boost utf-8 code conversion facet has security problems

I've been meaning to mention this for some time. The boost utf-8 code conversion facet implements an early spec of utf-8 that allows up to 6 byte representations but current specs, and security issues suggest it should only support up to four. See http://en.wikipedia.org/wiki/UTF-8 and in particular the section on invalid byte sequences. It also has some stuff wrong, like do_length() is supposed to only tell you length of valid code sequences, but the boost implementation doesn't check for validity. This has long since been superseded by newer specs that allow only 1-4 byte representations. The issue is that the longer representations let you alias characters, so if someone is filtering out certain chars for security reasons, for sql or http requests, for example, someone can bypass the filter by using a different representation of a character, and inject their naughty stuff just when you think you've got them stopped. In addition to no more 5 and 6 byte utf-8 chars, there are some other first and second byte codes that are forbidden because they would alias other characters. Current specs, see std 63 (also currently rfc 3629), and The Unicode Standard Version 5.2, specify how to correctly implement the conversion. I think that everything you need to know, however, to implement do_in() and do_length() correctly is in this table (notice that start and end of valid second byte range varies in different unicode code ranges): /* Table 3-7. Well-Formed UTF-8 Byte Sequences from version 5.2 of the Unicode Standard Code Points First Byte Second Byte Third Byte Fourth Byte U+0000..U+007F 00..7F U+0080..U+07FF C2..DF 80..BF U+0800..U+0FFF E0 A0..BF 80..BF U+1000..U+CFFF E1..EC 80..BF 80..BF U+D000..U+D7FF ED 80..9F 80..BF U+E000..U+FFFF EE..EF 80..BF 80..BF U+10000..U+3FFFF F0 90..BF 80..BF 80..BF U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF U+100000..U+10FFFF F4 80..8F 80..BF 80..BF */ So, you have to do more checking, but it's still efficient. I have test code that checks all this. I'm willing to help in various ways. o I have an implementation of the utf-8 code conversion facet that implements do_in, do_out, and do_length that I did. You can have the source to jump start the fix. o I'm willing to make changes to the boost one and contribute a patch with a little guidance. o I'm willing to just figure that now that I've pointed it out, someone who owns this code will be motivated to fix it, it's not too hard. o Also willing to give you the test code to check with. Probably a lot different than how boost implements test code. o Willing to give peace a chance;) It should really be fixed. There's a lot of bad guys out there that know about these sorts of problems, plus, there aren't many open source implementations of utf-8 code conversion facets, so folks are likely to emulate/steal this. btw mine's freely available to anyone who requests it. wc -l * 232 codecvt_utf8_facet.cpp 29 codecvt_utf8_facet.hpp 14 Makefile 510 testcodecvt.cpp Funny the test code is twice the size of the code. Patrick

Patrick Horgan wrote:
It should really be fixed. There's a lot of bad guys out there that know about these sorts of problems, plus, there aren't many open source implementations of utf-8 code conversion facets, so folks are likely to emulate/steal this.
btw mine's freely available to anyone who requests it.
wc -l * 232 codecvt_utf8_facet.cpp 29 codecvt_utf8_facet.hpp 14 Makefile 510 testcodecvt.cpp
Funny the test code is twice the size of the code.
Note that there is a test of this code in the serialization library test suite as well as documentation. The current one fails some tests on some platforms. So if you want to fix all this up, code, tests and documentation, it would be a great thing as far as I'm concerned. In fact what I would like to see is utf8 facet be moved from boost/detail to boost/utility along with all it's documentation and tests. The module needs the documenation and tests but there is no boost/lib/detail to put them in. The serialization library isn't a good place for this because this module is used by at least one other library (program options) and maybe more. Robert Ramey

On Thu, Oct 14, 2010 at 6:29 PM, Robert Ramey <ramey@rrsd.com> wrote:
Patrick Horgan wrote:
It should really be fixed. There's a lot of bad guys out there that know about these sorts of problems, plus, there aren't many open source implementations of utf-8 code conversion facets, so folks are likely to emulate/steal this.
btw mine's freely available to anyone who requests it.
wc -l * 232 codecvt_utf8_facet.cpp 29 codecvt_utf8_facet.hpp 14 Makefile 510 testcodecvt.cpp
Funny the test code is twice the size of the code.
Note that there is a test of this code in the serialization library test suite as well as documentation. The current one fails some tests on some platforms.
So if you want to fix all this up, code, tests and documentation, it would be a great thing as far as I'm concerned. In fact what I would like to see is utf8 facet be moved from boost/detail to boost/utility along with all it's documentation and tests. The module needs the documenation and tests but there is no boost/lib/detail to put them in. The serialization library isn't a good place for this because this module is used by at least one other library (program options) and maybe more.
It is also used by the filesystem library. I also think it would be great to turn this into a real Boost library. --Beman

Actually I want to mention that UTF-8 codecvt facet implementation has several other problems: 1. When sizeof(wchar_t)==2 it supports only UCS-2 and not full UTF-16 2. It is indeed does not strictly assumes that maximal encoding of single UTF-8 character is 4. In Boost.Locale I had implemented the full UTF-8 codecvt facet that supports both UTF-16 and UTF-32 I assume that this code can replace current implementation, even thou it should be extracted from Boost.Locale iw this facet is more generic and supoorts other encodings as well. Note, this UTF-8 facet does not depend on external library. Artyom
I've been meaning to mention this for some time. The boost utf-8 code conversion facet implements an early spec of utf-8 that allows up to 6 byte representations but current specs, and security issues suggest it should only support up to four. See http://en.wikipedia.org/wiki/UTF-8 and in particular the section on invalid byte sequences. It also has some stuff wrong, like do_length() is supposed to only tell you length of valid code sequences, but the boost implementation doesn't check for validity.

Actually I want to mention that UTF-8 codecvt facet implementation has several other problems:
1. When sizeof(wchar_t)==2 it supports only UCS-2 and not full UTF-16 In general that would have to be true of any implementation, wchar_t IS a wide character, and not an encoding scheme. According to the spec wchar_t is supposed to represent the internal character set of the
On 10/15/2010 02:18 AM, Artyom wrote: program as a wide character. (From a recent C++ draft standard N3126): 3.9.1Fundamental types ... 5 Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). In 16 bits you could only support UCS2 in a single wide character (and current C++ specs clearly state that). That means it would be a mistake in that environment to support any locale that needed characters outside the Unicode plane 0, codes U+0000-U+FFFF, also known as the Basic Multilingual Plane (BMP). In effect, if a compiler writer chose to implement a 16-bit wchar_t, they would be choosing not to support any locale that needed codes from any plane outside the BMP. That would be silly since Unicode says the following about plane 1, codes U+10000-U+1FFFF, the Supplementary Multilingual Plane (SMP). (From the Unicode Standard version 5.2): The majority of scripts currently identified for encoding will eventually be allocated in the SMP. As a result, some areas of the SMP will experience common, frequent usage. So right now you'd do ok with a 16-bit wchar_t holding UCS2 codes in most parts of the world, but going forward, no. (Too bad a 17-bit wchar_t doesn't make sense, it would just hold the BMP and the SMP.) Just because it's silly doesn't mean no one would do it of course. Full UTF-16 requires 2-16 bit codes for the codes in the supplementary or higher planes, so won't fit in a 16-bit wchar-t. Support of the recent C++ drafts requires a char32_t basic type anyway, so I can't imagine anyone using a 16-bit wchar_t going forward, nevertheless, my code notes the precense of a 16-bit wchar_t and returns an encoding error in do_in() as required by the C++ spec, if a utf-8 sequence would overflow it. I'd like to see support for the same 3 required by recent drafts, named (as in the spec), codecvt_utf8 (one of UCS2 or USC4 to utf-8), codecvt_utf16 (one of UCS2 or UCS4 to utf-16), and codecvt_utf8_utf16 (utf-16 to utf-8) which explicitly state the two encodings: 22.4.1.4 Class template codecvt ... 3 ... codecvt<char, char, mbstate_t> implements a degenerate conversion; it does not convert at all. The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encodings schemes, and the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encodings schemes. codecvt<wchar_t,char,mbstate_t> converts between the native character sets for narrow and wide characters. I find the last ambiguous, sounding like it would just be a conversion between single chars and single wchar_ts, maybe between a locale specified ISO encoding like ISO-8859-? and a UCS wchar_t but that's not what they mean-I think. If you look in the locale.stdcvt section they say: 22.5 Standard code conversion facets ... 4 For the facet codecvt_utf8: — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program. clearly saying that it's a conversion to UCS2 for wchar_t of 16-bit or UCS4 for wchar_t of 32-bit. Future libstdc++ libraries will provide these anyway, but won't be available everywhere for quite awhile. Patrick

On 16.10.2010, at 00:23, Patrick Horgan wrote:
Support of the recent C++ drafts requires a char32_t basic type anyway, so I can't imagine anyone using a 16-bit wchar_t going forward,
There's absolutely no way Windows programming will ever change wchar_t away from 16 bits, and people will continue to use it. Sebastian

On 10/16/2010 06:10 AM, Sebastian Redl wrote:
On 16.10.2010, at 00:23, Patrick Horgan wrote:
Support of the recent C++ drafts requires a char32_t basic type anyway, so I can't imagine anyone using a 16-bit wchar_t going forward, There's absolutely no way Windows programming will ever change wchar_t away from 16 bits, and people will continue to use it. Then that implies that it can only hold UCS2. That's a choice. In C99, the type wchar_t is officially intended to be used only for 32-bit ISO 10646 values, independent of the currently used locale. C99 subclause 6.10.8 specifies that the value of the macro __STDC_ISO_10646__ shall be "an integer constant of the form yyyymmL (for example, 199712L), intended to indicate that values of type wchar_t are the coded representations of the characters defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month." Of course Microsoft isn't able to define that, since you can't hold 20 bits in a 16 bit data type.
Both C and C++ in their current draft standards require a wchar_t to be a data type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales. It takes 20 bits to hold all of UCS4. That would make Visual C++ non-compliant with the current draft standards. Perhaps that will be removed or fudged with another macro before final vote. Of course, this is why the new standards have explicit char16_t and char32_t, because of the impossibility of using wchar_t. Nicely, the new types are defined to be unsigned:) I mislike signatures that use char, since it's the only int type that is explicitly allowed to be either signed or unsigned. Of course dealing with conversions you always need unsigned because the effects of sign extension are startling at the least, and erroneous at the worst. Wish they'd specified originally, std:codecvt<wchar_t, unsigned char, std::mbstat_t>. The current Unicode standard, 5.2, notes that there are places where wchar_t are only 8-bits and suggests that only char16_t, for UCS2 and char32_t, for UCS2 or UCS4 be used by programmers going forward. Unfortunately, the signature std::codecvt<wchar_t, char, std::mbstate_t> has to be dealt with. As required by the specs, I note if wchar_t is 16-bit and return error whenever do_in() or do_length() is asked to decode any utf-8 that would yield a code greater than U+FFFF. That's the right thing to do. UCS2 DOES support approximately the whole world's scripts right now, but there are things in the supplemental plane, like musical symbols, that people like me like, as there are all the ancient scripts, and going forward, Unicode plans to put codes for most scripts awaiting encoding in the supplemental plane. It's all a mess and quite frustrating. I really wish the new C++ Standard had deprecated std::codecvt<wchar_t, char, std::mbstate_t> and encouraged all to use in its place, std::codecvt<char32_t, char, std::mbstate_t>. Their job must be hard. I say, just throw it away! It will be cleaner and much more elegant, but of course they can't, since they told people to use it before, left wiggle room on the size of wchar_t saying only that it had to be at least the size of a char, and they want existing code to compile. Patrick

On 18.10.2010 08:07, Patrick Horgan wrote:
On 10/16/2010 06:10 AM, Sebastian Redl wrote:
On 16.10.2010, at 00:23, Patrick Horgan wrote:
Support of the recent C++ drafts requires a char32_t basic type anyway, so I can't imagine anyone using a 16-bit wchar_t going forward, There's absolutely no way Windows programming will ever change wchar_t away from 16 bits, and people will continue to use it. Then that implies that it can only hold UCS2. That's a choice. In C99, the type wchar_t is officially intended to be used only for 32-bit ISO 10646 values, independent of the currently used locale. C99 subclause 6.10.8 specifies that the value of the macro __STDC_ISO_10646__ shall be "an integer constant of the form yyyymmL (for example, 199712L), intended to indicate that values of type wchar_t are the coded representations of the characters defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month." Of course Microsoft isn't able to define that, since you can't hold 20 bits in a 16 bit data type.
Microsoft defines wchar_t to be a UTF-16 2-byte unit, screw the standards. Sebastian

Few points: 1. wchar_t comes from C and was defined first time when it was indeed enough 16 bits and the size wasn't defined specifically 2. Keeping compatible ABI is very important, so you should just get used to fact that wchar_t may be 2 or 4 bytes 3. Ignoring non-BMP places is very bad idea, so you should assume that std::wstring is UTF-16 or UTF-32 and not UCS-2 or UCS-4! 4. If you write new applications, don't use wchar_t, use std::string and UTF-8 this is perfectly good and right solution. 5. If you support wchar_t - support UTF-16 and support is well. So IMHO the facet we talk about should support UTF-16 as well and should work with correct definition of UTF-8. You can blame standards, you can blame Microsoft or even Unicode but you still have to live with it, and as long as you live with it - do it right. Artyom P.S.: I think UTF-16 should die: http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmf... P.P.S.: As long as Boost supports wchar_t I believe it should support UTF-16 even it is a nightmare. ----- Original Message ----
From: Sebastian Redl <sebastian.redl@getdesigned.at> To: boost@lists.boost.org Sent: Mon, October 18, 2010 9:36:17 AM Subject: Re: [boost] boost utf-8 code conversion facet has security problems
On 18.10.2010 08:07, Patrick Horgan wrote:
On 10/16/2010 06:10 AM, Sebastian Redl wrote:
On 16.10.2010, at 00:23, Patrick Horgan wrote:
Support of the recent C++ drafts requires a char32_t basic type anyway, so I can't imagine anyone using a 16-bit wchar_t going forward, There's absolutely no way Windows programming will ever change wchar_t away from 16 bits, and people will continue to use it. Then that implies that it can only hold UCS2. That's a choice. In C99, the type wchar_t is officially intended to be used only for 32-bit ISO 10646 values, independent of the currently used locale. C99 subclause 6.10.8 specifies that the value of the macro __STDC_ISO_10646__ shall be "an integer constant of the form yyyymmL (for example, 199712L), intended to indicate that values of type wchar_t are the coded representations of the characters defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month." Of course Microsoft isn't able to define that, since you can't hold 20 bits in a 16 bit data type.
Microsoft defines wchar_t to be a UTF-16 2-byte unit, screw the standards.
Sebastian _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 10/18/2010 12:36 AM, Sebastian Redl wrote: > ...elision by patrick... >> shall be "an integer constant of the form yyyymmL >> (for example, 199712L), intended to indicate that >> values of type wchar_t are the coded representations >> of the characters defined by ISO/IEC 10646, along >> with all amendments and technical corrigenda as of >> the specified year and month." Of course Microsoft >> isn't able to define that, since you can't hold 20 >> bits in a 16 bit data type. > Then that implies that it can only hold UCS2. That's > a choice. In C99, the type wchar_t is officially > intended to be used only for 32-bit ISO 10646 values, > independent of the currently used locale. C99 > subclause 6.10.8 specifies that the value of the > macro __STDC_ISO_10646__ > > Microsoft defines wchar_t to be a UTF-16 2-byte unit, > screw the standards. Does that mean that a Microsoft Visual C++ supplied codecvt_utf8_facet<wchar_t,char,mbstate_t>would convert to UTF-16, or UCS2? Wouldn't be that hard to make it aware of Microsofty and do the wrong thing, but 1) Shouldn't we follow the spec 2) Wouldn't we annoy those on Windows who read specs and expected UCS2? I suspect it would be better to just provide the specializations the current draft standard calls for: template<class Elem, unsigned long Maxcode = 0x10ffff, codecvt_mode Mode = (codecvt_mode)0> class codecvt_utf8 : public codecvt<Elem, char, mbstate_t> { // convert between UCS2 or UCS4 and utf-8 }; template<class Elem, unsigned long Maxcode = 0x10ffff, codecvt_mode Mode = (codecvt_mode)0> class codecvt_utf16 : public codecvt<Elem, char, mbstate_t> { // convert between UCS2 or UCS4 and utf-8 }; template<class Elem, unsigned long Maxcode = 0x10ffff, codecvt_mode Mode = (codecvt_mode)0> class codecvt_utf8_utf16 : public codecvt<Elem, char, mbstate_t> { // convert between utf-16 and utf-8 - this one works for microsoft }; With Elem as wchar_t, char16_t, or char32_t and mode allowing you to specify a BOM header be generated or consumed as well as to generate little-endian instead of the default big-endian. On Microsoft, the third might be specialized: template<wchar_t, unsigned long Maxcode=sizeof(wchar_t), codecvt_mode Mode=(codecvt_mode)0>codecvt_utf8_utf16{stuff here} Patrick
participants (5)
-
Artyom
-
Beman Dawes
-
Patrick Horgan
-
Robert Ramey
-
Sebastian Redl