Re: [boost] boost utf-8 code conversion facet has security problems

15 Oct 2010

      ...
Actually I want to mention that UTF-8 codecvt facet implementation
has several other problems:
1. When sizeof(wchar_t)==2 it supports only UCS-2 and not full UTF-16
In general that would have to be true of any 
implementation, wchar_t IS a wide character, and not an 
encoding scheme.  According to the spec wchar_t is 
supposed to represent the internal character set of the
On 10/15/2010 02:18 AM, Artyom wrote:
program as a wide character. (From a recent C++ draft 
standard N3126):

3.9.1Fundamental types
...
5 Type wchar_t is a distinct type whose values can 
represent distinct codes for all members of the largest 
extended character set specified among the supported 
locales (22.3.1).

In 16 bits you could only support UCS2 in a single wide 
character (and current C++ specs clearly state that). 
That means it would be a mistake in that environment to 
support any locale that needed characters outside the 
Unicode plane 0, codes U+0000-U+FFFF, also known as the 
Basic Multilingual Plane (BMP).  In effect, if a 
compiler writer chose to implement a 16-bit wchar_t, 
they would be choosing not to support any locale that 
needed codes from any plane outside the BMP.  That 
would be silly since Unicode says the following about 
plane 1, codes U+10000-U+1FFFF, the Supplementary 
Multilingual Plane (SMP). (From the Unicode Standard 
version 5.2):

The majority of scripts currently identified for 
encoding will eventually be allocated in the SMP. As a 
result, some areas of the SMP will experience common, 
frequent usage.

So right now you'd do ok with a 16-bit wchar_t holding 
UCS2 codes in most parts of the world, but going 
forward, no.  (Too bad a 17-bit wchar_t doesn't make 
sense, it would just hold the BMP and the SMP.) Just 
because it's silly doesn't mean no one would do it of 
course.  Full UTF-16 requires 2-16 bit codes for the 
codes in the supplementary or higher planes, so won't 
fit in a 16-bit wchar-t.  Support of the recent C++ 
drafts requires a char32_t basic type anyway, so I 
can't imagine anyone using a 16-bit wchar_t going 
forward, nevertheless, my code notes the precense of a 
16-bit wchar_t and returns an encoding error in do_in() 
as required by the C++ spec, if a utf-8 sequence would 
overflow it.

I'd like to see support for the same 3 required by 
recent drafts, named (as in the spec), codecvt_utf8 
(one of UCS2 or USC4 to utf-8), codecvt_utf16 (one of 
UCS2 or UCS4 to utf-16), and codecvt_utf8_utf16 (utf-16 
to utf-8) which explicitly state the two encodings:

22.4.1.4 Class template codecvt
...
3 ... codecvt<char, char, mbstate_t> implements a 
degenerate conversion; it does not convert at all. The 
specialization codecvt<char16_t, char, mbstate_t> 
converts between the UTF-16 and UTF-8 encodings 
schemes, and the specialization codecvt <char32_t, 
char, mbstate_t> converts between the UTF-32 and UTF-8 
encodings schemes. codecvt<wchar_t,char,mbstate_t> 
converts between the native character sets for narrow 
and wide characters.

I find the last ambiguous, sounding like it would just 
be a conversion between single chars and single 
wchar_ts, maybe between a locale specified ISO encoding 
like ISO-8859-? and a UCS wchar_t but that's not what 
they mean-I think.  If you look in the locale.stdcvt 
section they say:

22.5 Standard code conversion facets
...
4 For the facet codecvt_utf8:
— The facet shall convert between UTF-8 multibyte 
sequences and UCS2 or UCS4 (depending on the size of 
Elem) within the program.

clearly saying that it's a conversion to UCS2 for 
wchar_t of 16-bit or UCS4 for wchar_t of 32-bit.

Future libstdc++ libraries will provide these anyway, 
but won't be available everywhere for quite awhile.

Patrick

Re: [boost] boost utf-8 code conversion facet has security problems

Patrick Horgan