
Actually I want to mention that UTF-8 codecvt facet implementation has several other problems:
1. When sizeof(wchar_t)==2 it supports only UCS-2 and not full UTF-16 In general that would have to be true of any implementation, wchar_t IS a wide character, and not an encoding scheme. According to the spec wchar_t is supposed to represent the internal character set of the
On 10/15/2010 02:18 AM, Artyom wrote: program as a wide character. (From a recent C++ draft standard N3126): 3.9.1Fundamental types ... 5 Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). In 16 bits you could only support UCS2 in a single wide character (and current C++ specs clearly state that). That means it would be a mistake in that environment to support any locale that needed characters outside the Unicode plane 0, codes U+0000-U+FFFF, also known as the Basic Multilingual Plane (BMP). In effect, if a compiler writer chose to implement a 16-bit wchar_t, they would be choosing not to support any locale that needed codes from any plane outside the BMP. That would be silly since Unicode says the following about plane 1, codes U+10000-U+1FFFF, the Supplementary Multilingual Plane (SMP). (From the Unicode Standard version 5.2): The majority of scripts currently identified for encoding will eventually be allocated in the SMP. As a result, some areas of the SMP will experience common, frequent usage. So right now you'd do ok with a 16-bit wchar_t holding UCS2 codes in most parts of the world, but going forward, no. (Too bad a 17-bit wchar_t doesn't make sense, it would just hold the BMP and the SMP.) Just because it's silly doesn't mean no one would do it of course. Full UTF-16 requires 2-16 bit codes for the codes in the supplementary or higher planes, so won't fit in a 16-bit wchar-t. Support of the recent C++ drafts requires a char32_t basic type anyway, so I can't imagine anyone using a 16-bit wchar_t going forward, nevertheless, my code notes the precense of a 16-bit wchar_t and returns an encoding error in do_in() as required by the C++ spec, if a utf-8 sequence would overflow it. I'd like to see support for the same 3 required by recent drafts, named (as in the spec), codecvt_utf8 (one of UCS2 or USC4 to utf-8), codecvt_utf16 (one of UCS2 or UCS4 to utf-16), and codecvt_utf8_utf16 (utf-16 to utf-8) which explicitly state the two encodings: 22.4.1.4 Class template codecvt ... 3 ... codecvt<char, char, mbstate_t> implements a degenerate conversion; it does not convert at all. The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encodings schemes, and the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encodings schemes. codecvt<wchar_t,char,mbstate_t> converts between the native character sets for narrow and wide characters. I find the last ambiguous, sounding like it would just be a conversion between single chars and single wchar_ts, maybe between a locale specified ISO encoding like ISO-8859-? and a UCS wchar_t but that's not what they mean-I think. If you look in the locale.stdcvt section they say: 22.5 Standard code conversion facets ... 4 For the facet codecvt_utf8: — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program. clearly saying that it's a conversion to UCS2 for wchar_t of 16-bit or UCS4 for wchar_t of 32-bit. Future libstdc++ libraries will provide these anyway, but won't be available everywhere for quite awhile. Patrick