
On 10/16/2010 06:10 AM, Sebastian Redl wrote:
On 16.10.2010, at 00:23, Patrick Horgan wrote:
Support of the recent C++ drafts requires a char32_t basic type anyway, so I can't imagine anyone using a 16-bit wchar_t going forward, There's absolutely no way Windows programming will ever change wchar_t away from 16 bits, and people will continue to use it. Then that implies that it can only hold UCS2. That's a choice. In C99, the type wchar_t is officially intended to be used only for 32-bit ISO 10646 values, independent of the currently used locale. C99 subclause 6.10.8 specifies that the value of the macro __STDC_ISO_10646__ shall be "an integer constant of the form yyyymmL (for example, 199712L), intended to indicate that values of type wchar_t are the coded representations of the characters defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month." Of course Microsoft isn't able to define that, since you can't hold 20 bits in a 16 bit data type.
Both C and C++ in their current draft standards require a wchar_t to be a data type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales. It takes 20 bits to hold all of UCS4. That would make Visual C++ non-compliant with the current draft standards. Perhaps that will be removed or fudged with another macro before final vote. Of course, this is why the new standards have explicit char16_t and char32_t, because of the impossibility of using wchar_t. Nicely, the new types are defined to be unsigned:) I mislike signatures that use char, since it's the only int type that is explicitly allowed to be either signed or unsigned. Of course dealing with conversions you always need unsigned because the effects of sign extension are startling at the least, and erroneous at the worst. Wish they'd specified originally, std:codecvt<wchar_t, unsigned char, std::mbstat_t>. The current Unicode standard, 5.2, notes that there are places where wchar_t are only 8-bits and suggests that only char16_t, for UCS2 and char32_t, for UCS2 or UCS4 be used by programmers going forward. Unfortunately, the signature std::codecvt<wchar_t, char, std::mbstate_t> has to be dealt with. As required by the specs, I note if wchar_t is 16-bit and return error whenever do_in() or do_length() is asked to decode any utf-8 that would yield a code greater than U+FFFF. That's the right thing to do. UCS2 DOES support approximately the whole world's scripts right now, but there are things in the supplemental plane, like musical symbols, that people like me like, as there are all the ancient scripts, and going forward, Unicode plans to put codes for most scripts awaiting encoding in the supplemental plane. It's all a mess and quite frustrating. I really wish the new C++ Standard had deprecated std::codecvt<wchar_t, char, std::mbstate_t> and encouraged all to use in its place, std::codecvt<char32_t, char, std::mbstate_t>. Their job must be hard. I say, just throw it away! It will be cleaner and much more elegant, but of course they can't, since they told people to use it before, left wiggle room on the size of wchar_t saying only that it had to be at least the size of a char, and they want existing code to compile. Patrick