Re: [boost] boost utf-8 code conversion facet has security problems

18 Oct 2010

      On 10/16/2010 06:10 AM, Sebastian Redl wrote:
...
On 16.10.2010, at 00:23, Patrick Horgan wrote:
...
Support of the recent C++ drafts requires a char32_t basic type anyway, so I can't imagine anyone using a 16-bit wchar_t going forward,
There's absolutely no way Windows programming will ever change wchar_t away from 16 bits, and people will continue to use it.
Then that implies that it can only hold UCS2.  That's a 
choice.  In C99, the type wchar_t is officially 
intended to be used only for 32-bit ISO 10646 values, 
independent of the currently used locale.  C99 
subclause 6.10.8 specifies that the value of the macro 
__STDC_ISO_10646__
shall be "an integer constant of the form yyyymmL (for 
example, 199712L), intended to indicate that values of 
type wchar_t are the coded representations of the 
characters defined by ISO/IEC 10646, along with all 
amendments and technical corrigenda as of the specified 
year and month."  Of course Microsoft isn't able to 
define that, since you can't hold 20 bits in a 16 bit 
data type.
Both C and C++ in their current draft standards require 
a wchar_t to be a data type whose range of values can 
represent distinct codes for all members of the largest 
extended character set specified among the supported 
locales. It takes 20 bits to hold all of UCS4.  That 
would make Visual C++ non-compliant with the current 
draft standards.  Perhaps that will be removed or 
fudged with another macro before final vote.

Of course, this is why the new standards have explicit 
char16_t and char32_t, because of the impossibility of 
using wchar_t.  Nicely, the new types are defined to be 
unsigned:)  I mislike signatures that use char, since 
it's the only int type that is explicitly allowed to be 
either signed or unsigned.  Of course dealing with 
conversions you always need unsigned because the 
effects of sign extension are startling at the least, 
and erroneous at the worst.  Wish they'd specified 
originally, std:codecvt<wchar_t, unsigned char, 
std::mbstat_t>.

The current Unicode standard, 5.2, notes that there are 
places where wchar_t are only 8-bits and suggests that 
only char16_t, for UCS2 and char32_t, for UCS2 or UCS4 
be used by programmers going forward.  Unfortunately, 
the signature std::codecvt<wchar_t, char, 
std::mbstate_t> has to be dealt with.  As required by 
the specs, I note if wchar_t is 16-bit and return error 
whenever do_in() or do_length() is asked to decode any 
utf-8 that would yield a code greater than U+FFFF.  
That's the right thing to do.

UCS2 DOES support approximately the whole world's 
scripts right now, but there are things in the 
supplemental plane, like musical symbols, that people 
like me like, as there are all the ancient scripts, and 
going forward, Unicode plans to put codes for most 
scripts awaiting encoding in the supplemental plane.  
It's all a mess and quite frustrating.  I really wish 
the new C++ Standard had deprecated 
std::codecvt<wchar_t, char, std::mbstate_t> and 
encouraged all to use in its place, 
std::codecvt<char32_t, char, std::mbstate_t>.  Their 
job must be hard.  I say, just throw it away!  It will 
be cleaner and much more elegant, but of course they 
can't, since they told people to use it before, left 
wiggle room on the size of wchar_t saying only that it 
had to be at least the size of a char, and they want 
existing code to compile.

Patrick