Re: [boost] boost utf-8 code conversion facet has security problems

15 Oct 2010

      Actually I want to mention that UTF-8 codecvt facet implementation
has several other problems:

1. When sizeof(wchar_t)==2 it supports only UCS-2 and not full UTF-16
2. It is indeed does not strictly assumes that maximal encoding of
   single UTF-8 character is 4.

In Boost.Locale I had implemented the full UTF-8 codecvt facet
that supports both UTF-16 and UTF-32 I assume that this code
can replace current implementation, even thou it should
be extracted from Boost.Locale iw this facet is more generic
and supoorts other encodings as well.

Note, this UTF-8 facet does not depend on external library.

Artyom
...
I've been meaning to mention this for some time.  The boost utf-8  code
conversion facet implements an early spec of utf-8 that allows up to  6
byte representations but current specs, and security issues suggest  it
should only support up to four.   See
http://en.wikipedia.org/wiki/UTF-8 and in particular the section  on
invalid byte sequences.  It also has some stuff wrong, like  do_length()
is supposed to only tell you length of valid code sequences, but  the
boost implementation doesn't check for validity.