Re: [boost] [General] Always treat std::strings as UTF-8

19 Jan 2011


      On Tue, 18 Jan 2011 13:18:58 -0800
Patrick Horgan <phorgan1@gmail.com> wrote:
...
...
...
If they can do that, that's great!  The conversion code was so short
that I assumed it wasn't a full, complete conversion algorithm.
They're complete, and accurate. The algorithms aren't overly complex,
they just translate between different forms of the exact same data,
after all.
If you can assume that the encoding is correct already that's true.  
Most the code to convert from utf-8 to utf-32 or utf-16, for example,
is to check that you don't have overly long encodings that cause
security issues or other violations of the well-formedness table in
On 01/18/2011 08:23 AM, Chad Nelson wrote:
the unicode spec.  Otherwise, especially if you carry things around in
utf-8 by preference, and do your checking in that encoding, you open
yourself up to problems.
(http://capec.mitre.org/data/definitions/80.html).  If you don't ever
accept utf-8 encoded things from users, of course, you don't have to
worry about this, but I would write the conversion defensively.
The conversion code in those classes does exactly that, and will (at
the moment) throw an exception on any problem.

It is, again at the moment, possible for a programmer to get invalid
encodings into the utf*_t strings, but it shouldn't be possible to ever
get them from the conversion functions. The unit tests that I wrote for
it (not included in the package) deliberately tries to feed in invalid
code, just to ensure that it's caught correctly.
...
I should say that I haven't read your code yet and you might very
well do this correctly.  The code conversion facet used by a lot of
boost code doesn't.  It was written to an older version of the spec
for utf-8 and allows 5 and 6 character encodings.  It does have these
security concerns.
Then having freshly-written code, using the latest specifications, is an
advantage. ;-)
...
I offered awhile back to replace it, but assume that with the locale
stuff coming up for review it would be better to go with that.  I did
write a replacement for utf8_codecvt_facet.cpp utf8_codecvt_facet.hpp
that could be dropped in for the use of serialization and passes the
tests in that part of boost.
I saw your message, and your generous offer. It, and the silence that
greeted it, is part of what convinced me that I needed to write my own
conversion functions.
-- 
Chad Nelson
Oak Circle Software, Inc.
*
*
*