[boost] boost utf-8 code conversion facet has security problems

14 Oct 2010

      I've been meaning to mention this for some time.  The boost utf-8 code
conversion facet implements an early spec of utf-8 that allows up to 6
byte representations but current specs, and security issues suggest it
should only support up to four.  See
http://en.wikipedia.org/wiki/UTF-8 and in particular the section on
invalid byte sequences.  It also has some stuff wrong, like do_length()
is supposed to only tell you length of valid code sequences, but the
boost implementation doesn't check for validity.

This has long since been superseded by newer specs that allow only 1-4
byte representations.  The issue is that the longer representations let
you alias characters, so if someone is filtering out certain chars for
security reasons, for sql or http requests, for example, someone can
bypass the filter by using a different representation of a character,
and inject their naughty stuff just when you think you've got them
stopped.  In addition to no more 5 and 6 byte utf-8 chars, there are
some other first and second byte codes that are forbidden because they
would alias other characters.

Current specs, see std 63 (also currently rfc 3629), and The Unicode
Standard
Version 5.2, specify how to correctly implement the conversion.  I think
that everything you need to know, however, to implement do_in() and
do_length() correctly is in this table (notice that start and end of
valid second byte range varies in different unicode code ranges):

/*  Table 3-7. Well-Formed UTF-8 Byte Sequences from version 5.2 of the
      Unicode Standard

     Code Points    First Byte Second Byte Third Byte Fourth Byte
    U+0000..U+007F     00..7F
    U+0080..U+07FF     C2..DF    80..BF
    U+0800..U+0FFF     E0        A0..BF      80..BF
    U+1000..U+CFFF     E1..EC    80..BF      80..BF
    U+D000..U+D7FF     ED        80..9F      80..BF
    U+E000..U+FFFF     EE..EF    80..BF      80..BF
    U+10000..U+3FFFF   F0        90..BF      80..BF      80..BF
    U+40000..U+FFFFF   F1..F3    80..BF      80..BF      80..BF
    U+100000..U+10FFFF F4        80..8F      80..BF      80..BF
*/

So, you have to do more checking, but it's still efficient.

I have test code that checks all this.  I'm willing to help in various ways.

o I have an implementation of the utf-8 code conversion facet that
implements do_in, do_out, and do_length that I did.  You can have the
source to jump start the fix.

o I'm willing to make changes to the boost one and contribute a patch
with a little guidance.

o I'm willing to just figure that now that I've pointed it out, someone
who owns this code will be motivated to fix it, it's not too hard.

o Also willing to give you the test code to check with.  Probably a lot
different than how boost implements test code.

o Willing to give peace a chance;)

It should really be fixed.  There's a lot of bad guys out there that
know about these sorts of problems, plus, there aren't many open source
implementations of utf-8 code conversion facets, so folks are likely to
emulate/steal this.

btw mine's freely available to anyone who requests it.

wc -l *
    232 codecvt_utf8_facet.cpp
     29 codecvt_utf8_facet.hpp
     14 Makefile
    510 testcodecvt.cpp

Funny the test code is twice the size of the code.

Patrick

[boost] boost utf-8 code conversion facet has security problems

Patrick Horgan