
I've been meaning to mention this for some time. The boost utf-8 code conversion facet implements an early spec of utf-8 that allows up to 6 byte representations but current specs, and security issues suggest it should only support up to four. See http://en.wikipedia.org/wiki/UTF-8 and in particular the section on invalid byte sequences. It also has some stuff wrong, like do_length() is supposed to only tell you length of valid code sequences, but the boost implementation doesn't check for validity. This has long since been superseded by newer specs that allow only 1-4 byte representations. The issue is that the longer representations let you alias characters, so if someone is filtering out certain chars for security reasons, for sql or http requests, for example, someone can bypass the filter by using a different representation of a character, and inject their naughty stuff just when you think you've got them stopped. In addition to no more 5 and 6 byte utf-8 chars, there are some other first and second byte codes that are forbidden because they would alias other characters. Current specs, see std 63 (also currently rfc 3629), and The Unicode Standard Version 5.2, specify how to correctly implement the conversion. I think that everything you need to know, however, to implement do_in() and do_length() correctly is in this table (notice that start and end of valid second byte range varies in different unicode code ranges): /* Table 3-7. Well-Formed UTF-8 Byte Sequences from version 5.2 of the Unicode Standard Code Points First Byte Second Byte Third Byte Fourth Byte U+0000..U+007F 00..7F U+0080..U+07FF C2..DF 80..BF U+0800..U+0FFF E0 A0..BF 80..BF U+1000..U+CFFF E1..EC 80..BF 80..BF U+D000..U+D7FF ED 80..9F 80..BF U+E000..U+FFFF EE..EF 80..BF 80..BF U+10000..U+3FFFF F0 90..BF 80..BF 80..BF U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF U+100000..U+10FFFF F4 80..8F 80..BF 80..BF */ So, you have to do more checking, but it's still efficient. I have test code that checks all this. I'm willing to help in various ways. o I have an implementation of the utf-8 code conversion facet that implements do_in, do_out, and do_length that I did. You can have the source to jump start the fix. o I'm willing to make changes to the boost one and contribute a patch with a little guidance. o I'm willing to just figure that now that I've pointed it out, someone who owns this code will be motivated to fix it, it's not too hard. o Also willing to give you the test code to check with. Probably a lot different than how boost implements test code. o Willing to give peace a chance;) It should really be fixed. There's a lot of bad guys out there that know about these sorts of problems, plus, there aren't many open source implementations of utf-8 code conversion facets, so folks are likely to emulate/steal this. btw mine's freely available to anyone who requests it. wc -l * 232 codecvt_utf8_facet.cpp 29 codecvt_utf8_facet.hpp 14 Makefile 510 testcodecvt.cpp Funny the test code is twice the size of the code. Patrick