Re: [boost] UTF-8 conversion etc.

3 Mar 2008


      Kirit Sælensminde wrote:
...
If an enclosing specification already tells us that it is Unicode and 
which encoding (i.e. HTTP and SMTP/MIME have this mechanism) then there 
shouldn't be a BOM.
Yes, if the mechanism tells us the endianness. Otherwise, the BOM is
still needed.
 There also should never be a BOM anywhere other than 
the start of a string/stream/file (if you concatenate you should remove 
inner ones). I think some old applications may incorrectly use a BOM as 
a zero width break too.
Not really incorrectly. 0xFFFE really was the zero-width non-breaking
space originally, but the special zero-width property led people to use
it as a BOM. Thus, a different character was designated as the new
ZWNBSP, and 0xFFFE was officially made the BOM. So the usage is only
incorrect in new applications.
When decoding UTF-8 it is also useful to check that the character you 
just decoded is actually meant to use that number of UTF-8 bytes. For 
example, by zero padding you can encode an apostrophe as 2 bytes rather 
than 1. There are a number of security exploits centred around this and 
getting one means you're dealing with a buggy Unicode encoder at best, 
but more likely your software is under attack. I throw an exception to 
stop all processing in its tracks if I see this.
Phil's code does that, too.

Sebastian Redl