Re: [boost] UTF-8 conversion etc.

3 Mar 2008

      Phil Endecott wrote:
...
Sebastian Redl wrote:
...
It gets worse. I've tried to implement a very simple "kinda-shift"
encoding, UTF-16VE. That is, a UTF-16 form that expects a BOM to
determine endianness. This encoding uses the shift state to remember
what endian it is in. (No dynamic switching.)
The common case is that you have a BOM at the start, and if there are 
any other BOMs they'll be the same.  But what I don't know is what the 
Unicode specs allow in this respect, and whether it's sensible to 
provide explicit support for that limited case as well as the more 
general case.
From memory when I was implementing Unicode strings for my web 
framework it goes something along these lines.

If an enclosing specification already tells us that it is Unicode and 
which encoding (i.e. HTTP and SMTP/MIME have this mechanism) then there 
shouldn't be a BOM. There also should never be a BOM anywhere other than 
the start of a string/stream/file (if you concatenate you should remove 
inner ones). I think some old applications may incorrectly use a BOM as 
a zero width break too. You probably want to just filter out all BOMs 
and output them in streams etc. only when told to do so.

When decoding UTF-8 it is also useful to check that the character you 
just decoded is actually meant to use that number of UTF-8 bytes. For 
example, by zero padding you can encode an apostrophe as 2 bytes rather 
than 1. There are a number of security exploits centred around this and 
getting one means you're dealing with a buggy Unicode encoder at best, 
but more likely your software is under attack. I throw an exception to 
stop all processing in its tracks if I see this.

K

-- 
http://www.kirit.com/

Re: [boost] UTF-8 conversion etc.

Kirit Sælensminde