
Phil Endecott wrote:
Sebastian Redl wrote:
It gets worse. I've tried to implement a very simple "kinda-shift" encoding, UTF-16VE. That is, a UTF-16 form that expects a BOM to determine endianness. This encoding uses the shift state to remember what endian it is in. (No dynamic switching.)
The common case is that you have a BOM at the start, and if there are any other BOMs they'll be the same. But what I don't know is what the Unicode specs allow in this respect, and whether it's sensible to provide explicit support for that limited case as well as the more general case.
From memory when I was implementing Unicode strings for my web framework it goes something along these lines. If an enclosing specification already tells us that it is Unicode and which encoding (i.e. HTTP and SMTP/MIME have this mechanism) then there shouldn't be a BOM. There also should never be a BOM anywhere other than the start of a string/stream/file (if you concatenate you should remove inner ones). I think some old applications may incorrectly use a BOM as a zero width break too. You probably want to just filter out all BOMs and output them in streams etc. only when told to do so. When decoding UTF-8 it is also useful to check that the character you just decoded is actually meant to use that number of UTF-8 bytes. For example, by zero padding you can encode an apostrophe as 2 bytes rather than 1. There are a number of security exploits centred around this and getting one means you're dealing with a buggy Unicode encoder at best, but more likely your software is under attack. I throw an exception to stop all processing in its tracks if I see this. K -- http://www.kirit.com/