
Kirit Sælensminde wrote:
If an enclosing specification already tells us that it is Unicode and which encoding (i.e. HTTP and SMTP/MIME have this mechanism) then there shouldn't be a BOM. Yes, if the mechanism tells us the endianness. Otherwise, the BOM is still needed. There also should never be a BOM anywhere other than the start of a string/stream/file (if you concatenate you should remove inner ones). I think some old applications may incorrectly use a BOM as a zero width break too. Not really incorrectly. 0xFFFE really was the zero-width non-breaking space originally, but the special zero-width property led people to use it as a BOM. Thus, a different character was designated as the new ZWNBSP, and 0xFFFE was officially made the BOM. So the usage is only incorrect in new applications. When decoding UTF-8 it is also useful to check that the character you just decoded is actually meant to use that number of UTF-8 bytes. For example, by zero padding you can encode an apostrophe as 2 bytes rather than 1. There are a number of security exploits centred around this and getting one means you're dealing with a buggy Unicode encoder at best, but more likely your software is under attack. I throw an exception to stop all processing in its tracks if I see this.
Phil's code does that, too. Sebastian Redl