Re: [boost] UTF-8 conversion etc.

1 Mar 2008

      Phil Endecott wrote:
...
My original hope was that only the actual conversion function would 
need to track the shift state, while the decoding of the units would be 
independent of it.  Unfortunately this isn't true of, for example, 
iso-2022; you need to track the shift state to know whether you're 
looking at a 1-byte or a 2-byte character.  So, yes, this interface 
will need to change.
It gets worse. I've tried to implement a very simple "kinda-shift"
encoding, UTF-16VE. That is, a UTF-16 form that expects a BOM to
determine endianness. This encoding uses the shift state to remember
what endian it is in. (No dynamic switching.)

Trying to implement this, I've found that it is apparently logically
impossible to provide bidirectional iterators for shift encodings, like
ISO 2022-based encodings. These encodings rely on state that can only be
known by sequentially scanning the string from front to back. Any
attempt to iterate backwards would first have to mark the switch
positions and what modes they switch from.

This can be worked around for my UTF-16VE, but not for true shift
encodings. Thus, the charset traits probably need a flag that designates
the set as a shift encoding and makes the iterator adapter be forward-only.

On a side note, Shift-JIS, EUC-JP and ISO-2022-JP are all absurdly
complex. UTF-8 is so much easier!

Sebastian Redl