
Phil Endecott wrote:
The common case is that you have a BOM at the start, and if there are any other BOMs they'll be the same. But what I don't know is what the Unicode specs allow in this respect, and whether it's sensible to provide explicit support for that limited case as well as the more general case. (Do the IANA character sets names that I'm using as the basis for the charset_t enum have any way of distinguishing these cases, for example? I think the answer is no.)
IANA registers UTF-16BE, UTF-16LE and UTF-16. BE and LE are the fixed-endian variants. UTF-16 depends on context: if the base unit is a 16-bit entity, UTF-16 is simply endian-agnostic. If it's an 8-bit entity, I believe UTF-16 requires a BOM. I don't think flipping endians in the middle of a string is useful. I can't imagine what twisted tool would generate such code. Come to think of it, if I'm not careful, *my* code will generate it, namely when you concatenate a BE and a LE string. Concatenating shift encodings is *not* fun. Neither is substringing them.
Trying to implement this, I've found that it is apparently logically impossible to provide bidirectional iterators for shift encodings, like ISO 2022-based encodings. These encodings rely on state that can only be known by sequentially scanning the string from front to back.
Yes. You may be able to argue in some cases that you can predict the state during backward traversal IF there are no redundant shifts and if there are only two states. Again, I don't know whether that's useful in practice (and I suspect not).
Not really. The only shift encodings that ever found use are those of the ISO 2022 family, which have a two different shift state sets, one with four and one with three states, for a total of 12 shift states, not to mention the character set selection capabilities. Have I mentioned that the complexity of this stuff is absurd?
We could detect the case when skip_forward_char is not implemented.
What I'm currently doing is detecting if state_t is an empty class. Much, much easier than detecting if a function is implemented or not, especially if you have a base class that provides a default for the function.
There are various factors that influence the adapted iterator traversal tag. For example, I wanted to say that the character iterator has the same traversal tag as the unit iterator, except that it's not random access; i.e. min(unit_iter_t,bidirectional). Is there any existing code anywhere for doing operations like this on iterator traversal category tags?
Not that I know of. I had something like this around for old style categories, but when I tried to adapt it to the new ones, I realized that it didn't actually work. (I ended up never using it.) Sebastian Redl