[boost] Re: Any interest in adding unicode support to boost?

19 Oct 2004

      ...
I am pretty sure you mean abstract character here, not code unit. My
understanding of the Unicode terminology is that the decomposed version of 
ü
consists of
one abstract character (ü)
two encoded characters (u, š)
two UTF-32 code units (0x00000075 0x00000308)
two UTF-16 code units (0x0075 0x0308)
three UTF-8 code units (0x75 0xCC 0x88)
but perhaps I have it backwards...
No. You are correct about that. I don't know what I was talking about. This 
is another example of me talking before I think! ;) I think we argee on 
this, but are just misunderstanding each other.

Anyhoo...  To answer this again: :)
...
Again, taking this example, you let's say that do_some_operation performs
canonicalization to some Unicode canonical form; you can't do this by 
iterating
over code points.
No you can't do that with code point iterators, but I am pretty sure you 
couldn't do it with an abstract character iterator either. (Or any kind of 
iterator for that matter) The process of canonicalization (I'm assuming you 
are talking about canonical decomposition here) involves splitting one code 
point into multiple code points if that is possible. (ü would be splitted 
into u and š as you say) That means that the do_some_operation would need to 
insert code points into the string it is iterating over, something that 
would take some "hacking" to do inside a normal iterator interface.

Abstract character iterators are no better. The concept of abstract 
characters is oblivious to the code unit differences between these 
representations, and iterating over abstract characters (I'm not sure how 
this would even be done) would not reveal the underlying composition of code 
points needed for canonical decomposition to be performed.

Ultimately I feel that the operation of normalization (which involves 
canonical decomposition) of unicode strings should be hidden from the user 
completely and be performed automatically by the library where that is 
needed. (Like on a call to the == operator.) I think that solution would be 
satisfactory for most users as the normalization process is somewhat 
intricate and really not something users should be forced to understand.

Are we at all on the same page now?

[boost] Re: Any interest in adding unicode support to boost?

Erik Wien