
I am pretty sure you mean abstract character here, not code unit. My understanding of the Unicode terminology is that the decomposed version of ü consists of
one abstract character (ü) two encoded characters (u, š) two UTF-32 code units (0x00000075 0x00000308) two UTF-16 code units (0x0075 0x0308) three UTF-8 code units (0x75 0xCC 0x88)
but perhaps I have it backwards...
No. You are correct about that. I don't know what I was talking about. This is another example of me talking before I think! ;) I think we argee on this, but are just misunderstanding each other. Anyhoo... To answer this again: :)
Again, taking this example, you let's say that do_some_operation performs canonicalization to some Unicode canonical form; you can't do this by iterating over code points.
No you can't do that with code point iterators, but I am pretty sure you couldn't do it with an abstract character iterator either. (Or any kind of iterator for that matter) The process of canonicalization (I'm assuming you are talking about canonical decomposition here) involves splitting one code point into multiple code points if that is possible. (ü would be splitted into u and š as you say) That means that the do_some_operation would need to insert code points into the string it is iterating over, something that would take some "hacking" to do inside a normal iterator interface. Abstract character iterators are no better. The concept of abstract characters is oblivious to the code unit differences between these representations, and iterating over abstract characters (I'm not sure how this would even be done) would not reveal the underlying composition of code points needed for canonical decomposition to be performed. Ultimately I feel that the operation of normalization (which involves canonical decomposition) of unicode strings should be hidden from the user completely and be performed automatically by the library where that is needed. (Like on a call to the == operator.) I think that solution would be satisfactory for most users as the normalization process is somewhat intricate and really not something users should be forced to understand. Are we at all on the same page now?