[boost] Re: Any interest in adding unicode support to boost?

19 Oct 2004

      In article <cl3nps$4d8$1@sea.gmane.org>, "Erik Wien" <wien@start.no> wrote:
...
Hi. Thanks for the feedback!
My pleasure :-)
...
"Miro Jurisic" <macdev@meeroh.org> wrote in message 
news:macdev-BACD3C.13585519102004@sea.gmane.org...
...
I generally agree with this design approach, but I don't think that code 
point iterators alone are sufficient.
Neither do I as the matter a fact, but this is as far as I have come right 
now. :) There would probably be different types of iterators (or iterator 
wrappers) made available to enable iterations over everything from code units 
to code points/abstract characters.
Yes, I agree.
...
...
Iteration over encoded characters and abstract characters would be needed 
for some algorithms to function sensibly. For example, the simple task of:
find(begin, end, "ü")
needs to use abstract characters in order to be able to find precomposed 
and decomposed versions of ü.
True... And this is a point where implemtation would be less than trivial.
Yeah, that's how far I got before I decided that I didn't have the time to deal 
with the problem given my current schedule.
...
...
Again, taking this example, you let's say that do_some_operation performs 
canonicalization to some Unicode canonical form; you can't do this by 
iterating over code points.
Nope. A code unit iterator would be needed for things like that.
I am pretty sure you mean abstract character here, not code unit. My 
understanding of the Unicode terminology is that the decomposed version of ü 
consists of

one abstract character (ü)
two encoded characters (u, š)
two UTF-32 code units (0x00000075 0x00000308)
two UTF-16 code units (0x0075 0x0308)
three UTF-8 code units (0x75 0xCC 0x88)

but perhaps I have it backwards...
...
The implementation described here would not pose too much of a problem, I was 
thinking more of the problems that arise when you take things like collation 
and locales into consideration. From what i understand there is a real issue 
in enabling proper unicode support in the standard classes like locale, ctype 
and collate, as they assume things that do not neccesarily apply to a unicode 
representation of text. A failiure to enable good support in those classes 
(at least locale and ctype), would also make the iostream support break, and 
things start to snowball. I could very well be wrong on this (Actually, I 
hope I am! :) ), as I haven't had the time to read up on all issues 
concerning this. But again, this is one of many problems I hope running this 
project will help reveal.
I don't know enough about locales to comment on this, unfortunately.

meeroh