Call for interest for native unicode character and string support in boost

Hi, I am considering creating a set of Unicode support classes for boost that would add native Unicode support to boost and would like to find out if there is any interest in either using it or helping to write it. 1. unistring This would store UTF16 logically ordered data and allow Unicode actions and iterations. It would be largely similar the basic_string except the iterators supported would be forward only and would be arranged in the following class hierarchy: Data UTF16 - const iterator Grapheme Word Sentence Line Identifier Data: WORDs of data *Data Iterator: WORD UTF16: UTF16 encoded words [i.e. includes surrogates which are two WORDs] *UTF16 Iterator: const WORD Grapheme: What appears to the user as a character but may consist of several UTF16 encoded WORDs [e.g. e followed by acute]. *Grapheme Iterator: unichar Word: Word breaks [n.b. this would not work in languages like Thai where that cannot be computed from the characters]. *Word iterator: unistring [return portion of string as new and independent string] Line: Line breaks [n.b. ditto Thai ...] *Line iterator: unistring [return portion of string as new and independent string] Identifier: Language parsing iterator * Identifier iterator: unistring [return portion of string as new and independent string] All iterators would support basedata() which would convert the iterator to a Data iterator which can then have const cast off as required. The unistring would support equality (=) but not equivalence (<). Equality would perform canonical decomposition on the fly and compare decomposed data. Equivalence would be supported by a separate class called unistringsort. The following methods would be supported: empty size decompose tolower toupper assignUFT7 assignUTF8 assignUTF16 assignUTF32 insertUFT7 insertUTF8 insertUTF16 insertUTF32 find find_if replace replace_if others ... 2. unistringsort unistringsort would have two members, a const unistring and a const vector<WORD> of sort data [4 words per Unicode character]. Equality (=) would not be supported. Equivalence (<) would be supported which would do a level 4 compare on the sort data. Sort level 4 is the level used for display sorting of strings. Other levels allow ignoring case accents etc. The following methods would be supported: equals(const unistringsort & other, sortlevel level) would allow other sort levels (1, 2, 3, and 4) to be tested. const unistring& string() const vector<WORD>* data() Note: Unicode sorting is complex and involves sort decomposition, canonical decomposition, mathematical expansion of characters using a predetermined table and rules to 4 words per character that are then parsed in priority order. 3. unichar A unichar data type would also be created to allow Unicode tests on individual characters which will be DWORDs for ease due to the fact that Unicode characters are 21 bits. This would allow testing of: isstrongright isright isleft isstrongleft isleftjoining isrightjoining isbothjoining isnumeric etc... The implementation would require a few of Meg of data files created from the current Unicode release and my initial thoughts are that the following files would need to be parsed: allkeys.txt CaseFolding.txt GraphemeBreakTest.txt LineBreak.txt SentenceBreakTest.txt SpecialCasing.txt UnicodeData.txt WordBreakTest.txt The data files created would be header files containing native c arrays containing specially organised data to support all the above functionality. A program would be created to produce these files from the source automatically from an up to date Unicode release. The ranges in Blocks.txt would need to be hard coded to support functions like ishangul etc. Questions: 1. Is this worth doing are enough people interested to make it worthwhile? 2. Which would be the best implementation of basic_string to use? 3. Should unistring support equality due to the overhead in decomposition or should there be a decompunistring? 4. Do we need other classes or to include functionality such as the ability to convert to display order? 5. What other methods are required on the classes? 6. Any comments ? Yours, Graham Barnett BEng, MCSD/ MCAD .Net, MCSE/ MCSA 2003, CompTIA Sec+

Hello Graham, There was a student project aiming to produce a Unicode library, but I didn't hear anything of it after the thread in http://lists.boost.org/boost/2005/03/22580.php There are loads of comments and ideas in that thread. Everyone wants a Unicode library, but no-one seems to have enough time to write it well. I again have been playing with the idea of trying to write a library over the past few weeks. You seem to be quite well versed in Unicode. My (hopefully constructive) comments on your post: First, are WORD and DWORD the Windows equivalents of uint16_t and uint32_t, respectively? I think the C++ way would be to ultimately leave the choice of encoding to the user through a template parameter. This would, I guess, do away with the assign* and insert* methods for various encodings. I think the normalisation form should be an invariant of the string as well (and a template parameter). This makes it possible to implement operator== and operator< as binary comparisons of codepoints, so that they will be relatively fast (more so for UTF-8 and UTF-32 than for UTF-16). People will surely want to use the string as a key for std::map's, for example. Other more expensive collation methods (including localised ones) could be implemented by different classes. As far as the iterators are concerned, I believe the standard Unicode string should contain grapheme clusters, and thus its iterator should have this beast as its value_type (I would call it "character" because as far as the Unicode standard and combining characters are concerned, C++ programmers in general are "users", and grapheme clusters is what they think of as characters). Hope this helps. Rogier

Rogier van Dalen wrote:
There was a student project aiming to produce a Unicode library, but I didn't hear anything of it after the thread in http://lists.boost.org/boost/2005/03/22580.php
Aw shucks... Time flies. I was hoping to get back to you sooner on this (You've all been very helpful here, so you deserve more feedback than you have been getting.), but after completing the project I got a job that takes up most of my time, and thus the unicode library has had to play second fiddle for a while. Anyway, this is just as good an occation as any I guess, so I might just as well just fill you in on the latest developments. We finished our bachelor's project a couple of months ago, and we ended up with a fairly useable (although highly unpolished) implementation. The intention was to release that to you guys, but I wasn't completely happy with some things in that version, so I decided to rewrite large portions of it to make it "worthy" of your scrutiny. (I mean, how long could that take? Sigh..) This obviously took much longer than I expected (In fact, I'm still working on it), and that is more or less why you haven't heard from me until now. The version I have now, provides mutable code point strings (boost::ReversibleContainers), with both dynamic and locked encoding forms (More or less the same as the ones described in earlier threads here). It also supports normalization (all forms) of code point sequences through STL compatible algorithms, with a complete set of tests to test the validity of these algorithms. Finally the library provides a mutable "text-element" string class, that can represent strings of grapheme clusters, words, (sentences,) or anything else you want it to, normalized to a specified normalization form, and in any encoding you want. Tests for checking that grapheme clusters and words are broken up correctly are also provided. (The text_element/text_element_string should be pretty close to what you wanted as a "unicode string class", it certainly has the monster of a value_type thing covered. ;)) I can give you some more details on the design and implementation of this a little later, I don't have enough time to do that right now. It does need to be revised though, as it's rather clumsy to use as it is. (This is what I was hoping to do during the summer, but didn't have the time to.) As for creating the "Boost unicode library", which really is the ultimate goal of all this mocking about, I am beginning to feel like we should make more of a community effort out of it. There are clearly a lot of people experienced in Unicode here (with Graham Barnett now joining the club), but as you said, noone seems to have much time to spend on it. Therefore a more organized collaboration between all of us, on both design and code, could be a good idea to get this moving along a little faster. I'd be more than happy to donate the code I have developed so far to serve as a starting-point/reference/example of failure for something like that. Any thoughts? - Erik

In article <086E419469537C439E250E428F0DF0930162A1@host.Sysdev.local>, "Graham" <Graham@system-development.co.uk> wrote:
1. Is this worth doing are enough people interested to make it worthwhile?
Yes. There has already been a lot of discussion of boost Unicode support on this list. You should read the archives before you continue with your proposal, because many of your questions are already discussed in the archives. Look for postings by Erik Wien, as he was the one driving the effort at that time.
2. Which would be the best implementation of basic_string to use?
This has been discussed at length in the archived threads. It is at best questionable to use basic_string for Unicode, and in my opinion it is almost certainly wrong.
3. Should unistring support equality due to the overhead in decomposition or should there be a decompunistring?
The question of whether Unicode APIs should present a normalized view to the client has also been touched upon. My opinion, probably stated in the archives, is that requiring the user of the API to know about decompositions before being able to correctly compare Unicode strings is unacceptable.
6. Any comments ?
I cannot stress this enough: please read the archives. I am sure that the enthusiasm for discussing various aspects of this design will be greatly diminished the second (or probably third, by now) time through. Ben -- I changed my name: <http://periodic-kingdom.org/People/NameChange.php>
participants (4)
-
Ben Artin
-
Erik Wien
-
Graham
-
Rogier van Dalen