
Hi, I am considering creating a set of Unicode support classes for boost that would add native Unicode support to boost and would like to find out if there is any interest in either using it or helping to write it. 1. unistring This would store UTF16 logically ordered data and allow Unicode actions and iterations. It would be largely similar the basic_string except the iterators supported would be forward only and would be arranged in the following class hierarchy: Data UTF16 - const iterator Grapheme Word Sentence Line Identifier Data: WORDs of data *Data Iterator: WORD UTF16: UTF16 encoded words [i.e. includes surrogates which are two WORDs] *UTF16 Iterator: const WORD Grapheme: What appears to the user as a character but may consist of several UTF16 encoded WORDs [e.g. e followed by acute]. *Grapheme Iterator: unichar Word: Word breaks [n.b. this would not work in languages like Thai where that cannot be computed from the characters]. *Word iterator: unistring [return portion of string as new and independent string] Line: Line breaks [n.b. ditto Thai ...] *Line iterator: unistring [return portion of string as new and independent string] Identifier: Language parsing iterator * Identifier iterator: unistring [return portion of string as new and independent string] All iterators would support basedata() which would convert the iterator to a Data iterator which can then have const cast off as required. The unistring would support equality (=) but not equivalence (<). Equality would perform canonical decomposition on the fly and compare decomposed data. Equivalence would be supported by a separate class called unistringsort. The following methods would be supported: empty size decompose tolower toupper assignUFT7 assignUTF8 assignUTF16 assignUTF32 insertUFT7 insertUTF8 insertUTF16 insertUTF32 find find_if replace replace_if others ... 2. unistringsort unistringsort would have two members, a const unistring and a const vector<WORD> of sort data [4 words per Unicode character]. Equality (=) would not be supported. Equivalence (<) would be supported which would do a level 4 compare on the sort data. Sort level 4 is the level used for display sorting of strings. Other levels allow ignoring case accents etc. The following methods would be supported: equals(const unistringsort & other, sortlevel level) would allow other sort levels (1, 2, 3, and 4) to be tested. const unistring& string() const vector<WORD>* data() Note: Unicode sorting is complex and involves sort decomposition, canonical decomposition, mathematical expansion of characters using a predetermined table and rules to 4 words per character that are then parsed in priority order. 3. unichar A unichar data type would also be created to allow Unicode tests on individual characters which will be DWORDs for ease due to the fact that Unicode characters are 21 bits. This would allow testing of: isstrongright isright isleft isstrongleft isleftjoining isrightjoining isbothjoining isnumeric etc... The implementation would require a few of Meg of data files created from the current Unicode release and my initial thoughts are that the following files would need to be parsed: allkeys.txt CaseFolding.txt GraphemeBreakTest.txt LineBreak.txt SentenceBreakTest.txt SpecialCasing.txt UnicodeData.txt WordBreakTest.txt The data files created would be header files containing native c arrays containing specially organised data to support all the above functionality. A program would be created to produce these files from the source automatically from an up to date Unicode release. The ranges in Blocks.txt would need to be hard coded to support functions like ishangul etc. Questions: 1. Is this worth doing are enough people interested to make it worthwhile? 2. Which would be the best implementation of basic_string to use? 3. Should unistring support equality due to the overhead in decomposition or should there be a decompunistring? 4. Do we need other classes or to include functionality such as the ability to convert to display order? 5. What other methods are required on the classes? 6. Any comments ? Yours, Graham Barnett BEng, MCSD/ MCAD .Net, MCSE/ MCSA 2003, CompTIA Sec+