New subject: Call for interest for native unicode character and string support in boost

22 Jul 2005

      Hi,

I am considering creating a set of Unicode support classes for boost
that would add native Unicode support to boost and would like to find
out if there is any interest in either using it or helping to write it.

1. unistring

This would store UTF16 logically ordered data and allow Unicode actions
and iterations.

It would be largely similar the basic_string except the iterators
supported would be forward only and would be arranged in the following
class hierarchy:

Data

 UTF16 - const iterator

  Grapheme 

   Word

    Sentence

    Line

    Identifier

Data: WORDs of data

*Data Iterator: WORD

UTF16: UTF16 encoded words [i.e. includes surrogates which are two
WORDs]

*UTF16 Iterator: const WORD

Grapheme: What appears to the user as a character but may consist of
several UTF16 encoded WORDs [e.g. e followed by acute].

*Grapheme Iterator: unichar

Word: Word breaks [n.b. this would not work in languages like Thai where
that cannot be computed from the characters].

*Word iterator: unistring [return portion of string as new and
independent string]

Line: Line breaks [n.b. ditto Thai ...]

*Line iterator: unistring [return portion of string as new and
independent string]

Identifier: Language parsing iterator

* Identifier iterator: unistring [return portion of string as new and
independent string]

All iterators would support basedata() which would convert the iterator
to a Data iterator which can then have const cast off as  required.

The unistring would support equality (=) but not equivalence (<).

Equality would perform canonical decomposition on the fly and compare
decomposed data.

Equivalence would be supported by a separate class called unistringsort.

The following methods would be supported:

empty

size

decompose

tolower

toupper

assignUFT7

assignUTF8

assignUTF16

assignUTF32

insertUFT7

insertUTF8

insertUTF16

insertUTF32

find

find_if

replace

replace_if

others ...

2. unistringsort

unistringsort would have two members, a const unistring and a const
vector<WORD> of sort data [4 words per Unicode character].

Equality (=) would not be supported.

Equivalence (<) would be supported which would do a level 4 compare on
the sort data.

Sort level 4 is the level used for display sorting of strings. Other
levels allow ignoring case accents etc.

The following methods would be supported:

equals(const unistringsort & other, sortlevel level) would allow other
sort levels (1, 2, 3, and 4) to be tested.

const unistring& string()

const vector<WORD>* data()

Note: Unicode sorting is complex and involves sort decomposition,
canonical decomposition, mathematical expansion of characters using a
predetermined table and rules to 4 words per character that are then
parsed in priority order.

3. unichar

A unichar data type would also be created to allow Unicode tests on
individual characters which will be DWORDs for ease due to the fact that
Unicode characters are 21 bits.

This would allow testing of:

isstrongright

isright

isleft

isstrongleft

isleftjoining

isrightjoining

isbothjoining

isnumeric

etc...

The implementation would require a few of Meg of data files created from
the current Unicode release and my initial thoughts are that the
following files would need to be parsed:

allkeys.txt

CaseFolding.txt

GraphemeBreakTest.txt

LineBreak.txt

SentenceBreakTest.txt

SpecialCasing.txt

UnicodeData.txt

WordBreakTest.txt

The data files created would be header files containing native c arrays
containing specially organised data to support all the above
functionality.

A program would be created to produce these files from the source
automatically from an up to date Unicode release.

The ranges in Blocks.txt would need to be hard coded to support
functions like ishangul etc.

Questions:

1. Is this worth doing are enough people interested to make it
worthwhile?

2. Which would be the best implementation of basic_string to use?

3. Should unistring support equality due to the overhead in
decomposition or should there be a decompunistring?

4. Do we need other classes or to include functionality such as the
ability to convert to display order?

5. What other methods are required on the classes?

6. Any comments ?

Yours,

Graham Barnett

BEng, MCSD/ MCAD .Net, MCSE/ MCSA 2003, CompTIA Sec+

Call for interest for native unicode character and string support in boost

Graham

Rogier van Dalen

Erik Wien

Ben Artin

tags

participants (4)