Re: [boost] [Unicode strings] We're off

16 Mar 2005

      Hi Erik!

I'm glad to see you've made a lot of progress in these months of
silence. I've got a few comments for now.

Of course there isn't much documentation yet, but now that the library
is out in the open, writing a Unicode primer might be a good thing to
do now. Issues that I don't think many programmers are aware of
include (off the top of my head) what code points are (21 bits), what
Unicode characters are, why you need combining characters, why UTF-32
is not usually optimal. The library will once need these docs anyway.
I'd gladly help out with this, though I'm not sure this would fit your
university's requirements.

Some speculation on the Unicode database: do you really need the
character names? Maybe you should use multi_index, probably with
hashing. Maybe you could use Boost.Serialisation for loading the file.

I think that in general you would need to separate input/output from
other Unicode processing. For example: endianness only matters when
portably reading/writing files; IMO strings in memory should have your
platform's endianness. (I second Thorsten's proposal of having
utf8_string, utf16_string, utf32_string, utf_string.)
For reading code points from files, a codecvt could be used. This can
be fast because its virtual functions are called only once per so many
bytes.
I think there's an implementation floating around in the yahoo files
section that can automatically figure out the file's encoding and
convert to and from any endianness.

I also think you should separate code points and Unicode characters.
In normal situations, the user should not have to deal with code
points. The discussion should not focus on that for now; it's an
implementation detail. I strongly object to your

    typedef encoded_string<unicode_tag> unicode_string;

because I think a Unicode string should contain characters. For
example, a regular expression on Unicode strings should support level
2 (see <http://www.unicode.org/reports/tr18/>). Why go for anything
less?

Hoping this will be useful,
Rogier

Re: [boost] [Unicode strings] We're off

Rogier van Dalen