Re: [boost] [unicode] Interest Check / Proof of Concept

19 Nov 2008

      James Porter wrote:
...
Over the past few months, I've been tinkering with a Unicode string 
library. It's still *far* from finished, but it's far enough along that 
the overall structure is visible. I've seen a bunch of Unicode proposals 
for Boost come and go, so hopefully this one will address the most 
common needs people have.
Hi Jim,

Mine was probably one of those proposals that you looked at; for the 
record the code is all available at

   http://svn.chezphil.org/libpbe/trunk/include/charset/

and nearby directories.  I was reasonably happy with my implementations 
of the most common character sets (i.e. unicode, ASCII, iso8859), but I 
wanted to explore some of the more esoteric ones to understand the 
implications that they would have on how a general-purpose framework 
should work.  For example, I wanted to explore how error handling 
policies could be specified and what conditions they would need to 
handle.  The last work that I did with this code was a general-purpose 
command-line conversion utility that could be used to benchmark the 
conversions.  Input and output character sets and error policies could 
be set from the command-line, but the problem that I hit was that 
making these things template parameters led to a code-size and 
compilation-time explosion.  That means that I'll need to rethink a few 
things, but it has been low on my to-do list.
...
The library is based on two (immutable) string types: ct_string and 
rt_string. ct_strings are _C_ompile _T_ime tagged with a particular 
encoding, and rt_strings are _R_un _T_ime tagged with an encoding.
Mutable vs. immutable strings is something that has been briefly 
discussed before.  My personal preference has been for mutable strings, 
but without the O(1) random access guarantee of a std::string.  I also 
considered strings where the only mutation allowed is appending, i.e. 
there's a back_insert_iterator.  Why do you prefer immutable strings?

One argument for mutable strings is simply that std::string is mutable, 
and that a proposal is more likely to prove popular if it changes less 
w.r.t. existing practice.

I also have run-time and compile-time tagging.  My feeling now is that 
compile-time-tagging is the more important case.  Data whose encoding 
is known only at run-time can be handled using a more ad-hoc method if 
necessary.  I also struggled to find good names for these things; I 
don't find ct_string and rt_string great.  Do any readers have suggestions?
...
This is to allow for faster conversion when the encoding is known at 
compile-time, but to allow for conversion at run-time (useful for 
reading XML!).
General usage would look something like this:
ct_string<ct::utf8> foo("Hello, world!");
typedef ct_string<ct::utf8> utf8string;
...
ct_string<ct::utf16> bar;
  bar.encode(foo);
Well it's actually decoding the utf16 and encoding the utf8.  Maybe 
"transcode", and preferably as a free function:

transcode(bar,foo);

equivalent to:

std::copy(back_insert_iterator(bar),foo.begin(),foo.end());
...
rt_string baz;
  baz.encode(bar,rt::utf8);
So the encoding of the rt_string is not stored in the string?
...
Note the use of ct::utf8 and rt::utf8. As you might expect from the 
syntax, ct::utf8 is a type, and rt::utf8 is an object. Broadly speaking, 
to create an encoding, you create a class with read and write methods, 
and then you create an instance of an rt_encoding<MyEncoding>. Most of 
this is laid out in the comments of my code, so I won't go into too much 
detail here.
I'll try to find time to have a look, but I do encourage you to post 
more details to the list.  That tends to generate more discussion than 
"please look at the code" proposals do.
...
There's still a lot missing from the code (most notably, 
dynamically-sized strings and string concatenation),
So what is your underlying implementation?  Not std::string?
...
but here's a 
rundown of what *is* present:
* Compile-time and run-time tagged strings
* Re-encoding of strings based on compile-/run-time tags
* Uses simple memory copying when source and dest encodings are the same
* Forward iterators to step through code points in strings
If you'd like to take a look at the code, it's available here: 
http://www.teamboxel.com/misc/unicode.tar.gz . I've tested it in gcc 
4.3.2 and MSVC8, but most modern compilers should be able to handle it. 
Comments and criticisms are, of course, welcome.
One of my priorities has been performance; it would be good to compare 
e.g. utf8-to/from-utf16 conversion speed.

My feeling about the way forward is as follows:

- A complete character set library is a lot of work.

- A library that only understands Unicode is less work, but is it what 
people need?

- Is there a consensus about mutable vs. immutable strings?  Perhaps we 
should start by defining a new string concept, removing the 
character-set-unfriendly aspects of std::string like indexing using 
integers, and see what people think of it.  I have been trying to use 
only std::algorithms and iterators with strings in new code, but it can 
often be simpler to use indexes and the std::string members that use or 
return them.

- It would be useful to factor out the actual Unicode bit-bashing 
operations.  I have implementations of them that I have carefully 
tuned, and they are ready for wider use even though the rest of my code isn't.

Regards,  Phil.

Re: [boost] [unicode] Interest Check / Proof of Concept

Phil Endecott