
James Porter wrote:
Over the past few months, I've been tinkering with a Unicode string library. It's still *far* from finished, but it's far enough along that the overall structure is visible. I've seen a bunch of Unicode proposals for Boost come and go, so hopefully this one will address the most common needs people have.
Hi Jim, Mine was probably one of those proposals that you looked at; for the record the code is all available at http://svn.chezphil.org/libpbe/trunk/include/charset/ and nearby directories. I was reasonably happy with my implementations of the most common character sets (i.e. unicode, ASCII, iso8859), but I wanted to explore some of the more esoteric ones to understand the implications that they would have on how a general-purpose framework should work. For example, I wanted to explore how error handling policies could be specified and what conditions they would need to handle. The last work that I did with this code was a general-purpose command-line conversion utility that could be used to benchmark the conversions. Input and output character sets and error policies could be set from the command-line, but the problem that I hit was that making these things template parameters led to a code-size and compilation-time explosion. That means that I'll need to rethink a few things, but it has been low on my to-do list.
The library is based on two (immutable) string types: ct_string and rt_string. ct_strings are _C_ompile _T_ime tagged with a particular encoding, and rt_strings are _R_un _T_ime tagged with an encoding.
Mutable vs. immutable strings is something that has been briefly discussed before. My personal preference has been for mutable strings, but without the O(1) random access guarantee of a std::string. I also considered strings where the only mutation allowed is appending, i.e. there's a back_insert_iterator. Why do you prefer immutable strings? One argument for mutable strings is simply that std::string is mutable, and that a proposal is more likely to prove popular if it changes less w.r.t. existing practice. I also have run-time and compile-time tagging. My feeling now is that compile-time-tagging is the more important case. Data whose encoding is known only at run-time can be handled using a more ad-hoc method if necessary. I also struggled to find good names for these things; I don't find ct_string and rt_string great. Do any readers have suggestions?
This is to allow for faster conversion when the encoding is known at compile-time, but to allow for conversion at run-time (useful for reading XML!).
General usage would look something like this:
ct_string<ct::utf8> foo("Hello, world!");
typedef ct_string<ct::utf8> utf8string;
ct_string<ct::utf16> bar; bar.encode(foo);
Well it's actually decoding the utf16 and encoding the utf8. Maybe "transcode", and preferably as a free function: transcode(bar,foo); equivalent to: std::copy(back_insert_iterator(bar),foo.begin(),foo.end());
rt_string baz; baz.encode(bar,rt::utf8);
So the encoding of the rt_string is not stored in the string?
Note the use of ct::utf8 and rt::utf8. As you might expect from the syntax, ct::utf8 is a type, and rt::utf8 is an object. Broadly speaking, to create an encoding, you create a class with read and write methods, and then you create an instance of an rt_encoding<MyEncoding>. Most of this is laid out in the comments of my code, so I won't go into too much detail here.
I'll try to find time to have a look, but I do encourage you to post more details to the list. That tends to generate more discussion than "please look at the code" proposals do.
There's still a lot missing from the code (most notably, dynamically-sized strings and string concatenation),
So what is your underlying implementation? Not std::string?
but here's a rundown of what *is* present:
* Compile-time and run-time tagged strings * Re-encoding of strings based on compile-/run-time tags * Uses simple memory copying when source and dest encodings are the same * Forward iterators to step through code points in strings
If you'd like to take a look at the code, it's available here: http://www.teamboxel.com/misc/unicode.tar.gz . I've tested it in gcc 4.3.2 and MSVC8, but most modern compilers should be able to handle it. Comments and criticisms are, of course, welcome.
One of my priorities has been performance; it would be good to compare e.g. utf8-to/from-utf16 conversion speed. My feeling about the way forward is as follows: - A complete character set library is a lot of work. - A library that only understands Unicode is less work, but is it what people need? - Is there a consensus about mutable vs. immutable strings? Perhaps we should start by defining a new string concept, removing the character-set-unfriendly aspects of std::string like indexing using integers, and see what people think of it. I have been trying to use only std::algorithms and iterators with strings in new code, but it can often be simpler to use indexes and the std::string members that use or return them. - It would be useful to factor out the actual Unicode bit-bashing operations. I have implementations of them that I have carefully tuned, and they are ready for wider use even though the rest of my code isn't. Regards, Phil.