[boost] [unicode] Interest Check / Proof of Concept

19 Nov 2008

      Over the past few months, I've been tinkering with a Unicode string 
library. It's still *far* from finished, but it's far enough along that 
the overall structure is visible. I've seen a bunch of Unicode proposals 
for Boost come and go, so hopefully this one will address the most 
common needs people have.

The library is based on two (immutable) string types: ct_string and 
rt_string. ct_strings are _C_ompile _T_ime tagged with a particular 
encoding, and rt_strings are _R_un _T_ime tagged with an encoding. This 
is to allow for faster conversion when the encoding is known at 
compile-time, but to allow for conversion at run-time (useful for 
reading XML!).

General usage would look something like this:

	ct_string<ct::utf8> foo("Hello, world!");

	ct_string<ct::utf16> bar;
	bar.encode(foo);

	rt_string baz;
	baz.encode(bar,rt::utf8);

Note the use of ct::utf8 and rt::utf8. As you might expect from the 
syntax, ct::utf8 is a type, and rt::utf8 is an object. Broadly speaking, 
  to create an encoding, you create a class with read and write methods, 
and then you create an instance of an rt_encoding<MyEncoding>. Most of 
this is laid out in the comments of my code, so I won't go into too much 
detail here.

There's still a lot missing from the code (most notably, 
dynamically-sized strings and string concatenation), but here's a 
rundown of what *is* present:

* Compile-time and run-time tagged strings
* Re-encoding of strings based on compile-/run-time tags
* Uses simple memory copying when source and dest encodings are the same
* Forward iterators to step through code points in strings

If you'd like to take a look at the code, it's available here: 
http://www.teamboxel.com/misc/unicode.tar.gz . I've tested it in gcc 
4.3.2 and MSVC8, but most modern compilers should be able to handle it. 
Comments and criticisms are, of course, welcome.

- Jim

[boost] [unicode] Interest Check / Proof of Concept

James Porter