
Over the past few months, I've been tinkering with a Unicode string library. It's still *far* from finished, but it's far enough along that the overall structure is visible. I've seen a bunch of Unicode proposals for Boost come and go, so hopefully this one will address the most common needs people have. The library is based on two (immutable) string types: ct_string and rt_string. ct_strings are _C_ompile _T_ime tagged with a particular encoding, and rt_strings are _R_un _T_ime tagged with an encoding. This is to allow for faster conversion when the encoding is known at compile-time, but to allow for conversion at run-time (useful for reading XML!). General usage would look something like this: ct_string<ct::utf8> foo("Hello, world!"); ct_string<ct::utf16> bar; bar.encode(foo); rt_string baz; baz.encode(bar,rt::utf8); Note the use of ct::utf8 and rt::utf8. As you might expect from the syntax, ct::utf8 is a type, and rt::utf8 is an object. Broadly speaking, to create an encoding, you create a class with read and write methods, and then you create an instance of an rt_encoding<MyEncoding>. Most of this is laid out in the comments of my code, so I won't go into too much detail here. There's still a lot missing from the code (most notably, dynamically-sized strings and string concatenation), but here's a rundown of what *is* present: * Compile-time and run-time tagged strings * Re-encoding of strings based on compile-/run-time tags * Uses simple memory copying when source and dest encodings are the same * Forward iterators to step through code points in strings If you'd like to take a look at the code, it's available here: http://www.teamboxel.com/misc/unicode.tar.gz . I've tested it in gcc 4.3.2 and MSVC8, but most modern compilers should be able to handle it. Comments and criticisms are, of course, welcome. - Jim