
On Tue, 19 Oct 2004 18:32:50 +0200, Erik Wien <wien@start.no> wrote:
----- Original Message ----- From: "Rogier van Dalen" <rogiervd@gmail.com>
I've recently started on the first draft of a Unicode library.
Interesting. Is there a discussion going about this library that I have missed, or haven't you posted anything about it yet? I'd hate to start something like this, if there is already being made an effort on the subject.
It's in the planning stage; I have a preliminary implementation of some parts. Your message made me bring out my ideas into the public.
I think a definition of unicode::code as uint32_t would be much better. Problem is, codecvt is only implemented for wchar_t and char, so it's not possible to make a Unicode codecvt without manually adding (dummy) implementations of codecvt<unicode::code,char,mbstate_t> to the std namespace. I guess this is the reason that Ron Garcia just used wchar_t.
I don't really feel locking the code unit size to 32bits is a good solution either as strings would then become unneccesarily large.
As I tried to show, the choice of the underlying buffer is templated. This could be std::string, or an SGI rope<wchar_t>, or anything else. A char-based buffer would automatically make it a UTF-8-encoded string, etcetera. I agree with you (and with the Unicode standard) that using strings of UTF-16 is probably best for most practical applications. The interface should IMHO always use UTF-32 (I agree with the Unicode standard here too): codepoint_string<...> s = ....; I think *s.begin() should return a UTF-32-encoded codepoint. The codecvt class converts to UTF-32 because it didn't occur to me to do anything else; and why would you? Regards, Rogier