
(Sorry if this is a double post, I'd not subscribed to the list first time) Joseph Gauterin wrote:
If you change state_type in the char_traits, you'd be able to differentiate the various basic_string types and include information about the character encoding without writing a whole lot of new code. Thanks for the suggestion. I need to learn some more about this corner of "namespace std", clearly, before I go and re-invent something. IIRC, some of the non-const std::basic_string methods aren't suitable for handling variable width encodings like utf8 and utf16 - non-const operator[] in paticular returns a reference to the character type - a big problem if you want to assign a value > 0x7F (i.e. a character that uses 2 or more bytes).
I've noticed that there are frequent requests/proposals for some sort of boost unicode/string encoding library. I've thought about the problem and it seems to big for one person to handle in their spare time - perhaps a group of us should get together to discuss working on one? I'd be happy to participate.
I'm going to chime in here to say that I've been using a string implementation similar to this for a few years now. Our systems are on Windows so we want UTF-16 where we interface with Windows APIs and other Windows software, but we wanted to put all of the surrogate pairs stuff in one place. Our FSLib::wstring uses UTF-32 characters for character interfaces (i.e. at() and operator[]), but UTF-16 internallly. We throw out the non-const operator[] and the non-const iterator. They haven't really been missed. We also have to offer a std_str() which returns a std::wstring and buffer_begin() and buffer_end() which return wchar_t* so we can use Boost.Regex etc. I've also started looking at tagged types for many of the same sorts of things already mentioned. I also want to use them to describe other types of encodings such as HTTP query string and file specification encodings, HTML attribute encoding, SQL statement string encoding etc. The idea being here that it would be impossible to concatenate a query string encoded string to a HTML attribute encoded one without using the correct conversion function. The idea here is to improve security to defeat things like XSS attacks on web servers and SQL injection attacks. I've been looking at making the conversions happen through explicit constructors in order to make it easier to use. A final thing I've just started to look at is to get the compiler to choose the best internal representation out of UTF-8, UTF=16 and UTF-32 for general use, but it's not something I've gotten very far with. K