Re: [boost] Strings tagged with their character set

27 Sep 2007

      (Sorry if this is a double post, I'd not subscribed to the list first time)

Joseph Gauterin wrote:
...
...
...
If you change state_type in the char_traits, you'd be able to
differentiate the various basic_string types and include information
about the character encoding without writing a whole lot of new code.
Thanks for the suggestion.  I need to learn some more about this corner
of "namespace std", clearly, before I go and re-invent something.
IIRC, some of the non-const std::basic_string methods aren't suitable
for handling variable width encodings like utf8 and utf16 - non-const
operator[] in paticular returns a reference to the character type - a
big problem if you want to assign a value > 0x7F (i.e. a character
that uses 2 or more bytes).
I've noticed that there are frequent requests/proposals for some sort
of boost unicode/string encoding library. I've thought about the
problem and it seems to big for one person to handle in their spare
time - perhaps a group of us should get together to discuss working on
one? I'd be happy to participate.
I'm going to chime in here to say that I've been using a string 
implementation similar to this for a few years now. Our systems are on 
Windows so we want UTF-16 where we interface with Windows APIs and other 
Windows software, but we wanted to put all of the surrogate pairs stuff 
in one place.

Our FSLib::wstring uses UTF-32 characters for character interfaces (i.e. 
at() and operator[]), but UTF-16 internallly. We throw out the non-const 
operator[] and the non-const iterator. They haven't really been missed. 
We also have to offer a std_str() which returns a std::wstring and 
buffer_begin() and buffer_end() which return wchar_t* so we can use 
Boost.Regex etc.

I've also started looking at tagged types for many of the same sorts of 
things already mentioned. I also want to use them to describe other 
types of encodings such as HTTP query string and file specification 
encodings, HTML attribute encoding, SQL statement string encoding etc. 
The idea being here that it would be impossible to concatenate a query 
string encoded string to a HTML attribute encoded one without using the 
correct conversion function.

The idea here is to improve security to defeat things like XSS attacks 
on web servers and SQL injection attacks. I've been looking at making 
the conversions happen through explicit constructors in order to make it 
easier to use.

A final thing I've just started to look at is to get the compiler to 
choose the best internal representation out of UTF-8, UTF=16 and UTF-32 
for general use, but it's not something I've gotten very far with.

K

Re: [boost] Strings tagged with their character set

Kirit Sælensminde