[boost] Re: [Unicode strings] We're off

16 Mar 2005

      In article <d19pdf$jhu$1@sea.gmane.org>, Erik Wien <wien@start.no> wrote:
...
Thorsten Ottosen wrote:
...
| Current design:
| The current design is based around the concept of «encoding traits».
Is entirely improper to make unicode strings a typedef
for std::basic string<...> ?
Not entirely, but certainly less that optimal. basic string (and the 
iostreams) make assuptions that don't neccesarily apply to Unicode text. 
One of them is that strings can be represented as a sequence of equally 
sized characters. Unicode can be represented that way, but that would 
mean you'd have to use 32 bits pr. character to be able to represent all 
the code point assigned in the Unicode standard. In most cases, that is 
way too much overhead for a string, and usually also a waste, since 
unicode code points rarely require more that 16 bits to be encoded. You 
could of course implement unicode for 16 bit characters in basic string, 
but that would require that the user know about things like surrogate 
pairs, and also know how to correctly handle them. An unlikely scenario.
I completely agree with Erik on this. std::string makes assumptions that do not 
hold for Unicode characters, and it provides interfaces that are misleading (or 
outright wrong) for Unicode strings. For example, basic_string lets you erase a 
single element, which can make the string no longer be a valid Unicode string 
(unless the elements are represented in UTF32). Same problem exists with every 
other mutating algorithm on basic_string, including operator[].
...
...
and what is the benefit of having a  function vs a function template?
surely a function template will look the same to the client as an ordinary
function; Is it often used that people must change encoding on the fly?
Normally I would not think so, and my first implementation did not work 
this way. That one was implemented with the entire string class being 
templated on encoding, and thereby eliminating the whole implementation 
inheritance tree in this implementation.
There was however (as far as I could tell at least) some concern about 
this approach in the other thread. (Mostly related to code size and 
being locked into an encoding at compile time.) Some thought that could 
be a problem for XML parsers and related technology that needs to 
establish encoding at run-time. (When reading files for example) This 
new implementation was simply a test to see if an alternate solution 
could be found, without those drawbacks. (It has a plenthora of new ones 
though.)
Here I also agree. Having multiple string classes would just force everyone to 
pick one for, in most cases, no good reason whatsoever. If I am writing code 
that uses C++ strings, which encoding should I choose? Why should I care? 
Particularly, if I don't care, why would I have to choose anyway? More than 
likely, I would just choose the same thing 99% of the time anyway. 

I believe that the ability to force a Unicode string to be in a particular 
encoding has some value -- especially for people doing low-level work such as 
serializing Unicode strings to XML, and for people who need to understand time 
and space complexity of various Unicode encodings -- but I do not believe that 
this justifiable demand for complexity means we should make the interface harder 
for everyone else. 

I do, however, think that some people are going to feel that they need to 
eliminate the runtime overhead of generalized strings and explicitly instantiate 
strings in a particular encoding, and I don't know whether the library currently 
provides a facility to accomplish this.

meeroh

[boost] Re: [Unicode strings] We're off

Miro Jurisic