[boost] Re: [Unicode strings] We're off

16 Mar 2005

      Thorsten Ottosen wrote:
...
Hi Erik,
Hi! Thanks for your reply.
...
Let me first say that its good to see that progress is happening
on this important topic.
Here are just some small comments; I didn't follow the first discussion,
so maybe these things have already been answered.
| Current design:
| The current design is based around the concept of «encoding_traits».
Is entirely improper to make unicode strings a typedef
for std::basic_string<...> ?
Not entirely, but certainly less that optimal. basic_string (and the 
iostreams) make assuptions that don't neccesarily apply to Unicode text. 
One of them is that strings can be represented as a sequence of equally 
sized characters. Unicode can be represented that way, but that would 
mean you'd have to use 32 bits pr. character to be able to represent all 
the code point assigned in the Unicode standard. In most cases, that is 
way too much overhead for a string, and usually also a waste, since 
unicode code points rarely require more that 16 bits to be encoded. You 
could of course implement unicode for 16 bit characters in basic_string, 
but that would require that the user know about things like surrogate 
pairs, and also know how to correctly handle them. An unlikely scenario.

By using encoding_traits however, we are able to make a string class 
that internally works with 8, 16 or 32 bit code units (UTF-8, 16 and 32 
respectively), but that has an external interface that uses 32 bit code 
points, abstracting away the underlying encoding. By doing it that way 
we easily halve the effective size of a string for most users. (When 
using UTF-16 for example)
...
and what is the benefit of having a  function vs a function template?
surely a function template will look the same to the client as an ordinary
function; Is it often used that people must change encoding on the fly?
Normally I would not think so, and my first implementation did not work 
this way. That one was implemented with the entire string class being 
templated on encoding, and thereby eliminating the whole implementation 
inheritance tree in this implementation.

There was however (as far as I could tell at least) some concern about 
this approach in the other thread. (Mostly related to code size and 
being locked into an encoding at compile time.) Some thought that could 
be a problem for XML parsers and related technology that needs to 
establish encoding at run-time. (When reading files for example) This 
new implementation was simply a test to see if an alternate solution 
could be found, without those drawbacks. (It has a plenthora of new ones 
though.)

I am more than willing to change this if the current design is no good. 
Starting a discussion on this is one of my main reasons for posting the 
code in the first place.
...
|You do however gain speed (I would assume), since you
|wouldn't have the overhead of virtual function-calls, as well as a less
|complex implementation.
It would be good to see some real data on how much slower it gets. If the
slowdown is high, then you should consider a two-layered approach
(implementing the virtual functions in terms of the non-virtual) or
to remove the virtual functions altogether.
Yep. Some profiling of the different designs would be a good idea, and 
will probably be done in the near future.
...
-Thorsten
- Erik

[boost] Re: [Unicode strings] We're off

Erik Wien