
In article <d19pdf$jhu$1@sea.gmane.org>, Erik Wien <wien@start.no> wrote:
Thorsten Ottosen wrote:
| Current design: | The current design is based around the concept of «encoding traits».
Is entirely improper to make unicode strings a typedef for std::basic string<...> ?
Not entirely, but certainly less that optimal. basic string (and the iostreams) make assuptions that don't neccesarily apply to Unicode text. One of them is that strings can be represented as a sequence of equally sized characters. Unicode can be represented that way, but that would mean you'd have to use 32 bits pr. character to be able to represent all the code point assigned in the Unicode standard. In most cases, that is way too much overhead for a string, and usually also a waste, since unicode code points rarely require more that 16 bits to be encoded. You could of course implement unicode for 16 bit characters in basic string, but that would require that the user know about things like surrogate pairs, and also know how to correctly handle them. An unlikely scenario.
I completely agree with Erik on this. std::string makes assumptions that do not hold for Unicode characters, and it provides interfaces that are misleading (or outright wrong) for Unicode strings. For example, basic_string lets you erase a single element, which can make the string no longer be a valid Unicode string (unless the elements are represented in UTF32). Same problem exists with every other mutating algorithm on basic_string, including operator[].
and what is the benefit of having a function vs a function template? surely a function template will look the same to the client as an ordinary function; Is it often used that people must change encoding on the fly?
Normally I would not think so, and my first implementation did not work this way. That one was implemented with the entire string class being templated on encoding, and thereby eliminating the whole implementation inheritance tree in this implementation.
There was however (as far as I could tell at least) some concern about this approach in the other thread. (Mostly related to code size and being locked into an encoding at compile time.) Some thought that could be a problem for XML parsers and related technology that needs to establish encoding at run-time. (When reading files for example) This new implementation was simply a test to see if an alternate solution could be found, without those drawbacks. (It has a plenthora of new ones though.)
Here I also agree. Having multiple string classes would just force everyone to pick one for, in most cases, no good reason whatsoever. If I am writing code that uses C++ strings, which encoding should I choose? Why should I care? Particularly, if I don't care, why would I have to choose anyway? More than likely, I would just choose the same thing 99% of the time anyway. I believe that the ability to force a Unicode string to be in a particular encoding has some value -- especially for people doing low-level work such as serializing Unicode strings to XML, and for people who need to understand time and space complexity of various Unicode encodings -- but I do not believe that this justifiable demand for complexity means we should make the interface harder for everyone else. I do, however, think that some people are going to feel that they need to eliminate the runtime overhead of generalized strings and explicitly instantiate strings in a particular encoding, and I don't know whether the library currently provides a facility to accomplish this. meeroh