[boost] Re: [Unicode strings] We're off

17 Mar 2005

      In article <d1cam8$sdq$1@sea.gmane.org>, Erik Wien <wien@start.no> wrote:
...
Miro Jurisic wrote:
...
Here I also agree. Having multiple string classes would just force everyone 
to pick one for, in most cases, no good reason whatsoever. If I am writing 
code that uses C++ strings, which encoding should I choose? Why should I 
care? Particularly, if I don't care, why would I have to choose anyway? 
More than likely, I would just choose the same thing 99% of the time 
anyway.
If we went with an implemetation templated on encoding, I would suggest 
simply having a typedef like todays std::string, let's say "typedef 
encoded_string<utf16_tag> unicode_string;", and market that like "the unicode 
string class". Users that don't care, would use that and be happy, possibly 
not even knowing they are using some template instansiation. Advanced users 
could still easily use one of the other encodings, or even template their 
code to use all of them if found neccesary. But then, like I have said, you 
wouldn't have functions/classes that are encoding independent without 
templating them.
Well, here's what I think -- and this is based entirely on my experience, so I 
know it's biased:

1. How much of my code has to deal with strings (manipulation, creation, or 
use)? Almost all of it.

2. How much of that code has to know about the encoding? Almost none of it.

Because of this, I really think that for my purposes the right answer is an 
encoding-agnostic abstraction.

Now, based on my understanding of where knowledge of encodings is necessary, I 
think that my use cases are similar to those of most C++ users. I could be wrong 
on that point, of course.
...
...
I believe that the ability to force a Unicode string to be in a particular 
encoding has some value -- especially for people doing low-level work such 
as serializing Unicode strings to XML, and for people who need to 
understand time and space complexity of various Unicode encodings -- but I 
do not believe that this justifiable demand for complexity means we should 
make the interface harder for everyone else.
I agree. But having a templated implementation, would not mean a complex 
interface for the end user. It would probably be simpler than the current 
implementation, since you could loose all the encoding setting and getting. 
Especially if we for for the above mentioned typedef, to remove the template 
syntax for the casual user.
I am not sure that's really true. Let's consider this:

1. When you are passing a boost::unicode_string to an API that uses a different 
kind of string, you are going to have to perform some conversion (even if it's 
as simple as extracting a wchar_t* from the unicode_string) one way or another. 
Therefore, the relative complexity of two possible interfaces in this use case 
depends on how easy it is to perform the required conversion.

I think that they can be equally easy to use for this use case. 

2. When you are manipulating a boost::unicode_string with boost APIs, I believe 
that the two proposed designs would have the same ease of use.

3. When you need to mix and match encodings, then I don't think that the two 
APIs can be equally easy to use, primarily because implicit conversions in C++ 
lead to difficulties. (I assume I don't have to bring up specific examples here.)

I think that the end result of the "typedef encoded_string" design would be that 
either I would have to turn every function that uses a string into a template 
(which is annoying), or I would have to choose one encoding to use throughout my 
code, and this seems unnecessary to me. 

Finally, it doesn't make sense to me to pay for the transcoding cost any earlier 
than necessary. Consider this code:

unicode_string foo()
{
   return function_that_returns_utf8();
}

With the "typedef encoded_string" design, I am forced to pay for the cost of 
transcoding even if the caller of this code will actually need utf_8. 

So, to summarize, my opinion is that in applications in which one encoding is 
used throughout the application (and note that this really means "the 
application and all boost::unicode_string-savvy libraries it uses") the typedef 
approach is probably as easy as the class approach (and faster, because it 
eliminates vtable dispatch), whereas in applications in which more than one 
encoding is used, the benefit of avoiding the vtable dispatch will be offset by 
having to pay for transcoding cost upfront. 

In my opinion, having boost::unicode_string_utfN for the situations in which 
encoding is important and boost::unicode_string which can hold any encoding is 
better than not having a string that can hold any encoding. (I am sure that if 
we decide to accept this library with typedef unicode_string_utfM 
unicode_string, the first thing I'll need is my own encoding-agnostic 
unicode_string...)
...
...
I do, however, think that some people are going to feel that they need to 
eliminate the runtime overhead of generalized strings and explicitly 
instantiate strings in a particular encoding, and I don't know whether the 
library currently provides a facility to accomplish this.
It doesn't currently. But it would be pretty simple to create an 
implementation that allows that through use of the encoding_traits classes. I 
have done that before, and could probably use most of that code again if we 
were to include that.
I think that it should provide this, but I don't demand that it provide it right 
away.

meeroh