
In article <d1cam8$sdq$1@sea.gmane.org>, Erik Wien <wien@start.no> wrote:
Miro Jurisic wrote:
Here I also agree. Having multiple string classes would just force everyone to pick one for, in most cases, no good reason whatsoever. If I am writing code that uses C++ strings, which encoding should I choose? Why should I care? Particularly, if I don't care, why would I have to choose anyway? More than likely, I would just choose the same thing 99% of the time anyway.
If we went with an implemetation templated on encoding, I would suggest simply having a typedef like todays std::string, let's say "typedef encoded_string<utf16_tag> unicode_string;", and market that like "the unicode string class". Users that don't care, would use that and be happy, possibly not even knowing they are using some template instansiation. Advanced users could still easily use one of the other encodings, or even template their code to use all of them if found neccesary. But then, like I have said, you wouldn't have functions/classes that are encoding independent without templating them.
Well, here's what I think -- and this is based entirely on my experience, so I know it's biased: 1. How much of my code has to deal with strings (manipulation, creation, or use)? Almost all of it. 2. How much of that code has to know about the encoding? Almost none of it. Because of this, I really think that for my purposes the right answer is an encoding-agnostic abstraction. Now, based on my understanding of where knowledge of encodings is necessary, I think that my use cases are similar to those of most C++ users. I could be wrong on that point, of course.
I believe that the ability to force a Unicode string to be in a particular encoding has some value -- especially for people doing low-level work such as serializing Unicode strings to XML, and for people who need to understand time and space complexity of various Unicode encodings -- but I do not believe that this justifiable demand for complexity means we should make the interface harder for everyone else.
I agree. But having a templated implementation, would not mean a complex interface for the end user. It would probably be simpler than the current implementation, since you could loose all the encoding setting and getting. Especially if we for for the above mentioned typedef, to remove the template syntax for the casual user.
I am not sure that's really true. Let's consider this: 1. When you are passing a boost::unicode_string to an API that uses a different kind of string, you are going to have to perform some conversion (even if it's as simple as extracting a wchar_t* from the unicode_string) one way or another. Therefore, the relative complexity of two possible interfaces in this use case depends on how easy it is to perform the required conversion. I think that they can be equally easy to use for this use case. 2. When you are manipulating a boost::unicode_string with boost APIs, I believe that the two proposed designs would have the same ease of use. 3. When you need to mix and match encodings, then I don't think that the two APIs can be equally easy to use, primarily because implicit conversions in C++ lead to difficulties. (I assume I don't have to bring up specific examples here.) I think that the end result of the "typedef encoded_string" design would be that either I would have to turn every function that uses a string into a template (which is annoying), or I would have to choose one encoding to use throughout my code, and this seems unnecessary to me. Finally, it doesn't make sense to me to pay for the transcoding cost any earlier than necessary. Consider this code: unicode_string foo() { return function_that_returns_utf8(); } With the "typedef encoded_string" design, I am forced to pay for the cost of transcoding even if the caller of this code will actually need utf_8. So, to summarize, my opinion is that in applications in which one encoding is used throughout the application (and note that this really means "the application and all boost::unicode_string-savvy libraries it uses") the typedef approach is probably as easy as the class approach (and faster, because it eliminates vtable dispatch), whereas in applications in which more than one encoding is used, the benefit of avoiding the vtable dispatch will be offset by having to pay for transcoding cost upfront. In my opinion, having boost::unicode_string_utfN for the situations in which encoding is important and boost::unicode_string which can hold any encoding is better than not having a string that can hold any encoding. (I am sure that if we decide to accept this library with typedef unicode_string_utfM unicode_string, the first thing I'll need is my own encoding-agnostic unicode_string...)
I do, however, think that some people are going to feel that they need to eliminate the runtime overhead of generalized strings and explicitly instantiate strings in a particular encoding, and I don't know whether the library currently provides a facility to accomplish this.
It doesn't currently. But it would be pretty simple to create an implementation that allows that through use of the encoding_traits classes. I have done that before, and could probably use most of that code again if we were to include that.
I think that it should provide this, but I don't demand that it provide it right away. meeroh