
Erik Wien wrote:
- Why would the user want to change the encoding? Especially between UTF-16 and UTF-32?
Well... Different people have different needs. If you are mostly using ASCII characters, and require small size, UTF-8 would fit your bill. If you need the best general performance on most operations, use UTF-16. If you need fast iteration over code points and size doesn't matter, use UTF-32.
Ok, since everybody agreed characters outside 16 bits are very rare, UTF-32 seems to never be needed. As for UTF-8 vs. UTF-16: yes, the need for choice seems present. However, UTF-16 string class would be better than no string class at all, and extra genericity will cost you development time.
- Why would the user want to specify encoding at compile time? Are there performance benefits to that? Basically, if we agree that UTF-32 is not needed, then UTF-16 is the only encoding which does not require complex handling. Maybe, for other encodings using virtual functions in character iterator is OK? And if iterators have abstract characters" as value_type, maybe the overhead if that is much large that virtual function call even for UTF-16.
Though I haven't confirmed this by testing, I would assume templating the encoding and thus specifying it at compile time would result in better performance since you don't have the overhead of virtual function calls. (Polymorphy would probably be needed if templates were scrapped.)
It would. The question is by how much.
Avoiding virtual calls also enables the compiler to optimize (inline) more thouroughly, something that is very benificial in this case because of the amount of different small, specialized functions that are needed in string manipulation.
This is a bit abstract. Virtual function is a inlining barrier, but it would be placed only for character access. On both sides of the barrier, compiler can freely optimize everything.
- What if the user wants to specify encoding at run time? For example, XML files specify encoding explicitly. I'd want to use ascii/UTF-8 encoding if XML document is 8-bit, and UTF-16 when it's Unicode.
That is one problem with the templating of encoding. You would have to ether template all file scanning functions in the XML parser on encoding as well, of you would need to do some run-time checks and use the correct template depending on the encoding used in the file. This is of course not ideal, but only where encoding is something that is specified upon run-time. What the most common scenario is, is something that needs to be determined before a final design is decided on.
Another possibility is that you can decide if UTF8 of UTF16 should be used dynamically -- just counting the number of non-ascii characters. That would mean that only really advanced users need make the decision themself. I think I'm starting to like Peter's idea that advanced users need vector<char_xxx> together with a set of algorithms. - Volodya