[boost] Re: Any interest in adding unicode support to boost?

21 Oct 2004

      Erik Wien wrote:
...
...
- Why would the user want to change the encoding? Especially between
 UTF-16 and UTF-32?
Well... Different people have different needs. If you are mostly using
ASCII characters, and require small size, UTF-8 would fit your bill. If
you need the best general performance on most operations, use UTF-16. If
you need fast iteration over code points and size doesn't matter, use
UTF-32.
Ok, since everybody agreed characters outside 16 bits are very rare, UTF-32
seems to never be needed. As for UTF-8 vs. UTF-16: yes, the need for choice
seems present. However, UTF-16 string class would be better than no string
class at all, and extra genericity will cost you development time.
...
...
- Why would the user want to specify encoding at compile time? Are there
 performance benefits to that? Basically, if we agree that UTF-32 is not
 needed, then UTF-16 is the only encoding which does not require complex
 handling. Maybe, for other encodings using virtual functions in
 character iterator is OK? And if iterators have abstract characters" as
 value_type, maybe the overhead if that is much large that virtual
 function call even for UTF-16.
Though I haven't confirmed this by testing, I would assume templating the
encoding and thus specifying it at compile time would result in better
performance since you don't have the overhead of virtual function calls.
(Polymorphy would probably be needed if templates were scrapped.)
It would. The question is by how much.
...
Avoiding 
virtual calls also enables the compiler to optimize (inline) more
thouroughly, something that is very benificial in this case because of the
amount of different small, specialized functions that are needed in string
manipulation.
This is a bit abstract. Virtual function is a inlining barrier, but it would
be placed only for character access. On both sides of the barrier, compiler
can freely optimize everything.
...
...
- What if the user wants to specify encoding at run time? For example,
XML
 files specify encoding explicitly. I'd want to use ascii/UTF-8 encoding
if
 XML document is 8-bit, and UTF-16 when it's Unicode.
That is one problem with the templating of encoding. You would have to
ether template all file scanning functions in the XML parser on encoding
as well, of you would need to do some run-time checks and use the correct
template depending on the encoding used in the file. This is of course not
ideal, but only where encoding is something that is specified upon
run-time. What the most common scenario is, is something that needs to be
determined before a final design is decided on.
Another possibility is that you can decide if UTF8 of UTF16 should be used
dynamically -- just counting the number of non-ascii characters. That would
mean that only really advanced users need make the decision themself.

I think I'm starting to like Peter's idea that advanced users need
vector<char_xxx> together with a set of algorithms.

- Volodya

[boost] Re: Any interest in adding unicode support to boost?

Vladimir Prus