Re: [boost] [rfc] I/O Library Design

25 Jun 2007

      Mathias Gaunard wrote:
...
Andrey Semashev wrote:
...
Text parsing is one of such examples. And it may be extremely 
performance critical.
Text parsing being quite low-level, they should probably use lower-level 
accesses (iterating over code points or code units for example).
Extensive parsing should probably access lower-level views of the 
string, like code points or code units, and eventually be careful 
depending on what they do.
I agree that parsing is rather a low-level task. But I see no benefit 
from being forced to parse Unicode code points instead of fixed-length 
chars in a given encoding.
...
Various Unicode related tools (text boundaries searching etc.) would be 
needed to assist the parser in this task.
That would be nice.
...
Building a fully Unicode-aware regex engine is probably difficult.
See the guidelines here: http://unicode.org/unicode/reports/tr18/
Boost.Regex -- which makes use of ICU for Unicode support -- for 
example, does not even fully comply to level 1.
Interesting. I wonder what level of support will be proposed to the 
Standartization Comitee.
...
...
...
You should be working with Unicode internally in your app anyway if you 
want to avoid translations, since most systems or toolkits require 
Unicode in some form in their interfaces.
I'm not sure about the "most" word in context of "require". I'd rather 
say "most allow Unicode".
I know of several libraries or APIs that only work with Unicode. It's 
simply easier for them if there is only one format that represent all text.
GTK+ is one example.
Well, that doesn't mean I was wrong in my statement. :)
...
...
But that does not mean that all strings in C++ 
should be in Unicode and I should always work in it. I just want to have 
a choice, after all.
...
Additionally, there is plenty of already written code that does not use 
Unicode. We can't just throw it away.
Compatibility with legacy code will always be an issue.
Isn't a runtime conversion simply acceptable?
I don't think so - we're recurring to the performance issue.

I just don't understand why there's so strong will to cut down fixed 
char encodings in favor of exclusive Unicode support. Why can't we have 
both? Is it for that the text processing algorithms should be 
duplicated? I think not, if the implementation is well designed. Is it 
for CRT size growth because of some encoding-specific data? Possible, 
but not necessarily. In fact, if the application size is of primary 
concern, the whole Unicode support is a good candidate to cut away.