
Mathias Gaunard wrote:
Andrey Semashev wrote:
Text parsing is one of such examples. And it may be extremely performance critical.
Text parsing being quite low-level, they should probably use lower-level accesses (iterating over code points or code units for example).
Extensive parsing should probably access lower-level views of the string, like code points or code units, and eventually be careful depending on what they do.
I agree that parsing is rather a low-level task. But I see no benefit from being forced to parse Unicode code points instead of fixed-length chars in a given encoding.
Various Unicode related tools (text boundaries searching etc.) would be needed to assist the parser in this task.
That would be nice.
Building a fully Unicode-aware regex engine is probably difficult. See the guidelines here: http://unicode.org/unicode/reports/tr18/ Boost.Regex -- which makes use of ICU for Unicode support -- for example, does not even fully comply to level 1.
Interesting. I wonder what level of support will be proposed to the Standartization Comitee.
You should be working with Unicode internally in your app anyway if you want to avoid translations, since most systems or toolkits require Unicode in some form in their interfaces. I'm not sure about the "most" word in context of "require". I'd rather say "most allow Unicode".
I know of several libraries or APIs that only work with Unicode. It's simply easier for them if there is only one format that represent all text. GTK+ is one example.
Well, that doesn't mean I was wrong in my statement. :)
But that does not mean that all strings in C++ should be in Unicode and I should always work in it. I just want to have a choice, after all.
Additionally, there is plenty of already written code that does not use Unicode. We can't just throw it away.
Compatibility with legacy code will always be an issue. Isn't a runtime conversion simply acceptable?
I don't think so - we're recurring to the performance issue. I just don't understand why there's so strong will to cut down fixed char encodings in favor of exclusive Unicode support. Why can't we have both? Is it for that the text processing algorithms should be duplicated? I think not, if the implementation is well designed. Is it for CRT size growth because of some encoding-specific data? Possible, but not necessarily. In fact, if the application size is of primary concern, the whole Unicode support is a good candidate to cut away.