
On 28/01/2011 14:58, Artyom wrote:
What am I paying for? I don't see how I gain anything.
You don't pay on validation of the UTF-8 especially when 99% of uses of the string are encoding-agnostic.
I asked for what I gained, not what I did not lose.
// UTF validation
bool is_valid_utf() const;
See, that's what makes the whole thing pointless.
Actually not, consider:
socket.read(my_string); if(!my_string.is_valid_utf()) ....
Could be a free function, and would actually be *better* as a free function, because you could apply it on any range, not just your type.
Your type doesn't add any semantic value on top of std::string, it's just an agglomeration of free functions into a class. That's a terrible design. The only advantage that a specific type for unicode strings would bring is that it could enforce certain useful invariants.
You don't need to enforce things you don't care 99% of cases.
You don't get the point. Your type doesn't add any information on top of std::string. Therefore it is meaningless. It's just an agglomeration of functions, in C++ we use namespaces for that, not classes.
Enforcing that the string is in a valid UTF encoding and is normalized in a specific normalization form can make most Unicode algorithms several orders of magnitude faster.
You do not always want to normalize text. It is user choice you may have optimized algorithms for already normalized strings but it is not always the case.
If my strings are valid and normalized, I can compare them with a simple binary-level comparison; likewise for substring search, where I may also need to add a boundary check if I want fine-grain search. What you want to do is implement comparison by iterating through each lazily computed code point and comparing them. This is at least 60 times as slow; it also doesn't really compare equivalent characters in the strings. To get correct behaviour when comparing strings, they should be normalized. Normalization is costly, so you don't want to do it at each comparison, but only once. In practice, all data available everywhere should already be in NFC (XML mandates it, for example) and checking whether a string is normalized is very fast (while less fast than checking if a string is valid UTF-8, since you still need to access a table, which might hurt the cache, and is not vectorizable). Dealing with potentially invalid UTF strings can be highly dangerous as well, exploits for that kind of thing are common-place. I suspect denormalized Unicode could be sensitive too, since in some parts of your application 00e0 (à) and 0061 0300 (a + `) could compare equal but not in others, depending on what that string went through, causing inconsistencies. Anyway, the only value we can bring on top of the range abstraction is by establishing invariants. It makes sense to establish the strongest one; though I am not opposed to just checking for UTF validity. But no checking at all? There is no point. You might as well make your string type a typedef of std::string.
Also what kind of normalization NFC? NFKC?
NFC, of course. It takes less space and doesn't make you lose anything. If you want to work in decomposed forms or something else, use your own container and not the adaptor. Remember, this whole thing is just there to help you deal with the general case in a practical, correct and efficient way. The real algorithms are fully generic, and allow you to do whatever you want; they accept both normalized and un-normalized strings, data regardless of its memory layout, etc.
All of this is trivial to implement quickly with my Unicode library.
No, it is not.
I know better what I described and what my library is capable of, thank you.
Your Unicode library is locale agnostic which makes it quite useless in too many cases.
In the common case, you don't care (nor want to care) about a locale.
Almost every added function was locale sensitive:
- search - collation - case handling
And so on. This is major drawback of your library that it is not capable of doing locale sensitive algorithms that are vast majority of the Unicode algorithms
Search up to the combining character sequence boundary is locale-agnostic. Search up to the grapheme boundary is virtually locale-agnostic (Unicode does not distribute locale alternatives, though it does hint at its possibility) Case folding only has a couple of characters that are specific for Turkish, making it quite reasonably locale-agnostic. Collation depends on a special table; Unicode only provides a default one, which aims at being as locale-agnostic as possible. It also hosts a repository where one can get alternative tables. Anyway, those are mere details; you can always change the backend for one tailored to your locale.