Re: [boost] [string] Realistic API proposal

29 Jan 2011

      On 28/01/2011 14:58, Artyom wrote:
...
...
What am I paying for? I don't see how I gain  anything.
You don't pay on validation of the UTF-8 especially when 99% of uses
of the string are encoding-agnostic.
I asked for what I gained, not what I did not lose.
...
...
...
// UTF validation
bool  is_valid_utf() const;
See, that's what makes the whole thing  pointless.
Actually not, consider:
socket.read(my_string);
    if(!my_string.is_valid_utf())
       ....
Could be a free function, and would actually be *better* as a free 
function, because you could apply it on any range, not just your type.
...
...
Your type doesn't add any semantic value on top of std::string,
it's just an agglomeration of free functions into a class. That's a terrible
design.
The only advantage that a specific type for unicode strings would  bring is
that it could
enforce certain useful invariants.
You don't need to enforce things you don't care 99% of cases.
You don't get the point.

Your type doesn't add any information on top of std::string. Therefore 
it is meaningless.

It's just an agglomeration of functions, in C++ we use namespaces for 
that, not classes.
...
...
Enforcing that  the string is in a valid UTF encoding and is normalized
in a specific  normalization form can make most Unicode algorithms several
orders of magnitude  faster.
You do not always want to normalize text. It is user choice you
may have optimized algorithms for already normalized strings
but it is not always the case.
If my strings are valid and normalized, I can compare them with a simple 
binary-level comparison; likewise for substring search, where I may also 
need to add a boundary check if I want fine-grain search.

What you want to do is implement comparison by iterating through each 
lazily computed code point and comparing them. This is at least 60 times 
as slow; it also doesn't really compare equivalent characters in the 
strings.

To get correct behaviour when comparing strings, they should be 
normalized. Normalization is costly, so you don't want to do it at each 
comparison, but only once.
In practice, all data available everywhere should already be in NFC (XML 
mandates it, for example) and checking whether a string is normalized is 
very fast (while less fast than checking if a string is valid UTF-8, 
since you still need to access a table, which might hurt the cache, and 
is not vectorizable).

Dealing with potentially invalid UTF strings can be highly dangerous as 
well, exploits for that kind of thing are common-place.
I suspect denormalized Unicode could be sensitive too, since in some 
parts of your application 00e0 (à) and 0061 0300 (a + `) could compare 
equal but not in others, depending on what that string went through, 
causing inconsistencies.

Anyway, the only value we can bring on top of the range abstraction is 
by establishing invariants.
It makes sense to establish the strongest one; though I am not opposed 
to just checking for UTF validity.

But no checking at all? There is no point.
You might as well make your string type a typedef of std::string.
...
Also what kind of normalization NFC? NFKC?
NFC, of course. It takes less space and doesn't make you lose anything.
If you want to work in decomposed forms or something else, use your own 
container and not the adaptor.

Remember, this whole thing is just there to help you deal with the 
general case in a practical, correct and efficient way.
The real algorithms are fully generic, and allow you to do whatever you 
want; they accept both normalized and un-normalized strings, data 
regardless of its memory layout, etc.
...
...
All of this is trivial to implement quickly with my  Unicode  library.
No, it is not.
I know better what I described and what my library is capable of, thank you.
...
Your Unicode library is locale agnostic which makes it quite
useless in too many cases.
In the common case, you don't care (nor want to care) about a locale.
...
Almost every added function was locale sensitive:
- search
- collation
- case handling
And so on. This is major drawback of your library that
it is not capable of doing locale sensitive algorithms
that are vast majority of the Unicode algorithms
Search up to the combining character sequence boundary is locale-agnostic.
Search up to the grapheme boundary is virtually locale-agnostic (Unicode 
does not distribute locale alternatives, though it does hint at its 
possibility)

Case folding only has a couple of characters that are specific for 
Turkish, making it quite reasonably locale-agnostic.

Collation depends on a special table; Unicode only provides a default 
one, which aims at being as locale-agnostic as possible. It also hosts a 
repository where one can get alternative tables.

Anyway, those are mere details; you can always change the backend for 
one tailored to your locale.

Re: [boost] [string] Realistic API proposal

Mathias Gaunard