Re: [boost] Re: Re: Any interest in adding unicode support to boost?

21 Oct 2004

      Erik Wien wrote:
...
"Rogier van Dalen" <rogiervd@gmail.com> wrote in message
...
...
I think the best solution is to store the string in the form it was
originally recieved (decomposed or not), and instead provide composition
functions or even iterator wrappers that compose on the fly. That would
allow for composed strings to be used if needed (like in a XML library, 
but
not imposing that requirement on all other users.
I don't think I can agree on that. If you do a lot of input/output,
this might yield a better performance, but even in reading XML, you
probably need to compare strings a lot, and if they are not
normalised, this will really take a lot of processing.
Correct me if I'm wrong, but a simple comparison of two non-normalized
Unicode strings would take looking up the characters in the Unicode
Character Database, decomposing every single character, gathering base
characters and combining marks, and ordering the marks, then comparing
them. And this must be done for every character. I don't have any
numbers, of course, but I have this feeling it is going to be really
really slow.
You are quite correct... It is slow. And that is why I am hesitant to make 
decomposition something that will happen every time you assign something to 
a string.
What this really boils down to, is what kind of usage pattern is the most 
common? The library should be written to provide the best performance on the 
operations most people do.
How about this:

- when initialising or assigning to a string, you can opt to normalise 
one way or the other, or not at all
- normalised strings are flagged as such
- normalisation on comparison can be skipped if both strings are flagged 
as being normalised the same way
- normalisation on assignment can be skipped if the right-hand side is 
flagged as being normalised appropriately

Then the user can choose to normalise whichever way is best for his 
application but without breaking interoperability with libraries that 
require something different (or produce unnormalised strings); there's a 
speed penalty for renormalising but that seems inevitable.

Obviously there's a speed penalty for checking normalisation flags 
repeatedly at run-time, but I don't think it would be too bad.

Ben.

Re: [boost] Re: Re: Any interest in adding unicode support to boost?

Ben Hutchings