Re: [boost] Call for interest for native unicode character and string support in boost

From: Rogier van Dalen <rogiervd@gmail.com>
Subject: Re: [boost] Call for interest for native unicode character and string support in boost
Dear Rogier,
DWORD is actually uint32_t.
I know, but - correct me if I'm wrong - I think the sort keys consist of 16-bit integers.
This depends on how we implement it - and if we implement or allow for more than 4 level comparison - if so the 5+ comparison if coded may require uint32_t. What do you think? I am happy to go to uint16_t and limit to 4 levels as this is a reasonable performance/ implementation trade-off that I have previously used with success.
I believe collation must be here as there will be probably be several
containers with Unicode characteristics and this is a good level for
them to work on.
I'm not sure - seeing as I'd expect us to come up with at least the option to keep strings in some Normalisation Form, the
first step of the collation algorithm should be skipped in some cases. Furthermore, individual comparisons should probably be
done without producing sort keys.
String equality [=] would be handled directly using on the fly comparison of canonically decomposed forms as I previously mentioned, and as you correctly mention. String equivalence [<] would be handled using sort level 4+.
Not being a native speaker of English I don't think I understand what a "coding hit" is.
Reduction in performance - your English is better than a lot of the British !
I'm assuming that the caller usually knows the start and end. (I'm obviously assuming a C++-style string and not a C-style
zero-terminated one.) In this case, the caller must check whether the end has been reached and whether it's at begin,
pass in 0 in those cases, and then start_of_grapheme has to check whether 0 was passed in. To me it
seem this would be slower, both to code and to execute.
Performance can be argued back and forth, but when I looked at optimising the performance I needed three iterators as iterating graphemes is not a 'cheap' operation - and I moved the iterators along in series [step forward is curr->prev, next->curr, calculate next]. In 64 bit code iterating a uint_32t container, iterators can introduce high costs. The real problem though is that iterators do not work across DLL boundaries unless you have the code for the DLL and the caller. Have a look at my recently posted implementation. This implementation proposal would mean you can produce a DLL for sale that has a custom Unicode control in it - without having to have the source code for the iterators that it needs and providing an interface for each iterator type, or placing the whole of the Unicode data in that DLL and preventing the use of custom characters.
Furthermore, one'd better abstract away from the implementation as much as possible, and I don't think it is relevant to the caller
how far the functions want to look ahead. (I don't think "next" is needed for the grapheme_cluster case anyway.)
A good example of next grapheme and start of grapheme is for default moving a cursor left/ right - the cursor may be on a grapheme [or optionally inside a grapheme if the editor supports this, as is optionally recommended in the standard] and you need to find the previous/next grapheme.
Ermmm... what's this doing then on p. 137 of Unicode standard 4.0.0?
"Characters may have case mappings that depend on the locale. The principal example is Turkish, ..."
Ouch - I had not seen that before. I believe that this would mean that the upper/ lower case conversions will need to take a locale to allow for these special extensions at a later date. I have not seen any written rules for this - I guess because I had never worked with Turkish - but it should be allowed for. Does this mean that sorting will also require a locale? Please let me know what you think and if you can spot any other assumptions I have made in my attachment posted on the 28th. Yours, Graham

Hi Graham, There should be a message on your most recent header proposal on the list; if you don't see it, please let me know. On 7/30/05, Graham <Graham@system-development.co.uk> wrote:
...
I know, but - correct me if I'm wrong - I think the sort keys consist of 16-bit integers.
This depends on how we implement it - and if we implement or allow for more than 4 level comparison - if so the 5+ comparison if coded may require uint32_t.
What do you think? I am happy to go to uint16_t and limit to 4 levels as this is a reasonable performance/ implementation trade-off that I have previously used with success.
I hadn't thought of that, actually. But as for the sort keys: I was actually thinking of packing the key *with* the string, so that the string itself can be used as a tie-breaker (cos that's what you need the 32 bits for, isn't it?). class string_with_sort_key { const string s; std::vector <uint16_t> sort_key; public: string_with_sort_key (const string &); }; This would make a good key for, say, std::map: it woul implicitly converting from string when required, e.g. when calling "find" with a string. I think the string itself would be needed in case another locale is required, and as a tie-breaker, but I haven't fully thought this through yet. What do you think about this?
done without producing sort keys.
String equality [=] would be handled directly using on the fly comparison of canonically decomposed forms as I previously mentioned, and as you correctly mention.
String equivalence [<] would be handled using sort level 4+.
I'd rather use binary comparison for the normal operator< and operator==: that's what the Unicode standard recommends for speed (5.18). For UTF-8 and UTF-32 it's as fast as normal string comparison, and for UTF-16 it's only slightly slower. C++ has this habit of passing comparison objects (like std::less) to algorithms and containers that would nicely fit in with the need for locales. Basically, a "collate" object would contain the information needed to compare strings for a given locale on a certain level.
Not being a native speaker of English I don't think I understand what a "coding hit" is.
Reduction in performance - your English is better than a lot of the British !
Being a linguist (here meaning someone who knows things about linguistics) I should be of the opinion that written language isn't really language, so that doesn't really count - but thank you anyway! :-)
I'm assuming that the caller usually knows the start and end. (I'm obviously assuming a C++-style string and not a C-style zero-terminated one.) In this case, the caller must check whether the end has been reached and whether it's at begin, pass in 0 in those cases, and then start_of_grapheme has to check whether 0 was passed in. To me it seem this would be slower, both to code and to execute.
Performance can be argued back and forth, but when I looked at optimising the performance I needed three iterators as iterating graphemes is not a 'cheap' operation - and I moved the iterators along in series [step forward is curr->prev, next->curr, calculate next]. In 64 bit code iterating a uint_32t container, iterators can introduce high costs.
I'm not sure I see what you mean, sorry. I'd probably have to see the code. If I am correct, however, for advancing to the next grapheme boundary you don't need a "previous" iterator, because it only has a look-back (if that's the antonym of lookahead) of one character max.
The real problem though is that iterators do not work across DLL boundaries unless you have the code for the DLL and the caller. Have a look at my recently posted implementation. This implementation proposal would mean you can produce a DLL for sale that has a custom Unicode control in it - without having to have the source code for the iterators that it needs and providing an interface for each iterator type, or placing the whole of the Unicode data in that DLL and preventing the use of custom characters.
Whoops... I'm only now starting to see that we probably have slightly different headers in mind: I'm thinking of what the user needs, you're thinking of what the "database" needs... sorry I didn't realise that. Obviously, the DLL shouldn't use iterators, and should probably have, well ermmm, just about the interface you propose. Hopefully this realisation will get us somewhere because I'm sure any iterator type can be used for processing, even if the database returns plain pointers. It'll probably mean the database interface should be as low-level as possible and any smarts should be outside it. These should definitely be different header files, too... For example, I think the database (DLL) should provide the Grapheme_Cluster_Break property for every character; the actual algorithm could be templated, allowing any iterator to be used.
...
(I don't think "next" is needed for the grapheme_cluster case anyway.)
Correcting myself: I meant to say "previous" rather than "next", but see up.
...
Ermmm... what's this doing then on p. 137 of Unicode standard 4.0.0?
"Characters may have case mappings that depend on the locale. The principal example is Turkish, ..."
Ouch - I had not seen that before.
I believe that this would mean that the upper/ lower case conversions will need to take a locale to allow for these special extensions at a later date.
I have not seen any written rules for this - I guess because I had never worked with Turkish - but it should be allowed for.
Does this mean that sorting will also require a locale?
In my latest header proposal, I've grouped both functions under one "locale" object, and I think this would be fast, allow for (later) adding tailoring for language, and yield a reasonable syntax. What do you think? Regards, Rogier
participants (2)
-
Graham
-
Rogier van Dalen