Re: [boost] Call for interest for native unicode character and string support in boost - updated definition 3

Dear Rogier,
What do you think? I am happy to go to uint16_t and limit to 4 levels as
this is a reasonable performance/ implementation trade-off that I have
previously used with success.
I hadn't thought of that, actually. But as for the sort keys: I was
actually thinking of packing the key *with* the string, so that the
string itself can be used as a tie-breaker (cos that's what you need
the 32 bits for, isn't it?).
class string_with_sort_key
{
const string s;
std::vector <uint16_t> sort_key;
public:
string_with_sort_key (const string &);
};
This would make a good key for, say, std::map: it woul implicitly
converting from string when required, e.g. when calling "find" with a
string. I think the string itself would be needed in case another
locale is required, and as a tie-breaker, but I haven't fully thought
this through yet. What do you think about this?
Yes - almost exactly what I had in mind. We can then use the sort data to level 4 with uint16_t and the string is present. The final flourish might be to have a version that has the canonical decom of the string present as well for times when you hit a sort tie. I have never needed to handle a sort tie in this way, but we have the flexibility under this scheme to even have two classes string_with_sort_key and string_with_sort_key_and_decomp as this is just a wrapper round the actual processing call and developers can then use whichever they prefer for the memory/ complexity tradeoffs on their system.
done without producing sort keys.
String equality [=] would be handled directly using on the fly
comparison of canonically decomposed forms as I previously mentioned,
and as you correctly mention.
String equivalence [<] would be handled using sort level 4+.
I'd rather use binary comparison for the normal operator< and
operator==: that's what the Unicode standard recommends for speed
(5.18). For UTF-8 and UTF-32 it's as fast as normal string comparison,
and for UTF-16 it's only slightly slower. C++ has this habit of
passing comparison objects (like std::less) to algorithms and
containers that would nicely fit in with the need for locales.
Basically, a "collate" object would contain the information needed to
compare strings for a given locale on a certain level.
Fine - I don't have a problem with it - the only reason that I have prevented it in the past was to ensure good programming practice [i.e. force use of the collation version] at the cost of removing the simple version.
I'm not sure I see what you mean, sorry. I'd probably have to see the
code. If I am correct, however, for advancing to the next grapheme
boundary you don't need a "previous" iterator, because it only has a
look-back (if that's the antonym of lookahead) of one character max.
I used to think that I only needed two characters [current and next] for grapheme testing but then I hit some oddities. To do a reliable test you actually do need the previous character for: <cr><lf> as opposed to <lf> i.e. 123<lf>, the <lf> is a grapheme start but 123<cr><lf> the <lf> is not a grapheme start Hangul syllable sequences Base char + virama + letter And a few of the others...
Whoops... I'm only now starting to see that we probably have slightly
different headers in mind: I'm thinking of what the user needs, you're
thinking of what the "database" needs... sorry I didn't realise that.
Obviously, the DLL shouldn't use iterators, and should probably have,
well ermmm, just about the interface you propose. Hopefully this
realisation will get us somewhere because I'm sure any iterator type
can be used for processing, even if the database returns plain
pointers. It'll probably mean the database interface should be as
low-level as possible and any smarts should be outside it. These
should definitely be different header files, too...
For example, I think the database (DLL) should provide the
Grapheme_Cluster_Break property for every character; the actual
algorithm could be templated, allowing any iterator to be used.
Great - this is an extremely important breakthrough in mindset - you have a fixed interface for sharing round third party DLLs etc, but can write templated iterators around the interface calls.
...
(I don't think "next" is
needed for the grapheme_cluster case anyway.)
Correcting myself: I meant to say "previous" rather than "next", but see up.
Moving a cursor left requires previous grapheme.
Does this mean that sorting will also require a locale?
In my latest header proposal, I've grouped both functions under one
"locale" object, and I think this would be fast, allow for (later)
adding tailoring for language, and yield a reasonable syntax. What do
you think?
What the implementation does internally and whether it decides to do use a locale class should not be specified at this level. I think it would be simpler to just pass a locale ID in, and then let the implementation move the few exceptions around dependent on the locale using whatever internal technique it wanted. Let me know what you think of the latest iteration that shows the interface into the character implementation and around which we will probably added templated inline calls for iterators etc. Yours, Graham
participants (1)
-
Graham