Re: [boost] Call for interest for native unicode character and string support in boost - updated definition 3 - Boost

31 Jul 2005

      Dear Rogier,
...
...
What do you think? I am happy to go to uint16_t and limit to 4
levels as
...
...
this is a reasonable performance/ implementation trade-off that I
have
...
...
previously used with success.
...
I hadn't thought of that, actually. But as for the sort keys: I was
...
actually thinking of packing the key *with* the string, so that the
...
string itself can be used as a tie-breaker (cos that's what you need
...
the 32 bits for, isn't it?).
...
class string_with_sort_key
...
{
...
const string s;
...
std::vector <uint16_t>  sort_key;
...
public:
...
string_with_sort_key (const string &);
...
};
...
This would make a good key for, say, std::map: it woul implicitly
...
converting from string when required, e.g. when calling "find" with a
...
string. I think the string itself would be needed in case another
...
locale is required, and as a tie-breaker, but I haven't fully thought
...
this through yet. What do you think about this?
Yes - almost exactly what I had in mind. We can then use the sort data
to level 4 with uint16_t and the string is present.

The final flourish might be to have a version that has the canonical
decom of the string present as well for times when you hit a sort tie.

I have never needed to handle a sort tie in this way, but we have the
flexibility under this scheme to even have two classes
string_with_sort_key and string_with_sort_key_and_decomp as this is just
a wrapper round the actual processing call and developers can then use
whichever they prefer for the memory/ complexity tradeoffs on their
system.
...
...
...
done without producing sort keys.
...
...
...
...
String equality [=] would be handled directly using on the fly
...
...
comparison of canonically decomposed forms as I previously
mentioned,
...
...
and as you correctly mention.
...
...
...
...
String equivalence [<] would be handled using sort level 4+.
...
I'd rather use binary comparison for the normal operator< and
...
operator==: that's what the Unicode standard recommends for speed
...
(5.18). For UTF-8 and UTF-32 it's as fast as normal string comparison,
...
and for UTF-16 it's only slightly slower. C++ has this habit of
...
passing comparison objects (like std::less) to algorithms and
...
containers that would nicely fit in with the need for locales.
...
Basically, a "collate" object would contain the information needed to
...
compare strings for a given locale on a certain level.
Fine - I don't have a problem with it - the only reason that I have
prevented it in the past was to ensure good programming practice [i.e.
force use of the collation version] at the cost of removing the simple
version.
...
I'm not sure I see what you mean, sorry. I'd probably have to see the
...
code. If I am correct, however, for advancing to the next grapheme
...
boundary you don't need a "previous" iterator, because it only has a
...
look-back (if that's the antonym of lookahead) of one character max.
I used to think that I only needed two characters [current and next] for
grapheme testing but then I hit some oddities.

To do a reliable test you actually do need the previous character for:

<cr><lf> as opposed to <lf> i.e. 123<lf>, the <lf> is a grapheme start
but 123<cr><lf> the <lf> is not a grapheme start

Hangul syllable sequences

Base char + virama + letter

And a few of the others...
...
Whoops... I'm only now starting to see that we probably have slightly
...
different headers in mind: I'm thinking of what the user needs, you're
...
thinking of what the "database" needs... sorry I didn't realise that.
...
Obviously, the DLL shouldn't use iterators, and should probably have,
...
well ermmm, just about the interface you propose. Hopefully this
...
realisation will get us somewhere because I'm sure any iterator type
...
can be used for processing, even if the database returns plain
...
pointers. It'll probably mean the database interface should be as
...
low-level as possible and any smarts should be outside it. These
...
should definitely be different header files, too...
...
For example, I think the database (DLL) should provide the
...
Grapheme_Cluster_Break property for every character; the actual
...
algorithm could be templated, allowing any iterator to be used.
Great - this is an extremely important breakthrough in mindset - you
have a fixed interface for sharing round third party DLLs etc, but can
write templated iterators around the interface calls.
...
...
...
...
...
...
(I don't think "next" is
...
...
needed for the grapheme_cluster case anyway.)
...

...
Correcting myself: I meant to say "previous" rather than "next", but
see up.
Moving a cursor left requires previous grapheme.
...
...
Does this mean that sorting will also require a locale?
...

...
In my latest header proposal, I've grouped both functions under one
...
"locale" object, and I think this would be fast, allow for (later)
...
adding tailoring for language, and yield a reasonable syntax. What do
...
you think?
What the implementation does internally and whether it decides to do use
a locale class should not be specified at this level.

I think it would be simpler to just pass a locale ID in, and then let
the implementation move the few exceptions around dependent on the
locale using whatever internal technique it wanted.

Let me know what you think of the latest iteration that shows the
interface into the character implementation and around which we will
probably added templated inline calls for iterators etc. 

Yours,

Graham

Re: [boost] Call for interest for native unicode character and string support in boost - updated definition 3

Graham

tags

participants (1)