Re: [boost] String Algorithms Library: Case insensitive compareUTF-8

28 Aug 2008

      ...
...
...
Anyone who knows how this could be made possible?
I suppose I need a locale facet like the std::ctype, but which works
for 
UTF-8, and not just for ASCII a-z,A-Z. I guess the information in a
Martin Lutken wrote:
table
...
like this (http://www.unicode.org/Public/UNIDATA/CaseFolding.txt) 
could be used.
This might not work out-of-the-box. StringAlgo lib is designed around
the >sequences
od characters. Since UTF-8 have variable character with encoding,
algotrithms
in the library would not work as expected.
To make it working, you will need a container with iterators, that
will
iterate over meta-characters, not bytes.
...
If it's better/easier just to convert the string to UTF-32 before
doing >case 
insensitive compares, replaces I could live with that.
If you meant UTS-32 and you have a corresponding locale
implementation, >than
this approach is a viable solution.
Sorry, what is UTS-32 ? I tried to Google it: 351 results, with none
of >them 
looking like char encoding related.
I found this article on Wikipedia on UTF-32/UCS-4:
http://en.wikipedia.org/wiki/UTF-32
Is it not what I need ? 
I suspect that many people must have ran into similar problems.
Perhaps >we should
add a 32 bit string class to Boost. And until I get a better
understanding, I will 
keep calling it UTF-32 :-)
...
Sorry, I mixed up it a little. I meant UCS-4 a.k.a fixed-width
encoding. I >was not
aware that UTF-32 id de-facto the same.
Anyway, the statement about usability with StringAlgo still holds. It
can >work with
any fixed-size encoding, as long as you have the corresponding locales.
...
It could theoretically work also with variable-with characters,
provided >you
...
have a container/localte framework, that allows to operate on
metacharacters.
I'm not sure how efficient it will be, though.
Best regards,
Pavol.
MArtin,
The Unicode library I posted in the vault will do what you want for
arbitrary characters. It allows you to take two UTF-32 unicode strings
and compare them at different comparison levels [e.g. exact, case
insensitive etc], and includes the calls necessary for iterating
characters [which may be more than a single 4 byte character [e.g.
surrogates can be 3 x 'UTF-32' numbers] and then of course the ability
to iterate graphemes.

It will allow you to do a case insensitive comparison using the full
Unicode character library specification.

The only thing we did not have time to do was do the string wrapper
class.

Feel free to work on that !

Thanks.

Yours,

Graham

Graham

tags

participants (1)