Re: [boost] String Algorithms Library: Case insensitive compareUTF-8

Anyone who knows how this could be made possible? I suppose I need a locale facet like the std::ctype, but which works for UTF-8, and not just for ASCII a-z,A-Z. I guess the information in a
Martin Lutken wrote: table
like this (http://www.unicode.org/Public/UNIDATA/CaseFolding.txt) could be used.
This might not work out-of-the-box. StringAlgo lib is designed around the >sequences od characters. Since UTF-8 have variable character with encoding, algotrithms in the library would not work as expected.
To make it working, you will need a container with iterators, that will iterate over meta-characters, not bytes.
If it's better/easier just to convert the string to UTF-32 before doing >case insensitive compares, replaces I could live with that.
If you meant UTS-32 and you have a corresponding locale implementation, >than this approach is a viable solution.
Sorry, what is UTS-32 ? I tried to Google it: 351 results, with none of >them looking like char encoding related.
I found this article on Wikipedia on UTF-32/UCS-4: http://en.wikipedia.org/wiki/UTF-32
Is it not what I need ? I suspect that many people must have ran into similar problems. Perhaps >we should add a 32 bit string class to Boost. And until I get a better understanding, I will keep calling it UTF-32 :-)
Sorry, I mixed up it a little. I meant UCS-4 a.k.a fixed-width encoding. I >was not aware that UTF-32 id de-facto the same.
Anyway, the statement about usability with StringAlgo still holds. It can >work with any fixed-size encoding, as long as you have the corresponding locales.
It could theoretically work also with variable-with characters,
provided >you
have a container/localte framework, that allows to operate on metacharacters. I'm not sure how efficient it will be, though.
Best regards, Pavol. MArtin,
The Unicode library I posted in the vault will do what you want for arbitrary characters. It allows you to take two UTF-32 unicode strings and compare them at different comparison levels [e.g. exact, case insensitive etc], and includes the calls necessary for iterating characters [which may be more than a single 4 byte character [e.g. surrogates can be 3 x 'UTF-32' numbers] and then of course the ability to iterate graphemes. It will allow you to do a case insensitive comparison using the full Unicode character library specification. The only thing we did not have time to do was do the string wrapper class. Feel free to work on that ! Thanks. Yours, Graham
participants (1)
-
Graham