Re: [boost] UTF-8 conversion etc. (Sebastian Redl) (Sebastian Redl)

As I said to Phil, Rogier and I completed a Unicode character library for
Release under boost, but never submitted it to Boost as we had intended to
release it with a string library built on it, and never had time to do the
second part of the work.
Post it, and we'll do the second part. It's open-source.
Sebastian
Sebasian, As requested, I have posted a Unicode character support library in the boost vault as Unicode_lib.zip. Please feel free to use this under the boost license. It would be great if somebody had the time to develop the existing character support to ad the string wrappers that we had intended to - but ran out of time doing. It should be well documented, but if you want any questions answered, please feel free to ask and I'll try and help! Thanks. Yours, Graham

Graham wrote:
As requested, I have posted a Unicode character support library in the boost vault as Unicode_lib.zip.
Thanks for that. I'll study it properly in due course, but I've just had a quick look at your UTF8 functions for now. I see that you use tables for decoding, and I'll have to see how the performance of that compares to my version. For encoding you generate the bytes backwards, which is a trick that's unfortunately not available to an implementation that's templated on the output iterator, unless as a specialisation for random-access iterators. Anyway, I'll investigate further. Thanks again, Phil.

Graham wrote:
As requested, I have posted a Unicode character support library in the boost vault as Unicode_lib.zip.
Please feel free to use this under the boost license.
It would be great if somebody had the time to develop the existing character support to ad the string wrappers that we had intended to - but ran out of time doing.
It should be well documented, but if you want any questions answered, please feel free to ask and I'll try and help!
I finally got around to taking a good look at it. It is my understanding that the library consists essentially of two parts: 1) codecvt facets for UTF8, 16 and 32 and char_traits for codepoint, along with appropriate fstream typedefs. 2) An interface for getting the properties of a unicode codepoint, and implementations for some of the Unicode algorithms. This is very impressive, but unfortunately, codecvt is simply not rich enough to build a string class based on it that actually stores the characters in encoded form. Am I correct in this? Sebastian

On Monday 07 April 2008 19:20:56 Sebastian Redl wrote:
Graham wrote: As requested, I have posted a Unicode character support library in the boost vault as Unicode_lib.zip.
Please feel free to use this under the boost license.
It would be great if somebody had the time to develop the existing character support to ad the string wrappers that we had intended to - but ran out of time doing.
It should be well documented, but if you want any questions answered, please feel free to ask and I'll try and help!
I have been looking at your work. Would it be usefull for converting upper,lower case for non ASCII charaters ? I have been trying to use those functions in : <boost/algorithm/string.hpp> But they don't seem to work for my utf-8 strings. I tried with different locales, but that doesn help. Seems I need a locale that defines all the facets for converting between UTF-8 upper- and lowercase characters. I was thinking, perhaps you library could be used for that somehow ? -Regards Martin L

Anyone who knows how this could be made possible? I suppose I need a locale facet like the std::ctype, but which works for UTF-8, and not just for ASCII a-z,A-Z. I guess the information in a table like this (http://www.unicode.org/Public/UNIDATA/CaseFolding.txt) could be used. If it's better/easier just to convert the string to UTF-32 before doing case insensitive compares, replaces I could live with that. -Regards Martin Lütken

Hi, Martin Lutken wrote:
Anyone who knows how this could be made possible? I suppose I need a locale facet like the std::ctype, but which works for UTF-8, and not just for ASCII a-z,A-Z. I guess the information in a table like this (http://www.unicode.org/Public/UNIDATA/CaseFolding.txt) could be used.
This might not work out-of-the-box. StringAlgo lib is designed around the sequences od characters. Since UTF-8 have variable character with encoding, algotrithms in the library would not work as expected. To make it working, you will need a container with iterators, that will iterate over meta-characters, not bytes.
If it's better/easier just to convert the string to UTF-32 before doing case insensitive compares, replaces I could live with that.
If you meant UTS-32 and you have a corresponding locale implementation, than this approach is a viable solution. Best regards, Pavol.

Martin Lutken wrote:
Anyone who knows how this could be made possible? I suppose I need a locale facet like the std::ctype, but which works for UTF-8, and not just for ASCII a-z,A-Z. I guess the information in a table like this (http://www.unicode.org/Public/UNIDATA/CaseFolding.txt) could be used.
This might not work out-of-the-box. StringAlgo lib is designed around the sequences od characters. Since UTF-8 have variable character with encoding, algotrithms in the library would not work as expected. To make it working, you will need a container with iterators, that will iterate over meta-characters, not bytes.
If it's better/easier just to convert the string to UTF-32 before doing case insensitive compares, replaces I could live with that.
If you meant UTS-32 and you have a corresponding locale implementation, than this approach is a viable solution. Sorry, what is UTS-32 ? I tried to Google it: 351 results, with none of them looking like char encoding related. I found this article on Wikipedia on UTF-32/UCS-4: http://en.wikipedia.org/wiki/UTF-32 Is it not what I need ? I suspect that many people must have ran into similar problems. Perhaps we should add a 32 bit string class to Boost. And until I get a better understanding, I will keep calling it UTF-32 :-) -Regards Martin Lütken

Martin Lütken wrote:
Martin Lutken wrote:
Anyone who knows how this could be made possible? I suppose I need a locale facet like the std::ctype, but which works for UTF-8, and not just for ASCII a-z,A-Z. I guess the information in a table like this (http://www.unicode.org/Public/UNIDATA/CaseFolding.txt) could be used.
This might not work out-of-the-box. StringAlgo lib is designed around the sequences od characters. Since UTF-8 have variable character with encoding, algotrithms in the library would not work as expected.
To make it working, you will need a container with iterators, that will iterate over meta-characters, not bytes.
If it's better/easier just to convert the string to UTF-32 before doing case insensitive compares, replaces I could live with that.
If you meant UTS-32 and you have a corresponding locale implementation, than this approach is a viable solution.
Sorry, what is UTS-32 ? I tried to Google it: 351 results, with none of them looking like char encoding related.
I found this article on Wikipedia on UTF-32/UCS-4: http://en.wikipedia.org/wiki/UTF-32
Is it not what I need ? I suspect that many people must have ran into similar problems. Perhaps we should add a 32 bit string class to Boost. And until I get a better understanding, I will keep calling it UTF-32 :-)
Sorry, I mixed up it a little. I meant UCS-4 a.k.a fixed-width encoding. I was not aware that UTF-32 id de-facto the same. Anyway, the statement about usability with StringAlgo still holds. It can work with any fixed-size encoding, as long as you have the corresponding locales. It could theoretically work also with variable-with characters, provided you have a container/localte framework, that allows to operate on metacharacters. I'm not sure how efficient it will be, though. Best regards, Pavol.
participants (6)
-
Graham
-
Martin Lutken
-
Martin Lütken
-
Pavol Droba
-
Phil Endecott
-
Sebastian Redl