Re: [boost] UTF-8 conversion etc. (Sebastian Redl) (Sebastian Redl)

Graham

12 Mar 2008 12 Mar '08

9:15 p.m.

...
As I said to Phil, Rogier and I completed a Unicode character library for

Release under boost, but never submitted it to Boost as we had intended to

release it with a string library built on it, and never had time to do the

second part of the work.

Post it, and we'll do the second part. It's open-source.

Sebastian

Sebasian, As requested, I have posted a Unicode character support library in the boost vault as Unicode_lib.zip. Please feel free to use this under the boost license. It would be great if somebody had the time to develop the existing character support to ad the string wrappers that we had intended to - but ran out of time doing. It should be well documented, but if you want any questions answered, please feel free to ask and I'll try and help! Thanks. Yours, Graham

Show replies by date

Phil Endecott

12 Mar 12 Mar

10:37 p.m.

New subject: UTF-8 conversion etc.

Graham wrote:

...

As requested, I have posted a Unicode character support library in the boost vault as Unicode_lib.zip.

Thanks for that. I'll study it properly in due course, but I've just had a quick look at your UTF8 functions for now. I see that you use tables for decoding, and I'll have to see how the performance of that compares to my version. For encoding you generate the bytes backwards, which is a trick that's unfortunately not available to an implementation that's templated on the output iterator, unless as a specialisation for random-access iterators. Anyway, I'll investigate further. Thanks again, Phil.

Sebastian Redl

7 Apr 7 Apr

5:20 p.m.

New subject: UTF-8 conversion etc.

Graham wrote:

...

As requested, I have posted a Unicode character support library in the boost vault as Unicode_lib.zip.

Please feel free to use this under the boost license.

It would be great if somebody had the time to develop the existing character support to ad the string wrappers that we had intended to - but ran out of time doing.

It should be well documented, but if you want any questions answered, please feel free to ask and I'll try and help!

I finally got around to taking a good look at it. It is my understanding that the library consists essentially of two parts: 1) codecvt facets for UTF8, 16 and 32 and char_traits for codepoint, along with appropriate fstream typedefs. 2) An interface for getting the properties of a unicode codepoint, and implementations for some of the Unicode algorithms. This is very impressive, but unfortunately, codecvt is simply not rich enough to build a string class based on it that actually stores the characters in encoded form. Am I correct in this? Sebastian

Martin Lutken

26 Aug 26 Aug

10:49 a.m.

New subject: UTF-8 conversion etc. upper,lower case converting

On Monday 07 April 2008 19:20:56 Sebastian Redl wrote:

...

Graham wrote: As requested, I have posted a Unicode character support library in the boost vault as Unicode_lib.zip.

Please feel free to use this under the boost license.

It would be great if somebody had the time to develop the existing character support to ad the string wrappers that we had intended to - but ran out of time doing.

It should be well documented, but if you want any questions answered, please feel free to ask and I'll try and help!

I have been looking at your work. Would it be usefull for converting upper,lower case for non ASCII charaters ? I have been trying to use those functions in : <boost/algorithm/string.hpp> But they don't seem to work for my utf-8 strings. I tried with different locales, but that doesn help. Seems I need a locale that defines all the facets for converting between UTF-8 upper- and lowercase characters. I was thinking, perhaps you library could be used for that somehow ? -Regards Martin L

Martin Lutken

12:11 p.m.

New subject: String Algorithms Library: Case insensitive compare UTF-8

Anyone who knows how this could be made possible? I suppose I need a locale facet like the std::ctype, but which works for UTF-8, and not just for ASCII a-z,A-Z. I guess the information in a table like this (http://www.unicode.org/Public/UNIDATA/CaseFolding.txt) could be used. If it's better/easier just to convert the string to UTF-32 before doing case insensitive compares, replaces I could live with that. -Regards Martin Lütken

Pavol Droba

27 Aug 27 Aug

8:04 a.m.

New subject: String Algorithms Library: Case insensitive compare UTF-8

Hi, Martin Lutken wrote:

...

Anyone who knows how this could be made possible? I suppose I need a locale facet like the std::ctype, but which works for UTF-8, and not just for ASCII a-z,A-Z. I guess the information in a table like this (http://www.unicode.org/Public/UNIDATA/CaseFolding.txt) could be used.

This might not work out-of-the-box. StringAlgo lib is designed around the sequences od characters. Since UTF-8 have variable character with encoding, algotrithms in the library would not work as expected. To make it working, you will need a container with iterators, that will iterate over meta-characters, not bytes.

...

If it's better/easier just to convert the string to UTF-32 before doing case insensitive compares, replaces I could live with that.

If you meant UTS-32 and you have a corresponding locale implementation, than this approach is a viable solution. Best regards, Pavol.

Martin Lütken

9:03 p.m.

New subject: String Algorithms Library: Case insensitive compareUTF-8

Martin Lutken wrote:

...

Anyone who knows how this could be made possible? I suppose I need a locale facet like the std::ctype, but which works for UTF-8, and not just for ASCII a-z,A-Z. I guess the information in a table like this (http://www.unicode.org/Public/UNIDATA/CaseFolding.txt) could be used.

...

If it's better/easier just to convert the string to UTF-32 before doing case insensitive compares, replaces I could live with that.

If you meant UTS-32 and you have a corresponding locale implementation, than this approach is a viable solution. Sorry, what is UTS-32 ? I tried to Google it: 351 results, with none of them looking like char encoding related. I found this article on Wikipedia on UTF-32/UCS-4: http://en.wikipedia.org/wiki/UTF-32 Is it not what I need ? I suspect that many people must have ran into similar problems. Perhaps we should add a 32 bit string class to Boost. And until I get a better understanding, I will keep calling it UTF-32 :-) -Regards Martin Lütken

Pavol Droba

28 Aug 28 Aug

1:53 p.m.

New subject: String Algorithms Library: Case insensitive compareUTF-8

Martin Lütken wrote:

...

Martin Lutken wrote:

...
Anyone who knows how this could be made possible? I suppose I need a locale facet like the std::ctype, but which works for UTF-8, and not just for ASCII a-z,A-Z. I guess the information in a table like this (http://www.unicode.org/Public/UNIDATA/CaseFolding.txt) could be used.

This might not work out-of-the-box. StringAlgo lib is designed around the sequences od characters. Since UTF-8 have variable character with encoding, algotrithms in the library would not work as expected.

To make it working, you will need a container with iterators, that will iterate over meta-characters, not bytes.

...
If it's better/easier just to convert the string to UTF-32 before doing case insensitive compares, replaces I could live with that.

If you meant UTS-32 and you have a corresponding locale implementation, than this approach is a viable solution.

Sorry, what is UTS-32 ? I tried to Google it: 351 results, with none of them looking like char encoding related.

I found this article on Wikipedia on UTF-32/UCS-4: http://en.wikipedia.org/wiki/UTF-32

Is it not what I need ? I suspect that many people must have ran into similar problems. Perhaps we should add a 32 bit string class to Boost. And until I get a better understanding, I will keep calling it UTF-32 :-)

Sorry, I mixed up it a little. I meant UCS-4 a.k.a fixed-width encoding. I was not aware that UTF-32 id de-facto the same. Anyway, the statement about usability with StringAlgo still holds. It can work with any fixed-size encoding, as long as you have the corresponding locales. It could theoretically work also with variable-with characters, provided you have a container/localte framework, that allows to operate on metacharacters. I'm not sure how efficient it will be, though. Best regards, Pavol.

6192

Age (days ago)

6361

Last active (days ago)

List overview

Download

7 comments

6 participants

participants (6)

Graham
Martin Lutken
Martin Lütken
Pavol Droba
Phil Endecott
Sebastian Redl