Boost.Locale (was Re: [SQL-Connectivity] Is Boost interested in CppDB?)

On Tue, Dec 14, 2010 at 10:48 PM, Edward Diener <eldiener@tropicsoft.com> wrote:
On 12/14/2010 9:03 AM, Artyom wrote:
Accidentally (or not) there is a Boost.Locale library that I had submitted for a formal review is stuck in queue waiting for the formal review.
It supports charset conversions and much more of Unicode handling.
And accidentally (or not) it recommends using UTF-8 anywhere and not to use wide strings...
That's fine as long as you realize that it may not be the most common usage of Unicode for everybody ( such as in Windows programming ).
+1 -- if there was a library that did easy conversion from std::wstring (usually the default in Windows now) to proper UTF-8 encoded std::string in Boost that would be *awesome*. I can totally use that library in cpp-netlib too. ;) I'm not sure, but maybe Spirit.Karma can handle something like that? I haven't checked yet but a string conversion/encoding utility like that would be very much appreciated IMO.
I am looking forward to the review of Boost.Locale.
+1 -- Dean Michael Berris deanberris.com

On 12/14/2010 9:53 AM, Dean Michael Berris wrote:
+1 -- if there was a library that did easy conversion from std::wstring (usually the default in Windows now) to proper UTF-8 encoded std::string in Boost that would be *awesome*. I can totally use that library in cpp-netlib too. ;)
Please, no. std::string is not an appropriate holder for a UTF-8 string. It encourages random-access mutation of any byte in a UTF-8 sequence, pretty much guaranteeing data corruption. -- Eric Niebler BoostPro Computing http://www.boostpro.com

On Tue, Dec 14, 2010 at 11:08 PM, Eric Niebler <eric@boostpro.com> wrote:
On 12/14/2010 9:53 AM, Dean Michael Berris wrote:
+1 -- if there was a library that did easy conversion from std::wstring (usually the default in Windows now) to proper UTF-8 encoded std::string in Boost that would be *awesome*. I can totally use that library in cpp-netlib too. ;)
Please, no. std::string is not an appropriate holder for a UTF-8 string. It encourages random-access mutation of any byte in a UTF-8 sequence, pretty much guaranteeing data corruption.
Yes, in a perfect world we would have a sane string implementation that knew how to handle UTF-8 natively. I guess C++0x should fix that, but until then, there are libraries that deal with UTF-8 from a `char const *` and subsequently std::string's that contain UTF-8 encoded strings. Until all those libraries get re-written to use a better string implementation, a converter from std::wstring -> UTF-8 encoded std::string (or something else convertible to an std::string) would be really useful -- even if the possibility and likelihood that corruptive string operations on UTF-8 std::string's is likely. I would really not like to write a library like this myself which is why I'm really hoping there's something out there that's easy to use that provides STL-like container interfaces to UTF-8 encoded strings. If that string is not an std::string doesn't really matter much as long as I can get to use it now almost the same way I'd use an std::string. ;) Are you working on something like that Eric? :D -- Dean Michael Berris deanberris.com

On 12/14/2010 9:53 AM, Dean Michael Berris wrote:
+1 -- if there was a library that did easy conversion from std::wstring (usually the default in Windows now) to proper UTF-8 encoded std::string in Boost that would be *awesome*. I can totally use that library in cpp-netlib too. ;)
Please, no. std::string is not an appropriate holder for a UTF-8 string. It encourages random-access mutation of any byte in a UTF-8 sequence, pretty much guaranteeing data corruption.
As well as std::wstring, just of of developers do not aware of it (that UTF-16 is variable length encoding) Artyom

On 14/12/2010 16:08, Eric Niebler wrote:
On 12/14/2010 9:53 AM, Dean Michael Berris wrote:
+1 -- if there was a library that did easy conversion from std::wstring (usually the default in Windows now) to proper UTF-8 encoded std::string in Boost that would be *awesome*. I can totally use that library in cpp-netlib too. ;)
Please, no. std::string is not an appropriate holder for a UTF-8 string. It encourages random-access mutation of any byte in a UTF-8 sequence, pretty much guaranteeing data corruption.
It is, however, an appropriate holder for the *data* of a UTF-8 string.

On 12/14/2010 2:27 PM, Mathias Gaunard wrote:
On 14/12/2010 16:08, Eric Niebler wrote:
On 12/14/2010 9:53 AM, Dean Michael Berris wrote:
+1 -- if there was a library that did easy conversion from std::wstring (usually the default in Windows now) to proper UTF-8 encoded std::string in Boost that would be *awesome*. I can totally use that library in cpp-netlib too. ;)
Please, no. std::string is not an appropriate holder for a UTF-8 string. It encourages random-access mutation of any byte in a UTF-8 sequence, pretty much guaranteeing data corruption.
It is, however, an appropriate holder for the *data* of a UTF-8 string.
Does not C++0x define character types for Unicode characters ? Would not a basic_string<utf8> ( or whatever it is called ) character type be a better choice than basic_string<char> if that is the case, if the former existed ?

Doesn't C++0x specify char to be UTF-8, char16_t to be UTF-16 and char32_t to be UTF-32? Potentially misleading source : http://en.wikipedia.org/wiki/C%2B%2B0x <http://en.wikipedia.org/wiki/C%2B%2B0x>std::string then becomes UTF-8 character container by default. On Tue, Dec 14, 2010 at 22:05, Edward Diener <eldiener@tropicsoft.com>wrote:
On 12/14/2010 2:27 PM, Mathias Gaunard wrote:
On 14/12/2010 16:08, Eric Niebler wrote:
On 12/14/2010 9:53 AM, Dean Michael Berris wrote:
+1 -- if there was a library that did easy conversion from std::wstring (usually the default in Windows now) to proper UTF-8 encoded std::string in Boost that would be *awesome*. I can totally use that library in cpp-netlib too. ;)
Please, no. std::string is not an appropriate holder for a UTF-8 string. It encourages random-access mutation of any byte in a UTF-8 sequence, pretty much guaranteeing data corruption.
It is, however, an appropriate holder for the *data* of a UTF-8 string.
Does not C++0x define character types for Unicode characters ? Would not a basic_string<utf8> ( or whatever it is called ) character type be a better choice than basic_string<char> if that is the case, if the former existed ?
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On 14/12/2010 22:05, Edward Diener wrote:
On 12/14/2010 2:27 PM, Mathias Gaunard wrote:
On 14/12/2010 16:08, Eric Niebler wrote:
On 12/14/2010 9:53 AM, Dean Michael Berris wrote:
+1 -- if there was a library that did easy conversion from std::wstring (usually the default in Windows now) to proper UTF-8 encoded std::string in Boost that would be *awesome*. I can totally use that library in cpp-netlib too. ;)
Please, no. std::string is not an appropriate holder for a UTF-8 string. It encourages random-access mutation of any byte in a UTF-8 sequence, pretty much guaranteeing data corruption.
It is, however, an appropriate holder for the *data* of a UTF-8 string.
Does not C++0x define character types for Unicode characters ? Would not a basic_string<utf8> ( or whatever it is called ) character type be a better choice than basic_string<char> if that is the case, if the former existed ?
While C++0x introduces char16_t and char32_t, meant for UTF-16 and UTF-32 respectively, there is no special character type dedicated to UTF-8.

On 12/15/2010 7:13 AM, Mathias Gaunard wrote:
On 14/12/2010 22:05, Edward Diener wrote:
On 12/14/2010 2:27 PM, Mathias Gaunard wrote:
On 14/12/2010 16:08, Eric Niebler wrote:
On 12/14/2010 9:53 AM, Dean Michael Berris wrote:
+1 -- if there was a library that did easy conversion from std::wstring (usually the default in Windows now) to proper UTF-8 encoded std::string in Boost that would be *awesome*. I can totally use that library in cpp-netlib too. ;)
Please, no. std::string is not an appropriate holder for a UTF-8 string. It encourages random-access mutation of any byte in a UTF-8 sequence, pretty much guaranteeing data corruption.
It is, however, an appropriate holder for the *data* of a UTF-8 string.
Does not C++0x define character types for Unicode characters ? Would not a basic_string<utf8> ( or whatever it is called ) character type be a better choice than basic_string<char> if that is the case, if the former existed ?
While C++0x introduces char16_t and char32_t, meant for UTF-16 and UTF-32 respectively, there is no special character type dedicated to UTF-8.
UTF-8 is variable length encoded (so is UTF-16). basic_string and string are unsuitable for any variable length encoded data, as Eric pointed out. Regards, -- Joel de Guzman http://www.boostpro.com http://spirit.sf.net

On 15/12/2010 00:28, Joel de Guzman wrote:
On 12/15/2010 7:13 AM, Mathias Gaunard wrote:
On 14/12/2010 22:05, Edward Diener wrote:
On 12/14/2010 2:27 PM, Mathias Gaunard wrote:
On 14/12/2010 16:08, Eric Niebler wrote:
On 12/14/2010 9:53 AM, Dean Michael Berris wrote:
+1 -- if there was a library that did easy conversion from std::wstring (usually the default in Windows now) to proper UTF-8 encoded std::string in Boost that would be *awesome*. I can totally use that library in cpp-netlib too. ;)
Please, no. std::string is not an appropriate holder for a UTF-8 string. It encourages random-access mutation of any byte in a UTF-8 sequence, pretty much guaranteeing data corruption.
It is, however, an appropriate holder for the *data* of a UTF-8 string.
<snip stuff that seems irrelevant to the message>
UTF-8 is variable length encoded (so is UTF-16). basic_string and string are unsuitable for any variable length encoded data, as Eric pointed out.
What I said is that basic_string<char> is perfectly suitable as a container to store UTF-8 data; it is not, however, very suitable to do text processing with (albeit UTF-8 has sufficiently nice properties to make it more or less passable). Raw data is different from an abstraction meant to represent text. My Unicode library does not provide a Unicode string type so as to decouple the data storage and representation from the abstraction and semantics of text we want to attach to that data. This means that it is up to the user to not temper with the data and make it invalid, and respect the preconditions of the algorithms he wishes to use. This could be strongly enforced by using a specific type, but that would mean strict ownership of the data by said type, copies between external representations (and they are many between all the different major C++ libraries), and other interoperability problems.

Mathias Gaunard wrote:
On 15/12/2010 00:28, Joel de Guzman wrote:
On 12/15/2010 7:13 AM, Mathias Gaunard wrote:
On 14/12/2010 22:05, Edward Diener wrote:
On 12/14/2010 2:27 PM, Mathias Gaunard wrote:
On 14/12/2010 16:08, Eric Niebler wrote:
On 12/14/2010 9:53 AM, Dean Michael Berris wrote:
> +1 -- if there was a library that did easy conversion from > std::wstring (usually the default in Windows now) to > proper UTF-8 encoded std::string in Boost that would be > *awesome*. I can totally use that library in cpp-netlib too.
Please, no. std::string is not an appropriate holder for a UTF-8 string. It encourages random-access mutation of any byte in a UTF-8 sequence, pretty much guaranteeing data corruption.
It is, however, an appropriate holder for the *data* of a UTF-8 string.
<snip stuff that seems irrelevant to the message>
UTF-8 is variable length encoded (so is UTF-16). basic_string and string are unsuitable for any variable length encoded data, as Eric pointed out.
What I said is that basic_string<char> is perfectly suitable as a container to store UTF-8 data; it is not, however, very suitable to do text processing with (albeit UTF-8 has sufficiently nice properties to make it more or less passable).
Raw data is different from an abstraction meant to represent text.
I disagree. vector<char> would be a suitable storage container for the raw data. basic_string<char> implies string semantics. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

On 15/12/2010 13:59, Stewart, Robert wrote:
I disagree. vector<char> would be a suitable storage container for the raw data. basic_string<char> implies string semantics.
More or less. It also adds some purely data structure elements that vector<char> does not provide, such as all things related to substrings (extracting a given substring, replacing a substring with a substring of a different length, appending a substring etc.) One can implement a rope data structure with the interface of basic_string, but not really with that of vector.

On Wed, Dec 15, 2010 at 7:59 AM, Stewart, Robert <Robert.Stewart@sig.com> wrote:
vector<char> would be a suitable storage container for the raw data. basic_string<char> implies string semantics.
But it also implies string *optimizations* (small string optimization, anyone?) As an *implementation detail* of some other Unicode container jobbie, basic_string makes sense. It might even make sense to expose the raw basic_string (with warnings), just for interoperability reasons. I'm sure there are lots of interfaces out there that already deal with utf-8 in basic_strings. -- Dave Abrahams BoostPro Computing http://www.boostpro.com

On 14/12/2010 15:53, Dean Michael Berris wrote:
+1 -- if there was a library that did easy conversion from std::wstring (usually the default in Windows now) to proper UTF-8 encoded std::string in Boost that would be *awesome*. I can totally use that library in cpp-netlib too. ;)
My library can do that kind of conversion with arbitrary ranges, and possibly lazily as it is being iterated. Artyom's library can probably do it too, but only eagerly and with contiguous memory segments. My Unicode library would be in the review queue if people had manifested sufficient interest, but I was quite disappointed to see none last time I asked for comments. I did send a submission to boostcon 2011 about it though, to present its approach to Unicode and discuss it.

On 12/14/2010 2:25 PM, Mathias Gaunard wrote:
On 14/12/2010 15:53, Dean Michael Berris wrote:
+1 -- if there was a library that did easy conversion from std::wstring (usually the default in Windows now) to proper UTF-8 encoded std::string in Boost that would be *awesome*. I can totally use that library in cpp-netlib too. ;)
My library can do that kind of conversion with arbitrary ranges, and possibly lazily as it is being iterated.
Artyom's library can probably do it too, but only eagerly and with contiguous memory segments.
My Unicode library would be in the review queue if people had manifested sufficient interest, but I was quite disappointed to see none last time I asked for comments. I did send a submission to boostcon 2011 about it though, to present its approach to Unicode and discuss it.
What is Your library called and where in the sandbox is it ?

On 14/12/2010 22:01, Edward Diener wrote:
What is Your library called and where in the sandbox is it ?
Boost.Unicode, it's in soc/2009. It still needs some work before getting submitted for review, in particular refactoring. I may get back to work on it soon if I get some motivating feedback. But if you want a solution for Unicode that's ready for deployment, you should consider Artyom's, which appears well polished.

On 12/14/2010 6:04 PM, Mathias Gaunard wrote:
On 14/12/2010 22:01, Edward Diener wrote:
What is Your library called and where in the sandbox is it ?
Boost.Unicode, it's in soc/2009.
Maybe the lack of interest is because this means little to me, and may mean little to others. If you have a library you need to tell people how to get it. I found a Soc 2009 home page and I still have no idea how one is supposed to see what is there or how to get your library.
It still needs some work before getting submitted for review, in particular refactoring.
I may get back to work on it soon if I get some motivating feedback. But if you want a solution for Unicode that's ready for deployment, you should consider Artyom's, which appears well polished.
Very strange. You mention your library as possibly being more complete but then you tout someone else's. OK, I will study Artyom's Boost.Locale instead.

On 15/12/2010 01:44, Edward Diener wrote:
On 12/14/2010 6:04 PM, Mathias Gaunard wrote:
On 14/12/2010 22:01, Edward Diener wrote:
What is Your library called and where in the sandbox is it ?
Boost.Unicode, it's in soc/2009.
Maybe the lack of interest is because this means little to me, and may mean little to others. If you have a library you need to tell people how to get it. I found a Soc 2009 home page and I still have no idea how one is supposed to see what is there or how to get your library.
There have been emails about it regularly on this mailing list for the past year and a half. Searching this list for Unicode should give you many hits. The docs are here, if that's what you're looking for: <http://mathias.gaunard.com/unicode/doc/html/>
Very strange. You mention your library as possibly being more complete but then you tout someone else's. OK, I will study Artyom's Boost.Locale instead.
My library is more powerful in a way, but is also less polished and feature-complete. They also have completely different approaches in their interface, as my library is made to be locale-agnostic and Artyom's chooses to make use of the standard C++ locale subsystem as much as possible, even though it is inherently broken for Unicode. My library is a generic implementation of Unicode, while Boost.Locale is mostly a wrapper on top of ICU, IBM's Unicode library. They're quite different, and I like mine best of course, but I have to admit Boost.Locale is more ready for production than Boost.Unicode for the time being.

On Wed, Dec 15, 2010 at 8:21 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 15/12/2010 01:44, Edward Diener wrote:
Maybe the lack of interest is because this means little to me, and may mean little to others. If you have a library you need to tell people how to get it. I found a Soc 2009 home page and I still have no idea how one is supposed to see what is there or how to get your library.
There have been emails about it regularly on this mailing list for the past year and a half.
Searching this list for Unicode should give you many hits. The docs are here, if that's what you're looking for: <http://mathias.gaunard.com/unicode/doc/html/>
This is just beautiful -- I think this is exactly what I need in cpp-netlib! Will this be submitted to Boost for review soon, because I really want to be able to deal with UTF-8 real soon now. :D
Very strange. You mention your library as possibly being more complete but then you tout someone else's. OK, I will study Artyom's Boost.Locale instead.
My library is more powerful in a way, but is also less polished and feature-complete.
So, what are the features you'd like to implement so that us potential users can be the judge of whether it's feature-complete enough?
They also have completely different approaches in their interface, as my library is made to be locale-agnostic and Artyom's chooses to make use of the standard C++ locale subsystem as much as possible, even though it is inherently broken for Unicode.
My library is a generic implementation of Unicode, while Boost.Locale is mostly a wrapper on top of ICU, IBM's Unicode library.
They're quite different, and I like mine best of course, but I have to admit Boost.Locale is more ready for production than Boost.Unicode for the time being.
I think I like this approach better too to deal with unicode data in a generic means. I like that it plays nicely with Boost.Range and Boost.RangeEx which is definitely a good way to deal with strings and text. Of course this is just me. I look forward to this library getting stable and usable soon -- definitely something sorely missing in Boost and in C++ in general. -- Dean Michael Berris deanberris.com

On 15/12/2010 13:34, Dean Michael Berris wrote:
On Wed, Dec 15, 2010 at 8:21 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 15/12/2010 01:44, Edward Diener wrote:
Maybe the lack of interest is because this means little to me, and may mean little to others. If you have a library you need to tell people how to get it. I found a Soc 2009 home page and I still have no idea how one is supposed to see what is there or how to get your library.
There have been emails about it regularly on this mailing list for the past year and a half.
Searching this list for Unicode should give you many hits. The docs are here, if that's what you're looking for: <http://mathias.gaunard.com/unicode/doc/html/>
This is just beautiful -- I think this is exactly what I need in cpp-netlib! Will this be submitted to Boost for review soon, because I really want to be able to deal with UTF-8 real soon now. :D
It could be submitted soon if I gave it a bit of love. If my talk about it gets accepted for boostcon 2011, I will definitely have it in the review queue several months before it starts.
Very strange. You mention your library as possibly being more complete but then you tout someone else's. OK, I will study Artyom's Boost.Locale instead.
My library is more powerful in a way, but is also less polished and feature-complete.
So, what are the features you'd like to implement so that us potential users can be the judge of whether it's feature-complete enough?
- I need to finish support for word, sentence and line boundaries - The ABI needs to be more clearly defined to guarantee backward and upward compatibility - The convert and segment subsystem must be clearly separated into its own library and namespace - The system must be made SIMD-ready - Simple case conversion should be added - General case folding (and maybe collation) should be added Nothing among these is particularly difficult.

----- Original Message ----
From: Mathias Gaunard <mathias.gaunard@ens-lyon.org>
- I need to finish support for word, sentence and line boundaries - The ABI needs to be more clearly defined to guarantee backward and upward compatibility - The convert and segment subsystem must be clearly separated into its own library and namespace - The system must be made SIMD-ready - Simple case conversion should be added - General case folding (and maybe collation) should be added
Nothing among these is particularly difficult.
Few notes or questions, you say that your library is locale agnostic, I see a contradiction between what you say and what you need to implement 1. AFAIK boundary analysis is locale dependent. 2. case conversion - is locale dependent - for example if the locale is Turkish then upper("i")=="İ" while upper("i")="I" for other languages. 3. collation - **is** locale dependent as text sorting in different languages is very different - even if they use same script (Latin for example) Artyom

On 15/12/2010 18:50, Artyom wrote:
Few notes or questions, you say that your library is locale agnostic, I see a contradiction between what you say and what you need to implement
My personal belief is that the locale matters for few things, and it's a big burden to set up and manage. So if I can avoid having to choose one, I'd rather do that, and only specify one when I really need it.
1. AFAIK boundary analysis is locale dependent.
Tailoring of break properties is not supported: the default values are used. The specification in question (UAX #29) barely mentions tailoring anyway. A possibility to achieve a locale-dependent behaviour here would be to swap the database with a tailored one.
2. case conversion - is locale dependent - for example if the locale is Turkish then upper("i")=="İ" while upper("i")="I" for other languages.
Simple case conversions are the easy 1:1 language- and context-agnostic mappings. I can't do the more complex conversions because they depend on specific languages and contexts. Thankfully case folding is not language- nor context-dependent, and is probably what most people want rather than case conversion.
3. collation - **is** locale dependent as text sorting in different languages is very different - even if they use same script (Latin for example)
Yes, it definitely is; but you could still have a "general" collation that would work well enough for most languages. I said it in 'maybe', but I had forgotten how complicated the official algorithm was. So I won't do the collation support before a while.

On Wed, 15 Dec 2010 20:50:47 +0100, Mathias Gaunard wrote:
On 15/12/2010 18:50, Artyom wrote:
Few notes or questions, you say that your library is locale agnostic, I see a contradiction between what you say and what you need to implement
...
2. case conversion - is locale dependent - for example if the locale is Turkish then upper("i")=="İ" while upper("i")="I" for other languages.
Simple case conversions are the easy 1:1 language- and context-agnostic mappings.
How are you defining 'simple' conversions? They are either right or wrong. Not converting i to İ for a Turkish user is just plain wrong and rather defeats the point of Unicode. Alex

On 16/12/2010 12:07, Alexander Lamaison wrote:
How are you defining 'simple' conversions?
Those that are in UnicodeData mapping and not in the SpecialCasing mapping.
They are either right or wrong.
Unicode is quite more flexible than that in its levels of compliance.
Not converting i to İ for a Turkish user is just plain wrong and rather defeats the point of Unicode.
Turish, Azeri and Lithuanian have a couple of characters whose case mappings are specific to their language, indeed. I'm not saying it's not important to deal correctly which those languages, I'm just saying dealing with special casing is a more advanced feature than dealing with simple casing.

On 16/12/2010 12:33, Mathias Gaunard wrote:
I'm not saying it's not important to deal correctly which those languages, I'm just saying dealing with special casing is a more advanced feature than dealing with simple casing.
If that's what it takes for people to use my library, however, I could prioritize this feature.

2. case conversion - is locale dependent - for example if the locale is Turkish then upper("i")=="İ" while upper("i")="I" for other languages.
Simple case conversions are the easy 1:1 language- and context-agnostic mappings.
I can't do the more complex conversions because they depend on specific languages and contexts.
Thankfully case folding is not language- nor context-dependent, and is probably what most people want rather than case conversion.
Then don't do case conversion! Do just case folding. For such "simple" and incorrect case conversion I don't need sophisticated Unicode library, I can use use standard operating system API and even std::locale::ctype very successfully (which I do in Boost.Locale if user prefers to use non-icu based backend) Case conversion is: - context dependent: Greek letter "Σ" is converted to "σ" or to "ς", according to position in the word. - locale dependent: Turkish i goes to İ - not 1-to-1: German ß goes to SS in upper case. So if you don't do this right, just don't do it. I'm not sure about case-folding but AFAIK it is not 1-to-1 as well - but I may be wrong.
Yes, it definitely is; but you could still have a "general" collation that would work well enough for most languages.
For general collation that works "well" in most languages I can use strcmp... I don't need Unicode library for this. Artyom

On 16/12/2010 12:32, Artyom wrote:
Then don't do case conversion!
I already parse the data that provides that information, I might as well forward it to the user. Unicode provides two levels of casing, one in its main character mapping, and one in the SpecialCasing supplement.
Do just case folding. For such "simple" and incorrect case conversion I don't need sophisticated Unicode library, I can use use standard operating system API and even std::locale::ctype very successfully (which I do in Boost.Locale if user prefers to use non-icu based backend)
Case conversion is:
- context dependent: Greek letter "Σ" is converted to "σ" or to "ς", according to position in the word. - locale dependent: Turkish i goes to İ - not 1-to-1: German ß goes to SS in upper case.
Right, and the reason I'm not doing it right now is because I don't want to look into the context thing before I take a look at more complex things that I think are more immediately useful.
I'm not sure about case-folding but AFAIK it is not 1-to-1 as well - but I may be wrong.
No it isn't. It also needs special treatment of Turkish, but nothing context-dependent.
For general collation that works "well" in most languages I can use strcmp... I don't need Unicode library for this.
Doesn't allow to search for a substring regardless of case, accentuation or punctuation. The thing that really interests me with collation is collation folding.

On Thu, Dec 16, 2010 at 04:15, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
Doesn't allow to search for a substring regardless of case, accentuation or punctuation. The thing that really interests me with collation is collation folding.
Case (and other mutator)-insensitive search/comparison is certainly the part of most interest to me, be it via additional operations or through some kind of "ascii-fication" step. I've never found an example of where just converting a string's case is actually useful. Does anyone have an example?

On Wed, Dec 15, 2010 at 10:04 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 15/12/2010 13:34, Dean Michael Berris wrote:
This is just beautiful -- I think this is exactly what I need in cpp-netlib! Will this be submitted to Boost for review soon, because I really want to be able to deal with UTF-8 real soon now. :D
It could be submitted soon if I gave it a bit of love. If my talk about it gets accepted for boostcon 2011, I will definitely have it in the review queue several months before it starts.
Cool!
So, what are the features you'd like to implement so that us potential users can be the judge of whether it's feature-complete enough?
- I need to finish support for word, sentence and line boundaries
This doesn't sound like something I'd need, so I'm not going to wait for this.
- The ABI needs to be more clearly defined to guarantee backward and upward compatibility
I'm not too worried about ABI maintenance especially if I'm just going to write a header-only library that will require this, so that's not a deal breaker for me.
- The convert and segment subsystem must be clearly separated into its own library and namespace
Okay. This sounds like a good thing, but a migration path would be enough I think for people wanting to use the "current" version (assuming I can get it from the sandbox).
- The system must be made SIMD-ready
I don't see how this would be a requirement, but maybe it might be critical for some people who actually care that a text processing library would be SIMD ready.
- Simple case conversion should be added
I'd really like this, but not really too critical. I'm just more worried about encoding to UTF-8.
- General case folding (and maybe collation) should be added
This doesn't sound critical to me (well, I mostly need English and "webby" characters) but I guess this would be good.
Nothing among these is particularly difficult.
Cool, I'll look forward to giving it a whirl soon. :D Thanks Mathias! -- Dean Michael Berris deanberris.com

Very strange. You mention your library as possibly being more complete but then you tout someone else's. OK, I will study Artyom's Boost.Locale instead.
My library is more powerful in a way, but is also less polished and feature-complete. They also have completely different approaches in their interface, as my library is made
to be locale-agnostic and Artyom's chooses to make use of the standard C++ locale subsystem as much as possible, even though it is inherently broken for Unicode.
Few notes, std::locale is not "inherently broken" it has a great way to do things, just some things a generally done in imperfect way, the great thing about std::locale that it is extensible which allows to to fix some issues and use it very well. BTW there are still things that even "broken" std::locale does well, for example collation (at least under Linux) works quite fine.
My library is a generic implementation of Unicode, while Boost.Locale is mostly a wrapper on top of ICU, IBM's Unicode library.
Yes and not, it is not wrapper of ICU, but ICU is central part, you can use many Unicode/Localization providers even standard library and in many cases it works very well. But ICU gives very good and high quality features that standard libraries do not. Artyom

On Wed, Dec 15, 2010 at 3:25 AM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 14/12/2010 15:53, Dean Michael Berris wrote:
+1 -- if there was a library that did easy conversion from std::wstring (usually the default in Windows now) to proper UTF-8 encoded std::string in Boost that would be *awesome*. I can totally use that library in cpp-netlib too. ;)
My library can do that kind of conversion with arbitrary ranges, and possibly lazily as it is being iterated.
Cool.
Artyom's library can probably do it too, but only eagerly and with contiguous memory segments.
I'll have to admit I haven't looked -- if that were true that would be a shame.
My Unicode library would be in the review queue if people had manifested sufficient interest, but I was quite disappointed to see none last time I asked for comments. I did send a submission to boostcon 2011 about it though, to present its approach to Unicode and discuss it.
I might have been in hibernation at that time as I missed it -- but consider this message an expression of interest in a *sane* and *generic* string encoding/decoding/representation library that may support not only UTF-8 but other Unicode encoding schemes (UTF-16,UTF-32). Pointers to code+documentation would be greatly appreciated. Also, Boost.Locale getting reviewed soon would be a good thing. Unfortunately I'm just too busy (and inexperienced) to be a review manager for *any* of the libraries in the queue. -- Dean Michael Berris deanberris.com

On Tue, Dec 14, 2010 at 8:25 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
My library can do that kind of conversion with arbitrary ranges, and possibly lazily as it is being iterated.
Artyom's library can probably do it too, but only eagerly and with contiguous memory segments.
My Unicode library would be in the review queue if people had manifested sufficient interest, but I was quite disappointed to see none last time I asked for comments. I did send a submission to boostcon 2011 about it though, to present its approach to Unicode and discuss it.
I *am* interested in a good (semi-)standard Unicode handling library for C++ since it is IMO long overdue (not counting all the C libraries) and working with text at character level in C++ is nowadays a real pain if you are not limited to ASCII. + Eager / lazy iteration and traversing noncontiguous sequences are cool, but I would also welcome some high-level one-line tools for convenient conversion between std::strings and wstrings on different platforms, most notably Windows where using std::strings in Unicode builds and with functions taking just LPWSTR is a nightmare. IMO a lot of people would find something like this extremely useful (even if not extremely efficient). str::string s = get_utf8_string(); WhatEverWinapiFunc(..., convert_to<std::string<TCHAR>>(s).c_str(), ...); or str::wstring ws = get_string(); AnotherWinapiFunc(..., convert_to<std::string<TCHAR>>(ws).c_str(), ...); Another thing is some kind of adaptor for std::(w)string providing begin()/end() functions returning an iterator traversing through the code points instead of utf-XY "chars". i.e. in C++0x: std::string s = get_utf8_string(); auto as = adapt(s); auto i = as.begin(), e = as.end(); while(i != e) { char32_t c = *i; ... *i = transform(c); ++i; } I have just scrolled through the docs for Boost.Unicode some time ago so maybe it is already there and I've missed it. If so, links to some examples showing this would be appreciated.

On 15/12/2010 08:20, Matus Chochlik wrote:
IMO a lot of people would find something like this extremely useful (even if not extremely efficient).
str::string s = get_utf8_string(); WhatEverWinapiFunc(..., convert_to<std::string<TCHAR>>(s).c_str(), ...);
or
str::wstring ws = get_string(); AnotherWinapiFunc(..., convert_to<std::string<TCHAR>>(ws).c_str(), ...);
The interface is modeled after that of standard algorithms, and therefore it takes an output iterator to write the output to, rather than creating a container directly. // ws is a std::string (utf-8) or std::wstring (utf-16 or utf-32). std::basic_string<TCHAR> out; utf_transcode<TCHAR>(ws, std::back_inserter(out)); AnotherWinapiFunc(..., out.c_str(), ...); Assuming TCHAR is either char or wchar_t this should work out of the box. The fact it takes an output iterator is quite practical, as you can easily do two passes for example, one to count how many characters you need, and one to copy that data. Or you can just grow the container as you add elements, as std::back_inserter does. Something like convert_to<std::basic_string<TCHAR>>(utf_transcode<TCHAR>(ws)).c_str() would also work, but that's maybe a bit verbose.
Another thing is some kind of adaptor for std::(w)string providing begin()/end() functions returning an iterator traversing through the code points instead of utf-XY "chars". i.e. in C++0x:
std::string s = get_utf8_string(); auto as = adapt(s); auto i = as.begin(), e = as.end(); while(i != e) { char32_t c = *i;
Replace adapt(s) by utf_decode(s)
... *i = transform(c);
No, you can't do that. Data accessed like this is immutable. It's not impossible to make them mutable (a bit complicated in the code though, the range concepts don't support inserting/erasing elements), but it's probably not a good idea because it would be O(n) worst case. If you really want to do that, you can already do it using i.base() and next(i).base(), which gives you the range of the character in terms of original std::string iterators, so you can use std::string::replace.
++i; }
I have just scrolled through the docs for Boost.Unicode some time ago so maybe it is already there and I've missed it. If so, links to some examples showing this would be appreciated.
Of course it's there, transcoding between UTF encodings is the most basic feature.

On Wed, Dec 15, 2010 at 2:15 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 15/12/2010 08:20, Matus Chochlik wrote:
The interface is modeled after that of standard algorithms, and therefore it takes an output iterator to write the output to, rather than creating a container directly.
// ws is a std::string (utf-8) or std::wstring (utf-16 or utf-32). std::basic_string<TCHAR> out; utf_transcode<TCHAR>(ws, std::back_inserter(out)); AnotherWinapiFunc(..., out.c_str(), ...);
Assuming TCHAR is either char or wchar_t this should work out of the box.
The fact it takes an output iterator is quite practical, as you can easily do two passes for example, one to count how many characters you need, and one to copy that data. Or you can just grow the container as you add elements, as std::back_inserter does.
Something like
convert_to<std::basic_string<TCHAR>>(utf_transcode<TCHAR>(ws)).c_str()
would also work, but that's maybe a bit verbose.
My point is that many times you need to do this kind of conversion once or twice in the whole application code, i.e. when you have to use a library that does not play well with the WinAPI's character type switching. Besides the low level tools which are cool if you need efficiency, it would be good to have some syntactc-sugar-wrappers on top of them for situations where the clarity of code and nonverbosity is more important. Currently I use a wrapper around the MultiByteToWideChar / WideCharToMultiByte functions that does the conversion and cleans up afterwards and I copy the code whenever I start a new project, but I was looking for something more "standardized" and portable.
Another thing is some kind of adaptor for std::(w)string providing begin()/end() functions returning an iterator traversing through the code points instead of utf-XY "chars". i.e. in C++0x:
std::string s = get_utf8_string(); auto as = adapt(s); auto i = as.begin(), e = as.end(); while(i != e) { char32_t c = *i;
Replace adapt(s) by utf_decode(s)
Great, this is what I've been looking for.
... *i = transform(c);
No, you can't do that. Data accessed like this is immutable.
It's not impossible to make them mutable (a bit complicated in the code though, the range concepts don't support inserting/erasing elements), but it's probably not a good idea because it would be O(n) worst case.
That's a valid point, so the more efficient alternative to inplace transformation is to use another container for the output and an inserter.
If you really want to do that, you can already do it using i.base() and next(i).base(), which gives you the range of the character in terms of original std::string iterators, so you can use std::string::replace.
Still, it could be useful if the transformation was made only on a small subset of characters in the range or if the original and the replacement byte sequence had equal length, which tends to happen for characters from the same "script".
++i; }
I have just scrolled through the docs for Boost.Unicode some time ago so maybe it is already there and I've missed it. If so, links to some examples showing this would be appreciated.
Of course it's there, transcoding between UTF encodings is the most basic feature.
Yes, I know that Boost.Unicode does the transcoding (it would be a sad unicode library if it didn't :-)) , I was asking about the syntactic sugar functions and the possibly mutating iteration. But thanks for your response :)

On 15/12/2010 15:01, Matus Chochlik wrote:
That's a valid point, so the more efficient alternative to inplace transformation is to use another container for the output and an inserter.
Yes. Or you can use the transform adaptor in Boost.RangeEx, to adapt your elements into transformed elements without modifying the original range. Boost.Unicode (or Boost.Convert, as I'm going to separate it), also provides a convert adaptor, which is a generalization of the transform adaptor in that it allows N:M transformation rather 1:1 transformation. This is what it uses to adapt UTF-8 ranges into a UTF-32 ones.

From: Matus Chochlik <chochlik@gmail.com> On Tue, Dec 14, 2010 at 8:25 PM, Mathias Gaunard
My library can do that kind of conversion with arbitrary ranges, and possibly lazily as it is being iterated.
Artyom's library can probably do it too, but only eagerly and with contiguous memory segments.
+ Eager / lazy iteration and traversing noncontiguous sequences are cool
[...]
Another thing is some kind of adaptor for std::(w)string providing begin()/end() functions returning an iterator traversing through the code points instead of utf-XY "chars". i.e. in C++0x:
std::string s = get_utf8_string(); auto as = adapt(s); auto i = as.begin(), e = as.end(); while(i != e) { char32_t c = *i; ... *i = transform(c); ++i; }
That is exactly the reason Boost.Locale does not provide iteration over code points... What kind of transform(c) you want to do? See... Usually code points are meaningless in context of natural text processing, you generally need higher units: Examples: 1. How many characters where "שָלוֹם" - there are 4 chracters and 6 code points (4 base letters+2 diacritics). Code point!= character and this is why you do not need "indexing" over code points unless you develop some Unicode algorithm. 2. You are rarely work (transform) stand alone code points. You always use context, even stuff like converting case may change the amount of code points in the string! If you want to split the text into characters, words etc, there is a break iterator that does this for you. Artyom

On Wed, Dec 15, 2010 at 6:42 PM, Artyom <artyomtnk@yahoo.com> wrote:
From: Matus Chochlik <chochlik@gmail.com> On Tue, Dec 14, 2010 at 8:25 PM, Mathias Gaunard
That is exactly the reason Boost.Locale does not provide iteration over code points...
What kind of transform(c) you want to do?
See... Usually code points are meaningless in context of natural text processing, you generally need higher units:
Examples:
1. How many characters where "שָלוֹם" - there are 4 chracters and 6 code points (4 base letters+2 diacritics). Code point!= character and this is why you
do not need "indexing" over code points unless you develop some Unicode algorithm.
2. You are rarely work (transform) stand alone code points. You always use context, even stuff like converting case may change the amount of code points in the string!
If you want to split the text into characters, words etc, there is a break iterator that does this for you.
Artyom
Ok, Thanks for the clarification. Matus
participants (12)
-
Alexander Lamaison
-
Artyom
-
Dave Abrahams
-
Dean Michael Berris
-
Edward Diener
-
Eric Niebler
-
Joel de Guzman
-
Klaim
-
Mathias Gaunard
-
Matus Chochlik
-
Scott McMurray
-
Stewart, Robert