Boost Unicode support ideas

Jeremy Maitin-Shepard

12 Apr 2004 12 Apr '04

2:38 a.m.

It seems that Unicode support in Boost (which could lead to Unicode support in the C++ language and standard library) would be quite desirable. The IBM International Components for Unicode (ICU) library (http://oss.software.ibm.com/icu/) is an existing C++ library with what appears to be a Boost-compatible license, which provides all or most of the Unicode support that would be desired in Boost or the C++ standard library, in addition to Unicode-equivalents of libraries already either in the standard library or in Boost, including number/currency formatting, date formatting, message formatting, and a regular expression library. Unfortunately, it does not use C++ exceptions to signal exceptional conditions (but rather it uses an error code return mechanism), it does not follow Boost naming conventions, and although there are some C++-specific facilities, most of the C++ API is the same as the C API, thus resulting in a less-than-optimal C++ interface. Nonetheless, I think Boostifying the ICU library would be quite feasible, whereas attempting to reimplement all of the desired functionality that the ICU library provides would be extremely time consuming, since the collating and other services in the ICU library already support a large number of locales, and the character-set conversion facilities support a large number of character sets. The representation of locales does present an issue that needs to be considered. The existing C++ standard locale facets are not very suitable for a variety of reasons: - The standard facets (and the locale class itself, in that it is a functor for comparing basic_strings) are tied to facilities such as std::basic_string and std::ios_base which are not suitable for Unicode support. - The interface of std::collate<Ch> is not at all suitable for providing all of the functionality desired for Unicode string collation. A suitable Unicode collation facility should at least allow for user-selection of the strength level used (refer to http://www.unicode.org/unicode/reports/tr10/), and would ideally also support customizations as extensive as the ICU library does (refer to http://oss.software.ibm.com/icu/userguide/Collate_ServiceArchitecture.html and http://oss.software.ibm.com/icu/userguide/Collate_Customization.html). - Facilities such as Unicode string collation are heavily data-driven, and it would be inefficient to load the data for facilities that are not used. This could be avoided by using some sort of lazy loading mechanism. It would still be possible to use the standard locale object as a container of an entirely new set of facets, which could be loaded from the data sources based on the name of the locale, and ``injected'' into an existing locale object, by calling some function. It is not clear, however, what advantage this would serve over simply using a thin-wrapper over a locale name to represent a ``locale,'' as is done in the ICU library. -- Jeremy Maitin-Shepard

Show replies by date

Miro Jurisic

12 Apr 12 Apr

10:07 a.m.

In article <8765c5yl77.fsf@jbms.ath.cx>, Jeremy Maitin-Shepard <jbms@attbi.com> wrote:

...

The IBM International Components for Unicode (ICU) library (http://oss.software.ibm.com/icu/) is an existing C++ library with what appears to be a Boost-compatible license, which provides all or most of the Unicode support that would be desired in Boost or the C++ standard library, in addition to Unicode-equivalents of libraries already either in the standard library or in Boost, including number/currency formatting, date formatting, message formatting, and a regular expression library. Unfortunately, it does not use C++ exceptions to signal exceptional conditions (but rather it uses an error code return mechanism), it does not follow Boost naming conventions, and although there are some C++-specific facilities, most of the C++ API is the same as the C API, thus resulting in a less-than-optimal C++ interface.

I don't agree with the statement that ICU "provides all or most of the Unicode support that would be desired in Boost or the C++ standard library", based on the fact that ICU does not use exceptions and does not provide STL iterator access to Unicode strings (as far as I know). To me this means that a new library is needed (its implementation can be based on ICU, linked to ICU, etc) or a substantial change to ICU is needed. My preference lies with a boost library that leverages ICU as much as possible without compromising any of the features that I would consider critical for a boost Unicode library. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

David Abrahams

11:35 a.m.

Miro Jurisic <macdev@meeroh.org> writes:

...

In article <8765c5yl77.fsf@jbms.ath.cx>, Jeremy Maitin-Shepard <jbms@attbi.com> wrote:

...
The IBM International Components for Unicode (ICU) library (http://oss.software.ibm.com/icu/) is an existing C++ library with what appears to be a Boost-compatible license, which provides all or most of the Unicode support that would be desired in Boost or the C++ standard library, in addition to Unicode-equivalents of libraries already either in the standard library or in Boost, including number/currency formatting, date formatting, message formatting, and a regular expression library. Unfortunately, it does not use C++ exceptions to signal exceptional conditions (but rather it uses an error code return mechanism), it does not follow Boost naming conventions, and although there are some C++-specific facilities, most of the C++ API is the same as the C API, thus resulting in a less-than-optimal C++ interface.

I don't agree with the statement that ICU "provides all or most of the Unicode support that would be desired in Boost or the C++ standard library", based on the fact that ICU does not use exceptions and does not provide STL iterator access to Unicode strings (as far as I know). To me this means that a new library is needed (its implementation can be based on ICU, linked to ICU, etc) or a substantial change to ICU is needed. My preference lies with a boost library that leverages ICU as much as possible without compromising any of the features that I would consider critical for a boost Unicode library.

Semantic details of the statement aside, I do hope somebody or somebodies tackle the problem. As has been made extra-obvious here recently, there are more issues with unicode and internationalization than most of us can grasp, and the only way I'm going to be effective with Unicode is if I have an effective library with the right abstractions. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Miro Jurisic

5:49 p.m.

In article <ur7utig46.fsf@boost-consulting.com>, David Abrahams <dave@boost-consulting.com> wrote:

...

Semantic details of the statement aside, I do hope somebody or somebodies tackle the problem. As has been made extra-obvious here recently, there are more issues with unicode and internationalization than most of us can grasp, and the only way I'm going to be effective with Unicode is if I have an effective library with the right abstractions.

All I can say here is that due to my workload I am unlikely to tackle the issue until Unicode support becomes a requirement in my own work, so don't expect any proposals from me in this area before 2005. If someone else starts working on it, I'll participate in the design and review as time allows. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

Jeremy Maitin-Shepard

6:06 p.m.

David Abrahams <dave@boost-consulting.com> writes:

...

[snip]

...

Semantic details of the statement aside, I do hope somebody or somebodies tackle the problem. As has been made extra-obvious here recently, there are more issues with unicode and internationalization than most of us can grasp, and the only way I'm going to be effective with Unicode is if I have an effective library with the right abstractions.

I would be interested in tackling this, but I would like to reach a resolution on the locale issue I described in the previous posting before much work is begun. -- Jeremy Maitin-Shepard

Vladimir Prus

13 Apr 13 Apr

7:51 a.m.

David Abrahams wrote:

...

Miro Jurisic <macdev@meeroh.org> writes:

...

...
I don't agree with the statement that ICU "provides all or most of the Unicode support that would be desired in Boost or the C++ standard library",

...

Semantic details of the statement aside, I do hope somebody or somebodies tackle the problem. As has been made extra-obvious here recently, there are more issues with unicode and internationalization than most of us can grasp, and the only way I'm going to be effective with Unicode is if I have an effective library with the right abstractions.

Oh, in fact we did not even touched internationalization. It a separate nontrivial issue and, btw, while Jeremy says existing std::locale is not exactly good for Unicode, the same can be said about standard 'messages' locale facet. Since it requires integer ids for messages, I'd rather stick to GNU gettext (not to mention I was never able to figure out how to make the 'messages' facet work :-( ) - Volodya

Jeremy Maitin-Shepard

12 Apr 12 Apr

5:08 p.m.

Miro Jurisic <macdev@meeroh.org> writes:

...

In article <8765c5yl77.fsf@jbms.ath.cx>, Jeremy Maitin-Shepard <jbms@attbi.com> wrote:

...

...
The IBM International Components for Unicode (ICU) library (http://oss.software.ibm.com/icu/) is an existing C++ library with what appears to be a Boost-compatible license, which provides all or most of the Unicode support that would be desired in Boost or the C++ standard library, in addition to Unicode-equivalents of libraries already either in the standard library or in Boost, including number/currency formatting, date formatting, message formatting, and a regular expression library. Unfortunately, it does not use C++ exceptions to signal exceptional conditions (but rather it uses an error code return mechanism), it does not follow Boost naming conventions, and although there are some C++-specific facilities, most of the C++ API is the same as the C API, thus resulting in a less-than-optimal C++ interface.

...

I don't agree with the statement that ICU "provides all or most of the Unicode support that would be desired in Boost or the C++ standard library", based on the fact that ICU does not use exceptions and does not provide STL iterator access to Unicode strings (as far as I know). To me this means that a new library is needed (its implementation can be based on ICU, linked to ICU, etc) or a substantial change to ICU is needed. My preference lies with a boost library that leverages ICU as much as possible without compromising any of the features that I would consider critical for a boost Unicode library.

Regardless of your interpretation of that statement, I think we agree on this issue: ICU provides the desired Unicode facilities, but its C++ interface is not satisfactory. -- Jeremy Maitin-Shepard

Miro Jurisic

5:45 p.m.

In article <87n05hw2dc.fsf@jbms.ath.cx>, Jeremy Maitin-Shepard <jbms@attbi.com> wrote:

...

Miro Jurisic <macdev@meeroh.org> writes:

Regardless of your interpretation of that statement, I think we agree on this issue: ICU provides the desired Unicode facilities, but its C++ interface is not satisfactory.

Absolutely. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

Vladimir Prus

13 Apr 13 Apr

7:47 a.m.

Hi Jeremy,

...

The IBM International Components for Unicode (ICU) library (http://oss.software.ibm.com/icu/) is an existing C++ library .... and although there are some C++-specific facilities, most of the C++ API is the same as the C API, thus resulting in a less-than-optimal C++ interface.

True. In particular it looks like they use interators which have 'next' method. Hmm... let me guess why -- IIRC it was Java library initially and was then ported to C++.

...

Nonetheless, I think Boostifying the ICU library would be quite feasible, ...

As Miro said, there are alternatives how Boost.Unicode might be related to ICU, though using code from there is desirable.

...

The representation of locales does present an issue that needs to be considered. The existing C++ standard locale facets are not very suitable for a variety of reasons:

- The standard facets (and the locale class itself, in that it is a functor for comparing basic_strings) are tied to facilities such as std::basic_string and std::ios_base which are not suitable for Unicode support.

We can just forget about locale::operator() ;-) But there are other issues. For example, 'toupper' takes a charT and returns charT. The Unicode standard (in 5.18) gives an example of a character which becomes two characters when uppercased. Also, it might be necessary to look at the following code point to find if it's composing character. Other facets, say 'num_put', maybe don't need changes. If it generates data in UCS-2, that's fine.

...

- The interface of std::collate<Ch> is not at all suitable for providing all of the functionality desired for Unicode string collation. A suitable Unicode collation facility should at least allow for user-selection of the strength level used (refer to http://www.unicode.org/unicode/reports/tr10/),

Can't you 'imbue' a new facet whenever you need to change something? It's needed, though, to decide what to use for 'charT' and what encoding to use. If ICU can compare UTF-16 encoded strings, then it's possible to pass those strings to 'compare'. I'm don't understand what's 'transform', though.

...

It would still be possible to use the standard locale object as a container of an entirely new set of facets, which could be loaded from the data sources based on the name of the locale, and ``injected'' into an existing locale object, by calling some function. It is not clear, however, what advantage this would serve over simply using a thin-wrapper over a locale name to represent a ``locale,'' as is done in the ICU library.

First, using std::locale would be more familiar. Second, std::locale allows to use different facets, and that's a good thing in general. E.g. I have all POSIX locale categories set to "C" except for LC_CTYPE. It would be inconvenient to have only one locale setting for everything. - Volodya

John Maddock

2:56 p.m.

...

It seems that Unicode support in Boost (which could lead to Unicode support in the C++ language and standard library) would be quite desirable.

You bet, I would love to add much implroved Unicode support to Boost.Regex (the issue is being raised by users more and more often), but need a "standard" library on which to base it. I was planning to raise this issue myself later in the summer, but if you want to take it on that's one less thing to worry about :-)

...

The IBM International Components for Unicode (ICU) library (http://oss.software.ibm.com/icu/) is an existing C++ library with what appears to be a Boost-compatible license, which provides all or most of the Unicode support that would be desired in Boost or the C++ standard library, in addition to Unicode-equivalents of libraries already either in the standard library or in Boost, including number/currency formatting, date formatting, message formatting, and a regular expression library. Unfortunately, it does not use C++ exceptions to signal exceptional conditions (but rather it uses an error code return mechanism), it does not follow Boost naming conventions, and although there are some C++-specific facilities, most of the C++ API is the same as the C API, thus resulting in a less-than-optimal C++ interface.

Agreed.

...

Nonetheless, I think Boostifying the ICU library would be quite feasible, whereas attempting to reimplement all of the desired functionality that the ICU library provides would be extremely time consuming, since the collating and other services in the ICU library already support a large number of locales, and the character-set conversion facilities support a large number of character sets.

The representation of locales does present an issue that needs to be considered. The existing C++ standard locale facets are not very suitable for a variety of reasons:

- The standard facets (and the locale class itself, in that it is a functor for comparing basic_strings) are tied to facilities such as std::basic_string and std::ios_base which are not suitable for Unicode support.

Why not? Once the locale facets are provided, the std iostreams will "just work", that was the whole point of templating them in the first place.

...

- The interface of std::collate<Ch> is not at all suitable for providing all of the functionality desired for Unicode string collation.

There may be problems with other facets, but not with this one IMO, Unicode provides a well defined algorithm for creating a sort key from a Unicode string, and that's exactly the facility that std::collate needs (for transform, the other mether methods can then be implemented in terms of that).

...

A suitable Unicode collation facility should at least allow for user-selection of the strength level used (refer to http://www.unicode.org/unicode/reports/tr10/), and would ideally also support customizations as extensive as the ICU library does (refer to

http://oss.software.ibm.com/icu/userguide/Collate_ServiceArchitecture.html

...

and http://oss.software.ibm.com/icu/userguide/Collate_Customization.html).

Complicated stuff! Normally the features that you are describing would be handled by a named collate facet: template<class charT> class unicode_collate_byname : public std::collate<charT> { unicode_collate_byname(const char* locale_name); /* details */ }; When the user imbues their locale with a unicode_collate_byname("en_GB"), then they would expect it to "do the right thing". Of course there may be a lower-level interface below this, but I see no problem with implementing this facet.

...

- Facilities such as Unicode string collation are heavily data-driven, and it would be inefficient to load the data for facilities that are not used. This could be avoided by using some sort of lazy loading mechanism.

Yep.

...

It would still be possible to use the standard locale object as a container of an entirely new set of facets, which could be loaded from the data sources based on the name of the locale, and ``injected'' into an existing locale object, by calling some function. It is not clear, however, what advantage this would serve over simply using a thin-wrapper over a locale name to represent a ``locale,'' as is done in the ICU library.

That would be a really bad idea - no code would take advantage of those facets, the big advantage of implementing the std ones, is that it gets iostreams working with Unicode data types, and that would then get lexical_cast and a whole load of other things working too... However I think we're getting ahead of ourselves here: I think a Unicode library should be handled in stages: 1) define the data types for 8/16/32 bit Unicode characters. 2) define iterator adapters to convert a sequence of one Unicode character type to another. 3) define char_traits specialisations (as necessary) in order to get basic_string working with Unicode character sequences, typedef the appropriate string types: typedef basic_string<utf8_t> utf8_string; // etc 4) define low level access to the core Unicode data properties (in unidata.txt). 5) Begin to add locale support - a big job, probably a few facets at a time. 6) define iterator adapters for various Unicode algorithms (composition/decomposition/compression etc). 7) Anything I've forgotten :-) The main goal would be to define a good clean interface, the implementation could be: 1) On top of ICU. 2) On top of Platform specific API's (Windows and I believe MacOS X have some Unicode support without the need to resort to ICU or whatever. 3) An independent Boost implementation (difficult once you get into the locale specific stuff). Anyway, I hope these thoughts help, John.

Jeremy Maitin-Shepard

6:14 p.m.

"John Maddock" <john@johnmaddock.co.uk> writes:

...

...
[snip]

...

...
The representation of locales does present an issue that needs to be considered. The existing C++ standard locale facets are not very suitable for a variety of reasons:

- The standard facets (and the locale class itself, in that it is a functor for comparing basic_strings) are tied to facilities such as std::basic_string and std::ios_base which are not suitable for Unicode support.

...

Why not? Once the locale facets are provided, the std iostreams will "just work", that was the whole point of templating them in the first place.

I'm not sure they will ``just work,'' although I will admit I haven't looked into the issues relating to Unicode support in iostreams thoroughly yet.

...

[snip]

...

However I think we're getting ahead of ourselves here: I think a Unicode library should be handled in stages:

...

1) define the data types for 8/16/32 bit Unicode characters.

unsigned char for UTF-8 code units, boost::uint16_t for UTF-16 code units, and boost::int32_t for UTF-32 code units and to represent Unicode code points

...

2) define iterator adapters to convert a sequence of one Unicode character type to another.

This is easy enough, unicode.org provides optimized C code for this purpose, which could easily be changed slightly for the use of iterator adapters. Alternatively, ICU probably less directly provides this.

...

3) define char_traits specialisations (as necessary) in order to get basic_string working with Unicode character sequences, typedef the appropriate string types:

...

typedef basic_string<utf8_t> utf8_string; // etc

As far as the use of UTF-8 as the internal encoding, I personally would suggest that UTF-16 be used instead, because UTF-8 is rather inefficient to work with. Although I am not overly attached to UTF-16, I do think it is important to standardize on a single internal representation, because for practical reasons, it is useful to be able to have non-templated APIs for purposes such as collating. The other issues I see with using basic_string include that many of its methods would not be suitable for use with a Unicode string, and it does not have something like an operator += which would allow appending of a single Unicode code point (represented as a 32-bit integer). What it comes down to is that basic_string is designed with fixed-width character representations in mind. I would be more in favor of creating a separate type to represent Unicode strings.

...

4) define low level access to the core Unicode data properties (in unidata.txt).

Reuse of the ICU library would probably be very helpful in this.

...

5) Begin to add locale support - a big job, probably a few facets at a time.

The issue is that, despite what you say, most or all of the standard library facets are not suitable for use with Unicode strings. For instance, the character classification and toupper-like operations need not be tied to a locale. Furthermore, many of the operations such as toupper on a single character are not well defined, and rather must be defined as a string to string mapping. Finally, the single-character type must be a 32-bit integer, while the code unit type will probably not be (since UTF-32 as the internal representation would be inefficient). Specific cases include collate<Ch>, which lacks an interface for configuring collation, such as which strength level to use, whether uppercase or lowercase letters should sort first, whether in French locales accents should be sorted right to left, and other such features. It is true that an additional, more powerful interface could be provided, but this would add complexity. Additionally, it depends on basic_string<Ch> (note lack of char_traits specification), which is used as the return type of transform, when something representing a byte array might be more suitable. Additionally, num_put, moneypunct and money_put all would allow only a single code unit in a number of cases, when a string of multiple code points would be suitable. In addition, those facets also depend on basic_string<Ch>.

...

6) define iterator adapters for various Unicode algorithms (composition/decomposition/compression etc). 7) Anything I've forgotten :-)

A facility for Unicode substring matching, which would use the collation facilities, would be useful. This could be based on the ICU implementation. Additionally, a date formatting facility for Unicode would be useful.

...

[snip]

-- Jeremy Maitin-Shepard

Miro Jurisic

7:32 p.m.

In article <87ptabsq2h.fsf@jbms.ath.cx>, Jeremy Maitin-Shepard <jbms@attbi.com> wrote:

...

What it comes down to is that basic_string is designed with fixed-width character representations in mind.

I would be more in favor of creating a separate type to represent Unicode strings.

I agree with this completely.

...

...
4) define low level access to the core Unicode data properties (in unidata.txt).

Reuse of the ICU library would probably be very helpful in this.

...
5) Begin to add locale support - a big job, probably a few facets at a time.

The issue is that, despite what you say, most or all of the standard library facets are not suitable for use with Unicode strings. For instance, the character classification and toupper-like operations need not be tied to a locale. Furthermore, many of the operations such as toupper on a single character are not well defined, and rather must be defined as a string to string mapping. Finally, the single-character type must be a 32-bit integer, while the code unit type will probably not be (since UTF-32 as the internal representation would be inefficient).

You are forgetting that abstract Unicode characters are defined as sequences of code points (even if those code points are 32-bit) and string manipulation has to take this into account (there are numerous combinations of characters and combining marks that must be treated as single units for purpose of searching, collation, etc.) A single encoded character type may be 32 bits, but encoded characters are often not the level on which the clients need to manipulate strings. I am happy to see that there is someone here who knows more about locales in C++ than I do; I haven't had the time to research that as thoroughly as I would like to. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

Jeremy Maitin-Shepard

8:09 p.m.

Miro Jurisic <macdev@meeroh.org> writes:

...

[snip]

...

You are forgetting that abstract Unicode characters are defined as sequences of code points (even if those code points are 32-bit) and string manipulation has to take this into account (there are numerous combinations of characters and combining marks that must be treated as single units for purpose of searching, collation, etc.) A single encoded character type may be 32 bits, but encoded characters are often not the level on which the clients need to manipulate strings.

Right, it will certainly be necessary to provide a grapheme_cluster_iterator (with value_type = the Unicode string type). ICU should help with this. Nonetheless, it is useful to represent a single code point, for several reasons: - For the purpose of string construction, the Unicode specification explicitly states that any sequence of code points is well formed, and so this provides the smallest unit by which guaranteed-well-formed strings can be formed. - It would be useful to provide functions for querying the Unicode properties of individual code points, and this code_point type would be the only suitable parameter type. I do agree, however, that for almost any output formatting, the locale-specific or user-specified fill text/symbols should be specified as strings, rather than as individual characters. -- Jeremy Maitin-Shepard

Miro Jurisic

9:16 p.m.

In article <87isg3skr6.fsf@jbms.ath.cx>, Jeremy Maitin-Shepard <jbms@attbi.com> wrote:

...

Right, it will certainly be necessary to provide a grapheme_cluster_iterator (with value_type = the Unicode string type). ICU should help with this.

You are conflating abstract characters (which exist in absence of a graphical representation) and graphemes (whose existence is dependent upon the graphical representation), but I believe we are talking about the same thing.

...

Nonetheless, it is useful to represent a single code point, for several reasons:

I agree; as I mentioned elsewhere, I believe that the Unicode string abstraction needs to support at least iteration by abstract characters, encoded characters, and encoding units.

...

- For the purpose of string construction, the Unicode specification explicitly states that any sequence of code points is well formed, and so this provides the smallest unit by which guaranteed-well-formed strings can be formed.

Can you refer me to a specific point in the spec where this is stated?

...

- It would be useful to provide functions for querying the Unicode properties of individual code points, and this code_point type would be the only suitable parameter type.

Absolutely.

...

I do agree, however, that for almost any output formatting, the locale-specific or user-specified fill text/symbols should be specified as strings, rather than as individual characters.

Yes. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

Jeremy Maitin-Shepard

9:45 p.m.

Miro Jurisic <macdev@meeroh.org> writes:

...

In article <87isg3skr6.fsf@jbms.ath.cx>, Jeremy Maitin-Shepard <jbms@attbi.com> wrote:

...

[snip]

...

...
- For the purpose of string construction, the Unicode specification explicitly states that any sequence of code points is well formed, and so this provides the smallest unit by which guaranteed-well-formed strings can be formed.

...

Can you refer me to a specific point in the spec where this is stated?

In Unicode 4.0.1, Chapter 3.9: D30a Well-formed: A Unicode code unit sequence that purports to be in a Unicode encoding form is called well-formed if and only if it does follow the specification of that Unicode encoding form. - A Unicode code unit sequence that consists entirely of a sequence of well-formed Unic ode code unit sequences (all of the same Unicode encoding form) is itself a well-formed Unicode code unit sequence. Thus, since any code unit sequence representing a single Unicode scalar value is itself well-formed, any sequence of encoded code points is well-formed.

...

[snip]

-- Jeremy Maitin-Shepard

Miro Jurisic

10:26 p.m.

In article <87r7ur4kmy.fsf@jbms.ath.cx>, Jeremy Maitin-Shepard <jbms@attbi.com> wrote:

...

Miro Jurisic <macdev@meeroh.org> writes:

...
In article <87isg3skr6.fsf@jbms.ath.cx>, Jeremy Maitin-Shepard <jbms@attbi.com> wrote:

...
[snip]

...
...
- For the purpose of string construction, the Unicode specification explicitly states that any sequence of code points is well formed, and so this provides the smallest unit by which guaranteed-well-formed strings can be formed.

...
Can you refer me to a specific point in the spec where this is stated?

In Unicode 4.0.1, Chapter 3.9, D30a

Right, ok, everything you said so far makes sense; I agree that operating on encoded characters (as sequences of code points) is useful in a number of contexts, and (as I already pointed out), operating on abstract characters is useful in other contexts. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

Thorsten Ottosen

14 Apr 14 Apr

1:37 a.m.

"Jeremy Maitin-Shepard" <jbms@attbi.com> wrote in message news:87ptabsq2h.fsf@jbms.ath.cx...

...

"John Maddock" <john@johnmaddock.co.uk> writes: [snip]

...
5) Begin to add locale support - a big job, probably a few facets at a time.

The issue is that, despite what you say, most or all of the standard library facets are not suitable for use with Unicode strings. For instance, the character classification and toupper-like operations need not be tied to a locale.

The new string algorithms to_upper etc do support locales. br Thorsten

Jeremy Maitin-Shepard

2:22 a.m.

"Thorsten Ottosen" <nesotto@cs.auc.dk> writes:

...

"Jeremy Maitin-Shepard" <jbms@attbi.com> wrote in message news:87ptabsq2h.fsf@jbms.ath.cx...

...
"John Maddock" <john@johnmaddock.co.uk> writes: [snip]

...
5) Begin to add locale support - a big job, probably a few facets at a time.

The issue is that, despite what you say, most or all of the standard library facets are not suitable for use with Unicode strings. For instance, the character classification and toupper-like operations need not be tied to a locale.

...

The new string algorithms to_upper etc do support locales.

I'm not sure you are understanding me. What I am saying is that operations such as "convert to uppercase" on Unicode strings are locale-independent, and thus such operations need not and should not be part of the locale interface. It seems perhaps you are referring to the string algorithm library that was reviewed some time ago. -- Jeremy Maitin-Shepard

Thorsten Ottosen

3:46 a.m.

"Jeremy Maitin-Shepard" <jbms@attbi.com> wrote in message news:87llkz47tz.fsf@jbms.ath.cx...

...

"Thorsten Ottosen" <nesotto@cs.auc.dk> writes: [snip]

...
The new string algorithms to_upper etc do support locales.

I'm not sure you are understanding me. What I am saying is that operations such as "convert to uppercase" on Unicode strings are locale-independent,

yep, I misunderstood you. br Thorsten

Miro Jurisic

5:07 a.m.

In article <87llkz47tz.fsf@jbms.ath.cx>, Jeremy Maitin-Shepard <jbms@attbi.com> wrote:

...

What I am saying is that operations such as "convert to uppercase" on Unicode strings are locale-independent, and thus such operations need not and should not be part of the locale interface.

To clarify even further, Unicode incorporates some concepts that have traditionally been swept under the locale rug; string encodings and character properties fall in that category. Unicode does not completely replace locale facilities, of course, as it only deals with strings, and not with all other l10n/i18n issues. Furthermore, the locale abstraction is not always compatible with the Unicode abstraction; this is primarily because the locale abstraction defines characters as fixed-size entities and treats many transformations, such as case change, as 1-1 mappings, whereas Unicode uses a more general definition that works in more languages and locales. As a result, just because Unicode and locales both deal with some of the same concepts, that doesn't mean their treatment is compatible. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

Anthony Williams

11:52 a.m.

Jeremy Maitin-Shepard <jbms@attbi.com> writes:

...

What I am saying is that operations such as "convert to uppercase" on Unicode strings are locale-independent, and thus such operations need not and should not be part of the locale interface.

In which case you are wrong. The SpecialCasings.txt file from the Unicode data file set identifies locale-specific case conversions, such as: # Turkish and Azeri # I and i-dotless; I-dot and i are case pairs in Turkish and Azeri # The following rules handle those cases. # Remove spurious dot above small i's when lowercasing, if there are no more # accents above: 0307; ; 0307; 0307; tr AFTER_i NOT_MORE_ABOVE # COMBINING DOT ABOVE 0307; ; 0307; 0307; az AFTER_i NOT_MORE_ABOVE # COMBINING DOT ABOVE # Fix case pairs 0049; 0131; 0049; 0049; tr; # LATIN CAPITAL LETTER I 0069; 0069; 0130; 0130; tr; # LATIN SMALL LETTER I 0049; 0131; 0049; 0049; az; # LATIN CAPITAL LETTER I 0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I In fact, as the sample shows, not only is case conversion locale-dependent, but it is context-dependent too --- the conversion of a character depends on the preceding characters. Anthony -- Anthony Williams Senior Software Engineer, Beran Instruments Ltd.

Jeremy Maitin-Shepard

7:45 p.m.

Anthony Williams <anthony_w.geo@yahoo.com> writes:

...

Jeremy Maitin-Shepard <jbms@attbi.com> writes:

...
What I am saying is that operations such as "convert to uppercase" on Unicode strings are locale-independent, and thus such operations need not and should not be part of the locale interface.

...

In which case you are wrong.

Hmm, it seems that as you say there are a few cases where locale affects properties/transformations, and so those facilities should be tied to the locale. Nonetheless, it does not appear that the existing ctype facet is suitable for Unicode. -- Jeremy Maitin-Shepard

Pavol Droba

7:58 a.m.

On Wed, Apr 14, 2004 at 11:37:29AM +1000, Thorsten Ottosen wrote:

...

"Jeremy Maitin-Shepard" <jbms@attbi.com> wrote in message news:87ptabsq2h.fsf@jbms.ath.cx...

...
"John Maddock" <john@johnmaddock.co.uk> writes: [snip]

...
5) Begin to add locale support - a big job, probably a few facets at a time.

The issue is that, despite what you say, most or all of the standard library facets are not suitable for use with Unicode strings. For instance, the character classification and toupper-like operations need not be tied to a locale.

The new string algorithms to_upper etc do support locales.

They do, however, the algorithm only operates on per-character basis. This discussion about the unicode goes far beyond what is possible to achieve by using std locales. Anyway, if there will be a proper unicode support provided in boost, I see no problem with the adaptation of string algorithms to make use of it. Regards, Pavol

Vladimir Prus

8:15 a.m.

Pavol Droba wrote:

...

...
The new string algorithms to_upper etc do support locales.

They do, however, the algorithm only operates on per-character basis. This discussion about the unicode goes far beyond what is possible to achieve by using std locales.

Anyway, if there will be a proper unicode support provided in boost, I see no problem with the adaptation of string algorithms to make use of it.

Hmm... shouldn't it be the other way around? That is boost::unicode_string should provide some iterators which can be passed to the string_algo library, and guarantee that it works OK. - Volodya

Pavol Droba

10:04 a.m.

On Wed, Apr 14, 2004 at 12:15:22PM +0400, Vladimir Prus wrote:

...

Pavol Droba wrote:

...
...
The new string algorithms to_upper etc do support locales.

They do, however, the algorithm only operates on per-character basis. This discussion about the unicode goes far beyond what is possible to achieve by using std locales.

Anyway, if there will be a proper unicode support provided in boost, I see no problem with the adaptation of string algorithms to make use of it.

Hmm... shouldn't it be the other way around? That is boost::unicode_string should provide some iterators which can be passed to the string_algo library, and guarantee that it works OK.

AFAIK there is no unicode_string there so far. So the string_algo lib works with the stuff, that is available. i.e. std containers (with fixed-width characters) and locales. I can imagine a support for unicode string. If it would be well designed, many string algorithms can work almost out-of-box. Regards, Pavol

Vladimir Prus

10:26 a.m.

Pavol Droba wrote:

...

...
Hmm... shouldn't it be the other way around? That is boost::unicode_string should provide some iterators which can be passed to the string_algo library, and guarantee that it works OK.

AFAIK there is no unicode_string there so far. So the string_algo lib works with the stuff, that is available. i.e. std containers (with fixed-width characters) and locales.

I can imagine a support for unicode string. If it would be well designed, many string algorithms can work almost out-of-box.

I mean that one of the design goals of unicode string is allowing other code (like string_algo) to work out of the box. - Volodya

Pavol Droba

11:37 a.m.

On Wed, Apr 14, 2004 at 02:26:34PM +0400, Vladimir Prus wrote:

...

Pavol Droba wrote:

...
...
Hmm... shouldn't it be the other way around? That is boost::unicode_string should provide some iterators which can be passed to the string_algo library, and guarantee that it works OK.

AFAIK there is no unicode_string there so far. So the string_algo lib works with the stuff, that is available. i.e. std containers (with fixed-width characters) and locales.

I can imagine a support for unicode string. If it would be well designed, many string algorithms can work almost out-of-box.

I mean that one of the design goals of unicode string is allowing other code (like string_algo) to work out of the box.

I understand that, but there is a conceptual difference between classic strings and unicode string. For instance to make to_upper algorithm, former one needs locales, while the second one has toupper functionality inbound. I can imagine a way how to make these two compatible, question is however if it won't compromise natural properties of specific string representations. Pavol

John Maddock

10:35 a.m.

...

...
1) define the data types for 8/16/32 bit Unicode characters.

unsigned char for UTF-8 code units, boost::uint16_t for UTF-16 code units, and boost::int32_t for UTF-32 code units and to represent Unicode code points

Almost, ICU uses wchar_t for UTF-16 on Win32 (just to complicate things).

...

...
2) define iterator adapters to convert a sequence of one Unicode character type to another.

This is easy enough, unicode.org provides optimized C code for this purpose, which could easily be changed slightly for the use of iterator adapters. Alternatively, ICU probably less directly provides this.

Yep.

...

...
3) define char_traits specialisations (as necessary) in order to get basic_string working with Unicode character sequences, typedef the appropriate string types:

...
typedef basic_string<utf8_t> utf8_string; // etc

As far as the use of UTF-8 as the internal encoding, I personally would suggest that UTF-16 be used instead, because UTF-8 is rather inefficient to work with. Although I am not overly attached to UTF-16, I do think it is important to standardize on a single internal representation, because for practical reasons, it is useful to be able to have non-templated APIs for purposes such as collating.

You can use whatever you want - I don't think users should be constrained to a specific internal encoding. Personally I don't like UTF8 either, but I know some people do...

...

The other issues I see with using basic_string include that many of its methods would not be suitable for use with a Unicode string, and it does not have something like an operator += which would allow appending of a single Unicode code point (represented as a 32-bit integer).

What it comes down to is that basic_string is designed with fixed-width character representations in mind.

I would be more in favor of creating a separate type to represent Unicode strings.

Personally I think we have too many string types around already. While I understand you're concerns about basic_string, as a container of code-points it's just fine IMO. We can always add non-member functions for more advanced manipulation.

...

...
4) define low level access to the core Unicode data properties (in unidata.txt).

Reuse of the ICU library would probably be very helpful in this.

...
5) Begin to add locale support - a big job, probably a few facets at a time.

The issue is that, despite what you say, most or all of the standard library facets are not suitable for use with Unicode strings. For instance, the character classification and toupper-like operations need not be tied to a locale.

Accepted ctype operations are largely (though not completely) independed of the locale, that just makes the ctype specialisations easier IMO.

...

Furthermore, many of the operations such as toupper on a single character are not well defined, and rather must be defined as a string to string mapping.

I know, however 1 to 1 approximations are available (those in Unidata.txt). I'm not saying that the std locale facets should be the only interface, or even the primary one, but providing it does get a lot of other stuff working.

...

Finally, the single-character type must be a 32-bit integer, while the code unit type will probably not be (since UTF-32 as the internal representation would be inefficient).

True, for UTF-16 only the core Unicode subset would be supported by std::locale (ie no surrogates): this is the same as the situation in Java and JavaScript.

...

Specific cases include collate<Ch>, which lacks an interface for configuring collation, such as which strength level to use, whether uppercase or lowercase letters should sort first, whether in French locales accents should be sorted right to left, and other such features. It is true that an additional, more powerful interface could be provided, but this would add complexity.

You can provide any constructor interface to the collate facet that you want, for example to support a locale and a strenth level one might use: template <class charT> class unicode_collate : public std::collate<charT> { public: unicode_collate(const char* name, int level = INT_MAX); /* details */ }; I'm assuming that we have a non-member function to create a locale object that contains a set of Unicode facets: std::locale create_unicode_locale(const char* name); Usage to create a locale object with primary level collation would then be: std::locale l(create_unicode_locale("en_GB"), new unicode_collate("en_GB", 1)); mystream.imbue(l); mystream << something; // etc.

...

Additionally, it depends on basic_string<Ch> (note lack of char_traits specification), which is used as the return type of transform, when something representing a byte array might be more suitable.

You might have me on that one :-)

...

Additionally, num_put, moneypunct and money_put all would allow only a single code unit in a number of cases, when a string of multiple code points would be suitable. In addition, those facets also depend on basic_string<Ch>.

I don't understand what the problem is there, please explain.

...

...
6) define iterator adapters for various Unicode algorithms (composition/decomposition/compression etc). 7) Anything I've forgotten :-)

A facility for Unicode substring matching, which would use the collation facilities, would be useful. This could be based on the ICU implementation.

Additionally, a date formatting facility for Unicode would be useful.

std::time_get / std::time_put ? :-) John.

Jeremy Maitin-Shepard

3:33 p.m.

"John Maddock" <john@johnmaddock.co.uk> writes:

...

[snip]

...

You can use whatever you want - I don't think users should be constrained to a specific internal encoding. Personally I don't like UTF8 either, but I know some people do...

Well, the problem with making the Unicode string, however it is defined, templated on encoding is that then the entire library must exist as function templates, which given the size and data-driven nature of operations such as collation, and the convenience of being able to use run-time polymorphism, would be rather undesirable, as I see it.

...

...
The other issues I see with using basic_string include that many of its methods would not be suitable for use with a Unicode string, and it does not have something like an operator += which would allow appending of a single Unicode code point (represented as a 32-bit integer).

What it comes down to is that basic_string is designed with fixed-width character representations in mind.

I would be more in favor of creating a separate type to represent Unicode strings.

...

Personally I think we have too many string types around already. While I understand you're concerns about basic_string, as a container of code-points it's just fine IMO. We can always add non-member functions for more advanced manipulation.

I do see the advantage of using an existing type, and I would agree that the run-time complexity issues can be avoided. However, I can also see how it would give a false sense of compatibility, because users would tend to view the basic_string type as something higher-level than it really is; for instance, basic_string defines many low-level operations such as find (even find for a single code unit), which when dealing with Unicode text should probably be avoided in most cases, and so the additional verbosity of using std::find and std::search would be beneficial. Also, I do like the idea of having an operator += for the code point type, which would not be possible with basic_string.

...

[snip]

...

...
Furthermore, many of the operations such as toupper on a single character are not well defined, and rather must be defined as a string to string mapping.

...

I know, however 1 to 1 approximations are available (those in Unidata.txt). I'm not saying that the std locale facets should be the only interface, or even the primary one, but providing it does get a lot of other stuff working.

As I describe below, I would be reluctant to provide a deficient interface, particularly when it is likely to be used since it is the interface with which users are familiar.

...

...
Finally, the single-character type must be a 32-bit integer, while the code unit type will probably not be (since UTF-32 as the internal representation would be inefficient).

...

True, for UTF-16 only the core Unicode subset would be supported by std::locale (ie no surrogates): this is the same as the situation in Java and JavaScript.

I don't really like the idea of providing deficient interface, since at the very least that would necessitate providing an additional non-deficient interface. I would say: if we are going to add Unicode support, we might as well do it right, instead of trying to hack it into the existing interface in a way that would result in something less than intuitive to use. Note that AFAIK, there are a few common-use (Cantonese) characters outside the Basic Multilingual Plane (BMP).

...

...
Specific cases include collate<Ch>, which lacks an interface for configuring collation, such as which strength level to use, whether uppercase or lowercase letters should sort first, whether in French locales accents should be sorted right to left, and other such features. It is true that an additional, more powerful interface could be provided, but this would add complexity.

...

You can provide any constructor interface to the collate facet that you want, for example to support a locale and a strenth level one might use:

...

template <class charT> class unicode_collate : public std::collate<charT> { public: unicode_collate(const char* name, int level = INT_MAX); /* details */ };

...

I'm assuming that we have a non-member function to create a locale object that contains a set of Unicode facets:

...

std::locale create_unicode_locale(const char* name);

...

Usage to create a locale object with primary level collation would then be:

...

std::locale l(create_unicode_locale("en_GB"), new unicode_collate("en_GB", 1)); mystream.imbue(l); mystream << something; // etc.

Okay, I suppose the parameters can be specified at construction-time.

...

...
Additionally, it depends on basic_string<Ch> (note lack of char_traits specification), which is used as the return type of transform, when something representing a byte array might be more suitable.

...

You might have me on that one :-)

There is also the additional problem with UTF-16 that (AFAIK), a binary sorting of UTF-16 strings does not give an ordering by code point. This is one other reason I don't really like the use of basic_string.

...

...
Additionally, num_put, moneypunct and money_put all would allow only a single code unit in a number of cases, when a string of multiple code points would be suitable. In addition, those facets also depend on basic_string<Ch>.

...

I don't understand what the problem is there, please explain.

For purposes such as specifying the fill character, these facets allow only a single code unit. For Unicode, a string of code points would be more suitable. Additionally, these facets must supply certain other symbols as basic_string<Ch> (without char_traits specialization), and so various specializations of basic_string would end up being used to represent Unicode strings, resulting in problems, I would say.

...

...
[snip]

...

...
Additionally, a date formatting facility for Unicode would be useful.

...

std::time_get / std::time_put ? :-)

std::time_put has the same problems as std::num_put with respect to requiring a single code unit to represent the fill character. -- Jeremy Maitin-Shepard

Miro Jurisic

15 Apr 15 Apr

12:39 a.m.

In article <87d66a4lsf.fsf@jbms.ath.cx>, Jeremy Maitin-Shepard <jbms@attbi.com> wrote:

...

I do see the advantage of using an existing type, and I would agree that the run-time complexity issues can be avoided. However, I can also see how it would give a false sense of compatibility, because users would tend to view the basic_string type as something higher-level than it really is; for instance, basic_string defines many low-level operations such as find (even find for a single code unit), which when dealing with Unicode text should probably be avoided in most cases, and so the additional verbosity of using std::find and std::search would be beneficial.

I completely agree with this. std::basic_string is an abstraction that happens share a part of the name of what we are discussing here (Unicode strings), and parts of its implementation are suitable for our purpose, but its interface is very much mismatched to what we are looking for. Therefore, I firmly believe that using std::basic_string to represent Unicode strings (as opposed to using it to implement Unicode strings) would provide both a false sense of security (to those who would think that their knowledge of basic_string directly carries over to Unicode strings) and an invitation to use those parts of the basic_string interface that are simply wrong for Unicode strings (such as basic_string::find*). Just because it's a string abstraction, it doesn't mean it's the string abstraction we want. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

Miro Jurisic

13 Apr 13 Apr

7:27 p.m.

In article <00c101c42167$8d0a7f40$1b440352@fuji>, "John Maddock" <john@johnmaddock.co.uk> wrote:

...

...
- The standard facets (and the locale class itself, in that it is a functor for comparing basic_strings) are tied to facilities such as std::basic_string and std::ios_base which are not suitable for Unicode support.

Why not? Once the locale facets are provided, the std iostreams will "just work", that was the whole point of templating them in the first place.

I have already gone over this in other posts, but, in short, std::basic_string makes performance guarantees that are at odds with Unicode strings.

...

However I think we're getting ahead of ourselves here: I think a Unicode library should be handled in stages:

1) define the data types for 8/16/32 bit Unicode characters.

The fact that you believe this is a reasonable first step leads me to believe that you have not given much thought to the fact that even if you use a 32-bit Unicode encoding, a character can take up more than 32 bits (and likewise for 16-bit and 8-bit encodings. Unicode characters are not fixed-width data in any encoding.

...

2) define iterator adapters to convert a sequence of one Unicode character type to another.

This is also not as easy as you seem to believe that it is, because even within one encoding many strings can have multiple representations.

...

3) define char_traits specialisations (as necessary) in order to get basic_string working with Unicode character sequences, typedef the appropriate string types:

typedef basic_string<utf8_t> utf8_string; // etc

This is not a good idea. If you do this, you will produce a basic_string which can violate well-formedness of Unicode strings when you use any mutation algorithm other than concatenation, or you will violate performance guarantees of basic_string.

...

7) Anything I've forgotten :-)

I think you have forgotten to read and understand the complexity of Unicode (or any of the books that discuss the spec less tersely, such as Unicode Demystified), because I think that some of the suggestions you made here are incompatible with how Unicode actually works. Please correct me if I am wrong -- I would love to be wrong :-)

...

The main goal would be to define a good clean interface, the implementation could be:

We can't define a good clean interface until we understand the problems. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

John Maddock

14 Apr 14 Apr

10:59 a.m.

...

...
...
- The standard facets (and the locale class itself, in that it is a functor for comparing basic_strings) are tied to facilities such as std::basic_string and std::ios_base which are not suitable for Unicode support.

Why not? Once the locale facets are provided, the std iostreams will "just work", that was the whole point of templating them in the first place.

I have already gone over this in other posts, but, in short, std::basic_string makes performance guarantees that are at odds with Unicode strings.

Basic_string is a sequence of code points, no more no less, all performance guarentees for basic_string can be met as such.

...

...
However I think we're getting ahead of ourselves here: I think a Unicode library should be handled in stages:

1) define the data types for 8/16/32 bit Unicode characters.

The fact that you believe this is a reasonable first step leads me to believe that you have not given much thought to the fact that even if you use a 32-bit Unicode encoding, a character can take up more than 32 bits (and likewise for 16-bit and 8-bit encodings. Unicode characters are not fixed-width data in any encoding.

Well it is the same first step that ICU takes: there is also a proposal before the C language committee to introduce such data types (they're called char16_t and char32_t), C++ is likely to follow suite (see http://std.dkuug.dk/jtc1/sc22/wg14/www/docs/n1040.pdf). I'm talking about code-points (and sequences thereof), not characters or glyphs which as you say consist of multiple code points. I would handle "characters" and "glyphs" as iterator adapters sitting on top of sequences of code points. For code points, basic_string is as good a container as any (as are vector and deque and anything else you care to define).

...

...
2) define iterator adapters to convert a sequence of one Unicode character type to another.

This is also not as easy as you seem to believe that it is, because even within one encoding many strings can have multiple representations.

I'm not talking about normalision / composition here: just conversion between encodings, ICU does this already, as do many other libraries. Iterator adapters for normalisation / composition / compression would also be useful additions. Likewise adapters for iterating "characters" and "glyphs".

...

...
3) define char_traits specialisations (as necessary) in order to get basic_string working with Unicode character sequences, typedef the appropriate string types:

typedef basic_string<utf8_t> utf8_string; // etc

This is not a good idea. If you do this, you will produce a basic_string which can violate well-formedness of Unicode strings when you use any mutation algorithm other than concatenation, or you will violate performance guarantees of basic_string.

Working on sequences of code points always requires care: clearly one could erase a low surrogate and leave a high surrogate "orphanned" behind for example. One would need to make it clear in the documention that potential problems like this can occur.

...

...
7) Anything I've forgotten :-)

I think you have forgotten to read and understand the complexity of Unicode (or any of the books that discuss the spec less tersely, such as Unicode Demystified), because I think that some of the suggestions you made here are incompatible with how Unicode actually works. Please correct me if I am wrong -- I would love to be wrong :-)

Well sometimes I'm wrong, and sometimes I'm right ;-) Unicode is such a large and complex issue, that it's actually pretty hard to keep even a small fraction of the issues in ones mind at a time, hence my suggestion to split the issue up into a series of steps.

...

...
The main goal would be to define a good clean interface, the implementation could be:

We can't define a good clean interface until we understand the problems.

Obviously. John.

Vladimir Prus

12:23 p.m.

John Maddock wrote:

...

...
I have already gone over this in other posts, but, in short, std::basic_string makes performance guarantees that are at odds with Unicode strings.

Basic_string is a sequence of code points, no more no less, all performance guarentees for basic_string can be met as such.

Using basic_string is container for code points is fine, but what to do with the other operations: find, replace, whatever else. It would be nice if the interface that user will use most of the time is the most convenient. If we agree that user most likely to want find/replace whatever on sequence of *characters*, it's not good to require something like std::find(unicode_iterator(s.begin()), unicode_iterator(s.end()), .....); to do that, and since it's not possible to change definition of std::string you might want boost::unicode_string which find methods works on characters. Another possible approach can be: typedef basic_string<wchar_t> unicode_codepoints_string; class unicode_characters_string { public: unicode_characters_string(const unicode_codepoints_string&); class iterator { }; iterator begin(); iterator end(); // no find* methods! private: // might even hold rep by reference. unicode_codepoints_string& m_rep; }; After that, one simply states that to do find/replace in 'unicode_characters_string' one should use the string_algo library. Together with a big warning that basic_string<> does not really do 100% correct find/replace this might be enough. In fact, I'm still not sure basic_string is all that usefull. If you have unicode_characters_string which does all operations correctly, and basic_string, which does only some operations correctly, why would you use basic_string? For efficiency?

...

I'm talking about code-points (and sequences thereof), not characters or glyphs which as you say consist of multiple code points.

I would handle "characters" and "glyphs" as iterator adapters sitting on top of sequences of code points. For code points, basic_string is as good a container as any (as are vector and deque and anything else you care to define).

iterator adapters are fine for implementation. I fear that requiring user to employ iterator adapters directly is bad decision.

...

Working on sequences of code points always requires care: clearly one could erase a low surrogate and leave a high surrogate "orphanned" behind for example. One would need to make it clear in the documention that potential problems like this can occur.

And what can user do to avoid such problems, except for not using basic_string? - Volodya

Miro Jurisic

15 Apr 15 Apr

1:05 a.m.

In article <022101c4220f$a0c363f0$a8500352@fuji>, "John Maddock" <john@johnmaddock.co.uk> wrote:

...

...
...
...
- The standard facets (and the locale class itself, in that it is a functor for comparing basic_strings) are tied to facilities such as std::basic_string and std::ios_base which are not suitable for Unicode support.

Why not? Once the locale facets are provided, the std iostreams will "just work", that was the whole point of templating them in the first place.

I have already gone over this in other posts, but, in short, std::basic_string makes performance guarantees that are at odds with Unicode strings.

Basic_string is a sequence of code points, no more no less, all performance guarentees for basic_string can be met as such.

If all you want basic_string for is a sequence of code points, you should use a vector<codePointT> instead, as vector does not provide additional methods that would be at best deceptive and at worst dangerous when applied to Unicode strings.

...

Iterator adapters for normalisation / composition / compression would also be useful additions.

Likewise adapters for iterating "characters" and "glyphs".

Leaving compression out, as I don't see what it has to do with Unicode strings per se, I don't think they would be useful additions, I think they would be required in order a boost Unicode library to meet my expectations.

...

Working on sequences of code points always requires care: clearly one could erase a low surrogate and leave a high surrogate "orphanned" behind for example. One would need to make it clear in the documention that potential problems like this can occur.

It is precisely because this interface is dangerous that I believe that it should not be the default interface to a Unicode string. It is rarely useful and often harmful. It does not make it easy to do things right.

...

Unicode is such a large and complex issue, that it's actually pretty hard to keep even a small fraction of the issues in ones mind at a time, hence my suggestion to split the issue up into a series of steps.

The problem is that I think that some of the steps you propose do not take us in the direction of a useful Unicode string abstraction in boost, but merely provide convenient wrappers for the simple problems without tackling the complicated problems. I don't have a problem with solving simple problems first, but I would like to have a reason to believe that solving those simple problems gets us closer to solving the hard problems at a later time; I am not convinced the approach you proposal fits that bill. meeroh -- If this message helped you, consider buying an item from my wish list: <http://web.meeroh.org/wishlist>

Anthony Williams

14 Apr 14 Apr

12:06 p.m.

Miro Jurisic <macdev@meeroh.org> writes:

...

In article <00c101c42167$8d0a7f40$1b440352@fuji>, "John Maddock" <john@johnmaddock.co.uk> wrote:

...
...
- The standard facets (and the locale class itself, in that it is a functor for comparing basic_strings) are tied to facilities such as std::basic_string and std::ios_base which are not suitable for Unicode support.

Why not? Once the locale facets are provided, the std iostreams will "just work", that was the whole point of templating them in the first place.

I have already gone over this in other posts, but, in short, std::basic_string makes performance guarantees that are at odds with Unicode strings.

Only if you use an encoding other than UTF-32/UCS-4. This has to be a (POD) UDT rather than a typedef, so that one may specialize std::char_traits. Of course, if this gets standardized, then it can be a built-in, since the standard can specialize its own templates.

...

...
However I think we're getting ahead of ourselves here: I think a Unicode library should be handled in stages:

1) define the data types for 8/16/32 bit Unicode characters.

The fact that you believe this is a reasonable first step leads me to believe that you have not given much thought to the fact that even if you use a 32-bit Unicode encoding, a character can take up more than 32 bits (and likewise for 16-bit and 8-bit encodings. Unicode characters are not fixed-width data in any encoding.

Yes, but a codepoint is 32-bits, and codepoints can come in any sequence. A given sequence of codepoints may or may not have a valid semantic meaning as a "character", but that is like debating whether or not "fjkp" is a valid word --- beyond the scope of basic string handling facilities.

...

...
2) define iterator adapters to convert a sequence of one Unicode character type to another.

This is also not as easy as you seem to believe that it is, because even within one encoding many strings can have multiple representations.

That is why there are various canonical forms defined. We should provide a means of converting to the canonical forms. However, this is independent of Unicode encoding --- the same sequence of code points can be represented in each Unicode encoding in precisely one way.

...

...
3) define char_traits specialisations (as necessary) in order to get basic_string working with Unicode character sequences, typedef the appropriate string types:

typedef basic_string<utf8_t> utf8_string; // etc

This is not a good idea. If you do this, you will produce a basic_string which can violate well-formedness of Unicode strings when you use any mutation algorithm other than concatenation, or you will violate performance guarantees of basic_string.

Yes. basic_string<CharType> relies on each CharType being a valid entity in its own right --- for Unicode this means it must be a single Unicode code point, so using basic_string for UTF-8 is out. You are right that Unicode does not play fair with most standard locale facilities, especially case conversions (1-1, 1-many, 1-0, context sensitivity (which could be seen as many-many), locale specifics). Collation is one area where the standard library facilities should be OK, since the standard library collation support deals with whole strings. When you install the collation facet in your locale, you choose the Unicode collation options that are relevant to you. Anthony -- Anthony Williams Senior Software Engineer, Beran Instruments Ltd.

Jeremy Maitin-Shepard

7:43 p.m.

Anthony Williams <anthony_w.geo@yahoo.com> writes:

...

Miro Jurisic <macdev@meeroh.org> writes:

...

...
[snip]

...

...
I have already gone over this in other posts, but, in short, std::basic_string makes performance guarantees that are at odds with Unicode strings.

...

Only if you use an encoding other than UTF-32/UCS-4. This has to be a (POD) UDT rather than a typedef, so that one may specialize std::char_traits. Of course, if this gets standardized, then it can be a built-in, since the standard can specialize its own templates.

Performance guarantees aside, standardizing around UTF-32 is not, IMO, practical.

...

[snip]

...

...
...
3) define char_traits specialisations (as necessary) in order to get basic_string working with Unicode character sequences, typedef the appropriate string types:

typedef basic_string<utf8_t> utf8_string; // etc

This is not a good idea. If you do this, you will produce a basic_string which can violate well-formedness of Unicode strings when you use any mutation algorithm other than concatenation, or you will violate performance guarantees of basic_string.

...

Yes. basic_string<CharType> relies on each CharType being a valid entity in its own right --- for Unicode this means it must be a single Unicode code point, so using basic_string for UTF-8 is out.

basic_string can still be used as a low-level storage facility, although it was certainly not designed to be used as such, and in treating it as such, many of the ``compatibility'' advantages are lost anyway. If you are advocating internal representation in UTF-32, however, I would say that performance measurements generally show that UTF-16 is significantly faster for processing, such that the advantage of being able to nicely fit it into the existing interfaces is not justified.

...

You are right that Unicode does not play fair with most standard locale facilities, especially case conversions (1-1, 1-many, 1-0, context sensitivity (which could be seen as many-many), locale specifics).

...

Collation is one area where the standard library facilities should be OK, since the standard library collation support deals with whole strings. When you install the collation facet in your locale, you choose the Unicode collation options that are relevant to you.

Perhaps, except for the other issues which I have described in other messages. -- Jeremy Maitin-Shepard

Daryle Walker

25 Apr 25 Apr

6:14 p.m.

On 4/13/04 3:27 PM, "Miro Jurisic" <macdev@meeroh.org> wrote:

...

In article <00c101c42167$8d0a7f40$1b440352@fuji>, "John Maddock" <john@johnmaddock.co.uk> wrote: [SNIP]

...
However I think we're getting ahead of ourselves here: I think a Unicode library should be handled in stages:

1) define the data types for 8/16/32 bit Unicode characters.

The fact that you believe this is a reasonable first step leads me to believe that you have not given much thought to the fact that even if you use a 32-bit Unicode encoding, a character can take up more than 32 bits (and likewise for 16-bit and 8-bit encodings. Unicode characters are not fixed-width data in any encoding. [TRUNCATE]

Unicode code-points fit in 31-bit values. The 8- and 16-bit standards just encode the 32-bit standard. We could base Unicode string only around the code-points. It may be better to use abstract Unicode characters instead. However, each abstract character can be made up of a variable number code-points. Worse, there can be several ways of expressing the same abstract character (that's why there are normalization standards). Maybe we can have: struct unicode_code_point { int_least_32_t c; }; struct unicode_code_point_traits { /* like char_traits */ }; struct unicode_abstract_character { int_least_32_t main_char; // can there be co-main characters? std::size_t helper_count; // length of following array int_least_32_t *helper_chars; // dynamic array of combiners }; struct unicode_abstract_character_traits { /* like char_traits, but much more complicated */ }; Recall that character types must be POD, so all the smarts have to go into the traits class. -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

7762

Age (days ago)

7775

Last active (days ago)

List overview

Download

36 comments

9 participants

participants (9)

Anthony Williams
Daryle Walker
David Abrahams
Jeremy Maitin-Shepard
John Maddock
Miro Jurisic
Pavol Droba
Thorsten Ottosen
Vladimir Prus