Any interest in adding unicode support to boost?

newer
[assign] initializing nested data...

older
[serialization] binary archives...

Erik Wien

19 Oct 2004 19 Oct '04

2:22 a.m.

Hi. I am in the process of planning a library for handling unicode strings in C++, and would like to probe the interest in the boost community for something like that. I read through the unicode dicussion that was up back in april, and from what I could gather there was some amount of interest, but no one felt comfortable taking on the task as of yet. I am hoping to be able to run this project as my Bachelor's Thesis in Computer Engineering (Not sure if that is the correct translation from Norwegian.) and if it gets approved by my college, myself and two other programmers will spend one semester working exclusively on this. (of course in collaboration with the boost community) At the end of that semester I hope the library (Or at least parts of it) will be in such a state it can submitted for review by boost. The library should ultimately have suppport for at least basic handling of unicode strings (in all encodings), collation of strings and other locale specific operations. The library should also be (to the extent that is possible) integrated with the standard C++ library (and boost) to get as much functionality as possible "for free". I'm here thinking of, among other things, the std::locale class and compabillity with iostreams. How these requirements are fulfilled will be determined as the project (hopefully) moves forward. I really feel the C++ language needs some form of standardized unicode support, and developing such a library within the boost community would be a very good way to ensure it fits everybody's needs the best possible way. If you have any, and I do mean ANY, thoughts on this, please do not hesitate to reply to this mail and let me know. I'm looking forward to your responses. Best regards Erik Wien - Gjøvik College (Høyskolen i Gjøvik), Norway.

Show replies by date

Patrick Bennett

19 Oct 19 Oct

4:31 a.m.

Erik Wien wrote:

...

... If you have any, and I do mean ANY, thoughts on this, please do not hesitate to reply to this mail and let me know. I'm looking forward to your responses.

Erik, be sure to take a look at IBM's open-source ICU library: http://oss.software.ibm.com/icu/

Erik Wien

4:08 p.m.

----- Original Message ----- From: "Patrick Bennett" <patrick.bennett@inin.com>

...

Erik, be sure to take a look at IBM's open-source ICU library: http://oss.software.ibm.com/icu/

I have looked at the ICU library, and it's a very good unicode implementation, but the goal of my project will be to create a library that is in a form that could be standardized in boost, and from what I have seen/read, ICU is not really suited for that. ICU's errorhandling (errorcodes instead of exceptions), naming convention etc. are some problems. If the ICU licence is boost compatible, it would be possible to base an implementation on ICU however, but that will be determined if the project is approved by my college.

Miro Jurisic

4:37 a.m.

In article <cl1tqh$qp$1@sea.gmane.org>, "Erik Wien" <wien@start.no> wrote:

...

Hi. I am in the process of planning a library for handling unicode strings in C++, and would like to probe the interest in the boost community for something like that. I read through the unicode dicussion that was up back in april, and from what I could gather there was some amount of interest, but no one felt comfortable taking on the task as of yet.

I think that's a fair summary.

...

I am hoping to be able to run this project as my Bachelor's Thesis in Computer Engineering (Not sure if that is the correct translation from Norwegian.) and if it gets approved by my college, myself and two other programmers will spend one semester working exclusively on this. (of course in collaboration with the boost community) At the end of that semester I hope the library (Or at least parts of it) will be in such a state it can submitted for review by boost.

In addition to all the things you would normally consider when preparing a boost submission (e.g., whether there is interest) and all the things you would normally consider when writing a bachelor's thesis (e.g., whether you can complete it on time), you should think about some of the possible interactions between boost and thesis: 1. Is the boost license compatible with copyright and intellectual property requirements of your school? (It's best not to assume; most schools have a specific policy about this.) 2. If the boost review process begins after you graduate, which of the three of you will be responsible for interacting with boost? It's clear from your message you have interest in boost, but you should to be prepared to take on the participation in the review process in case your collaborators decide they want to do something else after they graduate. etc. I don't want to sound discouraging, but I do want to point out that the completion of your thesis and the submission to boost should be decoupled so that you don't make your life more stressful than necessary. Besides that point, my only question right now is when we can see the preliminary proposal :-) meeroh

Erik Wien

4:30 p.m.

----- Original Message ----- From: "Miro Jurisic" <macdev@meeroh.org>

...

In addition to all the things you would normally consider when preparing a boost submission (e.g., whether there is interest) and all the things you would normally consider when writing a bachelor's thesis (e.g., whether you can complete it on time), you should think about some of the possible interactions between boost and thesis:

Yes. I am currently working with my college to find out if there are any such issues that might come up. I have to turn in a project description in a week or so, and in the following review process, these matters should surface.

...

1. Is the boost license compatible with copyright and intellectual property requirements of your school? (It's best not to assume; most schools have a specific policy about this.)

Though I haven't checked personally (And I will do that.), I know several open-source projects have been run as bachelors thesis(es?) at my college before, so I would imagine a boost project should be possible.

...

2. If the boost review process begins after you graduate, which of the three of you will be responsible for interacting with boost? It's clear from your message you have interest in boost, but you should to be prepared to take on the participation in the review process in case your collaborators decide they want to do something else after they graduate.

I don't want to sound discouraging, but I do want to point out that the completion of your thesis and the submission to boost should be decoupled so that you don't make your life more stressful than necessary.>

Good point. I have not yet decided if I should make a connection between the actual review process and the thesis itself. I will probably end up with the thesis being a paper and library that could then be submitted to boost (but not required to) when it is completed. I would like to do the the development within the boost community though, so a total decoupling from boost would probably not be such a good idea either.

...

Besides that point, my only question right now is when we can see the preliminary proposal :-)

Patience. ;)

...

meeroh

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Robert Ramey

4:47 a.m.

I think you should spend a little more time investigating the following: a) The "vault" files section has code by A Barbati which addresses issue related to unicode. b) Ron Garcia contributed codecvt facets for unicode that have been incorporated into boost are currently used by two boost libraries (serialization and program options.) c) asni library functions exist for converting strings and characters to/from wstrings/wchar s in accordance with the currently selected locale. Not all libraries implement these functions however. So its not clear to me what exactly needs to be done here - other than fixing up some older stdandard libraries. I don't think that's what you had in mind. Of course, I'm not exactly sure what you propose to do in your library other than the above so these observations might not be relevant. On the other hand, if you want more project ideas, I'm sure we could come up with lots of suggestions. Robert Ramey "Erik Wien" <wien@start.no> wrote in message news:cl1tqh$qp$1@sea.gmane.org...

...

Hi. I am in the process of planning a library for handling unicode strings in C++, and would like to probe the interest in the boost community for something like that. I read through the unicode dicussion that was up back in april, and from what I could gather there was some amount of interest, but no one felt comfortable taking on the task as of yet.

I am hoping to be able to run this project as my Bachelor's Thesis in Computer Engineering (Not sure if that is the correct translation from Norwegian.) and if it gets approved by my college, myself and two other programmers will spend one semester working exclusively on this. (of course in collaboration with the boost community) At the end of that semester I hope the library (Or at least parts of it) will be in such a state it can submitted for review by boost.

The library should ultimately have suppport for at least basic handling of unicode strings (in all encodings), collation of strings and other locale specific operations. The library should also be (to the extent that is possible) integrated with the standard C++ library (and boost) to get as much functionality as possible "for free". I'm here thinking of, among other things, the std::locale class and compabillity with iostreams. How these requirements are fulfilled will be determined as the project (hopefully) moves forward.

I really feel the C++ language needs some form of standardized unicode support, and developing such a library within the boost community would be a very good way to ensure it fits everybody's needs the best possible way.

If you have any, and I do mean ANY, thoughts on this, please do not hesitate to reply to this mail and let me know. I'm looking forward to your responses.

Best regards Erik Wien - Gjøvik College (Høyskolen i Gjøvik), Norway.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Miro Jurisic

5:50 a.m.

In article <cl268e$okv$1@sea.gmane.org>, "Robert Ramey" <ramey@rrsd.com> wrote:

...

I think you should spend a little more time investigating the following:

a) The "vault" files section has code by A Barbati which addresses issue related to unicode. b) Ron Garcia contributed codecvt facets for unicode that have been incorporated into boost are currently used by two boost libraries (serialization and program options.) c) asni library functions exist for converting strings and characters to/from wstrings/wchar s in accordance with the currently selected locale. Not all libraries implement these functions however.

So its not clear to me what exactly needs to be done here - other than fixing up some older stdandard libraries. I don't think that's what you had in mind.

There is a lot Unicode work to be done in the standard C++ library and boost. C++ currently has no Unicode-aware string abstraction, and this is a big problem for anyone who has to deal with Unicode strings in C++ code. std::string is poorly suited for any Unicode-savvy work, for many reasons -- mainly having to do with the fact that std::string and STL and boost algorithms using std::string::iterator don't know how to handle strings in accordance with the Unicode spec. meeroh

Robert Ramey

6:18 a.m.

...

In article <cl268e$okv$1@sea.gmane.org>, "Robert Ramey" <ramey@rrsd.com> wrote:

...
I think you should spend a little more time investigating the following:

a) The "vault" files section has code by A Barbati which addresses issue related to unicode. b) Ron Garcia contributed codecvt facets for unicode that have been incorporated into boost are currently used by two boost libraries (serialization and program options.) c) asni library functions exist for converting strings and characters to/from wstrings/wchar s in accordance with the currently selected locale. Not all libraries implement these functions however.

So its not clear to me what exactly needs to be done here - other than fixing up some older stdandard libraries. I don't think that's what you had in mind.

There is a lot Unicode work to be done in the standard C++ library and boost. C++ currently has no Unicode-aware string abstraction, and this is a big

...

for anyone who has to deal with Unicode strings in C++ code. std::string is poorly suited for any Unicode-savvy work, for many reasons -- mainly having to do with the fact that std::string and STL and boost algorithms using std::string::iterator don't know how to handle strings in accordance with

"Miro Jurisic" <macdev@meeroh.org> wrote in message news:macdev-320EB0.01505419102004@sea.gmane.org... problem the

...

Unicode spec.

Hmmm - it would never occur to me to use std::string for characters wider than 8 bits. My studied this issue in some detail and concluded that one uses unicode or othe 2 ro 4 byte encoding, the simplest and most natural way is to use std::wstring (a synonym for std::basic_string<wchar_t>. At this point the only issues would be a) implementations which are not based on basic_string (I don't know if there are any of these around) b) input/output to other encoding such as utf-8 or ? - this is handled by codecvt facets. I believe that STL and boost algorithms that handle std::string can (or should) be able to handle any std::basic_string<?> . That is my basis for the view that unicode shouldn't be a big issue. Of course if one want's to handle unicode as std::string containing - say UTF-8 encoding of unicode characters - then that would be a separate issue. I don't think anyone would want to do that. I'm willing to be convinced I'm wrong about this - but I just don't see it yet. Robert Ramey

Vladimir Prus

6:43 a.m.

Hi Robert,

...

...
There is a lot Unicode work to be done in the standard C++ library and boost. C++ currently has no Unicode-aware string abstraction, and this is a big

...
for anyone who has to deal with Unicode strings in C++ code. std::string is poorly suited for any Unicode-savvy work, for many reasons -- mainly having to do with the fact that std::string and STL and boost algorithms using std::string::iterator don't know how to handle strings in accordance with

"Miro Jurisic" <macdev@meeroh.org> wrote in message problem the

...
Unicode spec.

...

I believe that STL and boost algorithms that handle std::string can (or should) be able to handle any std::basic_string<?> . That is my basis for the view that unicode shouldn't be a big issue. ... I'm willing to be convinced I'm wrong about this - but I just don't see it yet.

This was discussed extensively before. For example, Miro has pointed out that even plain "find" is not suitable for unicode strings because some characters can be represeted with several wchar_t values. Then, there's an issue of proper collation. Given that Unicode can contain accents and various other "marks", it is not obvious that string::operator< will always to the right thing. - Volodya

Miro Jurisic

9:16 a.m.

In article <cl2d2p$7a3$1@sea.gmane.org>, Vladimir Prus <ghost@cs.msu.su> wrote:

...

This was discussed extensively before. For example, Miro has pointed out that even plain "find" is not suitable for unicode strings because some characters can be represeted with several wchar_t values.

Then, there's an issue of proper collation. Given that Unicode can contain accents and various other "marks", it is not obvious that string::operator< will always to the right thing.

Thanks, it's comforting to know /someone/ is paying attention :-) meeroh

Robert Ramey

4:19 p.m.

"Vladimir Prus" <ghost@cs.msu.su> wrote in message news:cl2d2p$7a3$1@sea.gmane.org...

...

This was discussed extensively before. For example, Miro has pointed out that even plain "find" is not suitable for unicode strings because some characters can be represeted with several wchar_t values.

Then, there's an issue of proper collation. Given that Unicode can contain accents and various other "marks", it is not obvious that string::operator< will always to the right thing.

My reference (Stroustrup, The C++ Programming language) shows the locale class containing a function template<class Ch, class Tr, class A> // compare strings using this locale bool operator()(const basic_string<Ch, Tr, A> & const basic_string<Ch, Tr, A> & ) const; So I always presumed that there was a "unicode" locale that implemented this as well all other required information. Now that I think about it I realize that it was only a presumption that I never really checked. Now I wonder what facitlities do most libraries do provide for unicode facets. I know there are ansi functions for translating between multi-byte and wide character strings. I've used these functions and they did what I expected them to do. I presumed they worked in accordance with the currently selected locale and its related facets. If the basic_string<wchar_t>::operator<(...) isn't doing "the right thing" wouldn't it be just a bug in the implementation of the standard library rather than a candidate for a boost library? Robert Ramey

Edward Diener

8:50 p.m.

Robert Ramey wrote:

...

"Vladimir Prus" <ghost@cs.msu.su> wrote in message news:cl2d2p$7a3$1@sea.gmane.org...

...
This was discussed extensively before. For example, Miro has pointed out that even plain "find" is not suitable for unicode strings because some characters can be represeted with several wchar_t values.

Then, there's an issue of proper collation. Given that Unicode can contain accents and various other "marks", it is not obvious that string::operator< will always to the right thing.

My reference (Stroustrup, The C++ Programming language) shows the locale class containing a function

template<class Ch, class Tr, class A> // compare strings using this locale bool operator()(const basic_string<Ch, Tr, A> & const basic_string<Ch, Tr,

...
& ) const;

So I always presumed that there was a "unicode" locale that implemented this as well all other required information. Now that I think about it I realize that it was only a presumption that I never really checked. Now I wonder what facitlities do most libraries do provide for unicode facets. I know there are ansi functions for translating between multi-byte and wide character strings. I've used these functions and they did what I expected them to do. I presumed they worked in accordance with the currently selected locale and its related facets. If the basic_string<wchar_t>::operator<(...) isn't doing "the right thing" wouldn't it be just a bug in the implementation of the standard library rather than a candidate for a boost library?

The use of 'wchar_t' is purely implementation defined as what it means, other than the very little said about it in the C++ standard in relation to 'char'. It need have nothing to do with any of the Unicode encodings, or it may represent a particular Unicode encoding. This is purely up to the implementation. So doing the "right thing" is purely up to the implementer although, of course, the implementer will tell you what the wchar_t represents for that implementation.

Robert Ramey

20 Oct 20 Oct

12:08 a.m.

"Edward Diener" <eddielee@tropicsoft.com> wrote in message news:cl3umm$pkp$1@sea.gmane.org...

...

Robert Ramey wrote:

...
"Vladimir Prus" <ghost@cs.msu.su> wrote in message news:cl2d2p$7a3$1@sea.gmane.org...

...
This was discussed extensively before. For example, Miro has pointed out that even plain "find" is not suitable for unicode strings because some characters can be represeted with several wchar_t values.

Then, there's an issue of proper collation. Given that Unicode can contain accents and various other "marks", it is not obvious that string::operator< will always to the right thing.

My reference (Stroustrup, The C++ Programming language) shows the locale class containing a function

template<class Ch, class Tr, class A> // compare strings using this locale bool operator()(const basic_string<Ch, Tr, A> & const basic_string<Ch, Tr,

...
& ) const;

So I always presumed that there was a "unicode" locale that implemented this as well all other required information. Now that I think about it I realize that it was only a presumption that I never really checked. Now I wonder what facitlities do most libraries do provide for unicode facets. I know there are ansi functions for translating between multi-byte and wide character strings. I've used these functions and they did what I expected them to do. I presumed they worked in accordance with the currently selected locale and its related facets. If the basic_string<wchar_t>::operator<(...) isn't doing "the right thing" wouldn't it be just a bug in the implementation of the standard library rather than a candidate for a boost library?

The use of 'wchar_t' is purely implementation defined as what it means, other than the very little said about it in the C++ standard in relation to 'char'. It need have nothing to do with any of the Unicode encodings, or it may represent a particular Unicode encoding. This is purely up to the implementation. So doing the "right thing" is purely up to the implementer

OK I can buy that

...

although, of course, the implementer will tell you what the wchar_t represents for that implementation.

OK - are there standard library implementations which use other than unicode (or variants there of) for wchar_t encodings? Basically my reservations about the utility of a unicode library stem from the following: a) the standard library has std:::basic_string<T> where T is any type char, wchar_t or whatever. b) all algorithms that use std::string are (or should be) applicable to std::basic_string<T> regardless of the actual type of T (more or less) c) character encodings can be classified into two types - single element types like unicode (UCS-2, UCS-4) and ascii, and multi element types like JIS, and others. d) there exist ansi functions which translate strings from one type to an other based on information in the current locale. This information is dependent on the particular encoding. e) There is nothing particularly special about unicode in this scheme. Its just one more encoding scheme among many. Therefore making a special unicode library would be unnecessarily specific. Any efforts so spent would be better invested in generic encoding/decoding algorithms and/or setting up locale facts for specific encodings UTF-8, UTF-16, etc. Robert Ramey

Erik Wien

12:50 a.m.

Robert Ramey wrote:

...

Basically my reservations about the utility of a unicode library stem from the following:

a) the standard library has std:::basic_string<T> where T is any type char, wchar_t or whatever.

Yes. The problem with unicode is that it is not really possible to represent a character as an atomic value. A single glyph could in extreme cases be made up of 3 (or even more) 32 bit code units (UTF-32), and therefore defining a good T, is nigh on impossible.

...

b) all algorithms that use std::string are (or should be) applicable to std::basic_string<T> regardless of the actual type of T (more or less) c) character encodings can be classified into two types - single element types like unicode (UCS-2, UCS-4) and ascii, and multi element types like JIS, and others.

As i said, Unicode is not fixed width. Not in any encoding scheme. Therefore it is very difficult to teach the basic_string class to correctly handle unicode strings.

...

d) there exist ansi functions which translate strings from one type to an other based on information in the current locale. This information is dependent on the particular encoding. e) There is nothing particularly special about unicode in this scheme. Its just one more encoding scheme among many. Therefore making a special unicode library would be unnecessarily specific. Any efforts so spent would be better invested in generic encoding/decoding algorithms and/or setting up locale facts for specific encodings UTF-8, UTF-16, etc.

The reason for focusing on Unicode is that is has become the de facto standard for character representation. It is supported by most OSes and many programming languages. This is not likely to change. As for other encoding schemes. I actually had support for other encodings (like UCS, Shift JIS etc.) in the back of my mind when I wrote the implementation I described earlier. That is why the string class is called encoded_string, and not unicode_string. If the interface of the encoding_traits class is made general enough, it should be a piece of cake to add support for additional encoding schemes at a later date.

Rob Stewart

8:32 p.m.

From: "Erik Wien" <wien@start.no>

...

Robert Ramey wrote:

...
a) the standard library has std:::basic_string<T> where T is any type char, wchar_t or whatever.

Yes. The problem with unicode is that it is not really possible to represent a character as an atomic value. A single glyph could in extreme cases be made up of 3 (or even more) 32 bit code units (UTF-32), and therefore defining a good T, is nigh on impossible.

Could the character type be a class that can hold one or more data members of some representation type plus a pointer to overflow data? Then, an abstract character can be represented completely within the character type if the encoding is sufficiently simple, and if the encoding is more complex, the additional data is put on the free store. For example, if most characters can be represented with a single representation type instance, then the class would contain one data member of that type plus a pointer to the rest, if any. Performance analysis can indicate how best to implement such a class, but it could have from one to N data members of the representation type, where N is the maximum number of representation type values needed to represent all abstract characters. Differing choices of N and the representation type will give different performance characteristics for a given Unicode string. Those values might be tuned for general purpose use or they might be exposed via template parameters. Granted, a simple character is enlarged by an unused pointer and it may be that using N objects of the representation type takes no more space, thereby obviating the conditional code checking for a non-null pointer. Nevertheless, it's an idea to consider, if only for a minute. ;-) -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

Edward Diener

1:38 a.m.

Robert Ramey wrote:

...

"Edward Diener" <eddielee@tropicsoft.com> wrote in message news:cl3umm$pkp$1@sea.gmane.org...

...
Robert Ramey wrote:

...
"Vladimir Prus" <ghost@cs.msu.su> wrote in message news:cl2d2p$7a3$1@sea.gmane.org...

...
This was discussed extensively before. For example, Miro has pointed out that even plain "find" is not suitable for unicode strings because some characters can be represeted with several wchar_t values.

Then, there's an issue of proper collation. Given that Unicode can contain accents and various other "marks", it is not obvious that string::operator< will always to the right thing.

My reference (Stroustrup, The C++ Programming language) shows the locale class containing a function

template<class Ch, class Tr, class A> // compare strings using this locale bool operator()(const basic_string<Ch, Tr, A> & const basic_string<Ch, Tr,

...
& ) const;

So I always presumed that there was a "unicode" locale that implemented this as well all other required information. Now that I think about it I realize that it was only a presumption that I never really checked. Now I wonder what facitlities do most libraries do provide for unicode facets. I know there are ansi functions for translating between multi-byte and wide character strings. I've used these functions and they did what I expected them to do. I presumed they worked in accordance with the currently selected locale and its related facets. If the basic_string<wchar_t>::operator<(...) isn't doing "the right thing" wouldn't it be just a bug in the implementation of the standard library rather than a candidate for a boost library?

The use of 'wchar_t' is purely implementation defined as what it means, other than the very little said about it in the C++ standard in relation to 'char'. It need have nothing to do with any of the Unicode encodings, or it may represent a particular Unicode encoding. This is purely up to the implementation. So doing the "right thing" is purely up to the implementer

OK I can buy that

...
although, of course, the implementer will tell you what the wchar_t represents for that implementation.

OK - are there standard library implementations which use other than unicode (or variants there of) for wchar_t encodings?

I do not know if there is or not. My point being that wchar_t is not a Unicode character by definition. That is why I believe that any new character types, either as future built-in characters, or as a C++ class, are needed to support Unicode encodings. As soon as one says that wchar_t should be changed depending on that locale/facet in order to support a Unicode encoding, one is doing the wrong thing and the reason why is that wchar_t is C++'s idea of an implementation defined wide character only.

...

Basically my reservations about the utility of a unicode library stem from the following:

a) the standard library has std:::basic_string<T> where T is any type char, wchar_t or whatever.

Agreed

...

b) all algorithms that use std::string are (or should be) applicable to std::basic_string<T> regardless of the actual type of T (more or less)

The standard algorithms work using iterators, and treat std::basic_string<T> as a container. That is fine but it doesn't produce results that treat a string as a meaningful collection of characters which represent a character encoding. For that the std::basic_string<T> member functions are needed and should be used.

...

c) character encodings can be classified into two types - single element types like unicode (UCS-2, UCS-4) and ascii, and multi element types like JIS, and others.

OK, at least in the present.

...

d) there exist ansi functions which translate strings from one type to an other based on information in the current locale. This information is dependent on the particular encoding.

OK, I see where you are going and you may be right. You want to continue using 'char' and 'wchar_t' but use only the locale to define their encoding and functionality. That sounds possible except for one issue of which I can think. It is that 'char' and 'wchar_t' do not encompass all the possible popular encoding sizes for fixed size encodings. Right now we have 8, 16, and 32 bits. Perhaps we need a new C++ basic character type, maybe 'lwchar_t', and the rule sizeof(char) <= sizeof(wchar_t) <= sizeof(lwchar_t). This would give us a better fighting chance to represent all the most popular encodings at least as far as fixed size character sizes are concerned. Along with your suggestion, thought would then have to be given to what a std::basic_string<T> really means beyond the use of narrow characters. Right now it is implementation defined, but in the future how and where do we specify character encodings via locales ?

...

e) There is nothing particularly special about unicode in this scheme. Its just one more encoding scheme among many. Therefore making a special unicode library would be unnecessarily specific. Any efforts so spent would be better invested in generic encoding/decoding algorithms and/or setting up locale facts for specific encodings UTF-8, UTF-16, etc.

You have made a good point but see above. In your scheme, I would still want to have the C++ standard library have all current functionality regarding characters and strings be templated on built-in character types everywhere. There would probably need to be a review of current and future functionality to determine in which situations locales, with their character encoding information, need to be passed along with a string.

Beman Dawes

1:50 a.m.

New subject: Any interest in adding unicode support to boost?

At 08:08 PM 10/19/2004, Robert Ramey wrote:

...

OK - are there standard library implementations which use other than unicode (or variants there of) for wchar_t encodings?

IIRC, this question was asked at one of the committee meetings, and the answer was "yes". It may have been Sun, although I'm not sure. The rationale was simply that the compiler predated Unicode. Presumably the vendor will continue with their current encoding for wchar_t, and use Unicode encodings for char16_t and char32_t strings. --Beman

John Maddock

10:21 a.m.

...

e) There is nothing particularly special about unicode in this scheme. Its just one more encoding scheme among many. Therefore making a special unicode library would be unnecessarily specific. Any efforts so spent would be better invested in generic encoding/decoding algorithms and/or setting up locale facts for specific encodings UTF-8, UTF-16, etc.

Robert, there is a lot more to Unicode than an encoding form: Character properties. Collation. Bidirectional character handling. Character shaping algorithms. And lots more I've forgotten: do take a look at ICU, I don't like the C++ design much, but it's a *huge* and technically very competent library. John.

Edward Diener

19 Oct 19 Oct

5:38 a.m.

Erik Wien wrote:

...

Hi. I am in the process of planning a library for handling unicode strings in C++, and would like to probe the interest in the boost community for something like that. I read through the unicode dicussion that was up back in april, and from what I could gather there was some amount of interest, but no one felt comfortable taking on the task as of yet.

I am hoping to be able to run this project as my Bachelor's Thesis in Computer Engineering (Not sure if that is the correct translation from Norwegian.) and if it gets approved by my college, myself and two other programmers will spend one semester working exclusively on this. (of course in collaboration with the boost community) At the end of that semester I hope the library (Or at least parts of it) will be in such a state it can submitted for review by boost.

The library should ultimately have suppport for at least basic handling of unicode strings (in all encodings), collation of strings and other locale specific operations. The library should also be (to the extent that is possible) integrated with the standard C++ library (and boost) to get as much functionality as possible "for free". I'm here thinking of, among other things, the std::locale class and compabillity with iostreams. How these requirements are fulfilled will be determined as the project (hopefully) moves forward.

A few points you probably already know: 1) Wide characters and Unicode characters are not necessarily the same thing for any given implementation. 2) There are quite a few Unicode encodings. 3) The idea is to be able to plug in a Unicode encoding into the same standard library templates and boost templates which now support 'char' and wchar_t'. In other words ideally you want to treat your Unicode encoding as just another character type, with extra smarts depending on the encoding. The extra smarts would be used in specializations. In the past in comp.std.c++ I attempted to promote the idea that all standard library functionality which dealt generally in characters and strings should be parameterized on the character type for the sake of orthogonality and the future. While most are, there is still some functionality which does not, ie exceptions and file names and locale message files, and assume that only narrow characters exist in its usage. I am still amazed that programmers from countries which would normally use wide characters as Unicode encodings, such as the Japanese, have not made more of an issue with this, but perhaps they are so used to their far more difficult DBCS roots that pursuing wide characters everywhere, much less a real Unicode encoding, is a minor issue with them.

Darren Cook

6:33 a.m.

...

...
Hi. I am in the process of planning a library for handling unicode strings in C++ ... .... functionality which does not, ie exceptions and file names and locale message files, and assume that only narrow characters exist in its usage. I am still amazed that programmers from countries which would normally use wide characters as Unicode encodings, such as the Japanese, have not made more of an issue with this, but perhaps they are so used to their far more difficult DBCS roots that pursuing wide characters everywhere, much less a real Unicode encoding, is a minor issue with them.

For Japanese strings I use either Shift-JIS or UTF-8 encodings, both of which work fine with std::string for most tasks. Matching sub-strings could give a false match however. When talking about Unicode does that include UTF-8? Darren

Erik Wien

4:29 p.m.

----- Original Message ----- From: "Edward Diener" <eddielee@tropicsoft.com>

...

A few points you probably already know:

1) Wide characters and Unicode characters are not necessarily the same thing for any given implementation. 2) There are quite a few Unicode encodings.

Yes I know. Thanks for the heads up though! ;)

...

3) The idea is to be able to plug in a Unicode encoding into the same standard library templates and boost templates which now support 'char' and wchar_t'. In other words ideally you want to treat your Unicode encoding as just another character type, with extra smarts depending on the encoding. The extra smarts would be used in specializations.

Agreed. That is one of the main design goals for a potential library in my opinion. I have recently created a little test library for simple unicode strings that provides iterators that can be used with the different algorithms in boost and std. I would probably base some parts of a new library on that implementation. I will post a new message with more information about this later.

...

In the past in comp.std.c++ I attempted to promote the idea that all standard library functionality which dealt generally in characters and strings should be parameterized on the character type for the sake of orthogonality and the future. While most are, there is still some functionality which does not, ie exceptions and file names and locale message files, and assume that only narrow characters exist in its usage. I am still amazed that programmers from countries which would normally use wide characters as Unicode encodings, such as the Japanese, have not made more of an issue with this, but perhaps they are so used to their far more difficult DBCS roots that pursuing wide characters everywhere, much less a real Unicode encoding, is a minor issue with them.

I completely agree. There are a few areas of the standard that makes a lot of assumptions about how characters and strings are represented, and many of these assumptions are not necceseraly true when it comes to unicode. How to match a potential library with the standard is therefore an important issue in the development, and one I hope to devote some time to resolve (Or at least knowingly ignore! ;) ) if I move forward with the project.

Doug Gregor

6:22 a.m.

On Oct 18, 2004, at 7:22 PM, Erik Wien wrote:

...

I really feel the C++ language needs some form of standardized unicode support, and developing such a library within the boost community would be a very good way to ensure it fits everybody's needs the best possible way.

If you have any, and I do mean ANY, thoughts on this, please do not hesitate to reply to this mail and let me know. I'm looking forward to your responses.

We absolutely need a Unicode library and I'd be glad to see someone tackle it. Doug

Rogier van Dalen

10:27 a.m.

I've recently started on the first draft of a Unicode library. An assumption I think is wrong is that wchar_t would be suitable for Unicode. Correct me if I'm wrong, but IIRC wchar_t has 16 bits on Microsoft compilers, for example. The utf8_codecvt_facet implementation will on these compilers cut off any codepoints over 0xFFFF. (U+1D12C will come out as U+D12C.) I think a definition of unicode::code as uint32_t would be much better. Problem is, codecvt is only implemented for wchar_t and char, so it's not possible to make a Unicode codecvt without manually adding (dummy) implementations of codecvt<unicode::code,char,mbstate_t> to the std namespace. I guess this is the reason that Ron Garcia just used wchar_t. About Unicode strings: I suggest having a codepoint_string, with the string of code units as a template parameter. Its interface should work with 21 (32) bits values, while internally these are converted to UTF-8, UTF-16, or remain UTF-32. template <class CodeUnitString> class codepoint_string { CodeUnitString code_units; // ... }; The real unicode::string would be the character string, which uses a base character with its combining marks for its interface. template <class CodePointString> class string { CodePointString codepoints; // ... }; So unicode::string<unicode::codepoint_string<std::string> > would be a UTF8-encoded string that is manipulated using its characters. unicode::string should take care of correctly searching for a character string, rather than a codepoint string. operator< has never done "the right thing" anyway: it does not make a difference between uppercase and lowercase, for example. Probably, locales should be used for collation. The Unicode collation algorithm is pretty well specified. Hope all this is clear... Regards, Rogier

Aaron W. LaFramboise

10:37 a.m.

Rogier van Dalen wrote:

...

An assumption I think is wrong is that wchar_t would be suitable for Unicode. Correct me if I'm wrong, but IIRC wchar_t has 16 bits on Microsoft compilers, for example. The utf8_codecvt_facet implementation will on these compilers cut off any codepoints over 0xFFFF. (U+1D12C will come out as U+D12C.)

This is because the Windows NT ABI is hardwired for 16-bit wide characters. I beleive that means the wide characters are actually UTF-16 characters that use "surrogate pairs." Regardless of whether this is a good thing or not, Windows compilers need to follow suit as the underlying implementation of their wide characters is in Windows, not in the compiler. It might be possible for a compiler to provide their own Unicode implementation, and map that to Windows' wide characters, but in the user-visible situations where the two implementations disagreed, there might be suprising results that might make the compiler-provided implementation unusable. Aaron W. LaFramboise

Teemu Torma

12:28 p.m.

On Tuesday 19 October 2004 12:37, Aaron W. LaFramboise wrote:

...

...
An assumption I think is wrong is that wchar_t would be suitable for Unicode. Correct me if I'm wrong, but IIRC wchar_t has 16 bits on Microsoft compilers, for example. The utf8_codecvt_facet implementation will on these compilers cut off any codepoints over 0xFFFF. (U+1D12C will come out as U+D12C.)

This is because the Windows NT ABI is hardwired for 16-bit wide characters. I beleive that means the wide characters are actually UTF-16 characters that use "surrogate pairs."

We tested at some point, and it is UCS-2 by default, and will switch to UTF-16 if support for east asian languages is enabled. Teemu

Erik Wien

4:32 p.m.

----- Original Message ----- From: "Rogier van Dalen" <rogiervd@gmail.com>

...

I've recently started on the first draft of a Unicode library.

Interesting. Is there a discussion going about this library that I have missed, or haven't you posted anything about it yet? I'd hate to start something like this, if there is already being made an effort on the subject.

...

An assumption I think is wrong is that wchar_t would be suitable for Unicode. Correct me if I'm wrong, but IIRC wchar_t has 16 bits on Microsoft compilers, for example. The utf8_codecvt_facet implementation will on these compilers cut off any codepoints over 0xFFFF. (U+1D12C will come out as U+D12C.)

I agree. The "unicode is wide strings" assumption is wrong in my opinion, and I would stribe to provide a correct implementation based on the Unicode standard if I were to go ahead with this.

...

I think a definition of unicode::code as uint32_t would be much better. Problem is, codecvt is only implemented for wchar_t and char, so it's not possible to make a Unicode codecvt without manually adding (dummy) implementations of codecvt<unicode::code,char,mbstate_t> to the std namespace. I guess this is the reason that Ron Garcia just used wchar_t.

I don't really feel locking the code unit size to 32bits is a good solution either as strings would then become unneccesarily large. In a test implementation I have recently made, I templated the entire encoding scheme (using an encoding_traits class) and made a common interface for strings that lets you iterate over the code points it controls, no matter what the underlying encoding is. (I will post another message with more details of this library.) This does of course make for problems with other parts of the standard, but solutions to these problems is what I want my thesis to be all about.

...

About Unicode strings: I suggest having a codepoint_string, with the string of code units as a template parameter. Its interface should work with 21 (32) bits values, while internally these are converted to UTF-8, UTF-16, or remain UTF-32. template <class CodeUnitString> class codepoint_string { CodeUnitString code_units; // ... };

The real unicode::string would be the character string, which uses a base character with its combining marks for its interface. template <class CodePointString> class string { CodePointString codepoints; // ... };

So unicode::string<unicode::codepoint_string<std::string> > would be a UTF8-encoded string that is manipulated using its characters.

unicode::string should take care of correctly searching for a character string, rather than a codepoint string.

Thanks. I will take that into consideration. I'm glad to hear any design/implementation ideas since I want this library to be useable for the largest amount of people possible.

...

operator< has never done "the right thing" anyway: it does not make a difference between uppercase and lowercase, for example. Probably, locales should be used for collation. The Unicode collation algorithm is pretty well specified.

Yes. I hope to be able to add support for the collation algorithm to enable proper, locale specific collation.

Rogier van Dalen

20 Oct 20 Oct

1:08 p.m.

On Tue, 19 Oct 2004 18:32:50 +0200, Erik Wien <wien@start.no> wrote:

...

----- Original Message ----- From: "Rogier van Dalen" <rogiervd@gmail.com>

...
I've recently started on the first draft of a Unicode library.

Interesting. Is there a discussion going about this library that I have missed, or haven't you posted anything about it yet? I'd hate to start something like this, if there is already being made an effort on the subject.

It's in the planning stage; I have a preliminary implementation of some parts. Your message made me bring out my ideas into the public.

...

...
I think a definition of unicode::code as uint32_t would be much better. Problem is, codecvt is only implemented for wchar_t and char, so it's not possible to make a Unicode codecvt without manually adding (dummy) implementations of codecvt<unicode::code,char,mbstate_t> to the std namespace. I guess this is the reason that Ron Garcia just used wchar_t.

I don't really feel locking the code unit size to 32bits is a good solution either as strings would then become unneccesarily large.

As I tried to show, the choice of the underlying buffer is templated. This could be std::string, or an SGI rope<wchar_t>, or anything else. A char-based buffer would automatically make it a UTF-8-encoded string, etcetera. I agree with you (and with the Unicode standard) that using strings of UTF-16 is probably best for most practical applications. The interface should IMHO always use UTF-32 (I agree with the Unicode standard here too): codepoint_string<...> s = ....; I think *s.begin() should return a UTF-32-encoded codepoint. The codecvt class converts to UTF-32 because it didn't occur to me to do anything else; and why would you? Regards, Rogier

Miro Jurisic

19 Oct 19 Oct

6:02 p.m.

In article <e094f9eb041019032718d58d04@mail.gmail.com>, Rogier van Dalen <rogiervd@gmail.com> wrote:

...

An assumption I think is wrong is that wchar_t would be suitable for Unicode.

wchar_t is to be avoided at all costs. Its size differs from compiler to compiler, and even depends on compiler settings (2 or 4 bytes). Encoding of wchar_t strings is ill-defined and also varies from system to system (usually UCS-2 or UCS-4, but there is no guarantee it's a Unicode encoding).

...

So unicode::string<unicode::codepoint_string<std::string> > would be a UTF8-encoded string that is manipulated using its characters.

Encoded characters or abstract characters? (See section 2.4 of Unicode standard for definitions) meeroh

Rogier van Dalen

20 Oct 20 Oct

1:09 p.m.

...

...
So unicode::string<unicode::codepoint_string<std::string> > would be a UTF8-encoded string that is manipulated using its characters.

Encoded characters or abstract characters? (See section 2.4 of Unicode standard for definitions)

I mean a base character with its combining characters. I don't think this is the same as "abstract character", is it? My plan was to decompose all characters in unicode::string. This makes manipulation of diacritics easier. Correct me if I'm wrong, but your example of finding "ü" in a string would come down to finding the codepoint sequence "U+0075 U+0308" and checking whether it is not followed by another combining character, pretty trivial still. Regards, Rogier

Miro Jurisic

4:20 p.m.

In article <e094f9eb04102006096b92c870@mail.gmail.com>, Rogier van Dalen <rogiervd@gmail.com> wrote:

...

...
...
So unicode::string<unicode::codepoint string<std::string> > would be a UTF8-encoded string that is manipulated using its characters.

Encoded characters or abstract characters? (See section 2.4 of Unicode standard for definitions)

I mean a base character with its combining characters. I don't think this is the same as "abstract character", is it?

That is an abstract character, yes.

...

My plan was to decompose all characters in unicode::string. This makes manipulation of diacritics easier. Correct me if I'm wrong, but your example of finding "ü" in a string would come down to finding the codepoint sequence "U+0075 U+0308" and checking whether it is not followed by another combining character, pretty trivial still.

You have to not only decompose them but put them in a canonical decomposed order in order for that to work. meeroh

Erik Wien

5:41 p.m.

"Miro Jurisic" <macdev@meeroh.org> wrote in message news:macdev-

...

...
My plan was to decompose all characters in unicode::string. This makes manipulation of diacritics easier. Correct me if I'm wrong, but your example of finding "ü" in a string would come down to finding the codepoint sequence "U+0075 U+0308" and checking whether it is not followed by another combining character, pretty trivial still.

You have to not only decompose them but put them in a canonical decomposed order in order for that to work.

You could also do a Canonical Composition after the decompsition. (Normalization form C) Either way this is not something you would like to do on every assigment of a string, but rather when it is needed. (i.e. on comparison.)

Rogier van Dalen

5:52 p.m.

On Wed, 20 Oct 2004 12:20:22 -0400, Miro Jurisic <macdev@meeroh.org> wrote:

...

In article <e094f9eb04102006096b92c870@mail.gmail.com>, Rogier van Dalen <rogiervd@gmail.com> wrote:

...
My plan was to decompose all characters in unicode::string. This makes manipulation of diacritics easier. Correct me if I'm wrong, but your example of finding "ü" in a string would come down to finding the codepoint sequence "U+0075 U+0308" and checking whether it is not followed by another combining character, pretty trivial still.

You have to not only decompose them but put them in a canonical decomposed order in order for that to work.

Yes, of course. I left it out thinking it was trivial (which it may be; you'd need a small part of the Unicode Database though). Regards, Rogier

Vladimir Prus

6:44 a.m.

Rogier van Dalen wrote:

...

So unicode::string<unicode::codepoint_string<std::string> > would be a UTF8-encoded string that is manipulated using its characters.

After reading your post, I don't understand what exactly the two levels of template parameters give you. And, even if they are needed, it's very important that unicode::string with different template parameters are always convertible between each other. Otherwise, two libraries with different types won't be able to interoperate. - Volodya

Rogier van Dalen

1:11 p.m.

On Wed, 20 Oct 2004 10:44:04 +0400, Vladimir Prus <ghost@cs.msu.su> wrote:

...

Rogier van Dalen wrote:

...
So unicode::string<unicode::codepoint_string<std::string> > would be a UTF8-encoded string that is manipulated using its characters.

After reading your post, I don't understand what exactly the two levels of template parameters give you.

Even though I think that a Unicode library should by default work on base characters with their combining characters, sometimes you may need to manipulate codepoints directly.

...

And, even if they are needed, it's very important that unicode::string with different template parameters are always convertible between each other. Otherwise, two libraries with different types won't be able to interoperate.

Of course. This should be fairly trivial though. Regards, Rogier

Erik Wien

19 Oct 19 Oct

5:07 p.m.

As I have said in a couple of other posts here, I have already started testing different approaces to this library and I might as well post some examples of what I have so far and how it would be used. I have only been looking closely at the string representation part so far, so don't expect too much. ;) The basic idea I have been working around, is to make a nencoded_string class templated on unicode encoding types (i.e. UTF-8, UTF-16). This is made possible through a encoding_traits class which contains all nececcary implementation details for working on strings of code units. The outline of the encoding traits class looks something like this: template<typename encoding> struct encoding_traits { // Type definitions for code_units etc. // Is the encoding fixed width? (allows a good deal of iterator optimizations) // Algoritms for iterating forwards and backwards over code units. // Function for converting a series of code units to a unicode code point. // Any other operations that are encoding specific. } This traits class is used by the encoded_string class to provide support for strings using any unicode representation internally. This allows the programmer to choose what encoding should be used from string to string, depending on what would be best suited. The external interface of this class would mainly be code point iterators. These iterators can iterate over any encoded_string and the underlying encoding should be invisible. (This is something that requires a non standard iterator implementation according to the c++ spec, but would work nicely with the boost iterator library.) You could use the encoded_string class like this: // Constructor converts the ASCII string to UTF-16. encoded_string<utf16> some_string("Hello World"); // Run some standard algorithm on the string: std::for_each(some_string.begin(), some_string.end(), do_some_operation); I do currently have a really rough implementation that works like described above, and I would probably base parts of a potential library on that. I am aware that this implementation will be less that ideal for integration with the current c++ standard, but it's issues like that I would like to get deeper into during the develpoment. Any comments you might have on this approach are most welcome. Regards Erik

Miro Jurisic

5:58 p.m.

In article <cl3hl9$g4e$1@sea.gmane.org>, "Erik Wien" <wien@start.no> wrote:

...

The basic idea I have been working around, is to make a nencoded_string class templated on unicode encoding types (i.e. UTF-8, UTF-16). This is made possible through a encoding_traits class which contains all nececcary implementation details for working on strings of code units.

I generally agree with this design approach, but I don't think that code point iterators alone are sufficient. Iteration over encoded characters and abstract characters would be needed for some algorithms to function sensibly. For example, the simple task of: find(begin, end, "ü") needs to use abstract characters in order to be able to find precomposed and decomposed versions of ü.

...

You could use the encoded_string class like this:

// Constructor converts the ASCII string to UTF-16. encoded_string<utf16> some_string("Hello World"); // Run some standard algorithm on the string: std::for_each(some_string.begin(), some_string.end(), do_some_operation);

Again, taking this example, you let's say that do_some_operation performs canonicalization to some Unicode canonical form; you can't do this by iterating over code points.

...

I am aware that this implementation will be less that ideal for integration with the current c++ standard, but it's issues like that I would like to get deeper into during the develpoment.

You should explain what problems with integration you foresee. meeroh

Erik Wien

6:52 p.m.

Hi. Thanks for the feedback! "Miro Jurisic" <macdev@meeroh.org> wrote in message news:macdev-BACD3C.13585519102004@sea.gmane.org...

...

I generally agree with this design approach, but I don't think that code point iterators alone are sufficient.

Neither do I as the matter a fact, but this is as far as I have come right now. :) There would probably be different types of iterators (or iterator wrappers) made available to enable iterations over everything from code units to code points/abstract characters.

...

Iteration over encoded characters and abstract characters would be needed for some algorithms to function sensibly. For example, the simple task of:

find(begin, end, "ü")

needs to use abstract characters in order to be able to find precomposed and decomposed versions of ü.

True... And this is a point where implemtation would be less than trivial. Comparing strings in unicode is anything BUT trivial, and it's imperative to find a good way to implement this functionallity through the standard algorithms.

...

Again, taking this example, you let's say that do_some_operation performs canonicalization to some Unicode canonical form; you can't do this by iterating over code points.

Nope. A code unit iterator would be needed for things like that.

...

...
I am aware that this implementation will be less that ideal for integration with the current c++ standard, but it's issues like that I would like to get deeper into during the develpoment.

You should explain what problems with integration you foresee.

I think I was thinking a little ahead of myself when I wrote that. :) The implementation described here would not pose too much of a problem, I was thinking more of the problems that arise when you take things like collation and locales into consideration. From what i understand there is a real issue in enabling proper unicode support in the standard classes like locale, ctype and collate, as they assume things that do not neccesarily apply to a unicode representation of text. A failiure to enable good support in those classes (at least locale and ctype), would also make the iostream support break, and things start to snowball. I could very well be wrong on this (Actually, I hope I am! :) ), as I haven't had the time to read up on all issues concerning this. But again, this is one of many problems I hope running this project will help reveal.

Miro Jurisic

9:07 p.m.

In article <cl3nps$4d8$1@sea.gmane.org>, "Erik Wien" <wien@start.no> wrote:

...

Hi. Thanks for the feedback!

My pleasure :-)

...

"Miro Jurisic" <macdev@meeroh.org> wrote in message news:macdev-BACD3C.13585519102004@sea.gmane.org...

...
I generally agree with this design approach, but I don't think that code point iterators alone are sufficient.

Neither do I as the matter a fact, but this is as far as I have come right now. :) There would probably be different types of iterators (or iterator wrappers) made available to enable iterations over everything from code units to code points/abstract characters.

Yes, I agree.

...

...
Iteration over encoded characters and abstract characters would be needed for some algorithms to function sensibly. For example, the simple task of:

find(begin, end, "ü")

needs to use abstract characters in order to be able to find precomposed and decomposed versions of ü.

True... And this is a point where implemtation would be less than trivial.

Yeah, that's how far I got before I decided that I didn't have the time to deal with the problem given my current schedule.

...

...
Again, taking this example, you let's say that do_some_operation performs canonicalization to some Unicode canonical form; you can't do this by iterating over code points.

Nope. A code unit iterator would be needed for things like that.

I am pretty sure you mean abstract character here, not code unit. My understanding of the Unicode terminology is that the decomposed version of ü consists of one abstract character (ü) two encoded characters (u, š) two UTF-32 code units (0x00000075 0x00000308) two UTF-16 code units (0x0075 0x0308) three UTF-8 code units (0x75 0xCC 0x88) but perhaps I have it backwards...

...

The implementation described here would not pose too much of a problem, I was thinking more of the problems that arise when you take things like collation and locales into consideration. From what i understand there is a real issue in enabling proper unicode support in the standard classes like locale, ctype and collate, as they assume things that do not neccesarily apply to a unicode representation of text. A failiure to enable good support in those classes (at least locale and ctype), would also make the iostream support break, and things start to snowball. I could very well be wrong on this (Actually, I hope I am! :) ), as I haven't had the time to read up on all issues concerning this. But again, this is one of many problems I hope running this project will help reveal.

I don't know enough about locales to comment on this, unfortunately. meeroh

Erik Wien

11:11 p.m.

...

I am pretty sure you mean abstract character here, not code unit. My understanding of the Unicode terminology is that the decomposed version of ü consists of

one abstract character (ü) two encoded characters (u, š) two UTF-32 code units (0x00000075 0x00000308) two UTF-16 code units (0x0075 0x0308) three UTF-8 code units (0x75 0xCC 0x88)

but perhaps I have it backwards...

No. You are correct about that. I don't know what I was talking about. This is another example of me talking before I think! ;) I think we argee on this, but are just misunderstanding each other. Anyhoo... To answer this again: :)

...

Again, taking this example, you let's say that do_some_operation performs canonicalization to some Unicode canonical form; you can't do this by iterating over code points.

No you can't do that with code point iterators, but I am pretty sure you couldn't do it with an abstract character iterator either. (Or any kind of iterator for that matter) The process of canonicalization (I'm assuming you are talking about canonical decomposition here) involves splitting one code point into multiple code points if that is possible. (ü would be splitted into u and š as you say) That means that the do_some_operation would need to insert code points into the string it is iterating over, something that would take some "hacking" to do inside a normal iterator interface. Abstract character iterators are no better. The concept of abstract characters is oblivious to the code unit differences between these representations, and iterating over abstract characters (I'm not sure how this would even be done) would not reveal the underlying composition of code points needed for canonical decomposition to be performed. Ultimately I feel that the operation of normalization (which involves canonical decomposition) of unicode strings should be hidden from the user completely and be performed automatically by the library where that is needed. (Like on a call to the == operator.) I think that solution would be satisfactory for most users as the normalization process is somewhat intricate and really not something users should be forced to understand. Are we at all on the same page now?

Peter Dimov

11:45 p.m.

Erik Wien wrote:

...

Ultimately I feel that the operation of normalization (which involves canonical decomposition) of unicode strings should be hidden from the user completely and be performed automatically by the library where that is needed. (Like on a call to the == operator.)

It appears that there are two schools of thought when it comes to string design. One approach treats a string purely as a sequential container of values. The other tries to represent "string values" as a coherent whole. It doesn't help that in the simple case where the value_type is char the two approaches result in mostly identical semantics. My opinion is that the std::char_traits<> experiment failed and conclusively demonstrated that the "string as a value" approach is a dead end, and that practical string libraries must treat a string as a sequential container, vector<char>, vector<char16_t> and vector<char32_t> in our case. The interpretation of that sequence of integers as a concrete string value representation needs to be done by algorithms. In other words, I believe that string::operator== should always perform the per-element comparison std::equal( lhs.begin(), lhs.end(), rhs.begin() ) that is specified in the Container requirements table. If I want to test whether two sequences of char16_t's, interpreted as UTF16 Unicode strings, would represent the same string in a printed form, I should be given a dedicated function that does just that - or an equivalent. Similarly, if I want to normalize a sequence of chars that are actually UTF8, I'd call the appropriate 'normalize' function/algorithm. But I may be wrong. :-)

Erik Wien

20 Oct 20 Oct

12:34 a.m.

Peter Dimov wrote:

...

It appears that there are two schools of thought when it comes to string design. One approach treats a string purely as a sequential container of values. The other tries to represent "string values" as a coherent whole. It doesn't help that in the simple case where the value_type is char the two approaches result in mostly identical semantics.

My opinion is that the std::char_traits<> experiment failed and conclusively demonstrated that the "string as a value" approach is a dead end, and that practical string libraries must treat a string as a sequential container, vector<char>, vector<char16_t> and vector<char32_t> in our case.

The interpretation of that sequence of integers as a concrete string value representation needs to be done by algorithms.

That is kinda what my current implementation does, but the container is not directly accessible by the user. (Nor do I think it should be) Instead I wrap the vector of code points in a class and provide different types of iterators to iterate though the vector at different "character levels", instead of external algorithms. You can therefore access the string on a code unit level, but the casual user would not neccesarily know (or care) about that. Instead he would use the "string as a value" approach, using strings to represent a sentance, word, or some other language construct. When most people think of a string, they think of text, and not the underlying binary representation, and therefore that is, in my opinion, the notion a library should be designed around.

...

In other words, I believe that string::operator== should always perform the per-element comparison std::equal( lhs.begin(), lhs.end(), rhs.begin() ) that is specified in the Container requirements table.

If I want to test whether two sequences of char16_t's, interpreted as UTF16 Unicode strings, would represent the same string in a printed form, I should be given a dedicated function that does just that - or an equivalent. Similarly, if I want to normalize a sequence of chars that are actually UTF8, I'd call the appropriate 'normalize' function/algorithm.

Though I see where you are coming from, I don't agree with you on that. In my opinion a good unicode library should hide as much as possible of the complexity of the actual character representation from the user. If we were to require the user to know that a direct binary comparison of strings is not the same as a actual textual comparison, we loose some of the simplicity of the library. Most users that use such a library would not know that the character ö can be represented as both 'oš' and 'ö', and that as a consequence of that, calling == on to strings could result in the behaviour "ö" != "ö". By removing the need for such knowledge by the user, we reduce the learning curve considerably, which is one of the main reasons for abstracting this functionality anyway.

Miro Jurisic

7:56 a.m.

In article <cl4bs4$prn$1@sea.gmane.org>, "Erik Wien" <wien@start.no> wrote:

...

Peter Dimov wrote:

...

...
In other words, I believe that string::operator== should always perform the per-element comparison std::equal( lhs.begin(), lhs.end(), rhs.begin() ) that is specified in the Container requirements table.

If I want to test whether two sequences of char16_t's, interpreted as UTF16 Unicode strings, would represent the same string in a printed form, I should be given a dedicated function that does just that - or an equivalent. Similarly, if I want to normalize a sequence of chars that are actually UTF8, I'd call the appropriate 'normalize' function/algorithm.

Though I see where you are coming from, I don't agree with you on that. In my opinion a good unicode library should hide as much as possible of the complexity of the actual character representation from the user. If we were to require the user to know that a direct binary comparison of strings is not the same as a actual textual comparison, we loose some of the simplicity of the library. Most users that use such a library would not know that the character ö can be represented as both 'oš' and 'ö', and that as a consequence of that, calling == on to strings could result in the behaviour "ö" != "ö". By removing the need for such knowledge by the user, we reduce the learning curve considerably, which is one of the main reasons for abstracting this functionality anyway.

I completely agree with Erik on this. For anything except for US english, the interface of basic_string grafted on top of a sequence of UTF code points produces wrong results for most people most of the time. Unicode is hard enough as it is, we don't need to expose it via an interface whose default behavior violates the principle of least surprise most of the time. meeroh

Peter Dimov

12:48 p.m.

Miro Jurisic wrote:

...

I completely agree with Erik on this. For anything except for US english, the interface of basic_string grafted on top of a sequence of UTF code points produces wrong results for most people most of the time.

How did basic_string even enter the discussion?

Miro Jurisic

4:27 p.m.

In article <00da01c4b6a3$114a1a10$0600a8c0@pdimov>, "Peter Dimov" <pdimov@mmltd.net> wrote:

...

Miro Jurisic wrote:

...
I completely agree with Erik on this. For anything except for US english, the interface of basic_string grafted on top of a sequence of UTF code points produces wrong results for most people most of the time.

How did basic_string even enter the discussion?

You said string::operator== which I interpreted to mean basic_string::operator== rather than fictitious_unicode_string::operator==. meeroh

Peter Dimov

12:47 p.m.

Erik Wien wrote:

...

Peter Dimov wrote:

...
It appears that there are two schools of thought when it comes to string design. One approach treats a string purely as a sequential container of values. The other tries to represent "string values" as a coherent whole. It doesn't help that in the simple case where the value_type is char the two approaches result in mostly identical semantics. My opinion is that the std::char_traits<> experiment failed and conclusively demonstrated that the "string as a value" approach is a dead end, and that practical string libraries must treat a string as a sequential container, vector<char>, vector<char16_t> and vector<char32_t> in our case.

The interpretation of that sequence of integers as a concrete string value representation needs to be done by algorithms.

That is kinda what my current implementation does, but the container is not directly accessible by the user. (Nor do I think it should be) Instead I wrap the vector of code points in a class and provide different types of iterators to iterate though the vector at different "character levels", instead of external algorithms.

That's what external algorithms take, iterators. I don't understand what you mean by that.

...

You can therefore access the string on a code unit level, but the casual user would not neccesarily know (or care) about that. Instead he would use the "string as a value" approach, using strings to represent a sentance, word, or some other language construct. When most people think of a string, they think of text, and not the underlying binary representation, and therefore that is, in my opinion, the notion a library should be designed around.

That may be so. But I don't see how the user can be isolated from the binary representation if he needs to pick one of utf8_string, utf16_string, ucs2_string, ucs4_string to store his strings. Perhaps I misunderstand your idea. Can you post a sketch of your spec? How many string classes do you have? What encoding do they use? What do begin(), end(), size() return? Are the iterators random access? Bidirectional? Constant? How can the user obtain the underlying element sequence to persist it somewhere or to pass it to an external library?

...

In my opinion a good unicode library should hide as much as possible of the complexity of the actual character representation from the user.

Hiding intrinsic complexity isn't necessarily a good idea. Sometimes users need to accomplish a specific task and the abstraction layer, in its attempts to "hide the complexity", just gets in the way. This should never happen.

Erik Wien

7:21 p.m.

"Peter Dimov" <pdimov@mmltd.net> wrote in message news:00d501c4b6a2$f541bda0

...

That may be so. But I don't see how the user can be isolated from the binary representation if he needs to pick one of utf8_string, utf16_string, ucs2_string, ucs4_string to store his strings. Perhaps I misunderstand your idea. Can you post a sketch of your spec? How many string classes do you have? What encoding do they use? What do begin(), end(), size() return? Are the iterators random access? Bidirectional? Constant? How can the user obtain the underlying element sequence to persist it somewhere or to pass it to an external library?

First you need to understand that what I have so far, is just a preliminary test implementation for my own amusement. I anticipate a lot of things will change if I go forward with this project. Right now i have a single encoded_string class that has two template parameters, namely encoding and encoding_traits. encoding_traits is a class where all encoding specific implementation is kept, and this class is used to setup the encoded_string class to correctly represent strings in the given encoding. begin() and end() return a code unit iterator that has the same interface and value_type ++, no matter what the underlying encoding is. That is, you only see code points when iterating over a string, not the underlying code unit sequence. The iterators used are bidirectional, not random access (impossible on UTF-8 and UTF-16) and they are as of now not constant. It IS possible to assign a code unit to a UTF-8 encoded string through an iterator, even if the resulting code unit sequence would be longer than the one the iterator is pointing to. The underlying container is automatically resized to make room for the new sequence. (This is of course slow!)

Peter Dimov

7:46 p.m.

New subject: Any interest in adding unicode support to boost?

Erik Wien wrote:

...

Right now i have a single encoded_string class that has two template parameters, namely encoding and encoding_traits. encoding_traits is a class where all encoding specific implementation is kept, and this class is used to setup the encoded_string class to correctly represent strings in the given encoding.

Yes, that's close to what I thought. Do not repeat the basic_string mistake and make encoding_traits a template parameter. A traits class is never used in this way. Your encoding_traits is actually a policy. A traits class is independent of the components that use it. It is basically a mapping from a type to something; in your case, a mapping between the encoding parameter and the operations. So, encoding_traits aside, you essentially have string<utf8>.

...

The iterators used are bidirectional, not random access (impossible on UTF-8 and UTF-16) and they are as of now not constant. It IS possible to assign a code unit to a UTF-8 encoded string through an iterator, even if the resulting code unit sequence would be longer than the one the iterator is pointing to. The underlying container is automatically resized to make room for the new sequence. (This is of course slow!)

This is another basic_string mistake that effectively rules out efficient reference counting. ;-) Just make the iterators constant. The functionality can be obtained with explicit erase/insert/replace members.

Beman Dawes

10:30 p.m.

New subject: Any interest in adding unicode support to boost?

At 03:46 PM 10/20/2004, Peter Dimov wrote:

...

...
The iterators used are bidirectional, not random access (impossible on UTF-8 and UTF-16) and they are as of now not constant. It IS possible to assign a code unit to a UTF-8 encoded string through an iterator, even if the resulting code unit sequence would be longer than the one the iterator is pointing to. The underlying container is automatically resized to make room for the new sequence. (This is of course slow!)

This is another basic_string mistake that effectively rules out efficient

...

reference counting. ;-) Just make the iterators constant. The functionality can be obtained with explicit erase/insert/replace members.

There are additional advantages to a constant iterator and explicit erase/insert/replace design; various caching and disk-residence implementations become both possible and efficient. --Beman

Eric Niebler

7:48 p.m.

Erik Wien wrote:

...

The iterators used are bidirectional, not random access (impossible on UTF-8 and UTF-16)

No. Andrei Alexandrescu explained a scheme to me whereby a UTF-16 encoded string can have a random-access iterator, and I think it should. The basic idea is you keep a plain array of 16-bit integers which are the 16-bit characters and the first 16 bits of surrogate pairs. Then you have a data structure which maps from string offsets to the second 16 bits of surrogate pairs. Random access involves a simple index and a map look-up. Sequential access requires no map look-up. And since surrogate pairs are very rare, the map will almost always be empty and the look-up is skipped. I think the default should be UTF-16 encoding, and that the iterator should use a scheme like this to be random access. Rationale: there are string algorithms that benefit from random access (Boyer-Moore comes to mind). -- Eric Niebler Boost Consulting www.boost-consulting.com

Peter Dimov

8:14 p.m.

Eric Niebler wrote:

...

Erik Wien wrote:

...
The iterators used are bidirectional, not random access (impossible on UTF-8 and UTF-16)

No. Andrei Alexandrescu explained a scheme to me whereby a UTF-16 encoded string can have a random-access iterator, and I think it should. The basic idea is you keep a plain array of 16-bit integers which are the 16-bit characters and the first 16 bits of surrogate pairs. Then you have a data structure which maps from string offsets to the second 16 bits of surrogate pairs. Random access involves a simple index and a map look-up. Sequential access requires no map look-up. And since surrogate pairs are very rare, the map will almost always be empty and the look-up is skipped.

Nice! But this seems to make c_str O(N) operation. If I need to speak to a library in the common extern "C" language of interoperability, and that library happens to need UTF-16 encoded wchar_t const [], which by coincidence has the same representation as char16_t const [], I won't be very happy if The C++ string seems to ignore this common scenario.

Eric Niebler

8:37 p.m.

Peter Dimov wrote:

...

Eric Niebler wrote:

...
Erik Wien wrote:

...
The iterators used are bidirectional, not random access (impossible on UTF-8 and UTF-16)

No. Andrei Alexandrescu explained a scheme to me whereby a UTF-16 encoded string can have a random-access iterator, and I think it should. The basic idea is you keep a plain array of 16-bit integers which are the 16-bit characters and the first 16 bits of surrogate pairs. Then you have a data structure which maps from string offsets to the second 16 bits of surrogate pairs. Random access involves a simple index and a map look-up. Sequential access requires no map look-up. And since surrogate pairs are very rare, the map will almost always be empty and the look-up is skipped.

Nice! But this seems to make c_str O(N) operation. If I need to speak to a library in the common extern "C" language of interoperability, and that library happens to need UTF-16 encoded wchar_t const [], which by coincidence has the same representation as char16_t const [], I won't be very happy if The C++ string seems to ignore this common scenario.

Two points. First, keep in mind that surrogates are exceedingly rare. The common case is that there are no surrogates, and c_str is O(1). Second, in the rare case where there are surrogates, there can be a mutable cache that c_str can return, building it on demand only when the cache is dirty. IMO the advantages of having a random access iterator are worth the trouble, especially considering how rare surrogates are. Oh, and I agree that it should be a const iterator. :-) -- Eric Niebler Boost Consulting www.boost-consulting.com

Rogier van Dalen

21 Oct 21 Oct

3:06 p.m.

On Wed, 20 Oct 2004 12:48:31 -0700, Eric Niebler <eric@boost-consulting.com> wrote:

...

I think the default should be UTF-16 encoding, and that the iterator should use a scheme like this to be random access. Rationale: there are string algorithms that benefit from random access (Boyer-Moore comes to mind).

Correct me if I'm wrong. From what I gather from a Google search, Boyer-Moore is a fast string search algorithm. Why not use the algorithm on the code units rather than codepoints? UTF-8 and UTF-16 are both not stateful, specifically to allow optimisations such as this (as well as error recovery). As was pointed out earlier in this thread, searching for Unicode characters takes looking at combining characters as well. I think this will go for many, if not all, algorithms that you can think of: either they can be made to work with code units, or they must work on abstract characters, which means a variable-width encoding anyway. (See the Unicode Standard 4, Section 2.5 for a similar argument for UTF-16 over UTF-32, even though the latter is fixed-width.) I'm ready to be proven wrong; however, at this moment at least I believe that any effort to make UTF-16 randomly accessible is not useful. Regards, Rogier

Eric Niebler

7:10 p.m.

Rogier van Dalen wrote:

...

On Wed, 20 Oct 2004 12:48:31 -0700, Eric Niebler <eric@boost-consulting.com> wrote:

...
I think the default should be UTF-16 encoding, and that the iterator should use a scheme like this to be random access. Rationale: there are string algorithms that benefit from random access (Boyer-Moore comes to mind).

Correct me if I'm wrong. From what I gather from a Google search, Boyer-Moore is a fast string search algorithm. Why not use the algorithm on the code units rather than codepoints? UTF-8 and UTF-16 are both not stateful, specifically to allow optimisations such as this (as well as error recovery).

Searching a Unicode string for a particular bit pattern is not particularly meaningful because the same string can be represented with different bit patterns. Have I misinterpreted what you are suggesting? -- Eric Niebler Boost Consulting www.boost-consulting.com

Rogier van Dalen

8:57 p.m.

On Thu, 21 Oct 2004 12:10:28 -0700, Eric Niebler <eric@boost-consulting.com> wrote:

...

Rogier van Dalen wrote:

...
On Wed, 20 Oct 2004 12:48:31 -0700, Eric Niebler <eric@boost-consulting.com> wrote:

...
I think the default should be UTF-16 encoding, and that the iterator should use a scheme like this to be random access. Rationale: there are string algorithms that benefit from random access (Boyer-Moore comes to mind).

Correct me if I'm wrong. From what I gather from a Google search, Boyer-Moore is a fast string search algorithm. Why not use the algorithm on the code units rather than codepoints? UTF-8 and UTF-16 are both not stateful, specifically to allow optimisations such as this (as well as error recovery).

Searching a Unicode string for a particular bit pattern is not particularly meaningful because the same string can be represented with different bit patterns. Have I misinterpreted what you are suggesting?

If the strings are not normalised, that is correct. However, if the strings are normalised (to the same form), then the codepoint patter will be the same; and so will the code unit pattern (through the way UTF-8 and UTF-16 are specified). If you search a Unicode string for a substring using code units, you *will* find all matches, plus some bogus matches. For example, if you're looking for "can" you will match "cañ" too, in normalisation form D. To get rid of these bogus matches, all you have to do is check that no combining marks are following, and otherwise, proceed to the next match. So Boyer-Moore may be used on the code units, as long as it is checked that a match includes full abstract characters only. Hope this makes clearer what I'm trying to say. Rogier

Beman Dawes

20 Oct 20 Oct

10:44 p.m.

At 08:47 AM 10/20/2004, Peter Dimov wrote:

...

... How can the user obtain the underlying element sequence to persist it somewhere or to pass it to an external library?

I'd like to underline this requirement. There must be the equivalent of c_str() to allow retrieval of the underlying element sequence. Alternately, if the underlying element sequence is hidden, then there would have to be a whole series of functions to get copies. utf8(), utf16(), etc. --Beman

Vladimir Prus

7:17 a.m.

Peter Dimov wrote:

...

...
Ultimately I feel that the operation of normalization (which involves canonical decomposition) of unicode strings should be hidden from the user completely and be performed automatically by the library where that is needed. (Like on a call to the == operator.)

It appears that there are two schools of thought when it comes to string design. One approach treats a string purely as a sequential container of values. The other tries to represent "string values" as a coherent whole. It doesn't help that in the simple case where the value_type is char the two approaches result in mostly identical semantics.

My opinion is that the std::char_traits<> experiment failed

I agree to that.

...

and conclusively demonstrated that the "string as a value" approach is a dead end,

How was it demonstrated? There are two separate questions. First, is how many operations are methods of 'string' and how many are external. Contrary to what Exception C++ says, I believe many methods in string is OK. As an example, QString presents huge but consistent interface, while in standard C++ we have string, boost::format, boost::tokenizer and boost::string_algo , and simply it's too many separate docs to look at. Second question is if operator==, operator< or 'find' should operate on vector<char_XX> or on abstract characters, using Unicode rules, or there should be two versions. I don't really understand why 'unicode-unaware' semantic is ever needed, so we should have only 'unicode-aware' one.

...

But I may be wrong. :-)

Me too. - Volodya

Peter Dimov

12:28 p.m.

Vladimir Prus wrote:

...

Second question is if operator==, operator< or 'find' should operate on vector<char_XX> or on abstract characters, using Unicode rules, or there should be two versions. I don't really understand why 'unicode-unaware' semantic is ever needed, so we should have only 'unicode-aware' one.

Look at 21.3/2: "The class template basic_string conforms to the requirements of a Sequence, as specified in (23.1.1). Additionally, because the iterators supported by basic_string are random access iterators (24.1.5), basic_string conforms to the the requirements of a Reversible Container, as specified in (23.1)." Now look at Table 65, Container requirements, operator==: "== is an equivalence relation. a.size()==b.size() && equal(a.begin(), a.end(), b.begin())" The question is now, what do begin(), end() and size() return for our hypothetical string16? I maintain that the library design is much cleaner if begin(), end() and size() are random access iterators over the underlying _storage_, not over the codepoint representation or abstract character representation. Codepoint iterators and abstract character iterators would still be provided, but they would be constant bidirectional with char32_t as the value_type. Codepoint and abstract character operations would be provided by algorithms, taking an iterator range. The user should remember and honor the encoding (UTF-16, UCS-2, other) of a particular container of char16_t, not the container itself. This is straightforward STL-style container-iterator-algorithm orthogonalization.

Vladimir Prus

1 p.m.

Peter Dimov wrote:

...

Vladimir Prus wrote:

...
Second question is if operator==, operator< or 'find' should operate on vector<char_XX> or on abstract characters, using Unicode rules, or there should be two versions. I don't really understand why 'unicode-unaware' semantic is ever needed, so we should have only 'unicode-aware' one.

Look at 21.3/2: "The class template basic_string conforms to the requirements of a Sequence, as specified in (23.1.1).

Additionally, because the iterators supported by basic_string are random access iterators (24.1.5), basic_string conforms to the the requirements of a Reversible Container, as specified in (23.1)."

Now look at Table 65, Container requirements, operator==:

"== is an equivalence relation.

a.size()==b.size() && equal(a.begin(), a.end(), b.begin())"

Yes, I know that.

...

The question is now, what do begin(), end() and size() return for our hypothetical string16?

First two return some hypothetical iterator. Its operator* will return either "unicode_character" (base character + all accents) or "unicode_character_ref" which will refer back to storage and extract components on demand. The performance of both version is unclear. Further, with "unicode_character_ref", the interator won't be lvalue iterator (or random access iterator in standard terms).

...

I maintain that the library design is much cleaner if begin(), end() and size() are random access iterators over the underlying _storage_, not over the codepoint representation or abstract character representation.

But those methods allow to work directly on storage, and I don't know if that's ever needed, or more common that working on character level. And after all, unicode_string can have 'storage()' method that gives vector<char_16>&, if direct storage manipulation is desired.

...

Codepoint iterators and abstract character iterators would still be provided, but they would be constant bidirectional with char32_t as the value_type.

Codepoint and abstract character operations would be provided by algorithms, taking an iterator range.

The user should remember and honor the encoding (UTF-16, UCS-2, other) of a particular container of char16_t, not the container itself.

So, vector<char16_t> could be either UTF-16 or UCS-2? I think that's a bad idea. If a library accepts unicode string, then its interface can either: - use 'unicode_string' - use 'unicode_string<some_encoding>' - use 'vector<char16_t>' and have a comment that the string is UTF8. I think the first option is best, and the last is too easy to misuse. - Volodya

Peter Dimov

2:03 p.m.

New subject: Any interest in adding unicode support to boost?

Vladimir Prus wrote:

...

Peter Dimov wrote:

...
The question is now, what do begin(), end() and size() return for our hypothetical string16?

First two return some hypothetical iterator. Its operator* will return either "unicode_character" (base character + all accents) or "unicode_character_ref" which will refer back to storage and extract components on demand. The performance of both version is unclear. Further, with "unicode_character_ref", the interator won't be lvalue iterator (or random access iterator in standard terms).

...
I maintain that the library design is much cleaner if begin(), end() and size() are random access iterators over the underlying _storage_, not over the codepoint representation or abstract character representation.

But those methods allow to work directly on storage, and I don't know if that's ever needed, or more common that working on character level.

Well, only storage elements can be directly manipulated. So if any direct manipulation is needed, it needs to be direct storage manipulation. ;-)

...

And after all, unicode_string can have 'storage()' method that gives vector<char_16>&, if direct storage manipulation is desired.

I'm not sure that this is a good idea. It violates encapsulation and can break the invariant of unicode_string, which, if I understand correctly, is that it contains a sequence of abstract Unicode characters in a particular pre-determined normalized form, encoded using a particular, pre-determined encoding.

...

...
The user should remember and honor the encoding (UTF-16, UCS-2, other) of a particular container of char16_t, not the container itself.

So, vector<char16_t> could be either UTF-16 or UCS-2? I think that's a bad idea.

Maybe, but this is just the way things are. :-) A sequence of char16_t can have any encoding.

...

If a library accepts unicode string, then its interface can either: - use 'unicode_string' - use 'unicode_string<some_encoding>' - use 'vector<char16_t>' and have a comment that the string is UTF8.

I think the first option is best, and the last is too easy to misuse.

Yes. So let's see if I understand your position correctly. A single string class shall be used to store Unicode strings, i.e. logical sequences of Unicode abstract characters. This string shall be stored in one chosen encoding, for example UTF-8. The user does not have direct access to the underlying storage, however, so it might be regarded as an implementation detail. An invariant of the string is that it is always in one chosen normalized form. Iteration over the string gives back a sequence of char32_t abstract characters. Comparisons are defined in terms of these sequences. Is this a fair summary?

Eric Niebler

5:23 p.m.

New subject: Any interest in adding unicode support to boost?

Peter Dimov wrote:

...

Vladimir Prus wrote:

...
If a library accepts unicode string, then its interface can either: - use 'unicode_string' - use 'unicode_string<some_encoding>' - use 'vector<char16_t>' and have a comment that the string is UTF8.

I think the first option is best, and the last is too easy to misuse.

Yes.

So let's see if I understand your position correctly.

A single string class shall be used to store Unicode strings, i.e. logical sequences of Unicode abstract characters.

This string shall be stored in one chosen encoding, for example UTF-8. The user does not have direct access to the underlying storage, however, so it might be regarded as an implementation detail.

An invariant of the string is that it is always in one chosen normalized form. Iteration over the string gives back a sequence of char32_t abstract characters. Comparisons are defined in terms of these sequences.

Is this a fair summary?

Such a one-size-fits-all unicode_string is guaranteed to be inefficient for some applications. If it is always stored in a decomposed form, an XML library probably wouldn't want to use it, because it requires a composed form. And making the encoding an implementation detail makes it inefficient to use in situations where binary compatibility matters (serialization, for example). Also, it is impossible to store an abstract unicode character in char32_t because there may be N zero-width combining characters associated with it. Perhaps having a one-size-fits-all unicode_string might be a nice default, as long as users who care about encoding and canonical form have other types (template + policies?) with knobs they can twiddle. -- Eric Niebler Boost Consulting www.boost-consulting.com

Erik Wien

6:04 p.m.

New subject: Any interest in adding unicode support to boost?

"Eric Niebler" <eric@boost-consulting.com> wrote in message

...

Such a one-size-fits-all unicode_string is guaranteed to be inefficient for some applications.

Yes... That's why I would like the encoding to be templated. Allowing the programmer to choose the encoding best suited for his/her needs.

...

If it is always stored in a decomposed form, an XML library probably wouldn't want to use it, because it requires a composed form. And making the encoding an implementation detail makes it inefficient to use in situations where binary compatibility matters (serialization, for example).

I think the best solution is to store the string in the form it was originally recieved (decomposed or not), and instead provide composition functions or even iterator wrappers that compose on the fly. That would allow for composed strings to be used if needed (like in a XML library, but not imposing that requirement on all other users.

...

Also, it is impossible to store an abstract unicode character in char32_t because there may be N zero-width combining characters associated with it.

Quite true.. Storing abstract characters would require some variable width storage facility.

...

Perhaps having a one-size-fits-all unicode_string might be a nice default, as long as users who care about encoding and canonical form have other types (template + policies?) with knobs they can twiddle.

I would really like to provide enough knobs to keep everyone happy! ;)

Peter Dimov

6:38 p.m.

Erik Wien wrote:

...

"Eric Niebler" <eric@boost-consulting.com> wrote in message

...
Such a one-size-fits-all unicode_string is guaranteed to be inefficient for some applications.

Yes... That's why I would like the encoding to be templated. Allowing the programmer to choose the encoding best suited for his/her needs.

It's good to have one string class for library interoperability reasons. Otherwise library A would demand utf8_string, library B would demand utf16_string, and library C would demand utf32_string. No matter which one you choose, you'll pay a price. (This doesn't change even if you spell utf8_string as string<utf8>.)

Erik Wien

7:03 p.m.

...

It's good to have one string class for library interoperability reasons. Otherwise library A would demand utf8_string, library B would demand utf16_string, and library C would demand utf32_string. No matter which one you choose, you'll pay a price. (This doesn't change even if you spell utf8_string as string<utf8>.)

That is true. Though the strings of different encodings should be assignable to each other, libraries taking references to encoded_strings would need some conversion to be done. We have a similar problem today with basic_string<char> and basic_string<wchar_t>, and I think it could also be solved in a way that is very similar to what is done in the <string> header. If we typedef a unicode_string or something as encoded_string<utf16>, and promote that as THE string class, most users would use that as their primary string representation, and simply be oblivious to the underlying encoding. (A good thing.) Advanced user could (just like we do today with basic_string) choose to support multiple encodings by templating their own functions on encoding as well.

Vladimir Prus

21 Oct 21 Oct

6:21 a.m.

Erik Wien wrote:

...

...
It's good to have one string class for library interoperability reasons. Otherwise library A would demand utf8_string, library B would demand utf16_string, and library C would demand utf32_string. No matter which one you choose, you'll pay a price. (This doesn't change even if you spell utf8_string as string<utf8>.)

That is true. Though the strings of different encodings should be assignable to each other, libraries taking references to encoded_strings would need some conversion to be done.

We have a similar problem today with basic_string<char> and basic_string<wchar_t>, and I think it could also be solved in a way that is very similar to what is done in the <string> header.

Just to clarify: the string and wstring in the standard have a huge problem: you can't convert string to wstring in any way: there's just no appropriate converting constructor.

...

If we typedef a unicode_string or something as encoded_string<utf16>, and promote that as THE string class, most users would use that as their primary string representation, and simply be oblivious to the underlying encoding. (A good thing.)

That would still make it easy for a user to use some different encoding without good reason.

...

Advanced user could (just like we do today with basic_string) choose to support multiple encodings by templating their own functions on encoding as well.

Oh well. I just hope nobody will ever make an implementation of XML parser + XML Schema + XPath + XQuery + SOAP + HTML renderer which is fully templated on string type, unless the same person speeds up gcc by 10 times previously. - Volodya

Erik Wien

22 Oct 22 Oct

12:43 a.m.

"Vladimir Prus" <ghost@cs.msu.su> wrote in message news:cl7kha$tk4

...

...
That is true. Though the strings of different encodings should be assignable to each other, libraries taking references to encoded_strings would need some conversion to be done.

We have a similar problem today with basic_string<char> and basic_string<wchar_t>, and I think it could also be solved in a way that is very similar to what is done in the <string> header.

Just to clarify: the string and wstring in the standard have a huge problem: you can't convert string to wstring in any way: there's just no appropriate converting constructor.

Correct, and that should be provided if we were to make a unicode_string templated in encoding. (Not that we neccesarily will.)

...

...
If we typedef a unicode_string or something as encoded_string<utf16>, and promote that as THE string class, most users would use that as their primary string representation, and simply be oblivious to the underlying encoding. (A good thing.)

That would still make it easy for a user to use some different encoding without good reason.

It would be possible yes, not even difficult, but I don't think that means people will actually do that. People usually use std::string today, not basic_string<whatever>, because that is "the string class". I think a similar thing would happen with a unicode_string typedef. Especially if there is no difference in the interface between the different template versions. Then you'd have to know what the differences between the different encodings are (be an andvanced user, and aware of the drawbacks), to actually bother using anything other that UTF-16.

...

...
Advanced user could (just like we do today with basic_string) choose to support multiple encodings by templating their own functions on encoding as well.

Oh well. I just hope nobody will ever make an implementation of

XML parser + XML Schema + XPath + XQuery + SOAP + HTML renderer

which is fully templated on string type, unless the same person speeds up gcc by 10 times previously.

Point taken. ;)

Rogier van Dalen

8:57 a.m.

On Fri, 22 Oct 2004 02:43:19 +0200, Erik Wien <wien@start.no> wrote:

...

"Vladimir Prus" <ghost@cs.msu.su> wrote in message news:cl7kha$tk4

...
Just to clarify: the string and wstring in the standard have a huge problem: you can't convert string to wstring in any way: there's just no appropriate converting constructor.

Correct, and that should be provided if we were to make a unicode_string templated in encoding. (Not that we neccesarily will.)

I think that if you don't know the encoding (and normalisation form) at compile time, iterators will have to be really slow. I hope you will try and find out during your research how important that is. I guess things like regular expressions would be slow, unless you provide algorithms with a possibility to switch to versions tailored to the encoding at run-time, similar to the visitor pattern boost::variant provides.

...

...
That would still make it easy for a user to use some different encoding without good reason.

It would be possible yes, not even difficult, but I don't think that means people will actually do that. People usually use std::string today, not basic_string<whatever>, because that is "the string class". I think a similar thing would happen with a unicode_string typedef. Especially if there is no difference in the interface between the different template versions. Then you'd have to know what the differences between the different encodings are (be an andvanced user, and aware of the drawbacks), to actually bother using anything other that UTF-16.

I fully agree. Rogier

Rogier van Dalen

20 Oct 20 Oct

7:09 p.m.

...

...
If it is always stored in a decomposed form, an XML library probably wouldn't want to use it, because it requires a composed form. And making the encoding an implementation detail makes it inefficient to use in situations where binary compatibility matters (serialization, for example).

I think the best solution is to store the string in the form it was originally recieved (decomposed or not), and instead provide composition functions or even iterator wrappers that compose on the fly. That would allow for composed strings to be used if needed (like in a XML library, but not imposing that requirement on all other users.

I don't think I can agree on that. If you do a lot of input/output, this might yield a better performance, but even in reading XML, you probably need to compare strings a lot, and if they are not normalised, this will really take a lot of processing. Correct me if I'm wrong, but a simple comparison of two non-normalized Unicode strings would take looking up the characters in the Unicode Character Database, decomposing every single character, gathering base characters and combining marks, and ordering the marks, then comparing them. And this must be done for every character. I don't have any numbers, of course, but I have this feeling it is going to be really really slow. Regards, Rogier

Erik Wien

7:26 p.m.

"Rogier van Dalen" <rogiervd@gmail.com> wrote in message

...

...
I think the best solution is to store the string in the form it was originally recieved (decomposed or not), and instead provide composition functions or even iterator wrappers that compose on the fly. That would allow for composed strings to be used if needed (like in a XML library, but not imposing that requirement on all other users.

I don't think I can agree on that. If you do a lot of input/output, this might yield a better performance, but even in reading XML, you probably need to compare strings a lot, and if they are not normalised, this will really take a lot of processing. Correct me if I'm wrong, but a simple comparison of two non-normalized Unicode strings would take looking up the characters in the Unicode Character Database, decomposing every single character, gathering base characters and combining marks, and ordering the marks, then comparing them. And this must be done for every character. I don't have any numbers, of course, but I have this feeling it is going to be really really slow.

You are quite correct... It is slow. And that is why I am hesitant to make decomposition something that will happen every time you assign something to a string. What this really boils down to, is what kind of usage pattern is the most common? The library should be written to provide the best performance on the operations most people do.

Ben Hutchings

21 Oct 21 Oct

1:43 p.m.

Erik Wien wrote:

...

"Rogier van Dalen" <rogiervd@gmail.com> wrote in message

...
...
I think the best solution is to store the string in the form it was originally recieved (decomposed or not), and instead provide composition functions or even iterator wrappers that compose on the fly. That would allow for composed strings to be used if needed (like in a XML library, but not imposing that requirement on all other users.

I don't think I can agree on that. If you do a lot of input/output, this might yield a better performance, but even in reading XML, you probably need to compare strings a lot, and if they are not normalised, this will really take a lot of processing. Correct me if I'm wrong, but a simple comparison of two non-normalized Unicode strings would take looking up the characters in the Unicode Character Database, decomposing every single character, gathering base characters and combining marks, and ordering the marks, then comparing them. And this must be done for every character. I don't have any numbers, of course, but I have this feeling it is going to be really really slow.

You are quite correct... It is slow. And that is why I am hesitant to make decomposition something that will happen every time you assign something to a string.

What this really boils down to, is what kind of usage pattern is the most common? The library should be written to provide the best performance on the operations most people do.

How about this: - when initialising or assigning to a string, you can opt to normalise one way or the other, or not at all - normalised strings are flagged as such - normalisation on comparison can be skipped if both strings are flagged as being normalised the same way - normalisation on assignment can be skipped if the right-hand side is flagged as being normalised appropriately Then the user can choose to normalise whichever way is best for his application but without breaking interoperability with libraries that require something different (or produce unnormalised strings); there's a speed penalty for renormalising but that seems inevitable. Obviously there's a speed penalty for checking normalisation flags repeatedly at run-time, but I don't think it would be too bad. Ben.

Erik Wien

22 Oct 22 Oct

12:45 a.m.

"Ben Hutchings" <ben.hutchings@businesswebsoftware.com> wrote in message

...

...
You are quite correct... It is slow. And that is why I am hesitant to make decomposition something that will happen every time you assign something to a string.

What this really boils down to, is what kind of usage pattern is the most common? The library should be written to provide the best performance on the operations most people do.

How about this:

- when initialising or assigning to a string, you can opt to normalise one way or the other, or not at all - normalised strings are flagged as such - normalisation on comparison can be skipped if both strings are flagged as being normalised the same way - normalisation on assignment can be skipped if the right-hand side is flagged as being normalised appropriately

That is not a bad idea... I think... ;) Providing a "normalization scheme" as a setting in the string would be possible. I'll take that into consideration.

Peter Dimov

20 Oct 20 Oct

6:42 p.m.

Eric Niebler wrote:

...

Perhaps having a one-size-fits-all unicode_string might be a nice default, as long as users who care about encoding and canonical form have other types (template + policies?) with knobs they can twiddle.

Right. My original point was that the advanced users don't really need a class (or a class template with policies.) They need algorithms that operate on sequences of char*_t.

Rogier van Dalen

7:02 p.m.

...

Such a one-size-fits-all unicode_string is guaranteed to be inefficient for some applications. If it is always stored in a decomposed form, an XML library probably wouldn't want to use it, because it requires a composed form. And making the encoding an implementation detail makes it inefficient to use in situations where binary compatibility matters (serialization, for example).

This is a good point. I should think however that a codecvt facet should be responsible for serialization rather than the unicode string. Furthermore, IMO, the invariants of any unicode string should be checked when reading from a file anyway. This should happen on two levels: the UTF-8 or UTF-16 encoding must be correct, and no dangling combining characters or combining characters on control characters should occur; furthermore, the normalisation form should probably be checked as well. So I'm not sure whether using Normalisation Form C rather than D will give you any big performance gains - you may need less memory though.

...

Also, it is impossible to store an abstract unicode character in char32_t because there may be N zero-width combining characters associated with it.

I'm not sure what you mean here, but if you mean that one abstract character would be one codepoint: that's not true, I'm sorry to say. Especially languages for which there was no encoding before Unicode, and funny scientists like mathematicians or linguists (I count myself among the latter) will use abstract characters that have not been encoded as precomposed characters in Unicode. Nor will they be; the precomposed forms are there for backwards compatibility, mainly. Note that adding a combining mark to a precomposed character takes decomposing it and recomposing it, so that might be pretty slow.

...

Perhaps having a one-size-fits-all unicode_string might be a nice default, as long as users who care about encoding and canonical form have other types (template + policies?) with knobs they can twiddle.

I do agree with that; and also I seem to remember from the discussion back in April that some people felt they needed to iterate over codepoints too. So please allow me to propose an altered version of my earlier proposal, taking in various suggestions from this thread. namespace unicode { // ***** Level 1: code units ***** // The code unit sequence is not explicitly specified, but it // could be std::string, or SGI rope<char16_t>, or whatever. // I think it would be reasonable to require replace, find, // find_first_of and similar. // ***** Level 2: codepoints ***** // The codepoint sequence is templatised on the code unit // sequence. // Depending on CodeUnits::value_type the encoding will // be UTF-8, UTF-16, or UTF-32. template <class CodeUnits> class codepoint_string { CodeUnits _code_units; public: // ... // A user is not allowed to change the code unit // sequence, but it may be copied, or serialised. const CodeUnits & code_units(); // The iterator is a bidirectional iterator. // This is cheap to implement on any correct Unicode- // encoded string since the iterator is not stateful. typedef ... iterator; // A size() member function is not included; // count() may be nice though. }; // ***** Level three: characters ***** // Normalisation policies struct normalisation_form_c {}; struct normalisation_form_d {}; // Input policies struct as_utf8 {}; struct as_latin1 {}; struct as_utf16 {}; // etcetera // Error checking policies struct throw_on_encoding_error {}; struct workaround_encoding_error {}; // An abstract Unicode character // I have not given this guys' interface much though yet. template <class NormalisationForm> class character { char32_t _base; std::vector<char32_t> _marks; public: character (char32_t base); character & operator = (char32_t base); const char32_t & base() const; void add_mark (char32_t mark); // An iterator to iterate over the combining marks. // It is a const_iterator because we wouldn't want to // allow introducing non-marks in the list of marks. typedef std::vector<char32_t>::const_iterator mark_iterator; mark_iterator mark_begin() const; mark_iterator mark_end() const; // .... }; // The actual Unicode string template <class CodeUnits, class NormalisationForm, class ErrorChecking> class string { codepoints<CodeUnits> _codepoints; public: // Initialise with a utf8 string; normalise and check for errors string (const CodeUnits &, as_utf8_tag); template <class CodeUnits2, class NormalisationForm2, class ErrorChecking2> string (const string <CodeUnits2, NormalisationForm2, ErrorChecking2> &); // .... const codepoints<CodeUnits> & codepoints(); const CodeUnits & code_units(); // Another bidirectional iterator, this one iterates // over abstract characters. class iterator { public: // Returns an object with an interface equal to // unicode::character, but it changes the string. character_ref operator *() const; // ... }; }; } // namespace unicode // ***** That was all ***** Mutating operations on unicode::string may require O(n) time where n is the length of the code unit sequence, depending on CodeUnit's properties. That's why using an SGI rope would make sense. Some default template parameters for unicode::string should be thought of. Regards, Rogier

Peter Dimov

8:05 p.m.

Rogier van Dalen wrote:

...

// The actual Unicode string template <class CodeUnits, class NormalisationForm, class ErrorChecking> class string

By using ErrorChecking as a template parameter, you are encoding it as part of the string type, but this is not necessary, because there is no difference between values of strings with different ErrorChecking policies (ErrorChecking does not change the invariant). You should just provide different member functions for the two ErrorChecking behaviors, or pass the ErrorChecking parameter to the member functions that require it. The other two parameters do seem to affect the string value/invariant, so they aren't redundant. Whether they are a good idea is another matter. :-)

Rogier van Dalen

21 Oct 21 Oct

2:28 p.m.

On Wed, 20 Oct 2004 23:05:08 +0300, Peter Dimov <pdimov@mmltd.net> wrote:

...

Rogier van Dalen wrote:

...
// The actual Unicode string template <class CodeUnits, class NormalisationForm, class ErrorChecking> class string

By using ErrorChecking as a template parameter, you are encoding it as part of the string type, but this is not necessary, because there is no difference between values of strings with different ErrorChecking policies (ErrorChecking does not change the invariant). You should just provide different member functions for the two ErrorChecking behaviors, or pass the ErrorChecking parameter to the member functions that require it.

I hadn't yet looked at it this way, but you are right from a theoretical point of view at least. To get more to practical matters, what do you think this should do: unicode::string s = ...; s += 0xDC01; // An isolated surrogate, which is nonsense ? Should it throw, or convert the isolated surrogate to U+FFFD REPLACEMENT CHARACTER (Unicode standard 4 Section 2.7), or something else? And what should the member function with the opposite behaviour be called? Rogier

Peter Dimov

3:02 p.m.

Rogier van Dalen wrote:

...

On Wed, 20 Oct 2004 23:05:08 +0300, Peter Dimov <pdimov@mmltd.net> wrote:

...
Rogier van Dalen wrote:

...
// The actual Unicode string template <class CodeUnits, class NormalisationForm, class ErrorChecking> class string

By using ErrorChecking as a template parameter, you are encoding it as part of the string type, but this is not necessary, because there is no difference between values of strings with different ErrorChecking policies (ErrorChecking does not change the invariant). You should just provide different member functions for the two ErrorChecking behaviors, or pass the ErrorChecking parameter to the member functions that require it.

I hadn't yet looked at it this way, but you are right from a theoretical point of view at least. To get more to practical matters, what do you think this should do:

unicode::string s = ...; s += 0xDC01; // An isolated surrogate, which is nonsense

? Should it throw, or convert the isolated surrogate to U+FFFD REPLACEMENT CHARACTER (Unicode standard 4 Section 2.7), or something else?

Whatever is most common. My choice would probably be 'throw', but I haven't used Unicode strings enough to have a strong opinion.

...

And what should the member function with the opposite behaviour be called?

s.append( 0xDC01 ); // default (throw), += alias // pick your favorite from the list below s.append_and_correct( 0xDC01 ); s.append( 0xDC01, unicode::convert_on_error ); s.append<unicode::convert_on_error>( 0xDC01 ); I'd go with the first option based on general principles, all else being equal. There is also unicode::append_and_correct( s, 0xDC01 ); if the operation can be performed in "user space", i.e. doesn't need to be a friend of the string class. Or s += unicode::correct( 0xDC01 ); if the automatic correction does not depend on the left side.

Erik Wien

6:31 p.m.

"Rogier van Dalen" <rogiervd@gmail.com> wrote in message

...

I hadn't yet looked at it this way, but you are right from a theoretical point of view at least. To get more to practical matters, what do you think this should do:

unicode::string s = ...; s += 0xDC01; // An isolated surrogate, which is nonsense

? Should it throw, or convert the isolated surrogate to U+FFFD REPLACEMENT CHARACTER (Unicode standard 4 Section 2.7), or something else? And what should the member function with the opposite behaviour be called?

The best solution would be to never append single code units, but instead code points. The += operator would determine how many code units is required for the given code point.

Eric Niebler

7:08 p.m.

Erik Wien wrote:

...

"Rogier van Dalen" <rogiervd@gmail.com> wrote in message

...
I hadn't yet looked at it this way, but you are right from a theoretical point of view at least. To get more to practical matters, what do you think this should do:

unicode::string s = ...; s += 0xDC01; // An isolated surrogate, which is nonsense

? Should it throw, or convert the isolated surrogate to U+FFFD REPLACEMENT CHARACTER (Unicode standard 4 Section 2.7), or something else? And what should the member function with the opposite behaviour be called?

The best solution would be to never append single code units, but instead code points. The += operator would determine how many code units is required for the given code point.

I disagree. The user should be allowed to twiddle as many bits as she pleases, even permitted to create an invalid UTF string. However, operations that interpret the string as a whole (comparison, canonicalization, etc.) should detect invalid strings and throw. The reason is that people will need to manipulate strings at the bit level, and intermediate states may be invalid, but that the final state may be valid. We shouldn't do too much nannying during these intermediate states. -- Eric Niebler Boost Consulting www.boost-consulting.com

Miro Jurisic

8:47 p.m.

In article <41780922.9070406@boost-consulting.com>, "Eric Niebler" <eric@boost-consulting.com> wrote:

...

Erik Wien wrote:

...
"Rogier van Dalen" <rogiervd@gmail.com> wrote in message

...
I hadn't yet looked at it this way, but you are right from a theoretical point of view at least. To get more to practical matters, what do you think this should do:

unicode::string s = ...; s += 0xDC01; // An isolated surrogate, which is nonsense

? Should it throw, or convert the isolated surrogate to U+FFFD REPLACEMENT CHARACTER (Unicode standard 4 Section 2.7), or something else? And what should the member function with the opposite behaviour be called?

The best solution would be to never append single code units, but instead code points. The += operator would determine how many code units is required for the given code point.

I disagree. The user should be allowed to twiddle as many bits as she pleases, even permitted to create an invalid UTF string. However, operations that interpret the string as a whole (comparison, canonicalization, etc.) should detect invalid strings and throw. The reason is that people will need to manipulate strings at the bit level, and intermediate states may be invalid, but that the final state may be valid. We shouldn't do too much nannying during these intermediate states.

I am not sure I buy this. I think that if you want to have unchecked Unicode data, you should use a vector<char*_t>. Unicode strings have well-defined invariants with respect to canonicalization and well-formedness, and I think that the a Unicode string abstraction should enforce those invariants. Having intermediate states that are invalid and a final state that is valid is not a feature, it's a bug. It's a silent failure that I want to know about. meeroh

Erik Wien

9:13 p.m.

"Miro Jurisic" <macdev@meeroh.org> wrote in message news:macdev-

...

I am not sure I buy this. I think that if you want to have unchecked Unicode data, you should use a vector<char*_t>. Unicode strings have well-defined invariants with respect to canonicalization and well-formedness, and I think that the a Unicode string abstraction should enforce those invariants.

Having intermediate states that are invalid and a final state that is valid is not a feature, it's a bug. It's a silent failure that I want to know about.

Amen. ;)

Eric Niebler

11:02 p.m.

Erik Wien wrote:

...

"Miro Jurisic" <macdev@meeroh.org> wrote in message news:macdev-

...
I am not sure I buy this. I think that if you want to have unchecked Unicode data, you should use a vector<char*_t>. Unicode strings have well-defined invariants with respect to canonicalization and well-formedness, and I think that the a Unicode string abstraction should enforce those invariants.

Having intermediate states that are invalid and a final state that is valid is not a feature, it's a bug. It's a silent failure that I want to know about.

Amen. ;)

No fair bringing religion into this. ;-) I'll repeat what I said before -- this would be an unfortunate design, and you'll hear about it from your users. If you force people to do their bit twiddling in vector<char*_t>, then you impose an extra allocation and a copy to get it into a unicode::string, and most people won't bother. -- Eric Niebler Boost Consulting www.boost-consulting.com

Rogier van Dalen

22 Oct 22 Oct

8:32 a.m.

On Thu, 21 Oct 2004 16:02:34 -0700, Eric Niebler <eric@boost-consulting.com> wrote:

...

No fair bringing religion into this. ;-) I'll repeat what I said before -- this would be an unfortunate design, and you'll hear about it from your users. If you force people to do their bit twiddling in vector<char*_t>, then you impose an extra allocation and a copy to get it into a unicode::string, and most people won't bother.

What frequent use cases do you see where people will want to change bits rather than work with characters? Rogier

Eric Niebler

2:47 p.m.

Rogier van Dalen wrote:

...

On Thu, 21 Oct 2004 16:02:34 -0700, Eric Niebler <eric@boost-consulting.com> wrote:

...
No fair bringing religion into this. ;-) I'll repeat what I said before -- this would be an unfortunate design, and you'll hear about it from your users. If you force people to do their bit twiddling in vector<char*_t>, then you impose an extra allocation and a copy to get it into a unicode::string, and most people won't bother.

What frequent use cases do you see where people will want to change bits rather than work with characters?

utf8_string str; str.reserve(some_big_number); ifstream file("utf8.txt"); istreambuf_iterator<char8_t> begin(file), end; std::copy( begin, end, back_inserter(str) ); This can't throw if you want people to use your string class. -- Eric Niebler Boost Consulting www.boost-consulting.com

Vladimir Prus

3:01 p.m.

Eric Niebler wrote:

...

...
What frequent use cases do you see where people will want to change bits rather than work with characters?

utf8_string str; str.reserve(some_big_number);

ifstream file("utf8.txt"); istreambuf_iterator<char8_t> begin(file), end;

std::copy( begin, end, back_inserter(str) );

This can't throw if you want people to use your string class.

And what if the example is changed to be: ifstream file("local8bit.txt"); istreambuf_iterator<char8_t> begin(file), end; std::copy( begin, end, back_inserter(str) ); ? This use case is common, so you have to provide something like "local8bit_back_inserter", and then to be explicit it's better to provide "utf8_back_inserter". - Volodya

Peter Dimov

3:09 p.m.

Eric Niebler wrote:

...

utf8_string str; str.reserve(some_big_number);

ifstream file("utf8.txt"); istreambuf_iterator<char8_t> begin(file), end;

std::copy( begin, end, back_inserter(str) );

This can't throw if you want people to use your string class.

utf8_string str( begin, end );

Rogier van Dalen

8:37 p.m.

On Fri, 22 Oct 2004 18:09:12 +0300, Peter Dimov <pdimov@mmltd.net> wrote:

...

Eric Niebler wrote:

...
utf8_string str; str.reserve(some_big_number);

ifstream file("utf8.txt"); istreambuf_iterator<char8_t> begin(file), end;

std::copy( begin, end, back_inserter(str) );

This can't throw if you want people to use your string class.

utf8_string str( begin, end );

Seems sensible, though I would like to reintroduce the as_utf8_tag class I introduced earlier. I think the utf8_string we're talking about is either a sequence of char32_t's, or a sequence of unicode::character's (encoded differently underlyingly). Whatever it is, I think you'll want to make a difference between a UTF-8 code unit iterator, a UTF-16 code unit iterator and a UTF-32 one, and make it: utf8_string str (begin, end, as_utf8_tag()); Rogier

Rob Stewart

4:43 p.m.

From: Rogier van Dalen <rogiervd@gmail.com>

...

<eric@boost-consulting.com> wrote:

...
-- this would be an unfortunate design, and you'll hear about it from your users. If you force people to do their bit twiddling in vector<char*_t>, then you impose an extra allocation and a copy to get it into a unicode::string, and most people won't bother.

What frequent use cases do you see where people will want to change bits rather than work with characters?

It is easy to build a safe interface atop a fast one, but you can't do the reverse. Perhaps the normal Unicode string, the safe one, can be implemented using the lower level one. Then, as someone mentioned, you can use the likes of c_str() to get a copy of the low level string. (You'll want to be able to convert from and assign from the low level string, too.) -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

Peter Dimov

5:11 p.m.

Rob Stewart wrote:

...

It is easy to build a safe interface atop a fast one, but you can't do the reverse.

Yes. The question is, why should the fast interface not consist of a set of algorithms operating on sequences of char*_t. The argument against was that creating a string incurs an allocation penalty, and the users will not do it. But my experience does not suggest that this would be the case. I frequently use vector<char> for C APIs that need a char[] and std::string for C++ APIs, and haven't found the necessary conversions a performance burden.

David Abrahams

4:01 p.m.

"Erik Wien" <wien@start.no> writes:

...

"Miro Jurisic" <macdev@meeroh.org> wrote in message news:macdev-

...

...
I am not sure I buy this. I think that if you want to have unchecked Unicode data, you should use a vector<char*_t>. Unicode strings have well-defined invariants with respect to canonicalization and well-formedness, and I think that the a Unicode string abstraction should enforce those invariants.

Having intermediate states that are invalid and a final state that is valid is not a feature, it's a bug. It's a silent failure that I want to know about.

Amen. ;)

How is this different from the situation with filesystem::path, where eager checking has turned out to be painful for a broad spectrum of users? -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Peter Dimov

4:24 p.m.

David Abrahams wrote:

...

"Erik Wien" <wien@start.no> writes:

...
"Miro Jurisic" <macdev@meeroh.org> wrote in message news:macdev-

...
...
I am not sure I buy this. I think that if you want to have unchecked Unicode data, you should use a vector<char*_t>. Unicode strings have well-defined invariants with respect to canonicalization and well-formedness, and I think that the a Unicode string abstraction should enforce those invariants.

Having intermediate states that are invalid and a final state that is valid is not a feature, it's a bug. It's a silent failure that I want to know about.

Amen. ;)

How is this different from the situation with filesystem::path, where eager checking has turned out to be painful for a broad spectrum of users?

"We have a valid Unicode string" is a pretty sensible invariant.

David Abrahams

5:13 p.m.

"Peter Dimov" <pdimov@mmltd.net> writes:

...

...
How is this different from the situation with filesystem::path, where eager checking has turned out to be painful for a broad spectrum of users?

"We have a valid Unicode string" is a pretty sensible invariant.

How is that more sensible than "we have a valid and portable path?" -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Ben Hutchings

25 Oct 25 Oct

9:26 a.m.

David Abrahams wrote:

...

"Peter Dimov" <pdimov@mmltd.net> writes:

...
...
How is this different from the situation with filesystem::path, where eager checking has turned out to be painful for a broad spectrum of users?

"We have a valid Unicode string" is a pretty sensible invariant.

How is that more sensible than "we have a valid and portable path?"

An invalid Unicode string isn't meaningful anywhere whereas an unportable path can be valid and meaningful on some platforms.

Miro Jurisic

22 Oct 22 Oct

5:10 p.m.

In article <uis92u4hg.fsf@boost-consulting.com>, David Abrahams <dave@boost-consulting.com> wrote:

...

"Erik Wien" <wien@start.no> writes:

...
"Miro Jurisic" <macdev@meeroh.org> wrote in message news:macdev-

...
...
I am not sure I buy this. I think that if you want to have unchecked Unicode data, you should use a vector<char*_t>. Unicode strings have well-defined invariants with respect to canonicalization and well-formedness, and I think that the a Unicode string abstraction should enforce those invariants.

Having intermediate states that are invalid and a final state that is valid is not a feature, it's a bug. It's a silent failure that I want to know about.

Amen. ;)

How is this different from the situation with filesystem::path, where eager checking has turned out to be painful for a broad spectrum of users?

I am not familiar with the problem in boost::fs so I can't comment on that specifically, but I can comment in a more general sense. I think we can safely agree that: 1. There is an invariant that is valuable in some problem domains 2. There may be some problem domains in which it is valuable to sidestep that invariant boost::fs, as far as I understand it, ran into the problem that it was impossible to sidestep the invariant. My position is not that we should prohibit people from manipulating Unicode strings in a manner that does not maintain well-formedness. My position is that we should permit use of Unicode strings that guarantees well-formedness. I haven't taken the time to figure out how to do that (if I had the time, I would not be discussing how Erik could do this, I'd be writing a proposal myself :-) ). I come from a problem domain in which an abstraction which guarantees well-formedness has more value, but I am not looking for an answer that satisfies me and nobody else. meeroh

David Abrahams

5:27 p.m.

Miro Jurisic <macdev@meeroh.org> writes:

...

...
How is this different from the situation with filesystem::path, where eager checking has turned out to be painful for a broad spectrum of users?

I am not familiar with the problem in boost::fs so I can't comment on that specifically, but I can comment in a more general sense. I think we can safely agree that:

1. There is an invariant that is valuable in some problem domains 2. There may be some problem domains in which it is valuable to sidestep that invariant

I see. That is different. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Beman Dawes

6:49 p.m.

New subject: Any interest in adding unicode support to boost?

At 01:10 PM 10/22/2004, Miro Jurisic wrote:

...

boost::fs, as far as I understand it, ran into the problem that it was impossible to sidestep the invariant.

No, rather than the error check was on by default. Some people want it off as the default. As far as Unicode strings are concerned, the question is a little different. Is it well defined behavior to create a string that does not meet the Unicode invariants? If so, can ordinary operations break invariants, or is such dangerous activity restricted to "experts only" functions? --Beman

John Maddock

23 Oct 23 Oct

9:56 a.m.

New subject: Any interest in adding unicode support to boost?

...

No, rather than the error check was on by default. Some people want it off as the default.

As far as Unicode strings are concerned, the question is a little different. Is it well defined behavior to create a string that does not meet the Unicode invariants? If so, can ordinary operations break invariants, or is such dangerous activity restricted to "experts only" functions?

For what it's worth: the Unicode standard *requires* conforming implementations to neither accept nor generate ill-formed Unicode sequences. Ref: Unicode Chapter3 C12 and C12a. John.

Rogier van Dalen

11:40 a.m.

On Sat, 23 Oct 2004 10:56:55 +0100, John Maddock <john@johnmaddock.co.uk> wrote:

...

...
No, rather than the error check was on by default. Some people want it off as the default.

As far as Unicode strings are concerned, the question is a little different. Is it well defined behavior to create a string that does not meet the Unicode invariants? If so, can ordinary operations break invariants, or is such dangerous activity restricted to "experts only" functions?

For what it's worth: the Unicode standard *requires* conforming implementations to neither accept nor generate ill-formed Unicode sequences.

Ref: Unicode Chapter3 C12 and C12a.

It does explicitly allow processing ill-formed sequences, or, at least, "talk[ing] about" concatenating two ill-formed code unit sequences to form a valid one, though. (Ch. 3, D30e.) Rogier

Rogier van Dalen

11:41 a.m.

On Fri, 22 Oct 2004 14:49:46 -0400, Beman Dawes <bdawes@acm.org> wrote:

...

At 01:10 PM 10/22/2004, Miro Jurisic wrote:

...
boost::fs, as far as I understand it, ran into the problem that it was impossible to sidestep the invariant.

No, rather than the error check was on by default. Some people want it off as the default.

As far as Unicode strings are concerned, the question is a little different. Is it well defined behavior to create a string that does not meet the Unicode invariants? If so, can ordinary operations break invariants, or is such dangerous activity restricted to "experts only" functions?

I guess you could say that all ordinary operations may take place on three different levels. Appending a code unit to the sequence of code units may make it uninterpretable as a codepoint sequence. Appending a codepoint may make it uninterpretable as a sequence of characters. The problem I think is not the operations, but rather the level they operate on. I have not yet found examples where a non-const code unit or codepoint sequence is needed, except for input. I think initialising from a code unit sequence (say, a UTF-8 encoded file) from two iterators, as shown by Peter Dimov, would be just right. (You can always make your own UTF-8 sequence and put it into a Unicode string, of course, but this will probably mean copying the data.) (For output, non-mutating access to the code units may be provided, for example to UTF-16 code units if you want to interface with Win32 API functions.) Rogier

Peter Dimov

25 Oct 25 Oct

12:45 p.m.

New subject: Any interest in adding unicode support to boost?

Beman Dawes wrote:

...

At 01:10 PM 10/22/2004, Miro Jurisic wrote:

...
boost::fs, as far as I understand it, ran into the problem that it was impossible to sidestep the invariant.

No, rather than the error check was on by default. Some people want it off as the default.

I interpret it a little differently: the "error" check provided no value to users; in fact, it "provided" a negative value, which is why most prefer "off by default". This in itself does not prove that all possible portability checks do not provide value to users, just that this particular check is a net loss. By saying that "some people want it off as default" you are discounting the results of the experiment, attributing them to personal preference. ;-)

Beman Dawes

4:51 p.m.

New subject: Any interest in adding unicode support to boost?

At 08:45 AM 10/25/2004, Peter Dimov wrote:

...

Beman Dawes wrote:

...
At 01:10 PM 10/22/2004, Miro Jurisic wrote:

...
boost::fs, as far as I understand it, ran into the problem that it was impossible to sidestep the invariant.

No, rather than the error check was on by default. Some people want it off as the default.

I interpret it a little differently: the "error" check provided no value to users;

That's true for some users, but not all users.

...

in fact, it "provided" a negative value, which is why most prefer "off by default".

...

provide value to users, just that this particular check is a net loss.

By saying that "some people want it off as default" you are discounting

Only for the users who don't want to perform the check. For those that do want the check, it has a positive value. This in itself does not prove that all possible portability checks do not the

...

results of the experiment, attributing them to personal preference. ;-)

You could attribute the preference to the degree of portability required by the user's application's but I think it is more than that. Some people would rather be safe by default, while others would rather error checks only be applied if explicitly invoked. That seems like a personal preference to me. For Unicode strings, it is possible to provide an const_iterator interface which both guarantees the "valid Unicode" invariant and allows construction from character streams. That's the important point, if I understand your postings correctly. And maintenance of the invariant seems more than just a personal preference; it is a stronger design. So it is a stronger argument than the filesystem case. --Beman

Peter Dimov

6:10 p.m.

New subject: filesystem portability checks (Was: Any interest in adding unicode support to boost?)

Beman Dawes wrote:

...

At 08:45 AM 10/25/2004, Peter Dimov wrote:

...
Beman Dawes wrote:

...
At 01:10 PM 10/22/2004, Miro Jurisic wrote:

...
boost::fs, as far as I understand it, ran into the problem that it was impossible to sidestep the invariant.

No, rather than the error check was on by default. Some people want it off as the default.

I interpret it a little differently: the "error" check provided no value to users;

That's true for some users, but not all users.

Certainly. Think of my statement as integrated over all possible users. Each user has his own unique needs, but it's still possible to form meaningful sentences that describe the general case.

...

...
in fact, it "provided" a negative value, which is why most prefer "off by default".

Only for the users who don't want to perform the check. For those that do want the check, it has a positive value.

I'm afraid I haven't been able to express my point well. My point was that if you focus on The Check, you'll only conclude that most users don't want The Check by default. This doesn't mean that most users don't want A Check by default; it just may not be The Check. There is a spectrum of possibilities between Strictest Check and No Check which may contain the optimal default that provides just the right amount of portability checking.

...

You could attribute the preference to the degree of portability required by the user's application's but I think it is more than that. Some people would rather be safe by default, while others would rather error checks only be applied if explicitly invoked. That seems like a personal preference to me.

I think that the main problem is that the current checks do not always signal errors. Errors are never a matter of personal preference, they are always part of a specification or dictated by external requirements. Tolerating checks that typically produce a significant amount of false positives can be a matter of personal preference, of course. But it also depends on how such checks are implemented. I tolerate a relatively high degree of false positives in compiler warnings, but I'd rather not see a similar rate in compiler errors. I certainly wouldn't mind a filesystem _warning_ for nonportable paths, if we find a suitable delivery mechanism.

Beman Dawes

26 Oct 26 Oct

6:54 p.m.

New subject: filesystem portability checks (Was: Any interest in adding unicode support to boost?)

At 02:10 PM 10/25/2004, Peter Dimov wrote:

...

... I certainly wouldn't mind a filesystem _warning_ for nonportable paths, if we find a suitable delivery mechanism.

Interesting! A fresh perspective. Let's see what can be done within the current design. A adaptor function could intercept what otherwise would be errors and turn them into warnings: bool warn_nonportable_name( std::string const & name ) { if ( !fs::portable_name( name ) ) boost::issue_warning( "Warning, path contains non-portable name: " + p ); return true; } Because the adaptor always returns true, an error exception is never thrown. This function could become the new default, or could be explicitly coded like any name_check function. What should issue_warning() do? How about something similar to throw_exception(): namespace boost { #ifdef BOOST_USER_WARNINGS void issue_warning( std::string const & msg ); // user defined #else void issue_warning( std::string const & msg ) { std::clog << msg << '\n'; } #endif } Does that capture what you had in mind? --Beman

Peter Dimov

7:49 p.m.

New subject: filesystem portability checks

Beman Dawes wrote:

...

At 02:10 PM 10/25/2004, Peter Dimov wrote:

...
... I certainly wouldn't mind a filesystem _warning_ for nonportable paths, if we find a suitable delivery mechanism.

Interesting! A fresh perspective.

Let's see what can be done within the current design. A adaptor function could intercept what otherwise would be errors and turn them into warnings: bool warn_nonportable_name( std::string const & name )

{ if ( !fs::portable_name( name ) ) boost::issue_warning( "Warning, path contains non-portable name: " + p ); return true; }

Because the adaptor always returns true, an error exception is never thrown. This function could become the new default, or could be explicitly coded like any name_check function.

[...]

...

Does that capture what you had in mind?

Almost. The current validation scheme (with the above addition) is (IIUC): 1. Parse path according to grammar, throw on parse error; 2. Check every path element with checker(e), throw on false; Warning checker calls issue_warning on nonportable paths. What I had in mind was more like: 1. Parse path according to grammar, throw on parse error; 2. Call checker(e) for every path element; if false, call fs_nonportable_path_element Where: void (*fs_nonportable_path_element)( std::string const & path, std::string const & element ) = __fsnp_default; (as an aside, the exceptions really need their own types, because these aren't filesystem errors, i.e. no underlying filesystem error code is associated with them.) The general idea of this approach is that if the user wants an exception on nonportable path elements, the name checker can simply throw. If the user wants the current behavior, he can install an fs_nonportable_path_element that throws. Otherwise, he can install an appropriate handler that logs or ignores the nonportable path. It might even be possible to get rid of the default name checker "replaceable global constant", if the original default is now good enough and nobody needs to change it.

Rogier van Dalen

21 Oct 21 Oct

9:02 p.m.

On Thu, 21 Oct 2004 12:08:18 -0700, Eric Niebler <eric@boost-consulting.com> wrote:

...

I disagree. The user should be allowed to twiddle as many bits as she pleases, even permitted to create an invalid UTF string. However, operations that interpret the string as a whole (comparison, canonicalization, etc.) should detect invalid strings and throw. The reason is that people will need to manipulate strings at the bit level, and intermediate states may be invalid, but that the final state may be valid. We shouldn't do too much nannying during these intermediate states.

Manipulation of strings at the bit level is always possible -- use std::string to fiddle with your UTF-8 string all you like. IMO unicode::string (or whatever you wish to call it) should always contain a valid Unicode string. Otherwise, every time operator++() is called on an iterator many checks would have to be done, and exceptions may be thrown. The iterators would take more memory because they would have to know their begin and end positions, too. Furthermore, I'm not convinced the average C++ programmer should be expected to know the Unicode standard well enough to make twiddling bits the primary mode of Unicode string manipulation. Regards, Rogier

Rogier van Dalen

9 p.m.

On Thu, 21 Oct 2004 20:31:24 +0200, Erik Wien <wien@start.no> wrote:

...

The best solution would be to never append single code units, but instead code points. The += operator would determine how many code units is required for the given code point.

I fully agree with you on that; I was considering what should happen if the user appended something invalid (e.g., an isolated surrogate). Sorry for any confusion caused. I made a second mistake in mixing up the two levels in an unclear way. I very much like Peter's suggestion of using free functions converting invalid values to valid ones. Using that I suggest: unicode::codepoint_string should throw when an invalid codepoint is appended to it (e.g., an isolated surrogate). unicode::correct_codepoint() should convert an invalid codepoint into U+FFFD, and could be used to "safely" insert codepoints. char32_t correct_codepoint (char32_t); unicode::string should take a unicode::character for appending. A unicode::character object may be constructed with a single codepoint, which will be its base character. If this codepoint is invalid, it should throw. If the codepoint is a combining mark, it should also throw. unicode::correct() should convert an invalid codepoint into U+FFFD, and if it is input a combining mark, it should use U+0020 SPACE as a base character. character correct (char32_t); Regards, Rogier

Rob Stewart

22 Oct 22 Oct

4:46 p.m.

From: Rogier van Dalen <rogiervd@gmail.com>

...

unicode::string should take a unicode::character for appending. A unicode::character object may be constructed with a single codepoint, which will be its base character. If this codepoint is invalid, it should throw. If the codepoint is a combining mark, it should also throw. unicode::correct() should convert an invalid codepoint into U+FFFD, and if it is input a combining mark, it should use U+0020 SPACE as a base character.

Why not have unicode::character's ctor invoke unicode::correct()? -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

Rogier van Dalen

8:38 p.m.

On Fri, 22 Oct 2004 12:46:00 -0400 (EDT), Rob Stewart <stewart@sig.com> wrote:

...

From: Rogier van Dalen <rogiervd@gmail.com>

...
unicode::string should take a unicode::character for appending. A unicode::character object may be constructed with a single codepoint, which will be its base character. If this codepoint is invalid, it should throw. If the codepoint is a combining mark, it should also throw. unicode::correct() should convert an invalid codepoint into U+FFFD, and if it is input a combining mark, it should use U+0020 SPACE as a base character.

Why not have unicode::character's ctor invoke unicode::correct()?

unicode::correct() replaces every encoding error in the input by a replacement character. This loses information and it is not recoverable. The combining character bit is only slightly better. When I proposed a policy I called it workaround_encoding_error; maybe we need a better name than "correct". I agree with Peter Dimov, however, that the default should be to throw rather than to throw away information and pretend nothing happened. Regards, Rogier

David Abrahams

3:57 p.m.

"Erik Wien" <wien@start.no> writes:

...

"Rogier van Dalen" <rogiervd@gmail.com> wrote in message

...
I hadn't yet looked at it this way, but you are right from a theoretical point of view at least. To get more to practical matters, what do you think this should do:

unicode::string s = ...; s += 0xDC01; // An isolated surrogate, which is nonsense

? Should it throw, or convert the isolated surrogate to U+FFFD REPLACEMENT CHARACTER (Unicode standard 4 Section 2.7), or something else? And what should the member function with the opposite behaviour be called?

The best solution would be to never append single code units, but instead code points. The += operator would determine how many code units is required for the given code point.

Is this going to be illegal for most fs, then? std::copy( std::istream_iterator<char>(f), std::istream_iterator<char>(), std::back_inserter(my_utf8_string)); I think it pretty much has to work. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Peter Dimov

4:23 p.m.

David Abrahams wrote:

...

Is this going to be illegal for most fs, then?

std::copy( std::istream_iterator<char>(f), std::istream_iterator<char>(), std::back_inserter(my_utf8_string));

I think it pretty much has to work.

my_utf8_string.append( std::istream_iterator<char>(f), std::istream_iterator<char>() ); is the proper spelling. Or, if you prefer, my_utf8_string.insert( my_utf8_string.end(), std::istream_iterator<char>(f), std::istream_iterator<char>() ); Character by character push_back loops just aren't cool.

David Abrahams

5:12 p.m.

"Peter Dimov" <pdimov@mmltd.net> writes:

...

David Abrahams wrote:

...
Is this going to be illegal for most fs, then? std::copy( std::istream_iterator<char>(f), std::istream_iterator<char>(), std::back_inserter(my_utf8_string)); I think it pretty much has to work.

my_utf8_string.append( std::istream_iterator<char>(f), std::istream_iterator<char>() );

is the proper spelling. Or, if you prefer,

my_utf8_string.insert( my_utf8_string.end(), std::istream_iterator<char>(f), std::istream_iterator<char>() );

Character by character push_back loops just aren't cool.

Okay, but should we make this well-known and understood idiom one that compiles and fails at runtime? -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Peter Dimov

5:30 p.m.

David Abrahams wrote:

...

"Peter Dimov" <pdimov@mmltd.net> writes:

...
David Abrahams wrote:

...
Is this going to be illegal for most fs, then? std::copy( std::istream_iterator<char>(f), std::istream_iterator<char>(), std::back_inserter(my_utf8_string)); I think it pretty much has to work.

my_utf8_string.append( std::istream_iterator<char>(f), std::istream_iterator<char>() );

is the proper spelling. Or, if you prefer,

my_utf8_string.insert( my_utf8_string.end(), std::istream_iterator<char>(f), std::istream_iterator<char>() );

Character by character push_back loops just aren't cool.

Okay, but should we make this well-known and understood idiom one that compiles and fails at runtime?

The well-known idiom is evil, but I see your point. A checking utf8_string probably shouldn't have push_back( char ). However, how did an utf8_string enter the discussion? Wasn't our unicode string supposed to hold UTF-16 in a clever way that permitted random access, and expose char32_t as its value_type?

David Abrahams

11:32 p.m.

"Peter Dimov" <pdimov@mmltd.net> writes:

...

David Abrahams wrote:

...
"Peter Dimov" <pdimov@mmltd.net> writes:

...
David Abrahams wrote:

...
Is this going to be illegal for most fs, then? std::copy( std::istream_iterator<char>(f), std::istream_iterator<char>(), std::back_inserter(my_utf8_string)); I think it pretty much has to work.

my_utf8_string.append( std::istream_iterator<char>(f), std::istream_iterator<char>() );

is the proper spelling. Or, if you prefer,

my_utf8_string.insert( my_utf8_string.end(), std::istream_iterator<char>(f), std::istream_iterator<char>() );

Character by character push_back loops just aren't cool.

Okay, but should we make this well-known and understood idiom one that compiles and fails at runtime?

The well-known idiom is evil, but I see your point. A checking utf8_string probably shouldn't have push_back( char ).

However, how did an utf8_string enter the discussion? Wasn't our unicode string supposed to hold UTF-16 in a clever way that permitted random access, and expose char32_t as its value_type?

I don't know. Been somewhat distracted with the committee meeting so I probably just haven't been paying enough attention to this thread. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Miro Jurisic

5:03 p.m.

In article <uoeiuu4n6.fsf@boost-consulting.com>, David Abrahams <dave@boost-consulting.com> wrote:

...

Is this going to be illegal for most fs, then?

std::copy( std::istream_iterator<char>(f), std::istream_iterator<char>(), std::back_inserter(my_utf8_string));

I think it pretty much has to work.

I think this is a red herring. A UTF-8 string is not a sequence of chars, nor is every char convertible to a UTF-8 char, so why should this work any more than vector<void*> my_vector; std::copy( std::istream_iterator<char>(f), std::istream_iterator<char>(), std::back_inserter(my_vector)); ? OTOH, A UTF-8 string is a sequence of Unicode chars, so what should work is: std::copy( std::istream_iterator<unicode_char>(f), std::istream_iterator<unicode_char>(), std::back_inserter(my_utf8_string)); with the semantics that every time you hit the iterator, an entire unicode character is read off the stream and appended to the string. meeroh

David Abrahams

5:25 p.m.

Miro Jurisic <macdev@meeroh.org> writes:

...

In article <uoeiuu4n6.fsf@boost-consulting.com>, David Abrahams <dave@boost-consulting.com> wrote:

...
Is this going to be illegal for most fs, then?

std::copy( std::istream_iterator<char>(f), std::istream_iterator<char>(), std::back_inserter(my_utf8_string));

I think it pretty much has to work.

I think this is a red herring. A UTF-8 string is not a sequence of chars, nor is every char convertible to a UTF-8 char, so why should this work any more than

vector<void*> my_vector;

std::copy( std::istream_iterator<char>(f), std::istream_iterator<char>(), std::back_inserter(my_vector));

?

OTOH, A UTF-8 string is a sequence of Unicode chars, so what should work is:

std::copy( std::istream_iterator<unicode_char>(f), std::istream_iterator<unicode_char>(), std::back_inserter(my_utf8_string));

with the semantics that every time you hit the iterator, an entire unicode character is read off the stream and appended to the string.

Okay. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Vladimir Prus

21 Oct 21 Oct

6:17 a.m.

New subject: Any interest in adding unicode support to boost?

Eric Niebler wrote:

...

...
A single string class shall be used to store Unicode strings, i.e. logical sequences of Unicode abstract characters.

This string shall be stored in one chosen encoding, for example UTF-8. The user does not have direct access to the underlying storage, however, so it might be regarded as an implementation detail.

An invariant of the string is that it is always in one chosen normalized form. Iteration over the string gives back a sequence of char32_t abstract characters. Comparisons are defined in terms of these sequences.

Is this a fair summary?

Such a one-size-fits-all unicode_string is guaranteed to be inefficient for some applications. If it is always stored in a decomposed form, an XML library probably wouldn't want to use it, because it requires a composed form. And making the encoding an implementation detail makes it inefficient to use in situations where binary compatibility matters (serialization, for example).

This seems right, but there's a catch. Configurable encoding would help if all components of your application you the same encoding. Say XML parser wants composed form, so you use unicode_string<utf16, composed>. Now another part of your application (library written by somebody else) uses different encoding, and you have to convert the data on the interface. If there's only one encoding, you need to do conversion for code which really, really needs other encoding. If there are several encoding, then different libraries will use different encoding based on educated guesses about data, and you'll be converting everywhere.

...

Also, it is impossible to store an abstract unicode character in char32_t because there may be N zero-width combining characters associated with it.

Perhaps having a one-size-fits-all unicode_string might be a nice default, as long as users who care about encoding and canonical form have other types (template + policies?) with knobs they can twiddle.

Maybe, I just wish there was some efficient mechanism to prevent users who did not read the entire Unicode standard 10 times and so know what there's doing to touch the knobs ;-) - Volodya

Vladimir Prus

6:09 a.m.

New subject: Any interest in adding unicode support to boost?

Peter Dimov wrote:

...

...
If a library accepts unicode string, then its interface can either: - use 'unicode_string' - use 'unicode_string<some_encoding>' - use 'vector<char16_t>' and have a comment that the string is UTF8.

I think the first option is best, and the last is too easy to misuse.

Yes.

So let's see if I understand your position correctly.

A single string class shall be used to store Unicode strings, i.e. logical sequences of Unicode abstract characters.

This string shall be stored in one chosen encoding, for example UTF-8. The user does not have direct access to the underlying storage, however, so it might be regarded as an implementation detail.

An invariant of the string is that it is always in one chosen normalized form. Iteration over the string gives back a sequence of char32_t abstract characters. Comparisons are defined in terms of these sequences.

Is this a fair summary?

Yes, with this addition: - user can obtain the raw data in any format he likes (local8bit, utf8, utf16) - user can construct the string from any format he likes (from the same list) - Ideally, there should be add "encoder" add-on, which can handle specific named encodings ("koi8-r"...) - Volodya

Miro Jurisic

20 Oct 20 Oct

7:53 a.m.

In article <001301c4b635$bda16750$6501a8c0@pdimov2>, "Peter Dimov" <pdimov@mmltd.net> wrote:

...

My opinion is that the std::char_traits<> experiment failed and conclusively demonstrated that the "string as a value" approach is a dead end, and that practical string libraries must treat a string as a sequential container, vector<char>, vector<char16_t> and vector<char32_t> in our case.

The interpretation of that sequence of integers as a concrete string value representation needs to be done by algorithms.

There is no dispute that the rep of the string needs to be a container. (Though I do not agree that it's obvious that it should be a vector.) However, the basic_string interface grafted on top of a container of Unicode code units will produce bogus Unicode strings. This is why I strongly believe that basic_string is not a suitable container for Unicode strings. A separate container which does not provide convenient and completely incorrect member functions (such as find and assign) should be used. Consider this; pretend that - c and d are characters - C and D are the same character with an umlaut - C and D do not have precomposed code units in Unicode basic_string<char16_t> s("Cc"); // pretend assign and find use iterator ranges, for simplicity s.assign(s.find("c"), "d"); This will result in "Dc", which is completely wrong IMNSHO, and there should not be a simple interface that allows you to shoot yourself in the foot so thoroughly. It is not strings-as-containers that I am opposed to, but the deceptive simplicity of basic_string member functions. meeroh

Peter Dimov

12:30 p.m.

Miro Jurisic wrote:

...

In article <001301c4b635$bda16750$6501a8c0@pdimov2>, "Peter Dimov" <pdimov@mmltd.net> wrote:

...
My opinion is that the std::char_traits<> experiment failed and conclusively demonstrated that the "string as a value" approach is a dead end, and that practical string libraries must treat a string as a sequential container, vector<char>, vector<char16_t> and vector<char32_t> in our case.

The interpretation of that sequence of integers as a concrete string value representation needs to be done by algorithms.

There is no dispute that the rep of the string needs to be a container. (Though I do not agree that it's obvious that it should be a vector.) However, the basic_string interface grafted on top of a container of Unicode code units will produce bogus Unicode strings. This is why I strongly believe that basic_string is not a suitable container for Unicode strings.

We agree on that.

John Maddock

10:19 a.m.

...

Erik Wien wrote:

...
Ultimately I feel that the operation of normalization (which involves canonical decomposition) of unicode strings should be hidden from the user completely and be performed automatically by the library where that is needed. (Like on a call to the == operator.)

It appears that there are two schools of thought when it comes to string design. One approach treats a string purely as a sequential container of values. The other tries to represent "string values" as a coherent whole. It doesn't help that in the simple case where the value_type is char the two approaches result in mostly identical semantics.

My opinion is that the std::char_traits<> experiment failed and conclusively demonstrated that the "string as a value" approach is a dead end, and that practical string libraries must treat a string as a sequential container, vector<char>, vector<char16_t> and vector<char32_t> in our case.

The interpretation of that sequence of integers as a concrete string value representation needs to be done by algorithms.

In other words, I believe that string::operator== should always perform the per-element comparison std::equal( lhs.begin(), lhs.end(), rhs.begin() ) that is specified in the Container requirements table.

If I want to test whether two sequences of char16_t's, interpreted as UTF16 Unicode strings, would represent the same string in a printed form, I should be given a dedicated function that does just that - or an equivalent. Similarly, if I want to normalize a sequence of chars that are actually UTF8, I'd call the appropriate 'normalize' function/algorithm.

Right, and there are several different Normalised forms so we have to be able to choose the algorithm that does the right thing for what we want here. Can I make one other plea here: *please* lets not get too stuck on string class representations; we can have iterator sequences as well (these may well be part of a string, or they may be part of a memory mapped file, or some other smart iterator - like the Unicode encoding transformation iterators I've just been writing), and operations / algorithms on iterators are more important too me than YASC (Yet Another String Class) :-) John.

Eric Niebler

5:06 a.m.

Erik Wien wrote:

...

Ultimately I feel that the operation of normalization (which involves canonical decomposition) of unicode strings should be hidden from the user completely and be performed automatically by the library where that is needed. (Like on a call to the == operator.) I think that solution would be satisfactory for most users as the normalization process is somewhat intricate and really not something users should be forced to understand.

Are we at all on the same page now?

No. "Normalization" doesn't always mean canonical decomposition. There are several canonical forms, some of which *require* the use of composite characters. In fact, the XML standard requires such a canonical form. A Unicode library cannot hide the issue of canonicalization from the user, because users will care which canonical form is being used. -- Eric Niebler Boost Consulting www.boost-consulting.com

Vladimir Prus

7 a.m.

Eric Niebler wrote:

...

Erik Wien wrote:

...
Ultimately I feel that the operation of normalization (which involves canonical decomposition) of unicode strings should be hidden from the user completely and be performed automatically by the library where that is needed. (Like on a call to the == operator.) I think that solution would be satisfactory for most users as the normalization process is somewhat intricate and really not something users should be forced to understand.

Are we at all on the same page now?

No. "Normalization" doesn't always mean canonical decomposition. There are several canonical forms, some of which *require* the use of composite characters. In fact, the XML standard requires such a canonical form. A Unicode library cannot hide the issue of canonicalization from the user, because users will care which canonical form is being used.

Why? If I want to compare two string, I don't really care which normalized form is used. - Volodya

Peter Dimov

12:17 p.m.

Vladimir Prus wrote:

...

Eric Niebler wrote:

...
No. "Normalization" doesn't always mean canonical decomposition. There are several canonical forms, some of which *require* the use of composite characters. In fact, the XML standard requires such a canonical form. A Unicode library cannot hide the issue of canonicalization from the user, because users will care which canonical form is being used.

Why? If I want to compare two string, I don't really care which normalized form is used.

But if you need a particular normalized form for other purposes (to store it into a database, perhaps), you have no way to obtain it from operator==.

Vladimir Prus

12:26 p.m.

Peter Dimov wrote:

...

Vladimir Prus wrote:

...
Eric Niebler wrote:

...
No. "Normalization" doesn't always mean canonical decomposition. There are several canonical forms, some of which *require* the use of composite characters. In fact, the XML standard requires such a canonical form. A Unicode library cannot hide the issue of canonicalization from the user, because users will care which canonical form is being used.

Why? If I want to compare two string, I don't really care which normalized form is used.

But if you need a particular normalized form for other purposes (to store it into a database, perhaps), you have no way to obtain it from operator==.

Yes. But it's possible to have standalone "normalization" function, and still use default normalized representation for the string class. - Volodya

Peter Dimov

12:51 p.m.

Vladimir Prus wrote:

...

Peter Dimov wrote:

...
Vladimir Prus wrote:

...
Eric Niebler wrote:

...
No. "Normalization" doesn't always mean canonical decomposition. There are several canonical forms, some of which *require* the use of composite characters. In fact, the XML standard requires such a canonical form. A Unicode library cannot hide the issue of canonicalization from the user, because users will care which canonical form is being used.

Why? If I want to compare two string, I don't really care which normalized form is used.

But if you need a particular normalized form for other purposes (to store it into a database, perhaps), you have no way to obtain it from operator==.

Yes. But it's possible to have standalone "normalization" function, and still use default normalized representation for the string class.

Thereby assuming that all users need to pay for normalization (twice) on every comparison? Or maybe you are arguing that the string should always be kept in a particular normalized form?

Vladimir Prus

1:01 p.m.

Peter Dimov wrote:

...

...
Yes. But it's possible to have standalone "normalization" function, and still use default normalized representation for the string class.

Thereby assuming that all users need to pay for normalization (twice) on every comparison?

Noway, of course.

...

Or maybe you are arguing that the string should always be kept in a particular normalized form?

That's what I meant. - Volodya

Rogier van Dalen

1:29 p.m.

On Wed, 20 Oct 2004 15:51:21 +0300, Peter Dimov <pdimov@mmltd.net> wrote:

...

Vladimir Prus wrote:

...
Peter Dimov wrote:

...
But if you need a particular normalized form for other purposes (to store it into a database, perhaps), you have no way to obtain it from operator==.

Yes. But it's possible to have standalone "normalization" function, and still use default normalized representation for the string class.

Thereby assuming that all users need to pay for normalization (twice) on every comparison?

Or maybe you are arguing that the string should always be kept in a particular normalized form?

That seems to be the only way of keeping comparison, search, etcetera, implementable in terms of char_traits<> functions --- and so, the only way of getting performance similar to std::basic_string<>'s. Note that normalisation of any kind requires access to the Unicode Character Database, which may take some time, especially if the relevant parts happen not to be in the processor cache. Comparing any Unicode data in different or unknown normalisation forms will therefore by definition be slow. Regards, Rogier

Erik Wien

6:18 p.m.

"Rogier van Dalen" <rogiervd@gmail.com> wrote in message news:e094f9eb0410200629617a4e01@mail.gmail.com...

...

On Wed, 20 Oct 2004 15:51:21 +0300, Peter Dimov <pdimov@mmltd.net> wrote:

...
Vladimir Prus wrote:

...
Peter Dimov wrote:

...
But if you need a particular normalized form for other purposes (to store it into a database, perhaps), you have no way to obtain it from operator==.

Yes. But it's possible to have standalone "normalization" function, and still use default normalized representation for the string class.

Thereby assuming that all users need to pay for normalization (twice) on every comparison?

Or maybe you are arguing that the string should always be kept in a particular normalized form?

That seems to be the only way of keeping comparison, search, etcetera, implementable in terms of char_traits<> functions --- and so, the only way of getting performance similar to std::basic_string<>'s.

Note that normalisation of any kind requires access to the Unicode Character Database, which may take some time, especially if the relevant parts happen not to be in the processor cache.

Comparing any Unicode data in different or unknown normalisation forms will therefore by definition be slow.

True.. So what we basically need to determine, is what is most critical? Fast comparing of strings (Strings always represented in a given NF), or fast genereal string handling (NF determined when needed)

Rob Stewart

8:20 p.m.

New subject: Any interest in adding unicode support to boost?

From: "Erik Wien" <wien@start.no>

...

"Rogier van Dalen" <rogiervd@gmail.com> wrote in message news:e094f9eb0410200629617a4e01@mail.gmail.com...

...
On Wed, 20 Oct 2004 15:51:21 +0300, Peter Dimov <pdimov@mmltd.net> wrote:

...
Or maybe you are arguing that the string should always be kept in a particular normalized form?

That seems to be the only way of keeping comparison, search, etcetera, implementable in terms of char_traits<> functions --- and so, the only way of getting performance similar to std::basic_string<>'s.

Note that normalisation of any kind requires access to the Unicode Character Database, which may take some time, especially if the relevant parts happen not to be in the processor cache.

Comparing any Unicode data in different or unknown normalisation forms will therefore by definition be slow.

True.. So what we basically need to determine, is what is most critical? Fast comparing of strings (Strings always represented in a given NF), or fast genereal string handling (NF determined when needed)

What if the class had the option, at least, to hold multiple forms, creating each on demand? Then, the operations you invoke would simply request the particular form they require. If that form is not currently available, it is generated. That approach means you need a dirty flag set by mutating operations to know when to invalidate the secondary forms. I can envision thrashing as operations requiring a secondary form trigger mutations which invalidate the secondary form only to be needed immediately thereafter. It might also be possible to mutate all currently available generated forms, but then the complexity guarantees are affected. -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

Peter Dimov

8:40 p.m.

New subject: Any interest in adding unicode support toboost?

Rob Stewart wrote:

...

What if the class had the option, at least, to hold multiple forms, creating each on demand? Then, the operations you invoke would simply request the particular form they require. If that form is not currently available, it is generated.

The classic problem is that logically const operations are now physically non-const (as I was surprised to learn from Howard Hinnant, this is the case with std::locale.) The implicit thread safety contract ("basic thread safety") says that unless stated otherwise, several logically const operations can be performed concurrently. This would now require a mutex lock.

Alexander Terekhov

21 Oct 21 Oct

9:31 a.m.

New subject: Any interest in adding unicode supporttoboost?

Peter Dimov wrote:

...

Rob Stewart wrote:

...
What if the class had the option, at least, to hold multiple forms, creating each on demand? Then, the operations you invoke would simply request the particular form they require. If that form is not currently available, it is generated.

DCCI is what you need here.

...

The classic problem is that logically const operations are now physically non-const (as I was surprised to learn from Howard Hinnant, this is the case with std::locale.)

The implicit thread safety contract ("basic thread safety") says that unless stated otherwise, several logically const operations can be performed concurrently. This would now require a mutex lock.

Or lockless atomic<>. ;-) http://groups.google.com/groups?selm=415BD983.E2DA2114%40web.de regards, alexander.

Rogier van Dalen

3:04 p.m.

New subject: Any interest in adding unicode support to boost?

On Wed, 20 Oct 2004 20:18:27 +0200, Erik Wien <wien@start.no> wrote:

...

"Rogier van Dalen" <rogiervd@gmail.com> wrote in message news:e094f9eb0410200629617a4e01@mail.gmail.com...

...
Comparing any Unicode data in different or unknown normalisation forms will therefore by definition be slow.

True.. So what we basically need to determine, is what is most critical? Fast comparing of strings (Strings always represented in a given NF), or fast genereal string handling (NF determined when needed)

I'm not quite sure what you mean. Do you propose to check whether a string is valid when reading it? And do you propose to make sure it is in some normalization form? Or will you leave it in any form it is in? What use cases do you envision where only "string handling" that does not need normalization is used? I have not been able to think of any. Regards, Rogier

Erik Wien

20 Oct 20 Oct

6:11 p.m.

"Eric Niebler" <eric@boost-consulting.com> wrote in message

...

No. "Normalization" doesn't always mean canonical decomposition. There are several canonical forms, some of which *require* the use of composite characters. In fact, the XML standard requires such a canonical form. A Unicode library cannot hide the issue of canonicalization from the user, because users will care which canonical form is being used.

When I say "hidden", I do not neccesarily mean "unaccessible". I can think of many ways to provide a policy or something that would determine which normalization form would be used on a call to the == operator. My point was that we should not require the common user to know what a normalization form is, and to aid that provide a default normalization policy that maps the closest to "common sense". ('ö' should "equal" 'oš' for example.)

Vladimir Prus

7:50 a.m.

Erik Wien wrote:

...

The basic idea I have been working around, is to make a nencoded_string class templated on unicode encoding types (i.e. UTF-8, UTF-16). This is made possible through a encoding_traits class which contains all nececcary implementation details for working on strings of code units.

The outline of the encoding traits class looks something like this:

template<typename encoding> struct encoding_traits { // Type definitions for code_units etc. // Is the encoding fixed width? (allows a good deal of iterator optimizations) // Algoritms for iterating forwards and backwards over code units. // Function for converting a series of code units to a unicode code point. // Any other operations that are encoding specific. }

Why do you need the traits, at compile-time? - Why would the user want to change the encoding? Especially between UTF-16 and UTF-32? - Why would the user want to specify encoding at compile time? Are there performance benefits to that? Basically, if we agree that UTF-32 is not needed, then UTF-16 is the only encoding which does not require complex handling. Maybe, for other encodings using virtual functions in character iterator is OK? And if iterators have abstract characters" as value_type, maybe the overhead if that is much large that virtual function call even for UTF-16. (As a side note, discussion about templated vs. non-templated interface seems a reasonable addition to a thethis. It's sure thing that if anybody wrote such a thethis in our lab, he would be asked to justify such a global decisions). - What if the user wants to specify encoding at run time? For example, XML files specify encoding explicitly. I'd want to use ascii/UTF-8 encoding if XML document is 8-bit, and UTF-16 when it's Unicode. - Volodya

Erik Wien

6:36 p.m.

"Vladimir Prus" <ghost@cs.msu.su> wrote in message news:cl55cd$9ei$1@sea.gmane.org...

...

Why do you need the traits, at compile-time?

Perhaps I didn't state this clearly enough. The traits class is one of the template parameters of the encoded_string class. (Defaulting to encoding_traits<encoding>) The traits class contains all information about the encoding being specified, like code unit size, and functions for iterating through a code unit sequence. All encoding specific implementation is done in the traits class.

...

- Why would the user want to change the encoding? Especially between UTF-16 and UTF-32?

Well... Different people have different needs. If you are mostly using ASCII characters, and require small size, UTF-8 would fit your bill. If you need the best general performance on most operations, use UTF-16. If you need fast iteration over code points and size doesn't matter, use UTF-32.

...

- Why would the user want to specify encoding at compile time? Are there performance benefits to that? Basically, if we agree that UTF-32 is not needed, then UTF-16 is the only encoding which does not require complex handling. Maybe, for other encodings using virtual functions in character iterator is OK? And if iterators have abstract characters" as value_type, maybe the overhead if that is much large that virtual function call even for UTF-16.

Though I haven't confirmed this by testing, I would assume templating the encoding and thus specifying it at compile time would result in better performance since you don't have the overhead of virtual function calls. (Polymorphy would probably be needed if templates were scrapped.) Avoiding virtual calls also enables the compiler to optimize (inline) more thouroughly, something that is very benificial in this case because of the amount of different small, specialized functions that are needed in string manipulation.

...

(As a side note, discussion about templated vs. non-templated interface seems a reasonable addition to a thethis. It's sure thing that if anybody wrote such a thethis in our lab, he would be asked to justify such a global decisions).

Thanks for the tip! I would probably include a discussion on why templates are used if they end up in a final implementation.

...

- What if the user wants to specify encoding at run time? For example, XML files specify encoding explicitly. I'd want to use ascii/UTF-8 encoding if XML document is 8-bit, and UTF-16 when it's Unicode.

That is one problem with the templating of encoding. You would have to ether template all file scanning functions in the XML parser on encoding as well, of you would need to do some run-time checks and use the correct template depending on the encoding used in the file. This is of course not ideal, but only where encoding is something that is specified upon run-time. What the most common scenario is, is something that needs to be determined before a final design is decided on.

Vladimir Prus

21 Oct 21 Oct

6:05 a.m.

Erik Wien wrote:

...

...
- Why would the user want to change the encoding? Especially between UTF-16 and UTF-32?

Well... Different people have different needs. If you are mostly using ASCII characters, and require small size, UTF-8 would fit your bill. If you need the best general performance on most operations, use UTF-16. If you need fast iteration over code points and size doesn't matter, use UTF-32.

Ok, since everybody agreed characters outside 16 bits are very rare, UTF-32 seems to never be needed. As for UTF-8 vs. UTF-16: yes, the need for choice seems present. However, UTF-16 string class would be better than no string class at all, and extra genericity will cost you development time.

...

...
- Why would the user want to specify encoding at compile time? Are there performance benefits to that? Basically, if we agree that UTF-32 is not needed, then UTF-16 is the only encoding which does not require complex handling. Maybe, for other encodings using virtual functions in character iterator is OK? And if iterators have abstract characters" as value_type, maybe the overhead if that is much large that virtual function call even for UTF-16.

Though I haven't confirmed this by testing, I would assume templating the encoding and thus specifying it at compile time would result in better performance since you don't have the overhead of virtual function calls. (Polymorphy would probably be needed if templates were scrapped.)

It would. The question is by how much.

...

Avoiding virtual calls also enables the compiler to optimize (inline) more thouroughly, something that is very benificial in this case because of the amount of different small, specialized functions that are needed in string manipulation.

This is a bit abstract. Virtual function is a inlining barrier, but it would be placed only for character access. On both sides of the barrier, compiler can freely optimize everything.

...

...
- What if the user wants to specify encoding at run time? For example, XML files specify encoding explicitly. I'd want to use ascii/UTF-8 encoding if XML document is 8-bit, and UTF-16 when it's Unicode.

That is one problem with the templating of encoding. You would have to ether template all file scanning functions in the XML parser on encoding as well, of you would need to do some run-time checks and use the correct template depending on the encoding used in the file. This is of course not ideal, but only where encoding is something that is specified upon run-time. What the most common scenario is, is something that needs to be determined before a final design is decided on.

Another possibility is that you can decide if UTF8 of UTF16 should be used dynamically -- just counting the number of non-ascii characters. That would mean that only really advanced users need make the decision themself. I think I'm starting to like Peter's idea that advanced users need vector<char_xxx> together with a set of algorithms. - Volodya

Mathew Robertson

9:46 a.m.

...

...
...
- Why would the user want to change the encoding? Especially between UTF-16 and UTF-32?

Well... Different people have different needs. If you are mostly using ASCII characters, and require small size, UTF-8 would fit your bill. If you need the best general performance on most operations, use UTF-16. If you need fast iteration over code points and size doesn't matter, use UTF-32.

Ok, since everybody agreed characters outside 16 bits are very rare, UTF-32 seems to never be needed. As for UTF-8 vs. UTF-16: yes, the need for choice seems present. However, UTF-16 string class would be better than no string class at all, and extra genericity will cost you development time.

<rant> umm... so your saying that no one will ever need more than 640K RAM? Just because YOU dont need more than 16bits, doesn't meen that I dont need more than 16bits. </rant> The main question of a Unicode library should _always_ be, can the library represent every character that can be drawn; things like iteraters, algorithms, etc are nice-to-haves -> the representation of the written language is the first priority, everything else is secondary. Also, the Unicode standard will evolve over time to include more characters from many more characters sets that you or I may never use but someone else might; who knows, maybe the ASCII character set will get a 27th character one day... A library shouldn't preclude the use of these new characters, just because we thought "no one will ever need more than 16bits"... So, how about we dont make the same mistakes as we made in the past... Whatever desision finially gets chosen will come down to one of two choices: a) variable length string format, eg: UTF8, or something similar b) fix width format with so many bits that humans are unlikely to use all the address space at any time in the next 50/100 years, eg UTF-32, or similar FWIW: my personal preference would be to go for a variable with encoding -> so that we never have to solve this problem again... although this makes concepts like text-reflow quite a bit harder to implement. regards, Mathew Robertson

Vladimir Prus

10:12 a.m.

Mathew Robertson wrote:

...

...
...
...
- Why would the user want to change the encoding? Especially between UTF-16 and UTF-32?

Well... Different people have different needs. If you are mostly using ASCII characters, and require small size, UTF-8 would fit your bill. If you need the best general performance on most operations, use UTF-16. If you need fast iteration over code points and size doesn't matter, use UTF-32.

Ok, since everybody agreed characters outside 16 bits are very rare, UTF-32 seems to never be needed. As for UTF-8 vs. UTF-16: yes, the need for choice seems present. However, UTF-16 string class would be better than no string class at all, and extra genericity will cost you development time.

<rant> umm... so your saying that no one will ever need more than 640K RAM? Just because YOU dont need more than 16bits, doesn't meen that I dont need more than 16bits. </rant>

The main question of a Unicode library should _always_ be, can the library represent every character that can be drawn; things like iteraters, algorithms, etc are nice-to-haves -> the representation of the written language is the first priority, everything else is secondary.

Also, the Unicode standard will evolve over time to include more characters from many more characters sets that you or I may never use but someone else might; who knows, maybe the ASCII character set will get a 27th character one day... A library shouldn't preclude the use of these new characters, just because we thought "no one will ever need more than 16bits"... So, how about we dont make the same mistakes as we made in the past...

Do you realize that "nobody needs UTF-32" is not the same that "nobody needs character which can't be represented in 16 bits"? UTF-16 can represent all Unicode characters.

...

Whatever desision finially gets chosen will come down to one of two choices: a) variable length string format, eg: UTF8, or something similar b) fix width format with so many bits that humans are unlikely to use all the address space at any time in the next 50/100 years, eg UTF-32, or similar

FWIW: my personal preference would be to go for a variable with encoding -> so that we never have to solve this problem again... although this makes concepts like text-reflow quite a bit harder to implement.

What's "text-reflow", BTW? - Volodya

Mathew Robertson

22 Oct 22 Oct

12:18 a.m.

...

...
...
...
...
- Why would the user want to change the encoding? Especially between UTF-16 and UTF-32?

Well... Different people have different needs. If you are mostly using ASCII characters, and require small size, UTF-8 would fit your bill. If you need the best general performance on most operations, use UTF-16. If you need fast iteration over code points and size doesn't matter, use UTF-32.

Ok, since everybody agreed characters outside 16 bits are very rare, UTF-32 seems to never be needed. As for UTF-8 vs. UTF-16: yes, the need for choice seems present. However, UTF-16 string class would be better than no string class at all, and extra genericity will cost you development time.

<rant> umm... so your saying that no one will ever need more than 640K RAM? Just because YOU dont need more than 16bits, doesn't meen that I dont need more than 16bits. </rant>

The main question of a Unicode library should _always_ be, can the library represent every character that can be drawn; things like iteraters, algorithms, etc are nice-to-haves -> the representation of the written language is the first priority, everything else is secondary.

Also, the Unicode standard will evolve over time to include more characters from many more characters sets that you or I may never use but someone else might; who knows, maybe the ASCII character set will get a 27th character one day... A library shouldn't preclude the use of these new characters, just because we thought "no one will ever need more than 16bits"... So, how about we dont make the same mistakes as we made in the past...

Do you realize that "nobody needs UTF-32" is not the same that "nobody needs character which can't be represented in 16 bits"? UTF-16 can represent all Unicode characters.

yes I do realise... the origonal statement was "...everybody agreed characters outside 16 bits are very rare, UTF-32 seems to never be needed." UTF-16 can indeed represent every Unicode character, but that is not what was written. Also, "nobody needs character which can't be represented in 16 bits" in the context of UTF-16, is the same as "nobody needs more than 8 bits" if the context is UTF-8. The same could be said for 4bits and 2bits, given an appropriate encoding scheme... One point that hasn't been mentioned so far is that, word sizes on most modern CPU's are 32bits wide. From a performance POV, the word-alignment may be a suitable justification for offsetting the increased storage requirements of a 32bit unit.

...

...
Whatever desision finially gets chosen will come down to one of two choices: a) variable length string format, eg: UTF8, or something similar b) fix width format with so many bits that humans are unlikely to use all the address space at any time in the next 50/100 years, eg UTF-32, or similar

FWIW: my personal preference would be to go for a variable with encoding -> so that we never have to solve this problem again... although this makes concepts like text-reflow quite a bit harder to implement.

What's "text-reflow", BTW?

text-reflow is the term used to describe what happens when a slab of text needs to be formatted to use a specified width. For example, a wordprocessor (particularily one that uses variable width character font metrics), will need to reflow the paragraph so as to fit within the specified width. Say if you resize the wordprocessor window, the formatting engine would need to reflow the text according to the new window size. Mathew

Michael Walter

12:52 a.m.

...

yes I do realise... the origonal statement was "...everybody agreed characters outside 16 bits are very rare, UTF-32 seems to never be needed." UTF-16 can indeed represent every Unicode character, but that is not what was written. It must have been obvious to the poster of the original statement.

...

One point that hasn't been mentioned so far is that, word sizes on most modern CPU's are 32bits wide. From a performance POV, the word-alignment may be a suitable justification for offsetting the increased storage requirements of a 32bit unit. Of course from a performance POV you surely don't want to waste twice (four times) as much memory in the cache either. Performance is always a tradeoff.

Cheers, Michael

John Maddock

20 Oct 20 Oct

10:14 a.m.

...

Any comments you might have on this approach are most welcome.

Interesting: funnily enough I've just started experimenting with Unicode support for Boost.Regex (based initially on top of ICU, but it could equally sit on top of Boost.Unicode or whatever). The first thing I had to do was write a bunch of iterators for interconverting between encoding forms (I needed Bidirectional Iterators which code conversion facets don't/can't provide). So I guess we're all on a similar page here, can your encoding converters proved efficient iterator-based interconversion? John.

Erik Wien

6:53 p.m.

"John Maddock" <john@johnmaddock.co.uk> wrote in message

...

Interesting: funnily enough I've just started experimenting with Unicode support for Boost.Regex (based initially on top of ICU, but it could equally sit on top of Boost.Unicode or whatever). The first thing I had to do was write a bunch of iterators for interconverting between encoding forms (I needed Bidirectional Iterators which code conversion facets don't/can't provide). So I guess we're all on a similar page here, can your encoding converters proved efficient iterator-based interconversion?

Well.. "efficient" is probably not the word I would use ;), yet that is. The way it is implemented right now, the value_type of a encoded_string iterator of any encoding is 32bit. (A unicode code-point.) So when iterating over any encoding, the external interface always looks as a vector of code points. Consequently you can use iterators from one string (UTF-8) to initialize another string (UTF-16) and the conversion between the two encodings would happen automatically. I'm guessing this is something similar to what you have. I also have a rather hackish implementation that can provide non-const (assignable) code point iterators on any encoding. This involves a lot of trickery with iterators changing the size the container they are iterating over, and proxy classes as a reference_type in the iterator. (something that is not allowed (yet) in standard C++, but is in boost) As you can imagine, this implementation is everything but efficient. Kinda neat though! ;)

Beman Dawes

19 Oct 19 Oct

9:10 p.m.

At 10:22 PM 10/18/2004, Erik Wien wrote:

...

... I really feel the C++ language needs some form of standardized unicode support, and developing such a library within the boost community would be a very good way to ensure it fits everybody's needs the best possible way.

Are you aware of the Unicode Technical Report (TR) being prepared by the C committee? See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1040.pdf While there is no guarantee, it is a definite possibility that the C++ committee will use this C TR as part of Unicode support in C++. Because this TR is so limited in scope (it just has a few conversion functions), that probably won't cause problems for a C++ Unicode library, but you should still be aware of the data types introduced. --Beman

Erik Wien

11:24 p.m.

...

Are you aware of the Unicode Technical Report (TR) being prepared by the C committee? See http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1040.pdf

While there is no guarantee, it is a definite possibility that the C++ committee will use this C TR as part of Unicode support in C++. Because this TR is so limited in scope (it just has a few conversion functions), that probably won't cause problems for a C++ Unicode library, but you should still be aware of the data types introduced.

No, I was not aware of that. Thanks for the link. There doesn't seem to be any problems associated with a possible unicode library, and that C library being included into the C++ standard. It basically does a lot of the low-level unicode stuff that I am currently doing by hand, so it would probably be more of a blessing than a curse if it got included into standard C++.

Beman Dawes

20 Oct 20 Oct

2:07 a.m.

New subject: Any interest in adding unicode support to boost?

...

...
Are you aware of the Unicode Technical Report (TR) being prepared by

At 07:24 PM 10/19/2004, Erik Wien wrote: the C

...

...
committee? See

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1040.pdf

...

...
While there is no guarantee, it is a definite possibility that the C++ committee will use this C TR as part of Unicode support in C++. Because

...

...
this TR is so limited in scope (it just has a few conversion functions), that probably won't cause problems for a C++ Unicode library, but you should still be aware of the data types introduced.

No, I was not aware of that. Thanks for the link.

There doesn't seem to be any problems associated with a possible unicode library, and that C library being included into the C++ standard. It basically does a lot of the low-level unicode stuff that I am currently doing by hand, so it would probably be more of a blessing than a curse if it got included into standard C++.

Since there is probably more than a 90% chance it will eventually get into standard C++, I think we should consider implementing it and then use it for the low-level functionality needed by higher level Boost Unicode libraries. --Beman

Robert Ramey

10:47 p.m.

"Erik Wien" <wien@start.no> wrote in message news:cl1tqh$qp$1@sea.gmane.org...

...

If you have any, and I do mean ANY, thoughts on this, please do not hesitate to reply to this mail and let me know. I'm looking forward to your responses.

LOL - well, be careful what you wish for. I think it would be illuminating to see a short summary of the classes and functions such a library is expected to provide Robert Ramey

Miro Jurisic

11:26 p.m.

In article <cl6pur$ceh$1@sea.gmane.org>, "Robert Ramey" <ramey@rrsd.com> wrote:

...

"Erik Wien" <wien@start.no> wrote in message news:cl1tqh$qp$1@sea.gmane.org...

...
If you have any, and I do mean ANY, thoughts on this, please do not hesitate to reply to this mail and let me know. I'm looking forward to your responses.

LOL - well, be careful what you wish for.

I think it would be illuminating to see a short summary of the classes and functions such a library is expected to provide

I agree; I think that the feedback given so far covers most of the important perspectives, and I think we should not fool ourselves into thinking every one of them can be accommodated immediately. A library that addresses some of them designed to allow further work to address additional applications would be a huge step from where we are today, so I suggest that we (Erik :-) ) pick a set of objectives manageable for his thesis timeline, and submit a preliminary proposal. meeroh

Erik Wien

21 Oct 21 Oct

6:49 p.m.

"Miro Jurisic" <macdev@meeroh.org> wrote in message news:macdev-

...

...
I think it would be illuminating to see a short summary of the classes and functions such a library is expected to provide

I agree; I think that the feedback given so far covers most of the important perspectives, and I think we should not fool ourselves into thinking every one of them can be accommodated immediately. A library that addresses some of them designed to allow further work to address additional applications would be a huge step from where we are today, so I suggest that we (Erik :-) ) pick a set of objectives manageable for his thesis timeline, and submit a preliminary proposal.

It's coming. :) I will have to write a description of what I want to achieve with this for my college to look at by this weekend, and I will try to take everything said here into consideration when writing that. I will try to post a translation of the description here (Or preferably in a new thread) once it's handed in. As for preliminary proposals and the like. You will probably not see too much of the kind before the project (hopefully) gets accepted and I am able to work full time on it. Right now I am buried in other work that needs attention, so I'm not able to dedicate myself completely to this just yet. I will try to use any spare moment though!

Andy Little

22 Oct 22 Oct

9:52 p.m.

"Erik Wien" <wien@start.no> wrote in message news:cl1tqh$qp$1@sea.gmane.org...

...

Hi. I am in the process of planning a library for handling unicode strings in C++, and would like to probe the interest in the boost community for something like that. I read through the unicode dicussion that was up back in april, and from what I could gather there was some amount of interest, but no one felt comfortable taking on the task as of yet.

[snip]

...

I really feel the C++ language needs some form of standardized unicode support, and developing such a library within the boost community would be a very good way to ensure it fits everybody's needs the best possible way.

If you have any, and I do mean ANY, thoughts on this, please do not hesitate to reply to this mail and let me know. I'm looking forward to your responses.

FWIW Here my thoughts.. There is no equivalence between std::string (aka std::string, std::wstring) and a sequence of characters conforming to an encoded sequence (aka encoded-string). However an encoded-string can (potentially) be converted to a string, but not the other way round, because the std::string does not provide adequate information. For an encoding scheme to work the encoding must be provided, and must be run time. The best way to do this for various encodings is to use packets, with headers providing the information regarding the contents, eg type of encoding, number of characters, checksum etc. These packets themselves could be manipulated in std::strings (including sequences of packets), which could then be used to perform operations where the encoding is not important. This should combine the best combination of performance, both in speed and size. regards Andy Little

Andy Little

23 Oct 23 Oct

11:08 a.m.

"Andy Little" <andy@servocomm.freeserve.co.uk> wrote in message news:clbvg3$f4o$1@sea.gmane.org...

...

"Erik Wien" <wien@start.no> wrote in message news:cl1tqh$qp$1@sea.gmane.org...

...
Hi. I am in the process of planning a library for handling unicode

strings

...

...
in C++, and would like to probe the interest in the boost community for something like that. I read through the unicode dicussion that was up back in april, and from what I could gather there was some amount of interest, but no one felt comfortable taking on the task as of yet.

I forgot to say ... What ever scheme you come up with, this is an essential C+ library and needs to be done by somebody, sometime soon. :-) ....... As a potential user... The other problem that immediately springs to my mind is , how do I compose a unicode string in my C++ source code. I really dont want to be dealing with some unicode_character<TheEncode>(0x78,99,'c',-1) ; style. The unicode charts I have seen are mind boggling, when taken in the raw. However I always want to know the (common) encoding(s) for a particular character. For that I would , I guess, need a naming convention for common characters eg the functionality of http://www.unicode.org/charts/charindex.html eg: I would prefer to do (similar to HTML character entities): // whatever... target unicode string encoded_string str; // set of common characters in required encoding typedef unicode_common_character_set<TheEncode> uni_chars; // eg "um" where u == 'micro' symbol etc. uni_chars my_str[] = {uni_chars::quot, uni_chars::micro, uni_chars::m, uni_chars::quot}; str = my_str; As a user thats probably the bit I am most interested in rather than the implementation details. regards Andy Little

Beman Dawes

1:01 p.m.

New subject: Any interest in adding unicode support to boost?

At 07:08 AM 10/23/2004, Andy Little wrote:

...

The other problem that immediately springs to my mind is , how do I

compose

...

a unicode string in my C++ source code. I really dont want to be dealing with some unicode_character<TheEncode>(0x78,99,'c',-1) ; style.

The C TR provides additional string literals and character constants. See below. --Beman 5.1 String literals and character constants notations The notations for string literals and character constants for char16_t are defined analogous to the wide character string literals and wide character constants: u"s-char-sequence" denotes a char16_t type string literal and initializes an array of char16_t. The corresponding character constant is denoted by u'c-char-sequence' and has the type char16_t . Likewise, the string literal and character constant for char32_t are, U"s-char-sequence" and U'c-char-sequence'.

Andy Little

11:58 p.m.

New subject: Any interest in adding unicode support to boost?

"Beman Dawes" <bdawes@acm.org> wrote in message news:6.0.3.0.2.20041023085844.028a7018@mailhost.esva.net...

...

At 07:08 AM 10/23/2004, Andy Little wrote:

...
The other problem that immediately springs to my mind is , how do I

compose

...
a unicode string in my C++ source code. I really dont want to be dealing with some unicode_character<TheEncode>(0x78,99,'c',-1) ; style.

The C TR provides additional string literals and character constants. See below.

--Beman

5.1 String literals and character constants notations

[snip detail] Ok I understand that long term the C/C++ language should support Unicode and I am all in favour. The other question is whether there is a need for a C++ library which provides definitive encodings ( Not restricted to Unicode) that can be implemented in the Current C++ language. I could certainly make use of it today. I would assume that such a library would be extremely useful in deciding how to implement language support in C++. I would guess that to be generic the characters could use named indirection in some cases, similar to the html scheme... typedef typename default_charset<GUI>:: type charset_type; encoded_string<charset_type> str = "length " ; str += charset_type::micro ; // much nicer ... str += "m"; etc regards Andy Little

7556

Age (days ago)

7563

Last active (days ago)

List overview

Download

149 comments

22 participants

participants (22)

Aaron W. LaFramboise
Alexander Terekhov
Andy Little
Beman Dawes
Ben Hutchings
Darren Cook
David Abrahams
Doug Gregor
Edward Diener
Eric Niebler
Erik Wien
John Maddock
Mathew Robertson
Michael Walter
Miro Jurisic
Patrick Bennett
Peter Dimov
Rob Stewart
Robert Ramey
Rogier van Dalen
Teemu Torma
Vladimir Prus