[Unicode strings] We're off

Erik Wien

16 Mar 2005 16 Mar '05

3:03 a.m.

Hi. I hope you guys still remember this, despite my lack of activity on this list, but I was here about developing a Unicode library for Boost a while back, and well.. We're off! :) We are now at a point in the development where we would really appreciate feedback from you Boosters. We have developed a *very* early prototype based on some of the ideas put forward in the earlier discussion here, and we would like you guys to comment on it. The design is by no means locked, and only represent one possible way to implement a Unicode string class. If you have an alternate solution, please let us know. We are using an evolutionary development model (prototyping), so we are open for design changes should there be a need for that. You can download the source code (VC++/DevCpp projects only for now, sorry) from our site here: http://hovedprosjekter.hig.no/v2005/data/Gr5_unicode (the files section). On the same site we have also set up a forum where you can discuss different aspects of the project if you want to keep it off list. No registration is required, so it should be hassle (password) free. Current design: The current design is based around the concept of «encoding_traits». These are templated on the different encodings used in Unicode (UTF-8, 16 and 32, both endians), and provide functions and typedefs for working on code units (8, 16 and 32 bit integers respectively) in any encoding. These traits are then used for implementing different interfaces that externally use 32bit code points, thereby abstracting away the underlying encoding. The string class itself is created with encoding transparency in mind. Also at the class level. This means that the encoding used in the string is not a template parameter of the string class itself (making each instantiation of the string it's own type), but rather a parameter of an implementation class that is used internally to hold the string. Something like this (highly simplified): class impl_base { // A lot of pure virtual functions for manipulating a string. }; template<typename encoding> class impl { // Implement the functions...obviously. }; class encoded_string { impl_base* m_impl; template<typename encoding> void set_encoding(encoding enc_tag) { m_impl = new impl<encoding>(); } }; The reason for doing this is that it allows functions that take encoded_string parameters to be blissfully unaware of what encoding they are working on, without having to templatize (it that a word?) the function itself. (Something I understood was a bit of a worry for some in the last discussion.) An alternate way of doing this (something we also tested when developing the current version), is to simply template the string class itself on encoding, but then you loose the above advantage of being able to have non-template functions working on multipe encodings. You do however gain speed (I would assume), since you wouldn't have the overhead of virtual function-calls, as well as a less complex implementation. There's also an implementation of the Unicode Character Database in the prototype, along with an implementation of the normalization algorithms, but I won't go into the details of them here (to keep this from becoming a novel). Should be easy enough to understand if you want to. Anyway.. Comments are as always welcome. Either here, or in the forum at the site. Regards - Erik To Eric Niebler: Did you recieve the mail I sendt you a while back about the whole contact-person debackle? (on the Boost Consulting address) Never got a reply, so I'm not sure if it went through.

Show replies by date

Miro Jurisic

16 Mar 16 Mar

5:33 a.m.

New subject: [Unicode strings] CodeWarrior compilation

I tried to compile this on CW9 and I ran into two major problems that will probably cause you woes with other compilers as well. 1. Given the declaration: template<> class bar<foo> { template <typename baz> void method(); }; the correct method definition is not template<typename baz> void method() { } but template <> template<typename baz> void method() { } Your compiler apparently accepts the former, but I believe it's wrong. 2. Given the declaration: template <typename foo> class Base { public: typedef int bar; }; and a derived class template <typename foo> class Derived: public Base <foo> { you can't use unqualified "bar" inside declaration of Derived -- you have to qualify it with "typename Base<foo>::" (or you can use a using declaration such as "using Base<foo>::bar" inside the declaration of Derived). I don't know if CW is right on this or not -- I remember having it explained to me before, but I paged out the explanation. Anyway, these two problems generate a bunch of compiler errors, but right now I am running into the problem that my boost is older than what your code requires so I can't submit patches yet. meeroh

Erik Wien

6:33 a.m.

New subject: [Unicode strings] CodeWarrior compilation

Thanks for your reply. Miro Jurisic wrote:

...

I tried to compile this on CW9 and I ran into two major problems that will probably cause you woes with other compilers as well.

1. Given the declaration: template<> class bar<foo> { template <typename baz> void method(); };

the correct method definition is not

template<typename baz> void method() { }

but

template <> template<typename baz> void method() { }

Your compiler apparently accepts the former, but I believe it's wrong.

I see. Can't remember exactly where we do that off the top of my head, but I guess we do somewhere. :) The code works on both VC++ (7.1 and 8.0) and GCC 3.3 so that's why we haven't noticed I guess. I'll look into that tomorrow, and fix it where applicable.

...

2. Given the declaration:

template <typename foo> class Base { public: typedef int bar; };

and a derived class

template <typename foo> class Derived: public Base <foo> {

you can't use unqualified "bar" inside declaration of Derived -- you have to qualify it with "typename Base<foo>::" (or you can use a using declaration such as "using Base<foo>::bar" inside the declaration of Derived). I don't know if CW is right on this or not -- I remember having it explained to me before, but I paged out the explanation.

I'm pretty sure it is an error. Now that you mention it, I have seen the explaination for that somewhere too. I'm not sure where we actally do that though, but I'll look into this tomorrow as well.

...

Anyway, these two problems generate a bunch of compiler errors, but right now I am running into the problem that my boost is older than what your code requires so I can't submit patches yet.

Guess I should have mentioned this. Sorry. The code is using Boost 1.32 - release version.

...

meeroh

- Erik

Miro Jurisic

8:10 a.m.

New subject: [Unicode strings] CodeWarrior compilation

For the record, with pretty minor changes to bring Erik's code up to compliance with CW9's rightfully picky compiler, his test app compiles, runs, and passes the test on Mac OS X. I sent him patches out of band. meeroh

Erik Wien

10:41 a.m.

New subject: [Unicode strings] CodeWarrior compilation

Hmm.. I have now added the patches you provided, but unfortunately they break the library on both VC++ and GCC. It seems they don't like the revised encoding_traits function definitions. (They can't be matched to their declarations according to both of them.) The other fixes work though. Is adding the template<> to the definitions the only way to make it compile on CW, or is there another solution?

Thorsten Ottosen

11:24 a.m.

New subject: [Unicode strings] CodeWarrior compilation

"Erik Wien" <wien@start.no> wrote in message news:d192ee$u9n$1@sea.gmane.org... | Hmm.. I have now added the patches you provided, but unfortunately they | break the library on both VC++ and GCC. It seems they don't like the | revised encoding_traits function definitions. (They can't be matched to | their declarations according to both of them.) The other fixes work though. | | Is adding the template<> to the definitions the only way to make it | compile on CW, or is there another solution? put the definition inside the class. -Thorsten

Erik Wien

1:39 p.m.

New subject: [Unicode strings] CodeWarrior compilation

Thorsten Ottosen wrote:

...

| Is adding the template<> to the definitions the only way to make it | compile on CW, or is there another solution?

put the definition inside the class.

D'oh. That would work! :) I'll do the changes and upload a new version to our site shortly. - Erik

Erik Wien

5:45 p.m.

New subject: [Unicode strings] CodeWarrior compilation

Erik Wien wrote:

...

I'll do the changes and upload a new version to our site shortly.

And now I have. Same place as the last one: http://hovedprosjekter.hig.no/v2005/data/Gr5_unicode - Erik

Erik Wien

6:53 a.m.

Just wanted to add some things I forgot in the other mail. I'd like to stress that none of the code you see in the current implementation should be concidered production quality. There are a lot of things that are less than optimal, and a lot of things that are just plain wrong, and might very well blow up. :) Much of it is simply thrown together to test different ideas. One of the things I'm really not sure about, is the character_set_traits concept that is in there now. The basic idea was to allow the library to be used with character sets that are not code point compatible with Unicode by abstracting this into another traits concept, and having the string class use that for it's external interface. This was an idea that seemed good at the time, but I'm getting more and more unsure about the usefulness of it. The biggest reservation I have against it, is that it basically makes it impossible to incorporate Unicode specific functionality in the string class' interface. (Functions for normalization and collation come to mind.) Another thing is the way the Unicode Character Database is implemented. As of now, we simply generate one massive 2MB source file with the database as one gigantic array inside it. This of course leads to equally gigantic executables, which may or may not be desirable. Anyway.. Just wanted to cover my behind before you start complaining. ;) - Erik

Thorsten Ottosen

11:36 a.m.

Hi Erik, Let me first say that its good to see that progress is happening on this important topic. Here are just some small comments; I didn't follow the first discussion, so maybe these things have already been answered. | Current design: | The current design is based around the concept of «encoding_traits». Is entirely improper to make unicode strings a typedef for std::basic_string<...> ? |The string class itself is created with encoding transparency in mind. |Also at the class level. This means that the encoding used in the string |is not a template parameter of the string class itself (making each |instantiation of the string it's own type), but rather a parameter of an |implementation class that is used internally to hold the string. ... |The reason for doing this is that it allows functions that take |encoded_string parameters to be blissfully unaware of what encoding they |are working on, without having to templatize (it that a word?) the |function itself. (Something I understood was a bit of a worry for some |in the last discussion.) An alternate way of doing this (something we |also tested when developing the current version), is to simply template |the string class itself on encoding, but then you loose the above |advantage of being able to have non-template functions working on |multipe encodings. and what is the benefit of having a function vs a function template? surely a function template will look the same to the client as an ordinary function; Is it often used that people must change encoding on the fly? |You do however gain speed (I would assume), since you |wouldn't have the overhead of virtual function-calls, as well as a less |complex implementation. It would be good to see some real data on how much slower it gets. If the slowdown is high, then you should consider a two-layered approach (implementing the virtual functions in terms of the non-virtual) or to remove the virtual functions altogether. -Thorsten

Erik Wien

5:13 p.m.

Thorsten Ottosen wrote:

...

Hi Erik,

Hi! Thanks for your reply.

...

Let me first say that its good to see that progress is happening on this important topic.

Here are just some small comments; I didn't follow the first discussion, so maybe these things have already been answered.

| Current design: | The current design is based around the concept of «encoding_traits».

Is entirely improper to make unicode strings a typedef for std::basic_string<...> ?

Not entirely, but certainly less that optimal. basic_string (and the iostreams) make assuptions that don't neccesarily apply to Unicode text. One of them is that strings can be represented as a sequence of equally sized characters. Unicode can be represented that way, but that would mean you'd have to use 32 bits pr. character to be able to represent all the code point assigned in the Unicode standard. In most cases, that is way too much overhead for a string, and usually also a waste, since unicode code points rarely require more that 16 bits to be encoded. You could of course implement unicode for 16 bit characters in basic_string, but that would require that the user know about things like surrogate pairs, and also know how to correctly handle them. An unlikely scenario. By using encoding_traits however, we are able to make a string class that internally works with 8, 16 or 32 bit code units (UTF-8, 16 and 32 respectively), but that has an external interface that uses 32 bit code points, abstracting away the underlying encoding. By doing it that way we easily halve the effective size of a string for most users. (When using UTF-16 for example)

...

and what is the benefit of having a function vs a function template? surely a function template will look the same to the client as an ordinary function; Is it often used that people must change encoding on the fly?

Normally I would not think so, and my first implementation did not work this way. That one was implemented with the entire string class being templated on encoding, and thereby eliminating the whole implementation inheritance tree in this implementation. There was however (as far as I could tell at least) some concern about this approach in the other thread. (Mostly related to code size and being locked into an encoding at compile time.) Some thought that could be a problem for XML parsers and related technology that needs to establish encoding at run-time. (When reading files for example) This new implementation was simply a test to see if an alternate solution could be found, without those drawbacks. (It has a plenthora of new ones though.) I am more than willing to change this if the current design is no good. Starting a discussion on this is one of my main reasons for posting the code in the first place.

...

|You do however gain speed (I would assume), since you |wouldn't have the overhead of virtual function-calls, as well as a less |complex implementation.

It would be good to see some real data on how much slower it gets. If the slowdown is high, then you should consider a two-layered approach (implementing the virtual functions in terms of the non-virtual) or to remove the virtual functions altogether.

Yep. Some profiling of the different designs would be a good idea, and will probably be done in the near future.

...

-Thorsten

- Erik

Thorsten Ottosen

7:14 p.m.

"Erik Wien" <wien@start.no> wrote in message news:d19pdf$jhu$1@sea.gmane.org... Thorsten Ottosen wrote:

...

Hi Erik,

...

Is entirely improper to make unicode strings a typedef for std::basic_string<...> ?

|Not entirely, but certainly less that optimal. basic_string (and the |iostreams) make assuptions that don't neccesarily apply to Unicode text. |One of them is that strings can be represented as a sequence of equally |sized characters. Unicode can be represented that way, but that would |mean you'd have to use 32 bits pr. character to be able to represent all |the code point assigned in the Unicode standard. In most cases, that is |way too much overhead for a string, and usually also a waste, since |unicode code points rarely require more that 16 bits to be encoded. You |could of course implement unicode for 16 bit characters in basic_string, |but that would require that the user know about things like surrogate |pairs, and also know how to correctly handle them. An unlikely scenario. I'm sure I get this, probably because I'm just don't know enough about this subject. Ok, so basic_string< char, char_trait<char>, allocator<char> > makes assumptions. So what, I was implying that you should write a specialization basic_string< char, utf_traits<char>, allocator<char> >: template< class T, class UTF > class basic_string<T,utf_traits<UTF>,std::allocator<T> > { public: basic_string() { } ... }; typedef basic_string< char, utf_traits<utf8> > utf8_string; What is it you wouldn't be able to do with this interface? |Normally I would not think so, and my first implementation did not work |this way. That one was implemented with the entire string class being |templated on encoding, and thereby eliminating the whole implementation |inheritance tree in this implementation. | |There was however (as far as I could tell at least) some concern about |this approach in the other thread. (Mostly related to code size and hm...the function is only going to be used by 3 different classes, right? If so at most 3 times the size of a virtual function solution; v-tables fill up too; and virtual functions in a class template can have *large* code size impact if not all virtual functions are used. (So are they?) |being locked into an encoding at compile time.) sometimes strong typesafety is good; sometimes it's not | Some thought that could |be a problem for XML parsers and related technology that needs to |establish encoding at run-time. (When reading files for example) ok, that seems to motivate that some form of dynamic types should be there. | This |new implementation was simply a test to see if an alternate solution |could be found, without those drawbacks. (It has a plenthora of new ones |though.) |I am more than willing to change this if the current design is no good. |Starting a discussion on this is one of my main reasons for posting the |code in the first place. It seems to me that we then need four classes utf8_string utf16_string utf32_string utf_string // the dynamic one

Stefan Seefeld

7:56 p.m.

Thorsten Ottosen wrote:

...

Ok, so basic_string< char, char_trait<char>, allocator<char> > makes assumptions. So what, I was implying that you should write a specialization

basic_string< char, utf_traits<char>, allocator<char> >:

template< class T, class UTF > class basic_string<T,utf_traits<UTF>,std::allocator<T> > { public: basic_string() { } ... };

typedef basic_string< char, utf_traits<utf8> > utf8_string;

What is it you wouldn't be able to do with this interface?

because (for example) std::basic_string provides random access to its data (i.e. characters), and that implies a fixed-size character type, which not all unicode encodings provide. Regards, Stefan

Thorsten Ottosen

9:40 p.m.

"Stefan Seefeld" <seefeld@sympatico.ca> wrote in message news:42388F83.7090306@sympatico.ca... | Thorsten Ottosen wrote: | > What is it you wouldn't be able to do with this interface? | | because (for example) std::basic_string provides random access to | its data (i.e. characters), and that implies a fixed-size character | type, which not all unicode encodings provide. ahhraaa :-) Thanks both of you. -Thorsten

Erik Wien

17 Mar 17 Mar

4:05 p.m.

Thorsten Ottosen wrote:

...

hm...the function is only going to be used by 3 different classes, right? If so at most 3 times the size of a virtual function solution;

No. 5 I think. UTF-8, and UTF-16 and 32 in both endians. The ones in the platform's reversed endian would only really be used for file parsing though, whenever we get around to that...

...

v-tables fill up too; and virtual functions in a class template can have *large* code size impact if not all virtual functions are used. (So are they?)

The idea is to keep the virtual interface to a bare minimum, and let the string class itself create it's own complex interface by combining these virtual functions. Basically just having functions for setting, getting and iteration in the implementation, meaning they should all be used frequently.

...

sometimes strong typesafety is good; sometimes it's not

Yep. What we need to decide on, is whether it is good more than it is not. :)

...

ok, that seems to motivate that some form of dynamic types should be there.

That's what I thought too a while ago, but I'm not that sure anymore. I'll admit I'm no iostream wizard, but wouldn't it be possible to create some kind of unicode_stream by making a specialization of char_traits for unsigned ints (Unicode code points), and then create some facets (I forget which ones, codecvt and ctype I guess) that enable these streams to read all Unicode encoding forms from their buffer, and transcode into a sequence of Unicode code points before returning them to the user? This would mean that the users would not have to know what kind of encoding is used in the file they are reading. It would be totally transparent to them.

...

It seems to me that we then need four classes

utf8_string utf16_string utf32_string utf_string // the dynamic one

The three first ones could be created by having one template class templated on encoding, and have it use the encoding_traits classes from the current prototype. I have tried this before, and it works fine. The neccessity of the last one would depend on whether the iostream functionality I mentioned above would work or not. If it is possible, I don't really see the need for a dynamic string class either. - Erik

Felipe Magno de Almeida

5:28 p.m.

Erik Wien wrote:

...

Thorsten Ottosen wrote:

...
hm...the function is only going to be used by 3 different classes, right? If so at most 3 times the size of a virtual function solution;

No. 5 I think. UTF-8, and UTF-16 and 32 in both endians. The ones in the platform's reversed endian would only really be used for file parsing though, whenever we get around to that...

...
v-tables fill up too; and virtual functions in a class template can have *large* code size impact if not all virtual functions are used. (So are they?)

The idea is to keep the virtual interface to a bare minimum, and let the string class itself create it's own complex interface by combining these virtual functions. Basically just having functions for setting, getting and iteration in the implementation, meaning they should all be used frequently.

...
sometimes strong typesafety is good; sometimes it's not

Yep. What we need to decide on, is whether it is good more than it is not. :)

...
ok, that seems to motivate that some form of dynamic types should be there.

That's what I thought too a while ago, but I'm not that sure anymore. I'll admit I'm no iostream wizard, but wouldn't it be possible to create some kind of unicode_stream by making a specialization of char_traits for unsigned ints (Unicode code points), and then create some facets (I forget which ones, codecvt and ctype I guess) that enable these streams to read all Unicode encoding forms from their buffer, and transcode into a sequence of Unicode code points before returning them to the user? This would mean that the users would not have to know what kind of encoding is used in the file they are reading. It would be totally transparent to them.

...
It seems to me that we then need four classes

utf8_string utf16_string utf32_string utf_string // the dynamic one

The three first ones could be created by having one template class templated on encoding, and have it use the encoding_traits classes from the current prototype. I have tried this before, and it works fine. The neccessity of the last one would depend on whether the iostream functionality I mentioned above would work or not. If it is possible, I don't really see the need for a dynamic string class either.

I think it would be desireble to have the dynamic class even when not having such iostream functionality. Sometimes we dont know which utf we are going to use, even when needing to read it from somewhere else or making it through some low-level way. But having iostreams read and write unicode would be awesome. Maybe having somekind of stringstream would be great too, but I think it would be much more work than it was planned.

...

- Erik

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- Felipe Magno de Almeida UIN: 2113442 email: felipe.almeida at ic unicamp br, felipe.m.almeida at gmail com, felipe at synergy com I am a C, modern C++, MFC, ODBC, Windows Services, MAPI developer from synergy, and Computer Science student from State University of Campinas(UNICAMP). To know more about: Unicamp: http://www.ic.unicamp.br Synergy: http://www.synergy.com.br current work: http://www.mintercept.com

Daniel James

19 Mar 19 Mar

11:20 a.m.

Felipe Magno de Almeida wrote:

...

I think it would be desireble to have the dynamic class even when not having such iostream functionality. Sometimes we dont know which utf we are going to use, even when needing to read it from somewhere else or making it through some low-level way. But having iostreams read and write unicode would be awesome. Maybe having somekind of stringstream would be great too, but I think it would be much more work than it was planned.

Why should such a string class stop at unicode? Wouldn't it be a good idea to support other encodings? It might be better to have such a class as part of a separate library, probably with 'pluggable' encodings, which would include unicode.

Felipe Magno de Almeida

2:04 p.m.

Daniel James wrote:

...

Felipe Magno de Almeida wrote:

...
I think it would be desireble to have the dynamic class even when not having such iostream functionality. Sometimes we dont know which utf we are going to use, even when needing to read it from somewhere else or making it through some low-level way. But having iostreams read and write unicode would be awesome. Maybe having somekind of stringstream would be great too, but I think it would be much more work than it was planned.

Why should such a string class stop at unicode? Wouldn't it be a good idea to support other encodings? It might be better to have such a class as part of a separate library, probably with 'pluggable' encodings, which would include unicode.

Well, it would be a good idea. I dont how is the license of the ICUU of the IBM, but maybe we could use that, or perhaps such feature could just be too much trouble. But if we have a unicode string then we can reuse that already.

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Erik Wien

21 Mar 21 Mar

1 a.m.

Daniel James wrote:

...

Why should such a string class stop at unicode? Wouldn't it be a good idea to support other encodings? It might be better to have such a class as part of a separate library, probably with 'pluggable' encodings, which would include unicode.

That was the idea behind the "character_set_traits" class in the current prototype. You could just implement the tratis for some other encoding, and you'd be set. The problem though (and in my opinion it's a big one), is that for the encoded_string class (and any iostream implementation based on the same concepts) to be useable at all as a Unicode string class, we would have to include a lot of functionality that is Unicode specific. (Normalization is one example) What would we do with this functionality for Shift-JIS?

Daniel James

9 a.m.

Erik Wien wrote:

...

Daniel James wrote:

...
Why should such a string class stop at unicode? Wouldn't it be a good idea to support other encodings? It might be better to have such a class as part of a separate library, probably with 'pluggable' encodings, which would include unicode.

That was the idea behind the "character_set_traits" class in the current prototype. You could just implement the tratis for some other encoding, and you'd be set. The problem though (and in my opinion it's a big one), is that for the encoded_string class (and any iostream implementation based on the same concepts) to be useable at all as a Unicode string class, we would have to include a lot of functionality that is Unicode specific. (Normalization is one example) What would we do with this functionality for Shift-JIS?

I have no idea ;) I know this is a complicated subject, and I'm far from an expert. I was writing about the suggested dyanmic string, 'utf_string', possibly better called 'any_string', or 'encoded_string'. IMO your library should concentrate on unicode (and perhaps encodings that are close enough to unicode), and leave other encodings to other libraries. A dynamicly encoded string class would probably require a different interface, partly for efficiency's sake and partly because of the differences between encodings. Also, it will be more important that it interacts well with other string implementations. Daniel

Erik Wien

4 Apr 4 Apr

11:42 a.m.

Daniel James wrote:

...

...
That was the idea behind the "character_set_traits" class in the current prototype. You could just implement the tratis for some other encoding, and you'd be set. The problem though (and in my opinion it's a big one), is that for the encoded_string class (and any iostream implementation based on the same concepts) to be useable at all as a Unicode string class, we would have to include a lot of functionality that is Unicode specific. (Normalization is one example) What would we do with this functionality for Shift-JIS?

I have no idea ;)

Neither do I. :) That's why I feel it's a dead end.

...

I was writing about the suggested dyanmic string, 'utf_string', possibly better called 'any_string', or 'encoded_string'.

Actually, it *is* already called encoded_string. I think code_point_string would be a more descriptive name, given it's function though. I'm not sure what it will end up being. IMO your library should

...

concentrate on unicode (and perhaps encodings that are close enough to unicode), and leave other encodings to other libraries. A dynamicly encoded string class would probably require a different interface, partly for efficiency's sake and partly because of the differences between encodings. Also, it will be more important that it interacts well with other string implementations.

Yes, that is basically how I am beginning to feel too. After all, since Unicode is supported by all major players in the industry, it will (I hope) eventually take over for all the encodings in existance today. Support for those encodings will therefore not be as important in the future, making concentrating on Unicode exclusively a more viable solution. - Erik

David Abrahams

19 Mar 19 Mar

10:15 p.m.

"Thorsten Ottosen" <nesotto@cs.auc.dk> writes:

...

|There was however (as far as I could tell at least) some concern about |this approach in the other thread. (Mostly related to code size and

hm...the function is only going to be used by 3 different classes, right? If so at most 3 times the size of a virtual function solution; v-tables fill up too; and virtual functions in a class template can have *large* code size impact if not all virtual functions are used. (So are they?)

They're costly even if they are used. Calling virtual functions generates a *lot* more code than a trivial inline function does. Putting type erasure in the lowest-level string design is wrong, especially if it means calling through virtual functions once per character (but even if it doesn't). If you want to erase the encoding information from the type, that should be done in a separate layer, as with boost::function. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Miro Jurisic

16 Mar 16 Mar

7:53 p.m.

In article <d19pdf$jhu$1@sea.gmane.org>, Erik Wien <wien@start.no> wrote:

...

Thorsten Ottosen wrote:

...
| Current design: | The current design is based around the concept of «encoding traits».

Is entirely improper to make unicode strings a typedef for std::basic string<...> ?

Not entirely, but certainly less that optimal. basic string (and the iostreams) make assuptions that don't neccesarily apply to Unicode text. One of them is that strings can be represented as a sequence of equally sized characters. Unicode can be represented that way, but that would mean you'd have to use 32 bits pr. character to be able to represent all the code point assigned in the Unicode standard. In most cases, that is way too much overhead for a string, and usually also a waste, since unicode code points rarely require more that 16 bits to be encoded. You could of course implement unicode for 16 bit characters in basic string, but that would require that the user know about things like surrogate pairs, and also know how to correctly handle them. An unlikely scenario.

I completely agree with Erik on this. std::string makes assumptions that do not hold for Unicode characters, and it provides interfaces that are misleading (or outright wrong) for Unicode strings. For example, basic_string lets you erase a single element, which can make the string no longer be a valid Unicode string (unless the elements are represented in UTF32). Same problem exists with every other mutating algorithm on basic_string, including operator[].

...

...
and what is the benefit of having a function vs a function template? surely a function template will look the same to the client as an ordinary function; Is it often used that people must change encoding on the fly?

Normally I would not think so, and my first implementation did not work this way. That one was implemented with the entire string class being templated on encoding, and thereby eliminating the whole implementation inheritance tree in this implementation.

There was however (as far as I could tell at least) some concern about this approach in the other thread. (Mostly related to code size and being locked into an encoding at compile time.) Some thought that could be a problem for XML parsers and related technology that needs to establish encoding at run-time. (When reading files for example) This new implementation was simply a test to see if an alternate solution could be found, without those drawbacks. (It has a plenthora of new ones though.)

Here I also agree. Having multiple string classes would just force everyone to pick one for, in most cases, no good reason whatsoever. If I am writing code that uses C++ strings, which encoding should I choose? Why should I care? Particularly, if I don't care, why would I have to choose anyway? More than likely, I would just choose the same thing 99% of the time anyway. I believe that the ability to force a Unicode string to be in a particular encoding has some value -- especially for people doing low-level work such as serializing Unicode strings to XML, and for people who need to understand time and space complexity of various Unicode encodings -- but I do not believe that this justifiable demand for complexity means we should make the interface harder for everyone else. I do, however, think that some people are going to feel that they need to eliminate the runtime overhead of generalized strings and explicitly instantiate strings in a particular encoding, and I don't know whether the library currently provides a facility to accomplish this. meeroh

Erik Wien

17 Mar 17 Mar

4:20 p.m.

Miro Jurisic wrote:

...

Here I also agree. Having multiple string classes would just force everyone to pick one for, in most cases, no good reason whatsoever. If I am writing code that uses C++ strings, which encoding should I choose? Why should I care? Particularly, if I don't care, why would I have to choose anyway? More than likely, I would just choose the same thing 99% of the time anyway.

If we went with an implemetation templated on encoding, I would suggest simply having a typedef like todays std::string, let's say "typedef encoded_string<utf16_tag> unicode_string;", and market that like "the unicode string class". Users that don't care, would use that and be happy, possibly not even knowing they are using some template instansiation. Advanced users could still easily use one of the other encodings, or even template their code to use all of them if found neccesary. But then, like I have said, you wouldn't have functions/classes that are encoding independent without templating them.

...

I believe that the ability to force a Unicode string to be in a particular encoding has some value -- especially for people doing low-level work such as serializing Unicode strings to XML, and for people who need to understand time and space complexity of various Unicode encodings -- but I do not believe that this justifiable demand for complexity means we should make the interface harder for everyone else.

I agree. But having a templated implementation, would not mean a complex interface for the end user. It would probably be simpler than the current implementation, since you could loose all the encoding setting and getting. Especially if we for for the above mentioned typedef, to remove the template syntax for the casual user.

...

I do, however, think that some people are going to feel that they need to eliminate the runtime overhead of generalized strings and explicitly instantiate strings in a particular encoding, and I don't know whether the library currently provides a facility to accomplish this.

It doesn't currently. But it would be pretty simple to create an implementation that allows that through use of the encoding_traits classes. I have done that before, and could probably use most of that code again if we were to include that. - Erik

Miro Jurisic

11:34 p.m.

In article <d1cam8$sdq$1@sea.gmane.org>, Erik Wien <wien@start.no> wrote:

...

Miro Jurisic wrote:

...
Here I also agree. Having multiple string classes would just force everyone to pick one for, in most cases, no good reason whatsoever. If I am writing code that uses C++ strings, which encoding should I choose? Why should I care? Particularly, if I don't care, why would I have to choose anyway? More than likely, I would just choose the same thing 99% of the time anyway.

If we went with an implemetation templated on encoding, I would suggest simply having a typedef like todays std::string, let's say "typedef encoded_string<utf16_tag> unicode_string;", and market that like "the unicode string class". Users that don't care, would use that and be happy, possibly not even knowing they are using some template instansiation. Advanced users could still easily use one of the other encodings, or even template their code to use all of them if found neccesary. But then, like I have said, you wouldn't have functions/classes that are encoding independent without templating them.

Well, here's what I think -- and this is based entirely on my experience, so I know it's biased: 1. How much of my code has to deal with strings (manipulation, creation, or use)? Almost all of it. 2. How much of that code has to know about the encoding? Almost none of it. Because of this, I really think that for my purposes the right answer is an encoding-agnostic abstraction. Now, based on my understanding of where knowledge of encodings is necessary, I think that my use cases are similar to those of most C++ users. I could be wrong on that point, of course.

...

...
I believe that the ability to force a Unicode string to be in a particular encoding has some value -- especially for people doing low-level work such as serializing Unicode strings to XML, and for people who need to understand time and space complexity of various Unicode encodings -- but I do not believe that this justifiable demand for complexity means we should make the interface harder for everyone else.

I agree. But having a templated implementation, would not mean a complex interface for the end user. It would probably be simpler than the current implementation, since you could loose all the encoding setting and getting. Especially if we for for the above mentioned typedef, to remove the template syntax for the casual user.

I am not sure that's really true. Let's consider this: 1. When you are passing a boost::unicode_string to an API that uses a different kind of string, you are going to have to perform some conversion (even if it's as simple as extracting a wchar_t* from the unicode_string) one way or another. Therefore, the relative complexity of two possible interfaces in this use case depends on how easy it is to perform the required conversion. I think that they can be equally easy to use for this use case. 2. When you are manipulating a boost::unicode_string with boost APIs, I believe that the two proposed designs would have the same ease of use. 3. When you need to mix and match encodings, then I don't think that the two APIs can be equally easy to use, primarily because implicit conversions in C++ lead to difficulties. (I assume I don't have to bring up specific examples here.) I think that the end result of the "typedef encoded_string" design would be that either I would have to turn every function that uses a string into a template (which is annoying), or I would have to choose one encoding to use throughout my code, and this seems unnecessary to me. Finally, it doesn't make sense to me to pay for the transcoding cost any earlier than necessary. Consider this code: unicode_string foo() { return function_that_returns_utf8(); } With the "typedef encoded_string" design, I am forced to pay for the cost of transcoding even if the caller of this code will actually need utf_8. So, to summarize, my opinion is that in applications in which one encoding is used throughout the application (and note that this really means "the application and all boost::unicode_string-savvy libraries it uses") the typedef approach is probably as easy as the class approach (and faster, because it eliminates vtable dispatch), whereas in applications in which more than one encoding is used, the benefit of avoiding the vtable dispatch will be offset by having to pay for transcoding cost upfront. In my opinion, having boost::unicode_string_utfN for the situations in which encoding is important and boost::unicode_string which can hold any encoding is better than not having a string that can hold any encoding. (I am sure that if we decide to accept this library with typedef unicode_string_utfM unicode_string, the first thing I'll need is my own encoding-agnostic unicode_string...)

...

...
I do, however, think that some people are going to feel that they need to eliminate the runtime overhead of generalized strings and explicitly instantiate strings in a particular encoding, and I don't know whether the library currently provides a facility to accomplish this.

It doesn't currently. But it would be pretty simple to create an implementation that allows that through use of the encoding_traits classes. I have done that before, and could probably use most of that code again if we were to include that.

I think that it should provide this, but I don't demand that it provide it right away. meeroh

Sundell Software

18 Mar 18 Mar

2:48 p.m.

On Wed, 16 Mar 2005 18:13:36 +0100, Erik Wien <wien@start.no> wrote:

...

Not entirely, but certainly less that optimal. basic_string (and the iostreams) make assuptions that don't neccesarily apply to Unicode text. One of them is that strings can be represented as a sequence of equally sized characters. Unicode can be represented that way, but that would mean you'd have to use 32 bits pr. character to be able to represent all the code point assigned in the Unicode standard. In most cases, that is way too much overhead for a string, and usually also a waste, since unicode code points rarely require more that 16 bits to be encoded. You could of course implement unicode for 16 bit characters in basic_string, but that would require that the user know about things like surrogate pairs, and also know how to correctly handle them. An unlikely scenario.

Looking at the code, it seems to duplicate alot of what basic_string does. AFAIK, though i haven't looked that close at unicode, you have two ways of viewing the string. As a string of UTF-* elements(?) and the other as a string of characters. The former has the same properties as basic_string, the latter doesn't. It seems to me then, that a possible design would be to make it a basic_string and provide special iterators etc that views the string as characters. This would require the iterator to have a reference to the basic_string to be able to support assignment. Maybe it would require whole wrapper class around basic_string to provide the required functionality. Rakshasa

Miro Jurisic

6:16 p.m.

In article <b3f2685905031806482a425a55@mail.gmail.com>, Sundell Software <sundell.software@gmail.com> wrote:

...

Looking at the code, it seems to duplicate alot of what basic_string does. AFAIK, though i haven't looked that close at unicode, you have two ways of viewing the string. As a string of UTF-* elements(?) and the other as a string of characters. The former has the same properties as basic_string, the latter doesn't.

It seems to me then, that a possible design would be to make it a basic_string and provide special iterators etc that views the string as characters. This would require the iterator to have a reference to the basic_string to be able to support assignment. Maybe it would require whole wrapper class around basic_string to provide the required functionality.

I believe that the question of why basic_string is not a suitable Unicode abstraction has been answered adequately in this thread, but to summarize: numerous basic_string methods would allow the client to violate invariants set by the Unicode standard. meerohj

Sundell Software

6:38 p.m.

On Fri, 18 Mar 2005 13:16:24 -0500, Miro Jurisic <macdev@meeroh.org> wrote:

...

I believe that the question of why basic_string is not a suitable Unicode abstraction has been answered adequately in this thread, but to summarize: numerous basic_string methods would allow the client to violate invariants set by the Unicode standard.

The client would not be using the basic_string directly to manipulate the unicode character string, although he would have access to the basic_string. If the client chooses to shoot themselves in the foot, they can. But any operation on the string as a string of characters would be done through another interface. Rakshasa

Stefan Seefeld

6:42 p.m.

Sundell Software wrote:

...

On Fri, 18 Mar 2005 13:16:24 -0500, Miro Jurisic <macdev@meeroh.org> wrote:

...
I believe that the question of why basic_string is not a suitable Unicode abstraction has been answered adequately in this thread, but to summarize: numerous basic_string methods would allow the client to violate invariants set by the Unicode standard.

The client would not be using the basic_string directly to manipulate the unicode character string, although he would have access to the basic_string. If the client chooses to shoot themselves in the foot, they can. But any operation on the string as a string of characters would be done through another interface.

So what's the advantage of using std::basic_string over, say, std::vector ? Regards, Stefan

Felipe Magno de Almeida

7:34 p.m.

Stefan Seefeld wrote:

...

Sundell Software wrote:

...
On Fri, 18 Mar 2005 13:16:24 -0500, Miro Jurisic <macdev@meeroh.org> wrote:

...
I believe that the question of why basic_string is not a suitable Unicode abstraction has been answered adequately in this thread, but to summarize: numerous basic_string methods would allow the client to violate invariants set by the Unicode standard.

The client would not be using the basic_string directly to manipulate the unicode character string, although he would have access to the basic_string. If the client chooses to shoot themselves in the foot, they can. But any operation on the string as a string of characters would be done through another interface.

So what's the advantage of using std::basic_string over, say, std::vector ?

reference counting optimization and maybe others, where there is.

...

Regards, Stefan _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Stefan Seefeld

8:15 p.m.

Felipe Magno de Almeida wrote:

...

...
So what's the advantage of using std::basic_string over, say, std::vector ?

reference counting optimization and maybe others, where there is.

While I agree that ref counting might be a good optimization, this is an implementation detail and not an excuse to impose an interface (with all the semantics that come with it) that doesn't fit. Regards, Stefan

Miro Jurisic

8:17 p.m.

In article <423B2D40.3060105@ic.unicamp.br>, Felipe Magno de Almeida <felipe.almeida@ic.unicamp.br> wrote:

...

Stefan Seefeld wrote:

...
Sundell Software wrote:

...
On Fri, 18 Mar 2005 13:16:24 -0500, Miro Jurisic <macdev@meeroh.org> wrote:

...
I believe that the question of why basic_string is not a suitable Unicode abstraction has been answered adequately in this thread, but to summarize: numerous basic_string methods would allow the client to violate invariants set by the Unicode standard.

The client would not be using the basic_string directly to manipulate the unicode character string, although he would have access to the basic_string. If the client chooses to shoot themselves in the foot, they can. But any operation on the string as a string of characters would be done through another interface.

So what's the advantage of using std::basic_string over, say, std::vector ?

reference counting optimization and maybe others, where there is.

Reference counting optimization is widely understood to be make performance of multi-threaded code worse than some other implementations of basic_string. Can name some other reason why basic_string would be a good rep for Unicode strings? I can't think of any. meeroh

Daniel James

19 Mar 19 Mar

12:24 p.m.

Miro Jurisic wrote:

...

...
...
So what's the advantage of using std::basic_string over, say, std::vector ?

reference counting optimization and maybe others, where there is.

Reference counting optimization is widely understood to be make performance of multi-threaded code worse than some other implementations of basic_string. Can name some other reason why basic_string would be a good rep for Unicode strings? I can't think of any.

Small string optimisation. ;) But, whether you use std::string, std::vector, or manually manage memory, it should be an implementation detail. The binary representation should probably be exposed by its iterators, not the std::string. i.e. instead of get_binary_layout(), have: binary_iterator binary_begin(); binary_iterator binary_end(); const_binary_iterator binary_begin() const; const_binary_iterator binary_end() const; Should you have non-const binary iterators? IMO, yes, anyone who uses them should know what they're letting themselves in for. Although, you might want to guarantee that the binary representation is in continuous memory, and isn't, say, a deque.

Erik Wien

4 Apr 4 Apr

11:16 a.m.

Daniel James wrote:

...

But, whether you use std::string, std::vector, or manually manage memory, it should be an implementation detail.

Exactly. We will probably go for std::vector for now, but since it is an implementation detail, changing this at a later time should not be a problem. (As long as it is found worthy of the effort in terms of efficiency that is) The

...

binary representation should probably be exposed by its iterators, not the std::string. i.e. instead of get_binary_layout(), have:

binary_iterator binary_begin(); binary_iterator binary_end(); const_binary_iterator binary_begin() const; const_binary_iterator binary_end() const;

Should you have non-const binary iterators? IMO, yes, anyone who uses them should know what they're letting themselves in for.

Although, you might want to guarantee that the binary representation is in continuous memory, and isn't, say, a deque.

I have problems finding any reason why anyone would need to have access to the binary representation. To me it seems anything lower level than code points, would never be needed. Unless of course you are processing Unicode encoding forms manually, but if you are doing that, why are you using a unicode library anyway? A simple vector with code units would be much more appropriate. - Erik

Rogier van Dalen

11:51 a.m.

On Apr 4, 2005 1:16 PM, Erik Wien <wien@start.no> wrote:

...

Daniel James wrote:

...
But, whether you use std::string, std::vector, or manually manage memory, it should be an implementation detail. True.

...

Exactly. We will probably go for std::vector for now, but since it is an implementation detail, changing this at a later time should not be a problem. (As long as it is found worthy of the effort in terms of efficiency that is)

basic_string has convenient and fast find* and replace methods though. Regards, Rogier

Stefan Seefeld

8:34 a.m.

Rogier van Dalen wrote:

...

basic_string has convenient and fast find* and replace methods though.

What is it that you want to find and replace ? Bytes ? Regards, Stefan

Rogier van Dalen

1:10 p.m.

On Apr 4, 2005 10:34 AM, Stefan Seefeld <seefeld@sympatico.ca> wrote:

...

Rogier van Dalen wrote:

...
basic_string has convenient and fast find* and replace methods though.

What is it that you want to find and replace ? Bytes ?

I suppose the Unicode string will contain find*() and replace() methods that will often forward to the underlying container's algorithms. If two strings are normalised and in the same encoding, you can find one in the other by comparing bytes, yes. If you want to replace a codepoint, or a grapheme cluster, by one with a different encoded length, replace() will be your friend (the Unicode string's friend, rather). Anyway, I shouldn't have sent my previous message: let's quickly get out of this bike shed! Regards, Rogier

Miro Jurisic

8:45 p.m.

In article <e094f9eb050404061051334a6b@mail.gmail.com>, Rogier van Dalen <rogiervd@gmail.com> wrote:

...

If two strings are normalised and in the same encoding, you can find one in the other by comparing bytes, yes.

No, you can't. Consider searching for "e" in a decomposed representation of e-acute -- you should find nothing, but you will get a false positive instead. meeroh

Rogier van Dalen

5 Apr 5 Apr

7:26 a.m.

On Apr 4, 2005 10:45 PM, Miro Jurisic <macdev@meeroh.org> wrote:

...

In article <e094f9eb050404061051334a6b@mail.gmail.com>, Rogier van Dalen <rogiervd@gmail.com> wrote:

...
If two strings are normalised and in the same encoding, you can find one in the other by comparing bytes, yes.

No, you can't. Consider searching for "e" in a decomposed representation of e-acute -- you should find nothing, but you will get a false positive instead.

Whoops, I'd forgotten about that for a moment. But, if I'm not mistaken again, you never get false negatives. False positives can be skipped quite easily after basic_string::find has found one. Regards, Rogier

Miro Jurisic

9:22 a.m.

In article <e094f9eb05040500266a200bf8@mail.gmail.com>, Rogier van Dalen <rogiervd@gmail.com> wrote:

...

On Apr 4, 2005 10:45 PM, Miro Jurisic <macdev@meeroh.org> wrote:

...
In article <e094f9eb050404061051334a6b@mail.gmail.com>, Rogier van Dalen <rogiervd@gmail.com> wrote:

...
If two strings are normalised and in the same encoding, you can find one in the other by comparing bytes, yes.

No, you can't. Consider searching for "e" in a decomposed representation of e-acute -- you should find nothing, but you will get a false positive instead.

Whoops, I'd forgotten about that for a moment. But, if I'm not mistaken again, you never get false negatives. False positives can be skipped quite easily after basic_string::find has found one.

I wouldn't call anything that requires knowledge of Unicode composition rules "quite easy". :-) meeroh

Erik Wien

6 Apr 6 Apr

5:35 p.m.

Miro Jurisic wrote:

...

I wouldn't call anything that requires knowledge of Unicode composition rules "quite easy". :-)

I would have to agree. The simple fact that many of us forget about those problems when discussing this topic (totally slipped by my filters), is evidence enough. - Erik

Sundell Software

18 Mar 18 Mar

8:55 p.m.

On Fri, 18 Mar 2005 16:34:24 -0300, Felipe Magno de Almeida <felipe.almeida@ic.unicamp.br> wrote:

...

reference counting optimization and maybe others, where there is.

And that it is already a part of the standard. Less code duplication and the existance of a stable implementation. Dunno if the best thing would be to make a unicode string class privatly inherit from basic_string or perhaps do everything through iterators. Each UTF-8/16/32 has its own iterator type, but all output UTF-32 when accessed. Look at std::istream_iterator/std::ostream_iterator for design. There would propably be helper functions for the most common tasks and i think you should be able to do all the nessesary tasks with just iterators. typedef basic_string<utf_8> ustring8; typedef basic_string<utf_16> ustring16; ustring8 u8; ustring16 u16; // Would propably make .begin() default. unicode_iterator i8(u8, u8.begin()); // This would be a slow way of doing operator[]. the assignment would // insert/remove elements from the basic_string if nessesary. *std::advance(unicode_iterator(u16, u16.begin()), 5) = *(i8++); Note that the client is responible for giving a valid iterator to unicode_iterator. BTW, is using UTF-8/16 in the container really overall cheaper than UTF-32. Since if the client changes a character, and it happens to be larger/smaller then all the elements behind it would need to be moved. Does that happen rarely enough? Though the client should propably know that themselves. Rakshasa

Erik Wien

4 Apr 4 Apr

11:31 a.m.

Sundell Software wrote:

...

Each UTF-8/16/32 has its own iterator type, but all output UTF-32 when accessed. Look at std::istream_iterator/std::ostream_iterator for design. There would propably be helper functions for the most common tasks and i think you should be able to do all the nessesary tasks with just iterators.

Yep. That is basically how the current implementation works. It's all (bi-directional) iterators. A unicode string is by nature a bi-directional sequence, so your basically forced to work that way.

...

typedef basic_string<utf_8> ustring8; typedef basic_string<utf_16> ustring16;

ustring8 u8; ustring16 u16;

// Would propably make .begin() default. unicode_iterator i8(u8, u8.begin());

// This would be a slow way of doing operator[]. the assignment would // insert/remove elements from the basic_string if nessesary. *std::advance(unicode_iterator(u16, u16.begin()), 5) = *(i8++);

Note that the client is responible for giving a valid iterator to unicode_iterator.

An implementation like this is already in place, but not locked to basic_string. A mutable code_point_iterator (unicode_iterator in your code) can be created from any random access sequence. You won't be getting random access to the unicode sequence though, like I mentioned above.

...

BTW, is using UTF-8/16 in the container really overall cheaper than UTF-32. Since if the client changes a character, and it happens to be larger/smaller then all the elements behind it would need to be moved. Does that happen rarely enough? Though the client should propably know that themselves.

UTF-8, no. That is for people who require small size above all. But UTF-16 usually is, unless you are using some obscure language that is not within the BMP (Basic Multilingual Plane). - Erik

Thorsten Ottosen

19 Mar 19 Mar

12:04 a.m.

"Miro Jurisic" <macdev@meeroh.org> wrote in message news:macdev-831075.13162418032005@sea.gmane.org... | I believe that the question of why basic_string is not a suitable Unicode | abstraction has been answered adequately in this thread, but to summarize: | numerous basic_string methods would allow the client to violate invariants set | by the Unicode standard. the basic_string interface isn't exactly the best interface invented, so I don't consider this important...sorry I brought that up. Would boost.string algorithm work with unicode strings or what would it take for that support (this I do think is important). -Thorsten

Erik Wien

4 Apr 4 Apr

11:09 a.m.

Thorsten Ottosen wrote:

...

<snip> Would boost.string algorithm work with unicode strings or what would it take for that support (this I do think is important).

I would suspect much of this could come for free when an eventual iostream implementation for Unicode is made. I'm not entirely sure though. There are rules that apply to unicode strings when it comes things like casing, searching etc. that might not be trivial to implement within the current boost.string library.

Rogier van Dalen

16 Mar 16 Mar

10:02 p.m.

Hi Erik! I'm glad to see you've made a lot of progress in these months of silence. I've got a few comments for now. Of course there isn't much documentation yet, but now that the library is out in the open, writing a Unicode primer might be a good thing to do now. Issues that I don't think many programmers are aware of include (off the top of my head) what code points are (21 bits), what Unicode characters are, why you need combining characters, why UTF-32 is not usually optimal. The library will once need these docs anyway. I'd gladly help out with this, though I'm not sure this would fit your university's requirements. Some speculation on the Unicode database: do you really need the character names? Maybe you should use multi_index, probably with hashing. Maybe you could use Boost.Serialisation for loading the file. I think that in general you would need to separate input/output from other Unicode processing. For example: endianness only matters when portably reading/writing files; IMO strings in memory should have your platform's endianness. (I second Thorsten's proposal of having utf8_string, utf16_string, utf32_string, utf_string.) For reading code points from files, a codecvt could be used. This can be fast because its virtual functions are called only once per so many bytes. I think there's an implementation floating around in the yahoo files section that can automatically figure out the file's encoding and convert to and from any endianness. I also think you should separate code points and Unicode characters. In normal situations, the user should not have to deal with code points. The discussion should not focus on that for now; it's an implementation detail. I strongly object to your typedef encoded_string<unicode_tag> unicode_string; because I think a Unicode string should contain characters. For example, a regular expression on Unicode strings should support level 2 (see <http://www.unicode.org/reports/tr18/>). Why go for anything less? Hoping this will be useful, Rogier

Erik Wien

17 Mar 17 Mar

4:52 p.m.

Rogier van Dalen wrote:

...

Hi Erik!

Hi!

...

I'm glad to see you've made a lot of progress in these months of silence. I've got a few comments for now.

Of course there isn't much documentation yet, but now that the library is out in the open, writing a Unicode primer might be a good thing to do now. Issues that I don't think many programmers are aware of include (off the top of my head) what code points are (21 bits), what Unicode characters are, why you need combining characters, why UTF-32 is not usually optimal. The library will once need these docs anyway. I'd gladly help out with this, though I'm not sure this would fit your university's requirements.

Yep. You are absolutely right. That would greatly cut down on the time used to explain these concepts to people here on the list. As you say, the boost documentation will probably need this too eventually, so it's a good idea to do it now. You could always do it if you want to, but we will have to write some paragraphs about this for our report too, so we might as well do it ourself. No need for two people doing the same thing.

...

Some speculation on the Unicode database: do you really need the character names? Maybe you should use multi_index, probably with hashing. Maybe you could use Boost.Serialisation for loading the file.

We probably won't need the names, and we have been speculating on taking them out. (The Unicode 1.0 and ISO names will probably go regardless) We thought about using multi_index but (correct me if I'm wrong) wouldn't that make is neccessary to fill the database at run-time? With serialization, the overhead would probably not be that bad, but still.

...

I think that in general you would need to separate input/output from other Unicode processing. For example: endianness only matters when portably reading/writing files; IMO strings in memory should have your platform's endianness. (I second Thorsten's proposal of having utf8_string, utf16_string, utf32_string, utf_string.) For reading code points from files, a codecvt could be used. This can be fast because its virtual functions are called only once per so many bytes. I think there's an implementation floating around in the yahoo files section that can automatically figure out the file's encoding and convert to and from any endianness.

I have an idea on how to implement iostream support in the library (Wrote it in another mail here), but I'm not really sure if it would work. Could you perhaps verify that?

...

I also think you should separate code points and Unicode characters. In normal situations, the user should not have to deal with code points. The discussion should not focus on that for now; it's an implementation detail. I strongly object to your

typedef encoded_string<unicode_tag> unicode_string;

because I think a Unicode string should contain characters. For example, a regular expression on Unicode strings should support level 2 (see <http://www.unicode.org/reports/tr18/>). Why go for anything less?

What exactly do mean by the term "character"? Abstract characters? I do agree with you on the level 2 support though. The closer the behaviour of a string in "reg-ex use" is to what the user would normally expect, the better. - Erik

Rogier van Dalen

11:30 p.m.

Just some short answers; I don't have much time at the moment:

...

...
Of course there isn't much documentation yet, but now that the library is out in the open, writing a Unicode primer might be a good thing to do now. [...] [...] we will have to write some paragraphs about this for our report too, so we might as well do it ourself. No need for two people doing the same thing.

...

[...] We thought about using multi_index but (correct me if I'm wrong) wouldn't that make is neccessary to fill the database at run-time? With serialization, the overhead would probably not be that bad, but still. I think so, but I think hashing might give an enormous runtime

Great! I look forward to seeing it. performance gain. I'm not particularly knowledgeable in this area, just throwing in ideas on this.

...

I have an idea on how to implement iostream support in the library (Wrote it in another mail here), but I'm not really sure if it would work. Could you perhaps verify that? What you're saying sounds correct to me. http://groups.yahoo.com/group/boost/files/utf/ has utf-2003-01-12.zip. I have no idea what its status is but it seems to implement all kinds of UTF I/O you'll need. There's even a detect_from_bom.hpp which appears to check for a BOM and imbue the correct codecvt.

...

...
I also think you should separate code points and Unicode characters. In normal situations, the user should not have to deal with code points. The discussion should not focus on that for now; it's an implementation detail. I strongly object to your

typedef encoded_string<unicode_tag> unicode_string;

because I think a Unicode string should contain characters. For example, a regular expression on Unicode strings should support level 2 (see <http://www.unicode.org/reports/tr18/>). Why go for anything less?

What exactly do mean by the term "character"? Abstract characters?

Yes. IMHO: One of the goals of the Unicode library is to relieve programmers of having to know all ins and outs of Unicode. In my opinion, the average programmer should not need to know about code points and normalisation forms. Finding strings in other strings should just work, without having to mess with normalisation forms. Starting a string with a combining character should throw, because it's meaningless (and may cause hard-to-diagnose errors later on). The third character in the string "rôle" is 'l', and not either-an-l-or-a-combining-character. Code points should be hidden away in the "advanced topics" section of the library. I'm so totally convinced of this I have a hard time seeing why it should be otherwise. Do you, or anyone else, feel there is anything obvious I'm missing?

...

I do agree with you on the level 2 support though. The closer the behaviour of a string in "reg-ex use" is to what the user would normally expect, the better. Exactly.

Regards, Rogier

Erik Wien

21 Mar 21 Mar

12:23 a.m.

Rogier van Dalen wrote:

...

...
[...] We thought about using multi_index but (correct me if I'm wrong) wouldn't that make is neccessary to fill the database at run-time? With serialization, the overhead would probably not be that bad, but still.

I think so, but I think hashing might give an enormous runtime performance gain. I'm not particularly knowledgeable in this area, just throwing in ideas on this.

Hashing would certainly be faster that the binary search we are using now, but would we need to do that through multi_index (Or any other run-time solution for that matter)? Wouldn't it be more efficient to build the hash-table statically through the code generator we are using now? You can't do lazy loading of the database and all that stuff that way, but you would loose the dependency of an external file for the database.

...

What you're saying sounds correct to me. http://groups.yahoo.com/group/boost/files/utf/ has utf-2003-01-12.zip. I have no idea what its status is but it seems to implement all kinds of UTF I/O you'll need. There's even a detect_from_bom.hpp which appears to check for a BOM and imbue the correct codecvt.

I'll take a look at it. Looks promising.

Rogier van Dalen

18 Mar 18 Mar

9:41 p.m.

On Thu, 17 Mar 2005 17:52:25 +0100, Erik Wien <wien@start.no> wrote:

...

What exactly do mean by the term "character"? Abstract characters?

I really need to remember the correct terminology - what I mean is the thing "a user thinks of as a character", a "grapheme cluster", of which the Unicode standard says: "[T]here is a core concept of "characters that should be kept together" that can be defined for the Unicode Standard in a language-independent way. This core concept is known as a grapheme cluster, and it consists of any combining character sequence that contains only nonspacing combining marks, or any sequence of characters that constitutes a Hangul syllable (possibly followed by one or more nonspacing marks)." I believe this is what a Unicode library should use as its basic unit. Sorry for any confusion caused, Rogier

Jonathan Biggar

10:56 p.m.

Rogier van Dalen wrote:

...

On Thu, 17 Mar 2005 17:52:25 +0100, Erik Wien <wien@start.no> wrote:

...
What exactly do mean by the term "character"? Abstract characters?

I really need to remember the correct terminology - what I mean is the thing "a user thinks of as a character", a "grapheme cluster", of which the Unicode standard says:

"[T]here is a core concept of "characters that should be kept together" that can be defined for the Unicode Standard in a language-independent way. This core concept is known as a grapheme cluster, and it consists of any combining character sequence that contains only nonspacing combining marks, or any sequence of characters that constitutes a Hangul syllable (possibly followed by one or more nonspacing marks)."

I believe this is what a Unicode library should use as its basic unit.

Be careful with making a global assertion. Different users of a Unicode library will need to access the data at different levels. Some will need the raw encoding bytes or words, some will need code points, and some will need 'grapheme clusters'. The library should support working at the level that each particular user needs, and different parts of an application or library may need to work at multiple levels. -- Jon Biggar Levanta jon@levanta.com

Rogier van Dalen

19 Mar 19 Mar

12:18 p.m.

On Fri, 18 Mar 2005 14:56:53 -0800, Jonathan Biggar <jon@levanta.com> wrote:

...

Rogier van Dalen wrote:

...
I believe this is what a Unicode library should use as its basic unit.

Be careful with making a global assertion. Different users of a Unicode library will need to access the data at different levels. Some will need the raw encoding bytes or words, some will need code points, and some will need 'grapheme clusters'.

The library should support working at the level that each particular user needs, and different parts of an application or library may need to work at multiple levels.

A decision must be made. Certainly you should have access to code points; and you should be able to work at multiple levels. However, one level has to be the default level. Most programmers should be able to get what they want by using boost::unicode_string (or whatever it's going to be called). We need to make a "global assertion" that's correct 99% of the time. I think we need an interface that will work for programmers that have no idea what the difference between a code point or a grapheme cluster is, and don't want to be bothered by the difference between U+0135 LATIN SMALL LETTER J WITH CIRCUMFLEX and U+006A LATIN SMALL LETTER J U+0302 COMBINING CIRCUMFLEX ACCENT String handling includes searching, comparing, for which the above should be equivalent. As a programmer, I don't want to be bothered with different sequences that are canonically equivalent. I want it to just work. The library should handle the cases I didn't think about. Input and output has to deal with code points, obviously, but I think going from code points to what users think of as "characters" and vice versa for I/O should be done by the library. By default. I have not been able to find another use case for accessing code points directly. I'm ready to be convinced I'm wrong. However, we'll have to make a choice. Regards, Rogier

Jonathan Biggar

4:57 p.m.

Rogier van Dalen wrote:

...

...
Be careful with making a global assertion. Different users of a Unicode library will need to access the data at different levels. Some will need the raw encoding bytes or words, some will need code points, and some will need 'grapheme clusters'.

The library should support working at the level that each particular user needs, and different parts of an application or library may need to work at multiple levels.

A decision must be made. Certainly you should have access to code points; and you should be able to work at multiple levels. However, one level has to be the default level. Most programmers should be able to get what they want by using boost::unicode_string (or whatever it's going to be called). We need to make a "global assertion" that's correct 99% of the time.

I don't see why there has to be a "default" inteface at all. There should just be multiple interfaces, one for each level that a programmer may have need to work at.

...

I think we need an interface that will work for programmers that have no idea what the difference between a code point or a grapheme cluster is, and don't want to be bothered by the difference between

U+0135 LATIN SMALL LETTER J WITH CIRCUMFLEX and U+006A LATIN SMALL LETTER J U+0302 COMBINING CIRCUMFLEX ACCENT

That's fine for *certain* uses. Other programs may have a need to distinguish between the two, and need the ability to convert a Unicode string from the form where all combining characters are combined and the form where they are all separate explicit codepoints. A way of telling the library that you don't care about the difference is to ensure that every string you use is canonicalized into the form that makes your job easier. Alternatively, the interface could provide the ability to set state bits in the string that indicate whether you want to see the differences or not.

...

String handling includes searching, comparing, for which the above should be equivalent. As a programmer, I don't want to be bothered with different sequences that are canonically equivalent. I want it to just work. The library should handle the cases I didn't think about.

That's fine *when* you are working at that high a level of abstraction.

...

Input and output has to deal with code points, obviously, but I think going from code points to what users think of as "characters" and vice versa for I/O should be done by the library. By default. I have not been able to find another use case for accessing code points directly. I'm ready to be convinced I'm wrong. However, we'll have to make a choice.

Another use case would be writing codeset conversion functions. -- Jonathan Biggar jon@levanta.com

Rogier van Dalen

6:13 p.m.

On Sat, 19 Mar 2005 08:57:55 -0800, Jonathan Biggar <jon@levanta.com> wrote:

...

Rogier van Dalen wrote:

...
...
Be careful with making a global assertion. Different users of a Unicode library will need to access the data at different levels. Some will need the raw encoding bytes or words, some will need code points, and some will need 'grapheme clusters'.

The library should support working at the level that each particular user needs, and different parts of an application or library may need to work at multiple levels.

A decision must be made. Certainly you should have access to code points; and you should be able to work at multiple levels. However, one level has to be the default level. Most programmers should be able to get what they want by using boost::unicode_string (or whatever it's going to be called). We need to make a "global assertion" that's correct 99% of the time.

I don't see why there has to be a "default" inteface at all. There should just be multiple interfaces, [...]

I'm sorry, I don't see how these propositions are mutably exclusive. I believe we are talking about different kinds of users. Let's get this clear: I was assuming that the Unicode library will be aimed at programmers doing everyday programming jobs whose programs will have to deal with non-English characters (because they're bound to be localised, or because non-English names will be inserted in a database, or whatever), i.e. people who have no idea about how Unicode works and don't want to, as long as it does work. Correct me if I'm wrong, but you seem to assume the library will be used mostly by those who need to code things like codeset conversions, who should know a great deal about Unicode. What I think would be a good interface: // A string of code points, encoded UTF-16 (or templated). class code_point_string { public: //... const std::basic_string<char16_t> code_units(); }; // A string of "grapheme clusters", with a code_point_string underlying. // The string is always in a normalisation form. template <class NormalisationPolicy = NormalisationFormC> class unicode_string { public: //... const code_point_string & code_points() const; }; Those who need to process code points can happily use code_point_string; others can use unicode_string.

...

[...] Other programs may have a need to distinguish between the two, and need the ability to convert a Unicode string from the form where all combining characters are combined and the form where they are all separate explicit codepoints.

I believe you would not need to manipulate code points to convert *all* characters in a string from one normalisation form to another. (See the interface proposal above.)

...

A way of telling the library that you don't care about the difference is to ensure that every string you use is canonicalized into the form that makes your job easier.

I'd say the normalisation form of a string is an invariant that the library rather than the user should deal with.

...

[...]

Regards, Rogier

Erik Wien

21 Mar 21 Mar

12:50 a.m.

Rogier van Dalen wrote:

...

<snip> I believe we are talking about different kinds of users. Let's get this clear: I was assuming that the Unicode library will be aimed at programmers doing everyday programming jobs whose programs will have to deal with non-English characters (because they're bound to be localised, or because non-English names will be inserted in a database, or whatever), i.e. people who have no idea about how Unicode works and don't want to, as long as it does work.

That was my initial thought. This Unicode library should in my opinion make handling Unicode strings correctly as easy as it is to handle ASCII strings today. But that does not mean we will have to put mittens on everyone else to keep them away from the lower details. If you need to manipulate code points, I think you should be allowed to. Code units on the other hand, I'm a little more wary about, since users easily could screw things up on that level. (Make a sequence ill-formed.) Furthermore I don't really see why anyone would need to muck about with code units.

...

What I think would be a good interface:

// A string of code points, encoded UTF-16 (or templated). class code_point_string { public: //... const std::basic_string<char16_t> code_units(); };

// A string of "grapheme clusters", with a code_point_string underlying. // The string is always in a normalisation form. template <class NormalisationPolicy = NormalisationFormC> class unicode_string { public: //... const code_point_string & code_points() const; };

Those who need to process code points can happily use code_point_string; others can use unicode_string.

This is starting to look more and more like the way to go in my opinion. By layering interfaces with an increasing level of abstraction (from code points and up), we could more or less keep everyone happy. What I really don't like about this solution, is that we would end up with a myriad of different types that all are "unicode strings", but at different levels. I can easily imagine mayhem erupting when everyone get their favorite unicode abstraction and use that one exclusively in their APIs. Passing strings around would be a complete nightmare. One solution could be to make code points the "base level" of abstraction, and used normalization policies (like you outlined) for functions where normalization form actually matters (find etc.), we could still get most of the functionality a grapheme_cluster_string would provide, but without the extra types. I'm just afraid that if we have a code_point_string in all encodings, plus the dynamic one, in addition to the same number of strings at the grapheme cluster level, there would simply be too many of them, and it would confuse the users more that it would help them. Feel free to convince me otherwise though.

Rogier van Dalen

11:10 a.m.

[Rearranging paragraphs from your post] On Mon, 21 Mar 2005 01:50:04 +0100, Erik Wien <wien@start.no> wrote:

...

One solution could be to make code points the "base level" of abstraction, and used normalization policies (like you outlined) for functions where normalization form actually matters (find etc.), we could still get most of the functionality a grapheme_cluster_string would provide, but without the extra types.

I'm not too sure how you envision using normalisation policies for functions. However, the problem I see with it is that normalisation form is not a property of a function. A normalisation form is a property of a string. I think it should be an invariant of that string. Imagine a std::map<> where you use a Unicode string as a key; you want equivalent strings to map to the same object. operator< for two strings with the same normalisation form and the same encoding is trivial (and as fast as std::basic_string::operator< for UTF-8 or UTF-32). On two strings with unknown normalisation forms, it will be dreadfully much slower because you'll need to look things up in the Unicode database all the time.

...

What I really don't like about this solution, is that we would end up with a myriad of different types that all are "unicode strings", but at different levels. I can easily imagine mayhem erupting when everyone get their favorite unicode abstraction and use that one exclusively in their APIs. Passing strings around would be a complete nightmare.

...

I'm just afraid that if we have a code_point_string in all encodings, plus the dynamic one, in addition to the same number of strings at the grapheme cluster level, there would simply be too many of them, and it would confuse the users more that it would help them.

As long as there is one boost::unicode_string, I speculate this shouldn't be much of a problem. Developers wanting to make another choice than you have made I think will fall into either of two categories: - Those who know about Unicode and are not easily confused by encodings and normalisation forms; - and those who worry about performance. With a good rationale (based on measured performance in a number of test cases), you should be able to pick one that's good enough in most situations, I think. (Looking at the ICU website, I'd say this would involve UTF-16, but let's see what you come up with.) Regards, Rogier

Erik Wien

4 Apr 4 Apr

10:51 a.m.

Sorry about the late reply. I have been away for easter, and to top it all off, been sick a while. Anyway, I'm back...

...

I'm not too sure how you envision using normalisation policies for functions. However, the problem I see with it is that normalisation form is not a property of a function. A normalisation form is a property of a string. I think it should be an invariant of that string.

...

Imagine a std::map<> where you use a Unicode string as a key; you want equivalent strings to map to the same object. operator< for two strings with the same normalisation form and the same encoding is trivial (and as fast as std::basic_string::operator< for UTF-8 or UTF-32). On two strings with unknown normalisation forms, it will be dreadfully much slower because you'll need to look things up in the Unicode database all the time.

Yep.. You are of course right. I should start thinking before I talk. :) Having strings locked to a normalization form, would be the most logical way to go. What I don't really see though, is why you would have to have a separate class (different from the code point string class that is) for this functionality. If we made the code point string classes (both the static and dynamic ones) have a normalization policy and provide a policy that doesn't actually do anything, in addition to ones that normalize to each of the normalization forms, everyone could have their way. If you don't care about normalization, use the do_nothing one. If you do care (or simply have no clue what normalization is - most users), use NFD or NFC or something.

...

...
What I really don't like about this solution, is that we would end up with a myriad of different types that all are "unicode strings", but at different levels. I can easily imagine mayhem erupting when everyone get their favorite unicode abstraction and use that one exclusively in their APIs. Passing strings around would be a complete nightmare.

...
I'm just afraid that if we have a code_point_string in all encodings, plus the dynamic one, in addition to the same number of strings at the grapheme cluster level, there would simply be too many of them, and it would confuse the users more that it would help them.

As long as there is one boost::unicode_string, I speculate this shouldn't be much of a problem.

I hope you are right, because if it turns out to be a problem, it will be a major one! What do the rest of you think? Would a large number of different classes lead to confusion, or would a unicode_string typedef hide this complexity? Developers wanting to make another

...

choice than you have made I think will fall into either of two categories: - Those who know about Unicode and are not easily confused by encodings and normalisation forms; - and those who worry about performance.

Yep, that sounds about right. Most users should not really care what kind of encoding and normalization form is used. They want to work with the string, not fiddle with it's internal representation. With a good rationale (based

...

on measured performance in a number of test cases), you should be able to pick one that's good enough in most situations, I think. (Looking at the ICU website, I'd say this would involve UTF-16, but let's see what you come up with.)

I would be surprised if any other encoding than UTF-16 would end up as the most efficient one. UTF-8 suffers from the big variation in code unit count for any given code point and UTF-32 is just a waste of space for little performance for most users. You never know though.

Rogier van Dalen

11:43 a.m.

On Apr 4, 2005 12:51 PM, Erik Wien <wien@start.no> wrote:

...

Sorry about the late reply. I have been away for easter, and to top it all off, been sick a while. Anyway, I'm back...

Great; we'll go on with the discussion. I'm glad you agree with me on most points. :-)

...

Yep.. You are of course right. I should start thinking before I talk. :)

I don't know... thinking out aloud often works in abstract discussions like this.

...

Having strings locked to a normalization form, would be the most logical way to go. What I don't really see though, is why you would have to have a separate class (different from the code point string class that is) for this functionality. If we made the code point string classes (both the static and dynamic ones) have a normalization policy and provide a policy that doesn't actually do anything, in addition to ones that normalize to each of the normalization forms, everyone could have their way. If you don't care about normalization, use the do_nothing one. If you do care (or simply have no clue what normalization is - most users), use NFD or NFC or something.

I'm not sure about this. The simplicity point is a good one. Assuming you do want to have built-in grapheme cluster support, I do however see two problems with this approach: 1. You'd still need two kinds of iterators: iterators over codepoints, and iterators over grapheme clusters. This makes things conceptually muddy for users, I think. The string class will need codepoint versions and grapheme cluster versions of many methods (e.g., insert, erase, find*). You may end up actually implementing two strings in one string class. 2. Elements are not straightforwardly inserted into the sequence. E.g., appending 0x317 (a combining character) to a string s will not make s.back() return 0x317. In short, a code point string that automatically normalises is not a Sequence, though it may superficially look like one. I have a feeling this would be more difficult to understand for users than two separate string classes would. But maybe that's because I already understand my own viewpoint? Regards, Rogier

Erik Wien

6 Apr 6 Apr

5:33 p.m.

Rogier van Dalen wrote:

...

Great; we'll go on with the discussion.

I'm glad you agree with me on most points. :-)

Great, isn't it? :D

...

...
Having strings locked to a normalization form, would be the most logical way to go. What I don't really see though, is why you would have to have a separate class (different from the code point string class that is) for this functionality. If we made the code point string classes (both the static and dynamic ones) have a normalization policy and provide a policy that doesn't actually do anything, in addition to ones that normalize to each of the normalization forms, everyone could have their way. If you don't care about normalization, use the do_nothing one. If you do care (or simply have no clue what normalization is - most users), use NFD or NFC or something.

Based on the comments made by Miro in the other thread (and you in resonse to this), I'm going to disagree with myself on that point. (How's that for a change?) Normalization in a code-point string, would lead to many problems when searching and inserting I never even thought of. A grapheme-cluster string makes more and more sense the more I think about it. I'll throw some ideas around over the weekend (Don't hold me to this), and see if I come up with a smart way of implementing something like that.

...

I'm not sure about this. The simplicity point is a good one. Assuming you do want to have built-in grapheme cluster support, I do however see two problems with this approach:

1. You'd still need two kinds of iterators: iterators over codepoints, and iterators over grapheme clusters. This makes things conceptually muddy for users, I think. The string class will need codepoint versions and grapheme cluster versions of many methods (e.g., insert, erase, find*). You may end up actually implementing two strings in one string class.

2. Elements are not straightforwardly inserted into the sequence. E.g., appending 0x317 (a combining character) to a string s will not make s.back() return 0x317.

In short, a code point string that automatically normalises is not a Sequence, though it may superficially look like one. I have a feeling this would be more difficult to understand for users than two separate string classes would. But maybe that's because I already understand my own viewpoint?

All true. You have me convinced, sir. - Erik

7425

Age (days ago)

7446

Last active (days ago)

List overview

Download

58 comments

10 participants

participants (10)

Daniel James
David Abrahams
Erik Wien
Felipe Magno de Almeida
Jonathan Biggar
Miro Jurisic
Rogier van Dalen
Stefan Seefeld
Sundell Software
Thorsten Ottosen