[UTF String] UTF String library 1.5 ready for perusal

It's a little later than I'd planned, but the second version of the UTF String library is now available for download, at <http://www.oakcircle.com/toolkit.html>. I'd appreciate comments or (constructive) criticisms on it. This version is substantially better than the original. The design has been somewhat simplified, removing extraneous features like null-string emulation. Each of the classes now contain as many of the std::string functions as I could efficiently add (essentially all of them in utf32_t), including I/O stream functions, and also features code-point iterators. It includes conversion classes for many common code-pages too. (I have not yet added true *character* iterators. That's going to require more study. But the code-point iterators should be pretty useful even without that.) There may well be bugs in it. I've only just completed it, I haven't used this version in anything but test code yet. It's also still *not* ready for submission... it's closer, but at the very least I'd like to add a few more code-pages to it first (so it can handle any code-page that Windows is likely to throw at it), and break up the code-page cpp files. Artyom, it might interest you to note that its capabilities are very similar to those outlined in your "realistic API proposal" (<http://lists.boost.org/Archives/boost/2011/01/176046.php>), though the design itself is different. It doesn't handle Unicode normalization, case handling, comparison, or search, but all the other capabilities are there. Your API could be implemented on top of utf8_t fairly easily. -- Chad Nelson Oak Circle Software, Inc. * * *

On Wed, Feb 9, 2011 at 3:50 PM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
It's a little later than I'd planned, but the second version of the UTF String library is now available for download, at <http://www.oakcircle.com/toolkit.html>. I'd appreciate comments or (constructive) criticisms on it.
Are there any examples of usage ? BR Matus

On Wed, 9 Feb 2011 15:59:35 +0100 Matus Chochlik <chochlik@gmail.com> wrote:
On Wed, Feb 9, 2011 at 3:50 PM, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
It's a little later than I'd planned, but the second version of the UTF String library is now available for download, at <http://www.oakcircle.com/toolkit.html>. I'd appreciate comments or (constructive) criticisms on it.
Are there any examples of usage ?
Not yet, other than the couple on the documentation page. I've got some unit testing code for it, which didn't make it into the package because it isn't stand-alone; if that would help, I can post the UTF portion of it...? -- Chad Nelson Oak Circle Software, Inc. * * *

On 09/02/2011 15:50, Chad Nelson wrote:
This version is substantially better than the original. The design has been somewhat simplified, removing extraneous features like null-string emulation. Each of the classes now contain as many of the std::string functions as I could efficiently add (essentially all of them in utf32_t), including I/O stream functions
Bad design, IMHO. , and also features code-point
iterators.
That code point iterator uses pointers and indexes instead of iterators, which means it cannot work as an arbitrary iterator adaptor even though it could with virtually no change, especially since it only requires a forward iterator.

On Thu, 10 Feb 2011 14:22:56 +0100 Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 09/02/2011 15:50, Chad Nelson wrote:
This version is substantially better than the original. The design has been somewhat simplified, removing extraneous features like null-string emulation. Each of the classes now contain as many of the std::string functions as I could efficiently add (essentially all of them in utf32_t), including I/O stream functions
Bad design, IMHO.
Not very constructive. *Why* do you think it's a bad design?
and also features code-point iterators.
That code point iterator uses pointers and indexes instead of iterators, which means it cannot work as an arbitrary iterator adaptor even though it could with virtually no change, especially since it only requires a forward iterator.
Sorry, I don't understand the reasoning behind that assertion. Please enlighten me. -- Chad Nelson Oak Circle Software, Inc. * * *

On 10/02/2011 14:40, Chad Nelson wrote:
On Thu, 10 Feb 2011 14:22:56 +0100 Mathias Gaunard<mathias.gaunard@ens-lyon.org> wrote:
On 09/02/2011 15:50, Chad Nelson wrote:
This version is substantially better than the original. The design has been somewhat simplified, removing extraneous features like null-string emulation. Each of the classes now contain as many of the std::string functions as I could efficiently add (essentially all of them in utf32_t), including I/O stream functions
Bad design, IMHO.
Not very constructive. *Why* do you think it's a bad design?
It's generally agreed on that std::string is a bad design. See GotW #84 for example. That must be a good ten years old...
and also features code-point iterators.
That code point iterator uses pointers and indexes instead of iterators, which means it cannot work as an arbitrary iterator adaptor even though it could with virtually no change, especially since it only requires a forward iterator.
Sorry, I don't understand the reasoning behind that assertion. Please enlighten me.
There is no need for any reasoning: look at the code of your code point iterator. It uses a pointer and indexes, and is therefore not a generic iterator adaptor. Iterating through code points is fully generic and should work for any forward iterator or bidirectional iterator, not just a pointer. Making your iterator random access when it obviously isn't is also a terrible idea. The special case thing has nothing to do there either, it should be a different iterator. I'm not a fan of returning a reference in operator* as well. I also don't understand what mIndex and mEndIndex are for (nor why you compute the size in code points of the string before constructing iterators), since you seem to check data is valid beforehand. And if data is invalid, you have lots of potential for unsafety in your iterators anyway (in _value for example).

On Thu, 10 Feb 2011 15:18:27 +0100 Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 10/02/2011 14:40, Chad Nelson wrote:
This version is substantially better than the original. The design has been somewhat simplified, removing extraneous features like null-string emulation. Each of the classes now contain as many of the std::string functions as I could efficiently add (essentially all of them in utf32_t), including I/O stream functions
Bad design, IMHO.
Not very constructive. *Why* do you think it's a bad design?
It's generally agreed on that std::string is a bad design. See GotW #84 for example. That must be a good ten years old...
Maybe so, but irrelevant in this case. The goal was to make transitioning from std::string to the UTF types as painless as possible, for those who want to do it, and that means duplicating as many of std::string's functions as can efficiently be done.
and also features code-point iterators.
That code point iterator uses pointers and indexes instead of iterators, which means it cannot work as an arbitrary iterator adaptor even though it could with virtually no change, especially since it only requires a forward iterator.
Sorry, I don't understand the reasoning behind that assertion. Please enlighten me.
There is no need for any reasoning: look at the code of your code point iterator. It uses a pointer and indexes, and is therefore not a generic iterator adaptor.
It wasn't meant to be generic. It was meant to be exactly what it is: an iterator specific to the UTF type where it's defined. For that purpose, it's designed exactly as it should be, IMHO.
Iterating through code points is fully generic and should work for any forward iterator or bidirectional iterator, not just a pointer.
I could make it fully generic, but it wouldn't be nearly as efficient that way. I chose to do the extra work to make it efficient.
Making your iterator random access when it obviously isn't is also a terrible idea. The special case thing has nothing to do there either, it should be a different iterator.
It could be a bidirectional iterator, as it has all of the abilities of one. And it could be a random access iterator, as it has all but one of the requirements for that (and in many cases has all of them). Given that choice, I chose to make it a random access iterator.
I'm not a fan of returning a reference in operator* as well.
No choice in that, I ran into at least one STL algorithm under GCC that wouldn't compile if it wasn't a reference, even when it was only being read. I don't remember which one, but it was something important and commonly-used enough that breaking it was not an option.
I also don't understand what mIndex and mEndIndex are for (nor why you compute the size in code points of the string before constructing iterators),
They give me a way to prevent my iterator code from walking off the beginning or end of the underlying string. The only other way to do it would be to store a pointer to the string object in every iterator, or a pair of iterators or pointers to the underlying type, which I considered worse. As an important side benefit, they also provide an efficient way to calculate the difference in code points for operator-, which I feel is important. And the size in code-points is (supposed to be) stored at all times. If it isn't then it's a leftover from an earlier iteration, which I'll be happy to correct, but which should be harmless (because even if that's the case, it's only calculated once per string, then stored and updated).
since you seem to check data is valid beforehand. And if data is invalid, you have lots of potential for unsafety in your iterators anyway (in _value for example).
All the UTF types were very carefully designed so that there's no chance of invalid data in them, barring extraordinary measures to deliberately corrupt it. Anything else? :-) -- Chad Nelson Oak Circle Software, Inc. * * *

Hi Chad, Like Mathias I'm not very enthusiastic about the approach that you're taking here - but there is plenty of space for different approaches, so if you want to do it like this you are welcome to do so. My own approach has been to: - Store text in sequence-of-byte containers of whatever sort seem appropriate, i.e. std::string, std::vector<char>, raw memory etc. - Use iterator adaptors to access that data as UTF-8 when appropriate. - Use std::algorithms like find(begin,end,what) rather than std::string members. This works for me, and I recommend it. So I have one comment on this exchange: Chad Nelson wrote:
There is no need for any reasoning: look at the code of your code point iterator. It uses a pointer and indexes, and is therefore not a generic iterator adaptor.
It wasn't meant to be generic. It was meant to be exactly what it is: an iterator specific to the UTF type where it's defined. For that purpose, it's designed exactly as it should be, IMHO.
Iterating through code points is fully generic and should work for any forward iterator or bidirectional iterator, not just a pointer.
I could make it fully generic, but it wouldn't be nearly as efficient that way. I chose to do the extra work to make it efficient.
I have to challenge your efficiency comment. I have UTF-8 encoding and decoding that works with generic iterators, including pointers, and I have no efficiency issues resulting from its genericity. In fact I spent some time carefully optimising it and I believe that when used with pointers it is as good as I could get by writing it in assembler. Regards, Phil.

On Thu, 10 Feb 2011 21:19:49 +0000 "Phil Endecott" <spam_from_boost_dev@chezphil.org> wrote:
[...] So I have one comment on this exchange:
Iterating through code points is fully generic and should work for any forward iterator or bidirectional iterator, not just a pointer.
I could make it fully generic, but it wouldn't be nearly as efficient that way. I chose to do the extra work to make it efficient.
I have to challenge your efficiency comment. I have UTF-8 encoding and decoding that works with generic iterators, including pointers, and I have no efficiency issues resulting from its genericity. In fact I spent some time carefully optimising it and I believe that when used with pointers it is as good as I could get by writing it in assembler.
You may well be right, for your UTF-8-only code. For my design, it was more efficient to create separate iterators for UTF-8, UTF-16, and UTF-32 than to try to make one completely generic UTF-anything iterator. That's what I meant by "fully generic." -- Chad Nelson Oak Circle Software, Inc. * * *

On 11/02/2011 01:21, Chad Nelson wrote:
You may well be right, for your UTF-8-only code. For my design, it was more efficient to create separate iterators for UTF-8, UTF-16, and UTF-32 than to try to make one completely generic UTF-anything iterator. That's what I meant by "fully generic."
This doesn't make any sense whatsoever. The code to deal with different UTF encodings cannot be the same, so you cannot write generic code to deal with all those encodings.

On Fri, 11 Feb 2011 11:48:50 +0100 Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 11/02/2011 01:21, Chad Nelson wrote:
You may well be right, for your UTF-8-only code. For my design, it was more efficient to create separate iterators for UTF-8, UTF-16, and UTF-32 than to try to make one completely generic UTF-anything iterator. That's what I meant by "fully generic."
This doesn't make any sense whatsoever.
The code to deal with different UTF encodings cannot be the same, so you cannot write generic code to deal with all those encodings.
Just because you don't have the imagination and skill to do it doesn't mean that it can't be done. As a matter of fact, I considered it, and had worked out a design that *did* do it, using only one helper function. I decided against using it. -- Chad Nelson Oak Circle Software, Inc. * * *

On 10/02/2011 18:39, Chad Nelson wrote:
On Thu, 10 Feb 2011 15:18:27 +0100 Mathias Gaunard<mathias.gaunard@ens-lyon.org> wrote:
On 10/02/2011 14:40, Chad Nelson wrote:
This version is substantially better than the original. The design has been somewhat simplified, removing extraneous features like null-string emulation. Each of the classes now contain as many of the std::string functions as I could efficiently add (essentially all of them in utf32_t), including I/O stream functions
Bad design, IMHO.
Not very constructive. *Why* do you think it's a bad design?
It's generally agreed on that std::string is a bad design. See GotW #84 for example. That must be a good ten years old...
Maybe so, but irrelevant in this case. The goal was to make transitioning from std::string to the UTF types as painless as possible, for those who want to do it, and that means duplicating as many of std::string's functions as can efficiently be done.
and also features code-point iterators.
That code point iterator uses pointers and indexes instead of iterators, which means it cannot work as an arbitrary iterator adaptor even though it could with virtually no change, especially since it only requires a forward iterator.
Sorry, I don't understand the reasoning behind that assertion. Please enlighten me.
There is no need for any reasoning: look at the code of your code point iterator. It uses a pointer and indexes, and is therefore not a generic iterator adaptor.
It wasn't meant to be generic. It was meant to be exactly what it is: an iterator specific to the UTF type where it's defined. For that purpose, it's designed exactly as it should be, IMHO.
It was already stated on this list that the ability to deal with arbitrary ranges is more valuable; I am merely stating that your iterator could work with any iterator with virtually no change.
I could make it fully generic, but it wouldn't be nearly as efficient that way. I chose to do the extra work to make it efficient.
Your code never uses the fact that the iterator is a pointer or that memory is stored contiguously. I also don't think that for example your utf-8 iterating strategy is very fast. Your utf-8 decoding itself seems to have lots of repetition and unnecessary tests and memory accesses... Not that it is easy to make that kind of thing fast anyway.
It could be a bidirectional iterator, as it has all of the abilities of one. And it could be a random access iterator, as it has all but one of the requirements for that (and in many cases has all of them). Given that choice, I chose to make it a random access iterator.
It has constant-time distance, which isn't very useful and adds unnecessary overhead to your iterator.
I'm not a fan of returning a reference in operator* as well.
No choice in that, I ran into at least one STL algorithm under GCC that wouldn't compile if it wasn't a reference, even when it was only being read. I don't remember which one, but it was something important and commonly-used enough that breaking it was not an option.
I believe the standard containers indeed require it to be a reference, but I'm not aware of any problems with any implementation. Would you mind telling me what libstdc++ algorithm relies on this?
They give me a way to prevent my iterator code from walking off the beginning or end of the underlying string.
The only other way to do it would be to store a pointer to the string object in every iterator, or a pair of iterators or pointers to the underlying type, which I considered worse.
You seek the next "first" character of a code-unit sequence. That indeed causes problems when you reach the end (unless you put a 0 at the end of your buffer, which isn't such a bad idea). UTF-8 and 16, however, don't require this. You can deduce how many code units you need to consume from the first code unit. You do that in your decoders.
As an important side benefit, they also provide an efficient way to calculate the difference in code points for operator-, which I feel is important.
And the size in code-points is (supposed to be) stored at all times.
What's the point of this? The size in code points is not a very useful thing in general.
All the UTF types were very carefully designed so that there's no chance of invalid data in them, barring extraordinary measures to deliberately corrupt it.
That's a good thing. However, you could use this opportunity to make decoding much faster, since you don't need to check for correctness anymore.
Anything else? :-)
You would probably encounter problems on platforms where int is 8 or 16 bits.

On Fri, 11 Feb 2011 00:07:58 +0100 Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
There is no need for any reasoning: look at the code of your code point iterator. It uses a pointer and indexes, and is therefore not a generic iterator adaptor.
It wasn't meant to be generic. It was meant to be exactly what it is: an iterator specific to the UTF type where it's defined. For that purpose, it's designed exactly as it should be, IMHO.
It was already stated on this list that the ability to deal with arbitrary ranges is more valuable; I am merely stating that your iterator could work with any iterator with virtually no change.
Either I missed that statement, or I don't recognize it in this context, because I'm not sure what you mean.
I could make it fully generic, but it wouldn't be nearly as efficient that way. I chose to do the extra work to make it efficient.
Your code never uses the fact that the iterator is a pointer or that memory is stored contiguously.
How would you suggest that it use that information? I think we have different definitions of generic. As I said in my last message, I define fully generic as working with any UTF-encoded string (not just UTF-8). That would be possible, but would almost certainly be less processor-efficient than having iterators customized to each type.
I also don't think that for example your utf-8 iterating strategy is very fast. Your utf-8 decoding itself seems to have lots of repetition and unnecessary tests and memory accesses...
Not that it is easy to make that kind of thing fast anyway.
There's always room for further optimization, at the cost of more programmer time and more code.
It could be a bidirectional iterator, as it has all of the abilities of one. And it could be a random access iterator, as it has all but one of the requirements for that (and in many cases has all of them). Given that choice, I chose to make it a random access iterator.
It has constant-time distance, which isn't very useful and adds unnecessary overhead to your iterator.
I disagree. I find it extremely useful to have a true random access iterator for some strings (most, in UTF-16, and arguably most in many cases of UTF-8 too), and an emulated one for the rest. And for that, the overhead isn't unnecessary.
I'm not a fan of returning a reference in operator* as well.
No choice in that, I ran into at least one STL algorithm under GCC that wouldn't compile if it wasn't a reference, even when it was only being read. I don't remember which one, but it was something important and commonly-used enough that breaking it was not an option.
I believe the standard containers indeed require it to be a reference, but I'm not aware of any problems with any implementation.
Would you mind telling me what libstdc++ algorithm relies on this?
If I recalled which one it was, I would have put it in the original message.
They give me a way to prevent my iterator code from walking off the beginning or end of the underlying string.
The only other way to do it would be to store a pointer to the string object in every iterator, or a pair of iterators or pointers to the underlying type, which I considered worse.
You seek the next "first" character of a code-unit sequence. That indeed causes problems when you reach the end (unless you put a 0 at the end of your buffer, which isn't such a bad idea). UTF-8 and 16, however, don't require this. You can deduce how many code units you need to consume from the first code unit. You do that in your decoders.
And I didn't want to duplicate that code in the iterators as well, or separate it out and possibly add the overhead of another function call to the decoder functions.
As an important side benefit, they also provide an efficient way to calculate the difference in code points for operator-, which I feel is important.
And the size in code-points is (supposed to be) stored at all times.
What's the point of this? The size in code points is not a very useful thing in general.
On the contrary, keeping track of the length of the string is *very* useful. The alternative is to calculate it on the fly, whenever someone asks for it. If you want that, you can always go back to using C-style strings.
All the UTF types were very carefully designed so that there's no chance of invalid data in them, barring extraordinary measures to deliberately corrupt it.
That's a good thing. However, you could use this opportunity to make decoding much faster, since you don't need to check for correctness anymore.
As I said, there's always room for further optimization.
Anything else? :-)
You would probably encounter problems on platforms where int is 8 or 16 bits.
I haven't seen a platform where an int is 16 bits since DOS, which I stopped coding for in the late nineties. And I've never seen one where it's eight bits. Do you know of any modern platform -- as in one that uses Unicode, and could usefully use this library -- where that's the case? -- Chad Nelson Oak Circle Software, Inc. * * *

On Thu, Feb 10, 2011 at 16:52, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Fri, 11 Feb 2011 00:07:58 +0100 Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
The size in code points is not a very useful thing in general.
On the contrary, keeping track of the length of the string is *very* useful. The alternative is to calculate it on the fly, whenever someone asks for it. If you want that, you can always go back to using C-style strings.
I understand why it's useful to know how long it is in encoding units, but the number of code points seems quite useless to me. Can you elaborate? ~ Scott

On Thu, 10 Feb 2011 18:57:00 -0800 Scott McMurray <me22.ca+boost@gmail.com> wrote:
On Thu, Feb 10, 2011 at 16:52, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
The size in code points is not a very useful thing in general.
On the contrary, keeping track of the length of the string is *very* useful. The alternative is to calculate it on the fly, whenever someone asks for it. If you want that, you can always go back to using C-style strings.
I understand why it's useful to know how long it is in encoding units, but the number of code points seems quite useless to me.
Can you elaborate?
The size in code-points *is* the size of the string, according to the view of the string that the class exposes. -- Chad Nelson Oak Circle Software, Inc. * * *

On Thu, Feb 10, 2011 at 21:41, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
On Thu, 10 Feb 2011 18:57:00 -0800 Scott McMurray <me22.ca+boost@gmail.com> wrote:
I understand why it's useful to know how long it is in encoding units, but the number of code points seems quite useless to me.
Can you elaborate?
The size in code-points *is* the size of the string, according to the view of the string that the class exposes.
Ok, but what would I actually want to use that for?

On Fri, 11 Feb 2011 20:23:58 -0800 Scott McMurray <me22.ca+boost@gmail.com> wrote:
On Thu, Feb 10, 2011 at 21:41, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
I understand why it's useful to know how long it is in encoding units, but the number of code points seems quite useless to me.
Can you elaborate?
The size in code-points *is* the size of the string, according to the view of the string that the class exposes.
Ok, but what would I actually want to use that for?
What do you use string.length() for? :-) Efficiently providing an answer to that is one of several things the UTF string classes keep track of it for. -- Chad Nelson Oak Circle Software, Inc. * * *

On 02/12/2011 05:57 AM, Chad Nelson wrote:
On Fri, 11 Feb 2011 20:23:58 -0800 Scott McMurray<me22.ca+boost@gmail.com> wrote:
On Thu, Feb 10, 2011 at 21:41, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
I understand why it's useful to know how long it is in encoding units, but the number of code points seems quite useless to me.
Can you elaborate?
The size in code-points *is* the size of the string, according to the view of the string that the class exposes.
Ok, but what would I actually want to use that for?
What do you use string.length() for? :-) Efficiently providing an answer to that is one of several things the UTF string classes keep track of it for.
std::string::length specifies the amount of memory required to represent it as encoded, and is useful if you intend to pass it to something else as a char array, length pair. Given that number of code points is directly related to neither the memory required nor the number of logical characters/glyphs/size it will take up to display, it seems it is unlikely to be useful in many cases. In cases where there is a limit of the maximum length of a string, I believe that is almost certainly going to be in terms of the encoded length in a particular encoding (i.e.g UTF-8 or UTF-16), rather than in code points.

On Sat, Feb 12, 2011 at 8:00 PM, Jeremy Maitin-Shepard <jeremy@jeremyms.com> wrote:
On 02/12/2011 05:57 AM, Chad Nelson wrote:
On Fri, 11 Feb 2011 20:23:58 -0800 Scott McMurray<me22.ca+boost@gmail.com> wrote:
On Thu, Feb 10, 2011 at 21:41, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
I understand why it's useful to know how long it is in encoding units, but the number of code points seems quite useless to me.
Can you elaborate?
The size in code-points *is* the size of the string, according to the view of the string that the class exposes.
Ok, but what would I actually want to use that for?
What do you use string.length() for? :-) Efficiently providing an answer to that is one of several things the UTF string classes keep track of it for.
std::string::length specifies the amount of memory required to represent it as encoded, and is useful if you intend to pass it to something else as a char array, length pair. Given that number of code points is directly related to neither the memory required nor the number of logical characters/glyphs/size it will take up to display, it seems it is unlikely to be useful in many cases. In cases where there is a limit of the maximum length of a string, I believe that is almost certainly going to be in terms of the encoded length in a particular encoding (i.e.g UTF-8 or UTF-16), rather than in code points.
How about size() returning the required storage size for the string as in number of bytes and length() returning the number of code points? length() could be used for example when allocating an array of code-points (char32_t) where the string could be 'expanded' from UTF-8 for algorithms that require true random-access. Matus

On Sat, 12 Feb 2011 20:19:09 +0100 Matus Chochlik <chochlik@gmail.com> wrote:
What do you use string.length() for? :-) Efficiently providing an answer to that is one of several things the UTF string classes keep track of it for.
std::string::length specifies the amount of memory required to represent it as encoded, and is useful if you intend to pass it to something else as a char array, length pair. Given that number of code points is directly related to neither the memory required nor the number of logical characters/glyphs/size it will take up to display, it seems it is unlikely to be useful in many cases. [...]
How about size() returning the required storage size for the string as in number of bytes and length() returning the number of code points?
Wouldn't that confuse any STL algorithm that uses the number of elements? Anything that cares about the number of elements seems to use size() to retrieve it, since length() is only provided by strings. In any case, both measurements are easily available already. T.length() (or T.size()) gives the length in code-points, i.e. the size it would be as a UTF-32 string. T.coded() exposes the underlying encoded type, so T.coded().length() gives the amount of memory needed for the encoded data.
length() could be used for example when allocating an array of code-points (char32_t) where the string could be 'expanded' from UTF-8 for algorithms that require true random-access.
True, though the utf32_t type makes that unnecessary most of the time. -- Chad Nelson Oak Circle Software, Inc. * * *

On 12 February 2011 13:19, Matus Chochlik <chochlik@gmail.com> wrote:
How about size() returning the required storage size for the string as in number of bytes and length() returning the number of code points?
Please pick a different name. size() and length() are synonyms as far as std::string is concerned; changing that would cause needless frustration for users migrating to a different string class. -- Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

On Mon, Feb 14, 2011 at 07:51, Nevin Liber <nevin@eviloverlord.com> wrote:
Please pick a different name. size() and length() are synonyms as far as std::string is concerned; changing that would cause needless frustration for users migrating to a different string class.
+1 Though perhaps the right way to solve this would be to have the string class only directly expose things that work equivalently at the code unit, code point, or grapheme cluster level, even if that means only copying, exact equality, and concatenation -- no iteration, length, etc without looking at an encoding or grouping explicitly. Let all the other operations use whatever view they prefer.

On Sat, 12 Feb 2011 11:00:31 -0800 Jeremy Maitin-Shepard <jeremy@jeremyms.com> wrote:
The size in code-points *is* the size of the string, according to the view of the string that the class exposes.
Ok, but what would I actually want to use that for?
What do you use string.length() for? :-) Efficiently providing an answer to that is one of several things the UTF string classes keep track of it for.
std::string::length specifies the amount of memory required to represent it as encoded, and is useful if you intend to pass it to something else as a char array, length pair. Given that number of code points is directly related to neither the memory required nor the number of logical characters/glyphs/size it will take up to display, it seems it is unlikely to be useful in many cases.
But for those few cases where it *would* be useful, I see no reason not to provide it. It costs essentially nothing, since the count is originally provided by the same function that validates the encoded data when it's put into a UTF type, and is used for other things as well. And people are used to being able to retrieve the size of a string, eliminating that function would discomfort some developers.
In cases where there is a limit of the maximum length of a string, I believe that is almost certainly going to be in terms of the encoded length in a particular encoding (i.e.g UTF-8 or UTF-16), rather than in code points.
Well, that's easily available too, via T.coded().length(). -- Chad Nelson Oak Circle Software, Inc. * * *

Jeremy Maitin-Shepard <jeremy <at> jeremyms.com> writes:
In cases where there is a limit of the maximum length of a string, I believe that is almost certainly going to be in terms of the encoded length in a particular encoding (i.e.g UTF-8 or UTF-16), rather than in code points.
Cutting any variable-width encoded string after a certain number of code units is as useful as to cut a dollar bill in half. After you have done it, it loose it's value. Having said that, the same applies to cutting a string after a certain number of code points, but here you could compare it to tear off a corner of a bill. Counting graphemes or grapheme clusters is usually the way to go. Regards, Anders Dalvander -- WWFSMD?

On Sat, Feb 12, 2011 at 05:57, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
What do you use string.length() for? :-) Efficiently providing an answer to that is one of several things the UTF string classes keep track of it for.
Seeing if a string will "fit", under various meanings: * wrapping (like 80-column console lines) * fixed-width fields (like ID3v1) Neither of which are things that are applicable at codepoint level :) I've never seen anything best done at codepoint level. Comparing, rendering, and storing text all are done at different non-codepoint levels.

On Sun, Feb 13, 2011 at 3:17 PM, Scott McMurray <me22.ca+boost@gmail.com> wrote:
On Sat, Feb 12, 2011 at 05:57, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
What do you use string.length() for? :-) Efficiently providing an answer to that is one of several things the UTF string classes keep track of it for.
Seeing if a string will "fit", under various meanings: * wrapping (like 80-column console lines) * fixed-width fields (like ID3v1)
Neither of which are things that are applicable at codepoint level :)
I've never seen anything best done at codepoint level. Comparing, rendering, and storing text all are done at different non-codepoint levels.
+1. I can't see any good use case for code points anywhere in the default interface. Operations are either dumb (can operate on code units) or complex (need grapheme clusters). While I'm sure an in-between case exists somewhere, I think it will be quite rare. The best design imho would expose code units and provide views for code points and grapheme clusters. -- Cory Nelson http://int64.org

On Sun, 13 Feb 2011 15:17:30 -0800 Scott McMurray <me22.ca+boost@gmail.com> wrote:
On Sat, Feb 12, 2011 at 05:57, Chad Nelson <chad.thecomfychair@gmail.com> wrote:
What do you use string.length() for? :-) Efficiently providing an answer to that is one of several things the UTF string classes keep track of it for.
Seeing if a string will "fit", under various meanings: * wrapping (like 80-column console lines) * fixed-width fields (like ID3v1)
Neither of which are things that are applicable at codepoint level :)
I don't know what the code-point length might be needed for, though that doesn't mean much as I haven't played with Unicode for very long. But std::string has that capability, so anyone who wants to use the UTF classes as std::string equivalents, or even replacements -- the target audience for the library -- will expect it, whether they end up using it or not.
I've never seen anything best done at codepoint level. Comparing, rendering, and storing text all are done at different non-codepoint levels.
Anything that operates at a higher level, such as glyphs or true Unicode characters, has to operate on code-points. -- Chad Nelson Oak Circle Software, Inc. * * *

On 11/02/2011 01:52, Chad Nelson wrote:
It was already stated on this list that the ability to deal with arbitrary ranges is more valuable; I am merely stating that your iterator could work with any iterator with virtually no change.
Either I missed that statement, or I don't recognize it in this context, because I'm not sure what you mean.
Which statement are you referring to? The piece you quoted contains two different statements.
I could make it fully generic, but it wouldn't be nearly as efficient that way. I chose to do the extra work to make it efficient.
Your code never uses the fact that the iterator is a pointer or that memory is stored contiguously.
How would you suggest that it use that information?
You said it was more efficient to only work with pointers than to work with arbitrary iterators. Again, I'm merely stating that you are never relying on the fact that your pointer is a pointer, so your code could very well work with any iterator.
I think we have different definitions of generic. As I said in my last message, I define fully generic as working with any UTF-encoded string (not just UTF-8). That would be possible, but would almost certainly be less processor-efficient than having iterators customized to each type.
I don't see what that has to do with genericity, nor how that would incur any runtime overhead. Just dispatch (through template specialization or overloading) to different iterator types depending on the size type of the underlying data range.
There's always room for further optimization, at the cost of more programmer time and more code.
Well, a tool for encoding/decoding UTF strings is only as good as: - the quality of its codec implementations - how flexible those codecs are to be able to work with the user's data I'm afraid your library is not particularly good on the first point, and is rather bad on the second.
It has constant-time distance, which isn't very useful and adds unnecessary overhead to your iterator.
I disagree. I find it extremely useful to have a true random access iterator for some strings (most, in UTF-16, and arguably most in many cases of UTF-8 too), and an emulated one for the rest. And for that, the overhead isn't unnecessary.
constant-time distance does not give you random access, so I don't know where you're going at. If you want pseudo-random access, use std::advance.
If I recalled which one it was, I would have put it in the original message.
So you base your claims on vague memories, I see. I also see your iterators are missing operator->. It is usually better to use boost::iterator_facade or boost::iterator_adaptor to define iterators.
And I didn't want to duplicate that code in the iterators as well, or separate it out and possibly add the overhead of another function call to the decoder functions.
The fact you would need to duplicate this is proof of how inflexible your design is. So you're adding overhead with meaningless data and essentially computing redundant things because you can't restructure your code to do it the right way. Interesting.
On the contrary, keeping track of the length of the string is *very* useful. The alternative is to calculate it on the fly, whenever someone asks for it. If you want that, you can always go back to using C-style strings.
Again, extracting the size in code points of a string is not a particularly useful operation, so I don't really see the point of maintaining it. Do you have real examples of where it is useful to have that operation be O(1) instead O(n)?
That's a good thing. However, you could use this opportunity to make decoding much faster, since you don't need to check for correctness anymore.
As I said, there's always room for further optimization.
Well, the whole point of enforcing validity is to make use of it; not making use of it doesn't really demonstrate that possibility in your design. I'm afraid doing it with your design would also require a lot of code duplication.
You would probably encounter problems on platforms where int is 8 or 16 bits.
I haven't seen a platform where an int is 16 bits since DOS, which I stopped coding for in the late nineties. And I've never seen one where it's eight bits. Do you know of any modern platform -- as in one that uses Unicode, and could usefully use this library -- where that's the case?
If you want to be included in Boost, it is good measure to not restrict yourself to non-portable assertions when there is absolutely no need to or no gain from doing so. As far as I know, only DSPs these days have such properties, and it does seem unlikely one would want to use these for Unicode text processing.

On Fri, 11 Feb 2011 11:44:03 +0100 Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 11/02/2011 01:52, Chad Nelson wrote:
It has constant-time distance, which isn't very useful and adds unnecessary overhead to your iterator.
I disagree. I find it extremely useful to have a true random access iterator for some strings (most, in UTF-16, and arguably most in many cases of UTF-8 too), and an emulated one for the rest. And for that, the overhead isn't unnecessary.
constant-time distance does not give you random access, so I don't know where you're going at.
You couldn't have missed the "special case" code in the iterators, which provides true random access if the string contains no multi-element-encoded code-points (which is usually the case for UTF-16, and often for UTF-8), because you criticized that in an earlier message. So I have to assume that you aren't attempting to provide honest feedback, and that your motivation is simply to attack the library. Why? Hm, you're developing a Boost.Unicode proposal yourself, aren't you? Competition is such an inconvenient thing. Don't bother responding, I will not waste my time with you any further. -- Chad Nelson Oak Circle Software, Inc. * * *

On 11/02/2011 15:00, Chad Nelson wrote:
So I have to assume that you aren't attempting to provide honest feedback, and that your motivation is simply to attack the library. Why? Hm, you're developing a Boost.Unicode proposal yourself, aren't you? Competition is such an inconvenient thing.
Don't bother responding, I will not waste my time with you any further.
Well, I was being nice by giving you feedback because I have some experience in the field. I don't think you'd get much feedback otherwise, since you're just doing "yet another unicode/string library proposal", with no explanation nor docs at all (doxygen reference doesn't count as documentation, it's just a reference for the user, it doesn't explain the design of the library). But obviously you prefer to see a personal attack than to try to think about the potential design flaws that I noticed from a quick look at your library. I'm trying to help you there, or at least make you realize there are already several related efforts that could be better than yours on certain aspects. If you want to eventually be able to submit your library for review, it would be nice to know how it stands compared to other proposed approaches. Of course, mine is better, but that comes without saying ;).

Chad Nelson wrote:
Mathias Gaunard wrote:
I'm not a fan of returning a reference in operator* as well.
No choice in that, I ran into at least one STL algorithm under GCC that wouldn't compile if it wasn't a reference, even when it was only being read. I don't remember which one, but it was something important and commonly-used enough that breaking it was not an option.
To satisfy the requirements of a forward iterator, you must return a reference in operator*. A long time ago I asked on this list whether output iterator also has this requirement, because Boost.Iterator didn't declared an iterator an output iterator if this condition was not satisfied (thereby causing problems for Boost.MultiArray on MSVC-10). I'm now pretty sure that the requirement is definitively there for forward iterators and definitively absent for output iterators, so that the behavior of Boost.Iterator with respect to output iterators can at least be regarded as undesirable. The take-away for this discussion is that forward iterator and any of its refinements like bidirectional iterator and random access iterator have to return a reference in operator*. Regards, Thomas

On 2/11/2011 11:21 AM, Thomas Klimpel wrote: [...]
The take-away for this discussion is that forward iterator and any of its refinements like bidirectional iterator and random access iterator have to return a reference in operator*.
Just curious, do you have any idea what standard library implementations at present actually rely on this? Was this addressed at all in C++0x? I.e., were the iterator concepts formally "orthogonalized" to separate access concepts from traversal concepts, as outlined in the Boost.Iterator library? 'Cause disallowing proxy references on random access traversal and bidirectional traversal iterators...that sucks. Sorry to change the topic... :/ - Jeff

On 11.02.2011, at 20:36, Jeffrey Lee Hellrung, Jr. wrote:
On 2/11/2011 11:21 AM, Thomas Klimpel wrote: [...]
The take-away for this discussion is that forward iterator and any of its refinements like bidirectional iterator and random access iterator have to return a reference in operator*.
Just curious, do you have any idea what standard library implementations at present actually rely on this?
Was this addressed at all in C++0x? I.e., were the iterator concepts formally "orthogonalized" to separate access concepts from traversal concepts, as outlined in the Boost.Iterator library?
'Cause disallowing proxy references on random access traversal and bidirectional traversal iterators...that sucks.
IIRC, the issue was supposed to get addressed by the formal concepts for iterators. But when concepts were dropped, iterators reverted to their C++03 state, without any time remaining to bring in orthogonal iterator concepts. Sebastian

Sebastian Redl wrote:
On 11.02.2011, at 20:36, Jeffrey Lee Hellrung, Jr. wrote:
'Cause disallowing proxy references on random access traversal and bidirectional traversal iterators...that sucks.
IIRC, the issue was supposed to get addressed by the formal concepts for iterators. But when concepts were dropped, iterators reverted to their C++03 state, without any time remaining to bring in orthogonal iterator concepts.
If I understood Dave correctly, he is no longer convinced whether the orthogonal iterator concept would have been the correct solution. I don't even understand the advantages of disallowing proxy references. What prevents generic code from just using iterator_traits<Iterator>::reference when the (proxy-) reference returned from an iterator must be saved temporarily? What I can understand is to require that the iterators from the standard containers should not use proxy references, because it would force iterator_traits<iterator_type>::reference into normal non-generic user code. As many "normal" users might be not too familiar with "iterator_traits", it is desirable to avoid that situation. However, if some algorithms of an stl implementation have difficulties with iterators returning proxy references, I think it is something that could easily be fixed if some standard would say that proxy references are allowed. Regards, Thomas

On Fri, 11 Feb 2011 20:21:07 +0100 Thomas Klimpel <Thomas.Klimpel@synopsys.com> wrote:
Chad Nelson wrote:
I'm not a fan of returning a reference in operator* as well.
No choice in that, I ran into at least one STL algorithm under GCC that wouldn't compile if it wasn't a reference, even when it was only being read. I don't remember which one, but it was something important and commonly-used enough that breaking it was not an option.
To satisfy the requirements of a forward iterator, you must return a reference in operator*. [...] The take-away for this discussion is that forward iterator and any of its refinements like bidirectional iterator and random access iterator have to return a reference in operator*.
That would certainly explain the behavior I saw. Thanks for the information. -- Chad Nelson Oak Circle Software, Inc. * * *
participants (12)
-
Anders Dalvander
-
Chad Nelson
-
Cory Nelson
-
Jeffrey Lee Hellrung, Jr.
-
Jeremy Maitin-Shepard
-
Mathias Gaunard
-
Matus Chochlik
-
Nevin Liber
-
Phil Endecott
-
Scott McMurray
-
Sebastian Redl
-
Thomas Klimpel