[UTF String] Feedback on UTF String library, please

Chad Nelson

11 Feb 2011 11 Feb '11

2:14 p.m.

I reiterate my request from this message <http://permalink.gmane.org/gmane.comp.lib.boost.devel/214601>: would those interested in the UTF String library please comment on the new version? Specifically, I'm looking for feedback on: * Technical opinions on the design; * How useful it would be to you as-is; * What would make it more useful to you. The code can be found here: <http://www.oakcircle.com/toolkit.html>. Thanks in advance. -- Chad Nelson Oak Circle Software, Inc. * * *

Attachments:

signature.asc (application/pgp-signature — 198 bytes)

Show replies by date

Phil Endecott

11 Feb 11 Feb

5:22 p.m.

Chad Nelson wrote:

...

I reiterate my request from this message <http://permalink.gmane.org/gmane.comp.lib.boost.devel/214601>: would those interested in the UTF String library please comment on the new version?

Hi Chad, It looks like you've started a new thread in the hope of "throwing Mathias off the scent". His tone has indeed been unusually aggressive and it would be good to get some other peoples' comments - but to me, it looks like he has made a fundamentally accurate analysis of your proposal. For example, you have this "almost" random-access feature that, IIUC, for UTF-8 will give you O(1) random access if you have only ASCII characters and for UTF-16 will give you O(1) random access if you have only BMP characters. That's just horrible! Imagine an email application that slows down by a factor of N when it receives a message with a non-BMP character in it - yuk. IMHO, if you want to add these features, they should be done in a way that prevents the user from accidentally misusing them. Rather than this drop-in replacement for std::string that misbehaves at run time, I would prefer something that requires some work to make it fit, but then behaves as expected. Regards, Phil.

Chad Nelson

5:44 p.m.

On Fri, 11 Feb 2011 17:22:50 +0000 "Phil Endecott" <spam_from_boost_dev@chezphil.org> wrote:

...

Chad Nelson wrote:

...
I reiterate my request from this message <http://permalink.gmane.org/gmane.comp.lib.boost.devel/214601>: would those interested in the UTF String library please comment on the new version?

It looks like you've started a new thread in the hope of "throwing Mathias off the scent".

More to leave the bad taste of the previous thread behind, for myself and other potential participants.

...

His tone has indeed been unusually aggressive and it would be good to get some other peoples' comments - but to me, it looks like he has made a fundamentally accurate analysis of your proposal.

Maybe so, I couldn't tell the good feedback from the bad.

...

For example, you have this "almost" random-access feature that, IIUC, for UTF-8 will give you O(1) random access if you have only ASCII characters and for UTF-16 will give you O(1) random access if you have only BMP characters. That's just horrible! [...]

If you put it that way, you're right. I assumed that the developer using the library would read the documentation and know that the iterators weren't always true random-access, but that assumption doesn't stand up to conscious examination.

...

IMHO, if you want to add these features, they should be done in a way that prevents the user from accidentally misusing them. Rather than this drop-in replacement for std::string that misbehaves at run time, I would prefer something that requires some work to make it fit, but then behaves as expected.

Perhaps renaming the current iterator types to "fast_iterator", and making the standard iterators bidirectional ones? I'd hate to saddle all UTF strings with only bidirectional iterators when many of them can use the much faster random-access ones. -- Chad Nelson Oak Circle Software, Inc. * * *

Vlad Lazarenko

6:24 p.m.

On Fri, Feb 11, 2011 at 12:44 PM, Chad Nelson <chad.thecomfychair@gmail.com>wrote:

...

On Fri, 11 Feb 2011 17:22:50 +0000 "Phil Endecott" <spam_from_boost_dev@chezphil.org> wrote:

...
Chad Nelson wrote:

...
I reiterate my request from this message <http://permalink.gmane.org/gmane.comp.lib.boost.devel/214601>: would those interested in the UTF String library please comment on the new version?

It looks like you've started a new thread in the hope of "throwing Mathias off the scent".

More to leave the bad taste of the previous thread behind, for myself and other potential participants.

...
His tone has indeed been unusually aggressive and it would be good to get some other peoples' comments - but to me, it looks like he has made a fundamentally accurate analysis of your proposal.

Maybe so, I couldn't tell the good feedback from the bad.

...
For example, you have this "almost" random-access feature that, IIUC, for UTF-8 will give you O(1) random access if you have only ASCII characters and for UTF-16 will give you O(1) random access if you have only BMP characters. That's just horrible! [...]

If you put it that way, you're right. I assumed that the developer using the library would read the documentation and know that the iterators weren't always true random-access, but that assumption doesn't stand up to conscious examination.

Unfortunately, users tend not to read documentation. From my experience, many just using copy-paste from examples and modify the result to fit their needs. IMHO the default behavior should be the fastest from the most safest. But having an example of "tricky" usage to improve performance with the explanation of pros and cons will help some users to copy-paste from the right place and read documentation before they do so.

...

...
IMHO, if you want to add these features, they should be done in a way that prevents the user from accidentally misusing them. Rather than this drop-in replacement for std::string that misbehaves at run time, I would prefer something that requires some work to make it fit, but then behaves as expected.

Perhaps renaming the current iterator types to "fast_iterator", and making the standard iterators bidirectional ones? I'd hate to saddle all UTF strings with only bidirectional iterators when many of them can use the much faster random-access ones. -- Chad Nelson Oak Circle Software, Inc. * * *

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- *Vlad Lazarenko* *Lazarenko.me <http://lazarenko.me>* vlad@lazarenko.com <vlad@lazarenko.me>

Chad Nelson

10:07 p.m.

On Fri, 11 Feb 2011 13:24:06 -0500 Vlad Lazarenko <vlad@lazarenko.me> wrote:

...

On Fri, Feb 11, 2011 at 12:44 PM, Chad Nelson <chad.thecomfychair@gmail.com>wrote:

...
...
For example, you have this "almost" random-access feature that, IIUC, for UTF-8 will give you O(1) random access if you have only ASCII characters and for UTF-16 will give you O(1) random access if you have only BMP characters. That's just horrible! [...]

If you put it that way, you're right. I assumed that the developer using the library would read the documentation and know that the iterators weren't always true random-access, but that assumption doesn't stand up to conscious examination.

Unfortunately, users tend not to read documentation. From my experience, many just using copy-paste from examples and modify the result to fit their needs.

Mine too -- that's what I meant by conscious examination, when I thought about it I realized that almost no one reads documentation except as a last resort.

...

IMHO the default behavior should be the fastest from the most safest. But having an example of "tricky" usage to improve performance with the explanation of pros and cons will help some users to copy-paste from the right place and read documentation before they do so.

Hm, nice idea. Thanks. -- Chad Nelson Oak Circle Software, Inc. * * *

Marsh Ray

8:09 p.m.

On 02/11/2011 11:44 AM, Chad Nelson wrote:

...

On Fri, 11 Feb 2011 17:22:50 +0000 "Phil Endecott"<spam_from_boost_dev@chezphil.org> wrote:

...
For example, you have this "almost" random-access feature that, IIUC, for UTF-8 will give you O(1) random access if you have only ASCII characters and for UTF-16 will give you O(1) random access if you have only BMP characters. That's just horrible! [...]

If you put it that way, you're right. I assumed that the developer using the library would read the documentation and know that the iterators weren't always true random-access, but that assumption doesn't stand up to conscious examination.

We've heard this argument against UTF-8 many times. Like many of us, I've worked with a lot of code to process a lot of text over many years. I'd like to question this idea that random access to arbitrary character data is really very relevant. The difference between O(1) and O(N) isn't significant until N becomes nontrivial. Which in practical terms probably in the dozens or hundreds of characters. So let me ask the question: Just when is it really valid to want to jump the 278th "abstract character" in a string? Seriously, how often do these situations arise? A guy who's only ever programmed "US ASCII" on a plain text terminal may think he needs every 80th character in reverse order to get a column from a screen line or something. But he would be wrong anywhere that uses controls, compose characters, non-spacing blanks, multibyte, or whatever. Some string search and regex algorithms use skip-ahead N, but how often is N large enough to avoid a whole cache line fill? Isn't it sufficient to simply document the behavior that derives from a straightforward implementation of the API? - Marsh

5291

Age (days ago)

5291

Last active (days ago)

List overview

Download

5 comments

4 participants

participants (4)

Chad Nelson
Marsh Ray
Phil Endecott
Vlad Lazarenko