Re: [boost] [UTF String] Feedback on UTF String library, please

11 Feb 2011

      On 02/11/2011 11:44 AM, Chad Nelson wrote:
...
On Fri, 11 Feb 2011 17:22:50 +0000
"Phil Endecott"<spam_from_boost_dev@chezphil.org>  wrote:
...
For example, you have this "almost" random-access feature that, IIUC,
for UTF-8 will give you O(1) random access if you have only ASCII
characters and for UTF-16 will give you O(1) random access if you
have only BMP characters.  That's just horrible! [...]
If you put it that way, you're right. I assumed that the developer
using the library would read the documentation and know that the
iterators weren't always true random-access, but that assumption
doesn't stand up to conscious examination.
We've heard this argument against UTF-8 many times. Like many of us, 
I've worked with a lot of code to process a lot of text over many years. 
I'd like to question this idea that random access to arbitrary character 
data is really very relevant.

The difference between O(1) and O(N) isn't significant until N becomes 
nontrivial. Which in practical terms probably in the dozens or hundreds 
of characters.

So let me ask the question:

Just when is it really valid to want to jump the 278th "abstract 
character" in a string?

Seriously, how often do these situations arise?

A guy who's only ever programmed "US ASCII" on a plain text terminal may 
think he needs every 80th character in reverse order to get a column 
from a screen line or something. But he would be wrong anywhere that 
uses controls, compose characters, non-spacing blanks, multibyte, or 
whatever.

Some string search and regex algorithms use skip-ahead N, but how often 
is N large enough to avoid a whole cache line fill?

Isn't it sufficient to simply document the behavior that derives from a 
straightforward implementation of the API?

- Marsh

Re: [boost] [UTF String] Feedback on UTF String library, please

Marsh Ray