
On 02/11/2011 11:44 AM, Chad Nelson wrote:
On Fri, 11 Feb 2011 17:22:50 +0000 "Phil Endecott"<spam_from_boost_dev@chezphil.org> wrote:
For example, you have this "almost" random-access feature that, IIUC, for UTF-8 will give you O(1) random access if you have only ASCII characters and for UTF-16 will give you O(1) random access if you have only BMP characters. That's just horrible! [...]
If you put it that way, you're right. I assumed that the developer using the library would read the documentation and know that the iterators weren't always true random-access, but that assumption doesn't stand up to conscious examination.
We've heard this argument against UTF-8 many times. Like many of us, I've worked with a lot of code to process a lot of text over many years. I'd like to question this idea that random access to arbitrary character data is really very relevant. The difference between O(1) and O(N) isn't significant until N becomes nontrivial. Which in practical terms probably in the dozens or hundreds of characters. So let me ask the question: Just when is it really valid to want to jump the 278th "abstract character" in a string? Seriously, how often do these situations arise? A guy who's only ever programmed "US ASCII" on a plain text terminal may think he needs every 80th character in reverse order to get a column from a screen line or something. But he would be wrong anywhere that uses controls, compose characters, non-spacing blanks, multibyte, or whatever. Some string search and regex algorithms use skip-ahead N, but how often is N large enough to avoid a whole cache line fill? Isn't it sufficient to simply document the behavior that derives from a straightforward implementation of the API? - Marsh