Re: [boost] [UTF String] Feedback on UTF String library, please

On 02/11/2011 14:09:27 Marsh Ray wrote:
So let me ask the question:
Just when is it really valid to want to jump the 278th "abstract character" in a string?
Seriously, how often do these situations arise?
Well, Boyer-Moore string searching, obviously. And as noted previously, short snippets of null-terminated text are not the only things that people store within a std:string or behind a char *. There's a tremendous amount of data out there in fixed-column databases, for example, and undoubtedly many maintainers who'd like a painless way to upgrade the column unit from 'byte' to 'generic character.' It's also probably a logical fallacy to assume that every string algorithm on every device supported by Boost will always be cache-limited. If strings for you are just handles that you pass to other people's APIs, then no, you don't need random access to the contents. But if you're seriously aiming to create a Unicode-aware drop-in replacement for existing string classes, then it ought to support all of the existing code that assumes efficient random access. And if you can provide measurable speedup for at least the common case, even at the cost of a few extra bits per data chunk, that's worth pursuing.

Brent Spillner wrote:
Just when is it really valid to want to jump the 278th "abstract character" in a string?
Seriously, how often do these situations arise?
Well, Boyer-Moore string searching, obviously.
If you're doing Boyer-Moore on a UTF-8 string you should normally do it on the bytes, not on the codepoints, so there is no need for random access to the codepoints. This works because of the particular properties of UTF-8, and will be faster because you don't have the overhead of the UTF-8 decoding. The exception would be if you want to do comparison at the "grapheme cluster" level, i.e. if you want to consider a single-codepoint, multi-byte representation of e.g. an accented character as equal to the multiple-codepoint, multi-byte representation comprising a base character plus a combining accent. I would suggest that it would be better to first normalise your strings, so that this doesn't arise. Even if you did want to do Boyer-Moore on the codepoint-sequence, I think that using std::advance to move a bidirectional iterator would not change its complexity order. I believe this is also true for UTF-16, but I have less experience with that.
If strings for you are just handles that you pass to other people's APIs, then no, you don't need random access to the contents. But if you're seriously aiming to create a Unicode-aware drop-in replacement for existing string classes, then it ought to support all of the existing code that assumes efficient random access. And if you can provide measurable speedup for at least the common case, even at the cost of a few extra bits per data chunk, that's worth pursuing.
Are you suggesting some sort of indexed string? I guess you could use an augmented tree to provide O(log N) access to the N'th character in a UTF-8 string. In fact didn't someone propose something like that a few months ago? Not as a string but as a "vector whose iterators are not invalided" or "list with random access" or something? Regards, Phil.
participants (2)
-
Brent Spillner
-
Phil Endecott