Re: [boost] [UTF String] Feedback on UTF String library, please

On 2/11/2001 17:05:25, Phil Endecott wrote:
Brent Spillner wrote:
Just when is it really valid to want to jump the 278th "abstract character" in a string?
Seriously, how often do these situations arise?
Well, Boyer-Moore string searching, obviously.
If you're doing Boyer-Moore on a UTF-8 string you should normally do it on the bytes, not on the codepoints, so there is no need for random access to the codepoints. This works because of the particular properties of UTF-8, and will be faster because you don't have the overhead of the UTF-8 decoding.
The discussion was about 'abstract characters,' not codepoints. I think a codepoint iterator is likely useful too, but for most applications the character view is required.
The exception would be if you want to do comparison at the "grapheme cluster" level, i.e. if you want to consider a single-codepoint, multi-byte representation of e.g. an accented character as equal to the multiple-codepoint, multi-byte representation comprising a base
Yes, I think that's the behavior desired for most applications.
character plus a combining accent. I would suggest that it would be better to first normalise your strings, so that this doesn't arise.
...but I think that's a little too glib. If you have the string "résumé" in your data store, and a user enters "resume" as a search key, then you either have to search independently for each plausible variant for "resume," or you have to maintain a parallel, accent-free copy of the original text and the means to translate indices into the accent-free version into indices into the rich version, or you have to have some notion of 'abstract character' comparison and iteration. I think the last solution is by far the most preferable, particularly for a low-level library. This is how people actually use strings in practice, after all.
Even if you did want to do Boyer-Moore on the codepoint-sequence, I think that using std::advance to move a bidirectional iterator would not change its complexity order.
It clearly won't; you need to ::advance at most O(n) times, and Boyer-Moore is already a linear complexity algorithm (and I'd be amazed to see a sublinear one for the general case.) But since the whole algorithm is linear, those extra O(n) iterator advancements past code units that you already know to be irrelevant could be a measurable contributor to the total run time.
Are you suggesting some sort of indexed string? I guess you could use an augmented tree to provide O(log N) access to the N'th character in a UTF-8 string. In fact didn't someone propose something like that a few
Yes, that's what I was thinking.
months ago? Not as a string but as a "vector whose iterators are not invalided" or "list with random access" or something?
Good to know; I'll take a look at that.

Brent Spillner wrote:
If you have the string "r?sum?" in your data store,
[Aside: I'm reading this via the "daily digest", and I find it sadly appropriate that the mailing list software replaces all non-ASCII characters with ?s...]
and a user enters "resume" as a search key, then you either have to search independently for each plausible variant for "resume," or you have to maintain a parallel, accent-free copy of the original text and the means to translate indices into the accent-free version into indices into the rich version, or you have to have some notion of 'abstract character' comparison and iteration. I think the last solution is by far the most preferable, particularly for a low-level library. This is how people actually use strings in practice, after all.
Well, "how people actually use" and "for most applications" are things that could be argued about ad infinitum. Let's not go there. Yes, approximate-matching is useful. I have a to_ascii_character iterator adaptor (or to_alphanumeric_character) that I can layer on top of a utf8 iterator when I need that. More often, I just need a comparison that will work for keys in a std::associative_container.
Even if you did want to do Boyer-Moore on the codepoint-sequence, I think that using std::advance to move a bidirectional iterator would not change its complexity order.
It clearly won't; you need to ::advance at most O(n) times, and Boyer-Moore is already a linear complexity algorithm (and I'd be amazed to see a sublinear one for the general case.) But since the whole algorithm is linear, those extra O(n) iterator advancements past code units that you already know to be irrelevant could be a measurable contributor to the total run time.
But what are the options we're comparing here? I don't think there's anything on the table that gives you O(1) random access to the characters or codepoints of a UTF-8 string. Chad's proposal has an iterator that gives you that only when the string doesn't contain any non-ASCII characters, and needs to scan the string in O(N) time to determine that; in general, it will emulate random access in linear time. The hypothetical augmented tree representation can do random access in log(N) time, but you still have to build the tree first which is O(N). Using std::advance on a bidirectional iterator is probably better than those alternatives. If I were trying to implement fast Boyer-Moore on UTF-8, I would do it on the bytes: the distance that you advance after a mismatch can be thought of as a hint; if it tells you to advance N characters, I would advance N bytes and then skip to the next codepoint-start byte. This is, in my experience, typical of how UTF-8 can be most efficiently used; if you don't forget that it's a byte sequence, you can do smart things with it. Regards, Phil.
participants (2)
-
Brent Spillner
-
Phil Endecott