
On Sat, Jan 22, 2011 at 1:51 AM, Dave Abrahams <dave@boostpro.com> wrote:
At Sat, 22 Jan 2011 01:14:38 +0800, Dean Michael Berris wrote:
4. Looks like a real STL container except the iterator type is smarter than your average iterator.
Encoding is a matter of external interpretation and I think should not be part of a string's interface. You can have wrappers that interpret a string as a UTF-* string.
What does it iterate over? chars? code points? characters? Something else?
I can see basically a way of saying what you want when you want to get an iterator from it -- by default though a call to '.begin()' will return an iterator characters (just so you don't break compatibility with std::string).
Then you mean an iterator over chars, not characters.
Yeah, over chars. :)
The iterator can store a reference to the original string and when advanced, can do the appropriate interpretation of the string in context. If you wanted a code point iterator, you'd get the code point iterator. If you wanted a character based on a certain encoding then you can have a special iterator for that. An iterator would also know whether it was out of bounds.
This allows people to write code that dealt with code points, characters (based on the encoding), and raw data if absolutely necessary.
Hmm, I'm just not sure whether these are useful. The iterators to be supplied (if any) should IMO be dictated by the needs of real algorithms.
I thought about it a little more too, and there should be a way of just crafting the appropriate iterator from the outside -- much like how the current Iterators library allows you to create different kinds of iterators. Algorithms that deal with text, like rendering characters for example in a GUI, would basically need to iterate over code points or glyphs. Typesetting algorithms would pretty much need the same kind of traversal. Also things like instance counting (building a histogram based on character counts) for example for compression and all the cool things like that would need to have access to individual "elements" of a given text -- in the pre-Unicode days this was just a simple table of 255 characters, unfortunately it's gotten a lot more complex than that ;). -- Dean Michael Berris about.me/deanberris