Re: [boost] [General] Always treat std::strings as UTF-8

19 Jan 2011


      On Wed, 19 Jan 2011 00:00:59 +0100
Robert Kawulak <robert.kawulak@gmail.com> wrote:
...
...
From: Artyom
Ok let's thing what do you need iterators for? Accessing "characters"
if so you are most likely doing something terribly wrong as you
ignore the fact that codepoint != character.
I would say such iterator is wrong by design unless you develop
a Unicode algorithm that relates to code point.
Now wouldn't it be nice if ascii_t (or whatever it's called) and
utf*_t string classes had 3 kinds of iterators:
- storage iterator (char, wchar_t etc.),
- codepoint iterator,
- character iterator.
The current iterators fall under the storage iterator category, but
code-point iterators are easily possible. Character iterators may
require help from a full-fledged Unicode library (I don't yet know
whether there's a simple way to determine what code-points are
combining ones, I doubt there is), but they should be doable too.
...
You could then reuse many existing algorithms to perform operations on
a level that is sufficient in a given situation [...] I don't know
Unicode quirks enough to tell how useful this interface would be, but
it seems interesting.
And intriguing. When I get back to the Unicode string classes, I'll
look into adding such iterators.
-- 
Chad Nelson
Oak Circle Software, Inc.
*
*
*