Re: [boost] [General] Always treat std::strings as UTF-8

18 Jan 2011

      ...
From: Artyom
Ok let's thing what do you need iterators for? Accessing "characters"
if so you are most likely doing something terribly wrong as you ignore
the fact that codepoint != character.
I would say such iterator is wrong by design unless you develop
a Unicode algorithm that relates to code point.
Now wouldn't it be nice if ascii_t (or whatever it's called) and utf*_t string classes had 3 kinds of iterators:
- storage iterator (char, wchar_t etc.),
- codepoint iterator,
- character iterator.

You could then reuse many existing algorithms to perform operations on a level that is sufficient in a given situation, like:

- bitwise copy:
    std::copy(utf8_1.storage_begin(), utf8_1.storage_end(),
        utf8_2.storage_begin())
- check if utf32 is a substring of utf8, codepoint-wise:
    std::search(utf8.codepoint_begin(), utf8.codepoint_end(),
        utf32.codepoint_begin(), utf32.codepoint_end())
- character-wise copy ascii_t to utf_16, considering the codepage of ascii object:
    utf16_t utf16(ascii.character_begin(), ascii_t.character_end())
- count codepoints:
    std::distance(utf8.codepoint_begin(), utf8.codepoint_end())
- count characters:
    std::distance(utf8.character_begin(), utf8.character_end())
- get the 5th codepoint:
    std::advance(utf8.codepoint_begin(), 5)

I don't know Unicode quirks enough to tell how useful this interface would be, but it seems interesting. What do you think?

Best regards,
Robert

Re: [boost] [General] Always treat std::strings as UTF-8

Robert Kawulak