Re: [boost] [General] Always treat std::strings as UTF-8

18 Jan 2011

      ...
...
Otherwise you should:
1. Reinvent the string
Or at least wrap it. ;-)
...
2. Reinvent  standard library to use new string
Not entirely necessary, for the same  reason that very few changes to
the standard library are needed when you  switch from char strings to
char16_t strings to char32_t strings -- the  standard library, designed
around the idea of iterators, is mostly  type-agnostic.
Ok... Few things:

1. UTF-32 is waste of space - don't use it unless it is something
   like handling code points (char32_t)
2. UTF-16 is too error prone (See: UTF-16 considered harmful)
3. There is not special type char8_t distinct from char, so you 
   can't use it.
...
The utf*_t types provide fully functional iterators,
Ok let's thing what do you need iterators for? Accessing "characters"
if so you are most likely doing something terribly wrong as you ignore
the fact that codepoint != character.

I would say such iterator is wrong by design unless you develop
a Unicode algorithm that relates to code point.
...
so  they'll work
fine with most library functions, so long as those functions  don't care
that some characters are encoded as multiple bytes. It's just the  ones
that assume that a single byte represents all characters that you  have
to replace, and you'd have to replace those regardless of whether  you're
using a new string type or not, if you're using any multi-byte  encoding.
Ok... 

The paragraph above is inheritable wrong 

first of all lets cleanup all things:
...
that some characters are encoded as multiple bytes
Characters are not code points.
...
the  ones that assume that a single byte represents
all characters
Please I want to make this statement even more clearer

 C H A R A C T E R  != C O D E  P O I N T

Even in single byte encodings - for examples windows-1255 is single
byte encoding and still my represent a single character using 1, 2 or
3 bytes!

Once again - when you work with string you don't work with them as series
of characters you want with them and text entities - text chunks.
...
and you'd have to replace those regardless of whether  you're
using a new string type or not, if you're using any multi-byte  encoding.
No I would not because I don't look at string as on the sequence
of code points - by themselves then are meaningless.

Code points are meaningful in terms of Unicode algorithms
that know how to combine them.

So if you want to handle text chunks you will have to use 
some Unicode aware library.
...
...
It is just neither feasible no necessary.
My code  says it's perfectly feasible. ;-) Whether it's necessary or not
is up to the  individual developer, but the type-safety it offers is
more in line with the  design philosophy of C++ than using std::string
for everything. I hate to  harp on the same tired example, but why do
you really need any pointer type  other than void*? It's the same idea.
No it isn't. String is text chunk.

You can combine them, concatenate them, search for specific
substrings or relate to ASCII characters for example like
in HTML and parse them and this is perfectly doable withing
standard std::string regardless it is UTF-8, Latin1 or other
ISO-8859-* ASCII compatible encoding.

This is very different.

Giving you "utf-8" string or UTF-8 container would
give you false feeling that you doing something right.

Unicode is not about splitting string into code points
or iterating over them... It is totally different thing.

Artyom