Re: [boost] [General] Always treat std::strings as UTF-8

19 Jan 2011

      On Tue, 18 Jan 2011 11:01:10 -0800 (PST)
Artyom <artyomtnk@yahoo.com> wrote:
...
...
...
2. Reinvent  standard library to use new string
Not entirely necessary, for the same  reason that very few changes to
the standard library are needed when you  switch from char strings to
char16_t strings to char32_t strings -- the  standard library,
designed around the idea of iterators, is mostly  type-agnostic.
Ok... Few things:
1. UTF-32 is waste of space - don't use it unless it is something
   like handling code points (char32_t)
2. UTF-16 is too error prone (See: UTF-16 considered harmful)
No argument with either assertion.
...
3. There is not special type char8_t distinct from char, so you 
   can't use it.
That's why I wrote the utf8_t type. I'd have been quite happy to just
use an std::basic_string<utf8_byte_t>, and I looked into the C++0x
"opaque typedef" idea to see if it was possible. I couldn't find any
elegant way to make it work, and the opaque typedef proposal was
dropped from the spec, so I felt that I had to write the utf8_t class.

However, I'm not sure what point you're trying to make with the above.
...
...
The utf*_t types provide fully functional iterators,
Ok let's thing what do you need iterators for? Accessing "characters"
if so you are most likely doing something terribly wrong as you ignore
the fact that codepoint != character.
In the current incarnation of the class, the iterators are for
accessing the bytes, to make it trivially compatible with things like
std::copy.
...
I would say such iterator is wrong by design unless you develop
a Unicode algorithm that relates to code point.
If that's needed (and it probably is), it's easy enough to add. It
just wouldn't use the standard begin() and end() functions.
...
...
so  they'll work
fine with most library functions, so long as those functions  don't
care that some characters are encoded as multiple bytes. It's just
the  ones that assume that a single byte represents all characters
that you  have to replace, and you'd have to replace those
regardless of whether  you're using a new string type or not, if
you're using any multi-byte  encoding.
Ok...
The paragraph above is inheritable wrong
Oh?
...
first of all lets cleanup all things:
...
that some characters are encoded as multiple bytes
Characters are not code points.
A semantic point. Correct, but irrelevant to the argument I was trying
to make.
...
...
the  ones that assume that a single byte represents
all characters
Please I want to make this statement even more clearer
C H A R A C T E R  != C O D E  P O I N T
Even in single byte encodings - for examples windows-1255 is single
byte encoding and still my represent a single character using 1, 2 or
3 bytes!
std::copy, std::mismatch, std::equal, std::search, and several others
would work equally well on UTF-8 strings. Functions that only allow you
to specify a single element to work with, like std::find, would require
a slightly different kind of iterator, one that operated on either
characters or code-points. I don't see how that makes anything in the
quoted paragraph inherently wrong.
...
Once again - when you work with string you don't work with them as
series of characters you want with them and text entities - text
chunks.
That depends on what you're doing with them. If you're using them as
translations for messages your program is sending out, then your
statement is correct -- you treat them as opaque blobs. But if for
instance you're parsing a file, you want tokens, which *are* merely an
arbitrary series of characters. Or if your program allows the user to
edit a file, you want something that gives you single characters,
regardless of how many bytes or code-points they're encoded in.
...
...
and you'd have to replace those regardless of whether  you're
using a new string type or not, if you're using any multi-byte
encoding.
No I would not because I don't look at string as on the sequence of
code points - by themselves then are meaningless.
Code points are meaningful in terms of Unicode algorithms that know
how to combine them.
So if you want to handle text chunks you will have to use some
Unicode aware library.
If you want to sort them, properly for the locale you're working with,
you're correct. If you just want to write them out, or edit them, then
barring things like messages in mixed left-to-right and right-to-left
languages, it's fairly simple.
...
...
...
It is just neither feasible no necessary.
My code  says it's perfectly feasible. ;-) Whether it's necessary or
not is up to the  individual developer, but the type-safety it
offers is more in line with the  design philosophy of C++ than using
std::string for everything. I hate to  harp on the same tired
example, but why do you really need any pointer type  other than
void*? It's the same idea.
No it isn't. String is text chunk.
You can combine them, concatenate them, search for specific
substrings or relate to ASCII characters for example like in HTML and
parse them and this is perfectly doable withing standard std::string
regardless it is UTF-8, Latin1 or other ISO-8859-* ASCII compatible
encoding.
This is very different.
I'm trying to understand your point, but with no success so far. If you
want something that gives you characters or code-points, then an
std::string has no chance of working in any multi-byte encoding -- a
UTF-whatever-specific type does.
...
Giving you "utf-8" string or UTF-8 container would give you false
feeling that you doing something right.
How?
...
Unicode is not about splitting string into code points or iterating
over them... It is totally different thing.
I'm baffled by this statement. For doing anything interesting, Unicode
or any other encoding *is* about iterating over characters (or
code-points, if that's what you're looking for).

Your point seems to be that the utf*_t classes are actively harmful in
some way that I don't see, and using std::string somehow mitigates that
by making you do more work. Or am I misunderstanding you?
-- 
Chad Nelson
Oak Circle Software, Inc.
*
*
*