
On Tue, 18 Jan 2011 11:01:10 -0800 (PST) Artyom <artyomtnk@yahoo.com> wrote:
2. Reinvent standard library to use new string
Not entirely necessary, for the same reason that very few changes to the standard library are needed when you switch from char strings to char16_t strings to char32_t strings -- the standard library, designed around the idea of iterators, is mostly type-agnostic.
Ok... Few things:
1. UTF-32 is waste of space - don't use it unless it is something like handling code points (char32_t) 2. UTF-16 is too error prone (See: UTF-16 considered harmful)
No argument with either assertion.
3. There is not special type char8_t distinct from char, so you can't use it.
That's why I wrote the utf8_t type. I'd have been quite happy to just use an std::basic_string<utf8_byte_t>, and I looked into the C++0x "opaque typedef" idea to see if it was possible. I couldn't find any elegant way to make it work, and the opaque typedef proposal was dropped from the spec, so I felt that I had to write the utf8_t class. However, I'm not sure what point you're trying to make with the above.
The utf*_t types provide fully functional iterators,
Ok let's thing what do you need iterators for? Accessing "characters" if so you are most likely doing something terribly wrong as you ignore the fact that codepoint != character.
In the current incarnation of the class, the iterators are for accessing the bytes, to make it trivially compatible with things like std::copy.
I would say such iterator is wrong by design unless you develop a Unicode algorithm that relates to code point.
If that's needed (and it probably is), it's easy enough to add. It just wouldn't use the standard begin() and end() functions.
so they'll work fine with most library functions, so long as those functions don't care that some characters are encoded as multiple bytes. It's just the ones that assume that a single byte represents all characters that you have to replace, and you'd have to replace those regardless of whether you're using a new string type or not, if you're using any multi-byte encoding.
Ok...
The paragraph above is inheritable wrong
Oh?
first of all lets cleanup all things:
that some characters are encoded as multiple bytes
Characters are not code points.
A semantic point. Correct, but irrelevant to the argument I was trying to make.
the ones that assume that a single byte represents all characters
Please I want to make this statement even more clearer
C H A R A C T E R != C O D E P O I N T
Even in single byte encodings - for examples windows-1255 is single byte encoding and still my represent a single character using 1, 2 or 3 bytes!
std::copy, std::mismatch, std::equal, std::search, and several others would work equally well on UTF-8 strings. Functions that only allow you to specify a single element to work with, like std::find, would require a slightly different kind of iterator, one that operated on either characters or code-points. I don't see how that makes anything in the quoted paragraph inherently wrong.
Once again - when you work with string you don't work with them as series of characters you want with them and text entities - text chunks.
That depends on what you're doing with them. If you're using them as translations for messages your program is sending out, then your statement is correct -- you treat them as opaque blobs. But if for instance you're parsing a file, you want tokens, which *are* merely an arbitrary series of characters. Or if your program allows the user to edit a file, you want something that gives you single characters, regardless of how many bytes or code-points they're encoded in.
and you'd have to replace those regardless of whether you're using a new string type or not, if you're using any multi-byte encoding.
No I would not because I don't look at string as on the sequence of code points - by themselves then are meaningless.
Code points are meaningful in terms of Unicode algorithms that know how to combine them.
So if you want to handle text chunks you will have to use some Unicode aware library.
If you want to sort them, properly for the locale you're working with, you're correct. If you just want to write them out, or edit them, then barring things like messages in mixed left-to-right and right-to-left languages, it's fairly simple.
It is just neither feasible no necessary.
My code says it's perfectly feasible. ;-) Whether it's necessary or not is up to the individual developer, but the type-safety it offers is more in line with the design philosophy of C++ than using std::string for everything. I hate to harp on the same tired example, but why do you really need any pointer type other than void*? It's the same idea.
No it isn't. String is text chunk.
You can combine them, concatenate them, search for specific substrings or relate to ASCII characters for example like in HTML and parse them and this is perfectly doable withing standard std::string regardless it is UTF-8, Latin1 or other ISO-8859-* ASCII compatible encoding.
This is very different.
I'm trying to understand your point, but with no success so far. If you want something that gives you characters or code-points, then an std::string has no chance of working in any multi-byte encoding -- a UTF-whatever-specific type does.
Giving you "utf-8" string or UTF-8 container would give you false feeling that you doing something right.
How?
Unicode is not about splitting string into code points or iterating over them... It is totally different thing.
I'm baffled by this statement. For doing anything interesting, Unicode or any other encoding *is* about iterating over characters (or code-points, if that's what you're looking for). Your point seems to be that the utf*_t classes are actively harmful in some way that I don't see, and using std::string somehow mitigates that by making you do more work. Or am I misunderstanding you? -- Chad Nelson Oak Circle Software, Inc. * * *