
On Sun, Jan 23, 2011 at 6:42 PM, Patrick Horgan <phorgan1@gmail.com> wrote: [snip/]
No. They're using std::string. It works just fine for this as it does for other things. It's performance guarantees are in respect to their templated data type, not in terms of the encoding of the contents. std::string lets you walk through a JIS string and decode it. A utf-8 string would hurl chunks since, of course, it wouldn't be encoded as utf-8. I could go on and on, but perhaps if you'd refresh yourself on the interface of std::string and think about the implications on that if you had a validating utf-8 string you'd see. I'm really in favor of an utf-8 string, I just wouldn't call it string because that would be a lie. It wouldn't be a general string, but a special case of string.
This whole debate is at least for me about what the std::string is and what we want it to be: a) a little more that a glorified std::vector<char> with few extra operations for a more convenient handling of the byte-sequences stored inside, that can be currently interpreted in dozens if not hundreds of ways, depending on the current platform, the default or explicitly selected locale+encoding, etc., etc. b) a container of byte sequences that represent human-readable text and every single sequence (provided it is valid) can be translated into exactly a single sequence of "logical" characters of the said text, by a standardized mapping, but also provides operations for handling the text character-by-character not only byte-by-byte (portably). If the application wishes so, it still can treat the string only as the byte sequence because this is of course a valid usage.
... elision by me. What functionality would they loose exactly ? Again, on many platforms the default encoding for all (or nearly all) locales already is UTF-8 so if you get a string from the OS API and store it into a std::string then it is UTF-8 encoded. I do a equal share of programming on Windows and Linux platforms and I have yet to run into these problems you describe on Linux where for some time now the default encoding is UTF-8. Actually today I encounter more problems on Windows, where I can't set the locale to use UTF-8 and consequently I have to transcode data from socket connections of files manually.
I didn't say ever, that having utf-8 encoded characters in a std::string would cause you some kind of problems. I don't think I even said anything that would let you infer that. You're completely off base here. I was talking about a string specialized for utf-8 encoded characters. You're chasing a red-herring. You're stalking a strawman target. I agree with you entirely. std::string does a great job of holding utf-8 encoded characters, as well as many other things.
Why would people want to lose so much of the functionality of std::string? [/quote]
OK, [OT] I was referring to the [quote] part. I meant no offense I merely said that I yet have to run into any problems (loosing much of the functionality) on platforms where std::string is used to hold byte-sequences encoded by the UTF-8 encoding. I certainly don't need to be right or everyone agreeing with me, this is a discussion and I gladly let myself to be educated by people knowing more about the issue at hand than I do. [/OT]
If you are talking about being able to have indexed random-access to "logical characters" for example on Windows with some special encodings, then this is only a platform-specific and unportable functionality. What I propose, is to extend the interface so that it would allow you handle the "raw-byte-sequences" that are now used to represent strings of logical characters in a platform independent way by using the Unicode standard.
That's nice. I vote for that idea. Just don't call it std::string, because it won't be, and you won't be able to do everything with it that std::string does today.
Would you care to elaborate what functionality would you loose ? Even random access to individual characters could be implemented. Of course this would break the existing performance guarantees, that are however granted only on platforms which use std::string for single-byte encodings. It could also employ some caching mechanism to speed things up, but this is just an implementation detail and of course has its trade-offs. But, if this help us to (slowly) get rid of the necessity to handle various encodings that are relics from an age where every single byte of memory and every processor tick had been a precious resource, then I am all for it. I imagine that the folks at Unicode consortium have worked for the past 20+ years on the standard not only to create a yet another encoding that would complement and live happily ever after with all the others, but to replace them eventually. Having said that I *do not* want to "ban" or prevent anyone from using specific encodings where it is necessary or advantageous, but such usage should be considered a special-case and not general-usage as it is now. Many database systems, web-browsers, web-content-creation tools, xml editors, etc., etc. are considering UTF-8 to be the default and yes, they let you work with other encoding but as a special case. [snip/]
That's an advantage of a utf-8 encoded file. You don't need a special string type to write to that. Before writing to the file you can hold the data in memory in a std::string or a C string, or a chunk of mmap'd memory today, and if they contain data encoded in utf-8 you have the same advantage.
That is not only what I can, do but also what I already do and I'm not very happy with the results, because if std::string is by the OS/libraries now expected to use platform-specific encoding, then these two do not play together very well. Unless you (of course) transcode them explicitly. I rarely use a mem'mapped file as a whole without trying to parse it and use the data for example in a GUI. [snip/]
A string specialized to only hold utf-8 encoded data wouldn't be any good to someone not using utf-8 encoding. Even if they were using 7-bit ascii for a network application, like ftp, for example, they'd have to pay the penalty for validating that the 7-bit ascii was valid utf-8. If they're using it as a container to carry around encrypted strings, well that wouldn't be possible at all.
Let us have a "special_encoding_string" where we need to handle the legacy encodings ...
If a system call returned a name valid in that operating system that I would later pass to another system call, if it wasn't utf-8 what could I do? Break? Or corrupt the string?
... and a native_encoding_string or even let's use vector<char> for these two (they are valid, but IMO *special*) use-cases. [snip/]
Vayo con Diós, Hasta la vista :)
Matus