
On Wed, 19 Jan 2011 00:44:39 -0800 (PST) Artyom <artyomtnk@yahoo.com> wrote:
From: Chad Nelson <chad.thecomfychair@gmail.com>
3. There is not special type char8_t distinct from char, so you can't use it.
That's why I wrote the utf8_t type. I'd have been quite happy to just use an std::basic_string<utf8_byte_t>, and I looked into the C++0x "opaque typedef" idea to see if it was possible.
Even if opaque typedef would be included in C++0x it would be still not feasable for use as string. [...] I had facet this problem when tested char16_t/char32_t under gcc with partial standard library implementation that hadn't specialized classes for them and I couldn't get many things works.
This is real problem.
So it would just not work even if C++0x had opaque typedefs.
I think there would have been ways around the problem. For the example you quoted, the most logical solution would probably be to just use basic_stringstream<wchar_t> and convert the string afterward. Not a very satisfying solution, but it would have worked. In any case, the point is moot, since opaque typedefs won't be in C++0x.
Ok...
The paragraph above is inheritable wrong
Oh?
I hope you are not offended but I just had seen so many things go wrong because of such assumptions so I'm little bit frustrated that such things come again and again.
'Fraid they'll continue to come up, because there are always new developers and there isn't a lot of information on the subject available where a developer would stumble into it by accident. Having a set of UTF string types with three different kinds of iterators would at least make some C++ programmers realize that the problem exists, when they wouldn't have before.
Or if your program allows the user to edit a file, you want something that gives you single characters, regardless of how many bytes or code-points they're encoded in.
That is what great Unicode aware toolkits like Qt, GtkMM and others with hundreds and thousands of lines of code do for you. [...]
Which is great, if you happen to be using a Qt-based or Gtk-based interface in your program, but useless if you're not. I'd prefer a solution that's not tied to monolithic libraries that try to deliver everything and the kitchen sink.
I'm trying to understand your point, but with no success so far. If you want something that gives you characters or code-points, then an std::string has no chance of working in any multi-byte encoding -- a UTF-whatever-specific type does.
It works perfectly well. However for text analysis you either:
1. Use a library like Boost.Locale 2. Relate to the ASCII subset of the text allowing to handle 99% of various formats there - you don't need code point iterators for this.
Why would you want to do either of those, when something like a utf8_t class could make the Boost.Locale interface easier and more intuitively obvious to use, and eliminate the ASCII restriction too?
Your point seems to be that the utf*_t classes are actively harmful in some way that I don't see, and using std::string somehow mitigates that by making you do more work. Or am I misunderstanding you?
My statement is following:
- utf*_t would not give any real added value - that what I was trying to show, and of you want to iterate over codepoints you can do it with external iterator over std::string perfectly well.
You can also handle strings perfectly well the C way, with manually allocated memory, strcpy, strlen, and the like. But you still see the benefits of using an std::string class.
But in most cases you don't want to iterate over codepoints but rather characters words and other text entities and codepoints would not help you with this.
An explicit *character* iterator, over a UTF-type, would solve that problem.
- utf*_t would create troubles as it would require instant conversions between utf*_t types and 1001 other libraries.
And what is even more important it can't be simply integrated into existing C++ string framework.
Oh? :-) The way I'm envisioning it, you could do something like this... utf8_t foo = some_utf8_text; cout << *native_t(foo); ...to send a string to stdout. It would be automatically transcoded to the system's current code page (probably using Boost.Locale) if the code-page isn't already UTF-8, and the asterisk would provide an std::string in that type. Though of course, the utf*_t classes would be provided with an output operator of their own that would take care of that for you, so you wouldn't have to. If you needed to interface with a Windows API function... utf16_t bar = foo; // Automatic conversion DrawTextW(dc, bar->c_str(), bar->length(), rect, flags); ...that would do the trick, and would probably get buried in a library function of some sort that takes a utf16_t type. If you fed it an std::string, std::wstring, or utf32_t type, it would be automatically converted when the function is called. And if you fed it a utf16_t, of course, no conversion would be done, it would be used as-is. So while you might have to do some conversion to other string types to interface with different existing libraries (like Qt), the process is very simple and can probably be automated. *If* you decided to use the utf*_t types at all. And as I've said before, you can simply use std::string for any functions that are encoding-agnostic.
- All I suggest is when you work on windows - don't use ANSI encodings, assume that std::string is UTF-8 encoded and convert it to UTF-16 on system call boundaries.
Assumptions like that will cause problems for existing codebases, which are probably using std::strings in ways that would break. Can something as widely used as Boost afford to make a breaking change like that? On the other hand, with a set of UTF types, you could provide two overloads, one that blindly operates on std::strings as it does now, and one that works on the most convenient UTF form, which would automatically provide some guarantees about the content (such as, that it's valid). If, of course, the function you're using cares about the encoding. As you pointed out, most won't, and can be left using std::string with no problem. And if you, the function's author, want to move away from the std::string form, you just mark it deprecated and leave it there with a warning about when it will go away. The company using the library can make its own decision about whether to upgrade beyond that point or not. I don't foresee many authors with a need for that kind of thing, but for those that do, it would be nice if it were there.
Basically - don't reinvent things, try to make current code work well - it has some design flaws by overall C++/STL is totally fine for unicode handing it needs some things to improve but providing utf*_t classes is not the way to go.
This is **My** point of view.
Thanks for making it clear. I have to disagree though. Most programmers don't want to delve into Unicode and learn about the intricacies of code-points and the like. They just want to use it. The UTF string types should let them do so, in most cases, with a much gentler learning curve than using ICU (or even Boost.Locale) directly. -- Chad Nelson Oak Circle Software, Inc. * * *