
On Sun, Jan 29, 2012 at 17:52, Beman Dawes <bdawes@acm.org> wrote: [...]
I personally prefer char32_t and char16_t to uint_least32_t and uint_least16_t, but don't have enough experience to the C++11 types to make blanket recommendations.
I don't care for the name. I claim that we don't need a distinct type with a keyword for that.
2. "Standard library strings with different character encodings have different types that do not interoperate." It's good. There shall no be implicit conversions in user code. If the user wants, she shall
specify the
conversion explicitly, as in:
s2 = convert-with-whatever-explicit-interface-you-like("foo");
int x; long y; ... y = x; ... x = y;
Nothing controversial here, and very convenient. The x = y conversion is lossy, but the semantics are well defined and you can always use a function call if you want different semantics.
It is controversial. It was inherited from C where even void* -> int* conversion was possible. Some argue that x = y should be an error. See D&E 14.3.5.2. Most compilers issue a warning for this. Note that where compatibility with C is not a concern, C++ prohibits narrowing conversions: vector<int> v = {1, 2, 3}; vector<short> v1 = {v[0], v[1], v[2]}; vector<long> v2 = v; // not narrowing but fails too Btw, x = y is implementation-defined if y is a large negative, not "well defined". string x;
u32string y; ... y = x; ... x = y;
Why is this any different? It is very convenient. We can argue about the best semantics for the x = y conversion, but once those semantics are settled you can always use a function call if you want different semantics.
Convenient: yes. But not every convenient feature is good. It can do harm. First two things that come to mind are: 1. Overload resolution ambiguity or surprising results. 2. It hides potentially expensive conversions (I agree to do these implicitly only when interacting with 3rd-party code). 3. It eases different encodings interoperability, thus postponing one-encoding standardization, yet doesn't solve the headache completely (still the user has to think about encodings and choose a string she needs from this zoo: string, u16string, u32string...). And why don't we have std::string::operator const char*()?
problems..." Class path forces the user to use a specific encoding
3. "...class path solves some of the string interoperability that she
even may not be willing to hear of. It manifests in the following ways: - The 'default' interface returns the encoding used by the system, requiring the user to use a verbose interface to get the encoding she uses. - If the user needs to get the path encoded in her favorite encoding *by reference* with a lifetime of the path (e.g. as a parameter to an async call), she must maintain a long living *copy* of the temporary returned from the said interface. - Getting the extension from a narrow-string path using boost::path on Windows involves *two* conversions although the system is never called in the middle. - Library code can't use path::imbue(). It must pass the corresponding codecvt facet everywhere to use anything but the (implementation defined and volatile at runtime) default.
My contention is that class path is having to take on conversion responsibilities that are better performed by basic_string. That part of the motivation for exploring ways string classes could take on some of those responsibilities.
Good. But my intent is to move the conversions either inside operational functions (preferable). Till we can't standardize on a Unicode execution character set let the conversion happen when calling those functions (perhaps use a path_ref that does it implicitly if we don't want the FS v2 templated functions). I remind that class path is used not just for calling the system.
4. "Can be called like this: (example)" So we had 2 encodings to consider before C++11, 4 after the additions in C++11 and you're
proposing
additions to make it easier to work with any number of encodings. We are moving towards encoding HELL.
The number of encodings isn't a function of C++, it is a function of the real-world. Traditionally, there were many encodings in wide use, and then Unicode came along with a few more. But the Unicode encodings have enough advantages that users are gradually moving away from non-Unicode encodings. C++ needs to accommodate that trend by becoming friendlier to the Unicode encodings.
Sure. But it doesn't mean that it have to be friendlier to ALL Unicode encodings.
5. "A "Hello World" program using a C++11 Unicode string literal illustrates this frustration:" Unicode string literal (except u8) illustrates how adding yet another unneeded feature to the C++ standard complicates the language, adds problems, adds frustration and solves nothing. The user can just write
cout << u8"您好世界";
Even better is:
cout << "您好世界";
which *just works* on most compilers (e.g. GCC:
and needs some trickery on others (MSVC: save as UTF-8 without BOM). A much simpler solution is to standardize narrow string literals to be UTF-8 encoded (or a better phrasing would be "capable of storing any Unicode data" so this will work with UTF-EBCDIC where needed), but I know it's too much to ask.
I'm not sure that is too much to ask for the C++ standard after C++11, whatever it ends up being called. It would take a lot of some careful work to bring the various interests on board. A year ago was the wrong point in the C++ standard revision cycle to even talks about such a change. But C++11 has shipped. Now is the time to start the process of moving the problem onto the committee's radar screen.
Thanks for the forecast!
6. "String conversion iterators are not provided (minus Example)" This section *I fully support*. The additions to C++11 pushed by Dinkumware
heavy, not general enough, and badly designed. C++11 still lacks convenient conversion between different Unicode encodings, which is a must in today's world. Just a few notes: - "Interfaces work at the level of entire strings rather than characters," This *is* desired since the overhead of the temporary allocations is repaid by the fact that optimized UTF-8↔UTF-16↔UTF-32 conversions need large chunks of data. Nevertheless I agree that iterator access is sometimes preferred. - Instead of the c_str() from "Example" a better approach is to provide a convenience non-member function that can work on any range of chars. E.g. using the "char type specifies the encoding" approach
are this
would be:
std::wstring wstr = convert<wchar_t>(u8"您好世界"); // doesn't even construct an std::string std::string u8str = convert<char>(wstr); // don't care for the name
While I'm totally convinced that conversion iterators would be very useful, the exact form is an open question. Could you be more specific about the details of your convert suggestion?
The point is that it's more like a free-standing c_str() you proposed. Unlike c_str() member function it would work on any character range, and returns a range of converting iterators. We don't need to extent basic_string for this, which is already too big.
7. True interoperability, portability and conciseness will come when we standardize on *one* encoding.
Even if we are only talking about Unicode, multiple encodings still seem a necessity.
Unicode algorithms work on code points (UCS-4) internally. Everything else can be encoded in some (narrow) execution character set capable of storing Unicode. Almost no-one implements Unicode algorithms, thus we can practically assume that one encoding is sufficient on each platform. -- Yakov