Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

30 Jan 2012

      On Sun, Jan 29, 2012 at 17:52, Beman Dawes <bdawes@acm.org> wrote:
[...]
...
I personally prefer char32_t and char16_t to uint_least32_t and
uint_least16_t, but don't have enough experience to the C++11 types to
make blanket recommendations.
I don't care for the name. I claim that we don't need a distinct type with
a keyword for that.
...
...
2. "Standard library strings with different character encodings have
  different types that do not interoperate." It's good. There shall no be
  implicit conversions in user code. If the user wants, she shall
specify the
...
conversion explicitly, as in:
s2 = convert-with-whatever-explicit-interface-you-like("foo");
int x;
long y;
...
y = x;
...
x = y;
Nothing controversial here, and very convenient. The x = y conversion
is lossy, but the semantics are well defined and you can always use a
function call if you want different semantics.
It is controversial. It was inherited from C where even void* -> int*
conversion was possible. Some argue that x = y should be an error. See D&E
14.3.5.2. Most compilers issue a warning for this. Note that where
compatibility with C is not a concern, C++ prohibits narrowing conversions:

vector<int> v = {1, 2, 3};
vector<short> v1 = {v[0], v[1], v[2]};
vector<long> v2 = v; // not narrowing but fails too

Btw, x = y is implementation-defined if y is a large negative, not "well
defined".

string x;
...
u32string y;
...
y = x;
...
x = y;
Why is this any different? It is very convenient. We can argue about
the best semantics for the x = y conversion, but once those semantics
are settled you can always use a function call if you want different
semantics.
Convenient: yes. But not every convenient feature is good. It can do harm.
First two things that come to mind are:

   1. Overload resolution ambiguity or surprising results.
   2. It hides potentially expensive conversions (I agree to do these
   implicitly only when interacting with 3rd-party code).
   3. It eases different encodings interoperability, thus postponing
   one-encoding standardization, yet doesn't solve the headache completely
   (still the user has to think about encodings and choose a string she needs
   from this zoo: string, u16string, u32string...).

And why don't we have std::string::operator const char*()?
...
...
problems..." Class path forces the user to use a specific encoding
3. "...class path solves some of the string interoperability
that she
...
even may not be willing to hear of. It manifests in the following ways:
     - The 'default' interface returns the encoding used by the system,
     requiring the user to use a verbose interface to get the
encoding she uses.
     - If the user needs to get the path encoded in her favorite encoding
     *by reference* with a lifetime of the path (e.g. as a parameter
to an async
     call), she must maintain a long living *copy* of the temporary
returned
     from the said interface.
     - Getting the extension from a narrow-string path using boost::path
     on Windows involves *two* conversions although the system is never
called
     in the middle.
     - Library code can't use path::imbue(). It must pass the
     corresponding codecvt facet everywhere to use anything but the
     (implementation defined and volatile at runtime) default.
My contention is that class path is having to take on conversion
responsibilities that are better performed by basic_string. That part
of the motivation for exploring ways string classes could take on some
of those responsibilities.
Good. But my intent is to move the conversions either inside operational
functions (preferable). Till we can't standardize on a Unicode execution
character set let the conversion happen when calling those functions
(perhaps use a path_ref that does it implicitly if we don't want the FS v2
templated functions). I remind that class path is used not just for calling
the system.
...
...
4. "Can be called like this: (example)" So we had 2 encodings to
  consider before C++11, 4 after the additions in C++11 and you're
proposing
...
additions to make it easier to work with any number of encodings. We
are
  moving towards encoding HELL.
The number of encodings isn't a function of C++, it is a function of
the real-world. Traditionally, there were many encodings in wide use,
and then Unicode came along with a few more. But the Unicode encodings
have enough advantages that users are gradually moving away from
non-Unicode encodings. C++ needs to accommodate that trend by becoming
friendlier to the Unicode encodings.
Sure. But it doesn't mean that it have to be friendlier to ALL Unicode
encodings.
...
...
5. "A "Hello World" program using a C++11 Unicode string literal
  illustrates this frustration:" Unicode string literal (except u8)
  illustrates how adding yet another unneeded feature to the C++ standard
  complicates the language, adds problems, adds frustration and solves
  nothing. The user can just write
cout << u8"您好世界";
Even better is:
cout << "您好世界";
which *just works* on most compilers (e.g. GCC:
http://ideone.com/lBpMJ)
...
and needs some trickery on others (MSVC: save as UTF-8 without BOM). A
much
  simpler solution is to standardize narrow string literals to be UTF-8
  encoded (or a better phrasing would be "capable of storing any Unicode
  data" so this will work with UTF-EBCDIC where needed), but I know it's
too
  much to ask.
I'm not sure that is too much to ask for the C++ standard after C++11,
whatever it ends up being called. It would take a lot of some careful
work to bring the various interests on board. A year ago was the wrong
point in the C++ standard revision cycle to even talks about such a
change. But C++11 has shipped. Now is the time to start the process of
moving the problem onto the committee's radar screen.
Thanks for the forecast!
...
...
6. "String conversion iterators are not provided (minus Example)" This
  section *I fully support*. The additions to C++11 pushed by Dinkumware
...
heavy, not general enough, and badly designed. C++11 still lacks
convenient
  conversion between different Unicode encodings, which is a must in
today's
  world. Just a few notes:
     - "Interfaces work at the level of entire strings rather than
     characters," This *is* desired since the overhead of the temporary
     allocations is repaid by the fact that optimized UTF-8↔UTF-16↔UTF-32
     conversions need large chunks of data. Nevertheless I agree that
iterator
     access is sometimes preferred.
     - Instead of the c_str() from "Example" a better approach is to
     provide a convenience non-member function that can work on any
range of
     chars. E.g. using the "char type specifies the encoding" approach
are
this
...
would be:
std::wstring wstr = convert<wchar_t>(u8"您好世界"); // doesn't even
     construct an std::string
     std::string u8str = convert<char>(wstr); // don't care for the name
While I'm totally convinced that conversion iterators would be very
useful, the exact form is an open question.  Could you be more
specific about the details of your convert suggestion?
The point is that it's more like a free-standing c_str() you proposed.
Unlike c_str() member function it would work on any character range, and
returns a range of converting iterators. We don't need to extent
basic_string for this, which is already too big.
...
...
7. True interoperability, portability and conciseness will come when
  we standardize on *one* encoding.
Even if we are only talking about Unicode, multiple encodings still
seem a necessity.
Unicode algorithms work on code points (UCS-4) internally. Everything else
can be encoded in some (narrow) execution character set capable of storing
Unicode. Almost no-one implements Unicode algorithms, thus we can
practically assume that one encoding is sufficient on each platform.

-- 
Yakov

Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

Yakov Galka