[strings][unicode] Proposals for Improved String Interoperability in a Unicode World

newer
[math_constants] Lots of "extra...

Beman Dawes

28 Jan 2012 28 Jan '12

4:46 p.m.

Beman.github.com/string-interoperability/interop_white_paper.html describes Boost components intended to ease string interoperability in general and Unicode string interoperability in particular. These proposals are the Boost version of the TR2 proposals made in N3336, Adapting Standard Library Strings and I/O to a Unicode World. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3336.html. I'm very interested in hearing comments about either the Boost or the TR2 proposal. Are these useful additions? Is there a better way to achieve the same easy interoperability goals? Where is the best home for the Boost proposals? A separate library? Part of some existing library? Are these proposals orthogonal to the need for deeper Unicode functionality, such as Mathias Gaunard's Unicode components? --Beman

Show replies by date

Yakov Galka

28 Jan 28 Jan

7:48 p.m.

My opinion: 1. You shall not use any char type other than char and wchar_t for working with strings. Using the char type and/or char_traits to mark the encoding doesn't work. This is because the standard provided facets, C standard library functions etc. are provided almost only for char and wchar_t types. And we *don't want* to specialize all possible facets for each possible encoding, just as we don't want to add u16sprintf, u32sprintf, u16cout, u32cout, etc... This would effectively increase the size of the interface to ϴ(number-of-entities × number-of-encodings). Following the above you won't use char32_t and char16_t added in C++11 either. You will use just one or two encodings internally that will be those used for char and wchar_t according to the conventions in your code and/or the platform you work with. The only place you may need the char**_t types is when converting from UTF-16/UTF-32 into the internal encoding you use for your strings (either narrow or wide). But in those conversion algorithms uint_least32_t and uint_least16_t suit your needs just fine. 2. "Standard library strings with different character encodings have different types that do not interoperate." It's good. There shall no be implicit conversions in user code. If the user wants, she shall specify the conversion explicitly, as in: s2 = convert-with-whatever-explicit-interface-you-like("foo"); 3. "...class path solves some of the string interoperability problems..." Class path forces the user to use a specific encoding that she even may not be willing to hear of. It manifests in the following ways: - The 'default' interface returns the encoding used by the system, requiring the user to use a verbose interface to get the encoding she uses. - If the user needs to get the path encoded in her favorite encoding *by reference* with a lifetime of the path (e.g. as a parameter to an async call), she must maintain a long living *copy* of the temporary returned from the said interface. - Getting the extension from a narrow-string path using boost::path on Windows involves *two* conversions although the system is never called in the middle. - Library code can't use path::imbue(). It must pass the corresponding codecvt facet everywhere to use anything but the (implementation defined and volatile at runtime) default. 4. "Can be called like this: (example)" So we had 2 encodings to consider before C++11, 4 after the additions in C++11 and you're proposing additions to make it easier to work with any number of encodings. We are moving towards encoding HELL. 5. "A "Hello World" program using a C++11 Unicode string literal illustrates this frustration:" Unicode string literal (except u8) illustrates how adding yet another unneeded feature to the C++ standard complicates the language, adds problems, adds frustration and solves nothing. The user can just write cout << u8"您好世界"; Even better is: cout << "您好世界"; which *just works* on most compilers (e.g. GCC: http://ideone.com/lBpMJ) and needs some trickery on others (MSVC: save as UTF-8 without BOM). A much simpler solution is to standardize narrow string literals to be UTF-8 encoded (or a better phrasing would be "capable of storing any Unicode data" so this will work with UTF-EBCDIC where needed), but I know it's too much to ask. 6. "String conversion iterators are not provided (minus Example)" This section *I fully support*. The additions to C++11 pushed by Dinkumware are heavy, not general enough, and badly designed. C++11 still lacks convenient conversion between different Unicode encodings, which is a must in today's world. Just a few notes: - "Interfaces work at the level of entire strings rather than characters," This *is* desired since the overhead of the temporary allocations is repaid by the fact that optimized UTF-8↔UTF-16↔UTF-32 conversions need large chunks of data. Nevertheless I agree that iterator access is sometimes preferred. - Instead of the c_str() from "Example" a better approach is to provide a convenience non-member function that can work on any range of chars. E.g. using the "char type specifies the encoding" approach this would be: std::wstring wstr = convert<wchar_t>(u8"您好世界"); // doesn't even construct an std::string std::string u8str = convert<char>(wstr); // don't care for the name 7. True interoperability, portability and conciseness will come when we standardize on *one* encoding. On Sat, Jan 28, 2012 at 18:46, Beman Dawes <bdawes@acm.org> wrote:

...

Beman.github.com/string-interoperability/interop_white_paper.html describes Boost components intended to ease string interoperability in general and Unicode string interoperability in particular.

These proposals are the Boost version of the TR2 proposals made in N3336, Adapting Standard Library Strings and I/O to a Unicode World. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3336.html.

I'm very interested in hearing comments about either the Boost or the TR2 proposal. Are these useful additions? Is there a better way to achieve the same easy interoperability goals?

Where is the best home for the Boost proposals? A separate library? Part of some existing library?

Are these proposals orthogonal to the need for deeper Unicode functionality, such as Mathias Gaunard's Unicode components?

--Beman

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Sincerely, -- Yakov

Mathias Gaunard

29 Jan 29 Jan

12:49 a.m.

On 01/28/2012 08:48 PM, Yakov Galka wrote:

...

The user can just write

cout<< u8"您好世界";

Even better is:

cout<< "您好世界";

which *just works* on most compilers (e.g. GCC: http://ideone.com/lBpMJ) and needs some trickery on others (MSVC: save as UTF-8 without BOM).

No, that's just wrong. That's not the model that C++ uses. By not storing it with the BOM, you're essentially tricking MSVC into believing it is ANSI (windows-1252 on western systems), and thus avoiding source character set to the execution character set, since those happen to be the same. The way a C++ compiler is supposed to work is that all of your source is in the source character set, regardless of the type of string literal you use. Then the compiler will convert your source character set to the execution character set for narrow string literals, to the wide execution character set for wide string literals, to UTF-8 for u8 literals, etc. The correct way to portably use Unicode characters in a C++ source is to write it as UTF-8 and ensure that all compilers will consider the source character set to be UTF-8. Then use the appropriate literal types depending on what encoding you want your string literals to end up in. Of course, in the real world, it causes two practical problems: - MSVC requires a BOM to be present, but GCC will choke if there is one - In the lack of u8 string literals, you're stuck with wide string literals if you want something resembling Unicode, unless you use narrow string literals with just ASCII and escape sequences (\xYY, \u and \U will not work since it will convert) What probably should be done is that compilers should be compelled to support UTF-8 as the source character set in a unified way. I once asked volodya if it were feasible to implement this in the build system (add a BOM for MSVC), but he didn't seem to think it was worth it.

Yakov Galka

6:08 a.m.

On Sun, Jan 29, 2012 at 02:49, Mathias Gaunard <mathias.gaunard@ens-lyon.org

...

wrote:

...

On 01/28/2012 08:48 PM, Yakov Galka wrote:

...
The user can just write

cout<< u8"您好世界";

Even better is:

cout<< "您好世界";

which *just works* on most compilers (e.g. GCC: http://ideone.com/lBpMJ) and needs some trickery on others (MSVC: save as UTF-8 without BOM).

No, that's just wrong. That's not the model that C++ uses. By not storing it with the BOM, you're essentially tricking MSVC into believing it is ANSI (windows-1252 on western systems), and thus avoiding source character set to the execution character set, since those happen to be the same.

The way a C++ compiler is supposed to work is that all of your source is in the source character set, regardless of the type of string literal you use. Then the compiler will convert your source character set to the execution character set for narrow string literals, to the wide execution character set for wide string literals, to UTF-8 for u8 literals, etc.

Sorry for not being clear enough. I agree and I've not said otherwise. The second 'cout' line *is* a hack. I admit it won't work if you mix such string literals with wide literals or external identifiers containing Unicode. The intent was to show how it could be done if the effort was focused on making narrow string literals "Unicode compatible". [...] What probably should be done is that compilers should be compelled to

...

support UTF-8 as the source character set in a unified way.

Yes, it could be nice. It would solve half the problem, which is a huge step forward given the current mood of the committee. However, embedding Unicode string literals in source code is still not something you routinely do. Internationalization usually uses external string tables. I once asked volodya if it were feasible to implement this in the build

...

system (add a BOM for MSVC), but he didn't seem to think it was worth it.

I don't understand. MSVC already understands BOM, and GCC has already been fixed according to http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33415(didn't test it). On Sun, Jan 29, 2012 at 03:12, Mathias Gaunard <mathias.gaunard@ens-lyon.org

...

wrote:

...

I think you should consider the points being made in N3334. While that proposal is in my opinion not good enough, it raises an important issue that is often present with std::string-based or similar designs.

A function that takes a std::string, or a boost::filesystem::path for that matter, necessarily causes the callee to copy the data into a heap-allocated buffer, even if there is no need to.

Use of the range concept would solve that issue, but then that requires making the function a template. A type-erased range would be possible, but that has significant performance overhead. a string_ref or path_ref is maybe the lesser evil.

+1 This topic has been raised here in program-options context: http://boost.2283326.n4.nabble.com/program-options-Some-methods-take-const-c... -- Yakov

Artyom Beilis

8:13 a.m.

New subject: [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

...

From: Yakov Galka <ybungalobill@gmail.com>

[...] What probably should be done is that compilers should be compelled to

...
support UTF-8 as the source character set in a unified way.

Yes, it could be nice. It would solve half the problem, which is a huge step forward given the current mood of the committee. However, embedding Unicode string literals in source code is still not something you routinely do. Internationalization usually uses external string tables.

Not right. Sometimes you do want non ASCII symbols in the source code, what is wrong to have © in the text or € symbol in the code. Also the fact that C++ does not define Unicode source code is standard design problem, there is nothing wrong to have Unicode literals in the source code. In fact the ONLY modern compiler that deos not suppor them is Vistual Studio, all others I had ever used (gcc, clang, intel, sunstudio) work fine with UTF-8.

...

I once asked volodya if it were feasible to implement this in the build

...
system (add a BOM for MSVC), but he didn't seem to think it was worth it.

I don't understand. MSVC already understands BOM, and GCC has already been fixed according to http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33415(didn't test it).

Few points. 1. BOM should not be used in source code, no compiler except MSVC uses it and most do not support it. BOM is totally stupid for UTF-8 as it does not have "byte order" so it should just die for UTF-8. 2. Setting UTF-8 BOM makes narrow literals to be encoded in ANSI encoding which makes BOM useless (crap... sory) with MSVC even more. Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/

Mathias Gaunard

1:33 p.m.

On 01/29/2012 09:13 AM, Artyom Beilis wrote:

...

In fact the ONLY modern compiler that deos not suppor them is Vistual Studio, all others I had ever used (gcc, clang, intel, sunstudio) work fine with UTF-8.

They all support it, the problem is that they require different things to use it.

...

1. BOM should not be used in source code, no compiler except MSVC uses it and most do not support it.

According to Yakov, GCC supports it now. It would be nice if it could work without any BOM though.

...

2. Setting UTF-8 BOM makes narrow literals to be encoded in ANSI encoding which makes BOM useless (crap... sory) with MSVC even more.

That's the correct behaviour. Use u8 string literals if you want UTF-8. The problem is only present if the compiler doesn't have those string literals.

Artyom Beilis

1:53 p.m.

New subject: [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

----- Original Message -----

...

From: Mathias Gaunard <mathias.gaunard@ens-lyon.org> On 01/29/2012 09:13 AM, Artyom Beilis wrote:

...
In fact the ONLY modern compiler that deos not suppor them is Vistual Studio, all others I had ever used (gcc, clang, intel, sunstudio) work fine with UTF-8.

They all support it, the problem is that they require different things to use it.

Not, MSVC does not allow to create both "שלום" and L"שלום" literal as Unicode (utf-8, UTF-16) for all other compilers it is default behavior.

...

...
1. BOM should not be used in source code, no compiler except MSVC uses it and most do not support it.

According to Yakov, GCC supports it now. It would be nice if it could work without any BOM though.

GCC's default input and literal encoding is UTF-8. BOM is not needed.

...

...
2. Setting UTF-8 BOM makes narrow literals to be encoded in ANSI encoding which makes BOM useless (crap... sory) with MSVC even more.

That's the correct behaviour.

No, it is unspecified behavior according to the standard. Standard does not specify what narrow encoding should be used, that is why u8"" was created. All (but MSVC) compilers create UTF-8 literals and use UTF-8 input and this is the default.

...

Use u8 string literals if you want UTF-8.

Why on earth should I do this? All the world around uses UTF-8. Why should I specifiy u8"" if it is something that can be easily defined at compiler level? All we need is some flag for MSVC that tells that string literals encoding is UTF-8. I think the standard should require a method for specification of input encoding and literals encoding and require UTF-8 input and literal encoding support whether it is by adding some flag or by providing some pragma.

...

The problem is only present if the compiler doesn't have those string literals.

AFAIR, neither gcc4.6 nor msvc10 supports u8"". Artyom

Mathias Gaunard

2:28 p.m.

On 01/29/2012 02:53 PM, Artyom Beilis wrote:

...

Not, MSVC does not allow to create both "שלום" and L"שלום" literal as Unicode (utf-8, UTF-16) for all other compilers it is default behavior.

And it shouldn't. String literals are in the execution character set. On Windows the execution character set is what it calls ANSI. That much is not going to change.

...

...
...
1. BOM should not be used in source code, no compiler except MSVC uses it and most do not support it.

According to Yakov, GCC supports it now. It would be nice if it could work without any BOM though.

GCC's default input and literal encoding is UTF-8. BOM is not needed.

That's not what I'm saying. What we want is a unified way to set UTF-8 as the source character set. The problem is that MSVC requires BOM, but GCC used to not allow it.

...

...
...
2. Setting UTF-8 BOM makes narrow literals to be encoded in ANSI encoding which makes BOM useless (crap... sory) with MSVC even more.

That's the correct behaviour.

No, it is unspecified behavior according to the standard.

It isn't.

...

Standard does not specify what narrow encoding should be used, that is why u8"" was created.

The standard specifies that it is the execution character set. MSVC specifies that for its implementation, the execution character set is ANSI.

...

All (but MSVC) compilers create UTF-8 literals and use UTF-8 input and this is the default.

That's because for those other compilers, you are in a case where the source character set is the same as the execution character set. With MSVC, if you don't do anything, both your source and execution character sets are ANSI. If you set your source character set to UTF-8, your execution character set remains ANSI still. On non-Windows platforms, UTF-8 is the most common execution character set, so you can have a setup where source = execution = UTF-8, but you can't do that on Windows. But that is irrelevant to the standard.

...

...
Use u8 string literals if you want UTF-8.

Why on earth should I do this?

Because it makes perfect sense and it's the way it's supposed to work.

...

All the world around uses UTF-8. Why should I specifiy u8"" if it is something that can be easily defined at compiler level?

Because otherwise you're not independent from the execution character set. Writing you program with Unicode allows you to not depend on platform-specific encodings, that doesn't mean it makes them go away. I repeat, narrow string literals are and will remain in the execution character set. Expecting those to end up as UTF-8 data is wrong and not portable.

...

All we need is some flag for MSVC that tells that string literals encoding is UTF-8.

That "flag" is using the u8 prefix on those string literals. Remember: the encoding used for the data in a string literal is independent from the encoding used to write the source.

...

AFAIR, neither gcc4.6 nor msvc10 supports u8"".

Unicode string literals have been in GCC since 4.5. However there are indeed practical problems with using the standard mechanisms because they're not always implemented.

Yakov Galka

3:11 p.m.

On Sun, Jan 29, 2012 at 16:28, Mathias Gaunard <mathias.gaunard@ens-lyon.org

...

wrote:

...

On 01/29/2012 02:53 PM, Artyom Beilis wrote:

Not, MSVC does not allow to create both "שלום" and L"שלום" literal

...
as Unicode (utf-8, UTF-16) for all other compilers it is default behavior.

And it shouldn't. String literals are in the execution character set. On Windows the execution character set is what it calls ANSI. That much is not going to change.

Execution character set is defined by the implementation, that is the compiler and the runtime library. It has nothing to do with the system underneath. That is the implementation is free to decide that execution character set is UTF-8, even though Windows narrow strings are some 'ANSI'. Standard library interfaces then would accept UTF-8 (fopen, fstream, etc..).

...

[...]

2. Setting UTF-8 BOM makes narrow literals to be encoded in ANSI encoding

...
...
...
which

...
makes BOM useless (crap... sory) with MSVC even more.

That's the correct behaviour.

No, it is unspecified behavior according to the standard.

It isn't.

As said above you can't deduce from the standard what is the "execution character set for Windows". MSVC defines it to be 'ANSI', which is the source of all problems. But it is unspecified behavior according to the standard. Standard does not specify what narrow encoding should be used, that

...

...
is why u8"" was created.

The standard specifies that it is the execution character set. MSVC specifies that for its implementation, the execution character set is ANSI.

Yes, and we would like to at least have a flag that overrides the execution character set to UTF-8.

...

[...]

Use u8 string literals if you want UTF-8.

...
...
Why on earth should I do this?

Because it makes perfect sense and it's the way it's supposed to work.

As per C++11 it doesn't make sense to use any other narrow string literal but u8"". Why would you use plain "" on Windows? [...]

...

All we need is some flag for MSVC that tells that string

...
literals encoding is UTF-8.

That "flag" is using the u8 prefix on those string literals. Remember: the encoding used for the data in a string literal is independent from the encoding used to write the source.

Yes, it will remain independent even with "" meaning u8"". Even if the source character set was UTF-32 it would mean UTF-8. Sincerely, -- Yakov

Artyom Beilis

3:14 p.m.

New subject: [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

----- Original Message -----

...

From: Mathias Gaunard <mathias.gaunard@ens-lyon.org>

...
Not, MSVC does not allow to create both "שלום" and L"שלום" literal as Unicode (utf-8, UTF-16) for all other compilers it is default behavior.

And it shouldn't.

It depends on the point of view. (see below)

...

String literals are in the execution character set. On Windows the execution character set is what it calls ANSI. That much is not going to change.

Execution character set is host dependent and ANSI code page differs from one host to another. When you compile the program on one host with one character set the program will not behave correctly on other host. This is a huge bug in C++ design. That is why it should be fixed. Most compilers around already did this... It can be done in backward compatible way by requiring compilation time option and deprecating the concept of "execution character set"

...

...
...
...
1. BOM should not be used in source code, no compiler except MSVC uses it and most do not support it.

According to Yakov, GCC supports it now. It would be nice if it could work without any BOM though.

GCC's default input and literal encoding is UTF-8. BOM is not needed.

That's not what I'm saying. What we want is a unified way to set UTF-8 as the source character set. The problem is that MSVC requires BOM, but GCC used to not allow it.

The problem is not BOM or not BOM. BOM is not way to fix the problem. All concept of "BOM" to distinguish between ANSI encoding and UTF-8 exist only on Windows. It is not portable and most importantly stupid thing to provide "Byte-Order-Mark" for UTF-8 that does not have byte order. GCC provides a flag to specify encoding, AFAIR most of other compilers do the same.

...

...
...
...
2. Setting UTF-8 BOM makes narrow literals to be encoded in ANSI encoding which makes BOM useless (crap... sory) with MSVC even more.

That's the correct behaviour.

No, it is unspecified behavior according to the standard.

It isn't.

It is, because host character set is not well defined and it varies from host to host. So the result is just not specified. I'll make it more clear: **It is not well defined**.

...

...
Standard does not specify what narrow encoding should be used, that is why u8"" was created.

The standard specifies that it is the execution character set. MSVC specifies that for its implementation, the execution character set is ANSI.

So may be standard should add an option to specific the input character set explicitly so it would not vary from host to host?

...

...
All (but MSVC) compilers create UTF-8 literals and use UTF-8 input and this is the default.

That's because for those other compilers, you are in a case where the source character set is the same as the execution character set.

GCC allows to specify both "" literal encoding and input encodings. -finput-charset and -fexec-charset options.

...

With MSVC, if you don't do anything, both your source and execution character sets are ANSI. If you set your source character set to UTF-8, your execution character set remains ANSI still.

No it will remain the original ANSI encoding that may not much the host ANSI encoding. Input CP-XXX "test" -> literal CP-XXX "test" < - execution charset But in runtime it would be CP-YYY != CP-XXX

...

On non-Windows platforms, UTF-8 is the most common execution character set, so you can have a setup where source = execution = UTF-8, but you can't do that on Windows. But that is irrelevant to the standard.

As I tould you standard should specify a way to define both execution and input character set.

...

...
...
Use u8 string literals if you want UTF-8.

Why on earth should I do this?

Because it makes perfect sense and it's the way it's supposed to work.

Except that it does not solve any real problem.

...

...
All the world around uses UTF-8. Why should I specifiy u8"" if it is something that can be easily defined at compiler level?

Because otherwise you're not independent from the execution character set. Writing you program with Unicode allows you to not depend on platform-specific encodings, that doesn't mean it makes them go away.

I remind UTF-8 is Unicode...

...

I repeat, narrow string literals are and will remain in the execution character set. Expecting those to end up as UTF-8 data is wrong and not portable.

I thing it is a bug in a design and the programmer should be able to override it. Finally the "execution" character set is meaningless as it host dependent, the "narrow-literal" character set is meaningful.

...

...
All we need is some flag for MSVC that tells that string literals encoding is UTF-8.

That "flag" is using the u8 prefix on those string literals. Remember: the encoding used for the data in a string literal is independent from the encoding used to write the source.

I know

...

...
AFAIR, neither gcc4.6 nor msvc10 supports u8"".

Unicode string literals have been in GCC since 4.5.

AFAIR GCC supports u"" and U"" when I checked u8"" it was not working but I may be wrong. Artyom Beilis

...

Beman Dawes

11:25 p.m.

On Sat, Jan 28, 2012 at 7:49 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

...

The way a C++ compiler is supposed to work is that all of your source is in the source character set, regardless of the type of string literal you use. Then the compiler will convert your source character set to the execution character set for narrow string literals, to the wide execution character set for wide string literals, to UTF-8 for u8 literals, etc.

The correct way to portably use Unicode characters in a C++ source is to write it as UTF-8 and ensure that all compilers will consider the source character set to be UTF-8. Then use the appropriate literal types depending on what encoding you want your string literals to end up in. Of course, in the real world, it causes two practical problems: - MSVC requires a BOM to be present, but GCC will choke if there is one - In the lack of u8 string literals, you're stuck with wide string literals if you want something resembling Unicode, unless you use narrow string literals with just ASCII and escape sequences (\xYY, \u and \U will not work since it will convert)

What probably should be done is that compilers should be compelled to support UTF-8 as the source character set in a unified way.

Makes sense to me. Why don't you write up an issue for the C and C++ committees? My guess it would be well received as long (1) C and C++ stay in sync (or at least don't conflict), and (2) compiler vendors aren't required to do anything that would prevent existing source files that work with their compiler to no longer work. This issue might well attract national body support, which increases the chance the committee will take action. It would be helpful if the issue write up included a survey of current compilers so that committee members not familiar with various compilers could see that UTF-8 is already widely supported modulo the BOM issue. Another possibility is to start lobbying compiler vendors, or at least Microsoft, to support UTF-8 both with and without BOM. --Beman

Artyom Beilis

30 Jan 30 Jan

8:24 a.m.

New subject: [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

----- Original Message -----

...

From: Beman Dawes <bdawes@acm.org>

...
What probably should be done is that compilers should be compelled to support UTF-8 as the source character set in a unified way.

Makes sense to me.

Why don't you write up an issue for the C and C++ committees? My

[snip]

Another possibility is to start lobbying compiler vendors, or at least Microsoft, to support UTF-8 both with and without BOM.

It is not only BOM not BOM issue. It is mostly the ability to define execution character set. i.e. character set for normal "some text" literals and the input character set and what is even more important that C++ compilers must support UTF-8 for the two of them. Artyom

Daryle Walker

31 Jan 31 Jan

8:57 a.m.

----------------------------------------

...

Date: Mon, 30 Jan 2012 00:24:30 -0800 From: Artyom

----- Original Message -----

...
From: Beman Dawes <bdawes@acm.org>

...
What probably should be done is that compilers should be compelled to support UTF-8 as the source character set in a unified way.

Makes sense to me.

Why don't you write up an issue for the C and C++ committees? My

[snip]

Another possibility is to start lobbying compiler vendors, or at least Microsoft, to support UTF-8 both with and without BOM.

It is not only BOM not BOM issue. It is mostly the ability to define execution character set. i.e. character set for normal "some text" literals and the input character set and what is even more important that C++ compilers must support UTF-8 for the two of them.

This probably isn't the right post to respond to, but I don't want to spend forever figuring it out. Not every system is a 8/16/32(/64)-bit computer using ASCII/Latin-1/UTF-8. C++ (from C) was designed so a user with a 9/36/81-bit EBSDIC system and one with a 8/16/32/64 UTF-16 system can write programs for the other (with the appropriate cross-compiler). We don't want to obnoxiously be prejudiced against systems not matching the current configuration trends. (I was originally going to write "9/36/72", but then realized that higher types only have to be a multiple of char, not each other, so my new system breaks more common-programmer assumptions. BTW, that's 9-bit bytes (char), 36-bit words (short and int), and 81-bit long-words (long and long-long). I wonder if anyone here can fabricate this custom hardware, to mess people up.) Daryle W.

Yakov Galka

9:52 a.m.

On Tue, Jan 31, 2012 at 10:57, Daryle Walker <darylew@hotmail.com> wrote:

...

----------------------------------------

...
Date: Mon, 30 Jan 2012 00:24:30 -0800 From: Artyom

----- Original Message -----

...
From: Beman Dawes <bdawes@acm.org>

...
What probably should be done is that compilers should be compelled to support UTF-8 as the source character set in a unified way.

Makes sense to me.

Why don't you write up an issue for the C and C++ committees? My

[snip]

Another possibility is to start lobbying compiler vendors, or at least Microsoft, to support UTF-8 both with and without BOM.

It is not only BOM not BOM issue. It is mostly the ability to define execution character set. i.e. character set for normal "some text" literals and the input character set and what is even more important that C++ compilers must support UTF-8 for the two of them.

This probably isn't the right post to respond to, but I don't want to spend forever figuring it out.

Not every system is a 8/16/32(/64)-bit computer using ASCII/Latin-1/UTF-8. C++ (from C) was designed so a user with a 9/36/81-bit EBSDIC system and one with a 8/16/32/64 UTF-16 system can write programs for the other (with the appropriate cross-compiler). We don't want to obnoxiously be prejudiced against systems not matching the current configuration trends.

(I was originally going to write "9/36/72", but then realized that higher types only have to be a multiple of char, not each other, so my new system breaks more common-programmer assumptions. BTW, that's 9-bit bytes (char), 36-bit words (short and int), and 81-bit long-words (long and long-long). I wonder if anyone here can fabricate this custom hardware, to mess people up.)

Daryle W.

Thanks Daryle. I'm aware of this issue and thus restrained from talking about UTF-8 only. The wording I'm interested in is "execution character set is capable of storing any Unicode data". This would mean that it will be UTF-8 on systems having CHAR_BIT==8 and compatible with ASCII, UTF-EBCDIC on IBM mainframes, perhaps UTF-32 on DSP with CHAR_BIT==32 and sizeof(char) == sizeof(long). Yet another option is to restrict the requirement to hosted implementations only. -- Yakov

Mathias Gaunard

12:12 p.m.

On 01/31/2012 09:57 AM, Daryle Walker wrote:

...

This probably isn't the right post to respond to, but I don't want to spend forever figuring it out.

Not every system is a 8/16/32(/64)-bit computer using ASCII/Latin-1/UTF-8. C++ (from C) was designed so a user with a 9/36/81-bit EBSDIC system and one with a 8/16/32/64 UTF-16 system can write programs for the other (with the appropriate cross-compiler). We don't want to obnoxiously be prejudiced against systems not matching the current configuration trends.

Which is exactly why forcing a particular execution character set is a bad idea. Forcing a particular source character set, however, may be another matter, as it only affects the compiler itself.

Olaf van der Spek

2:13 p.m.

On Tue, Jan 31, 2012 at 1:12 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

On 01/31/2012 09:57 AM, Daryle Walker wrote:

...
This probably isn't the right post to respond to, but I don't want to spend forever figuring it out.

Not every system is a 8/16/32(/64)-bit computer using ASCII/Latin-1/UTF-8. C++ (from C) was designed so a user with a 9/36/81-bit EBSDIC system and one with a 8/16/32/64 UTF-16 system can write programs for the other (with the appropriate cross-compiler). We don't want to obnoxiously be prejudiced against systems not matching the current configuration trends.

Which is exactly why forcing a particular execution character set is a bad idea. Forcing a particular source character set, however, may be another matter, as it only affects the compiler itself.

Wouldn't it affect editors and other utilities too? -- Olaf

Mathias Gaunard

8:33 p.m.

On 01/31/2012 03:13 PM, Olaf van der Spek wrote:

...

On Tue, Jan 31, 2012 at 1:12 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...
On 01/31/2012 09:57 AM, Daryle Walker wrote:

...
This probably isn't the right post to respond to, but I don't want to spend forever figuring it out.

Not every system is a 8/16/32(/64)-bit computer using ASCII/Latin-1/UTF-8. C++ (from C) was designed so a user with a 9/36/81-bit EBSDIC system and one with a 8/16/32/64 UTF-16 system can write programs for the other (with the appropriate cross-compiler). We don't want to obnoxiously be prejudiced against systems not matching the current configuration trends.

Which is exactly why forcing a particular execution character set is a bad idea. Forcing a particular source character set, however, may be another matter, as it only affects the compiler itself.

Wouldn't it affect editors and other utilities too?

Not necessarily, a compiler can support multiple source character sets.

Beman Dawes

29 Jan 29 Jan

3:52 p.m.

On Sat, Jan 28, 2012 at 2:48 PM, Yakov Galka <ybungalobill@gmail.com> wrote:

...

My opinion:

1. You shall not use any char type other than char and wchar_t for working with strings. Using the char type and/or char_traits to mark the encoding doesn't work. This is because the standard provided facets, C standard library functions etc. are provided almost only for char and wchar_t types. And we *don't want* to specialize all possible facets for each possible encoding, just as we don't want to add u16sprintf, u32sprintf, u16cout, u32cout, etc... This would effectively increase the size of the interface to ϴ(number-of-entities × number-of-encodings). Following the above you won't use char32_t and char16_t added in C++11 either. You will use just one or two encodings internally that will be those used for char and wchar_t according to the conventions in your code and/or the platform you work with. The only place you may need the char**_t types is when converting from UTF-16/UTF-32 into the internal encoding you use for your strings (either narrow or wide). But in those conversion algorithms uint_least32_t and uint_least16_t suit your needs just fine.

I agree with you that "we *don't want* to specialize all possible facets for each possible encoding, just as we don't want to add u16sprintf, u32sprintf, u16cout, u32cout, etc...". Hopefully someone will step forward with a set of deeply Unicode aware generic algorithms to take advantage of Unicode specific functionality. I personally prefer char32_t and char16_t to uint_least32_t and uint_least16_t, but don't have enough experience to the C++11 types to make blanket recommendations.

...

2. "Standard library strings with different character encodings have different types that do not interoperate." It's good. There shall no be implicit conversions in user code. If the user wants, she shall specify the conversion explicitly, as in:

s2 = convert-with-whatever-explicit-interface-you-like("foo");

int x; long y; ... y = x; ... x = y; Nothing controversial here, and very convenient. The x = y conversion is lossy, but the semantics are well defined and you can always use a function call if you want different semantics. string x; u32string y; ... y = x; ... x = y; Why is this any different? It is very convenient. We can argue about the best semantics for the x = y conversion, but once those semantics are settled you can always use a function call if you want different semantics.

...

3. "...class path solves some of the string interoperability problems..." Class path forces the user to use a specific encoding that she even may not be willing to hear of. It manifests in the following ways: - The 'default' interface returns the encoding used by the system, requiring the user to use a verbose interface to get the encoding she uses. - If the user needs to get the path encoded in her favorite encoding *by reference* with a lifetime of the path (e.g. as a parameter to an async call), she must maintain a long living *copy* of the temporary returned from the said interface. - Getting the extension from a narrow-string path using boost::path on Windows involves *two* conversions although the system is never called in the middle. - Library code can't use path::imbue(). It must pass the corresponding codecvt facet everywhere to use anything but the (implementation defined and volatile at runtime) default.

My contention is that class path is having to take on conversion responsibilities that are better performed by basic_string. That part of the motivation for exploring ways string classes could take on some of those responsibilities.

...

4. "Can be called like this: (example)" So we had 2 encodings to consider before C++11, 4 after the additions in C++11 and you're proposing additions to make it easier to work with any number of encodings. We are moving towards encoding HELL.

The number of encodings isn't a function of C++, it is a function of the real-world. Traditionally, there were many encodings in wide use, and then Unicode came along with a few more. But the Unicode encodings have enough advantages that users are gradually moving away from non-Unicode encodings. C++ needs to accommodate that trend by becoming friendlier to the Unicode encodings.

...

5. "A "Hello World" program using a C++11 Unicode string literal illustrates this frustration:" Unicode string literal (except u8) illustrates how adding yet another unneeded feature to the C++ standard complicates the language, adds problems, adds frustration and solves nothing. The user can just write

cout << u8"您好世界";

Even better is:

cout << "您好世界";

which *just works* on most compilers (e.g. GCC: http://ideone.com/lBpMJ) and needs some trickery on others (MSVC: save as UTF-8 without BOM). A much simpler solution is to standardize narrow string literals to be UTF-8 encoded (or a better phrasing would be "capable of storing any Unicode data" so this will work with UTF-EBCDIC where needed), but I know it's too much to ask.

I'm not sure that is too much to ask for the C++ standard after C++11, whatever it ends up being called. It would take a lot of some careful work to bring the various interests on board. A year ago was the wrong point in the C++ standard revision cycle to even talks about such a change. But C++11 has shipped. Now is the time to start the process of moving the problem onto the committee's radar screen.

...

6. "String conversion iterators are not provided (minus Example)" This section *I fully support*. The additions to C++11 pushed by Dinkumware are heavy, not general enough, and badly designed. C++11 still lacks convenient conversion between different Unicode encodings, which is a must in today's world. Just a few notes: - "Interfaces work at the level of entire strings rather than characters," This *is* desired since the overhead of the temporary allocations is repaid by the fact that optimized UTF-8↔UTF-16↔UTF-32 conversions need large chunks of data. Nevertheless I agree that iterator access is sometimes preferred. - Instead of the c_str() from "Example" a better approach is to provide a convenience non-member function that can work on any range of chars. E.g. using the "char type specifies the encoding" approach this would be:

std::wstring wstr = convert<wchar_t>(u8"您好世界"); // doesn't even construct an std::string std::string u8str = convert<char>(wstr); // don't care for the name

While I'm totally convinced that conversion iterators would be very useful, the exact form is an open question. Could you be more specific about the details of your convert suggestion?

...

7. True interoperability, portability and conciseness will come when we standardize on *one* encoding.

Even if we are only talking about Unicode, multiple encodings still seem a necessity. --Beman

Yakov Galka

30 Jan 30 Jan

5 p.m.

On Sun, Jan 29, 2012 at 17:52, Beman Dawes <bdawes@acm.org> wrote: [...]

...

I personally prefer char32_t and char16_t to uint_least32_t and uint_least16_t, but don't have enough experience to the C++11 types to make blanket recommendations.

I don't care for the name. I claim that we don't need a distinct type with a keyword for that.

...

...
2. "Standard library strings with different character encodings have different types that do not interoperate." It's good. There shall no be implicit conversions in user code. If the user wants, she shall

specify the

...
conversion explicitly, as in:

s2 = convert-with-whatever-explicit-interface-you-like("foo");

int x; long y; ... y = x; ... x = y;

Nothing controversial here, and very convenient. The x = y conversion is lossy, but the semantics are well defined and you can always use a function call if you want different semantics.

It is controversial. It was inherited from C where even void* -> int* conversion was possible. Some argue that x = y should be an error. See D&E 14.3.5.2. Most compilers issue a warning for this. Note that where compatibility with C is not a concern, C++ prohibits narrowing conversions: vector<int> v = {1, 2, 3}; vector<short> v1 = {v[0], v[1], v[2]}; vector<long> v2 = v; // not narrowing but fails too Btw, x = y is implementation-defined if y is a large negative, not "well defined". string x;

...

u32string y; ... y = x; ... x = y;

Why is this any different? It is very convenient. We can argue about the best semantics for the x = y conversion, but once those semantics are settled you can always use a function call if you want different semantics.

Convenient: yes. But not every convenient feature is good. It can do harm. First two things that come to mind are: 1. Overload resolution ambiguity or surprising results. 2. It hides potentially expensive conversions (I agree to do these implicitly only when interacting with 3rd-party code). 3. It eases different encodings interoperability, thus postponing one-encoding standardization, yet doesn't solve the headache completely (still the user has to think about encodings and choose a string she needs from this zoo: string, u16string, u32string...). And why don't we have std::string::operator const char*()?

...

...
problems..." Class path forces the user to use a specific encoding

3. "...class path solves some of the string interoperability that she

...
even may not be willing to hear of. It manifests in the following ways: - The 'default' interface returns the encoding used by the system, requiring the user to use a verbose interface to get the encoding she uses. - If the user needs to get the path encoded in her favorite encoding *by reference* with a lifetime of the path (e.g. as a parameter to an async call), she must maintain a long living *copy* of the temporary returned from the said interface. - Getting the extension from a narrow-string path using boost::path on Windows involves *two* conversions although the system is never called in the middle. - Library code can't use path::imbue(). It must pass the corresponding codecvt facet everywhere to use anything but the (implementation defined and volatile at runtime) default.

My contention is that class path is having to take on conversion responsibilities that are better performed by basic_string. That part of the motivation for exploring ways string classes could take on some of those responsibilities.

Good. But my intent is to move the conversions either inside operational functions (preferable). Till we can't standardize on a Unicode execution character set let the conversion happen when calling those functions (perhaps use a path_ref that does it implicitly if we don't want the FS v2 templated functions). I remind that class path is used not just for calling the system.

...

...
4. "Can be called like this: (example)" So we had 2 encodings to consider before C++11, 4 after the additions in C++11 and you're

proposing

...
additions to make it easier to work with any number of encodings. We are moving towards encoding HELL.

The number of encodings isn't a function of C++, it is a function of the real-world. Traditionally, there were many encodings in wide use, and then Unicode came along with a few more. But the Unicode encodings have enough advantages that users are gradually moving away from non-Unicode encodings. C++ needs to accommodate that trend by becoming friendlier to the Unicode encodings.

Sure. But it doesn't mean that it have to be friendlier to ALL Unicode encodings.

...

...
5. "A "Hello World" program using a C++11 Unicode string literal illustrates this frustration:" Unicode string literal (except u8) illustrates how adding yet another unneeded feature to the C++ standard complicates the language, adds problems, adds frustration and solves nothing. The user can just write

cout << u8"您好世界";

Even better is:

cout << "您好世界";

which *just works* on most compilers (e.g. GCC:

http://ideone.com/lBpMJ)

...
and needs some trickery on others (MSVC: save as UTF-8 without BOM). A much simpler solution is to standardize narrow string literals to be UTF-8 encoded (or a better phrasing would be "capable of storing any Unicode data" so this will work with UTF-EBCDIC where needed), but I know it's too much to ask.

I'm not sure that is too much to ask for the C++ standard after C++11, whatever it ends up being called. It would take a lot of some careful work to bring the various interests on board. A year ago was the wrong point in the C++ standard revision cycle to even talks about such a change. But C++11 has shipped. Now is the time to start the process of moving the problem onto the committee's radar screen.

Thanks for the forecast!

...

...
6. "String conversion iterators are not provided (minus Example)" This section *I fully support*. The additions to C++11 pushed by Dinkumware

...
heavy, not general enough, and badly designed. C++11 still lacks convenient conversion between different Unicode encodings, which is a must in today's world. Just a few notes: - "Interfaces work at the level of entire strings rather than characters," This *is* desired since the overhead of the temporary allocations is repaid by the fact that optimized UTF-8↔UTF-16↔UTF-32 conversions need large chunks of data. Nevertheless I agree that iterator access is sometimes preferred. - Instead of the c_str() from "Example" a better approach is to provide a convenience non-member function that can work on any range of chars. E.g. using the "char type specifies the encoding" approach

are this

...
would be:

std::wstring wstr = convert<wchar_t>(u8"您好世界"); // doesn't even construct an std::string std::string u8str = convert<char>(wstr); // don't care for the name

While I'm totally convinced that conversion iterators would be very useful, the exact form is an open question. Could you be more specific about the details of your convert suggestion?

The point is that it's more like a free-standing c_str() you proposed. Unlike c_str() member function it would work on any character range, and returns a range of converting iterators. We don't need to extent basic_string for this, which is already too big.

...

...
7. True interoperability, portability and conciseness will come when we standardize on *one* encoding.

Even if we are only talking about Unicode, multiple encodings still seem a necessity.

Unicode algorithms work on code points (UCS-4) internally. Everything else can be encoded in some (narrow) execution character set capable of storing Unicode. Almost no-one implements Unicode algorithms, thus we can practically assume that one encoding is sufficient on each platform. -- Yakov

Beman Dawes

31 Jan 31 Jan

3:33 p.m.

On Mon, Jan 30, 2012 at 12:00 PM, Yakov Galka <ybungalobill@gmail.com> wrote:

...

On Sun, Jan 29, 2012 at 17:52, Beman Dawes <bdawes@acm.org> wrote: ...

...
While I'm totally convinced that conversion iterators would be very useful, the exact form is an open question. Could you be more specific about the details of your convert suggestion?

The point is that it's more like a free-standing c_str() you proposed. Unlike c_str() member function it would work on any character range, and returns a range of converting iterators. We don't need to extent basic_string for this, which is already too big.

The only way I can see that work with a totally unchanged basic_string would involve a temporary, which I was trying to avoid. Although with move semantics the temporary isn't as expensive as is used to be. If basic_string changed to accept range templates (which others may propose), a free-function approach would work (pending the details of the range proposal). If basic_string changed to accept single iterator templates, a free-function conversion iterator generator approach would work. I've added these three alternative solutions to the paper, and given you credit in the acknowledgments. Thanks! See http://beman.github.com/string-interoperability/tr2-proposal.html --Beman

Beman Dawes

3:41 p.m.

On Mon, Jan 30, 2012 at 12:00 PM, Yakov Galka <ybungalobill@gmail.com> wrote:

...

Unicode algorithms work on code points (UCS-4) internally. Everything else can be encoded in some (narrow) execution character set capable of storing Unicode. Almost no-one implements Unicode algorithms, thus we can practically assume that one encoding is sufficient on each platform.

That's totally at odds with my experience. A client deals with many database files ever day from many different sources. Most are encoded in UTF-8, but some are encoded in UTF-16 or non-Unicode schemes. That's life. Get over it:-) --Beman

Mathias Gaunard

29 Jan 29 Jan

1:12 a.m.

On 01/28/2012 05:46 PM, Beman Dawes wrote:

...

Beman.github.com/string-interoperability/interop_white_paper.html describes Boost components intended to ease string interoperability in general and Unicode string interoperability in particular.

These proposals are the Boost version of the TR2 proposals made in N3336, Adapting Standard Library Strings and I/O to a Unicode World. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3336.html.

I'm very interested in hearing comments about either the Boost or the TR2 proposal. Are these useful additions? Is there a better way to achieve the same easy interoperability goals?

I think you should consider the points being made in N3334. While that proposal is in my opinion not good enough, it raises an important issue that is often present with std::string-based or similar designs. A function that takes a std::string, or a boost::filesystem::path for that matter, necessarily causes the callee to copy the data into a heap-allocated buffer, even if there is no need to. Use of the range concept would solve that issue, but then that requires making the function a template. A type-erased range would be possible, but that has significant performance overhead. a string_ref or path_ref is maybe the lesser evil.

...

Where is the best home for the Boost proposals? A separate library? Part of some existing library?

Are these proposals orthogonal to the need for deeper Unicode functionality, such as Mathias Gaunard's Unicode components?

It seems all you really care about is having iterator adaptors that do character set conversion, allowing to lazily convert any range of any encoding to a particular Unicode encoding. This has always been the goal of my library, which somewhat provides that along with more advanced Unicode features. Those two things could live separately though. For standardization, the problem with iterator adaptors is that they cannot be as fast as free functions operating on pointers, unless the optimizer is pretty darn good. The conversion algorithms are also fully template and cannot be put in the library binary. Those are disadvantages compared to the mechanisms that exist today in the standard. By the way you only have input iterator adaptors. In my library I've implemented bidirectional iterator adaptors and output iterator adaptors. You've only been considering input, but output can also be useful depending on the situation.

Mathias Gaunard

1:22 a.m.

On 01/29/2012 02:12 AM, Mathias Gaunard wrote:

...

A function that takes a std::string, or a boost::filesystem::path for that matter, necessarily causes the callee to copy the data into a heap-allocated buffer, even if there is no need to.

The caller, rather.

Beman Dawes

30 Jan 30 Jan

2:50 p.m.

On Sat, Jan 28, 2012 at 8:12 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

On 01/28/2012 05:46 PM, Beman Dawes wrote:

...
Beman.github.com/string-interoperability/interop_white_paper.html describes Boost components intended to ease string interoperability in general and Unicode string interoperability in particular.

These proposals are the Boost version of the TR2 proposals made in N3336, Adapting Standard Library Strings and I/O to a Unicode World. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3336.html.

I'm very interested in hearing comments about either the Boost or the TR2 proposal. Are these useful additions? Is there a better way to achieve the same easy interoperability goals?

I think you should consider the points being made in N3334.

Ah, thanks! Yes, that's a very interesting proposal. I've started a separate thread to discuss it, so won't repeat that discussion here.

...

...
Where is the best home for the Boost proposals? A separate library? Part of some existing library?

Are these proposals orthogonal to the need for deeper Unicode functionality, such as Mathias Gaunard's Unicode components?

It seems all you really care about is having iterator adaptors that do character set conversion, allowing to lazily convert any range of any encoding to a particular Unicode encoding.

Yes, that's a fair summary.

...

This has always been the goal of my library, which somewhat provides that along with more advanced Unicode features. Those two things could live separately though.

I'm still feeling my way. I'd actually prefer to leave the encoding conversion to someone else. It's like my POD relaxation proposal that went into C++11 - I really didn't feel qualified to do that work, but none of the experts stepped forward. So I got sucked into the problem.

...

For standardization, the problem with iterator adaptors is that they cannot be as fast as free functions operating on pointers, unless the optimizer is pretty darn good.

Yes, but the optimizers are often "pretty darn good", and iterator adapters are very flexible.

...

The conversion algorithms are also fully template and cannot be put in the library binary.

That may well be correct for the general algorithms, but I'd be surprised if specializations for the most common cases couldn't call down to compiled binary functions.

...

Those are disadvantages compared to the mechanisms that exist today in the standard.

By the way you only have input iterator adaptors. In my library I've implemented bidirectional iterator adaptors and output iterator adaptors. You've only been considering input, but output can also be useful depending on the situation.

There is a do list work item to implement bidirectional iterator adapters. And output iterator adapters are worth some work too. Thanks for your comments, --Beman

Keith Burton

29 Jan 29 Jan

8:21 a.m.

New subject: [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

-----Original Message----- [snip] These proposals are the Boost version of the TR2 proposals made in N3336, Adapting Standard Library Strings and I/O to a Unicode World. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3336.html. I'm very interested in hearing comments about either the Boost or the TR2 proposal [snip] -----Original Message----- Beman I do not understand how the converting c_str template can be useful in what for me, is the normal usage of the c_str function. Given existing code std::string stdstr; const char * cstr = stdstr.c_str(); third_party_api( cstr ); and moving to general use of a wide string type e.g. std::u32string stdstr; const char * cstr = stdstr.c_str< char >(); // ????????? third_party_api( cstr ); clearly it is possible to make third_part_api( stdstr.c_str< char

...

.c_str() ) work but surely that would also permit the above invalid use.

Keith Burton

Beman Dawes

2:43 p.m.

On Sun, Jan 29, 2012 at 3:21 AM, Keith Burton <kb@xtramax.co.uk> wrote:

...

-----Original Message----- [snip] These proposals are the Boost version of the TR2 proposals made in N3336, Adapting Standard Library Strings and I/O to a Unicode World. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3336.html.

I'm very interested in hearing comments about either the Boost or the TR2 proposal

[snip] -----Original Message-----

Beman

I do not understand how the converting c_str template can be useful in what for me, is the normal usage of the c_str function.

Given existing code

std::string stdstr; const char * cstr = stdstr.c_str();

third_party_api( cstr );

and moving to general use of a wide string type e.g.

std::u32string stdstr; const char * cstr = stdstr.c_str< char >(); // ?????????

That's a compile time error. The unspecified iterator type returned will not be const char*. It will be a conversion iterator with a value type of char, and thus only useful directly in purpose written code or in generic algorithms templated on iterator type.

...

third_party_api( cstr );

clearly it is possible to make third_part_api( stdstr.c_str< char

...
.c_str() ) work but surely that would also permit the above invalid use.

One possible problem with conversion iterators with a value type of char is that they can be passed to functions that don't work with UTF-8 encoded data because of its multibyte nature. But UTF-8 is so craftily designed that many functions do work as intended, even though the functions were designed without multibyte encodings in mind. --Beman

Keith Burton

30 Jan 30 Jan

7:04 a.m.

New subject: [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

...

...
std::u32string stdstr; const char * cstr = stdstr.c_str< char >(); //

?????????

...

That's a compile time error. The unspecified iterator type returned will not be const char*. It will be a conversion iterator with a value type of char, and thus only useful directly in purpose written code or in generic algorithms templated on iterator type.

In that case, perhaps a more appropriate name would be cbegin<> instead of c_str<> Keith

Artyom Beilis

29 Jan 29 Jan

8:44 a.m.

New subject: [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

----- Original Message -----

...

From: Beman Dawes <bdawes@acm.org> To: Boost Developers List <boost@lists.boost.org> Cc: Sent: Saturday, January 28, 2012 6:46 PM Subject: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World

Beman.github.com/string-interoperability/interop_white_paper.html describes Boost components intended to ease string interoperability in general and Unicode string interoperability in particular.

These proposals are the Boost version of the TR2 proposals made in N3336, Adapting Standard Library Strings and I/O to a Unicode World. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3336.html.

I'm very interested in hearing comments about either the Boost or the TR2 proposal. Are these useful additions? Is there a better way to achieve the same easy interoperability goals?

Where is the best home for the Boost proposals? A separate library? Part of some existing library?

Are these proposals orthogonal to the need for deeper Unicode functionality, such as Mathias Gaunard's Unicode components?

--Beman

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Before I address specific points in the draft I'd like to say - it is not the way to go. In order to make Unicode work, we need two things: 1. First of all to define in the standard that any compiler should be able to treat literals as UTF-8 and the input text as UTF-8 text and recommend that it would be the default. This would make the developers life much easier whether they develop for "Wide" Unicode or for the Narrow UTF-8. 2. The standard does not define what locales are actually supported and how they are defined. The standard should define explicitly that UTF-8 locales must be supported. The rest become trivial: std::wcout << L"שלום" and std::cout << "שלום" Would work and much more. We are all working so hard to workaround a design flaw of C++ and C++ standard library that allows ANSI encoding and works with them. If the standard would require and recommend to handle UTF-8 by default we would not have all the boost::filesystem::path::imbue and other stuff that make the life a nightmare. If we want to go forward with Unicode we need to deprecate non-UTF encoding we should have UTF-8, UTF-16 and UTF-32 by default, or defined in compilation time in C++ and let the standard library to handle it. Take a look on what Go did. All modern languages are Unicode by their nature, net C++ be as well. All other stuff is just a workaround of a deeper problem and makes the programming harder. Now it would be possible if the standard committee would vote for it. -------------------------------------------- Now some specific points about converting iterator: It is fine for Unicode encoding conversion but it is very problematic for non-Unicode encodings. Small note:

...

Interfaces don't work well with generic programming techniques, particularly iterators.

Iterator is bad design for general encoding conversion for several reasons: In many cases conversion is stateful and iterator is this case is not the best concept. Some conversions require complex algorithms that should be be inlined but rather implemented with ineritence.Using iterator would require several Virtual function calls per character with trivial implementation and would be very complex as would require buffering techinques withing iterator, that is why codecvt iterface is actually good for encoding conversion (even thought it has a design flaw with mbstate_t that is useless for implementing stateful encoders, but if mbstate_t was something reasonable it would be very good interface). Now I explain why. 1. In some cases you may want to perform normalization before conversion or some other operations because it is not always correct assumption that XYZ-encoding-character <-> Unicode code point. Sometimes several characters may be join to single code point and the other way around. 2. When you operate on complex encodings it is better to pass a buffer for performance. Because conversion algorithm would work much better on a chunk of text rathern over some API. Even take a look on MSVC standard library. The wide to narrow conversion calls codecvt for **every** code point rather then using buffers. So do you really expect that implementations would actually create an efficent iterators? Bottom line. The "iterators range" is not good method for handling. ------------------------------------- Iterator concept. This paper does not require what iterator concept is defined? Input? Output? Forward, Bidirectional? Random? For some encodings it can work as bidirectional or even random iterator, for some it may be forward iterator only. ----------------------------- So I don't really think this is a way to go. Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/

4923

Age (days ago)

4926

Last active (days ago)

List overview

Download

27 comments

7 participants

participants (7)

Artyom Beilis
Beman Dawes
Daryle Walker
Keith Burton
Mathias Gaunard
Olaf van der Spek
Yakov Galka