
----- Original Message -----
From: Beman Dawes <bdawes@acm.org> To: Boost Developers List <boost@lists.boost.org> Cc: Sent: Saturday, January 28, 2012 6:46 PM Subject: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World
Beman.github.com/string-interoperability/interop_white_paper.html describes Boost components intended to ease string interoperability in general and Unicode string interoperability in particular.
These proposals are the Boost version of the TR2 proposals made in N3336, Adapting Standard Library Strings and I/O to a Unicode World. See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3336.html.
I'm very interested in hearing comments about either the Boost or the TR2 proposal. Are these useful additions? Is there a better way to achieve the same easy interoperability goals?
Where is the best home for the Boost proposals? A separate library? Part of some existing library?
Are these proposals orthogonal to the need for deeper Unicode functionality, such as Mathias Gaunard's Unicode components?
--Beman
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Before I address specific points in the draft I'd like to say - it is not the way to go. In order to make Unicode work, we need two things: 1. First of all to define in the standard that any compiler should be able to treat literals as UTF-8 and the input text as UTF-8 text and recommend that it would be the default. This would make the developers life much easier whether they develop for "Wide" Unicode or for the Narrow UTF-8. 2. The standard does not define what locales are actually supported and how they are defined. The standard should define explicitly that UTF-8 locales must be supported. The rest become trivial: std::wcout << L"שלום" and std::cout << "שלום" Would work and much more. We are all working so hard to workaround a design flaw of C++ and C++ standard library that allows ANSI encoding and works with them. If the standard would require and recommend to handle UTF-8 by default we would not have all the boost::filesystem::path::imbue and other stuff that make the life a nightmare. If we want to go forward with Unicode we need to deprecate non-UTF encoding we should have UTF-8, UTF-16 and UTF-32 by default, or defined in compilation time in C++ and let the standard library to handle it. Take a look on what Go did. All modern languages are Unicode by their nature, net C++ be as well. All other stuff is just a workaround of a deeper problem and makes the programming harder. Now it would be possible if the standard committee would vote for it. -------------------------------------------- Now some specific points about converting iterator: It is fine for Unicode encoding conversion but it is very problematic for non-Unicode encodings. Small note:
Interfaces don't work well with generic programming techniques, particularly iterators.
Iterator is bad design for general encoding conversion for several reasons: In many cases conversion is stateful and iterator is this case is not the best concept. Some conversions require complex algorithms that should be be inlined but rather implemented with ineritence.Using iterator would require several Virtual function calls per character with trivial implementation and would be very complex as would require buffering techinques withing iterator, that is why codecvt iterface is actually good for encoding conversion (even thought it has a design flaw with mbstate_t that is useless for implementing stateful encoders, but if mbstate_t was something reasonable it would be very good interface). Now I explain why. 1. In some cases you may want to perform normalization before conversion or some other operations because it is not always correct assumption that XYZ-encoding-character <-> Unicode code point. Sometimes several characters may be join to single code point and the other way around. 2. When you operate on complex encodings it is better to pass a buffer for performance. Because conversion algorithm would work much better on a chunk of text rathern over some API. Even take a look on MSVC standard library. The wide to narrow conversion calls codecvt for **every** code point rather then using buffers. So do you really expect that implementations would actually create an efficent iterators? Bottom line. The "iterators range" is not good method for handling. ------------------------------------- Iterator concept. This paper does not require what iterator concept is defined? Input? Output? Forward, Bidirectional? Random? For some encodings it can work as bidirectional or even random iterator, for some it may be forward iterator only. ----------------------------- So I don't really think this is a way to go. Artyom Beilis -------------- CppCMS - C++ Web Framework: http://cppcms.com/ CppDB - C++ SQL Connectivity: http://cppcms.com/sql/cppdb/