
Eric Niebler wrote:
Agree. Thanks Zach. I'm discouraged that every time the issue of a Unicode library comes up, the discussion immediately descends into a debate about how to design yet another string class. Such a high level wrapper *might* be useful (strong emphasis on "might"), but the core must be the Unicode algorithms, and the design for a Unicode library must start there.
Since it seems like there's a lot of concern with making a new string type, how about the following (off-the-cuff):
* Iterator filters a la Zach's message:
[snip]
* Runtime-defined filters:
typedef boost::recoding_iterator<boost::utf16,boost::runtime> utf16_to_any_iter; boost::runtime *my_codec = /*...*/; std::copy(utf16_to_utf8_iter(u_string.begin(), my_codec), utf16_to_utf8_iter(u_string.end(), my_codec), std::back_inserter(std_string));
Yes, that's what I was thinking as well. In fact, if you look at the Boost.GIL any_image<> and any_image_view<> templates, you'll see that they allow the user to specify a limit number of variants (a la Boost.Variant). So it's more restrictive than a Boost.Any, but that might be an advantage if it allows you to detect more errors at runtime. I think that in use cases, one will have knowledge of the maximum number of encodings that are possible in that case. Just something to consider.
* Shorthand for the above two points:
boost::transcode(u_string, boost::utf16(), std_string, boost::utf8());
Looks good, but is this function an assignment, or an append?
* String views that can wrap up the encoding type and the data (a container of some kind: strings, vector<char>s, ropes, etc):
boost::estring_view<utf8> my_utf8_string(std_string); boost::estring_view<> my_rt_string(str, my_codec);
boost::transcode(my_utf8_string, my_rt_string);
Yes. Views are notably absent in my original post. I think views are essential for encodings that are variable in length (e.g. UTF-8). Getting the character-location of code point N, or vice versa, and doing it efficiently, is a must-have.
Luckily, most of the work I've done is in making the encoding facets extensible and chooseable at runtime, so I wouldn't mourn the loss of my (frankly none-too-zazzy) string class.
This is just what I was hoping. The bulk of the work you'll do in any case will probably be with the algorithms and number of supported encodings. Zach