Re: [boost] [unicode] Interest Check / Proof of Concept

20 Nov 2008


      ...
Eric Niebler wrote:
...
Agree. Thanks Zach. I'm discouraged that every time the issue of a Unicode
library comes up, the discussion immediately descends into a debate about
how to design yet another string class. Such a high level wrapper *might* be
useful (strong emphasis on "might"), but the core must be the Unicode
algorithms, and the design for a Unicode library must start there.
Since it seems like there's a lot of concern with making a new string type,
how about the following (off-the-cuff):
* Iterator filters a la Zach's message:
[snip]
...
* Runtime-defined filters:
typedef boost::recoding_iterator<boost::utf16,boost::runtime>
               utf16_to_any_iter;
       boost::runtime *my_codec = /*...*/;
       std::copy(utf16_to_utf8_iter(u_string.begin(), my_codec),
               utf16_to_utf8_iter(u_string.end(), my_codec),
               std::back_inserter(std_string));
Yes, that's what I was thinking as well.  In fact, if you look at the
Boost.GIL any_image<> and any_image_view<> templates, you'll see that
they allow the user to specify a limit number of variants (a la
Boost.Variant).  So it's more restrictive than a Boost.Any, but that
might be an advantage if it allows you to detect more errors at
runtime.  I think that in use cases, one will have knowledge of the
maximum number of encodings that are possible in that case.  Just
something to consider.
...
* Shorthand for the above two points:
boost::transcode(u_string, boost::utf16(),
               std_string, boost::utf8());
Looks good, but is this function an assignment, or an append?
...
* String views that can wrap up the encoding type and the data (a container
of some kind: strings, vector<char>s, ropes, etc):
boost::estring_view<utf8> my_utf8_string(std_string);
       boost::estring_view<> my_rt_string(str, my_codec);
boost::transcode(my_utf8_string, my_rt_string);
Yes.  Views are notably absent in my original post.  I think views are
essential for encodings that are variable in length (e.g. UTF-8).
Getting the character-location of code point N, or vice versa, and
doing it efficiently, is a must-have.
...
Luckily, most of the work I've done is in making the encoding facets
extensible and chooseable at runtime, so I wouldn't mourn the loss of my
(frankly none-too-zazzy) string class.
This is just what I was hoping.  The bulk of the work you'll do in any
case will probably be with the algorithms and number of supported
encodings.

Zach

Re: [boost] [unicode] Interest Check / Proof of Concept

Zach Laine