Re: [boost] RFC: interest in Unicode codecs?

20 Jul 2009


      On Sat, Jul 18, 2009 at 15:34, Cory Nelson<phrosty@gmail.com> wrote:
...
On Fri, Jul 17, 2009 at 4:29 PM, Rogier van Dalen<rogiervd@gmail.com> wrote:
...
Though I'm not sure decoding this much UTF8-encoded data is often a
bottleneck in practice,
...
UTF-8 is the primary bottleneck in XML decoding.  That's been my
motivation thus far.
And is it necessary to decode large stretches of UTF-8 rather than
only the textual content? I imagined performance characteristics are
quite different when you decode only short amounts of text at the same
time. But I've never actually done this comparison, so I'm happy to
take your word for it.
...
...
It now seems to me that a full Unicode library would be hard to get
accepted into Boost; it would be more feasible to get a UTF library
submitted, which is more along the lines of your library. (A Unicode
library could later be based on the same principles.)
Freestanding transcoding functions and codecvt facets are not the only
thing I believe a UTF library would need, though. I'd add to the list:
- iterator adaptors (input and output);
- range adaptors;
- a code point string;
- compile-time encoding (meta-programming);
- documentation.
I agree, mostly.  I'm not sure if a special string (as opposed to
basic_string<utf16_t>) would be worthwhile -- what would you do with
it that didn't require a full Unicode library supporting it?
Good point. I am not able to come up with a use case, other than "use
it as the base of a grapheme string". From the tactical perspective of
getting something through a Boost review, though, it would help to
flesh out the design of a code point string before writing a grapheme
string in the same vein. I think. But I'm becoming less sure of it as
I write it.

Cheers,
Rogier