
On Sat, Jul 18, 2009 at 15:34, Cory Nelson<phrosty@gmail.com> wrote:
On Fri, Jul 17, 2009 at 4:29 PM, Rogier van Dalen<rogiervd@gmail.com> wrote:
Though I'm not sure decoding this much UTF8-encoded data is often a bottleneck in practice,
UTF-8 is the primary bottleneck in XML decoding. That's been my motivation thus far.
And is it necessary to decode large stretches of UTF-8 rather than only the textual content? I imagined performance characteristics are quite different when you decode only short amounts of text at the same time. But I've never actually done this comparison, so I'm happy to take your word for it.
It now seems to me that a full Unicode library would be hard to get accepted into Boost; it would be more feasible to get a UTF library submitted, which is more along the lines of your library. (A Unicode library could later be based on the same principles.)
Freestanding transcoding functions and codecvt facets are not the only thing I believe a UTF library would need, though. I'd add to the list: - iterator adaptors (input and output); - range adaptors; - a code point string; - compile-time encoding (meta-programming); - documentation.
I agree, mostly. I'm not sure if a special string (as opposed to basic_string<utf16_t>) would be worthwhile -- what would you do with it that didn't require a full Unicode library supporting it?
Good point. I am not able to come up with a use case, other than "use it as the base of a grapheme string". From the tactical perspective of getting something through a Boost review, though, it would help to flesh out the design of a code point string before writing a grapheme string in the same vein. I think. But I'm becoming less sure of it as I write it. Cheers, Rogier