
Rogier van Dalen wrote:
<snip> I believe we are talking about different kinds of users. Let's get this clear: I was assuming that the Unicode library will be aimed at programmers doing everyday programming jobs whose programs will have to deal with non-English characters (because they're bound to be localised, or because non-English names will be inserted in a database, or whatever), i.e. people who have no idea about how Unicode works and don't want to, as long as it does work.
That was my initial thought. This Unicode library should in my opinion make handling Unicode strings correctly as easy as it is to handle ASCII strings today. But that does not mean we will have to put mittens on everyone else to keep them away from the lower details. If you need to manipulate code points, I think you should be allowed to. Code units on the other hand, I'm a little more wary about, since users easily could screw things up on that level. (Make a sequence ill-formed.) Furthermore I don't really see why anyone would need to muck about with code units.
What I think would be a good interface:
// A string of code points, encoded UTF-16 (or templated). class code_point_string { public: //... const std::basic_string<char16_t> code_units(); };
// A string of "grapheme clusters", with a code_point_string underlying. // The string is always in a normalisation form. template <class NormalisationPolicy = NormalisationFormC> class unicode_string { public: //... const code_point_string & code_points() const; };
Those who need to process code points can happily use code_point_string; others can use unicode_string.
This is starting to look more and more like the way to go in my opinion. By layering interfaces with an increasing level of abstraction (from code points and up), we could more or less keep everyone happy. What I really don't like about this solution, is that we would end up with a myriad of different types that all are "unicode strings", but at different levels. I can easily imagine mayhem erupting when everyone get their favorite unicode abstraction and use that one exclusively in their APIs. Passing strings around would be a complete nightmare. One solution could be to make code points the "base level" of abstraction, and used normalization policies (like you outlined) for functions where normalization form actually matters (find etc.), we could still get most of the functionality a grapheme_cluster_string would provide, but without the extra types. I'm just afraid that if we have a code_point_string in all encodings, plus the dynamic one, in addition to the same number of strings at the grapheme cluster level, there would simply be too many of them, and it would confuse the users more that it would help them. Feel free to convince me otherwise though.