
"John Maddock" <john@johnmaddock.co.uk> writes:
[snip]
You can use whatever you want - I don't think users should be constrained to a specific internal encoding. Personally I don't like UTF8 either, but I know some people do...
Well, the problem with making the Unicode string, however it is defined, templated on encoding is that then the entire library must exist as function templates, which given the size and data-driven nature of operations such as collation, and the convenience of being able to use run-time polymorphism, would be rather undesirable, as I see it.
The other issues I see with using basic_string include that many of its methods would not be suitable for use with a Unicode string, and it does not have something like an operator += which would allow appending of a single Unicode code point (represented as a 32-bit integer).
What it comes down to is that basic_string is designed with fixed-width character representations in mind.
I would be more in favor of creating a separate type to represent Unicode strings.
Personally I think we have too many string types around already. While I understand you're concerns about basic_string, as a container of code-points it's just fine IMO. We can always add non-member functions for more advanced manipulation.
I do see the advantage of using an existing type, and I would agree that the run-time complexity issues can be avoided. However, I can also see how it would give a false sense of compatibility, because users would tend to view the basic_string type as something higher-level than it really is; for instance, basic_string defines many low-level operations such as find (even find for a single code unit), which when dealing with Unicode text should probably be avoided in most cases, and so the additional verbosity of using std::find and std::search would be beneficial. Also, I do like the idea of having an operator += for the code point type, which would not be possible with basic_string.
[snip]
Furthermore, many of the operations such as toupper on a single character are not well defined, and rather must be defined as a string to string mapping.
I know, however 1 to 1 approximations are available (those in Unidata.txt). I'm not saying that the std locale facets should be the only interface, or even the primary one, but providing it does get a lot of other stuff working.
As I describe below, I would be reluctant to provide a deficient interface, particularly when it is likely to be used since it is the interface with which users are familiar.
Finally, the single-character type must be a 32-bit integer, while the code unit type will probably not be (since UTF-32 as the internal representation would be inefficient).
True, for UTF-16 only the core Unicode subset would be supported by std::locale (ie no surrogates): this is the same as the situation in Java and JavaScript.
I don't really like the idea of providing deficient interface, since at the very least that would necessitate providing an additional non-deficient interface. I would say: if we are going to add Unicode support, we might as well do it right, instead of trying to hack it into the existing interface in a way that would result in something less than intuitive to use. Note that AFAIK, there are a few common-use (Cantonese) characters outside the Basic Multilingual Plane (BMP).
Specific cases include collate<Ch>, which lacks an interface for configuring collation, such as which strength level to use, whether uppercase or lowercase letters should sort first, whether in French locales accents should be sorted right to left, and other such features. It is true that an additional, more powerful interface could be provided, but this would add complexity.
You can provide any constructor interface to the collate facet that you want, for example to support a locale and a strenth level one might use:
template <class charT> class unicode_collate : public std::collate<charT> { public: unicode_collate(const char* name, int level = INT_MAX); /* details */ };
I'm assuming that we have a non-member function to create a locale object that contains a set of Unicode facets:
std::locale create_unicode_locale(const char* name);
Usage to create a locale object with primary level collation would then be:
std::locale l(create_unicode_locale("en_GB"), new unicode_collate("en_GB", 1)); mystream.imbue(l); mystream << something; // etc.
Okay, I suppose the parameters can be specified at construction-time.
Additionally, it depends on basic_string<Ch> (note lack of char_traits specification), which is used as the return type of transform, when something representing a byte array might be more suitable.
You might have me on that one :-)
There is also the additional problem with UTF-16 that (AFAIK), a binary sorting of UTF-16 strings does not give an ordering by code point. This is one other reason I don't really like the use of basic_string.
Additionally, num_put, moneypunct and money_put all would allow only a single code unit in a number of cases, when a string of multiple code points would be suitable. In addition, those facets also depend on basic_string<Ch>.
I don't understand what the problem is there, please explain.
For purposes such as specifying the fill character, these facets allow only a single code unit. For Unicode, a string of code points would be more suitable. Additionally, these facets must supply certain other symbols as basic_string<Ch> (without char_traits specialization), and so various specializations of basic_string would end up being used to represent Unicode strings, resulting in problems, I would say.
[snip]
Additionally, a date formatting facility for Unicode would be useful.
std::time_get / std::time_put ? :-)
std::time_put has the same problems as std::num_put with respect to requiring a single code unit to represent the fill character. -- Jeremy Maitin-Shepard