Re: [boost] Boost Unicode support ideas

14 Apr 2004

      "John Maddock" <john@johnmaddock.co.uk> writes:
...
[snip]
...
You can use whatever you want - I don't think users should be constrained to
a specific internal encoding.  Personally I don't like UTF8 either, but I
know some people do...
Well, the problem with making the Unicode string, however it is defined,
templated on encoding is that then the entire library must exist as
function templates, which given the size and data-driven nature of
operations such as collation, and the convenience of being able to use
run-time polymorphism, would be rather undesirable, as I see it.
...
...
The other issues I see with using basic_string include that many of its
methods would not be suitable for use with a Unicode string, and it
does not have something like an operator += which would allow appending
of a single Unicode code point (represented as a 32-bit integer).
What it comes down to is that basic_string is designed with fixed-width
character representations in mind.
I would be more in favor of creating a separate type to represent
Unicode strings.
...
Personally I think we have too many string types around already.  While I
understand you're concerns about basic_string, as a container of code-points
it's just fine IMO.  We can always add non-member functions for more
advanced manipulation.
I do see the advantage of using an existing type, and I would agree that
the run-time complexity issues can be avoided.  However, I can also see
how it would give a false sense of compatibility, because users would
tend to view the basic_string type as something higher-level than it
really is; for instance, basic_string defines many low-level operations
such as find (even find for a single code unit), which when dealing
with Unicode text should probably be avoided in most cases, and so the
additional verbosity of using std::find and std::search would be
beneficial.

Also, I do like the idea of having an operator += for the code point
type, which would not be possible with basic_string.
...
[snip]
...
...
Furthermore, many of the operations such as
toupper on a single character are not well defined, and rather must be
defined as a string to string mapping.
...
I know, however 1 to 1 approximations are available (those in Unidata.txt).
I'm not saying that the std locale facets should be the only interface, or
even the primary one, but providing it does get a lot of other stuff
working.
As I describe below, I would be reluctant to provide a deficient
interface, particularly when it is likely to be used since it is the
interface with which users are familiar.
...
...
Finally, the single-character
type must be a 32-bit integer, while the code unit type will probably
not be (since UTF-32 as the internal representation would be
inefficient).
...
True, for UTF-16 only the core Unicode subset would be supported by
std::locale (ie no surrogates): this is the same as the situation in Java
and JavaScript.
I don't really like the idea of providing deficient interface, since at
the very least that would necessitate providing an additional
non-deficient interface.  I would say: if we are going to add Unicode
support, we might as well do it right, instead of trying to hack it
into the existing interface in a way that would result in something
less than intuitive to use.

Note that AFAIK, there are a few common-use (Cantonese) characters
outside the Basic Multilingual Plane (BMP).
...
...
Specific cases include collate<Ch>, which lacks an interface for
configuring collation, such as which strength level to use, whether
uppercase or lowercase letters should sort first, whether in French
locales accents should be sorted right to left, and other such features.
It is true that an additional, more powerful interface could be
provided, but this would add complexity.
...
You can provide any constructor interface to the collate facet that you
want, for example to support a locale and a strenth level one might use:
...
template <class charT>
class unicode_collate : public std::collate<charT>
{
public:
unicode_collate(const char* name, int level = INT_MAX);
/* details */
};
...
I'm assuming that we have a non-member function to create a locale object
that contains a set of Unicode facets:
...
std::locale create_unicode_locale(const char* name);
...
Usage to create a locale object with primary level collation would then be:
...
std::locale l(create_unicode_locale("en_GB"), new unicode_collate("en_GB",
1));
mystream.imbue(l);
mystream << something;
// etc.
Okay, I suppose the parameters can be specified at construction-time.
...
...
Additionally, it depends on
basic_string<Ch> (note lack of char_traits specification), which is used
as the return type of transform, when something representing a byte
array might be more suitable.
...
You might have me on that one :-)
There is also the additional problem with UTF-16 that (AFAIK), a binary
sorting of UTF-16 strings does not give an ordering by code point.  This
is one other reason I don't really like the use of basic_string.
...
...
Additionally, num_put, moneypunct and money_put all would allow only a
single code unit in a number of cases, when a string of multiple code
points would be suitable.  In addition, those facets also depend on
basic_string<Ch>.
...
I don't understand what the problem is there, please explain.
For purposes such as specifying the fill character, these facets allow
only a single code unit.  For Unicode, a string of code points would be
more suitable.  Additionally, these facets must supply certain other
symbols as basic_string<Ch> (without char_traits specialization), and
so various specializations of basic_string would end up being used to
represent Unicode strings, resulting in problems, I would say.
...
...
[snip]
...
...
Additionally, a date formatting facility for Unicode would be useful.
...
std::time_get / std::time_put ? :-)
std::time_put has the same problems as std::num_put with respect to
requiring a single code unit to represent the fill character.

-- 
Jeremy Maitin-Shepard