
From: "Erik Wien" <wien@start.no>
Robert Ramey wrote:
a) the standard library has std:::basic_string<T> where T is any type char, wchar_t or whatever.
Yes. The problem with unicode is that it is not really possible to represent a character as an atomic value. A single glyph could in extreme cases be made up of 3 (or even more) 32 bit code units (UTF-32), and therefore defining a good T, is nigh on impossible.
Could the character type be a class that can hold one or more data members of some representation type plus a pointer to overflow data? Then, an abstract character can be represented completely within the character type if the encoding is sufficiently simple, and if the encoding is more complex, the additional data is put on the free store. For example, if most characters can be represented with a single representation type instance, then the class would contain one data member of that type plus a pointer to the rest, if any. Performance analysis can indicate how best to implement such a class, but it could have from one to N data members of the representation type, where N is the maximum number of representation type values needed to represent all abstract characters. Differing choices of N and the representation type will give different performance characteristics for a given Unicode string. Those values might be tuned for general purpose use or they might be exposed via template parameters. Granted, a simple character is enlarged by an unused pointer and it may be that using N objects of the representation type takes no more space, thereby obviating the conditional code checking for a non-null pointer. Nevertheless, it's an idea to consider, if only for a minute. ;-) -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;