Re: [boost] Re: Any interest in adding unicode support to boost?

20 Oct 2004

      From: "Erik Wien" <wien@start.no>
...
Robert Ramey wrote:
...
a) the standard library has std:::basic_string<T> where T is any type 
char, wchar_t or whatever.
Yes. The problem with unicode is that it is not really possible to represent 
a character as an atomic value. A single glyph could in extreme cases be 
made up of 3 (or even more) 32 bit code units (UTF-32), and therefore 
defining a good T, is nigh on impossible.
Could the character type be a class that can hold one or more
data members of some representation type plus a pointer to
overflow data?  Then, an abstract character can be represented
completely within the character type if the encoding is
sufficiently simple, and if the encoding is more complex, the
additional data is put on the free store.

For example, if most characters can be represented with a single
representation type instance, then the class would contain one
data member of that type plus a pointer to the rest, if any.

Performance analysis can indicate how best to implement such a
class, but it could have from one to N data members of the
representation type, where N is the maximum number of
representation type values needed to represent all abstract
characters.  Differing choices of N and the representation type
will give different performance characteristics for a given
Unicode string.  Those values might be tuned for general purpose
use or they might be exposed via template parameters.

Granted, a simple character is enlarged by an unused pointer and
it may be that using N objects of the representation type takes
no more space, thereby obviating the conditional code checking
for a non-null pointer.  Nevertheless, it's an idea to consider,
if only for a minute. ;-)

-- 
Rob Stewart                           stewart@sig.com
Software Engineer                     http://www.sig.com
Susquehanna International Group, LLP  using std::disclaimer;