New subject: Call for interest for native unicode character and string support in boost

25 Jul 2005

      Dear Doug,

Thank you for your positive mail.
...
You seem to have a lot of valuable experience with Unicode.
...
I feel that the interface is the more urgent matter for now. The
...
implementation of the tables can always be changed as long as the
...
interface remains stable. Let me rephrase that: I think the issues of
...
programmer's interface and data tables are orthogonal, and programmers
...
will care most about the interface, and then about performance. This
...
is not to say you don't give a whole lot of important and difficult
...
issues that must be solved at some point.
...
...
How have you hooked in dictionary word break support for languages
I agree that there is a large degree of orthogonally between the
character and properties vs string handling, but there is also a
fundamental and strong link. Until you have access to the character
properties, or have at least defined what methods to access those
properties will be available, you cannot support Unicode strings.

A classic example of this is the fact that the lower case version of a
single Unicode character may be series of Unicode characters - without
understanding this and coding a method to handle this, the string
handling cannot be written.

I believe that we should agree a Unicode character data specification
that will supply Unicode data so that we can then independently look at
the string interface. In my experience the string part can then be split
off and handled completely independently as you have correctly stated.

Should these be character properties be supplied using an interface,
published method calls, etc etc. ?

What calls should there be?

If we can agree the interface/ separation/ of Unicode character data
from string interfaces then I believe that we will move forward quickly
from there as we are then 'just' taking about algorithm optimisation on
a known data set to create the best possible string implementation.

like
...
...
Thai
...
IMO that would be beyond the scope of a general Unicode library.
It is both outside the scope but also fundamental to the approach as
this case must be handled/ provided for.

In my experience this is handled by the dictionary pass [outside the
scope of this support] adding special break markers into the text [which
need to be supported transparently as Unicode characters that happen to
be in the private use range at this level] so that the text and string
iterators can then be handled normally. The fact that the break markers
are special characters in the private use range should not be relevant
or special at this level.
...
...
How far have you gone? Do you have support for going from logical to
...
...
display on combined ltor and rtol ? Customised glyph conversion for
...
...
Indic Urdu?
...
Correct me if I'm wrong, but I think these issues become important
...
only when rendering Unicode strings. Aren't they thus better handled
...
by Uniscribe, Pango, or similar OS-specific libraries? I think a Boost
...
Unicode library should focus on processing symbolic Unicode strings
...
and keep away from what happens when they are displayed, just like
...
std::basic_string does.
Unfortunately I believe that there may be serious limitations in this
approach.

I strongly believe that even if we do not actually write all the code we
must not be in a position where, for example, you have to use a
Uniscribe library based on Unicode 4 and a Boost library based on
Unicode 4.1. [This is even ignoring UniScribe's 'custom' handling].

We must provide a Unicode character system on which all libraries can
operate consistently.

Even working out a grapheme break may require different sets of
compromises that must work consistently for any set of inter-related
libraries to be successful.

As another example where display controls data organisation, what
happens if you want to have a page of text display the same on several
machines?

This is actually a very difficult thing to do due to limitations in the
Windows GDI scaling [which is not floating point but 'rounds' scaling
calculations, and which can result in as much as a +/-10% difference in
simple string lengths on different machines, unless handled specifically
e.g. IIIIIIIIIIIIIIII can be the same length as WXWXWX on one machine
but there can be a 20% difference on another machine] and requires
access to the data conversion mechanisms and requires that you know how
you are going to perform the rendering.
...
Why would you want to do canonical decomposition explicitly in a
...
regular expression?
Let me give two examples:

First why?

If you use a regular expression to search for <e acute> - [I use angle
brackets <> to describe a single character for this e-mail] then
logically you should find text containing:

<e><acute> and <e acute> as these are both visually the same when
displayed to the user.

Second why do we need to know? If we decompose arbitrarily then we can
cover over syntax errors and act unexpectedly:

e.g. If we type [\x1<e acute>] then we decompose it outside the
syntactic context of the regular expression engine and we would get:

[\x1e<acute>] 

This is NOT what we expect and is therefore an error as we are now
looking only for an acute and the actual syntax error [incomplete hex
number] is not reported.
...
...
We will need to create a utility to take the 'raw'/ published unicode
...
...
data files along with user defined private characters to make these
...
...
tables which would then be used by the set of functions that we will
...
...
agree such as isnumeric, ishangul, isstrongrtol, isrtol etc.
...

...
I find the idea of users embedding private character properties
...
_within_ the standard Unicode tables, and building their own slightly
...
different version of the Unicode library, scary. Why is this needed?
It is important that the private use range, which is part of the Unicode
spec, be handled consistently with the other Unicode ranges otherwise we
end up having to write everything twice !

The private use range is in the Unicode spec specifically as it has been
recognised that any complex Unicode system will need private use
characters.

Classic examples are implementations that move special display
characters into portions of the private use ranges to allow for optimal
display of visible tabs, visible cr, special characters like Thai word
breaks, and of course completely non-standard characters like a button
that can be embedded in text and would be entirely implementation
specific. Having the breaking characteristics of these characters be
handled consistently with all Unicode characters is a massive
simplification for coding.

I strongly believe that we must therefore allow each developer who wants
to use the Unicode system the ability to add these private use character
properties into their own personal main character tables so they are
handled consistently with all other characters, but acknowledge that
these are implementation specific.

This private use character data would NOT be published or distributed -
the facility to merge them in during usage allows each developer the
access to add their own private use data for their own system only.

Yours,

Graham Barnett

BEng, MCSD/ MCAD .Net, MCSE/ MCSA 2003, CompTIA Sec+

Re: [boost] Call for interest for native unicode character and string support in boost

Graham

Rogier van Dalen

tags

participants (2)