Re: [boost] Call for interest for native unicode character and string support in boost

Dear Doug, Thank you for your positive mail.
You seem to have a lot of valuable experience with Unicode.
I feel that the interface is the more urgent matter for now. The
implementation of the tables can always be changed as long as the
interface remains stable. Let me rephrase that: I think the issues of
programmer's interface and data tables are orthogonal, and programmers
will care most about the interface, and then about performance. This
is not to say you don't give a whole lot of important and difficult
issues that must be solved at some point.
How have you hooked in dictionary word break support for languages
I agree that there is a large degree of orthogonally between the character and properties vs string handling, but there is also a fundamental and strong link. Until you have access to the character properties, or have at least defined what methods to access those properties will be available, you cannot support Unicode strings. A classic example of this is the fact that the lower case version of a single Unicode character may be series of Unicode characters - without understanding this and coding a method to handle this, the string handling cannot be written. I believe that we should agree a Unicode character data specification that will supply Unicode data so that we can then independently look at the string interface. In my experience the string part can then be split off and handled completely independently as you have correctly stated. Should these be character properties be supplied using an interface, published method calls, etc etc. ? What calls should there be? If we can agree the interface/ separation/ of Unicode character data from string interfaces then I believe that we will move forward quickly from there as we are then 'just' taking about algorithm optimisation on a known data set to create the best possible string implementation. like
Thai
IMO that would be beyond the scope of a general Unicode library.
It is both outside the scope but also fundamental to the approach as this case must be handled/ provided for. In my experience this is handled by the dictionary pass [outside the scope of this support] adding special break markers into the text [which need to be supported transparently as Unicode characters that happen to be in the private use range at this level] so that the text and string iterators can then be handled normally. The fact that the break markers are special characters in the private use range should not be relevant or special at this level.
How far have you gone? Do you have support for going from logical to
display on combined ltor and rtol ? Customised glyph conversion for
Indic Urdu?
Correct me if I'm wrong, but I think these issues become important
only when rendering Unicode strings. Aren't they thus better handled
by Uniscribe, Pango, or similar OS-specific libraries? I think a Boost
Unicode library should focus on processing symbolic Unicode strings
and keep away from what happens when they are displayed, just like
std::basic_string does.
Unfortunately I believe that there may be serious limitations in this approach. I strongly believe that even if we do not actually write all the code we must not be in a position where, for example, you have to use a Uniscribe library based on Unicode 4 and a Boost library based on Unicode 4.1. [This is even ignoring UniScribe's 'custom' handling]. We must provide a Unicode character system on which all libraries can operate consistently. Even working out a grapheme break may require different sets of compromises that must work consistently for any set of inter-related libraries to be successful. As another example where display controls data organisation, what happens if you want to have a page of text display the same on several machines? This is actually a very difficult thing to do due to limitations in the Windows GDI scaling [which is not floating point but 'rounds' scaling calculations, and which can result in as much as a +/-10% difference in simple string lengths on different machines, unless handled specifically e.g. IIIIIIIIIIIIIIII can be the same length as WXWXWX on one machine but there can be a 20% difference on another machine] and requires access to the data conversion mechanisms and requires that you know how you are going to perform the rendering.
Why would you want to do canonical decomposition explicitly in a
regular expression?
Let me give two examples: First why? If you use a regular expression to search for <e acute> - [I use angle brackets <> to describe a single character for this e-mail] then logically you should find text containing: <e><acute> and <e acute> as these are both visually the same when displayed to the user. Second why do we need to know? If we decompose arbitrarily then we can cover over syntax errors and act unexpectedly: e.g. If we type [\x1<e acute>] then we decompose it outside the syntactic context of the regular expression engine and we would get: [\x1e<acute>] This is NOT what we expect and is therefore an error as we are now looking only for an acute and the actual syntax error [incomplete hex number] is not reported.
We will need to create a utility to take the 'raw'/ published unicode
data files along with user defined private characters to make these
tables which would then be used by the set of functions that we will
agree such as isnumeric, ishangul, isstrongrtol, isrtol etc.
I find the idea of users embedding private character properties
_within_ the standard Unicode tables, and building their own slightly
different version of the Unicode library, scary. Why is this needed?
It is important that the private use range, which is part of the Unicode spec, be handled consistently with the other Unicode ranges otherwise we end up having to write everything twice ! The private use range is in the Unicode spec specifically as it has been recognised that any complex Unicode system will need private use characters. Classic examples are implementations that move special display characters into portions of the private use ranges to allow for optimal display of visible tabs, visible cr, special characters like Thai word breaks, and of course completely non-standard characters like a button that can be embedded in text and would be entirely implementation specific. Having the breaking characteristics of these characters be handled consistently with all Unicode characters is a massive simplification for coding. I strongly believe that we must therefore allow each developer who wants to use the Unicode system the ability to add these private use character properties into their own personal main character tables so they are handled consistently with all other characters, but acknowledge that these are implementation specific. This private use character data would NOT be published or distributed - the facility to merge them in during usage allows each developer the access to add their own private use data for their own system only. Yours, Graham Barnett BEng, MCSD/ MCAD .Net, MCSE/ MCSA 2003, CompTIA Sec+

Hi Graham, On 7/25/05, Graham <Graham@system-development.co.uk> wrote:
[...] If we can agree the interface/ separation/ of Unicode character data from string interfaces then I believe that we will move forward quickly from there as we are then 'just' taking about algorithm optimisation on a known data set to create the best possible string implementation.
OK, we agree on this; I was incorrectly lumping things together.
How have you hooked in dictionary word break support for languages like Thai
IMO that would be beyond the scope of a general Unicode library.
It is both outside the scope but also fundamental to the approach as this case must be handled/ provided for.
In my experience this is handled by the dictionary pass [outside the scope of this support] adding special break markers into the text [which need to be supported transparently as Unicode characters that happen to be in the private use range at this level] so that the text and string iterators can then be handled normally. The fact that the break markers are special characters in the private use range should not be relevant or special at this level.
You mean that we invent a set of private characters that the dictionary pass should use?
How far have you gone? Do you have support for going from logical to display on combined ltor and rtol ? Customised glyph conversion for Indic Urdu?
Correct me if I'm wrong, but I think these issues become important only when rendering Unicode strings. Aren't they thus better handled by Uniscribe, Pango, or similar OS-specific libraries? I think a Boost Unicode library should focus on processing symbolic Unicode strings and keep away from what happens when they are displayed, just like std::basic_string does.
Unfortunately I believe that there may be serious limitations in this approach.
I strongly believe that even if we do not actually write all the code we must not be in a position where, for example, you have to use a Uniscribe library based on Unicode 4 and a Boost library based on Unicode 4.1. [This is even ignoring UniScribe's 'custom' handling].
We must provide a Unicode character system on which all libraries can operate consistently.
Even working out a grapheme break may require different sets of compromises that must work consistently for any set of inter-related libraries to be successful.
Do you have an example? I'm having trouble envisioning a situation in which libraries based on different Unicode versions actually cause conflicts.
As another example where display controls data organisation, what happens if you want to have a page of text display the same on several machines?
Can you elaborate? In what cases is this vital and how does display influence data organisation?
This is actually a very difficult thing to do due to limitations in the Windows GDI scaling [which is not floating point but 'rounds' scaling calculations, and which can result in as much as a +/-10% difference in simple string lengths on different machines, unless handled specifically e.g. IIIIIIIIIIIIIIII can be the same length as WXWXWX on one machine but there can be a 20% difference on another machine] and requires access to the data conversion mechanisms and requires that you know how you are going to perform the rendering.
I fear I don't understand what you mean. It sounds to me like you're suggesting defining a new font format for the Boost Unicode library.
Why would you want to do canonical decomposition explicitly in a regular expression?
Let me give two examples:
First why?
If you use a regular expression to search for <e acute> - [I use angle brackets <> to describe a single character for this e-mail] then logically you should find text containing:
And <acute> is a combining acute?
<e><acute> and <e acute> as these are both visually the same when displayed to the user.
Second why do we need to know? If we decompose arbitrarily then we can cover over syntax errors and act unexpectedly: [...]
Yes, the Unicode library should by default process grapheme clusters rather than code points. This would automatically solve the regex issue.
We will need to create a utility to take the 'raw'/ published unicode data files along with user defined private characters to make these tables which would then be used by the set of functions that we will agree such as isnumeric, ishangul, isstrongrtol, isrtol etc.
I find the idea of users embedding private character properties _within_ the standard Unicode tables, and building their own slightly different version of the Unicode library, scary. Why is this needed?
It is important that the private use range, which is part of the Unicode spec, be handled consistently with the other Unicode ranges otherwise we end up having to write everything twice !
The private use range is in the Unicode spec specifically as it has been recognised that any complex Unicode system will need private use characters.
Classic examples are implementations that move special display characters into portions of the private use ranges to allow for optimal display of visible tabs, visible cr, special characters like Thai word breaks, and of course completely non-standard characters like a button that can be embedded in text and would be entirely implementation specific. Having the breaking characteristics of these characters be handled consistently with all Unicode characters is a massive simplification for coding.
I strongly believe that we must therefore allow each developer who wants to use the Unicode system the ability to add these private use character properties into their own personal main character tables so they are handled consistently with all other characters, but acknowledge that these are implementation specific.
This private use character data would NOT be published or distributed - the facility to merge them in during usage allows each developer the access to add their own private use data for their own system only.
But surely this means every app would have to come with a different DLL? I'm not so sure about this. For many cases other markup (XML or something) would do. Maybe other people have opinions about this? Regards, Rogier
participants (2)
-
Graham
-
Rogier van Dalen