Re: [boost] Call for interest for native unicode character and string support in boost

Date: Thu, 28 Jul 2005 06:24:52 +0200
From: Rogier van Dalen <rogiervd@gmail.com>
Hi Graham,
In my experience this is handled by the dictionary pass [outside the
scope of this support] adding special break markers into the text [which
need to be supported transparently as Unicode characters that happen to
be in the private use range at this level] so that the text and string
iterators can then be handled normally. The fact that the break markers
are special characters in the private use range should not be relevant
or special at this level.
You mean that we invent a set of private characters that the
dictionary pass should use?
Actually individual developers are free to use their own private range as they see fit - these characters are just one example of the use of non standard private characters - 'we' as in boost do not invent these - any developer can add these characters to their own code to handle Thai if private character definition on a per-developer instance is supported.
We must provide a Unicode character system on which all libraries can
operate consistently.
Even working out a grapheme break may require different sets of
compromises that must work consistently for any set of inter-related
libraries to be successful.
Do you have an example? I'm having trouble envisioning a situation in
which libraries based on different Unicode versions actually cause
conflicts.
This is actually a very difficult thing to do due to limitations in
I enclose a few of the Unicode 4.0.1 to 4.1 differences as an example Notable Changes From Unicode 4.0.1 to Unicode 4.1.0 * Addition of 1273 new characters to the standard, including those to complete roundtrip mapping of the HKSCS and GB 18030 standards, five new currency signs, some characters for Indic and Korean, and eight new scripts. (The exact list of additions can be seen in DerivedAge.txt, in the age=4.1 section.) * Change in the end of the CJK Unified Ideographs range from U+9FA5 to U+9FBB, with the addition of some Han characters. The boundaries of such ranges are sometimes hardcoded in software, in which case the hardcoded value needs to be changed. the
Windows GDI scaling [which is not floating point but 'rounds' scaling
calculations, and which can result in as much as a +/-10% difference in
simple string lengths on different machines, unless handled specifically
e.g. IIIIIIIIIIIIIIII can be the same length as WXWXWX on one machine
but there can be a 20% difference on another machine] and requires
access to the data conversion mechanisms and requires that you know how
you are going to perform the rendering.
I fear I don't understand what you mean. It sounds to me like you're
suggesting defining a new font format for the Boost Unicode library.
Please accept that there are fundamental issues with the way in which scaling happens on a Windows platform that means that anybody who want to ensure that there are the same number of lines shown in a document regardless of the 'zoom' setting used to display that document so that the lines appear to have the same relative lengths at any zoom setting have to work very hard. This might be an example of where a particular implementation might store an array of wrapped lines rather than an array of characters. By moving the Unicode specific rules out of, and separate to, the storage, the storage method becomes an implementation issue. There can then be many different storage methods based on the same Unicode rules, e.g. implementations based on vector<char32_t> and vector<char16_t> to give two examples.
Why would you want to do canonical decomposition explicitly in a
regular expression?
Let me give two examples:
First why?
If you use a regular expression to search for <e acute> - [I use angle
brackets <> to describe a single character for this e-mail] then
logically you should find text containing:
And <acute> is a combining acute?
Yes
<e><acute> and <e acute> as these are both visually the same when
displayed to the user.
Second why do we need to know? If we decompose arbitrarily then we can
cover over syntax errors and act unexpectedly:
[...]
Yes, the Unicode library should by default process grapheme clusters
rather than code points. This would automatically solve the regex
issue.
This private use character data would NOT be published or distributed
I would envision that the string storage classes would support UTF16, UTF32 and grapheme based iterators. It is important in some circumstances to be able to process UTF32 characters rather than graphemes - for example when drawing the string ! As graphemes are can be several UTF32 characters long and require calculation to determine their data length it is often not practical to make these the default processing scheme. -
the facility to merge them in during usage allows each developer the
access to add their own private use data for their own system only.
But surely this means every app would have to come with a different DLL?
I'm not so sure about this. For many cases other markup (XML or
something) would do. Maybe other people have opinions about this?
I propose making it so that the Unicode data could be either placed in a DLL or an application. The data and boost functions would be supplied as a number of .h and .cpps that can be compiled in to either. The functions would be in supplied .h and .cpps, and the data .h and .cpps would be generated once only by a developer using a stand alone command line application. Access to the Unicode data would ONLY be by using the set of core functions or function wrappers round the core functions as outlined in my previous attachment and these core functions would be accessed by function pointers which can therefore be passed across DLLs even when the DLLS are written by third parties. The DLLs could then ask for the Unicode function pointers if required during initialisation so that first and third party DLLs could work with the same data. Yours, Graham

Hi, On 7/28/05, Graham <Graham@system-development.co.uk> wrote:
... Notable Changes From Unicode 4.0.1 to Unicode 4.1.0
* Addition of 1273 new characters to the standard, including those to complete roundtrip mapping of the HKSCS and GB 18030 standards, five new currency signs, some characters for Indic and Korean, and eight new scripts. (The exact list of additions can be seen in DerivedAge.txt, in the age=4.1 section.) * Change in the end of the CJK Unified Ideographs range from U+9FA5 to U+9FBB, with the addition of some Han characters. The boundaries of such ranges are sometimes hardcoded in software, in which case the hardcoded value needs to be changed.
I see. Things like these may indeed be problematic when, say, the operating system uses another version than the library. However, do we have any alternative but to accept that this may lead to problems?
...
I fear I don't understand what you mean. It sounds to me like you're suggesting defining a new font format for the Boost Unicode library.
Please accept that there are fundamental issues with the way in which scaling happens on a Windows platform that means that anybody who want to ensure that there are the same number of lines shown in a document regardless of the 'zoom' setting used to display that document so that the lines appear to have the same relative lengths at any zoom setting have to work very hard.
Having done a fair share of font development, I know. However, this is not a Windows-specific problem. Computer screens' resolutions jut aren't high enough to both display readable text and position characters at an unrounded position. I'm not sure there is much to be done about it though, in a Unicode implementation.
This might be an example of where a particular implementation might store an array of wrapped lines rather than an array of characters.
By moving the Unicode specific rules out of, and separate to, the storage, the storage method becomes an implementation issue.
There can then be many different storage methods based on the same Unicode rules, e.g. implementations based on vector<char32_t> and vector<char16_t> to give two examples.
This sounds reasonable, but I'm having trouble seeing what you mean to suggest. Is it that you'd want to give users the choice of an encoding?
I would envision that the string storage classes would support UTF16, UTF32 and grapheme based iterators. It is important in some circumstances to be able to process UTF32 characters rather than graphemes - for example when drawing the string !
As graphemes are can be several UTF32 characters long and require calculation to determine their data length it is often not practical to make these the default processing scheme.
Have you read the discussion on this list some months ago? I suggested a grapheme-based string containing a codepoint string. A codepoints() method should give the user the codepoint string, so that it can be used to display or other I/O, or whatever. It's just that you don't want to force people to deal with normalisation forms and what not if it can be helped. Equivalent sequences should work the same as far as the user is concerned, independent of its normalisation form or encoding. If you really want to process UTF-32 codepoints, or UTF-8 bytes, for that matter, you should be able to get at them, but how many people do you think would need that? A very small fraction, I think. Thus, the default interface should use graphemes.
This private use character data would NOT be published or distributed
the facility to merge them in during usage allows each developer the access to add their own private use data for their own system only.
But surely this means every app would have to come with a different DLL? I'm not so sure about this. For many cases other markup (XML or something) would do. Maybe other people have opinions about this?
I propose making it so that the Unicode data could be either placed in a DLL or an application. ...
OK, this may be possible. Thinking from a possible standardisation point of view, however, this would seem quite impossible. Especially since the fraction of people wanting to introduce their own characters will be small, I think it would be best to focus on what the interface should look like first, and then to see how extra characters can be added. (I'll reply to your new header mail in a minute.) Regards, Rogier
participants (2)
-
Graham
-
Rogier van Dalen