Re: [boost] Call for interest for native unicode character and string support in boost

Message: 14 Date: Sun, 24 Jul 2005 20:14:29 +0200 From: Erik Wien <wien@start.no> Subject: Re: [boost] Call for interest for native unicode character and string support in boost To: boost@lists.boost.org Message-ID: <dc0lpu$ej0$1@sea.gmane.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Dear Erik, I have done extensive Unicode work, and would be very interested to see what you have done. I apologise if I ask a lot of questions but these are the questions that immediately spring to mind on any Unicode implementation. How have you organised the Unicode 4.1 data ? Optimising this is itself a project and I have gone through a number of iterations. Unfortunately I have no definitive answer as you ultimately trade off speed against data size. Have you allowed for updating against the current Unicode standard from the Unicode data files? Have you made any trade-off of border [grapheme etc] detection for a simplification of character data ? Have you given access to the character properties? Have you added Unicode sorting and done it in such a way as to not get a high performance hit, e.g. having a separate pair<sort data, string> class? Have you stopped equivalence or equality on a Unicode string? What was your trade off on canconical decomposition? How have you hooked in dictionary word break support for languages like Thai or have you just built in support for adding private characters so that you can support force work break, force no word break, and detected word break [to name just three necessary private characters - in this case for Thai]. How far have you gone? Do you have support for going from logical to display on combined ltor and rtol ? Customised glyph conversion for Indic Urdu? Should these discussions be in a separate mailing group? How can we ensure that other boost projects understand the implication of Unicode support and the subtle changes required, e.g. hooks to allow for canonical decomposition on string data portions of regular expressions in the regexpr project? I am open to suggestions as to the best way to proceed as I feel that there are so many factors that can be traded against each other that there must be some flexibility in the design to allow for speed or memory optimised designs and this will need a lot of careful thought from different informed view points. My feeling is that the first step must be to agree the organisation of the data tables that are parsed from the Unicode data to allow for character tests, upper/ lower case conversion, sort conversion etc. There must be agreement on how best to organise these for speed or size, and what character tests are required. Until we can perform test on Unicode characters and have the basics, getting into Unicode strings is jumping the gun ! We will need to create a utility to take the 'raw'/ published unicode data files along with user defined private characters to make these tables which would then be used by the set of functions that we will agree such as isnumeric, ishangul, isstrongrtol, isrtol etc. I think that this would allow us to progress in a way where we can build the foundations and then build everything else on top of them, and which would then allow the standard to be directly related to and updated from the Unicode standard itself. What do you think? Yours, Graham Barnett BEng, MCSD/ MCAD .Net, MCSE/ MCSA 2003, CompTIA Sec+

Dear Graham, You seem to have a lot of valuable experience with Unicode. I feel that the interface is the more urgent matter for now. The implementation of the tables can always be changed as long as the interface remains stable. Let me rephrase that: I think the issues of programmer's interface and data tables are orthogonal, and programmers will care most about the interface, and then about performance. This is not to say you don't give a whole lot of important and difficult issues that must be solved at some point. As far as I can see, C++ provides the machinery to abstract away from code points, which gives an opportunity for hiding more of the complexity of Unicode than other programming languages. I would love to see a library which hides the issue of different normalisation forms from the programmer. As for the scope of the library:
How have you hooked in dictionary word break support for languages like Thai IMO that would be beyond the scope of a general Unicode library.
How far have you gone? Do you have support for going from logical to display on combined ltor and rtol ? Customised glyph conversion for Indic Urdu? Correct me if I'm wrong, but I think these issues become important only when rendering Unicode strings. Aren't they thus better handled by Uniscribe, Pango, or similar OS-specific libraries? I think a Boost Unicode library should focus on processing symbolic Unicode strings and keep away from what happens when they are displayed, just like std::basic_string does.
How can we ensure that other boost projects understand the implication of Unicode support and the subtle changes required, e.g. hooks to allow for canonical decomposition on string data portions of regular expressions in the regexpr project?
As long as the Unicode string abstracts away from normalisation forms, level 2 Unicode support for regular expressions should basically come for free, I believe. In general, the Unicode library should incorporate as much Unicode-specific machinery as possible, leaving as little difficulty for other library authors as possible. Docs are important here. Why would you want to do canonical decomposition explicitly in a regular expression?
We will need to create a utility to take the 'raw'/ published unicode data files along with user defined private characters to make these tables which would then be used by the set of functions that we will agree such as isnumeric, ishangul, isstrongrtol, isrtol etc.
I find the idea of users embedding private character properties _within_ the standard Unicode tables, and building their own slightly different version of the Unicode library, scary. Why is this needed? Regards, Rogier
participants (2)
-
Graham
-
Rogier van Dalen