Re: [boost] Call for interest for native unicode character and string support in boost - updated definition

From: Rogier van Dalen <rogiervd@gmail.com>
Subject: Re: [boost] Call for interest for native unicode character and string support in boost - updated definition
Hi Graham,
I fear we might be heading off in a new direction here. I think enabling third parties to add codepoints,
if desirable at all (I still have my sincere doubts) should not be a primary concern.
I propose we focus on the interface first, and leave out the implementation bits.
I have been working with Unicode for many years and cannot envisage any serious Unicode implementation where private characters are not required. I have given examples: Private characters like buttons, like wrapping markers for Thai, streaming format changes, etc. etc. Each developer will need to be able to customise it - after all how can you support Unicode but decide to ignore private characters that are part of the Unicode spec !
Discussing the implementation, I'm afraid, will just make the discussion unclear, as I'll show by pointing out some things once.
- as for the "functions" struct, do you realise that C++ has a built-in feature called "virtual functions" to do things like this?
Interesting you raise this point. I started off as an interface but hit a major problem - performance. I tried wrapping 100,000 words and the interface took a hell of a hit due to the way in which the compiled code takes the interface and then offsets to the member to call a function on the interface. So I looked at being able to access the function on the interface directly - this proved to be impossible via a static function and you cannot have statics being implemented on an interface and no better if you wrap an interface call using mem_fun. Then I considered how I would modify one function, like sort, without ANY performance hit - and ended up with what I had. How the hell can you do it more neatly without a performance hit? I am open to suggestions. Do not forget we must be able to optimise this for times when this interface will be hit millions of times per second - and yes this is something I have done a number of times in real life - e.g. formatting of live data feeds - and this must be a primary concern in the design. - Have you realised what happens when deque<>::iterators, or any other iterators to containers with non-contiguous elements, are fed into the get_uppercase method with the current implementation? What about iterators that process UTF-8 and pretend it to be a UTF-32 sequence? The implementation will work on any non-contiguous iterator as written, after all the single iterator version increments/ decrements the iterator then de-refs it. Of course from bitter experience I can tell you that working with a single iterator and not caching the values is a massive performance hit. - Do you realise that inline functions are not macros and thus need no backslashes at the end of lines? Sorry - I went through a number of iterations to try and get the code you saw - in one of the iterations I tried to inline the functions.
The implementation won't help discussing the interface, so I think we'd better leave them out for now
I am actually finding that the snapshots I have included are important as they can help to reveal why the implementation might not work and what we can do about it.
BTW, now looking at the code: do you fully realise how iterators work?
It seems to me that
StartOfGrapheme(functions* pFns, inputIterator chPrev, inputIterator ch, inputIterator chNext) is not really needed,
because it is quite easy to find an iterator to the next and previous position given the current one.
That is correct(ish) - but processor intensive - I have had real problems when you process a document from a data feed and wrap it for storage in a database. As an example on a ".4 Gig Pentium 4 I have seen this take up to six seconds [depending on the text - and we were handling some large text documents] when fully optimised - and that required a lot of profiling to get the best performance. Wrapping a full document for the first time requires a lot of processing and you cannot just increment and decrement iterators - as I said in my previous e-mail [the one today, after the e-mail to which you have replied] the only way of doing this with reasonable performance is to have three iterators that get shuffled along. Remember these are really low level functions and get hit very heavily and often.
Please find attached a modified version of the header. Changes are:
- It now looks like C++: superfluous semicolons are deleted and identifiers are lowercase with underscores;
This will mean altering the automatic conversion of the Unicode names to this form C++ names but this should not be a problem stripping out the upper case a replacing spaces and - with _.
- the last "uni" prefixes have been removed;
- it does not show the implementation any more;
Unfortunately you seem to have fallen into the great trap of iterators. The code can no longer be used to/ from third party DLLs, please see my post of earlier today. Iterators do not work unless you have all the source code - which means that third party DLLs containing custom controls will not work with your changes. This would be a major step backwards as grid controls are a major business and many are shipped without source, just as DLLs with a defined interface. You cannot use an iterator across such a boundary - please see my previous code for an example of how I have ensured that it WILL work with third party DLLs such as grid controls without every DLL from a different company having to have it's own Unicode data. The implementation shown demonstrated how to do this and was included for that reason.
- I corrected some spelling errors, possibly introduced new ones (seperable is spelled sepArable, isn't it? I'm not 100% sure);
- a get_category() function is defined to get the general General Category (i.e. letter, mark, number, etc.)
- I deleted the page0() function because I didn't see why it should be
I have just realised that all the category enums need to be placed in a single enum as they are in fact in a single data space so your get_category becomes the single category call. there (feel free to move it back in if I missed something). OK
- I provided only iterator-based grapheme, word and sentence skipping functions;
This will not work across third party code when you don't have the source. Means you cant have any more grid control DLLs etc in Unicode unless they contain all the Unicode data separately or you insist that all suppliers have to sell their DLLs as source code.
- I provided a locale with a few lines sketching an idea of what a
collation object should look like. Did you realise that string comparison should likely be passed to a function or
container as a comparison object? For example, my specification would allow:
// Some container c of strings
// Some string s
std::lower_bound (c.begin(), c.end(), s, unicode::default_locale().collate_accents());
Any comments are, of course, most welcome!
We need to make a collation object that is container independent as there are going to be several Unicode containers - not just one. We must therefore separate collation from container. It must work with third party DLLs where you don't have the source code so iterators are out. The collate_accents call is strange - I don't understand why you don't just pass a locale in. For example some languages sort <ae> differently and this is not an accent. I would therefore just pass in a uint32_t for the locale. Any thoughts? Yours, Graham

"Graham" <Graham@system-development.co.uk> writes:
- Have you realised what happens when deque<>::iterators, or any other iterators to containers with non-contiguous elements, are fed into the get_uppercase method with the current implementation? What about iterators that process UTF-8 and pretend it to be a UTF-32 sequence?
The implementation will work on any non-contiguous iterator as written, after all the single iterator version increments/ decrements the iterator then de-refs it.
The use of "-" on only the first line of a quotation is nonstandard, and makes it hard for both people and automated tools to understand the thread. Remember that your posts are saved for posterity, and especially with a topic as important as Unicode support, it is important to be able to easily review the discussion history for rationales. http://www.boost.org/more/discussion_policy.htm#quoting Thank you, -- Dave Abrahams Boost Consulting www.boost-consulting.com
participants (2)
-
David Abrahams
-
Graham