Re: [boost] GSoC Unicode library: second preview

Hello,
Here is the documentation of the current state of the Unicode library that I am doing as a google summer of code project: http://blogloufoque.free.fr/unicode/doc/html/ [snip]
Where is the source code? .... Some notes:
UTF-16 ... This is the recommended encoding for dealing with Unicode internally for general purposes
To be honest, it is most error prone encoding to work with Unicode: 1. It is variable length encoding 2. There surragate charrecters are quite rare and thus it is very hard to find bugs related with it. It was mostly born as a "mistake" at the beggining of the unicode when it was beleved that 16bit is enough for signle code point. So many software platforms adopted 16 bit encoding that supported only BMP, As a result you can **easily** find **huge** amount of bugs in the code that uses utf-16, In most of cases such bugs are hard to track because these code points are rare. For example, try to edit file-name in Windows with a charrecter that not in BMP you would see that you need to press "delete" twice, try to write such charecter in Qt3 application... that would just not work; There are many examples of it. So, I would be aware of recommending this encoding as internal encoding, just because many platforms use it.
UTF-32 ... This encoding isn't really recommended
As I mentioned above, it is not quite true, it is much safer encoding to work with, So I would recommend not to write such "suggestions". More notes: ----------- - For boundary checks I'd suggest to use ICU or Qt4 like API: iterate over string and return each time next bound. Not check if there is a bound on specific character. - Examples and More description is required Artyom

2009/6/20 Artyom <artyomtnk@yahoo.com>
UTF-16 ... This is the recommended encoding for dealing with Unicode internally for general purposes
To be honest, it is most error prone encoding to work with Unicode:
Amen. Really, I don't see why people don't just use UTF-8 all over the place. Even UTF-32 isn't as convenient as most would like, since you still have combining code points and other similar complications. As a programmer what I really care about is usually some nebulous concept of "characters", and one character can easily be 3 codepoints or 1/3 of a codepoint. It feels like the only way to get Unicode string handling right (at the application level, not library or render levels) is to deal entirely in strings and regexes. Suppose I have "difficult" with the "ffi" ligature codepoint, and I do a perl-style split on /i/. I should probably be getting "d", the "ff" ligature codepoint, and "cult". I know if I tried to code that by hand in every application I'd miss all kinds of evil corner cases like that.

Scott McMurray wrote:
Suppose I have "difficult" with the "ffi" ligature codepoint, and I do a perl-style split on /i/.
There is no way for "i" to match as being part of that string unless you replace the "ffi" ligature by the letters "f", "f", "i". That operation is known as a compatibility decomposition (and will be provided by the library in due time, of course, along with compatibility composition, canonical decomposition, canonical composition and the normalization forms that are defined in terms of them) You could choose to apply split with arguments normalized according to normalization form KC, which allows comparison independently of formatting considerations. But that also means 5 will match ⁵. You could choose that 5 should match ⁵, but ⁵ should not match 5, so the pattern should be in NFC but the string to search in in NFKC.
I should probably be getting "d", the "ff" ligature codepoint, and "cult". I know if I tried to code that by hand in every application I'd miss all kinds of evil corner cases like that.
Unfortunately Unicode is made of a lot of case corners, and there is no way around it without understanding it.

Artyom wrote:
Where is the source code?
On the sandbox svn, as I said. I can provide a tarball or zip if it is really needed, but it is easier to keep the svn up to date.
UTF-16 ... This is the recommended encoding for dealing with Unicode internally for general purposes
To be honest, it is most error prone encoding to work with Unicode:
You're not supposed to deal with it for text management, it is nothing more than the encoding of your raw data. UTF-16 is recommended because it allows algorithms to operate efficiently while minimizing memory waste, and thus is believed to be a better compromise than UTF-8 and UTF-32. Of course, the library works with UTF-8 and UTF-32 just as well, it makes no difference to the generic algorithms (which don't exist yet, but expect substring searching and the like), it's up to you to choose what makes the most sense to use for your situation (for example, you may choose to use UTF-8 because you need to interact a lot with programming interfaces expecting that format).
1. It is variable length encoding 2. There surragate charrecters are quite rare and thus it is very hard to find bugs related with it.
All facilities of the library take that into account, and even more if you ask them to (they will be able to work at the grapheme cluster level also, it's just part of the generic interface when it is relevant).
It was mostly born as a "mistake" at the beggining of the unicode when it was beleved that 16bit is enough for signle code point. So many software platforms adopted 16 bit encoding that supported only BMP, As a result you can **easily** find **huge** amount of bugs in the code that uses utf-16, In most of cases such bugs are hard to track because these code points are rare.
They should be fairly easy to find. Either you're using the algorithm that does the task correctly, or you're fiddling with the encoding by hand which is likely to be wrong.
UTF-32 ... This encoding isn't really recommended
As I mentioned above, it is not quite true, it is much safer encoding to work with,
In my personal opinion, it only exists in order to be "politically correct", so that broken code that relies on the illusion that you can have fixed-size characters keeps working.
- For boundary checks I'd suggest to use ICU or Qt4 like API: iterate over string and return each time next bound.
That's what consumer_iterator, and the _bounded functions that invoke it, do. With UTF-8 as the source code character encoding, and assuming C++0x features for readabilty: char foo[] = "eoaéôn"; for(auto subrange : u8_bounded(foo)) { for(unsigned char c : subrange) cout << c; cout << ' '; } cout << endl; prints: e o a é ô n (i.e. spaces are only put between code points, not code units)
Not check if there is a bound on specific character.
Checking if a given position constitutes a boundary is a fairly useful primitive to have for certain algorithms (since you can delay the boundary check until the moment it is really needed instead of doing it everywhere), and can be useful for applications that need to have some kind of pseudo random-access. It's also a primitive that the Unicode standard provides optimized implementations of (part of the unicode character database contains information to speed up that primitive on grapheme clusters for example). consumer_iterator can either be implemented in terms of such a primitive of in terms of the Consumer primitive. It can therefore be used to iterate sequences of code units, grapheme clusters, words, sentences, lines, etc. (any pattern modeled by the Consumer concept)
- Examples and More description is required
There is some pretty simplistic example in the source, libs/unicode/example/test.cpp, which I mostly use to check things work between refactorings without setting up unit tests. I'll try to work on a tutorial.
participants (3)
-
Artyom
-
Mathias Gaunard
-
Scott McMurray