
Hello,
Here is the documentation of the current state of the Unicode library that I am doing as a google summer of code project: http://blogloufoque.free.fr/unicode/doc/html/ [snip]
Where is the source code? .... Some notes:
UTF-16 ... This is the recommended encoding for dealing with Unicode internally for general purposes
To be honest, it is most error prone encoding to work with Unicode: 1. It is variable length encoding 2. There surragate charrecters are quite rare and thus it is very hard to find bugs related with it. It was mostly born as a "mistake" at the beggining of the unicode when it was beleved that 16bit is enough for signle code point. So many software platforms adopted 16 bit encoding that supported only BMP, As a result you can **easily** find **huge** amount of bugs in the code that uses utf-16, In most of cases such bugs are hard to track because these code points are rare. For example, try to edit file-name in Windows with a charrecter that not in BMP you would see that you need to press "delete" twice, try to write such charecter in Qt3 application... that would just not work; There are many examples of it. So, I would be aware of recommending this encoding as internal encoding, just because many platforms use it.
UTF-32 ... This encoding isn't really recommended
As I mentioned above, it is not quite true, it is much safer encoding to work with, So I would recommend not to write such "suggestions". More notes: ----------- - For boundary checks I'd suggest to use ICU or Qt4 like API: iterate over string and return each time next bound. Not check if there is a bound on specific character. - Examples and More description is required Artyom