Re: [boost] GSoC Unicode library: second preview

20 Jun 2009

      Hello,
...
Here is the documentation of the
current state of the Unicode library that I am doing as a
google summer of code project:
http://blogloufoque.free.fr/unicode/doc/html/
[snip]
Where is the source code?

....

Some notes:
...
UTF-16 ... This is the recommended encoding for dealing with
Unicode internally for general purposes
To be honest, it is most error prone encoding to work with Unicode:

1. It is variable length encoding
2. There surragate charrecters are quite rare and thus it is very
   hard to find bugs related with it.

It was mostly born as a "mistake" at the beggining of the unicode
when it was beleved that 16bit is enough for signle code point.
So many software platforms adopted 16 bit encoding that supported
only BMP, As a result you can **easily** find **huge** amount of
bugs in the code that uses utf-16, In most of cases such bugs
are hard to track because these code points are rare.

For example, try to edit file-name in Windows with a charrecter that
not in BMP you would see that you need to press "delete" twice, try
to write such charecter in Qt3 application... that would just not work;
There are many examples of it.

So, I would be aware of recommending this encoding as internal encoding,
just because many platforms use it.
...
UTF-32 ... This encoding isn't really recommended
As I mentioned above, it is not quite true, it is much safer encoding
to work with,

So I would recommend not to write such "suggestions".

More notes:
-----------

- For boundary checks I'd suggest to use ICU or Qt4 like API: iterate
  over string and return each time next bound. Not check if there is
  a bound on specific character.

- Examples and More description is required

Artyom