
On Sun, Mar 29, 2009 at 9:40 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
I plan to submit during the week my proposal for the Summer of Code about Unicode.
I plan to provide: - iterator adaptors to iterate sequences of code units, code points and graphemes, and eventually more, from a sequence in UTF-8, UTF-16, UCS-2 or UTF-32/UCS-4.
What about conversion algorithms to conveniently generate these sequences in the first place?
- miscellaneous utilities, such as categorization of code points - normalization functions - comparisons but not collations - substring search algorithms - and finally, an unicode string type
From prior discussions, it seemed to me that there were actually needs for several unicode string types.
* Specific UTF-8, UTF-16, UTF-*, string classes to be used within an application, when a particular Unicode string type and internal representation is the optimal choice. * A single utf_string that varies its internal representation at run-time. This is the choice for communication between third parties where not enough is known about the applications to choose a particular internal representation, or within an application when the application must cope with runtime changing needs..
I am well aware defining yet another new string type is quite controversial, but I believe this is quite useful. A dedicated type would be able to maintain certain invariants, such as maintaining a special normalization form. Also, I believe it can be possible to come up with a string design that allows easy integration with any other existing string type, such as the ones from the standard or Qt
While this is an interesting proposal, it appears to me to be several years worth of work. How would you structure the first summer's work? Would you aim at breadth (a prototype covering the whole) or depth (production quality work that concentrates on one aspect)? --Beman