[gsoc] unicode tools and an unicode string type

I plan to submit during the week my proposal for the Summer of Code about Unicode. I plan to provide: - iterator adaptors to iterate sequences of code units, code points and graphemes, and eventually more, from a sequence in UTF-8, UTF-16, UCS-2 or UTF-32/UCS-4. - miscellaneous utilities, such as categorization of code points - normalization functions - comparisons but not collations - substring search algorithms - and finally, an unicode string type I am well aware defining yet another new string type is quite controversial, but I believe this is quite useful. A dedicated type would be able to maintain certain invariants, such as maintaining a special normalization form. Also, I believe it can be possible to come up with a string design that allows easy integration with any other existing string type, such as the ones from the standard or Qt.

On Mon, Mar 30, 2009 at 02:40, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
I plan to submit during the week my proposal for the Summer of Code about Unicode.
That sounds good. Your proposal seems comprehensive. Are you aware that Erik Wien did work on a Unicode library for his Bachelor's thesis in 2005? (If I remember correctly.) Cheers, Rogier

On Sun, Mar 29, 2009 at 9:40 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
I plan to submit during the week my proposal for the Summer of Code about Unicode.
I plan to provide: - iterator adaptors to iterate sequences of code units, code points and graphemes, and eventually more, from a sequence in UTF-8, UTF-16, UCS-2 or UTF-32/UCS-4.
What about conversion algorithms to conveniently generate these sequences in the first place?
- miscellaneous utilities, such as categorization of code points - normalization functions - comparisons but not collations - substring search algorithms - and finally, an unicode string type
From prior discussions, it seemed to me that there were actually needs for several unicode string types.
* Specific UTF-8, UTF-16, UTF-*, string classes to be used within an application, when a particular Unicode string type and internal representation is the optimal choice. * A single utf_string that varies its internal representation at run-time. This is the choice for communication between third parties where not enough is known about the applications to choose a particular internal representation, or within an application when the application must cope with runtime changing needs..
I am well aware defining yet another new string type is quite controversial, but I believe this is quite useful. A dedicated type would be able to maintain certain invariants, such as maintaining a special normalization form. Also, I believe it can be possible to come up with a string design that allows easy integration with any other existing string type, such as the ones from the standard or Qt
While this is an interesting proposal, it appears to me to be several years worth of work. How would you structure the first summer's work? Would you aim at breadth (a prototype covering the whole) or depth (production quality work that concentrates on one aspect)? --Beman

Beman Dawes wrote:
While this is an interesting proposal, it appears to me to be several years worth of work. How would you structure the first summer's work? Would you aim at breadth (a prototype covering the whole) or depth (production quality work that concentrates on one aspect)?
As there have been multiple attempts at a boost.unicode library in the past, and they all failed (at least in the sense that they didn't result in any usable unicode library as part of boost), I would like to see any further attempts at this to learn from this experience. To me, it suggests that a more incremental approach is needed, where the focus is on small self-contained chunks of functionality that can be reviewed and integrated into boost quickly. (Obviously it's also important to keep the big picture in mind, but aiming too high is a clear recipe for failure.) (One aspect of this is that it may be important to recognize that the API won't necessarily be 'right' in the first iteration. Having to aim for the ultimate (generic) unicode API upfront (knowing that any future API changes will meet heavy resistance) may actually hinder progress.) Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Stefan Seefeld wrote:
To me, it suggests that a more incremental approach is needed, where the focus is on small self-contained chunks of functionality that can be reviewed and integrated into boost quickly. (Obviously it's also important to keep the big picture in mind, but aiming too high is a clear recipe for failure.) (One aspect of this is that it may be important to recognize that the API won't necessarily be 'right' in the first iteration. Having to aim for the ultimate (generic) unicode API upfront (knowing that any future API changes will meet heavy resistance) may actually hinder progress.)
+1 -t

Beman Dawes wrote:
While this is an interesting proposal, it appears to me to be several years worth of work. How would you structure the first summer's work? Would you aim at breadth (a prototype covering the whole) or depth (production quality work that concentrates on one aspect)?
I'd *again* (we have already discussed several proposals at length... I am disappointed to see that this one is no more restricted in scope than the others) advise the student to simplify the proposal. Better to write a hundred lines of code that is tested, documented and committed to trunk than a thousand lines that languish in the gsoc09 code repository. -t

Beman Dawes wrote:
On Sun, Mar 29, 2009 at 9:40 PM, Mathias Gaunard
- iterator adaptors to iterate sequences of code units, code points and graphemes, and eventually more, from a sequence in UTF-8, UTF-16, UCS-2 or UTF-32/UCS-4.
What about conversion algorithms to conveniently generate these sequences in the first place?
I am not really interested in supporting arbitrary conversions between charsets, since that is mostly about writing big charset-specific look-up tables. This could be done by a separate library. Conversions between the different Unicode encodings as well as from charsets that are included verbatim into Unicode (such as ISO-8859-1) should probably be allowed, however.
From prior discussions, it seemed to me that there were actually needs for several unicode string types.
* Specific UTF-8, UTF-16, UTF-*, string classes to be used within an application, when a particular Unicode string type and internal representation is the optimal choice.
* A single utf_string that varies its internal representation at run-time. This is the choice for communication between third parties where not enough is known about the applications to choose a particular internal representation, or within an application when the application must cope with runtime changing needs..
Since I was seeing integration with existing string types as important, I was thinking of actually templating on the underlying string type. unicode_string<std::string>, for example. The underlying value type of the string type gives the type of encoding used (here, UTF-8). Different levels of type erasure could maybe be used to either forget the underlying string type or the underlying encoding. This still requires some thought, obviously. Any ideas on how it should be done welcome.
While this is an interesting proposal, it appears to me to be several years worth of work. How would you structure the first summer's work? Would you aim at breadth (a prototype covering the whole) or depth (production quality work that concentrates on one aspect)?
I suppose breadth. Quality could be increased after the SoC. I'm more interested in using the time where I am mentored to come up with interesting and practical designs that would pass a review. Also, I'm quite the regular on this list, so it's not like I'll disappear once the SoC is done.

Mathias Gaunard wrote:
I suppose breadth. Quality could be increased after the SoC. I'm more interested in using the time where I am mentored to come up with interesting and practical designs that would pass a review.
If this is indeed going to be more about design than actual code, I'd suggest to make the review and discussion of previous attempts (that did have broad support in this community), such as the already mentioned work by Erik Wien (see http://lists.boost.org/Archives/boost/2004/10/74349.php) a central part of the work. As I already mentioned, I don't think a top-down approach is a good idea in this case, but it would be especially bad if all it did was to add yet another item to this potentially-good-but-unimplemented-unicode-designs bag. FWIW. Regards, Stefan -- ...ich hab' noch einen Koffer in Berlin...

Stefan Seefeld wrote :
If this is indeed going to be more about design than actual code, I'd suggest to make the review and discussion of previous attempts (that did have broad support in this community), such as the already mentioned work by Erik Wien (see http://lists.boost.org/Archives/boost/2004/10/74349.php) a central part of the work.
I haven't read the whole thread, since it's quite big and I am limited in time at the moment. I was never able to get the final code of Erik Wien, too. I have gone through most of the unicode threads, and have noted most of the issues that were raised. Note that my proposal is somewhat more restricted than Erik's. I'm not planning to do any locale-specific work, no collation support, and no integration with codecvt facets, or standard locales. Locale support could eventually be added given time (and I'd personally rather do it using a custom-made locale system, like ICU does), but it was asked to restrict the scope of the library. It will be purely iterators (or rather, ranges) and algorithms, which are quite more simple to deal with that the whole standard locale subsystem. On top of that will be layered an unicode string type, which is nothing more than a glorified container wrapper with eventual type erasure which purpose is to maintain invariants and thus accurately represent an unicode string. It's really aimed at being simple and non-intrusive. Components are fairly separate and code is thus incremental, and the unicode string just composes the work. I personally believe basic_string, char_traits, and codecvt facets and the standard locale system are not really suitable to deal with unicode, which may have been the reason why previous proposals ended up they way they did. I think some people said the same in the various unicode discussions, too.
As I already mentioned, I don't think a top-down approach is a good idea in this case, but it would be especially bad if all it did was to add yet another item to this potentially-good-but-unimplemented-unicode-designs bag.
Efficient algorithms are provided by the Unicode consortium, so it's mostly just the design or glue code that needs work. The glue depending on what integration with other components is being done. Here, it's mostly just range concepts. Furthermore, assuming design is what matters the most for that project, the documentation in itself would be integral part of the project. Lack of good documentation may be the reason why some previous unicode projects failed, too.
participants (5)
-
Beman Dawes
-
Mathias Gaunard
-
Rogier van Dalen
-
Stefan Seefeld
-
troy d. straszheim