Comment on string / unicode discussion

6 Jul 2006

      I don't have enough time to delve deeply into this thread but I  
thought I'd make a few passing comments.

Adobe has a fairly major string class problem (we joke that every  
project must have it's own string class - which is nearly true).  
There isn't such thing as a single type of string - there are _many_  
purposes and you need to be able to handle things like language and  
style runs and large, large blocks of text with efficient edits, UI  
substations (which are aware of things like split negation and  
masculine/feminine forms), language based ordering, different  
encodings...

We need another string class like a hole in the head.

What we do need - are good standard algorithms which can be applied  
to any string class.

I believe this is doable with the current iterator interface.

I believe it's possible (meaning I've done some quick experiments) to  
define an input iterator (actually as strong as a non-mutating  
forward iterator) and output iterator, which do conversions. This  
means that you can define operations in terms of unicode encoding  
(though some operations such as ordering may still require a locale).

Consider -

to_lower(first, last, output)
to_upper(first, last, output)

such transformations can work with any encoding (you can uppercase  
UTF-8 into UTF-32). They can't work in-situ (but I don't think  
to_upper or to_lower really can work in-situ - certainly not in UTF-8  
and probably not in UTF-16, and I believe there are some multi- 
character forms that even break in UTF-32...). It is possible though  
to wrap them with a replace function for in-place operations.

The current std::find() will work with such iterator adapters to find  
single UTF-32 character (in any encoded sequence).

Currently with ASL we're taking such an approach for localization  
strings (replacing an existing string class for localized strings at  
Adobe with a small set of functions and _any_ string class (any  
sequence of code units), including std::string, std::vector (or deque  
or list).

You might take a look here for some ideas: <http:// 
opensource.adobe.com/group__asl__xstring.html>.

Sean

Sean Parent

Jeff Garland

Joel de Guzman

Robert Ramey

David Abrahams

Robert Ramey

David Abrahams

Robert Ramey

Daniel Mitchell

Robert Ramey

Daniel Mitchell

David Abrahams

Robert Ramey

David Abrahams

Robert Ramey

David Abrahams

tags

participants (6)