Re: [boost] Comment on string / unicode discussion

6 Jul 2006

      Sean Parent wrote:
...
I don't have enough time to delve deeply into this thread but I  
thought I'd make a few passing comments.
Adobe has a fairly major string class problem (we joke that every  
project must have it's own string class - which is nearly true).  
There isn't such thing as a single type of string - there are _many_  
purposes and you need to be able to handle things like language and  
style runs and large, large blocks of text with efficient edits, UI  
substations (which are aware of things like split negation and  
masculine/feminine forms), language based ordering, different  
encodings...
We need another string class like a hole in the head.
What we do need - are good standard algorithms which can be applied  
to any string class.
I believe this is doable with the current iterator interface.
I believe it's possible (meaning I've done some quick experiments) to  
define an input iterator (actually as strong as a non-mutating  
forward iterator) and output iterator, which do conversions. This  
means that you can define operations in terms of unicode encoding  
(though some operations such as ordering may still require a locale).
Consider -
to_lower(first, last, output)
to_upper(first, last, output)
such transformations can work with any encoding (you can uppercase  
UTF-8 into UTF-32). They can't work in-situ (but I don't think  
to_upper or to_lower really can work in-situ - certainly not in UTF-8  
and probably not in UTF-16, and I believe there are some multi- 
character forms that even break in UTF-32...). It is possible though  
to wrap them with a replace function for in-place operations.
The current std::find() will work with such iterator adapters to find  
single UTF-32 character (in any encoded sequence).
Currently with ASL we're taking such an approach for localization  
strings (replacing an existing string class for localized strings at  
Adobe with a small set of functions and _any_ string class (any  
sequence of code units), including std::string, std::vector (or deque  
or list).
You might take a look here for some ideas: <http:// 
opensource.adobe.com/group__asl__xstring.html>.
This is very close to what I have in mind. The main difference is that
the functions/algorithms in my mind take ranges instead of iterators.
Thus:

     to_lower(src, dest)
     to_upper(src, dest)

With these, I could make Fusion like wrappers that transform them into
something like:

     some_string s1 = to_lower(src);
     some_string s2 = to_upper(src);

where to_lower and to_upper return cheap views that are in and by
themselves valid strings/ranges. They are cheap because the actual
conversions/transformations are done on demand-- think lazy evaluation.
So, like those done by expression template techniques, there are
no expensive temporaries when you perform seemingly expensive tasks
like:

     some_string s = f1(f2(f3(f4(src))));

And yes, because they are generic, those string algorithms can work
on any string type that satisfy some basic requirements.

Regards,
-- 
Joel de Guzman
http://www.boost-consulting.com
http://spirit.sf.net