Comment on string / unicode discussion

I don't have enough time to delve deeply into this thread but I thought I'd make a few passing comments. Adobe has a fairly major string class problem (we joke that every project must have it's own string class - which is nearly true). There isn't such thing as a single type of string - there are _many_ purposes and you need to be able to handle things like language and style runs and large, large blocks of text with efficient edits, UI substations (which are aware of things like split negation and masculine/feminine forms), language based ordering, different encodings... We need another string class like a hole in the head. What we do need - are good standard algorithms which can be applied to any string class. I believe this is doable with the current iterator interface. I believe it's possible (meaning I've done some quick experiments) to define an input iterator (actually as strong as a non-mutating forward iterator) and output iterator, which do conversions. This means that you can define operations in terms of unicode encoding (though some operations such as ordering may still require a locale). Consider - to_lower(first, last, output) to_upper(first, last, output) such transformations can work with any encoding (you can uppercase UTF-8 into UTF-32). They can't work in-situ (but I don't think to_upper or to_lower really can work in-situ - certainly not in UTF-8 and probably not in UTF-16, and I believe there are some multi- character forms that even break in UTF-32...). It is possible though to wrap them with a replace function for in-place operations. The current std::find() will work with such iterator adapters to find single UTF-32 character (in any encoded sequence). Currently with ASL we're taking such an approach for localization strings (replacing an existing string class for localized strings at Adobe with a small set of functions and _any_ string class (any sequence of code units), including std::string, std::vector (or deque or list). You might take a look here for some ideas: <http:// opensource.adobe.com/group__asl__xstring.html>. Sean

Sean Parent wrote:
What we do need - are good standard algorithms which can be applied to any string class.
Have you guys looked into boost.string_algo this? http://www.boost.org/doc/html/string_algo.html Most of the algorithms take an locale for localization. There was also a proposal to add at least some of these to TR2...although I haven't followed where this has gone since the Mt. Tremblaunt meeting. Jeff

Sean Parent wrote:
This is very close to what I have in mind. The main difference is that the functions/algorithms in my mind take ranges instead of iterators. Thus: to_lower(src, dest) to_upper(src, dest) With these, I could make Fusion like wrappers that transform them into something like: some_string s1 = to_lower(src); some_string s2 = to_upper(src); where to_lower and to_upper return cheap views that are in and by themselves valid strings/ranges. They are cheap because the actual conversions/transformations are done on demand-- think lazy evaluation. So, like those done by expression template techniques, there are no expensive temporaries when you perform seemingly expensive tasks like: some_string s = f1(f2(f3(f4(src)))); And yes, because they are generic, those string algorithms can work on any string type that satisfy some basic requirements. Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

To implement part of the serialization library, I defined composable iterators that could be composed in any sequence. This permitted me to a small number of iterators that could be composed at compile time to generate a much larger number of possible transforming iterators. I called these "Dataflow" iterators. They are used for thinks like converting strings from wide char to base64 and things like that. I've never been able to convince anyone else of the merit of the approach - but hope springs eternal. Robert Ramey Sean Parent wrote:

"Robert Ramey" <ramey@rrsd.com> writes:
How are these different from many of the iterators provided by http://www.boost.org/libs/iterator, particularly transform_iterator? -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
http://www.boost.org/libs/serialization/doc/index.html describes this. All the "dataflow iterators" are derived from boost.iterator. For some I derived from transform iterator for others I derived from filter and I forget the rest. Aside from implementing the transforming behavior required for the specific instance, the only real addition is the requirement that all of them have a templated constructor. This simple addition made all the difference for me. This permitted me to compose them to any ressonable depth and sequence with just one (rather long) typedef into a new iterator which can be used just as easily as any other. Robert Ramey

"Robert Ramey" <ramey@rrsd.com> writes:
I don't know what you mean by "derived" but if you're referring to regular C++ derivation, that generally doesn't result in a legal iterator class. For example, the result type of operator++ is usually wrong.
Details would be helpful here.
The existing iterator adaptors are easily composed, so I'd like to know a little more about what you did, and gained.
I've never been able to convince anyone else of the merit of the approach - but hope springs eternal.
Maybe you never articulated sufficiently clearly what you added to the basics provided by the iterator library, and why. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
Whoops - try http://www.boost.org/libs/serialization/doc/dataflow.html
That's what I'm referring to. And it has required extra effort to make operator++ work correctly.
See the above link.
The existing iterator adaptors are easily composed, so I'd like to know a little more about what you did, and gained.
I wanted to be able do things like the following: a) define an iterator for transforming 8 bit octets into 6 bit octets b) define an iterator which would render 6 bit octets into ascii codes as used in base64 coding c) define an iterator which would insert line breaks every 50 characters I wanted to be able to develope and test each of these separately. so far no real problem with the boost iterators. Then I wanted to compose them with a typedef typedef boost::archive::iterators::insert_linebreaks< boost::archive::iterators::base64_from_binary< boost::archive::iterators::transform_width< const char *, 6, 8 > > ,72 ,const char // cwpro8 needs this > base64_text; Wow - just great. Now just construct an instance of base64_text - and we're in business. So I can do: char * address; // pointer to character buffer ... boost::archive::iterators::ostream_iterator<char> oi(os); std::copy( base64_text(address), base64_text( address + count ), oi ); Uh - oh - I can do the above - because I need to instanciate the iterator with some sort of make_???_iterator. This an extra pain and adds a lot of confusion It also prevents me from doing something like: std::copy( wchar_from_mb(base64_text(address))), wchar_from_mb((base64_text( address + count )), oi ); should I find this convenient. By adding the templated constructors and using them instead of make_???_functions I was able to achieve what I desired.
LOL - apparently so. Robert Ramey

On Thursday 06 July 2006 11:17, Robert Ramey wrote:
So the advantage is the ability to construct a base64_text iterator directly from a char const* without manually constructing each of the underlying iterator adaptor layers. That's neat, but wouldn't a function base64_text make_base64_text_iterator( char const* ); work just as well? D.

Daniel Mitchell wrote:
Yes - the newly created iterator - created with the typedef - hides all the underlying implementation just presents the "composed" interface.
base64_text make_base64_text_iterator( char const* );
work just as well?
How would I make such a function automatically? wouldn't have to do something like base64_text make_base64_text_iterator( make_insert_linebreaks<...>( make_transform_width<...>(char const *) ) ) and not a LOT of stuff goes in to the ... But maybe it could be made to work. I just much preferred the idea of not having extra make_??? functions. The method I used lets me make a "stable" of iterators and compose them at will. The generated iterators (derived from boost iterator adaptor) are all (I think) legal iterators and can be used in any STL algorithms. It comes close to implementing what people are a talking about now. And it does all te heavy lifting at compile time so that the the compiler can inline everything. What I would like to see is a code_cvt facet which used an iterator as template argument. Combined with the above I could easily generate any code_cvt facet from my "stable" of composible iterators. This could be attached to any output stream to do stuff like base64 output or utf8 output or whatever. Note that the serialization library uses this to implment things like mb_from_wchar etc. so I would expect it would be easy to make things like utf8_from_mb. and add this to the "stable" of dataflow iterators. Robert Ramey

On Thursday 06 July 2006 12:50, Robert Ramey wrote:
I was actually thinking of something a little less general and a little more direct like base64_text make_base64_text_iterator( char const* c ) { typedef transform_width<char const*,6,8> transform_iter; typedef base64_from_binary<transform_iter> base64_from_binary_iter; return base64_text( base64_from_binary_iter( transform_iter( c ) ) ); } However...
...I can see now that the above would be inadequate because you don't want to write a make_xxx_iterator function for each new combination of iterators. Sorry, I wasn't thinking "big picture." The method you suggest where each fundamental iterator (transform_width, base64_from_binary, and so on) gets its own make_xxx_iterator function and those functions are chained together to generate the right type is better anyway, and there's no question that the template constructors offer a much nicer syntax. D.

"Robert Ramey" <ramey@rrsd.com> writes:
Thanks; I think I see why you need the templated ctors now. w.r.t. composing with a typedef, would boost::archive::iterators::insert_linebreaks< boost::archive::iterators::base64_from_binary< boost::archive::iterators::transform_width< const char *, 6, 8 >::type >::type ,72 ,const char // cwpro8 needs this >::type base64_text; be much worse, for your purposes? -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
Of course there is another possibility - that it's clear how it works but no one sees the appeal of the approach. A previous time when ranges was being discussed, I flogged this idea and there was a strong objection raised to the nested template syntax. Funny thing was - that is exactly what I love about this. Its the compile time counter-part of the common way of composition invoked at run time. When iostreams code conversion was being discussed I also tried to promote the idea - but I think it was seen as to "far out". I suspect that a large part is aesthetic and what people are familiar with. It started out as an experiment with the then new boost iterators. As I progressed I became more enthusiastic about the idea. It was of immense help with implementing the more tedious aspects of the serialization library and has needed almost no maintainence. So I'm very happy with it. In my view it - or something like it - provides the answer to the question which started this thread- whatever that was.
Hmmm - I read this as creating an instance of base64_text - presumably on the stack. I don't see how I could use this to do:
which is what I REALLY wanted to do. Robert Ramey

David Abrahams wrote:
Oh - I only noticed that the word "typedef" was gone. Without looking at exactly what I did in detail and just commenting from a users's pespective - I don't see a big problem. Maybe its even better in that it is explicit about exactly what we're doing - composing iterator types to create a new iterator type. But then - if I want to use base_64_text somewhere - where does the ::type go? That is: where would base_64_text::type come from? Currently - every composed iterator can be used as a component in another iterator. In fact, in the serialization library I did this a couple times just to keep the code readible. If STL didn't use pairs of iterators everywhere, we might not even need the typedef - just make a composed construction - but that gets very tedious when one needs more than one of the same type. FWIW - my main interest really wasn't for converting strings of characters. I've been thinking for a very long time about what I would want in a "views" type library for relational algebra. Hence the name "Dataflow". I would really like to use this to generate at compile time a seqence of operations on data tuples. That is I would like to have select, join, project, cat, etc. I suppose this is probably very close to what fusion already does - but I havn't looked at it. Most database systems rely on very elaborate precompiled functions composed at runtime. I would hope that using something like the above along with a very good compiler would result in much faster programs for database operations. Heck I would even consider making a system which compiled each query on the fly. I've become convinced that with templated constructors and the iterators that are already in there (zip, filter, etc.) I'm probably very close to where I wanted to be originally. However, now I'm involved in other kinds of applications so I don't have the motivation to take things to what I see as their logical conclusion. Oh well. Robert Ramey
participants (6)
-
Daniel Mitchell
-
David Abrahams
-
Jeff Garland
-
Joel de Guzman
-
Robert Ramey
-
Sean Parent