Interest in super string class?

I've been working on a little project where I've had to doing lots of string processing, so I decided to put together a string type that wraps up boost.regex and boost.string_algo into a string type. I also remember a discussion in the LWG about whether the various string algorithms should be built in or not -- well consider this a test -- personally I find it easier built into the string than as standalone functions. You can download from the string/text processing part of the vault: http://tinyurl.com/dbcye Below is the summary and motivating code example. Enjoy, Jeff -------------------------------------------------------------------------- Souped up string class that includes fancy query, replacement, and conversion functions. This type has the following main goals: * Is a drop-in replacement convertable to std::string and std::wstring * Provide case conversions and case insensitive comparison * Provide white space triming functions * Provide a split functions to parse a string into pieces base on string or regex * Provide sophisticated text replacement functions based on strings or regex * Provide append and insert functions for types Overall, this class is mostly a convience wrapper around functions available in boost.string_algo and boost.regex. This is best illustrated with some code: super_string s(" (456789) [123] 2006-10-01 abcdef "); s.to_upper(); cout << s << endl; s.trim(); //lop of the whitespace on both sides cout << s << endl; double dbl = 1.23456; s.append(dbl); //append any streamable type s+= " "; cout << s << endl; date d(2006, Jul, 1); s.insert_at(28, d); //insert any streamable type cout << s << endl; //find the yyyy-mm-dd date format if (s.contains_regex("\\d{4}-\\d{2}-\\d{2}")) { //replace parens around digits with square brackets [the digits] s.replace_all_regex("\\(([0-9]+)\\)", "__[$1]__"); cout << s << endl; //split the string on white space to process parts super_string::string_vector out_vec; unsigned int count = s.split_regex("\\s+", out_vec); if (count) { for(int i=0; i < out_vec.size(); ++i) { out_vec[i].replace_first("__",""); //get rid of first __ in string cout << i << " " << out_vec[i] << endl; } } } //wide strings too... wsuper_string ws(L" hello world "); ws.trim_left(); wcout << ws << endl; Expected output is: (456789) [123] 2006-10-01 ABCDEF (456789) [123] 2006-10-01 ABCDEF (456789) [123] 2006-10-01 ABCDEF1.23456 (456789) [123] 2006-10-01 2006-Jul-01 ABCDEF1.23456 __[456789]__ [123] 2006-10-01 2006-Jul-01 ABCDEF1.23456 0 [456789]__ 1 [123] 2 2006-10-01 3 2006-Jul-01 4 ABCDEF1.23456 hello world

Jeff Garland <jeff@crystalclearsoftware.com> writes:
I appreciate the convenience of such an interface, I really do, but doesn't this design just compound the "fat interface" problems that std::string already has? Even Python's string, which has a *lot* built in, doesn't try to handle the regex stuff directly. -- Dave Abrahams Boost Consulting www.boost-consulting.com

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Jul 2, 2006, at 9:51 AM, David Abrahams wrote:
I smell Objective-C. In which School of Philosophy is Boost? Perhaps a 'convenience' namespace where "fat" types can live.
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (Darwin) iEYEARECAAYFAkSoAkMACgkQJJNoeGe+5O7VRwCeMhKwFWp0WmGSZIzX/Urw16Gz o/gAn1vItpHIsNzLoJK/jmNqugChsX2G =ZuF3 -----END PGP SIGNATURE-----

Kon Lovett wrote:
I know nothing about Objective-C so any resemblance is coincidental.
In which School of Philosophy is Boost?
Perhaps a 'convenience' namespace where "fat" types can live.
It's against the Boost/C++ way...no doubt. See my reply to Dave for more details. It may never see the light of day in Boost, but I'm finding it useful so it seemed like I should share -- especially since it's been awhile since I've contributed any code outside of date_time. Jeff

David Abrahams wrote:
Yes, that's partially the point :-) I understand std::string is too big for some. Sadly the members it has make it hard to do the things I tend to do with strings most often with strings. The fact is, if you look around at the languages people are using most for string processing, they offer just as many features as super_string and then some. Somehow, programmers are managing to deal with this. I'd buy more into the fat interface being a problem if something in the string class went beyond string processing, but it doesn't. String processing is a big complex domain -- whole languages have been optimized for it -- it needs a lot of functions to cover the domain and make easy to read code. Any way you slice it the current basic_string is inferior to what most modern languages offer. Needless to say, I understand all about stl, free functions, their power, etc, etc. But the big thing this misses is that having a single type that unifies the string processing interface means there's a single set of documentation to start figuring out how to do a string manipulation. I don't have to wade thru 50 pages of string_algorithms, 50 pages of regex docs and so on -- there's hundreds of functions to deal with strings there. Not to mention the templatization factor in the docs of these libraries which mostly detracts from me figuring out how to process the string. If I'm a Boost novice much of this great a useful string processing capability might be lost in so many other libraries. The other thing that gets me is the readability of code. With a built-in function, it's one less parameter to remember when calling these functions. It seems trivial, but I believe the code is ultimately easier to understand. Simple example: std::string s1("foo"); std::string s2("bar); std::string s3("foo"); //The next line makes me go read the docs again, every time replace_all(s1,s2,s3); //which string is modified exactly? or s1.replace_all(s2, s3); //obvious which string is modified here I understand this flies against the current established C++ wisdom, but that's part of the reason I've done it. After thinking about it, I think the 'wisdom' is wrong. Usability and readability has been lost -- my code is harder to understand. I expect that super_string has little chance of ever making it to Boost because it is goes too radically against some of these deeply held beliefs. That said, I think there's a group of folks out there that agree with me and are afraid to speak up. Now they can at least download it from the vault -- but maybe they'll speak up -- we'll see. In any case, it's up to individuals to decide download and use super_string, or continue using their inferior string class ;-)
Even Python's string, which has a *lot* built in, doesn't try to handle the regex stuff directly.
There are plenty of counter examples: Perl, Java, Javascript, and Ruby that build regex directly into the library/language. It's very powerful and useful in my experience. And, of course, super_string doesn't take away anything, just makes these powerful tools more accessible and easier to use. Jeff

Jeff Garland wrote:
I am with you, Jeff. I do not think that std::string is too fat but only that a design mistake was made with it. The mistake is that after specifying a std::string constructor that takes a C null-terminated string ( const char *), which std::string has, all other functionality dealing with a string should have been in terms of std::string, and nothing else should have been in terms of a C null-terminated string. This is the principle of making a clear interface which has a single good way of doing things, rather than a muddy interface with numerous ways of doing the same thing. Other than this design mistake, no doubt unfortunately done to cater to the C crowd, std::string is fine for what it does and is not too fat at all.
I will speak up. The passion for loosely coupled free functions has gone too far. It works when there is a reason for it, usually because it is a function template and must deal with different types, ala the algorithms in the C++ standard library, but is not a solution for all situations. I am for a rich string class and think that super string is the right idea. My only difference is that I want a string class to only deal in C++ std::strings at all times, once a constructor has been provided for converting a null-terminated C string into the string class, in order to make the interface much cleaner and clearer.

On Sun, 02 Jul 2006 11:51:32 -0700, Jeff Garland <jeff@crystalclearsoftware.com> wrote:
Hi Jeff, just to be sure I understood: *the* two reasons for such a class are: (a) helping novices (b) having a cleaner (whatever that means) syntax. Sorry if I'm missing something or oversimplifying the issue. It seems to me that (a) is a secondary point, as being a novice is something deemed to disappear asap if you want to seriously program in C++; (b) is very nicely obtained with Shunsuke's proposal, which seems to get the best of the two worlds: power and syntactical convenience (BTW, Shunsuke, is the code in the vault?). Other points? -- [ Gennaro Prota, C++ developer for hire ] [ resume: available on request ]

Gennaro Prota wrote:
Well that's too bad indeed, but I think you're wrong on this. For one thing, the committee is working changes for C++0x to fix long standing issues in C++ to eliminate common mistakes made by novices. Even so, I don't care *that much* about novices. I just want to be able to write code that *I* can understand when I read it 3 months later without rereading the string_algo and regex docs.
is very nicely obtained with Shunsuke's proposal, which seems to get the best of the two worlds: power and syntactical convenience
I'm unconvinced from what I've seen so far.
(BTW, Shunsuke, is the code in the vault?).
Seems unlikely given that it's a 'future proposal'. The only range algorithms stuff I know of in the vault was done by Eric N.
Other points?
Sure. Here's a sanitized fragment of a pretty typical form for a perl script I've probably written 100 times: open(IN, "<$filename") || die "File not found $filename\n"; $count = 0; while ($line = <IN>) { chomp($line); count++; if ($line =~ /^\*\*\*/) { $line =~ s/^\*\*\*//; $line .= " line: " . $count; do_something1($line); } elsif ($line =~ /^\*\*/) { $line =~ s/^\*\*//; do_something($line); } #...lots more... } I believe I can trivially rewrite this using super_string in C++. Due to the extra escape the regex's are harder to understand, but with some comments it should be easier to see what is happening then the perl: std::ifstream infile(...); super_string line; unsigned int count = 0; while (infile.getline(line)) { count++; line.trim_right(); if (line.contains_regex("^\\\*\\\*\\\*")) { line.replace_regex("^\\\*\\\*\\\*", ""); line.append( " line: ").append(count); do_something(line); } else {... I believe with some comments about what the regex's do, mere mortals (or Java programmers ;-) can read and understand this code. Why should it be any other way? Jeff

Jeff Garland <jeff@crystalclearsoftware.com> writes:
Agreed.
Try "python -c help(str)" I think python may hit a sweet spot for where to push functionality outside of member functions.
Iterators do :)
Yes, in part because there's so much interface redundancy.
I agree that dealing with CharT and traits over and over again is handily beaten by having it encoded, once, into the string type.
Yes, I used to make that argument regularly and I still agree with it.
Well, let's not make this political before we have to, OK? :)
Not sure that's a good example if you're going for readability; Plus, it has special operators that help (and could in principle be implemented as free functions).
Java, Javascript, and Ruby that build regex directly into the library/language.
Whoa there. Python builds regex directly into the library too. That doesn't mean it should be part of the string. python -c "import sre;help(sre)"
I agree with the idea in principle; I just want to scrutinize its execution a bit before we all buy into it as proposed ;-) -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
Looks like I'm still missing a couple functions ;-) BTW, Is it true that strings in python are immutable?
Fair enough -- it's the reason I've try to explain my position in terms of code whenever possible.
I thought the c++ was pretty readable. As for free functions, I'm not opposed, but so far I haven't seen a proposal that makes the code clearer to my eye.
I wasn't trying to suggest that Python didn't support regex. I mostly leave Python I still haven't written any significant programs in it. One day...
How does that phrase go...it'll be a cold summer day in Az (a balmy 105 now) before we 'all' agree ;-) Jeff

Jeff Garland <jeff@crystalclearsoftware.com> writes:
And by the same measure, you have some extras (no regexps in python strings).
BTW, Is it true that strings in python are immutable?
Yes it is.
Well, Shunsuke Sogame's proposed interface (which echoes some work Eric Niebler has already done) works pretty well. I don't think that concatenating string operations with '.' is vastly better than using '|', and the former comes with some attendant disadvantages that have been detailed elsewhere in this thread.
Didn't think you were; you seem to be missing my point, which is that there are lots of ways to get the functionality into the library. It doesn't necessarily have to be directly attached to the string class.
Seems like you're deflecting rather than engaging, which is disappointing. There are lots of important questions here; I'm glad you proposed this interface and thereby raised them. I just wish we could have a more complete exploration of the solution space, which -- especially where C++ is concerned -- is still largely uncharted. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
True, but for my work I need regex and I've cited other languages where it is integrated.
I guess I have yet to be convinced of the supposed disadvantages. super_string is just as extensible as basic_string. I can still take advantage of Sunsuke's code. But, I think there was actually some agreement that if we surveyed programmers more would understand form #1 without reading the docs then #2. 1) s.replace_all(s2, s3) 2) replace_all(out(s1), in(s2), in(s3)); Don't get me wrong, I think the new range interface is a major usability advance -- well at least for unix programmers. He actually threw me off a bit with this example: std::string dst(...) range_copy(rng|to_upper|to_lower|to_upper, dst); which didn't make sense (apparently it was a joke). Still, I'm pretty sure in a code reading contest more programmers would understand this: dst.to_upper().to_lower().to_upper(); as obviously ridiculous code.
No, I got it, but I don't know what to do with it. Python chose to leave it out of the string class -- Java chose to provide a simplified interface(just to be clear, there's a Pattern class in JAVA that works with string). In Perl it's helped along by language features, but it's essentially built-in using operators. So it's a mixed bag as far as how people have chosen to build these functions. Overall though, I didn't see the motivation for leaving Regex out if I'm going to add other functions to the string type. There's no user cost if they don't use Regex other then processing of a few extra includes on compile. So my take is that it's confusing to introduce a new string type and then make half the functions free functions. And some of the regex functions are 3 parameter case discussed above.
I think I've been plenty engaged. And we can explore the solutions all we want. It's just a fact that there's some group of people that will never agree with me. Jeff

Jeff Garland wrote:
Me, I prefer the immutable and functional approach: to_upper(to_lower(to_upper(rng))) Yeah, for the record, I never liked the use of the |. With the plain functional syntax, the joke is very clear. For the record, I don't like fat everything-but-the-kitchen-sink interface too. Same as I dislike mutating functions. Sorry, Jeff. Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

On Wed, 05 Jul 2006 22:01:55 +0800, Joel de Guzman <joel@boost-consulting.com> wrote:
It could be a joke or a leftover from a previous example where the OP used three different things, such as to_upper, to_lower and trim, just to change them at the latest moment. But it might be not: the question of what the result of to_upper(to_lower(to_upper(rng))) is supposed to be is not trivial at all. -- [ Gennaro Prota, C++ developer for hire ] [ resume: available on request ]

Gennaro Prota wrote:
The result will be a view. Then, if it's done correctly, ideally, the view transformations can collapse the view. IOTW, this: to_lower_view<to_upper_view<range> > will be optimized to: to_lower_view<range> and: to_upper_view<to_lower_view<range> > will be optimized to: to_upper_view<range> Hence, the entire to_upper(to_lower(to_upper(rng))) will be collapsed to: to_upper(rng) Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

On Wed, 05 Jul 2006 22:35:47 +0800, Joel de Guzman <joel@boost-consulting.com> wrote:
Yes, I know. What I was hinting at is that not all languages have "uppercase" and/or "lowercase". And in those that have there's no guarantee of one-to-one mapping, so that to_lower or to_upper aren't in general invertible functions, or even left-invertible functions. -- [ Gennaro Prota, C++ developer for hire ] [ resume: available on request ]

Sebastian Redl wrote:
Right! All the more reason to go for the functional non-mutating approach. In many (most?) cases, you'll end up with a new transformed string. In-situ conversions/transformations seem to be less common in reality, especially when dealing with different encodings and different string types. Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

On Wed, 05 Jul 2006 22:35:47 +0800 Joel de Guzman <joel@boost-consulting.com> wrote:
That's a whole bunch of assumptions, both on the compiler and the developer, not to mention the semantics of each "view." I've used views on numerous occasions, and while they give the appearance of "no cost" they do, quite often, incur significant overhead. Both for the view abstraction itself, and the shenanigans behind the scene when the real underlying data changes. Hey, they are great for some things, but I do not see how they can be the endall that eliminates mutable interfaces...

Jody Hagins wrote:
Maybe. But unless you provide some benchmarks and numbers, what you are hinting at has no weight, as far as I'm concerned. Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

I'm the one asking YOU for reasons why immutable interfaces should be used over mutable ones ;-) You are the one making a proposal in favor of immutable interfaces. Aside from being able to represent the functional paradigm in C++, what are the benefits from your point of view? I've read several papers, and I've used several similar libraries, but I want your experience stories because I respect your talent and experience. Why should the interfaces for string (or other boost libraries) be immutable? I am unaware of an immutable implementation that provides transparent performance cost. I'd be most interested if you know of one. BTW, just for a fun story, working in the kernel and in certain parts of the government, I've learned that benchmarks mean almost nothing, unless you REALLY know all the details of what is going on. One of my first jobs as a kernel developer was to improve performance of our scheduler. I was able to find a few clever optimizations, but the best one was to "configure" the system for the type of operations being run and change the scheduling policy to something that favored the "recognized pattern" which ended up putting our product at the very top of every price/performance index available. Special code for tests is against the rules, and we certainly did not do that. However, you are allowed to configure the kernel, in any documented manner. You are also allowed to "recognize" certain patterns and change behavior in response to those execution patterns. Recognizing the interesting characteristics of those benchmarks were pretty simple (and could reasonably be justified as recognizing a class of program characteristics common to industry usage).

Jody Hagins wrote:
I'm not making a proposal. I just stated my preference when I said "Me, I prefer the immutable and functional approach". Now, it might be worthwhile to invest some time in defending that. Maybe I would, when time permits, but certainly not to someone who starts going "Blech" the very moment he reads "immutable" or "functional". ;) Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

On Thu, 06 Jul 2006 07:43:41 +0800 Joel de Guzman <joel@boost-consulting.com> wrote:
Ah... then I misrepresented my position yet again. I was hoping that FC++ would be accepted, and I even tried to get Brian interested in applying here (though we don't do much in the center of his tool interest). I am definitely NOT opposed to either immutable interfaces or functional methodologies on a fundamental basis. I am, however, opposed to the idea of replacing mutable interfaces in C++, which have a long track record of success with immutable ones, with nothing more than something akin to "I like them better." Furthermore, I am opposed to replacing ANY existing good practice with another without good reasoning as to why it is better. That's one reason I like C++ so much. There are lots of areas in which to specialize, but you have a huge solution space in which to work. You are never left with just one way to do things. I was weened on lisp, and moved to smalltalk. I haven't played in that world for more than 15 years now, though. My memory isn't all that great, but at times I wish I could go back to the smalltalk world. I'm sure by now they have addressed the terrible performance (I actually heard that a major electric utility company on the east coast of the U.S has their entire operation written in smalltalk). I'm genuinely interested in your response. I have a great deal of respect for your work and opinion. If you think back a bit, you will remember that I tried to hire you as well, before you joined Boost consulting. Oh well, I guess I've wondered off topic sufficiently to take further stuff to private email (unelss you really do want to share your reasoning for immutable interfaces -- you seem to have several people on this list who share your opinion -- I truly am interested).

Jody Hagins wrote:
Because it is desirable to have cheap string copying. The most effective solution is to use reference counted string with copy-on-write. AFAIK, no one has come up with a thread safe implementation for COW std::basic_string, it's interface is too leaky (operator[] returns a bare char&, for example). Although GNU C++ std::basic_string is reference counted and COW they say it is not thread safe in some corner cases. http://groups.google.com/group/comp.lang.c++.moderated/msg/0aa848d97e205d72 Cheap copy is the only motivation behind an immutable string. One implementation has been reported to perform well as a drop-in replacement for std::basic_string. http://article.gmane.org/gmane.comp.lib.boost.devel/110029

Joel de Guzman wrote:
Hmm, I'm not convinced of that. some_container tmp(to_upper_view(rng)); to_lower_view(tmp); would return something different than to_lower_view(to_upper_view(rng)) in languages which do not have a homomorphous characterwise mapping between lowercase and uppercase spelling (e.g. german: ß -> SS, but S -> s). For the same reason, a locale-aware to_upper_view would not be as trivial as one would think at first. (Not overly complex, either, probably). The collapsing views would produce a better quality output in the case of ß, but I'm not sure the behaviour of compiletime composition should differ from the behaviour of the runtime composition. Regards, m Send instant messages to your online friends http://au.messenger.yahoo.com

Martin Wille wrote:
Yes, indeed. I realized that after some folks pointed that out. So collapsing to_upper/to_lower is not a good idea. But then, yes, the german: ß -> SS also implies that a universal to_lower/to_upper algorithm cannot be efficiently implemented as a mutating/in-place function (i.e. the string will have to grow/shrink). When such memory movement happens, the advantage of in-place mutation is gone. Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Joel de Guzman wrote:
Martin Wille wrote:
True. I didn't intend to object to the functional model, anyway. (In fact, I like it a lot and I'd prefer it most of the time.) IMHO, the ongoing discussion highlights some fundamental disagreement on what a string is/should be. There's an immutable string with free functions group and a fat-interface string group. Either approach looks wrong to me when taken alone. String handling is the thing Java got basically right, IMHO. Java distinguishes between a string class (immutable strings) and a string builder class. Typically, immutable string classes are memory efficient because they allow for sharing instances easily and time efficient because they allows for easy passing of the strings into and out of functions. String builders focus on modifying contents and intend to save memory and execution time and to avoid memory fragmentation by avoiding repeated allocation/copy/disposal sequences (even with views, you'll sometimes need to save intermediate results and to work from there in order to gain performance). std::string tries to be both and that lead to a fat and somewhat confusing interface that is hard to implement efficiently. The proposed super string could be a nice string _builder_, while the proposed functional/view interface could operate nicely on an immutable string class, std::string and on the string builder. Once we distinguish between a string builder and a string, most of the confusing aspects of having two ways of doing the same thing go away. Offering a generic, uniform approach to string handling, free functions and views (supposedly) can work on any kind of string representation, while the string builder can focus on efficient in-place transformations. Use-cases in favor of either approach can be made. Consequently, there's room for both suggested ways of dealing with character sequences and for the libraries supporting them. What we need in order to avoid confusion is to communicate clearly what the intent of the classes or views/functions is. So if we add the suggested super string then PLEASE do not name it 'string', but 'string_builder' or 'string_buffer' in order to emphasize on the in-place modification aspect in the name. Of course, there should be a complement to string_builder: immutable_string. (ISTR there was a proposal for that, already). Regards, m Send instant messages to your online friends http://au.messenger.yahoo.com

Martin Wille wrote:
They had no choice but to get it right. Java is reference-based and has no const. The only way to provide value semantics is to make the values immutable. If they aren't, you have to defensively clone them whenever you are returning a field (otheriwse the caller may modify your state through its reference), which is extremely error-prone. They managed to get Date wrong, though. It's mutable.

On Thu, 06 Jul 2006 11:08:12 +0200 Martin Wille <mw8329@yahoo.com.au> wrote:
I think you just nailed my thoughts, but I incorrectly aimed them at "prefer immutable." I tend to lean toward the side of providing flexible SAFE interfaces, rather than restricting them. I'm not against immutable. I am against providing ONLY immutable interfaces. I'm not against mutable. I am against providing ONLY mutable interfaces. Most of us probably agree with the latter, probably a fair number of us disagree on the former. No, I'm not in favor of fat interfaces, but I am in favor of providing safe, efficient, flexible interfaces. I think your point about std::string is very important, and may be a better description of "fat" class. What makes a class "fat?" I do not think it is the number of functions (member and free support) provided in the interface, but the number of "roles" the interface attempts to play. Anytime an interface attempts to do "too much" you are asking for trouble. Restrict the job that an interface is trying to do, but allow flexible ways to accomplish that job. Hmmm. Maybe now I'm clear as mud...

Jody Hagins wrote:
Don't most of the advantages of immutability (representation sharing in its various forms, especially in the presence of multiple threads) disappear when you provide mutable accessors as well?
I think your point about std::string is very important, and may be a better description of "fat" class. What makes a class "fat?"
The number of functions that need (or have - pick your definition) access to its private members, probably.

"Peter Dimov" <pdimov@mmltd.net> writes:
Yes. I think the idea is that you have an immutable string for most purposes, and some sort of mutable "string creation buffer" that allows time-critical processing to occur in a different context. -- Dave Abrahams Boost Consulting www.boost-consulting.com

On 7/6/06 11:10 AM, "Jody Hagins" <jody-boost-011304@atdesk.com> wrote: [SNIP]
I understand. I even have an example from the standard. The stringstream classes are supposed to be the new way to handle in-memory streams. They are supposed to replace the now-depreciated strstream. However, the old class has three mutually-exclusive modes of operation, but the new class only replaces the most(?) common case, a dynamically-allocated buffer. There currently isn't an updated replacement for the static buffer or static constant buffer modes. The problem was in strstream being overloaded with three classes' worth of interface. (That reminds me, I [or someone else] has to propose the two replacement class template sets for the next TR/Standard cycle.) -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

"Martin Wille" wrote:
One posibility is to use Boost.Const String (should be sitting somewhere in review queue) so that: super_string<char, std::basic_string<char> > would provide mutable interface and super_string<char, boost::const_string<char> > would be limited to immutable operations. The const_string variant may also avoid the basic_string overhead. /Pavel

Pavel Vozenilek wrote:
Frankly, I don't see the benefit of that. std::string and const_string are already working string implementations. Why add a wrapper around them? The suggested wrappers still have different types and even different interfaces. The latter would certainly be confusing. A wrapping approach would make sense if (raw) buffers could be wrapped: E.g. super_string< vector<char> >, or super_string< boost::array<...> > or super_string< char[N] > or even super_string< view<...> > super_string< char[N] > would have the advantage that it can operate entirely on the stack which helps reduce the dynamic allocations (which is the purpose of a string buffer). Regards, m Send instant messages to your online friends http://au.messenger.yahoo.com

Pavel Vozenilek wrote:
After some further experimentation and consideration I've 1) created a const_super_string variant that is immutable (will upload soon) 2) dropped the second template parameter for the base string type The immutable form is, as you suggested, based on the proposed boost::const_string. It provides only const functions and is thread safe -- while the mutable form is not. In the mutable form I've removed the *_copy functions to reduce the 'fatness' of the interface. I dropped the string_type template parameter after actually doing the implementation of const_super_string. The reality is that the implementation relies on the underlying string type and you can't just drop in a new base string type easily. The likelihood of using that parameter seems rather remote. And the extra parameter complicates the documentation and the usage interface -- of which the goal is to make as easy as possible. Jeff

Martin Wille wrote:
Joel de Guzman wrote:
Martin Wille wrote:
Catching up on a few things.
I mostly agree with your points, except that it lives in denial of the existence and experience of programmers with basic_string. As much as we have issues with the approach it's there and it's used heavily. I need to inter-operate with it. My approach is to build some additional capabilities onto it. Clearly that's not going to please everyone, but I think it's a reasonable approach.
I'm not going to change the name because I'm not planning on changing the approach. It's a derivative of basic_string -- that makes it a string. In the next version there will be an experimental const_super_string which is an immutable version built on the proposed boost::const_string. My goal is to extend the standard library in a useful way, not pitch it out and rewrite it completely. Jeff

Joel de Guzman wrote:
Of course the issue with this is always the efficiency.
For the record, I don't like fat everything-but-the-kitchen-sink interface too. Same as I dislike mutating functions. Sorry, Jeff.
To be clear, I'm not opposed to an immutable super_string -- I agree that would make things nicer. However, I don't have the time or inclination to write the core of a string class (is there a boost.const_string somewhere?). The other thing I value highly working seamlessly with existing code that uses std::string. Jeff

Jeff Garland wrote:
Of course not! These should return views ;) Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

On Wed, 05 Jul 2006 22:01:55 +0800 Joel de Guzman <joel@boost-consulting.com> wrote: Joel, I am really interested to know why you prefer the functional approach in C++, a language which offers both options.
Me, I prefer the immutable and functional approach:
to_upper(to_lower(to_upper(rng)))
Blech. What are the overwhelming reasons for having purely immutable interfaces in C++? If I had both mutable and immutable options available to me, there is no way I'd use the above, unless there was a very good reason. In most cases, "++i" may not be much better that "i++" but I always prefer "++i" unless I have a very good reason for the other option. Sure, premature optimization is bad, but premature pessimization is worse. If you have two readily available and similarly easy to use interfaces, why would I want to use the one that incurs far more overhead? I assume you will answer that you'd prefer to only have the immutable interface available, so, whay is that, particularly in C++?
For the record, I don't like fat everything-but-the-kitchen-sink interface too. Same as I dislike mutating functions. Sorry, Jeff.
I'm not in favor of huge interfaces either. However, let's not forget that just because a function is outside the class specification, does not remove it from the class interface.

Jody said:
Personally I'd prefer it to be an option that you can specify at call time. Something like to_upper(ref(s)); t = to_upper(s); Submit whatever word you like for 'ref', it could be 'in_place' or whatever. Of course that design would mandate the functional approach. Although I suppose with some clever proxying you could do in_place(s).replace('a', 'b'); My feeling is that member functions aren't right for algorithms because any extensions to a super_string library can't use the same natural syntax as the library provided functions. For example, I come up with a function String to_title_case(String s); I can't use it in the same way as I can to_upper. s = s.to_upper(); s = to_title_case(s); The inconsistency could be considered a good thing, because it makes it clear which functions are library provided and which are local. But I'm not so keen/
I assume you will answer that you'd prefer to only have the immutable interface available, so, whay is that, particularly in C++?
I think both should be available. The "you don't pay for what you don't use" principle applies here I think, if I don't need to make a copy, and in my application it matters, I should be able to avoid it. Sam

Jody Hagins <jody-boost-011304@atdesk.com> writes:
In C++ or any language, functional interfaces tend to make code clearer, for the same reason that "const" is valuable. When values associated with names change, it is usually harder to keep track of what those names mean. Of course, when immutability starts to interfere with expressive power, it can have the opposite effect, but I have learned to prefer immutability until it proves impractical. Also, it's worth noting that immutable types tend to be more amenable to efficient parallelization, so mutability is not necessarily an efficiency win. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Joel de Guzman wrote:
How about the following with plain syntax? :-) some_algorithm( file_range<boost::uint8_t>(name) | // file_iterator range utf8_decoded | // u8_to_u32_iterator range transformed(::newline_cvter()) | // transform_iterator range tab_expanded(::tabsize<>::value) | // my buggy iterator to_upper | memoized // multi_pass iterator range ); -- Shunsuke Sogame

Shunsuke Sogame wrote:
Hmmm... lemme see... I'm not sure if I understand your code precisely, but, maybe something like: pipeline< utf8_decoded , transformed<::newline_cvter> , tab_expanded<::tabsize<>::value> , to_upper , memoized> f; file_range<boost::uint8_t> rng(name); some_algorithm(transform(rng, f)); Regards, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Joel de Guzman wrote:
That's not so complicated. (In fact, it is the actual code snippets.) It is the same as the following: some_algorithm( make_memoize_iterator_range( make_to_upper_iterator_range( make_tab_expand_iterator_range( make_transform_iterator_range( make_u8_to_u32_iterator_range( file_range<>(name) ), ::newline_cvter() ), ::tabsize<>::value ) ) ) ); For example, 'make_transform_iterator_range' returns the pair of 'boost::transform_iterator'. You mean the type re-construction for optimization? If it is possible, it must be exciting! Anyway, the Range proposal seems to provide the plain syntax together. Users have the choice. -- Shunsuke Sogame

"Shunsuke Sogame" wrote:
The super string was intended to be something familiar and easy to use for users. The functional syntax is not common in C++, immutability will be feared as inefficient and the pipe syntax is completely novel. Something as superstring s; s.do_this().do_that(); has a chance to be used in spite of having large API. The problematic operations as German ß -> SS -> ss can be handled without problem in this model. /Pavel

On Wed, 05 Jul 2006 06:23:44 -0700, Jeff Garland <jeff@crystalclearsoftware.com> wrote:
Ok. But is it something for *your* work or a generally usable library? ;)
Yes. But call me a stupid, that's not necessarily a disadvantage. Programmers should never go without reading docs. If anyone makes a mistake with (2) because he *supposed* the modified string was something he randomly chose, then change the syntax if you like but be sure to change the programmer, as well.
That's in a perfect world. On this planet, your library could not compile because a compiler bug prevents it to compile the regex headers, even if you don't use them.
Indeed.
Exactly. Discussions here are never to "reject" or "approve" per se; they just try to improve and *understand* things before they get approved or rejected.
I know how you may feel, believe me. Some time ago I proposed things like 0p as null pointer literal, or a new null keyword (the former being a palliative for the latter); I suggest typed enums and many other things, to always hear one of the same two answers: no new keywords, please; no compelling case for making a language change. You know what happened to all of these. I personally think your proposal is naturally going to have some (or much) opposition but is anyway an idea worth exploring. Just let us "digest" it, so to speak, and try your code on the field. Some months from now our ideas might be different. -- [ Gennaro Prota, C++ developer for hire ] [ resume: available on request ]

"Gennaro Prota" wrote:
Speaking about Regex: it fails only on few systems older: http://engineering.meta-comm.com/boost-regression/CVS-RC_1_33_0/user/regex.h... Format, by virtue of age is somewhat better: http://engineering.meta-comm.com/boost-regression/CVS-RC_1_33_0/user/format.... String algos ignore VC6 but otherwise work almost everywhere: http://engineering.meta-comm.com/boost-regression/CVS-RC_1_33_0/user/algorit... The combination of these libraries could reveal hidden flaws, though. /Pavel

Gennaro Prota wrote:
I wrote super_string for my own purposes. My experience writing tools/utilities and other things that process strings (largely in perl) is that regex processing is essential for all but the most trivial cases. I posted it because I thought others might find it useful.
I won't call you names, but I couldn't disagree more. I might just be reading the code not changing it. I'm just trying to understand what it does. Having to read the docs is a distraction.
It isn't necessarily a mistake. When I read #2 I just wonder what in() and out() mean...it distracts me from the purpose of the code.
Regex is in TR1 - it's highly portable. I have to say, I could really care less about people with really bad compilers at this point.
I'm sorry that you think I'm deflecting -- I'm not. I believe I've patiently explained my reasons. I don't expect everyone to agree with the trade-offs I've made. Personally I think this sort of deflecting comment that detracts from the real discussion -- let's move on.
This is just bogus....I never stopped discussing the technical merits of anything. I had low expectations for the proposal because it goes against the c++ is 'expert only' and the 'obscure code is better dogma' you've espoused and seems to be so commonly held. But this is going off course again...
Sorry to hear that. I will say, however, language changes are very hard to justify and are very expensive. They impact everyone. Libraries are easier because they are optional -- you don't have to use them. You're free to ignore super_string if you don't like it.
Feel free to digest all you like -- I'm not going anywhere. Jeff

On Wed, 05 Jul 2006 16:09:17 +0200 Gennaro Prota <gennaro_prota@yahoo.com> wrote:
0p or null would certainly help clarify the "null-pointer" issue. I assume you have done something similar to moving code to x86_64? Those pesky "..." interfaces. Based on Stroustrup's recommendation and years of practice, our developers use "0" for null pointers, only to be bitten by passing "0" as a pointer to a vararg function. Oops. On x86_64, ints are 32 bits, and 0 is passed as an int to a vararg functions because there is no signature forcing an implicit type promotion. I guess you could say to not use vararg functions, but that is impossible if you also use C libraries. Even if their use is "hidden" away in library code, at some point you have to make the distinction. Worse, the problems result in runtime errors... hopefully core dumps in development and testing. So... the rule now is to use NULL anytime a vararg function is being called... 0 otherwise... until we run into the next problem...

On Wed, 5 Jul 2006 11:09:58 -0400, Jody Hagins <jody-boost-011304@atdesk.com> wrote:
Not in the sense you probably mean (I know of people who use my code on 64-bit platforms though). I've never followed Stroustrup's advice to use "0", because I had a (slight) hope that I could grep for "NULL" some day in the future and replace it with something better: this is from an old post of mine on c.l.c++.m (the message is very long, so I'll directly quote the relevant part here) As to using 0 vs. NULL I prefer NULL (mainly in the hope that I can grep and replace it easily if a true null pointer constant will be introduced in C++) [...] If the language had a true, type-safe, overload-compatible null pointer constant then we could start discussing on real technical basis. The NULL definition was an error. They wanted to prohibit (void*)0, because it implied a special case of implicit conversion from void* to object pointer. But they probably didn't realize they were replacing it with something requiring a special case for 'int and any integral or enumeration type' -> pointer, which is worse. enum { e = 32 }; const int null = !!!!! !!!! !!!! !!!!!! !!!!!!!! ! ! !! !! !! !! !! !! !!!!! !! !! !! !! !!!!!! !! !!!!! !! !! !! !! !! !! ! ! !! !! !! !! !! !! !!!!! !!!! !!!! !!!!!! !! (e >> 6) * ('9'-'1') * sizeof(char); void (*p)() = null;
So... the rule now is to use NULL anytime a vararg function is being called... 0 otherwise... until we run into the next problem...
That doesn't solve the problem if NULL is used from C++. As to nullptr... the current accepted naming style is/was (a) avoiding abbreviations (b) separating different words with underscore. That suggests "null_pointer" but... hey we have shared_ptr, so why not null_ptr? Oh no, we don't like underscores :) -- [ Gennaro Prota, C++ developer for hire ] [ resume: available on request ]

On Wed, 05 Jul 2006 17:56:39 +0200 Gennaro Prota <gennaro_prota@yahoo.com> wrote:
We started using it long ago when using NULL caused compiler errors in some situations with certain compilers. Unfortunately, neither NULL nor 0 provide a true null pointer option. I'm not sure if you were trying, but your post made me laugh... I especially like the "not" art, as I've never seen it before. Closest I've see to that was a long time ago in an obfuscated code competition where the programmer wrote code to solve a maze, and the code itself was in the shape of a maze. Quite clever, though I think it was back before the "web" when it was easy to spot clever bits... now there is so much chaff, I don't even bother to look for the wheat...

Jeff Garland <jeff@crystalclearsoftware.com> writes:
I'm not saying regex isn't important.
and I've cited other languages where it is integrated.
Yes. There are examples of both.
Only sort of. Noninvasive extensions are relegated to "second class" syntax.
Sure.
Python also has a pattern class that works with string. python -c "import re; help(re.compile(''))" What point are you making here?
In Perl it's helped along by language features, but it's essentially built-in using operators.
Most operators in C++ can be implemented as free functions.
So it's a mixed bag as far as how people have chosen to build these functions.
Exactly.
Overall though, I didn't see the motivation for leaving Regex out if I'm going to add other functions to the string type.
Well, you have to stop somewhere. Regex might be a good place because it represents a rather large, complicated, and rather different batch of functionality from the rest of what one gets from string. And it draws in dependencies that not everyone needs.
There's no user cost if they don't use Regex other then processing of a few extra includes on compile.
Linking with more libraries, perhaps? -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
Jeff Garland <jeff@crystalclearsoftware.com> writes:
I stand by my original statement: 'super_string is just as extensible as basic_string'. Your comment applies to basic_string as much as super_string. Although I guess there's really nothing stopping someone from deriving from super_string and adding their own nifty functions -- as long as they're stateless. Still, I don't see how the current alternative of having "second class" syntax for all advanced string code is better? And C++ programmers have to deal with the fact that some code may be written in an 'object style' and some in a pure functional style -- that's just life as a C++ programmer.
I'm just clarifying that in Java all the 'Regex functionality' isn't all embedded in the String class -- but there is a relation between the two. super_string is very similar to the JavaString in that respect.
I realize -- and perhaps something could be done to emulate a perl-like syntax. But I think there's plenty of people that would be against that idea as enabling even more obscure code -- me among them. I'm open to being convinced, but I think clear function names will end up being superior.
Nope. basic_superstring is a template, so if you never call a *_regex function you won't need to link in the regex library -- it's a header only dependency. If this weren't the case, I'd agree with you. Jeff

On Sunday 02 July 2006 03:03, Jeff Garland wrote:
I really like this proposal, but have a minor point to add: IMHO, value classes should have much fewer mutating (non-const) (member) functions. The above two examples are a very good example. to_upper() should be const and return the string converted to uppercase, instead of performing the operation on 'this'. Similarly, trim() should be trim_med_() and return a trimmed copy of 'this'. This makes it much easier to work with such classes. Qt4's QString has gone a long way towards this style of interface, but still has a lot of annoying exceptions that make me mad most every day. The usual argument against is performance, and I agree performance is important for a general-purpose string class. But what is often overlooked is that the prefer-const-to-mutable-methods-style interface allows many more optimisations than the classical style, if the implementor is willing to go down the expression template path: s = s.replace('&',"&").replace('<',"<).replace('\'','''); can be made to execute a lot faster if replace is const than if it is not, b/c it forces the return value of replace() to be evaluated. Maybe I'm talking common wisdom here, it's just something that I found annoying in my everyday work and wanted to share. Feel free to ignore :) Thanks, Marc -- Marc Mutz -- marc@klaralvdalens-datakonsult.se, mutz@kde.org Klarälvdalens Datakonsult AB, Platform-independent software solutions

Marc Mutz wrote:
Well, I went about 50% of the way on this. You'll notice that for some things there are non-modifying variations. basic_super_string trim_copy() const; basic_super_string trim_left_copy() const; basic_super_string trim_right_copy() const; basic_super_string to_lower_copy() const; basic_super_string to_upper_copy() const; I just got tired and didn't add all the variations for replace_xxx. Of course that makes the interface even fatter :-)
I'm no expert in expression templates, but I don't think '.' can be overloaded -- how can ET do anything here?
Maybe I'm talking common wisdom here, it's just something that I found annoying in my everyday work and wanted to share. Feel free to ignore :)
I'll add replace_*_copy versions in the next rev... Jeff

On Sunday 02 July 2006 23:09, Jeff Garland wrote:
Right. Therefore I'd make _copy the standard and not provide mutating variants at all. Who needs trim() and trim_copy() (or trimmed(), as I prefer to use English grammar to convey constness of member functions, where possible) if trim() can always be written as s = s.trim_copy()?
Well, operators are also mere functions. Instead of implementing a free op+, you implement a member replace() in the ET class, returning another instantiation of the ET, just like op+ would. Granted, it's more work to implement for the plethora of operations, and I can't readily see how to make this specific case of replace() any faster while preserving the ordering of replacements implied in the above (important for '&' in this case), but the point was that this kind of interface is open for these kinds of optimizations, though, on second thought, you could probably do the same thing with the dtor of the ET in the non-const case, too. Marc -- Marc Mutz -- marc@klaralvdalens-datakonsult.se, mutz@kde.org Klarälvdalens Datakonsult AB, Platform-independent software solutions

Marc Mutz wrote:
I assume you meant s = s.trim(); Well, that's very persuasive although I think it's harder to write something modifies a collection of objects in place -- no? Don't get me wrong, I'm pretty fond of immutable value types. Most of date_time is written as immutable value types with a couple exceptions. However, in this case I'm building on a base type that's already mutable and it seems to me that it's pretty natural to say s.replace_all(...).
I'll take that as a queue to do nothing ;-) Jeff

On Monday 03 July 2006 00:14, Jeff Garland wrote:
I assume you meant
s = s.trim();
After renaming trim_copy(), yes. Otherwise no.
Well, that's very persuasive although I think it's harder to write something modifies a collection of objects in place -- no?
Very good point. With a mutating trim(), people are tempted to write std::for_each( v.begin(), v.end(), mem_fn( &super_string::trim ) ); which, strictly speaking, is not explicitly allowed by the std, IIRC. With a const trimmed(), the user would be forced to use std:transform(v.begin(),v.end(),v.begin(),mem_fn(&super_string::trimmed)); which much better conveys what the code does.
It's only natural b/c it's what people are used to. It's much easier, IMHO, to work with immutable types, b/c the interface is consistent. You have a point when you say that std::string is a mutable type and you're building on it. That doesn't mean, however, that you need to keep it's baggage. :) Marc -- Marc Mutz -- marc@klaralvdalens-datakonsult.se, mutz@kde.org phone: +49 521 521 45 45; mobile: +47 45 27 38 95 Klarälvdalens Datakonsult AB, Platform-independent software solutions

Marc Mutz <marc@klaralvdalens-datakonsult.se> writes:
On Monday 03 July 2006 00:14, Jeff Garland wrote:
I assume you meant
s = s.trim();
After renaming trim_copy(), yes. Otherwise no.
Well, that's very persuasive although I think it's harder to write
something modifies a collection of objects in place -- no?
Very good point. With a mutating trim(), people are tempted to write
std::for_each( v.begin(), v.end(), mem_fn( &super_string::trim ) );
which, strictly speaking, is not explicitly allowed by the std, IIRC. With a
const trimmed(), the user would be forced to use
std:transform(v.begin(),v.end(),v.begin(),mem_fn(&super_string::trimmed));
which much better conveys what the code does.
Don't get me
wrong, I'm pretty fond of immutable value types. Most of date_time is
written as immutable value types with a couple exceptions. However, in
this case I'm building on a base type that's already mutable and it seems
to me that it's pretty natural to say s.replace_all(...).
It's only natural b/c it's what people are used to. It's much easier, IMHO, to
work with immutable types,
In general, yes, and I would oppose the acceptance of a new C++ string class into Boost if it weren't immutable. And once you know it's immutable, you don't need naming contortions like "trim_copy." "trim" will do nicely. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Marc Mutz wrote:
There's no restriction on for_each. I think the value of transform versus for_each for this case is marginal at best. As for trim versus trimmed, I'm afraid I don't like trimmed much. It's 'past tense'. To my ear it strikes as odd -- like the operation is already done. I normally think of mutating functions a present tense verbs. Just looking around a bit, Boost.string_algo-trim, QTString-trimmed, Java.String-trim, RWCString-strip -- so it's not that consistent. As much as anything I'm leveraging Boost.string_algo here, so consistent naming is a virtue. Hopefully this won't make you mad, but your prior email made me realize that the mutating functions should return a self reference so now you can write: s.trim().to_upper().append(" more stuff "); The other thing it made me realize is that it would be handy to have additional overloads on the number of parameters in append/insert_at. So in the next version you'll be able to write: double dbl = 1.12345; int i = 100; s.append(dbl, " a string ", i); //"1.12345 a string 100" I'll probably do something like boost.tuple and support up to 10 arbitrary parameters. And at the suggestion of someone else, I'm also adding boost.format into the mix: s.append_formatted(dbl, "-some string-", i, "%-7.2f %s %=5d"); //"1.12 -some string- 100 " So I'm afraid I'm making it more mutable, not less ;-) Jeff

Hi, Jeff Garland wrote:
I was following this discussion closely. I must say that I kind of like your class. Especially in the way you propose it - as a convenience wrapper build on top of algorithms. Having both, we have win-win scenario. Just one remark on top of mutable vs. copy. As you said, the major reason for coming with this class was a clarity of code. I must say, that when I read s.trim(), I pretty much assume, that the operation is performed on 's'. So mutable is natural winner for me here. Regards, Pavol

"Jeff Garland" wrote:
s.trim().to_upper().append(" more stuff ");
I don't know why but something reminded me awk. Imagine having class named "superstring_list": superstring_list s; vector<string> = s.load_from_file("filename).split_by("\n").trim().replace("a", "b").reverse().remove_first(); /Pavel

Pavel Vozenilek wrote:
Well that's interesting -- I've done some experimenting, it really wouldn't be difficult to support a super string collection that exports string functions, but the semantics get a bit odd on some things. Like is append(string) equivalent to push_back -- I think it would have to be. I can hear the cries now about bloated interface. Of course template<class T> append(const T& val); is a unique function. Anyway, I can see this getting a bit messy. Jeff

On 7/3/06 9:08 AM, "Jeff Garland" <jeff@crystalclearsoftware.com> wrote: [SNIP discussion with Marc Mutz over the "trim" name]
I don't think this is a good idea. Could this be implemented any more efficiently than: s.append( dbl ).append( " a string " ).append( i ); You would have to do the memory allocations in piecemeal in either case. And no matter how many parameters are given, users would have to resort to chaining whenever they go above your arbitrary limit. There would be little gain over the cost of having to drag in tuples and advanced programming. The "append" method could have an alternative version with a second parameter, that of a locale so the user can control the string conversion. The single-parameter version would just use the default locale.
Unless you are limiting the method to be exactly four parameters, how would it know that the last parameter is a format string and not a direct argument (that looks like a format by coincidence)?
So I'm afraid I'm making it more mutable, not less ;-)
You still have to evaluate each potential method, and not make your class a dumping ground of mixed quality. -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Daryle Walker wrote:
Not really: template<class char_type> template<typename T1, typename T2, typename T3> inline basic_super_string<char_type>& basic_super_string<char_type>::append(const T1& val1, const T2& val2, const T3& val3) { string_stream_type ss; ss << val1 << val2 << val3; *this += ss.str(); return *this; } In most cases I doubt the string stream will need to reallocate.
Are you saying I should force users to create a tuple to do this? That seems like can unnecessary complication and I think I'd still have to have many overloads to make this work seamlessly.
It could, but I'm going to just give them direct access to the stream. Here's the 3 parameter variant: template<typename T1, typename T2, typename T3> basic_super_string<char_type>& append (const T1& val1, const T2& val2, const T3& val3, string_stream_type& ss); Used like this: std::ostringstream ss; super_string s; ss << std::setprecision(3); double dbl = 1.987654321; int i = 1000; s.append(dbl, i, " stuff", ss); //s == "1.99 1000 stuff" So if they want to change the locale for this conversion stream they can. This is also much more efficient than reallocating the string stream for each set of conversions. In my performance tests allocating the stringstream is a significant impact, so doing more conversions at once or having the user supply one that doesn't get reallocated is a definite benefit.
Simple -- append_formatted always requires the last argument to be a string and it is always interpreted as a format -- it's different from 'append'. If the format doesn't match with the rest you get a boost.format exception.
True enough. I'm still exploring what should be in the interface -- that's part of the reason I posted -- I knew I would get lots of advice. That said, the type conversion capabilities are essential to my personal goals, even though that cranks up the number of overloads. Also, after playing with the append_formatted I really like it for simple creation of table-like output: double dbl = 1.123456789; int i = 1000; s.append_formatted(dbl, i, dbl, i , "|%-7.2f|%-7d|%-7.2f|%-7d|\n"); s.append_formatted(i, dbl, i, dbl , "|%=7d|%=7.4f|%=7d|%=7.4f|\n"); // s == "|1.12 |1000 |1.12 |1000 |\n" // "| 1000 | 1.1235| 1000 | 1.1235|\n" On the other side of the ledger, for v2 I've removed all the *_copy methods in the mutable string class. Discussion convinced me that they were over the line -- and after all, if you want to operate on a copy, it's easy enough to make one without building it into every method (eg: to_upper_copy). Also, I'll be introducing an const_super_string based on boost::const_string that is immutable -- for those that want to see how that interface works out. Personally, I prefer the mutable interface, but this works too: const_super_string s; double dbl = 1.543; s = s.append("double: ", dbl, " blah, blah"); // s == "double: 1.543 blah, blah" and the performance hit isn't too bad. More to come... Jeff

On 7/8/06 10:06 PM, "Jeff Garland" <jeff@crystalclearsoftware.com> wrote:
Each part of the "ss << val1 << val2 << val3" is done with a separate function call, leading to piecemeal allocation. There's another allocation possibility at the "*this += ss.str()" line. You can't hope that previous calls left enough reserve space to elide any allocation.
No, I don't want the tuple/advanced stuff; it sounded like you were going to add all that stuff to implement your multiple-argument idea. Since you have to stop making overloads at some point and force users to chain parameter calls, why not skip those multi-argument overloads and allow the chaining of single-argument calls?
So you would break encapsulation by moving an implementation object, the string stream, up to the user's level just so you wouldn't have to implement extra formatting methods? What happens to "s" if the string stream already has data in it? Could your code survive using a stream that can have ANY adjustments made to it? -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Jeff Garland wrote:
Apart from optimization, the future (under range proposal) 'to_upper' might be: std::vector<char> const rng; std::string dst(rng|to_upper); The following is possible today: std::string dst = range_construct(rng|to_upper); range_copy(rng|to_upper|to_lower|to_upper, dst); Note that '|to_upper' is lazy. Well, IMHO, I prefer free-functions for another readability: ::CString rng; boost::to_upper(rng); std::vector<char> rng; boost::to_upper(rng); super_string str; str.to_upper(); // ! -- Shunsuke Sogame

Shunsuke Sogame wrote:
Sorta hard for me to anticipate the future ;-) Anyway, I assume rng can be an std::string as well?
I have no idea what this code does? Construct a range from chars that have been upper-cased and write it into dst. Then copy the rng to while converting it to upper, then lower, then upper? This one isn't winning me over with code clarity...
In the zero/one argument case there's no significant advantage. So I'll repeat this case again: std::string s1("foo"); std::string s2("bar); std::string s3("foo"); //The next line makes me go read the docs again, every time replace_all(s1,s2,s3); //which string is modified exactly? or s1.replace_all(s2, s3); //obvious which string is modified here Of course, you want to be able to work consistently so you have to pull along the functions with less arguments too. Jeff

Jeff Garland wrote:
'std::string' will be constructed from any "rng" in the far distant future... :-) I also can't wait for the future Boost.Range, I'm working for my own use: http://tinyurl.com/p8e4m Note that they are not fully implemented yet.
'|to_upper' makes a range of 'transform_iterator'.
The main problem is that 's1' must be a 'super_string'. The 'Range' and 'Sequence' abstractions will disappear. How does this look? :-) replace_all(out(s1), in(s2), in(s3)); -- Shunsuke Sogame

Shunsuke Sogame wrote:
Jeff Garland wrote:
Looks interesting.
Honestly, that didn't clear it up for me. What does the rest of it do? This code doesn't make sense to me: 'to_upper|to_lower|to_upper' It's upper, then lower, then upper -- huh?
Yep that's true it does. But, you know, that's what I'm writing code for -- strings of chars. I don't really need the particular code I'm writing to handle every possible type in the world -- just strings of chars. If you aren't processing strings, then don't use super_string. Pavol and John have taken care of all the generic cases. I'm reducing things down to the common and frequent thing that I need to do. And the thing is, there's still nothing stopping me from writing super_string s(...); some_free_function_algorithm_here(s, ...); So what have I lost? Nothing. I've gained cleaner, clearer code for the things I do every day -- code I can explain to any programmer. There isn't a single novel algorithm or function in super_string -- I've just optimized the generic code for the common cases I need by wrapping up a couple of existing Boost libraries.
Better, but it's still ugly in comparison. I could give the first one to just about any programmer (even non-c++) and they would get it. With your example I'm sure lots of programmers will be scratching there heads. They'll be distracted by the 'in' and 'out'. I certainly am. Jeff

Jeff Garland wrote:
'to_upper` is called a range adaptor. The pitiful iterators do what you mentioned. That code is somewhat kidding. :-)
Right. 'super_string' can be used as 'Sequence' and 'Range'. IIRC, I see your proposal is no radical and related to the language feature proposal which makes users have the choice of: 'foo(a);' or 'a.foo();'
I agree we should appreciate non-c++ programmers. Actually range adaptors looks influenced by unix pipe syntax. -- Shunsuke Sogame

Bo Persson wrote:
"Marc Mutz" <marc@klaralvdalens-datakonsult.se> skrev i meddelandet
And in what locale should to_upper work? Swedish?
The global locale. You can replace with locale::global if you so desire. Of course, I could add locale parameters to these functions, but that was one way to keep things simpler. Whether it's worth adding the parameter is really a question of how frequently applications manipulate multiple locales at the same time. Jeff

On Monday 03 July 2006 00:11, Jeff Garland wrote:
Actually, I think having a global locale is an abomination in the first place, but that's for another place and time :) Marc -- Marc Mutz -- marc@klaralvdalens-datakonsult.se, mutz@kde.org Klarälvdalens Datakonsult AB, Platform-independent software solutions

Thorsten Ottosen wrote:
It doesn't stop being easier to use -- it's the same every day. Your implicit implication is that later on I'm going to find a bunch of extra functions that I can't perform with super_string. True enough. At that point I either add it to super_string or write it using the functional interface. No big deal.
against the spirit of low separation and minimal interfaces.
Against the spirit minimal interfaces for a single type, perhaps. Overall though, I'm radically simplifying the overall string processing interface 'surface area'. Compare the universe of all available boost string processing free functions in string_algo, format, lexical_cast, tokenizer, regex, and xpressive versus super_string. super_string is much smaller. And super_string can be documented without the 'noise' of all the template parameters associated with the lower level libraries. As for low separation, it's all template code, so you only pay for what you instantiate. If you don't use regex interfaces you don't need to link the library. Yes, there's more includes to process, but I think I can afford it. And again, I've lost nothing w.r.t using generic code -- I can still use it whenever I need.
Not a good idea IMO.
We disagree, obviously, and that's fine. Just to save some time, I'm pretty sure it's impossible to convince me that range or something else is going to solve my set of issues. I'll just point out, that some of this was a result of recent brushes of mine with other languages. I've been doing some coding lately in, gasp, Java and Javascript. As a diehard C++ developer, it ticks me off that I can sit down with the manual for these languages and in 15 minutes whip up some fancy string parsing code using regular expressions, etc. It's all very nice and neat. Go to the string reference page -- see the list of functions and boom, you're in business. So it gets me thinking, why is it that C++ makes this so hard? Well, std::string isn't as capable as Java's string. Of course, C++ (with Boost) has all of the same capabilities, but it takes a truckload of documentation and a masters degree to figure it all out. And then it's a pain in the neck to use and read the code. This is my attempt to rectify that. Jeff

Jeff Garland wrote:
My implicit implication was that soon you'll feel comfortable with the free-standing functions. In php, there is not a single member functions in a string, and all string processing is done with free-standing functions. In general I find strings in php easy to work with: http://www.php.net/manual/en/ref.strings.php http://dk2.php.net/manual/en/ref.regex.php
Probably not. I did not have that I mind. I think some extensions to range will make stuff like regexes easier, but it would give you "one place for all string processing".
If the problem is bad documentation/tutorials, then I think we should fix that instead. For Java, many many people have been paid to write the documentation, whereas for boost, we have to do it our spare-time. It's pretty good most of the time anyway. -Thorsten

Thorsten Ottosen wrote:
Jeff Garland wrote:
Not likely, I've used free functions for years, but I'm going to the dark side now ;-) Anyway, I've already responded to this point in detail elsewhere. I agree, btw, that it can be done cleanly without a string type, but then you're going to need language support to do it well.
I've written code in PHP as well. It has the advantage over C++ in that it doesn't have templates to distract in the documentation. But overall, I think PHP string handling is a mess...sorry.
Sorry -- do you mean 'wouldn't give'?
Nope, that's not really the problem.
For Java, many many people have been paid to write the documentation,
That's sad, a lot of it is pretty lousy as far as I'm concerned.
whereas for boost, we have to do it our spare-time. It's pretty good most of the time anyway.
I don't really have a problem with the documentation. I think each of the various docs are good by themselves. But as I mentioned before the functions are spread across a number of libraries. And, of course, the reference documentation has to be written in a totally generic fashion. Just take regex_replace as a case in point: template <class OutputIterator, class BidirectionalIterator, class traits, class charT> OutputIterator regex_replace(OutputIterator out, BidirectionalIterator first, BidirectionalIterator last, const basic_regex<charT, traits>& e, const basic_string<charT>& fmt, match_flag_type flags = match_default); template <class traits, class charT> basic_string<charT> regex_replace(const basic_string<charT>& s, const basic_regex<charT, traits>& e, const basic_string<charT>& fmt, match_flag_type flags = match_default); My first reaction when I read this is, wow, interesting, but how do I use it? It's hard for even an experienced guy like me to see the forest from the template tree's here. So I scroll down to the example and start reading the example code. Ok, now I see it and I can go back, consume it, ponder more...then realize, ok I guess it's the second signature because I'm using an std::string...now I can go write some code. (Of course, I normally don't do it like this because I go and look up some regex code I've already written). Now lets compare JavaString.replaceAll (short description) String replaceAll(String regex, String replacement) Replaces each substring of this string that matches the given regular expression with the given replacement. Wow, ok I don't need to see the example code, I can write code now. I might need to read more about the regex string rules, but no biggie they follow expected conventions. After 2 minutes I'm testing code. Of course, JavaString.replaceAll is just lame compared to what regex can do. But, you know, it covers most of what I use for typical day to day string processing. It's clean, easy, fast -- I can focus on other parts of my app rather than the template parameters for the string function. Now lets examine the hastily created *pre-alpha* super_string docs: template<class char_type> void basic_super_string< char_type >::replace_all_regex( const base_string_type & match_regex, const base_string_type & replace_format) Replace the all instance of the match_string with the replace_format. super_string s("(abc)3333()(456789) [123] (1) (cde)"); //replace parens around digits with #--the digits--# s.replace_all_regex("\\(([0-9]+)\\)", "#--$1--#"); //s == "(abc)3333()#--456789--# [123] #--1--# (cde)" Right from the start there's only one signature and only one template parameter to document -- char_type is pretty easy to understand, doesn't even really require explanation -- but really the docs would be nicer without that distraction. The context is string processing, so I don't have to worry about explaining the regex function can work on vector<char> or whatever sequence I want. I've ditched a couple parameters of function parameters -- always going for the regex defaults. So super_string is more like JavaString -- very limited compared to full up regex or string_algo, but it's easier to document and use for common cases. Jeff

Jeff Garland wrote:
OK, now I 'get it'. Until this example I simply saw functions moving from one interface into another - now I see that the interface itself becomes cleaner. Move me from 'skeptic' to 'believer', although I still can't help feeling dirty for endorsing this! Some dogma just runs too deep to easily walk away from I guess. -- AlisdairM

AlisdairM wrote:
LOL :-) Thanks for opening up...
Some dogma just runs too deep to easily walk away from I guess.
Believe me, I had to think about this for awhile before posting it. I figured I'd be sealing my fate with some people on this list for good. And I knew the 'less generic' == good was going to be hard to sell ;-) Jeff

| -----Original Message----- | From: boost-bounces@lists.boost.org | [mailto:boost-bounces@lists.boost.org] On Behalf Of Thorsten Ottosen | Sent: 04 July 2006 22:56 | To: boost@lists.boost.org | Subject: Re: [boost] Interest in super string class? | | Jeff Garland wrote: | > Thorsten Ottosen wrote: | > | | >>If the problem is bad documentation/tutorials, then I | think we should fix that instead. | > | > Nope, that's not really the problem. I disagree - I think the docs ARE a problem. | > I don't really have a problem with the documentation. I | think each of the various docs are good by themselves. | | > template <class OutputIterator, class | BidirectionalIterator, class traits, class charT> | > OutputIterator regex_replace(OutputIterator out, | > BidirectionalIterator first, | > BidirectionalIterator last, | > const basic_regex<charT, traits>& e, | > const basic_string<charT>& fmt, | > match_flag_type flags = match_default); | > | > template <class traits, class charT> | > basic_string<charT> regex_replace(const basic_string<charT>& s, | > const basic_regex<charT, traits>& e, | > const basic_string<charT>& fmt, | > match_flag_type flags = match_default); | > | > | > My first reaction when I read this is, wow, interesting, | but how do I use it? I'm glad I'm not the only one who goes gulp (and sometimes giver-up). There is just too much class clutter to see the wood from the trees. | That's the job of a toturial. I disagree - I think that somehow the USER needs the EXAMPLES right side-by-side with the reference info. One might crack this problem more than one way, for example by hyper-links to AND FROM the example/tutorial and reference info. With paper, this seems to me to lead to having both the reference book and the tutorial/example book open at the same time. Good for book sales/authors ;-) (I think that this is a problem with the whole of the STL. It is very powerful, but it isn't obvious HOW to use it, and often can turn out to be rather subtle.) Overall, I have the impression that Jeff is on the right track. Paul --- Paul A Bristow Prizet Farmhouse, Kendal, Cumbria UK LA8 8AB +44 1539561830 & SMS, Mobile +44 7714 330204 & SMS pbristow@hetp.u-net.com

"Thorsten Ottosen" wrote:
Other scipting languages have /very/ powerful strings and present this as a major feature. Borland C+ uses fat string and every GUI framework on the planet had ctrated their own string to deal with insufficiency of std::string (in their own words). /Pavel

Pavel Vozenilek wrote:
In their own words is right. The std::string implementation is much better than Borland's AnsiString, Microsoft's CString, .NET's System.String, java.lang.string and nearly any other string class I have ever seen and used. The insufficiency is the other way around as far as the design goes. Almost all of the other string classes have no orthogonality, and a set of minimilistic functions which leave programmers groping for the correct sequence of basic manipulations when they have to do more complicated string manipulations. Only the std::string class has the richness to provide for a very large set of basic string manipulations at any point in a string for any length in an orthogonal way. No one in their right mind would prefer, for example, Borland's AnsiString to std::string as far as the basic string interface goes. What some of these other string classes have is other specialized functionality, which std:;string lacked, and Jeff is just bringing these in to a super string class from Boost's own regex and string_algo libraries. I see nothing wrong with that as a simplification in a single class of other functionality. I pointed out what I thought was the only bad design decision in std::string, but it's a done deal already and the mavens of C++ evidently want to support C idioms for the life of the language so who am I to object. Other than that, the much-maligned design of std::string is actually excellent, but that shouldn't stop anyone else from adding new functionality to a derived class and jeff has elegantly done so. Thanks Jeff !

On Tue, 04 Jul 2006 16:40:49 -0400 Edward Diener <eddielee@tropicsoft.com> wrote:
Sure, the C-centric viewpoint is a problem. However, IMO, the biggest problem with std::string (yes, others fall into the a similar category)... some very significant implementation details are left to the implementor, with no choice for the user. Yeah, I know... you are not supposed to be concerned with what's behind the interface. However, in the real world, at some point you must be concerned with those issues. Many applications will perform terribly with unique copies underneath every std::string. Likewise, many applications will perform terribly with ref-counted COW strings. Many applications have a small multithreaded need, but are yoked with mutex locking for each string operation of some implementations. Worse, move your code between compilers and you will get different behavior. I would love to see more string options, so that I can choose when unique strings are best, or non-mutex protected COW operations are acceptable. Personally, I don't care if the strings have different names, or are typed based on policies, or there is a special type that defers the implementation to some type of strategy object. Obviously some are better than others, but either would be better than being saddled with a class that can not be used in some situations because you have no idea what it will cost to use. We have a home grown string library that has its own problems. However, at the time, we were using two different versions of two different compilers, and each was using a different string implementation. Where string behavior is required to be pre-determined, how are you supposed to use std::string when its requirements allow such different behavior? The answer was (and I believe still is)... you can't. It sure would be nice to have all that std::string functionality (and interoperability) available to me even if I need a different underlying implementation.

First of all I'd like to "come out" too and say that this feature in boost would be a real joy for all of us "non boost super users" (I have quite a bunch of them all around me). I use tokenizer by copy / pasting some code, and regex is still too obscure for me to be yet at this point. I handle lexical_cast masterfully, but I never took the time to read format documentations because I never thought it could be useful fro strings. I never heard of string_algo or xpressive before today. All in all, having a class that proposes all string related algorithms in a clean interface, aimed only at strings looks like something I've been missing for years. Jeff Garland wrote:
Compare the universe of all available boost string processing free functions in string_algo, format, lexical_cast, tokenizer, regex, and
Speaking strictly of strings, I think this is the main problem. Maybe beginning by writing a doc that teaches simply what can be done on strings, using all boost libs would be the better thing to do? Basically, that's what the super_string documentation will do. At least, I, would be very pleased to find such a tutorial, and would get a lot of use to it. The second step might be to wrap all that features in free functions (I wonder why there isn't a namespace boost::string_algo that provides all supports for strings), and maybe the third to have a class that provides a clean interface. But I'll leave that up to you boosts super users. Maybe I'm not super user enough to have a good point on this, but at least, you can be sure I'm not a lazy developer and reading docs is my hobby. I feel weird realizing that I missed a lot of strings features from boost. SeskaPeel.

SeskaPeel wrote:
I suppose someone could write a nice article about this, but my goal was also to make my resulting code look cleaner. When I'm looking at the code 3 months later I'm not really that interested which Boost library implements...just what it does.
There is, but it's a library that provides most of the functions super_string provides.
Well, if nothing else, this discussion might lead you to look at parts of Boost you were missing out on :-) Jeff

Feature request: would it be possible to parametrize the internal string type (currently hardcoded to std::basic_string)? For example one may like to use flex_string. template<class char_type, typename Str_t = std::basic_string<char_type> > class basic_super_string : public Str_t .... /Pavel
participants (22)
-
AlisdairM
-
Bo Persson
-
Daryle Walker
-
David Abrahams
-
Edward Diener
-
Gennaro Prota
-
Jeff Garland
-
Jody Hagins
-
Joel de Guzman
-
Kon Lovett
-
Marc Mutz
-
Martin Wille
-
Maxim Yegorushkin
-
Paul A Bristow
-
Pavel Vozenilek
-
Pavol Droba
-
Peter Dimov
-
Sam Partington
-
Sebastian Redl
-
SeskaPeel
-
Shunsuke Sogame
-
Thorsten Ottosen