
On Thu, Aug 26, 2010 at 7:07 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:
On 26/08/10 08:09, Dean Michael Berris wrote:
Indeed.
However, I am looking for a means of representing a string -- a collection of characters on which you can implement algorithms on. They might as well be ranges, but things like pattern matching and templates (as in string templates, where you have placeholders and other things) can be applied to or used to generate them.
A string is just a range of characters; nothing more.
Actually, not *just* a range of characters. If your range dealt with ownership of data and abstracted the means by which the data is actually manipulated (even through iterators) then maybe a string is just a range of characters. If your range made sure that the data is moved instead of copied in certain situations, or whether there is an optimization that can be made on certain operations (like concatenation) then yeah I'll agree that a string is just a range of characters. If you think about a string as a special beast that does a lot of things: 1. Allocates memory which holds the characters 2. Offers an abstraction unique to strings (token_iterator, line_iterator, char_iterator, wchar_iterator) 3. Has value semantics similar to built-in types (copyable, movable, "regular") Then it doesn't look like just a range.
I suppose you'd have to linearize it. Sending it in multiple chunks would have different behaviour on different types of sockets.
Yes, but something that is inherently supported by the type.
There is no "inherently supported" or "not inherently supported". Linearizing the memory means you have to copy it in a contiguous buffer.
Sure, but if the string type already does that for you at the time you need it, then it's inherently supported by the type, no?
Why strings are important has a lot to do with being able to perform domain-specific optimization on string algorithms. Things like capitalization, whitespace removal, encoding/decoding, transformations like breaking up strings according to some pattern (tokenization, parsing, etc.). Because you can specialize the memory-management of strings (as opposed to just ranges of char's) the "win" in treating a string as a separate type are practical rather than conceptual.
I've got a system that does arbitrary conversion and segmentation of ranges in a lazy way without ever allocating memory.
Sure, but the memory has to be allocated somewhere else. I'm really looking for that "thing that allocates the memory and manages it for me" more than just the range. If you use std::string, then I think what I want to replace is an std::string.
I can use it for things like on-the-fly character encoding conversion, Unicode normalization, or base64 encoding, for a few examples.
It never requires to differentiate strings from other types of ranges, not to use special containers or anything.
That's cool, but in cases where you deal with potentially huge strings (i.e. more than a memory page's worth of data) you start looking for ways of moving some of the work out of runtime and into compile time (things like getting the size of the literals). And especially in cases where you need to abstract the string from being something that is exclusively in memory to something that refers to data that is not in memory (retrieved from a socket, from a file, from user input) you run into issues like buffer management, demand-driven/lazy-loading data, etc. that a range is not the end-all and/or best solution for. For instance, think about a forward iterator (or a single-pass iterator?) that the string can expose to allow for one-time traversal of the data it holds or it refers to. The string doesn't have to have the data all in memory if it doesn't yet, and it can then lazily load the data and expose it through that forward iterator. You can still deal with it in ranges, but the string is smart enough to figure out how to manage the memory and the data it represents.
Just in case you haven't noticed, that's exactly the same code as you wrote, except it's real code that works and it uses different names.
Until you think about the string_handle actually manages the memory unto which the data is stored.
That's fine if I can control the memory allocation of the lazy range.
Using type erasure means you have to use dynamic memory allocation eventually, unless you limit it to use stack storage and refuse types that are too large. (which in practice should be able to work fine)
Not in that sense. The dynamic memory will have to be managed by the string, or the string handle. If the string handle knew that some part of the string was a conglomeration of statically-sized literals then it can hold that data in a boost::array of the correct size -- then think about the transformed proto tree to have nodes merged in case they were adjoined literals be a single node as a merged literal, or in cases where there are bounded strings (even those generated at runtime, i.e. a bounded string of length 255) then you can correctly allocate enough space at the point where the operator= is implemented. Of course the type erasure might be an issue, but knowing the eventual size at compile time allows you a lot of optimizations you otherwise can't or won't do.
As it is, a lazy range just represents a collection of iterator pairs
Just the one pair, actually.
Yes, but then you have a left-leaning tree of iterator pairs. ;)
-- the data has to live somewhere still.
Having it live in your new "string type" is a bad idea, as this just makes things not inter-operable. Simply allow people to use whatever data structure they see fit for their storage needs, and just deal with ranges when you want to manipulate strings.
If the new string type implemented iterators (several types of iterators in fact) and manages the memory for you in a configurable manner, *and* allows you to convert it to either an std::string or an std::wstring, why wouldn't it be inter-operable?
What I'm looking for is a combined ownership+iteration mechanism. Right now the problem of allocating a chunk of memory every time a you concatenate two strings is the problem I'm trying to solve with metaprogramming and knowing at compile time how much memory I'm going to need to allocate.
What meta-programming will give you is a type that represents an expression that you can then evaluate. It doesn't own the values in any way.
Yes, but it allows me to know the lengths of the strings at compile time and eliminate the multiple allocations the temporary strings I otherwise would incur in case I just did std::string + std::string or even in cases where I use an ostringstream. What I can then eliminate are the multiple allocations -- something nearly impossible to do just at runtime.
Range adaptors are exactly the same thing, they're a lightweight DSEL.
But the range adaptors don't solve the issue of multiple allocations -- I still have std::strings owning the data and potentially fragmenting the memory because I keep allocating and deallocating small chunks of irregular-sized memory. The new string type will have a more sane memory management mechanism which relies on as much compile-time programming as well as runtime magic to sanitize the memory utilization of dealing with strings. Someone can even make it behave almost the same as an std::string and offer almost the same interface (I'm not a fan of all the functions there) then it just becomes a better string.
Of course if you're dealing with strings that have an unknown length (as in my example, we really can't tell the length of 'f' at the point 's' is defined) at least getting to know the parts that have a known length at compile time (the literals) allows me to allocate enough space ahead of time with the compiler's help.
With ranges, the way it works is depending on whether the range you want to get the size of has random access or not. If it is, getting the size is O(1). If it isn't, it's O(n). If it's a single-pass range, you can't even get the size.
a joined_range has the lowest category of the ranges it's joining, so if one of the ranges isn't random access, then the concatenated range isn't.
Yes, that's understood. I'm happy with the ranges approach, as long as the string type I'm using is smart enough to know that part of the string it's referring to is a conglomeration of statically-sized and variable-sized data. The problem with just ranges is that I can't really just return that from a function and expect that the data it refers to (being iterators) will be valid when they're needed. Again, unless the range owns the data, this will be a problem.
Maybe instead of having multiple concatenations, what happens is I allocate a chunk "just large enough" to hold the multiple concatenated strings, and just traverse the lazy string as in your example. The copy happens at runtime, the allocation of a large enough buffer (maybe a boost::array) happens at compile-time (or at least the determination of the size).
You can't get the size of a range at compile-time.
Precisely, which is why strings aren't *just" ranges. If I had a literal string, I can know the size of that at compile time. If I had a bounded string or a generator that knew exactly how much data (even just the maximum) it will be generating that made these things known at compile time, then I don't need to be stuck with just ranges where you can't get the size of at compile time.
If you're only dealing with statically-sized stuff, I suggest you use fusion sequences, which is basically the same thing as ranges except they're statically-sized and can have heterogeneous elements. (and can make your compile-time explode) There is a boost::fusion::joint_view that behaves the same as boost::join.
You can't really mix both runtime and compile-time sized things together. It's either one or the other, since compile-time code and runtime code at different beasts. (of course, you can consider a fusion sequence as a range, but then you lose the information it was statically sized when manipulating it as a range)
Yes, but... if you made enough information known at compile time then you can get away with optimizing then for a better runtime. If I were to use Proto (which I most definitely will do) then I would then be able to use Fusion to deal with the sequences and operate on these at both compile-time and run-time.
boost::lazy_range is not actually in boost, but that would be a type erased range, and I've got an implementation somewhere.
Maybe a lazy_range would be nice to have in Boost. Or even just a join iterator.
The join iterator is already there (since 1.43 I think? maybe before that), I wouldn't have put it in my code without saying it's not in Boost otherwise.
Ah, cool. I'll look into that.
Type erasure for ranges and iterators isn't in Boost, I suppose because there is little interest given using the real type is much more efficient, it would need allocations policies, and there is still hope that someday we'll have a generic type erasure facility.
Indeed.
Of course, not using type erasure at all (i.e. replacing lazy_range by auto or the actual type of the expression) would allow it to be quite faster.
Definitely.
Then why the string_handle thing? You want a type that represents the expression.
Ah, because 'auto' is C++0x and I was still thinking that maybe people not moving to C++0x might as well want to be able to use this new string type even at C++03.
The hope was to be able to determine at least the length of the whole string from the concatenation of multiple strings
Nothing particularly hard to do there.
At compile time.
so that effective memory allocation can be done at the time it's needed, which is usually at the time a copy of the whole string is required.
The problem is that you seem to want to apply stateful operations on your string object and delay them until you copy it, and have it automagically deal with lifetime and ownership issues.
Yes! Well, technically not stateful operations -- they're functionally pure operations: concatenation doesn't munge with the existing strings, it just returns a new string made of the contents of the other strings (or string generators).
Maybe the right solution for you is to use a rope data structure, which is a container and doesn't actually concatenate things, it just puts the fragments in a self-balancing tree. Never really understood the point of that data structure myself.
In cases where you concatenate strings that amount to a huge string to deal with, eliminating the copies or re-allocations of memory amounts to a lot of efficiency gains. Now if only there was a rope that was smart enough to deal with some of the things at compile-time instead... -- Dean Michael Berris deanberris.com