Re: [boost] [Potentially OT] String Concatenation Operator

27 Aug 2010

      On Thu, Aug 26, 2010 at 7:07 PM, Mathias Gaunard
<mathias.gaunard@ens-lyon.org> wrote:
...
On 26/08/10 08:09, Dean Michael Berris wrote:
...
Indeed.
However, I am looking for a means of representing a string -- a
collection of characters on which you can implement algorithms on.
They might as well be ranges, but things like pattern matching and
templates (as in string templates, where you have placeholders and
other things) can be applied to or used to generate them.
A string is just a range of characters; nothing more.
Actually, not *just* a range of characters.

If your range dealt with ownership of data and abstracted the means by
which the data is actually manipulated (even through iterators) then
maybe a string is just a range of characters. If your range made sure
that the data is moved instead of copied in certain situations, or
whether there is an optimization that can be made on certain
operations (like concatenation) then yeah I'll agree that a string is
just a range of characters.

If you think about a string as a special beast that does a lot of things:

1. Allocates memory which holds the characters
2. Offers an abstraction unique to strings (token_iterator,
line_iterator, char_iterator, wchar_iterator)
3. Has value semantics similar to built-in types (copyable, movable, "regular")

Then it doesn't look like just a range.
...
...
...
I suppose you'd have to linearize it.
Sending it in multiple chunks would have different behaviour on different
types of sockets.
Yes, but something that is inherently supported by the type.
There is no "inherently supported" or "not inherently supported".
Linearizing the memory means you have to copy it in a contiguous buffer.
Sure, but if the string type already does that for you at the time you
need it, then it's inherently supported by the type, no?
...
...
Why strings are important has a lot to do with being able to perform
domain-specific optimization on string algorithms. Things like
capitalization, whitespace removal, encoding/decoding, transformations
like breaking up strings according to some pattern (tokenization,
parsing, etc.). Because you can specialize the memory-management of
strings (as opposed to just ranges of char's) the "win" in treating a
string as a separate type are practical rather than conceptual.
I've got a system that does arbitrary conversion and segmentation of ranges
in a lazy way without ever allocating memory.
Sure, but the memory has to be allocated somewhere else.

I'm really looking for that "thing that allocates the memory and
manages it for me" more than just the range. If you use std::string,
then I think what I want to replace is an std::string.
...
I can use it for things like on-the-fly character encoding conversion,
Unicode normalization, or base64 encoding, for a few examples.
It never requires to differentiate strings from other types of ranges, not
to use special containers or anything.
That's cool, but in cases where you deal with potentially huge strings
(i.e. more than a memory page's worth of data) you start looking for
ways of moving some of the work out of runtime and into compile time
(things like getting the size of the literals). And especially in
cases where you need to abstract the string from being something that
is exclusively in memory to something that refers to data that is not
in memory (retrieved from a socket, from a file, from user input) you
run into issues like buffer management, demand-driven/lazy-loading
data, etc. that a range is not the end-all and/or best solution for.

For instance, think about a forward iterator (or a single-pass
iterator?) that the string can expose to allow for one-time traversal
of the data it holds or it refers to. The string doesn't have to have
the data all in memory if it doesn't yet, and it can then lazily load
the data and expose it through that forward iterator. You can still
deal with it in ranges, but the string is smart enough to figure out
how to manage the memory and the data it represents.
...
Just in case you haven't noticed, that's exactly the same code as you wrote,
except it's real code that works and it uses different names.
Until you think about the string_handle actually manages the memory
unto which the data is stored.
...
...
That's fine if I can control the memory allocation of the lazy range.
Using type erasure means you have to use dynamic memory allocation
eventually, unless you limit it to use stack storage and refuse types that
are too large. (which in practice should be able to work fine)
Not in that sense. The dynamic memory will have to be managed by the
string, or the string handle. If the string handle knew that some part
of the string was a conglomeration of statically-sized literals then
it can hold that data in a boost::array of the correct size -- then
think about the transformed proto tree to have nodes merged in case
they were adjoined literals be a single node as a merged literal, or
in cases where there are bounded strings (even those generated at
runtime, i.e. a bounded string of length 255) then you can correctly
allocate enough space at the point where the operator= is implemented.
Of course the type erasure might be an issue, but knowing the eventual
size at compile time allows you a lot of optimizations you otherwise
can't or won't do.
...
...
As it is, a lazy range just represents a collection of iterator pairs
Just the one pair, actually.
Yes, but then you have a left-leaning tree of iterator pairs. ;)
...
...
-- the data has to live somewhere still.
Having it live in your new "string type" is a bad idea, as this just makes
things not inter-operable.
Simply allow people to use whatever data structure they see fit for their
storage needs, and just deal with ranges when you want to manipulate
strings.
If the new string type implemented iterators (several types of
iterators in fact) and manages the memory for you in a configurable
manner, *and* allows you to convert it to either an std::string or an
std::wstring, why wouldn't it be inter-operable?
...
...
What I'm looking for is a
combined ownership+iteration mechanism. Right now the problem of
allocating a chunk of memory every time a you concatenate two strings
is the problem I'm trying to solve with metaprogramming and knowing at
compile time how much memory I'm going to need to allocate.
What meta-programming will give you is a type that represents an expression
that you can then evaluate. It doesn't own the values in any way.
Yes, but it allows me to know the lengths of the strings at compile
time and eliminate the multiple allocations the temporary strings I
otherwise would incur in case I just did std::string + std::string or
even in cases where I use an ostringstream. What I can then eliminate
are the multiple allocations -- something nearly impossible to do just
at runtime.
...
Range adaptors are exactly the same thing, they're a lightweight DSEL.
But the range adaptors don't solve the issue of multiple allocations
-- I still have std::strings owning the data and potentially
fragmenting the memory because I keep allocating and deallocating
small chunks of irregular-sized memory. The new string type will have
a more sane memory management mechanism which relies on as much
compile-time programming as well as runtime magic to sanitize the
memory utilization of dealing with strings. Someone can even make it
behave almost the same as an std::string and offer almost the same
interface (I'm not a fan of all the functions there) then it just
becomes a better string.
...
...
Of course if you're dealing with strings that have an unknown length
(as in my example, we really can't tell the length of 'f' at the point
's' is defined) at least getting to know the parts that have a known
length at compile time (the literals) allows me to allocate enough
space ahead of time with the compiler's help.
With ranges, the way it works is depending on whether the range you want to
get the size of has random access or not.
If it is, getting the size is O(1). If it isn't, it's O(n).
If it's a single-pass range, you can't even get the size.
a joined_range has the lowest category of the ranges it's joining, so if one
of the ranges isn't random access, then the concatenated range isn't.
Yes, that's understood. I'm happy with the ranges approach, as long as
the string type I'm using is smart enough to know that part of the
string it's referring to is a conglomeration of statically-sized and
variable-sized data.

The problem with just ranges is that I can't really just return that
from a function and expect that the data it refers to (being
iterators) will be valid when they're needed. Again, unless the range
owns the data, this will be a problem.
...
...
Maybe instead of having
multiple concatenations, what happens is I allocate a chunk "just
large enough" to hold the multiple concatenated strings, and just
traverse the lazy string as in your example. The copy happens at
runtime, the allocation of a large enough buffer (maybe a
boost::array) happens at compile-time (or at least the determination
of the size).
You can't get the size of a range at compile-time.
Precisely, which is why strings aren't *just" ranges. If I had a
literal string, I can know the size of that at compile time. If I had
a bounded string or a generator that knew exactly how much data (even
just the maximum) it will be generating that made these things known
at compile time, then I don't need to be stuck with just ranges where
you can't get the size of at compile time.
...
If you're only dealing with statically-sized stuff, I suggest you use fusion
sequences, which is basically the same thing as ranges except they're
statically-sized and can have heterogeneous elements. (and can make your
compile-time explode)
There is a boost::fusion::joint_view that behaves the same as boost::join.
You can't really mix both runtime and compile-time sized things together.
It's either one or the other, since compile-time code and runtime code at
different beasts. (of course, you can consider a fusion sequence as a range,
but then you lose the information it was statically sized when manipulating
it as a range)
Yes, but... if you made enough information known at compile time then
you can get away with optimizing then for a better runtime. If I were
to use Proto (which I most definitely will do) then I would then be
able to use Fusion to deal with the sequences and operate on these at
both compile-time and run-time.
...
...
...
boost::lazy_range is not actually in boost, but that would be a type
erased
range, and I've got an implementation somewhere.
Maybe a lazy_range would be nice to have in Boost. Or even just a join
iterator.
The join iterator is already there (since 1.43 I think? maybe before that),
I wouldn't have put it in my code without saying it's not in Boost
otherwise.
Ah, cool. I'll look into that.
...
Type erasure for ranges and iterators isn't in Boost, I suppose because
there is little interest given using the real type is much more efficient,
it would need allocations policies, and there is still hope that someday
we'll have a generic type erasure facility.
Indeed.
...
...
...
Of course, not using type erasure at all (i.e. replacing lazy_range by
auto
or the actual type of the expression) would allow it to be quite faster.
Definitely.
Then why the string_handle thing? You want a type that represents the
expression.
Ah, because 'auto' is C++0x and I was still thinking that maybe people
not moving to C++0x might as well want to be able to use this new
string type even at C++03.
...
...
The hope was to be able to determine at least the length of the whole
string from the concatenation of multiple strings
Nothing particularly hard to do there.
At compile time.
...
...
so that effective
memory allocation can be done at the time it's needed, which is
usually at the time a copy of the whole string is required.
The problem is that you seem to want to apply stateful operations on your
string object and delay them until you copy it, and have it automagically
deal with lifetime and ownership issues.
Yes! Well, technically not stateful operations -- they're functionally
pure operations: concatenation doesn't munge with the existing
strings, it just returns a new string made of the contents of the
other strings (or string generators).
...
Maybe the right solution for you is to use a rope data structure, which is a
container and doesn't actually concatenate things, it just puts the
fragments in a self-balancing tree.
Never really understood the point of that data structure myself.
In cases where you concatenate strings that amount to a huge string to
deal with, eliminating the copies or re-allocations of memory amounts
to a lot of efficiency gains. Now if only there was a rope that was
smart enough to deal with some of the things at compile-time
instead...

-- 
Dean Michael Berris
deanberris.com