[Potentially OT] String Concatenation Operator

newer
[Smart Pointers] compilation error...

Dean Michael Berris

24 Aug 2010 24 Aug '10

4:11 p.m.

Good day everyone, I am currently taking some time to implement some functionality into cpp-netlib [0] (shameless plug) and somehow I've stumbled into a sort of conundrum. First, some background: I'm trying to abstract away the string building routines of the network library to come up with the most efficient way of doing the following: 1. Efficiently allocate space to contain a string being built from literals and variable length strings. 2. Be able to build/traverse the string lazily (i.e., it doesn't matter that the string is contiguous in memory as in the case of C-strings, or whether they are built/backed by a stream as in Haskell ByteString). 3. As much as possible be "automagically" network-safe (i.e. can be dealt with by Boost.Asio without having to do much acrobatics with it). At the heart of the issue is the semantics of the '+' operator to signify string concatenation. Trying not to sound pedantic about it, the addition operator in traditional mathematical notions is both commutative and associative, while string concatenation is not commutative but right associative. Trying to remember my C++ operator precedence and associativity rules, it looks like operator% and/or operator^ might be good candidates for this, but only in expression templates where you fold from the right. Now I don't want to start beating on the STL's standard string implementation, but I'd like to know if anyone is already working on a string implementation that meets the above requirements? I'd be happy to wait on compile times with Proto, if it means I can save big at runtime. What I wanted to be able to do (and am reproducing at the moment) is a means of doing the following: string_handle f = /* some means of building a string */; string_handle s = str("Literal:") ^ f ^ str("\r\n\r\n"); std::string some_string = string_handle; // convert to string and build lazily If for instance f were also a literal, then s can efficiently already hold the string in some fixed sized byte array whose size is determined at compile time. Somehow the function str() would only be able to take a literal and look something like this: template <size_t N> inline bounded_fragment<N> str(char const s[N]) { return bounded_fragment<N>(s); } The evaluation of the assignment (or copy constructor) of the string_handle will then evaluate the expression template and already know at compile time: A. Whether the string is just a long literal and allocate enough space to effectively hold the whole string at compile time, or at least reserve enough space statically (a boost::array perhaps) so that a simple range copy can be done (and optimized by the compiler as well) B. Whether the string is a list of variable length strings, having a list of handles built C. Whether it is a mix and have all the adjacent literals joined effectively at compile time and those variable sized strings retrieved when required Pointers to ongoing work would be most appreciated -- I'm currently too preoccupied to chase this particular rabbit down the hole (I'm chasing a different rabbit in a different hole) but maybe this is an interesting enough problem for the template metaprogramming guru's to look into? Thanks in advance and I look forward to any thoughts/pointers. -- Dean Michael Berris deanberris.com

Show replies by date

joel falcou

24 Aug 24 Aug

5:42 p.m.

Can't you just work over the wstring example in proto and use this as a base for a intelligent string concatenation system ? basically, any termianl containing a string of any kind and operator+ or w/e build a proto tree that you affect to your string. This expression is crawled down at RT to gatehr the length, allcoate the buffer and do another traversal for copying each part where it does. Then use proto::lit() to build char[], char* or std::string into the proto AST.

Dean Michael Berris

25 Aug 25 Aug

2:20 a.m.

On Wed, Aug 25, 2010 at 1:42 AM, joel falcou <joel.falcou@lri.fr> wrote:

...

Can't you just work over the wstring example in proto and use this as a base for a intelligent string concatenation system ?

I can... and I just might... but not now. :) I'll read through that particular example, thanks for the pointer.

...

basically, any termianl containing a string of any kind and operator+ or w/e build a proto tree that you affect to your string. This expression is crawled down at RT to gatehr the length, allcoate the buffer and do another traversal for copying each part where it does. Then use proto::lit() to build char[], char* or std::string into the proto AST.

I wonder though, when you say RT you mean "Run-Time"? Is there a way to do it possibly at compile time, gathering the length of literal strings at least? My biggest problem with the run-time approach is that it doesn't allow me to be smart about the lengths at compile time. A different string type then that has lazy construction semantics might be memory-heavy if left to be constructed at runtime alone. I'm wondering whether there's a way to create a string type that (like the Spirit rule template) encapsulates the building blocks required to build the string at some point in time later (at evaluation time) and doesn't do anything until the whole string is actually required. Then that type can expose iterators that can be treated as a range, which builds the string one character at a time. Disregarding the requirement for constant-time random access for example, a string type that builds the string on demand might make part of its building blocks a generator tied to an external source. In case constant-time random access would be required, then the string can lazily build a contiguous string out of the building blocks, cache it, and then offer a read-only random access iterator over that (or even use it on an operator[] overload). I still have not anything concrete in mind and was just wondering aloud and thinking that something like this might have already been done by someone else. Might be interesting to work on at some point I think. ;) -- Dean Michael Berris deanberris.com

joel falcou

5:20 a.m.

On 25/08/10 04:20, Dean Michael Berris wrote:

...

I wonder though, when you say RT you mean "Run-Time"? Is there a way to do it possibly at compile time, gathering the length of literal strings at least?

Ah yes of course, you can walk the tree to find char[N] and use N as a length element. Now, mosr compiler will inline it automatically if you do it at runtime anyway.

Mathias Gaunard

9:38 a.m.

On 24/08/2010 17:11, Dean Michael Berris wrote:

...

Good day everyone,

I am currently taking some time to implement some functionality into cpp-netlib [0] (shameless plug) and somehow I've stumbled into a sort of conundrum.

First, some background: I'm trying to abstract away the string building routines of the network library to come up with the most efficient way of doing the following:

1. Efficiently allocate space to contain a string being built from literals and variable length strings. 2. Be able to build/traverse the string lazily (i.e., it doesn't matter that the string is contiguous in memory as in the case of C-strings, or whether they are built/backed by a stream as in Haskell ByteString).

It seems to be what you're looking for is a range (or a slight refinement). I.e. an entity that can be iterated. Arrays, tuples, strings, vectors, lists, a pair of istream_iterator, a pair of pointers, a concatenation of ranges, a transformed range, etc. are all ranges.

...

3. As much as possible be "automagically" network-safe (i.e. can be dealt with by Boost.Asio without having to do much acrobatics with it).

I suppose you'd have to linearize it. Sending it in multiple chunks would have different behaviour on different types of sockets.

...

What I wanted to be able to do (and am reproducing at the moment) is a means of doing the following:

string_handle f = /* some means of building a string */; string_handle s = str("Literal:") ^ f ^ str("\r\n\r\n"); std::string some_string = string_handle; // convert to string and build lazily

How about: boost::lazy_range<char> f = /* some means of building a string */ boost::lazy_range<char> s = boost::adaptors::join( boost::as_literal("Literal"), f, boost::as_literal("\r\n\r\n") ); std::string some_string; boost::copy(s, std::back_inserter(some_string)); boost::lazy_range is not actually in boost, but that would be a type erased range, and I've got an implementation somewhere. Of course, not using type erasure at all (i.e. replacing lazy_range by auto or the actual type of the expression) would allow it to be quite faster.

Mathias Gaunard

11:54 a.m.

On 25/08/10 10:38, Mathias Gaunard wrote:

...

boost::lazy_range<char> s = boost::adaptors::join(

It's actually boost::join, not boost::adaptors::join.

Dean Michael Berris

26 Aug 26 Aug

7:09 a.m.

On Wed, Aug 25, 2010 at 5:38 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

On 24/08/2010 17:11, Dean Michael Berris wrote:

...
1. Efficiently allocate space to contain a string being built from literals and variable length strings. 2. Be able to build/traverse the string lazily (i.e., it doesn't matter that the string is contiguous in memory as in the case of C-strings, or whether they are built/backed by a stream as in Haskell ByteString).

It seems to be what you're looking for is a range (or a slight refinement). I.e. an entity that can be iterated.

Almost... That's just one part of it.

...

Arrays, tuples, strings, vectors, lists, a pair of istream_iterator, a pair of pointers, a concatenation of ranges, a transformed range, etc. are all ranges.

Indeed. However, I am looking for a means of representing a string -- a collection of characters on which you can implement algorithms on. They might as well be ranges, but things like pattern matching and templates (as in string templates, where you have placeholders and other things) can be applied to or used to generate them.

...

...
3. As much as possible be "automagically" network-safe (i.e. can be dealt with by Boost.Asio without having to do much acrobatics with it).

I suppose you'd have to linearize it. Sending it in multiple chunks would have different behaviour on different types of sockets.

Yes, but something that is inherently supported by the type. Why strings are important has a lot to do with being able to perform domain-specific optimization on string algorithms. Things like capitalization, whitespace removal, encoding/decoding, transformations like breaking up strings according to some pattern (tokenization, parsing, etc.). Because you can specialize the memory-management of strings (as opposed to just ranges of char's) the "win" in treating a string as a separate type are practical rather than conceptual.

...

...
What I wanted to be able to do (and am reproducing at the moment) is a means of doing the following:

string_handle f = /* some means of building a string */; string_handle s = str("Literal:") ^ f ^ str("\r\n\r\n");

...
std::string some_string = string_handle; // convert to string and build lazily

How about:

boost::lazy_range<char> f = /* some means of building a string */

boost::lazy_range<char> s = boost::adaptors::join( boost::as_literal("Literal"), f, boost::as_literal("\r\n\r\n") );

std::string some_string; boost::copy(s, std::back_inserter(some_string));

That's fine if I can control the memory allocation of the lazy range. As it is, a lazy range just represents a collection of iterator pairs -- the data has to live somewhere still. What I'm looking for is a combined ownership+iteration mechanism. Right now the problem of allocating a chunk of memory every time a you concatenate two strings is the problem I'm trying to solve with metaprogramming and knowing at compile time how much memory I'm going to need to allocate. Of course if you're dealing with strings that have an unknown length (as in my example, we really can't tell the length of 'f' at the point 's' is defined) at least getting to know the parts that have a known length at compile time (the literals) allows me to allocate enough space ahead of time with the compiler's help. Maybe instead of having multiple concatenations, what happens is I allocate a chunk "just large enough" to hold the multiple concatenated strings, and just traverse the lazy string as in your example. The copy happens at runtime, the allocation of a large enough buffer (maybe a boost::array) happens at compile-time (or at least the determination of the size).

...

boost::lazy_range is not actually in boost, but that would be a type erased range, and I've got an implementation somewhere.

Maybe a lazy_range would be nice to have in Boost. Or even just a join iterator.

...

Of course, not using type erasure at all (i.e. replacing lazy_range by auto or the actual type of the expression) would allow it to be quite faster.

Definitely. The hope was to be able to determine at least the length of the whole string from the concatenation of multiple strings, so that effective memory allocation can be done at the time it's needed, which is usually at the time a copy of the whole string is required. -- Dean Michael Berris deanberris.com

Mathias Gaunard

11:07 a.m.

On 26/08/10 08:09, Dean Michael Berris wrote:

...

Indeed.

However, I am looking for a means of representing a string -- a collection of characters on which you can implement algorithms on. They might as well be ranges, but things like pattern matching and templates (as in string templates, where you have placeholders and other things) can be applied to or used to generate them.

A string is just a range of characters; nothing more.

...

...
I suppose you'd have to linearize it. Sending it in multiple chunks would have different behaviour on different types of sockets.

Yes, but something that is inherently supported by the type.

There is no "inherently supported" or "not inherently supported". Linearizing the memory means you have to copy it in a contiguous buffer.

...

Why strings are important has a lot to do with being able to perform domain-specific optimization on string algorithms. Things like capitalization, whitespace removal, encoding/decoding, transformations like breaking up strings according to some pattern (tokenization, parsing, etc.). Because you can specialize the memory-management of strings (as opposed to just ranges of char's) the "win" in treating a string as a separate type are practical rather than conceptual.

I've got a system that does arbitrary conversion and segmentation of ranges in a lazy way without ever allocating memory. I can use it for things like on-the-fly character encoding conversion, Unicode normalization, or base64 encoding, for a few examples. It never requires to differentiate strings from other types of ranges, not to use special containers or anything.

...

...
...
What I wanted to be able to do (and am reproducing at the moment) is a means of doing the following:

string_handle f = /* some means of building a string */; string_handle s = str("Literal:") ^ f ^ str("\r\n\r\n");

...
std::string some_string = string_handle; // convert to string and build lazily

How about:

boost::lazy_range<char> f = /* some means of building a string */

boost::lazy_range<char> s = boost::adaptors::join( boost::as_literal("Literal"), f, boost::as_literal("\r\n\r\n") );

std::string some_string; boost::copy(s, std::back_inserter(some_string));

Just in case you haven't noticed, that's exactly the same code as you wrote, except it's real code that works and it uses different names.

...

That's fine if I can control the memory allocation of the lazy range.

Using type erasure means you have to use dynamic memory allocation eventually, unless you limit it to use stack storage and refuse types that are too large. (which in practice should be able to work fine)

...

As it is, a lazy range just represents a collection of iterator pairs

Just the one pair, actually.

...

-- the data has to live somewhere still.

Having it live in your new "string type" is a bad idea, as this just makes things not inter-operable. Simply allow people to use whatever data structure they see fit for their storage needs, and just deal with ranges when you want to manipulate strings.

...

What I'm looking for is a combined ownership+iteration mechanism. Right now the problem of allocating a chunk of memory every time a you concatenate two strings is the problem I'm trying to solve with metaprogramming and knowing at compile time how much memory I'm going to need to allocate.

What meta-programming will give you is a type that represents an expression that you can then evaluate. It doesn't own the values in any way. Range adaptors are exactly the same thing, they're a lightweight DSEL.

...

Of course if you're dealing with strings that have an unknown length (as in my example, we really can't tell the length of 'f' at the point 's' is defined) at least getting to know the parts that have a known length at compile time (the literals) allows me to allocate enough space ahead of time with the compiler's help.

With ranges, the way it works is depending on whether the range you want to get the size of has random access or not. If it is, getting the size is O(1). If it isn't, it's O(n). If it's a single-pass range, you can't even get the size. a joined_range has the lowest category of the ranges it's joining, so if one of the ranges isn't random access, then the concatenated range isn't.

...

Maybe instead of having multiple concatenations, what happens is I allocate a chunk "just large enough" to hold the multiple concatenated strings, and just traverse the lazy string as in your example. The copy happens at runtime, the allocation of a large enough buffer (maybe a boost::array) happens at compile-time (or at least the determination of the size).

You can't get the size of a range at compile-time. If you're only dealing with statically-sized stuff, I suggest you use fusion sequences, which is basically the same thing as ranges except they're statically-sized and can have heterogeneous elements. (and can make your compile-time explode) There is a boost::fusion::joint_view that behaves the same as boost::join. You can't really mix both runtime and compile-time sized things together. It's either one or the other, since compile-time code and runtime code at different beasts. (of course, you can consider a fusion sequence as a range, but then you lose the information it was statically sized when manipulating it as a range)

...

...
boost::lazy_range is not actually in boost, but that would be a type erased range, and I've got an implementation somewhere.

Maybe a lazy_range would be nice to have in Boost. Or even just a join iterator.

The join iterator is already there (since 1.43 I think? maybe before that), I wouldn't have put it in my code without saying it's not in Boost otherwise. Type erasure for ranges and iterators isn't in Boost, I suppose because there is little interest given using the real type is much more efficient, it would need allocations policies, and there is still hope that someday we'll have a generic type erasure facility.

...

...
Of course, not using type erasure at all (i.e. replacing lazy_range by auto or the actual type of the expression) would allow it to be quite faster.

Definitely.

Then why the string_handle thing? You want a type that represents the expression.

...

The hope was to be able to determine at least the length of the whole string from the concatenation of multiple strings

Nothing particularly hard to do there.

...

so that effective memory allocation can be done at the time it's needed, which is usually at the time a copy of the whole string is required.

The problem is that you seem to want to apply stateful operations on your string object and delay them until you copy it, and have it automagically deal with lifetime and ownership issues. Maybe the right solution for you is to use a rope data structure, which is a container and doesn't actually concatenate things, it just puts the fragments in a self-balancing tree. Never really understood the point of that data structure myself.

Jeffrey Lee Hellrung, Jr.

2:19 p.m.

On 8/26/2010 4:07 AM, Mathias Gaunard wrote:

...

On 26/08/10 08:09, Dean Michael Berris wrote: [...]

...
Maybe instead of having multiple concatenations, what happens is I allocate a chunk "just large enough" to hold the multiple concatenated strings, and just traverse the lazy string as in your example. The copy happens at runtime, the allocation of a large enough buffer (maybe a boost::array) happens at compile-time (or at least the determination of the size).

You can't get the size of a range at compile-time.

But you can define a trait or metafunction that tells you *whether* the range has a compile-time size (and, if so, then *what* that size is), and go around specializing it for C arrays, Boost.Array, the Boost.Range adaptors, etc. E.g., a transformed range is statically sized iff it's underlying range is statically sized; a joined range is statically sized if *all* its underlying ranges are statically sized; etc. Something like that may be what the OP is looking for... - Jeff

Mathias Gaunard

4:47 p.m.

On 26/08/10 15:19, Jeffrey Lee Hellrung, Jr. wrote:

...

...
You can't get the size of a range at compile-time.

But you can define a trait or metafunction that tells you *whether* the range has a compile-time size (and, if so, then *what* that size is)

You might as well special case fusion sequences.

...

and go around specializing it for C arrays, Boost.Array, the Boost.Range adaptors, etc. E.g., a transformed range is statically sized iff it's underlying range is statically sized; a joined range is statically sized if *all* its underlying ranges are statically sized; etc.

Indeed, you could specialize all range adaptors to call the fusion equivalent when passed fusion sequences, but then it wouldn't be clear what the return type is, since an iterator adaptor and a fusion expression are quite different things. This is already a problem with ranges, as return types are not well documented (and sometimes are in a detail namespace...). This could be solved with a result_of namespace like fusion does.

Jeremy Maitin-Shepard

2:37 p.m.

On 08/26/2010 07:07 AM, Mathias Gaunard wrote:

...

[snip]

...

There is no "inherently supported" or "not inherently supported". Linearizing the memory means you have to copy it in a contiguous buffer.

I'm not sure which protocol you're thinking of, but even over a UDP socket you can use scatter/gather send/recv (at least under POSIX). It is not clear that this mechanism is necessarily substantially more efficient in practice than just copying the data to a contiguous buffer, but in principle it seems useful for the data structure, assuming it is in fact represented as a sequence of arrays, to provide a way to access that underlying representation for the purpose of IO. Although I imagine this facility would be used most often with char arrays, I agree there is no advantage (except perhaps faster compilation due to reduced use of templates) in restricting it to char.

...

[snip]

...

Then why the string_handle thing? You want a type that represents the expression.

It seems that the idea would be better realized by replacing string_handle with auto. Still, I would think the same result could be achieved by implementing certain optimizations in the existing join, e.g. joining statically sized arrays produces a single statically sized array.

...

[snip]

Dean Michael Berris

27 Aug 27 Aug

2:50 a.m.

On Thu, Aug 26, 2010 at 7:07 PM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

On 26/08/10 08:09, Dean Michael Berris wrote:

...
Indeed.

However, I am looking for a means of representing a string -- a collection of characters on which you can implement algorithms on. They might as well be ranges, but things like pattern matching and templates (as in string templates, where you have placeholders and other things) can be applied to or used to generate them.

A string is just a range of characters; nothing more.

Actually, not *just* a range of characters. If your range dealt with ownership of data and abstracted the means by which the data is actually manipulated (even through iterators) then maybe a string is just a range of characters. If your range made sure that the data is moved instead of copied in certain situations, or whether there is an optimization that can be made on certain operations (like concatenation) then yeah I'll agree that a string is just a range of characters. If you think about a string as a special beast that does a lot of things: 1. Allocates memory which holds the characters 2. Offers an abstraction unique to strings (token_iterator, line_iterator, char_iterator, wchar_iterator) 3. Has value semantics similar to built-in types (copyable, movable, "regular") Then it doesn't look like just a range.

...

...
...
I suppose you'd have to linearize it. Sending it in multiple chunks would have different behaviour on different types of sockets.

Yes, but something that is inherently supported by the type.

There is no "inherently supported" or "not inherently supported". Linearizing the memory means you have to copy it in a contiguous buffer.

Sure, but if the string type already does that for you at the time you need it, then it's inherently supported by the type, no?

...

...
Why strings are important has a lot to do with being able to perform domain-specific optimization on string algorithms. Things like capitalization, whitespace removal, encoding/decoding, transformations like breaking up strings according to some pattern (tokenization, parsing, etc.). Because you can specialize the memory-management of strings (as opposed to just ranges of char's) the "win" in treating a string as a separate type are practical rather than conceptual.

I've got a system that does arbitrary conversion and segmentation of ranges in a lazy way without ever allocating memory.

Sure, but the memory has to be allocated somewhere else. I'm really looking for that "thing that allocates the memory and manages it for me" more than just the range. If you use std::string, then I think what I want to replace is an std::string.

...

I can use it for things like on-the-fly character encoding conversion, Unicode normalization, or base64 encoding, for a few examples.

It never requires to differentiate strings from other types of ranges, not to use special containers or anything.

That's cool, but in cases where you deal with potentially huge strings (i.e. more than a memory page's worth of data) you start looking for ways of moving some of the work out of runtime and into compile time (things like getting the size of the literals). And especially in cases where you need to abstract the string from being something that is exclusively in memory to something that refers to data that is not in memory (retrieved from a socket, from a file, from user input) you run into issues like buffer management, demand-driven/lazy-loading data, etc. that a range is not the end-all and/or best solution for. For instance, think about a forward iterator (or a single-pass iterator?) that the string can expose to allow for one-time traversal of the data it holds or it refers to. The string doesn't have to have the data all in memory if it doesn't yet, and it can then lazily load the data and expose it through that forward iterator. You can still deal with it in ranges, but the string is smart enough to figure out how to manage the memory and the data it represents.

...

Just in case you haven't noticed, that's exactly the same code as you wrote, except it's real code that works and it uses different names.

Until you think about the string_handle actually manages the memory unto which the data is stored.

...

...
That's fine if I can control the memory allocation of the lazy range.

Using type erasure means you have to use dynamic memory allocation eventually, unless you limit it to use stack storage and refuse types that are too large. (which in practice should be able to work fine)

Not in that sense. The dynamic memory will have to be managed by the string, or the string handle. If the string handle knew that some part of the string was a conglomeration of statically-sized literals then it can hold that data in a boost::array of the correct size -- then think about the transformed proto tree to have nodes merged in case they were adjoined literals be a single node as a merged literal, or in cases where there are bounded strings (even those generated at runtime, i.e. a bounded string of length 255) then you can correctly allocate enough space at the point where the operator= is implemented. Of course the type erasure might be an issue, but knowing the eventual size at compile time allows you a lot of optimizations you otherwise can't or won't do.

...

...
As it is, a lazy range just represents a collection of iterator pairs

Just the one pair, actually.

Yes, but then you have a left-leaning tree of iterator pairs. ;)

...

...
-- the data has to live somewhere still.

Having it live in your new "string type" is a bad idea, as this just makes things not inter-operable. Simply allow people to use whatever data structure they see fit for their storage needs, and just deal with ranges when you want to manipulate strings.

If the new string type implemented iterators (several types of iterators in fact) and manages the memory for you in a configurable manner, *and* allows you to convert it to either an std::string or an std::wstring, why wouldn't it be inter-operable?

...

...
What I'm looking for is a combined ownership+iteration mechanism. Right now the problem of allocating a chunk of memory every time a you concatenate two strings is the problem I'm trying to solve with metaprogramming and knowing at compile time how much memory I'm going to need to allocate.

What meta-programming will give you is a type that represents an expression that you can then evaluate. It doesn't own the values in any way.

Yes, but it allows me to know the lengths of the strings at compile time and eliminate the multiple allocations the temporary strings I otherwise would incur in case I just did std::string + std::string or even in cases where I use an ostringstream. What I can then eliminate are the multiple allocations -- something nearly impossible to do just at runtime.

...

Range adaptors are exactly the same thing, they're a lightweight DSEL.

But the range adaptors don't solve the issue of multiple allocations -- I still have std::strings owning the data and potentially fragmenting the memory because I keep allocating and deallocating small chunks of irregular-sized memory. The new string type will have a more sane memory management mechanism which relies on as much compile-time programming as well as runtime magic to sanitize the memory utilization of dealing with strings. Someone can even make it behave almost the same as an std::string and offer almost the same interface (I'm not a fan of all the functions there) then it just becomes a better string.

...

...
Of course if you're dealing with strings that have an unknown length (as in my example, we really can't tell the length of 'f' at the point 's' is defined) at least getting to know the parts that have a known length at compile time (the literals) allows me to allocate enough space ahead of time with the compiler's help.

With ranges, the way it works is depending on whether the range you want to get the size of has random access or not. If it is, getting the size is O(1). If it isn't, it's O(n). If it's a single-pass range, you can't even get the size.

a joined_range has the lowest category of the ranges it's joining, so if one of the ranges isn't random access, then the concatenated range isn't.

Yes, that's understood. I'm happy with the ranges approach, as long as the string type I'm using is smart enough to know that part of the string it's referring to is a conglomeration of statically-sized and variable-sized data. The problem with just ranges is that I can't really just return that from a function and expect that the data it refers to (being iterators) will be valid when they're needed. Again, unless the range owns the data, this will be a problem.

...

...
Maybe instead of having multiple concatenations, what happens is I allocate a chunk "just large enough" to hold the multiple concatenated strings, and just traverse the lazy string as in your example. The copy happens at runtime, the allocation of a large enough buffer (maybe a boost::array) happens at compile-time (or at least the determination of the size).

You can't get the size of a range at compile-time.

Precisely, which is why strings aren't *just" ranges. If I had a literal string, I can know the size of that at compile time. If I had a bounded string or a generator that knew exactly how much data (even just the maximum) it will be generating that made these things known at compile time, then I don't need to be stuck with just ranges where you can't get the size of at compile time.

...

If you're only dealing with statically-sized stuff, I suggest you use fusion sequences, which is basically the same thing as ranges except they're statically-sized and can have heterogeneous elements. (and can make your compile-time explode) There is a boost::fusion::joint_view that behaves the same as boost::join.

You can't really mix both runtime and compile-time sized things together. It's either one or the other, since compile-time code and runtime code at different beasts. (of course, you can consider a fusion sequence as a range, but then you lose the information it was statically sized when manipulating it as a range)

Yes, but... if you made enough information known at compile time then you can get away with optimizing then for a better runtime. If I were to use Proto (which I most definitely will do) then I would then be able to use Fusion to deal with the sequences and operate on these at both compile-time and run-time.

...

...
...
boost::lazy_range is not actually in boost, but that would be a type erased range, and I've got an implementation somewhere.

Maybe a lazy_range would be nice to have in Boost. Or even just a join iterator.

The join iterator is already there (since 1.43 I think? maybe before that), I wouldn't have put it in my code without saying it's not in Boost otherwise.

Ah, cool. I'll look into that.

...

Type erasure for ranges and iterators isn't in Boost, I suppose because there is little interest given using the real type is much more efficient, it would need allocations policies, and there is still hope that someday we'll have a generic type erasure facility.

Indeed.

...

...
...
Of course, not using type erasure at all (i.e. replacing lazy_range by auto or the actual type of the expression) would allow it to be quite faster.

Definitely.

Then why the string_handle thing? You want a type that represents the expression.

Ah, because 'auto' is C++0x and I was still thinking that maybe people not moving to C++0x might as well want to be able to use this new string type even at C++03.

...

...
The hope was to be able to determine at least the length of the whole string from the concatenation of multiple strings

Nothing particularly hard to do there.

At compile time.

...

...
so that effective memory allocation can be done at the time it's needed, which is usually at the time a copy of the whole string is required.

The problem is that you seem to want to apply stateful operations on your string object and delay them until you copy it, and have it automagically deal with lifetime and ownership issues.

Yes! Well, technically not stateful operations -- they're functionally pure operations: concatenation doesn't munge with the existing strings, it just returns a new string made of the contents of the other strings (or string generators).

...

Maybe the right solution for you is to use a rope data structure, which is a container and doesn't actually concatenate things, it just puts the fragments in a self-balancing tree. Never really understood the point of that data structure myself.

In cases where you concatenate strings that amount to a huge string to deal with, eliminating the copies or re-allocations of memory amounts to a lot of efficiency gains. Now if only there was a rope that was smart enough to deal with some of the things at compile-time instead... -- Dean Michael Berris deanberris.com

Mathias Gaunard

9:30 p.m.

On 27/08/2010 03:50, Dean Michael Berris wrote:

...

Actually, not *just* a range of characters.

If your range dealt with ownership of data and abstracted the means by which the data is actually manipulated (even through iterators) then maybe a string is just a range of characters. If your range made sure that the data is moved instead of copied in certain situations, or whether there is an optimization that can be made on certain operations (like concatenation) then yeah I'll agree that a string is just a range of characters.

If you think about a string as a special beast that does a lot of things:

1. Allocates memory which holds the characters 2. Offers an abstraction unique to strings (token_iterator, line_iterator, char_iterator, wchar_iterator) 3. Has value semantics similar to built-in types (copyable, movable, "regular")

Then it doesn't look like just a range.

What I call a string is just the data, not a container of said data. I don't see what 2 is doing in there. Surely those mechanisms are independent of the container.

...

Sure, but if the string type already does that for you at the time you need it, then it's inherently supported by the type, no?

If you can iterate through a sequence of copyable elements, then you can copy those elements into a new sequence of the structure of your choosing. This doesn't require any particular support from the first sequence type.

...

That's cool, but in cases where you deal with potentially huge strings (i.e. more than a memory page's worth of data) you start looking for ways of moving some of the work out of runtime and into compile time

This doesn't make sense. If the data is too big to fit into memory, then it's going to be completely out of the range (no pun intended) of what the compile-time world can deal with.

...

And especially in cases where you need to abstract the string from being something that is exclusively in memory to something that refers to data that is not in memory (retrieved from a socket, from a file, from user input) you run into issues like buffer management, demand-driven/lazy-loading data, etc. that a range is not the end-all and/or best solution for.

For instance, think about a forward iterator (or a single-pass iterator?) that the string can expose to allow for one-time traversal of the data it holds or it refers to. The string doesn't have to have the data all in memory if it doesn't yet, and it can then lazily load the data and expose it through that forward iterator.

I don't get your argument. Basically you're saying "wouldn't it be cool if you could not actually store your data in memory, but generate it as you're traversing or get it from I/O on demand?". That's what ranges are. Ranges *are* iterators.

...

If the string handle knew that some part of the string was a conglomeration of statically-sized literals then it can hold that data in a boost::array of the correct size

No it cannot. string_handle is a single type defined in advance, and it must have a finite size. It cannot guess how many conglomerations of statically-sized literals you're going to want to put into it, and therefore can't guarantee enough storage.

...

then you can correctly allocate enough space at the point where the operator= is implemented.

At runtime; (operator= is a function that executes code, not a type definition that provides automatic storage) which makes the memory dynamically allocated, as I said.

...

Of course the type erasure might be an issue, but knowing the eventual size at compile time allows you a lot of optimizations you otherwise can't or won't do.

Here, I can't see it bringing any benefit compared to knowing it at runtime, since the allocation can only happen at runtime.

...

Yes, but then you have a left-leaning tree of iterator pairs. ;)

Just like you have a tree of a bounded segments, or an AST if you go for a proto solution.

...

If the new string type implemented iterators (several types of iterators in fact) and manages the memory for you in a configurable manner, *and* allows you to convert it to either an std::string or an std::wstring, why wouldn't it be inter-operable?

Because your code will expect your string type, not the string type of the user, meaning he'll have to convert to it. Converting is not actually needed, since you could just treat your type, the type of the user, or any smart lazy evaluated type the same.

...

But the range adaptors don't solve the issue of multiple allocations

How do they not? Your problem is that you reallocate multiple times with a classic binary operator+ implementation: (pseudo-code) buffer = a + b + c is buffer = allocate(size(a)) copy(buffer, a) buffer2 = allocate(size(buffer)+size(b)) copy(buffer2, buffer) cat(buffer2, b) free(buffer) buffer = allocate(size(buffer2)+size(c)) copy(buffer, buffer2) cat(buffer, c) free(buffer2) Instead of that you could just evaluate that expression as r = join(a, b, c) buffer3 = allocate(size(r)) copy(buffer3, r) As you can see, it solves the problem just fine.

...

Ah, because 'auto' is C++0x and I was still thinking that maybe people not moving to C++0x might as well want to be able to use this new string type even at C++03.

You can write the type out yourself or use a result_of-like protocol, like Fusion does.

...

Yes! Well, technically not stateful operations -- they're functionally pure operations: concatenation doesn't munge with the existing strings, it just returns a new string made of the contents of the other strings (or string generators).

If you end up re-storing the result into the same variable, it basically is stateful.

Dean Michael Berris

29 Aug 29 Aug

7:23 a.m.

On Sat, Aug 28, 2010 at 5:30 AM, Mathias Gaunard <mathias.gaunard@ens-lyon.org> wrote:

...

On 27/08/2010 03:50, Dean Michael Berris wrote:

...
Actually, not *just* a range of characters.

If your range dealt with ownership of data and abstracted the means by which the data is actually manipulated (even through iterators) then maybe a string is just a range of characters. If your range made sure that the data is moved instead of copied in certain situations, or whether there is an optimization that can be made on certain operations (like concatenation) then yeah I'll agree that a string is just a range of characters.

If you think about a string as a special beast that does a lot of things:

1. Allocates memory which holds the characters 2. Offers an abstraction unique to strings (token_iterator, line_iterator, char_iterator, wchar_iterator) 3. Has value semantics similar to built-in types (copyable, movable, "regular")

Then it doesn't look like just a range.

What I call a string is just the data, not a container of said data.

I don't see what 2 is doing in there. Surely those mechanisms are independent of the container.

The data is the container when it comes to the (ugly) world of strings. You can treat a string as a special case of a container that is also the data. The string "Foo" is really a self-contained data-type which has the properties of size (amount of memory used) and contents. For literal strings much of the information (except maybe the contents) is available at compile time, and using a pure runtime construct such as a range is inadequate IMO -- thus the question about a different (if not better) means of concatenating literal strings and by extension strings in general, abstracted enough by a string type that knows how to deal with these things both at compile time and runtime. Like I said, you can treat a string as a range through the iterators, but really looking at a string as *just* a range without thinking about the implications of efficient memory management and value semantics outside the realm of ranges is lacking IMO.

...

...
Sure, but if the string type already does that for you at the time you need it, then it's inherently supported by the type, no?

If you can iterate through a sequence of copyable elements, then you can copy those elements into a new sequence of the structure of your choosing. This doesn't require any particular support from the first sequence type.

My contention is that you don't need to iterate through a sequence of copyable elements earlier than you have to. A string type should be allowed and should be able to delay these operations until the point of when the information is required. Again unless you can return a range from a function and ensure that the data that the range refers to is valid (unlike a container which does guarantee that) then ranges would be insufficient for that case.

...

...
That's cool, but in cases where you deal with potentially huge strings (i.e. more than a memory page's worth of data) you start looking for ways of moving some of the work out of runtime and into compile time

This doesn't make sense. If the data is too big to fit into memory, then it's going to be completely out of the range (no pun intended) of what the compile-time world can deal with.

Actually not so. You can know the length of a string at compile time or at least the maximum length of the string at compile, then it costs you only the size of the type that holds the information about the string. It doesn't mean that you need the whole string in memory already at compile time. If for example a string is known to have only a maximum length of 4096 bytes and a minimum of 100 bytes at compile time, then the string type can then allocate only as much as 100 bytes enough to represent that sub-string. At runtime then you already have a buffer that is at least 100 bytes long, and a string type that exposes an iterator that uses an in-memory window of 100 bytes and re-use that buffer that it already allocated and knew the size of statically. Think about strings that are mmapped to files or those that are built from socket IO. In cases where you know the string will always be 4096 bytes at compile time, then what's the point of not allocating all the 4096 bytes as a boost::array for example?

...

...
And especially in cases where you need to abstract the string from being something that is exclusively in memory to something that refers to data that is not in memory (retrieved from a socket, from a file, from user input) you run into issues like buffer management, demand-driven/lazy-loading data, etc. that a range is not the end-all and/or best solution for.

For instance, think about a forward iterator (or a single-pass iterator?) that the string can expose to allow for one-time traversal of the data it holds or it refers to. The string doesn't have to have the data all in memory if it doesn't yet, and it can then lazily load the data and expose it through that forward iterator.

I don't get your argument. Basically you're saying "wouldn't it be cool if you could not actually store your data in memory, but generate it as you're traversing or get it from I/O on demand?". That's what ranges are. Ranges *are* iterators.

I'm not saying ranges don't work. I'm saying for the purpose of doing a better string, thinking just in ranges is lacking. Nothing is stopping the new string type to be smarter than your average string and still be dealt with like a range. I'm just saying that thinking about it as *just* a range is missing the forest for the trees.

...

...
If the string handle knew that some part of the string was a conglomeration of statically-sized literals then it can hold that data in a boost::array of the correct size

No it cannot. string_handle is a single type defined in advance, and it must have a finite size. It cannot guess how many conglomerations of statically-sized literals you're going to want to put into it, and therefore can't guarantee enough storage.

You can actually -- if you treat the operator= as your evaluator, you can have a temporary "statically-sized" boost::array that holds the conglomerated data, wrapped in a template type that derives from a virtual base, which exposes the appropriate iterators as a pimpl. You can then as part of either the copy constructor, move constructor, and/or the copy/move assignment operator for string_handle. Pseudocode: struct base; template <size_t size_> struct data : base { // implementation here }; struct string_handle { // ... template <class AST> string_handle & operator=(AST const & rhs) { boost::array<typename char_<AST>::type, sizeof_<AST>::value> buffer; eval(rhs, buffer); // perform the copy to the buffer this->pimpl(new data<sizeof_<AST>::value>(move(buffer))); return *this; } // ... unique_ptr<base> pimpl; }; You can even think about a list of variants to hold conglomerated literals, bounded strings, and string_handle instances, and then expose iterators that know how to traverse the list of variants maintained by the string_handle.

...

...
then you can correctly allocate enough space at the point where the operator= is implemented.

At runtime; (operator= is a function that executes code, not a type definition that provides automatic storage) which makes the memory dynamically allocated, as I said.

Sure, but operator= can be a template function that can also perform computations on the types/statically-known-data at compile time. It's not just about avoiding dynamic memory, it's all about knowing the sizes at compile-time.

...

...
Of course the type erasure might be an issue, but knowing the eventual size at compile time allows you a lot of optimizations you otherwise can't or won't do.

Here, I can't see it bringing any benefit compared to knowing it at runtime, since the allocation can only happen at runtime.

If you know that you have 10 adjacent literal strings from the Proto AST then you can transform that into one long literal buffer and know the size at compile time. Even if your allocation happens at runtime, knowing the size upfront saves you 9 unnecessary temporary allocations, or 9 iterator pairs in case of a joined range. I don't see why that's *not* a saving.

...

...
Yes, but then you have a left-leaning tree of iterator pairs. ;)

Just like you have a tree of a bounded segments, or an AST if you go for a proto solution.

Yes, but you can definitely do a lot more at compile time with a proto AST than a left-leaning tree of iterator pairs in the case of just runtime ranges.

...

...
If the new string type implemented iterators (several types of iterators in fact) and manages the memory for you in a configurable manner, *and* allows you to convert it to either an std::string or an std::wstring, why wouldn't it be inter-operable?

Because your code will expect your string type, not the string type of the user, meaning he'll have to convert to it. Converting is not actually needed, since you could just treat your type, the type of the user, or any smart lazy evaluated type the same.

But if the client code didn't even see this type -- it can be that magical you know -- and it plays well with the other STL-string like types, why won't it be interoperable?

...

...
But the range adaptors don't solve the issue of multiple allocations

How do they not?

Your problem is that you reallocate multiple times with a classic binary operator+ implementation: (pseudo-code)

buffer = a + b + c

is

buffer = allocate(size(a)) copy(buffer, a)

buffer2 = allocate(size(buffer)+size(b)) copy(buffer2, buffer) cat(buffer2, b) free(buffer)

buffer = allocate(size(buffer2)+size(c)) copy(buffer, buffer2) cat(buffer, c) free(buffer2)

Instead of that you could just evaluate that expression as

r = join(a, b, c) buffer3 = allocate(size(r)) copy(buffer3, r)

As you can see, it solves the problem just fine.

Not if a, b, or c are literals. The call to size(r) is not evaluated at compile-time, and you'll have to store a, b, c, somehow as iterable strings -- temporary std::string objects -- each one being an allocation -- instead of just a pointer to statically-located data as what a compiler will usually do for string literals. The traversal for size(r) is also going to happen at runtime and is potentially an O(n) operation.

...

...
Ah, because 'auto' is C++0x and I was still thinking that maybe people not moving to C++0x might as well want to be able to use this new string type even at C++03.

You can write the type out yourself or use a result_of-like protocol, like Fusion does.

Sure, that's one option, but the simple type erasure that costs exactly one indirection would be just fine.

...

...
Yes! Well, technically not stateful operations -- they're functionally pure operations: concatenation doesn't munge with the existing strings, it just returns a new string made of the contents of the other strings (or string generators).

If you end up re-storing the result into the same variable, it basically is stateful.

But you don't in the case of concatenation -- strictly speaking, the assignment operation is the stateful part, not the concatentation. In the situation I'm exploring, Concatenation is non-immediate (more correctly, is a lazy operation) and is non-destructive. Of course, the string will eventually have to expose the iterators anyway, and all the range operations will still apply. Except the smarter string will already have bought you the optimizations possible at compile time and at runtime. -- Dean Michael Berris deanberris.com

Jeroen Habraken

25 Aug 25 Aug

10:18 a.m.

Hi, On 24 August 2010 18:11, Dean Michael Berris <mikhailberis@gmail.com> wrote:

...

Good day everyone,

I am currently taking some time to implement some functionality into cpp-netlib [0] (shameless plug) and somehow I've stumbled into a sort of conundrum.

First, some background: I'm trying to abstract away the string building routines of the network library to come up with the most efficient way of doing the following:

1. Efficiently allocate space to contain a string being built from literals and variable length strings. 2. Be able to build/traverse the string lazily (i.e., it doesn't matter that the string is contiguous in memory as in the case of C-strings, or whether they are built/backed by a stream as in Haskell ByteString). 3. As much as possible be "automagically" network-safe (i.e. can be dealt with by Boost.Asio without having to do much acrobatics with it).

At the heart of the issue is the semantics of the '+' operator to signify string concatenation. Trying not to sound pedantic about it, the addition operator in traditional mathematical notions is both commutative and associative, while string concatenation is not commutative but right associative. Trying to remember my C++ operator precedence and associativity rules, it looks like operator% and/or operator^ might be good candidates for this, but only in expression templates where you fold from the right.

Now I don't want to start beating on the STL's standard string implementation, but I'd like to know if anyone is already working on a string implementation that meets the above requirements? I'd be happy to wait on compile times with Proto, if it means I can save big at runtime.

What I wanted to be able to do (and am reproducing at the moment) is a means of doing the following:

string_handle f = /* some means of building a string */; string_handle s = str("Literal:") ^ f ^ str("\r\n\r\n"); std::string some_string = string_handle; // convert to string and build lazily

If for instance f were also a literal, then s can efficiently already hold the string in some fixed sized byte array whose size is determined at compile time. Somehow the function str() would only be able to take a literal and look something like this:

template <size_t N> inline bounded_fragment<N> str(char const s[N]) { return bounded_fragment<N>(s); }

The evaluation of the assignment (or copy constructor) of the string_handle will then evaluate the expression template and already know at compile time:

A. Whether the string is just a long literal and allocate enough space to effectively hold the whole string at compile time, or at least reserve enough space statically (a boost::array perhaps) so that a simple range copy can be done (and optimized by the compiler as well)

B. Whether the string is a list of variable length strings, having a list of handles built

C. Whether it is a mix and have all the adjacent literals joined effectively at compile time and those variable sized strings retrieved when required

Pointers to ongoing work would be most appreciated -- I'm currently too preoccupied to chase this particular rabbit down the hole (I'm chasing a different rabbit in a different hole) but maybe this is an interesting enough problem for the template metaprogramming guru's to look into?

Thanks in advance and I look forward to any thoughts/pointers.

-- Dean Michael Berris deanberris.com _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

There are two things that come to mind: - ropes <http://www.sgi.com/tech/stl/Rope.html> - libevent buffers <http://monkey.org/~provos/libevent/doxygen-2.0.1/structbufferevent.html> The latter has been designed for a purpose similar to yours and I believe a lot can be learned from its implementation, even though it is C code. Regards, Jeroen

5423

Age (days ago)

5428

Last active (days ago)

List overview

Download

14 comments

6 participants

participants (6)

Dean Michael Berris
Jeffrey Lee Hellrung, Jr.
Jeremy Maitin-Shepard
Jeroen Habraken
joel falcou
Mathias Gaunard