
on Wed Sep 03 2008, "Giovanni Piero Deretta" <gpderetta-AT-gmail.com> wrote:
On Wed, Sep 3, 2008 at 3:11 PM, David Abrahams <dave@boostpro.com> wrote:
on Tue Sep 02 2008, "Giovanni Piero Deretta" <gpderetta-AT-gmail.com> wrote:
Great, but what are you doing with these stacks? Normally, the answer would be something like "I'm applying the X algorithm to a range of elements that represent Y but have been adapted to look like Z"
Usually is a matter of converting a text document to a point in a feature vector space as a preprocessing stage:
e.g., for the purposes of search? Just trying to get a feel for what you're describing.
Near duplicate elimination and document clustering in general. Most clustering algorithms work on representations of documents as points in a very high dimension space.
OK.
Most pipelines start with tokenization, normalization, filtering and hashing, with a 'take' operation at the end.
By 'take' you mean a destructive read?
What do you mean with 'destructive'?
Nevermind; it was a stab in the dark and you explain what you mean below.
'take(range, n)' returns a range adaptor that iterates over the first 'n' elements of 'range'. This way I do not have to decide how much of the original document (which theoretically could be of unlimited length) I need to until the end of the pipeline (most importantly, I want to set the final length *after* filtering).
OK.
<snip>
Most of the uses of lazy ranges are in relatively non performance critical part of the application, so I do not aim for absolute zero abstraction overhead.
Okay, so... is it worth breaking up the iterator abstraction for this purpose?
Do you think we are breaking the iterator abstraction? How? Or I've misunderstood the question?
Well, if we add "Factorable" or something like it, the iterator abstraction gets broken down into smaller bits for the purpose of storage, complicating some code [like a generic range for example ;-)] and it imposes an extra coding duty on writers of high-quality iterator adaptors. I'm asking if its worth it.
We are simply trying to come up with a simple and clean trick to keep iterator size under control.
Really? IIUC so far we're only discussing compressing the size of ranges, unless we're willing to pay for indirection inside of iterators (I'm not).
I'm interested in using dynamic iterators in the near future for code decoupling
You mean, like, any_iterator?
Yes.
and It would be a pity if these ranges couldn't fit in a small object optimization buffer.
Do you know how big that target ("small object optimization buffer") is?
I'm going to implement my own any_iterator, so I'll make the buffer big enough for the average iterator size.
Oh, that's not the small object optimization. The SOO is only done by compilers: it means passing small objects in registers instead of on the stack. http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/323704.... IIUC you're talking about the too-specifically-named "small string optimization." I don't understand why you would implement any_iterator yourself when there are so many extant implementations, but maybe it's none of my business.
Currently the iterators I use are often over 400 bytes, which is a bit to big :), thus the need to squeeze their size down as much as possible. A hundred bytes could be enough [1].
Again, have you measured?
[1] 100 bytes for SBO might seems a lot, but, I've used a custom string with a very large SBO buffer (over 100 bytes), been happily copying it around and performance was much better than with standard std::string (which, admittedly, did do reference counting but no SBO). I guess that the size of the buffer depends a lot on what you are going to do with it.
Yes indeed. -- Dave Abrahams BoostPro Computing http://www.boostpro.com