container with large number of elements and a small number of invalide elements

Hello, I have a need for a container where I can store 10^6 elements of type T (PODs) , a small number of which (say 100) can be invalid elements (ie special value of T) The 100 elements can be randomly distributed throughout the 10^6 elements. The container should be as close to a sequence container as possible. The required usages: 0. invalid elements can be set = 10^6 - 100 1. get number of valid elements in O(1) 2. get the max/min element of all valid elements 3. have a bidrectional iterator that iterators only over valid elements On a separate note, is there a version of multi_array where the number of dimensions is known at runtime? regards,

What would "valid"/"invalid" means here? If it's "exist" or "don't exist", then maybe using boost::optional<MyPod> would be enough. Did you try with a reserved std::vector already? If size is a problem using it, then maybe STXXL would work for you:http://stxxl.sourceforge.net/

On 31 October 2013 16:10, Klaim - Joël Lamotte <mjklaim@gmail.com> wrote:
What would "valid"/"invalid" means here? If it's "exist" or "don't exist", then maybe using boost::optional<MyPod> would be enough.
That's right. How about sizeof( boost::optional<MyPod> ) vs sizeof( MyPOD ) ? Also, what sort of iterator can iterate over just the valid values? or how to implement one? The point here is that there's a small number of elements that are invalid, their indices in the container are random,

On Thu, Oct 31, 2013 at 6:14 PM, MM <finjulhich@gmail.com> wrote:
On 31 October 2013 16:10, Klaim - Joël Lamotte <mjklaim@gmail.com> wrote:
What would "valid"/"invalid" means here? If it's "exist" or "don't exist", then maybe using boost::optional<MyPod> would be enough.
That's right. How about sizeof( boost::optional<MyPod> ) vs sizeof( MyPOD ) ?
You can check it easily. On this site using Boost 1.49 I get 4 additional bytes on clang and gcc: http://rextester.com/FMH8278 I'm a bit surprised because I thought it would be 1 byte max. Anyway you will need a way to mark values as either valid or invalid. There are several solutions, like using a std::vector<bool> (it's a bit vector) or boost::dynamic_bitset to have one bit per positions telling if it's valid or not. A simpler solution would have your POD providing the info.
Also, what sort of iterator can iterate over just the valid values? or how to implement one?
Well if the container, say std::vector, already provide an iterator, you just have to wrap it by forwarding work but just continue iterating if you just iterated into an invalid. Or better, a range. I never used boost helpers to build ranges so I can't say more on that but you have several possibilities.
The point here is that there's a small number of elements that are invalid, their indices in the container are random,
I understand that, but I don't really see a specific problem.

On 10/31/2013 1:14 PM, MM wrote:
On 31 October 2013 16:10, Klaim - Joël Lamotte <mjklaim@gmail.com <mailto:mjklaim@gmail.com>> wrote:
What would "valid"/"invalid" means here? If it's "exist" or "don't exist", then maybe using boost::optional<MyPod> would be enough.
That's right. How about sizeof( boost::optional<MyPod> ) vs sizeof( MyPOD ) ? Also, what sort of iterator can iterate over just the valid values? or how to implement one? The point here is that there's a small number of elements that are invalid, their indices in the container are random,
So why not keep the invalid items from getting into the container in the first place? Jeff

On 1 November 2013 12:37, Jeff Flinn <Jeffrey.Flinn@gmail.com> wrote:
On 10/31/2013 1:14 PM, MM wrote:
On 31 October 2013 16:10, Klaim - Joël Lamotte <mjklaim@gmail.com <mailto:mjklaim@gmail.com>> wrote:
What would "valid"/"invalid" means here? If it's "exist" or "don't exist", then maybe using boost::optional<MyPod> would be enough.
That's right. How about sizeof( boost::optional<MyPod> ) vs sizeof( MyPOD ) ? Also, what sort of iterator can iterate over just the valid values? or how to implement one? The point here is that there's a small number of elements that are invalid, their indices in the container are random,
So why not keep the invalid items from getting into the container in the first place?
Jeff
______________________________**_________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/**mailman/listinfo.cgi/boost-**users<http://lists.boost.org/mailman/listinfo.cgi/boost-users>
because they are processed at a later stage, their indices in the container are relevant,

On 11/1/2013 1:28 PM, MM wrote:
On 1 November 2013 12:37, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.com>> wrote: On 10/31/2013 1:14 PM, MM wrote: On 31 October 2013 16:10, Klaim - Joël Lamotte <mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>>> wrote:
What would "valid"/"invalid" means here? If it's "exist" or "don't exist", then maybe using boost::optional<MyPod> would be enough.
That's right. How about sizeof( boost::optional<MyPod> ) vs sizeof( MyPOD ) ? Also, what sort of iterator can iterate over just the valid values? or how to implement one? The point here is that there's a small number of elements that are invalid, their indices in the container are random,
So why not keep the invalid items from getting into the container in the first place?
Jeff
because they are processed at a later stage, their indices in the container are relevant,
I'd need a better description of just what defines and item to be valid/invalid. And just what this processed at a later stage means. If you've some functions: bool valid(const MyPod& ) { return ...; } bool invalid(const MyPod& pod) { return !valid(pod); } , and a pre-existing container then you can get an adapted view using boost::range::adaptors::filtered like: typedef std::vector<MyPod> MyPods; MyPods all(...); MyPods bad; boost::copy(all | filtered(invalid), std::back_inserter(bad)); boost::foreach(all | filtered(valid), someFunction); It all depends on the context of where this is used. Another approach uses the boost icl (interval container library). One example is: http://www.boost.org/doc/libs/1_55_0b1/libs/icl/doc/html/boost_icl/projects.... I've used a similar approach to just store the "valid" bases in a dna sequences and keeping track of gaps with an interval container. The advantage is avoiding the filtering cost when the application is primarily interested in the valid items only. Jeff

On Fri, Nov 1, 2013 at 1:18 PM, Jeff Flinn <Jeffrey.Flinn@gmail.com> wrote:
On 11/1/2013 1:28 PM, MM wrote:
On 1 November 2013 12:37, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.**com <Jeffrey.Flinn@gmail.com>>> wrote: On 10/31/2013 1:14 PM, MM wrote: On 31 October 2013 16:10, Klaim - Joël Lamotte <mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>>> wrote:
What would "valid"/"invalid" means here? If it's "exist" or "don't exist", then maybe using boost::optional<MyPod> would be enough.
That's right. How about sizeof( boost::optional<MyPod> ) vs sizeof( MyPOD ) ? Also, what sort of iterator can iterate over just the valid values? or how to implement one? The point here is that there's a small number of elements that are invalid, their indices in the container are random,
So why not keep the invalid items from getting into the container in the first place?
Jeff
because they are processed at a later stage, their indices in the container are relevant,
I'd need a better description of just what defines and item to be valid/invalid. And just what this processed at a later stage means.
If you've some functions:
bool valid(const MyPod& ) { return ...; } bool invalid(const MyPod& pod) { return !valid(pod); }
, and a pre-existing container then you can get an adapted view using boost::range::adaptors::**filtered like:
typedef std::vector<MyPod> MyPods;
MyPods all(...); MyPods bad;
boost::copy(all | filtered(invalid), std::back_inserter(bad));
boost::foreach(all | filtered(valid), someFunction);
It all depends on the context of where this is used. Another approach uses the boost icl (interval container library). One example is:
http://www.boost.org/doc/libs/**1_55_0b1/libs/icl/doc/html/** boost_icl/projects.html#boost_**icl.projects.large_bitset<http://www.boost.org/doc/libs/1_55_0b1/libs/icl/doc/html/boost_icl/projects.html#boost_icl.projects.large_bitset>
I've used a similar approach to just store the "valid" bases in a dna sequences and keeping track of gaps with an interval container. The advantage is avoiding the filtering cost when the application is primarily interested in the valid items only.
Jeff
______________________________**_________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/**mailman/listinfo.cgi/boost-**users<http://lists.boost.org/mailman/listinfo.cgi/boost-users>
I would recommend checking out Google sparsehash -- specifically, sparse_hash_set<T>: http://code.google.com/p/sparsehash/source/browse/trunk/src/sparsehash/spars... The biggest difference between that and something like an std::vector<boost::optional<T>> is google::sparse_hash_set will re-order values on you (the nature of hash tables). It also forces you to have a "deleted key" which exists in the realm of possible T values. Based on impression that I get from your original email, this fits with your idea of an "invalid" T. That said, it is a ridiculously low-overhead container (on the order of ~2 bits/entry vs 4 bytes/entry with a boost::optional). -- Travis Gockel Chief λ Combinator

On 11/1/2013 7:24 PM, Travis Gockel wrote:
On Fri, Nov 1, 2013 at 1:18 PM, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.com>> wrote:
On 11/1/2013 1:28 PM, MM wrote:
On 1 November 2013 12:37, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.com> <mailto:Jeffrey.Flinn@gmail.__com <mailto:Jeffrey.Flinn@gmail.com>>> wrote: On 10/31/2013 1:14 PM, MM wrote: On 31 October 2013 16:10, Klaim - Joël Lamotte <mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>>>> wrote:
What would "valid"/"invalid" means here? If it's "exist" or "don't exist", then maybe using boost::optional<MyPod> would be enough.
That's right. How about sizeof( boost::optional<MyPod> ) vs sizeof( MyPOD ) ? Also, what sort of iterator can iterate over just the valid values? or how to implement one? The point here is that there's a small number of elements that are invalid, their indices in the container are random,
So why not keep the invalid items from getting into the container in the first place?
Jeff
because they are processed at a later stage, their indices in the container are relevant,
I'd need a better description of just what defines and item to be valid/invalid. And just what this processed at a later stage means.
If you've some functions:
bool valid(const MyPod& ) { return ...; } bool invalid(const MyPod& pod) { return !valid(pod); }
, and a pre-existing container then you can get an adapted view using boost::range::adaptors::__filtered like:
typedef std::vector<MyPod> MyPods;
MyPods all(...); MyPods bad;
boost::copy(all | filtered(invalid), std::back_inserter(bad));
boost::foreach(all | filtered(valid), someFunction);
It all depends on the context of where this is used. Another approach uses the boost icl (interval container library). One example is:
http://www.boost.org/doc/libs/__1_55_0b1/libs/icl/doc/html/__boost_icl/proje... <http://www.boost.org/doc/libs/1_55_0b1/libs/icl/doc/html/boost_icl/projects.html#boost_icl.projects.large_bitset>
I've used a similar approach to just store the "valid" bases in a dna sequences and keeping track of gaps with an interval container. The advantage is avoiding the filtering cost when the application is primarily interested in the valid items only.
I would recommend checking out Google sparsehash -- specifically, sparse_hash_set<T>: http://code.google.com/p/sparsehash/source/browse/trunk/src/sparsehash/spars...
The biggest difference between that and something like an std::vector<boost::optional<T>> is google::sparse_hash_set will re-order values on you (the nature of hash tables). It also forces you to have a "deleted key" which exists in the realm of possible T values. Based on impression that I get from your original email, this fits with your idea of an "invalid" T.
That said, it is a ridiculously low-overhead container (on the order of ~2 bits/entry vs 4 bytes/entry with a boost::optional).
That's 2x10^6 bits for the OP's case, where-as the ICL approach is 64bits per interval. In the OP's case that would be 64x100(max number of invalid items assuming all invalid separated by atleast 1 valid item) about 3 orders of magnitude less. Jeff

On 3 November 2013 22:55, Jeff Flinn <Jeffrey.Flinn@gmail.com> wrote:
On 11/1/2013 7:24 PM, Travis Gockel wrote:
On Fri, Nov 1, 2013 at 1:18 PM, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.com>> wrote:
On 11/1/2013 1:28 PM, MM wrote:
On 1 November 2013 12:37, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.com> <mailto:Jeffrey.Flinn@gmail.__com
<mailto:Jeffrey.Flinn@gmail.com>>> wrote: On 10/31/2013 1:14 PM, MM wrote: On 31 October 2013 16:10, Klaim - Joël Lamotte <mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>>>> wrote:
What would "valid"/"invalid" means here? If it's "exist" or "don't exist", then maybe using boost::optional<MyPod> would be enough.
That's right. How about sizeof( boost::optional<MyPod> ) vs sizeof( MyPOD ) ? Also, what sort of iterator can iterate over just the valid values? or how to implement one? The point here is that there's a small number of elements that are invalid, their indices in the container are random,
So why not keep the invalid items from getting into the container in the first place?
Jeff
because they are processed at a later stage, their indices in the container are relevant,
I'd need a better description of just what defines and item to be valid/invalid. And just what this processed at a later stage means.
If you've some functions:
bool valid(const MyPod& ) { return ...; } bool invalid(const MyPod& pod) { return !valid(pod); }
, and a pre-existing container then you can get an adapted view using boost::range::adaptors::__filtered like:
typedef std::vector<MyPod> MyPods;
MyPods all(...); MyPods bad;
boost::copy(all | filtered(invalid), std::back_inserter(bad));
boost::foreach(all | filtered(valid), someFunction);
It all depends on the context of where this is used. Another approach uses the boost icl (interval container library). One example is:
http://www.boost.org/doc/libs/__1_55_0b1/libs/icl/doc/html/_ _boost_icl/projects.html#boost___icl.projects.large_bitset
<http://www.boost.org/doc/libs/1_55_0b1/libs/icl/doc/ html/boost_icl/projects.html#boost_icl.projects.large_bitset>
I've used a similar approach to just store the "valid" bases in a dna sequences and keeping track of gaps with an interval container. The advantage is avoiding the filtering cost when the application is primarily interested in the valid items only.
I would recommend checking out Google sparsehash -- specifically, sparse_hash_set<T>: http://code.google.com/p/sparsehash/source/browse/ trunk/src/sparsehash/sparse_hash_set
The biggest difference between that and something like an std::vector<boost::optional<T>> is google::sparse_hash_set will re-order values on you (the nature of hash tables). It also forces you to have a "deleted key" which exists in the realm of possible T values. Based on impression that I get from your original email, this fits with your idea of an "invalid" T.
That said, it is a ridiculously low-overhead container (on the order of ~2 bits/entry vs 4 bytes/entry with a boost::optional).
That's 2x10^6 bits for the OP's case, where-as the ICL approach is 64bits per interval. In the OP's case that would be 64x100(max number of invalid items assuming all invalid separated by atleast 1 valid item) about 3 orders of magnitude less.
Jeff
Thanks for the various answers. I'm reading about ICL, and sparsehash to learn more. basically, struct MyPOD { double d1; ..... double d8; } and I currently have std::vector<MyPOD> with 1e6 elements. 100 of those would be == a particular instance of MyPOD invalid { +Inf, +Inf, ... -Inf,, 0. }. It's a known, predefined MyPOD constant I need: 1. calculate the max element, and the min element, _while_ ignoring the 100 or so invalid entries (I don't know where they are), I can't however change the vector<MyPOD>. it's read only. 2. I need also to plot it (ie, plot all the elements that are not Invalid). The index of each element in vector is transformed and returns a sort of coordinate to plot a point So, I would have wrapped this vector into a class, and exposed similar interface to vector, but would have implemented an iterator that just skips the invallid entries:the iterator would skip the invalid entry until next valid one when incremented or decremented. My class would also be indexable with [] with this access not with the same order as vector's [] (it would skip the invalid entries, so slower) Perhaps the best choice is a boost range adapted view. Reading the docs.... thanks very much MM

On 4 November 2013 18:18, MM <finjulhich@gmail.com> wrote:
On 3 November 2013 22:55, Jeff Flinn <Jeffrey.Flinn@gmail.com> wrote:
On 11/1/2013 7:24 PM, Travis Gockel wrote:
On Fri, Nov 1, 2013 at 1:18 PM, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.com>> wrote:
On 11/1/2013 1:28 PM, MM wrote:
On 1 November 2013 12:37, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.com> <mailto:Jeffrey.Flinn@gmail.__com
<mailto:Jeffrey.Flinn@gmail.com>>> wrote: On 10/31/2013 1:14 PM, MM wrote: On 31 October 2013 16:10, Klaim - Joël Lamotte <mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>>>> wrote:
What would "valid"/"invalid" means here? If it's "exist" or "don't exist", then maybe using boost::optional<MyPod> would be enough.
That's right. How about sizeof( boost::optional<MyPod> ) vs sizeof( MyPOD ) ? Also, what sort of iterator can iterate over just the valid values? or how to implement one? The point here is that there's a small number of elements that are invalid, their indices in the container are random,
So why not keep the invalid items from getting into the container in the first place?
Jeff
because they are processed at a later stage, their indices in the container are relevant,
I'd need a better description of just what defines and item to be valid/invalid. And just what this processed at a later stage means.
If you've some functions:
bool valid(const MyPod& ) { return ...; } bool invalid(const MyPod& pod) { return !valid(pod); }
, and a pre-existing container then you can get an adapted view using boost::range::adaptors::__filtered like:
typedef std::vector<MyPod> MyPods;
MyPods all(...); MyPods bad;
boost::copy(all | filtered(invalid), std::back_inserter(bad));
boost::foreach(all | filtered(valid), someFunction);
It all depends on the context of where this is used. Another approach uses the boost icl (interval container library). One example is:
http://www.boost.org/doc/libs/__1_55_0b1/libs/icl/doc/html/_ _boost_icl/projects.html#boost___icl.projects.large_bitset
<http://www.boost.org/doc/libs/1_55_0b1/libs/icl/doc/ html/boost_icl/projects.html#boost_icl.projects.large_bitset>
I've used a similar approach to just store the "valid" bases in a dna sequences and keeping track of gaps with an interval container. The advantage is avoiding the filtering cost when the application is primarily interested in the valid items only.
I would recommend checking out Google sparsehash -- specifically, sparse_hash_set<T>: http://code.google.com/p/sparsehash/source/browse/ trunk/src/sparsehash/sparse_hash_set
The biggest difference between that and something like an std::vector<boost::optional<T>> is google::sparse_hash_set will re-order values on you (the nature of hash tables). It also forces you to have a "deleted key" which exists in the realm of possible T values. Based on impression that I get from your original email, this fits with your idea of an "invalid" T.
That said, it is a ridiculously low-overhead container (on the order of ~2 bits/entry vs 4 bytes/entry with a boost::optional).
That's 2x10^6 bits for the OP's case, where-as the ICL approach is 64bits per interval. In the OP's case that would be 64x100(max number of invalid items assuming all invalid separated by atleast 1 valid item) about 3 orders of magnitude less.
Jeff
Thanks for the various answers. I'm reading about ICL, and sparsehash to learn more.
basically,
struct MyPOD { double d1; ..... double d8; } and I currently have std::vector<MyPOD> with 1e6 elements. 100 of those would be == a particular instance of MyPOD invalid { +Inf, +Inf, ... -Inf,, 0. }. It's a known, predefined MyPOD constant
I need: 1. calculate the max element, and the min element, _while_ ignoring the 100 or so invalid entries (I don't know where they are), I can't however change the vector<MyPOD>. it's read only.
2. I need also to plot it (ie, plot all the elements that are not Invalid). The index of each element in vector is transformed and returns a sort of coordinate to plot a point
So, I would have wrapped this vector into a class, and exposed similar interface to vector, but would have implemented an iterator that just skips the invallid entries:the iterator would skip the invalid entry until next valid one when incremented or decremented. My class would also be indexable with [] with this access not with the same order as vector's [] (it would skip the invalid entries, so slower)
Perhaps the best choice is a boost range adapted view.
Reading the docs....
thanks very much
MM
I have wrapped my vector in a class, and used boost range adaptor filtered. I've implemented operator[](std::size_t i) as a simple begin is the begin iterator to a filtered view on myvector | boost::adapators::filtered( ... ) with a predicate that ignores my invalid values. std::advance(begin, i) /// this advance on the view gives a performance equal to what a forward only iterator (not random access iterator) would give. This appears to be too slow in my case, I need a better order of magnitude of such access, I probably should have a solution where I store the indices of the invalid values, and I implement operator[](std::size_t i) with the help of those indices. MM

Is std::stable_partition too slow? (Sorry if answered earlier in the thread.) http://www.sgi.com/tech/stl/stable_partition.html Vince On Tue, Nov 5, 2013 at 12:43 PM, MM <finjulhich@gmail.com> wrote:
On 4 November 2013 18:18, MM <finjulhich@gmail.com> wrote:
On 3 November 2013 22:55, Jeff Flinn <Jeffrey.Flinn@gmail.com> wrote:
On 11/1/2013 7:24 PM, Travis Gockel wrote:
On Fri, Nov 1, 2013 at 1:18 PM, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.com>> wrote:
On 11/1/2013 1:28 PM, MM wrote:
On 1 November 2013 12:37, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.com> <mailto:Jeffrey.Flinn@gmail.__com
<mailto:Jeffrey.Flinn@gmail.com>>> wrote: On 10/31/2013 1:14 PM, MM wrote: On 31 October 2013 16:10, Klaim - Joël Lamotte <mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>>>> wrote:
What would "valid"/"invalid" means here? If it's "exist" or "don't exist", then maybe using boost::optional<MyPod> would be enough.
That's right. How about sizeof( boost::optional<MyPod> ) vs sizeof( MyPOD ) ? Also, what sort of iterator can iterate over just the valid values? or how to implement one? The point here is that there's a small number of elements that are invalid, their indices in the container are random,
So why not keep the invalid items from getting into the container in the first place?
Jeff
because they are processed at a later stage, their indices in the container are relevant,
I'd need a better description of just what defines and item to be valid/invalid. And just what this processed at a later stage means.
If you've some functions:
bool valid(const MyPod& ) { return ...; } bool invalid(const MyPod& pod) { return !valid(pod); }
, and a pre-existing container then you can get an adapted view using boost::range::adaptors::__filtered like:
typedef std::vector<MyPod> MyPods;
MyPods all(...); MyPods bad;
boost::copy(all | filtered(invalid), std::back_inserter(bad));
boost::foreach(all | filtered(valid), someFunction);
It all depends on the context of where this is used. Another approach uses the boost icl (interval container library). One example is:
http://www.boost.org/doc/libs/__1_55_0b1/libs/icl/doc/html/_ _boost_icl/projects.html#boost___icl.projects.large_bitset
<http://www.boost.org/doc/libs/1_55_0b1/libs/icl/doc/ html/boost_icl/projects.html#boost_icl.projects.large_bitset>
I've used a similar approach to just store the "valid" bases in a dna sequences and keeping track of gaps with an interval container. The advantage is avoiding the filtering cost when the application is primarily interested in the valid items only.
I would recommend checking out Google sparsehash -- specifically, sparse_hash_set<T>: http://code.google.com/p/sparsehash/source/browse/ trunk/src/sparsehash/sparse_hash_set
The biggest difference between that and something like an std::vector<boost::optional<T>> is google::sparse_hash_set will re-order values on you (the nature of hash tables). It also forces you to have a "deleted key" which exists in the realm of possible T values. Based on impression that I get from your original email, this fits with your idea of an "invalid" T.
That said, it is a ridiculously low-overhead container (on the order of ~2 bits/entry vs 4 bytes/entry with a boost::optional).
That's 2x10^6 bits for the OP's case, where-as the ICL approach is 64bits per interval. In the OP's case that would be 64x100(max number of invalid items assuming all invalid separated by atleast 1 valid item) about 3 orders of magnitude less.
Jeff
Thanks for the various answers. I'm reading about ICL, and sparsehash to learn more.
basically,
struct MyPOD { double d1; ..... double d8; } and I currently have std::vector<MyPOD> with 1e6 elements. 100 of those would be == a particular instance of MyPOD invalid { +Inf, +Inf, ... -Inf,, 0. }. It's a known, predefined MyPOD constant
I need: 1. calculate the max element, and the min element, _while_ ignoring the 100 or so invalid entries (I don't know where they are), I can't however change the vector<MyPOD>. it's read only.
2. I need also to plot it (ie, plot all the elements that are not Invalid). The index of each element in vector is transformed and returns a sort of coordinate to plot a point
So, I would have wrapped this vector into a class, and exposed similar interface to vector, but would have implemented an iterator that just skips the invallid entries:the iterator would skip the invalid entry until next valid one when incremented or decremented. My class would also be indexable with [] with this access not with the same order as vector's [] (it would skip the invalid entries, so slower)
Perhaps the best choice is a boost range adapted view.
Reading the docs....
thanks very much
MM
I have wrapped my vector in a class, and used boost range adaptor filtered.
I've implemented operator[](std::size_t i) as a simple
begin is the begin iterator to a filtered view on myvector | boost::adapators::filtered( ... ) with a predicate that ignores my invalid values.
std::advance(begin, i) /// this advance on the view gives a performance equal to what a forward only iterator (not random access iterator) would give.
This appears to be too slow in my case, I need a better order of magnitude of such access,
I probably should have a solution where I store the indices of the invalid values, and I implement operator[](std::size_t i) with the help of those indices.
MM
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

On 5 November 2013 18:09, Vincent N. Virgilio <virgilio@ieee.org> wrote:
On Tue, Nov 5, 2013 at 12:43 PM, MM <finjulhich@gmail.com> wrote:
On 4 November 2013 18:18, MM <finjulhich@gmail.com> wrote:
On 3 November 2013 22:55, Jeff Flinn <Jeffrey.Flinn@gmail.com> wrote:
On 11/1/2013 7:24 PM, Travis Gockel wrote:
On Fri, Nov 1, 2013 at 1:18 PM, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.com>> wrote:
On 11/1/2013 1:28 PM, MM wrote:
On 1 November 2013 12:37, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.com> <mailto:Jeffrey.Flinn@gmail.__com
<mailto:Jeffrey.Flinn@gmail.com>>> wrote: On 10/31/2013 1:14 PM, MM wrote: On 31 October 2013 16:10, Klaim - Joël Lamotte <mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>>>> wrote:
What would "valid"/"invalid" means here? If it's "exist" or "don't exist", then maybe using boost::optional<MyPod> would be enough.
That's right. How about sizeof( boost::optional<MyPod> ) vs sizeof( MyPOD ) ? Also, what sort of iterator can iterate over just the valid values? or how to implement one? The point here is that there's a small number of elements that are invalid, their indices in the container are random,
So why not keep the invalid items from getting into the container in the first place?
Jeff
because they are processed at a later stage, their indices in the container are relevant,
I'd need a better description of just what defines and item to be valid/invalid. And just what this processed at a later stage means.
If you've some functions:
bool valid(const MyPod& ) { return ...; } bool invalid(const MyPod& pod) { return !valid(pod); }
, and a pre-existing container then you can get an adapted view using boost::range::adaptors::__filtered like:
typedef std::vector<MyPod> MyPods;
MyPods all(...); MyPods bad;
boost::copy(all | filtered(invalid), std::back_inserter(bad));
boost::foreach(all | filtered(valid), someFunction);
It all depends on the context of where this is used. Another approach uses the boost icl (interval container library). One example is:
http://www.boost.org/doc/libs/__1_55_0b1/libs/icl/doc/html/_ _boost_icl/projects.html#boost___icl.projects.large_bitset
<http://www.boost.org/doc/libs/1_55_0b1/libs/icl/doc/ html/boost_icl/projects.html#boost_icl.projects.large_bitset>
I've used a similar approach to just store the "valid" bases in a dna sequences and keeping track of gaps with an interval container. The advantage is avoiding the filtering cost when the application is primarily interested in the valid items only.
I would recommend checking out Google sparsehash -- specifically, sparse_hash_set<T>: http://code.google.com/p/sparsehash/source/browse/ trunk/src/sparsehash/sparse_hash_set
The biggest difference between that and something like an std::vector<boost::optional<T>> is google::sparse_hash_set will re-order values on you (the nature of hash tables). It also forces you to have a "deleted key" which exists in the realm of possible T values. Based on impression that I get from your original email, this fits with your idea of an "invalid" T.
That said, it is a ridiculously low-overhead container (on the order of ~2 bits/entry vs 4 bytes/entry with a boost::optional).
That's 2x10^6 bits for the OP's case, where-as the ICL approach is 64bits per interval. In the OP's case that would be 64x100(max number of invalid items assuming all invalid separated by atleast 1 valid item) about 3 orders of magnitude less.
Jeff
Thanks for the various answers. I'm reading about ICL, and sparsehash to learn more.
basically,
struct MyPOD { double d1; ..... double d8; } and I currently have std::vector<MyPOD> with 1e6 elements. 100 of those would be == a particular instance of MyPOD invalid { +Inf, +Inf, ... -Inf,, 0. }. It's a known, predefined MyPOD constant
I need: 1. calculate the max element, and the min element, _while_ ignoring the 100 or so invalid entries (I don't know where they are), I can't however change the vector<MyPOD>. it's read only.
2. I need also to plot it (ie, plot all the elements that are not Invalid). The index of each element in vector is transformed and returns a sort of coordinate to plot a point
So, I would have wrapped this vector into a class, and exposed similar interface to vector, but would have implemented an iterator that just skips the invallid entries:the iterator would skip the invalid entry until next valid one when incremented or decremented. My class would also be indexable with [] with this access not with the same order as vector's [] (it would skip the invalid entries, so slower)
Perhaps the best choice is a boost range adapted view.
Reading the docs....
thanks very much
MM
I have wrapped my vector in a class, and used boost range adaptor filtered.
I've implemented operator[](std::size_t i) as a simple
begin is the begin iterator to a filtered view on myvector | boost::adapators::filtered( ... ) with a predicate that ignores my invalid values.
std::advance(begin, i) /// this advance on the view gives a performance equal to what a forward only iterator (not random access iterator) would give.
This appears to be too slow in my case, I need a better order of magnitude of such access,
I probably should have a solution where I store the indices of the invalid values, and I implement operator[](std::size_t i) with the help of those indices.
MM
Is std::stable_partition too slow? (Sorry if answered earlier in the thread.)
http://www.sgi.com/tech/stl/stable_partition.html
Vince
I am unable to copy or modify my original vector, I can only browse over it, iterate over it in different ways. stable_partition would move around the elements if I'm not mistaken
MM

Yes, stable_partition (and probably partition) involve swaps, according to SGI's doc. Instead, you might apply one of the above partition functions to a parallel container of indices, and dereference with the predicate? Seems like this is steering back to an ersatz version of a boost adaptor. Vince On Tue, Nov 5, 2013 at 1:29 PM, MM <finjulhich@gmail.com> wrote:
On 5 November 2013 18:09, Vincent N. Virgilio <virgilio@ieee.org> wrote:
On Tue, Nov 5, 2013 at 12:43 PM, MM <finjulhich@gmail.com> wrote:
On 4 November 2013 18:18, MM <finjulhich@gmail.com> wrote:
On 3 November 2013 22:55, Jeff Flinn <Jeffrey.Flinn@gmail.com> wrote:
On 11/1/2013 7:24 PM, Travis Gockel wrote:
On Fri, Nov 1, 2013 at 1:18 PM, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.com>> wrote:
On 11/1/2013 1:28 PM, MM wrote:
On 1 November 2013 12:37, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.com> <mailto:Jeffrey.Flinn@gmail.__com
<mailto:Jeffrey.Flinn@gmail.com>>> wrote: On 10/31/2013 1:14 PM, MM wrote: On 31 October 2013 16:10, Klaim - Joël Lamotte <mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>>>> wrote:
What would "valid"/"invalid" means here? If it's "exist" or "don't exist", then maybe using boost::optional<MyPod> would be enough.
That's right. How about sizeof( boost::optional<MyPod> ) vs sizeof( MyPOD ) ? Also, what sort of iterator can iterate over just the valid values? or how to implement one? The point here is that there's a small number of elements that are invalid, their indices in the container are random,
So why not keep the invalid items from getting into the container in the first place?
Jeff
because they are processed at a later stage, their indices in the container are relevant,
I'd need a better description of just what defines and item to be valid/invalid. And just what this processed at a later stage means.
If you've some functions:
bool valid(const MyPod& ) { return ...; } bool invalid(const MyPod& pod) { return !valid(pod); }
, and a pre-existing container then you can get an adapted view using boost::range::adaptors::__filtered like:
typedef std::vector<MyPod> MyPods;
MyPods all(...); MyPods bad;
boost::copy(all | filtered(invalid), std::back_inserter(bad));
boost::foreach(all | filtered(valid), someFunction);
It all depends on the context of where this is used. Another approach uses the boost icl (interval container library). One example is:
http://www.boost.org/doc/libs/__1_55_0b1/libs/icl/doc/html/_ _boost_icl/projects.html#boost___icl.projects.large_bitset
<http://www.boost.org/doc/libs/1_55_0b1/libs/icl/doc/ html/boost_icl/projects.html#boost_icl.projects.large_bitset>
I've used a similar approach to just store the "valid" bases in a dna sequences and keeping track of gaps with an interval container. The advantage is avoiding the filtering cost when the application is primarily interested in the valid items only.
I would recommend checking out Google sparsehash -- specifically, sparse_hash_set<T>: http://code.google.com/p/sparsehash/source/browse/ trunk/src/sparsehash/sparse_hash_set
The biggest difference between that and something like an std::vector<boost::optional<T>> is google::sparse_hash_set will re-order values on you (the nature of hash tables). It also forces you to have a "deleted key" which exists in the realm of possible T values. Based on impression that I get from your original email, this fits with your idea of an "invalid" T.
That said, it is a ridiculously low-overhead container (on the order of ~2 bits/entry vs 4 bytes/entry with a boost::optional).
That's 2x10^6 bits for the OP's case, where-as the ICL approach is 64bits per interval. In the OP's case that would be 64x100(max number of invalid items assuming all invalid separated by atleast 1 valid item) about 3 orders of magnitude less.
Jeff
Thanks for the various answers. I'm reading about ICL, and sparsehash to learn more.
basically,
struct MyPOD { double d1; ..... double d8; } and I currently have std::vector<MyPOD> with 1e6 elements. 100 of those would be == a particular instance of MyPOD invalid { +Inf, +Inf, ... -Inf,, 0. }. It's a known, predefined MyPOD constant
I need: 1. calculate the max element, and the min element, _while_ ignoring the 100 or so invalid entries (I don't know where they are), I can't however change the vector<MyPOD>. it's read only.
2. I need also to plot it (ie, plot all the elements that are not Invalid). The index of each element in vector is transformed and returns a sort of coordinate to plot a point
So, I would have wrapped this vector into a class, and exposed similar interface to vector, but would have implemented an iterator that just skips the invallid entries:the iterator would skip the invalid entry until next valid one when incremented or decremented. My class would also be indexable with [] with this access not with the same order as vector's [] (it would skip the invalid entries, so slower)
Perhaps the best choice is a boost range adapted view.
Reading the docs....
thanks very much
MM
I have wrapped my vector in a class, and used boost range adaptor filtered.
I've implemented operator[](std::size_t i) as a simple
begin is the begin iterator to a filtered view on myvector | boost::adapators::filtered( ... ) with a predicate that ignores my invalid values.
std::advance(begin, i) /// this advance on the view gives a performance equal to what a forward only iterator (not random access iterator) would give.
This appears to be too slow in my case, I need a better order of magnitude of such access,
I probably should have a solution where I store the indices of the invalid values, and I implement operator[](std::size_t i) with the help of those indices.
MM
Is std::stable_partition too slow? (Sorry if answered earlier in the thread.)
http://www.sgi.com/tech/stl/stable_partition.html
Vince
I am unable to copy or modify my original vector, I can only browse over it, iterate over it in different ways. stable_partition would move around the elements if I'm not mistaken
MM
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

On 5 November 2013 18:34, Vincent N. Virgilio <virgilio@ieee.org> wrote:
Yes, stable_partition (and probably partition) involve swaps, according to SGI's doc.
Instead, you might apply one of the above partition functions to a parallel container of indices, and dereference with the predicate?
Seems like this is steering back to an ersatz version of a boost adaptor.
Vince
Could you elaborate? what do you mean an "ersatz version of a boost adaptor" I can store a container of indices that are invalid, and get max and min,and access via a [] with the help of that container. I just was looking to reuse some component for this rather than implement it myself.
Or modify your algorithm to not require random access. ;-) What are you wanting to do with the sequence(s)? I need to plot only the elements that are valid. Unfortunately, the plotting library requires a sort of indexed element,
Thanks, MM

On 5 November 2013 12:45, MM <finjulhich@gmail.com> wrote:
I can store a container of indices that are invalid,
Or you can store a container of indices that are valid. Unless you are on a very constrained platform, an extra 4M (assuming 1 million elements and 4-byte indices) for that container just isn't all that much space.
and get max and min,
Why not just store their indices when you build the vector? -- Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> (847) 691-1404

ersatz = limited (cheap) You might use partition or stable_partition on the container of indices. The predicate would dereference and return valid/invalid. Partition or stable_partition would separate your container of indices into two groups. Iterators in [begin, middle) would mark valid indices, and iterators in [middle, end) would mark invalid indices. You'd plot the dereferenced set of valid indices in [begin, middle). Vince On Tue, Nov 5, 2013 at 1:45 PM, MM <finjulhich@gmail.com> wrote:
On 5 November 2013 18:34, Vincent N. Virgilio <virgilio@ieee.org> wrote:
Yes, stable_partition (and probably partition) involve swaps, according to SGI's doc.
Instead, you might apply one of the above partition functions to a parallel container of indices, and dereference with the predicate?
Seems like this is steering back to an ersatz version of a boost adaptor.
Vince
Could you elaborate? what do you mean an "ersatz version of a boost adaptor"
I can store a container of indices that are invalid, and get max and min,and access via a [] with the help of that container. I just was looking to reuse some component for this rather than implement it myself.
Or modify your algorithm to not require random access. ;-) What are you wanting to do with the sequence(s)? I need to plot only the elements that are valid. Unfortunately, the plotting library requires a sort of indexed element,
Thanks, MM
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

On 11/5/2013 12:43 PM, MM wrote:
On 4 November 2013 18:18, MM <finjulhich@gmail.com <mailto:finjulhich@gmail.com>> wrote:
On 3 November 2013 22:55, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.com>> wrote:
On 11/1/2013 7:24 PM, Travis Gockel wrote:
On Fri, Nov 1, 2013 at 1:18 PM, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.com> <mailto:Jeffrey.Flinn@gmail.__com <mailto:Jeffrey.Flinn@gmail.com>>> wrote:
On 11/1/2013 1:28 PM, MM wrote:
On 1 November 2013 12:37, Jeff Flinn <Jeffrey.Flinn@gmail.com <mailto:Jeffrey.Flinn@gmail.com> <mailto:Jeffrey.Flinn@gmail.__com <mailto:Jeffrey.Flinn@gmail.com>> <mailto:Jeffrey.Flinn@gmail. <mailto:Jeffrey.Flinn@gmail.>____com
<mailto:Jeffrey.Flinn@gmail.__com <mailto:Jeffrey.Flinn@gmail.com>>>> wrote: On 10/31/2013 1:14 PM, MM wrote: On 31 October 2013 16:10, Klaim - Joël Lamotte <mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>>> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com> <mailto:mjklaim@gmail.com <mailto:mjklaim@gmail.com>>>>> wrote:
What would "valid"/"invalid" means here? If it's "exist" or "don't exist", then maybe using boost::optional<MyPod> would be enough.
That's right. How about sizeof( boost::optional<MyPod> ) vs sizeof( MyPOD ) ? Also, what sort of iterator can iterate over just the valid values? or how to implement one? The point here is that there's a small number of elements that are invalid, their indices in the container are random,
So why not keep the invalid items from getting into the container in the first place?
Jeff
because they are processed at a later stage, their indices in the container are relevant,
I'd need a better description of just what defines and item to be valid/invalid. And just what this processed at a later stage means.
If you've some functions:
bool valid(const MyPod& ) { return ...; } bool invalid(const MyPod& pod) { return !valid(pod); }
, and a pre-existing container then you can get an adapted view using boost::range::adaptors::____filtered like:
typedef std::vector<MyPod> MyPods;
MyPods all(...); MyPods bad;
boost::copy(all | filtered(invalid), std::back_inserter(bad));
boost::foreach(all | filtered(valid), someFunction);
It all depends on the context of where this is used. Another approach uses the boost icl (interval container library). One example is:
http://www.boost.org/doc/libs/____1_55_0b1/libs/icl/doc/html/____boost_icl/p... <http://www.boost.org/doc/libs/__1_55_0b1/libs/icl/doc/html/__boost_icl/projects.html#boost___icl.projects.large_bitset>
<http://www.boost.org/doc/__libs/1_55_0b1/libs/icl/doc/__html/boost_icl/proje... <http://www.boost.org/doc/libs/1_55_0b1/libs/icl/doc/html/boost_icl/projects.html#boost_icl.projects.large_bitset>>
I've used a similar approach to just store the "valid" bases in a dna sequences and keeping track of gaps with an interval container. The advantage is avoiding the filtering cost when the application is primarily interested in the valid items only.
I would recommend checking out Google sparsehash -- specifically, sparse_hash_set<T>: http://code.google.com/p/__sparsehash/source/browse/__trunk/src/sparsehash/s... <http://code.google.com/p/sparsehash/source/browse/trunk/src/sparsehash/sparse_hash_set>
The biggest difference between that and something like an std::vector<boost::optional<T>__> is google::sparse_hash_set will re-order values on you (the nature of hash tables). It also forces you to have a "deleted key" which exists in the realm of possible T values. Based on impression that I get from your original email, this fits with your idea of an "invalid" T.
That said, it is a ridiculously low-overhead container (on the order of ~2 bits/entry vs 4 bytes/entry with a boost::optional).
That's 2x10^6 bits for the OP's case, where-as the ICL approach is 64bits per interval. In the OP's case that would be 64x100(max number of invalid items assuming all invalid separated by atleast 1 valid item) about 3 orders of magnitude less.
Jeff
Thanks for the various answers. I'm reading about ICL, and sparsehash to learn more.
basically,
struct MyPOD { double d1; ..... double d8; } and I currently have std::vector<MyPOD> with 1e6 elements. 100 of those would be == a particular instance of MyPOD invalid { +Inf, +Inf, ... -Inf,, 0. }. It's a known, predefined MyPOD constant
I need: 1. calculate the max element, and the min element, _while_ ignoring the 100 or so invalid entries (I don't know where they are), I can't however change the vector<MyPOD>. it's read only.
2. I need also to plot it (ie, plot all the elements that are not Invalid). The index of each element in vector is transformed and returns a sort of coordinate to plot a point
So, I would have wrapped this vector into a class, and exposed similar interface to vector, but would have implemented an iterator that just skips the invallid entries:the iterator would skip the invalid entry until next valid one when incremented or decremented. My class would also be indexable with [] with this access not with the same order as vector's [] (it would skip the invalid entries, so slower)
Perhaps the best choice is a boost range adapted view.
Reading the docs....
thanks very much
MM
I have wrapped my vector in a class, and used boost range adaptor filtered.
I've implemented operator[](std::size_t i) as a simple
begin is the begin iterator to a filtered view on myvector | boost::adapators::filtered( ... ) with a predicate that ignores my invalid values.
std::advance(begin, i) /// this advance on the view gives a performance equal to what a forward only iterator (not random access iterator) would give.
This appears to be too slow in my case, I need a better order of magnitude of such access,
I probably should have a solution where I store the indices of the invalid values, and I implement operator[](std::size_t i) with the help of those indices.
Or modify your algorithm to not require random access. ;-) What are you wanting to do with the sequence(s)? Jeff
participants (6)
-
Jeff Flinn
-
Klaim - Joël Lamotte
-
MM
-
Nevin Liber
-
Travis Gockel
-
Vincent N. Virgilio