Boost.Bloom_filter now in Sandbox (WAS Re: Interest in a Bloom Filter?)

On Mon, Jun 8, 2009 at 2:55 PM, Dean Michael Berris<mikhailberis@gmail.com> wrote:
On Mon, Jun 8, 2009 at 2:47 PM, Joaquin M Lopez Munoz<joaquin@tid.es> wrote: [snip]
Thanks for your insights! I'll make the changes and then maybe put it in the sandbox along with the documentation that I'll be working on.
The implementation is now available in Boost's Sandbox. You can check out the implementation along with the documentation via: svn co https://svn.boost.org/svn/boost/sandbox/bloom_filter/trunk/ Can someone please add this to the Libraries Under Construction page? I'd also like to request to add this to the formal review queue (if this qualifies for a fast track review, that would also be an option I'm open to). Thanks in advance, and I hope this helps. -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

Hi, ----- Original Message ----- From: "Dean Michael Berris" <mikhailberis@gmail.com> To: <boost@lists.boost.org> Sent: Monday, June 08, 2009 1:59 PM Subject: [boost] Boost.Bloom_filter now in Sandbox (WAS Re: Interest in aBloom Filter?)
On Mon, Jun 8, 2009 at 2:55 PM, Dean Michael Berris<mikhailberis@gmail.com> wrote:
On Mon, Jun 8, 2009 at 2:47 PM, Joaquin M Lopez Munoz<joaquin@tid.es> wrote: [snip]
Thanks for your insights! I'll make the changes and then maybe put it in the sandbox along with the documentation that I'll be working on.
The implementation is now available in Boost's Sandbox. You can check out the implementation along with the documentation via:
svn co https://svn.boost.org/svn/boost/sandbox/bloom_filter/trunk/
Can someone please add this to the Libraries Under Construction page?
Done (see https://svn.boost.org/trac/boost/wiki/LibrariesUnderConstruction#Boost.Bloom...) Best, Vicente

Dean Michael Berris <mikhailberis <at> gmail.com> writes:
The implementation is now available in Boost's Sandbox. You can check out the implementation along with the documentation via:
[...]
I'd also like to request to add this to the formal review queue (if this qualifies for a fast track review, that would also be an option I'm open to).
With all due respect, I think your submission is in too early a state to already request a formal review. I'd say you are in "refinement" stage as described in: http://www.boost.org/development/submissions.html So you can still get a lot of feedback till the design reaches a mature state. IMHO there're plenty of opportunities for improvement, here are a handful of observations in no particular order: 1. Your material lacks proper tests and examples, and the docs are sketchy (a formal reference would be needed at least). 2. bloom_filter should have an allocator template parameter, just like any other container around. 3. Have you considered the possibility to let the user specify M and K at constuction time rather than compile tiem and the pros and cons of both approaches? 4. It'd be nice to have a way to specify the desired false positive probability instead of M and K. 5. It'd be nice if entry-level users can get a default set of hash functions so that they need not specifiy them --maybe a parameterized hash formula h(x,N) so that you can just use h(x,0), h(x,1),... h(x,N-1). 6. You're not providing a free function swap(bloom_filter&,bloom_filter&). 7. bloom_filters are not currently testable for equality. 8. bloom_filters are not currently serializable. 9. Union and intersection of bloom_filters are not provided (according to the Wikipedia entry these ops are trivial to implement). 10. For completeness, seems like constructing a bloom_filter from a bitset_type would be nice to have. 11. You'd have to decide whether you want to make this look more like a std::map<T,bool> or not. If the former, some boilerplate typedefs like value_type and memer functions like size() could be provided. Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

On Tue, Jun 9, 2009 at 4:36 AM, Joaquin M Lopez Munoz<joaquin@tid.es> wrote:
Dean Michael Berris <mikhailberis <at> gmail.com> writes:
The implementation is now available in Boost's Sandbox. You can check out the implementation along with the documentation via:
[...]
I'd also like to request to add this to the formal review queue (if this qualifies for a fast track review, that would also be an option I'm open to).
With all due respect, I think your submission is in too early a state to already request a formal review. I'd say you are in "refinement" stage as described in:
Oh, okay. Thanks for pointing this out.
So you can still get a lot of feedback till the design reaches a mature state. IMHO there're plenty of opportunities for improvement, here are a handful of observations in no particular order:
1. Your material lacks proper tests and examples, and the docs are sketchy (a formal reference would be needed at least).
Okay. I can work on adding more tests (or at least having more "formal" unit tests).
2. bloom_filter should have an allocator template parameter, just like any other container around.
Okay, I'll just forward this anyway to the underlying data structure (bitset) so that should be easy to add.
3. Have you considered the possibility to let the user specify M and K at constuction time rather than compile tiem and the pros and cons of both approaches?
Yes, I have but I've put a premium at type safety and putting M and K as part of the type instead of being a runtime value. The reasons are: 1. Specifying a static M allows the use of the standard bitset instead of Boost's dynamic_bitset. 2. A static K allows for enforcing the number of hash functions at compile time instead of at runtime. 3. Since M and K are not variable and are part of the type, it makes little sense for me to have either of them set at construction time. One thing I might be amenable to is a static K and a runtime M, but that makes the hash functions unable to determine statically the bounds of the bitset (M). I can change the concept for the hash function, but the simplicity of size_t(Input) just makes it easier for users I think.
4. It'd be nice to have a way to specify the desired false positive probability instead of M and K.
That'd be nice, but then the hash functions would have to be provided for by the implementation too to ensure that the probability is held.
5. It'd be nice if entry-level users can get a default set of hash functions so that they need not specifiy them --maybe a parameterized hash formula h(x,N) so that you can just use h(x,0), h(x,1),... h(x,N-1).
I agree. I can use Boost.CRC's functions and implement a simple mod-hash against M. Let me see what I can come up with. BTW, this may require that the actual bloom_filter<> template be renamed to something like static_bloom_filter<> and have a dynamic_bloom_filter<> version as well that supports runtime-setting of M like something I describe above. The bloom_filter<> will derive from static_bloom_filter<> and provide a set of basic hash functions so that users don't have to worry about providing their own hash functions. Or, I can support a default-constructible bloom_filter<> that provides these functions too. This sounds like something I'd be able to implement quickly.
6. You're not providing a free function swap(bloom_filter&,bloom_filter&).
Not yet, you're correct. Let me address this soon.
7. bloom_filters are not currently testable for equality.
You're correct. I'll add this soon.
8. bloom_filters are not currently serializable.
True. I might need to have a file 'boost/bloom_filter/serialize.hpp' to add serializability non-intrusively.
9. Union and intersection of bloom_filters are not provided (according to the Wikipedia entry these ops are trivial to implement).
True.
10. For completeness, seems like constructing a bloom_filter from a bitset_type would be nice to have.
I agree, this is something that can be added soon as I allow for having a default set of hash functions for bloom filters.
11. You'd have to decide whether you want to make this look more like a std::map<T,bool> or not. If the former, some boilerplate typedefs like value_type and memer functions like size() could be provided.
I'm still struggling with this mainly because I'd think a bloom filter would better model an std::set's behavior, but without access to the "stored" elements (mainly because the elements aren't really stored). Thanks for the great insights and feedback, I'll try implementing these things sooner than later. Have a great day! -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

Dean Michael Berris <mikhailberis <at> gmail.com> writes:
On Tue, Jun 9, 2009 at 4:36 AM, Joaquin M Lopez Munoz<joaquin <at> tid.es> wrote:
11. You'd have to decide whether you want to make this look more like a std::map<T,bool> or not. If the former, some boilerplate typedefs like value_type and memer functions like size() could be provided.
I'm still struggling with this mainly because I'd think a bloom filter would better model an std::set's behavior, but without access to the "stored" elements (mainly because the elements aren't really stored).
Now that you mention it, maybe std::set<T> is a better model than std::map<T,bool>. Hadn't thought about it.
Have a great day!
Thanks, I'll try :) Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

Just an update -- as of Sandbox revision 53772, I've addressed some of the issues below: On Tue, Jun 9, 2009 at 12:52 PM, Dean Michael Berris<mikhailberis@gmail.com> wrote:
On Tue, Jun 9, 2009 at 4:36 AM, Joaquin M Lopez Munoz<joaquin@tid.es> wrote:
So you can still get a lot of feedback till the design reaches a mature state. IMHO there're plenty of opportunities for improvement, here are a handful of observations in no particular order:
1. Your material lacks proper tests and examples, and the docs are sketchy (a formal reference would be needed at least).
Okay. I can work on adding more tests (or at least having more "formal" unit tests).
I've added a couple more tests, hardly "formal" by definition, but is an improvement over the single test file that's implemented.
2. bloom_filter should have an allocator template parameter, just like any other container around.
Okay, I'll just forward this anyway to the underlying data structure (bitset) so that should be easy to add.
And apparently I mis-typed -- std::bitset<> doesn't have an allocator parameter. Because of this, I've chosen to not support for the meantime a custom allocator.
3. Have you considered the possibility to let the user specify M and K at constuction time rather than compile tiem and the pros and cons of both approaches?
Yes, I have but I've put a premium at type safety and putting M and K as part of the type instead of being a runtime value. The reasons are:
1. Specifying a static M allows the use of the standard bitset instead of Boost's dynamic_bitset. 2. A static K allows for enforcing the number of hash functions at compile time instead of at runtime. 3. Since M and K are not variable and are part of the type, it makes little sense for me to have either of them set at construction time.
One thing I might be amenable to is a static K and a runtime M, but that makes the hash functions unable to determine statically the bounds of the bitset (M). I can change the concept for the hash function, but the simplicity of size_t(Input) just makes it easier for users I think.
I thought about it a little more and I don't think I like the idea of a separate static and dynamic bloom_filter. If there are more reasons to allow for dynamically growing bloom filters, then I might support this.
4. It'd be nice to have a way to specify the desired false positive probability instead of M and K.
That'd be nice, but then the hash functions would have to be provided for by the implementation too to ensure that the probability is held.
I will not be supporting this anytime soon. ;-)
5. It'd be nice if entry-level users can get a default set of hash functions so that they need not specifiy them --maybe a parameterized hash formula h(x,N) so that you can just use h(x,0), h(x,1),... h(x,N-1).
I agree. I can use Boost.CRC's functions and implement a simple mod-hash against M. Let me see what I can come up with. BTW, this may require that the actual bloom_filter<> template be renamed to something like static_bloom_filter<> and have a dynamic_bloom_filter<> version as well that supports runtime-setting of M like something I describe above. The bloom_filter<> will derive from static_bloom_filter<> and provide a set of basic hash functions so that users don't have to worry about providing their own hash functions.
This is already done, put it in boost/bloom_filter/detail/default_hash.hpp -- with a naive offset and mod-hash functions. Yes, this should be improved to use Boost.Hash or other implementations easily. ;-)
Or, I can support a default-constructible bloom_filter<> that provides these functions too. This sounds like something I'd be able to implement quickly.
Which is already the case. A default constructed bloom filter uses an unsophisticated hash function internally.
6. You're not providing a free function swap(bloom_filter&,bloom_filter&).
Not yet, you're correct. Let me address this soon.
Done.
7. bloom_filters are not currently testable for equality.
You're correct. I'll add this soon.
Done, but I've run into the issue of not being able to compare boost::function objects of the same type -- for some reason this has been disabled in Boost Trunk.
8. bloom_filters are not currently serializable.
True. I might need to have a file 'boost/bloom_filter/serialize.hpp' to add serializability non-intrusively.
This should come later... I still need to look into whether serialization should only be done for the bitset state rather than the whole type. I say this because up to now I don't think there's a way of serializing boost::function objects and deserializing them effectively either.
9. Union and intersection of bloom_filters are not provided (according to the Wikipedia entry these ops are trivial to implement).
True.
And I need to implement these later (if there's an actual need for this).
10. For completeness, seems like constructing a bloom_filter from a bitset_type would be nice to have.
I agree, this is something that can be added soon as I allow for having a default set of hash functions for bloom filters.
And this is done. HTH -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

Dean Michael Berris wrote:
3. Have you considered the possibility to let the user specify M and K at constuction time rather than compile tiem and the pros and cons of both approaches?
Yes, I have but I've put a premium at type safety and putting M and K as part of the type instead of being a runtime value. The reasons are:
1. Specifying a static M allows the use of the standard bitset instead of Boost's dynamic_bitset. 2. A static K allows for enforcing the number of hash functions at compile time instead of at runtime. 3. Since M and K are not variable and are part of the type, it makes little sense for me to have either of them set at construction time.
One thing I might be amenable to is a static K and a runtime M, but that makes the hash functions unable to determine statically the bounds of the bitset (M). I can change the concept for the hash function, but the simplicity of size_t(Input) just makes it easier for users I think.
I thought about it a little more and I don't think I like the idea of a separate static and dynamic bloom_filter. If there are more reasons to allow for dynamically growing bloom filters, then I might support this.
Dynamically growing bloom filters are technically impossible. The request was for a bloom filter whose size is set at construction time. This is useful e.g. because it allows the user of some filtering tool to trade memory for reliability via a configuration file instead of having to recompile. Sebastian

On Tue, Jun 9, 2009 at 8:10 PM, Sebastian Redl<sebastian.redl@getdesigned.at> wrote:
Dean Michael Berris wrote:
I thought about it a little more and I don't think I like the idea of a separate static and dynamic bloom_filter. If there are more reasons to allow for dynamically growing bloom filters, then I might support this.
Dynamically growing bloom filters are technically impossible.
Oh, you're right.
The request was for a bloom filter whose size is set at construction time. This is useful e.g. because it allows the user of some filtering tool to trade memory for reliability via a configuration file instead of having to recompile.
Agreed. Then I'll implement a dynamically sized bloom filter, maybe call it dynamic_bloom_filter that uses a Boost.Dynamic_bitset internally. Thanks for pointing out my oversight. -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

On Tue, Jun 9, 2009 at 11:40 PM, Dean Michael Berris<mikhailberis@gmail.com> wrote:
On Tue, Jun 9, 2009 at 8:10 PM, Sebastian Redl<sebastian.redl@getdesigned.at> wrote:
The request was for a bloom filter whose size is set at construction time. This is useful e.g. because it allows the user of some filtering tool to trade memory for reliability via a configuration file instead of having to recompile.
Agreed. Then I'll implement a dynamically sized bloom filter, maybe call it dynamic_bloom_filter that uses a Boost.Dynamic_bitset internally.
Thanks for pointing out my oversight.
And as of revision 53785 of the Boost Sandbox, boost::bloom_filter<> now has the following template parameters: template <class Input, class Sequence, class Block, class Allocator> struct bloom_filter; Sequence is a Fusion-compliant sequence of hash functions. The default parameter is a set of three pre-seeded hash functions which are an improved version of the earlier boost::detail::default_hash<Size> type. Block is the type used by the underlying Boost.Dynamic_bitset allocated by Allocator. Now a bloom_filter can just be defined as: bloom_filter<int> bf(2048); // 2048 is the number of bits to store, a runtime value The default bloom_filter has three functions, all specializations of the boost::detail::default_hash<Size> type (namely default_hash<0>, default_hash<1>, default_hash<2>). HTH -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

And as of revision 53785 of the Boost Sandbox, boost::bloom_filter<> now has the following template parameters:
template <class Input, class Sequence, class Block, class Allocator> struct bloom_filter;
Sequence is a Fusion-compliant sequence of hash functions. So call it HashFunctions or HashSequence. Intuitively, I'd interpret Sequence as "the underlying sequence", i.e. the container that stores
Dean Michael Berris wrote: the bits. And yes, I know that makes no sense for a Bloom filter. Sebastian

On Thu, Jun 11, 2009 at 12:48 AM, Sebastian Redl<sebastian.redl@getdesigned.at> wrote:
And as of revision 53785 of the Boost Sandbox, boost::bloom_filter<> now has the following template parameters:
template <class Input, class Sequence, class Block, class Allocator> struct bloom_filter;
Sequence is a Fusion-compliant sequence of hash functions. So call it HashFunctions or HashSequence. Intuitively, I'd interpret Sequence as "the underlying sequence", i.e. the container that stores
Dean Michael Berris wrote: the bits. And yes, I know that makes no sense for a Bloom filter.
Okay, that makes sense. ;-) Changing it soon. :-) -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

On Thu, Jun 11, 2009 at 3:20 PM, Dean Michael Berris<mikhailberis@gmail.com> wrote:
On Thu, Jun 11, 2009 at 12:48 AM, Sebastian Redl<sebastian.redl@getdesigned.at> wrote:
So call it HashFunctions or HashSequence. Intuitively, I'd interpret Sequence as "the underlying sequence", i.e. the container that stores the bits. And yes, I know that makes no sense for a Bloom filter.
Okay, that makes sense. ;-)
Changing it soon. :-)
And this is done. :-) -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

Dean Michael Berris wrote:
Just an update -- as of Sandbox revision 53772, I've addressed some of the issues below:
On Tue, Jun 9, 2009 at 12:52 PM, Dean Michael Berris<mikhailberis@gmail.com> wrote:
On Tue, Jun 9, 2009 at 4:36 AM, Joaquin M Lopez Munoz<joaquin@tid.es> wrote:
So you can still get a lot of feedback till the design reaches a mature state. IMHO there're plenty of opportunities for improvement, here are a handful of observations in no particular order:
3. Have you considered the possibility to let the user specify M and K at constuction time rather than compile tiem and the pros and cons of both approaches?
Yes, I have but I've put a premium at type safety and putting M and K as part of the type instead of being a runtime value. The reasons are:
1. Specifying a static M allows the use of the standard bitset instead of Boost's dynamic_bitset. 2. A static K allows for enforcing the number of hash functions at compile time instead of at runtime. 3. Since M and K are not variable and are part of the type, it makes little sense for me to have either of them set at construction time.
One thing I might be amenable to is a static K and a runtime M, but that makes the hash functions unable to determine statically the bounds of the bitset (M). I can change the concept for the hash function, but the simplicity of size_t(Input) just makes it easier for users I think.
I thought about it a little more and I don't think I like the idea of a separate static and dynamic bloom_filter. If there are more reasons to allow for dynamically growing bloom filters, then I might support this.
Two bloom_filters with different hash functions are not comparable. Don't you think that the hash functions make part of the bloom_filter type? bloom_filter<M, hash_1, ... hash_k>bf; Dean Michael Berris wrote:
7. bloom_filters are not currently testable for equality.
You're correct. I'll add this soon.
Done, but I've run into the issue of not being able to compare boost::function objects of the same type -- for some reason this has been disabled in Boost Trunk.
If hash functions make part of the bloom_filter type there is no issue. No need to compare the hash functions at run-time. The compiler will do it for you. Dean Michael Berris wrote:
8. bloom_filters are not currently serializable.
True. I might need to have a file 'boost/bloom_filter/serialize.hpp' to add serializability non-intrusively.
This should come later... I still need to look into whether serialization should only be done for the bitset state rather than the whole type. I say this because up to now I don't think there's a way of serializing boost::function objects and deserializing them effectively either.
If hash functions make part of the bloom_filter type there is no issue, only the data must be serialized. HTH, Vicente -- View this message in context: http://www.nabble.com/Boost.Bloom_filter-now-in-Sandbox-%28WAS-Re%3A-Interes... Sent from the Boost - Dev mailing list archive at Nabble.com.

Hi Vicente! On Tue, Jun 9, 2009 at 9:49 PM, Vicente Botet Escriba<vicente.botet@wanadoo.fr> wrote:
On Tue, Jun 9, 2009 at 12:52 PM, Dean Michael Berris<mikhailberis@gmail.com> wrote:
Yes, I have but I've put a premium at type safety and putting M and K as part of the type instead of being a runtime value. The reasons are:
1. Specifying a static M allows the use of the standard bitset instead of Boost's dynamic_bitset. 2. A static K allows for enforcing the number of hash functions at compile time instead of at runtime. 3. Since M and K are not variable and are part of the type, it makes little sense for me to have either of them set at construction time.
One thing I might be amenable to is a static K and a runtime M, but that makes the hash functions unable to determine statically the bounds of the bitset (M). I can change the concept for the hash function, but the simplicity of size_t(Input) just makes it easier for users I think.
I thought about it a little more and I don't think I like the idea of a separate static and dynamic bloom_filter. If there are more reasons to allow for dynamically growing bloom filters, then I might support this.
Two bloom_filters with different hash functions are not comparable. Don't you think that the hash functions make part of the bloom_filter type?
bloom_filter<M, hash_1, ... hash_k>bf;
Thinking about it a little more, yes this makes sense. However, pending the formalization of variadic templates, I'm leaning towards supporting Fusion sequences in the collection of hash functions: bloom_filter<Input, M, Sequence> bf; This way I can deduce K from size<Sequence> and maybe even derive the bloom_filter linearly from the Sequence elements (to optimize the storage, using the compressed_pair trick). I'm still mulling it around in my head, and will be trying to implement this in the morning (PH time). My only worry with this is that I'd have to define the Concept for hash function types. It shouldn't be too hard especially if I'll build upon the Boost.Hash concepts (if they're applicable).
Done, but I've run into the issue of not being able to compare boost::function objects of the same type -- for some reason this has been disabled in Boost Trunk.
If hash functions make part of the bloom_filter type there is no issue. No need to compare the hash functions at run-time. The compiler will do it for you.
Yup, this is true. Now I just have to implement it using Boost.Fusion -- this I'm worried about because this somewhat makes the bloom_filter implementation (and usage) a little more complex than I'd like, but if the requirements dictate it then there would be less reasons for me to avoid it. ;-)
8. bloom_filters are not currently serializable.
True. I might need to have a file 'boost/bloom_filter/serialize.hpp' to add serializability non-intrusively.
This should come later... I still need to look into whether serialization should only be done for the bitset state rather than the whole type. I say this because up to now I don't think there's a way of serializing boost::function objects and deserializing them effectively either.
If hash functions make part of the bloom_filter type there is no issue, only the data must be serialized.
You're correct here. Now it's just a matter of me implementing it then. :-)
HTH,
It definitely does help. Thanks! :-) -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

On Tue, Jun 9, 2009 at 11:57 PM, Dean Michael Berris<mikhailberis@gmail.com> wrote:
On Tue, Jun 9, 2009 at 9:49 PM, Vicente Botet Escriba<vicente.botet@wanadoo.fr> wrote:
Two bloom_filters with different hash functions are not comparable. Don't you think that the hash functions make part of the bloom_filter type?
bloom_filter<M, hash_1, ... hash_k>bf;
Thinking about it a little more, yes this makes sense. However, pending the formalization of variadic templates, I'm leaning towards supporting Fusion sequences in the collection of hash functions:
bloom_filter<Input, M, Sequence> bf;
This way I can deduce K from size<Sequence> and maybe even derive the bloom_filter linearly from the Sequence elements (to optimize the storage, using the compressed_pair trick). I'm still mulling it around in my head, and will be trying to implement this in the morning (PH time).
My only worry with this is that I'd have to define the Concept for hash function types. It shouldn't be too hard especially if I'll build upon the Boost.Hash concepts (if they're applicable).
This has also been addressed in the version that's in revision 53785 of the Boost Sandbox. :-)
If hash functions make part of the bloom_filter type there is no issue. No need to compare the hash functions at run-time. The compiler will do it for you.
Yup, this is true. Now I just have to implement it using Boost.Fusion -- this I'm worried about because this somewhat makes the bloom_filter implementation (and usage) a little more complex than I'd like, but if the requirements dictate it then there would be less reasons for me to avoid it. ;-)
And my worries were misplaced. :-) The implementation seems cleaner now IMO using Boost.Fusion internally. Fusion saves my day again. :-D
If hash functions make part of the bloom_filter type there is no issue, only the data must be serialized.
You're correct here. Now it's just a matter of me implementing it then. :-)
And this is already done too. :-) Thanks for the insights! :-D -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

Dean Michael Berris escribió:
Just an update -- as of Sandbox revision 53772, I've addressed some of the issues below:
On Tue, Jun 9, 2009 at 12:52 PM, Dean Michael Berris<mikhailberis@gmail.com> wrote:
2. bloom_filter should have an allocator template parameter, just like any other container around.
Okay, I'll just forward this anyway to the underlying data structure (bitset) so that should be easy to add.
And apparently I mis-typed -- std::bitset<> doesn't have an allocator parameter. Because of this, I've chosen to not support for the meantime a custom allocator.
Umm, fair point :) this raises two points: 2.1. Allocators are still needed if you implement a variant where M (and maybe K) are passed at construction time. 2.2. Is it wise to use the stack-based std::bitset when M is huge? I think it's not unlikely that M can be very large (say millions) which can easily lead to stack overflow problems. This might bring allocators back into scene.
3. Have you considered the possibility to let the user specify M and K at constuction time rather than compile tiem and the pros and cons of both approaches?
Yes, I have but I've put a premium at type safety and putting M and K as part of the type instead of being a runtime value. The reasons are:
1. Specifying a static M allows the use of the standard bitset instead of Boost's dynamic_bitset. 2. A static K allows for enforcing the number of hash functions at compile time instead of at runtime. 3. Since M and K are not variable and are part of the type, it makes little sense for me to have either of them set at construction time.
One thing I might be amenable to is a static K and a runtime M, but that makes the hash functions unable to determine statically the bounds of the bitset (M). I can change the concept for the hash function, but the simplicity of size_t(Input) just makes it easier for users I think.
I'd ask you to think this over before dismissing it: I can envision scenarios where M is not known at compile time (for instance, when dealing with external data whose size is not known beforehand). We can easily enter into overdesign land, but you can always let this be driven by a policy, so that everybody's happy.
7. bloom_filters are not currently testable for equality.
You're correct. I'll add this soon.
Done, but I've run into the issue of not being able to compare boost::function objects of the same type -- for some reason this has been disabled in Boost Trunk.
Ummm, this is tricky, because of course it does not suffice to test the state only --it critically depends on the hash functions to be used (At first sight I thought you only needed to test state and ignore auxiliary objects like std containers do).
8. bloom_filters are not currently serializable.
True. I might need to have a file 'boost/bloom_filter/serialize.hpp' to add serializability non-intrusively.
Intrusiveness is not necessarily related with having a separate serialize.hpp (you can friend declare inside bloom_filter in bloom_filter.hpp and have the implementation elsewhere), but it's a good idea nonetheless. Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

On Tue, Jun 9, 2009 at 11:58 PM, <joaquin@tid.es> wrote:
Dean Michael Berris escribió:
And apparently I mis-typed -- std::bitset<> doesn't have an allocator parameter. Because of this, I've chosen to not support for the meantime a custom allocator.
Umm, fair point :) this raises two points:
2.1. Allocators are still needed if you implement a variant where M (and maybe K) are passed at construction time.
Indeed. I'm leaning towards making this the default behavior using Boost.Dynamic_bitset instead and implementing the static_bloom_filter<> template that uses std::bitset internally now.
2.2. Is it wise to use the stack-based std::bitset when M is huge? I think it's not unlikely that M can be very large (say millions) which can easily lead to stack overflow problems. This might bring allocators back into scene.
You're right, it doesn't make it practical to use huge bitsets that sit on the stack. Perhaps Boost.Dynamic_bitset is my friend. ;-)
One thing I might be amenable to is a static K and a runtime M, but that makes the hash functions unable to determine statically the bounds of the bitset (M). I can change the concept for the hash function, but the simplicity of size_t(Input) just makes it easier for users I think.
I'd ask you to think this over before dismissing it: I can envision scenarios where M is not known at compile time (for instance, when dealing with external data whose size is not known beforehand).
We can easily enter into overdesign land, but you can always let this be driven by a policy, so that everybody's happy.
Now that you mention it, I agree. :-) First I'll do two things: - Make the runtime-constructed bloom_filter the default implementation (uses Boost.Dynamic_bitset and allows for custom allocators) - Rename the current implementation as static_bloom_filter that takes Input, M, and a Fusion Sequence of Hash Function types as arguments.
Done, but I've run into the issue of not being able to compare boost::function objects of the same type -- for some reason this has been disabled in Boost Trunk.
Ummm, this is tricky, because of course it does not suffice to test the state only --it critically depends on the hash functions to be used (At first sight I thought you only needed to test state and ignore auxiliary objects like std containers do).
Yup. But apparently, if I make the hash functions part of the type instead of be function objects passed in at runtime, that would solve the problem (based on earlier comments in the thread).
8. bloom_filters are not currently serializable.
True. I might need to have a file 'boost/bloom_filter/serialize.hpp' to add serializability non-intrusively.
Intrusiveness is not necessarily related with having a separate serialize.hpp (you can friend declare inside bloom_filter in bloom_filter.hpp and have the implementation elsewhere), but it's a good idea nonetheless.
The friend declaration might be an option, and if I can't do it non-intrusively I'll fall back to this option. Thanks for pointing this out. :-) -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

On Wed, Jun 10, 2009 at 1:51 AM, Dean Michael Berris<mikhailberis@gmail.com> wrote:
On Tue, Jun 9, 2009 at 11:58 PM, <joaquin@tid.es> wrote:
Umm, fair point :) this raises two points:
2.1. Allocators are still needed if you implement a variant where M (and maybe K) are passed at construction time.
Indeed. I'm leaning towards making this the default behavior using Boost.Dynamic_bitset instead and implementing the static_bloom_filter<> template that uses std::bitset internally now.
Now I've decided that boost::bloom_filter<> will take a runtime parameter which is the size of the internal bitset (a Boost.Dynamic_bitset) -- and the static version that uses std::bitset is boost::static_bloom_filter<>.
2.2. Is it wise to use the stack-based std::bitset when M is huge? I think it's not unlikely that M can be very large (say millions) which can easily lead to stack overflow problems. This might bring allocators back into scene.
You're right, it doesn't make it practical to use huge bitsets that sit on the stack. Perhaps Boost.Dynamic_bitset is my friend. ;-)
This has been addressed in revision 53785 of the Boost Sandbox. :-)
One thing I might be amenable to is a static K and a runtime M, but that makes the hash functions unable to determine statically the bounds of the bitset (M). I can change the concept for the hash function, but the simplicity of size_t(Input) just makes it easier for users I think.
I'd ask you to think this over before dismissing it: I can envision scenarios where M is not known at compile time (for instance, when dealing with external data whose size is not known beforehand).
We can easily enter into overdesign land, but you can always let this be driven by a policy, so that everybody's happy.
Now that you mention it, I agree. :-) First I'll do two things:
- Make the runtime-constructed bloom_filter the default implementation (uses Boost.Dynamic_bitset and allows for custom allocators)
This is done.
- Rename the current implementation as static_bloom_filter that takes Input, M, and a Fusion Sequence of Hash Function types as arguments.
Also done. :)
Done, but I've run into the issue of not being able to compare boost::function objects of the same type -- for some reason this has been disabled in Boost Trunk.
Ummm, this is tricky, because of course it does not suffice to test the state only --it critically depends on the hash functions to be used (At first sight I thought you only needed to test state and ignore auxiliary objects like std containers do).
Yup. But apparently, if I make the hash functions part of the type instead of be function objects passed in at runtime, that would solve the problem (based on earlier comments in the thread).
Which is also done. :D
Intrusiveness is not necessarily related with having a separate serialize.hpp (you can friend declare inside bloom_filter in bloom_filter.hpp and have the implementation elsewhere), but it's a good idea nonetheless.
The friend declaration might be an option, and if I can't do it non-intrusively I'll fall back to this option. Thanks for pointing this out. :-)
And I've yet to work on this. ;-) HTH! -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

About the implementation: If I understand your implementation, you do something like that: given a value V, you compute a number X (say the digest of V), then you compute K different values applying the transformation (X + offset) mod(M) I don't know if this produces k hash functions suitable for bloom filters, but, why don't you use Boost.hash to compute X? Also, it would be easier to define the requirements on the Input type. About the documentation: I think that the following paragraph in the documentation must be reformulated <snip> To find out whether we've encountered a certain URL before, we simply pass the URL to the same three hash functions. If we see that all the positions returned by the three hash functions are all 1, then we might have seen the URL before -- this is called a false positive. </snip> indeed, you cannot say that the URL is a false positive, since it could be equal to an URL stored previously. IMHO, you should describe the mathematical principles in more detail and, moreover, you should discuss in more detail the false positives and how to control their rate. Even if it isn't mandatory, it would be nice if you write your documentation in QuickBook format, since it will uniform both the documentation toolchain and the documentation look. Also, don't forgive to synchronize the documentation with the current interface. Manuel Fiorelli

Hi Manuel, On Tue, Jun 9, 2009 at 8:45 PM, Manuel Fiorelli<manuel.fiorelli@gmail.com> wrote:
About the implementation: If I understand your implementation, you do something like that: given a value V, you compute a number X (say the digest of V), then you compute K different values applying the transformation (X + offset) mod(M)
Right. This naive filter will almost always create K consecutive indexes (or results) from the hash functions.
I don't know if this produces k hash functions suitable for bloom filters, but, why don't you use Boost.hash to compute X? Also, it would be easier to define the requirements on the Input type.
I've considered using Boost.Hash to compute the hash values -- the naive implementation up there is meant to be a sample. I'm in the process of (re)thinking how the functions are made part of the type instead. Maybe instead of just Boost.Array's at construction time I'll consider Fusion-compatible sequences that are part of the signature (thus the type of the filter).
About the documentation: I think that the following paragraph in the documentation must be reformulated
<snip> To find out whether we've encountered a certain URL before, we simply pass the URL to the same three hash functions. If we see that all the positions returned by the three hash functions are all 1, then we might have seen the URL before -- this is called a false positive. </snip>
indeed, you cannot say that the URL is a false positive, since it could be equal to an URL stored previously.
Right, I need to make that a little more clear. Thanks for pointing it out.
IMHO, you should describe the mathematical principles in more detail and, moreover, you should discuss in more detail the false positives and how to control their rate.
I agree. I'll work on it some more and gather more resources worth referencing.
Even if it isn't mandatory, it would be nice if you write your documentation in QuickBook format, since it will uniform both the documentation toolchain and the documentation look.
I chose to write in ReStructuredText first to make the documentation generation cycle faster -- it should be easier later to create Quickbook documentation from the RST sources if I remember my quickbook correctly. ;-)
Also, don't forgive to synchronize the documentation with the current interface.
Right. More work for me then. ;-) Thanks for the insights! :-D -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

2009/6/9 Dean Michael Berris <mikhailberis@gmail.com>:
Hi Manuel,
On Tue, Jun 9, 2009 at 8:45 PM, Manuel Fiorelli<manuel.fiorelli@gmail.com> wrote:
About the implementation: If I understand your implementation, you do something like that: given a value V, you compute a number X (say the digest of V), then you compute K different values applying the transformation (X + offset) mod(M)
Right. This naive filter will almost always create K consecutive indexes (or results) from the hash functions.
Well, the K hash functions should be independent, but actually the default functions are highly correlated. I suspect that a more sophisticated approach should be taken. Wikipedia suggests some techniques and some references. Manuel Fiorelli

On Tue, Jun 9, 2009 at 11:47 PM, Dean Michael Berris<mikhailberis@gmail.com> wrote:
Hi Manuel,
On Tue, Jun 9, 2009 at 8:45 PM, Manuel Fiorelli<manuel.fiorelli@gmail.com> wrote:
About the implementation: If I understand your implementation, you do something like that: given a value V, you compute a number X (say the digest of V), then you compute K different values applying the transformation (X + offset) mod(M)
Right. This naive filter will almost always create K consecutive indexes (or results) from the hash functions.
Which I've addressed in revision 53785 -- now the default_hash<> uses Boost.Hash internally with statically defined seeds.
About the documentation: I think that the following paragraph in the documentation must be reformulated
<snip> To find out whether we've encountered a certain URL before, we simply pass the URL to the same three hash functions. If we see that all the positions returned by the three hash functions are all 1, then we might have seen the URL before -- this is called a false positive. </snip>
indeed, you cannot say that the URL is a false positive, since it could be equal to an URL stored previously.
Right, I need to make that a little more clear. Thanks for pointing it out.
This is something I'll be working on later. HTH! -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

1. Can you please explain why you need multiple "hashes"? 2. Your code doesn't trivially exit on the contains 3. Your solution doesn't provide a means to determine the various optimal parameter combinations 4. Your solution doesn't provide a means for counting modes 5. Your solution doesn't provide for erasures (do not confuse with removal of inserted items) 6. Your solution doesn't provide for self superimposed compression mode (similar to union but doubling of fpp) Arash Partow __________________________________________________ Be one who knows what they don't know, Instead of being one who knows not what they don't know, Thinking they know everything about all things. http://www.partow.net Dean Michael Berris wrote:
On Tue, Jun 9, 2009 at 11:47 PM, Dean Michael Berris<mikhailberis@gmail.com> wrote:
Hi Manuel,
On Tue, Jun 9, 2009 at 8:45 PM, Manuel Fiorelli<manuel.fiorelli@gmail.com> wrote:
About the implementation: If I understand your implementation, you do something like that: given a value V, you compute a number X (say the digest of V), then you compute K different values applying the transformation (X + offset) mod(M)
Right. This naive filter will almost always create K consecutive indexes (or results) from the hash functions.
Which I've addressed in revision 53785 -- now the default_hash<> uses Boost.Hash internally with statically defined seeds.
About the documentation: I think that the following paragraph in the documentation must be reformulated
<snip> To find out whether we've encountered a certain URL before, we simply pass the URL to the same three hash functions. If we see that all the positions returned by the three hash functions are all 1, then we might have seen the URL before -- this is called a false positive. </snip>
indeed, you cannot say that the URL is a false positive, since it could be equal to an URL stored previously.
Right, I need to make that a little more clear. Thanks for pointing it out.
This is something I'll be working on later.
HTH!

Hi Arash, On Thu, Jun 18, 2009 at 7:00 AM, Arash Partow<arash@partow.net> wrote:
1. Can you please explain why you need multiple "hashes"?
You don't need multiple hashes, but if you do want to use more hash functions (K) you can. The point is the user should be able to provide the hash functions that the bloom filter will use internally.
2. Your code doesn't trivially exit on the contains
It used to be that it did (in the version that didn't use Boost.Fusion). I can try and optimize still in cases when it encounters a '0' bit in between the hash function results.
3. Your solution doesn't provide a means to determine the various optimal parameter combinations
What do you mean by this?
4. Your solution doesn't provide a means for counting modes
Modes for bloom filters? I'm not sure I understand what you mean.
5. Your solution doesn't provide for erasures (do not confuse with removal of inserted items)
.clear() is supposed to clear the internal bitset and clear the bloom filter.
6. Your solution doesn't provide for self superimposed compression mode (similar to union but doubling of fpp)
The idea of the bloom_filter is to provide precisely that, a bloom filter -- any compression you might want to do in transmitting a bloom_filter would IMO be beyond the scope of the library. There has been some suggestions of implementing a compressed_bloom_filter that maintains a compressed bitset, but afar from that I must say I have no idea what self superimposed compression mode. -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

The idea of the bloom_filter is to provide precisely that, a bloom filter -- any compression you might want to do in transmitting a bloom_filter would IMO be beyond the scope of the library. There has been some suggestions of implementing a compressed_bloom_filter that maintains a compressed bitset, but afar from that I must say I have no idea what self superimposed compression mode.
Implementing naïve bloom filter according to the encyclopædic definition does not make much sense. It will be just a proof of concept... As I have already mentioned in the other thread (http://thread.gmane.org/gmane.comp.lib.boost.devel/190192/) it's all about providing a generic solution that could accommodate all the peculiarities. In my opinion there is some established research on compressed bloom filters and that the issue of the compression is not orthogonal to the bloom filter itself. I also agree that it is a must to provide the way to determine optimal parameter values for all the variants of the filters. As for the erasure, I think that Arash was referring to cropped filters and interpretation as erasure codes. (reference: Bloom Filters: One Size Fits All?) Regards, Milosz

Dean Michael Berris wrote:
Hi Arash,
On Thu, Jun 18, 2009 at 7:00 AM, Arash Partow wrote:
1. Can you please explain why you need multiple "hashes"?
You don't need multiple hashes, but if you do want to use more hash functions (K) you can. The point is the user should be able to provide the hash functions that the bloom filter will use internally.
Please read something other than wiki, there are many great citations on the wiki article that explain the most practical state of the art BFs. If you want something for boost then at least try and provide the most correct set of interfaces even if the underlying details are not the most correct. In short you only need 2 independent hash functions to get the optimal fpp for a set of size m bits(an expected insertion amount is optional).
2. Your code doesn't trivially exit on the contains
It used to be that it did (in the version that didn't use Boost.Fusion). I can try and optimize still in cases when it encounters a '0' bit in between the hash function results.
this is what i see: bool contains(const_ref input) const { using fusion::for_each; typedef typename base::contains_impl contains_checker; bool found = true; for_each( hash_functions, contains_checker(bit_set, input, found) ); return found; } unless the contract for for_each has changed (which it hasn't both in stl and fusion): 1. it doesn't exit 2. if you even bother to look at the machine code generated, you'd see that even with the best available set of opt flags, its still miles behind a simple loop with a break statement. You have a copy constructor being called, do you know the cost that will incur when calling contains 10^9 times?!
3. Your solution doesn't provide a means to determine the various optimal parameter combinations
What do you mean by this?
There are various (many) use-cases for a BF 1. user provides max size and max fpp BF determines best k 2. user provides max fpp and max expected insertion BF determine best k,m 3. user provides max fpp and fuzzy k with limit BF determines m,k and optimal fpp 4. etc etc... Some of these can change/vary during runtime, so getters for fpp and inserted count would be nice/good/expected.
4. Your solution doesn't provide a means for counting modes
Modes for bloom filters? I'm not sure I understand what you mean.
There are modes of BFs where instead of 1 bit per block you use more say 4 bits and instead of setting high you increment a counter. this allows for a probabilistic determination of how many times an element has been inserted into a BF, this is called a counting bloom filter. this is another of MANY important use-case you have not covered.
5. Your solution doesn't provide for erasures (do not confuse with removal of inserted items)
.clear() is supposed to clear the internal bitset and clear the bloom filter.
No not clear. A bloom filter is to a degree error resilient/tolerant. if for some reason part of your bloom filter data is damaged (n/w io, bad memory etc...) you can erase that section (set all the bits high), doing so depending on how large the section is will increase the fpp. For some uses that is still quite acceptable, as the fpp has already been determined to be orders of magnitude less than what can be tolerated.
6. Your solution doesn't provide for self superimposed compression mode (similar to union but doubling of fpp)
The idea of the bloom_filter is to provide precisely that, a bloom filter -- any compression you might want to do in transmitting a bloom_filter would IMO be beyond the scope of the library. There has been some suggestions of implementing a compressed_bloom_filter that maintains a compressed bitset, but afar from that I must say I have no idea what self superimposed compression mode.
This is NOT data compression if you are providing union operations, then you should also provide self compression, that is for example, OR'ing the 1st half of the filter with the 2nd half, decreasing its size. Also another note looking at your code again (btw do you even use it?!), you're equality test should also include the hashes and any other seeding data you use to construct the BF. Another thing that is missing is serialization of the BF, I would like to be able to save to file, or send it over a n/w. So a streaming operator or something inline with boost.serialize would be good. Another thing to consider is that a BF implementation should not be a means to show-off one's abilities in the use of templates and meta-programming paradigms but rather something that is efficient and lean. Think about situations where cache hit-miss and coherency will be an issue (as these are some of the more critical uses of BFs), look at the machine code that is generated for some of the more critical routines insert/contains for various archs and ask yourself, what it is you're "really" trying to achieve. Arash Partow __________________________________________________ Be one who knows what they don't know, Instead of being one who knows not what they don't know, Thinking they know everything about all things. http://www.partow.net

On Fri, Jun 19, 2009 at 4:59 AM, Arash Partow<arash@partow.net> wrote:
Dean Michael Berris wrote:
Hi Arash,
On Thu, Jun 18, 2009 at 7:00 AM, Arash Partow wrote:
1. Can you please explain why you need multiple "hashes"?
You don't need multiple hashes, but if you do want to use more hash functions (K) you can. The point is the user should be able to provide the hash functions that the bloom filter will use internally.
Please read something other than wiki, there are many great citations on the wiki article that explain the most practical state of the art BFs. If you want something for boost then at least try and provide the most correct set of interfaces even if the underlying details are not the most correct. In short you only need 2 independent hash functions to get the optimal fpp for a set of size m bits(an expected insertion amount is optional).
When you say correct set of interfaces, what do you need more than 'insert(...)' and 'contains(...)' for the basic bloom_filter? In the case of a counting bloom filter, maybe you'd have 'remove(...)' or 'erase(...)'. Other than these functions I don't know what you mean by most correct set of interfaces.
2. Your code doesn't trivially exit on the contains
It used to be that it did (in the version that didn't use Boost.Fusion). I can try and optimize still in cases when it encounters a '0' bit in between the hash function results.
this is what i see:
bool contains(const_ref input) const { using fusion::for_each; typedef typename base::contains_impl contains_checker; bool found = true; for_each( hash_functions, contains_checker(bit_set, input, found) ); return found; }
unless the contract for for_each has changed (which it hasn't both in stl and fusion):
1. it doesn't exit
Oh, but it does. Have you tried the tests? Remember this is Boost.Fusion -- http://www.boost.org/doc/libs/1_39_0/libs/fusion/doc/html/fusion/algorithm/i... -- it iterates on the 'hash_functions' sequence which is finite.
2. if you even bother to look at the machine code generated, you'd see that even with the best available set of opt flags, its still miles behind a simple loop with a break statement. You have a copy constructor being called, do you know the cost that will incur when calling contains 10^9 times?!
Really? The Boost.Fusion version actually unrolls the loop for you, so the cost of that for_each call is actually constant time. Sure, there's no short-circuit (yet) but then the cost is constant.
3. Your solution doesn't provide a means to determine the various optimal parameter combinations
What do you mean by this?
There are various (many) use-cases for a BF
1. user provides max size and max fpp BF determines best k 2. user provides max fpp and max expected insertion BF determine best k,m 3. user provides max fpp and fuzzy k with limit BF determines m,k and optimal fpp 4. etc etc...
Some of these can change/vary during runtime, so getters for fpp and inserted count would be nice/good/expected.
This is too complex -- and has too many assumptions. First, it assumes that the user wants a self-(re)sizing bloom filter. Second it assumes that the user doesn't want to provide their own hash functions. Third it assumes that the user actually really cares if the bloom filter should be smarter than the user. In some cases (like the simplest cases I'm trying to solve for the moment) the current implementation seems sufficient. Of course, it's not fancy, and I don't intend for it to become too fancy. If you think you can implement something better, then please by all means show me code that I can deal with -- this is not said sarcastically, I really do want to see what you mean when you say all the above translated into code.
4. Your solution doesn't provide a means for counting modes
Modes for bloom filters? I'm not sure I understand what you mean.
There are modes of BFs where instead of 1 bit per block you use more say 4 bits and instead of setting high you increment a counter. this allows for a probabilistic determination of how many times an element has been inserted into a BF, this is called a counting bloom filter. this is another of MANY important use-case you have not covered.
I thought 'counting modes' meant 'Mode' in the statistical sense and getting these statistical modes. I would have understood it better if you meant I wasn't implementing counting bloom filters or bloomier filters or <insert new variation of a highfalutin name> bloom filter. Anyway, the point is to provide a bloom filter using the textbook definition. If there's a need for other kinds of bloom filters, like counting bloom filters, then that's another implementation that can be bundled in the library later -- *if* there is enough need for it, and *if* the library even gets into Boost just trying to solve the common (basic) cases. Of course the library is designed to evolve.
5. Your solution doesn't provide for erasures (do not confuse with removal of inserted items)
.clear() is supposed to clear the internal bitset and clear the bloom filter.
No not clear. A bloom filter is to a degree error resilient/tolerant. if for some reason part of your bloom filter data is damaged (n/w io, bad memory etc...) you can erase that section (set all the bits high), doing so depending on how large the section is will increase the fpp. For some uses that is still quite acceptable, as the fpp has already been determined to be orders of magnitude less than what can be tolerated.
If you want to go down this road, then you think about implementing your own allocator with error correction support and use that in the bloom_filter or any container that supports/uses custom allocators. Maybe you'll want an STL vector that also knows how to determine if part of the data is corrupted too?
6. Your solution doesn't provide for self superimposed compression mode (similar to union but doubling of fpp)
The idea of the bloom_filter is to provide precisely that, a bloom filter -- any compression you might want to do in transmitting a bloom_filter would IMO be beyond the scope of the library. There has been some suggestions of implementing a compressed_bloom_filter that maintains a compressed bitset, but afar from that I must say I have no idea what self superimposed compression mode.
This is NOT data compression if you are providing union operations, then you should also provide self compression, that is for example, OR'ing the 1st half of the filter with the 2nd half, decreasing its size.
I don't understand why you want to do this when clearly the Bloom Filter is meant to have a fixed number of bits. Any variation to this in transit is beyond the scope of the implementation of the Bloom Filter.
Also another note looking at your code again (btw do you even use it?!), you're equality test should also include the hashes and any other seeding data you use to construct the BF.
I guess you missed the earlier discussion: the point was that since the hash functions are part of the bloom_filter's type, it's assumed that two bloom_filter instances of the same type have the same hash functions. If one bloom filter has a different encapsulated state compared to another bloom filter of the same type, then they are not equal -- simple, and easy to understand/implement. That's the point of equality testing after all.
Another thing that is missing is serialization of the BF, I would like to be able to save to file, or send it over a n/w. So a streaming operator or something inline with boost.serialize would be good.
Yes, I know. And if you missed the earlier discussion, I'm working on it. ;-)
Another thing to consider is that a BF implementation should not be a means to show-off one's abilities in the use of templates and meta-programming paradigms but rather something that is efficient and lean. Think about situations where cache hit-miss and coherency will be an issue (as these are some of the more critical uses of BFs), look at the machine code that is generated for some of the more critical routines insert/contains for various archs and ask yourself, what it is you're "really" trying to achieve.
Hmmm... I guess you haven't followed the discussion at all in the list. Let me rehash a bit: 1. The only reason Boost.Fusion was used was because the hash functions were actually made part of the type (of the bloom_filter template). There was a need to iterate through the hash functions when inserting and checking an element whether they're part of the represented set. 2. The reason there are two bloom filter implementations was because there was a need for a version that was able to construct a bitset of size M at runtime (bloom_filter) as opposed to one that had the size set as part of the type (static_bloom_filter). 3. Nobody is showing-off one's abilities to use templates and meta-programming here: the point for using Boost.Fusion is to simplify the iteration of the sequence of Hash Functions. If there was any showing-off that that was going to happen with template metaprogramming by me, I wouldn't have used Boost.Fusion and had done it by hand instead. 4. The implementation is meant to be simple, portable across platforms, and C++ standard-compliant. When you deal with these issues, it's best you don't think about cache hit-miss and coherency issues IMO. At any rate if you want to have better control of the memory allocation and issues like these, you can control them by: a) using your favorite STL implementation of an std::bitset in the case of the statically-sized filter or b) by providing your own allocator and using the "correct" block type in the case of a runtime-constructed Bloom Filter. Of course, if the library under construction isn't suiting your needs, there should be room for improvement. But there's no need to be too fancy with the implementation/interface and being clever with the False Positive Probability -- you can control this easily with choosing/implementing your own hash functions and the size of the underlying bitset. HTH -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

Dean Michael Berris wrote:
When you say correct set of interfaces, what do you need more than 'insert(...)' and 'contains(...)' for the basic bloom_filter? In the case of a counting bloom filter, maybe you'd have 'remove(...)' or 'erase(...)'. Other than these functions I don't know what you mean by most correct set of interfaces.
For me it means the possibility of defining different variants of Bloom Filters (standard or extensions) in an easy way. Something more than text-book.
There are various (many) use-cases for a BF
1. user provides max size and max fpp BF determines best k 2. user provides max fpp and max expected insertion BF determine best k,m 3. user provides max fpp and fuzzy k with limit BF determines m,k and optimal fpp 4. etc etc...
Some of these can change/vary during runtime, so getters for fpp and inserted count would be nice/good/expected.
This is too complex -- and has too many assumptions. First, it assumes that the user wants a self-(re)sizing bloom filter. Second it assumes that the user doesn't want to provide their own hash functions. Third it assumes that the user actually really cares if the bloom filter should be smarter than the user.
1) The mentioned assumptions don't imply the dynamic-self-resizing filter. But it would be nice to look at the recent papers about dynamic-self-resizing papers... 2) Why does it assume that the user doesn't want custom hash functions? 3) It's not about being smarter, but about providing in convenient form some of the theory/math behind the filters.
In some cases (like the simplest cases I'm trying to solve for the moment) the current implementation seems sufficient. Of course, it's not fancy, and I don't intend for it to become too fancy. If you think you can implement something better, then please by all means show me code that I can deal with -- this is not said sarcastically, I really do want to see what you mean when you say all the above translated into code.
And I am implying that the simplest and naïve case is far from the real world applications. And one should not focus on the simplest case, but attempt to provide a good solution.
Anyway, the point is to provide a bloom filter using the textbook definition. If there's a need for other kinds of bloom filters, like counting bloom filters, then that's another implementation that can be bundled in the library later -- *if* there is enough need for it, and *if* the library even gets into Boost just trying to solve the common (basic) cases. Of course the library is designed to evolve.
Yet again - the simplest case. If the library is designed to evolve, one should take much care when designing the interfaces. It's not easy to change afterwards. Providing counting modes or full-fledged counting filter should not be a problem.
No not clear. A bloom filter is to a degree error resilient/tolerant. if for some reason part of your bloom filter data is damaged (n/w io, bad memory etc...) you can erase that section (set all the bits high), doing so depending on how large the section is will increase the fpp. For some uses that is still quite acceptable, as the fpp has already been determined to be orders of magnitude less than what can be tolerated.
If you want to go down this road, then you think about implementing your own allocator with error correction support and use that in the bloom_filter or any container that supports/uses custom allocators. Maybe you'll want an STL vector that also knows how to determine if part of the data is corrupted too?
Please read the paper that I have mentioned in one of the earlier posts. There are also uses of the intentional introduction of the errors into the BF.
This is NOT data compression if you are providing union operations, then you should also provide self compression, that is for example, OR'ing the 1st half of the filter with the 2nd half, decreasing its size.
I don't understand why you want to do this when clearly the Bloom Filter is meant to have a fixed number of bits. Any variation to this in transit is beyond the scope of the implementation of the Bloom Filter.
Again, there is some motivation and research behind compression. Tradeoff between size, time of lookup and FPR is one of the reasons.
Let me rehash a bit: 4. The implementation is meant to be simple, portable across platforms, and C++ standard-compliant. When you deal with these issues, it's best you don't think about cache hit-miss and coherency issues IMO. At any rate if you want to have better control of the memory allocation and issues like these, you can control them by: a) using your favorite STL implementation of an std::bitset in the case of the statically-sized filter or b) by providing your own allocator and using the "correct" block type in the case of a runtime-constructed Bloom Filter.
And the implementation can be perfectly C++ compliant and portable with including the variants of the Bloom filter that pay attention to the cache hit-miss ratio and coherency. And it is not only the issue of memory allocation itself but it has much to do with the internal working of the filter.
Of course, if the library under construction isn't suiting your needs, there should be room for improvement. But there's no need to be too fancy with the implementation/interface and being clever with the False Positive Probability -- you can control this easily with choosing/implementing your own hash functions and the size of the underlying bitset.
You can do more than choosing a custom hash and size. Regards, Milosz

Milosz, I completely agree with everything you have said. In the event Dean decides its a bit much for him, I wouldn't mind teaming up and doing it properly. Arash Partow __________________________________________________ Be one who knows what they don't know, Instead of being one who knows not what they don't know, Thinking they know everything about all things. http://www.partow.net Milosz Marian HULBOJ wrote:
Dean Michael Berris wrote:
When you say correct set of interfaces, what do you need more than 'insert(...)' and 'contains(...)' for the basic bloom_filter? In the case of a counting bloom filter, maybe you'd have 'remove(...)' or 'erase(...)'. Other than these functions I don't know what you mean by most correct set of interfaces.
For me it means the possibility of defining different variants of Bloom Filters (standard or extensions) in an easy way. Something more than text-book.
There are various (many) use-cases for a BF
1. user provides max size and max fpp BF determines best k 2. user provides max fpp and max expected insertion BF determine best k,m 3. user provides max fpp and fuzzy k with limit BF determines m,k and optimal fpp 4. etc etc...
Some of these can change/vary during runtime, so getters for fpp and inserted count would be nice/good/expected.
This is too complex -- and has too many assumptions. First, it assumes that the user wants a self-(re)sizing bloom filter. Second it assumes that the user doesn't want to provide their own hash functions. Third it assumes that the user actually really cares if the bloom filter should be smarter than the user.
1) The mentioned assumptions don't imply the dynamic-self-resizing filter. But it would be nice to look at the recent papers about dynamic-self-resizing papers... 2) Why does it assume that the user doesn't want custom hash functions? 3) It's not about being smarter, but about providing in convenient form some of the theory/math behind the filters.
In some cases (like the simplest cases I'm trying to solve for the moment) the current implementation seems sufficient. Of course, it's not fancy, and I don't intend for it to become too fancy. If you think you can implement something better, then please by all means show me code that I can deal with -- this is not said sarcastically, I really do want to see what you mean when you say all the above translated into code.
And I am implying that the simplest and naïve case is far from the real world applications. And one should not focus on the simplest case, but attempt to provide a good solution.
Anyway, the point is to provide a bloom filter using the textbook definition. If there's a need for other kinds of bloom filters, like counting bloom filters, then that's another implementation that can be bundled in the library later -- *if* there is enough need for it, and *if* the library even gets into Boost just trying to solve the common (basic) cases. Of course the library is designed to evolve.
Yet again - the simplest case. If the library is designed to evolve, one should take much care when designing the interfaces. It's not easy to change afterwards. Providing counting modes or full-fledged counting filter should not be a problem.
No not clear. A bloom filter is to a degree error resilient/tolerant. if for some reason part of your bloom filter data is damaged (n/w io, bad memory etc...) you can erase that section (set all the bits high), doing so depending on how large the section is will increase the fpp. For some uses that is still quite acceptable, as the fpp has already been determined to be orders of magnitude less than what can be tolerated.
If you want to go down this road, then you think about implementing your own allocator with error correction support and use that in the bloom_filter or any container that supports/uses custom allocators. Maybe you'll want an STL vector that also knows how to determine if part of the data is corrupted too?
Please read the paper that I have mentioned in one of the earlier posts. There are also uses of the intentional introduction of the errors into the BF.
This is NOT data compression if you are providing union operations, then you should also provide self compression, that is for example, OR'ing the 1st half of the filter with the 2nd half, decreasing its size.
I don't understand why you want to do this when clearly the Bloom Filter is meant to have a fixed number of bits. Any variation to this in transit is beyond the scope of the implementation of the Bloom Filter.
Again, there is some motivation and research behind compression. Tradeoff between size, time of lookup and FPR is one of the reasons.
Let me rehash a bit: 4. The implementation is meant to be simple, portable across platforms, and C++ standard-compliant. When you deal with these issues, it's best you don't think about cache hit-miss and coherency issues IMO. At any rate if you want to have better control of the memory allocation and issues like these, you can control them by: a) using your favorite STL implementation of an std::bitset in the case of the statically-sized filter or b) by providing your own allocator and using the "correct" block type in the case of a runtime-constructed Bloom Filter.
And the implementation can be perfectly C++ compliant and portable with including the variants of the Bloom filter that pay attention to the cache hit-miss ratio and coherency. And it is not only the issue of memory allocation itself but it has much to do with the internal working of the filter.
Of course, if the library under construction isn't suiting your needs, there should be room for improvement. But there's no need to be too fancy with the implementation/interface and being clever with the False Positive Probability -- you can control this easily with choosing/implementing your own hash functions and the size of the underlying bitset.
You can do more than choosing a custom hash and size.
Regards, Milosz

Arash Partow wrote:
Milosz, I completely agree with everything you have said. In the event Dean decides its a bit much for him, I wouldn't mind teaming up and doing it properly.
I also don't mind teaming in order to provide the "most robust Bloom Filter implementation ever". ;) Regards, Milosz

On Fri, Jun 19, 2009 at 2:59 PM, Milosz Marian HULBOJ<mhulboj@hulboj.org> wrote:
Dean Michael Berris wrote:
When you say correct set of interfaces, what do you need more than 'insert(...)' and 'contains(...)' for the basic bloom_filter? In the case of a counting bloom filter, maybe you'd have 'remove(...)' or 'erase(...)'. Other than these functions I don't know what you mean by most correct set of interfaces.
For me it means the possibility of defining different variants of Bloom Filters (standard or extensions) in an easy way. Something more than text-book.
Right, then I guess the Concepts documentation should be enough -- saying that a type models the BloomFilter concept when they support the following restrictions/interface: ...
There are various (many) use-cases for a BF
1. user provides max size and max fpp BF determines best k 2. user provides max fpp and max expected insertion BF determine best k,m 3. user provides max fpp and fuzzy k with limit BF determines m,k and optimal fpp 4. etc etc...
Some of these can change/vary during runtime, so getters for fpp and inserted count would be nice/good/expected.
This is too complex -- and has too many assumptions. First, it assumes that the user wants a self-(re)sizing bloom filter. Second it assumes that the user doesn't want to provide their own hash functions. Third it assumes that the user actually really cares if the bloom filter should be smarter than the user.
1) The mentioned assumptions don't imply the dynamic-self-resizing filter. But it would be nice to look at the recent papers about dynamic-self-resizing papers...
Actually, it does -- it's implying that you just give a false positive probability and it computes the number of hash functions (k) and the size of the container (m). Maybe not re-sizing (I've read a few papers that imply or suggest the possibility and practicability of that) but self-sizing at construction is IMO too fancy. Maybe a convenience function and meta-function would work: template <double FPP> struct m_and_k_computer; m_and_k_computer<0.09>::k; // yields the number of hash functions m_and_k_computer<0.09>::m; // yields the number of bits used to represent
2) Why does it assume that the user doesn't want custom hash functions?
Because none of the numbered list earlier presented mentioned the possibility of a custom hash function. ;-)
3) It's not about being smarter, but about providing in convenient form some of the theory/math behind the filters.
Sure, but it's completely possible to layer these things on top of the underlying data structure which is the *_bloom_filter. I mean computing the false positive probability is independent of the actual workings of a bloom filter. I can imagine a layer of convenience functions (factories really) that does everything above and uses the current bloom filter implementation as the resulting type. What I'm really saying is, that these things (ways to compute K,M, having different hash functions, etc.) are nice to have, but is it really part of the basic notion of what a Bloom Filter is or how it actually behaves? There is an algorithmically simple template for the 'insert' and 'contains' operations and sure there can be more optimizations at that level -- and I'm definitely welcome to suggestions. However now the question becomes: should these things be part of a Bloom Filter library? Maybe yes, maybe no -- maybe like the type of elements you represent and your actual requirements in the application, it should be determined by the users.
In some cases (like the simplest cases I'm trying to solve for the moment) the current implementation seems sufficient. Of course, it's not fancy, and I don't intend for it to become too fancy. If you think you can implement something better, then please by all means show me code that I can deal with -- this is not said sarcastically, I really do want to see what you mean when you say all the above translated into code.
And I am implying that the simplest and naïve case is far from the real world applications. And one should not focus on the simplest case, but attempt to provide a good solution.
Hmmm... I'm providing a simple means of defining a bloom filter given the specifications. Now if you need a certain FPP met, then maybe you should be the one computing K and M -- if you want to use special hash functions, then maybe you should provide them instead of using the defaults. The defaults should make it convenient to use, but of course if genericness of the simplest case doesn't work for you, then you have the option to specialize to your own requirements.
Anyway, the point is to provide a bloom filter using the textbook definition. If there's a need for other kinds of bloom filters, like counting bloom filters, then that's another implementation that can be bundled in the library later -- *if* there is enough need for it, and *if* the library even gets into Boost just trying to solve the common (basic) cases. Of course the library is designed to evolve.
Yet again - the simplest case. If the library is designed to evolve, one should take much care when designing the interfaces. It's not easy to change afterwards. Providing counting modes or full-fledged counting filter should not be a problem.
I'm a little confused here: what else do you need from a bloom filter other than 'bf.insert(x)' and 'bf.contains(x)' ? This is the basic interface to any bloom filter IIRC, and I don't get why sticking to that will necessarily make it harder to change in the future. If you're talking about the type interface, it's going to be way harder to change if you have a single bloom_filter type that has loads and loads of template parameters that define the body of the bloom_filter IMO. This is why I chose to separate the static_bloom_filter (statically-sized bloom filter that is) and the runtime-sized bloom_filter templates. If there's a special kind of bloom_filter that needs to be implemented in the future (like a counting bloom filter, and the compressed bloom filter) then that's a different type but with similar interfaces to the basic bloom_filter.
No not clear. A bloom filter is to a degree error resilient/tolerant. if for some reason part of your bloom filter data is damaged (n/w io, bad memory etc...) you can erase that section (set all the bits high), doing so depending on how large the section is will increase the fpp. For some uses that is still quite acceptable, as the fpp has already been determined to be orders of magnitude less than what can be tolerated.
If you want to go down this road, then you think about implementing your own allocator with error correction support and use that in the bloom_filter or any container that supports/uses custom allocators. Maybe you'll want an STL vector that also knows how to determine if part of the data is corrupted too?
Please read the paper that I have mentioned in one of the earlier posts. There are also uses of the intentional introduction of the errors into the BF.
Hmmm... What I'm implying here is that the "generic" data structure should not be bothered with details like these of whether the memory gets corrupted, etc. IMO you deal with network transmission errors at the network transmission layer, the memory errors at the appropriate level, and appropriate errors at the appropriate level of detail. Now if we're going to start writing self-aware containers, then maybe there should be containers that can actually heal themselves in case of memory corruption. Those would be nice to have, but I doubt they'd be a lot better than the containers we already have at the moment which *aren't* storage/network/self-aware except in these special cases when memory gets corrupted.
This is NOT data compression if you are providing union operations, then you should also provide self compression, that is for example, OR'ing the 1st half of the filter with the 2nd half, decreasing its size.
I don't understand why you want to do this when clearly the Bloom Filter is meant to have a fixed number of bits. Any variation to this in transit is beyond the scope of the implementation of the Bloom Filter.
Again, there is some motivation and research behind compression. Tradeoff between size, time of lookup and FPR is one of the reasons.
Yes, I understand this. Now if you need/want a compressed bloom filter, maybe a compressed_bloom_filter type would be more appropriate. No?
Let me rehash a bit: 4. The implementation is meant to be simple, portable across platforms, and C++ standard-compliant. When you deal with these issues, it's best you don't think about cache hit-miss and coherency issues IMO. At any rate if you want to have better control of the memory allocation and issues like these, you can control them by: a) using your favorite STL implementation of an std::bitset in the case of the statically-sized filter or b) by providing your own allocator and using the "correct" block type in the case of a runtime-constructed Bloom Filter.
And the implementation can be perfectly C++ compliant and portable with including the variants of the Bloom filter that pay attention to the cache hit-miss ratio and coherency. And it is not only the issue of memory allocation itself but it has much to do with the internal working of the filter.
Yes, I agree. Now what do these special bloom filters look like? How do they behave? Can I possibly implement all these different kinds of special bloom filters? Maybe yes, maybe no, but what I'm saying is that for the most part the goal is to make using Bloom Filters simple and easy to use. Just like the STL vector, some people have chosen to not use it for a myriad of reasons just like the ones you mention (cache coherency, etc.), but for the simplest cases where you need a homogeneous container I believe the STL vector is enough. Same reasoning goes: if you need a simple bloom filter, then the candidate Boost.Bloom_filter might work for you. If not, maybe a special one in the library would work better for you. Or, you just might have to write one for your specific needs.
Of course, if the library under construction isn't suiting your needs, there should be room for improvement. But there's no need to be too fancy with the implementation/interface and being clever with the False Positive Probability -- you can control this easily with choosing/implementing your own hash functions and the size of the underlying bitset.
You can do more than choosing a custom hash and size.
Yes I know this -- but the point I was trying to make was that _for the simplest cases_ and for the most part, the simplicity of the proposed solution should do for the simplest cases. HTH -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

Dean Michael Berris wrote:
On Fri, Jun 19, 2009 at 4:59 AM, Arash Partow wrote:
Dean Michael Berris wrote:
Hi Arash,
1. Can you please explain why you need multiple "hashes"? You don't need multiple hashes, but if you do want to use more hash functions (K) you can. The point is the user should be able to provide
On Thu, Jun 18, 2009 at 7:00 AM, Arash Partow wrote: the hash functions that the bloom filter will use internally.
Please read something other than wiki, there are many great citations on the wiki article that explain the most practical state of the art BFs. If you want something for boost then at least try and provide the most correct set of interfaces even if the underlying details are not the most correct. In short you only need 2 independent hash functions to get the optimal fpp for a set of size m bits(an expected insertion amount is optional).
When you say correct set of interfaces, what do you need more than 'insert(...)' and 'contains(...)' for the basic bloom_filter? In the case of a counting bloom filter, maybe you'd have 'remove(...)' or 'erase(...)'. Other than these functions I don't know what you mean by most correct set of interfaces.
Those are not the only methods, in fact there are a few constructors and some policies that can be provided. Understanding the problem space is a good place to begin, seeing how BFs are used and not just assuming how they are used is another thing you should look into, I would also have a look at the boost.crc library, there's more to this than just an insert and contains. it could become quite a useful library, if it meets the needs of people.
2. Your code doesn't trivially exit on the contains It used to be that it did (in the version that didn't use Boost.Fusion). I can try and optimize still in cases when it encounters a '0' bit in between the hash function results.
this is what i see:
bool contains(const_ref input) const { using fusion::for_each; typedef typename base::contains_impl contains_checker; bool found = true; for_each( hash_functions, contains_checker(bit_set, input, found) ); return found; }
unless the contract for for_each has changed (which it hasn't both in stl and fusion):
1. it doesn't exit
Oh, but it does. Have you tried the tests? Remember this is Boost.Fusion -- http://www.boost.org/doc/libs/1_39_0/libs/fusion/doc/html/fusion/algorithm/i... -- it iterates on the 'hash_functions' sequence which is finite.
"Your code doesn't trivially exit", I'll repeat it again "Your code doesn't trivially exit" there is no talk about the for_each not ending. Its just that because you really want to push the idea of using metaprogramming, you've lost sight of the simplest most elegant solution, which is: for(i: 0,k-1) if (0 == m[h[i](d)]) return false; return true;
2. if you even bother to look at the machine code generated, you'd see that even with the best available set of opt flags, its still miles behind a simple loop with a break statement. You have a copy constructor being called, do you know the cost that will incur when calling contains 10^9 times?!
Really? The Boost.Fusion version actually unrolls the loop for you, so the cost of that for_each call is actually constant time. Sure, there's no short-circuit (yet) but then the cost is constant.
If you actually ran you're own code, you would realize that unrolling a loop of calls to contains is far more beneficial than unrolling the loop inside contains (also true for insert). Further more if iteration over the data to be hashed was mixed in with the hashing operations (eg: update hash_0-hash_k-1 with data_i) that would also be far more efficient than "The Boost.Fusion version actually unrolls the loop for you" philosophy
3. Your solution doesn't provide a means to determine the various optimal parameter combinations What do you mean by this?
There are various (many) use-cases for a BF
1. user provides max size and max fpp BF determines best k 2. user provides max fpp and max expected insertion BF determine best k,m 3. user provides max fpp and fuzzy k with limit BF determines m,k and optimal fpp 4. etc etc...
Some of these can change/vary during runtime, so getters for fpp and inserted count would be nice/good/expected.
This is too complex -- and has too many assumptions. First, it assumes that the user wants a self-(re)sizing bloom filter. Second it assumes that the user doesn't want to provide their own hash functions. Third it assumes that the user actually really cares if the bloom filter should be smarter than the user.
In some cases (like the simplest cases I'm trying to solve for the moment) the current implementation seems sufficient. Of course, it's not fancy, and I don't intend for it to become too fancy. If you think you can implement something better, then please by all means show me code that I can deal with -- this is not said sarcastically, I really do want to see what you mean when you say all the above translated into code.
What I read from this is: I'm trying to propose and eventually submit a library for which I don't fully understand its most basic use-cases, furthermore I can't be bothered to look into it any further without someone feeding me the answers - good attitude :)
4. Your solution doesn't provide a means for counting modes Modes for bloom filters? I'm not sure I understand what you mean.
There are modes of BFs where instead of 1 bit per block you use more say 4 bits and instead of setting high you increment a counter. this allows for a probabilistic determination of how many times an element has been inserted into a BF, this is called a counting bloom filter. this is another of MANY important use-case you have not covered.
I thought 'counting modes' meant 'Mode' in the statistical sense and getting these statistical modes.
I would have understood it better if you meant I wasn't implementing counting bloom filters or bloomier filters or <insert new variation of a highfalutin name> bloom filter.
Anyway, the point is to provide a bloom filter using the textbook definition. If there's a need for other kinds of bloom filters, like counting bloom filters, then that's another implementation that can be bundled in the library later -- *if* there is enough need for it, and *if* the library even gets into Boost just trying to solve the common (basic) cases. Of course the library is designed to evolve.
A smart representation of the internal data and some policies could cater for both modes without having to create an xxx_bf class for each new mode. Again as mentioned above requires understanding the problem domain and not just 5mins on google.
5. Your solution doesn't provide for erasures (do not confuse with removal of inserted items) .clear() is supposed to clear the internal bitset and clear the bloom filter.
No not clear. A bloom filter is to a degree error resilient/tolerant. if for some reason part of your bloom filter data is damaged (n/w io, bad memory etc...) you can erase that section (set all the bits high), doing so depending on how large the section is will increase the fpp. For some uses that is still quite acceptable, as the fpp has already been determined to be orders of magnitude less than what can be tolerated.
If you want to go down this road, then you think about implementing your own allocator with error correction support and use that in the bloom_filter or any container that supports/uses custom allocators. Maybe you'll want an STL vector that also knows how to determine if part of the data is corrupted too?
An allocator allocates/deallocates memory thats all it does, nothing else, this is clearly a job for the class holding the data. Again I'm not going to feed you the answer as to why this is done as a person interested in proposing a library you should at the very least know about it and why its done.
6. Your solution doesn't provide for self superimposed compression mode (similar to union but doubling of fpp)
The idea of the bloom_filter is to provide precisely that, a bloom filter -- any compression you might want to do in transmitting a bloom_filter would IMO be beyond the scope of the library. There has been some suggestions of implementing a compressed_bloom_filter that maintains a compressed bitset, but afar from that I must say I have no idea what self superimposed compression mode.
This is NOT data compression if you are providing union operations, then you should also provide self compression, that is for example, OR'ing the 1st half of the filter with the 2nd half, decreasing its size.
I don't understand why you want to do this when clearly the Bloom Filter is meant to have a fixed number of bits. Any variation to this in transit is beyond the scope of the implementation of the Bloom Filter.
Again if you can't see why decreasing the size of the data gradually has a benefit then.....
Also another note looking at your code again (btw do you even use it?!), you're equality test should also include the hashes and any other seeding data you use to construct the BF.
I guess you missed the earlier discussion: the point was that since the hash functions are part of the bloom_filter's type, it's assumed that two bloom_filter instances of the same type have the same hash functions. If one bloom filter has a different encapsulated state compared to another bloom filter of the same type, then they are not equal -- simple, and easy to understand/implement. That's the point of equality testing after all.
Ok my bad probably didn't catch that, but to be fair all i did was look at the code you've presented. Of all the information available on this library that is the most pertinent, not the back and forths on an ML
Another thing that is missing is serialization of the BF, I would like to be able to save to file, or send it over a n/w. So a streaming operator or something inline with boost.serialize would be good.
Yes, I know. And if you missed the earlier discussion, I'm working on it. ;-)
ok.
Another thing to consider is that a BF implementation should not be a means to show-off one's abilities in the use of templates and meta-programming paradigms but rather something that is efficient and lean. Think about situations where cache hit-miss and coherency will be an issue (as these are some of the more critical uses of BFs), look at the machine code that is generated for some of the more critical routines insert/contains for various archs and ask yourself, what it is you're "really" trying to achieve.
Hmmm... I guess you haven't followed the discussion at all in the list.
No you're right, all I need is to see code where you try and push the use of complex template algos with in-line copy constructors when all you need is a for-loop an index counter and a break. I apologize for not being enlightened by the discussion on the ML.
Let me rehash a bit:
Of course, if the library under construction isn't suiting your needs, there should be room for improvement. But there's no need to be too fancy with the implementation/interface and being clever with the False Positive Probability
If you actually use BFs you'll soon realize you need all these things. Otherwise all that it is, is an array with some hash functions. There is nothing wrong with wanting the bare minimum and most correct interface, isn't that one of the guiding principles of boost? Look I've been a bit brash, I apologize. I know you're keen to make a contribution to boost and get your name up in lights, you've got that function input iter thing (whatever that is) pending. Its always nice to see people eager and keen contribute to open source stuff, so don't let me or my whims discourage you. :D Arash Partow __________________________________________________ Be one who knows what they don't know, Instead of being one who knows not what they don't know, Thinking they know everything about all things. http://www.partow.net

On Fri, Jun 19, 2009 at 3:18 PM, Arash Partow<arash@partow.net> wrote:
Dean Michael Berris wrote:
When you say correct set of interfaces, what do you need more than 'insert(...)' and 'contains(...)' for the basic bloom_filter? In the case of a counting bloom filter, maybe you'd have 'remove(...)' or 'erase(...)'. Other than these functions I don't know what you mean by most correct set of interfaces.
Those are not the only methods, in fact there are a few constructors and some policies that can be provided.
Constructors maybe. Policies, I'm still a little iffy with this one. I'd say I can derive policies from the existing implementation. If there comes another special bloom filter, that gives me a sample of three implementations from which to derive policies from. I'm not about to go psychic and try to predict *all* the possible ways I can split up the bloom filter implementation to cover all possible variations of the bloom filter -- although that would be cool if I could. ;-)
Understanding the problem space is a good place to begin, seeing how BFs are used and not just assuming how they are used is another thing you should look into, I would also have a look at the boost.crc library, there's more to this than just an insert and contains. it could become quite a useful library, if it meets the needs of people.
But Boost.CRC isn't a container, there's no way you can compare the bloom_filters implementation to Boost.CRC -- unless I'm missing your point. The problem space that I understand which Bloom Filters are trying to address is this: 1. You need to represent a set of values. 2. You want the representation to be space efficient. 3. You can tolerate false positives. Now, there are *a lot* of places where you can use a data structure like this. One that pops to my head is a means of space-efficiently checking whether you've already encountered a value before. This works in filesystem caches, web proxy caches, in social networking there's also a lot of uses. Some people use Bloom Filters merely for representing sets of numbers, in case they were generating new "random" identifiers. Of course I cannot predict all the performance requirements of all these use cases -- some would want compression in case of transmitting the data over the network, some would want wicked fast checking, some would want to represent absolutely huge sets and have the bitset representation in memory compressed, etc. Do I want to deal with some of these special cases? Oh yes, the compressed_bloom_filter is something I'm thinking of implementing next, and there's also the interesting counting_bloom_filter. Do I want to deal with them all? Probably not.
2. Your code doesn't trivially exit on the contains It used to be that it did (in the version that didn't use Boost.Fusion). I can try and optimize still in cases when it encounters a '0' bit in between the hash function results.
this is what i see:
bool contains(const_ref input) const { using fusion::for_each; typedef typename base::contains_impl contains_checker; bool found = true; for_each( hash_functions, contains_checker(bit_set, input, found) ); return found; }
unless the contract for for_each has changed (which it hasn't both in stl and fusion):
1. it doesn't exit
Oh, but it does. Have you tried the tests? Remember this is Boost.Fusion --
http://www.boost.org/doc/libs/1_39_0/libs/fusion/doc/html/fusion/algorithm/i... -- it iterates on the 'hash_functions' sequence which is finite.
"Your code doesn't trivially exit", I'll repeat it again "Your code doesn't trivially exit" there is no talk about the for_each not ending. Its just that because you really want to push the idea of using metaprogramming, you've lost sight of the simplest most elegant solution, which is:
for(i: 0,k-1) if (0 == m[h[i](d)]) return false; return true;
I know, but have you checked my initial implementation which is almost like that: bool result = true; for (size_t i = 0; i < K && result; ++i) result = result && bit_set_[hashes[i](input)]; return result; The reason I went with Boost.Fusion is because the sequence is statically sized, and cannot be iterated through using runtime identifiers/indexes like above. See, how do you get the ith element of a tuple when i is defined at runtime? Do you know how to do this with Boost.Fusion? See I'm still looking for a way of making the current implementation using Boost.Fusion short-circuit the loop. This means changing the function object used in the for_each to include a check to whether it's already encountered a 0 bit and not have to perform the hashing function again. Because Boost.Fusion already unrolls the loop for me at compile-time, I'm hoping the compiler will create the necessary branches that makes it equivalent to the above runtime loop. Now if I hadn't made the function objects part of the type as a Boost.Fusion sequence then I run into the equality checking and swap/assignment problem. Been there, done that, and the solution above has made the implementation of 'contains' a little more complicated. This is why I feel frustrated at the moment since this has already been discussed on the list before too. I know the burden is on me to get creative with how to address this issue now. I welcome the challenge, and I assure you I'm working on it. :-)
2. if you even bother to look at the machine code generated, you'd see that even with the best available set of opt flags, its still miles behind a simple loop with a break statement. You have a copy constructor being called, do you know the cost that will incur when calling contains 10^9 times?!
Really? The Boost.Fusion version actually unrolls the loop for you, so the cost of that for_each call is actually constant time. Sure, there's no short-circuit (yet) but then the cost is constant.
If you actually ran you're own code, you would realize that unrolling a loop of calls to contains is far more beneficial than unrolling the loop inside contains (also true for insert).
Wait, but inside contains, you get the benefit without paying too much of a cost. Boost.Fusion already unrolls the for_each for you, so I don't see what the problem is. I always thought unrolling/optimizing inner loops gave good results, and if you layer that on top of unrolled outer loops then you've essentially eliminated the loops too. I don't see why unrolling the inner loop (Boost.Fusion's for_each) is a bad thing.
Further more if iteration over the data to be hashed was mixed in with the hashing operations (eg: update hash_0-hash_k-1 with data_i) that would also be far more efficient than "The Boost.Fusion version actually unrolls the loop for you" philosophy
But this is up to the hash function implemented, not in 'contains'. See if Boost.Fusion already unrolled the loop in 'contains' and the hash function implementation is already sufficiently optimized internally, I don't see what the problem is. BTW, if you check what the compiler already did, then you'll see that Boost.Fusion is dealing with references (not copies) even with the function object to apply, and that the unrolled loop is there, and all that remains are the (almost always inlined) calls to the hash functions.
This is too complex -- and has too many assumptions. First, it assumes that the user wants a self-(re)sizing bloom filter. Second it assumes that the user doesn't want to provide their own hash functions. Third it assumes that the user actually really cares if the bloom filter should be smarter than the user.
In some cases (like the simplest cases I'm trying to solve for the moment) the current implementation seems sufficient. Of course, it's not fancy, and I don't intend for it to become too fancy. If you think you can implement something better, then please by all means show me code that I can deal with -- this is not said sarcastically, I really do want to see what you mean when you say all the above translated into code.
What I read from this is: I'm trying to propose and eventually submit a library for which I don't fully understand its most basic use-cases, furthermore I can't be bothered to look into it any further without someone feeding me the answers - good attitude :)
No, what I was communicating was the fact that the bloom filter implementation should be simple enough to fit simple cases, and good enough to cover cases with a little more complexity. Now if you need anything specific that's not yet covered by the current implemented bloom filters, then maybe they're different types that need to be implemented still. I personally may not be able to implement all of these before (and if ever) this gets reviewed and into the main Boost distribution, but that in the off chance that you do want to actually contribute code I am open to suggestions and collaboration. Now I don't see what's wrong with proposing and submitting a simple library _that works_ and iterating according to requirements of the community. So far I see that there's only one person asking for a set of highly complex specialized bloom filters, and I'm questioning whether it's really a good use of my time trying to chase corner cases for (currently) unimplemented bloom filters. ;-)
I thought 'counting modes' meant 'Mode' in the statistical sense and getting these statistical modes.
I would have understood it better if you meant I wasn't implementing counting bloom filters or bloomier filters or <insert new variation of a highfalutin name> bloom filter.
Anyway, the point is to provide a bloom filter using the textbook definition. If there's a need for other kinds of bloom filters, like counting bloom filters, then that's another implementation that can be bundled in the library later -- *if* there is enough need for it, and *if* the library even gets into Boost just trying to solve the common (basic) cases. Of course the library is designed to evolve.
A smart representation of the internal data and some policies could cater for both modes without having to create an xxx_bf class for each new mode. Again as mentioned above requires understanding the problem domain and not just 5mins on google.
The smart representation is already there for the basic use case (a Boost.Dynamic_bitset and an STL bitset). Now, in the case of compressed internal representations for the compressed_bloom_filter suggestion, that has yet to be implemented. Also, the counting_bloom_filter is also a different thing. Notice that they're different types? Remember, you derive policies, not predict them. I think the implementation can improve and evolve, but I don't see why you're looking at the current version and thinking I'm not thinking about changing them along the way... Have you assumed that I've stopped thinking about this project already?
5. Your solution doesn't provide for erasures (do not confuse with removal of inserted items) .clear() is supposed to clear the internal bitset and clear the bloom filter.
No not clear. A bloom filter is to a degree error resilient/tolerant. if for some reason part of your bloom filter data is damaged (n/w io, bad memory etc...) you can erase that section (set all the bits high), doing so depending on how large the section is will increase the fpp. For some uses that is still quite acceptable, as the fpp has already been determined to be orders of magnitude less than what can be tolerated.
If you want to go down this road, then you think about implementing your own allocator with error correction support and use that in the bloom_filter or any container that supports/uses custom allocators. Maybe you'll want an STL vector that also knows how to determine if part of the data is corrupted too?
An allocator allocates/deallocates memory thats all it does, nothing else, this is clearly a job for the class holding the data. Again I'm not going to feed you the answer as to why this is done as a person interested in proposing a library you should at the very least know about it and why its done.
If you had a use case where you actually expected a container to heal itself due to memory corruption, then maybe I'll understand better what you mean. As I've already responded to another email, I'm thinking this should be a special case that doesn't need to be part of a generic solution.
I don't understand why you want to do this when clearly the Bloom Filter is meant to have a fixed number of bits. Any variation to this in transit is beyond the scope of the implementation of the Bloom Filter.
Again if you can't see why decreasing the size of the data gradually has a benefit then.....
Sure, you want a policy that defines the internal storage, so you can define a policy for compressed bloom filters. I get it. But how's that different from implementing a compressed_bloom_filter?
I guess you missed the earlier discussion: the point was that since the hash functions are part of the bloom_filter's type, it's assumed that two bloom_filter instances of the same type have the same hash functions. If one bloom filter has a different encapsulated state compared to another bloom filter of the same type, then they are not equal -- simple, and easy to understand/implement. That's the point of equality testing after all.
Ok my bad probably didn't catch that, but to be fair all i did was look at the code you've presented. Of all the information available on this library that is the most pertinent, not the back and forths on an ML
Actually if you read the code carefully, that's already the case even if you didn't read the discussions on the list.
Hmmm... I guess you haven't followed the discussion at all in the list.
No you're right, all I need is to see code where you try and push the use of complex template algos with in-line copy constructors when all you need is a for-loop an index counter and a break. I apologize for not being enlightened by the discussion on the ML.
What in-line copy constructors are you talking about? I make absolutely no copies in the code whatsoever when it comes to using Boost.Fusion's for_each. Are you sure you know how Boost.Fusion works because you seem afraid of the idioms used in the implementation?
Let me rehash a bit:
Of course, if the library under construction isn't suiting your needs, there should be room for improvement. But there's no need to be too fancy with the implementation/interface and being clever with the False Positive Probability
If you actually use BFs you'll soon realize you need all these things. Otherwise all that it is, is an array with some hash functions. There is nothing wrong with wanting the bare minimum and most correct interface, isn't that one of the guiding principles of boost?
Sure, that's why I'm open to compressed_bloom_filter and counting_bloom_filter and other bloom filter implementations (bloomier_filters come to mind). But what I'm saying is that pending these implementations, I'm sticking with the implementation that's currently up there for the bare minimum use case which it suits just fine. There's still room for improvement as you've pointed out in the contains() function and I'll look into adressing the trivial exit problem.
Look I've been a bit brash, I apologize. I know you're keen to make a contribution to boost and get your name up in lights, you've got that function input iter thing (whatever that is) pending. Its always nice to see people eager and keen contribute to open source stuff, so don't let me or my whims discourage you. :D
Has nothing to do with getting the name up in lights. I guess I was just looking for clearer explanations/thoughts than just one-liners and vague statements. Hopefully if you can get more verbose with what you mean compared to just silently implying things, I guess this medium lends itself better to discussion. Otherwise, it's really hard to understand what you meant with the earlier list of one-liners and the way I knew how to get more information from you was to engage you in discussion. Don't worry, they're more encouraging than discouraging really. Now that just means there's more work for me. :-) -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

The name static_bloom_filter is not the best, as keywords (semi-)static can have a special meaning related with the filter behaviour. Regards, Milosz
participants (9)
-
Arash Partow
-
Dean Michael Berris
-
Joaquin M Lopez Munoz
-
joaquin@tid.es
-
Manuel Fiorelli
-
Milosz Marian HULBOJ
-
Sebastian Redl
-
Vicente Botet Escriba
-
vicente.botet