Interest in a Bloom Filter?

Hi Everyone, During the weekend, I got acquainted with and excited about Bloom Filters and the utility of such a data structure. I then proceeded to look at generic implementations of a bloom filter and I thought about implementing one myself instead as a simple exercise. Attached is what I came up with which I'm submitting as a library for review (and uploading to the vault). Also attached is a sample program which uses the bloom_filter. I've yet to write some documentation about it, which would be in the list of things to do in case there's enough interest for the library to be included in Boost. Thanks in advance and I hope this helps! -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

Dean Michael Berris wrote:
During the weekend, I got acquainted with and excited about Bloom Filters and the utility of such a data structure.
I've only heard of bloom effects as used in 3D graphics for more realistic lighting effects. Looking at the code, that does not seem to be the purpose of your bloom filter. Just what is it and why might I use it? --Jeffrey Bosboom

Hi Jeffrey, On Mon, Jun 8, 2009 at 2:00 PM, Jeffrey Bosboom<jbosboom@uci.edu> wrote:
Dean Michael Berris wrote:
During the weekend, I got acquainted with and excited about Bloom Filters and the utility of such a data structure.
I've only heard of bloom effects as used in 3D graphics for more realistic lighting effects. Looking at the code, that does not seem to be the purpose of your bloom filter. Just what is it and why might I use it?
A Bloom Filter is a space efficient data-structure that allows you to represent set membership that allows for false positives (elements may have been marked as included in a set when they really aren't part of the set) but not false negatives (when an element has not been included/added in the set, the bloom filter wouldn't show that they are part of the set). These have been used in proxy caches to signify whether they already have a cached copy of a URI. File systems also use these to see whether a part of a file (or a page) has already been accessed before (to limit the frequency of fetching pages that have been fetched before on a DFS for instance). A more in-depth explanation of bloom filters can be found here: http://en.wikipedia.org/wiki/Bloom_filter HTH -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

Dean Michael Berris <mikhailberis <at> gmail.com> writes:
Hi Everyone,
During the weekend, I got acquainted with and excited about Bloom Filters and the utility of such a data structure. I then proceeded to look at generic implementations of a bloom filter and I thought about implementing one myself instead as a simple exercise.
Attached is what I came up with which I'm submitting as a library for review (and uploading to the vault). Also attached is a sample program which uses the bloom_filter.
I'm definitely interested in having a Bloom filter in Boost. A couple of comments on your current implementation: * Seems to me it is more natural (or at least more in sync with current custom in std containers) to let the order of bloom_filter template parameters be Input,M,K * Why not make bloom_filter<M,K,T> interface mimic that of a std::map<T,bool> so that instead of bf.contains(x) one can write bf[x] // returns a bool Admittedly, one cannot extend this analogy to insertion: bf[x]=true could represent bf.insert(x), but bf[x]=false is not implementable due to the very nature of Bloom filters. * Again with the purpose of using as standard a terminology as possible I'd rename reset() to clear(). * state() should return a const reference to bit_set so as to avoid copying when not strictly needed. Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

On Mon, Jun 8, 2009 at 2:47 PM, Joaquin M Lopez Munoz<joaquin@tid.es> wrote:
Dean Michael Berris <mikhailberis <at> gmail.com> writes:
Attached is what I came up with which I'm submitting as a library for review (and uploading to the vault). Also attached is a sample program which uses the bloom_filter.
I'm definitely interested in having a Bloom filter in Boost.
This sounds encouraging! :-)
A couple of comments on your current implementation:
* Seems to me it is more natural (or at least more in sync with current custom in std containers) to let the order of bloom_filter template parameters be
Input,M,K
Right. That's something that can easily be changed. ;-)
* Why not make bloom_filter<M,K,T> interface mimic that of a std::map<T,bool> so that instead of
bf.contains(x)
This can be retained though, for semantic purposes.
one can write
bf[x] // returns a bool
I like this. :-)
Admittedly, one cannot extend this analogy to insertion: bf[x]=true could represent bf.insert(x), but bf[x]=false is not implementable due to the very nature of Bloom filters.
Right. I think though that once you know that something is a bloom filter, you might very well use insert() -- if the operator[] overload returns a bool, I'm not sure how the above code would compile.
* Again with the purpose of using as standard a terminology as possible I'd rename reset() to clear().
I borrowed reset() from std::bitset -- although I agree clear() is better terminology.
* state() should return a const reference to bit_set so as to avoid copying when not strictly needed.
I was torn with this one, but I agree that this would be a sane solution. I was worrying about the user doing a const_cast<> and then doing something stupid with the internal data structure, but then I guess all bets are off once that happens. ;-) Thanks for your insights! I'll make the changes and then maybe put it in the sandbox along with the documentation that I'll be working on. Have a great day! -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

Dean Michael Berris wrote:
Hi Everyone,
During the weekend, I got acquainted with and excited about Bloom Filters and the utility of such a data structure. I then proceeded to look at generic implementations of a bloom filter and I thought about implementing one myself instead as a simple exercise.
Attached is what I came up with which I'm submitting as a library for review (and uploading to the vault). Also attached is a sample program which uses the bloom_filter.
Hello, Few comments: - It would be nice to have the option of using `double-hashing' schema, where all the hashes are generated according to: hash(i) = h1(x) + i*h2(x) or the squared/cubed variant: hash(i) = h1(x) + i*h2(x) + i*i (*i) - one hash value can be potentially used to generate many hash values. For example, if you need 16-bit hashes, then from the hash function producing 32-bit hash-value you could get 2 values. This can be in turn combined with the previous point - double hashing. - From my point of view bloom filters are used in performance-critical applications, and each tiny optimisation counts. - intersection/union/difference operations Regards, Milosz

Hi Milosz, On Mon, Jun 15, 2009 at 2:49 PM, Milosz Marian HULBOJ<mhulboj@hulboj.org> wrote:
- It would be nice to have the option of using `double-hashing' schema, where all the hashes are generated according to: hash(i) = h1(x) + i*h2(x) or the squared/cubed variant: hash(i) = h1(x) + i*h2(x) + i*i (*i)
This is interesting. I'd think this would have to be implemented by the user of the bloom_filter in the HashFunctions template parameter.
- one hash value can be potentially used to generate many hash values. For example, if you need 16-bit hashes, then from the hash function producing 32-bit hash-value you could get 2 values. This can be in turn combined with the previous point - double hashing.
Right. But this is something that would have to be provided by the user IMO.
- From my point of view bloom filters are used in performance-critical applications, and each tiny optimisation counts.
I would agree, and I've tried to optimize as much as I can (or as much as I know how to) by having a statically-sized version and another version that has a runtime-constructed Boost.Dynamic_bitset contained. If you have specific ideas about how I can still optimize parts of the implementation, I'd definitely like to hear them. :)
- intersection/union/difference operations
This sounds like it should be easy to do, but I'd need to look into the protocol implementation for this. I say this because I might run into a place where instead of just doing simple naive intersection/union/difference on bitsets, I might have to do some expression templates with Boost.Proto -- and as much as I'd love to use Boost.Proto and learn how to implement real protocols with it, I'd rather stick with the simple implementations first. At any rate, this would be interesting/easy to implement as free function operator overloads. Thanks for your thoughts and I hope this helps! -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com

Two more ideas: - Counting bloom filter (different from normal bloom filter, as supports removal, but requires more space overhead in order to store counts) - Example/test code (attached the skeleton) Regards, Milosz

Dean Michael Berris wrote:
Hi Milosz,
On Mon, Jun 15, 2009 at 2:49 PM, Milosz Marian HULBOJ<mhulboj@hulboj.org> wrote:
- It would be nice to have the option of using `double-hashing' schema, where all the hashes are generated according to: hash(i) = h1(x) + i*h2(x) or the squared/cubed variant: hash(i) = h1(x) + i*h2(x) + i*i (*i)
This is interesting. I'd think this would have to be implemented by the user of the bloom_filter in the HashFunctions template parameter.
- one hash value can be potentially used to generate many hash values. For example, if you need 16-bit hashes, then from the hash function producing 32-bit hash-value you could get 2 values. This can be in turn combined with the previous point - double hashing.
Hi Dean, The bottom line of the two points above is that in order to hash for the bloom filter (in the double-hashing schema) one only needs to: - invoke hash function once - split the bits and do few multiplications and additions Still I didn't have time to look into yours implementation to see whether the above can be used seamlessly. There are also variants of double hashing to be more cache-line friendly, but personally the idea presented above was sufficient for my needs (for the hashing functions I've been using the TR1 hash and murmur hash). Regards, Milosz

2009/6/15 Milosz Marian HULBOJ <mhulboj@hulboj.org>:
Dean Michael Berris wrote:
Hi Milosz,
On Mon, Jun 15, 2009 at 2:49 PM, Milosz Marian HULBOJ<mhulboj@hulboj.org> wrote:
- It would be nice to have the option of using `double-hashing' schema, where all the hashes are generated according to: hash(i) = h1(x) + i*h2(x) or the squared/cubed variant: hash(i) = h1(x) + i*h2(x) + i*i (*i)
This is interesting. I'd think this would have to be implemented by the user of the bloom_filter in the HashFunctions template parameter.
- one hash value can be potentially used to generate many hash values. For example, if you need 16-bit hashes, then from the hash function producing 32-bit hash-value you could get 2 values. This can be in turn combined with the previous point - double hashing.
Hi Dean,
The bottom line of the two points above is that in order to hash for the bloom filter (in the double-hashing schema) one only needs to: - invoke hash function once - split the bits and do few multiplications and additions Still I didn't have time to look into yours implementation to see whether the above can be used seamlessly.
There are also variants of double hashing to be more cache-line friendly, but personally the idea presented above was sufficient for my needs (for the hashing functions I've been using the TR1 hash and murmur hash).
I suspect that this approach cannot be implemented with the current design, because of the management of the hash functions. The template class bloom_filter takes a parameter, called HashFunctions, which describes the K independent hash functions to instantiate and use. There is no way to express that the i-th hash depends on the previous hashes other than recompute them inside the i-th hash function. I suspect that the interface could be changed to be like the following template<class Input, class HashingPolicy, class Hasher> class bloom_filter; The Hashing policy must compute the K hash functions, using the value produced by the hash function described by the hasher. The Hashing policy can use the Hasher many times and can split the values as you describe. If we want to accomodate an HashingPolicy using M different hash functions, Hasher could be a fusion vector, possibly with less elements than K. PS: I don't know if this less elegant design could really determine any advantage, or even if it is buggy :-) Manuel Fiorelli

Manuel Fiorelli wrote:
There is no way to express that the i-th hash depends on the previous hashes other than recompute them inside the i-th hash function.
I've got the same impression.
I suspect that the interface could be changed to be like the following
template<class Input, class HashingPolicy, class Hasher> class bloom_filter;
The Hashing policy must compute the K hash functions, using the value produced by the hash function described by the hasher.
If we want to accomodate an HashingPolicy using M different hash functions, Hasher could be a fusion vector, possibly with less elements than K.
Seems sound, but it would be good to list the typical use cases and work out the best generic interface for them. So far we have got following basic cases: Independent case: A simple case with distinct and independent hash functions that can be evaluated separately: h_1(x), h_2(x), h_3(x), ... Dependent cases: B case when from one hash value (i.e. 64bit) we can get multiple hash values (i.e. 4*16bit): h(x) -> {h_1(x), h_2(x), h_3(x), ...} C hash_combine - one hash function and the combining method D double-hashing schema (http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf) h_i(x) = h_1(x) + i*h_2(x) + optional In this schema one can provide: - two different hash functions (A) - one hash function and use approach (B) - use one hash function and do something along hash_combine (C) other, less popular (?) D partitioned bloom filters, combinations of above, other? E different types of bloom filters: - bloomier filters - compressed bloom filter - ...
PS: I don't know if this less elegant design could really determine any advantage, or even if it is buggy :-)
Neither do I, small brainstorming with use cases is likely to help... Regards, Milosz

2009/6/16 Milosz Marian HULBOJ <mhulboj@hulboj.org>:
Seems sound, but it would be good to list the typical use cases and work out the best generic interface for them. So far we have got following basic cases:
Independent case:
A simple case with distinct and independent hash functions that can be evaluated separately: h_1(x), h_2(x), h_3(x), ...
Dependent cases:
B case when from one hash value (i.e. 64bit) we can get multiple hash values (i.e. 4*16bit): h(x) -> {h_1(x), h_2(x), h_3(x), ...}
C hash_combine - one hash function and the combining method
D double-hashing schema (http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf) h_i(x) = h_1(x) + i*h_2(x) + optional In this schema one can provide: - two different hash functions (A) - one hash function and use approach (B) - use one hash function and do something along hash_combine (C)
other, less popular (?)
Your listing of methods is really interesting.... Manuel Fiorelli

D partitioned bloom filters, combinations of above, other?
I think it would be also nice to have the mentioned previously partitioned cache-aware bloom filters. I haven't got time to try out their performance/efficiency, but the paper seems interesting: http://algo2.iti.uni-karlsruhe.de/singler/publications/cacheefficientbloomfi... Implementation doesn't seem problematic. There are plenty of possible perfomance-related quirks/variants in this relatively simple data-structure, as in most applications the sole reason for using bloom filters is the performance. Regards, Milosz
participants (5)
-
Dean Michael Berris
-
Jeffrey Bosboom
-
Joaquin M Lopez Munoz
-
Manuel Fiorelli
-
Milosz Marian HULBOJ