[GSoC] Request for Feedback on Boost.Bloom Filter Project

Hello all, I'm writing to request feedback, suggestions, and advice from the Boost community. I've been working on a Bloom filter data structure for about a month and a half now. At this point, the following is complete and available in the Boost Sandbox: - working, basic prototype (insert & query operations, union & intersect) - a test suite covering the public interface - five examples - documentation covering many aspects of the project (tested on Chrome 12 and 13, Firefox 4 and 6) Before I proceed to the next phase, I need feedback. Specifically, I need comments on: - code quality: is there anything that stands out about my design, naming convention, or implementation that could use improvement? - documentation quality: was it informative, clear, and easy to understand? Is there anything missing? At this point, there are many directions the Bloom filter project can go, and I need to know which direction would best serve the interests of the community. For this, I would be very interested in hearing about: - desired usage: do you have a project that could use a Bloom filter? What kind of Bloom filter would you need? - compression policy: would you be interested Bloom filters that could be serialized to a compressed form and deserialized back into a decompressed form? - variants: there are a few planned variants listed in the design page of the documentation (future directions). Do any of these seem particularly useful to you? To access the source, visit the sandbox here: http://svn.boost.org/svn/boost/sandbox/bloom_filter/trunk/ To access the documentation, visit the sandbox here: http://svn.boost.org/svn/boost/sandbox/bloom_filter/trunk/libs/bloom_filter/... All portions of the project assume the canonical Boost release model. Namely: $(BOOST_ROOT)/boost/bloom_filter $(BOOST_ROOT)/libs/bloom_filter Thanks! -Alejandro Cabrera

On 6/21/2011 12:21 AM, Alejandro Cabrera wrote:
Hello all,
I'm writing to request feedback, suggestions, and advice from the Boost community. I've been working on a Bloom filter data structure for about a month and a half now. At this point, the following is complete and available in the Boost Sandbox: - working, basic prototype (insert & query operations, union & intersect) - a test suite covering the public interface - five examples - documentation covering many aspects of the project (tested on Chrome 12 and 13, Firefox 4 and 6)
I would suggest to you that you briefly explain what a Bloom filter entails, and why it might be important, when sending your message to this list.

Edward Diener-3 wrote:
On 6/21/2011 12:21 AM, Alejandro Cabrera wrote:
Hello all,
I'm writing to request feedback, suggestions, and advice from the Boost community. I've been working on a Bloom filter data structure for about a month and a half now. At this point, the following is complete and available in the Boost Sandbox: - working, basic prototype (insert & query operations, union & intersect) - a test suite covering the public interface - five examples - documentation covering many aspects of the project (tested on Chrome 12 and 13, Firefox 4 and 6)
I would suggest to you that you briefly explain what a Bloom filter entails, and why it might be important, when sending your message to this list.
Hi, maybe this link can help "http://en.wikipedia.org/wiki/Bloom_filter" Best, Vicente -- View this message in context: http://boost.2283326.n4.nabble.com/GSoC-Request-for-Feedback-on-Boost-Bloom-... Sent from the Boost - Dev mailing list archive at Nabble.com.

Having bloom_filter in boost:: is great idea! It is super useful, we use in production code (our own implementation (actually we use Bloomier filters with murmur hash perfect hashing from your wiki reference). It lets us use very memory and lookup efficient data structure for very large datasets (think 100 GB file with strings on SSD disk indexed by e.g. 200 MB Bloomier filter in memory) Did you considered adding serialization of a bloom_filter to your implementation? In general reconstructing hash based containers with series of inserts is pretty inefficient. Use case that I'm talking about: e.g. for you web proxy scenario proxy service keeps running and downloading, caching URLs and adding them to bloom filter. Than the process needs to be restarted for some reason. All documents downloaded and stored on disk will have to be reitereted and their URLs reinserted to newly created bloom_filter, which makes the startup of the process slow. btw I have the same problem with standard containers (except std::vector). There is no efficient serialization / deserialization for them rendering them useless for any larger side project (like unordered_set of 1m strings). -- View this message in context: http://boost.2283326.n4.nabble.com/GSoC-Request-for-Feedback-on-Boost-Bloom-... Sent from the Boost - Dev mailing list archive at Nabble.com.

Hi Alejandro, Alejandro Cabrera wrote:
I'm writing to request feedback, suggestions, and advice from the Boost community. I've been working on a Bloom filter data structure for about a month and a half now. At this point, the following is complete and available in the Boost Sandbox: - working, basic prototype (insert & query operations, union & intersect) - a test suite covering the public interface - five examples - documentation covering many aspects of the project (tested on Chrome 12 and 13, Firefox 4 and 6)
At first glance, this looks very good. Well done. Bloom filters are essentially a quite simple concept, and I suggest that you should stick to that i.e. resist any temptation to add lots of extra features. Having said that, I think you should look at e.g. std::set<> and std::unordered_set and see what you can do to make your class closer to them. I can think of: - empty() - clear() (Ah, you do have this, but not documented?) - a constructor that takes a range, e.g. int a[] = { 1,2,3,4,5 }; bloom_filter<int> b(a,a+5); std::list<int> l; bloom_filter<int> c(l.begin(),l.end()); - insert(begin,end) similarly - swap() - You have a size() which is (confusingly?) different from those containers' size(); it's more like capacity() I guess. (Similarly I think I would rename count() to maybe bit_count().) There is a trade-off between the size of the filter, the number of values stored, the number of hash functions and the false positive rate. You might consider adding some sort of compile-time metafunctions to help the user determine the right size and number of hash functions, based on their expected number of values stored and acceptable false positive rate. Similarly, you could add a static expected_false_positive_rate() method that takes the number of stored values as a parameter. Sometimes it would be better for size to be a run-time parameter. Sometimes it would be useful to access the underlying data. For example, say I have a spell check dictionary or a URL block-list; I might want to compile that list into a bloom filter and ship that as a binary file. For that I'd want to be able to use it as an adaptor on top of an underlying container that I supply: // compile: std::vector<uint32_t> data(100000); bloom_filter b(data); // acts as a container adaptor b.insert(...); file f; f.write(data.begin(),data.end()); // use: memory_mapped_file f("dictionary.bloom"); bloom_filter b(f); if (b.contains(url)) .... I think it would be sufficient for false_positive_rate() to return a float. No-one cares about its 19th digit. For some reason I have a mental block about how intersection works. Union is OK; I can see that set a; set b; bloom_filter(a) union bloom_filter(b) == bloom_filter( a union b ); but it's not clear to me if bloom_filter(a) intersect bloom_filter(b) == bloom_filter( a intersect b ); Can you convince me? During the XInt review there was some discussion about whether is_prime() should be called is_probably_prime(). The same argument could be made here about whether your contains() should be probably_contains(). If there were some six-letter word meaning "probably contains" then I would support using it, but "probably_contains" is a lot of typing. Regarding hash functions: I believe I have read that some care is needed when you use essentially the same hash function but with different seeds i.e. are the results really independent? It would be be more obviously correct to have a set of hash functions that use entirely different algorithms. Unfortunately this is hard. Performance: it would be great to see some benchmarks for different types against std::set and std::unordered_set. In particular it would interesting to look at string keys of different lengths: both the bloom filter and the unordered_set always have to look at every byte of the string to compute the hash, while a std::set need only look at the string's unique prefix. You mention things like web caches, and so I guess getting some real-world benchmark data would be simple enough. One hashing option for strings is to hash substrings. Anyway, good work so far. Regards, Phil.

Phil Endecott wrote:
For some reason I have a mental block about how intersection works. Union is OK; I can see that
set a; set b; bloom_filter(a) union bloom_filter(b) == bloom_filter( a union b );
but it's not clear to me if
bloom_filter(a) intersect bloom_filter(b) == bloom_filter( a intersect b );
Can you convince me?
That is correct, afaik only union (or) and difference (xor) can be applied to non-counting BFs.

Arash Partow wrote:
Phil Endecott wrote:
For some reason I have a mental block about how intersection works. Union is OK; I can see that
set a; set b; bloom_filter(a) union bloom_filter(b) == bloom_filter( a union b );
but it's not clear to me if
bloom_filter(a) intersect bloom_filter(b) == bloom_filter( a intersect b );
Can you convince me?
That is correct, afaik only union (or) and difference (xor) can be applied to non-counting BFs.
Sorry Arash but I don't understand your reply; what exactly is correct? I believe I have convinced myself that a bitwise AND on bloom filters gives something that is a superset of the intersection of the contents, i.e. the false positive rate is increased. A proof would be good. Regarding xor, it superficially looks like it might give the symetric set difference i.e. those elements in A or B but not both, but when you consider overlapping hashes I think it results in false negatives. Regards, Phil.

Phil Endecott wrote:
Sorry Arash but I don't understand your reply; what exactly is correct?
My point was that you are correct, the "and" does not provide anything of value. Only "or" and "xor" are practical for non-counting bloom filters.
I believe I have convinced myself that a bitwise AND on bloom filters gives something that is a superset of the intersection of the contents, i.e. the false positive rate is increased. A proof would be good. Regarding xor, it superficially looks like it might give the symetric set difference i.e. those elements in A or B but not both, but when you consider overlapping hashes I think it results in false negatives.
You're correct there as well, I should have qualified the difference type, as it is indeed a symmetric difference. That said, I'm not sure what you mean by overlapping hashes, the only rule for such operations is that the BFs be constructed similarly - that is: 1. same types of hash functions 2. same number of hash functions 3. same filter sizes 4. same hash value quantization value Otherwise such set operations simply wont work and become meaningless.

On 6/25/2011 1:53 PM, Phil Endecott wrote:
I believe I have convinced myself that a bitwise AND on bloom filters gives something that is a superset of the intersection of the contents, i.e. the false positive rate is increased. A proof would be good.
Easily done: Take two Bloom filters, A and B, with the same hash-functions (h of them) and the same size vectors (N). If the number of bits set in one of them is s, then when h<<N and s>>1 the probability of a false positive is approximately (h/N)^s, so the false positive rate increases with the proportion of the bits that are set. If an element appears in both sets, then the bits corresponding to any element that appears in both sets will be set in both vectors and will then be set in the bit-wise conjunction. That same element will therefore appear as a positive in the conjunction so the conjunction at least includes the intersection. For any bit that is not among the bits that would be set in the properly constructed Bloom filter for (A intersect B) there is some probability that it is set by one of the elements in A's set that is not in B's /and/ by one of the elements in B's set that is not in A's, so it will appear in the conjunction bits. This means that the "s" for the conjunction will be larger than the s for the intersection so the false positive rate will be higher for the conjunction than for the intersection. The probability is fairly high for any particular bit not in the intersection to be set in the conjunction if the intersection is small relative to the number of elements in A and the number of elements of B. If the hash functions are each applied to the entire N bits, then there will be h*|A-B| opportunities for a bit to be set from an element in set A that is not in B and h*|B-A| opportunities for it to be set from an element in set B that is not in A. If each hash is applied to a distinct subset of oN/h bits from the total, then the probability is smaller but still not insignificant: |A-B| opportunities for elements in A but not in B times |B-A| opportunities for elements in B but not A. Topher Cooper

Alejandro Cabrera wrote:
I'm writing to request feedback, suggestions, and advice from the Boost community. I've been working on a Bloom filter data structure for about a month and a half now. At this point, the following is complete and available in the Boost Sandbox: - working, basic prototype (insert & query operations, union & intersect) - a test suite covering the public interface - five examples - documentation covering many aspects of the project (tested on Chrome 12 and 13, Firefox 4 and 6)
Looks ok, but one important question - Why is the BF typed? Its not necessary and in fact there are many use-cases where one might want to insert and/or test membership for a range of different types using the same BF - all that those types require are that they be hashable. Another issue, along the lines of what Phil mentioned wrt naming of the "contains" method. It should really be called "contains" for two reasons, firstly its a transitive verb - "a doing" label which is indicative of how it would be used in code, and secondly you've provided the method false_positive_rate which coupled with the fact that a BF is a probabilistic set, indicates to the user that any result will inherently have a false positive probability assigned with it, which is naively proportional to a function of the bits in the BF and the number of elements inserted into the BF (and not necessarily the element's bit-length). All in all a good start - keep up the good work!

Arash Partow wrote:
Alejandro Cabrera wrote:
I'm writing to request feedback, suggestions, and advice from the Boost community. I've been working on a Bloom filter data structure for about a month and a half now. At this point, the following is complete and available in the Boost Sandbox: - working, basic prototype (insert & query operations, union & intersect) - a test suite covering the public interface - five examples - documentation covering many aspects of the project (tested on Chrome 12 and 13, Firefox 4 and 6)
Looks ok, but one important question - Why is the BF typed? Its not necessary and in fact there are many use-cases where one might want to insert and/or test membership for a range of different types using the same BF - all that those types require are that they be hashable.
Though not necessary, I believe having a typed BF is safer. If there are users that want to eschew type safety to allow operations on multiple types, I believe this can be accomplished either by using a Bloom filter that accepts insertions of type void *. Arash Partow wrote:
Another issue, along the lines of what Phil mentioned wrt naming of the "contains" method. It should really be called "contains" for two reasons, firstly its a transitive verb - "a doing" label which is indicative of how it would be used in code, and secondly you've provided the method false_positive_rate which coupled with the fact that a BF is a probabilistic set, indicates to the user that any result will inherently have a false positive probability assigned with it, which is naively proportional to a function of the bits in the BF and the number of elements inserted into the BF (and not necessarily the element's bit-length).
I agree that the transitive verb property is crucial. It communicates the intent of the operation. I've opted to go with the name "probably_contains" for the time being. Even though it is slightly longer to type, it highlights its probabilistic nature. I want there to be minimal confusion for users new to Bloom filters about the nature of the data structure. Arash Partow wrote:
All in all a good start - keep up the good work!
Thank you for the feedback! -- View this message in context: http://boost.2283326.n4.nabble.com/GSoC-Request-for-Feedback-on-Boost-Bloom-... Sent from the Boost - Dev mailing list archive at Nabble.com.

Alejandro Cabrera write:
Arash Partow wrote:
Looks ok, but one important question - Why is the BF typed? Its not necessary and in fact there are many use-cases where one might want to insert and/or test membership for a range of different types using the same BF - all that those types require are that they be hashable.
Though not necessary, I believe having a typed BF is safer. If there are users that want to eschew type safety to allow operations on multiple types, I believe this can be accomplished either by using a Bloom filter that accepts insertions of type void *.
Hmm, it's not clear to me how exactly you would do that with void*. I think Arash's suggestion is better, and you can then build strongly-typed filters on top of the "any" version like this: class bloom_filter_any { template <typename T> void insert(const T& val) { .... } .... }; template <typename T> class bloom_filter: bloom_filer_any { void insert(const T& val) { bloom_filter_any::insert(val); } .... }; Regards, Phil.

Hi, Phil Endecott-48 wrote:
Alejandro Cabrera write:
Arash Partow wrote:
Looks ok, but one important question - Why is the BF typed? Its not necessary and in fact there are many use-cases where one might want to insert and/or test membership for a range of different types using the same BF - all that those types require are that they be hashable.
Could you give a concrete example of insertion of different types? any type?
Though not necessary, I believe having a typed BF is safer. If there are users that want to eschew type safety to allow operations on multiple types, I believe this can be accomplished either by using a Bloom filter that accepts insertions of type void *.
Hmm, it's not clear to me how exactly you would do that with void*. I think Arash's suggestion is better, and you can then build strongly-typed filters on top of the "any" version like this:
class bloom_filter_any { template <typename T> void insert(const T& val) { .... } .... };
could you elaborate how the hash functions will be applied to any T?, Should this imply dynamic polymorphism? Best, Vicente -- View this message in context: http://boost.2283326.n4.nabble.com/GSoC-Request-for-Feedback-on-Boost-Bloom-... Sent from the Boost - Dev mailing list archive at Nabble.com.

Vicente Botet wrote:
Phil Endecott-48 wrote:
Alejandro Cabrera write:
Arash Partow wrote:
Looks ok, but one important question - Why is the BF typed? Its not necessary and in fact there are many use-cases where one might want to insert and/or test membership for a range of different types using the same BF - all that those types require are that they be hashable.
Could you give a concrete example of insertion of different types? any type?
You mean a practical application? No.
Though not necessary, I believe having a typed BF is safer. If there are users that want to eschew type safety to allow operations on multiple types, I believe this can be accomplished either by using a Bloom filter that accepts insertions of type void *.
Hmm, it's not clear to me how exactly you would do that with void*. I think Arash's suggestion is better, and you can then build strongly-typed filters on top of the "any" version like this:
class bloom_filter_any { template <typename T> void insert(const T& val) { .... } .... };
could you elaborate how the hash functions will be applied to any T?, Should this imply dynamic polymorphism?
No, it doesn't require dynamic polymorphism, just compile-type polymorphism. The filter's hash functions must accept the types that are actually used in calls to insert() etc. Regards, Phil.

Vicente Botet wrote:
Could you give a concrete example of insertion of different types? any type?
A valid use-case could be that you have a struct called ipv4 and another called ipv6 which represent ip addresses (though not necessarily inheriting from a common base). Now it is true that they are nothing more than arrays of chars (or a string), but lets assume we'd like to treat them as the objects they are without knowing their underlying detail. Following on from this I'd like to be able to do something like the following: bloom_filter<AHashFunc> bf(.....); ipv4 i4(....); ipv6 i6(....); bf.insert(i4); bf.insert(i6); Where the only requirement is that the types ipv4 and ipv6 have the correct hash function specialization in the namespace std::tr1 Something to ponder over, when inserting a boost:any is it the any-type as a whole, the underlying type or the serialized type that will be inserted into the BF? What about a Person class? would it be the Person type as a whole, or the serialized form of every field in the Person type that will be inserted? - I supposed it all depends on the way the hash input is generated.

Arash, Arash Partow wrote:
Vicente Botet wrote:
Could you give a concrete example of insertion of different types? any type?
A valid use-case could be that you have a struct called ipv4 and another called ipv6 which represent ip addresses (though not necessarily inheriting from a common base). Now it is true that they are nothing more than arrays of chars (or a string), but lets assume we'd like to treat them as the objects they are without knowing their underlying detail. Following on from this I'd like to be able to do something like the following:
bloom_filter<AHashFunc> bf(.....);
ipv4 i4(....); ipv6 i6(....);
bf.insert(i4); bf.insert(i6);
Where the only requirement is that the types ipv4 and ipv6 have the correct hash function specialization in the namespace std::tr1
I can see the convenience of this approach. Why would you prefer to use one Bloom filter to store two related types rather than two Bloom filters? For example: code wrote:
bloom_filter<ip4, 10000> ip4_bloom; bloom_filter<ip6, 10000> ip6_bloom;
ip4 x4(...); ip6 x6(...);
if ( // if is ip4, perhaps using type-traits or some form of polymorphism ) ip4_bloom.insert(x4); else if ( // if is ip6) ip6_bloom.insert(x6); else //error;
//...
if (!ip4_bloom.contains(...) ) // load resource from high latency storage else // fetch from cache
This is how I would typically expect the current Bloom filter to be used in multi-type scenarios. Arash Partow wrote:
Something to ponder over, when inserting a boost:any is it the any-type as a whole, the underlying type or the serialized type that will be inserted into the BF? What about a Person class? would it be the Person type as a whole, or the serialized form of every field in the Person type that will be inserted? - I supposed it all depends on the way the hash input is generated.
This really depends on the implementation of the Hasher (hash function callable). A Hasher could target a specific field of an object to generate its fingerprint. Alternatively, the Hasher could treat an object as an array of bytes and generate the fingerprint in that fashion. I've left this aspect of the design very open to customization. -Alej -- View this message in context: http://boost.2283326.n4.nabble.com/GSoC-Request-for-Feedback-on-Boost-Bloom-... Sent from the Boost - Dev mailing list archive at Nabble.com.

On 21/06/2011 2:21 PM, Alejandro Cabrera wrote:
Hello all,
I'm writing to request feedback, suggestions, and advice from the Boost community. I've been working on a Bloom filter data structure for about a month and a half now. At this point, the following is complete and available in the Boost Sandbox: - working, basic prototype (insert & query operations, union & intersect) - a test suite covering the public interface - five examples - documentation covering many aspects of the project (tested on Chrome 12 and 13, Firefox 4 and 6)
One more suggestion, have you considered introducing a compression function/method? Compression wrt BF is really just super-imposing (Or'ing) the 1st half of the BF with the second half and halving the quantization factor for the hash values, this operation doubles the effective false positive probability. This technique is typically used in DHT scenarios. where the complete BF is stored at a particular node, then a compressed version is sent to the direct neighbors of that node, which they then compress the received BF again and send that onto their direct neighbors etc, until such time as the BF reaches a certain size or the EFPP is above a certain threshold. (the BF is coupled with information regarding routing to the originating node)
participants (7)
-
Alejandro Cabrera
-
Arash Partow
-
Edward Diener
-
jakub szymanski
-
Phil Endecott
-
Topher Cooper
-
Vicente Botet