
Hi Alejandro, Alejandro Cabrera wrote:
I'm writing to request feedback, suggestions, and advice from the Boost community. I've been working on a Bloom filter data structure for about a month and a half now. At this point, the following is complete and available in the Boost Sandbox: - working, basic prototype (insert & query operations, union & intersect) - a test suite covering the public interface - five examples - documentation covering many aspects of the project (tested on Chrome 12 and 13, Firefox 4 and 6)
At first glance, this looks very good. Well done. Bloom filters are essentially a quite simple concept, and I suggest that you should stick to that i.e. resist any temptation to add lots of extra features. Having said that, I think you should look at e.g. std::set<> and std::unordered_set and see what you can do to make your class closer to them. I can think of: - empty() - clear() (Ah, you do have this, but not documented?) - a constructor that takes a range, e.g. int a[] = { 1,2,3,4,5 }; bloom_filter<int> b(a,a+5); std::list<int> l; bloom_filter<int> c(l.begin(),l.end()); - insert(begin,end) similarly - swap() - You have a size() which is (confusingly?) different from those containers' size(); it's more like capacity() I guess. (Similarly I think I would rename count() to maybe bit_count().) There is a trade-off between the size of the filter, the number of values stored, the number of hash functions and the false positive rate. You might consider adding some sort of compile-time metafunctions to help the user determine the right size and number of hash functions, based on their expected number of values stored and acceptable false positive rate. Similarly, you could add a static expected_false_positive_rate() method that takes the number of stored values as a parameter. Sometimes it would be better for size to be a run-time parameter. Sometimes it would be useful to access the underlying data. For example, say I have a spell check dictionary or a URL block-list; I might want to compile that list into a bloom filter and ship that as a binary file. For that I'd want to be able to use it as an adaptor on top of an underlying container that I supply: // compile: std::vector<uint32_t> data(100000); bloom_filter b(data); // acts as a container adaptor b.insert(...); file f; f.write(data.begin(),data.end()); // use: memory_mapped_file f("dictionary.bloom"); bloom_filter b(f); if (b.contains(url)) .... I think it would be sufficient for false_positive_rate() to return a float. No-one cares about its 19th digit. For some reason I have a mental block about how intersection works. Union is OK; I can see that set a; set b; bloom_filter(a) union bloom_filter(b) == bloom_filter( a union b ); but it's not clear to me if bloom_filter(a) intersect bloom_filter(b) == bloom_filter( a intersect b ); Can you convince me? During the XInt review there was some discussion about whether is_prime() should be called is_probably_prime(). The same argument could be made here about whether your contains() should be probably_contains(). If there were some six-letter word meaning "probably contains" then I would support using it, but "probably_contains" is a lot of typing. Regarding hash functions: I believe I have read that some care is needed when you use essentially the same hash function but with different seeds i.e. are the results really independent? It would be be more obviously correct to have a set of hash functions that use entirely different algorithms. Unfortunately this is hard. Performance: it would be great to see some benchmarks for different types against std::set and std::unordered_set. In particular it would interesting to look at string keys of different lengths: both the bloom filter and the unordered_set always have to look at every byte of the string to compute the hash, while a std::set need only look at the string's unique prefix. You mention things like web caches, and so I guess getting some real-world benchmark data would be simple enough. One hashing option for strings is to hash substrings. Anyway, good work so far. Regards, Phil.