Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project

21 Jun 2011

      Hi Alejandro,

Alejandro Cabrera wrote:
...
I'm writing to request feedback, suggestions, and advice from the Boost 
community. I've been working on a Bloom filter data structure for about 
a month and a half now. At this point, the following is complete and 
available in the Boost Sandbox:
- working, basic prototype (insert & query operations, union & intersect)
- a test suite covering the public interface
- five examples
- documentation covering many aspects of the project (tested on Chrome 
12 and 13, Firefox 4 and 6)
At first glance, this looks very good.  Well done.

Bloom filters are essentially a quite simple concept, and I suggest 
that you should stick to that i.e. resist any temptation to add lots of 
extra features.  Having said that, I think you should look at e.g. 
std::set<> and std::unordered_set and see what you can do to make your 
class closer to them.  I can think of:

- empty()
- clear()  (Ah, you do have this, but not documented?)
- a constructor that takes a range, e.g.
   int a[] = { 1,2,3,4,5 };
   bloom_filter<int> b(a,a+5);
   std::list<int> l;
   bloom_filter<int> c(l.begin(),l.end());
- insert(begin,end) similarly
- swap()
- You have a size() which is (confusingly?) different from those 
containers' size(); it's more like capacity() I guess.  (Similarly I 
think I would rename count() to maybe bit_count().)

There is a trade-off between the size of the filter, the number of 
values stored, the number of hash functions and the false positive 
rate.  You might consider adding some sort of compile-time 
metafunctions to help the user determine the right size and number of 
hash functions, based on their expected number of values stored and 
acceptable false positive rate.  Similarly, you could add a static 
expected_false_positive_rate() method that takes the number of stored 
values as a parameter.

Sometimes it would be better for size to be a run-time parameter.

Sometimes it would be useful to access the underlying data.  For 
example, say I have a spell check dictionary or a URL block-list; I 
might want to compile that list into a bloom filter and ship that as a 
binary file.  For that I'd want to be able to use it as an adaptor on 
top of an underlying container that I supply:

// compile:
std::vector<uint32_t> data(100000);
bloom_filter b(data);  // acts as a container adaptor
b.insert(...);
file f;
f.write(data.begin(),data.end());

// use:
memory_mapped_file f("dictionary.bloom");
bloom_filter b(f);
if (b.contains(url)) ....

I think it would be sufficient for false_positive_rate() to return a 
float.  No-one cares about its 19th digit.

For some reason I have a mental block about how intersection works.  
Union is OK; I can see that

set a;
set b;
bloom_filter(a) union bloom_filter(b) == bloom_filter( a union b );

but it's not clear to me if

bloom_filter(a) intersect bloom_filter(b) == bloom_filter( a intersect 
b );

Can you convince me?

During the XInt review there was some discussion about whether 
is_prime() should be called is_probably_prime().  The same argument 
could be made here about whether your contains() should be 
probably_contains().  If there were some six-letter word meaning 
"probably contains" then I would support using it, but 
"probably_contains" is a lot of typing.

Regarding hash functions: I believe I have read that some care is 
needed when you use essentially the same hash function but with 
different seeds i.e. are the results really independent?  It would be 
be more obviously correct to have a set of hash functions that use 
entirely different algorithms.  Unfortunately this is hard.

Performance: it would be great to see some benchmarks for different 
types against std::set and std::unordered_set.  In particular it would 
interesting to look at string keys of different lengths: both the bloom 
filter and the unordered_set always have to look at every byte of the 
string to compute the hash, while a std::set need only look at the 
string's unique prefix.  You mention things like web caches, and so I 
guess getting some real-world benchmark data would be simple enough.  
One hashing option for strings is to hash substrings.

Anyway, good work so far.

Regards,  Phil.

Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project

Phil Endecott