New subject: [GSoC] Request for Feedback on Boost.Bloom Filter Project

22 Jun 2011

      ...
Hi Alejandro,
Alejandro Cabrera wrote:
...
I'm writing to request feedback, suggestions, and advice from the Boost
community. I've been working on a Bloom filter data structure for about
a month and a half now. At this point, the following is complete and
available in the Boost Sandbox:
- working, basic prototype (insert&  query operations, union&  intersect)
- a test suite covering the public interface
- five examples
- documentation covering many aspects of the project (tested on Chrome
12 and 13, Firefox 4 and 6)
At first glance, this looks very good.  Well done.
Bloom filters are essentially a quite simple concept, and I suggest
that you should stick to that i.e. resist any temptation to add lots of
extra features.  Having said that, I think you should look at e.g.
std::set<>  and std::unordered_set and see what you can do to make your
class closer to them.  I can think of:
- empty()
- clear()  (Ah, you do have this, but not documented?)
I accidentally skipped documenting this. Thank you for pointing this out.
- a constructor that takes a range, e.g.
    int a[] = { 1,2,3,4,5 };
    bloom_filter<int>  b(a,a+5);
    std::list<int>  l;
    bloom_filter<int>  c(l.begin(),l.end());
- insert(begin,end) similarly
I hadn't considered various ways to construct the Bloom filter. I 
assumed users would always want to start with an empty Bloom filter, but
Hello Phil,
this would make using the Bloom filter more flexible.
...
- swap()
I overlooked swap. I'll add this.
- You have a size() which is (confusingly?) different from those
containers' size(); it's more like capacity() I guess.  (Similarly I
think I would rename count() to maybe bit_count().)
I agree that using size() as I do is confusing, versus what size() has 
come to mean for standard containers. This would better be named 
something like bit_count(). I'll reconsider the name.
There is a trade-off between the size of the filter, the number of
values stored, the number of hash functions and the false positive
rate.  You might consider adding some sort of compile-time
metafunctions to help the user determine the right size and number of
hash functions, based on their expected number of values stored and
acceptable false positive rate.  Similarly, you could add a static
expected_false_positive_rate() method that takes the number of stored
values as a parameter.
Given a desired false positive rate, an expected number of elements and 
an upper-limit on the storage requirement, it is possible to determine 
both a good Size value and a good number of hash functions. I see now 
how meta-functions can help alleviate the burden on the user for picking 
good values.
...
Sometimes it would be better for size to be a run-time parameter.
I have a dynamic Bloom filter in the queue for variants. It's definitely 
required if I were to develop a scalable Bloom filter that guarantees a 
certain false positive rate, as detailed in a recent research paper
This might take a bit of work to provide as a feature. It would involve 
ratios to represent the false positive rate as a template parameter to 
the meta-function. I'll put this feature off 'til version 1.1 or 1.2. 
Until that time, users will have to determine the best parameters for 
their application using the documentation guidelines. On that note, I 
think I should add a section to the tutorial detailing how to choose 
good parameters.
titled "Scalable Bloom Filters".
...
Sometimes it would be useful to access the underlying data.  For
example, say I have a spell check dictionary or a URL block-list; I
might want to compile that list into a bloom filter and ship that as a
binary file.  For that I'd want to be able to use it as an adaptor on
top of an underlying container that I supply:
// compile:
std::vector<uint32_t>  data(100000);
bloom_filter b(data);  // acts as a container adaptor
b.insert(...);
file f;
f.write(data.begin(),data.end());
// use:
memory_mapped_file f("dictionary.bloom");
bloom_filter b(f);
if (b.contains(url)) ....
I'm not sure how this would work. Could you elaborate further in this 
regard?
Currently, as I understand, there's two components of interest. The 
first is to be able to access the underlying Bloom filter data for 
serialization/deserialization purposes. This should be easy enough to do 
given a function that exposes the Bloom filter data. The other component 
that I'm picking up on is to be able to construct a Bloom filter using 
various sources. Range construction will be added that semantically 
behaves as an insertion for each element in the range. However, I'm not 
sure how to support construction from a memory_mapped_file.
...
I think it would be sufficient for false_positive_rate() to return a
float.  No-one cares about its 19th digit.
I preferred greater precision to anticipate user needs. Unless 
false_positive_rate() were called frequently, I wouldn't expect that the 
added precision would incur a significant performance penalty.
For some reason I have a mental block about how intersection works.
Union is OK; I can see that
set a;
set b;
bloom_filter(a) union bloom_filter(b) == bloom_filter( a union b );
but it's not clear to me if
bloom_filter(a) intersect bloom_filter(b) == bloom_filter( a intersect
b );
Can you convince me?
Intersect intends to leave only the bits shared between two bloom 
filters on. This also corresponds to the elements that both Bloom 
filters have inserted. The reason for this is because intersection is 
only allowed between Bloom filters that use the same hash function set 
and the same storage size. Assume that both Bloom filters are 
instantiated with hash functions that return the integer value of the 
element hashed. Then the following occurs:
...
During the XInt review there was some discussion about whether
is_prime() should be called is_probably_prime().  The same argument
could be made here about whether your contains() should be
probably_contains().  If there were some six-letter word meaning
"probably contains" then I would support using it, but
"probably_contains" is a lot of typing.
The wording for contains() is another tricky issue. probably_contains() 
has the benefit that it points out the probabilistic nature of the data 
structure. This is very helpful to warn users. I don't think the 
addition of 9 more characters to type (probably_) should be a problem, 
since it isn't the type of function that can nest on itself 
(x.contains(t).contains...). I will tentatively rename the function
...
Regarding hash functions: I believe I have read that some care is
needed when you use essentially the same hash function but with
different seeds i.e. are the results really independent?  It would be
be more obviously correct to have a set of hash functions that use
entirely different algorithms.  Unfortunately this is hard.
I've left the hashing component open to extension exactly because of
bloom_filter<int, 4> a;
bloom_filter<int, 4> b;
bloom_filter<int, 4> c;

a.insert(1);
a.insert(2);
// 0110

b.insert(2);
b.insert(3);
// 0011

c = a & b;
// 0010

At this point, only c.contains(2) will return true.
probably_contains(), and if further comments indicate that this adds 
some level of confusion, I'll reconsider the name again.
this reasoning. The Bloom filter data structure does rely greatly on the 
use of good hash functions. However, the definition of "good" depends on 
the context. It might be that a CRC32 is perfect for some applications, 
but a murmurhash3 is better for others, and some applications might even 
require a cryptographically secure hash. Since the intent of the project 
is not to develop great hash functions, I'm inclined to leave the option 
of boost::hash_value as the default, though it might be best to default 
to an mpl::vector containing a single hash function instead of two, 
since as you stated, the results aren't independent.
...
Performance: it would be great to see some benchmarks for different
types against std::set and std::unordered_set.  In particular it would
interesting to look at string keys of different lengths: both the bloom
filter and the unordered_set always have to look at every byte of the
string to compute the hash, while a std::set need only look at the
string's unique prefix.  You mention things like web caches, and so I
guess getting some real-world benchmark data would be simple enough.
One hashing option for strings is to hash substrings.
I suspect that a benchmark on at least strings would be useful to 
demonstrate how the bloom_filter compares against std::set and 
std::unordered_set, assuming that most users will be hashing data (file 
names, URLs, etc.).

I'll add benchmarks to my queue of things todo, but I want to make sure 
the interface is close to stable before I do commit to writing benchmarks.
...
Anyway, good work so far.
Regards,  Phil.
Thank you for taking the time to comment so thoroughly on the Bloom 
filter project!

Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project

Alejandro Cabrera

Phil Endecott

tags

participants (2)