Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project

Hi Alejandro,
Alejandro Cabrera wrote:
I'm writing to request feedback, suggestions, and advice from the Boost community. I've been working on a Bloom filter data structure for about a month and a half now. At this point, the following is complete and available in the Boost Sandbox: - working, basic prototype (insert& query operations, union& intersect) - a test suite covering the public interface - five examples - documentation covering many aspects of the project (tested on Chrome 12 and 13, Firefox 4 and 6) At first glance, this looks very good. Well done.
Bloom filters are essentially a quite simple concept, and I suggest that you should stick to that i.e. resist any temptation to add lots of extra features. Having said that, I think you should look at e.g. std::set<> and std::unordered_set and see what you can do to make your class closer to them. I can think of:
- empty() - clear() (Ah, you do have this, but not documented?) I accidentally skipped documenting this. Thank you for pointing this out. - a constructor that takes a range, e.g. int a[] = { 1,2,3,4,5 }; bloom_filter<int> b(a,a+5); std::list<int> l; bloom_filter<int> c(l.begin(),l.end()); - insert(begin,end) similarly I hadn't considered various ways to construct the Bloom filter. I assumed users would always want to start with an empty Bloom filter, but
Hello Phil, this would make using the Bloom filter more flexible.
- swap() I overlooked swap. I'll add this. - You have a size() which is (confusingly?) different from those containers' size(); it's more like capacity() I guess. (Similarly I think I would rename count() to maybe bit_count().) I agree that using size() as I do is confusing, versus what size() has come to mean for standard containers. This would better be named something like bit_count(). I'll reconsider the name. There is a trade-off between the size of the filter, the number of values stored, the number of hash functions and the false positive rate. You might consider adding some sort of compile-time metafunctions to help the user determine the right size and number of hash functions, based on their expected number of values stored and acceptable false positive rate. Similarly, you could add a static expected_false_positive_rate() method that takes the number of stored values as a parameter. Given a desired false positive rate, an expected number of elements and an upper-limit on the storage requirement, it is possible to determine both a good Size value and a good number of hash functions. I see now how meta-functions can help alleviate the burden on the user for picking good values.
Sometimes it would be better for size to be a run-time parameter. I have a dynamic Bloom filter in the queue for variants. It's definitely required if I were to develop a scalable Bloom filter that guarantees a certain false positive rate, as detailed in a recent research paper
This might take a bit of work to provide as a feature. It would involve ratios to represent the false positive rate as a template parameter to the meta-function. I'll put this feature off 'til version 1.1 or 1.2. Until that time, users will have to determine the best parameters for their application using the documentation guidelines. On that note, I think I should add a section to the tutorial detailing how to choose good parameters. titled "Scalable Bloom Filters".
Sometimes it would be useful to access the underlying data. For example, say I have a spell check dictionary or a URL block-list; I might want to compile that list into a bloom filter and ship that as a binary file. For that I'd want to be able to use it as an adaptor on top of an underlying container that I supply:
// compile: std::vector<uint32_t> data(100000); bloom_filter b(data); // acts as a container adaptor b.insert(...); file f; f.write(data.begin(),data.end());
// use: memory_mapped_file f("dictionary.bloom"); bloom_filter b(f); if (b.contains(url)) .... I'm not sure how this would work. Could you elaborate further in this regard?
Currently, as I understand, there's two components of interest. The first is to be able to access the underlying Bloom filter data for serialization/deserialization purposes. This should be easy enough to do given a function that exposes the Bloom filter data. The other component that I'm picking up on is to be able to construct a Bloom filter using various sources. Range construction will be added that semantically behaves as an insertion for each element in the range. However, I'm not sure how to support construction from a memory_mapped_file.
I think it would be sufficient for false_positive_rate() to return a float. No-one cares about its 19th digit. I preferred greater precision to anticipate user needs. Unless false_positive_rate() were called frequently, I wouldn't expect that the added precision would incur a significant performance penalty. For some reason I have a mental block about how intersection works. Union is OK; I can see that
set a; set b; bloom_filter(a) union bloom_filter(b) == bloom_filter( a union b );
but it's not clear to me if
bloom_filter(a) intersect bloom_filter(b) == bloom_filter( a intersect b );
Can you convince me? Intersect intends to leave only the bits shared between two bloom filters on. This also corresponds to the elements that both Bloom filters have inserted. The reason for this is because intersection is only allowed between Bloom filters that use the same hash function set and the same storage size. Assume that both Bloom filters are instantiated with hash functions that return the integer value of the element hashed. Then the following occurs:
During the XInt review there was some discussion about whether is_prime() should be called is_probably_prime(). The same argument could be made here about whether your contains() should be probably_contains(). If there were some six-letter word meaning "probably contains" then I would support using it, but "probably_contains" is a lot of typing. The wording for contains() is another tricky issue. probably_contains() has the benefit that it points out the probabilistic nature of the data structure. This is very helpful to warn users. I don't think the addition of 9 more characters to type (probably_) should be a problem, since it isn't the type of function that can nest on itself (x.contains(t).contains...). I will tentatively rename the function
Regarding hash functions: I believe I have read that some care is needed when you use essentially the same hash function but with different seeds i.e. are the results really independent? It would be be more obviously correct to have a set of hash functions that use entirely different algorithms. Unfortunately this is hard. I've left the hashing component open to extension exactly because of
bloom_filter<int, 4> a; bloom_filter<int, 4> b; bloom_filter<int, 4> c; a.insert(1); a.insert(2); // 0110 b.insert(2); b.insert(3); // 0011 c = a & b; // 0010 At this point, only c.contains(2) will return true. probably_contains(), and if further comments indicate that this adds some level of confusion, I'll reconsider the name again. this reasoning. The Bloom filter data structure does rely greatly on the use of good hash functions. However, the definition of "good" depends on the context. It might be that a CRC32 is perfect for some applications, but a murmurhash3 is better for others, and some applications might even require a cryptographically secure hash. Since the intent of the project is not to develop great hash functions, I'm inclined to leave the option of boost::hash_value as the default, though it might be best to default to an mpl::vector containing a single hash function instead of two, since as you stated, the results aren't independent.
Performance: it would be great to see some benchmarks for different types against std::set and std::unordered_set. In particular it would interesting to look at string keys of different lengths: both the bloom filter and the unordered_set always have to look at every byte of the string to compute the hash, while a std::set need only look at the string's unique prefix. You mention things like web caches, and so I guess getting some real-world benchmark data would be simple enough. One hashing option for strings is to hash substrings.
I suspect that a benchmark on at least strings would be useful to demonstrate how the bloom_filter compares against std::set and std::unordered_set, assuming that most users will be hashing data (file names, URLs, etc.). I'll add benchmarks to my queue of things todo, but I want to make sure the interface is close to stable before I do commit to writing benchmarks.
Anyway, good work so far.
Regards, Phil.
Thank you for taking the time to comment so thoroughly on the Bloom filter project!

Alejandro Cabrera wrote:
Sometimes it would be useful to access the underlying data. For example, say I have a spell check dictionary or a URL block-list; I might want to compile that list into a bloom filter and ship that as a binary file. For that I'd want to be able to use it as an adaptor on top of an underlying container that I supply:
// compile: std::vector<uint32_t> data(100000); bloom_filter b(data); // acts as a container adaptor b.insert(...); file f; f.write(data.begin(),data.end());
// use: memory_mapped_file f("dictionary.bloom"); bloom_filter b(f); if (b.contains(url)) .... I'm not sure how this would work. Could you elaborate further in this regard?
Currently, as I understand, there's two components of interest. The first is to be able to access the underlying Bloom filter data for serialization/deserialization purposes. This should be easy enough to do given a function that exposes the Bloom filter data. The other component that I'm picking up on is to be able to construct a Bloom filter using various sources. Range construction will be added that semantically behaves as an insertion for each element in the range. However, I'm not sure how to support construction from a memory_mapped_file.
OK, let me expand on the example I started above. Say I have a long list of "banned" URLs and an interactive program that must check URLs entered by the user against that list. I pre-compile the bloom filter and save its raw binary content to a file. Then in the application, I map that data into memory. The point about memory-mapping is that the data is not actually read from the file until it is used, so in this case the bloom filter's ctor can be very fast; when the user types in a URL, only those pages of the file that contain the bits corresponding to the URL's hashes are read. This is both faster and more memory-efficient than loading all of the data. In order for this to work, it's only necessary that I can make a bloom filter that is an adaptor over a range; the range that I provide will be a pair of pointers that refer to the memory region where the file is mapped. I.e. something like this (pseudo-code): int fd = open("banned_urls.dat", O_RDONLY); const char* begin = mmap(0, size, PROT_READ, fd, 0); const char* end = begin + size; bloom_filter_adaptor<const char*> banned_urls(begin,end); Currently you're using std::bitset for the underlying data. Since that doesn't (AFAIK) have a way to adapt an underlying implementation data type, I think you'll need to replace it with a wrapper that does do that. Then you can have something like this: template <typename ITER_T> class bloom_filter_adaptor { typedef bitset_adaptor<ITER_T> impl_t; impl_t impl; public: bloom_filter_adaptor(ITER_T begin, ITER_T end): impl(begin,end) {} ... }; (An alternative to taking an iterator-pair is to take a container and store a reference to it.) Then the non-adaptor version, for when that flexibility is not needed: template <size_t SIZE> struct bloom_filter_base { typedef std::vector<uint32_t> data_t; data_t data; bloom_filter_base(): data(0,SIZE) {} }; template <size_t SIZE> class bloom_filter: bloom_filter_base<SIZE>, bloom_filter_adaptor<bloom_filter_base::data_t> { public: bloom_filter(): bloom_filter_adaptor(data.begin(),data.end()) {} }; ...or something like that. Regards, Phil.
participants (2)
-
Alejandro Cabrera
-
Phil Endecott