Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project

23 Jun 2011

      Jakub,
...
Having bloom_filter in boost:: is great idea! It is super useful, we use in
production code (our own implementation (actually we use Bloomier filters
with murmur hash perfect hashing from your wiki reference).
It lets us use very memory and lookup efficient data structure for very
large datasets (think 100 GB file with  strings on SSD disk indexed by e.g.
200 MB Bloomier filter in memory)
I've read about Bloomier filters, though those seem a ways off still. 
I'm building from the most basic to the more complex variants, in general.

Your statistics sound amazing, to say the least! Working with very large 
data isn't something I've had the opportunity to do just yet, though I'm 
interested and aware of such scenarios, as my research area is in 
operating systems and storage issues.
...
Did you considered adding serialization of a bloom_filter to your
implementation?
I haven't considered it at length, but there must be a way to encode the 
type of the Bloom filter uniquely prior to writing out its bits, so that 
when deserialization time comes, a user wouldn't accidentally 
instantiate a Bloom filter of the wrong type. The idea I have in mind at 
the moment is to first serialize the template parameters of the Bloom 
filter (Type, Size, HashFunctions). The Type and the HashFunctions 
appear to be the trickiest, since their structure can take on as many 
possibilities as the type system allows. Writing out the bits should be 
the easiest part.

If you have any ideas in this regard, I'd appreciate it if you shared them.
...
In general reconstructing hash based containers with series of inserts is
pretty inefficient.
I can imagine this to be the case, especially if the containers are 
storing a very large number of elements.
...
Use case that I'm talking about: e.g. for you web proxy scenario proxy
service keeps running and downloading, caching URLs and adding them to bloom
filter. Than the process needs to be restarted for some reason. All
documents downloaded and stored on disk will have to be reitereted and their
URLs reinserted to newly created bloom_filter, which makes the startup of
the process slow.
btw I have the same problem with standard containers (except std::vector).
There is no efficient serialization / deserialization for them rendering
them useless for any larger side project (like unordered_set of 1m strings).
There was a Boost Serialization library in the works. It seems that the 
last time it was maintained was 7 years ago. I'm not sure what generic, 
type-safe [de]serialization would look like - it sounds like a project 
all on its own! I agree that the C++ standard would benefit greatly from 
a means to [de]serialize data structures, much like Python has the 
pickle module.

Thank you for your feedback, Jakub.
-Alej

Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project

Alejandro Cabrera