
Jakub,
Having bloom_filter in boost:: is great idea! It is super useful, we use in production code (our own implementation (actually we use Bloomier filters with murmur hash perfect hashing from your wiki reference). It lets us use very memory and lookup efficient data structure for very large datasets (think 100 GB file with strings on SSD disk indexed by e.g. 200 MB Bloomier filter in memory)
I've read about Bloomier filters, though those seem a ways off still. I'm building from the most basic to the more complex variants, in general. Your statistics sound amazing, to say the least! Working with very large data isn't something I've had the opportunity to do just yet, though I'm interested and aware of such scenarios, as my research area is in operating systems and storage issues.
Did you considered adding serialization of a bloom_filter to your implementation?
I haven't considered it at length, but there must be a way to encode the type of the Bloom filter uniquely prior to writing out its bits, so that when deserialization time comes, a user wouldn't accidentally instantiate a Bloom filter of the wrong type. The idea I have in mind at the moment is to first serialize the template parameters of the Bloom filter (Type, Size, HashFunctions). The Type and the HashFunctions appear to be the trickiest, since their structure can take on as many possibilities as the type system allows. Writing out the bits should be the easiest part. If you have any ideas in this regard, I'd appreciate it if you shared them.
In general reconstructing hash based containers with series of inserts is pretty inefficient.
I can imagine this to be the case, especially if the containers are storing a very large number of elements.
Use case that I'm talking about: e.g. for you web proxy scenario proxy service keeps running and downloading, caching URLs and adding them to bloom filter. Than the process needs to be restarted for some reason. All documents downloaded and stored on disk will have to be reitereted and their URLs reinserted to newly created bloom_filter, which makes the startup of the process slow. btw I have the same problem with standard containers (except std::vector). There is no efficient serialization / deserialization for them rendering them useless for any larger side project (like unordered_set of 1m strings).
There was a Boost Serialization library in the works. It seems that the last time it was maintained was 7 years ago. I'm not sure what generic, type-safe [de]serialization would look like - it sounds like a project all on its own! I agree that the C++ standard would benefit greatly from a means to [de]serialize data structures, much like Python has the pickle module. Thank you for your feedback, Jakub. -Alej