Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project

Jakub,
Having bloom_filter in boost:: is great idea! It is super useful, we use in production code (our own implementation (actually we use Bloomier filters with murmur hash perfect hashing from your wiki reference). It lets us use very memory and lookup efficient data structure for very large datasets (think 100 GB file with strings on SSD disk indexed by e.g. 200 MB Bloomier filter in memory)
I've read about Bloomier filters, though those seem a ways off still. I'm building from the most basic to the more complex variants, in general. Your statistics sound amazing, to say the least! Working with very large data isn't something I've had the opportunity to do just yet, though I'm interested and aware of such scenarios, as my research area is in operating systems and storage issues.
Did you considered adding serialization of a bloom_filter to your implementation?
I haven't considered it at length, but there must be a way to encode the type of the Bloom filter uniquely prior to writing out its bits, so that when deserialization time comes, a user wouldn't accidentally instantiate a Bloom filter of the wrong type. The idea I have in mind at the moment is to first serialize the template parameters of the Bloom filter (Type, Size, HashFunctions). The Type and the HashFunctions appear to be the trickiest, since their structure can take on as many possibilities as the type system allows. Writing out the bits should be the easiest part. If you have any ideas in this regard, I'd appreciate it if you shared them.
In general reconstructing hash based containers with series of inserts is pretty inefficient.
I can imagine this to be the case, especially if the containers are storing a very large number of elements.
Use case that I'm talking about: e.g. for you web proxy scenario proxy service keeps running and downloading, caching URLs and adding them to bloom filter. Than the process needs to be restarted for some reason. All documents downloaded and stored on disk will have to be reitereted and their URLs reinserted to newly created bloom_filter, which makes the startup of the process slow. btw I have the same problem with standard containers (except std::vector). There is no efficient serialization / deserialization for them rendering them useless for any larger side project (like unordered_set of 1m strings).
There was a Boost Serialization library in the works. It seems that the last time it was maintained was 7 years ago. I'm not sure what generic, type-safe [de]serialization would look like - it sounds like a project all on its own! I agree that the C++ standard would benefit greatly from a means to [de]serialize data structures, much like Python has the pickle module. Thank you for your feedback, Jakub. -Alej

Alejandro Cabrera <cpp.cabrera <at> gmail.com> writes:
There was a Boost Serialization library in the works. It seems that the last time it was maintained was 7 years ago. I'm not sure what generic, type-safe [de]serialization would look like - it sounds like a project all on its own! I agree that the C++ standard would benefit greatly from a means to [de]serialize data structures, much like Python has the pickle module.
http://www.boost.org/doc/libs/1_46_1/libs/serialization/doc/index.html I am really confused. serialization is in boost and it does containers. Am I not understanding what you are talking about? Joel

Joel, Mailing List, My apologies regarding the statement on Boost Serialization. I thought that it wasn't actively maintained. I had based my assumption of activity on the History page, which reports the most recent changes as having occurred on November 1, 2004 (http://www.boost.org/doc/libs/1_46_1/libs/serialization/doc/index.html -> History). The release notes list changes up until Boost 1.45. It's definitely an actively maintained library, so I don't want to create a misconception with my own misinformation. -- View this message in context: http://boost.2283326.n4.nabble.com/GSoC-Request-for-Feedback-on-Boost-Bloom-... Sent from the Boost - Dev mailing list archive at Nabble.com.

Alejandro Cabrera <cpp.cabrera <at> gmail.com> writes:
My apologies regarding the statement on Boost Serialization. I thought that it wasn't actively maintained.
No problem. Out of curiosity, have you seen the NPS Bloom filter package linked to here: http://afflib.org/ Joel

Joel, Joel Young wrote:
Alejandro Cabrera <cpp.cabrera <at> gmail.com> writes:
My apologies regarding the statement on Boost Serialization. I thought that it wasn't actively maintained.
No problem.
Out of curiosity, have you seen the NPS Bloom filter package linked to here:
Joel
I had not heard of it before. I've also been informed of the existence of this alternate Bloom filter library: http://tech.blog.greplin.com/lucene-utilities-and-bloom-filters Mailing list, There are definitely many alternatives out there. I'll be writing benchmarks during the next week testing collision rates, insertion speed, and query speed for the following types: integer, string, user-defined (URL class saving URL as a string). Are there any other metrics the community is interested in? Are there any specific types of data that you'd like to see tested? -Alej -- View this message in context: http://boost.2283326.n4.nabble.com/GSoC-Request-for-Feedback-on-Boost-Bloom-... Sent from the Boost - Dev mailing list archive at Nabble.com.

Alejandro Cabrera <cpp.cabrera <at> gmail.com> writes: Please configure your client to properly reply. You are breaking the threads on every post. See: http://post.gmane.org/post.php?group=gmane.comp.lib.boost.devel Joel
participants (2)
-
Alejandro Cabrera
-
Joel Young