[Feedback] Towards a Better Boost.BloomFilter

Hello all, The GSoC coding period is over. However, I have every intent to continue to develop, improve, and maintain the Boost.BloomFilter package. Your feedback last time around was very useful, and made me think very carefully about the operations supported by each Bloom filter variant implemented. Thank you. Now, I could use feedback on at least the following aspects: - Are all the operations you need implemented by Boost.BloomFilter? Serialization support is still missing, though at least now it is possible to access the underlying storage for each structure implemented through the data() member function. - Are there are any particular variants you'd like to see implemented sooner? I've gathered together most Bloom filter literature on the bibliography page: http://svn.boost.org/svn/boost/sandbox/bloom_filter/trunk/libs/bloom_filter/... http://svn.boost.org/svn/boost/sandbox/bloom_filter/trunk/libs/bloom_filter/... (Note: twohash_ Bloom filters have been implemented.) - There is much talk of C++11 as of late on the mailing list: do you see anywhere in the implementation where taking advantage of C++11 features may produce a better Bloom filter package? - Regarding compiler support, this Bloom filter has been tested on several compiler suites (gcc-4.6, gcc-4.6 -std=c++0x, clang-2.9, clang-2.9-darwin), but has been primarily developed on Linux. Does the test suite compile and pass on your target compiler? My goal is to produce a review-ready Bloom filter library by the end of the year. Any feedback helps. Thank you for your time! A few links: - repository trunk: http://svn.boost.org/svn/boost/sandbox/bloom_filter/trunk/ http://svn.boost.org/svn/boost/sandbox/bloom_filter/trunk/ - documentation: http://svn.boost.org/svn/boost/sandbox/bloom_filter/trunk/libs/bloom_filter/... http://svn.boost.org/svn/boost/sandbox/bloom_filter/trunk/libs/bloom_filter/... Cheers, -Alejandro Cabrera -- View this message in context: http://boost.2283326.n4.nabble.com/Feedback-Towards-a-Better-Boost-BloomFilt... Sent from the Boost - Dev mailing list archive at Nabble.com.

Hi Alejandro, Alejandro Cabrera wrote:
Hello all,
The GSoC coding period is over. However, I have every intent to continue to develop, improve, and maintain the Boost.BloomFilter package.
Your feedback last time around was very useful, and made me think very carefully about the operations supported by each Bloom filter variant implemented. Thank you.
Now, I could use feedback on at least the following aspects: - Are all the operations you need implemented by Boost.BloomFilter?
There is still no "adaptor" functionality, so I can't mmap() a file to use as the bloom filter's raw content, as I might want to do for e.g. a spellcheck, URL blocklist, etc. etc.
Serialization support is still missing, though at least now it is possible to access the underlying storage for each structure implemented through the data() member function.
data() returns a std::bitset, but that doesn't provide access to its data in a form that I can write to a file (e.g. in preparing the data for the above examples). I consider this a fault of std::bitset. I believe you should use a std::vector or array instead. I see that you still have intersection without discussing the error rate implications. Your dynamic_bitset has a resize() method. Was that there before? I don't understand how it can work in any useful way. The final documentation should mention the complexity of each method. Regards, Phil.

Phil Endecott-48 wrote:
There is still no "adaptor" functionality, so I can't mmap() a file to use as the bloom filter's raw content, as I might want to do for e.g. a spellcheck, URL blocklist, etc. etc.
Could you describe this is more detail? I am familiar with the mmap() interface, but how would one go about providing an adaptor so that mmap() could be used as the Bloom filter's raw content? Would this be accomplished through a constructor, for example: // dynamic_basic_bloom_filter(void *addr, const size_t len); dynamic_basic_bloom_filter<std::string> bloom(address, length); Would you happen to know if there is any work being done on a Boost.Posix or any similar C++ project? Phil Endecott-48 wrote:
data() returns a std::bitset, but that doesn't provide access to its data in a form that I can write to a file (e.g. in preparing the data for the above examples). I consider this a fault of std::bitset. I believe you should use a std::vector or array instead.
data() returns the underlying type in each case. For the basic Bloom filter, this is a bitset (std:: or dynamic), and for counting Bloom filters, this is either a boost::array or an std::vector. I see the problem with std::bitset now. In order to serialize the bitset, it would take O(num_bits) operations, rather than the number of blocks (using operator[]). Using an std::vector, the serialization can be accomplished in O(num_elements). It also helps that boost.Serialization provides an implementation for std::vector. Thank you for the insight. I'll work on converting the underlying storage type next week. Phil Endecott-48 wrote:
I see that you still have intersection without discussing the error rate implications.
I will add a section in the tutorial that discusses the implications of union/intersect operations. Phil Endecott-48 wrote:
Your dynamic_bitset has a resize() method. Was that there before? I don't understand how it can work in any useful way.
Yes, this was there initially. The dynamic_basic_bloom_filter is the only dynamic Bloom filter that has the resize() method. I intend to remove this method, as users requiring a resize can instantiate a new dynamic Bloom filter to achieve the same effect. It was a design relic that should've been discarded much sooner, as it is more dangerous to users than useful. Phil Endecott-48 wrote:
The final documentation should mention the complexity of each method.
I've made sure to include the complexity of all methods. This is located at: http://svn.boost.org/svn/boost/sandbox/bloom_filter/trunk/libs/bloom_filter/... In the class references, do you think it would be convenient to make each function name clickable and make them link to the specific entry in the function reference? Thank you, -Alej -- View this message in context: http://boost.2283326.n4.nabble.com/Feedback-Towards-a-Better-Boost-BloomFilt... Sent from the Boost - Dev mailing list archive at Nabble.com.

On Fri, Aug 26, 2011 at 10:02 AM, Alejandro Cabrera <cpp.cabrera@gmail.com> wrote:
Phil Endecott-48 wrote:
There is still no "adaptor" functionality, so I can't mmap() a file to use as the bloom filter's raw content, as I might want to do for e.g. a spellcheck, URL blocklist, etc. etc.
Could you describe this is more detail? I am familiar with the mmap() interface, but how would one go about providing an adaptor so that mmap() could be used as the Bloom filter's raw content?
A possible solution is simply to allow the use of boost.interprocess allocators which support memory mapped files: http://www.boost.org/doc/libs/1_47_0/doc/html/boost/interprocess/allocator.h... HTH, -- gpd

Alejandro Cabrera wrote:
Phil Endecott-48 wrote:
There is still no "adaptor" functionality, so I can't mmap() a file to use as the bloom filter's raw content, as I might want to do for e.g. a spellcheck, URL blocklist, etc. etc.
Could you describe this is more detail? I am familiar with the mmap() interface, but how would one go about providing an adaptor so that mmap() could be used as the Bloom filter's raw content? Would this be accomplished through a constructor, for example:
// dynamic_basic_bloom_filter(void *addr, const size_t len); dynamic_basic_bloom_filter<std::string> bloom(address, length);
It is normally preferable to pass a begin-end pair rather than address and length, but fundamentally yes I would like to be able to construct a read-only bloom filter from a pair of const_iterators i.e. const pointers in this case.
Would you happen to know if there is any work being done on a Boost.Posix or any similar C++ project?
Not relevant.
Phil Endecott-48 wrote:
data() returns a std::bitset, but that doesn't provide access to its data in a form that I can write to a file (e.g. in preparing the data for the above examples). I consider this a fault of std::bitset. I believe you should use a std::vector or array instead.
data() returns the underlying type in each case. For the basic Bloom filter, this is a bitset (std:: or dynamic), and for counting Bloom filters, this is either a boost::array or an std::vector.
I see the problem with std::bitset now. In order to serialize the bitset, it would take O(num_bits) operations, rather than the number of blocks (using operator[]). Using an std::vector, the serialization can be accomplished in O(num_elements). It also helps that boost.Serialization provides an implementation for std::vector. Thank you for the insight. I'll work on converting the underlying storage type next week.
I don't care about serialisation. I just want to be able to const T* p = &(*(bloom_filter.data().begin())); size_t len = sizeof(T) * bloom_filter.data().size(); write(fd,p,len); Regards, Phil.

Phil, Phil Endecott-48 wrote:
It is normally preferable to pass a begin-end pair rather than address and length, but fundamentally yes I would like to be able to construct a read-only bloom filter from a pair of const_iterators i.e. const pointers in this case.
The input iterator constructor is already available. All that remains is what you describe below. Phil Endecott-48 wrote:
I don't care about serialisation. I just want to be able to
const T* p = &(*(bloom_filter.data().begin())); size_t len = sizeof(T) * bloom_filter.data().size(); write(fd,p,len);
Regards, Phil.
All this should take is changing the underlying storage type to something that supports the operations begin() and size(). The intent is to use std::vector, as with the dynamic counting Bloom filter implementations. The operations you've described above are serialization. This will be supported soon, as soon as I can work on the basic Bloom filters. I'll add a test to my suite to make sure there is a way to support this in both directions (write/read, store/load). Thanks, -Alej -- View this message in context: http://boost.2283326.n4.nabble.com/Feedback-Towards-a-Better-Boost-BloomFilt... Sent from the Boost - Dev mailing list archive at Nabble.com.

Alejandro Cabrera wrote:
Phil,
Phil Endecott-48 wrote:
It is normally preferable to pass a begin-end pair rather than address and length, but fundamentally yes I would like to be able to construct a read-only bloom filter from a pair of const_iterators i.e. const pointers in this case.
The input iterator constructor is already available.
No, I don't think so. What you have is a ctor that iterates over a sequence of values and inserts them: template <typename InputIterator> basic_bloom_filter(const InputIterator start, const InputIterator end) { for (InputIterator i = start; i != end; ++i) this->insert(*i); } What I want is a way to adapt a chunk of raw memory, which I've got from mmap(), and to treat it as a bloom filter. I haven't yet seen any "adaptor" functionality in your code.
All that remains is what you describe below.
Phil Endecott-48 wrote:
I don't care about serialisation. I just want to be able to
const T* p = &(*(bloom_filter.data().begin())); size_t len = sizeof(T) * bloom_filter.data().size(); write(fd,p,len);
Regards, Phil.
All this should take is changing the underlying storage type to something that supports the operations begin() and size(). The intent is to use std::vector, as with the dynamic counting Bloom filter implementations. The operations you've described above are serialization. This will be supported soon, as soon as I can work on the basic Bloom filters. I'll add a test to my suite to make sure there is a way to support this in both directions (write/read, store/load).
I really don't want to "load" the filter i.e. to do something that needs O(N) time and O(N) memory. I just want to mmap the file into memory and "declare" the bloom filter into existence in O(1) time. Then when I eventually use it, pages will be faulted-in on demand. The aim is to ensure that I can have a perhaps very large filter and yet to perform small numbers of queries quite quickly. Regards, Phil.
participants (3)
-
Alejandro Cabrera
-
Giovanni Piero Deretta
-
Phil Endecott