Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project

25 Jun 2011

      Alejandro Cabrera wrote:
...
...
Sometimes it would be useful to access the underlying data.  For
example, say I have a spell check dictionary or a URL block-list; I
might want to compile that list into a bloom filter and ship that as a
binary file.  For that I'd want to be able to use it as an adaptor on
top of an underlying container that I supply:
// compile:
std::vector<uint32_t>  data(100000);
bloom_filter b(data);  // acts as a container adaptor
b.insert(...);
file f;
f.write(data.begin(),data.end());
// use:
memory_mapped_file f("dictionary.bloom");
bloom_filter b(f);
if (b.contains(url)) ....
I'm not sure how this would work. Could you elaborate further in this 
regard?
Currently, as I understand, there's two components of interest. The 
first is to be able to access the underlying Bloom filter data for 
serialization/deserialization purposes. This should be easy enough to do 
given a function that exposes the Bloom filter data. The other component 
that I'm picking up on is to be able to construct a Bloom filter using 
various sources. Range construction will be added that semantically 
behaves as an insertion for each element in the range. However, I'm not 
sure how to support construction from a memory_mapped_file.
OK, let me expand on the example I started above.  Say I have a long 
list of "banned" URLs and an interactive program that must check URLs 
entered by the user against that list.  I pre-compile the bloom filter 
and save its raw binary content to a file.  Then in the application, I 
map that data into memory.  The point about memory-mapping is that the 
data is not actually read from the file until it is used, so in this 
case the bloom filter's ctor can be very fast; when the user types in a 
URL, only those pages of the file that contain the bits corresponding 
to the URL's hashes are read.  This is both faster and more 
memory-efficient than loading all of the data.

In order for this to work, it's only necessary that I can make a bloom 
filter that is an adaptor over a range; the range that I provide will 
be a pair of pointers that refer to the memory region where the file is 
mapped.  I.e. something like this (pseudo-code):

int fd = open("banned_urls.dat", O_RDONLY);
const char* begin = mmap(0, size, PROT_READ, fd, 0);
const char* end = begin + size;
bloom_filter_adaptor<const char*> banned_urls(begin,end);

Currently you're using std::bitset for the underlying data.  Since that 
doesn't (AFAIK) have a way to adapt an underlying implementation data 
type, I think you'll need to replace it with a wrapper that does do 
that.  Then you can have something like this:

template <typename ITER_T>
class bloom_filter_adaptor {
   typedef bitset_adaptor<ITER_T> impl_t;
   impl_t impl;
public:
   bloom_filter_adaptor(ITER_T begin, ITER_T end):
     impl(begin,end)
   {}
   ...
};

(An alternative to taking an iterator-pair is to take a container and 
store a reference to it.)

Then the non-adaptor version, for when that flexibility is not needed:

template <size_t SIZE>
struct bloom_filter_base {
   typedef std::vector<uint32_t> data_t;
   data_t data;
   bloom_filter_base():
     data(0,SIZE)
   {}
};

template <size_t SIZE>
class bloom_filter:
   bloom_filter_base<SIZE>,
   bloom_filter_adaptor<bloom_filter_base::data_t>
{
public:
   bloom_filter():
     bloom_filter_adaptor(data.begin(),data.end())
   {}
};

...or something like that.

Regards,  Phil.

Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter Project

Phil Endecott