
Alejandro Cabrera wrote:
Sometimes it would be useful to access the underlying data. For example, say I have a spell check dictionary or a URL block-list; I might want to compile that list into a bloom filter and ship that as a binary file. For that I'd want to be able to use it as an adaptor on top of an underlying container that I supply:
// compile: std::vector<uint32_t> data(100000); bloom_filter b(data); // acts as a container adaptor b.insert(...); file f; f.write(data.begin(),data.end());
// use: memory_mapped_file f("dictionary.bloom"); bloom_filter b(f); if (b.contains(url)) .... I'm not sure how this would work. Could you elaborate further in this regard?
Currently, as I understand, there's two components of interest. The first is to be able to access the underlying Bloom filter data for serialization/deserialization purposes. This should be easy enough to do given a function that exposes the Bloom filter data. The other component that I'm picking up on is to be able to construct a Bloom filter using various sources. Range construction will be added that semantically behaves as an insertion for each element in the range. However, I'm not sure how to support construction from a memory_mapped_file.
OK, let me expand on the example I started above. Say I have a long list of "banned" URLs and an interactive program that must check URLs entered by the user against that list. I pre-compile the bloom filter and save its raw binary content to a file. Then in the application, I map that data into memory. The point about memory-mapping is that the data is not actually read from the file until it is used, so in this case the bloom filter's ctor can be very fast; when the user types in a URL, only those pages of the file that contain the bits corresponding to the URL's hashes are read. This is both faster and more memory-efficient than loading all of the data. In order for this to work, it's only necessary that I can make a bloom filter that is an adaptor over a range; the range that I provide will be a pair of pointers that refer to the memory region where the file is mapped. I.e. something like this (pseudo-code): int fd = open("banned_urls.dat", O_RDONLY); const char* begin = mmap(0, size, PROT_READ, fd, 0); const char* end = begin + size; bloom_filter_adaptor<const char*> banned_urls(begin,end); Currently you're using std::bitset for the underlying data. Since that doesn't (AFAIK) have a way to adapt an underlying implementation data type, I think you'll need to replace it with a wrapper that does do that. Then you can have something like this: template <typename ITER_T> class bloom_filter_adaptor { typedef bitset_adaptor<ITER_T> impl_t; impl_t impl; public: bloom_filter_adaptor(ITER_T begin, ITER_T end): impl(begin,end) {} ... }; (An alternative to taking an iterator-pair is to take a container and store a reference to it.) Then the non-adaptor version, for when that flexibility is not needed: template <size_t SIZE> struct bloom_filter_base { typedef std::vector<uint32_t> data_t; data_t data; bloom_filter_base(): data(0,SIZE) {} }; template <size_t SIZE> class bloom_filter: bloom_filter_base<SIZE>, bloom_filter_adaptor<bloom_filter_base::data_t> { public: bloom_filter(): bloom_filter_adaptor(data.begin(),data.end()) {} }; ...or something like that. Regards, Phil.