
On Fri, Jun 19, 2009 at 2:59 PM, Milosz Marian HULBOJ<mhulboj@hulboj.org> wrote:
Dean Michael Berris wrote:
When you say correct set of interfaces, what do you need more than 'insert(...)' and 'contains(...)' for the basic bloom_filter? In the case of a counting bloom filter, maybe you'd have 'remove(...)' or 'erase(...)'. Other than these functions I don't know what you mean by most correct set of interfaces.
For me it means the possibility of defining different variants of Bloom Filters (standard or extensions) in an easy way. Something more than text-book.
Right, then I guess the Concepts documentation should be enough -- saying that a type models the BloomFilter concept when they support the following restrictions/interface: ...
There are various (many) use-cases for a BF
1. user provides max size and max fpp BF determines best k 2. user provides max fpp and max expected insertion BF determine best k,m 3. user provides max fpp and fuzzy k with limit BF determines m,k and optimal fpp 4. etc etc...
Some of these can change/vary during runtime, so getters for fpp and inserted count would be nice/good/expected.
This is too complex -- and has too many assumptions. First, it assumes that the user wants a self-(re)sizing bloom filter. Second it assumes that the user doesn't want to provide their own hash functions. Third it assumes that the user actually really cares if the bloom filter should be smarter than the user.
1) The mentioned assumptions don't imply the dynamic-self-resizing filter. But it would be nice to look at the recent papers about dynamic-self-resizing papers...
Actually, it does -- it's implying that you just give a false positive probability and it computes the number of hash functions (k) and the size of the container (m). Maybe not re-sizing (I've read a few papers that imply or suggest the possibility and practicability of that) but self-sizing at construction is IMO too fancy. Maybe a convenience function and meta-function would work: template <double FPP> struct m_and_k_computer; m_and_k_computer<0.09>::k; // yields the number of hash functions m_and_k_computer<0.09>::m; // yields the number of bits used to represent
2) Why does it assume that the user doesn't want custom hash functions?
Because none of the numbered list earlier presented mentioned the possibility of a custom hash function. ;-)
3) It's not about being smarter, but about providing in convenient form some of the theory/math behind the filters.
Sure, but it's completely possible to layer these things on top of the underlying data structure which is the *_bloom_filter. I mean computing the false positive probability is independent of the actual workings of a bloom filter. I can imagine a layer of convenience functions (factories really) that does everything above and uses the current bloom filter implementation as the resulting type. What I'm really saying is, that these things (ways to compute K,M, having different hash functions, etc.) are nice to have, but is it really part of the basic notion of what a Bloom Filter is or how it actually behaves? There is an algorithmically simple template for the 'insert' and 'contains' operations and sure there can be more optimizations at that level -- and I'm definitely welcome to suggestions. However now the question becomes: should these things be part of a Bloom Filter library? Maybe yes, maybe no -- maybe like the type of elements you represent and your actual requirements in the application, it should be determined by the users.
In some cases (like the simplest cases I'm trying to solve for the moment) the current implementation seems sufficient. Of course, it's not fancy, and I don't intend for it to become too fancy. If you think you can implement something better, then please by all means show me code that I can deal with -- this is not said sarcastically, I really do want to see what you mean when you say all the above translated into code.
And I am implying that the simplest and naïve case is far from the real world applications. And one should not focus on the simplest case, but attempt to provide a good solution.
Hmmm... I'm providing a simple means of defining a bloom filter given the specifications. Now if you need a certain FPP met, then maybe you should be the one computing K and M -- if you want to use special hash functions, then maybe you should provide them instead of using the defaults. The defaults should make it convenient to use, but of course if genericness of the simplest case doesn't work for you, then you have the option to specialize to your own requirements.
Anyway, the point is to provide a bloom filter using the textbook definition. If there's a need for other kinds of bloom filters, like counting bloom filters, then that's another implementation that can be bundled in the library later -- *if* there is enough need for it, and *if* the library even gets into Boost just trying to solve the common (basic) cases. Of course the library is designed to evolve.
Yet again - the simplest case. If the library is designed to evolve, one should take much care when designing the interfaces. It's not easy to change afterwards. Providing counting modes or full-fledged counting filter should not be a problem.
I'm a little confused here: what else do you need from a bloom filter other than 'bf.insert(x)' and 'bf.contains(x)' ? This is the basic interface to any bloom filter IIRC, and I don't get why sticking to that will necessarily make it harder to change in the future. If you're talking about the type interface, it's going to be way harder to change if you have a single bloom_filter type that has loads and loads of template parameters that define the body of the bloom_filter IMO. This is why I chose to separate the static_bloom_filter (statically-sized bloom filter that is) and the runtime-sized bloom_filter templates. If there's a special kind of bloom_filter that needs to be implemented in the future (like a counting bloom filter, and the compressed bloom filter) then that's a different type but with similar interfaces to the basic bloom_filter.
No not clear. A bloom filter is to a degree error resilient/tolerant. if for some reason part of your bloom filter data is damaged (n/w io, bad memory etc...) you can erase that section (set all the bits high), doing so depending on how large the section is will increase the fpp. For some uses that is still quite acceptable, as the fpp has already been determined to be orders of magnitude less than what can be tolerated.
If you want to go down this road, then you think about implementing your own allocator with error correction support and use that in the bloom_filter or any container that supports/uses custom allocators. Maybe you'll want an STL vector that also knows how to determine if part of the data is corrupted too?
Please read the paper that I have mentioned in one of the earlier posts. There are also uses of the intentional introduction of the errors into the BF.
Hmmm... What I'm implying here is that the "generic" data structure should not be bothered with details like these of whether the memory gets corrupted, etc. IMO you deal with network transmission errors at the network transmission layer, the memory errors at the appropriate level, and appropriate errors at the appropriate level of detail. Now if we're going to start writing self-aware containers, then maybe there should be containers that can actually heal themselves in case of memory corruption. Those would be nice to have, but I doubt they'd be a lot better than the containers we already have at the moment which *aren't* storage/network/self-aware except in these special cases when memory gets corrupted.
This is NOT data compression if you are providing union operations, then you should also provide self compression, that is for example, OR'ing the 1st half of the filter with the 2nd half, decreasing its size.
I don't understand why you want to do this when clearly the Bloom Filter is meant to have a fixed number of bits. Any variation to this in transit is beyond the scope of the implementation of the Bloom Filter.
Again, there is some motivation and research behind compression. Tradeoff between size, time of lookup and FPR is one of the reasons.
Yes, I understand this. Now if you need/want a compressed bloom filter, maybe a compressed_bloom_filter type would be more appropriate. No?
Let me rehash a bit: 4. The implementation is meant to be simple, portable across platforms, and C++ standard-compliant. When you deal with these issues, it's best you don't think about cache hit-miss and coherency issues IMO. At any rate if you want to have better control of the memory allocation and issues like these, you can control them by: a) using your favorite STL implementation of an std::bitset in the case of the statically-sized filter or b) by providing your own allocator and using the "correct" block type in the case of a runtime-constructed Bloom Filter.
And the implementation can be perfectly C++ compliant and portable with including the variants of the Bloom filter that pay attention to the cache hit-miss ratio and coherency. And it is not only the issue of memory allocation itself but it has much to do with the internal working of the filter.
Yes, I agree. Now what do these special bloom filters look like? How do they behave? Can I possibly implement all these different kinds of special bloom filters? Maybe yes, maybe no, but what I'm saying is that for the most part the goal is to make using Bloom Filters simple and easy to use. Just like the STL vector, some people have chosen to not use it for a myriad of reasons just like the ones you mention (cache coherency, etc.), but for the simplest cases where you need a homogeneous container I believe the STL vector is enough. Same reasoning goes: if you need a simple bloom filter, then the candidate Boost.Bloom_filter might work for you. If not, maybe a special one in the library would work better for you. Or, you just might have to write one for your specific needs.
Of course, if the library under construction isn't suiting your needs, there should be room for improvement. But there's no need to be too fancy with the implementation/interface and being clever with the False Positive Probability -- you can control this easily with choosing/implementing your own hash functions and the size of the underlying bitset.
You can do more than choosing a custom hash and size.
Yes I know this -- but the point I was trying to make was that _for the simplest cases_ and for the most part, the simplicity of the proposed solution should do for the simplest cases. HTH -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com