Re: [boost] Boost.Bloom_filter now in Sandbox (WAS Re: Interest in a Bloom Filter?)

19 Jun 2009

      On Fri, Jun 19, 2009 at 2:59 PM, Milosz Marian HULBOJ<mhulboj@hulboj.org> wrote:
...
Dean Michael Berris wrote:
...
When you say correct set of interfaces, what do you need more than
'insert(...)' and 'contains(...)' for the basic bloom_filter? In the
case of a counting bloom filter, maybe you'd have 'remove(...)' or
'erase(...)'. Other than these functions I don't know what you mean by
most correct set of interfaces.
For me it means the possibility of defining different variants of Bloom
Filters (standard or extensions) in an easy way. Something more than
text-book.
Right, then I guess the Concepts documentation should be enough --
saying that a type models the BloomFilter concept when they support
the following restrictions/interface: ...
...
...
...
There are various (many) use-cases for a BF
1. user provides max size and max fpp BF determines best k
2. user provides max fpp and max expected insertion BF determine best k,m
3. user provides max fpp and fuzzy k with limit BF determines m,k and
optimal fpp
4. etc etc...
Some of these can change/vary during runtime, so getters for fpp and
inserted count would be nice/good/expected.
This is too complex -- and has too many assumptions. First, it assumes
that the user wants a self-(re)sizing bloom filter. Second it assumes
that the user doesn't want to provide their own hash functions. Third
it assumes that the user actually really cares if the bloom filter
should be smarter than the user.
1) The mentioned assumptions don't imply the dynamic-self-resizing filter.
But it would be nice to look at the recent papers about
dynamic-self-resizing papers...
Actually, it does -- it's implying that you just give a false positive
probability and it computes the number of hash functions (k) and the
size of the container (m). Maybe not re-sizing (I've read a few papers
that imply or suggest the possibility and practicability of that) but
self-sizing at construction is IMO too fancy. Maybe a convenience
function and meta-function would work:

  template <double FPP> struct m_and_k_computer;

  m_and_k_computer<0.09>::k; // yields the number of hash functions
  m_and_k_computer<0.09>::m; // yields the number of bits used to represent
...
2) Why does it assume that the user doesn't want custom hash functions?
Because none of the numbered list earlier presented mentioned the
possibility of a custom hash function. ;-)
...
3) It's not about being smarter, but about providing in convenient form some
of the theory/math behind the filters.
Sure, but it's completely possible to layer these things on top of the
underlying data structure which is the *_bloom_filter. I mean
computing the false positive probability is independent of the actual
workings of a bloom filter. I can imagine a layer of convenience
functions (factories really) that does everything above and uses the
current bloom filter implementation as the resulting type.

What I'm really saying is, that these things (ways to compute K,M,
having different hash functions, etc.) are nice to have, but is it
really part of the basic notion of what a Bloom Filter is or how it
actually behaves? There is an algorithmically simple template for the
'insert' and 'contains' operations and sure there can be more
optimizations at that level -- and I'm definitely welcome to
suggestions. However now the question becomes: should these things be
part of a Bloom Filter library? Maybe yes, maybe no -- maybe like the
type of elements you represent and your actual requirements in the
application, it should be determined by the users.
...
...
In some cases (like the simplest cases I'm trying to solve for the
moment) the current implementation seems sufficient. Of course, it's
not fancy, and I don't intend for it to become too fancy. If you think
you can implement something better, then please by all means show me
code that I can deal with -- this is not said sarcastically, I really
do want to see what you mean when you say all the above translated
into code.
And I am implying that the simplest and naïve case is far from the real
world applications. And one should not focus on the simplest case, but
attempt to provide a good solution.
Hmmm... I'm providing a simple means of defining a bloom filter given
the specifications. Now if you need a certain FPP met, then maybe you
should be the one computing K and M -- if you want to use special hash
functions, then maybe you should provide them instead of using the
defaults.

The defaults should make it convenient to use, but of course if
genericness of the simplest case doesn't work for you, then you have
the option to specialize to your own requirements.
...
...
Anyway, the point is to provide a bloom filter using the textbook
definition. If there's a need for other kinds of bloom filters, like
counting bloom filters, then that's another implementation that can be
bundled in the library later -- *if* there is enough need for it, and
*if* the library even gets into Boost just trying to solve the common
(basic) cases. Of course the library is designed to evolve.
Yet again - the simplest case. If the library is designed to evolve, one
should take much care when designing the interfaces. It's not easy to change
afterwards. Providing counting modes or full-fledged counting filter should
not be a problem.
I'm a little confused here: what else do you need from a bloom filter
other than 'bf.insert(x)' and 'bf.contains(x)' ? This is the basic
interface to any bloom filter IIRC, and I don't get why sticking to
that will necessarily make it harder to change in the future.

If you're talking about the type interface, it's going to be way
harder to change if you have a single bloom_filter type that has loads
and loads of template parameters that define the body of the
bloom_filter IMO. This is why I chose to separate the
static_bloom_filter (statically-sized bloom filter that is) and the
runtime-sized bloom_filter templates. If there's a special kind of
bloom_filter that needs to be implemented in the future (like a
counting bloom filter, and the compressed bloom filter) then that's a
different type but with similar interfaces to the basic bloom_filter.
...
...
...
No not clear. A bloom filter is to a degree error resilient/tolerant. if
for
some reason part of your bloom filter data is damaged (n/w io, bad memory
etc...) you can erase that section (set all the bits high), doing so
depending on how large the section is will increase the fpp. For some
uses
that is still quite acceptable, as the fpp has already been determined to
be
orders of magnitude less than what can be tolerated.
If you want to go down this road, then you think about implementing
your own allocator with error correction support and use that in the
bloom_filter or any container that supports/uses custom allocators.
Maybe you'll want an STL vector that also knows how to determine if
part of the data is corrupted too?
Please read the paper that I have mentioned in one of the earlier posts.
There are also uses of the intentional introduction of the errors into the
BF.
Hmmm... What I'm implying here is that the "generic" data structure
should not be bothered with details like these of whether the memory
gets corrupted, etc. IMO you deal with network transmission errors at
the network transmission layer, the memory errors at the appropriate
level, and appropriate errors at the appropriate level of detail. Now
if we're going to start writing self-aware containers, then maybe
there should be containers that can actually heal themselves in case
of memory corruption. Those would be nice to have, but I doubt they'd
be a lot better than the containers we already have at the moment
which *aren't* storage/network/self-aware except in these special
cases when memory gets corrupted.
...
...
...
This is NOT data compression if you are providing union operations, then
you
should also provide self compression, that is for example, OR'ing the 1st
half of the filter with the 2nd half, decreasing its size.
I don't understand why you want to do this when clearly the Bloom
Filter is meant to have a fixed number of bits. Any variation to this
in transit is beyond the scope of the implementation of the Bloom
Filter.
Again, there is some motivation and research behind compression. Tradeoff
between size, time of lookup and FPR is one of the reasons.
Yes, I understand this. Now if you need/want a compressed bloom
filter, maybe a compressed_bloom_filter type would be more
appropriate. No?
...
...
Let me rehash a bit:
4. The implementation is meant to be simple, portable across
platforms, and C++ standard-compliant. When you deal with these
issues, it's best you don't think about cache hit-miss and coherency
issues IMO. At any rate if you want to have better control of the
memory allocation and issues like these, you can control them by: a)
using your favorite STL implementation of an std::bitset in the case
of the statically-sized filter or b) by providing your own allocator
and using the "correct" block type in the case of a
runtime-constructed Bloom Filter.
And the implementation can be perfectly C++ compliant and portable with
including the variants of the Bloom filter that pay attention to the cache
hit-miss ratio and coherency. And it is not only the issue of memory
allocation itself but it has much to do with the internal working of the
filter.
Yes, I agree. Now what do these special bloom filters look like? How
do they behave? Can I possibly implement all these different kinds of
special bloom filters? Maybe yes, maybe no, but what I'm saying is
that for the most part the goal is to make using Bloom Filters simple
and easy to use. Just like the STL vector, some people have chosen to
not use it for a myriad of reasons just like the ones you mention
(cache coherency, etc.), but for the simplest cases where you need a
homogeneous container I believe the STL vector is enough. Same
reasoning goes: if you need a simple bloom filter, then the candidate
Boost.Bloom_filter might work for you. If not, maybe a special one in
the library would work better for you. Or, you just might have to
write one for your specific needs.
...
...
Of course, if the library under construction isn't suiting your needs,
there should be room for improvement. But there's no need to be too
fancy with the implementation/interface and being clever with the
False Positive Probability -- you can control this easily with
choosing/implementing your own hash functions and the size of the
underlying bitset.
You can do more than choosing a custom hash and size.
Yes I know this -- but the point I was trying to make was that _for
the simplest cases_ and for the most part, the simplicity of the
proposed solution should do for the simplest cases.

HTH

-- 
Dean Michael Berris | Software Engineer, Friendster, Inc.
blog.cplusplus-soup.com | twitter.com/mikhailberis |
linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis |
deanberris.com