Re: [boost] Boost.Bloom_filter now in Sandbox (WAS Re: Interest in a Bloom Filter?)

9 Jun 2009

      Hi Manuel,

On Tue, Jun 9, 2009 at 8:45 PM, Manuel
Fiorelli<manuel.fiorelli@gmail.com> wrote:
...
About the implementation:
If I understand your implementation, you do something like that:
given a value V,
you compute a number X (say the digest of V),
then you compute K different values applying the transformation (X +
offset) mod(M)
Right. This naive filter will almost always create K consecutive
indexes (or results) from the hash functions.
...
I don't know if this produces k hash functions suitable for bloom
filters, but, why don't you use Boost.hash to compute X? Also, it
would be easier to define the requirements on the Input type.
I've considered using Boost.Hash to compute the hash values -- the
naive implementation up there is meant to be a sample. I'm in the
process of (re)thinking how the functions are made part of the type
instead. Maybe instead of just Boost.Array's at construction time I'll
consider Fusion-compatible sequences that are part of the signature
(thus the type of the filter).
...
About the documentation:
I think that the following paragraph in the documentation must be reformulated
<snip>
To find out whether we've encountered a certain URL before, we simply
pass the URL to the same three hash functions. If we see that all the
positions returned by the three hash functions are all 1, then we
might have seen the URL before -- this is called a false positive.
</snip>
indeed, you cannot say that the URL is a false positive, since it
could be equal to an URL stored previously.
Right, I need to make that a little more clear. Thanks for pointing it out.
...
IMHO, you should describe the mathematical principles in more detail
and, moreover, you should discuss in more detail the false positives
and how to control their rate.
I agree. I'll work on it some more and gather more resources worth referencing.
...
Even if it isn't mandatory, it would be nice if you write your
documentation in QuickBook format, since it will uniform both the
documentation toolchain and the documentation look.
I chose to write in ReStructuredText first to make the documentation
generation cycle faster -- it should be easier later to create
Quickbook documentation from the RST sources if I remember my
quickbook correctly. ;-)
...
Also, don't forgive to synchronize the documentation with the current interface.
Right. More work for me then. ;-)

Thanks for the insights! :-D

-- 
Dean Michael Berris | Software Engineer, Friendster, Inc.
blog.cplusplus-soup.com | twitter.com/mikhailberis |
linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis |
deanberris.com