Re: [boost] Boost.Bloom_filter now in Sandbox (WAS Re: Interest in a Bloom Filter?)

10 Jun 2009

      On Wed, Jun 10, 2009 at 1:51 AM, Dean Michael
Berris<mikhailberis@gmail.com> wrote:
...
On Tue, Jun 9, 2009 at 11:58 PM, <joaquin@tid.es> wrote:
...
Umm, fair point :) this raises two points:
2.1. Allocators are still needed if you implement a variant where M (and
maybe K)
are passed at construction time.
Indeed. I'm leaning towards making this the default behavior using
Boost.Dynamic_bitset instead and implementing the
static_bloom_filter<> template that uses std::bitset internally now.
Now I've decided that boost::bloom_filter<> will take a runtime
parameter which is the size of the internal bitset (a
Boost.Dynamic_bitset) -- and the static version that uses std::bitset
is boost::static_bloom_filter<>.
...
...
2.2. Is it wise to use the stack-based std::bitset when M is huge? I think
it's not
unlikely that M can be very large (say millions) which can easily lead to
stack overflow problems. This might bring allocators back into scene.
You're right, it doesn't make it practical to use huge bitsets that
sit on the stack. Perhaps Boost.Dynamic_bitset is my friend. ;-)
This has been addressed in revision 53785 of the Boost Sandbox. :-)
...
...
...
...
One thing I might be amenable to is a static K and a runtime M, but
that makes the hash functions unable to determine statically the
bounds of the bitset (M). I can change the concept for the hash
function, but the simplicity of size_t(Input) just makes it easier for
users I think.
I'd ask you to think this over before dismissing it: I can envision
scenarios where M
is not known at compile time (for instance, when dealing with external data
whose
size is not known beforehand).
We can easily enter into overdesign land, but you can always let this be
driven by
a policy, so that everybody's happy.
Now that you mention it, I agree. :-) First I'll do two things:
- Make the runtime-constructed bloom_filter the default implementation
(uses Boost.Dynamic_bitset and allows for custom allocators)
This is done.
...
- Rename the current implementation as static_bloom_filter that takes
Input, M, and a Fusion Sequence of Hash Function types as arguments.
Also done. :)
...
...
...
Done, but I've run into the issue of not being able to compare
boost::function objects of the same type -- for some reason this has
been disabled in Boost Trunk.
Ummm, this is tricky, because of course it does not suffice to test the
state only --it
critically depends on the hash functions to be used (At first sight I
thought you only
needed to test state and ignore auxiliary objects like std containers do).
Yup. But apparently, if I make the hash functions part of the type
instead of be function objects passed in at runtime, that would solve
the problem (based on earlier comments in the thread).
Which is also done. :D
...
...
Intrusiveness is not necessarily related with having a separate
serialize.hpp (you can
friend declare inside bloom_filter in bloom_filter.hpp and have the
implementation
elsewhere), but it's a good idea nonetheless.
The friend declaration might be an option, and if I can't do it
non-intrusively I'll fall back to this option. Thanks for pointing
this out. :-)
And I've yet to work on this. ;-)

HTH!

-- 
Dean Michael Berris | Software Engineer, Friendster, Inc.
blog.cplusplus-soup.com | twitter.com/mikhailberis |
linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis |
deanberris.com