
Just an update -- as of Sandbox revision 53772, I've addressed some of the issues below: On Tue, Jun 9, 2009 at 12:52 PM, Dean Michael Berris<mikhailberis@gmail.com> wrote:
On Tue, Jun 9, 2009 at 4:36 AM, Joaquin M Lopez Munoz<joaquin@tid.es> wrote:
So you can still get a lot of feedback till the design reaches a mature state. IMHO there're plenty of opportunities for improvement, here are a handful of observations in no particular order:
1. Your material lacks proper tests and examples, and the docs are sketchy (a formal reference would be needed at least).
Okay. I can work on adding more tests (or at least having more "formal" unit tests).
I've added a couple more tests, hardly "formal" by definition, but is an improvement over the single test file that's implemented.
2. bloom_filter should have an allocator template parameter, just like any other container around.
Okay, I'll just forward this anyway to the underlying data structure (bitset) so that should be easy to add.
And apparently I mis-typed -- std::bitset<> doesn't have an allocator parameter. Because of this, I've chosen to not support for the meantime a custom allocator.
3. Have you considered the possibility to let the user specify M and K at constuction time rather than compile tiem and the pros and cons of both approaches?
Yes, I have but I've put a premium at type safety and putting M and K as part of the type instead of being a runtime value. The reasons are:
1. Specifying a static M allows the use of the standard bitset instead of Boost's dynamic_bitset. 2. A static K allows for enforcing the number of hash functions at compile time instead of at runtime. 3. Since M and K are not variable and are part of the type, it makes little sense for me to have either of them set at construction time.
One thing I might be amenable to is a static K and a runtime M, but that makes the hash functions unable to determine statically the bounds of the bitset (M). I can change the concept for the hash function, but the simplicity of size_t(Input) just makes it easier for users I think.
I thought about it a little more and I don't think I like the idea of a separate static and dynamic bloom_filter. If there are more reasons to allow for dynamically growing bloom filters, then I might support this.
4. It'd be nice to have a way to specify the desired false positive probability instead of M and K.
That'd be nice, but then the hash functions would have to be provided for by the implementation too to ensure that the probability is held.
I will not be supporting this anytime soon. ;-)
5. It'd be nice if entry-level users can get a default set of hash functions so that they need not specifiy them --maybe a parameterized hash formula h(x,N) so that you can just use h(x,0), h(x,1),... h(x,N-1).
I agree. I can use Boost.CRC's functions and implement a simple mod-hash against M. Let me see what I can come up with. BTW, this may require that the actual bloom_filter<> template be renamed to something like static_bloom_filter<> and have a dynamic_bloom_filter<> version as well that supports runtime-setting of M like something I describe above. The bloom_filter<> will derive from static_bloom_filter<> and provide a set of basic hash functions so that users don't have to worry about providing their own hash functions.
This is already done, put it in boost/bloom_filter/detail/default_hash.hpp -- with a naive offset and mod-hash functions. Yes, this should be improved to use Boost.Hash or other implementations easily. ;-)
Or, I can support a default-constructible bloom_filter<> that provides these functions too. This sounds like something I'd be able to implement quickly.
Which is already the case. A default constructed bloom filter uses an unsophisticated hash function internally.
6. You're not providing a free function swap(bloom_filter&,bloom_filter&).
Not yet, you're correct. Let me address this soon.
Done.
7. bloom_filters are not currently testable for equality.
You're correct. I'll add this soon.
Done, but I've run into the issue of not being able to compare boost::function objects of the same type -- for some reason this has been disabled in Boost Trunk.
8. bloom_filters are not currently serializable.
True. I might need to have a file 'boost/bloom_filter/serialize.hpp' to add serializability non-intrusively.
This should come later... I still need to look into whether serialization should only be done for the bitset state rather than the whole type. I say this because up to now I don't think there's a way of serializing boost::function objects and deserializing them effectively either.
9. Union and intersection of bloom_filters are not provided (according to the Wikipedia entry these ops are trivial to implement).
True.
And I need to implement these later (if there's an actual need for this).
10. For completeness, seems like constructing a bloom_filter from a bitset_type would be nice to have.
I agree, this is something that can be added soon as I allow for having a default set of hash functions for bloom filters.
And this is done. HTH -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com