
On Tue, Jun 9, 2009 at 4:36 AM, Joaquin M Lopez Munoz<joaquin@tid.es> wrote:
Dean Michael Berris <mikhailberis <at> gmail.com> writes:
The implementation is now available in Boost's Sandbox. You can check out the implementation along with the documentation via:
[...]
I'd also like to request to add this to the formal review queue (if this qualifies for a fast track review, that would also be an option I'm open to).
With all due respect, I think your submission is in too early a state to already request a formal review. I'd say you are in "refinement" stage as described in:
Oh, okay. Thanks for pointing this out.
So you can still get a lot of feedback till the design reaches a mature state. IMHO there're plenty of opportunities for improvement, here are a handful of observations in no particular order:
1. Your material lacks proper tests and examples, and the docs are sketchy (a formal reference would be needed at least).
Okay. I can work on adding more tests (or at least having more "formal" unit tests).
2. bloom_filter should have an allocator template parameter, just like any other container around.
Okay, I'll just forward this anyway to the underlying data structure (bitset) so that should be easy to add.
3. Have you considered the possibility to let the user specify M and K at constuction time rather than compile tiem and the pros and cons of both approaches?
Yes, I have but I've put a premium at type safety and putting M and K as part of the type instead of being a runtime value. The reasons are: 1. Specifying a static M allows the use of the standard bitset instead of Boost's dynamic_bitset. 2. A static K allows for enforcing the number of hash functions at compile time instead of at runtime. 3. Since M and K are not variable and are part of the type, it makes little sense for me to have either of them set at construction time. One thing I might be amenable to is a static K and a runtime M, but that makes the hash functions unable to determine statically the bounds of the bitset (M). I can change the concept for the hash function, but the simplicity of size_t(Input) just makes it easier for users I think.
4. It'd be nice to have a way to specify the desired false positive probability instead of M and K.
That'd be nice, but then the hash functions would have to be provided for by the implementation too to ensure that the probability is held.
5. It'd be nice if entry-level users can get a default set of hash functions so that they need not specifiy them --maybe a parameterized hash formula h(x,N) so that you can just use h(x,0), h(x,1),... h(x,N-1).
I agree. I can use Boost.CRC's functions and implement a simple mod-hash against M. Let me see what I can come up with. BTW, this may require that the actual bloom_filter<> template be renamed to something like static_bloom_filter<> and have a dynamic_bloom_filter<> version as well that supports runtime-setting of M like something I describe above. The bloom_filter<> will derive from static_bloom_filter<> and provide a set of basic hash functions so that users don't have to worry about providing their own hash functions. Or, I can support a default-constructible bloom_filter<> that provides these functions too. This sounds like something I'd be able to implement quickly.
6. You're not providing a free function swap(bloom_filter&,bloom_filter&).
Not yet, you're correct. Let me address this soon.
7. bloom_filters are not currently testable for equality.
You're correct. I'll add this soon.
8. bloom_filters are not currently serializable.
True. I might need to have a file 'boost/bloom_filter/serialize.hpp' to add serializability non-intrusively.
9. Union and intersection of bloom_filters are not provided (according to the Wikipedia entry these ops are trivial to implement).
True.
10. For completeness, seems like constructing a bloom_filter from a bitset_type would be nice to have.
I agree, this is something that can be added soon as I allow for having a default set of hash functions for bloom filters.
11. You'd have to decide whether you want to make this look more like a std::map<T,bool> or not. If the former, some boilerplate typedefs like value_type and memer functions like size() could be provided.
I'm still struggling with this mainly because I'd think a bloom filter would better model an std::set's behavior, but without access to the "stored" elements (mainly because the elements aren't really stored). Thanks for the great insights and feedback, I'll try implementing these things sooner than later. Have a great day! -- Dean Michael Berris | Software Engineer, Friendster, Inc. blog.cplusplus-soup.com | twitter.com/mikhailberis | linkedin.com/in/mikhailberis | profiles.friendster.com/mikhailberis | deanberris.com