
Hi Everyone, The following can hardly be called a library review, but I still wanted to give this feedback. The documentation of this library meets what I call the highest standards of Boost. What attracted me to Boost originally was the way the libraries were documented. Studying each new library was learning something new: about programming, about structuring your program, about structuring your thoughts. A lot of libraries, back then, first documented the problem domain, then gave you some theory behind it, then listed possible approaches, and then finally also presented to you the chosen solution. Boost.Bloom represents the same character. I did not know what Bloom Filters were before reading the docs. Now I feel I do. I learned about the problem domain. I was convinced that the problem domain is worth addressing. I know the technical challenges. I know what to expect of the library interface (without having looked at the reference section)! I can already answer one of the review questions. Is the design of the library sound? If the author was able to explain the problem domain and the technical challenges/trade-offs so clearly and succinctly, then I am confident that the library design must be sound. Some interface suggestions: 1. Use the term "bit_capacity" rather than "capacity" for representing capacity in bits. This is not to accidently mislead the users. 2. It is a good move to use the name "may_contain" rather than "contains", but "may_contain" actually also doesn't do justice to the semantics either. The user is asking, "do you contain the value", and the library is replying, "maybe I do, and maybe I don't". Actually, I am unable to suggest anything better. If the function name was `f.definately_misses()` then a true/false answer would be meaningful. and the opposite of "definately_misses" is "either 'contains', or 'misses' incorrectly classified as 'contains'". But I realize that it is longer and contains negatives. So I think your choice of the name is the best given the set of alternatives. 3. When describing the |= operator, use the name "set union", which is the industry standard. The current choice of words "containing both the elements of f and f2" can be interpreted both as a union and as an intersection. 4. I am uneasy about the functionality under &= operator. It implies symmetry with operator |=. But the symmetry is broken for at least two reasons: * If I put some elements to filter f1 and some other elements to filter f2, and then I apply |=, then the effect is the same as if I put all the elements in f1 to begin with. The same property, if I understand correctly, does not apply to operator &=. * The compromising of the FPR value that you are describing. I am not sure if anyone ever needs this intersection behavior. But if it is useful, maybe reflect this lack of symmetry by using a named function rather than an operator? 5. Does the suffix "_for" in the name "fpr_for" bring any value? 6. Not sure if it is a good idea to have two overloads for `reset()` doing slightly different things. For constructors this is unavoidable, but for member functions, you can have `reset` and `reset_for`. 7. I think that you are missing a precondition for operators &= and |=. They work under the assumptions that for both filters f1 and f2 inserting the same element renders the same bit pattern in the hash. But this is not the case when the hashers that I provide to f1 and f2 have the same type but different initial state: filter<T, MyHash> f1(1000, MyHash{1}); filter<T, MyHash> f2(1000, MyHash{2}); The above feedback is based on me having read the documentation. I haven't tried the library, and it is unlikely that I will be able to do this within the review period, so I am not sure if I am qualified to give a recommendation for accepting the library into Boost. Joaquín, thank you for writing and sharing this library! Regards, &rzej;

El 18/05/2025 a las 13:29, Andrzej Krzemienski via Boost escribió:
Hi Everyone, The following can hardly be called a library review, but I still wanted to give this feedback.
Hi Andrzej, thanks for this feedback! It'd be great if you can find the time to write a full review.
[...].
Some interface suggestions:
1. Use the term "bit_capacity" rather than "capacity" for representing capacity in bits. This is not to accidently mislead the users.
I find the name ugly but it certainly dispels any potential ambiguity, so, yes, I can adopt that.
[...] 3. When describing the |= operator, use the name "set union", which is the industry standard. The current choice of words "containing both the elements of f and f2" can be interpreted both as a union and as an intersection.
Noted, will do.
4. I am uneasy about the functionality under &= operator. It implies symmetry with operator |=. But the symmetry is broken for at least two reasons: * If I put some elements to filter f1 and some other elements to filter f2, and then I apply |=, then the effect is the same as if I put all the elements in f1 to begin with. The same property, if I understand correctly, does not apply to operator &=.
The following, dual property holds: if I put some elements to f1 and some other elements to f2 and then I apply &=, those elements that were both in f1 and f2 will also be in the result. The symmetry breaks in that this operation is _not_ the same as constructing a filter from scratch with the elements in both f1 and f2, as the latter won't have stray, unnecessary bits set to one that may potentially be left over in the former.
* The compromising of the FPR value that you are describing. I am not sure if anyone ever needs this intersection behavior. But if it is useful, maybe reflect this lack of symmetry by using a named function rather than an operator?
I don't like this much. Admittedly there's a lack of simmetry due to FPR degradation, but &= is so descriptive of what's going on that I'd rather keep it.
5. Does the suffix "_for" in the name "fpr_for" bring any value?
This suffix has been added for consistency reasons. We have capacity() (non-static, capacity of the filter under consideration) and capacity_for(...) (static, capacity estimation, not dependent on the state of any filter). And fpr_for(...) is sort of the inverse function of capacity_for(...), so it makes sense (in my mind at least) that it have the same suffix.
6. Not sure if it is a good idea to have two overloads for `reset()` doing slightly different things. For constructors this is unavoidable, but for member functions, you can have `reset` and `reset_for`.
This usage of "_for()" to unload the reset overload is not consistent with how it's used for capacity() and capacity_for(). I mean, reset() is NOT to reset_for() as capacity() is to capacity_for(); rather both overloads of reset do the same thing with different input parameters.
7. I think that you are missing a precondition for operators &= and |=. They work under the assumptions that for both filters f1 and f2 inserting the same element renders the same bit pattern in the hash. But this is not the case when the hashers that I provide to f1 and f2 have the same type but different initial state:
filter<T, MyHash> f1(1000, MyHash{1}); filter<T, MyHash> f2(1000, MyHash{2});
Well, technically the behavior is well defined even if the hashers are not equivalent: "[...] changes the value of each bit in the internal array with the result of doing a logical AND operation of that bit and the corresponding one in x." Admittedly, doing this with non-equivalent hashers will produce nothing but garbage, so... well, I don't know what I should do here :-)
[...] Joaquín, thank you for writing and sharing this library!
Thank _you_ for taking the time to look into my proposal. Joaquin M Lopez Munoz

niedz., 18 maj 2025 o 18:52 Joaquin M López Muñoz via Boost < boost@lists.boost.org> napisał(a):
El 18/05/2025 a las 13:29, Andrzej Krzemienski via Boost escribió:
Hi Everyone, The following can hardly be called a library review, but I still wanted to give this feedback.
7. I think that you are missing a precondition for operators &= and |=. They work under the assumptions that for both filters f1 and f2 inserting the same element renders the same bit pattern in the hash. But this is not the case when the hashers that I provide to f1 and f2 have the same type but different initial state:
filter<T, MyHash> f1(1000, MyHash{1}); filter<T, MyHash> f2(1000, MyHash{2});
Well, technically the behavior is well defined even if the hashers are not equivalent:
"[...] changes the value of each bit in the internal array with the result of doing a logical AND operation of that bit and the corresponding one in x."
Admittedly, doing this with non-equivalent hashers will produce nothing but garbage, so... well, I don't know what I should do here :-)
I am pretty sure there is only one right answer here. While the semantics of the function are specified in terms of setting the bit values, the value that this interface brings to the users is at a higher level. It is this: If filter f1 indicates a set S1 of values that are certainly not stored (those for which may_contain() returns false), and filter f2 indicates a set S2 of values that are certainly not stored in f2, then the result is a filter whose values certainly not stored are the union of S1 and S2. This equivalence of specifications (in terms of bits, and in terms of definitely not stored values) is broken when f1 and f2 use different hashes.This is sufficient a reason to call this a precondition: breaking the high-level abstraction/guarantee. Having or not having undefined behavior is irrelevant here. Regards, &rzej;

El 18/05/2025 a las 19:54, Andrzej Krzemienski escribió:
niedz., 18 maj 2025 o 18:52 Joaquin M López Muñoz via Boost <boost@lists.boost.org> napisał(a):
El 18/05/2025 a las 13:29, Andrzej Krzemienski via Boost escribió: > Hi Everyone, > The following can hardly be called a library review, but I still wanted to > give this feedback.
> 7. I think that you are missing a precondition for operators &= and |=. > They work under the assumptions that for both filters f1 and f2 inserting > the same element renders the same bit pattern in the hash. But this is not > the case when the hashers that I provide to f1 and f2 have the same type > but different initial state: > > filter<T, MyHash> f1(1000, MyHash{1}); > filter<T, MyHash> f2(1000, MyHash{2});
Well, technically the behavior is well defined even if the hashers are not equivalent:
"[...] changes the value of each bit in the internal array with the result of doing a logical AND operation of that bit and the corresponding one in x."
Admittedly, doing this with non-equivalent hashers will produce nothing but garbage, so... well, I don't know what I should do here :-)
I am pretty sure there is only one right answer here.
While the semantics of the function are specified in terms of setting the bit values, the value that this interface brings to the users is at a higher level. It is this: If filter f1 indicates a set S1 of values that are certainly not stored (those for which may_contain() returns false), and filter f2 indicates a set S2 of values that are certainly not stored in f2, then the result is a filter whose values certainly not stored are the union of S1 and S2.
This equivalence of specifications (in terms of bits, and in terms of definitely not stored values) is broken when f1 and f2 use different hashes.This is sufficient a reason to call this a precondition: breaking the high-level abstraction/guarantee. Having or not having undefined behavior is irrelevant here.
Not sure I fully agree, but, hey, it's just a precondition statement in the documentation, so I'll add it. Thank you, Joaquin M Lopez Munoz

niedz., 18 maj 2025 o 13:29 Andrzej Krzemienski <akrzemi1@gmail.com> napisał(a):
Hi Everyone, The following can hardly be called a library review, but I still wanted to give this feedback.
I was also curious if Boost.Bloom works with Boost.Hash2. And it does! (gcc 13.2 in C++11 and C++20.) Probably not the best move performance-wise, but this was to check if the libraries can be combined. Let me just remark that while the library apparently works with C++11, the initial "hello world" example from the docs does not compile in C++11, due to the digit separator in literal 1'000'000. Probably not a problem, but I was initially surprised why my example stopped compiling when I switched to C++11. Regards, &rzej;

On Mon, May 19, 2025 at 11:13 PM Andrzej Krzemienski via Boost < boost@lists.boost.org> wrote:
niedz., 18 maj 2025 o 13:29 Andrzej Krzemienski <akrzemi1@gmail.com> napisał(a):
Probably not the best move performance-wise, but this was to check if the libraries can be combined.
If you mark your hash as avalanching performance should be better. But in my simple experiments even with that tiny extra boost it still not even close to boost::hash(that is identity) and extra hashing done by Bloom library(because boost::hash for uint64_t is not avalanching)... To be clear evaluating hash performance in bloom filters is very hard, so do not give a lot of weight to my measurements. :)

El 19/05/2025 a las 23:07, Andrzej Krzemienski via Boost escribió:
niedz., 18 maj 2025 o 13:29 Andrzej Krzemienski <akrzemi1@gmail.com> napisał(a):
[...] Let me just remark that while the library apparently works with C++11, the initial "hello world" example from the docs does not compile in C++11, due to the digit separator in literal 1'000'000. Probably not a problem, but I was initially surprised why my example stopped compiling when I switched to C++11.
Yes, I decided to add the digit separators for the benefit of the reader. The equivalent full example here, though, https://github.com/joaquintides/bloom/blob/develop/example/basic.cpp is C++11 compatible. Joaquin M Lopez Munoz

niedz., 18 maj 2025 o 13:29 Andrzej Krzemienski <akrzemi1@gmail.com> napisał(a):
Hi Everyone, The following can hardly be called a library review, but I still wanted to give this feedback.
I recommend to ACCEPT Boost.Bloom to Boost. I have provided some more detailed feedback earlier. I find the documentation exemplary, design sound and adequate, use case real and worth addressing. I have played with toy examples. They compiled without problems (gcc 13.2 in C++11 and C++20). I have not looked at the implementation. I let the review manager judge how this should affect the weight of the review. I am not knowledgeable in the problem domain. The time I have spent on studying the library: if I summed up all the hours, that would be like 20h A note about exception safety. Using the definitions from David Abrahams, "basic exception safety" should be provided for every function. I cannot see any note about the exception safety of this library. Usually when I find a new library on the web, I assume that the author is not aware of the problems of "exception safety", and I do not trust them until I am convinced otherwise. I am pretty sure Boost.Bloom takes this into account, but I could not find anything in the docs on this matter. Additionally, for the copy assignment of the class filter: the docs do not say what happens when the allocator throws. Note that allocators may throw even if we have not run out of memory: do you provide the strong ("commit or rollback") guarantee? Or just the basic one? The same for reset(). Regards, &rzej;

El 22/05/2025 a las 14:06, Andrzej Krzemienski via Boost escribió:
niedz., 18 maj 2025 o 13:29 Andrzej Krzemienski <akrzemi1@gmail.com> napisał(a):
[...]
A note about exception safety. Using the definitions from David Abrahams, "basic exception safety" should be provided for every function. I cannot see any note about the exception safety of this library. Usually when I find a new library on the web, I assume that the author is not aware of the problems of "exception safety", and I do not trust them until I am convinced otherwise. I am pretty sure Boost.Bloom takes this into account, but I could not find anything in the docs on this matter.
Yes, exception safety has been taken into account and basic (at least) safety is provided everywhere. I will add this to the reference. Would it be ok to state that basic is provided by default and then explicitly document those functions providing strong (noexcept is not necessary as the signature tells it)?
Additionally, for the copy assignment of the class filter: the docs do not say what happens when the allocator throws. Note that allocators may throw even if we have not run out of memory: do you provide the strong ("commit or rollback") guarantee? Or just the basic one? The same for reset().
Strong safety is provided, which, when combined with propagate_on_container_copy_assignment traits, results in a rather convoluted implementation: https://github.com/joaquintides/bloom/blob/develop/include/boost/bloom/detai... reset() also provides strong safety. Joaquin M Lopez Munoz

On Thu, May 22, 2025, 18:52 Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:
niedz., 18 maj 2025 o 13:29 Andrzej Krzemienski <akrzemi1@gmail.com> napisał(a):
[...]
A note about exception safety. Using the definitions from David Abrahams, "basic exception safety" should be provided for every function. I cannot see any note about the exception safety of this library. Usually when I find a new library on the web, I assume that the author is not aware of
El 22/05/2025 a las 14:06, Andrzej Krzemienski via Boost escribió: the
problems of "exception safety", and I do not trust them until I am convinced otherwise. I am pretty sure Boost.Bloom takes this into account, but I could not find anything in the docs on this matter.
Yes, exception safety has been taken into account and basic (at least) safety is provided everywhere. I will add this to the reference. Would it be ok to state that basic is provided by default and then explicitly document those functions providing strong (noexcept is not necessary as the signature tells it)?
Sure. Thank you. &rzej;
Additionally, for the copy assignment of the class filter: the docs do not say what happens when the allocator throws. Note that allocators may throw even if we have not run out of memory: do you provide the strong ("commit or rollback") guarantee? Or just the basic one? The same for reset().
Strong safety is provided, which, when combined with propagate_on_container_copy_assignment traits, results in a rather convoluted implementation:
https://github.com/joaquintides/bloom/blob/develop/include/boost/bloom/detai...
reset() also provides strong safety.
Joaquin M Lopez Munoz
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
participants (3)
-
Andrzej Krzemienski
-
Ivan Matek
-
Joaquin M López Muñoz