
Dear Boost Community, As Review Manager for Candidate Boost.Bloom, I have carefully reviewed and incorporated the community’s feedback from May 13–22 and hereby ACCEPT Boost.Bloom unconditionally. Congratulations to Joaquín on an outstanding contribution, and my genuine gratitude to everyone for their thorough analyses, respectful debates, and lively discussion. If I missed thanking anyone directly, please accept my apologies—your time and effort are greatly appreciated. The submission’s outstanding quality, the community engagement, the authors’ extensive experience with Boost and longstanding contributions, their responsiveness during the review process, and the fact that they’ve already begun integrating the proposed changes into the repository have convinced me to accept the library in its current form and trust that any further suggestions will be carefully considered. Consequently, I have not imposed any conditions on its acceptance. Finally, I’ll open a series of [peer-review]-tagged issues on the Boost.Bloom repository (https://github.com/joaquintides/bloom) to track each of these items for Joaquín. He’s already begun addressing many of them—so some may close immediately—but this will help us monitor progress. On a personal note, this was my first time serving in a Boost leadership role, and I found the experience both rewarding and constructive. My warmest thanks you all for making this experience positive and rewarding. --- ## Community recommendations I note that non-C++ Alliance reviewers rarely disclosed their affiliations. ### ACCEPT (7) 1. Claudio de Souza (May 18) (Undisclosed) 2. Ivan Matek (personal exchange: May 25) (Undisclosed) 3. Tomer Vromen (May 23) (Undisclosed) 4. Дмитрий Архипов (May 21) (C++ Alliance) 5. Christian Mazakas (May 21) (Undisclosed) 6. Vinnie Falco (May 22) (C++ Alliance) 7. Andrzej Krzemienski (May 22) (Undisclosed) ### ACCEPT CONDITIONALLY (1) 1. Kostas Savvidis (personal exchange: May 22) (Institute of Nuclear and Particle Physics Demokritos) ### NOT A REVIEW (5) 1. Peter Turcan (May 17) (C++ Alliance) 2. Ruben Perez (May 21) (C++ Alliance) 3. Seth (May 23) (Undisclosed) 4. David Bien (May 23) (Undisclosed) 5. Alexander Grund (May 23) (Undisclosed) --- ## Community feedback ### Mailing List 1. Strong consensus to accept - Nearly every reviewer recommends accepting Boost.Bloom, praising its code quality, interface, SIMD optimizations, and documentation. - It is my understanding it could become a "model" library for future contributions. 2. Documentation and onboarding - Add a friendlier getting-started section and introduction to Bloom filters - Add intuition behind mathematical equations - Clarify terms such as capacity vs. bit_capacity, may_contain, and fpr_for - Provide copy-and-paste-ready examples, syntax highlighting, and visuals for parameters vs. false-positive rate 3. Build and integration - Improve CMake support (generate Visual Studio solution, add gdb pretty-printers) - Document minimum C++ standard and supported compilers 4. API design and semantics - Reconsider or justify container-like features (emplace, allocator semantics) - Simplify template parameters or consider a hybrid compiled-lib approach - Ensure operator and method names are unambiguous and document preconditions 5. Performance and accuracy - Document real-world vs. theoretical false-positive rates - Provide post-construction FPR estimation utilities - Consider replacing custom RNG with a standard linear-congruential approach 6. Advanced features and suggestions - Add batch-lookup or multi-element tests - Highlight and document cache-line-blocked filters as a primary use-case - Offer utilities for computing memory requirements and alignment 7. Real-world integration - Include examples of Boost.Bloom in large projects (e.g., a bitcoind fork) 8. Minor refinements - Mention the origin of the Bloom filter name - Zero unused bits for deterministic serialization - Warn about potential OOM errors with unrealistic false-positive rates 9. Roadmap - Explore runtime-filter variation and Cuckoo/XOR filter support - Add bulk-lookup API - Formalize ContainerHash integration (e.g. `is_avalanching`) - Build examples/tests and coverage in CI to avoid code rot - Flesh out a guided tutorial narrative ### Slack The #boost channel on the Official C++ Language Slack Workspace (joinable at https://slack.cpp.al) was buzzing with animated discussions: - May 26: Joaquin and Sam Darwin about setting up code coverage for Boost.Bloom - May 25: Vinnie asked if a bitcoind fork using Boost.Bloom would be interesting, Joaquín agreed, and Janko pointed out Bitcoin already uses a rolling bloom filter. - May 22: Vinnie Falco asked Mohammed Nejati for a CMakeLists.txt to generate a VS solution for Boost.Bloom that lets you browse headers/sources, build and run tests, and debug with breakpoints: https://github.com/ashtum/bloom - May 21: Vinnie and all debugging why tests wouldn’t load—Janko suggested BUILD_TESTING, Vinnie tried it then learned there’s no CMake file in the test directory - May 21: Joaquín added CMake-based tests to Boost.Bloom and encountered a missing header error; Janko pointed out the tests needed to link against Boost::bloom, which fixed the build. - May 21: Pdimov clarified that listing individual Boost dependencies isn’t necessary when linking to Boost::bloom. - May 21: The channel then held an extended debate on whether Boost.Bloom’s interface should mimic standard containers, the role of allocators and template parameters, and the pros and cons of header-only versus compiled-library designs. - May 21: Vinnie called for volunteers to fork bitcoind to use Boost.Bloom, and later suggested extracting and summarizing the day’s discussion for the review records, add a link to the channel, and add a fancy ASCII banner to it (surely he was joking) _ join us at _(_)_ cpplang.slack.com#boost wWWWw _ @@@@ (_)@(_) vVVVv _ @@@@ (___) _(_)_ @@()@@ wWWWw (_)\ (___) _(_)_ @@()@@ Y (_)@(_) @@@@ (___) |/ Y (_)@(_) @@@@ \|/ (_)\ / Y \| \|/ /(_) \| |/ | \ | \ |/ | / \ | / \|/ |/ \| \|/ \\|// \\|/// \\\|// \\\|/// \|/// \\\|// \\|// \\\|// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ### Reddit The full post lives here: https://www.reddit.com/r/cpp/comments/1klujy0/boostbloom_review_starts_on_ma... - May 13, the author announced the Boost.Bloom library review, prompting 5 comments with questions and explanations. - One user asked if it was a container that “compresses” its contents - Another clarified that Bloom filters solve the set-membership problem by using a tunable bit array that guarantees no false negatives but allows configurable false-positive rates, ideal for fast, memory-efficient membership checks rather than full object storage. --- ## Conclusion Although I have tried my best to provide a thorough and accurate summary of the material, should any omissions, inaccuracies or plain mistakes remain, I respectfully ask for your understanding and welcome your corrections. Thank you for your time and consideration, Thank you Joaquín for this amazing library, and long live Boost.Bloom, Arnaud Becheler Boost.Bloom Review Manager

On 29 May 2025, at 19:25, Vinnie Falco via Boost <boost@lists.boost.org> wrote:
Thank you for your work in the role of review manager, and also I think this sets a new bar for what we might like to see in terms of review summaries.
Thanks indeed go to Arnaud for managing the review. I am somewhat puzzled that no technical points whatsoever became part of the revew managers report. The main technical innovation of the design of this library was the idea to do it essentially without hashing. This idea is solid, given that a library like bloom is going to be dealing with millions or at most few billions of items. It is sufficient to use DETERMINISTIC random numbers, ie a good enough RNG. The library is choosing this RNG parametrically which is unworkable and specific details are such that the choice may turn out to be really bad according to well established theory. The suggestion in my review was to use a Knuth 64bit RNG, but other choices are possible. I do not imagine that there are people here who have not at least heard of the disaster awaiting the naive use of bad RNGs. On a separate note, maybe we, the authors of the Boost.Random library should have done a better job of providing RNGs which people might use in such a project instead of rolling their own. I am the author of the MIXMAX RNG which is in Boost.Random, it is industrial strength, but this project needed something slightly more light-weight. Unfortunately the small light-weight generators in Boost.Random and incidentally also std::random are simply no good. I.e. that 64bit RNG of Knuth is sadly not there afaik. Thus, we cannot fault the author for not using an RNG from Boost.Random, but not addressing this issue in the report at all is puzzling. I hope that Joaquin finds a good way forward nonetheless. Best Regards to All, Kostas ========================================= Institute of Nuclear and Particle Physics NCSR Demokritos https://inspirehep.net/literature?q=a%20Konstantin.G.Savvidy.1 https://mixmax.hepforge.org <https://mixmax.hepforge.org/>

El 29/05/2025 a las 18:42, Kostas Savvidis via Boost escribió:
On 29 May 2025, at 19:25, Vinnie Falco via Boost <boost@lists.boost.org> wrote:
Thank you for your work in the role of review manager, and also I think this sets a new bar for what we might like to see in terms of review summaries. Thanks indeed go to Arnaud for managing the review. I am somewhat puzzled that no technical points whatsoever became part of the revew managers report. The main technical innovation of the design of this library was the idea to do it essentially without hashing. This idea is solid, given that a library like bloom is going to be dealing with millions or at most few billions of items. It is sufficient to use DETERMINISTIC random numbers, ie a good enough RNG. The library is choosing this RNG parametrically which is unworkable and specific details are such that the choice may turn out to be really bad according to well established theory. The suggestion in my review was to use a Knuth 64bit RNG, but other choices are possible. I do not imagine that there are people here who have not at least heard of the disaster awaiting the naive use of bad RNGs.
Hi Kostas, Regardless of whether this point you raised is included in Arnaud's report or not, it's already in my backlog and I will address it properly. The reason I'm not adopting immediately your Knuth mixer approach is because I'd like to understand (and hopefully reproduce) the conditions, if any, under which a poor MCG may ruin the efficiency of the filter. They key point here (in my mind at least), is that, given two sequences of hash values hi and gi, we're not interested in internal correlation of either hi or gi, but cross correlation between the two sequences. Internally, I think it suffices to guarantee that hi (and gi) values won't repeat --but of course I may be wrong and this is what I would like to study carefully. I know this is asking too much, but given that you're an expert on random numbers and such, it would be extremely helpful if you can assist in looking for pathological cases, or lack thereof. Either way, I'll keep you posted on my developments.
On a separate note, maybe we, the authors of the Boost.Random library should have done a better job of providing RNGs which people might use in such a project instead of rolling their own.
The only reason for not adopting Boost.Random is speed: Boost.Bloom has been implemented with the aim of being as fast as possible, and any non-trivial mixer will have a measureable impact on performance. Joaquin M Lopez Munoz

On 29 May 2025, at 20:37, Joaquin M López Muñoz via Boost <boost@lists.boost.org> wrote:
given two sequences of hash values hi and gi, we're not interested in internal correlation of either hi or gi, but cross correlation between the two sequences.
I am actually not sure which is more important for a Bloom filter, autocorrelation or cross-correlation. What is known is that the cross-correlation between two sequences in any MCG or LCG ( x' = a * x mod k) is even worse than the autocorrelation. It goes like this: if in the first bucket h1 = 2*g1, then hi=2*gi for all i or buckets. This is independent of a and k. Same if you replace the "2" with "3" etc. Effectively there is 100% correletion between buckets. One cannot fix that even with a better multiplier. K

Kostas Savvidis wrote:
On 29 May 2025, at 20:37, Joaquin M López Muñoz via Boost <boost@lists.boost.org> wrote:
given two sequences of hash values hi and gi, we're not interested in internal correlation of either hi or gi, but cross correlation between the two sequences.
I am actually not sure which is more important for a Bloom filter, autocorrelation or cross-correlation.
What is known is that the cross-correlation between two sequences in any MCG or LCG ( x' = a * x mod k) is even worse than the autocorrelation. It goes like this: if in the first bucket h1 = 2*g1, then hi=2*gi for all i or buckets. This is independent of a and k. Same if you replace the "2" with "3" etc. Effectively there is 100% correletion between buckets. One cannot fix that even with a better multiplier.
It's not clear to me why this would be a problem, but if it is, it can be fixed by using an LCG (a*x+b) instead of an MCG (a*x).

El 30/05/2025 a las 18:27, Peter Dimov via Boost escribió:
Kostas Savvidis wrote:
On 29 May 2025, at 20:37, Joaquin M López Muñoz via Boost <boost@lists.boost.org> wrote: given two sequences of hash values hi and gi, we're not interested in internal correlation of either hi or gi, but cross correlation between the two sequences. I am actually not sure which is more important for a Bloom filter, autocorrelation or cross-correlation.
What is known is that the cross-correlation between two sequences in any MCG or LCG ( x' = a * x mod k) is even worse than the autocorrelation. It goes like this: if in the first bucket h1 = 2*g1, then hi=2*gi for all i or buckets. This is independent of a and k. Same if you replace the "2" with "3" etc. Effectively there is 100% correletion between buckets. One cannot fix that even with a better multiplier. It's not clear to me why this would be a problem, but if it is, it can be fixed by using an LCG (a*x+b) instead of an MCG (a*x).
Umm, I like the LCG idea as an additional sum is basically free. Kostas, would a smart choice of b improve the statistical properties of thhe procedure? Note that we can afford determining b as a function of a (here a=m where m is the capacity of the array). Joaquin M Lopez Munoz

On 30 May 2025, at 19:54, Joaquin M López Muñoz via Boost <boost@lists.boost.org> wrote:
El 30/05/2025 a las 18:27, Peter Dimov via Boost escribió:
Kostas Savvidis wrote:
On 29 May 2025, at 20:37, Joaquin M López Muñoz via Boost <boost@lists.boost.org> wrote: given two sequences of hash values hi and gi, we're not interested in internal correlation of either hi or gi, but cross correlation between the two sequences. I am actually not sure which is more important for a Bloom filter, autocorrelation or cross-correlation.
What is known is that the cross-correlation between two sequences in any MCG or LCG ( x' = a * x mod k) is even worse than the autocorrelation. It goes like this: if in the first bucket h1 = 2*g1, then hi=2*gi for all i or buckets. This is independent of a and k. Same if you replace the "2" with "3" etc. Effectively there is 100% correletion between buckets. One cannot fix that even with a better multiplier. It's not clear to me why this would be a problem, but if it is, it can be fixed by using an LCG (a*x+b) instead of an MCG (a*x).
Umm, I like the LCG idea as an additional sum is basically free. Kostas, would a smart choice of b improve the statistical properties of thhe procedure? Note that we can afford determining b as a function of a (here a=m where m is the capacity of the array).
The b parameter does not fundamentally do anything for an MCG/LCG, quality remains the same, it just makes reasoning about this issue more difficult. std::random and boost do not have any additive constants in any RNG for this reason, no improvement whatsoever. And why do we need good randomness at all? The high correlation between the buckets may mean that if two items collide in the first bucket, then they will collide in all subsequent buckets. Or the opposite, equally bad (!), that if they dont collide in the first bucket, then they wont collide at all. In this library we have not seen this leading to outright failure (yet), probably because in the typical use case only SOME (high) bits of the hash are used to get position in the bucket. So its complicated: someone could and should do detailed theoretical analysis of "Bloom with MCG instead of hashing", but... ... let's get back to earth. Definitely, if you want to use an MCG, you do need a decent multiplier and a~=m does not assure that. You can afford a good multiplier at the cost of one extra machine cycle. Modern CPU makes both addition and multiplication in one cycle and it is anyway masked by memory latency. Regards, Kostas

El 29/05/2025 a las 18:00, Arnaud Becheler via Boost escribió:
Dear Boost Community,
As Review Manager for Candidate Boost.Bloom, I have carefully reviewed and incorporated the community’s feedback from May 13–22 and hereby ACCEPT Boost.Bloom unconditionally.
Thank you Arnaud for your hard work as RM! I'll digest your report carefully and will tend to all the issues described there. For those interested in following the post-review evolution of the library, I'm already working on incorporating the feedback here: https://github.com/joaquintides/bloom/compare/develop...feature/review-feedb... Please take a look, speak up if something's missing or not properly addressed, file new issues, etc. Thank you to all the people who participated in the review, a lot of valid and interesting points have been raised. I'll add your names to the (upcoming) acknowledgements section. Feels so good to be part of Boost! I hope I can give back by adding and maintaining this little library for the potential benefit of our users. Best, Joaquin M Lopez Munoz
participants (5)
-
Arnaud Becheler
-
Joaquin M López Muñoz
-
Kostas Savvidis
-
Peter Dimov
-
Vinnie Falco