[Bloom] review starts on 13th of May

Arnaud Becheler

13 May 2025 13 May '25

9:47 a.m.

Dear Boost community, The review of the Boost.Bloom library begins today May 13th, 2025, and will run through May 22nd, 2025. Boost.Bloom is a header-only C++ library written by Joaquín M López Muñoz, providing fast and flexible Bloom filters. These are space-efficient probabilistic data structures used to test set membership with a controllable false positive rate and minimal memory footprint. The library supports a variety of configurations and hash policies and is designed to integrate well with modern C++. You can find the library here: - Repo: https://github.com/joaquintides/bloom - Documentation: https://master.bloom.cpp.al/ Bloom filters are especially useful when: - You need to test if something is not in a large dataset, and can tolerate a small false positive rate. (e.g., avoiding duplicate URL crawls in a web crawler) - You need a memory-efficient structure to represent large sets (e.g., checking for the presence of DNA subsequences across massive genomic datasets) - You want to avoid expensive lookups for items that are definitely not present (e.g., filtering nonexistent records before querying a remote database). As always, we welcome all reviews — from quick impressions to detailed analysis. Your feedback helps ensure that the library meets Boost's high standards in terms of correctness, performance, documentation, and design. If you’re interested in contributing a review, please post to the Boost mailing list during the review period. In your review please state whether you recommend to reject or accept the library into Boost, and whether you suggest any conditions for acceptance. Other questions you might want to answer in your review are: - What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? - Did you try to use the library? With what compiler? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problems tackled by the library? Note: All links to other parts of Boost in the Boost.Bloom documentation are currently broken, as they were written with the assumption that the library was already integrated into Boost - please do not account for those in your review. Thank you in advance for your time and insights! Best regards, Arnaud Becheler Review Manager, Boost.Bloom

Show replies by date

Peter Turcan

16 May 16 May

10:42 p.m.

Joaquín, Arnaud, Thanks for submitting and managing the review of this library. Here is my review of the documentation. Good to see plenty of links, tables and structure to the reference, diagrams and equations, and performance benchmarks. I think the doc could use a much friendlier introduction that explains the purpose of these useful filters to all C++ developers. I suggest something like this: *Introduction* Bloom filters provide a pattern-matching mechanism to rapidly determine if a given item - an email, URL, genome, username, digital fingerprint etc. - *might *be a member of a long and important list. A key feature of Bloom filters is that there are no false negatives - if the result of the filter is *negative *then the item does not occur in the list with a mathematical certainty of 100%. But there can be false positives - if the result of the filter is *positive *it means that the item *may *occur in the list but the filter is not certain. The certainty of avoiding false negatives makes Bloom filters particularly useful in fraud detection. This is best explained by an over-simplified example. Say you have a list, a database, of known bad actors, that needs to be checked before access to your financial website is approved. And, for the sake of simplicity the bad actors list contains only: 1. mr-jekyll@mail.com 2. mr-hyde@mail.com 3. frankenstein@mail.com 4. al-capone@mail.com 5. lord-voldemort@mail.com 6. darth-vader@mail.com Now, an attempt is made to access the financial website that is protected by a pattern-matching Bloom filter. In our example, the Bloom filter could simply be: If (request-name contains "e") then return positive, else return negative; If access is requested by T <Tony-Montana@mail.com>ony-Montana@mail.com, the filter will return *negative *- there is no "e" in the name so Tony Montana could not possibly be in the bad actors list and he should be granted access to the financial website without further ado. If access is requested by John-Dillinger@mail.com, then the filter returns *positive *- it contains an "e" so the name *might *be in the bad actors list. The only way to be sure is now to perform a full-text search - time consuming but will provide certainty. The value of Bloom filters is that they can eliminate full-text searches for many access requests. In reality, the list of bad actors is likely to be millions long. The algorithms used in the Bloom filter use mathematical patterns (determined by hash functions) that are present in the list. You can imagine that a long list might require half-a-dozen or more patterns to ensure that ALL entries in the bad actors list are represented by one or more patterns. It does not matter if a bad actor is covered by multiple patterns. It is essential that all bad actors are represented by at least one pattern. The Bloom filter will check an access request against each pattern it is provided with, and return *immediately* when a check against a pattern returns positive. This result initiates a full-text search, though no doubt this can be optimized with clever indexing and organization. Whereas fraud detection is the top use of Bloom filters, they have proved useful in such situations as checking genome strings against a database of DNA sequences. These checks can be time intensive due to the length of DNA strings. Similarly for high-frequency trading, to check for recent or duplicate orders, as speed is of the essence. For the record, the Bloom filter is named after its inventor - Burton Howard Bloom - who described its purpose in a 1970 paper - *Space/Time Trade-offs in Hash Coding with Allowable Errors*. There is no connection between the filters and "blooming" in the floral sense! If an application you are building contains a large database that needs to be searched regularly, this library provides you with the components necessary to create a Bloom filter of the complexity required to fine tune the search performance of your application. Follow the intro with a *Getting Started* section: *Getting Started* - describe the *Requirements*, *Installation *and a trivial "hello world" example of the use of the library. *Tutorial* - this section is "tutorial" in name only - the section should be a numbered step-by-step procedure for the reader to follow - that results in a small but meaningful example working, with the expected output. Currently it is not certain what to do. Following the Tutorial, your *Bloom Filter Primer, Choosing a Filter Combination, Benchmarks *- are all appropriate. *Reference* - I found it difficult to follow the Reference in the sense of "what am I looking at?". Best to be specific as to what is a class, template, method, property, field, macro etc. AND critically explain the use-case of each of these constructs in the intro or description of the construct - what is the scenario that this construct applies to? Some of the reference I didn't really understand - such as: A compile-time std::size_t value indicating the number of (not necessarily distinct) bits set/checked per operation. - what does *not necessarily distinct* mean? *Appendix A* The equations are somewhat daunting and difficult to determine what is good practice and what is not. Perhaps add text to describe best practices in certain situations. Perhaps add an *Acknowledgements *section at the end - authors, testers, motivation, etc. *Summary* I would most like to see a friendly introduction, a step-by-step tutorial, and a clearer reference. The case for Bloom filters seems strong, and good luck with the library. I am the technical writer for the C++ Alliance. - best Peter Turcan On Tue, May 13, 2025 at 2:47 AM Arnaud Becheler via Boost < boost@lists.boost.org> wrote:

...

Dear Boost community,

The review of the Boost.Bloom library begins today May 13th, 2025, and will run through May 22nd, 2025.

Boost.Bloom is a header-only C++ library written by Joaquín M López Muñoz, providing fast and flexible Bloom filters. These are space-efficient probabilistic data structures used to test set membership with a controllable false positive rate and minimal memory footprint. The library supports a variety of configurations and hash policies and is designed to integrate well with modern C++.

You can find the library here: - Repo: https://github.com/joaquintides/bloom - Documentation: https://master.bloom.cpp.al/

Bloom filters are especially useful when: - You need to test if something is not in a large dataset, and can tolerate a small false positive rate. (e.g., avoiding duplicate URL crawls in a web crawler) - You need a memory-efficient structure to represent large sets (e.g., checking for the presence of DNA subsequences across massive genomic datasets) - You want to avoid expensive lookups for items that are definitely not present (e.g., filtering nonexistent records before querying a remote database).

As always, we welcome all reviews — from quick impressions to detailed analysis. Your feedback helps ensure that the library meets Boost's high standards in terms of correctness, performance, documentation, and design. If you’re interested in contributing a review, please post to the Boost mailing list during the review period.

In your review please state whether you recommend to reject or accept the library into Boost, and whether you suggest any conditions for acceptance. Other questions you might want to answer in your review are: - What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? - Did you try to use the library? With what compiler? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problems tackled by the library?

Note: All links to other parts of Boost in the Boost.Bloom documentation are currently broken, as they were written with the assumption that the library was already integrated into Boost - please do not account for those in your review.

Thank you in advance for your time and insights!

Best regards, Arnaud Becheler Review Manager, Boost.Bloom

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Joaquin M López Muñoz

17 May 17 May

9:25 a.m.

...

Joaquín, Arnaud,

Thanks for submitting and managing the review of this library. Here is my review of the documentation. Good to see plenty of links, tables and structure to the reference, diagrams and equations, and performance benchmarks. Hi Peter, thank you for your documentation review! I think the doc could use a much friendlier introduction that explains the purpose of these useful filters to all C++ developers. In the current documentation, this is the intended purpose of the "Primer"

...

I suggest something like this:

*Introduction* [...]

Now, an attempt is made to access the financial website that is protected by a pattern-matching Bloom filter. In our example, the Bloom filter could simply be:

If (request-name contains "e") then return positive, else return negative; This is not how Bloom filters work. A Bloom filter does not process elements

El 17/05/2025 a las 0:42, Peter Turcan via Boost escribió: section. In my mind, there are two big groups in the audience: * People who already know what a Bloom filter is and want to go straight into what this particular library offers. * People who don't know about Bloom filters. So, the primer is provided as a sort of skippable section depending on the previous background of the reader. In which ways do you think that this primer does not fullfill the goals of your proposed "Introduction" section? based on (implicit or explicit) patterns of the input data. Instead, for each element inserted into the filter, the values of the associated hash functions as applied to the element are used to calculate a number of bit positions that are then set to 1. The statistical magic is that when querying for an element that has _not_ been inserted, the probability that all the associated bits are set to 1 is very small --though not zero, hence the false positive rate thing. If the primer has led you to this exposition of Bloom filters, then I guess it's not doing an excellent job at explaining the data structure :/ Have you written this proposed introduction based on your reading of my docs or from some other sources?

...

[...]

Whereas fraud detection is the top use of Bloom filters, [...] I woulnd't say fraud detection is _the top use_ of Bloom filters, though it's certainly a reasonable application of this data structure. they have proved useful in such situations as checking genome strings against a database of DNA sequences. These checks can be time intensive due to the length of DNA strings[...]

...

[...]

Follow the intro with a *Getting Started* section:

*Getting Started* - describe the *Requirements*, *Installation *and a trivial "hello world" example of the use of the library. Yes, I can plug this info as a subsection of the introduction. Note that installation is basically non-existent, as the docs assume that the library is already part of Boost, and, being header-only, there's no additional build step. *Tutorial* - this section is "tutorial" in name only - the section should be a numbered step-by-step procedure for the reader to follow - that results in a small but meaningful example working, with the expected output. Currently it is not certain what to do. I can try to organize the tutorial as part of a narrative where a full example is built. I have some problems though with that, as some parts, notably the one where I describe the different configuration options, don't look to me amenable to that approach. [...]

*Reference* - I found it difficult to follow the Reference in the sense of "what am I looking at?". Best to be specific as to what is a class, template, method, property, field, macro etc. AND critically explain the use-case of each of these constructs in the intro or description of the construct - what is the scenario that this construct applies to? The reference is built in standardese format, so I guess it looks pretty straightforward to me cause I've wasted the best years of my youth reading

...

Some of the reference I didn't really understand - such as:

A compile-time std::size_t value indicating the number of (not necessarily distinct) bits set/checked per operation.

- what does *not necessarily distinct* mean? The way a Bloom filter works, element insertion translates to a given numbers of bits being set to 1. But, as their positions are selected

FWIW, there's a genomics-related example at https://github.com/joaquintides/bloom/blob/develop/example/genome.cpp the standard. Do you have any favorite Boost lib referencewise? pseudorandomly and independently, it may be the case that some of the positions coincide.

...

*Appendix A* The equations are somewhat daunting and difficult to determine what is good practice and what is not. Perhaps add text to describe best practices in certain situations. Sorry, I'm not getting you here. How are mathematical equations related to best practices? In any case, the appendix is provided precisely as an appendix because it's not essential to understanding or using the library --only the mathematically inclined will be looking at this. Perhaps add an *Acknowledgements *section at the end - authors, testers, motivation, etc. Absolutely, on my todo list. *Summary* I would most like to see a friendly introduction, a step-by-step tutorial, and a clearer reference.

The case for Bloom filters seems strong, and good luck with the library.

Again, thanks for your review and useful tips. Joaquin M Lopez Munoz

Peter Turcan

19 May 19 May

8:36 p.m.

Joaquin - thanks for your comments on my comments, here are my comments on your comments on my comments:

...

In which ways do you think that this primer does not fulfill the goals of your proposed "Introduction" section?

My issue with your Primer is that it explains what a Bloom filter is, not what it does - start the *Introduction *by answering the question "what is the problem that a Bloom Filter solves?". Don't move onto the "is" until you have answered "why should I be interested in this?". A Bloom filter *is* a probabilistic data structure where inserted elements can be looked up with 100% accuracy, whereas looking up for a non-inserted element may fail with some probability called the filter’s *false positive rate* or FPR. The tradeoff here is that Bloom filters occupy much less space than traditional non-probabilistic containers (typically, around 8-20 bits per element) for an acceptably low FPR. The greater the filter’s *capacity* (its size in bits), the lower the resulting FPR. My 2c is that an *Introduction *section should describe the purpose and use cases of a library that is understandable to a program manager or analyst who is researching the libraries to consider for their project but who will not be as technical as their developers. It is people who need assistance that turn to overview documentation - typically experts in the field go straight to the *Reference*. If fraud detection is not the main use of Bloom Filters, what is? Be bold and articulate use cases. Every use-case you mention will pull in users.

...

Have you written this proposed introduction based on your reading of my docs or from some other sources?

Other sources, for the reason described above.

...

I can try to organize the tutorial as part of a narrative where a full example is built. I have some problems though with that, as some parts, notably the one where I describe the different configuration options, don't look to me amenable to that approach.

Configuration options belong in the *Getting Started/Installation* sections, a *Tutorial *should step by step. Consider deciding on the appropriate configuration for your tutorial, and run with that. The purpose of a first tutorial is to provide the user with a "feel good" experience that encourages them to go further - it need not be complete in any sense.

...

Do you have any favorite Boost lib referencewise?

Take a look at this ref entry from Boost.Geometry - I don't have to look out of this page to understand what it does, and how to use it: https://www.boost.org/doc/libs/1_85_0/libs/geometry/doc/html/geometry/refere... Users of your library will jump straight into *Reference *pages, and straight back out again. Best to fully describe the purpose and functionality of each construct without requiring leaving that page. Repetition within a *Reference *is not necessarily much of an issue as they are not read front to back like a book. Perhaps identify the reference entries that do the heavy-lifting, and consider expanding them so that they are self-contained explanations with examples and diagrams if needed. Hope this is useful and good luck with this effort! - Peter Turcan On Sat, May 17, 2025 at 2:26 AM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:

...

...
Joaquín, Arnaud,

Thanks for submitting and managing the review of this library. Here is my review of the documentation. Good to see plenty of links, tables and structure to the reference, diagrams and equations, and performance benchmarks. Hi Peter, thank you for your documentation review! I think the doc could use a much friendlier introduction that explains

...
purpose of these useful filters to all C++ developers. In the current documentation, this is the intended purpose of the "Primer"

El 17/05/2025 a las 0:42, Peter Turcan via Boost escribió: the section. In my mind, there are two big groups in the audience:

* People who already know what a Bloom filter is and want to go straight into what this particular library offers. * People who don't know about Bloom filters.

So, the primer is provided as a sort of skippable section depending on the previous background of the reader. In which ways do you think that this primer does not fullfill the goals of your proposed "Introduction" section?

...
I suggest something like this:

*Introduction* [...]

Now, an attempt is made to access the financial website that is protected by a pattern-matching Bloom filter. In our example, the Bloom filter could simply be:

If (request-name contains "e") then return positive, else return negative; This is not how Bloom filters work. A Bloom filter does not process elements based on (implicit or explicit) patterns of the input data. Instead, for each element inserted into the filter, the values of the associated hash functions as applied to the element are used to calculate a number of bit positions that are then set to 1. The statistical magic is that when querying for an element that has _not_ been inserted, the probability that all the associated bits are set to 1 is very small --though not zero, hence the false positive rate thing.

If the primer has led you to this exposition of Bloom filters, then I guess it's not doing an excellent job at explaining the data structure :/ Have you written this proposed introduction based on your reading of my docs or from some other sources?

...
[...]

Whereas fraud detection is the top use of Bloom filters, [...] I woulnd't say fraud detection is _the top use_ of Bloom filters, though it's certainly a reasonable application of this data structure. they have proved useful in such situations as checking genome strings against a database of DNA sequences. These checks can be time intensive due to the length of DNA strings[...]

FWIW, there's a genomics-related example at

https://github.com/joaquintides/bloom/blob/develop/example/genome.cpp

...
[...]

Follow the intro with a *Getting Started* section:

*Getting Started* - describe the *Requirements*, *Installation *and a trivial "hello world" example of the use of the library. Yes, I can plug this info as a subsection of the introduction. Note that installation is basically non-existent, as the docs assume that the library is already part of Boost, and, being header-only, there's no additional build step. *Tutorial* - this section is "tutorial" in name only - the section should be a numbered step-by-step procedure for the reader to follow - that results in a small but meaningful example working, with the expected output. Currently it is not certain what to do. I can try to organize the tutorial as part of a narrative where a full example is built. I have some problems though with that, as some parts, notably the one where I describe the different configuration options, don't look to me amenable to that approach. [...]

*Reference* - I found it difficult to follow the Reference in the sense of "what am I looking at?". Best to be specific as to what is a class, template, method, property, field, macro etc. AND critically explain the use-case of each of these constructs in the intro or description of the construct - what is the scenario that this construct applies to? The reference is built in standardese format, so I guess it looks pretty straightforward to me cause I've wasted the best years of my youth reading the standard. Do you have any favorite Boost lib referencewise? Some of the reference I didn't really understand - such as:

A compile-time std::size_t value indicating the number of (not necessarily distinct) bits set/checked per operation.

- what does *not necessarily distinct* mean? The way a Bloom filter works, element insertion translates to a given numbers of bits being set to 1. But, as their positions are selected pseudorandomly and independently, it may be the case that some of the positions coincide. *Appendix A* The equations are somewhat daunting and difficult to determine what is good practice and what is not. Perhaps add text to describe best practices in certain situations. Sorry, I'm not getting you here. How are mathematical equations related to best practices? In any case, the appendix is provided precisely as an appendix because it's not essential to understanding or using the library --only the mathematically inclined will be looking at this. Perhaps add an *Acknowledgements *section at the end - authors, testers, motivation, etc. Absolutely, on my todo list. *Summary* I would most like to see a friendly introduction, a step-by-step tutorial, and a clearer reference.

The case for Bloom filters seems strong, and good luck with the library.

Again, thanks for your review and useful tips.

Joaquin M Lopez Munoz

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Joaquin M López Muñoz

20 May 20 May

4:25 p.m.

El 19/05/2025 a las 22:36, Peter Turcan via Boost escribió:

...

Joaquin - thanks for your comments on my comments, here are my comments on your comments on my comments:

...
In which ways do you think that this primer does not fulfill the goals of your proposed "Introduction" section?

My issue with your Primer is that it explains what a Bloom filter is, not what it does - start the *Introduction *by answering the question "what is the problem that a Bloom Filter solves?". Don't move onto the "is" until you have answered "why should I be interested in this?".

A Bloom filter *is* a probabilistic data structure where inserted elements can be looked up with 100% accuracy, whereas looking up for a non-inserted element may fail with some probability called the filter’s *false positive rate* or FPR. The tradeoff here is that Bloom filters occupy much less space than traditional non-probabilistic containers (typically, around 8-20 bits per element) for an acceptably low FPR. The greater the filter’s *capacity* (its size in bits), the lower the resulting FPR.

My 2c is that an *Introduction *section should describe the purpose and use cases of a library that is understandable to a program manager or analyst who is researching the libraries to consider for their project but who will not be as technical as their developers. It is people who need assistance that turn to overview documentation - typically experts in the field go straight to the *Reference*.

Makes sense.

...

If fraud detection is not the main use of Bloom Filters, what is? Be bold and articulate use cases. Every use-case you mention will pull in users.

Some use cases (rather, areas of application) are mentioned in the primer. I can try to make this more prominent.

...

...
Have you written this proposed introduction based on your reading of my docs or from some other sources?

Other sources, for the reason described above.

...
I can try to organize the tutorial as part of a narrative where a full example is built. I have some problems though with that, as some parts, notably the one where I describe the different configuration options, don't look to me amenable to that approach.

Configuration options belong in the *Getting Started/Installation* sections, a *Tutorial *should step by step. Consider deciding on the appropriate configuration for your tutorial, and run with that. The purpose of a first tutorial is to provide the user with a "feel good" experience that encourages them to go further - it need not be complete in any sense.

Well, not sure filter configuration belings in the getting started section, but I'll see what I can do.

...

...
Do you have any favorite Boost lib referencewise? Take a look at this ref entry from Boost.Geometry - I don't have to look out of this page to understand what it does, and how to use it:

https://www.boost.org/doc/libs/1_85_0/libs/geometry/doc/html/geometry/refere...

This makes sense for the kind of entities described in Boost.Geometry, namely meaty algorithms with a lot of explaining behind. Most of the entities described in Boost.Bloom are far simpler than that, and far less isolated. For instance, boost::bloom::filter has 18 constructors, what would I write about them other than list them and say what they do in one line? The only important component in the library is boost::bloom::filter. I tried to give the more user-friendly explanation, examples etc. in the tutorial --what's left in the reference is just dry exposition, standard- style. I still feel this is ok as long a the tutorial does the heavy work of introducing the users to the component more gently.

...

Users of your library will jump straight into *Reference *pages, and straight back out again. Best to fully describe the purpose and functionality of each construct without requiring leaving that page. Repetition within a *Reference *is not necessarily much of an issue as they are not read front to back like a book.

Perhaps identify the reference entries that do the heavy-lifting, and consider expanding them so that they are self-contained explanations with examples and diagrams if needed.

Hope this is useful and good luck with this effort!

It is useful, thanks for your feedback. I'll see to best applying it to improve the documentation. Joaquin M Lopez Munoz

Kostas Savvidis

18 May 18 May

12:18 a.m.

Hi All, This is a mini review of the proposed Bloom library. I have spent a few hours researching the background and a couple of hours with the example programs, modified it in various ways and did not encounter any showstopper problems. I am knowledgeable in the areas of probability, random numbers and related mathematical areas. Cheers, Kostas ========================================= Institute of Nuclear and Particle Physics NCSR Demokritos http://inspirehep.net/author/profile/K.G.Savvidy.1 https://mixmax.hepforge.org ==================================================================== CONSIDERATIONS: This library implements the classic Bloom filter which can be used to implement an O(1)-time computation of whether a given object belongs in a previously preprocessed set. The preprocessing takes O(n) time and importantly O(n) in storage for a bit array, but the original data can be optionally discarded. I was reminded of the storage requirement while experimenting with the provided example for genomic data: the example datafile is 140MB, while the resulting filter contains a lookup table taking up 180MB (!!!), with the 0.01 setting for target false positive rate (FPR). Further, I have compared the "predicted" false positive error rate obtained from the library function genome_filter::fpr_for(n, f.capacity()) with the actual FPR. This FPR appears to be overpessimistic, as experiments with the real data showed. Setting the filter FPR to 0.5 or even 0.9 (which results in a much smaller lookup table) there were almost no false positives in practice. I do not fully understand the full reasons for this, but it appears one of the reasons is: THE "SET" WE INSERTED HAD MANY IDENTICAL ITEMS, In this case, the stream had many identical k-mers, to be precise, out of 142 million base pairs, there is actually ~22 million identical 20-mers, but even when FPR is set to 0.9 only 40 million extra false collisions are generated which is way less than 90%. Further investigation suggests that fpr_for is far from reality: the bit array for FPR=0.9 is 2^28=268m bits wide, of which only 142m - 22m = 120m are occupied, thus the correct FPR rate is 120/268 = 44%, in rough agreement with the observed 40m/120m. Similarly, when I request fpr=0.01 or 1%, the number of false collisions is only 86k, so actual FPR is probably closer to 2*86k/120m = 0.14%. This necessitates my recommendation 2), see below. EASE OF USE: Appears to be reasonable. Some amount of code to realize the filter needs to be written by hand. In particular, if the items are long strings or some other nontrivial datatype, one needs to write a function called "hash_value" which maps the item to a std::size_t. DOCUMENTATION Clear writing, background is discussed, examples are good, and references to relevant papers are provided. DESIGN: Is solid to my eyes, with the exception of a hand rolled random number generator "mcg_and_fastrange" in boost/bloom/detail/core.hpp Maybe this is good enough, but it is not very good. It attempts two things at once, generate a position in the bit array to insert a bit as well as "expand" a hash. However, when the bucket size is a power of two, say m=2^28 the multiplier will be chosen as 2^28+3, needless to say it means the lower 28 bits will behave badly, as if hash was produced with h' = h*3 mod 64. The technique of producing a random integer on a range 0..m according to Lemire is solid, but it cannot be combined to produce the random numbers themselves. Hence my recommendation 3). CONCLUSION: The popularity of the Bloom filters is possibly due to actual performance (false positive rate) being better in practice than estimated apriori. This is not for free and neither it is magic, as it comes at the cost of large memory storage, which is 1.44*n*log2(1/ε), or 5-6 bits per item. Nevertheless, it is definitely a useful general-purpose algorithm. RECOMMENDATIONS: 1) Accept the library, with conditions, 2) provide a second more accurate way to compute FPR, not based on apriori estimate from the number of elements already inserted but on the actual number of bits that are set in the bit array. This quantity could be computed at small cost during insertion by keeping track of collisions, or evaluated after the fact from the full scan of the bit array (total Hamming weight). 3) for rng, use h' = h*a mod 2^64 with a=6364136223846793005, the known good multiplier from Knuth, then high(h*m) for random position

Joaquin M López Muñoz

3:48 p.m.

El 18/05/2025 a las 2:18, Kostas Savvidis via Boost escribió:

...

Hi All,

This is a mini review of the proposed Bloom library.

Hi Kostas, thanks for your review!

...

[...] ==================================================================== CONSIDERATIONS: This library implements the classic Bloom filter which can be used to implement an O(1)-time computation of whether a given object belongs in a previously preprocessed set. The preprocessing takes O(n) time and importantly O(n) in storage for a bit array, but the original data can be optionally discarded. I was reminded of the storage requirement while experimenting with the provided example for genomic data: the example datafile is 140MB, while the resulting filter contains a lookup table taking up 180MB (!!!), with the 0.01 setting for target false positive rate (FPR). Further, I have compared the "predicted" false positive error rate obtained from the library function genome_filter::fpr_for(n, f.capacity()) with the actual FPR.

By actual FPR you mean the FPR for individual 20-mers, not for whole DNA sequences, right? Sequence FPR is expected to be much lower than 20-mer FPR.

...

This FPR appears to be overpessimistic, as experiments with the real data showed. Setting the filter FPR to 0.5 or even 0.9 (which results in a much smaller lookup table) there were almost no false positives in practice. I do not fully understand the full reasons for this, but it appears one of the reasons is: THE "SET" WE INSERTED HAD MANY IDENTICAL ITEMS, In this case, the stream had many identical k-mers, to be precise, out of 142 million base pairs, there is actually ~22 million identical 20-mers, but even when FPR is set to 0.9 only 40 million extra false collisions are generated which is way less than 90%. Further investigation suggests that fpr_for is far from reality: the bit array for FPR=0.9 is 2^28=268m bits wide, of which only 142m - 22m = 120m are occupied, thus the correct FPR rate is 120/268 = 44%, in rough agreement with the observed 40m/120m. Similarly, when I request fpr=0.01 or 1%, the number of false collisions is only 86k, so actual FPR is probably closer to 2*86k/120m = 0.14%.

We must be doing radically different things, because my local experiments don't look at all like what you see. I've written an instrumented version of the genomics example here: https://gist.github.com/joaquintides/e8fbf87fb0dd0b6fda87969c676c1e34 and I get these results for the genome of Drosophila melanogaster: actual unique kmers: 121430875 bits set: 719647041 capacity: 1533664512 estimated unique kmers based on array occupancy: 121434296 actual fpr: 0.004248 If I use the exact number of k-mers as an input on filter construction instead of the original approach where the number of k-mers is taken to be approximately equal to the number of bytes of the file (definining the macro USE_ACTUAL_NUMBER_K_MERS), I get: actual unique kmers: 121430875 bits set: 680516771 capacity: 1278575360 estimated unique kmers based on array occupancy: 121434774 actual fpr: 0.01011 So, the actual FPR is very much in line with the value specified on construction, and the number of different 20-mers is 121M rather than 22M. Maybe we're using different input files? I'm using GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna obtained from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001215.4/

...

This necessitates my recommendation 2), see below.

EASE OF USE: Appears to be reasonable. Some amount of code to realize the filter needs to be written by hand. In particular, if the items are long strings or some other nontrivial datatype, one needs to write a function called "hash_value" which maps the item to a std::size_t.

Strictly speaking, this is not required by the library itself, but by boost::hash (the default hash function). If you use some other hash function (say, std::hash<my_type>), the exact requirement will be different, but in general you need to provide a hashing procedure for any user-defined class you wish to use with boost::bloom::filter (or any hash container, for that matter).

...

DOCUMENTATION Clear writing, background is discussed, examples are good, and references to relevant papers are provided.

DESIGN: Is solid to my eyes, with the exception of a hand rolled random number generator "mcg_and_fastrange" in boost/bloom/detail/core.hpp Maybe this is good enough, but it is not very good. It attempts two things at once, generate a position in the bit array to insert a bit as well as "expand" a hash. However, when the bucket size is a power of two, say m=2^28 the multiplier will be chosen as 2^28+3, needless to say it means the lower 28 bits will behave badly, as if hash was produced with h' = h*3 mod 64. The technique of producing a random integer on a range 0..m according to Lemire is solid, but it cannot be combined to produce the random numbers themselves. Hence my recommendation 3).

I will run local experiments to see if some FPR degradation occurs due to a poor MCG being used when m is a power of two. Note that this procedure is not required to produce high entropy, but only to yield different {hi} values for different h0 initial values, as explained for the double hash scheme in section 3 of https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf

...

CONCLUSION: The popularity of the Bloom filters is possibly due to actual performance (false positive rate) being better in practice than estimated apriori. This is not for free and neither it is magic, as it comes at the cost of large memory storage, which is 1.44*n*log2(1/ε), or 5-6 bits per item. Nevertheless, it is definitely a useful general-purpose algorithm.

I'd be happy to extend the library to support more space-efficient filters such as Cuckoo or xor, though they seem to be much slower, and this library focuses heavily on speed (I dare say it's the fastest around). But as I said, if there's an interest in having additional filter structures I can definitely tackle that.

...

RECOMMENDATIONS: 1) Accept the library, with conditions, 2) provide a second more accurate way to compute FPR, not based on apriori estimate from the number of elements already inserted but on the actual number of bits that are set in the bit array. This quantity could be computed at small cost during insertion by keeping track of collisions, or evaluated after the fact from the full scan of the bit array (total Hamming weight).

I can certainly do that: * Keeping track of collisions: a member function try_insert (or some such name) can be written that returns false if the element existed prior to insertion. This way the user themself can keep track of the actual number of elements --I prefer this to the alternative of tasking boost::bloom::filter with keeping track itself, as this introduces a penalty even for users not interested in the functionality. * Estimating FPR based on array occupancy: I know how to estimate the number of elements from the number of bits set to 1 with n* = -(m/k) ln(1-num_bits_set/m), (1) and from there I can use the standard FPR formula with n*, m and k. The problem is that (1) is valid for the classical Bloom filter but I don't think it applies to block and multiblock filters also provided by candidate Boost.Bloom. FWIW, I used (1) in the instrumented genome example above, which uses a non-classical filter, and it worked like a charm (?).

...

3) for rng, use h' = h*a mod 2^64 with a=6364136223846793005, the known good multiplier from Knuth, then high(h*m) for random position

As mentioned above, let me try to measure the statistical soundness of the current approach. Will keep you posted. Again, thank you for your review, Joaquin M Lopez Munoz

Дмитрий Архипов

3:50 p.m.

вт, 13 мая 2025 г. в 12:48, Arnaud Becheler via Boost <boost@lists.boost.org>:

...

Dear Boost community,

The review of the Boost.Bloom library begins today May 13th, 2025, and will run through May 22nd, 2025.

I'm curious if you have measured the library against other Bloom libraries. After some searching I discovered that there's a bunch of Rust implementations. And you mention one C++ and a few Java-based implementations in another thread.

Joaquin M López Muñoz

5:54 p.m.

El 18/05/2025 a las 17:50, Дмитрий Архипов via Boost escribió:

...

...
Dear Boost community,

The review of the Boost.Bloom library begins today May 13th, 2025, and will run through May 22nd, 2025. I'm curious if you have measured the library against other Bloom

вт, 13 мая 2025 г. в 12:48, Arnaud Becheler via Boost<boost@lists.boost.org>: libraries. After some searching I discovered that there's a bunch of Rust implementations. And you mention one C++ and a few Java-based implementations in another thread.

Yes, I've measured it against the following (some of these provide other data structures, I only measured against the Bloom-like ones): * Lemire's fastfilter_cpp: https://github.com/FastFilter/fastfilter_cpp * Sasha Krassovsky's bloom_filters https://github.com/save-buffer/bloomfilter_benchmarks * Jim Apple's libfilter: https://github.com/jbapple/libfilter libfilter's Bloom filter code is adapted from Apache Kudu: https://github.com/apache/kudu/blob/master/src/kudu/util/block_bloom_filter_... * fastboom (Rust): https://github.com/tomtomwombat/fastbloom fastfilter_cpp and bloom_filters are slower or much slower. libfilter has similar performance to boost::bloom::fast_multiblock32<8>, both use AVX2 acceleration. As for fastbloom, I did a rough comparison (a more direct benchmark is difficult as fastbloom is written in Rust), results here: https://www.reddit.com/r/rust/comments/1kkhppa/media_bloom_filter_accuracy_u... JoaquinM Lopez Munoz

Дмитрий Архипов

21 May 21 May

7:39 a.m.

вт, 13 мая 2025 г. в 12:48, Arnaud Becheler via Boost <boost@lists.boost.org>:

...

Dear Boost community,

The review of the Boost.Bloom library begins today May 13th, 2025, and will run through May 22nd, 2025.

This is my review of the proposed Boost.Bloom library. First of all, I want to disclose that I work for the C++ Alliance. Also, I don't have any particular knowledge of probabilistic theory or container implementations. Between reading documentation and tinkering with the tests, examples, and benchmarks, I'd say I've spent around 6 hours. The library is very to the point: it implements a particular data structure and provides ways to customise it for your needs. The documentation also explains why one would need to customise it. Customisation is very easily done, which is good. The API is small, but I looked at other libraries that implement Bloom filters and couldn't find anything that's missing. After thinking about it for several days I could only conceive of one possible extra operation: looking for several items at once may potentially be faster than doing it in sequence. The container provided by this library can be used to optimise search. This has a broad range of applications (the docs have a link to a paper that lists several), but on the other hand is limited by the speed of the search that is being optimised compared to the speed of Bloom filter. Which is why it is crucial for Bloom filters to be as fast as possible. The provided benchmarks compare bloom filters with boost::unordered_flat_set (AFAIK the fastest dictionary container in the world) and there are indeed configurations when bloom::filter wins even in that competition. Joaquin also assures that competitor libraries are at least not faster than his. So, I think that the library has utility. My only criticism of the documentation is that I don't believe it mentions why Bloom filters are called that. When I started doing my review I thought that there is some metaphorical connection between the container and flowers blooming. I am in general not a fan of one page documentation, but this one is short enough to work well in this format. I recommend we ACCEPT the library into Boost. I want to thank Joaquin for submitting the library and Arnaud for managing the review process.

Joaquin M López Muñoz

4:32 p.m.

El 21/05/2025 a las 9:39, Дмитрий Архипов via Boost escribió:

...

вт, 13 мая 2025 г. в 12:48, Arnaud Becheler via Boost <boost@lists.boost.org>:

...
Dear Boost community,

The review of the Boost.Bloom library begins today May 13th, 2025, and will run through May 22nd, 2025. This is my review of the proposed Boost.Bloom library.

Thanks Dmitry for your review!

...

First of all, I want to disclose that I work for the C++ Alliance. Also, I don't have any particular knowledge of probabilistic theory or container implementations. Between reading documentation and tinkering with the tests, examples, and benchmarks, I'd say I've spent around 6 hours.

[...] The API is small, but I looked at other libraries that implement Bloom filters and couldn't find anything that's missing. After thinking about it for several days I could only conceive of one possible extra operation: looking for several items at once may potentially be faster than doing it in sequence.

Yes, maybe we could potentially leverage the same technique used by Boost.Unordered concurrent container's bulk visitation: https://bannalia.blogspot.com/2023/10/bulk-visitation-in-boostconcurrentflat...

...

[...]

My only criticism of the documentation is that I don't believe it mentions why Bloom filters are called that. When I started doing my review I thought that there is some metaphorical connection between the container and flowers blooming.

This was also observed by Peter Turcan. I'll add a mention to that in the docs.

...

I am in general not a fan of one page documentation, but this one is short enough to work well in this format.

If the library is accepted, my plan is to ask Chris Mazakas for help to port the docs to multi-page Antora. Joaquin M Lopez Munoz

Ivan Matek

5:29 p.m.

On Wed, May 21, 2025 at 6:32 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:

...

If the library is accepted, my plan is to ask Chris Mazakas for help to port the docs to multi-page Antora.

This is general comment, not directed at Bloom directly. What I love about Mp11, Hash2, Bloom(current version) docs is that I can use find in browser to quickly find some example I am looking for. For example recently I remembered that phrase "or better yet" was used in part of docs I was reading so I just searched for that, similarly when I want to see all usages of fast_, or avalanching. If documentation is not single page jumping quickly through all results with context is harder than just going through the page with find. To be clear this is nitpicking, paged or one page documentation will work great and even ignoring few content suggestions people suggested(including me) it is already at very very high level.

Christian Mazakas

5:51 p.m.

On Wed, May 21, 2025 at 10:30 AM Ivan Matek via Boost <boost@lists.boost.org> wrote:

...

This is general comment, not directed at Bloom directly. What I love about Mp11, Hash2, Bloom(current version) docs is that I can use find in browser to quickly find some example I am looking for. For example recently I remembered that phrase "or better yet" was used in part of docs I was reading so I just searched for that, similarly when I want to see all usages of fast_, or avalanching. If documentation is not single page jumping quickly through all results with context is harder than just going through the page with find.

This is kind of funny because I'm the exact opposite. I find myself not enjoying something like the b2 docs because it's all a single-page so if I wanna search a somewhat generic term, I get like 200 results on the page which is borderline useless. I was very happy with how the Unordered docs came out once we went multi-page. Something like a library-wide search might be in the cards though. - Christian

Ruben Perez

9:05 a.m.

On Tue, 13 May 2025 at 11:48, Arnaud Becheler via Boost <boost@lists.boost.org> wrote:

...

Dear Boost community,

The review of the Boost.Bloom library begins today May 13th, 2025, and will run through May 22nd, 2025.

Hi all, Since I'm not very knowledgeable of Bloom filters or containers, and I'm friends with the author, I won't emit a formal review of the proposed Boost.Bloom. However, I still would like to provide some (hopefully useful) feedback for the author. In general, I think the library is great - it solves one problem and solves it well, and the documentation does a great job explaining the domain problem, the design decisions that were made and how to use the library. Some points regarding the interface and implementation: * Peeking the implementation, I've got the impression that the block class, when using integers distinct to unsigned char (e.g. block<std::uint32_t>), end up using reinterpret_cast<std::uint32_t*> on the underlying unsigned char array. AFAIK this is undefined behavior, since no std::uint32_t object exists at this location. I don't know how relevant this is though, as I guess it works fine in most compilers. * Is there a reason to use boost::uint64_t and boost::uint32_t in a C++11 library? (vs. std::uint64_t and std::uint32_t). * Does filter::reset() without arguments make sense? * Does filter::emplace() add any value over just constructing and inserting an object? In a traditional container, this is justified because the elements are actually stored, but that does not seem to be the case here. * Do the constructor and operator= overloads taking a range of elements add any value? They seem to be equivalent to constructing/assigning and then inserting the values separately. * The block classes seem to perform no type checking on the passed Block type argument. * The convenience header <boost/bloom.hpp> seems to be missing. * I actually like the naming chosen for filter::may_contain(). I think it conveys pretty succinctly what it does. * I also like the fact that you provided a natvis file for your classes. Some points regarding the documentation: * I struggle to understand the BucketSize template parameter. I think this is because the term "bucket" is never mentioned in the primer. It'd be cool if some information regarding this concept was mentioned in the primer. * The serialization example saves and loads a filter using multiblock<std::uint64_t>. Is the bit pattern generated by the filter endianness-dependent? One of the papers you reference [1] mentions sending such representations over the network, which would make this point relevant for the reader. * I've found the examples more difficult to read than what I'd have liked. Some more comments would be useful. * Examples in the docs would benefit from syntax highlighting. You can use asciidoc [source,cpp] blocks to achieve it. * I'd try to avoid C-style casts in the examples. * I'd try to make code snippets as close to real code as possible. That is, I'd try to avoid things like "using filter = boost::bloom::filter<std::string, ...>;", in favor of something that the user may copy to their code. Some points regarding testing: * It looks like the examples are not being built by any CI build. This is dangerous because it yields to code rotting. * It looks like the tests are not being built from CMake. Having "if(BUILD_TESTING AND EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/CMakeLists.txt")" in the main CMakeLists.txt may trick the reader into thinking they are, but there is no test/CMakeLists.txt. I understand that this is what's generated by default by boostdep --cmake, but I'd advise either to add a test/CMakeLists.txt (preferable) or removing the if statement altogether. * The Jamfile in example/ shouldn't specify <cxxstd> directly. * Adding a code coverage metric would be useful. Hope this helps. Regards, Ruben. [1] https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf

Joaquin M López Muñoz

6:06 p.m.

El 21/05/2025 a las 11:05, Ruben Perez via Boost escribió:

...

On Tue, 13 May 2025 at 11:48, Arnaud Becheler via Boost <boost@lists.boost.org> wrote:

...
Dear Boost community,

The review of the Boost.Bloom library begins today May 13th, 2025, and will run through May 22nd, 2025.

Hi all,

Since I'm not very knowledgeable of Bloom filters or containers, and I'm friends with the author, I won't emit a formal review of the proposed Boost.Bloom. However, I still would like to provide some (hopefully useful) feedback for the author.

Your feedback is much welcome!

...

In general, I think the library is great - it solves one problem and solves it well, and the documentation does a great job explaining the domain problem, the design decisions that were made and how to use the library.

Some points regarding the interface and implementation:

* Peeking the implementation, I've got the impression that the block class, when using integers distinct to unsigned char (e.g. block<std::uint32_t>), end up using reinterpret_cast<std::uint32_t*> on the underlying unsigned char array. AFAIK this is undefined behavior, since no std::uint32_t object exists at this location. I don't know how relevant this is though, as I guess it works fine in most compilers.

There are two situations here: * When possible, the array memory is aligned to Block and accessed directly as if it were an array of Blocks. I think we're good here as Block, being an unsigned integral type, is also an implicit-lifetime type. * When alignment is not possible (typically, when there's subarray overlap), Blocks are memcpy'ed from and to the array. Again I think this is legal because unsigned integral types are trivially copyable.

...

* Is there a reason to use boost::uint64_t and boost::uint32_t in a C++11 library? (vs. std::uint64_t and std::uint32_t).

I don't know, I'm just in the habit of using <boost/cstdint.hpp> since time immemorial. Yes, I shall use <cstdint> instead.

...

* Does filter::reset() without arguments make sense?

filter::reset() is the same as filter::reset(0), i.e. it deallocates the bit array. I find reset() convenient and reasonably self-explanatory, but if the community thinks otherwise I can just omit the default arg value.

...

* Does filter::emplace() add any value over just constructing and inserting an object? In a traditional container, this is justified because the elements are actually stored, but that does not seem to be the case here.

It's basically there for consistency with standard container APIs. There's an astronomically unlikely use case for it, namely uses-allocator value_type construction: https://en.cppreference.com/w/cpp/memory/uses_allocator

...

* Do the constructor and operator= overloads taking a range of elements add any value? They seem to be equivalent to constructing/assigning and then inserting the values separately.

Again, consistency with standard APIs. Yes, they're equivalent to construction plus insertion.

...

* The block classes seem to perform no type checking on the passed Block type argument.

You're right, I had it in my mental backlog and forgot about it. Will fix.

...

* The convenience header <boost/bloom.hpp> seems to be missing.

I'm a little wary of omnibus headers cause they keep growing as the lib does. In this particular case the whole library is very small, so I guess it's not a problem.

...

[...]

Some points regarding the documentation:

* I struggle to understand the BucketSize template parameter. I think this is because the term "bucket" is never mentioned in the primer. It'd be cool if some information regarding this concept was mentioned in the primer.

The concept does not belong in the primer as this business with overlapping subarrays is an innovation in candidate Boost.Bloom. Anyway, this was already brought up by Ivan Matek, I will add a diagram to support the explanation on buckets, subarrays, and how they interact.

...

* The serialization example saves and loads a filter using multiblock<std::uint64_t>. Is the bit pattern generated by the filter endianness-dependent? One of the papers you reference [1] mentions sending such representations over the network, which would make this point relevant for the reader.

Yes, it is endianness-dependent. Should I mention this somewhere, I guess?

...

* I've found the examples more difficult to read than what I'd have liked. Some more comments would be useful.

Noted, I can add more comments.

...

* Examples in the docs would benefit from syntax highlighting. You can use asciidoc [source,cpp] blocks to achieve it.

This is how it was initially but then I took it back for some reason I can't remember now. Will look at it again.

...

* I'd try to avoid C-style casts in the examples.

Noted.

...

* I'd try to make code snippets as close to real code as possible. That is, I'd try to avoid things like "using filter = boost::bloom::filter<std::string, ...>;", in favor of something that the user may copy to their code.

You mean user-defined filter being homonym with boost::bloom::filter? I don't see a problem with that, even in real code.

...

Some points regarding testing:

* It looks like the examples are not being built by any CI build. This is dangerous because it yields to code rotting.

Ok. Do you of any Boost lib already doing that from which I can take inspiration?

...

* It looks like the tests are not being built from CMake. Having "if(BUILD_TESTING AND EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/CMakeLists.txt")" in the main CMakeLists.txt may trick the reader into thinking they are, but there is no test/CMakeLists.txt. I understand that this is what's generated by default by boostdep --cmake, but I'd advise either to add a test/CMakeLists.txt (preferable) or removing the if statement altogether.

I depend on others' kindness for that as my knowledge of CMake is embarrasingly limited. Will ask for help to do it (if you know how to do it and would like to submit a PR, be my guest!)

...

* The Jamfile in example/ shouldn't specify <cxxstd> directly.

How should I do then?

...

* Adding a code coverage metric would be useful.

Noted, will do.

...

Hope this helps.

It certainly does! My post-review backlog is growing by the hour :) Thanks again for taking the time to go through my proposal. Joaquin M Lopez Munoz

Peter Dimov

6:16 p.m.

Joaquin M López Muñoz wrote:

...

...
* It looks like the tests are not being built from CMake. Having "if(BUILD_TESTING AND EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/test/CMakeLists.txt")" in the main CMakeLists.txt may trick the reader into thinking they are, but there is no test/CMakeLists.txt. I understand that this is what's generated by default by boostdep --cmake, but I'd advise either to add a test/CMakeLists.txt (preferable) or removing the if statement altogether.

I depend on others' kindness for that as my knowledge of CMake is embarrasingly limited. Will ask for help to do it (if you know how to do it and would like to submit a PR, be my guest!)

You need to simplify your test/Jamfile, instead of test-suite "bloom" : [ run test_array.cpp ] [ run test_capacity.cpp ] [ run test_combination.cpp ] [ run test_comparison.cpp ] [ run test_construction.cpp ] [ run test_fpr.cpp ] [ run test_insertion.cpp ] ; have just run test_array.cpp ; run test_capacity.cpp ; run test_combination.cpp ; run test_comparison.cpp ; run test_construction.cpp ; run test_fpr.cpp ; run test_insertion.cpp ; and then the "generic" test/CMakeLists.txt that uses boost_test_jamfile which you can lift from somewhere else will work with minimal changes.

Christian Mazakas

7:52 p.m.

I vote to ACCEPT this library. First and foremost, a Bloom filter is a true return to form for Boost, implementing a classic data structure from computer science. Second, it has broadly understood practical use-cases already adopted by industry. Third, it was written by one Joaquín M López Muñoz who I'd say is one of the best container authors I've ever had the joys to work with. The documentation is nice and clear, explaining the application of the container and then the backing theory. Choosing a filter can seem a bit daunting but the docs do a good job here guiding users through it. And I'm sure the defaults are sufficient for the majority of users anyway. The implementation is robust and also vector-optimized. The presented benchmarks showcase the advantages of using the Bloom filter once you hit a certain element count, which is fantastic. So it seems to me like all we have here is a well-understood data structure with an incredibly high-quality implementation being presented to us. Seems like an easy acceptance vote to me. - Christian

Joaquin M López Muñoz

22 May 22 May

7:01 a.m.

El 21/05/2025 a las 21:52, Christian Mazakas via Boost escribió:

...

I vote to ACCEPT this library.

First and foremost, a Bloom filter is a true return to form for Boost, implementing a classic data structure from computer science.

Second, it has broadly understood practical use-cases already adopted by industry.

Third, it was written by one Joaquín M López Muñoz who I'd say is one of the best container authors I've ever had the joys to work with.

The documentation is nice and clear, explaining the application of the container and then the backing theory.

Choosing a filter can seem a bit daunting but the docs do a good job here guiding users through it. And I'm sure the defaults are sufficient for the majority of users anyway.

The implementation is robust and also vector-optimized. The presented benchmarks showcase the advantages of using the Bloom filter once you hit a certain element count, which is fantastic.

So it seems to me like all we have here is a well-understood data structure with an incredibly high-quality implementation being presented to us. Seems like an easy acceptance vote to me.

Hi Christian, thank you for your review! Joaquin M Lopez Munoz

Vinnie Falco

2:29 p.m.

What follows is my review of Boost.Bloom. Please note, I am a decision-maker at the C++ Alliance and Joaquin has a working relationship with the non-profit. The Boost.Bloom library was not sponsored in any part by C++ Alliance, and I was not aware he was working on it until he first announced it. That said, Joaquin is a pretty solid developer and the key word which describes his work output is "consistent quality." When I first looked at the library I was not able to use its CMakeLists.txt to create a proper Visual Studio solution and project files containing the header files of the library so that I can browse the code, and projects containing the tests so I can set breakpoints and inspect the behavior of the implementation at run-time. However, Mohammad kindly provided a patch and thus the version of the library I am reviewing is slightly different, as it contains this additional commit: https://github.com/ashtum/bloom/commit/e119a3c868757104d929b21b5a0d06266d60b... Although it is not a requirement, I do wish more libraries submitted for review had these additions to the CMakeLists.txt as my skills at compiling and running things outside of Visual Studio are incredibly limited (or rather, non-existent). I am far more likely to submit a review for a library that I can interact with in the Visual Studio IDE with a minimum of effort. Thankfully the solution was generated and I got this nice looking set of projects and files (https://i.imgur.com/s7JNZ05.png) [image: image.png] Unfortunately I also got these warnings: https://gist.github.com/vinniefalco/056fe64e6302a761b703cb4ec2f5a255 Documentation is a mixed bag. I can't seem to find very important things like the minimum version of C++ required, or which compilers it supports, like I can find here: https://www.boost.org/doc/libs/latest/libs/json/doc/html/json/overview.html#... The content of the documentation is high quality but the layout is in my opinion terrible. I find it hard to read these class synopses and the styling is an impediment to legibility. This is not a problem specific to Boost.Bloom. Hopefully some of the work the C++ Alliance is doing in the area of documentation will empower authors to do better. The inclusion of the Visual Studio visualizer ("natvis file") is a very nice touch and an indicator of quality. Hopefully we will see the corresponding gdb pretty printer. Or perhaps someone might contribute it. Heads up, I am quite familiar with Bloom filters in terms of implementation experience although Joaquin is way ahead of me on the math and optimization. Bloom filters are very common in decentralized network programs such as Gnutella, BearShare, LimeWire, and now bitcoind. For example, thin bitcoind nodes can provide their directly connected peers with a compressed bloom filter, to reduce broadcast message volume to the subset of interested transactions: https://bitcoinops.org/en/topics/transaction-bloom-filtering Before the review period started, I had mentioned at least twice that it would be nice to see a demo of Boost.Bloom data structures patched into some other popular software which already uses these filters. A quick review of the repository turns up these leads which could provide for interesting study: https://github.com/bitcoin/bitcoin/blob/35bf3f88398da0211cece0a29da4de001e34... https://github.com/bitcoin/bitcoin/blob/35bf3f88398da0211cece0a29da4de001e34... https://github.com/bitcoin/bitcoin/blob/35bf3f88398da0211cece0a29da4de001e34... The "rolling bloom filter" referenced above is a slightly different data structure from what Boost.Bloom offers, and I can't help but wonder if this "CRollingBloomFilter" can be implemented in terms of Boost.Bloom. If the Boost.Bloom powered version of bitcoind proves more efficient than its current implementation, it will be adopted by that project as such efficiencies contribute to increased decentralization. The example in the README could be improved by declaring the filter thusly: using filter = boost::bloom::filter<boost::core::string_view, 5>; I don't think it is a good idea to give users an example which performs unnecessary copying or allocation. I'm puzzled about the terminology. "insert" and "emplace" imply that new nodes or array elements of type value_type are being stored. This is not the case though, as these objects are hashed and the resulting digest is used to algorithmically adjust the filter which is in essence an array of bool. `emplace` documentation says that it constructs elements. This talk of subfilters is confusing and sounds like an implementation detail. And I trust that whatever is there is necessary, as Joaquin is not prone to frivolities. I think that the array() function(s) need to guarantee that bits beyond the capacity but present in the storage will be zeroed out. This way they can be serialized deterministically. I looked through the implementation, the optimizations are a bit beyond me although it appears to be well thought out and the benchmarks are nice. I ran the examples and I set some breakpoints and stepped into the code, where I understood that insert and emplace do not store the elements yet we are constructing them anyway which is disappointing for a string_view aficionado such as myself. I suppose this review can mostly be considered feedback and not any meaningful opposition to its inclusion in Boost, therefore my vote is to ACCEPT this library. Joaquin has a track record of perpetual improvements to the libraries he works on so even if there is an identified shortcoming, I doubt it will last long. I appreciate the work that went into the library and I think it will do well in its domain. Thanks

Peter Dimov

3:46 p.m.

Vinnie Falco wrote:

...

Although it is not a requirement, I do wish more libraries submitted for review had these additions to the CMakeLists.txt as my skills at compiling and running things outside of Visual Studio are incredibly limited (or rather, non-existent).

Or not. I don't think we should uglify all our CMakeLists.txt files beyond recognition for the benefit of exactly one person in the observable universe.

Christian Mazakas

5:03 p.m.

On Thu, May 22, 2025 at 8:47 AM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

Vinnie Falco wrote:

...
Although it is not a requirement, I do wish more libraries submitted for review had these additions to the CMakeLists.txt as my skills at compiling and running things outside of Visual Studio are incredibly limited (or rather, non-existent).

Or not. I don't think we should uglify all our CMakeLists.txt files beyond recognition for the benefit of exactly one person in the observable universe.

I'm actually curious how much if would've sufficed to just simply add a test/CMakeLists.txt that just did: boost_test_jamfile(...) and then you just generate the solution from the boost-root, as is the standard for Boost. Interestingly, I would've had to manually add this for my own workflows as I use clangd which is based on a compile_commands.json. Without tests, there would've been nothing to compile. - Christian

Joaquin M López Muñoz

5:09 p.m.

El 22/05/2025 a las 16:29, Vinnie Falco via Boost escribió:

...

[...]

Documentation is a mixed bag. I can't seem to find very important things like the minimum version of C++ required, or which compilers it supports, like I can find here:

https://www.boost.org/doc/libs/latest/libs/json/doc/html/json/overview.html#...

Peter Turcan also asked for a getting started section along these lines. I will add it as a subsection to the introduction.

...

The content of the documentation is high quality but the layout is in my opinion terrible. I find it hard to read these class synopses and the styling is an impediment to legibility. This is not a problem specific to Boost.Bloom. Hopefully some of the work the C++ Alliance is doing in the area of documentation will empower authors to do better.

My plan is, if the library is accepted, to ask for help to port this to Antora (aka Boost.Unordered style).

...

The inclusion of the Visual Studio visualizer ("natvis file") is a very nice touch and an indicator of quality. Hopefully we will see the corresponding gdb pretty printer. Or perhaps someone might contribute it.

I already have in mind who I'll ask to do this :-)

...

Heads up, I am quite familiar with Bloom filters in terms of implementation experience although Joaquin is way ahead of me on the math and optimization. Bloom filters are very common in decentralized network programs such as Gnutella, BearShare, LimeWire, and now bitcoind. For example, thin bitcoind nodes can provide their directly connected peers with a compressed bloom filter, to reduce broadcast message volume to the subset of interested transactions:

https://bitcoinops.org/en/topics/transaction-bloom-filtering

Before the review period started, I had mentioned at least twice that it would be nice to see a demo of Boost.Bloom data structures patched into some other popular software which already uses these filters. A quick review of the repository turns up these leads which could provide for interesting study:

https://github.com/bitcoin/bitcoin/blob/35bf3f88398da0211cece0a29da4de001e34... https://github.com/bitcoin/bitcoin/blob/35bf3f88398da0211cece0a29da4de001e34... https://github.com/bitcoin/bitcoin/blob/35bf3f88398da0211cece0a29da4de001e34...

The "rolling bloom filter" referenced above is a slightly different data structure from what Boost.Bloom offers, and I can't help but wonder if this "CRollingBloomFilter" can be implemented in terms of Boost.Bloom. If the Boost.Bloom powered version of bitcoind proves more efficient than its current implementation, it will be adopted by that project as such efficiencies contribute to increased decentralization.

In fact I'm somewhat surprised most reviewers haven't asked for variations of/alternatives to Bloom filters. This can be added to the roadmap if the library gains traction, I guess.

...

The example in the README could be improved by declaring the filter thusly:

using filter = boost::bloom::filter<boost::core::string_view, 5>;

I don't think it is a good idea to give users an example which performs unnecessary copying or allocation.

This is not needed cause the filter (with the exception of emplace) does not create elements on its own --you comment on this later in the review.

...

I'm puzzled about the terminology. "insert" and "emplace" imply that new nodes or array elements of type value_type are being stored. This is not the case though, as these objects are hashed and the resulting digest is used to algorithmically adjust the filter which is in essence an array of bool. `emplace` documentation says that it constructs elements.

The interface mimics that of a container as a way to lower the entry barrier for new users learning it. But filters are _not_ containers, as you've found out. Claudio also mentioned this. I'll stress the point in some prominent place in the documentation. BTW, emplace() does construct the element, calculates its hash, and then throws it away.

...

This talk of subfilters is confusing and sounds like an implementation detail. And I trust that whatever is there is necessary, as Joaquin is not prone to frivolities.

I'll expand on this in the docs. Subfilters are the way the library allows users to configure the filter according to different FPR/speed/size tradeoffs. Ruben Perez and Ivan Matek also requested more clarity in this area.

...

I think that the array() function(s) need to guarantee that bits beyond the capacity but present in the storage will be zeroed out. This way they can be serialized deterministically.

Not sure I'm getting you, but the extent of the span returned by array() is exactly as wide as capacity() indicates.

...

I looked through the implementation, the optimizations are a bit beyond me although it appears to be well thought out and the benchmarks are nice. I ran the examples and I set some breakpoints and stepped into the code, where I understood that insert and emplace do not store the elements yet we are constructing them anyway which is disappointing for a string_view aficionado such as myself.

I suppose this review can mostly be considered feedback and not any meaningful opposition to its inclusion in Boost, therefore my vote is to ACCEPT this library. Joaquin has a track record of perpetual improvements to the libraries he works on so even if there is an identified shortcoming, I doubt it will last long. I appreciate the work that went into the library and I think it will do well in its domain.

Thanks for your review Vinnie! Joaquin M Lopez Munoz

Vinnie Falco

5:36 p.m.

On Thu, May 22, 2025 at 10:09 AM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:

...

In fact I'm somewhat surprised most reviewers haven't asked for variations of/alternatives to Bloom filters. This can be added to the roadmap if the library gains traction, I guess.

Ideally, variations would have motivating use-cases so their performance could be studied in a real-world setting.

...

...
The example in the README could be improved by declaring the filter thusly:

using filter = boost::bloom::filter<boost::core::string_view, 5>;

I don't think it is a good idea to give users an example which performs unnecessary copying or allocation.

This is not needed cause the filter (with the exception of emplace) does not create elements on its own --you comment on this later in the review.

It does create the element temporarily. Here, we are constructing a `std::string` from `char const*`: https://github.com/joaquintides/bloom/blob/fabc8698c1053efdfe5336de2d01a4fb3... This is only because the filter type uses `std::string`. I realize in the example that these probably qualify for std::string's small buffer optimization, and users might be surprised when "inserting" larger strings that dynamic allocations take place.

...

...
I'm puzzled about the terminology. "insert" and "emplace" imply that new nodes or array elements of type value_type are being stored. This is not the case though, as these objects are hashed and the resulting digest is used to algorithmically adjust the filter which is in essence an array of bool. `emplace` documentation says that it constructs elements.

The interface mimics that of a container as a way to lower the entry barrier for new users learning it. But filters are _not_ containers, as you've found out. Claudio also mentioned this. I'll stress the point in some prominent place in the documentation.

I have the opposite opinion. That the interface should distance itself from a typical container interface, otherwise new users may mistakenly believe that they behave similarly in terms of resource usage. At least when it comes to strings, proper selection of the template parameter type seems to be a foot-gun. Maybe there is room for a type alias "string_filter" which is tuned for character strings. BTW, emplace() does construct the element, calculates its hash, and then

...

throws it away.

Yeah, this is weird. Why even have the element type? `insert` can in theory accept any object which is hashable. And emplace doesn't seem as valuable for the bloom filter container as it is for regular containers, since there are no nodes. That is, rather than `emplace(args...)` a user could just as easily write `insert(T(args...))`

...

...
I think that the array() function(s) need to guarantee that bits beyond the capacity but present in the storage will be zeroed out. This way they can be serialized deterministically.

Not sure I'm getting you, but the extent of the span returned by array() is exactly as wide as capacity() indicates.

What if I construct the filter with say, a prime number of bits like 137? What happens to the last 7 bits (to bring it up to 144)? Thanks

Joaquin M López Muñoz

6:23 p.m.

El 22/05/2025 a las 19:36, Vinnie Falco escribió:

...

On Thu, May 22, 2025 at 10:09 AM Joaquin M López Muñoz via Boost <boost@lists.boost.org> wrote:

In fact I'm somewhat surprised most reviewers haven't asked for variations of/alternatives to Bloom filters. This can be added to the roadmap if the librarygains traction, I guess.

Ideally, variations would have motivating use-cases so their performance could be studied in a real-world setting.

> The example in the README could be improved by declaring the filter thusly: > > using filter = boost::bloom::filter<boost::core::string_view, 5>; > > I don't think it is a good idea to give users an example which performs > unnecessary copying or allocation.

This is not needed cause the filter (with the exception of emplace) does not createelements on its own --you comment on this later in the review.

It does create the element temporarily. Here, we are constructing a `std::string` from `char const*`:

https://github.com/joaquintides/bloom/blob/fabc8698c1053efdfe5336de2d01a4fb3...

This is only because the filter type uses `std::string`. [...]

Ah ok. Yes, a string must be constructed to extract its hash value. boost::bloom::filter supports heterogeneous lookup, so you can avoid it with that, or use std::string_view to beign with as you pointed out.

...

> I'm puzzled about the terminology. "insert" and "emplace" imply that new > nodes or array elements of type value_type are being stored. This is not > the case though, as these objects are hashed and the resulting digest is > used to algorithmically adjust the filter which is in essence an array of > bool. `emplace` documentation says that it constructs elements.

The interface mimics that of a container as a way to lower the entry barrier for new users learning it. But filters are _not_ containers, as you've found out. Claudio also mentioned this. I'll stress the point in some prominent place inthe documentation.

I have the opposite opinion. That the interface should distance itself from a typical container interface, otherwise new users may mistakenly believe that they behave similarly in terms of resource usage. At least when it comes to strings, proper selection of the template parameter type seems to be a foot-gun. Maybe there is room for a type alias "string_filter" which is tuned for character strings.

I personally don't see the need: if you want to process strings, you have to be aware of the implications of using std::string vs. std::string_view, this is not something germane to candidate Boost.Bloom per se.

...

BTW, emplace() does construct the element, calculates its hash, and then throwsit away.

Yeah, this is weird. Why even have the element type? `insert` can in theory accept any object which is hashable.

I've been tempted to introduced an insert_hash_value method, but this looks mighty dangerous for the casual user. As an alternative, one can always declare a filter<std::size_t, ...> and do all the hashing stuff externally.

...

And emplace doesn't seem as valuable for the bloom filter container as it is for regular containers, since there are no nodes. That is, rather than `emplace(args...)` a user could just as easily write `insert(T(args...))`

It's there just for consistency with container APIs. It also supports the incredibly unlikely case of uses-allocator constuction, but this is admittedly not a real motivation: https://en.cppreference.com/w/cpp/memory/uses_allocator

...

> I think that the array() function(s) need to guarantee that bits beyond the > capacity but present in the storage will be zeroed out. This way they can > be serialized deterministically.

Not sure I'm getting you, but the extent of the span returned by array() is exactlyas wide as capacity() indicates.

What if I construct the filter with say, a prime number of bits like 137? What happens to the last 7 bits (to bring it up to 144)?

You can _request_ 137 bits, but the resulting capacity will be slightly larger than that to accomodate for the subfilter block size and round up to the byte. Joaquin M Lopez Munoz

Vinnie Falco

23 May 23 May

1:59 p.m.

On Thu, May 22, 2025 at 11:24 AM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:

...

I've been tempted to introduced an insert_hash_value method, but this looks mighty dangerous for the casual user.

I am all for creating simple but usable interfaces, yet we need to be careful about who is in our target audience. "Casual" users of Bloom filters will be more technically savvy than casual users of say, vectors. You can go your entire career without ever using or hearing about Bloom filters. `insert_hash_value` is an important function for setting filter entries in more than one container without re-computing digests.. Eventually someone is going to want this API.

...

It's there just for consistency with container APIs. It also supports the incredibly unlikely case of uses-allocator constuction, but this is admittedly not a real motivation:

I am growing increasingly convinced that the adherence to a container-like API is providing meager returns.

...

https://en.cppreference.com/w/cpp/memory/uses_allocator

Since the container does not store elements, you could probably make this library not-header-only and instead come as a compiled lib component, which of course I would find preferable for all the benefits that compiled libraries provide. Faster compile times, reduced executable bloat, opportunities for increased compiler optimizations. Thanks

Seth

22 May 22 May

11:24 p.m.

On Thu, May 22, 2025, at 7:09 PM, Joaquin M López Muñoz via Boost wrote:

...

BTW, emplace() does construct the element, calculates its hash, and then throws it away.

In other words, it *doesn't emplace*. "emplace" to me always meant "inplace-construction" (for node-based containers, std::optional, std::any, std::variant). If `emplace` by definition destructs the element constructed, should it have a different name? I think I agree with those that think the adherence to container-like member names does more harm than good. `insert` is fine, because in _some sense_ (potential set membership) an element is being inserted. For the current semantics, `emplace` should just something like `insert_from_temporary`, but I don't see the appeal over `insert(T(...))`. The uses-allocator argument just highlights to me, that having the allocator be `allocator<T>` instead of `allocator<unsigned char>` feels misleading to begin with. (There might be advanced scenarios that I missed reading of the docs, but my understanding is that bloom::filter in principle never stores T?). In this light, I prefer the caller be in charge of any allocators used for temporaries of T. It's actually more likely that it would *not* coincide with the allocator for the filter array (e.g. a stack/pool allocator for repeated temps; bloom filters themselves typically have longer lifetimes and fixed capacities, the polar opposite usage pattern). Also, having the temporary right there in the interface (`insert(T(....))`) makes the [allocation] cost visible to the user. It might prompt to the user to switch to a custom `Hash` type that doesn't incur the [same] cost. Hopefully my understanding doesn't completely miss the mark - in which case perhaps it could spark more ideas for documentation improvements :) Seth (PS. This is not a review, because I didn't spend enough time)

David Bien

23 May 23 May

1:13 a.m.

I noticed the specialized emplace() right away - it looked weird so I thought about it for a bit: template<typename... Args> BOOST_FORCEINLINE void emplace(Args&&... args) { insert(detail::allocator_constructed<allocator_type,value_type>{ get_allocator(),std::forward<Args>(args)...}.value()); } template< typename U, typename std::enable_if< std::is_same<T,detail::remove_cvref_t<U>>::value>::type* =nullptr

...

BOOST_FORCEINLINE void emplace(U&& x) { insert(x); /* avoid value_type construction */ } Given T t; So, the emplace( std::move(t)) will call emplace(U&&), which doesn’t move anything out of the value x, cuz insert(const&). emplace(t) will call the emplace(Args&&… args) because emplace(U&&) can’t accept a const T& - and will make a copy of t before calling insert(). I get why it was done, but I find it to be obfuscating or confusing. You could as well just declare the emplace for T as emplace(const U&) (which also looks weird) and a copy wouldn’t be made but it will still accept the call emplace(std::move(t)). I think ridding emplace entirely is probably the best move - we aren’t emplacing anything in reality. That’s my two cents, and also not really a review since I didn’t spend enough time for a real review. Hopefully I’m not annoying anyone - lol. bien From: Boost <boost-bounces@lists.boost.org> on behalf of Seth via Boost <boost@lists.boost.org> Date: Thursday, May 22, 2025 at 5:24 PM To: Boost List <boost@lists.boost.org> Cc: Seth <bugs@sehe.nl> Subject: Re: [boost] [Bloom] review starts on 13th of May On Thu, May 22, 2025, at 7:09 PM, Joaquin M López Muñoz via Boost wrote:

...

BTW, emplace() does construct the element, calculates its hash, and then throws it away.

Alexander Grund

8:18 a.m.

...

I noticed the specialized emplace() right away - it looked weird so I thought about it for a bit:

template<typename... Args> BOOST_FORCEINLINE void emplace(Args&&... args) { insert(detail::allocator_constructed<allocator_type,value_type>{ get_allocator(),std::forward<Args>(args)...}.value()); } template< typename U, typename std::enable_if< std::is_same<T,detail::remove_cvref_t<U>>::value>::type* =nullptr

...
BOOST_FORCEINLINE void emplace(U&& x) { insert(x); /* avoid value_type construction */ }

Given T t; So, the emplace( std::move(t)) will call emplace(U&&), which doesn’t move anything out of the value x, cuz insert(const&). Which is fine, isn't it? No functions is required to "destroy" the rvalue-arg. emplace(t) will call the emplace(Args&&… args) because emplace(U&&) can’t accept a const T& - and will make a copy of t before calling insert(). I don't think so. Isn't the 2nd overload a universal reference? I.e. it

...

Why even have the element type? `insert` can in theory accept any object which is hashable. And emplace doesn't seem as valuable for the bloom filter container as it is for regular containers, since there are no nodes. That is, rather than `emplace(args...)` a user could just as easily write `insert(T(args...))` Having the element type is useful to distinguish a filter for ints vs strings or some custom type. I wouldn't want to loose that. As for emplace I see the value in using the containers allocator. Maybe

Am 23.05.25 um 03:13 schrieb David Bien via Boost: trivially does accept a const T& and in fact I think this is the whole point of it as indicated by the concept-like. So it accepts any form of a T so no copy should be made. Although I guess a std::forward would make the pattern more clear. this should be highlighted in the documentation as the main reason over `insert(T(...))`. Also the interface of implicitly constructing the T avoids the noise in the call.

Joaquin M López Muñoz

10:03 a.m.

El 23/05/2025 a las 3:13, David Bien via Boost escribió:

...

I noticed the specialized emplace() right away - it looked weird so I thought about it for a bit:

template<typename... Args> BOOST_FORCEINLINE void emplace(Args&&... args) { insert(detail::allocator_constructed<allocator_type,value_type>{ get_allocator(),std::forward<Args>(args)...}.value()); } template< typename U, typename std::enable_if< std::is_same<T,detail::remove_cvref_t<U>>::value>::type* =nullptr

...
BOOST_FORCEINLINE void emplace(U&& x) { insert(x); /* avoid value_type construction */ }

Given T t; So, the emplace( std::move(t)) will call emplace(U&&), which doesn’t move anything out of the value x, cuz insert(const&). emplace(t) will call the emplace(Args&&… args) because emplace(U&&) can’t accept a const T& - and will make a copy of t before calling insert().

No, emplace(t) also selects template< typename U, typename std::enable_if< std::is_same<T,detail::remove_cvref_t<U>>::value>::type* =nullptr > BOOST_FORCEINLINE void emplace(U&& x) { insert(x); /* avoid value_type construction */ } So, no copy of t is incurred. U&& is a so-called universal reference that matches everything, including const T&.

...

I get why it was done, but I find it to be obfuscating or confusing. You could as well just declare the emplace for T as emplace(const U&) (which also looks weird) and a copy wouldn’t be made but it will still accept the call emplace(std::move(t)).

I think ridding emplace entirely is probably the best move - we aren’t emplacing anything in reality.

That’s my two cents, and also not really a review since I didn’t spend enough time for a real review. Hopefully I’m not annoying anyone - lol.

On the contrary, I'm happy the review is producing such interesting conversations. Joaquin M Lopez Munoz

Joaquin M López Muñoz

9:54 a.m.

El 23/05/2025 a las 1:24, Seth via Boost escribió:

...

On Thu, May 22, 2025, at 7:09 PM, Joaquin M López Muñoz via Boost wrote:

...
BTW, emplace() does construct the element, calculates its hash, and then throws it away. In other words, it *doesn't emplace*.

"emplace" to me always meant "inplace-construction" (for node-based containers, std::optional, std::any, std::variant). If `emplace` by definition destructs the element constructed, should it have a different name?

I think I agree with those that think the adherence to container-like member names does more harm than good. `insert` is fine, because in _some sense_ (potential set membership) an element is being inserted. For the current semantics, `emplace` should just something like `insert_from_temporary`, but I don't see the appeal over `insert(T(...))`.

My personal preference is towards mimicking container APIs because this is likely one of the most familiar interfaces out there. Alexander seems to be on the same front. Anyway, I'll do whatever the community agrees on --or the review manager decides. This particular member function is a very minor detail in the grand scheme of things.

...

The uses-allocator argument just highlights to me, that having the allocator be `allocator<T>` instead of `allocator<unsigned char>` feels misleading to begin with. (There might be advanced scenarios that I missed reading of the docs, but my understanding is that bloom::filter in principle never stores T?).

bloom::filter never stores T. As for having allocator<T>, again this is for consistency with all standard usages I know of, where the default allocator is std::allocator<value_type>. Internally, an allocator rebound to usigned char is indeed used.

...

In this light, I prefer the caller be in charge of any allocators used for temporaries of T. It's actually more likely that it would *not* coincide with the allocator for the filter array (e.g. a stack/pool allocator for repeated temps; bloom filters themselves typically have longer lifetimes and fixed capacities, the polar opposite usage pattern).

Right now, it is perfectly possible (and expected) to pass T values to the filter that have been constructed by the user anyway they please.

...

Also, having the temporary right there in the interface (`insert(T(....))`) makes the [allocation] cost visible to the user. It might prompt to the user to switch to a custom `Hash` type that doesn't incur the [same] cost.

Hopefully my understanding doesn't completely miss the mark - in which case perhaps it could spark more ideas for documentation improvements :)

I think your understanding of the matter is correct. Thanks for your comments! Joaquin M Lopez Munoz

Vromen, Tomer

22 May 22 May

3:35 p.m.

New subject: [Bloom] review starts on 13th of May - Tomer's review

Hi all, Bottom line first: In my opinion, the proposed Bloom library should be ACCEPTED to Boost. It implements a useful data structure, and does it with high quality. Joaquín, thank you for submitting this library to Boost! Now for the details. Background: This is my first time reviewing a proposed Boost library. I recently finished implementing a Blocked Bloom filter, so this library is very interesting to me and I have some relevant knowledge. Sadly, my version of the BF requires some unique customizations which would likely make it difficult to use the proposed library in my code (I won’t go into details, as they are not relevant, and I wouldn't want the library to become encumbered by multiple customization options). I will only mention that in my implementation, reducing memory access to a single cache line was found to be critical for performance. Design & documentation: The design of the library looks solid, with good design choices. However, I’m missing the rationale behind those choices. One example is that k is a compile-time constant. As a user this is far from optimal. I do not care about the number of bits used for each key, only that it is optimal. In practice, the optimal value for k is a function of the target FPR. In other cases the target is the memory overhead (m/n) and k is derived from that. I can choose m, n, and the FPR at runtime, but I have to decide on k in compile-time. I understand that there are implementation considerations in favor of making k a compile-time constant, I just wish to have that explained in the docs. Another design limitation is that to my understanding the library doesn’t actually offer an implementation of the classical Bloom filter (contradicting the claim in the first line of the docs). The default behavior is a variant that sets/gets bits in k different 1-byte blocks, and there is no option for a block-less BF. This is different from the classical Bloom filter in which 2 or more hashes can target the same byte, and even the same bit. The FPR will be very slightly higher than the FPR for a classical BF with the same parameters, which makes the current calculation very slightly off. I think this should at least be addressed in the documentation. I wanted to use a Blocked BF with blocks of 1 cache line. It was not obvious from the docs that this is possible, because the docs say that the underlying type for the Block must be an unsigned integral type. I tried the following which seemed to work: using CacheLine = std::bitset<512>; using filter = boost::bloom::filter<T, 1, boost::bloom::block<CacheLine, K>>; I would like to see this officially supported and documented as a main use-case. The purpose of multiblock is not clear. Seems to me that only the _fast variants are useful, for their speed. Otherwise, a block with sufficient size (e.g. using std::bitset as above, but might need something with better implementation) would be better due to slightly lower FPR. The documentation mentions many examples like this: “filter<1,block<uint32_t,K>>” However, the first template parameter is missing, and this is confusing. The examples should all provide valid types and working code. This appears in several places. “For given values of n and m, the optimum k is the integer closest to [formula]” Mathematically speaking, this is not strictly correct. For example, with m/n=12.2553042 you will get k_opt=8.4947, but choosing k=9 will give you (very slightly) lower FPR. On the other hand, in this case I would still choose k=8 because 8 iterations is less than 9. So it's mostly a matter of wording. “Without formal justification, we have relaxed this even further to just one initial hash value [...]” You might be interested in SingleHash [https://yangzhou1997.github.io/paper/singlehash-bigcomp18.pdf] which provides formal justification. Implementation: I only reviewed the code very lightly, and only tried to use the filter in some basic code, so I don’t have much to contribute in this area. To use multiblock I needed to include a dedicated header, which is not needed for block. This is a bit confusing. Estimating FPR based on array occupancy: You wrote in another thread that you “know how to estimate the number of elements from the number of bits set [… but the formula] is valid for the classical Bloom filter”. Since each block is itself a classical filter, and expected value is linear, you can apply the formula per block (using K') and then sum the total for the whole filter, and finally divide by K. I think this is a nice feature to have. Summary: As I wrote above, this is a useful high-quality library. I don't think a single Bloom library can support all possible variations, so it's important to document the design choices that were made. Thank you for contributing this code to Boost. Tomer Vromen Internal Use - Confidential

Joaquin M López Muñoz

23 May 23 May

9:43 a.m.

New subject: [Bloom] review starts on 13th of May - Tomer's review

El 22/05/2025 a las 17:35, Vromen, Tomer via Boost escribió:

...

Hi all,

Bottom line first: In my opinion, the proposed Bloom library should be ACCEPTED to Boost. It implements a useful data structure, and does it with high quality. Joaquín, thank you for submitting this library to Boost! Hi Tomer, I already answered Arnaud's repost of your review before your own post hit the ML, so let's continue the conversation on that thread. Thank you! Joaquin M Lopez Munoz

Arnaud Becheler

22 May 22 May

8:15 p.m.

Dear Boost community, Tomer encountered some issues with the ML and asked me to forward their review. Since the deadline is approaching, if anyone else is interested in reviewing, please let me know—we may be able to arrange an extension. Best regards, Arnaud Becheler Review Manager, Boost.Bloom --- Hi all, Bottom line first: In my opinion, the proposed Bloom library should be ACCEPTED to Boost. It implements a useful data structure, and does it with high quality. Joaquín, thank you for submitting this library to Boost! Now for the details. Background: This is my first time reviewing a proposed Boost library. I recently finished implementing a Blocked Bloom filter, so this library is very interesting to me and I have some relevant knowledge. Sadly, my version of the BF requires some unique customizations which would likely make it difficult to use the proposed library in my code (I won’t go into details, as they are not relevant, and I wouldn't want the library to become encumbered by multiple customization options). I will only mention that in my implementation, reducing memory access to a single cache line was found to be critical for performance. Design & documentation: The design of the library looks solid, with good design choices. However, I’m missing the rationale behind those choices. One example is that k is a compile-time constant. As a user this is far from optimal. I do not care about the number of bits used for each key, only that it is optimal. In practice, the optimal value for k is a function of the target FPR. In other cases the target is the memory overhead (m/n) and k is derived from that. I can choose m, n, and the FPR at runtime, but I have to decide on k in compile-time. I understand that there are implementation considerations in favor of making k a compile-time constant, I just wish to have that explained in the docs. Another design limitation is that to my understanding the library doesn’t actually offer an implementation of the classical Bloom filter (contradicting the claim in the first line of the docs). The default behavior is a variant that sets/gets bits in k different 1-byte blocks, and there is no option for a block-less BF. This is different from the classical Bloom filter in which 2 or more hashes can target the same byte, and even the same bit. The FPR will be very slightly higher than the FPR for a classical BF with the same parameters, which makes the current calculation very slightly off. I think this should at least be addressed in the documentation. I wanted to use a Blocked BF with blocks of 1 cache line. It was not obvious from the docs that this is possible, because the docs say that the underlying type for the Block must be an unsigned integral type. I tried the following which seemed to work: using CacheLine = std::bitset<512>; using filter = boost::bloom::filter<T, 1, boost::bloom::block<CacheLine, K>>; I would like to see this officially supported and documented as a main use-case. The purpose of multiblock is not clear. Seems to me that only the _fast variants are useful, for their speed. Otherwise, a block with sufficient size (e.g. using std::bitset as above, but might need something with better implementation) would be better due to slightly lower FPR. The documentation mentions many examples like this: “filter<1,block<uint32_t,K>>” However, the first template parameter is missing, and this is confusing. The examples should all provide valid types and working code. This appears in several places. “For given values of n and m, the optimum k is the integer closest to [formula]” Mathematically speaking, this is not strictly correct. For example, with m/n=12.2553042 you will get k_opt=8.4947, but choosing k=9 will give you (very slightly) lower FPR. On the other hand, in this case I would still choose k=8 because 8 iterations is less than 9. So it's mostly a matter of wording. “Without formal justification, we have relaxed this even further to just one initial hash value [...]” You might be interested in SingleHash [ https://yangzhou1997.github.io/paper/singlehash-bigcomp18.pdf] which provides formal justification. Implementation: I only reviewed the code very lightly, and only tried to use the filter in some basic code, so I don’t have much to contribute in this area. To use multiblock I needed to include a dedicated header, which is not needed for block. This is a bit confusing. Estimating FPR based on array occupancy: You wrote in another thread that you “know how to estimate the number of elements from the number of bits set [… but the formula] is valid for the classical Bloom filter”. Since each block is itself a classical filter, and expected value is linear, you can apply the formula per block (using K') and then sum the total for the whole filter, and finally divide by K. I think this is a nice feature to have. Summary: As I wrote above, this is a useful high-quality library. I don't think a single Bloom library can support all possible variations, so it's important to document the design choices that were made. Thank you for contributing this code to Boost. Tomer Vromen ---

Arnaud Becheler

10:08 p.m.

Dear Boost community, I want to sincerely thank everyone for your thoughtful contributions and active participation throughout the review of the Boost Bloom library. Your insights have been extremely helpful. With this message, I’m officially closing the review period. I’ll be compiling the feedback and will share a summary report with you within the next week. Coincidentally, it's midnight in Paris and there are fireworks going off outside as I type this—a fitting celebration to wrap up our review! Thank you again for your involvement! Best regards, Arnaud Becheler Review Manager, Boost.Bloom

...

Alexander Grund

23 May 23 May

8:21 a.m.

Am 22.05.25 um 22:15 schrieb Arnaud Becheler via Boost:

...

One example is that k is a compile-time constant. As a user this is far from optimal. I do not care about the number of bits used for each key, only that it is optimal. In practice, the optimal value for k is a function of the target FPR. In other cases the target is the memory overhead (m/n) and k is derived from that. I can choose m, n, and the FPR at runtime, but I have to decide on k in compile-time. I understand that there are implementation considerations in favor of making k a compile-time constant, I just wish to have that explained in the docs. An idea: Other containers use "-1" as template arguments for "runtime size". If this isn't too hard to support it might provide the extra flexibility for power users.

Joaquin M López Muñoz

9:37 a.m.

El 22/05/2025 a las 22:15, Arnaud Becheler via Boost escribió:

...

Dear Boost community,

Tomer encountered some issues with the ML and asked me to forward their review. Since the deadline is approaching, if anyone else is interested in reviewing, please let me know—we may be able to arrange an extension.

Best regards, Arnaud Becheler Review Manager, Boost.Bloom

--- Hi all,

Bottom line first: In my opinion, the proposed Bloom library should be ACCEPTED to Boost. It implements a useful data structure, and does it with high quality. Joaquín, thank you for submitting this library to Boost!

Hi Tomer, thanks for your review!

...

Now for the details.

Background:

This is my first time reviewing a proposed Boost library. I recently finished implementing a Blocked Bloom filter, so this library is very interesting to me and I have some relevant knowledge. Sadly, my version of the BF requires some unique customizations which would likely make it difficult to use the proposed library in my code (I won’t go into details, as they are not relevant, and I wouldn't want the library to become encumbered by multiple customization options). I will only mention that in my implementation, reducing memory access to a single cache line was found to be critical for performance.

Design & documentation:

The design of the library looks solid, with good design choices. However, I’m missing the rationale behind those choices.

One example is that k is a compile-time constant. As a user this is far from optimal. I do not care about the number of bits used for each key, only that it is optimal. In practice, the optimal value for k is a function of the target FPR. In other cases the target is the memory overhead (m/n) and k is derived from that. I can choose m, n, and the FPR at runtime, but I have to decide on k in compile-time. I understand that there are implementation considerations in favor of making k a compile-time constant, I just wish to have that explained in the docs.

The rationale is that having k a compile-time value makes the filter faster. I've been playing with variations where k can be specified at run time and they're 10-20% slower. Of course I can provide those variations as part of the library, and in fact I'm inclined to include them in the roadmap.

...

Another design limitation is that to my understanding the library doesn’t actually offer an implementation of the classical Bloom filter (contradicting the claim in the first line of the docs). The default behavior is a variant that sets/gets bits in k different 1-byte blocks, and there is no option for a block-less BF. This is different from the classical Bloom filter in which 2 or more hashes can target the same byte, and even the same bit. The FPR will be very slightly higher than the FPR for a classical BF with the same parameters, which makes the current calculation very slightly off. I think this should at least be addressed in the documentation.

The default configuration for boost::bloom::filter: filter<T, K, block<unsigned char, 1>> is _exactly_ a classical Bloom filter: randomly selecting an unsigned char and then selecting a random bit in it is the same as randomly selecting a bit in the whole array (repeat K times). This is also mentioned in https://master.bloom.cpp.al/html/index.html#fpr_estimation "We have the particular case FPRblock(n,m,b,k,1)=FPRmultiblock(n,m,b,k,1)=FPR(n,m,k), which follows simply from the observation that using {block|multiblock}<Block, 1> behaves exactly as a classical Bloom filter." (It is not hard to prove the equation expanding the formulas for k' = 1).

...

I wanted to use a Blocked BF with blocks of 1 cache line. It was not obvious from the docs that this is possible, because the docs say that the underlying type for the Block must be an unsigned integral type. I tried the following which seemed to work:

using CacheLine = std::bitset<512>; using filter = boost::bloom::filter<T, 1, boost::bloom::block<CacheLine, K>>;

I would like to see this officially supported and documented as a main use-case.

Yes, this sort of works currently and you've got 64-byte aligment. Internally, though, it's hitting UB as the assumption that Block is an unsigned integral type translates to treating it as a trivial type (which std::bitset is not). Triviality is important because the internal bit array is exposed as precisely that, for instance for serialization purposes. If we relax the "unsigned integral type" requirement to "trivial type supporting integral operations Block{0}, Block{1}, <<, >>, &, |, &= and|=", a custom uint512_t class could be used to fit the bill.

...

The purpose of multiblock is not clear. Seems to me that only the _fast variants are useful, for their speed. Otherwise, a block with sufficient size (e.g. using std::bitset as above, but might need something with better implementation) would be better due to slightly lower FPR.

block<uint512_t, 8> (assuming such uint512_t exists) would be much slower than multiblock<uint64_t, 8>. The latter would have slightly worse FPR, as you point out. Also, multiblock<uint64_t, 5> (for instance) operates over blocks of size 8*5 = 40 bytes, and you won't have a Block type of size 40 to match that.

...

The documentation mentions many examples like this: “filter<1,block<uint32_t,K>>” However, the first template parameter is missing, and this is confusing. The examples should all provide valid types and working code. This appears in several places.

Yes, this appears in tables to save space wrt filter<T, 1, ...> (tables are already very wide). Will try to accommodate.

...

“For given values of n and m, the optimum k is the integer closest to [formula]” Mathematically speaking, this is not strictly correct. For example, with m/n=12.2553042 you will get k_opt=8.4947, but choosing k=9 will give you (very slightly) lower FPR. On the other hand, in this case I would still choose k=8 because 8 iterations is less than 9. So it's mostly a matter of wording.

You're absolutely right, I went for that sloppy wording to avoid making it more complicated, but well, reviewers are unforgiving :) Will fix.

...

“Without formal justification, we have relaxed this even further to just one initial hash value [...]” You might be interested in SingleHash [ https://yangzhou1997.github.io/paper/singlehash-bigcomp18.pdf] which provides formal justification.

I had that paper on my radar, but the mixing procedure they use is not the same as ours. Sometime I'll try to crack the maths for this.

...

Implementation:

I only reviewed the code very lightly, and only tried to use the filter in some basic code, so I don’t have much to contribute in this area.

To use multiblock I needed to include a dedicated header, which is not needed for block. This is a bit confusing.

<boost/bloom/filter.hpp> includes <boost/bloom/block> as block is the default type for boost::bloom::filter, hence the asymmetry. I understand this will be mitigated by an omnibus header <boost/bloom.hpp>, which was requested by Ruben Perez.

...

Estimating FPR based on array occupancy: You wrote in another thread that you “know how to estimate the number of elements from the number of bits set [… but the formula] is valid for the classical Bloom filter”. Since each block is itself a classical filter, and expected value is linear, you can apply the formula per block (using K') and then sum the total for the whole filter, and finally divide by K.

Wouldn't work for filter<T, K, block<..., K'>>, as multiple blocks are accessed per element.

...

I think this is a nice feature to have.

I concur. Again, I'll try to crack the math. If there's some mathematically inclined person reading this, any help would be appreciated.

...

Summary:

As I wrote above, this is a useful high-quality library. I don't think a single Bloom library can support all possible variations, so it's important to document the design choices that were made.

Thank you for contributing this code to Boost.

Thank _you_ for your interesting remarks. Joaquin M Lopez Munoz

Age (days ago)

Last active (days ago)

List overview

Download

35 comments

14 participants

participants (14)

Alexander Grund
Arnaud Becheler
Christian Mazakas
David Bien
Ivan Matek
Joaquin M López Muñoz
Kostas Savvidis
Peter Dimov
Peter Turcan
Ruben Perez
Seth
Vinnie Falco
Vromen, Tomer
Дмитрий Архипов