Re: [boost] [Bloom] review starts on 13th of May

16 May 2025

      Joaquín, Arnaud,

Thanks for submitting and managing the review of this library.  Here is my
review of the documentation. Good to see plenty of links, tables and
structure to the reference, diagrams and equations, and performance
benchmarks.

I think the doc could use a much friendlier introduction that explains the
purpose of these useful filters to all C++ developers. I suggest something
like this:

*Introduction*
Bloom filters provide a pattern-matching mechanism to rapidly determine if
a given item - an email, URL, genome, username, digital fingerprint
etc. - *might
*be a member of a long and important list. A key feature of Bloom filters
is that there are no false negatives - if the result of the filter is *negative
*then the item does not occur in the list with a mathematical certainty of
100%. But there can be false positives - if the result of the filter
is *positive
*it means that the item *may *occur in the list but the filter is not
certain. The certainty of avoiding false negatives makes Bloom filters
particularly useful in fraud detection. This is best explained by an
over-simplified example. Say you have a list, a database, of known bad
actors, that needs to be checked before access to your financial website is
approved. And, for the sake of simplicity the bad actors list contains only:

   1. mr-jekyll@mail.com
   2. mr-hyde@mail.com
   3. frankenstein@mail.com
   4. al-capone@mail.com
   5. lord-voldemort@mail.com
   6. darth-vader@mail.com

Now, an attempt is made to access the financial website that is protected
by a pattern-matching Bloom filter. In our example, the Bloom filter could
simply be:

If (request-name contains "e") then return positive, else return negative;

If access is requested by T <Tony-Montana@mail.com>ony-Montana@mail.com,
the filter will return *negative *- there is no "e" in the name so Tony
Montana could not possibly be in the bad actors list and he should be
granted access to the financial website without further ado. If access is
requested by John-Dillinger@mail.com, then the filter returns *positive *-
it contains an "e" so the name *might *be in the bad actors list. The only
way to be sure is now to perform a full-text search - time consuming but
will provide certainty.

The value of Bloom filters is that they can eliminate full-text searches
for many access requests.

In reality, the list of bad actors is likely to be millions long. The
algorithms used in the Bloom filter use mathematical patterns (determined
by hash functions) that are present in the list. You can imagine that a
long list might require half-a-dozen or more patterns to ensure that ALL
entries in the bad actors list are represented by one or more patterns. It
does not matter if a bad actor is covered by multiple patterns. It is
essential that all bad actors are represented by at least one pattern. The
Bloom filter will check an access request against each pattern it is
provided with, and return *immediately* when a check against a pattern
returns positive. This result initiates a full-text search, though no doubt
this can be optimized with clever indexing and organization.

Whereas fraud detection is the top use of Bloom filters, they have proved
useful in such situations as checking genome strings against a database of
DNA sequences. These checks can be time intensive due to the length of DNA
strings. Similarly for high-frequency trading, to check for recent or
duplicate orders, as speed is of the essence.

For the record, the Bloom filter is named after its inventor - Burton
Howard Bloom - who described its purpose in a 1970 paper - *Space/Time
Trade-offs in Hash Coding with Allowable Errors*.  There is no connection
between the filters and "blooming" in the floral sense!

If an application you are building contains a large database that needs to
be searched regularly, this library provides you with the components
necessary to create a Bloom filter of the complexity required to fine tune
the search performance of your application.

Follow the intro with a *Getting Started* section:

*Getting Started*
- describe the *Requirements*, *Installation *and a trivial "hello world"
example of the use of the library.

*Tutorial*
- this section is "tutorial" in name only - the section should be a
numbered step-by-step procedure for the reader to follow - that results in
a small but meaningful example working, with the expected output. Currently
it is not certain what to do.

Following the Tutorial, your *Bloom Filter Primer, Choosing a Filter
Combination, Benchmarks *- are all appropriate.

*Reference*
- I found it difficult to follow the Reference in the sense of "what am I
looking at?". Best to be specific as to what is a class, template, method,
property, field, macro etc. AND critically explain the use-case of each of
these constructs in the intro or description of the construct - what is the
scenario that this construct applies to?

Some of the reference I didn't really understand - such as:

A compile-time std::size_t value indicating the number of (not necessarily
distinct) bits set/checked per operation.

- what does *not necessarily distinct* mean?

*Appendix A*
The equations are somewhat daunting and difficult to determine what is good
practice and what is not. Perhaps add text to describe best practices in
certain situations.

Perhaps add an *Acknowledgements *section at the end - authors, testers,
motivation, etc.

*Summary*
I would most like to see a friendly introduction, a step-by-step tutorial,
and a clearer reference.

The case for Bloom filters seems strong, and good luck with the library.

I am the technical writer for the C++ Alliance.

- best
Peter Turcan

On Tue, May 13, 2025 at 2:47 AM Arnaud Becheler via Boost <
boost@lists.boost.org> wrote:
...
Dear Boost community,
The review of the Boost.Bloom library begins today May 13th, 2025, and will
run through May 22nd, 2025.
Boost.Bloom is a header-only C++ library written by Joaquín M López Muñoz,
providing fast and flexible Bloom filters. These are space-efficient
probabilistic data structures used to test set membership with a
controllable false positive rate and minimal memory footprint. The library
supports a variety of configurations and hash policies and is designed to
integrate well with modern C++.
You can find the library here:
    - Repo: https://github.com/joaquintides/bloom
    - Documentation: https://master.bloom.cpp.al/
Bloom filters are especially useful when:
    - You need to test if something is not in a large dataset, and can
tolerate a small false positive rate. (e.g.,  avoiding duplicate URL crawls
in a web crawler)
    - You need a memory-efficient structure to represent large sets (e.g.,
checking for the presence of DNA subsequences across massive genomic
datasets)
    - You want to avoid expensive lookups for items that are definitely not
present (e.g., filtering nonexistent records before querying a remote
database).
As always, we welcome all reviews — from quick impressions to detailed
analysis. Your feedback helps ensure that the library meets Boost's high
standards in terms of correctness, performance, documentation, and design.
If you’re interested in contributing a review, please post to the Boost
mailing list during the review period.
In your review please state whether you recommend to reject or accept the
library into Boost, and whether you suggest any conditions for acceptance.
Other questions you might want to answer in your review are:
  -  What is your evaluation of the design?
  -  What is your evaluation of the implementation?
  -  What is your evaluation of the documentation?
  -  What is your evaluation of the potential usefulness of the library?
  -  Did you try to use the library? With what compiler? Did you have any
problems?
  -  How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?
  -  Are you knowledgeable about the problems tackled by the library?
Note: All links to other parts of Boost in the Boost.Bloom documentation
are currently broken, as they were written with the assumption that the
library was already integrated into Boost - please do not account for those
in your review.
Thank you in advance for your time and insights!
Best regards,
Arnaud Becheler
Review Manager, Boost.Bloom
_______________________________________________
Unsubscribe & other changes:
http://lists.boost.org/mailman/listinfo.cgi/boost