Re: [boost] [Bloom] review starts on 13th of May

17 May 2025

      ...
Joaquín, Arnaud,
Thanks for submitting and managing the review of this library.  Here is my
review of the documentation. Good to see plenty of links, tables and
structure to the reference, diagrams and equations, and performance
benchmarks.
Hi Peter, thank you for your documentation review!
I think the doc could use a much friendlier introduction that explains the
purpose of these useful filters to all C++ developers.
In the current documentation, this is the intended purpose of the "Primer"
...
I suggest something like this:
*Introduction*
[...]
Now, an attempt is made to access the financial website that is protected
by a pattern-matching Bloom filter. In our example, the Bloom filter could
simply be:
If (request-name contains "e") then return positive, else return negative;
This is not how Bloom filters work. A Bloom filter does not process elements
El 17/05/2025 a las 0:42, Peter Turcan via Boost escribió:
section. In my mind, there are two big groups in the audience:

* People who already know what a Bloom filter is and want to go straight
into what this particular library offers.
* People who don't know about Bloom filters.

So, the primer is provided as a sort of skippable section depending on the
previous background of the reader. In which ways do you think that this
primer does not fullfill the goals of your proposed "Introduction" section?

based on (implicit or explicit) patterns of the input data. Instead, for 
each
element inserted into the filter, the values of the associated hash 
functions
as applied to the element are used to calculate a number of bit positions
that are then set to 1. The statistical magic is that when querying for an
element that has _not_ been inserted, the probability that all the 
associated
bits are set to 1 is very small --though not zero, hence the false 
positive rate
thing.

If the primer has led you to this exposition of Bloom filters, then I guess
it's not doing an excellent job at explaining the data structure :/ Have 
you written
this proposed introduction based on your reading of my docs or from some
other sources?
...
[...]
Whereas fraud detection is the top use of Bloom filters, [...]
I woulnd't say fraud detection is _the top use_ of Bloom filters, though
it's certainly a reasonable application of this data structure.
they have proved
useful in such situations as checking genome strings against a database of
DNA sequences. These checks can be time intensive due to the length of DNA
strings[...]
...
[...]
Follow the intro with a *Getting Started* section:
*Getting Started*
- describe the *Requirements*, *Installation *and a trivial "hello world"
example of the use of the library.
Yes, I can plug this info as a subsection of the introduction. Note that 
installation
is basically non-existent, as the docs assume that the library is 
already part of
Boost, and, being header-only, there's no additional build step.
*Tutorial*
- this section is "tutorial" in name only - the section should be a
numbered step-by-step procedure for the reader to follow - that results in
a small but meaningful example working, with the expected output. Currently
it is not certain what to do.
I can try to organize the tutorial as part of a narrative where a full 
example
is built. I have some problems though with that, as some parts, notably the
one where I describe the different configuration options, don't look to me
amenable to that approach.
[...]
*Reference*
- I found it difficult to follow the Reference in the sense of "what am I
looking at?". Best to be specific as to what is a class, template, method,
property, field, macro etc. AND critically explain the use-case of each of
these constructs in the intro or description of the construct - what is the
scenario that this construct applies to?
The reference is built in standardese format, so I guess it looks pretty
straightforward to me cause I've wasted the best years of my youth reading
...
Some of the reference I didn't really understand - such as:
A compile-time std::size_t value indicating the number of (not necessarily
distinct) bits set/checked per operation.
- what does *not necessarily distinct* mean?
The way a Bloom filter works, element insertion translates to a given
numbers of bits being set to 1. But, as their positions are selected
FWIW, there's a genomics-related example at

https://github.com/joaquintides/bloom/blob/develop/example/genome.cpp

the standard. Do you have any favorite Boost lib referencewise?
pseudorandomly and independently, it may be the case that some
of the positions coincide.
...
*Appendix A*
The equations are somewhat daunting and difficult to determine what is good
practice and what is not. Perhaps add text to describe best practices in
certain situations.
Sorry, I'm not getting you here. How are mathematical equations related
to best practices? In any case, the appendix is provided precisely as an
appendix because it's not essential to understanding or using the library
--only the mathematically inclined will be looking at this.
Perhaps add an *Acknowledgements *section at the end - authors, testers,
motivation, etc.
Absolutely, on my todo list.
*Summary*
I would most like to see a friendly introduction, a step-by-step tutorial,
and a clearer reference.
The case for Bloom filters seems strong, and good luck with the library.
Again, thanks for your review and useful tips.

Joaquin M Lopez Munoz

Re: [boost] [Bloom] review starts on 13th of May

Joaquin M López Muñoz