Re: [boost] [Bloom] review starts on 13th of May

23 May 2025

      El 22/05/2025 a las 22:15, Arnaud Becheler via Boost escribió:
...
Dear Boost community,
Tomer encountered some issues with the ML and asked me to forward their
review. Since the deadline is approaching, if anyone else is interested in
reviewing, please let me know—we may be able to arrange an extension.
Best regards,
Arnaud Becheler
Review Manager, Boost.Bloom
---
Hi all,
Bottom line first:
In my opinion, the proposed Bloom library should be ACCEPTED to Boost.
It implements a useful data structure, and does it with high quality.
Joaquín, thank you for submitting this library to Boost!
Hi Tomer, thanks for your review!
...
Now for the details.
Background:
This is my first time reviewing a proposed Boost library. I recently
finished implementing a Blocked Bloom filter, so this library is very
interesting to me and I have some relevant knowledge. Sadly, my version of
the BF requires some unique customizations which would likely make it
difficult to use the proposed library in my code (I won’t go into details,
as they are not relevant, and I wouldn't want the library to become
encumbered by multiple customization options). I will only mention that in
my implementation, reducing memory access to a single cache line was found
to be critical for performance.
Design & documentation:
The design of the library looks solid, with good design choices. However,
I’m missing the rationale behind those choices.
One example is that k is a compile-time constant. As a user this is far
from optimal. I do not care about the number of bits used for each key,
only that it is optimal. In practice, the optimal value for k is a function
of the target FPR. In other cases the target is the memory overhead (m/n)
and k is derived from that. I can choose m, n, and the FPR at runtime, but
I have to decide on k in compile-time. I understand that there are
implementation considerations in favor of making k a compile-time constant,
I just wish to have that explained in the docs.
The rationale is that having k a compile-time value makes the filter 
faster. I've been
playing with variations where k can be specified at run time and they're 
10-20% slower.
Of course I can provide those variations as part of the library, and in 
fact I'm inclined
to include them in the roadmap.
...
Another design limitation is that to my understanding the library doesn’t
actually offer an implementation of the classical Bloom filter
(contradicting the claim in the first line of the docs). The default
behavior is a variant that sets/gets bits in k different 1-byte blocks, and
there is no option for a block-less BF. This is different from the
classical Bloom filter in which 2 or more hashes can target the same byte,
and even the same bit. The FPR will be very slightly higher than the FPR
for a classical BF with the same parameters, which makes the current
calculation very slightly off. I think this should at least be addressed in
the documentation.
The default configuration for boost::bloom::filter:

   filter<T, K, block<unsigned char, 1>>

is _exactly_ a classical Bloom filter: randomly selecting an unsigned char
and then selecting a random bit in it is the same as randomly selecting
a bit in the whole array (repeat K times). This is also mentioned in

https://master.bloom.cpp.al/html/index.html#fpr_estimation

"We have the particular case

   FPRblock(n,m,b,k,1)=FPRmultiblock(n,m,b,k,1)=FPR(n,m,k),

which follows simply from the observation that using 
{block|multiblock}<Block, 1>
behaves exactly as a classical Bloom filter."

(It is not hard to prove the equation expanding the formulas for k' = 1).
...
I wanted to use a Blocked BF with blocks of 1 cache line. It was not
obvious from the docs that this is possible, because the docs say that the
underlying type for the Block must be an unsigned integral type. I tried
the following which seemed to work:
using CacheLine = std::bitset<512>;
   using filter = boost::bloom::filter<T, 1, boost::bloom::block<CacheLine,
K>>;
I would like to see this officially supported and documented as a main
use-case.
Yes, this sort of works currently and you've got 64-byte aligment. 
Internally, though,
it's hitting UB as the assumption that Block is an unsigned integral 
type translates
to treating it as a trivial type (which std::bitset is not). Triviality 
is important because
the internal bit array is exposed as precisely that, for instance for 
serialization
purposes.

If we relax the "unsigned integral type" requirement to "trivial type 
supporting
integral operations Block{0}, Block{1}, <<, >>, &, |, &= and|=", a 
custom uint512_t class
could be used to fit the bill.
...
The purpose of multiblock is not clear. Seems to me that only the _fast
variants are useful, for their speed. Otherwise, a block with sufficient
size (e.g. using std::bitset as above, but might need something with better
implementation) would be better due to slightly lower FPR.
block<uint512_t, 8> (assuming such uint512_t exists) would be much 
slower than
multiblock<uint64_t, 8>. The latter would have slightly worse FPR, as 
you point out.

Also, multiblock<uint64_t, 5> (for instance) operates over blocks of 
size 8*5 = 40
bytes, and you won't have a Block type of size 40 to match that.
...
The documentation mentions many examples like this:
“filter<1,block<uint32_t,K>>”
However, the first template parameter is missing, and this is confusing.
The examples should all provide valid types and working code.
This appears in several places.
Yes, this appears in tables to save space wrt filter<T, 1, ...> (tables 
are already very
wide). Will try to accommodate.
...
“For given values of n and m, the optimum k is the integer closest to
[formula]”
Mathematically speaking, this is not strictly correct. For example, with
m/n=12.2553042 you will get k_opt=8.4947, but choosing k=9 will give you
(very slightly) lower FPR. On the other hand, in this case I would still
choose k=8 because 8 iterations is less than 9. So it's mostly a matter of
wording.
You're absolutely right, I went for that sloppy wording to avoid making 
it more
complicated, but well, reviewers are unforgiving :) Will fix.
...
“Without formal justification, we have relaxed this even further to just
one initial hash value [...]”
You might be interested in SingleHash [
https://yangzhou1997.github.io/paper/singlehash-bigcomp18.pdf] which
provides formal justification.
I had that paper on my radar, but the mixing procedure they use is not the
same as ours. Sometime I'll try to crack the maths for this.
...
Implementation:
I only reviewed the code very lightly, and only tried to use the filter in
some basic code, so I don’t have much to contribute in this area.
To use multiblock I needed to include a dedicated header, which is not
needed for block. This is a bit confusing.
<boost/bloom/filter.hpp> includes <boost/bloom/block> as block is the 
default
type for boost::bloom::filter, hence the asymmetry. I understand this 
will be
mitigated by an omnibus header <boost/bloom.hpp>, which was requested by
Ruben Perez.
...
Estimating FPR based on array occupancy: You wrote in another thread that
you “know how to estimate the number of elements from the number of bits
set [… but the formula] is valid for the classical Bloom filter”. Since
each block is itself a classical filter, and expected value is linear, you
can apply the formula per block (using K') and then sum the total for the
whole filter, and finally divide by K.
Wouldn't work for filter<T, K, block<..., K'>>, as multiple blocks are 
accessed per
element.
...
I think this is a nice feature to have.
I concur. Again, I'll try to crack the math. If there's some 
mathematically inclined
person reading this, any help would be appreciated.
...
Summary:
As I wrote above, this is a useful high-quality library.
I don't think a single Bloom library can support all possible variations,
so it's important to document the design choices that were made.
Thank you for contributing this code to Boost.
Thank _you_ for your interesting remarks.

Joaquin M Lopez Munoz

Re: [boost] [Bloom] review starts on 13th of May

Joaquin M López Muñoz