Re: [boost] [Bloom] review starts on 13th of May

18 May 2025

      El 18/05/2025 a las 2:18, Kostas Savvidis via Boost escribió:
...
Hi All,
This is a mini review of the proposed Bloom library.
Hi Kostas, thanks for your review!
...
[...]
====================================================================
CONSIDERATIONS:
This library implements the classic Bloom filter which can be used to implement
an O(1)-time computation of whether a given object belongs in a previously preprocessed set.
The preprocessing takes O(n) time and importantly O(n) in storage for a bit array, but the original data can be optionally discarded.
    I was reminded of the storage requirement while experimenting with the provided
example for genomic data: the example datafile is 140MB, while the resulting filter
contains a lookup table taking up 180MB (!!!), with the 0.01 setting for target false positive rate (FPR).
Further, I have compared the "predicted" false positive error rate obtained from the library function
genome_filter::fpr_for(n, f.capacity()) with the actual FPR.
By actual FPR you mean the FPR for individual 20-mers, not for whole DNA
sequences, right? Sequence FPR is expected to be much lower than
20-mer FPR.
...
This FPR appears to be overpessimistic, as experiments with the real data showed.
Setting the filter FPR to 0.5 or even 0.9 (which results in a much smaller lookup table)
there were almost no false positives in practice.
     I do not fully understand the full reasons for this, but it appears one of the reasons is:
THE "SET" WE INSERTED HAD MANY IDENTICAL ITEMS,
In this case, the stream had many identical k-mers, to be precise, out of 142 million base pairs,
there is actually ~22  million identical 20-mers, but even when FPR is set to 0.9 only 40 million extra false collisions are generated
which is way less than 90%. Further investigation suggests that fpr_for is far from reality:
the bit array for FPR=0.9 is 2^28=268m bits wide, of which only 142m - 22m = 120m are occupied, thus
the correct FPR rate is 120/268 = 44%, in rough agreement with the observed 40m/120m.
     Similarly, when I request fpr=0.01 or 1%, the number of false collisions is only 86k, so actual FPR is probably
closer to 2*86k/120m = 0.14%.
We must be doing radically different things, because my local
experiments don't look at all like what you see. I've written an
instrumented version of the genomics example here:

https://gist.github.com/joaquintides/e8fbf87fb0dd0b6fda87969c676c1e34

and I get these results for the genome of Drosophila
melanogaster:

actual unique kmers: 121430875
bits set: 719647041
capacity: 1533664512
estimated unique kmers based on array occupancy: 121434296
actual fpr: 0.004248

If I use the exact number of k-mers as an input on filter construction
instead of the original approach where the number of k-mers is taken
to be approximately equal to the number of bytes of the file
(definining the macro USE_ACTUAL_NUMBER_K_MERS), I get:

actual unique kmers: 121430875
bits set: 680516771
capacity: 1278575360
estimated unique kmers based on array occupancy: 121434774
actual fpr: 0.01011

So, the actual FPR is very much in line with the value specified on
construction, and the number of different 20-mers is 121M rather than
22M. Maybe we're using different input files? I'm using

GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna

obtained from

https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001215.4/
...
This necessitates my recommendation 2), see below.
EASE OF USE:
Appears to be reasonable. Some amount of code to realize the filter needs to be written by hand.
In particular, if the items are long strings or some other nontrivial datatype, one needs to
write a function called "hash_value" which maps the item to a std::size_t.
Strictly speaking, this is not required by the library itself, but by 
boost::hash (the default hash function). If you use some other hash 
function (say, std::hash<my_type>), the exact requirement will be 
different, but in general you need to provide a hashing procedure for 
any user-defined class you wish to use with boost::bloom::filter (or any 
hash container, for that matter).
...
DOCUMENTATION
Clear writing, background is discussed, examples are good, and references to relevant papers are provided.
DESIGN:
Is solid to my eyes, with the exception of a hand rolled random number generator "mcg_and_fastrange" in boost/bloom/detail/core.hpp
Maybe this is good enough, but it is not very good. It attempts two things at once,
generate a position in the bit array to insert a bit as well as "expand" a hash. However, when the bucket size is a power of two, say m=2^28
the multiplier will be chosen as 2^28+3, needless to say it means the lower 28 bits will behave badly, as if hash was produced with
h' = h*3 mod 64.  The technique of producing a random integer on a range 0..m according to Lemire is solid, but it cannot be combined to produce the random numbers themselves. Hence my recommendation 3).
I will run local experiments to see if some FPR degradation occurs
due to a poor MCG being used when m is a power of two. Note that
this procedure is not required to produce high entropy, but only
to yield different {hi} values for different h0 initial values, as explained
for the double hash scheme in section 3 of

https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
...
CONCLUSION: The popularity of the Bloom filters is possibly due to actual performance (false positive rate) being better in practice than estimated apriori.
This is not for free and neither it is magic, as it comes at the cost of large memory storage, which is 1.44*n*log2(1/ε), or 5-6 bits per item.
Nevertheless, it is definitely a useful general-purpose algorithm.
I'd be happy to extend the library to support more space-efficient
filters such as Cuckoo or xor, though they seem to be much slower, and
this library focuses heavily on speed (I dare say it's the fastest around).
But as I said, if there's an interest in having additional filter structures
I can definitely tackle that.
...
RECOMMENDATIONS:
1) Accept the library, with conditions,
2) provide a second more accurate way to compute FPR, not based on apriori estimate from the number of elements already inserted
     but on the actual number of bits that are set in the bit array. This quantity could be computed at small cost during insertion by keeping track of collisions,
     or  evaluated after the fact from the full scan of the bit array (total Hamming weight).
I can certainly do that:

* Keeping track of collisions: a member function try_insert
(or some such name) can be written that returns false if the
element existed prior to insertion. This way the user themself
can keep track of the actual number of elements --I prefer this to
the alternative of tasking boost::bloom::filter with keeping track
itself, as this introduces a penalty even for users not interested
in the functionality.
* Estimating FPR based on array occupancy: I know how to
estimate the number of elements from the number of bits set to
1 with

   n* = -(m/k) ln(1-num_bits_set/m),     (1)

and from there I can use the standard FPR formula with n*, m and k.
The problem is that (1) is valid for the classical Bloom filter but
I don't think it applies to block and multiblock filters also provided
by candidate Boost.Bloom. FWIW, I used (1) in the instrumented
genome example above, which uses a non-classical filter, and it worked
like a charm (?).
...
3) for rng, use h' = h*a mod 2^64 with a=6364136223846793005, the known good multiplier from Knuth, then high(h*m) for random position
As mentioned above, let me try to measure the statistical soundness
of the current approach. Will keep you posted.

Again, thank you for your review,

Joaquin M Lopez Munoz

Re: [boost] [Bloom] review starts on 13th of May

Joaquin M López Muñoz