
El 18/05/2025 a las 2:18, Kostas Savvidis via Boost escribió:
Hi All,
This is a mini review of the proposed Bloom library.
Hi Kostas, thanks for your review!
[...] ==================================================================== CONSIDERATIONS: This library implements the classic Bloom filter which can be used to implement an O(1)-time computation of whether a given object belongs in a previously preprocessed set. The preprocessing takes O(n) time and importantly O(n) in storage for a bit array, but the original data can be optionally discarded. I was reminded of the storage requirement while experimenting with the provided example for genomic data: the example datafile is 140MB, while the resulting filter contains a lookup table taking up 180MB (!!!), with the 0.01 setting for target false positive rate (FPR). Further, I have compared the "predicted" false positive error rate obtained from the library function genome_filter::fpr_for(n, f.capacity()) with the actual FPR.
By actual FPR you mean the FPR for individual 20-mers, not for whole DNA sequences, right? Sequence FPR is expected to be much lower than 20-mer FPR.
This FPR appears to be overpessimistic, as experiments with the real data showed. Setting the filter FPR to 0.5 or even 0.9 (which results in a much smaller lookup table) there were almost no false positives in practice. I do not fully understand the full reasons for this, but it appears one of the reasons is: THE "SET" WE INSERTED HAD MANY IDENTICAL ITEMS, In this case, the stream had many identical k-mers, to be precise, out of 142 million base pairs, there is actually ~22 million identical 20-mers, but even when FPR is set to 0.9 only 40 million extra false collisions are generated which is way less than 90%. Further investigation suggests that fpr_for is far from reality: the bit array for FPR=0.9 is 2^28=268m bits wide, of which only 142m - 22m = 120m are occupied, thus the correct FPR rate is 120/268 = 44%, in rough agreement with the observed 40m/120m. Similarly, when I request fpr=0.01 or 1%, the number of false collisions is only 86k, so actual FPR is probably closer to 2*86k/120m = 0.14%.
We must be doing radically different things, because my local experiments don't look at all like what you see. I've written an instrumented version of the genomics example here: https://gist.github.com/joaquintides/e8fbf87fb0dd0b6fda87969c676c1e34 and I get these results for the genome of Drosophila melanogaster: actual unique kmers: 121430875 bits set: 719647041 capacity: 1533664512 estimated unique kmers based on array occupancy: 121434296 actual fpr: 0.004248 If I use the exact number of k-mers as an input on filter construction instead of the original approach where the number of k-mers is taken to be approximately equal to the number of bytes of the file (definining the macro USE_ACTUAL_NUMBER_K_MERS), I get: actual unique kmers: 121430875 bits set: 680516771 capacity: 1278575360 estimated unique kmers based on array occupancy: 121434774 actual fpr: 0.01011 So, the actual FPR is very much in line with the value specified on construction, and the number of different 20-mers is 121M rather than 22M. Maybe we're using different input files? I'm using GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna obtained from https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001215.4/
This necessitates my recommendation 2), see below.
EASE OF USE: Appears to be reasonable. Some amount of code to realize the filter needs to be written by hand. In particular, if the items are long strings or some other nontrivial datatype, one needs to write a function called "hash_value" which maps the item to a std::size_t.
Strictly speaking, this is not required by the library itself, but by boost::hash (the default hash function). If you use some other hash function (say, std::hash<my_type>), the exact requirement will be different, but in general you need to provide a hashing procedure for any user-defined class you wish to use with boost::bloom::filter (or any hash container, for that matter).
DOCUMENTATION Clear writing, background is discussed, examples are good, and references to relevant papers are provided.
DESIGN: Is solid to my eyes, with the exception of a hand rolled random number generator "mcg_and_fastrange" in boost/bloom/detail/core.hpp Maybe this is good enough, but it is not very good. It attempts two things at once, generate a position in the bit array to insert a bit as well as "expand" a hash. However, when the bucket size is a power of two, say m=2^28 the multiplier will be chosen as 2^28+3, needless to say it means the lower 28 bits will behave badly, as if hash was produced with h' = h*3 mod 64. The technique of producing a random integer on a range 0..m according to Lemire is solid, but it cannot be combined to produce the random numbers themselves. Hence my recommendation 3).
I will run local experiments to see if some FPR degradation occurs due to a poor MCG being used when m is a power of two. Note that this procedure is not required to produce high entropy, but only to yield different {hi} values for different h0 initial values, as explained for the double hash scheme in section 3 of https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf
CONCLUSION: The popularity of the Bloom filters is possibly due to actual performance (false positive rate) being better in practice than estimated apriori. This is not for free and neither it is magic, as it comes at the cost of large memory storage, which is 1.44*n*log2(1/ε), or 5-6 bits per item. Nevertheless, it is definitely a useful general-purpose algorithm.
I'd be happy to extend the library to support more space-efficient filters such as Cuckoo or xor, though they seem to be much slower, and this library focuses heavily on speed (I dare say it's the fastest around). But as I said, if there's an interest in having additional filter structures I can definitely tackle that.
RECOMMENDATIONS: 1) Accept the library, with conditions, 2) provide a second more accurate way to compute FPR, not based on apriori estimate from the number of elements already inserted but on the actual number of bits that are set in the bit array. This quantity could be computed at small cost during insertion by keeping track of collisions, or evaluated after the fact from the full scan of the bit array (total Hamming weight).
I can certainly do that: * Keeping track of collisions: a member function try_insert (or some such name) can be written that returns false if the element existed prior to insertion. This way the user themself can keep track of the actual number of elements --I prefer this to the alternative of tasking boost::bloom::filter with keeping track itself, as this introduces a penalty even for users not interested in the functionality. * Estimating FPR based on array occupancy: I know how to estimate the number of elements from the number of bits set to 1 with n* = -(m/k) ln(1-num_bits_set/m), (1) and from there I can use the standard FPR formula with n*, m and k. The problem is that (1) is valid for the classical Bloom filter but I don't think it applies to block and multiblock filters also provided by candidate Boost.Bloom. FWIW, I used (1) in the instrumented genome example above, which uses a non-classical filter, and it worked like a charm (?).
3) for rng, use h' = h*a mod 2^64 with a=6364136223846793005, the known good multiplier from Knuth, then high(h*m) for random position
As mentioned above, let me try to measure the statistical soundness of the current approach. Will keep you posted. Again, thank you for your review, Joaquin M Lopez Munoz