
El 22/05/2025 a las 22:15, Arnaud Becheler via Boost escribió:
Dear Boost community,
Tomer encountered some issues with the ML and asked me to forward their review. Since the deadline is approaching, if anyone else is interested in reviewing, please let me know—we may be able to arrange an extension.
Best regards, Arnaud Becheler Review Manager, Boost.Bloom
--- Hi all,
Bottom line first: In my opinion, the proposed Bloom library should be ACCEPTED to Boost. It implements a useful data structure, and does it with high quality. Joaquín, thank you for submitting this library to Boost!
Hi Tomer, thanks for your review!
Now for the details.
Background:
This is my first time reviewing a proposed Boost library. I recently finished implementing a Blocked Bloom filter, so this library is very interesting to me and I have some relevant knowledge. Sadly, my version of the BF requires some unique customizations which would likely make it difficult to use the proposed library in my code (I won’t go into details, as they are not relevant, and I wouldn't want the library to become encumbered by multiple customization options). I will only mention that in my implementation, reducing memory access to a single cache line was found to be critical for performance.
Design & documentation:
The design of the library looks solid, with good design choices. However, I’m missing the rationale behind those choices.
One example is that k is a compile-time constant. As a user this is far from optimal. I do not care about the number of bits used for each key, only that it is optimal. In practice, the optimal value for k is a function of the target FPR. In other cases the target is the memory overhead (m/n) and k is derived from that. I can choose m, n, and the FPR at runtime, but I have to decide on k in compile-time. I understand that there are implementation considerations in favor of making k a compile-time constant, I just wish to have that explained in the docs.
The rationale is that having k a compile-time value makes the filter faster. I've been playing with variations where k can be specified at run time and they're 10-20% slower. Of course I can provide those variations as part of the library, and in fact I'm inclined to include them in the roadmap.
Another design limitation is that to my understanding the library doesn’t actually offer an implementation of the classical Bloom filter (contradicting the claim in the first line of the docs). The default behavior is a variant that sets/gets bits in k different 1-byte blocks, and there is no option for a block-less BF. This is different from the classical Bloom filter in which 2 or more hashes can target the same byte, and even the same bit. The FPR will be very slightly higher than the FPR for a classical BF with the same parameters, which makes the current calculation very slightly off. I think this should at least be addressed in the documentation.
The default configuration for boost::bloom::filter: filter<T, K, block<unsigned char, 1>> is _exactly_ a classical Bloom filter: randomly selecting an unsigned char and then selecting a random bit in it is the same as randomly selecting a bit in the whole array (repeat K times). This is also mentioned in https://master.bloom.cpp.al/html/index.html#fpr_estimation "We have the particular case FPRblock(n,m,b,k,1)=FPRmultiblock(n,m,b,k,1)=FPR(n,m,k), which follows simply from the observation that using {block|multiblock}<Block, 1> behaves exactly as a classical Bloom filter." (It is not hard to prove the equation expanding the formulas for k' = 1).
I wanted to use a Blocked BF with blocks of 1 cache line. It was not obvious from the docs that this is possible, because the docs say that the underlying type for the Block must be an unsigned integral type. I tried the following which seemed to work:
using CacheLine = std::bitset<512>; using filter = boost::bloom::filter<T, 1, boost::bloom::block<CacheLine, K>>;
I would like to see this officially supported and documented as a main use-case.
Yes, this sort of works currently and you've got 64-byte aligment. Internally, though, it's hitting UB as the assumption that Block is an unsigned integral type translates to treating it as a trivial type (which std::bitset is not). Triviality is important because the internal bit array is exposed as precisely that, for instance for serialization purposes. If we relax the "unsigned integral type" requirement to "trivial type supporting integral operations Block{0}, Block{1}, <<, >>, &, |, &= and|=", a custom uint512_t class could be used to fit the bill.
The purpose of multiblock is not clear. Seems to me that only the _fast variants are useful, for their speed. Otherwise, a block with sufficient size (e.g. using std::bitset as above, but might need something with better implementation) would be better due to slightly lower FPR.
block<uint512_t, 8> (assuming such uint512_t exists) would be much slower than multiblock<uint64_t, 8>. The latter would have slightly worse FPR, as you point out. Also, multiblock<uint64_t, 5> (for instance) operates over blocks of size 8*5 = 40 bytes, and you won't have a Block type of size 40 to match that.
The documentation mentions many examples like this: “filter<1,block<uint32_t,K>>” However, the first template parameter is missing, and this is confusing. The examples should all provide valid types and working code. This appears in several places.
Yes, this appears in tables to save space wrt filter<T, 1, ...> (tables are already very wide). Will try to accommodate.
“For given values of n and m, the optimum k is the integer closest to [formula]” Mathematically speaking, this is not strictly correct. For example, with m/n=12.2553042 you will get k_opt=8.4947, but choosing k=9 will give you (very slightly) lower FPR. On the other hand, in this case I would still choose k=8 because 8 iterations is less than 9. So it's mostly a matter of wording.
You're absolutely right, I went for that sloppy wording to avoid making it more complicated, but well, reviewers are unforgiving :) Will fix.
“Without formal justification, we have relaxed this even further to just one initial hash value [...]” You might be interested in SingleHash [ https://yangzhou1997.github.io/paper/singlehash-bigcomp18.pdf] which provides formal justification.
I had that paper on my radar, but the mixing procedure they use is not the same as ours. Sometime I'll try to crack the maths for this.
Implementation:
I only reviewed the code very lightly, and only tried to use the filter in some basic code, so I don’t have much to contribute in this area.
To use multiblock I needed to include a dedicated header, which is not needed for block. This is a bit confusing.
<boost/bloom/filter.hpp> includes <boost/bloom/block> as block is the default type for boost::bloom::filter, hence the asymmetry. I understand this will be mitigated by an omnibus header <boost/bloom.hpp>, which was requested by Ruben Perez.
Estimating FPR based on array occupancy: You wrote in another thread that you “know how to estimate the number of elements from the number of bits set [… but the formula] is valid for the classical Bloom filter”. Since each block is itself a classical filter, and expected value is linear, you can apply the formula per block (using K') and then sum the total for the whole filter, and finally divide by K.
Wouldn't work for filter<T, K, block<..., K'>>, as multiple blocks are accessed per element.
I think this is a nice feature to have.
I concur. Again, I'll try to crack the math. If there's some mathematically inclined person reading this, any help would be appreciated.
Summary:
As I wrote above, this is a useful high-quality library. I don't think a single Bloom library can support all possible variations, so it's important to document the design choices that were made.
Thank you for contributing this code to Boost.
Thank _you_ for your interesting remarks. Joaquin M Lopez Munoz