
Hi everybody, I was reading the documentation/code, and I have some questions/comments. 1. Why is unit of measurement bit? I understand that conceptually bloom filters(maybe others, did not check) are bit arrays, but for me this seems like very error prone. Would verbose alternative of using boost::bloom::bits(1000) or boost::bloom::bytes(125) be preferred? I did not suggest UDL, as I feel for such rarely used class it may be overkill. 2. boost::hash<short> (and other integral ones) afaik are identity, I presume this is not an issue because you do extra steps if hash_is_avalanching is false, but maybe users should prefer avalanching hash? If so maybe suggest fast avalanching hash available in boost... btw I tried to read the docs but I got(not a big issue, since I read them from original. version): 404 Not Found - Code: NoSuchKey - Message: The specified key does not exist. - Key: develop.bloom.cpp.al/unordered/doc/html/unordered/reference/hash_traits.html - RequestId: JHRNESDX8BYY6T2G - HostId: nz9kkYamNQ4ocu023JneA7PyMHTEyQK9jgMSSBuwReE0OQAslcN0xOZxgopXpvn/63AFQzreXgM= 3. libs/unordered/include/boost/unordered/detail/mulx.hpp contains mulx implementation that seems very similar to one used in bloom, and authors are same. Although this is tiny duplication, would it be best to extract it somewhere to avoid duplication? 4. Is there no benefit in having a power of 2 sized bloom filters? AFAIK it is trick used by hash maps (requires hash is good) to get faster indexing. I see you use clever 128bit multiplication that generates fast mul instruction to avoid division, but I still presume shift and masking would be faster. 5. Why is BOOST_ASSERT(fpr>=0.0&&fpr<=1.0); not BOOST_ASSERT(fpr>0.0&&fpr<=1.0); , i.e. is there benefit of allowing calls with impossible fpr argument? This may sound like nitpicking, but I think it is wrong to allow calls with arguments that give people wrong idea of what bloom filters are capable. I understand that this functionality is under capacity *esimation**,*but I still think it may be confusing. 6. Related to 6. : It is relatively easy to OOM with this if you are "reckless" with fpr value, I would add a suggestion to docs to mention this(if it is not already there, maybe I missed it). 7. It may be just me, but I believe Filter Definition chapter would benefit from visual example, in particular how is BucketSize related to subarray size. 8. Regarding choosing optimal value: In the example we use 10M elements. From what I see they are unused in our computations till the end so I would rewrite: "*Suppose we plan to insert 10M elements and want to keep the FPR at 10−4. The chart gives us five possibilities:*" to something like " Suppose we plan to insert 10M elements and want to keep the FPR at 10−4. Looking up FPR in the chart(this step is not dependent on number of elements) gives us five possibilities:" 9. My IDE complains about redundant inline on member functions. Is this intentionally done since (afaik msvc) some compilers use inline as optimization(inlining) hint or? 10. Should #if 1 in may_contain be there?

El 17/05/2025 a las 20:36, Ivan Matek via Boost escribió:
I'm using bits because that's the unit used universally on the literature on Bloom filters. But yes, this bit/byte confusion may lead to user errors (I've been bitten myself by it occasionally during development). As for libraries out there, some use bits (to name a few, Apache Common Collections, Arash Partow's, fastbloom) and some use bytes (Impala). I'm happy to do whatever the communitty decides is best here.
Correct.
but maybe users should prefer avalanching hash? If so maybe suggest fast avalanching hash available in boost...
What's that fast avalanching hash you refer to?
https://master.bloom.cpp.al/html/index.html#implementation_notes_bucket_loca... for details.
fpr==0.0 is a legitimate (if uninteresting) argument value for capacity_for: capacity_for(0, 0.0) --> 24 capacity_for(1, 0.0) --> 18446744073709549592 The formal reason why fpr==0.0 is supported is because of symmetry: some calls to fpr_for actually return 0.0 (for instance, fpr_for(0, 100)).
Yes, I can add that warning, thanks for the suggestion.
Absolutely, will do that.
Yes, the catch is that the chart's x axis is not number of elements, but rather c = capacity / number of elements. I'll come up with a better wording to avoid confusions.
Can you provide the full warning messages you're getting and for which environment this is happening? BOOST_FORCEINLINE is used quite liberally to nudge Visual Studio, which is otherwise notoriously lazy at inlining.
10. Should #if 1 in may_contain be there?
That's a leftover from the microoptimization phase during lib development. Harmless as it is, but I will remove it eventually. Joaquin M Lopez Munoz

Thank you for the answers, here are some quick clarifications, will read more/think more about "complex" topics later... On Sun, May 18, 2025 at 11:56 AM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:
What's that fast avalanching hash you refer to?
I meant hash that is fast(i.e. computation is very fast) and is avalanching. I presume for bloom filters that is what people might find most interesting. But I could be wrong, e.g. maybe best hash to use for integers is identity, and then let bloom library do the extra hashing.
Maybe unordered could expose it as public functionality? Not super clean, but avoids duplication. tbh IDK what is best practice here..Or maybe move it to ContainerHash as it seems both libs depend on it.
Warning is about literal inline keyword, not BOOST_FORCEINLINE. I presume reason is because member functions are implicitly inline(I never understood why they are treated differently from free functions, but that is different story...).

El 18/05/2025 a las 14:42, Ivan Matek escribió:
Yes, I think this is good enough, plus it's what say Boost.Unordered uses to good effect. There are much higher quality hash functions in Boost.Hash2, but I'd say that quality is worth the extra computational cost when dealing with cryptographic secutity, not just for purposes of avalanching (Peter Dimov and Chris Mazakas can correct me otherwise).
I don't know, looks to me ugly and potentially confusing to users, who have no real use for that. In fact, there's another piece of functionality that I'd rather have migrated to ContainerHash, namely the is_avalanching trait. Currently this lives in Boost.Unordered, which forces Boost.Bloom to depend on it for no other reason than to access this almost trivial utility --not ideal, obviously.
Can you please post here the entire warning message so that I can locate the offending line, plus the environment (compiler and version) you're using? Thank you! Joaquin M Lopez Munoz

On Sun, May 18, 2025 at 5:59 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:
I don't know, looks to me ugly and potentially confusing to users, who have no real use for that.
To be honest I do not think it is so bad, you could just document it is a common "primitive" used when it comes to index computation, and then users will know why it is where it is. Only thing I suggest is to not put it in core. I have seen core/util/common/whatever_name bloat in many projects and I am not a fan of that.
I would be a big fan philosophically of trait being in a small header/library. For related example I dislike that for iterator_tags you must include heavy std header, when it is trivial functionality. IIRC there are even Boost libs that are defining them just to get around the include. :)
It is IDE clang-tidy warning, not sure compiler matters. https://clang.llvm.org/extra/clang-tidy/checks/readability/redundant-inline-...

El 18/05/2025 a las 19:29, Ivan Matek escribió:
Boost.IsAvalanching? Not sure I'm buying that :-)
Could you please check if adding this comment to the offending lines /* NOLINT(readability-redundant-inline-specifier) */ makes the warnings go away? If so, I'd happily accept a PR with that change. Thank you! Joaquin M Lopez Munoz

Joaquin M López Muñoz wrote:
I already have to declare hash_is_avalanching in ContainerHash https://github.com/boostorg/container_hash/blob/b8179488b20eb1373bdbf5c7fcca... so I have no objections to moving it there. IIRC we originally kept in in Unordered because Unordered was C++11 and ContainerHash wasn't yet. But now it is.

On Mon, May 19, 2025 at 8:27 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:
I am now a bit confused :) I believe warning is correct, why not change the code instead of suppressing the warning? To be clear this is warning for literal inline keyword not boost force inline macro, few examples: inline constexpr std::size_t range()const noexcept{return (std::size_t)rng;} inline void prepare_hash(boost::uint64_t& hash)const noexcept // ignore gray for size_t cast, it does not know sometimes we compile on 32bit system. But to answer your question: suppression with NOLINT works. On Mon, May 19, 2025 at 8:44 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:
I have actually considered that also(I actually thought of std::nextafter(0.0, 1.0);) but same idea... This is getting bit philosophical, but I see those as 2 different, although similar issues. 1. impossible to compute result or result over size_t max 2. practically impossible: e.g. result is 2^47: fits inside size_t but will OOM I am not a big fan of library picking a constant for which result is unreasonable( to handle 2.), but on the other hand nonsense values should be detected asap, before program has chance to continue... long story short I am not sure what is best decision here. For 1. I am much more convinced that returning nullopt is morally ;) correct API design.

AMDG On 5/19/25 1:01 PM, Ivan Matek via Boost wrote:
What do you mean by correct? The warning does not indicate a problem in the code, and whether it is more or less readable is very subjective. It seems like a pretty pointless warning to me. Even if you care about such things, you really shouldn't be applying it to any code other than your own. In Christ, Steven Watanabe

On Tue, May 20, 2025 at 2:09 AM Steven Watanabe via Boost < boost@lists.boost.org> wrote:
Warning is not false positive.
Well as I said it is not a problem since inline is just redundant. As for if it should be removed: I believe so because people read boost sources sometimes, especially in libraries like Bloom that are readable compared to for example some older libraries that do heroics with macros. And to be clear: IIRC my CLion clang-tidy setting are default, so I did nothing to enable this lint warning, meaning that every CLion user that reads Bloom source will get same grayed out code.

El 21/05/2025 a las 13:54, Ivan Matek via Boost escribió:
It is redundant from the point of view of the language standard, but compilers take inline as a hint to favor inlining in optimization modes. Consider for instance: https://godbolt.org/z/dnfhYvnMK You can see that X::foo is inlined while X::bar is not. This is the original reason why those member functions in candidate Boost.Bloom are added the "inline" bit, namely, to invite compilers to inline --though, admittedly, for very short functions this is hardly going to make a difference. Joaquin M Lopez Munoz

Am 21.05.25 um 14:28 schrieb Joaquin M López Muñoz via Boost:
I think the warning is mostly correct. From the Docs:
In your godbolt example you have *static* member functions. This seems to be a difference. See https://godbolt.org/z/h76z4GfzP for your example with static removed I'm wondering if this is a compiler/optimizer issue that should be fixed. So I'd say at least "inline constexpr" can be simplified, which is used in the source. However you are right in that clang-tidy also warns for "static inline" where it does matter: https://godbolt.org/z/WqK3oK93Y It seems to indeed take it as a hint as the function will be inlined when you make it smaller. Just my 2c, Alex

El 21/05/2025 a las 16:42, Alexander Grund via Boost escribió:
From the point of view of the standard, there's no difference between static and non-static member functions when it comes to how inline applies to them --that is, it is redundant if the function definition is in-class. From a optimization hint point ot view, static/non-static it may or may not make a difference, but at any rate compilers seem to be still paying attention to the inline keyword as of this writing.
It's used once, if I'm not wrong, here: https://github.com/joaquintides/bloom/blob/develop/include/boost/bloom/detai... Yes, removing inline here should be ok as the function is always used at compile time, so inlining is immaterial.
Joaquin M Lopez Munoz

Had a bit more time to think :) so here are my replies and few more questions. On Sun, May 18, 2025 at 11:56 AM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:
I am a strong fan of strong types(e.g. std::chrono) so bits/bytes strong type has my vote, not sure what community in general prefers.
This is a bit philosophical, but I actually do not feel this is correct. First of all is (0, 0.0) only usecase where fpr of 0.0 makes sense? i.e any time when n>0 fpr 0.0 is impossible(or I misunderstood something). So assert could be implies(it is funny because we had discussion about implies on ML few months ago), so something like: BOOST_IMPLICATION(fpr == 0.0, n == 0); Similarly for (1, 0.0) I do not believe result should be size_t max value, as this is not correct value. Now we both know you will OOM before noticing this in reality, but even if we imagine magical computer that can allocate that much memory fpr is not 0.0. If this library was not C++11 I would suggest std::optional as return type, but as it is boost::optional seems like best option. Now I know people might think I am making API ugly for users, but I really do not think a bit more typing is such a big problem when the alternative is people messing up(probably not when using constants in code, more likely when they dynamically compute something). Bloom filters are important, but they are not like json or protobuf where they are everywhere in codebase, users needing to use a bit uglier API in 10LOC in 1M LOC codebase does not seem like a big deal to me. So in examples above: capacity_for(0, 0.0) - > min_possible_capacity capacity_for(1, 0.0) - > nullopt Few more questions I thought of recently, apologies in advance if I misunderstood something. 1. There is some SIMD code that seems to have cost in K that is proportional to "greater or equal multiple of 8". I wonder does that affect recommendation for optimal K or not? e.g if table we disused before tells me to use K 14 should I bump it up to 16 if using fast_multiblock64? static BOOST_FORCEINLINE bool check(const value_type& x,boost::uint64_t hash) { for(int i=0;i<k/8;++i){ if(!check_m256ix2(x[i],hash,8))return false; hash=detail::mulx64(hash); } if(k%8){ if(!check_m256ix2(x[k/8],hash,k%8))return false; } return true; } 2. Fact that bloom filter has a break in lookup(you can break as soon as one bit is not set(let's ignore simd now) makes me wonder if performance depends on hit rate? E.g. If I know my hit rate will be ~10% or ~90% should that affect my picking of K? 3. I have just noticed I wrongly remembered that g++ defaults to march=native, it does not. Why is this related to this library? Well I am not sure fallback in fast_ headers should exist. In other words when I want to use fast_ and AVX2 is not activated I would prefer code to fail to compile. So I would remove this fallback: template<std::size_t K> using fast_multiblock64=multiblock<boost::uint64_t,K>; Now there is reasonable argument that people do not want their code to break when they change compiler flags, or target new architecture(e.g. one where you did not add SIMD for), but I believe SIMD speedup is so huge that this should not be a silent change. If users really want fallback behavior they can do their own preprocesor + using logic. 4. Another thing related to SIMD is question of ones arrays used in make_m256ix2/make_m256i. Initially I have assumed they are not static const(guaranteed construction just once, but thread safe init overhead) because compiler is always smart enough to optimize this to some readonly memory that will be initialized at compile time and everything is great. I looked into asm of my simple program test I wrote on my machine, initialization is done at compile time, great. But :) then I went to godbolt, and made small reproduction of code, and there I can not <https://godbolt.org/z/sYfc7rffa> get the compilers to avoid populating that array at runtime. Now I do not expect for you to go over godbolt link and try to nudge compilers to optimize this by changing flags or example code, my question is more about if this is fragile is that maybe a reason to use some ugly trick to initialize those values at compile time(e.g. as constexpr array of properly aligned uint32 that when interpreted as ones has correct values). If anybody reading this has any idea why this happens please do tell. My first guess is that I messed up when porting code to small godbolt example. :) My second guess would be that in bloom code I extracted for godbolt is used in a way that hints to compiler to do the initialization at compile time(e.g. kp values are known at compile time), but this is just a guess. So to be clear there is no problem I can detect when using bloom library, I am just worried that with a huge combinations of compilers, standards, optimization levels(e.g. O2 vs O3) and architectures there may be a subset of users that will hit performance issue if bloom does not force compile time init of those values. To be more specific, for this source: int main() { using simd_filter = boost::bloom::filter<std::string_view, 1, boost::bloom::fast_multiblock64<2>, 1>; simd_filter sf; char str[2] = {char('a'+(rand()%26)), '\0'}; sf.insert(str); str[0] = char('a'+(rand()%26)); return sf.may_contain(str); } on my machine clang generates beautiful asm in a sense there is no initialization of ones at runtime. When I copy godbolt example to my machine initialization of ones at runtime exists, so it is probably not just godbolt/my machine difference.

El 18/05/2025 a las 23:38, Ivan Matek escribió:
Yes, it is impossible: the capacity would have to be infinite --the maximum attainable value is returned instead, though this is of little value as OOM would ensue (as you point out below).
I understand your point and can relate to it, but consider this: capacity_for(1, 1.E-200) Is this legit? OOM will happen here, too. Where do we put the limit?
I will address these comments tomorrow (out of time today), but I felt like answering to the first prt of your post now. Joaquin M Lopez Munoz

El 18/05/2025 a las 23:38, Ivan Matek escribió:
This k%8 check doesn't really happen at run time, as k is a compile-time value, so k=14 should be (slightly) faster than k=16. So, I don't particulary recommend that a multiple of 8 be used if that's not optimal in terms of achieved FPR. That said, you can see in https://github.com/joaquintides/boost_bloom_benchmarks that execution times for fast_multiblockXX<K>, although roughly proportional to K, seem to occasionally favor k%8==0. So, in the end I'm afraid you'll have to measure to come up with the best configuration for your environment.
block and multiblock do not do early termination on lookup, while fast_multiblockXX do. The reason for that is that this is what was fastest overall in each case. As for the hit rate, well, I guess if you know most lookups are going to be negative, then you can maybe use a higher k. But then, for a given capacity and number of inserted elements, there's no point in increasing k beyond its optimal value (that yielding the smallest possible FPR).
That's a matter of opinion, I guess, but I'd rather have people not wanting the fallback write the compile-time check instead of the other way around. Sometimes you're not writing a final application but a library (say, on top of candidate Boost.Bloom), and you don't control compilation flags or target architecture.
Your godbolt snippet behaves differently to candidate Boost.Bloom because the values of kp are not known at compile time (which is the case with candidate Boost.Bloom). If you instead do std::array vals = {5,6,7}; https://godbolt.org/z/M7W5Tnbsq then the code becomes noticeably simpler and in one compiler (Clang) the ones table goes away completely. In general, having the table marked as static or not makes no difference, compilers are smart enough to handle that regardless. I fail to see any run-time table initialization in your original snippet at https://godbolt.org/z/sYfc7rffa . Joaquin M Lopez Munoz

On Tue, May 20, 2025 at 5:10 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:
I guess my concern is that people will assume reading documentation that if fast_ compiles it uses SIMD. But I see your point. To be clear what I mean here: *"but uses faster SIMD-based algorithms when SSE2, AVX2 or Neon are available". * User might think: my CPU supports AVX2, so surely it will use SIMD algorithms. But available here refers to compiler options(and obviously CPU support when binary is started), not just on CPU support. I know I am not telling you anything you do not know, I just think large percentage of users might misunderstand what available means.
I am not a SIMD expert, but is this not creating those variables on stack? gcc asm vbroadcastsd ymm1, qword ptr [rip + .LCPI0_1] vmovaps xmm3, xmm1 vmovaps ymmword ptr [rsp + 64], ymm3 vpmovsxbq ymm4, dword ptr [rip + .LCPI0_4] vmovaps ymmword ptr [rsp + 128], ymm4 vmovaps ymmword ptr [rsp + 192], ymm1 vmovaps ymmword ptr [rsp + 256], ymm1 vmovaps ymmword ptr [rsp + 320], ymm1 vmovaps ymmword ptr [rsp + 384], ymm1 vmovaps ymmword ptr [rsp + 448], ymm1 But again my question was mostly about how certain those optimizations are for Bloom considering huge variety of compilers and compiler options, not to mention some future refactoring that might trip up the compiler optimizations. Now I may be just too paranoid, but those variables are not simple ints so I suspect that is why compilers have a problem computing them at compile time in my godbolt example, although as you said they do it successfully for Bloom, and I have verified that in my example code on my machine compiler optimizes it.

El 20/05/2025 a las 18:00, Ivan Matek escribió:
Yes, you're right, I can rewite "are available" as "are enabled at compile time".
Umm, yes, maybe. Anyway, scratch what I said about compilers not really caring about const vs. static const: adding static to your snippet severely pessimizes the codegen, with static initialization guards and all. So there goes your explanation to why static was not used :-) For the record, during develpment I examined the gencode for all fast_multiblockXX classes with the three major compilers, Intel and ARM to check that nothing looked bad. Joaquin M Lopez Munoz

On Tue, May 20, 2025 at 6:31 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:
Thank you, I believe that is big improvement.
Yes, I have noticed static messes it up, although for ints <https://godbolt.org/z/oYM15zYoW> compiler is smart enough to not emit that guard. That is one of reasons why I am so paranoid this optimization might stop working with some future compiler. simd intrisics may be harder for compiler to reason about that "just" ints.
I agree that 99% it will never break, since I presume compilers will rarely regress in this manner... but I still think there is tiny chance they might. :) One more question: I have some handcrafted tests (where bloom filter is so small it fits in L1/L2 cache, and hit rate of lookups is 0%(beside false positives) ) and simd one is a bit slower than no simd for certain values of K. constexpr size_t num_inserted = 10'000; constexpr double fpr = 1e-5; constexpr size_t K = 5; using vanilla_filter = boost::bloom::filter<uint64_t, 1, boost::bloom::multiblock<uint64_t, K>, 1>; using simd_filter = boost::bloom::filter<uint64_t, 1, boost::bloom::fast_multiblock64<K>, 1>; I presume that is expected since it is hard to make sure SIMD is always faster, but just wanted to double check with you that this is not a unexpected result. So to recap my question: If bloom filter fits in L1 or L2 cache is it best practice to check if SIMD or normal version is faster instead of assuming SIMD always wins?

El 20/05/2025 a las 21:07, Ivan Matek escribió:
Benchmarks at https://github.com/joaquintides/boost_bloom_benchmarks show that the advantage of fast_multiblock64<K> with respect to multiblock<uint64_t, K> is small for some compilers (Clang, VS) and low values of K, and occasionally multiblock wins (though these measurements come with a fair degree of noise). So, yes, I'd profile to make sure. In the case of fast_multiblock32<K> vs. multiblock<uint64_t, K>, the advantage of the former is much more clear (note that multiblock<uint32_t, K> is not included in the benchmarks because it does not get us anything with respect to multiblock<uint64_t, K>). Joaquin M Lopez Munoz
participants (5)
-
Alexander Grund
-
Ivan Matek
-
Joaquin M López Muñoz
-
Peter Dimov
-
Steven Watanabe