
On Tue, May 20, 2025 at 6:31 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:
User might think: my CPU supports AVX2, so surely it will use SIMD algorithms. But available here refers to compiler options(and obviously CPU support when binary is started), not just on CPU support. I know I am not telling you anything you do not know, I just think large percentage of users might misunderstand what available means.
Yes, you're right, I can rewite "are available" as "are enabled at compile time".
Thank you, I believe that is big improvement.
Umm, yes, maybe. Anyway, scratch what I said about compilers not really caring about const vs. static const: adding static to your snippet severely pessimizes the codegen, with static initialization guards and all. So there goes your explanation to why static was not used :-)
Yes, I have noticed static messes it up, although for ints <https://godbolt.org/z/oYM15zYoW> compiler is smart enough to not emit that guard. That is one of reasons why I am so paranoid this optimization might stop working with some future compiler. simd intrisics may be harder for compiler to reason about that "just" ints.
For the record, during develpment I examined the gencode for all fast_multiblockXX classes with the three major compilers, Intel and ARM to check that nothing looked bad.
I agree that 99% it will never break, since I presume compilers will rarely regress in this manner... but I still think there is tiny chance they might. :) One more question: I have some handcrafted tests (where bloom filter is so small it fits in L1/L2 cache, and hit rate of lookups is 0%(beside false positives) ) and simd one is a bit slower than no simd for certain values of K. constexpr size_t num_inserted = 10'000; constexpr double fpr = 1e-5; constexpr size_t K = 5; using vanilla_filter = boost::bloom::filter<uint64_t, 1, boost::bloom::multiblock<uint64_t, K>, 1>; using simd_filter = boost::bloom::filter<uint64_t, 1, boost::bloom::fast_multiblock64<K>, 1>; I presume that is expected since it is hard to make sure SIMD is always faster, but just wanted to double check with you that this is not a unexpected result. So to recap my question: If bloom filter fits in L1 or L2 cache is it best practice to check if SIMD or normal version is faster instead of assuming SIMD always wins?