
El 18/05/2025 a las 23:38, Ivan Matek escribió:
[...] Few more questions I thought of recently, apologies in advance if I misunderstood something.
1. There is some SIMD code that seems to have cost in K that is proportional to "greater or equal multiple of 8". I wonder does that affect recommendation for optimal K or not? e.g if table we disused before tells me to use K 14 should I bump it up to 16 if using fast_multiblock64? static BOOST_FORCEINLINE bool check(const value_type& x,boost::uint64_t hash) { for(int i=0;i<k/8;++i){ if(!check_m256ix2(x[i],hash,8))return false; hash=detail::mulx64(hash); } if(k%8){ if(!check_m256ix2(x[k/8],hash,k%8))return false; } return true; }
This k%8 check doesn't really happen at run time, as k is a compile-time value, so k=14 should be (slightly) faster than k=16. So, I don't particulary recommend that a multiple of 8 be used if that's not optimal in terms of achieved FPR. That said, you can see in https://github.com/joaquintides/boost_bloom_benchmarks that execution times for fast_multiblockXX<K>, although roughly proportional to K, seem to occasionally favor k%8==0. So, in the end I'm afraid you'll have to measure to come up with the best configuration for your environment.
1. Fact that bloom filter has a break in lookup(you can break as soon as one bit is not set(let's ignore simd now) makes me wonder if performance depends on hit rate? E.g. If I know my hit rate will be ~10% or ~90% should that affect my picking of K?
block and multiblock do not do early termination on lookup, while fast_multiblockXX do. The reason for that is that this is what was fastest overall in each case. As for the hit rate, well, I guess if you know most lookups are going to be negative, then you can maybe use a higher k. But then, for a given capacity and number of inserted elements, there's no point in increasing k beyond its optimal value (that yielding the smallest possible FPR).
1. I have just noticed I wrongly remembered that g++ defaults to march=native, it does not. Why is this related to this library? Well I am not sure fallback in fast_ headers should exist. In other words when I want to use fast_ and AVX2 is not activated I would prefer code to fail to compile. So I would remove this fallback: template<std::size_t K> using fast_multiblock64=multiblock<boost::uint64_t,K>; Now there is reasonable argument that people do not want their code to break when they change compiler flags, or target new architecture(e.g. one where you did not add SIMD for), but I believe SIMD speedup is so huge that this should not be a silent change. If users really want fallback behavior they can do their own preprocesor + using logic.
That's a matter of opinion, I guess, but I'd rather have people not wanting the fallback write the compile-time check instead of the other way around. Sometimes you're not writing a final application but a library (say, on top of candidate Boost.Bloom), and you don't control compilation flags or target architecture.
1. Another thing related to SIMD is question of onesarrays used in make_m256ix2/make_m256i. Initially I have assumed they are not static const(guaranteed construction just once, but thread safe init overhead) because compiler is always smart enough to optimize this to some readonly memory that will be initialized at compile time and everything is great. I looked into asm of my simple program test I wrote on my machine, initialization is done at compile time, great. But :) then I went to godbolt, and made small reproduction of code, and there I can not <https://godbolt.org/z/sYfc7rffa> get the compilers to avoid populating that array at runtime. Now I do not expect for you to go over godbolt link and try to nudge compilers to optimize this by changing flags or example code, my question is more about if this is fragile is that maybe a reason to use some ugly trick to initialize those values at compile time(e.g. as constexpr array of properly aligned uint32 that when interpreted as ones has correct values). If anybody reading this has any idea why this happens please do tell. My first guess is that I messed up when porting code to small godbolt example. :) My second guess would be that in bloom code I extracted for godbolt is used in a way that hints to compiler to do the initialization at compile time(e.g. kp values are known at compile time), but this is just a guess. So to be clear there is no problem I can detect when using bloom library, I am just worried that with a huge combinations of compilers, standards, optimization levels(e.g. O2 vs O3) and architectures there may be a subset of users that will hit performance issue if bloom does not force compile time init of those values.
To be more specific, for this source:
int main() { using simd_filter = boost::bloom::filter<std::string_view,1, boost::bloom::fast_multiblock64<2>, 1>; simd_filter sf; char str[2] = {char('a'+(rand()%26)),'\0'}; sf.insert(str); str[0] =char('a'+(rand()%26)); return sf.may_contain(str); } on my machine clang generates beautiful asm in a sense there is no initialization of ones at runtime. When I copy godbolt example to my machine initialization of ones at runtime exists, so it is probably not just godbolt/my machine difference.
Your godbolt snippet behaves differently to candidate Boost.Bloom because the values of kp are not known at compile time (which is the case with candidate Boost.Bloom). If you instead do std::array vals = {5,6,7}; https://godbolt.org/z/M7W5Tnbsq then the code becomes noticeably simpler and in one compiler (Clang) the ones table goes away completely. In general, having the table marked as static or not makes no difference, compilers are smart enough to handle that regardless. I fail to see any run-time table initialization in your original snippet at https://godbolt.org/z/sYfc7rffa . Joaquin M Lopez Munoz