
El 02/05/2025 a las 3:05, Ivan Matek via Boost escribió:
On Fri, May 2, 2025 at 2:10 AM Jean-Louis Leroy via Boost < boost@lists.boost.org> wrote:
Maybe Joaquín wants to jump in, I don't want to put words in his mouth ;-)
Would be nice, from what I got from his talk some performance I have seen in my experiments, e.g. finding perfect hash for 368 items in 512 buckets seems almost impossible when we apply formula from the talk.
Ok, boost::openmethod::policies::fast_perfect_hash::hash_initialize(first, last) takes a range of N class info objects, calculates M as the lowest shift so that 1<<M >= N and then randomly tries multipliers C until one of them is such that (type_id*C)>>(64-M) produces no collisions, that is, it acts as perfect hash for the type_ids in [first, last) into 2^M buckets. 100,000 random multipliers are tried, and if all fail then M is bumped by 1, 2, 3, and if nothing works the function bails out. For N = 368, the values of M tried are 9, 10, 11 and 12, which correspond to 512, 1024, 2048 and 4096 buckets, respectively. Now, according to the formulae from the talk you mention, the average number of random tries necessary to find a perfect multiplier (assuming the input is random) is, for each number of buckets: 512 buckets: 1.63840297782e+80 tries 1024 buckets: 7.17526892837e+32 tries 2048 buckets: 1.83078061406e+15 tries 4096 buckets: 24222060 tries all of them much greater than the 100,000 limit for each value of M imposed by hash_initialize. So, as you point out, registration of 368 classes should fail with extremely high probability, right? I wrote this small program that mocks up the registration of 368 classes and calls fast_perfect_hash::hash_initialize: #include <array> #include <boost/openmethod.hpp> #include <boost/unordered/unordered_flat_set.hpp> #include <iostream> using namespace boost::openmethod; struct fake_class { fake_class(type_id id) { type_ids[0] = id; } auto type_id_begin() { return &type_ids[0]; } auto type_id_end() { return &type_ids[1]; } type_id type_ids[1]; }; static constexpr std::size_t num_classes = 368; int main() { std::vector<fake_class> data; boost::unordered_flat_set<type_id> unique; std::default_random_engine rnd(13081963); std::uniform_int_distribution<type_id> d; for(int i=0;i<num_classes;++i) { for(;;) { auto id = d(rnd); if(unique.insert(id).second) { data.push_back(id); break; } } } using fast_perfect_hash= policies::fast_perfect_hash<BOOST_OPENMETHOD_DEFAULT_POLICY>; fast_perfect_hash::hash_initialize(data.begin(), data.end()); // locally tweaked fast_perfect_hash so that these members are public std::cout<<"hash_mult: "<<fast_perfect_hash::hash_mult<<"\n"; std::cout<<"M: "<<(sizeof(std::size_t) * CHAR_BIT - fast_perfect_hash::hash_shift)<<"\n"; } As type_ids we simply use random numbers, and, sure enough, the program fails: could not find hash factors after 400000s using 8192 buckets D:\openmethods\x64\Debug\openmethods.exe (process 24820) exited with code 3. Now, let's change the uniform distribution (d) as follows: std::uniform_int_distribution<type_id> d(1, 1000); and, with this change, registration succeeds for 1024 buckets! hash_mult: 16501117525188094459 M: 10 D:\openmethods\x64\Debug\openmethods.exe (process 24332) exited with code 0. What's hapenning? In the second run, all the type_ids are contained in a range that is much smaller than the entire space of 64-bit numbers, so it is *much much* more likely that a random multiplier will be perfect. In real life, type_ids are just the addresses of typeid(X) for each registered class X, and as it happens these usually occupy a contiguous region of memory instead of being scattered across the 2^64 bytes of the memory space. To sum up, fast_perfect_hash::hash_initialize works against theoretical expectation because type_info objects in a program occupy a small, contiguous range of addresses within the memory space. Joaquin M Lopez Munoz