
El 12/05/2025 a las 12:06, Ivan Matek escribió:
On Sun, May 11, 2025 at 9:01 PM Joaquin M López Muñoz via Boost <boost@lists.boost.org> wrote:
https://github.com/joaquintides/perfect_range_hash
Thank you for this. Unfortunately even this basic math is a bit too complex for me so few small comments:
1. stylistically I would link to particular hash of source code, although documentation link is nicer. Reason is that documentation link can die
I can relink to a more stable doc if the lib is accepted.
2. afaik C is not randomly chosen between 0<= C <= 2^64, i.e. it is always an odd number.
You're right. We'd have to ask the author, but I guess the "| 1" in "hash_mult = uniform_dist(rnd) | 1" is there to avoid the (very unlikely) case hash_mult = 0. In any case, this small adjustment doesn't change the big picture statistically --the output distribution is largely dominated by C/B, where C is the multiplier and B is 2^(64-m), so the least significant bit of C is mostly swallowed along the way.
3. in my brief experiments I tried to add large gap, and randomly pick subset of types, and performance was still amazing.
Example for 3.: [...]
This case of 231 type_ids scattered in a range of 3000 positions is lower-bounded by (has better chances than) the baseline case with 3000 adjacent positions, which has a probability 0.14 of finding a perfect hash multiplier for 4096 buckets. Note that a "hard" case would involve having your type_ids distributed across the entire 64-bit space of possible numerical values, and in the example you wrote the values are confined to a range that is around 10^14 times smaller than 2^64 --this is the fact that the algorithm takes an implicit advantage of. Joaquin M Lopez Munoz