Re: [boost] [openmethod] On the unreasonable efficiency of fast_perfect_hash

El 12/05/2025 a las 12:06, Ivan Matek escribió:
On Sun, May 11, 2025 at 9:01 PM Joaquin M López Muñoz via Boost <boost@lists.boost.org> wrote:
https://github.com/joaquintides/perfect_range_hash
Thank you for this. Unfortunately even this basic math is a bit too complex for me so few small comments:
1. stylistically I would link to particular hash of source code, although documentation link is nicer. Reason is that documentation link can die
I can relink to a more stable doc if the lib is accepted.
2. afaik C is not randomly chosen between 0<= C <= 2^64, i.e. it is always an odd number.
You're right. We'd have to ask the author, but I guess the "| 1" in "hash_mult = uniform_dist(rnd) | 1" is there to avoid the (very unlikely) case hash_mult = 0. In any case, this small adjustment doesn't change the big picture statistically --the output distribution is largely dominated by C/B, where C is the multiplier and B is 2^(64-m), so the least significant bit of C is mostly swallowed along the way.
3. in my brief experiments I tried to add large gap, and randomly pick subset of types, and performance was still amazing.
Example for 3.: [...]
This case of 231 type_ids scattered in a range of 3000 positions is lower-bounded by (has better chances than) the baseline case with 3000 adjacent positions, which has a probability 0.14 of finding a perfect hash multiplier for 4096 buckets. Note that a "hard" case would involve having your type_ids distributed across the entire 64-bit space of possible numerical values, and in the example you wrote the values are confined to a range that is around 10^14 times smaller than 2^64 --this is the fact that the algorithm takes an implicit advantage of. Joaquin M Lopez Munoz

On Mon, May 12, 2025 at 1:48 PM Joaquin M López Muñoz via Boost < boost@lists.boost.org> wrote:
El 12/05/2025 a las 12:06, Ivan Matek escribió: You're right. We'd have to ask the author, but I guess the "| 1" in "hash_mult = uniform_dist(rnd) | 1" is there to avoid the (very unlikely) case hash_mult = 0. In any case, this small adjustment doesn't change the big picture
I think reason people pick odd number is because if your multiplier is power of 2 last bits are guaranteed to be 0. Not sure that matters a lot here since we only use few top bits. https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that... * the number 11400714819323198486 is closer but we don’t want multiples of two because that would throw away one bit)*
statistically --the output distribution is largely dominated by C/B, where C is the multiplier and B is 2^(64-m), so the least significant bit of C is mostly swallowed along the way.
This case of 231 type_ids scattered in a range of 3000 positions is lower-bounded by (has better chances than) the baseline case with 3000 adjacent positions, which has a probability 0.14 of finding a perfect hash multiplier for 4096 buckets. Note that a "hard" case would involve having your type_ids distributed across the entire 64-bit space of possible numerical values, and in the example you wrote the values are confined to a range that is around 10^14 times smaller than 2^64
Thank you for explanation. For me it is interesting even if they are not consecutive, but bounded within "small"( compared to 64bit range) interval perfect hashing still works great. That was surprising to me, but I guess even noneven distribution among small range of values can be nicely hashed. To think a bit more broadly, not just in terms of openmethods... this is quite interesting to me, e.g. does that mean that randomly picked short(16bit) numbers can be hashed much better(assuming "proper" 64 bit hash, even though that is not what many STL implementations do, i.e. they use identity <https://godbolt.org/z/bYYT3M5j1>) than same number of randomly picked int(32 bit) numbers... What about randomly picked int vs randomly picked positive int(not unsigned int, but int whose value is positive)...

You're right. We'd have to ask the author, but I guess the "| 1" in "hash_mult = uniform_dist(rnd) | 1" is there to avoid the (very unlikely) case hash_mult = 0. In any case, this small adjustment doesn't change the big picture
I think reason people pick odd number is because if your multiplier is power of 2 last bits are guaranteed to be 0. * the number 11400714819323198486 is closer but we don’t want multiples of two because that would throw away one bit)*
Yes, that was my - I dare not say reasoning - my intuition. I tried discussing this with a mathematician friend, but all I could get out of her was: "my field is differential topology, not integer number theory" :-D
the values are confined to a range that is around 10^14 times smaller than 2^64
type_ids are few, and they are confined in the initialized data segment in theory, and in practice they are consecutive in a small range of memory. J-L

Jean-Louis Leroy wrote:
type_ids are few, and they are confined in the initialized data segment in theory, and in practice they are consecutive in a small range of memory.
Incidentally, am I right in inferring from this conversation that the library assumes that &typeid(T) == &typeid(T)? This isn't true on Windows when DLLs are involved, or under other OSes when .so and -fvisibility=hidden.

Incidentally, am I right in inferring from this conversation that the library assumes that &typeid(T) == &typeid(T)?
No. See here: https://tinyurl.com/2afhuju6 It looks like I was a tad too optimistic when I wrote the comments though ;-) J-L

On Mon, May 12, 2025, at 5:09 PM, Jean-Louis Leroy via Boost wrote:
Incidentally, am I right in inferring from this conversation that the library assumes that &typeid(T) == &typeid(T)?
No. See here: https://tinyurl.com/2afhuju6
May I suggest *not* using a URL shortener. As a rule, people should not have to click on opaque links in emails. Here, it's worth more to see https://github.com/jll63/Boost.OpenMethod/blob/BOOST_REVIEW/include/boost/op... which immediately conveys that we're going to 30-ish lines of `openmethod/compiler.hpp`. It's win-win: people see where they will endup, and can also predict what information they might find (to decide whether they want to). Thanks, Seth
participants (5)
-
Ivan Matek
-
Jean-Louis Leroy
-
Joaquin M López Muñoz
-
Peter Dimov
-
Seth