
On 12/7/24 03:16, Peter Dimov via Boost wrote:
Andrey Semashev wrote:
2. Somewhat related, on the use case of extending the size of the hash value by repeatedly calling result(). Is this something that should be encouraged? I'm not a specialist, but my understanding is that extending the hash value this way does not improve its quality, as you get a series of bits in the value that are predetermined based on the lower bits. Isn't that effectively equivalent to just e.g. zero-extending the hash value?
No, it isn't. Hashes often have more internal state than they output as the final result, so extension can actually produce non-redundant bits. This is true, for example, for xxhash and siphash, and there's an example in the docs of extending xxhash.
Even if the effective number of bits doesn't change, it's still not the same as zero-extending, because a zero-extended hash is of very poor quality. (The constant zero bits, being constant and zero, are completely unaffected by the input.)
Thanks. So for hash functions that maintain larger internal state subsequent calls to finalize() (formely, result()) expose that additional state, which makes the extended hash value of higher quality compared to the baseline one. However, I still don't see the benefit of extending through subsequent calls to finalize() when the internal state is not larger than the hash value. Any input data that produce a particular hash value v will also produce the same extended value xv. And that would be true whether you perform extension through additional calls to finalize(), duplicating the baseline hash value or zero-extension. Of course, the contribution to uniqueness of individual bits in xv would be different in each case, but as far as the uniqueness of xv as a whole is concerned, there seems to be no difference. And if you need to extend the hash value in the first place, you probably are interested in the whole xv rather than its portion that may not be as effective as the whole xv.
Ideally, each bit in the hash value should have about 50% probability of being 1. This is called "avalanche property".
My understanding is that the main property of hash functions that determine their quality is how uniquely each hash value identifies a given input data. There are also other requirements, like the inability to reconstruct the input data from the hash value, but I'd say uniqueness is the number one requirement. Sometimes this is interpreted as how many bits in the hash value are affected by each given input bit. But while extending through calls to finalize() or duplicating the initial hash value appear to increase this metric, the extended bits are synchronized with the baseline hash value bits, and therefore do not increase the extended hash value uniqueness.