Re: [boost] [hash2] Comments

8 Dec 2024

      On 12/7/24 03:16, Peter Dimov via Boost wrote:
...
Andrey Semashev wrote:
...
2. Somewhat related, on the use case of extending the size of the hash value
by repeatedly calling result(). Is this something that should be encouraged? I'm
not a specialist, but my understanding is that extending the hash value this
way does not improve its quality, as you get a series of bits in the value that
are predetermined based on the lower bits. Isn't that effectively equivalent to
just e.g. zero-extending the hash value?
No, it isn't. Hashes often have more internal state than they output as the
final result, so extension can actually produce non-redundant bits. This is
true, for example, for xxhash and siphash, and there's an example in the docs
of extending xxhash.
Even if the effective number of bits doesn't change, it's still not the same
as zero-extending, because a zero-extended hash is of very poor quality.
(The constant zero bits, being constant and zero, are completely unaffected
by the input.)
Thanks. So for hash functions that maintain larger internal state
subsequent calls to finalize() (formely, result()) expose that
additional state, which makes the extended hash value of higher quality
compared to the baseline one.

However, I still don't see the benefit of extending through subsequent
calls to finalize() when the internal state is not larger than the hash
value. Any input data that produce a particular hash value v will also
produce the same extended value xv. And that would be true whether you
perform extension through additional calls to finalize(), duplicating
the baseline hash value or zero-extension. Of course, the contribution
to uniqueness of individual bits in xv would be different in each case,
but as far as the uniqueness of xv as a whole is concerned, there seems
to be no difference.

And if you need to extend the hash value in the first place, you
probably are interested in the whole xv rather than its portion that may
not be as effective as the whole xv.
...
Ideally, each bit in the hash value should have about 50% probability of
being 1. This is called "avalanche property".
My understanding is that the main property of hash functions that
determine their quality is how uniquely each hash value identifies a
given input data. There are also other requirements, like the inability
to reconstruct the input data from the hash value, but I'd say
uniqueness is the number one requirement. Sometimes this is interpreted
as how many bits in the hash value are affected by each given input bit.
But while extending through calls to finalize() or duplicating the
initial hash value appear to increase this metric, the extended bits are
synchronized with the baseline hash value bits, and therefore do not
increase the extended hash value uniqueness.