Re: [boost] [hash2] Formal Review Begins

14 Dec 2024

      Darryl Green wrote:
...
There has been a lot of discussion, in hash2 review, about non-idempotent
result() / every hash2 hash algorithm is required to be an extensible hash
function. I have questions:
Isn't any non-idempotent result function surprising?
I wouldn't say so.

If you have a thing that exposes three operations, construct, update, result,
your expectation should be that it supports the following sequence

construct
update
update
...
update
result

and any deviations put you into "read the documentation" land.

Adding a second `result` call to the end can easily have one of the following
effects:

- return an unspecified value
- return the same value as the previous call (idempotence)
- fail (either by returning an error or throwing)
- undefined behavior

There's no a priori reason to expect idempotence here, unless the
specification guarantees it, and if you look at the opinions expressed so far
in this review, you'll find people who argue for each of the above being
their preference, presumably the one who they wouldn't be surprised by.

In addition, if you now add

update
update
result

to the end of that sequence, you have some further choices to make. I
suspect that we'll also have a diversity of opinions here as to what the
unsurprising effect of the above should be; my expectation would be for
a person who considers idempotence the only unsurprising alternative to
prefer the clean semantics of `result` always returning the hash of the
partial message accumulated so far by the preceding sequence of update
calls.

But of course that's not at all undisputed, partly because it imposes a cost.

As currently specified, you can obtain these semantics by doing the
following:

auto get_partial_result( Hash const& hash )
{
    return Hash( hash ).result();
}

which is why I consider them derived semantics, and not a fundamental
operation that the algorithm needs to provide. They are achievable from
the outside, with no loss of functionality, and with the costs paid by the
user who needs them.
...
As I said in my review,
hash2 imposing the extensibility requirement on "raw" hash algorithms that
are not XOFs (you can always do "something" to extend it but doing
"something" does not always make a good extensible hash function) is
surprising. Ideally hash2 should allow an XOF to be used but not by requiring
every hash to pretend to be one. I did say some documentation would help
but I feel that relying on docs to make it hard to "do the wrong thing" or even
possible to do what should be a simple thing (use an existing hash
algorithm/algorithm implementation) without that in itself being a silly mistake
(NOT having result in some way stir the pot is a bug, now, unless one considers
extension by repetition to be "good enough" - which absurd as it sounds,
maybe it is - it's at least "honest").
In short, the argument here is that we shouldn't require from hash algorithm
authors to provide result extension, because they might screw it up.

That's a legitimate argument, in principle; if result extension was something
uniquely hard to provide, or something particularly error-prone (measured
against the baseline of implementing the rest of the hash algorithm), I would
probably have agreed. But that's not the case. In fact, most of the time,
result extension is free, in that if you just implement `result` in an entirely
straightforward manner per the algorithm spec, you automatically get result
extension. It's easier to implement it accidentally, than make a mistake.

And it's not true that the rest of the hash algorithm is easier to implement,
or less error prone. Seed construction and update are at least as tricky, if
not more.

In addition, the exact same argument applies against seeded construction;
people can easily make mistakes when implementing it. But I consider this
an acceptable cost to pay, because the functionality is necessary for the
intended use of the hash algorithms, by the library and by the users.

Why is result extension necessary?

Well, consider the std::hash use case, where the wrapper object does this:

    std::size_t operator()( T const& v ) const
    {
        H h;
        boost::hash2::hash_append( h, {}, v );
        return boost::hash2::get_integral_result<std::size_t>( h.result() );
    }

(ignoring seeding at the moment)

If H::result_type is uint32_t, but std::size_t is 64 bit, get_integral_result
needs to extend the uint32_t value into 64 bits. It currently does the
equivalent of

    return h.result() * 0x7FFFFFFF'7FFFFFFF;

but I'm planning to change it to take advantage of result extension and do
this instead:

    return (uint64_t)h.result << 32 | h.result();

There is no scenario in which the former produces better 64 bit hash values
than the latter, unless, of course, the implementation of result extension is
majorly screwed up.

For another example, consider hash_append_unordered_range, which
currently does

        w += hash2::get_integral_result<std::uint64_t>( h2.result() );

As we already established, a 64 bit accumulator w isn't going to cut it; so
I'm planning to change w from

    std::uint64_t w = 0;

to

    std::uint64_t w[ 4 ] = {};

This would require the ability to obtain 256 bits of hash from the hash
algorithm. Again, there is no way to do this from the outside, without
support from the hash algorithm, and achieve a better result distribution.

In the general case, when you need to do nontrivial things with the hash
algorithm that aren't just a simple sequence of `update` calls, result
extension is simply necessary.

I suppose I can, in principle, implement the above use cases without
result extension, by using an MGF-like scheme. E.g. instead of

    return (uint64_t)h.result << 32 | h.result();

do this

    H h1( h ); h1.update( '\x01', 1 ); auto r1 = h1.result();
    H h2( h ); h2.update( '\x02', 1 ); auto r2 = h2.result();
    return (uint64_t)r1 << 32 | r2;

but this imposes completely unnecessary performance costs.
...
What is the specification for a "hash2 XOF"?
Does the author really mean to take on the role of specifying the properties
required? If FIPS says that a function is an XOF it probably is. If FIPS doesn't
say that, but Peter says it is one (it must be to be a "hash2 hash algorithm")...
Is that ok? Who wins?
If FIPS doesn't say that a hash function is an XOF, it isn't.

Result extension doesn't automatically turn any hash function into an XOF,
similarly to how byte sequence seeding doesn't turn any hash function into
a MAC. You can use the function as XOF or MAC, but this doesn't make it
meet the stricter cryptographic requirements for XOF or MAC.

It's probably my own fault for not making that point in the documentation.
I did make an explicit note in the constructor description that even though
it makes all algorithms usable as MACs, an established MAC algorithm such
as HMAC should be used unless the algorithm has been specifically designed
to be used as a MAC (as SipHash was.)

But that's not what I said when describing `result`, and I should have.

To sum up with a general point, the hash algorithm requirements as specified
reflect (my idea of) what a modern hash algorithm (one designed today) would
provide as an API.

If we look at SipHash, which is relatively recent, we'll see that it's a very good
match for these requirements; the only slight issue is that it doesn't support
seed sequences of arbitrary length (which is a regression in usability compared
to HMAC.)

In particular, it supports byte sequence seeding, doesn't expose its entire
internal state as the digest, and trivially supports result extension.

There is actually a SipHash variant that produces a 128 bit digest, instead of 64
bit one, and it's almost exactly equivalent to a result-extended hash2::siphash_64.

The only difference is that some xor constants have been changed so that the
low 64 bits of the 128 bit variant do not match the value produced by the 64
bit variant.

Thanks to everyone who read this far. We should gamify the list and have some
sort of achievement be unlocked by that.

Re: [boost] [hash2] Formal Review Begins

Peter Dimov