[Hash2] Result extension

newer
[Release] Boost 1.87.7 Release...

older
[hash2] Hash Algorithm...

Peter Dimov

9 Dec 2024 9 Dec '24

4:43 p.m.

Vinnie Falco asked me the following on Slack:

...

I would ask, what is the motivating use-case for calling result twice? This is not explained in the docs and no examples are given. In fact, the one example given says "not to do this"

Calling result() twice (or more times) provides result extension; the ability to extract variable number of bits from a hash algorithm, instead of a fixed size value (e.g. 64 bit.) This is in fact stated in the docs here https://pdimov.github.io/hash2/doc/html/hash2.html#hashing_bytes_result

...

Note that result is non-const, because it changes the internal state. It’s allowed for result to be called more than once; subsequent calls perform the state finalization again and as a result produce a pseudorandom sequence of result_type values. This can be used to effectively extend the output of the hash function. For example, a 256 bit result can be obtained from a hash algorithm whose result_type is 64 bit, by calling result four times.

and there is an example of doing that here https://pdimov.github.io/hash2/doc/html/hash2.html#example_result_extension All hash algorithms are required to support result extension, because (in my opinion) this is extremely useful functionality that is easy - even trivial - to provide, but is often withheld either by accident or in some cases, even deliberately. Hash algorithms typically have a "finalization" phase that pads the message, mixes the length, scrambles the internal state in a more thorough manner than in `update`, and then derives a hash value from that state. (The hash value is often shorter than the total amount of state.) If this "finalization" phase is performed more than once, one naturally gets the mandated `result()` behavior. Falco continues:

...

I pointed out in the post I already made that the quality of digest from calling result twice is dependent on the hash algorithm, and there is no way the library can provide assurances on the quality

That's of course correct, but it also applies to the quality of calling `result()` only once; it's naturally dependent on the implementation of the hash algorithm. What's important here is that it's not possible to provide an extended result of better quality from the outside; the hash algorithm is in the best place to provide it because it has access to more bits of internal state than it lets out. This requirement effectively mandates that all _hash algorithms_ be _extendable-output hash functions_: https://en.wikipedia.org/wiki/Extendable-output_function Note that this is not the only innovation that the proposed hash algorithm concept involves. All hash algorithms are required to support seeding from uint64_t and from an arbitrary sequence of bytes, which makes them effectively _keyed hash functions_ (or _message authentication codes_). Also note that the requirement that one can interleave calls to `update` and `result` arbitrarily makes it possible to implement byte sequence seeding (for algorithms that don't already support it) in the following manner: Hash::Hash( unsigned char const* p, size_t n ): Hash() { if( n != 0 ) { update( p, n ); result(); } } Subsequent `update` calls now start from an initial internal state that has incorporated the contents of [p, p+n), and that has been "finalized" (scrambled thoroughly) such that the result is not equivalent to just prepending the seed to the message (as would have happened if the result() call has been omitted.)

Show replies by date

Matt Borland

9 Dec 9 Dec

5:05 p.m.

...

Vinnie Falco asked me the following on Slack:

...

...
I would ask, what is the motivating use-case for calling result twice? This is not explained in the docs and no examples are given. In fact, the one example given says "not to do this"

...

Calling result() twice (or more times) provides result extension; the ability to extract variable number of bits from a hash algorithm, instead of a fixed size value (e.g. 64 bit.)

...

This is in fact stated in the docs here

...

https://pdimov.github.io/hash2/doc/html/hash2.html#hashing_bytes_result

...

...
Note that result is non-const, because it changes the internal state. It’s allowed for result to be called more than once; subsequent calls perform the state finalization again and as a result produce a pseudorandom sequence of result_type values. This can be used to effectively extend the output of the hash function. For example, a 256 bit result can be obtained from a hash algorithm whose result_type is 64 bit, by calling result four times.

...

and there is an example of doing that here

...

https://pdimov.github.io/hash2/doc/html/hash2.html#example_result_extension

...

All hash algorithms are required to support result extension, because (in my opinion) this is extremely useful functionality that is easy - even trivial - to provide, but is often withheld either by accident or in some cases, even deliberately.

...

Hash algorithms typically have a "finalization" phase that pads the message, mixes the length, scrambles the internal state in a more thorough manner than in `update`, and then derives a hash value from that state. (The hash value is often shorter than the total amount of state.)

...

If this "finalization" phase is performed more than once, one naturally gets the mandated `result()` behavior.

...

Falco continues:

...

...
I pointed out in the post I already made that the quality of digest from calling result twice is dependent on the hash algorithm, and there is no way the library can provide assurances on the quality

...

That's of course correct, but it also applies to the quality of calling `result()` only once; it's naturally dependent on the implementation of the hash algorithm.

...

What's important here is that it's not possible to provide an extended result of better quality from the outside; the hash algorithm is in the best place to provide it because it has access to more bits of internal state than it lets out.

...

This requirement effectively mandates that all hash algorithms be extendable-output hash functions:

...

https://en.wikipedia.org/wiki/Extendable-output_function

For those unfamiliar with Extendable Output Functions (XOFs) FIPS 202 [1], and the reference implementation [2] provide good detail since the wiki article seems a bit short. Matt [1] https://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf [2] https://github.com/XKCP/XKCP/blob/master/usage-example.md

Vinnie Falco

5:21 p.m.

On Mon, Dec 9, 2024 at 8:43 AM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

What's important here is that it's not possible to provide an extended result of better quality from the outside; the hash algorithm is in the best place to provide it because it has access to more bits of internal state than it lets out.

This requirement effectively mandates that all _hash algorithms_ be _extendable-output hash functions_:

https://en.wikipedia.org/wiki/Extendable-output_function

It seems to me that Hash2 is primarily aimed at users who wish to opt-in to a superior framework for use in unordered style containers. For this use-case, hash algorithms only need the digest finalized once. All of the work in the paper is focused on this use-case. What you have done, in your own words, is innovative. From the documentation:

...

If we’re using one of these algorithms to produce file or content checksums, do not tolerate collisions, and operate on a large number of files or items (many millions), it might be better to use a 128 bit digest instead.

This obviously is out of scope for unordered containers. I think it would be better to change HashAlgorithm to only allow one call to finalize() (and rename it from "result"), and no calls to update() after finalize, using the language or similar language to what I have already posted. And there is nothing inherently wrong with innovating, therefore I would propose that you simply add a new concept: ExtendableOutputHashAlgorithm And in this new concept impose your additional requirements. Most users won't care about extendable output and can just stick with the elements of the library needed for working with unordered containers. Those users will find it easy to adapt external implementations of hash algorithms, as they do not need to also have the expertise for ensuring that these external implementations meet the more stringent requirements of ExtendableOutputHashAlgorithm. There is an added benefit of relaxing the requirements for HashAlgorithm: users can call into existing algorithms without modifying them, often a requirement for security certifications. Thanks

Vinnie Falco

5:23 p.m.

On Mon, Dec 9, 2024 at 8:43 AM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

...Subsequent `update` calls now start from an initial internal state that has incorporated the contents of [p, p+n), and that has been "finalized" (scrambled thoroughly) such that the result is not equivalent to just prepending the seed to the message (as would have happened if the result() call has been omitted.)

What is wrong with simply prepending the seed? Thanks

Peter Dimov

5:28 p.m.

Vinnie Falco wrote:

...

What is wrong with simply prepending the seed?

"seed" "message" and "seedm" "essage" trivially collide, which may not be desirable.

Vinnie Falco

5:41 p.m.

On Mon, Dec 9, 2024 at 9:28 AM Peter Dimov <pdimov@gmail.com> wrote:

...

"seed" "message" and "seedm" "essage" trivially collide, which may not be desirable.

Sorry, I don't follow. You are comparing the result of two different hash functions: H1 h1( "seed", 4 ); h1.update( "message", 7 ); auto r1 = h1.result(); H2 h2( "seedm", 5 ); h2.update( "essage", 6 ); auto r2 = h2.result(); assert( r1 == r2 ); h1 and h2 are different hash functions from the same family [1] and you have gone out of your way to form a preimage utilizing knowledge of the seed. Hardly fair and also does not in any way reflect the security concerns of real use-cases. Your own implementation of the seeded FNV-1a constructor simply calls update(): https://github.com/pdimov/hash2/blob/7a25f8518692b657e9272884519519fbaca2ec5... I think it would be preferable if Hash2 made this the default implementation, and allowed more sophisticated algorithms (and authors) to opt-in to a better implementation. Otherwise, we are forcing ordinary users who just want to adapt an external library to meet the HashAlgorithm requirements to become experts. [1] https://en.wikipedia.org/wiki/Universal_hashing [2] https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function

Vinnie Falco

6:09 p.m.

On Mon, Dec 9, 2024 at 9:41 AM Vinnie Falco <vinnie.falco@gmail.com> wrote:

...

I think it would be preferable if Hash2 made this the default implementation, and allowed more sophisticated algorithms (and authors) to opt-in to a better implementation. Otherwise, we are forcing ordinary users who just want to adapt an external library to meet the HashAlgorithm requirements to become experts.

Peter, would you please comment on the following sketch: /** HashAlgorithm exemplar */ struct HashAlgorithm { using result_type = /* integral or array-like */; HashAlgorithm(); HashAlgorithm( HashAlgorithm const& r ); HashAlgorithm& operator=( HashAlgorithm const& r ); /** Append data to the message The behavior of subsequent calls to update() after finalize() has been called at least once is undefined unless specified by the HashAlgorithm. */ void update( void const* data, std::size_t n ); /** Return the digest This function shall return the final hash value of the input message, where the input message is defined by the ordered sequence of bytes provided in all prior calls to update(). The behavior of subsequent calls to finalize() is undefined unless specified by the HashAlgorithm. */ result_type finalize(); static constexpr int block_size = /*...*/; // optional explicit HashAlgorithm( std::uint64_t seed ); // optional HashAlgorithm( unsigned char const* seed, std::size_t n ); // optional }; /** Return a seeded HashAlgorithm */ template< class HashAlgorithm > HashAlgorithm make_seeded( std::uint64_t seed ) { if constexpr(std::is_constructible< HashAlgorithm, std::uint64_t>) return HashAlgorithm(seed); else { HashAlgorithm h; hash_append(h, seed); return h; } } /** Return a seeded HashAlgorithm */ template< class HashAlgorithm > HashAlgorithm make_seeded( unsigned char const* seed, std::size_t n ) { if constexpr(std::is_constructible<HashAlgorithm, unsigned char const*, std::size_t>) return HashAlgorithm(seed); else { HashAlgorithm h; hash_append(h, seed); return h; } } Thanks

Peter Dimov

6:21 p.m.

Vinnie Falco wrote:

...

Peter, would you please comment on the following sketch: ... The behavior of subsequent calls to finalize() is undefined unless specified by the HashAlgorithm.

I don't like gratuitous undefined behavior.

Andrey Semashev

5:39 p.m.

On 12/9/24 19:43, Peter Dimov via Boost wrote:

...

What's important here is that it's not possible to provide an extended result of better quality from the outside; the hash algorithm is in the best place to provide it because it has access to more bits of internal state than it lets out.

This requirement effectively mandates that all _hash algorithms_ be _extendable-output hash functions_:

https://en.wikipedia.org/wiki/Extendable-output_function

Only some hash functions are specified as extendable-output functions (XOF). I mean "specified" as "in hash algorithm specification". The link you posted says XOF is an extension and even lists a few examples of functions that support it. The fact that you can implement some hash functions such that the implementation allows multiple finalization calls or even interleave updates and finalization steps does not make that hash function a XOF. That just a property of your particular implementation. A useful property, but still beyond specification. A different implementation may rightfully not support this property and be still compliant with the spec. In my opinion, HashAlgorithm must support the latter implementation that is compliant with the hash function specification and does not support the digest extension. If you want to expose the XOF capability then please create a separate concept, say HashAlgorithmXOF, and add a way to detect whether a given algorithm supports result extension. I'll add that XOF is supported by some implementations, but they are also incompatible with the current HashAlgorithm concept. For example, OpenSSL provides EVP_DigestFinalXOF, but it must also be called only once. The difference from EVP_DigestFinal_ex is that EVP_DigestFinalXOF accepts the size of the buffer to will with the digest. If you are going to define HashAlgorithmXOF, please take existing implementations of this feature into account.

...

Note that this is not the only innovation that the proposed hash algorithm concept involves. All hash algorithms are required to support seeding from uint64_t and from an arbitrary sequence of bytes, which makes them effectively _keyed hash functions_ (or _message authentication codes_).

Also note that the requirement that one can interleave calls to `update` and `result` arbitrarily makes it possible to implement byte sequence seeding (for algorithms that don't already support it) in the following manner:

Hash::Hash( unsigned char const* p, size_t n ): Hash() { if( n != 0 ) { update( p, n ); result(); } }

Subsequent `update` calls now start from an initial internal state that has incorporated the contents of [p, p+n), and that has been "finalized" (scrambled thoroughly) such that the result is not equivalent to just prepending the seed to the message (as would have happened if the result() call has been omitted.)

The exact behavior of the hash algorithm's constructor is its implementation details. It doesn't need to be specified in terms of public update and result methods. And certainly, that one hash algorithm supports this sort of operation ordering doesn't mean that all of them should.

Andrey Semashev

6 p.m.

On 12/9/24 19:43, Peter Dimov via Boost wrote:

...

Note that this is not the only innovation that the proposed hash algorithm concept involves. All hash algorithms are required to support seeding from uint64_t and from an arbitrary sequence of bytes, which makes them effectively _keyed hash functions_ (or _message authentication codes_).

Also note that the requirement that one can interleave calls to `update` and `result` arbitrarily makes it possible to implement byte sequence seeding (for algorithms that don't already support it) in the following manner:

Hash::Hash( unsigned char const* p, size_t n ): Hash() { if( n != 0 ) { update( p, n ); result(); } }

Subsequent `update` calls now start from an initial internal state that has incorporated the contents of [p, p+n), and that has been "finalized" (scrambled thoroughly) such that the result is not equivalent to just prepending the seed to the message (as would have happened if the result() call has been omitted.)

Also, my understanding of HMAC[1] is that the key is prepended to the subsequent data and then the whole data is hashed. This contradicts with your code calling result() in the middle. Am I missing something? [1]: https://datatracker.ietf.org/doc/html/rfc2104#section-2

Peter Dimov

6:03 p.m.

Andrey Semashev wrote:

...

Also, my understanding of HMAC[1] is that the key is prepended to the subsequent data and then the whole data is hashed. This contradicts with your code calling result() in the middle. Am I missing something?

I don't understand what contradicts what. HMAC is a separate algorithm. https://pdimov.github.io/hash2/doc/html/hash2.html#ref_hmac

Andrey Semashev

6:08 p.m.

On 12/9/24 21:03, Peter Dimov via Boost wrote:

...

Andrey Semashev wrote:

...
Also, my understanding of HMAC[1] is that the key is prepended to the subsequent data and then the whole data is hashed. This contradicts with your code calling result() in the middle. Am I missing something?

I don't understand what contradicts what. HMAC is a separate algorithm.

https://pdimov.github.io/hash2/doc/html/hash2.html#ref_hmac

I'm confused. So the hash algorithm constructors from bytes are not used to pass keys for HMAC?

Peter Dimov

6:19 p.m.

Andrey Semashev wrote:

...

On 12/9/24 21:03, Peter Dimov via Boost wrote:

...
Andrey Semashev wrote:

...
Also, my understanding of HMAC[1] is that the key is prepended to the subsequent data and then the whole data is hashed. This contradicts with your code calling result() in the middle. Am I missing something?

I don't understand what contradicts what. HMAC is a separate algorithm.

https://pdimov.github.io/hash2/doc/html/hash2.html#ref_hmac

I'm confused. So the hash algorithm constructors from bytes are not used to pass keys for HMAC?

The constructor of `hmac` will be used to pass keys for HMAC. The constructor of something else will not.

Vinnie Falco

6:01 p.m.

On Mon, Dec 9, 2024 at 8:43 AM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

...there is an example of doing that here

https://pdimov.github.io/hash2/doc/html/hash2.html#example_result_extension

I will argue that for all the cases where result extension is a valid use-case, that the caller is already committed to a specific hash algorithm. And in this case the hash algorithm can specify additional behavior for update() and result() with no reduction in the usability of Hash2. Thanks

Peter Dimov

6:19 p.m.

Vinnie Falco wrote:

...

On Mon, Dec 9, 2024 at 8:43 AM Peter Dimov via Boost <boost@lists.boost.org <mailto:boost@lists.boost.org> > wrote:

...there is an example of doing that here

https://pdimov.github.io/hash2/doc/html/hash2.html#example_result_ extension

I will argue that for all the cases where result extension is a valid use-case, that the caller is already committed to a specific hash algorithm. And in this case the hash algorithm can specify additional behavior for update() and result() with no reduction in the usability of Hash2.

In other words, no generic code should be able to rely on result extension. I don't think this is an improvement. It's much more ergonomic for end users to be able to rely on all hash algorithm features, so that code can be generic without fear of invoking undefined behavior. That's also why all algorithms are required to support specific forms of seed construction, instead of every algorithm providing its own random constructors (even though this better reflects what happens today). The current hash algorithm requirements reflect my view on what minimum functionality a hash algorithm should provide in 2025. It's true that this doesn't correspond to what traditionally has been provided.

233

Age (days ago)

233

Last active (days ago)

List overview

Download

14 comments

4 participants

participants (4)

Andrey Semashev
Matt Borland
Peter Dimov
Vinnie Falco