Re: [boost] [hash2] Formal Review Begins

12 Dec 2024

      On Sat, Dec 7, 2024 at 1:47 PM Matt Borland via Boost
<boost@lists.boost.org> wrote:
...
Dear all,
The review of Hash2 by Peter Dimov and Christian Mazakas begins today Saturday, December 7th through Sunday December 15th, 2024.
Code: https://github.com/pdimov/hash2
Docs: https://pdimov.github.io/hash2/
Since I've already barged in the topic earlier, I might as well leave a review.
...
Please provide feedback on the following general topics:
- What is your evaluation of the design?
One of the things that puzzles me about this library is its intended
purpose. "Types don't know #" was clearly about hash types for use in
hash tables, which explained why keying the hashes was mandatory. But
it is unclear to me why MD5, SHA-1, and other cryptographic hashes are
present here; they are way too inefficient to be of much use in hash
tables, so there is little reason to include them. If, on the other
hand, this is meant as a more general hashing library, I have strong
objections as detailed below.

I believe the HashAlgorithm concept is too overloaded. It _requires_
that hashes be both keyable and extendable-output, which are things
that many common hash functions are not capable of. MD5, for example,
is not capable of either keying or extendable-output, and as such will
require an ad hoc _mode of operation_ to achieve these goals, which is
fraught with danger. A user that wants to use this library with a hash
that is not included here is going to be pushed to include
functionality the hash does not provide and that they might not know
how to achieve. Despite being very useful functionality, this is a
source of vulnerabilities waiting to happen.

Additionally, one of the constructors for keying is std::uint64_t.
While this is mostly OK for the hash table DoS-prevention use case, it
is woefully small for cryptographic purposes. This constructor should
not exist in this case.

Furthermore, the multiple result() calls API to achieve extendable
output, besides being confusing and incompatible with just about every
implementation out there, is suboptimal from a performance standpoint.
If I want to generate a long output, calling result() repeatedly will
make vectorization much more annoying to implement (and requires an
unnecessary large internal buffer) than an API such as digest(output,
length). This is the same rationale for proposals such as P1068 for
more efficient output on the standard <random> generators.

I would suggest splitting HashAlgorithm concepts into
KeyedHashAlgorithm/PRF and
FixedOutputAlgorithm/VariableOutputAlgorithm, or something along these
lines. It should not be necessary to shoehorn this functionality into
primitives that do not support it. Note that this is not a clean
hierarchy; some keyed algorithms cannot be safely used as unkeyed
hashes, so KeyedHashAlgorithm is not a subset of HashAlgorithm and
vice-versa.
...
- What is your evaluation of the implementation?
The implementation of the cryptographic hash functions highlight the
issues with the above design. I will talk about MD5 here, but it
applies all the same to SHA-1, SHA-256, etc.

Firstly, it is strange that an implementation of MD5 is keyed, since
MD5 does not provide this functionality. But the keying here simply
prepends the key, padded to block length, to the message. This enables
the so-called length-extension attack: if I have MD5(k || m), I can
easily compute MD5(k || m || other stuff) without knowing the key,
which can be used to bypass authenticators (Flickr was a famous
example of this [1]). This is why HMAC is has been drilled into
developers whenever keyed hashing comes up.

Furthermore, the way extended output works breaks pseudorandomness.
Suppose I have a keyed MD5 instance and want to generate several
blocks of output. The expectation here is that _all_ of the output is
indistinguishable from a random string of the same length. But that is
not the case here. What we have instead is

    first_block  = MD5(k || m || padding)
    second_block = MD5(k || m || padding || more padding)
    ...

An attacker who observes this can easily distinguish this by taking
first_block, which consists of the internal MD5 state, hashing the
extra padding, and checking whether the output is equal to
second_block. If I were to use this scheme as a stream cipher to
encrypt data, knowing the first 16 bytes of plaintext (e.g., a
known/predictable header) would immediately enable the recovery of the
rest. This is broken as an XOF.

The way variable-length output in Merkle-Damgard hashes such as MD5
historically has been handled is to use MGF1 [2, 10.2.1] or something
along these lines. But once again this is a _mode of operation_ on top
of MD5, similarly to HMAC.

I have already mentioned this elsewhere, but the unordered hash
approach currently implemented cannot possibly be secure without
enormous parameters.

A minor question I have regards the constants used in
get_result_multiplier(). The documentation states that this is used to
produce a uniform output on the target range, but it's unclear from
the documentation or source code the rationale or criteria for the
method or constants used here. This seems to get into integer hash
territory, but for example the 4 -> 1 case consists of x ->
(x*0x7f7f7f7f) >> 24, for which easy differentials exist, e.g., (x,
x^0xc0800000) collide with probability ~1/4.

The implementations are otherwise clean and easy to read, which
simplified spotting the issues above.

[1] https://vnhacker.blogspot.com/2009/09/flickrs-api-signature-forgery.html
[2] https://www.ietf.org/rfc/rfc2437.txt
...
- What is your evaluation of the documentation?
- What is your evaluation of the potential usefulness
of the library? Do you already use it in industry?
A framework to easily hash custom types without having to repeat the
same std::hash or similar boilerplate over and over is something I
would definitely use, and have sort of reinvented kludgy versions of
it multiple times before. But apart from easily enabling types to be
usable in hash tables I would not have much use for this library.
...
- Did you try to use the library? With which compiler(s)? Did
you have any problems?
Yes. Clang 19 on Linux. It worked as expected.
...
- How much effort did you put into your evaluation?
A glance? A quick reading? In-depth study?
A few hours. Mostly skimming the source code and documentation.
...
- Are you knowledgeable about the problem domain?
I have been involved in the design and analysis of hash functions.
...
Ensure to explicitly include with your review: ACCEPT, REJECT, or CONDITIONAL ACCEPT (with acceptance conditions).
I would rather leave this field empty, since I'm not too familiar with
the procedure here, but if pressed I would say reject in its current
state.

Re: [boost] [hash2] Formal Review Begins

Samuel Neves