
On Sat, Dec 7, 2024 at 1:47 PM Matt Borland via Boost
Dear all,
The review of Hash2 by Peter Dimov and Christian Mazakas begins today Saturday, December 7th through Sunday December 15th, 2024.
Code: https://github.com/pdimov/hash2 Docs: https://pdimov.github.io/hash2/
Since I've already barged in the topic earlier, I might as well leave a review.
Please provide feedback on the following general topics:
- What is your evaluation of the design?
One of the things that puzzles me about this library is its intended purpose. "Types don't know #" was clearly about hash types for use in hash tables, which explained why keying the hashes was mandatory. But it is unclear to me why MD5, SHA-1, and other cryptographic hashes are present here; they are way too inefficient to be of much use in hash tables, so there is little reason to include them. If, on the other hand, this is meant as a more general hashing library, I have strong objections as detailed below. I believe the HashAlgorithm concept is too overloaded. It _requires_ that hashes be both keyable and extendable-output, which are things that many common hash functions are not capable of. MD5, for example, is not capable of either keying or extendable-output, and as such will require an ad hoc _mode of operation_ to achieve these goals, which is fraught with danger. A user that wants to use this library with a hash that is not included here is going to be pushed to include functionality the hash does not provide and that they might not know how to achieve. Despite being very useful functionality, this is a source of vulnerabilities waiting to happen. Additionally, one of the constructors for keying is std::uint64_t. While this is mostly OK for the hash table DoS-prevention use case, it is woefully small for cryptographic purposes. This constructor should not exist in this case. Furthermore, the multiple result() calls API to achieve extendable output, besides being confusing and incompatible with just about every implementation out there, is suboptimal from a performance standpoint. If I want to generate a long output, calling result() repeatedly will make vectorization much more annoying to implement (and requires an unnecessary large internal buffer) than an API such as digest(output, length). This is the same rationale for proposals such as P1068 for more efficient output on the standard <random> generators. I would suggest splitting HashAlgorithm concepts into KeyedHashAlgorithm/PRF and FixedOutputAlgorithm/VariableOutputAlgorithm, or something along these lines. It should not be necessary to shoehorn this functionality into primitives that do not support it. Note that this is not a clean hierarchy; some keyed algorithms cannot be safely used as unkeyed hashes, so KeyedHashAlgorithm is not a subset of HashAlgorithm and vice-versa.
- What is your evaluation of the implementation?
The implementation of the cryptographic hash functions highlight the issues with the above design. I will talk about MD5 here, but it applies all the same to SHA-1, SHA-256, etc. Firstly, it is strange that an implementation of MD5 is keyed, since MD5 does not provide this functionality. But the keying here simply prepends the key, padded to block length, to the message. This enables the so-called length-extension attack: if I have MD5(k || m), I can easily compute MD5(k || m || other stuff) without knowing the key, which can be used to bypass authenticators (Flickr was a famous example of this [1]). This is why HMAC is has been drilled into developers whenever keyed hashing comes up. Furthermore, the way extended output works breaks pseudorandomness. Suppose I have a keyed MD5 instance and want to generate several blocks of output. The expectation here is that _all_ of the output is indistinguishable from a random string of the same length. But that is not the case here. What we have instead is first_block = MD5(k || m || padding) second_block = MD5(k || m || padding || more padding) ... An attacker who observes this can easily distinguish this by taking first_block, which consists of the internal MD5 state, hashing the extra padding, and checking whether the output is equal to second_block. If I were to use this scheme as a stream cipher to encrypt data, knowing the first 16 bytes of plaintext (e.g., a known/predictable header) would immediately enable the recovery of the rest. This is broken as an XOF. The way variable-length output in Merkle-Damgard hashes such as MD5 historically has been handled is to use MGF1 [2, 10.2.1] or something along these lines. But once again this is a _mode of operation_ on top of MD5, similarly to HMAC. I have already mentioned this elsewhere, but the unordered hash approach currently implemented cannot possibly be secure without enormous parameters. A minor question I have regards the constants used in get_result_multiplier(). The documentation states that this is used to produce a uniform output on the target range, but it's unclear from the documentation or source code the rationale or criteria for the method or constants used here. This seems to get into integer hash territory, but for example the 4 -> 1 case consists of x -> (x*0x7f7f7f7f) >> 24, for which easy differentials exist, e.g., (x, x^0xc0800000) collide with probability ~1/4. The implementations are otherwise clean and easy to read, which simplified spotting the issues above. [1] https://vnhacker.blogspot.com/2009/09/flickrs-api-signature-forgery.html [2] https://www.ietf.org/rfc/rfc2437.txt
- What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? Do you already use it in industry?
A framework to easily hash custom types without having to repeat the same std::hash or similar boilerplate over and over is something I would definitely use, and have sort of reinvented kludgy versions of it multiple times before. But apart from easily enabling types to be usable in hash tables I would not have much use for this library.
- Did you try to use the library? With which compiler(s)? Did you have any problems?
Yes. Clang 19 on Linux. It worked as expected.
- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
A few hours. Mostly skimming the source code and documentation.
- Are you knowledgeable about the problem domain?
I have been involved in the design and analysis of hash functions.
Ensure to explicitly include with your review: ACCEPT, REJECT, or CONDITIONAL ACCEPT (with acceptance conditions).
I would rather leave this field empty, since I'm not too familiar with the procedure here, but if pressed I would say reject in its current state.