
Darryl Green wrote:
There has been a lot of discussion, in hash2 review, about non-idempotent result() / every hash2 hash algorithm is required to be an extensible hash function. I have questions:
Isn't any non-idempotent result function surprising?
I wouldn't say so. If you have a thing that exposes three operations, construct, update, result, your expectation should be that it supports the following sequence construct update update ... update result and any deviations put you into "read the documentation" land. Adding a second `result` call to the end can easily have one of the following effects: - return an unspecified value - return the same value as the previous call (idempotence) - fail (either by returning an error or throwing) - undefined behavior There's no a priori reason to expect idempotence here, unless the specification guarantees it, and if you look at the opinions expressed so far in this review, you'll find people who argue for each of the above being their preference, presumably the one who they wouldn't be surprised by. In addition, if you now add update update result to the end of that sequence, you have some further choices to make. I suspect that we'll also have a diversity of opinions here as to what the unsurprising effect of the above should be; my expectation would be for a person who considers idempotence the only unsurprising alternative to prefer the clean semantics of `result` always returning the hash of the partial message accumulated so far by the preceding sequence of update calls. But of course that's not at all undisputed, partly because it imposes a cost. As currently specified, you can obtain these semantics by doing the following: auto get_partial_result( Hash const& hash ) { return Hash( hash ).result(); } which is why I consider them derived semantics, and not a fundamental operation that the algorithm needs to provide. They are achievable from the outside, with no loss of functionality, and with the costs paid by the user who needs them.
As I said in my review, hash2 imposing the extensibility requirement on "raw" hash algorithms that are not XOFs (you can always do "something" to extend it but doing "something" does not always make a good extensible hash function) is surprising. Ideally hash2 should allow an XOF to be used but not by requiring every hash to pretend to be one. I did say some documentation would help but I feel that relying on docs to make it hard to "do the wrong thing" or even possible to do what should be a simple thing (use an existing hash algorithm/algorithm implementation) without that in itself being a silly mistake (NOT having result in some way stir the pot is a bug, now, unless one considers extension by repetition to be "good enough" - which absurd as it sounds, maybe it is - it's at least "honest").
In short, the argument here is that we shouldn't require from hash algorithm authors to provide result extension, because they might screw it up. That's a legitimate argument, in principle; if result extension was something uniquely hard to provide, or something particularly error-prone (measured against the baseline of implementing the rest of the hash algorithm), I would probably have agreed. But that's not the case. In fact, most of the time, result extension is free, in that if you just implement `result` in an entirely straightforward manner per the algorithm spec, you automatically get result extension. It's easier to implement it accidentally, than make a mistake. And it's not true that the rest of the hash algorithm is easier to implement, or less error prone. Seed construction and update are at least as tricky, if not more. In addition, the exact same argument applies against seeded construction; people can easily make mistakes when implementing it. But I consider this an acceptable cost to pay, because the functionality is necessary for the intended use of the hash algorithms, by the library and by the users. Why is result extension necessary? Well, consider the std::hash use case, where the wrapper object does this: std::size_t operator()( T const& v ) const { H h; boost::hash2::hash_append( h, {}, v ); return boost::hash2::get_integral_resultstd::size_t( h.result() ); } (ignoring seeding at the moment) If H::result_type is uint32_t, but std::size_t is 64 bit, get_integral_result needs to extend the uint32_t value into 64 bits. It currently does the equivalent of return h.result() * 0x7FFFFFFF'7FFFFFFF; but I'm planning to change it to take advantage of result extension and do this instead: return (uint64_t)h.result << 32 | h.result(); There is no scenario in which the former produces better 64 bit hash values than the latter, unless, of course, the implementation of result extension is majorly screwed up. For another example, consider hash_append_unordered_range, which currently does w += hash2::get_integral_resultstd::uint64_t( h2.result() ); As we already established, a 64 bit accumulator w isn't going to cut it; so I'm planning to change w from std::uint64_t w = 0; to std::uint64_t w[ 4 ] = {}; This would require the ability to obtain 256 bits of hash from the hash algorithm. Again, there is no way to do this from the outside, without support from the hash algorithm, and achieve a better result distribution. In the general case, when you need to do nontrivial things with the hash algorithm that aren't just a simple sequence of `update` calls, result extension is simply necessary. I suppose I can, in principle, implement the above use cases without result extension, by using an MGF-like scheme. E.g. instead of return (uint64_t)h.result << 32 | h.result(); do this H h1( h ); h1.update( '\x01', 1 ); auto r1 = h1.result(); H h2( h ); h2.update( '\x02', 1 ); auto r2 = h2.result(); return (uint64_t)r1 << 32 | r2; but this imposes completely unnecessary performance costs.
What is the specification for a "hash2 XOF"? Does the author really mean to take on the role of specifying the properties required? If FIPS says that a function is an XOF it probably is. If FIPS doesn't say that, but Peter says it is one (it must be to be a "hash2 hash algorithm")... Is that ok? Who wins?
If FIPS doesn't say that a hash function is an XOF, it isn't. Result extension doesn't automatically turn any hash function into an XOF, similarly to how byte sequence seeding doesn't turn any hash function into a MAC. You can use the function as XOF or MAC, but this doesn't make it meet the stricter cryptographic requirements for XOF or MAC. It's probably my own fault for not making that point in the documentation. I did make an explicit note in the constructor description that even though it makes all algorithms usable as MACs, an established MAC algorithm such as HMAC should be used unless the algorithm has been specifically designed to be used as a MAC (as SipHash was.) But that's not what I said when describing `result`, and I should have. To sum up with a general point, the hash algorithm requirements as specified reflect (my idea of) what a modern hash algorithm (one designed today) would provide as an API. If we look at SipHash, which is relatively recent, we'll see that it's a very good match for these requirements; the only slight issue is that it doesn't support seed sequences of arbitrary length (which is a regression in usability compared to HMAC.) In particular, it supports byte sequence seeding, doesn't expose its entire internal state as the digest, and trivially supports result extension. There is actually a SipHash variant that produces a 128 bit digest, instead of 64 bit one, and it's almost exactly equivalent to a result-extended hash2::siphash_64. The only difference is that some xor constants have been changed so that the low 64 bits of the 128 bit variant do not match the value produced by the 64 bit variant. Thanks to everyone who read this far. We should gamify the list and have some sort of achievement be unlocked by that.