Re: [boost] [hash2] Formal Review Begins

11 Dec 2024

      ...
- What is your evaluation of the design?
Please see the documentation section - most of the comments there are about
the design at least as much as how it is documented.
...
- What is your evaluation of the implementation?
The implementation is unsurprisingly of high quality.
...
- What is your evaluation of the documentation?
The documentation is very comprehensive. Sufficiently so that it called out
all the surprising features of what I'd expected was going to simply be a
refined (given the author) implementation of the approach described in the
paper "types don't know #".

The documentation describes hash algorithm requirements early. I expected
to be unsurprised and initially was (vs the hash algorithm requirements in
the paper) but the fact that every hash algorithm must have a constructor
taking an integer seed and one taking a byte sequence immediately caught my
attention. I didn't expect that all of the listed hash algorithms would
have either or both of those constructors.
So I checked the reference docs. for the algorithms to find out what this
requirement has added to the "usual" expectations for each algorithm.

FNV-1a looked like a predictable, minimal "how to seed something that isn't
a keyed hash i.e. just feed it a "key" as a prefix (by calling update).
That did make me wonder why the concept was complicated with the
constructors that aren't doing anything the user couldn't do
post-construction by simply calling update().

But - xxhash algorithm specification (as opposed to the hash2 library) does
define a (result-type sized) seed and sure enough - hash2 provides a
constructor that operates as expected (with a twist that the seed is not of
result_type so can be larger than seed the 32 bit algorithm supports).
This discrepancy and the requirement that different seeds will give
different results forces larger seed values to be used creatively.
The approach taken is to, if the seed was not small enough to be converted
to result type without overflow to use the low 32 bits to do the defined
seeding then mix the high 32 bits into the state "somehow" (I checked the
implementation, its as if the high 32 bits were copied to unsigned char
b[4] then update(b, 4) called 4 times - I think?
So - what happens when constructing from a byte sequence? In this case we
fallback to default construction followed by update() - oh - the docs say
"update() and a call to result(). That looks odd? Why would one call
result()? After calling result() further update() calls are valid? Read
some more docs. Oh - every hash algorithm is required to not only have
these constructors, but also to be an extensible hash. Another surprise. It
turns out, result() having a side effect of changing the state is
required/expected. And being used here to "mix" the seed. I'm not 100% sure
why - not that it's bad - but just how arbitrary is each algorithms seeding
behaviour? Is it reasonable to be creating these seeding conventions
specific to the library?
The obvious answer is - if you don't like it, write a different algorithm
implementation. It is true that for the purposes of generating hashes of
unordered container keys any "reasonable" seeding convention is good enough
- even where hash flooding protection is needed.

Perhaps using the library to hash arbitrary C++ objects for any/all uses is
getting into an area where the user needs to be a bit more aware of
potential interoperability issues. Even so, the algorithm specification for
each provided algorithm should be very explicit about just how it seeds -
and that there is no interoperability of these features with any other
implementation of the "same" algorithm. Unless of course there is - If an
algorithm in hash2  follows a specification for seeding and/or extension
this should be clearly documented/referenced.

An implementation of https://keccak.team/kangarootwelve.html would make an
interesting addition due to it having a comprehensive set of functions
that might be expected to directly fulfil the concepts (among other nice
features).
...
- What is your evaluation of the potential usefulness
of the library? Do you already use it in industry?
I haven't used it yet. But I've used a hash algorithms for various things.
Some cryptographic, some purely for high performance (relative to
resources) unordered containers / data structures. I would definitely use
this library for unordered key hashing, and the convenience of creating a
hash for a user defined type independent of the algorithm choice is a
valuable capability and the reason I haven't had anything muich to say
about that aspect of the library is - I don't have any complaints or
suggestions for improvement, except that I think that support for hashing
unordered containers should be dropped. because there isn't a robust and
performant generic solution (it seems - if there is, it's not in the
library).

I would think twice before using the library for any cryptographic purpose
not because of any specific issues (other than the unordered container
issue) beyond it being yet another implementation of the raw hash algorithm
without any specific benefits and with a restrictive (for good reason)
interface tailored to the uses envisioned in the paper.
...
- Did you try to use the library? With which compiler(s)? Did
you have any problems?
I compiled and ran some tests using g++ 14.2 - It was very easy to do this.
Modular boost is great!
...
- How much effort did you put into your evaluation?
A few hours.
...
A glance? A quick reading? In-depth study?
I took a glance, mistook it for a nice simple library to make unordered
containers easier to use with arbitrary key types, and though I wouldn't
dig deeper and probably didn't have a lot to add beyond voting to accept.
But the extended scope of the algorithm concept, and also the aspirations
to solve the problem of hashing multisets / unordered containers as key
type made me take a bit more notice. I don't think the generalisations of
the hash algorithm concept do any harm, but the potential for surprise and
the potential to expect that the hash algorithms so defined/encapsulated
should be used for "everything" is there. The ergonomics are a bit quirky -
I do wonder if adopting names that are a bit more descriptive of the hash
algorithm operation (if we were talking only about sponges Id say absorb(),
squeeze() in place of update() and digest() ) might be in order to show
just how different the algorithm concepts are, but I'm not sure that every
hash algorithm should provide the latter. Perhaps it is more that
everything should provide update(), an extensible hash should provide
"stir()" and everything should provide a (const) result(). Surely the few
places where result is called multiple times with the expectation that the
state was changed can/should make this explicit?  Im also not sure I like
result types as array-like byte sequences or integers but perhaps
optimisers can't quite make that choice irrelevant from a performance
perspective.  Im not sure the library should even try to solve the problem
of extracting "good" bits from a hash that doesn't have great dispersion
properties. There isn't anything "wrong" with a fast and less than uniform
and certainly not preimage resistant hash function - there is something
wrong with trying to "make it work" where it shouldn be being used in the
first place.
...
- Are you knowledgeable about the problem domain?
I know enough to be dangerous. And know enough to know I am (dangerous).
This library is mostly a way to avoid footgun problems while easing the
task for the average knowledge user like me.
...
Ensure to explicitly include with your review: ACCEPT, REJECT, or
CONDITIONAL ACCEPT (with acceptance conditions).

I vote to ACCEPT the library. I don't think adding explicit conditions is
necessary as I expect the author will address any issues.

Re: [boost] [hash2] Formal Review Begins

Darryl Green