[boost] [hash2] Review

13 Dec 2024

      his is my review of the proposed boost::hash2 library

# What is your evaluation of the design?

The base idea, separating the hash algorithm from the actual hashing
of objects, is a brilliant one.

The library consist on two parts:

* a collection of hashing algorithm, some of them of crypto quality
* a c++ objects hashing facility

Given this design, i would not have expected from the library to provide
a thorough collection of hash algorithm. But it's a good think to get
them. I would have expected, however, to see some guidance about how
to integrate an external hashing library. Actually it is unfortunate
that this part is missing, because it seems a lot more harder than it
should be.

Several design decisions could be improved in my opinion:

* the requirements on the HashAlgorithm concept are too strict, and have
  no rationale in the documentation

  * copy constructibility and copy assignment are unusual across current
    hash implementations, as it has been noted by other reviewers. I can 
    understand a rationale for copy construct, copy assignment looks very
    weird to me. The only rationale i can think of is resetting the internal
    state to a defaulted state, which could be a specific operation.

  * creation with seed (either as uint64 or as an array) as well as default
    construction is required. It seems odd. The documentation, for example,
    strongly advice agains using siphash without a seed. I would prefer that
    it is a compile time error to use siphash without a seed / with an
    incorrectly sized seed. I've no problem adding a key/seed to algorithms
    that do not mandate them, though.

    But, since the library itself does not create the HashAlgorithm
    themselve (there is one notable exception to that that), it should not
    put additional requirements beyond what it uses. The HashAlgorithm
    concept could be:

    struct HashAlgorithm
    {
        using result_type = /*integral or array-like*/;

        static constexpr int block_size = /*...*/; // optional

        void update( void const* data, std::size_t n );

        result_type result();
    };

    Interestingly enough, most of object hashing works if you provide it
    with such an algorithm (not copyable and not default constructible). I
    failed to see where default construction is used outside tests, for
    example.

  * the usage of void const* + size in interface is controversial, as several
    people would prefer a span<>. There are good reasons for not using span,
    and this is not something that should be an issue, as this should only
    impact HashAlgorithm writers. It is in my opinion an unfortunate design
    choice that this internal interface is leaked to the end user. The
    functionality is already provided by hash_append_range, which could
    be the advertised way of hashing bytes via that library (ie, when mixing
    hashing of object and hashing of bytes, eg when implementing custom
    hashing for a type).

  * constexpr use mandates usage of unsigned char const*. So it could be
    the primary supported interface over void const*, to avoid interface
    duplication.

  * the requirement on result extension is also uncommon. While i understand
    the rationale, it should be noted that it is likely to cause some
    language interoperability issues, as it is not necessarily 
    specified by the algorithm. Implementations in other languages are
    unlikely to provide them, making the feature a niche one, unless the
    underlying algorithm specification provides it.

* most of the functionality is provided by the hash_append function, and the
  hash_append_xxx helpers. This is a neat design.

* The Flavor template argument allows to customize hashing of objects, to
  get platform-independants hashes. This is a good idea, but it comes with
  some problems:

  * the flavor is repeated over each hash_append call. I don't think it is
    a realistic use case that the flavor changes over time. If i were to
    integrate the library in our code base, the first thing i'll do is
    write a wrapper to get rid of that flavor argument at every call site.

  * the default flavor uses std::uint64_t for size. It seems odd. If i'm
    not asking for a specific flavor, i'm not asking the hash to be
    consistent accross different systems. So std::size_t should be used
    there instead. This looked to me like a pay for what you don't use
    situation, so i ran a quick near worst-case benchmark on our arm
    platform (hashing a million times a std::vector of the three integers
    1-2-3), to find out that it makes a the code using size_t runs 15%
    faster.

# What is your evaluation of the implementation?

The implementation looks good. Code is clean, very readable.

A minor remark in hash_append_unordered_range : m should be declared as
std::size_t (or std::uint64_t). The current
typename std::iterator_traits<It>::difference_type m = 0; is wrong.

I'm not at ease with hash_append_unordered_range. The fact that it
reduces the hash to 64 bits, simply making a sum of the element hashes,
and then hashing that, does not look right. If the goal is simply to
avoid unintended collisions, it's probably fine. If the goal is to preserve 
against an attacker mindfully trying to provoke collisions, it may do the 
trick,although intuitively xoring the whole results would be better at that
job than summing truncated results. If it's to preserve cryptographic 
properties of the hash function, then it looks wrong.

Another remark is about size serialization. The library relies on
std::tuple_size to identify container-like structs, but not include the size
in the hash (or is it because of is_contiguously_hashable?). This leads to
unintuitive results, such as:

    std::array a{1,2,3,4};
    std::span s(a);
    Hash h1,h2,h3,h4;
    hash_append(h1, {}, a);
    hash_append_range(h2, {}, begin(a), end(a));
    assert(h1.result() == h2.result()); // TRUE
    hash_append(h3, {}, s);
    hash_append_range(h4, {}, begin(s), end(s));
    assert(h3.result() == h4.result()); // FAILS

I'm not sure there's a general solution to this issue, though. But since
std::array models a container, not including its size in the digest may not
be the wisest choice (the doc talks about plain arrays, not about std::array,
so i don't know if it's a bug or a feature)

# What is your evaluation of the documentation?

The order of the documentation is odd. I would expect it to first explain
me how i use the library with the provided algorithm to hash c++ data or
bytes, but it first goes into deep details about how for example how result()
can be called multiple times, or what requirements are set on custom hash
algorithms.

A description of the scope of the library would be welcomed. It both provides
hashing for use by containers, and cryptographic quality hashing. These two
seems at first unrelated, and the library lies in a not clearly defined
in-between. It has been explain during the review, and the arguments given
are convincing, and should find their place in the documentation.

The motivation for machine architecture independant hashes is lacking. Usually,
the way to achieve this is to use a domain specific canonical form, because
endianness or word size is unlikely to be the only variable at stake here.

But overall, the documentation has all the relevant information. A tutorial /
quick start would be a nice addition.

I dislike the "single page style" documentation, and hope the tooling can be
improved in this regard.

# What is your evaluation of the potential usefulness of the library? 
Do you already use it in industry?

I have a need for an easy to use sha1, that i can easily compile for embedded
targets. Having that in boost actually lowers the number of external
dependancies, so yes this is definitely useful.

I currently don't have a need for the object hashing facility, but would 
definitely use the library it if the needs arise.

# Did you try to use the library? With which compiler(s)? Did
you have any problems?

I tried with several flavors of gcc, cross-compiling for bare-metal cortex-m0
targets or embedded arm32 linux, as well as standard debian x64. Everything
worked.

# How much effort did you put into your evaluation?
A glance? A quick reading? In-depth study?

I did not review the algorithm implementations. I spent around 15 hours
on studying the library doc and internals.

# Are you knowledgeable about the problem domain?

No specific knowledge of the domain.

# What is your final recommendation

I would recommend to CONDITIONALLY ACCEPT the library into boost, the
conditions being:

* demonstrate the usage of an existing external hash library into the
  boost::hash2 framework

* remove default construction and too short key construction for algorithms
  where this is actually harmful (such as siphash). I don't have an issue
  with providing constructors with keys for algorithms that do not mandate it.
  But boost code should encourage correct usage of the algorithms it propose,
  not allow bad usage in the name of genericity.

The following improvements should imho be done, but i don't base my conditional
acceptance on them. Actually some of them can be done on top of the proposed
library:

* advertise hash_append_range as the recommended way to add untyped data to
  the hash algorithm (or define an additional hash_append_bytes to fulfill
  this role).

* adds a tutorial / introduction in the doc, move the whole hashing byte
  sequences item after the hashing c++ objects item (if previous point is
  adressed, part of it would have to be included in the hashing c++ objects
  item).

* makes copy construction requirement on hash algorithm a requirement only
  if hash_append_unordered_range is used. Hashing unordeder ranges is an
  open problem, some creative other solutions may be found, and i see no
  reasons why they would involve duplicating the algorithm state. Actually,
  if the hash is default constructible, then it is not needed. You need
  either one of default or copy constructible, not both.

* use std::size_t instead of std::uint64_t for the default_flavor, as this
  is faster on some platforms.

* provide wrappers that avoid repeating the flavor over and over. It is much
  more convenient for the user to write once using
  MyHash = hash2::FlavoredHashhash2::default_flavor and then
  MyHash::append(X) than repeating over and over the parameter.

Finally, i want to thank Peter for developing and proposing this library
for boost, and Matt for managing the review.

Regards,

Julien

[boost] [hash2] Review

Julien Blanc