Re: [boost] [hash2] Andrey's review

10 Dec 2024

      Here is my review of the Boost.Hash2 library. Sorry for the long read.

* Are you knowledgeable about the problem domain?

I use hash functions with containers regularly and I have added support
for hashing for my types using std::hash and Boost.ContainerHash, but I
have little to no knowledge about designing and validating hash
functions. I'm definitely not a specialist in cryptography.

* How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?

A couple days of reading the docs, source code, the proposal and some
other resources linked there or brought up during discussions. And some
time writing toy examples and the review itself.

* What is your evaluation of the design, documentation and implementation?

I'll comment on the three aspects of the library together because in
some instances the three are related. For the implementation, I was
reviewing revision 26b16023dc639ae354e246458719e4b569bc1d97 on develop
(which was the default on checkout from git when I did the review, and
no tag or branch was specified in the review announcement).

My general impression of the documentation is that it is sufficient to
eventually learn the library, but not easy to grasp for a newcomer. The
library doesn't give the user an illustration of how to add support for
hashing to his types or how to compute a hash value for an object until
late in the documentation. I think, the description of HashAlgorithm
could be moved to later chapters of the documentation intended for hash
algorithm implementers, and the initial chapters should provide
introduction for hash algorithm users. Also, I feel that some chapters
could use more detailed explanation. For example, the introduction in
Contiguously Hashable Types seems incomplete, as it doesn't explain what
is "contiguously hashable" and what happens when the type is not
contiguously hashable.

I feel that in some cases important details are missing in the text of
the less formal initial chapters, but are present in the later reference
section. Given that there are no links within the text to the reference
sections, and the whole documentation is a giant wall of text, the
documentation is very difficult to navigate to learn these details. As I
wrote my review while I was reading the docs, there were multiple times
when I had to edit my review because I initially thought something was
not documented only to find it later mentioned in somewhere in the
reference sections. Adding cross-links and more details in the initial
informal chapters would be helpful. Also pagination would be very
welcome, although I understand it is difficult with Asciidoc.

I like the general idea of separated hash functions and support for
hashing in types. I think this a step in the right direction for
improving support for cryptography in C++.

However, the way HashAlgorithm is defined is controversial. In the
earlier discussion I've already suggested to rename the
HashAlgorithm::result() method to finalize() and Vinnie has agreed to
this change in the proposal. The specification also has been relaxed to
allow hash algorithms that don't support further update() and finalize()
calls after the initial finalize(). In the discussion, Peter has
indicated that there is no intention to adopt this, which, I think,
significantly limits compatibility with existing hash algorithm
implementations. I have posted a short overview of the existing crypto
libraries here:

https://lists.boost.org/Archives/boost/2024/12/258649.php

I understand that the ability to call result() multiple times and
interleave result() and update() calls has its uses, but I don't think
this feature should be mandatory. First, as noted, the overwhelming
majority of the current, optimized and verified implementations don't
support it, and making the library incompatible with them complicates
the library development and impacts its usefulness to users. Second, the
feature itself is not often needed. The primary use case is extending
the hash value size, which is a rather niche use case. If a hash value
size (or its quality) is not sufficient, I would rather consider a
different hash function entirely than try to extend the existing digest,
as that would arguably result in better quality hashing.

Digest extension is normally only supported by extendable-output hash
functions (XOF). Some cryptographic libraries have support for XOF, but
again, HashAlgorithm is incompatible with them (or at least, those I
checked). For example, in OpenSSL, EVP_DigestFinalXOF is used to
finalize the hash calculation and extract the variable-sized digest in
one call. EVP_DigestFinalXOF is still required to be called no more than
once, no further updates or finalization are allowed. I think, the
ability to repeatedly call result() is a poor vehicle to support XOF,
and I prefer the OpenSSL approach more.

Another feature that is required by HashAlgorithm and not supported by
many implementations is seeding the algorithm on construction.
Conceptually, seeding is typically understood as initializing the
internal state of the hash algorithm with specific non-default values.
Seeding is not supported by many hash algorithms, for example SHA-1 and
SHA-2, and the conventional workaround is to add salt to the hashed
data. I think, HashAlgorithm should allow implementations to use the
seed arguments to the hash algorithm constructors as salt, i.e. the
constructor would implicitly call update().

I think the basic HashAlgorithm concept should require a minimal set of
operations so that existing mainstream implementations of popular
non-XOF hash algorithms can be supported. A more advanced concept, say
named HashAlgorithmXOF, that supports digest expansion can be defined
separately. Something along these lines:

  struct HashAlgorithmBase
  {
    static constexpr size_t block_size = /*...*/; // optional

    /*
     * Initializes algorithm to the default state.
     */
    HashAlgorithmBase();
    /*
     * Initializes algorithm to an initial state dependent on bytes
     * in [seed, seed + n). If n is zero, equivalent to default
     * construction.
     * The exact meaning of the seed bytes is implementation-defined.
     */
    HashAlgorithmBase( void const* seed, std::size_t n );

    HashAlgorithmBase( HashAlgorithmBase const& r ); // optional
    HashAlgorithmBase& operator=( HashAlgorithmBase const& r ); // optional
    HashAlgorithmBase( HashAlgorithmBase&& r );
    HashAlgorithmBase& operator=( HashAlgorithmBase&& r );

    /*
     * Feeds algorithm with input data. Can be called multiple times.
     * Calling update() after the algorithm has been finalized is
     * undefined behavior, unless otherwise specified by the specific
     * algorithm.
     */
    void update( void const* data, std::size_t n );
  };

  struct HashAlgorithm : HashAlgorithmBase
  {
    using result_type = /* more on this below */;

    using HashAlgorithmBase::HashAlgorithmBase;
    using HashAlgorithmBase::operator=;

    /*
     * Finalizes the algorithm and returns the final digest.
     * Calling finalize() after the algorithm has been finalized is
     * undefined behavior, unless otherwise specified by the specific
     * algorithm.
     */
    result_type finalize();
  };

  struct HashAlgorithmXOF : HashAlgorithmBase
  {
    using result_type = /* more on this below */;

    using HashAlgorithmBase::HashAlgorithmBase;
    using HashAlgorithmBase::operator=;

    /*
     * Finalizes the algorithm and copies the final digest to the buffer
     * of size `size`, in bytes, pointed to by `digest`.
     * Calling finalize() after the algorithm has been finalized is
     * undefined behavior, unless otherwise specified by the specific
     * algorithm.
     */
    void finalize(unsigned char* digest, std::size_t size);

    /*
     * Finalizes the algorithm and returns the final digest of size
     * `size`.
     * Calling finalize() after the algorithm has been finalized is
     * undefined behavior, unless otherwise specified by the specific
     * algorithm.
     *
     * Note: The returned digest is equivalent to the above overload.
     * The difference is that this overload allocates memory for the
     * digest instead of using the user-provided storage.
     */
    result_type finalize(std::size_t size);
  };

Above, I have intentionally omitted the constructor from uint64_t as I
find it confusing. In particular, it isn't clear what endianness it uses
to seed the algorithm. It looks like it duplicates hash_append.

In HashAlgorithmXOF, I provided two overloads of finalize(), where one
could be implemented in terms of the other. I'm not particularly
attached to this dualism, but if I had to keep just one of the
overloads, I would keep the first one, the one that fills the
user-provided buffer. It is more flexible, although it is slightly less
safe than the other one. However, I'm not insisting on any specific form
of finalize() in this case.

Ion has already mentioned in his review that HashAlgorithm::block_size
should have type size_t, and I agree. At the very least, it should be an
unsigned integer. BTW, the same applies to hmac::block_size.

Ion also mentioned that hash algorithms should support move construction
and move assignment, and I agree. Also, HashAlgorithm currently requires
hash algorithms to be copy costructible and copy assignable. It looks
like the only place where copyability is required is
hash_append_unordered_range. Copyability is a rather strong requirement,
which may not be supported by all implementations. This is why I marked
copy constructor and copy assignment "optional" in my proposed
interfaces above - copyability should only be required by the library
algorithms that actually need it; those that don't should still work
with non-copyable hash algorithms.

The documentation should provide a more detailed description of the
HashAlgorithm::result_type type. Specifically, what interface and
operations it must support. Ideally, the library should provide a
concept, similar to HashAlgorithm. Currently, the documentation only
says it can be an unsigned integer type or a "std::array-like" type,
where the latter is rather nebulous. From the implementation of
get_integral_result it looks like it must support data() member function
returning a pointer and that its size() after default-construction is a
constant expression and returns at least 8. This, for example, excludes
std::vector, which would be useful for HashAlgorithmXOF::result_type.

Also on the topic of get_integral_result for "std::array-like" hash
values, a few comments:

- It requires the static size of 8 regardless of the size of the
requested integer, so even if uint32_t is requested it will still
require at least std::array<unsigned char, 8>. This should not be necessary.
- It ignores the data beyond the initial 8 bytes. It might be ok for the
purpose of slicing the hash value, but it is inconsistent with
get_integral_result for integers, which attempts to mix bits through
multiplications with a constant. I think, the implementation should be
consistent between arrays and integers. The approach of universally
mixing bits for all hash value types seems safer as it produces better
results for hash values with non-uniform distribution of high-quality
bits across the hash value. It might be possible to reuse one of the
hash algorithms implemented in the library for this bit mixing.

So I think, get_integral_result and HashAlgorithm::result_type should be
better defined by the library. It should be possible for the user to
provide his custom type as HashAlgorithm::result_type, and for that (a)
there must be a list of requirements for that type to satisfy, and (b)
it should be clear how an integer will be derived from it by
get_integral_result. Maybe the user should be able to define his own
get_integral_result behavior for his digest type. The reason why I think
user-defined hash value types are useful is type safety, as it may be
desirable to associate the digest with the algorithm that produced it.

In the Provided Hash Algorithms section
(https://pdimov.github.io/hash2/doc/html/hash2.html#hashing_bytes_provided_ha...),
consider documenting the specific classes and headers that implement the
algorithms.

I appreciate the guidance that you give for whether each of the
algorithms is cryptographically strong and for the intended application,
this is very useful. It would also be nice if there were some
performance estimates for each of the algorithms, as implemented in the
library.

The hmac_* type aliases are not provided for siphash and xxhash. But I
would actually prefer that those type aliases be removed for the rest of
the hash algorithms, so that the algorithms don't bring the dependency
on hmac.hpp, which is not always needed by the user. Typing
hmac<sha2_256> instead of hmac_sha2_256 is hardly any more tedious.

In the discussion before the review I asked to provide a member-style
API, similar to the free hash_append function, in the tag_invoke's h
argument, which is currently the hash algorithm. The idea is to pass a
wrapper or a proxy object as that argument; the wrapper would provide
the API for appending data to the underlying hash algorithm. This would
allow to eliminate the flavor argument that the user has to forward
(more on this below) and would allow to reduce the dependency on
Boost.Hash2 that is needed to add support for hashing to user-defined
classes. Unfortunate dependencies caused by support for hashing and
serialization has been a long standing problem within and outside Boost,
and it would be prudent not to perpetuate it further.

The tag_invoke protocol still requires the hash_append_tag to be
declared, but forward-declaring the tag should be much easier and less
limiting on Boost.Hash2 future evolution than forward-declaring the
hash_append function. I would prefer if the need for that tag was also
removed, but presumably that would mean dropping the tag_invoke protocol
and using a sufficiently unique name for the customization point
function (e.g. boost_hash2_append). I would be fine with that, but I can
understand if the authors choose to stick with the tag_invoke protocol
and the tag. If there is a way to avoid the forward declaration and
still remain compatible with tag_invoke, I would love to hear it, and it
should probably be added in the library docs.

Also before the review I asked to provide a simple way to integrate
types that support Boost.Hash2 with containers that support std::hash
protocol. Peter answered that there should be a `hash` utility class
template that could be used as a base class for std::hash specialization
or, presumably, could be directly specified as the hash function in the
container. I can see there is a section in the documentation discussing
this
(https://pdimov.github.io/hash2/doc/html/hash2.html#hashing_objects). I
think this sort of utility should be provided by the library in a
separate header out of the box, and the docs should point the user to it.

In the list of types natively supported by hash_append, long double is
not listed. From the implementation, it should work if it's 64-bit.
Larger representations could be supported, but care must be excersized
with padding bytes. The padding could be cleared before hashing by using
__builtin_clear_padding (gcc, clang) or __builtin_zero_non_value_bits
(MSVC), you can find an example in Boost.Atomic:

https://github.com/boostorg/atomic/blob/5e2e01666860178591d2616a71f88181dded...
https://github.com/boostorg/atomic/blob/5e2e01666860178591d2616a71f88181dded...
https://github.com/boostorg/atomic/blob/5e2e01666860178591d2616a71f88181dded...

Also, std::nullptr_t is missing in the list but is actually supported.

In that same list, pointers to functions are listed as supported, but
not pointers to members. I find this surprising and I don't see why they
shouldn't be supported. If this omission is intentional, a rationale in
the docs would be helpful. Otherwise, pointers to members should be
supported.

You may also want to consider adding support for trivially copyable
class types, when native byte order is selected in the flavor argument
to hash_append. You can clear the padding with one of the intrinsics I
mentioned above and then hash the entire object as a single blob of
bytes. It looks like this should be allowed under the
is_contiguously_hashable optimization.

I would like a flavor class template like this:

  template<
    endian ByteOrder = endian::native,
    typename SizeType = std::size_t
...
struct flavor
  {
    static constexpr endian byte_order = ByteOrder;
    using size_type = SizeType;
  };

  using default_flavor = flavor<>;
  using little_endian_flavor = flavor< endian::little, std::uint64_t >;
  using big_endian_flavor = flavor< endian::big, std::uint64_t >;

  // Maybe even this, not sure if it is worth it
  template<
    endian ByteOrder = endian::native,
    typename SizeType = std::size_t
...
inline constexpr flavor< ByteOrder, SizeType > flavor_v{};

This way the user is able to customize the size type without having to
define a new flavor struct. In practice, most of the time container
sizes do not exceed the capacity of uint32_t (and may in fact be
uint32_t or less, with Boost.Container containers), so users might be
tempted to specify that type instead of the default.

This also changes the default_flavor to use size_t for container sizes,
and this is consistent with the default flavor semantics. That is,
default_flavor represents the current hardware properties, which are
both the byte order and the native size type. For portable hashing, one
should use either little_endian_flavor or big_endian_flavor, which
should both have a portable size type.

In the hash_append function call, the flavor argument is mandatory, and
I'm not convinced this is necessary. In the earlier discussion, Peter
stated that this is intentional, so that there is less opportunity to
forget that argument when it is actually needed. Presumably, the main
purpose of this is to ensure portable behavior across machines with
different byte order. However, such portability is not always required,
and in fact the default flavor (both in name, and in hash_append
template parameters) does not offer this portability. The empty brace
initializer "{}" that is used for the default flavor also doesn't
indicate a particular flavor, other than "whatever the default is, I
don't care". Clearly, the library treats one set of parameters as the
default in all respects but the requirement to always specify that in
hash_append arguments. This doesn't make sense to me. Either make the
default flavor optional, or don't offer a default (not in name, not in
template parameters) and require users to always specify a specific
flavor. My preference would be the former, because default_flavor seems
like a sensible default, and, since hash portability is not free in
terms of performance, it should be explicitly opted-in by users (i.e.
the "don't pay for what you don't use" principle).

The second problem is the need to manually forward the flavor argument
in the tag_invoke overloads. I think, having to pass around unused
arguments is not a good design. This can be mitigated by wrapping the
flavor argument in the wrapper/proxy that is passed to the tag_invoke's
h argument, something like this:

  template< typename Hash, typename Flavor >
  class hash_proxy
  {
  public:
    using hash_type = Hash;
    using flavor_type = Flavor;

  private:
    Hash& h;

  public:
    explicit hash_proxy(Hash& h) noexcept : h(h) {}

    template< typename T >
    void append(T const& data)
    {
      // As if hash_append(h, flavor_type{}, data) here
    }
  };

  template< typename Flavor, typename Hash, typename T >
  void hash_append(Hash& h, T const& data)
  {
    hash_proxy< Hash, Flavor > hp(h);
    tag_invoke(hash_append_tag{}, hp, data);
  }

  template< typename Hash, typename T >
  void hash_append(Hash& h, T const& data)
  {
    hash_append< default_flavor >(h, data);
  }

  class my_class
  {
  private:
    int x;
    short y;
    char z;

    template< typename H >
    friend void tag_invoke(hash_append_tag, H& h, my_class const& data)
    {
      h.append(data.x);
      h.append(data.y);
      h.append(data.z);
    }
  };

Lastly on this topic, "flavor" doesn't sound like the correct word to
me. It sounds like a parameter that controls the hash algorithm
specifics, whereas it actually has no relation to the hash algorithm at
all. I think, "options" would be a better name for this parameter.

* Did you try to use the library? With which compiler(s)? Did you have
any problems?

Yes, I tried to compile a few toy examples with gcc 13.2. I found that
the library does not prefer a user-provided tag_invoke overload when the
user's type is also described with Boost.Describe, and the code fails to
compile because of overload resolution ambiguity.

https://github.com/pdimov/hash2/issues/28

I think, tag_invoke should not cause ambiguities, even if the type
matches other categories of types natively supported by the library. If
the user defined tag_invoke, he clearly intended this function to
implement hashing.

I also wrote a simple benchmark to see how the library performance
compares to OpenSSL. I also used this program to test how one would wrap
OpenSSL primitives into Boost.Hash2 interfaces. The benchmark is here:

https://github.com/Lastique/hash2/blob/6bd5dafbcb05417014d0b85336a1245f5f186...

Compiled with:

g++ -O3 -march=rocketlake -std=c++17 -I../include -I../../boost -o
hash2_ossl_perf hash2_ossl_perf.cpp -lssl -lcrypto

Here's the output on my machine:

sha1_160 (1024 bytes): 1220.093078 MiB/s
sha1_160 (1048576 bytes): 1317.424120 MiB/s
sha1_160 (16777216 bytes): 1318.296514 MiB/s
sha2_256 (1024 bytes): 406.139530 MiB/s
sha2_256 (1048576 bytes): 430.141013 MiB/s
sha2_256 (16777216 bytes): 429.341694 MiB/s
sha2_512 (1024 bytes): 629.593514 MiB/s
sha2_512 (1048576 bytes): 712.100583 MiB/s
sha2_512 (16777216 bytes): 711.549426 MiB/s
openssl_sha1_160 (1024 bytes): 1531.142208 MiB/s
openssl_sha1_160 (1048576 bytes): 2759.062140 MiB/s
openssl_sha1_160 (16777216 bytes): 2738.579482 MiB/s
openssl_sha2_256 (1024 bytes): 1541.291824 MiB/s
openssl_sha2_256 (1048576 bytes): 2462.035414 MiB/s
openssl_sha2_256 (16777216 bytes): 2449.770143 MiB/s
openssl_sha2_512 (1024 bytes): 781.187505 MiB/s
openssl_sha2_512 (1048576 bytes): 1086.239840 MiB/s
openssl_sha2_512 (16777216 bytes): 1082.922782 MiB/s

So, depending on the input message size, OpenSSL is faster from 1.24x to
5.72x. Interestingly, despite what is said in Boost.Hash2 documentation
(and confirmed by the test), OpenSSL's implementation of SHA2-256 is
faster than SHA2-512. This could be because of the SHA ISA extensions in
my CPU.

* What is your evaluation of the potential usefulness of the library? Do
you already use it in industry?

I do not use the library in real projects.

I think the separation between the hash functions and the hashed types
is a very useful design idea, I like it very much. However, the way
HashAlgorithm is defined significantly limits usefulness of the
Boost.Hash2 library, as it prevents users from plugging in existing hash
algorithm implementations, even if just to experiment. I noted other
issues in the review, but I think this one is critical to the library
adoption. When I wrote the review, I was torn between "conditional
accept" and "reject" votes, as I really do like the general concept the
library offers and yet I don't see myself adopting it because of the
limitations. But since Peter indicated that he is not willing to address
those limitations, I will have to vote REJECT.

Besides HashAlgorithm requirements, it looks like the library could use
some work anyway. In particular, the mentioned issue with tag_invoke
ambiguity should be resolved, and the member-wise API for hash_append
needs to be explored. These things may affect the protocol used to add
support for hashing to user types, and changes to HashAlgorithm would
affect hash algorithms, so these issues need to be sorted out before
acceptance and adoption. I hope to see the library improved and
submitted for review a second time.

Thank you Peter for submitting the library and Matt for managing the review.

Re: [boost] [hash2] Andrey's review

Andrey Semashev