
Here is my review of the Boost.Hash2 library. Sorry for the long read. * Are you knowledgeable about the problem domain? I use hash functions with containers regularly and I have added support for hashing for my types using std::hash and Boost.ContainerHash, but I have little to no knowledge about designing and validating hash functions. I'm definitely not a specialist in cryptography. * How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? A couple days of reading the docs, source code, the proposal and some other resources linked there or brought up during discussions. And some time writing toy examples and the review itself. * What is your evaluation of the design, documentation and implementation? I'll comment on the three aspects of the library together because in some instances the three are related. For the implementation, I was reviewing revision 26b16023dc639ae354e246458719e4b569bc1d97 on develop (which was the default on checkout from git when I did the review, and no tag or branch was specified in the review announcement). My general impression of the documentation is that it is sufficient to eventually learn the library, but not easy to grasp for a newcomer. The library doesn't give the user an illustration of how to add support for hashing to his types or how to compute a hash value for an object until late in the documentation. I think, the description of HashAlgorithm could be moved to later chapters of the documentation intended for hash algorithm implementers, and the initial chapters should provide introduction for hash algorithm users. Also, I feel that some chapters could use more detailed explanation. For example, the introduction in Contiguously Hashable Types seems incomplete, as it doesn't explain what is "contiguously hashable" and what happens when the type is not contiguously hashable. I feel that in some cases important details are missing in the text of the less formal initial chapters, but are present in the later reference section. Given that there are no links within the text to the reference sections, and the whole documentation is a giant wall of text, the documentation is very difficult to navigate to learn these details. As I wrote my review while I was reading the docs, there were multiple times when I had to edit my review because I initially thought something was not documented only to find it later mentioned in somewhere in the reference sections. Adding cross-links and more details in the initial informal chapters would be helpful. Also pagination would be very welcome, although I understand it is difficult with Asciidoc. I like the general idea of separated hash functions and support for hashing in types. I think this a step in the right direction for improving support for cryptography in C++. However, the way HashAlgorithm is defined is controversial. In the earlier discussion I've already suggested to rename the HashAlgorithm::result() method to finalize() and Vinnie has agreed to this change in the proposal. The specification also has been relaxed to allow hash algorithms that don't support further update() and finalize() calls after the initial finalize(). In the discussion, Peter has indicated that there is no intention to adopt this, which, I think, significantly limits compatibility with existing hash algorithm implementations. I have posted a short overview of the existing crypto libraries here: https://lists.boost.org/Archives/boost/2024/12/258649.php I understand that the ability to call result() multiple times and interleave result() and update() calls has its uses, but I don't think this feature should be mandatory. First, as noted, the overwhelming majority of the current, optimized and verified implementations don't support it, and making the library incompatible with them complicates the library development and impacts its usefulness to users. Second, the feature itself is not often needed. The primary use case is extending the hash value size, which is a rather niche use case. If a hash value size (or its quality) is not sufficient, I would rather consider a different hash function entirely than try to extend the existing digest, as that would arguably result in better quality hashing. Digest extension is normally only supported by extendable-output hash functions (XOF). Some cryptographic libraries have support for XOF, but again, HashAlgorithm is incompatible with them (or at least, those I checked). For example, in OpenSSL, EVP_DigestFinalXOF is used to finalize the hash calculation and extract the variable-sized digest in one call. EVP_DigestFinalXOF is still required to be called no more than once, no further updates or finalization are allowed. I think, the ability to repeatedly call result() is a poor vehicle to support XOF, and I prefer the OpenSSL approach more. Another feature that is required by HashAlgorithm and not supported by many implementations is seeding the algorithm on construction. Conceptually, seeding is typically understood as initializing the internal state of the hash algorithm with specific non-default values. Seeding is not supported by many hash algorithms, for example SHA-1 and SHA-2, and the conventional workaround is to add salt to the hashed data. I think, HashAlgorithm should allow implementations to use the seed arguments to the hash algorithm constructors as salt, i.e. the constructor would implicitly call update(). I think the basic HashAlgorithm concept should require a minimal set of operations so that existing mainstream implementations of popular non-XOF hash algorithms can be supported. A more advanced concept, say named HashAlgorithmXOF, that supports digest expansion can be defined separately. Something along these lines: struct HashAlgorithmBase { static constexpr size_t block_size = /*...*/; // optional /* * Initializes algorithm to the default state. */ HashAlgorithmBase(); /* * Initializes algorithm to an initial state dependent on bytes * in [seed, seed + n). If n is zero, equivalent to default * construction. * The exact meaning of the seed bytes is implementation-defined. */ HashAlgorithmBase( void const* seed, std::size_t n ); HashAlgorithmBase( HashAlgorithmBase const& r ); // optional HashAlgorithmBase& operator=( HashAlgorithmBase const& r ); // optional HashAlgorithmBase( HashAlgorithmBase&& r ); HashAlgorithmBase& operator=( HashAlgorithmBase&& r ); /* * Feeds algorithm with input data. Can be called multiple times. * Calling update() after the algorithm has been finalized is * undefined behavior, unless otherwise specified by the specific * algorithm. */ void update( void const* data, std::size_t n ); }; struct HashAlgorithm : HashAlgorithmBase { using result_type = /* more on this below */; using HashAlgorithmBase::HashAlgorithmBase; using HashAlgorithmBase::operator=; /* * Finalizes the algorithm and returns the final digest. * Calling finalize() after the algorithm has been finalized is * undefined behavior, unless otherwise specified by the specific * algorithm. */ result_type finalize(); }; struct HashAlgorithmXOF : HashAlgorithmBase { using result_type = /* more on this below */; using HashAlgorithmBase::HashAlgorithmBase; using HashAlgorithmBase::operator=; /* * Finalizes the algorithm and copies the final digest to the buffer * of size `size`, in bytes, pointed to by `digest`. * Calling finalize() after the algorithm has been finalized is * undefined behavior, unless otherwise specified by the specific * algorithm. */ void finalize(unsigned char* digest, std::size_t size); /* * Finalizes the algorithm and returns the final digest of size * `size`. * Calling finalize() after the algorithm has been finalized is * undefined behavior, unless otherwise specified by the specific * algorithm. * * Note: The returned digest is equivalent to the above overload. * The difference is that this overload allocates memory for the * digest instead of using the user-provided storage. */ result_type finalize(std::size_t size); }; Above, I have intentionally omitted the constructor from uint64_t as I find it confusing. In particular, it isn't clear what endianness it uses to seed the algorithm. It looks like it duplicates hash_append. In HashAlgorithmXOF, I provided two overloads of finalize(), where one could be implemented in terms of the other. I'm not particularly attached to this dualism, but if I had to keep just one of the overloads, I would keep the first one, the one that fills the user-provided buffer. It is more flexible, although it is slightly less safe than the other one. However, I'm not insisting on any specific form of finalize() in this case. Ion has already mentioned in his review that HashAlgorithm::block_size should have type size_t, and I agree. At the very least, it should be an unsigned integer. BTW, the same applies to hmac::block_size. Ion also mentioned that hash algorithms should support move construction and move assignment, and I agree. Also, HashAlgorithm currently requires hash algorithms to be copy costructible and copy assignable. It looks like the only place where copyability is required is hash_append_unordered_range. Copyability is a rather strong requirement, which may not be supported by all implementations. This is why I marked copy constructor and copy assignment "optional" in my proposed interfaces above - copyability should only be required by the library algorithms that actually need it; those that don't should still work with non-copyable hash algorithms. The documentation should provide a more detailed description of the HashAlgorithm::result_type type. Specifically, what interface and operations it must support. Ideally, the library should provide a concept, similar to HashAlgorithm. Currently, the documentation only says it can be an unsigned integer type or a "std::array-like" type, where the latter is rather nebulous. From the implementation of get_integral_result it looks like it must support data() member function returning a pointer and that its size() after default-construction is a constant expression and returns at least 8. This, for example, excludes std::vector, which would be useful for HashAlgorithmXOF::result_type. Also on the topic of get_integral_result for "std::array-like" hash values, a few comments: - It requires the static size of 8 regardless of the size of the requested integer, so even if uint32_t is requested it will still require at least std::array<unsigned char, 8>. This should not be necessary. - It ignores the data beyond the initial 8 bytes. It might be ok for the purpose of slicing the hash value, but it is inconsistent with get_integral_result for integers, which attempts to mix bits through multiplications with a constant. I think, the implementation should be consistent between arrays and integers. The approach of universally mixing bits for all hash value types seems safer as it produces better results for hash values with non-uniform distribution of high-quality bits across the hash value. It might be possible to reuse one of the hash algorithms implemented in the library for this bit mixing. So I think, get_integral_result and HashAlgorithm::result_type should be better defined by the library. It should be possible for the user to provide his custom type as HashAlgorithm::result_type, and for that (a) there must be a list of requirements for that type to satisfy, and (b) it should be clear how an integer will be derived from it by get_integral_result. Maybe the user should be able to define his own get_integral_result behavior for his digest type. The reason why I think user-defined hash value types are useful is type safety, as it may be desirable to associate the digest with the algorithm that produced it. In the Provided Hash Algorithms section (https://pdimov.github.io/hash2/doc/html/hash2.html#hashing_bytes_provided_ha...), consider documenting the specific classes and headers that implement the algorithms. I appreciate the guidance that you give for whether each of the algorithms is cryptographically strong and for the intended application, this is very useful. It would also be nice if there were some performance estimates for each of the algorithms, as implemented in the library. The hmac_* type aliases are not provided for siphash and xxhash. But I would actually prefer that those type aliases be removed for the rest of the hash algorithms, so that the algorithms don't bring the dependency on hmac.hpp, which is not always needed by the user. Typing hmac<sha2_256> instead of hmac_sha2_256 is hardly any more tedious. In the discussion before the review I asked to provide a member-style API, similar to the free hash_append function, in the tag_invoke's h argument, which is currently the hash algorithm. The idea is to pass a wrapper or a proxy object as that argument; the wrapper would provide the API for appending data to the underlying hash algorithm. This would allow to eliminate the flavor argument that the user has to forward (more on this below) and would allow to reduce the dependency on Boost.Hash2 that is needed to add support for hashing to user-defined classes. Unfortunate dependencies caused by support for hashing and serialization has been a long standing problem within and outside Boost, and it would be prudent not to perpetuate it further. The tag_invoke protocol still requires the hash_append_tag to be declared, but forward-declaring the tag should be much easier and less limiting on Boost.Hash2 future evolution than forward-declaring the hash_append function. I would prefer if the need for that tag was also removed, but presumably that would mean dropping the tag_invoke protocol and using a sufficiently unique name for the customization point function (e.g. boost_hash2_append). I would be fine with that, but I can understand if the authors choose to stick with the tag_invoke protocol and the tag. If there is a way to avoid the forward declaration and still remain compatible with tag_invoke, I would love to hear it, and it should probably be added in the library docs. Also before the review I asked to provide a simple way to integrate types that support Boost.Hash2 with containers that support std::hash protocol. Peter answered that there should be a `hash` utility class template that could be used as a base class for std::hash specialization or, presumably, could be directly specified as the hash function in the container. I can see there is a section in the documentation discussing this (https://pdimov.github.io/hash2/doc/html/hash2.html#hashing_objects). I think this sort of utility should be provided by the library in a separate header out of the box, and the docs should point the user to it. In the list of types natively supported by hash_append, long double is not listed. From the implementation, it should work if it's 64-bit. Larger representations could be supported, but care must be excersized with padding bytes. The padding could be cleared before hashing by using __builtin_clear_padding (gcc, clang) or __builtin_zero_non_value_bits (MSVC), you can find an example in Boost.Atomic: https://github.com/boostorg/atomic/blob/5e2e01666860178591d2616a71f88181dded... https://github.com/boostorg/atomic/blob/5e2e01666860178591d2616a71f88181dded... https://github.com/boostorg/atomic/blob/5e2e01666860178591d2616a71f88181dded... Also, std::nullptr_t is missing in the list but is actually supported. In that same list, pointers to functions are listed as supported, but not pointers to members. I find this surprising and I don't see why they shouldn't be supported. If this omission is intentional, a rationale in the docs would be helpful. Otherwise, pointers to members should be supported. You may also want to consider adding support for trivially copyable class types, when native byte order is selected in the flavor argument to hash_append. You can clear the padding with one of the intrinsics I mentioned above and then hash the entire object as a single blob of bytes. It looks like this should be allowed under the is_contiguously_hashable optimization. I would like a flavor class template like this: template< endian ByteOrder = endian::native, typename SizeType = std::size_t
struct flavor { static constexpr endian byte_order = ByteOrder; using size_type = SizeType; }; using default_flavor = flavor<>; using little_endian_flavor = flavor< endian::little, std::uint64_t >; using big_endian_flavor = flavor< endian::big, std::uint64_t >; // Maybe even this, not sure if it is worth it template< endian ByteOrder = endian::native, typename SizeType = std::size_t
inline constexpr flavor< ByteOrder, SizeType > flavor_v{}; This way the user is able to customize the size type without having to define a new flavor struct. In practice, most of the time container sizes do not exceed the capacity of uint32_t (and may in fact be uint32_t or less, with Boost.Container containers), so users might be tempted to specify that type instead of the default. This also changes the default_flavor to use size_t for container sizes, and this is consistent with the default flavor semantics. That is, default_flavor represents the current hardware properties, which are both the byte order and the native size type. For portable hashing, one should use either little_endian_flavor or big_endian_flavor, which should both have a portable size type. In the hash_append function call, the flavor argument is mandatory, and I'm not convinced this is necessary. In the earlier discussion, Peter stated that this is intentional, so that there is less opportunity to forget that argument when it is actually needed. Presumably, the main purpose of this is to ensure portable behavior across machines with different byte order. However, such portability is not always required, and in fact the default flavor (both in name, and in hash_append template parameters) does not offer this portability. The empty brace initializer "{}" that is used for the default flavor also doesn't indicate a particular flavor, other than "whatever the default is, I don't care". Clearly, the library treats one set of parameters as the default in all respects but the requirement to always specify that in hash_append arguments. This doesn't make sense to me. Either make the default flavor optional, or don't offer a default (not in name, not in template parameters) and require users to always specify a specific flavor. My preference would be the former, because default_flavor seems like a sensible default, and, since hash portability is not free in terms of performance, it should be explicitly opted-in by users (i.e. the "don't pay for what you don't use" principle). The second problem is the need to manually forward the flavor argument in the tag_invoke overloads. I think, having to pass around unused arguments is not a good design. This can be mitigated by wrapping the flavor argument in the wrapper/proxy that is passed to the tag_invoke's h argument, something like this: template< typename Hash, typename Flavor > class hash_proxy { public: using hash_type = Hash; using flavor_type = Flavor; private: Hash& h; public: explicit hash_proxy(Hash& h) noexcept : h(h) {} template< typename T > void append(T const& data) { // As if hash_append(h, flavor_type{}, data) here } }; template< typename Flavor, typename Hash, typename T > void hash_append(Hash& h, T const& data) { hash_proxy< Hash, Flavor > hp(h); tag_invoke(hash_append_tag{}, hp, data); } template< typename Hash, typename T > void hash_append(Hash& h, T const& data) { hash_append< default_flavor >(h, data); } class my_class { private: int x; short y; char z; template< typename H > friend void tag_invoke(hash_append_tag, H& h, my_class const& data) { h.append(data.x); h.append(data.y); h.append(data.z); } }; Lastly on this topic, "flavor" doesn't sound like the correct word to me. It sounds like a parameter that controls the hash algorithm specifics, whereas it actually has no relation to the hash algorithm at all. I think, "options" would be a better name for this parameter. * Did you try to use the library? With which compiler(s)? Did you have any problems? Yes, I tried to compile a few toy examples with gcc 13.2. I found that the library does not prefer a user-provided tag_invoke overload when the user's type is also described with Boost.Describe, and the code fails to compile because of overload resolution ambiguity. https://github.com/pdimov/hash2/issues/28 I think, tag_invoke should not cause ambiguities, even if the type matches other categories of types natively supported by the library. If the user defined tag_invoke, he clearly intended this function to implement hashing. I also wrote a simple benchmark to see how the library performance compares to OpenSSL. I also used this program to test how one would wrap OpenSSL primitives into Boost.Hash2 interfaces. The benchmark is here: https://github.com/Lastique/hash2/blob/6bd5dafbcb05417014d0b85336a1245f5f186... Compiled with: g++ -O3 -march=rocketlake -std=c++17 -I../include -I../../boost -o hash2_ossl_perf hash2_ossl_perf.cpp -lssl -lcrypto Here's the output on my machine: sha1_160 (1024 bytes): 1220.093078 MiB/s sha1_160 (1048576 bytes): 1317.424120 MiB/s sha1_160 (16777216 bytes): 1318.296514 MiB/s sha2_256 (1024 bytes): 406.139530 MiB/s sha2_256 (1048576 bytes): 430.141013 MiB/s sha2_256 (16777216 bytes): 429.341694 MiB/s sha2_512 (1024 bytes): 629.593514 MiB/s sha2_512 (1048576 bytes): 712.100583 MiB/s sha2_512 (16777216 bytes): 711.549426 MiB/s openssl_sha1_160 (1024 bytes): 1531.142208 MiB/s openssl_sha1_160 (1048576 bytes): 2759.062140 MiB/s openssl_sha1_160 (16777216 bytes): 2738.579482 MiB/s openssl_sha2_256 (1024 bytes): 1541.291824 MiB/s openssl_sha2_256 (1048576 bytes): 2462.035414 MiB/s openssl_sha2_256 (16777216 bytes): 2449.770143 MiB/s openssl_sha2_512 (1024 bytes): 781.187505 MiB/s openssl_sha2_512 (1048576 bytes): 1086.239840 MiB/s openssl_sha2_512 (16777216 bytes): 1082.922782 MiB/s So, depending on the input message size, OpenSSL is faster from 1.24x to 5.72x. Interestingly, despite what is said in Boost.Hash2 documentation (and confirmed by the test), OpenSSL's implementation of SHA2-256 is faster than SHA2-512. This could be because of the SHA ISA extensions in my CPU. * What is your evaluation of the potential usefulness of the library? Do you already use it in industry? I do not use the library in real projects. I think the separation between the hash functions and the hashed types is a very useful design idea, I like it very much. However, the way HashAlgorithm is defined significantly limits usefulness of the Boost.Hash2 library, as it prevents users from plugging in existing hash algorithm implementations, even if just to experiment. I noted other issues in the review, but I think this one is critical to the library adoption. When I wrote the review, I was torn between "conditional accept" and "reject" votes, as I really do like the general concept the library offers and yet I don't see myself adopting it because of the limitations. But since Peter indicated that he is not willing to address those limitations, I will have to vote REJECT. Besides HashAlgorithm requirements, it looks like the library could use some work anyway. In particular, the mentioned issue with tag_invoke ambiguity should be resolved, and the member-wise API for hash_append needs to be explored. These things may affect the protocol used to add support for hashing to user types, and changes to HashAlgorithm would affect hash algorithms, so these issues need to be sorted out before acceptance and adoption. I hope to see the library improved and submitted for review a second time. Thank you Peter for submitting the library and Matt for managing the review.