
his is my review of the proposed boost::hash2 library # What is your evaluation of the design? The base idea, separating the hash algorithm from the actual hashing of objects, is a brilliant one. The library consist on two parts: * a collection of hashing algorithm, some of them of crypto quality * a c++ objects hashing facility Given this design, i would not have expected from the library to provide a thorough collection of hash algorithm. But it's a good think to get them. I would have expected, however, to see some guidance about how to integrate an external hashing library. Actually it is unfortunate that this part is missing, because it seems a lot more harder than it should be. Several design decisions could be improved in my opinion: * the requirements on the HashAlgorithm concept are too strict, and have no rationale in the documentation * copy constructibility and copy assignment are unusual across current hash implementations, as it has been noted by other reviewers. I can understand a rationale for copy construct, copy assignment looks very weird to me. The only rationale i can think of is resetting the internal state to a defaulted state, which could be a specific operation. * creation with seed (either as uint64 or as an array) as well as default construction is required. It seems odd. The documentation, for example, strongly advice agains using siphash without a seed. I would prefer that it is a compile time error to use siphash without a seed / with an incorrectly sized seed. I've no problem adding a key/seed to algorithms that do not mandate them, though. But, since the library itself does not create the HashAlgorithm themselve (there is one notable exception to that that), it should not put additional requirements beyond what it uses. The HashAlgorithm concept could be: struct HashAlgorithm { using result_type = /*integral or array-like*/; static constexpr int block_size = /*...*/; // optional void update( void const* data, std::size_t n ); result_type result(); }; Interestingly enough, most of object hashing works if you provide it with such an algorithm (not copyable and not default constructible). I failed to see where default construction is used outside tests, for example. * the usage of void const* + size in interface is controversial, as several people would prefer a span<>. There are good reasons for not using span, and this is not something that should be an issue, as this should only impact HashAlgorithm writers. It is in my opinion an unfortunate design choice that this internal interface is leaked to the end user. The functionality is already provided by hash_append_range, which could be the advertised way of hashing bytes via that library (ie, when mixing hashing of object and hashing of bytes, eg when implementing custom hashing for a type). * constexpr use mandates usage of unsigned char const*. So it could be the primary supported interface over void const*, to avoid interface duplication. * the requirement on result extension is also uncommon. While i understand the rationale, it should be noted that it is likely to cause some language interoperability issues, as it is not necessarily specified by the algorithm. Implementations in other languages are unlikely to provide them, making the feature a niche one, unless the underlying algorithm specification provides it. * most of the functionality is provided by the hash_append function, and the hash_append_xxx helpers. This is a neat design. * The Flavor template argument allows to customize hashing of objects, to get platform-independants hashes. This is a good idea, but it comes with some problems: * the flavor is repeated over each hash_append call. I don't think it is a realistic use case that the flavor changes over time. If i were to integrate the library in our code base, the first thing i'll do is write a wrapper to get rid of that flavor argument at every call site. * the default flavor uses std::uint64_t for size. It seems odd. If i'm not asking for a specific flavor, i'm not asking the hash to be consistent accross different systems. So std::size_t should be used there instead. This looked to me like a pay for what you don't use situation, so i ran a quick near worst-case benchmark on our arm platform (hashing a million times a std::vector of the three integers 1-2-3), to find out that it makes a the code using size_t runs 15% faster. # What is your evaluation of the implementation? The implementation looks good. Code is clean, very readable. A minor remark in hash_append_unordered_range : m should be declared as std::size_t (or std::uint64_t). The current typename std::iterator_traits<It>::difference_type m = 0; is wrong. I'm not at ease with hash_append_unordered_range. The fact that it reduces the hash to 64 bits, simply making a sum of the element hashes, and then hashing that, does not look right. If the goal is simply to avoid unintended collisions, it's probably fine. If the goal is to preserve against an attacker mindfully trying to provoke collisions, it may do the trick,although intuitively xoring the whole results would be better at that job than summing truncated results. If it's to preserve cryptographic properties of the hash function, then it looks wrong. Another remark is about size serialization. The library relies on std::tuple_size to identify container-like structs, but not include the size in the hash (or is it because of is_contiguously_hashable?). This leads to unintuitive results, such as: std::array<int, 4> a{1,2,3,4}; std::span<int, 4> s(a); Hash h1,h2,h3,h4; hash_append(h1, {}, a); hash_append_range(h2, {}, begin(a), end(a)); assert(h1.result() == h2.result()); // TRUE hash_append(h3, {}, s); hash_append_range(h4, {}, begin(s), end(s)); assert(h3.result() == h4.result()); // FAILS I'm not sure there's a general solution to this issue, though. But since std::array models a container, not including its size in the digest may not be the wisest choice (the doc talks about plain arrays, not about std::array, so i don't know if it's a bug or a feature) # What is your evaluation of the documentation? The order of the documentation is odd. I would expect it to first explain me how i use the library with the provided algorithm to hash c++ data or bytes, but it first goes into deep details about how for example how result() can be called multiple times, or what requirements are set on custom hash algorithms. A description of the scope of the library would be welcomed. It both provides hashing for use by containers, and cryptographic quality hashing. These two seems at first unrelated, and the library lies in a not clearly defined in-between. It has been explain during the review, and the arguments given are convincing, and should find their place in the documentation. The motivation for machine architecture independant hashes is lacking. Usually, the way to achieve this is to use a domain specific canonical form, because endianness or word size is unlikely to be the only variable at stake here. But overall, the documentation has all the relevant information. A tutorial / quick start would be a nice addition. I dislike the "single page style" documentation, and hope the tooling can be improved in this regard. # What is your evaluation of the potential usefulness of the library? Do you already use it in industry? I have a need for an easy to use sha1, that i can easily compile for embedded targets. Having that in boost actually lowers the number of external dependancies, so yes this is definitely useful. I currently don't have a need for the object hashing facility, but would definitely use the library it if the needs arise. # Did you try to use the library? With which compiler(s)? Did you have any problems? I tried with several flavors of gcc, cross-compiling for bare-metal cortex-m0 targets or embedded arm32 linux, as well as standard debian x64. Everything worked. # How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? I did not review the algorithm implementations. I spent around 15 hours on studying the library doc and internals. # Are you knowledgeable about the problem domain? No specific knowledge of the domain. # What is your final recommendation I would recommend to CONDITIONALLY ACCEPT the library into boost, the conditions being: * demonstrate the usage of an existing external hash library into the boost::hash2 framework * remove default construction and too short key construction for algorithms where this is actually harmful (such as siphash). I don't have an issue with providing constructors with keys for algorithms that do not mandate it. But boost code should encourage correct usage of the algorithms it propose, not allow bad usage in the name of genericity. The following improvements should imho be done, but i don't base my conditional acceptance on them. Actually some of them can be done on top of the proposed library: * advertise hash_append_range as the recommended way to add untyped data to the hash algorithm (or define an additional hash_append_bytes to fulfill this role). * adds a tutorial / introduction in the doc, move the whole hashing byte sequences item after the hashing c++ objects item (if previous point is adressed, part of it would have to be included in the hashing c++ objects item). * makes copy construction requirement on hash algorithm a requirement only if hash_append_unordered_range is used. Hashing unordeder ranges is an open problem, some creative other solutions may be found, and i see no reasons why they would involve duplicating the algorithm state. Actually, if the hash is default constructible, then it is not needed. You need either one of default or copy constructible, not both. * use std::size_t instead of std::uint64_t for the default_flavor, as this is faster on some platforms. * provide wrappers that avoid repeating the flavor over and over. It is much more convenient for the user to write once using MyHash = hash2::FlavoredHash<hash2::default_flavor> and then MyHash::append(X) than repeating over and over the parameter. Finally, i want to thank Peter for developing and proposing this library for boost, and Matt for managing the review. Regards, Julien

Julien Blanc wrote: ...
* remove default construction and too short key construction for algorithms where this is actually harmful (such as siphash). I don't have an issue with providing constructors with keys for algorithms that do not mandate it. But boost code should encourage correct usage of the algorithms it propose, not allow bad usage in the name of genericity. ... * makes copy construction requirement on hash algorithm a requirement only if hash_append_unordered_range is used. Hashing unordeder ranges is an open problem, some creative other solutions may be found, and i see no reasons why they would involve duplicating the algorithm state. Actually, if the hash is default constructible, then it is not needed. You need either one of default or copy constructible, not both.
I already answered this (twice) - using default constructed hash algorithm instances in hash_append_unordered_range makes collisions seed-independent. But this aside, these two points also contradict each other. The first one wants to remove default construction from SipHash; the second one wants to use default construction in hash_append_unordered_range. If both are applied, SipHash will not work in hash_append_unordered_range. As for improving hash_append_unordered_range, my current ideas are here: https://lists.boost.org/Archives/boost/2024/12/258744.php

Le vendredi 13 décembre 2024 à 19:24 +0200, Peter Dimov via Boost a écrit :
Julien Blanc wrote: ...
* remove default construction and too short key construction for algorithms where this is actually harmful (such as siphash). I don't have an issue with providing constructors with keys for algorithms that do not mandate it. But boost code should encourage correct usage of the algorithms it propose, not allow bad usage in the name of genericity. ... * makes copy construction requirement on hash algorithm a requirement only if hash_append_unordered_range is used. Hashing unordeder ranges is an open problem, some creative other solutions may be found, and i see no reasons why they would involve duplicating the algorithm state. Actually, if the hash is default constructible, then it is not needed. You need either one of default or copy constructible, not both.
I already answered this (twice) - using default constructed hash algorithm instances in hash_append_unordered_range makes collisions seed-independent.
And it seems we won't agree here. IMHO it's not worth it. The potential for misuse and its consequences far outweight the ease of implementation that it provides.
But this aside, these two points also contradict each other. The first one wants to remove default construction from SipHash; the second one wants to use default construction in hash_append_unordered_range. If both are applied, SipHash will not work in hash_append_unordered_range.
Obviously i failed at explaining my point, because i see no contradiction. My point is that for unordered range hashing, you can keep your scheme and need either: 1. use a default constructed hash 2. use a copy constructed hash from the given hash. This makes the hash less predictable. But not both. The current implementation uses #2, but it could use #1 if #2 is not provided and #1 is available (typical use case for #1 would be openssl md5, whereas typical for #2 is siphash in boost::hash2). If you have neither #1 or #2, then you cannot hash unordered containers. But you can hash everything else. And it will be fine for a lot of users. Hope i made myself a little clearer, Regards, Julien
participants (3)
-
Julien Blanc
-
Matt Borland
-
Peter Dimov