
his is my review of the proposed boost::hash2 library
# What is your evaluation of the design?
The base idea, separating the hash algorithm from the actual hashing
of objects, is a brilliant one.
The library consist on two parts:
* a collection of hashing algorithm, some of them of crypto quality
* a c++ objects hashing facility
Given this design, i would not have expected from the library to provide
a thorough collection of hash algorithm. But it's a good think to get
them. I would have expected, however, to see some guidance about how
to integrate an external hashing library. Actually it is unfortunate
that this part is missing, because it seems a lot more harder than it
should be.
Several design decisions could be improved in my opinion:
* the requirements on the HashAlgorithm concept are too strict, and have
no rationale in the documentation
* copy constructibility and copy assignment are unusual across current
hash implementations, as it has been noted by other reviewers. I can
understand a rationale for copy construct, copy assignment looks very
weird to me. The only rationale i can think of is resetting the internal
state to a defaulted state, which could be a specific operation.
* creation with seed (either as uint64 or as an array) as well as default
construction is required. It seems odd. The documentation, for example,
strongly advice agains using siphash without a seed. I would prefer that
it is a compile time error to use siphash without a seed / with an
incorrectly sized seed. I've no problem adding a key/seed to algorithms
that do not mandate them, though.
But, since the library itself does not create the HashAlgorithm
themselve (there is one notable exception to that that), it should not
put additional requirements beyond what it uses. The HashAlgorithm
concept could be:
struct HashAlgorithm
{
using result_type = /*integral or array-like*/;
static constexpr int block_size = /*...*/; // optional
void update( void const* data, std::size_t n );
result_type result();
};
Interestingly enough, most of object hashing works if you provide it
with such an algorithm (not copyable and not default constructible). I
failed to see where default construction is used outside tests, for
example.
* the usage of void const* + size in interface is controversial, as several
people would prefer a span<>. There are good reasons for not using span,
and this is not something that should be an issue, as this should only
impact HashAlgorithm writers. It is in my opinion an unfortunate design
choice that this internal interface is leaked to the end user. The
functionality is already provided by hash_append_range, which could
be the advertised way of hashing bytes via that library (ie, when mixing
hashing of object and hashing of bytes, eg when implementing custom
hashing for a type).
* constexpr use mandates usage of unsigned char const*. So it could be
the primary supported interface over void const*, to avoid interface
duplication.
* the requirement on result extension is also uncommon. While i understand
the rationale, it should be noted that it is likely to cause some
language interoperability issues, as it is not necessarily
specified by the algorithm. Implementations in other languages are
unlikely to provide them, making the feature a niche one, unless the
underlying algorithm specification provides it.
* most of the functionality is provided by the hash_append function, and the
hash_append_xxx helpers. This is a neat design.
* The Flavor template argument allows to customize hashing of objects, to
get platform-independants hashes. This is a good idea, but it comes with
some problems:
* the flavor is repeated over each hash_append call. I don't think it is
a realistic use case that the flavor changes over time. If i were to
integrate the library in our code base, the first thing i'll do is
write a wrapper to get rid of that flavor argument at every call site.
* the default flavor uses std::uint64_t for size. It seems odd. If i'm
not asking for a specific flavor, i'm not asking the hash to be
consistent accross different systems. So std::size_t should be used
there instead. This looked to me like a pay for what you don't use
situation, so i ran a quick near worst-case benchmark on our arm
platform (hashing a million times a std::vector of the three integers
1-2-3), to find out that it makes a the code using size_t runs 15%
faster.
# What is your evaluation of the implementation?
The implementation looks good. Code is clean, very readable.
A minor remark in hash_append_unordered_range : m should be declared as
std::size_t (or std::uint64_t). The current
typename std::iterator_traits<It>::difference_type m = 0; is wrong.
I'm not at ease with hash_append_unordered_range. The fact that it
reduces the hash to 64 bits, simply making a sum of the element hashes,
and then hashing that, does not look right. If the goal is simply to
avoid unintended collisions, it's probably fine. If the goal is to preserve
against an attacker mindfully trying to provoke collisions, it may do the
trick,although intuitively xoring the whole results would be better at that
job than summing truncated results. If it's to preserve cryptographic
properties of the hash function, then it looks wrong.
Another remark is about size serialization. The library relies on
std::tuple_size to identify container-like structs, but not include the size
in the hash (or is it because of is_contiguously_hashable?). This leads to
unintuitive results, such as:
std::array

Julien Blanc wrote: ...
* remove default construction and too short key construction for algorithms where this is actually harmful (such as siphash). I don't have an issue with providing constructors with keys for algorithms that do not mandate it. But boost code should encourage correct usage of the algorithms it propose, not allow bad usage in the name of genericity. ... * makes copy construction requirement on hash algorithm a requirement only if hash_append_unordered_range is used. Hashing unordeder ranges is an open problem, some creative other solutions may be found, and i see no reasons why they would involve duplicating the algorithm state. Actually, if the hash is default constructible, then it is not needed. You need either one of default or copy constructible, not both.
I already answered this (twice) - using default constructed hash algorithm instances in hash_append_unordered_range makes collisions seed-independent. But this aside, these two points also contradict each other. The first one wants to remove default construction from SipHash; the second one wants to use default construction in hash_append_unordered_range. If both are applied, SipHash will not work in hash_append_unordered_range. As for improving hash_append_unordered_range, my current ideas are here: https://lists.boost.org/Archives/boost/2024/12/258744.php

Le vendredi 13 décembre 2024 à 19:24 +0200, Peter Dimov via Boost a écrit :
Julien Blanc wrote: ...
* remove default construction and too short key construction for algorithms where this is actually harmful (such as siphash). I don't have an issue with providing constructors with keys for algorithms that do not mandate it. But boost code should encourage correct usage of the algorithms it propose, not allow bad usage in the name of genericity. ... * makes copy construction requirement on hash algorithm a requirement only if hash_append_unordered_range is used. Hashing unordeder ranges is an open problem, some creative other solutions may be found, and i see no reasons why they would involve duplicating the algorithm state. Actually, if the hash is default constructible, then it is not needed. You need either one of default or copy constructible, not both.
I already answered this (twice) - using default constructed hash algorithm instances in hash_append_unordered_range makes collisions seed-independent.
And it seems we won't agree here. IMHO it's not worth it. The potential for misuse and its consequences far outweight the ease of implementation that it provides.
But this aside, these two points also contradict each other. The first one wants to remove default construction from SipHash; the second one wants to use default construction in hash_append_unordered_range. If both are applied, SipHash will not work in hash_append_unordered_range.
Obviously i failed at explaining my point, because i see no contradiction. My point is that for unordered range hashing, you can keep your scheme and need either: 1. use a default constructed hash 2. use a copy constructed hash from the given hash. This makes the hash less predictable. But not both. The current implementation uses #2, but it could use #1 if #2 is not provided and #1 is available (typical use case for #1 would be openssl md5, whereas typical for #2 is siphash in boost::hash2). If you have neither #1 or #2, then you cannot hash unordered containers. But you can hash everything else. And it will be fine for a lot of users. Hope i made myself a little clearer, Regards, Julien
participants (3)
-
Julien Blanc
-
Matt Borland
-
Peter Dimov