
his is my review of the proposed boost::hash2 library
# What is your evaluation of the design?
The base idea, separating the hash algorithm from the actual hashing
of objects, is a brilliant one.
The library consist on two parts:
* a collection of hashing algorithm, some of them of crypto quality
* a c++ objects hashing facility
Given this design, i would not have expected from the library to provide
a thorough collection of hash algorithm. But it's a good think to get
them. I would have expected, however, to see some guidance about how
to integrate an external hashing library. Actually it is unfortunate
that this part is missing, because it seems a lot more harder than it
should be.
Several design decisions could be improved in my opinion:
* the requirements on the HashAlgorithm concept are too strict, and have
no rationale in the documentation
* copy constructibility and copy assignment are unusual across current
hash implementations, as it has been noted by other reviewers. I can
understand a rationale for copy construct, copy assignment looks very
weird to me. The only rationale i can think of is resetting the internal
state to a defaulted state, which could be a specific operation.
* creation with seed (either as uint64 or as an array) as well as default
construction is required. It seems odd. The documentation, for example,
strongly advice agains using siphash without a seed. I would prefer that
it is a compile time error to use siphash without a seed / with an
incorrectly sized seed. I've no problem adding a key/seed to algorithms
that do not mandate them, though.
But, since the library itself does not create the HashAlgorithm
themselve (there is one notable exception to that that), it should not
put additional requirements beyond what it uses. The HashAlgorithm
concept could be:
struct HashAlgorithm
{
using result_type = /*integral or array-like*/;
static constexpr int block_size = /*...*/; // optional
void update( void const* data, std::size_t n );
result_type result();
};
Interestingly enough, most of object hashing works if you provide it
with such an algorithm (not copyable and not default constructible). I
failed to see where default construction is used outside tests, for
example.
* the usage of void const* + size in interface is controversial, as several
people would prefer a span<>. There are good reasons for not using span,
and this is not something that should be an issue, as this should only
impact HashAlgorithm writers. It is in my opinion an unfortunate design
choice that this internal interface is leaked to the end user. The
functionality is already provided by hash_append_range, which could
be the advertised way of hashing bytes via that library (ie, when mixing
hashing of object and hashing of bytes, eg when implementing custom
hashing for a type).
* constexpr use mandates usage of unsigned char const*. So it could be
the primary supported interface over void const*, to avoid interface
duplication.
* the requirement on result extension is also uncommon. While i understand
the rationale, it should be noted that it is likely to cause some
language interoperability issues, as it is not necessarily
specified by the algorithm. Implementations in other languages are
unlikely to provide them, making the feature a niche one, unless the
underlying algorithm specification provides it.
* most of the functionality is provided by the hash_append function, and the
hash_append_xxx helpers. This is a neat design.
* The Flavor template argument allows to customize hashing of objects, to
get platform-independants hashes. This is a good idea, but it comes with
some problems:
* the flavor is repeated over each hash_append call. I don't think it is
a realistic use case that the flavor changes over time. If i were to
integrate the library in our code base, the first thing i'll do is
write a wrapper to get rid of that flavor argument at every call site.
* the default flavor uses std::uint64_t for size. It seems odd. If i'm
not asking for a specific flavor, i'm not asking the hash to be
consistent accross different systems. So std::size_t should be used
there instead. This looked to me like a pay for what you don't use
situation, so i ran a quick near worst-case benchmark on our arm
platform (hashing a million times a std::vector of the three integers
1-2-3), to find out that it makes a the code using size_t runs 15%
faster.
# What is your evaluation of the implementation?
The implementation looks good. Code is clean, very readable.
A minor remark in hash_append_unordered_range : m should be declared as
std::size_t (or std::uint64_t). The current
typename std::iterator_traits<It>::difference_type m = 0; is wrong.
I'm not at ease with hash_append_unordered_range. The fact that it
reduces the hash to 64 bits, simply making a sum of the element hashes,
and then hashing that, does not look right. If the goal is simply to
avoid unintended collisions, it's probably fine. If the goal is to preserve
against an attacker mindfully trying to provoke collisions, it may do the
trick,although intuitively xoring the whole results would be better at that
job than summing truncated results. If it's to preserve cryptographic
properties of the hash function, then it looks wrong.
Another remark is about size serialization. The library relies on
std::tuple_size to identify container-like structs, but not include the size
in the hash (or is it because of is_contiguously_hashable?). This leads to
unintuitive results, such as:
std::array