
Dear all, The review of Hash2 by Peter Dimov and Christian Mazakas begins today Saturday, December 7th through Sunday December 15th, 2024. Code: https://github.com/pdimov/hash2 Docs: https://pdimov.github.io/hash2/ Please provide feedback on the following general topics: - What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? Do you already use it in industry? - Did you try to use the library? With which compiler(s)? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problem domain? Ensure to explicitly include with your review: ACCEPT, REJECT, or CONDITIONAL ACCEPT (with acceptance conditions). There's already been a flurry of conversation so I thank you all for that. If you have any questions during this process feel free to reach out to me. Additionally, if you would like to submit your review privately, which I will anonymize for the review report, you may send it directly to me at: matt@mattborland.com. Matt Borland Review Manager and C++Alliance Staff Engineer

Matt Borland wrote:
Dear all,
The review of Hash2 by Peter Dimov and Christian Mazakas begins today Saturday, December 7th through Sunday December 15th, 2024.
Code: https://github.com/pdimov/hash2 Docs: https://pdimov.github.io/hash2/
I know that many do not want to clone the entire Boost to place the library in libs/hash2, so I've added support for building it with CMake as a standalone project - it uses FetchContent to acquire the necessary Boost dependencies. By default `cmake --build .` builds the benchmark and the examples; if you want to build and run the tests, use cmake --build . --target tests ctest For Windows, add `--config Debug` to the build line and `-C Debug` to the ctest line (or Release.)

On Sat, Dec 7, 2024 at 6:14 AM Peter Dimov via Boost
I know that many do not want to clone the entire Boost to place the library in libs/hash2, so I've added support for building it with CMake as a standalone project - it uses FetchContent to acquire the necessary Boost dependencies.
It is easier for me to put it in libs/hash2. I cloned it there and then ran this command line from hash2: cmake -G "Visual Studio 17 2022" -A x64 -B bin64 -DCMAKE_TOOLCHAIN_FILE="C:/vcpkg/scripts/buildsystems/vcpkg.cmake" -DVCPKG_CHAINLOAD_TOOLCHAIN_FILE="C:/users/vinnie/src/toolchains/msvc.cmake" FetchContent was invoked. How do I build a Visual Studio Solution for hash2 which uses my existing local clone of the superproject? Thanks

This is my review of boost hash2.
Since I encourage people to write smaller reviews, here's mine. Please
weigh it accordingly.
I cloned the repo and ran the tests, all worked.
I only studied the docs (not the implementation). If there's one person
who's implementation I blindly trust, it's Peter.
That plus writing the review took about an hour.
The choice of hash algorithms looks correct to me..
HashAlgorithm should allow me to plugin any other algorithm, e.g. from
openssl.
I also think that update(void*,size_t) is the correct interface, because it
doesn't require a reinterpret_cast to an inappropriate type,
nor does it create any dependency.
The built-in algorithms for hashing are a good choice although I wonder
about the usefulness of pointers. A pointer doesn't really tell if it's
itself information or points to it (what do you hash for const char*?)
I would limit it to void* and function pointers. However, many other hash
libraries have it built-in too. Following precedent here seems ok to me.
Another example that I would be interested in is how one would hash a
`boost::json::value`. If it's just calling `hash_append_unordered_range`,
can recursion become an issue? If so, how do I address it?
I think this library does solve multiple issues with std::hash/boost.hash:`
1. having an actual hash algorithm
2. ability to choose a hash algorithm
3. convenient extensibility
4. endian awareness
I think (1) is the most important, because with std::hash everyone needing
his class to be hashed must know how to implement/combine hashes without
knowing which hash algorithm is used under neath.
By using an established algorithm and just being required to provide the
bytes to be hashed, the user does not need to know anything about hashing
algorithms.
It is also very useful as is, e.g. I can drop it into a project using
unordered_map by just replacing the `Hash` parameter.
I think the design of the library is correct and recommend to ACCEPT this
library into boost.
On Sat, Dec 7, 2024 at 9:47 PM Matt Borland via Boost
Dear all,
The review of Hash2 by Peter Dimov and Christian Mazakas begins today Saturday, December 7th through Sunday December 15th, 2024.
Code: https://github.com/pdimov/hash2 Docs: https://pdimov.github.io/hash2/
Please provide feedback on the following general topics:
- What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? Do you already use it in industry? - Did you try to use the library? With which compiler(s)? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problem domain?
Ensure to explicitly include with your review: ACCEPT, REJECT, or CONDITIONAL ACCEPT (with acceptance conditions).
There's already been a flurry of conversation so I thank you all for that. If you have any questions during this process feel free to reach out to me. Additionally, if you would like to submit your review privately, which I will anonymize for the review report, you may send it directly to me at: matt@mattborland.com.
Matt Borland Review Manager and C++Alliance Staff Engineer
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Klemens Morgenstern wrote:
Another example that I would be interested in is how one would hash a `boost::json::value`.
Good suggestion, thanks.
In theory, this is how one would do it:
namespace boost
{
namespace json
{
template

Oops, forgot to include the kind().
The correct way to write the tag_invoke overload today is
namespace boost { namespace json {
template
std::enable_if_t< std::is_same ::value > tag_invoke( boost::hash2::hash_append_tag const&, Hash& h, Flavor const& f, V const& v ) {
boost::hash2::hash_append( h, f, v.kind() );
boost::json::visit( [&](auto const& w){ boost::hash2::hash_append( h, f, w ); }, v ); }
} // namespace json } // namespace boost

On Sat, Dec 7, 2024 at 8:16 AM Peter Dimov via Boost
The other subtlety here is that boost::json::object should be considered an unordered range, rather than an ordinary range.
Why?
It doesn't expose a `hasher` typedef, which is the heuristic is_unordered_range uses, but it already specializes is_unordered_range here:
Should we add a hasher typedef? Thanks

On Sat, Dec 7, 2024 at 8:28 AM Vinnie Falco
On Sat, Dec 7, 2024 at 8:16 AM Peter Dimov via Boost < boost@lists.boost.org> wrote:
The other subtlety here is that boost::json::object should be considered an unordered range, rather than an ordinary range.
Why?
Nevermind...it should be unordered because equality is independent of the ordering of elements (and json::object preserves order) Thanks

On Sat, Dec 7, 2024 at 7:27 AM Klemens Morgenstern via Boost < boost@lists.boost.org> wrote:
I wonder about the usefulness of pointers.
A use-case for pointers is: std::unordered_map< T*, std::weak_ptr<T> > Where the T is constructed via std::make_shared. Note: If any std::weak_ptr references the control block created by std::make_shared after the lifetime of all shared owners ended, the memory occupied by T persists until all weak owners get destroyed as well, which may be undesirable if sizeof(T) is large. Thanks

On Sat, Dec 7, 2024 at 5:18 PM Vinnie Falco via Boost
On Sat, Dec 7, 2024 at 7:27 AM Klemens Morgenstern via Boost < boost@lists.boost.org> wrote:
I wonder about the usefulness of pointers.
A use-case for pointers is:
std::unordered_map< T*, std::weak_ptr<T> >
Where the T is constructed via std::make_shared. Note:
I think this case itself can easily be handled by people casting T* to
std::uintptr_t.
I think bigger issue is that users can not do this when pointer is nested
inside a type.
E.g.
How can user hash
std::tuple
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Hi,
This is my review of this library
-----------------------------------
- What is your evaluation of the design?
-----------------------------------
I think the design is good, I like the clear separation between the
algorithm and user-defined types, and I was sympathetic with the
original "Types don’t know #" proposal.
I think the included hash algorithms are varied and practical.
Some minor comments (opinions, don't take them as "bad design decisions"):
- "HashAlgorithm::block_size" is an int, IMHO it should std::size_t. A
negative value has no sense (and the documentation does not say anything
if an algorithm defines it as a negative number, I suppose it's UB/bug).
Maybe we should document some valid range that the library supports.
- No move constructor/assigment for algorithms. I don't know if those
make any sense, but I imagine user implementations with those members
are perfectly valid.
- The semantics for copy constructor/assignment might not be clear. The
library internally uses the copy constructor (e.g. when implementing the
unordered range hash), so it certainly expects something more concrete
than "equivalent to the original". I imagine a copy/assigned algorithm
is required to lead to the same output when updated with the same data.
Maybe this should be explained in the documentation with an example.
- "result" sounds like a getter/const function returning the already
calculated value, but it's a required "final" step that might even
return different values if repeatedly called. Naming it to something
like "produce_result" might be more self-explaining.
- Should the HashAlgorithm support calls to "update" and "result"
several times, interleaving both calls? Does the library call "result"
only as the final step once or more times? I think required guaranteed
should be more explicitly stated in the design/documentation.
- I don't know if SHA-1/RIPEMD-128 should be in the library at all
(insecure and not sure if useful in any real context).
- The original proposal mentions a "variadic version of hash_append",
(https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#variadic)
maybe that would be a useful utility in the library.
- The original proposal uses a different syntax:
void
operator()(void const* key, std::size_t len) noexcept
{
state_.Update(key, len);
}
explicit
operator result_type() noexcept
{
std::uint64_t h1, h2;
state_.Final(&h1, &h2);
return h1;
}
Which are the advantages of the new approach (update/result) vs the
original proposal?
- The hash_append_unordered_range function is surprising for people like
me that are not experts on hashing, but it's an important concept for
users. The original proposal explains the problem
(https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#unordere...),
I feel the problem should be also explained in the documentation of
Hash2, and how the library design solves it (using a copy of the
HashAlgorithm, as proposed in n3980).
N3980 also says that "And there are various other schemes. They are all
implementable. But they each have their advantages and disadvantages.
Therefore this proposal proposes none of them. Should the future expose
the ideal hash_append specification for unordered containers, it can
always be added at that time."
Why does Hash2 offer a single alternative? Would it be a good a idea to
offer different unordered hashing functions (each with different
trade-offs) so that a user can choose the best one?
-N3980 proposes a "type-erase hasher"
(https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#type_era...),
would it be interesting to include it in Hash2?
- I didn't know about the "tag_invoke" customization point trend. I
imagine that this avoids polluting all namespaces with the chosen
function name.
However, AFAIK library proposals as "std::execution" have abandoned the
tag_invoke approach and will use member functions as customization
points
(https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2855r1.html).
Are member functions a better customization point for user-defined
classes than these tag_invoke friend functions that kick ADL (and its
associated potential problems)?.
Say:
//declare a member function in your type that will be
//called by boost::hash2::hash_append()
template

Ion Gaztañaga wrote:
Hi,
This is my review of this library
Thank you for the review. That's a lot of questions. :-)
- "HashAlgorithm::block_size" is an int, IMHO it should std::size_t.
Maybe. The type doesn't matter that much for integral constants, but I suppose size_t is more correct.
- No move constructor/assigment for algorithms. I don't know if those make any sense, but I imagine user implementations with those members are perfectly valid.
Hash algorithms should be allowed to provide a move constructor or a move assignment operator, yes.
- The semantics for copy constructor/assignment might not be clear. The library internally uses the copy constructor (e.g. when implementing the unordered range hash), so it certainly expects something more concrete than "equivalent to the original".
"Equivalent to the original" is the strongest possible requirement. I don't see how the library can possibly expect anything more than that. In particular,
I imagine a copy/assigned algorithm is required to lead to the same output when updated with the same data.
... this is implied by "equivalent to the original", because if the copy produced a different output, it wouldn't be equivalent.
- Should the HashAlgorithm support calls to "update" and "result" several times, interleaving both calls?
It should, yes. Any calls are allowed, in any order.
- I don't know if SHA-1/RIPEMD-128 should be in the library at all (insecure and not sure if useful in any real context).
SHA-1 is still widely used, e.g. for Git commit hashes. We added RIPEMD-128 in order to have a recommended replacement for MD5 in the case where the digest has to be exactly 128 bits.
- The original proposal mentions a "variadic version of hash_append", (https://www.open- std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#variadic) maybe that would be a useful utility in the library.
It might be, yes.
- The original proposal uses a different syntax:
void operator()(void const* key, std::size_t len) noexcept { state_.Update(key, len); }
explicit operator result_type() noexcept { std::uint64_t h1, h2; state_.Final(&h1, &h2); return h1; }
Which are the advantages of the new approach (update/result) vs the original proposal?
It seems obvious to me that named functions are better than unnamed functions.
- The hash_append_unordered_range function is surprising for people like me that are not experts on hashing, but it's an important concept for users. The original proposal explains the problem (https://www.open- std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#unordered), I feel the problem should be also explained in the documentation of Hash2, and how the library design solves it (using a copy of the HashAlgorithm, as proposed in n3980).
N3980 also says that "And there are various other schemes. They are all implementable. But they each have their advantages and disadvantages. Therefore this proposal proposes none of them. Should the future expose the ideal hash_append specification for unordered containers, it can always be added at that time."
Why does Hash2 offer a single alternative?
I don't know of a better one. Constructing a temporary std::set is not a serious alternative. In addition to being allocating, slow, and potentially throwing, it also relies on the existence of operator<, which an unordered container doesn't require.
-N3980 proposes a "type-erase hasher" (https://www.open- std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#type_erased_hasher ), would it be interesting to include it in Hash2?
I had the type erased hasher in the "planned additions" list for a long time, partly because I thought I needed it for the hash2sum example. But it doesn't meet the hash algorithm requirements, because it doesn't have the correct constructors and because the copy constructor is allocating and potentially throwing (the latter is not, strictly speaking, a violation but it's still undesirable.)
Is tag_invoke superior because we can declare customization points for classes that we can't modify?
Yes.
But we need access to data members so we need priviledge access to that type...
Not always. E.g. std::complex or std::error_code provide accessors for the necessary members.
- The documentation shows (in the chapter "Use with Unordered Containers") several "hash" classes that are needed to be able to use Hash2 with std/boost unordered_xx classes (or boost::intrusive::unordered_xxx types). I think it would be a good idea to include such utilities in this Hash2 library (under names like "hasher", "seeded_hasher".
One of these, or maybe some different one, will eventually be included, once I decide on which.
- The design says "The size_type member type of the flavor affects how container and range sizes (typically of type size_t) are serialized. Since the size of size_t in bytes can vary, serializing the type directly results in different hash values when the code is compiled for 64 bit or for 32 bit. Using a fixed width type avoids this".
I don't agree with this, between 32 and 64 bit compilation you will have differences when hashing user-defined types that have pointers, stored std::size_t/ptrdiff_t members, etc..
There are still a lot of types for which the size type is the only source of variability, e.g. std::vector<int>.
IMHO xxx_flavours defined in the library should define size_type as "std::size_t". Those wanting portable hashes between 32 and 64 bit need to do a bigger effort than configuring Hash2, and probably defining size_type as std::uint64_t will reduce the performance of 32 bit users...
It doesn't. I initially had size_type defined as uint32_t for similar reasons, but it doesn't really matter, so I changed it to uint64_t.
- One thing I don't like is that the library depends on container_hash/describe and mp11. I would expect this library to be dependency-free (other than Config/Assert). Specially I dislike the "Describe" dependency. Is there a way to support Described classes without depending on that library?
No.
- There is an asymmetry between "writeNNle" and "writeNNbe" implementations (little vs big endian). In little endian implementations "detail::is_constant_evaluated() and endianess" is used to be able to use memcpy in some cases (https://github.com/pdimov/hash2/blob/develop/include/boost/hash2/detail /write.hpp#L26) where in the big endian implementation this is not performed (https://github.com/pdimov/hash2/blob/develop/include/boost/hash2/detail /write.hpp#L73).
Probably few people use big endian platforms (except those like me that still need to compile for powerpc targets...), but I would expect an equivalent implementation for big-endian functions.
The implementations are equivalent in practice. GCC and Clang recognize the pattern used and turn it into a single mov; the memcpy path is only necessary for MSVC, which doesn't support big endian architectures.
- class Str example: N is missing the type, missing <algorithm> include, operator== has the wrong type...
Sorry, why does operator== have the wrong type?
- All "Use with Unordered Containers" examples should be compilable, including the needed headers.
The examples are actually compilable, I just stripped the #includes when embedding them in the documentation to avoid the unnecessary repetition. Maybe I shouldn't have. :-)

El 08/12/2024 a las 12:42, Peter Dimov via Boost escribió:
... this is implied by "equivalent to the original", because if the copy produced a different output, it wouldn't be equivalent.
- Should the HashAlgorithm support calls to "update" and "result" several times, interleaving both calls?
It should, yes. Any calls are allowed, in any order.
Understood.
- I don't know if SHA-1/RIPEMD-128 should be in the library at all (insecure and not sure if useful in any real context).
SHA-1 is still widely used, e.g. for Git commit hashes. We added RIPEMD-128 in order to have a recommended replacement for MD5 in the case where the digest has to be exactly 128 bits.
Ok.
- The original proposal mentions a "variadic version of hash_append", (https://www.open- std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#variadic) maybe that would be a useful utility in the library.
It might be, yes.
Let's add it to the to-do list ;-)
- The hash_append_unordered_range function is surprising for people like me that are not experts on hashing, but it's an important concept for users. The original proposal explains the problem (https://www.open- std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#unordered), I feel the problem should be also explained in the documentation of Hash2, and how the library design solves it (using a copy of the HashAlgorithm, as proposed in n3980).
N3980 also says that "And there are various other schemes. They are all implementable. But they each have their advantages and disadvantages. Therefore this proposal proposes none of them. Should the future expose the ideal hash_append specification for unordered containers, it can always be added at that time."
Why does Hash2 offer a single alternative?
I don't know of a better one.
Constructing a temporary std::set is not a serious alternative. In addition to being allocating, slow, and potentially throwing, it also relies on the existence of operator<, which an unordered container doesn't require.
Ok. Yes, the other alternative described in the original paper is much worse. Not sure how the hash quality is weakened with the curren implementation: all element hashes are converted to 64 bit, added and a final hash is computed. It seems to me that combining and shuffling all original result_type bits from each individual hash would provide a better hash result for the whole sequence, but I'm not certainly able to mathematically prove that...
-N3980 proposes a "type-erase hasher" (https://www.open- std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#type_erased_hasher ), would it be interesting to include it in Hash2?
I had the type erased hasher in the "planned additions" list for a long time, partly because I thought I needed it for the hash2sum example. But it doesn't meet the hash algorithm requirements, because it doesn't have the correct constructors and because the copy constructor is allocating and potentially throwing (the latter is not, strictly speaking, a violation but it's still undesirable.)
Type-erasing requires trade-offs but for types like pimpl idioms, you can't template or it's not desirable to template code that previously was on a cpp. One option is that boost::hash2::type_erased_hasher can have enough small buffer optimization (at least for hashes implemented in the boost::hash2 library) so that allocation can be avoided in most cases. A good type_erased_hasher is not trivial and certainly useful.
- The documentation shows (in the chapter "Use with Unordered Containers") several "hash" classes that are needed to be able to use Hash2 with std/boost unordered_xx classes (or boost::intrusive::unordered_xxx types). I think it would be a good idea to include such utilities in this Hash2 library (under names like "hasher", "seeded_hasher".
One of these, or maybe some different one, will eventually be included, once I decide on which.
Great.
- The design says "The size_type member type of the flavor affects how container and range sizes (typically of type size_t) are serialized. Since the size of size_t in bytes can vary, serializing the type directly results in different hash values when the code is compiled for 64 bit or for 32 bit. Using a fixed width type avoids this".
I don't agree with this, between 32 and 64 bit compilation you will have differences when hashing user-defined types that have pointers, stored std::size_t/ptrdiff_t members, etc..
There are still a lot of types for which the size type is the only source of variability, e.g. std::vector<int>.
IMHO xxx_flavours defined in the library should define size_type as "std::size_t". Those wanting portable hashes between 32 and 64 bit need to do a bigger effort than configuring Hash2, and probably defining size_type as std::uint64_t will reduce the performance of 32 bit users...
It doesn't.
Show me the numbers ;-) Maybe if benchmark results are added to the documentation this can be shown.
- There is an asymmetry between "writeNNle" and "writeNNbe" implementations (little vs big endian). In little endian implementations "detail::is_constant_evaluated() and endianess" is used to be able to use memcpy in some cases (https://github.com/pdimov/hash2/blob/develop/include/boost/hash2/detail /write.hpp#L26) where in the big endian implementation this is not performed (https://github.com/pdimov/hash2/blob/develop/include/boost/hash2/detail /write.hpp#L73).
Probably few people use big endian platforms (except those like me that still need to compile for powerpc targets...), but I would expect an equivalent implementation for big-endian functions.
The implementations are equivalent in practice. GCC and Clang recognize the pattern used and turn it into a single mov; the memcpy path is only necessary for MSVC, which doesn't support big endian architectures.
Ok, understood.
- class Str example: N is missing the type, missing <algorithm> include, operator== has the wrong type...
Sorry, why does operator== have the wrong type?
Because operator== for Str compares X instances instead of Str instances: class Str { private: static constexpr N = 32; std::uint8_t size_ = 0; char data_[ N ] = {}; public: friend constexpr bool operator==( X const& x1, X const& x2 ) { return x1.size_ == x2.size_ && std::equal( x1.data_, x1.data_ + x1.size_, x2.data_ ); }
- All "Use with Unordered Containers" examples should be compilable, including the needed headers.
The examples are actually compilable, I just stripped the #includes when embedding them in the documentation to avoid the unnecessary repetition. Maybe I shouldn't have. :-)
I've found (at least in my libraries) that some users like to copy-paste examples in the doc and want them to work without needing to figure out what's missing. Best, Ion

Ion Gaztañaga wrote:
- I don't know if SHA-1/RIPEMD-128 should be in the library at all (insecure and not sure if useful in any real context).
Amusingly enough, I'm _right now_ looking at an API documentation which requires HMAC-SHA1. :-) (Proprietary and NDAed, of course.)

On Mon, 9 Dec 2024, 11:54 pm Peter Dimov via Boost,
Amusingly enough, I'm _right now_ looking at an API documentation which requires HMAC-SHA1. :-)
Amusingly enough there isn't a problem with using HMAC-SHA1 if the key is being used as a secret, is predictable etc. This should make its use in message authentication perfectly ok if the conditions for effective authentication through use of HMAC are actually met in the system in question.

*My review:*
- What is your evaluation of the design?
Excellent. The trade-offs in API design have been well explained. There is
nothing in the design that will get in the way of any use case of mine.
- What is your evaluation of the implementation?
My assumption is that it will be carefully thought out and close to
perfectly implemented. The author has form.
- What is your evaluation of the documentation?
Comprehensive and well laid out with plenty of examples.
One reading of the documentation left me convinced that I am ready to
incorporate the library into new and existing code, and know how to.
- What is your evaluation of the potential usefulness of the library? Do
you already use it in industry?
Very useful. When I write software it is generally facing the internet and
is generally interfacing with servers that require HMAC.
The library fulfils my three main requirements:
- prevention of DOS hash attacks
- computation of HMAC
- A trustworthy and *easy to use* SHA-256 implementation.
out of the box
- Did you try to use the library? With which compiler(s)? Did
you have any problems?
No
- How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?
An afternoon of reading the documentation and evaluating the library's
relevance to me.
- Are you knowledgeable about the problem domain?
Yes
My vote: ACCEPT
My comment: If the ISO committee rejected a library as well designed and as
useful as this, this is merely a continuation of the stream of evidence of
its incompetence.
R
On Sat, 7 Dec 2024 at 14:47, Matt Borland via Boost
Dear all,
The review of Hash2 by Peter Dimov and Christian Mazakas begins today Saturday, December 7th through Sunday December 15th, 2024.
Code: https://github.com/pdimov/hash2 Docs: https://pdimov.github.io/hash2/
Please provide feedback on the following general topics:
- What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? Do you already use it in industry? - Did you try to use the library? With which compiler(s)? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problem domain?
Ensure to explicitly include with your review: ACCEPT, REJECT, or CONDITIONAL ACCEPT (with acceptance conditions).
There's already been a flurry of conversation so I thank you all for that. If you have any questions during this process feel free to reach out to me. Additionally, if you would like to submit your review privately, which I will anonymize for the review report, you may send it directly to me at: matt@mattborland.com.
Matt Borland Review Manager and C++Alliance Staff Engineer
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Sat, Dec 7, 2024 at 7:47 AM Matt Borland via Boost
The review of Hash2 by Peter Dimov and Christian Mazakas begins today Saturday, December 7th through Sunday December 15th, 2024. Please provide feedback on the following general topics:
- What is your evaluation of the design?
Overall, I like it. First, some general notes.
I know that the authors want robust C++11 support, but I think the
library suffers from having a C++11-only implementation. I think if
you want to support C++11, then by all means do that. But the rest of
use would like interfaces that use: std::byte instead of sequences of
unsigned char and void *; std::span instead of a pointer and size;
language concepts instead of documentation-as-concepts, etc. If you
macro guard your C++11 interfaces, you can have your cake and eat it
too. I for one have not used C++11 since sometime in 2014. While 11
users exist, they are a small minority now. Requiring them to define
a macro to get 11-compatible interfaces does not seem onerous to me,
and the rest of us should be able to use more modern stuff.
One example is digest<>. The docs say, "digest<N> is a constexpr-friendly
class template similar to std::array
- What is your evaluation of the implementation?
Looks good -- easy to read; no red flags. There are lots of low-level shenanigans, and they are done without UB, as far as I can tell.
- What is your evaluation of the documentation?
I think the intro could really benefit from a small snippet showing a simple
use of the library. In particular, I cannot tell at the beginning whether I
always need to write something like HashAlgorithm or not. I'm sure it will
become clear later, but earlier is better; providing simple use like hashing a
string using a reasonable choice of algorithm, for use in a
std::unordered_map
- What is your evaluation of the potential usefulness of the library? Do you already use it in industry?
I think it will be very useful. I look forward to not having to write hash_append over and over again in different projects. :) I do not use the library currently.
- Did you try to use the library? With which compiler(s)? Did you have any problems?
I did not try to use it.
- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
I read the documentation, and read the implementation (though not the hash algorithms themselves). I spent about 4 hours doing this review.
- Are you knowledgeable about the problem domain?
As a user of hashing containers, yes. As for anything cryptographic, not at all.
Ensure to explicitly include with your review: ACCEPT, REJECT, or CONDITIONAL ACCEPT (with acceptance conditions).
I vote to ACCEPT this library. Zach Zach

Matt, et al, I have read over the documentation for the Hash2 library, and what follows are my notes. First and foremost, *I appreciate the effort and detail that has gone into getting the documentation this far - no mean amount of work*. Also appreciate that there are plenty of links to external topics and many examples - all helpful, and perhaps some of the things I mention here are already planned. *General Points:* 1. Just a little confused as to the name "Hash2" when there is no "Hash1" ? Perhaps this was explained earlier and I missed it. 2. With all the doc in one single page I found the absence of an obvious Search function awkward, but perhaps this is forthcoming through another effort? I found I wanted to search for words such as "required", "optional", "hash_append" and so on. 3. I agree with other commenters that the all-in-one-page approach is daunting ("Grandmother's fruitcake"). Would have liked to see at least all the top level topics as separate pages, though ideally Reference material is better as one page per construct (class, method, property, structure) so a dev can link to that page when they are working hard on an implementation of the construct. *Contents:* 1. The "Supported Compilers" section should be moved up after the Overview, as part of a section "Getting Started with Hash2". 2. The Getting Started section should work through one example that a newcomer to hashing can understand: a. installing Hash2 b. linking/compiler requirements or options c. provides a short but meaningful example d. running and explaining the output 3. The Getting Started section could use a "Common Errors and Issues" section detailing the most likely errors, misconceptions or stumbling blocks and how to respond. And/Or a Troubleshooting section for the library as a whole. 4. Consider following this with a "Tutorial" section - repeating the steps of the Getting Started section, but with a meatier and potentially useful example. 5. Usage Examples, the code examples have almost no comments - would benefit from being extensively commented. - same for other code snippets throughout the doc - though "Examples" takes priority 6. The "Example" sections in the Reference would also benefit from a sentence/para explaining what the example does, in particular noting any gotchas or potential pain points. 7. In the doc: Described classes (using Boost.Describe). Boost.Describe should be a link to that libs docs. Same for mentions of other libraries. 8. In the block_size section for Hash Algorithm Requirements, there are the two paragraphs (my bold): Cryptographic hash functions provide a block_size value, which is their block size (e.g. 64 for MD5, 128 for SHA2-512) and is *required *in order to implement the corresponding HMAC. block_size is an *optional *requirement. These statements appear to be contradictory - and there is no such thing as an "optional requirement". There is such a thing as an "optional parameter/setting". Consider searching the doc for the phrase "optional requirement" and amending. Consider searching for other "required/optional" statements and verify there is no ambiguity. 9. Nit. Replace "e.g." with "for example, " - it is so much easier to read. Same for other contractions (i.e. = "that is" etc. = "and similar/so on"). 10. Should there be an "Acknowledgements" section near the end - design input, testers? I am the technical writer for the Cpp Alliance. Hope this helps. Best of luck with the project. - Peter Turcan On Mon, Dec 9, 2024 at 12:45 PM Matt Borland via Boost < boost@lists.boost.org> wrote:
I vote to ACCEPT this library.
Zach
Zach, thank you for your review and comments.
Matt _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Peter Turcan wrote:
1. Just a little confused as to the name "Hash2" when there is no "Hash1" ? Perhaps this was explained earlier and I missed it.
"Hash1" is the library formerly known as Functional/Hash, and now named ContainerHash. It defines the function object boost::hash. (Whereas Hash2 defines its stuff in namespace boost::hash2.)

El 10/12/2024 a las 1:47, Peter Dimov via Boost escribió:
Peter Turcan wrote:
1. Just a little confused as to the name "Hash2" when there is no "Hash1" ? Perhaps this was explained earlier and I missed it.
"Hash1" is the library formerly known as Functional/Hash, and now named ContainerHash. It defines the function object boost::hash.
(Whereas Hash2 defines its stuff in namespace boost::hash2.)
Since Hash2 depends on ContainerHash for several traits and functions both have the same dependencies (Config, Assert, Describe, Mp11), and both are about hashing... Wouldn't make sense to merge both libraries into a single Hash library? Best, Ion

Ion Gaztañaga wrote:
El 10/12/2024 a las 1:47, Peter Dimov via Boost escribió:
Peter Turcan wrote:
1. Just a little confused as to the name "Hash2" when there is no "Hash1" ? Perhaps this was explained earlier and I missed it.
"Hash1" is the library formerly known as Functional/Hash, and now named ContainerHash. It defines the function object boost::hash.
(Whereas Hash2 defines its stuff in namespace boost::hash2.)
Since Hash2 depends on ContainerHash for several traits and functions both have the same dependencies (Config, Assert, Describe, Mp11), and both are about hashing... Wouldn't make sense to merge both libraries into a single Hash library?
Maybe, eventually. ContainerHash doesn't accept a seed at the moment, and there isn't that much to be gained by the merge. We also have existing hash_value overloads in user code, and these have to continue working.

The review of Hash2 by Peter Dimov and Christian Mazakas begins
Christopher's Review.
DISCLAIMER: I know the review manager.
I highly respect Matt as a person and
through his open-source work. Furthermore,
I work in a dedicated fashion with Matt
on new, potentially upcoming Boost-like
projects. I do not receive any open-source
money. I operate as a self-funded,
independent researcher in FOSS.
Here is my review of the proposed
Boost.Hash2
- What is your evaluation of the design?
Exceptional. It is clean with terse,
well-deocumented code. The design addresses
a sorely neglected area in C/C++ programming
and makes strong, decisive strides to move
forward.
- What is your evaluation of the implementation?
It's simply fantastic. And I will qualify
this further below.
- What is your evaluation of the documentation?
It is fine and I could use the library
immediately within about 3 minutes.
- What is your evaluation of the potential usefulness
of the library? Do you already use it in industry?
The library is highly useful. At the moment,
it lacks some of the extreme quality
aspects that my industrial area would mandate.
But it is not far off from becoming a viable
secure component. I'd be willing to work
toward improving it with the authors.
What I'd like to see short-term:
* Handle enhanced compiler warnings.
* Include a subset of NIST testing.
* Fuzzing run(s) on some hashes in CI.
* I think SHA-3 is worthy of inclusion.
- Did you try to use the library? With which
compiler(s)? Did you have any problems?
Yes. I performed a whole slew of PC-based
tests with the standard vanilla compiler
crew including GCC, VC, clang in
a few versions. Given the (already)
extremely well-tested functionality
of Hash2, there were no problems.
But here comes the fun part. When I studied
the implementation it seemed highly portable.
So I put it onto several embedded bare-metal
systems. The code compiled and ran on 32-bit
microcontrollers and even ran on an 8-bit
controller. Rarely do we find such portability
and quality in Boost-proposed libraries,
or anywhere for that matter. Kudos on
a job well done! I can provide ROM-sizes
and microsecond run-time comparisons
if needed.
- How much effort did you put into your evaluation?
A glance? A quick reading? In-depth study?
An in-depth study.
- Are you knowledgeable about the problem domain?
Over the decades, my "day-job" has involved
delivery of hundreds of millions of
high-volume, safety-critical devices,
mostly having some kind of crypto in them.
I think I'm somewhat knowledgeable
in this area. I've had determining
roles including library author,
library client, security auditor and
also specified the commercial acquisition
of numerous libraries and silicon vendor
crypto-accelerators.
Ensure to explicitly include with your review:
ACCEPT, REJECT, or CONDITIONAL ACCEPT
(with acceptance conditions).
I vote to unconditionaly ACCEPT Hash2.
Thank you to the authors forthis seminal
work and bold step forward.
Thanks Matt for managing this rich
and lively review.
Kind regards, Christopher Kormanyos
On Tuesday, December 10, 2024 at 02:56:10 PM GMT+1, Matt Borland via Boost
I have read over the documentation for the Hash2 library, and what follows are my notes.
Peter, thank you for the documentation review. I always find these beneficial. Matt _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Christopher Kormanyos wrote:
But here comes the fun part. When I studied the implementation it seemed highly portable. So I put it onto several embedded bare-metal systems. The code compiled and ran on 32-bit microcontrollers and even ran on an 8-bit controller. Rarely do we find such portability and quality in Boost-proposed libraries, or anywhere for that matter. Kudos on a job well done!
Thanks Christopher. :-)
What I'd like to see short-term: * Handle enhanced compiler warnings.
Would you please go into more detail here? We currently test with -Wall -Wextra.

What I'd like to see short-term: * Handle enhanced compiler warnings.
Would you please go into more detail here? We currently test with -Wall -Wextra.
Yes. I would like to. I appreciate
and benefit from (at bare minimum in GCC):
-Wall -Wextra -Wpedantic -Wconversion -Wsign-conversion
And sometimes I even go for -Werror
You would face a much larger onslaught
of warnings with a commercial-grade CERT-C++
checker, so that set above is actually
rather slack in my world. I can provide
qan actual build-log with line numbers
if needed. Or I can just PR the changes
if you like.
The whole warning-game is just like, hey,
why the noise? 90% of the warnings
are useless, but every once in a while you
get a nugget that (might) prevent you
a really big crash. So I just activate
a bunch of warnings, handle them all and
simultaneously weed out the good stuff.
Here is my typical warning set "on the metal"
real-time-cpp/ref_app/target/app/make/app_make.gmk at 8cc0ce3dd26c30eeba21a1826716f7604f5ef2e4 · ckormanyos/real-time-cpp
|
|
|
| | |
|
|
|
| |
real-time-cpp/ref_app/target/app/make/app_make.gmk at 8cc0ce3dd26c30eeba...
Source code for the book Real-Time C++, by Christopher Kormanyos - ckormanyos/real-time-cpp
|
|
|
Kind regards, Christopher
On Tuesday, December 10, 2024 at 09:28:10 PM GMT+1, Peter Dimov via Boost
But here comes the fun part. When I studied the implementation it seemed highly portable. So I put it onto several embedded bare-metal systems. The code compiled and ran on 32-bit microcontrollers and even ran on an 8-bit controller. Rarely do we find such portability and quality in Boost-proposed libraries, or anywhere for that matter. Kudos on a job well done!
Thanks Christopher. :-)
What I'd like to see short-term: * Handle enhanced compiler warnings.
Would you please go into more detail here? We currently test with -Wall -Wextra. _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Tue, Dec 10, 2024 at 12:07 PM Christopher Kormanyos via Boost < boost@lists.boost.org> wrote:
What I'd like to see short-term: * Handle enhanced compiler warnings. * Include a subset of NIST testing. * Fuzzing run(s) on some hashes in CI. * I think SHA-3 is worthy of inclusion.
I was wondering what kind of NIST testing you're alluding to. We do have some copy-pasted test vectors for myriad PDFs but for a good portion of the algorithms, we're using the test vectors outlined here: https://github.com/pdimov/hash2/blob/7a25f8518692b657e9272884519519fbaca2ec5... https://csrc.nist.gov/Projects/Cryptographic-Algorithm-Validation-Program/Se... Towards the bottom there under Test Vectors, one can download a .zip folder full of .rsp files which were used to verify the output of the applicable algorithms. I tried to avoid relying on those too much during testing because I wanted something quasi-human readable and understandable so if there was an applicable PDF, I preferred that. I did think about a scenario where we would've committed the .rsp files to the repo and run them during CI or some such as part of a much more extensive test suite. Which hashes would you like to see fuzzed, assuming "all of them" is off the table? And how long should we fuzz as well? I'm not sure if we can exhaustively fuzz the algorithms as part of a normal CI infrastructure. - Christian

* Include a subset of NIST testing.
I did think about a scenario where we would've committed the .rsp files to the repo and run them during CI or some such as part of a much more extensive test suite.
Thanks for responding, Christian. I think including and running the short test vectors might be a good compromise for your CI. They run rather quickly and I don't think that'll slow down CI too much.
* Fuzzing run(s) on some hashes in CI.
Which hashes would you like to see fuzzed, assuming "all of them" is off the table? And how long should we fuzz as well?
I think one of them, your pick, like SHA2-512
maybe for 5 minutes in CI would add robustness.
Maybe limit the fuzzed messages to 2 < len < 1024
bytes? This is not required in any way,
although I think this would increase robustness.
Christopher.
On Wednesday, December 11, 2024 at 12:26:07 AM GMT+1, Christian Mazakas via Boost
What I'd like to see short-term: * Handle enhanced compiler warnings. * Include a subset of NIST testing. * Fuzzing run(s) on some hashes in CI. * I think SHA-3 is worthy of inclusion.
I was wondering what kind of NIST testing you're alluding to. We do have some copy-pasted test vectors for myriad PDFs but for a good portion of the algorithms, we're using the test vectors outlined here: https://github.com/pdimov/hash2/blob/7a25f8518692b657e9272884519519fbaca2ec5... https://csrc.nist.gov/Projects/Cryptographic-Algorithm-Validation-Program/Se... Towards the bottom there under Test Vectors, one can download a .zip folder full of .rsp files which were used to verify the output of the applicable algorithms. I tried to avoid relying on those too much during testing because I wanted something quasi-human readable and understandable so if there was an applicable PDF, I preferred that. I did think about a scenario where we would've committed the .rsp files to the repo and run them during CI or some such as part of a much more extensive test suite. Which hashes would you like to see fuzzed, assuming "all of them" is off the table? And how long should we fuzz as well? I'm not sure if we can exhaustively fuzz the algorithms as part of a normal CI infrastructure. - Christian _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Which hashes would you like to see fuzzed, assuming "all of them" is off the table? And how long should we fuzz as well?
I think one of them, your pick, like SHA2-512 maybe for 5 minutes in CI would add robustness. Maybe limit the fuzzed messages to 2 < len < 1024 bytes? This is not required in any way, although I think this would increase robustness.
You could link up with OpenSSL to obtain
control values for the dynamically selected,
fuzzed message(s).
Christopher
On Wednesday, December 11, 2024 at 07:13:42 AM GMT+1, Christopher Kormanyos
* Include a subset of NIST testing.
I did think about a scenario where we would've committed the .rsp files to the repo and run them during CI or some such as part of a much more extensive test suite.
Thanks for responding, Christian. I think including and running the short test vectors might be a good compromise for your CI. They run rather quickly and I don't think that'll slow down CI too much.
* Fuzzing run(s) on some hashes in CI.
Which hashes would you like to see fuzzed, assuming "all of them" is off the table? And how long should we fuzz as well?
I think one of them, your pick, like SHA2-512
maybe for 5 minutes in CI would add robustness.
Maybe limit the fuzzed messages to 2 < len < 1024
bytes? This is not required in any way,
although I think this would increase robustness.
Christopher.
On Wednesday, December 11, 2024 at 12:26:07 AM GMT+1, Christian Mazakas via Boost
What I'd like to see short-term: * Handle enhanced compiler warnings. * Include a subset of NIST testing. * Fuzzing run(s) on some hashes in CI. * I think SHA-3 is worthy of inclusion.
I was wondering what kind of NIST testing you're alluding to. We do have some copy-pasted test vectors for myriad PDFs but for a good portion of the algorithms, we're using the test vectors outlined here: https://github.com/pdimov/hash2/blob/7a25f8518692b657e9272884519519fbaca2ec5... https://csrc.nist.gov/Projects/Cryptographic-Algorithm-Validation-Program/Se... Towards the bottom there under Test Vectors, one can download a .zip folder full of .rsp files which were used to verify the output of the applicable algorithms. I tried to avoid relying on those too much during testing because I wanted something quasi-human readable and understandable so if there was an applicable PDF, I preferred that. I did think about a scenario where we would've committed the .rsp files to the repo and run them during CI or some such as part of a much more extensive test suite. Which hashes would you like to see fuzzed, assuming "all of them" is off the table? And how long should we fuzz as well? I'm not sure if we can exhaustively fuzz the algorithms as part of a normal CI infrastructure. - Christian _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Zach Laine wrote:
One example is digest<>. The docs say, "digest<N> is a constexpr-friendly class template similar to std::array
. It is used to store the resulting message digest of hash algorithms such as SHA2-256 or RIPEMD- 160.". So, exactly like C++17 std::array then? :)
Almost. Most of std::array does indeed become constexpr in C++17, except for comparisons. For those you have to wait until C++20. digest<> also has operator<< and to_string.

Zach Laine wrote:
Why the discrepancy between unsigned char const * in the hash algorithm ctor, and void const * in update()?
The discrepancy is because the both used to take `unsigned char const*`, and then I changed `update` to `void const*` to avoid all the reinterpret casts, but didn't change the constructor. However, the key often comes in char[] or string form, especially when using HMAC to interact with existing APIs, so I think that the constructor is also going to be changed soon to take `void const*` to match.

Zach Laine wrote:
Why define your own endian? You already have an external dependency it seems, on boost::container_hash, so why not use the logic and constants from Boost.Endian?
Endian would be an avoidable dependency, and people don't like those. ContainerHash is different because the reason to depend on it is to pick up user specializations of its traits. If you specialize ContainerHash's traits, it's exceedingly likely that you want the same specializations for Hash2, and it's friendlier to not make users specialize the same trait twice, as would have happened if I duplicated them.

Zach Laine wrote:
Naively, I would have reduced them to:
struct unordered_tag {};
template
constexpr void hash_append( Hash& h, Flavor const& f, T const& v ); template
constexpr void hash_append( Hash& h, Flavor const& f, It first, It last ); template
constexpr void hash_append( Hash& h, Flavor const& f, It first, It last, unordered_tag ); Is there a reason that could not work?
There is, yes. Naming everything hash_append doesn't play well with making hash_append variadic, as has already been suggested a few times. Once you have that, the above overload set becomes a nice trap because hash_append( h, f, "foo", "bar" ); calls the iterator overload.

Zach Laine wrote:
Also, even if you don't change any of the names above, please don't use the phrase "sized_range", since we now have a std::ranges concept with that name that means something substantially different
That thought has occurred to me, but I never came up with a replacement name that I was happy with.
(std::ranges::sized_range requires std::ranges::size(x), for instance, where yours does not).
If we add overloads for the _range functions that actually take ranges and not iterator pairs, the hash_append_sized_range function will however require and use r.size(), instead of taking the iterator distance.

Here is my review of the Boost.Hash2 library. Sorry for the long read.
* Are you knowledgeable about the problem domain?
I use hash functions with containers regularly and I have added support
for hashing for my types using std::hash and Boost.ContainerHash, but I
have little to no knowledge about designing and validating hash
functions. I'm definitely not a specialist in cryptography.
* How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?
A couple days of reading the docs, source code, the proposal and some
other resources linked there or brought up during discussions. And some
time writing toy examples and the review itself.
* What is your evaluation of the design, documentation and implementation?
I'll comment on the three aspects of the library together because in
some instances the three are related. For the implementation, I was
reviewing revision 26b16023dc639ae354e246458719e4b569bc1d97 on develop
(which was the default on checkout from git when I did the review, and
no tag or branch was specified in the review announcement).
My general impression of the documentation is that it is sufficient to
eventually learn the library, but not easy to grasp for a newcomer. The
library doesn't give the user an illustration of how to add support for
hashing to his types or how to compute a hash value for an object until
late in the documentation. I think, the description of HashAlgorithm
could be moved to later chapters of the documentation intended for hash
algorithm implementers, and the initial chapters should provide
introduction for hash algorithm users. Also, I feel that some chapters
could use more detailed explanation. For example, the introduction in
Contiguously Hashable Types seems incomplete, as it doesn't explain what
is "contiguously hashable" and what happens when the type is not
contiguously hashable.
I feel that in some cases important details are missing in the text of
the less formal initial chapters, but are present in the later reference
section. Given that there are no links within the text to the reference
sections, and the whole documentation is a giant wall of text, the
documentation is very difficult to navigate to learn these details. As I
wrote my review while I was reading the docs, there were multiple times
when I had to edit my review because I initially thought something was
not documented only to find it later mentioned in somewhere in the
reference sections. Adding cross-links and more details in the initial
informal chapters would be helpful. Also pagination would be very
welcome, although I understand it is difficult with Asciidoc.
I like the general idea of separated hash functions and support for
hashing in types. I think this a step in the right direction for
improving support for cryptography in C++.
However, the way HashAlgorithm is defined is controversial. In the
earlier discussion I've already suggested to rename the
HashAlgorithm::result() method to finalize() and Vinnie has agreed to
this change in the proposal. The specification also has been relaxed to
allow hash algorithms that don't support further update() and finalize()
calls after the initial finalize(). In the discussion, Peter has
indicated that there is no intention to adopt this, which, I think,
significantly limits compatibility with existing hash algorithm
implementations. I have posted a short overview of the existing crypto
libraries here:
https://lists.boost.org/Archives/boost/2024/12/258649.php
I understand that the ability to call result() multiple times and
interleave result() and update() calls has its uses, but I don't think
this feature should be mandatory. First, as noted, the overwhelming
majority of the current, optimized and verified implementations don't
support it, and making the library incompatible with them complicates
the library development and impacts its usefulness to users. Second, the
feature itself is not often needed. The primary use case is extending
the hash value size, which is a rather niche use case. If a hash value
size (or its quality) is not sufficient, I would rather consider a
different hash function entirely than try to extend the existing digest,
as that would arguably result in better quality hashing.
Digest extension is normally only supported by extendable-output hash
functions (XOF). Some cryptographic libraries have support for XOF, but
again, HashAlgorithm is incompatible with them (or at least, those I
checked). For example, in OpenSSL, EVP_DigestFinalXOF is used to
finalize the hash calculation and extract the variable-sized digest in
one call. EVP_DigestFinalXOF is still required to be called no more than
once, no further updates or finalization are allowed. I think, the
ability to repeatedly call result() is a poor vehicle to support XOF,
and I prefer the OpenSSL approach more.
Another feature that is required by HashAlgorithm and not supported by
many implementations is seeding the algorithm on construction.
Conceptually, seeding is typically understood as initializing the
internal state of the hash algorithm with specific non-default values.
Seeding is not supported by many hash algorithms, for example SHA-1 and
SHA-2, and the conventional workaround is to add salt to the hashed
data. I think, HashAlgorithm should allow implementations to use the
seed arguments to the hash algorithm constructors as salt, i.e. the
constructor would implicitly call update().
I think the basic HashAlgorithm concept should require a minimal set of
operations so that existing mainstream implementations of popular
non-XOF hash algorithms can be supported. A more advanced concept, say
named HashAlgorithmXOF, that supports digest expansion can be defined
separately. Something along these lines:
struct HashAlgorithmBase
{
static constexpr size_t block_size = /*...*/; // optional
/*
* Initializes algorithm to the default state.
*/
HashAlgorithmBase();
/*
* Initializes algorithm to an initial state dependent on bytes
* in [seed, seed + n). If n is zero, equivalent to default
* construction.
* The exact meaning of the seed bytes is implementation-defined.
*/
HashAlgorithmBase( void const* seed, std::size_t n );
HashAlgorithmBase( HashAlgorithmBase const& r ); // optional
HashAlgorithmBase& operator=( HashAlgorithmBase const& r ); // optional
HashAlgorithmBase( HashAlgorithmBase&& r );
HashAlgorithmBase& operator=( HashAlgorithmBase&& r );
/*
* Feeds algorithm with input data. Can be called multiple times.
* Calling update() after the algorithm has been finalized is
* undefined behavior, unless otherwise specified by the specific
* algorithm.
*/
void update( void const* data, std::size_t n );
};
struct HashAlgorithm : HashAlgorithmBase
{
using result_type = /* more on this below */;
using HashAlgorithmBase::HashAlgorithmBase;
using HashAlgorithmBase::operator=;
/*
* Finalizes the algorithm and returns the final digest.
* Calling finalize() after the algorithm has been finalized is
* undefined behavior, unless otherwise specified by the specific
* algorithm.
*/
result_type finalize();
};
struct HashAlgorithmXOF : HashAlgorithmBase
{
using result_type = /* more on this below */;
using HashAlgorithmBase::HashAlgorithmBase;
using HashAlgorithmBase::operator=;
/*
* Finalizes the algorithm and copies the final digest to the buffer
* of size `size`, in bytes, pointed to by `digest`.
* Calling finalize() after the algorithm has been finalized is
* undefined behavior, unless otherwise specified by the specific
* algorithm.
*/
void finalize(unsigned char* digest, std::size_t size);
/*
* Finalizes the algorithm and returns the final digest of size
* `size`.
* Calling finalize() after the algorithm has been finalized is
* undefined behavior, unless otherwise specified by the specific
* algorithm.
*
* Note: The returned digest is equivalent to the above overload.
* The difference is that this overload allocates memory for the
* digest instead of using the user-provided storage.
*/
result_type finalize(std::size_t size);
};
Above, I have intentionally omitted the constructor from uint64_t as I
find it confusing. In particular, it isn't clear what endianness it uses
to seed the algorithm. It looks like it duplicates hash_append.
In HashAlgorithmXOF, I provided two overloads of finalize(), where one
could be implemented in terms of the other. I'm not particularly
attached to this dualism, but if I had to keep just one of the
overloads, I would keep the first one, the one that fills the
user-provided buffer. It is more flexible, although it is slightly less
safe than the other one. However, I'm not insisting on any specific form
of finalize() in this case.
Ion has already mentioned in his review that HashAlgorithm::block_size
should have type size_t, and I agree. At the very least, it should be an
unsigned integer. BTW, the same applies to hmac::block_size.
Ion also mentioned that hash algorithms should support move construction
and move assignment, and I agree. Also, HashAlgorithm currently requires
hash algorithms to be copy costructible and copy assignable. It looks
like the only place where copyability is required is
hash_append_unordered_range. Copyability is a rather strong requirement,
which may not be supported by all implementations. This is why I marked
copy constructor and copy assignment "optional" in my proposed
interfaces above - copyability should only be required by the library
algorithms that actually need it; those that don't should still work
with non-copyable hash algorithms.
The documentation should provide a more detailed description of the
HashAlgorithm::result_type type. Specifically, what interface and
operations it must support. Ideally, the library should provide a
concept, similar to HashAlgorithm. Currently, the documentation only
says it can be an unsigned integer type or a "std::array-like" type,
where the latter is rather nebulous. From the implementation of
get_integral_result it looks like it must support data() member function
returning a pointer and that its size() after default-construction is a
constant expression and returns at least 8. This, for example, excludes
std::vector, which would be useful for HashAlgorithmXOF::result_type.
Also on the topic of get_integral_result for "std::array-like" hash
values, a few comments:
- It requires the static size of 8 regardless of the size of the
requested integer, so even if uint32_t is requested it will still
require at least std::array
struct flavor { static constexpr endian byte_order = ByteOrder; using size_type = SizeType; }; using default_flavor = flavor<>; using little_endian_flavor = flavor< endian::little, std::uint64_t >; using big_endian_flavor = flavor< endian::big, std::uint64_t >; // Maybe even this, not sure if it is worth it template< endian ByteOrder = endian::native, typename SizeType = std::size_t
inline constexpr flavor< ByteOrder, SizeType > flavor_v{}; This way the user is able to customize the size type without having to define a new flavor struct. In practice, most of the time container sizes do not exceed the capacity of uint32_t (and may in fact be uint32_t or less, with Boost.Container containers), so users might be tempted to specify that type instead of the default. This also changes the default_flavor to use size_t for container sizes, and this is consistent with the default flavor semantics. That is, default_flavor represents the current hardware properties, which are both the byte order and the native size type. For portable hashing, one should use either little_endian_flavor or big_endian_flavor, which should both have a portable size type. In the hash_append function call, the flavor argument is mandatory, and I'm not convinced this is necessary. In the earlier discussion, Peter stated that this is intentional, so that there is less opportunity to forget that argument when it is actually needed. Presumably, the main purpose of this is to ensure portable behavior across machines with different byte order. However, such portability is not always required, and in fact the default flavor (both in name, and in hash_append template parameters) does not offer this portability. The empty brace initializer "{}" that is used for the default flavor also doesn't indicate a particular flavor, other than "whatever the default is, I don't care". Clearly, the library treats one set of parameters as the default in all respects but the requirement to always specify that in hash_append arguments. This doesn't make sense to me. Either make the default flavor optional, or don't offer a default (not in name, not in template parameters) and require users to always specify a specific flavor. My preference would be the former, because default_flavor seems like a sensible default, and, since hash portability is not free in terms of performance, it should be explicitly opted-in by users (i.e. the "don't pay for what you don't use" principle). The second problem is the need to manually forward the flavor argument in the tag_invoke overloads. I think, having to pass around unused arguments is not a good design. This can be mitigated by wrapping the flavor argument in the wrapper/proxy that is passed to the tag_invoke's h argument, something like this: template< typename Hash, typename Flavor > class hash_proxy { public: using hash_type = Hash; using flavor_type = Flavor; private: Hash& h; public: explicit hash_proxy(Hash& h) noexcept : h(h) {} template< typename T > void append(T const& data) { // As if hash_append(h, flavor_type{}, data) here } }; template< typename Flavor, typename Hash, typename T > void hash_append(Hash& h, T const& data) { hash_proxy< Hash, Flavor > hp(h); tag_invoke(hash_append_tag{}, hp, data); } template< typename Hash, typename T > void hash_append(Hash& h, T const& data) { hash_append< default_flavor >(h, data); } class my_class { private: int x; short y; char z; template< typename H > friend void tag_invoke(hash_append_tag, H& h, my_class const& data) { h.append(data.x); h.append(data.y); h.append(data.z); } }; Lastly on this topic, "flavor" doesn't sound like the correct word to me. It sounds like a parameter that controls the hash algorithm specifics, whereas it actually has no relation to the hash algorithm at all. I think, "options" would be a better name for this parameter. * Did you try to use the library? With which compiler(s)? Did you have any problems? Yes, I tried to compile a few toy examples with gcc 13.2. I found that the library does not prefer a user-provided tag_invoke overload when the user's type is also described with Boost.Describe, and the code fails to compile because of overload resolution ambiguity. https://github.com/pdimov/hash2/issues/28 I think, tag_invoke should not cause ambiguities, even if the type matches other categories of types natively supported by the library. If the user defined tag_invoke, he clearly intended this function to implement hashing. I also wrote a simple benchmark to see how the library performance compares to OpenSSL. I also used this program to test how one would wrap OpenSSL primitives into Boost.Hash2 interfaces. The benchmark is here: https://github.com/Lastique/hash2/blob/6bd5dafbcb05417014d0b85336a1245f5f186... Compiled with: g++ -O3 -march=rocketlake -std=c++17 -I../include -I../../boost -o hash2_ossl_perf hash2_ossl_perf.cpp -lssl -lcrypto Here's the output on my machine: sha1_160 (1024 bytes): 1220.093078 MiB/s sha1_160 (1048576 bytes): 1317.424120 MiB/s sha1_160 (16777216 bytes): 1318.296514 MiB/s sha2_256 (1024 bytes): 406.139530 MiB/s sha2_256 (1048576 bytes): 430.141013 MiB/s sha2_256 (16777216 bytes): 429.341694 MiB/s sha2_512 (1024 bytes): 629.593514 MiB/s sha2_512 (1048576 bytes): 712.100583 MiB/s sha2_512 (16777216 bytes): 711.549426 MiB/s openssl_sha1_160 (1024 bytes): 1531.142208 MiB/s openssl_sha1_160 (1048576 bytes): 2759.062140 MiB/s openssl_sha1_160 (16777216 bytes): 2738.579482 MiB/s openssl_sha2_256 (1024 bytes): 1541.291824 MiB/s openssl_sha2_256 (1048576 bytes): 2462.035414 MiB/s openssl_sha2_256 (16777216 bytes): 2449.770143 MiB/s openssl_sha2_512 (1024 bytes): 781.187505 MiB/s openssl_sha2_512 (1048576 bytes): 1086.239840 MiB/s openssl_sha2_512 (16777216 bytes): 1082.922782 MiB/s So, depending on the input message size, OpenSSL is faster from 1.24x to 5.72x. Interestingly, despite what is said in Boost.Hash2 documentation (and confirmed by the test), OpenSSL's implementation of SHA2-256 is faster than SHA2-512. This could be because of the SHA ISA extensions in my CPU. * What is your evaluation of the potential usefulness of the library? Do you already use it in industry? I do not use the library in real projects. I think the separation between the hash functions and the hashed types is a very useful design idea, I like it very much. However, the way HashAlgorithm is defined significantly limits usefulness of the Boost.Hash2 library, as it prevents users from plugging in existing hash algorithm implementations, even if just to experiment. I noted other issues in the review, but I think this one is critical to the library adoption. When I wrote the review, I was torn between "conditional accept" and "reject" votes, as I really do like the general concept the library offers and yet I don't see myself adopting it because of the limitations. But since Peter indicated that he is not willing to address those limitations, I will have to vote REJECT. Besides HashAlgorithm requirements, it looks like the library could use some work anyway. In particular, the mentioned issue with tag_invoke ambiguity should be resolved, and the member-wise API for hash_append needs to be explored. These things may affect the protocol used to add support for hashing to user types, and changes to HashAlgorithm would affect hash algorithms, so these issues need to be sorted out before acceptance and adoption. I hope to see the library improved and submitted for review a second time. Thank you Peter for submitting the library and Matt for managing the review.

On 12/10/24 16:09, Andrey Semashev wrote:
The documentation should provide a more detailed description of the HashAlgorithm::result_type type. Specifically, what interface and operations it must support. Ideally, the library should provide a concept, similar to HashAlgorithm. Currently, the documentation only says it can be an unsigned integer type or a "std::array-like" type, where the latter is rather nebulous. From the implementation of get_integral_result it looks like it must support data() member function returning a pointer and that its size() after default-construction is a constant expression and returns at least 8. This, for example, excludes std::vector, which would be useful for HashAlgorithmXOF::result_type.
Also on the topic of get_integral_result for "std::array-like" hash values, a few comments:
- It requires the static size of 8 regardless of the size of the requested integer, so even if uint32_t is requested it will still require at least std::array
. This should not be necessary. - It ignores the data beyond the initial 8 bytes. It might be ok for the purpose of slicing the hash value, but it is inconsistent with get_integral_result for integers, which attempts to mix bits through multiplications with a constant. I think, the implementation should be consistent between arrays and integers. The approach of universally mixing bits for all hash value types seems safer as it produces better results for hash values with non-uniform distribution of high-quality bits across the hash value. It might be possible to reuse one of the hash algorithms implemented in the library for this bit mixing. So I think, get_integral_result and HashAlgorithm::result_type should be better defined by the library. It should be possible for the user to provide his custom type as HashAlgorithm::result_type, and for that (a) there must be a list of requirements for that type to satisfy, and (b) it should be clear how an integer will be derived from it by get_integral_result. Maybe the user should be able to define his own get_integral_result behavior for his digest type. The reason why I think user-defined hash value types are useful is type safety, as it may be desirable to associate the digest with the algorithm that produced it.
Also, since I already suggested a few renames in my review, I think renaming result_type to digest_type would make sense. result_type is traditionally used as the result type of a function object, which HashAlgorithm isn't. And the result of a hash function is commonly called a digest.

Andrey Semashev wrote:
Above, I have intentionally omitted the constructor from uint64_t as I find it confusing. In particular, it isn't clear what endianness it uses to seed the algorithm. It looks like it duplicates hash_append.
It doesn't use any endianness. It's just used as a seed from which to derive the initial state. Most, if not all, existing non-cryptographic algorithms take integral seeds.
The documentation should provide a more detailed description of the HashAlgorithm::result_type type. Specifically, what interface and operations it must support. Ideally, the library should provide a concept, similar to HashAlgorithm. Currently, the documentation only says it can be an unsigned integer type or a "std::array-like" type, where the latter is rather nebulous.
Hash2 started out as a C++03 library, and mandated use of boost::array. Then it became a C++11 library, and mandated use of std::array. But then it became a C++14 constexpr library, and std::array didn't work, so now it mandates the nebulous "std::array-like" type (which is not yet completely nailed down, but will crystallize with practice.)
Also on the topic of get_integral_result for "std::array-like" hash values, a few comments:
- It requires the static size of 8 regardless of the size of the requested integer, so even if uint32_t is requested it will still require at least std::array
. This should not be necessary.
I'm not yet clear on the practical utility of using a result_type that is an array of fewer than 8 bytes. We don't have any such examples yet, so it's hard to tell whether this makes any sense and whether it should be explicitly blessed, and to what extent.
- It ignores the data beyond the initial 8 bytes. It might be ok for the purpose of slicing the hash value, but it is inconsistent with get_integral_result for integers, which attempts to mix bits through multiplications with a constant.
This is explicitly stated in the documentation of get_integral_result, in the Remarks. In practice it's a useful heuristic to assume that an array-like result implies a hash function of high quality (an avalanching one) so that picking any M bits would suffice. Not making this assumption would make things a bit hairy; it's not quite clear how to turn an arbitrary number of not-high-quality bits into 32. The library tries its best for the integral types, but there the possibilities are at least a finite amount and multiplications are built-in.
It might be possible to reuse one of the hash algorithms implemented in the library for this bit mixing.
Interesting "meta" approach, but not necessarily better in practice than the current one, under any practical criteria.
So I think, get_integral_result and HashAlgorithm::result_type should be better defined by the library.
It most likely will be, eventually, once we have near-100% clarity on how precisely it needs to be defined.
The hmac_* type aliases are not provided for siphash and xxhash. But I would actually prefer that those type aliases be removed for the rest of the hash algorithms, so that the algorithms don't bring the dependency on hmac.hpp, which is not always needed by the user. Typing hmac
instead of hmac_sha2_256 is hardly any more tedious.
hmac.hpp is a very light header, so...
Thank you Peter for submitting the library and Matt for managing the review.
Thanks, Andrey, for your review.

On 12/10/24 17:18, Peter Dimov via Boost wrote:
Andrey Semashev wrote:
Above, I have intentionally omitted the constructor from uint64_t as I find it confusing. In particular, it isn't clear what endianness it uses to seed the algorithm. It looks like it duplicates hash_append.
It doesn't use any endianness. It's just used as a seed from which to derive the initial state.
Most, if not all, existing non-cryptographic algorithms take integral seeds.
Yes, but for other algorithms you must imply some byte order, as you call `update(&seed, sizeof(seed))` in the constructor. Otherwise, the algorithm seeded with the same integer would produce different results on machines with different byte orders. I'd say, if a specific hash algorithm supports seeding from an integer, which is different from salting (i.e. simply calling update() on the seed), that algorithm should provide the constructor from an integer. If the constructor is described in HashAlgorithm, at least it should be marked optional.
Also on the topic of get_integral_result for "std::array-like" hash values, a few comments:
- It requires the static size of 8 regardless of the size of the requested integer, so even if uint32_t is requested it will still require at least std::array
. This should not be necessary. I'm not yet clear on the practical utility of using a result_type that is an array of fewer than 8 bytes. We don't have any such examples yet, so it's hard to tell whether this makes any sense and whether it should be explicitly blessed, and to what extent.
It's not so much about blessing anything, it's about not requiring what you actually don't need.
- It ignores the data beyond the initial 8 bytes. It might be ok for the purpose of slicing the hash value, but it is inconsistent with get_integral_result for integers, which attempts to mix bits through multiplications with a constant.
This is explicitly stated in the documentation of get_integral_result, in the Remarks.
In practice it's a useful heuristic to assume that an array-like result implies a hash function of high quality (an avalanching one) so that picking any M bits would suffice.
If you're willing to rely on that the array-like result is of high quality, why not do the same for integer results? I just think the behavior should be consistent, whether the digest is an integer, array or some user-defined type.
Not making this assumption would make things a bit hairy; it's not quite clear how to turn an arbitrary number of not-high-quality bits into 32. The library tries its best for the integral types, but there the possibilities are at least a finite amount and multiplications are built-in.
It might be possible to reuse one of the hash algorithms implemented in the library for this bit mixing.
Interesting "meta" approach, but not necessarily better in practice than the current one, under any practical criteria.
Why? Yes, applying, e.g. fnv1a_32, would be slower than a single multiply or slicing an array, but from the perspective of quality of the reduced digest this approach sounds good to me.
The hmac_* type aliases are not provided for siphash and xxhash. But I would actually prefer that those type aliases be removed for the rest of the hash algorithms, so that the algorithms don't bring the dependency on hmac.hpp, which is not always needed by the user. Typing hmac
instead of hmac_sha2_256 is hardly any more tedious. hmac.hpp is a very light header, so...
This may be the case today, and typedefs are there to stay pretty much forever. I say don't introduce them to begin with, they don't bring much value anyway.

Andrey Semashev wrote:
It might be possible to reuse one of the hash algorithms implemented in the library for this bit mixing.
Interesting "meta" approach, but not necessarily better in practice than the current one, under any practical criteria.
Why? Yes, applying, e.g. fnv1a_32, would be slower than a single multiply or slicing an array, but from the perspective of quality of the reduced digest this approach sounds good to me.
Because taking the low 32 bits from a high quality digest is better than taking the FNV-1a hash of all the bits.

Andrey Semashev wrote:
I'm not yet clear on the practical utility of using a result_type that is an array of fewer than 8 bytes. We don't have any such examples yet, so it's hard to tell whether this makes any sense and whether it should be explicitly blessed, and to what extent.
It's not so much about blessing anything, it's about not requiring what you actually don't need.
What I need today is only part of the story.

On Tue, Dec 10, 2024 at 5:09 AM Andrey Semashev via Boost < boost@lists.boost.org> wrote:
...Vinnie has agreed to this change in the proposal
Well, no one needs my permission to deviate from the paper :)
Copyability is a rather strong requirement, which may not be supported by all implementations.
Could you please provide an example of a non-copyable HashAlgorithm? I'm struggling to see it. Your review was very well written btw, thanks. Regards

On 12/10/24 17:50, Vinnie Falco wrote:
On Tue, Dec 10, 2024 at 5:09 AM Andrey Semashev via Boost
mailto:boost@lists.boost.org> wrote: Copyability is a rather strong requirement, which may not be supported by all implementations.
Could you please provide an example of a non-copyable HashAlgorithm? I'm struggling to see it.
The primary use case I had in mind is wrapping an external library, where the external library does not support creating a deep copy of the algorithm state. This is the case with libsodium (https://doc.libsodium.org/hashing/generic_hashing), for example. In general, copyability requires support from whatever is being copied, movability does not (as you can always store a pointer and move that). Since the need for HashAlgorithm's copyability is rather limited, I think it makes sense to not require it, if possible.

On 12/10/24 18:05, Andrey Semashev wrote:
On 12/10/24 17:50, Vinnie Falco wrote:
On Tue, Dec 10, 2024 at 5:09 AM Andrey Semashev via Boost
mailto:boost@lists.boost.org> wrote: Copyability is a rather strong requirement, which may not be supported by all implementations.
Could you please provide an example of a non-copyable HashAlgorithm? I'm struggling to see it.
The primary use case I had in mind is wrapping an external library, where the external library does not support creating a deep copy of the algorithm state. This is the case with libsodium (https://doc.libsodium.org/hashing/generic_hashing), for example.
I should clarify that libsodium uses structs for storing the state of the hashing algorithms. For example, for SHA2-256 the struct is crypto_hash_sha256_state: https://github.com/jedisct1/libsodium/blob/bfa6ee638673dc88c51331dc13e8f112f... As far as I could see, the documentation makes no guarantees about the contents of this struct and whether it is safe to memcpy it. So, even if the struct can be copied as bytes, I still don't consider copying as properly supported by the library, as one has to rely on implementation details.

Andrey Semashev wrote:
Here's the output on my machine:
sha1_160 (1024 bytes): 1220.093078 MiB/s sha1_160 (1048576 bytes): 1317.424120 MiB/s sha1_160 (16777216 bytes): 1318.296514 MiB/s sha2_256 (1024 bytes): 406.139530 MiB/s sha2_256 (1048576 bytes): 430.141013 MiB/s sha2_256 (16777216 bytes): 429.341694 MiB/s sha2_512 (1024 bytes): 629.593514 MiB/s sha2_512 (1048576 bytes): 712.100583 MiB/s sha2_512 (16777216 bytes): 711.549426 MiB/s openssl_sha1_160 (1024 bytes): 1531.142208 MiB/s openssl_sha1_160 (1048576 bytes): 2759.062140 MiB/s openssl_sha1_160 (16777216 bytes): 2738.579482 MiB/s openssl_sha2_256 (1024 bytes): 1541.291824 MiB/s openssl_sha2_256 (1048576 bytes): 2462.035414 MiB/s openssl_sha2_256 (16777216 bytes): 2449.770143 MiB/s openssl_sha2_512 (1024 bytes): 781.187505 MiB/s openssl_sha2_512 (1048576 bytes): 1086.239840 MiB/s openssl_sha2_512 (16777216 bytes): 1082.922782 MiB/s
So, depending on the input message size, OpenSSL is faster from 1.24x to 5.72x. Interestingly, despite what is said in Boost.Hash2 documentation (and confirmed by the test), OpenSSL's implementation of SHA2-256 is faster than SHA2-512. This could be because of the SHA ISA extensions in my CPU.
These are interesting timings. Originally I assumed that this is because OpenSSL uses SIMD. But you're right that it might actually use SHA-NI instead. https://en.wikipedia.org/wiki/Intel_SHA_extensions For some reason I thought that SHA-NI only supported SHA1, but it does support SHA2-256 as well. Is there an easy way to check what OpenSSL uses? Probably not because it selects the instruction set at runtime.

On Tue, 10 Dec 2024, 16:35 Peter Dimov via Boost,
Here's the output on my machine:
sha1_160 (1024 bytes): 1220.093078 MiB/s sha1_160 (1048576 bytes): 1317.424120 MiB/s sha1_160 (16777216 bytes): 1318.296514 MiB/s sha2_256 (1024 bytes): 406.139530 MiB/s sha2_256 (1048576 bytes): 430.141013 MiB/s sha2_256 (16777216 bytes): 429.341694 MiB/s sha2_512 (1024 bytes): 629.593514 MiB/s sha2_512 (1048576 bytes): 712.100583 MiB/s sha2_512 (16777216 bytes): 711.549426 MiB/s openssl_sha1_160 (1024 bytes): 1531.142208 MiB/s openssl_sha1_160 (1048576 bytes): 2759.062140 MiB/s openssl_sha1_160 (16777216 bytes): 2738.579482 MiB/s openssl_sha2_256 (1024 bytes): 1541.291824 MiB/s openssl_sha2_256 (1048576 bytes): 2462.035414 MiB/s openssl_sha2_256 (16777216 bytes): 2449.770143 MiB/s openssl_sha2_512 (1024 bytes): 781.187505 MiB/s openssl_sha2_512 (1048576 bytes): 1086.239840 MiB/s openssl_sha2_512 (16777216 bytes): 1082.922782 MiB/s
So, depending on the input message size, OpenSSL is faster from 1.24x to 5.72x. Interestingly, despite what is said in Boost.Hash2 documentation (and confirmed by the test), OpenSSL's implementation of SHA2-256 is faster
Andrey Semashev wrote: than
SHA2-512. This could be because of the SHA ISA extensions in my CPU.
These are interesting timings. Originally I assumed that this is because OpenSSL uses SIMD. But you're right that it might actually use SHA-NI instead.
https://en.wikipedia.org/wiki/Intel_SHA_extensions
For some reason I thought that SHA-NI only supported SHA1, but it does support SHA2-256 as well.
Is there an easy way to check what OpenSSL uses? Probably not because it selects the instruction set at runtime.
Is it expected to reach similar levels of performance in the future using Hash2? Or is this not in scope for the library? Thanks, Ruben.

Ruben Perez wrote:
On Tue, 10 Dec 2024, 16:35 Peter Dimov via Boost,
mailto:boost@lists.boost.org > wrote: Andrey Semashev wrote:
Here's the output on my machine:
sha1_160 (1024 bytes): 1220.093078 MiB/s sha1_160 (1048576 bytes): 1317.424120 MiB/s sha1_160 (16777216 bytes): 1318.296514 MiB/s sha2_256 (1024 bytes): 406.139530 MiB/s sha2_256 (1048576 bytes): 430.141013 MiB/s sha2_256 (16777216 bytes): 429.341694 MiB/s sha2_512 (1024 bytes): 629.593514 MiB/s sha2_512 (1048576 bytes): 712.100583 MiB/s sha2_512 (16777216 bytes): 711.549426 MiB/s openssl_sha1_160 (1024 bytes): 1531.142208 MiB/s openssl_sha1_160 (1048576 bytes): 2759.062140 MiB/s openssl_sha1_160 (16777216 bytes): 2738.579482 MiB/s openssl_sha2_256 (1024 bytes): 1541.291824 MiB/s openssl_sha2_256 (1048576 bytes): 2462.035414 MiB/s openssl_sha2_256 (16777216 bytes): 2449.770143 MiB/s openssl_sha2_512 (1024 bytes): 781.187505 MiB/s openssl_sha2_512 (1048576 bytes): 1086.239840 MiB/s openssl_sha2_512 (16777216 bytes): 1082.922782 MiB/s
So, depending on the input message size, OpenSSL is faster from 1.24x to 5.72x. Interestingly, despite what is said in Boost.Hash2 documentation (and confirmed by the test), OpenSSL's implementation of SHA2-256 is faster than SHA2-512. This could be because of the SHA ISA extensions in my CPU.
These are interesting timings. Originally I assumed that this is because OpenSSL uses SIMD. But you're right that it might actually use SHA-NI instead.
https://en.wikipedia.org/wiki/Intel_SHA_extensions
For some reason I thought that SHA-NI only supported SHA1, but it does support SHA2-256 as well.
Is there an easy way to check what OpenSSL uses? Probably not because it selects the instruction set at runtime.
Is it expected to reach similar levels of performance in the future using Hash2? Or is this not in scope for the library?
I don't see why it wouldn't be in scope; performance optimizations are never ruled out. Taking advantage of SHA-NI when it's statically known to be available (under GCC/Clang -march=native, for instance) should be relatively easy to add. Dynamic dispatch is a bit more convoluted because of constexpr but should be possible.

On 12/10/24 21:15, Peter Dimov via Boost wrote:
Ruben Perez wrote:
Is it expected to reach similar levels of performance in the future using Hash2? Or is this not in scope for the library?
I don't see why it wouldn't be in scope; performance optimizations are never ruled out.
Taking advantage of SHA-NI when it's statically known to be available (under GCC/Clang -march=native, for instance) should be relatively easy to add.
Requiring SHA instructions unconditionally sets a rather high bar on the CPU requirements.
Dynamic dispatch is a bit more convoluted because of constexpr but should be possible.
SIMD intrinsics (at least x86 ones) are not constexpr-friendly as they require reinterpret_cast to access memory. Only AVX-512 added load/store intrinsics taking void pointers. Maybe gcc's vector extensions are better, but I'm not sure if SHA intrinsics will work with them. I suspect the constexpr version will have to always use the generic code, and the runtime version may use CPU detection and vectorized code.

Andrey Semashev wrote:
I think, HashAlgorithm should allow implementations to use the seed arguments to the hash algorithm constructors as salt, i.e. the constructor would implicitly call update().
No, this is a bad practice and should never be used. As I already said, the algorithm should at minimum include the size of the seed, and then pad to a multiple of the block size. This is not just something I made up, it's existing practice. See for instance how KMAC is defined in https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-185.pdf If not that, you should at minimum include the hash of the seed in the initial prefix, instead of (or in addition to) the seed, like HMAC does. Don't forget to pad to a multiple of the block size.

On 12/10/24 19:51, Peter Dimov via Boost wrote:
Andrey Semashev wrote:
I think, HashAlgorithm should allow implementations to use the seed arguments to the hash algorithm constructors as salt, i.e. the constructor would implicitly call update().
No, this is a bad practice and should never be used. As I already said, the algorithm should at minimum include the size of the seed, and then pad to a multiple of the block size.
Is this something that should be intrinsically supported by HashAlgorithm, though? Can it be an external component, such as a free function or an adapter like hmac<>? Currently, as I look in the hash algorithms provided by Boost.Hash2, most of them interpret the byte-range seed the way you describe, with the exception of fnv1a and siphash. fnv1a doesn't append the seed size (and doesn't pad as FNV-1a is not a block hash); I'm not sure if that's intentional. siphash has special treatment of the seed if it is exactly 8 bytes. Given these differences, is the user able to have any sort of guarantee on consistent behavior of the constructors of various algorithms? If not, why provide the constructor at all? I think, it would be better if the library provided a standalone component that would seed any hash algorithm (whether implemented by Boost.Hash2 or not) in a consistent and secure way.

Andrey Semashev wrote:
Currently, as I look in the hash algorithms provided by Boost.Hash2, most of them interpret the byte-range seed the way you describe, with the exception of fnv1a and siphash. fnv1a doesn't append the seed size (and doesn't pad as FNV-1a is not a block hash); I'm not sure if that's intentional.
I suppose it would be better for it to do so, yes.
siphash has special treatment of the seed if it is exactly 8 bytes.
That's because this specific handling of the key is part of its specification, as siphash is a keyed hash function. So if you use siphash as specified, using the key length that it mandates, you get its documented (in siphash.pdf) behavior and it passes the corresponding siphash test suite.
Given these differences, is the user able to have any sort of guarantee on consistent behavior of the constructors of various algorithms? If not, why provide the constructor at all?
The constructor does what it's specified to do in the hash algorithm requirements. Yes, the user can rely on that.

On Sat, Dec 7, 2024 at 7:41 AM Matt Borland via Boost
Dear all,
The review of Hash2 by Peter Dimov and Christian Mazakas begins today Saturday, December 7th through Sunday December 15th, 2024.
Code: https://github.com/pdimov/hash2 Docs: https://pdimov.github.io/hash2/
Please provide feedback on the following general topics:
- What is your evaluation of the design?
This looks to be a good API and hits the goals desired. Seems like it would be simple and nice to use.
- What is your evaluation of the implementation?
- What is your evaluation of the documentation?
- What is your evaluation of the potential usefulness of the library? Do you already use it in industry?
There are several hashes given in this library that are used in security critical scenarios (e.g. SHA-2). Additionally, the explicit goal of the library is to give the user tools they need to build their own hashes. I do not see any indication that any kind of cryptographic assessment was done on this library. There is not even a mention that the security of these hashes has been thought about nor in what scenarios they could be used. This is absolutely essential for any library providing cryptographic primitives. There are a *lot* of potential pitfalls in the implementation of a cryptographic primitive, even in cases where the "correctness" of the end result aren't in question. For example, there have been timing attacks against SHA-2/HMAC where the difference in the amount of time processing takes can leak information about the secret key. https://dl.acm.org/doi/10.1007/978-3-030-89915-8_2 These side-channel type attacks can be extremely insidious, and are not things that people who are otherwise experts in C++/C/Assembly/etc would ever think about. For cryptography, we need a different type of expert than the type we typically grow in this community. **If Boost wants to take on providing cryptographic primitives, we need to hold those libraries to a higher standard, including an evaluation by outside cryptopgophers before we release anything to the public.** I would be much happier if the choice was made to use a thoroughly vetted library (e.g. OpenSSL, but more compact options would probably be better) behind the scenes, with a nice C++ wrapper on the front. I am tempted to suggest an option where we put big banners on the library saying "not for use in scenarios that require cryptographic security", but there is simply too much danger in users not understanding *when* they require security (with dozens of examples of this happening in the literature) that they would accidentally use this library in scenarios that undermine their security.
- Did you try to use the library? With which compiler(s)? Did you have any problems?
I did not.
- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
Spent a good amount of time reading the documentation. Browsed through some of the code.
- Are you knowledgeable about the problem domain?
Moderately, work in a security critical industry. Ensure to explicitly include with your review: ACCEPT, REJECT, or
CONDITIONAL ACCEPT (with acceptance conditions).
REJECT Sincerely, Thomas Kent

Tom Kent wrote:
For example, there have been timing attacks against SHA-2/HMAC where the difference in the amount of time processing takes can leak information about the secret key. https://dl.acm.org/doi/10.1007/978-3-030-89915-8_2
This is an attack against a hardware implementation of HMAC-SHA2-256. The paper doesn't seem to be freely available, but I _think_ that it relies on data leaks via changes in power consumption. This doesn't seem applicable to software implementations, and I wasn't able to find attacks against software HMAC-SHA2-256. (Admittedly, after a not that exhaustive search attempt.) Either way, if you think that not having a SHA-2 implementation in Boost will increase the resistance of the collective body of C++ code against side channel attacks, I can only say that I believe you are very much mistaken.

On Wed, 11 Dec 2024, 02:55 Tom Kent via Boost,
There are several hashes given in this library that are used in security critical scenarios (e.g. SHA-2). Additionally, the explicit goal of the library is to give the user tools they need to build their own hashes.
I do not see any indication that any kind of cryptographic assessment was done on this library. There is not even a mention that the security of these hashes has been thought about nor in what scenarios they could be used. This is absolutely essential for any library providing cryptographic primitives. There are a *lot* of potential pitfalls in the implementation of a cryptographic primitive, even in cases where the "correctness" of the end result aren't in question.
For example, there have been timing attacks against SHA-2/HMAC where the difference in the amount of time processing takes can leak information about the secret key. https://dl.acm.org/doi/10.1007/978-3-030-89915-8_2
These side-channel type attacks can be extremely insidious, and are not things that people who are otherwise experts in C++/C/Assembly/etc would ever think about. For cryptography, we need a different type of expert than the type we typically grow in this community.
**If Boost wants to take on providing cryptographic primitives, we need to hold those libraries to a higher standard, including an evaluation by outside cryptopgophers before we release anything to the public.**
As a potential user mainly interested in the "hashing untyped byte sequences" use case (involving SHA2), do you think migrating from OpenSSL to Boost.Hash2 would be detrimental for security at this point? If the answer is yes, is there a way to remediate this (even after the library gets accepted)? Or is this just not the main use case of the library? The use case involves generating digests for a network protocol (MySQL). I'd like to know both Tom's and Peter's opinions. Thanks, Ruben.

Ruben Perez wrote:
As a potential user mainly interested in the "hashing untyped byte sequences" use case (involving SHA2), do you think migrating from OpenSSL to Boost.Hash2 would be detrimental for security at this point? If the answer is yes, is there a way to remediate this (even after the library gets accepted)? Or is this just not the main use case of the library?
The use case involves generating digests for a network protocol (MySQL).
I'd like to know both Tom's and Peter's opinions.
Can you please point me to the source code portions in Boost.MySQL that implement SHA-2 authentication?

On Thu, 12 Dec 2024 at 14:27, Peter Dimov
Ruben Perez wrote:
As a potential user mainly interested in the "hashing untyped byte sequences" use case (involving SHA2), do you think migrating from OpenSSL to Boost.Hash2 would be detrimental for security at this point? If the answer is yes, is there a way to remediate this (even after the library gets accepted)? Or is this just not the main use case of the library?
The use case involves generating digests for a network protocol (MySQL).
I'd like to know both Tom's and Peter's opinions.
Can you please point me to the source code portions in Boost.MySQL that implement SHA-2 authentication?
Current code (using OpenSSL): https://github.com/boostorg/mysql/blob/c438f26731e36c2db6457705ec5dbb9f7657d... Code using the proposed library: https://github.com/boostorg/mysql/pull/389/files#diff-1ce941e5f315c38f0eb53e... Protocol docs: https://dev.mysql.com/doc/dev/mysql-server/8.4.3/page_caching_sha2_authentic... It's somehow similar in spirit to SCRAM-SHA256, but built in-house by MySQL.

Ruben Perez wrote:
On Thu, 12 Dec 2024 at 14:27, Peter Dimov
wrote: Ruben Perez wrote:
As a potential user mainly interested in the "hashing untyped byte
sequences"
use case (involving SHA2), do you think migrating from OpenSSL to Boost.Hash2 would be detrimental for security at this point? If the answer is yes, is there a way to remediate this (even after the library gets accepted)? Or is this just not the main use case of the library?
The use case involves generating digests for a network protocol (MySQL).
I'd like to know both Tom's and Peter's opinions.
Can you please point me to the source code portions in Boost.MySQL that implement SHA-2 authentication?
Current code (using OpenSSL): https://github.com/boostorg/mysql/blob/c438f26731e36c2db6457705ec5dbb9f7657d... Code using the proposed library: https://github.com/boostorg/mysql/pull/389/files#diff-1ce941e5f315c38f0eb53e... Protocol docs: https://dev.mysql.com/doc/dev/mysql-server/8.4.3/page_caching_sha2_authentic...
It's somehow similar in spirit to SCRAM-SHA256, but built in-house by MySQL.
In this specific case, if we assume that Hash2 is accepted into Boost, I'd say that using OpenSSL is much more susceptible to supply chain attacks. The user acquires both Boost.MySQL and Boost.Hash2 through Boost, whereas the typical practice of acquiring OpenSSL under Windows until very recently was "web search and download random binaries from somewhere on the Internet." Things are probably much better today because of vcpkg and conan, but the exact version of OpenSSL that the user will end up using is still an unknown variable. The actual SHA256 implementation in OpenSSL has been looked at much more than the one in Hash2 at this point, but the Hash2 code is easy to inspect and verify because it follows the reference implementation very closely at the moment (although this might change if we add SHA-NI optimizations in the future.) (I ignore here the C-style interface of OpenSSL's function, which has its own safety implications, but let's assume that the code in Boost.MySQL is 100% correct.)

The actual SHA256 implementation in OpenSSL has been looked at much more than the one in Hash2 at this point, but the Hash2 code is easy to inspect and verify because it follows the reference implementation very closely at the moment (although this might change if we add SHA-NI optimizations in the future.)
OpenSSL's SHA implementations (SHS and SHA3) are also certified under FIPS 140-2 [1] if this is a relevant property for your users Reuben. Matt [1] https://csrc.nist.gov/projects/cryptographic-module-validation-program/certi...

On Thu, 12 Dec 2024, 21:34 Matt Borland,
The actual SHA256 implementation in OpenSSL has been looked at much more than the one in Hash2 at this point, but the Hash2 code is easy to inspect and verify because it follows the reference implementation very closely at the moment (although this might change if we add SHA-NI optimizations in the future.)
OpenSSL's SHA implementations (SHS and SHA3) are also certified under FIPS 140-2 [1] if this is a relevant property for your users Reuben.
Matt
[1] https://csrc.nist.gov/projects/cryptographic-module-validation-program/certi...
Thanks for the info.

On Thu, 12 Dec 2024, 21:17 Peter Dimov,
On Thu, 12 Dec 2024 at 14:27, Peter Dimov
wrote: Ruben Perez wrote:
As a potential user mainly interested in the "hashing untyped byte
sequences"
use case (involving SHA2), do you think migrating from OpenSSL to Boost.Hash2 would be detrimental for security at this point? If the answer is yes, is there a way to remediate this (even after the library gets accepted)? Or is this just not the main use case of the
Ruben Perez wrote: library?
The use case involves generating digests for a network protocol
(MySQL).
I'd like to know both Tom's and Peter's opinions.
Can you please point me to the source code portions in Boost.MySQL that implement SHA-2 authentication?
Current code (using OpenSSL):
https://github.com/boostorg/mysql/blob/c438f26731e36c2db6457705ec5dbb9f7657d...
Code using the proposed library:
https://github.com/boostorg/mysql/pull/389/files#diff-1ce941e5f315c38f0eb53e...
Protocol docs: https://dev.mysql.com/doc/dev/mysql-server/8.4.3/page_caching_sha2_authentic...
It's somehow similar in spirit to SCRAM-SHA256, but built in-house by MySQL.
In this specific case, if we assume that Hash2 is accepted into Boost, I'd say that using OpenSSL is much more susceptible to supply chain attacks.
The user acquires both Boost.MySQL and Boost.Hash2 through Boost, whereas the typical practice of acquiring OpenSSL under Windows until very recently was "web search and download random binaries from somewhere on the Internet."
Things are probably much better today because of vcpkg and conan, but the exact version of OpenSSL that the user will end up using is still an unknown variable.
The actual SHA256 implementation in OpenSSL has been looked at much more than the one in Hash2 at this point, but the Hash2 code is easy to inspect and verify because it follows the reference implementation very closely at the moment (although this might change if we add SHA-NI optimizations in the future.)
(I ignore here the C-style interface of OpenSSL's function, which has its own safety implications, but let's assume that the code in Boost.MySQL is 100% correct.)
Thanks for the clarifications, Peter.

Dear all,
The review of Hash2 by Peter Dimov and Christian Mazakas begins today Saturday, December 7th through Sunday December 15th, 2024.
Code: https://github.com/pdimov/hash2 Docs: https://pdimov.github.io/hash2/
Please provide feedback on the following general topics:
- What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? Do you already use it in industry? - Did you try to use the library? With which compiler(s)? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problem domain?
Ensure to explicitly include with your review: ACCEPT, REJECT, or CONDITIONAL ACCEPT (with acceptance conditions).
There's already been a flurry of conversation so I thank you all for that. If you have any questions during this process feel free to reach out to me. Additionally, if you would like to submit your review privately, which I will anonymize for the review report, you may send it directly to me at: matt@mattborland.com.
Matt Borland Review Manager and C++Alliance Staff Engineer
All, We are now halfway through the review period, so if you are going to submit a review you only have a few days remaining. There's been a lot of good discussion which I'm sure Peter and Christian appreciate. Matt

My review (a bit longer than intended)
- What is your evaluation of the design? The library is based on the paper which goes into great lengths discussing the design. To me this sounds well reasoned so the library design is solid. - What is your evaluation of the implementation? The whole code base looks very clean and well structured. The implementation of the provided hash functions follows their specification making them easy to verify. Reasonable use of trait-types and C++11/14 features and SFINAE-friendly methods make it look solid to me.
- What is your evaluation of the documentation? Well written and structured. However I find it odd, that it seems to emphasize the provided hashing algorithms and still calls itself "An Extensible Hashing Framework" For me it also dives too soon too deep into details. I'd like to have
The only point of criticism I have is the type used for "byte-ranges" in the HashAlgorithm constructor and update method(s). In the (seed-)ctor the bytes are of type `unsigned char const` while in `update` the are of type `void const`. If both methods are supposed to receive a byte sequence I'd expect them to use the same type and am confused by the difference. Given that the constexpr-enable concept uses `unsigned char const` I'd choose that. Use of a span-like type has been extensively discussed and I'll just mention my personal key takeaway: In the context of static analysis through compiler warnings I'd like to see support for both `update(ptr, size)` & `update(byte-span)` in the concept to give HashAlgorithm implementers the freedom to not provide a signature that triggers compiler warnings. Support for a span-like type at least constructible from a `unsigned char const*` and size to be passed to the algorithms update function seems reasonable to me. However I do see the advantage of the more flexible approach currently implemented. Especially due to the multitude of span classes available. So to me this is optional and at most a QoI improvement. the distinction more clear between: a) hashing (raw) byte sequences by using a HashAlgorithm b) hashing any C++ types/objects by using the framework As a (potential) user I'd really like to be told first how to use this framework with something existing (already implemented hash algorithm class) than having an in-depth explanation of the (class-)concepts and of a multitude of provided algorithms, a simple list with links to details at this point (the introduction) might suffice. To me the "major use case" the library addresses is not "hashing an untyped sequence of bytes". If I understood it correctly the algorithms provided are "just" examples of algorithms the framework can be used with. There are already implementations of such algorithms and some are seemingly much faster (compare the OpenSSL benchmarks posted on the list). The novelty to me here is a framework to allow the separation of types and hash algorithms, i.e. enable types to be hashed without knowing by which algorithm. So maybe that should be more prominent in the docs. I also find the description of the `result()` function questionable:
Note that result is non-const, because it changes the internal state. [...] subsequent calls [...] produce a pseudorandom sequence
Having this in the description of the HashAlgorithm concept seems to exclude any algorithm that doesn't has a meaningful "finalize" operation. Or those that "finalize" e.g. by padding and hence subsequent "finalize" calls are no-ops yielding the same result. If it isn't an intentional requirement for all algorithms to be used by this, I'd suggest to change the wording to e.g. "may change internal state." and "Some might producea pseudorandom sequence" or similar.
- What is your evaluation of the potential usefulness of the library? Do you already use it in industry? Very useful, especially as it is an implementation of a paper for a standards proposal which I think is very much in line of what Boost is/was/should be. I don't use it yet.
- Did you try to use the library? With which compiler(s)? Did you have any problems? Tried examples using GCC 13, no issues. - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? Something short of an in-depth study - Are you knowledgeable about the problem domain? Somewhat, yes. Ensure to explicitly include with your review: ACCEPT, REJECT, or CONDITIONAL ACCEPT (with acceptance conditions). CONDITIONAL ACCEPT I'd like to suggest to reconsider the concept of the `result` function (like the split into `result() const` and `finalize()` mentioned on the
In the context of cryptographic security mentioned by others I'd like to see a) an example how to wrap an existing, 3rd-party hash implementation (which might only have a C interface) into the HashAlgorithm Concept b) at least a statement on the security of the provided cryptographic hash algorithms in the documentation. list) or at least the wording, and the types for "byte-ranges" mentioned above unless changing that type severely affects the functionality (negatively). Having the concept defined on inclusion might make it hard to change later on when people rely on it so I'd rather like to have this "right" from the start, if in doubt be more permissive. This "condition" is intentionally very weak as I do like to see this library include, and can be met by simply stating that the values returned by multiple invocations of `result` might be the same.

- What is your evaluation of the design? Please see the documentation section - most of the comments there are about the design at least as much as how it is documented.
- What is your evaluation of the implementation? The implementation is unsurprisingly of high quality.
- What is your evaluation of the documentation?
The documentation is very comprehensive. Sufficiently so that it called out all the surprising features of what I'd expected was going to simply be a refined (given the author) implementation of the approach described in the paper "types don't know #". The documentation describes hash algorithm requirements early. I expected to be unsurprised and initially was (vs the hash algorithm requirements in the paper) but the fact that every hash algorithm must have a constructor taking an integer seed and one taking a byte sequence immediately caught my attention. I didn't expect that all of the listed hash algorithms would have either or both of those constructors. So I checked the reference docs. for the algorithms to find out what this requirement has added to the "usual" expectations for each algorithm. FNV-1a looked like a predictable, minimal "how to seed something that isn't a keyed hash i.e. just feed it a "key" as a prefix (by calling update). That did make me wonder why the concept was complicated with the constructors that aren't doing anything the user couldn't do post-construction by simply calling update(). But - xxhash algorithm specification (as opposed to the hash2 library) does define a (result-type sized) seed and sure enough - hash2 provides a constructor that operates as expected (with a twist that the seed is not of result_type so can be larger than seed the 32 bit algorithm supports). This discrepancy and the requirement that different seeds will give different results forces larger seed values to be used creatively. The approach taken is to, if the seed was not small enough to be converted to result type without overflow to use the low 32 bits to do the defined seeding then mix the high 32 bits into the state "somehow" (I checked the implementation, its as if the high 32 bits were copied to unsigned char b[4] then update(b, 4) called 4 times - I think? So - what happens when constructing from a byte sequence? In this case we fallback to default construction followed by update() - oh - the docs say "update() and a call to result(). That looks odd? Why would one call result()? After calling result() further update() calls are valid? Read some more docs. Oh - every hash algorithm is required to not only have these constructors, but also to be an extensible hash. Another surprise. It turns out, result() having a side effect of changing the state is required/expected. And being used here to "mix" the seed. I'm not 100% sure why - not that it's bad - but just how arbitrary is each algorithms seeding behaviour? Is it reasonable to be creating these seeding conventions specific to the library? The obvious answer is - if you don't like it, write a different algorithm implementation. It is true that for the purposes of generating hashes of unordered container keys any "reasonable" seeding convention is good enough - even where hash flooding protection is needed. Perhaps using the library to hash arbitrary C++ objects for any/all uses is getting into an area where the user needs to be a bit more aware of potential interoperability issues. Even so, the algorithm specification for each provided algorithm should be very explicit about just how it seeds - and that there is no interoperability of these features with any other implementation of the "same" algorithm. Unless of course there is - If an algorithm in hash2 follows a specification for seeding and/or extension this should be clearly documented/referenced. An implementation of https://keccak.team/kangarootwelve.html would make an interesting addition due to it having a comprehensive set of functions that might be expected to directly fulfil the concepts (among other nice features).
- What is your evaluation of the potential usefulness of the library? Do you already use it in industry?
I haven't used it yet. But I've used a hash algorithms for various things. Some cryptographic, some purely for high performance (relative to resources) unordered containers / data structures. I would definitely use this library for unordered key hashing, and the convenience of creating a hash for a user defined type independent of the algorithm choice is a valuable capability and the reason I haven't had anything muich to say about that aspect of the library is - I don't have any complaints or suggestions for improvement, except that I think that support for hashing unordered containers should be dropped. because there isn't a robust and performant generic solution (it seems - if there is, it's not in the library). I would think twice before using the library for any cryptographic purpose not because of any specific issues (other than the unordered container issue) beyond it being yet another implementation of the raw hash algorithm without any specific benefits and with a restrictive (for good reason) interface tailored to the uses envisioned in the paper.
- Did you try to use the library? With which compiler(s)? Did you have any problems?
I compiled and ran some tests using g++ 14.2 - It was very easy to do this. Modular boost is great!
- How much effort did you put into your evaluation?
A few hours.
A glance? A quick reading? In-depth study?
I took a glance, mistook it for a nice simple library to make unordered containers easier to use with arbitrary key types, and though I wouldn't dig deeper and probably didn't have a lot to add beyond voting to accept. But the extended scope of the algorithm concept, and also the aspirations to solve the problem of hashing multisets / unordered containers as key type made me take a bit more notice. I don't think the generalisations of the hash algorithm concept do any harm, but the potential for surprise and the potential to expect that the hash algorithms so defined/encapsulated should be used for "everything" is there. The ergonomics are a bit quirky - I do wonder if adopting names that are a bit more descriptive of the hash algorithm operation (if we were talking only about sponges Id say absorb(), squeeze() in place of update() and digest() ) might be in order to show just how different the algorithm concepts are, but I'm not sure that every hash algorithm should provide the latter. Perhaps it is more that everything should provide update(), an extensible hash should provide "stir()" and everything should provide a (const) result(). Surely the few places where result is called multiple times with the expectation that the state was changed can/should make this explicit? Im also not sure I like result types as array-like byte sequences or integers but perhaps optimisers can't quite make that choice irrelevant from a performance perspective. Im not sure the library should even try to solve the problem of extracting "good" bits from a hash that doesn't have great dispersion properties. There isn't anything "wrong" with a fast and less than uniform and certainly not preimage resistant hash function - there is something wrong with trying to "make it work" where it shouldn be being used in the first place.
- Are you knowledgeable about the problem domain? I know enough to be dangerous. And know enough to know I am (dangerous). This library is mostly a way to avoid footgun problems while easing the task for the average knowledge user like me.
Ensure to explicitly include with your review: ACCEPT, REJECT, or
CONDITIONAL ACCEPT (with acceptance conditions). I vote to ACCEPT the library. I don't think adding explicit conditions is necessary as I expect the author will address any issues.

On Sat, Dec 7, 2024 at 1:47 PM Matt Borland via Boost
Dear all,
The review of Hash2 by Peter Dimov and Christian Mazakas begins today Saturday, December 7th through Sunday December 15th, 2024.
Code: https://github.com/pdimov/hash2 Docs: https://pdimov.github.io/hash2/
Since I've already barged in the topic earlier, I might as well leave a review.
Please provide feedback on the following general topics:
- What is your evaluation of the design?
One of the things that puzzles me about this library is its intended purpose. "Types don't know #" was clearly about hash types for use in hash tables, which explained why keying the hashes was mandatory. But it is unclear to me why MD5, SHA-1, and other cryptographic hashes are present here; they are way too inefficient to be of much use in hash tables, so there is little reason to include them. If, on the other hand, this is meant as a more general hashing library, I have strong objections as detailed below. I believe the HashAlgorithm concept is too overloaded. It _requires_ that hashes be both keyable and extendable-output, which are things that many common hash functions are not capable of. MD5, for example, is not capable of either keying or extendable-output, and as such will require an ad hoc _mode of operation_ to achieve these goals, which is fraught with danger. A user that wants to use this library with a hash that is not included here is going to be pushed to include functionality the hash does not provide and that they might not know how to achieve. Despite being very useful functionality, this is a source of vulnerabilities waiting to happen. Additionally, one of the constructors for keying is std::uint64_t. While this is mostly OK for the hash table DoS-prevention use case, it is woefully small for cryptographic purposes. This constructor should not exist in this case. Furthermore, the multiple result() calls API to achieve extendable output, besides being confusing and incompatible with just about every implementation out there, is suboptimal from a performance standpoint. If I want to generate a long output, calling result() repeatedly will make vectorization much more annoying to implement (and requires an unnecessary large internal buffer) than an API such as digest(output, length). This is the same rationale for proposals such as P1068 for more efficient output on the standard <random> generators. I would suggest splitting HashAlgorithm concepts into KeyedHashAlgorithm/PRF and FixedOutputAlgorithm/VariableOutputAlgorithm, or something along these lines. It should not be necessary to shoehorn this functionality into primitives that do not support it. Note that this is not a clean hierarchy; some keyed algorithms cannot be safely used as unkeyed hashes, so KeyedHashAlgorithm is not a subset of HashAlgorithm and vice-versa.
- What is your evaluation of the implementation?
The implementation of the cryptographic hash functions highlight the issues with the above design. I will talk about MD5 here, but it applies all the same to SHA-1, SHA-256, etc. Firstly, it is strange that an implementation of MD5 is keyed, since MD5 does not provide this functionality. But the keying here simply prepends the key, padded to block length, to the message. This enables the so-called length-extension attack: if I have MD5(k || m), I can easily compute MD5(k || m || other stuff) without knowing the key, which can be used to bypass authenticators (Flickr was a famous example of this [1]). This is why HMAC is has been drilled into developers whenever keyed hashing comes up. Furthermore, the way extended output works breaks pseudorandomness. Suppose I have a keyed MD5 instance and want to generate several blocks of output. The expectation here is that _all_ of the output is indistinguishable from a random string of the same length. But that is not the case here. What we have instead is first_block = MD5(k || m || padding) second_block = MD5(k || m || padding || more padding) ... An attacker who observes this can easily distinguish this by taking first_block, which consists of the internal MD5 state, hashing the extra padding, and checking whether the output is equal to second_block. If I were to use this scheme as a stream cipher to encrypt data, knowing the first 16 bytes of plaintext (e.g., a known/predictable header) would immediately enable the recovery of the rest. This is broken as an XOF. The way variable-length output in Merkle-Damgard hashes such as MD5 historically has been handled is to use MGF1 [2, 10.2.1] or something along these lines. But once again this is a _mode of operation_ on top of MD5, similarly to HMAC. I have already mentioned this elsewhere, but the unordered hash approach currently implemented cannot possibly be secure without enormous parameters. A minor question I have regards the constants used in get_result_multiplier(). The documentation states that this is used to produce a uniform output on the target range, but it's unclear from the documentation or source code the rationale or criteria for the method or constants used here. This seems to get into integer hash territory, but for example the 4 -> 1 case consists of x -> (x*0x7f7f7f7f) >> 24, for which easy differentials exist, e.g., (x, x^0xc0800000) collide with probability ~1/4. The implementations are otherwise clean and easy to read, which simplified spotting the issues above. [1] https://vnhacker.blogspot.com/2009/09/flickrs-api-signature-forgery.html [2] https://www.ietf.org/rfc/rfc2437.txt
- What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? Do you already use it in industry?
A framework to easily hash custom types without having to repeat the same std::hash or similar boilerplate over and over is something I would definitely use, and have sort of reinvented kludgy versions of it multiple times before. But apart from easily enabling types to be usable in hash tables I would not have much use for this library.
- Did you try to use the library? With which compiler(s)? Did you have any problems?
Yes. Clang 19 on Linux. It worked as expected.
- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
A few hours. Mostly skimming the source code and documentation.
- Are you knowledgeable about the problem domain?
I have been involved in the design and analysis of hash functions.
Ensure to explicitly include with your review: ACCEPT, REJECT, or CONDITIONAL ACCEPT (with acceptance conditions).
I would rather leave this field empty, since I'm not too familiar with the procedure here, but if pressed I would say reject in its current state.

Ensure to explicitly include with your review: ACCEPT, REJECT, or CONDITIONAL ACCEPT (with acceptance conditions).
I would rather leave this field empty, since I'm not too familiar with the procedure here, but if pressed I would say reject in its current state.
Samuel, There is no particular procedure with casting your vote besides emailing this list. It is purely based on your analysis of the library do you think it should be accepted, rejected, or accepted at a later time once certain conditions have been fulfilled? As review manager I compile all the votes, analysis, and will provide a final ruling typically within a few days of the review period closing. For further information: https://www.boost.org/community/reviews.html. Please let me know if you have any additional questions. Thank you for your time and review. Matt

Samuel Neves wrote:
Furthermore, the way extended output works breaks pseudorandomness. Suppose I have a keyed MD5 instance and want to generate several blocks of output. The expectation here is that _all_ of the output is indistinguishable from a random string of the same length. But that is not the case here. What we have instead is
first_block = MD5(k || m || padding) second_block = MD5(k || m || padding || more padding) ...
An attacker who observes this can easily distinguish this by taking first_block, which consists of the internal MD5 state, hashing the extra padding, and checking whether the output is equal to second_block.
This only works if the message length is known to the attacker, because the "more padding" part includes the message length, which is incremented by each call to result().

Samuel Neves wrote:
A minor question I have regards the constants used in get_result_multiplier(). The documentation states that this is used to produce a uniform output on the target range, but it's unclear from the documentation or source code the rationale or criteria for the method or constants used here. This seems to get into integer hash territory, but for example the 4 -> 1 case consists of x -> (x*0x7f7f7f7f) >> 24, for which easy differentials exist, e.g., (x, x^0xc0800000) collide with probability ~1/4.
It's pretty hard to come up with "good" multipliers here because it's not possible to quantify the "good" part. There are basically two cases, one where the 32 bit input is uniformly distributed, and one in which it isn't. And when it isn't, that is, it comes from a "bad" hash, it's not really possible to optimize the multiplier without a known input distribution, and we don't have that. So I came up with some ad hoc criteria here https://github.com/pdimov/hash2/blob/develop/test/get_integral_result_4.cpp and tried to make the multipliers work for them.

Samuel Neves wrote:
Additionally, one of the constructors for keying is std::uint64_t. While this is mostly OK for the hash table DoS-prevention use case, it is woefully small for cryptographic purposes. This constructor should not exist in this case.
While it's trivially true that uint64_t is not enough seed for cryptographic purposes, MD5 seeded with an unknown uint64_t seed is still incrementally much more cryptographic than plain MD5. The use case here is generic code. The user has something templated on the hash algorithm that is currently using e.g. xxhash_64 seeded with uint64_t. But this turns out to not work well in practice because collisions happen. So they just change the template parameter from xxhash_64 to md5_128 and carry on. For real crypto, you'd have the code using HMAC-SHA2-256, and while hmac_sha2_256 doesn't much need the uint64_t constructor, I very much doubt that anyone would use it by mistake. The rationale for adding a byte seed constructor to everything, even though the functions aren't necessarily designed for it, is the same; a hash function seeded with something unknown to the attacker is incrementally much more secure than the same hash function not seeded with anything.

One of the things that puzzles me about this library is its intended purpose. "Types don't know #" was clearly about hash types for use in hash tables, which explained why keying the hashes was mandatory. But it is unclear to me why MD5, SHA-1, and other cryptographic hashes are present here; [...] I believe the HashAlgorithm concept is too overloaded. It _requires_ that hashes be both keyable and extendable-output, [...] A user that wants to use this library with a hash that is not included here is going to be pushed to include functionality the hash does not provide and that they might not know how to achieve.
That is actually a good point: Why are those requirements on the HashAlgorithm concept at all? The library is (mostly) about allowing to hash types, i.e. the framework surrounding `hash_append`. I understand that a specific way (currently being discussed here on the list) of construction is required for implementing `hash_append` for unordered containers. But besides that, why does the library require specific constructors at all? The HashAlgorithm is provided by the user as an instance. How the users constructed that isn't relevant for the purpose of the library, is it? To my understanding having support for seeds could be highlighted as best practice for specific use cases, but not as a general requirement. I understand that the library provides ways to extract a hash out of an Algorithm. So including `result_type` and `result` makes sense, although given that this is C++11 the former seems a bit outdated as it can be determined by the latter. But besides that there are no requirements on HashAlgorithm relied upon by the library, are there? For user convenience I can see `SeedableHashAlgorithm` as a subset of `HashAlgorithm` as a useful, additional concept to include in the library. The library could provide traits to check a given type for a specific concept (when lacking C++20 concepts) This would give users and hash authors more freedom in which algorithm is used. Of course it could be stated, that all included algorithms conform to the `SeedableHashAlgorithm` concept, if they (meaningfully) are.

Alexander Grund wrote:
That is actually a good point: Why are those requirements on the HashAlgorithm concept at all?
The library is (mostly) about allowing to hash types, i.e. the framework surrounding `hash_append`. I understand that a specific way (currently being discussed here on the list) of construction is required for implementing `hash_append` for unordered containers.
But besides that, why does the library require specific constructors at all?
Simplified, this asks "if we ignore the use cases that require seeding, why does the library require that the hash algorithms support seeding?"
For user convenience I can see `SeedableHashAlgorithm` as a subset of `HashAlgorithm` as a useful, additional concept to include in the library.
I've already answered this a few times. It's more ergonomic if all hash algorithms support a consistent interface. This allows writing generic code that can work with any algorithm. It allows switching from one hash algorithm to any other without breakage. It also allows switching from unseeded use to seeded use, or from use with uint64_t seed to use with a byte sequence seed, without breakage regardless of the chosen hash algorithm. The idea here is to make switching from a less secure use to a more secure use easy and painless.

Samuel Neves wrote:
One of the things that puzzles me about this library is its intended purpose. "Types don't know #" was clearly about hash types for use in hash tables, which explained why keying the hashes was mandatory. But it is unclear to me why MD5, SHA-1, and other cryptographic hashes are present here; they are way too inefficient to be of much use in hash tables, so there is little reason to include them. If, on the other hand, this is meant as a more general hashing library, I have strong objections as detailed below.
There exist intermediate use cases between "key in a hash table" and "cryptographically secure hashing of a binary blob" that are also served by the library. Some of them are: - if you have a program with some program state S, represented by a C++ object, which the user can save to disk by means of e.g. Ctrl+S, you often want to know whether the current state has been saved or not (to display an indicator, or to decide whether to autosave on exit.) This is implemented by keeping a hash of the last save, and comparing the current hash to the last saved one. - if you have something like Boost.Compute, which needs to compile shaders, you can keep a cache of already compiled shader binaries, along with the hashes of the source used to create them, and then skip the compilation if the source matches a hash of an already compiled binary. https://github.com/boostorg/compute/blob/cf7907574d6159cd43e6cf687a7b656278c... - if you need to send complex C++ objects over the network, you can first ask the remote endpoint "do you already have the object with this hash?" and if so, skip sending the object. And there are many other cases isomorphic to the ones above. In all of these cases, you want a very good quality hash function, one for which you can ideally assume that collisions never happen, for some practical definition of never (less than cosmic rays or random RAM bit flips, for instance). And because of that, you want to be able to use a cryptographic, or ex-cryptographic, hash function, even when there is no external attacker, because non-cryptographic hash functions are generally optimized for speed and do not produce a sufficiently long digest size. (But if a non-cryptographic hash works for you, you also want to be able to use _that_, because more speed is always better than less speed, other things being equal.) In short, having a reliable framework for hashing arbitrary C++ objects with any user-selected hash algorithm opens up a lot of possibilities that one might not realize are there, and "how do I implement std::hash for my type" is only a small fraction of them.

On Fri, 13 Dec 2024, 11:22 Peter Dimov via Boost,
One of the things that puzzles me about this library is its intended
Samuel Neves wrote: purpose.
"Types don't know #" was clearly about hash types for use in hash tables, which explained why keying the hashes was mandatory. But it is unclear to me why MD5, SHA-1, and other cryptographic hashes are present here; they are way too inefficient to be of much use in hash tables, so there is little reason to include them. If, on the other hand, this is meant as a more general hashing library, I have strong objections as detailed below.
There exist intermediate use cases between "key in a hash table" and "cryptographically secure hashing of a binary blob" that are also served by the library. Some of them are:
- if you have a program with some program state S, represented by a C++ object, which the user can save to disk by means of e.g. Ctrl+S, you often want to know whether the current state has been saved or not (to display an indicator, or to decide whether to autosave on exit.) This is implemented by keeping a hash of the last save, and comparing the current hash to the last saved one.
- if you have something like Boost.Compute, which needs to compile shaders, you can keep a cache of already compiled shader binaries, along with the hashes of the source used to create them, and then skip the compilation if the source matches a hash of an already compiled binary.
https://github.com/boostorg/compute/blob/cf7907574d6159cd43e6cf687a7b656278c...
- if you need to send complex C++ objects over the network, you can first ask the remote endpoint "do you already have the object with this hash?" and if so, skip sending the object.
These three have been pretty clarifying for me regarding some design decisions in the library, thanks for providing them I'd encourage you to mention them in the docs. Thanks, Ruben.

All, I received the following review directly from Ivan Matek, because the review is written as a PDF that exceeds the allowable size of the ML. The document can be accessed here: https://drive.proton.me/urls/A6NCZVTK58#yuGnqqnzYlhA Matt

On Fri, Dec 13, 2024 at 9:29 AM Matt Borland via Boost < boost@lists.boost.org> wrote:
I received the following review directly from Ivan Matek, because the review is written as a PDF that exceeds the allowable size of the ML. The document can be accessed here: https://drive.proton.me/urls/A6NCZVTK58#yuGnqqnzYlhA
Please post this at a minimum as an HTML email. Preferably reformat to be text-only. Our future plan for the new boost.org is that we will collect and index reviews in the site database and cross-reference them from library pages. Reviews submitted as nicely formatted PDFs unfortunately will not be picked up by this. It would be best if all reviews were posted to the mailing list so they can be accessed by tooling. Thanks

I received the following review directly from Ivan Matek, because the review is written as a PDF that exceeds the allowable size of the ML. The document can be accessed here: https://drive.proton.me/urls/A6NCZVTK58#yuGnqqnzYlhA
Please post this at a minimum as an HTML email. Preferably reformat to be text-only.
Our future plan for the new boost.org is that we will collect and index reviews in the site database and cross-reference them from library pages. Reviews submitted as nicely formatted PDFs unfortunately will not be picked up by this. It would be best if all reviews were posted to the mailing list so they can be accessed by tooling.
Below is my attempt at converting this to text for posterity.
Boost.Hash2 Review - Ivan Matek December 2024
About my Preferences/Experience/Expectations
I am not a Boost developer. I am a longtime C++ developer.
I know of issues with std::hash, not familiar with Boost.ContainerHash. I am pretty sure I used hash_combine years ago, but not sure. I prefer modern C++, and dislike features that are easy to use wrong. I do not like WG21 ABI/keyword(e.g. co_await) decisions as I feel they are ruining the language. In terms of safety vs performance my opinion is that it depends. In general I feel underlying code can use unsafe constructs as those components are much more used/tested than user facing API, when it comes to user code I prefer safer, harder to misuse constructs, unless we are talking low latency or HPC domains.
When first hearing the name Hash2 I originally expected that the library would help many C++ developers that are frustrated with the fact C++23 std does not know how to hash std::pair or std::vector. That is still the main use case I am considering in this review, but during discussions on the mailing list and reading the documentation I have learned that it is good to also consider Hash2 as a general framework for decoupling type value representation and hashing algorithms.
My knowledge of cryptography is basic.
Bias
I have not interacted in person with the primary author. But over years he has replied many times to my questions on the mailing list and I have noticed he is active on reddit where he often provides very good comments/answers, I often use the named argument trick I learned from his blog. Boost.Mp11 is one of my favorite Boost libraries. Beside that I know the author did a lot of work in Boost, e.g. fast hash maps and Boost.Describe.
Due to all things above I consider him a world class C++ expert.
With regards to the second author if I remember correctly we have never interacted. As for authors of the original old WG21 proposal I know that Howard is the author of std::chrono, provided many great answers on SO, and IIRC was author of original versions of libc++ so I also consider him a world class C++ expert.
Limitations
I did not follow any other discussion except the mailing list.
Design
I think the general C++ design of the library is great, so I will not spend much time on that, beside saying that I feel that like STL separates iterators and algorithms it separates hash algorithms and object representations. Beside that not hashing size when size of type is constant is a smart optimization.
C++11
With regard to the decision to support C++11 I think new Boost libraries should not support C++11, but that is a topic that has been discussed at length, and in any case it is not a reason for library acceptance or rejection. It is possible that with C++17 or C++20 API could be nicer, but I can not judge that without actually seeing it, besides assuming that the concept for the algorithm could give nicer error messages.
HashAlgorithm Concept
Unfortunately I do not know enough about domain to judge it, and if it would be better that there are more than one concept. Describe With regards to dependency on Describe I feel Describe is a great library, although I have never used it in a production so this is a huge benefit, not a liability. I am very happy to see when Boost libraries are combined to produce nice code. I understand that people are worried for paying compilation of includes that they do not use, but I think benefits outweigh this unfortunate downside.
Variant
I do not understand why variant is not supported.
There are three main areas where I think there is opportunity for improvements.
Easy to use, hard to misuse wrappers for users
More advanced avoidance of hashing size
Cryptographic guarantees
Easy to Use Hard to Misuse
It is possible that I am misunderstanding the reasons for the signature of tag_invoke function, but it is not clear to me why current:
template

There has been a lot of discussion, in hash2 review, about non-idempotent result() / every hash2 hash algorithm is required to be an extensible hash function. I have questions: Isn't any non-idempotent result function surprising? As I said in my review, hash2 imposing the extensibility requirement on "raw" hash algorithms that are not XOFs (you can always do "something" to extend it but doing "something" does not always make a good extensible hash function) is surprising. Ideally hash2 should allow an XOF to be used but not by requiring every hash to pretend to be one. I did say some documentation would help but I feel that relying on docs to make it hard to "do the wrong thing" or even possible to do what should be a simple thing (use an existing hash algorithm/algorithm implementation) without that in itself being a silly mistake (NOT having result in some way stir the pot is a bug, now, unless one considers extension by repetition to be "good enough" - which absurd as it sounds, maybe it is - it's at least "honest"). What is the specification for a "hash2 XOF"? Does the author really mean to take on the role of specifying the properties required? If FIPS says that a function is an XOF it probably is. If FIPS doesn't say that, but Peter says it is one (it must be to be a "hash2 hash algorithm")... Is that ok? Who wins?

Darryl Green wrote:
There has been a lot of discussion, in hash2 review, about non-idempotent result() / every hash2 hash algorithm is required to be an extensible hash function. I have questions:
Isn't any non-idempotent result function surprising?
I wouldn't say so. If you have a thing that exposes three operations, construct, update, result, your expectation should be that it supports the following sequence construct update update ... update result and any deviations put you into "read the documentation" land. Adding a second `result` call to the end can easily have one of the following effects: - return an unspecified value - return the same value as the previous call (idempotence) - fail (either by returning an error or throwing) - undefined behavior There's no a priori reason to expect idempotence here, unless the specification guarantees it, and if you look at the opinions expressed so far in this review, you'll find people who argue for each of the above being their preference, presumably the one who they wouldn't be surprised by. In addition, if you now add update update result to the end of that sequence, you have some further choices to make. I suspect that we'll also have a diversity of opinions here as to what the unsurprising effect of the above should be; my expectation would be for a person who considers idempotence the only unsurprising alternative to prefer the clean semantics of `result` always returning the hash of the partial message accumulated so far by the preceding sequence of update calls. But of course that's not at all undisputed, partly because it imposes a cost. As currently specified, you can obtain these semantics by doing the following: auto get_partial_result( Hash const& hash ) { return Hash( hash ).result(); } which is why I consider them derived semantics, and not a fundamental operation that the algorithm needs to provide. They are achievable from the outside, with no loss of functionality, and with the costs paid by the user who needs them.
As I said in my review, hash2 imposing the extensibility requirement on "raw" hash algorithms that are not XOFs (you can always do "something" to extend it but doing "something" does not always make a good extensible hash function) is surprising. Ideally hash2 should allow an XOF to be used but not by requiring every hash to pretend to be one. I did say some documentation would help but I feel that relying on docs to make it hard to "do the wrong thing" or even possible to do what should be a simple thing (use an existing hash algorithm/algorithm implementation) without that in itself being a silly mistake (NOT having result in some way stir the pot is a bug, now, unless one considers extension by repetition to be "good enough" - which absurd as it sounds, maybe it is - it's at least "honest").
In short, the argument here is that we shouldn't require from hash algorithm authors to provide result extension, because they might screw it up. That's a legitimate argument, in principle; if result extension was something uniquely hard to provide, or something particularly error-prone (measured against the baseline of implementing the rest of the hash algorithm), I would probably have agreed. But that's not the case. In fact, most of the time, result extension is free, in that if you just implement `result` in an entirely straightforward manner per the algorithm spec, you automatically get result extension. It's easier to implement it accidentally, than make a mistake. And it's not true that the rest of the hash algorithm is easier to implement, or less error prone. Seed construction and update are at least as tricky, if not more. In addition, the exact same argument applies against seeded construction; people can easily make mistakes when implementing it. But I consider this an acceptable cost to pay, because the functionality is necessary for the intended use of the hash algorithms, by the library and by the users. Why is result extension necessary? Well, consider the std::hash use case, where the wrapper object does this: std::size_t operator()( T const& v ) const { H h; boost::hash2::hash_append( h, {}, v ); return boost::hash2::get_integral_resultstd::size_t( h.result() ); } (ignoring seeding at the moment) If H::result_type is uint32_t, but std::size_t is 64 bit, get_integral_result needs to extend the uint32_t value into 64 bits. It currently does the equivalent of return h.result() * 0x7FFFFFFF'7FFFFFFF; but I'm planning to change it to take advantage of result extension and do this instead: return (uint64_t)h.result << 32 | h.result(); There is no scenario in which the former produces better 64 bit hash values than the latter, unless, of course, the implementation of result extension is majorly screwed up. For another example, consider hash_append_unordered_range, which currently does w += hash2::get_integral_resultstd::uint64_t( h2.result() ); As we already established, a 64 bit accumulator w isn't going to cut it; so I'm planning to change w from std::uint64_t w = 0; to std::uint64_t w[ 4 ] = {}; This would require the ability to obtain 256 bits of hash from the hash algorithm. Again, there is no way to do this from the outside, without support from the hash algorithm, and achieve a better result distribution. In the general case, when you need to do nontrivial things with the hash algorithm that aren't just a simple sequence of `update` calls, result extension is simply necessary. I suppose I can, in principle, implement the above use cases without result extension, by using an MGF-like scheme. E.g. instead of return (uint64_t)h.result << 32 | h.result(); do this H h1( h ); h1.update( '\x01', 1 ); auto r1 = h1.result(); H h2( h ); h2.update( '\x02', 1 ); auto r2 = h2.result(); return (uint64_t)r1 << 32 | r2; but this imposes completely unnecessary performance costs.
What is the specification for a "hash2 XOF"? Does the author really mean to take on the role of specifying the properties required? If FIPS says that a function is an XOF it probably is. If FIPS doesn't say that, but Peter says it is one (it must be to be a "hash2 hash algorithm")... Is that ok? Who wins?
If FIPS doesn't say that a hash function is an XOF, it isn't. Result extension doesn't automatically turn any hash function into an XOF, similarly to how byte sequence seeding doesn't turn any hash function into a MAC. You can use the function as XOF or MAC, but this doesn't make it meet the stricter cryptographic requirements for XOF or MAC. It's probably my own fault for not making that point in the documentation. I did make an explicit note in the constructor description that even though it makes all algorithms usable as MACs, an established MAC algorithm such as HMAC should be used unless the algorithm has been specifically designed to be used as a MAC (as SipHash was.) But that's not what I said when describing `result`, and I should have. To sum up with a general point, the hash algorithm requirements as specified reflect (my idea of) what a modern hash algorithm (one designed today) would provide as an API. If we look at SipHash, which is relatively recent, we'll see that it's a very good match for these requirements; the only slight issue is that it doesn't support seed sequences of arbitrary length (which is a regression in usability compared to HMAC.) In particular, it supports byte sequence seeding, doesn't expose its entire internal state as the digest, and trivially supports result extension. There is actually a SipHash variant that produces a 128 bit digest, instead of 64 bit one, and it's almost exactly equivalent to a result-extended hash2::siphash_64. The only difference is that some xor constants have been changed so that the low 64 bits of the 128 bit variant do not match the value produced by the 64 bit variant. Thanks to everyone who read this far. We should gamify the list and have some sort of achievement be unlocked by that.

Dear all,
The review of Hash2 by Peter Dimov and Christian Mazakas begins today Saturday, December 7th through Sunday December 15th, 2024.
Code: https://github.com/pdimov/hash2 Docs: https://pdimov.github.io/hash2/
Please provide feedback on the following general topics:
- What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? Do you already use it in industry? - Did you try to use the library? With which compiler(s)? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problem domain?
Ensure to explicitly include with your review: ACCEPT, REJECT, or CONDITIONAL ACCEPT (with acceptance conditions).
There's already been a flurry of conversation so I thank you all for that. If you have any questions during this process feel free to reach out to me. Additionally, if you would like to submit your review privately, which I will anonymize for the review report, you may send it directly to me at: matt@mattborland.com.
Matt Borland Review Manager and C++Alliance Staff Engineer
All, Reminder that the review for Hash2 ends tomorrow (15 December) night. If you have not already submitted a review of the library please consider doing so. Matt

сб, 14 дек. 2024 г. в 19:15, Matt Borland via Boost
The review of Hash2 by Peter Dimov and Christian Mazakas begins today Saturday, December 7th through Sunday December 15th, 2024.
This is my review of the proposed Hash2 library. Before the review I want to state that I am affiliated with the C++ Alliance.
- What is your evaluation of the design?
I read the paper on which the library is based on years ago and was very fond of the design outlined in it. It's great to see it finally realised. The library has a few innovations with regards to the original paper and after reading all of the review I consider those innovations useful.
- What is your evaluation of the implementation?
I only glanced at the implementation.
- What is your evaluation of the documentation?
I think, the documentation should be much more explicit about the implications of the design choices it made. After reading other people's comments and reviews it is clear to me that many people have not considered that the implications of the requirements of the HashAlgorithm concept are deliberate and not a mistake. Also, it will probably be good to explicitly state that no vetting by crypto experts or certifications by relevant agencies was done for any of the hashers provided by the library.
- What is your evaluation of the potential usefulness of the library? Do you already use it in industry?
I think that the library achieves what it tried to achieve (see below). And I consider that very useful.
- Did you try to use the library? With which compiler(s)? Did you have any problems?
In order to review Hash2 I experimented with integrating it into Boost.JSON. Hashing is used in 2 places in Boost.JSON. One of the uses is boost:: and std::hash support for JSON containers. Rewriting the existing code with Hash2 was very straightforward, as it is extremely similar to the existing usage of ContainerHash. The benefit of using Hash2 is exactly the one outlined in the original paper: ease of switching hashing algorithms. The second use of hashing in Boost.JSON is hashing of object keys. Rewriting the existing hashing function to become a HashAlgorithm was very straightforward (although, this could be because our algorithm is so simple). After that I experimented with changing the hashing algorithm with several of the ones provided by Hash2. This was extremely easy: changing an include line and changing the hash object type. This ability to trivially change hashing algorithms is as far as I can tell the main objective of the library. That is the reason for the requirements of HashAlgorithm. While those requirements may exclude many of the algorithms provided by the existing libraries today, I hope that maintainers of those libraries will see the utility of these additional operations and will provide them.
- How much effort did you put into your evaluation? A glance? A quick reading? In-depth study?
With reading documentation, all reviews and comments, and trying the library out, I've spent around 5 hours.
- Are you knowledgeable about the problem domain?
I'm not an expert on either hashing or cryptography. I suggest we ACCEPT the library into Boost.

Dear all,
The review of Hash2 by Peter Dimov and Christian Mazakas begins today Saturday, December 7th through Sunday December 15th, 2024.
Code: https://github.com/pdimov/hash2 Docs: https://pdimov.github.io/hash2/
Please provide feedback on the following general topics:
- What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? Do you already use it in industry? - Did you try to use the library? With which compiler(s)? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problem domain?
Ensure to explicitly include with your review: ACCEPT, REJECT, or CONDITIONAL ACCEPT (with acceptance conditions).
There's already been a flurry of conversation so I thank you all for that. If you have any questions during this process feel free to reach out to me. Additionally, if you would like to submit your review privately, which I will anonymize for the review report, you may send it directly to me at: matt@mattborland.com.
Matt Borland Review Manager and C++Alliance Staff Engineer
All, This is the final reminder that the review period for Hash2 ends tonight. Matt

Here's my review. I want to thank Peter Dimov and Christian Mazakas for their work.
Does this library bring real benefit to C++ developers for real world use-case?
Yes. Improving the state of hash functions for unordered containers is a worthy goal, as is easily changing the hash function being used. Choosing a specific hash algorithm or plugin external hash algorithms is very useful. I like the choice of hash algorithms. The list can be expanded if people clamor for a new algorithm.
Do you have an application for this library?
Yes. We always use unordered containers, which is the primary use case here.
Does the API match with current best practices?
Yes. It's a significant improvement over "Boost.Hash1". Maybe it could just be "Boost.Hash" since there's no "Boost.Hash1" and this library is not only about containers. I was initially concerned that `result()` was not idempotent, but Peter's arguments made sense to me. The authors could change the name to some verb, but that's up to them. It is the same with the provided hashers being as cryptographically safe as they could be, but that seems like an unfair criterion for a library focused on containers and aiming to be an improvement over "Boost.Hash1". However, in both cases, expectations have changed over time. Big admonitions in the documentation are worth it. hash_append_sized_range could become something like hash_append_range_and_size, to avoid confusion with std::ranges::sized_range.
Is the documentation helpful and clear?
Yes. The documentation is very complete and precise. It's driven by goals and examples, which is excellent. Since the implementations are not supposed to be the fastest, the documentation could include some benchmarks. This could be the first example since unordered containers are the most common use case. Then some content about wrapping external hashing algorithms could be included, since the library is about generalizing that. I wouldn't say I like single-page documentation very much, but it does the job.
Did you try to use it? What problems or surprises did you encounter?
Yes. I compiled the code and ran the examples.
What is your evaluation of the implementation?
The code is very well organized and adequately documented. I hope the authors can figure out an implementation for hash_append_unordered_range that doesn't lose entropy with cryptographically safe hash functions. Or maybe they could provide alternatives and document their trade-offs. Or there could be some hash_aggregator concept to let it be customized.
Are you knowledgeable about the problem domain?
Yes. I have been using various hash functions in multiple applications for a long time.
Please explicitly state that you either *accept* or *reject* the inclusion of this library into boost.
Also please indicate the time & effort spent on the evaluation and give
I recommend accepting the library. the reasons for your decision. I read the documentation and the source code. I compiled the library and ran the examples. I spent time on and off over the last week evaluating the library. In total, I spent about two days evaluating it.
Disclaimer
I should also mention my affiliation with the C++ Alliance as a Staff Engineer. Em dom., 15 de dez. de 2024 às 14:15, Matt Borland via Boost < boost@lists.boost.org> escreveu:
Dear all,
The review of Hash2 by Peter Dimov and Christian Mazakas begins today Saturday, December 7th through Sunday December 15th, 2024.
Code: https://github.com/pdimov/hash2 Docs: https://pdimov.github.io/hash2/
Please provide feedback on the following general topics:
- What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? Do you already use it in industry? - Did you try to use the library? With which compiler(s)? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problem domain?
Ensure to explicitly include with your review: ACCEPT, REJECT, or CONDITIONAL ACCEPT (with acceptance conditions).
There's already been a flurry of conversation so I thank you all for that. If you have any questions during this process feel free to reach out to me. Additionally, if you would like to submit your review privately, which I will anonymize for the review report, you may send it directly to me at: matt@mattborland.com.
Matt Borland Review Manager and C++Alliance Staff Engineer
All,
This is the final reminder that the review period for Hash2 ends tonight.
Matt _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Alan Freitas https://alandefreitas.github.io/alandefreitas/ https://github.com/alandefreitas

Dear all,
The review of Hash2 by Peter Dimov and Christian Mazakas begins today Saturday, December 7th through Sunday December 15th, 2024.
Code: https://github.com/pdimov/hash2 Docs: https://pdimov.github.io/hash2/
Please provide feedback on the following general topics:
- What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? Do you already use it in industry? - Did you try to use the library? With which compiler(s)? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problem domain?
Ensure to explicitly include with your review: ACCEPT, REJECT, or CONDITIONAL ACCEPT (with acceptance conditions).
There's already been a flurry of conversation so I thank you all for that. If you have any questions during this process feel free to reach out to me. Additionally, if you would like to submit your review privately, which I will anonymize for the review report, you may send it directly to me at: matt@mattborland.com.
Matt Borland Review Manager and C++Alliance Staff Engineer
Dear all, The review period for Hash2 is now closed. Thank you to everyone who participated by writing a review, and generating lively discussion. I will publish the report and results in short order. Matt
participants (20)
-
Alan de Freitas
-
Alexander Grund
-
Andrey Semashev
-
Christian Mazakas
-
Christopher Kormanyos
-
Darryl Green
-
Ion Gaztañaga
-
Ivan Matek
-
Joaquín M López Muñoz
-
Klemens Morgenstern
-
Matt Borland
-
Peter Dimov
-
Peter Turcan
-
Richard Hodges
-
Ruben Perez
-
Samuel Neves
-
Tom Kent
-
Vinnie Falco
-
Zach Laine
-
Дмитрий Архипов