Re: [boost] [hash2] Formal Review Begins

8 Dec 2024


      El 08/12/2024 a las 12:42, Peter Dimov via Boost escribió:
...
... this is implied by "equivalent to the original", because if the copy produced
a different output, it wouldn't be equivalent.
...
- Should the HashAlgorithm support calls to "update" and "result"
several times, interleaving both calls?
It should, yes. Any calls are allowed, in any order.
Understood.
...
...
- I don't know if SHA-1/RIPEMD-128 should be in the library at all (insecure
and not sure if useful in any real context).
SHA-1 is still widely used, e.g. for Git commit hashes. We added RIPEMD-128
in order to have a recommended replacement for MD5 in the case where
the digest has to be exactly 128 bits.
Ok.
...
...
- The original proposal mentions a "variadic version of hash_append",
(https://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#variadic)
maybe that would be a useful utility in the library.
It might be, yes.
Let's add it to the to-do list ;-)
...
...
- The hash_append_unordered_range function is surprising for people like me
that are not experts on hashing, but it's an important concept for users. The
original proposal explains the problem (https://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#unordered),
I feel the problem should be also explained in the documentation of Hash2,
and how the library design solves it (using a copy of the HashAlgorithm, as
proposed in n3980).
N3980 also says that "And there are various other schemes. They are all
implementable. But they each have their advantages and disadvantages.
Therefore this proposal proposes none of them. Should the future expose the
ideal hash_append specification for unordered containers, it can always be
added at that time."
Why does Hash2 offer a single alternative?
I don't know of a better one.
Constructing a temporary std::set is not a serious alternative. In addition
to being allocating, slow, and potentially throwing, it also relies on the
existence of operator<, which an unordered container doesn't require.
Ok. Yes, the other alternative described in the original paper is much 
worse. Not sure how the hash quality is weakened with the curren 
implementation: all element hashes are converted to 64 bit, added and a 
final hash is computed. It seems to me that combining and shuffling all 
original result_type bits from each individual hash would provide a 
better hash result for the whole sequence, but I'm not certainly able to 
mathematically prove that...
...
...
-N3980 proposes a "type-erase hasher"
(https://www.open-
std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html#type_erased_hasher
),
would it be interesting to include it in Hash2?
I had the type erased hasher in the "planned additions" list for a long time,
partly because I thought I needed it for the hash2sum example. But it doesn't
meet the hash algorithm requirements, because it doesn't have the correct
constructors and because the copy constructor is allocating and potentially
throwing (the latter is not, strictly speaking, a violation but it's still
undesirable.)
Type-erasing requires trade-offs but for types like pimpl idioms, you 
can't template or it's not desirable to template code that previously 
was on a cpp. One option is that boost::hash2::type_erased_hasher can 
have enough small buffer optimization (at least for hashes implemented 
in the boost::hash2 library) so that allocation can be avoided in most 
cases. A good type_erased_hasher is not trivial and certainly useful.
...
...
- The documentation shows (in the chapter "Use with Unordered
Containers") several "hash" classes that are needed to be able to use
Hash2 with std/boost unordered_xx classes (or
boost::intrusive::unordered_xxx types). I think it would be a good idea to
include such utilities in this Hash2 library (under names like "hasher",
"seeded_hasher".
One of these, or maybe some different one, will eventually be included,
once I decide on which.
Great.
...
...
- The design says "The size_type member type of the flavor affects how
container and range sizes (typically of type size_t) are serialized.
Since the size of size_t in bytes can vary, serializing the type directly results in
different hash values when the code is compiled for
64 bit or for 32 bit. Using a fixed width type avoids this".
I don't agree with this, between 32 and 64 bit compilation you will have
differences when hashing user-defined types that have pointers, stored
std::size_t/ptrdiff_t members, etc..
There are still a lot of types for which the size type is the only source of
variability, e.g. std::vector<int>.
...
IMHO xxx_flavours defined in the library
should define size_type as "std::size_t". Those wanting portable hashes
between 32 and 64 bit need to do a bigger effort than configuring Hash2, and
probably defining size_type as std::uint64_t will reduce the performance of 32
bit users...
It doesn't.
Show me the numbers ;-) Maybe if benchmark results are added to the 
documentation this can be shown.
...
...
- There is an asymmetry between "writeNNle" and "writeNNbe"
implementations (little vs big endian). In little endian implementations
"detail::is_constant_evaluated() and endianess" is used to be able to use
memcpy in some cases
(https://github.com/pdimov/hash2/blob/develop/include/boost/hash2/detail
/write.hpp#L26)
where in the big endian implementation this is not performed
(https://github.com/pdimov/hash2/blob/develop/include/boost/hash2/detail
/write.hpp#L73).
Probably few people use big endian platforms (except those like me that still
need to compile for powerpc targets...), but I would expect an equivalent
implementation for big-endian functions.
The implementations are equivalent in practice. GCC and Clang recognize
the pattern used and turn it into a single mov; the memcpy path is only
necessary for MSVC, which doesn't support big endian architectures.
Ok, understood.
...
...
- class Str example: N is missing the type, missing <algorithm> include,
operator== has the wrong type...
Sorry, why does operator== have the wrong type?
Because operator== for Str compares X instances instead of Str instances:

class Str
{
private:

     static constexpr N = 32;

     std::uint8_t size_ = 0;
     char data_[ N ] = {};

public:

     friend constexpr bool operator==( X const& x1, X const& x2 )
     {
         return x1.size_ == x2.size_ &&
         std::equal( x1.data_, x1.data_ + x1.size_, x2.data_ );
     }
...
...
- All "Use with Unordered Containers" examples should be compilable,
including the needed headers.
The examples are actually compilable, I just stripped the #includes when
embedding them in the documentation to avoid the unnecessary repetition.
Maybe I shouldn't have. :-)
I've found (at least in my libraries) that some users like to copy-paste 
examples in the doc and want them to work without needing to figure out 
what's missing.

Best,

Ion

Re: [boost] [hash2] Formal Review Begins

Ion Gaztañaga