[hash2] Difference from paper: HashAlgorithm interface vs. function object

Hi, The HashAlgorithm interface defined in Boost.Hash2 is notably different from the one described in N3980 (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html). Specifically, the paper defines HashAlgorithm as a function object, whose operator() is equivalent to Boost.Hash2's HashAlgorithm::update method. The function object also provides a conversion operator to result_type, which is equivalent to Boost.Hash2's HashAlgorithm::result method. Is this discrepancy intentional and are there plans (on the paper authors' or the library authors' side) to eliminate it? Personally, I don't think a function object is the right interface for a hash algorithm. Update and finalize are two distinct operations provided by hash algorithms and both should be performed explicitly by the user. I don't think a conversion operator is the right name for finalize. Conversion operators are normally used when one type can meaningfully "impersonate" another, and this is not the case here. The function object is also misusing result_type, as its operator() doesn't return it. The paper discusses that the fact that a hash algorithm is a function object can be used to wrap it in std::function, which can be useful with pimpl idiom and for the purpose of type erasure. I don't think this will actually work, at least not as simple as the paper describes it, because std::function won't allow to extract the computed hash digest (i.e. you won't be able to call the type conversion operator on the wrapped function object). I think, the user will have to write some wrapper classes anyway to retain access to the wrapped hash algorithm. However, this is still an interesting use case. Does Boost.Hash2 intend to provide utilities to address it? For example, Boost.Hash2 could provide a hash-algorithm-agnostic interface for feeding an algorithm with input data, and a wrapper for an external algorithm object that would implement this interface. Something along these lines: class polymorphic_hash_algorithm { public: polymorphic_hash_algorithm( polymorphic_hash_algorithm const&) = delete; polymorphic_hash_algorithm& operator=( polymorphic_hash_algorithm const&) = delete; virtual void update(const void* data, size_t size) = 0; protected: polymorphic_hash_algorithm() = default; ~polymorphic_hash_algorithm() = default; }; template< typename Hash > class polymorphic_hash_algorithm_wrapper : public polymorphic_hash_algorithm { public: using hash_type = Hash; using result_type = typename hash_type::result_type; static constexpr size_t block_size = hash_type::block_size; private: Hash& h; public: explicit polymorphic_hash_algorithm_wrapper(hash_type& h) noexcept : h(h) {} void update(const void* data, size_t size) override { h.update(data, size); } result_type result() { return h.result(); } }; This way, polymorphic_hash_algorithm could be used to abstract away the hash algorithm type while still be compatible with hash_append and friends. As a side note, this would be another use case where the hash algorithm passed to hash_append is non-copyable. And while copyability could be added, most likely it wouldn't be free.

Andrey Semashev wrote:
Hi,
The HashAlgorithm interface defined in Boost.Hash2 is notably different from the one described in N3980 (https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html). Specifically, the paper defines HashAlgorithm as a function object, whose operator() is equivalent to Boost.Hash2's HashAlgorithm::update method. The function object also provides a conversion operator to result_type, which is equivalent to Boost.Hash2's HashAlgorithm::result method.
Is this discrepancy intentional and are there plans (on the paper authors' or the library authors' side) to eliminate it?
It is, yes.
The paper discusses that the fact that a hash algorithm is a function object can be used to wrap it in std::function, which can be useful with pimpl idiom and for the purpose of type erasure. I don't think this will actually work, at least not as simple as the paper describes it, because std::function won't allow to extract the computed hash digest (i.e. you won't be able to call the type conversion operator on the wrapped function object). I think, the user will have to write some wrapper classes anyway to retain access to the wrapped hash algorithm.
That was more or less my line of thought too.
However, this is still an interesting use case. Does Boost.Hash2 intend to provide utilities to address it?
As I already said, I had a type-erased wrapper on the to-do list, but decided against providing one, for two reasons: 1. A type-erased wrapper can't provide a constructor, not even a default one. 2. The copy constructor of the type-erased wrapper would need to allocate and be potentially throwing. While this is not prohibited by the requirements, it's unlikely to be desirable. You've already stated that you don't think copy construction should be required. But copy construction is required by the current implementation of hash_append_unordered_range. What if we just drop support for unordered ranges? While I appreciate the ingenuity of this Stalin-inspired design approach, I don't think that the library should only require from the hash algorithms what the library uses today. hash_append_unordered_range should be seen as representative of what users may need to do with hash algorithms, and it clearly demonstrates that sometimes, taking a copy of the hash algorithm is needed. What about solving both (1) and (2) by making construction and copy construction optional requirements (along with result extension)? That, too, is not what I would consider particularly user-friendly. Note that we now have five properties (four constructors and one result retrieval function) that can vary independently. This translates to 32 possible hash algorithm interfaces. No, what I prefer is for users to be able to rely on each of these five properties, and for hash algorithm authors to be required to implement them, instead of picking what to support and what not.

On Wed, Dec 11, 2024 at 6:32 PM Peter Dimov via Boost
hash_append_unordered_range should be seen as representative of what users may need to do with hash algorithms, and it clearly demonstrates that sometimes, taking a copy of the hash algorithm is needed.
Can you please elaborate why is this necessary(not speaking generally, just in this particular case)? I originally though we do Hash h2( h ); so that seed of h matters, but then I noticed that anyway at the end of function we do hash2::hash_append( h, f, w ); hash2::hash_append_size( h, f, m ); so for I presume seed of h matters even if we default constructed hashers in the loop.

Ivan Matek wrote:
Can you please elaborate why is this necessary(not speaking generally, just in this particular case)? I originally though we do Hash h2( h ); so that seed of h matters, but then I noticed that anyway at the end of function we do hash2::hash_append( h, f, w ); hash2::hash_append_size( h, f, m ); so for I presume seed of h matters even if we default constructed hashers in the loop.
It matters for the final hash value, but it won't matter for creating collisions by crafting unordered maps. That is, any collision will be seed-independent because w will not depend on the seed. And seed-independent collisions are bad.

Le mercredi 11 décembre 2024 à 19:31 +0200, Peter Dimov via Boost a écrit :
You've already stated that you don't think copy construction should be required. But copy construction is required by the current implementation of hash_append_unordered_range.
The way you phrase it make it sounds like it could be implemented without requiring copy construction. Is that poor understanding from my side? Or would it incurs some other penalty (e.g. performance) if that requirement held and hash_append_unordered_range was implemented without requiring copy construction? In the latter case I think it would be worth explaining in the documentation. Regards, Julien

Julien Blanc wrote:
Le mercredi 11 décembre 2024 à 19:31 +0200, Peter Dimov via Boost a écrit :
You've already stated that you don't think copy construction should be required. But copy construction is required by the current implementation of hash_append_unordered_range.
The way you phrase it make it sounds like it could be implemented without requiring copy construction. Is that poor understanding from my side?
I feel like I already answered this question. The alternative mechanism given in N3980, constructing a temporary std::set from the unordered set elements, then hashing that, is wildly impractical for obvious reasons, performance and otherwise. Using default-constructed instances of the hash algorithm, instead of copies, makes the hashing seed-independent, which means that a way to engineer collisions will work regardless of the seed used. I say "by the current implementation" because I obviously can't say anything about hypothetical future and unknown to me implementations.

12 décembre 2024 à 14:22 "Peter Dimov"
Julien Blanc wrote:
Le mercredi 11 décembre 2024 à 19:31 +0200, Peter Dimov via Boost a écrit : You've already stated that you don't think copy construction should be required. But copy construction is required by the current implementation of hash_append_unordered_range.
The way you phrase it make it sounds like it could be implemented without requiring copy construction. Is that poor understanding from my side?
I feel like I already answered this question.
Thanks for clarifying this again, then.
The alternative mechanism given in N3980, constructing a temporary std::set from the unordered set elements, then hashing that, is wildly impractical for obvious reasons, performance and otherwise.
Yes, this one is pretty obvious.
Using default-constructed instances of the hash algorithm, instead of copies, makes the hashing seed-independent, which means that a way to engineer collisions will work regardless of the seed used.
This one is less obvious. That's a design motivation that at least should be found in the documentation. The rationale in N3980 for copy of hash algorithm was performance (btw, i don't really buy the performance argument, it seems like a pretty unusual case), completly unrelated to unordered containers. Regards, Julien

No dia 12 de dez. de 2024, às 14:22, Peter Dimov via Boost
escreveu: Using default-constructed instances of the hash algorithm, instead of copies, makes the hashing seed-independent, which means that a way to engineer collisions will work regardless of the seed used.
Can’t you construct h2 by seeding it with a result from h (possibly after updating h with the range size)? Joaquin M Lopez Munoz

Joaquín M López Muñoz wrote:
No dia 12 de dez. de 2024, às 14:22, Peter Dimov via Boost
escreveu: Using default-constructed instances of the hash algorithm, instead of copies, makes the hashing seed-independent, which means that a way to engineer collisions will work regardless of the seed used.
Can’t you construct h2 by seeding it with a result from h (possibly after updating h with the range size)?
That's (a) much slower and (b) requires the ability to obtain a result without perturbing the state, which is basically equivalent to making a copy.

On 12/12/24 16:22, Peter Dimov via Boost wrote:
Julien Blanc wrote:
Le mercredi 11 décembre 2024 à 19:31 +0200, Peter Dimov via Boost a écrit :
You've already stated that you don't think copy construction should be required. But copy construction is required by the current implementation of hash_append_unordered_range.
The way you phrase it make it sounds like it could be implemented without requiring copy construction. Is that poor understanding from my side?
I feel like I already answered this question.
The alternative mechanism given in N3980, constructing a temporary std::set from the unordered set elements, then hashing that, is wildly impractical for obvious reasons, performance and otherwise.
Using default-constructed instances of the hash algorithm, instead of copies, makes the hashing seed-independent, which means that a way to engineer collisions will work regardless of the seed used.
But the original hash algorithm h does depend on the seed, and so does its final result. Conceptually, hash_append_unordered_range has the effect of applying h to the input sequence ordered in a specific (though unspecified) way. That is, regardless of the seed used to initialize h, the input data of h does not depend on the seed. Again, conceptually, this looks equivalent to the case when each h2 that is used to hash each element of the sequence does not depend on the seed (i.e. produces a value that is solely dependent on the hashed element).

Andrey Semashev wrote:
On 12/12/24 16:22, Peter Dimov via Boost wrote:
Using default-constructed instances of the hash algorithm, instead of copies, makes the hashing seed-independent, which means that a way to engineer collisions will work regardless of the seed used.
But the original hash algorithm h does depend on the seed, and so does its final result.
void hash_append_unordered_range( Hash& h, Flavor const& f, It first, It last ) std::uint64_t w = 0; // compute w from [first, last) hash2::hash_append( h, f, w ); hash2::hash_append_size( h, f, m ); If two different ranges produce the same w, the final result will be the same, even though it does depend on the seed.
participants (6)
-
Andrey Semashev
-
Ivan Matek
-
Joaquín M López Muñoz
-
Julien Blanc
-
Peter Dimov
-
Vinnie Falco