[hash2] Question about cryptographic guarantees of hash_append_unordered_range

Hi everyone, Implementation uses uint64_t in which it sums into hashes of particular elements(each computed with "fresh" hasher so ordering does not matter), then it finally hashes that sum and range size into original passed hasher. I see no problem with logic this since it cleverly exploits the fact that uint64_t overflow is defined and that sum is commutative. But I wonder if this preserves the cryptographic quality of original hash. For example if we use sha2 it seems to me that finding collisions becomes easier since sum is "only" 64 bits, and length is "only" 64 bits(or 32 on some systems), and may be even less if attacker knows unordered range was not super long, e.g. number of items was less than 10000. Does somebody know more about this particular domain and could point me to some resource to learn more? I know it was in the original proposal, but it is possible original proposal was not focusing only on cryptographically secure algorithms. I was unable to google any guarantees, maybe because idk what keywords to use. :)

Ivan Matek wrote:
Ideally I'd use 128 or 256 bit integer there, but using __uint128_t portably is still a pain, and hashing unordered containers is a very niche use case. You'd probably never encounter it in practice... well, except for boost::json::object, I suppose. Note that if you have things in the message preceding the unordered range, this effectively serves as a seed for the per-element hash function, which makes engineering collisions harder. Still, 64 bits are 64 bits. 128 would have been a good compromise between speed and security, but std::uint128_t is not here yet. https://github.com/cplusplus/papers/issues/1793 (Please don't leave comments on the above issue, it's not for discussion, but for tracking paper progress.) Matt Borland is considering proposing boost::uint128_t to Core, but that would mean depending on Core.

On Sun, Dec 8, 2024 at 11:44 AM Peter Dimov via Boost
Matt Borland is considering proposing boost::uint128_t to Core, but that would mean depending on Core.
Would it be better to make this its own library? More fine-grained libraries could have appeal to Boost users. Thanks

El 08/12/2024 a las 20:49, Vinnie Falco via Boost escribió:
I wouldn't like to add another dependency to the library users. Wouldn't be acceptable to take all the bits from the hash, treat the hash as an array 64 bit or 32 bits hashes and apply something like hash_combine or similar to each element so that we combine big hashes preserving all bits from result_type? This would also scale to 512 bit or bigger hashes. Best, Ion

On Sun, Dec 8, 2024 at 8:44 PM Peter Dimov via Boost
As for int128: my first intuition is that sum type be conditional on size of result_type, i.e. use int128 only when result_type is greater than 64 bits? I am not sure how this affects cryptographic properties of algorithm, but could be good heuristic since IDK of any safe crypto algorithm that uses 64bits or less for result. Still this is not something I would recommend since as mentioned I do not know enough about cryptography to understand what transformations are safe.

On Sun, Dec 8, 2024 at 11:27 AM Ivan Matek via Boost
...sum is commutative.
Bitwise XOR and modular unsigned multiplication are also commutative. Check out this paper: "Incremental Multiset Hash Functions and Their Application to Memory Integrity Checking" https://people.csail.mit.edu/devadas/pubs/mhashes.pdf Thanks

Vinnie Falco wrote:
Bitwise xor has the problem that xoring two equal values yields 0, so the sequences [1], [1, 1, 1], [1, 2, 2] and so on all hash to the same thing. Multiply has the problem that a single 0 cancels out everything, and the problem that multiplying 64 even numbers zeroes out the low 64 bits.

Bitwise xor has the problem that xoring two equal values yields 0, so the sequences [1], [1, 1, 1], [1, 2, 2] and so on all hash to the same
On Mon, 9 Dec 2024, 7:19 am Peter Dimov via Boost,

On Sun, Dec 8, 2024 at 7:27 PM Ivan Matek via Boost
It doesn't, even if you preserve the length of the hash. There are two attack approaches here, depending on the size of the sets you can work with. On the low-data side, you can turn collision or second-preimage finding into a lattice problem. Suppose you have some set whose intermediate hashes are [a=13816517845079277968, b=2675547450723878905, c=18133324598626471763, d=10200581252073748329]. Build the lattice spanned by the rows [K*2^64, 0, 0, 0, 0] [K*a, 1, 0, 0, 0] [K*b, 0, 1, 0, 0] [K*c, 0, 0, 1, 0] [K*d, 0, 0, 0, 1] where K is a constant large enough to force the first coordinate of vectors to have more weight and sum to 0. Short vectors of this lattice will have the form [0, x, y, z, w], which means that x*a + y*b + z*c + w*d = 0 (mod 2^64), or in other words the set [a,b,c,d] and the set [[a]*(x+1), [b]*(y+1), [c]*(z+1], [d]*(w+1)] have the same hash. In the above case we can use K=2^32 and use LLL to obtain the vector [0, 2870, 27951, 8550, 46911] and thus a second-preimage for the above hash. Shorter second-preimages are possible by increasing the dimension. In particular, for 64-bit hashes it is possible to find lattice vectors containing coefficients only in {-1,0,1}, corresponding to removals and additions of single elements from the original set, at dimension ~48. Note that while finding shortest vectors in a lattice is a very hard problem as the dimension grows, here we don't really have to find exceedingly short vectors; they must instead have either -1 (removal) or positive coefficients. For collision-finding, where we control both inputs, we can start with many repetitions of the same element to make the problem easier and allow reasonably small negative coefficients as well. See also [2, Appendix B]. For larger sets, the generalized birthday attacks from Wagner [1, Section 4] also should apply here with minimal modifications, and reduces the security of an n-bit sum to O(2^{2sqrt(n)}). As far as I can tell the only successful multiset hashes either require enormous moduli or use elliptic curves. [1] https://www.iacr.org/archive/crypto2002/24420288/24420288.pdf [2] https://arxiv.org/pdf/1601.06502

But I wonder if this preserves the cryptographic quality of original hash.
As far as I can tell the only successful multiset hashes either require enormous moduli or use elliptic curves.
[1] https://www.iacr.org/archive/crypto2002/24420288/24420288.pdf [2] https://arxiv.org/pdf/1601.06502
This is the correct answer. See also section 5.1 of NIST SP 800-107[1]. Matt [1] https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-107r1....

Matt Borland wrote: ...
Thanks Samuel, very informative.
I'm not sure why hash truncation applies here; what Samuel is saying is that this scheme for multiset hashing is insecure even if the original hash is not truncated. A 256 bit hash, for instance, would only provide (at most) 32 bit security. For 64 bit security one would need a 1024 bit source hash, which seems like an overkill to me.

Samuel Neves wrote:
Thanks again for this informative post. It should probably be noted here that the current approach, inadequate as it is, does include the size of the multiset in the hash. I think that this makes it a bit harder to hack. I've come up with several possible approaches to "harden" the current algorithm, and I'll use the opportunity to post them for discussion while we're all tuned to the hash2 wavelength. These are all independent; they can be applied separately from one another. 1. As suggested, extend the width of the accumulator from 64 bits to 256 bits, by using uint64_t[4], with element-wise addition. 2. In addition to computing the sum of the hashes w = sum(hi), also compute the sum of hi xor C, where C is a constant, e.g. 0x9e3779b97f4a7c15. Include it in the final hash as well. This can be extended to computing the sums of hi ^ C1, hi ^ C2, and so on. It can also be generalized to computing the sum of f(hi), where f is some arbitrary transformation function, preferably nonlinear. E.g. mulx(hi, C), where mulx is the primitive created by performing a double width multiplication (64x64 -> 128) and then xor-ing the high and low parts of the result. 3. In addition to computing the sum of the hashes, also compute a histogram of their hex digits. That is, maintain an array of type uint64_t[16] and increment its i-th element each time the hex digit i appears in an intermediate hash. Then include this histogram in the final hash. 4. Use a two pass approach. On the first pass, compute the hash as is done today, and add it to the hash algorithm state, as done today. Then, on a second pass, do the same calculation, but starting from the current state. That is, effectively, compute the intermediate hashes using a hash function H' that is dependent on the contents of the multiset. I can easily see that each of 1-4 makes attacks harder, but I lack the knowledge to estimate how much harder. Maybe some of these are completely unnecessary because they are trivial to work around. Maybe not. Comments appreciated.
participants (9)
-
Darryl Green
-
Ion Gaztañaga
-
Ivan Matek
-
Kostas Savvidis
-
Matt Borland
-
Peter Dimov
-
Rainer Deyke
-
Samuel Neves
-
Vinnie Falco