[hash2] Question about cryptographic guarantees of hash_append_unordered_range

newer
The Master branch is now open for...

Ivan Matek

8 Dec 2024 8 Dec '24

7:27 p.m.

Hi everyone, Implementation uses uint64_t in which it sums into hashes of particular elements(each computed with "fresh" hasher so ordering does not matter), then it finally hashes that sum and range size into original passed hasher. I see no problem with logic this since it cleverly exploits the fact that uint64_t overflow is defined and that sum is commutative. But I wonder if this preserves the cryptographic quality of original hash. For example if we use sha2 it seems to me that finding collisions becomes easier since sum is "only" 64 bits, and length is "only" 64 bits(or 32 on some systems), and may be even less if attacker knows unordered range was not super long, e.g. number of items was less than 10000. Does somebody know more about this particular domain and could point me to some resource to learn more? I know it was in the original proposal, but it is possible original proposal was not focusing only on cryptographically secure algorithms. I was unable to google any guarantees, maybe because idk what keywords to use. :)

Show replies by date

Peter Dimov

8 Dec 8 Dec

7:43 p.m.

Ivan Matek wrote:

...

Hi everyone,

Implementation uses uint64_t in which it sums into hashes of particular elements(each computed with "fresh" hasher so ordering does not matter), then it finally hashes that sum and range size into original passed hasher.

I see no problem with logic this since it cleverly exploits the fact that uint64_t overflow is defined and that sum is commutative.

But I wonder if this preserves the cryptographic quality of original hash. For example if we use sha2 it seems to me that finding collisions becomes easier since sum is "only" 64 bits, and length is "only" 64 bits(or 32 on some systems), and may be even less if attacker knows unordered range was not super long, e.g. number of items was less than 10000.

Ideally I'd use 128 or 256 bit integer there, but using __uint128_t portably is still a pain, and hashing unordered containers is a very niche use case. You'd probably never encounter it in practice... well, except for boost::json::object, I suppose. Note that if you have things in the message preceding the unordered range, this effectively serves as a seed for the per-element hash function, which makes engineering collisions harder. Still, 64 bits are 64 bits. 128 would have been a good compromise between speed and security, but std::uint128_t is not here yet. https://github.com/cplusplus/papers/issues/1793 (Please don't leave comments on the above issue, it's not for discussion, but for tracking paper progress.) Matt Borland is considering proposing boost::uint128_t to Core, but that would mean depending on Core.

Vinnie Falco

7:49 p.m.

On Sun, Dec 8, 2024 at 11:44 AM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

Matt Borland is considering proposing boost::uint128_t to Core, but that would mean depending on Core.

Would it be better to make this its own library? More fine-grained libraries could have appeal to Boost users. Thanks

Ion Gaztañaga

8:04 p.m.

El 08/12/2024 a las 20:49, Vinnie Falco via Boost escribió:

...

On Sun, Dec 8, 2024 at 11:44 AM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...
Matt Borland is considering proposing boost::uint128_t to Core, but that would mean depending on Core.

Would it be better to make this its own library? More fine-grained libraries could have appeal to Boost users.

I wouldn't like to add another dependency to the library users. Wouldn't be acceptable to take all the bits from the hash, treat the hash as an array 64 bit or 32 bits hashes and apply something like hash_combine or similar to each element so that we combine big hashes preserving all bits from result_type? This would also scale to 512 bit or bigger hashes. Best, Ion

Ion Gaztañaga

8:15 p.m.

El 08/12/2024 a las 21:04, Ion Gaztañaga via Boost escribió:

...

El 08/12/2024 a las 20:49, Vinnie Falco via Boost escribió:

...
On Sun, Dec 8, 2024 at 11:44 AM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...
Matt Borland is considering proposing boost::uint128_t to Core, but that would mean depending on Core.

Would it be better to make this its own library? More fine-grained libraries could have appeal to Boost users.

I wouldn't like to add another dependency to the library users.

Wouldn't be acceptable to take all the bits from the hash, treat the hash as an array 64 bit or 32 bits hashes and apply something like hash_combine or similar to each element so that we combine big hashes preserving all bits from result_type?

This would also scale to 512 bit or bigger hashes.

Auto-replying... Just realized that the operation must be commutative so hash_combine would not work. Maybe just adding those individual hashes could do the trick. Best, Ion

Peter Dimov

8:32 p.m.

Ion Gaztañaga wrote:

...

El 08/12/2024 a las 21:04, Ion Gaztañaga via Boost escribió:

...
El 08/12/2024 a las 20:49, Vinnie Falco via Boost escribió:

...
On Sun, Dec 8, 2024 at 11:44 AM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...
Matt Borland is considering proposing boost::uint128_t to Core, but that would mean depending on Core.

Would it be better to make this its own library? More fine-grained libraries could have appeal to Boost users.

I wouldn't like to add another dependency to the library users.

Wouldn't be acceptable to take all the bits from the hash, treat the hash as an array 64 bit or 32 bits hashes and apply something like hash_combine or similar to each element so that we combine big hashes preserving all bits from result_type?

This would also scale to 512 bit or bigger hashes.

Auto-replying...

Just realized that the operation must be commutative so hash_combine would not work. Maybe just adding those individual hashes could do the trick.

I suppose I can just add the individual uint64_t[] values without carry, instead of emulating a full-featured uint128_t or uint256_t.

Ion Gaztañaga

9:29 p.m.

El 08/12/2024 a las 21:32, Peter Dimov via Boost escribió:

...

Ion Gaztañaga wrote:

...
El 08/12/2024 a las 21:04, Ion Gaztañaga via Boost escribió:

...
El 08/12/2024 a las 20:49, Vinnie Falco via Boost escribió:

...
On Sun, Dec 8, 2024 at 11:44 AM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...
Matt Borland is considering proposing boost::uint128_t to Core, but that would mean depending on Core.

Would it be better to make this its own library? More fine-grained libraries could have appeal to Boost users.

I wouldn't like to add another dependency to the library users.

Wouldn't be acceptable to take all the bits from the hash, treat the hash as an array 64 bit or 32 bits hashes and apply something like hash_combine or similar to each element so that we combine big hashes preserving all bits from result_type?

This would also scale to 512 bit or bigger hashes.

Auto-replying...

Just realized that the operation must be commutative so hash_combine would not work. Maybe just adding those individual hashes could do the trick.

I suppose I can just add the individual uint64_t[] values without carry, instead of emulating a full-featured uint128_t or uint256_t.

Yeah, that was the idea. If, for some reason, this operation is slow, maybe on some platforms it can be improved using SIMD. Ion

Ivan Matek

8:28 p.m.

On Sun, Dec 8, 2024 at 8:44 PM Peter Dimov via Boost <boost@lists.boost.org> wrote:

...

Ivan Matek wrote:

Ideally I'd use 128 or 256 bit integer there, but using __uint128_t portably is still a pain, and hashing unordered containers is a very niche use case. You'd probably never encounter it in practice... well, except for boost::json::object, I suppose.

I fully understand and agree that 99+% of uses will be people just hashing some values inside their process(and use some faster nonsafe hash), never to be exposed outside world or used to sign/verify something... But for me the issues is that in documentation/source this is not mentioned, as I think it is a huge design choice because when it comes to cryptography people(both users and implementers) need to be very paranoid and think of every edge case. So I would suggest that in documentation there is NOT CRYPTOGRAPHICALLY SAFE or this function being called hash_append_unordered_range_noncrypto or hash_append_unordered_range_unsafe or some other name. And then there can be hash_append_unordered_range that actually does combination in cryptographically safe but slow manner. Now I do not know what is proper way to do that, but I presume from your answer that larger accumulator type is secure.

As for int128: my first intuition is that sum type be conditional on size of result_type, i.e. use int128 only when result_type is greater than 64 bits? I am not sure how this affects cryptographic properties of algorithm, but could be good heuristic since IDK of any safe crypto algorithm that uses 64bits or less for result. Still this is not something I would recommend since as mentioned I do not know enough about cryptography to understand what transformations are safe.

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Vinnie Falco

9:05 p.m.

On Sun, Dec 8, 2024 at 11:27 AM Ivan Matek via Boost <boost@lists.boost.org> wrote:

...

...sum is commutative.

Bitwise XOR and modular unsigned multiplication are also commutative. Check out this paper: "Incremental Multiset Hash Functions and Their Application to Memory Integrity Checking" https://people.csail.mit.edu/devadas/pubs/mhashes.pdf Thanks

Peter Dimov

9:19 p.m.

Vinnie Falco wrote:

...

On Sun, Dec 8, 2024 at 11:27 AM Ivan Matek via Boost <boost@lists.boost.org> wrote:

...
...sum is commutative.

Bitwise XOR and modular unsigned multiplication are also commutative.

Bitwise xor has the problem that xoring two equal values yields 0, so the sequences [1], [1, 1, 1], [1, 2, 2] and so on all hash to the same thing. Multiply has the problem that a single 0 cancels out everything, and the problem that multiplying 64 even numbers zeroes out the low 64 bits.

Darryl Green

10:55 p.m.

...

Bitwise xor has the problem that xoring two equal values yields 0, so the sequences [1], [1, 1, 1], [1, 2, 2] and so on all hash to the same

On Mon, 9 Dec 2024, 7:19 am Peter Dimov via Boost, <boost@lists.boost.org> wrote: thing. Exactly. Surely this means your other comment that you could add the unit64_t[] elements individually, should hopefully fail smhash testing? Add without carry on 64 bit chunks is better than add without carry on 1 bit chunks but still not ideal?

Peter Dimov

9 Dec 9 Dec

12:45 a.m.

Darryl Green wrote:

...

On Mon, 9 Dec 2024, 7:19 am Peter Dimov via Boost, <boost@lists.boost.org <mailto:boost@lists.boost.org> > wrote:

...
Bitwise xor has the problem that xoring two equal values yields 0, so the sequences [1], [1, 1, 1], [1, 2, 2] and so on all hash to the same thing.

Exactly. Surely this means your other comment that you could add the unit64_t[] elements individually, should hopefully fail smhash testing? Add without carry on 64 bit chunks is better than add without carry on 1 bit chunks but still not ideal?

It's better, yes. It's obviously never worse than adding a single uint64_t.

Kostas Savvidis

10 Dec 10 Dec

7:29 a.m.

...

On 8 Dec 2024, at 23:19, Peter Dimov via Boost <boost@lists.boost.org> wrote: Vinnie Falco wrote:

...
On Sun, Dec 8, 2024 at 11:27 AM Ivan Matek via Boost <boost@lists.boost.org> wrote:

...
...sum is commutative. Bitwise XOR and modular unsigned multiplication are also commutative.

Bitwise xor has the problem that xoring two equal values yields 0, so the sequences [1], [1, 1, 1], [1, 2, 2] and so on all hash to the same thing.

And addition gives the same result for [3] , [2,1], [1,1,1] etc. My guess is that a hash of an unordered set is probably impossible to make crypto-safe. But maybe with modular exponentiation? Cheers, Kostas

Rainer Deyke

11:18 a.m.

On 10.12.24 08:29, Kostas Savvidis via Boost wrote:

...

...
On 8 Dec 2024, at 23:19, Peter Dimov via Boost <boost@lists.boost.org> wrote: Vinnie Falco wrote:

...
On Sun, Dec 8, 2024 at 11:27 AM Ivan Matek via Boost <boost@lists.boost.org> wrote:

...
...sum is commutative. Bitwise XOR and modular unsigned multiplication are also commutative.

Bitwise xor has the problem that xoring two equal values yields 0, so the sequences [1], [1, 1, 1], [1, 2, 2] and so on all hash to the same thing.

And addition gives the same result for [3] , [2,1], [1,1,1] etc.

No, because the elements themselves are not added together. Only their hashes are. -- Rainer Deyke - rainerd@eldwood.com

Samuel Neves

9 Dec 9 Dec

12:50 p.m.

On Sun, Dec 8, 2024 at 7:27 PM Ivan Matek via Boost <boost@lists.boost.org> wrote:

...

Hi everyone,

Implementation uses uint64_t in which it sums into hashes of particular elements(each computed with "fresh" hasher so ordering does not matter), then it finally hashes that sum and range size into original passed hasher.

I see no problem with logic this since it cleverly exploits the fact that uint64_t overflow is defined and that sum is commutative.

But I wonder if this preserves the cryptographic quality of original hash.

It doesn't, even if you preserve the length of the hash. There are two attack approaches here, depending on the size of the sets you can work with. On the low-data side, you can turn collision or second-preimage finding into a lattice problem. Suppose you have some set whose intermediate hashes are [a=13816517845079277968, b=2675547450723878905, c=18133324598626471763, d=10200581252073748329]. Build the lattice spanned by the rows [K*2^64, 0, 0, 0, 0] [K*a, 1, 0, 0, 0] [K*b, 0, 1, 0, 0] [K*c, 0, 0, 1, 0] [K*d, 0, 0, 0, 1] where K is a constant large enough to force the first coordinate of vectors to have more weight and sum to 0. Short vectors of this lattice will have the form [0, x, y, z, w], which means that x*a + y*b + z*c + w*d = 0 (mod 2^64), or in other words the set [a,b,c,d] and the set [[a]*(x+1), [b]*(y+1), [c]*(z+1], [d]*(w+1)] have the same hash. In the above case we can use K=2^32 and use LLL to obtain the vector [0, 2870, 27951, 8550, 46911] and thus a second-preimage for the above hash. Shorter second-preimages are possible by increasing the dimension. In particular, for 64-bit hashes it is possible to find lattice vectors containing coefficients only in {-1,0,1}, corresponding to removals and additions of single elements from the original set, at dimension ~48. Note that while finding shortest vectors in a lattice is a very hard problem as the dimension grows, here we don't really have to find exceedingly short vectors; they must instead have either -1 (removal) or positive coefficients. For collision-finding, where we control both inputs, we can start with many repetitions of the same element to make the problem easier and allow reasonably small negative coefficients as well. See also [2, Appendix B]. For larger sets, the generalized birthday attacks from Wagner [1, Section 4] also should apply here with minimal modifications, and reduces the security of an n-bit sum to O(2^{2sqrt(n)}). As far as I can tell the only successful multiset hashes either require enormous moduli or use elliptic curves. [1] https://www.iacr.org/archive/crypto2002/24420288/24420288.pdf [2] https://arxiv.org/pdf/1601.06502

Matt Borland

1:48 p.m.

...

...
But I wonder if this preserves the cryptographic quality of original hash.

...

It doesn't, even if you preserve the length of the hash. There are two attack approaches here, depending on the size of the sets you can work with.

...

On the low-data side, you can turn collision or second-preimage finding into a lattice problem. Suppose you have some set whose intermediate hashes are [a=13816517845079277968, b=2675547450723878905, c=18133324598626471763, d=10200581252073748329]. Build the lattice spanned by the rows

...

[K2^64, 0, 0, 0, 0] [Ka, 1, 0, 0, 0] [Kb, 0, 1, 0, 0] [Kc, 0, 0, 1, 0] [Kd, 0, 0, 0, 1]

...

where K is a constant large enough to force the first coordinate of vectors to have more weight and sum to 0. Short vectors of this lattice will have the form [0, x, y, z, w], which means that xa + yb + zc + wd = 0 (mod 2^64), or in other words the set [a,b,c,d] and the set [[a](x+1), [b](y+1), [c](z+1], [d]*(w+1)] have the same hash.

...

In the above case we can use K=2^32 and use LLL to obtain the vector [0, 2870, 27951, 8550, 46911] and thus a second-preimage for the above hash. Shorter second-preimages are possible by increasing the dimension. In particular, for 64-bit hashes it is possible to find lattice vectors containing coefficients only in {-1,0,1}, corresponding to removals and additions of single elements from the original set, at dimension ~48.

...

Note that while finding shortest vectors in a lattice is a very hard problem as the dimension grows, here we don't really have to find exceedingly short vectors; they must instead have either -1 (removal) or positive coefficients. For collision-finding, where we control both inputs, we can start with many repetitions of the same element to make the problem easier and allow reasonably small negative coefficients as well. See also [2, Appendix B].

...

For larger sets, the generalized birthday attacks from Wagner [1, Section 4] also should apply here with minimal modifications, and reduces the security of an n-bit sum to O(2^{2sqrt(n)}).

...

As far as I can tell the only successful multiset hashes either require enormous moduli or use elliptic curves.

...

[1] https://www.iacr.org/archive/crypto2002/24420288/24420288.pdf [2] https://arxiv.org/pdf/1601.06502

This is the correct answer. See also section 5.1 of NIST SP 800-107[1]. Matt [1] https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-107r1....

Peter Dimov

2:23 p.m.

Matt Borland wrote: ...

...

...
Note that while finding shortest vectors in a lattice is a very hard problem as the dimension grows, here we don't really have to find exceedingly short vectors; they must instead have either -1 (removal) or positive coefficients. For collision-finding, where we control both inputs, we can start with many repetitions of the same element to make the problem easier and allow reasonably small negative coefficients as well. See also [2, Appendix B].

...
For larger sets, the generalized birthday attacks from Wagner [1, Section 4] also should apply here with minimal modifications, and reduces the security of an n-bit sum to O(2^{2sqrt(n)}).

...
As far as I can tell the only successful multiset hashes either require enormous moduli or use elliptic curves.

...
[1] https://www.iacr.org/archive/crypto2002/24420288/24420288.pdf [2] https://arxiv.org/pdf/1601.06502

Thanks Samuel, very informative.

...

This is the correct answer. See also section 5.1 of NIST SP 800-107[1].

Matt

[1] https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-107r1....

I'm not sure why hash truncation applies here; what Samuel is saying is that this scheme for multiset hashing is insecure even if the original hash is not truncated. A 256 bit hash, for instance, would only provide (at most) 32 bit security. For 64 bit security one would need a 1024 bit source hash, which seems like an overkill to me.

Peter Dimov

12 Dec 12 Dec

2:42 p.m.

Samuel Neves wrote:

...

...
But I wonder if this preserves the cryptographic quality of original hash.

It doesn't, even if you preserve the length of the hash. There are two attack approaches here, depending on the size of the sets you can work with.

On the low-data side, you can turn collision or second-preimage finding into a lattice problem. Suppose you have some set whose intermediate hashes are [a=13816517845079277968, b=2675547450723878905, c=18133324598626471763, d=10200581252073748329]. Build the lattice spanned by the rows

[K*2^64, 0, 0, 0, 0] [K*a, 1, 0, 0, 0] [K*b, 0, 1, 0, 0] [K*c, 0, 0, 1, 0] [K*d, 0, 0, 0, 1]

where K is a constant large enough to force the first coordinate of vectors to have more weight and sum to 0. Short vectors of this lattice will have the form [0, x, y, z, w], which means that x*a + y*b + z*c + w*d = 0 (mod 2^64), or in other words the set [a,b,c,d] and the set [[a]*(x+1), [b]*(y+1), [c]*(z+1], [d]*(w+1)] have the same hash.

In the above case we can use K=2^32 and use LLL to obtain the vector [0, 2870, 27951, 8550, 46911] and thus a second-preimage for the above hash. Shorter second-preimages are possible by increasing the dimension. In particular, for 64-bit hashes it is possible to find lattice vectors containing coefficients only in {-1,0,1}, corresponding to removals and additions of single elements from the original set, at dimension ~48.

Note that while finding shortest vectors in a lattice is a very hard problem as the dimension grows, here we don't really have to find exceedingly short vectors; they must instead have either -1 (removal) or positive coefficients. For collision-finding, where we control both inputs, we can start with many repetitions of the same element to make the problem easier and allow reasonably small negative coefficients as well. See also [2, Appendix B].

For larger sets, the generalized birthday attacks from Wagner [1, Section 4] also should apply here with minimal modifications, and reduces the security of an n-bit sum to O(2^{2sqrt(n)}).

As far as I can tell the only successful multiset hashes either require enormous moduli or use elliptic curves.

[1] https://www.iacr.org/archive/crypto2002/24420288/24420288.pdf [2] https://arxiv.org/pdf/1601.06502

Thanks again for this informative post. It should probably be noted here that the current approach, inadequate as it is, does include the size of the multiset in the hash. I think that this makes it a bit harder to hack. I've come up with several possible approaches to "harden" the current algorithm, and I'll use the opportunity to post them for discussion while we're all tuned to the hash2 wavelength. These are all independent; they can be applied separately from one another. 1. As suggested, extend the width of the accumulator from 64 bits to 256 bits, by using uint64_t[4], with element-wise addition. 2. In addition to computing the sum of the hashes w = sum(hi), also compute the sum of hi xor C, where C is a constant, e.g. 0x9e3779b97f4a7c15. Include it in the final hash as well. This can be extended to computing the sums of hi ^ C1, hi ^ C2, and so on. It can also be generalized to computing the sum of f(hi), where f is some arbitrary transformation function, preferably nonlinear. E.g. mulx(hi, C), where mulx is the primitive created by performing a double width multiplication (64x64 -> 128) and then xor-ing the high and low parts of the result. 3. In addition to computing the sum of the hashes, also compute a histogram of their hex digits. That is, maintain an array of type uint64_t[16] and increment its i-th element each time the hex digit i appears in an intermediate hash. Then include this histogram in the final hash. 4. Use a two pass approach. On the first pass, compute the hash as is done today, and add it to the hash algorithm state, as done today. Then, on a second pass, do the same calculation, but starting from the current state. That is, effectively, compute the intermediate hashes using a hash function H' that is dependent on the contents of the multiset. I can easily see that each of 1-4 makes attacks harder, but I lack the knowledge to estimate how much harder. Maybe some of these are completely unnecessary because they are trivial to work around. Maybe not. Comments appreciated.

230

Age (days ago)

234

Last active (days ago)

List overview

Download

17 comments

9 participants

participants (9)

Darryl Green
Ion Gaztañaga
Ivan Matek
Kostas Savvidis
Matt Borland
Peter Dimov
Rainer Deyke
Samuel Neves
Vinnie Falco