
AMDG Andrej van der Zee wrote:
I have a question that I hope can be answered by using Boost. I have the following datastructure:
struct ip_info { const u_int8_t _ifile_index; const unsigned long _src_ip; const unsigned long _dst_ip; const u_int16_t _id; const u_int16_t _length; const u_int16_t _src_port; const u_int16_t _dst_port; const u_int32_t _seq; const u_int32_t _ack_seq; };
The object is identified by all its members. Now I want to count duplicates. But I have MANY of these objects, I mean billions. I have to reduce the memory footprint of my program, because it soon grows over 4GB and boom! So I rather not keep all the objects in memory and why would I, because I do not need them. I just need to count duplicates.
My idea was to hash such objects to a smaller values and use that hash value as key_type in a map. The value_type of the map would be an integer. This something like std::map
or something. Then I though maybe use Boost.Functional/Hash, but are there any guarantees that the hash value is unique?
No. It's impossible to guarantee this, because there are fewer possible hash values than input values.
How unique are they? Are there better alternatives?
Assuming an ideal hash function that produces k-bit values, the probability that n elements have no collisions is \frac{2^k!}{(2^k - n)!2^{kn}}. With a 32 bit hash, the probability of a collision is about 50% when there are 77,000 elements. In Christ, Steven Watanabe