Re: [Boost-users] Boost.Functional/Hash

2 Sep 2010

      AMDG

Andrej van der Zee wrote:
...
I have a question that I hope can be answered by using Boost. I have
the following datastructure:
struct ip_info
{
  const u_int8_t _ifile_index;
  const unsigned long _src_ip;
  const unsigned long _dst_ip;
  const u_int16_t _id;
  const u_int16_t _length;
  const u_int16_t _src_port;
  const u_int16_t _dst_port;
  const u_int32_t _seq;
  const u_int32_t _ack_seq;
};
The object is identified by all its members. Now I want to count
duplicates. But I have MANY of these objects, I mean billions. I have
to reduce the memory footprint of my program, because it soon grows
over 4GB and boom! So I rather not keep all the objects in memory and
why would I, because I do not need them. I just need to count
duplicates.
My idea was to hash such objects to a smaller values and use that hash
value as key_type in a map. The value_type of the map would be an
integer. This something like std::map<int32_t, u_int16_t> or
something.
Then I though maybe use Boost.Functional/Hash, but are there any
guarantees that the hash value is unique?
No.  It's impossible to guarantee this, because there
are fewer possible hash values than input values.
...
How unique are they? Are
there better alternatives?
Assuming an ideal hash function that produces k-bit values,
the probability that n elements have no collisions is
\frac{2^k!}{(2^k - n)!2^{kn}}.  With a 32 bit hash,
the probability of a collision is about 50% when there
are 77,000 elements.

In Christ,
Steven Watanabe