Re: [Boost-users] Boost.Functional/Hash

2 Sep 2010

      Marsh Ray wrote:
...
On 09/02/2010 04:06 AM, Andrej van der Zee wrote:
...
Hi,
I have a question that I hope can be answered by using Boost. I have
the following datastructure:
struct ip_info
{
   const u_int8_t _ifile_index;
   const unsigned long _src_ip;
   const unsigned long _dst_ip;
   const u_int16_t _id;
   const u_int16_t _length;
   const u_int16_t _src_port;
   const u_int16_t _dst_port;
   const u_int32_t _seq;
   const u_int32_t _ack_seq;
};
The object is identified by all its members. Now I want to count
duplicates. But I have MANY of these objects, I mean billions. I have
to reduce the memory footprint of my program, because it soon grows
over 4GB and boom! So I rather not keep all the objects in memory and
why would I, because I do not need them. I just need to count
duplicates.
My, what a clear statement of a classic problem. Hmm, are you sure
this isn't homework? :-)
I know I'm not supposed to do this but ....

Here is an easy solution.

a) create a file of all the objects.  Divide the file size by the record 
length to get number
of records.
b) sort them using the postman's sort evaluation version (available at 
www.rrsd.com)
using the -u switch to eliminate duplicates.
c) calculate the output file size using the method in a) above
d) the difference between the two numbers is the number of duplicates.

On my 5 year old machine, sorting 1 GB takes about 1 minute.  Since your
records are about 2 bytes long it would be 20 min/billion records.  I doubt
you'd find a faster way.

Robert Ramey