
Marsh Ray wrote:
On 09/02/2010 04:06 AM, Andrej van der Zee wrote:
Hi,
I have a question that I hope can be answered by using Boost. I have the following datastructure:
struct ip_info { const u_int8_t _ifile_index; const unsigned long _src_ip; const unsigned long _dst_ip; const u_int16_t _id; const u_int16_t _length; const u_int16_t _src_port; const u_int16_t _dst_port; const u_int32_t _seq; const u_int32_t _ack_seq; };
The object is identified by all its members. Now I want to count duplicates. But I have MANY of these objects, I mean billions. I have to reduce the memory footprint of my program, because it soon grows over 4GB and boom! So I rather not keep all the objects in memory and why would I, because I do not need them. I just need to count duplicates.
My, what a clear statement of a classic problem. Hmm, are you sure this isn't homework? :-)
I know I'm not supposed to do this but .... Here is an easy solution. a) create a file of all the objects. Divide the file size by the record length to get number of records. b) sort them using the postman's sort evaluation version (available at www.rrsd.com) using the -u switch to eliminate duplicates. c) calculate the output file size using the method in a) above d) the difference between the two numbers is the number of duplicates. On my 5 year old machine, sorting 1 GB takes about 1 minute. Since your records are about 2 bytes long it would be 20 min/billion records. I doubt you'd find a faster way. Robert Ramey