On 03/12/2015 10:58 AM, Steven Clark wrote:
There are probably some constraints you didn't mention.
Of course. :)
Here are some ideas based on various different guesses.
And thank you so much for taking the time to respond to my post.
* At 80 bytes per line, that's a total of about 15 Gb of data. With a moderately beefy computer you can hold it all in memory.
* You can store the intermediate results unserialized, just dumping your structs into files. Only serialize when you're finished. Or,
True that, but one of the details I omitted is that this app is linked with libsparse, which is like lint on steroids. This tool parses preprocessor files and creates a tree in memory of all the symols in the file. My code walks this tree to create a database of info germane to our purposes. Of course, this uses more memory again. With about 3000 files to process, there isn't enough memory on the average workstation to contain it all at once. When I tried to do this all in memory, even a big kahuna machine with 32 GB of memory and 48 cores tanked after about the 100th file.
* Depending on what you're doing, using an actual database to store your intermediate results might improve performance.
Tried that. The performance of boost serialization trumps the performance of a dbms. :)
* Reorganize your algorithm so it computes the final results for a file in one pass. Perhaps you can read each file, store some information in memory, then write results for each file.
* Store the intermediate results for all 3000 files in one file. Mmap the intermediate results file; this is another variation of the suggestion not to serialize intermediate results.
* Fix the program that reads the serialized files, so that it can read an arbitrary number of serialized records rather than just one. I'm sure this can be done - slurp in a serialized record, see if you're at the end of file, if not then repeat.
These steps offer the most promise. The code already reads all the serialized records into memory, to a vector, with one deserialization call. The fault lies in the algorithm I am using to manage duplicate symbols when I encounter them. What I do for every symbol is ... . create a new node (vertex) . search the existing list for duplicates . if the symbol is a duplicate, add its connections (edges) to the pre-existing node and delete the new node. . next Performance drops from about 3 files per second to a less than one per second at the end. For the 3000+ files, it takes more than 50 minutes on an 8-core with 16 GB of memory. To speed things up, I've created a nodes-only list, which reduces the size of the vector to be searched by a factor of 4. I haven't got this working, yet, so I have yet to determine the performance gain.
If none of these ideas are useful, at least they should help point out what other constraints you have, that were not evident in your first message.
Steven J. Clark VGo Communications
Many thanks, Steven. I realize how busy everybody is, and I really appreciate the thoughtful and valuable input. Regards, Tony