I'm trying to serialize the data from multiple passes of an app on various files. Each pass generates data that must be serialized. If I simply append each serialization, then deserialization will only occur for the first instance, since the number of records read by the deserialization will only be for the first instance. What I'm doing is deserializing on each new pass, deleting the original file, and then serializing everything again with the new information. If there were only a few files to process, this would not be a problem. However there are thousands of files. Additionally, on each new pass, I am checking to see if a certain type of record has already been saved. So, with every pass, I must look up in a deeper and deeper database. Currently, it's taking almost an hour to process about 3000 files, with an average of 55,000 lines per file. It is a huge amount of data. However, I'm looking for a way to reduce the length of time it takes to do this processing. Does anybody have a better idea than to cycle through the serialize-deserialize-lookup-serialize sequence for each file?
There are probably some constraints you didn't mention. Here are some ideas based on various different guesses. * At 80 bytes per line, that's a total of about 15 Gb of data. With a moderately beefy computer you can hold it all in memory. * You can store the intermediate results unserialized, just dumping your structs into files. Only serialize when you're finished. Or, keep all your intermediate results in memory until you're finished. * Depending on what you're doing, using an actual database to store your intermediate results might improve performance. * Reorganize your algorithm so it computes the final results for a file in one pass. Perhaps you can read each file, store some information in memory, then write results for each file. * Store the intermediate results for all 3000 files in one file. Mmap the intermediate results file; this is another variation of the suggestion not to serialize intermediate results. * Fix the program that reads the serialized files, so that it can read an arbitrary number of serialized records rather than just one. I'm sure this can be done - slurp in a serialized record, see if you're at the end of file, if not then repeat. If none of these ideas are useful, at least they should help point out what other constraints you have, that were not evident in your first message. Steven J. Clark VGo Communications -----Original Message----- From: Boost-users [mailto:boost-users-bounces@lists.boost.org] On Behalf Of Tony Camuso Sent: Thursday, March 12, 2015 9:09 AM To: boost-users@lists.boost.org Subject: [Boost-users] Serialization cumulatively. I'm trying to serialize the data from multiple passes of an app on various files. Each pass generates data that must be serialized. If I simply append each serialization, then deserialization will only occur for the first instance, since the number of records read by the deserialization will only be for the first instance. What I'm doing is deserializing on each new pass, deleting the original file, and then serializing everything again with the new information. If there were only a few files to process, this would not be a problem. However there are thousands of files. Additionally, on each new pass, I am checking to see if a certain type of record has already been saved. So, with every pass, I must look up in a deeper and deeper database. Currently, it's taking almost an hour to process about 3000 files, with an average of 55,000 lines per file. It is a huge amount of data. However, I'm looking for a way to reduce the length of time it takes to do this processing. Does anybody have a better idea than to cycle through the serialize-deserialize-lookup-serialize sequence for each file? _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
On 03/12/2015 10:58 AM, Steven Clark wrote:
There are probably some constraints you didn't mention.
Of course. :)
Here are some ideas based on various different guesses.
And thank you so much for taking the time to respond to my post.
* At 80 bytes per line, that's a total of about 15 Gb of data. With a moderately beefy computer you can hold it all in memory.
* You can store the intermediate results unserialized, just dumping your structs into files. Only serialize when you're finished. Or,
True that, but one of the details I omitted is that this app is linked with libsparse, which is like lint on steroids. This tool parses preprocessor files and creates a tree in memory of all the symols in the file. My code walks this tree to create a database of info germane to our purposes. Of course, this uses more memory again. With about 3000 files to process, there isn't enough memory on the average workstation to contain it all at once. When I tried to do this all in memory, even a big kahuna machine with 32 GB of memory and 48 cores tanked after about the 100th file.
* Depending on what you're doing, using an actual database to store your intermediate results might improve performance.
Tried that. The performance of boost serialization trumps the performance of a dbms. :)
* Reorganize your algorithm so it computes the final results for a file in one pass. Perhaps you can read each file, store some information in memory, then write results for each file.
* Store the intermediate results for all 3000 files in one file. Mmap the intermediate results file; this is another variation of the suggestion not to serialize intermediate results.
* Fix the program that reads the serialized files, so that it can read an arbitrary number of serialized records rather than just one. I'm sure this can be done - slurp in a serialized record, see if you're at the end of file, if not then repeat.
These steps offer the most promise. The code already reads all the serialized records into memory, to a vector, with one deserialization call. The fault lies in the algorithm I am using to manage duplicate symbols when I encounter them. What I do for every symbol is ... . create a new node (vertex) . search the existing list for duplicates . if the symbol is a duplicate, add its connections (edges) to the pre-existing node and delete the new node. . next Performance drops from about 3 files per second to a less than one per second at the end. For the 3000+ files, it takes more than 50 minutes on an 8-core with 16 GB of memory. To speed things up, I've created a nodes-only list, which reduces the size of the vector to be searched by a factor of 4. I haven't got this working, yet, so I have yet to determine the performance gain.
If none of these ideas are useful, at least they should help point out what other constraints you have, that were not evident in your first message.
Steven J. Clark VGo Communications
Many thanks, Steven. I realize how busy everybody is, and I really appreciate the thoughtful and valuable input. Regards, Tony
Tony Camuso wrote
I'm trying to serialize the data from multiple passes of an app on various files. Each pass generates data that must be serialized. If I simply append each serialization, then deserialization will only occur for the first instance, since the number of records read by the deserialization will only be for the first instance.
What I'm doing is deserializing on each new pass, deleting the original file, and then serializing everything again with the new information.
I'm not sure I understand what you're trying to do - but of course this is the list so I can just answer anyway. Why doesn't the following work? Ok a couple of problems: a) tracking prevents writing of data multiple times b) serialization requires that data be const just to prevent users from doing this exact sort of thing - which is a mistake in the presence of tracking. Solution - turn tracking off and cast away consents I have no idea if this is helpful - but maybe it's food for thought Robert Ramey -- View this message in context: http://boost.2283326.n4.nabble.com/Serialization-cumulatively-tp4673059p4673... Sent from the Boost - Users mailing list archive at Nabble.com.
On 03/13/2015 12:35 PM, Robert Ramey wrote:
Tony Camuso wrote
I'm trying to serialize the data from multiple passes of an app on various files. Each pass generates data that must be serialized. If I simply append each serialization, then deserialization will only occur for the first instance, since the number of records read by the deserialization will only be for the first instance.
What I'm doing is deserializing on each new pass, deleting the original file, and then serializing everything again with the new information.
I'm not sure I understand what you're trying to do - but of course this is the list so I can just answer anyway.
Hi, Robert. I sent a response to Steven this morning, posted here ... http://lists.boost.org/boost-users/2015/03/83963.php ... that gives a little more detail about what I'm trying to do.
Why doesn't the following work?
Ok a couple of problems: a) tracking prevents writing of data multiple times b) serialization requires that data be const just to prevent users from doing this exact sort of thing - which is a mistake in the presence of tracking.
Solution - turn tracking off and cast away consents
How do I disable tracking? That sounds like it may be very useful. I must do my own tracking, as described in my response to Steven, so having boost track for me is redundant, and probably hurts performance. By "cast away constants" do you mean, static_cast<type>(identifier) from "const type" to "type" ?
I have no idea if this is helpful - but maybe it's food for thought
Robert Ramey
Thanks, Robert. I appreciate any and all input. Regards, Tony Camuso
Hmmm - I included a code sketch of what I had in mind. Does it not show up? Robert Ramey -- View this message in context: http://boost.2283326.n4.nabble.com/Serialization-cumulatively-tp4673059p4673... Sent from the Boost - Users mailing list archive at Nabble.com.
On 03/14/2015 05:22 PM, Robert Ramey wrote:
Hmmm - I included a code sketch of what I had in mind. Does it not show up?
Robert Ramey
It shows up on the nabble link you gave me, but not on the boost users list at http://lists.boost.org/boost-users/2015/03/83965.php Thanks for the link!
-- View this message in context: http://boost.2283326.n4.nabble.com/Serialization-cumulatively-tp4673059p4673... Sent from the Boost - Users mailing list archive at Nabble.com. _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
Hi, Robert. I would have answered sooner, but had other issues arise. I had a look at your code, and that's basically what I'm already doing. Problem is that the time to process the files this way is O(log n), as processing each file takes incrementally longer as the database grows. It takes about an hour to process around 3000 files having about 15GB of data. Sounds reasonable, until you compare it to the compiler that whizzes through all the same files, and more, in only a few minutes. When I serialize the output without trying to recreate the whole database for each file, the length of time to process these 3000 files drops to about 5 minutes, which is a much more acceptable number for my target users. This yields one very large file with about 3000 appended serializations. What I'd like to do, because i think it would be much faster, is to go through the one big file and deserialize each of those serializations as they are encountered. Early testing showed that it would only take a few minutes to integrate these pieces into one whole. If there were linefeeds in the serialized data, the code to do this would be much simpler. Is there another, more architected way for me to deserialize an aggregate of serializations? -- View this message in context: http://boost.2283326.n4.nabble.com/Serialization-cumulatively-tp4673059p4673... Sent from the Boost - Users mailing list archive at Nabble.com.
tcamuso wrote
Problem is that the time to process the files this way is O(log n), ....
I would think that this is solvable, but I can't really comment without spending significant time looking at the specific code.
If there were linefeeds in the serialized data, the code to do this would be much simpler.
I don't even remember if there should be line feeds in there. certainly xml archives have linefeeds. But again, I'd have to spend a lot of time looking at your specific case. Of course you could hire me by the hour if you like.
Is there another, more architected way for me to deserialize an aggregate of serializations?
I think what you want to do should be possible in an efficient way. However, it would require spending enough time with the library to understand how it works at a deeper level. I realized that this defeats the original appeal of the library to some extent. But it's still better than writing a new system from scratch. Robert Ramey -- View this message in context: http://boost.2283326.n4.nabble.com/Serialization-cumulatively-tp4673059p4673... Sent from the Boost - Users mailing list archive at Nabble.com.
On 03/20/2015 12:17 PM, Robert Ramey [via Boost] wrote:
tcamuso wrote
Problem is that the time to process the files this way is O(log n),
I would think that this is solvable, but I can't really comment without spending significant time looking at the specific code.
Understood.
If there were linefeeds in the serialized data, the code to do this would be much simpler.
I don't even remember if there should be line feeds in there. certainly xml archives have linefeeds. But again, I'd have to spend a lot of time looking at your specific case.
Hmm.. can I save a text archive as xml? Does the serializer care whether xml tags are present? Interestingly, the text archiver was giving me linefeeds for a while. Now they aren't there. I didn't change any of the serialization code, but I did change the classes and structs that get serialized.
Of course you could hire me by the hour if you like.
:) I don't have that kind of money.
Is there another, more architected way for me to deserialize an aggregate of serializations?
I think what you want to do should be possible in an efficient way. However, it would require spending enough time with the library to understand how it works at a deeper level.
Unfortunately, time is of the essence.
I realized that this defeats the original appeal of the library to some extent. But it's still better than writing a new system from scratch.
Agreed. I need to get to the knee point with this app, then I can address these things in maintenance mode.
Robert Ramey
-- View this message in context: http://boost.2283326.n4.nabble.com/Serialization-cumulatively-tp4673059p4673... Sent from the Boost - Users mailing list archive at Nabble.com.
tcamuso wrote
Interestingly, the text archiver was giving me linefeeds for a while.
I always thought it worked that way. But I don't remember.
Now they aren't there.
which surprise me.
Of course you could hire me by the hour if you like. :) I don't have that kind of money.
how much does it cost all your customers to run your program for hours? Robert Ramey -- View this message in context: http://boost.2283326.n4.nabble.com/Serialization-cumulatively-tp4673059p4673... Sent from the Boost - Users mailing list archive at Nabble.com.
On 03/20/2015 03:22 PM, Robert Ramey wrote:
tcamuso wrote
Interestingly, the text archiver was giving me linefeeds for a while.
I always thought it worked that way. But I don't remember.
Now they aren't there.
which surprise me.
Of course you could hire me by the hour if you like. :) I don't have that kind of money.
how much does it cost all your customers to run your program for hours?
My customers are my fellow engineers who will likely run it as a cron job at night, with all their other cron jobs. However, there are times when you need to refresh on the spot, and waiting an hour is a hideous prospect. Of course, most of us are balancing more than one thing at a time, so it's just another context switch in the big scheme of things. Basically, what this thing does is look for exported symbols in the Linux kernel. It uses the sparse library to do this. We are looking for deeply nested structures that could affect the kernel application binary interface (KABI). If changes are made to those structures that are not KABI-safe, then problems can emerge with 3rd party apps that use the KABI. The idea is to provide kernel developers with a tool that can expose whether the data structure they are considering for change could affect the KABI. We have means to protect such changes, but it's difficult to know when to use them without a tool that can plumb the depths looking for any and all dependencies an exported symbol may have. Many thanks and warm regards, Tony Camuso Platform Enablement Red Hat
Greetings Robert. Given the assistance you and the other boost cognoscenti provided while I was developing my project, I feel that I owe you an update. What I decided to do in the end was to use a distributed database model. The code generates a data file for each preprocessed kernel source file. Rather than squashing those together into one large database, I left them distributed in their respective source directories. The length of time to process the whole kernel now only takes about 5 minutes on my desktop. The lookup utility can find anything in less than a minute. Performance is enhanced all around, though the size of the database collectively is about ten times larger than if I compressed it into one file. The trade-off of disk space for performance was well worth it. The project is at a decent knee-point, though there are a few things I'm sure my fellow engineers will want to add or change. You can track the progress of the project at https://github.com/camuso/kabiparser Thanks and regards, Tony Camuso Red Hat Platform Kernel
On Fri, Mar 20, 2015 at 10:37 AM, tcamuso
What I'd like to do, because i think it would be much faster, is to go through the one big file and deserialize each of those serializations as they are encountered. Early testing showed that it would only take a few minutes to integrate these pieces into one whole.
If there were some simple way for user code to recognize the end of an archive, maybe you could interpose an input filter using Boost.Iostreams. The filter would present EOF to its caller on spotting the end of archive, but leave the underlying file open (with its read pointer adjusted to immediately after the end of archive) for the application to bind another instance of the same filter onto the same underlying file.
If there were linefeeds in the serialized data, the code to do this would be much simpler.
Maybe you could derive an archive type yourself from one of the existing ones that differs only in appending an easily-recognizable marker when finished writing?
On 03/20/2015 02:18 PM, Nat Goodspeed wrote:
On Fri, Mar 20, 2015 at 10:37 AM, tcamuso
wrote: What I'd like to do, because i think it would be much faster, is to go through the one big file and deserialize each of those serializations as they are encountered. Early testing showed that it would only take a few minutes to integrate these pieces into one whole.
If there were some simple way for user code to recognize the end of an archive, maybe you could interpose an input filter using Boost.Iostreams. The filter would present EOF to its caller on spotting the end of archive, but leave the underlying file open (with its read pointer adjusted to immediately after the end of archive) for the application to bind another instance of the same filter onto the same underlying file.
I must look into Boost.Iostreams, because I can recognize the end of an archive by detecting the "serialization::archive" string at the beginning of another. This is a hack, I realize, and there's no guarantee that this banner won't change in the future.
If there were linefeeds in the serialized data, the code to do this would be much simpler.
Maybe you could derive an archive type yourself from one of the existing ones that differs only in appending an easily-recognizable marker when finished writing?
This may be the what's needed. I'm almost out of time on this project, so I may have to eat the long time it takes to build the database and revisit this another day when I'm in maintenance mode. I will be happy to post my results.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
participants (5)
-
Nat Goodspeed
-
Robert Ramey
-
Steven Clark
-
tcamuso
-
Tony Camuso