[serialization] Bug in text_oarchive but not in xml_oarchive?
I would assume that simply changing the following from xml_oarchive: ofstream ofs(filename.c_str(), ios::binary); boost::archive::xml_oarchive oa(ofs); oa << BOOST_SERIALIZATION_NVP(my_object); to text_oarchive (with, of course, appropriate changes to include the appropriate headers): ofstream ofs(filename.c_str(), ios::binary); boost::archive::text_oarchive oa(ofs); oa << BOOST_SERIALIZATION_NVP(my_object); should, a priori, not result in a core dump in the text_oarchive version? I am running under gcc 3.4.3, and and the Intel C++ compiler version 9.0, boost version 1.33.1. The text_oarchive fails with a segmentation violation on both compilers, and the xml_oarchive works with both. Cursory inspection of the xml output shows the xml to be well-formed, etc. Before I try to narrow this down further and provide a clear test that shows this error (which, given the complexity of the code and the size of the datasets, will be a considerable effort), I would like to know if my above assumption is correct, or if it is possible I have made a mistake somehow in writing my serialization methods. BTW, I am the same guy who has posted several (confirmed) bugs to this list: I am not a beginner, I have been using boost serialization for the better part of a year or so, so in general, I know what I'm doing. I say this simply because I don't want to waste a ton of time with this if I need not --- in short, to my mind, this looks like a clear example of a bug in the boost text_oarchive, but I perhaps may be wrong. Thank you. Bill
The inention is that archive classes be interchangeable. So if a serialization fails for one archive class and not for another, it would be a bug by definition. As of now, there the only known instances of this occuring is with the serialization of Nan floating point numbers which fails on text/xml archives but passes with native binary archives. Bill Lear wrote:
The text_oarchive fails with a segmentation violation on both compilers, and the xml_oarchive works with both. Cursory inspection of the xml output shows the xml to be well-formed, etc.
Which is kind of interesting since both of these archive classes use the same i/o implementation. So it must be in text_?oarchive which after all isn't very big.
BTW, I am the same guy who has posted several (confirmed) bugs to this list: I am not a beginner, I have been using boost serialization for the better part of a year or so, so in general, I know what I'm doing. I say this simply because I don't want to waste a ton of time with this if I need not --- in short, to my mind, this looks like a clear example of a bug in the boost text_oarchive, but I perhaps may be wrong.
I do sympathize and much appreciate the effort you have to expend to narrow this down. It hasn't been reported so it must be pretty obscure. Sorry, I can't be of more help. BTW, the best thing for me in these cases is to run under the debugger, trap when fault throws, and inspect the call stack (which is going to be long). So maybe it won't be so hard to track down once you can reproduce it. Robert Ramey
On Tuesday, May 2, 2006 at 20:47:22 (-0700) Robert Ramey writes:
The inention is that archive classes be interchangeable. So if a serialization fails for one archive class and not for another, it would be a bug by definition.
Ok, that's good --- at least I'm not crazy. This will help in my efforts here on out.
As of now, there the only known instances of this occuring is with the serialization of Nan floating point numbers which fails on text/xml archives but passes with native binary archives.
I did confirm last night that the binary version is failing as well.
I do sympathize and much appreciate the effort you have to expend to narrow this down. It hasn't been reported so it must be pretty obscure. Sorry, I can't be of more help.
BTW, the best thing for me in these cases is to run under the debugger, trap when fault throws, and inspect the call stack (which is going to be long). So maybe it won't be so hard to track down once you can reproduce it.
Thanks for the note. I did run under the debugger --- the call stack was note merely long, it was unbelievable. I'm running on a reasonably fast, high-memory machine, and when I asked the debugger for a stack trace, it took over an hour to print 60,000+ frames. I gave up at that point. I also ran this under valgrind, but it fails and valgrind doesn't seem to notice the seg fault --- at least it doesn't print out anything useful (and I'm going whole hog with the options on valgrind to force it to be verbose). I'm going to give it another go today. If I can't make progress with this today, I'm going to have to abandon boost for the moment and move on with our current, hand-coded version of serialization. I think the boost approach is the way to go in the long term for us for many reasons, but I won't be able to spend too much more time on this. Hopefully something will pop up today that I can share with you. Thanks again. Bill
Bill Lear wrote:
Thanks for the note. I did run under the debugger --- the call stack was note merely long, it was unbelievable. I'm running on a reasonably fast, high-memory machine, and when I asked the debugger for a stack trace, it took over an hour to print 60,000+ frames. I gave up at that point.
Since you don't have 60,000 functions in your program, that would almost surely indicate a stack overflow due to un-terminated recursion. Just look at the stack back from the end the recursion should be obvious. Robert Ramey
On Wednesday, May 3, 2006 at 07:53:15 (-0700) Robert Ramey writes:
.... Just look at the stack back from the end the recursion should be obvious.
I'm sorry to be dense --- do you mean do something like "frame -50", to look at the stack from the top while in the debugger (gdb)? How do I get to the top of the stack? Thanks for your help. Bill
Heres what I do. I run the debugger. I set a trap anywhere. or I can wait until it traps on its own. The GDB lets me view the last x number of calls in the stack. I might look at the last 100 (I forget the exact command BT 500 ?). I can inspect this and see if where its repeating. eg a b c d ... a ==> uh oh, recursive call - investicate this. Note that the following case isn't a problem a b c d ... a k l ... If your structure has cyclic pointers you should get "some" recursion (e.g) if you have 20 pointers which form a cycle, you'll get repeat of the stack after 20 pointers have been serialized. The library tracks pointer serialization and detects cycles so that there should be no stack overflow. Unless of course if your serialization a cycle of pointers to types marked "untracked" - which is pretty hard to do without explicitly disabling a compile time trap to make it hard todo this. Good Luck Robert Ramey
On Wednesday, May 3, 2006 at 09:03:03 (-0700) Robert Ramey writes:
Heres what I do.
I run the debugger.
I set a trap anywhere. or I can wait until it traps on its own.
I did want to share something I kvetched about: the pain in the neck when C++ exceptions are thrown and you get no call stack of how that exception came to be thrown (e.g., "stream error"). I discovered what seems to be a fairly easy solution under gdb. Compile this test code with -g: #include <iostream> #include <stdexcept> using namespace std; void a() { throw runtime_error("bad dog"); } void b() { a(); } void c() { b(); } void d() { c(); } int main(int ac, char* av[]) { try { d(); } catch (const exception& e) { cerr << "error: " << e.what() << '\n'; } } In gdb 6.4, you can simply say "catch throw". In the 6.3 version (perhaps earlier), you can set the "catch throw" after shared libraries have been loaded: % gdb ./a.out (gdb) set stop-on-solib-events 1 (gdb) run Starting program: /home/blear/a.out Stopped due to shared library event (gdb) catch throw Catchpoint 1 (throw) (gdb) c Continuing. Stopped due to shared library event (gdb) c Continuing. Stopped due to shared library event (gdb) c Continuing. Catchpoint 1 (exception thrown) 0x00aeaf11 in __cxa_throw () from /usr/lib/libstdc++.so.6 (gdb) where #0 0x00aeaf11 in __cxa_throw () from /usr/lib/libstdc++.so.6 #1 0x08048c04 in a () at t.cc:6 #2 0x08048c0f in b () at t.cc:7 #3 0x08048c1d in c () at t.cc:8 #4 0x08048c2b in d () at t.cc:9 #5 0x08048c50 in main (ac=1, av=0xbff68dd4) at t.cc:13 (gdb) Hope this helps others. Bill
On Wednesday, May 3, 2006 at 07:53:15 (-0700) Robert Ramey writes:
... Since you don't have 60,000 functions in your program, that would almost surely indicate a stack overflow due to un-terminated recursion. Just look at the stack back from the end the recursion should be obvious.
After plowing through endless reams of data and stack frames, looking for an obnoxious pattern, I finally tried what should have been obvious: check how much stack size my processes are allowed and try bumping that. Surely enough, I was granted 10 megabytes of stack size. I removed the ceiling on this (ulimit -s unlimited) and the thing worked, repeatedly, no problems. So, it appears that these larger data sets are just pressing the stack quite hard, so boost is apparently not the culprit. I seem to remember, though, seeing tons frames of boost library calls for a single serialization output --- seemingly lots of "dispatch" type template redirections. Is there any way to optimize some of this out? I suppose not, but just wanted to ask. In the course of this little adventure, I did come across something else that I'll try to put into a more thoughtful email later: exceptions in the serialization library are sometimes maddeningly unhelpful (C++ exceptions in general drive me nuts sometimes). One time, I successfully wrote the entire structure out to disk, and after the routine had returned, boost threw a "stream error" --- I finally tried setting ios::binary on the stream and this seemed to fix it. But the "stream error" had no context, obviously the code just looks for the fail bit being set (also the bad bit, I think), and just throws an exception. But, there is no context to this, no stack trace (C++ exception), etc. I did notice that in gdb you can put a debugger "catch" on a C++ throw --- next time I get a stream error, I will try this trick. Anyway, thank you again for all of your help. I am tired, but happy that the boost library doesn't have some sort of weird bug and that this was all just a case of me racing down the non-obvious path, instead of choosing the one (seemingly, always) less-traveled by. Bill
Bill Lear wrote:
So, it appears that these larger data sets are just pressing the stack quite hard, so boost is apparently not the culprit. I seem to remember, though, seeing tons frames of boost library calls for a single serialization output --- seemingly lots of "dispatch" type template redirections. Is there any way to optimize some of this out? I suppose not, but just wanted to ask.
Except for calls into the precompiled library, all of these calls can in theory be optimized away by an optimizing compiler. Current compilers generally can't do ALL of them but you'll find that there is an absolutly HUGE improvement in performance and reduction in code size when using the optimization switches of one's compiler. The difference between optimization and debug builds when using MPL type templated code is large - an the serialization library is an extreme example of this.
In the course of this little adventure, I did come across something else that I'll try to put into a more thoughtful email later: exceptions in the serialization library are sometimes maddeningly unhelpful (C++ exceptions in general drive me nuts sometimes). One time, I successfully wrote the entire structure out to disk, and after the routine had returned, boost threw a "stream error" --- I finally tried setting ios::binary on the stream and this seemed to fix it.
But the "stream error" had no context, obviously the code just looks for the fail bit being set (also the bad bit, I think), and just throws an exception. But, there is no context to this, no stack trace (C++ exception), etc.
I did notice that in gdb you can put a debugger "catch" on a C++ throw --- next time I get a stream error, I will try this trick.
The debugger has to be set to trap when an exception is thrown. All the debuggers I've used permit this as an option. The source code where the exception is thrown describes all the information that is known that might be helpful. But if the library get's to the point where it has to throw an exception - its exhausted its facilities for knowning what to do. I don't see any way to change this.
Anyway, thank you again for all of your help. I am tired, but happy that the boost library doesn't have some sort of weird bug and that this was all just a case of me racing down the non-obvious path, instead of choosing the one (seemingly, always) less-traveled by.
Well, maybe next time you migtht consider titling your message something other than "Bug in text_oarchive but not inxml_oarchive?" like perhaps - "[serialization]I can't figure out when I'm doing wrong - please Help" Its really irksome to an author of a library like this which includes 50 tests x 5 archives x 10? compilers (2500 testing scenarios) to defend the proposition that the library is correct because some huge an un-understandable program (10 MB of stack space and thousands of stack frames - good grief!) throws and exception and the author hasn't even found where the exception is thrown. Just a heads up - I'm a sensitive guy. Robert Ramey
On Thursday, May 4, 2006 at 10:00:17 (-0700) Robert Ramey writes:
Well, maybe next time you migtht consider titling your message something other than "Bug in text_oarchive but not inxml_oarchive?" like perhaps - "[serialization]I can't figure out when I'm doing wrong - please Help" Its really irksome to an author of a library like this which includes 50 tests x 5 archives x 10? compilers (2500 testing scenarios) to defend the proposition that the library is correct because some huge an un-understandable program (10 MB of stack space and thousands of stack frames - good grief!) throws and exception and the author hasn't even found where the exception is thrown. Just a heads up - I'm a sensitive guy.
That's why I did two things: put a question mark in the title, and said "thank you" lots. Don't forget: it wasn't throwing an exception, this was a side issue that I introduced late. It was core dumping in the text version of the library and not in the xml version. My initial assumption, that you confirmed --- both of us wrong --- was that this was a priori evidence of something being wrong with the text version of the library. When I had reported bugs before that exhibited this behavior, you mentioned this yourself. So, I just wanted confirmation on where to begin my search --- I wasn't asking for a defense, as I wasn't making an accusation (again, the question mark). Yes, my fault for not doing the obvious (in hindsight) and simply upping the stack space, but in general, as I think you can see clearly from my posts to this list over time, I try to be very careful in my queries and try to assume that this might not be a problem in boost and leave ample room for pointing the finger at myself. Thank you, by the way, for taking time to answer my questions on optimization. Bill
participants (2)
-
Bill Lear
-
Robert Ramey