boost serialize - stack overflow for large problems

I have implemented the serialization routine for a rather complex structure with many nested classes. The serialization seems to work for small examples, however, for bigger data sets, I get a stack overflow error. If I increase the stack size from 8MB to 20MB, the examples seems to work. However, I'd like to optimize the code the way that I do not need to increase the stack size (especially, since not all users can do that themselves) So here are my questions: 1. Why is the required stack size different for different data sets different. The structure of the datasets is identical, only the number of data sets differs - so why is the recursive level different? The only thing I can imagine is, that I have a class (container) with a vector of objects. Each object stores a pointer to the container class. Is it possible that once serialization in the container class starts, that there is a recursive serialization pattern? This seems to be support by the fact that the objects in the container class are not serialized in the same way as they are stored in the container. class container { std::vector<object*> myObjects; }; class object { const container* myContainer; } 2. How can I try to decrease the required stack size for the serialization routines and is there a way to estimate the required stack size for a specific problem a priori, so that I can throw an exception with an error message manually instead of having a segmentation fault. Thanks for any suggestions JFU

"Jörg F. Unger" wrote:
I have implemented the serialization routine for a rather complex structure with many nested classes. The serialization seems to work for small examples, however, for bigger data sets, I get a stack overflow error. If I increase the stack size from 8MB to 20MB, the examples seems to work. However, I'd like to optimize the code the way that I do not need to increase the stack size (especially, since not all users can do that themselves)
Are you referring to compile time or runtime?
So here are my questions: 1. Why is the required stack size different for different data sets different. The structure of the datasets is identical, only the number of data sets differs - so why is the recursive level different? The only thing I can imagine is, that I have a class (container) with a vector of objects. Each object stores a pointer to the container class. Is it possible that once serialization in the container class starts, that there is a recursive serialization pattern? This seems to be support by the fact that the objects in the container class are not serialized in the same way as they are stored in the container.
The stack depth should be proportional to the depth of your class data. It should be easy to check this. Setup your debugger to trap on stack overflow. When it does, show a back trace. You should be able to easily determine the source of the stack usage. I could be some deeply recursive structure, a coding error or who knows. Do this analysis is much more time efficient than me trying to guess what the problem might be.
2. How can I try to decrease the required stack size for the serialization routines and is there a way to estimate the required stack size for a specific problem a priori, so that I can throw an exception with an error message manually instead of having a segmentation fault.
If the source is some sort of error - then there is not problem once the error is fixed. If the source is that there is VERY deep nesting of data structures, you'll just have to refactor the data. This isn't a hardship since you would likely have other issues generated by this besides serialization. Robert Ramey

Robert Ramey <ramey <at> rrsd.com> writes:
"Jörg F. Unger" wrote:
I have implemented the serialization routine for a rather complex structure with many nested classes. The serialization seems to work for small examples, however, for bigger data sets, I get a stack overflow error. If I increase the stack size from 8MB to 20MB, the examples seems to work. However, I'd like to optimize the code the way that I do not need to increase the stack size (especially, since not all users can do that themselves)
Are you referring to compile time or runtime?
If the source is that there is VERY deep nesting of data structures, you'll just have to refactor the data. This isn't a hardship since you would likely have other issues generated by this besides serialization.
Robert Ramey
_______________________________________________ Boost-users mailing list Boost-users <at> lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
The problem occurs at runtime. I 've figured out that the problem is due to the nested data structure. I have kind of a graph, where every node stores a pointer to the neighboring node. This means, once I start to serialize the first node, all the neighboring nodes have to be serialized as well, which finally ends up with at many recursive function calls as I have nodes. This is probably similar to a linked list, where once the first entry is serialized, a pointer to the next element is serialized and so on, which finally ends up with as many recursive function calls as there are elements in the list. Is there a clever way to restructure the serialization (e.g. first serializing all the nodes, and in a second step, when the nodes have been created, serialize there connectivity without introducing two separate serialize routines?) PS: Sorry for initially posting my response as a new message

Jorg F. Unger wrote:
Robert Ramey <ramey <at> rrsd.com> writes:
"Jörg F. Unger" wrote:
I have implemented the serialization routine for a rather complex structure with many nested classes. The serialization seems to work for small examples, however, for bigger data sets, I get a stack overflow error. If I increase the stack size from 8MB to 20MB, the examples seems to work. However, I'd like to optimize the code the way that I do not need to increase the stack size (especially, since not all users can do that themselves)
Are you referring to compile time or runtime?
If the source is that there is VERY deep nesting of data structures, you'll just have to refactor the data. This isn't a hardship since you would likely have other issues generated by this besides serialization.
Robert Ramey
_______________________________________________ Boost-users mailing list Boost-users <at> lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
The problem occurs at runtime. I 've figured out that the problem is due to the nested data structure. I have kind of a graph, where every node stores a pointer to the neighboring node. This means, once I start to serialize the first node, all the neighboring nodes have to be serialized as well, which finally ends up with at many recursive function calls as I have nodes. This is probably similar to a linked list, where once the first entry is serialized, a pointer to the next element is serialized and so on, which finally ends up with as many recursive function calls as there are elements in the list. Is there a clever way to restructure the serialization (e.g. first serializing all the nodes, and in a second step, when the nodes have been created, serialize there connectivity without introducing two separate serialize routines?)
Generally the library "tracks" those things serialized by pointers so that there is not problem with cycles etc. If you can list all your nodes ahead of time you can serialize them. Then when you serialize your graph, there will be no recursion. Robert Ramey

Robert Ramey <ramey <at> rrsd.com> writes:
Jorg F. Unger wrote:
Robert Ramey <ramey <at> rrsd.com> writes:
"Jörg F. Unger" wrote:
I have implemented the serialization routine for a rather complex structure with many nested classes. The serialization seems to work for small examples, however, for bigger data sets, I get a stack overflow error. If I increase the stack size from 8MB to 20MB, the examples seems to work. However, I'd like to optimize the code the way that I do not need to increase the stack size (especially, since not all users can do that themselves)
Are you referring to compile time or runtime?
If the source is that there is VERY deep nesting of data structures, you'll just have to refactor the data. This isn't a hardship since you would likely have other issues generated by this besides serialization.
Robert Ramey
_______________________________________________ Boost-users mailing list Boost-users <at> lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
The problem occurs at runtime. I 've figured out that the problem is due to the nested data structure. I have kind of a graph, where every node stores a pointer to the neighboring node. This means, once I start to serialize the first node, all the neighboring nodes have to be serialized as well, which finally ends up with at many recursive function calls as I have nodes. This is probably similar to a linked list, where once the first entry is serialized, a pointer to the next element is serialized and so on, which finally ends up with as many recursive function calls as there are elements in the list. Is there a clever way to restructure the serialization (e.g. first serializing all the nodes, and in a second step, when the nodes have been created, serialize there connectivity without introducing two separate serialize routines?)
Generally the library "tracks" those things serialized by pointers so that there is not problem with cycles etc. If you can list all your nodes ahead of time you can serialize them. Then when you serialize your graph, there will be no recursion.
Robert Ramey
_______________________________________________ Boost-users mailing list Boost-users <at> lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
The problem is that the data structure is as follows class container { std::vector<object*> myObjects; }; class object { std::vector<object*> neighborObjects; } The problem is, how to split the node information from the graph information. I've implemented a serialize routine for the container, which then calls the serialize routine for the object, which recursively calls the serialize routine of the neighboring objects. That means, the first function call to the serialize routine in the object class is only finished, if all the neighbors have been serialized. Similarly, the serialize routine of the first neighbor requires all other neighbors to be serialized. Although the serialize routine can track the pointers, this leads to a stack problem, since the number of recursive serialize routines to be called (which are all stored on the stack) is equivalent to the number of objects. Is there a clever why to serialize first all the objects with its data (these have been omitted in the class definition for simplicity) and then serialize the neighbors information without changing the data structure?

Jorg F. Unger wrote:
Robert Ramey <ramey <at> rrsd.com> writes:
Jorg F. Unger wrote:
Robert Ramey <ramey <at> rrsd.com> writes:
"Jörg F. Unger" wrote:
I have implemented the serialization routine for a rather complex structure with many nested classes. The serialization seems to work for small examples, however, for bigger data sets, I get a stack overflow error. If I increase the stack size from 8MB to 20MB, the examples seems to work. However, I'd like to optimize the code the way that I do not need to increase the stack size (especially, since not all users can do that themselves)
Are you referring to compile time or runtime?
If the source is that there is VERY deep nesting of data structures, you'll just have to refactor the data. This isn't a hardship since you would likely have other issues generated by this besides serialization.
Robert Ramey
_______________________________________________ Boost-users mailing list Boost-users <at> lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
The problem occurs at runtime. I 've figured out that the problem is due to the nested data structure. I have kind of a graph, where every node stores a pointer to the neighboring node. This means, once I start to serialize the first node, all the neighboring nodes have to be serialized as well, which finally ends up with at many recursive function calls as I have nodes. This is probably similar to a linked list, where once the first entry is serialized, a pointer to the next element is serialized and so on, which finally ends up with as many recursive function calls as there are elements in the list. Is there a clever way to restructure the serialization (e.g. first serializing all the nodes, and in a second step, when the nodes have been created, serialize there connectivity without introducing two separate serialize routines?)
Generally the library "tracks" those things serialized by pointers so that there is not problem with cycles etc. If you can list all your nodes ahead of time you can serialize them. Then when you serialize your graph, there will be no recursion.
Robert Ramey
_______________________________________________ Boost-users mailing list Boost-users <at> lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
The problem is that the data structure is as follows
class container { std::vector<object*> myObjects; };
class object { std::vector<object*> neighborObjects; }
The problem is, how to split the node information from the graph information. I've implemented a serialize routine for the container, which then calls the serialize routine for the object, which recursively calls the serialize routine of the neighboring objects. That means, the first function call to the serialize routine in the object class is only finished, if all the neighbors have been serialized. Similarly, the serialize routine of the first neighbor requires all other neighbors to be serialized. Although the serialize routine can track the pointers, this leads to a stack problem, since the number of recursive serialize routines to be called (which are all stored on the stack) is equivalent to the number of objects.
I totally understand the problem. But messing with the serialization is the wrong way to go about it. question:How do you delete the graph? You have to delete all the elements. If you do it cursively, you'll have the same problem of deep nesting. I doubt that the graph "own's" the objects pointed to. So you must have a way to list all the nodes without traveling through the graph. Assumnig you do, serialize all the nodes this way - but DON'T include the pointers to adjacent nodes in your serialize function. This should resolve your problem. Robert Ramey
Is there a clever why to serialize first all the objects with its data (these have been omitted in the class definition for simplicity) and then serialize the neighbors information without changing the data structure?
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

On Wed, Sep 29, 2010 at 11:26 AM, Robert Ramey <ramey@rrsd.com> wrote:
question:How do you delete the graph? You have to delete all the elements. If you do it cursively, you'll have the same problem of deep nesting. I doubt that the graph "own's" the objects pointed to. So you must have a way to list all the nodes without traveling through the graph. Assumnig you do, serialize all the nodes this way - but DON'T include the pointers to adjacent nodes in your serialize function.
Having worked on similar things, I think you may be missing the point here. Consider a triangulated surface, made up of nodes. Each node has an ordered list of its neighbors (nodes) that implicitly defines all the triangles having the node as an apex. The surface "owns" the nodes (in a vector let's say). The neighboring nodes of a node are not owned OTOH, yet the connectivity info (the topology) they represent must still be persisted for each node. And that info happens to be a vector of pointers of other nodes of the same surface. When you serialize the surface, you must serialize all its nodes, and all those nodes' topology too. If you do it depth first, you end up blowing up the stack indeed. I see two possible solutions: 1) save all the nodes *without* their topology info on a first pass of the vector, thus recording all the "tracked" pointers for them, then write the topology info for each, since then you don't "recurse", you only write a "ref" to an already serialized pointer. 2) have the possibility to write a "weak reference" to a pointer, which only adds the pointer to the "tracking" map with a special flag saying it wasn't "really" written yet, such that writing the topology of a node that wasn't yet written is akin to "forward references" to those neighboring nodes. Either the node will really be written later on, or it never will, and I suppose serialization should handle that gracefully. #1 doesn't require changes in boost serialization, but puts the onus on the client code. Especially since you have to "externally" save the neighbors, so it's no longer well encapsulated, and writing a subset of the surface's nodes become more complex. #2 would require seeking ahead to read the actual object when encountering a forward reference, and furthermore a separate map to know where to seek for it. That may be incompatible with boost serialization (I confess to be mostly ignorant about it). But my point is that asking how you delete these nodes is not really a fair question. The imagined code above (based on real code I've worked on) knows the neighbors are just references to nodes from the same surface, and doesn't delete them. Removing a node is an operation you can do at the surface level only, and removing all nodes simply iterates on the surface's nodes and deletes them, and the neighbors are never deleted by the node itself.
This should resolve your problem.
As I hope I've shown above, I don't think you've addressed the issue actually. Don't get me wrong, I'm not barging in to this thread to criticize you or boost serialization, I've just trying to make you see the problem from a different angle based on my own experience, so you can formulate a better answer since I'm interested in that answer. If it comes out wrong, blame it on English not being my native language, and I apologize in advance for it. Thanks, --DD

Dominique Devienne wrote:
On Wed, Sep 29, 2010 at 11:26 AM, Robert Ramey <ramey@rrsd.com> wrote:
question:How do you delete the graph? You have to delete all the elements. If you do it cursively, you'll have the same problem of deep nesting. I doubt that the graph "own's" the objects pointed to. So you must have a way to list all the nodes without traveling through the graph. Assumnig you do, serialize all the nodes this way - but DON'T include the pointers to adjacent nodes in your serialize function.
Having worked on similar things, I think you may be missing the point here.
Consider a triangulated surface, made up of nodes. Each node has an ordered list of its neighbors (nodes) that implicitly defines all the triangles having the node as an apex.
The surface "owns" the nodes (in a vector let's say). The neighboring nodes of a node are not owned OTOH, yet the connectivity info (the topology) they represent must still be persisted for each node. And that info happens to be a vector of pointers of other nodes of the same surface.
When you serialize the surface, you must serialize all its nodes, and all those nodes' topology too. If you do it depth first, you end up blowing up the stack indeed.
I see two possible solutions:
1) save all the nodes *without* their topology info on a first pass of the vector, thus recording all the "tracked" pointers for them, then write the topology info for each, since then you don't "recurse", you only write a "ref" to an already serialized pointer.
That's my suggestion.
2) have the possibility to write a "weak reference" to a pointer, which only adds the pointer to the "tracking" map with a special flag saying it wasn't "really" written yet, such that writing the topology of a node that wasn't yet written is akin to "forward references" to those neighboring nodes. Either the node will really be written later on, or it never will, and I suppose serialization should handle that gracefully.
I'd have to think about this.
#1 doesn't require changes in boost serialization, but puts the onus on the client code. Especially since you have to "externally" save the neighbors, so it's no longer well encapsulated, and writing a subset of the surface's nodes become more complex.
#2 would require seeking ahead to read the actual object when encountering a forward reference, and furthermore a separate map to know where to seek for it. That may be incompatible with boost serialization (I confess to be mostly ignorant about it).
But my point is that asking how you delete these nodes is not really a fair question. The imagined code above (based on real code I've worked on) knows the neighbors are just references to nodes from the same surface, and doesn't delete them. Removing a node is an operation you can do at the surface level only,
and removing all nodes simply iterates on the surface's nodes and deletes them, and the neighbors are never deleted by the node itself.
If you can iterate along all the nodes, you can serialize them. Then you use #1 above.
This should resolve your problem.
As I hope I've shown above, I don't think you've addressed the issue actually. Don't get me wrong, I'm not barging in to this thread to criticize you or boost serialization, I've just trying to make you see the problem from a different angle based on my own experience, so you can formulate a better answer since I'm interested in that answer. If it comes out wrong, blame it on English not being my native language, and I apologize in advance for it. Thanks, --DD
I'm aware of the problem, it just seems to me that it's more general than serialization and not addressable from within it. It would come up on any nested recurssive algorithm. Maybe just create a large stack and don't worry about it. If memory commitment is a problem, then create a temporary stack with a separate task or co-routine and use that for serialization. Then the problem becomes temporary rather than permanent. Robert Ramey

On 09/29/2010 11:12 AM, Dominique Devienne wrote:
When you serialize the surface, you must serialize all its nodes, and all those nodes' topology too. If you do it depth first, you end up blowing up the stack indeed.
3) Serialize the nodes through a sequence adapter which presents each node once. You may have that already. Adapt the serialization of the nodes' vertex vectors to cast the node pointers to uint64_t or uintptr_ts (yes, I know). Systematically refer to that value as a "node ID" from now on. On deserialization, maintain an associative container like: map<node_id_t, pair<node *, node**>> such that the key is the node ID, pr.first is either null or the deserialized object, and pr.second is either null or points to the head of a singly-linked list formed of elements in the referencing nodes' vertex vectors. After all nodes have been deserialized in this form, traverse each node's SLL and fixup the next node** with the new node *. This assumes sizeof(node**) <= sizeof(node*) <= sizeof(node_id_t). In my book there is, occasionally, a time and a place for reinterpret_cast. Maybe you could do it better with a union type, but is this odd case worth the additional type complexity? (Hmm, this is a Boost list we're on). I've known this to be used successfully in practice (I think it was assembler though). The real danger with this scheme is that you'll never convince some coworkers that serializing the pointer isn't a bug. Now that I think about it, in some cases it might be considered a security problem to leak information about heap layout though the pointer values. If this is the case, encrypt them all and throw away the key! - Marsh

Zitat von "Jorg F. Unger" <j-unger@northwestern.edu>:
Robert Ramey <ramey <at> rrsd.com> writes:
Jorg F. Unger wrote:
Robert Ramey <ramey <at> rrsd.com> writes:
"Jörg F. Unger" wrote:
I have implemented the serialization routine for a rather complex structure with many nested classes. The serialization seems to work for small examples, however, for bigger data sets, I get a stack overflow error. If I increase the stack size from 8MB to 20MB, the examples seems to work. However, I'd like to optimize the code the way that I do not need to increase the stack size (especially, since not all users can do that themselves)
Are you referring to compile time or runtime?
If the source is that there is VERY deep nesting of data structures, you'll just have to refactor the data. This isn't a hardship since you would likely have other issues generated by this besides serialization.
Robert Ramey
_______________________________________________ Boost-users mailing list Boost-users <at> lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
The problem occurs at runtime. I 've figured out that the problem is due to the nested data structure. I have kind of a graph, where every node stores a pointer to the neighboring node. This means, once I start to serialize the first node, all the neighboring nodes have to be serialized as well, which finally ends up with at many recursive function calls as I have nodes. This is probably similar to a linked list, where once the first entry is serialized, a pointer to the next element is serialized and so on, which finally ends up with as many recursive function calls as there are elements in the list. Is there a clever way to restructure the serialization (e.g. first serializing all the nodes, and in a second step, when the nodes have been created, serialize there connectivity without introducing two separate serialize routines?)
Generally the library "tracks" those things serialized by pointers so that there is not problem with cycles etc. If you can list all your nodes ahead of time you can serialize them. Then when you serialize your graph, there will be no recursion.
Robert Ramey
_______________________________________________ Boost-users mailing list Boost-users <at> lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
The problem is that the data structure is as follows
class container { std::vector<object*> myObjects; };
class object { std::vector<object*> neighborObjects; }
The problem is, how to split the node information from the graph information. I've implemented a serialize routine for the container, which then calls the serialize routine for the object, which recursively calls the serialize routine of the neighboring objects. That means, the first function call to the serialize routine in the object class is only finished, if all the neighbors have been serialized. Similarly, the serialize routine of the first neighbor requires all other neighbors to be serialized. Although the serialize routine can track the pointers, this leads to a stack problem, since the number of recursive serialize routines to be called (which are all stored on the stack) is equivalent to the number of objects. Is there a clever why to serialize first all the objects with its data (these have been omitted in the class definition for simplicity) and then serialize the neighbors information without changing the data structure?
without changing the data structure, yes, but not without changing your serialization code, since boost.serialization doesn`t make a difference between "owning pointers" and references. one solution is to make your class "container" a container that is serialized by value, similar to how std::list is serialized: std::list internally is a linked list, and if the internal nodes were passed to boost.serialization, the same stack overflow problem would occur. but it is exposed to boost.serialization as a series of values (through the STL container interface). the topology of the internal graph of std::list is clear just by the order of the values, in your case you`d have to save references to the neighbouring nodes in a second pass. so the basic idea is that only "class container" has a serialize() function (which saves all information necessary to reconstruct the entire graph), and your "class object" does not. ideally, boost.serialization would always save a reference to tracked objects and save referenced objects at the end of the archive without recursion.
participants (6)
-
"Jörg F. Unger"
-
Dominique Devienne
-
Jorg F. Unger
-
Marsh Ray
-
Robert Ramey
-
Stefan Strasser