I have implemented the serialization routine for a rather complex
structure with many nested classes. The serialization seems to work for
small examples, however, for bigger data sets, I get a stack overflow
error. If I increase the stack size from 8MB to 20MB, the examples seems
to work. However, I'd like to optimize the code the way that I do not
need to increase the stack size (especially, since not all users can do
that themselves)
So here are my questions:
1. Why is the required stack size different for different data sets
different. The structure of the datasets is identical, only the number
of data sets differs - so why is the recursive level different? The only
thing I can imagine is, that I have a class (container) with a vector of
objects. Each object stores a pointer to the container class. Is it
possible that once serialization in the container class starts, that
there is a recursive serialization pattern? This seems to be support by
the fact that the objects in the container class are not serialized in
the same way as they are stored in the container.
class container
{
std::vector
I have implemented the serialization routine for a rather complex
structure with many nested classes. The serialization seems to work
for small examples, however, for bigger data sets, I get a stack
overflow error. If I increase the stack size from 8MB to 20MB, the
examples seems to work. However, I'd like to optimize the code the
way that I do not need to increase the stack size (especially, since
not all users can do that themselves)
So here are my questions:
1. Why is the required stack size different for different data sets
different. The structure of the datasets is identical, only the number
of data sets differs - so why is the recursive level different? The
only thing I can imagine is, that I have a class (container) with a
vector of objects. Each object stores a pointer to the container
class. Is it possible that once serialization in the container class
starts, that there is a recursive serialization pattern? This seems
to be support by the fact that the objects in the container class are
not serialized in the same way as they are stored in the container.
The stack depth should be proportional to the depth of your
class data. It should be easy to check this. Setup your debugger
to trap on stack overflow. When it does, show a back trace. You
should be able to easily determine the source of the stack usage.
I could be some deeply recursive structure, a coding error or
who knows. Do this analysis is much more time efficient than
me trying to guess what the problem might be.
2. How can I try to decrease the required stack size for the
serialization routines and is there a way to estimate the required
stack size for a specific problem a priori, so that I can throw an
exception with an error message manually instead of having a
segmentation fault.
If the source is some sort of error - then there is not problem
once the error is fixed.
If the source is that there is VERY deep nesting of data structures,
you'll just have to refactor the data. This isn't a hardship since
you would likely have other issues generated by this besides
serialization.
Robert Ramey
I have implemented the serialization routine for a rather complex
structure with many nested classes. The serialization seems to work
for small examples, however, for bigger data sets, I get a stack
overflow error. If I increase the stack size from 8MB to 20MB, the
examples seems to work. However, I'd like to optimize the code the
way that I do not need to increase the stack size (especially, since
not all users can do that themselves)
Are you referring to compile time or runtime?
If the source is that there is VERY deep nesting of data structures,
you'll just have to refactor the data. This isn't a hardship since
you would likely have other issues generated by this besides
serialization.
The problem occurs at runtime. I 've figured out that the problem is
due to the nested data structure. I have kind of a graph, where every
node stores a pointer to the neighboring node. This means, once I start
to serialize the first node, all the neighboring nodes have to be
serialized as well, which finally ends up with at many recursive
function calls as I have nodes. This is probably similar to a linked
list, where once the first entry is serialized, a pointer to the next
element is serialized and so on, which finally ends up with as many
recursive function calls as there are elements in the list. Is there a
clever way to restructure the serialization (e.g. first serializing all
the nodes, and in a second step, when the nodes have been created,
serialize there connectivity without introducing two separate serialize
routines?)
PS: Sorry for initially posting my response as a new message
I have implemented the serialization routine for a rather complex
structure with many nested classes. The serialization seems to work
for small examples, however, for bigger data sets, I get a stack
overflow error. If I increase the stack size from 8MB to 20MB, the
examples seems to work. However, I'd like to optimize the code the
way that I do not need to increase the stack size (especially, since
not all users can do that themselves)
Are you referring to compile time or runtime?
If the source is that there is VERY deep nesting of data structures,
you'll just have to refactor the data. This isn't a hardship since
you would likely have other issues generated by this besides
serialization.
The problem occurs at runtime. I 've figured out that the problem is
due to the nested data structure. I have kind of a graph, where every
node stores a pointer to the neighboring node. This means, once I
start to serialize the first node, all the neighboring nodes have to
be serialized as well, which finally ends up with at many recursive
function calls as I have nodes. This is probably similar to a linked
list, where once the first entry is serialized, a pointer to the next
element is serialized and so on, which finally ends up with as many
recursive function calls as there are elements in the list. Is there a
clever way to restructure the serialization (e.g. first serializing
all the nodes, and in a second step, when the nodes have been created,
serialize there connectivity without introducing two separate
serialize routines?)
Generally the library "tracks" those things serialized by pointers
so that there is not problem with cycles etc. If you can list all your
nodes ahead of time you can serialize them. Then when you
serialize your graph, there will be no recursion.
Robert Ramey
I have implemented the serialization routine for a rather complex
structure with many nested classes. The serialization seems to work
for small examples, however, for bigger data sets, I get a stack
overflow error. If I increase the stack size from 8MB to 20MB, the
examples seems to work. However, I'd like to optimize the code the
way that I do not need to increase the stack size (especially, since
not all users can do that themselves)
Are you referring to compile time or runtime?
If the source is that there is VERY deep nesting of data structures,
you'll just have to refactor the data. This isn't a hardship since
you would likely have other issues generated by this besides
serialization.
The problem occurs at runtime. I 've figured out that the problem is
due to the nested data structure. I have kind of a graph, where every
node stores a pointer to the neighboring node. This means, once I
start to serialize the first node, all the neighboring nodes have to
be serialized as well, which finally ends up with at many recursive
function calls as I have nodes. This is probably similar to a linked
list, where once the first entry is serialized, a pointer to the next
element is serialized and so on, which finally ends up with as many
recursive function calls as there are elements in the list. Is there a
clever way to restructure the serialization (e.g. first serializing
all the nodes, and in a second step, when the nodes have been created,
serialize there connectivity without introducing two separate
serialize routines?)
Generally the library "tracks" those things serialized by pointers
so that there is not problem with cycles etc. If you can list all your
nodes ahead of time you can serialize them. Then when you
serialize your graph, there will be no recursion.
I have implemented the serialization routine for a rather complex
structure with many nested classes. The serialization seems to
work for small examples, however, for bigger data sets, I get a
stack overflow error. If I increase the stack size from 8MB to
20MB, the examples seems to work. However, I'd like to optimize
the code the way that I do not need to increase the stack size
(especially, since not all users can do that themselves)
Are you referring to compile time or runtime?
If the source is that there is VERY deep nesting of data
structures, you'll just have to refactor the data. This isn't a
hardship since you would likely have other issues generated by
this besides serialization.
The problem occurs at runtime. I 've figured out that the problem is
due to the nested data structure. I have kind of a graph, where
every node stores a pointer to the neighboring node. This means,
once I start to serialize the first node, all the neighboring nodes
have to be serialized as well, which finally ends up with at many
recursive function calls as I have nodes. This is probably similar
to a linked list, where once the first entry is serialized, a
pointer to the next element is serialized and so on, which finally
ends up with as many recursive function calls as there are elements
in the list. Is there a clever way to restructure the serialization
(e.g. first serializing all the nodes, and in a second step, when
the nodes have been created, serialize there connectivity without
introducing two separate serialize routines?)
Generally the library "tracks" those things serialized by pointers
so that there is not problem with cycles etc. If you can list all
your nodes ahead of time you can serialize them. Then when you
serialize your graph, there will be no recursion.
The problem is that the data structure is as follows
class container
{
std::vector myObjects;
};
class object
{
std::vector neighborObjects;
}
The problem is, how to split the node information from the graph
information. I've implemented a serialize routine for the container,
which then calls the serialize routine for the object, which
recursively calls the serialize routine of the neighboring objects.
That means, the first function call to the serialize routine in the
object class is only finished, if all the neighbors have been
serialized. Similarly, the serialize routine of the first neighbor
requires all other neighbors to be serialized. Although the serialize
routine can track the pointers, this leads to a stack problem, since
the number of recursive serialize routines to be called (which are
all stored on the stack) is equivalent to the number of objects.
I totally understand the problem. But messing with the serialization
is the wrong way to go about it.
question:How do you delete the graph? You have to delete all the
elements. If you do it cursively, you'll have the same problem
of deep nesting. I doubt that the graph "own's" the objects pointed to.
So you must have a way to list all the nodes without traveling through
the graph. Assumnig you do, serialize all the nodes this way - but
DON'T include the pointers to adjacent nodes in your serialize function.
This should resolve your problem.
Robert Ramey
Is
there a clever why to serialize first all the objects with its data
(these have been omitted in the class definition for simplicity) and
then serialize the neighbors information without changing the data
structure?
question:How do you delete the graph? You have to delete all the
elements. If you do it cursively, you'll have the same problem
of deep nesting. I doubt that the graph "own's" the objects pointed to.
So you must have a way to list all the nodes without traveling through
the graph. Assumnig you do, serialize all the nodes this way - but
DON'T include the pointers to adjacent nodes in your serialize function.
Having worked on similar things, I think you may be missing the point here.
Consider a triangulated surface, made up of nodes. Each node has an
ordered list of its neighbors (nodes) that implicitly defines all the
triangles having the node as an apex.
The surface "owns" the nodes (in a vector let's say). The neighboring
nodes of a node are not owned OTOH, yet the connectivity info (the
topology) they represent must still be persisted for each node. And
that info happens to be a vector of pointers of other nodes of the
same surface.
When you serialize the surface, you must serialize all its nodes, and
all those nodes' topology too.
If you do it depth first, you end up blowing up the stack indeed.
I see two possible solutions:
1) save all the nodes *without* their topology info on a first pass of
the vector, thus recording all the "tracked" pointers for them, then
write the topology info for each, since then you don't "recurse", you
only write a "ref" to an already serialized pointer.
2) have the possibility to write a "weak reference" to a pointer,
which only adds the pointer to the "tracking" map with a special flag
saying it wasn't "really" written yet, such that writing the topology
of a node that wasn't yet written is akin to "forward references" to
those neighboring nodes. Either the node will really be written later
on, or it never will, and I suppose serialization should handle that
gracefully.
#1 doesn't require changes in boost serialization, but puts the onus
on the client code. Especially since you have to "externally" save the
neighbors, so it's no longer well encapsulated, and writing a subset
of the surface's nodes become more complex.
#2 would require seeking ahead to read the actual object when
encountering a forward reference, and furthermore a separate map to
know where to seek for it. That may be incompatible with boost
serialization (I confess to be mostly ignorant about it).
But my point is that asking how you delete these nodes is not really a
fair question. The imagined code above (based on real code I've worked
on) knows the neighbors are just references to nodes from the same
surface, and doesn't delete them. Removing a node is an operation you
can do at the surface level only, and removing all nodes simply
iterates on the surface's nodes and deletes them, and the neighbors
are never deleted by the node itself.
As I hope I've shown above, I don't think you've addressed the issue
actually. Don't get me wrong, I'm not barging in to this thread to
criticize you or boost serialization, I've just trying to make you see
the problem from a different angle based on my own experience, so you
can formulate a better answer since I'm interested in that answer. If
it comes out wrong, blame it on English not being my native language,
and I apologize in advance for it. Thanks, --DD
question:How do you delete the graph? You have to delete all the
elements. If you do it cursively, you'll have the same problem
of deep nesting. I doubt that the graph "own's" the objects pointed
to. So you must have a way to list all the nodes without traveling
through the graph. Assumnig you do, serialize all the nodes this way
- but DON'T include the pointers to adjacent nodes in your serialize
function.
Having worked on similar things, I think you may be missing the point
here.
Consider a triangulated surface, made up of nodes. Each node has an
ordered list of its neighbors (nodes) that implicitly defines all the
triangles having the node as an apex.
The surface "owns" the nodes (in a vector let's say). The neighboring
nodes of a node are not owned OTOH, yet the connectivity info (the
topology) they represent must still be persisted for each node. And
that info happens to be a vector of pointers of other nodes of the
same surface.
When you serialize the surface, you must serialize all its nodes, and
all those nodes' topology too.
If you do it depth first, you end up blowing up the stack indeed.
I see two possible solutions:
1) save all the nodes *without* their topology info on a first pass of
the vector, thus recording all the "tracked" pointers for them, then
write the topology info for each, since then you don't "recurse", you
only write a "ref" to an already serialized pointer.
2) have the possibility to write a "weak reference" to a pointer,
which only adds the pointer to the "tracking" map with a special flag
saying it wasn't "really" written yet, such that writing the topology
of a node that wasn't yet written is akin to "forward references" to
those neighboring nodes. Either the node will really be written later
on, or it never will, and I suppose serialization should handle that
gracefully.
#1 doesn't require changes in boost serialization, but puts the onus
on the client code. Especially since you have to "externally" save the
neighbors, so it's no longer well encapsulated, and writing a subset
of the surface's nodes become more complex.
#2 would require seeking ahead to read the actual object when
encountering a forward reference, and furthermore a separate map to
know where to seek for it. That may be incompatible with boost
serialization (I confess to be mostly ignorant about it).
But my point is that asking how you delete these nodes is not really a
fair question. The imagined code above (based on real code I've worked
on) knows the neighbors are just references to nodes from the same
surface, and doesn't delete them. Removing a node is an operation you
can do at the surface level only,
As I hope I've shown above, I don't think you've addressed the issue
actually. Don't get me wrong, I'm not barging in to this thread to
criticize you or boost serialization, I've just trying to make you see
the problem from a different angle based on my own experience, so you
can formulate a better answer since I'm interested in that answer. If
it comes out wrong, blame it on English not being my native language,
and I apologize in advance for it. Thanks, --DD
I'm aware of the problem, it just seems to me that it's more general
than serialization and not addressable from within it. It would
come up on any nested recurssive algorithm.
Maybe just create a large stack and don't worry about it.
If memory commitment is a problem, then create a temporary stack
with a separate task or co-routine and use that for serialization.
Then the problem becomes temporary rather than permanent.
Robert Ramey
When you serialize the surface, you must serialize all its nodes, and
all those nodes' topology too.
If you do it depth first, you end up blowing up the stack indeed.
3) Serialize the nodes through a sequence adapter which presents each
node once. You may have that already.
Adapt the serialization of the nodes' vertex vectors to cast the node
pointers to uint64_t or uintptr_ts (yes, I know). Systematically refer
to that value as a "node ID" from now on.
On deserialization, maintain an associative container like:
map>
such that the key is the node ID,
pr.first is either null or the deserialized object, and
pr.second is either null or points to the head of a singly-linked list
formed of elements in the referencing nodes' vertex vectors.
After all nodes have been deserialized in this form, traverse each
node's SLL and fixup the next node** with the new node *.
This assumes sizeof(node**) <= sizeof(node*) <= sizeof(node_id_t).
In my book there is, occasionally, a time and a place for
reinterpret_cast. Maybe you could do it better with a union type, but is
this odd case worth the additional type complexity? (Hmm, this is a
Boost list we're on).
I've known this to be used successfully in practice (I think it was
assembler though). The real danger with this scheme is that you'll never
convince some coworkers that serializing the pointer isn't a bug.
Now that I think about it, in some cases it might be considered a
security problem to leak information about heap layout though the
pointer values. If this is the case, encrypt them all and throw away the
key!
- Marsh
I have implemented the serialization routine for a rather complex
structure with many nested classes. The serialization seems to work
for small examples, however, for bigger data sets, I get a stack
overflow error. If I increase the stack size from 8MB to 20MB, the
examples seems to work. However, I'd like to optimize the code the
way that I do not need to increase the stack size (especially, since
not all users can do that themselves)
Are you referring to compile time or runtime?
If the source is that there is VERY deep nesting of data structures,
you'll just have to refactor the data. This isn't a hardship since
you would likely have other issues generated by this besides
serialization.
The problem occurs at runtime. I 've figured out that the problem is
due to the nested data structure. I have kind of a graph, where every
node stores a pointer to the neighboring node. This means, once I
start to serialize the first node, all the neighboring nodes have to
be serialized as well, which finally ends up with at many recursive
function calls as I have nodes. This is probably similar to a linked
list, where once the first entry is serialized, a pointer to the next
element is serialized and so on, which finally ends up with as many
recursive function calls as there are elements in the list. Is there a
clever way to restructure the serialization (e.g. first serializing
all the nodes, and in a second step, when the nodes have been created,
serialize there connectivity without introducing two separate
serialize routines?)
Generally the library "tracks" those things serialized by pointers
so that there is not problem with cycles etc. If you can list all your
nodes ahead of time you can serialize them. Then when you
serialize your graph, there will be no recursion.
The problem is that the data structure is as follows
class container
{
std::vector myObjects;
};
class object
{
std::vector neighborObjects;
}
The problem is, how to split the node information from the graph information.
I've implemented a serialize routine for the container, which then calls the
serialize routine for the object, which recursively calls the
serialize routine
of the neighboring objects. That means, the first function call to
the serialize
routine in the object class is only finished, if all the neighbors have been
serialized. Similarly, the serialize routine of the first neighbor
requires all
other neighbors to be serialized. Although the serialize routine can
track the
pointers, this leads to a stack problem, since the number of
recursive serialize
routines to be called (which are all stored on the stack) is
equivalent to the
number of objects. Is there a clever why to serialize first all the
objects with
its data (these have been omitted in the class definition for
simplicity) and
then serialize the neighbors information without changing the data structure?
without changing the data structure, yes, but not without changing
your serialization code, since boost.serialization doesn`t make a
difference between "owning pointers" and references.
one solution is to make your class "container" a container that is
serialized by value, similar to how std::list is serialized:
std::list internally is a linked list, and if the internal nodes were
passed to boost.serialization, the same stack overflow problem would
occur.
but it is exposed to boost.serialization as a series of values
(through the STL container interface).
the topology of the internal graph of std::list is clear just by the
order of the values, in your case you`d have to save references to the
neighbouring nodes in a second pass.
so the basic idea is that only "class container" has a serialize()
function (which saves all information necessary to reconstruct the
entire graph), and your "class object" does not.
ideally, boost.serialization would always save a reference to tracked
objects and save referenced objects at the end of the archive without
recursion.