Boost Serialization tracking issues

Dear all, I was trying to use the Boost Serialization library again, but I ran into a problem which cost me a full day. The Boost Serialization library is a very handy library, but often when I use it I run into problems which cost me days. This time I try to serialize just 'track_never' objects. First of all the compiler firewall which is built in the library does not work when using the nvp macro: void Foo() { boost::archive::xml_oarchive oa(...) SomeStruct s; oa << s; // compile time error oa << BOOST_SERIALIZATION_NVP(s); //ok } But this wasn't the real problem. I was looping to serialize non tracking objects: void Bla() { boost::archive::xml_oarchive oa(...) for (int i = 0; i < 1000; ++i) { const SomeStruct s; oa << s; } } However the library did track them by pointer due to a previous request to serialize a pointer. It decides this on a call to 'basic_serializer::serialized_as_pointer' deep in the library. struct SomeStructHolder { SomeStructHolder() : m_p(NULL){} template <class Archive> void serialize(Archive& ar, const unsigned int /*version*/) { ar & BOOST_SERIALIZATION_NVP(m_p); } const SomeStruct* m_p; }; somewhere else: void Bla() { boost::archive::xml_oarchive oa(...) const SomeStructHolder holder; oa << holder; } So this is again an unexpected property of the library. Maybe the ambition for automatic object tracking is too high, and specify it explicitly per class / per archiving operation would be more work for the user, but also clearer. And still the original problem was too find out if Boost Serialization could also be used on a per demand loading of (value-)objects: in my problem the final storage can be potential bigger than computer memory, so I was wondering if objects could be loaded on demand (instead of all in once). wkr, me

if you want to suppress object tracking for a particular class BOOST_CLASS_TRACKING(SomeStruct, boost::serialization::track_never) as described in the manual Reference/Special Considerations/Object Tracking. Of course you must realized that if you do this and you serialize pointers you're exposed to the risk that when you load the archive multiple pointers to a unique object you're not going to the same thing when you load it. Your going to get each pointer pointing to a different (and new object). Robert Ramey gast128 wrote:
Dear all,
I was trying to use the Boost Serialization library again, but I ran into a problem which cost me a full day. The Boost Serialization library is a very handy library, but often when I use it I run into problems which cost me days.
This time I try to serialize just 'track_never' objects. First of all the compiler firewall which is built in the library does not work when using the nvp macro:
void Foo() { boost::archive::xml_oarchive oa(...)
const SomeStruct s; // const described in rationale
oa << s; // compile time error oa << BOOST_SERIALIZATION_NVP(s); //ok }
But this wasn't the real problem. I was looping to serialize non tracking objects:
void Bla() { boost::archive::xml_oarchive oa(...)
for (int i = 0; i < 1000; ++i) { const SomeStruct s; oa << s; } }
However the library did track them by pointer due to a previous request to serialize a pointer. It decides this on a call to 'basic_serializer::serialized_as_pointer' deep in the library.
struct SomeStructHolder { SomeStructHolder() : m_p(NULL){}
template <class Archive> void serialize(Archive& ar, const unsigned int /*version*/) { ar & BOOST_SERIALIZATION_NVP(m_p); }
const SomeStruct* m_p; };
somewhere else:
void Bla() { boost::archive::xml_oarchive oa(...)
const SomeStructHolder holder; oa << holder; }
So this is again an unexpected property of the library. Maybe the ambition for automatic object tracking is too high, and specify it explicitly per class / per archiving operation would be more work for the user, but also clearer.
And still the original problem was too find out if Boost Serialization could also be used on a per demand loading of (value-)objects: in my problem the final storage can be potential bigger than computer memory, so I was wondering if objects could be loaded on demand (instead of all in once).
wkr, me

Robert Ramey
if you want to suppress object tracking for a particular class
BOOST_CLASS_TRACKING(SomeStruct, boost::serialization::track_never)
as described in the manual Reference/Special Considerations/Object Tracking.
Of course you must realized that if you do this and you serialize pointers you're exposed to the risk that when you load the archive multiple pointers to a unique object you're not going to the same thing when you load it. Your going to get each pointer pointing to a different (and new object).
Thx for the answer. My point is that if you nowhere specify a load/store through pointer, the following code stores just multiple copies: for (int i = 0; i < 100; ++i) { const SomeStruct s; oa << s; } but this behavior seems to change if somewhere a store through is performed. This was for my surprising.

LOL - well it IS surprising. Take a look at the documentation for the class serialization trait "Implementation Level". This touches upon a really big issue with a library such as this. Question is which do you do: a) always the right thing b) always the same thing - and maybe emit a warning or error when its not a good idea Choosing b) makes for a transparent system and this is what I usually prefer. Generally I detest hidden features, attributes, which make things look easier but in fact produce surprising side-effects. When tthings fail it means a huge effort trying to get to the bottom of things - often the only recourse is by trial and error. a) is much more popular today. It seems easier and makes a better demo. And its seductive - let's design it so "it just works". If there is one program in my whole life which has tested my sanity its Microsoft Word. (next in line might be bjam). In practice I usually prefer b) but sometimes - like this one - I give myself a pass and slip into a). In this case it was deliberate design decision to "do what the user probably wants if he doesn't otherwise specify it". In spite of my prejudice generally against this kind of thing, I think in this case it has worked to advantage as I have gotten very few complaints and problems about it. There might be a few other places where something similar has been done - but these would also be exeptional cases. So I sympathize with your point of view in general, I just don't think its the correct one in this particular case. Of course, if one is going to do something like run 1 TB through the serializer, he really should carefully read the manual, and carefully consider what he's doing. I wouldn't trust me to have done the right thing . Robert Ramey gast128 wrote:
Thx for the answer. My point is that if you nowhere specify a load/store through pointer, the following code stores just multiple copies:
for (int i = 0; i < 100; ++i) { const SomeStruct s; oa << s; }
but this behavior seems to change if somewhere a store through is performed. This was for my surprising.

Robert Ramey
a) always the right thing b) always the same thing - and maybe emit a warning or error when its not a good idea
Choosing b) makes for a transparent system and this is what I usually prefer. Generally I detest hidden features, attributes, which make things look easier but in fact produce surprising side-effects. When tthings fail it means a huge effort trying to get to the bottom of things - often the only recourse is by trial and error.
a) is much more popular today. It seems easier and makes a better demo. And its seductive - let's design it so "it just works". If there is one program > in my whole life which has tested my sanity its Microsoft Word. (next in line might be bjam).
snip rest. I understand this. But maybe the ambition is too high. In the c++ language one can choose many constructs, but the intention is not specified. For refernces it is clear (they must point to other objects), but (shared-)pointers can have value or reference semantics. An alternatvie would be too make a two layer library: in the basic library the user has to specify the storage intention himself; the upper layer uses the basic layer and try to make assumptions about intentions. In this way the user can alwyas fall back. For example in Boost.Bind one can explicitly specify the return type, it automatic detection is unsatisfying. This is of course all said without the experience of making a serialization library myself...

Unfortunately, current tracking implementation can also leads to weird version incompatibility. The unlucky combination is
object_serializable + track_selectivly. Suppose, someone have code that serizalizes vector<int> to file (by value). After some time, some code that serializes vector<int> as pointer is added to program. Its enough to make old files unreadable (or readable with errors). Sample code:
void TestTrack2Err(boost::archive::polymorphic_xml_iarchive &ar)
{
std::vector<int> *p;
ar & BOOST_SERIALIZATION_NVP(p);
// saving of vector<int>* is absent intentionaly to simulate different versions of saving and loading code
}
void TestTrack2()
{
std::stringstream s;
std::vector<int> x;
x.push_back(4);
{
boost::archive::polymorphic_text_oarchive ar(s);
ar & BOOST_SERIALIZATION_NVP(x);
}
x.clear();
{
boost::archive::polymorphic_text_iarchive ar(s);
ar & BOOST_SERIALIZATION_NVP(x);
assert(x.size() == 1); // The vector is empty at this point - unless loading of vector<int>* is commented out
}
}
The consequences of error can vary from noting (especially with xml archives) to program crashes in some more difficult cases. I have recently encountered this error when dealing with a big project (more than 500 serialized types) and have no idea how I can get out of a difficulty.
"Robert Ramey"
LOL - well it IS surprising.
Take a look at the documentation for the class serialization trait "Implementation Level".
This touches upon a really big issue with a library such as this. Question is which do you do:
a) always the right thing b) always the same thing - and maybe emit a warning or error when its not a good idea
Choosing b) makes for a transparent system and this is what I usually prefer. Generally I detest hidden features, attributes, which make things look easier but in fact produce surprising side-effects. When tthings fail it means a huge effort trying to get to the bottom of things - often the only recourse is by trial and error.
a) is much more popular today. It seems easier and makes a better demo. And its seductive - let's design it so "it just works". If there is one program in my whole life which has tested my sanity its Microsoft Word. (next in line might be bjam).
In practice I usually prefer b) but sometimes - like this one - I give myself a pass and slip into a). In this case it was deliberate design decision to "do what the user probably wants if he doesn't otherwise specify it". In spite of my prejudice generally against this kind of thing, I think in this case it has worked to advantage as I have gotten very few complaints and problems about it.
There might be a few other places where something similar has been done - but these would also be exeptional cases.
So I sympathize with your point of view in general, I just don't think its the correct one in this particular case.
Of course, if one is going to do something like run 1 TB through the serializer, he really should carefully read the manual, and carefully consider what he's doing. I wouldn't trust me to have done the right thing .
Robert Ramey
gast128 wrote:
Thx for the answer. My point is that if you nowhere specify a load/store through pointer, the following code stores just multiple copies:
for (int i = 0; i < 100; ++i) { const SomeStruct s; oa << s; }
but this behavior seems to change if somewhere a store through is performed. This was for my surprising.

OK - I see the problem - good example. I don't see an obvious or easy fix. I'll look into it. Robert Ramey Sergey Skorniakov wrote:
Unfortunately, current tracking implementation can also leads to weird version incompatibility. The unlucky combination is object_serializable + track_selectivly. Suppose, someone have code that serizalizes vector<int> to file (by value). After some time, some code that serializes vector<int> as pointer is added to program. Its enough to make old files unreadable (or readable with errors). Sample code:
void TestTrack2Err(boost::archive::polymorphic_xml_iarchive &ar) { std::vector<int> *p; ar & BOOST_SERIALIZATION_NVP(p); // saving of vector<int>* is absent intentionaly to simulate different versions of saving and loading code } void TestTrack2() { std::stringstream s; std::vector<int> x; x.push_back(4); { boost::archive::polymorphic_text_oarchive ar(s); ar & BOOST_SERIALIZATION_NVP(x); } x.clear(); { boost::archive::polymorphic_text_iarchive ar(s); ar & BOOST_SERIALIZATION_NVP(x); assert(x.size() == 1); // The vector is empty at this point - unless loading of vector<int>* is commented out } }
The consequences of error can vary from noting (especially with xml archives) to program crashes in some more difficult cases. I have recently encountered this error when dealing with a big project (more than 500 serialized types) and have no idea how I can get out of a difficulty.

Robert Ramey wrote:
OK - I see the problem - good example. I don't see an obvious or easy fix. I'll look into it.
Thinking about this some more.
Sergey Skorniakov wrote:
Unfortunately, current tracking implementation can also leads to weird version incompatibility. The unlucky combination is object_serializable + track_selectivly. Suppose, someone have code that serizalizes vector<int> to file (by value). After some time, some code that serializes vector<int> as pointer is added to program. Its enough to make old files unreadable (or readable with errors).
This issue was considered in the implemenation. The fact that at type is tracked/untracked is written to the archive when the data is saved. Tracking at load time is determined by the corresponding flag in the archive and NOT the current attribute. Soooo this will have to be looked into. standard library collections are a little different in that they are unversioned. I don't know if that makes a difference, but we'll look into it. Its also very odd the behavior is different depending upon the type of archive. Robert Ramey
Sample code:
void TestTrack2Err(boost::archive::polymorphic_xml_iarchive &ar) { std::vector<int> *p; ar & BOOST_SERIALIZATION_NVP(p); // saving of vector<int>* is absent intentionaly to simulate different versions of saving and loading code } void TestTrack2() { std::stringstream s; std::vector<int> x; x.push_back(4); { boost::archive::polymorphic_text_oarchive ar(s); ar & BOOST_SERIALIZATION_NVP(x); } x.clear(); { boost::archive::polymorphic_text_iarchive ar(s); ar & BOOST_SERIALIZATION_NVP(x); assert(x.size() == 1); // The vector is empty at this point - unless loading of vector<int>* is commented out } }
The consequences of error can vary from noting (especially with xml archives) to program crashes in some more difficult cases. I have recently encountered this error when dealing with a big project (more than 500 serialized types) and have no idea how I can get out of a difficulty.

I had looked into code and found that tracking and version information loaded from archive only if basic_iserializer::class_info returns true:
void
basic_iarchive_impl::load_preamble(
basic_iarchive & ar,
cobject_id & co
){
if(! co.initialized){
if(co.bis_ptr->class_info()){
class_id_optional_type cid;
load(ar, cid); // to be thrown away
load(ar, co.tracking_level);
load(ar, co.file_version);
}
else{
// override tracking with indicator from class information
co.tracking_level = co.bis_ptr->tracking(m_flags);
co.file_version = version_type(
co.bis_ptr->version()
);
}
co.initialized = true;
}
}
So, for all types that has implementation_level less than object_class_info (standard library collections of primitive types has object_serializable implementation_level) information about tracking (and version) are generated in-place depending of presence appropriate instance of pointer_iserializer
Robert Ramey wrote:
OK - I see the problem - good example. I don't see an obvious or easy fix. I'll look into it.
Thinking about this some more.
Sergey Skorniakov wrote:
Unfortunately, current tracking implementation can also leads to weird version incompatibility. The unlucky combination is object_serializable + track_selectivly. Suppose, someone have code that serizalizes vector<int> to file (by value). After some time, some code that serializes vector<int> as pointer is added to program. Its enough to make old files unreadable (or readable with errors).
This issue was considered in the implemenation. The fact that at type is tracked/untracked is written to the archive when the data is saved. Tracking at load time is determined by the corresponding flag in the archive and NOT the current attribute.
Soooo this will have to be looked into. standard library collections are a little different in that they are unversioned. I don't know if that makes a difference, but we'll look into it. Its also very odd the behavior is different depending upon the type of archive.
Robert Ramey
Sample code:
void TestTrack2Err(boost::archive::polymorphic_xml_iarchive &ar) { std::vector<int> *p; ar & BOOST_SERIALIZATION_NVP(p); // saving of vector<int>* is absent intentionaly to simulate different versions of saving and loading code } void TestTrack2() { std::stringstream s; std::vector<int> x; x.push_back(4); { boost::archive::polymorphic_text_oarchive ar(s); ar & BOOST_SERIALIZATION_NVP(x); } x.clear(); { boost::archive::polymorphic_text_iarchive ar(s); ar & BOOST_SERIALIZATION_NVP(x); assert(x.size() == 1); // The vector is empty at this point - unless loading of vector<int>* is commented out } }
The consequences of error can vary from noting (especially with xml archives) to program crashes in some more difficult cases. I have recently encountered this error when dealing with a big project (more than 500 serialized types) and have no idea how I can get out of a difficulty.

Thanks for looking into this. Sergey Skorniakov wrote:
I had looked into code and found that tracking and version information loaded from archive only if basic_iserializer::class_info returns true:
void basic_iarchive_impl::load_preamble( basic_iarchive & ar, cobject_id & co ){ if(! co.initialized){ if(co.bis_ptr->class_info()){ class_id_optional_type cid; load(ar, cid); // to be thrown away load(ar, co.tracking_level); load(ar, co.file_version); } else{ // override tracking with indicator from class information co.tracking_level = co.bis_ptr->tracking(m_flags); co.file_version = version_type( co.bis_ptr->version() ); } co.initialized = true; } }
So, for all types that has implementation_level less than object_class_info (standard library collections of primitive types has object_serializable implementation_level) information about tracking (and version) are generated in-place depending of presence appropriate instance of pointer_iserializer
in code. I think, it is very dangerous solution. May be, it's better to consider track_selectivly as track_always in such a situation.
That would raise howls from those concerned about performance. I'm thinking that tracking behavior shouldn't be tied to implementation level at all.
The different behavior of xml archives explained by diferent implementation of loading object_id - xml archive just returns already parsed attribute (it gives object_id of last loaded object with object_id or uninitialized value), but other archives performs actual reading that leads into various troubles, depending of archive content.
Its the intention that all archives behave alike. I'll look into it. Robert Ramey
participants (3)
-
gast128
-
Robert Ramey
-
Sergey Skorniakov