[serialization] a proposal for an alternative to the new const-saving rule

Hello, The new rule introduced in Boost.Serialization that forbids saving of non-const objects has proved a little controversial. My point of view is that the rule, albeit far from perfect, provides some level of safety against hard to track errors. Others' opinions differ. The current rule is a rough approximation to what IMHO would constitute the right enforcement: everytime a trackable object is saved, check wether an object with the same address has been previously saved and, if so, make sure that the object didn't change. The hardest part is checking for equality. My proposal is to follow a hash-based approach, which is effective both in terms of complexity and space (one word per tracked object.), and does not impose any special requirement on the serialized objects (for instance, an approach based on operator== would require that objects be equalitycomparable). How does one compute a hash value for an object being saved? It is easy to do recursively as follows (pseudocode): unsigned int* hash_addr=0; // a member of the current archive internal_save(object obj) { if(hash_addr!=0 || obj is trackable) { // a hash computation is in course or we need to initiate one unsigned int* old_hash_addr=hash_addr; unsigned int h=0; hash_addr=&h; user_defined_save(o); set_hash(o,h); hash_addr=old_hash_addr; if(hash_addr!=0){ // add h to the hash being computed boost::hash_combine(*hash_addr,h); } } else{ user_defined_save(o); } } user_defined_save(primitive_type x) // provided by Boost.Serialization { if(hash_addr!=0){ boost::hash_combine(*hash_addr,x); } // rest of serialization stuff, as always } This implementation does not impose any additional requirement on the objects being serialized and is totally transparent to the user (she doesn't have to do any hash-related work herself.) With the computed hash values, Boost.Serialization can emmit run-time errors if a trackable object is serialized twice and changed in between (currently, the errors are compile-time.) What do you think? Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

On Friday 24 June 2005 12:27, Joaquín Mª López Muñoz wrote:
The new rule introduced in Boost.Serialization that forbids saving of non-const objects has proved a little controversial. My point of view is that the rule, albeit far from perfect, provides some level of safety against hard to track errors. Others' opinions differ.
The current rule is a rough approximation to what IMHO would constitute the right enforcement: everytime a trackable object is saved, check wether an object with the same address has been previously saved and, if so, make sure that the object didn't change.
To avoid getting deep in "what's trackable object" discussion, we can reformulate that as "everytime serialization is about to skip writing of object's data and write id of previously saved object, check.......".
The hardest part is checking for equality. My proposal is to follow a hash-based approach, which is effective both in terms of complexity and space (one word per tracked object.), and does not impose any special requirement on the serialized objects (for instance, an approach based on operator== would require that objects be equalitycomparable).
.....
This implementation does not impose any additional requirement on the objects being serialized and is totally transparent to the user (she doesn't have to do any hash-related work herself.) With the computed hash values, Boost.Serialization can emmit run-time errors if a trackable object is serialized twice and changed in between (currently, the errors are compile-time.)
What do you think?
This suggestion is very interesting! But I wonder, how you can detect for sure that objects are the same using a hash. Two different object can hash to the same value? Maybe, operator== is still needed? - Volodya

Vladimir Prus <ghost <at> cs.msu.su> writes:
On Friday 24 June 2005 12:27, Joaquín Mª López Muñoz wrote:
[...]
The hardest part is checking for equality. My proposal is to follow a hash-based approach, which is effective both in terms of complexity and space (one word per tracked object.), and does not impose any special requirement on the serialized objects (for instance, an approach based on operator== would require that objects be equalitycomparable).
.....
What do you think?
This suggestion is very interesting! But I wonder, how you can detect for sure that objects are the same using a hash. Two different object can hash to the same value? Maybe, operator== is still needed?
Admittedly, hash-based checking is not 100% safe, but I don't think resorting to operator== is doable because: * A serializable type need not be equality comparable. * Even if the type is equality comparable, you'd have to store a copy of the object first time you see it in order to check it later against later occurrences. This further imposes assignability on the type, and the overhead of copying and storing a whole object can be prohibitive. Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

Joaquin M Lopez Munoz wrote:
This suggestion is very interesting! But I wonder, how you can detect for sure that objects are the same using a hash. Two different object can hash to the same value? Maybe, operator== is still needed?
Admittedly, hash-based checking is not 100% safe, but I don't think resorting to operator== is doable because:
* A serializable type need not be equality comparable. * Even if the type is equality comparable, you'd have to store a copy of the object first time you see it in order to check it later against later occurrences. This further imposes assignability on the type, and the overhead of copying and storing a whole object can be prohibitive.
I agree with you. While it's possible that an object is modified but gets the same hash value, in most cases your approach will detect modification. I'd say this is the way to go! - Volodya

The hardest part is checking for equality. My proposal is to follow a hash-based approach, which is effective both in terms of complexity and space (one word per tracked object.), ... Admittedly, hash-based checking is not 100% safe, ...
I think this is unacceptable. If I have two objects A and B, and they do happen to hash to the same value, then B won't get saved. However good the hash code is does not matter: this would stop me using the serialization library in anything mission critical. BTW, I've been trying to follow this "serialization and const" discussion but getting lost in places. Are we talking about the same problems I had described here: http://lists.boost.org/boost/2005/04/24700.php As you can see: a) The workaround was easy enough once I knew what the problem was; b) I saw the only solution to be educating library users to be aware of the issues. Darren

Darren Cook <darren@dcook.org> writes:
The hardest part is checking for equality. My proposal is to follow a hash-based approach, which is effective both in terms of complexity and space (one word per tracked object.), ... Admittedly, hash-based checking is not 100% safe, ...
I think this is unacceptable. If I have two objects A and B, and they do happen to hash to the same value, then B won't get saved.
Only if they have the same address, in which case you've done something wrong anyway and you want an assertion.
However good the hash code is does not matter: this would stop me using the serialization library in anything mission critical.
I think one of us is misunderstanding the proposal. -- Dave Abrahams Boost Consulting www.boost-consulting.com

Joaquín Mª López Muñoz <joaquin@tid.es> writes:
The new rule introduced in Boost.Serialization that forbids saving of non-const objects has proved a little controversial. My point of view is that the rule, albeit far from perfect, provides some level of safety against hard to track errors. Others' opinions differ.
The current rule is a rough approximation to what IMHO would constitute the right enforcement: everytime a trackable object is saved, check wether an object with the same address has been previously saved and, if so, make sure that the object didn't change.
I don't understand what bugs this is going to catch. Certainly in this case, for( ... X x(... ); ar << x; the low-level problem isn't that x is changing, but that it's a different object each time. Two temporally different x's might well be identical. But all of this misses the high-level problem: the author of the code doesn't know what he's doing. You simply can't serialize objects from distinct scopes with tracking into the same archive, because there may be aliasing. And there's nothing we can reasonably do to detect that problem when the aliased objects have the same type (asserting when a tracked pointer is serialized again with a new type is a great idea). There should be a mechanism to manually clear all tracking so that the user can tell the archive that he's serializing tracked objects in a new scope. Finally, this has nothing at all to do with constness. <disclaimer> Of course, I might be missing something important :) </disclaimer> -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
Joaqu?n M? L?pez Mu?oz <joaquin@tid.es> writes:
The new rule introduced in Boost.Serialization that forbids saving of non-const objects has proved a little controversial. My point of view is that the rule, albeit far from perfect, provides some level of safety against hard to track errors. Others' opinions differ.
The current rule is a rough approximation to what IMHO would constitute the right enforcement: everytime a trackable object is saved, check wether an object with the same address has been previously saved and, if so, make sure that the object didn't change.
I don't understand what bugs this is going to catch. Certainly in this case,
for( ... X x(... ); ar << x;
the low-level problem isn't that x is changing, but that it's a different object each time. Two temporally different x's might well be identical.
But all of this misses the high-level problem: the author of the code doesn't know what he's doing. You simply can't serialize objects from distinct scopes with tracking into the same archive, because there may be aliasing.
Now the key question. Why do I need the tracking if I never save an object with the same address both by pointer and by value. I don't. In some other part of program, I might be serializing vector<X*>, but as long as none of those pointers point to stack objects, I don't need tracking in the above loop. And serialization library insist on tracking just because somewhere else in the program pointer to X is serialized. - Volodya

Vladimir Prus <ghost@cs.msu.su> writes:
Now the key question. Why do I need the tracking if I never save an object with the same address both by pointer and by value. I don't.
Surely if you have an object graph where everything is saved by pointer you want to track those cases as well?
In some other part of program, I might be serializing vector<X*>, but as long as none of those pointers point to stack objects, I don't need tracking in the above loop.
Tracking in the above loop wouldn't help you if those pointers pointed to stack objects, because, for that other part of the program to be legal, the pointers would have to point to stack objects other than the ones in the loop, which have all been destroyed by that point. -- Dave Abrahams Boost Consulting www.boost-consulting.com
participants (5)
-
Darren Cook
-
David Abrahams
-
Joaquin M Lopez Munoz
-
Joaquín Mª López Muñoz
-
Vladimir Prus