Re: [boost] [serialization] a proposal for an alternative to thenewconst-saving rule

----- Mensaje original ----- De: David Abrahams <dave@boost-consulting.com> Fecha: Sábado, Junio 25, 2005 1:39 am Asunto: Re: [boost] [serialization] a proposal for an alternative to thenewconst-saving rule
"JOAQUIN LOPEZ MU?Z" <joaquin@tid.es> writes:
De: David Abrahams <dave@boost-consulting.com>
But all of this misses the high-level problem: the author of the code doesn't know what he's doing. You simply can't serialize objects from distinct scopes with tracking into the same archive, because there may be aliasing.
Totally agreed, this is what we are trying to detect in order to protect the author of the code.
And there's nothing we can reasonably do to detect that problem when the aliased objects have the same type
Nothing? I'm afraid I don't get you. A perfect aliasing detection mechanism is probably impossible to implement, but the hash test at least approximates it. This is better than providing no safety mechanism, as I understand you advocate.
I'm not sure it is. There's an imposition on users: all the types they want to serialize have to support hashing.
No, no. Please check the piece of pseudocode on my first message: the hash code is automatically built by Boost.Serialization, without any intervention from the user, and certainly without any requirement that the type be hashable (in the sense of providing a hash_value overload or something like this.) Let me illustrate with an example: struct labelled_point { int x; int y; string label; template <class Archive> void serialize(Archive & ar, const unsigned int version) { ar & x; ar & y; ar & label; } }; struct labelled_segment { labelled_point p0,p1; template <class Archive> void serialize(Archive & ar, const unsigned int version) { ar & p0; ar & p1; } }; So, given an object seg of type labelled_line, the framework automatically constructs the following associated hash value: combine( combine( boost::hash<int>(seg.p0.x), boost::hash<int>(seg.p0.y), boost::hash<std::string>(seg.p0.label)), combine( boost::hash<int>(seg.p1.x), boost::hash<int>(seg.p1.y), boost::hash<std::string>(seg.p1.label))) (combine(...) is shorthand for the obvious combination formula of hash values using boost::hash_combine.) The process recursively goes down to primitive types, as specified in my proposed pseudocode, and only these have to be hashable, but fortunately they are.
It is nice that the serialization library automatically takes care of hashing aggregated types and leaving out the unserialized data...
uh, wait: this will never work unless you plan only to do shallow hashing. Otherwise you will get an exponential explosion for some object graphs. Is that your intention?
I'm not getting you. The hash value is calculated as part of the saving process itself, so it has the very same complexity. I've got the hunch you might be meaning something else, could you please elaborate? Joaquín M López Muñoz Telefónica, Investigación y Desarrollo

"JOAQUIN LOPEZ MU?Z" <joaquin@tid.es> writes:
David Abrahams <dave@boost-consulting.com> wrote: to thenewconst-saving rule
"JOAQUIN LOPEZ MU?Z" <joaquin@tid.es> writes:
De: David Abrahams <dave@boost-consulting.com>
I'm not sure it is. There's an imposition on users: all the types they want to serialize have to support hashing.
No, no. Please check the piece of pseudocode on my first message: the hash code is automatically built by Boost.Serialization, without any intervention from the user,
I understand that.
and certainly without any requirement that the type be hashable (in the sense of providing a hash_value overload or something like this.)
I was talking about the type's sub-objects.
Let me illustrate with an example:
<snip> I actually did understand that, as you can see from...
The process recursively goes down to primitive types, as specified in my proposed pseudocode, and only these have to be hashable, but fortunately they are.
It is nice that the serialization library automatically takes care of hashing aggregated types and leaving out the unserialized data...
...this remark, and the remarks at the end of my message, which you snipped.
uh, wait: this will never work unless you plan only to do shallow hashing. Otherwise you will get an exponential explosion for some object graphs. Is that your intention?
I'm not getting you. The hash value is calculated as part of the saving process itself, so it has the very same complexity. I've got the hunch you might be meaning something else, could you please elaborate?
The question is, when encountering a pointer or reference member that needs to be hashed, do you do what the serialization process does and hash the thing that it refers to, or do you just hash its address? The former is deep hashing; the latter is shallow hashing. In a graph with tracking, you serialize an object X once, and all subsequent times you encounter X in the graph you just serialize an id. If you add hashing, the first time you encounter X you'll hash it and serialize it. For this one time it doesn't matter if hashing is deep or shallow from a big-O perspective. However, when you encounter X again, if hashing is deep, you'll do a lot of extra work. Actually, a potentially infinite amount if there are cycles, so hashing really does have to be shallow. I guess I answered my own question. -- Dave Abrahams Boost Consulting www.boost-consulting.com

David Abrahams wrote:
I'm not getting you. The hash value is calculated as part of the saving process itself, so it has the very same complexity. I've got the hunch you might be meaning something else, could you please elaborate?
The question is, when encountering a pointer or reference member that needs to be hashed, do you do what the serialization process does and hash the thing that it refers to, or do you just hash its address? The former is deep hashing; the latter is shallow hashing. In a graph with tracking, you serialize an object X once, and all subsequent times you encounter X in the graph you just serialize an id. If you add hashing, the first time you encounter X you'll hash it and serialize it. For this one time it doesn't matter if hashing is deep or shallow from a big-O perspective. However, when you encounter X again, if hashing is deep, you'll do a lot of extra work. Actually, a potentially infinite amount if there are cycles, so hashing really does have to be shallow. I guess I answered my own question.
If hashing is shallow, then what does it check? The library already has mechanism to check that the same address is serialized, and the point of the proposal, as I understand it, is to detect cases like: A* a = A() archive << a; // modify a archive << a; In this case, deep hashing is needed. Yea, it can be costly -- the more reasons to make it optional, maybe even at runtime. - Volodya

Vladimir Prus <ghost@cs.msu.su> writes:
David Abrahams wrote:
I'm not getting you. The hash value is calculated as part of the saving process itself, so it has the very same complexity. I've got the hunch you might be meaning something else, could you please elaborate?
The question is, when encountering a pointer or reference member that needs to be hashed, do you do what the serialization process does and hash the thing that it refers to, or do you just hash its address? The former is deep hashing; the latter is shallow hashing. In a graph with tracking, you serialize an object X once, and all subsequent times you encounter X in the graph you just serialize an id. If you add hashing, the first time you encounter X you'll hash it and serialize it. For this one time it doesn't matter if hashing is deep or shallow from a big-O perspective. However, when you encounter X again, if hashing is deep, you'll do a lot of extra work. Actually, a potentially infinite amount if there are cycles, so hashing really does have to be shallow. I guess I answered my own question.
If hashing is shallow, then what does it check?
That the immediate members of an object serialized at a particular address have the same values.
The library already has mechanism to check that the same address is serialized, and the point of the proposal, as I understand it, is to detect cases like:
A* a = A() archive << a; // modify a archive << a;
In this case, deep hashing is needed. Yea, it can be costly -- the more reasons to make it optional, maybe even at runtime.
Deep hashing is impossible. In the case of a reference cycle it will never terminate. You could do deep hashing with marking of visited objects so you don't hash them twice, but that's equivalent to shallow hashing. -- Dave Abrahams Boost Consulting www.boost-consulting.com
participants (3)
-
David Abrahams
-
JOAQUIN LOPEZ MU?Z
-
Vladimir Prus