Re: [boost] [review] Review of Unordered library begins today December 7

11 Dec 2007

      Daniel:  First, thanks for your reply.  And sorry if I came across as  
overly critical.  I think your library is excellent.  The point of  
the review process is to try to make it even better :)  In fact, if  
we could make the C++ standard better in the process, I'd be delighted.

As I've said in my review, but to be clear I'll say once again, none  
of these items under discussion below are contingent to my  
(enthusiastic) vote in favor of your library.  They are merely for  
your consideration.

On Dec 10, 2007, at 3:54 PM, Daniel James wrote:
...
...
I find it unfortunate that it's not defined in the standard.  Howard
gives the same implementation I would have given, which works in O(N)
time.
And only works for unordered_set. How would you define equality for
unordered_multimap? Do you compare the mapped values, does the order
of the values matter? If it doesn't then the comparison becomes more
expensive, if it does then it's a bit odd for an 'unordered' container
(although this implementation does preserve insertion order, if it
didn't then things would be worse). Is it okay to define equality in
terms of the equality predicate for keys, but operator== for values?
It works equally well with a little tweak if you require equality of  
count(key) for the multiset.  I didn't think about map/multimap,  
thanks for pointing this out.  But Daniel, what exactly does  
operator== for std::multimap do???  In fact, because insertion in  
std::multimap inserts at the end of the range, the value of  
std::multimap depends on the insertion order.  You wouldn't be doing  
anything different with unordered_multimap.  And as you mention, your  
implementation does already preserve insertion order, so you wouldn't  
have any problem to supply these extensions, and they'd be well- 
defined.   Isn't it better if you can provide stronger guarantees  
than the current draft standard, just as you already do for exceptions?

Now the standard should really define what happens when inserting  
equivalent keys in unordered_multimap, just as it does for multimap,  
but we're not fixing the standard.  I'll raise that point on another  
mailing list if I get around it.  As I understand, it depends on  
whether you use singly- or doubly-linked lists for chaining, and I  
buy your argument as to why use doubly-linked lists for multiset.  I  
think you made good points there.

As for the equality comparison vs predicate, the same again is true  
for operator== for the standard associative containers, defined as  
a.size() == b.size() && std::equal(a.begin(), a.end(), b.begin()   
[sic] (no, *I* didn't forget the final paren, it's not on page 671  
either).  Nowhere in that makes use of the "equivalence" implied by  
operator<, which is used for uniqueness in set/map and equivalence in  
multiset/multimap.  So by defining equality in terms of the equality  
predicate for keys in order to check for the uniqueness property, but  
operator== for both keys and values in the implementation of  
operator== for unordered_map/multimap, you wouldn't exactly break new  
ground.

Or am I missing some subtle implications or other clause in the  
standard?
...
...
I found also the following bogus note in the Working Draft:
...
10 Unordered associative containers are not required to support the
expressions a == b or a != b. [ Note: This is because the container
requirements define operator equality in terms of equality of
ranges. Since the elements of an unordered associative container
appear in an arbitrary order, range equality is not a useful
operation. — end note]
That's not bogus at all. An equality operator for unordered containers
would have completely different semantics to all other containers
(especially when using custom comparisons, comparing associative
containers can have surprising results).
Why not? As I argue above, the same is already true of the ordered  
associative containers.  I don't think you'd be providing anything  
more surprising than already present in the standard.

As for surprising semantics, a set and multiset have well-defined  
notions of value (in a mathematical sense), i.e., a subset from a  
ground set of key, which is exactly what operator== compares.  For  
map and multimap, the value is a function from a ground set of keys  
to a value for map, sequence of values for multimap (sequence means  
that order matters).  The operational definition in terms of range is  
simply an algorithm for comparing those values.  Exactly the same is  
true for the values of unordered containers;  operator== as Howard  
and I've defined them precisely perform this comparison of value.  I  
find nothing surprising in that.  The fact that operator== is defined  
in terms of ranges for sequences and ordered associative containers,  
and is defined/implemented differently for unordered containers, is  
not surprising, it's necessary.  Again, I believe the standard could  
be improved on this, and wouldn't your library be the natural vector  
to provide a reference implementation for a proposal to define  
operator== for unordered containers?
...
"Right now, I'm trying to limit myself to obvious design decisions for
extensions so I don't really want to go into this area.
Fair enough.  You gave good points.  I hope my counter-arguments are  
equally relevant, but you're of course free to not implement them.   
Still, as boost explores possibilities for later standardization, it  
could be hoped that you would provide an implementation that provides  
prior art for future discussion in the committee.  Whether you or  
someone else (perhaps me, I really would like to) would make the  
proposal, is for the future.

Still, I think there's only one sensible definition for  
unordered_set, and unordered_multiset, and the multiset version isn't  
trivial either.  Would be nice if at least those werre in the  
library, perhaps with a named function instead of an operator.  Would  
it be a problem if you at least provided those?
...
...
The prime number list is too short, it roughly doubles every time.
On the other hand, I understand that it wouldn't do that have a
*much* larger list, as this static array is included in every
translation unit.
...
I think it can actually cause an ODR violation so I probably should
change that anyway. David Abrahams recently posted a way to avoid this
by including the value in a template, so I'll try that.
In theory, yes.  In practice, linkers don't really enforce it, I  
hear.  Having it as a template doesn't solve the basic problem, which  
is the replication of identical data in many translation units.  In  
fact, it might exacerbate it by having the same table defined  
multiple times in the same translation unit!   Where I work, we're  
trying to curb executable sizes (which can be enormous, like you  
wouldn't believe).  This is precisely what would get a frown at the  
code review.
...
How much of a difference would it make? The memory requirements for
the array of buckets is typically smaller than the memory required for
the nodes (unless you have a small number of nodes or increase the
maximum load factor). T
You're correct, unless people like to have very large maps of small  
types (unordered_map<int, some_type*> with 10M elements) where it  
starts to really matter.  On the other end, it also matters if a user  
has lots of small maps.  I think it can really be a problem, in fact  
I have been in that situation myself (having to fix the prime number  
table of our implementation of hash table, by request of one user  
with performance-critical needs).  So here I actually speak from  
practical experience.

I believe the best thing, is to move the table to the .cpp, but I  
also I totally understanding the appeal of header-only libraries.   
Would you consider a configuration flag, whereby unless explicitly  
defined, your table is in the header, but for a user which controls  
its own site installation, a table can be generated into a .cpp and  
compiled into a library to be linked into programs using  
Boost.Unordered?  I actually did write such a piece of code, it isn't  
hard, and you can specify the step (30% or 100%) and the range of the  
desired table. If you include this code into the example section and  
make the flag public, your users can choose what they prefer based on  
their requirements.

Then again, if they're that sophisticated, maybe they can do it  
themselves.
...
...
Re: swap:  I really think you should be following the committee's
current advice (I'm looking at the code of hash_table_impl.hpp, lines
1194 -- 1219), sooner rather than later.  But my opinion is biased.
Why is your opinion biased?
My boss is John Lakos, and my colleague Pablo Halpern.  They both  
have strong opinions on allocators and value-semantic types.  In  
fact, they're arguing exactly about these issues with the committee  
(N2436, N2387, N1850 ).  I've been converted myself :)
...
That works both ways, as the committee could change its mind. Since we
don't have concepts, following their current decision means slow
swaps. Which isn't that bad since it's only for the rare occasions
when the allocators aren't equal. I haven't got a strong preference
here either, but this was discussed before and the general consensus
seemed to be for what I've done.
What I hear from Pablo and John is that they're getting good support  
around their proposal (ok, they're biased too :)...  Maybe you're  
right to wait and see what the committee will do.  Hopefully, won't  
take long as C++0x is around the horizon and there'll be two meetings  
in Jan. and June.  Somehow, I didn't think that Pablo's proposal  
(N2436) required concepts, but required traits instead.  But I'm not  
intimately familiar.
--
Hervé Brönnimann
hervebronnimann@mac.com