Re: [boost] [multi_index][ann] Boost.MultiIndex 1.56 preview: major update with new hashed index implementation

25 Dec 2013

      Ion Gaztañaga <igaztanaga <at> gmail.com> writes:
...
El 25/12/2013 18:12, Joaquín Mª López Muñoz escribió:
...
The most important change is a complete reimplementation of the
internals of hashed indices, so that erasure of elements is
constant-time regardless of the level of occupancy of the internal
bucket array [...]
Joaquín, you blog posts are incredibly interesting. Congratulations.
Thank you!
...
Just in case you want to check it in the future, Boost.Intrusive uses a 
policy-based data structure, but suffers from the same performance 
degradation as MultiIndex had in pre-1.56 versions. AFAIK, the good old 
bucket rooted singly linked list is the only one that can support 
"auto-unlink" hooks (nodes that delete themselves from the container 
without having a pointer to the container).
The data structure I propose can do auto unlinking. Check
https://github.com/boostorg/multi_index/blob/master/
include/boost/multi_index/detail/hash_index_node.hpp , functions
unlink @ line 280 (non-duplicate elements) and unlink @ line 438
(duplicate elements): in both cases unlinking does not need any
additional information from the container.
...
- It does avoid the "iterator erase(...)" problem returning "void" 
instead of "iterator" (as Intrusive need not to be standard-conforming).
That would be my preferred soluton at any rate but unfortunately
the committee decided that LWG issue #579

http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-closed.html#579

was a NAD, so I had to comply, even though I take it as proven that
complying incurs an extra pointer per element that the original
proponents of unordered associative containers didn't see coming.
...
- If the user is willing to pay the price, it can offer "group" linking 
as Boost.Unordered does for duplicated elements but although this does 
not solve the "depleted bucket" syndrome. Your proxy-returning erase()
Unfinished sentence?
...
- It can also store the hash value in the node and this is used for 
faster rehashing and lookups but it does not implement the 
libstdc++/libc++/Boost.Unordered data structure. This was planned for 
the future to improve erasure times.
- Power of two buckets can be optionally enabled to speed up some 
operations if the user is sure the hash function is good enough.
You novel data structure is really interesting, although I guess it's 
not easy to implement, specially when duplicates are around.
On the contrary, I think Boost.Intrusive can very easily implement
it as it has the auto-unlinking property. Drop me a line if you want
to port it to your lib.

I think the variant for duplicate elements can be improved, as currently
group linking kicks in only for #group>2. I decided to leave it like
that for the moment but will get back to it when time permits.
...
It's interesting to see how well the Dinkumware implementation performs, 
as the memory cost is now lower than some years ago, now that the rest 
of implementations have added the hash value in the node (the overhead 
of the Dinkumware implementation is one pointer per bucket).
Dinkumware's excellent performance (despite its heavy overhead of
2 pointers per node plus 2 pointers per bucket entry) comes from
to factors:

* Bucket arrays are powers of two,
* the data structure has very good locality,

But is has a serious problem with erasure, as it's basically violating
the standard by invoking the hash function, which can throw --and
erasure can't throw. The only fix would be adding an additional
node field to cache the hash value. As it stands now, Dinkumware's
implementation of standard unordered associative containers is broken.

Best,

Joaquín M López Muñoz
Telefónica Digital

Re: [boost] [multi_index][ann] Boost.MultiIndex 1.56 preview: major update with new hashed index implementation

Joaquin M Lopez Munoz