Request for feedback on design of concurrent_unordered_map, plus notes on use of memory transactions
Dear Boost,
Finally I have found some time to write up my experimentations with
using transactional GCC or Intel TSX TM to make an improved
concurrent_unordered_map. The quick tl;dr summary is this:
1. Transactional GCC is great iff you are happy accepting up to a 50%
performance reduction in normal code execution. For a greenfield
multithreaded project this is absolutely worth it due to the enormous
savings in debug effort.
2. First generation Intel TSX isn't worth using in almost any common
use case. The only common use place it makes sense is for almost
always read only data which is *always* highly contended where
speedups can approach linear to execution cores, and even then the
refactoring you need to make to avoid writes to any cache line
touched by any other thread will make your code slower than throwing
everything inside a standard spin lock. Spin locks are well optimised
and allow simple implementations, plus spin locks are very fast for
uncontended use. Any Intel TSX (HLE, RTM) introduces penalties to
uncontended use and seems to force a pipeline stall which can reduce
single threaded performance by as much as 40%.
3. Trying to implement atomics-based logic to have the CPU select
non-TSX during uncontended and TSX during contended comes out at best
as slightly slower than always using TSX. This, for me, simply kills
TSX v1.0, I no longer consider it worth thinking about till v2.0 is
made by Intel.
For reference for those wishing to repeat my experimentations, you
can see the history of my four attempts at a concurrent_unordered_map
at
https://github.com/ned14/boost.spinlock/commits/concurrent_unordered_m
ap2 plus benchmark results at
https://ci.nedprod.com/view/Boost%20Thread-Expected-Permit/job/Memory%
20Transactions%20Test%20Linux%20GCC%204.8/.
So, onto my proposed new concurrent_unordered_map design, which is
NOT a split ordered list design as Intel TBB or Microsoft PPL does
it. I had four criteria for my concurrent_unordered_map design which
were based on my needs for proposed Boost.AFIO:
1. erase() is safe. This is for me the big showstopper with split
ordered list designs, and I think it probably is for most people.
2. insert(), find(), and erase() all had to have single threaded
performance similar to a std::unordered_map<> with a spin lock around
it i.e. fast.
3. Everything apart from rehashing should be able to be concurrent.
This includes iteration, inserting, finding, erasing, sizing.
4. Memory transaction friendly. This means transactional GCC right
now, but maybe some future Intel TSX v2.0 support later. This
criterion, incidentally, *requires* C++ 11 noexcept support, or at
least I can't see a way to avoid it otherwise (just to be clear, I am
claiming that for the implementation to _efficiently_ use memory
transactions it needs noexcept move construction for both key and
mapped type).
Designs tried and benchmarked and rejected:
1. Items in a bucket were kept via an atomic forward pointer to the
next item and one used cmpxchg to insert and remove items. This had
really great, even superb, performance for load factors near one, but
became pathological for badly distributed hash functions (I'm looking
at you libstdc++). Once you rehashed the hash function to make it
better distributed, performance nose dived. I was tempted to override
std::hash with a decent implementation ...
2. Items in a bucket were kept as as a linear list of hashes and
atomic pointers to value_type's. Each atomic pointer's bottom bit was
its spin lock to keep each item footprint small, so you could lock
each individual item during usage. The problem here really was that
how do you safely resize the table without a per-bucket lock, and
that halved performance over a spinlocked unordered_map as
transactions really didn't deliver as expected here.
3. I created a specialisation only for mapped types of some
std::shared_ptr<> which let me implement a completely lock free
design. Unfortunately it was also rather slow, also about half the
performance of a spinlocked unordered_map. Atomic increments and
decrements of the same cache line across execution cores are slow :)
Final proposed design (comments welcome):
* item_type is allocated via a rebind of the supplied allocator, and
looks like:
struct item_type
{
value_type p;
size_t hash;
};
If value_type has a big sizeof(), a very good idea is to make the
above a unique_ptr as otherwise you can't pack many item_type's into
a cache line (AFIO's value_type is a std::pair
On 28/07/2014 03:20, Niall Douglas wrote:
* item_type is allocated via a rebind of the supplied allocator, and looks like:
struct item_type { value_type p; size_t hash; };
I presume this will support partial construction to handle the case when value_type does not have a default constructor, or is movable-only, etc?
* To insert or find an item involves determining the bucket from the modulus of the hash by the number of buckets, locking the bucket and scanning the linear list for an empty item or the right item as appropriate. We know empty items from a magic hash value which is defined to some magic constant (advice on the best magic hash value least likely to turn up from std::hash is welcome).
Isn't it by definition (admittedly reality is less ideal) that all hash values have equal probability? What happens if a real item does end up with the magic "empty" hash value? I think it'd be safer to have a separate flag for this case.
* To erase an item is easy: iterators store an iterator to the bucket and an offset into the linear table. Erasure involves constructing a temporary value_type, locking the bucket and performing a *move* construction out of the table into the temporary value_type, resetting the hash to empty. Destruction can then occur without holding the bucket lock. If the item just erased is the last in the bucket, the bucket's size (not capacity) is reduced by one, and that is repeated until the bucket size is the minimal possible.
If the hash is reset to empty and the lock released before destruction occurs, doesn't that mean that something else could try to concurrently construct into the same storage? After all, the item slot will now look empty to insert(). Similarly the bucket size could be reduced by one just as insert() puts something into that bucket slot (thinking that it was still within the valid range, so not trying to increment it itself).
* size() under this design becomes O(bucket count) as we sum the count members of each bucket. empty() is worst case O(bucket count) when the map is empty, best case O(1) when the first bucket is not empty, average case O(bucket count/item count/2).
You could use a relaxed count at the top level to improve this, possibly. size() and empty() become statistical rather than prescriptive in the presence of concurrent writes anyway, so a stale value is not harmful. Although I suppose conversely updating a shared counter will cause cache contention performance loss for insert/erase, which may not be worth it for the benefit of statistical functions that are less likely to even be called in a concurrent design.
Thoughts on this design before I write a final implementation are welcome. I don't plan to complete my implementation past the minimum needed for proposed Boost.AFIO, so if someone else wishes to finish it that and make it part of official Boost that would be great.
I presume you've looked at other common implementations of this sort of thing? eg. libcds.
On 28 Jul 2014 at 13:26, Gavin Lambert wrote:
On 28/07/2014 03:20, Niall Douglas wrote:
* item_type is allocated via a rebind of the supplied allocator, and looks like:
struct item_type { value_type p; size_t hash; };
I presume this will support partial construction to handle the case when value_type does not have a default constructor, or is movable-only, etc?
My partial implementation will only support move construction, as that is all I need to store shared_ptr. You're right that there is a small object vs large object optimisation opportunity here.
* To insert or find an item involves determining the bucket from the modulus of the hash by the number of buckets, locking the bucket and scanning the linear list for an empty item or the right item as appropriate. We know empty items from a magic hash value which is defined to some magic constant (advice on the best magic hash value least likely to turn up from std::hash is welcome).
Isn't it by definition (admittedly reality is less ideal) that all hash values have equal probability? What happens if a real item does end up with the magic "empty" hash value? I think it'd be safer to have a separate flag for this case.
Firstly, if the magic hash value ever turns up during insert the program aborts :) It should, hopefully, be exceedingly rare on 64 bit to the point that we don't care [1]. Secondly, a separate flag costs another 8 bytes, and it ruins the cache line symmetry for AFIO where items size out to 32 bytes which is a lovely multiple. [1]: On libstdc++ a hash of a size_t is that size_t. As AFIO increments a size_t from 1 onwards, any magic value with a top bit set is safe. MSVC tries rather harder and for a size_t it uses the FNV-1a hash function which is rated as second best overall (but best for numbers) at https://programmers.stackexchange.com/questions/49550/which-hashing-al gorithm-is-best-for-uniqueness-and-speed. FNV-1a looks pretty random to me.
* To erase an item is easy: iterators store an iterator to the bucket and an offset into the linear table. Erasure involves constructing a temporary value_type, locking the bucket and performing a *move* construction out of the table into the temporary value_type, resetting the hash to empty. Destruction can then occur without holding the bucket lock. If the item just erased is the last in the bucket, the bucket's size (not capacity) is reduced by one, and that is repeated until the bucket size is the minimal possible.
If the hash is reset to empty and the lock released before destruction occurs, doesn't that mean that something else could try to concurrently construct into the same storage? After all, the item slot will now look empty to insert().
We have to be careful here about "destruction of what?". The assumption is that the value_type inside the item_type is left in a low cost or default state after being moved out of it, so the lock is held during the move, after which that slot will be available for use by someone else who may claim the lock and move construct into that storage. If someone held a reference to that value, then it would suddenly change contents in a racy way. But it would still point to an unrelocated value of that type, and therefore I believe is safe for most uses. As destruction is generally slow (free() etc), that's why the bucket lock isn't held for it. It makes an enormous difference to concurrency anyway.
Similarly the bucket size could be reduced by one just as insert() puts something into that bucket slot (thinking that it was still within the valid range, so not trying to increment it itself).
The lock is held during all resizes of the bucket.
* size() under this design becomes O(bucket count) as we sum the count members of each bucket. empty() is worst case O(bucket count) when the map is empty, best case O(1) when the first bucket is not empty, average case O(bucket count/item count/2).
You could use a relaxed count at the top level to improve this, possibly.
I didn't mention the atomic count in each bucket as I thought it obvious, but it holds the number of valid entries in each bucket. This saves size() or empty() having to iterate the bucket tables.
size() and empty() become statistical rather than prescriptive in the presence of concurrent writes anyway, so a stale value is not harmful. Although I suppose conversely updating a shared counter will cause cache contention performance loss for insert/erase, which may not be worth it for the benefit of statistical functions that are less likely to even be called in a concurrent design.
Exactly. It's trivial to keep a top level atomic count, but the concurrency consequences are dire. I do really wish I knew of a better way for empty() to have constant time though :(
Thoughts on this design before I write a final implementation are welcome. I don't plan to complete my implementation past the minimum needed for proposed Boost.AFIO, so if someone else wishes to finish it that and make it part of official Boost that would be great.
I presume you've looked at other common implementations of this sort of thing? eg. libcds.
I did do quite a bit of prior art research, and was surprised that no one seems to have had a crack at a STL compliant concurrent_unordered_map apart from the Intel TBB and Microsoft PPL implementations, both of which don't allow concurrent erase which is a showstopper. I don't have the time to complete my implementation past what I need now, but I was thinking it would make a great GSoC for 2015 and Boost (and the STL) is in sore need of a decent concurrent_unordered_map with a different set of design compromises to the split ordered list designs. The libcds implementations don't follow the STL API, and unless I missed something they get their speed from some clever memory allocation tricks which require threads to call init and cleanup routines to initialise thread local state. I didn't look deeply here, so if I am wrong please do correct. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On July 28, 2014 6:08:26 AM EDT, Niall Douglas
On 28 Jul 2014 at 13:26, Gavin Lambert wrote:
On 28/07/2014 03:20, Niall Douglas wrote:
* To insert or find an item involves determining the bucket from the modulus of the hash by the number of buckets, locking the bucket and scanning the linear list for an empty item or the right item as appropriate. We know empty items from a magic hash value which is defined to some magic constant (advice on the best magic hash value least likely to turn up from std::hash is welcome).
Isn't it by definition (admittedly reality is less ideal) that all hash values have equal probability? What happens if a real item does end up with the magic "empty" hash value? I think it'd be safer to have a separate flag for this case.
Firstly, if the magic hash value ever turns up during insert the program aborts :) It should, hopefully, be exceedingly rare on 64 bit to the point that we don't care [1].
I'd be hard pressed to trust a library that aborts if the possible, but exceedingly rare occurs.
Secondly, a separate flag costs another 8 bytes, and it ruins the cache line symmetry for AFIO where items size out to 32 bytes which is a lovely multiple.
A separate flag only costs one bit. You might yet be able to rethink your current data structure to free one. The worst case, add I see it, is that you simply increment (or decrement) a computed hash value should the magic value ever arise. That is, you increase the odds of a collision so that no value ever uses the magic value. ___ Rob (Sent from my portable computation engine)
On 28 Jul 2014 at 7:21, Rob Stewart wrote:
The worst case, add I see it, is that you simply increment (or decrement) a computed hash value should the magic value ever arise. That is, you increase the odds of a collision so that no value ever uses the magic value.
That works for me. Great idea, thank you. Here are some benchmarks for a first attempt at the design I proposed earlier. The tests involve reserving 10,000 capacity and inserting 5,000 items to try and remove the overhead of rehashing for unordered_map. The write test is a random burst of up to 128 insertions or deletions which may be done concurrently by as many threads as the CPU has (8 hyperthreaded ones). The read test is as many finds as it can do concurrently. I tested single threaded and 8 threaded to ensure the concurrent design was not slower than a spinlocked std::unordered_map. First single threaded: === Large unordered_map spinlock write performance === 1. Achieved 23548585.078085 transactions per second === Large unordered_map spinlock read performance === 1. Achieved 64616938.960940 transactions per second === Large concurrent_unordered_map write performance === There are 1 threads in this CPU 1. Achieved 36929611.532615 transactions per second === Large concurrent_unordered_map read performance === There are 1 threads in this CPU 1. Achieved 66784946.368549 transactions per second As concurrent_unordered_map doesn't have to check the load factor during inserts and erases, it is quicker single threaded. Eight threaded: === Large unordered_map spinlock write performance === 1. Achieved 22012179.855850 transactions per second === Large unordered_map spinlock read performance === 1. Achieved 42627903.829194 transactions per second === Large concurrent_unordered_map write performance === There are 8 threads in this CPU 1. Achieved 166278672.013391 transactions per second === Large concurrent_unordered_map read performance === There are 8 threads in this CPU 1. Achieved 165071376.646606 transactions per second The contention on the read spinlock reduces unordered_map performance over the single threaded case significantly. Meanwhile my concurrent_unordered_map sees a 4.5x insert/erase improvement and a 2.5x find improvement, in fact both writing and reading the concurrent_unordered_map are now both the same performance under maximum contention. Someone is bound to ask how does this compare to a split ordered list concurrent_unordered_map? Well, for Microsoft PPL's one using VS2013: === Large concurrent_unordered_map read performance === There are 4 threads in this CPU Mine: Achieved 37400530.145442 transactions per second PPL: Achieved 54401408.200488 transactions per second So PPL is about 1.5x of mine for reads. PPL can't do concurrent erase though, so I can't test the writes. My design is also intended for transactional GCC, and there I see the following benchmarks: Single threaded: === Large unordered_map spinlock write performance === 1. Achieved 15925637.896608 transactions per second === Large unordered_map spinlock read performance === 1. Achieved 52702345.715530 transactions per second === Large concurrent_unordered_map write performance === There are 1 threads in this CPU 1. Achieved 19225268.992226 transactions per second === Large concurrent_unordered_map read performance === There are 1 threads in this CPU 1. Achieved 18177607.124302 transactions per second Something is bad here with my single threaded read performance, it's way low. I'm going to assume it's a bug in transactional GCC, see below for what I mean. Eight threaded: === Large unordered_map spinlock write performance === 1. Achieved 13913300.037094 transactions per second === Large unordered_map spinlock read performance === 1. Achieved 32325773.115938 transactions per second === Large concurrent_unordered_map write performance === There are 8 threads in this CPU 1. Achieved 95011747.620646 transactions per second === Large concurrent_unordered_map read performance === There are 8 threads in this CPU 1. Achieved 109675855.394830 transactions per second My concurrent_unordered_map sees a 5x insert/erase improvement and a 6x find improvement. That surely means my code is in fact good and is parallelising nicely. Obviously these are early results, I still have much to do. It is looking like it should provide a nice wee boost for AFIO's performance for sure. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
On 07/27/2014 05:20 PM, Niall Douglas wrote:
1. Items in a bucket were kept via an atomic forward pointer to the next item and one used cmpxchg to insert and remove items. This had really great, even superb, performance for load factors near one, but became pathological for badly distributed hash functions (I'm looking at you libstdc++). Once you rehashed the hash function to make it better distributed, performance nose dived. I was tempted to override std::hash with a decent implementation ...
I would not rule out this solution. You can just tell users to supply their own hash function to solve the distribution problem.
Final proposed design (comments welcome):
If value_type has a big sizeof(), a very good idea is to make the above a unique_ptr as otherwise you can't pack many item_type's into a cache line (AFIO's value_type is a std::pair
> which is 24 bytes, so item_type is 32 bytes which works well for my purposes).
You add storage traits so users can decide for themselves. This trait would contain the stored type and to/from conversion functions.
* To insert or find an item involves determining the bucket from the modulus of the hash by the number of buckets, locking the bucket and scanning the linear list for an empty item or the right item as appropriate. We know empty items from a magic hash value which is defined to some magic constant (advice on the best magic hash value least likely to turn up from std::hash is welcome). find() starts
std::hash already does a modulo operation, so you could simply do another to free a number for the magic value. For example, auto magic = std::numeric_limitHash::result_type::max(); hash %= magic; It will change the distribution, however, as as both 0 and max are mapped to 0.
On 30 Jul 2014 at 17:19, Bjorn Reese wrote:
On 07/27/2014 05:20 PM, Niall Douglas wrote:
1. Items in a bucket were kept via an atomic forward pointer to the next item and one used cmpxchg to insert and remove items. This had really great, even superb, performance for load factors near one, but became pathological for badly distributed hash functions (I'm looking at you libstdc++). Once you rehashed the hash function to make it better distributed, performance nose dived. I was tempted to override std::hash with a decent implementation ...
I would not rule out this solution. You can just tell users to supply their own hash function to solve the distribution problem.
There are additional problems: (i) a concurrent_unordered_map cannot rehash automatically because iterators would spontaneously invalidate and (ii) size(), needed for rehashing, is O(buckets) which is an obvious scalability problem. That means it must be much more tolerant to high load factors than normal. I am aiming for good performance at a load factor of up to 12. AFIO is going to push the problem of bucket sizing onto its user and do no rehashing at all :)
Final proposed design (comments welcome):
If value_type has a big sizeof(), a very good idea is to make the above a unique_ptr as otherwise you can't pack many item_type's into a cache line (AFIO's value_type is a std::pair
> which is 24 bytes, so item_type is 32 bytes which works well for my purposes). You add storage traits so users can decide for themselves. This trait would contain the stored type and to/from conversion functions.
I couldn't make rehashing exception safe without bringing in malloc for all allocations, so this has gone away. I now always use malloc.
* To insert or find an item involves determining the bucket from the modulus of the hash by the number of buckets, locking the bucket and scanning the linear list for an empty item or the right item as appropriate. We know empty items from a magic hash value which is defined to some magic constant (advice on the best magic hash value least likely to turn up from std::hash is welcome). find() starts
std::hash already does a modulo operation, so you could simply do another to free a number for the magic value. For example,
auto magic = std::numeric_limitHash::result_type::max(); hash %= magic;
It will change the distribution, however, as as both 0 and max are mapped to 0.
One useful side effect of always using malloc is we can now use a null pointer to indicate the slot is unusued. This avoids the need for special hash values. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
participants (4)
-
Bjorn Reese
-
Gavin Lambert
-
Niall Douglas
-
Rob Stewart