Request for feedback on design of concurrent_unordered_map, plus notes on use of memory transactions

Dear Boost, Finally I have found some time to write up my experimentations with using transactional GCC or Intel TSX TM to make an improved concurrent_unordered_map. The quick tl;dr summary is this: 1. Transactional GCC is great iff you are happy accepting up to a 50% performance reduction in normal code execution. For a greenfield multithreaded project this is absolutely worth it due to the enormous savings in debug effort. 2. First generation Intel TSX isn't worth using in almost any common use case. The only common use place it makes sense is for almost always read only data which is *always* highly contended where speedups can approach linear to execution cores, and even then the refactoring you need to make to avoid writes to any cache line touched by any other thread will make your code slower than throwing everything inside a standard spin lock. Spin locks are well optimised and allow simple implementations, plus spin locks are very fast for uncontended use. Any Intel TSX (HLE, RTM) introduces penalties to uncontended use and seems to force a pipeline stall which can reduce single threaded performance by as much as 40%. 3. Trying to implement atomics-based logic to have the CPU select non-TSX during uncontended and TSX during contended comes out at best as slightly slower than always using TSX. This, for me, simply kills TSX v1.0, I no longer consider it worth thinking about till v2.0 is made by Intel. For reference for those wishing to repeat my experimentations, you can see the history of my four attempts at a concurrent_unordered_map at https://github.com/ned14/boost.spinlock/commits/concurrent_unordered_m ap2 plus benchmark results at https://ci.nedprod.com/view/Boost%20Thread-Expected-Permit/job/Memory% 20Transactions%20Test%20Linux%20GCC%204.8/. So, onto my proposed new concurrent_unordered_map design, which is NOT a split ordered list design as Intel TBB or Microsoft PPL does it. I had four criteria for my concurrent_unordered_map design which were based on my needs for proposed Boost.AFIO: 1. erase() is safe. This is for me the big showstopper with split ordered list designs, and I think it probably is for most people. 2. insert(), find(), and erase() all had to have single threaded performance similar to a std::unordered_map<> with a spin lock around it i.e. fast. 3. Everything apart from rehashing should be able to be concurrent. This includes iteration, inserting, finding, erasing, sizing. 4. Memory transaction friendly. This means transactional GCC right now, but maybe some future Intel TSX v2.0 support later. This criterion, incidentally, *requires* C++ 11 noexcept support, or at least I can't see a way to avoid it otherwise (just to be clear, I am claiming that for the implementation to _efficiently_ use memory transactions it needs noexcept move construction for both key and mapped type). Designs tried and benchmarked and rejected: 1. Items in a bucket were kept via an atomic forward pointer to the next item and one used cmpxchg to insert and remove items. This had really great, even superb, performance for load factors near one, but became pathological for badly distributed hash functions (I'm looking at you libstdc++). Once you rehashed the hash function to make it better distributed, performance nose dived. I was tempted to override std::hash with a decent implementation ... 2. Items in a bucket were kept as as a linear list of hashes and atomic pointers to value_type's. Each atomic pointer's bottom bit was its spin lock to keep each item footprint small, so you could lock each individual item during usage. The problem here really was that how do you safely resize the table without a per-bucket lock, and that halved performance over a spinlocked unordered_map as transactions really didn't deliver as expected here. 3. I created a specialisation only for mapped types of some std::shared_ptr<> which let me implement a completely lock free design. Unfortunately it was also rather slow, also about half the performance of a spinlocked unordered_map. Atomic increments and decrements of the same cache line across execution cores are slow :) Final proposed design (comments welcome): * item_type is allocated via a rebind of the supplied allocator, and looks like: struct item_type { value_type p; size_t hash; }; If value_type has a big sizeof(), a very good idea is to make the above a unique_ptr as otherwise you can't pack many item_type's into a cache line (AFIO's value_type is a std::pair<size_t, std::shared_ptr<>> which is 24 bytes, so item_type is 32 bytes which works well for my purposes). * bucket_type looks like this: struct bucket_type { spinlock<bool> lock; atomic<unsigned> count; // count is used items in there std::vector<item_type, item_type_allocator_type> items; char pad[64-sizeof(spinlock<bool>)-sizeof(atomic<unsigned>)-sizeof(std::vec tor<item_type, item_type_allocator_type>)]; }; static_assert(sizeof(bucket_type)==64, "bucket_type is not 64 bytes long!"); Note that items has its capacity directly managed, and is always expanded to powers of two. The size of actual items stored can float between zero and capacity. * And finally, concurrent_unordered_map itself merely contains: template<class Key, class T, class Hash=std::hash<Key>, class Pred=std::equal_to<Key>, class Alloc=std::allocator<std::pair<const Key, T>>> class concurrent_unordered_map { spinlock<bool> _rehash_lock; hasher _hasher; key_equal _key_equal; std::vector<bucket_type> _buckets; }; * To insert or find an item involves determining the bucket from the modulus of the hash by the number of buckets, locking the bucket and scanning the linear list for an empty item or the right item as appropriate. We know empty items from a magic hash value which is defined to some magic constant (advice on the best magic hash value least likely to turn up from std::hash is welcome). find() starts from the start of the list forwards, insert() starts from the end of the list backwards to help with cache line contention. As insert() needs to check for key matches as well as for empty slots, it needs to happen anyway, but if we knew key collision could never happen we could detect when the table is full from the atomic count and skip the table scan. If insert doesn't find a match nor an empty slot, if there is capacity it appends the item, if no capacity a capacity doubling is done and then an append. If capacity doubling occurs, all references to items in that bucket become invalid. I intend an insert_stable() API which will refuse to modify capacity, plus a bucket_reserve() API. insert() and find() have O(avrg bucket size (not count)). On transactional GCC find() and the initial seach for key equivalence in insert() are implemented as transactions without locking the bucket. All modifications are done by locking the bucket. This was found to have the best performance through trial and error. * To erase an item is easy: iterators store an iterator to the bucket and an offset into the linear table. Erasure involves constructing a temporary value_type, locking the bucket and performing a *move* construction out of the table into the temporary value_type, resetting the hash to empty. Destruction can then occur without holding the bucket lock. If the item just erased is the last in the bucket, the bucket's size (not capacity) is reduced by one, and that is repeated until the bucket size is the minimal possible. I currently violate STL allocator requirements here because I don't use the allocator for the temporary value_type to avoid a usually pointless malloc. Any suggestions on how to fix this would be welcome. You'll note that erasure here is safe as its storage never moves, though if you concurrently use and erase the same item you'll get a race onto a destroyed item which is as expected. erase() is O(1). * size() under this design becomes O(bucket count) as we sum the count members of each bucket. empty() is worst case O(bucket count) when the map is empty, best case O(1) when the first bucket is not empty, average case O(bucket count/item count/2). * Iteration under this design is queerly inversely dependent on load factor - with the higher the load factor, the better the performance. This is because we don't bother with the usual forward pointer per item as those are expensive for concurrent use and doubly linked pointers are rather a lot of overhead per item, so iteration is a case of iterating linear tables and once completed, searching for the next non-empty bucket which in a sparsely occupied bucket list is very lumpy as cache lines are pulled one by one into the cache. * rehash() under this design firstly locks the special rehash lock, then locks every bucket in the list. It then constructs a new empty bucket list and locks every lock in that. It then swaps the bucket lists, and moves all the items from the old buckets to the new buckets. It finally unlocks all the buckets, and the special rehash lock. This function is hard to write completely race free under all CPU load circumstances without burdening the main API implementations. I may use the concept of keeping around the old just emptied bucket list which is permanently marked to users as "go reload the bucket list" to work around this, and then document that you must not rehash too frequently or else you will race your code. Note that rehashing under this design is completely manual. If it were not, insertion, erasure and item referencing could not be safe. This implies that if you rehash, it is up to you to not be using items by reference at the time. Rehashing is also by definition not concurrent, indeed it is the only function in the design which is not capable of concurrency. Thoughts on this design before I write a final implementation are welcome. I don't plan to complete my implementation past the minimum needed for proposed Boost.AFIO, so if someone else wishes to finish it that and make it part of official Boost that would be great. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

On 28/07/2014 03:20, Niall Douglas wrote:
* item_type is allocated via a rebind of the supplied allocator, and looks like:
struct item_type { value_type p; size_t hash; };
I presume this will support partial construction to handle the case when value_type does not have a default constructor, or is movable-only, etc?
* To insert or find an item involves determining the bucket from the modulus of the hash by the number of buckets, locking the bucket and scanning the linear list for an empty item or the right item as appropriate. We know empty items from a magic hash value which is defined to some magic constant (advice on the best magic hash value least likely to turn up from std::hash is welcome).
Isn't it by definition (admittedly reality is less ideal) that all hash values have equal probability? What happens if a real item does end up with the magic "empty" hash value? I think it'd be safer to have a separate flag for this case.
* To erase an item is easy: iterators store an iterator to the bucket and an offset into the linear table. Erasure involves constructing a temporary value_type, locking the bucket and performing a *move* construction out of the table into the temporary value_type, resetting the hash to empty. Destruction can then occur without holding the bucket lock. If the item just erased is the last in the bucket, the bucket's size (not capacity) is reduced by one, and that is repeated until the bucket size is the minimal possible.
If the hash is reset to empty and the lock released before destruction occurs, doesn't that mean that something else could try to concurrently construct into the same storage? After all, the item slot will now look empty to insert(). Similarly the bucket size could be reduced by one just as insert() puts something into that bucket slot (thinking that it was still within the valid range, so not trying to increment it itself).
* size() under this design becomes O(bucket count) as we sum the count members of each bucket. empty() is worst case O(bucket count) when the map is empty, best case O(1) when the first bucket is not empty, average case O(bucket count/item count/2).
You could use a relaxed count at the top level to improve this, possibly. size() and empty() become statistical rather than prescriptive in the presence of concurrent writes anyway, so a stale value is not harmful. Although I suppose conversely updating a shared counter will cause cache contention performance loss for insert/erase, which may not be worth it for the benefit of statistical functions that are less likely to even be called in a concurrent design.
Thoughts on this design before I write a final implementation are welcome. I don't plan to complete my implementation past the minimum needed for proposed Boost.AFIO, so if someone else wishes to finish it that and make it part of official Boost that would be great.
I presume you've looked at other common implementations of this sort of thing? eg. libcds.

On 28 Jul 2014 at 13:26, Gavin Lambert wrote:
On 28/07/2014 03:20, Niall Douglas wrote:
* item_type is allocated via a rebind of the supplied allocator, and looks like:
struct item_type { value_type p; size_t hash; };
I presume this will support partial construction to handle the case when value_type does not have a default constructor, or is movable-only, etc?
My partial implementation will only support move construction, as that is all I need to store shared_ptr. You're right that there is a small object vs large object optimisation opportunity here.
* To insert or find an item involves determining the bucket from the modulus of the hash by the number of buckets, locking the bucket and scanning the linear list for an empty item or the right item as appropriate. We know empty items from a magic hash value which is defined to some magic constant (advice on the best magic hash value least likely to turn up from std::hash is welcome).
Isn't it by definition (admittedly reality is less ideal) that all hash values have equal probability? What happens if a real item does end up with the magic "empty" hash value? I think it'd be safer to have a separate flag for this case.
Firstly, if the magic hash value ever turns up during insert the program aborts :) It should, hopefully, be exceedingly rare on 64 bit to the point that we don't care [1]. Secondly, a separate flag costs another 8 bytes, and it ruins the cache line symmetry for AFIO where items size out to 32 bytes which is a lovely multiple. [1]: On libstdc++ a hash of a size_t is that size_t. As AFIO increments a size_t from 1 onwards, any magic value with a top bit set is safe. MSVC tries rather harder and for a size_t it uses the FNV-1a hash function which is rated as second best overall (but best for numbers) at https://programmers.stackexchange.com/questions/49550/which-hashing-al gorithm-is-best-for-uniqueness-and-speed. FNV-1a looks pretty random to me.
* To erase an item is easy: iterators store an iterator to the bucket and an offset into the linear table. Erasure involves constructing a temporary value_type, locking the bucket and performing a *move* construction out of the table into the temporary value_type, resetting the hash to empty. Destruction can then occur without holding the bucket lock. If the item just erased is the last in the bucket, the bucket's size (not capacity) is reduced by one, and that is repeated until the bucket size is the minimal possible.
If the hash is reset to empty and the lock released before destruction occurs, doesn't that mean that something else could try to concurrently construct into the same storage? After all, the item slot will now look empty to insert().
We have to be careful here about "destruction of what?". The assumption is that the value_type inside the item_type is left in a low cost or default state after being moved out of it, so the lock is held during the move, after which that slot will be available for use by someone else who may claim the lock and move construct into that storage. If someone held a reference to that value, then it would suddenly change contents in a racy way. But it would still point to an unrelocated value of that type, and therefore I believe is safe for most uses. As destruction is generally slow (free() etc), that's why the bucket lock isn't held for it. It makes an enormous difference to concurrency anyway.
Similarly the bucket size could be reduced by one just as insert() puts something into that bucket slot (thinking that it was still within the valid range, so not trying to increment it itself).
The lock is held during all resizes of the bucket.
* size() under this design becomes O(bucket count) as we sum the count members of each bucket. empty() is worst case O(bucket count) when the map is empty, best case O(1) when the first bucket is not empty, average case O(bucket count/item count/2).
You could use a relaxed count at the top level to improve this, possibly.
I didn't mention the atomic count in each bucket as I thought it obvious, but it holds the number of valid entries in each bucket. This saves size() or empty() having to iterate the bucket tables.
size() and empty() become statistical rather than prescriptive in the presence of concurrent writes anyway, so a stale value is not harmful. Although I suppose conversely updating a shared counter will cause cache contention performance loss for insert/erase, which may not be worth it for the benefit of statistical functions that are less likely to even be called in a concurrent design.
Exactly. It's trivial to keep a top level atomic count, but the concurrency consequences are dire. I do really wish I knew of a better way for empty() to have constant time though :(
Thoughts on this design before I write a final implementation are welcome. I don't plan to complete my implementation past the minimum needed for proposed Boost.AFIO, so if someone else wishes to finish it that and make it part of official Boost that would be great.
I presume you've looked at other common implementations of this sort of thing? eg. libcds.
I did do quite a bit of prior art research, and was surprised that no one seems to have had a crack at a STL compliant concurrent_unordered_map apart from the Intel TBB and Microsoft PPL implementations, both of which don't allow concurrent erase which is a showstopper. I don't have the time to complete my implementation past what I need now, but I was thinking it would make a great GSoC for 2015 and Boost (and the STL) is in sore need of a decent concurrent_unordered_map with a different set of design compromises to the split ordered list designs. The libcds implementations don't follow the STL API, and unless I missed something they get their speed from some clever memory allocation tricks which require threads to call init and cleanup routines to initialise thread local state. I didn't look deeply here, so if I am wrong please do correct. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

On July 28, 2014 6:08:26 AM EDT, Niall Douglas <s_sourceforge@nedprod.com> wrote:
On 28 Jul 2014 at 13:26, Gavin Lambert wrote:
On 28/07/2014 03:20, Niall Douglas wrote:
* To insert or find an item involves determining the bucket from the modulus of the hash by the number of buckets, locking the bucket and scanning the linear list for an empty item or the right item as appropriate. We know empty items from a magic hash value which is defined to some magic constant (advice on the best magic hash value least likely to turn up from std::hash is welcome).
Isn't it by definition (admittedly reality is less ideal) that all hash values have equal probability? What happens if a real item does end up with the magic "empty" hash value? I think it'd be safer to have a separate flag for this case.
Firstly, if the magic hash value ever turns up during insert the program aborts :) It should, hopefully, be exceedingly rare on 64 bit to the point that we don't care [1].
I'd be hard pressed to trust a library that aborts if the possible, but exceedingly rare occurs.
Secondly, a separate flag costs another 8 bytes, and it ruins the cache line symmetry for AFIO where items size out to 32 bytes which is a lovely multiple.
A separate flag only costs one bit. You might yet be able to rethink your current data structure to free one. The worst case, add I see it, is that you simply increment (or decrement) a computed hash value should the magic value ever arise. That is, you increase the odds of a collision so that no value ever uses the magic value. ___ Rob (Sent from my portable computation engine)

On 28 Jul 2014 at 7:21, Rob Stewart wrote:
The worst case, add I see it, is that you simply increment (or decrement) a computed hash value should the magic value ever arise. That is, you increase the odds of a collision so that no value ever uses the magic value.
That works for me. Great idea, thank you. Here are some benchmarks for a first attempt at the design I proposed earlier. The tests involve reserving 10,000 capacity and inserting 5,000 items to try and remove the overhead of rehashing for unordered_map. The write test is a random burst of up to 128 insertions or deletions which may be done concurrently by as many threads as the CPU has (8 hyperthreaded ones). The read test is as many finds as it can do concurrently. I tested single threaded and 8 threaded to ensure the concurrent design was not slower than a spinlocked std::unordered_map. First single threaded: === Large unordered_map spinlock write performance === 1. Achieved 23548585.078085 transactions per second === Large unordered_map spinlock read performance === 1. Achieved 64616938.960940 transactions per second === Large concurrent_unordered_map write performance === There are 1 threads in this CPU 1. Achieved 36929611.532615 transactions per second === Large concurrent_unordered_map read performance === There are 1 threads in this CPU 1. Achieved 66784946.368549 transactions per second As concurrent_unordered_map doesn't have to check the load factor during inserts and erases, it is quicker single threaded. Eight threaded: === Large unordered_map spinlock write performance === 1. Achieved 22012179.855850 transactions per second === Large unordered_map spinlock read performance === 1. Achieved 42627903.829194 transactions per second === Large concurrent_unordered_map write performance === There are 8 threads in this CPU 1. Achieved 166278672.013391 transactions per second === Large concurrent_unordered_map read performance === There are 8 threads in this CPU 1. Achieved 165071376.646606 transactions per second The contention on the read spinlock reduces unordered_map performance over the single threaded case significantly. Meanwhile my concurrent_unordered_map sees a 4.5x insert/erase improvement and a 2.5x find improvement, in fact both writing and reading the concurrent_unordered_map are now both the same performance under maximum contention. Someone is bound to ask how does this compare to a split ordered list concurrent_unordered_map? Well, for Microsoft PPL's one using VS2013: === Large concurrent_unordered_map read performance === There are 4 threads in this CPU Mine: Achieved 37400530.145442 transactions per second PPL: Achieved 54401408.200488 transactions per second So PPL is about 1.5x of mine for reads. PPL can't do concurrent erase though, so I can't test the writes. My design is also intended for transactional GCC, and there I see the following benchmarks: Single threaded: === Large unordered_map spinlock write performance === 1. Achieved 15925637.896608 transactions per second === Large unordered_map spinlock read performance === 1. Achieved 52702345.715530 transactions per second === Large concurrent_unordered_map write performance === There are 1 threads in this CPU 1. Achieved 19225268.992226 transactions per second === Large concurrent_unordered_map read performance === There are 1 threads in this CPU 1. Achieved 18177607.124302 transactions per second Something is bad here with my single threaded read performance, it's way low. I'm going to assume it's a bug in transactional GCC, see below for what I mean. Eight threaded: === Large unordered_map spinlock write performance === 1. Achieved 13913300.037094 transactions per second === Large unordered_map spinlock read performance === 1. Achieved 32325773.115938 transactions per second === Large concurrent_unordered_map write performance === There are 8 threads in this CPU 1. Achieved 95011747.620646 transactions per second === Large concurrent_unordered_map read performance === There are 8 threads in this CPU 1. Achieved 109675855.394830 transactions per second My concurrent_unordered_map sees a 5x insert/erase improvement and a 6x find improvement. That surely means my code is in fact good and is parallelising nicely. Obviously these are early results, I still have much to do. It is looking like it should provide a nice wee boost for AFIO's performance for sure. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/

On 07/27/2014 05:20 PM, Niall Douglas wrote:
1. Items in a bucket were kept via an atomic forward pointer to the next item and one used cmpxchg to insert and remove items. This had really great, even superb, performance for load factors near one, but became pathological for badly distributed hash functions (I'm looking at you libstdc++). Once you rehashed the hash function to make it better distributed, performance nose dived. I was tempted to override std::hash with a decent implementation ...
I would not rule out this solution. You can just tell users to supply their own hash function to solve the distribution problem.
Final proposed design (comments welcome):
If value_type has a big sizeof(), a very good idea is to make the above a unique_ptr as otherwise you can't pack many item_type's into a cache line (AFIO's value_type is a std::pair<size_t, std::shared_ptr<>> which is 24 bytes, so item_type is 32 bytes which works well for my purposes).
You add storage traits so users can decide for themselves. This trait would contain the stored type and to/from conversion functions.
* To insert or find an item involves determining the bucket from the modulus of the hash by the number of buckets, locking the bucket and scanning the linear list for an empty item or the right item as appropriate. We know empty items from a magic hash value which is defined to some magic constant (advice on the best magic hash value least likely to turn up from std::hash is welcome). find() starts
std::hash already does a modulo operation, so you could simply do another to free a number for the magic value. For example, auto magic = std::numeric_limit<Hash::result_type>::max(); hash %= magic; It will change the distribution, however, as as both 0 and max are mapped to 0.

On 30 Jul 2014 at 17:19, Bjorn Reese wrote:
On 07/27/2014 05:20 PM, Niall Douglas wrote:
1. Items in a bucket were kept via an atomic forward pointer to the next item and one used cmpxchg to insert and remove items. This had really great, even superb, performance for load factors near one, but became pathological for badly distributed hash functions (I'm looking at you libstdc++). Once you rehashed the hash function to make it better distributed, performance nose dived. I was tempted to override std::hash with a decent implementation ...
I would not rule out this solution. You can just tell users to supply their own hash function to solve the distribution problem.
There are additional problems: (i) a concurrent_unordered_map cannot rehash automatically because iterators would spontaneously invalidate and (ii) size(), needed for rehashing, is O(buckets) which is an obvious scalability problem. That means it must be much more tolerant to high load factors than normal. I am aiming for good performance at a load factor of up to 12. AFIO is going to push the problem of bucket sizing onto its user and do no rehashing at all :)
Final proposed design (comments welcome):
If value_type has a big sizeof(), a very good idea is to make the above a unique_ptr as otherwise you can't pack many item_type's into a cache line (AFIO's value_type is a std::pair<size_t, std::shared_ptr<>> which is 24 bytes, so item_type is 32 bytes which works well for my purposes).
You add storage traits so users can decide for themselves. This trait would contain the stored type and to/from conversion functions.
I couldn't make rehashing exception safe without bringing in malloc for all allocations, so this has gone away. I now always use malloc.
* To insert or find an item involves determining the bucket from the modulus of the hash by the number of buckets, locking the bucket and scanning the linear list for an empty item or the right item as appropriate. We know empty items from a magic hash value which is defined to some magic constant (advice on the best magic hash value least likely to turn up from std::hash is welcome). find() starts
std::hash already does a modulo operation, so you could simply do another to free a number for the magic value. For example,
auto magic = std::numeric_limit<Hash::result_type>::max(); hash %= magic;
It will change the distribution, however, as as both 0 and max are mapped to 0.
One useful side effect of always using malloc is we can now use a null pointer to indicate the slot is unusued. This avoids the need for special hash values. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/
participants (4)
-
Bjorn Reese
-
Gavin Lambert
-
Niall Douglas
-
Rob Stewart