
Dear Boost, Finally I have found some time to write up my experimentations with using transactional GCC or Intel TSX TM to make an improved concurrent_unordered_map. The quick tl;dr summary is this: 1. Transactional GCC is great iff you are happy accepting up to a 50% performance reduction in normal code execution. For a greenfield multithreaded project this is absolutely worth it due to the enormous savings in debug effort. 2. First generation Intel TSX isn't worth using in almost any common use case. The only common use place it makes sense is for almost always read only data which is *always* highly contended where speedups can approach linear to execution cores, and even then the refactoring you need to make to avoid writes to any cache line touched by any other thread will make your code slower than throwing everything inside a standard spin lock. Spin locks are well optimised and allow simple implementations, plus spin locks are very fast for uncontended use. Any Intel TSX (HLE, RTM) introduces penalties to uncontended use and seems to force a pipeline stall which can reduce single threaded performance by as much as 40%. 3. Trying to implement atomics-based logic to have the CPU select non-TSX during uncontended and TSX during contended comes out at best as slightly slower than always using TSX. This, for me, simply kills TSX v1.0, I no longer consider it worth thinking about till v2.0 is made by Intel. For reference for those wishing to repeat my experimentations, you can see the history of my four attempts at a concurrent_unordered_map at https://github.com/ned14/boost.spinlock/commits/concurrent_unordered_m ap2 plus benchmark results at https://ci.nedprod.com/view/Boost%20Thread-Expected-Permit/job/Memory% 20Transactions%20Test%20Linux%20GCC%204.8/. So, onto my proposed new concurrent_unordered_map design, which is NOT a split ordered list design as Intel TBB or Microsoft PPL does it. I had four criteria for my concurrent_unordered_map design which were based on my needs for proposed Boost.AFIO: 1. erase() is safe. This is for me the big showstopper with split ordered list designs, and I think it probably is for most people. 2. insert(), find(), and erase() all had to have single threaded performance similar to a std::unordered_map<> with a spin lock around it i.e. fast. 3. Everything apart from rehashing should be able to be concurrent. This includes iteration, inserting, finding, erasing, sizing. 4. Memory transaction friendly. This means transactional GCC right now, but maybe some future Intel TSX v2.0 support later. This criterion, incidentally, *requires* C++ 11 noexcept support, or at least I can't see a way to avoid it otherwise (just to be clear, I am claiming that for the implementation to _efficiently_ use memory transactions it needs noexcept move construction for both key and mapped type). Designs tried and benchmarked and rejected: 1. Items in a bucket were kept via an atomic forward pointer to the next item and one used cmpxchg to insert and remove items. This had really great, even superb, performance for load factors near one, but became pathological for badly distributed hash functions (I'm looking at you libstdc++). Once you rehashed the hash function to make it better distributed, performance nose dived. I was tempted to override std::hash with a decent implementation ... 2. Items in a bucket were kept as as a linear list of hashes and atomic pointers to value_type's. Each atomic pointer's bottom bit was its spin lock to keep each item footprint small, so you could lock each individual item during usage. The problem here really was that how do you safely resize the table without a per-bucket lock, and that halved performance over a spinlocked unordered_map as transactions really didn't deliver as expected here. 3. I created a specialisation only for mapped types of some std::shared_ptr<> which let me implement a completely lock free design. Unfortunately it was also rather slow, also about half the performance of a spinlocked unordered_map. Atomic increments and decrements of the same cache line across execution cores are slow :) Final proposed design (comments welcome): * item_type is allocated via a rebind of the supplied allocator, and looks like: struct item_type { value_type p; size_t hash; }; If value_type has a big sizeof(), a very good idea is to make the above a unique_ptr as otherwise you can't pack many item_type's into a cache line (AFIO's value_type is a std::pair<size_t, std::shared_ptr<>> which is 24 bytes, so item_type is 32 bytes which works well for my purposes). * bucket_type looks like this: struct bucket_type { spinlock<bool> lock; atomic<unsigned> count; // count is used items in there std::vector<item_type, item_type_allocator_type> items; char pad[64-sizeof(spinlock<bool>)-sizeof(atomic<unsigned>)-sizeof(std::vec tor<item_type, item_type_allocator_type>)]; }; static_assert(sizeof(bucket_type)==64, "bucket_type is not 64 bytes long!"); Note that items has its capacity directly managed, and is always expanded to powers of two. The size of actual items stored can float between zero and capacity. * And finally, concurrent_unordered_map itself merely contains: template<class Key, class T, class Hash=std::hash<Key>, class Pred=std::equal_to<Key>, class Alloc=std::allocator<std::pair<const Key, T>>> class concurrent_unordered_map { spinlock<bool> _rehash_lock; hasher _hasher; key_equal _key_equal; std::vector<bucket_type> _buckets; }; * To insert or find an item involves determining the bucket from the modulus of the hash by the number of buckets, locking the bucket and scanning the linear list for an empty item or the right item as appropriate. We know empty items from a magic hash value which is defined to some magic constant (advice on the best magic hash value least likely to turn up from std::hash is welcome). find() starts from the start of the list forwards, insert() starts from the end of the list backwards to help with cache line contention. As insert() needs to check for key matches as well as for empty slots, it needs to happen anyway, but if we knew key collision could never happen we could detect when the table is full from the atomic count and skip the table scan. If insert doesn't find a match nor an empty slot, if there is capacity it appends the item, if no capacity a capacity doubling is done and then an append. If capacity doubling occurs, all references to items in that bucket become invalid. I intend an insert_stable() API which will refuse to modify capacity, plus a bucket_reserve() API. insert() and find() have O(avrg bucket size (not count)). On transactional GCC find() and the initial seach for key equivalence in insert() are implemented as transactions without locking the bucket. All modifications are done by locking the bucket. This was found to have the best performance through trial and error. * To erase an item is easy: iterators store an iterator to the bucket and an offset into the linear table. Erasure involves constructing a temporary value_type, locking the bucket and performing a *move* construction out of the table into the temporary value_type, resetting the hash to empty. Destruction can then occur without holding the bucket lock. If the item just erased is the last in the bucket, the bucket's size (not capacity) is reduced by one, and that is repeated until the bucket size is the minimal possible. I currently violate STL allocator requirements here because I don't use the allocator for the temporary value_type to avoid a usually pointless malloc. Any suggestions on how to fix this would be welcome. You'll note that erasure here is safe as its storage never moves, though if you concurrently use and erase the same item you'll get a race onto a destroyed item which is as expected. erase() is O(1). * size() under this design becomes O(bucket count) as we sum the count members of each bucket. empty() is worst case O(bucket count) when the map is empty, best case O(1) when the first bucket is not empty, average case O(bucket count/item count/2). * Iteration under this design is queerly inversely dependent on load factor - with the higher the load factor, the better the performance. This is because we don't bother with the usual forward pointer per item as those are expensive for concurrent use and doubly linked pointers are rather a lot of overhead per item, so iteration is a case of iterating linear tables and once completed, searching for the next non-empty bucket which in a sparsely occupied bucket list is very lumpy as cache lines are pulled one by one into the cache. * rehash() under this design firstly locks the special rehash lock, then locks every bucket in the list. It then constructs a new empty bucket list and locks every lock in that. It then swaps the bucket lists, and moves all the items from the old buckets to the new buckets. It finally unlocks all the buckets, and the special rehash lock. This function is hard to write completely race free under all CPU load circumstances without burdening the main API implementations. I may use the concept of keeping around the old just emptied bucket list which is permanently marked to users as "go reload the bucket list" to work around this, and then document that you must not rehash too frequently or else you will race your code. Note that rehashing under this design is completely manual. If it were not, insertion, erasure and item referencing could not be safe. This implies that if you rehash, it is up to you to not be using items by reference at the time. Rehashing is also by definition not concurrent, indeed it is the only function in the design which is not capable of concurrency. Thoughts on this design before I write a final implementation are welcome. I don't plan to complete my implementation past the minimum needed for proposed Boost.AFIO, so if someone else wishes to finish it that and make it part of official Boost that would be great. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/