[boost] Request for feedback on design of concurrent_unordered_map, plus notes on use of memory transactions

27 Jul 2014

      Dear Boost,

Finally I have found some time to write up my experimentations with 
using transactional GCC or Intel TSX TM to make an improved 
concurrent_unordered_map. The quick tl;dr summary is this:

1. Transactional GCC is great iff you are happy accepting up to a 50% 
performance reduction in normal code execution. For a greenfield 
multithreaded project this is absolutely worth it due to the enormous 
savings in debug effort.

2. First generation Intel TSX isn't worth using in almost any common 
use case. The only common use place it makes sense is for almost 
always read only data which is *always* highly contended where 
speedups can approach linear to execution cores, and even then the 
refactoring you need to make to avoid writes to any cache line 
touched by any other thread will make your code slower than throwing 
everything inside a standard spin lock. Spin locks are well optimised 
and allow simple implementations, plus spin locks are very fast for 
uncontended use. Any Intel TSX (HLE, RTM) introduces penalties to 
uncontended use and seems to force a pipeline stall which can reduce 
single threaded performance by as much as 40%.

3. Trying to implement atomics-based logic to have the CPU select 
non-TSX during uncontended and TSX during contended comes out at best 
as slightly slower than always using TSX. This, for me, simply kills 
TSX v1.0, I no longer consider it worth thinking about till v2.0 is 
made by Intel.

For reference for those wishing to repeat my experimentations, you 
can see the history of my four attempts at a concurrent_unordered_map 
at 
https://github.com/ned14/boost.spinlock/commits/concurrent_unordered_m
ap2 plus benchmark results at 
https://ci.nedprod.com/view/Boost%20Thread-Expected-Permit/job/Memory%
20Transactions%20Test%20Linux%20GCC%204.8/.

So, onto my proposed new concurrent_unordered_map design, which is 
NOT a split ordered list design as Intel TBB or Microsoft PPL does 
it. I had four criteria for my concurrent_unordered_map design which 
were based on my needs for proposed Boost.AFIO:

1. erase() is safe. This is for me the big showstopper with split 
ordered list designs, and I think it probably is for most people.

2. insert(), find(), and erase() all had to have single threaded 
performance similar to a std::unordered_map<> with a spin lock around 
it i.e. fast.

3. Everything apart from rehashing should be able to be concurrent. 
This includes iteration, inserting, finding, erasing, sizing.

4. Memory transaction friendly. This means transactional GCC right 
now, but maybe some future Intel TSX v2.0 support later. This 
criterion, incidentally, *requires* C++ 11 noexcept support, or at 
least I can't see a way to avoid it otherwise (just to be clear, I am 
claiming that for the implementation to _efficiently_ use memory 
transactions it needs noexcept move construction for both key and 
mapped type).

Designs tried and benchmarked and rejected:

1. Items in a bucket were kept via an atomic forward pointer to the 
next item and one used cmpxchg to insert and remove items. This had 
really great, even superb, performance for load factors near one, but 
became pathological for badly distributed hash functions (I'm looking 
at you libstdc++). Once you rehashed the hash function to make it 
better distributed, performance nose dived. I was tempted to override 
std::hash with a decent implementation ...

2. Items in a bucket were kept as as a linear list of hashes and 
atomic pointers to value_type's. Each atomic pointer's bottom bit was 
its spin lock to keep each item footprint small, so you could lock 
each individual item during usage. The problem here really was that 
how do you safely resize the table without a per-bucket lock, and 
that halved performance over a spinlocked unordered_map as 
transactions really didn't deliver as expected here.

3. I created a specialisation only for mapped types of some 
std::shared_ptr<> which let me implement a completely lock free 
design. Unfortunately it was also rather slow, also about half the 
performance of a spinlocked unordered_map. Atomic increments and 
decrements of the same cache line across execution cores are slow :)

Final proposed design (comments welcome):

* item_type is allocated via a rebind of the supplied allocator, and 
looks like:

struct item_type
{
  value_type p;
  size_t hash;
};

If value_type has a big sizeof(), a very good idea is to make the 
above a unique_ptr as otherwise you can't pack many item_type's into 
a cache line (AFIO's value_type is a std::pair<size_t, 
std::shared_ptr<>> which is 24 bytes, so item_type is 32 bytes which 
works well for my purposes).

* bucket_type looks like this:

struct bucket_type {
  spinlock<bool> lock;
  atomic<unsigned> count; // count is used items in there
  std::vector<item_type, item_type_allocator_type> items;
  char 
pad[64-sizeof(spinlock<bool>)-sizeof(atomic<unsigned>)-sizeof(std::vec
tor<item_type, item_type_allocator_type>)];
};
static_assert(sizeof(bucket_type)==64, "bucket_type is not 64 bytes 
long!");

Note that items has its capacity directly managed, and is always 
expanded to powers of two. The size of actual items stored can float 
between zero and capacity.

* And finally, concurrent_unordered_map itself merely contains:

template<class Key, class T, class Hash=std::hash<Key>, class 
Pred=std::equal_to<Key>, class Alloc=std::allocator<std::pair<const 
Key, T>>> class concurrent_unordered_map {
  spinlock<bool> _rehash_lock;
  hasher _hasher;
  key_equal _key_equal;
  std::vector<bucket_type> _buckets;  
};

* To insert or find an item involves determining the bucket from the 
modulus of the hash by the number of buckets, locking the bucket and 
scanning the linear list for an empty item or the right item as 
appropriate. We know empty items from a magic hash value which is 
defined to some magic constant (advice on the best magic hash value 
least likely to turn up from std::hash is welcome). find() starts 
from the start of the list forwards, insert() starts from the end of 
the list backwards to help with cache line contention. As insert() 
needs to check for key matches as well as for empty slots, it needs 
to happen anyway, but if we knew key collision could never happen we 
could detect when the table is full from the atomic count and skip 
the table scan. If insert doesn't find a match nor an empty slot, if 
there is capacity it appends the item, if no capacity a capacity 
doubling is done and then an append. If capacity doubling occurs, all 
references to items in that bucket become invalid. I intend an 
insert_stable() API which will refuse to modify capacity, plus a 
bucket_reserve() API.

insert() and find() have O(avrg bucket size (not count)).

On transactional GCC find() and the initial seach for key equivalence 
in insert() are implemented as transactions without locking the 
bucket. All modifications are done by locking the bucket. This was 
found to have the best performance through trial and error.

* To erase an item is easy: iterators store an iterator to the bucket 
and an offset into the linear table. Erasure involves constructing a 
temporary value_type, locking the bucket and performing a *move* 
construction out of the table into the temporary value_type, 
resetting the hash to empty. Destruction can then occur without 
holding the bucket lock. If the item just erased is the last in the 
bucket, the bucket's size (not capacity) is reduced by one, and that 
is repeated until the bucket size is the minimal possible.

I currently violate STL allocator requirements here because I don't 
use the allocator for the temporary value_type to avoid a usually 
pointless malloc. Any suggestions on how to fix this would be 
welcome.

You'll note that erasure here is safe as its storage never moves, 
though if you concurrently use and erase the same item you'll get a 
race onto a destroyed item which is as expected.

erase() is O(1).

* size() under this design becomes O(bucket count) as we sum the 
count members of each bucket. empty() is worst case O(bucket count) 
when the map is empty, best case O(1) when the first bucket is not 
empty, average case O(bucket count/item count/2).

* Iteration under this design is queerly inversely dependent on load 
factor - with the higher the load factor, the better the performance. 
This is because we don't bother with the usual forward pointer per 
item as those are expensive for concurrent use and doubly linked 
pointers are rather a lot of overhead per item, so iteration is a 
case of iterating linear tables and once completed, searching for the 
next non-empty bucket which in a sparsely occupied bucket list is 
very lumpy as cache lines are pulled one by one into the cache.

* rehash() under this design firstly locks the special rehash lock, 
then locks every bucket in the list. It then constructs a new empty 
bucket list and locks every lock in that. It then swaps the bucket 
lists, and moves all the items from the old buckets to the new 
buckets. It finally unlocks all the buckets, and the special rehash 
lock.

This function is hard to write completely race free under all CPU 
load circumstances without burdening the main API implementations. I 
may use the concept of keeping around the old just emptied bucket 
list which is permanently 
marked to users as "go reload the bucket list" to work around this, 
and then 
document that you must not rehash too frequently or else you will 
race your 
code.

Note that rehashing under this design is completely manual. If it 
were not, 
insertion, erasure and item referencing could not be safe. This 
implies that if 
you rehash, it is up to you to not be using items by reference at the 
time. 
Rehashing is also by definition not concurrent, indeed it is the only 
function 
in the design which is not capable of concurrency.

Thoughts on this design before I write a final implementation are 
welcome. I don't plan to complete my implementation past the minimum 
needed for proposed Boost.AFIO, so if someone else wishes to finish 
it that and make it part of official Boost that would be great.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ 
http://ie.linkedin.com/in/nialldouglas/

[boost] Request for feedback on design of concurrent_unordered_map, plus notes on use of memory transactions

Niall Douglas