Re: [boost] Non-allocating future promise... Re: ASIO into the standard (was: Re: C++ committee meeting report)

9 Jul 2014

      On 8 Jul 2014 at 22:29, Lee Clagett wrote:
...
...
So back to the drawing board again. I'm now thinking of simplifying
by requiring that the mapped type is a shared_ptr<T> and I'll see
what sort of design that might yield. I am finding that using TM is
repeatedly not worth it so far due to the costs on single theaded
performance, far more frequently than I expected. Maybe the next
Intel chip will improve TSX's overheads substantially.
I got the impression that writing in a transaction could be the expensive
part, especially if it was contested (having to rollback, etc).
Aborting a RTM transaction is *very* expensive. I think this is why 
HLE clearly uses a different internal implementation to RTM.

You also can only touch about 100 cache lines in a transaction before 
you have a 50% chance of it aborting irrespective due to exceeding 
internal buffer capacities (half the L1 cache is available for TM, 
but it's shared).
...
However, if
you entered a critical section for only reading, there would be less of a
penalty since it never "dirtied" the cacheline. Have you tested that too
(lots of readers few writers)? Intel's tbb::speculative_spin_rw_lock
_really_ makes sure that atomic flag is on its own cacheline (padding
everywhere), and acquiring the reader doesn't appear to do a write.
I tried a many reader few writer approach (80/20 split) where readers 
never write any cache lines. Aborts are even more costly than a many 
writer approach, I assume because more cache lines must be thrown 
away as more cache lines are marked as touched by the readers before 
a writer collides with them. Yeah, I was surprised too.

Putting the fallback atomic flag into its own cacheline is okay as a 
once off for maybe an entire container. Per-bucket it's excessive, 
and per-future would be crazy.

BTW my RTM-enhanced spinlock doesn't acquire the spinlock but instead 
starts a transaction which will abort if someone does acquire the 
spinlock. That way all users of the spinlock use the critically 
sectioned code without actually locking the spinlock. I did this 
because a HLE enhanced spinlock is so slow in single threaded code, 
whereas a RTM enhanced spinlock had acceptable single threaded 
performance costs (~3%).
...
Although, the single threaded performance has me thinking that I am
mistaken; I feel like a novice despite reading so much about hardware
memory barriers.
Well, do bear in mind this stuff isn't my forte. I could simply be 
incompetent. It doesn't help I'm working on this stuff after a full 
day of work, so my brain is pretty tired. I'm sure when someone like 
Andrey gets onto this stuff he'll see much better results than I 
have.

Niall

-- 
ned Productions Limited Consulting
http://www.nedproductions.biz/ 
http://ie.linkedin.com/in/nialldouglas/

Re: [boost] Non-allocating future promise... Re: ASIO into the standard (was: Re: C++ committee meeting report)

Niall Douglas