On 8 Jul 2014 at 22:29, Lee Clagett wrote:
So back to the drawing board again. I'm now thinking of simplifying by requiring that the mapped type is a shared_ptr<T> and I'll see what sort of design that might yield. I am finding that using TM is repeatedly not worth it so far due to the costs on single theaded performance, far more frequently than I expected. Maybe the next Intel chip will improve TSX's overheads substantially.
I got the impression that writing in a transaction could be the expensive part, especially if it was contested (having to rollback, etc).
Aborting a RTM transaction is *very* expensive. I think this is why HLE clearly uses a different internal implementation to RTM. You also can only touch about 100 cache lines in a transaction before you have a 50% chance of it aborting irrespective due to exceeding internal buffer capacities (half the L1 cache is available for TM, but it's shared).
However, if you entered a critical section for only reading, there would be less of a penalty since it never "dirtied" the cacheline. Have you tested that too (lots of readers few writers)? Intel's tbb::speculative_spin_rw_lock _really_ makes sure that atomic flag is on its own cacheline (padding everywhere), and acquiring the reader doesn't appear to do a write.
I tried a many reader few writer approach (80/20 split) where readers never write any cache lines. Aborts are even more costly than a many writer approach, I assume because more cache lines must be thrown away as more cache lines are marked as touched by the readers before a writer collides with them. Yeah, I was surprised too. Putting the fallback atomic flag into its own cacheline is okay as a once off for maybe an entire container. Per-bucket it's excessive, and per-future would be crazy. BTW my RTM-enhanced spinlock doesn't acquire the spinlock but instead starts a transaction which will abort if someone does acquire the spinlock. That way all users of the spinlock use the critically sectioned code without actually locking the spinlock. I did this because a HLE enhanced spinlock is so slow in single threaded code, whereas a RTM enhanced spinlock had acceptable single threaded performance costs (~3%).
Although, the single threaded performance has me thinking that I am mistaken; I feel like a novice despite reading so much about hardware memory barriers.
Well, do bear in mind this stuff isn't my forte. I could simply be incompetent. It doesn't help I'm working on this stuff after a full day of work, so my brain is pretty tired. I'm sure when someone like Andrey gets onto this stuff he'll see much better results than I have. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/