On 20 Mar 2015 at 19:32, Giovanni Piero Deretta wrote:
What's special about memory allocation here? Intuitively sharing futures on the stack might actually be worse especially if you have multiple futures being serviced by different threads.
I think memory allocation is entirely wise for shared_future. I think auto allocation is wise for future, as only exactly one of those can exist at any time per promise.
On Intel RMW is the same speed as non-atomic ops unless the cache line is Owned or Shared.
Yes if the thread does not own the cache line the communication cost dwarfs everything else, but in the normal case of a exclusive cache line, mfence, xchg, cmpxchg and friends cost 30-50 Cycles and stall the CPU. Significantly more than the cost of non serialising instructions. Not something I want to do in a move constructor.
You're right and I'm wrong on this - I stated the claim above on empirical testing where I found no difference in the use of the LOCK prefix. It would appear I had an inefficiency in my testing code: Agner says that for Haswell: XADD: 5 uops, 7 latency LOCK XADD: 9 uops, 19 latency CMPXCHG 6 uops, 8 latency LOCK CMPXCHG 10 uops, 19 latency (Source: http://www.agner.org/optimize/instruction_tables.pdf) So approximately one halves the throughput and triples the latency with the LOCK prefix irrespective of the state of the cache line. Additionally as I reported on this list maybe a year ago, first gen Intel TSX provides no benefits and indeed a hefty penalty over simple atomic RMW. ARM and other CPUs provide load linked store conditional, so RMW with those is indeed close to penalty free if the cache line is exclusive to the CPU doing those ops. It's just Intel is still incapable of low latency lock gets, though it's enormously better than the Pentium 4. All that said, I don't see a 50 cycle cost per move constructor as at all being a problem. Compilers are also pretty good at applying RVO (if you don't get in the way) to elide moves with observable effects which any move constructor using atomic must cause. The total number of actual 50 cycle move constructors therefore generated is usually a minimum.
[Snip]
A couple of months ago I was arguing with Gor Nishanov (author of MS resumable functions paper), that heap allocating the resumable function by default is unacceptable. And here I am arguing the other side :).
It is unacceptable. Chris's big point in his Concurrency alternative paper before WG21 is that future-promise is useless for 100k-1M socket scalability because to reach that you must have much lower latency than future-promise is capable of. ASIO's async_result system can achieve 100k-1M scalability. Future promise (as currently implemented) cannot.
Thankfully WG21 appear to have accepted this about resumable functions in so far as I am aware.
I'm a big fan of Chris' proposal as well. I haven't seen any new papers on resumable functions, I would love to know were the committee is heading.
The argument between those two camps is essentially concurrency vs parallelism. I had thought the latter had won? My urgings by private email to those involved is that a substantial reconciliation needs to happen between the Concurrency TS and the Networking TS such that they are far more tightly integrated and not these separate opposing paradigm islands. A future-promise as efficient as async_result would be an excellent first step along such a reconciliation. Niall -- ned Productions Limited Consulting http://www.nedproductions.biz/ http://ie.linkedin.com/in/nialldouglas/