Re: [boost] [asio] Bug: Handlers execute on the wrong strand (Gavin Lambert).
I actually ended up abandoning Asio and rolling my own threadqueue (I didn't want any of the socket-related stuff so this wasn't as big a job as it sounds). Mine isn't fully-baked yet (and it's Windows-only for the moment) and I'm not sure I'd want to show it off either, but on the upside most of the guts are entirely lock-free (though not wait-free, since it's based on Boost.LockFree's queue).
Asio at Windows uses IOCP (default settings, using asio::io_service for task scheduling) and that is the (theoretical) reason of better thread scheduling for the Asio-based thread pool. Sometimes it's really visible. Regards, Marat Abrarov.
On 24/10/2013 22:27, Quoth Marat Abrarov:
I actually ended up abandoning Asio and rolling my own threadqueue (I didn't want any of the socket-related stuff so this wasn't as big a job as it sounds). Mine isn't fully-baked yet (and it's Windows-only for the moment) and I'm not sure I'd want to show it off either, but on the upside most of the guts are entirely lock-free (though not wait-free, since it's based on Boost.LockFree's queue).
Asio at Windows uses IOCP (default settings, using asio::io_service for task scheduling) and that is the (theoretical) reason of better thread scheduling for the Asio-based thread pool. Sometimes it's really visible.
It's also full of mutexes though, which is why it didn't work out for me. (Note that I was using Boost 1.53 when testing Asio; maybe this has changed in future versions, although I heard that 1.54 picked up a bug in the IOCP reactor.) I'm not sure exactly which lock triggered the slow path (my logging was only sufficient to show that it was one of the ones inside Asio, but not which one). But as the prior email said, given reuse of strand implementations between supposedly independent strands, that seems like a likely candidate. (Though it didn't take long for the latency spikes to manifest -- typically they'd start after a couple of minutes and then recur roughly every 10-30 seconds.) My implementation also uses IOCP for the io_service equivalent but without the dispatch queue mutexes that Asio uses. The Boost.LockFree queue is used mostly for the io_strand and object_handle equivalents (and even then it's only used by the latter to support multiple concurrent waiters, which I didn't end up needing). (The strand, of course, is still blocking in the execution path but not in the posting path.) I haven't done a head-to-head benchmark on each (and it wouldn't surprise me if Asio were faster than mine for many loads -- and it's definitely more flexible than I made mine) but so far my one is doing at least as well as Asio on production loads but without the latency spikes from the locks. Still very early days yet though.
On 25 Oct 2013 at 19:39, Gavin Lambert wrote:
upside most of the guts are entirely lock-free (though not wait-free, since it's based on Boost.LockFree's queue).
lockfree::queue isn't actually hugely performant. Most lock free code isn't compared to most lock based implementations because you gain in worst case execution times by sacrificing average case execution times. The only major exception is lockfree::spsc_queue which is indeed very fast by any metric.
Asio at Windows uses IOCP (default settings, using asio::io_service for task scheduling) and that is the (theoretical) reason of better thread scheduling for the Asio-based thread pool. Sometimes it's really visible.
It's also full of mutexes though, which is why it didn't work out for me. (Note that I was using Boost 1.53 when testing Asio; maybe this has changed in future versions, although I heard that 1.54 picked up a bug in the IOCP reactor.)
I saw lost wakeups during parallel writes in ASIO 1.54, so I disabled those for AFIO. That appears to be fixed in 1.55, so AFIO now parallelises everything as it was designed to. This might mean ASIO in 1.55 is fixed.
I'm not sure exactly which lock triggered the slow path (my logging was only sufficient to show that it was one of the ones inside Asio, but not which one). But as the prior email said, given reuse of strand implementations between supposedly independent strands, that seems like a likely candidate. (Though it didn't take long for the latency spikes to manifest -- typically they'd start after a couple of minutes and then recur roughly every 10-30 seconds.)
ASIO is, once you compile it with optimisation, really a thin wrapper doing a lot of mallocs and frees around Win IO completion ports. Any latency spikes are surely due to either IOCP or the memory allocator causing a critical section to exceed its spin count, and therefore go to kernel sleep?
I haven't done a head-to-head benchmark on each (and it wouldn't surprise me if Asio were faster than mine for many loads -- and it's definitely more flexible than I made mine) but so far my one is doing at least as well as Asio on production loads but without the latency spikes from the locks. Still very early days yet though.
If you're on Haswell, you might look into my memory transaction implementation in AFIO. It uses Intel TSX if available according to runtime detection, otherwise it falls back onto a policy composed spin lock (yes I know I did NIH with yet another Boost spinlock implementation, but hey mine is policy composed so you can vary spin counts etc!!!). It works on Intel's TSX simulator, but I would really love to know if it works on real TSX hardware. Niall -- Currently unemployed and looking for work. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/
On 26/10/2013 11:09, Quoth Niall Douglas:
lockfree::queue isn't actually hugely performant. Most lock free code isn't compared to most lock based implementations because you gain in worst case execution times by sacrificing average case execution times. The only major exception is lockfree::spsc_queue which is indeed very fast by any metric.
Yes, I know that. As I said my implementation only barely uses the queue anyway. (I was thinking about using an MPSC queue for the strand implementation, since it's the heaviest queue user, MPSC is all it should require, and I do have one handy -- but so far that's not my bottleneck so I haven't worried about it too much.) I still generally find that being able to complete work without context switching is a massive win over running into a lock-wall, even if the individual work takes longer on average to complete.
ASIO is, once you compile it with optimisation, really a thin wrapper doing a lot of mallocs and frees around Win IO completion ports. Any latency spikes are surely due to either IOCP or the memory allocator causing a critical section to exceed its spin count, and therefore go to kernel sleep?
No, the latency was very definitely coming from *some* instance of either boost::asio::detail::mutex or boost::asio::detail::static_mutex. I didn't trace it down any further than that. (As far as the memory allocator goes, I'm actually using nedmalloc -- which I know isn't lock-free but it's pretty decent at avoiding lock contention. And I had that instrumented too and there were no memory allocation locks around the time when the latency occurred.) My application was using an io_service with 6-8 threads, and a large number of serial_ports each with their own io_strand and deadline_timer. The deadline_timer is cancelled and restarted most of the time rather than being allowed to expire -- but this shouldn't be unusual behaviour for a timeout.
On 31 Oct 2013 at 12:12, Gavin Lambert wrote:
ASIO is, once you compile it with optimisation, really a thin wrapper doing a lot of mallocs and frees around Win IO completion ports. Any latency spikes are surely due to either IOCP or the memory allocator causing a critical section to exceed its spin count, and therefore go to kernel sleep?
No, the latency was very definitely coming from *some* instance of either boost::asio::detail::mutex or boost::asio::detail::static_mutex. I didn't trace it down any further than that.
From what I can tell ASIO uses Win32 critical sections which come with spin counts. If you keep every operation under the count, all is good, else you get spikes as it drops to kernel wait.
(As far as the memory allocator goes, I'm actually using nedmalloc -- which I know isn't lock-free but it's pretty decent at avoiding lock contention. And I had that instrumented too and there were no memory allocation locks around the time when the latency occurred.)
I might know something about nedmalloc :). Mmm, I was just about to suggest that nedmalloc might be doing a free space consolidation run and that might be the cause of the spike, but if it isn't then okay.
My application was using an io_service with 6-8 threads, and a large number of serial_ports each with their own io_strand and deadline_timer. The deadline_timer is cancelled and restarted most of the time rather than being allowed to expire -- but this shouldn't be unusual behaviour for a timeout.
It just goes to show how complex and unpredictable modern C++ has become :( At the bleeding edge we're actually all complex systems engineers now, not software or computer science engineers. I wish how we as an industry interview job hire candidates reflected that (or I wouldn't be having the problems I have had!). Niall -- Currently unemployed and looking for work. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/
On 31/10/2013 14:12, Quoth Niall Douglas:
From what I can tell ASIO uses Win32 critical sections which come with spin counts. If you keep every operation under the count, all is good, else you get spikes as it drops to kernel wait.
I also tried replacing the locks with some of my own (which are slower but handle PI) but it didn't make much difference. This isn't surprising though as the times I was seeing were in the order of 300ms from requesting the lock to being granted it, as I said before, which is a bit excessive for even a kernel wait. (And before you ask, the handlers themselves don't take that long to run -- I don't have precise figures handy ATM but generally I consider them slow if they take 5ms. Though of course Windows being Windows there can be 15ms spikes if it does hit a kernel wait or similar; still nowhere close to 300ms though.) I still wouldn't rule out "I was doing something silly in my code" of course (there's always room for that), and I'm probably operating outside of what ASIO was originally intended for anyway (since I'm not doing sockets). But whatever was causing it must have either been in ASIO itself or in the small amount of wrapper code I had to rewrite when moving from ASIO to my custom implementation, because it seems to have gone away since switching over. (The access pattern of the outside code is unchanged.)
I might know something about nedmalloc :)
Funny that. :)
Mmm, I was just about to suggest that nedmalloc might be doing a free space consolidation run and that might be the cause of the spike, but if it isn't then okay.
Not unless it can do that without locking anything, at least. I was basically only recording attempts to lock/unlock rather than any access to the allocator. I suspect I'm hitting the memory allocator in my implementation more frequently than ASIO was, actually -- I'm not trying to cache and reuse operations or buffers; it just does a "new" whenever it needs it. (Although I might be getting away with fewer intermediate objects, since I've cut the functionality to the bare minimum.) So I doubt allocation was the issue. (Unless maybe it was trying to *avoid* allocation that introduced the issue, as the post that started this discussion implied.)
On 31 Oct 2013 at 15:23, Gavin Lambert wrote:
surprising though as the times I was seeing were in the order of 300ms from requesting the lock to being granted it, as I said before, which is a bit excessive for even a kernel wait. (And before you ask, the
I've seen CAS locks spike to a quarter second if you get a very unlucky sequence of events where all cores are read modify writing more cache lines that the cache coherency bus can cope with. You'll see the mouse pointer, disc i/o etc all go to ~4Hz. Admittedly, that's a problem older processors experience more than newer ones, Intel have improved things.
ASIO itself or in the small amount of wrapper code I had to rewrite when moving from ASIO to my custom implementation, because it seems to have gone away since switching over. (The access pattern of the outside code is unchanged.)
ASIO may be doing nothing wrong, but simply the combination of your code with its code produces weird timing resonances which just happen to cause spikes on some particular hardware. I occasionally get bug reports for nedmalloc by hedge funds where they upgraded to some new hardware and nedmalloc suddenly starts latency spiking. I tell them to add an empty for loop incrementing an atomic, and they're often quite surprised when the spiking goes away.
Mmm, I was just about to suggest that nedmalloc might be doing a free space consolidation run and that might be the cause of the spike, but if it isn't then okay.
Not unless it can do that without locking anything, at least. I was basically only recording attempts to lock/unlock rather than any access to the allocator.
nedmalloc keeps multiple pools, and while free space consolidating one pool it will send traffic to one of the other pools.
I suspect I'm hitting the memory allocator in my implementation more frequently than ASIO was, actually -- I'm not trying to cache and reuse operations or buffers; it just does a "new" whenever it needs it. (Although I might be getting away with fewer intermediate objects, since I've cut the functionality to the bare minimum.) So I doubt allocation was the issue. (Unless maybe it was trying to *avoid* allocation that introduced the issue, as the post that started this discussion implied.)
One of the cunning ideas I had while at BlackBerry was for a new clang optimiser pass plugin which has the compiler coalesce operator new calls into batch mallocs and replace all sequences of stack unwound new/deletes with alloca(). It would break ABI compatibility with GCC, but I reckoned would deliver tremendous performance improvements in malloc contended code. Shame we probably won't see that optimisation any time soon, it would help Boost code in particular. Niall -- Currently unemployed and looking for work. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/
participants (3)
-
Gavin Lambert
-
Marat Abrarov
-
Niall Douglas