[lockfree] Faster ring-buffer queue implementation
Hello, some time ago I've implemented lock-free multi-producer multi-consumer queue on ring buffer and recently I've compared it's performance with current (Boost 1.55.0) boost::lockfree::queue - it shows 50% better performance. I've made just a quick glance on the Boost implementation and noticed that it uses CAS operations which is basically slower than simple RMW (atomic increment in our case) on modern x86-64 hardware. Maybe I've made some mistake in boost::lockfree::queue usage. If so then I'll be obliged for pointing me out the problem in my benchmark. Otherwise I'm wondering if it has sense to integrate the queue implementation into boost::lockfree? The algorithm description is available at http://www.linuxjournal.com/content/lock-free-multi-producer-multi-consumer-... The source code of the queue with the benchmark for boost::lockfree::queue: https://github.com/krizhanovsky/NatSys-Lab/blob/master/lockfree_rb_q.cc And the benchmark results are available here: http://natsys-lab.blogspot.ru/2014/01/natsys-lock-free-queue-in-comparison.h... Thanks! -- Alexander Krizhanovsky NatSys Lab. (http://natsys-lab.com) tel: +7 (916) 717-3899, +7 (499) 747-6304 e-mail: ak@natsys-lab.com
some time ago I've implemented lock-free multi-producer multi-consumer queue on ring buffer and recently I've compared it's performance with current (Boost 1.55.0) boost::lockfree::queue - it shows 50% better performance. I've made just a quick glance on the Boost implementation and noticed that it uses CAS operations which is basically slower than simple RMW (atomic increment in our case) on modern x86-64 hardware.
Maybe I've made some mistake in boost::lockfree::queue usage. If so then I'll be obliged for pointing me out the problem in my benchmark. Otherwise I'm wondering if it has sense to integrate the queue implementation into boost::lockfree?
nothing wrong, just comparing a bounded with an unbounded algorithm is like comparing apples with oranges. but for the case of a bounded queue (lockfree::queue and fixed_sized<> or capacity<>) it could make sense to make use of it! tim
On 16/01/2014 07:32, Quoth Alexander Krizhanovsky:
Otherwise I'm wondering if it has sense to integrate the queue implementation into boost::lockfree?
The algorithm description is available at http://www.linuxjournal.com/content/lock-free-multi-producer-multi-consumer-...
I've only had a quick glance at it, but it looks less general-purpose -- it appears that for each queue instance, you need to predefine an upper bound on the number of producer and consumer threads and assign each thread a unique id within that bound. (Also, given the apparent locality of the thread-id, I don't like that it is stored and accessed as a thread-global; this would complicate use of multiple queues from one thread. And interleaving the per-thread head/tail indexes seems vulnerable to false sharing.) There are still many cases where manually assigning thread ids is achievable, so something like this might be useful to have. But it couldn't be a replacement for the more general queue.
On 01/16/2014 08:04 AM, Gavin Lambert wrote:
On 16/01/2014 07:32, Quoth Alexander Krizhanovsky:
Otherwise I'm wondering if it has sense to integrate the queue implementation into boost::lockfree?
The algorithm description is available at http://www.linuxjournal.com/content/lock-free-multi-producer-multi-consumer-...
I've only had a quick glance at it, but it looks less general-purpose -- it appears that for each queue instance, you need to predefine an upper bound on the number of producer and consumer threads and assign each thread a unique id within that bound.
(Also, given the apparent locality of the thread-id, I don't like that it is stored and accessed as a thread-global; this would complicate use of multiple queues from one thread.
The queue requires thread local pointers to head and tail (the same approach is used in LMAX Disruptor). Since each thread can have access to many queues, then the local pointers are implemented as an array of ThrPos structures and the array is a member of particular queue instance. However, to quickly access the queue we need small consequent thread identifiers. For example if we have two threads with identifiers 0 and 1 and two queues Q0 and Q1, then the thread will access their local pointers through Q0.thr_p_[0] and Q0.thr_p_[1] for first queue and Q1.thr_p_[0] and Q1.thr_p_[1] for the second queue. This can be continued to arbitrary number of threads and queues. We only need to know on the queue construction how many threads will be using it.
And interleaving the per-thread head/tail indexes seems vulnerable to false sharing.)
False sharing is bad due to costly requests for ownership message of cache coherence protocol. tail and head pointers are modified only by the owning thread which is preferably should be binded to particular CPU (however, OS scheduler doesn't reschedule a thread to other CPU without need). Other threads/CPUs are only reads tail/head pointers of the thread. There is sense to align ThrPos to cache line, so different CPUs won't update the same cache lines. I tried this, but didn't get intelligible performance improvement on my laptop (2 cores), 12 cores/2 processors and 40 cores/4 processors servers. Probably, there is sense to make more benchmarks....
There are still many cases where manually assigning thread ids is achievable, so something like this might be useful to have.
Yes, I understand that current thread ids are bit ugly and it's good to replace them by nicer interfaces. It seems boost::thread and std::thread don't provide small consecuent thread ids... Any ideas how it's possible to adjust the interface?
But it couldn't be a replacement for the more general queue.
Yes, definitely agree. The ring buffer queue with all its limitations was designed for classical work queue, not as a general fast queue.
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
-- Alexander Krizhanovsky NatSys Lab. (http://natsys-lab.com) tel: +7 (916) 717-3899, +7 (499) 747-6304 e-mail: ak@natsys-lab.com
On 16/01/2014 22:06, Quoth Alexander Krizhanovsky:
The queue requires thread local pointers to head and tail (the same approach is used in LMAX Disruptor). Since each thread can have access to many queues, then the local pointers are implemented as an array of ThrPos structures and the array is a member of particular queue instance. However, to quickly access the queue we need small consequent thread identifiers.
For example if we have two threads with identifiers 0 and 1 and two queues Q0 and Q1, then the thread will access their local pointers through Q0.thr_p_[0] and Q0.thr_p_[1] for first queue and Q1.thr_p_[0] and Q1.thr_p_[1] for the second queue. This can be continued to arbitrary number of threads and queues. We only need to know on the queue construction how many threads will be using it.
What I don't like is the idea that a given thread is always "thread 1". Imagine a situation where you have an existing queue that's already used by one producer and one consumer, each of which is already "thread 0". Now another producer thread is added; this of necessity must be thread 1. But that thread is also a consumer of another SPSC queue -- and yet that second queue must be declared as having two consumers, despite "consumer thread 0" not existing for it. Granted it's only a small memory wastage in this case (trivial next to the cache alignment), but keeping all the thread ids straight in a complex web of queues could get messy, plus having to lie to the queue about how many producers and consumers it actually has, making it do extra work. I'd be happier if the thread id were a parameter to push/pop instead of using thread local data for that -- even thread-local is *too* global for this. (Since producer and consumer ids are independent, the id is only needed in that one spot anyway.)
And interleaving the per-thread head/tail indexes seems vulnerable to false sharing.)
False sharing is bad due to costly requests for ownership message of cache coherence protocol. tail and head pointers are modified only by the owning thread which is preferably should be binded to particular CPU (however, OS scheduler doesn't reschedule a thread to other CPU without need). Other threads/CPUs are only reads tail/head pointers of the thread.
A write to any of the 64 bytes on the cache line will temporarily block reads to any other byte on that line. Having one writer and the rest readers just avoids write races; it has no bearing on false sharing.
There is sense to align ThrPos to cache line, so different CPUs won't update the same cache lines. I tried this, but didn't get intelligible performance improvement on my laptop (2 cores), 12 cores/2 processors and 40 cores/4 processors servers. Probably, there is sense to make more benchmarks....
Have you tried splitting the head/tail apart? I'm not really sure why they're in the same structure to begin with -- you're already passing n_producers and n_consumers separately, so you should be able to have separate arrays of the appropriate lengths. I guess it saves one malloc/free, but I don't think that matters. It might be useful to have a variant that has n_producers and n_consumers as template parameters -- this would allow constant-size arrays to be used, and would let the compiler apply unrolling optimisations to the iteration loops. And it's reasonable to expect that this will be known at compile time, given the way that thread ids are being set up anyway.
participants (3)
-
Alexander Krizhanovsky
-
Gavin Lambert
-
tim