[Review] Lockfree review starts today, July 18th

Hi all, The review of Tim Blechmann's Boost.Lockfree library starts today, July 18th 2011, and will end on July 28th. I really hope to see your vote and your participation in the discussions on the Boost mailing lists! ----------------------------------------------------- About the library: Boost.Lockfree provides implementations of lock-free data structures. Lock-free data structures can be accessed by multiple threads without the necessity of blocking synchronization primitives such as guards. Lock-free data structures can be used in real-time systems, where blocking algorithms may lead to high worst-case execution times, to avoid priority inversion, or to increase the scalability for multi-processor machines. The following data structures are provided: - boost::lockfree::fifo, a lock-free fifo queue - boost::lockfree::stack, a lock-free stack - boost::lockfree::atomic_int, an atomic integer class The library is accessible from here: http://tim.klingt.org/boost_lockfree.tag.gz. Boost.Lockfree depends on C++0x atomics. They are not well supported with current compilers yet. For this reason Boost.Lockfree depends on the *unreviewed* Boost.Atomic library, that emulates C++0x atomics for C++98. This review is about Boost.Lockfree and not about Boost.Atomic, although it is included in the tarball and the git repository. If Boost.Lockfree will be accepted, it won't be merged into trunk before Boost.Atomic is accepted. --------------------------------------------------- Please always state in your review, whether you think the library should be accepted as a Boost library! Additionally please consider giving feedback on the following general topics: - What is your evaluation of the design? - What is your evaluation of the implementation? - What is your evaluation of the documentation? - What is your evaluation of the potential usefulness of the library? - Did you try to use the library? With what compiler? Did you have any problems? - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? - Are you knowledgeable about the problem domain? Regards Hartmut Review Manager

small correctly (seems i have copied an older description)
- boost::lockfree::atomic_int, an atomic integer class
boost.lockfree does not include an integer class, but the line should read: - boost::lockfree::ringbuffer, a wait-free single-producer/single-consumer ringbuffer. sry for the confusion, tim --------------------------------------------------------------- diese nachricht wurde ueber klingt.org gesendet this message was sent through klingt.org's webmail.

The URL for the download had a typo. I think the URL should be: http://tim.klingt.org/boost_lockfree.tar.gz not: http://tim.klingt.org/boost_lockfree.tag.gz Thanks, Brendon.

On 7/18/2011 8:32 AM, Hartmut Kaiser wrote: > Boost.Lockfree depends on C++0x atomics. They are not well supported with current compilers yet. > For this reason Boost.Lockfree depends on the *unreviewed* Boost.Atomic library, that emulates > C++0x atomics for C++98. This review is about Boost.Lockfree and not about Boost.Atomic, > although it is included in the tarball and the git repository. > > If Boost.Lockfree will be accepted, it won't be merged into trunk before Boost.Atomic is > accepted. That's good to hear. So, um, when is Boost.Atomic going to be accepted? :) I vote yes to the library with the trivial condition that the interface design issues I bring up below are considered. This is my first review and vote: w00t. I've been using this library in its unofficial state since February. Our applications are command-line multithreaded analysis tools. We routinely run with 8 threads and have run on more (but not since the lockfree modification). I have a mix of questions and comments. > Additionally please consider giving feedback on the following general topics: > > - What is your evaluation of the design? I'm happy with the design for the most part. It's reasonably consistent with the queue and stack adaptors from the standard library. It can go a little farther though: - size()/empty(): These should be available and not just "for debugging." I'm aware that using these doesn't make a lot of sense in multithreaded programming (i.e. just because the queue is empty doesn't mean the producer is done), but the std adaptors have them so I think the lockfree versions should as well. A size() can quite useful for displaying an (estimated) asynchronous counter, especially for the case of a filled-up-front fifo with multiple consumers. The front()/back() functions are understandably omitted though. - range constructors: it's often convenient to construct a container from another container; since lockfree fifo is a real container and not just an adaptor, its ctors should have standard container ctors. The only other things that bother me are the class template specializations. The documentation had me thinking for several minutes that the fifo.hpp reference was duplicated. It wasn't until I read the description that I understood the specialization. Which got me to thinking about a way to eliminate that redundancy: instead of a class template specialization, can dequeue's parameter be a template that must be constructible from T? That way you could have: fifo<Foo*> q; q.enqueue(new Foo); q.enqueue(new Foo); Foo* rawPtr; q.dequeue(rawPtr); shared_ptr<Foo> sharedPtr; q.dequeue(sharedPtr); This would also support different pointer types, e.g. tr1::shared_ptr or std::unique_ptr or even some UDT. > - What is your evaluation of the implementation? I'm not really knowledgeable enough to evaluate it. It works well in my multithreaded application. > - What is your evaluation of the documentation? There's a lot of nitpicking to be done here. I'll say up front that the examples are clean, easy to understand, and pleasantly short. What is the difference between guard and lock? Since this is lockfree and not guardfree, the docs could just have "lock" :) Why is the is_lock_free() function needed? If it is needed, why can't it be static? What is a real-world use case for it? And doesn't the caching_freelist_t type count as lockfree even though it may lock for memory allocations? Some fun to be had with Search and Replace. Some of these occur more than once: - "According Maurice Herlihy distinguishes" -> What is the wording supposed to be here? Accordingly? - "to ensure the correctness" -> "to ensure thread-safety" - "with no interaction without any direct interaction" -> "without direct interaction" - "atomic operations, specific CPU instructions, which are executed atomically." -> "atomic operations (specific CPU instructions executed without interruption)." - "Not every hardware supports" -> "Not all hardware supports" - loosing -> losing - lock-free' and wait-free' -> are these supposed to be quoted? Please don't use ` - "A mentioned earlier" -> "As mentioned earlier" - "an interface, which is not compatible to the stl containers." -> "an interface which is not compatible with standard containers." - "produced and consumed by 2 threads each" -> "produced and consumed by 4 threads each" - "by 2 separate threads" -> "by separate threads" - "Enqueueing may fail, if the ringbuffer is full." -> "Enqueueing will fail if the ringbuffer is full." - "[begin, end[" -> "[begin, end)" - poping -> popping - os -> OS - "struct caching_freelist_t" -> "caching_freelist_t" (with quotes, bold, and/or monospace font); same for all uses of "struct x" or "x_t" in proportional font. Screwed up quotes and formatting (again with the `): > The implementations are text-book' implementations of well-known data structures. the queue is > based on a paper of Michael Scott and Maged Michael, stack and ringbuffer are considered as > folklore' and are implemented in several open-source projects. No hyphen in textbook. In Design & Implementation, the final list of bullet points should have a preface like "The implementation of lockfree containers comes with some caveats:" RISC is capitalized. Perhaps list some examples of unsupported RISC processors? Make hyphenation of lock-free consistent. It's fine to have lockfree in code, but prose should always use lock-free (or never). In reference.html, why are all three headers linked but only fifo.hpp is displayed? The hyperlinks are screwed up. When I click on fifo.hpp I want to go to the full page, not the class declaration. For that matter, what is the purpose of having a page just for the class declaration and its specialization? If you want to show class declarations, show them all on one page (with links to the full pages). What does it mean for a fifo node to be "lockfree"? Why is it impossible for C++0x atomics to provide "a completely accurate implementation"? I think this was already asked and answered but it should be in the docs. And it's especially strange given your leading sentence "Boost.Lockfree depends on C++0x atomics". I'm probably missing something. Why is dequeue_unsafe unsafe if the platform supports atomics? Ringbuffer methods 4 & 5 don't have "Returns:" documentation. For method 5, is the array size supposed to be passed in as the template parameter? Rationale? In method 8 of the ringbuffer specialization, 't' should be 'ret'. What is the ringbuffer specialization for? Why have both the template-fixed-size and constructor-fixed-size versions? The array methods for ringbuffer need examples. I actually have no idea how to use them by looking at the methods signatures. How is the range enqueue method non-blocking? "While this function is thread-safe, it only guarantees that at some point during the execution of the function the stack has been empty." This definition is kind of confusing. Perhaps "May return true if the queue/stack/ringbuffer was empty at some point during the call to empty(). It is rarely practical to use this value in program logic." > - What is your evaluation of the potential usefulness of the library? The useful producer-consumer paradigm is easier and simpler using the lockfree queue. So yes, it's useful. > - Did you try to use the library? With what compiler? Did you have any problems? We build and run our applications primarily with x86_64/GCC 4.1.2 and x86/MSVC 9. > - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? I spent a couple hours proof-reading the documentation and another to write the review. Which is probably more time than it actually took me to switch from my lock-based multi-consumer queue to the lockfree fifo. :) > - Are you knowledgeable about the problem domain? Not enough to critique the implementation, but I can ask plenty of questions. :) Thanks for the very useful library Tim (and Helge)! -Matt

hi matthew,
If Boost.Lockfree will be accepted, it won't be merged into trunk before Boost.Atomic is accepted.
That's good to hear. So, um, when is Boost.Atomic going to be accepted? :)
boost.atomic should be treated on its own. personally i would encourate a fast- track review because it is an implementation of a c++0x feature, so there is no real need to evaluate the interface for example ;)
- size()/empty(): These should be available and not just "for debugging." I'm aware that using these doesn't make a lot of sense in multithreaded programming (i.e. just because the queue is empty doesn't mean the producer is done), but the std adaptors have them so I think the lockfree versions should as well. A size() can quite useful for displaying an (estimated) asynchronous counter, especially for the case of a filled-up-front fifo with multiple consumers. The front()/back() functions are understandably omitted though.
afaict, empty() cannot correctly be implemented for the fifo (unless using an atomic counter). size() would require an atomic integer (read: one more atomic operation, which will probably be rather expensive) and cannot be implemented accurately (one can just give an lower or upper bound).
- range constructors: it's often convenient to construct a container from another container; since lockfree fifo is a real container and not just an adaptor, its ctors should have standard container ctors.
here again i hesitate: stack and fifo use an internal freelist, which can be configured with a policy to be constant-sized (which is required for real-time systems where you cannot allocate memory). range constructors would therefore also require an additional argument to specify the number of nodes of the free-list.
The only other things that bother me are the class template specializations. The documentation had me thinking for several minutes that the fifo.hpp reference was duplicated. It wasn't until I read the description that I understood the specialization. Which got me to thinking about a way to eliminate that redundancy: instead of a class template specialization, can dequeue's parameter be a template that must be constructible from T? That way you could have: fifo<Foo*> q; q.enqueue(new Foo); q.enqueue(new Foo); Foo* rawPtr; q.dequeue(rawPtr); shared_ptr<Foo> sharedPtr; q.dequeue(sharedPtr); This would also support different pointer types, e.g. tr1::shared_ptr or std::unique_ptr or even some UDT.
this is an interesting idea! however it cannot be used to dequeue to a scoped_ptr. although i must admit that dequeueing to a scoped_ptr is debatable ...
- What is your evaluation of the documentation?
i won't comment on all of your suggestions: most of them i'll simply agree with (i am not a native speaker and the docs haven't been proofread, as i rewrote it after the boostcon pre-review)
What is the difference between guard and lock? Since this is lockfree and not guardfree, the docs could just have "lock" :)
yes and no: lock-free is not the same as non-blocking (according to the terminology of herlihy, which i am trying to follow). `obstruction-free' data structures would not be lock-free, but would not use locks, either.
Why is the is_lock_free() function needed? If it is needed, why can't it be static? What is a real-world use case for it? And doesn't the caching_freelist_t type count as lockfree even though it may lock for memory allocations?
atomic operations are not necessarily lock-free. boost.lockfree requires double- word compare-and-swap, which is not available on all architectures. e.g. ppc does not provide them. c++0x atomics define atomic<>::is_lock_free as non-static member functions (which is reasonable, because on some architectures, a specific alignment is required, so atomic<> instances would be lock-free or not depending on their memory address). targetting c++0x, it cannot be static. even worse: to give a fully correct answer, we would have to check every internal node, which is not possible, if the internal nodes are allocate dynamically. and it is true, that the caching_freelist_t may lock during memory allocations. so maybe instead of a boolean, i shall return an enum with the states `lockfree', `blocking_on_allocation' and `blocking_on_atomic_operation' ...
In reference.html, why are all three headers linked but only fifo.hpp is displayed? The hyperlinks are screwed up. When I click on fifo.hpp I want to go to the full page, not the class declaration. For that matter, what is the purpose of having a page just for the class declaration and its specialization? If you want to show class declarations, show them all on one page (with links to the full pages).
well, this is what quickbook/doxygen generates by default. i'd prefer to list all headers on reference.html, but i'd need some advice, how to achieve this.
What does it mean for a fifo node to be "lockfree"?
Why is it impossible for C++0x atomics to provide "a completely accurate implementation"? I think this was already asked and answered but it should be in the docs. And it's especially strange given your leading sentence "Boost.Lockfree depends on C++0x atomics". I'm probably missing something.
as stated above: is_lock_free is a per-instance member function of atomic<>. i am not sure, if i really should describe the reasons in detail, because it would go quite into the details of the implementation.
Why is dequeue_unsafe unsafe if the platform supports atomics?
the _unsafe operations are unsafe, because they do not use atomics. the reason for this is performance: atomics are slow
What is the ringbuffer specialization for? Why have both the template-fixed-size and constructor-fixed-size versions?
i want to provide the same data structure for specifying the size at compile- time and at run-time. the main reason for compile-time is that using a power of two, the indices math can use bitmasks for performance reasons.
The array methods for ringbuffer need examples. I actually have no idea how to use them by looking at the methods signatures.
ok, will try to add
How is the range enqueue method non-blocking?
it does not enqueue all objects, but only tries to enqueue the first N objects. will improve the docs in this part. thanks for your review, i will try to adapt the docs according to your suggestions. cheers, tim

On 7/18/11 6:32 AM, Hartmut Kaiser wrote:
Hi all,
The review of Tim Blechmann's Boost.Lockfree library starts today, July 18th 2011, and will end on July 28th. I really hope to see your vote and your participation in the discussions on the Boost mailing lists!
How does one build the documentation? The build system changes to the libs/lockfree/doc directory and then runs "bjam", but when this just results in the following error message: Jamfile.v2:11: in modules.load rule doxygen unknown in module Jamfile</Users/gcross/Downloads/boost_lockfree/libs/lockfree/doc>. /opt/local/share/boost-build/build/project.jam:312: in load-jamfile /opt/local/share/boost-build/build/project.jam:68: in load /opt/local/share/boost-build/build/project.jam:170: in project.find /opt/local/share/boost-build/build-system.jam:248: in load /opt/local/share/boost-build/kernel/modules.jam:261: in import /opt/local/share/boost-build/kernel/bootstrap.jam:132: in boost-build /opt/local/share/boost-build/boost-build.jam:1: in module scope I have no experience with Boost-Build so I don't have a sense of what could go wrong to cause an error like this. It might also be a good idea to post the documentation somewhere on the internet so we can browse through it without having to hack our way through the build system first. Thanks! Greg

It might also be a good idea to post the documentation somewhere on the internet so we can browse through it without having to hack our way through the build system first.
The documentation is online at http://tim.klingt.org/boost_lockfree/. There used to be a link on the Boost Review Schedule page [1], but since the review started it's gone (why is there a "Link" column in the "Schedule" table but not in the "Past Review Results and Milestones" table?) Regards, Nate. [1] http://www.boost.org/community/review_schedule.html

It might also be a good idea to post the documentation somewhere on the internet so we can browse through it without having to hack our way through the build system first.
The documentation is online at http://tim.klingt.org/boost_lockfree/.
the rendered html docs are also part of the tarball ... but i agree, generating the docs with bjam is not really straight-forward. cheers, tim

Hi Tim, Here is a new set of questions: (sorry they may have been more relevant in a pre-review, I missed it if there was one). I did not read nor test the code yet, so forgive me if my questions are stupid. ** On lock-free concepts in general - Your definition of obstruction-free is strange, I would have though that a lock free structure would be quite the opposite :* "if a concurrent operation is guaranteed to be finished in a finite number of steps, even if* * another concurrent operation interferes"*. I do not understand the guarantee that obstruction gives otherwise. I think this section of the documentation should be expanded and give more details. - Is there some kind of inclusion relationship between lock-free, wait-free, and obstruction free ? (I am thinking something like everything that is wait-free is also lock-free and everything lock-free is also obstruction-free) - It would be good to define "guard" in the documentation as I would have considered memory barriers (required for atomic operations) to be guards. ** About performance - You don't mention in the "motivations" sections that lock-free data structures are probably more performant than their locked counter parts. Though it is obvious too some, it is not to all, and to be honest, I'd prefer you mention at this stage the typical usage of lock free structures (for instance, are there situations where locked data structures perform better ?) - You don't give any performance comparisons with a normal, "locked" stack. It would be interesting to do. It should nearly be a test case ;). - Is there a way to force the lock-free data structures to use a locked implementation ? I'll probably want to compare the two and it could be useful for the test cases. ** About data structures - What's would be a typical use case of a concurrent stack ? Since both consumers and writers are accessing the same end of the structure, won't it trigger a lot of cache-sharing and have bad performance ? I had look .Net has a one, so I guess there is a point, but I can't figure it out on my own. - Is it feasible to implement a lock-free pool of objects ? Is it relevant ? (I am thinking of something vaguely simiral to .net ConcurrentBag) ** Others - Enqueue may block if memory needs to be allocated, but wouldn't it be possible to avoid this by specifying an optional maximum size as template parameter or constructor argument ? - Documentation : the class synopsis of the data structures should be accessible from the "Reference" page. (That's where I instinctively looked for them). Thanks for your work, it overall looks very promising, Julien

hi julien,
** On lock-free concepts in general
- Your definition of obstruction-free is strange, I would have though that a lock free structure would be quite the opposite :* "if a concurrent operation is guaranteed to be finished in a finite number of steps, even if* * another concurrent operation interferes"*. I do not understand the guarantee that obstruction gives otherwise. I think this section of the documentation should be expanded and give more details.
i am basically taking the definitions from herlihy&shavit. but i can try to improve this part of the docs.
- Is there some kind of inclusion relationship between lock-free, wait-free, and obstruction free ? (I am thinking something like everything that is wait-free is also lock-free and everything lock-free is also obstruction-free)
exactly.
- It would be good to define "guard" in the documentation as I would have considered memory barriers (required for atomic operations) to be guards.
yes, possibly. i am trying to avoid the term `lock', because using the terminology of wait-free/lock-free/obstruction-free, there are data structures, which do not use locks, but that are not lock-free.
** About performance
- You don't mention in the "motivations" sections that lock-free data structures are probably more performant than their locked counter parts. Though it is obvious too some, it is not to all, and to be honest, I'd prefer you mention at this stage the typical usage of lock free structures (for instance, are there situations where locked data structures perform better ?)
there is a section about `performance of non-blocking data structures'. there i mention that in some cases blocking data structures can outperform their lock- free equivalent.
- You don't give any performance comparisons with a normal, "locked" stack. It would be interesting to do. It should nearly be a test case ;).
i basically want to avoid all discussion about performance and performance characteristics in the docs, because it heavily depends on the design of the CPU, the instruction set, and probably the connection between the cpu cores. so my advice is to test with the specific workload. synthetic benchmarks with 8 threads hammering a stack (which basically uses one memory location) will most likely be slower than a synthetic benchmark using locks because of contention (threads interfering with each other).
- Is there a way to force the lock-free data structures to use a locked implementation ? I'll probably want to compare the two and it could be useful for the test cases.
this is not really feasible: if you want to use a blocking implementation, you will use a different layout of a data structure (e.g. a std::vector using a continuous memory region instead of the node-based lock-free data structures).
** About data structures
- What's would be a typical use case of a concurrent stack ? Since both consumers and writers are accessing the same end of the structure, won't it trigger a lot of cache-sharing and have bad performance ? I had look .Net has a one, so I guess there is a point, but I can't figure it out on my own.
the fifo requires more atomic operations than the stack.
- Is it feasible to implement a lock-free pool of objects ? Is it relevant ? (I am thinking of something vaguely simiral to .net ConcurrentBag)
one can implement a lock-free linked list, which serves as set-like data structure. i the main reason, why there is none in boost.lockfree is that i never needed one myself. but it may be a good addition in a future version.
- Enqueue may block if memory needs to be allocated, but wouldn't it be possible to avoid this by specifying an optional maximum size as template parameter or constructor argument ?
there are two ways: * there is a constructor, which reserves a number of nodes for the freelist * one can use the freelist_t tag to configure the stack/fifo to avoid any memory allocations. this is mainly required for hard-realtime systems.
- Documentation : the class synopsis of the data structures should be accessible from the "Reference" page. (That's where I instinctively looked for them).
true ... but i will need some help from a quickbook/doxygen master for that :/ cheers, tim

----- Original Message ----
From: Tim Blechmann <tim@klingt.org> To: boost@lists.boost.org Sent: Tue, July 19, 2011 9:10:06 PM Subject: Re: [boost] [Review] Lockfree review starts today, July 18th
hi julien,
** On lock-free concepts in general
- Your definition of obstruction-free is strange, I would have though that a lock free structure would be quite the opposite :* "if a concurrent operation is guaranteed to be finished in a finite number of steps, even if* * another concurrent operation interferes"*. I do not understand the guarantee that obstruction gives otherwise. I think this section of the documentation should be expanded and give more details.
i am basically taking the definitions from herlihy&shavit. but i can try to improve this part of the docs.
Actually reading this notes I opened Herlihy & Shavit book "The art of multiprocessor programming" To check if this is correct definition and then I had seen you refer to the book. Which actually gives +1 to the library even more.
** About performance
- You don't mention in the "motivations" sections that lock-free data structures are probably more performant than their locked counter parts. Though it is obvious too some, it is not to all, and to be honest, I'd prefer you mention at this stage the typical usage of lock free structures (for instance, are there situations where locked data structures perform better ?)
there is a section about `performance of non-blocking data structures'. there i
mention that in some cases blocking data structures can outperform their
lock-
free equivalent.
I must strongly support this point. Lock free is not always the fastest algorithm but in some cases it is may be very important to ensure conditions like "at-least-one-succeeds-eventually" Artyom

The review of Tim Blechmann's Boost.Lockfree library starts today, July 18th 2011, and will end on July 28th. I really hope to see your vote and your participation in the discussions on the Boost mailing lists!
I had a quick look a the code and read the first few pages of the docs, but haven't built or run anything yet. Still a few questions: The documentation talks a bit about false sharing and to some extent about cacheline alignment to achieve that, but I don't see that to extent I would expect in code. Specifically, how do you ensure that a given object (I only looked at ringbuffer) _starts_ on a cacheline boundary. I only see this weird padding "idiom" that everyone seems to use, but nothing to prevent a ringbuffer to be put in the middle of other objects that reside on cacheline that are happily write-allocated by other threads. For instance, what happens for: ringbuffer<foo> x; ringbuffer<foo> y; Consider a standard toolchain without fancy optimizations. Wouldn't this normally result in the x.read_pos and y.write_pos to be allocated on the same cacheline. I have been bitten quite a few times by compilers not implementing explicit alignment specification in a way you would expect (specifically for objects with automatic storage duration) There also doesn't seem to be a way to override the allocation of memory. For the kind of low-latency we (as in Morgan Stanley) interested in, we may sometimes care about delays from lazy PTE mechanisms that many operating system have. If you just simply allocate via new[] you may get a few lazily allocated pages from the OS. A 1ms delay for page fault is something we do care about. Is there any good way to override the allocation? Are there any performance targets/tests? E.g. for a ringbuffer, I found a test with a variable number of producer and consumers useful, where producers feed well-known data and consumers do almost nothing (e.g. just add the dequeued numbers or something) and see what kind of feed rates can be sustained without the consumer(s) falling behind. Lastly, what's going on with all the atomic code in there? Can I assume that's just implementation details that overrides things in the current Boost.Atomic lib and hence ignore it for the review? Thanks! -hg -------------------------------------------------------------------------- NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers. If you cannot access these links, please notify us by reply message and we will send the contents to you. By messaging with Morgan Stanley you consent to the foregoing.

hello holger,
The documentation talks a bit about false sharing and to some extent about cacheline alignment to achieve that, but I don't see that to extent I would expect in code. Specifically, how do you ensure that a given object (I only looked at ringbuffer) _starts_ on a cacheline boundary.
i am trying to ensure that those parts, which are modified by different threads are in different cache lines. however i don't necessarily care, if they are at the beginning of the cache line.
I only see this weird padding "idiom" that everyone seems to use, but nothing to prevent a ringbuffer to be put in the middle of other objects that reside on cacheline that are happily write-allocated by other threads. For instance, what happens for:
ringbuffer<foo> x; ringbuffer<foo> y;
Consider a standard toolchain without fancy optimizations. Wouldn't this normally result in the x.read_pos and y.write_pos to be allocated on the same cacheline.
in this case one could argue, that you should ensure the padding manually :) nevertheless, there is one point that i should probably address: i should enforce that neither read index, write index and the actual ringbuffer array use different cache lines.
There also doesn't seem to be a way to override the allocation of memory. For the kind of low-latency we (as in Morgan Stanley) interested in, we may sometimes care about delays from lazy PTE mechanisms that many operating system have. If you just simply allocate via new[] you may get a few lazily allocated pages from the OS. A 1ms delay for page fault is something we do care about.
Is there any good way to override the allocation?
if the size of the ringbuffer is specified at runtime, there is no way for it, i should probably add allocator support. however this will only help, if your allocators force the memory regions into physical ram by using mlock() or the like and by touching them to avoid minor page faults.
Are there any performance targets/tests? E.g. for a ringbuffer, I found a test with a variable number of producer and consumers useful, where producers feed well-known data and consumers do almost nothing (e.g. just add the dequeued numbers or something) and see what kind of feed rates can be sustained without the consumer(s) falling behind.
the ringbuffer is a single-producer, single-consumer data structure. if you use multiple producers, it will be corrupted! in general i hesitate to add any performance numbers, because the performance heavily depends on the CPU that is used.
Lastly, what's going on with all the atomic code in there? Can I assume that's just implementation details that overrides things in the current Boost.Atomic lib and hence ignore it for the review?
boost.lockfree depends on boost.atomic. for this review, boost.atomic should be ignored and we probably have to decide later, if we postpone the inclusion until boost.atomic is reviewed, or if i provide a modified version of boost.atomic as implementation detail. nevertheless, i have a small wrapper, which could be used to switch between boost::atomic and std::atomic. unfortunately none of my compilers implements the necessary parts of the atomic<> template ... cheers, tim

Thanks Tim,
ringbuffer<foo> x; ringbuffer<foo> y;
Consider a standard toolchain without fancy optimizations. Wouldn't this normally result in the x.read_pos and y.write_pos to be allocated on the same cacheline.
in this case one could argue, that you should ensure the padding manually :)
That seems counter-intuitive to me. If you expect users to do so that should probably be spelled out clearly.
nevertheless, there is one point that i should probably address: i should enforce that neither read index, write index and the actual ringbuffer array use different cache lines.
I'm having trouble understanding this. Do you mean they should go on different cachelines?
Is there any good way to override the allocation?
if the size of the ringbuffer is specified at runtime, there is no way for it, i should probably add allocator support. however this will only help, if your allocators force the memory regions into physical ram by using mlock() or the like and by touching them to avoid minor page faults.
We usually just touch the memory and assume it will never be paged out (i.e. you just have a box with enough memory for your stuff and you don't run weird things on it). Large pages are usually implicitly locked on most operating systems -- so that's another cheap option.
the ringbuffer is a single-producer, single-consumer data structure. if you use multiple producers, it will be corrupted! My bad -- sorry haven't read the docs or code carefully.
in general i hesitate to add any performance numbers, because the performance heavily depends on the CPU that is used.
And setting up a system for low latencies isn't exactly trivial these anymore either. Still, I have found a best-case comparison to be usually useful. Especially when you can just rerun the test in your own setup. Say your consumer does nothing -- what's the highest produce rate you can get etc. And, yes I understand that these things are highly dependent on the microarchitecture and OS & baseboard configuration. Thanks! -hg -------------------------------------------------------------------------- NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers. If you cannot access these links, please notify us by reply message and we will send the contents to you. By messaging with Morgan Stanley you consent to the foregoing.

ringbuffer<foo> x; ringbuffer<foo> y;
Consider a standard toolchain without fancy optimizations. Wouldn't this normally result in the x.read_pos and y.write_pos to be
allocated
on the same cacheline.
in this case one could argue, that you should ensure the padding manually :)
That seems counter-intuitive to me. If you expect users to do so that should probably be spelled out clearly.
well, i think this is not really a common case. if one wants to be sure, that there is absolutely no false sharing, one would have to pad the whole class to ensure that nothing before and nothing after the data structure will be placed in the same cache line. but i think going that far is overkill ...
nevertheless, there is one point that i should probably address: i should enforce that neither read index, write index and the actual ringbuffer array use different cache lines.
I'm having trouble understanding this. Do you mean they should go on different cachelines?
yes
the ringbuffer is a single-producer, single-consumer data structure. if you use multiple producers, it will be corrupted!
My bad -- sorry haven't read the docs or code carefully.
i've now mentioned it explicitly in the reference section of each function. cheers, tim

in this case one could argue, that you should ensure the padding manually :)
That seems counter-intuitive to me. If you expect users to do so that should probably be spelled out clearly.
well, i think this is not really a common case. if one wants to be sure, that there is absolutely no false sharing, one would have to pad the whole class to ensure that nothing before and nothing after the data structure will be placed in the same cache line. but i think going that far is overkill ...
Interesting -- I would have expected this to be the common case and the cost of cacheline alignment to be relatively minor for a ringbuffer. Anyway, I'll see if I can take a closer look. Thanks! -hg -------------------------------------------------------------------------- NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers. If you cannot access these links, please notify us by reply message and we will send the contents to you. By messaging with Morgan Stanley you consent to the foregoing.
participants (9)
-
Artyom Beilis
-
Brendon Costa
-
Gregory Crosswhite
-
Grund, Holger
-
Hartmut Kaiser
-
Julien Nitard
-
Matthew Chambers
-
Nathan Ridge
-
Tim Blechmann