
hello holger,
The documentation talks a bit about false sharing and to some extent about cacheline alignment to achieve that, but I don't see that to extent I would expect in code. Specifically, how do you ensure that a given object (I only looked at ringbuffer) _starts_ on a cacheline boundary.
i am trying to ensure that those parts, which are modified by different threads are in different cache lines. however i don't necessarily care, if they are at the beginning of the cache line.
I only see this weird padding "idiom" that everyone seems to use, but nothing to prevent a ringbuffer to be put in the middle of other objects that reside on cacheline that are happily write-allocated by other threads. For instance, what happens for:
ringbuffer<foo> x; ringbuffer<foo> y;
Consider a standard toolchain without fancy optimizations. Wouldn't this normally result in the x.read_pos and y.write_pos to be allocated on the same cacheline.
in this case one could argue, that you should ensure the padding manually :) nevertheless, there is one point that i should probably address: i should enforce that neither read index, write index and the actual ringbuffer array use different cache lines.
There also doesn't seem to be a way to override the allocation of memory. For the kind of low-latency we (as in Morgan Stanley) interested in, we may sometimes care about delays from lazy PTE mechanisms that many operating system have. If you just simply allocate via new[] you may get a few lazily allocated pages from the OS. A 1ms delay for page fault is something we do care about.
Is there any good way to override the allocation?
if the size of the ringbuffer is specified at runtime, there is no way for it, i should probably add allocator support. however this will only help, if your allocators force the memory regions into physical ram by using mlock() or the like and by touching them to avoid minor page faults.
Are there any performance targets/tests? E.g. for a ringbuffer, I found a test with a variable number of producer and consumers useful, where producers feed well-known data and consumers do almost nothing (e.g. just add the dequeued numbers or something) and see what kind of feed rates can be sustained without the consumer(s) falling behind.
the ringbuffer is a single-producer, single-consumer data structure. if you use multiple producers, it will be corrupted! in general i hesitate to add any performance numbers, because the performance heavily depends on the CPU that is used.
Lastly, what's going on with all the atomic code in there? Can I assume that's just implementation details that overrides things in the current Boost.Atomic lib and hence ignore it for the review?
boost.lockfree depends on boost.atomic. for this review, boost.atomic should be ignored and we probably have to decide later, if we postpone the inclusion until boost.atomic is reviewed, or if i provide a modified version of boost.atomic as implementation detail. nevertheless, i have a small wrapper, which could be used to switch between boost::atomic and std::atomic. unfortunately none of my compilers implements the necessary parts of the atomic<> template ... cheers, tim