Re: [boost] [Review] Lockfree review starts today, July 18th

21 Jul 2011

      hello holger,
...
The documentation talks a bit about false sharing and to some extent about
cacheline alignment to achieve that, but I don't see that to extent I
would expect in code. Specifically, how do you ensure that a given object
(I only looked at ringbuffer) _starts_ on a cacheline boundary.
i am trying to ensure that those parts, which are modified by different threads 
are in different cache lines. however i don't necessarily care, if they are at 
the beginning of the cache line.
...
I only see this weird padding "idiom" that everyone seems to use, but
nothing to prevent a ringbuffer to be put in the middle of other objects
that reside on cacheline that are happily write-allocated by other
threads. For instance, what happens for:
ringbuffer<foo> x;
ringbuffer<foo> y;
Consider a standard toolchain without fancy optimizations. Wouldn't this
normally result in the x.read_pos and y.write_pos to be allocated on the
same cacheline.
in this case one could argue, that you should ensure the padding manually :)

nevertheless, there is one point that i should probably address: i should 
enforce that neither read index, write index and the actual ringbuffer array use 
different cache lines.
...
There also doesn't seem to be a way to override the allocation of memory.
For the kind of low-latency we (as in Morgan Stanley) interested in, we
may sometimes care about delays from lazy PTE mechanisms that many
operating system have. If you just simply allocate via new[] you may get a
few lazily allocated pages from the OS. A 1ms delay for page fault is
something we do care about.
Is there any good way to override the allocation?
if the size of the ringbuffer is specified at runtime, there is no way for it, i 
should probably add allocator support. however this will only help, if your 
allocators force the memory regions into physical ram by using mlock() or the 
like and by touching them to avoid minor page faults.
...
Are there any performance targets/tests? E.g. for a ringbuffer, I found a
test with a variable number of producer and consumers useful, where
producers feed well-known data and consumers do almost nothing (e.g. just
add the dequeued numbers or something) and see what kind of feed rates can
be sustained without the consumer(s) falling behind.
the ringbuffer is a single-producer, single-consumer data structure. if you use 
multiple producers, it will be corrupted!
in general i hesitate to add any performance numbers, because the performance 
heavily depends on the CPU that is used.
...
Lastly, what's going on with all the atomic code in there? Can I assume
that's just implementation details that overrides things in the current
Boost.Atomic lib and hence ignore it for the review?
boost.lockfree depends on boost.atomic. for this review, boost.atomic should be 
ignored and we probably have to decide later, if we postpone the inclusion until 
boost.atomic is reviewed, or if i provide a modified version of boost.atomic as 
implementation detail. nevertheless, i have a small wrapper, which could be used 
to switch between boost::atomic and std::atomic. unfortunately none of my 
compilers implements the necessary parts of the atomic<> template ...

cheers, tim

Re: [boost] [Review] Lockfree review starts today, July 18th

Tim Blechmann