Re: [boost] [lockfree::fifo] Review

5 Dec 2009

      thanks for your comments!
...
First of all I think that having a library of lock-free 
algorithms/containers would be greatly beneficial to a great number of 
boost users. On the other hand, having implemented several lock-free 
algorithms, I know that it is notoriously hard to get them right. We'll 
need all the help we can get from knowledgeable people in the field. The 
implementation needs to be thoroughly reviewed. Let me give the first 
attempt at that.
i know of the difficulty to get lock-free data structures right, and
several issues have been solved by people, reviewing my code ...
...
Second, this algorithm makes certain assumptions about the ability to 
access memory that has been freed. The algorithm can try to read memory 
that has been de-allocated. (See further down for description of how). 
Several debug implementations of malloc/free store every allocated 
element on one page and when the element is freed, the page is marked 
protected for both reading and writing. That means that this algorithm 
can only work with a subset of custom allocators. Your "free_list" for 
example can't work with it since it's possible for a node to be freed 
when you call "deallocate". The "caching_freelist" and "static_freelist" 
on the other hand work just fine because they never return the memory 
back to the system.
the free_list class should not be used for the stack/fifo classes (i
should remove it to avoid confusion). i chose the freelist approach,
since hazard pointers and pass-the-buck are/may be subject of patents/
...
Third, and the biggest limitation, is the algorithm's assumption that 
you can store a pointer to an element plus an ABA tag in a contiguous 
memory space that can be passed to a CAS instruction. I believe not all 
platforms provide a DCAS instruction, that's why you're trying to work 
around this issue by using "packed pointers", i.e. storing the ABA tag 
in the part of the pointer that presumably is never used. The problem 
with that, however, is that some libc implementations of malloc/free do 
use all the bits in the pointer for their own needs, so you can't assume 
that you can meddle with them.
puh, can you elaborate on this? i haven't been aware of any use of the
unused address space bits.
...
My experience with this algorithm shows that the safest approach is to 
use a pre-allocated array for the queue and replace pointers with 
indexes into the array. Assuming that the size of an index is 
sufficiently lower than the size of a pointer, you can use the remaining 
bits to store the ABA tag. This alleviates both point 2 and 3. This does 
mean however, that the queue can't have an unbound size, the size has to 
be specified during the queue instantiation.
that would greatly increase the portability ... it would be even
possible to use e.g. 16bit indices and 16bit tags on 32bit platforms
without dcas
...
Fourth, on platforms where CAS is implemented using LL/SC (such as 
PowerPC and MIPS), it is possible to livelock. The problem with LL/SC on 
most platforms is that the write granulum for marking LL's address as 
"dirty" isn't just the memory pointed to by LL, but the whole cache line 
where the memory resides. On these platforms it is important to make 
sure that each "node" of the queue is located on a separate cache line.
interesting point! i should somehow handle size and alignment of the
nodes on these platforms
...
Here's the code of your enqueue:
bool enqueue(T const & t)
     {
         node * n = alloc_node(t);
if (n == NULL)
             return false;
for (;;)
         {
             atomic_node_ptr tail (tail_); //1
             read_memory_barrier(); //2
             atomic_node_ptr next (tail->next); //3
if (likely(tail == tail_)) //4
             {
                 if (next.get_ptr() == 0) //5
                 {
                     if ( tail->next.cas(next, n) )//6
                     {
                         tail_.cas(tail, n);
                         return true;
                     }
                 }
                 else
                     tail_.cas(tail, next.get_ptr());
             }
         }
     }
1. You're missing write barrier after a call to alloc_node().
Rational: alloc_node sets the "next" pointer of the newly allocated node 
to NULL. That is a store to shared memory. Setting next pointer to NULL 
has to be observable to other threads BEFORE they observe the inserted 
node. If you don't insert a write barrier, it is possible for other 
threads to see the new element in the list before element's next pointer 
is initialized with NULL.
i was assuming, that cas issues a full memory barrier (at least my cas
implementation does), so i don't think, this would be an issue.
...
2. You've misplaced the read_barrier.
Rational: Line 4 checks the validity of the load done on line 3. That 
means that the read_barrier should be between lines 3 and 4. If you 
don't have a read_barrier after loading *tail->next, then this load can 
happen after the load of tail_ on line 4, which defies the purpose of 
line 4.
...
You still need to have a barrier between lines 1 and 3, but here only a 
data dependency barrier will suffice, which is a weaker form of a read 
barrier and on most all platforms is a no-op since data dependent loads 
can be handled by processors automatically.
thanks for this explanation ...
...
3. Your tail_ and head_ should be volatile. Furthermore, the getters for 
you tagged_ptr (like get_ptr, operator->() and operator*  ) should 
return pointers/references to volatile types.
Rational: If your tail_ is not volatile, then the compiler is free to 
load head_ just once and line 4 effectively becomes a no-op.
this was already pointed out a few days ago and is fixed in my repository.
...
1. Again, you need to put read_barrier between lines 4 and 5. Only data 
dependency barrier will suffice at line 2.
2. You need to add write barrier after line 7 for the same reason as in 
the enqueue.
same as above
...
Hope this was descriptive,
i am currently porting the data structures to the boost.atomic proposal,
which provides better language support to specify the memory order
constraints. the code is available from a separate branch of my
repository. i would be curious, if you could have a look at that
implementation as well.

thanks a lot for your review!
cheers, tim

-- 
tim@klingt.org
http://tim.klingt.org

There's no such entity as "most people". These are generalities. All
generalities are meaningless. You've got to pin it down to a specific
person doing a specific thing at a specific time and space.
  William S. Burroughs

Re: [boost] [lockfree::fifo] Review

Tim Blechmann