Re: [boost] [lockfree::fifo] Review

18 Jan 2010

      "Helge Bahmann" <hcb@chaoticmind.net> wrote in message 
news:201001181953.33044.hcb@chaoticmind.net...
...
Am Saturday 16 January 2010 03:15:44 schrieb Chris M. Thomasson:
...
FWIW Helge, here is a new proxy garbage collector algorithm of mine that 
I
am currently experimenting with:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/a
53f24de178b419f
interesting concept, but AFAICT it shares two undesirable properties with 
my
implementation: Reclamation can be delayed indefinitely if
- one thread is "stuck" within the critical section (e.g. scheduled away)
Well, if a thread actually deadlocks or "permanently livelocks" within the 
critical section, then no reclamation can occur.
...
- threads interlock such that at least one thread is always within the
critical section
...
Right?
No. My proxy algorithm is N = 2, while yours is N = 1 where N is the number 
of collector objects. This means that you only have a _single_ reference 
count. During periods of load, this single reference count might never reach 
zero and the garbage will keep piling up.

In my case, I can atomically switch collector objects and reset the 
reference count thus guaranteeing that the old one will drop to zero exactly 
when the last thread accessing it releases it's reference. This is because 
all new threads will take reference's to the new collector. Here is key area 
in my algorithm, take note of the following procedure in the code:

http://cpt.pastebin.com/f537f2c7c

__________________________________________________________________
void prv_quiesce_begin(collector& c)
{
    // Try to begin the quiescence process.
    if (! m_quiesce.exchange(true, std::memory_order_acquire))
    {
        // advance the current collector and grab the old one.
128:    size_t i = ((&c - m_collectors) + 1) & 1;

130:    unsigned int old = m_current.exchange(i, std::memory_order_acq_rel);

        // decode reference count.
        unsigned int refs = old & 0xFFFFFFF0U;

        // verify previous collector index.
        RL_ASSERT((old & 0xFU) == (&c - m_collectors));

        // increment and generate an odd reference count.
        c.m_count.fetch_add(refs + 0x10U, std::memory_order_release);
    }
}
__________________________________________________________________

Line `128' simply determines the index of the next collector object. Line 
`130' atomically resets the main reference count to zero _and_ swaps in the 
next collector index. Now, all new threads will only be able to gain 
references to the collector that was just swapped in which means that it 
will be impossible for them to gain access to the old collector. Therefore, 
the old collector will drop to zero. I guess this is similar to using split 
counters to implement RCU. However, AFAICT my algorithm is unique and 
completely different than Paul McKenney's split counter RCU solution.
...
It's not that I consider this problematic (there is no reason I would ever
want to share a data structure between threads if I know that the above 
can
pose a problem).
It can be problematic for various usage patterns. For instance, think if I 
wanted to use an N = 1 collector to solve the reader/writer problem (e.g., 
readers iterating over a data-structure while writers are concurrently 
pushing/popping items). Now, I fully expect reads to be very frequent in 
such scenarios. This would not work well for N = 1 collector because if 
reader threads were constantly accessing the data-structure then the 
reference count would probably never reach zero.  However, in an N = 2 
collector I can guarantee that new reader threads would not effect the 
reference count of the swapped out collector in line `130'. This is a case 
in which N = 2 is key difference between our implementations.
...
...
The algorithm is wait-free in practice on systems with native 
fetch-and-add
and swap instructions like x86-32/64. If the architecture forces you to
implement fetch-and-add with a damn LL/SC or CAS loop, well, then it's 
not
really wait-free now is it? No, I am afraid it's lock-free, or whatever
semantics the LL/SC is which could even be obstruction-free. Sad...
Depends how it is implemented -- LL/SC is wait-free iff other reads to the
location do not invalidate the reservation (PowerPC is in that category).
Here are some fairly interesting posts from Andy Glew (former Computer 
Architect at Intel) wrt LL/SC:

http://groups.google.com/group/comp.arch/msg/18f0d50aa3c9a148

http://groups.google.com/group/comp.arch/msg/5d1ff313519005f8
...
I
strongly suspect that CAS on x86 is actually just a micro-coded LL/SC
anyway...
I think they take locks to get the job done. Andy Glew would be the guy to 
ask. So, I should probably post a query over in `comp.arch' where all the 
hardware gurus hang out.

;^)