[lockfree::fifo] Review

I decided to start a new thread on this topic since the last thread on the same topic is pretty outdated and not everyone will see it. (you can disregard my reply to the previous thread as this one is more complete) First of all I think that having a library of lock-free algorithms/containers would be greatly beneficial to a great number of boost users. On the other hand, having implemented several lock-free algorithms, I know that it is notoriously hard to get them right. We'll need all the help we can get from knowledgeable people in the field. The implementation needs to be thoroughly reviewed. Let me give the first attempt at that. In this e-mail I'll be primarily concentrating on the algorithm correctness rather than the interface. First, let me start with the choice of the algorithm for the fifo queues: Magued Michael and Michael Scott's "Simple, Fast, and Practical Non-blocking and blocking concurrent queues". I really think it is the best choice. Having said that, however, I need to point out that there some inherent limitation in this algorithm. The first limitation is that you can't peek at the head of the queue without dequeueing the element first, hence you can't get an interface that is conformant with std::queue. But I truly believe that it will be the case with any lock-free queue implementation, hence we should abandon the hope of having a lock-free queue look the same as std::queue Second, this algorithm makes certain assumptions about the ability to access memory that has been freed. The algorithm can try to read memory that has been de-allocated. (See further down for description of how). Several debug implementations of malloc/free store every allocated element on one page and when the element is freed, the page is marked protected for both reading and writing. That means that this algorithm can only work with a subset of custom allocators. Your "free_list" for example can't work with it since it's possible for a node to be freed when you call "deallocate". The "caching_freelist" and "static_freelist" on the other hand work just fine because they never return the memory back to the system. Third, and the biggest limitation, is the algorithm's assumption that you can store a pointer to an element plus an ABA tag in a contiguous memory space that can be passed to a CAS instruction. I believe not all platforms provide a DCAS instruction, that's why you're trying to work around this issue by using "packed pointers", i.e. storing the ABA tag in the part of the pointer that presumably is never used. The problem with that, however, is that some libc implementations of malloc/free do use all the bits in the pointer for their own needs, so you can't assume that you can meddle with them. My experience with this algorithm shows that the safest approach is to use a pre-allocated array for the queue and replace pointers with indexes into the array. Assuming that the size of an index is sufficiently lower than the size of a pointer, you can use the remaining bits to store the ABA tag. This alleviates both point 2 and 3. This does mean however, that the queue can't have an unbound size, the size has to be specified during the queue instantiation. Fourth, on platforms where CAS is implemented using LL/SC (such as PowerPC and MIPS), it is possible to livelock. The problem with LL/SC on most platforms is that the write granulum for marking LL's address as "dirty" isn't just the memory pointed to by LL, but the whole cache line where the memory resides. On these platforms it is important to make sure that each "node" of the queue is located on a separate cache line. Here's the code of your enqueue: bool enqueue(T const & t) { node * n = alloc_node(t); if (n == NULL) return false; for (;;) { atomic_node_ptr tail (tail_); //1 read_memory_barrier(); //2 atomic_node_ptr next (tail->next); //3 if (likely(tail == tail_)) //4 { if (next.get_ptr() == 0) //5 { if ( tail->next.cas(next, n) )//6 { tail_.cas(tail, n); return true; } } else tail_.cas(tail, next.get_ptr()); } } } 1. You're missing write barrier after a call to alloc_node(). Rational: alloc_node sets the "next" pointer of the newly allocated node to NULL. That is a store to shared memory. Setting next pointer to NULL has to be observable to other threads BEFORE they observe the inserted node. If you don't insert a write barrier, it is possible for other threads to see the new element in the list before element's next pointer is initialized with NULL. 2. You've misplaced the read_barrier. Rational: Line 4 checks the validity of the load done on line 3. That means that the read_barrier should be between lines 3 and 4. If you don't have a read_barrier after loading *tail->next, then this load can happen after the load of tail_ on line 4, which defies the purpose of line 4. Here's a scenario that makes it clear: State: queue is well-formed and tail_ points to the last element. 1. Thread A executes line1. State: the queue is unchanged and Thread A has a local copy of tail_ 2. Since there's no read barrier after line 3, loads at lines 3 and 4 happen out-of-order. tail_ is pre-loaded for comparison at line 4. 3. Thread B removes the element pointed by tail_, advances tail_ further State: tail_ is advanced, the old element is removed and given back to the allocator. A's local copy of tail_ becomes stale. 4. Thread A executes line 3 State: A's copy of tail is stale. This copy is used to load memory that has been reclamed by the allocator and possibly modified. The value of the memory happens to be 0. 5. Thread A executes line 4. Since tail_ has been preloaded, line 4 succedes. 6. Line 5 succedes 7. Line 6 succedes. Boom: you've just modified a piece of memory not controlled by the queue. You still need to have a barrier between lines 1 and 3, but here only a data dependency barrier will suffice, which is a weaker form of a read barrier and on most all platforms is a no-op since data dependent loads can be handled by processors automatically. 3. Your tail_ and head_ should be volatile. Furthermore, the getters for you tagged_ptr (like get_ptr, operator->() and operator* ) should return pointers/references to volatile types. Rational: If your tail_ is not volatile, then the compiler is free to load head_ just once and line 4 effectively becomes a no-op. Here's code for your "dequeue": bool dequeue (T * ret) { for (;;) { atomic_node_ptr head (head_); //1 read_memory_barrier(); //2 atomic_node_ptr tail(tail_); //3 node * next = head->next.get_ptr(); //4 if (likely(head == head_)) //5 { if (head.get_ptr() == tail.get_ptr()) //6 { if (next == 0) return false; tail_.cas(tail, next); } else { *ret = next->data; //7 if (head_.cas(head, next)) //8 { dealloc_node(head.get_ptr()); //9 return true; } } } } } 1. Again, you need to put read_barrier between lines 4 and 5. Only data dependency barrier will suffice at line 2. 2. You need to add write barrier after line 7 for the same reason as in the enqueue. Hope this was descriptive, Andy Venikov.

bool enqueue(T const & t) { node * n = alloc_node(t);
if (n == NULL) return false;
for (;;) { atomic_node_ptr tail (tail_); //1 read_memory_barrier(); //2 atomic_node_ptr next (tail->next); //3
if (likely(tail == tail_)) //4 { if (next.get_ptr() == 0) //5 { if ( tail->next.cas(next, n) )//6 { tail_.cas(tail, n); return true; } } else tail_.cas(tail, next.get_ptr()); } } }
1. You're missing write barrier after a call to alloc_node(). Rational: alloc_node sets the "next" pointer of the newly allocated node to NULL. That is a store to shared memory. Setting next pointer to NULL has to be observable to other threads BEFORE they observe the inserted node. If you don't insert a write barrier, it is possible for other threads to see the new element in the list before element's next pointer is initialized with NULL.
'n' (and n.next) isn't "published" until the CAS in line 6 - doesn't the CAS do a write barrier (or full barrier)? Tony

Gottlob Frege wrote:
bool enqueue(T const & t) { node * n = alloc_node(t);
if (n == NULL) return false;
for (;;) { atomic_node_ptr tail (tail_); //1 read_memory_barrier(); //2 atomic_node_ptr next (tail->next); //3
if (likely(tail == tail_)) //4 { if (next.get_ptr() == 0) //5 { if ( tail->next.cas(next, n) )//6 { tail_.cas(tail, n); return true; } } else tail_.cas(tail, next.get_ptr()); } } }
1. You're missing write barrier after a call to alloc_node(). Rational: alloc_node sets the "next" pointer of the newly allocated node to NULL. That is a store to shared memory. Setting next pointer to NULL has to be observable to other threads BEFORE they observe the inserted node. If you don't insert a write barrier, it is possible for other threads to see the new element in the list before element's next pointer is initialized with NULL.
'n' (and n.next) isn't "published" until the CAS in line 6 - doesn't the CAS do a write barrier (or full barrier)?
That's exactly the point - n.next has to be observable (or "published") before CAS inserts the element into the queue, whose results are immediate. As far as whether CAS does a mem barrier or not, I think it's not specified. If you know otherwise - please let me know. In that case I'm wrong and you don't need a write barrier here. But just looking at the implementations of CAS, I think it looks like some of them don't do membars. Granted, since x86 memory model is so strict you don't need any kinds of memory barriers, but on PowerPC, for example, CAS will be implemented using lwarx/stwcx. The documentation on these doesn't say anything about barriers.
Tony
Andy.

thanks for your comments!
First of all I think that having a library of lock-free algorithms/containers would be greatly beneficial to a great number of boost users. On the other hand, having implemented several lock-free algorithms, I know that it is notoriously hard to get them right. We'll need all the help we can get from knowledgeable people in the field. The implementation needs to be thoroughly reviewed. Let me give the first attempt at that.
i know of the difficulty to get lock-free data structures right, and several issues have been solved by people, reviewing my code ...
Second, this algorithm makes certain assumptions about the ability to access memory that has been freed. The algorithm can try to read memory that has been de-allocated. (See further down for description of how). Several debug implementations of malloc/free store every allocated element on one page and when the element is freed, the page is marked protected for both reading and writing. That means that this algorithm can only work with a subset of custom allocators. Your "free_list" for example can't work with it since it's possible for a node to be freed when you call "deallocate". The "caching_freelist" and "static_freelist" on the other hand work just fine because they never return the memory back to the system.
the free_list class should not be used for the stack/fifo classes (i should remove it to avoid confusion). i chose the freelist approach, since hazard pointers and pass-the-buck are/may be subject of patents/
Third, and the biggest limitation, is the algorithm's assumption that you can store a pointer to an element plus an ABA tag in a contiguous memory space that can be passed to a CAS instruction. I believe not all platforms provide a DCAS instruction, that's why you're trying to work around this issue by using "packed pointers", i.e. storing the ABA tag in the part of the pointer that presumably is never used. The problem with that, however, is that some libc implementations of malloc/free do use all the bits in the pointer for their own needs, so you can't assume that you can meddle with them.
puh, can you elaborate on this? i haven't been aware of any use of the unused address space bits.
My experience with this algorithm shows that the safest approach is to use a pre-allocated array for the queue and replace pointers with indexes into the array. Assuming that the size of an index is sufficiently lower than the size of a pointer, you can use the remaining bits to store the ABA tag. This alleviates both point 2 and 3. This does mean however, that the queue can't have an unbound size, the size has to be specified during the queue instantiation.
that would greatly increase the portability ... it would be even possible to use e.g. 16bit indices and 16bit tags on 32bit platforms without dcas
Fourth, on platforms where CAS is implemented using LL/SC (such as PowerPC and MIPS), it is possible to livelock. The problem with LL/SC on most platforms is that the write granulum for marking LL's address as "dirty" isn't just the memory pointed to by LL, but the whole cache line where the memory resides. On these platforms it is important to make sure that each "node" of the queue is located on a separate cache line.
interesting point! i should somehow handle size and alignment of the nodes on these platforms
Here's the code of your enqueue:
bool enqueue(T const & t) { node * n = alloc_node(t);
if (n == NULL) return false;
for (;;) { atomic_node_ptr tail (tail_); //1 read_memory_barrier(); //2 atomic_node_ptr next (tail->next); //3
if (likely(tail == tail_)) //4 { if (next.get_ptr() == 0) //5 { if ( tail->next.cas(next, n) )//6 { tail_.cas(tail, n); return true; } } else tail_.cas(tail, next.get_ptr()); } } }
1. You're missing write barrier after a call to alloc_node(). Rational: alloc_node sets the "next" pointer of the newly allocated node to NULL. That is a store to shared memory. Setting next pointer to NULL has to be observable to other threads BEFORE they observe the inserted node. If you don't insert a write barrier, it is possible for other threads to see the new element in the list before element's next pointer is initialized with NULL.
i was assuming, that cas issues a full memory barrier (at least my cas implementation does), so i don't think, this would be an issue.
2. You've misplaced the read_barrier. Rational: Line 4 checks the validity of the load done on line 3. That means that the read_barrier should be between lines 3 and 4. If you don't have a read_barrier after loading *tail->next, then this load can happen after the load of tail_ on line 4, which defies the purpose of line 4.
You still need to have a barrier between lines 1 and 3, but here only a data dependency barrier will suffice, which is a weaker form of a read barrier and on most all platforms is a no-op since data dependent loads can be handled by processors automatically.
thanks for this explanation ...
3. Your tail_ and head_ should be volatile. Furthermore, the getters for you tagged_ptr (like get_ptr, operator->() and operator* ) should return pointers/references to volatile types. Rational: If your tail_ is not volatile, then the compiler is free to load head_ just once and line 4 effectively becomes a no-op.
this was already pointed out a few days ago and is fixed in my repository.
1. Again, you need to put read_barrier between lines 4 and 5. Only data dependency barrier will suffice at line 2. 2. You need to add write barrier after line 7 for the same reason as in the enqueue.
same as above
Hope this was descriptive,
i am currently porting the data structures to the boost.atomic proposal, which provides better language support to specify the memory order constraints. the code is available from a separate branch of my repository. i would be curious, if you could have a look at that implementation as well. thanks a lot for your review! cheers, tim -- tim@klingt.org http://tim.klingt.org There's no such entity as "most people". These are generalities. All generalities are meaningless. You've got to pin it down to a specific person doing a specific thing at a specific time and space. William S. Burroughs

Tim Blechmann wrote:
Third, and the biggest limitation, is the algorithm's assumption that you can store a pointer to an element plus an ABA tag in a contiguous memory space that can be passed to a CAS instruction. I believe not all platforms provide a DCAS instruction, that's why you're trying to work around this issue by using "packed pointers", i.e. storing the ABA tag in the part of the pointer that presumably is never used. The problem with that, however, is that some libc implementations of malloc/free do use all the bits in the pointer for their own needs, so you can't assume that you can meddle with them.
puh, can you elaborate on this? i haven't been aware of any use of the unused address space bits.
Yes. Recently I did an experiment on 32 bit 2.6 linux using SUSE 9.1 and libc that came with it. (libc version was 2.3.3-97). I was doing all sorts of allocations and looking at the returned pointer. What I found was that the only the two low bits were always the same (obviously because malloc has to return aligned data). All other bits could be different depending on the size of the requested allocation. I think they were also depended on the thread that was requesting the allocation, but I'm not sure on this point. So, other than the 2 lower bits, you can't assume you can use any more. And 2 bits seems kinda low for an ABA tag.
My experience with this algorithm shows that the safest approach is to use a pre-allocated array for the queue and replace pointers with indexes into the array. Assuming that the size of an index is sufficiently lower than the size of a pointer, you can use the remaining bits to store the ABA tag. This alleviates both point 2 and 3. This does mean however, that the queue can't have an unbound size, the size has to be specified during the queue instantiation.
that would greatly increase the portability ... it would be even possible to use e.g. 16bit indices and 16bit tags on 32bit platforms without dcas
Yup, that's the idea. Originally though, I said that the queue's size will have to be bounded. I've been thinking about it for a while and I think there's a way to keep it growing dynamically. I'll bounce some ideas around on c.p.t newsgroup and will come back to you shortly.
Fourth, on platforms where CAS is implemented using LL/SC (such as PowerPC and MIPS), it is possible to livelock. The problem with LL/SC on most platforms is that the write granulum for marking LL's address as "dirty" isn't just the memory pointed to by LL, but the whole cache line where the memory resides. On these platforms it is important to make sure that each "node" of the queue is located on a separate cache line.
interesting point! i should somehow handle size and alignment of the nodes on these platforms
1. You're missing write barrier after a call to alloc_node(). Rational: alloc_node sets the "next" pointer of the newly allocated node to NULL. That is a store to shared memory. Setting next pointer to NULL has to be observable to other threads BEFORE they observe the inserted node. If you don't insert a write barrier, it is possible for other threads to see the new element in the list before element's next pointer is initialized with NULL.
i was assuming, that cas issues a full memory barrier (at least my cas implementation does), so i don't think, this would be an issue.
You know, it's an interesting question whether CAS really implies a memory barrier or not. Obviously, you've implemented it that way, but I think it not need be that way. I think there's really no "standard" on CAS so we need to go with what is commonly accepted. It may be another question that would be good to ask at c.p.t If someone can answer this question, please chime in. <snip>
i am currently porting the data structures to the boost.atomic proposal, which provides better language support to specify the memory order constraints. the code is available from a separate branch of my repository. i would be curious, if you could have a look at that implementation as well.
Definitely. How can I get to your latest code?
thanks a lot for your review! cheers, tim
Sure, Andy.

Andy Venikov wrote:
that would greatly increase the portability ... it would be even possible to use e.g. 16bit indices and 16bit tags on 32bit platforms without dcas
Yup, that's the idea. Originally though, I said that the queue's size will have to be bounded. I've been thinking about it for a while and I think there's a way to keep it growing dynamically. I'll bounce some ideas around on c.p.t newsgroup and will come back to you shortly.
I've just posted the description of the solution with some questions to the c.p.t: http://groups.google.com/group/comp.programming.threads/browse_thread/thread... Andy.

On Tue, 8 Dec 2009, Andy Venikov wrote:
Third, and the biggest limitation, is the algorithm's assumption that you can store a pointer to an element plus an ABA tag in a contiguous memory space that can be passed to a CAS instruction. I believe not all platforms provide a DCAS instruction, that's why you're trying to work around this issue by using "packed pointers", i.e. storing the ABA tag in the part of the pointer that presumably is never used. The problem with that, however, is that some libc implementations of malloc/free do use all the bits in the pointer for their own needs, so you can't assume that you can meddle with them.
puh, can you elaborate on this? i haven't been aware of any use of the unused address space bits.
Yes. Recently I did an experiment on 32 bit 2.6 linux using SUSE 9.1 and libc that came with it. (libc version was 2.3.3-97).
So far I have not yet met any 64-bit platform that provides 64 bits of (user-space!) address space (the same cannot be said of 32-bit platforms), so I would say that any use of spare bits on 64 bit is safe (for user-space!). FWIW there are other ways to avoid ABA than tagging pointers, and without DCAS on 32-bit, see e.g. this sample implementation of a lock-free stack: http://www.chaoticmind.net/~hcb/projects/boost.atomic/stack.hpp
Fourth, on platforms where CAS is implemented using LL/SC (such as PowerPC and MIPS), it is possible to livelock. The problem with LL/SC on most platforms is that the write granulum for marking LL's address as "dirty" isn't just the memory pointed to by LL, but the whole cache line where the memory resides. On these platforms it is important to make sure that each "node" of the queue is located on a separate cache line.
just to make this clear... there are two types of "live-lock"s with ll/sc: a) one thread accessing memory between ll/sc b) two threads racing in a tight loop on ll/sc, continually aborting the other's operation If ll/sc is encapsulated in CAS, than a) is impossible There is nothing you can do about b) without changing the CAS implementation -- and there is generally no reason to even attempt that: With a window of usually two-three instructions between ll/sc, the rate of failure (i.e. repetition required due to interference) is surprisingly low even with two threads racing in tight ll/sc loops. Moving things to separate cache lines is a useful optimization (even without ll/sc), but nothing that is required.
You know, it's an interesting question whether CAS really implies a memory barrier or not. Obviously, you've implemented it that way, but I think it not need be that way. I think there's really no "standard" on CAS so we need to go with what is commonly accepted. It may be another question that would be good to ask at c.p.t If someone can answer this question, please chime in.
There are valid uses for CAS with and without various types of barriers. See the explicit memory_order specification in my atomic operations library. Using correct ordering constraints is unfortunately tricky -- use "too strong" ones, and performance will look ridiculous compared to locking. Use "took weak" ones, and your implementation will spuriously fail in ways that are nearly impossible to diagnose. Best regards, Helge

Helge Bahmann <hcb@chaoticmind.net> writes:
On Tue, 8 Dec 2009, Andy Venikov wrote:
Third, and the biggest limitation, is the algorithm's assumption that you can store a pointer to an element plus an ABA tag in a contiguous memory space that can be passed to a CAS instruction. I believe not all platforms provide a DCAS instruction, that's why you're trying to work around this issue by using "packed pointers", i.e. storing the ABA tag in the part of the pointer that presumably is never used. The problem with that, however, is that some libc implementations of malloc/free do use all the bits in the pointer for their own needs, so you can't assume that you can meddle with them.
puh, can you elaborate on this? i haven't been aware of any use of the unused address space bits.
Yes. Recently I did an experiment on 32 bit 2.6 linux using SUSE 9.1 and libc that came with it. (libc version was 2.3.3-97).
So far I have not yet met any 64-bit platform that provides 64 bits of (user-space!) address space (the same cannot be said of 32-bit platforms), so I would say that any use of spare bits on 64 bit is safe (for user-space!).
As Chris also noted, I would be wary of using them for ABA checking though: it's easy to overflow a 16-bit counter quite quickly, and that's all you've got even if the OS only uses 48 of your 64 bits for address space.
FWIW there are other ways to avoid ABA than tagging pointers, and without DCAS on 32-bit, see e.g. this sample implementation of a lock-free stack:
http://www.chaoticmind.net/~hcb/projects/boost.atomic/stack.hpp
Yup. Incidentally, you don't even need to count the nodes in push, provided you're careful when reclaiming nodes in pop. The downside of this technique is that if the traffic is too heavy then the ref count never reaches zero, so the nodes can never be reclaimed.
Using correct ordering constraints is unfortunately tricky -- use "too strong" ones, and performance will look ridiculous compared to locking. Use "took weak" ones, and your implementation will spuriously fail in ways that are nearly impossible to diagnose.
Totally agreed. Anthony -- Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/ just::thread C++0x thread library http://www.stdthread.co.uk Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk 15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

On Wed, 9 Dec 2009, Anthony Williams wrote:
So far I have not yet met any 64-bit platform that provides 64 bits of (user-space!) address space (the same cannot be said of 32-bit platforms), so I would say that any use of spare bits on 64 bit is safe (for user-space!).
As Chris also noted, I would be wary of using them for ABA checking though: it's easy to overflow a 16-bit counter quite quickly, and that's all you've got even if the OS only uses 48 of your 64 bits for address space.
This concern is certainly valid for basically any algorithm that uses "generation counters" or similar constructs for disambiguation (it could be argued that even 32 bits may not be enough). Using spare bits at all should however be considered a valid technique.
FWIW there are other ways to avoid ABA than tagging pointers, and without DCAS on 32-bit, see e.g. this sample implementation of a lock-free stack:
http://www.chaoticmind.net/~hcb/projects/boost.atomic/stack.hpp
Yup. Incidentally, you don't even need to count the nodes in push, provided you're careful when reclaiming nodes in pop.
hm, I'm not sure what you mean, I don't think I am counting anything but the number of threads in the critical section?
The downside of this technique is that if the traffic is too heavy then the ref count never reaches zero, so the nodes can never be reclaimed.
Right, and although I would argue that such a highly contended hotspot is a design flaw in itself, there are ways to do reliable reclaim even in that case (e.g. that allow reclaiming an element after the last thread that could actually have "seen" it leaves the critical section -- can be done with checkpoints similar RCU, so it can even be more efficient than the atomic counter I used), yet I am not sure this is really worth it -- I would like to make an entirely different point that has been nagging me: I'm not certain that portable, generally useful "symmetric" [1] wait-free (or obstruction-free) data structures for unbounded numbers of threads are both a) feasible and b) required. As to a) implementations overwhelmingly end up being more expensive than locking (as well as less deterministic than locking with priority inheritance), and there is also some theoretical basis for suspecting this to be an intrinsic property [2] [3]. As to b), the cases when I have needed and successfully used wait-/obstruction-free data structures are roughly: - bounded numbers of threads with static roles (e.g. producer/consumer with single reader/multiple writer or vice versa) - strongly asymmetric access patterns (e.g. number of read accesses vastly outweighs number of write accesses; or the converse: a producer in "interrupt" context, with many consumers in the slow path) In the former case, often all operations can/must be wait-/obstruction-free. In the latter case the goal is just to make a subset of the operations "less expensive" or wait/obstruction-free, while the other subset typically becomes "more expensive". Both cases are easier to address, cover a large number of use cases and still pay off tremendously. Personally I would therefore consider less generic data structures more useful, e.g. hash-tables/trees with wait-free read-side path, or fast fully wait-free 1:n and n:1 producer/consumer data structures -- instead of not very portable (DCAS), quite expensive (compared to locking) n:m p/c data structures that strive to be egg-laying woolly dairy pigs. (I don't consider my stack implementation terribly useful, for that matter). Helge [1] by "symmetric" I mean: all types of accesses (e.g. both insertion and deletion) are wait/obstruction-free [2] http://portal.acm.org/citation.cfm?id=1146427 [3] http://portal.acm.org/citation.cfm?id=197975

Helge Bahmann <hcb@chaoticmind.net> writes:
On Wed, 9 Dec 2009, Anthony Williams wrote:
FWIW there are other ways to avoid ABA than tagging pointers, and without DCAS on 32-bit, see e.g. this sample implementation of a lock-free stack:
http://www.chaoticmind.net/~hcb/projects/boost.atomic/stack.hpp
Yup. Incidentally, you don't even need to count the nodes in push, provided you're careful when reclaiming nodes in pop.
hm, I'm not sure what you mean, I don't think I am counting anything but the number of threads in the critical section?
Sorry, that's what I meant: you're counting threads in both push and pop, but with a bit of care you only need to count the ones in pop. I haven't examined your algorithm closely enough to know if your code takes the necessary "bit of care".
The downside of this technique is that if the traffic is too heavy then the ref count never reaches zero, so the nodes can never be reclaimed.
Right, and although I would argue that such a highly contended hotspot is a design flaw in itself,
Agreed. Such a highly contented queue is going to take a huge performance hit due to all the cache ping pong, and a blocking queue might actually perform better.
there are ways to do reliable reclaim even in that case (e.g. that allow reclaiming an element after the last thread that could actually have "seen" it leaves the critical section -- can be done with checkpoints similar RCU, so it can even be more efficient than the atomic counter I used), yet I am not sure this is really worth it
Agreed. You need to use a strategy that works for your use case. RCU-style checkpoints are one technique that does work, but there may be better solutions at the architecture level that avoid the heavy contention.
-- I would like to make an entirely different point that has been nagging me:
I'm not certain that portable, generally useful "symmetric" [1] wait-free (or obstruction-free) data structures for unbounded numbers of threads are both a) feasible and b) required. As to a) implementations overwhelmingly end up being more expensive than locking (as well as less deterministic than locking with priority inheritance), and there is also some theoretical basis for suspecting this to be an intrinsic property [2] [3]. As to b), the cases when I have needed and successfully used wait-/obstruction-free data structures are roughly:
- bounded numbers of threads with static roles (e.g. producer/consumer with single reader/multiple writer or vice versa)
- strongly asymmetric access patterns (e.g. number of read accesses vastly outweighs number of write accesses; or the converse: a producer in "interrupt" context, with many consumers in the slow path)
In the former case, often all operations can/must be wait-/obstruction-free. In the latter case the goal is just to make a subset of the operations "less expensive" or wait/obstruction-free, while the other subset typically becomes "more expensive". Both cases are easier to address, cover a large number of use cases and still pay off tremendously.
Yes. There are many ways to implement a FIFO queue, each making a different set of trade-offs.
Personally I would therefore consider less generic data structures more useful, e.g. hash-tables/trees with wait-free read-side path, or fast fully wait-free 1:n and n:1 producer/consumer data structures -- instead of not very portable (DCAS), quite expensive (compared to locking) n:m p/c data structures that strive to be egg-laying woolly dairy pigs. (I don't consider my stack implementation terribly useful, for that matter).
Totally agreed. The egg-laying woolly dairy pigs are fun to design, but are hard to get right, and often have a much higher performance cost than a specialized data structure. Anthony -- Author of C++ Concurrency in Action http://www.stdthread.co.uk/book/ just::thread C++0x thread library http://www.stdthread.co.uk Just Software Solutions Ltd http://www.justsoftwaresolutions.co.uk 15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

"Helge Bahmann" <hcb@chaoticmind.net> wrote in message news:alpine.DEB.1.10.0912101115460.456@m65s28.vlinux.de...
On Wed, 9 Dec 2009, Anthony Williams wrote:
So far I have not yet met any 64-bit platform that provides 64 bits of (user-space!) address space (the same cannot be said of 32-bit platforms), so I would say that any use of spare bits on 64 bit is safe (for user-space!).
As Chris also noted, I would be wary of using them for ABA checking though: it's easy to overflow a 16-bit counter quite quickly, and that's all you've got even if the OS only uses 48 of your 64 bits for address space.
This concern is certainly valid for basically any algorithm that uses "generation counters" or similar constructs for disambiguation (it could be argued that even 32 bits may not be enough). Using spare bits at all should however be considered a valid technique.
FWIW there are other ways to avoid ABA than tagging pointers, and without DCAS on 32-bit, see e.g. this sample implementation of a lock-free stack:
http://www.chaoticmind.net/~hcb/projects/boost.atomic/stack.hpp
Yup. Incidentally, you don't even need to count the nodes in push, provided you're careful when reclaiming nodes in pop.
hm, I'm not sure what you mean, I don't think I am counting anything but the number of threads in the critical section?
The downside of this technique is that if the traffic is too heavy then the ref count never reaches zero, so the nodes can never be reclaimed.
Right, and although I would argue that such a highly contended hotspot is a design flaw in itself,
You have created a proxy garbage collector that reminds me of one I did a long time ago which also suffered from this exact same problem. I was using it for memory reclamation: http://groups.google.com/group/comp.programming.threads/browse_frm/thread/3d... it went something like, membars aside for a moment: _____________________________________________________________ struct node { struct node* next; }; struct proxy { struct node* head; // = NULL uint32_t count; // = 0 }; void proxy_acquire(struct proxy* const self) { struct proxy cmp, xchg; cmp.head = self->head; cmp.count = self->count; do { xchg.head = cmp.head; xchg.count = cmp.count + 1; } while (! DWCAS(self, &cmp, &xchg)); } struct node* proxy_release(struct proxy* const self) { struct proxy cmp, xchg; cmp.head = self->head; cmp.count = self->count; do { xchg.count = cmp.count - 1; if (! xchg.count) { xchg.head = NULL; } else { xchg.head = cmp.head; } } while (! DWCAS(self, &cmp, &xchg)); if (! xchg.count) { return cmp.head; } return NULL; } struct node* proxy_defer(struct proxy* const self, struct node* node) { struct proxy cmp, xchg; cmp.head = self->head; cmp.count = self->count; do { node->next = cmp.head; xchg.count = cmp.count; if (! xchg.count) { xchg.head = NULL; } else { xchg.head = node; } } while (! DWCAS(self, &cmp, &xchg)); if (! xchg.count) { return node; } return NULL; } _____________________________________________________________ This solves memory reclamation, but will pile up nodes during times of sustained load...
there are ways to do reliable reclaim even in that case (e.g. that allow reclaiming an element after the last thread that could actually have "seen" it leaves the critical section
Yes. My next incarnation of the proxy collector above was coded to explicitly handle this situation. It simply breaks off collector objects during deferment and backlinks them all together: http://webpages.charter.net/appcore/misc/pc_sample_h_v1.html
-- can be done with checkpoints similar RCU, so it can even be more efficient than the atomic counter I used), yet I am not sure this is really worth it -- I would like to make an entirely different point that has been nagging me:
I don't think it's worth using RCU just to get around ABA in a lock-free stack. Heck, it does not even fit in with the general usage pattern which does not expect a lot of writes. Humm, well, of course you can take tremendous pressure off of RCU by combining it with version counters and only using them to determine when to defer nodes. For instance, you don't need to defer a node into RCU until a version counter is just about to rollover.

"Chris M. Thomasson" <cristom@charter.net> wrote in message news:hgfqaj$k90$1@ger.gmane.org...
"Helge Bahmann" <hcb@chaoticmind.net> wrote in message news:alpine.DEB.1.10.0912101115460.456@m65s28.vlinux.de...
On Wed, 9 Dec 2009, Anthony Williams wrote:
So far I have not yet met any 64-bit platform that provides 64 bits of (user-space!) address space (the same cannot be said of 32-bit platforms), so I would say that any use of spare bits on 64 bit is safe (for user-space!).
As Chris also noted, I would be wary of using them for ABA checking though: it's easy to overflow a 16-bit counter quite quickly, and that's all you've got even if the OS only uses 48 of your 64 bits for address space.
This concern is certainly valid for basically any algorithm that uses "generation counters" or similar constructs for disambiguation (it could be argued that even 32 bits may not be enough). Using spare bits at all should however be considered a valid technique.
FWIW there are other ways to avoid ABA than tagging pointers, and without DCAS on 32-bit, see e.g. this sample implementation of a lock-free stack:
http://www.chaoticmind.net/~hcb/projects/boost.atomic/stack.hpp
Yup. Incidentally, you don't even need to count the nodes in push, provided you're careful when reclaiming nodes in pop.
hm, I'm not sure what you mean, I don't think I am counting anything but the number of threads in the critical section?
The downside of this technique is that if the traffic is too heavy then the ref count never reaches zero, so the nodes can never be reclaimed.
Right, and although I would argue that such a highly contended hotspot is a design flaw in itself,
You have created a proxy garbage collector that reminds me of one I did a long time ago which also suffered from this exact same problem.
FWIW Helge, here is a new proxy garbage collector algorithm of mine that I am currently experimenting with: http://groups.google.com/group/comp.programming.threads/browse_frm/thread/a5... is basically a simplified version of the following lock-free algorithm: http://webpages.charter.net/appcore/misc/pc_sample_h_v1.html The new algorihtm is a *wait-free solution to the ABA problem and memory reclamation issues in general wrt non-blocking algorithms. The usage pattern is basically identical to you're collector wrt the following code: http://www.chaoticmind.net/~hcb/projects/boost.atomic/stack.hpp you do it like: __________________________________________________________________ bool pop(T &data) throw() { enter_crit(); node * n=stack.pop(); if (n) { data=n->data; expired.push(n); } leave_crit(); return n!=0; } ___________________________________________________________________ The equivalent code for my collector would be: ___________________________________________________________________ bool pop(T& data) throw() { proxy_type::collector& c = m_proxy.acquire(); node* n = m_stack.pop(); if (n) { data = n->m_data; m_proxy.collect(c, n); } m_proxy.release(c); return (n != NULL); } ___________________________________________________________________ [*] The algorithm is wait-free in practice on systems with native fetch-and-add and swap instructions like x86-32/64. If the architecture forces you to implement fetch-and-add with a damn LL/SC or CAS loop, well, then it's not really wait-free now is it? No, I am afraid it's lock-free, or whatever semantics the LL/SC is which could even be obstruction-free. Sad... ;^(...

Am Saturday 16 January 2010 03:15:44 schrieb Chris M. Thomasson:
FWIW Helge, here is a new proxy garbage collector algorithm of mine that I am currently experimenting with:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/a 53f24de178b419f
interesting concept, but AFAICT it shares two undesirable properties with my implementation: Reclamation can be delayed indefinitely if - one thread is "stuck" within the critical section (e.g. scheduled away) - threads interlock such that at least one thread is always within the critical section Right? It's not that I consider this problematic (there is no reason I would ever want to share a data structure between threads if I know that the above can pose a problem).
The algorithm is wait-free in practice on systems with native fetch-and-add and swap instructions like x86-32/64. If the architecture forces you to implement fetch-and-add with a damn LL/SC or CAS loop, well, then it's not really wait-free now is it? No, I am afraid it's lock-free, or whatever semantics the LL/SC is which could even be obstruction-free. Sad...
Depends how it is implemented -- LL/SC is wait-free iff other reads to the location do not invalidate the reservation (PowerPC is in that category). I strongly suspect that CAS on x86 is actually just a micro-coded LL/SC anyway... Regards, Helge

"Helge Bahmann" <hcb@chaoticmind.net> wrote in message news:201001181953.33044.hcb@chaoticmind.net...
Am Saturday 16 January 2010 03:15:44 schrieb Chris M. Thomasson:
FWIW Helge, here is a new proxy garbage collector algorithm of mine that I am currently experimenting with:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/a 53f24de178b419f
interesting concept, but AFAICT it shares two undesirable properties with my implementation: Reclamation can be delayed indefinitely if
- one thread is "stuck" within the critical section (e.g. scheduled away)
Well, if a thread actually deadlocks or "permanently livelocks" within the critical section, then no reclamation can occur.
- threads interlock such that at least one thread is always within the critical section
Right?
No. My proxy algorithm is N = 2, while yours is N = 1 where N is the number of collector objects. This means that you only have a _single_ reference count. During periods of load, this single reference count might never reach zero and the garbage will keep piling up. In my case, I can atomically switch collector objects and reset the reference count thus guaranteeing that the old one will drop to zero exactly when the last thread accessing it releases it's reference. This is because all new threads will take reference's to the new collector. Here is key area in my algorithm, take note of the following procedure in the code: http://cpt.pastebin.com/f537f2c7c __________________________________________________________________ void prv_quiesce_begin(collector& c) { // Try to begin the quiescence process. if (! m_quiesce.exchange(true, std::memory_order_acquire)) { // advance the current collector and grab the old one. 128: size_t i = ((&c - m_collectors) + 1) & 1; 130: unsigned int old = m_current.exchange(i, std::memory_order_acq_rel); // decode reference count. unsigned int refs = old & 0xFFFFFFF0U; // verify previous collector index. RL_ASSERT((old & 0xFU) == (&c - m_collectors)); // increment and generate an odd reference count. c.m_count.fetch_add(refs + 0x10U, std::memory_order_release); } } __________________________________________________________________ Line `128' simply determines the index of the next collector object. Line `130' atomically resets the main reference count to zero _and_ swaps in the next collector index. Now, all new threads will only be able to gain references to the collector that was just swapped in which means that it will be impossible for them to gain access to the old collector. Therefore, the old collector will drop to zero. I guess this is similar to using split counters to implement RCU. However, AFAICT my algorithm is unique and completely different than Paul McKenney's split counter RCU solution.
It's not that I consider this problematic (there is no reason I would ever want to share a data structure between threads if I know that the above can pose a problem).
It can be problematic for various usage patterns. For instance, think if I wanted to use an N = 1 collector to solve the reader/writer problem (e.g., readers iterating over a data-structure while writers are concurrently pushing/popping items). Now, I fully expect reads to be very frequent in such scenarios. This would not work well for N = 1 collector because if reader threads were constantly accessing the data-structure then the reference count would probably never reach zero. However, in an N = 2 collector I can guarantee that new reader threads would not effect the reference count of the swapped out collector in line `130'. This is a case in which N = 2 is key difference between our implementations.
The algorithm is wait-free in practice on systems with native fetch-and-add and swap instructions like x86-32/64. If the architecture forces you to implement fetch-and-add with a damn LL/SC or CAS loop, well, then it's not really wait-free now is it? No, I am afraid it's lock-free, or whatever semantics the LL/SC is which could even be obstruction-free. Sad...
Depends how it is implemented -- LL/SC is wait-free iff other reads to the location do not invalidate the reservation (PowerPC is in that category).
Here are some fairly interesting posts from Andy Glew (former Computer Architect at Intel) wrt LL/SC: http://groups.google.com/group/comp.arch/msg/18f0d50aa3c9a148 http://groups.google.com/group/comp.arch/msg/5d1ff313519005f8
I strongly suspect that CAS on x86 is actually just a micro-coded LL/SC anyway...
I think they take locks to get the job done. Andy Glew would be the guy to ask. So, I should probably post a query over in `comp.arch' where all the hardware gurus hang out. ;^)

On Mon, 18 Jan 2010, Chris M. Thomasson wrote:
"Helge Bahmann" <hcb@chaoticmind.net> wrote in message news:201001181953.33044.hcb@chaoticmind.net...
Am Saturday 16 January 2010 03:15:44 schrieb Chris M. Thomasson:
FWIW Helge, here is a new proxy garbage collector algorithm of mine that I am currently experimenting with:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/a 53f24de178b419f
interesting concept, but AFAICT it shares two undesirable properties with my implementation: Reclamation can be delayed indefinitely if
- one thread is "stuck" within the critical section (e.g. scheduled away)
Well, if a thread actually deadlocks or "permanently livelocks" within the critical section, then no reclamation can occur.
this means that reclamation (and by extension writing with bounded amount of memory) is not wait-free (this is not a problem I think, just pointing out)
- threads interlock such that at least one thread is always within the critical section
Right?
No. My proxy algorithm is N = 2, while yours is N = 1 where N is the number of collector objects. This means that you only have a _single_ reference count. During periods of load, this single reference count might never reach zero and the garbage will keep piling up.
[snip explanation] Okay thanks for your explanation, I did not understand your code correctly the first time, but now things are much more clear. The concept is very nice indeed, and I agree that N=1 and N=2 makes quite a difference. I *think* that your approach is much more valuable as a building block for "object-local RCU" for data structures with long-lived read access than ABA prevention for producer/consumer data structures (where the read-side critical section is very short). Thanks for sharing the code, I can already think of an application for this in my own code :) Best regards, Helge

"Helge Bahmann" <hcb@chaoticmind.net> wrote in message news:alpine.DEB.1.10.1001191553030.25332@m65s28.vlinux.de...
On Mon, 18 Jan 2010, Chris M. Thomasson wrote:
"Helge Bahmann" <hcb@chaoticmind.net> wrote in message news:201001181953.33044.hcb@chaoticmind.net...
Am Saturday 16 January 2010 03:15:44 schrieb Chris M. Thomasson:
FWIW Helge, here is a new proxy garbage collector algorithm of mine that I am currently experimenting with:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/a 53f24de178b419f
interesting concept, but AFAICT it shares two undesirable properties with my implementation: Reclamation can be delayed indefinitely if
- one thread is "stuck" within the critical section (e.g. scheduled away)
Well, if a thread actually deadlocks or "permanently livelocks" within the critical section, then no reclamation can occur.
this means that reclamation (and by extension writing with bounded amount of memory) is not wait-free (this is not a problem I think, just pointing out)
Yes; very good point Helge. I simply _failed_ to clarify that the wait-free properties come from the fact that I am not using loops (e.g., CAS loop) to acquire/release a reference to a collector object. This has been a point of interest in the past over on `comp.programming.threads'. Previous collectors used CAS-loop to either acquire and/or release a reference. This does not work out to well for a reader-writer pattern because only one reader thread can gain access while the others fail their CAS and have to try again. This lock-free property can result in fairly poor performance for readers. Therefore, we have been coming up with wait-free techniques to allow readers to gain access without looping around in software.
- threads interlock such that at least one thread is always within the critical section
Right?
No. My proxy algorithm is N = 2, while yours is N = 1 where N is the number of collector objects. This means that you only have a _single_ reference count. During periods of load, this single reference count might never reach zero and the garbage will keep piling up.
[snip explanation]
Okay thanks for your explanation, I did not understand your code correctly the first time, but now things are much more clear.
That's great! :^)
The concept is very nice indeed, and I agree that N=1 and N=2 makes quite a difference. I *think* that your approach is much more valuable as a building block for "object-local RCU" for data structures with long-lived read access than ABA prevention for producer/consumer data structures (where the read-side critical section is very short).
I think I agree with that statement.
Thanks for sharing the code,
You are welcome! If you can think of any optimizations I would LOVE to hear all about them. One thing I can think of would be to remove the need for threads to actually destroy objects. This can be moved to a separate thread or the objects can simply be stuck in a cache to be reused. For instance, if the `proxy::release(...)' procedure was turned into a function which returned a pointer to a stack of nodes that are in a quiescent state, then the thread could decide what to do with them. I think that this would make the algorithm more flexible. Please note that this is only an initial implementation and it's more of a proof-of-concept than anything. There are probably many improvements that can be made... ;^)
I can already think of an application for this in my own code :)
Cool. IMVHO, the proxy collector is fairly generic in that you can use it to solve memory reclamation in many different types of non-blocking algorithms and *usage patterns. I think that if Boost does include a non-blocking library, then something along the lines of the collector algorithm should probably be included. [*] - I am experimenting with the collector using the following usage pattern. Imagine a database server that get very many sustained query on a regular basis. Now, reader threads receive search requests and then execute the query on a dynamic shared data-structure. Here is high-level pseudo-code of how I am constructing the logic of the reader threads: __________________________________________________________________ #define SYNC_INTERVAL 256 #define DEFER_THRESHOLD 128 typedef proxy<DEFER_THRESHOLD> proxy_type; static proxy_type g_proxy; void reader_thread() { proxy_type::collector* c = &g_proxy.acquire(); for (unsigned i = 1 ;; ++i) { query_type* q = try_to_get_a_query_request(); if (! q) { g_proxy.release(*c); q = wait_for_a_query_request(); c = &g_proxy.acquire(); } execute_query(q); if (! (i % SYNC_INTERVAL)) { g_proxy.release(*c); c = &g_proxy.acquire(); } } g_proxy.release(*c); } __________________________________________________________________ Instead of acquiring/releasing a collector object for every single execution of a query this technique instead attempts to amortize the number of times a reader thread needs to acquire/release a collector. Now, I can think of a possible optimization/enhancement to the proxy algorithm itself. Take note that an odd reference count denotes that a collection cycle is in progress. Therefore, if a reader thread can simply test if the reference count of it's acquired collector object is even, then it simple does not actually "need" to release a reference to it. If it's odd, then a release is necessary in order to move along the current collection process. This would allow the reader threads acquire/release pattern to be amortized across the actual garbage deferment threshold. I think I am going to alter the algorithm in order to support the following usage pattern: __________________________________________________________________ #define DEFER_THRESHOLD 256 typedef proxy<DEFER_THRESHOLD> proxy_type; static proxy_type g_proxy; void reader_thread() { proxy_type::collector* c = &g_proxy.acquire(); for (;;) { query_type* q = try_to_get_a_query_request(); if (! q) { g_proxy.release(*c); q = wait_for_a_query_request(); c = &g_proxy.acquire(); } execute_query(q); if (c->is_collecting()) { g_proxy.release(*c); c = &g_proxy.acquire(); } } g_proxy.release(*c); } __________________________________________________________________ This would allow reader threads to curn along during periods of load with fairly low overhead. I say this because the only time the readers would have to release their collector object and re-acquire it is when they must explicitly wait for query requests, or when the garbage threshold was reached and a collection cycle was initiated by a writer thread. There are some fairly promising enhancements I think I can make to the collector, and that is one of them. Humm...

"Chris M. Thomasson" <cristom@charter.net> wrote in message news:hj84tr$7r8$1@ger.gmane.org...
"Helge Bahmann" <hcb@chaoticmind.net> wrote in message news:alpine.DEB.1.10.1001191553030.25332@m65s28.vlinux.de... [...] Now, I can think of a possible optimization/enhancement to the proxy algorithm itself. Take note that an odd reference count denotes that a collection cycle is in progress. Therefore, if a reader thread can simply test if the reference count of it's acquired collector object is even, then it simple does not actually "need" to release a reference to it. If it's odd, then a release is necessary in order to move along the current collection process. This would allow the reader threads acquire/release pattern to be amortized across the actual garbage deferment threshold. I think I am going to alter the algorithm in order to support the following usage pattern: __________________________________________________________________ #define DEFER_THRESHOLD 256
typedef proxy<DEFER_THRESHOLD> proxy_type;
static proxy_type g_proxy;
void reader_thread() { proxy_type::collector* c = &g_proxy.acquire();
for (;;) { query_type* q = try_to_get_a_query_request();
if (! q) { g_proxy.release(*c);
q = wait_for_a_query_request();
c = &g_proxy.acquire(); }
execute_query(q);
if (c->is_collecting()) { g_proxy.release(*c);
c = &g_proxy.acquire(); } }
g_proxy.release(*c); } __________________________________________________________________
This would allow reader threads to curn along during periods of load with fairly low overhead. I say this because the only time the readers would have to release their collector object and re-acquire it is when they must explicitly wait for query requests, or when the garbage threshold was reached and a collection cycle was initiated by a writer thread. There are some fairly promising enhancements I think I can make to the collector, and that is one of them.
Humm...
Refer to the following message over in `comp.programming.threads' for a model of the pattern above in Relacy 2.2: http://groups.google.com/group/comp.programming.threads/msg/f03f8435a1792061 Enjoy! ;^)

"Andy Venikov" <avenikov@gmail.com> wrote in message news:hfrs5q$anq$1@ger.gmane.org...
Helge Bahmann wrote:
There are valid uses for CAS with and without various types of barriers. See the explicit memory_order specification in my atomic operations library.
So, CAS by itself doesn't imply anything, it's all about how you call it, right?
Humm, kind of. CAS on certain architectures actually implies quite a bit (e.g., `LOCK' prefix on x86-32/64). However, since this is on a Boost Forum I should probably be thinking abstract, in the sense of creating highly portable algorithms in order to help make portions of the Boost libraries expansive programming experience easier, so to speak; right? ;^) Therefore, CAS is an atomic RMW with absolutely no effect on memory visibility. However, I think that the return value from failed CAS to an address A can be thought of as a "load". On the flip side, a successful CAS on address B is basically a "store" in the sense of the successful RMW portion of the CAS itself. The "load" from A can rise above the "store" to B. CAS is irrelevant here and an external memory barrier operation might need to be executed after the store to B in order to ensure that it is "fully rendered into visible memory" _before_ the load from A is "committed".

FWIW there are other ways to avoid ABA than tagging pointers, and without DCAS on 32-bit, see e.g. this sample implementation of a lock-free stack:
http://www.chaoticmind.net/~hcb/projects/boost.atomic/stack.hpp
i ran some throughput benchmarks of your stack class vs mine (using the stack test without the data hash table) boost.lockfree stack, x86_64, pointer compression: 97019.782330 task-clock-msecs # 7.730 CPUs 108258 context-switches # 0.001 M/sec 14 CPU-migrations # 0.000 M/sec 820 page-faults # 0.000 M/sec 270234935420 cycles # 2785.359 M/sec 18556843363 instructions # 0.069 IPC 1002004340 cache-references # 10.328 M/sec 65412 cache-misses # 0.001 M/sec 1497109143 branches # 15.431 M/sec 159665283 branch-misses # 1.646 M/sec 12.550888223 seconds time elapsed helge's stack, x86_64: 97193.980901 task-clock-msecs # 6.829 CPUs 181307 context-switches # 0.002 M/sec 19 CPU-migrations # 0.000 M/sec 525 page-faults # 0.000 M/sec 270730718104 cycles # 2785.468 M/sec 21901021611 instructions # 0.081 IPC 816105899 cache-references # 8.397 M/sec 44053 cache-misses # 0.000 M/sec 4929368756 branches # 50.717 M/sec 234936307 branch-misses # 2.417 M/sec 14.231853706 seconds time elapsed i added your stack class to a new branch (topic/helge_stack) in my repository ... cheers, tim -- tim@klingt.org http://tim.klingt.org The price an artist pays for doing what he wants is that he has to do it. William S. Burroughs

Yes. Recently I did an experiment on 32 bit 2.6 linux using SUSE 9.1 and libc that came with it. (libc version was 2.3.3-97). I was doing all sorts of allocations and looking at the returned pointer. What I found was that the only the two low bits were always the same (obviously because malloc has to return aligned data). All other bits could be different depending on the size of the requested allocation. I think they were also depended on the thread that was requesting the allocation, but I'm not sure on this point. So, other than the 2 lower bits, you can't assume you can use any more. And 2 bits seems kinda low for an ABA tag.
all 32 bit platforms that i know of are using the full 32bit as address space (the lower two bits are 0 because of 32bit alignment) ... on 64bit platforms it is quite different ... ppc64, ia64 and x86_64 all use just a part of the full address space (iirc 48 bit for x86_64) ... the rest should be usable as aba prevention tag.
1. You're missing write barrier after a call to alloc_node(). Rational: alloc_node sets the "next" pointer of the newly allocated node to NULL. That is a store to shared memory. Setting next pointer to NULL has to be observable to other threads BEFORE they observe the inserted node. If you don't insert a write barrier, it is possible for other threads to see the new element in the list before element's next pointer is initialized with NULL.
i was assuming, that cas issues a full memory barrier (at least my cas implementation does), so i don't think, this would be an issue.
You know, it's an interesting question whether CAS really implies a memory barrier or not. Obviously, you've implemented it that way, but I think it not need be that way. I think there's really no "standard" on CAS so we need to go with what is commonly accepted. It may be another question that would be good to ask at c.p.t If someone can answer this question, please chime in.
the c++0x atomic library handles it explicitly ... compare_exchange has takes the behavior as argument ...
i am currently porting the data structures to the boost.atomic proposal, which provides better language support to specify the memory order constraints. the code is available from a separate branch of my repository. i would be curious, if you could have a look at that implementation as well.
Definitely. How can I get to your latest code?
the code can be found in my git repository [1], branch lockfree-atomic ... unfortunately, i am stuck with the fifo implementation, which is currently broken, although i cannot see why :/ thnx, tim [1] http://tim.klingt.org/git?p=boost_lockfree.git -- tim@klingt.org http://tim.klingt.org Relying on the government to protect your privacy is like asking a peeping tom to install your window blinds. John Perry Barlow

Tim Blechmann a écrit :
all 32 bit platforms that i know of are using the full 32bit as address space (the lower two bits are 0 because of 32bit alignment) ... on 64bit platforms it is quite different ... ppc64, ia64 and x86_64 all use just a part of the full address space (iirc 48 bit for x86_64) ... the rest should be usable as aba prevention tag.
48 bits is only for the current x86_64 platforms, it will get upgraded to 64 bits eventually. However, such platforms will also support DWCAS, so you have your backup solution.

all 32 bit platforms that i know of are using the full 32bit as address space (the lower two bits are 0 because of 32bit alignment) ... on 64bit platforms it is quite different ... ppc64, ia64 and x86_64 all use just a part of the full address space (iirc 48 bit for x86_64) ... the rest should be usable as aba prevention tag.
48 bits is only for the current x86_64 platforms, it will get upgraded to 64 bits eventually.
However, such platforms will also support DWCAS, so you have your backup solution.
it is also performance related, though ... on x86_64 (nehalem) my fifo stress test runs about 25% faster with pointer/tag compression than with cmpxchg16b ... that said, the lock-free property is for me more important than the throughput, since i am using it for soft real-time systems ... tim -- tim@klingt.org http://tim.klingt.org I had nothing to offer anybody except my own confusion Jack Kerouac

Tim Blechmann wrote:
all 32 bit platforms that i know of are using the full 32bit as address space (the lower two bits are 0 because of 32bit alignment) ... on 64bit platforms it is quite different ... ppc64, ia64 and x86_64 all use just a part of the full address space (iirc 48 bit for x86_64) ... the rest should be usable as aba prevention tag. 48 bits is only for the current x86_64 platforms, it will get upgraded to 64 bits eventually.
However, such platforms will also support DWCAS, so you have your backup solution.
it is also performance related, though ... on x86_64 (nehalem) my fifo stress test runs about 25% faster with pointer/tag compression than with cmpxchg16b ... that said, the lock-free property is for me more important than the throughput, since i am using it for soft real-time systems ...
Yes, this has also been my fear that cmpxchg outperforms cmpxch8b which outperforms cmpxch16b. Tim, have you read the replies to my post on c.p.t regarding ABA bits? Even on this thread someone (I think Helge) argued that even 32 bits may not be enough. Now I'm thinking that maybe "generation counter" solution may not be workable as a general solution. Andy.

Tim, have you read the replies to my post on c.p.t regarding ABA bits? Even on this thread someone (I think Helge) argued that even 32 bits may not be enough. Now I'm thinking that maybe "generation counter" solution may not be workable as a general solution.
For ABA, overflow (or wraparound, really) shouldn't be a problem. All you are trying to do is make sure you don't see the same pointer twice, while in a CAS. Seeing the same pointer twice does happen, but not that often. What are the chances that not only will it be the same pointer, but also that the 32 bits have wrapped around back to the same value? I would suspect that the CAS'ing thread must have suspended for quite a while in order for the 32 bits to loop around, and on top of that, to hit the 1 / 2^32 chance of landing on the same value...
Andy.
Tony

"Gottlob Frege" <gottlobfrege@gmail.com> wrote in message news:97ffb310912101725w14104d84n8bdeeebeed2d7345@mail.gmail.com...
Tim, have you read the replies to my post on c.p.t regarding ABA bits? Even on this thread someone (I think Helge) argued that even 32 bits may not be enough. Now I'm thinking that maybe "generation counter" solution may not be workable as a general solution.
For ABA, overflow (or wraparound, really) shouldn't be a problem. All you are trying to do is make sure you don't see the same pointer twice, while in a CAS. Seeing the same pointer twice does happen, but not that often. What are the chances that not only will it be the same pointer, but also that the 32 bits have wrapped around back to the same value? I would suspect that the CAS'ing thread must have suspended for quite a while in order for the 32 bits to loop around, and on top of that, to hit the 1 / 2^32 chance of landing on the same value...
Although, it would _definitely_ suck if the race-condition hit! I am way to paranoid... lol. ;^)

Hi Does X86 and X86_64 support DCAS?(intel core2 duo onwards) Eventually the how does tim solve the ABA problem? Thanks jon

Does X86 and X86_64 support DCAS?(intel core2 duo onwards) Eventually the how does tim solve the ABA problem?
x86 supports cmpxchg8b (since i586 iirc). x86_64 provides cmpxchg16b, except on early amd machines. tim -- tim@klingt.org http://tim.klingt.org Every word is like an unnecessary stain on silence and nothingness Samuel Beckett

it is also performance related, though ... on x86_64 (nehalem) my fifo stress test runs about 25% faster with pointer/tag compression than with cmpxchg16b ... that said, the lock-free property is for me more important than the throughput, since i am using it for soft real-time systems ...
Yes, this has also been my fear that cmpxchg outperforms cmpxch8b which outperforms cmpxch16b.
Tim, have you read the replies to my post on c.p.t regarding ABA bits? Even on this thread someone (I think Helge) argued that even 32 bits may not be enough. Now I'm thinking that maybe "generation counter" solution may not be workable as a general solution.
i just went through the replies ... (maybe i should upcase some parts in the documentation, that the implementation focus on WORST CASE, not AVERAGE CASE performance ... people keep complaining that the stack/fifo may be outperformed be blocking algorithms, which is both true and irrelevant for me, as these implementations are soft real-time safe (and could be made hard real-time safe). as for the aba tag ... increasing the tag is not necessary in the enqueue operation (chris thomasson made a valid point here), but then 16bit would give 2**16 different tags. of course a tag overflow is possible, but not very likely ... by definition ll/sc are immune to aba problems, but implementing cas via ll/sc, one loses this feature ... personally i would prefer to have language support for ll/sc transactions instead of aba-prone cas ... most cas-architectures provide dcas, while most ll/sc architectures shouldn't use cas emulation, but ll/sc-style transactions directly, as it is by definition aba immune ... too bad, c++0x doesn't provide language support for ll/sc, but only for cas :/ tim -- tim@klingt.org http://tim.klingt.org Life is really simple, but we insist on making it complicated. Confucius

On Fri, 11 Dec 2009, Tim Blechmann wrote:
by definition ll/sc are immune to aba problems
This is NOT correct. The sequence of operations you would need to prevent ABA is: object *tmp=head.load_linked(); object *next=tmp->next; head.store_conditional(next); This is however NOT possible because you are not allowed to make any memory access between ll and sc -- some CPUs (definitely older mips, but not ppc, unsure about many of the other architectures) clear the reservation on _any_ memory reference and will cause this sequence to always fail. Alpha has the fun requirement of a memory barrier before you are allowed to dereference "tmp", which just provides another way to kill your reservation. ll/sc only prevents "pathological" ABA races as you cannot inpsect the object being replaced. Exposing an ll/sc interface to a higher-level language is therefore unlikely to result in something usable.
most cas-architectures provide dcas
I think x86 is the rare exception, not the rule, so relying on DCAS excludes pretty much all 32-bit except for x86. Helge

Helge Bahmann wrote:
On Fri, 11 Dec 2009, Tim Blechmann wrote:
by definition ll/sc are immune to aba problems
This is NOT correct. The sequence of operations you would need to prevent ABA is:
object *tmp=head.load_linked(); object *next=tmp->next; head.store_conditional(next);
This is however NOT possible because you are not allowed to make any memory access between ll and sc -- some CPUs (definitely older mips, but not ppc, unsure about many of the other architectures) clear the reservation on _any_ memory reference and will cause this sequence to always fail. Alpha has the fun requirement of a memory barrier before you are allowed to dereference "tmp", which just provides another way to kill your reservation. PPC as well. 7447, for example, will invalidate the "reservation bit" after writing to ANY memory location.
ll/sc only prevents "pathological" ABA races as you cannot inpsect the object being replaced. Exposing an ll/sc interface to a higher-level language is therefore unlikely to result in something usable.
most cas-architectures provide dcas
I think x86 is the rare exception, not the rule, so relying on DCAS excludes pretty much all 32-bit except for x86.
Helge _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

On Fri, 11 Dec 2009, Andy Venikov wrote:
Helge Bahmann wrote:
This is however NOT possible because you are not allowed to make any memory access between ll and sc -- some CPUs (definitely older mips, but not ppc, unsure about many of the other architectures) clear the reservation on _any_ memory reference and will cause this sequence to always fail. Alpha has the fun requirement of a memory barrier before you are allowed to dereference "tmp", which just provides another way to kill your reservation. PPC as well. 7447, for example, will invalidate the "reservation bit" after writing to ANY memory location.
Cool, this makes the whole thing model-specific behavior and thus even more fun :) On the ppc 7455 available to me it not only works, but the "stwcx." in fact behaves more like a CAS (it remembers the value loaded by lwarx, and succeeds if it matches -- thus ABA detection won't work for entirely different reasons). Other fun restrictions that I remember from some architecture manual (I don't know which, think it was also some mips flavor): No taken branch allowed between ll/sc. The message should however be clear: ll/sc is *brittle* and not usable for ABA prevention. Helge

The message should however be clear: ll/sc is *brittle* and not usable for ABA prevention.
Helge
Sad thing is that if some of those brittle restrictions could be removed, ll/sc could be a more useful than CAS. And unfortunately C++0x seems to be leanings towards CAS, and that might discourage other alternatives. Tony

Am Friday 11 December 2009 17:07:58 schrieb Gottlob Frege:
The message should however be clear: ll/sc is *brittle* and not usable for ABA prevention.
Helge
Sad thing is that if some of those brittle restrictions could be removed, ll/sc could be a more useful than CAS. And unfortunately C++0x seems to be leanings towards CAS, and that might discourage other alternatives.
What _are_ the alternatives? Transactional memory systems have been promised "real soon now", but are not there yet, and standardizing something unimplementable helps no one. In my opinion the problem is something entirely different: When you need a "queue" concept, you will highly likely immediately choose a doubly-linked list without further thought. The thing that turned out to be really difficult is the construction of direct wait-free equivalents of such seemingly "simple" specific data structures. This has not prevented programmers who had no trouble constructing data-structures custom-tailored to their specific purpose from being very successful with just CAS. This track record, the many instances of code using locking that could be replaced with wait-free methods (see e.g. my wait-free signal/slot library) and the theoretical results showing CAS to be universal make me quite content with CAS. Helge

"Helge Bahmann" <hcb@chaoticmind.net> wrote in message news:alpine.DEB.1.10.0912111612520.13830@m65s28.vlinux.de...
On Fri, 11 Dec 2009, Andy Venikov wrote:
Helge Bahmann wrote:
This is however NOT possible because you are not allowed to make any memory access between ll and sc -- some CPUs (definitely older mips, but not ppc, unsure about many of the other architectures) clear the reservation on _any_ memory reference and will cause this sequence to always fail. Alpha has the fun requirement of a memory barrier before you are allowed to dereference "tmp", which just provides another way to kill your reservation. PPC as well. 7447, for example, will invalidate the "reservation bit" after writing to ANY memory location.
Cool, this makes the whole thing model-specific behavior and thus even more fun :) On the ppc 7455 available to me it not only works, but the "stwcx." in fact behaves more like a CAS (it remembers the value loaded by lwarx, and succeeds if it matches -- thus ABA detection won't work for entirely different reasons).
Other fun restrictions that I remember from some architecture manual (I don't know which, think it was also some mips flavor): No taken branch allowed between ll/sc.
The message should however be clear: ll/sc is *brittle* and not usable for ABA prevention.
FWIW, it's all in `Book II: PowerPC Virtual Environment Architecture' Appendix `B' which you can get here: http://www.ibm.com/developerworks/systems/library/es-archguide-v2.html Especially section `B.3 List Insertion'. You have got to take extreme care when implementing this on a PPC. You don't want something like false-sharing to break reservations. You have to make sure to align everything on cache line boundaries _and_ to ensure proper padding between non-related elements, assuming that reservation granule is a cache line. All of that good stuff! ;^)

Tim Blechmann wrote: Also, Tim, it Chris Thomasson found another problem with the current code. In non pointer packing solution your head and tail are multi-words. You can't load multi-word atomically without some special instructions. Here's his comment: Chris M. Thomasson wrote
AFAICT, in the code by Tim Blechmann, the anchor is typedef'ed as: ____________________________________________________________ typedef tagged_ptr<node> atomic_node_ptr; ____________________________________________________________
and `tagged_ptr<node>' is double-wide when `BOOST_LOCKFREE_PTR_COMPRESSION' is not defined. Tim loads the head/tail anchor in his implementation of the MS algorithm like: ____________________________________________________________ atomic_node_ptr tail (tail_); ____________________________________________________________
and the copy constructor of the double-wide version of `tagged_ptr' is: ____________________________________________________________ /** copy constructor */ tagged_ptr(tagged_ptr const & p)//: ptr(0), tag(0) { set(p); } ____________________________________________________________
and the `set()' procedure is: ____________________________________________________________ void set(tagged_ptr const & p) { ptr = p.ptr; tag = p.tag; } ____________________________________________________________
Which will not be atomic on a 32-bit system. AFAICT, this breaks the algorithm according to the MS queue pseudo-code. So, Tim should probably change this behavior and call `atomic_set()' instead. Of course this will only make the performance much worse! You could use the `MOVQ' instruction on the x86, but that would use MMX. This is why I pointed you to the blocking queue which does not need to do this. Yes, it's blocking but it's a MUCH simpler algorithm and has far better performance characteristics.
<snip>
i just went through the replies ... (maybe i should upcase some parts in the documentation, that the implementation focus on WORST CASE, not AVERAGE CASE performance ... people keep complaining that the stack/fifo may be outperformed be blocking algorithms, which is both true and irrelevant for me, as these implementations are soft real-time safe (and could be made hard real-time safe).
Agreed. Are you planning to augment your library with algorithms for different needs? Like MPSC, SPMC and so forth? I think that could be useful, because it looks like implementing the most generic MPMC case won't give you the best results.
as for the aba tag ... increasing the tag is not necessary in the enqueue operation (chris thomasson made a valid point here), but then 16bit would give 2**16 different tags. of course a tag overflow is possible, but not very likely ...
I guess it all depends on the algorithm and the actual memory manger used. For example, the ABA problem is more likely to appear in a simple LIFO stack than in M-S's FIFO queue. The generation count just decreases the chance. As I understand it, if 1/N is the probability of ABA in any given algorithm, then using the generation count of M bits makes the probability 1/(N * x^M). Now, coming up with the N for any given algorithm is what's hard to do. I guess, you could say that if M is sufficiently high, then N is negligible.
by definition ll/sc are immune to aba problems, but implementing cas via ll/sc, one loses this feature ... personally i would prefer to have language support for ll/sc transactions instead of aba-prone cas ...
most cas-architectures provide dcas, while most ll/sc architectures shouldn't use cas emulation, but ll/sc-style transactions directly, as it is by definition aba immune ... too bad, c++0x doesn't provide language support for ll/sc, but only for cas :/
Totally agree - when I was implementing this algorithm, I had two different versions - one CAS-based with ABA tag and the other LL/SC based without the tag. The problem with LL/SC on some platforms however is sometimes not solvable. On PowerPC 7447, for example, the LL is invalidated when ANY thread writes to ANY memory location, not just to the same cache line.... ugh...
tim
------------------------------------------------------------------------
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Also, Tim, it Chris Thomasson found another problem with the current code. In non pointer packing solution your head and tail are multi-words. You can't load multi-word atomically without some special instructions. Here's his comment:
i ported boost.lockfree to boost.atomic, which deals with this ...
i just went through the replies ... (maybe i should upcase some parts in the documentation, that the implementation focus on WORST CASE, not AVERAGE CASE performance ... people keep complaining that the stack/fifo may be outperformed be blocking algorithms, which is both true and irrelevant for me, as these implementations are soft real-time safe (and could be made hard real-time safe).
Agreed. Are you planning to augment your library with algorithms for different needs? Like MPSC, SPMC and so forth? I think that could be useful, because it looks like implementing the most generic MPMC case won't give you the best results.
for now, i will provide mpmc queues only ... maybe it makes sense to add some specializations for single producers/consumers, later, especially, if more people are requesting it ...
by definition ll/sc are immune to aba problems, but implementing cas via ll/sc, one loses this feature ... personally i would prefer to have language support for ll/sc transactions instead of aba-prone cas ...
most cas-architectures provide dcas, while most ll/sc architectures shouldn't use cas emulation, but ll/sc-style transactions directly, as it is by definition aba immune ... too bad, c++0x doesn't provide language support for ll/sc, but only for cas :/
Totally agree - when I was implementing this algorithm, I had two different versions - one CAS-based with ABA tag and the other LL/SC based without the tag. The problem with LL/SC on some platforms however is sometimes not solvable. On PowerPC 7447, for example, the LL is invalidated when ANY thread writes to ANY memory location, not just to the same cache line.... ugh...
one wouldn't be able to convert every cas-based algorithm to ll/sc, but ll/sc gives another technique for implementing lockfree data structures ... cheers, tim -- tim@klingt.org http://tim.klingt.org You can play a shoestring if you're sincere John Coltrane

i just went through the replies ... (maybe i should upcase some parts in the documentation, that the implementation focus on WORST CASE, not AVERAGE CASE performance ... people keep complaining that the stack/fifo may be outperformed be blocking algorithms, which is both true and irrelevant for me, as these implementations are soft real-time safe (and could be made hard real-time safe).
Agreed. Are you planning to augment your library with algorithms for different needs? Like MPSC, SPMC and so forth? I think that could be useful, because it looks like implementing the most generic MPMC case won't give you the best results.
i just added a lock-free single-producer/single-consumer ringbuffer to my git repository, branch topic/spsc_ringbuffer ... it implements an algorithm, that is commonly found in open source audio applications ... performs about 8 times faster than the m-s fifo queue (which is based on linked lists). it comes in two versions: template <typename T, size_t size> class ringbuffer; template <typename T> class ringbuffer<T, 0>; in the first version the size is set at compile time, in the second version at run time (during the construction). tim -- tim@klingt.org http://tim.klingt.org You can play a shoestring if you're sincere John Coltrane

"Tim Blechmann" <tim@klingt.org> wrote in message news:4B23A083.7080008@klingt.org...
i just went through the replies ... (maybe i should upcase some parts in the documentation, that the implementation focus on WORST CASE, not AVERAGE CASE performance ... people keep complaining that the stack/fifo may be outperformed be blocking algorithms, which is both true and irrelevant for me, as these implementations are soft real-time safe (and could be made hard real-time safe).
Agreed. Are you planning to augment your library with algorithms for different needs? Like MPSC, SPMC and so forth? I think that could be useful, because it looks like implementing the most generic MPMC case won't give you the best results.
i just added a lock-free single-producer/single-consumer ringbuffer to my git repository, branch topic/spsc_ringbuffer ... it implements an algorithm, that is commonly found in open source audio applications ... performs about 8 times faster than the m-s fifo queue (which is based on linked lists).
FWIW Tim, here is a very simple pseudo-code implementation of a nice single-producer/consumer bounded queue: _______________________________________________________________ template<template T, size_t T_depth> class spsc_queue { atomic<T*> m_buffer[T_depth]; // = { NULL } size_t m_head; // = 0 size_t m_tail; // = 0 public: bool push(T* object) { if (m_buffer[m_head].load(memory_order_relaxed)) { return false; } m_buffer[m_head].store(memory_order_release); ++m_head; if (m_head == T_depth) { m_head = 0; } return true; } T* pop() { T* object = m_buffer[m_tail].load(memory_order_acquire); if (object) { m_buffer[m_tail].store(NULL, memory_order_relaxed); ++m_tail; if (m_tail == T_depth) { m_tail = 0; } } return object; } }; _______________________________________________________________ This is good because the producer and consumer do not interfere with each other wrt the `spsc_queue::m_head/tail' variables. I think I noticed that you are allowing producers and consumers to load both the reader and writer variables and use them to determine boundary conditions (e.g., empty/full) right?

"Chris M. Thomasson" <cristom@charter.net> wrote in message news:hgfmj4$8m0$1@ger.gmane.org...
"Tim Blechmann" <tim@klingt.org> wrote in message news:4B23A083.7080008@klingt.org...
i just went through the replies ... (maybe i should upcase some parts in the documentation, that the implementation focus on WORST CASE, not AVERAGE CASE performance ... people keep complaining that the stack/fifo may be outperformed be blocking algorithms, which is both true and irrelevant for me, as these implementations are soft real-time safe (and could be made hard real-time safe).
Agreed. Are you planning to augment your library with algorithms for different needs? Like MPSC, SPMC and so forth? I think that could be useful, because it looks like implementing the most generic MPMC case won't give you the best results.
i just added a lock-free single-producer/single-consumer ringbuffer to my git repository, branch topic/spsc_ringbuffer ... it implements an algorithm, that is commonly found in open source audio applications ... performs about 8 times faster than the m-s fifo queue (which is based on linked lists).
FWIW Tim, here is a very simple pseudo-code implementation of a nice single-producer/consumer bounded queue: _______________________________________________________________ template<template T, size_t T_depth> class spsc_queue { atomic<T*> m_buffer[T_depth]; // = { NULL } size_t m_head; // = 0 size_t m_tail; // = 0
public: bool push(T* object) { if (m_buffer[m_head].load(memory_order_relaxed)) { return false; }
m_buffer[m_head].store(memory_order_release);
Ummm, that should be m_buffer[m_head].store(object, memory_order_release); of course!!! Sorry about that.

FWIW Tim, here is a very simple pseudo-code implementation of a nice single-producer/consumer bounded queue:
interesting ... it passes objects via pointers, though, which limits its usefulness ... still, if it performs better than my spsc ringbuffer, it would be interesting to use it as template specialization for pointer types ... tim -- tim@klingt.org http://tim.klingt.org A year ago, six months ago, I thought that I was an artist. I no longer think about it, I am. Henry Miller

FWIW Tim, here is a very simple pseudo-code implementation of a nice single-producer/consumer bounded queue:
interesting ... it passes objects via pointers, though, which limits its usefulness ...
I believe that can be solved by doing something like this: <pseudo-code typed in news reader> ___________________________________________________________ template<typename T, size_t T_depth> struct spsc_queue { struct cell { std::atomic<bool> m_busy; // = false; T m_object; }; cell m_buffer[T_depth]; size_t m_head; // = 0 size_t m_tail; // = 0 bool push(T const& object) { cell& c = m_buffer[m_head]; if (c.m_busy.load(memory_order_acquire)) return false; c.m_object = object; c.m_busy.store(true, memory_order_release); m_head = (m_head == T_depth - 1) ? 0 : (m_head + 1); return true; } bool pop(T& object) { cell& c = m_buffer[m_tail]; if (! c.m_busy.load(memory_order_acquire)) return false; object = c.m_object; c.m_busy.store(false, memory_order_release); m_tail = (m_tail == T_depth - 1) ? 0 : (m_tail + 1); return true; } }; ___________________________________________________________
still, if it performs better than my spsc ringbuffer, it would be interesting to use it as template specialization for pointer types ...
Well, IMO, it should perform better because the producer and consumer are not thrashing each other wrt the head and tail indexes.

still, if it performs better than my spsc ringbuffer, it would be interesting to use it as template specialization for pointer types ...
Well, IMO, it should perform better because the producer and consumer are not thrashing each other wrt the head and tail indexes.
yes, it would increase the size of the ringbuffer, but thrashing of the indices should be reduced, especially, if they are placed in different cache lines. i will do a bit of profiling to get some numbers. thanks, tim -- tim@klingt.org http://tim.klingt.org It is better to make a piece of music than to perform one, better to perform one than to listen to one, better to listen to one than to misuse it as a means of distraction, entertainment, or acquisition of 'culture'. John Cage

Well, IMO, it should perform better because the producer and consumer are not thrashing each other wrt the head and tail indexes.
the performance difference is almost the same. current implementation: Performance counter stats for 'workspace/boost_lockfree/bin.v2/libs/lockfree/test/ringbuffer_test.test/gcc-4.4.1/release/threading-multi/ringbuffer_test': 104843.459702 task-clock-msecs # 1.919 CPUs 86507 context-switches # 0.001 M/sec 100 CPU-migrations # 0.000 M/sec 2123 page-faults # 0.000 M/sec 292926371337 cycles # 2793.940 M/sec 139203486138 instructions # 0.475 IPC 3829638652 cache-references # 36.527 M/sec 19789409 cache-misses # 0.189 M/sec 31762728722 branches # 302.954 M/sec 1669597777 branch-misses # 15.925 M/sec 54.643822949 seconds time elapsed your proposed implementation (with cache line padding): Performance counter stats for 'workspace/boost_lockfree/bin.v2/libs/lockfree/test/ringbuffer_test2.test/gcc-4.4.1/release/threading-multi/ringbuffer_test2': 104827.826694 task-clock-msecs # 1.922 CPUs 84196 context-switches # 0.001 M/sec 196 CPU-migrations # 0.000 M/sec 2135 page-faults # 0.000 M/sec 292892624370 cycles # 2794.035 M/sec 149885017189 instructions # 0.512 IPC 3301119985 cache-references # 31.491 M/sec 20729054 cache-misses # 0.198 M/sec 33378354235 branches # 318.411 M/sec 1685405757 branch-misses # 16.078 M/sec 54.527086724 seconds time elapsed cheers, tim -- tim@klingt.org http://tim.klingt.org Desperation is the raw material of drastic change. Only those who can leave behind everything they have ever believed in can hope to escape. William S. Burroughs

"Tim Blechmann" <tim@klingt.org> wrote in message news:4B2E4492.3050201@klingt.org...
Well, IMO, it should perform better because the producer and consumer are not thrashing each other wrt the head and tail indexes.
the performance difference is almost the same.
Interesting. Thanks for profiling it. Have you tried aligning everything on cache line boundaries? I would try to ensure that the buffer is aligned and padded along with the head and tail variables. For 64-byte cache line and 32-bit pointers you could do: struct ringbuffer { void* m_buffer[1024]; // multiple of a cache-line uint32_t m_head; char pad[60]; uint32_t m_tail; char pad[60]; }; Then ensure that `ringbuffer' structs are actually aligned on a cache line boundary. unsigned char buffer[sizeof(struct ringbuffer) + 63]; struct ringbuffer* rb = ALIGN(buffer, 64); Now I know for sure that `rb' is isolated and won't experience/cause false sharing with other parts of the application.

On 12/20/2009 04:57 PM, Chris M. Thomasson wrote:
"Tim Blechmann" <tim@klingt.org> wrote in message news:4B2E4492.3050201@klingt.org...
Well, IMO, it should perform better because the producer and consumer are not thrashing each other wrt the head and tail indexes.
the performance difference is almost the same.
Interesting. Thanks for profiling it. Have you tried aligning everything on cache line boundaries? I would try to ensure that the buffer is aligned and padded along with the head and tail variables. For 64-byte cache line and 32-bit pointers you could do:
i am using [1] (from my topic/ringbuffer_chris branch): static const int padding_size = BOOST_LOCKFREE_CACHELINE_BYTES - sizeof(size_t); cell m_buffer[max_size]; char padding1[padding_size]; size_t m_head; // = 0 char padding2[padding_size]; size_t m_tail; // = 0 cheers, tim [1] http://tim.klingt.org/git?p=boost_lockfree.git;a=blob;f=boost/lockfree/ringb... -- tim@klingt.org http://tim.klingt.org You can play a shoestring if you're sincere John Coltrane

On Sun, Dec 20, 2009 at 11:17 AM, Tim Blechmann <tim@klingt.org> wrote:
On 12/20/2009 04:57 PM, Chris M. Thomasson wrote:
"Tim Blechmann" <tim@klingt.org> wrote in message news:4B2E4492.3050201@klingt.org...
Well, IMO, it should perform better because the producer and consumer are not thrashing each other wrt the head and tail indexes.
the performance difference is almost the same.
Interesting. Thanks for profiling it. Have you tried aligning everything on cache line boundaries? I would try to ensure that the buffer is aligned and padded along with the head and tail variables. For 64-byte cache line and 32-bit pointers you could do:
How about we go through the ring buffer by steps of 2^n - 1 such that each next element is on a separate cache line? ie instead of m_head = (m_head == T_depth - 1) ? 0 : (m_head + 1); we do m_head = (m_head + 7) % T_depth; You still use each slot, just in a different order. You calculate 'n' to be whatever you need based on the cell size. As long as the resultant step size is prime mod T_depth. I'm not sure if the false-sharing avoidance would be worth the cost of using up more cache lines. Probably depends on how full the queue is, etc. Tony Or might that be worse?

On 12/20/2009 08:44 PM, Gottlob Frege wrote:
On Sun, Dec 20, 2009 at 11:17 AM, Tim Blechmann <tim@klingt.org> wrote:
On 12/20/2009 04:57 PM, Chris M. Thomasson wrote:
"Tim Blechmann" <tim@klingt.org> wrote in message news:4B2E4492.3050201@klingt.org...
Well, IMO, it should perform better because the producer and consumer are not thrashing each other wrt the head and tail indexes.
the performance difference is almost the same.
Interesting. Thanks for profiling it. Have you tried aligning everything on cache line boundaries? I would try to ensure that the buffer is aligned and padded along with the head and tail variables. For 64-byte cache line and 32-bit pointers you could do:
How about we go through the ring buffer by steps of 2^n - 1 such that each next element is on a separate cache line? ie instead of m_head = (m_head == T_depth - 1) ? 0 : (m_head + 1); we do m_head = (m_head + 7) % T_depth;
You still use each slot, just in a different order. You calculate 'n' to be whatever you need based on the cell size. As long as the resultant step size is prime mod T_depth.
I'm not sure if the false-sharing avoidance would be worth the cost of using up more cache lines. Probably depends on how full the queue is, etc.
yes, i would guess, the performance characteristics differ depending on the number of elements in the buffer. also, the current implementation can easily be adapted to enqueue/dequeue multiple objects efficiently ... cheers, tim -- tim@klingt.org http://tim.klingt.org All we composers really have to work with is time and sound - and sometimes I'm not even sure about sound. Morton Feldman

"Tim Blechmann" <tim@klingt.org> wrote in message news:4B2F4530.4040003@klingt.org... On 12/20/2009 08:44 PM, Gottlob Frege wrote:
On Sun, Dec 20, 2009 at 11:17 AM, Tim Blechmann <tim@klingt.org> wrote:
On 12/20/2009 04:57 PM, Chris M. Thomasson wrote:
"Tim Blechmann" <tim@klingt.org> wrote in message news:4B2E4492.3050201@klingt.org...
Well, IMO, it should perform better because the producer and consumer are not thrashing each other wrt the head and tail indexes.
the performance difference is almost the same.
Interesting. Thanks for profiling it. Have you tried aligning everything on cache line boundaries? I would try to ensure that the buffer is aligned and padded along with the head and tail variables. For 64-byte cache line and 32-bit pointers you could do:
How about we go through the ring buffer by steps of 2^n - 1 such that each next element is on a separate cache line? ie instead of m_head = (m_head == T_depth - 1) ? 0 : (m_head + 1); we do m_head = (m_head + 7) % T_depth;
You still use each slot, just in a different order. You calculate 'n' to be whatever you need based on the cell size. As long as the resultant step size is prime mod T_depth.
I'm not sure if the false-sharing avoidance would be worth the cost of using up more cache lines. Probably depends on how full the queue is, etc.
I think it would be beneficial if the producer could push enough items to fill up a cache line, and the consumer just pop's a cache line worth of objects at a time.
yes, i would guess, the performance characteristics differ depending on the number of elements in the buffer. also, the current implementation can easily be adapted to enqueue/dequeue multiple objects efficiently ...
For the single-producer/consumer queue you can treat the access like a transaction. For instance, the producer can read/write to the buffer up to the tail and commit the update just by incrementing the head. A consumer can read/write to the produced portion of the buffer and commit by bumping the tail. This is can good because you can avoid copying is come cases and allow the thread to use the data in the buffer directly. Here is an example of such an algorithm: http://pastebin.com/f693b64b3 This version uses eventcounts to enforce boundary conditions because I did not want to introduce any nasty spin-waiting. The behavior of this example code is basically equal to a socket/pipe connection between two threads.

yes, i would guess, the performance characteristics differ depending on the number of elements in the buffer. also, the current implementation can easily be adapted to enqueue/dequeue multiple objects efficiently ...
For the single-producer/consumer queue you can treat the access like a transaction. For instance, the producer can read/write to the buffer up to the tail and commit the update just by incrementing the head. A consumer can read/write to the produced portion of the buffer and commit by bumping the tail. This is can good because you can avoid copying is come cases and allow the thread to use the data in the buffer directly. Here is an example of such an algorithm:
i just had a brief look at the algorithm, but it doesn't look too different from the kfifo ringbuffer from the linux kernel, does it? thanks, tim -- tim@klingt.org http://tim.klingt.org The aim of education is the knowledge, not of facts, but of values William S. Burroughs

"Tim Blechmann" <tim@klingt.org> wrote in message news:4B2F84D7.5090200@klingt.org...
yes, i would guess, the performance characteristics differ depending on the number of elements in the buffer. also, the current implementation can easily be adapted to enqueue/dequeue multiple objects efficiently ...
For the single-producer/consumer queue you can treat the access like a transaction. For instance, the producer can read/write to the buffer up to the tail and commit the update just by incrementing the head. A consumer can read/write to the produced portion of the buffer and commit by bumping the tail. This is can good because you can avoid copying is come cases and allow the thread to use the data in the buffer directly. Here is an example of such an algorithm:
i just had a brief look at the algorithm, but it doesn't look too different from the kfifo ringbuffer from the linux kernel, does it?
You mean this one right? http://www.gelato.unsw.edu.au/lxr/source/kernel/kfifo.c I think it's different in that it allows for a producer or consumer to use the buffer directly. For instance, you can do something like: // producer buffer_T buffer; if (ringbuf_try_write_begin(..., &buffer)) { // the producer has sole access to the memory pointer // to by `buffer->ptr' the size of `buffer->size'. // just write a single character. buffer->ptr[0] = 'X'; buffer->size = 1; // Okay, we can commit the write. ringbuf_write_commit(..., &buffer); } // consumer buffer_T buffer; if (ringbuf_try_read_begin(..., &buffer)) { // the consumer has sole access to the memory pointer // to by `buffer->ptr' the size of `buffer->size'. // print out all characters. char tmp = buffer->ptr[buffer->size - 1]; buffer->ptr[buffer->size - 1] = '\0'; puts(buffer->ptr); putchar(tmp); // commit our transaction ringbuf_read_commit(..., &buffer); } AFAICT, you cannot do this with the kfifo ringbuffer from the Linux Kernel because it does not support the proper interface. The code does not need to use external buffers and copy to/from the ringbuffer. Instead it just accesses the ringbuffer memory directly. Notice how the consumer is operating on the buffer instead of copying everything out?

Hi tim: Where is the test code for the ringbuffer?? -----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Tim Blechmann Sent: Sunday, December 20, 2009 11:37 PM To: boost@lists.boost.org Cc: Chris M. Thomasson Subject: Re: [boost] [lockfree::fifo] Review
Well, IMO, it should perform better because the producer and consumer are not thrashing each other wrt the head and tail indexes.
the performance difference is almost the same. current implementation: Performance counter stats for 'workspace/boost_lockfree/bin.v2/libs/lockfree/test/ringbuffer_test.test/gcc-4.4.1/release/threading-multi/ringbuffer_test': 104843.459702 task-clock-msecs # 1.919 CPUs 86507 context-switches # 0.001 M/sec 100 CPU-migrations # 0.000 M/sec 2123 page-faults # 0.000 M/sec 292926371337 cycles # 2793.940 M/sec 139203486138 instructions # 0.475 IPC 3829638652 cache-references # 36.527 M/sec 19789409 cache-misses # 0.189 M/sec 31762728722 branches # 302.954 M/sec 1669597777 branch-misses # 15.925 M/sec 54.643822949 seconds time elapsed your proposed implementation (with cache line padding): Performance counter stats for 'workspace/boost_lockfree/bin.v2/libs/lockfree/test/ringbuffer_test2.test/gcc-4.4.1/release/threading-multi/ringbuffer_test2': 104827.826694 task-clock-msecs # 1.922 CPUs 84196 context-switches # 0.001 M/sec 196 CPU-migrations # 0.000 M/sec 2135 page-faults # 0.000 M/sec 292892624370 cycles # 2794.035 M/sec 149885017189 instructions # 0.512 IPC 3301119985 cache-references # 31.491 M/sec 20729054 cache-misses # 0.198 M/sec 33378354235 branches # 318.411 M/sec 1685405757 branch-misses # 16.078 M/sec 54.527086724 seconds time elapsed cheers, tim -- tim@klingt.org http://tim.klingt.org Desperation is the raw material of drastic change. Only those who can leave behind everything they have ever believed in can hope to escape. William S. Burroughs

On 12/22/2009 02:58 PM, jon_zhou@agilent.com wrote:
Hi tim:
Where is the test code for the ringbuffer??
it is part of my topic/spsc_ringbuffer branch: http://tim.klingt.org/git?p=boost_lockfree.git;a=blob;f=libs/lockfree/test/r... cheers, tim -- tim@klingt.org http://tim.klingt.org Art is either a complaint or do something else John Cage quoting Jasper Johns

"Tim Blechmann" <tim@klingt.org> wrote in message news:4B2E059C.3020808@klingt.org...
FWIW Tim, here is a very simple pseudo-code implementation of a nice single-producer/consumer bounded queue:
interesting ... it passes objects via pointers, though, which limits its usefulness ... still, if it performs better than my spsc ringbuffer, it would be interesting to use it as template specialization for pointer types ...
Here is another "interesting" pointer-based algorithm. It's not a strict FIFO and it allows producers to act as consumers; here is some pseudo-code: _____________________________________________________________ template<typename T, size_t T_depth> struct wierd_mpsc_queue { atomic<T*> m_buffer[T_depth]; // = { NULL }; atomic<size_t> m_head; // = 0 size_t m_tail; // = 0 T* push(T* object) { assert(object); size_t i = m_head.fetch_add(1, memory_order_relaxed) % T_depth; object = m_buffer[i].exchange(object, memory_order_release); if (object) { atomic_thread_fence(memory_order_acquire); } return object; } T* pop() { T* object = m_buffer[m_tail].exchange(NULL, memory_order_acquire); if (object) { m_tail = (m_tail == T_depth - 1) ? 0 : (m_tail + 1); } return object; } }; _____________________________________________________________ This is pretty weird because it's a multi-producer/single-consumer, however, it can behave as if it was a multi-producer/multi-consumer when the producers start to consume a full buffer. Also, sometimes you just don't care about a "strict" ordering... ;^)

On Mon, 21 Dec 2009, Chris M. Thomasson wrote:
object = m_buffer[i].exchange(object, memory_order_release);
if (object) { atomic_thread_fence(memory_order_acquire); }
return object;
[...]
T* pop() { T* object = m_buffer[m_tail].exchange(NULL, memory_order_acquire);
if (object) { m_tail = (m_tail == T_depth - 1) ? 0 : (m_tail + 1); }
return object;
you generally do not need "memory_order_acquire" when dereferencing an "atomically published" pointer -- "memory_order_consume" suffices to provide the proper ordering for all operations that are data-dependent on the pointer value (and any dereference obviously needs the value to compute the memory address to access). This is faster on any non-x86 and non-alpha by a fair amount. Regards, Helge

"Helge Bahmann" <hcb@chaoticmind.net> wrote in message news:alpine.DEB.1.10.0912211548590.31425@m65s28.vlinux.de...
On Mon, 21 Dec 2009, Chris M. Thomasson wrote:
object = m_buffer[i].exchange(object, memory_order_release);
if (object) { atomic_thread_fence(memory_order_acquire); }
return object;
[...]
T* pop() { T* object = m_buffer[m_tail].exchange(NULL, memory_order_acquire);
if (object) { m_tail = (m_tail == T_depth - 1) ? 0 : (m_tail + 1); }
return object;
you generally do not need "memory_order_acquire" when dereferencing an "atomically published" pointer -- "memory_order_consume" suffices to provide the proper ordering for all operations that are data-dependent on the pointer value (and any dereference obviously needs the value to compute the memory address to access).
This is faster on any non-x86 and non-alpha by a fair amount.
Of course you are right. For some reason, I was thinking that `memory_order_consume' would boil down to a: MEMBAR #LoadLoad on a SPARC. The name was confusing me. Perhaps it should be named `memory_order_depends' or something... BTW, where is `memory_order_produce'? ;^) I don't think I can use C++0x memory ordering to achieve simple #LoadLoad and #StoreStore barriers without the #LoadStore constraint. Am I right?

On Mon, 21 Dec 2009, Chris M. Thomasson wrote:
Of course you are right. For some reason, I was thinking that `memory_order_consume' would boil down to a:
MEMBAR #LoadLoad
on a SPARC. The name was confusing me. Perhaps it should be named `memory_order_depends' or something...
I think it was named exactly that way in some earlier proposals, and I also liked it better.
I don't think I can use C++0x memory ordering to achieve simple #LoadLoad and #StoreStore barriers without the #LoadStore constraint. Am I right?
If a hypothetical C++0x compiler were extremely clever and knew that the only operations on shared data before a given "release" fence were stores, it could translate this as "MEMBAR #StoreStore" (similar for "acquire" with loads). I don't know if the required code analysis is feasible or worthwhile, but there is certainly no way to force the compiler to generate these exact barriers. It's a pity that something like the below algorithm is therefore not possible: template<typename X> class atomic_readmostly_state { public: X read() { for(;;) { unsigned int g=generation.load(memory_order_relaxed); load_load_mb(); X tmp=state[g&1]; load_load_mb(); if (g==generation.load(memory_order_relaxed)) return tmp; } } void write(X tmp) { mutex::scoped_lock g(write_lock); unsigned int next=generation.load(memory_order_relaxed)+1; state[next&1]=tmp; generation.store(next, memory_order_release); } private: X state[2]; atomic<unsigned int> generation; mutex write_lock; }; which provides a nice, contention-free (no stores by readers!) mechanism to share read-mostly state. Helge

"Helge Bahmann" <hcb@chaoticmind.net> wrote in message news:alpine.DEB.1.10.0912211630230.31425@m65s28.vlinux.de...
On Mon, 21 Dec 2009, Chris M. Thomasson wrote:
Of course you are right. For some reason, I was thinking that `memory_order_consume' would boil down to a:
MEMBAR #LoadLoad
on a SPARC. The name was confusing me. Perhaps it should be named `memory_order_depends' or something...
I think it was named exactly that way in some earlier proposals, and I also liked it better.
I don't think I can use C++0x memory ordering to achieve simple #LoadLoad and #StoreStore barriers without the #LoadStore constraint. Am I right?
If a hypothetical C++0x compiler were extremely clever and knew that the only operations on shared data before a given "release" fence were stores, it could translate this as "MEMBAR #StoreStore" (similar for "acquire" with loads).
Yes, that's a good point.
I don't know if the required code analysis is feasible or worthwhile, but there is certainly no way to force the compiler to generate these exact barriers.
It's a pity that something like the below algorithm is therefore not possible:
Agreed. [...]
which provides a nice, contention-free (no stores by readers!) mechanism to share read-mostly state.
Interesting... Humm, that kind of reminds me of a sequence lock: <membars aside for a moment> _______________________________________________________________ template<typename T> struct sequence_lock { T m_object; // = T() uint32_t m_version; // = 0 mutex m_mutex; T read() { for (;;) { if (LOAD(&m_version) % 2) yield(), continue; T object = m_object; if (LOAD(&m_version) % 2) yield(), continue; return object; } } void write(T const& object) { mutex::scoped_lock lock(m_mutex); STORE(&m_version, LOAD(&m_version) + 1); m_object = object; STORE(&m_version, LOAD(&m_version) + 1); } }; _______________________________________________________________

Am Monday 21 December 2009 18:21:39 schrieb Chris M. Thomasson:
"Helge Bahmann" <hcb@chaoticmind.net> wrote in message news:alpine.DEB.1.10.0912211630230.31425@m65s28.vlinux.de...
On Mon, 21 Dec 2009, Chris M. Thomasson wrote:
Of course you are right. For some reason, I was thinking that `memory_order_consume' would boil down to a:
MEMBAR #LoadLoad
on a SPARC.
mmh... do you happen to have a Sparc at hand for testing? (Mine is currently broken). If yes, would you mind doing a little bit of testing so we could implement atomic ops for Sparc as well? [...]
Interesting... Humm, that kind of reminds me of a sequence lock:
yes, it's exactly the same idea Best regards Helge

"Helge Bahmann" <hcb@chaoticmind.net> wrote in message news:200912212121.09020.hcb@chaoticmind.net...
Am Monday 21 December 2009 18:21:39 schrieb Chris M. Thomasson:
"Helge Bahmann" <hcb@chaoticmind.net> wrote in message news:alpine.DEB.1.10.0912211630230.31425@m65s28.vlinux.de...
On Mon, 21 Dec 2009, Chris M. Thomasson wrote:
Of course you are right. For some reason, I was thinking that `memory_order_consume' would boil down to a:
MEMBAR #LoadLoad
on a SPARC.
mmh... do you happen to have a Sparc at hand for testing? (Mine is currently broken). If yes, would you mind doing a little bit of testing so we could implement atomic ops for Sparc as well?
Well, Sun Microsystems did give me a brand new Sun Fire T2000 to play with back in 2006 when I was participating in the CoolThreads Contest: https://coolthreads.dev.java.net https://coolthreads.dev.java.net/servlets/ProjectForumMessageView?messageID=11001&forumID=1797 (I created the vZOOM project) I kept it for about two years and then sold it. Unfortunately, I don't have access to a SPARC now. ;^( Humm. Have you had a look at the relevant source code for OpenSolaris? http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/common/atomic/... AFAICT, they implement everything you need. [...]

On Mon, 21 Dec 2009, Chris M. Thomasson wrote:
https://coolthreads.dev.java.net
https://coolthreads.dev.java.net/servlets/ProjectForumMessageView?messageID=11001&forumID=1797
(I created the vZOOM project)
I kept it for about two years and then sold it. Unfortunately, I don't have access to a SPARC now.
nice :) but a pity that you sold it :/
Humm. Have you had a look at the relevant source code for OpenSolaris?
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/common/atomic/...
Well, I have already implemented this once (quite some time ago), so in principle I know what to do, but without being able to test it, this becomes whacky. Helge

"Chris M. Thomasson" <cristom@charter.net> wrote in message news:hgoao1$klu$1@ger.gmane.org... [...]
which provides a nice, contention-free (no stores by readers!) mechanism to share read-mostly state.
Interesting... Humm, that kind of reminds me of a sequence lock:
<membars aside for a moment> _______________________________________________________________ template<typename T> struct sequence_lock { T m_object; // = T() uint32_t m_version; // = 0 mutex m_mutex;
T read() { for (;;) { if (LOAD(&m_version) % 2) yield(), continue;
T object = m_object;
if (LOAD(&m_version) % 2) yield(), continue;
I forgot to load and compare the version! let me fix that: ___________________________________________________________ T read() { for (;;) { uint32_t version1 = LOAD(&m_version); if (! (version1 % 2)) { T object = m_object; uint32_t version2 = LOAD(&m_version); if (version1 == version2) { return object; } } backoff(); } } ___________________________________________________________ Sorry about that! ;^o
void write(T const& object) { mutex::scoped_lock lock(m_mutex);
STORE(&m_version, LOAD(&m_version) + 1);
m_object = object;
STORE(&m_version, LOAD(&m_version) + 1); } }; _______________________________________________________________

On Fri, Dec 4, 2009 at 11:01 AM, Andrew Venikov <andrew.venikov@genband.com>wrote:
3. Your tail_ and head_ should be volatile. Furthermore, the getters for you tagged_ptr (like get_ptr, operator->() and operator* ) should return pointers/references to volatile types. Rational: If your tail_ is not volatile, then the compiler is free to load head_ just once and line 4 effectively becomes a no-op.
A little late chiming in here, but I think volatile is neither necessary nor sufficient to guarantee that the compiler does not perform such optimizations. If it works, it's due to implementation specific extensions of the volatile keyword, such as those provided by Microsoft's compiler which does indeed guarantee that such optimizations are not performed on volatile variables. However, I don't believe this is the case for all compilers (although it might be true for many). In any case, it's definitely not standards compliant behavior, and my understanding of volatile (which seems to change every time someone publishes a new paper about just how bad it is) is that it's almost completely useless for portable, standards compliant behavior. Zach

Zachary Turner wrote:
A little late chiming in here, but I think volatile is neither necessary nor sufficient to guarantee that the compiler does not perform such optimizations. If it works, it's due to implementation specific extensions of the volatile keyword, such as those provided by Microsoft's compiler which does indeed guarantee that such optimizations are not performed on volatile variables. However, I don't believe this is the case for all compilers (although it might be true for many). In any case, it's definitely not standards compliant behavior, and my understanding of volatile (which seems to change every time someone publishes a new paper about just how bad it is) is that it's almost completely useless for portable, standards compliant behavior.
Zach _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
By no means is using volatile sufficient for producing correct parallel algorithms. But it is absolutely necessary. The standard places a requirement on conforming implementations that: 1.9.6 The observable behavior of the abstract machine is its sequence of reads and writes to volatile data and calls to library I/O functions 1.9.7 Accessing an object designated by a volatile lvalue (3.10), modifying an object, calling a library I/O function, or calling a function that does any of those operations are all side effects, which are changes in the state of the execution environment. Evaluation of an expression might produce side effects. At certain specified points in the execution sequence called sequence points, all side effects of previous evaluations shall be complete and no side effects of subsequent evaluations shall have taken place 1.9.11 The least requirements on a conforming implementation are: — At sequence points, volatile objects are stable in the sense that previous evaluations are complete and subsequent evaluations have not yet occurred. That to me sounds like a complete enough requirement that compilers don't perform optimizations that produce "surprising" results in so far as observable behavior in an abstract (single-threaded) machine are concerned. To make that work in a parallel environment we have to augment volatile with either library calls or a correct use of memory barriers. Microsoft's extension is that reads and writes to volatile are always accompanied with a full barrier. Frankly, I don't want that "service"; I'd much rather have a control over that. Andy.;

hi all, i just wanted to let you know, that i ported boost.lockfree to boost.atomic. the code can be obtained from my git repository [1], branch lockfree-atomic-upstream. unfortunately there seems to be an issue with the fifo implementation. after adding another null pointer check (which shouldn't actually be needed), the implementation works fine, on my nehalem machine, not sure about other platforms, though ... it removes most of the low-level code, though ... so it is way cleaner than before ... cheers, tim [1] http://tim.klingt.org/git?p=boost_lockfree.git -- tim@klingt.org http://tim.klingt.org Only very good and very bad programmers use goto in C
participants (10)
-
Andrew Venikov
-
Andy Venikov
-
Anthony Williams
-
Chris M. Thomasson
-
Gottlob Frege
-
Helge Bahmann
-
jon_zhou@agilent.com
-
Mathias Gaunard
-
Tim Blechmann
-
Zachary Turner