threadpool lockfree_channel

newer
[asio] possible multi-threaded bug...

older
[static] A useful static variable...

Tim Blechmann

16 Mar 2009 16 Mar '09

12:04 a.m.

hi oliver, i just saw, that you checked in a lockfree fifo implementation to the boost sandbox. it looks like an implementation of the michael/scott queue, isn't it? if so, i suppose, you are missing a thread-safe memory reclamation mechanism ... some time ago, i wrote a boost-style implementation of this data structure, not sure, if you came across it [1] ... maybe it would make sense to join our efforts in implementing a lockfree queue and get it into boost? best, tim [1] http://tim.klingt.org/git?p=boost_lockfree.git;a=summary -- tim@klingt.org http://tim.klingt.org Music is the can opener of the soul. It makes you terribly quiet inside, makes you aware that there's a roof to your being. Henry Miller

Attachments:

signature.asc (application/pgp-signature — 197 bytes)

Show replies by date

Hartmut Kaiser

16 Mar 16 Mar

12:35 a.m.

Tim,

...

i just saw, that you checked in a lockfree fifo implementation to the boost sandbox. it looks like an implementation of the michael/scott queue, isn't it? if so, i suppose, you are missing a thread-safe memory reclamation mechanism ... some time ago, i wrote a boost-style implementation of this data structure, not sure, if you came across it [1] ... maybe it would make sense to join our efforts in implementing a lockfree queue and get it into boost?

Please count me in! Regards Hartmut

Anthony Williams

10:14 a.m.

Tim Blechmann <tim@klingt.org> writes:

...

i just saw, that you checked in a lockfree fifo implementation to the boost sandbox. it looks like an implementation of the michael/scott queue, isn't it? if so, i suppose, you are missing a thread-safe memory reclamation mechanism ... some time ago, i wrote a boost-style implementation of this data structure, not sure, if you came across it [1] ... maybe it would make sense to join our efforts in implementing a lockfree queue and get it into boost?

...

[1] http://tim.klingt.org/git?p=boost_lockfree.git;a=summary

Firstly, I'd like to help with this. I have a couple of lock-free queue implementations lying around from my book. I've attached a sample implementation that uses atomic reference-counting --- it would need to be "boostified", as it's written for C++0x, but that should be relatively straightforward. I don't make any claims for performance, but I believe it to be "correct", with no data races or undefined behaviour --- I *really* want to know if that's not the case, so I can correct it before the book goes to press. Secondly, both these queues (Tim's at the posted link, and Oliver's at https://svn.boost.org/trac/boost/browser/sandbox/threadpool/boost/tp/lockfre...) have race conditions in dequeue/take. First, let's look at Tim's implementation at http://tim.klingt.org/git?p=boost_lockfree.git;a=blob;f=boost/lockfree/fifo.... Suppose there are two threads calling deqeue() concurrently. If there is an item in the queue, both will proceed identically down to line 137, as neither thread modifies the queue data before then. Now suppose one thread (A) gets preempted, but the other (B) proceeds. Thread B will read next->data and assign it to ret. It will then update head_ with CAS (which will succeed as no other thread is modifying/has modified the data structure), and *deallocate the node* at line 141. This will destroy next->data. Now, when thread A wakes up it will try and read next->data => reading destroyed object, undefined behaviour. Oliver's take() suffers a similar problem. Suppose there are two threads calling take(). Both proceed as far as line 168, and then thread A gets pre-empted whilst thread B proceeds. There is an item on the queue, so val is non-NULL, and the queue hasn't been modified, so head_==head. No other thread is modifying the queue, so we end up at line 197. We then update head_ in line 198 and *delete head.ptr* at line 202 and set it to NULL at line 203. Now, when thread A wakes up the first thing it does is read head.ptr->prev, which is a dereference of a NULL ptr => undefined behaviour. Lock-free code is hard. Memory reclamation makes it doubly hard. Anthony -- Author of C++ Concurrency in Action | http://www.manning.com/williams just::thread C++0x thread library | http://www.stdthread.co.uk Just Software Solutions Ltd | http://www.justsoftwaresolutions.co.uk 15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

Vicente Botet Escriba

10:40 a.m.

Anthony Williams-4 wrote:

...

Tim Blechmann <tim@klingt.org> writes:

...
i just saw, that you checked in a lockfree fifo implementation to the boost sandbox. it looks like an implementation of the michael/scott queue, isn't it? if so, i suppose, you are missing a thread-safe memory reclamation mechanism ... some time ago, i wrote a boost-style implementation of this data structure, not sure, if you came across it [1] ... maybe it would make sense to join our efforts in implementing a lockfree queue and get it into boost?

...
[1] http://tim.klingt.org/git?p=boost_lockfree.git;a=summary

Firstly, I'd like to help with this. I have a couple of lock-free queue implementations lying around from my book. I've attached a sample implementation that uses atomic reference-counting --- it would need to be "boostified", as it's written for C++0x, but that should be relatively straightforward. I don't make any claims for performance, but I believe it to be "correct", with no data races or undefined behaviour --- I *really* want to know if that's not the case, so I can correct it before the book goes to press.

Secondly, both these queues (Tim's at the posted link, and Oliver's at https://svn.boost.org/trac/boost/browser/sandbox/threadpool/boost/tp/lockfre...) have race conditions in dequeue/take.

First, let's look at Tim's implementation at http://tim.klingt.org/git?p=boost_lockfree.git;a=blob;f=boost/lockfree/fifo....

Suppose there are two threads calling deqeue() concurrently. If there is an item in the queue, both will proceed identically down to line 137, as neither thread modifies the queue data before then. Now suppose one thread (A) gets preempted, but the other (B) proceeds. Thread B will read next->data and assign it to ret. It will then update head_ with CAS (which will succeed as no other thread is modifying/has modified the data structure), and *deallocate the node* at line 141. This will destroy next->data. Now, when thread A wakes up it will try and read next->data => reading destroyed object, undefined behaviour.

Oliver's take() suffers a similar problem. Suppose there are two threads calling take(). Both proceed as far as line 168, and then thread A gets pre-empted whilst thread B proceeds. There is an item on the queue, so val is non-NULL, and the queue hasn't been modified, so head_==head. No other thread is modifying the queue, so we end up at line 197. We then update head_ in line 198 and *delete head.ptr* at line 202 and set it to NULL at line 203. Now, when thread A wakes up the first thing it does is read head.ptr->prev, which is a dereference of a NULL ptr => undefined behaviour.

Lock-free code is hard. Memory reclamation makes it doubly hard.

Anthony -- Author of C++ Concurrency in Action | http://www.manning.com/williams just::thread C++0x thread library | http://www.stdthread.co.uk Just Software Solutions Ltd | http://www.justsoftwaresolutions.co.uk 15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK. Company No. 5478976

template<typename T> class queue { private: struct node;

struct counted_node_ptr { int external_count; node* ptr; };

struct node_counter { int internal_count:30; unsigned external_counters:2; };

struct node { std::atomic<T*> data; std::atomic<node_counter> count; std::atomic<counted_node_ptr> next;

node() { node_counter new_count; new_count.internal_count=0; new_count.external_counters=2; count.store(new_count);

counted_node_ptr next_node={0}; next.store(next_node); }

void release_ref() { node_counter old_counter= count.load(std::memory_order_relaxed); node_counter new_counter; do { new_counter=old_counter; --new_counter.internal_count; } while(!count.compare_exchange_strong( old_counter,new_counter, std::memory_order_acquire,std::memory_order_relaxed));

if(!new_counter.internal_count && !new_counter.external_counters) { delete this; } }

};

std::atomic<counted_node_ptr> head; std::atomic<counted_node_ptr> tail;

static void increase_external_count( std::atomic<counted_node_ptr>& counter, counted_node_ptr& old_counter) { counted_node_ptr new_counter;

do { new_counter=old_counter; ++new_counter.external_count; } while(!counter.compare_exchange_strong( old_counter,new_counter, std::memory_order_acquire, std::memory_order_relaxed));

old_counter.external_count=new_counter.external_count; }

void set_new_tail(counted_node_ptr &old_tail, counted_node_ptr const &new_tail) { node* const current_tail_ptr=old_tail.ptr; while(!tail.compare_exchange_weak(old_tail,new_tail) && old_tail.ptr==current_tail_ptr); if(old_tail.ptr==current_tail_ptr) { free_external_counter(old_tail); } else { current_tail_ptr->release_ref(); } }

static void free_external_counter(counted_node_ptr &old_node_ptr) { node* const ptr=old_node_ptr.ptr; int const count_increase=old_node_ptr.external_count-2;

node_counter old_counter= ptr->count.load(std::memory_order_relaxed); node_counter new_counter; do { new_counter=old_counter; --new_counter.external_counters; new_counter.internal_count+=count_increase; } while(!ptr->count.compare_exchange_strong( old_counter,new_counter, std::memory_order_acquire,std::memory_order_relaxed));

if(!new_counter.internal_count && !new_counter.external_counters) { delete ptr; } }

public: queue() { counted_node_ptr new_node; new_node.external_count=1; new_node.ptr=new node;

head.store(new_node); tail.store(new_node); }

queue(const queue& other)=delete; queue& operator=(const queue& other)=delete;

~queue() { while(pop()); delete head.load().ptr; } std::unique_ptr<T> pop() { counted_node_ptr old_head=head.load(std::memory_order_relaxed); for(;;) { increase_external_count(head,old_head); node* const ptr=old_head.ptr; if(ptr==tail.load().ptr) { return std::unique_ptr<T>(); } counted_node_ptr next=ptr->next.load(); if(head.compare_exchange_strong(old_head,next)) { T* const res=ptr->data.exchange(NULL); free_external_counter(old_head); return std::unique_ptr<T>(res); } ptr->release_ref(); } }

void push(T new_value) { std::unique_ptr<T> new_data(new T(new_value)); counted_node_ptr new_next; new_next.ptr=new node; new_next.external_count=1; counted_node_ptr old_tail=tail.load();

for(;;) { increase_external_count(tail,old_tail);

T* old_data=NULL; if(old_tail.ptr->data.compare_exchange_strong( old_data,new_data.get())) { counted_node_ptr old_next={0}; if(!old_tail.ptr->next.compare_exchange_strong( old_next,new_next)) { delete new_next.ptr; new_next=old_next; } set_new_tail(old_tail, new_next); new_data.release(); break; } else { counted_node_ptr old_next={0}; if(old_tail.ptr->next.compare_exchange_strong( old_next,new_next)) { old_next=new_next; new_next.ptr=new node; } set_new_tail(old_tail, old_next); } } } };

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Hi, I'm also interesteed on lock-free queues. In particular I need one that have * single-writer single-reader So there is not issue push/push and front/front race conditions. I think that we can extend this and provide implementation thats are lock-free for * single-writer single-reader * multiple-writer single-reader * single-writer multiple-reader * multiple-writer multiple-reader So we can have the better implementation for each case. I'm also interested on a dequeue in which * multiple-writer (push) multiple-reader(pop) single_reader(front). I think that the Thread pool has one like that for the internal queues, but I need to verify. Best, Vicente -- View this message in context: http://www.nabble.com/threadpool-lockfree_channel-tp22529985p22535465.html Sent from the Boost - Dev mailing list archive at Nabble.com.

k-oli＠gmx.de

7:28 p.m.

Am Monday 16 March 2009 11:40:03 schrieb Vicente Botet Escriba:

...

I'm also interested on a dequeue in which * multiple-writer (push) multiple-reader(pop) single_reader(front).

I think that the Thread pool has one like that for the internal queues, but I need to verify.

The wsq has multiple threads deqeueing items from one end (synchronized via mutex) and one thread enqeueing/deqeueing at the other end (stack-like). regards, Oliver

Tim Blechmann

10:55 a.m.

hi anthony,

...

First, let's look at Tim's implementation at http://tim.klingt.org/git?p=boost_lockfree.git;a=blob;f=boost/lockfree/fifo....

Suppose there are two threads calling deqeue() concurrently. If there is an item in the queue, both will proceed identically down to line 137, as neither thread modifies the queue data before then. Now suppose one thread (A) gets preempted, but the other (B) proceeds. Thread B will read next->data and assign it to ret. It will then update head_ with CAS (which will succeed as no other thread is modifying/has modified the data structure), and *deallocate the node* at line 141. This will destroy next->data. Now, when thread A wakes up it will try and read next->data => reading destroyed object, undefined behaviour.

dealloc_node in line 141 does not free the node, but pushes it to a freelist stack ... from my understanding, this leads to an increased memory usage ... allocated nodes are not freed until the fifo is destroyed, but reused for new nodes ... so memory reclamation should be safe ... or am i missing something? best, tim -- tim@klingt.org http://tim.klingt.org There's no such entity as "most people". These are generalities. All generalities are meaningless. You've got to pin it down to a specific person doing a specific thing at a specific time and space. William S. Burroughs

Anthony Williams

11:10 a.m.

Tim Blechmann <tim@klingt.org> writes:

...

hi anthony,

...
First, let's look at Tim's implementation at http://tim.klingt.org/git?p=boost_lockfree.git;a=blob;f=boost/lockfree/fifo....

Suppose there are two threads calling deqeue() concurrently. If there is an item in the queue, both will proceed identically down to line 137, as neither thread modifies the queue data before then. Now suppose one thread (A) gets preempted, but the other (B) proceeds. Thread B will read next->data and assign it to ret. It will then update head_ with CAS (which will succeed as no other thread is modifying/has modified the data structure), and *deallocate the node* at line 141. This will destroy next->data. Now, when thread A wakes up it will try and read next->data => reading destroyed object, undefined behaviour.

dealloc_node in line 141 does not free the node, but pushes it to a freelist stack ... from my understanding, this leads to an increased memory usage ... allocated nodes are not freed until the fifo is destroyed, but reused for new nodes ... so memory reclamation should be safe ... or am i missing something?

Your dealloc_node is: void dealloc_node(node * n) { n->~node(); pool.deallocate(n); } So the node is destroyed (by the destructor) even if the memory is not deallocated. Anthony -- Anthony Williams Author of C++ Concurrency in Action | http://www.manning.com/williams Custom Software Development | http://www.justsoftwaresolutions.co.uk Just Software Solutions Ltd, Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK

Tim Blechmann

11:42 a.m.

...

...
...
Suppose there are two threads calling deqeue() concurrently. If there is an item in the queue, both will proceed identically down to line 137, as neither thread modifies the queue data before then. Now suppose one thread (A) gets preempted, but the other (B) proceeds. Thread B will read next->data and assign it to ret. It will then update head_ with CAS (which will succeed as no other thread is modifying/has modified the data structure), and *deallocate the node* at line 141. This will destroy next->data. Now, when thread A wakes up it will try and read next->data => reading destroyed object, undefined behaviour. dealloc_node in line 141 does not free the node, but pushes it to a freelist stack ... from my understanding, this leads to an increased memory usage ... allocated nodes are not freed until the fifo is destroyed, but reused for new nodes ... so memory reclamation should be safe ... or am i missing something?

Your dealloc_node is:

void dealloc_node(node * n) { n->~node(); pool.deallocate(n); }

So the node is destroyed (by the destructor) even if the memory is not deallocated.

i see your point ... restricting the use of the fifo queue to PODs should work around this issue ... tim -- tim@klingt.org http://tim.klingt.org Wherever we are, what we hear is mostly noise. When we ignore it, it disturbs us. When we listen to it, we find it fascinating. John Cage

Anthony Williams

12:09 p.m.

Tim Blechmann <tim@klingt.org> writes:

...

...
...
...
Suppose there are two threads calling deqeue() concurrently. If there is an item in the queue, both will proceed identically down to line 137, as neither thread modifies the queue data before then. Now suppose one thread (A) gets preempted, but the other (B) proceeds. Thread B will read next->data and assign it to ret. It will then update head_ with CAS (which will succeed as no other thread is modifying/has modified the data structure), and *deallocate the node* at line 141. This will destroy next->data. Now, when thread A wakes up it will try and read next->data => reading destroyed object, undefined behaviour. dealloc_node in line 141 does not free the node, but pushes it to a freelist stack ... from my understanding, this leads to an increased memory usage ... allocated nodes are not freed until the fifo is destroyed, but reused for new nodes ... so memory reclamation should be safe ... or am i missing something?

Your dealloc_node is:

void dealloc_node(node * n) { n->~node(); pool.deallocate(n); }

So the node is destroyed (by the destructor) even if the memory is not deallocated.

i see your point ... restricting the use of the fifo queue to PODs should work around this issue ...

Yes, in that you can remove the node destructor from dealloc_node, and thus you no longer have undefined behaviour. However, the node could still be reused by another thread calling push() before our thread A wakes up. In this case, you could have an ABA problem --- thread A has a pointer to node X which originally had a next ptr of Y, but in the mean time thread B has deallocated node X. Node X may subsequently be reused with a next ptr of node Z. If it node X becomes the head node before thread A wakes up, then the CAS on line 139 will succeed, despite the fact that node X has been deallocated and reallocated since, and the CAS will store the wrong "next" value (Y rather than Z), thus corrupting the queue. Thus, you cannot reuse node X unless you can be sure there is no thread in the position of thread A. With the code as it stands, you cannot know that, as thread A has not done any writing in dequeue(). Since you are using tagged pointers, you could increment the tag value whenever you reuse a node, which will at least ensure that the CAS in thread A fails if there is an ABA issue, since the tag part of the pointer will be different. Since the tag has a finite size, this is still a problem in theory, as the tag value might wrap around before thread A resumes, but in practice it's probably OK. Alternatively, you could use hazard pointers, reference-counted pointers (as per the implementation I posted), reference counting on the whole queue (last one out deletes any dequeued nodes), or another memory reclamation scheme. Anthony -- Anthony Williams Author of C++ Concurrency in Action | http://www.manning.com/williams Custom Software Development | http://www.justsoftwaresolutions.co.uk Just Software Solutions Ltd, Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK

Tim Blechmann

12:42 p.m.

...

Since you are using tagged pointers, you could increment the tag value whenever you reuse a node, which will at least ensure that the CAS in

thanks for noting this issue! i will apply a fix for it later ...

...

Alternatively, you could use hazard pointers, reference-counted pointers (as per the implementation I posted), reference counting on the whole queue (last one out deletes any dequeued nodes), or another memory reclamation scheme.

i have an implementation of the `pass-the-buck' algorithm for memory reclamation, but from my understanding both `pass-the-buck' and `smr'/hazard pointers may be patented :( thanks, tim -- tim@klingt.org http://tim.klingt.org Relying on the government to protect your privacy is like asking a peeping tom to install your window blinds. John Perry Barlow

Anthony Williams

1:43 p.m.

Tim Blechmann <tim@klingt.org> writes:

...

...
Since you are using tagged pointers, you could increment the tag value whenever you reuse a node, which will at least ensure that the CAS in

thanks for noting this issue! i will apply a fix for it later ...

...
Alternatively, you could use hazard pointers, reference-counted pointers (as per the implementation I posted), reference counting on the whole queue (last one out deletes any dequeued nodes), or another memory reclamation scheme.

i have an implementation of the `pass-the-buck' algorithm for memory reclamation, but from my understanding both `pass-the-buck' and `smr'/hazard pointers may be patented :(

Hazard pointers are definitely patented. I don't know the state of pass-the-buck. Anthony -- Anthony Williams Author of C++ Concurrency in Action | http://www.manning.com/williams Custom Software Development | http://www.justsoftwaresolutions.co.uk Just Software Solutions Ltd, Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK

hurcan solter

17 Mar 17 Mar

11:59 p.m.

On Mon, Mar 16, 2009 at 2:09 PM, Anthony Williams <anthony.ajw@gmail.com>wrote:

...

Tim Blechmann <tim@klingt.org> writes:

...
...
...
...
Suppose there are two threads calling deqeue() concurrently. If there is an item in the queue, both will proceed identically down to line 137, as neither thread modifies the queue data before then. Now suppose one thread (A) gets preempted, but the other (B) proceeds. Thread B will read next->data and assign it to ret. It will then update head_ with CAS (which will succeed as no other thread is modifying/has modified the data structure), and *deallocate the node* at line 141. This will destroy next->data. Now, when thread A wakes up it will try and read next->data => reading destroyed object, undefined behaviour. dealloc_node in line 141 does not free the node, but pushes it to a freelist stack ... from my understanding, this leads to an increased memory usage ... allocated nodes are not freed until the fifo is destroyed, but reused for new nodes ... so memory reclamation should be safe ... or am i missing something?

Your dealloc_node is:

void dealloc_node(node * n) { n->~node(); pool.deallocate(n); }

So the node is destroyed (by the destructor) even if the memory is not deallocated.

i see your point ... restricting the use of the fifo queue to PODs should work around this issue ...

Yes, in that you can remove the node destructor from dealloc_node, and thus you no longer have undefined behaviour.

However, the node could still be reused by another thread calling push() before our thread A wakes up. In this case, you could have an ABA problem --- thread A has a pointer to node X which originally had a next ptr of Y, but in the mean time thread B has deallocated node X. Node X may subsequently be reused with a next ptr of node Z. If it node X becomes the head node before thread A wakes up, then the CAS on line 139 will succeed, despite the fact that node X has been deallocated and reallocated since, and the CAS will store the wrong "next" value (Y rather than Z), thus corrupting the queue.

Thus, you cannot reuse node X unless you can be sure there is no thread in the position of thread A. With the code as it stands, you cannot know that, as thread A has not done any writing in dequeue().

Since you are using tagged pointers, you could increment the tag value whenever you reuse a node, which will at least ensure that the CAS in thread A fails if there is an ABA issue, since the tag part of the pointer will be different. Since the tag has a finite size, this is still a problem in theory, as the tag value might wrap around before thread A resumes, but in practice it's probably OK.

Alternatively, you could use hazard pointers, reference-counted pointers (as per the implementation I posted), reference counting on the whole queue (last one out deletes any dequeued nodes), or another memory reclamation scheme.

That's why my lock-free queue in c# was comparatively easy to implement,although

GC doesn't necessarily prevents the ABA problem in every situation, for a fifo queue with enqueue and dequeue it suffices. Anyway don't you think that confining T to be POD type a little bit of restrictive?

Tim Blechmann

18 Mar 18 Mar

9:08 a.m.

...

GC doesn't necessarily prevents the ABA problem in every situation, for a fifo queue with enqueue and dequeue it suffices. Anyway don't you think that confining T to be POD type a little bit of restrictive?

well, i am usually using the fifo queue in a soft-realtime system, where this restriction doesn't matter. in order to transfer non-POD classes, one will probably want to allocate them on the heap ... tim -- tim@klingt.org http://tim.klingt.org After one look at this planet any visitor from outer space would say "I want to see the manager." William S. Burroughs

Tim Blechmann

16 Mar 16 Mar

12:41 p.m.

...

Firstly, I'd like to help with this. I have a couple of lock-free queue implementations lying around from my book. I've attached a sample implementation that uses atomic reference-counting --- it would need to be "boostified", as it's written for C++0x, but that should be relatively straightforward. I don't make any claims for performance, but I believe it to be "correct", with no data races or undefined behaviour --- I *really* want to know if that's not the case, so I can correct it before the book goes to press.

while the implementation of your queue looks good to me, i am a bit concerned about the use of dynamic memory allocation ... it is `lockfree' in terms of the data structure implementation, but allocating memory from the os (for node and T), its behavior is depending on the new/delete implementation. in my implementation pop is lockfree in any case, while push is lockfree as long as the number of nodes does not exceed the preallocated node count ... while this may be negligible for an average-case scenario, it is quite significant for the worst-case ... tim -- tim@klingt.org http://tim.klingt.org The price an artist pays for doing what he wants is that he has to do it. William S. Burroughs

Anthony Williams

1:37 p.m.

Tim Blechmann <tim@klingt.org> writes:

...

...
Firstly, I'd like to help with this. I have a couple of lock-free queue implementations lying around from my book. I've attached a sample implementation that uses atomic reference-counting --- it would need to be "boostified", as it's written for C++0x, but that should be relatively straightforward. I don't make any claims for performance, but I believe it to be "correct", with no data races or undefined behaviour --- I *really* want to know if that's not the case, so I can correct it before the book goes to press.

while the implementation of your queue looks good to me, i am a bit concerned about the use of dynamic memory allocation ... it is `lockfree' in terms of the data structure implementation, but allocating memory from the os (for node and T), its behavior is depending on the new/delete implementation.

Yes, with a poor allocator there will be a performance hit and potentially locking inside the allocator. I would just replace the global allocator with a proper MT allocator such as Hoard in the first instance, and I hope that the C++0x compilers ship with such an allocator. If the allocator performance is still an issue it can easily be replaced with a custom allocator designed for the specific use case.

...

in my implementation pop is lockfree in any case, while push is lockfree as long as the number of nodes does not exceed the preallocated node count ... while this may be negligible for an average-case scenario, it is quite significant for the worst-case ...

Yes. You could amortize the cost by allocating chunks of nodes rather than single nodes, but then you need to track the chunks... Anthony -- Anthony Williams Author of C++ Concurrency in Action | http://www.manning.com/williams Custom Software Development | http://www.justsoftwaresolutions.co.uk Just Software Solutions Ltd, Registered in England, Company Number 5478976. Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL, UK

Edouard A.

1:48 p.m.

On Mon, 16 Mar 2009 13:41:34 +0100, Tim Blechmann <tim@klingt.org> wrote:

...

while the implementation of your queue looks good to me, i am a bit concerned about the use of dynamic memory allocation ... it is `lockfree' in terms of the data structure implementation, but allocating memory from the os (for node and T), its behavior is depending on the new/delete implementation.

What you need is a memory pool for your thread pool. :) -- EA

Andreas Klein

18 Mar 18 Mar

7:28 a.m.

New subject: problem with list

Dear admin. I am trying to post to the boost-mailing list. I get a replay that the message was recived, but it do not appear on the list. Can you help me? Best wishes Andreas Klein ------------------------------------------------------------------------ Dear boost developers, I checked the implementation of boost::dynamic_bitset::count and find that it uses a table look up. This is not optimal. First on 64-bit machines divide-and-conquer algorithms are faster than table look up and save even memory. Second if the bitset contains several words, there are for both 32 and 64 bit machines population count algorithms that are faster than the simple loop over all words used by boost. Together with Y. Edel I developed a new version,which saves 60% against the simple algorithm. We put the code online (http://cage.ugent.be/~klein/popc.html). A disadvantage of our implementation is that we relay on aggressive optimization (-O3) of the compiler. If you are searching a compromise between speed and code complexity. The first iteration of the Harley-Seal-method (25% faster than the simple loop) would be a candidate. You find this and other population count algorithms on our homepage, too. If there is interest I can add one of the advance algorithms in boost. Best wishes Andreas Klein

Dmitry Goncharov

9:02 a.m.

New subject: problem with list

Andreas Klein wrote:

...

Dear admin.

I am trying to post to the boost-mailing list. I get a replay that the message was recived, but it do not appear on the list. Can you help me?

Best wishes Andreas Klein

------------------------------------------------------------------------ Dear boost developers,

I checked the implementation of boost::dynamic_bitset::count

and find that it uses a table look up.

This is not optimal.

First on 64-bit machines divide-and-conquer algorithms are faster than table look up and save even memory.

Second if the bitset contains several words, there are for both 32 and 64 bit machines population count algorithms that are faster than the simple loop over all words used by boost.

Together with Y. Edel I developed a new version,which saves 60% against the simple algorithm. We put the code online (http://cage.ugent.be/~klein/popc.html). A disadvantage of our implementation is that we relay on aggressive optimization (-O3) of the compiler.

If you are searching a compromise between speed and code complexity. The first iteration of the Harley-Seal-method (25% faster than the simple loop) would be a candidate.

You find this and other population count algorithms on our homepage, too.

If there is interest I can add one of the advance algorithms in boost.

Best wishes Andreas Klein

Hi, Andreas. The message that you are reposting here successfully arrived first time. You must have "Receive your own posts to the list?" flag turned off. You can turned it on through the web interface. Here http://lists.boost.org/mailman/options.cgi/boost HTH, Dmitry

k-oli＠gmx.de

16 Mar 16 Mar

7:04 p.m.

Hello Tim, the lockfree_channel was my first attempt for a lockfree fifo and was only intended for some tests. I'm interessted in a boost lib of lockfree structures - I hope the boost community can help. regards, Oliver Am Monday 16 March 2009 01:04:34 schrieb Tim Blechmann:

...

hi oliver,

i just saw, that you checked in a lockfree fifo implementation to the boost sandbox. it looks like an implementation of the michael/scott queue, isn't it? if so, i suppose, you are missing a thread-safe memory reclamation mechanism ... some time ago, i wrote a boost-style implementation of this data structure, not sure, if you came across it [1] ... maybe it would make sense to join our efforts in implementing a lockfree queue and get it into boost?

best, tim

[1] http://tim.klingt.org/git?p=boost_lockfree.git;a=summary

Edouard A.

17 Mar 17 Mar

7:08 p.m.

...

I'm interessted in a boost lib of lockfree structures - I hope the boost community can help.

I agree. Threadpool is just a brick of a bigger ensemble that would make efficient multithreaded programming a little bit easier. Below one can think of concurrent structures and operations, above ready-to-use patterns and algorithms. Maybe you could start by making a "x-mass list" of what you need. Do you have a design diagram of your current threadpool implementation? Kind regards -- EA

k-oli＠gmx.de

21 Mar 21 Mar

9 a.m.

Am Tuesday 17 March 2009 20:08:18 schrieb Edouard A.:

...

...
I'm interessted in a boost lib of lockfree structures - I hope the boost community can help.

I agree.

Threadpool is just a brick of a bigger ensemble that would make efficient multithreaded programming a little bit easier. Below one can think of concurrent structures and operations, above ready-to-use patterns and algorithms.

Maybe you could start by making a "x-mass list" of what you need.

- improve performance (fast as TBB - if possible) - lockfree-fifo, -lifo and -priority queue - check work-stealing algorithm - check for raise conditions - get framework interface stable as possible - timeout of task execution (meaning if task could not be finished until a certain time has reached the task gets canceld -> should be possible via Anthonies future lib)

...

Do you have a design diagram of your current threadpool implementation?

currently not :(

Edouard A.

10:32 a.m.

...

- improve performance (fast as TBB - if possible) - lockfree-fifo, -lifo and -priority queue - check work-stealing algorithm

Maybe there are some things to work out before implementing lockfree structures. What would be helpful to know is : - task latency - you send one task in the system and see how long it takes (no op task) - throughput - how many tasks per second can you handle (no op task) - a combination of the two - running the same tests with tasks of arbitrary complexity A profiler running during these tests should give you a pretty clear picture about what needs to be improved. On my side, what I could do is profile the basic parallel_sort implementation I posted a while ago. Regards. -- Edouard Alligand Partner http://www.bureau14.fr/

5929

Age (days ago)

5934

Last active (days ago)

List overview

Download

21 comments

9 participants

participants (9)

Andreas Klein
Anthony Williams
Dmitry Goncharov
Edouard A.
Hartmut Kaiser
hurcan solter
k-oli＠gmx.de
Tim Blechmann
Vicente Botet Escriba