
hi julien,
First, I have re read the doc and started to look at the implementation and I found something that is apparently a contradiction:
- In the doc you say "Spinlocks do not directly interact with the operating system either. However it is possible that the owning thread is preempted by the operating system, which violates the lock-free property"
- In the code however (for fifo<T>), the code seems to be "spinning" all the time using "for (;;)" loops.
Could you give me some precisions on why this is ok ?
spinning is not the problem, locking is. the spinning in fifo (aka queue) and stack basically has the semantics `try an operation, if it fails because another thread interferes with it, try again'. if the os preempts a thread while spinning, it does not block the other threads.
Then, I started to wrote a test program to compare performance with Intel's Thread Building Block library (3.0 update 7, the latest stable). I am far from finished, but my first results are that boost::lockfree is slower on my machine by about 40%.
I use the TBB concurrent_queue<T> class that is according to the doc "non-blocking". I had a quick look at the implementation it seems to be using a similar technique as fifo<T> (that is spinning, no system locks). Do you know whether this is actually a lock free structure or not ?
i checked tbb's concurrent_queue several year ago, but at that time, it seemed to be using spinlocks. maybe this has changed. non-blocking does not necessarily mean lock-free, but it could be an obstruction-free queue ... one point, where i really question the non-blocking claim is the signature of the push() function: void push( const T& source ); it it either needs to allocate memory dynamically from the OS, where the memory allocator may block, or it throws an exception (and afaict this will also cause some memory allocations).
My question is not demanding that you match Intel's perf ;) but rather if you have any ideas on what could be done to improve the performance further (Intel's using assembly for instance, no idea if it has a big impact, another suspect may be a better atomic implementation).
some people suggest that implementing a backoff could improve the performance. otoh, herlihy and shavit suggest that the performance of a backoff is heavily depends on the backoff parameters and these parameters differ among different cpu type and even clock speeds. because of this, and because implementing a backoff will require some assembly code, i decided not to include a backoff in boost.lockfree.
I have attached the code FYI (please keep in mind that this is a quick test). The interesting part is the "SimpleTest" functions. I compiled on Ubuntu 11.04 / gcc 4.5 in C++0x mode. I measured the time using the "time" command. I ran the sample on an AMD cpu which cpuinfo are attached.
one point: the enqueue/push functions of boost.lockfree do not necessarily succeed. in your code, it should be fine, because the freelist that is used will allocate new internal nodes, if the freelist is empty. however this means that the push operation may block, if the allocator blocks. cheers, tim