
Dear All, Here is a quick review of the proposed Boost.Lockfree library. Utility ------- For some proposed libraries their utility is beyond doubt because they are already being used in some significant application. In contrast most GSoC projects are developed more in the "hope that they will be useful" without a motivating application. So we have to ask ourselves, does this library meet a need? To me, lock free algorithms and data structures have always seemed appealing at first sight but, apart from some trivial cases like counters, on close inspection they turn out to be too complicated to bother with. (They also lack required features, i.e. waiting for non-empty; I don't think I've ever written a producer-consumer application that didn't need the consumer to block on empty.) The author acknowledges that these structures will typically be slower than conventional implementations using mutexes, but claims that their benefit comes in terms of better worst-case latency. Specifically, if a thread is pre-empted while it holds a lock, another thread will have to wait until the first thread is re-scheduled and can unlock before it can itself lock. But this case is exactly the one that priority inversion is supposed to address: when the second thread tries to lock, the kernel immediately re-schedules the first thread. Yes, because of the kernel involvement this will probably still be slower than the lock-free approach, but perhaps not by much. It seems to me that the author should present benchmarks to demonstrate the actual worst-case latency difference between this code and a lock-based implementation. Design ------ I am happy with the design of the interface, as far as it goes; there are a few things that could be added though: - Debug modes that check the thread requirements. - Other combinations of single/multiple producer/consumer. Implementation -------------- I have looked only at the implementation of the ringbuffer class template. It lacks comments. There are some oddities like this: size_t ret = read_index - write_index - 1; if (write_index >= read_index) ret += max_size; return ret; Here, a negative value may be assigned to the unsigned size_t. I think this is actually safe, but it looks odd. I have not found any bugs. It looks like every call to enqueue() and dequeue() will access both the read and write indexes. Won't this cause poor performance due to cache issues? I would have been tempted to avoid this by keeping local 'available' counts: (PSEUDO_CODE) private: atomic<size_t> read_index_; atomic<size_t> write_index_; size_t read_available_; size_t write_available_; public: bool enqueue(...) { if (write_available_==0) { // In this case, we access read_index_ which is probably cached by the consumer thread, // which is expensive. But this is rare. write_available_ = (read_index_ + max_size - write_index_ - 1) % max_size; } if (!write_available_) return false; // full buffer[write_index_.load(memory_order_relaxed)] = data; write_index_.store((write_index_+1)%max_size); --write_available_; } // dequeue similarly... Documentation ------------- I'm not impressed by the documentation. It lacks rationale and information about the implementation. It has some formatting issues and typos (e.g. lose/loose). Platform requirements (i.e. the Boost.Atomic and/or std::atomic requirements and supported architectures) should be explained. The reference documentation involves more clicks than I would expect to reach the actual content. Some functions are marked as "for debugging only", e.g. the empty() and reset() [which should probably be called clear()] methods. It's not clear to me why these methods must be unsafe; for example, for the ringbuffer, I would expect the consumer thread to safely be able to call empty() and the producer to safely be able to call full() (which is not provided). Rationale for these restrictions (or relaxations of them) would be useful. - Did you try to use the library? With what compiler? Did you have any problems? I considered exercising the library on my ARM systems, but I realised that it requires 64-bit atomics; currently, Boost.Atomic only provides 32-bit atomics on ARM. (This raises the whole issue of the Boost.Atomic dependency. I'm presuming that Tim intends not to bundle a Boost.Atomic snapshot with the library if approved. If that changes, we really need another review.) - How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? A few hours reading the docs and code. - Are you knowledgeable about the problem domain? I know a certain amount about atomics and lock-based concurrent programming, but not much about lock-free data structures. - Please always state in your review, whether you think the library should be accepted as a Boost library! I think it should be accepted provided that: * The author demonstrates some measurable benefit over lock-based data structures. * The documentation is improved to describe the implementation and to provide rationale for design decisions. * The library is not added to Boost until Boost.Atomic has been reviewed and approved (or, available std::atomic implementations work with it). Regards, Phil.

hi phil,
For some proposed libraries their utility is beyond doubt because they are already being used in some significant application. In contrast most GSoC projects are developed more in the "hope that they will be useful" without a motivating application. So we have to ask ourselves, does this library meet a need?
first: the GSoC is about reviewing this library. it has been developed long before because i needed some lock-free data structures for a soft real-time system, which eventually became the project of my master thesis. so (a) i have been using this code for quite some time (the oldest codepieces are few years old). (b) over the last one or two years several projects have been using boost.lockfree. sometimes i got emails from name@bigindustyname.com asking for something or reporting a problem (especially maintaining the asm code was quite a hell before i migrated to boost.atomic)
To me, lock free algorithms and data structures have always seemed appealing at first sight but, apart from some trivial cases like counters, on close inspection they turn out to be too complicated to bother with.
they are not the best choice for all applications. but i have never mentioned that :) having a background in low-latency real-time audio synthesis (deadlines of about 1 ms), i cannot rely on high-performance concurrent data structures, if they would eventually block the real-time thread.
(They also lack required features, i.e. waiting for non-empty; I don't think I've ever written a producer-consumer application that didn't need the consumer to block on empty.)
my real-world example has a rate-monotonic scheduler. whenever the scheduler is called, i need to poll a queue to see of there are new jobs.
The author acknowledges that these structures will typically be slower than conventional implementations using mutexes, but claims that their benefit comes in terms of better worst-case latency. Specifically, if a thread is pre-empted while it holds a lock, another thread will have to wait until the first thread is re-scheduled and can unlock before it can itself lock. But this case is exactly the one that priority inversion is supposed to address: when the second thread tries to lock, the kernel immediately re-schedules the first thread.
in theory, yes, but it assumes the following points: * the context switch is cheap * the scheduler itself is does not need to wait for a lock that is held by a different process for too long * the mutex is not implemented as combination of spinlock and waiting lock. iirc this is done sometimes in order to avoid the cost of a system call. actually, there is an interesting blog entry `time waits for nothing' [1], containing sections from the audio system documentation of microsoft, apple and linus, which all clearly state that it is not acceptable to use any blocking primitives from their real-time callbacks.
I am happy with the design of the interface, as far as it goes; there are a few things that could be added though:
- Debug modes that check the thread requirements.
hm, i agree this might be useful, but i am not sure, if it will really be possible to catch all thread requirements accurately. but i will have a look.
- Other combinations of single/multiple producer/consumer.
other people have asked for this, too ... it is probably a good future extension.
Implementation --------------
I have looked only at the implementation of the ringbuffer class template. It lacks comments. There are some oddities like this:
hm, yes ... the index computation can probably be cleaned up.
It looks like every call to enqueue() and dequeue() will access both the read and write indexes. Won't this cause poor performance due to cache issues?
i don't think so: considering enqueue, the enqueue thread will both read and write write_index, but the dequeue thread will only read it. when the dequeue thread reads the value, it can use the one in the local cache, but if the enqueue thread writes the new index atomically, i will need to trigger the cache coherency protocol (e.g. broadcast the new value to other cores or invalidate the cache lines of other values). so afaik, accessing write_index is cheap, because reading does not need to deal with cache coherency.
information about the implementation. It has some formatting issues and typos (e.g. lose/loose). Platform requirements (i.e. the Boost.Atomic and/or std::atomic requirements and supported architectures) should be explained.
i'll probably write some more about the internals and the design.
The reference documentation involves more clicks than I would expect to reach the actual content.
already fixed.
Some functions are marked as "for debugging only", e.g. the empty() and reset() [which should probably be called clear()] methods. It's not clear to me why these methods must be unsafe; for example, for the ringbuffer, I would expect the consumer thread to safely be able to call empty() and the producer to safely be able to call full() (which is not provided). Rationale for these restrictions (or relaxations of them) would be useful.
good points: clear is more consistent to other stl/boost data structures. providing full() and empty() with a defined rationale sounds good, too ...
- Did you try to use the library? With what compiler? Did you have any problems?
I considered exercising the library on my ARM systems, but I realised that it requires 64-bit atomics; currently, Boost.Atomic only provides 32-bit atomics on ARM.
actually, the requirement for 64bit atomics will be slightly relaxed for bounded and fixed-sized data strucures by using 16 bit array indices instead of pointers.
(This raises the whole issue of the Boost.Atomic dependency. I'm presuming that Tim intends not to bundle a Boost.Atomic snapshot with the library if approved. If that changes, we really need another review.)
i have a local branch, where i am bundling boost.atomic. but there are some problems: * as the review has been done without taking boost.atomic into account, there would have to be a second review (implementation only) * it would be a fork of the boost.atomic library, so maintaining both side-by-side could become quite painful * if the implementation will be reviewed, then it is already half the work for a full review.
I think it should be accepted provided that: * The author demonstrates some measurable benefit over lock-based data structures.
as stated above (and cited from [1]), there are situations, where lock-based programming is not an option. i have already stated in the docs that lock-free programming is not the best approach for all applications and lock-based data structures can be more efficient in some (many) cases. probably the first line of the doc should clear up the misconception that lock-free data structures are the high-performance data structures ... like it is a misconception that `real-time' means `really fast' ;)
* The documentation is improved to describe the implementation and to provide rationale for design decisions.
agreed
* The library is not added to Boost until Boost.Atomic has been reviewed and approved (or, available std::atomic implementations work with it).
this has already been stated in the announcement of the review: boost.lockfree can be accepted, but it won't be merged before boost.atomic is accepter. cheers, tim [1] http://www.rossbencina.com/code/real-time-audio-programming-101-time-waits- for-nothing
participants (2)
-
Phil Endecott
-
Tim Blechmann