
Hello, On 6/13/07, Giovanni Piero Deretta <gpderetta@gmail.com> wrote:
I'm not really familiar with either the implementation of the Join library nor with lock-free design, but from a quick glance at the code, I see in boost/join/base/port.hpp:asynch_f [1] that you use mutex protected deque to enqueue async calls. This is probably one place that would benefit from a lock-free deque; also probably the fact that the max queue size is bounded might help with this optimization.
Thanks for spending time looking into code and reply. This is encouraging that some experts start checking details. Each object with async / synch / chords and inheriting actor class will have only one mutex protect ALL states related to concurrency / synchronization / asynchrony, which are distributed over async / synch / actor classes. One scenario may better explain, suppose an async method is called, the calling thread will first grab the mutex, then do the following: 1> push passed argument to queue, 2> turn on a bit in a bitmap representing the message arrival status of all async / synch methods of this object, 3> check this "global" bitmap against the bitmaps of all chords to find which chord has all the messages it need to be ready to fire 4> if a chord is ready to fire, do the following: 4.1> extract all the arguments from the queues of async / synch methods of this chord, and bind theses arguments to the defined chord body method, 4.2> if the chord has a synch method, wake up the calling thread of that synch method to run the chord body, 4.3> otherwise spawn a new task in executor's thread pool to run the chord body 5> The calling thread of async method will release the mutex and return. So the mutex is not only for protecting queue, it is used to protect the whole process, its data structures and operations. All the queues and data structures don't have any their own protections, they are plain STL containers, all protected by this global mutex. The above call-sequence and similar ones are critical to the performance of Join. From Jocaml to Cw, people have done much to optimize these critical paths. My implementations mostly follows Cw. I don't know about lock-free algorithms. From what i learnt from browsing web, i only see lock-free algorithms for basic data structures (stacks , queues) and algorithms, which may not be able to provide much help for the above described call-sequence with heavily customized data structures. I could be wrong, please correct.
Not at all. Plain well designed lock-full (?) algorithm are fast enough 99% of the times (even as fast as lock-free). Is just that a more optimized implementation would cover even the remaining 1%. Do you have any benchmark? And not just synthetic benchmarks, but some real application written with both classic locks & condvars and with Join patterns. This would be useful to compare both the benefits in code simplicity and in performance (if any). Even if Join where slower, if it would really simplify the design of concurrent applications it would be a win.
I cannot agree more about this. I am looking any ways to improve under current design. I have done some initial optimization with template partial specializatin, for example. For async methods, there are two template classes defined: 1> async<void(arg)>: this is async method which does pass argument, this class use a queue to buffer the message 2> async<void(void)>: this is a async method which does not pass any arguments, this class will not use queues, instead it use a integer to count how many times it has been called. This ways the semantics is maintained with much less overhead. Based on method signatures, compiler will automatically choose the best-fit template classes to use. Similarly, for synch<> methods, there are four template classes defined to allow compiler to choose. I haven't done any benchmark comparison yet. I'll try to do some profiling and benchmarking when i get the time and resources. Some implementation of design documentation would be greatly useful,
but I think most of it can be deduced from the Microsoft papers, is that right?
Yes, the current implementation mostly follow Cw paper. The most notable differences are the handling of synchronous methods, exceptions and chord priorities. Thanks Yigong