[Looking for Feedback] - Fairness lib
Dear Boost Community, I am writing to seek feedback on a new C++ library that I have been developing (https://github.com/Sernior/fairness <https://github.com/Sernior/fairness>). This library, named "*Fairness*," is still a work in progress. Nonetheless, I believe that this is an opportune time to begin gathering feedback. *Fairness* is designed to provide fair synchronization primitives. While the standard C++ library offers various tools for managing concurrency, it does not provide mechanisms for handling fairness policies. To introduce the most fundamental component of my library, the "priority_mutex," I would like to share a simple example: /#include <thread>/ /#include <chrono>/ /#include <algorithm>/ /#include <BS_thread_pool.hpp>/ /#include <boost/fairness.hpp>/ / / /static boost::fairness::priority_mutex<4> pm;/ / / /static void busy_wait_nano(int64_t nanoseconds){/ /auto begin = std::chrono::steady_clock::now();/ /for(;std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::steady_clock::now() - begin).count() < nanoseconds;)/ / continue;/ /}/ / / /static void thread_function_nano(int p, int preCriticalTime, int postCriticalTime, int criticalTime){/ / busy_wait_nano(preCriticalTime);/ / pm.lock(p);/ / busy_wait_nano(criticalTime);/ / pm.unlock();/ / busy_wait_nano(postCriticalTime);/ /}/ / / /int main()/ /{/ / std::array<int, 8> prios {0, 2, 2, 1, 1, 3, 3, 0};/ / std::array<int, 8> preCT {2000, 1500, 2000, 3000, 1000, 500, 500, 2000};/ / std::array<int, 8> postCT {5000, 3000, 2000, 2500, 1000, 1500, 1500, 4500};/ / int criticalTime = 1000;/ / / / BS::thread_pool pool(8);/ / / / for (int i = 0; i < 8; ++i){/ / pool.push_task(thread_function_nano, prios[i], preCT[i], postCT[i], criticalTime);/ / }/ / pool.wait_for_tasks();/ / / / return 0;/ /} / /// using BS thread pool https://github.com/bshoshany/thread-pool in this example./ / / In this example priorities are not given randomly, you can see that I gave 0 priority (the highest priority) to thread 0 which has 2000 + 1000 + 5000 and thread 7 which has 2000 + 1000 + 4500. Priority mutexes function much like a `std::array`, using a 0-indexed based system. For instance, a `priority_mutex<4>` will manage priorities ranging from 0 to 3, with 0 being the highest priority. Currently, the library comprises priority mutexes, both recursive and shared versions, along with RAII wrappers such as `unique_lock` and `shared_lock`, all of which support priorities. There are, however, some elements still missing, including priority condvars and timed versions of the mutexes. The primary motivation behind creating this library is to enhance throughput in multi-threaded pipelines. To illustrate the concept more simply, imagine a pipeline with N tasks, some of which access a protected resource. In cases of contention, there exists an optimal order of resource accesses that minimizes overall execution time, https://sernior.github.io/fairness/ <https://sernior.github.io/fairness/> the images show 2 different ways of ordering critical sections one of which is much better than the other. This library aims to empower developers, who have profiled their software, to manipulate this order with minimal performance overhead and as simply as possible. In terms of implementation, much of the development has revolved around benchmarking and testing. During this process, I've experimented with over 10 different versions of the priority mutex. Ultimately, I achieved the best results by using a priority spinlock, which is also available to library users, to provide internal atomicity to the rest of the mutexes. An essential consideration in the implementation is ensuring that the internal priority spinlock is 128 bytes away from the rest of the memory used by the mutex especially the memory used to perform the atomic::wait on. This separation helps avoid false sharing. The choice of 128 bytes aligns with processor architectures like AMD Zen 4 (the one I am testing on), as indicated in their optimization manual. I can attest that without those 128 byte separation the PM_LockUnlock benchmark was 10X worse and the PM_pipeline_benchmark_fast was worse than the std counterpart (see below for the benchmarks). Another aspect to note in the implementation is the use of 4 bytes for booleans when only a single byte is needed. This decision is related to the way atomic::wait is implemented. Most standard library implementations of atomic::wait check if the type size is supported by the current platform's wait syscall. For instance, Windows can wait on 1 byte, while Linuxkernel based systems can only wait on 4 bytes. These are some benchmark results: 6.4.0-kali3-amd64 AMD Ryzen 5 7600X 6-Core Processor Now the LockUnlock benchmarks were the most difficult to optimize. Those are 8 threads simply doing: m.lock(); m.unlock(); The pipeline benchmarks (with different lenght) are actual pipelines where each thread has different times before/during/after the critical section and I try to set the priorities accordingly to finish as soon as possible. You can see there is also a mutex called slim_priority_mutex, that is a priority_mutex implementation that can support up to 7 (15 if boost atomic is found and your hardware supports the DWCAS) priorities. Those are light weight versions that only take 128 bits up to 7 priorities and past that, up to 15, 256 bits. I am also testing with a custom wait implementations (https://github.com/Sernior/fairness/blob/main/include/boost/fairness/detail/... <https://github.com/Sernior/fairness/blob/main/include/boost/fairness/detail/wait_ops.hpp>) which can be tested by defining BOOST_FAIRNESS_USE_EXPERIMENTAL_WAIT_NOTIFY. At some point I wish to make this wait the default version the lib uses instead of the std one since users of my lib could configure this version using my configs macros (number of fast spins - relaxed spins - yield spins before the wait syscall). As I said in the beginning other than to present my lib, this email has the purpose of obtaining feedback. I would like to see if there is anyone (other than me) interested in what I am doing and, if so, consider that I am still far from finished and I would love suggestions. Thank you all for your attention. Best regards, Federico Abrignani.
To add the benchmark results since the image did not show in my mail. --------------------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------------------------------------------- priority_mutex_benchmark::PM_LockUnlock/threads:8 1.64 ns 13.0 ns 54189280 standard_mutex_benchmark::STD_LockUnlock/threads:8 0.712 ns 5.54 ns 125137360 slim_priority_mutex_benchmark::SLM_PM_LockUnlock/threads:8 3.27 ns 26.0 ns 27112392 spinlock_priority_mutex_benchmark::SPNLC_PM_LockUnlock/threads:8 0.825 ns 6.49 ns 111065720 slim_spinlock_priority_mutex_benchmark::SPNLC_SLM_PM_LockUnlock/threads:8 3.41 ns 27.2 ns 25413960 recursive_priority_mutex_benchmark::R_PM_LockUnlock/threads:8 1.65 ns 13.1 ns 53737320 standard_recursive_mutex_benchmark::R_STD_LockUnlock/threads:8 0.960 ns 7.67 ns 105165456 shared_priority_mutex_benchmark::PM_S_LockUnlock/threads:8 1.62 ns 12.9 ns 54708152 standard_shared_mutex_benchmark::STD_S_LockUnlock/threads:8 0.965 ns 7.68 ns 106124968 shared_priority_mutex_benchmark::PM_S_SLockSUnlock/threads:8 1.58 ns 12.6 ns 54070432 standard_shared_mutex_benchmark::STD_S_SLockSUnlock/threads:8 0.721 ns 5.76 ns 117148936 ____________________________________________________________________________________________ priority_mutex_benchmark::PM_pipeline_benchmark_long/threads:8 9541052 ns 52513517 ns 16 standard_mutex_benchmark::STD_pipeline_benchmark_long/threads:8 9774822 ns 51878916 ns 16 slim_priority_mutex_benchmark::SLM_PM_pipeline_benchmark_long/threads:8 9380027 ns 55017352 ns 16 spinlock_priority_mutex_benchmark::SPNLC_PM_pipeline_benchmark_long/threads:8 9531843 ns 75511719 ns 8 slim_spinlock_priority_mutex_benchmark::SPNLC_SLM_PM_pipeline_benchmark_long/threads:8 9532650 ns 75912929 ns 8 recursive_priority_mutex_benchmark::R_PM_pipeline_benchmark_long/threads:8 9539370 ns 52512176 ns 16 standard_recursive_mutex_benchmark::R_STD_pipeline_benchmark_long/threads:8 9775127 ns 51879528 ns 16 shared_priority_mutex_benchmark::PM_S_pipeline_benchmark_long/threads:8 9531850 ns 76251304 ns 16 standard_shared_mutex_benchmark::STD_S_pipeline_benchmark_long/threads:8 9775146 ns 51880461 ns 16 shared_priority_mutex_benchmark::PM_S_Spipeline_benchmark_long/threads:8 6484466 ns 51870805 ns 16 standard_shared_mutex_benchmark::STD_S_Spipeline_benchmark_long/threads:8 6484449 ns 51869927 ns 16 ____________________________________________________________________________________________ priority_mutex_benchmark::PM_pipeline_benchmark_gaming/threads:8 949731 ns 5213265 ns 136 standard_mutex_benchmark::STD_pipeline_benchmark_gaming/threads:8 1001777 ns 5193649 ns 136 slim_priority_mutex_benchmark::SLM_PM_pipeline_benchmark_gaming/threads:8 966555 ns 5557157 ns 128 spinlock_priority_mutex_benchmark::SPNLC_PM_pipeline_benchmark_gaming/threads:8 952582 ns 7604306 ns 96 slim_spinlock_priority_mutex_benchmark::SPNLC_SLM_PM_pipeline_benchmark_gaming/threads:8 921612 ns 7355304 ns 88 recursive_priority_mutex_benchmark::R_PM_pipeline_benchmark_gaming/threads:8 969296 ns 5291337 ns 136 standard_recursive_mutex_benchmark::R_STD_pipeline_benchmark_gaming/threads:8 1001943 ns 5193937 ns 136 shared_priority_mutex_benchmark::PM_S_pipeline_benchmark_gaming/threads:8 921843 ns 7348795 ns 88 standard_shared_mutex_benchmark::STD_S_pipeline_benchmark_gaming/threads:8 999545 ns 5191391 ns 136 shared_priority_mutex_benchmark::PM_S_Spipeline_benchmark_gaming/threads:8 648466 ns 5187308 ns 136 standard_shared_mutex_benchmark::STD_S_Spipeline_benchmark_gaming/threads:8 648462 ns 5187299 ns 136 ____________________________________________________________________________________________ priority_mutex_benchmark::PM_pipeline_benchmark_audio/threads:8 96730 ns 525237 ns 1336 standard_mutex_benchmark::STD_pipeline_benchmark_audio/threads:8 102634 ns 521351 ns 1344 slim_priority_mutex_benchmark::SLM_PM_pipeline_benchmark_audio/threads:8 97671 ns 566508 ns 1264 spinlock_priority_mutex_benchmark::SPNLC_PM_pipeline_benchmark_audio/threads:8 91991 ns 734638 ns 936 slim_spinlock_priority_mutex_benchmark::SPNLC_SLM_PM_pipeline_benchmark_audio/threads:8 97838 ns 780163 ns 928 recursive_priority_mutex_benchmark::R_PM_pipeline_benchmark_audio/threads:8 96704 ns 525095 ns 1336 standard_recursive_mutex_benchmark::R_STD_pipeline_benchmark_audio/threads:8 102726 ns 521373 ns 1344 shared_priority_mutex_benchmark::PM_S_pipeline_benchmark_audio/threads:8 92130 ns 734847 ns 904 standard_shared_mutex_benchmark::STD_S_pipeline_benchmark_audio/threads:8 102640 ns 521309 ns 1344 shared_priority_mutex_benchmark::PM_S_Spipeline_benchmark_audio/threads:8 64865 ns 518881 ns 1352 standard_shared_mutex_benchmark::STD_S_Spipeline_benchmark_audio/threads:8 64864 ns 518872 ns 1352 ____________________________________________________________________________________________ priority_mutex_benchmark::PM_pipeline_benchmark_fast/threads:8 1181 ns 7046 ns 95936 standard_mutex_benchmark::STD_pipeline_benchmark_fast/threads:8 1537 ns 7263 ns 95040 slim_priority_mutex_benchmark::SLM_PM_pipeline_benchmark_fast/threads:8 1218 ns 9528 ns 78904 spinlock_priority_mutex_benchmark::SPNLC_PM_pipeline_benchmark_fast/threads:8 1004 ns 8017 ns 86744 slim_spinlock_priority_mutex_benchmark::SPNLC_SLM_pipeline_benchmark_fast/threads:8 1028 ns 8192 ns 85040 recursive_priority_mutex_benchmark::R_PM_pipeline_benchmark_fast/threads:8 1177 ns 7084 ns 94584 standard_recursive_mutex_benchmark::R_STD_pipeline_benchmark_fast/threads:8 1573 ns 7318 ns 96456 shared_priority_mutex_benchmark::PM_S_pipeline_benchmark_fast/threads:8 1380 ns 10979 ns 59928 standard_shared_mutex_benchmark::STD_S_pipeline_benchmark_fast/threads:8 1534 ns 7301 ns 97200 shared_priority_mutex_benchmark::PM_S_Spipeline_benchmark_fast/threads:8 665 ns 5318 ns 131400 standard_shared_mutex_benchmark::STD_S_Spipeline_benchmark_fast/threads:8 663 ns 5303 ns 131816
On 10/21/23 17:11, Federico Abrignani via Boost wrote:
Dear Boost Community,
I am writing to seek feedback on a new C++ library that I have been developing (https://github.com/Sernior/fairness <https://github.com/Sernior/fairness>). This library, named "*Fairness*," is still a work in progress. Nonetheless, I believe that this is an opportune time to begin gathering feedback.
*Fairness* is designed to provide fair synchronization primitives. While the standard C++ library offers various tools for managing concurrency, it does not provide mechanisms for handling fairness policies.
[snip]
The primary motivation behind creating this library is to enhance throughput in multi-threaded pipelines. To illustrate the concept more simply, imagine a pipeline with N tasks, some of which access a protected resource. In cases of contention, there exists an optimal order of resource accesses that minimizes overall execution time, https://sernior.github.io/fairness/ <https://sernior.github.io/fairness/> the images show 2 different ways of ordering critical sections one of which is much better than the other.
This library aims to empower developers, who have profiled their software, to manipulate this order with minimal performance overhead and as simply as possible.
While I do think priority-based synchronization primitives would be a very useful addition to Boost libraries, I find this motivation puzzling. Priorities are introduced to avoid starvation of specific (high-priority) threads at the expense of the less important threads and likely overall performance (because fairness is not free). The chart you show on the front page is nice, but it seems to me like a hand-tailored case that is not necessarily related to reality. Specifically, there is no guarantee that processing times for different threads are aligned as depicted on your chart. For example, one of the threads could have a very long processing time compared to other threads, which would negate any effect on the total processing time from reordering threads using priorities. What priorities actually provide is minimize the processing time of the high-priority thread, pretty much regardless of the low-priority threads, barring data starvation, of course. So I think, you should probably reconsider the motivation or at least present some evidence that the library helps to achieve the stated goals in practice (and how does it do that).
Priorities are introduced to avoid starvation of specific (high-priority) threads at the expense of the less important threads and likely overall performance (because fairness is not free). The chart you show on the front page is nice, but it seems to me like a hand-tailored case that is not necessarily related to reality.
Yes the charts I provide are a simple hand tailored case that I am using to explain visually a concept that may happen in the real world.
Specifically, there is no guarantee that processing times for different threads are aligned as depicted on your chart. For example, one of the threads could have a very long processing time compared to other threads, which would negate any effect on the total processing time from reordering threads using priorities. What priorities actually provide is minimize the processing time of the high-priority thread, pretty much regardless of the low-priority threads, barring data starvation, of course. So I think, you should probably reconsider the motivation or at least present some evidence that the library helps to achieve the stated goals in practice (and how does it do that).
Right, but that is also why I said "After profiling your software". There is no guarantee until a developer sees it happening. Allow me to quote my own documentation: " These tools, if misused, have the potential to cause harm rather than benefit. Careful implementation and understanding are crucial to harness their benefits effectively. " What I meant with this sentence is exactly what you are saying: These tools should be used by people who have first profiled their software and found themselves in the situation where they thought: "If only I had a way to tell my threads that a specific task should be performed first ". But I feel your concern as it is also my concern. I surely need to perform more studies to find out exactly the boundaries of when using priority mutexes is better than not using them. If you are interested you can have a look at my pipeline benchmarks and tinker with them! https://github.com/Sernior/fairness/blob/main/benchmarks/pm/priority_mutex_b... std::array<int, 8> prios {0, 2, 2, 1, 1, 3, 3, 0}; std::array<int, 8> preCT {20, 15, 20, 30, 10, 5, 5, 20}; int CT = 10; std::array<int, 8> postCT {50, 30, 20, 25, 10, 15, 15, 45}; these 3 arrays are the priorities, times before and after the critical sections for the 8 threads. CT is the time of the critical section. But returning to your example, if one of the threads, lets call it *L*, has has a very *L*ong processing time compared to the other threads, wouldn`t you agree that it would be better if *L* had priority over the other threads? If you wanted to finish as fast as possible the tasks of all threads? Feel free to contact me by email directly if you want some explanations on how to use my benchmarks! Thanks for the feedback. Federico Abrignani.
On 10/21/23 21:03, Federico Abrignani via Boost wrote:
Specifically, there is no guarantee that processing times for different threads are aligned as depicted on your chart. For example, one of the threads could have a very long processing time compared to other threads, which would negate any effect on the total processing time from reordering threads using priorities. What priorities actually provide is minimize the processing time of the high-priority thread, pretty much regardless of the low-priority threads, barring data starvation, of course. So I think, you should probably reconsider the motivation or at least present some evidence that the library helps to achieve the stated goals in practice (and how does it do that).
But returning to your example, if one of the threads, lets call it *L*, has has a very *L*ong processing time compared to the other threads, wouldn`t you agree that it would be better if *L* had priority
over the other threads? If you wanted to finish as fast as possible the tasks of all threads?
No, not necessarily. It is usually not the matter of long or short run time, but rather what is the bottleneck or what is required to finish fast. To give an example, consider two threads. One is produces data and puts it in a queue, the other one dequeues the data and, say, writes it to a file or sends to network. Producing data is expensive, so the first thread may take a lot of processing to produce a piece of the data. Also, the queue has limited capacity and may overflow, in which case the writer will overwrite previously written data. In order to avoid that, you would typically prefer the second thread to have a higher priority, so that the enqueued data is dispatched ASAP and the queue is mostly empty. Doing so may increase the runtime of the first thread, as it may block on the queue for longer periods of time, especially if there are multiple readers. Surely, there is a wide variety of use cases, some of them are more along the lines of what you're saying. My point, though, is that priorities are not about minimizing overall run time, as your documentation suggests. They are primarily about minimizing starvation or minimizing latency of *specific* threads or tasks at the expense of other threads and general performance. Whether that effect is used to achieve correctness, reduce latency or improve throughput is application-specific.
Surely, there is a wide variety of use cases, some of them are more along the lines of what you're saying. My point, though, is that priorities are not about minimizing overall run time, as your documentation suggests. They are primarily about minimizing starvation or minimizing latency of *specific* threads or tasks at the expense of other threads and general performance. Whether that effect is used to achieve correctness, reduce latency or improve throughput is application-specific.
I see your point. Yes, the reason I started this lib was loop like pipelines and maybe I got too "settled" with that specific use case while writing that piece of documentation. The consumer producer problem is surely not the use case I started this library for and who knows how many more use cases are there. But my point is: Imagine you have an loop like pipeline. while (true){ update(); render(); } This could be a video game or any other graphical application. update() get performed by X threads each one having a specific task. In order to gain maximum FPS you want those threads to be able to finish as fast as possible. This is the use case I has in mind when I wrote that. But I see your point now. I will remove that statement and provide instead different examples of different use cases instead. Thanks for the feedback. Federico Abrignani.
participants (3)
-
Andrey Semashev
-
Federico Abrignani
-
Federico Abrignani