[Fibers] Performance

newer
[program_options] comma separated...

older
GSoC 2014 - Full Text Search

Hartmut Kaiser

11 Jan 2014 11 Jan '14

1:43 a.m.

Oliver, Do you have some performance data for your fiber implementation? What is the (amortized) overhead introduced for one fiber (i.e. the average time required to create, schedule, execute, and delete one fiber which runs an empty function, when executing a larger number of those, perhaps 500.000 fibers)? It would be interesting to see this number when giving 1..N cores to the scheduler. Thanks! Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Show replies by date

Oliver Kowalke

11 Jan 11 Jan

9:31 a.m.

2014/1/11 Hartmut Kaiser <hartmut.kaiser@gmail.com>

...

Oliver,

Do you have some performance data for your fiber implementation? What is the (amortized) overhead introduced for one fiber (i.e. the average time required to create, schedule, execute, and delete one fiber which runs an empty function, when executing a larger number of those, perhaps 500.000 fibers)? It would be interesting to see this number when giving 1..N cores to the scheduler.

unfortunately I've no performance tests yet - maybe I'll write one after some optimizations (like replacing the stl containers by a single linked list of intrusive_ptr). I'm not sure what a fiber should execute within such a test. should the fiber-function have an empty body (e.g. execute nothing)? or should it at least yield one time? if the code executed by the fiber does nothing then the execution time will be determined by the algorithm for memory allocation of the clib. the context switches for resuming ans suspending the fiber and the time required to insert and remove the fiber from the ready-queue inside the the fiber-scheduler. this queue is currently a stl container and will be replaced by a single-linked list of intrusive-ptrs. a context switch (suspending/resuming a coroutine) needs ca. 80 CPU cycles on Intel Core2 Q6700 (64bit Linux).

Andreas Schäfer

7:11 p.m.

On 10:31 Sat 11 Jan , Oliver Kowalke wrote:

...

unfortunately I've no performance tests yet - maybe I'll write one after some optimizations (like replacing the stl containers by a single linked list of intrusive_ptr).

My suggestion: write the performance tests first. There's no better tool to drive code optimization. Best -Andreas -- ========================================================== Andreas Schäfer HPC and Grid Computing Chair of Computer Science 3 Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-27910 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

Hartmut Kaiser

8:12 p.m.

Oliver,

...

...
Do you have some performance data for your fiber implementation? What is the (amortized) overhead introduced for one fiber (i.e. the average time required to create, schedule, execute, and delete one fiber which runs an empty function, when executing a larger number of those, perhaps 500.000 fibers)? It would be interesting to see this number when giving 1..N cores to the scheduler.

unfortunately I've no performance tests yet - maybe I'll write one after some optimizations (like replacing the stl containers by a single linked list of intrusive_ptr).

I'd write the test before starting to do optimizations.

...

I'm not sure what a fiber should execute within such a test. should the fiber-function have an empty body (e.g. execute nothing)? or should it at least yield one time?

Well, that are two separate performance tests already :-P However, having it yielding just adds two more context switches and a scheduling cycle, thus I'd expect not too much additional insight from this. While you're at it, I'd suggest to also write a test measuring the overhead of using futures. For an idea how such tests could look like, you might want to glance here: https://github.com/STEllAR-GROUP/hpx/tree/master/tests/performance.

...

if the code executed by the fiber does nothing then the execution time will be determined by the algorithm for memory allocation of the clib. the context switches for resuming ans suspending the fiber and the time required to insert and remove the fiber from the ready-queue inside the the fiber-scheduler.

That's assumptions you're having which are by no means conclusive. From our experience with HPX (https://github.com/STEllAR-GROUP/hpx) the overheads for a fiber (which is a hpx::thread in our case) are determined by many more factors than just the memory allocator. Things like contention caused by the work stealing or by NUMA effects such when you start stealing across NUMA domains usually overshadow the memory allocation costs. Additionally, the quality of the scheduler implementation affects things gravely.

...

this queue is currently a stl container and will be replaced by a single- linked list of intrusive-ptrs.

If you had a performance test you'd immediately see whether this improves your performance. Doing optimizations based on gut feelings are most of the time not very effective, you need measurements to support your work.

...

a context switch (suspending/resuming a coroutine) needs ca. 80 CPU cycles on Intel Core2 Q6700 (64bit Linux).

Sure, but this does not tell you how much time is consumed by executing those. The actual execution time will be determined by many factors, such a caching effects, TLB misses, memory bandwidth limitations and other contention effects. IMHO, for this library to be accepted, it has to prove to be of high quality which implies best possible performance. You might want to compare the performance of your library with other existing solutions (for instance TBB, qthreads, openmp, HPX). The link I provided above will give you a set of trivial tests for those. Moreover, we'd be happy to add an equivalent test for your library to our repository. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Hartmut Kaiser

13 Jan 13 Jan

1:37 p.m.

<snip>

...

IMHO, for this library to be accepted, it has to prove to be of high quality which implies best possible performance. You might want to compare the performance of your library with other existing solutions (for instance TBB, qthreads, openmp, HPX). The link I provided above will give you a set of trivial tests for those. Moreover, we'd be happy to add an equivalent test for your library to our repository.

Ping? Any news? I'd make my vote depend on the outcome of the performance tests. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Oliver Kowalke

4:43 p.m.

2014/1/13 Hartmut Kaiser <hartmut.kaiser@gmail.com>

...

...
IMHO, for this library to be accepted, it has to prove to be of high quality which implies best possible performance. You might want to compare the performance of your library with other existing solutions (for instance TBB, qthreads, openmp, HPX). The link I provided above will give you a set of trivial tests for those. Moreover, we'd be happy to add an equivalent test for your library to our repository.

Ping? Any news? I'd make my vote depend on the outcome of the performance tests.

I'll add performance tests but I doubt that I could finish to implement the tests till Wednesday. I've to look through the tests provided by the HPX-project, select some and try to understand what they are doing.

Oliver Kowalke

14 Jan 14 Jan

3:51 a.m.

2014/1/11 Hartmut Kaiser <hartmut.kaiser@gmail.com>

...

It would be interesting to see this number when giving 1..N cores to the scheduler.

...

Things like contention caused by the work stealing or by NUMA effects such when you start stealing across NUMA domains usually overshadow the memory allocation costs. Additionally, the quality of the scheduler implementation affects things gravely.

...

You might want to compare the performance of your library with other existing solutions (for instance TBB, qthreads, openmp, HPX). The link I provided above will give you a set of trivial tests for those. Moreover, we'd be happy to add an equivalent test for your library to our repository.

after re-reading I have the the impression that there is a misunderstanding. boost.fiber is a thin wrapper over coroutines (each fiber contains on coroutine) - the library schedules and synchronizes fibers (as requested on the developer list in 2013) in one thread. the fibers in this lib are agnostic of threads - I've only added some support that the classes (mutex, condition_variable) could be used in a multi-threaded context. combining fibers with threads should be done in another, more sophisticated library (at higher level). I believe you can't and shouldn't compare fibers with qthreads, TBB or openmp. I'll write a test measuring the overhead of a fiber running in one thread (as already described above) first.

Antony Polukhin

7:05 a.m.

2014/1/14 Oliver Kowalke <oliver.kowalke@gmail.com> <...>

...

I believe you can't and shouldn't compare fibers with qthreads, TBB or openmp. I'll write a test measuring the overhead of a fiber running in one thread (as already described above) first.

How about comparing fiber construction and joining with thread construction and joining? This will help the users to decide, is it beneficial to start a new thread or to start a fiber. A few ideas for tests: * compare construction+join of a single thread and construction+join of single fiber (empty functors in both cases) * compare construction+join of a multiple threads and construction+join of multiple fibers (empty functors in both cases) * compare construction of a thread and construction of fiber (empty functors in both cases) Pseudocode: void foo(){} const unsigned N = 1000; // Test #1 timer start for (unsigned i = 0; i < N; ++i) { fiber f(&foo); f.join(); } cout << "Fibers: " << timer stop; timer start for (unsigned i = 0; i < N; ++i) { boost::thread f(&foo); f.join(); } cout << "Threads: " << timer stop; // Test #2 timer start for (unsigned i = 0; i < N; ++i) { fiber f1(&foo), f2(&foo), f3(&foo), f4(&foo), f5(&foo); f1.join(); f2.join(); f3.join(); f4.join(); f5.join(); } cout << "Fibers: " << timer stop; timer start for (unsigned i = 0; i < N; ++i) { boost::thread f1(&foo), f2(&foo), f3(&foo), f4(&foo), f5(&foo); f1.join(); f2.join(); f3.join(); f4.join(); f5.join(); } cout << "Threads: " << timer stop; // Test #3 timer start for (unsigned i = 0; i < N; ++i) { fiber(&foo).detach(); } cout << "Fibers: " << timer stop; timer start for (unsigned i = 0; i < N; ++i) { boost::thread(&foo).detach(); } cout << "Threads: " << timer stop; -- Best regards, Antony Polukhin

Oliver Kowalke

11:10 a.m.

2014/1/14 Antony Polukhin <antoshkka@gmail.com>

...

How about comparing fiber construction and joining with thread construction and joining? This will help the users to decide, is it beneficial to start a new thread or to start a fiber.

A few ideas for tests: * compare construction+join of a single thread and construction+join of single fiber (empty functors in both cases)

== compares the overhead of constructing between fiber and thread

...

* compare construction+join of a multiple threads and construction+join of multiple fibers (empty functors in both cases) * compare construction of a thread and construction of fiber (empty functors in both cases)

I believe this is not a valid, because you compare the execution-time of N fibers running the test-function (concurrent but not parallel) in *one* thread with the execution-time of N threads (running parallel) while each single thread runs the test-function once. fibers do *not* introduce parallelism, e.g. using fibers does not gain benefits of multi-core systems at the first glance. Of course you could combine threads and fibers but this is not the focus of boost.fiber this should be done by another library.

Antony Polukhin

1:13 p.m.

2014/1/14 Oliver Kowalke <oliver.kowalke@gmail.com>

...

2014/1/14 Antony Polukhin <antoshkka@gmail.com>

...
* compare construction+join of a multiple threads and construction+join of multiple fibers (empty functors in both cases) * compare construction of a thread and construction of fiber (empty functors in both cases)

I believe this is not a valid, because you compare the execution-time of N fibers running the test-function (concurrent but not parallel) in *one* thread with the execution-time of N threads (running parallel) while each single thread runs the test-function once.

Not exactly. Test function is *empty*, so you'll see the influence of *additional* fiber/thread (overhead change beacuse of already spawned fibers/threads). In other words: Threads require synchronizations and OS context switches. With growth of threads those overheads may grow. Fibers must be free from such effects, however they can be less CPU cache friendly (in theory). -- Best regards, Antony Polukhin

Hartmut Kaiser

2:05 p.m.

...

...
How about comparing fiber construction and joining with thread construction and joining? This will help the users to decide, is it beneficial to start a new thread or to start a fiber.

A few ideas for tests: * compare construction+join of a single thread and construction+join of single fiber (empty functors in both cases)

== compares the overhead of constructing between fiber and thread

...
* compare construction+join of a multiple threads and construction+join of multiple fibers (empty functors in both cases) * compare construction of a thread and construction of fiber (empty functors in both cases)

I believe this is not a valid, because you compare the execution-time of N fibers running the test-function (concurrent but not parallel) in *one* thread with the execution-time of N threads (running parallel) while each single thread runs the test-function once.

fibers do *not* introduce parallelism, e.g. using fibers does not gain benefits of multi-core systems at the first glance.

Of course you could combine threads and fibers but this is not the focus of boost.fiber this should be done by another library.

If you constrain executing the std::threads to one core you'd get comparable results. OTOH, if you allow to run the fibers concurrently on more than one core you'd get comparable results again. I miss to understand why this shouldn't be viable. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Oliver Kowalke

11:41 a.m.

2014/1/14 Antony Polukhin <antoshkka@gmail.com>

...

Pseudocode:

void foo(){} const unsigned N = 1000;

// Test #1 timer start for (unsigned i = 0; i < N; ++i) { fiber f(&foo); f.join(); }

cout << "Fibers: " << timer stop;

timer start for (unsigned i = 0; i < N; ++i) { boost::thread f(&foo); f.join(); }

cout << "Threads: " << timer stop;

// Test #2 timer start for (unsigned i = 0; i < N; ++i) { fiber f1(&foo), f2(&foo), f3(&foo), f4(&foo), f5(&foo); f1.join(); f2.join(); f3.join(); f4.join(); f5.join(); }

cout << "Fibers: " << timer stop;

timer start for (unsigned i = 0; i < N; ++i) { boost::thread f1(&foo), f2(&foo), f3(&foo), f4(&foo), f5(&foo); f1.join(); f2.join(); f3.join(); f4.join(); f5.join(); }

cout << "Threads: " << timer stop;

// Test #3 timer start for (unsigned i = 0; i < N; ++i) { fiber(&foo).detach(); }

cout << "Fibers: " << timer stop;

timer start for (unsigned i = 0; i < N; ++i) { boost::thread(&foo).detach(); }

cout << "Threads: " << timer stop;

I did a quick hack and the code using fibers is 2-3 times faster than the threads. boost.fiber does not contain the suggested optimizations (like replacing stl containers)

Hartmut Kaiser

6:42 p.m.

...

...
Pseudocode:

void foo(){} const unsigned N = 1000;

// Test #1 timer start for (unsigned i = 0; i < N; ++i) { fiber f(&foo); f.join(); }

cout << "Fibers: " << timer stop;

timer start for (unsigned i = 0; i < N; ++i) { boost::thread f(&foo); f.join(); }

cout << "Threads: " << timer stop;

// Test #2 timer start for (unsigned i = 0; i < N; ++i) { fiber f1(&foo), f2(&foo), f3(&foo), f4(&foo), f5(&foo); f1.join(); f2.join(); f3.join(); f4.join(); f5.join(); }

cout << "Fibers: " << timer stop;

timer start for (unsigned i = 0; i < N; ++i) { boost::thread f1(&foo), f2(&foo), f3(&foo), f4(&foo), f5(&foo); f1.join(); f2.join(); f3.join(); f4.join(); f5.join(); }

cout << "Threads: " << timer stop;

// Test #3 timer start for (unsigned i = 0; i < N; ++i) { fiber(&foo).detach(); }

cout << "Fibers: " << timer stop;

timer start for (unsigned i = 0; i < N; ++i) { boost::thread(&foo).detach(); }

cout << "Threads: " << timer stop;

I did a quick hack and the code using fibers is 2-3 times faster than the threads. boost.fiber does not contain the suggested optimizations (like replacing stl containers)

I'd be disappointed if the overheads imposed by Boost.Fiber are only 2-3 times smaller than for kernel threads. I'd expect it to impose at least 10 times, if not 15-20 times less overheads than kernel threads (at least that's the numbers we're seeing from HPX). Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Oliver Kowalke

7 p.m.

2014/1/14 Hartmut Kaiser <hartmut.kaiser@gmail.com>

...

I'd be disappointed if the overheads imposed by Boost.Fiber are only 2-3 times smaller than for kernel threads. I'd expect it to impose at least 10 times, if not 15-20 times less overheads than kernel threads (at least that's the numbers we're seeing from HPX).

yes - I'm disappointed too. but, as I already explaned, I was focused on boost.asio's async-result and to provide a ways to synchronized coroutiens with an interface similar to std::thread. tuning seams to need more attention from my side - at first I've to identify the bottlenecks.

Niall Douglas

15 Jan 15 Jan

11:36 a.m.

On 14 Jan 2014 at 12:42, Hartmut Kaiser wrote:

...

...
I did a quick hack and the code using fibers is 2-3 times faster than the threads. boost.fiber does not contain the suggested optimizations (like replacing stl containers)

I'd be disappointed if the overheads imposed by Boost.Fiber are only 2-3 times smaller than for kernel threads. I'd expect it to impose at least 10 times, if not 15-20 times less overheads than kernel threads (at least that's the numbers we're seeing from HPX).

Like any C++ probably Boost.Fiber makes many malloc calls per context switch. It adds up. If I ever had a willing employer, I could get clang to spit out far more malloc optimal C++ at the cost of a new ABI, but I never could get an employer to bite. I think coming within 50% of the performance of Windows Fibers would be more than plenty. After all Boost.Fiber "does more" than Windows Fibers. Niall -- Currently unemployed and looking for work. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/

Hartmut Kaiser

1:18 p.m.

...

...
...
I did a quick hack and the code using fibers is 2-3 times faster than the threads. boost.fiber does not contain the suggested optimizations (like replacing stl containers)

I'd be disappointed if the overheads imposed by Boost.Fiber are only 2-3 times smaller than for kernel threads. I'd expect it to impose at least 10 times, if not 15-20 times less overheads than kernel threads (at least that's the numbers we're seeing from HPX).

Like any C++ probably Boost.Fiber makes many malloc calls per context switch. It adds up.

I don't think that things like a context switch require any memory allocation. All you do is to flush the registers, flip the stack pointer, and load the registers from the new stack.

...

If I ever had a willing employer, I could get clang to spit out far more malloc optimal C++ at the cost of a new ABI, but I never could get an employer to bite.

Sorry for sidestepping, are you sure compilers do memory allocation as part of their way to conform to ABI's? I was always assuming memory allocation is done only when explicitly requested by user code.

...

I think coming within 50% of the performance of Windows Fibers would be more than plenty. After all Boost.Fiber "does more" than Windows Fibers.

It might be sufficient for you but not for everybody else. It wouldn't be sufficient for us, for instance. If you build systems relying on fine grain parallelism, then efficiently implemented fibers are the only way to go. If you need to create billions of threads (fibers), then every microsecond of overhead counts billion-fold. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Oliver Kowalke

1:25 p.m.

2014/1/15 Hartmut Kaiser <hartmut.kaiser@gmail.com>

...

...
Like any C++ probably Boost.Fiber makes many malloc calls per context switch. It adds up.

I don't think that things like a context switch require any memory allocation. All you do is to flush the registers, flip the stack pointer, and load the registers from the new stack.

the context switch itself does not required memory allocation but the function/functor to be executed in the fiber must be stored (type erased) inside the fiber. the current implementation of fiber does allocate internally an object holding the fiber-function. I think it is possible to store the function/functor on top of the stack used by the fiber and thus prevent the need for memory allocation to hold the function/functor.

Niall Douglas

2:19 p.m.

On 15 Jan 2014 at 7:18, Hartmut Kaiser wrote:

...

...
Like any C++ probably Boost.Fiber makes many malloc calls per context switch. It adds up.

I don't think that things like a context switch require any memory allocation. All you do is to flush the registers, flip the stack pointer, and load the registers from the new stack.

A fiber implementation would also need to maintain work and sleep queues. They're all STL containers at present.

...

...
I think coming within 50% of the performance of Windows Fibers would be more than plenty. After all Boost.Fiber "does more" than Windows Fibers.

It might be sufficient for you but not for everybody else. It wouldn't be sufficient for us, for instance. If you build systems relying on fine grain parallelism, then efficiently implemented fibers are the only way to go. If you need to create billions of threads (fibers), then every microsecond of overhead counts billion-fold.

Firstly I think you underestimate how quick Windows Fibers are - they have been highly tuned to win SQL Server benchmarks. Secondly, Boost.Fiber does a ton load more work than Windows Fibers, so no one can reasonably expect it to be as quick.

...

...
If I ever had a willing employer, I could get clang to spit out far more malloc optimal C++ at the cost of a new ABI, but I never could get an employer to bite.

Sorry for sidestepping, are you sure compilers do memory allocation as part of their way to conform to ABI's? I was always assuming memory allocation is done only when explicitly requested by user code.

This is very off topic for this mailing list. However, one of the projects I proposed at BlackBerry before I was removed was to solve the substantial Qt allocation overhead because of PIMPL by getting clang to replace much use of individual operator new's for temporary objects with a single alloca() at the base of the call stack. This broke ABI because you need to generate an additional copy of every constructor, one which uses the new purely stack based allocation mechanism for temporary dynamic memory allocations (also, we'd need to spit out additional metadata to help the link and LTCG layer assemble the right code). Anyway the idea was deemed too weird to see any business case, and then of course I was eliminated shortly thereafter anyway. I should mention that this idea was one of mine long before joining BlackBerry, and therefore nothing proprietary is being leaked. Niall -- Currently unemployed and looking for work. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/

Hartmut Kaiser

9:02 p.m.

...

...
...
Like any C++ probably Boost.Fiber makes many malloc calls per context switch. It adds up.

I don't think that things like a context switch require any memory allocation. All you do is to flush the registers, flip the stack pointer, and load the registers from the new stack.

A fiber implementation would also need to maintain work and sleep queues. They're all STL containers at present.

I can see that. You explicitly referred to the context switch, thus my request for clarification.

...

...
...
I think coming within 50% of the performance of Windows Fibers would be more than plenty. After all Boost.Fiber "does more" than Windows Fibers.

It might be sufficient for you but not for everybody else. It wouldn't be sufficient for us, for instance. If you build systems relying on fine grain parallelism, then efficiently implemented fibers are the only way to go. If you need to create billions of threads (fibers), then every microsecond of overhead counts billion-fold.

Firstly I think you underestimate how quick Windows Fibers are - they have been highly tuned to win SQL Server benchmarks. Secondly, Boost.Fiber does a ton load more work than Windows Fibers, so no one can reasonably expect it to be as quick.

Whatever the speed of Boost.Fiber, all I would like to see is a measure of its imposed overheads which would allow everybody to decide whether the implementation is sufficiently performing for a particular use case. That's what I was asking for in the very beginning. At the same time, our own implementation in HPX (on the Windows platform) is using Windows Fibers for our lightweight thread implementation, so I perfectly understand what's their imposed overheads are. I also understand that Boost.Fiber does more than the Windows Fibers which are used just for the underlying context switch operation. Still, my main incentive for voting YES to this review and for considering using this library as a replacement for HPX's thread implementation would be if it had superior performance. This is even more true as I know (and have evidence) that it is possible to come close to the Windows Fibers performance for lightweight threads exposing the same API as std::thread does (see HPX). IMHO, Boost.Fiber is a library which - unlike other Boost libraries - has not been developed as a prototype for a particular API (in which case I'd be all for accepting subpar performance). It clearly has been developed to provide a higher performing implementation for an existing API. That means that if Oliver is not able to demonstrate superior performance over existing implementations, I wouldn't see any point in having the library in Boost in the first place.

...

...
...
If I ever had a willing employer, I could get clang to spit out far more malloc optimal C++ at the cost of a new ABI, but I never could get an employer to bite.

Sorry for sidestepping, are you sure compilers do memory allocation as part of their way to conform to ABI's? I was always assuming memory allocation is done only when explicitly requested by user code.

This is very off topic for this mailing list. However, one of the projects I proposed at BlackBerry before I was removed was to solve the substantial Qt allocation overhead because of PIMPL by getting clang to replace much use of individual operator new's for temporary objects with a single alloca() at the base of the call stack. This broke ABI because you need to generate an additional copy of every constructor, one which uses the new purely stack based allocation mechanism for temporary dynamic memory allocations (also, we'd need to spit out additional metadata to help the link and LTCG layer assemble the right code). Anyway the idea was deemed too weird to see any business case, and then of course I was eliminated shortly thereafter anyway. I should mention that this idea was one of mine long before joining BlackBerry, and therefore nothing proprietary is being leaked.

Thanks for this explanation. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Oliver Kowalke

9:13 p.m.

2014/1/15 Hartmut Kaiser <hartmut.kaiser@gmail.com>

...

IMHO, Boost.Fiber is a library which - unlike other Boost libraries - has not been developed as a prototype for a particular API (in which case I'd be all for accepting subpar performance). It clearly has been developed to provide a higher performing implementation for an existing API. That means that if Oliver is not able to demonstrate superior performance over existing implementations, I wouldn't see any point in having the library in Boost in the first place.

As I explained several times in this review - boost.fiber aims to provide a way to synchronize/coordinate coroutines as it was requested on the dev-list some months ago. -> boost.asio

Hartmut Kaiser

9:31 p.m.

...

...
IMHO, Boost.Fiber is a library which - unlike other Boost libraries - has not been developed as a prototype for a particular API (in which case I'd be all for accepting subpar performance). It clearly has been developed to provide a higher performing implementation for an existing API. That means that if Oliver is not able to demonstrate superior performance over existing implementations, I wouldn't see any point in having the library in Boost in the first place.

As I explained several times in this review - boost.fiber aims to provide a way to synchronize/coordinate coroutines as it was requested on the dev- list some months ago. -> boost.asio

In that case you might be surprised to learn that libraries often have a life of their own which opens up unexpected opportunities way beyond whatever you might have imagined. IHMO it is a mistake to constrain Boost.Fiber to just what you said as this is only a minor use case (as convenient as it might be) for such a library. Threading with minimal overheads supporting fine grain parallelism is the future. Building convenient means for managing that parallelism is the future. Application scalability, which today might be just a problem for high end computing problems, gains footprint in everyday computing at an exceptionally high rate. Two years from now, my desktop will support 288 concurrent threads (Intel Knight's Landing [1]). Massive multi-threading is here to stay. At the same time, application scalability is limited by the 4 horsemen of the apocalypse [2]: Starvation, Latencies, Overheads, and Waiting for contention resolution (SLOW). IOW, minimizing overheads is one of the critical pieces of the puzzle. Libraries such as Boost.Fiber are critical to solve the problems of insufficient scalability and parallel efficiency achievable by using existing technologies only. Wake up Oliver - you're up to the forefront of parallel computing and you don't realize it! Boost has to define the future of C++ libraries (as imposed on us by the computer architectures to come), it has not to focus on covering the past. Let's not let this opportunity slip. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu [1] http://www.extremetech.com/extreme/171678-intel-unveils-72-core-x86-knights- landing-cpu-for-exascale-supercomputing [2] http://stellar.cct.lsu.edu/2012/01/is-the-free-lunch-over-really/

Nat Goodspeed

9:34 p.m.

On Wed, Jan 15, 2014 at 4:31 PM, Hartmut Kaiser <hartmut.kaiser@gmail.com> wrote:

...

Wake up Oliver - you're up to the forefront of parallel computing and you don't realize it!

Boost has to define the future of C++ libraries (as imposed on us by the computer architectures to come), it has not to focus on covering the past. Let's not let this opportunity slip.

I can absolutely agree with this. My divergence is with your previous remark that Boost.Fiber can *only* be justified by its performance. Boost.Fiber has *potential future* performance benefits. It has *present* semantic benefits.

Oliver Kowalke

16 Jan 16 Jan

7:26 a.m.

2014/1/15 Hartmut Kaiser <hartmut.kaiser@gmail.com>

...

Wake up Oliver - you're up to the forefront of parallel computing and you don't realize it!

why do you allege that? for me semantics and usability of boost.fiber is more important than performance. performance tuning is mostly an issue of implementation details and can be done after the API is stable and proven. but this might not to apply to all developers - you are free to choose the tool of your preference.

Dean Michael Berris

7:29 a.m.

On Thu, Jan 16, 2014 at 6:26 PM, Oliver Kowalke <oliver.kowalke@gmail.com> wrote:

...

2014/1/15 Hartmut Kaiser <hartmut.kaiser@gmail.com>

...
Wake up Oliver - you're up to the forefront of parallel computing and you don't realize it!

why do you allege that?

for me semantics and usability of boost.fiber is more important than performance. performance tuning is mostly an issue of implementation details and can be done after the API is stable and proven. but this might not to apply to all developers - you are free to choose the tool of your preference.

I realize I'm jumping in on a conversation that I'm not involved in but as mostly a user of Boost libraries, I for one would encourage you to think about the users and their needs more than what you the developer think is important. This is one critical piece of feedback you're getting, and I seriously hope you consider prioritizing this in your continued development of Boost.Fiber.

Oliver Kowalke

1:11 p.m.

014/1/16 Dean Michael Berris <dberris@google.com>

...

I realize I'm jumping in on a conversation that I'm not involved in but as mostly a user of Boost libraries, I for one would encourage you to think about the users and their needs more than what you the developer think is important. This is one critical piece of feedback you're getting, and I seriously hope you consider prioritizing this in your continued development of Boost.Fiber.

Did I say that I'll ignore Hartmuts concerns? For Hartmut only speed matters and I've agreed that I'll address this issue.

Hartmut Kaiser

1:28 p.m.

...

...
I realize I'm jumping in on a conversation that I'm not involved in but as mostly a user of Boost libraries, I for one would encourage you to think about the users and their needs more than what you the developer think is important. This is one critical piece of feedback you're getting, and I seriously hope you consider prioritizing this in your continued development of Boost.Fiber.

Thanks Michael!

...

Did I say that I'll ignore Hartmuts concerns? For Hartmut only speed matters and I've agreed that I'll address this issue.

I've never said that it's only speed that matters. I said that because the API is set there is only performance which could be used as a criteria to decide whether your library is a worthy addition to Boost (besides implementation quality, which is sub-standard as others have pointed out). Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Oliver Kowalke

1:52 p.m.

2014/1/16 Hartmut Kaiser <hartmut.kaiser@gmail.com>

...

I've never said that it's only speed that matters.

but I got this impression from your last postings - sorry if I'm wrong

...

I said that because the API is set there is only performance which could be used as a criteria to decide whether your library is a worthy addition to Boost

this your opinion and I disagree - it is not the only one

...

(besides implementation quality, which is sub-standard as others have pointed out).

not very kind from you - nicht sehr nett von Dir copy-and-paste errors happen

Thomas Heller

2:18 p.m.

On 01/16/2014 02:52 PM, Oliver Kowalke wrote:

...

2014/1/16 Hartmut Kaiser <hartmut.kaiser@gmail.com>

...
I've never said that it's only speed that matters.

but I got this impression from your last postings - sorry if I'm wrong

...
I said that because the API is set there is only performance which could be used as a criteria to decide whether your library is a worthy addition to Boost

this your opinion and I disagree - it is not the only one

It should be one of the major criteria for the decision for the library to get accepted for the arguments brought up by Hartmut.

...

...
(besides implementation quality, which is sub-standard as others have pointed out).

not very kind from you - nicht sehr nett von Dir copy-and-paste errors happen

It might not be very kind, but it reflects the current state of the library. In addition, the library is not useful on the advertised platforms. The PPC64 implementation of Boost.Context is not tested and does not work (sure it's not the fault of Fiber per se), for example.

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Oliver Kowalke

2:19 p.m.

2014/1/16 Thomas Heller <thom.heller@gmail.com>

...

It might not be very kind, but it reflects the current state of the library. In addition, the library is not useful on the advertised platforms. The PPC64 implementation of Boost.Context is not tested and does not work (sure it's not the fault of Fiber per se), for example.

boost.context is irrelevant in this discussion. do you think I've a machine for each architecture at home? I can only write the code if requested by some users and I have to rely on the willing of community members to test the code on the specific hardware. You asked me about boost.context support of PPC64 and I told you that the code is untested from my side and boost-regression tests do not exist for PPC64. But you did not respond to my email. As I did with other users it hoped that we could fix the problem together but I didn't get any feedback from you - don't blame me.

Andreas Schäfer

2:22 p.m.

Hi, On 15:19 Thu 16 Jan , Oliver Kowalke wrote:

...

do you think I've a machine for each architecture at home? I can only write the code if requested by some users and I have to rely on the willing of community members to test the code on the specific hardware.

You asked me about boost.context support of PPC64 and I told you that the code is untested from my side and boost-regression tests do not exist for PPC64.

I can get you ssh access to a PPC64 node if you're interested. Just send me a private mail. HTH -Andreas -- ========================================================== Andreas Schäfer HPC and Grid Computing Chair of Computer Science 3 Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-27910 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

Thomas Heller

2:49 p.m.

On 01/16/2014 03:19 PM, Oliver Kowalke wrote:

...

2014/1/16 Thomas Heller <thom.heller@gmail.com>

...
It might not be very kind, but it reflects the current state of the library. In addition, the library is not useful on the advertised platforms. The PPC64 implementation of Boost.Context is not tested and does not work (sure it's not the fault of Fiber per se), for example.

boost.context is irrelevant in this discussion.

I don't think so. Boost.Fiber builts on context and without a working context implementation, the fiber is useless.

...

do you think I've a machine for each architecture at home? I can only write the code if requested by some users and I have to rely on the willing of community members to test the code on the specific hardware.

Absolutely. But context is shipped with code for PPC64 so it should be assumed it works.

...

You asked me about boost.context support of PPC64 and I told you that the code is untested from my side and boost-regression tests do not exist for PPC64. But you did not respond to my email. As I did with other users it hoped that we could fix the problem together but I didn't get any feedback from you - don't blame me.

Yes, I got side tracked. I am not blaming you for not delivering a PPC64 Boost.Context implementation. When i get back to the project where i need the context switch for PPC64, i will certainly get back to you. I was just trying to point to a case where Fiber is not working (with no indication in the docs or elsewhere). Sorry if you got the wrong impression.

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Oliver Kowalke

2:49 p.m.

2014/1/16 Thomas Heller <thom.heller@gmail.com>

...

I don't think so. Boost.Fiber builts on context and without a working context implementation, the fiber is useless.

because I don't get feedback about the implementation of an architecture, nor regression-tests exists, I should throw away the work. of course I let it in the library and wait if someone tests it and reports the bug. otherwise it would be impossible to get any feedback.

...

do you think I've a machine for each architecture at home? I can only

...
write the code if requested by some users and I have to rely on the willing of community members to test the code on the specific hardware.

Absolutely. But context is shipped with code for PPC64 so it should be assumed it works.

...

Yes, I got side tracked. I am not blaming you for not delivering a PPC64 Boost.Context implementation. When i get back to the project where i need the context switch for PPC64, i will certainly get back to you. I was just trying to point to a case where Fiber is not working (with no indication in the docs or elsewhere). Sorry if you got the wrong impression.

after reading the postings of you and other members of your group I must get the impression you really try to dis me Please keep in mind that I do this work only in my spare time (beside I've a family too), I've not the time as you in your daily work at the university. And then you and your fellows tell me that I'm to stupid - beside that some of the criticised issues are questionable - copy-and-paste error happen.

Thomas Heller

3:29 p.m.

On 01/16/2014 03:49 PM, Oliver Kowalke wrote:

...

2014/1/16 Thomas Heller <thom.heller@gmail.com>

...
I don't think so. Boost.Fiber builts on context and without a working context implementation, the fiber is useless.

because I don't get feedback about the implementation of an architecture, nor regression-tests exists, I should throw away the work. of course I let it in the library and wait if someone tests it and reports the bug. otherwise it would be impossible to get any feedback.

No objection of keeping it in the develop branch. But I think it is bad practice to release a code which is not tested.

...

...
do you think I've a machine for each architecture at home? I can only

...
write the code if requested by some users and I have to rely on the willing of community members to test the code on the specific hardware.

Absolutely. But context is shipped with code for PPC64 so it should be assumed it works.

...

...
Yes, I got side tracked. I am not blaming you for not delivering a PPC64 Boost.Context implementation. When i get back to the project where i need the context switch for PPC64, i will certainly get back to you. I was just trying to point to a case where Fiber is not working (with no indication in the docs or elsewhere). Sorry if you got the wrong impression.

after reading the postings of you and other members of your group I must get the impression you really try to dis me

I am not trying to diss you. My apologies if anything i said offended you personally.

...

Please keep in mind that I do this work only in my spare time (beside I've a family too), I've not the time as you in your daily work at the university. And then you and your fellows tell me that I'm to stupid - beside that some of the criticised issues are questionable - copy-and-paste error happen.

Sure they happen. Happen to everyone. Again, noone said your stupid, we are just giving feedback about your work. As a side effect of our daily work we happen to have gained some experience with the library you propose and in addition think that performance should be an important and critical feature of your library. That's all. No offense intended.

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Daniel James

1:27 p.m.

On 16 January 2014 07:29, Dean Michael Berris <dberris@google.com> wrote:

...

I realize I'm jumping in on a conversation that I'm not involved in but as mostly a user of Boost libraries, I for one would encourage you to think about the users and their needs more than what you the developer think is important. This is one critical piece of feedback you're getting, and I seriously hope you consider prioritizing this in your continued development of Boost.Fiber.

It isn't really user feedback, but feedback from a developer of similar functionality, which is something very different. Pedantically, the developer could also be a user of the library, but their main point of view is as a developer of such functionality, and their opinions are influenced by that. If they've put a lot of effort into something, then it's likely that they will overvalue it. Feedback from other developers is of course extremely useful, but the difference should be appreciated.

Hartmut Kaiser

1:42 p.m.

...

...
I realize I'm jumping in on a conversation that I'm not involved in but as mostly a user of Boost libraries, I for one would encourage you to think about the users and their needs more than what you the developer think is important. This is one critical piece of feedback you're getting, and I seriously hope you consider prioritizing this in your continued development of Boost.Fiber.

It isn't really user feedback, but feedback from a developer of similar functionality, which is something very different.

Yes, it guarantees that the viewpoint expressed by that developer can be assumed to be well educated as that developer understands the issues perfectly, probably much better than any user could.

...

Pedantically, the developer could also be a user of the library, but their main point of view is as a developer of such functionality, and their opinions are influenced by that. If they've put a lot of effort into something, then it's likely that they will overvalue it. Feedback from other developers is of course extremely useful, but the difference should be appreciated.

Why is it that there is again this 'selfishness' being silently alleged. Do you imply that just because I claim to have a better understanding of the applicability of a particular idiom/library then many others (because I'm working in the field for years) my opinion is too biased to be considered useful? However, I have to admit that I very much would like to replace some of our code with an external library to lessen our maintenance burden. But alas, as it seems it will not happen this time. But I'm tired of this pointless discussion. I'm outa here. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Daniel James

1:55 p.m.

On 16 January 2014 13:42, Hartmut Kaiser <hartmut.kaiser@gmail.com> wrote:

...

...
Pedantically, the developer could also be a user of the library, but their main point of view is as a developer of such functionality, and their opinions are influenced by that. If they've put a lot of effort into something, then it's likely that they will overvalue it. Feedback from other developers is of course extremely useful, but the difference should be appreciated.

Why is it that there is again this 'selfishness' being silently alleged. Do you imply that just because I claim to have a better understanding of the applicability of a particular idiom/library then many others (because I'm working in the field for years) my opinion is too biased to be considered useful?

I said, "Feedback from other developers is of course extremely useful". I have no idea how you managed to interpret that as saying that you're opinion isn't useful.

Andreas Schäfer

1:50 p.m.

On 13:27 Thu 16 Jan , Daniel James wrote:

...

It isn't really user feedback, but feedback from a developer of similar functionality, which is something very different.

Yes and no. Usually someone who implemented similar code did just that because he wanted to use it for himself. At least in an Open Source environment. Thus the developer becomes the first user.

...

Pedantically, the developer could also be a user of the library, but their main point of view is as a developer of such functionality, and their opinions are influenced by that. If they've put a lot of effort into something, then it's likely that they will overvalue it. Feedback from other developers is of course extremely useful, but the difference should be appreciated.

Let me try to rephrase that: said developer's point of view might be biased, thus his arguments carry less weight. Is that what you're saying? I'd then add to the discussion that his experience also makes him a domain expert, which reinforces his authority. This road is called "ad hominem" and doesn't lead anywhere. Lets get back to the facts. I, as a potential user of user-level threads a.k.a. fibers would only use them if they allowed me to do something std::thread can't do for me: many many, fine-grained threads, which relieve me of the burden of having to adapt the decomposition of my compute problem. And this again boils down to performance: if it's not going to be much faster, why shouldn't I hand over the problem to the OS? Being a library developer myself, I can assure you that performance is not something you can easily bolt on afterwards. Rather, it has to be built-in from the beginning. Otherwise you'll end up reimplementing class after class. Just my 0.02€. Cheers -Andreas -- ========================================================== Andreas Schäfer HPC and Grid Computing Chair of Computer Science 3 Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-27910 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

Daniel James

3:12 p.m.

On 16 January 2014 13:50, Andreas Schäfer <gentryx@gmx.de> wrote:

...

On 13:27 Thu 16 Jan , Daniel James wrote:

...

...
Pedantically, the developer could also be a user of the library, but their main point of view is as a developer of such functionality, and their opinions are influenced by that. If they've put a lot of effort into something, then it's likely that they will overvalue it. Feedback from other developers is of course extremely useful, but the difference should be appreciated.

Let me try to rephrase that: said developer's point of view might be biased, thus his arguments carry less weight. Is that what you're saying?

No, of course it isn't.

...

I'd then add to the discussion that his experience also makes him a domain expert, which reinforces his authority. This road is called "ad hominem" and doesn't lead anywhere.

Your response is called a "straw man argument", or an "Aunt Sally".

Hartmut Kaiser

12:19 p.m.

...

...
Wake up Oliver - you're up to the forefront of parallel computing and you don't realize it!

why do you allege that?

I was hoping to raise your awareness that you're up to something much bigger than 'just' fibers. I didn't mean to offend and I apologize if I did.

...

for me semantics and usability of boost.fiber is more important than performance. performance tuning is mostly an issue of implementation details and can be done after the API is stable and proven.

I still don't get it. There is no API stability question. The API is well defined for over 2 years now in the C++11 Standard (and even longer in Boost.Thread). So performance is the main incentive for such a library (what could there be else?). If you don't need the extra performance - use std::thread. Boost.Fiber does not add any new semantics beyond what the Standard mandates. Instead, it adds more constraints to the context where the API can be used (somebody mentioned interaction with Asio, and single-threaded legacy applications) - thus it narrows down existing semantics.

...

but this might not to apply to all developers - you are free to choose the tool of your preference.

Sure, that's out of question. My concern is that we're about to add a Boost library targeting some minor use cases only, while it has the potential to change the way we do parallel computing. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Oliver Kowalke

12:35 p.m.

2014/1/16 Hartmut Kaiser <hartmut.kaiser@gmail.com>

...

I still don't get it. There is no API stability question. The API is well defined for over 2 years now in the C++11 Standard (and even longer in Boost.Thread).

I could have choosen a different API for fibers - but I think the developers are more familiar with std::thread/boost::thread API.

...

So performance is the main incentive for such a library (what could there be else?).

with fibers you can suspend your execution context while keep the thread running (might execute something else). this is not possible with threads if they are suspend (yield(), waiting on mutex/condition_variable). this feature of fiber enables you to write (again asio eve nif you don't care this use case). for (;;) { ... boost::asio::async_read( socket_, buffer, yield[ec]); ... } async_write() suspends the current execution context (not the thread itself) and resumes it if all data have been read. without fibers you can't write the code like above (for-loop for instance). in the thread itself you can have more than one fibers running such for-loops. with threads you would have to pass a callback to async_read() and you could not invoke it inside a for-loop. the example directory of boost.fiber contains several asio examples demonstrating this feature.

...

If you don't need the extra performance - use std::thread.

I could have choosen a different API.

...

Boost.Fiber does not add any new semantics beyond what the Standard mandates.

it adds 'suspend/resume' a execution context while the hosting thread is not suspended.

...

Instead, it adds more constraints to the context where the API can be used (somebody mentioned interaction with Asio, and single-threaded legacy applications) - thus it narrows down existing semantics.

I think this statement is false.

...

Sure, that's out of question. My concern is that we're about to add a Boost library targeting some minor use cases only, while it has the potential to change the way we do parallel computing.

be sure that I've performance on my target after this discussions! I've already started to write code for performance-measurements.

Giovanni Piero Deretta

12:51 p.m.

[sorry for joining the discussion so late] On Thu, Jan 16, 2014 at 12:35 PM, Oliver Kowalke <oliver.kowalke@gmail.com>wrote:

...

2014/1/16 Hartmut Kaiser <hartmut.kaiser@gmail.com>

...
I still don't get it. There is no API stability question. The API is well defined for over 2 years now in the C++11 Standard (and even longer in Boost.Thread).

I could have choosen a different API for fibers - but I think the developers are more familiar with std::thread/boost::thread API.

...
So performance is the main incentive for such a library (what could there be else?).

with fibers you can suspend your execution context while keep the thread running (might execute something else). this is not possible with threads if they are suspend (yield(), waiting on mutex/condition_variable).

this feature of fiber enables you to write (again asio eve nif you don't care this use case).

for (;;) { ... boost::asio::async_read( socket_, buffer, yield[ec]); ... }

async_write() suspends the current execution context (not the thread itself) and resumes it if all data have been read. without fibers you can't write the code like above (for-loop for instance). in the thread itself you can have more than one fibers running such for-loops.

with threads you would have to pass a callback to async_read() and you could not invoke it inside a for-loop.

I think that Harmut point is that you can very well use threads for the same thing. In this particular case you would just perform a syncronous read. Yes, to mantain the same level of concurrency you need to spawn ten of thousands of threads, but that's feasible on a modern os/hardware pair. The point of using fibers (i.e. M:N threading) is almost purely performance. -- gpd

Nat Goodspeed

1:27 p.m.

On Thu, Jan 16, 2014 at 7:51 AM, Giovanni Piero Deretta <gpderetta@gmail.com> wrote:

...

I think that Harmut point is that you can very well use threads for the same thing. ... The point of using fibers (i.e. M:N threading) is almost purely performance.

Again, for a large class of use cases, fibers and threads are not the same. Writing thread-safe code remains something of an art, a specialty within the already-rarefied realm of good C++ coding. With care, code review and testing, it is of course possible to produce good thread-safe code when you are writing it from scratch. But retrofitting existing single-threaded code to be thread-safe can be extremely costly. At this moment in history, we have a very large volume of existing code whose developers (perhaps unconsciously) relied on having exclusive access to certain in-process resources. Some of us do not have the option to discard it and rewrite from scratch. Yes, this is a subset of the possible use cases of the Fiber library. It is an important subset because threads provide no equivalent. Yes, I also want a Boost library that will concurrently process very large numbers of tasks, with each of a number of threads running very many fibers. I think the Fiber library gives us a foundation on which to build that support. But even with its present feature set, with Oliver responding to the community, it has great value. I feel frustrated when people dismiss the very real benefit of cooperative context switching as irrelevant to them.

Hartmut Kaiser

1:41 p.m.

...

...
I think that Harmut point is that you can very well use threads for the same thing. ... The point of using fibers (i.e. M:N threading) is almost purely performance.

Again, for a large class of use cases, fibers and threads are not the same.

Writing thread-safe code remains something of an art, a specialty within the already-rarefied realm of good C++ coding. With care, code review and testing, it is of course possible to produce good thread-safe code when you are writing it from scratch.

But retrofitting existing single-threaded code to be thread-safe can be extremely costly. At this moment in history, we have a very large volume of existing code whose developers (perhaps unconsciously) relied on having exclusive access to certain in-process resources. Some of us do not have the option to discard it and rewrite from scratch.

Yes, this is a subset of the possible use cases of the Fiber library. It is an important subset because threads provide no equivalent.

If the main target of Boost.Fiber is this use case (support 'multi-threading' in single threaded applications), then the way it's implemented does not make sense to me. Why would you need a single atomic if all you have is a single thread? And the source code has atomics all over the place - thus I gather this use case was not what Oliver had in mind.

...

Yes, I also want a Boost library that will concurrently process very large numbers of tasks, with each of a number of threads running very many fibers. I think the Fiber library gives us a foundation on which to build that support. But even with its present feature set, with Oliver responding to the community, it has great value. I feel frustrated when people dismiss the very real benefit of cooperative context switching as irrelevant to them.

Why accept a library which is over-engineered for the advertised use case (see above) and not (yet) fit for the broader one? Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Thomas Heller

2:27 p.m.

On 01/16/2014 02:27 PM, Nat Goodspeed wrote:

...

On Thu, Jan 16, 2014 at 7:51 AM, Giovanni Piero Deretta <gpderetta@gmail.com> wrote:

...
I think that Harmut point is that you can very well use threads for the same thing. ... The point of using fibers (i.e. M:N threading) is almost purely performance.

Again, for a large class of use cases, fibers and threads are not the same.

Writing thread-safe code remains something of an art, a specialty within the already-rarefied realm of good C++ coding. With care, code review and testing, it is of course possible to produce good thread-safe code when you are writing it from scratch.

But retrofitting existing single-threaded code to be thread-safe can be extremely costly. At this moment in history, we have a very large volume of existing code whose developers (perhaps unconsciously) relied on having exclusive access to certain in-process resources. Some of us do not have the option to discard it and rewrite from scratch.

Even in the context of a Boost.Fiber like library, you have to take extra care to secure your data structures from concurrent access. Even though it is not necessarily running any threads in parallel, a fiber can suspend while being in a critical section. BTW, from our experiences with HPX, such a behavior (suspending a user level thread while a lock is held) is very dangerous and often leads to deadlocks. That being said, even when you decide to use fiber with your legacy code, the cost to make it safe is not really negligible.

...

Yes, this is a subset of the possible use cases of the Fiber library. It is an important subset because threads provide no equivalent.

Yes, I also want a Boost library that will concurrently process very large numbers of tasks, with each of a number of threads running very many fibers. I think the Fiber library gives us a foundation on which to build that support. But even with its present feature set, with Oliver responding to the community, it has great value. I feel frustrated when people dismiss the very real benefit of cooperative context switching as irrelevant to them.

Noone said it's irrelevant. The point was that performance should be the major criteria to accept the library.

...

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Nat Goodspeed

2:23 p.m.

On Thu, Jan 16, 2014 at 9:27 AM, Thomas Heller <thom.heller@gmail.com> wrote:

...

Even in the context of a Boost.Fiber like library, you have to take extra care to secure your data structures from concurrent access. Even though it is not necessarily running any threads in parallel, a fiber can suspend while being in a critical section. BTW, from our experiences with HPX, such a behavior (suspending a user level thread while a lock is held) is very dangerous and often leads to deadlocks. That being said, even when you decide to use fiber with your legacy code, the cost to make it safe is not really negligible.

Your point is well-taken. Introducing a new fiber -- let's say "a cooperatively-concurrent thread" -- into code *partially* retrofitted for kernel threads is more dangerous than in code which has always before run on a single thread; in other words, code with no kernel-thread synchronization constructs. You can grep for kernel-thread synchronization constructs, though. Being certain that you have located and adequately defended every potential access to a process-global resource is significantly harder.

Oliver Kowalke

1:33 p.m.

2014/1/16 Giovanni Piero Deretta <gpderetta@gmail.com>

...

I think that Harmut point is that you can very well use threads for the same thing. In this particular case you would just perform a syncronous read. Yes, to mantain the same level of concurrency you need to spawn ten of thousands of threads, but that's feasible on a modern os/hardware pair. The point of using fibers (i.e. M:N threading) is almost purely performance.

In the context of C10K problem and using the one-thread-per-client pattern I doubt that this would scale (even on modern hardware). Do you have some data showing the performance of an modern operating system and hardware by increasing thread count?

Peter Dimov

1:58 p.m.

Oliver Kowalke wrote:

...

In the context of C10K problem and using the one-thread-per-client pattern I doubt that this would scale (even on modern hardware). Do you have some data showing the performance of an modern operating system and hardware by increasing thread count?

Spawning 10000 threads, each executing Sleep(100), takes about 350 ms for me, waiting for them to finish adds another 100 ms. Not sure how relevant is this benchmark though. I was just curious.

Giovanni Piero Deretta

2:03 p.m.

On Thu, Jan 16, 2014 at 1:33 PM, Oliver Kowalke <oliver.kowalke@gmail.com>wrote:

...

2014/1/16 Giovanni Piero Deretta <gpderetta@gmail.com>

...
I think that Harmut point is that you can very well use threads for the same thing. In this particular case you would just perform a syncronous read. Yes, to mantain the same level of concurrency you need to spawn ten of thousands of threads, but that's feasible on a modern os/hardware pair. The point of using fibers (i.e. M:N threading) is almost purely performance.

In the context of C10K problem and using the one-thread-per-client pattern I doubt that this would scale (even on modern hardware). Do you have some data showing the performance of an modern operating system and hardware by increasing thread count?

I do not have hard numbers (do you?), but consider that the C10K page is quite antiquated today. On a previous life I worked on relatively low-latency applications that did handle multiple thousands requests per second per machine. We never bothered with anything but with the one thread per connection model. This was on windows, on, IIRC, octa-core 64 bits machines (today you can "easily" get 24 cores or more on a standard intel server class machine). Now, if we were talking about hundreds of thousands of threads or milions of threads, it would be interesting to see numbers for both threads and fibers...

Hartmut Kaiser

2:32 p.m.

...

...
2014/1/16 Giovanni Piero Deretta <gpderetta@gmail.com>

...
I think that Harmut point is that you can very well use threads for the same thing. In this particular case you would just perform a syncronous read. Yes, to mantain the same level of concurrency you need to spawn ten of thousands of threads, but that's feasible on a modern os/hardware pair. The point of using fibers (i.e. M:N threading) is almost purely performance.

In the context of C10K problem and using the one-thread-per-client pattern I doubt that this would scale (even on modern hardware). Do you have some data showing the performance of an modern operating system and hardware by increasing thread count?

I do not have hard numbers (do you?), but consider that the C10K page is quite antiquated today.

On a previous life I worked on relatively low-latency applications that did handle multiple thousands requests per second per machine. We never bothered with anything but with the one thread per connection model. This was on windows, on, IIRC, octa-core 64 bits machines (today you can "easily" get 24 cores or more on a standard intel server class machine).

Now, if we were talking about hundreds of thousands of threads or milions of threads, it would be interesting to see numbers for both threads and fibers...

FWIW, the use cases I'm seeing (and trust me those are very commonplace at least in scientific computing) involve not just hundreds or thousands of threads, but hundreds of millions of threads (billions of threads a couple of years from now). Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Giovanni Piero Deretta

2:51 p.m.

On Thu, Jan 16, 2014 at 2:32 PM, Hartmut Kaiser <hartmut.kaiser@gmail.com>wrote:

...

...
...
2014/1/16 Giovanni Piero Deretta <gpderetta@gmail.com>

...
I think that Harmut point is that you can very well use threads for the same thing. In this particular case you would just perform a syncronous read. Yes, to mantain the same level of concurrency you need to spawn ten of thousands of threads, but that's feasible on a modern os/hardware pair. The point of using fibers (i.e. M:N threading) is almost purely performance.

In the context of C10K problem and using the one-thread-per-client pattern I doubt that this would scale (even on modern hardware). Do you have some data showing the performance of an modern operating system and hardware by increasing thread count?

I do not have hard numbers (do you?), but consider that the C10K page is quite antiquated today.

On a previous life I worked on relatively low-latency applications that did handle multiple thousands requests per second per machine. We never bothered with anything but with the one thread per connection model. This was on windows, on, IIRC, octa-core 64 bits machines (today you can "easily" get 24 cores or more on a standard intel server class machine).

Now, if we were talking about hundreds of thousands of threads or milions of threads, it would be interesting to see numbers for both threads and fibers...

FWIW, the use cases I'm seeing (and trust me those are very commonplace at least in scientific computing) involve not just hundreds or thousands of threads, but hundreds of millions of threads (billions of threads a couple of years from now).

On a single machine? That would be impressive! -- gpd

Hartmut Kaiser

4:44 p.m.

...

...
...
Now, if we were talking about hundreds of thousands of threads or milions of threads, it would be interesting to see numbers for both threads and fibers...

FWIW, the use cases I'm seeing (and trust me those are very commonplace at least in scientific computing) involve not just hundreds or thousands of threads, but hundreds of millions of threads (billions of threads a couple of years from now).

On a single machine? That would be impressive!

Well, it depends on the size of the machine, doesn't it? The no. 1 machine on the top 500 list [1] (Tianhe-2 [2]) has 3120000 cores (in 16,000 compute nodes). Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu [1] http://www.top500.org/ [2] http://www.top500.org/list/2013/11/

Giovanni Piero Deretta

4:56 p.m.

On Thu, Jan 16, 2014 at 4:44 PM, Hartmut Kaiser <hartmut.kaiser@gmail.com>wrote:

...

...
...
...
Now, if we were talking about hundreds of thousands of threads or milions of threads, it would be interesting to see numbers for both threads and fibers...

FWIW, the use cases I'm seeing (and trust me those are very commonplace at least in scientific computing) involve not just hundreds or thousands of threads, but hundreds of millions of threads (billions of threads a couple of years from now).

On a single machine? That would be impressive!

Well, it depends on the size of the machine, doesn't it? The no. 1 machine on the top 500 list [1] (Tianhe-2 [2]) has 3120000 cores (in 16,000 compute nodes).

Oh, right! Do they usually present a single OS image to the application? I.e. do all the cores share a single memory address space or nodes communicate via message passing (MPI I presume)? std::thread-like scaling is relevant for the first case, less so for the later. -- gpd

Thomas Heller

5:19 p.m.

Am 16.01.2014 17:57 schrieb "Giovanni Piero Deretta" <gpderetta@gmail.com>:

...

On Thu, Jan 16, 2014 at 4:44 PM, Hartmut Kaiser <hartmut.kaiser@gmail.com wrote:

...
...
...
...
Now, if we were talking about hundreds of thousands of threads or milions of threads, it would be interesting to see numbers for

...

...
...
...
...
threads and fibers...

FWIW, the use cases I'm seeing (and trust me those are very commonplace at least in scientific computing) involve not just hundreds or thousands of threads, but hundreds of millions of

both threads

...

...
...
...
(billions of threads a couple of years from now).

On a single machine? That would be impressive!

Well, it depends on the size of the machine, doesn't it? The no. 1 machine on the top 500 list [1] (Tianhe-2 [2]) has 3120000 cores (in 16,000 compute nodes).

Oh, right!

Do they usually present a single OS image to the application? I.e. do all the cores share a single memory address space or nodes communicate via message passing (MPI I presume)? std::thread-like scaling is relevant for the first case, less so for the later.

If you decide to program with MPI that's certainly true. However HPX[1] provides the ability to spawn threads remotely, completely embedded in a standard conforming API. For those remote procedure calls a small overhead is crucial in order to efficiently utilize your whole machine. We demonstrated the capability to do exactly that [2]. [1]: http://stellar.cct.lsu.edu [2]: http://stellar.cct.lsu.edu/pubs/scala13.pdf

...

-- gpd

_______________________________________________ Unsubscribe & other changes:

http://lists.boost.org/mailman/listinfo.cgi/boost

Hartmut Kaiser

5:30 p.m.

...

...
...
...
...
Now, if we were talking about hundreds of thousands of threads or milions of threads, it would be interesting to see numbers for both threads and fibers...

FWIW, the use cases I'm seeing (and trust me those are very commonplace at least in scientific computing) involve not just hundreds or thousands of threads, but hundreds of millions of threads (billions of threads a couple of years from now).

On a single machine? That would be impressive!

Well, it depends on the size of the machine, doesn't it? The no. 1 machine on the top 500 list [1] (Tianhe-2 [2]) has 3120000 cores (in 16,000 compute nodes).

Oh, right!

Do they usually present a single OS image to the application? I.e. do all the cores share a single memory address space or nodes communicate via message passing (MPI I presume)? std::thread-like scaling is relevant for the first case, less so for the later.

The conventional way how it's done is to use MPI. However, if you used HPX you'd see one global address space (for the things to be sharable). Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Andreas Schäfer

2:04 p.m.

On 14:33 Thu 16 Jan , Oliver Kowalke wrote:

...

In the context of C10K problem and using the one-thread-per-client pattern I doubt that this would scale (even on modern hardware). Do you have some data showing the performance of an modern operating system and hardware by increasing thread count?

Here are two peer-reviewed publications with extensive performance data on various modern architectures. So yes: this can go very fast, if done right. http://stellar.cct.lsu.edu/pubs/isc2012.pdf http://stellar.cct.lsu.edu/pubs/scala13.pdf HTH -Andreas -- ========================================================== Andreas Schäfer HPC and Grid Computing Chair of Computer Science 3 Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany +49 9131 85-27910 PGP/GPG key via keyserver http://www.libgeodecomp.org ========================================================== (\___/) (+'.'+) (")_(") This is Bunny. Copy and paste Bunny into your signature to help him gain world domination!

Bjorn Reese

4:19 p.m.

On 01/16/2014 01:51 PM, Giovanni Piero Deretta wrote:

...

I think that Harmut point is that you can very well use threads for the same thing. In this particular case you would just perform a syncronous read. Yes, to mantain the same level of concurrency you need to spawn ten

Let me add two use cases that cannot be handled reasonably that way. First, many third-party libraries have callbacks as their primary interaction mechanism, and unlike Asio, they do not provide a synchronous alternative for the interaction. In this case fibers can be of great help. Second, when decoding/parsing streaming data (data that is received piecemeal) that is separated by delimiters, you have to start decoding to see if you have received the delimiter. If not, then you have to receive more data and decode again. Rather than having to decode from the beginning every time, it is preferable to remember how far you got and continue from there. This can be done by integrating fibers with the decoder. In these use cases performance is of secondary importance.

Giovanni Piero Deretta

4:28 p.m.

On Thu, Jan 16, 2014 at 4:19 PM, Bjorn Reese <breese@mail1.stofanet.dk>wrote:

...

On 01/16/2014 01:51 PM, Giovanni Piero Deretta wrote:

I think that Harmut point is that you can very well use threads for the

...
same thing. In this particular case you would just perform a syncronous read. Yes, to mantain the same level of concurrency you need to spawn ten

Let me add two use cases that cannot be handled reasonably that way.

First, many third-party libraries have callbacks as their primary interaction mechanism, and unlike Asio, they do not provide a synchronous alternative for the interaction.

You do not need to sell me the advantage of using continuations for managing callback hell :) ...

...

In this case fibers can be of great help.

... but we already have boost.coroutine for that ...

...

Second, when decoding/parsing streaming data (data that is received piecemeal) that is separated by delimiters, you have to start decoding to see if you have received the delimiter. If not, then you have to receive more data and decode again. Rather than having to decode from the beginning every time, it is preferable to remember how far you got and continue from there. This can be done by integrating fibers with the decoder.

... also a perfect match for coroutines.

...

In these use cases performance is of secondary importance.

What boost.fibers add is a scheduler and a compatibility layer for boost/std thread, including locks, condvars and futures. You do not really need these if you just need to thread your callbacks. -- gpd

Bjorn Reese

17 Jan 17 Jan

7:06 a.m.

On 01/16/2014 05:28 PM, Giovanni Piero Deretta wrote:

...

... but we already have boost.coroutine for that ...

Yes, but once we want to coordinate between such continuations, whether through condition variables or message queues, then we are right back in fiber-land.

Oliver Kowalke

7:19 a.m.

2014/1/17 Bjorn Reese <breese@mail1.stofanet.dk>

...

On 01/16/2014 05:28 PM, Giovanni Piero Deretta wrote:

... but we already have boost.coroutine for that ...

...
Yes, but once we want to coordinate between such continuations, whether through condition variables or message queues, then we are right back in fiber-land.

correct - and such features was already requested on the developer mailing list in 2013

Hartmut Kaiser

16 Jan 16 Jan

1:22 p.m.

...

...
I still don't get it. There is no API stability question. The API is well defined for over 2 years now in the C++11 Standard (and even longer in Boost.Thread).

I could have choosen a different API for fibers - but I think the developers are more familiar with std::thread/boost::thread API.

But you have not (and for a good reason!). So this argument is moot.

...

...
So performance is the main incentive for such a library (what could there be else?).

with fibers you can suspend your execution context while keep the thread running (might execute something else). this is not possible with threads if they are suspend (yield(), waiting on mutex/condition_variable).

this feature of fiber enables you to write (again asio eve nif you don't care this use case).

for (;;) { ... boost::asio::async_read( socket_, buffer, yield[ec]); ... }

async_write() suspends the current execution context (not the thread itself) and resumes it if all data have been read. without fibers you can't write the code like above (for-loop for instance). in the thread itself you can have more than one fibers running such for- loops.

with threads you would have to pass a callback to async_read() and you could not invoke it inside a for-loop.

the example directory of boost.fiber contains several asio examples demonstrating this feature.

The only benefit you're getting from using fibers for this (and you could achieve the same semantics using plain ol'threads as well, Boost.Asio is doing it for years after all) is - now guess - performance. So please make up your mind. Are you trying to improve performance or what?

...

...
If you don't need the extra performance - use std::thread.

I could have choosen a different API.

As said, you didn't. So this is moot.

...

...
Boost.Fiber does not add any new semantics beyond what the Standard mandates.

it adds 'suspend/resume' a execution context while the hosting thread is not suspended.

Who cares about this if not for performance reasons. However I still believe, that Fibers _are_ threads. You can't do anything with them you couldn't do with std::thread directly (semantically).

...

...
Instead, it adds more constraints to the context where the API can be used (somebody mentioned interaction with Asio, and single-threaded legacy applications) - thus it narrows down existing semantics.

I think this statement is false.

Care to elaborate? Why is this false? You're implementing the Standards API with Standards semantics and constrain the usability of the outcome to a couple of minor use cases. Sorry, still no new semantics.

...

...
Sure, that's out of question. My concern is that we're about to add a Boost library targeting some minor use cases only, while it has the potential to change the way we do parallel computing.

be sure that I've performance on my target after this discussions! I've already started to write code for performance-measurements.

Performance is never an after-thought. Your implementation quality so far is not up to Boost standards (as others have pointed out), the performance of the library is not convincing either. I'd suggest you withdraw your submission at this point, rework the library and try for another review once that's achieved. We have more substandard code in Boost already than necessary because of this 'let's fix it later' attitude. This 'later' never happens, most of the time - sadly. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Niall Douglas

1:42 p.m.

On 16 Jan 2014 at 7:22, Hartmut Kaiser wrote:

...

Performance is never an after-thought. Your implementation quality so far is not up to Boost standards (as others have pointed out), the performance of the library is not convincing either. I'd suggest you withdraw your submission at this point, rework the library and try for another review once that's achieved. We have more substandard code in Boost already than necessary because of this 'let's fix it later' attitude. This 'later' never happens, most of the time - sadly.

I think it isn't unreasonable for a library to enter Boost if it has good performance *scaling* to load (e.g. O(log N)), even if performance in the absolute or comparative-to-near-alternatives sense is not great. Absolute performance can always be incrementally improved later, whereas poor performance scaling to load usually means the design is wrong and you're going to need a whole new library with new API. This is why I really wanted to see performance scaling graphs. If they show O(N log N) or worse, then the design is deeply flawed and the library must not enter Boost. Until we have such a graph, we can't know as there is no substitute for empirical testing. Niall -- Currently unemployed and looking for work in Ireland. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/

Oliver Kowalke

1:44 p.m.

2014/1/16 Hartmut Kaiser <hartmut.kaiser@gmail.com>

...

...
...
I still don't get it. There is no API stability question. The API is well defined for over 2 years now in the C++11 Standard (and even longer in Boost.Thread).

I could have choosen a different API for fibers - but I think the developers are more familiar with std::thread/boost::thread API.

But you have not (and for a good reason!). So this argument is moot.

what I tried to tell is, that I the boost community could comme to the conclusion that the choosen API (thread-API or any another) is not appropriate for the suggested semantics. as the review figured out is that the thread-API would be accepted by the reviewers - and that's what I was referring with 'stable API for boost.fiber'.

...

The only benefit you're getting from using fibers for this (and you could achieve the same semantics using plain ol'threads as well, Boost.Asio is doing it for years after all) is - now guess - performance. So please make up your mind. Are you trying to improve performance or what?

As I wrote before - with thread you would have to scatter your code with callbacks. With fibers you don't - you could write the code as it would by synchronous operations. That makes the code easier to read and understandable.

Gavin Lambert

17 Jan 17 Jan

5:54 a.m.

On 17/01/2014 02:44, Quoth Oliver Kowalke:

...

As I wrote before - with thread you would have to scatter your code with callbacks. With fibers you don't - you could write the code as it would by synchronous operations. That makes the code easier to read and understandable.

Boost.Asio already supports using Boost.Coroutine for that purpose; an extra library seems unnecessary if that is your target. My understanding is that the new thing that Fibers tries to bring to the table is std::thread-like cross-fiber synchronisation. Which is something that only matters if you have fibers running in multiple threads, and does not seem related to the use case you're mentioning above, unless I'm missing something. So I'm a little confused as to what you're trying to focus on.

Nat Goodspeed

3:01 p.m.

On Fri, Jan 17, 2014 at 12:54 AM, Gavin Lambert <gavinl@compacsort.com> wrote:

...

On 17/01/2014 02:44, Quoth Oliver Kowalke:

...
As I wrote before - with thread you would have to scatter your code with callbacks. With fibers you don't - you could write the code as it would by synchronous operations. That makes the code easier to read and understandable.

...

Boost.Asio already supports using Boost.Coroutine for that purpose; an extra library seems unnecessary if that is your target.

What if you're using an asynchronous API that's not Boost.Asio? What if you're using several different async APIs? Wouldn't you want something like future and promise to interface between your coroutine and an arbitrary asynchronous API? Then there's the lifespan question. In a classic coroutine scenario, you instantiate a coroutine object, you chat with it for a bit, then you destroy it. But launching a "cooperatively context-switched thread" is more of a fire-and-forget operation. Who owns the object? Who cleans up when it's done? Then there's control flow. A coroutine has a caller. When it context-switches away, it specifically resumes that caller. What if you have several different coroutines you're using as cooperative threads, and you want to run whichever of them is ready next? Clearly all of this can be done with coroutines, yes. (Fiber does build it on coroutines!) But it's a whole additional abstraction layer. Must every developer facing this kind of use case build that layer by hand?

...

My understanding is that the new thing that Fibers tries to bring to the table is std::thread-like cross-fiber synchronisation. Which is something that only matters if you have fibers running in multiple threads, and does not seem related to the use case you're mentioning above, unless I'm missing something.

Consider a producer fiber obtaining input from some async source, pumping items into a queue. That queue is consumed by several different consumer fibers, each interacting with an async sink. All of it is running on a single thread. That's just one example.

Gavin Lambert

19 Jan 19 Jan

10:46 p.m.

On 18/01/2014 04:01, Quoth Nat Goodspeed:

...

Clearly all of this can be done with coroutines, yes. (Fiber does build it on coroutines!) But it's a whole additional abstraction layer. Must every developer facing this kind of use case build that layer by hand?

I asked that because Oliver seems to me to be focusing many of his replies in this thread and elsewhere to "it makes Asio syntax cleaner", which I don't feel is a sufficient justification for this library to exist by itself, because Coroutine already does that. I'm not saying that the library doesn't have merit for other reasons, just that it's not being expressed very well.

...

Consider a producer fiber obtaining input from some async source, pumping items into a queue. That queue is consumed by several different consumer fibers, each interacting with an async sink. All of it is running on a single thread. That's just one example.

If you are running on a single thread, you do not require any locks at all on the queue, and barely any kind of synchronisation to have consumers go to sleep when idle and be woken up when new work arrives. Although granted this library would theoretically make life easier than the alternatives if the consumers also needed to sleep on things other than the queue itself -- though again, if you're running in one thread you don't need locks, so there's not much you need to sleep on. It's only really when you go to M:N that something like this becomes especially valuable. (Don't get me wrong -- I'm eagerly waiting for something like this, because I *do* have a M:N situation in some of my code. But that code is currently using Windows fibers, so I'm also interested in a performance comparison.)

Oliver Kowalke

20 Jan 20 Jan

7:07 a.m.

2014/1/19 Gavin Lambert <gavinl@compacsort.com>

...

I asked that because Oliver seems to me to be focusing many of his replies in this thread and elsewhere to "it makes Asio syntax cleaner", which I don't feel is a sufficient justification for this library to exist by itself, because Coroutine already does that.

correct - but think on the one-thread-per-clientpattern (which most developers are familiar with) which is easy to write and understand than using callbacks - both using asnyc I/O. one-thread-per-client -> one-fiber-per-client with coroutines you can't use the one-fiber-per-client because you are missing the synchronization classes.

...

If you are running on a single thread, you do not require any locks at all on the queue, and barely any kind of synchronisation to have consumers go to sleep when idle and be woken up when new work arrives.

hmm - the library doesn't uses locks in the sense of thread-locks fibers are a thin wrapper around coroutines and you are able to do something like: void fn1() {}; fiber f1(fn1); void fn2() { f1.join(); } fiber f2( fn2); e.g. a fiber can join another fiber as you know it from threads (coroutines do not provide this feature or you would have to implement it as boost.fiber already tries to provide).

...

Although granted this library would theoretically make life easier than the alternatives if the consumers also needed to sleep on things other than the queue itself -- though again, if you're running in one thread you don't need locks, so there's not much you need to sleep on.

fiber does not sleep - they are suspended (its stack and registers are preserved) and if the condition for which the fiber was suspended becomes true it will be resumed, e.g. the registers are restored in the cpu and the stackpoint is restored too. so it is not real locking as threads but the fiber library provides classes as std::thread API (but the internal implementation and the used mechanisms are different).

...

It's only really when you go to M:N that something like this becomes especially valuable.

not realy even if you do cooperative scheduling == userland-threads (which can run concurrently in one thread) you need classes for coordinating the fibers _______________________________________________

Gavin Lambert

11:44 p.m.

On 20/01/2014 20:07, Quoth Oliver Kowalke:

...

with coroutines you can't use the one-fiber-per-client because you are missing the synchronization classes.

You can if you don't require synchronisation. Something that's just serving up read-only data (eg. basic in-memory web server) or handing off complex requests to a worker thread via a non-blocking queue, would be an example of that. Every fiber is completely independent of every other -- they don't care what the others are up to. The thing is though that the main advantage of thread-per-client is handling multiple requests simultaneously. And you lose that advantage with fiber-per-client unless you sprinkle your processing code with fiber interruption points (either manually or via calls to the sync classes you're proposing) -- and even then I think that only provides much benefit for long-running-connection protocols (like IRC or telnet), not request-response protocols (like HTTP), and where individual processing time is very short. For a system that has longer processing times but still wants to handle multiple requests (where processing time is CPU bound, rather than waiting on other fibers), the best design would be a limited size threadpool that can run any of the per-client fibers. And AFAIK your proposed library has no support for this scenario. I'm not saying it necessarily *needs* this, but if you're going to talk about fibers-as-useful-to-ASIO I think this case is going to come up sooner rather than later, so it may be worthy of consideration in the library design. Another scenario that doesn't require fiber migration, but does require cross-thread-fiber-synch, is: - one thread running N client fibers - M worker threads each running one fiber If a client thread wants to make a blocking call (eg. database/file I/O) it could post a request to a worker, which would do the blocking call and then post back once it was done. This would allow the client fibers to keep running but the system would still bottleneck once it had M simultaneous blocking calls. (A thread per client system wouldn't bottleneck there, but it loses performance if there are too many non-blocked threads.) Neither design seems entirely satisfactory. (An obvious solution is to never use blocking calls, but that's not always possible.)

Hartmut Kaiser

21 Jan 21 Jan

1:48 a.m.

Oliver,

...

...
I asked that because Oliver seems to me to be focusing many of his replies in this thread and elsewhere to "it makes Asio syntax cleaner", which I don't feel is a sufficient justification for this library to exist by itself, because Coroutine already does that.

correct - but think on the one-thread-per-clientpattern (which most developers are familiar with) which is easy to write and understand than using callbacks - both using asnyc I/O. one-thread-per-client -> one-fiber-per-client with coroutines you can't use the one-fiber-per-client because you are missing the synchronization classes.

After some more thinking I believe to start understanding the angle you're coming from. The proposed library has been designed for the sole purpose of complementing Boost.Asio (or similar asynchronous libraries), allowing it to be used in a more straightforward way. I apologize for being slow or dense. Using the name Boost.Fiber implies a much broader use case (and that's what got me confused). I think it would be sensible to choose another name for this library. BTW, if the author would have referred to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3747.pdf, all these misunderstandings could have been avoided... Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

...

...
If you are running on a single thread, you do not require any locks at all on the queue, and barely any kind of synchronisation to have consumers go to sleep when idle and be woken up when new work arrives.

hmm - the library doesn't uses locks in the sense of thread-locks

fibers are a thin wrapper around coroutines and you are able to do something like:

void fn1() {}; fiber f1(fn1);

void fn2() { f1.join(); } fiber f2( fn2);

e.g. a fiber can join another fiber as you know it from threads (coroutines do not provide this feature or you would have to implement it as boost.fiber already tries to provide).

...
Although granted this library would theoretically make life easier than the alternatives if the consumers also needed to sleep on things other than the queue itself -- though again, if you're running in one thread you don't need locks, so there's not much you need to sleep on.

fiber does not sleep - they are suspended (its stack and registers are preserved) and if the condition for which the fiber was suspended becomes true it will be resumed, e.g. the registers are restored in the cpu and the stackpoint is restored too. so it is not real locking as threads but the fiber library provides classes as std::thread API (but the internal implementation and the used mechanisms are different).

...
It's only really when you go to M:N that something like this becomes especially valuable.

not realy even if you do cooperative scheduling == userland-threads (which can run concurrently in one thread) you need classes for coordinating the fibers _______________________________________________

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Niall Douglas

12:44 p.m.

On 20 Jan 2014 at 19:48, Hartmut Kaiser wrote:

...

BTW, if the author would have referred to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3747.pdf, all these misunderstandings could have been avoided...

That's a good paper, but I wish it didn't claim to be a *universal* model for asynchronous operations because that model is completely unsuitable for persistent storage i/o. I had that argument with Nicholas @ Microsoft Research actually, and I think I may well have persuaded him as they're seeing the same problems of fit. Niall -- Currently unemployed and looking for work in Ireland. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/

Nat Goodspeed

3:09 p.m.

On Tue, Jan 21, 2014 at 7:44 AM, Niall Douglas <s_sourceforge@nedprod.com> wrote:

...

On 20 Jan 2014 at 19:48, Hartmut Kaiser wrote:

...

...
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3747.pdf

...

That's a good paper, but I wish it didn't claim to be a *universal* model for asynchronous operations because that model is completely unsuitable for persistent storage i/o. I had that argument with Nicholas @ Microsoft Research actually, and I think I may well have persuaded him as they're seeing the same problems of fit.

Niall, would you be able to propose a more universal model? Please read this as a simple invitation rather than a challenge. The goal of the paper seems laudable: accepting an argument that allows the caller to specify whether to provide results with a callback, a future or suspend-and-resume.

Niall Douglas

6:20 p.m.

New subject: Universal async in C++ (was: Re: [Fibers] Performance)

On 21 Jan 2014 at 10:09, Nat Goodspeed wrote:

...

...
...
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3747.pdf

...
That's a good paper, but I wish it didn't claim to be a *universal* model for asynchronous operations because that model is completely unsuitable for persistent storage i/o. I had that argument with Nicholas @ Microsoft Research actually, and I think I may well have persuaded him as they're seeing the same problems of fit.

Niall, would you be able to propose a more universal model? Please read this as a simple invitation rather than a challenge. The goal of the paper seems laudable: accepting an argument that allows the caller to specify whether to provide results with a callback, a future or suspend-and-resume.

I think I effectively have via AFIO which uses an asynchronous execution precedence graph model with explicit gathering for error propagation, which unlike the proposed model works as well with seekable i/o as fifo i/o. The problem with the AFIO model though is that it is very heavily reliant on the performance of futures (and therefore the memory allocator) as those are used to transport state between closures, and for those with very low latency very small packet (e.g. UDP) socket i/o it's unsuitable (which is of course pointed out in the WG21 paper). My opinion there is that callbacks can become a low level interface for those who need it, while the execution precedence graph model is much easier to program against for everything else. You're probably about to ask me for a N-paper, so I'll save an email cycle by explaining my hesitation on that. Bjorn Reece has been working extensively with me off list to de-wart AFIO when combined with ASIO such that async socket and async disc i/o seamlessly interoperate, specifically he's persuaded me to supply async_io_op to completion handlers which is going to be a technical challenge for me to implement in a safe and quick way, but I think I am capable though it's going to hurt. I will start that work item once I clear my maths coursework (hopefully tomorrow), and file my consulting company's annual accounts (hopefully end of this week). Also, I might actually have a job interview soon, which amazes me as I had expected at least three months to elapse before any jobs in C++ turned up (there are about six a year total in this region). Both Artur and Niklas who are leading out Microsoft's work on async in C++ are aware of AFIO's design approach, and last time I heard they found themselves coming ever closer to AFIO's design as they find blocking issues and corner cases in their proposals. I think it will be more productive for now for me to keep nagging them on their N-papers rather than write one of my own. After all, being in Europe and being unemployed makes it extremely tough to attend C++ standards meetings, and I don't personally rate a N paper's chances if someone isn't at the meetings to champion it i.e. present in the bar afterwards to argue merits with people. Hopefully this explains things. Do point out any problems in my thoughts. Niall -- Currently unemployed and looking for work in Ireland. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/

Bjorn Reese

22 Jan 22 Jan

5:37 p.m.

On 01/21/2014 04:09 PM, Nat Goodspeed wrote:

...

Niall, would you be able to propose a more universal model? Please read this as a simple invitation rather than a challenge. The goal of

A good place to start is to understand the limitations of the Asio model. The first limitation pertains to chained operations. In Asio you chain operations by initiating the next operation when you receive a callback from the previous operation. Although we can initiate several operations at the same time, these operations are not chained. I am going to ignore scatter I/O here because they have their own limitations (e.g. they either multiple reads or multiple writes, but not combinations thereof.) The second limitation is about multiple types. The Asio model assume that there is a one-to-one correspondence between the type that I request and the type I receive. This is perfectly fine for Asio because it just deals with buffers. However, if you create an asynchronous RPC server using the same kind of callback mechanism as Asio, then you want request a function of any type and receive a function of a specific type. In this design you have multiple "return types" (received function signatures.) It can be useful to put the second limitation into perspective. The RPC example above fits best into the event listener mentioned below. Inspired by a classification by Eric Meijer, we can say that: 1. If we have a single return value of a single type then we use T (or expected<T>) in the sync case and future<T> in the async case. 2. If we have multiple return values of a single type then we use an iterator in the sync case and an observer pattern (e.g. signal2 or Asio callbacks) in the async case. 3. If we have multiple return values of multiple types then we use a variant<T> visitor in the sync case and an event listener in the async case.

...

the paper seems laudable: accepting an argument that allows the caller to specify whether to provide results with a callback, a future or suspend-and-resume.

Definitely, and it is a significant step forward.

Niall Douglas

11:55 p.m.

On 22 Jan 2014 at 18:37, Bjorn Reese wrote:

...

...
Niall, would you be able to propose a more universal model? Please read this as a simple invitation rather than a challenge. The goal of

A good place to start is to understand the limitations of the Asio model.

A great post Bjorn. You definitely explained it better than I. And my public thanks to you for all your work with me on improving AFIO (I'll be replying to your email soon, I ought to submit my maths coursework tomorrow). I should elaborate on the importance of easily chaining operations into patterns as it's probably non-obvious. Storage i/o, unlike fifo i/o, does not scale linearly to queue depth due to highly non-linear variance in callback latencies due to OS caching effects overlaid on mechanical motors, so ease for the programmer to fiddle with operation chain patterns is paramount for maximum performance, especially as it's mainly a trial and error thing given the complexities of the systems which make up filing systems etc. One basically designs an access pattern you think ought to be performant under both cold and warm cache scenarios, try testing it and find yourself a bit wide from the mark, so now you hunt around for the goldilocks zone through repeated testing cycles. I haven't personally found any better way than this yet sadly, storage is so very non-linear at the micro-level. Doing this using ASIO style callbacks involves a ton load of cutting and pasting code around, several iterations of compilation to remedy the unenforced syntax errors, more iterations of debugging because now you've broken the storage etc. Alternatively, doing this using AFIO style chained ops is far easier and quicker because you simply tweak the op dependency graph you're sending to the async closure engine to execute, and you let the engine figure out how best to send it to the OS and ASIO. There is also an additional debugging option made available because with a formal op dependency graph one could have code check that your sequence of reads and writes is power-loss safe and race condition free with other processes reading and writing the same files. Using an ASIO style callback model that would be hard without additional metadata being supplied. As much as AFIO style async code is a bit heavy due to the use of futures, for the kind of latencies you see with big chunks of data being moved around it's an affordable overhead given the convenience. Niall -- Currently unemployed and looking for work in Ireland. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/

Gavin Lambert

23 Jan 23 Jan

12:26 a.m.

On 23/01/2014 06:37, Quoth Bjorn Reese:

...

The first limitation pertains to chained operations. In Asio you chain operations by initiating the next operation when you receive a callback from the previous operation. Although we can initiate several operations at the same time, these operations are not chained. I am going to ignore scatter I/O here because they have their own limitations (e.g. they either multiple reads or multiple writes, but not combinations thereof.)

I'm having trouble understanding this. A chained operation must by definition be one operation being called as some other operation completes, and can never possibly refer to operations running in parallel. You can certainly have multiple operations related in fashions other than chains, either by giving them the same callback target object, or by calling them through the same strand, or by calling them on some object that has some internal policy about how concurrent operations are managed, or by making a new composite operation that internally manages sub-operations in a fashion invisible to the caller.

...

The second limitation is about multiple types. The Asio model assume that there is a one-to-one correspondence between the type that I request and the type I receive. This is perfectly fine for Asio because it just deals with buffers. However, if you create an asynchronous RPC server using the same kind of callback mechanism as Asio, then you want request a function of any type and receive a function of a specific type. In this design you have multiple "return types" (received function signatures.)

Or each "function" is just a custom async operation. You don't request a function and then try to interrogate it, you just execute operations. It's pretty easy to define new I/O objects in the ASIO model and give them whatever async functionality you want. Granted, it's all in-process, so you'd have the added complication of injecting some sort of serialisation and remoting to make it RPC, but it's still fairly readily achievable, I think.

Bjorn Reese

24 Jan 24 Jan

12:13 p.m.

On 01/23/2014 01:26 AM, Gavin Lambert wrote:

...

I'm having trouble understanding this. A chained operation must by definition be one operation being called as some other operation completes, and can never possibly refer to operations running in parallel.

Think of the execution of chained operations as analogous to the execution of CPU instructions. Niall has already explained the situation where all chained operations should be passed to the scheduler to avoid latency. This is analogous to avoid flushing the CPU pipeline. You can also have chained operations that are commutative, so the scheduler can reorder them for better performance. This is analogous to out-of-order CPU execution.

...

Or each "function" is just a custom async operation. You don't request a function and then try to interrogate it, you just execute operations.

Can you elaborate? If I have the following event listener, how would it look and be used with your suggestion? class gui_event { public: virtual void on_key(int key); virtual void on_help(int x, int y); };

Niall Douglas

12:46 p.m.

New subject: Universal async i/o (was: Re: [Fibers] Performance)

On 24 Jan 2014 at 13:13, Bjorn Reese wrote:

...

...
I'm having trouble understanding this. A chained operation must by definition be one operation being called as some other operation completes, and can never possibly refer to operations running in parallel.

Think of the execution of chained operations as analogous to the execution of CPU instructions.

Niall has already explained the situation where all chained operations should be passed to the scheduler to avoid latency. This is analogous to avoid flushing the CPU pipeline.

That's a good analogy, but there are significant differences in orders of scaling. Where a pipeline stall in a CPU may cost you 10x, and a main memory cache line miss may cost you 200x, you're talking a 50,000x cost to a warm filing system cache miss. There are also very different queue depth scaling differences, so for example the SATA AHCI driver on Windows gets exponentially slow if you queue more than a few hundred ops to it simultaneously, whereas the Windows FS cache layer will happily scale to tens of thousands of simultaneous ops without blinking. How many FS cache layer ops turn into how many SATA AHCI driver ops is very non-trivial, and essentially it becomes a statistical analysis of black box behaviour which I would assume is not even static across OS releases.

...

You can also have chained operations that are commutative, so the scheduler can reorder them for better performance. This is analogous to out-of-order CPU execution.

Indeed that is the very point of chaining: you can say to AFIO that this group A here of operations can complete in any order and I don't care, but I don't want that this group B here of operations to occur until the very last operation in group A completes. This affords maximum scope to the OS kernel to reorder operations to complete as fast as possible without losing data integrity/causing races. It's this sort of metadata that the ASIO callback model simply doesn't specify. It's actually really unfortunate that more of this stuff isn't documented explicitly in OS documentation. If you're into filing systems, then you know it, but otherwise people just assume that reading and writing persistent data is just like any other kind of i/o. The Unix abstraction of making fd's identical for any kind of i/o when there are very significant differences underneath in semantics is mainly to blame I assume. Niall -- Currently unemployed and looking for work in Ireland. Work Portfolio: http://careers.stackoverflow.com/nialldouglas/

Gavin Lambert

27 Jan 27 Jan

10:16 p.m.

On 25/01/2014 01:13, Quoth Bjorn Reese:

...

...
Or each "function" is just a custom async operation. You don't request a function and then try to interrogate it, you just execute operations.

Can you elaborate? If I have the following event listener, how would it look and be used with your suggestion?

class gui_event { public: virtual void on_key(int key); virtual void on_help(int x, int y); };

You flip it around. Instead of having an event listener object that is registered on some event provider source, where the provider source invokes the methods explicitly when an event arrives, you have anything that is interested in events invoke an async request on the source object. So it'd be something more like this: class gui_source { public: // actually using templates to make the callback more generic void async_key(void (*callback)(error_code ec, int key)); void async_help(void (*callback)(error_code ec, int x, int y)); }; The code on the receiving side just handles a callback instead of receiving an explicit call, but otherwise it's basically the same. You still have to externally define your threading model (eg. GUI events typically assume they're always called back on the same thread), and the policy on whether events are forwarded to all listeners or only first-come-first-served, if callbacks are supposed to be ordered in some way, and if callbacks are "persistent" (request once, called many times) or "single-shot" (one callback per request, as in ASIO), and if the latter, what happens if an event arrives when a particular listener is in between listen calls. Because of the extra complexity, it's definitely *easier* to use the direct-notifier pattern, which is why most UI frameworks do that. But it's *possible* to use these successfully even with single-shot callbacks -- just look at AJAX long-polling for a real-world example.

Bjorn Reese

31 Jan 31 Jan

9:28 a.m.

On 01/27/2014 11:16 PM, Gavin Lambert wrote:

...

You flip it around. Instead of having an event listener object that is registered on some event provider source, where the provider source invokes the methods explicitly when an event arrives, you have anything that is interested in events invoke an async request on the source object. So it'd be something more like this:

class gui_source { public: // actually using templates to make the callback more generic void async_key(void (*callback)(error_code ec, int key)); void async_help(void (*callback)(error_code ec, int x, int y)); };

The code on the receiving side just handles a callback instead of receiving an explicit call, but otherwise it's basically the same.

This is an intesting idea. Although it does involve a lot of plumbing, I agree that it can be done. As we are exploring the limitations of the Asio model, let me introduce a use case that is difficult to do within this kind of parallel initiator-callback paradigm. Consider a secure RPC server, whose full API only can be used if the client has the correct privileges. For simplicity, let us assume that this is a single-client server. There are two modes: unauthenticated and authenticated. In unauthenticated mode, the server should reject all but the authentication requests. The way you typically would do this is to have separate implementations of the API for each mode, and when the client has been authenticated, the server will switch from the unauthenticated to the authenticated implementation. This wholesale replacement of the underlying implementation is much more difficult to do with the parallel initiator-callback style. We could solve the problem with another level of indirection, but that would effectively re-introduce the event listener.

Gavin Lambert

2 Feb 2 Feb

10:05 p.m.

On 31/01/2014 22:28, Quoth Bjorn Reese:

...

Consider a secure RPC server, whose full API only can be used if the client has the correct privileges. For simplicity, let us assume that this is a single-client server. There are two modes: unauthenticated and authenticated. In unauthenticated mode, the server should reject all but the authentication requests. The way you typically would do this is to have separate implementations of the API for each mode, and when the client has been authenticated, the server will switch from the unauthenticated to the authenticated implementation. This wholesale replacement of the underlying implementation is much more difficult to do with the parallel initiator-callback style. We could solve the problem with another level of indirection, but that would effectively re-introduce the event listener.

The way to handle that, I would think, would be to have the "public" API be more limited in scope (not an identical copy that mostly returns not-authenticated errors), and to provide an "async_authenticate" request that calls back (single-shot) with an interface that provides the complete API. Async requests wouldn't carry over between the authenticated and unauthenticated API, but you wouldn't want that anyway -- most clients would authenticate first before making any other requests anyway, and one of the clients might be a "broker" that wants to maintain multiple independently authenticated connections with different credentials. Granted that's outside the scope of a single-client server, but the design would be easier to scale if it turns out that single-client isn't sufficient. Again, I'm not saying that Asio-style callbacks are the "best" way of implementing RPC or UI models. Just that it's not impossible to do so. (But at some level, you need a serialisation+networking layer to actually transfer requests between processes or machines. This can be made completely transparent to both server and client, but something has to be able to translate all the possible types of request.)

Bjorn Reese

6 Feb 6 Feb

10:27 a.m.

On 02/02/2014 11:05 PM, Gavin Lambert wrote:

...

The way to handle that, I would think, would be to have the "public" API be more limited in scope (not an identical copy that mostly returns not-authenticated errors), and to provide an "async_authenticate" request that calls back (single-shot) with an interface that provides the complete API.

Authentication is just one example of how the API may need to change its operational mode dynamically. You could also have a maintenance mode, a defensive mode (against denial-of-service attacks), a budget vs premium mode, and so on. How do I change from defensive mode back to normal mode?

...

Again, I'm not saying that Asio-style callbacks are the "best" way of implementing RPC or UI models. Just that it's not impossible to do so.

It may just be my lack of imagination, but I cannot see how to do a mode change with the Asio model.

...

(But at some level, you need a serialisation+networking layer to actually transfer requests between processes or machines. This can be made completely transparent to both server and client, but something has to be able to translate all the possible types of request.)

Yes, and that applies to any model, so I am not holding that against the Asio model.

Gavin Lambert

10:07 p.m.

On 6/02/2014 23:27, Quoth Bjorn Reese:

...

On 02/02/2014 11:05 PM, Gavin Lambert wrote:

...
The way to handle that, I would think, would be to have the "public" API be more limited in scope (not an identical copy that mostly returns not-authenticated errors), and to provide an "async_authenticate" request that calls back (single-shot) with an interface that provides the complete API.

Authentication is just one example of how the API may need to change its operational mode dynamically. You could also have a maintenance mode, a defensive mode (against denial-of-service attacks), a budget vs premium mode, and so on.

How do I change from defensive mode back to normal mode?

How are you imagining that the modes change? For example if a client can dynamically upgrade from budget to premium mode then that's just another case of authentication. If it's a global server state change, then probably it would disconnect all currently subscribed clients (calling them back with an error code) and let them reconnect to its new API provider, which might refuse certain operations entirely or place different limits on them.

Bjorn Reese

8 Feb 8 Feb

1:14 p.m.

On 02/06/2014 11:07 PM, Gavin Lambert wrote:

...

How are you imagining that the modes change?

As we are exploring the limitations of the Asio model, I would say that the mode can change in any imaginable manner: it could alternate between two modes on each request, or it could change only a subset of the requests rather than all, just to name two. And this would be completely transparent to the client, because it the general case it is not simply a matter of accepting or rejecting requests, but about processing them differently. So I guess that it boils down to: 1. Can I replace a continuation after the function has been initiated? 2. Can I group several of such replacements so that they will be replaced at the same time (or at least before the next event occurs)?

Gavin Lambert

10 Feb 10 Feb

12:20 a.m.

On 9/02/2014 02:14, Quoth Bjorn Reese:

...

As we are exploring the limitations of the Asio model, I would say that the mode can change in any imaginable manner: it could alternate between two modes on each request, or it could change only a subset of the requests rather than all, just to name two. And this would be completely transparent to the client, because it the general case it is not simply a matter of accepting or rejecting requests, but about processing them differently.

I still think you're imagining a scenario that doesn't make sense in practice.

...

So I guess that it boils down to:

1. Can I replace a continuation after the function has been initiated?

You have to have something that is holding the continuation so that it can be called later. There's no reason *in principle* why this cannot be handed from one object to another as desired as long as the conceptual operation is still in progress as far as the caller is concerned. (In fact, this actually happens in ASIO -- a pending operation is held by the I/O object until it completes, then is passed to the generic scheduler to execute the user handler.) If something significant happens that the caller is likely to want to know about (such as changing license state etc), then it may be worthwhile cancelling the connection (by calling back with an error code) and getting the client to reconnect, because it might want to set up a different set of pending operations / subscriptions given the new state. When there are multiple pending operations, there's no reason why you can't cancel some, initiate others, and leave the rest going (depending on what makes sense for the particular change in question). (I said "in principle" above because the current ASIO model makes heavy use of templates for performance and to enable auxiliary features such as strands and custom allocation; the way that it's implemented at the moment these can't survive a type-erasure boundary such as would be required to pass through an RPC system. Instead you'd have to replicate these on each side -- but you'd probably want that anyway.)

...

2. Can I group several of such replacements so that they will be replaced at the same time (or at least before the next event occurs)?

In the context of "a thing that implements ASIO-like callbacks", sure. As I said above, you'd just have to move the list of pending operations from one implementation to the other, and the caller wouldn't know the difference as long as whatever handle it uses to make async requests is still valid. (Things get hairier if you're processing events on multiple threads, though.) In the context of ASIO specifically, not really, at least not once the operations have been marked as complete and ready to execute. But I'm still not really sure why you'd want to.

Nat Goodspeed

15 Jan 15 Jan

9:31 p.m.

On Wed, Jan 15, 2014 at 4:02 PM, Hartmut Kaiser <hartmut.kaiser@gmail.com> wrote:

...

IMHO, Boost.Fiber is a library which - unlike other Boost libraries - has not been developed as a prototype for a particular API (in which case I'd be all for accepting subpar performance). It clearly has been developed to provide a higher performing implementation for an existing API. That means that if Oliver is not able to demonstrate superior performance over existing implementations, I wouldn't see any point in having the library in Boost in the first place.

Strongly disagree with your assumption. To me it's the semantics of Boost.Fiber that matter. Before launching any code on a new thread, both Boost.Thread and std::thread require that you must sanitize that code against potential race conditions. With a large, ancient code base, that sanitizing effort becomes almost prohibitive. Running one thread with multiple fibers is guaranteed to introduce no new race conditions. Emulating the std::thread API is intended to minimize coder confusion -- not to provide a drop-in replacement.

Hartmut Kaiser

11:03 p.m.

...

...
IMHO, Boost.Fiber is a library which - unlike other Boost libraries - has not been developed as a prototype for a particular API (in which case I'd be all for accepting subpar performance). It clearly has been developed to provide a higher performing implementation for an existing API. That means that if Oliver is not able to demonstrate superior performance over existing implementations, I wouldn't see any point in having the library in Boost in the first place.

Strongly disagree with your assumption. To me it's the semantics of Boost.Fiber that matter.

The semantics are well defined by the C++11 Standard, no news here.

...

Before launching any code on a new thread, both Boost.Thread and std::thread require that you must sanitize that code against potential race conditions. With a large, ancient code base, that sanitizing effort becomes almost prohibitive. Running one thread with multiple fibers is guaranteed to introduce no new race conditions.

If that's the case, then why does Boost.Fiber provide synchronization primitives to synchronize between fibers and support for work stealing? Or why is it implemented using atimics all over the place? Your statement does not make sense to me, sorry.

...

Emulating the std::thread API is intended to minimize coder confusion -- not to provide a drop-in replacement.

That's only one particular use case. Clearly we're talking about two different viewpoints where mine focusses on a very broad application for this kind of library, while yours is trying to limit it to a small set of use cases. While I accept the validity of these use cases I think it's not worth having a full Boost library just for those. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Nat Goodspeed

11:21 p.m.

On Wed, Jan 15, 2014 at 6:03 PM, Hartmut Kaiser <hartmut.kaiser@gmail.com> wrote:

...

...
Running one thread with multiple fibers is guaranteed to introduce no new race conditions.

...

If that's the case, then why does Boost.Fiber provide synchronization primitives to synchronize between fibers and support for work stealing? Or why is it implemented using atimics all over the place? Your statement does not make sense to me, sorry.

Because the library is more general than either of our individual use cases. It addresses the scenario in which you need to coordinate a fiber in one thread with a fiber (perhaps the one and only fiber) running in another thread.

...

...
Emulating the std::thread API is intended to minimize coder confusion -- not to provide a drop-in replacement.

...

That's only one particular use case. Clearly we're talking about two different viewpoints where mine focusses on a very broad application for this kind of library, while yours is trying to limit it to a small set of use cases. While I accept the validity of these use cases I think it's not worth having a full Boost library just for those.

Hmm! With respect, it sounds to me as though you're saying: my use case is important, yours is not. You're saying that for your use case, performance is critical, and performance would be the only reason you would choose Boost.Fiber over other libraries available to you. I can respect that. I'm saying that for my use case, performance is not critical, and there is nothing in the standard library or presently in Boost that addresses my semantic requirements. That sounds like reason enough to have a Boost library.

Hartmut Kaiser

11:29 p.m.

...

...
That's only one particular use case. Clearly we're talking about two different viewpoints where mine focusses on a very broad application for this kind of library, while yours is trying to limit it to a small set of use cases. While I accept the validity of these use cases I think it's not worth having a full Boost library just for those.

Hmm! With respect, it sounds to me as though you're saying: my use case is important, yours is not.

Please don't start splitting hairs. I said 'While I accept the validity of these use cases' after all.

...

You're saying that for your use case, performance is critical, and performance would be the only reason you would choose Boost.Fiber over other libraries available to you. I can respect that.

I'm saying that for my use case, performance is not critical, and there is nothing in the standard library or presently in Boost that addresses my semantic requirements. That sounds like reason enough to have a Boost library.

Fine. So you vote YES and I vote NO. What's the problem? Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

Nat Goodspeed

11:34 p.m.

On Wed, Jan 15, 2014 at 6:29 PM, Hartmut Kaiser <hartmut.kaiser@gmail.com> wrote:

...

Please don't start splitting hairs. I said 'While I accept the validity of these use cases' after all.

I apologize. I do not wish to be irritating.

james

16 Jan 16 Jan

6:39 a.m.

On 15/01/2014 23:29, Hartmut Kaiser wrote:
> Fine. So you vote YES and I vote NO. What's the problem?
It depends on whether the vote is just selfish, or reflects that:
  - the presence of an additional not-quite-what-you-need doesn't
   negatively impact you
  - it is valuable for others and technically reasonable

I would hope that votes are not entirely selfish.

Under the circumstances I would think you could abstain but
unless there is some reason for its presence to cause a problem,
why would you Nack?

Hartmut Kaiser

12:26 p.m.

> On 15/01/2014 23:29, Hartmut Kaiser wrote:
> > Fine. So you vote YES and I vote NO. What's the problem?
> It depends on whether the vote is just selfish, or reflects that:
>   - the presence of an additional not-quite-what-you-need doesn't
>    negatively impact you
>   - it is valuable for others and technically reasonable
> 
> I would hope that votes are not entirely selfish.
> 
> Under the circumstances I would think you could abstain but unless there
> is some reason for its presence to cause a problem, why would you Nack?

With all due respect, I'm contributing to Boost for over 10 years now, and I
have 4 major libraries in Boost I'm authoring/contributing too. I have
managed numerous Boost reviews in the past. I'm a member of the Boost
steering committee, and I'm one of the main organizers of BoostCon and
C++Now. I have invested more time into Boost than you can even start to
imagine. Nobody has so far managed to allege I could be selfish with regards
to Boost. I'm very much inclined to ask you rethink what you wrote.

Regards Hartmut
---------------
http://boost-spirit.com
http://stellar.cct.lsu.edu

Antony Polukhin

15 Jan 15 Jan

10:44 a.m.

2014/1/14 Oliver Kowalke <oliver.kowalke@gmail.com>

...

I did a quick hack and the code using fibers is 2-3 times faster than the threads. boost.fiber does not contain the suggested optimizations (like replacing stl containers)

Not bad! Did the tests run on Linux or Windows? -- Best regards, Antony Polukhin

Oliver Kowalke

11:04 a.m.

2014/1/15 Antony Polukhin <antoshkka@gmail.com>

...

2014/1/14 Oliver Kowalke <oliver.kowalke@gmail.com>

...
I did a quick hack and the code using fibers is 2-3 times faster than the threads. boost.fiber does not contain the suggested optimizations (like replacing stl containers)

Not bad! Did the tests run on Linux or Windows?

I've tested it on Linux (32bit/64bit), Windows will follow this evening. Because the code is not optimized (memory allocations in context/coroutine and fiber) it will not be competitive with qthreads, tbb, hpx yet. But this was not the main aim - the lib tries to integrate with other boost lib and to support new use-cases (for instance code writen like its synchronous counterpart but using asynchronous operations).

Hartmut Kaiser

14 Jan 14 Jan

2:03 p.m.

...

...
It would be interesting to see this number when giving 1..N cores to the scheduler.

...
Things like contention caused by the work stealing or by NUMA effects such when you start stealing across NUMA domains usually overshadow the memory allocation costs. Additionally, the quality of the scheduler implementation affects things gravely.

...
You might want to compare the performance of your library with other existing solutions (for instance TBB, qthreads, openmp, HPX). The link I provided above will give you a set of trivial tests for those. Moreover, we'd be happy to add an equivalent test for your library to our repository.

after re-reading I have the the impression that there is a misunderstanding.

I hope not.

...

boost.fiber is a thin wrapper over coroutines (each fiber contains on coroutine) - the library schedules and synchronizes fibers (as requested on the developer list in 2013) in one thread. the fibers in this lib are agnostic of threads - I've only added some support that the classes (mutex, condition_variable) could be used in a multi-threaded context. combining fibers with threads should be done in another, more sophisticated library (at higher level).

I believe you can't and shouldn't compare fibers with qthreads, TBB or openmp. I'll write a test measuring the overhead of a fiber running in one thread (as already described above) first.

I beg to disagree. Surely, you run fibers on top of OS-threads (in your case using the coroutines mechanism). However, every fiber is semantically indistinguishable from a std::thread (if implemented properly). It has a dedicated function to execute, it represents a context of execution, you can synchronize it with other fibers, etc. In fact nothing in the C++ Standard implies that a std::thread has to be implemented using OS (kernel) threads, why we decided to name our lightweight tasks 'hpx::thread' which expose 100% of the mandated interface for std::threads. If you run on several cores (OS-threads), you start executing your fibers concurrently. AFAIU, your library is clearly designed for this, otherwise you wouldn't have implemented special, fiber-oriented synchronization primitives or work stealing capabilities. To clarify, I'm not talking about measuring the performance of (kernel) threads, rather I would like for you to give us performance data for Boost.Fiber so we can understand what are the overheads imposed by using fibers in the first place. The only way to not only get quantitative numbers which do not mean anything beyond a single machine, I was suggesting to run equivalent performance benchmarks using other, similar libraries, such a TBB, openmp, HPX, etc. as this would allow to get a qualitative picture regardless of the machine the tests are run on. And the libraries I listed clearly implement a semantically equivalent idiom: lightweight parallelism (be it a task in TBB, a fiber in Boost.Fiber, a hpx::thread, or a qthread, etc.). Hope this clarifies what I had in mind. Regards Hartmut --------------- http://boost-spirit.com http://stellar.cct.lsu.edu

4172

Age (days ago)

4202

Last active (days ago)

List overview

Download

92 comments

14 participants

participants (14)

Andreas Schäfer
Antony Polukhin
Bjorn Reese
Daniel James
Dean Michael Berris
Gavin Lambert
Giovanni Piero Deretta
Hartmut Kaiser
james
Nat Goodspeed
Niall Douglas
Oliver Kowalke
Peter Dimov
Thomas Heller