[graph][heap][coroutine] interruptable dijkstra shortest path function

[apologies for double posting, send it to wrong list before] Hi, A few months ago I posted here to gauge interest for an interruptable and resumable version of the dijkstra_shortest_path algorithm. The initial response was positive, but Jeremiah considered the proposed solution lacking compared to an algorithm object model that inverts the control over the algorithm, Additional to my original solution, I now have two implementations of the algorithm object model. 1. Visitor based interruption. This is my original approach. It separates the algorithm into an init and main function. The main function asks the visitor if it should be halted afer each iteration. After halting, the main function can be used to resume the calculation. auto dijkstra = make_dijkstra_object(graph, named_parameters); dijkstra.init_from_sources(sources); dijkstra.expand( interruptor); dijkstra.expand( other_interruptor); auto distance_map = dijkstra.get<vertex_distance_t>(); auto color_map = dijkstra.get<vertex_color_t>(); 2. Algorithm object implementation. This was suggested by Jeremiah. The algorithm object iterates from control point to control point in the algorithm. Giving at each stop access to the intermediate results of the algorithm. The two implementations are: a) Using the [Coroutine] library that is currently under review to step in and out of the function. b) Fragmenting the function into many subfunctions that are chained together and the other works by fragmenting the original algorithm into a number of small interlinked functions starting at one control point and returning at the next. The interface of a and b is exactly the same and it works like this: auto dijkstra = make_dijkstra_object(graph, sources, named_parameters); while(dijkstra.next() ) { control_point cp = dijkstra.get_control_point(); vertex u = dijkstra.get_u(); auto distance_map = dijkstra.get<vertex_distance_t>(); if(cp == cp_finish_vertex && get(distance_map, u) > 42) { cout << "far enough" << endl; break; } } auto distance_map = dijkstra.get<vertex_distance_t>(); The algorithm object offers most flexibility for interweaving multiple searches. The visitor based solution on the other hand is more efficient. The latter has no loss in performance compared to the existing dijkstra functions. The coroutine algorithm object was extremely easy to implement but it is not very efficient. Does anybody have a suggestion how that can improve? Besides these interruptable and resumable shortest path searches, there are a few more advantages to the implementation. a. All dijkstra parameters are bundled in a light-weight dijkstra_state object that is returned by algorithms, giving direct access to all results of the algorithm, for instance: auto dijkstra_state = dijkstra_shortest_paths(graph, sources, no_named_parameters() ); auto distance_map = dijkstra_state.get<vertex_distance_t>(); b. Named parameters (old style) are also used for the priority queue and the color map c. Boost.Heap mutable queues can be used with Boost.Graph via the named parameter interface. It doesn't help performance now, but may be a good base for experimentation. d. Separation of the dijkstra shortest path algorithm in init and main function makes it easier to provide custom initialization. For instance: multiple sources or only re-initializing changed vertices Can anybody use this? The latest version is here: http://people.pwf.cam.ac.uk/ahh34/dijkstra.zip Kind regards, Alex p.s. below are timings for a large graph: bimba_6M.obj... 2090 ms. Boost Graph Library 2121 ms. With interruption visitor 5601 ms. With Coroutine(*) 2621 ms. With fragmented functions (*) 2840 ms. With Boost.Heap (*) halting at all control points of the dijkstra visitor

c. Boost.Heap mutable queues can be used with Boost.Graph via the named parameter interface. It doesn't help performance now, but may be a good base for experimentation.
afaict, boost.heap should perform worse than the bgl heaps. its implementation is much more generic, while bgl heaps are optimized by assuming integer keys. tim

On Sun, 23 Sep 2012, Tim Blechmann wrote:
c. Boost.Heap mutable queues can be used with Boost.Graph via the named parameter interface. It doesn't help performance now, but may be a good base for experimentation.
afaict, boost.heap should perform worse than the bgl heaps. its implementation is much more generic, while bgl heaps are optimized by assuming integer keys.
Actually, the BGL heaps can use arbitrary keys, as long as there is a way to map them to integers (which might involve some other lookup table). -- Jeremiah Willcock

On 23/09/2012 17:25, Jeremiah Willcock wrote:
On Sun, 23 Sep 2012, Tim Blechmann wrote:
c. Boost.Heap mutable queues can be used with Boost.Graph via the named parameter interface. It doesn't help performance now, but may be a good base for experimentation.
afaict, boost.heap should perform worse than the bgl heaps. its implementation is much more generic, while bgl heaps are optimized by assuming integer keys.
Actually, the BGL heaps can use arbitrary keys, as long as there is a way to map them to integers (which might involve some other lookup table).
-- Jeremiah Willcock
The main difference that I see is in the storage type that is used to get handles for values and to retrieve values from handles. The boost graph library typically uses an array of values (hidden in a shared_array_property_map) where the index in the array is the handle. That is possible because the number of values is limited by the number of vertices in the graph. Boost heap uses a std::list, where the handle is an iterator to an element in the list and the value is stored in the list. This makes sense because for Boost.Heap the number of values may be infinite, and an array for all values would explode. However, it also seems that a list provides more functionality and overhead than necessary as it keeps its values sorted. I would therefore expect that using an unordered list instead of a std::list would improve performance of boost.heap without sacrificing generality.

On Sep 23, 2012, at 7:04 PM, Alex Hagen-Zanker <ahh34@cam.ac.uk> wrote:
On 23/09/2012 17:25, Jeremiah Willcock wrote:
On Sun, 23 Sep 2012, Tim Blechmann wrote:
c. Boost.Heap mutable queues can be used with Boost.Graph via the named parameter interface. It doesn't help performance now, but may be a good base for experimentation.
afaict, boost.heap should perform worse than the bgl heaps. its implementation is much more generic, while bgl heaps are optimized by assuming integer keys.
Actually, the BGL heaps can use arbitrary keys, as long as there is a way to map them to integers (which might involve some other lookup table).
-- Jeremiah Willcock
The main difference that I see is in the storage type that is used to get handles for values and to retrieve values from handles. The boost graph library typically uses an array of values (hidden in a shared_array_property_map) where the index in the array is the handle. That is possible because the number of values is limited by the number of vertices in the graph.
Boost heap uses a std::list, where the handle is an iterator to an element in the list and the value is stored in the list. This makes sense because for Boost.Heap the number of values may be infinite, and an array for all values would explode.
However, it also seems that a list provides more functionality and overhead than necessary as it keeps its values sorted. I would therefore expect that using an unordered list instead of a std::list would improve performance of boost.heap without sacrificing generality.
Are you thinking of set/map versus unordered_set/map? list isn't sorted by default. ___ Rob

On Sep 23, 2012, at 7:04 PM, Alex Hagen-Zanker <ahh34@cam.ac.uk> wrote:
On 23/09/2012 17:25, Jeremiah Willcock wrote:
On Sun, 23 Sep 2012, Tim Blechmann wrote:
c. Boost.Heap mutable queues can be used with Boost.Graph via the named parameter interface. It doesn't help performance now, but may be a good base for experimentation. afaict, boost.heap should perform worse than the bgl heaps. its implementation is much more generic, while bgl heaps are optimized by assuming integer keys. Actually, the BGL heaps can use arbitrary keys, as long as there is a way to map them to integers (which might involve some other lookup table).
-- Jeremiah Willcock
The main difference that I see is in the storage type that is used to get handles for values and to retrieve values from handles. The boost graph library typically uses an array of values (hidden in a shared_array_property_map) where the index in the array is the handle. That is possible because the number of values is limited by the number of vertices in the graph.
Boost heap uses a std::list, where the handle is an iterator to an element in the list and the value is stored in the list. This makes sense because for Boost.Heap the number of values may be infinite, and an array for all values would explode.
However, it also seems that a list provides more functionality and overhead than necessary as it keeps its values sorted. I would therefore expect that using an unordered list instead of a std::list would improve performance of boost.heap without sacrificing generality. Are you thinking of set/map versus unordered_set/map? list isn't sorted by default. It is not sorted, but it is ordered. The iterators go through the elements in the same order that they are added to the list. For the sole
On 24/09/2012 10:21, Rob Stewart wrote: purpose of handle fetching and retrieving that is not a necessary requirement. So there are is potential for some gains to be made (probably minute). In particular the insert() and erase() functions can be simpler.

Boost heap uses a std::list, where the handle is an iterator to an element in the list and the value is stored in the list. This makes sense because for Boost.Heap the number of values may be infinite, and an array for all values would explode.
However, it also seems that a list provides more functionality and overhead than necessary as it keeps its values sorted. I would therefore expect that using an unordered list instead of a std::list would improve performance of boost.heap without sacrificing generality. Are you thinking of set/map versus unordered_set/map? list isn't sorted by default. It is not sorted, but it is ordered. The iterators go through the elements in the same order that they are added to the list. For the sole purpose of handle fetching and retrieving that is not a necessary requirement. So there are is potential for some gains to be made (probably minute). In particular the insert() and erase() functions can be simpler.
std::list is only used as dumb container for implementing mutable d-ary heaps. it is neither ordered, nor does it use insert(), it is basically a way to achieve mutability by managing a heap of std::list iterators. or what are you referring to? tim

On 24/09/2012 11:44, Tim Blechmann wrote:
Boost heap uses a std::list, where the handle is an iterator to an element in the list and the value is stored in the list. This makes sense because for Boost.Heap the number of values may be infinite, and an array for all values would explode.
However, it also seems that a list provides more functionality and overhead than necessary as it keeps its values sorted. I would therefore expect that using an unordered list instead of a std::list would improve performance of boost.heap without sacrificing generality. Are you thinking of set/map versus unordered_set/map? list isn't sorted by default. It is not sorted, but it is ordered. The iterators go through the elements in the same order that they are added to the list. For the sole purpose of handle fetching and retrieving that is not a necessary requirement. So there are is potential for some gains to be made (probably minute). In particular the insert() and erase() functions can be simpler. std::list is only used as dumb container for implementing mutable d-ary heaps. it is neither ordered,
http://www.cplusplus.com/reference/stl/list/
nor does it use insert()
, it is basically a way to achieve mutability by managing a heap of std::list iterators. or what are you referring to?
tim I am just interested to see the fundamental difference between the Boost.Graph d_ary_heap and Boost.Heap d_ary_heap. Because the statement
Heap does not use insert() but push_front(), it does use erase(). that Graph is optimized for integer values does not seem correct. I noticed that Graph manages the handles through an array and Heap through a std::list and considered that there might be some difference in performance because of that. Probably the difference is minute, but at the same time it is the only place where I see that Graph makes an assumption (the total number of values pushed onto the heap is known) that Heap cannot make.

On Sep 24, 2012, at 7:07 AM, Alex Hagen-Zanker <ahh34@cam.ac.uk> wrote:
I am just interested to see the fundamental difference between the Boost.Graph d_ary_heap and Boost.Heap d_ary_heap. Because the statement that Graph is optimized for integer values does not seem correct. I noticed that Graph manages the handles through an array and Heap through a std::list and considered that there might be some difference in performance because of that.
Probably the difference is minute, but at the same time it is the only place where I see that Graph makes an assumption (the total number of values pushed onto the heap is known) that Heap cannot make.
Locality of reference and a single free store allocation are sufficient reasons for Graph's implementation to be faster. Perhaps a deque would be better than a list for Heap. ___ Rob

I am just interested to see the fundamental difference between the Boost.Graph d_ary_heap and Boost.Heap d_ary_heap. Because the statement that Graph is optimized for integer values does not seem correct. I noticed that Graph manages the handles through an array and Heap through a std::list and considered that there might be some difference in performance because of that.
Probably the difference is minute, but at the same time it is the only place where I see that Graph makes an assumption (the total number of values pushed onto the heap is known) that Heap cannot make.
Locality of reference and a single free store allocation are sufficient reasons for Graph's implementation to be faster. Perhaps a deque would be better than a list for Heap.
the implementation assumes that std::list iterators are always valid as long as the referred object is part of the list. does this property apply to deque?

On Sep 24, 2012, at 9:31 AM, Tim Blechmann <tim@klingt.org> wrote:
I am just interested to see the fundamental difference between the Boost.Graph d_ary_heap and Boost.Heap d_ary_heap. Because the statement that Graph is optimized for integer values does not seem correct. I noticed that Graph manages the handles through an array and Heap through a std::list and considered that there might be some difference in performance because of that.
Probably the difference is minute, but at the same time it is the only place where I see that Graph makes an assumption (the total number of values pushed onto the heap is known) that Heap cannot make.
Locality of reference and a single free store allocation are sufficient reasons for Graph's implementation to be faster. Perhaps a deque would be better than a list for Heap.
the implementation assumes that std::list iterators are always valid as long as the referred object is part of the list. does this property apply to deque?
As long as you don't erase from the middle, which I would guess you probably do.

On Sep 24, 2012, at 7:07 AM, Alex Hagen-Zanker <ahh34@cam.ac.uk> wrote:
I am just interested to see the fundamental difference between the Boost.Graph d_ary_heap and Boost.Heap d_ary_heap. Because the statement that Graph is optimized for integer values does not seem correct. I noticed that Graph manages the handles through an array and Heap through a std::list and considered that there might be some difference in performance because of that.
Probably the difference is minute, but at the same time it is the only place where I see that Graph makes an assumption (the total number of values pushed onto the heap is known) that Heap cannot make. Locality of reference and a single free store allocation are sufficient reasons for Graph's implementation to be faster. Perhaps a deque would be better than a list for Heap. The single free store allocation does make a difference of course. I don't really understand how locality of reference plays a role. Are both
On 24/09/2012 14:25, Rob Stewart wrote: options not equally local?

On Sep 24, 2012, at 11:27 AM, Alex Hagen-Zanker <ahh34@cam.ac.uk> wrote:
On 24/09/2012 14:25, Rob Stewart wrote:
On Sep 24, 2012, at 7:07 AM, Alex Hagen-Zanker <ahh34@cam.ac.uk> wrote:
I am just interested to see the fundamental difference between the Boost.Graph d_ary_heap and Boost.Heap d_ary_heap. Because the statement that Graph is optimized for integer values does not seem correct. I noticed that Graph manages the handles through an array and Heap through a std::list and considered that there might be some difference in performance because of that.
Probably the difference is minute, but at the same time it is the only place where I see that Graph makes an assumption (the total number of values pushed onto the heap is known) that Heap cannot make. Locality of reference and a single free store allocation are sufficient reasons for Graph's implementation to be faster. Perhaps a deque would be better than a list for Heap. The single free store allocation does make a difference of course. I don't really understand how locality of reference plays a role. Are both options not equally local?
No. Lists are node-based, so each element is from a different part of the free store. By contrast, in an array, the elements are in contiguous memory. That difference affects the speed at which the elements can be accessed due to the behaviors of memory, MMUs, and caching. ___ Rob

On 26/09/2012 10:37, Rob Stewart wrote:
On Sep 24, 2012, at 11:27 AM, Alex Hagen-Zanker <ahh34@cam.ac.uk> wrote:
On 24/09/2012 14:25, Rob Stewart wrote:
On Sep 24, 2012, at 7:07 AM, Alex Hagen-Zanker <ahh34@cam.ac.uk> wrote:
I am just interested to see the fundamental difference between the Boost.Graph d_ary_heap and Boost.Heap d_ary_heap. Because the statement that Graph is optimized for integer values does not seem correct. I noticed that Graph manages the handles through an array and Heap through a std::list and considered that there might be some difference in performance because of that.
Probably the difference is minute, but at the same time it is the only place where I see that Graph makes an assumption (the total number of values pushed onto the heap is known) that Heap cannot make. Locality of reference and a single free store allocation are sufficient reasons for Graph's implementation to be faster. Perhaps a deque would be better than a list for Heap. The single free store allocation does make a difference of course. I don't really understand how locality of reference plays a role. Are both options not equally local? No. Lists are node-based, so each element is from a different part of the free store. By contrast, in an array, the elements are in contiguous memory. That difference affects the speed at which the elements can be accessed due to the behaviors of memory, MMUs, and caching.
I see. That would make the case for using an unordered list which does not need to be node based. I gave it a try out of interest, and it did neither hurt nor help. (http://people.ds.cam.ac.uk/ahh34/unordered_list.hpp). Which is something similar to this: std::vector<T> data; std::vector<size_t> open_slots; size_t get_handle(T value) { if(open_slots.empty() ) { data.push_back(value); return data.size() -1; } else { handle = open_slots.back(); open_slots.pop_back(); data[handle] = value; return handle; } } value get_value(size_t handle) { return data[handle]; } void erase(size_t handle) { open_slots.push_back(handle); }

on Sat Sep 22 2012, Alex Hagen-Zanker <ahh34-AT-cam.ac.uk> wrote:
[apologies for double posting, send it to wrong list before] Hi,
A few months ago I posted here to gauge interest for an interruptable and resumable version of the dijkstra_shortest_path algorithm. The initial response was positive, but Jeremiah considered the proposed solution lacking compared to an algorithm object model that inverts the control over the algorithm,
Additional to my original solution, I now have two implementations of the algorithm object model.
... Have you seen http://cpp-next.com/archive/2012/09/portable-stackless-coroutines-in-one-hea... ? This approach might be sufficiently lightweight that it would allow us to maintain one version of the algorithm with some macros. -- Dave Abrahams BoostPro Computing Software Development Training http://www.boostpro.com Clang/LLVM/EDG Compilers C++ Boost

On 05/10/2012 02:36, Dave Abrahams wrote:
on Sat Sep 22 2012, Alex Hagen-Zanker <ahh34-AT-cam.ac.uk> wrote:
[apologies for double posting, send it to wrong list before] Hi,
A few months ago I posted here to gauge interest for an interruptable and resumable version of the dijkstra_shortest_path algorithm. The initial response was positive, but Jeremiah considered the proposed solution lacking compared to an algorithm object model that inverts the control over the algorithm,
Additional to my original solution, I now have two implementations of the algorithm object model. ...
Have you seen http://cpp-next.com/archive/2012/09/portable-stackless-coroutines-in-one-hea... ?
This approach might be sufficiently lightweight that it would allow us to maintain one version of the algorithm with some macros.
It is very nice indeed. I did not try it with the shortest path algorithm, but instead a simple sum algorithm where yield is used to get the running total. Using the stackless boost asio coroutine from the link you provided is just as simple as using the proposed boost coroutine. Summing 10,000,000 times the number 2 while yielding intermediate sums gave me the following results: sum plain : 20000000 time: 0.01 sum fragmented : 20000000 time: 0.232 sum boost coro : 20000000 time: 0.448 sum asio coro : 20000000 time: 0.034 Without yielding intermediate results, I got the following: sum plain : 20000000 time: 0.008 sum fragmented : 20000000 time: 0.029 sum boost coro : 20000000 time: 0.027 sum asio coro : 20000000 time: 0.012 I'd say it it definately worth investigating to use the boost asio method as a means of creating algorithm objects. From this example it appears that: 1. It is an efficient way of interrupting and resuming algorithms. 2. It leaves the algorithms coded in a readable state, although it seems that all local variables should become member variables. 3. If you decide not to interrupt the algorithm, the additional cost due to being part of an algorithm object is less than for the alternatives. I am not working on this anymore, but please see the source for above trial in the attachment. Kind regards, Alex

Am 05.10.2012 14:07, schrieb Alex Hagen-Zanker:
Have you seen http://cpp-next.com/archive/2012/09/portable-stackless-coroutines-in-one-hea...
?
This approach might be sufficiently lightweight that it would allow us to maintain one version of the algorithm with some macros.
It is very nice indeed. I did not try it with the shortest path algorithm, but instead a simple sum algorithm where yield is used to get the running total.
Using the stackless boost asio coroutine from the link you provided is just as simple as using the proposed boost coroutine.
Summing 10,000,000 times the number 2 while yielding intermediate sums gave me the following results:
sum plain : 20000000 time: 0.01 sum fragmented : 20000000 time: 0.232 sum boost coro : 20000000 time: 0.448 sum asio coro : 20000000 time: 0.034
Without yielding intermediate results, I got the following:
sum plain : 20000000 time: 0.008 sum fragmented : 20000000 time: 0.029 sum boost coro : 20000000 time: 0.027 sum asio coro : 20000000 time: 0.012
I'd say it it definately worth investigating to use the boost asio method as a means of creating algorithm objects. From this example it appears that:
1. It is an efficient way of interrupting and resuming algorithms. 2. It leaves the algorithms coded in a readable state, although it seems that all local variables should become member variables. 3. If you decide not to interrupt the algorithm, the additional cost due to being part of an algorithm object is less than for the alternatives.
I am not working on this anymore, but please see the source for above trial in the attachment.
did you compile the code with optimization switched on? preserving fpu state has a significant influence on performance too (at least for boost.coroutine I measure a significant influence). stack-less coroutines (preprocessor trick with switch statement) will always be faster than stack-full coroutines (boost.coroutine) because stack-full coroutines have to preserver the stack-frame and some registers and because it exchanges the stack and instruction pointer it kicks the CPU predictions (shadow stack pointer ...).

On Fri, Oct 5, 2012 at 4:32 PM, Oliver Kowalke <oliver.kowalke@gmx.de>wrote:
Am 05.10.2012 14:07, schrieb Alex Hagen-Zanker:
Have you seen
http://cpp-next.com/archive/**2012/09/portable-stackless-** coroutines-in-one-header/<http://cpp-next.com/archive/2012/09/portable-stackless-coroutines-in-one-header/> ?
This approach might be sufficiently lightweight that it would allow us to maintain one version of the algorithm with some macros.
It is very nice indeed. I did not try it with the shortest path algorithm, but instead a simple sum algorithm where yield is used to get the running total.
Using the stackless boost asio coroutine from the link you provided is just as simple as using the proposed boost coroutine.
Summing 10,000,000 times the number 2 while yielding intermediate sums gave me the following results:
sum plain : 20000000 time: 0.01 sum fragmented : 20000000 time: 0.232 sum boost coro : 20000000 time: 0.448 sum asio coro : 20000000 time: 0.034
Without yielding intermediate results, I got the following:
sum plain : 20000000 time: 0.008 sum fragmented : 20000000 time: 0.029 sum boost coro : 20000000 time: 0.027 sum asio coro : 20000000 time: 0.012
I'd say it it definately worth investigating to use the boost asio method as a means of creating algorithm objects. From this example it appears that:
1. It is an efficient way of interrupting and resuming algorithms. 2. It leaves the algorithms coded in a readable state, although it seems that all local variables should become member variables. 3. If you decide not to interrupt the algorithm, the additional cost due to being part of an algorithm object is less than for the alternatives.
I am not working on this anymore, but please see the source for above trial in the attachment.
did you compile the code with optimization switched on? preserving fpu state has a significant influence on performance too (at least for boost.coroutine I measure a significant influence).
stack-less coroutines (preprocessor trick with switch statement) will always be faster than stack-full coroutines (boost.coroutine) because stack-full coroutines have to preserver the stack-frame and some registers and because it exchanges the stack and instruction pointer it kicks the CPU predictions (shadow stack pointer ...).
That's true; the cost could be reduced by providing a buffered coroutine variant (i.e. stack switching is actually done only after n yields); the apropriateness of buffering depends on the use case. But the biggest reason for the performance difference is that the stackless coroutine code can be completely inlined inside the caller as the compiler can "see thru" the yield and resume. That's the reason while, even without yielding, the stackless code is faster. Having inlineable stack-full coroutines would require heroic compiler help. -- gpd

Am 05.10.2012 17:47, schrieb Giovanni Piero Deretta:
But the biggest reason for the performance difference is that the stackless coroutine code can be completely inlined inside the caller as the compiler can "see thru" the yield and resume. That's the reason while, even without yielding, the stackless code is faster. Having inlineable stack-full coroutines would require heroic compiler help.
heroic? :^) I'm thinking about to provide inline asm in boost.context for compilers which support it - a t first it will be gcc. Oliver

On 05/10/2012 16:32, Oliver Kowalke wrote:
Am 05.10.2012 14:07, schrieb Alex Hagen-Zanker:
Have you seen http://cpp-next.com/archive/2012/09/portable-stackless-coroutines-in-one-hea...
?
This approach might be sufficiently lightweight that it would allow us to maintain one version of the algorithm with some macros.
It is very nice indeed. I did not try it with the shortest path algorithm, but instead a simple sum algorithm where yield is used to get the running total.
...
I'd say it it definately worth investigating to use the boost asio method as a means of creating algorithm objects. From this example it appears that:
1. It is an efficient way of interrupting and resuming algorithms. 2. It leaves the algorithms coded in a readable state, although it seems that all local variables should become member variables. 3. If you decide not to interrupt the algorithm, the additional cost due to being part of an algorithm object is less than for the alternatives.
did you compile the code with optimization switched on?
I used the default release setting of VS2010. It has optimization switched on.
preserving fpu state has a significant influence on performance too (at least for boost.coroutine I measure a significant influence).
stack-less coroutines (preprocessor trick with switch statement) will always be faster than stack-full coroutines (boost.coroutine) because stack-full coroutines have to preserver the stack-frame and some registers and because it exchanges the stack and instruction pointer it kicks the CPU predictions (shadow stack pointer ...).
Does that mean that you would recommend using the preprocessor trick whenever its limitations can be worked around? Maybe it would make sense to offer both solutions in your library, possibly using the same interface?

Am 05.10.2012 18:05, schrieb Alex Hagen-Zanker:
On 05/10/2012 16:32, Oliver Kowalke wrote:
Am 05.10.2012 14:07, schrieb Alex Hagen-Zanker:
Have you seen http://cpp-next.com/archive/2012/09/portable-stackless-coroutines-in-one-hea...
?
This approach might be sufficiently lightweight that it would allow us to maintain one version of the algorithm with some macros.
It is very nice indeed. I did not try it with the shortest path algorithm, but instead a simple sum algorithm where yield is used to get the running total.
...
I'd say it it definately worth investigating to use the boost asio method as a means of creating algorithm objects. From this example it appears that:
1. It is an efficient way of interrupting and resuming algorithms. 2. It leaves the algorithms coded in a readable state, although it seems that all local variables should become member variables. 3. If you decide not to interrupt the algorithm, the additional cost due to being part of an algorithm object is less than for the alternatives.
did you compile the code with optimization switched on?
I used the default release setting of VS2010. It has optimization switched on.
preserving fpu state has a significant influence on performance too (at least for boost.coroutine I measure a significant influence).
stack-less coroutines (preprocessor trick with switch statement) will always be faster than stack-full coroutines (boost.coroutine) because stack-full coroutines have to preserver the stack-frame and some registers and because it exchanges the stack and instruction pointer it kicks the CPU predictions (shadow stack pointer ...).
Does that mean that you would recommend using the preprocessor trick whenever its limitations can be worked around? Maybe it would make sense to offer both solutions in your library, possibly using the same interface?
it depends - as Giovanni already explained in another posting inline-asm could speed-up stack-full coroutines

On 05/10/2012 16:32, Oliver Kowalke wrote:
Am 05.10.2012 14:07, schrieb Alex Hagen-Zanker:
Have you seen http://cpp-next.com/archive/2012/09/portable-stackless-coroutines-in-one-hea...
?
This approach might be sufficiently lightweight that it would allow us to maintain one version of the algorithm with some macros.
It is very nice indeed. I did not try it with the shortest path algorithm, but instead a simple sum algorithm where yield is used to get the running total.
Using the stackless boost asio coroutine from the link you provided is just as simple as using the proposed boost coroutine.
Summing 10,000,000 times the number 2 while yielding intermediate sums gave me the following results:
sum plain : 20000000 time: 0.01 sum fragmented : 20000000 time: 0.232 sum boost coro : 20000000 time: 0.448 sum asio coro : 20000000 time: 0.034
Without yielding intermediate results, I got the following:
sum plain : 20000000 time: 0.008 sum fragmented : 20000000 time: 0.029 sum boost coro : 20000000 time: 0.027 sum asio coro : 20000000 time: 0.012
I'd say it it definately worth investigating to use the boost asio method as a means of creating algorithm objects. From this example it appears that:
1. It is an efficient way of interrupting and resuming algorithms. 2. It leaves the algorithms coded in a readable state, although it seems that all local variables should become member variables. 3. If you decide not to interrupt the algorithm, the additional cost due to being part of an algorithm object is less than for the alternatives.
I am not working on this anymore, but please see the source for above trial in the attachment.
did you compile the code with optimization switched on?
Maximizing all optimization settings I get: With yield: sum plain : 20000000 time: 0.016 sum fragmented : 20000000 time: 0.047 sum boost coro : 20000000 time: 0.452 sum asio coro : 20000000 time: 0.031 Without yield: sum plain : 20000000 time: 0.016 sum fragmented : 20000000 time: 0.031 sum boost coro : 20000000 time: 0.016 sum asio coro : 20000000 time: 0.015
participants (8)
-
Alex Hagen-Zanker
-
Dave Abrahams
-
Giovanni Piero Deretta
-
Gordon Woodhull
-
Jeremiah Willcock
-
Oliver Kowalke
-
Rob Stewart
-
Tim Blechmann