[boost] [coroutine][review] Very Late Review

17 Sep 2012

      Hi all,

This review is way after the deadline, so feel free to ignore my vote, but
I hope that the feedback will be useful anyway.

First of all, I vote to accept the library with the current interface. I
see that Oliver agreed to significant interface changes which I hope these
will be subject to a minireview before final acceptance.

- What is your evaluation of the documentation?

Adequate for someone that already has some knowledge of coroutines and
generators, but it lacks an introduction on the topic. Like other reviewers
I also suggest that some of the more interesting examples should be
presented and discussed in the documentation. Also, a few topics should be
discussed more in details:
    * Why thread local storage cannot be referenced in threaded
applications with a brief descritption of what is, IMHO, a GCC TLS bug.
    * What is the exact meaning of the preserve FPU flags (which, as far as
I understand, it really makes it optional to preserve the FPU context),
not the FPU registers themselves

- What is your evaluation of the potential usefulness of the library?

Very useful. This library is long due in Boost.

- Did you try to use the library?  With what compiler?  Did you have any
problems?

No, I did not try.

- How much effort did you put into your evaluation? A glance? A quick
reading? In-depth study?

In depth study.

- Are you knowledgeable about the problem domain?

yes, I have implemented similar functionality before.

- What is your evaluation of the design and implementation?
The interface as proposed for review is very close to my original
boost.coroutine interface and as such I of course think is adequate. I
think that the interface can be improved (which is one of the reasons that
i never submitted my library for review). Since the beginning of the review
period, a lot of interface changes have been proposed, some of which I
like, some I don't.

About the design:

I do not believe that the coroutine should be parametrized on do_unwind
(insert something about not parametrising on behaviour here :) ). The
behaviour of boost.coroutine should be explicit from the type and one
shouldn't have suprises when unwinding a coroutine. Please chose to either
assert or explicit unwind on destruction. I prefer the former, as per
std::function, but would prefer the latter to parametrization.

I also dislike the preserve_fpu flag, but I understand what it is there for
and could live with it. Still it should be harder to misuse: make it an
enum, and move it last in the constructor parementer list, so that
specifying a stateful allocator (BTW, I assume these are supported),
shouldn't require to specify this flag.

I like the idea that has been proposed during the review that the result of
a coroutine should be retrivable at any time via a get method. I dislike
the proposed optional<T> result for operator(). If the previous delayed
result functionality is implemented, operator() should just return *this,
so that the subsequent get() call can be chained or the result tested for
non empty. This change has the potential of deeply affecting the interface,
so I would really like to see it subject to a mini review before acceptance.

The coroutine object has both the concept of a not-a-coroutine and of a
completed coroutine. This is confusing, and should be conflated to 'empty'
state, analogous to an empty std::function. In addition to simplifying the
documentation and the interface, it should also slightly improve the
implementation. Operator bool should return !empty as per std::function.

It has been suggested that coroutine::self be removed an have a thread
specific object. I strongly disagree with this option. First of all it will
make it too easy to break type safety:

typedef this_coroutine<int(int)> this_coro;

typedef this_coroutine<void()> that_coro;

coroutine<int(int)>([]{
       that_coro::yield(); // oops compiles fine, UB at runtime.
});

Of course you can find ways to break even code with the explicit self
syntax, but the above is, IMHO, just too dangerous.
In practice, one need to pass the signature to any code that wants to yield
anyway, so might as well pass the self object around, at least one can rely
on template type deduction.

Also, one does not really want arbitrary code to yield out of the coroutine
as this can be dangerous (think of code between yield statements as a sort
of critical section) as invariants can be broken or mutices held.

Instead I suggest doing away with a separate self type and instead making
the first parameter to the coroutine functor a full fledged coroutine, with
simmetric signature:

coroutine<int(void)> my_coroutine([](coroutine<void(int)> caller) {
              caller(10);
});

int i = my_coroutine();

This means that the parameter no longer represent the coroutine itself,
but, more logically, the caller coroutine. This makes very obvious the
analogy between coroutines and channels. It also blurs the division between
symmetric and asymmetric coroutines. Bonus points if you had a function
that infers the coroutine singnature from the coroutine functor operator()
(I like to call this function callcc for obvious reasons [
http://en.wikipedia.org/wiki/Call-with-current-continuation]):

auto my_coroutine = callcc([](coroutine<void(int)> caller) { ... } );

This would obviously work only if the functor is monomorphic, but it
interacts very nicely with C++11 lambdas.

I do not like the separation between coroutines and generators. Instead the
generator class should be dropped. The reason that I had generator as a
separate class in the original coroutine SoC is because I wanted to give it
iterator semantics (this also required internally using a reference counted
pointer to make the iterator copiable, which I disliked). Instead I propose
to make every coroutine<T()> fulfill the the input range concept and every
coroutine<void(T)> fulfill the output range concept (this is a
functionality not covered by generators). In practice it means that for
these classes of coroutines, begin and end are defined returning
appropriate iterator (i.e. some very thin layers around the coroutine
itself). As a range is required to survive its iterator anyway, the
iterators can simply use plain pointers to the coroutine.

The last suggestion works very well if coupled with the suggestion of
making 'self' a full coroutine:

std::vector<int> a = { ... };
std::vector<int> b = { ... };

std::cout << "The result of 'merge(a,b) is:\n";
for(auto x : callcc([&](coroutine<void(int)> c) {  std::merge(a.begin(),
a.end(), b.begin(), b.end(), c.begin()); })
{
        std::cout << x << "\n";
}

I thought modelling generalized coroutines<T(T)> as ranges, but I have yet
to find a good model.

Enough abot the interface; I have some more comments about the
Implementation:

The code is readable (it helps if you are familiar with the topic) even if,
as other reviewers have noted, is almost free of comments. The nest of base
classes required to handle all possible coroutine signatures is quite
overwhelming though and I think it could be significantly reduced. I was
able to trace through the execution paths I was interested into, but it
required quite a bit of back and forward between implementation files.

The coroutine object itself holds a reference counted pointer to the
implementation. As the coroutine is movable but not copyable, I do not
think that reference counting is needed. The coroutine itself should
consists of two raw pointers: a pointer to the target context and a pointer
to the result slot. You need this pointer anyway, because, if I understand
correctly, you are planning to allow retriving the result slot at any time
after the operator() and yield call.

I think that the implementaiton is not as eficient as possible. I strongly
believe that a coroutine call should be as fast as a normal functiona call
as possible. As such it is important to optimize operator() and yield as
much as possible. Ideally, no code should be executed other than calling
context::jump_fcontext. I will detail below how such an implementation is
possible:

Operator() needs  to check at every call whether the coroutine has started
and decide whether to call impl->start or impl->resume. This is a very
predictable branch, but every instruction counts, and as a minimum the
branch will increase instruction cache pressure and decoding bandwith.
Similarly, there is a branch after the internal resume call to detect
whether the coroutine has terminated and another one to check whether
exceptions should be thrown.

The first branch can be removed by moving the complexity to the
constructor: a redy-to-start context should be indistinguishable from a
suspended context.

The branch on is_complete can be removed by making sure that a resume on
on-complete also pass the pointer to the return slot via the native_resume
result.

The check for pending exceptions can be removed if context switches that
cause exceptions to be thrown on the target context are modified to create
an artificial stack frame on top of the called stack that runs a function
that throws the exception only when necessary. This would require a small
(but useful) addition to boost.context interface.

The flag manipulation can be removed by completely removing the flags. The
only state you really need to know is whether the coroutine is empty (which
can be signaled by having a null pointer to context) and whether any result
is available (that can be signaled by having a null pointer to result
slot). Both pointers are updated by each call to resume(). As the internal
context implementation only allows passing a single value, this would
require an additional level of indirection. The indirection can be removed
by changing the context jump_fcontext api to return a structure of two
pointers: the existing parameter plus an additional pointer to the calling
context. As most ABI allow to return small structures in registers, this
can be implemented quite eficiently.

Another source of ineficiency and indirection is the constructor, which
while not as critical as task switching itself, it should still be fast. To
construct a coroutine two dynamic allocations are currently needed, one to
allocate the control block (i.e. the derived class from context base), one
to allocate the stack. I suggest completely removing the control block,
instead placing the allocator and the coroutine functor itself directly in
the stack frame of the trampoline. To do this, after allocating the stack,
the constructor switches to the target context trampoline and passes to it
apointer to the functor and allocator pairs from constructor stack, which
are moved to local variables. This will also implicitly type erase both the
allocator and the functor.

Similarly, there is noneed to store the two caller and calle fcontextes,
which can  be mantained as local variables inside inside
coroutine::operator(), and yield respectively. The previously described
additional parameter returned by impl->resume allows to eficiently retrieve
the other context.  These stack allocations are not unlike what is aready
done for result slot which is located inside yield or operator().

As there is no control block any more, cleanup can no longer be done inside
its destructor. Instead, after the context has unwound, the trampoline
switches to the calling context stack and executes a cleanup function on
top of it (this is the same functionality need to implement the exception
throwing as described before). This cleanup function can safely destroy
first the coroutine functor and then deallocate the stack. The cleanup
function then returns null for both the calling context pointer and result
slot, signaling that the coroutine has exited.

I have a sketch implementation of most of the suggestions above here
https://github.com/gpderetta/Experiments/blob/scheduler/switch.hpp(examples
here
https://github.com/gpderetta/Experiments/blob/scheduler/switch_test.cc).

That's all so far; I'll try to come with further comments; I'm sorry for
the wall of text, I wrote it in one go and didn't have time to cut it down.

Good luck with the review!

-- gpd

[boost] [coroutine][review] Very Late Review

Giovanni Piero Deretta