
On Fri, Apr 13, 2012 at 7:39 PM, Oliver Kowalke <oliver.kowalke@gmx.de> wrote:
Am 13.04.2012 19:21, schrieb Mathias Gaunard:
This is incorrect. ucontext is just one of the provided implementations. There is also custom assembly for x86.
what about the other architectures? and has to save/restore the registers as the call conventions require (and fcontext does). Why should it then be faster? At a brief look it does not preserve the SSE2 control and status word as well as it does not preserve x87 control word. If you do not take care about the calling convention and ignore to preserve some relevant data of course you can be faster (but it is incorrect code and might fail).
Not saving the SSE and x87 control word was a conscious decision on my part. The control words are unlike other callee/caller saved registers as they define a process mode and are explicitly under the control of the user. In my tests the instructions used to load/save these states had a considerable cost on my old netburst CPU. The compiler may temporarily change the control state (for example in legacy x87 mode to implement some non-standard rounding), but it has to reset them to the original value before calling any externally defined function (like the ASM context switching functions) as these will expect the control words to be in the default state (whatever this is). The only time called functions will see the control words in a non default state is if the user explicitly changed the state, for example via a C99 compiler pragma or builtin function. The boost.coroutine documentation did explicitly warn about risky changes to proces state across coroutine calls, including the signal mask (which boost.context also does not preserve), locks, TLS and of course the FPU state. So, yes, the user might see failures, but never because of hidden optimizations done by the compiler, but because he explicitly forgot to restore the sahred state (any state) to a sane default before switching out of the coroutine. Having said that, I doubt that on a modern CPU this extra state save/change would hardly cost more than an extra 50% on a context call (which in the grand order of things isn't really that much). Any claimed scalability differences between boost.context and the my old library must come from somewhere else and not from the low lever context switching routines. The only thing that comes to my mind is that boost.coroutine did save all registers on the stack (which is very likely to be cache hot) instead of a separate structure as for boost.context (which, IIRC, was heap allocated in the higher lever wrapper). FWIW, while it is hard to compare my results on an old 32 bit machine with yours on an undoubtely newer CPU and OS, I distinctly remember from my tests that a coroutine-to-coroutine switch (using the high level API) was about an 100 time faster using the custom backend than using ucontext (mainly because of the high cost of the function call). HTH, -- gpd