
Am 14.04.2012 00:37, schrieb Giovanni Piero Deretta:
Not saving the SSE and x87 control word was a conscious decision on my part. The control words are unlike other callee/caller saved registers as they define a process mode and are explicitly under the control of the user. In my tests the instructions used to load/save these states had a considerable cost on my old netburst CPU.
The compiler may temporarily change the control state (for example in legacy x87 mode to implement some non-standard rounding), but it has to reset them to the original value before calling any externally defined function (like the ASM context switching functions) as these will expect the control words to be in the default state (whatever this is).
The only time called functions will see the control words in a non default state is if the user explicitly changed the state, for example via a C99 compiler pragma or builtin function. The boost.coroutine documentation did explicitly warn about risky changes to proces state across coroutine calls, including the signal mask (which boost.context also does not preserve), locks, TLS and of course the FPU state. but what about code you don't have under your control (legacy libs etc.)? Having said that, I doubt that on a modern CPU this extra state save/change would hardly cost more than an extra 50% on a context call (which in the grand order of things isn't really that much). Any claimed scalability differences between boost.context and the my old library must come from somewhere else and not from the low lever context switching routines. The only thing that comes to my mind is that boost.coroutine did save all registers on the stack (which is very likely to be cache hot) instead of a separate structure as for boost.context (which, IIRC, was heap allocated in the higher lever wrapper). I you refer to my performance tests - I never compared boost.context with boost.coroutine - I've measured the cycle-costs of fcontext and ucontext.
FWIW, while it is hard to compare my results on an old 32 bit machine with yours on an undoubtely newer CPU and OS, I distinctly remember from my tests that a coroutine-to-coroutine switch (using the high level API) was about an 100 time faster using the custom backend than using ucontext (mainly because of the high cost of the function call).
HTH, that was the same what I figured out (see above fcontext vs. ucontext and performance test app in boost.context). I assumed that your lib used ucontext as back-end and therefore I've had concerns about that it would be much faster than boost.context (as told in another post).
btw, file swapcontext64.cpp (from your lib) might contain a bug https://svn.boost.org/svn/boost/sandbox/SOC/2006/coroutine/trunk/libs/corout... it preserves the registers rbx, rbp, rax , rdx. I think it should be rbx, rbp, r12-r15 (+SSE2 and x87) as described in' SysV ABI AMD64 Architecture Processor Supplement - Draft Version 0.99.4'.