Re: [boost] [atomic] comments

22 Oct 2011

      On Saturday 22 October 2011 20:32:44 Tim Blechmann wrote:
...
...
...
then we need some kind of interprocess-specific atomic ... maybe as
part of boost.interprocess ... iac, maybe we should provide an
implementation which somehow matches the behavior of c++11 compilers
...
well if the atomics are truely atomic, then BOOST_ATOMIC_*_LOCK_FREE == 2
and I find a platform where you cannot use them safely between processes
difficult to imagine (not that something like that could not exist)
one would have to do the dispatching logic in the preprocessor, so one
cannot dispatch depending on the typedef operator.
it's certainly possible to build a helper template to map types to these macro 
values (map to the value of BOOST_ATOMIC_INT_LOCK_FREE for all types T with 
sizeof(T) == sizeof(int) for example)
...
...
if they are not atomic, then you are going to hit a "fallback-via
locking" path in whiche case you are almost certainly better off picking
an interprocess communication mechanism that just uses locking directly
true, but at the cost of increasing the program logic. however there are
cases, when you are happy that you don't have to change the program at the
cost of performance on legacy hardware.
okay that's a valid point -- not sure how common this use case is, but I do 
not think it deserves penalizing the process-local path

doing it in Boost.Interprocess might be something to consider however
...
...
...
it would be equally correct to have something like:
static bool has_cmpxchg16b = query_cpuid_for_cmpxchg16b()
if (has_cmpxchg16b)
use_cmpxchg16b();
else
use_fallback();
less bloat and prbly only a minor performance hit ;)
problematic because the compiler must insert a lock to ensure thread-safe
initialization of the "static bool" (thus it is by definition not
"lock-free" any more)
well, one could also set a static variable with a function called before
main (e.g. via __attribute__(constructor))
might be possible, but this will then cost everyone the cpuid at load time

I am currently trying out something different, namely a tristate variable 
("unknown", "has_cmpxchg8b", "lacks_cmpxchg8b") with a benign race where (in 
bad cases) multiple threads might end up doing "cpuid" concurrently until all 
threads "see" that it has a state other than "unknown"
...
...
...
in the average, but not in the worst case. for real-time systems it is
not acceptable that the os preempts a real-time thread while it is
holding a spinlock.
prio-inheriting mutexes are usually much faster than cmpxchg16b -- use
these for hard real-time (changing the fallback path to use PI mutexes as
well might even be something to consider)
do you have some numbers which latencies can be achieved with PI mutexes?
no I don't, but the literature measuring wakeup latencies in operating systems 
is plentiful

I only have throughput numbers, and these peg a double-word-CAS operation as 
slightly less than twice as expensive as single-word-CAS -- considering that 
most protocols need one pair of (either single- or double-word) CAS, and 
considering that PI mutex lock/unlock can essentially be just a CAS on the 
lock variable (to store/clear the owner id) in the fast path, PI mutexes 
usually end up faster

Nevertheless I will add cmpxchg16b for experimentation.

Best regards
Helge

Re: [boost] [atomic] comments

Helge Bahmann