[atomic] comments

newer
[locale] Premission to merge clang...

Tim Blechmann

21 Oct 2011 21 Oct '11

8:27 a.m.

hi helge and others, ... has been rather quiet about the boost.atomic review, so i want to raise a view issues: shared memory support: the fallback implementation relies on the spinlock pool that also used by the smart pointers. however this pool is per-process, so the fallback implementation won't work in shared memory. can this be changed/fixed? atomic::is_lock_free(): is_lock_free is set to either `true' or `false'. however in some cases, there are alignment constraints (iirc, 64bit atomics on ia32/x86_64 require a 64bit alignment). afaict there are not precautions to take care of this, are there? compile-time vs run-time dispatching: some instructions are not available on every CPU of a specific architecture. e.g. cmpxchg8b or cmpxchg16b are not available on all ia32/x86_64 cpus. i would appreciate if these instructions would not be used before performing a CPUID check, whether these instructions are really available (at least in a legacy mode) cmpxchg16b: currently cmpxchg16b doesn't seem to be supported. this instruction is required for some lock-free data structures (e.g. there is a dequeue algorithm, that requires a pair of tagged pointers). maybe this can be a starting point to discuss boost.atomic ... cheers, tim

Show replies by date

Helge Bahmann

21 Oct 21 Oct

10:09 a.m.

Hi Tim and others On Friday 21 October 2011 10:27:50 Tim Blechmann wrote:

...

hi helge and others,

... has been rather quiet about the boost.atomic review, so i want to raise a view issues:

shared memory support: the fallback implementation relies on the spinlock pool that also used by the smart pointers. however this pool is per-process, so the fallback implementation won't work in shared memory. can this be changed/fixed?

fixing this would require a per-variable lock... depending on the platform this can have enormous overheads. I would suggest using the compile-time macros BOOST_ATOMIC_*_LOCK_FREE to pick an alternate code path.

...

atomic::is_lock_free(): is_lock_free is set to either `true' or `false'. however in some cases, there are alignment constraints (iirc, 64bit atomics on ia32/x86_64 require a 64bit alignment). afaict there are not precautions to take care of this, are there?

for x86_64 there is nothing to do, ABI requires 8 byte alignment already there used to be an __align__(8) to cover ia32, but it got lost... I *think* the "lock" prefix will cover this case nevertheless (at a hefty performance cost, though...)

...

compile-time vs run-time dispatching: some instructions are not available on every CPU of a specific architecture. e.g. cmpxchg8b or cmpxchg16b are not available on all ia32/x86_64 cpus. i would appreciate if these instructions would not be used before performing a CPUID check, whether these instructions are really available (at least in a legacy mode)

the correct way to do that is to have different libraries for sub-architectures and have the runtime- linker decide... this requires infrastructure not present in boost

...

cmpxchg16b: currently cmpxchg16b doesn't seem to be supported. this instruction is required for some lock-free data structures (e.g. there is a dequeue algorithm, that requires a pair of tagged pointers).

could do, but cmpxchg16b is dog-slow, the fallback path is going to be faster anyways

...

maybe this can be a starting point to discuss boost.atomic ...

best regards Helge

Tim Blechmann

11:06 a.m.

...

...
shared memory support: the fallback implementation relies on the spinlock pool that also used by the smart pointers. however this pool is per-process, so the fallback implementation won't work in shared memory. can this be changed/fixed?

fixing this would require a per-variable lock... depending on the platform this can have enormous overheads.

I would suggest using the compile-time macros BOOST_ATOMIC_*_LOCK_FREE to pick an alternate code path.

then we need some kind of interprocess-specific atomic ... maybe as part of boost.interprocess ... iac, maybe we should provide an implementation which somehow matches the behavior of c++11 compilers ...

...

...
atomic::is_lock_free(): is_lock_free is set to either `true' or `false'. however in some cases, there are alignment constraints (iirc, 64bit atomics on ia32/x86_64 require a 64bit alignment). afaict there are not precautions to take care of this, are there?

for x86_64 there is nothing to do, ABI requires 8 byte alignment already

there used to be an __align__(8) to cover ia32, but it got lost... I *think* the "lock" prefix will cover this case nevertheless (at a hefty performance cost, though...)

i see

...

...
compile-time vs run-time dispatching: some instructions are not available on every CPU of a specific architecture. e.g. cmpxchg8b or cmpxchg16b are not available on all ia32/x86_64 cpus. i would appreciate if these instructions would not be used before performing a CPUID check, whether these instructions are really available (at least in a legacy mode)

the correct way to do that is to have different libraries for sub-architectures and have the runtime- linker decide... this requires infrastructure not present in boost

it would be equally correct to have something like: static bool has_cmpxchg16b = query_cpuid_for_cmpxchg16b() if (has_cmpxchg16b) use_cmpxchg16b(); else use_fallback(); less bloat and prbly only a minor performance hit ;)

...

...
cmpxchg16b: currently cmpxchg16b doesn't seem to be supported. this instruction is required for some lock-free data structures (e.g. there is a dequeue algorithm, that requires a pair of tagged pointers).

could do, but cmpxchg16b is dog-slow, the fallback path is going to be faster anyways

in the average, but not in the worst case. for real-time systems it is not acceptable that the os preempts a real-time thread while it is holding a spinlock. cheers, tim

Helge Bahmann

11:27 a.m.

On Friday 21 October 2011 13:06:20 Tim Blechmann wrote:

...

...
...
shared memory support: the fallback implementation relies on the spinlock pool that also used by the smart pointers. however this pool is per-process, so the fallback implementation won't work in shared memory. can this be changed/fixed?

fixing this would require a per-variable lock... depending on the platform this can have enormous overheads.

I would suggest using the compile-time macros BOOST_ATOMIC_*_LOCK_FREE to pick an alternate code path.

then we need some kind of interprocess-specific atomic ... maybe as part of boost.interprocess ... iac, maybe we should provide an implementation which somehow matches the behavior of c++11 compilers ...

well if the atomics are truely atomic, then BOOST_ATOMIC_*_LOCK_FREE == 2 and I find a platform where you cannot use them safely between processes difficult to imagine (not that something like that could not exist) if they are not atomic, then you are going to hit a "fallback-via locking" path in whiche case you are almost certainly better off picking an interprocess communication mechanism that just uses locking directly

...

...
...
atomic::is_lock_free(): is_lock_free is set to either `true' or `false'. however in some cases, there are alignment constraints (iirc, 64bit atomics on ia32/x86_64 require a 64bit alignment). afaict there are not precautions to take care of this, are there?

for x86_64 there is nothing to do, ABI requires 8 byte alignment already

there used to be an __align__(8) to cover ia32, but it got lost... I *think* the "lock" prefix will cover this case nevertheless (at a hefty performance cost, though...)

i see

but you certainly have a point that this alignment should corrected, noted to be fixed

...

...
...
compile-time vs run-time dispatching: some instructions are not available on every CPU of a specific architecture. e.g. cmpxchg8b or cmpxchg16b are not available on all ia32/x86_64 cpus. i would appreciate if these instructions would not be used before performing a CPUID check, whether these instructions are really available (at least in a legacy mode)

the correct way to do that is to have different libraries for sub-architectures and have the runtime- linker decide... this requires infrastructure not present in boost

it would be equally correct to have something like: static bool has_cmpxchg16b = query_cpuid_for_cmpxchg16b()

if (has_cmpxchg16b) use_cmpxchg16b(); else use_fallback();

less bloat and prbly only a minor performance hit ;)

problematic because the compiler must insert a lock to ensure thread-safe initialization of the "static bool" (thus it is by definition not "lock-free" any more)

...

...
...
cmpxchg16b: currently cmpxchg16b doesn't seem to be supported. this instruction is required for some lock-free data structures (e.g. there is a dequeue algorithm, that requires a pair of tagged pointers).

could do, but cmpxchg16b is dog-slow, the fallback path is going to be faster anyways

in the average, but not in the worst case. for real-time systems it is not acceptable that the os preempts a real-time thread while it is holding a spinlock.

prio-inheriting mutexes are usually much faster than cmpxchg16b -- use these for hard real-time (changing the fallback path to use PI mutexes as well might even be something to consider) that being said, I can put it in, but I don't think there is value in it Best regards Helge

Tim Blechmann

22 Oct 22 Oct

6:32 p.m.

...

...
then we need some kind of interprocess-specific atomic ... maybe as part of boost.interprocess ... iac, maybe we should provide an implementation which somehow matches the behavior of c++11 compilers ...

well if the atomics are truely atomic, then BOOST_ATOMIC_*_LOCK_FREE == 2 and I find a platform where you cannot use them safely between processes difficult to imagine (not that something like that could not exist)

one would have to do the dispatching logic in the preprocessor, so one cannot dispatch depending on the typedef operator.

...

if they are not atomic, then you are going to hit a "fallback-via locking" path in whiche case you are almost certainly better off picking an interprocess communication mechanism that just uses locking directly

true, but at the cost of increasing the program logic. however there are cases, when you are happy that you don't have to change the program at the cost of performance on legacy hardware.

...

...
it would be equally correct to have something like: static bool has_cmpxchg16b = query_cpuid_for_cmpxchg16b()

if (has_cmpxchg16b)

use_cmpxchg16b();

else

use_fallback();

less bloat and prbly only a minor performance hit ;)

problematic because the compiler must insert a lock to ensure thread-safe initialization of the "static bool" (thus it is by definition not "lock-free" any more)

well, one could also set a static variable with a function called before main (e.g. via __attribute__(constructor))

...

...
in the average, but not in the worst case. for real-time systems it is not acceptable that the os preempts a real-time thread while it is holding a spinlock.

prio-inheriting mutexes are usually much faster than cmpxchg16b -- use these for hard real-time (changing the fallback path to use PI mutexes as well might even be something to consider)

do you have some numbers which latencies can be achieved with PI mutexes? tim

Helge Bahmann

7:39 p.m.

On Saturday 22 October 2011 20:32:44 Tim Blechmann wrote:

...

...
...
then we need some kind of interprocess-specific atomic ... maybe as part of boost.interprocess ... iac, maybe we should provide an implementation which somehow matches the behavior of c++11 compilers ...

well if the atomics are truely atomic, then BOOST_ATOMIC_*_LOCK_FREE == 2 and I find a platform where you cannot use them safely between processes difficult to imagine (not that something like that could not exist)

one would have to do the dispatching logic in the preprocessor, so one cannot dispatch depending on the typedef operator.

it's certainly possible to build a helper template to map types to these macro values (map to the value of BOOST_ATOMIC_INT_LOCK_FREE for all types T with sizeof(T) == sizeof(int) for example)

...

...
if they are not atomic, then you are going to hit a "fallback-via locking" path in whiche case you are almost certainly better off picking an interprocess communication mechanism that just uses locking directly

true, but at the cost of increasing the program logic. however there are cases, when you are happy that you don't have to change the program at the cost of performance on legacy hardware.

okay that's a valid point -- not sure how common this use case is, but I do not think it deserves penalizing the process-local path doing it in Boost.Interprocess might be something to consider however

...

...
...
it would be equally correct to have something like: static bool has_cmpxchg16b = query_cpuid_for_cmpxchg16b()

if (has_cmpxchg16b)

use_cmpxchg16b();

else

use_fallback();

less bloat and prbly only a minor performance hit ;)

problematic because the compiler must insert a lock to ensure thread-safe initialization of the "static bool" (thus it is by definition not "lock-free" any more)

well, one could also set a static variable with a function called before main (e.g. via __attribute__(constructor))

might be possible, but this will then cost everyone the cpuid at load time I am currently trying out something different, namely a tristate variable ("unknown", "has_cmpxchg8b", "lacks_cmpxchg8b") with a benign race where (in bad cases) multiple threads might end up doing "cpuid" concurrently until all threads "see" that it has a state other than "unknown"

...

...
...
in the average, but not in the worst case. for real-time systems it is not acceptable that the os preempts a real-time thread while it is holding a spinlock.

prio-inheriting mutexes are usually much faster than cmpxchg16b -- use these for hard real-time (changing the fallback path to use PI mutexes as well might even be something to consider)

do you have some numbers which latencies can be achieved with PI mutexes?

no I don't, but the literature measuring wakeup latencies in operating systems is plentiful I only have throughput numbers, and these peg a double-word-CAS operation as slightly less than twice as expensive as single-word-CAS -- considering that most protocols need one pair of (either single- or double-word) CAS, and considering that PI mutex lock/unlock can essentially be just a CAS on the lock variable (to store/clear the owner id) in the fast path, PI mutexes usually end up faster Nevertheless I will add cmpxchg16b for experimentation. Best regards Helge

Tim Blechmann

23 Oct 23 Oct

9:28 a.m.

...

...
...
well if the atomics are truely atomic, then BOOST_ATOMIC_*_LOCK_FREE == 2 and I find a platform where you cannot use them safely between processes difficult to imagine (not that something like that could not exist)> one would have to do the dispatching logic in the preprocessor, so one cannot dispatch depending on the typedef operator.

it's certainly possible to build a helper template to map types to these macro values (map to the value of BOOST_ATOMIC_INT_LOCK_FREE for all types T with sizeof(T) == sizeof(int) for example)

yes ... iac, i would have preferred the semantics of dispatching on a per- class (or per-size basis) myself. but unfortunately atomic::is_lock_free is per-instance ...

...

...
...
...
in the average, but not in the worst case. for real-time systems it is not acceptable that the os preempts a real-time thread while it is holding a spinlock.

prio-inheriting mutexes are usually much faster than cmpxchg16b -- use these for hard real-time (changing the fallback path to use PI mutexes as well might even be something to consider)

do you have some numbers which latencies can be achieved with PI mutexes?

no I don't, but the literature measuring wakeup latencies in operating systems is plentiful

last year i performed some benchmarks for the worst-case wakeup latencies on linux. on highly optimized systems with PREEMPT_RT you can get a worst-case of about 20us, however with a vanilla `low-latency' kernel, you have a worst case of about 300us (quite significant for me, targetting deadlines of about 1ms).

...

I only have throughput numbers, and these peg a double-word-CAS operation as slightly less than twice as expensive as single-word-CAS -- considering that most protocols need one pair of (either single- or double-word) CAS, and considering that PI mutex lock/unlock can essentially be just a CAS on the lock variable (to store/clear the owner id) in the fast path, PI mutexes usually end up faster

well, according to my experience on a stock operating system the worst-case performance is about a 100 times slower than the average case ...

...

Nevertheless I will add cmpxchg16b for experimentation.

great, thanks! cheers, tim

Tim Blechmann

1 Nov 1 Nov

6:12 p.m.

hi helge, (responding to an old mail after re-reading my latest version of the draft)

...

...
...
well if the atomics are truely atomic, then BOOST_ATOMIC_*_LOCK_FREE == 2 and I find a platform where you cannot use them safely between processes difficult to imagine (not that something like that could not exist)> one would have to do the dispatching logic in the preprocessor, so one cannot dispatch depending on the typedef operator.

it's certainly possible to build a helper template to map types to these macro values (map to the value of BOOST_ATOMIC_INT_LOCK_FREE for all types T with sizeof(T) == sizeof(int) for example)

the preprocessor variables are only defined for integral types. the standard also says: template <class T> bool atomic_is_lock_free(const volatile atomic_type*); The function atomic_is_lock_free (29.6) indicates whether the object is lock- free. In any given program execution, the result of the lock-free query shall be consistent for all pointers of the same type. but atomic_is_lock_free is only defined for integral atomic types. so to be standard compliant, one could build the helper template only for integral atomic types. cheers, tim

Domagoj Saric

28 Oct 28 Oct

3:12 p.m.

On 21.10.2011. 13:06, Tim Blechmann wrote:

...

...
...
compile-time vs run-time dispatching: some instructions are not available on every CPU of a specific architecture. e.g. cmpxchg8b or cmpxchg16b are not available on all ia32/x86_64 cpus. i would appreciate if these instructions would not be used before performing a CPUID check, whether these instructions are really available (at least in a legacy mode)

the correct way to do that is to have different libraries for sub-architectures and have the runtime- linker decide... this requires infrastructure not present in boost

it would be equally correct to have something like: static bool has_cmpxchg16b = query_cpuid_for_cmpxchg16b()

if (has_cmpxchg16b) use_cmpxchg16b(); else use_fallback();

less bloat and prbly only a minor performance hit ;)

cmpxchg8b is available since the original Pentium. Preferably dynamic support for such ancient hardware, if supported at all, should not be on by default (by forcing dynamic dispatching on everyone). -- "What Huxley teaches is that in the age of advanced technology, spiritual devastation is more likely to come from an enemy with a smiling face than from one whose countenance exudes suspicion and hate." Neil Postman

Andrey Semashev

3:43 p.m.

On Friday, October 28, 2011 17:12:55 Domagoj Saric wrote:

...

On 21.10.2011. 13:06, Tim Blechmann wrote:

...
...
...
compile-time vs run-time dispatching: some instructions are not available on every CPU of a specific architecture. e.g. cmpxchg8b or cmpxchg16b are not available on all ia32/x86_64 cpus. i would appreciate if these instructions would not be used before performing a CPUID check, whether these instructions are really available (at least in a legacy mode)

the correct way to do that is to have different libraries for sub-architectures and have the runtime- linker decide... this requires infrastructure not present in boost

it would be equally correct to have something like: static bool has_cmpxchg16b = query_cpuid_for_cmpxchg16b()

if (has_cmpxchg16b)

use_cmpxchg16b();

else

use_fallback();

less bloat and prbly only a minor performance hit ;)

cmpxchg8b is available since the original Pentium. Preferably dynamic support for such ancient hardware, if supported at all, should not be on by default (by forcing dynamic dispatching on everyone).

Unfortunately, cmpxchg16b is not as common as cmpxchg8b, so a dynamic check would be desirable. However, I would prefer that there were no if's like the one above. Perhaps, a global table of pointers to the actual function implementations would be better. Initially pointers should point to functions that perform cpuid and initialize this table and then call the real functions for the detected hardware. This way we eliminate almost all overhead in the long run, including call_once.

Helge Bahmann

31 Oct 31 Oct

4:58 p.m.

On Friday 28 October 2011 17:43:43 Andrey Semashev wrote:

...

On Friday, October 28, 2011 17:12:55 Domagoj Saric wrote:

...
On 21.10.2011. 13:06, Tim Blechmann wrote:

...
...
...
compile-time vs run-time dispatching: some instructions are not available on every CPU of a specific architecture. e.g. cmpxchg8b or cmpxchg16b are not available on all ia32/x86_64 cpus. i would appreciate if these instructions would not be used before performing a CPUID check, whether these instructions are really available (at least in a legacy mode)

the correct way to do that is to have different libraries for sub-architectures and have the runtime- linker decide... this requires infrastructure not present in boost

it would be equally correct to have something like: static bool has_cmpxchg16b = query_cpuid_for_cmpxchg16b()

if (has_cmpxchg16b)

use_cmpxchg16b();

else

use_fallback();

less bloat and prbly only a minor performance hit ;)

cmpxchg8b is available since the original Pentium. Preferably dynamic support for such ancient hardware, if supported at all, should not be on by default (by forcing dynamic dispatching on everyone).

considering the cost of cmpxchg8b itself, the cost of a branch -- if done correctly [1] -- is most likely immeasurable

...

Unfortunately, cmpxchg16b is not as common as cmpxchg8b, so a dynamic check would be desirable. However, I would prefer that there were no if's like the one above. Perhaps, a global table of pointers to the actual function implementations would be better. Initially pointers should point to functions that perform cpuid and initialize this table and then call the real functions for the detected hardware. This way we eliminate almost all overhead in the long run, including call_once.

the processor most likely has more difficulties correctly predicting the code flow through a register-indirect branch than a static one, so I am not really sure this is cheaper, but it is in any case worth trying out also, this would not be a "single" function pointer but a whole bunch of them to cover the different atomic operations (reducing everything to CAS generates more lock/unlock cycles in the fallback path otherwise) [1] I'm thinking of forcing the fallback path out-of-line such that cmpxchg8b is fall-through Best regards Helge

Andrey Semashev

6:29 p.m.

...

considering the cost of cmpxchg8b itself, the cost of a branch -- if done correctly [1] -- is most likely immeasurable

Probably. But I'm a perfectionist. :)

...

...
Unfortunately, cmpxchg16b is not as common as cmpxchg8b, so a dynamic check would be desirable. However, I would prefer that there were no if's like the one above. Perhaps, a global table of pointers to the actual function implementations would be better. Initially pointers should point to functions that perform cpuid and initialize this table and then call the real functions for the detected hardware. This way we eliminate almost all overhead in the long run, including call_once.

the processor most likely has more difficulties correctly predicting the code flow through a register-indirect branch than a static one, so I am not really sure this is cheaper, but it is in any case worth trying out

Yes, this needs testing, however I hope that unconditional jump should be quite well predictable. My main concern is that without this trick you'll end up calling pthread_once or something like that on every call and this will be worse than simply jumping to the final destination. Also, this jump is likely to be inlined anyway (and transformed into call).

...

also, this would not be a "single" function pointer but a whole bunch of them to cover the different atomic operations (reducing everything to CAS generates more lock/unlock cycles in the fallback path otherwise)

Sure, like I said - a table of pointers.

...

[1] I'm thinking of forcing the fallback path out-of-line such that cmpxchg8b is fall-through

Hmm, yeah, that may help inlining the cmpxchg8b part into the calling code.

Helge Bahmann

8:38 p.m.

On Monday 31 October 2011 19:29:35 Andrey Semashev wrote:

...

...
considering the cost of cmpxchg8b itself, the cost of a branch -- if done correctly [1] -- is most likely immeasurable

Probably. But I'm a perfectionist. :)

me too, but if it does not have a measurable detriment, I consider it perfect :)

...

...
...
Unfortunately, cmpxchg16b is not as common as cmpxchg8b, so a dynamic check would be desirable. However, I would prefer that there were no if's like the one above. Perhaps, a global table of pointers to the actual function implementations would be better. Initially pointers should point to functions that perform cpuid and initialize this table and then call the real functions for the detected hardware. This way we eliminate almost all overhead in the long run, including call_once.

the processor most likely has more difficulties correctly predicting the code flow through a register-indirect branch than a static one, so I am not really sure this is cheaper, but it is in any case worth trying out

Yes, this needs testing, however I hope that unconditional jump should be quite well predictable.

it's only predictable as long as it is in the BTB, as soon as it gets flushed -- out of luck branch to static address, to out-of-line forward address to hit "predict not taken" default assumption on cold cache on the other hand is still essentially free

...

...
also, this would not be a "single" function pointer but a whole bunch of them to cover the different atomic operations (reducing everything to CAS generates more lock/unlock cycles in the fallback path otherwise)

Sure, like I said - a table of pointers.

since boost.atomic is (supposed) to stay a header-only library, there are cases where these will be instantiated multiple times -- the many different pointers may pressure the BTB unduly Best regards Helge

Tim Blechmann

30 Oct 30 Oct

4:45 p.m.

hi helge,

...

...
shared memory support: the fallback implementation relies on the spinlock pool that also used by the smart pointers. however this pool is per-process, so the fallback implementation won't work in shared memory. can this be changed/fixed?

fixing this would require a per-variable lock... depending on the platform this can have enormous overheads.

i've checked N3225, the most recent version of the draft that i have at hand at the moment. 29.4.4 tells me: The implementation should not depend on any per-process state. This restriction enables communication by memory that is mapped into a process more than once and by memory that is shared between two processes. cheers, tim

Helge Bahmann

31 Oct 31 Oct

2:01 p.m.

On Sunday 30 October 2011 17:45:18 Tim Blechmann wrote:

...

hi helge,

...
...
shared memory support: the fallback implementation relies on the spinlock pool that also used by the smart pointers. however this pool is per-process, so the fallback implementation won't work in shared memory. can this be changed/fixed?

fixing this would require a per-variable lock... depending on the platform this can have enormous overheads.

i've checked N3225, the most recent version of the draft that i have at hand at the moment. 29.4.4 tells me:

The implementation should not depend on any per-process state. This restriction enables communication by memory that is mapped into a process more than once and by memory that is shared between two processes.

cheers, tim

but "should" != "must", and additionally: a) IMHO atomics for inter-process coordination is the exception, while inter-thread coordination is the norm b) "per-object" lock is expensive (sizeof(pthread_mutex_t) can be 40 bytes) c) when you hit the fallback path, then you are better off using a datastructure with locking to begin with *If* I change the implementation to a per-object lock then this would in my view mean to optimize for an uncommon case, with noticable memory overhead penalties for the common case, to allow using the atomic objects as "drop-in" replacements in an environment where they are not the best solution anyways. (I would favor to move the distinction between using atomic objects for a given data structure versus locking upwards, instead of moving everything downwards and shoe-horning it into boost.atomic). I have serious difficulties justifying such a change, maybe others can offer their opinion? Best regards Helge

Andrey Semashev

6:10 p.m.

On Monday, October 31, 2011 15:01:35 Helge Bahmann wrote:

...

a) IMHO atomics for inter-process coordination is the exception, while inter-thread coordination is the norm

I have to disagree. Atomics may be used to communicate processes just as well as threads, if not better. I often use atomics to perform lock-free IPC between processes. Aside from performance reasons, this often adds to application resillience to process crashes, when classic primitives like mutex may be left in an invalid state. Multi-process applications (with shared memory to communicate between processes) are less common than other kinds of applications, yes. But that doesn't make this use case exceptional in any way.

...

b) "per-object" lock is expensive (sizeof(pthread_mutex_t) can be 40 bytes)

c) when you hit the fallback path, then you are better off using a datastructure with locking to begin with

*If* I change the implementation to a per-object lock then this would in my view mean to optimize for an uncommon case, with noticable memory overhead penalties for the common case, to allow using the atomic objects as "drop-in" replacements in an environment where they are not the best solution anyways. (I would favor to move the distinction between using atomic objects for a given data structure versus locking upwards, instead of moving everything downwards and shoe-horning it into boost.atomic).

Well, yes and no. Consider this example, which illustrates the control structure of a lock-free ring buffer: struct index_t { uint32_t version; uint32_t index; }; I want to be able to write atomic< index_t > so that it compiles and works on any platform, even without 64-bit CAS support in hardware. It may work slower, yes, but it will. On the other hand, I agree that there is no sense in atomic< std::string > or something like that. But hey, nothing prevents you from shooting in your foot.

...

I have serious difficulties justifying such a change, maybe others can offer their opinion?

I think, having a mutex per atomic instance is an overkill. However, a spinlock per instance might just be the silver bullet. The size overhead should be quite modest (1 to 4 bytes, I presume) and the performance would still be decent. After all, atomic<> is intended to be used with relatively small types with simple operations, such as copying and arithmetics. In other cases it is natural to use explicit mutexes, and we could emphasise it in the docs.

Helge Bahmann

8:33 p.m.

On Monday 31 October 2011 19:10:05 Andrey Semashev wrote:

...

On Monday, October 31, 2011 15:01:35 Helge Bahmann wrote:

...
a) IMHO atomics for inter-process coordination is the exception, while inter-thread coordination is the norm

I have to disagree. Atomics may be used to communicate processes just as well as threads, if not better.

the question is not whether they *can* be used, but which case is more common -- and considering the enormous amount of simple atomic counters, "init-once" atomic pointers etc. found in typical applications make me doubtful that inter-process coordination accounts for more than 1% of use cases [...]

...

Well, yes and no. Consider this example, which illustrates the control structure of a lock-free ring buffer:

struct index_t { uint32_t version; uint32_t index; };

I want to be able to write atomic< index_t > so that it compiles and works on any platform, even without 64-bit CAS support in hardware. It may work slower, yes, but it will.

what's wrong with just implementing a platform-specific "ipc queue"? mind that you are going to rely on platform specifics as soon as you start considering things such as sleep/wakeup for congestion control

...

...
I have serious difficulties justifying such a change, maybe others can offer their opinion?

I think, having a mutex per atomic instance is an overkill. However, a spinlock per instance might just be the silver bullet. The size overhead should be quite modest (1 to 4 bytes, I presume) and the performance would still be decent. After all, atomic<> is intended to be used with relatively small types with simple operations, such as copying and arithmetics. In other cases it is natural to use explicit mutexes, and we could emphasise it in the docs.

might be possible, the problem is that this assumes that there is atomic<something> available -- as soon as you hit a platform where everything hits the fallback, you just have to use a mutex and the cost becomes unbearable Best regards Helge

Andrey Semashev

9:18 p.m.

On Monday, October 31, 2011 21:33:19 Helge Bahmann wrote:

...

...
I have to disagree. Atomics may be used to communicate processes just as well as threads, if not better.

the question is not whether they *can* be used, but which case is more common -- and considering the enormous amount of simple atomic counters, "init-once" atomic pointers etc. found in typical applications make me doubtful that inter-process coordination accounts for more than 1% of use cases

But that doesn't make the inter-process use case a second class citizen, does it? The implementation should support it just as well as intra-process cases.

...

...
I want to be able to write atomic< index_t > so that it compiles and works on any platform, even without 64-bit CAS support in hardware. It may work slower, yes, but it will.

what's wrong with just implementing a platform-specific "ipc queue"?

Lots of reasons. I may not have access to all platforms, to begin with. I may not have enough knowledge about hardware capabilities of all of the platforms. Manual porting to multitude platforms may be expensive. The bottom line is that I want to be able to write generic code with atomics which transparently works on many platforms. I may be able to tune it on one or two platforms that are important to me right now but the code may work just as well on others. If it works slower, too bad, but it still works - it's better than nothing. If a better platform-specific code can be written in this case, and there is someone willing to write it - no problem, let him do that.

...

...
I think, having a mutex per atomic instance is an overkill. However, a spinlock per instance might just be the silver bullet. The size overhead should be quite modest (1 to 4 bytes, I presume) and the performance would still be decent. After all, atomic<> is intended to be used with relatively small types with simple operations, such as copying and arithmetics. In other cases it is natural to use explicit mutexes, and we could emphasise it in the docs.

might be possible, the problem is that this assumes that there is atomic<something> available -- as soon as you hit a platform where everything hits the fallback, you just have to use a mutex and the cost becomes unbearable

True. But are there realistic platforms without any support of atomic ops whatsoever today? If there are, I'm not sure the library should support these platforms in the first place.

Tim Blechmann

10:30 p.m.

...

...
...
I think, having a mutex per atomic instance is an overkill. However, a spinlock per instance might just be the silver bullet. The size overhead should be quite modest (1 to 4 bytes, I presume) and the performance would still be decent. After all, atomic<> is intended to be used with relatively small types with simple operations, such as copying and arithmetics. In other cases it is natural to use explicit mutexes, and we could emphasise it in the docs.

might be possible, the problem is that this assumes that there is atomic<something> available -- as soon as you hit a platform where everything hits the fallback, you just have to use a mutex and the cost becomes unbearable

True. But are there realistic platforms without any support of atomic ops whatsoever today? If there are, I'm not sure the library should support these platforms in the first place.

well, it is quite a chicken-and-egg problem, we need atomics to implement atomics to implement atomics, when atomics are not available. but in the real world i guess all platforms will provide some kind of atomic operations, which are sufficient to implement basic spinlocks. it would also be fine with me to delegate the implementation to boost::detail::spinlock in the smart_ptr library (assuming that it will never be implemented via atomic<>) cheers, tim

Helge Bahmann

1 Nov 1 Nov

6 a.m.

On Monday 31 October 2011 23:30:19 Tim Blechmann wrote:

...

...
...
...
I think, having a mutex per atomic instance is an overkill. However, a spinlock per instance might just be the silver bullet. The size overhead should be quite modest (1 to 4 bytes, I presume) and the performance would still be decent. After all, atomic<> is intended to be used with relatively small types with simple operations, such as copying and arithmetics. In other cases it is natural to use explicit mutexes, and we could emphasise it in the docs.

might be possible, the problem is that this assumes that there is atomic<something> available -- as soon as you hit a platform where everything hits the fallback, you just have to use a mutex and the cost becomes unbearable

True. But are there realistic platforms without any support of atomic ops whatsoever today? If there are, I'm not sure the library should support these platforms in the first place.

well, it is quite a chicken-and-egg problem, we need atomics to implement atomics to implement atomics, when atomics are not available. but in the real world i guess all platforms will provide some kind of atomic operations, which are sufficient to implement basic spinlocks.

it would also be fine with me to delegate the implementation to boost::detail::spinlock in the smart_ptr library (assuming that it will never be implemented via atomic<>)

I think that it might long-term be worthwhile to consider formulating smart_ptr in terms of atomic<> -- moving the existing spinlock pool from one place to another would be a relatively trivial change. Best regards Helge

Phil Endecott

2:41 p.m.

Helge Bahmann wrote:

...

I think that it might long-term be worthwhile to consider formulating smart_ptr in terms of atomic<>

For the record, I am already doing this and the code is posted here: https://svn.boost.org/trac/boost/ticket/5625 I think it's important that we consolidate the platform-specific atomic code in one place, so that subtle issues like e.g. trac issue 5372 only need to be fixed in one place. If anyone thinks that Boost.Atomic is unsuitable for any of the places where Boost (or anything else for that matter) is currently using custom platform-specific atomics code, we should investigate now and decide how Boost.Atomic (and/or std::atomic) can be fixed to make it suitable. Regards, Phil.

Peter Dimov

2:57 p.m.

Phil Endecott wrote:

...

For the record, I am already doing this and the code is posted here:

https://svn.boost.org/trac/boost/ticket/5625

A few quick comments... you need explicit memory orders (acqrel for --, relaxed for ++), because SC is overkill; long c = use_count_ should be out of the loop.

Helge Bahmann

5:57 a.m.

On Monday 31 October 2011 22:18:16 Andrey Semashev wrote:

...

On Monday, October 31, 2011 21:33:19 Helge Bahmann wrote:

...
...
I have to disagree. Atomics may be used to communicate processes just as well as threads, if not better.

the question is not whether they *can* be used, but which case is more common -- and considering the enormous amount of simple atomic counters, "init-once" atomic pointers etc. found in typical applications make me doubtful that inter-process coordination accounts for more than 1% of use cases

But that doesn't make the inter-process use case a second class citizen, does it? The implementation should support it just as well as intra-process cases.

If adding support for one use case penalizes another one, it is a balancing question and you have to answer "which one is more frequent", it's as simple as that.

...

...
...
I want to be able to write atomic< index_t > so that it compiles and works on any platform, even without 64-bit CAS support in hardware. It may work slower, yes, but it will.

what's wrong with just implementing a platform-specific "ipc queue"?

Lots of reasons. I may not have access to all platforms, to begin with. I may not have enough knowledge about hardware capabilities of all of the platforms. Manual porting to multitude platforms may be expensive.

This is ridiculous. May I invite you to have a look at socket communication via boost.asio? So this essentially boils down to that you consider it okay to penalize the common use case for atomic operations just to spare you the trivial implementation expense of using existing IPC as fallback? Besides there is still the option of implementing something like "interprocess.atomic" that does what you want, without penalizing the process-local case. [...]

...

...
...
I think, having a mutex per atomic instance is an overkill. However, a spinlock per instance might just be the silver bullet. The size overhead should be quite modest (1 to 4 bytes, I presume) and the performance would still be decent. After all, atomic<> is intended to be used with relatively small types with simple operations, such as copying and arithmetics. In other cases it is natural to use explicit mutexes, and we could emphasise it in the docs.

might be possible, the problem is that this assumes that there is atomic<something> available -- as soon as you hit a platform where everything hits the fallback, you just have to use a mutex and the cost becomes unbearable

True. But are there realistic platforms without any support of atomic ops whatsoever today? If there are, I'm not sure the library should support these platforms in the first place.

if it is not ported to the platform then nothing is avaialble. I have only recently finished sparcv9 implementation, itanium and mips still missing, so they would suffer immediately. Best regards Helge

Tim Blechmann

9:49 a.m.

...

...
Lots of reasons. I may not have access to all platforms, to begin with. I may not have enough knowledge about hardware capabilities of all of the platforms. Manual porting to multitude platforms may be expensive.

This is ridiculous. May I invite you to have a look at socket communication via boost.asio?

socket communication and shared memory have quite different performance characteristics. e.g. i would not trust on accessing sockets from a real-time thread.

...

...
True. But are there realistic platforms without any support of atomic ops whatsoever today? If there are, I'm not sure the library should support these platforms in the first place.

if it is not ported to the platform then nothing is avaialble. I have only recently finished sparcv9 implementation, itanium and mips still missing, so they would suffer immediately.

well, the spinlock pool uses boost::detail::spinlock (i hope no platform will rely on the fallback implementation in spinlock_nt.hpp, which seems to be simply wrong). imo, associating one with each emulated atomic<> would not be a huge memory overhead. iac, if we want boost::atomic<> to behave similar to std::atomic<> so that it works a drop-in replacement, the implementation should follow the standard as closely as possibe. in the end it would be great if someone could simply use a different namespace and maybe add/remove the BOOST_ prefix from preprocessor symbols to switch between boost.atomic and c++11 implementation ... cheers, tim

Helge Bahmann

10:29 a.m.

On Tuesday 01 November 2011 10:49:55 Tim Blechmann wrote:

...

...
...
Lots of reasons. I may not have access to all platforms, to begin with. I may not have enough knowledge about hardware capabilities of all of the platforms. Manual porting to multitude platforms may be expensive.

This is ridiculous. May I invite you to have a look at socket communication via boost.asio?

socket communication and shared memory have quite different performance characteristics.

is there some semantic ambiguity to the word "fallback" that escapes me? or do you expect the "fallback" to have the same performance characteristic as the "optimized" implementation, always? then please explain to me how a fallback for atomic variables using locks is going to preserve your expected performance characteristics

...

e.g. i would not trust on accessing sockets from a real-time thread.

what makes you believe that message channels in real-time systems were designed so dumb as to make them unusable for real-time purposes?

...

...
...
True. But are there realistic platforms without any support of atomic ops whatsoever today? If there are, I'm not sure the library should support these platforms in the first place.

if it is not ported to the platform then nothing is avaialble. I have only recently finished sparcv9 implementation, itanium and mips still missing, so they would suffer immediately.

well, the spinlock pool uses boost::detail::spinlock (i hope no platform will rely on the fallback implementation in spinlock_nt.hpp, which seems to be simply wrong). imo, associating one with each emulated atomic<> would not be a huge memory overhead.

iac, if we want boost::atomic<> to behave similar to std::atomic<> so that it works a drop-in replacement, the implementation should follow the standard as closely as possibe.

right, but the standard implementation for gcc does not use a spinlock per object (see __atomic_flag_for_address) which turns all of this moot - there is NO guarantee for std::atomic to be safe interprocess, period

Tim Blechmann

11:59 a.m.

...

...
...
...
Lots of reasons. I may not have access to all platforms, to begin with. I may not have enough knowledge about hardware capabilities of all of the platforms. Manual porting to multitude platforms may be expensive.> > This is ridiculous. May I invite you to have a look at socket communication via boost.asio?

socket communication and shared memory have quite different performance characteristics.

is there some semantic ambiguity to the word "fallback" that escapes me? or do you expect the "fallback" to have the same performance characteristic as the "optimized" implementation, always? then please explain to me how a fallback for atomic variables using locks is going to preserve your expected performance characteristics

imo, `fallback' would mean that i can still compile the program, without the need to provide a different implementation for the case that atomics are not lockfree/interprocess safe.

...

...
e.g. i would not trust on accessing sockets from a real-time thread.

what makes you believe that message channels in real-time systems were designed so dumb as to make them unusable for real-time purposes?

life would be so much easier for me, if the users of my software would not run on off-the-shelf operating systems ;)

...

right, but the standard implementation for gcc does not use a spinlock per object (see __atomic_flag_for_address) which turns all of this moot - there is NO guarantee for std::atomic to be safe interprocess, period

well, i'd say this is a problem of gcc's implementation of std::atomic. this doesn't justify that boost.atomic does not follow the suggestion of the standard. cheers, tim

Helge Bahmann

12:37 p.m.

On Tuesday 01 November 2011 12:59:09 Tim Blechmann wrote:

...

...
...
...
...
Lots of reasons. I may not have access to all platforms, to begin with. I may not have enough knowledge about hardware capabilities of all of the platforms. Manual porting to multitude platforms may be expensive.> >

This is ridiculous. May I invite you to have a look at socket communication via boost.asio?

socket communication and shared memory have quite different performance characteristics.

is there some semantic ambiguity to the word "fallback" that escapes me? or do you expect the "fallback" to have the same performance characteristic as the "optimized" implementation, always? then please explain to me how a fallback for atomic variables using locks is going to preserve your expected performance characteristics

imo, `fallback' would mean that i can still compile the program, without the need to provide a different implementation for the case that atomics are not lockfree/interprocess safe.

implementing the fallback for something as simple as a "queue" via some socket-based IPC is an entry-level question in a programmer job interview, so the effort required is rather trivial

...

...
...
e.g. i would not trust on accessing sockets from a real-time thread.

what makes you believe that message channels in real-time systems were designed so dumb as to make them unusable for real-time purposes?

life would be so much easier for me, if the users of my software would not run on off-the-shelf operating systems ;)

and what makes you believe that the performance characteristics of sockets in off-the-shelf operating systems is unsuitable for real-time, while the process scheduling characteristics of process scheduling in off-the-shelf operating systems is suitable for real-time? (yes I understand you want to maximise quality of service while accepting that it is only probabilistically real-time, just pointing out that a) you should back-up such statements with measurements and b) forget about any hope that your system is going to meet your requirements on any random platform you have not available and tested on).

...

...
right, but the standard implementation for gcc does not use a spinlock per object (see __atomic_flag_for_address) which turns all of this moot - there is NO guarantee for std::atomic to be safe interprocess, period

well, i'd say this is a problem of gcc's implementation of std::atomic. this doesn't justify that boost.atomic does not follow the suggestion of the standard.

the standard says "should" not "must" -- the gcc guys have not made this decision without good reasons, and I agree with these reasons note that there is also trouble lurking with run-time selection of whether something like atomic<uint64_t> is atomic via cmpxchg8b: do you really want to make this at minimum 12 bytes in size (though effectively occupying 16 due to alignment) just to save the room for the rarely-if-ever used per-object spinlock? same with atomic<128> and cmpxchg16b ? Similar problem on sparc -- sparcv8 must emulate everything via spinlock, on sparcv9 everything is lockfree, ideally should be decided at runtime. And BTW who says that a spinlock is "good enough"? Are you certain that a priority-inheriting mutex would not be required in some use cases? I would like to repeat the questions that nobody has answered so far: 1. What ratio of inter-thread/inter-process use of atomic variables do you expect? 2. What is wrong with implementing boost::interprocess::atomic<T> that specializes to boost::atomic<T> when possible and uses an interprocess lock otherwise? 3. What is wrong with implementing "my_maybe_lockfree_queue" that specializes to an implementation using boost::atomic<T> when possible and uses classical IPC otherwise? Since the mechanisms required to establish shared memory segments between processes is already platform-specific, why can you not make the distinction between different IPC mechanisms there? I am very much in favor of covering as many use cases as possible, and if the extension comes at no cost I will implement it right away -- but I do not see any good reason to penalize "well-designed" software with additional overhead to cater for the requirements of what I regard as a high-level design problem. Best regards Helge

Peter Dimov

1:06 p.m.

Helge Bahmann wrote:

...

the standard says "should" not "must" -- the gcc guys have not made this decision without good reasons, and I agree with these reasons

It says "should" but only for the lock-free case: [ Note: Operations that are lock-free should also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation should not depend on any per-process state. This restriction enables communication by memory that is mapped into a process more than once and by memory that is shared between two processes. —end note ] It also says: [ Note: the representation of an atomic specialization need not have the same size as its corresponding argument type. Specializations should have the same size whenever possible, as this reduces the effort required to port existing code. —end note ] The two recommendations are contradictory, so it's a quality of implementation issue.

Helge Bahmann

1:17 p.m.

On Tuesday 01 November 2011 14:06:46 Peter Dimov wrote:

...

Helge Bahmann wrote:

...
the standard says "should" not "must" -- the gcc guys have not made this decision without good reasons, and I agree with these reasons

It says "should" but only for the lock-free case:

[ Note: Operations that are lock-free should also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation should not depend on any per-process state. This restriction enables communication by memory that is mapped into a process more than once and by memory that is shared between two processes. —end note ]

and boost.atomic provides precisely that -- if the operations are lock-free, they are process-state free Best regards Helge

Tim Blechmann

6 p.m.

...

...
the standard says "should" not "must" -- the gcc guys have not made this decision without good reasons, and I agree with these reasons

It says "should" but only for the lock-free case:

[ Note: Operations that are lock-free should also be address-free. That is, atomic operations on the same memory location via two different addresses will communicate atomically. The implementation should not depend on any per-process state. This restriction enables communication by memory that is mapped into a process more than once and by memory that is shared between two processes. —end note ]

i am neither a native speaker nor a language lawyer. but to me sentence 3 does not seem to be limited to the lock-free state.

...

[ Note: the representation of an atomic specialization need not have the same size as its corresponding argument type. Specializations should have the same size whenever possible, as this reduces the effort required to port existing code. —end note ]

The two recommendations are contradictory, so it's a quality of implementation issue.

tim

Tim Blechmann

5:54 p.m.

...

...
imo, `fallback' would mean that i can still compile the program, without the need to provide a different implementation for the case that atomics are not lockfree/interprocess safe.

implementing the fallback for something as simple as a "queue" via some socket-based IPC is an entry-level question in a programmer job interview, so the effort required is rather trivial

... it may be trivial, but it still takes time ... apart from that, there are more complex communication mechanisms than a `queue'

...

...
...
...
e.g. i would not trust on accessing sockets from a real-time thread.

what makes you believe that message channels in real-time systems were designed so dumb as to make them unusable for real-time purposes?

life would be so much easier for me, if the users of my software would not run on off-the-shelf operating systems ;)

and what makes you believe that the performance characteristics of sockets in off-the-shelf operating systems is unsuitable for real-time, while the process scheduling characteristics of process scheduling in off-the-shelf operating systems is suitable for real-time?

i cannot comment on windows, but both osx and linux provide ways to schedule a thread at a priority that it won't be preempted by any other thread, so the process scheduler won't interfere.

...

...
...
right, but the standard implementation for gcc does not use a spinlock per object (see __atomic_flag_for_address) which turns all of this moot - there is NO guarantee for std::atomic to be safe interprocess, period> well, i'd say this is a problem of gcc's implementation of std::atomic. this doesn't justify that boost.atomic does not follow the suggestion of the standard.

the standard says "should" not "must" -- the gcc guys have not made this decision without good reasons, and I agree with these reasons

note that there is also trouble lurking with run-time selection of whether something like atomic<uint64_t> is atomic via cmpxchg8b: do you really want to make this at minimum 12 bytes in size (though effectively occupying 16 due to alignment) just to save the room for the rarely-if-ever used per-object spinlock? same with atomic<128> and cmpxchg16b ?

i don't really care about the size of an atomic<>. one other point for runtime selection is compatibility. e.g. you may have two binaries, one is compiled with support for double-width CAS, the other one without. with compile-time dispatching the behavior would be undefined, with run-time selection, it would be handled gracefully ...

...

2. What is wrong with implementing boost::interprocess::atomic<T> that specializes to boost::atomic<T> when possible and uses an interprocess lock otherwise?

will you provide an implementation? people were asking for boost::interprocess support during the boost.lockfree review. the only missing feature is the atomic<> implementation. cheers, tim

Andrey Semashev

6:08 p.m.

On Tuesday, November 01, 2011 11:29:22 Helge Bahmann wrote:

...

On Tuesday 01 November 2011 10:49:55 Tim Blechmann wrote:

...
socket communication and shared memory have quite different performance characteristics.

is there some semantic ambiguity to the word "fallback" that escapes me? or do you expect the "fallback" to have the same performance characteristic as the "optimized" implementation, always? then please explain to me how a fallback for atomic variables using locks is going to preserve your expected performance characteristics

Lock-based IPC can be much more efficient than socket-based. That's not mentioning much more sophisticated access models to the shared data that are possible with shared memory. Calling sockets a drop-in replacement for shared memory IPC (whether it uses atomics or locks) is nonsense, IMHO.

...

right, but the standard implementation for gcc does not use a spinlock per object (see __atomic_flag_for_address) which turns all of this moot - there is NO guarantee for std::atomic to be safe interprocess, period

GCC has the luxury of a shared runtime which can provide process-wide table of spinlocks. Boost.Atomic is header-only, so in multi-module applications this table has to be exported so that all modules use the single table. I mentioned that in the review. Do you have ideas of how to achieve that? Most obvious would be to link to Boost.Atomic dynamically...

Helge Bahmann

6:29 p.m.

On Tuesday 01 November 2011 19:08:20 Andrey Semashev wrote:

...

On Tuesday, November 01, 2011 11:29:22 Helge Bahmann wrote:

...
On Tuesday 01 November 2011 10:49:55 Tim Blechmann wrote:

...
socket communication and shared memory have quite different performance characteristics.

is there some semantic ambiguity to the word "fallback" that escapes me? or do you expect the "fallback" to have the same performance characteristic as the "optimized" implementation, always? then please explain to me how a fallback for atomic variables using locks is going to preserve your expected performance characteristics

Lock-based IPC can be much more efficient than socket-based.

Well then create a shared memory region and use explicit locking. This is even more efficient than the many implicit locks in miniscule atomic objects. See, by *not* exporting the spinlock of the fallback path of atomic<> objects I just made your IPC even more efficient.

...

...
right, but the standard implementation for gcc does not use a spinlock per object (see __atomic_flag_for_address) which turns all of this moot - there is NO guarantee for std::atomic to be safe interprocess, period

GCC has the luxury of a shared runtime which can provide process-wide table of spinlocks. Boost.Atomic is header-only, so in multi-module applications this table has to be exported so that all modules use the single table. I mentioned that in the review. Do you have ideas of how to achieve that? Most obvious would be to link to Boost.Atomic dynamically...

the same way shared_ptr treats its spinlock pool (template specialization, and I guess this leads to a symbol with virtual linkage, the instances all of which are collapsed into a single instance by the run-time linker) Best regards Helge

Andrey Semashev

6:54 p.m.

On Tuesday, November 01, 2011 19:29:50 Helge Bahmann wrote:

...

On Tuesday 01 November 2011 19:08:20 Andrey Semashev wrote:

...
Lock-based IPC can be much more efficient than socket-based.

Well then create a shared memory region and use explicit locking. This is even more efficient than the many implicit locks in miniscule atomic objects.

See, by *not* exporting the spinlock of the fallback path of atomic<> objects I just made your IPC even more efficient.

That's arguable but I caught your drift. Let's say boost::interprocess::atomic<> will make me happy.

...

...
GCC has the luxury of a shared runtime which can provide process-wide table of spinlocks. Boost.Atomic is header-only, so in multi-module applications this table has to be exported so that all modules use the single table. I mentioned that in the review. Do you have ideas of how to achieve that? Most obvious would be to link to Boost.Atomic dynamically...

the same way shared_ptr treats its spinlock pool (template specialization, and I guess this leads to a symbol with virtual linkage, the instances all of which are collapsed into a single instance by the run-time linker)

It doesn't work on Windows. shared_ptr is fine there because it actually uses real atomic ops on that platform. IIUC, Boost.Atomic cannot be saved by that.

Helge Bahmann

7:01 p.m.

...

...
...
GCC has the luxury of a shared runtime which can provide process-wide table of spinlocks. Boost.Atomic is header-only, so in multi-module applications this table has to be exported so that all modules use the single table. I mentioned that in the review. Do you have ideas of how to achieve that? Most obvious would be to link to Boost.Atomic dynamically...

the same way shared_ptr treats its spinlock pool (template specialization, and I guess this leads to a symbol with virtual linkage, the instances all of which are collapsed into a single instance by the run-time linker)

It doesn't work on Windows. shared_ptr is fine there because it actually uses real atomic ops on that platform. IIUC, Boost.Atomic cannot be saved by that.

that's a show-stopper indeed and I suspect it means a shared library then :( Thanks for the heads-up Best regards Helge

Andrey Semashev

5:44 p.m.

On Tuesday, November 01, 2011 06:57:20 Helge Bahmann wrote:

...

On Monday 31 October 2011 22:18:16 Andrey Semashev wrote:

...
But that doesn't make the inter-process use case a second class citizen, does it? The implementation should support it just as well as intra-process cases.

If adding support for one use case penalizes another one, it is a balancing question and you have to answer "which one is more frequent", it's as simple as that.

Just to be clear, what penalty are we talking about? Is it +4 bytes to a normally 16 bytes atomic<> on x86-64? Because other atomic types are assumed to have native atomic ops and don't need that extra storage for the spinlock. I'm not aware of the state of things on other architectures, what would the penalty be there?

...

...
...
what's wrong with just implementing a platform-specific "ipc queue"?

Lots of reasons. I may not have access to all platforms, to begin with. I may not have enough knowledge about hardware capabilities of all of the platforms. Manual porting to multitude platforms may be expensive.

This is ridiculous. May I invite you to have a look at socket communication via boost.asio?

I don't see how this is relevant. If you imply that shared memory IPC is somehow equivalent to socket IPC, I've got bad news for you.

...

Besides there is still the option of implementing something like "interprocess.atomic" that does what you want, without penalizing the process-local case.

Fair enough.

...

...
True. But are there realistic platforms without any support of atomic ops whatsoever today? If there are, I'm not sure the library should support these platforms in the first place.

if it is not ported to the platform then nothing is avaialble. I have only recently finished sparcv9 implementation, itanium and mips still missing, so they would suffer immediately.

I'm not sure I follow. Are you telling that IA64 and MIPS don't have native atomic ops or that they are currently somehow "supported" by Boost.Atomic without porting? Even if you don't port Boost.Atomic to these platforms you do have to port at least spinlock.

Helge Bahmann

6:05 p.m.

On Tuesday 01 November 2011 18:44:30 Andrey Semashev wrote:

...

On Tuesday, November 01, 2011 06:57:20 Helge Bahmann wrote:

...
On Monday 31 October 2011 22:18:16 Andrey Semashev wrote:

...
But that doesn't make the inter-process use case a second class citizen, does it? The implementation should support it just as well as intra-process cases.

If adding support for one use case penalizes another one, it is a balancing question and you have to answer "which one is more frequent", it's as simple as that.

Just to be clear, what penalty are we talking about? Is it +4 bytes to a normally 16 bytes atomic<> on x86-64? Because other atomic types are assumed to have native atomic ops and don't need that extra storage for the spinlock. I'm not aware of the state of things on other architectures, what would the penalty be there?

see other post for more examples (8 bytes for 4 byte atomic on sparc, >16 bytes overhead on PA-RISC) and note that due to alignment requirements that would effectively make a 16 byte atomic on x86-64 occupy 32 bytes (just put two atomic<tagged_ptr> instances side by side)

...

...
...
...
what's wrong with just implementing a platform-specific "ipc queue"?

Lots of reasons. I may not have access to all platforms, to begin with. I may not have enough knowledge about hardware capabilities of all of the platforms. Manual porting to multitude platforms may be expensive.

This is ridiculous. May I invite you to have a look at socket communication via boost.asio?

I don't see how this is relevant. If you imply that shared memory IPC is somehow equivalent to socket IPC, I've got bad news for you.

Upper layers are not supposed to care about the IPC mechanism but its semantics (buffer size, datagram size, datagram loss/ordering, capacity, over-/underflow behavior blocking, etc.). If in your design upper layers need to be intrinsically aware of the innards of your IPC mechanism, then I have bad news about your architecture.

...

...
...
True. But are there realistic platforms without any support of atomic ops whatsoever today? If there are, I'm not sure the library should support these platforms in the first place.

if it is not ported to the platform then nothing is avaialble. I have only recently finished sparcv9 implementation, itanium and mips still missing, so they would suffer immediately.

I'm not sure I follow. Are you telling that IA64 and MIPS don't have native atomic ops or that they are currently somehow "supported" by Boost.Atomic without porting? Even if you don't port Boost.Atomic to these platforms you do have to port at least spinlock.

falling back to spinlock from shared_ptr Best regards Helge

Andrey Semashev

6:45 p.m.

On Tuesday, November 01, 2011 19:05:07 Helge Bahmann wrote:

...

On Tuesday 01 November 2011 18:44:30 Andrey Semashev wrote:

...
On Tuesday, November 01, 2011 06:57:20 Helge Bahmann wrote:

...
This is ridiculous. May I invite you to have a look at socket communication via boost.asio?

I don't see how this is relevant. If you imply that shared memory IPC is somehow equivalent to socket IPC, I've got bad news for you.

Upper layers are not supposed to care about the IPC mechanism but its semantics (buffer size, datagram size, datagram loss/ordering, capacity, over-/underflow behavior blocking, etc.).

They don't. Like I said, data access model is not always restricted to a queue or a stream, and sockets are not well suited for these cases.

...

If in your design upper layers need to be intrinsically aware of the innards of your IPC mechanism, then I have bad news about your architecture.

Gladly, we don't have to discuss my design choices right now. ;)

...

...
...
if it is not ported to the platform then nothing is avaialble. I have only recently finished sparcv9 implementation, itanium and mips still missing, so they would suffer immediately.

I'm not sure I follow. Are you telling that IA64 and MIPS don't have native atomic ops or that they are currently somehow "supported" by Boost.Atomic without porting? Even if you don't port Boost.Atomic to these platforms you do have to port at least spinlock.

falling back to spinlock from shared_ptr

Right, and someone has to port it to these platforms. Honestly, I think either of us is missing something here. My initial assertion was that there is no point in Boost.Atomic on architectures without any hardware support for atomic ops. This is irrelevant to the level of support of a particular platform by the library. Obviously, porting is required for the architectures you mentioned, I'm just saying that you are free to assume that you always have some atomic ops on the platform you're trying to support.

Helge Bahmann

6:29 a.m.

...

...
might be possible, the problem is that this assumes that there is atomic<something> available -- as soon as you hit a platform where everything hits the fallback, you just have to use a mutex and the cost becomes unbearable

True. But are there realistic platforms without any support of atomic ops whatsoever today? If there are, I'm not sure the library should support these platforms in the first place.

there is PA-RISC which only supports LDCW which must be cacheline-aligned -- as a result, each atomic value would have the size of one cacheline (whether PA-RISC should be supported is another question)

4993

Age (days ago)

5004

Last active (days ago)

List overview

Download

38 comments

6 participants

participants (6)

Andrey Semashev
Domagoj Saric
Helge Bahmann
Peter Dimov
Phil Endecott
Tim Blechmann

[atomic] comments

Tim Blechmann

Helge Bahmann

Tim Blechmann

Helge Bahmann

Tim Blechmann

Helge Bahmann

Tim Blechmann

Tim Blechmann

Domagoj Saric

Helge Bahmann

Helge Bahmann

Tim Blechmann

Helge Bahmann

Helge Bahmann

Tim Blechmann

Helge Bahmann

Phil Endecott

Peter Dimov

Helge Bahmann

Tim Blechmann

Helge Bahmann

Tim Blechmann

Helge Bahmann

Peter Dimov

Helge Bahmann

Tim Blechmann

Tim Blechmann

Helge Bahmann

Helge Bahmann

Helge Bahmann

Helge Bahmann

tags

participants (6)