Re: [boost] Shared pointer atomic assembler for ARM?

16 Sep 2007

      Sid Sacek wrote:
...
I don't like the spin-lock approach for the atomic variables.
If the mutex is already locked by another thread, spinning just wastes CPU
cycles on a single-processor machine until the time-slice for the current
thread comes to an end. That's usually a 10-millisecond waste of CPU in the
worst case scenario.
Hi Sid,

Yes, but because the variable is locked for just a handful of 
instructions during the atomic increment/decrement, the probability of 
being pre-empted is small.  So spinning is rare, and the average 
slowdown is small.  It is more important to optimise the common case 
where the variable is not locked.  In particular, spinning is much 
better than using pthreads, which is what boost::shared_ptr does at 
present on ARM Linux; the function call overhead alone will dwarf the 
time spent spinning.  It also requires that each shared pointer is 
larger since it must also store the mutex.

To quantify this, I've timed a simple single-threaded loop that does 
100 million shared pointer increments and decrements.  With pthreads it 
takes 180 seconds, while with my spinning implementation it takes 22 
seconds.  That's about an additional microsecond per atomic operation.  
Comparing with the 10 ms that is wasted by spinning when an atomic 
operation is interrupted, the balance is when one in every 10 000 
atomic operations is interrupted by a thread that will access the same 
variable.  On my hardware, the window during which the variable stores 
the sentinel value is about 15 ns, so scheduling 100 times per second 
the probability of interrupting an atomic operation will be one in 666 
667.  That's a much lower probability than one in 10 000, so the 
spinning solution is better.  [Do let me know if you can see any flaws 
in my logic.]

An alternative is to simply yield when the thread is found to be locked:

#include <sched.h>

static inline long atomic_read_and_lock(long& mem)
{
   long val;
   do {
     // Equivalent to:
     //   val = mem;
     //   mem = -1;
     asm volatile ("swp\t%0, %1, [%2]"
                  :"=&r" (val)
                  :"r"  (-1),
                   "r"  (&mem)
                  :"memory");
     if (val != -1) {
       break;
     }
     sched_yield();
   } while (1);
   return val;
}

In the typical case this executes no more instructions than the spin 
code, and my test program runs at the same speed.  But it does increase 
the code size, which matters to me on my embedded system with not much 
cache.  It is also OS-specific.

The benefits of each approach will depend on the workload, i.e. how 
often shared pointers are actually used by more than one thread, and I 
suggest benchmarking your application with each of the alternatives.

Regards,

Phil.

Re: [boost] Shared pointer atomic assembler for ARM?

Phil Endecott