I'm not familiar with IA64 or the HP intrinsics, but a few quick comments: Baruch Zilber wrote:
inline void atomic_increment( int * pw ) { _Asm_mf(); static_cast<int>(_Asm_fetchadd(_FASZ_W, _SEM_REL, (void*)pw, +1, _LDHINT_NONE) + 1); }
The mf is redundant; _SEM_REL has the same effect. This should probably be inline void atomic_increment( int * pw ) { _Asm_fetchadd(_FASZ_W, _SEM_REL, pw, +1, _LDHINT_NONE); }
inline int atomic_decrement( int * pw ) { _Asm_mf(); return (static_cast<int>(_Asm_fetchadd(_FASZ_W, _SEM_REL, (void*)pw, -1, _LDHINT_NONE) - 1) - 1); }
The leading mf is redundant here, too. In addition, an acquire barrier in the zero case is missing, and there are two -1's where probably just one is needed. In pseudocode, atomic_decrement needs to be: int r = fetchadd4.rel( pw, -1 ); if( r == 1 ) ld4.acq( pw ); // or mf return r - 1; So, a wild guess: inline int atomic_decrement( int * pw ) { int r = (int)_Asm_fetchadd( _FASZ_W, _SEM_REL, pw, -1, _LDHINT_NONE ); if( r == 1 ) _Asm_mf(); return -1; } We might be able to replace the _Asm_mf with _Asm_ld, but I'm not sure whether it will generate ld4.acq.
inline int atomic_conditional_increment( int * pw ) { return _Asm_mov_to_ar((_Asm_app_reg)_AREG_CCV, *pw, (_Asm_fence)(_UP_CALL_FENCE | _UP_SYS_FENCE | _DOWN_CALL_FENCE | _DOWN_SYS_FENCE)), _Asm_mf(), (_Asm_cmpxchg((_Asm_sz)4, (_Asm_sem)_SEM_REL, pw, *pw + 1, (_Asm_ldhint)_LDHINT_NONE)); }
This doesn't look correct to me. In pseudocode, I believe that it needs to be: int v = *pw; for(;;) { if( v == 0 ) return 0; int r = cmpxchg( pw, v /*old*/, v+1 /*new*/ ); if( r == v ) return r+1; v = r; } The above code seems to implement just the cmpxchg primitive. The mf is redundant in this case, too.