Date: Tue, 21 Feb 2006 19:03:11 -0600
From: David Greene < greened@obbligato.org>
Subject: Re: [Boost-users] [statechart] Asynchronous Machines
To: boost-users@lists.boost.org
Message-ID: < 43FBB84F.70305@obbligato.org>
Content-Type: text/plain; charset=us-ascii

> Nowadays both CPUs tend to have caches. Whether or not the cache
> contents is guaranteed to be written back to the main memory when thread
> A returns depends on the architecture of your hardware. IIRC, on X86
> architectures there is such a guarantee. On other architectures you
> might need to use mutexes or a similar concept to guarantee that thread
> B sees the updates of thread A.

Mutexes don't effect cache coherence. Likely there will have to be
calls to special intrinsics depending on the architecture.

Short Answer: Yes they do.

Very long answer: More correctly the problem isn't really 'cache coherence' in the traditional meaning of cache coherency (which is that the cache, for your cpu, is consistent with main memory, etc), it is the order of memory reads and writes, and Mutex's are guaranteed to do whatever is necessary to make sure all queued reads are read before you get the mutex lock (ie they force a memory 'acquire' barrier) and they make sure all writes are written before the mutex is released (a 'release' memory barrier).

The important example is this:

Thread 1:
bool object_is_contstructed = false;
Object object(17); // construct it
object_is_constructed = true;

Thread 2:
if (object_is_constructed)
{
   use_object(object);
}
else
   dont_use_object(); // really we would probably wait, but this is an example

Problem 1 (Thread 1)
object_is_constructed may be set to true (in shared memory) BEFORE all the constructed bytes of object are seen in shared memory.
This has NOTHING to do with compiler optimizations or 'volatile' or the OS or other stuff. It is the hardware. (And really, whether you think of this as cache coherency or read/write order or whatever doesn't matter much, as long as the rules and concepts make sense.)

Problem 2:
EVEN IF object_is_constructed is written AFTER object is constructed, Thread 2 might read object BEFORE reading object_is_constructed! (Similar stuff. Basically, think about how the processor wants to optimize the order of memory reads/writes so as to not jump around memory too much - similar to how disk drivers try to optimize read/writes to avoid seeks.)

So what to do? In a sense, flush the reads/writes similar to what you would need to do for harddrives. Any and all thread primitives will do this for you implicitly. POSIX threads (pthreads) actually try to document this, but you can assume it for any thread library otherwise the library is essentially useless.

So use locks.
Look are writing first:
   lock mutex,
   write object,
   write object_is_constructed
   release a mutex.

note that in whatever order object and is_constructed are actually written to main memory, they both are written before the release. ***the order of those writes CANNNOT come after the order of the writes inside the mutex*** because of the 'release' memory barrier inside the mutex release.

Now on reading,
   lock mutex
   read is_constructed
   read object
   release mutex

So the read of object might still come before the read of is_constructed, but you CANNOT read either of them before the read of the mutex. And the mutex was written in order, so everything stays in order.

To take it a step further, if you had barrier control, you COULD do away with the mutex (if you wanted to spin lock, or just skip using 'object' as in the example) as so:

Thread 1:
write object
release barrier
write is_constructed

the barrier prevents the object write to move below the barrier.

Thread 2:
   read is_constructed (into temp)
   acquire barrier
   if (temp)
       read object

the acquire barrier prevents the read of 'objecrt' to move above the acquire.

Note that, for example, Win32 now has explicit barrier versions of its Interlocked functions ie InterlockedIncrementAcquire, etc. (The old ones, it can be assumed, did 'full' barriers.) Mac OSX has similar functions.

x86 never re-orders writes, but it can re-order reads. The IA64 *spec* says it can reorder both, although the processor doesn't seem to. Conventional wisdom is that the spec is explicitly 'looser' so that they can loosen the implementation in the future.

Oh, also, the are other refinements on the barriers - think of the combinations - are we only concerned about the order of reads w.r.t. other reads, writes with other writes? What about the order of reads vs writes, etc?

Thus back to the short answer: mutexes et al, magically do what is necessary to make it work.

Tony.

P.S. hope I got all that straight. It is easy to mix up the names of the barriers relative to their direction and reads vs writes, etc. I tend to just remember the general principles.