Cache flushing in the constructors

I've been struggling with these thoughts lately, and am curious how the boost "class" writers deal with it, or if there's any problem at all to begin with. If I want to write a boost class that's going to be compiled for all the supported CPU types, do I have to perform a core-cache flush at the end of my class constructors? I know that there are the "Strong Memory Model" CPU's out there, like the Intel x86 cores, that auto-magically take care of cache-coherency for you, and the "Weak Memory Model" CPU's where the code has to do all of the work for cache-coherency. When developing a boost class that's going to be an object that's used in multi-threaded applications; do these classes use some BOOST utility to performs cache flushes on the newly constructed objects, or am I just thinking about this in the wrong way, which I'm starting to believe is true. I'm new to thinking about these things, and my head hurts! Your views would be appreciated. Thanks, -Sid

Sid Sacek wrote:
If I want to write a boost class that's going to be compiled for all the supported CPU types, do I have to perform a core-cache flush at the end of my class constructors?
Why would you need to worry about such a thing? That's the responsibility of the compiler/hardware. There may be some esoteric cases, in extremely low level libraries where this may be important, but I doubt whether 99% of C++ classes need anything of the sort you describe. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

Why would you need to worry about such a thing? That's the responsibility of the compiler/hardware. Rob Stewart
Are you sure the compiler would take care of this? I've worked on a number of different CPU's, including the SPARC and the RS/6000, and I've never seen any special assembly emitted by the compiler that would do that. I feel like there's still something missing from the picture. You see, normally, entering and exiting a lock takes care of the memory fencing issues for you, but constructors are naked. By that, I mean there doesn't appear to be a fence at the end of the constructor, and therefore I don't see how the current core informs all other cores that a block of memory has been modified. -Sid

Sid Sacek wrote:
Why would you need to worry about such a thing? That's the responsibility of the compiler/hardware. Rob Stewart
Are you sure the compiler would take care of this?
I've worked on a number of different CPU's, including the SPARC and the RS/6000, and I've never seen any special assembly emitted by the compiler that would do that. I feel like there's still something missing from the picture.
How familiar are you with the term ccNUMA? As I'm not too familiar with it myself, I just did a quick look up at wikipedia (<http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access>): Cache coherent NUMA (ccNUMA) Nearly all CPU architectures use a small amount of very fast non-shared memory known as cache to exploit locality of reference in memory accesses. With NUMA, maintaining cache coherence across shared memory has a significant overhead. Although simpler to design and build, non-cache-coherent NUMA systems become prohibitively complex to program in the standard von Neumann architecture programming model. As a result, all NUMA computers sold to the market use special-purpose hardware to maintain cache coherence[citation needed], and thus class as "cache-coherent NUMA", or ccNUMA. In case this information is correct, I see little reason why you or your compiler would need to generate extra code. Regards, Thomas

How familiar are you with the term ccNUMA? As I'm not too familiar with it myself, I just did a quick look up at Wikipedia (<http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access>):
Cache coherent NUMA (ccNUMA) Regards, Thomas
I wasn't familiar with the term and so I read that article. I believe I understand that it's implying that motherboards used in mainstream computers use cache-coherency hardware in order to enforce the 'ccNUMA' architecture. If all motherboards used this ccNUMA architecture, I don't think there would ever be a problem. However, I'm under the impression that boost targets CPU's; not ccNUMA-based motherboards. I can see the possibility that a "Weak Memory Model CPU" is being used in an embedded scenario, and not having a ccNUMA architecture. I don't know if it's still realistic for that hardware to still be using boost at that point, but I suppose it's possible. My 'take away' from this is that if the programmer knows the hardware doesn't handle cache-coherency, and the code is using boost classes, the programmer has to make special provisions to deal with potential problems by surrounding constructor calls with fences. Thanks, -Sid

Sid Sacek wrote:
I can see the possibility that a "Weak Memory Model CPU" is being used in an embedded scenario, and not having a ccNUMA architecture. I don't know if it's still realistic for that hardware to still be using boost at that point, but I suppose it's possible.
I think IBM's "cell broadband engine" (also used in PS3) is an example of hardware that is not ccNUMA. I'm not sure how parallel other current embedded hardware is, but I guess it is still the exception. However, the other answers also reminded me that even on ccNUMA hardware, simply assuming that modified data will be available in other threads without explicit synchronization wouldn't be correct. (And that it will work most of the time will only make the corresponding bug/race condition more difficult to find). Regards, Thomas

Sid Sacek wrote:
Why would you need to worry about such a thing? That's the responsibility of the compiler/hardware. Rob Stewart
Are you sure the compiler would take care of this?
I wrote "compiler/hardware" purposely.
I've worked on a number of different CPU's, including the SPARC and the RS/6000, and I've never seen any special assembly emitted by the compiler that would do that. I feel like there's still something missing from the picture.
You see, normally, entering and exiting a lock takes care of the memory fencing issues for you, but constructors are naked. By that, I mean there doesn't appear to be a fence at the end of the constructor, and therefore I don't see how the current core informs all other cores that a block of memory has been modified.
That's the job of cache coherency logic in the hardware. When you access the object in another thread running on another CPU, the hardware must arrange for that other CPU to see the new state of the memory occupied by the new object, at some point. That also means that the other CPU may not *yet* see the new object's state. When one thread creates an object for another to use, it is your responsibility to use fences or locks to force the consuming thread to see what the producing thread did. Normally, however, an object isn't created by one thread and handed off to another, so fences are rarely warranted in a constructor. _____ Rob Stewart robert.stewart@sig.com Software Engineer, Core Software using std::disclaimer; Susquehanna International Group, LLP http://www.sig.com IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.

Normally, however, an object isn't created by one thread and handed off to another, so fences are rarely warranted in a constructor. Rob Stewart
Right. I'm starting to believe the answer to my question is that it is not the responsibility of constructors to solve potential cache-coherency issues, but rather, the code that uses the boost classes must be knowledgeable of the hardware's limitations and make special provisions for it. Thanks, -Sid

On Aug 4, 2010, at 12:01 PM, Sid Sacek <ssacek@securewatch24.com> wrote:
I'm starting to believe the answer to my question is that it is not the responsibility of constructors to solve potential cache-coherency issues, but rather, the code that uses the boost classes must be knowledgeable of the hardware's limitations and make special provisions for it.
The idea of the new c++0x memory model is that you only need to know about /it's/ limitations. But yeah, trying to do thread safety at the object level is almost always wrong. The necessary locks depend on context. -- Dave Abrahams (mobile) BoostPro Computing http://www.boostpro.com

Why would you need to worry about such a thing? That's the responsibility of the compiler/hardware.
Are you sure the compiler would take care of this? Most don't. VC++ uses stores with release semantics for initialization of volatiles for standard builtin types (char,short,int etc.) on most major target platforms (and so long the objects are naturally aligned, which isn't necessarily the case for long long on x86 without further precautions). I consider that actually an implementation deficiency as volatile shouldn't be in effect until after initialization is complete -- but that's the way it is.
I've worked on a number of different CPU's, including the SPARC and the RS/6000, and I've never seen any special assembly emitted by the compiler
The point is that most libraries require the user to avoid races on object accesses (except for simultaneous const function calls). that would do that. I feel like there's still something missing from the picture.
You see, normally, entering and exiting a lock takes care of the memory fencing issues for you, but constructors are naked. By that, I mean there doesn't appear to be a fence at the end of the constructor, and therefore I don't see how the current core informs all other cores that a block of memory has been modified.
If you access the object from another thread of execution without any syncing you have a race on that object and just saying the behavior is undefined is common practice in this case. Only if you do need guarantees for this case you'd have to think about doing something special. If you do have a "normal" synchronization mechanism it would almost certainly take care of enforcing memory order. -hg
participants (5)
-
Dave Abrahams
-
Holger Grund
-
Sid Sacek
-
Stewart, Robert
-
Thomas Klimpel