
Hi everyone, I'm sorry for the late review -- things have been busy as usual. I've briefly gone over the code and think Tim's Lockfree library would make an excellent addition to Boost and should be accepted, provided some points are addressed. First and foremost, I don't think the library can be called ready when it doesn't support Visual C++/x86 in a meaningful way -- which I believe it doesn't right now. What is your evaluation of the design? -------------------------------------- The library seems well-designed. As I believe has been discussed in a separate thread, naming is somewhat inconsistent. The failure modes are not clear to me (e.g. does enqueue/push return false or throw an exception when there is no more memory), but overall I think Tim did an excellent job. What is your evaluation of the implementation? ---------------------------------------------- As mentioned, it would appear that bundled atomic implementation has a lot of shortcomings, but one is particularly bad: atomic<tagged_ptr> is not actually lock-free for Visual C++. The DCAS implementation should be straightforward, but seems to be absent . in case you wonder I did an initial implementation of <(cstd)atomic> and added all required intrinsics for a simple implementation to the compiler. These made it all into VC++2010 RTM (IIRC, there was only one 64-bit, most where 8 & 16-bit variants) There are a few points that make me a bit nervous. Specifically, the alignment story is a bit weak. It seems there is only one use of BOOST_LOCKFREE_CACHELINE_ALIGNMENT and that's for fifo<>::node. Older compilers (GCCs and VC at least) don't dynamically align the stack with __declspec(align)/__attribute__((aligned). Other classes do not use alignment at all. It seems somewhat inconsequent to add internal padding to move members to different cachelines, but at the same time not ensuring alignment (which means potential cacheline overlaps at back and front of the object). Also the code seems to assume that compilers will use an empty base class optimization for noncopyable -- I guess, that's fine for most major compilers. Also it is not clear how it is ensured that nodes allocated are actually aligned. Allocators can and usually do ignore alignment requirements greater than intmax_t types. Some may just fail to instantiate. likely/unlikely are often defined as macros as in the Linux kernel. People probably shouldn't do that for C++, but I have seen it quite a bit. It probably doesn't help that many compilers' inlining oracles don't do very well for modern code. The name freelist_t is somewhat confusing for a template parameter and while I'm not sure what harm it could do it should probably follow the general convention of CamelCase. What is your evaluation of the documentation? --------------------------------------------- This is a bit of a weak point right now. There isn't much documentation to begin with. The definition of wait-free and lock-free is a bit weird. Specifically, the term concurrent operations is a bit vague. I think it's easier to grasp if you say something along the lines of at least one thread makes forward progress at any given point in time. With the current definition in the docs (and most others I know of), it may not be immediately clear why a mutex-based implementation wouldn't be lock-free. There's a paragraph:
The synchronization is done completely in user-space with no interaction without any direct interaction with the operating system [1] . This implies that they are not prone to issues like priority inversion (a low-priority thread needs to wait for a high-priority thread).
I think there's some duplication in the first sentence. I'm not sure what lock-free has to do with priority inversion and I would think that priority inversion means priorities are -- well, inverted. I.e. a high-priority thread is not scheduled, because a lower priority thread owns a resource needed by the high-priority thread. This also makes it sound a bit like lock-free is superior to OS mechanisms, but that really depends. A standard lock-free list is not fair for writers, but an implementation making use of OS facilities can be. On the other hand ticket lockets can provide some fairness. fifo<> and ringbuffer::empty says it is not thread-safe. I would assume that means the value returned is potentially stale, but it can't break class invariants, or can it? What is your evaluation of the potential usefulness of the library? ------------------------------------------------------------------- I think the library addresses a very important use-case and could therefore be a very useful addition to Boost. But to reiterate, I would expect Lockfree to actually be lock-free on all mainstream platforms. Did you try to use the library? With what compiler? Did you have any problems? ------------------------------------------------------------------------------ I don't have Boost libs built here, so only tried some simple tests with VC++ 10 SP1. No problems. I'm planning to run some perf tests with GCC 4.1 on AS5 x86_64. How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? -------------------------------------------------------------------------------------------- I spent about four hours mostly going over the code and the docs. Are you knowledgeable about the problem domain? ----------------------------------------------- I have done quite a lot of lock-free work over the last the last 8 years. I did an initial implementation of <atomic> for Microsoft and my current job involves a lot of tweaking of very low latency code, where we very much care about upper limits for execution time (but also about throughput). While I'm still constantly surprised by modern CPUs, I do consider myself to be quite knowledgeable about this area. -hg Holger Grund, Vice President Morgan Stanley | MS Strats and Modeling 25 Cabot Square | Canary Wharf | Floor 03 London, E14 4QA Phone: +44 20 7677-4095 Holger.Grund@morganstanley.com -------------------------------------------------------------------------- NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers. If you cannot access these links, please notify us by reply message and we will send the contents to you. By messaging with Morgan Stanley you consent to the foregoing.

hi holger, thanks for the review!
The library seems well-designed. As I believe has been discussed in a separate thread, naming is somewhat inconsistent. The failure modes are not clear to me (e.g. does enqueue/push return false or throw an exception when there is no more memory), but overall I think Tim did an excellent job.
have already improved the documentation of these functions, but i will check their wording again.
As mentioned, it would appear that bundled atomic implementation has a lot of shortcomings, but one is particularly bad: atomic<tagged_ptr> is not actually lock-free for Visual C++. The DCAS implementation should be straightforward, but seems to be absent . in case you wonder I did an initial implementation of <(cstd)atomic> and added all required intrinsics for a simple implementation to the compiler. These made it all into VC++2010 RTM (IIRC, there was only one 64-bit, most where 8 & 16-bit variants)
i don't have access to any msvc compiler, do you think you could contribute the msvc-specific code for boost.atomic? besides, i have a version that only requires 32bit cas.
There are a few points that make me a bit nervous. Specifically, the alignment story is a bit weak. It seems there is only one use of BOOST_LOCKFREE_CACHELINE_ALIGNMENT and that's for fifo<>::node. Older compilers (GCCs and VC at least) don't dynamically align the stack with __declspec(align)/__attribute__((aligned). Other classes do not use alignment at all.
i don't know of any standard-compliant c++ way to enforce a specific alignment. it doesn't affect the correct, but is merely a hint for the compiler
It seems somewhat inconsequent to add internal padding to move members to different cachelines, but at the same time not ensuring alignment (which means potential cacheline overlaps at back and front of the object). Also the code seems to assume that compilers will use an empty base class optimization for noncopyable -- I guess, that's fine for most major compilers.
is there any current compile, that doesn't optimize empty base classes? i read in a document from the early 1990s, that most recent compilers do an empty base class optimization.
Also it is not clear how it is ensured that nodes allocated are actually aligned. Allocators can and usually do ignore alignment requirements greater than intmax_t types. Some may just fail to instantiate.
the cache line alignment is again an optimization, that should work reasonably well. it shouldn't affect the correctness, though.
likely/unlikely are often defined as macros as in the Linux kernel. People probably shouldn't do that for C++, but I have seen it quite a bit. It probably doesn't help that many compilers' inlining oracles don't do very well for modern code.
why? often the hints are generated by profile-guided optimization, but very few build systems actually make use of them.
The name freelist_t is somewhat confusing for a template parameter and while I'm not sure what harm it could do it should probably follow the general convention of CamelCase.
already replaced with boost.parameter-style policies
The definition of wait-free and lock-free is a bit weird. Specifically, the term concurrent operations is a bit vague. I think it's easier to grasp if you say something along the lines of at least one thread makes forward progress at any given point in time. With the current definition in the docs (and most others I know of), it may not be immediately clear why a mutex-based implementation wouldn't be lock-free.
the wait-free and lock-free definitions are almost exactly taken from herlihy&shavit. i can try to explain them a bit more..
There's a paragraph:
The synchronization is done completely in user-space with no interaction without any direct interaction with the operating system [1] . This implies that they are not prone to issues like priority inversion (a low-priority thread needs to wait for a high-priority thread).
I think there's some duplication in the first sentence. I'm not sure what lock-free has to do with priority inversion and I would think that priority inversion means priorities are -- well, inverted. I.e. a high-priority thread is not scheduled, because a lower priority thread owns a resource needed by the high-priority thread.
well, but with lock-free programming, no thread owns a resource.
This also makes it sound a bit like lock-free is superior to OS mechanisms, but that really depends. A standard lock-free list is not fair for writers, but an implementation making use of OS facilities can be. On the other hand ticket lockets can provide some fairness.
it is definitely not the aim to provide a general-purpose concurrent data structure, which is fair and efficient, but to provide data structures that are, as the name states, lock-free. ;) the section about the performance of lock-free data structures already discusses some performance-related aspects, but apparently this is not clear enough.
fifo<> and ringbuffer::empty says it is not thread-safe. I would assume that means the value returned is potentially stale, but it can't break class invariants, or can it?
already improved this part. cheers, tim

As mentioned, it would appear that bundled atomic implementation has a lot of shortcomings, but one is particularly bad: atomic<tagged_ptr> is not actually lock-free for Visual C++. The DCAS implementation should be straightforward, but seems to be absent . in case you wonder I did an initial implementation of <(cstd)atomic> and added all required intrinsics for a simple implementation to the compiler. These made it all into VC++2010 RTM (IIRC, there was only one 64-bit, most where 8 & 16-bit variants)
i don't have access to any msvc compiler, do you think you could contribute the msvc-specific code for boost.atomic?
FWIW, we have 64 atomics support for MSVC implemented. It has been used in a bigger project for quite some time now. I can contribute those changes. Regards Hartmut --------------- http://boost-spirit.com

i don't have access to any msvc compiler, do you think you could contribute the msvc-specific code for boost.atomic?
FWIW, we have 64 atomics support for MSVC implemented. It has been used in a bigger project for quite some time now. I can contribute those changes.
I'm not sure whether I replied to this. I for one, would definitely appreciate if you could contribute those changes. I'll find some time to look over the bits either when you provide them or during the official review that will hopefully happen sometime soon. Thanks! -hg -------------------------------------------------------------------------- NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers. If you cannot access these links, please notify us by reply message and we will send the contents to you. By messaging with Morgan Stanley you consent to the foregoing.

Thanks Tim, I'll see if I can address individual points later. So just the important stuff :-)
of shortcomings, but one is particularly bad: atomic<tagged_ptr> is not actually lock-free for Visual C++. The DCAS implementation should be straightforward, but seems to be absent . in case you wonder I did an initial implementation of <(cstd)atomic> and added all required intrinsics for a simple implementation to the compiler. These made it all into VC++2010 RTM (IIRC, there was only one 64-bit, most where 8 & 16-bit variants)
i don't have access to any msvc compiler, do you think you could contribute the msvc-specific code for boost.atomic? besides, i have a version that only requires 32bit cas.
If you find a Windows OS, there's always the VC++ Express download. Anyhow, I think this should be straightforward. I didn't look at it in detail, but it looked like you only need a specialization with size 8 types that does a _InterlockedCompareExchange64 for most everything. Efficient loads & stores are a bit tricky in that SSE2 is not a requirement for 32-bit Windows. Without it, I think we need to resort FILD/FISTP, which is a pain. I see if I can code this up. No promises, though.
There are a few points that make me a bit nervous. Specifically, the alignment story is a bit weak. It seems there is only one use of BOOST_LOCKFREE_CACHELINE_ALIGNMENT and that's for fifo<>::node. Older compilers (GCCs and VC at least) don't dynamically align the stack with __declspec(align)/__attribute__((aligned). Other classes do not use alignment at all.
i don't know of any standard-compliant c++ way to enforce a specific alignment. it doesn't affect the correct, but is merely a hint for the compiler
Well, I hope that at least atomic<> gets it right. Nonaligned accesses are not guaranteed to be atomic. In this case I guess, the atomic<> implementation ends up using a uint64_t. In VC++ __alignof == 8, but that doesn't mean that autos of type uint64_t are necessarily aligned. In our case I would assume that you get memory from the OS to be at least 16 or 8 byte aligned and the alignment of class types should be fine. We should probably have a closer look, though. E.g.: consider this on x86 with an older GCC void foo() { atomic<tagged_node_ptr<X> > x; } Is x aligned, here? I don't recall the ABI, but I believe it doesn't guarantee anything beyond 4-byte alignment for ESP on entry. So to align x properly in the stack frame, the stack must be dynamically aligned (or some interprocedural optimization may help) -- but I don't think older GCCs do that. Same problem on Windows with VC++ for x86. However, dynamic stack alignment is only done under certain circumstance (I think, only when the codegen sees loads/stores of double or __declspec(align)ed objects from/to mem). It doesn't for uint64_t, however. Thanks! -hg -------------------------------------------------------------------------- NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers. If you cannot access these links, please notify us by reply message and we will send the contents to you. By messaging with Morgan Stanley you consent to the foregoing.

Anyhow, I think this should be straightforward. I didn't look at it in detail, but it looked like you only need a specialization with size 8 types that does a _InterlockedCompareExchange64 for most everything. Efficient loads & stores are a bit tricky in that SSE2 is not a requirement for 32-bit Windows. Without it, I think we need to resort FILD/FISTP, which is a pain.
iirc, sse2 intrinsics are not guaranteed to be atomic, so sometimes memory access has to be emulated via CAS.
Well, I hope that at least atomic<> gets it right. Nonaligned accesses are not guaranteed to be atomic. In this case I guess, the atomic<> implementation ends up using a uint64_t. In VC++ __alignof == 8, but that doesn't mean that autos of type uint64_t are necessarily aligned. In our case I would assume that you get memory from the OS to be at least 16 or 8 byte aligned and the alignment of class types should be fine.
this is the reason, why atomic<>::is_lock_free() is not a static member function.
E.g.: consider this on x86 with an older GCC void foo() { atomic<tagged_node_ptr<X> > x; }
Is x aligned, here? I don't recall the ABI, but I believe it doesn't guarantee anything beyond 4-byte alignment for ESP on entry. So to align x properly in the stack frame, the stack must be dynamically aligned (or some interprocedural optimization may help) -- but I don't think older GCCs do that.
dynamic memory allocation makes it even worse, because you can use placement new to put the data structure to virtually any memory location :/ tim

Efficient loads & stores are a bit tricky in that SSE2 is not a requirement for 32-bit Windows. Without it, I think we need to resort FILD/FISTP, which is a pain.
iirc, sse2 intrinsics are not guaranteed to be atomic, so sometimes memory access has to be emulated via CAS.
All aligned 64-bit accesses are guaranteed to be atomic on x86. The same is not true for 128-bit load and stores on x64 (at least there are no architectural guarantees -- I think most (all) Intel & AMD implementations still did in 2009) I'm not really sure how you would implement a fully correct lock-free atomic<int128_t> on x64. A cmpxchg16b requires the underlying page to be writable. A very silly example would be: const atomic<int128_t> x = 0; // read-only pages void foo() { int128_t l = x; } // how does this load work? CMPXCHG16B would result in an access violation In reality, 128-bit atomic reads from shared memory might be interesting. If you only have a read mapping I don't think there is any way to read 128 bits atomically.
Is x aligned, here? I don't recall the ABI, but I believe it doesn't guarantee anything beyond 4-byte alignment for ESP on entry. So to align x properly in the stack frame, the stack must be dynamically aligned (or some interprocedural optimization may help) -- but I don't think older GCCs do that.
dynamic memory allocation makes it even worse, because you can use placement new to put the data structure to virtually any memory location :/
Well, I would say if you do a placement new it's your responsibility to ensure proper alignment. If you don't, all bets are off -- or "undefined behavior" :-) Most implementations have a operator new/malloc implementation that always returns aligned memory. The tricky part about the x86 ABIs, is that they generally only require the stack to be 4-byte aligned. Many implementations therefore fail to get alignment of autos right. E.g.: uint64_t x0; // static storage duration, ok uint64_t* x1 = new uint64_t; // dynamic storage duration, ok char c[8]; uint64_t* x2 = new (c) uint64_t; // may not be aligned, but that's undefined behavior anyway void foo() { uint64_t x3; } // may not be aligned I think, the last case is where compilers fail to implement the requirements correctly. Anyway, that's really about Boost.Atomic. Let's see if we can get some traction on it. Thanks! -hg -------------------------------------------------------------------------- NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers. If you cannot access these links, please notify us by reply message and we will send the contents to you. By messaging with Morgan Stanley you consent to the foregoing.

On Wednesday 03 August 2011 09:39:21 Grund, Holger wrote:
Efficient loads & stores are a bit tricky in that SSE2 is not a requirement for 32-bit Windows. Without it, I think we need to resort FILD/FISTP, which is a pain.
iirc, sse2 intrinsics are not guaranteed to be atomic, so sometimes memory access has to be emulated via CAS.
All aligned 64-bit accesses are guaranteed to be atomic on x86. The same is not true for 128-bit load and stores on x64 (at least there are no architectural guarantees -- I think most (all) Intel & AMD implementations still did in 2009)
I'm not really sure how you would implement a fully correct lock-free atomic<int128_t> on x64. A cmpxchg16b requires the underlying page to be writable.
if the page is not writable, then why would you need an atomic<int128_t> in the first place? - if the data is unchanging, then is doesn't matter - if the data is changing (through a writable mapping by someone else to the page), then you have some sort of producer-/consumer-problem and that is trivialley solvable with word-sized atomic operations IMHO the same rationale holds for 64 bit atomics on 32 bit, so emulation via DCAS is acceptable -- since the lock prefix is needed anyway before cmpxchg8b/cmpxchg16b this should also deal with misalignment (even though this incurs a hefty performance penalty) Helge

I'm not really sure how you would implement a fully correct lock-free atomic<int128_t> on x64. A cmpxchg16b requires the underlying page to be writable.
if the page is not writable, then why would you need an atomic<int128_t> in the first place?
- if the data is unchanging, then is doesn't matter Well as I understand the current wording requires this to work: const atomic<int128_t> x; void foo() { int128_t y = x; }
Surely, you wouldn't want to revert to a lock implementation just because of this rather academic case.
- if the data is changing (through a writable mapping by someone else to the page), then you have some sort of producer-/consumer-problem and that is trivialley solvable with word-sized atomic operations Really? IMHO, the more bits you can squeeze into your data the better. For instance, look at Tim's tagged_ptr. This assumes 48 bits for address space, where some silicon already supports 52 bits. I have one case where I'd want to have more than 64 bits without having to revert to CAS.
That said, I have never had the need for 128-bit atomics nor for atomic accesses of read-only pages. But then, when you implement a library you think about these corner cases.
IMHO the same rationale holds for 64 bit atomics on 32 bit, so emulation via DCAS is acceptable -- since the lock prefix is needed anyway before cmpxchg8b/cmpxchg16b this should also deal with misalignment (even though this incurs a hefty performance penalty)
I don't think you want misaligned data with atomics -- ever. How would you ensure not to cross a cacheline or even a page? Anyway, I think 64-bit atomics on 32-bit are not too hard -- I think you it already for GCC, VC++ 2010 has all necessary intrinsics. And 128-bit atomics on x86_64 just needs a DCAS loop for most everything. -hg -------------------------------------------------------------------------- NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers. If you cannot access these links, please notify us by reply message and we will send the contents to you. By messaging with Morgan Stanley you consent to the foregoing.

On Wednesday 03 August 2011 12:30:14 Grund, Holger wrote:
I'm not really sure how you would implement a fully correct lock-free atomic<int128_t> on x64. A cmpxchg16b requires the underlying page to
be
writable.
if the page is not writable, then why would you need an atomic<int128_t> in the first place?
- if the data is unchanging, then is doesn't matter
Well as I understand the current wording requires this to work: const atomic<int128_t> x; void foo() { int128_t y = x; }
which is correct, but since "atomic" has a constructor, the variable wouldn't go into the rodata section anyway; even if it did, I do not see too many use cases for this (in case of shared mappings with one side being read-only you are strictly speaking already operating out-of-spec, but you don't need DCAS then in any case, see below).
Surely, you wouldn't want to revert to a lock implementation just because of this rather academic case.
- if the data is changing (through a writable mapping by someone else to the page), then you have some sort of producer-/consumer-problem and that is trivialley solvable with word-sized atomic operations
Really?
yes, really keep an "even" and an "odd" copy of your data structure, keep an atomically readable "generation counter" -- on access, read the generation counter, read the data (depending on parity of generation counter), read the generation counter again if it changed, start over. if it didn't change, you have your data; on modification, update generation counter as appropriate (if you are paranoid about counter overflows, you can repeat a similar trick with the counter itself) no need for anything larger than word-sized atomics here, size of shared read-only data structure does not matter
I don't think you want misaligned data with atomics -- ever. How would you ensure not to cross a cacheline or even a page?
you don't -- if an x86 processor detects the access to be misaligned, it will issue a bus lock instead of doing RMW in cache

Hi Helge,
keep an "even" and an "odd" copy of your data structure, keep an atomically readable "generation counter" -- on access, read the generation counter, read the data (depending on parity of generation counter), read the generation counter again
if it changed, start over. if it didn't change, you have your data; on modification, update generation counter as appropriate (if you are paranoid about counter overflows, you can repeat a similar trick with the counter itself)
no need for anything larger than word-sized atomics here, size of shared read-only data structure does not matter
Agreed, this is not impossible, but I still tend to think we should strive for a more efficient implementation if at all possible.
I don't think you want misaligned data with atomics -- ever. How would you ensure not to cross a cacheline or even a page?
you don't -- if an x86 processor detects the access to be misaligned, it will issue a bus lock instead of doing RMW in cache
Oh I see, what you mean now. As you said, that's not a terribly efficient thing to do these days, though. Anyway, this diverging quite a bit from Lockfree. I would certainly love to see a useful implementation of <atomic>. We care mostly about Linux here, but Windows would be quite nice, too. If there's anything I can do to get this moving let me know. Thanks! -hg -------------------------------------------------------------------------- NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers. If you cannot access these links, please notify us by reply message and we will send the contents to you. By messaging with Morgan Stanley you consent to the foregoing.

On Friday 05 August 2011 20:54:50 Grund, Holger wrote:
Hi Helge,
keep an "even" and an "odd" copy of your data structure, keep an atomically readable "generation counter" -- on access, read the generation counter, read the data (depending on parity of generation counter), read the generation counter again
if it changed, start over. if it didn't change, you have your data; on modification, update generation counter as appropriate (if you are paranoid about counter overflows, you can repeat a similar trick with the counter itself)
no need for anything larger than word-sized atomics here, size of shared read-only data structure does not matter
Agreed, this is not impossible, but I still tend to think we should strive for a more efficient implementation if at all possible.
Where do you see room for improvement? It is a fallacy to assume that "most efficient implementation" always means "there is a machine instruction providing a 1:1 translation of my high-level construct". Look at this from the POV of cache synchronisation cost (which is the real cost, not the number of instructions), and you will realize that there is not much you can do (assuming you can squeeze the data copies as well as the sequence counter into the same cacheline). This approach BTW is already way faster than e.g. using a 64-bit mmx register and paying the cost of mmx->gpr transfers on x86. Regards Helge

Agreed, this is not impossible, but I still tend to think we should strive for a more efficient implementation if at all possible.
Where do you see room for improvement? It is a fallacy to assume that "most efficient implementation" always means "there is a machine instruction providing a 1:1 translation of my high-level construct". Look at this from the POV of cache synchronisation cost (which is the real cost, not the number of instructions), and you will realize that there is not much you can do (assuming you can squeeze the data copies as well as the sequence counter into the same cacheline).
This approach BTW is already way faster than e.g. using a 64-bit mmx register and paying the cost of mmx->gpr transfers on x86.
That doesn't match my experience. Even in the noncontended case, I would be very surprised to see anything "way faster". On the latest and greatest hardware you may see a small edge for the scheme you propose -- and that's assuming I need a roundtrip to a GPR. However, under any kind of contention I do expect the MMX MOVQ version to be significantly faster. And of course, 64 bits is less than 32 + 2 * 64. -hg -------------------------------------------------------------------------- NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or views contained herein are not intended to be, and do not constitute, advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and Consumer Protection Act. If you have received this communication in error, please destroy all electronic and paper copies and notify the sender immediately. Mistransmission is not intended to waive confidentiality or privilege. Morgan Stanley reserves the right, to the extent permitted under applicable law, to monitor electronic communications. This message is subject to terms available at the following link: http://www.morganstanley.com/disclaimers. If you cannot access these links, please notify us by reply message and we will send the contents to you. By messaging with Morgan Stanley you consent to the foregoing.

On Monday 08 August 2011 10:59:47 Grund, Holger wrote:
Agreed, this is not impossible, but I still tend to think we should
strive
for a more efficient implementation if at all possible.
Where do you see room for improvement? It is a fallacy to assume that "most efficient implementation" always means "there is a machine instruction providing a 1:1 translation of my high-level construct". Look at this from the POV of cache synchronisation cost (which is the real cost, not the number of instructions), and you will realize that there is not much you can do (assuming you can squeeze the data copies as well as the sequence counter into the same cacheline).
This approach BTW is already way faster than e.g. using a 64-bit mmx register and paying the cost of mmx->gpr transfers on x86.
That doesn't match my experience. Even in the noncontended case, I would be very surprised to see anything "way faster".
mmx -> gpr has quite significant latency, don't forget that you need to shuffle around a fair bit, plus the cost of the eventual emms Also don't forget that moving mmx -> gpr defeats a large portion of the CPU's out-of-order and speculative execution capability, the CPUs cannot in general track dependencies across different register classes.
However, under any kind of contention I do expect the MMX MOVQ version to be significantly faster.
Assuming that you manage to put everything into a single cache line, I doubt that you will see any difference at all under contention: the real cost is the cache line transfer and transferring a single one has a latency of ~150 cycles (and that's rather not going to decrease with modern CPUs). After that, the number of bytes read out of the cache line are basically not measurable anymore.
And of course, 64 bits is less than 32 + 2 * 64.
It would rather be 32 + 64 +32 (there is no point in reading the "inactive" copy, but the processor can certainly read it speculatively). Helge

On Wed, Aug 3, 2011 at 7:06 AM, Helge Bahmann <hcb@chaoticmind.net> wrote:
- if the data is changing (through a writable mapping by someone else to the page), then you have some sort of producer-/consumer-problem and that is trivialley solvable with word-sized atomic operations
Really?
yes, really
keep an "even" and an "odd" copy of your data structure, keep an atomically readable "generation counter" -- on access, read the generation counter, read the data (depending on parity of generation counter), read the generation counter again
if it changed, start over. if it didn't change, you have your data; on modification, update generation counter as appropriate (if you are paranoid about counter overflows, you can repeat a similar trick with the counter itself)
This is single producer / single consumer? (probably works for spmc too). But not multi-producer. Right? Or do misunderstand? Tony

On Saturday 06 August 2011 08:30:05 Gottlob Frege wrote:
On Wed, Aug 3, 2011 at 7:06 AM, Helge Bahmann <hcb@chaoticmind.net> wrote:
- if the data is changing (through a writable mapping by someone else to the page), then you have some sort of producer-/consumer-problem and that is trivialley solvable with word-sized atomic operations
Really?
yes, really
keep an "even" and an "odd" copy of your data structure, keep an atomically readable "generation counter" -- on access, read the generation counter, read the data (depending on parity of generation counter), read the generation counter again
if it changed, start over. if it didn't change, you have your data; on modification, update generation counter as appropriate (if you are paranoid about counter overflows, you can repeat a similar trick with the counter itself)
This is single producer / single consumer? (probably works for spmc too). But not multi-producer. Right?
right, but you can always build consensus on your data somewhere else for multiple producers, then build consensus on who gets to "commit" the data to the pure consumers (probably this will work in a single step). While this adds another "copy" operation to the producer path, a few stores are lost in the noise compared to even a single interlocked cmpxchg. Helge

On Monday, August 8, 2011, Helge Bahmann <hcb@chaoticmind.net> wrote:
On Saturday 06 August 2011 08:30:05 Gottlob Frege wrote:
On Wed, Aug 3, 2011 at 7:06 AM, Helge Bahmann <hcb@chaoticmind.net>
- if the data is changing (through a writable mapping by someone else to the page), then you have some sort of producer-/consumer-problem and
is trivialley solvable with word-sized atomic operations
Really?
yes, really
keep an "even" and an "odd" copy of your data structure, keep an atomically readable "generation counter" -- on access, read the generation counter, read the data (depending on parity of generation counter), read the generation counter again
if it changed, start over. if it didn't change, you have your data; on modification, update generation counter as appropriate (if you are paranoid about counter overflows, you can repeat a similar trick with
wrote: that the
counter itself)
This is single producer / single consumer? (probably works for spmc too). But not multi-producer. Right?
right,
OK, I was just checking that I understood what you meant.
but you can always build consensus on your data somewhere else for multiple producers, then build consensus on who gets to "commit" the data to the pure consumers (probably this will work in a single step). While this adds another "copy" operation to the producer path, a few stores are lost in the noise compared to even a single interlocked cmpxchg.
Consensus building sounds like something that might require atomic ops of its own, but I get the idea.
Helge
Tony
participants (5)
-
Gottlob Frege
-
Grund, Holger
-
Hartmut Kaiser
-
Helge Bahmann
-
Tim Blechmann