
Any volunteers with CodeWarrior and g++ access on a PowerPC? ;-) -- Peter Dimov http://www.pdimov.com

Peter Dimov wrote:
Any volunteers with CodeWarrior and g++ access on a PowerPC? ;-)
I haven't been following this closely.. Are you saying you need someone to do assembly for this? Or is it using the compiler's support? Or is it just testing? Which GCC? Which CodeWarrior? I have the Apple GCC only. And CW-8.3. And I could try my hand at the coding problem. Although it's been a while since I did PPC assembly. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Rene Rivera wrote:
Peter Dimov wrote:
Any volunteers with CodeWarrior and g++ access on a PowerPC? ;-)
I haven't been following this closely.. Are you saying you need someone to do assembly for this? Or is it using the compiler's support? Or is it just testing?
Which GCC? Which CodeWarrior?
I have the Apple GCC only. And CW-8.3. And I could try my hand at the coding problem. Although it's been a while since I did PPC assembly.
Testing, mostly... and fixing any obvious errors. It's hard to write code without a compiler. :-) sp_counted_base_gcc_ppc.hpp and sp_counted_base_cw_ppc.hpp are now in CVS. Uncomment the appropriate lines in sp_counted_base.hpp to enable them (is __powerpc__ the proper target macro for g++?) Thank you in advance.

Peter Dimov wrote:
Testing, mostly... and fixing any obvious errors. It's hard to write code without a compiler. :-)
Definitely :-) ... Is there any particular test I should run? Or is it basically all the smart_ptr test?
sp_counted_base_gcc_ppc.hpp and sp_counted_base_cw_ppc.hpp are now in CVS. Uncomment the appropriate lines in sp_counted_base.hpp to enable them (is __powerpc__ the proper target macro for g++?)
Good question.. I'll find out when I try it :-) -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Rene Rivera wrote:
Peter Dimov wrote:
Testing, mostly... and fixing any obvious errors. It's hard to write code without a compiler. :-)
Definitely :-) ... Is there any particular test I should run? Or is it basically all the smart_ptr test?
All smart_ptr tests, yes. There are some that aren't run by default, shared_ptr_timing_test, shared_ptr_mt_test and weak_ptr_mt_test. It would be interesting to compare the performance of _timing_test with or without -D BOOST_SP_DISABLE_THREADS.
sp_counted_base_gcc_ppc.hpp and sp_counted_base_cw_ppc.hpp are now in CVS. Uncomment the appropriate lines in sp_counted_base.hpp to enable them (is __powerpc__ the proper target macro for g++?)
Good question.. I'll find out when I try it :-)
g++land is so exciting ;-) It seems that the Apple variant defines __ppc__, whereas the Linux/PPC version defines __powerpc__.

In article <013a01c53a26$cae17a70$6501a8c0@pdimov2>, "Peter Dimov" <pdimov@mmltd.net> wrote:
sp_counted_base_gcc_ppc.hpp and sp_counted_base_cw_ppc.hpp are now in CVS. Uncomment the appropriate lines in sp_counted_base.hpp to enable them (is __powerpc__ the proper target macro for g++?)
Good question.. I'll find out when I try it :-)
g++land is so exciting ;-) It seems that the Apple variant defines __ppc__, whereas the Linux/PPC version defines __powerpc__.
Which macro is the right one depends on what you really mean. Are you trying to conditionalize on g++ on Mac OS X? Any C++ compiler on Mac OS X? g++ on any PPC platform? The answers are different for each one of those. Odds are you want __APPLE__ && __MACH__, and not get PPC involved at all. meeroh

Miro Jurisic wrote:
In article <013a01c53a26$cae17a70$6501a8c0@pdimov2>, "Peter Dimov" <pdimov@mmltd.net> wrote:
sp_counted_base_gcc_ppc.hpp and sp_counted_base_cw_ppc.hpp are now in CVS. Uncomment the appropriate lines in sp_counted_base.hpp to enable them (is __powerpc__ the proper target macro for g++?)
Good question.. I'll find out when I try it :-)
g++land is so exciting ;-) It seems that the Apple variant defines __ppc__, whereas the Linux/PPC version defines __powerpc__.
Which macro is the right one depends on what you really mean. Are you trying to conditionalize on g++ on Mac OS X? Any C++ compiler on Mac OS X? g++ on any PPC platform? The answers are different for each one of those. Odds are you want __APPLE__ && __MACH__, and not get PPC involved at all.
Looking at Peter's code it seems he is interested in GCC on any PPC platform. It's selecting the type of assembly code to use, so the OS has nothing to do with it. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Peter Dimov wrote:
Rene Rivera wrote:
Peter Dimov wrote:
Testing, mostly... and fixing any obvious errors. It's hard to write code without a compiler. :-)
Definitely :-) ... Is there any particular test I should run? Or is it basically all the smart_ptr test?
All smart_ptr tests, yes. There are some that aren't run by default, shared_ptr_timing_test, shared_ptr_mt_test and weak_ptr_mt_test. It would be interesting to compare the performance of _timing_test with or without -D BOOST_SP_DISABLE_THREADS.
OK.. done.. Made one minor fix to the implementations for an extra argument to decrement count (checked into CVS). The results for the timing test are: With BOOST_SP_DISABLE_THREADS.. * cw-8.3 debug, 18.06 * gcc-3.3 debug, 50.98 * gcc-3.3 release, 18.47 Without BOOST_SP_DISABLE_THREADS.. * cw-8.3 debug, 21.08 * gcc-3.3 debug, 51.07 * gcc-3.3 release, 18.52 I didn't build the release of cw-8.3 because it was taking along time, more than 10 minutes, to compile some of the files. It's most likely one of the unbounded optimizations CW likes getting into. To me the curious aspect is that CW is faster in debug mode than GCC in release mode. It's nice to see the threaded code be almost as fast as the non-threaded code. Oh, I guess I only implied that all tests did pass. And I didn't mention that my machine is only a G3-233. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Rene Rivera wrote:
OK.. done.. Made one minor fix to the implementations for an extra argument to decrement count (checked into CVS). The results for the timing test are:
With BOOST_SP_DISABLE_THREADS..
* cw-8.3 debug, 18.06 * gcc-3.3 debug, 50.98 * gcc-3.3 release, 18.47
Without BOOST_SP_DISABLE_THREADS..
* cw-8.3 debug, 21.08 * gcc-3.3 debug, 51.07 * gcc-3.3 release, 18.52
I didn't build the release of cw-8.3 because it was taking along time, more than 10 minutes, to compile some of the files. It's most likely one of the unbounded optimizations CW likes getting into.
To me the curious aspect is that CW is faster in debug mode than GCC in release mode. It's nice to see the threaded code be almost as fast as the non-threaded code.
Great. ;-) Well-designed uniprocessor systems "should" be as fast in threaded mode; we'll probably see a bigger hit on a dual PPC. We can officially uncomment the #ifs in sp_counted_base.hpp now, I guess.

Peter Dimov wrote: [...]
we'll probably see a bigger hit on a dual PPC.
Try on it something along the lines of asm long atomic_decrement_weak( register long * pw ) { <load-UNreserved> <add -1> <branch if zero to acquire> {lw}sync loop: <load-reserved> <add -1> <branch if zero to acquire> <store-conditional> <branch if failed to loop else to done> acquire: isync done: <...> } regards, alexander.

Alexander Terekhov wrote:
Peter Dimov wrote: [...]
we'll probably see a bigger hit on a dual PPC.
Try on it something along the lines of
asm long atomic_decrement_weak( register long * pw ) {
<load-UNreserved> <add -1> <branch if zero to acquire> {lw}sync
loop:
<load-reserved> <add -1> <branch if zero to acquire> <store-conditional> <branch if failed to loop else to done>
acquire:
isync
done:
<...> }
and asm long atomic_decrement_strong( register long * pw ) { <load-reserved> <add -1> <branch if zero to acquire> {lw}sync loop: <store-conditional> <branch if !failed to done> <load-reserved> <add -1> <branch if !zero to loop> acquire: <store-conditional> <branch if failed to loop> isync done: <...> } regards, alexander.

Alexander Terekhov wrote: [... asm long atomic_decrement_[weak|strong] ...] I suppose that CodeWarrior is smart enough to recognize the presence of msync instructions and will behave accordingly with respect to compiler caching/reordering across these routines. If not, things can go wild, beware. regards, alexander.

On Apr 6, 2005, at 12:36 PM, Alexander Terekhov wrote:
Alexander Terekhov wrote:
[... asm long atomic_decrement_[weak|strong] ...]
I suppose that CodeWarrior is smart enough to recognize the presence of msync instructions and will behave accordingly with respect to compiler caching/reordering across these routines. If not, things can go wild, beware.
Your assumption is correct. CodeWarrior won't reorder across these instructions. -Howard

Alexander Terekhov wrote: [...]
asm long atomic_decrement_strong( register long * pw ) {
<load-reserved> <add -1> <branch if zero to acquire> {lw}sync
loop:
<store-conditional> <branch if !failed to done> <load-reserved> <add -1> <branch if !zero to loop>
acquire:
<store-conditional> <branch if failed to loop> isync
done:
<...> }
I meant asm long atomic_decrement_strong( register long * pw ) { <load-reserved> <add -1> <branch if zero to acquire> {lw}sync loop1: <store-conditional> <branch if !failed to done> loop2: <load-reserved> <add -1> <branch if !zero to loop1> acquire: <store-conditional> <branch if failed to loop2> isync done: <...> } regards, alexander.

Alexander Terekhov wrote:
and
asm long atomic_decrement_strong( register long * pw ) {
loop1:
<load-reserved> <add -1> <branch if zero to acquire> {lw}sync
loop:
<store-conditional> <branch if !failed to done>
loop2:
<load-reserved> <add -1> <branch if !zero to loop>
acquire:
<store-conditional> <branch if failed to loop>
... to either loop1 or loop2 ...
isync
done:
<...> }
but it's either suboptimal (more than one sync) or incorrect (missing sync), I think. It needs a state machine. loop0: lwarx add -1 beq acquire-without-sync sync loop1: stwcx. beq+ done loop2: lwarx add -1 bne loop1 acquire-with-sync: stwcx. bne- loop2 isync blr acquire-without-sync: stwcx. bne- loop0 isync done: blr or something like that. Post-release. The current code is good (and risky) enough. :-)

Peter Dimov wrote:
Alexander Terekhov wrote:
and
asm long atomic_decrement_strong( register long * pw ) {
[... loop -> loop1+loop2 ...]
but it's either suboptimal (more than one sync) or incorrect (missing sync), I think. It needs a state machine.
And how is this asm long atomic_decrement_strong( register long * pw ) { <load-reserved> <add -1> <branch if zero to acquire> {lw}sync loop1: <store-conditional> <branch if !failed to done> loop2: <load-reserved> <add -1> <branch if !zero to loop1> acquire: <store-conditional> <branch if failed to loop2> isync done: <...> } incorrrect or suboptimal?
loop0:
lwarx add -1 beq acquire-without-sync
sync
loop1:
stwcx. beq+ done
loop2:
lwarx add -1 bne loop1
acquire-with-sync:
stwcx. bne- loop2 isync blr
acquire-without-sync:
stwcx. bne- loop0 isync
done:
blr
I must be missing something, but it looks to me that you have way too much branching and isync-ing. regards, alexander.

Alexander Terekhov wrote: [... missing sync ...]
loop0:
lwarx add -1 beq acquire-without-sync
sync
loop1:
stwcx. beq+ done
loop2:
lwarx add -1 bne loop1
acquire-with-sync:
stwcx. bne- loop2 isync blr
acquire-without-sync:
stwcx. bne- loop0 isync
done:
blr
I must be missing something, but it looks to me that you have way too much branching and isync-ing.
Got it now. Thanks. regards, alexander.

Alexander Terekhov wrote: [...]
Got it now. Thanks.
asm long atomic_decrement_weak( register long * pw ) { <load-UNreserved> <add -1> <branch if zero to acquire> {lw}sync loop: <load-reserved> <add -1> <branch if zero to acquire> <store-conditional> <branch if failed to loop else to done> acquire: isync done: <...> } asm long atomic_decrement_strong( register long * pw ) { // Peter's state machine loop0: <load-reserved> <add -1> <branch if zero to loop0_acquire> {lw}sync loop1: <store-conditional> <branch if !failed to done> loop2: <load-reserved> <add -1> <branch if !zero to loop1> <store-conditional> <branch if failed to loop2 else to acquire> loop0_acquire: <store-conditional> <branch if failed to loop0> acquire: isync done: <...> } All right now? regards, alexander.

In article <004e01c53ac9$35901640$6601a8c0@pdimov>, "Peter Dimov" <pdimov@mmltd.net> wrote:
or something like that. Post-release. The current code is good (and risky) enough. :-)
My understanding from having spoken to several Apple engineers whom I consider knowledgeable on this topic is that it's a very bad idea to write assembly code to perform atomic operations on the PPC. There is a variety of CPU-specific idiosyncracies that make it very difficult to write such code correctly. As you might imagine, writing such code not quite correctly leads bugs that are very hard to track down. This recommendation is explicit in Apple's documentation at <http://developer.apple.com/technotes/tn/tn2006.html>: "Do not use the PowerPC instructions Load Reserved (lwarx) and Store Conditional (stwcx) to implement atomicity in your preemptively threaded application. As described in DTS Technote 1137 Disabling Interrupts on the Traditional Mac OS, these instructions are non-portable and are tricky to use correctly across the full spectrum of PowerPC implementations." For this reason, Apple supplies a number of atomic primitives in Mac OS 9 and Mac OS X which perform a variety of modify-or-fail primitives; Apple takes it upon themselves to make sure that their implementation is correct for the CPU architectures that the OS supports, and I can guarantee you that Apple has more time to develop and test these across a wide range of PPC hardware than we ever will (or should try to). I cannot understate how important it is that use the Apple APIs. They are documented at <http://developer.apple.com/documentation/Hardware/DeviceManagers/pci_srvcs/p... ards_drivers/PCI_BOOK.1a3.html>. Do not be concerned about the fact that this documentation is for PCI drivers; the same APIs are available in userland with #include <CoreServices/CoreServices.h>. meeroh

Miro Jurisic wrote: [... Apple's lwarc/stwcx smells-like-FUD stuff ...] Reading this http://www.google.de/groups?selm=1dguf31.ssbf6n1ogodzrN%40michel.sophia.dcm.... http://www.google.de/groups?selm=rbarris-ya023280001610980237340001%40206.82... http://www.google.de/groups?selm=alexr-1610981926250001%40roseal2.apple.com http://www.google.de/groups?selm=Pine.ULT.3.90.981017011012.17495A-100000%40... http://www.google.de/groups?selm=Pine.ULT.3.90.981019134835.576A-100000%40st... http://www.google.de/groups?selm=Pine.ULT.3.90.981020094604.8148A-100000%40s... http://www.google.de/groups?selm=rang-2010981430230001%40margaret.trillium.a... I gather that it's all about "hard to grasp" msync stuff, nothing else. regards, alexander. P.S. BTW, that Apple's CompareAndSwap (as of 1998) looks really over{i}sync'd beyond reasonable (redundant store aside for a moment).

In article <4254643E.E4AE9DAE@web.de>, Alexander Terekhov <terekhov@web.de> wrote:
Miro Jurisic wrote:
[... Apple's lwarc/stwcx smells-like-FUD stuff ...]
I gather that it's all about "hard to grasp" msync stuff, nothing else.
regards, alexander.
As I said elsewhere in the thread, I freely admit doubt because I do not have insight into all the CPU-specific issues here, but: 1. Apple has an API that works 2. Apple has an API that doesn't depend on the compiler 3. Apple makes an implicit commitment to continue making this API work as new hardware is released Leaving aside the question of what the 1998 CompareAndSwap looked like (because that's not what it looks like today), as far as I can tell there is no technical reason to believe that the boost code would be better than Apple's, and there is a good reason that calling Apple's APIs would make our code easier to maintain both in terms of compiler support and in terms of new hardware support. Let's not succumb to NIH syndrome here. When a library that boost can depend on (OS, STL, ANSI C, etc) provides functionality that we need, as is the case here, we should use it. meeroh

Miro Jurisic wrote:
As I said elsewhere in the thread, I freely admit doubt because I do not have insight into all the CPU-specific issues here, but:
1. Apple has an API that works 2. Apple has an API that doesn't depend on the compiler 3. Apple makes an implicit commitment to continue making this API work as new hardware is released
4. Apple has not documented the memory sychronization properties of this API.

In article <003c01c53b4c$1ec97680$6501a8c0@pdimov2>, "Peter Dimov" <pdimov@mmltd.net> wrote:
Miro Jurisic wrote:
As I said elsewhere in the thread, I freely admit doubt because I do not have insight into all the CPU-specific issues here, but:
1. Apple has an API that works 2. Apple has an API that doesn't depend on the compiler 3. Apple makes an implicit commitment to continue making this API work as new hardware is released
4. Apple has not documented the memory sychronization properties of this API.
If you give me a specific question to ask of Apple, I will work on getting the answer. Maybe they didn't document it, maybe we haven't found the documentation, maybe the documentation exists but hasn't been published yet. Very often, technical notes are released in direct response to developer questions. meeroh

Miro Jurisic wrote:
In article <003c01c53b4c$1ec97680$6501a8c0@pdimov2>, "Peter Dimov" <pdimov@mmltd.net> wrote:
4. Apple has not documented the memory sychronization properties of this API.
If you give me a specific question to ask of Apple, I will work on getting the answer.
"What are the memory synchronization properties of the atomic primitives?" is one good question, but "Is there a specific reason for not encouraging the developers to use lwarx/stwcx. except memory visibility issues on a multiprocessor?" is a much better one. FWIW, http://www.opensource.apple.com/darwinsource/10.3/gccfast-1614/libjava/sysde... uses lwarx/stwcx./sync/isync in a way that matches our understanding of how they are supposed to be used. There are other examples in Apple's own code, but opensource.apple.com wants me to register to see them. ;-)

Peter Dimov wrote:
There are other examples in Apple's own code, but opensource.apple.com wants me to register to see them. ;-)
But apparently they didn't ask googlebot to register: http://64.233.183.104/search?q=cache:CBs4claC9QsJ:www.opensource.apple.com/d... http://64.233.183.104/search?q=cache:9VZ6d3qK-hAJ:www.opensource.apple.com/d... And another non-Apple example: http://www.opensource.apple.com/darwinsource/7.0b1/samba/samba/source/tdb/sp... No CPU-specific hacks anywhere; just straightforward lwarx/stwcx. use.

In article <00b601c53ba9$8b562e80$6501a8c0@pdimov2>, "Peter Dimov" <pdimov@mmltd.net> wrote:
Peter Dimov wrote:
There are other examples in Apple's own code, but opensource.apple.com wants me to register to see them. ;-)
But apparently they didn't ask googlebot to register:
http://64.233.183.104/search?q=cache:CBs4claC9QsJ:www.opensource.apple.com/d... winsource/Current/xnu-517.3.7/libkern/ppc/OSAtomic.s
http://64.233.183.104/search?q=cache:9VZ6d3qK-hAJ:www.opensource.apple.com/d... winsource/10.3.2/xnu-517.3.7/osfmk/ppc/hw_lock.s
And another non-Apple example:
http://www.opensource.apple.com/darwinsource/7.0b1/samba/samba/source/tdb/sp... lock.c
No CPU-specific hacks anywhere; just straightforward lwarx/stwcx. use.
The answer I got from Apple about your question says: 1. Yes, the documentation for the userland calls is missing specification about memory synchronization behavior in the current version of the developer tools 2. The documentation for the corresponding kernel calls is at <http://developer.apple.com/documentation/Darwin/Reference/KernellibkernFrame... /OSAtomic/> and includes memory synchronization behavior 3. It is safe to assume that the userland call Foo has the same behavior as the kernel call OSFoo for the purpose of this discussion. (This API discrepancy will be addressed in the future.) If the documentation referenced here does not adequately answer your question, please let me know and I will follow up with Apple. Finally, I understand what you are saying regarding the fact that current Apple's code is what you expect it to be, but my point about Apple having more resources to keep this code up-to-date with emerging hardware still stands, and therefore I still think that we should use Apple's APIs. (And the engineers I talked to agree with me.) meeroh

Miro Jurisic wrote: [...]
2. The documentation for the corresponding kernel calls is at <http://developer.apple.com/documentation/Darwin/Reference/KernellibkernFrame... /OSAtomic/> and includes memory synchronization behavior
Invisible e-ink, I suppose. Can't see it. regards, alexander.

In article <42559E8A.EB7E42BF@web.de>, Alexander Terekhov <terekhov@web.de> wrote:
Miro Jurisic wrote: [...]
2. The documentation for the corresponding kernel calls is at <http://developer.apple.com/documentation/Darwin/Reference/KernellibkernFram ework /OSAtomic/> and includes memory synchronization behavior
Invisible e-ink, I suppose. Can't see it.
I have a strong personal and professional interest in making sure that boost on Mac OS X does not cause compatibility problems and hard-to-disagnose bugs, not just now, but for the future of products that I ship linked with boost; I also have string interests in helping make Apple's documentation accurate and adequate. I am willing to spend my time talking to you and my time and money on getting you an authoritative answer from Apple engineering, but I do not have the resources to put up with your snarky remarks and your less-than-enthusiastic responses to what I believe is a reasonable concern about code quality. You are wasting my good will. Please write up a specific question that I can give to an Apple engineer, because it's clear that either the one-sentence question I was previously given was not specific enough or that I did not understand it correctly. meeroh

Miro Jurisic wrote: ... Save your resources and stop playing broken telephone. Revealed details are not funny and it can become even worse. regards, alexander.

Miro Jurisic wrote:
The answer I got from Apple about your question says:
1. Yes, the documentation for the userland calls is missing specification about memory synchronization behavior in the current version of the developer tools 2. The documentation for the corresponding kernel calls is at <http://developer.apple.com/documentation/Darwin/Reference/KernellibkernFramework/OSAtomic/> and includes memory synchronization behavior 3. It is safe to assume that the userland call Foo has the same behavior as the kernel call OSFoo for the purpose of this discussion. (This API discrepancy will be addressed in the future.)
The documentation at the link above does not seem to describe memory synchronization. It only says that the operations are performed atomically. The OSAtomicAdd/Increment/Decrement implementation in OSAtomic.s does not contain any sync or isync barriers. We need a decrement that is at least a release when the new value is nonzero and an acquire when the new value is zero. OSCompareAndSwap (or its alias hw_compare_and_store in hw_lock.s) seems to provide acquire semantics on success (it has a trailing isync). This is also not mentioned in the documentation above. I don't think that we will gain anything from using these primitives. We'll still retain our current ppc versions for non-Apple OSes.

In article <00f901c53bb6$5caf34c0$6501a8c0@pdimov2>, "Peter Dimov" <pdimov@mmltd.net> wrote:
The documentation at the link above does not seem to describe memory synchronization. It only says that the operations are performed atomically.
The OSAtomicAdd/Increment/Decrement implementation in OSAtomic.s does not contain any sync or isync barriers. We need a decrement that is at least a release when the new value is nonzero and an acquire when the new value is zero.
OSCompareAndSwap (or its alias hw_compare_and_store in hw_lock.s) seems to provide acquire semantics on success (it has a trailing isync). This is also not mentioned in the documentation above.
Thanks. I understand the question better now. I am going to get you and Alexander in touch with someone who can answer your questions better than that web page or I can. meeroh

Miro Jurisic wrote: [...]
In article <4254643E.E4AE9DAE@web.de>, Alexander Terekhov <terekhov@web.de> wrote:
Miro Jurisic wrote:
[... Apple's lwarc/stwcx smells-like-FUD stuff ...]
I gather that it's all about "hard to grasp" msync stuff, nothing else.
regards, alexander.
As I said elsewhere in the thread, I freely admit doubt because I do not have insight into all the CPU-specific issues here, but:
1. Apple has an API that works 2. Apple has an API that doesn't depend on the compiler 3. Apple makes an implicit commitment to continue making this API work as new hardware is released
That's all fine and dandy, expect that AFAIK Apple has no refcounting API with optimal msync semantics for basic thread-safety at all. Yet.
Leaving aside the question of what the 1998 CompareAndSwap looked like (because that's not what it looks like today), as far as I can tell there is no technical reason to believe that the boost code would be better than Apple's, and there is a good reason that calling Apple's APIs would make our code easier to maintain both in terms of compiler support and in terms of new hardware support.
I suppose that Apple provides fully-fenced CAS mimicking original IBM's CAS on mainframes. That's suboptimal (even apart from msync) on Power.
Let's not succumb to NIH syndrome here. When a library that boost can depend on (OS, STL, ANSI C, etc) provides functionality that we need, as is the case here, we should use it.
We should use Apple's atomic_decrement_weak(), atomic_decrement_strong(), atomic_decrement_strong(), atomic_increment_naked(), and atomic_increment_naked_if_not_zero(). Drop me a link to Apple's man page for that stuff ASAP, please. regards, alexander.

Alexander Terekhov wrote: [...]
We should use Apple's atomic_decrement_weak(), atomic_decrement_strong(), atomic_decrement_strong(), atomic_increment_naked(), and atomic_increment_naked_if_not_zero().
Err. Double atomic_decrement_strong(). One is meant to be atomic_fetch_naked(). ;-) regards, alexander.

In article <4254EFCC.29894E10@web.de>, Alexander Terekhov <terekhov@web.de> wrote:
Miro Jurisic wrote: [...]
In article <4254643E.E4AE9DAE@web.de>, Alexander Terekhov <terekhov@web.de> wrote:
As I said elsewhere in the thread, I freely admit doubt because I do not have insight into all the CPU-specific issues here, but:
1. Apple has an API that works 2. Apple has an API that doesn't depend on the compiler 3. Apple makes an implicit commitment to continue making this API work as new hardware is released
That's all fine and dandy, expect that AFAIK Apple has no refcounting API with optimal msync semantics for basic thread-safety at all. Yet.
I don't know if they do, but if you give me a specific question to ask, I'll ask it. I have enough contacts at Apple that I should be able to help.
Leaving aside the question of what the 1998 CompareAndSwap looked like (because that's not what it looks like today), as far as I can tell there is no technical reason to believe that the boost code would be better than Apple's, and there is a good reason that calling Apple's APIs would make our code easier to maintain both in terms of compiler support and in terms of new hardware support.
I suppose that Apple provides fully-fenced CAS mimicking original IBM's CAS on mainframes. That's suboptimal (even apart from msync) on Power.
This sounds like an attempt at premature optimization to me, frankly. Have you profiled Apple's atomic primitives to determine that they are going to be problematic, or are you just guessing? I would rather have code that runs slightly slower but doesn't need to be revised when new hardware comes out.
Let's not succumb to NIH syndrome here. When a library that boost can depend on (OS, STL, ANSI C, etc) provides functionality that we need, as is the case here, we should use it.
We should use Apple's atomic_decrement_weak(), atomic_decrement_strong(), atomic_decrement_strong(), atomic_increment_naked(), and atomic_increment_naked_if_not_zero().
Drop me a link to Apple's man page for that stuff ASAP, please.
Sorry, I don't know what you are talking about. Not only do I not see such APIs on Mac OS, google also doesn't reveal such APIs anywhere else. meeroh

Miro Jurisic wrote: [...]
We should use Apple's atomic_decrement_weak(), atomic_decrement_strong(), atomic_decrement_strong(), atomic_increment_naked(), and ^^^^^^^^^^^^^^^^^^^^^^^^^
atomic_fetch_naked().
atomic_increment_naked_if_not_zero().
Drop me a link to Apple's man page for that stuff ASAP, please.
Sorry, I don't know what you are talking about. Not only do I not see such APIs on Mac OS, google also doesn't reveal such APIs anywhere else.
Try Google Canada or Germany. http://groups.google.ca/groups?q=atomic_decrement_weak http://groups.google.de/groups?q=atomic_decrement_weak They both seem to work fine. (US population is currently being tortured by "groups-beta" at google.com. ;-) ) regards, alexander.

In article <4255969D.925F93C4@web.de>, Alexander Terekhov <terekhov@web.de> wrote:
Miro Jurisic wrote: [...]
We should use Apple's atomic_decrement_weak(), atomic_decrement_strong(), atomic_decrement_strong(), atomic_increment_naked(), and ^^^^^^^^^^^^^^^^^^^^^^^^^
atomic_fetch_naked().
atomic_increment_naked_if_not_zero().
Drop me a link to Apple's man page for that stuff ASAP, please.
Sorry, I don't know what you are talking about. Not only do I not see such APIs on Mac OS, google also doesn't reveal such APIs anywhere else.
Try Google Canada or Germany.
http://groups.google.ca/groups?q=atomic_decrement_weak http://groups.google.de/groups?q=atomic_decrement_weak
They both seem to work fine. (US population is currently being tortured by "groups-beta" at google.com. ;-) )
In that case, read <http://developer.apple.com/documentation/Darwin/Reference/KernellibkernFrame... /OSAtomic/> and tell me if it answers your questions. As I mention elsewhere in the thread, Apple's documentation is flawed on this matter: they document kernel calls (e.g., OSIncrementAtomic) better than the userland calls (e.g., IncrementAtomic), but this issue will be addressed in the future, and it's safe to assume that the behavior of the userland call (which I believe we should use) is the same as the documented behavior of its kernel counterpart (whose documentation I referred you to above). meeroh

Miro Jurisic wrote: [...]
In that case, read <http://developer.apple.com/documentation/Darwin/Reference/KernellibkernFrame... /OSAtomic/> and tell me if it answers your questions.
It answers neither general msync questions nor specific "demand" for optimal refcounting API. regards, alexander.

Miro Jurisic wrote:
In article <004e01c53ac9$35901640$6601a8c0@pdimov>, "Peter Dimov" <pdimov@mmltd.net> wrote:
or something like that. Post-release. The current code is good (and risky) enough. :-)
My understanding from having spoken to several Apple engineers whom I consider knowledgeable on this topic is that it's a very bad idea to write assembly code to perform atomic operations on the PPC. There is a variety of CPU-specific idiosyncracies that make it very difficult to write such code correctly. As you might imagine, writing such code not quite correctly leads bugs that are very hard to track down.
Yes... On the one hand, when the OS developers say that their API should be used, one should listen. On the other hand, none of the technical notes state a concrete problem with lwarx/stwcx. I am inclined to think that the problems with hand-written lwarx/stwcx. code are not caused by CPU-specific idiosyncracies, but by memory visibility issues. Which we have taken care of. (Memory visibility could be described as CPU-specific, of course.) Google knows about one problem with stwcx. on PPC405, known as "erratum 77". But to the best of my knowledge no Mac has ever used a 405.

Peter Dimov wrote: [...]
Google knows about one problem with stwcx. on PPC405, known as "erratum 77".
Aka "CPU_210". See ftp://ftp.xilinx.com/pub/documentation/misc/ppc405f6v5_2_0.pdf Given that it is (officially) "Category 2" ("Major impact, workaround is impractical to implement, or a substantial risk of encountering the same or additional problems, including performance issues, exist after the workaround is implemented."), I (personally, so to speak) wouldn't even bother with it and simply say "you might have way too busted PPC405 model, please check and replace it if you want threading" or something like that in the docu instead. regards, alexander.

Miro wrote:
In article <004e01c53ac9$35901640$6601a8c0@pdimov>, "Peter Dimov" <pdimov@mmltd.net> wrote:
or something like that. Post-release. The current code is good (and risky) enough. :-)
My understanding from having spoken to several Apple engineers whom I consider knowledgeable on this topic is that it's a very bad idea to write assembly code to perform atomic operations on the PPC. There is a variety of CPU-specific idiosyncracies that make it very difficult to write such code correctly. As you might imagine, writing such code not quite correctly leads bugs that are very hard to track down.
This recommendation is explicit in Apple's documentation at <http://developer.apple.com/technotes/tn/tn2006.html>:
"Do not use the PowerPC instructions Load Reserved (lwarx) and Store Conditional (stwcx) to implement atomicity in your preemptively threaded application. As described in DTS Technote 1137 Disabling Interrupts on the Traditional Mac OS, these instructions are non-portable and are tricky to use correctly across the full spectrum of PowerPC implementations."
A couple of posters have referred to this as "smells-like-FUD", and "fails to state any technical reasons". To me, that sounds like denial. Apple has committed to provide routines that work on all PPC CPU that they ship. Why in the world would you choose not to use them, if they do what you need? At 1:39 AM +0300 4/7/05, Peter Dimov wrote:
On the one hand, when the OS developers say that their API should be used, one should listen.
On the other hand, none of the technical notes state a concrete problem with lwarx/stwcx.
I am inclined to think that the problems with hand-written lwarx/stwcx. code are not caused by CPU-specific idiosyncracies, but by memory visibility issues. Which we have taken care of. (Memory visibility could be described as CPU-specific, of course.)
Basing your code on "I'm inclined to think that..." does not sound like sound engineering practice to me. If you check out TN 1137, you see an earlier version of the same warning:
Important: DTS recommends that developers avoid using the PowerPC Load Reserved and Store Conditional instructions. There are two reasons for this. Firstly, these instructions are inherently processor-specific and reduce the portability of your code. Secondly, the behavior of these instructions varies between PowerPC CPU types. Accommodating all these variations is tricky. These instructions do not provide much utility beyond that provided by the Open Transport and DriverServicesLib atomic routines, and Apple ensures that these atomic routines are updated to do the right thing in all cases.
If I was writing this code, I would not want to take on the burden of research, testing, and ongoing maintenance that this implies. Certainly "It works fine on my machine" is not sufficient, given these warnings. Off the top of my head, I would want to test against the 601, 603, 603e, 604, G3 (several revisions), G4 (several revisions) and G5. Apple has the advantage here, in that the code only has to run on a single CPU type - the one in the machine. [ They can (and probably do) have different routines for different CPUs, but only one gets burned into the ROMs that come with the machine. ] A routine that comes with boost has to deal with all the CPUs that Apple has shipped (+ future ones). -- -- Marshall Marshall Clow Idio Software <mailto:marshall@idio.com> It is by caffeine alone I set my mind in motion. It is by the beans of Java that thoughts acquire speed, the hands acquire shaking, the shaking becomes a warning. It is by caffeine alone I set my mind in motion.

Marshall Clow wrote:
"Do not use the PowerPC instructions Load Reserved (lwarx) and Store Conditional (stwcx) to implement atomicity in your preemptively threaded application. As described in DTS Technote 1137 Disabling Interrupts on the Traditional Mac OS, these instructions are non-portable and are tricky to use correctly across the full spectrum of PowerPC implementations."
A couple of posters have referred to this as "smells-like-FUD", and "fails to state any technical reasons". To me, that sounds like denial.
The recommendation above is correct. The instructions are obviously non-portable and are, indeed, tricky to use correctly.
Apple has committed to provide routines that work on all PPC CPU that they ship. Why in the world would you choose not to use them, if they do what you need?
Because their memory synchronization properties are not documented. They _probably_ are fully fenced. But it's not guaranteed. Even if it were guaranteed, we don't need a full fence.
If I was writing this code, I would not want to take on the burden of research, testing, and ongoing maintenance that this implies. Certainly "It works fine on my machine" is not sufficient, given these warnings. Off the top of my head, I would want to test against the 601, 603, 603e, 604, G3 (several revisions), G4 (several revisions) and G5.
CPUs do happen to have bugs. But this isn't very common, and the existing CPU bugs are usually well-known. "Programmers equate an atomic update with a full fence" is a much better explanation of Apple's warnings... on the basis of the available information. They _could_ have identified a system where lwarx/stwcx. do not work reliably, after all.

Any volunteers with CodeWarrior and g++ access on a PowerPC? ;-)
I have those. What's up? CodeWarrior 8.x, 9.x (including 9.5 beta) gcc 3.3, 3.4, and 4.0 -- -- Marshall Marshall Clow Idio Software <mailto:marshall@idio.com> It is by caffeine alone I set my mind in motion. It is by the beans of Java that thoughts acquire speed, the hands acquire shaking, the shaking becomes a warning. It is by caffeine alone I set my mind in motion.

Marshall Clow wrote:
Any volunteers with CodeWarrior and g++ access on a PowerPC? ;-)
I have those. What's up?
CodeWarrior 8.x, 9.x (including 9.5 beta) gcc 3.3, 3.4, and 4.0
Just run the smart_ptr tests from the latest CVS. You may need to wait for the changes to propagate to the anonymous CVS, though, it lags behind.
participants (6)
-
Alexander Terekhov
-
Howard Hinnant
-
Marshall Clow
-
Miro Jurisic
-
Peter Dimov
-
Rene Rivera