[shared_ptr] PPC implementations updated

I have fixed a 970-specific problem with the PPC implementations that was pointed to us by Philip Koch. Thanks to Miro Jurisic for getting his attention. So the tests will need to be re-run. I hope I didn't make any stupid mistakes. -- Peter Dimov http://www.pdimov.com

Alexander Terekhov wrote:
Peter Dimov wrote:
I have fixed a 970-specific problem with the PPC implementations
I'd like to see official erratum. Quick search yielded nothing. I'd appreciate some help please.
http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/3A2D397F9A3202BD87... http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/3A2D397F9A3202BD87... This is what I found. It seems to confirm it.

Peter Dimov wrote:
Alexander Terekhov wrote:
Peter Dimov wrote:
I have fixed a 970-specific problem with the PPC implementations
I'd like to see official erratum. Quick search yielded nothing. I'd appreciate some help please.
http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/3A2D397F9A3202BD87...
http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/3A2D397F9A3202BD87...
Thanks.
This is what I found. It seems to confirm it.
That (preliminary) notice doesn't really say what's the problem with respect to "dangling" reservation. (Context switch of course must clear reservation; everybody and his dog knows that.) regards, alexander.

Alexander Terekhov wrote:
Peter Dimov wrote:
Alexander Terekhov wrote:
Peter Dimov wrote:
I have fixed a 970-specific problem with the PPC implementations
I'd like to see official erratum. Quick search yielded nothing. I'd appreciate some help please.
http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/3A2D397F9A3202BD87...
http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/3A2D397F9A3202BD87...
Thanks.
This is what I found. It seems to confirm it.
That (preliminary) notice doesn't really say what's the problem with respect to "dangling" reservation. (Context switch of course must clear reservation; everybody and his dog knows that.)
I don't know, but all of Apple's code specifically clears the reservation. ;-)

Alexander Terekhov wrote:
Peter Dimov wrote: [...]
I don't know, but all of Apple's code specifically clears the reservation. ;-)
It doesn't make (some) Apple's code less broken (msync wise) though. ;-)
Note also that http://sourceware.org/cgi-bin/cvsweb.cgi/libc/sysdeps/powerpc/bits/atomic.h?... doesn't seem to exhibit that paranoia (and supposedly runs just fine on the same hardware). I suggest that you uncommit the "fix". regards, alexander.

Alexander Terekhov wrote:
Alexander Terekhov wrote:
Peter Dimov wrote: [...]
I don't know, but all of Apple's code specifically clears the reservation. ;-)
It doesn't make (some) Apple's code less broken (msync wise) though. ;-)
Note also that
http://sourceware.org/cgi-bin/cvsweb.cgi/libc/sysdeps/powerpc/bits/atomic.h?...
doesn't seem to exhibit that paranoia (and supposedly runs just fine on the same hardware). I suggest that you uncommit the "fix".
It's good enough for 1.33. A failed lock() is still expensive because I haven't fixed the upper layers to not throw bad_weak_ptr. It might be good enough anyway... it's only two more instructions, after all. stwcx. is uncontended on zero. The paranoia is cheap in this specific case. :-)

Peter Dimov wrote:
It might be good enough anyway... it's only two more instructions, after all. stwcx. is uncontended on zero.
It is contended on strong count and uncontended on weak count. On MP, lock()ers will simply play Ping-Pong with zero fighting each other. Even if you change the code and do that silly "cleanup write" out of the loop, it will still increase traffic, I'm afraid. So, again, get rid of it unless the issue is provably real (not a myth created from miscommunication or whatever). regards, alexander.

In article <4256E629.5C47C47C@web.de>, Alexander Terekhov <terekhov@web.de> wrote:
Peter Dimov wrote:
It might be good enough anyway... it's only two more instructions, after all. stwcx. is uncontended on zero.
It is contended on strong count and uncontended on weak count. On MP, lock()ers will simply play Ping-Pong with zero fighting each other. Even if you change the code and do that silly "cleanup write" out of the loop, it will still increase traffic, I'm afraid. So, again, get rid of it unless the issue is provably real (not a myth created from miscommunication or whatever).
I think that given the choice of better performance and better safety we should err on the side of safety until the faster version is provable safe, not the other way around (as you seem to be suggesting). meeroh

Miro Jurisic wrote: [...]
I think that given the choice of better performance and better safety we should err on the side of safety until the faster version is provable safe, not the other way around (as you seem to be suggesting).
Faster version doesn't become unsafe merely because a preliminary notice "strongly recommends" some silliness not giving any reason for "strong recommendation" whatsoever. regards, alexander.

In article <4256FADB.186931E7@web.de>, Alexander Terekhov <terekhov@web.de> wrote:
Miro Jurisic wrote: [...]
I think that given the choice of better performance and better safety we should err on the side of safety until the faster version is provable safe, not the other way around (as you seem to be suggesting).
Faster version doesn't become unsafe merely because a preliminary notice "strongly recommends" some silliness not giving any reason for "strong recommendation" whatsoever.
Unless you can provide some credible data showing that the strong recommendation by chip manufacturer is safe to ignore, it would be prudent to follow the recommendation. What is it that you know that we don't? So far you have asserted a variety of things, all of which boil down to you being an authority on this subject -- including saying above that we should disregard an explicit strong recommendation from the chip manufacturer -- but unless you can provide some reason why your assertions are valid (preferably in the form of data, but credentials and references will do in a pinch), your argument doesn't really hold much credibility compared to official documentation by IBM. I'd be happy for you to be right, for this issue to be moot, and for us to be able to use the faster code. Please give us a solid reason to believe that the IBM documentation under discussion is wrong (other than "I say so"). meeroh

Miro Jurisic wrote: ... You've (again) got it ass backwards, Miro. My employer should better provide some *reason* to follow that "strong recommendation" or withdraw it (preliminary status aside for a moment). Weak or strong, (preliminary) recommendation does NOT equate to erratum. See also atomic stuff for power* in glibc's sources and pay a little attention to '@...'. regards, alexander.

In article <42574B24.7ECD6280@web.de>, Alexander Terekhov <terekhov@web.de> wrote:
You've (again) got it ass backwards, Miro. My employer should better provide some *reason* to follow that "strong recommendation" or withdraw it (preliminary status aside for a moment). Weak or strong, (preliminary) recommendation does NOT equate to erratum. See also atomic stuff for power* in glibc's sources and pay a little attention to '@...'.
Referring to glibc is not useful unless you give us a reason to believe that glibc was written with specific knowledge about why the recommendation should be ignored. For all we know, glibc's reason for being the way it is is the same as Peter's was -- i.e., not knowing about the IBM notice. Given that you work for IBM, you may have access to someone who knows why this preliminary notice was issued and why it is safe to ignore it, if it indeed is. If you are an authority on this issue, then please explain why the preliminary notice was issued and why it's safe to ignore it. If you are not an authority, then find someone who is and let us have the explanation. Otherwise, you are doing nothing but repeating poorly substantiated claims and urging us take your word for it. Finally, I would appreciate it if you could back off from your condescension. My goal here is not to insult you, prove you wrong, or make you look like a fool. My goal is to do what I can to ensure that when I build boost into a product and ship it to tens of thousands of users, that product's reliability and future compatibility are not compromised by poor engineering. meeroh

Miro Jurisic wrote: [...]
Given that you work for IBM, you may have access to someone who knows why this preliminary notice was issued and why it is safe to ignore it, if it indeed is.
Apart from my access to "someone", what makes you think that it is unsafe to ignore it?
If you are an authority on this issue, then please explain why the preliminary notice was issued and why it's safe to ignore it. If you are not an authority, then find someone who is and let us have the explanation.
Try official IBM channels. The notice says "Verify with your IBM field applications engineer that you have the latest version of this document before finalizing a design." I'm not your field applications engineer. regards, alexander.

< Forward Quoted > Newsgroups: comp.arch Subject: Re: lwarx/stwcx on PowerPC 970 References: <4257CC7A.252E5B2@web.de> Message-ID: <opsoyu88daqm36vk@grunion> Date: Sat, 09 Apr 2005 09:18:34 -0400 From: "Joe Seigh" <jseigh_01@xemaps.com> Joe Seigh wrote:
On Sat, 09 Apr 2005 14:37:14 +0200, Alexander Terekhov <terekhov@web.de> wrote:
http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/3A2D397F9A3202BD87...
Says that "Every larx executed should have an accompanying stcx to clear the reservation." and that "Additional information regarding Larx/Stcx instructions and usage can be found in the "PowerPC Microprocessor Family: Programming Environments Manual for 64 and 32-Bit Microprocessors".
"Additinal information" shows coding examples like this:
<quote>
loop: lwarx r6,0,r3 #load and reserve cmpw r4,r6 #first 2 operands equal ? bne- exit #skip if not stwcx. r5,0,r3 #store new value if still reserved bne- loop #loop if lost reservation exit: mr r4,r6 #return value from memory
</quote>
So, who's right here?
Just "strongly recommended", not required. The reservations obviously aren't recursively counted since the OS has no way of knowing the count in order to "clear" the reservations. Unless there's some performance hit for maintaining the reservation, I can't see any reason to worry about it.
-- Joe Seigh
< /Forward Quoted > On any sensible Power {PC} implementation (especially MP), redundant stwcx will negatively affect performance, not the other way around. regards, alexander.

Peter Dimov wrote: [...]
I don't know, but all of Apple's code specifically clears the reservation. ;-)
http://www.google.de/groups?selm=rang-2010981430230001%40margaret.trillium.a... I gather that they started doing that silly "cleanup" long before PPC970 was put on paper, not even first silicon. Ok, keep the "fix" in, but #ifdef it on Philip_Koch or something and don't penalize all Power clients, please. regards, alexander.

Alexander Terekhov wrote:
Peter Dimov wrote: [...]
I don't know, but all of Apple's code specifically clears the reservation. ;-)
http://www.google.de/groups?selm=rang-2010981430230001%40margaret.trillium.a...
I gather that they started doing that silly "cleanup" long before PPC970 was put on paper, not even first silicon. Ok, keep the "fix" in, but #ifdef it on Philip_Koch or something and don't penalize all Power clients, please.
Macro names should be all uppercase and start with BOOST... How about BOOST_MIRO_JURISIC_AND_PHILIP_KOCH? Jokes aside, I think that we should postpone this decision until after 1.33 is released. Maybe even #ifdef on __APPLE__, if you insist. Are you sure that keeping a reservation alive for extended periods of time does not incur a performance penalty, BTW? Could this be the reason for the preliminary technical note?

Alexander Terekhov wrote:
Peter Dimov wrote: [...]
Are you sure that keeping a reservation alive for extended periods of time does not incur a performance penalty, BTW?
I've never heard that it can.
Could this be the reason for the preliminary technical note?
I'm also puzzled.
Here's a snippet from the official "Programming Environments Manual" referenced by that puzzling preliminary notice itself. It explicitly mentions dangling lwarxs, says that it is good for forward progress, and shows yet another dangling lwarx in "better performance may be obtained" illustration. <quote> E.1 General Information The following points provide general information about the lwarx and stwcx. instructions: - It is acceptable to execute an lwarx instruction for which no stwcx. instruction is executed. Such a dangling lwarx instruction occurs in the example shown in Section E.2.5 , "Test and Set," if the value loaded is not zero. - To increase the likelihood that forward progress is made, it is important that looping on lwarx/stwcx. pairs be minimized. For example, in the sequence shown in Section E.2.5 , "Test and Set," this is achieved by testing the old value before attempting the store -- were the order reversed, more stwcx. instructions might be executed, and reservations might more often be lost between the lwarx and the stwcx. instructions. - The manner in which lwarx and stwcx. are communicated to other processors and mechanisms, and between levels of the memory subsystem within a given processor, is implementation-dependent. In some implementations, performance may be improved by minimizing looping on an lwarx instruction that fails to return a desired value. For example, in the example provided in Section E.2.5 , "Test and Set," if the program stays in the loop until the word loaded is zero, the programmer can change the "bne- $+12" to "bne- loop." In some implementations, better performance may be obtained by using an ordinary load instruction to do the initial checking of the value, as follows: loop: lwz r5,0(r3) #load the word cmpwi r5,0 #loop back if word bne- loop #not equal to 0 lwarx r5,0,r3 #try again, reserving cmpwi r5,0 #(likely to succeed) bne loop #try to store nonzero stwcx. r4,0,r3 # bne- loop #loop if lost reservation [...] E.2.5 Test and Set This version of the test and set primitive atomically loads a word from memory, ensures that the word in memory is a nonzero value, and sets CR0[EQ] according to whether the value loaded is zero. In this example, it is assumed that the address of the word to be tested is in GPR3, the new value (nonzero) is in GPR4, and the old value is returned in GPR5. loop: lwarx r5,0,r3 #load and reserve cmpwi r5, 0 #done if word bne $+12 #not equal to 0 stwcx. r4,0,r3 #try to store non-zero bne- loop #loop if lost reservation </quote> regards, alexander.

Alexander Terekhov wrote: [...]
That (preliminary) notice doesn't really say what's the problem with respect to "dangling" reservation. (Context switch of course must clear reservation; everybody and his dog knows that.)
Just to clarify. Think of simple unconditional increments on UP (count is zero initially). Thread A: <load-reserved> <add 1> [preemptive context switch] Thread B: <load-reserved> <add 1> <store-conditional> // count = 1 . . [another increment, count = 2] . . . [begin yet another increment] <load-reserved> [preemptive context switch] Thread A: <store-conditional> // count = 1 And you've just lost one billion dollars. In any preemptive OS context switch must always clear reservation to ensure that you never get into this sort of trouble. regards, alexander.

Peter Dimov wrote:
So the tests will need to be re-run. I hope I didn't make any stupid mistakes.
All still passing for CW-8.3 and Apple GCC 3.3 on OSX 10.2, G3. I let the release variant compile for cw-83 this time around. Took a while but it eventually finished and works without problems. SO at least it's not an infinite optimization problem.. It's just slow. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org
participants (4)
-
Alexander Terekhov
-
Miro Jurisic
-
Peter Dimov
-
Rene Rivera