Re: [Boost-users] Compatibility between shared_ptr and Rational RoseRT

Ross Manges

25 Apr 2005 25 Apr '05

5:28 p.m.

When running the executable under Rational Purify, it reports many FIM errors (Freeing Invalid Memory), and all of these are again due to the deletion of objects that were owned by a shared_ptr. Is your program (the code that uses shared_ptrs) running in multiple threads? I am seeing problems with my multithreaded code where the use_count in the shared_ptrs is not accurate and causing all sorts of problems. I'm even seeing core dump backtraces showing problems in the scoped_lock where the shared_ptr use_count is getting updated. It appears that there are a lot of restrictions on how shared_ptrs can be used in a multithreaded program, and I may be breaking some of those restrictions as it is easy to do so. In any case, let me know if you find a solution to your FIM errors. I have a feeling I'm seeing symptoms of the same problem.

Thanks.

Show replies by date

Peter Dimov

26 Apr 26 Apr

9:54 a.m.

New subject: Compatibility between shared_ptr and Rational RoseRT

Ross Manges wrote:

...

It appears that there are a lot of restrictions on how shared_ptrs can be used in a multithreaded program, and I may be breaking some of those restrictions as it is easy to do so.

Dave.Ware＠seleniacomms.com

10:45 a.m.

New subject: Compatibility between shared_ptr and Rational RoseRT

Hi Peter, Thanks for the information. Our model is multi-threaded, so we are now taking a closer look at the way in which we are using shared_ptr, and checking if any of them are used across threads. Regards Dave Please respond to boost-users@lists.boost.org Sent by: boost-users-bounces@lists.boost.org To: <boost-users@lists.boost.org> cc: Subject: Re: [Boost-users] Compatibility between shared_ptr and Rational RoseRT Ross Manges wrote:

...

It appears that there are a lot of restrictions on how shared_ptrs can be used in a multithreaded program, and I may be breaking some of those restrictions as it is easy to do so.

There is one restriction: all write accesses to a specific shared_ptr need to be exclusive. This is the usual Posix model, also called "as thread safe as an int" and "basic thread safety". It applies to most shared variables in a program. But I agree that it is easy to break the rules. And there is always the possibility that your program is fine, but there might be a thread-related error in shared_ptr (these are very difficult to find). So please let me know if you find out more. _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users ------------------------------------------------------------ This email and any attached files contains company confidential information which may be legally privileged. It is intended only for the person(s) or entity to which it is addressed and solely for the purposes set forth therein. If you are not the intended recipient or have received this email in error please notify the sender by return, delete it from your system and destroy any local copies. It is strictly forbidden to use the information in this email including any attachment or part thereof including copying, disclosing, distributing, amending or using for any other purpose. In addition the sender excludes all liabilities (whether tortious or common law) for damage or breach arising or related to this email including but not limited to viruses and libel.

Ross Manges

4:48 p.m.

New subject: Compatibility between shared_ptr and Rational RoseRT

...

But I agree that it is easy to break the rules. And there is always the possibility that your program is fine, but there might be a thread-related error in shared_ptr (these are very difficult to find). So please let me know if you find out more. I have put together some sample code that shows the problem I am having. What is the best way for me to post the code? Is it appropriate to post a message with the code as an attachment? Should I tar/gzip the code? Thanks for your help!

Peter Dimov

5:20 p.m.

New subject: Compatibility between shared_ptr and Rational RoseRT

Ross Manges wrote:

...

...
But I agree that it is easy to break the rules. And there is always the possibility that your program is fine, but there might be a thread-related error in shared_ptr (these are very difficult to find). So please let me know if you find out more. I have put together some sample code that shows the problem I am having. What is the best way for me to post the code? Is it appropriate to post a message with the code as an attachment? Should I tar/gzip the code?

Attaching the code (in whatever form is convenient to you) is fine.

Ross Manges

9:01 p.m.

New subject: Compatibility between shared_ptr and Rational RoseRT

...

Attaching the code (in whatever form is convenient to you) is fine. OK, I've attached some sample code I made up to show the kind of problems I'm having (the code doesn't do anything useful). Running this code results in the backtrace below. I am aware that this code breaks the condition that the shared_ptr needs to have exclusive writing modifications; however, I am unsure how to fix the code to make it correct. Please let me know if you have any suggestions, thanks! (BTW, I'm running against the Boost CVS which I downloaded on Fri, Apr 22nd; if I run against version 1_32, the backtrace is in the scope_lock instead of atomic_increment)

some info about my system: Red Hat Enterprise Linux 3, running 2.4.21-4.ELsmp (on a dual-processor dell with xeon's, if that matters) and here's the backtrace: (gdb) bt #0 0x0804a703 in atomic_increment (pw=0x21006) at sp_counted_base_gcc_x86.hpp:59 #1 0x0804a655 in boost::detail::sp_counted_base::add_ref_copy (this=0x21002) at sp_counted_base_gcc_x86.hpp:133 #2 0x0804a4d7 in boost::detail::shared_count::operator= (this=0xb191cde8, r=@0xb731e004) at shared_count.hpp:181 #3 0x08049a58 in boost::shared_ptr<Element>::operator= (this=0xb191cde4, r=@0xb731e000) at shared_ptr.hpp:148 #4 0x0804b089 in SimpleQueue::pop (this=0x804f290) at ../SimpleQueue.cpp:22 #5 0x080494ef in CommonThread::operator() (this=0x8050470) at ../CommonThread.cpp:11 #6 0x0804a06d in boost::detail::function::void_function_obj_invoker0<CommonThread, void>::invoke (function_obj_ptr= {obj_ptr = 0x8050470, const_obj_ptr = 0x8050470, func_ptr = 0x8050470, data = "p"}) at function_template.hpp:136 #7 0xb75ce283 in boost::thread_group::size () from /home/auser/devel/boost_cvs_install/lib/libboost_thread-gcc-mt-1_32.so.1.32.0 #8 0xb7493e21 in pthread_start_thread () from /lib/i686/libpthread.so.0 #9 0xb743208a in clone () from /lib/i686/libc.so.6

Peter Dimov

9:42 p.m.

New subject: Compatibility between shared_ptr and Rational RoseRT

Ross Manges wrote:

...

...
Attaching the code (in whatever form is convenient to you) is fine. OK, I've attached some sample code I made up to show the kind of problems I'm having (the code doesn't do anything useful). Running this code results in the backtrace below. I am aware that this code breaks the condition that the shared_ptr needs to have exclusive writing modifications; however, I am unsure how to fix the code to make it correct. Please let me know if you have any suggestions, thanks! (BTW, I'm running against the Boost CVS which I downloaded on Fri, Apr 22nd;

You should probably redownload since the CVS contained some slightly broken versions for a short time. :-)

...

if I run against version 1_32, the backtrace is in the scope_lock instead of atomic_increment)

.. although your problem probably lies elsewhere. I'm not seeing any violations of the rules in your code; the problem is probably more mundane: ElementPtr SimpleQueue::pop() { ElementPtr retElem; if(!Queue.empty()) { Since this check is not protected by a mutex lock, it is possible for two or more threads to test for emptiness simultaneously when the queue contains only one element... boost::mutex::scoped_lock scoped_lock(mutex); retElem = Queue.back(); ... and then the second and subsequent threads will access an invalid .back(). This may explain your crash. Queue.pop_back(); } else { cout <<"******************* EMPTY QUEUE" << endl; } return retElem; }

Ross Manges

27 Apr 27 Apr

12:45 a.m.

New subject: Compatibility between shared_ptr and Rational RoseRT

Thanks for getting back to me so quickly! My comments are embedded below.

...

You should probably redownload since the CVS contained some slightly broken versions for a short time. :-) OK, thanks for the heads up.

...

I'm not seeing any violations of the rules in your code; the problem is probably more mundane: I moved the mutex as suggested, and the program runs without incident now. However, I have more examples. I made 'retElem' a member variable in the class SimpleQueue. This probably (?) breaks the shared_ptr rules for exclusivity. Nonetheless, the program still runs OK. But, I also tried a version of the program using CommonCPP threads and mutexes instead of Boost threads and mutexes, and it dies and dumps a core. The backtrace follows and both versions of the code base are attached in bzip2 format. If you are so inclined, please review the code and let me know your thoughts. Thanks!!!

(gdb) bt #0 0x080e3f1b in ?? () #1 0x00000023 in ?? () #2 0x0804a97d in boost::detail::shared_count::operator= (this=0xb60dddf8, r=@0xb60dddd8) at shared_count.hpp:182 #3 0x08049a14 in boost::shared_ptr<Element>::operator= (this=0xb60dddf4, r=@0xb60dddd4) at shared_ptr.hpp:148 #4 0x0804948b in CommonThread::run (this=0x80ec258) at ../CommonThread.cpp:14 #5 0xb75ad7be in ost::ThreadImpl::ThreadExecHandler (th=0x80ec258) at thread.cpp:1110 #6 0xb75ac9bf in ccxx_exec_handler (th=0x80ec258) at thread.cpp:1136 #7 0xb754ae21 in pthread_start_thread () from /lib/i686/libpthread.so.0 #8 0xb73f308a in clone () from /lib/i686/libc.so.6

Peter Dimov

10:39 a.m.

New subject: Compatibility between shared_ptr and Rational RoseRT

Ross Manges wrote:

...

...
I'm not seeing any violations of the rules in your code; the problem is probably more mundane: I moved the mutex as suggested, and the program runs without incident now. However, I have more examples. I made 'retElem' a member variable in the class SimpleQueue. This probably (?) breaks the shared_ptr rules for exclusivity. Nonetheless, the program still runs OK. But, I also tried a version of the program using CommonCPP threads and mutexes instead of Boost threads and mutexes, and it dies and dumps a core.

Your problems are getting more subtle. The Boost threads version is fine, because pop looks like this in pseudocode: lock write retElem return retElem implicit unlock Since all accesses to retElem are protected, this doesn't violate the rules. But the CommonC++ version is: lock write retElem unlock return retElem and as you can see, another thread may enter the critical region and write retElem in parallel with the return statement, which needs to make a copy of retElem and place it in the return value.

Ross Manges

9:37 p.m.

New subject: Compatibility between shared_ptr and Rational RoseRT

...

and as you can see, another thread may enter the critical region and write retElem in parallel with the return statement, which needs to make a copy of retElem and place it in the return value. Ahh ha! Thanks for the insight. That has helped me sqash a few bugs. Now I do have one last remaining situation, and I have updated my example code to use shared_from_this() in a very precarious way. Is there some way to make this example work without changing how shared_from_this() is being used?

Also, I didn't mean to hijack this thread from the RoseRT topic. It just so happens that we're going to be running this code through Purify in the near future, so I'm curious if the original poster of this thread has resolved his FIM errors? here's my current backtrace: (gdb) bt #0 0x0804ae63 in atomic_increment (pw=0x1d) at sp_counted_base_gcc_x86.hpp:59 #1 0x0804ad4f in boost::detail::sp_counted_base::add_ref_copy (this=0x19) at sp_counted_base_gcc_x86.hpp:133 #2 0x0804ae0d in shared_count (this=0xb15ef014, r=@0x80966b4) at shared_count.hpp:170 #3 0x0804b8a0 in shared_ptr (this=0xb15ef010, _ctor_arg=@0x80966b0) at ../Element.cpp:20 #4 0x0804c247 in _Construct<ElementPtr, boost::shared_ptr<Element> > (__p=0xb15ef010, __value=@0x80966b0) at stl_construct.h:78 <...snip...>

Peter Dimov

29 Apr 29 Apr

2:48 p.m.

New subject: Compatibility between shared_ptr and Rational RoseRT

Ross Manges wrote:

...

Ahh ha! Thanks for the insight. That has helped me sqash a few bugs. Now I do have one last remaining situation, and I have updated my example code to use shared_from_this() in a very precarious way.

Your problem is not related to shared_from_this at all: void SimpleQueue1::add(ElementPtr elem) { queue.push_back(elem); } You just forgot to protect queue.push_back by locking the queue mutex.

Ross Manges

5 May 5 May

6:05 p.m.

New subject: Compatibility between shared_ptr and Rational RoseRT

...

Your problem is not related to shared_from_this at all:

void SimpleQueue1::add(ElementPtr elem) { queue.push_back(elem); }

You just forgot to protect queue.push_back by locking the queue mutex.

Just posting a last follow up. We've finally got our coding working well, and as it turns out, none of our shared_ptr problems were not actually shared_ptr problems, as you might have guessed. Most of our problems came from vectors or maps that were not locked down appropriately with muticies (like the example above). One serious bug that had us stumped for a long time was the rules for the validity of an iterator to a stl vector. We were calling my_vector.erase(it) and then happily using 'it' in a loop, not realizing that 'it' had been invalidated by the erase. Just a simple oversight that cost us a lot of hair-pulling and teeth grinding. It made the debugging even more difficult because all of our backtraces pointed to the shared_ptr scope_lock, which kept diverting us from the real cause. Anyway, thanks for all of your help. The Boost community has been great! --Ross

Caleb Epstein

26 Apr 26 Apr

9:52 p.m.

New subject: Compatibility between shared_ptr and Rational RoseRT

On 4/26/05, Ross Manges <ross.manges@ihmail.com> wrote:

...

...
Attaching the code (in whatever form is convenient to you) is fine. OK, I've attached some sample code I made up to show the kind of problems I'm having (the code doesn't do anything useful). Running this code results in the backtrace below. I am aware that this code breaks the condition that the shared_ptr needs to have exclusive writing modifications; however, I am unsure how to fix the code to make it correct. Please let me know if you have any suggestions, thanks! (BTW, I'm running against the Boost CVS which I downloaded on Fri, Apr 22nd; if I run against version 1_32, the backtrace is in the scope_lock instead of atomic_increment)

some info about my system: Red Hat Enterprise Linux 3, running 2.4.21-4.ELsmp (on a dual-processor dell with xeon's, if that matters)

and here's the backtrace: (gdb) bt #0 0x0804a703 in atomic_increment (pw=0x21006) at sp_counted_base_gcc_x86.hpp:59 #1 0x0804a655 in boost::detail::sp_counted_base::add_ref_copy (this=0x21002) at sp_counted_base_gcc_x86.hpp:133 #2 0x0804a4d7 in boost::detail::shared_count::operator= (this=0xb191cde8, r=@0xb731e004) at shared_count.hpp:181 #3 0x08049a58 in boost::shared_ptr<Element>::operator= (this=0xb191cde4, r=@0xb731e000) at shared_ptr.hpp:148 #4 0x0804b089 in SimpleQueue::pop (this=0x804f290) at ../SimpleQueue.cpp:22 #5 0x080494ef in CommonThread::operator() (this=0x8050470) at ../CommonThread.cpp:11 #6 0x0804a06d in boost::detail::function::void_function_obj_invoker0<CommonThread, void>::invoke (function_obj_ptr= {obj_ptr = 0x8050470, const_obj_ptr = 0x8050470, func_ptr = 0x8050470, data = "p"}) at function_template.hpp:136 #7 0xb75ce283 in boost::thread_group::size () from /home/auser/devel/boost_cvs_install/lib/libboost_thread-gcc-mt-1_32.so.1.32.0 #8 0xb7493e21 in pthread_start_thread () from /lib/i686/libpthread.so.0 #9 0xb743208a in clone () from /lib/i686/libc.so.6

In SimpleQueue::pop, you call Queue.empty() without locking the mutex. Move your scoped lock before the if (!Queue.empty ()) check and all is well. I've verified this on a 2-way Xeon box running an almost-identical kernel. -- Caleb Epstein caleb dot epstein at gmail dot com

7415

Age (days ago)

7425

Last active (days ago)

List overview

Download

12 comments

4 participants

participants (4)

Caleb Epstein
Dave.Ware＠seleniacomms.com
Peter Dimov
Ross Manges