[ASIO] random crashes

Hello I m using ASIO library (from boost 1.35) in a network daemon (running on a Debian Etch system). The code structure is almost the same as the one described in the HTTP Server example (http://tenermerx.com/Asio/boost_asio_1_3_1/doc/html/boost_asio/examples.html...). One io_service is used, async_accept() creates new "sessions" and so on. The daemon can run flawlessly for weeks (or only some hours), and crash randomly (segmentation fault). The "network" load isn't really high, every 5 minutes, a few megabytes of data (characters) are sent to the daemon. I didn't noticed any memory leaks or null pointers accesses. Among many crashes, I found 2 kind of crashes. Since core are dumped, I tried to debug it, but I don't know how to interpret the core result. The binary wasn't linked against debug libraries and I just can get debug data from the binary itself. Here's the gdb output on "bt full" command. The daemon has 4 threads, here s the gdb output of the one which causes the crash (I assume this is this one) First kind of crash : (only the relevant part is pasted, the output is huge). It seems to be related to the way I handle the timeout on a timer. In my daemon, the io_service thread may call close() function on the socket object when another thread may call cancel() on the socket object. Could this lead to a crash ? (I can post source code if needed) What is the best way to handle receive timeout *and* socket & timer close from another thread ? Thread 1 (process 27120): Program terminated with signal 11, Segmentation fault. #0 0xb7c7c024 in pthread_mutex_lock () from /lib/tls/i686/cmov/libpthread.so.0 No symbol table info available. #1 0xb7d5f0c6 in pthread_mutex_lock () from /lib/tls/i686/cmov/libc.so.6 No symbol table info available. #2 0x08060438 in boost::asio::detail::posix_mutex::lock (this=0x16) at /usr/include/boost/asio/detail/posix_mutex.hpp:71 error = 0 #3 0x08060559 in scoped_lock (this=0xb796807c, m=@0x16) at /usr/include/boost/asio/detail/scoped_lock.hpp:36 No locals. #4 0x08060ad0 in boost::asio::detail::epoll_reactor<false>::close_descriptor (this=0x2, descriptor=-1291829448) at /usr/include/boost/asio/detail/epoll_reactor.hpp:297 lock = {<boost::noncopyable_::noncopyable> = {<No data fields>}, mutex_ = @0x16, locked_ = 212} ev = {events = 6, data = {ptr = 0x25, fd = 37, u32 = 37, u64 = 8589934629}} #5 0x0809b189 in boost::asio::detail::reactive_socket_service<boost::asio::ip::tcp, boost::asio::detail::epoll_reactor<false> >::close (this=0xb3004358, impl=@0xb3037f1c, ec=@0xb7968128) at /usr/include/boost/asio/detail/reactive_socket_service.hpp:210 No locals. #6 0x0809b277 in boost::asio::stream_socket_service<boost::asio::ip::tcp>::close (this=0xb3000098, impl=@0xb3037f1c, ec=@0xb7968128) at /usr/include/boost/asio/stream_socket_service.hpp:145 No locals. #7 0x0809b2c5 in boost::asio::basic_socket<boost::asio::ip::tcp, boost::asio::stream_socket_service<boost::asio::ip::tcp> >::close (this=0xb3037f18) at /usr/include/boost/asio/basic_socket.hpp:253 ec = {m_val = 0, m_cat = 0xb7c74c10} #8 0x08091129 in Session::handleTimeout (this=0xb3037f18, error=@0xb796821c) at Session.cc:80 Could someone bring me some explainations about this crash ? Second kind of crash : (I really have no idea from where could come the crash, the backtrace isn't very useful) (gdb) bt full #0 boost::asio::detail::scoped_lock<boost::asio::detail::posix_mutex>::unlock (this=0xb5631dac) at /usr/include/boost/asio/detail/scoped_lock.hpp:58 No locals. #1 0x00000000 in ?? () No symbol table info available. Could someone give me some advices about how to handle these crashes ? How could I get more data in backtrace ? Thanks in advance for your help. Regards

Axel wrote:
In my daemon, the io_service thread may call close() function on the socket object when another thread may call cancel() on the socket object. Could this lead to a crash ? (I can post source code if needed) What is the best way to handle receive timeout *and* socket & timer close from another thread ?
Thread 1 (process 27120): Program terminated with signal 11, Segmentation fault. #0 0xb7c7c024 in pthread_mutex_lock () from /lib/tls/i686/cmov/libpthread.so.0 No symbol table info available. #1 0xb7d5f0c6 in pthread_mutex_lock () from /lib/tls/i686/cmov/libc.so.6 No symbol table info available. #2 0x08060438 in boost::asio::detail::posix_mutex::lock (this=0x16) at /usr/include/boost/asio/detail/posix_mutex.hpp:71 error = 0 #3 0x08060559 in scoped_lock (this=0xb796807c, m=@0x16) at /usr/include/boost/asio/detail/scoped_lock.hpp:36 No locals. #4 0x08060ad0 in boost::asio::detail::epoll_reactor<false>::close_descriptor (this=0x2, descriptor=-1291829448) at /usr/include/boost/asio/detail/epoll_reactor.hpp:297 lock = {<boost::noncopyable_::noncopyable> = {<No data fields>}, mutex_ = @0x16, locked_ = 212} ev = {events = 6, data = {ptr = 0x25, fd = 37, u32 = 37, u64 = 8589934629}} #5 0x0809b189 in boost::asio::detail::reactive_socket_service<boost::asio::ip::tcp, boost::asio::detail::epoll_reactor<false> >::close (this=0xb3004358, impl=@0xb3037f1c, ec=@0xb7968128) at /usr/include/boost/asio/detail/reactive_socket_service.hpp:210 No locals. #6 0x0809b277 in boost::asio::stream_socket_service<boost::asio::ip::tcp>::close (this=0xb3000098, impl=@0xb3037f1c, ec=@0xb7968128) at /usr/include/boost/asio/stream_socket_service.hpp:145 No locals. #7 0x0809b2c5 in boost::asio::basic_socket<boost::asio::ip::tcp, boost::asio::stream_socket_service<boost::asio::ip::tcp> >::close (this=0xb3037f18) at /usr/include/boost/asio/basic_socket.hpp:253 ec = {m_val = 0, m_cat = 0xb7c74c10} #8 0x08091129 in Session::handleTimeout (this=0xb3037f18, error=@0xb796821c) at Session.cc:80
Could someone bring me some explainations about this crash ?
From my experience, this is likely due to trying to invoke a member function on a lock that is already destroyed. You probably have a race condition where you are doing something on the ASIO socket after it's already been destroyed.
Second kind of crash : (I really have no idea from where could come the crash, the backtrace isn't very useful)
(gdb) bt full #0 boost::asio::detail::scoped_lock<boost::asio::detail::posix_mutex>::unlock (this=0xb5631dac) at /usr/include/boost/asio/detail/scoped_lock.hpp:58 No locals. #1 0x00000000 in ?? () No symbol table info available.
Could someone give me some advices about how to handle these crashes ? How could I get more data in backtrace ?
This is usually caused by a thread that trashes its stack. That can also be caused by accessing an object that is already destroyed. -- Jon Biggar jon@biggar.org jon@floorboard.com

Jon Biggar wrote:
From my experience, this is likely due to trying to invoke a member function on a lock that is already destroyed. You probably have a race condition where you are doing something on the ASIO socket after it's already been destroyed.
The only thread who can destroy the instance of the object (the asio socket instance is a member of this object) is the io_service running thread. The destroy instruction is in a receive error handler : I cancel timer, close the socket and destroy the instance ; could the timeout handler (which seems to cause the crash) be executed after the destruction of the object ? Maybe my design is bad, what would be a proper way to handle socket timeout and socket explicit close ? My design looks like : explicit_close() { timer.cancel() ; // triggers handle_timeout socket.cancel() ; } handle_timeout() { socket.close() ; // this triggers the handle_receive() function } handle_receive() { if(! error) {} else{timer.cancel() ; socket.close() ; delete this ;} }

Axel wrote:
Jon Biggar wrote:
From my experience, this is likely due to trying to invoke a member function on a lock that is already destroyed. You probably have a race condition where you are doing something on the ASIO socket after it's already been destroyed.
The only thread who can destroy the instance of the object (the asio socket instance is a member of this object) is the io_service running thread.
The destroy instruction is in a receive error handler : I cancel timer, close the socket and destroy the instance ; could the timeout handler (which seems to cause the crash) be executed after the destruction of the object ?
Maybe my design is bad, what would be a proper way to handle socket timeout and socket explicit close ?
My design looks like :
explicit_close() { timer.cancel() ; // triggers handle_timeout socket.cancel() ; } I recall that deadline_timer.cancel() is not thread safe, if this explicit_close() is being called in a different thread from the io_service thread then I believe there is a race condition if the timer is active while being cancelled. I may be wrong as I am a little rusty on boost asio.
handle_timeout() { socket.close() ; // this triggers the handle_receive() function }
handle_receive() { if(! error) {} else{timer.cancel() ; socket.close() ; delete this ;} } HTH Bill Somerville
participants (3)
-
Axel
-
Bill Somerville
-
Jon Biggar