asio stability and scalability
Hello, The project I am involved with required porting a Windows only networking application that was using the WSAAsynSelect mechanism to GNU/Linux, while remaining Windows compatible. We chose Boost asio as the framework to use. After spending time with asio and understanding it, the port was relatively straight forward. It basically works, both on Windows and on GNU/Linux. However, on both systems we are now running into stack corruptions. We have spent quite a few weeks now trying to track this down, mostly on Windows, as we have access to very many tools on that platform. We have verified that all underlying Win32 (and Winsock) function calls are valid (all parameters check), that our heap is not corrupt (all memory allocations and deallocations are clean), and we have eliminated all threading issues by restructuring the code to be runnable on a single io_service::run() thread (and Intel's Thread Checker found no issues either). We have done many other verifications as well, everything comes out clean. At this point we have to strongly consider the possibility that it is asio itself that is causing this error, though we are quite at a loss to explain how asio could result in this behavior. So we are wondering, are there known large production systems out there that use asio? Are there high volume web servers for instance written with asio? I came across this comment: http://dobbscodetalk.com/index.php?option=com_content&task=view&id=1700&Itemid=85 which, if true, would also imply that asio is perhaps not appropriate for high throughput applications, at least on GNU/Linux systems. My basic question is: Do people have success stories using asio in production applications? Have people used other frameworks, for instance ACE, and can comment on how they compare? Have others ran into random crashes and given up on using asio? Thanks for any feedback, -Eric
"Eric Twietmeyer"
My basic question is: Do people have success stories using asio in production applications? Have people used other frameworks, for instance ACE, and can comment on how they compare? Have others ran into random crashes and given up on using asio?
We have had good success with asio in our application. We have tested it under long-term, low-volume production use and also short-term, high-volume load testing, and it holds up well. We have not run into any random crashes. -----Scott.
We use asio in a windows-based project, which involves a lot of networking, including tens or even hundreds of simultaneous data-streams - and it works well. However, as Marat said, with asynchronous design one should be very careful about the lifetimes of the involved objects, because any flaw can cause in various kinds of memory/stack corruption. That's why asio-based designs usually extensively utilize shared_ptr's and shared_from_this idiom, instead of attempts to manage objects' lifetime manually.
"Igor R"
We use asio in a windows-based project, which involves a lot of networking, including tens or even hundreds of simultaneous data-streams - and it works well. However, as Marat said, with asynchronous design one should be very careful about the lifetimes of the involved objects, because any flaw can cause in various kinds of memory/stack corruption. That's why asio-based designs usually extensively utilize shared_ptr's and shared_from_this idiom, instead of attempts to manage objects' lifetime manually.
I understand completely. I have written quite a bit of IOCP based asynchronous networking code in the past, but directly, not using asio. However, all of my previous experience was with TCP. This current project requires UDP as well as TCP, and it seems somehow to be the UDP communication in particular that is initiating the problem (we can sort of remove the UDP portion and then this problem does not occur). We have of course gone over this with a fine tooth comb, and I did use everywhere the shared_ptr and shread_from_this idiom. Very perversely the stack corruption is occuring "out of line", as it were. Using VMware's Replay Debugging we have witnessed in each case that the memory corruption occurs while the user code is doing something completely innocuous (for instance reading memory), or during an internal call to sysenter in the guts of one of the WSARecv/Send functions (where this function is known to have valid inputs and completes successfully with valid output). Speaking with a Microsoft Support Tech, the only explanation we can come up with is that an IO interrupt is occuring between user space instructions and this IO interrupt processing is somehow trashing user space memory. The MS guy indicates that this can happen (buggy drivers, etc), but he has never seen it manifest in this way. But as we have seen this occur on several different Windows OS boxes (with various OS versions and various hardware), it seems unlikely to be a driver issue. Also this same behavior has been witnessed when asio is compiled with BOOST_ASIO_DISABLE_IOCP. After these several weeks of intense debugging efforts we are basically at a loss. I'm very reluctant to move from asio to ACE or some other framework, I like the way asio is structured, and it will take quite a bit of time to reimplement this. Thanks for the input. -Eric
This current project requires UDP as well as TCP, and it seems somehow to be the UDP communication in particular that is initiating the problem (we can sort of remove the UDP portion and then this problem does not occur).
Wow, this is really strange... Maybe the author of ASIO will have some ideas: http://blog.think-async.com/ http://sourceforge.net/mail/?group_id=122478
On Mon, Oct 19, 2009 at 2:51 PM, Igor R
This current project requires UDP as well as TCP, and it seems somehow to be the UDP communication in particular that is initiating the problem (we can sort of remove the UDP portion and then this problem does not occur).
Wow, this is really strange...
Maybe the author of ASIO will have some ideas: http://blog.think-async.com/ http://sourceforge.net/mail/?group_id=122478
For note, I have also been using ASIO for a very active server and I have had no such issues either.
FWIW - When I have a situation this frustrating, I would start by suppressing the UDP portion so the program works (sort of). The slowly add in parts of that code until it starts two fail. It's crude and tedious, but sometimes its been the only way. Robert Ramey OvermindDL1 wrote:
On Mon, Oct 19, 2009 at 2:51 PM, Igor R
wrote: This current project requires UDP as well as TCP, and it seems somehow to be the UDP communication in particular that is initiating the problem (we can sort of remove the UDP portion and then this problem does not occur).
Wow, this is really strange...
Maybe the author of ASIO will have some ideas: http://blog.think-async.com/ http://sourceforge.net/mail/?group_id=122478
For note, I have also been using ASIO for a very active server and I have had no such issues either.
On Mon, 19 Oct 2009 22:37:41 +0200, Eric Twietmeyer
[...]After these several weeks of intense debugging efforts we are basically at a loss. I'm very reluctant to move from asio to ACE or some other framework, I like the way asio is structured, and it will take quite a bit of time to reimplement this.
No idea if this helps but: Do you call any asynchronous I/O operation before main(), eg. in the constructor of a global object? I had to replace the Boost.Asio placeholders with _1 and _2 from Boost.Bind when building a program with VC++ 2008 as the Boost.Asio placeholders crashed the program every time when the handler was called. There was no problem with g++ though. Boris
On Mon, Oct 19, 2009 at 2:37 PM, Eric Twietmeyer
This current project requires UDP as well as TCP, and it seems somehow to be the UDP communication in particular that is initiating the problem (we can sort of remove the UDP portion and then this problem does not occur).
FWIW, we have an application that runs on XP which uses ASIO with both UDP and TCP. We talk to a third-party server that uses UDP broadcasts to tell us what TCP server endpoints (multiple) to connect to. We then connect via TCP, and do fairly high-throughput data transfer (gig-e speeds) using async_read(). The app is constantly receiving UDP broadcasts, via async_receive_from(). I also wrote a simulator for the third-party server that uses ASIO to do non-blocking UDP broadcasts, asynchronously handle TCP connects, and synchronous TCP writes. We have experienced no issues using both TCP and UDP with ASIO within the same app framework. Jon
"Jonathan Franklin"
On Mon, Oct 19, 2009 at 2:37 PM, Eric Twietmeyer
wrote: This current project requires UDP as well as TCP, and it seems somehow to be the UDP communication in particular that is initiating the problem (we can sort of remove the UDP portion and then this problem does not occur).
FWIW, we have an application that runs on XP which uses ASIO with both UDP and TCP. We talk to a third-party server that uses UDP broadcasts to tell us what TCP server endpoints (multiple) to connect to. We then connect via TCP, and do fairly high-throughput data transfer (gig-e speeds) using async_read(). The app is constantly receiving UDP broadcasts, via async_receive_from().
I also wrote a simulator for the third-party server that uses ASIO to do non-blocking UDP broadcasts, asynchronously handle TCP connects, and synchronous TCP writes.
We have experienced no issues using both TCP and UDP with ASIO within the same app framework.
Jon
Thanks to everyone for the feedback. That gives me much more confidence that this really should be working. In our case the process connects to a "client" via TCP and a "server" via UDP and acts basically as a data cacher. So long as the end client keeps requesting data that has not been cached, the UDP pipe will be used to fetch more data from the server and then it is passed over TCP to the client. There is therefore high throughput on both the TCP and UDP side of the our application. If the memory corruption does not occur quickly (as is sometimes the case), then it is possible to simply feed back data via TCP that has been cached and then the application seems stable. This is why I said that it seems to be UDP that is causing the issue as it only occurs when the system is run in such a manner that the client keeps needing new, uncached, data.
On Wed, Oct 21, 2009 at 12:51 PM, Eric Twietmeyer
application. If the memory corruption does not occur quickly (as is sometimes the case), then it is possible to simply feed back data via TCP that has been cached and then the application seems stable. This is why I said that it seems to be UDP that is causing the issue as it only occurs when the system is run in such a manner that the client keeps needing new, uncached, data.
Sounds like the UDP and TCP handling mechanisms may be walking all over each other. You may need to look at your shared resources, and figure out how/when they're being accessed by the two mechanisms. Jon
participants (7)
-
Boris Schaeling
-
Eric Twietmeyer
-
Igor R
-
Jonathan Franklin
-
OvermindDL1
-
Robert Ramey
-
Scott Gifford