
I vote to *not* accept Asio as a Boost library. This is based on some specific deficiencies in design/implementation and documentation. Hence if those are remedied I would be willing to change my vote. What is your evaluation of the design? Fair -- The overall separation of responsibilities is good and provides for considerable reuse. And possibly independent use of many of the facilities. Unfortunately many of the interactions and dependencies between the parts are not clear. In many situation I had to resort to reading through the implementation to figure out the various connections. Even though I can see how I would use the library in some tasks there are others I just don't see how something could be implement in an efficient manner. (More below on my specific use case) What is your evaluation of the implementation? Good -- For the given design the implementation is clear and easy to follow. But because of some particular design choice(s) there are some serious performance deficiencies which I would think prevent its use in the one area where the library was ostensibly design to be used, high performance servers. What is your evaluation of the documentation? Insufficient -- Even though I'm a proponent of Dox style automated documentation in this case the Dox reference documentation is insufficient in describing how the various operations work. The supporting design, tutorials, and examples I don't feel provide sufficient background to understand how the pieces connect and react with each other. Specifically the example code should at minimum explain why calls are done. In summary, either the reference docs need to improve significantly, or the non-reference docs need to take up the slack. What is your evaluation of the potential usefulness of the library? Good -- Even with my concerns about efficiency I can see the use in the wider connection based client/server domain where response efficiency is a minor concern. Did you try to use the library? No. How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? In-depth -- I read a good portion of the documentation, and various parts of the implementation with a particular use case in mind. Are you knowledgeable about the problem domain? Yes. I have done some form of networking programming for the past 10 years. In the form of HTTP handlers, Game multiplayer games, and most recently client/server version control system. Now some details about my use case and my performance concern... I am looking at Asio to provide the socket handling for one of my current projects, the version control system. I'm in the middle of rewriting the networking aspect of it from the classic thread per client model to an async endpoints and routers star to star virtual LAN topology with a custom reliable message protocol based on UDP. This means that since I'm writing my own router I need the best performance I can get. But to give some minimal concrete numbers... Given a single gigabit line, a minimal example as a server is likely going to have multiple lines, and the most favorable situation of handling full 64K UDP packets one would need to handle about 1500 messages a second (or 2/3 of a millisecond per message). But in practice it would have to be faster than that. This means minimizing the most expensive steps, i.e. memory allocations. Currently AFAICT in order to use the async operations asio requires a model which allocates a new handler for each message received. This might be fine for many situations where those handlers will change from message to message. But for my use case I have only one handler that needs to get called for *any* message that comes in. In the asio case it means that for each message I received it would: remove the handler from the demuxer map, call the handler, which would do my custom parsing and routing, and push a new async_receive of myself (which creates a new handler object and inserts it again into the demux map). This is clearly suboptimal, and will result in considerable performance degradation. To refer to some concrete code, one can look at the Daytime.6 tutorial which does basically that procedure. Of course it would be awesome if Christopher would come out and tell me I'm stoned and point out how to achieve the optimum I need. Barring that for me to use Asio for my project I would end up creating patched versions of epoll_reactor, kqueue_reactor, reactor_op_queue, reactive_sokect_service, and win_iocp_socket_service (maybe more). Such hacking of implementation classes to me means a serious design failure. Here's to hoping I'm wrong ;-) -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Currently AFAICT in order to use the async operations asio requires a model which allocates a new handler for each message received. This might be fine for many situations where those handlers will change from message to message. But for my use case I have only one handler that needs to get called for *any* message that comes in. In the asio case it means that for each message I received it would: remove the handler from the demuxer map, call the handler, which would do my custom parsing and routing, and push a new async_receive of myself (which creates a new handler object and inserts it again into the demux map). This is clearly suboptimal, and will result in considerable performance degradation. To refer to some concrete code, one can look at the Daytime.6 tutorial which does basically that procedure.
In practice, I tried out asio on Windows using the iocp implementation with the "same handler every time" pattern of usage, and was quite happy with CPU utilization on a 100 Mbit/s peak multicast stream (1000 -> 1500 bytes/message). I expect that there may be some gains to be had, especially in the epoll/kqueue based reactor implementations to allowing a user to "preregister" a handler. Maybe there could be overloads of the asynch operations that accept a pre-bound handler? However, as I'm sure you know, there will always be some "Per I/O operation" structures involved like the buffer, etc. It's pretty common on IOCP servers to post several asynch reads on the same socket, to avoid queueing and extra copying. Regards, Dave

Hi Rene, --- Rene Rivera <grafik.list@redshift-software.com> wrote:
I vote to *not* accept Asio as a Boost library. This is based on some specific deficiencies in design/implementation and documentation. Hence if those are remedied I would be willing to change my vote.
I love a challenge :) <snip>
What is your evaluation of the documentation?
Insufficient -- Even though I'm a proponent of Dox style automated documentation in this case the Dox reference documentation is insufficient in describing how the various operations work. The supporting design, tutorials, and examples I don't feel provide sufficient background to understand how the pieces connect and react with each other. Specifically the example code should at minimum explain why calls are done. In summary, either the reference docs need to improve significantly, or the non-reference docs need to take up the slack.
Fair enough. I totally agree that the documentation can be improved, although naturally everyone has a different opinion on what bit of the documentation has the highest priority. It's an ongoing task for me. What else can I say?
Now some details about my use case and my performance concern... <snip> Of course it would be awesome if Christopher would come out and tell me I'm stoned and point out how to achieve the optimum I need. Barring that for me to use Asio for my project I would end up creating patched versions of epoll_reactor, kqueue_reactor, reactor_op_queue, reactive_sokect_service, and win_iocp_socket_service (maybe more). Such hacking of implementation classes to me means a serious design failure.
Yep, custom allocation is something I want to address, and others here (notably Christopher Baus) are taking an active enough interest in it to explore what can be done. However, thanks to the description of your use case, I have had an idea of how to enable what I think it is you want reasonably easily. I'm sure you'll correct me if it's not :) The change requires only a very minor interface change -- the addition of a class template which I shall call handler_allocator. All new calls that are associated with a handler in some way will have to be changed to use it. Here is the default implementation, which as you can see simply forwards the calls to the supplied standard allocator. namespace boost { namespace asio { template <typename Handler> class handler_allocator { public: template <typename Allocator> static typename Allocator::pointer allocate( Handler& handler, Allocator& allocator, typename Allocator::size_type count) { return allocator.allocate(count); } template <typename Allocator> static void deallocate( Handler& handler, Allocator& allocator, typename Allocator::pointer pointer, typename Allocator::size_type count) { allocator.deallocate(pointer, count); } }; } } // namespaces This class can be specialised for specific handler types. Plus I will make the following guarantee: all memory associated with a handler will be deallocated before the handler is invoked. This means that you can reuse the same memory block for a chain of calls (such as the chain of async_receive calls in your case). This class will also have to be partially specialised for the handlers for composed operations inside asio (such as async_read and async_write) to ensure that the allocate() and deallocate() calls are forwarded to the correct specialisation for the user's own handler. I did a proof of concept of this for the demuxer::post() operation on Windows using MSVC 7.1. The test program is shown below. In this test avoiding dynamic memory allocation gives approximately 15% performance improvement. The program shows how the handler parameter to the allocate/deallocate functions can be used to get hold of per-handler memory, if you need that level of control. Cheers, Chris ------------------------------------------------------------- #include <asio.hpp> #include <iostream> class handler; namespace boost { namespace asio { template <> class handler_allocator<::handler>; } } // namespaces class handler { public: handler( boost::asio::demuxer& demuxer, int& count, void* buffer) : demuxer_(demuxer), count_(count), buffer_(buffer) { } void operator()() { if (count_ > 0) { --count_; demuxer_.post(*this); } } private: boost::asio::demuxer& demuxer_; int& count_; void* buffer_; friend class boost::asio::handler_allocator<handler>; }; template <> class boost::asio::handler_allocator<::handler> { public: template <typename Allocator> static typename Allocator::pointer allocate( handler& h, Allocator& allocator, typename Allocator::size_type count) { return static_cast<typename Allocator::pointer>(h.buffer_); } template <typename Allocator> static void deallocate( handler& h, Allocator& allocator, typename Allocator::pointer pointer, typename Allocator::size_type count) { } }; int main() { boost::asio::demuxer d; DWORD start = GetTickCount(); int count = 1000000; char buf[1024]; d.post(handler(d, count, buf)); d.run(); DWORD end = GetTickCount(); std::cout << (end - start) << " ticks\n"; return 0; }

On 12/20/05 7:45 AM, "Christopher Kohlhoff" <chris@kohlhoff.com> wrote:
namespace boost { namespace asio { template <> class handler_allocator<::handler>; } } // namespaces
I hope that you're paraphrasing how you included the template argument. You would need a space, like so: "< ::handler>". That's because "<:" is a digraph for the opening square bracket ("["). If that code is not off-the-cuff, then your complier has some issues with alternative token resolution (or can turn digraphs off). (Why yes, I have complained about this a bunch of times. And yes, I did get bit by this bug once.) -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Hi Daryle, --- Daryle Walker <darylew@hotmail.com> wrote:
I hope that you're paraphrasing how you included the template argument. You would need a space, like so: "< ::handler>". That's because "<:" is a digraph for the opening square bracket ("["). If that code is not off-the-cuff, then your complier has some issues with alternative token resolution (or can turn digraphs off).
Lovely. I didn't notice this while hacking out the code. And yes, my compiler (MSVC 8.0) did let this one go through to the keeper. Cheers, Chris

Rene Rivera wrote: [...]
This means that since I'm writing my own router I need the best performance I can get. But to give some minimal concrete numbers... Given a single gigabit line, a minimal example as a server is likely going to have multiple lines, and the most favorable situation of handling full 64K UDP packets one would need to handle about 1500 messages a second (or 2/3 of a millisecond per message). But in practice it would have to be faster than that. This means minimizing the most expensive steps, i.e. memory allocations.
Currently AFAICT in order to use the async operations asio requires a model which allocates a new handler for each message received. This might be fine for many situations where those handlers will change from message to message. But for my use case I have only one handler that needs to get called for *any* message that comes in. In the asio case it means that for each message I received it would: remove the handler from the demuxer map, call the handler, which would do my custom parsing and routing, and push a new async_receive of myself (which creates a new handler object and inserts it again into the demux map). This is clearly suboptimal, and will result in considerable performance degradation. To refer to some concrete code, one can look at the Daytime.6 tutorial which does basically that procedure.
Two things come to mind... 1. What inefficiencies are inherent to the design, and what are simply an implementation detail, and 2.
Did you try to use the library?
No.
It'd probably be helpful if you create a simple throughput test and post the results so that there is a clear target for Asio to match or exceed. This will also validate your objections as they currently seem to be based on intuition. :-)

Peter Dimov wrote:
Rene Rivera wrote: [...] Two things come to mind...
1. What inefficiencies are inherent to the design, and what are simply an implementation detail, and
Good question, and I'm not sure I can currently discern in this case which part is a design problem or implementation problem as both seem to be intertwined. From what I understand the design would allow for writing a custom demuxer/demuxer_service that optimized the dispatching. But doing that would prevent the reuse of the various implementation services (epoll, iocp, kqueue, select) as they are an internal detail to the default demuxer_service. So I think it's a design problem that this particular core functionality can't be reused, or customized, to perform differently.
2.
Did you try to use the library?
No.
It'd probably be helpful if you create a simple throughput test and post the results so that there is a clear target for Asio to match or exceed. This will also validate your objections as they currently seem to be based on intuition. :-)
Working on it... But it's not an intuition based objection, but an experiential objection. In my experience memory allocations are the biggest bottleneck in algorithms as they insert, at best, logN steps inside of inner loops. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Rene Rivera wrote:
Peter Dimov wrote:
Rene Rivera wrote:
Did you try to use the library?
No.
It'd probably be helpful if you create a simple throughput test and post the results so that there is a clear target for Asio to match or exceed. This will also validate your objections as they currently seem to be based on intuition. :-)
Working on it...
OK, I have now tried it. Two problems I ran into: compile errors with CW-8.3, and compile errors if any windows headers are included before the asio headers. The test is a single process multi-thread uni-directional UDP timing of 100,000 messages of 1K size on the local loop (127.0.0.1). The small message size and the localoop are to flood the server with as many messages as possible, as opposed to as much data as possible. Since it's the individual message handling that I'm testing. Each test cycle is run 6 times, and the first and last samples are removed, all in the process. The test process is run 3 times and the median is what I'm reporting here. In the process there are two basically identical variants which are tested. There is a single sync client, which is reused, that sends the messages to the current server thread. There are two types of servers tested, an async server using the callbacks for handling messages, and a sync server doing the reads directly and dispatching manually. Internally those represent almost identical code paths, except that the async is doing the extra allocations for the deferred dispatch. [code attached and output attached] I ran the same 100,000*1K*6*2*3 tests with both debug and release compiled code. As can be seen from the attached output in the best case, of release code, there is a 5.6% "overhead" from the async to sync cases. For the debug code the difference is a more dramatic 25.2%. In my continue reading and use of the code I concluded that the basic design flaw is that the underlying *socket_service classes are not available as a public interface. It's nice that thought and care has gone into designing a handling and dispatch pattern that can serve many needs. But as it's currently structured, and was pointed and/or alluded to in the sync/demuxer threads, not having access to using or customizing this lower level leads to forcing a single use case. Another restriction that this causes is that the various OS handling models are not available as individual handling types. Hence, for example on Linux, it is not possible to choose at runtime whether one should use epoll, kqueue, or select. I should mention that just because I voted no, doesn't mean I won't be using this library myself. The performance for my current application is sufficient. But I'm hoping that more improvements come along to make it possible to still have those optimum uses. Thanks to all for reading ;-) And happy _insert_your_preferred_holiday_here_. PS. The test was on my AMD Athlon 3200+ 2.2Ghz, 1Gig RAM, Win2KSP4 setup. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org
DEBUG ====== BEGIN OUTPUT ====== --- ASYNC... ### TIME: total = 2.97428; iterations = 100000; iteration = 2.97428e-005; iterations/second = 33621.6 ### TIME: total = 3.01433; iterations = 100000; iteration = 3.01433e-005; iterations/second = 33174.8 ### TIME: total = 2.98429; iterations = 100000; iteration = 2.98429e-005; iterations/second = 33508.8 ### TIME: total = 3.00432; iterations = 100000; iteration = 3.00432e-005; iterations/second = 33285.4 ### TIME: total = 2.98429; iterations = 100000; iteration = 2.98429e-005; iterations/second = 33508.8 ### TIME: total = 3.05439; iterations = 100000; iteration = 3.05439e-005; iterations/second = 32739.7 -- ...ASYNC: average iterations/second = 33369.5 --- SYNC... ### TIME: total = 2.40346; iterations = 100000; iteration = 2.40346e-005; iterations/second = 41606.8 ### TIME: total = 2.41347; iterations = 100000; iteration = 2.41347e-005; iterations/second = 41434.1 ### TIME: total = 2.41347; iterations = 100000; iteration = 2.41347e-005; iterations/second = 41434.1 ### TIME: total = 2.40346; iterations = 100000; iteration = 2.40346e-005; iterations/second = 41606.8 ### TIME: total = 2.42348; iterations = 100000; iteration = 2.42348e-005; iterations/second = 41262.9 ### TIME: total = 2.31333; iterations = 100000; iteration = 2.31333e-005; iterations/second = 43227.8 ### TIME: total = 0; iterations = 100000; iteration = 0; iterations/second = 1.#INF -- ...SYNC: average iterations/second = 41793.1
EXIT STATUS: 1 ====== END OUTPUT ======
DIFF = 25.2%
RELEASE ====== BEGIN OUTPUT ====== --- ASYNC... ### TIME: total = 2.46354; iterations = 100000; iteration = 2.46354e-005; iterations/second = 40592 ### TIME: total = 2.42348; iterations = 100000; iteration = 2.42348e-005; iterations/second = 41262.9 ### TIME: total = 2.42348; iterations = 100000; iteration = 2.42348e-005; iterations/second = 41262.9 ### TIME: total = 2.4335; iterations = 100000; iteration = 2.4335e-005; iterations/second = 41093.1 ### TIME: total = 2.4335; iterations = 100000; iteration = 2.4335e-005; iterations/second = 41093.1 ### TIME: total = 2.37341; iterations = 100000; iteration = 2.37341e-005; iterations/second = 42133.4 -- ...ASYNC: average iterations/second = 41178 --- SYNC... ### TIME: total = 2.2933; iterations = 100000; iteration = 2.2933e-005; iterations/second = 43605.3 ### TIME: total = 2.27327; iterations = 100000; iteration = 2.27327e-005; iterations/second = 43989.5 ### TIME: total = 2.30331; iterations = 100000; iteration = 2.30331e-005; iterations/second = 43415.7 ### TIME: total = 2.31333; iterations = 100000; iteration = 2.31333e-005; iterations/second = 43227.8 ### TIME: total = 2.30331; iterations = 100000; iteration = 2.30331e-005; iterations/second = 43415.7 ### TIME: total = 2.30331; iterations = 100000; iteration = 2.30331e-005; iterations/second = 43415.7 ### TIME: total = 0; iterations = 100000; iteration = 0; iterations/second = 1.#INF -- ...SYNC: average iterations/second = 43492.9
EXIT STATUS: 1 ====== END OUTPUT ======
DIFF = 5.6%
#include <cstdio> #include <iostream> #include <vector> #include <valarray> #include <boost/shared_ptr.hpp> #include <boost/thread.hpp> #include <boost/bind.hpp> #include <boost/asio.hpp> const std::size_t message_size = /**/ 1*1024 /*/ 64*1024 /**/; const int port = 9999; const std::size_t message_iterations = /** 100 /*/ 100000 /**/; namespace asio = ::boost::asio; namespace detail_test { struct timed_scope { boost::xtime t0; boost::xtime t1; std::size_t n; std::vector<double> & results; inline timed_scope(std::vector<double> & r, std::size_t iterations = 1) : results(r), n(iterations) { boost::xtime_get(&t0,boost::TIME_UTC); } inline ~timed_scope() { boost::xtime_get(&t1,boost::TIME_UTC); double t = double(t1.sec)+double(t1.nsec)/double(1000000000); t -= double(t0.sec)+double(t0.nsec)/double(1000000000); std::cerr << "### TIME" << ": total = " << t << "; iterations = " << n << "; iteration = " << t/double(n) << "; iterations/second = " << double(n)/t << '\n'; results.push_back(double(n)/t); } }; template <typename out> out & result_summary(out & o, const std::vector<double> & results) { std::valarray<double> r(&results[1],results.size()-2); o << r.sum()/r.size(); return o; } void sleep_for_secs(int n) { boost::xtime t; boost::xtime_get(&t,boost::TIME_UTC); t.sec += n; boost::thread::sleep(t); } } struct async_server { char io_buffer[message_size]; boost::asio::demuxer demuxer; boost::asio::datagram_socket socket; std::auto_ptr<detail_test::timed_scope> timer; std::size_t message_count; std::size_t message_recount; std::auto_ptr<boost::thread> runner; std::vector<double> results; async_server() : socket(this->demuxer,boost::asio::ipv4::udp::endpoint(port)) , message_count(0) , message_recount(0) { socket.async_receive( boost::asio::buffer(io_buffer, message_size), 0, boost::bind(handle_receive_from, this, boost::asio::placeholders::error, boost::asio::placeholders::bytes_transferred ) ); } void handle_receive_from( const boost::asio::error& error, size_t /*bytes_transferred*/) { if (!error || error == boost::asio::error::message_size) { if (++message_count == this->timer->n) { this->clear_timer(); message_count = 0; this->reset_timer(message_recount); } socket.async_receive( boost::asio::buffer(io_buffer, message_size), 0, boost::bind(handle_receive_from, this, boost::asio::placeholders::error, boost::asio::placeholders::bytes_transferred ) ); } } void start() { this->runner.reset(new boost::thread( boost::bind(boost::asio::demuxer::run,&this->demuxer) )); } void stop() { this->demuxer.interrupt(); this->runner->join(); this->clear_timer(); } void reset_timer(std::size_t i = 1) { this->message_recount = i; this->timer.reset(new detail_test::timed_scope(this->results,i)); } void clear_timer() { this->timer.reset(); } }; struct sync_server { char io_buffer[message_size]; boost::asio::demuxer demuxer; boost::asio::datagram_socket socket; std::auto_ptr<detail_test::timed_scope> timer; std::size_t message_count; std::size_t message_recount; std::auto_ptr<boost::thread> runner; std::vector<double> results; volatile bool running; sync_server() : socket(this->demuxer,boost::asio::ipv4::udp::endpoint(port)) , message_count(0) , message_recount(0) , running(false) { } void handle_receive_from( const boost::asio::error& error, size_t /*bytes_transferred*/) { if (!error || error == boost::asio::error::message_size) { if (++message_count == this->timer->n) { this->clear_timer(); message_count = 0; this->reset_timer(message_recount); } } } void start() { this->runner.reset(new boost::thread( boost::bind(sync_server::run,this) )); } void run() { this->running = true; while (this->running) { boost::asio::error error; std::size_t bytes_transferred = socket.receive( boost::asio::buffer(io_buffer, message_size), 0, boost::asio::assign_error(error)); this->handle_receive_from(error,bytes_transferred); if (error && error != boost::asio::error::message_size) break; } this->running = false; } void stop() { this->running = false; this->socket.close(); this->runner->join(); this->clear_timer(); } void reset_timer(std::size_t i = 1) { this->message_recount = i; this->timer.reset(new detail_test::timed_scope(this->results,i)); } void clear_timer() { this->timer.reset(); } }; struct sync_client { char io_buffer[message_size]; boost::asio::demuxer demuxer; boost::asio::ipv4::host_resolver host_resolver; boost::asio::ipv4::host host; boost::asio::ipv4::udp::endpoint receiver_endpoint; boost::asio::datagram_socket socket; sync_client() : host_resolver(this->demuxer) , receiver_endpoint(port) , socket(this->demuxer,boost::asio::ipv4::udp::endpoint(0)) { host_resolver.get_host_by_name(host, "127.0.0.1"); receiver_endpoint.address(host.address(0)); } void send() { socket.send_to( boost::asio::buffer(io_buffer, message_size), 0, receiver_endpoint); } }; int main() { sync_client c0; { async_server s0; s0.start(); detail_test::sleep_for_secs(2); std::cerr << "--- ASYNC...\n"; s0.reset_timer(message_iterations); for (std::size_t m = 0; m < message_iterations*(1+4+1); ++m) { c0.send(); } s0.stop(); detail_test::result_summary( std::cerr << "-- ...ASYNC: average iterations/second = ", s0.results) << "\n"; } detail_test::sleep_for_secs(2); { sync_server s0; s0.start(); detail_test::sleep_for_secs(2); std::cerr << "--- SYNC...\n"; s0.reset_timer(message_iterations); for (std::size_t m = 0; m < message_iterations*(1+4+1); ++m) { c0.send(); } s0.stop(); detail_test::result_summary( std::cerr << "-- ...SYNC: average iterations/second = ", s0.results) << "\n"; } return 1; }

On 12/20/05, Rene Rivera <grafik.list@redshift-software.com> wrote:
I ran the same 100,000*1K*6*2*3 tests with both debug and release compiled code. As can be seen from the attached output in the best case, of release code, there is a 5.6% "overhead" from the async to sync cases. For the debug code the difference is a more dramatic 25.2%.
On Linux, the differences between async and sync results are much more striking. Here are the results from Rene's program compiled with gcc 4.0.2-O2 on Linux 2.6 (epoll). I had to make a number of small changes to get it to compile, and the SYNC test hangs at the end --- ASYNC... ### TIME: total = 4.62879; iterations = 100000; iteration = 4.62879e-05; iterations/second = 21603.9 ### TIME: total = 5.37136; iterations = 100000; iteration = 5.37136e-05; iterations/second = 18617.3 ### TIME: total = 5.03588; iterations = 100000; iteration = 5.03588e-05; iterations/second = 19857.5 ### TIME: total = 5.09588; iterations = 100000; iteration = 5.09588e-05; iterations/second = 19623.7 ### TIME: total = 4.60645; iterations = 100000; iteration = 4.60645e-05; iterations/second = 21708.7 ### TIME: total = 4.55167; iterations = 100000; iteration = 4.55167e-05; iterations/second = 21970 -- ...ASYNC: average iterations/second = 19951.8 --- SYNC... ### TIME: total = 1.38579; iterations = 100000; iteration = 1.38579e-05; iterations/second = 72161.2 ### TIME: total = 1.3561; iterations = 100000; iteration = 1.3561e-05; iterations/second = 73741 ### TIME: total = 1.34804; iterations = 100000; iteration = 1.34804e-05; iterations/second = 74181.9 ### TIME: total = 1.35522; iterations = 100000; iteration = 1.35522e-05; iterations/second = 73788.5 ### TIME: total = 1.36956; iterations = 100000; iteration = 1.36956e-05; iterations/second = 73016.4 ### TIME: total = 22.2436; iterations = 100000; iteration = 0.000222436; iterations/second = 4495.68 -- ...SYNC: average iterations/second = 73682 I had to interrupt the program by attaching a debugger to get it to run to completion (explaining the low result for the last SYNC loop). One thread seems to get stuck in a "recv" call (sync_server::run) that does not get interrupted by main's call to s0.stop(). Not sure if this is a bug in the test program or in asio. -- Caleb Epstein caleb dot epstein at gmail dot com

Hi Caleb, --- Caleb Epstein <caleb.epstein@gmail.com> wrote: <snip>
I had to interrupt the program by attaching a debugger to get it to run to completion (explaining the low result for the last SYNC loop). One thread seems to get stuck in a "recv" call (sync_server::run) that does not get interrupted by main's call to s0.stop(). Not sure if this is a bug in the test program or in asio.
I suspect the problem is that the test program is closing the socket from one thread while the recv call is still in progress. I don't believe you can portably terminate a synchronous receive operation by closing the socket. If you do require behaviour where a receive is terminated by close, the asynchronous receive operation is guaranteed to return as soon as possible with the operation_aborted error. Cheers, Chris

"Caleb Epstein" <caleb.epstein@gmail.com> wrote:
On Linux, the differences between async and sync results are much more striking. Here are the results from Rene's program compiled with gcc 4.0.2-O2 on Linux 2.6 (epoll). I had to make a number of small changes to get it to compile, and the SYNC test hangs at the end
Caleb, Could you post your changes? I'll admit that I'm having a difficult time getting it to compile on g++.

On 12/22/05, christopher baus <christopher@baus.net> wrote:
Could you post your changes? I'll admit that I'm having a difficult time getting it to compile on g++.
Also sent privately, but resending to the list in case there are other interested parties. The had to change some of the boost::bind invocations to use fully qualified &class::method names, and in one case added a member function for use with boost::thread. I also changed the sync_server class to shut down when it reads a 1-byte packet as the blockign receive call was not being interrupted when the socket was closed from the main thread. -- Caleb Epstein caleb dot epstein at gmail dot com

--- Caleb Epstein <caleb.epstein@gmail.com> wrote:
On Linux, the differences between async and sync results are much more striking.
I have spent some time investigating this today, in the course of which I implemented optimisations to eliminate memory allocations and reduce the number of system calls in the asynchronous case. Even with the optimisations, on Linux the test still showed approximately the same results as those reported by Caleb. However, changing the test to have a network between the sender and receiver shows a marked improvement in async's relative and absolute performance. In this case, the performance of sync and async is virtually the same. My conclusion is that the single-host test exhibits pathological behaviour on Linux (and possibly other OSes). The problem arises due to UDP being an unreliable protocol. Let's consider the behaviour of the async test. We have: - One thread performing synchronous sends in a tight loop. - One thread performing asynchronous receives via the demuxer. Typically a UDP send will not block, so the synchronous loop performs sends until its timeslice finishes. This will rapidly fill the buffer on the receiving socket, and once that buffer is full the additional datagrams are discarded. The receiver will continue to receive whatever datagrams are available without giving up its timeslice, but once those are gone it will block on select/epoll/etc. The net result is that it takes the receiver more timeslices, and therefore more time, to receive its quota of packets. The synchronous test, on the other hand, appears to be getting flow control for free from Linux. That is, a thread blocking on a synchronous receive seems to be woken up as soon as data is available, so the socket's buffer never fills. This is borne out by introducing simple flow control to the async test. I added a short sleep to the synchronous send loop like so: if (m % 128 == 0) { timeval tv; tv.tv_sec = 0; tv.tv_usec = 1000; select(0, 0, 0, 0, &tv); } and the performance of the async test was boosted to approximately 2/3 of the sync test. A more realistic test involves putting the sender and receiver on different hosts. I did this with the following setup: - Dedicated 100Mbps ethernet connection - Sender: Windows XP SP2, 1.7GHz Pentium M, 512MB RAM - Receiver: Linux 2.6.8 kernel, 900MHz Pentium 3, 256MB RAM Running the test with packets of 256, 512 and 1024 bytes showed identical performance for the async and sync cases. I'm not saying that async operations will always perform as well as the equivalent sync operations. A one-socket test like this naturally favours synchronous operations, because an asynchronous implementation involves additional demultiplexing costs. However in a use case involving multiple sockets, these costs are amortised. Cheers, Chris --- Caleb Epstein <caleb.epstein@gmail.com> wrote:
On 12/20/05, Rene Rivera <grafik.list@redshift-software.com> wrote:
I ran the same 100,000*1K*6*2*3 tests with both debug and release compiled code. As can be seen from the attached output in the best
case,
of release code, there is a 5.6% "overhead" from the async to sync cases. For the debug code the difference is a more dramatic 25.2%.
On Linux, the differences between async and sync results are much more striking. Here are the results from Rene's program compiled with gcc 4.0.2-O2 on Linux 2.6 (epoll). I had to make a number of small changes to get it to compile, and the SYNC test hangs at the end
--- ASYNC... ### TIME: total = 4.62879; iterations = 100000; iteration = 4.62879e-05; iterations/second = 21603.9 ### TIME: total = 5.37136; iterations = 100000; iteration = 5.37136e-05; iterations/second = 18617.3 ### TIME: total = 5.03588; iterations = 100000; iteration = 5.03588e-05; iterations/second = 19857.5 ### TIME: total = 5.09588; iterations = 100000; iteration = 5.09588e-05; iterations/second = 19623.7 ### TIME: total = 4.60645; iterations = 100000; iteration = 4.60645e-05; iterations/second = 21708.7 ### TIME: total = 4.55167; iterations = 100000; iteration = 4.55167e-05; iterations/second = 21970 -- ...ASYNC: average iterations/second = 19951.8 --- SYNC... ### TIME: total = 1.38579; iterations = 100000; iteration = 1.38579e-05; iterations/second = 72161.2 ### TIME: total = 1.3561; iterations = 100000; iteration = 1.3561e-05; iterations/second = 73741 ### TIME: total = 1.34804; iterations = 100000; iteration = 1.34804e-05; iterations/second = 74181.9 ### TIME: total = 1.35522; iterations = 100000; iteration = 1.35522e-05; iterations/second = 73788.5 ### TIME: total = 1.36956; iterations = 100000; iteration = 1.36956e-05; iterations/second = 73016.4 ### TIME: total = 22.2436; iterations = 100000; iteration = 0.000222436; iterations/second = 4495.68 -- ...SYNC: average iterations/second = 73682
I had to interrupt the program by attaching a debugger to get it to run to completion (explaining the low result for the last SYNC loop). One thread seems to get stuck in a "recv" call (sync_server::run) that does not get interrupted by main's call to s0.stop(). Not sure if this is a bug in the test program or in asio.
-- Caleb Epstein caleb dot epstein at gmail dot com _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Christopher Kohlhoff wrote:
My conclusion is that the single-host test exhibits pathological behaviour on Linux (and possibly other OSes). The problem arises due to UDP being an unreliable protocol.
UDP is only unreliable from the POV of the receiver, not of the sender. UDP sends behave the same as TCP sends. Hence as far as the sender is concerned all sends succeed, under normal circumstances that is.
Let's consider the behaviour of the async test. We have:
- One thread performing synchronous sends in a tight loop.
- One thread performing asynchronous receives via the demuxer.
Typically a UDP send will not block, so the synchronous loop performs sends until its timeslice finishes. This will rapidly fill the buffer on the receiving socket, and once that buffer is full the additional datagrams are discarded.
The sender doesn't discard. You might get send failures if you have the send socket in non-blocking mode though. If it where any other way the test that I wrote would show "missing" receives. But Calebs results show all messages arriving, as would be expected from the localhost device. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Hi Rene, --- Rene Rivera <grafik.list@redshift-software.com> wrote:
The sender doesn't discard.
I didn't mean that the sender discards, only that the datagram being sent can be discarded immediately by the TCP/IP stack if the receiving socket's buffer is full. The sender is not aware of this.
If it where any other way the test that I wrote would show "missing" receives.
It does indeed exhibit many missing receives on two of my systems (one Linux, the other Mac OS X, both uniprocessor).
But Calebs results show all messages arriving, as would be expected from the localhost device.
I suspect that Caleb is using a multiprocessor machine, and so the behaviour I described does not happen for him. However on my systems with "flow control" enabled it is still necessary to include the additional performance optimisations to get the substantial improvement. It would be interesting to rerun the test on a multiprocessor machine with these changes included. Cheers, Chris

On 12/28/05, Christopher Kohlhoff <chris@kohlhoff.com> wrote:
Hi Rene,
--- Rene Rivera <grafik.list@redshift-software.com> wrote:
The sender doesn't discard.
I didn't mean that the sender discards, only that the datagram being sent can be discarded immediately by the TCP/IP stack if the receiving socket's buffer is full. The sender is not aware of this.
If it where any other way the test that I wrote would show "missing" receives.
It does indeed exhibit many missing receives on two of my systems (one Linux, the other Mac OS X, both uniprocessor).
I think I'm seeing dropped packets when I run the test on an SMP Linux 2.4-based system (see below). With Linux 2.6, the packets all appear to make it to the server, but we have the sync-faster-than-async condition.
But Calebs results show all messages arriving, as would be
expected from the localhost device.
I suspect that Caleb is using a multiprocessor machine, and so the behaviour I described does not happen for him. However on my systems with "flow control" enabled it is still necessary to include the additional performance optimisations to get the substantial improvement. It would be interesting to rerun the test on a multiprocessor machine with these changes included.
I've run the tests on an SMP machine and a single-CPU box. The SMP machine is running a Redhat 2.4.21 kernel and the test actually doesn't work properly on this platform. It seems that the receiver drops many packets and, because of the way the test is written, the program crashes when trying to print the incomplete result data. I added some debugging print statements that show the async_server receives about 75,000 of the 600,000 packets sent before it is forcefully stopped. When I run the tests on my single-CPU machine running Linux 2.6.14.3 at home, the test completes properly. I've tried both the epoll and select-based reactors and the results are nearly identical (approx. 3x slower than sync). I think its difficult to say where the speed difference comes from on these tests. It may exercise some inefficiencies or limitations of the Linux UDP stack. Perhaps the syscall overhead of using select/epoll is where all the performance is lost. Or it could be the architecture of asio that is at fault. I don't think we can say for sure at this point. I'll see if I can't knock up a straight socket-based test like this one (e.g. just POSIX socket calls, no asio) and see if it gets similar results to Rene's benchmark. The same benchmark run on some other UNIX-based system (MacOS, *BSD, etc) would be another interesting data-point to have to see if this is an implementation or a platform issue. Just a quick back of the envelope calculation. The synchronous test manages to handle about 65,000 1k messages a second, which amounts to a bandwidth of approximately 500 Megabits/second. It would certainly be nice to be able to sustain this level of throughput, but I'm not sure its an entirely realistic workload. -- Caleb Epstein caleb dot epstein at gmail dot com

On 12/28/05, Caleb Epstein <caleb.epstein@gmail.com> wrote:
I'll see if I can't knock up a straight socket-based test like this one ( e.g. just POSIX socket calls, no asio) and see if it gets similar results to Rene's benchmark. The same benchmark run on some other UNIX-based system (MacOS, *BSD, etc) would be another interesting data-point to have to see if this is an implementation or a platform issue.
OK, as threatened I've attached a POSIX-only version of Rene's test that uses only socket, bind, select, sendto, recvfrom and friends. It looks like, at least in the case of the asio ASYNC test, there is some performance upside to be achieved: Linux 2.4.21 SMP: --- ASYNC... ### TIME: total = 1.30585; iterations = 100000; iteration = 1.30585e-05; iterations/second = 76578.8 ### TIME: total = 1.24593; iterations = 100000; iteration = 1.24593e-05; iterations/second = 80261 ### TIME: total = 1.24527; iterations = 100000; iteration = 1.24527e-05; iterations/second = 80304.2 ### TIME: total = 1.24671; iterations = 100000; iteration = 1.24671e-05; iterations/second = 80211.2 ### TIME: total = 0.995225; iterations = 100000; iteration = 9.95225e-06; iterations/second = 100480 -- ...ASYNC: average iterations/second = 80258.8 --- SYNC... ### TIME: total = 1.10908; iterations = 100000; iteration = 1.10908e-05; iterations/second = 90164.5 ### TIME: total = 1.10027; iterations = 100000; iteration = 1.10027e-05; iterations/second = 90886.4 ### TIME: total = 1.1056; iterations = 100000; iteration = 1.1056e-05; iterations/second = 90448.5 ### TIME: total = 1.1051; iterations = 100000; iteration = 1.1051e-05; iterations/second = 90490 ### TIME: total = 1.10743; iterations = 100000; iteration = 1.10743e-05; iterations/second = 90299.2 ### TIME: total = 1.09628; iterations = 100000; iteration = 1.09628e-05; iterations/second = 91217.8 -- ...SYNC: average iterations/second = 90531 Linux 2.6.13.4 uniprocessor: --- ASYNC... ### TIME: total = 1.81951; iterations = 100000; iteration = 1.81951e-05; iterations/second = 54960 ### TIME: total = 1.78175; iterations = 100000; iteration = 1.78175e-05; iterations/second = 56124.6 ### TIME: total = 1.80059; iterations = 100000; iteration = 1.80059e-05; iterations/second = 55537.3 ### TIME: total = 1.81167; iterations = 100000; iteration = 1.81167e-05; iterations/second = 55197.6 ### TIME: total = 1.80915; iterations = 100000; iteration = 1.80915e-05; iterations/second = 55274.6 ### TIME: total = 2.77475; iterations = 100000; iteration = 2.77475e-05; iterations/second = 36039.2 -- ...ASYNC: average iterations/second = 55533.5 --- SYNC... ### TIME: total = 1.48353; iterations = 100000; iteration = 1.48353e-05; iterations/second = 67406.7 ### TIME: total = 1.37942; iterations = 100000; iteration = 1.37942e-05; iterations/second = 72494 ### TIME: total = 1.36727; iterations = 100000; iteration = 1.36727e-05; iterations/second = 73138.3 ### TIME: total = 1.37749; iterations = 100000; iteration = 1.37749e-05; iterations/second = 72595.8 ### TIME: total = 1.35504; iterations = 100000; iteration = 1.35504e-05; iterations/second = 73798.8 ### TIME: total = 1.19966; iterations = 100000; iteration = 1.19966e-05; iterations/second = 83357.3 -- ...SYNC: average iterations/second = 73006.7 -- Caleb Epstein caleb dot epstein at gmail dot com

Hi Caleb, --- Caleb Epstein <caleb.epstein@gmail.com> wrote: <snip>
I've run the tests on an SMP machine and a single-CPU box. The SMP machine is running a Redhat 2.4.21 kernel and the test actually doesn't work properly on this platform. It seems that the receiver drops many packets and, because of the way the test is written, the program crashes when trying to print the incomplete result data. I added some debugging print statements that show the async_server receives about 75,000 of the 600,000 packets sent before it is forcefully stopped.
Yes, this is precisely the behaviour I am seeing on my single processor machine. My linux kernel version is 2.6.8, so perhaps it is something that changed between 2.6.8 and 2.6.14.
When I run the tests on my single-CPU machine running Linux 2.6.14.3 at home, the test completes properly. I've tried both the epoll and select-based reactors and the results are nearly identical (approx. 3x slower than sync).
Did you run the test using the CVS version of asio? Last night I checked in an optimisation to make a non-blocking call before putting the socket into select/epoll/etc. This reduces the performance of the sync calls a bit, but I have just realised there is a further optimisation that eliminates this cost. For me, the CVS version of asio (plus a 1000 microsecond sleep every 128 iterations of the sync client, but without the sync call optimisation) gives the following results: --- ASYNC... ### TIME: total = 3.09885; iterations = 100000; iteration = 3.09885e-05; iterations/second = 32270.1 ### TIME: total = 2.37491; iterations = 100000; iteration = 2.37491e-05; iterations/second = 42106.9 ### TIME: total = 2.37707; iterations = 100000; iteration = 2.37707e-05; iterations/second = 42068.6 ### TIME: total = 0.668798; iterations = 100000; iteration = 6.68798e-06; iterations/second = 149522 -- ...ASYNC: average iterations/second = 42087.7 --- SYNC... ### TIME: total = 2.03626; iterations = 100000; iteration = 2.03626e-05; iterations/second = 49109.7 ### TIME: total = 2.02494; iterations = 100000; iteration = 2.02494e-05; iterations/second = 49384.1 ### TIME: total = 2.02348; iterations = 100000; iteration = 2.02348e-05; iterations/second = 49419.7 ### TIME: total = 2.01938; iterations = 100000; iteration = 2.01938e-05; iterations/second = 49520.1 ### TIME: total = 2.01489; iterations = 100000; iteration = 2.01489e-05; iterations/second = 49630.6 ### TIME: total = 2.06713; iterations = 100000; iteration = 2.06713e-05; iterations/second = 48376.2 -- ...SYNC: average iterations/second = 49488.6 Note that the test has not been modified to use the custom allocation support either. Cheers, Chris

Hi Rene, In my first reply to your review I outlined a proposal for custom memory allocation. I have now implemented enough of it to re-run your test program (updated program and results attached). --- Rene Rivera <grafik.list@redshift-software.com> wrote:
I ran the same 100,000*1K*6*2*3 tests with both debug and release compiled code. As can be seen from the attached output in the best case, of release code, there is a 5.6% "overhead" from the async to sync cases.
Machine specs: 1.7GHz Pentium M, 512MB RAM, Windows XP SP2. Compiler: VC8 Express. First, without custom memory allocation a typical run of the program made approximately 340000 calls to new. With my custom memory allocation there were just 80. In the release build async calls were marginally slower, but the difference was less than 1%.
For the debug code the difference is a more dramatic 25.2%.
Even though a debug build is not really relevant for performance tests, I suspect this difference is mostly due to the number of layers introduced by boost::bind. In my updated test, which uses a hand-crafted function object, the difference was more like 10% when using dynamic allocation. It's just 5% with custom allocation.
In my continue reading and use of the code I concluded that the basic design flaw is that the underlying *socket_service classes are not available as a public interface.
I don't understand how not exposing the implementation details is a design flaw.
Another restriction that this causes is that the various OS handling models are not available as individual handling types. Hence, for example on Linux, it is not possible to choose at runtime whether one should use epoll, kqueue, or select.
The stated intention is to use the most scalable event demultiplexer offered by a platform. Furthermore, runtime polymorphism is likely to limit the opportunity for optimisation. E.g. the custom allocation strategy I have implemented works because the user handler types have not been erased. I'd be interested to know whether you, or others, find this custom memory allocation interface satisfactory. After having used it I think that I quite like this approach because it allows the developer to use application-specific knowledge about the number of concurrent asynchronous "chains" when customising memory allocation. This custom memory allocation implementation required no changes to the existing asio public interface or overall design. Cheers, Chris #include <cstdio> #include <iostream> #include <vector> #include <valarray> #include <boost/shared_ptr.hpp> #include <boost/thread.hpp> #include <boost/bind.hpp> #include <asio.hpp> long new_count = 0; void* operator new(std::size_t size) { InterlockedIncrement(&new_count); void* p = malloc(size); if (!p) throw std::bad_alloc(); return p; } void operator delete(void* p) { free(p); } const std::size_t message_size = /**/ 1*1024 /*/ 64*1024 /**/; const int port = 9999; const std::size_t message_iterations = /** 100 /*/ 100000 /**/; //namespace asio = boost::asio; namespace detail_test { struct timed_scope { boost::xtime t0; boost::xtime t1; std::size_t n; std::vector<double> & results; inline timed_scope(std::vector<double> & r, std::size_t iterations = 1) : results(r), n(iterations) { boost::xtime_get(&t0,boost::TIME_UTC); } inline ~timed_scope() { boost::xtime_get(&t1,boost::TIME_UTC); double t = double(t1.sec)+double(t1.nsec)/double(1000000000); t -= double(t0.sec)+double(t0.nsec)/double(1000000000); std::cerr << "### TIME" << ": total = " << t << "; iterations = " << n << "; iteration = " << t/double(n) << "; iterations/second = " << double(n)/t << '\n'; results.push_back(double(n)/t); } }; template <typename out> out & result_summary(out & o, const std::vector<double> & results) { std::valarray<double> r(&results[1],results.size()-2); o << r.sum()/r.size(); return o; } void sleep_for_secs(int n) { boost::xtime t; boost::xtime_get(&t,boost::TIME_UTC); t.sec += n; boost::thread::sleep(t); } } struct async_server; struct async_server_receive_handler { async_server_receive_handler(async_server* this_p) : this_p_(this_p) {} void operator()(const asio::error& error, std::size_t); async_server* this_p_; }; template <> class asio::handler_alloc_hook<async_server_receive_handler> { public: template <typename Allocator> static typename Allocator::pointer allocate( async_server_receive_handler& h, Allocator& allocator, typename Allocator::size_type count) { return reinterpret_cast<typename Allocator::pointer>( h.this_p_->operation_buffer); } template <typename Allocator> static void deallocate( async_server_receive_handler& h, Allocator& allocator, typename Allocator::pointer pointer, typename Allocator::size_type count) { } }; struct async_server { char io_buffer[message_size]; asio::demuxer demuxer; asio::datagram_socket socket; std::auto_ptr<detail_test::timed_scope> timer; std::size_t message_count; std::size_t message_recount; std::auto_ptr<boost::thread> runner; std::vector<double> results; char operation_buffer[1024]; async_server() : socket(this->demuxer,asio::ipv4::udp::endpoint(port)) , message_count(0) , message_recount(0) { socket.async_receive( asio::buffer(io_buffer, message_size), 0, async_server_receive_handler(this)); } void start() { this->runner.reset(new boost::thread( boost::bind(&asio::demuxer::run,&this->demuxer) )); } void stop() { this->demuxer.interrupt(); this->runner->join(); this->clear_timer(); } void reset_timer(std::size_t i = 1) { this->message_recount = i; this->timer.reset(new detail_test::timed_scope(this->results,i)); } void clear_timer() { this->timer.reset(); } }; void async_server_receive_handler::operator()( const asio::error& error, std::size_t) { if (!error || error == asio::error::message_size) { if (++this_p_->message_count == this_p_->timer->n) { this_p_->clear_timer(); this_p_->message_count = 0; this_p_->reset_timer(this_p_->message_recount); } this_p_->socket.async_receive( asio::buffer(this_p_->io_buffer, message_size), 0, *this); } } struct sync_server { char io_buffer[message_size]; asio::demuxer demuxer; asio::datagram_socket socket; std::auto_ptr<detail_test::timed_scope> timer; std::size_t message_count; std::size_t message_recount; std::auto_ptr<boost::thread> runner; std::vector<double> results; volatile bool running; sync_server() : socket(this->demuxer,asio::ipv4::udp::endpoint(port)) , message_count(0) , message_recount(0) , running(false) { } void handle_receive_from( const asio::error& error, size_t /*bytes_transferred*/) { if (!error || error == asio::error::message_size) { if (++message_count == this->timer->n) { this->clear_timer(); message_count = 0; this->reset_timer(message_recount); } } } void start() { this->runner.reset(new boost::thread( boost::bind(&sync_server::run,this) )); } void run() { this->running = true; while (this->running) { asio::error error; std::size_t bytes_transferred = socket.receive( asio::buffer(io_buffer, message_size), 0, asio::assign_error(error)); this->handle_receive_from(error,bytes_transferred); if (error && error != asio::error::message_size) break; } this->running = false; } void stop() { this->running = false; this->socket.close(); this->runner->join(); this->clear_timer(); } void reset_timer(std::size_t i = 1) { this->message_recount = i; this->timer.reset(new detail_test::timed_scope(this->results,i)); } void clear_timer() { this->timer.reset(); } }; struct sync_client { char io_buffer[message_size]; asio::demuxer demuxer; asio::ipv4::host_resolver host_resolver; asio::ipv4::host host; asio::ipv4::udp::endpoint receiver_endpoint; asio::datagram_socket socket; sync_client() : host_resolver(this->demuxer) , receiver_endpoint(port) , socket(this->demuxer,asio::ipv4::udp::endpoint(0)) { host_resolver.get_host_by_name(host, "127.0.0.1"); receiver_endpoint.address(host.address(0)); } void send() { socket.send_to( asio::buffer(io_buffer, message_size), 0, receiver_endpoint); } }; int main() { sync_client c0; { async_server s0; s0.start(); detail_test::sleep_for_secs(2); std::cerr << "--- ASYNC...\n"; s0.reset_timer(message_iterations); for (std::size_t m = 0; m < message_iterations*(1+4+1); ++m) { c0.send(); } s0.stop(); detail_test::result_summary( std::cerr << "-- ...ASYNC: average iterations/second = ", s0.results) << "\n"; } detail_test::sleep_for_secs(5); { sync_server s0; s0.start(); detail_test::sleep_for_secs(2); std::cerr << "--- SYNC...\n"; s0.reset_timer(message_iterations); for (std::size_t m = 0; m < message_iterations*(1+4+1); ++m) { c0.send(); } s0.stop(); detail_test::result_summary( std::cerr << "-- ...SYNC: average iterations/second = ", s0.results) << "\n"; } std::cerr << new_count << " calls to new\n"; return 1; } --- ASYNC... ### TIME: total = 1.75252; iterations = 100000; iteration = 1.75252e-005; iterations/second = 57060.7 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.59229; iterations = 100000; iteration = 1.59229e-005; iterations/second = 62802.7 ### TIME: total = 1.6023; iterations = 100000; iteration = 1.6023e-005; iterations/second = 62410.1 -- ...ASYNC: average iterations/second = 63100.8 --- SYNC... ### TIME: total = 1.81261; iterations = 100000; iteration = 1.81261e-005; iterations/second = 55169.2 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.57226; iterations = 100000; iteration = 1.57226e-005; iterations/second = 63602.7 ### TIME: total = 1.59229; iterations = 100000; iteration = 1.59229e-005; iterations/second = 62802.6 ### TIME: total = 1.57226; iterations = 100000; iteration = 1.57226e-005; iterations/second = 63602.7 ### TIME: total = 1.56225; iterations = 100000; iteration = 1.56225e-005; iterations/second = 64010.4 -- ...SYNC: average iterations/second = 63302 --- ASYNC... ### TIME: total = 1.81261; iterations = 100000; iteration = 1.81261e-005; iterations/second = 55169.2 ### TIME: total = 1.59229; iterations = 100000; iteration = 1.59229e-005; iterations/second = 62802.6 ### TIME: total = 1.59229; iterations = 100000; iteration = 1.59229e-005; iterations/second = 62802.6 ### TIME: total = 1.59229; iterations = 100000; iteration = 1.59229e-005; iterations/second = 62802.7 ### TIME: total = 1.59229; iterations = 100000; iteration = 1.59229e-005; iterations/second = 62802.6 ### TIME: total = 1.55223; iterations = 100000; iteration = 1.55223e-005; iterations/second = 64423.4 -- ...ASYNC: average iterations/second = 62802.6 --- SYNC... ### TIME: total = 1.78256; iterations = 100000; iteration = 1.78256e-005; iterations/second = 56099 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.59229; iterations = 100000; iteration = 1.59229e-005; iterations/second = 62802.7 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.57226; iterations = 100000; iteration = 1.57226e-005; iterations/second = 63602.7 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 -- ...SYNC: average iterations/second = 63201.4 --- ASYNC... ### TIME: total = 1.74251; iterations = 100000; iteration = 1.74251e-005; iterations/second = 57388.6 ### TIME: total = 1.59229; iterations = 100000; iteration = 1.59229e-005; iterations/second = 62802.7 ### TIME: total = 1.57226; iterations = 100000; iteration = 1.57226e-005; iterations/second = 63602.7 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.63235; iterations = 100000; iteration = 1.63235e-005; iterations/second = 61261.5 -- ...ASYNC: average iterations/second = 63201.4 --- SYNC... ### TIME: total = 1.64236; iterations = 100000; iteration = 1.64236e-005; iterations/second = 60887.9 ### TIME: total = 1.55223; iterations = 100000; iteration = 1.55223e-005; iterations/second = 64423.4 ### TIME: total = 1.54222; iterations = 100000; iteration = 1.54222e-005; iterations/second = 64841.7 ### TIME: total = 1.54222; iterations = 100000; iteration = 1.54222e-005; iterations/second = 64841.7 ### TIME: total = 1.54222; iterations = 100000; iteration = 1.54222e-005; iterations/second = 64841.7 ### TIME: total = 1.54222; iterations = 100000; iteration = 1.54222e-005; iterations/second = 64841.7 -- ...SYNC: average iterations/second = 64737.1 --- ASYNC... ### TIME: total = 1.68242; iterations = 100000; iteration = 1.68242e-005; iterations/second = 59438.2 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.68242; iterations = 100000; iteration = 1.68242e-005; iterations/second = 59438.2 -- ...ASYNC: average iterations/second = 63200.1 --- SYNC... ### TIME: total = 1.75252; iterations = 100000; iteration = 1.75252e-005; iterations/second = 57060.7 ### TIME: total = 1.55223; iterations = 100000; iteration = 1.55223e-005; iterations/second = 64423.4 ### TIME: total = 1.54222; iterations = 100000; iteration = 1.54222e-005; iterations/second = 64841.7 ### TIME: total = 1.55223; iterations = 100000; iteration = 1.55223e-005; iterations/second = 64423.4 ### TIME: total = 1.54222; iterations = 100000; iteration = 1.54222e-005; iterations/second = 64841.7 ### TIME: total = 1.52219; iterations = 100000; iteration = 1.52219e-005; iterations/second = 65694.9 -- ...SYNC: average iterations/second = 64632.5 --- ASYNC... ### TIME: total = 1.73249; iterations = 100000; iteration = 1.73249e-005; iterations/second = 57720.3 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.57226; iterations = 100000; iteration = 1.57226e-005; iterations/second = 63602.7 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.57226; iterations = 100000; iteration = 1.57226e-005; iterations/second = 63602.7 ### TIME: total = 1.63235; iterations = 100000; iteration = 1.63235e-005; iterations/second = 61261.5 -- ...ASYNC: average iterations/second = 63401.4 --- SYNC... ### TIME: total = 1.74251; iterations = 100000; iteration = 1.74251e-005; iterations/second = 57388.6 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.57226; iterations = 100000; iteration = 1.57226e-005; iterations/second = 63602.7 ### TIME: total = 1.57226; iterations = 100000; iteration = 1.57226e-005; iterations/second = 63602.7 ### TIME: total = 1.47212; iterations = 100000; iteration = 1.47212e-005; iterations/second = 67929.4 -- ...SYNC: average iterations/second = 63401.4 --- ASYNC... ### TIME: total = 1.76253; iterations = 100000; iteration = 1.76253e-005; iterations/second = 56736.5 ### TIME: total = 1.59229; iterations = 100000; iteration = 1.59229e-005; iterations/second = 62802.6 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.59229; iterations = 100000; iteration = 1.59229e-005; iterations/second = 62802.6 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.64236; iterations = 100000; iteration = 1.64236e-005; iterations/second = 60887.9 -- ...ASYNC: average iterations/second = 63001.4 --- SYNC... ### TIME: total = 1.80259; iterations = 100000; iteration = 1.80259e-005; iterations/second = 55475.7 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.57226; iterations = 100000; iteration = 1.57226e-005; iterations/second = 63602.7 -- ...SYNC: average iterations/second = 63200.1 --- ASYNC... ### TIME: total = 1.70245; iterations = 100000; iteration = 1.70245e-005; iterations/second = 58738.9 ### TIME: total = 1.59229; iterations = 100000; iteration = 1.59229e-005; iterations/second = 62802.6 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.67241; iterations = 100000; iteration = 1.67241e-005; iterations/second = 59794.1 -- ...ASYNC: average iterations/second = 63100.8 --- SYNC... ### TIME: total = 1.81261; iterations = 100000; iteration = 1.81261e-005; iterations/second = 55169.2 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.52219; iterations = 100000; iteration = 1.52219e-005; iterations/second = 65694.9 -- ...SYNC: average iterations/second = 63200.1 --- ASYNC... ### TIME: total = 1.79258; iterations = 100000; iteration = 1.79258e-005; iterations/second = 55785.6 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.57226; iterations = 100000; iteration = 1.57226e-005; iterations/second = 63602.7 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.57226; iterations = 100000; iteration = 1.57226e-005; iterations/second = 63602.7 -- ...ASYNC: average iterations/second = 63300.8 --- SYNC... ### TIME: total = 1.79258; iterations = 100000; iteration = 1.79258e-005; iterations/second = 55785.6 ### TIME: total = 1.55223; iterations = 100000; iteration = 1.55223e-005; iterations/second = 64423.4 ### TIME: total = 1.55223; iterations = 100000; iteration = 1.55223e-005; iterations/second = 64423.4 ### TIME: total = 1.54222; iterations = 100000; iteration = 1.54222e-005; iterations/second = 64841.7 ### TIME: total = 1.54222; iterations = 100000; iteration = 1.54222e-005; iterations/second = 64841.7 ### TIME: total = 1.49215; iterations = 100000; iteration = 1.49215e-005; iterations/second = 67017.6 -- ...SYNC: average iterations/second = 64632.5 --- ASYNC... ### TIME: total = 1.77255; iterations = 100000; iteration = 1.77255e-005; iterations/second = 56415.9 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.6023; iterations = 100000; iteration = 1.6023e-005; iterations/second = 62410.1 ### TIME: total = 1.6724; iterations = 100000; iteration = 1.6724e-005; iterations/second = 59794.1 -- ...ASYNC: average iterations/second = 63002.6 --- SYNC... ### TIME: total = 1.78256; iterations = 100000; iteration = 1.78256e-005; iterations/second = 56099 ### TIME: total = 1.57226; iterations = 100000; iteration = 1.57226e-005; iterations/second = 63602.7 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.56225; iterations = 100000; iteration = 1.56225e-005; iterations/second = 64010.4 ### TIME: total = 1.57226; iterations = 100000; iteration = 1.57226e-005; iterations/second = 63602.7 ### TIME: total = 1.57226; iterations = 100000; iteration = 1.57226e-005; iterations/second = 63602.7 -- ...SYNC: average iterations/second = 63604 --- ASYNC... ### TIME: total = 1.77255; iterations = 100000; iteration = 1.77255e-005; iterations/second = 56415.9 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.57226; iterations = 100000; iteration = 1.57226e-005; iterations/second = 63602.7 ### TIME: total = 1.64236; iterations = 100000; iteration = 1.64236e-005; iterations/second = 60887.9 -- ...ASYNC: average iterations/second = 63300.8 --- SYNC... ### TIME: total = 1.69243; iterations = 100000; iteration = 1.69243e-005; iterations/second = 59086.5 ### TIME: total = 1.59229; iterations = 100000; iteration = 1.59229e-005; iterations/second = 62802.6 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.58228; iterations = 100000; iteration = 1.58228e-005; iterations/second = 63200.1 ### TIME: total = 1.57226; iterations = 100000; iteration = 1.57226e-005; iterations/second = 63602.7 -- ...SYNC: average iterations/second = 63100.8

This is great (custom handler allocation). I'm looking forward to getting the post-review asio version1 Also, do you know of any explanation for the discrepancy between sync and async timings with the epoll reactor posted by Caleb Epstein? Thanks, Dave

Hi Dave, --- Dave Moore <jdmoore99@gmail.com> wrote:
Also, do you know of any explanation for the discrepancy between sync and async timings with the epoll reactor posted by Caleb Epstein?
Once I get the custom allocation stuff finished I'll do some profiling. Some possibilities include that it currently makes two allocations per operation (once for the reactor and once for the post() queue) or that it makes more system calls (epoll_ctl, epoll_wait and read rather than just read). However it's probably unwise to guess in the absence of profiling data :) Cheers, Chris

On 12/21/05, Christopher Kohlhoff <chris@kohlhoff.com> wrote:
However it's probably unwise to guess in the absence of profiling data :)
Attached please find profiler output for this program when compiled with g++ 4.0.2 -O2 -pg on Linux 2.6 (epoll reactor). -- Caleb Epstein caleb dot epstein at gmail dot com

Christopher Kohlhoff wrote:
Hi Rene,
In my first reply to your review I outlined a proposal for custom memory allocation. I have now implemented enough of it to re-run your test program (updated program and results attached).
OK.
--- Rene Rivera <grafik.list@redshift-software.com> wrote:
I ran the same 100,000*1K*6*2*3 tests with both debug and release compiled code. As can be seen from the attached output in the best case, of release code, there is a 5.6% "overhead" from the async to sync cases.
Machine specs: 1.7GHz Pentium M, 512MB RAM, Windows XP SP2. Compiler: VC8 Express.
First, without custom memory allocation a typical run of the program made approximately 340000 calls to new. With my custom memory allocation there were just 80.
Much better :-)
In the release build async calls were marginally slower, but the difference was less than 1%.
For the debug code the difference is a more dramatic 25.2%.
Even though a debug build is not really relevant for performance
Of course... I was being as complete as possible.
In my continue reading and use of the code I concluded that the basic design flaw is that the underlying *socket_service classes are not available as a public interface.
I don't understand how not exposing the implementation details is a design flaw.
You might consider it an implementation detail. But depending on the context it might not be considered a detail. At minimum, in this case, I would expect to be able to choose which implementation to use. And in some situations I might want to make that choice at runtime, not just compile time. For example some would consider the std::deque<T> container behind an std::stack<T> to be an implementation detail. But I might want to use a circular_deque<T> instead, for those cases where I want to make certain memory guarantees. After all it's how we draw the abstraction line that we worry so much about in Boost. In this case I think you've drawn the line too far up preventing some alternate uses of your library.
Another restriction that this causes is that the various OS handling models are not available as individual handling types. Hence, for example on Linux, it is not possible to choose at runtime whether one should use epoll, kqueue, or select.
The stated intention is to use the most scalable event demultiplexer offered by a platform.
Sure, but scalability is a contextual measure. In my use case the scalability is not measured in how many connections are handled at once since my connections are all virtual and is only limited by bandwidth and protocol limits.
Furthermore, runtime polymorphism is likely to limit the opportunity for optimisation. E.g. the custom allocation strategy I have implemented works because the user handler types have not been erased.
I didn't realize I implied runtime polymorphism. I was only thinking of the simple ability to create different kinds of clients/servers based the OS signaling method. It's not possible right now since one can't instantiate in a type safe manner different datagram_socket_service variants. for example.
I'd be interested to know whether you, or others, find this custom memory allocation interface satisfactory.
It's OK. It solves the allocation problem, but not the reuse problems. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Hi Rene, --- Rene Rivera <grafik.list@redshift-software.com> wrote:
In my continue reading and use of the code I concluded that the basic design flaw is that the underlying *socket_service classes are not available as a public interface.
I don't understand how not exposing the implementation details is a design flaw.
You might consider it an implementation detail. But depending on the context it might not be considered a detail. At minimum, in this case, I would expect to be able to choose which implementation to use. And in some situations I might want to make that choice at runtime, not just compile time. For example some would consider the std::deque<T> container behind an std::stack<T> to be an implementation detail. But I might want to use a circular_deque<T> instead, for those cases where I want to make certain memory guarantees. After all it's how we draw the abstraction line that we worry so much about in Boost. In this case I think you've drawn the line too far up preventing some alternate uses of your library.
One of the goals of asio is portability. I believe the level of abstraction I have chosen is the lowest possible that is consistent with this goal. The internal implementations are highly platform-specific. In the future, I might consider having these platform-specific implementation classes mature into some sort of secondary public interface. But since that's not one of my design goals it has a low priority. <snip>
I didn't realize I implied runtime polymorphism. I was only thinking of the simple ability to create different kinds of clients/servers based the OS signaling method. It's not possible right now since one can't instantiate in a type safe manner different datagram_socket_service variants. for example.
If I'm understanding you correctly, what you are after is a single program, where you you have some sockets using one demuxing method (say epoll) and other sockets using (say) select? Interesting. Asio does allow you to customise the implementation of things like datagram sockets. For example: class my_datagram_socket_service { ... }; typedef basic_datagram_socket< my_datagram_socket_service> datagram_socket; You are free to reuse the existing internals (or not) to do this as you see fit. But I can't imagine that this scenario would be applicable to a broad enough audience to warrant spending the time to make it convenient.
It solves the allocation problem, but not the reuse problems.
As I said, you can reuse the implementation, but the public interface you're asking for is not in the scope of asio's current design goals. Cheers, Chris

On Thu, 22 Dec 2005 23:29:53 +1100 (EST) Christopher Kohlhoff <chris@kohlhoff.com> wrote:
If I'm understanding you correctly, what you are after is a single program, where you you have some sockets using one demuxing method (say epoll) and other sockets using (say) select? Interesting.
Note that each is good for specific things. I have tests and numbers that show select() and epoll() running circles around each other in performance, depending on the way they are used. Mixing the different types of use-cases causes both methods to degrade. Thus, I think it is very reasonable to put some sockets into a select-based method, and other sockets in an epoll-based method. BTW, I've yet to have time to do a review (I have a major deadline Dec 23, then xmas stuff for a few days). However, I've tried to follow some of the comments. One of my biggest problems with asio, when I tried it several months ago, was the poor support for datagram/multicast apps. I've seen that you have addressed some concerns (at least according to some comments). Could you please specify some details of the mcast improvements? Also, the motivation for your own thread implementation is an all-header implementation of asio. Why do you need an all-header implementation? In the earlier requirements for a boost network library, NOT having an all-header implementation was one of the requirements. Personally, I do not like all-lheader implementations, especially for system related components. They pull way too much junk into the namespace, especially under any flavor of Windows (though linux has its own problems there as well). Lots of macros, and other junk to pollute and cause problems -- not to mention the additional compilation time... What's wrong with a library-based implementation? While I haven't played, I assume it works under BCB ;-)

Jody Hagins wrote:
On Thu, 22 Dec 2005 23:29:53 +1100 (EST) Christopher Kohlhoff <chris@kohlhoff.com> wrote:
If I'm understanding you correctly, what you are after is a single program, where you you have some sockets using one demuxing method (say epoll) and other sockets using (say) select? Interesting.
Somewhat.
Note that each is good for specific things. I have tests and numbers that show select() and epoll() running circles around each other in performance, depending on the way they are used. Mixing the different types of use-cases causes both methods to degrade. Thus, I think it is very reasonable to put some sockets into a select-based method, and other sockets in an epoll-based method.
Strangely it's not having multiple methods on different FDs at the same time that I had in mind. Although thinking about it I can see how having the choice would be good. My original contention is a compatibility concern. I want to be able to write a *single* program that for compatibility can run on Linux 2.4 and above. In such a situation I need to query the OS to find out the optimal (from my point of view not Asios' POV) method and instantiate/use it. Hence I am using more information than is available when Asio makes its optimization choice. But ultimately my concerns with Asio are about the same as those pointed out by Darryl, much better than I could have. I find it distressing that users are forced to pay for this particular Proactor implementation, in it's most generic form, when there are many situations under which it's not needed at all or when I want to implement a different Proator or other dispatch implementation. In a strange way it reminds me of libraries like ACE where to use 5% of it you have to use 90% of it. Another aspect of the current design I have problems with is that it forces a pattern of use that predicates dynamically creating a state machine path for handling the IO. In my previous test example it's a very simple FSM, where it basically goes back to the same node for all IO messages. But for more complicated interactions, like HTTP, SMTP, IMAP, POP, etc. such a pattern is suboptimal, perhaps not from a performance POV, but from a code modularity, maintenance, and verification POV. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Hi Rene, --- Rene Rivera <grafik.list@redshift-software.com> wrote: <snip>
My original contention is a compatibility concern. I want to be able to write a *single* program that for compatibility can run on Linux 2.4 and above. In such a situation I need to query the OS to find out the optimal (from my point of view not Asios' POV) method and instantiate/use it. Hence I am using more information than is available when Asio makes its optimization choice.
But if your use case is as you described before, where you only have one or a small number of sockets, then why not just compile for a Linux 2.4 target only? The select-based implementation will do just fine since you've only got a very small number of sockets involved, and you'll get your portability. <snip>
Another aspect of the current design I have problems with is that it forces a pattern of use that predicates dynamically creating a state machine path for handling the IO. In my previous test example it's a very simple FSM, where it basically goes back to the same node for all IO messages. But for more complicated interactions, like HTTP, SMTP, IMAP, POP, etc. such a pattern is suboptimal, perhaps not from a performance POV, but from a code modularity, maintenance, and verification POV.
Actually I believe the asio model lends itself more easily to constructing intuitive, modular abstractions than, say, a reactive model. This is because it allows you to compose a chain of asynchronous operations and hide them behind a single function. For example, I could easily imagine a function: template <typename Handler> void async_http_get_content( asio::demuxer& demuxer, const std::string& url, std::vector<char>& content, Handler handler); Internally this might parse the URL, resolve the host name, create a socket, connect, transmit request, collect response, close socket and finally invoke the application handler, all asynchronously. Other protocol interactions could be similarly encapsulated behind an asynchronous interface. Cheers, Chris

On 12/22/05, Christopher Kohlhoff <chris@kohlhoff.com> wrote:
But if your use case is as you described before, where you only have one or a small number of sockets, then why not just compile for a Linux 2.4 target only? The select-based implementation will do just fine since you've only got a very small number of sockets involved, and you'll get your portability.
I'd like to request the addition of a macro or some other means to force the choice of a select-based demuxer, even when compiling on a platform that supports epoll. For example, I might end up compiling binaries on 2.6-basedsystems but wanting to run them on 2.4-based ones. There doesn't seem to be any way to do this currently. -- Caleb Epstein caleb dot epstein at gmail dot com

Hi Caleb, --- Caleb Epstein <caleb.epstein@gmail.com> wrote:
I'd like to request the addition of a macro or some other means to force the choice of a select-based demuxer, even when compiling on a platform that supports epoll. For example, I might end up compiling binaries on 2.6-basedsystems but wanting to run them on 2.4-based ones. There doesn't seem to be any way to do this currently.
Yeah, I was just thinking about the same thing myself. Something like a BOOST_ASIO_NO_EPOLL_REACTOR preprocessor check, for example. Cheers, Chris

Christopher Kohlhoff wrote:
Hi Rene,
--- Rene Rivera <grafik.list@redshift-software.com> wrote: <snip>
My original contention is a compatibility concern. I want to be able to write a *single* program that for compatibility can run on Linux 2.4 and above. In such a situation I need to query the OS to find out the optimal (from my point of view not Asios' POV) method and instantiate/use it. Hence I am using more information than is available when Asio makes its optimization choice.
But if your use case is as you described before, where you only have one or a small number of sockets, then why not just compile for a Linux 2.4 target only?
Because this is a concern for everyone. I might be the only person currently asking for this, but I won't be the last. I am raising these issues for the general benefit of the Boost review. Someone has to think about how the library addresses the full spectrum of needs, even if it wasn't originally designed for it. The best designed libraries can accommodate unknown uses. But if your position is that since it doesn't allow for a certain, to my POV, reasonable modularity standard to just "hack" around it, I'd have to say that my vote went from "no but I would consider changing it". To a: "vehement no". I don't want to see libraries in Boost that fail some basic abstraction uses. But to make the selection of which method, epoll, select, etc., to use more salient to you personally, I have a simple question for you if Asio is accepted into Boost: How do you expect to write tests for your library that cover *all* the various methods supported in one platform? PS. I know I'm probably getting on your nerves by now. But that's what reviews are like ;-) -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Hi Rene, --- Rene Rivera <grafik.list@redshift-software.com> wrote:
Because this is a concern for everyone. I might be the only person currently asking for this, but I won't be the last. I am raising these issues for the general benefit of the Boost review. Someone has to think about how the library addresses the full spectrum of needs, even if it wasn't originally designed for it. The best designed libraries can accommodate unknown uses. But if your position is that since it doesn't allow for a certain, to my POV, reasonable modularity standard to just "hack" around it, I'd have to say that my vote went from "no but I would consider changing it". To a: "vehement no". I don't want to see libraries in Boost that fail some basic abstraction uses.
I would have thought that basic abstraction would involve hiding the implementation, not exposing it. As I understand it, your initial "no" was based on two things: documentation deficiencies (fair enough) and the cost of memory allocations. I have already demonstrated an approach for customising memory allocation that doesn't alter the overall design. The issue on which you are now focusing is that there is no public interface that allows you to select different combinations of platform-specific components. Firstly, yes, this could be done, without any change to the current public interface. As I have said elsewhere, over time I may allow these components to evolve into a secondary public interface that is explicitly noted as having restricted portability. However, this requirement is, from my point of view, very specialised and unlikely to be widely applicable. Furthermore, if tomorrow I find an improved way of using epoll that necessitates a rework of the epoll_reactor interface, then I would not hesitate to do so. Exposing these platform-specific internals may hinder my ability to optimise that which is to me the most important - the portable interface. If we have a fundamental disagreement, it would be with regard to the "abstraction line" that you mentioned in an earlier post. I appreciate that the level I have set it at may be too high for your purposes. I would ask that you appreciate that asio was always intended to be portable, and I have chosen to set the public interface at the lowest possible portable level. What you are asking for is a public interface that exposes non-portable constructs; this is clearly at variance with my design goals.
But to make the selection of which method, epoll, select, etc., to use more salient to you personally, I have a simple question for you if Asio is accepted into Boost:
How do you expect to write tests for your library that cover *all* the various methods supported in one platform?
By doing what Caleb Epstein suggested and having a different macro passed when building each set of tests. I note that this is how Boost.Threads, for example, differentiates between the native Windows threads and pthreads variants. Is this also insufficient for your needs? Cheers, Chris

"Christopher Kohlhoff" <chris@kohlhoff.com> wrote in message news:20051223091739.62019.qmail@web32605.mail.mud.yahoo.com...
How do you expect to write tests for your library that cover *all* the various methods supported in one platform?
By doing what Caleb Epstein suggested and having a different macro passed when building each set of tests.
I note that this is how Boost.Threads, for example, differentiates between the native Windows threads and pthreads variants.
It is also how Boost.Filesystem chooses between Windows and POSIX variants on platforms (Cygwin) that support both. --Beman

Beman Dawes wrote:
"Christopher Kohlhoff" <chris@kohlhoff.com> wrote in message news:20051223091739.62019.qmail@web32605.mail.mud.yahoo.com...
How do you expect to write tests for your library that cover *all* the various methods supported in one platform?
By doing what Caleb Epstein suggested and having a different macro passed when building each set of tests.
I note that this is how Boost.Threads, for example, differentiates between the native Windows threads and pthreads variants.
It is also how Boost.Filesystem chooses between Windows and POSIX variants on platforms (Cygwin) that support both.
And I dislike that approach in general, I consider it a kludge. Because it eventually leads to ODR and dynamic linking problems. I also consider it a design dodge to only provide users one abstraction implementation for one category of OS services, instead of providing one abstraction that uses any implementation of that one category of OS services. -- -- Grafik - Don't Assume Anything -- Redshift Software, Inc. - http://redshift-software.com -- rrivera/acm.org - grafik/redshift-software.com -- 102708583/icq - grafikrobot/aim - Grafik/jabber.org

Hi I have been enjoying reading reviews of asio, and I have played with it a little bit for the purpose of possibly using it for implementing various protocols, many of which are layered in various ways. It would be nice to find a systematic way to decompose the layers involved. I've been thinking about how to do this in a way similar to the ACE Streams framework, without having to resort to runtime polymorphism and type erasure, but have not found a clean way to do this. This post made me think about this differently. Christopher Kohlhoff wrote:
For example, I could easily imagine a function:
template <typename Handler> void async_http_get_content( asio::demuxer& demuxer, const std::string& url, std::vector<char>& content, Handler handler);
Internally this might parse the URL, resolve the host name, create a socket, connect, transmit request, collect response, close socket and finally invoke the application handler, all asynchronously.
I think this is very useful, so to be able to assemble this and other higher abstractions one would likely need to be able to compose in terms of more fundamental building blocks. For instance: // Asynchronously get a line of input from 'stream' template <typename BUFFER, typename DISPATCHER, typename HANDLER> void async_get_line( BUFFER& buffered_stream, DISPATCHER& dispatcher, std::vector<char>& line, Handler handler); // Asynchronously get http headers, by repeatedly using async_get_line template <typename BUFFER, typename DISPATCHER, typename HANDLER> void asych_get_http_headers( BUFFER& buffered_stream, DISPATCHER& dispatcher, std::map<std::string, std::string> >& headers, Handler handler); // Asynchronously get the http response body template <typename BUFFER, typename DISPATCHER, typename HANDLER> void asych_get_http_body( BUFFER& buffered_stream, DISPATCHER& dispatcher, std::vector<char>& body, size_t content_length, Handler handler); Now, asynch_get_line would not know how much to consume from the socket, and getting one character at a time would be inappropriate. Thus, I added another template parameter BUFFER that is a fictional concept for the purpose of this discussion that supports pushing back unconsumed data for later invocations. I see that there is an asio::buffered_stream. Maybe it can be changed to support pushing back data. It would be driven like this asio::socket_stream s; buffer<asio::socket_stream> buffer(s); std::vector<char> body; std::map<std::string, std::string> > headers; ... // Send request, then void request_sent(...) { async_get_http_headers(buffer, demuxer, headers, handle_read_headers); } void handle_read_headers(...) { size_t content_length = atoi(headers["content-length"]); asynch_get_http_body(buffer, demuxer, body, content_length, handle_read_body); } Do you think this is a fruitful way to proceed? Will this sort decomposition be unfavorable with respect to performance? Also expressing these abstract operations in general enough types allows driving a protocol stack from a different source, such as a test driver instead of a specific socket type. You have already defined concepts for many aspects of asio. Could the suggested BUFFER concept derive from the Stream concept? What would that entail for the implementation of e.g asio::buffered_stream? I'm thinking about STL and the way that iterators concepts are hierarchically derived from each other and that algorithms are expressed in terms of a certain iterator concept. Speaking of concepts. I feel a little bit lost navigating the template type hierarchies. Consistent reuse of concept names where appropriate might help. For instance, you define a concept 'Dispatcher', yet you use the name demuxer_type everywhere. Finally, I don't see any way to cancel asynchronous operations. While this may be ok for the read_some functions, things get more serious when composing larger operations. What are your thoughts around this? Has cancellation been discussed before? Thanks Mats

Hi Mats, --- Mats Nilsson <mats.nilsson@xware.se> wrote:
I have been enjoying reading reviews of asio, and I have played with it a little bit for the purpose of possibly using it for implementing various protocols, many of which are layered in various ways. It would be nice to find a systematic way to decompose the layers involved.
Yes, this is something I am very interested in investigating further too, with the ultimate goal of providing an easy-to-use library of reusable protocol implementations.
I've been thinking about how to do this in a way similar to the ACE Streams framework, without having to resort to runtime polymorphism and type erasure, but have not found a clean way to do this.
As it happens my stream layering concepts are intended as a sort of compile-time version of ACE Streams. E.g.: typedef buffered_read_stream< ssl::stream<stream_socket> > my_stream;
This post made me think about this differently. <snip> Now, asynch_get_line would not know how much to consume from the socket, and getting one character at a time would be inappropriate. Thus, I added another template parameter BUFFER that is a fictional concept for the purpose of this discussion that supports pushing back unconsumed data for later invocations. I see that there is an asio::buffered_stream. Maybe it can be changed to support pushing back data.
Yes, although the buffered_stream template does already have a peek() function, which might also be a way to address the problem. BTW, the is_*_buffered type traits are also intended to allow these abstractions to be customised for the buffered and non-buffered cases. <snip>
Do you think this is a fruitful way to proceed? Will this sort decomposition be unfavorable with respect to performance?
If the calls are too fine-grained then yes, performance could be adversely affected. I suspect the ideal involves a combination of stream layering and operation composition. Some possibilities might include: - A line_buffered_stream class template that can be wrapped around stream_socket (or other implementations of the Stream concept). This would optimised for line buffering but without the need for pushing back data on to the buffer. It would issue reads to the underlying stream in large chunks. Line-oriented protocols can be layered on top of this using operation composition. - An http_connection class template (again wrapped around a Stream) that minimises calls to the underlying Stream, but uses a buffering strategy optimised for HTTP. Higher level functions like a single async_http_get_content function would be implemented in terms of this. <snip>
You have already defined concepts for many aspects of asio. Could the suggested BUFFER concept derive from the Stream concept?
I see no problem with that.
What would that entail for the implementation of e.g asio::buffered_stream? I'm thinking about STL and the way that iterators concepts are hierarchically derived from each other and that algorithms are expressed in terms of a certain iterator concept.
I'm happy to see the buffered*stream templates extended to provide better support for these use cases.
Speaking of concepts. I feel a little bit lost navigating the template type hierarchies. Consistent reuse of concept names where appropriate might help. For instance, you define a concept 'Dispatcher', yet you use the name demuxer_type everywhere.
The Dispatcher concept is for general handler dispatching, and the demuxer is one implementation (locking_dispatcher is another). However the classes that refer to a demuxer_type do so because they specifically need a demuxer, not just any old dispatcher.
Finally, I don't see any way to cancel asynchronous operations. While this may be ok for the read_some functions, things get more serious when composing larger operations.
What are your thoughts around this? Has cancellation been discussed before?
Portable cancellation is achieved by closing the socket. Any higher level abstraction would need to offer some sort of cancellation function that forwards the calls to the underlying socket, timer etc. Cheers, Chris

Hi Christopher Kohlhoff wrote:
If the calls are too fine-grained then yes, performance could be adversely affected. I suspect the ideal involves a combination of stream layering and operation composition. Some possibilities might include:
- A line_buffered_stream class template that can be wrapped around stream_socket (or other implementations of the Stream concept). This would optimised for line buffering but without the need for pushing back data on to the buffer. It would issue reads to the underlying stream in large chunks. Line-oriented protocols can be layered on top of this using operation composition.
Note that some protocols switch out of line mode, for instance HTTP, FTP/TLS and SSH. So unless there is a way to push data back to the buffered stream it would be difficult to account for this.
- An http_connection class template (again wrapped around a Stream) that minimises calls to the underlying Stream, but uses a buffering strategy optimised for HTTP. Higher level functions like a single async_http_get_content function would be implemented in terms of this.
Maybe a combination of asynchronous and synchronous strategies here. The buffered stream layer leads its own async life requesting large buffers as needed, while providing services for the layers above it using an async interface, but which may be carried out without a roundtrip to the dispatcher unless more data is actually needed.
The Dispatcher concept is for general handler dispatching, and the demuxer is one implementation (locking_dispatcher is another). However the classes that refer to a demuxer_type do so because they specifically need a demuxer, not just any old dispatcher.
What aspect makes demuxer_type different from Dispatcher? Could this Demuxer concept derive from Dispatcher? Or do you mean that they specifically need a particular demuxer instance, or need to refer to the very same class as some other entity?
Portable cancellation is achieved by closing the socket. Any higher level abstraction would need to offer some sort of cancellation function that forwards the calls to the underlying socket, timer etc.
Another concept? "CancelableOperation"? Mats

Hi Mats, --- Mats Nilsson <mats.nilsson@xware.se> wrote:
What aspect makes demuxer_type different from Dispatcher?
A demuxer is inherently associated with I/O, whereas a Dispatcher can be implemented without any reference to I/O operations, e.g. just using a thread pool.
Could this Demuxer concept derive from Dispatcher?
Yep, although in asio it's a class template (basic_demuxer<>) rather than a concept.
Portable cancellation is achieved by closing the socket. Any higher level abstraction would need to offer some sort of cancellation function that forwards the calls to the underlying socket, timer etc.
Another concept? "CancelableOperation"?
This might be useful, but couldn't be implemented by lower level classes like sockets. The problem is that on these classes cancellation has a side effect (closure of socket), whereas on timers, for example, it doesn't close the timer. This difference might also apply to higher level objects, I don't know. Cheers, Chris

Note that each is good for specific things. I have tests and numbers that show select() and epoll() running circles around each other in performance, depending on the way they are used. Mixing the different types of use-cases causes both methods to degrade. Thus, I think it is very reasonable to put some sockets into a select-based method, and other sockets in an epoll-based method.
Interesting. Could you give an example where select() is faster than epoll()? I suspect it is with smaller numbers of file descriptors.

On Thu, 22 Dec 2005 21:16:11 -0000 (GMT) "christopher baus" <christopher@baus.net> wrote:
Interesting. Could you give an example where select() is faster than epoll()? I suspect it is with smaller numbers of file descriptors.
I knew someone would ask. I did that work a long time ago, so let me start off concurrent prayers and grep. I was reading a linux networking book, and the performance numbers in that book seemd a bit ludicrous (I *did* work in a unix kernel group for a while, and I've used unix networking for many years). So, I took his examples, and his awesome performance was just in making the system call (it was actually failing with an error). Since I had started that course of action, I continued with my own tests. Unfortunately, the grep failed to find what I was looking for. I'll keep looking (I may have done it on an old machine that I do not have anymore).

On Thu, 22 Dec 2005 21:16:11 -0000 (GMT) "christopher baus" <christopher@baus.net> wrote:
Interesting. Could you give an example where select() is faster than epoll()? I suspect it is with smaller numbers of file descriptors.
Actually, I do not remember it being a smaller number of FDs. I believe I remember epoll outperforming select() only when there were a few file descriptors ready. In tests of lots of FDs, I think I remember select() actually outperforming epoll() when lots of the file descriptors were ready for input at the same time. IIRC, epoll was FABULOUS with lots of FDs, with infrequent activity. However, with the same number of FDs, select() actually outperformed epoll() when there was data ready to be read on most of the file descritors. Again, this is from memory of work done a long time ago, but I bet if you wrote a quick example of what I describe, you will find similar results. If not, I'll give you your money back on the price you paid me for my opinion ;-)

On Thu, 22 Dec 2005 21:16:11 -0000 (GMT) "christopher baus" <christopher@baus.net> wrote:
Interesting. Could you give an example where select() is faster than epoll()? I suspect it is with smaller numbers of file descriptors.
OK. I have a small example (but I'm out of time to do more). The problem with epoll() is when you have a large number of FDs that are all ready. If you epoll_wait() for as many events as you have FDs you get MUCH worse performance than poll() or select. If you just wait on 1 FD with epoll_wait(), you can call the system call a bunch more times, but then again, you have to make a system call for every FD that is ready, which is more expensive over lots of FDs. Does that make sense?

On Thu, 22 Dec 2005 21:16:11 -0000 (GMT) "christopher baus" <christopher@baus.net> wrote:
Interesting. Could you give an example where select() is faster than epoll()? I suspect it is with smaller numbers of file descriptors.
OK. I have a small example (but I'm out of time to do more). The problem with epoll() is when you have a large number of FDs that are all ready. If you epoll_wait() for as many events as you have FDs you get MUCH worse performance than poll() or select. If you just wait on 1 FD with epoll_wait(), you can call the system call a bunch more times, but then again, you have to make a system call for every FD that is ready, which is more expensive over lots of FDs.
Ok I could see that being the case. The /dev/poll, kqueue, epoll interfaces are all attempts to address the linear search of the FD array. For instance if there are a large number of FDs being waited on, but the last one is notified repeatedly, then many long searches are required to find the ready FD. But if all the FDs are ready, then no search is required. Basically all you have to do is call the handler for each FD. So yes, in the case where there are a lot of FDs that are always ready, then select is probably faster because it requires fewer sys calls, and no searching. I do think that is probably a rare case though, especially on internet facing servers. Plus epoll allows edge triggering to prevent being re-notified of readiness events. But you do have a point here. I always just thought "epoll better" but now I see specific cases where that might not be true. The problem is it is really difficult to decide which will be better for most users. I could almost see the same app allowing both epoll and select with the same binary, but I'm not sure the complexity is worth it. I've read a few of Rene's and Chris's comments on configuring for epoll/select, and my feeling is that on Linux if epoll is there, use it. But it is also possible to test for the existence of epoll at run time so the same binary can run on 2.4 and 2.6 kernels. I don't know of any apps that purposefully choose to use select() in the presence of epoll(), if an epoll() implementation for the app is available. If anything most app developers have been scrambling to provide epoll() support. christopher

On 12/23/05, christopher baus <christopher@baus.net> wrote:
I've read a few of Rene's and Chris's comments on configuring for epoll/select, and my feeling is that on Linux if epoll is there, use it. But it is also possible to test for the existence of epoll at run time so the same binary can run on 2.4 and 2.6 kernels.
If ASIO could do this at runtime, that would be fantastic. This could definitely come later though. -- Caleb Epstein caleb dot epstein at gmail dot com

On Fri, 23 Dec 2005 10:17:00 -0000 (GMT) "christopher baus" <christopher@baus.net> wrote:
I do think that is probably a rare case though, especially on internet facing servers. Plus epoll allows edge triggering to prevent being re-notified of readiness events. But you do have a point here. I always just thought "epoll better" but now I see specific cases where that might not be true.
Maybe on WAN servers, but not on high performance LAN based servers. Ruunning a simple test on 100 file descriptors, here is what I get. poll() calls per second: 107906 select() calls per second: 104891 epoll() calls per second: 70923 Granted, it's not a huge difference, and other numbers of FDs show differing results (also, only selecting for 1 FD in epoll() shows lots better results, but you have to make more system calls as well. Anyway, my point is that there surely are use cases where a user would want control over which underlying implementation is used. The programmer should be allowed to determine which implementation is used under the hood, and should be able to use several different ones for several different uses. I do not see this as exposing implementation details to the detriment of encapsulation. On the contrary, it gives the user the ability to have finer control. You can still default it to whatever you want, and if the defaults are good enough, then fine.
The problem is it is really difficult to decide which will be better for most users. I could almost see the same app allowing both epoll and select with the same binary, but I'm not sure the complexity is worth it.
Why not? The complexity would be exposing some mechanism to specify the underlying implementation. It doesn't have to be the implementation itself, just some token representing an implementation. Without the ability to tell the library which implementation to use, this library is severely lacking.
I've read a few of Rene's and Chris's comments on configuring for epoll/select, and my feeling is that on Linux if epoll is there, use it. But it is also possible to test for the existence of epoll at run time so the same binary can run on 2.4 and 2.6 kernels.
That's fine, for default behavior, but I still argue that the programmer should be given the ability to specify the underlying implementation.
I don't know of any apps that purposefully choose to use select() in the presence of epoll(), if an epoll() implementation for the app is available. If anything most app developers have been scrambling to provide epoll() support.
I know of at least 2 ;-)

Maybe on WAN servers, but not on high performance LAN based servers. Ruunning a simple test on 100 file descriptors, here is what I get.
poll() calls per second: 107906 select() calls per second: 104891 epoll() calls per second: 70923
Interesting. Is this the number of calls to the readiness handler? I have to admit that I'm surprised by those results. I guess epoll doesn't really help until the number of FDs gets much higher (for instance in the thousands). This makes me want to try my own tests.

Hi Jody, --- Jody Hagins <jody-boost-011304@atdesk.com> wrote:
Maybe on WAN servers, but not on high performance LAN based servers. Ruunning a simple test on 100 file descriptors, here is what I get.
poll() calls per second: 107906 select() calls per second: 104891 epoll() calls per second: 70923
Granted, it's not a huge difference, and other numbers of FDs show differing results (also, only selecting for 1 FD in epoll() shows lots better results, but you have to make more system calls as well.
I have just been experimenting with modifications to the internal reactor interfaces such that they will attempt a non-blocking operation first, and only add the requested event to epoll if it doesn't complete immediately. In cases where descriptors are chronically ready, this bypasses epoll system calls altogether. Rough indication so far is that it improves performance 10-20%.
Anyway, my point is that there surely are use cases where a user would want control over which underlying implementation is used. The programmer should be allowed to determine which implementation is used under the hood, and should be able to use several different ones for several different uses.
I'm happy to provide "private" methods to allow you to choose the backend, such as a #define, or something that defers the choice to runtime. My point is that it should not be part of the public interface, since: - These implementations are inherently non portable, and portability is the goal. - Adding things to a public interface implies that they are stable and will not be taken away in the future. Suppose some new demuxing technique comes along -- whether it be a new OS-level feature or a better way of utilising existing ones -- or an existing technique becomes obsolete, should asio's interface provide a guarantee that your chosen implementation is still supported? To me, it should not. Exposing these implementation details is likely to lead to onerous backward compatibility and maintenance requirements, and restrict the ability to optimise for the far more common use cases. As I said above, I'm willing to provide a private mechanism for choosing the implementation. I just don't believe that this is something that I can publish public interfaces for. Cheers, Chris

running a debug build of one of your examples in vc8.0 (visual studio beta 2) asserts "list iterators incompatible" from a calling location in hash_map.hpp // Insert a new entry into the map. std::pair<iterator, bool> insert(const value_type& v) { size_t bucket = boost::hash_value(v.first) % num_buckets; iterator it = buckets_[bucket].first; if (it == values_.end() ) <--- asserts Probably just "Safe" C++ library getting in the way. Anyone know a way to disable this checking? Simon

At 11:59 2005-12-24, simon meiklejohn wrote:
running a debug build of one of your examples in vc8.0 (visual studio beta 2) asserts "list iterators incompatible" from a calling location in hash_map.hpp // Insert a new entry into the map. std::pair<iterator, bool> insert(const value_type& v) { size_t bucket = boost::hash_value(v.first) % num_buckets; iterator it = buckets_[bucket].first; if (it == values_.end() ) <--- asserts
Probably just "Safe" C++ library getting in the way. Anyone know a way to disable this checking?
you may be pleased to know that with the "release" vs2005 there is no assert (debug), but I'm concerned about the program itself. I get different outputs from release and debug release: Successful accept Successful connect Successful receive hello therePress any key to continue . . . .....long pause before the last line. debug: Successful accept Successful connect Successful receive hello there¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦Press any key to continue . . . ......significant pause before the "Press any....." ok, this intrigued me some....so I put breakpoints before the output of "Successful........" and looked at buf_ when in release, there was a '\0' after the "hello there", in debug, there wasn't. I then looked up readsome for I/O streams in "The C++ Standard Library" Nicolai Josuttis... his documentation says readsome() returns a std::streamsize (so you know how many characters are actually in the biffer) AND explicitly says there will be NO '\0' put into the buffer. I haven't started my formal review of asio yet, but something is clearly amiss from looking at your program (and the signature for read_some()). from 27.6.13 of the standard: streamsize readsome(char_type* s, streamsize n); 30 Effects: Behaves as an unformatted input function (as described in 27.6.1.3, paragraph 1). After constructing a sentry object, if !good() calls setstate(failbit) which may throw an exception, and return. Otherwise extracts characters and stores them into successive locations of an array whose first element is designated by s. If rdbuf()->in_avail() == -1, calls setstate(eofbit) (which may throw ios_base::failure (27.4.4.3)), and extracts no characters; If rdbuf()->in_avail() == 0, extracts no characters If rdbuf()->in_avail() > 0, extracts min(rdbuf()->in_avail(),n)). 31 Returns: The number of characters extracted. I guess it's spelled differently in asio than the function in the standard streams because there is no way to tell how much data we have. I note there is no .gcount() const function either.
Simon
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Victor A. Wagner Jr. http://rudbek.com The five most dangerous words in the English language: "There oughta be a law"

Victor A. Wagner Jr. wrote:
At 11:59 2005-12-24, simon meiklejohn wrote:
running a debug build of one of your examples in vc8.0 (visual studio beta 2) asserts "list iterators incompatible" from a calling location in hash_map.hpp // Insert a new entry into the map. std::pair<iterator, bool> insert(const value_type& v) { size_t bucket = boost::hash_value(v.first) % num_buckets; iterator it = buckets_[bucket].first; if (it == values_.end() ) <--- asserts
Probably just "Safe" C++ library getting in the way. Anyone know a way to disable this checking?
This is different to the "safe" C and C++ libraries.
you may be pleased to know that with the "release" vs2005 there is no assert (debug), but I'm concerned about the program itself.
Assertions are not compiled in release builds. In the new VS2005 standard library, there are assertion checks to validate various things such as: * checking that two iterators being compared are from the same container; * checking that an iterator hasn't gone past the end of the container. It may be interesting to see if the following asserts (bug in the runtime library): iterator first = c.begin(), last = c.end(); if( first == last ) // does this assert? { } However, this most likely indicates a bug in that code.
I get different outputs from release and debug
ok, this intrigued me some....so I put breakpoints before the output of "Successful........" and looked at buf_ when in release, there was a '\0' after the "hello there", in debug, there wasn't. I then looked up readsome for I/O streams in "The C++ Standard Library" Nicolai Josuttis... his documentation says readsome() returns a std::streamsize (so you know how many characters are actually in the biffer) AND explicitly says there will be NO '\0' put into the buffer.
So is the I/O stream implementation inserting a '\0' in the release build as part of their "safe" runtime library initiative? Or is this in Johnathan's I/O streams library? - Reece

At 03:31 2005-12-27, Reece Dunn wrote:
Victor A. Wagner Jr. wrote:
At 11:59 2005-12-24, simon meiklejohn wrote:
running a debug build of one of your examples in vc8.0 (visual studio beta 2) asserts "list iterators incompatible" from a calling location in hash_map.hpp // Insert a new entry into the map. std::pair<iterator, bool> insert(const value_type& v) { size_t bucket = boost::hash_value(v.first) % num_buckets; iterator it = buckets_[bucket].first; if (it == values_.end() ) <--- asserts
Probably just "Safe" C++ library getting in the way. Anyone know a way to disable this checking?
This is different to the "safe" C and C++ libraries.
you may be pleased to know that with the "release" vs2005 there is no assert (debug), but I'm concerned about the program itself.
Assertions are not compiled in release builds.
I have _no_ idea why this is mentioned here. the OP said he was using the beta version of the compiler, I'm using the release version of the compiler.
In the new VS2005 standard library, there are assertion checks to validate various things such as: * checking that two iterators being compared are from the same container; * checking that an iterator hasn't gone past the end of the container.
It may be interesting to see if the following asserts (bug in the runtime library): iterator first = c.begin(), last = c.end(); if( first == last ) // does this assert? { }
However, this most likely indicates a bug in that code.
I get different outputs from release and debug
ok, this intrigued me some....so I put breakpoints before the output of "Successful........" and looked at buf_ when in release, there was a '\0' after the "hello there", in debug, there wasn't. I then looked up readsome for I/O streams in "The C++ Standard Library" Nicolai Josuttis... his documentation says readsome() returns a std::streamsize (so you know how many characters are actually in the biffer) AND explicitly says there will be NO '\0' put into the buffer.
So is the I/O stream implementation inserting a '\0' in the release build as part of their "safe" runtime library initiative? Or is this in Johnathan's I/O streams library?
who knows? I just thought running the program with the final compiler would make sense rather than chasing gremlins in a beta
- Reece
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Victor A. Wagner Jr. http://rudbek.com The five most dangerous words in the English language: "There oughta be a law"

--- "Victor A. Wagner Jr." <vawjr@rudbek.com> wrote:
you may be pleased to know that with the "release" vs2005 there is no assert (debug), but I'm concerned about the program itself. I get different outputs from release and debug release: Successful accept Successful connect Successful receive hello therePress any key to continue . . . .....long pause before the last line.
debug: Successful accept Successful connect Successful receive hello there¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ <snip>
It's a very easy to fix bug in the test program :) As you said, handle_recv currently assumes the buffer is NUL-terminated, when in fact it isn't. It would be better written as: void handle_accept(const error& err) { ... socket_.async_read_some(buffer(buf_, sizeof(buf_)), boost::bind(&stream_handler::handle_recv, this, boost::asio::placeholders::error, boost::asio::placeholders::bytes_transferred)); ... } void handle_recv(const error& err, size_t bytes_transferred) { if (err) { std::cout << "Receive error: " << err << "\n"; } else { std::cout << "Successful receive\n"; std::cout.write(buf_, bytes_transferred); } } Cheers, Chris

Hi Jody, --- Jody Hagins <jody-boost-011304@atdesk.com> wrote:
Note that each is good for specific things. I have tests and numbers that show select() and epoll() running circles around each other in performance, depending on the way they are used. Mixing the different types of use-cases causes both methods to degrade. Thus, I think it is very reasonable to put some sockets into a select-based method, and other sockets in an epoll-based method.
I guess you must be doing some pretty hard-core networking to be concerned about this level of detail. Can you describe a use case that would make such a feature widely useful? Some hard numbers on performance differences would be quite interesting.
BTW, I've yet to have time to do a review (I have a major deadline Dec 23, then xmas stuff for a few days).
However, I've tried to follow some of the comments.
One of my biggest problems with asio, when I tried it several months ago, was the poor support for datagram/multicast apps. I've seen that you have addressed some concerns (at least according to some comments). Could you please specify some details of the mcast improvements?
The standard multicast socket options are implemented in the asio::ipv4::multicast namespace. They are: add_membership (IPPROTO_IP/IP_ADD_MEMBERSHIP) drop_membership (IPPROTO_IP/IP_DROP_MEMBERSHIP) outbound_interface (IPPROTO_IP/IP_MULTICAST_IF) time_to_live (IPPROTO_IP/IP_MULTICAST_TTL) enable_loopback (IPPROTO_IP/IP_MULTICAST_LOOP)
Also, the motivation for your own thread implementation is an all-header implementation of asio. Why do you need an all-header implementation? In the earlier requirements for a boost network library, NOT having an all-header implementation was one of the requirements.
Personally, I do not like all-lheader implementations, especially for system related components. They pull way too much junk into the namespace, especially under any flavor of Windows (though linux has its own problems there as well). Lots of macros, and other junk to pollute and cause problems -- not to mention the additional compilation time...
What's wrong with a library-based implementation?
From comments, you'll seed that some people prefer header-only and others don't.
It's on my to-do list to investigate a library implementation, but this would be an option, not mandatory. The goal of the library implementation would be to hide away the system headers. However, a few things need to be considered: - Much of the library is implemented in templates, so it is likely that only the lowest level OS wrappers would be in a library. - I am very concerned that any compilation firewall technique used does not introduce additional memory allocations. I am not yet sure about the best way to achieve this. Some declaration of system types may be required (albeit in a namespace), for example.
While I haven't played, I assume it works under BCB ;-)
Yes, mostly :) Cheers, Chris

On Fri, 23 Dec 2005 09:09:41 +1100 (EST) Christopher Kohlhoff <chris@kohlhoff.com> wrote:
I guess you must be doing some pretty hard-core networking to be concerned about this level of detail. Can you describe a use case that would make such a feature widely useful? Some hard numbers on performance differences would be quite interesting.
Well, you know... I'm looking for the numbers now. I seem to have not only lost the numbers, but the program as well. If I can't find them, then maybe I'll just rewrite the test (as close as I can come to remembering it) on Christmas Day. I should be allowed to do SOMETHING fun on that day...
The standard multicast socket options are implemented in the asio::ipv4::multicast namespace. They are:
There are some other (not so standard, and ipv6) ones, but that should do for now.
From comments, you'll seed that some people prefer header-only and others don't.
Right. Wish there were a way to do both...
It's on my to-do list to investigate a library implementation, but this would be an option, not mandatory. The goal of the library implementation would be to hide away the system headers.
I actually played with a lib implementation using templates. Ended up templatizing on const builtins with values set to the "actual" value from the system header. It's a bit ugly in the implementation, but the usage is quite nice, and a bit more readble. It's is, at least, possibly to do all your template stuff (like socket options and such) in such a manner, without including the system headers directly.
However, a few things need to be considered:
- Much of the library is implemented in templates, so it is likely that only the lowest level OS wrappers would be in a library.
That's OK. The part I don't want included is the OS related junk. That's where my namespace gets polluted, and that's where funky macros come from. All your stuff will be nicely wrapped in the boost::asio namespace.
- I am very concerned that any compilation firewall technique used does not introduce additional memory allocations. I am not yet sure about the best way to achieve this. Some declaration of system types may be required (albeit in a namespace), for example.
Yes, I agree. I still think you can do it without the memory allocations. Basically, you are wrapping the system level stuff (anything that needs non-boost headers) in a library, and providing a think interface (it can even be templates for the most part). There will be a performance cost for the extra function call. While I've been extremely busy, I did try to read the docs. I'm concerned about all the bind() calls. I can live with it the first time, but it looks like I have to rebind after each operation. This can be quite costly. Maybe I'm missing something... I'll try to get some time over the next week to review it a bit more, but this time of year is HECTIC. Kids, XMAS, work deadlines, end of quarter deadlines, end of year deadlines, etc... etc... I even canceled my vacation time (went skiing last weekend with the kids, but other than that, it's all been canceled).

Hi Jody, --- Jody Hagins <jody-boost-011304@atdesk.com> wrote: <snip>
The standard multicast socket options are implemented in the asio::ipv4::multicast namespace. They are:
There are some other (not so standard, and ipv6) ones, but that should do for now.
Let me know if you happen to need something else, because they're easy enough to add. <snip>
I actually played with a lib implementation using templates. Ended up templatizing on const builtins with values set to the "actual" value from the system header. It's a bit ugly in the implementation, but the usage is quite nice, and a bit more readble. It's is, at least, possibly to do all your template stuff (like socket options and such) in such a manner, without including the system headers directly.
Interesting ideas. I may have to call on your experience in this when I get to doing the changes. I'm most worried about things like ipv4::address, which contains an in_addr structure. <snip>
While I've been extremely busy, I did try to read the docs. I'm concerned about all the bind() calls. I can live with it the first time, but it looks like I have to rebind after each operation. This can be quite costly. Maybe I'm missing something...
Basically the handler is just a function object. The boost::bind calls are convenient, for sure, but if you're concerned with performance you can define your own function object that simply forwards the callback like so: class my_handler { public: explicit my_handler(my_class* p) : p_(p) {} void operator()(const asio::error& e) { p_->handle_event(e); } private: my_class* p_; }; Having said that, I've been quite impressed with how well boost::bind is optimised by MSVC for example, so I haven't found it an issue in practice. Cheers, Chris

"Jody Hagins" <jody-boost-011304@atdesk.com> wrote in message news:20051222150748.437a3fe7.jody-boost-011304@atdesk.com...
Also, the motivation for your own thread implementation is an all-header implementation of asio. Why do you need an all-header implementation? In the earlier requirements for a boost network library, NOT having an all-header implementation was one of the requirements.
Personally, I do not like all-lheader implementations, especially for system related components. They pull way too much junk into the namespace, especially under any flavor of Windows (though linux has its own problems there as well). Lots of macros, and other junk to pollute and cause problems -- not to mention the additional compilation time...
What's wrong with a library-based implementation?
I was one of the people asking for a library based solution. While I do think that is important, for the reasons you mention, I also think it is something that can be deferred. If the design is strong, a lot of details can be refined over time. That isn't to say that documentation and some of the other issues aren't important for the review, but design is the critical issue. If the design is weak, stuff like header versus library implementation doesn't matter anyhow. --Beman

I'd be interested to know whether you, or others, find this custom memory allocation interface satisfactory. After having used it I think that I quite like this approach because it allows the developer to use application-specific knowledge about the number of concurrent asynchronous "chains" when customising memory allocation.
This custom memory allocation implementation required no changes to the existing asio public interface or overall design.
Chris, I'm trying to catch up here after hacking on the reactor implementation in a corner for a few days. Could you point me to a post that describes the custom memory allocation interface? One thing I've done is written a pooled_list implementation and parametrized the list implementation used by your hash_map. This eliminates a bunch of hits to the global allocator, at the cost of a pre-allocated fixed hash_map size. Considering that the hash_map associates FDs with handlers, I think this is a reasonable trade off. christopher

Hi Christopher, --- christopher baus <christopher@baus.net> wrote:
I'm trying to catch up here after hacking on the reactor implementation in a corner for a few days. Could you point me to a post that describes the custom memory allocation interface?
http://lists.boost.org/Archives/boost/2005/12/98373.php The name has since changed from handler_allocator to handler_alloc_hook -- see the modified test program on my second reply for an example of its use. Cheers, Chris

Hi Chris, I have some doubts about your allocator proposal:
struct async_server_receive_handler { async_server_receive_handler(async_server* this_p) : this_p_(this_p) {} void operator()(const asio::error& error, std::size_t); async_server* this_p_; };
template <> class asio::handler_alloc_hook<async_server_receive_handler> { public: template <typename Allocator> static typename Allocator::pointer allocate( async_server_receive_handler& h, Allocator& allocator, typename Allocator::size_type count) { return reinterpret_cast<typename Allocator::pointer>( h.this_p_->operation_buffer); }
template <typename Allocator> static void deallocate( async_server_receive_handler& h, Allocator& allocator, typename Allocator::pointer pointer, typename Allocator::size_type count) { } };
What does the custom allocator allocate? async_server_receive_handlers? Because the function is templatized: template <typename Allocator> static typename Allocator::pointer allocate( async_server_receive_handler& h, Allocator& allocator, typename Allocator::size_type count) { return reinterpret_cast<typename Allocator::pointer>( h.this_p_->operation_buffer); } so what's "Allocator &allocator" object, and what does it allocate? Can't we know at compile time what type are we going to manage? Also, should we define different allocators for read, write, accept, and other events? Regards, Ion

Hi Ion, --- Ion Gaztañaga <igaztanaga@gmail.com> wrote:
What does the custom allocator allocate? async_server_receive_handlers?
It allocates objects of type Allocator::value_type.
Because the function is templatized:
template <typename Allocator> static typename Allocator::pointer allocate( async_server_receive_handler& h, Allocator& allocator, typename Allocator::size_type count) { return reinterpret_cast<typename Allocator::pointer>( h.this_p_->operation_buffer); }
so what's "Allocator &allocator" object, and what does it allocate?
The Allocator type is rebound from the allocator template parameter on the public asio types. By default this is std::allocator<void>, so the Allocator type would be std::allocator<some_internal_asio_type>. The default implementation of the handler_alloc_hook::allocate function looks like: return allocator.allocate(count); and the default implementation of deallocate is: allocator.deallocate(pointer, count);
Can't we know at compile time what type are we going to manage?
Allocator::value_type is an implementation-defined type, e.g. it could be a win_iocp_socket_service::receive_operation object. But you do "know" this type at compile time since it's a template.
Also, should we define different allocators for read, write, accept, and other events?
That's up to you :) I expect judicious use of function object wrapping combined with partial specialisation could be useful to use the same allocation for difference operations. E.g.: template <typename Function> tagged_handler<Function> tag_handler(Function f); template <typename Function> class asio::handler_alloc_hook<tagged_handler<Function> > { ... custom allocation ... }; async_read(s, bufs, tag_handler(boost::bind(...))); Cheers, Chris
participants (14)
-
Beman Dawes
-
Caleb Epstein
-
christopher baus
-
Christopher Kohlhoff
-
Daryle Walker
-
Dave Moore
-
Ion Gaztañaga
-
Jody Hagins
-
Mats Nilsson
-
Peter Dimov
-
Reece Dunn
-
Rene Rivera
-
simon meiklejohn
-
Victor A. Wagner Jr.