Re: [boost] asio formal review

23 Dec 2005

      Hi Darryl,

--- Darryl Green <darryl.green@unitab.com.au> wrote: <snip>
...
I think the design as described by the concepts section is
very good, but it would be better if the implementation
offered finer-grained selection of the desired concepts.
Perhaps this is just an aesthetic concern, and it would be
easy to modify the existing services, but if these really are
separate concepts, why are they bundled together? I'm
concerned that there may be some lack of flexibility in the
implementation/interface actually offered that isn't apparent
from the design. The issue of separating out the sync vs async
interface has already been raised;
This has been extensively discussed elsewhere, but in summary I
can see no compelling reason why there should be a separate
low-level sync-only interface, and many advantages to having
them combined. However I still intend to rename the demuxer to
make its purpose clearer for all sorts of applications.
...
I would also like to see the ability to expose a reactive
rather than proactive interface by using a
reactive_stream_socket_service in place of an
async_stream_socket_service. The concepts would seem to allow
for this.
This is not consistent with the goal of portability. As I
mentioned in the "reactor versus proactor" thread, the proactor
is a more portable pattern than the reactor. In the longer term
I'm not averse to allowing a reactor abstraction to become part
of some secondary public interface, where the restricted
portability is explicitly noted.
...
One item that I have difficulty in discerning a clear
design/rationale for is the demuxer. The demuxer as
implemented is a repository of services (another wooly concept
afaiks) and something that provides the run, interrupt and
related functions to control the scheduling/dispatching of
handlers in reaction to changes in Async_Object state. Is
there anything preventing a more policy-based approach to
assembling services?
The demuxer is composed of services in a similar way to how
std::locale is composed of facets. The demuxer is not
necessarily aware of, or coupled to, the service types that it
contains. For example, the SSL implementation uses a service
that is only added in to the demuxer if SSL is used.

The service used by a particular public interface (e.g.
basic_stream_socket) is already a policy template parameter. One
example program shows how basic_stream_socket can be
instantiated with a custom service to perform logging, where
this service is in turn implemented using the standard
stream_socket_service.

There's no reason why, in principle, policy-driven services
could not be added to allow further customisation. But these
would be inherently non-portable, and so would only be part of
some secondary public interface.
...
I would like to see the aspects of a socket that have nothing
to do with reading/writing better separated, as this would
make for a more consistent interface for other forms of I/O
such as pipes, message queues etc as well as files.
For stream-oriented I/O, this separation is presented in the
form of the Stream concept. This concept is implemented by
stream_socket, ssl::stream, buffered_read_stream and so on. It
would be implemented by a hypothetical pipe class.
...
A very simple active object template that supports read and
write, but that relies on free functions or helper classes for
less universal operations such as bind, setting options, etc,
used as a base class, allows appropriate (and necessary in my
opinion) use of runtime polymorphism.
I have to disagree on runtime polymorphism: compile-time
polymorphism should always be the default. Runtime polymorphism
can impose additional costs (such as virtual function calls on a
fine-grained interface), and eliminates potential optimisations.
This adds up considerably if there are many layers of runtime
polymorphism involved.

Runtime polymorphism can be layered on top of compile-time
polymorphism using some sort of wrapper. E.g. for the stream
concept:

class abstract_stream { ... };

template <typename Stream>
class stream : public abstract_stream { ... };

stream<ssl::stream<stream_socket> > s(...);
runtime_polymorphic_function(s);

I am not certain if this is something that should be included in
asio or not, since the appropriate place to introduce runtime
polymorphism can vary according to the design of an application.
What do you think? I'd be happy to add it if you think it
generally useful.

<snip>
...
The intention was for this general idea to support other types
of "file descriptor" that don't offer conventional read and
write interfaces, and to allow other interfaces (such as
proactive) to use the same demuxing etc infrastructure.
I had hoped to have time to investigate introducing asio as a
replacement/alternative to this library, but the timing hasn't
been gode for that. In any case, I don't think asio would be a
good fit for a system like the following (where the other lib
is being used):
Components support tcp and ssl for WAN connections and unix
domain sockets locally. Throw in some weird stuff like
communicating over special purpose point-to-point links
implemented as character devices. There are large chunks of
code that have no need to care what exact type of "socket"
they are using. Some of this code is actually in dynamically
linked libraries itself. There are multiple processes using
these libraries.
Short of placing a wrapper around asio to make it more
amenable to separate compilation and to introduce runtime
polymorphism I can't see how I could use asio effectively this
system, or anything similar.
Would the polymorphic wrapper I proposed above address these
issues?
...
As far as I can see asio doesn't even allow interchangeable
use of an ssl and a tcp stream.
As I mentioned above, asio addresses this using compile time
polymorphism, with the Stream concept being implemented by
stream_socket and ssl::stream. For example, both work equally
well with free functions like asio::async_read.
...
I think this is unacceptable in a general purpose library. I'm
also concerned about the potential code bloat from unnecessary
duplication of code (not from differnt template parameters,
but from outright duplication in differnt templates).
Can you please point these out? I'm always in favour of reducing
code bloat.
...
Implementation
The code is clean, however I'm very concerned about the
performance (or lack of) and weight that a combination of the
implementation and some design decisions have introduced. The
documentation mentions the memory usage overhead of the
proactor emulation on systems using select/epoll. However it
is clear that there is considerable allocation overhead and
construction overhead involved (and not only on those
systems). While much discussion has centered on allocator
performance, I'm concerned that the user/library constructor
code invoked for each i/o operation is likely to be
non-trivial as well. For bulk transfers, it is likely that
these issues won't be too severe, however for systems where
low-latency handling of small messages is critical, it is
quite possible that these overheads will be comparable to or
larger than any other part of the processing for a
message/transaction.
I don't deny that there may be room for performance improvement
in the implementation. But as I recently tested the custom
allocation with Win32 I/O completion ports, and found that it
performs almost identically to synchronous I/O, I don't think
the public interface imposes any unnecessary inefficiencies.
...
I find the way the design supports, and the documentation
encourages, breaking up handling of structured messages into a
small header read followed by a larger read (with appropriate
handler for expected body/chunk) while the implementation
appears to do some quite heavyweight operations compared to
just reading a small header to be almost an encouragement of
an antipattern for efficient I/O.
I don't agree that asio favours this approach over others. But I
do believe that this approach is perfectly valid and extremely
useful, since it permits asynchronous code to be written in
terms of "contracts". I.e. perform this operation (or
operations) and don't come back to me until it's all done or an
error has occurred.

For code where absolute efficiency is not required, this
technique allows a clearer and more elegant expression of
intent. Furthermore, the cost of multiple small read system
calls can be mitigated by leveraging the compile-time
polymorphism of the Stream concept and replacing:

  stream_socket my_socket;

with:

  buffered_read_stream<stream_socket> my_socket;
...
I would expect an optimistic reactive approach (read the
header when notified, determine size, handler etc and
opertunistically attempt a read of the body) to be
significantly faster.
Let's take your example of a fixed sized header followed by a
body, and compare and contrast a reactive approach with the
alternatives offered by asio.

---------------------

1. Reactive.

Here you might have a single handler that is informed when a
socket is ready for reading. In this handler you perform
non-blocking reads.

void handle_read(...)
{
  switch (state_)
  {
  case reading_header:

    // Attempt non-blocking read of header.
    size_t bytes_read = read(sock_,
        header_buf_ + header_bytes_so_far_,
        header_length - header_bytes_so_far_);
    header_bytes_so_far_ += bytes_read;

    // Check if we have complete header.
    if (header_bytes_so_far_ < header_length)
      return; // Keep waiting for more.

    // Got the whole header now.
    body_length_ = decode_body_length(header_buf_);
    body_buf_ = ...; // size body buffer appropriately
    body_bytes_so_far_ = 0;
    state_ = reading_body;

    // Fall through to make opportunistic read of body.

  case reading_body:

    // Attempt non-blocking read of body.
    bytes_read = read(sock_,
        body_buf_ + body_bytes_so_far_,
        body_length_ - body_bytes_so_far_);
    body_bytes_so_far_ += bytes_read;

    // Check if we have complete body.
    if (body_bytes_so_far_ < body_length_)
      return; // Keep waiting for more.

    // Got whole body now.
    notify_app(...);

    // Start over.
    state_ = reading_header;
  }
}

2. Asio with multiple reads.

This is the approach where we issue asynchronous reads for
exactly the length of the header, and then the body. It may be
less efficient, but only because of the additional pass through
the demultiplexer, not because it makes more read() system calls
than the above. However feel free to compare in terms of
readability :) The main reason it's clearer is that you no
longer have to deal with short reads -- async_read's contract
takes care of that for you.

void start_read()
{
  // Issue read for entire header.
  async_read(sock_,
      buffer(header_buf_, header_length),
      handle_read_header);
}

void handle_read_header(const error& e)
{
  if (!e)
  {
    // Got header, determine body length.
    body_length_ = decode_body_length(header_buf_);
    body_buf_ = ...; // size body buffer appropriately

    // Issue read for entire body.
    async_read(sock_,
        buffer(body_buf_, body_length_),
        handle_read_body);
  }
}

void handle_read_body(const error& e)
{
  if (!e)
  {
    // Got whole body now.
    notify_app(...);

    // Start over.
    start_read();
  }
}

3. Asio with multiple reads + opportunistic body read.

This extends number 2 by adding in the opportunistic read of the
body. In effect this makes it equivalent to your reactive read
approach, but I think you'll still find it's easier to follow --
again because most of the code to handle short reads is
eliminated.

void start_read()
{
  // Issue read for entire header.
  async_read(sock_,
      buffer(header_buf_, header_length),
      handle_read_header);
}

void handle_read_header(const error& e)
{
  if (!e)
  {
    // Got header, determine body length.
    body_length_ = decode_body_length(header_buf_);
    body_buf_ = ...; // size body buffer appropriately

    // Make opportunistic non-blocking read of body.
    error read_error;
    size_t bytes_read = sock_.read_some(
        buffer(body_buf_, body_length_),
        assign_error(read_error));

    // Did we get whole body?
    if (bytes_read == body_length_ || read_error)
    {
      handle_read_body(read_error);
    }
    else
    {
      // Issue read for rest of body.
      async_read(sock_,
          buffer(body_buf_ + bytes_read,
            body_length_ - bytes_read),
          handle_read_body);
    }
  }
}

void handle_read_body(const error& e)
{
  if (!e)
  {
    // Got whole body now.
    notify_app(...);

    // Start over.
    start_read();
  }
}

4. Asio with bulk reads.

This approach is exemplified by the HTTP server example.
Essentially it reads as much as is available (up to some limit
of course) in one go, and then processes it using some state
machine similar to that in number 1. It is generally the most
efficient since it requires the fewest number of read system
calls or passes through the demuxer.

void handle_read(const error& e, size_t bytes_read)
{
  if (!e)
  {
    process_data(buffer_, bytes_read);
    async_read(sock_,
        buffer(buffer_, buffer_length),
        handle_read);
  }
}

5. Asio with transfer_at_least(...).

In certain applications you may be able to combine the
opportunistic read into the initial asynchronous read operation,
by using the transfer_at_least() completion condition. This
executes the operation with a contract that it transfer a
minimum number of bytes. In this way you can avoid short read
handling for the header, while still maximising the number of
bytes transferred in a single operation.

void start_read()
{
  // Issue read for entire header, and maybe we'll get some
  // body as a bonus.
  boost::array<mutable_buffer, 2> bufs = {
    buffer(header_buf_, header_length),
    buffer(body_buf, max_body_length) };
  async_read(sock_, bufs,
      transfer_at_least(header_length),
      handle_read_header);
}

...

---------------------
...
It is implied that an appropriate demuxer/demuxer service can
be used to implement various multi-threaded processing
schemes. This may well be true. However, an alternative
approach is to simply consider the affinity of an Async_Object
and its demuxer(s) (one for read, one for write) to be dynamic
and run a number of separate (per thread) demuxers, moving
Async_Objects between them as required. In this way the
handlers for a given Async_Object will only run in the
expected context.
This is not portable. With Win32 I/O completion ports for
example, once a socket is associated with an I/O completion port
it cannot be disassociated. The demuxer design reflects this.
...
I imagine other approaches are possible to achieve similar
results, but it is not clear (to me) from the design
documentation or a quick look at the code that this is the
case. I haven't reviewed the code to a level of detail where I
can be sure I am understanding the dispatching done by the
epoll and select reactors, but it looks to me as though
multiple threads may all wait on the same fdset (or
equivalent)? Is this right? Intended? I had expected some form
of modified leader/followers pattern here?
No, only one thread in the pool waits on select/epoll/kqueue.
Any other threads in the pool wait on a condition.
...
Conclusion
As I mentioned above, had the review been at a different time
I would have at least attempted to port some code (even if
only test/benchmark code) from an in-house reactor library to
asio. As it is my comments are based on a fair bit of
experience with network and async (in its most general sense)
I/O from the bit-level (literally) up to the application level
on a number of OSs/kernels/platforms and libraries (including
ACE) and I've spent a few hours reading the code and docs. I
have a healthy distrust of my (and everyone elses) guesses
regarding performance, and the author and review manager
should consider my points above regarding performance as areas
for investigation only.
A fair amount of the above appears to be a request for reactor
support (which it is, partly) but I would consider mechanisms
that support very light-weight proactor use highly desirable
irrespective of reactor support. In particular, it would seem
that some form of re-use of a single handler and buffer for an
Async_Object instance without any
allocation/(copy)construction of anything heavier than a
pointer should be quite practical.
Asio's interface already supports "lightweight", reusable
handlers, since ultimately a handler is just a function object.
If you wish, you can implement your handlers so that they are no
more expensive than a pointer:

class my_handler
{
public:
  explicit my_handler(my_class* p) : p_(p) {}
  void operator()(const asio::error& e) { p_->handle_event(e); }
private:
  my_class* p_;
};

<snip>

Thanks for taking the time to write such an extensive review.

Cheers,
Chris

Re: [boost] asio formal review

Christopher Kohlhoff