Re: [boost] [rfc] I/O Library Design

24 Jun 2007

      Jeremy Maitin-Shepard wrote:
...
- One idea from [Boost.IOStreams] to consider is the
  direct/indirect device distinction.
I never noticed this distinction before. It seems useful, but there are
issues not unlike the AsyncIO issues.
Direct devices provide a different interface. A programmer can take
advantage of this interface for some purposes, but for most, I fear, the
advantages would be lost. Consider:
- A direct device cannot be wrapped by filters that do dynamic data
rewriting (such as (de)compression). The random access aspect would be lost.
- A direct device cannot participate in the larger stack without
propagating the direct access model throughout the stack. (And this
stops at the text level anyway, because the character recoder does
dynamic data rewriting.) Propagating another interface means a lot of
additional implementation effort and complexity.
...
- Binary transport layer issue:
Platforms with unusual features, like 9-bit bytes or inability to
  handle types less than 32-bits in size can possibly still implement
  the interface for a text/character transport layer, possibly on top of
  some other lower-level transport that need not be part of the boost
  library.  Clearly, the text encoding and decoding would have to be
  done differently anyway.
A good point, but it does mean that the text layer dictates how the
binary layer has to work. Not really desirable when pure binary I/O has
nothing to do with text I/O.
One approach that occurs to me would be to make the binary transport
layer use a platform-specific byte type (octets, nonets, whatever) and
have the binary formatting layer convert this into data suitable for
character coding.
...
- Seeking:
Maybe make multiple mark/reset use the same interface as seeking, for
  simplicity.  Just define that a seeking device has the additional
  restriction that the mark type is an offset, and the argument to seek
  need not be the result of a call to tell.
Another issue is whether to standardize the return type from tell,
  like std::ios_base::streampos in the C++ iostreams library.
These are incompatible requirements, and the reason I want to keep the
interfaces separate. Standardizing the tell return type is a good idea
and necessary for efficient work of type erasure and simple use of
arbitrary values in seek(). The type must be transparent.
The return type of mark(), on the other hand, can and should be opaque.
This allows for many interesting things to be done. For example:
Consider a socket. It has no mark/reset, let alone seeking support.
You have a recursive descent parser that requires multiple mark/reset
support.
You want to parse data coming from the socket using this parser.

Typically, this will meant that you have to receive all data (which may
be tricky, if parsing is required to tell you how much data to expect)
and then parse the buffer.

With the I/O stack, you could instead write a filter that implements
mark/reset support on top of an arbitrary device.

Let's consider the simplest case: a filter that implements single
mark/reset. The filter contains a simple extensible buffer (such as a
std::vector, or perhaps a linked list of fixed-size buffers). When
mark() is first called, it starts buffering all data that is read
through it, in addition to returning it. When reset() is called, it
simply starts feeding the buffered data to read requests until it runs
out, at which point it goes back to calling the underlying device. When
mark() is called again, it can discard all data buffered so far. (E.g.
it might drop from the linked list all completely filled buffers and
free their memory.)

A more complex case is multiple mark/reset. One variant is a filter that
just starts buffering on the first mark() and returns the offset into
the buffer where the data starts (i.e. 0). On every subsequence mark(),
it returns a higher index. A reset() is passed the offset, so it starts
reading from this offset.
The obvious problem is that, from the first mark() on, all data has to
be buffered. In situations where memory is scarce, this may not be
desirable. If the filter knew how many marks still exist, it could
discard buffered data that is no longer needed.
If the mark type is opaque, it can be a smart-pointer-like object with a
reference count. For every buffer chunk, there could be one reference
count. If the count for a chunk drops to zero, the filter knows the data
in that chunk is no longer needed and can free the memory.

I want to keep this flexibility. I want mark/reset to stay separate.
...
- Binary formatting (perhaps the name data format would be better?):
Sounds like a good name.
...
I think it is important to provide a way to format
  {uint,int}{8,16,32,64}_t as either little or big endian two's
  complement (and possibly also one's complement).
Yes, that's pretty much the idea behind this layer.
  It might be useful
  to look at the not-yet-official boost endian library in the vault.
I will.
...
It should probably also
  be possible to determine using the library at compile time what the
  native format is.
To what end? If the native format is one of the special predefined ones,
it will hopefully be optimized in the platform-aware special
implementation (well, I can dream) anyway.
- Header vs Precompiled:
I think as much should be separately compiled as possible, but I also
  think that type erasure should not be used in any case where it will
  significantly compromise performance.
I'm thinking of a system where components are templates on the component
they wrap, so as to allow direct calls upwards. I'm thinking of using
the common separately compiled template specialization extension of
compilers to provide pre-compiled versions of the standard components
instantiated with the erasure components. This is very similar to how
Spirit works, except that it doesn't have pre-compiled stuff. In Spirit,
rule is the erasure type, but the various parsers can be directly
linked, too.

Then, if the performance is needed, the programmer can hand-craft his
chain so that no virtual calls are made, at the cost of compiling his
own copy of the components.

I'm afraid I don't see a better way of doing this. I'm wide open to
suggestions.
...
- The "byte" stream and the character stream, while conceptually
  different, should probably both be considered just "streams" of
  particular POD types.
I have explained in a different post why I don't think this is a good idea.
- Text transport:
I don't think this layer should be restricted to Unicode encodings.
...
For full generality, the library should provide facilities
  for converting between any two of a large list of encodings.
No, I'll leave that to a character support library. (Parts of the
I have no plans of doing so. I just consider all encodings as encodings
of the universal character set. An encoding is defined by how it maps
the UCS code points onto groups of octets, words, or other primitives.
library I will have to specify and implement to build the text converter
on, but that has time. The binary stuff comes first.)
...
- Text formatting:
For text formatting, I think it would be very useful to look at the
  IBM ICU library.
I have. Some interesting ideas there.
  It may in fact make sense to leave text formatting
  as a separate library
Each of my layers can be considered a separate library. That's the way
I'll implement them, too. I just design and present them at once because
I need to consider the requirements of the higher levels on the lower
levels.
Sebastian Redl