Re: [boost] [rfc] I/O Library Design

25 Jun 2007

      Sebastian Redl <sebastian.redl@getdesigned.at> writes:
...
Jeremy Maitin-Shepard wrote:
...
- One idea from [Boost.IOStreams] to consider is the
direct/indirect device distinction.
I never noticed this distinction before. It seems useful, but there are
issues not unlike the AsyncIO issues.
Direct devices provide a different interface. A programmer can take
advantage of this interface for some purposes, but for most, I fear, the
advantages would be lost. Consider:
- A direct device cannot be wrapped by filters that do dynamic data
rewriting (such as (de)compression). The random access aspect would be lost.
- A direct device cannot participate in the larger stack without
propagating the direct access model throughout the stack. (And this
stops at the text level anyway, because the character recoder does
dynamic data rewriting.) Propagating another interface means a lot of
additional implementation effort and complexity.
Okay.  I'm inclined to agree with this.
...
...
- Binary transport layer issue:
Platforms with unusual features, like 9-bit bytes or inability to
handle types less than 32-bits in size can possibly still implement
the interface for a text/character transport layer, possibly on top of
some other lower-level transport that need not be part of the boost
library.  Clearly, the text encoding and decoding would have to be
done differently anyway.
A good point, but it does mean that the text layer dictates how the
binary layer has to work. Not really desirable when pure binary I/O has
nothing to do with text I/O.
I'm not sure what you mean by this exactly.
...
One approach that occurs to me would be to make the binary transport
layer use a platform-specific byte type (octets, nonets, whatever) and
have the binary formatting layer convert this into data suitable for
character coding.
It seems like trying to support unusual architectures at all may be
extremely difficult.  See my other post.

I suppose if you can find a clean way to support these unusual
architectures, then all the better.

It seems that it would be very hard to support e.g. utf-8 on a platform
with 9-bit bytes or which cannot handle types smaller than 32-bits.
...
...
- Seeking:
Maybe make multiple mark/reset use the same interface as seeking, for
simplicity.  Just define that a seeking device has the additional
restriction that the mark type is an offset, and the argument to seek
need not be the result of a call to tell.
...
...
Another issue is whether to standardize the return type from tell,
like std::ios_base::streampos in the C++ iostreams library.
These are incompatible requirements, and the reason I want to keep the
interfaces separate. Standardizing the tell return type is a good idea
and necessary for efficient work of type erasure and simple use of
arbitrary values in seek(). The type must be transparent.
...
The return type of mark(), on the other hand, can and should be opaque.
This allows for many interesting things to be done. For example:
Consider a socket. It has no mark/reset, let alone seeking support.
You have a recursive descent parser that requires multiple mark/reset
support.
I see.  It still seems that using different names means that something
that requires only mark/reset support cannot use a stream providing
seek/tell support, without an additional intermediate layer.

[snip]
...
...
It should probably also
be possible to determine using the library at compile time what the
native format is.
To what end? If the native format is one of the special predefined ones,
it will hopefully be optimized in the platform-aware special
implementation (well, I can dream) anyway.
The reason would be for a protocol in which little/big endian is
specified as part of the message/data, and a typical implementation
would always write in native format (and so it would need to determine
which is the native format), but support both formats for reading.
...
...
- Header vs Precompiled:
I think as much should be separately compiled as possible, but I also
think that type erasure should not be used in any case where it will
significantly compromise performance.
I'm thinking of a system where components are templates on the component
they wrap, so as to allow direct calls upwards. I'm thinking of using
the common separately compiled template specialization extension of
compilers to provide pre-compiled versions of the standard components
instantiated with the erasure components. This is very similar to how
Spirit works, except that it doesn't have pre-compiled stuff. In Spirit,
rule is the erasure type, but the various parsers can be directly
linked, too.
Ideally, the cost of the virtual function calls would normally be
mitigated by calling e.g. read/write with a large number of elements at
once, rather than with only a single element.
...
Then, if the performance is needed, the programmer can hand-craft his
chain so that no virtual calls are made, at the cost of compiling his
own copy of the components.
...
I'm afraid I don't see a better way of doing this. I'm wide open to
suggestions.
...
- The "byte" stream and the character stream, while conceptually
different, should probably both be considered just "streams" of
particular POD types.
I have explained in a different post why I don't think this is a good idea.
- Text transport:
I don't think this layer should be restricted to Unicode encodings.
I have no plans of doing so. I just consider all encodings as encodings
of the universal character set. An encoding is defined by how it maps
the UCS code points onto groups of octets, words, or other primitives.
Is it in fact the case that all character encodings that are useful to
support encode only a subset of Unicode?  (i.e. there does not exist a
useful encoding that can represent a character that cannot be
represented by Unicode?)

In any case, though, it is not clear exactly why there is a need to
think of an arbitrary character encoding in terms of Unicode, except
when explicitly converting between that encoding and a Unicode encoding.

[snip]

-- 
Jeremy Maitin-Shepard