Re: [boost] [rfc] I/O Library Design

25 Jun 2007

      Sebastian Redl <sebastian.redl@getdesigned.at> writes:
...
Jeremy Maitin-Shepard wrote:
...
I think part of the issue may be the name "binary".  A better name may
be "byte" I/O or "byte" stream.
Originally, the binary transport layer was called byte transport layer.
I decided against this name for the simple reason that, as far as the
C++ standard is concerned, a byte is pretty much the same as an unsigned
char.
By byte, I really meant octet.  You could then call it an octet stream,
but it may be useful for at least the basic stream concept to support
streams of arbitrary POD types.

It seems like it would introduce significant complications to try to
support any architecture with a non-octet byte (i.e. an architecture
that does not have an 8-bit type).  Perhaps it is best for this library
to just completely ignore such architectures, since they are extremely
rare anyway.
...
Because the exact unit of transport is still in the open (and the
current tendency I see is toward using octets, and leaving native bytes
to some other mechanism), I didn't want any such implication in the name.
The name binary isn't a very good choice either, I admit. In the end,
all data is binary. But the distinction between "binary" and "textual"
data is important, and not only at the concept level. What I have in my
mind works something like this:
Binary data is in terms of octets, bytes, primitives, or PODs,
whatever.
Perhaps the name "data stream" would be appropriate, or better yet,
perhaps just "stream", and use the qualified name "text stream" or
"character stream" to refer to streams of characters that are somehow
marked (either at compile-time or run-time) with the encoding.

This brings up the following issue, what didn't seem to be addressed
very much in your responses:

Should text streams of arbitrary (non-Unicode encodings) be supported?
Also, should text streams support arbitrary base types (i.e. uint8_t or
uint16_t, or some other type), or should be restricted to a single type
(like uint16_t)?

The reason for substantial unification between both "data streams" and
"text streams" is that despite differences in how the data they
transport is used, the interface should essentially be the same (both
basic read/write, as well as things like mark/reset and seeking), and a
buffering facility should be exactly the same for both types of streams.

Similarly, facilities for providing mark/reset support on top of a
stream that does not support it by using buffering would be exactly the
same for both "binary" and "text" streams.

Even if seek may not be as useful for text streams, it still might be
useful to some people, and there is no reason to exclude it.

In the document you posted, for instance, you essentially just
duplicated much of the description of binary streams for the text
streams, which suggests a problem.

As a suggest below, a "text stream" might always be a very thin layer on
top of a binary stream, that simply specifies an encoding.  The issue,
though, is how it would work to layer something like a regular buffer or
a mark/reset-providing buffer on top of a text stream.  There shouldn't
have to be two mark/reset providers, one for data streams, and one for
text stream, but also it should be possible to layer such a thing on top
of a text stream directly, and still maintain the encoding annotation.
...
The distinguishing feature of binary data is that each "unit" is
meaningful in isolation. It makes sense to fetch a single unit and work
on it. It makes sense to jump to an arbitrary position in the stream and
interpret the unit there.
I suppose the question is at what level will the library provide
character encoding/decoding/conversion.  It seems like the most basic
way to create a text stream would be to simply take a data stream and
mark it with a particular encoding to make a text stream.  This suggests
that encoding/decoding/conversion should exist as a "data stream"
operation.  One thing I haven't figured out, though, it how the
underlying unit type of the stream, i.e. uint8_t or uint16_t, would
correspond to the encoding.  In particular, the issue is what underlying
unit type corresponds to each of the following encodings:

 - UTF-8, iso-8859-* (it seems obvious that uint8_t would be the choice
   here)

 - UTF-16 (uint16_t looks promising, but you need to be able to read
           this from a file, which might be a uint8_t stream)

 - UTF-16-LE/UTF-16-BE (uint16_t looks promising, but also
   uint16_le_t/uint16_be_t (a special type that might be defined in the
   endian library) might be better, and furthermore you need to be able
   to read this from a file, which might be a uint8_t stream)

Perhaps you have some ideas about the conceptual model that resolves
these issues.  One solution may be to decide that the unit type of the
stream need not be too closely tied to the encoding, and in particular
the encoding conversion might not care what types the input and output
streams are, within some limits (maybe the size of unit type of the
encoding must be a multiple of the size of the unit type of the stream).
...
Textual data is far more complex. It's a stream of abstract characters,
and they don't map cleanly to the underlying representative primitive. A
UTF-8 character maps to one, two, three or four octets, leaving aside
the dilemma of combining accents. A UTF-16 character maps to one or two
double-octets. It doesn't make sense to fetch a single primitive and
work on it, because it may not be a complete character. It doesn't make
sense to jump to an arbitrary position, because you might jump into the
middle of a character.
The internal character encoding is part of the text stream's type in my
model.
It seems that it may be useful to allow the encoding to be specified at
run-time.  The size of each encoded unit would still have to be known at
compile-time, though, so it is not clear exactly how important this is.

[snip]

-- 
Jeremy Maitin-Shepard