Re: [boost] [rfc] I/O Library Design

24 Jun 2007

      ...
I think part of the issue may be the name "binary".  A better name may
be "byte" I/O or "byte" stream.
Originally, the binary transport layer was called byte transport layer.
I decided against this name for the simple reason that, as far as the
C++ standard is concerned, a byte is pretty much the same as an unsigned
char. Because the exact unit of transport is still in the open (and the
current tendency I see is toward using octets, and leaving native bytes
to some other mechanism), I didn't want any such implication in the name.
The name binary isn't a very good choice either, I admit. In the end,
all data is binary. But the distinction between "binary" and "textual"
data is important, and not only at the concept level. What I have in my
mind works something like this:
Binary data is in terms of octets, bytes, primitives, or PODs, whatever.
The distinguishing feature of binary data is that each "unit" is
meaningful in isolation. It makes sense to fetch a single unit and work
on it. It makes sense to jump to an arbitrary position in the stream and
interpret the unit there.
Textual data is far more complex. It's a stream of abstract characters,
and they don't map cleanly to the underlying representative primitive. A
UTF-8 character maps to one, two, three or four octets, leaving aside
Jeremy Maitin-Shepard wrote:
the dilemma of combining accents. A UTF-16 character maps to one or two
double-octets. It doesn't make sense to fetch a single primitive and
work on it, because it may not be a complete character. It doesn't make
sense to jump to an arbitrary position, because you might jump into the
middle of a character.
The internal character encoding is part of the text stream's type in my
model.
...
...
Also, I believe the narrow/wide characters and locales stuff is broken 
beyond all repair, so I wouldn't recommend to do anything related to
that.
I believe that this library will attempt to address and properly handle
those issues.
I certainly will, especially as this view seems to be generally agreed
on by the posters.
...
...
I also think text formatting is a different need than I/O.
Indeed, it is often needed to generate a formatted string which is then 
given to a GUI Toolkit or whatever.
Presumably this would be supported by the using the text formatting
layer on top of an output text sink backed by an in-memory buffer.
That's the idea.

Separating formatting and I/O is necessary to avoid an ugly mess of
responsibilities, which is why the text formatting layer is a distinct
layer. However, having the the formatting build on the I/O interfaces
instead of string interfaces allows for greater optimization
opportunities. There is not much you can do when you create a string
from format instructions. You have two choices, basically. One is to
first find out how much memory is needed, allocate a buffer, and then
format the data. This is the approach MFC's CString::Format takes. The
obvious problem is that it does all work twice. The less obvious problem
is that it makes the code considerably more complex: either you find a
way to turn sub-formatting (that is, evaluating the format instructions
for a single parameter, e.g. turning an int into a string) into a dummy
operation that just returns the space needed. This is very complex and,
depending on the exact way formatting works, may even be impossible (or
at least will create many little strings that have their length read,
only to be discarded and later re-created), or (to continue a sentence
started a long time ago) you require the formatting methods to provide
an explicit "measure-only" operation, which hurts extensibility.
The other way is to just go ahead and format, re-allocating whenever
space runs out. And that's just what the I/O-based method does anyway.

However, if your formatting is bound to string objects, it means that
every formatting operation has to create a string containing all
formatted parameters. This may be a considerable memory/time overhead
when compared to a formatting that works on the I/O interfaces, where
formatted data is sent directly to the underlying device. (For
efficiency, of course, there may be a buffer in the device chain. That's
fine. The buffer is re-used, not allocated once per formatting operation.)

So yes, the formatting layer will be very distinct (and developed after
the other parts are complete), but I really believe that basing it on
the I/O interfaces is the best solution.
There can be, of course, a convenience interface that simply creates a
string from formatting instructions and parameters.

Sebastian Redl