Re: [boost] [rfc] I/O Library Design

2 Jul 2007

      Sebastian Redl <sebastian.redl@getdesigned.at> writes:

[snip]
...
Platforms using 9-bit bytes have need for binary I/O, too. They might
have need for doing it in their native 9-bit units. It would be a shame
to deprive them of this possibility just because the text streams
require octets. Especially if we already have a layer in place whose
purpose is to convert between low-level data representations.
It seems that the primary interface for the data formatting layer should
be in terms of fixed-size types like {u,}int{8,16,32,64}_t.  It is more
the job of a serialization library to support platform-dependent types
like short,int,long, etc., which would be of use primarily for producing
serialization output that will only be used as input to the exact same
program.

I suppose an alternative is for the read/write functions in the data
formatting layer to always specify an explicit number of bits.  For
example,
write_{u,}int<32> or read_{u,}int<32>.

read_int<N> always returns intN_t, and it is a compile-time error if
that type does not exist.

write_int<N> casts its argument to intN_t, and thus avoids the issue of
multiple names for the same type, like int/long on most 32-bit
platforms/compilers.

This interface supports architectures with a 36-bit word
(e.g. write_int<36>), but since everything is made explicit, avoids any
confusion that might otherwise result from such support.

Floating point types are somewhat more difficult to handle, and I'm not
sure what is the best approach.  One possibility is to also specify the
number of bits explicitly, and to assume the IEEE 754 format will be
used as the external format.  For example,

write_float<32> or write_float<64>

or perhaps

write_ieee754<32>.

It should just be a compile-time error if the compiler/platform doesn't
provide a suitable type.

[snip]
...
...
It seems like trying to support unusual architectures at all may be
extremely difficult.  See my other post.
Which other post is this?
My comments there probably weren't very important anyway.

I think it is worth considering, though, that given the rarity of non
8-bit byte platforms, it is probably not worth spending very much time
in supporting them, and more importantly, it is not worth complicating
the interface for 8-bit byte platforms in order to support them.
...
...
I suppose if you can find a clean way to support these unusual
architectures, then all the better.
It seems that it would be very hard to support e.g. utf-8 on a platform
with 9-bit bytes or which cannot handle types smaller than 32-bits.
I think the binary conversion can do it. The system would work
approximately like this:
1) Every platform defines its basic I/O byte. This would be 8 bits for
most computers (including those where char is 32 bits large), 9 or some
other number of bits for others. The I/O byte is the smallest unit that
can be read from a stream.
2) Most platforms will additionally designate an octet type. Probably I
will just use uint8_t for this. They will supply a Representation for
the formatting layer that can convert a stream of I/O bytes to a stream
of octets. (E.g. by truncating each byte.) If an octet stream is then
needed (e.g. for creating a UTF-8 stream) this representation will be
inserted.
This padding/truncating would need to be done as an explicit way of
encoding an octet stream as a nonet stream, and should probably not be
done implicitly, unless this sort of conversion is always assumed on
those platforms.
...
3) Platforms that do not support octets at all (or simply do not have a
primitive to spare for unambiguous overloads - they could use another
9-bit type and just ignore the additional byte; character streams, at
least, do not perform arithmetic on their units so overflow is not an
issue) do not have support for this. They're off bad. I think this case
is rare enough to be ignored.
Okay.
...
...
...
The return type of mark(), on the other hand, can and should be opaque.
This allows for many interesting things to be done. For example:
Consider a socket. It has no mark/reset, let alone seeking support.
You have a recursive descent parser that requires multiple mark/reset
support.
I see.  It still seems that using different names means that something
that requires only mark/reset support cannot use a stream providing
seek/tell support, without an additional intermediate layer.
Well, depends. Let's assume, for example, that the system will be
implemented as C++09 templates with heavy use of concepts.
I think it may not be a good idea to target this new I/O library to a
language that does not yet exist, and which more importantly is not yet
supported by any compiler, except perhaps Douglas Gregor's experimental
ConceptGCC, which as the release notes state, is extremely slow,
although the release notes also claim that performance can be improved.
I suppose it may work fine to write the library (using the preprocessor)
so that it can be compiled under existing compilers without concept
support, and include a small amount of additional functionality/use more
convenient syntax if concept support is available.

I would be very, very wary of anything that would increase the
compile-time for users of the library, though.

[snip]
...
...
The reason would be for a protocol in which little/big endian is
specified as part of the message/data, and a typical implementation
would always write in native format (and so it would need to determine
which is the native format), but support both formats for reading.
Hmm ... makes sense. I'm not really happy, but it makes sense.
What do you mean you're not happy?  I think all that would really be
needed would be a macro to indicate the endianness.  Of course any code
that depends on this would likely depend even more on 8-bit bytes, but
that is another issue.
...
...
Ideally, the cost of the virtual function calls would normally be
mitigated by calling e.g. read/write with a large number of elements at
once, rather than with only a single element.
Yes, but that's the ideal case. In practice, this means that the
application would have to do its own buffering even if it really wants
the data unit by unit.
Possibly this issue can be mitigated by exposing in the types only a
buffer around a text stream, although I agree that there is no perfect
solution.
...
The programmer will not want to construct the
complicated full type for this.
...
newline_filter<encoding_device<utf_8,
native_converter<gzip_filter<buffering_filter<file_device> > > > > chain & =
  open_file(filename,
read).attach(buffer()).attach(gunzip()).attach(decode(utf_8())).attach(newlines());
...
The programmer will want to simply write
...
text_stream<utf_8> chain =
  open_file(filename,
read).attach(buffer()).attach(gunzip()).attach(decode(utf_8())).attach(newlines());
I notice that these code examples suggest that all streams will be
reference counted (and cheaply copied).  Is that the intention?  A
potential drawback to that approach is that a buffer filter would be
forced to allocate its buffer on the heap, when it otherwise might be
able to use the stack.

[snip]
...
It is convenient to have a unified concept of a character, independent
of its encoding. The Unicode charset provides such a concept. Unicode is
also convenient in that it adds classification rules and similar stuff.
This decision is not really visible to user code anyway, only to
encoding converters: it should be sufficient to provide a conversion
from and to Unicode code points to enable a new encoding to be used in
the framework.
I am basically content using only Unicode for text handling in my own
programs, but I think it would be useful to see what others that care
about efficiency for certain operations (and work with languages that
are not represented very efficiently using UTF-8) think about this.

-- 
Jeremy Maitin-Shepard