
Sebastian Redl <sebastian.redl@getdesigned.at> writes:
Jeremy Maitin-Shepard wrote:
I think part of the issue may be the name "binary". A better name may be "byte" I/O or "byte" stream. Originally, the binary transport layer was called byte transport layer. I decided against this name for the simple reason that, as far as the C++ standard is concerned, a byte is pretty much the same as an unsigned char.
By byte, I really meant octet. You could then call it an octet stream, but it may be useful for at least the basic stream concept to support streams of arbitrary POD types. It seems like it would introduce significant complications to try to support any architecture with a non-octet byte (i.e. an architecture that does not have an 8-bit type). Perhaps it is best for this library to just completely ignore such architectures, since they are extremely rare anyway.
Because the exact unit of transport is still in the open (and the current tendency I see is toward using octets, and leaving native bytes to some other mechanism), I didn't want any such implication in the name. The name binary isn't a very good choice either, I admit. In the end, all data is binary. But the distinction between "binary" and "textual" data is important, and not only at the concept level. What I have in my mind works something like this: Binary data is in terms of octets, bytes, primitives, or PODs, whatever.
Perhaps the name "data stream" would be appropriate, or better yet, perhaps just "stream", and use the qualified name "text stream" or "character stream" to refer to streams of characters that are somehow marked (either at compile-time or run-time) with the encoding. This brings up the following issue, what didn't seem to be addressed very much in your responses: Should text streams of arbitrary (non-Unicode encodings) be supported? Also, should text streams support arbitrary base types (i.e. uint8_t or uint16_t, or some other type), or should be restricted to a single type (like uint16_t)? The reason for substantial unification between both "data streams" and "text streams" is that despite differences in how the data they transport is used, the interface should essentially be the same (both basic read/write, as well as things like mark/reset and seeking), and a buffering facility should be exactly the same for both types of streams. Similarly, facilities for providing mark/reset support on top of a stream that does not support it by using buffering would be exactly the same for both "binary" and "text" streams. Even if seek may not be as useful for text streams, it still might be useful to some people, and there is no reason to exclude it. In the document you posted, for instance, you essentially just duplicated much of the description of binary streams for the text streams, which suggests a problem. As a suggest below, a "text stream" might always be a very thin layer on top of a binary stream, that simply specifies an encoding. The issue, though, is how it would work to layer something like a regular buffer or a mark/reset-providing buffer on top of a text stream. There shouldn't have to be two mark/reset providers, one for data streams, and one for text stream, but also it should be possible to layer such a thing on top of a text stream directly, and still maintain the encoding annotation.
The distinguishing feature of binary data is that each "unit" is meaningful in isolation. It makes sense to fetch a single unit and work on it. It makes sense to jump to an arbitrary position in the stream and interpret the unit there.
I suppose the question is at what level will the library provide character encoding/decoding/conversion. It seems like the most basic way to create a text stream would be to simply take a data stream and mark it with a particular encoding to make a text stream. This suggests that encoding/decoding/conversion should exist as a "data stream" operation. One thing I haven't figured out, though, it how the underlying unit type of the stream, i.e. uint8_t or uint16_t, would correspond to the encoding. In particular, the issue is what underlying unit type corresponds to each of the following encodings: - UTF-8, iso-8859-* (it seems obvious that uint8_t would be the choice here) - UTF-16 (uint16_t looks promising, but you need to be able to read this from a file, which might be a uint8_t stream) - UTF-16-LE/UTF-16-BE (uint16_t looks promising, but also uint16_le_t/uint16_be_t (a special type that might be defined in the endian library) might be better, and furthermore you need to be able to read this from a file, which might be a uint8_t stream) Perhaps you have some ideas about the conceptual model that resolves these issues. One solution may be to decide that the unit type of the stream need not be too closely tied to the encoding, and in particular the encoding conversion might not care what types the input and output streams are, within some limits (maybe the size of unit type of the encoding must be a multiple of the size of the unit type of the stream).
Textual data is far more complex. It's a stream of abstract characters, and they don't map cleanly to the underlying representative primitive. A UTF-8 character maps to one, two, three or four octets, leaving aside the dilemma of combining accents. A UTF-16 character maps to one or two double-octets. It doesn't make sense to fetch a single primitive and work on it, because it may not be a complete character. It doesn't make sense to jump to an arbitrary position, because you might jump into the middle of a character. The internal character encoding is part of the text stream's type in my model.
It seems that it may be useful to allow the encoding to be specified at run-time. The size of each encoded unit would still have to be known at compile-time, though, so it is not clear exactly how important this is. [snip] -- Jeremy Maitin-Shepard