Re: [boost] [rfc] I/O Library Design

30 Jun 2007

      ...
While working on ordinary web software, there are actually a lot more
variations on data encodings than just text and binary:
And not only in web software. This is exactly what the filters and
devices are supposed to support. However, with some encodings, the line
between filters and devices is a bit blurred.
A binary format may itself be encoded as bytes (of varying endianess), or in
Base64 for email attachments (RFC 2045) or Base32 for URLs or form post data
(RFC 3548).
I don't think any of the transformation are accurately represented as
encoding a byte stream as text. I'll quickly address base-64 because it's
different from the others; this is a bitstream representation that happens
to tolerate being interpretted as character data in most scenarios
(base-32 also tolerates case conversion - so it's suitable for HTTP
headers).
From my understanding of Base-64, I'd say I disagree. Base-64 is not a
bitstream representation that tolerates being interpreted as characters.
This would mean that the bit pattern for the Base-64 version of a given
blob is defined. That's not the case, though. The Base-64 transformation
is defined in terms of abstract characters: the bit-hextet 000000
corresponds to A, 000001 to B, and so on. The actual representation of
Hi John,

I'm responding to both your mails in a single reply (and mixing your
quotes), because they are closely interrelated.

John Hayes wrote:
these characters does not matter - cannot matter! The encoding was
designed to survive re-encoding of the resulting text.
Therefore, writing a Base64Device that wraps a character stream and
provides a binary stream seems to be a very appropriate way of
implementing Base-64 to me.
...
When encoding in a plain-text format (after encoding into a narrow character
set), there might still be escaping depending on the container. C, JS, XML
attributes, elements and CDATAs, SQL (by database) all have different
escaping rules. This fails to mention sillier issues like newline
representation.
For the other escaping, these represent a text-representation of text data.
This is what text filters are for. While non-trivial, it would certainly
be possible to implement stateful filters that can escape string
literals. Or you can implement simpler filters that do the encoding but
are context-insensitive. (Then you're responsible for inserting and
removing the filters from your chain as context requires.)
...
but the interesting question to ask is what support would make
these operations implementable without rebuffering (to perform translations
that aren't immediately supported by the stream library).
...
Buffering is also an interesting problem because in some formats, buffering
events (like flush overflow or EOF) have streaming output to indicate an
explicit end of stream, minimum remaining distance or differences in
distance (like how many bytes to the next chunk in a stream).
That's an excellent and important observation. But I think my current
design supports this.
From my limited reasearch, the most complete description of a stream encoding is hidden in the description of HTTP 1.1 entities - this defines a
3-layer model for streaming:
Buffering events: How to determine how large the stream is (TE,
Content-Length, Trailer headers)
I don't think this should be part of the stream stack. Determining the
size looks like an application concern to me. The stream simply supplies
...
Transformations: Preprocessing required before the stream can be
interpretted (Content-Encoding: gzip, deflate, could include byte encodings)
This would be the domain of filters. However, determining the required
I think in a system that works by combining components (and I think
everything else would be too inflexible) you cannot implement
functionality that changes the size of the data without rebuffering, at
least with a small buffer. Any quoting means that for an incoming
character, two characters might get forwarded. Or one for two, going in
the other direction. A Base-64 translator must always buffer some data,
because it needs groups of 3 bytes before it can do encoding, groups of
4 characters before it can do decoding. This means some buffering.
the data the application requests. (Or tries to.)
transformations is still an application issue. Like above, I don't think
the stream stack should build itself based on data it parses. (However,
it would be an interesting domain for a support library.)
...
Type: What class should further interpret the content, and for text
entities, the character set encoding (Content-Type).
Same thing. Let the application find out the encoding and what to do
with the data.
1. Text encoding - how are numbers formatted (are numbers going direct to
primitive encoding), how are strings escaped and delimited in a text stream.
If writing to a string buffer, then the stream may terminate here. Text
encoding may alter the character set - for example, punycode changes unicode
into ascii (which simplifies the string encode process).
All except the first could be accomplished using text filters. The first
seems to be a very domain-specific question that is better handled by
the decision of which interface - the binary or the text - to use in the
first place.
...
2. String encoding - how do strings get reduced to a stream of primitives
(if the text format matching the encoding format then there's nothing to do
- true for SBCS, MBCS).
This would be the character conversion device.
 How is a variable length string delimited in binary
(length prefixes, null termination, maximum size, padding).
...
3. Primitive encoding - Endianness, did we really mean IEEE 754 floats,
That's the binary formatting.
 are
we sending whole bytes or only a subset of bits (and int is expedient for a
memory image, but there are only 8 significant bits),
Interesting idea here. May be binary formatting, may be serialization,
may be a simple matter of casting the data before feeding it to the
stream. The main problem I see in integrating this into the stream is
This looks like a question of serialization to me, and thus outside the
domain of the library.
that it is highly context-dependent. Which int is there because the
range is needed, and which is only there because the hardware processes
it faster? There could be both kinds within a single structure, which is
why I'm inclined to leave this to the application or the serialization.
...
are there alignment
issues (a file format that was originally a memory image may word-align
records or fields).
Another serialization issue.
...
4. Bitstream encoding - if the output is octets then this layer is optional,
otherwise chop up bits into Base64 or less.
...
Tagging the format can be most likely be ignored at the stream level. Most
file formats will either externally or internally specify their encoding
formats.
I don't think it's even possible, with reasonable effort, to support
Binary filters can do this, although as I argued above, I don't think
Base-64 is a good example of such a use.
this at the stream level. Tagging is very dependent on the data format.
...
The most helpful thing to do is provide factory functions that
convert from existing character set descriptors (
http://www.iana.org/assignments/character-sets) into an actual operator and
allow changing the operators at a specific stream position. This will help
most situations where character encoding is specified in a header.
Yes, I agree. The semantics of changing the stack in the middle of the
stream must be defined.

Sebastian Redl