
While working on ordinary web software, there are actually a lot more variations on data encodings than just text and binary: And not only in web software. This is exactly what the filters and devices are supposed to support. However, with some encodings, the line between filters and devices is a bit blurred. A binary format may itself be encoded as bytes (of varying endianess), or in Base64 for email attachments (RFC 2045) or Base32 for URLs or form post data (RFC 3548). I don't think any of the transformation are accurately represented as encoding a byte stream as text. I'll quickly address base-64 because it's different from the others; this is a bitstream representation that happens to tolerate being interpretted as character data in most scenarios (base-32 also tolerates case conversion - so it's suitable for HTTP headers).
From my understanding of Base-64, I'd say I disagree. Base-64 is not a bitstream representation that tolerates being interpreted as characters. This would mean that the bit pattern for the Base-64 version of a given blob is defined. That's not the case, though. The Base-64 transformation is defined in terms of abstract characters: the bit-hextet 000000 corresponds to A, 000001 to B, and so on. The actual representation of
Hi John, I'm responding to both your mails in a single reply (and mixing your quotes), because they are closely interrelated. John Hayes wrote: these characters does not matter - cannot matter! The encoding was designed to survive re-encoding of the resulting text. Therefore, writing a Base64Device that wraps a character stream and provides a binary stream seems to be a very appropriate way of implementing Base-64 to me.
When encoding in a plain-text format (after encoding into a narrow character set), there might still be escaping depending on the container. C, JS, XML attributes, elements and CDATAs, SQL (by database) all have different escaping rules. This fails to mention sillier issues like newline representation.
For the other escaping, these represent a text-representation of text data.
This is what text filters are for. While non-trivial, it would certainly be possible to implement stateful filters that can escape string literals. Or you can implement simpler filters that do the encoding but are context-insensitive. (Then you're responsible for inserting and removing the filters from your chain as context requires.)
but the interesting question to ask is what support would make these operations implementable without rebuffering (to perform translations that aren't immediately supported by the stream library).
Buffering is also an interesting problem because in some formats, buffering events (like flush overflow or EOF) have streaming output to indicate an explicit end of stream, minimum remaining distance or differences in distance (like how many bytes to the next chunk in a stream). That's an excellent and important observation. But I think my current design supports this. From my limited reasearch, the most complete description of a stream encoding is hidden in the description of HTTP 1.1 entities - this defines a 3-layer model for streaming:
Buffering events: How to determine how large the stream is (TE, Content-Length, Trailer headers) I don't think this should be part of the stream stack. Determining the size looks like an application concern to me. The stream simply supplies
Transformations: Preprocessing required before the stream can be interpretted (Content-Encoding: gzip, deflate, could include byte encodings) This would be the domain of filters. However, determining the required
I think in a system that works by combining components (and I think everything else would be too inflexible) you cannot implement functionality that changes the size of the data without rebuffering, at least with a small buffer. Any quoting means that for an incoming character, two characters might get forwarded. Or one for two, going in the other direction. A Base-64 translator must always buffer some data, because it needs groups of 3 bytes before it can do encoding, groups of 4 characters before it can do decoding. This means some buffering. the data the application requests. (Or tries to.) transformations is still an application issue. Like above, I don't think the stream stack should build itself based on data it parses. (However, it would be an interesting domain for a support library.)
Type: What class should further interpret the content, and for text entities, the character set encoding (Content-Type). Same thing. Let the application find out the encoding and what to do with the data. 1. Text encoding - how are numbers formatted (are numbers going direct to primitive encoding), how are strings escaped and delimited in a text stream. If writing to a string buffer, then the stream may terminate here. Text encoding may alter the character set - for example, punycode changes unicode into ascii (which simplifies the string encode process).
All except the first could be accomplished using text filters. The first seems to be a very domain-specific question that is better handled by the decision of which interface - the binary or the text - to use in the first place.
2. String encoding - how do strings get reduced to a stream of primitives (if the text format matching the encoding format then there's nothing to do - true for SBCS, MBCS). This would be the character conversion device. How is a variable length string delimited in binary (length prefixes, null termination, maximum size, padding).
3. Primitive encoding - Endianness, did we really mean IEEE 754 floats, That's the binary formatting. are we sending whole bytes or only a subset of bits (and int is expedient for a memory image, but there are only 8 significant bits), Interesting idea here. May be binary formatting, may be serialization, may be a simple matter of casting the data before feeding it to the stream. The main problem I see in integrating this into the stream is
This looks like a question of serialization to me, and thus outside the domain of the library. that it is highly context-dependent. Which int is there because the range is needed, and which is only there because the hardware processes it faster? There could be both kinds within a single structure, which is why I'm inclined to leave this to the application or the serialization.
are there alignment issues (a file format that was originally a memory image may word-align records or fields).
Another serialization issue.
4. Bitstream encoding - if the output is octets then this layer is optional, otherwise chop up bits into Base64 or less.
Tagging the format can be most likely be ignored at the stream level. Most file formats will either externally or internally specify their encoding formats. I don't think it's even possible, with reasonable effort, to support
Binary filters can do this, although as I argued above, I don't think Base-64 is a good example of such a use. this at the stream level. Tagging is very dependent on the data format.
The most helpful thing to do is provide factory functions that convert from existing character set descriptors ( http://www.iana.org/assignments/character-sets) into an actual operator and allow changing the operators at a specific stream position. This will help most situations where character encoding is specified in a header.
Yes, I agree. The semantics of changing the stack in the middle of the stream must be defined. Sebastian Redl