
I don't think any of the transformation are accurately represented as encoding a byte stream as text. I'll quickly address base-64 because it's different from the others; this is a bitstream representation that happens to tolerate being interpretted as character data in most scenarios (base-32 also tolerates case conversion - so it's suitable for HTTP headers). For the other escaping, these represent a text-representation of text data. Which is dumb sounding, but look at it from the perspective of encoding a 32-bit int - when that gets streamed, there's two choices: 1. Submit 32-bits for encoding - the stream requires enough parameters to figure out the endianness, and then how the bits are represented. 2. Convert it into a text and submit the text - the string requires the parameters for this conversion (base, leading 0s - printf parameters), then downstream the text encoding has it's own parameters. Escaping is just another way of saying "what is the text representation of this text". There's more than could be included, however, some really basic operators like escape operators and stream terminations operators would go a long way towards making it easy to interpret a lot of file formats. These operators are not on binary data, but on text data (something that can interpret the grapheme clusters directly). Otherwise, escaping will be buggy the first time it's applied to characters outside of the bottom 48. At some point, the line between streaming, serialization and text parsing gets blurred (the smarter the text parsing the deeper we go into unicode issues) - but the interesting question to ask is what support would make these operations implementable without rebuffering (to perform translations that aren't immediately supported by the stream library). The complete stack needed to support all of these requirements has a bunch of layers, but depending on the application most of them are optional: 1. Text encoding - how are numbers formatted (are numbers going direct to primitive encoding), how are strings escaped and delimited in a text stream. If writing to a string buffer, then the stream may terminate here. Text encoding may alter the character set - for example, punycode changes unicode into ascii (which simplifies the string encode process). 2. String encoding - how do strings get reduced to a stream of primitives (if the text format matching the encoding format then there's nothing to do - true for SBCS, MBCS). How is a variable length string delimited in binary (length prefixes, null termination, maximum size, padding). 3. Primitive encoding - Endianness, did we really mean IEEE 754 floats, are we sending whole bytes or only a subset of bits (and int is expedient for a memory image, but there are only 8 significant bits), are there alignment issues (a file format that was originally a memory image may word-align records or fields). 4. Bitstream encoding - if the output is octets then this layer is optional, otherwise chop up bits into Base64 or less. Tagging the format can be most likely be ignored at the stream level. Most file formats will either externally or internally specify their encoding formats. The most helpful thing to do is provide factory functions that convert from existing character set descriptors ( http://www.iana.org/assignments/character-sets) into an actual operator and allow changing the operators at a specific stream position. This will help most situations where character encoding is specified in a header. John On 6/22/07, Jeremy Maitin-Shepard <jbms@cmu.edu> wrote:
"John Hayes" <john.martin.hayes@gmail.com> writes:
While working on ordinary web software, there are actually a lot more variations on data encodings than just text and binary:
It seems fairly logical to me to have the following organization:
- Streams of arbitrary POD types
For instance, you might have uint8_t streams, uint16_t streams, etc.
- A byte stream would be a uint8_t stream.
- A text stream holding utf-16 encoded text would be a uint16_t stream, while a text stream holding utf-8 encoded text would be a uint8_t stream. A text stream holding iso-8859-1 encoded text would also be a uint8_t stream.
There is the issue of whether it is useful to have a special text stream type that is tagged (either at compile-time or at run-time) with the encoding in which the data either going in or out of it are supposed to be. How exactly this tagging should be done, and to what extent it would be useful, remains to be explored.
It seems that your various examples of filters/encoding, like BASE-64, URL encoding, CDATA escaping, and C++ string escaping, might well fit into the framework I described in the previous paragraphs. Many of these filters can be viewed as encoding a byte stream as text.
Let me know your thoughts, though.