
Jeremy Maitin-Shepard wrote:
- One idea from [Boost.IOStreams] to consider is the direct/indirect device distinction.
I never noticed this distinction before. It seems useful, but there are issues not unlike the AsyncIO issues. Direct devices provide a different interface. A programmer can take advantage of this interface for some purposes, but for most, I fear, the advantages would be lost. Consider: - A direct device cannot be wrapped by filters that do dynamic data rewriting (such as (de)compression). The random access aspect would be lost. - A direct device cannot participate in the larger stack without propagating the direct access model throughout the stack. (And this stops at the text level anyway, because the character recoder does dynamic data rewriting.) Propagating another interface means a lot of additional implementation effort and complexity.
- Binary transport layer issue:
Platforms with unusual features, like 9-bit bytes or inability to handle types less than 32-bits in size can possibly still implement the interface for a text/character transport layer, possibly on top of some other lower-level transport that need not be part of the boost library. Clearly, the text encoding and decoding would have to be done differently anyway.
A good point, but it does mean that the text layer dictates how the binary layer has to work. Not really desirable when pure binary I/O has nothing to do with text I/O. One approach that occurs to me would be to make the binary transport layer use a platform-specific byte type (octets, nonets, whatever) and have the binary formatting layer convert this into data suitable for character coding.
- Seeking:
Maybe make multiple mark/reset use the same interface as seeking, for simplicity. Just define that a seeking device has the additional restriction that the mark type is an offset, and the argument to seek need not be the result of a call to tell.
Another issue is whether to standardize the return type from tell, like std::ios_base::streampos in the C++ iostreams library.
These are incompatible requirements, and the reason I want to keep the interfaces separate. Standardizing the tell return type is a good idea and necessary for efficient work of type erasure and simple use of arbitrary values in seek(). The type must be transparent. The return type of mark(), on the other hand, can and should be opaque. This allows for many interesting things to be done. For example: Consider a socket. It has no mark/reset, let alone seeking support. You have a recursive descent parser that requires multiple mark/reset support. You want to parse data coming from the socket using this parser. Typically, this will meant that you have to receive all data (which may be tricky, if parsing is required to tell you how much data to expect) and then parse the buffer. With the I/O stack, you could instead write a filter that implements mark/reset support on top of an arbitrary device. Let's consider the simplest case: a filter that implements single mark/reset. The filter contains a simple extensible buffer (such as a std::vector, or perhaps a linked list of fixed-size buffers). When mark() is first called, it starts buffering all data that is read through it, in addition to returning it. When reset() is called, it simply starts feeding the buffered data to read requests until it runs out, at which point it goes back to calling the underlying device. When mark() is called again, it can discard all data buffered so far. (E.g. it might drop from the linked list all completely filled buffers and free their memory.) A more complex case is multiple mark/reset. One variant is a filter that just starts buffering on the first mark() and returns the offset into the buffer where the data starts (i.e. 0). On every subsequence mark(), it returns a higher index. A reset() is passed the offset, so it starts reading from this offset. The obvious problem is that, from the first mark() on, all data has to be buffered. In situations where memory is scarce, this may not be desirable. If the filter knew how many marks still exist, it could discard buffered data that is no longer needed. If the mark type is opaque, it can be a smart-pointer-like object with a reference count. For every buffer chunk, there could be one reference count. If the count for a chunk drops to zero, the filter knows the data in that chunk is no longer needed and can free the memory. I want to keep this flexibility. I want mark/reset to stay separate.
- Binary formatting (perhaps the name data format would be better?):
Sounds like a good name.
I think it is important to provide a way to format {uint,int}{8,16,32,64}_t as either little or big endian two's complement (and possibly also one's complement). Yes, that's pretty much the idea behind this layer. It might be useful to look at the not-yet-official boost endian library in the vault.
I will.
It should probably also be possible to determine using the library at compile time what the native format is. To what end? If the native format is one of the special predefined ones, it will hopefully be optimized in the platform-aware special implementation (well, I can dream) anyway. - Header vs Precompiled:
I think as much should be separately compiled as possible, but I also think that type erasure should not be used in any case where it will significantly compromise performance.
I'm thinking of a system where components are templates on the component they wrap, so as to allow direct calls upwards. I'm thinking of using the common separately compiled template specialization extension of compilers to provide pre-compiled versions of the standard components instantiated with the erasure components. This is very similar to how Spirit works, except that it doesn't have pre-compiled stuff. In Spirit, rule is the erasure type, but the various parsers can be directly linked, too. Then, if the performance is needed, the programmer can hand-craft his chain so that no virtual calls are made, at the cost of compiling his own copy of the components. I'm afraid I don't see a better way of doing this. I'm wide open to suggestions.
- The "byte" stream and the character stream, while conceptually different, should probably both be considered just "streams" of particular POD types. I have explained in a different post why I don't think this is a good idea. - Text transport:
I don't think this layer should be restricted to Unicode encodings.
For full generality, the library should provide facilities for converting between any two of a large list of encodings. No, I'll leave that to a character support library. (Parts of the
I have no plans of doing so. I just consider all encodings as encodings of the universal character set. An encoding is defined by how it maps the UCS code points onto groups of octets, words, or other primitives. library I will have to specify and implement to build the text converter on, but that has time. The binary stuff comes first.)
- Text formatting:
For text formatting, I think it would be very useful to look at the IBM ICU library. I have. Some interesting ideas there. It may in fact make sense to leave text formatting as a separate library Each of my layers can be considered a separate library. That's the way I'll implement them, too. I just design and present them at once because I need to consider the requirements of the higher levels on the lower levels.
Sebastian Redl