Re: [boost] IOStreams Formal review -- Guide for Reviewers

9 Sep 2004

      I think we can agree to the following:

Openable and Closeable interfaces (nasty spelling - like useable/usable, 
closeable/closable are both used widely - same probably applies to 
localizeable/localizable - probably localisable too in England)

Either all filters are nonblocking filters, or some filters are blocking 
and can't be attached to nonblocking source/sinks (is there an easy way 
to make the boilerplate for the non bufferable type simple enough that 
copy/paste isn't needed?  obviously, CPP can always be used but isn't 
self-documenting). 

Personally, I favor eliminating blocking filters; they aren't that much 
simpler IMO.  I guess we're more or less talking about cooperative vs. 
preemptive multitasking (assuming you're forced to use a thread for each 
blocking sources/sink, if you want to use simple blocking filters, 
rather than nonblocking/multiplexed I/O with nonblocking filters)
...
Suppose I have an input filter f which expects that each time it invokes get()
on the component immediately downstream it will block until the next character
is available or the end of the stream is reached. Suppose that g is immediately
upstream from f and wants to call f.get(). Supposing that no input is going to
be available for a long time, how can you turn that into a non-blocking call?
This is tricky.  I believe I already described the method for a blocking 
output filter, and the same strategy *almost* works for a blocking input 
filter as well.

source . A . B

where A is a blocking filter and source is a nonblocking source (can 
return EAGAIN when you get/read).  B is either the user's code, or 
another nonblocking filter.

source . inWrapper . A . outWrapper . B

inWrapper communicates secretly with outWrapper, bypassing A entirely 
and just returning EAGAIN to B whenever inWrapper wouldn't have anything 
to give A.   But, we can't predict ahead of time how much stuff an input 
filter A may consume in order to generate a single output character in 
response to get(), so the method of outWrapper bypassing A if inWrapper 
has not enough A, actually can't work.  We can still create a thread 
somewhere and patch things up, but let's just say that it's too 
complicated for now - besides, you can just use blocking stuff and put 
the whole process that reads from source into its own thread - 
threads+blocking is easier than asynchronous/cooperative anyway, so such 
a simulation would be of dubious value.

This discussion has caused me to seriously question the need for a 
simple InputFilter interface, or even a "pull" InputFilter at all.  The 
Symmetric Filter idea seems most appropriate to me (and sufficiently 
general, if you allow arbitrary iterator types instead of just char *).

Since we now allow boost::read to return less characters read (even 0) 
without meaning EOF/error, if you really only want to output one 
character then return, you can conform to the MultiCharacter (Buffered) 
interface but ignore n, always returning at most one character.  
Additionally, except for the simplest filters, you still have to deal 
with read requests that are less than what you would prefer to output, 
and due to lack of continuations/coroutines in the language,  you'll 
have to reify your current state and resume next time somebody tries to 
read you.

I suppose that an InputFilter really does allow you to write your 
parsing/recognition code in buffer-oblivious form (that is, you're to 
keep reading until you get a whole block of compressed data, or until 
you completely match your pattern), and you can reify your output state 
by simply buffering a whole chunk of output internally.  But what's the 
problem with buffering input if you need to?  You already have to 
potentially buffer output.

There is a pretty well known stream/message pipeline mechanism: linked 
lists of buffer chunks allocated by the producer, passed by 
reference/pointer to the consumer, and freed when totally consumed (you 
could allow a transfer-ownership operation, which would be nice for 
in-place transforms, but don't really need one).  Think message 
passing/queuing, but ignoring message boundaries and interpreting the 
data as a stream.

Fixed size buffer chunks (that may be partially full) are usually called 
"mbufs", and a list of them is an mbuf chain.  The only complexity 
involved in processing chains is that you have to either write your 
filter so that it operates character by character (a convenience 
iterator could step through the chain), or you have to reassemble the 
mbuf fragments into one contiguous buffer for reading.  You can adopt 
more complex logic that works directly on a sequence of mbufs (for 
instance Posix supports writing/reading files/sockets to/from a sequence 
of noncontiguous buffers - called "scatter/gather IO").  You can imagine 
variations where you also allow circular buffering rather than one-pass 
use of mbufs.

This reminds me: please consider the possibility of (some day in the 
future) automagically running pipelined filters in their own threads 
(properly synchronized, of course), and possibly some policy allowing 
the granularity of work done to be increased (generally, larger buffers 
would allow this) to minimize context switching overhead (which exists 
even for nonthreaded pipelines, because of cache coherency, 
state/continuation reification, and least important, function call 
overhead).  mbuf chains might be easier to efficiently synchronize in a 
multithreaded environment than semaphores/circular buffers (wild 
speculation).
...
I guess you can
...
repeatedly open() and close() fstreams that way, although I've never
wanted to.
I want to provide the same functionality that the standard library provides.
People would complain if you couldn't close and reopen an fstream, don't you
think?
Suffice it to say that I was heretofore unaware such capability ever 
existed, and you're obviously right to want it for your library :)
...
struct my_filter : input_filter {
       template<typename Source>
       std::streamsize read(Source& src, char* s, std::streamsize n);
  };
Here the interpretation of the return value is clear: return the number of
characters read, or -1 for EOF. The trouble is, there are some simple filters
that it's very hard to express with the above interface.
As I mentioned earlier, this wouldn't add any difficulties at all.  You 
can simply return at most 1 character, no matter how many are requested, 
if that really simplifies your implementation.  Naturally, there's some 
runtime overhead that wouldn't be necessary with your separate 
one-character interface, but doing things that way is generally less 
efficient anyway - you've already opted for programmer time over machine 
time.

Sorry this has gotten to be so involved - I am perfectly happy to just 
use whatever interfaces you have now - it's better than what I have 
(i.e. nothing).  The binary I/O you propose (in "future directions") 
sounds cool also - especially if the library had both native and 
"network byte order" variants.  Obviously anything involving passing 
mbuf chains instead of internal buffering would be a huge change to 
interface and implementation; I don't think it needs to happen, or if it 
does, there would be a place for it as a second library that also 
handles message-passing pipelines (where message boundaries are 
preserved and not just implementation details of streams).

Yrs,
Jonathan

Re: [boost] IOStreams Formal review -- Guide for Reviewers

Jonathan Graehl