
Christopher Kohlhoff wrote:
From my reading of the documentation, if I get a partial match and I want to continue to try for a full match I must buffer the entire data from the beginning of the partial match. This means that in the partial_regex_grep example it cannot find a substring match that is greater then 4096 characters long, because that is all the data it will buffer.
For the example in the docs, that's correct.
Furthermore, each time I want to retest the input against the expression it must process the whole input string again. This is not ideal from an efficiency point of view, since I could potentially receive input data one byte at a time.
True.
What I want is a stateful regular expression-based decoder object state. I can feed it more input which will cause more state transitions, and it will tell me when it reaches a terminal state.
Understood. Perhaps what is called for is a "pull" iterator that buffers chunks of data at a time, and when the buffer underflows, it fetches another chunk of data. That way, libraries like xpressive and Spirit can keep their iterator-based interface and not worry whether or not "++begin" goes to disk for the next 4Kb, or reads from a socket, or whatever. Would that address your problem?
I never have to buffer more input than the block just read because earlier input will have been fully consumed by the decoder.
That's not the case for a backtracking regex engine like Boost.Regex or xpressive. These libraries require bidirectional iterators because they may need to back out state transitions and decrement the iterator to try a different alternative. You'll need to buffer everything read so far, or else write it to a tmp file so you can get it back should you need it. And the problem of returning a partial match and persisting the current state of the state machine is a hard one. Some implementations maintain their state on the program stack, so returning effectively wipes out all that information. These implementations would need to somehow serialize the state stored on the program stack, and then de-serialize it in order to begin executing where it left off. Tricky stuff. -- Eric Niebler Boost Consulting www.boost-consulting.com