regex and stream processing

29 Sep 2003

      Message forwarded from Dave Gomboc:
...
We have a stream of text data (actually it is a chunked 
response of a webserver, but it might be a file or 
something...) The data size may be very large and we need to 
do some find/replace in this stream using regular 
expressions. It is not possible to load entire stream into 
the memory. I'm not even hope that Regex++ could work with 
such kind of streams, but maybe someone can suggest me a 
solution to this problem?
This is possible without much trouble _if_ the text strings you are 
trying to match will fit into main memory, even if all the data won't. 
Use a std::list as a buffer, so that deleting and inserting is 
inexpensive, and you don't have to perform copies and update iterators 
when reading in more input.  Fill your (sliding window) buffer, then 
regex it, using the partial match feature whenever you have not already 
detected the end of the file.  If a match occurs, any characters 
occuring before the match can be processed (serialized to disk, or 
whatever) and removed from the buffer.  Repeat this looking for a match 
until you reach the end of current input.  You'll either have "no match" 
or "partial match" at this point: remove non-matching characters (that 
could be all of them) at this point, then re-fill your buffer with more 
data and keep going.  If you're just doing one substitution, that's 
fine, but if more, there are a few things to keep in mind. 

I have done over 60 distinct substitutions at a time on a stream of 0.1 
gigabytes using this technique, and it works quite quickly.  (I used a 
buffer that held about 1 million characters (2^20), but I also tested 
with buffers as small as 256 characters  -- the strings I was matching 
were not large).  Just read in data, process all the substitutions in 
order, write the data out and repeat.  If you're doing multiple 
substitutions, you (of course) can't flush any data from memory until 
you've tried all the substitutions on the current region of interest. 
Also, for each later substituion, the current region of interest shrinks 
because you have to acknowledge partial matches from earlier 
subsitutions.  For example, say that there is no partial match at the 
end of substituions 1 through 6 (there might have been full matches 
where substitution was performed, that's okay.)  But substituion 7 has a 
partial match at the last five characters.  Now, your region of interest 
for substitution 8 is five characters smaller, because you don't want to 
process those last five characters because substitution 7 might change 
them once you've read in more input.  The order of your substituions is 
important if the outputs of some desired subsitutions overlap with the 
inputs of others. 

If you are not grokking my meaning, I can fish around for the source 
code.  Perhaps it's easier to explain in code than in words, my code for 
the entire process is maybe 10 or 15 lines.  An alternative is to try to 
do some streambuf hacking, but the above method was simpler and adequate 
for my purposes. 

Dave 

---------------------------------
Want to chat instantly with your online friends? Get the FREE Yahoo!Messenger

John Maddock

tags

participants (1)