Message forwarded from Dave Gomboc:
We have a stream of text data (actually it is a chunked response of a webserver, but it might be a file or something...) The data size may be very large and we need to do some find/replace in this stream using regular expressions. It is not possible to load entire stream into the memory. I'm not even hope that Regex++ could work with such kind of streams, but maybe someone can suggest me a solution to this problem?
This is possible without much trouble _if_ the text strings you are trying to match will fit into main memory, even if all the data won't. Use a std::list as a buffer, so that deleting and inserting is inexpensive, and you don't have to perform copies and update iterators when reading in more input. Fill your (sliding window) buffer, then regex it, using the partial match feature whenever you have not already detected the end of the file. If a match occurs, any characters occuring before the match can be processed (serialized to disk, or whatever) and removed from the buffer. Repeat this looking for a match until you reach the end of current input. You'll either have "no match" or "partial match" at this point: remove non-matching characters (that could be all of them) at this point, then re-fill your buffer with more data and keep going. If you're just doing one substitution, that's fine, but if more, there are a few things to keep in mind. I have done over 60 distinct substitutions at a time on a stream of 0.1 gigabytes using this technique, and it works quite quickly. (I used a buffer that held about 1 million characters (2^20), but I also tested with buffers as small as 256 characters -- the strings I was matching were not large). Just read in data, process all the substitutions in order, write the data out and repeat. If you're doing multiple substitutions, you (of course) can't flush any data from memory until you've tried all the substitutions on the current region of interest. Also, for each later substituion, the current region of interest shrinks because you have to acknowledge partial matches from earlier subsitutions. For example, say that there is no partial match at the end of substituions 1 through 6 (there might have been full matches where substitution was performed, that's okay.) But substituion 7 has a partial match at the last five characters. Now, your region of interest for substitution 8 is five characters smaller, because you don't want to process those last five characters because substitution 7 might change them once you've read in more input. The order of your substituions is important if the outputs of some desired subsitutions overlap with the inputs of others. If you are not grokking my meaning, I can fish around for the source code. Perhaps it's easier to explain in code than in words, my code for the entire process is maybe 10 or 15 lines. An alternative is to try to do some streambuf hacking, but the above method was simpler and adequate for my purposes. Dave --------------------------------- Want to chat instantly with your online friends? Get the FREE Yahoo!Messenger
participants (1)
-
John Maddock