[Regex | Xpressive] Efficiently "grepping" large files

Hello, I am looking into Boost.Regex and Boost.Xpressive for implementing grep-like functionality on large files. More specifically I want to check whether a file contains a match for an expression which does not need that much context (a few KB). regex_search and BidirectionalIterators for the whole file work, but there is a complication: I want to do other operations on the file in other threads in parallel (such as calculating hashes and extracting other data). Given that this is mostly I/O bound I would like to avoid rereading portions of the file as much as possible. My idea was to implement a circular buffer which may block the workers (and reading, if data is ever read faster than analyzed). The circular buffer needs to know when old data is no longer required, though, that is regex_search no longer works on it or backtracks into it. The best idea so far is to read blocks of data and do slightly overlapped calls of regex_search on those blocks. This approach has got at least two problems: 1) Matches and context are potentially limited to the size of the overlap. 2) Semantics of the match might be changed if I choose an arbitrary cut for the overlap (e.g. ^ and $ might change because I cut the beginning or end of a line) I would not mind to specify a maximum context of a few KB to maybe 1 MB, but I do not want to change semantics. Is there a better approach? I was thinking about supplying some intelligent iterators, but Regex/Xpressive might leave copies around longer than I anticipate, such that I still cannot decide where they are and how far they may still backtrack. If that's not really possible with the boost libraries I would be interested in alternatives, too. I am already considering looking into some grep implementations. Thanks in advance! Thomas Luzat

I am looking into Boost.Regex and Boost.Xpressive for implementing grep-like functionality on large files. More specifically I want to check whether a file contains a match for an expression which does not need that much context (a few KB). regex_search and BidirectionalIterators for the whole file work, but there is a complication:
I want to do other operations on the file in other threads in parallel (such as calculating hashes and extracting other data). Given that this is mostly I/O bound I would like to avoid rereading portions of the file as much as possible. My idea was to implement a circular buffer which may block the workers (and reading, if data is ever read faster than analyzed).
The circular buffer needs to know when old data is no longer required, though, that is regex_search no longer works on it or backtracks into it. The best idea so far is to read blocks of data and do slightly overlapped calls of regex_search on those blocks. This approach has got at least two problems:
1) Matches and context are potentially limited to the size of the overlap.
2) Semantics of the match might be changed if I choose an arbitrary cut for the overlap (e.g. ^ and $ might change because I cut the beginning or end of a line)
I would not mind to specify a maximum context of a few KB to maybe 1 MB, but I do not want to change semantics. Is there a better approach? I was thinking about supplying some intelligent iterators, but Regex/Xpressive might leave copies around longer than I anticipate, such that I still cannot decide where they are and how far they may still backtrack.
If that's not really possible with the boost libraries I would be interested in alternatives, too. I am already considering looking into some grep implementations.
Have you looked at the "partial match" feature of Boost.Regex: http://www.boost.org/doc/libs/1_47_0/libs/regex/doc/html/boost_regex/partial... HTH, John.

On 2011-08-17 09:51, John Maddock wrote:
I am looking into Boost.Regex and Boost.Xpressive for implementing grep-like functionality on large files. More specifically I want to [...] Have you looked at the "partial match" feature of Boost.Regex: http://www.boost.org/doc/libs/1_47_0/libs/regex/doc/html/boost_regex/partial...
This looks great! I was looking into the flags, but misunderstood the meaning of match_partial. This page really helped and gives a great example. Cheers Thomas Luzat -- Thomas Luzat Softwareentwicklung USt-IdNr.: DE255529066 Kaiserring 2a Mobile: +49 173 2575816 Web: http://luzat.com 46483 Wesel Fax: +49 173 502575816 E-Mail: thomas@luzat.com

On Tue, Aug 16, 2011 at 3:44 PM, Thomas Luzat
The circular buffer needs to know when old data is no longer required, though, that is regex_search no longer works on it or backtracks into it. The best idea so far is to read blocks of data and do slightly overlapped calls of regex_search on those blocks.... I would not mind to specify a maximum context of a few KB to maybe 1 MB, but I do not want to change semantics. Is there a better approach? I was thinking about supplying some intelligent iterators, but Regex/Xpressive might leave copies around longer than I anticipate, such that I still cannot decide where they are and how far they may still backtrack.
If that's not really possible with the boost libraries I would be interested in alternatives, too. I am already considering looking into some grep implementations.
Have you considered mmap'ing the file and allowing all your activity to occur on the mmap'd file? That way the VM subsystem would worry about paging things in or out as necessary, and there wouldn't be any issues with contention across multiple threads. Of course, if you don't have mmap on your system...

On 2011-08-17 14:43, Chris Cleeland wrote:
Have you considered mmap'ing the file and allowing all your activity to occur on the mmap'd file? That way the VM subsystem would worry about paging things in or out as necessary, and there wouldn't be any issues with contention across multiple threads. Of course, if you don't have mmap on your system...
I have considered mmaping or reading through the whole file, but benchmarking so far has shown that I am mostly I/O-limited. By synchronously working on blocks in parallel I avoid disk seeks as much as possible. I might offer such an implementation for cases where seeks are not that expensive (such as for SSDs or slower CPUs). Another problem is that mmap alone is not a complete solution in itself on 32 bit systems given that files may very well be larger than a few GB, but this can be solved now, too. Cheers Thomas Luzat -- Thomas Luzat Softwareentwicklung USt-IdNr.: DE255529066 Kaiserring 2a Mobile: +49 173 2575816 Web: http://luzat.com 46483 Wesel Fax: +49 173 502575816 E-Mail: thomas@luzat.com

On Wed, Aug 17, 2011 at 7:53 AM, Thomas Luzat
On 2011-08-17 14:43, Chris Cleeland wrote:
Have you considered mmap'ing the file and allowing all your activity to occur on the mmap'd file? That way the VM subsystem would worry about paging things in or out as necessary, and there wouldn't be any issues with contention across multiple threads. Of course, if you don't have mmap on your system...
I have considered mmaping or reading through the whole file, but benchmarking so far has shown that I am mostly I/O-limited. By synchronously working on blocks in parallel I avoid disk seeks as much as possible.
Ah, very well. I've also had similar situations wherein mmap provided no performance benefit over reading through well-tuned buffered i/o since accesses were mostly sequential.
I might offer such an implementation for cases where seeks are not that expensive (such as for SSDs or slower CPUs).
If main memory is large enough, a seek is likely to hit an existing page rather than something that must be paged in.
Another problem is that mmap alone is not a complete solution in itself on 32 bit systems given that files may very well be larger than a few GB, but this can be solved now, too.
That's a much more difficult issue to deal with. Not impossible, but definitely more difficult, and would nudge me towards the solution you originally inquired about.
participants (3)
-
Chris Cleeland
-
John Maddock
-
Thomas Luzat