[iostreams] regex_filter how-to
Hello
I have following problem:
I need to filter some records from one file and save it to another in
my c++ application.
For example from this file:
/////////////////// in.txt ////////////////////////////////
http://google.com
http://yahoo.com
...
http://google.com/analytics
////////////////////////////////////////////////////////////
I want to only extract lines that match regex:
^(?:http://google.com).*
to get:
//////////////// out.txt ///////////////////////////////
http://google.com
...
http://google.com/analytics
/////////////////////////////////////////////////////////
So I wrote something like this:
class Writer
{
public:
Writer()
:matchesCount_(0){}
virtual std::string operator() (const boost::match_results
Michał wrote:
So I wrote something like this: [snip]
filtering_istream first(boost::iostreams::regex_filter(match_lower, FileWriter(&out))); [snip]
It works fine for short files (IMO for files which size is smaller then size of stream buffer). But I work with very large files (~4,7 GB) and then this is not a good solution. Do you have any idea how to solve it?
IOStream's regex_filter loads the whole file in memory befory applying the regex on it, because the regex algoritms require a bidirectional iterator, IIRC. If your pattern always matches on a single line, you could use getline() and then apply the regex on each line separately. Alternatively, take a look at the Boost.Regex "partial match" feature (http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/boost_regex/partial...), which will allow you to apply the regex on "chunks". HTH, Éric Malenfant
IOStream's regex_filter loads the whole file in memory befory applying the regex on it.
No, it doesn't. As I said before, when you compile my code and use it against very large file in which all matching lines lay near the end then the output file will be empty. It can't load whole file because streams use buffers.
If your pattern always matches on a single line, you could use getline().
I can't I want to use iostreams regex_filter because it apply regex on large data chunks and i hope it will be much faster then applying it line by line. Otherwise i would not use iostreams at all.
Alternatively, take a look at the Boost.Regex "partial match"
I will take a look. -- Regards Michał Nowotka
Michał Nowotka wrote:
IOStream's regex_filter loads the whole file in memory befory applying the regex on it.
No, it doesn't.
Yes it does. Or I misread the docs and code :)
basic_regex_filter is an aggregate_filter. From IOStreams docs on aggregate_filter:
http://www.boost.org/doc/libs/1_40_0/libs/iostreams/doc/classes/aggregate.html>
The class template aggregate_filter is a DualUseFilter for use as a base class by Filters which filter an entire character sequence at once. Because a aggregate_filter must read an entire character sequence into memory before it begins to filter,[...]
</quote>
As I said before, when you compile my code and use it against very large file in which all matching lines lay near the end then the output file will be empty. It can't load whole file because streams use buffers.
That surprises me. Looking at the implementation of aggregate_filter (http://www.boost.org/doc/libs/1_40_0/boost/iostreams/filter/aggregate.hpp), I don't see any provisions for out of memory conditions, so I would have expected an attempt to read a several gigabytes file to result in a std::bad_alloc exception.
Michał Nowotka wrote:
Hello I have following problem:
I need to filter some records from one file and save it to another in my c++ application.
Hi, spirit might also be alternative. I rewrote the sample from here http://spirit.sourceforge.net/distrib/spirit_1_8_5/libs/spirit/example/funda... into the attached program. Its not nice, has global variables but probably shows a way how it could be done. Regards, Roland
participants (3)
-
Eric MALENFANT
-
Michał Nowotka
-
Roland Bock