----------------------------------------
To: boost-users@lists.boost.org Date: Mon, 30 Mar 2009 12:48:48 -0400 From: dfs@savarese.org Subject: Re: [Boost-users] regex iterator question
In message , "Robert Ramey" writes:
Thinking about it, this problem must come very often. How is it usually addressed? There must be a simple bridge across this. In a pinch, I'll just have to load the whole file into some sort collection, but I prefer the ultimate unlimited file size solution.
I always like unbounded solutions too but if you try to stream a file past a regex-er it is likely to be slow, and as pointed out by others, not even reasonable in the general case although you may want to think about specialized cases that may benefit from any restricted set of regex'es you have. You would have to think about "strategies" or similar notions that look at the problem and pick an approach or specific implementation based on parameters. I originally came here to do 1000's of REGEX queries on megabyte strings and ultimately used Boost and Greta for testing but quickly found ways to compile query/sample vectors and implement restricted searches once I found all the not-so-regular expressions fit a given constraint or could even do simple things like sorts to preserve locality later. There are a lot of potential performance limitations depending on the specific task parameters and machine. But, yes, it would be nice if someone had a general "strategy" library. LOL.
In the worst case, if you're using Perl-style expressions (or any style that isn't strictly "regular" and requires backtracking; lookahead assertions are a common culprit), the entire input may have to be consumed and buffered even if the expression ultimately matches only a few characters (see "On the Use of Regular Expressions for Searching Text", Clark and Cormack, ACM Transactions on Programming Languages and Systems, Vol 19, No. 3, pp 413-426.). Therefore, if you're dealing with small files, you may as well buffer the entire file in a char array and use regex_token_iterator. If you're dealing with large files, memory map the file instead.
daniel
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
_________________________________________________________________ Internet Explorer 8 – Get your Hotmail Accelerated. Download free! http://clk.atdmt.com/MRT/go/141323790/direct/01/