Re: [Boost-users] regex iterator question
data:image/s3,"s3://crabby-images/758ed/758ed636272ddc947a4ce1398eb6dee6f687ebf4" alt=""
----------------------------------------
To: boost-users@lists.boost.org Date: Mon, 30 Mar 2009 12:48:48 -0400 From: dfs@savarese.org Subject: Re: [Boost-users] regex iterator question
In message , "Robert Ramey" writes:
Thinking about it, this problem must come very often. How is it usually addressed? There must be a simple bridge across this. In a pinch, I'll just have to load the whole file into some sort collection, but I prefer the ultimate unlimited file size solution.
I always like unbounded solutions too but if you try to stream a file past a regex-er it is likely to be slow, and as pointed out by others, not even reasonable in the general case although you may want to think about specialized cases that may benefit from any restricted set of regex'es you have. You would have to think about "strategies" or similar notions that look at the problem and pick an approach or specific implementation based on parameters. I originally came here to do 1000's of REGEX queries on megabyte strings and ultimately used Boost and Greta for testing but quickly found ways to compile query/sample vectors and implement restricted searches once I found all the not-so-regular expressions fit a given constraint or could even do simple things like sorts to preserve locality later. There are a lot of potential performance limitations depending on the specific task parameters and machine. But, yes, it would be nice if someone had a general "strategy" library. LOL.
In the worst case, if you're using Perl-style expressions (or any style that isn't strictly "regular" and requires backtracking; lookahead assertions are a common culprit), the entire input may have to be consumed and buffered even if the expression ultimately matches only a few characters (see "On the Use of Regular Expressions for Searching Text", Clark and Cormack, ACM Transactions on Programming Languages and Systems, Vol 19, No. 3, pp 413-426.). Therefore, if you're dealing with small files, you may as well buffer the entire file in a char array and use regex_token_iterator. If you're dealing with large files, memory map the file instead.
daniel
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
_________________________________________________________________ Internet Explorer 8 – Get your Hotmail Accelerated. Download free! http://clk.atdmt.com/MRT/go/141323790/direct/01/
data:image/s3,"s3://crabby-images/3e82c/3e82ccc202ec258b0b6ee3d319246dddb1f0ae3c" alt=""
Thanks for all the suggestions.
Actually, this is a one-off program for which I want a correct solution in
the minimal lines of code. So paramount for me is just sticking together
that which is known to work.
Investigating this further, here is what I've found.
The spirit multi-pass iterator is almost ideal. It is a one liner - I like
that.
One can specify a different storage policy - fixed_queue
data:image/s3,"s3://crabby-images/758ed/758ed636272ddc947a4ce1398eb6dee6f687ebf4" alt=""
----------------------------------------
To: boost-users@lists.boost.org From: my computer screen Date: Mon, 30 Mar 2009 10:25:21 -0800 Subject: Re: [Boost-users] regex iterator question
Thanks for all the suggestions.
Actually, this is a one-off program for which I want a correct solution in the minimal lines of code. So paramount for me is just sticking together that which is known to work.
Normally, that is what scripts are for. I share various concerns about learning curves. General purpose tools are often helpful even if they don't scale well. Boost ends up just dropping in for many tasks but if you don't care about performance have you considered java? http://www.google.com/search?hl=en&q=java+regex+site%3Asun.com&btnG=Google+Search&aq=f&oq= _________________________________________________________________ Windows Live™ SkyDrive: Get 25 GB of free online storage. http://windowslive.com/online/skydrive?ocid=TXT_TAGLM_WL_skydrive_032009
participants (2)
-
Mike Marchywka
-
Robert Ramey