Regex++ newbie problems
Hi, I've just started using Regex++ (from boost 1.29.0) and I'm experiencing some strangeness that don't seem to be mentioned in the faq. Firstly I found that [-A-Za-z]+ matched spaces and punctuation characters unexpectedly rather than plain alphabetic characters and hyphens only as desired. Reading the documentation I altered this to [-:alpha:] & [-:upper::lower] with no effect. So I decided to experiment with adding ^[:space:]. When finally I reached the expression below I got a coredump where the expression was declared. The intention of this expression was to strip and keep leading and trailing punctuation and spaces as well as extracting a word from the middle. static const boost::regex Word_expression("([:punct::space:]*)([-:upper::lower:^[:punct::space:]]+)([: punct::space:]*)"); Is it right that 'bad' expressions should coredump? And if so in what way is the above expression bad? (as an aside maybe we could catch bad ones better by replacing regex strings with overloaded operators the way streams have superceded printf) I found I still get rogue matches on punctuation and spaces when I use the manually expanded form below: ([-abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]+) What is going wrong? Regards, Bruce A. Full source attached: E.g. try running e.g. ReadWord ",; token-alpha; ," I would desire the matches to be: what[1] -> ",; " what[2] -> "token-alpha" what[3] -> "; ," ----cut here---- #include <iostream> #include <fstream> #include "boost/regex.hpp" int main(int argc,const char* const argv[]) { int Status = 0; // static const boost::regex Word_expression("[a-zA-Z]+"); // causes coredump // static const boost::regex Word_expression("([:punct::space:]*)([-:upper::lower:^[:punct::space:]]+)([: punct::space:]*)"); static const boost::regex Word_expression("([:punct::space:]*)([-abcdefghijklmnopqrstuvwxyzABCDEFGHIJK LMNOPQRSTUVWXYZ]+)([:punct::space:]*)"); // dump arguments for(int argNo=0;argNo != argc;argNo++) { std::cout << "argNo " << argNo << " = '" << argv[argNo] << "'" << std::endl; boost::cmatch what; if(regex_search(argv[argNo], what, Word_expression)) { std::cout << "Whole = " << what[0].first << std::endl; int resultNo = 1; while(what[resultNo].matched == true) { std::cout << "sub[" << resultNo << "] = " << "'" << what[resultNo].first << "'" << std::endl; resultNo++; } } } if (argc <= 1) { std::cout << "Usage: ReadWord <filename>..." << std::endl; Status = 1; } else { for(int argNo=1;argNo != argc;argNo++) { std::ifstream In(argv[argNo]); if (!In) { std::cout << "Error: could not open file: " << argv[argNo] << std::endl; } else while((In) && (In.eof() == false)) { std::string InputLine; In >> InputLine; std::cout << InputLine << std::endl; } } } return Status; } //main ============================================================================ Any opinions expressed in this e-mail are those of the individual and not necessarily those of Tyco Safety Products. Any prices for the supply of goods or services are only valid if supported by a formal written quotation. This e-mail and any files transmitted with it, including replies and forwarded copies (which may contain alterations) subsequently transmitted from Tyco Saftey Products are confidential and solely for the use of the intended recipient. If you are not the intended recipient or the person responsible for delivery to the intended recipient, be advised that you have received this e-mail in error and that any use is strictly prohibited. In this event, please notify us via e-mail at 'helpdesk.tepg@tycoint.com' or telephone on 0121 255 6499 and then delete the e-mail and any copies of it. ============================================================================
I've just started using Regex++ (from boost 1.29.0) and I'm experiencing some strangeness that don't seem to be mentioned in the faq.
Firstly I found that [-A-Za-z]+ matched spaces and punctuation characters unexpectedly rather than plain alphabetic characters and hyphens only as desired. Reading the documentation I altered this to [-:alpha:] & [-:upper::lower] with no effect. So I decided to experiment with adding ^[:space:]. When finally I reached the expression below I got a coredump where the expression was declared. The intention of this expression was to strip and keep leading and trailing punctuation and spaces as well as extracting a word from the middle.
static const boost::regex
Word_expression("([:punct::space:]*)([-:upper::lower:^[:punct::space:]]+)([:
punct::space:]*)");
Is it right that 'bad' expressions should coredump?
boost::regex will through an exception if you pass it an invalid expression - you need to catch it or else yes your program will core dump. It's an invalid expression because: [:punct::space:]* should be [[:punct:][:space:]]* and [-:upper::lower:^[:punct::space:]] you can't nest character classes like that (in any regular expression language that I know of).
And if so in what way is the above expression bad? (as an aside maybe we could catch bad ones better by replacing regex strings with overloaded operators the way streams have superceded printf)
I found I still get rogue matches on punctuation and spaces when I use the manually expanded form below:
You are using the member first of boost::match_results as a null terminated string - it is *Not* a copy of the string matched or a null terminated string it is an iterator into your text - either use the sequence (first-second), or call match_results::str() to get a std::string object. John.
participants (2)
-
Bruce Adams [TSP Sunbury]
-
John Maddock