
[Max]
I'm using boost::tokenizer to do some simple parsing of data file in a format specified by the following rules: - One record of several fields in a single line - Adjacent data fields in a record separated by space char's(space or tab), with or without "," - String without space(s), with or without quotation marks - String with space(s), with quotation marks One example of a 4-field-per-record file is like: "string 2" 3 4 5 4.3 "String", 2, 3.04 4 3 AnyOtherText, 2, 3.04 4 3
I've indeed tried with boost.Regex, on a slightly different path though - I was using boost::regex_search instead.
Never call regex_search() in a loop by incrementing iterators - doing so can trigger infinite loops and incorrect results. Read Pete Becker's TR1 book for the gory details (consider what happens with zero-length matches, for example). Always use regex_iterator or regex_token_iterator instead.
But I still cannot understand it, after reading through http://www.boost.org/doc/libs/1_45_0/libs/regex/doc/html/boost_regex/syntax/... The part I could not interpret is: ^|[\s,] And $|[\s,]
The docs say:
A '^' character shall match the start of a line. A '$' character shall match the end of a line.
It depends on how strict you want to be (see the unusual examples below, especially involving empty fields). One approach is to describe the fields you're interested in, and let regex_iterator find them. (Another approach, activating regex_token_iterator's magical field splitting ability, doesn't seem to be applicable here because you want to handle quoted strings - if I'm wrong about that I'd love to find out). I suggest the following (I've used VC10 RTM std::regex here, but boost::regex will behave identically): C:\Temp>type meow.cpp #include <iostream> #include <ostream> #include <regex> #include <string> #include <vector> using namespace std; int main() { const string reg("\"([^\"]*)\"|([^\\s,\"]+)"); const regex r(reg); cout << "r: " << reg << endl << endl; for (string s; getline(cin, s); ) { if (s == "bye") { break; } vector<string> v; for (sregex_iterator i(s.begin(), s.end(), r), end; i != end; ++i) { const smatch& m = *i; v.push_back(m[1].matched ? m[1] : m[2]); } for (vector<string>::const_iterator i = v.begin(); i != v.end(); ++i) { cout << "[" << *i << "]"; } cout << endl << endl; } } C:\Temp>cl /EHsc /nologo /W4 meow.cpp meow.cpp C:\Temp>meow r: "([^"]*)"|([^\s,"]+) "string 2" 3 4 5 4.3 [string 2][3][4][5][4.3] "String", 2, 3.04 4 3 [String][2][3.04][4][3] AnyOtherText, 2, 3.04 4 3 [AnyOtherText][2][3.04][4][3] commas,without,spaces,"and","cute fluffy kittens" [commas][without][spaces][and][cute fluffy kittens] leading whitespace and (invisible) trailing whitespace [leading][whitespace][and][(invisible)][trailing][whitespace] empty "" quotes [empty][][quotes] really"bizarre"strings"like""this" [really][bizarre][strings][like][this] empty,,,fields, , , like this [empty][fields][like][this] bye C:\Temp> Stephan T. Lavavej Member of the Society for Regex Simplicity, I mean, Visual C++ Libraries Developer