Re: [boost] [Tokenizer]Usage and documentation

9 Feb 2011

      [Max]
...
I'm using boost::tokenizer to do some simple parsing of data file in a format specified by the following rules:
- One record of several fields in a single line
- Adjacent data fields in a record separated by space char's(space or tab), with or without ","
- String without space(s), with or without quotation marks
- String with space(s), with quotation marks
One example of a 4-field-per-record file is like:
"string 2" 3 4 5 4.3
"String", 2, 3.04 4 3
AnyOtherText, 2, 3.04 4 3
...
I've indeed tried with boost.Regex, on a slightly different path though - I
was using boost::regex_search instead.
Never call regex_search() in a loop by incrementing iterators - doing so can trigger infinite loops and incorrect results. Read Pete Becker's TR1 book for the gory details (consider what happens with zero-length matches, for example). Always use regex_iterator or regex_token_iterator instead.
...
But I still cannot understand it, after reading through
http://www.boost.org/doc/libs/1_45_0/libs/regex/doc/html/boost_regex/syntax/...
The part I could not interpret is:
^|[\s,]
And 
$|[\s,]
The docs say:
...
A '^' character shall match the start of a line. 
A '$' character shall match the end of a line.
It depends on how strict you want to be (see the unusual examples below, especially involving empty fields). One approach is to describe the fields you're interested in, and let regex_iterator find them. (Another approach, activating regex_token_iterator's magical field splitting ability, doesn't seem to be applicable here because you want to handle quoted strings - if I'm wrong about that I'd love to find out). I suggest the following (I've used VC10 RTM std::regex here, but boost::regex will behave identically):

C:\Temp>type meow.cpp
#include <iostream>
#include <ostream>
#include <regex>
#include <string>
#include <vector>
using namespace std;

int main() {
    const string reg("\"([^\"]*)\"|([^\\s,\"]+)");
    const regex r(reg);

    cout << "r: " << reg << endl << endl;

    for (string s; getline(cin, s); ) {
        if (s == "bye") {
            break;
        }

        vector<string> v;

        for (sregex_iterator i(s.begin(), s.end(), r), end; i != end; ++i) {
            const smatch& m = *i;

            v.push_back(m[1].matched ? m[1] : m[2]);
        }

        for (vector<string>::const_iterator i = v.begin(); i != v.end(); ++i) {
            cout << "[" << *i << "]";
        }

        cout << endl << endl;
    }
}

C:\Temp>cl /EHsc /nologo /W4 meow.cpp
meow.cpp

C:\Temp>meow
r: "([^"]*)"|([^\s,"]+)

"string  2"   3  4        5  4.3
[string  2][3][4][5][4.3]

"String",     2,  3.04    4  3
[String][2][3.04][4][3]

AnyOtherText, 2,  3.04    4  3
[AnyOtherText][2][3.04][4][3]

commas,without,spaces,"and","cute fluffy kittens"
[commas][without][spaces][and][cute fluffy kittens]

  leading whitespace and (invisible) trailing whitespace
[leading][whitespace][and][(invisible)][trailing][whitespace]

empty "" quotes
[empty][][quotes]

really"bizarre"strings"like""this"
[really][bizarre][strings][like][this]

empty,,,fields, , , like this
[empty][fields][like][this]

bye

C:\Temp>

Stephan T. Lavavej
Member of the Society for Regex Simplicity, I mean, Visual C++ Libraries Developer