[Tokenizer]Usage and documentation

Max

8 Feb 2011 8 Feb '11

1:13 p.m.

Hello, I'm using boost::tokenizer to do some simple parsing of data file in a format specified by the following rules: - One record of several fields in a single line - Adjacent data fields in a record separated by space char's(space or tab), with or without "," - String without space(s), with or without quotation marks - String with space(s), with quotation marks One example of a 4-field-per-record file is like: "string 2" 3 4 5 4.3 "String", 2, 3.04 4 3 AnyOtherText, 2, 3.04 4 3 I am using the following code to get a line at first, supposing 'input' has the contents of the data file: typedef boost::tokenizer<boost::char_separator<char> > tokenizer; boost::char_separator<char> sep("\n", " "); tokenizer tokens(input, sep); for(tokenizer::iterator beg=tokens.begin(); beg!=tokens.end(); ++beg) { } Then for each *beg, I parse each line with this typedef boost::tokenizer<boost::char_separator<char> > tokenizer; tokenizer tokens (*beg, boost::char_separator<char>(", ")); tokenizer::iterator it= tokens.begin(); But I cannot get the expected output. And, at the mean time, I found the doc of boost::tokenizer quite slim and not easy to find the information that I need. Does anybody else have the same feeling, or, is the fact that nobody is actually using it but turning to any other better lib? Thanks for any help. B/Rgds Max

Show replies by date

Michael Caisse

8 Feb 8 Feb

5:46 p.m.

On 2/8/2011 5:13 AM, Max wrote:

...

Hello,

I'm using boost::tokenizer to do some simple parsing of data file in a format specified by the following rules:

- One record of several fields in a single line

- Adjacent data fields in a record separated by space char's(space or tab), with or without ","

- String without space(s), with or without quotation marks

- String with space(s), with quotation marks

Hi Max - I would use Spirit Qi for this task. You can find the documentation here: <http://www.boost.org/doc/libs/1_45_0/libs/spirit/doc/html/index.html> michael -- ---------------------------------- Michael Caisse Object Modeling Designs www.objectmodelingdesigns.com

Max

9 Feb 9 Feb

9:50 a.m.

Thank you Michael for your pointer. I did knew spirit before but not much. It seemed like a canon or much more while what I need is a gun. Its power and elegance seems worth a try, even though the learning curve is a little bit steep. (To play with a simple toy program closely resembling those samples presented in the tutorial is not difficult, but having a full grasp, or nearly, is far from a easy task, especially when one comes across compile errors - the scenario that I believe everyone here can imagine, which is probably the biggest drawback of the powerful high order programming) B/Rgds Max

...

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Michael Caisse Sent: Wednesday, February 09, 2011 1:47 AM To: boost@lists.boost.org Subject: Re: [boost] [Tokenizer]Usage and documentation

On 2/8/2011 5:13 AM, Max wrote:

...
Hello,

I'm using boost::tokenizer to do some simple parsing of data file in a format specified by the following rules:

- One record of several fields in a single line

- Adjacent data fields in a record separated by space char's(space or tab), with or without ","

- String without space(s), with or without quotation marks

- String with space(s), with quotation marks

Hi Max -

I would use Spirit Qi for this task. You can find the documentation here:

<http://www.boost.org/doc/libs/1_45_0/libs/spirit/doc/html/index.html>

michael

--

---------------------------------- Michael Caisse Object Modeling Designs www.objectmodelingdesigns.com

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Michael Caisse

7:43 p.m.

...

...
-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Michael Caisse Sent: Wednesday, February 09, 2011 1:47 AM To: boost@lists.boost.org Subject: Re: [boost] [Tokenizer]Usage and documentation

On 2/8/2011 5:13 AM, Max wrote:

...
Hello,

I'm using boost::tokenizer to do some simple parsing of data file in a format specified by the following rules:

- One record of several fields in a single line

- Adjacent data fields in a record separated by space char's(space or tab), with or without ","

- String without space(s), with or without quotation marks

- String with space(s), with quotation marks

Hi Max -

I would use Spirit Qi for this task. You can find the documentation here:

<http://www.boost.org/doc/libs/1_45_0/libs/spirit/doc/html/index.html>

michael

--

---------------------------------- Michael Caisse Object Modeling Designs www.objectmodelingdesigns.com

On 2/9/2011 1:50 AM, Max wrote: Thank you Michael for your pointer.

I did knew spirit before but not much. It seemed like a canon or much more while what I need is a gun.

This, unfortunately, seems to be a common mis-perception. If we were talking about using lex, yacc, or bison I would understand. Spirit in a DSEL and a simple include enables the functionality. It can't be that the compiled result is 'canon' compared to the other solutions you are looking at. The resulting code is tight and fast.

...

Its power and elegance seems worth a try, even though the learning curve is a little bit steep.

(To play with a simple toy program closely resembling those samples presented in the tutorial is not difficult, but having a full grasp, or nearly, is far from a easy task, especially when one comes across compile errors - the scenario that I believe everyone here can imagine, which is probably the biggest drawback of the powerful high order programming)

B/Rgds Max

I think this hits the problem. The library can be intimidating at first. Any DSEL can just look odd initially. I personally find Qi to be close enough to EBNF that it reads nicely. The compiler errors are definitely an issue, especially when you are first beginning. Looks like you have found the tutorial. If you are interested in a crash course you can find slides here: <http://www.objectmodelingdesigns.com/boostcon10/> and the BoostCon video here: <http://blip.tv/file/4143337>. There is a Spirit ML and a bunch of hang out on the Boost IRC channel. The community is friendly and always eager to help a new convert. Regardless of the approach you take, I wish you luck! designated benevolent spirit evangelist - michael -- ---------------------------------- Michael Caisse Object Modeling Designs www.objectmodelingdesigns.com

Max

10 Feb 10 Feb

7:53 a.m.

Hello Michael

...

<please don't top-post> <http://www.boost.org/community/policy.html#quoting>

My apologies.

...

...
I did knew spirit before but not much. It seemed like a canon or much

more

...
while what I need is a gun.

This, unfortunately, seems to be a common mis-perception. If we were talking about using lex, yacc, or bison I would understand. Spirit in a DSEL and a simple include enables the functionality.

It can't be that the compiled result is 'canon' compared to the other solutions you are looking at. The resulting code is tight and fast.

...

...
Its power and elegance seems worth a try, even though the learning curve is a little bit steep.

(To play with a simple toy program closely resembling those samples presented in the tutorial is not difficult, but having a full grasp, or nearly, is far from a easy task, especially when one comes across compile errors - the scenario that I believe everyone here can imagine, which is probably

It's nice to hear that. the

...

...
biggest drawback of the powerful high order programming)

I think this hits the problem. The library can be intimidating at first. Any DSEL can just look odd initially. I personally find Qi to be close enough to EBNF that it reads nicely.

The compiler errors are definitely an issue, especially when you are first beginning. Looks like you have found the tutorial. If you are interested in a crash course you can find slides here: <http://www.objectmodelingdesigns.com/boostcon10/> and the BoostCon video here: <http://blip.tv/file/4143337>.

There is a Spirit ML and a bunch of hang out on the Boost IRC channel. The community is friendly and always eager to help a new convert. Regardless of the approach you take, I wish you luck!

In case I come across problems I'll definitely join the ML and ask for your help. Thank you very much for your pointer - I'll have a careful read of the presentations for both SPIRIT and ASIO in which both lib's and many other boost libs are presenting a beautiful real life collaboration. And... I'm so happy to get to know how you three (Michael, Hartmut and Joel) look like by the photo at the left bottom of the page. :-) I know you guys long before but it's this time I have a look at your photo.

...

designated benevolent spirit evangelist - michael

Yechezkel Mett

9 Feb 9 Feb

11:41 a.m.

On Tue, Feb 8, 2011 at 3:13 PM, Max <more4less@sina.com> wrote:

...

I'm using boost::tokenizer to do some simple parsing of data file in a format specified by the following rules:

- One record of several fields in a single line

- Adjacent data fields in a record separated by space char's(space or tab), with or without ","

- String without space(s), with or without quotation marks

- String with space(s), with quotation marks

One example of a 4-field-per-record file is like:

"string 2" 3 4 5 4.3

"String", 2, 3.04 4 3

AnyOtherText, 2, 3.04 4 3

I normally use boost.regex's regex_token_iterator for this sort of task. Try the following regex: "([^"]*)"|(?:^|[[:space:],])+([^[:space:],]+)(?:$|[[:space:],])+ and tell regex_token_iterator to extract matches 1 and 2. The above regex has a couple of quirks: "a""b" will be taken as two fields, "a" and "b". a,,b will be taken as two fields, not three. To read the file line by line, simply use std::getline. Yechezkel Mett

Max

1:22 p.m.

Thank you Yechezkel. I've indeed tried with boost.Regex, on a slightly different path though - I was using boost::regex_search instead. One drawback of the regex approach, IMO, is I feel the code a little bit rigid, or lack of flexibility, or in any other words, it's not anything I feel it should be - even though I cannot actually tell in what respect. Thanks for your yet another regex approach. I'm trying to rewrite your regex "([^"]*)"|(?:^|[[:space:],])+([^[:space:],]+)(?:$|[[:space:],])+ In a form that I'm more familiar "([^"]*)"|(?:^|[\s,])+([^\s,]+)(?:$|[\s,])+ But I still cannot understand it, after reading through http://www.boost.org/doc/libs/1_45_0/libs/regex/doc/html/boost_regex/syntax/ perl_syntax.html The part I could not interpret is: ^|[\s,] And $|[\s,] :-( (this is not a part of the regex, part of my expression instead.) Thanks. Max

...

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Yechezkel Mett Sent: Wednesday, February 09, 2011 7:42 PM To: boost@lists.boost.org Subject: Re: [boost] [Tokenizer]Usage and documentation

On Tue, Feb 8, 2011 at 3:13 PM, Max <more4less@sina.com> wrote:

...
I'm using boost::tokenizer to do some simple parsing of data file in a format specified by the following rules:

- One record of several fields in a single line

- Adjacent data fields in a record separated by space char's(space or tab), with or without ","

- String without space(s), with or without quotation marks

- String with space(s), with quotation marks

One example of a 4-field-per-record file is like:

"string 2" 3 4 5 4.3

"String", 2, 3.04 4 3

AnyOtherText, 2, 3.04 4 3

I normally use boost.regex's regex_token_iterator for this sort of task. Try the following regex:

"([^"]*)"|(?:^|[[:space:],])+([^[:space:],]+)(?:$|[[:space:],])+

and tell regex_token_iterator to extract matches 1 and 2.

The above regex has a couple of quirks: "a""b" will be taken as two fields, "a" and "b". a,,b will be taken as two fields, not three.

To read the file line by line, simply use std::getline.

Yechezkel Mett _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Stephan T. Lavavej

2:46 p.m.

[Max]

...

I'm using boost::tokenizer to do some simple parsing of data file in a format specified by the following rules: - One record of several fields in a single line - Adjacent data fields in a record separated by space char's(space or tab), with or without "," - String without space(s), with or without quotation marks - String with space(s), with quotation marks One example of a 4-field-per-record file is like: "string 2" 3 4 5 4.3 "String", 2, 3.04 4 3 AnyOtherText, 2, 3.04 4 3

...

I've indeed tried with boost.Regex, on a slightly different path though - I was using boost::regex_search instead.

Never call regex_search() in a loop by incrementing iterators - doing so can trigger infinite loops and incorrect results. Read Pete Becker's TR1 book for the gory details (consider what happens with zero-length matches, for example). Always use regex_iterator or regex_token_iterator instead.

...

But I still cannot understand it, after reading through http://www.boost.org/doc/libs/1_45_0/libs/regex/doc/html/boost_regex/syntax/... The part I could not interpret is: ^|[\s,] And $|[\s,]

The docs say:

...

A '^' character shall match the start of a line. A '$' character shall match the end of a line.

It depends on how strict you want to be (see the unusual examples below, especially involving empty fields). One approach is to describe the fields you're interested in, and let regex_iterator find them. (Another approach, activating regex_token_iterator's magical field splitting ability, doesn't seem to be applicable here because you want to handle quoted strings - if I'm wrong about that I'd love to find out). I suggest the following (I've used VC10 RTM std::regex here, but boost::regex will behave identically): C:\Temp>type meow.cpp #include <iostream> #include <ostream> #include <regex> #include <string> #include <vector> using namespace std; int main() { const string reg("\"([^\"]*)\"|([^\\s,\"]+)"); const regex r(reg); cout << "r: " << reg << endl << endl; for (string s; getline(cin, s); ) { if (s == "bye") { break; } vector<string> v; for (sregex_iterator i(s.begin(), s.end(), r), end; i != end; ++i) { const smatch& m = *i; v.push_back(m[1].matched ? m[1] : m[2]); } for (vector<string>::const_iterator i = v.begin(); i != v.end(); ++i) { cout << "[" << *i << "]"; } cout << endl << endl; } } C:\Temp>cl /EHsc /nologo /W4 meow.cpp meow.cpp C:\Temp>meow r: "([^"]*)"|([^\s,"]+) "string 2" 3 4 5 4.3 [string 2][3][4][5][4.3] "String", 2, 3.04 4 3 [String][2][3.04][4][3] AnyOtherText, 2, 3.04 4 3 [AnyOtherText][2][3.04][4][3] commas,without,spaces,"and","cute fluffy kittens" [commas][without][spaces][and][cute fluffy kittens] leading whitespace and (invisible) trailing whitespace [leading][whitespace][and][(invisible)][trailing][whitespace] empty "" quotes [empty][][quotes] really"bizarre"strings"like""this" [really][bizarre][strings][like][this] empty,,,fields, , , like this [empty][fields][like][this] bye C:\Temp> Stephan T. Lavavej Member of the Society for Regex Simplicity, I mean, Visual C++ Libraries Developer

Max

10 Feb 10 Feb

7:32 a.m.

Hello Stephen, Thank you so much for your detailed information, which is exactly what I need by now. After a glimpse on your email address, It comes to me that you might be the guy in a series of STL lectures I found on the net. It's really you! Even though I'm not a very beginner of STL, I've watched some of the video both for revisiting STL and for English practice. :-) Thank you also for your STL lectures.

...

...
The part I could not interpret is: ^|[\s,] And $|[\s,]

The docs say:

...
A '^' character shall match the start of a line. A '$' character shall match the end of a line.

Yes, I'm aware of this. But even with this in mind, I cannot interpret "^|[\s,]" and "$|[\s,]". For the former, I know '|' means alteration, but how can it be after '^'? For the latter, how can "|[\s,]" be expected after the end of a line (and the same confusion as above)?

...

It depends on how strict you want to be (see the unusual examples below, especially involving empty fields). One approach is to describe the fields

you're

...

interested in, and let regex_iterator find them. (Another approach, activating regex_token_iterator's magical field splitting ability, doesn't seem to be applicable here because you want to handle quoted strings - if I'm wrong about that I'd love to find out). I suggest the following (I've used VC10 RTM std::regex here, but boost::regex will behave identically):

C:\Temp>type meow.cpp

[code snippet]

C:\Temp>

Stephan T. Lavavej Member of the Society for Regex Simplicity, I mean, Visual C++ Libraries Developer

One more question - with you code, any empty 'token' between two contiguous ',' is ignored, what if someday I'd like to pick them up? B/Rgds Max

Yechezkel Mett

9:40 a.m.

On Thu, Feb 10, 2011 at 9:32 AM, Max <more4less@sina.com> wrote: [Stephan T. Lavavej <stl@exchange.microsoft.com> wrote:]

...

...
[Max]

...
The part I could not interpret is: ^|[\s,] And $|[\s,]

The docs say:

...
A '^' character shall match the start of a line. A '$' character shall match the end of a line.

Yes, I'm aware of this. But even with this in mind, I cannot interpret "^|[\s,]" and "$|[\s,]". For the former, I know '|' means alteration, but how can it be after '^'? For the latter, how can "|[\s,]" be expected after the end of a line (and the same confusion as above)?

^|[\s,] means _either_ the beginning of the line _or_ a space or comma. In other words the field starts either at the beginning of the line or after a space or comma. Likewise $|[\s,] The field ends either at the end of the line or before a space or comma.

...

One more question - with you code, any empty 'token' between two contiguous ',' is ignored, what if someday I'd like to pick them up?

"([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$ I'm presuming an empty line should count as no tokens; if you don't mind an empty line being one token it can be simplified to "([^"]*)"|([^\s,"]+)|(?:^|,)\s*()(?:$|,) Not really that much simpler. Yechezkel Mett

Max

1:46 p.m.

...

From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Yechezkel Mett Sent: Thursday, February 10, 2011 5:41 PM To: boost@lists.boost.org Subject: Re: [boost] [Tokenizer]Usage and documentation

^|[\s,]

means _either_ the beginning of the line _or_ a space or comma. In other words the field starts either at the beginning of the line or after a space or comma.

Likewise

$|[\s,]

The field ends either at the end of the line or before a space or comma.

I indeed never realized that ^ and $ could be used in combination with | in that way before. I didn't use RE that frequently though.

...

...
One more question - with you code, any empty 'token' between two contiguous ',' is ignored, what if someday I'd like to pick them up?

"([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$

I'm presuming an empty line should count as no tokens; if you don't mind an empty line being one token it can be simplified to

I have 3 version of the RE's sitting side by side attempting to figure out the difference between them.

...

"([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$ // (1) "([^"]*)"|([^\s,"]+)|(?:^|,)\s*()(?:$|,) // (2) "([^"]*)"|([^\s,"]+) // (3) original version offered by Stephen

But, unfortunately, I still cannot fully grasp the meaning of (1) and (2). But by testing (1) with Stephen's code, I get: r: "([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$ empty,,,fields, , , like this [empty][][fields][][like][this] ,,, [][] There are 2 empty tokens in between each 3 contiguous ',' but only one for each is detected. Likewise, for (2), I get: r: "([^"]*)"|([^\s,"]+)|(?:^|,)\s*()(?:$|,) empty,,,fields, , , like this [empty][fields][like][this] This time, the behavior is no different than the 'original' version. Thank you Yechezkel for you help. BTW, it seems like by reading http://www.boost.org/doc/libs/1_45_0/libs/regex/doc/html/boost_regex/syntax/ perl_syntax.html I cannot get a full view of the regex grammar. Maybe I need a whole book on it? :-) Is there any *complete* introduction available on the net? B/Rgds Max

Yechezkel Mett

13 Feb 13 Feb

11:44 a.m.

On Thu, Feb 10, 2011 at 3:46 PM, Max <more4less@sina.com> wrote:

...

I have 3 version of the RE's sitting side by side attempting to figure out the difference between them.

...
"([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$ // (1) "([^"]*)"|([^\s,"]+)|(?:^|,)\s*()(?:$|,) // (2) "([^"]*)"|([^\s,"]+) // (3) original version offered by Stephen

But, unfortunately, I still cannot fully grasp the meaning of (1) and (2).

,\s*(), means find a ',' followed by any number of spaces followed by a ',' and capture an empty string. The others are similar.

...

r: "([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$

empty,,,fields, , , like this [empty][][fields][][like][this] ,,, [][]

There are 2 empty tokens in between each 3 contiguous ',' but only one for each is detected.

Yes, that's a mistake. When matching ,, as an empty field the second ',' is eaten and can no longer be used as the beginning of the next field. "([^"]*)"|([^\s,"]+)|,\s*()(?=,)|^\s*()(?=,)|,\s*()$ should work. (?=) is a lookahead, it checks that the pattern (',' in this case) matches at this point, but doesn't eat any input.

...

Likewise, for (2), I get:

r: "([^"]*)"|([^\s,"]+)|(?:^|,)\s*()(?:$|,)

empty,,,fields, , , like this [empty][fields][like][this]

This time, the behavior is no different than the 'original' version.

I get the same results as the first version. Perhaps it wasn't escaped properly? Yechezkel Mett

Max

16 Feb 16 Feb

1:01 p.m.

[Yechezkel Mett]

...

,\s*(),

means find a ',' followed by any number of spaces followed by a ',' and capture an empty string.

Yes, now I see. Thank you, Yechezkel.

...

The others are similar.

...
r: "([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$

empty,,,fields, , , like this [empty][][fields][][like][this] ,,, [][]

There are 2 empty tokens in between each 3 contiguous ',' but only one

for

...

...
each is detected.

Yes, that's a mistake. When matching ,, as an empty field the second ',' is eaten and can no longer be used as the beginning of the next field.

"([^"]*)"|([^\s,"]+)|,\s*()(?=,)|^\s*()(?=,)|,\s*()$

should work. (?=) is a lookahead, it checks that the pattern (',' in this case) matches at this point, but doesn't eat any input.

...

...
Likewise, for (2), I get:

r: "([^"]*)"|([^\s,"]+)|(?:^|,)\s*()(?:$|,)

empty,,,fields, , , like this [empty][fields][like][this]

This time, the behavior is no different than the 'original' version.

I get the same results as the first version. Perhaps it wasn't escaped

Yes, Its behavior is exactly as you expected. properly? Yes, you are right. My different result came from my incorrect escaping unintentionally. B/Rgds Max P.S I've found some 'complete' reference (books) on RE. However it's this thread of discussion that has indeed triggered a leap of my understanding of RE. And, I have also had a revisit, not so deep though, to SPIRIT.Qi, following the direction of Michael. (Qi is a power tool I believe I definitely will use, and its siblings.) Now I'm able to comprehend quite 'complex' expression, including whose appeared in this thread. Thank you Michael, Yechezkel, Stephan for your kind help!

5270

Age (days ago)

5278

Last active (days ago)

List overview

Download

12 comments

4 participants

participants (4)

Max
Michael Caisse
Stephan T. Lavavej
Yechezkel Mett