Re: [Boost-users] Tokenizer Question

newer
Re: [Boost-users] Aborting a...

older
Aborting a Boost.Thread

Victor A. Wagner Jr.

12 Jun 2005 12 Jun '05

6:19 a.m.

At 19:19 2005-06-11, you wrote:

...

...
for tokenizing on whitespace, simple stream input (>>) to a std::string suffices.

My own tokenizer does just that--and puts the tokens into a deque.

...
IMO, it's hardly worth troubling yourself with a tokenizer for whitespace.

Well, not really. When parsing line-oriented output and semi-known structured lines it's handy to be able to sometimes work with a line's tokens as if they were in a vector or deque.

string yourline; istringstream is(yourline); deque<string> yourvec((istream_iterator<std::string>(is)), istream_iterator<std::string>()); voila, a deque it would be interesting to profile that against the hypothetical indexable tokenizer.

...

In fact, I was going to add a suggestion that the tokenizer also have the [] operator so that the individual tokens could be addressed as tok[1], etc.

-Tom

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Victor A. Wagner Jr. http://rudbek.com The five most dangerous words in the English language: "There oughta be a law"

Attachments:

attachment.html (text/html — 3.0 KB)

Show replies by date

Tom Browder

12 Jun 12 Jun

10:02 a.m.

New subject: Tokenizer Question

Yes, we can all roll our own (most of the time), but I thought one of the purposes of Boost was to avoid that. -Tom

Tom Browder

2:33 p.m.

New subject: Tokenizer Question

Victor, my humble tokenizer: string input_to_be_tokenized; istringstream ss; string s; deque<string> tokens; while (input_to_be_tokenized >> ss) tokens.push_back(ss); I made 3 test programs: string inp; for (int i = 0; i < 10000000; ++i) inp += " a"; // generate a container of tokens from inp with one of three methods: // my method (see above) // Victor's method // Boost, using a separator of (" \n\t") // if desired, loop over all the tokens using operator[] for the two deques, and the iterator for Boost's container Then I compiled them (gcc 3.4.2, Fedora Core 3, i386): g++ -pg progN progN.cc Ran them without the final loop and saved 'gmon.out' as a unique name. Ran them again with the final loop and saved 'gmon.out' as a unique name. Ran all six gmon's and saved the outputs to unique files: gprof progN gmonX > X.prof The accumulated times (sec) are surprising: Boost Victor's Mine ===== ===== ==== no loop 1.50 38.84 20.13 loop 131.91 38.89 23.91 Granted, I didn't do the tests multiple times, but it seems to me that the Boost tokenizer is great if you don't need to iterate through it, but it is the pits if you do. -Tom I'll send you my code and results if interested. -Tom prog Ran then a Ran them with gprof gprof progN Ran generate a string with 10,000,000 tokens (" a")_: " a a a ....a" and timed your tokenizer against mine 10 times. Mine beat yours by 2 to 3 seconds every time. The I used the Boost tokenizer and the timings went WAY down. So I think the benefits og the Boost tokenizer are well worth it, even for trivial tokenizing. -Tom my tokenizer _____ From: boost-users-bounces@lists.boost.org [mailto:boost-users-bounces@lists.boost.org] On Behalf Of Victor A. Wagner Jr. Sent: Sunday, June 12, 2005 1:20 AM To: boost-users@lists.boost.org Subject: Re: [Boost-users] Tokenizer Question At 19:19 2005-06-11, you wrote:

...

for tokenizing on whitespace, simple stream input (>>) to a std::string suffices.

My own tokenizer does just that--and puts the tokens into a deque.

...

IMO, it's hardly worth troubling yourself with a tokenizer for whitespace.

Well, not really. When parsing line-oriented output and semi-known structured lines it's handy to be able to sometimes work with a line's tokens as if they were in a vector or deque. string yourline; istringstream is( yourline ); deque < string > yourvec(( istream_iterator < std :: string >( is )), istream_iterator <std :: string >()); voila, a deque it would be interesting to profile that against the hypothetical indexable tokenizer. In fact, I was going to add a suggestion that the tokenizer also have the [] operator so that the individual tokens could be addressed as tok[1], etc. -Tom _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users Victor A. Wagner Jr. http://rudbek.com <http://rudbek.com/> The five most dangerous words in the English language: "There oughta be a law"

Robert Zeh

13 Jun 13 Jun

1:37 p.m.

New subject: Tokenizer Question

"Tom Browder" <tbrowder@cox.net> writes:

...

The accumulated times (sec) are surprising:

Boost Victor's Mine ===== ===== ==== no loop 1.50 38.84 20.13 loop 131.91 38.89 23.91

Granted, I didn't do the tests multiple times, but it seems to me that the Boost tokenizer is great if you don't need to iterate through it, but it is the pits if you do.

-Tom

I'll send you my code and results if interested.

-Tom

I'd be interested in seeing profiling output from your test --- when the tokenizer is reading from an input stream it builds up the strings character by character with operator +=. It can avoid creating the tokens character by character when it isn't dealing with input streams (it checks for an input stream by checking for std::input_iterator_tag, which I assume is set for istringstream). Robert Zeh http://home.earthlink.net/~rzeh

Victor A. Wagner Jr.

3:16 p.m.

New subject: Tokenizer Question

At 06:37 2005-06-13, Robert Zeh wrote:

...

"Tom Browder" <tbrowder@cox.net> writes:

...
The accumulated times (sec) are surprising:

Boost Victor's Mine ===== ===== ==== no loop 1.50 38.84 20.13 loop 131.91 38.89 23.91

Granted, I didn't do the tests multiple times, but it seems to me that the Boost tokenizer is great if you don't need to iterate through it, but it is the pits if you do.

-Tom

I'll send you my code and results if interested.

-Tom

I'd be interested in seeing profiling output from your test --- when the tokenizer is reading from an input stream it builds up the strings character by character with operator +=.

It can avoid creating the tokens character by character when it isn't dealing with input streams (it checks for an input stream by checking for std::input_iterator_tag, which I assume is set for istringstream).

I'm still stunned that the range constructor is slower than Tom's loop. Substantially slower.

...

Robert Zeh http://home.earthlink.net/~rzeh

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Victor A. Wagner Jr. http://rudbek.com The five most dangerous words in the English language: "There oughta be a law"

Daniel James

12 Jun 12 Jun

6:12 p.m.

New subject: Tokenizer Question

Victor A. Wagner Jr. wrote:

...

I'm still stunned that the range constructor is slower than Tom's loop. Substantially slower.

He used the profiler to time it - the input iterator's overhead is probably exagerated by the profiling code. Turn of profiling and turn up the optimization and the two versions should run at about the same speed.

Tom Browder

13 Jun 13 Jun

11:02 p.m.

New subject: Tokenizer Question

...

-----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users-bounces@lists.boost.org] On Behalf Of Robert Zeh

...

I'd be interested in seeing profiling output from your test

I'm sending my tokenizer test package to Robert. Anyone else wanting it, please ask. -Tom

Robert Zeh

18 Jun 18 Jun

2:59 a.m.

New subject: Tokenizer Question

"Tom Browder" <tbrowder@cox.net> writes:

...

...
-----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users-bounces@lists.boost.org] On Behalf Of Robert Zeh

...
I'd be interested in seeing profiling output from your test

I'm sending my tokenizer test package to Robert. Anyone else wanting it, please ask.

-Tom

Tom, thank you for the test package. It is very well put together, and I was able to start using it out of the box. I tried things on my box. Without compiler optimization enabled (-O3) and no profiling I get the following timings: (running each executable with -i): ttb: (boost) 28.93 user .38 system tts (Tom Browder's) 24.25 user 0.92 system ttv (Victor Wagner's) 24.66 user 1.05 system With compiler optimization I get much better results: razeh@squirmy:~/timing/tokenizer_test$ make time ttb: (boost) 16.49 user 0.33 system tts: (Tom Browder's) 21.46 user 0.95 system ttv: (Victor Wagner's) 21.91 user 1.00 system I only ran it twice; the second time the durations didn't change by more then half a second. Robert Zeh http://home.earthlink.net/~rzeh P.S. My machine specifications: I have a 650 MHz Duron running Debian, with the following specifications: razeh@squirmy:~/timing/tokenizer_test$ g++ -v Reading specs from /usr/lib/gcc-lib/i486-linux/3.3.5/specs Configured with: ../src/configure -v --enable-languages=c,c++,java,f77,pascal,objc,ada,treelang --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-gxx-include-dir=/usr/include/c++/3.3 --enable-shared --enable-__cxa_atexit --with-system-zlib --enable-nls --without-included-gettext --enable-clocale=gnu --enable-debug --enable-java-gc=boehm --enable-java-awt=xlib --enable-objc-gc i486-linux Thread model: posix gcc version 3.3.5 (Debian 1:3.3.5-12) razeh@squirmy:~/timing/tokenizer_test$ uname -a Linux squirmy 2.6.8-2-k7 #1 Mon Jan 24 03:29:52 EST 2005 i686 GNU/Linux

Tom Browder

19 Jun 19 Jun

2:53 p.m.

New subject: Tokenizer Question

...

-----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users-bounces@lists.boost.org] On Behalf Of Robert Zeh Sent: Friday, June 17, 2005 10:00 PM To: boost-users@lists.boost.org Tom, thank you for the test package. It is very well put together, and I was able to start using it out of the box.

You're welcome. Glad it worked.

...

I tried things on my box. Without compiler optimization enabled (-O3) and no profiling I get the following timings: (running each executable with -i):

Iterating over the token container after it is instantiated:

...

ttb: (boost) 28.93 user .38 system tts (Tom Browder's) 24.25 user 0.92 system ttv (Victor Wagner's) 24.66 user 1.05 system

...

With compiler optimization I get much better results: razeh@squirmy:~/timing/tokenizer_test$ make time ttb: (boost) 16.49 user 0.33 system tts: (Tom Browder's) 21.46 user 0.95 system ttv: (Victor Wagner's) 21.91 user 1.00 system

I'll bet if the boost::tokenizer developer made a special white space tokenizer it would better yet: the reason for my post to this list in the first place. -Tom Browder

Jeff Garland

3:13 p.m.

New subject: Tokenizer Question

On Sun, 19 Jun 2005 09:53:30 -0500, Tom Browder wrote

...

I'll bet if the boost::tokenizer developer made a special white space tokenizer it would better yet: the reason for my post to this list in the first place.

As far as I know there hasn't been much/any change to tokenizer in several releases -- so I'm not sure the 'maintainer' is actively maintaining. My guess is that if you want an optimized version you'll have to write it yourself. If you do, submit it and either we'll track down Mr. Bandela -- or someone else will take a look at incorporating it. Jeff

Robert Zeh

20 Jun 20 Jun

1:05 p.m.

New subject: Tokenizer Question

"Jeff Garland" <jeff@crystalclearsoftware.com> writes:

...

On Sun, 19 Jun 2005 09:53:30 -0500, Tom Browder wrote

...
I'll bet if the boost::tokenizer developer made a special white space tokenizer it would better yet: the reason for my post to this list in the first place.

As far as I know there hasn't been much/any change to tokenizer in several releases -- so I'm not sure the 'maintainer' is actively maintaining. My guess is that if you want an optimized version you'll have to write it yourself. If you do, submit it and either we'll track down Mr. Bandela -- or someone else will take a look at incorporating it.

Jeff

John was more then willing to incorporate the changes I submitted a while back. It's just a matter of sending him some email. Robert Zeh

Martin

12 Jun 12 Jun

7:35 p.m.

New subject: Tokenizer Question

...

My own tokenizer does just that--and puts the tokens into a deque.

Have you checked the string_algo library, specially the split function. string str1("hello abc-*-ABC-*-aBc goodbye"); // ranges with iterators in original string vector<iterator_range<string::iterator> > splitref; split(splitref, str1, is_any_of("*") ); // splitref == { "hello abc-","-ABC- ","-aBc goodbye" } // copies of found tokens list<string> splitcopy; split(splitcopy, str1, is_any_of("-*"), token_compress_on ); // splitcopy == { "hello abc","ABC","aBc goodbye" }

7355

Age (days ago)

7363

Last active (days ago)

List overview

Download

11 comments

6 participants

participants (6)

Daniel James
Jeff Garland
Martin
Robert Zeh
Tom Browder
Victor A. Wagner Jr.