[boost-users] tokenizer vs string algorithm split.

Hi I was wondering which one is better and faster to split a file of csv value of number and put it into container of double. 1.) Which option is better. // method 1. std::vector<std::string> split_string; boost::algorithm::trim(flist); boost::algorithm::split(split_string, flist, boost::algorithm::is_any_of(",")); std::vector<double> elements; BOOST_FOREACH(std::string s, split_string) { elements += boost::lexical_cast<double>(s); } // method 2. boost::char_separator<char> sep(","); boost::tokenizer<boost::char_separator<char> > tokens(flist, sep); std::vector<double> elements; BOOST_FOREACH(std::string token, tokens) { elements += boost::lexical_cast<double>(token); } 2.) When is it better to use string algorithm split instead of tokenizer and vice versa.

My limited experience is that tokenizer is faster. I have tried it several times in different schemes but the tokenizer always seems to come out faster by more than a little. I would prefer the split() scheme but I haven't found the way to make it go faster. Larry ----- Original Message ----- From: chun ping wang Newsgroups: gmane.comp.lib.boost.user To: boost-users@lists.boost.org Sent: Wednesday, December 12, 2007 10:56 PM Subject: [boost-users] tokenizer vs string algorithm split. Hi I was wondering which one is better and faster to split a file of csv value of number and put it into container of double. 1.) Which option is better. // method 1. std::vector<std::string> split_string; boost::algorithm::trim(flist); boost::algorithm::split(split_string, flist, boost::algorithm::is_any_of(",")); std::vector<double> elements; BOOST_FOREACH(std::string s, split_string) { elements += boost::lexical_cast<double>(s); } // method 2. boost::char_separator<char> sep(","); boost::tokenizer<boost::char_separator<char> > tokens(flist, sep); std::vector<double> elements; BOOST_FOREACH(std::string token, tokens) { elements += boost::lexical_cast<double>(token); } 2.) When is it better to use string algorithm split instead of tokenizer and vice versa. ------------------------------------------------------------------------------ _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

This may not matter for the CSV file you're parsing, but at least for a more general solution for CSV processing, you'd also have to handle fields that are surrounded by quotes and may even contain embedded commas. I don't know if split or tokenizer can handle that. -- Bill -- _____ From: Larry [mailto:lknain@nc.rr.com] Sent: Thursday, December 13, 2007 8:08 AM To: boost-users@lists.boost.org Subject: Re: [Boost-users] [boost-users] tokenizer vs string algorithm split. My limited experience is that tokenizer is faster. I have tried it several times in different schemes but the tokenizer always seems to come out faster by more than a little. I would prefer the split() scheme but I haven't found the way to make it go faster. Larry ----- Original Message ----- From: chun ping wang <mailto:cablepuff@gmail.com> Newsgroups: gmane.comp.lib.boost.user To: boost-users@lists.boost.org Sent: Wednesday, December 12, 2007 10:56 PM Subject: [boost-users] tokenizer vs string algorithm split. Hi I was wondering which one is better and faster to split a file of csv value of number and put it into container of double. 1.) Which option is better. // method 1. std::vector<std::string> split_string; boost::algorithm::trim(flist); boost::algorithm::split(split_string, flist, boost::algorithm::is_any_of(",")); std::vector<double> elements; BOOST_FOREACH(std::string s, split_string) { elements += boost::lexical_cast<double>(s); } // method 2. boost::char_separator<char> sep(","); boost::tokenizer<boost::char_separator<char> > tokens(flist, sep); std::vector<double> elements; BOOST_FOREACH(std::string token, tokens) { elements += boost::lexical_cast<double>(token); } 2.) When is it better to use string algorithm split instead of tokenizer and vice versa. _____ _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Bill Buklis wrote:
This may not matter for the CSV file you’re parsing, but at least for a more general solution for CSV processing, you’d also have to handle fields that are surrounded by quotes and may even contain embedded commas. I don’t know if split or tokenizer can handle that.
Tokenizer's escaped_list_separator handles quotes and embedded commas properly.

If your CSV has empty fields (e.g., data,data,,data.....) the only way I found to handle the empty field was to handle the separators yourself with the tokenizer otherwise the tokenizer would skip the field (a la strtok()). For CSVs I tried Spirit and came up with a scheme (with lots of help I would add) that seemed to work. Not many lines of code. It takes more time than I was interested in spending to figure it out. Larry ----- Original Message ----- From: "Edward Diener" <eldiener@tropicsoft.com> Newsgroups: gmane.comp.lib.boost.user To: <boost-users@lists.boost.org> Sent: Saturday, December 15, 2007 9:44 AM Subject: Re: [boost-users] tokenizer vs string algorithm split. Bill Buklis wrote:
This may not matter for the CSV file you’re parsing, but at least for a more general solution for CSV processing, you’d also have to handle fields that are surrounded by quotes and may even contain embedded commas. I don’t know if split or tokenizer can handle that.
Tokenizer's escaped_list_separator handles quotes and embedded commas properly.

Hi Larry, can you share the code which can handle empty fields? Thanks, Christian On Dec 15, 2007 1:32 PM, Larry <lknain@nc.rr.com> wrote:
If your CSV has empty fields (e.g., data,data,,data.....) the only way I found to handle the empty field was to handle the separators yourself with the tokenizer otherwise the tokenizer would skip the field (a la strtok()).
For CSVs I tried Spirit and came up with a scheme (with lots of help I would add) that seemed to work. Not many lines of code. It takes more time than I was interested in spending to figure it out.
Larry
----- Original Message ----- From: "Edward Diener" <eldiener@tropicsoft.com> Newsgroups: gmane.comp.lib.boost.user To: <boost-users@lists.boost.org> Sent: Saturday, December 15, 2007 9:44 AM Subject: Re: [boost-users] tokenizer vs string algorithm split.
Bill Buklis wrote:
This may not matter for the CSV file you're parsing, but at least for a more general solution for CSV processing, you'd also have to handle fields that are surrounded by quotes and may even contain embedded commas. I don't know if split or tokenizer can handle that.
Tokenizer's escaped_list_separator handles quotes and embedded commas properly.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

This was more of brute force approach that I did when I first started using Boost a few years ago. There may be (probably) better and/or more efficient ways to do it: It was sufficient for what I was doing. //----------------------------------------------------------------- // Using tokenizer using namespace boost; typedef escaped_list_separator<char> CharTokens; typedef tokenizer<CharTokens> EscapedTokenizer; typedef tokenizer<CharTokens>::iterator EscapedIterator; CharTokens cs(",",",",boost::keep_empty_tokens); std::string str; // This has CSV input line EscapedIterator eti; EscapedTokenizer et(str,cs); for (eti = et.begin(); eti != et,end(); eti++) { if (*eti == ",") { // See if this is a separator field_number++; } else { // *eti points to a value which could be an empty field // field_number is the field in the list } } //----------------------------------------------------------------- // Using Spirit // // Result is a vector of items much list split() - including empty strings in the // vector for empty fields // // Probably could be used with any<> using namespace boost::spirit; char *plist_csv = new char[4096]; rule<> list_csv, list_csv_item; std::vector<std::string> vec_item, vec_list; parse_info<> result; list_csv_item = confix_p('\"', *c_escape_cha_p,'\"') | longest_d(real_p | int_p | *(alnum_p | ch_p('_'))) ; list_csv = list_p( (!list_csv_item)[append(vec_item)], ',') [append(vec_list)] ; result = parse(plist_csv,list_csv); if (result.hit) // Got at least part if (result.full) { // All present } } ----- Original Message ----- From: "Christian Henning" <chhenning@gmail.com> Newsgroups: gmane.comp.lib.boost.user To: <boost-users@lists.boost.org> Sent: Saturday, December 15, 2007 1:38 PM Subject: Re: [boost-users] tokenizer vs string algorithm split.
Hi Larry, can you share the code which can handle empty fields?
Thanks, Christian
On Dec 15, 2007 1:32 PM, Larry <lknain@nc.rr.com> wrote:
If your CSV has empty fields (e.g., data,data,,data.....) the only way I found to handle the empty field was to handle the separators yourself with the tokenizer otherwise the tokenizer would skip the field (a la strtok()).
For CSVs I tried Spirit and came up with a scheme (with lots of help I would add) that seemed to work. Not many lines of code. It takes more time than I was interested in spending to figure it out.
Larry
----- Original Message ----- From: "Edward Diener" <eldiener@tropicsoft.com> Newsgroups: gmane.comp.lib.boost.user To: <boost-users@lists.boost.org> Sent: Saturday, December 15, 2007 9:44 AM Subject: Re: [boost-users] tokenizer vs string algorithm split.
Bill Buklis wrote:
This may not matter for the CSV file you're parsing, but at least for a more general solution for CSV processing, you'd also have to handle fields that are surrounded by quotes and may even contain embedded commas. I don't know if split or tokenizer can handle that.
Tokenizer's escaped_list_separator handles quotes and embedded commas properly.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Thanks Larry. On Dec 15, 2007 5:23 PM, Larry <lknain@nc.rr.com> wrote:
This was more of brute force approach that I did when I first started using Boost a few years ago. There may be (probably) better and/or more efficient ways to do it: It was sufficient for what I was doing.
//----------------------------------------------------------------- // Using tokenizer
using namespace boost;
typedef escaped_list_separator<char> CharTokens; typedef tokenizer<CharTokens> EscapedTokenizer; typedef tokenizer<CharTokens>::iterator EscapedIterator;
CharTokens cs(",",",",boost::keep_empty_tokens); std::string str; // This has CSV input line EscapedIterator eti;
EscapedTokenizer et(str,cs);
for (eti = et.begin(); eti != et,end(); eti++) { if (*eti == ",") { // See if this is a separator field_number++; } else { // *eti points to a value which could be an empty field // field_number is the field in the list } }
//----------------------------------------------------------------- // Using Spirit // // Result is a vector of items much list split() - including empty strings in the // vector for empty fields // // Probably could be used with any<>
using namespace boost::spirit;
char *plist_csv = new char[4096];
rule<> list_csv, list_csv_item; std::vector<std::string> vec_item, vec_list; parse_info<> result;
list_csv_item = confix_p('\"', *c_escape_cha_p,'\"') | longest_d(real_p | int_p | *(alnum_p | ch_p('_'))) ;
list_csv = list_p( (!list_csv_item)[append(vec_item)], ',') [append(vec_list)] ;
result = parse(plist_csv,list_csv);
if (result.hit) // Got at least part if (result.full) { // All present } }
----- Original Message ----- From: "Christian Henning" <chhenning@gmail.com> Newsgroups: gmane.comp.lib.boost.user To: <boost-users@lists.boost.org>
Sent: Saturday, December 15, 2007 1:38 PM Subject: Re: [boost-users] tokenizer vs string algorithm split.
Hi Larry, can you share the code which can handle empty fields?
Thanks, Christian
On Dec 15, 2007 1:32 PM, Larry <lknain@nc.rr.com> wrote:
If your CSV has empty fields (e.g., data,data,,data.....) the only way I found to handle the empty field was to handle the separators yourself with the tokenizer otherwise the tokenizer would skip the field (a la strtok()).
For CSVs I tried Spirit and came up with a scheme (with lots of help I would add) that seemed to work. Not many lines of code. It takes more time than I was interested in spending to figure it out.
Larry
----- Original Message ----- From: "Edward Diener" <eldiener@tropicsoft.com> Newsgroups: gmane.comp.lib.boost.user To: <boost-users@lists.boost.org> Sent: Saturday, December 15, 2007 9:44 AM Subject: Re: [boost-users] tokenizer vs string algorithm split.
Bill Buklis wrote:
This may not matter for the CSV file you're parsing, but at least for a more general solution for CSV processing, you'd also have to handle fields that are surrounded by quotes and may even contain embedded commas. I don't know if split or tokenizer can handle that.
Tokenizer's escaped_list_separator handles quotes and embedded commas properly.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

chun ping wang wrote:
Hi I was wondering which one is better and faster to split a file of csv value of number and put it into container of double. 1.) Which option is better. // method 1. std::vector<std::string> split_string; boost::algorithm::trim(flist); boost::algorithm::split(split_string, flist, boost::algorithm::is_any_of(",")); std::vector<double> elements; BOOST_FOREACH(std::string s, split_string) { elements += boost::lexical_cast<double>(s); }
// method 2. boost::char_separator<char> sep(","); boost::tokenizer<boost::char_separator<char> > tokens(flist, sep); std::vector<double> elements; BOOST_FOREACH(std::string token, tokens) { elements += boost::lexical_cast<double>(token); }
2.) When is it better to use string algorithm split instead of tokenizer and vice versa.
Hi, I didn't make any speed comparison between split and tokenizer, but there are ways for significant speed improvements when using split algorithm. Most speed problems results from unvanted copying of strings. This is quite costly operation and it should be avoided at all cost it the speed is important. First, there is an obvious problem in your code. In BOOST_FOREACH, you are missing a reference in the string parameter. This means, that every string will be copied in the loop. You can improve the actual usage of split algorithm as well. Quite significant speedup can be achieved if you use std::vector<boost::iterator_range<std::string::iterator> > to hold results instead of vector-of-strings. This way split algorthm will only store references to tokens in the original string, avoiding any copying until it is realy needes. Going one step futher, you can avoid using intermediate vector at all. You can use split_iterator directly. split_iterator<string::iterator> siter=make_split_iterator( flist, token_finder(is_any_of(","), token_compress_off)); BOOST_FOREACH( iterator_range<string::iterator> rngToken, make_range(siter, split_iterator<string::iterator>()) { // Do whatever you want with token here. // It is represented by an iterator_range so no copying // has been done yet. // You can make a copy if necessary string strToken = copy_range<string>(rngToken) } Best Regards, Pavol.

sorry kind of confuse on your last example on it helps me store the value in stl container of double. thanks. On Dec 13, 2007 12:14 PM, Pavol Droba <droba@topmail.sk> wrote:
chun ping wang wrote:
Hi I was wondering which one is better and faster to split a file of csv value of number and put it into container of double. 1.) Which option is better. // method 1. std::vector<std::string> split_string; boost::algorithm::trim(flist); boost::algorithm::split(split_string, flist, boost::algorithm::is_any_of(",")); std::vector<double> elements; BOOST_FOREACH(std::string s, split_string) { elements += boost::lexical_cast<double>(s); }
// method 2. boost::char_separator<char> sep(","); boost::tokenizer<boost::char_separator<char> > tokens(flist, sep); std::vector<double> elements; BOOST_FOREACH(std::string token, tokens) { elements += boost::lexical_cast<double>(token); }
2.) When is it better to use string algorithm split instead of tokenizer and vice versa.
Hi,
I didn't make any speed comparison between split and tokenizer, but there are ways for significant speed improvements when using split algorithm.
Most speed problems results from unvanted copying of strings. This is quite costly operation and it should be avoided at all cost it the speed is important.
First, there is an obvious problem in your code. In BOOST_FOREACH, you are missing a reference in the string parameter. This means, that every string will be copied in the loop.
You can improve the actual usage of split algorithm as well. Quite significant speedup can be achieved if you use std::vector<boost::iterator_range<std::string::iterator> > to hold results instead of vector-of-strings. This way split algorthm will only store references to tokens in the original string, avoiding any copying until it is realy needes.
Going one step futher, you can avoid using intermediate vector at all. You can use split_iterator directly.
split_iterator<string::iterator> siter=make_split_iterator( flist, token_finder(is_any_of(","), token_compress_off)); BOOST_FOREACH( iterator_range<string::iterator> rngToken, make_range(siter, split_iterator<string::iterator>()) { // Do whatever you want with token here. // It is represented by an iterator_range so no copying // has been done yet.
// You can make a copy if necessary string strToken = copy_range<string>(rngToken) }
Best Regards, Pavol. _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Hi, The actual storing of values in the stl container is your specific implementation detail. I was not writing about that. Simple, implementation can look like this elements.push_back( lexical_cast<double>(rngToken) ); Regards, Pavol. chun ping wang wrote:
sorry kind of confuse on your last example on it helps me store the value in stl container of double.
thanks.
On Dec 13, 2007 12:14 PM, Pavol Droba < droba@topmail.sk <mailto:droba@topmail.sk>> wrote:
chun ping wang wrote: > Hi I was wondering which one is better and faster to split a file of csv > value of number and put it into container of double. > 1.) Which option is better. > // method 1. > std::vector<std::string> split_string; > boost::algorithm::trim(flist); > boost::algorithm::split(split_string, flist, > boost::algorithm::is_any_of(",")); > std::vector<double> elements; > BOOST_FOREACH(std::string s, split_string) > { > elements += boost::lexical_cast<double>(s); > } > > // method 2. > boost::char_separator<char> sep(","); > boost::tokenizer<boost::char_separator<char> > > tokens(flist, sep); > std::vector<double> elements; > BOOST_FOREACH(std::string token, tokens) > { > elements += boost::lexical_cast<double>(token); > } > > 2.) When is it better to use string algorithm split instead of tokenizer > and vice versa. >
Hi,
I didn't make any speed comparison between split and tokenizer, but there are ways for significant speed improvements when using split algorithm.
Most speed problems results from unvanted copying of strings. This is quite costly operation and it should be avoided at all cost it the speed is important.
First, there is an obvious problem in your code. In BOOST_FOREACH, you are missing a reference in the string parameter. This means, that every string will be copied in the loop.
You can improve the actual usage of split algorithm as well. Quite significant speedup can be achieved if you use std::vector<boost::iterator_range<std::string::iterator> > to hold results instead of vector-of-strings. This way split algorthm will only store references to tokens in the original string, avoiding any copying until it is realy needes.
Going one step futher, you can avoid using intermediate vector at all. You can use split_iterator directly.
split_iterator<string::iterator> siter=make_split_iterator( flist, token_finder(is_any_of(","), token_compress_off)); BOOST_FOREACH( iterator_range<string::iterator> rngToken, make_range(siter, split_iterator<string::iterator>()) { // Do whatever you want with token here. // It is represented by an iterator_range so no copying // has been done yet.
// You can make a copy if necessary string strToken = copy_range<string>(rngToken) }
Best Regards, Pavol. _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org <mailto:Boost-users@lists.boost.org> http://lists.boost.org/mailman/listinfo.cgi/boost-users
------------------------------------------------------------------------
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
participants (6)
-
Bill Buklis
-
Christian Henning
-
chun ping wang
-
Edward Diener
-
Larry
-
Pavol Droba