Hi Two questions related to string tokenization. 1. Which is preferred? using "split" or the "tokenizer class"? 2. Both of these methods seem geared towards splitting on characters rather than splitting on substrings. Is there yet another method that is preferred for splitting a string on an exact substring? If I want to split "I<mark>Am<mark>A<mark>Test" into I, Am, A, Test, what is the best way? It seems that for split I'll have to write my own predicate, and for tokenizer, I'll have to write my own tokenizerFunction. -- Seth N.
Two questions related to string tokenization.
1. Which is preferred? using "split" or the "tokenizer class"? 2. Both of these methods seem geared towards splitting on characters rather than splitting on substrings. Is there yet another method that is preferred for splitting a string on an exact substring? If I want to split "I<mark>Am<mark>A<mark>Test" into I, Am, A, Test, what is the best way? It seems that for split I'll have to write my own predicate, and for tokenizer, I'll have to write my own tokenizerFunction.
Have a look at the Boost Spirit parser framework. You can define arbitrarily complex grammar that can be decorated with actions to suit your needs. The only thing that it doesn't have "out of the box" that I need is short-circuit evaluation (such that "show", "sho", and "sh" automatically map to the same action), but that's a very small thing compared to the wonderful flexibility of a complete recursive descent parser in C++. Eric
Seth Nielson wrote:
Hi
Two questions related to string tokenization.
1. Which is preferred? using "split" or the "tokenizer class"? 2. Both of these methods seem geared towards splitting on characters rather than splitting on substrings. Is there yet another method that is preferred for splitting a string on an exact substring? If I want to split "I<mark>Am<mark>A<mark>Test" into I, Am, A, Test, what is the best way? It seems that for split I'll have to write my own predicate, and for tokenizer, I'll have to write my own tokenizerFunction.
For split(), and simple <mark> cases, you can use existing predicates. For instance: boost::split (splitVec, submitData, boost::algorithm::is_any_of (",")); which makes it very simple to tokenize a string. I have used this approach in a multi-level parsing algorithm. I don't know how the performance stacks up against other approaches, but it serves my purpose and I can still understand it when I go back 6 months later and look at it again. :-) - Rush
No, that won't suit my needs. Did you read my message? I said I need MARK to be a substring, not a single character. -- Seth N. Rush Manbert wrote:
Seth Nielson wrote:
Hi
Two questions related to string tokenization.
1. Which is preferred? using "split" or the "tokenizer class"? 2. Both of these methods seem geared towards splitting on characters rather than splitting on substrings. Is there yet another method that is preferred for splitting a string on an exact substring? If I want to split "I<mark>Am<mark>A<mark>Test" into I, Am, A, Test, what is the best way? It seems that for split I'll have to write my own predicate, and for tokenizer, I'll have to write my own tokenizerFunction.
For split(), and simple <mark> cases, you can use existing predicates. For instance: boost::split (splitVec, submitData, boost::algorithm::is_any_of (","));
which makes it very simple to tokenize a string. I have used this approach in a multi-level parsing algorithm. I don't know how the performance stacks up against other approaches, but it serves my purpose and I can still understand it when I go back 6 months later and look at it again. :-)
- Rush _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
On Mon, 24 Jul 2006 14:53:51 -0500, Seth Nielson
No, that won't suit my needs. Did you read my message? I said I need MARK to be a substring, not a single character.
Use a split iterator: string str = "1ab2ab3"; vector<string> results; typedef boost::split_iteratorstd::string::iterator string_split_iterator; for (string_split_iterator it = make_split_iterator(str, first_finder("ab")); it != string_split_iterator(); ++it) { results.push_back(*it); } As for which one is preferable, I like the string algo library much better than I like the tokenizer library. It's very, very good. -- Be seeing you.
Thanks. For the record, I had to make one small correction. { results.push_back(copy_rangestd::string(*it)); } because *it was just a range. -- Seth N. Thore B.Karlsen wrote:
On Mon, 24 Jul 2006 14:53:51 -0500, Seth Nielson
wrote: No, that won't suit my needs. Did you read my message? I said I need MARK to be a substring, not a single character.
Use a split iterator:
string str = "1ab2ab3"; vector<string> results;
typedef boost::split_iteratorstd::string::iterator string_split_iterator;
for (string_split_iterator it = make_split_iterator(str, first_finder("ab")); it != string_split_iterator(); ++it) { results.push_back(*it); }
As for which one is preferable, I like the string algo library much better than I like the tokenizer library. It's very, very good.
Seth Nielson wrote:
No, that won't suit my needs. Did you read my message? I said I need MARK to be a substring, not a single character.
This is implemented in my proposed super_string class. Docs at: http://www.crystalclearsoftware.com/libraries/super_string/index.html code for download in the boost vault at: http://tinyurl.com/dbcye Using super_string it looks like: super_string s("<mark>Am<mark>A<mark>Test"); super_string::string_vector out_vec; s.split("<mark>", out_vec); //iterate on out_vec and process. super_string is just a wrapper around std::string and boost libraries that do the heavy lifting. If you want to do it yourself, you can use string_algo directly. The relevent code that makes this work is: namespace alg = boost::algorithm; string predicate("<mark>"); string input("<mark>Am<mark>A<mark>Test"); vector<string> result; alg::iter_split(result, input, alg::first_finder(predicate, alg::is_equal())); //iter_split will return entire string if no matches found You can look at the implementation of super_string for hints... http://www.crystalclearsoftware.com/libraries/super_string/super__string_8hp... HTH, Jeff
On Mon, 24 Jul 2006 14:32:11 -0700, Jeff Garland
alg::iter_split(result, input, alg::first_finder(predicate, alg::is_equal())); //iter_split will return entire string if no matches found
iter_split() doesn't seem to be documented in the string_algo documentation. Am I just not seeing it, or is there a reason for this? -- Be seeing you.
Thore B.Karlsen wrote:
On Mon, 24 Jul 2006 14:32:11 -0700, Jeff Garland
wrote: [...]
alg::iter_split(result, input, alg::first_finder(predicate, alg::is_equal())); //iter_split will return entire string if no matches found
iter_split() doesn't seem to be documented in the string_algo documentation. Am I just not seeing it, or is there a reason for this?
I'd guess it's a bug in the docs... Jeff
Hi, Jeff Garland wrote:
Thore B.Karlsen wrote:
On Mon, 24 Jul 2006 14:32:11 -0700, Jeff Garland
wrote: [...]
alg::iter_split(result, input, alg::first_finder(predicate, alg::is_equal())); //iter_split will return entire string if no matches found iter_split() doesn't seem to be documented in the string_algo documentation. Am I just not seeing it, or is there a reason for this?
I'd guess it's a bug in the docs...
Your guess is not right. iter_split is intentionaly not documented. It is only used as an implementation helper for split. We want to encourage to use find/split_iterator. This was one of the results of the library review. The same thing is happening in regex. Regards, Pavol
Seth Nielson wrote:
Hi
Two questions related to string tokenization.
1. Which is preferred? using "split" or the "tokenizer class"? 2. Both of these methods seem geared towards splitting on characters rather than splitting on substrings. Is there yet another method that is preferred for splitting a string on an exact substring? If I want to split "I<mark>Am<mark>A<mark>Test" into I, Am, A, Test, what is the best way? It seems that for split I'll have to write my own predicate, and for tokenizer, I'll have to write my own tokenizerFunction.
Or a third way: you could use a regex_token_iteror and split on regexes. John.
participants (7)
-
Eric Hill
-
Jeff Garland
-
John Maddock
-
Pavol Droba
-
Rush Manbert
-
Seth Nielson
-
Thore B.Karlsen