
Hi Two questions related to string tokenization. 1. Which is preferred? using "split" or the "tokenizer class"? 2. Both of these methods seem geared towards splitting on characters rather than splitting on substrings. Is there yet another method that is preferred for splitting a string on an exact substring? If I want to split "I<mark>Am<mark>A<mark>Test" into I, Am, A, Test, what is the best way? It seems that for split I'll have to write my own predicate, and for tokenizer, I'll have to write my own tokenizerFunction. -- Seth N.

Have a look at the Boost Spirit parser framework. You can define arbitrarily complex grammar that can be decorated with actions to suit your needs. The only thing that it doesn't have "out of the box" that I need is short-circuit evaluation (such that "show", "sho", and "sh" automatically map to the same action), but that's a very small thing compared to the wonderful flexibility of a complete recursive descent parser in C++. Eric

Seth Nielson wrote:
For split(), and simple <mark> cases, you can use existing predicates. For instance: boost::split (splitVec, submitData, boost::algorithm::is_any_of (",")); which makes it very simple to tokenize a string. I have used this approach in a multi-level parsing algorithm. I don't know how the performance stacks up against other approaches, but it serves my purpose and I can still understand it when I go back 6 months later and look at it again. :-) - Rush

On Mon, 24 Jul 2006 14:53:51 -0500, Seth Nielson <sethjn@gmail.com> wrote:
No, that won't suit my needs. Did you read my message? I said I need MARK to be a substring, not a single character.
Use a split iterator: string str = "1ab2ab3"; vector<string> results; typedef boost::split_iterator<std::string::iterator> string_split_iterator; for (string_split_iterator it = make_split_iterator(str, first_finder("ab")); it != string_split_iterator(); ++it) { results.push_back(*it); } As for which one is preferable, I like the string algo library much better than I like the tokenizer library. It's very, very good. -- Be seeing you.

Seth Nielson wrote:
No, that won't suit my needs. Did you read my message? I said I need MARK to be a substring, not a single character.
This is implemented in my proposed super_string class. Docs at: http://www.crystalclearsoftware.com/libraries/super_string/index.html code for download in the boost vault at: http://tinyurl.com/dbcye Using super_string it looks like: super_string s("<mark>Am<mark>A<mark>Test"); super_string::string_vector out_vec; s.split("<mark>", out_vec); //iterate on out_vec and process. super_string is just a wrapper around std::string and boost libraries that do the heavy lifting. If you want to do it yourself, you can use string_algo directly. The relevent code that makes this work is: namespace alg = boost::algorithm; string predicate("<mark>"); string input("<mark>Am<mark>A<mark>Test"); vector<string> result; alg::iter_split(result, input, alg::first_finder(predicate, alg::is_equal())); //iter_split will return entire string if no matches found You can look at the implementation of super_string for hints... http://www.crystalclearsoftware.com/libraries/super_string/super__string_8hp... HTH, Jeff

Hi, Jeff Garland wrote:
Your guess is not right. iter_split is intentionaly not documented. It is only used as an implementation helper for split. We want to encourage to use find/split_iterator. This was one of the results of the library review. The same thing is happening in regex. Regards, Pavol
participants (7)
-
Eric Hill
-
Jeff Garland
-
John Maddock
-
Pavol Droba
-
Rush Manbert
-
Seth Nielson
-
Thore B.Karlsen