Re: [Boost-users] Tokenizer design question

14 Jul 2009

      On Tue, Jul 14, 2009 at 11:48 AM, Polder, Matthew
J<matthew.j.polder@lmco.com> wrote:
...
The Tokenizer library has a char_separator with the option to keep
delimiters, drop delimiters, and keep or drop empty tokens. However, with
escaped_list_separator, the only behavior is to keep empty tokens. While
this is the obvious behavior for parsing csv and similar files, it would be
nice to have the ability to also drop empty tokens when constructing an
escaped_list_separator.
I have a command line parser that either reads its arguments from the
command line itself or a text file supplied on the command line. In the file
I’m passing in formats for the Date Time library I/O routines, and the
formats have spaces that I’m escaping so the format will be a single token,
which Tokenizer does find. But I sometimes use multiple tabs to separate my
fields so it will look pretty in a text editor, and escaped_list_separator
is keeping these. The solution for now is to have a switch in my command
line parser for which separator I want to use.
In the loop that you process tokens, you should be able to deal with
this by simply doing:

mytok::iterator begin = toker.begin();
mytok::iterator end = toker.end();
while (begin != end)
{
   if (begin->empty()) continue;

   //do normal token processing

  ++begin;
}

I guess it's doing this because it was originally designed to support
CSV files, which can contain empty fields.   so ,, in a CSV represents
an empty field, so in your case <space><space> would represent an
empty field too.  But since it's an empty field, the value of *iter is
the empty string, and there should be no other time where it will ever
evaluate to an empty string.  If nothing else it's a not-too-hackish
workaround, but maybe a constructor argument bool ignore_empty_fields
with default value of false would be niec too.

Re: [Boost-users] Tokenizer design question

Zachary Turner