[boost] token_iterator comments, suggestions

5 Jun 2004

      Hi,

For long time in my daytime projects I was using my class token_iterator
designed based on old iterator_adaptor design (adopted for old compilers).
Now when need arose for such functionality in Boost.Test. I looked in
direction of boost::token_iterator. After some struggle I end up adopting my
version to new design. Here some of comments and issues  that made me opt
so.

1. token_iterator follows in a sense a Policy-based design (PBD). Where
TokenizerFunc is a Policy that implements most of the functionality. The
problem here IMO (unfortunately you see it now quite frequently), that I do
not see a reason to use such a design. PBD makes sense it your generic
component do a lot of work, provide common interface, combine several
independent policies. In this case most what token_iterator does is just a
forwarding (it become especially clear if you switch to use
input_iterator_adaptor).
  It wouldn't be that bad, but PBD in this case causes significant usability
inconvenience. Instead of plain and simple:
    token_iterator it( "a:b", ":" );

  I have to write something like:

  token_iterator<char_delimiters_separator<char> > it(
char_delimiters_separator<char>( ":" ), "a:b" ).

All of that just because we have separate notion of Tokenizer function. IMO
it too big price to pay.

My suggestion is to get rid of the TokenizerFunction notion. Just implement
many different iterators with the same scope of interest: tokenization. We
will need to find proper names for iterators and instead of TokenizerFunc
model description we may have page explaining how to write custom token
iterators (if necessary, which I doubt).

2. token_iterator deals with begin/end iterators.
I understand that in some rare cases there is need to iterate over range of
iterators (BTW, std::input_iterator gonna be bad example - did you ever
tried to use token_iterator over it?). But in most case we want to tokenize
a string. And in most cases it either std::string or [char|wchar_t] const*.
The fact that Iterator is a template parameter makes token_iterator usage
quite inconvenient. In most cases the only template parameter that is
required is IMO character type.

My suggestion: lets have 2 different types if token iterators: one to
iterate over string and separate to iterate over range of iterators.

3. I found it confusing in many cases how delimiters gets specified in
token_iterator:
token_iterator( "...", ":", " " )

Which  string represent dropped delimiters, which kept?
And what if I just want empty tokens? I need to write
token_iterator( "...", "", "", keep_empty_tokens )
What are these two empty strings doing here?My suggestion is to use named
parameters scheme. Like this:
token_iterator( "...", dropped_delimiters = ":", kept_delimiters = " " );

4. 2 Tokenizer functions use Traits as *policy* parameter.
What actually is required in Compare predicate, that may or may not be
implemented using char_traits. While std::basic_string references inside
tokenizer function implementation should refer to traits by name
std::char_traits

5. What if I want to use isspace as kept delimiters policy? There is no way
to specify that.
IMO delimiter policy should be separated (as runtime policy).

6. You use iterator_category as a selector of assigner. It's better to use
traversal instead

7. In many places you use
#if     !defined(BOOST_MSVC) || BOOST_MSVC > 1300
        typename
#endif

use BOOST_DEDUCED_TYPENAME instead

8. What is the purpose of token_iterator_generator. I do not see a need for
one with new iterator facade design.

9. Why would we need all the access methods?
      Iterator base()const{return begin_;}
      Iterator end()const{return end_;};
      TokenizerFunc tokenizer_function()const{return f_;}
      Type current_token()const{return tok_;}
      bool at_end()const{return valid_;}

I couldn't imagine why do we need them.

10 Token iterator asserts in attempts to dereference or iterate end
iterator.
I don't believe it's correct. In many cases (mostly in design that assumes
some optional tokens in the end) it's much more convenient this does not
happened.

My suggestion is to return empty token on dereference and stay quite on
iteration.

11. In simpler design there won't be need in template copy constructor for
token_iterator

12. Documentation need to be updated IMO, cause even as things stands now it
dies not represent the best practice moved from type generators usage.

13. Files under libs/tokenizer lacks any structure required by boost
standards.

14. There is no real test suite only some small examples supplied.

15. Why escaped_list_separator was named this way. It seems strange to
emphasize in a name the fact that it use escape symbols, while major
separations is that it is CSV list iterator - and this is how I would name
it

I implemented string_token_iterator and range_token_iterator for use in
Boost.Test. See here (in a couple hours) http://tinyurl.com/28db9 for
implementation and here http://tinyurl.com/287uk for test suite.

Regards,

Gennadiy.

[boost] token_iterator comments, suggestions

Gennadiy Rozental