token_iterator comments, suggestions

Hi, For long time in my daytime projects I was using my class token_iterator designed based on old iterator_adaptor design (adopted for old compilers). Now when need arose for such functionality in Boost.Test. I looked in direction of boost::token_iterator. After some struggle I end up adopting my version to new design. Here some of comments and issues that made me opt so. 1. token_iterator follows in a sense a Policy-based design (PBD). Where TokenizerFunc is a Policy that implements most of the functionality. The problem here IMO (unfortunately you see it now quite frequently), that I do not see a reason to use such a design. PBD makes sense it your generic component do a lot of work, provide common interface, combine several independent policies. In this case most what token_iterator does is just a forwarding (it become especially clear if you switch to use input_iterator_adaptor). It wouldn't be that bad, but PBD in this case causes significant usability inconvenience. Instead of plain and simple: token_iterator it( "a:b", ":" ); I have to write something like: token_iterator<char_delimiters_separator<char> > it( char_delimiters_separator<char>( ":" ), "a:b" ). All of that just because we have separate notion of Tokenizer function. IMO it too big price to pay. My suggestion is to get rid of the TokenizerFunction notion. Just implement many different iterators with the same scope of interest: tokenization. We will need to find proper names for iterators and instead of TokenizerFunc model description we may have page explaining how to write custom token iterators (if necessary, which I doubt). 2. token_iterator deals with begin/end iterators. I understand that in some rare cases there is need to iterate over range of iterators (BTW, std::input_iterator gonna be bad example - did you ever tried to use token_iterator over it?). But in most case we want to tokenize a string. And in most cases it either std::string or [char|wchar_t] const*. The fact that Iterator is a template parameter makes token_iterator usage quite inconvenient. In most cases the only template parameter that is required is IMO character type. My suggestion: lets have 2 different types if token iterators: one to iterate over string and separate to iterate over range of iterators. 3. I found it confusing in many cases how delimiters gets specified in token_iterator: token_iterator( "...", ":", " " ) Which string represent dropped delimiters, which kept? And what if I just want empty tokens? I need to write token_iterator( "...", "", "", keep_empty_tokens ) What are these two empty strings doing here?My suggestion is to use named parameters scheme. Like this: token_iterator( "...", dropped_delimiters = ":", kept_delimiters = " " ); 4. 2 Tokenizer functions use Traits as *policy* parameter. What actually is required in Compare predicate, that may or may not be implemented using char_traits. While std::basic_string references inside tokenizer function implementation should refer to traits by name std::char_traits 5. What if I want to use isspace as kept delimiters policy? There is no way to specify that. IMO delimiter policy should be separated (as runtime policy). 6. You use iterator_category as a selector of assigner. It's better to use traversal instead 7. In many places you use #if !defined(BOOST_MSVC) || BOOST_MSVC > 1300 typename #endif use BOOST_DEDUCED_TYPENAME instead 8. What is the purpose of token_iterator_generator. I do not see a need for one with new iterator facade design. 9. Why would we need all the access methods? Iterator base()const{return begin_;} Iterator end()const{return end_;}; TokenizerFunc tokenizer_function()const{return f_;} Type current_token()const{return tok_;} bool at_end()const{return valid_;} I couldn't imagine why do we need them. 10 Token iterator asserts in attempts to dereference or iterate end iterator. I don't believe it's correct. In many cases (mostly in design that assumes some optional tokens in the end) it's much more convenient this does not happened. My suggestion is to return empty token on dereference and stay quite on iteration. 11. In simpler design there won't be need in template copy constructor for token_iterator 12. Documentation need to be updated IMO, cause even as things stands now it dies not represent the best practice moved from type generators usage. 13. Files under libs/tokenizer lacks any structure required by boost standards. 14. There is no real test suite only some small examples supplied. 15. Why escaped_list_separator was named this way. It seems strange to emphasize in a name the fact that it use escape symbols, while major separations is that it is CSV list iterator - and this is how I would name it I implemented string_token_iterator and range_token_iterator for use in Boost.Test. See here (in a couple hours) http://tinyurl.com/28db9 for implementation and here http://tinyurl.com/287uk for test suite. Regards, Gennadiy.

Hi, On Sat, Jun 05, 2004 at 08:16:17AM -0400, Gennadiy Rozental wrote:
Hi,
For long time in my daytime projects I was using my class token_iterator designed based on old iterator_adaptor design (adopted for old compilers). Now when need arose for such functionality in Boost.Test. I looked in direction of boost::token_iterator. After some struggle I end up adopting my version to new design. Here some of comments and issues that made me opt so.
[snip] This might seem a little bit out of topic, but some of the points you have addressed are already solved by find/split_iterator provided in the string algo library. Maybe you can have a look there. Unfortunately, the documentation is not 100% ready yet. find_iterator is also based on generic iterator concept, however, the iterator is the only template parameter, so you can easily specialize for string. Simple usage looks like this: #include <boost/algorithm/string.hpp> using namespace boost; using namespace std; .... typedef split_iterator<string::iterator> splitter_t; { string str("hello;how are you,bye"); for(splitter_t it=splitter_t(str, token_finder(is_any_of(";,"))); it!=splitter_t(); ++it) { cout << copy_iterator_range<string>(*it) << endl; } } I hope, that there are not too many syntax errors in the example. What can be seen there: - find/split_iterator is templated only by an iterator. - Finding algorithm is provided at runtime in the form of a "finder". There is a coupe of those defined in the string_algo lib. Specification can be seen here: http://tinyurl.com/29dg3 - An example here uses token_finder. It takes a predicate that recognizes valid tokens (separators). Thare is also a bunch of classification predicates there to help here (boost/algorithm/string/classification.hpp) - Default initialized find_iterator is an end mark for all iterations. Once the iterator reaches the end position, it will not assert on later increments, rather it will not change. - A value of the iterator is an iterator_range. It a a pair of iterators delimiting begin and end of match. Just have a look if you are interested. Regards, Pavol.

Hi,
On Sat, Jun 05, 2004 at 08:16:17AM -0400, Gennadiy Rozental wrote:
Hi,
For long time in my daytime projects I was using my class token_iterator designed based on old iterator_adaptor design (adopted for old compilers). Now when need arose for such functionality in Boost.Test. I looked in direction of boost::token_iterator. After some struggle I end up adopting my version to new design. Here some of comments and issues that made me opt so.
[snip]
This might seem a little bit out of topic, but some of the points you have addressed are already solved by find/split_iterator provided in the string algo
"Pavol Droba" <droba@topmail.sk> wrote in message news:20040605214012.GH19968@lenin.felcer.sk... library.
Maybe you can have a look there. Unfortunately, the documentation is not 100% ready yet.
Looks like I was not able to express myself clearly in original post. But actually what you suggesting is especially what I was arguing *against*. 1. Your library is dedicated to string algorithms. Why is it use range if iterators so frequently then? As an input and an output. I did found that having range of it iterators as a model of substring is very convenient. That what my basic_cstring is used for (actually, I may say that in my development this is single most widely used class). But it is full-fledged string as good as std::basic_string. And it does not based in iterators, but on character type. Iterator range does may be usable, BUT outside of string algo library. 2. Finder concept maybe useful in implementing different string search algorithm. BUT as implementation detail or generic case solution. I do not see any reason for interface that assumes explicit specification of finders provided by the library. There should be a generic algorithm/iterator that expect User defined finder, but there should be explicit algorithms/iterators dedicated for each type of search. I do that for algorithms, why not for iterators?
find_iterator is also based on generic iterator concept, however, the iterator is the only template parameter, so you can easily specialize for string.
Actually when I work with strings I do not want to know about iterators at all.
Simple usage looks like this:
In fact from your sources it seems that I have to write like this boost::split_iterator<std::string::iterator> it( str.begin(), str.end(), boost::token_finder(boost::is_any_of(";,")); IMO it's way to verbose, while still missing some of the functionality provided by the tokenizer library. My choice: boost::token_iterator it( str, boost::dropped_delimiters = ";," ); This looks shorter and clearer IMO. Second concern about your solution, that makes it in a sense even worse than the one provided by boost::tokenizer, is that you actually does not specify type of the Finder in the iterator specification. The price is that eventually you have to, this way or another, pay with runtime overhead. This would be unacceptable to me. There are couple of things that your solution provide in addition to what my and boost::tokenizer solution provide. Particularly it's an ability to specify several addition delimitation policies. I thought about this. But it isn't on top of my priorities (mostly because it is comparatively rarely needed). In any case this generic solution should not interfere with most convenient one used in majority of the cases. Regards, Gennadiy.

"Gennadiy Rozental" <gennadiy.rozental@thomson.com> wrote in message news:c9tphv$aib$1@sea.gmane.org... | "Pavol Droba" <droba@topmail.sk> wrote in message [snip] | 1. Your library is dedicated to string algorithms. [snip] | Iterator range does may be usable, BUT outside of string algo library. (We have agreed to move iterator_range to Boost.Range.) This is a discussion we need to have at some point. Either now or as part of the post-review of the string algorithms. Consider for example one of Pavols algorithms in the string library: (see http://lenin.felcer.sk/~droba/boost-book/string_algo.reference.html#header.b... ) template<typename InputContainerT, typename SearchContainerT> inline iterator_range< typename result_iterator_of< InputContainerT >::type > find_first(InputContainerT &, const SearchContainerT &); compared to what I tend to like better template<typename InputContainerT, typename SearchContainerT> inline range< InputContainerT > find_first(InputContainerT &, const SearchContainerT &); I think that is a drastic reduction in interface complexity. In the former, the algorithm is range based yet the return value have been specified with iterators. My current view is that Boost.Range should contain both itarator_range<> and range<>, but that range-based algorithms should use range<> | There are couple of things that your solution provide in addition to what my | and boost::tokenizer solution provide. Particularly it's an ability to | specify several addition delimitation policies. I thought about this. But | it isn't on top of my priorities (mostly because it is comparatively rarely | needed). one token separator policy that could be usefull is the everything_but_one_of( ",; " ); but maybe we're starting to overlap with regex_token_iterator here? br

"Thorsten Ottosen" <nesotto@cs.auc.dk> writes:
My current view is that Boost.Range should contain both itarator_range<> and range<>, but that range-based algorithms should use range<>
If I understand you correctly, it sounds a bit rigid. Would you not provide a range-based lower_bound as a matter of principle, or what? -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

"David Abrahams" <dave@boost-consulting.com> wrote in message news:ur7stqvrd.fsf@boost-consulting.com... | "Thorsten Ottosen" <nesotto@cs.auc.dk> writes: | | > My current view is that Boost.Range should contain both itarator_range<> and range<>, | > but that range-based algorithms should use range<> | | If I understand you correctly, it sounds a bit rigid. Would you not | provide a range-based lower_bound as a matter of principle, or what? I'm not sure I understand. Could you give an example? (remark: range<> would be less flexible than iterator_range<> since its template parameter is an ExternalRange, but it would provide a higher abstraction level. ) -Thorsten

"Thorsten Ottosen" <nesotto@cs.auc.dk> writes:
"David Abrahams" <dave@boost-consulting.com> wrote in message news:ur7stqvrd.fsf@boost-consulting.com... | "Thorsten Ottosen" <nesotto@cs.auc.dk> writes: | | > My current view is that Boost.Range should contain both itarator_range<> and range<>, | > but that range-based algorithms should use range<> | | If I understand you correctly, it sounds a bit rigid. Would you not | provide a range-based lower_bound as a matter of principle, or what?
I'm not sure I understand. Could you give an example?
It sounds like you're saying that algorithms that traffic in ranges shouldn't also deal in iterators (I could be mistaken), but it seems to me that for some algorithms (e.g. lower_bound), operating on a range and returning an iterator is just right.
(remark: range<> would be less flexible than iterator_range<> since its template parameter is an ExternalRange, but it would provide a higher abstraction level. )
This part's beyond me. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

"David Abrahams" <dave@boost-consulting.com> wrote in message news:uhdtpqogn.fsf@boost-consulting.com... | > I'm not sure I understand. Could you give an example? | | It sounds like you're saying that algorithms that traffic in ranges | shouldn't also deal in iterators (I could be mistaken), but it seems | to me that for some algorithms (e.g. lower_bound), operating on a | range and returning an iterator is just right. yeah, I'm saying that algorithms that traffic in ranges (eg vector<T>) and return iterator pairs should return range< vector<T> > and not iterator_range< typename iterator_of< vector<T> >::type > br Thorsten

I disagree. specifically I belive we should take the time to write the syntactic sugar which allows ALL of the current algorithms to accept range<>s (that's the plural of range<>, not some variable "s"). At Saturday 2004-06-05 22:12, you wrote:
"David Abrahams" <dave@boost-consulting.com> wrote in message news:uhdtpqogn.fsf@boost-consulting.com...
| > I'm not sure I understand. Could you give an example? | | It sounds like you're saying that algorithms that traffic in ranges | shouldn't also deal in iterators (I could be mistaken), but it seems | to me that for some algorithms (e.g. lower_bound), operating on a | range and returning an iterator is just right.
yeah, I'm saying that algorithms that traffic in ranges (eg vector<T>) and return iterator pairs should return
range< vector<T> >
and not
iterator_range< typename iterator_of< vector<T> >::type >
br
Thorsten
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Victor A. Wagner Jr. http://rudbek.com The five most dangerous words in the English language: "There oughta be a law"

"Victor A. Wagner Jr." <vawjr@rudbek.com> wrote in message news:6.1.0.6.2.20040607070211.0486b980@mail.rudbek.com... | I disagree. with what? It's hard to know when you top-post :-( | specifically I belive we should take the time to write the syntactic sugar | which allows ALL of the current algorithms to accept range<>s (that's the | plural of range<>, not some variable "s"). I can't figure out what that has to do with the discussion :-) I too like range version of std::copy() etc. To recap, I'm saying pair<iterator,iterator> foo( iterator, iterator ); in its range version should be range< Range > foo( Range& ); and not iterator_range< typename iterator_of<Range>::type > foo( Range& ); Pavol disagrees, that makes the situation a 1 against 1 until other people start having an opinion. br Thorsten

At Monday 2004-06-07 07:26, you wrote:
"Victor A. Wagner Jr." <vawjr@rudbek.com> wrote in message news:6.1.0.6.2.20040607070211.0486b980@mail.rudbek.com... | I disagree.
with what? It's hard to know when you top-post :-(
| specifically I belive we should take the time to write the syntactic sugar | which allows ALL of the current algorithms to accept range<>s (that's the | plural of range<>, not some variable "s").
I can't figure out what that has to do with the discussion :-) I too like range version of std::copy() etc. To recap, I'm saying
pair<iterator,iterator> foo( iterator, iterator );
in its range version should be
range< Range > foo( Range& );
Range foo(Range const&); // is perhaps this what you meant? or perhaps: Range foo(Range); I'm somewhat confused as I'd expected typedef pair<iterator, iterator> Range; to be the definition preceding your declaration of foo(...); I'm not sure what a range<Range> would be
and not
iterator_range< typename iterator_of<Range>::type > foo( Range& );
Pavol disagrees, that makes the situation a 1 against 1 until other people start having an opinion.
br
Thorsten
_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Victor A. Wagner Jr. http://rudbek.com The five most dangerous words in the English language: "There oughta be a law"

On Mon, Jun 07, 2004 at 08:37:48AM -0700, Victor A. Wagner Jr. wrote:
At Monday 2004-06-07 07:26, you wrote:
"Victor A. Wagner Jr." <vawjr@rudbek.com> wrote in message news:6.1.0.6.2.20040607070211.0486b980@mail.rudbek.com... | I disagree.
with what? It's hard to know when you top-post :-(
| specifically I belive we should take the time to write the syntactic sugar | which allows ALL of the current algorithms to accept range<>s (that's the | plural of range<>, not some variable "s").
I can't figure out what that has to do with the discussion :-) I too like range version of std::copy() etc. To recap, I'm saying
pair<iterator,iterator> foo( iterator, iterator );
in its range version should be
range< Range > foo( Range& );
Range foo(Range const&); // is perhaps this what you meant? or perhaps: Range foo(Range);
I'm somewhat confused as I'd expected typedef pair<iterator, iterator> Range; to be the definition preceding your declaration of foo(...); I'm not sure what a range<Range> would be
and not
iterator_range< typename iterator_of<Range>::type > foo( Range& );
Pavol disagrees, that makes the situation a 1 against 1 until other people start having an opinion.
There is a need to eliminate a confusion. We had a private discussion with Thorsten, and the summary is here: A problem: - Find algorithms have somewhat complicated return value. They return iterator_range< typename result_iterator_of<Range>::type >. There are two issues regarding this: 1) How to efectively store the result. Now the user must type rather complicated variable type. (i.e. iterator_range<string::iterator>) 2) Simplify the algorithm signature. - Suggestions a) Thorsten is proposing to define another iterator_range-like class that will have different signature: range<Range>. It should generaly provide automatic iterator deduction based on Range parameter. Then to use this class as a return value for find algorithms b) I think, that it is important to solve the issue 1) but the issue 2) is only cosmetic and a change would have several implication on the consitency of the overal library desing. To solve the issue 1) I suggest to define the class that Thorsten proposed. I would call it sub_range<T>. Name rationale is that it delimits a sub-part of a range defined by type T. It will be a simple derivate of iterator_range. It will have some convinient constructors. Both classes should be convertible to each other. So it will be possible to write: sub_range<string> r=find_first(str, "hello"); or sub_range<const string> r=find_first(str, "hello"); In addition, it seems reasonable to allow impliciti conversion of iterator_range to tuple<iterator,iterator>, as Thorsten suggested. So you can write: tie(iterator,iterator) = find_first(...); In addition to sub_range, other variants like reverse_sub_range can be defined. Regards, Pavol

Hi, On Sun, Jun 06, 2004 at 11:35:44AM +1000, Thorsten Ottosen wrote:
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> wrote in message news:c9tphv$aib$1@sea.gmane.org... | "Pavol Droba" <droba@topmail.sk> wrote in message
[snip]
| 1. Your library is dedicated to string algorithms.
[snip]
| Iterator range does may be usable, BUT outside of string algo library.
(We have agreed to move iterator_range to Boost.Range.)
This is a discussion we need to have at some point. Either now or as part of the post-review of the string algorithms.
Consider for example one of Pavols algorithms in the string library: (see http://lenin.felcer.sk/~droba/boost-book/string_algo.reference.html#header.b... )
template<typename InputContainerT, typename SearchContainerT> inline iterator_range< typename result_iterator_of< InputContainerT >::type > find_first(InputContainerT &, const SearchContainerT &);
compared to what I tend to like better
template<typename InputContainerT, typename SearchContainerT> inline range< InputContainerT > find_first(InputContainerT &, const SearchContainerT &);
I think that is a drastic reduction in interface complexity. In the former, the algorithm is range based yet the return value have been specified with iterators.
I is clear that iterator_range must work on the iterator basis, because there is not one-to-one relation between containers and iterators. The range<> you're proposing is only a syntactic sugar. It migh be good, but I'm not sure, if it is worth it.
My current view is that Boost.Range should contain both itarator_range<> and range<>, but that range-based algorithms should use range<>
| There are couple of things that your solution provide in addition to what my | and boost::tokenizer solution provide. Particularly it's an ability to | specify several addition delimitation policies. I thought about this. But | it isn't on top of my priorities (mostly because it is comparatively rarely | needed).
one token separator policy that could be usefull is the everything_but_one_of( ",; " ); but maybe we're starting to overlap with regex_token_iterator here?
You can already do the think like this. Simply write boost::token_iterator( !is_any_of(",; ") ); There are also operators && || that can be used to combine classification predicates. Regards, Pavol

"Pavol Droba" <droba@topmail.sk> wrote in message news:20040606113559.GL19968@lenin.felcer.sk... [snip] | > template<typename InputContainerT, typename SearchContainerT> | > inline iterator_range< typename result_iterator_of< InputContainerT >::type > | > find_first(InputContainerT &, const SearchContainerT &); | > | > compared to what I tend to like better | > | > template<typename InputContainerT, typename SearchContainerT> | > inline range< InputContainerT > | > find_first(InputContainerT &, const SearchContainerT &); | > | > I think that is a drastic reduction in interface complexity. In the former, the algorithm is range based yet the | > return value have been specified with iterators. | | I is clear that iterator_range must work on the iterator basis, because there is not one-to-one relation between | containers and iterators. yes, that is why I suggest we keep both. | The range<> you're proposing is only a syntactic sugar. It migh be good, but I'm not sure, | if it is worth it. I think it would be an improvement in this case. Remember that the main reason for range-based algorithms were "syntactic sugar". By having range arguments we avoid dealing with iterators until we have to or want to; the same should hold for the return values of such algorithms for the whole concept to be consistent. If the return value is iterator-based, we take a step back in abstraction level. | > one token separator policy that could be usefull is the everything_but_one_of( ",; " ); | > but maybe we're starting to overlap with regex_token_iterator here? | > | | You can already do the think like this. Simply write boost::token_iterator( !is_any_of(",; ") ); | There are also operators && || that can be used to combine classification predicates. aah, is_any_of() is part of the string library, right? br Thorsten

On Sat, Jun 05, 2004 at 08:46:54PM -0400, Gennadiy Rozental wrote:
"Pavol Droba" <droba@topmail.sk> wrote in message news:20040605214012.GH19968@lenin.felcer.sk...
Hi,
On Sat, Jun 05, 2004 at 08:16:17AM -0400, Gennadiy Rozental wrote:
Hi,
For long time in my daytime projects I was using my class token_iterator designed based on old iterator_adaptor design (adopted for old compilers). Now when need arose for such functionality in Boost.Test. I looked in direction of boost::token_iterator. After some struggle I end up adopting my version to new design. Here some of comments and issues that made me opt so.
[snip]
This might seem a little bit out of topic, but some of the points you have addressed are already solved by find/split_iterator provided in the string algo library. Maybe you can have a look there. Unfortunately, the documentation is not 100% ready yet.
Looks like I was not able to express myself clearly in original post. But actually what you suggesting is especially what I was arguing *against*.
Maybe I misunderstood you. Maybe you misunderstood the concepts in the string_algo library.
1. Your library is dedicated to string algorithms. Why is it use range if iterators so frequently then? As an input and an output. I did found that having range of it iterators as a model of substring is very convenient. That what my basic_cstring is used for (actually, I may say that in my development this is single most widely used class). But it is full-fledged string as good as std::basic_string. And it does not based in iterators, but on character type. Iterator range does may be usable, BUT outside of string algo library.
It was clearly stated during the review the "string" is not equal to std::basic_string. The string algorithm library is not std::basic_string extension, rather it is a set of string-related alorithms. However with the help of collection/range traits it provides almost the same level of convinience as if it would be specialized just for a single string class. Narrowing the implementation would be a giant leap back. And for the iterator range: it provides a reference into the original string. It if most efficient way to doing so, because you don't pay anything if you want just read the content there. For other operation you can simply use boost::copy_iterator_range, to extract the match to your favorite container.
2. Finder concept maybe useful in implementing different string search algorithm. BUT as implementation detail or generic case solution. I do not see any reason for interface that assumes explicit specification of finders provided by the library. There should be a generic algorithm/iterator that expect User defined finder, but there should be explicit algorithms/iterators dedicated for each type of search. I do that for algorithms, why not for iterators?
They are there actualy for most of the algorithms. find/split_iterator is quite new stuff. Originaly it has been encapsulated by find_all and split algorithms (they are still there, see split.hpp). During the review it has been expressed, that pure iterator-base interface is better for split operations. Therefor I have refactored the implementation so it can be used directly.
find_iterator is also based on generic iterator concept, however, the iterator is the only template parameter, so you can easily specialize for string.
Actually when I work with strings I do not want to know about iterators at all.
You don't have to, except for the one typedef.
Simple usage looks like this:
In fact from your sources it seems that I have to write like this
boost::split_iterator<std::string::iterator> it( str.begin(), str.end(), boost::token_finder(boost::is_any_of(";,"));
IMO it's way to verbose, while still missing some of the functionality provided by the tokenizer library.
Actualy you have missed the important constructor. You can write: boost::split_iterator<std::string::iterator> it( str, boost::token_finder(boost::is_any_of(";,")); str will be expanded automaticaly. And you can also use char*, wchar_t of vector<char>.
My choice:
boost::token_iterator it( str, boost::dropped_delimiters = ";," );
It is very easy to implement forwarder to this kind of syntax. As I said, find_iterator is relatively new, so I have extended the interface to full extend yet. I will keep this in mind. Thanks for idea.
This looks shorter and clearer IMO.
Second concern about your solution, that makes it in a sense even worse than the one provided by boost::tokenizer, is that you actually does not specify type of the Finder in the iterator specification. The price is that eventually you have to, this way or another, pay with runtime overhead. This would be unacceptable to me.
This is a design point. First implemenation had templated specification of the finder. However, the usage of such a class was very incovinient due to complicated specification. Therefor it was changed to the current state. The runtime overhead implied by the current implementation is rather small. I actualy only adds one indirection in the increment operation. So it is neglible.
There are couple of things that your solution provide in addition to what my and boost::tokenizer solution provide. Particularly it's an ability to specify several addition delimitation policies. I thought about this. But it isn't on top of my priorities (mostly because it is comparatively rarely needed). In any case this generic solution should not interfere with most convenient one used in majority of the cases.
Rationale in find_iterator is quite simple. There is a well defined concept of Finder in the lib and there is also a bunch of finders already implemented there. find_iterator is a natural candidate to use this facility. I'll be greatful if you can show me where do you think, the string_algo lib is lacking in the usablility. Preferably, provide also codesnippes how it is and how it should look like. Both sides can benefit from improvements. Regards, Pavol

1. Your library is dedicated to string algorithms. Why is it use range if iterators so frequently then? As an input and an output. I did found that having range of it iterators as a model of substring is very convenient. That what my basic_cstring is used for (actually, I may say that in my development this is single most widely used class). But it is full-fledged string as good as std::basic_string. And it does not based in iterators, but on character type. Iterator range does may be usable, BUT outside of string algo library.
It was clearly stated during the review the "string" is not equal to std::basic_string. The string algorithm library is not std::basic_string extension, rather it is a set of string-related algorithms. However with the help of
collection/range > traits it provides almost the same level of convenience as if it would > be specialized just for a single string class. > > Narrowing the implementation would be a giant leap back.
I don't propose to narrow implementation only to std::string. My point is: String algorithm always take string as a whole. And if I do not want it, there should not be single ::iterator or .begin() assumed by interface to be used by user.
And for the iterator range: it provides a reference into the original string. It if most efficient way to doing so, because you don't pay anything if you want just read the content there. For other operation you can simply use boost::copy_iterator_range, to extract the match to your favorite container.
2. Finder concept maybe useful in implementing different string search algorithm. BUT as implementation detail or generic case solution. I do not see any reason for interface that assumes explicit specification of finders provided by the library. There should be a generic algorithm/iterator
I agree that we need something to efficiently model substring. IMO something like basic_cstring is better choice for use with string algorithms. It may be less generic than iterator_range, but should be generic enough to cover 99% cases. that
expect User defined finder, but there should be explicit algorithms/iterators dedicated for each type of search. I do that for algorithms, why not for iterators?
They are there actually for most of the algorithms. find/split_iterator is quite new stuff. Originally it has been encapsulated by find_all and split algorithms (they are still there, see split.hpp). During the review it has been expressed, that pure iterator-base interface is better for split operations. Therefore I have refactored the implementation so it can be used directly.
IMO to be convenient you should've split iterators on many as you did with algorithms. Also you can keep generic find_iterator for use with custom finder.
find_iterator is also based on generic iterator concept, however, the iterator is the only template parameter, so you can easily specialize for string.
Simple usage looks like this:
In fact from your sources it seems that I have to write like this
boost::split_iterator<std::string::iterator> it( str.begin(), str.end(), boost::token_finder(boost::is_any_of(";,"));
IMO it's way to verbose, while still missing some of the functionality provided by the tokenizer library.
Actually you have missed the important constructor. You can write:
boost::split_iterator<std::string::iterator> it( str, boost::token_finder(boost::is_any_of(";,"));
str will be expanded automatically. And you can also use char*, wchar_t of vector<char>.
Did I? Could you point on file:line? Even this is too much IMO. There should not be ::iterator. I prefer boost::basic_split_iterator<char> ( = split_iterator ).
My choice:
boost::token_iterator it( str, boost::dropped_delimiters = ";," );
It is very easy to implement forwarder to this kind of syntax. As I said, find_iterator is relatively new, so I have extended the interface to full extend yet. I will keep this in mind. Thanks for idea.
Second concern about your solution, that makes it in a sense even worse
It should be separate class for every predefined finder IMO. Also keep in mind that boost::tokenizer provide additional functionality that is not covered by above finder. than
the one provided by boost::tokenizer, is that you actually does not specify type of the Finder in the iterator specification. The price is that eventually you have to, this way or another, pay with runtime overhead. This would be unacceptable to me.
This is a design point. First implementation had templated specification of the finder. However, the usage of such a class was very inconvenient due to complicated specification. Therefore it was changed to the current state. The runtime overhead implied by the current implementation is rather small. I actually only adds one indirection in the increment operation. So it is neglible.
You had to make this decision namely because you are trying to pull several independent (though similar) algorithms under the same interface hood and do not want burden of explicit type Finder type specification. But what is so common among these iterators to do so? On the other hand, in many cases all that the increment does is a couple comparisons. There is chance it could be inlined. Now you need to pay for function invocation.
There are couple of things that your solution provide in addition to what my and boost::tokenizer solution provide. Particularly it's an ability to specify several addition delimitation policies. I thought about this. But it isn't on top of my priorities (mostly because it is comparatively rarely needed). In any case this generic solution should not interfere with most convenient one used in majority of the cases.
Rationale in find_iterator is quite simple. There is a well defined concept of Finder in the lib and there is also a bunch of finders already implemented there. find_iterator is a natural candidate to use this facility.
I'll be grateful if you can show me where do you think, the string_algo
I don't mind find_iterator by itself. But there should be separate token_iterator with different tradeoffs: It more specific (less flexible), but more convenient and efficient. lib is
lacking in the usability. Preferably, provide also codesnippes how it is and how it should look like. Both sides can benefit from improvements.
I thought I did.
Regards,
Pavol
Regards, Gennadiy.

Hi Sunday, June 6, 2004, 10:32:10 PM, you wrote: [snip]
I don't propose to narrow implementation only to std::string. My point is:
String algorithm always take string as a whole. And if I do not want it, there should not be single ::iterator or .begin() assumed by interface to be used by user.
I don't know if I understand you correctly. User will almost never be forced to use .begin(). Collection traits are incorporated in the whole lib, so .begin() and etc. is used automaticaly, depending on the type of the collection. ::iterator is used only once in the defintion of the find_iterator type. It it only a single typedef, that user has to use. There are more options then char or wchar_t, and it is up to the user to decide. For instance, you can iterate over const_iterator of mutable iterator. That's the reason why these typedef are not part of the library. If it would be requested, I can include most common ones.
And for the iterator range: it provides a reference into the original string. It if most efficient way to doing so, because you don't pay anything if you want just read the content there. For other operation you can simply use boost::copy_iterator_range, to extract the match to your favorite container.
I agree that we need something to efficiently model substring. IMO something like basic_cstring is better choice for use with string algorithms. It may be less generic than iterator_range, but should be generic enough to cover 99% cases.
I have checked your basic_cstring. IMHO it is not usable in the context of the string_algo lib. It is restricted to char* (I definitely need iterators). It bundles algorithms inside a class - this violates the decapulated design policy. I don't realy know what do you have agains iterator_range. It is used only as a reference to the searched input. It is up to the user what will do with this reference. Why whould we want to enforce the user to use something, she does not want to? Can you please be more specific, what features/use-cases are you lacking with the iterator_range approach when compared to basic_cstring?
2. Finder concept maybe useful in implementing different string search algorithm. BUT as implementation detail or generic case solution. I do not see any reason for interface that assumes explicit specification of finders provided by the library. There should be a generic algorithm/iterator that expect User defined finder, but there should be explicit algorithms/iterators dedicated for each type of search. I do that for algorithms, why not for iterators?
They are there actually for most of the algorithms. find/split_iterator is quite new stuff. Originally it has been encapsulated by find_all and split algorithms (they are still there, see split.hpp). During the review it has been expressed, that pure iterator-base interface is better for split operations. Therefore I have refactored the implementation so it can be used directly.
IMO to be convenient you should've split iterators on many as you did with algorithms. Also you can keep generic find_iterator for use with custom finder.
I will do this. Actualy it is not needed to specify a class for every type of iterator. Just a constructor functions.
find_iterator is also based on generic iterator concept, however, the iterator is the only template parameter, so you can easily specialize for string.
Simple usage looks like this:
In fact from your sources it seems that I have to write like this
boost::split_iterator<std::string::iterator> it( str.begin(), str.end(), boost::token_finder(boost::is_any_of(";,"));
IMO it's way to verbose, while still missing some of the functionality provided by the tokenizer library.
Actually you have missed the important constructor. You can write:
boost::split_iterator<std::string::iterator> it( str, boost::token_finder(boost::is_any_of(";,"));
str will be expanded automatically. And you can also use char*, wchar_t of vector<char>.
Did I? Could you point on file:line?
I'm sorry, I'm starting to have a little mess. I forgot, that I haven't commited the latest update. It is there now.
Even this is too much IMO. There should not be ::iterator. I prefer boost::basic_split_iterator<char> ( = split_iterator ).
And how do you distinguish between "const_iterator" and "iterator" ? And how can you distinguish between iteration over vector<char>, rope<char> or basic_string<char>. Or even (with a little adaptation in collection_traits) CString?
My choice:
boost::token_iterator it( str, boost::dropped_delimiters = ";," );
It is very easy to implement forwarder to this kind of syntax. As I said, find_iterator is relatively new, so I have extended the interface to full extend yet. I will keep this in mind. Thanks for idea.
It should be separate class for every predefined finder IMO. Also keep in mind that boost::tokenizer provide additional functionality that is not covered by above finder.
I didn't mean to mimic tokenizer. Although I will try to incorporate as much of its functionality as possible. As I have written above, there is not a need for a separate class for every type of finder.
Second concern about your solution, that makes it in a sense even worse than the one provided by boost::tokenizer, is that you actually does not specify type of the Finder in the iterator specification. The price is that eventually you have to, this way or another, pay with runtime overhead. This would be unacceptable to me.
This is a design point. First implementation had templated specification of the finder. However, the usage of such a class was very inconvenient due to complicated specification. Therefore it was changed to the current state. The runtime overhead implied by the current implementation is rather small. I actually only adds one indirection in the increment operation. So it is neglible.
You had to make this decision namely because you are trying to pull several independent (though similar) algorithms under the same interface hood and do not want burden of explicit type Finder type specification. But what is so common among these iterators to do so? On the other hand, in many cases all that the increment does is a couple comparisons. There is chance it could be inlined. Now you need to pay for function invocation.
You might have a look into the implementation. There is realy only one additional dereference per increment. Finding operation is usualy at least O(n), so one far jmp does not bring any degradation in performance. find_iterator needs to consult the finder only there. Local state is stored directly in the iterator so comparisons and dereference have no additional overhead.
There are couple of things that your solution provide in addition to what my and boost::tokenizer solution provide. Particularly it's an ability to specify several addition delimitation policies. I thought about this. But it isn't on top of my priorities (mostly because it is comparatively rarely needed). In any case this generic solution should not interfere with most convenient one used in majority of the cases.
Rationale in find_iterator is quite simple. There is a well defined concept of Finder in the lib and there is also a bunch of finders already implemented there. find_iterator is a natural candidate to use this facility.
I don't mind find_iterator by itself. But there should be separate token_iterator with different tradeoffs: It more specific (less flexible), but more convenient and efficient.
Maybe thats the direction where boost::tokenizer should head. find_iterator is an extension to Finder. If you have a bunch of finders, it provides a useful functionality. However, I'd like to offer as much of usablity and convenience as possible. Regards, Pavol
participants (5)
-
David Abrahams
-
Gennadiy Rozental
-
Pavol Droba
-
Thorsten Ottosen
-
Victor A. Wagner Jr.