[tokenizer and token_iterator] request for policy or redesign

Hi John, I vaguely recall that the standard committee rejected the tokenizer proposal because of performance problems. A thing that would boost performance a lot would be to allow the user to specify a "token match type". So I have some stupid questions: 1. Why is the three template paramters not defined as template < class TokenizerFunc = char_delimiters_separator<char>, class Type = std::string class Iterator = typename range_const_iterator<Type>::type
class tokenizer ?? 2. what is the requirement exactly on the Type parameter? 3. would it not be possible to allow typedef boost::tokenizer< fun, boost::sub_range<string> > tokenizer; ? I believe this would give a huge speedup. br Thorsten

Hi, Did you have a chance to look on token_iterator in Boost.Test directory. This is result of my redesign on boost::token_iterator. My version supposed to be efficient (comparatevely) Let me know what you think. Gennadiy.

"Gennadiy Rozental" <gennadiy.rozental@thomson.com> wrote in message news:cjcb4i$mke$1@sea.gmane.org... | Hi, | | Did you have a chance to look on token_iterator in Boost.Test directory. I have now :-) | This is result of my redesign on boost::token_iterator. My version supposed | to be efficient (comparatevely) | | Let me know what you think. I guess you basic_cstring<> has a large say in why it would be faster?

| This is result of my redesign on boost::token_iterator. My version supposed | to be efficient (comparatevely) | | Let me know what you think.
I guess you basic_cstring<> has a large say in why it would be faster?
You guess right. Gennadiy.

Sorry pressed "send too quickly"... Why have you provided basic_string_token_iterator and range_token_iterator ? I don't really have time for a detailed look, sorry. br Thorsten

"Thorsten Ottosen" <nesotto@cs.auc.dk> wrote in message news:cjce7f$vqh$1@sea.gmane.org...
Sorry pressed "send too quickly"...
Why have you provided
basic_string_token_iterator
and
range_token_iterator
?
I don't really have time for a detailed look, sorry.
br
Thorsten
This was the second goal, I was trying to achieve: best possible usability. If you try to combine these two slightly different functionalities under the same hood, you end up with interface that depends on iterator type: token_iterator<std::string::iterator> it( str ); I see it as unacceptable burden (and that is why I do not like StringAlgo lib solution, but I plan to get back to that conversation later - I have saved message I need to answer). IMO the only acceptable template parameter should be character type, which in turn should be typedefed like STL do. IOW my interface of choice is: token_iterator tit( str ); Token iterator construction based on two iterators IMO is useful only for istream tokenization. I do provide this interface but as stand alone class. Regards, Gennadiy.

"Gennadiy Rozental" <gennadiy.rozental@thomson.com> wrote in message news:cjcit7$dpe$1@sea.gmane.org... | Token iterator construction based on two iterators IMO is useful only for | istream tokenization. I do provide this interface but as stand alone class. I think I disagree. Consider the match of some regular expression applied to a each line in a file. If you then want to tokenize each macth, then you don't want copy strings around but work on a view. br Thorsten

| Token iterator construction based on two iterators IMO is useful only for | istream tokenization. I do provide this interface but as stand alone class.
I think I disagree. Consider the match of some regular expression applied to a each line in a file. If you then want to tokenize each macth, then you don't want copy strings around but work on a view.
br
Thorsten
Yeah. You right. I should've said "mostly useful". There some (comparatively rare) case when you want to utilize iterator interface. But it does not change my point: it still does not justify polluting main line interface. Gennadiy.

Hi, On Tue, Sep 28, 2004 at 04:51:51PM -0400, Gennadiy Rozental wrote:
This was the second goal, I was trying to achieve: best possible usability. If you try to combine these two slightly different functionalities under the same hood, you end up with interface that depends on iterator type:
token_iterator<std::string::iterator> it( str );
I see it as unacceptable burden (and that is why I do not like StringAlgo lib solution, but I plan to get back to that conversation later - I have saved message I need to answer). IMO the only acceptable template parameter should be character type, which in turn should be typedefed like STL do. IOW my interface of choice is:
token_iterator tit( str );
You will need to specify seperator, right? So you will need at least one more parameter.
Token iterator construction based on two iterators IMO is useful only for istream tokenization. I do provide this interface but as stand alone class.
Just a few words about the StringAlgo solution. I remember our discussion, be AFAIR we didn't come to a conclusion. So here are few proposed changes as well as the current status of the things. First of all, it is quite easy to get rid of the template parameters you don't like: typedef boost::find_iterator<std::string::iterator> string_find_iterator; typedef boost::find_iterator<std::wstring::iterator> wstring_find_iterator; Two iterators are not needed for a long time already, find_iterators accept range as a parameter. Similary other parameters can be automated. (for instance, I can implement, that a set of separators will be accepted and token_finder will be automaticaly instantiated). If I remember correctly your last complain was, that dereferencing the iterator does not yield a string, so you cannot manipulate it easily. find_iterator will dereference to an iterator_range. There is plenty of goodies in the Boost.Range library that help you to live with it. For instance, there are comparison operators between arbitrary range types, and if you need more, you can always convert the range to something else using the copy_iterator_range function (maybe something shorter and easier to use can be provided as well). Please consider these options. Some of them are not currently available since, the StringAlgo lib has not been merged yet with the Range lib (I was afraid to do it before release). Any other ideas are more then welcome. Regards, Pavol

"Pavol Droba" <droba@topmail.sk> wrote in message news:20040928222854.GU29008@lenin.felcer.sk... | If I remember correctly your last complain was, that dereferencing the iterator | does not yield a string, so you cannot manipulate it easily. | | find_iterator will dereference to an iterator_range. There is plenty of goodies | in the Boost.Range library that help you to live with it. For instance, there | are comparison operators between arbitrary range types, and if you need | more, you can always convert the range to something else using the | copy_iterator_range function (maybe something shorter and easier to use can | be provided as well). copy_range<string>( *it ) should do the work. | Please consider these options. Some of them are not currently available since, | the StringAlgo lib has not been merged yet with the Range lib (I was afraid | to do it before release). this is a bit unfortunate since we now have two iterator_range classes hanging around. Is yours in namespace boost::algorithm::string ? If so, a small note about how to avoid clashes would be good in the string algo docs. br Thorsten

On Wed, Sep 29, 2004 at 09:33:23AM +0200, Thorsten Ottosen wrote:
"Pavol Droba" <droba@topmail.sk> wrote in message news:20040928222854.GU29008@lenin.felcer.sk...
| If I remember correctly your last complain was, that dereferencing the iterator | does not yield a string, so you cannot manipulate it easily. | | find_iterator will dereference to an iterator_range. There is plenty of goodies | in the Boost.Range library that help you to live with it. For instance, there | are comparison operators between arbitrary range types, and if you need | more, you can always convert the range to something else using the | copy_iterator_range function (maybe something shorter and easier to use can | be provided as well).
copy_range<string>( *it ) should do the work.
Actualy I thought, that we may add a member function to the iterator range. So that the syntax would be like: it->copy<string>();
| Please consider these options. Some of them are not currently available since, | the StringAlgo lib has not been merged yet with the Range lib (I was afraid | to do it before release).
this is a bit unfortunate since we now have two iterator_range classes hanging around. Is yours in namespace boost::algorithm::string ? If so, a small note about how to avoid clashes would be good in the string algo docs.
Good point. If I would knew, that I would take so long to make a release, I would have converted to use Boost.Range already. Actualy, is there some estimation when the release will happen? Regards, Pavol

"Pavol Droba" <droba@topmail.sk> wrote in message news:20040929104310.GY29008@lenin.felcer.sk... | On Wed, Sep 29, 2004 at 09:33:23AM +0200, Thorsten Ottosen wrote: | > | > copy_range<string>( *it ) should do the work. | > | | Actualy I thought, that we may add a member function to the iterator range. | So that the syntax would be like: | | it->copy<string>(); Inside a dependent type context this would be it-> template copy<string>(); I think. besides, I don't like the idea of adding more stuff to iterator_range. The Pavol Droba I knew used to be a proponent of free-standing functions :-) br Thorsten

On Wed, Sep 29, 2004 at 01:58:06PM +0200, Thorsten Ottosen wrote:
"Pavol Droba" <droba@topmail.sk> wrote in message news:20040929104310.GY29008@lenin.felcer.sk... | On Wed, Sep 29, 2004 at 09:33:23AM +0200, Thorsten Ottosen wrote: | >
| > copy_range<string>( *it ) should do the work. | > | | Actualy I thought, that we may add a member function to the iterator range. | So that the syntax would be like: | | it->copy<string>();
Inside a dependent type context this would be
it-> template copy<string>();
I think. besides, I don't like the idea of adding more stuff to iterator_range. The Pavol Droba I knew used to be a proponent of free-standing functions :-)
Oh yeah. That was probably just one of ideas that just popped up in my mind and never should get out of it. I was little bit misleaded by Gennadiy's effort to simplify the handling. This is obviously not the way. Regards, Pavol.

"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:
Hi,
Did you have a chance to look on token_iterator in Boost.Test directory. This is result of my redesign on boost::token_iterator. My version supposed to be efficient (comparatevely)
Let me know what you think.
Gennadiy, your mailer is squeezing spaces out of subject lines again, it looks like. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

"David Abrahams" <dave@boost-consulting.com> wrote in message news:upt452u2a.fsf@boost-consulting.com...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:
Hi,
Did you have a chance to look on token_iterator in Boost.Test directory. This is result of my redesign on boost::token_iterator. My version supposed to be efficient (comparatevely)
Let me know what you think.
Gennadiy,
your mailer is squeezing spaces out of subject lines again, it looks like.
I am using Outlook Express News Reader now. Do you know any options that manage that? Gennadiy.

"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:
"David Abrahams" <dave@boost-consulting.com> wrote in message news:upt452u2a.fsf@boost-consulting.com...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:
Hi,
Did you have a chance to look on token_iterator in Boost.Test directory. This is result of my redesign on boost::token_iterator. My version supposed to be efficient (comparatevely)
Let me know what you think.
Gennadiy,
your mailer is squeezing spaces out of subject lines again, it looks like.
I am using Outlook Express News Reader now. Do you know any options that manage that?
No, but I used the same newsreader for years and never experienced this problem. I suggest Cc'ing yourself on your next postings to see if it's happening at your end or elsewhere. I also suggest making sure you post in plain text and not HTML (in case that's what you're doing). -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

"David Abrahams" <dave@boost-consulting.com> wrote in message news:u7jqdyp4m.fsf@boost-consulting.com...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:
"David Abrahams" <dave@boost-consulting.com> wrote in message news:upt452u2a.fsf@boost-consulting.com...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:
your mailer is squeezing spaces out of subject lines again, it looks like.
I am using Outlook Express News Reader now. Do you know any options that manage that?
No, but I used the same newsreader for years and never experienced this problem. I suggest Cc'ing yourself on your next postings to see if it's happening at your end or elsewhere. I also suggest making sure you post in plain text and not HTML (in case that's what you're doing).
I've tried to reproduce the problem on gmane.test, but I do not get the "squeezed" subject text using OutlookExpress. I tried both plain and html. It's not just Gennadiy, but also Thorsten and some others on the list as well. The subject of the root of this thread had a carriage return and/or linefeed and tab as white space. I tried cutting/pasting in a tab into OE, but it is received as text. Jeff F

"Jeff Flinn" <TriumphSprint2000@hotmail.com> wrote in message news:cjf191$cqe$1@sea.gmane.org... | I've tried to reproduce the problem on gmane.test, but I do not get the | "squeezed" subject text using OutlookExpress. I tried both plain and html. | | It's not just Gennadiy, but also Thorsten and some others on the list as | well. The subject of the root of this thread had a carriage return and/or | linefeed and tab as white space. I tried cutting/pasting in a tab into OE, | but it is received as text. yeah, I don't get it...some mails are fine...let's see how this goes -Thorsten

"Jeff Flinn" <TriumphSprint2000@hotmail.com> writes:
"David Abrahams" <dave@boost-consulting.com> wrote in message news:u7jqdyp4m.fsf@boost-consulting.com...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:
"David Abrahams" <dave@boost-consulting.com> wrote in message news:upt452u2a.fsf@boost-consulting.com...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:
your mailer is squeezing spaces out of subject lines again, it looks like.
I am using Outlook Express News Reader now. Do you know any options that manage that?
No, but I used the same newsreader for years and never experienced this problem. I suggest Cc'ing yourself on your next postings to see if it's happening at your end or elsewhere. I also suggest making sure you post in plain text and not HTML (in case that's what you're doing).
I've tried to reproduce the problem on gmane.test, but I do not get the "squeezed" subject text using OutlookExpress. I tried both plain and html.
It's not just Gennadiy, but also Thorsten and some others on the list as well. The subject of the root of this thread had a carriage return and/or linefeed and tab as white space.
Could you explain what you mean in more detail?
I tried cutting/pasting in a tab into OE, but it is received as text.
Can you explain that one too? Thanks, -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

"Jeff Flinn" <TriumphSprint2000@hotmail.com> writes:
"David Abrahams" <dave@boost-consulting.com> wrote in message news:u7jqdyp4m.fsf@boost-consulting.com...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:
"David Abrahams" <dave@boost-consulting.com> wrote in message news:upt452u2a.fsf@boost-consulting.com...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:
your mailer is squeezing spaces out of subject lines again, it looks like.
I am using Outlook Express News Reader now. Do you know any options
manage that?
No, but I used the same newsreader for years and never experienced this problem. I suggest Cc'ing yourself on your next postings to see if it's happening at your end or elsewhere. I also suggest making sure you post in plain text and not HTML (in case that's what you're doing).
I've tried to reproduce the problem on gmane.test, but I do not get the "squeezed" subject text using OutlookExpress. I tried both plain and
"David Abrahams" <dave@boost-consulting.com> wrote in message news:upt44x3id.fsf@boost-consulting.com... that html.
It's not just Gennadiy, but also Thorsten and some others on the list as well. The subject of the root of this thread had a carriage return
and/or
linefeed and tab as white space.
Could you explain what you mean in more detail?
Thorsten's original subject for this thread: "[tokenizer and token_iterator] request for policy or redesign" Is actually boken into two lines, the second prepended with a '^H' tab character, as in: line1>[tokenizer and token_iterator] request for policy or line2> redesign When viewing the original message source via the OutlookExpress(OE) File >> Properties >> Details >> Message Source. Any replies created using OE for a message with a subject containing '^H' follow this pattern: line1>[tokenizer and token_iterator] request for policy or line2> redesign reply: line1>[tokenizer and token_iterator] request for policy line2> orredesign reply: line1>[tokenizer and token_iterator] request for line2> policyorredesign ...
I tried cutting/pasting in a tab into OE, but it is received as text.
Can you explain that one too?
I meant to say that the tab was replaced by equivalent spaces rather than the explict '^H' tab character. I created text in an editor " test^Htest ", and pasted it into the subject line in OutlookExpress(OE), to see if this might be the cause of the problem. Thanks, Jeff F

"Jeff Flinn" <TriumphSprint2000@hotmail.com> writes:
"Jeff Flinn" <TriumphSprint2000@hotmail.com> writes:
"David Abrahams" <dave@boost-consulting.com> wrote in message news:u7jqdyp4m.fsf@boost-consulting.com...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:
"David Abrahams" <dave@boost-consulting.com> wrote in message news:upt452u2a.fsf@boost-consulting.com...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:
your mailer is squeezing spaces out of subject lines again, it looks like.
I am using Outlook Express News Reader now. Do you know any options
manage that?
No, but I used the same newsreader for years and never experienced this problem. I suggest Cc'ing yourself on your next postings to see if it's happening at your end or elsewhere. I also suggest making sure you post in plain text and not HTML (in case that's what you're doing).
I've tried to reproduce the problem on gmane.test, but I do not get the "squeezed" subject text using OutlookExpress. I tried both plain and
"David Abrahams" <dave@boost-consulting.com> wrote in message news:upt44x3id.fsf@boost-consulting.com... that html.
It's not just Gennadiy, but also Thorsten and some others on the list as well. The subject of the root of this thread had a carriage return
and/or
linefeed and tab as white space.
Could you explain what you mean in more detail?
Thorsten's original subject for this thread:
"[tokenizer and token_iterator] request for policy or redesign"
Is actually boken into two lines, the second prepended with a '^H' tab character, as in:
That's odd. ^H is usually a backspace. Tab is ^I.
line1>[tokenizer and token_iterator] request for policy or line2> redesign
I see a newline followed by a tab. Is that legal in a RFC422 message header?
When viewing the original message source via the OutlookExpress(OE) File >> Properties >> Details >> Message Source.
Any replies created using OE for a message with a subject containing '^H' follow this pattern:
line1>[tokenizer and token_iterator] request for policy or line2> redesign
reply:
line1>[tokenizer and token_iterator] request for policy line2> orredesign
reply:
line1>[tokenizer and token_iterator] request for line2> policyorredesign
...
I tried cutting/pasting in a tab into OE, but it is received as text.
Can you explain that one too?
I meant to say that the tab was replaced by equivalent spaces rather than the explict '^H' tab character. I created text in an editor " test^Htest ", and pasted it into the subject line in OutlookExpress(OE), to see if this might be the cause of the problem.
Thanks for the explanation. Maybe Thorsten's mailer is at fault for inserting the newline? -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Is actually boken into two lines, the second prepended with a '^H' tab character, as in:
That's odd. ^H is usually a backspace. Tab is ^I.
It's only odd if it's not "When I say ^H I mean ^I day". Hmm, this must only be a local custom here. Sorry for the confusion. Thanks, Jeff

From: David Abrahams <dave@boost-consulting.com>
"Jeff Flinn" <TriumphSprint2000@hotmail.com> writes:
Thorsten's original subject for this thread:
"[tokenizer and token_iterator] request for policy or redesign"
Is actually boken into two lines, the second prepended with a '^H' tab character, as in:
That's odd. ^H is usually a backspace. Tab is ^I.
line1>[tokenizer and token_iterator] request for policy or line2> redesign
I see a newline followed by a tab. Is that legal in a RFC422 message header?
I see the same things. RMAIL in emacs handles it just fine. RFC822, 3.1.1. LONG HEADER FIELDS, specifically permits folding long lines like that. OE is broken. -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

Hi, On Tue, Sep 28, 2004 at 08:03:41PM +0200, Thorsten Ottosen wrote:
3. would it not be possible to allow
typedef boost::tokenizer< fun, boost::sub_range<string> > tokenizer;
Tokenizer like this is already in the StringAlgo library. It's call find_iterator. It does almost precisely what you requested. When you dereference it, you get iterator_range. Currently it is still working with the internal version, but after the release, I will convert it to use Boost.Range. Regards, Pavol.
participants (6)
-
David Abrahams
-
Gennadiy Rozental
-
Jeff Flinn
-
Pavol Droba
-
Rob Stewart
-
Thorsten Ottosen