[tokenizer and token_iterator] request for policy or redesign

older
[Serialization] Remaining failures...

Thorsten Ottosen

28 Sep 2004 28 Sep '04

6:03 p.m.

Hi John, I vaguely recall that the standard committee rejected the tokenizer proposal because of performance problems. A thing that would boost performance a lot would be to allow the user to specify a "token match type". So I have some stupid questions: 1. Why is the three template paramters not defined as template < class TokenizerFunc = char_delimiters_separator<char>, class Type = std::string class Iterator = typename range_const_iterator<Type>::type

...

class tokenizer ?? 2. what is the requirement exactly on the Type parameter? 3. would it not be possible to allow typedef boost::tokenizer< fun, boost::sub_range<string> > tokenizer; ? I believe this would give a huge speedup. br Thorsten

Show replies by date

Gennadiy Rozental

28 Sep 28 Sep

6:39 p.m.

New subject: [tokenizer and token_iterator] request for policy orredesign

Hi, Did you have a chance to look on token_iterator in Boost.Test directory. This is result of my redesign on boost::token_iterator. My version supposed to be efficient (comparatevely) Let me know what you think. Gennadiy.

Thorsten Ottosen

7:27 p.m.

New subject: [tokenizer and token_iterator] request for policyorredesign

"Gennadiy Rozental" <gennadiy.rozental@thomson.com> wrote in message news:cjcb4i$mke$1@sea.gmane.org... | Hi, | | Did you have a chance to look on token_iterator in Boost.Test directory. I have now :-) | This is result of my redesign on boost::token_iterator. My version supposed | to be efficient (comparatevely) | | Let me know what you think. I guess you basic_cstring<> has a large say in why it would be faster?

Gennadiy Rozental

8:44 p.m.

New subject: [tokenizer and token_iterator] request forpolicyorredesign

...

| This is result of my redesign on boost::token_iterator. My version supposed | to be efficient (comparatevely) | | Let me know what you think.

I guess you basic_cstring<> has a large say in why it would be faster?

You guess right. Gennadiy.

Thorsten Ottosen

7:30 p.m.

New subject: [tokenizer and token_iterator] request for policyorredesign

Sorry pressed "send too quickly"... Why have you provided basic_string_token_iterator and range_token_iterator ? I don't really have time for a detailed look, sorry. br Thorsten

Gennadiy Rozental

8:51 p.m.

New subject: [tokenizer and token_iterator] request forpolicyorredesign

"Thorsten Ottosen" <nesotto@cs.auc.dk> wrote in message news:cjce7f$vqh$1@sea.gmane.org...

...

Sorry pressed "send too quickly"...

Why have you provided

basic_string_token_iterator

and

range_token_iterator

?

I don't really have time for a detailed look, sorry.

br

Thorsten

This was the second goal, I was trying to achieve: best possible usability. If you try to combine these two slightly different functionalities under the same hood, you end up with interface that depends on iterator type: token_iterator<std::string::iterator> it( str ); I see it as unacceptable burden (and that is why I do not like StringAlgo lib solution, but I plan to get back to that conversation later - I have saved message I need to answer). IMO the only acceptable template parameter should be character type, which in turn should be typedefed like STL do. IOW my interface of choice is: token_iterator tit( str ); Token iterator construction based on two iterators IMO is useful only for istream tokenization. I do provide this interface but as stand alone class. Regards, Gennadiy.

Thorsten Ottosen

9:20 p.m.

New subject: [tokenizer and token_iterator] requestforpolicyorredesign

"Gennadiy Rozental" <gennadiy.rozental@thomson.com> wrote in message news:cjcit7$dpe$1@sea.gmane.org... | Token iterator construction based on two iterators IMO is useful only for | istream tokenization. I do provide this interface but as stand alone class. I think I disagree. Consider the match of some regular expression applied to a each line in a file. If you then want to tokenize each macth, then you don't want copy strings around but work on a view. br Thorsten

Gennadiy Rozental

9:35 p.m.

New subject: [tokenizer and token_iterator]requestforpolicyorredesign

...

| Token iterator construction based on two iterators IMO is useful only for | istream tokenization. I do provide this interface but as stand alone class.

I think I disagree. Consider the match of some regular expression applied to a each line in a file. If you then want to tokenize each macth, then you don't want copy strings around but work on a view.

br

Thorsten

Yeah. You right. I should've said "mostly useful". There some (comparatively rare) case when you want to utilize iterator interface. But it does not change my point: it still does not justify polluting main line interface. Gennadiy.

Pavol Droba

10:28 p.m.

New subject: [tokenizer and token_iterator] request forpolicyorredesign

Hi, On Tue, Sep 28, 2004 at 04:51:51PM -0400, Gennadiy Rozental wrote:

...

This was the second goal, I was trying to achieve: best possible usability. If you try to combine these two slightly different functionalities under the same hood, you end up with interface that depends on iterator type:

token_iterator<std::string::iterator> it( str );

I see it as unacceptable burden (and that is why I do not like StringAlgo lib solution, but I plan to get back to that conversation later - I have saved message I need to answer). IMO the only acceptable template parameter should be character type, which in turn should be typedefed like STL do. IOW my interface of choice is:

token_iterator tit( str );

You will need to specify seperator, right? So you will need at least one more parameter.

...

Token iterator construction based on two iterators IMO is useful only for istream tokenization. I do provide this interface but as stand alone class.

Just a few words about the StringAlgo solution. I remember our discussion, be AFAIR we didn't come to a conclusion. So here are few proposed changes as well as the current status of the things. First of all, it is quite easy to get rid of the template parameters you don't like: typedef boost::find_iterator<std::string::iterator> string_find_iterator; typedef boost::find_iterator<std::wstring::iterator> wstring_find_iterator; Two iterators are not needed for a long time already, find_iterators accept range as a parameter. Similary other parameters can be automated. (for instance, I can implement, that a set of separators will be accepted and token_finder will be automaticaly instantiated). If I remember correctly your last complain was, that dereferencing the iterator does not yield a string, so you cannot manipulate it easily. find_iterator will dereference to an iterator_range. There is plenty of goodies in the Boost.Range library that help you to live with it. For instance, there are comparison operators between arbitrary range types, and if you need more, you can always convert the range to something else using the copy_iterator_range function (maybe something shorter and easier to use can be provided as well). Please consider these options. Some of them are not currently available since, the StringAlgo lib has not been merged yet with the Range lib (I was afraid to do it before release). Any other ideas are more then welcome. Regards, Pavol

Thorsten Ottosen

29 Sep 29 Sep

7:33 a.m.

New subject: [tokenizer and token_iterator] requestforpolicyorredesign

"Pavol Droba" <droba@topmail.sk> wrote in message news:20040928222854.GU29008@lenin.felcer.sk... | If I remember correctly your last complain was, that dereferencing the iterator | does not yield a string, so you cannot manipulate it easily. | | find_iterator will dereference to an iterator_range. There is plenty of goodies | in the Boost.Range library that help you to live with it. For instance, there | are comparison operators between arbitrary range types, and if you need | more, you can always convert the range to something else using the | copy_iterator_range function (maybe something shorter and easier to use can | be provided as well). copy_range<string>( *it ) should do the work. | Please consider these options. Some of them are not currently available since, | the StringAlgo lib has not been merged yet with the Range lib (I was afraid | to do it before release). this is a bit unfortunate since we now have two iterator_range classes hanging around. Is yours in namespace boost::algorithm::string ? If so, a small note about how to avoid clashes would be good in the string algo docs. br Thorsten

Pavol Droba

10:43 a.m.

New subject: [tokenizer and token_iterator] requestforpolicyorredesign

On Wed, Sep 29, 2004 at 09:33:23AM +0200, Thorsten Ottosen wrote:

...

"Pavol Droba" <droba@topmail.sk> wrote in message news:20040928222854.GU29008@lenin.felcer.sk...

| If I remember correctly your last complain was, that dereferencing the iterator | does not yield a string, so you cannot manipulate it easily. | | find_iterator will dereference to an iterator_range. There is plenty of goodies | in the Boost.Range library that help you to live with it. For instance, there | are comparison operators between arbitrary range types, and if you need | more, you can always convert the range to something else using the | copy_iterator_range function (maybe something shorter and easier to use can | be provided as well).

copy_range<string>( *it ) should do the work.

Actualy I thought, that we may add a member function to the iterator range. So that the syntax would be like: it->copy<string>();

...

| Please consider these options. Some of them are not currently available since, | the StringAlgo lib has not been merged yet with the Range lib (I was afraid | to do it before release).

this is a bit unfortunate since we now have two iterator_range classes hanging around. Is yours in namespace boost::algorithm::string ? If so, a small note about how to avoid clashes would be good in the string algo docs.

Good point. If I would knew, that I would take so long to make a release, I would have converted to use Boost.Range already. Actualy, is there some estimation when the release will happen? Regards, Pavol

Thorsten Ottosen

11:58 a.m.

New subject: [tokenizer and token_iterator]requestforpolicyorredesign

"Pavol Droba" <droba@topmail.sk> wrote in message news:20040929104310.GY29008@lenin.felcer.sk... | On Wed, Sep 29, 2004 at 09:33:23AM +0200, Thorsten Ottosen wrote: | > | > copy_range<string>( *it ) should do the work. | > | | Actualy I thought, that we may add a member function to the iterator range. | So that the syntax would be like: | | it->copy<string>(); Inside a dependent type context this would be it-> template copy<string>(); I think. besides, I don't like the idea of adding more stuff to iterator_range. The Pavol Droba I knew used to be a proponent of free-standing functions :-) br Thorsten

Pavol Droba

2 p.m.

New subject: [tokenizer and token_iterator]requestforpolicyorredesign

On Wed, Sep 29, 2004 at 01:58:06PM +0200, Thorsten Ottosen wrote:

...

"Pavol Droba" <droba@topmail.sk> wrote in message news:20040929104310.GY29008@lenin.felcer.sk... | On Wed, Sep 29, 2004 at 09:33:23AM +0200, Thorsten Ottosen wrote: | >

| > copy_range<string>( *it ) should do the work. | > | | Actualy I thought, that we may add a member function to the iterator range. | So that the syntax would be like: | | it->copy<string>();

Inside a dependent type context this would be

it-> template copy<string>();

I think. besides, I don't like the idea of adding more stuff to iterator_range. The Pavol Droba I knew used to be a proponent of free-standing functions :-)

Oh yeah. That was probably just one of ideas that just popped up in my mind and never should get out of it. I was little bit misleaded by Gennadiy's effort to simplify the handling. This is obviously not the way. Regards, Pavol.

David Abrahams

11:28 a.m.

New subject: [tokenizer and token_iterator] request for policy orredesign

"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:

...

Hi,

Did you have a chance to look on token_iterator in Boost.Test directory. This is result of my redesign on boost::token_iterator. My version supposed to be efficient (comparatevely)

Let me know what you think.

Gennadiy, your mailer is squeezing spaces out of subject lines again, it looks like. -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Gennadiy Rozental

2:55 p.m.

New subject: [tokenizer and token_iterator] request for policyorredesign

"David Abrahams" <dave@boost-consulting.com> wrote in message news:upt452u2a.fsf@boost-consulting.com...

...

"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:

...
Hi,

Did you have a chance to look on token_iterator in Boost.Test directory. This is result of my redesign on boost::token_iterator. My version supposed to be efficient (comparatevely)

Let me know what you think.

Gennadiy,

your mailer is squeezing spaces out of subject lines again, it looks like.

I am using Outlook Express News Reader now. Do you know any options that manage that? Gennadiy.

David Abrahams

5:13 p.m.

New subject: [tokenizer and token_iterator] request for policyorredesign

"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:

...

"David Abrahams" <dave@boost-consulting.com> wrote in message news:upt452u2a.fsf@boost-consulting.com...

...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:

...
Hi,

Did you have a chance to look on token_iterator in Boost.Test directory. This is result of my redesign on boost::token_iterator. My version supposed to be efficient (comparatevely)

Let me know what you think.

Gennadiy,

your mailer is squeezing spaces out of subject lines again, it looks like.

I am using Outlook Express News Reader now. Do you know any options that manage that?

No, but I used the same newsreader for years and never experienced this problem. I suggest Cc'ing yourself on your next postings to see if it's happening at your end or elsewhere. I also suggest making sure you post in plain text and not HTML (in case that's what you're doing). -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Jeff Flinn

7:09 p.m.

New subject: [tokenizer and token_iterator] request forpolicyorredesign

"David Abrahams" <dave@boost-consulting.com> wrote in message news:u7jqdyp4m.fsf@boost-consulting.com...

...

"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:

...
"David Abrahams" <dave@boost-consulting.com> wrote in message news:upt452u2a.fsf@boost-consulting.com...

...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:

your mailer is squeezing spaces out of subject lines again, it looks like.

I am using Outlook Express News Reader now. Do you know any options that manage that?

No, but I used the same newsreader for years and never experienced this problem. I suggest Cc'ing yourself on your next postings to see if it's happening at your end or elsewhere. I also suggest making sure you post in plain text and not HTML (in case that's what you're doing).

I've tried to reproduce the problem on gmane.test, but I do not get the "squeezed" subject text using OutlookExpress. I tried both plain and html. It's not just Gennadiy, but also Thorsten and some others on the list as well. The subject of the root of this thread had a carriage return and/or linefeed and tab as white space. I tried cutting/pasting in a tab into OE, but it is received as text. Jeff F

Thorsten Ottosen

7:19 p.m.

New subject: [tokenizer and token_iterator] request for policy or redesign [foo]"\br\cr"

"Jeff Flinn" <TriumphSprint2000@hotmail.com> wrote in message news:cjf191$cqe$1@sea.gmane.org... | I've tried to reproduce the problem on gmane.test, but I do not get the | "squeezed" subject text using OutlookExpress. I tried both plain and html. | | It's not just Gennadiy, but also Thorsten and some others on the list as | well. The subject of the root of this thread had a carriage return and/or | linefeed and tab as white space. I tried cutting/pasting in a tab into OE, | but it is received as text. yeah, I don't get it...some mails are fine...let's see how this goes -Thorsten

David Abrahams

7:46 p.m.

New subject: [tokenizer and token_iterator] request forpolicyorredesign

"Jeff Flinn" <TriumphSprint2000@hotmail.com> writes:

...

"David Abrahams" <dave@boost-consulting.com> wrote in message news:u7jqdyp4m.fsf@boost-consulting.com...

...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:

...
"David Abrahams" <dave@boost-consulting.com> wrote in message news:upt452u2a.fsf@boost-consulting.com...

...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:

your mailer is squeezing spaces out of subject lines again, it looks like.

I am using Outlook Express News Reader now. Do you know any options that manage that?

No, but I used the same newsreader for years and never experienced this problem. I suggest Cc'ing yourself on your next postings to see if it's happening at your end or elsewhere. I also suggest making sure you post in plain text and not HTML (in case that's what you're doing).

I've tried to reproduce the problem on gmane.test, but I do not get the "squeezed" subject text using OutlookExpress. I tried both plain and html.

It's not just Gennadiy, but also Thorsten and some others on the list as well. The subject of the root of this thread had a carriage return and/or linefeed and tab as white space.

Could you explain what you mean in more detail?

...

I tried cutting/pasting in a tab into OE, but it is received as text.

Can you explain that one too? Thanks, -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Jeff Flinn

30 Sep 30 Sep

12:50 p.m.

New subject: [tokenizer and token_iterator] requestforpolicyorredesign

...

"Jeff Flinn" <TriumphSprint2000@hotmail.com> writes:

...
"David Abrahams" <dave@boost-consulting.com> wrote in message news:u7jqdyp4m.fsf@boost-consulting.com...

...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:

...
"David Abrahams" <dave@boost-consulting.com> wrote in message news:upt452u2a.fsf@boost-consulting.com...

...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:

your mailer is squeezing spaces out of subject lines again, it looks like.

I am using Outlook Express News Reader now. Do you know any options

...

...
...
...
manage that?

No, but I used the same newsreader for years and never experienced this problem. I suggest Cc'ing yourself on your next postings to see if it's happening at your end or elsewhere. I also suggest making sure you post in plain text and not HTML (in case that's what you're doing).

I've tried to reproduce the problem on gmane.test, but I do not get the "squeezed" subject text using OutlookExpress. I tried both plain and

"David Abrahams" <dave@boost-consulting.com> wrote in message news:upt44x3id.fsf@boost-consulting.com... that html.

...

...
It's not just Gennadiy, but also Thorsten and some others on the list as well. The subject of the root of this thread had a carriage return

and/or

...
linefeed and tab as white space.

Could you explain what you mean in more detail?

Thorsten's original subject for this thread: "[tokenizer and token_iterator] request for policy or redesign" Is actually boken into two lines, the second prepended with a '^H' tab character, as in: line1>[tokenizer and token_iterator] request for policy or line2> redesign When viewing the original message source via the OutlookExpress(OE) File >> Properties >> Details >> Message Source. Any replies created using OE for a message with a subject containing '^H' follow this pattern: line1>[tokenizer and token_iterator] request for policy or line2> redesign reply: line1>[tokenizer and token_iterator] request for policy line2> orredesign reply: line1>[tokenizer and token_iterator] request for line2> policyorredesign ...

...

...
I tried cutting/pasting in a tab into OE, but it is received as text.

Can you explain that one too?

I meant to say that the tab was replaced by equivalent spaces rather than the explict '^H' tab character. I created text in an editor " test^Htest ", and pasted it into the subject line in OutlookExpress(OE), to see if this might be the cause of the problem. Thanks, Jeff F

David Abrahams

1:30 p.m.

New subject: [tokenizer and token_iterator] requestforpolicyorredesign

"Jeff Flinn" <TriumphSprint2000@hotmail.com> writes:

...

...
"Jeff Flinn" <TriumphSprint2000@hotmail.com> writes:

...
"David Abrahams" <dave@boost-consulting.com> wrote in message news:u7jqdyp4m.fsf@boost-consulting.com...

...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:

...
"David Abrahams" <dave@boost-consulting.com> wrote in message news:upt452u2a.fsf@boost-consulting.com...

...
"Gennadiy Rozental" <gennadiy.rozental@thomson.com> writes:

your mailer is squeezing spaces out of subject lines again, it looks like.

I am using Outlook Express News Reader now. Do you know any options

...
...
...
...
manage that?

No, but I used the same newsreader for years and never experienced this problem. I suggest Cc'ing yourself on your next postings to see if it's happening at your end or elsewhere. I also suggest making sure you post in plain text and not HTML (in case that's what you're doing).

I've tried to reproduce the problem on gmane.test, but I do not get the "squeezed" subject text using OutlookExpress. I tried both plain and

"David Abrahams" <dave@boost-consulting.com> wrote in message news:upt44x3id.fsf@boost-consulting.com... that html.

...
...
It's not just Gennadiy, but also Thorsten and some others on the list as well. The subject of the root of this thread had a carriage return

and/or

...
linefeed and tab as white space.

Could you explain what you mean in more detail?

Thorsten's original subject for this thread:

"[tokenizer and token_iterator] request for policy or redesign"

Is actually boken into two lines, the second prepended with a '^H' tab character, as in:

That's odd. ^H is usually a backspace. Tab is ^I.

...

line1>[tokenizer and token_iterator] request for policy or line2> redesign

I see a newline followed by a tab. Is that legal in a RFC422 message header?

...

When viewing the original message source via the OutlookExpress(OE) File >> Properties >> Details >> Message Source.

Any replies created using OE for a message with a subject containing '^H' follow this pattern:

line1>[tokenizer and token_iterator] request for policy or line2> redesign

reply:

line1>[tokenizer and token_iterator] request for policy line2> orredesign

reply:

line1>[tokenizer and token_iterator] request for line2> policyorredesign

...

...
...
I tried cutting/pasting in a tab into OE, but it is received as text.

Can you explain that one too?

I meant to say that the tab was replaced by equivalent spaces rather than the explict '^H' tab character. I created text in an editor " test^Htest ", and pasted it into the subject line in OutlookExpress(OE), to see if this might be the cause of the problem.

Thanks for the explanation. Maybe Thorsten's mailer is at fault for inserting the newline? -- Dave Abrahams Boost Consulting http://www.boost-consulting.com

Jeff Flinn

1:48 p.m.

New subject: [tokenizer and token_iterator]requestforpolicyorredesign

...

...
Is actually boken into two lines, the second prepended with a '^H' tab character, as in:

That's odd. ^H is usually a backspace. Tab is ^I.

It's only odd if it's not "When I say ^H I mean ^I day". Hmm, this must only be a local custom here. Sorry for the confusion. Thanks, Jeff

Rob Stewart

5:03 p.m.

New subject: [tokenizer and token_iterator] requestforpolicyorredesign

From: David Abrahams <dave@boost-consulting.com>

...

"Jeff Flinn" <TriumphSprint2000@hotmail.com> writes:

...
...
Thorsten's original subject for this thread:

"[tokenizer and token_iterator] request for policy or redesign"

Is actually boken into two lines, the second prepended with a '^H' tab character, as in:

That's odd. ^H is usually a backspace. Tab is ^I.

...
line1>[tokenizer and token_iterator] request for policy or line2> redesign

I see a newline followed by a tab. Is that legal in a RFC422 message header?

I see the same things. RMAIL in emacs handles it just fine. RFC822, 3.1.1. LONG HEADER FIELDS, specifically permits folding long lines like that. OE is broken. -- Rob Stewart stewart@sig.com Software Engineer http://www.sig.com Susquehanna International Group, LLP using std::disclaimer;

Pavol Droba

28 Sep 28 Sep

7:35 p.m.

Hi, On Tue, Sep 28, 2004 at 08:03:41PM +0200, Thorsten Ottosen wrote:

...

3. would it not be possible to allow

typedef boost::tokenizer< fun, boost::sub_range<string> > tokenizer;

Tokenizer like this is already in the StringAlgo library. It's call find_iterator. It does almost precisely what you requested. When you dereference it, you get iterator_range. Currently it is still working with the internal version, but after the release, I will convert it to use Boost.Range. Regards, Pavol.

7610

Age (days ago)

7612

Last active (days ago)

List overview

Download

23 comments

6 participants

participants (6)

David Abrahams
Gennadiy Rozental
Jeff Flinn
Pavol Droba
Rob Stewart
Thorsten Ottosen