Re: [boost] [regex and string algo] again strange split behaviour

newer
Web page with compiler macros and...

older
Re: [boost] New Library Proposal:...

Jan Hermelink

13 Jul 2005 13 Jul '05

6:59 p.m.

<Jan.Hermelink@metalogic.de> wrote:

...

...
It looks more like a bug than by design if you ask me.

...

I don't think so - this behaviour is specified in the standardization proposal. To quote from http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1429.htm, Chapter "RE.8.2 Template class regex_token_iterator":

"If the end of sequence is reached (regex_search returns false), the iterator becomes equal to the end-of-sequence iterator value, unless the sub-expression being enumerated has index -1: In which case the iterator enumerates one last string that contains all the characters from the end of the last regular expression match to the end of the input sequence being enumerated, provided that this would not be an empty string."

!!! "provided that this would not be an empty string" !!!

...

How about this string: "/abc/abc". Would this result in "", "abc", "abc"? Yes

...

Yet "abc/abc/" would result in "abc", "abc"? Yes

...

That seems terribly unbalanced to me, and this is not the behavior I would expect. Yes, you may have a point here.

Or is it somewhat modeled after the C++ initializer syntax: { a, b, } is the same as { a, b } but { , a, b } isn't the same ... Maybe John can commence? Jan

Show replies by date

John Maddock

14 Jul 14 Jul

12:32 p.m.

New subject: [regex and string algo] again strange split behaviour

Apologies, I'm still catching up with this thread.

...

...
!!! "provided that this would not be an empty string" !!!

Correct.

...

...
How about this string: "/abc/abc". Would this result in "", "abc", "abc"? Yes

...
Yet "abc/abc/" would result in "abc", "abc"? Yes

...
That seems terribly unbalanced to me, and this is not the behavior I would expect. Yes, you may have a point here.

Or is it somewhat modeled after the C++ initializer syntax: { a, b, } is the same as { a, b } but { , a, b } isn't the same ...

Maybe John can commence?

The original rational was "do the same thing as perl", for example: perl -e "print join(':', split(/;/, '')) .\"\\n\". join(':', split(/;/, ';')) .\"\\n\". join(':', split(/;/, '1;2')) .\"\\n\". join(':', split(/;/, '1;2;')) .\"\\n\". join(':', split(/;/, ';1;2;'))" Outputs: 1:2 1:2 :1:2 Note no trailing blank fields, the Perl manual says: " split /PATTERN/,EXPR,LIMIT split /PATTERN/,EXPR split /PATTERN/ split Splits a string into a list of strings and returns that list. By default, empty leading fields are preserved, and empty trailing ones are deleted." It also kind of makes sense to me: if you want to split on a delimiter, then a trailing delimiter does not normally mean you want a trailing blank field: indeed trailing delimiters are quite commonly used (think C++ array syntax as one example). I believe in Perl you can get the empty trailing field if you specify an arbitrarily large argument as the split field limit. As far as Boost.Regex is concerned, regex_token_iterator could be used to get either behaviour given either definition as the starting point with equivalent ease: Given tyedef boost::regex_token_iterator< ...args... > iterator_type; iterator_type i( ...args... ); Then given the current behaviour (stripping a trailing empty field not followed by a delimiter): We know that a trailing field has been stripped if: (i++ == iterator_type()) && (i->second != end_of_string_sequence) Alternatively, if trailing empty fields were to be preserved, then we could spot them when they happen with: (i++ == iterator_type()) && (i->first == i->second) So for me, the question is which behaviour is more commonly required? At present I can't think of any real world use cases where a trailing empty field would be important, so here's the challenge: can anyone think of a file format, or transmission format or command line syntax or whatever where the trailing field is actually required? Real world cases only please, but first two data points: CSV files, and the Unicode character database don't require the output of trailing blanks (and parsing of the latter would certainly break if they were considered). One more very unscientific data point: historically this has always been the behaviour of regex_split (now deprecated), and its replacement regex_token_iterator, and no one has ever complained: until now that is! :-) Still sticking to my guns for now.... John.

Pavol Droba

1:14 p.m.

New subject: [regex and string algo] again strange split behaviour

Hi John, On Thu, Jul 14, 2005 at 01:32:15PM +0100, John Maddock wrote: [snip]

...

At present I can't think of any real world use cases where a trailing empty field would be important, so here's the challenge: can anyone think of a file format, or transmission format or command line syntax or whatever where the trailing field is actually required? Real world cases only please, but first two data points:

CSV files, and the Unicode character database don't require the output of trailing blanks (and parsing of the latter would certainly break if they were considered).

I have one case where it is natural to have: CSV file with a tabular data in it. When you parsing this structure, it is natural to have trailing blank, since it represents a valid data point. Another example is file path parsing. There is a fundamental difference between /folder1/folder2/file and /folder1/folder2/folder3/ Yet after spliting (and not considering the last blank) this difference is lost. Sure, it can almoust always be workarouded. But I assume, that it is much easier to skip a blank token, than check always if it was supposed to be there. Regards, Pavol.

Thore Karlsen

2:07 p.m.

New subject: [regex and string algo] again strange split behaviour

On Thu, 14 Jul 2005 13:32:15 +0100, "John Maddock" <john@johnmaddock.co.uk> wrote: [boost.regex dropping last empty token]

...

The original rational was "do the same thing as perl", for example:

perl -e "print join(':', split(/;/, '')) .\"\\n\". join(':', split(/;/, ';')) .\"\\n\". join(':', split(/;/, '1;2')) .\"\\n\". join(':', split(/;/, '1;2;')) .\"\\n\". join(':', split(/;/, ';1;2;'))"

Outputs:

1:2 1:2 :1:2

Note no trailing blank fields, the Perl manual says:

" split /PATTERN/,EXPR,LIMIT split /PATTERN/,EXPR split /PATTERN/ split Splits a string into a list of strings and returns that list. By default, empty leading fields are preserved, and empty trailing ones are deleted."

But if I'm not mistaken, you're not really doing the same thing. Perl drops _all_ empty trailing fields, and from the Boost.Regex description it looks like you are only dropping the very last one. Perl also has the option of keeping all empty trailing fields by using a negative number for LIMIT, as you mentioned.

...

It also kind of makes sense to me: if you want to split on a delimiter, then a trailing delimiter does not normally mean you want a trailing blank field: indeed trailing delimiters are quite commonly used (think C++ array syntax as one example).

I can't speak for everyone else, but I can say that in many of my splits I would want the last empty field to be retained. I'm parsing comma/tab/semicolon-separated log lines, CSV files, custom protocols, and other things where the last field is important, empty or not. An empty field is still valid data, and the field count in my cases can determine how I need to parse the data. (For keeping compatibility with old log file formats, for instance.) -- Be seeing you.

7296

Age (days ago)

7297

Last active (days ago)

List overview

Download

3 comments

4 participants

participants (4)

Jan Hermelink
John Maddock
Pavol Droba
Thore Karlsen