Re: [boost] [Review] xpressive

Scott Woods

14 Sep 2005 14 Sep '05

11:38 p.m.

* What is your evaluation of the design? Good. It has taken some powerful, established technologies (e.g. regular expressions and parser generators) and delivered a C++ facillity that uses some of the latest C++ techniques. Hosting these technologies within C++ syntax has resulted in some slightly jarring constructs (e.g. prefix "one-or-more", tag assignment) but the set of design decisions made seem very reasonable. There were some corners that got me curious; the following queries are therefore more about curiosity than suggestions for change. 1. What is the benefit of providing the complete match in the first entry of the results? e.g. "what[0]". While this is consistent with a long tradition in RE, after some time with STL it's presence at position zero wasnt as comfortable as I expected. 2. Why the slash syntax in dynamic regex? The resulting requirement for a double is fairly ugly. It may be consistent with something (Perl/ECMA/..?) but on balance is it worth it? 3. Why ">>" and not "," (comma). Did the "set" facillity take priority or does the low precedence of comma just result in a different ugliness (sorry, not really the word I want to use :-). 4. C++ operators and RE operators are not really comfortable bed-fellows. While the operator overloading technique has given something useful and concise was there no flirtation with "key words" (i.e. in the manner of "as_xpr"). "+" could have been "one_or_more( xpr )". Maybe the keyword version is viable as an alternate targeted at beginners? Unjustifiably complex? 5. There didnt appear to be much specific thought given to file processing. Is this another "not yet implemented"? In particular elegant integration with any async I/O facillity arising from sockets and file I/O initiatives. 6. Very interested in the future of "semantic actions". Actions and file processing probably go together? * What is your evaluation of the implementation? Cannot say. Not used. * What is your evaluation of the documentation? Good. I liked the balance of technical terminology combined with some informality. Appeared to be some minor problems such as initially not knowing what "_w" and "_d" meant. Was this my failing or was there a missing definition. Was the later use of "+alpha" and others inconsistent? * What is your evaluation of the potential usefulness of the library? Huge. Some of that answer refers to the usefulness of RE but xpressive is obviously more. There has been some discussion about overlap with regex. It would be nice to clarify that. The uptake of RE within the development community (at least, the one where I live) is almost embarrassingly low. I have seen many circumstances that cried out for the power and rigor of RE that instead got limited, buggy hand-written solutions. The ease-of-use (installation, build and coding) of xpressive might help? * Did you try to use the library? With what compiler? Did you have any problems? No. * How much effort did you put into your evaluation? A glance? A quick reading? In-depth study? An in-depth study with a glaring omission; the library was not used. Lack of time, old compiler, employer resistance to boost. * Are you knowledgeable about the problem domain? Yes. * Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion. Yes. Cheers.

Show replies by date

Joel de Guzman

15 Sep 15 Sep

12:56 a.m.

New subject: [Review] xpressive

Scott Woods wrote:

...

3. Why ">>" and not "," (comma). Did the "set" facillity take priority or does the low precedence of comma just result in a different ugliness (sorry, not really the word I want to use :-).

You must've missed the lengthy discussion when Spirit was being reviewed. Yes, one of the main reasons is because of precedence. The ">>" has just about the right level of precedence. I would have loved to be able to use the ','. Too bad, it has the lowest precedence in C/C++, which makes it virtually useless. It even has lower precedence than the '='. Also, there's prior practice: Spirit. Cheers, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Joel de Guzman

1:10 a.m.

New subject: [Review] xpressive

Scott Woods wrote:

...

4. C++ operators and RE operators are not really comfortable bed-fellows. While the operator overloading technique has given something useful and concise was there no flirtation with "key words" (i.e. in the manner of "as_xpr"). "+" could have been "one_or_more( xpr )".

Again, we've had this discussion before. You might want to look at the archives when Spirit was being reviewed. I'm glad Eric chose to follow existing practice. Doing otherwise would have resulted in difficulties in unification with Spirit. I can assure you, after a while, it grows on you and it becomes natural. WRT Spirit, no one's complaining now. We're way past that initial syntax hurdle. And I'm glad. Check out a fairly complex Spirit grammar, say, the Wave CPP. Replace all '+' with 'one_or_more'. You'll see what I mean. Cheers, -- Joel de Guzman http://www.boost-consulting.com http://spirit.sf.net

Scott Woods

1:43 a.m.

New subject: [Review] xpressive

----- Original Message ----- From: "Joel de Guzman" <joel@boost-consulting.com> To: <boost@lists.boost.org> Sent: Thursday, September 15, 2005 1:10 PM Subject: Re: [boost] [Review] xpressive [snip]

...

...
something useful and concise was there no flirtation with "key words" (i.e. in the manner of "as_xpr"). "+" could have been "one_or_more( xpr )".

Again, we've had this discussion before. You might want to look at the archives when Spirit was being reviewed. I'm glad Eric chose

Cool. Was searching after your last post. Also looking for explicit statements in the xpressive docs about conformance (e.g. with Spirit, Perl, ECMAScript,...). I remember reading them. But not what they said :-) All good.

Eric Niebler

3:05 a.m.

New subject: [Review] xpressive

Answers inline... Scott Woods wrote:

...

1. What is the benefit of providing the complete match in the first entry of the results? e.g. "what[0]". While this is consistent with a long tradition in RE, after some time with STL it's presence at position zero wasnt as comfortable as I expected.

I'm curious, what did your experience with STL lead you to expect? I did it this way because TR1 regex does it that way. Although xpressive is not a fully compliant TR1 regex implementation, minimizing gratuitous differences can only help.

...

2. Why the slash syntax in dynamic regex? The resulting requirement for a double is fairly ugly. It may be consistent with something (Perl/ECMA/..?) but on balance is it worth it?

I'm following the lead of every other regex package for C and C++ out there. Anything else, and there would be riots in the streets. I agree that the double-slashes are hard on the eyes, though. (So use static regexes insted. :-)

...

3. Why ">>" and not "," (comma). Did the "set" facillity take priority or does the low precedence of comma just result in a different ugliness (sorry, not really the word I want to use :-).

As Joel already said, operator precedence. Also, I completely stole Spirit's choice of operators, lock, stock and barrel. That's a conscious decision (made after much debate and hand-wringing) to ease any future unification, and so that Spirit users can be productive with xpressive with a minimum of fuss.

...

4. C++ operators and RE operators are not really comfortable bed-fellows. While the operator overloading technique has given something useful and concise was there no flirtation with "key words" (i.e. in the manner of "as_xpr"). "+" could have been "one_or_more( xpr )".

Joel already answered this one, too. Concise syntax is part of what makes regexes powerful. And don't underestimate the power of an infix notation to *greatly* reduce the syntactic burden. Imagine if the sequence operator were spelled "sequence()" instead of ">>". Instead of "a >> b >> c >> d" you would have "sequence(a, sequence(b, sequence(c, d)))". This quickly becomes unmanageable.

...

5. There didnt appear to be much specific thought given to file processing. Is this another "not yet implemented"? In particular elegant integration with any async I/O facillity arising from sockets and file I/O initiatives.

xpressive works generically with iterators. Spirit has a file iterator. That would be the way to go, IMO.

...

6. Very interested in the future of "semantic actions". Actions and file processing probably go together?

They're orthogonal, AFAICT.

...

Appeared to be some minor problems such as initially not knowing what "_w" and "_d" meant. Was this my failing or was there a missing definition. Was the later use of "+alpha" and others inconsistent?

The docs can certainly be more clear on this point. It's been brought to my attention that I never say anywhere exactly what _w means. I'll clean all this up. Thanks.

...

* What is your evaluation of the potential usefulness of the library? Huge. Some of that answer refers to the usefulness of RE but xpressive is obviously more. There has been some discussion about overlap with regex. It would be nice to clarify that.

I tried to clarify the issue of overlap in the following messages: http://article.gmane.org/gmane.comp.lib.boost.devel/131005 http://article.gmane.org/gmane.comp.lib.boost.devel/131072 Do you have a specific question or concern that hasn't been addressed in these messages? <snip>

...

* Do you think the library should be accepted as a Boost library? Be sure to say this explicitly so that your other comments don't obscure your overall opinion. Yes.

-- Eric Niebler Boost Consulting www.boost-consulting.com

Scott Woods

4:40 a.m.

New subject: [Review] xpressive

----- Original Message ----- From: "Eric Niebler" <eric@boost-consulting.com> To: <boost@lists.boost.org> Sent: Thursday, September 15, 2005 3:05 PM Subject: Re: [boost] [Review] xpressive

...

Answers inline...

Thanks.

...

...
1. What is the benefit of providing the complete match in the first entry of the results? e.g. "what[0]". While this is consistent with a long tradition in RE, after some time with STL it's presence at position zero wasnt as comfortable as I expected.

I'm curious, what did your experience with STL lead you to expect?

I did it this way because TR1 regex does it that way. Although xpressive is not a fully compliant TR1 regex implementation, minimizing gratuitous differences can only help.

Yep, agreed. Going back to the "what[0] / STL" thing and starting with your (snipped) example; std::string hello( "hello world!" ); sregex rex = sregex::compile( "(\\w+) (\\w+)!" ); smatch what; if( regex_match( hello, what, rex ) ) { std::cout << what[0] << '\n'; // whole match std::cout << what[1] << '\n'; // first capture std::cout << what[2] << '\n'; // second capture } "What[ 0 ]" is the odd one out; it does not have an implicit mapping to a manifest sub-expression. To RE-philes (I think my first exposure to $0 was in "vi"?) it's de rigueur. To those C++ developers that were born more recently but are familiar with STL, it's a wrinkle. Does processing of "what" always involve "++what.begin()" only because "what.complete()" fails to compete with tradition. Please don't take my quoted code snippets literally. Or imagine I side with the next generation :-)

...

...
2. Why the slash syntax in dynamic regex? The resulting requirement for a double is fairly ugly. It may be consistent with something (Perl/ECMA/..?) but on balance is it worth it?

...
I'm following the lead of every other regex package for C and C++ out

there. Anything else, and there would be riots in the streets. I agree that the double-slashes are hard on the eyes, though. (So use static > regexes insted. :-)

Ha, cool.

...

...
3. Why ">>" and not "," (comma). Did the "set" facillity take priority or does the low precedence of comma just result in a different ugliness (sorry, not really the word I want to use :-).

As Joel already said, operator precedence. Also, I completely stole Spirit's choice of operators, lock, stock and barrel. That's a conscious decision (made after much debate and hand-wringing) to ease any future unification, and so that Spirit users can be productive with xpressive with a minimum of fuss.

...

...
5. There didnt appear to be much specific thought given to file

Yep, sorry to have missed the evolution of Spirit. I'm a fairly recent Booster that only bothered to search my archives for xpressive before writing the review. Seems kinda dumb now; hope to do better next time. processing.

...

...
Is this another "not yet implemented"? In particular elegant integration with any async I/O facillity arising from sockets and file I/O initiatives.

xpressive works generically with iterators. Spirit has a file iterator. That would be the way to go, IMO.

For "normal" file processing this is fine. Well actually its marvelous. But for another circumstance see below.

...

...
6. Very interested in the future of "semantic actions". Actions and file processing probably go together?

They're orthogonal, AFAICT.

...

Yes they are. But I need to be clearer. I was associating files of input with semantic actions because processing of a file with xpr has a good chance of involving a complex xpr. And getting the right code to run at the right time with such an xpr, without embedded actions, involves contortions (even unnecessary CPU cycles?). I'm sure you are fully aware of all this. Sorry, it was an idle association. Also, a recurring problem with related tools such as lex, flex, yacc and bison is that they are architected to be "superior" to the "sub-ordinate" input/buffering scheme. On one hand, this is great because in a traditional parser it hid a significant sub-system and often did an efficient job of it. OTOH it is often difficult/impossible to present data blocks to such a parser in an async fashion. A role reversal is required. Borrowing your example again; // Sometime before establishing a TCP connection sregex rex = sregex::compile( "(\\w+) (\\w+)\\n" ); // Two words per line smatch what; // On an FD_READ // Load available bytes into char buffer[] and; while( regex_accumulate( buffer, what, rex ) ) { // The pattern has been matched. // This loop body may be entered 0 // or more times, for each FD_READ string command = what[ 1 ]; string argument = what[ 2 ]; } Structured this way, the application processing the commands is completely impervious to changing MTUs and block sizes. But something needs to carry the xpressive state between invocations of "regex_accumulate"? Hell, would the xpr lib work as is!? Cheers.

Eric Niebler

5:10 p.m.

New subject: [Review] xpressive

Answering an old question that got lost in the fray ... Scott Woods wrote:

...

"What[ 0 ]" is the odd one out; it does not have an implicit mapping to a manifest sub-expression. To RE-philes (I think my first exposure to $0 was in "vi"?) it's de rigueur. To those C++ developers that were born more recently but are familiar with STL, it's a wrinkle. Does processing of "what" always involve "++what.begin()" only because "what.complete()" fails to compete with tradition.

I understand what you're saying, but I don't agree that it's that for solely for legacy reasons. In practice, the only reason why you might iterator over all sub-matches is to print them out. Otherwise, the sub-matches are accessed randomly, because (for example) the 1st sub-match is a date and the 3rd sub-match is an email address, and I'm not interested in the 2nd. See? -- Eric Niebler Boost Consulting www.boost-consulting.com

Scott Woods

8:22 p.m.

New subject: [Review] xpressive

----- Original Message ----- From: "Eric Niebler" <eric@boost-consulting.com> To: <boost@lists.boost.org> Sent: Friday, September 16, 2005 5:10 AM Subject: Re: [boost] [Review] xpressive

...

...
with STL, it's a wrinkle. Does processing of "what" always involve "++what.begin()" only because "what.complete()" fails to compete with tradition.

I understand what you're saying, but I don't agree that it's that for solely for legacy reasons. In practice, the only reason why you might iterator over all sub-matches is to print them out. Otherwise, the sub-matches are accessed randomly, because (for example) the 1st sub-match is a date and the 3rd sub-match is an email address, and I'm not interested in the 2nd. See?

For sure. Interesting though; in the accesses listed you do not make use of the zero'th element and yet they must all acknowledge its presence. Both approaches obviously work. I can only guess that recent immersion in STL/concepts/boost has left me with some kind of halo. Soon get rid of that! It's making me notice minutiae such as non-zero-based use of a vector. Xpressive is great (as are spirit and regex). The async I/O thing was a tangent but hopefully useful beyond the acceptance of your lib. Thanks to Chris and Felipe. Cheers.

Daryle Walker

16 Sep 16 Sep

5:59 a.m.

New subject: [Review] xpressive

On 9/15/05 1:10 PM, "Eric Niebler" <eric@boost-consulting.com> wrote:

...

Answering an old question that got lost in the fray ...

Scott Woods wrote:

...
"What[ 0 ]" is the odd one out; it does not have an implicit mapping to a manifest sub-expression. To RE-philes (I think my first exposure to $0 was in "vi"?) it's de rigueur. To those C++ developers that were born more recently but are familiar with STL, it's a wrinkle. Does processing of "what" always involve "++what.begin()" only because "what.complete()" fails to compete with tradition.

I understand what you're saying, but I don't agree that it's that for solely for legacy reasons. In practice, the only reason why you might iterator over all sub-matches is to print them out. Otherwise, the sub-matches are accessed randomly, because (for example) the 1st sub-match is a date and the 3rd sub-match is an email address, and I'm not interested in the 2nd. See?

Didn't the serialization library run into a similar snag to this? That a lot of people were asking about how to do a particular task, because the author was the only one who never considered that task, so he never made the interface provide that task easily. Similarly, we may run into a situation where people commonly do non-random iterations through your "what" list. It looks like the current setup is not STL-friendly. Most of the "what" list is one type of thing, the in-order pieces of the regex parse. The first item of the list doesn't match that pattern (since it's the whole parse). I'm guessing that this "old" way wasn't a problem because people expected 1-based arrays, so the 0-index could be special. That doesn't work in a 0-based array culture, like C++ (or C). C++ people would expect the 0-index element to match the general rule of the list. This mixing of element types mixes concerns (violating "keep it simple, silly"). A STL-friendly alternative would to have separate member functions for the whole-parse and the list-of-parse-pieces, then have a special function (member or non-member) that generates a regex-culture combined list. -- Daryle Walker Mac, Internet, and Video Game Junkie darylew AT hotmail DOT com

Darren Cook

11:26 p.m.

New subject: [Review] xpressive

...

...
I understand what you're saying, but I don't agree that it's that for solely for legacy reasons. In practice, the only reason why you might iterator over all sub-matches is to print them out. Otherwise, the sub-matches are accessed randomly, because (for example) the 1st sub-match is a date and the 3rd sub-match is an email address, and I'm not interested in the 2nd. See?

... interface provide that task easily. Similarly, we may run into a situation where people commonly do non-random iterations through your "what" list.

It is hard to think of when you will need to: normally you are using a regex because the data in different parts of the string has different meaning. When the fields are regular you'll normally use regex_token_iterator<> instead, and that is STL-friendly.

...

It looks like the current setup is not STL-friendly. Most of the "what" list is one type of thing, the in-order pieces of the regex parse.

Skipping over the first element is not that hard for the unusual cases where regex_token_iterator<> is not what you needed. Darren

Christopher Kohlhoff

15 Sep 15 Sep

5:45 a.m.

New subject: [Review] xpressive

Hi Eric, Scott, Sorry if this is getting a bit off-topic for the xpressive review... --- Eric Niebler <eric@boost-consulting.com> wrote:

...

Scott Woods wrote:

...
5. There didnt appear to be much specific thought given to file processing. Is this another "not yet implemented"? In particular elegant integration with any async I/O facillity arising from sockets and file I/O initiatives.

xpressive works generically with iterators. Spirit has a file iterator. That would be the way to go, IMO.

I'm not totally familiar with the xpressive API, but as far as I understand the regex APIs (even with iterators) are not sufficient for clean integration with async I/O. What you need is a stateful decoder that you can feed data to as you receive it off the wire, and it tells you when it is done, or when it needs you to feed it more data. For example, a Decoder concept might look like: class Decoder { public: template <typename InputIterator> tuple<tribool, InputIterator> decode( InputIterator begin, InputIterator end); }; The begin and end parameters specify the range of input. The tribool return value is true when a complete "message" has been received, false when the data is known to be invalid, indeterminate when more data is required. The InputIterator return value is used to indicate how much of the input has been consumed. It would be used something like this: class server { //... handle_read(const error& e, size_t bytes_read) { tribool ok; tie(ok, tuples::ignore) = decoder_.decode(buf_, buf_ + bytes_read); if (ok) { // Successfully decoded message. } else if (!ok) { // Error in message. } else { // Need more data. sock_.async_read(buf_, 1024, bind(&server::handle_read, this, _1, _2)); } } //... private: stream_socket sock_; char buf_[1024]; Decoder decoder_; }; I have used hand-crafted implementations of this concept with great success. I would be *very* keen to see it supported in a generic regex mechanism, as it would make it so easy to implement text-based protocols efficiently and safely. Cheers, Chris

Scott Woods

5:58 a.m.

New subject: [Review] xpressive

----- Original Message ----- From: "Christopher Kohlhoff" <chris@kohlhoff.com> To: <boost@lists.boost.org> Sent: Thursday, September 15, 2005 5:45 PM Subject: Re: [boost] [Review] xpressive

...

Hi Eric, Scott,

Sorry if this is getting a bit off-topic for the xpressive review...

Maybe it is. But it's also bang on what I've been trying to figuratively inject into the "Not Yet Implemented" appendix of the xpressive docs. :-) [snip]

Eric Niebler

5:59 a.m.

New subject: [Review] xpressive

Christopher Kohlhoff wrote:

...

Hi Eric, Scott,

Sorry if this is getting a bit off-topic for the xpressive review...

Mmm, drifting a bit, perhaps, but it's worth clearing this up.

...

I'm not totally familiar with the xpressive API, but as far as I understand the regex APIs (even with iterators) are not sufficient for clean integration with async I/O.

What you need is a stateful decoder that you can feed data to as you receive it off the wire, and it tells you when it is done, or when it needs you to feed it more data.

What you are describing, at least in regex terms, is a partial match. Ordinarily, a regex match will give you a "yes" or a "no" answer. With a partial match, you can get a "maybe" if the input sequence is exhausted before the regex state machine has reached its final state. Boost.Regex and xpressive both support partial matches via the match_partial switch, which is documented (for Boost.Regex) here: http://boost.org/libs/regex/doc/partial_matches.html Note that TR1 regex does not support partial matches, for the simple reason that we weren't able to come up with satisfactory standardese in time. -- Eric Niebler Boost Consulting www.boost-consulting.com

Christopher Kohlhoff

6:43 a.m.

New subject: [Review] xpressive

Hi Eric, --- Eric Niebler <eric@boost-consulting.com> wrote:

...

What you are describing, at least in regex terms, is a partial match. Ordinarily, a regex match will give you a "yes" or a "no" answer. With a partial match, you can get a "maybe" if the input sequence is exhausted before the regex state machine has reached its final state.

...

From my reading of the documentation, if I get a partial match and I want to continue to try for a full match I must buffer the entire data from the beginning of the partial match. This means that in the

I was aware of partial matches from perusing the documentation a while back, and I'm not sure that it's exactly the same thing -- please correct me if i'm wrong. partial_regex_grep example it cannot find a substring match that is greater then 4096 characters long, because that is all the data it will buffer. Furthermore, each time I want to retest the input against the expression it must process the whole input string again. This is not ideal from an efficiency point of view, since I could potentially receive input data one byte at a time. What I want is a stateful regular expression-based decoder object (since in theory it's just a state machine and can remember its current state). I can feed it more input which will cause more state transitions, and it will tell me when it reaches a terminal state. I never have to buffer more input than the block just read because earlier input will have been fully consumed by the decoder. As I said, this is an area I am very interested in exploring further when time permits (and not just in relation to regular expressions, but also things like Boost.Serialization), but that definitely belongs in its own thread. Cheers, Chris

Felipe Magno de Almeida

7:02 a.m.

New subject: [Review] xpressive

Hi Christopher, On 9/15/05, Christopher Kohlhoff <chris@kohlhoff.com> wrote: [snip]

...

...
From my reading of the documentation, if I get a partial match and I want to continue to try for a full match I must buffer the entire data from the beginning of the partial match. This means that in the partial_regex_grep example it cannot find a substring match that is greater then 4096 characters long, because that is all the data it will buffer.

the same for spirit (If I'm wrong then I'm using it very inefficiently)

...

Furthermore, each time I want to retest the input against the expression it must process the whole input string again. This is not ideal from an efficiency point of view, since I could potentially receive input data one byte at a time.

indeed

...

What I want is a stateful regular expression-based decoder object (since in theory it's just a state machine and can remember its current state). I can feed it more input which will cause more state transitions, and it will tell me when it reaches a terminal state. I never have to buffer more input than the block just read because earlier input will have been fully consumed by the decoder.

IMO, it would be needed the possibility to have more than one grammar, one for success and another for failure(Would be interesting to have the possibility for more than two), until no-one has a match continue to wait for more data...

...

As I said, this is an area I am very interested in exploring further when time permits (and not just in relation to regular expressions, but also things like Boost.Serialization), but that definitely belongs in its own thread.

IMO, it should be another library that should be able to interact with AIO, spirit and xpressive. I already own some helper classes for something like this, when they're called from the AIO components, they forward to the grammars and verify success or failure, but they re-do all the grammar processing again. I would be willing to volunteer to help in a library like this, since I use it a lot for processing protocols with AIO(sockets) and spirit and already have some code writed. However, I dont have enough time to create such a library all by myself.

...

Cheers, Chris

best regards, -- Felipe Magno de Almeida Developer from synergy and Computer Science student from State University of Campinas(UNICAMP). Unicamp: http://www.ic.unicamp.br Synergy: http://www.synergy.com.br "There is no dark side of the moon really. Matter of fact it's all dark."

Scott Woods

9:19 p.m.

New subject: Async xpressive, regex and spirit

----- Original Message ----- From: "Felipe Magno de Almeida" <felipe.m.almeida@gmail.com> To: <boost@lists.boost.org> Sent: Thursday, September 15, 2005 7:02 PM Subject: Re: [boost] [Review] xpressive

...

...
What I want is a stateful regular expression-based decoder object (since in theory it's just a state machine and can remember its current state). I can feed it more input which will cause more state transitions, and it will tell me when it reaches a terminal state. I never have to buffer more input than the block just read because earlier input will have been fully consumed by the decoder.

Exactly. I only suspected there was some difficulty here. Thanks for the full story.

...

IMO, it would be needed the possibility to have more than one grammar, one for success and another for failure(Would be interesting to have the possibility for more than two), until no-one has a match continue to wait for more data...

Sorry. Don't follow the duplex grammar idea.

...

...
As I said, this is an area I am very interested in exploring further when time permits (and not just in relation to regular expressions, but also things like Boost.Serialization), but that definitely belongs in its own thread.

IMO, it should be another library that should be able to interact with AIO, spirit and xpressive. I already own some helper classes for something like this, when they're called from the AIO components, they forward to the grammars and verify success or failure, but they re-do all the grammar processing again. I would be willing to volunteer to help in a library like this, since I use it a lot for processing protocols with AIO(sockets) and spirit and already have some code writed. However, I dont have enough time to create such a library all by myself.

I'm in a very similar position. Except I have been forced to write my own version of AIO and regex. If that experience is any use then I'm pleased to contribute. Cheers.

Felipe Magno de Almeida

9:40 p.m.

New subject: Async xpressive, regex and spirit

On 9/15/05, Scott Woods <scottw@qbik.com> wrote:

...

----- Original Message ----- From: "Felipe Magno de Almeida" <felipe.m.almeida@gmail.com> To: <boost@lists.boost.org> Sent: Thursday, September 15, 2005 7:02 PM Subject: Re: [boost] [Review] xpressive

[snip]

...

Exactly. I only suspected there was some difficulty here. Thanks for the full story.

...
IMO, it would be needed the possibility to have more than one grammar, one for success and another for failure(Would be interesting to have the possibility for more than two), until no-one has a match continue to wait for more data...

Sorry. Don't follow the duplex grammar idea.

Imagine a SMTP protocol, it has two or more responses for a command, being from success, warning and completely failure. That way is necessary 3 grammars, one for each. If nothing matched, then we probably have a real problem. But one of the three should be matched, in genereal.

...

[snip]

...

...
IMO, it should be another library that should be able to interact with AIO, spirit and xpressive. I already own some helper classes for something like this, when they're called from the AIO components, they forward to the grammars and verify success or failure, but they re-do all the grammar processing again. I would be willing to volunteer to help in a library like this, since I use it a lot for processing protocols with AIO(sockets) and spirit and already have some code writed. However, I dont have enough time to create such a library all by myself.

I'm in a very similar position. Except I have been forced to write my own version of AIO and regex. If that experience is any use then I'm pleased to contribute.

I implemented AIO and a glue between the grammars and the AIO system.

...

Cheers.

_______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

-- Felipe Magno de Almeida Developer from synergy and Computer Science student from State University of Campinas(UNICAMP). Unicamp: http://www.ic.unicamp.br Synergy: http://www.synergy.com.br "There is no dark side of the moon really. Matter of fact it's all dark."

Scott Woods

10:18 p.m.

New subject: Async xpressive, regex and spirit

----- Original Message ----- From: "Felipe Magno de Almeida" <felipe.m.almeida@gmail.com> To: <boost@lists.boost.org> Sent: Friday, September 16, 2005 9:40 AM Subject: Re: [boost] Async xpressive, regex and spirit

...

...
...
IMO, it would be needed the possibility to have more than one grammar, one for success and another for failure(Would be interesting to have the possibility for more than two), until no-one has a match continue to wait for more data...

Sorry. Don't follow the duplex grammar idea.

Imagine a SMTP protocol, it has two or more responses for a command, being from success, warning and completely failure. That way is necessary 3 grammars, one for each. If nothing matched, then we probably have a real problem. But one of the three should be matched, in genereal.

I suspect we have different understandings of what a grammar is. Other possible misunderstandings make this confusing, e.g. software that receives SMTP responses never receives the related commands (client vs server) so a grammar that spans these domains does not make sense? Also, are you applying language technology (i.e. grammars) to the overall processing of protocol signals/messages? While I find the idea compelling (my first use of yacc a long time ago was for exactly this) it may not be the most appropriate, e.g. FSMs are "lingua franca" in telephony protocols whereas grammars do not get a mention. But grammars are cool :-) Cheers.

Felipe Magno de Almeida

16 Sep 16 Sep

12:08 p.m.

New subject: Async xpressive, regex and spirit

On 9/15/05, Scott Woods <scottw@qbik.com> wrote:

...

----- Original Message ----- From: "Felipe Magno de Almeida" <felipe.m.almeida@gmail.com> To: <boost@lists.boost.org> Sent: Friday, September 16, 2005 9:40 AM Subject: Re: [boost] Async xpressive, regex and spirit

...
...
...
IMO, it would be needed the possibility to have more than one grammar, one for success and another for failure(Would be interesting to have the possibility for more than two), until no-one has a match continue to wait for more data...

Sorry. Don't follow the duplex grammar idea.

Imagine a SMTP protocol, it has two or more responses for a command, being from success, warning and completely failure. That way is necessary 3 grammars, one for each. If nothing matched, then we probably have a real problem. But one of the three should be matched, in genereal.

I suspect we have different understandings of what a grammar is. Other possible misunderstandings make this confusing, e.g. software that receives SMTP responses never receives the related commands (client vs server) so a grammar that spans these domains does not make sense? Also, are you applying language technology (i.e. grammars) to the overall processing of protocol signals/messages?

I was referring to spirit grammars. I didnt understood what you meant with "software that receives SMTP responses never receives the related commands (client vs server) so a grammar that spans these domains does not make sense?"

...

While I find the idea compelling (my first use of yacc a long time ago was for exactly this) it may not be the most appropriate, e.g. FSMs are "lingua franca" in telephony protocols whereas grammars do not get a mention.

I use spirit grammars because it is easiest to make it RFC conforming, since the EBNF grammars are already writed with the RFCs.

...

But grammars are cool :-)

Agreed :P

...

Cheers.

regards, -- Felipe Magno de Almeida Developer from synergy and Computer Science student from State University of Campinas(UNICAMP). Unicamp: http://www.ic.unicamp.br Synergy: http://www.synergy.com.br "There is no dark side of the moon really. Matter of fact it's all dark."

Felipe Magno de Almeida

17 Sep 17 Sep

5:25 a.m.

New subject: Async xpressive, regex and spirit

Hi Scott Woods. In your opinion, my use of spirit grammars isnt the best way to implement a text-based protocol? Or was only a problem with the different meanings of grammars that we were using? If you really think it isnt the right way, I would like to hear from you why and what would be better in this case. I'm only using it, because it felt right to do it, and it really helped me writing RFC-compliant code. regards, -- Felipe Magno de Almeida Developer from synergy and Computer Science student from State University of Campinas(UNICAMP). Unicamp: http://www.ic.unicamp.br Synergy: http://www.synergy.com.br "There is no dark side of the moon really. Matter of fact it's all dark."

Scott Woods

18 Sep 18 Sep

10:52 p.m.

New subject: Async xpressive, regex and spirit

Hi Felipe, ----- Original Message ----- From: "Felipe Magno de Almeida" <felipe.m.almeida@gmail.com> To: <boost@lists.boost.org>

...

In your opinion, my use of spirit grammars isnt the best way to implement a text-based protocol?

No; it's great use of language technology.

...

Or was only a problem with the different meanings of grammars that we were using?

Exactly. If you really think

...

it isnt the right way, I would like to hear from you why and what would be better in this case. I'm only using it, because it felt right to do it, and it really helped me writing RFC-compliant code.

...

From the number of follow-up posts relating to the intergration of parsing (Xpressive) and I/O (asio) issues, I'm reassured that

I'm sure your approach will have improved robustness and ease of maintenance. there is something worth picking away at. The application of language technology to network messaging (e.g. the TCP application suite) is lovely but to actually bring it all together takes quite a bit of glue. I'm hoping that we can eventually all be using the same glue and that that glue will be better than my glue :-) Cheers.

Scott Woods

11:44 p.m.

New subject: Async xpressive, regex and spirit

----- Original Message ----- From: "Scott Woods" <scottw@qbik.com> To: <boost@lists.boost.org> Sent: Monday, September 19, 2005 10:52 AM Subject: Re: [boost] Async xpressive, regex and spirit

...

Hi Felipe,

----- Original Message ----- From: "Felipe Magno de Almeida" <felipe.m.almeida@gmail.com> To: <boost@lists.boost.org>

...
In your opinion, my use of spirit grammars isnt the best way to implement a text-based protocol?

No; it's great use of language technology.

A slightly bizarre question that might clear up some misunderstandings on my part; how do you handle EOF for your "RFC-compliant" grammars? Is EOF something you generate at the end of each command or on close of a session?

Eric Niebler

15 Sep 15 Sep

4:59 p.m.

New subject: partial matches and buffered input (was: [Review] xpressive)

Christopher Kohlhoff wrote:

...

From my reading of the documentation, if I get a partial match and I want to continue to try for a full match I must buffer the entire data from the beginning of the partial match. This means that in the partial_regex_grep example it cannot find a substring match that is greater then 4096 characters long, because that is all the data it will buffer.

For the example in the docs, that's correct.

...

Furthermore, each time I want to retest the input against the expression it must process the whole input string again. This is not ideal from an efficiency point of view, since I could potentially receive input data one byte at a time.

True.

...

What I want is a stateful regular expression-based decoder object state. I can feed it more input which will cause more state transitions, and it will tell me when it reaches a terminal state.

Understood. Perhaps what is called for is a "pull" iterator that buffers chunks of data at a time, and when the buffer underflows, it fetches another chunk of data. That way, libraries like xpressive and Spirit can keep their iterator-based interface and not worry whether or not "++begin" goes to disk for the next 4Kb, or reads from a socket, or whatever. Would that address your problem?

...

I never have to buffer more input than the block just read because earlier input will have been fully consumed by the decoder.

That's not the case for a backtracking regex engine like Boost.Regex or xpressive. These libraries require bidirectional iterators because they may need to back out state transitions and decrement the iterator to try a different alternative. You'll need to buffer everything read so far, or else write it to a tmp file so you can get it back should you need it. And the problem of returning a partial match and persisting the current state of the state machine is a hard one. Some implementations maintain their state on the program stack, so returning effectively wipes out all that information. These implementations would need to somehow serialize the state stored on the program stack, and then de-serialize it in order to begin executing where it left off. Tricky stuff. -- Eric Niebler Boost Consulting www.boost-consulting.com

Christopher Kohlhoff

16 Sep 16 Sep

12:58 a.m.

New subject: partial matches and buffered input (was: [Review] xpressive)

Hi Eric, --- Eric Niebler <eric@boost-consulting.com> wrote:

...

Understood. Perhaps what is called for is a "pull" iterator that buffers chunks of data at a time, and when the buffer underflows, it fetches another chunk of data. That way, libraries like xpressive and Spirit can keep their iterator-based interface and not worry whether or not "++begin" goes to disk for the next 4Kb, or reads from a socket, or whatever.

Would that address your problem?

I'm not sure, however your comments further down suggest probably not. If the read from socket is to use async I/O then the regex code has to give up control of the thread.

...

That's not the case for a backtracking regex engine like Boost.Regex or xpressive. These libraries require bidirectional iterators because they may need to back out state transitions and decrement the iterator to try a different alternative. You'll need to buffer everything read

...

so far, or else write it to a tmp file so you can get it back should you need it.

Ok, I didn't realise it required backtracking. Perhaps xpressive can be wrapped with something that does the buffering from the correct position in the input stream automatically in this case, but...

...

And the problem of returning a partial match and persisting the current state of the state machine is a hard one. Some implementations maintain their state on the program stack, so returning effectively wipes out all that information.

Does this mean that xpressive stores its state on the program stack?

...

These implementations would need to somehow serialize the state stored on the program stack, and then de-serialize it in order to begin executing where it left off. Tricky stuff.

This does confirm my feeling that there is call for a async I/O friendly "regular expression" library, and it: - Only supports expressions that can be mapped to FSMs without requiring backtracking. - Does not store any state on the program stack. I don't believe it needs to be anything like as rich in functionality as xpressive, say, and so I'm quite happy to drop support for the "hard" stuff in order to make it async I/O friendly. If that level of functionality is required a user can always do the processing in two steps, where the second step passes a complete message through something like xpressive. Cheers, Chris

David Abrahams

15 Sep 15 Sep

1:52 p.m.

New subject: [Review] xpressive

"Eric Niebler" <eric@boost-consulting.com> writes:

...

Scott Woods wrote:

...
1. What is the benefit of providing the complete match in the first entry of the results? e.g. "what[0]". While this is consistent with a long tradition in RE, after some time with STL it's presence at position zero wasnt as comfortable as I expected.

I'm curious, what did your experience with STL lead you to expect?

I did it this way because TR1 regex does it that way.

Python does it that way, too, FWIW. -- Dave Abrahams Boost Consulting www.boost-consulting.com

7252

Age (days ago)

7256

Last active (days ago)

List overview

Download

24 comments

8 participants

participants (8)

Christopher Kohlhoff
Darren Cook
Daryle Walker
David Abrahams
Eric Niebler
Felipe Magno de Almeida
Joel de Guzman
Scott Woods