[regex] any interest in finite automata-based regex engine?

Hello. Are there any plans to add finite automata-based subengine to the regex library? If there are, I can submit the algorithm/implementation of a modified nondeterministic FA which takes care of * greedy and lazy quantifiers, * eagerness of the alternation operator, * non-longest matches (`best' in the Perl-ish sense with respect to greediness/laziness/eagerness), * substring capturing. All these features being included, the time cost of a match is O(N(E+C)L), where N is the number of states, E is the average epsilon-closure size per state, C is the number of capturing parentheses, and L is the input length. As far as I know that's the lowest order of time complexity in this area (corrections are welcome)---though, of course, deterministic FA would be even faster. I suppose there's no point in adding another regex library, but being able to guarantee linear time cost in an existing one would be nice. -- Best regards, Vladimir Pozdyayev.

Vladimir Pozdyayev wrote:
Hello.
Are there any plans to add finite automata-based subengine to the regex library? If there are, I can submit the algorithm/implementation of a modified nondeterministic FA which takes care of * greedy and lazy quantifiers, * eagerness of the alternation operator, * non-longest matches (`best' in the Perl-ish sense with respect to greediness/laziness/eagerness), * substring capturing. All these features being included, the time cost of a match is O(N(E+C)L), where N is the number of states, E is the average epsilon-closure size per state, C is the number of capturing parentheses, and L is the input length.
Interesting. Is the algorithm from some published paper that you could link/post? Or do you have a formal description? - Volodya

-----Original Message----- From: boost-bounces@lists.boost.org [mailto:boost-bounces@lists.boost.org] On Behalf Of Vladimir Pozdyayev Sent: Monday, October 25, 2004 2:09 AM To: boost@lists.boost.org Subject: [boost] [regex] any interest in finite automata-based regex engine?
Hello.
Are there any plans to add finite automata-based subengine to the regex library? If there are, I can submit the algorithm/implementation of a modified nondeterministic FA which takes care of * greedy and lazy quantifiers, * eagerness of the alternation operator, * non-longest matches (`best' in the Perl-ish sense with respect to greediness/laziness/eagerness), * substring capturing. All these features being included, the time cost of a match is O(N(E+C)L), where N is the number of states, E is the average epsilon-closure size per state, C is the number of capturing parentheses, and L is the input length. As far as I know that's the lowest order of time complexity in this area (corrections are welcome)---though, of course, deterministic FA would be even faster.
I suppose there's no point in adding another regex library, but being able to guarantee linear time cost in an existing one would be nice.
Very nice. I've banged up against that one painfully recently (not with Boost's regex engine, be it noted <g>), and frankly, that's quite a guarantee given the features you mention. Reid

Vladimir Pozdyayev wrote:
Hello.
Are there any plans to add finite automata-based subengine to the regex library? If there are, I can submit the algorithm/implementation of a modified nondeterministic FA which takes care of * greedy and lazy quantifiers, * eagerness of the alternation operator, * non-longest matches (`best' in the Perl-ish sense with respect to greediness/laziness/eagerness), * substring capturing. All these features being included, the time cost of a match is O(N(E+C)L), where N is the number of states, E is the average epsilon-closure size per state, C is the number of capturing parentheses, and L is the input length. As far as I know that's the lowest order of time complexity in this area (corrections are welcome)---though, of course, deterministic FA would be even faster.
I suppose there's no point in adding another regex library, but being able to guarantee linear time cost in an existing one would be nice.
Interesting. You say it supports substring capturing. Does it also support backreferences? By which I mean, can you refer to the captured substrings from within the pattern itself, as in the perl regex /(.)\1/? -- Eric Niebler Boost Consulting www.boost-consulting.com

Interesting. You say it supports substring capturing. Does it also support backreferences? By which I mean, can you refer to the captured substrings from within the pattern itself, as in the perl regex /(.)\1/?
Not unless he's proved that NP=P :-) John.

To Vladimir Prus:
Interesting. Is the algorithm from some published paper that you could link/post? Or do you have a formal description?
The paper describing this algorithm is submitted for publication (offline) but not out yet. I am considering possibilities of publishing another version online---can you advise me on the subject of the best way to do it (other than creating & promoting my own site :) )? To Reid Sweatman:
Very nice. I've banged up against that one painfully recently (not with Boost's regex engine, be it noted <g>), and frankly, that's quite a guarantee given the features you mention.
Well, I developed this one without boost.regex in mind, firstly. It's just that I don't feel like creating a whole new regex library from scratch, and regex seems to be a right place to fit this in. To Eric Niebler:
Interesting. You say it supports substring capturing. Does it also support backreferences?
Unfortunately, no. It's proven somewhere out there that regex matching with backrefs is NP-hard---so, I guess, no linear guarantee ever. <sob> -- Best regards, Vladimir Pozdyayev.

I'd like to shed some light on a couple of points. The stuff I offer is dedicated to two tasks: * building an ANFA (that's Augmented NFA) from an expression tree of a given regex; * running the result against a given input string. What such a code desperately needs, is the following: * syntactical front-end: a class that would parse the actual regex string and build its expression tree; * character back-end: a class that would allow checking whether a given character is contained in a given character set, respecting encodings, locales etc. Boost.regex employs quite a general approach to these components. Reusing them and connecting my code to them is what I have in mind. The only snag is, I'm not familiar with boost.regex internals. So, any help in that field would be appreciated. -- Best regards, Vladimir Pozdyayev.

The stuff I offer is dedicated to two tasks: * building an ANFA (that's Augmented NFA) from an expression tree of a given regex; * running the result against a given input string. What such a code desperately needs, is the following: * syntactical front-end: a class that would parse the actual regex string and build its expression tree; * character back-end: a class that would allow checking whether a given character is contained in a given character set, respecting encodings, locales etc. Boost.regex employs quite a general approach to these components. Reusing them and connecting my code to them is what I have in mind.
The only snag is, I'm not familiar with boost.regex internals. So, any help in that field would be appreciated.
The regex internals are in the process of being completely rewritten (code is in cvs in the regex5 branch), I hope to merge this to the main trunk in the next few weeks: mainly it's the docs that I need to bring up to date. Regex parsing and state machine construction should now be quite straightforward to understand (within limits for a regex engine obviously!), so I would urge you to take a look (I can send you a zip if you don't have cvs access). I think the main problem is providing the same feature set as the existing engine - my understanding is that no machine can have the complexity you claim and still match backrefs, or even I believe wide characters (because the character set is too large to realistically build a table based NFA). Is that correct? BTW, I have always thought that there was room for multiple regex engines in Boost that would offer increasingly fewer features, but gain in worst-case performance. I suppose I should have tried to separate the parser from the back-end state machine format more, so that different engines can be plugged in at will, but there are only so many times I think I can stand to rewrite this stuff :-/ John.

I suppose I should have tried to separate the parser from the back-end state machine format more, so that different engines can be plugged in at will, but there are only so many times I think I can stand to rewrite this stuff :-/
On the other hand, and on reflection: take a look at the public interfaces for perl_matcher and basic_regex_creator in the new code, if your state machine code could emulate these interfaces, then it might be possible to slot it straight in as a possible alternative (with a bit of traits class magic to choose which back end actually gets used). John.

JM> The regex internals are in the process of being completely rewritten (code JM> is in cvs in the regex5 branch), I hope to merge this to the main trunk in JM> the next few weeks: mainly it's the docs that I need to bring up to date. JM> Regex parsing and state machine construction should now be quite JM> straightforward to understand (within limits for a regex engine obviously!), JM> so I would urge you to take a look (I can send you a zip if you don't have JM> cvs access). Which cvs do you mean? I suppose it's not the main boost cvs, as I can't see anything named "regex5" there. I think the easiest way would be mailing a zip, yes. Though the cvs address would do, if there's an anonymous access. JM> I think the main problem is providing the same feature set as the existing JM> engine - my understanding is that no machine can have the complexity you JM> claim and still match backrefs, or even I believe wide characters (because JM> the character set is too large to realistically build a table based NFA). JM> Is that correct? There's no backrefs, of course. Speaking of wide characters, they do not provide that much trouble: character sets do not have to be bit vectors or something like that. Any functional object that takes a character and returns boolean would be ok. This way charsets can even use locale-related system calls and stuff---let's call them `algorithmical' charsets instead of potentially huge table-based ones. -- Best regards, Vladimir Pozdyayev.

Which cvs do you mean? I suppose it's not the main boost cvs, as I can't see anything named "regex5" there. I think the easiest way would be mailing a zip, yes. Though the cvs address would do, if there's an anonymous access.
Yes the main Boost cvs repository, checkout the current development code then: cvs update -r regex5 boost/regex libs/regex But I'll send you the zip privately. John.

John, I've studied the interfaces you suggested, and here are my observations. Please correct me if I'm wrong. <perl_matcher> is a collection of algorithms which sort of hack into the underlying <basic_regex>'s internal data structures and use them to perform matching or whatever they're up to. <basic_regex_creator> is a "syntax features to internals" converter which is called directly by the parser. In theory, implementing it should be the better way to initialize customized structures. However, the way it is used is somewhat tricky: it fills not its own data structures, but structures of the class that called the parser itself---the <basic_regex_implementation>. So this one would need reimplementing, too. Now for the real problem. Both <perl_matcher> and <basic_regex_creator> deal with the already compiled state machine or its elements. The first one works directly with the regex internals, the second one gets <append_state> and similar calls from the parser. The trouble is, some of my algorithms' calculations have to be performed directly on the expression tree, the compiled state machine won't help. Is it possible to restore the tree from the information provided by the library? That is, given the regex "((?:a|b)*?)(b+)", end up with an object like new cat( new match( new kleene_lazy( new alt( new charset( "a" ), new charset( "b" ) ) ) ), new match( new repeat( new charset( "b" ) ) ) ) And now for something completely different. The following program outputs ' aa', where the first char is \0. If we replace <smatch> by <cmatch>, the output is ok. That holds for the regex5 as well as the regex library in boost 1.31.0. Am I missing something? (MSVC 7.0) #include <boost/regex.hpp> #include <iostream> main() { boost::smatch m; boost::regex_match( "aaa", m, boost::regex( ".*" ) ); std::cout << m[ 0 ] << "\n"; } -- Best regards, Vladimir Pozdyayev.

Vladimir Pozdyayev wrote:
some of my algorithms' calculations have to be performed directly on the expression tree, the compiled state machine won't help. Is it possible to restore the tree from the information provided by the library? That is, given the regex "((?:a|b)*?)(b+)", end up with an object like
new cat( new match( new kleene_lazy( new alt( new charset( "a" ), new charset( "b" ) ) ) ), new match( new repeat( new charset( "b" ) ) ) )
FYI, xpressive builds such a parse tree (see http://boost-sandbox.sf.net/libs/xpressive). And xpressive's compile-time design allows for multiple back-ends (NFA, backtracking NFA, DFA) without paying for ones that aren't used. However, xpressive is hard-coded to use a backtracking NFA for now, and there is no clean interface for plugging in a different back-end. It is also not as mature or as widely used as Boost.Regex. Feel free to contact me offline if you want more information.
And now for something completely different.
The following program outputs ' aa', where the first char is \0. If we replace <smatch> by <cmatch>, the output is ok. That holds for the regex5 as well as the regex library in boost 1.31.0. Am I missing something? (MSVC 7.0)
#include <boost/regex.hpp> #include <iostream> main() { boost::smatch m; boost::regex_match( "aaa", m, boost::regex( ".*" ) ); std::cout << m[ 0 ] << "\n"; }
I think there are 2 overloads of the regex_match algorithm coming into play here. They are: bool regex_match( std::string const &, smatch &, regex const & ); bool regex_match( char const *, cmatch &, regex const & ); When you pass a smatch, the first overload is getting chosen, so the string literal is converted into a std::string temporary, which gets destroyed when the regex_match algorithm returns. That leaves the smatch object holding invalid iterators, which is why you are seeing garbage. When you pass a cmatch, the second overload is getting chosen, and no temporary string object is created -- the cmatch object will hold pointers into the string literal. HTH -- Eric Niebler Boost Consulting www.boost-consulting.com

I've studied the interfaces you suggested, and here are my observations. Please correct me if I'm wrong.
<perl_matcher> is a collection of algorithms which sort of hack into the underlying <basic_regex>'s internal data structures and use them to perform matching or whatever they're up to.
Yes, it's responsible for the actual matching.
<basic_regex_creator> is a "syntax features to internals" converter which is called directly by the parser. In theory, implementing it should be the better way to initialize customized structures. However, the way it is used is somewhat tricky: it fills not its own data structures, but structures of the class that called the parser itself---the <basic_regex_implementation>. So this one would need reimplementing, too.
Now for the real problem. Both <perl_matcher> and <basic_regex_creator> deal with the already compiled state machine or its elements. The first one works directly with the regex internals, the second one gets <append_state> and similar calls from the parser. The trouble is, some of my algorithms' calculations have to be performed directly on the expression tree, the compiled state machine won't help. Is it possible to restore the tree from the information provided by the library? That is, given the regex "((?:a|b)*?)(b+)", end up with an object like
new cat( new match( new kleene_lazy( new alt( new charset( "a" ), new charset( "b" ) ) ) ), new match( new repeat( new charset( "b" ) ) ) )
I see what you mean, no, there never is a parse tree like that: it's never been necessary (until now obviously).
And now for something completely different.
The following program outputs ' aa', where the first char is \0. If we replace <smatch> by <cmatch>, the output is ok. That holds for the regex5 as well as the regex library in boost 1.31.0. Am I missing something? (MSVC 7.0)
#include <boost/regex.hpp> #include <iostream> main() { boost::smatch m; boost::regex_match( "aaa", m, boost::regex( ".*" ) ); std::cout << m[ 0 ] << "\n"; }
Well smatch is the wrong type to use in this situation: std::string::const_iterator and const char* are not the same type, you should be using match_results<const char*> in this case, which is the same type as typedef cmatch (I hope that makes sense). For conforming compilers, the code you posted does not compile (which is the correct behaviour): but some workarounds for bugs present in VC6 and VC7 cause a temporary string to be created in this case, and the call to go through an overload that ideally would not have been found - so the iterators you get back are iterators into a string that's already been destroyed. John.

To Eric Niebler and John Maddock on the subject of "something completely different": Oh well. Mea culpa. Thanks. And now for something more completely different. JM> I forgot to say: without understanding the algorithm you are using, and the JM> data it needs, it's hard to know whether or not Boost.Regex can support it, JM> I think you may have been asked this already, but do you have a description JM> anywhere of the algorithm you are using? http://groups.yahoo.com/group/boost/files/anfa-regex/ EN> ... xpressive's EN> compile-time design allows for multiple back-ends (NFA, backtracking EN> NFA, DFA) without paying for ones that aren't used. However, xpressive EN> is hard-coded to use a backtracking NFA for now, and there is no clean EN> interface for plugging in a different back-end... Maybe it's good time to officially separate regex components from each other and define the way they should interact, thus allowing users to connect them at their preference? I can name three of them: syntax parser, state machine, and character sets. The first two are interacting through either tree data structures (passive/declarative way), or callbacks imitating the parser shift-reduce activity (active/imperative way; the better one, I think). The other two interact through "bool contains( char_t c )" (and, possibly, additional functions to deal with more subtle stuff). I understand that some algorithms could prefer to treat those three as if they were closely related... well, whether it will become a problem is a question of design. Opinions on this, as well as on the whole subject, are welcome. -- Best regards, Vladimir Pozdyayev.

Vladimir Pozdyayev wrote:
I had a very cursory look at the code. It appears do to lots of vector::push_back and make lots of virtual calls while matching a pattern. You commented on the algorithmic complexity, but have you benchmarked against Boost.regex? I'd be very curious to see how your back-end stacks up performance-wise. -- Eric Niebler Boost Consulting www.boost-consulting.com

Eric Niebler wrote: EN> ... It appears do to lots of EN> vector::push_back and make lots of virtual calls while matching a EN> pattern. You commented on the algorithmic complexity, but have you EN> benchmarked against Boost.regex? I'd be very curious to see how your EN> back-end stacks up performance-wise. The code does _no_ virtual calls while matching (<operator()>). It's the regex compilation (<build(node*)>) that uses them. And <push_back>s are rather cheap when the memory is already allocated by earlier <resize>'s and likes of them, aren't they? Oh, and benchmarking is still on my TODO list. I intend to do it as soon as the syntax parser is able to accept the benchmarking tests. -- Best regards, Vladimir Pozdyayev.

Thanks, I think I can see where you're coming from now, but it's going to take me a while to digist this.
EN> ... xpressive's EN> compile-time design allows for multiple back-ends (NFA, backtracking EN> NFA, DFA) without paying for ones that aren't used. However, xpressive EN> is hard-coded to use a backtracking NFA for now, and there is no clean EN> interface for plugging in a different back-end...
Maybe it's good time to officially separate regex components from each other and define the way they should interact, thus allowing users to connect them at their preference?
I can name three of them: syntax parser, state machine, and character sets. The first two are interacting through either tree data structures (passive/declarative way), or callbacks imitating the parser shift-reduce activity (active/imperative way; the better one, I think). The other two interact through "bool contains( char_t c )" (and, possibly, additional functions to deal with more subtle stuff). I understand that some algorithms could prefer to treat those three as if they were closely related... well, whether it will become a problem is a question of design. Opinions on this, as well as on the whole subject, are welcome.
The character set functionality is already separated out into the regex traits classes: the new style traits class design in the regex5 code is a lot better defined and simplified compared to the old design BTW. The problem with separating the parser out, is that it can often benefit from linkage to the state machine format: it needs to know what "kinds" of states are supported: whether the state machine uses a "complex" or "reduced" instruction set (for example whether to expand +, ?, and {a,b} into just | and * operations: as most DFA implementations require. I haven't looked to see if yours does though). I'm not saying it's impossible, just difficult; it's the details that really get you ;-) John.

John Maddock wrote: JM> The problem with separating the parser out, is that it can often benefit JM> from linkage to the state machine format: it needs to know what "kinds" of JM> states are supported: whether the state machine uses a "complex" or JM> "reduced" instruction set This is why I suggested that the better way (I do not claim it's the best way, though) of connecting parser to state machine is by having regex_creator to perform the shift-reduce activity. These actions must be present somewhere in some form in any case, and moving them to regex_creator allows us to get by without creating additional structures (read: trees) if we don't need them explicitly. The following sample creates a tree as the parser calls creator's feature functions. Can your regex compilation process be represented this way? class abstract_regex_creator { public: void charset( const charset & ) { throw exception_unsupported; } void repeat( int, int, bool ) { throw exception_unsupported; } void cat() { throw exception_unsupported; } // ... }; class sample_regex_creator: public abstract_regex_creator { protected: // declare node_stack here public: void charset( const charset &cs ) { // push the charset node onto the stack } void repeat( int lo, int hi, bool lazy ) { // pop the top node N, repeat it (expanding into whatever // subexpression is necessary), and push the resulting node back } void cat() { // pop two nodes, push the cat node back } // and so on; if any regex feature is not supported, // just don't mention it here, so it will throw an exception }; JM> (for example whether to expand +, ?, and {a,b} JM> into just | and * operations: as most DFA implementations require. I JM> haven't looked to see if yours does though). Types of regex elements that are directly supported are defined in "anfa-nodes.h": null/charset leafs; +, +?, ?, ??, () unary operations; concatenation and alternation. * (*?) is expanded into + and ? (+? and ??). {a,b} is not in the source yet; it has to be expanded into a sequence of ?'s. -- Best regards, Vladimir Pozdyayev.

This is why I suggested that the better way (I do not claim it's the best way, though) of connecting parser to state machine is by having regex_creator to perform the shift-reduce activity. These actions must be present somewhere in some form in any case, and moving them to regex_creator allows us to get by without creating additional structures (read: trees) if we don't need them explicitly.
The following sample creates a tree as the parser calls creator's feature functions. Can your regex compilation process be represented this way?
class abstract_regex_creator { public: void charset( const charset & ) { throw exception_unsupported; } void repeat( int, int, bool ) { throw exception_unsupported; } void cat() { throw exception_unsupported; } // ... };
I think I quite like that idea, although it would require quite a bit of rewriting on my part to support it, I'm also concerned that there may be some unseen showstoppers hiding somewhere, but that may prove to be unfounded. Right now, I think I really need to concentrate on getting another release out; it's already slipped by 9 months or so, and I simply must get my head down and right some code! This is an area I want to explore though, if I can get this next lot of code out the door, then I'll create a cvs branch to experiment with this, if you want to suggest / experiment with a design for the abstract creator in the meantime, then go for it. John.

On Tue, 2 Nov 2004 10:11:45 +0300, Vladimir Pozdyayev <ardatur@mail.ru> wrote:
To Eric Niebler and John Maddock on the subject of "something completely different":
Oh well. Mea culpa. Thanks.
And now for something more completely different.
JM> I forgot to say: without understanding the algorithm you are using, and the JM> data it needs, it's hard to know whether or not Boost.Regex can support it, JM> I think you may have been asked this already, but do you have a description JM> anywhere of the algorithm you are using?
http://groups.yahoo.com/group/boost/files/anfa-regex/
EN> ... xpressive's EN> compile-time design allows for multiple back-ends (NFA, backtracking EN> NFA, DFA) without paying for ones that aren't used. However, xpressive EN> is hard-coded to use a backtracking NFA for now, and there is no clean EN> interface for plugging in a different back-end...
Maybe it's good time to officially separate regex components from each other and define the way they should interact, thus allowing users to connect them at their preference?
I can name three of them: syntax parser, state machine, and character sets. The first two are interacting through either tree data structures (passive/declarative way), or callbacks imitating the parser shift-reduce activity (active/imperative way; the better one, I think). The other two interact through "bool contains( char_t c )" (and, possibly, additional functions to deal with more subtle stuff). I understand that some algorithms could prefer to treat those three as if they were closely related... well, whether it will become a problem is a question of design. Opinions on this, as well as on the whole subject, are welcome.
For what it's worth, I think that's a really promising idea. As is, the library is a black box. Nobody can introduce a new syntax parser without access to the library internals; this is directly contrary to the way that the STL separates containers and algorithms. The question is whether it's possible to define an interface, or a family of interface, for creating and executing NDFAs. (Or maybe even DFAs in some simple cases?) Certainly the fact that better algorithms are possible for regexes that don't have backreferences than for regexes that do is interesting; to me, at least, this suggests that we might want to provide both algorithms and let users choose whichever one is appropriate. --Matt

For what it's worth, I think that's a really promising idea. As is, the library is a black box. Nobody can introduce a new syntax parser without access to the library internals; this is directly contrary to the way that the STL separates containers and algorithms.
If you want to be able to access the state machine internals for a particular implementation that's probably not too hard: in fact it's on my Todo list. The hard part is defining an abstract state machine interace that can be used by any state machine implementation rather than just one specific one. In fact I once tried to do this and gave up after about 6 months (wasn't really geting anywhere), so I guess you could say I have history which colors my perspective here :-/ I am in violent agreement that this is a good idea though, it's just getting it done that's hard, John.
The question is whether it's possible to define an interface, or a family of interface, for creating and executing NDFAs. (Or maybe even DFAs in some simple cases?) Certainly the fact that better algorithms are possible for regexes that don't have backreferences than for regexes that do is interesting; to me, at least, this suggests that we might want to provide both algorithms and let users choose whichever one is appropriate.
--Matt _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Now for the real problem. Both <perl_matcher> and <basic_regex_creator> deal with the already compiled state machine or its elements. The first one works directly with the regex internals, the second one gets <append_state> and similar calls from the parser. The trouble is, some of my algorithms' calculations have to be performed directly on the expression tree, the compiled state machine won't help. Is it possible to restore the tree from the information provided by the library? That is, given the regex "((?:a|b)*?)(b+)", end up with an object like
I forgot to say: without understanding the algorithm you are using, and the data it needs, it's hard to know whether or not Boost.Regex can support it, I think you may have been asked this already, but do you have a description anywhere of the algorithm you are using? John.
participants (6)
-
Eric Niebler
-
John Maddock
-
Matt Austern
-
Reid Sweatman
-
Vladimir Pozdyayev
-
Vladimir Prus