
The stuff I offer is dedicated to two tasks: * building an ANFA (that's Augmented NFA) from an expression tree of a given regex; * running the result against a given input string. What such a code desperately needs, is the following: * syntactical front-end: a class that would parse the actual regex string and build its expression tree; * character back-end: a class that would allow checking whether a given character is contained in a given character set, respecting encodings, locales etc. Boost.regex employs quite a general approach to these components. Reusing them and connecting my code to them is what I have in mind.
The only snag is, I'm not familiar with boost.regex internals. So, any help in that field would be appreciated.
The regex internals are in the process of being completely rewritten (code is in cvs in the regex5 branch), I hope to merge this to the main trunk in the next few weeks: mainly it's the docs that I need to bring up to date. Regex parsing and state machine construction should now be quite straightforward to understand (within limits for a regex engine obviously!), so I would urge you to take a look (I can send you a zip if you don't have cvs access). I think the main problem is providing the same feature set as the existing engine - my understanding is that no machine can have the complexity you claim and still match backrefs, or even I believe wide characters (because the character set is too large to realistically build a table based NFA). Is that correct? BTW, I have always thought that there was room for multiple regex engines in Boost that would offer increasingly fewer features, but gain in worst-case performance. I suppose I should have tried to separate the parser from the back-end state machine format more, so that different engines can be plugged in at will, but there are only so many times I think I can stand to rewrite this stuff :-/ John.