REGEXP - Serializing regexp ?

Hi I'm trying to use the boost-regexp package and I'm facing an issue with this. I'm basically applying a replace algo on a list of strings using a huge list of regular expressions. It seems that the boost-regexp API allows me only to either load all regular expressions in memory (i.e creating the regexp at process start) or load one regexp at a time but re-compile it each time. Is there a way to save to disk the compiled version of the regexp (so that the regexp package would not have to re-create the statemachine) ? If not, what would be the best approach for doing this ? Thanks /jog

Jacques-Olivier Goussard wrote:
Hi I'm trying to use the boost-regexp package and I'm facing an issue with this. I'm basically applying a replace algo on a list of strings using a huge list of regular expressions. It seems that the boost-regexp API allows me only to either load all regular expressions in memory (i.e creating the regexp at process start) or load one regexp at a time but re-compile it each time. Is there a way to save to disk the compiled version of the regexp (so that the regexp package would not have to re-create the statemachine) ?
No it's something I've often wondered about, but no one has really asked for it: until now! :-) The regex state machine isn't all that easily serializable since it's basically a linked list, but more importantly it both contains locale specific information, and caches some of std::locale's facet data. In general that make serialization dangerous in that it gives an illusion of portability that isn't really there: imagine what happens if you build a regex under one locale and read back in under another, sadly bad things will very likely happen :-(
If not, what would be the best approach for doing this ?
How desperate are you? Are there really that many regexes that loading them from their string representation is a problem? John.

How desperate are you? Are there really that many regexes that loading
Thanks for taking the time to answer John. them
from their string representation is a problem?
Well, not *that* desperate - I'm looking at the existing options for now. The number of regexp is roughly 4000 currently but unconstrained a priori and a lot of them are quite huge. Futhermore, the match is done for all those regexps on a list of string that can contain around 1 million strings (so recompilation each time of the regexp is likely to be a problem). The problem is the following: I'm supposed to implement a match on general tokens, i.e. being able to code regexps that would contain tokens like: (CITY) where CITY is defined elsewhere as a list of possible cities. The only way I see to do this with boost-regexp is to translate those pseudo-regexp into ones containing (boston|chicago|.....) I.e. replace all references to generic tokens to their expanded value - I'm afraid that will use too much mem (if all loaded) or too much time (if recompiled each time) - unless there is a way to refer to another regexp in a regexp ? Note that I'm just foreseeing problems here - if you tell me there's no solution then I'll implement the expansion and see what the performance look like. Just trying here to code it right the 1st time :) On a side note:
regex under one locale and read back in under another, sadly bad things will very likely happen :-(
Not if you save the locale in the compiled version and throw if a mismatch occurs, but anyway I understand serializing is not an easy thing to do. Cheers /jog

Hello,
did you consider using xpressive? It is pretty powerfull. You could divide
static parts of regex(es) accross compilation units, so the compilation time
will decrease, but the linking time might increase. But anyway if you know
your expressions at compile time you will probably do better with xpressive.
Give it a try.
Best Regards,
Ovanes
On 10/30/07, Jacques-Olivier Goussard
Thanks for taking the time to answer John.
How desperate are you? Are there really that many regexes that loading them from their string representation is a problem?
Well, not *that* desperate - I'm looking at the existing options for now. The number of regexp is roughly 4000 currently but unconstrained a priori and a lot of them are quite huge. Futhermore, the match is done for all those regexps on a list of string that can contain around 1 million strings (so recompilation each time of the regexp is likely to be a problem).
The problem is the following: I'm supposed to implement a match on general tokens, i.e. being able to code regexps that would contain tokens like: (CITY) where CITY is defined elsewhere as a list of possible cities. The only way I see to do this with boost-regexp is to translate those pseudo-regexp into ones containing (boston|chicago|.....) I.e. replace all references to generic tokens to their expanded value - I'm afraid that will use too much mem (if all loaded) or too much time (if recompiled each time) - unless there is a way to refer to another regexp in a regexp ? Note that I'm just foreseeing problems here - if you tell me there's no solution then I'll implement the expansion and see what the performance look like. Just trying here to code it right the 1st time :)
On a side note:
regex under one locale and read back in under another, sadly bad things will very likely happen :-(
Not if you save the locale in the compiled version and throw if a mismatch occurs, but anyway I understand serializing is not an easy thing to do.
Cheers /jog
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Jacques-Olivier Goussard wrote:
Thanks for taking the time to answer John.
How desperate are you? Are there really that many regexes that loading them from their string representation is a problem?
Well, not *that* desperate - I'm looking at the existing options for now. The number of regexp is roughly 4000 currently but unconstrained a priori and a lot of them are quite huge. Futhermore, the match is done for all those regexps on a list of string that can contain around 1 million strings (so recompilation each time of the regexp is likely to be a problem).
The problem is the following: I'm supposed to implement a match on general tokens, i.e. being able to code regexps that would contain tokens like: (CITY) where CITY is defined elsewhere as a list of possible cities. The only way I see to do this with boost-regexp is to translate those pseudo-regexp into ones containing (boston|chicago|.....) I.e. replace all references to generic tokens to their expanded value - I'm afraid that will use too much mem (if all loaded) or too much time (if recompiled each time) - unless there is a way to refer to another regexp in a regexp ?
I'll second the suggestion to look into xpressive. With xpressive, you can refer to one regex from another. And with the latest version (in the Boost Sandbox) you can put your list of cities into a symbol table and get very fast look-up -- much better than just a bunch of alternates. You can read about xpressive's symbol tables here: http://boost-sandbox.sourceforge.net/libs/xpressive/doc/html/boost_xpressive... HTH, -- Eric Niebler Boost Consulting www.boost-consulting.com

Thanks a lot, I'll take a look - it seems very promising.
Does it support wchar_t ? Another requirement of mine I'm afraid :)
/jog
On 10/30/07, Eric Niebler
Thanks for taking the time to answer John.
How desperate are you? Are there really that many regexes that loading them from their string representation is a problem?
Well, not *that* desperate - I'm looking at the existing options for now. The number of regexp is roughly 4000 currently but unconstrained a
Jacques-Olivier Goussard wrote: priori
and a lot of them are quite huge. Futhermore, the match is done for all those regexps on a list of string that can contain around 1 million strings (so recompilation each time of the regexp is likely to be a problem).
The problem is the following: I'm supposed to implement a match on general tokens, i.e. being able to code regexps that would contain tokens like: (CITY) where CITY is defined elsewhere as a list of possible cities. The only way I see to do this with boost-regexp is to translate those pseudo-regexp into ones containing (boston|chicago|.....) I.e. replace all references to generic tokens to their expanded value - I'm afraid that will use too much mem (if all loaded) or too much time (if recompiled each time) - unless there is a way to refer to another regexp in a regexp ?
I'll second the suggestion to look into xpressive. With xpressive, you can refer to one regex from another. And with the latest version (in the Boost Sandbox) you can put your list of cities into a symbol table and get very fast look-up -- much better than just a bunch of alternates.
You can read about xpressive's symbol tables here:
http://boost-sandbox.sourceforge.net/libs/xpressive/doc/html/boost_xpressive...
HTH,
-- Eric Niebler Boost Consulting www.boost-consulting.com _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Yes. ;)
On 10/30/07, Jacques-Olivier Goussard
Thanks a lot, I'll take a look - it seems very promising. Does it support wchar_t ? Another requirement of mine I'm afraid :) /jog
On 10/30/07, Eric Niebler
wrote: Thanks for taking the time to answer John.
How desperate are you? Are there really that many regexes that loading them from their string representation is a problem?
Well, not *that* desperate - I'm looking at the existing options for now. The number of regexp is roughly 4000 currently but unconstrained a
and a lot of them are quite huge. Futhermore, the match is done for all those regexps on a list of string that can contain around 1 million strings (so recompilation each time of the regexp is likely to be a problem).
The problem is the following: I'm supposed to implement a match on general tokens, i.e. being able to code regexps that would contain tokens like: (CITY) where CITY is defined elsewhere as a list of possible cities. The only way I see to do this with boost-regexp is to translate those pseudo-regexp into ones containing (boston|chicago|.....) I.e. replace all references to generic tokens to their expanded value
Jacques-Olivier Goussard wrote: priori -
I'm afraid that will use too much mem (if all loaded) or too much time (if recompiled each time) - unless there is a way to refer to another regexp in a regexp ?
I'll second the suggestion to look into xpressive. With xpressive, you can refer to one regex from another. And with the latest version (in the
Boost Sandbox) you can put your list of cities into a symbol table and get very fast look-up -- much better than just a bunch of alternates.
You can read about xpressive's symbol tables here:
http://boost-sandbox.sourceforge.net/libs/xpressive/doc/html/boost_xpressive...
HTH,
-- Eric Niebler Boost Consulting www.boost-consulting.com _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Eric Niebler wrote:
'll second the suggestion to look into xpressive. With xpressive, you can refer to one regex from another. And with the latest version (in the Boost Sandbox) you can put your list of cities into a symbol table and get very fast look-up -- much better than just a bunch of alternates.
Ok, I've taken a look and that doesn't seem to be what I need afteral. The regexp used must use the Perl semantic and are only known at runtime, so I would go for what you call dynamic expressions, however I'm not sure I see how I can use your symbol table neat trick with those, unless they are disconnected. I.e. I can easily translate "in (CITY) on westbrook" into sregex leftre = 'in'; sregex rightre = 'on westbrook'; sregex re = leftre >> (city_map) >> rightre; but what to do with "in (CITY|heaven) on westbrook" ? /jog /jog

Jacques-Olivier Goussard wrote:
*Eric Niebler* wrote:
'll second the suggestion to look into xpressive. With xpressive, you can refer to one regex from another. And with the latest version (in the Boost Sandbox) you can put your list of cities into a symbol table and get very fast look-up -- much better than just a bunch of alternates.
Ok, I've taken a look and that doesn't seem to be what I need afteral. The regexp used must use the Perl semantic and are only known at runtime, so I would go for what you call dynamic expressions, however I'm not sure I see how I can use your symbol table neat trick with those, unless they are disconnected. I.e. I can easily translate "in (CITY) on westbrook" into sregex leftre = 'in'; sregex rightre = 'on westbrook'; sregex re = leftre >> (city_map) >> rightre;
but what to do with "in (CITY|heaven) on westbrook" ?
You're right, you can't /directly/ use symbol tables in dynamic regexes, but you can embed a static regex into a dynamic one. See the section on Dynamic Regex Grammars here: http://tinyurl.com/2nacj7. In particular, you can: sregex_compiler comp; // static regex, creates a symbol table from city_map: comp["CITY"] = (a1 = city_map); // dynamic regex calls static regex sregex re = comp.compile("in ((?$CITY)|heaven) on westbrook"); HTH, -- Eric Niebler Boost Consulting www.boost-consulting.com

Great! I think I'm all set with this.
Thanks a lot - that's an amazing package.
/jog
On 10/31/07, Eric Niebler
Jacques-Olivier Goussard wrote:
*Eric Niebler* wrote:
'll second the suggestion to look into xpressive. With xpressive, you can refer to one regex from another. And with the latest version (in the Boost Sandbox) you can put your list of cities into a symbol table and get very fast look-up -- much better than just a bunch of alternates.
Ok, I've taken a look and that doesn't seem to be what I need afteral. The regexp used must use the Perl semantic and are only known at runtime, so I would go for what you call dynamic expressions, however I'm not sure I see how I can use your symbol table neat trick with those, unless they are disconnected. I.e. I can easily translate "in (CITY) on westbrook" into sregex leftre = 'in'; sregex rightre = 'on westbrook'; sregex re = leftre >> (city_map) >> rightre;
but what to do with "in (CITY|heaven) on westbrook" ?
You're right, you can't /directly/ use symbol tables in dynamic regexes, but you can embed a static regex into a dynamic one. See the section on Dynamic Regex Grammars here: http://tinyurl.com/2nacj7.
In particular, you can:
sregex_compiler comp; // static regex, creates a symbol table from city_map: comp["CITY"] = (a1 = city_map); // dynamic regex calls static regex sregex re = comp.compile("in ((?$CITY)|heaven) on westbrook");
HTH,
-- Eric Niebler Boost Consulting www.boost-consulting.com _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
participants (4)
-
Eric Niebler
-
Jacques-Olivier Goussard
-
John Maddock
-
Ovanes Markarian