REGEXP - Serializing regexp ? - Boost-users

REGEXP - Serializing regexp ?

older
Re: [Boost-users] Regex: Invalid...

Jacques-Olivier Goussard

30 Oct 2007 30 Oct '07

2:37 p.m.

Hi I'm trying to use the boost-regexp package and I'm facing an issue with this. I'm basically applying a replace algo on a list of strings using a huge list of regular expressions. It seems that the boost-regexp API allows me only to either load all regular expressions in memory (i.e creating the regexp at process start) or load one regexp at a time but re-compile it each time. Is there a way to save to disk the compiled version of the regexp (so that the regexp package would not have to re-create the statemachine) ? If not, what would be the best approach for doing this ? Thanks /jog

Attachments:

attachment.html (text/html — 698 bytes)

Show replies by date

John Maddock

30 Oct 30 Oct

5:17 p.m.

Jacques-Olivier Goussard wrote:

...

...
Hi I'm trying to use the boost-regexp package and I'm facing an issue with this. I'm basically applying a replace algo on a list of strings using a huge list of regular expressions. It seems that the boost-regexp API allows me only to either load all regular expressions in memory (i.e creating the regexp at process start) or load one regexp at a time but re-compile it each time. Is there a way to save to disk the compiled version of the regexp (so that the regexp package would not have to re-create the statemachine) ?

No it's something I've often wondered about, but no one has really asked for it: until now! :-) The regex state machine isn't all that easily serializable since it's basically a linked list, but more importantly it both contains locale specific information, and caches some of std::locale's facet data. In general that make serialization dangerous in that it gives an illusion of portability that isn't really there: imagine what happens if you build a regex under one locale and read back in under another, sadly bad things will very likely happen :-(

...

...
If not, what would be the best approach for doing this ?

How desperate are you? Are there really that many regexes that loading them from their string representation is a problem? John.

Jacques-Olivier Goussard

5:46 p.m.

...

How desperate are you? Are there really that many regexes that loading

Thanks for taking the time to answer John. them

...

from their string representation is a problem?

Well, not *that* desperate - I'm looking at the existing options for now. The number of regexp is roughly 4000 currently but unconstrained a priori and a lot of them are quite huge. Futhermore, the match is done for all those regexps on a list of string that can contain around 1 million strings (so recompilation each time of the regexp is likely to be a problem). The problem is the following: I'm supposed to implement a match on general tokens, i.e. being able to code regexps that would contain tokens like: (CITY) where CITY is defined elsewhere as a list of possible cities. The only way I see to do this with boost-regexp is to translate those pseudo-regexp into ones containing (boston|chicago|.....) I.e. replace all references to generic tokens to their expanded value - I'm afraid that will use too much mem (if all loaded) or too much time (if recompiled each time) - unless there is a way to refer to another regexp in a regexp ? Note that I'm just foreseeing problems here - if you tell me there's no solution then I'll implement the expansion and see what the performance look like. Just trying here to code it right the 1st time :) On a side note:

...

regex under one locale and read back in under another, sadly bad things will very likely happen :-(

Not if you save the locale in the compiled version and throw if a mismatch occurs, but anyway I understand serializing is not an easy thing to do. Cheers /jog

Ovanes Markarian

6:06 p.m.

Hello, did you consider using xpressive? It is pretty powerfull. You could divide static parts of regex(es) accross compilation units, so the compilation time will decrease, but the linking time might increase. But anyway if you know your expressions at compile time you will probably do better with xpressive. Give it a try. Best Regards, Ovanes On 10/30/07, Jacques-Olivier Goussard <jogoussard@gmail.com> wrote:

...

Thanks for taking the time to answer John.

...
How desperate are you? Are there really that many regexes that loading them from their string representation is a problem?

Well, not *that* desperate - I'm looking at the existing options for now. The number of regexp is roughly 4000 currently but unconstrained a priori and a lot of them are quite huge. Futhermore, the match is done for all those regexps on a list of string that can contain around 1 million strings (so recompilation each time of the regexp is likely to be a problem).

The problem is the following: I'm supposed to implement a match on general tokens, i.e. being able to code regexps that would contain tokens like: (CITY) where CITY is defined elsewhere as a list of possible cities. The only way I see to do this with boost-regexp is to translate those pseudo-regexp into ones containing (boston|chicago|.....) I.e. replace all references to generic tokens to their expanded value - I'm afraid that will use too much mem (if all loaded) or too much time (if recompiled each time) - unless there is a way to refer to another regexp in a regexp ? Note that I'm just foreseeing problems here - if you tell me there's no solution then I'll implement the expansion and see what the performance look like. Just trying here to code it right the 1st time :)

On a side note:

...
regex under one locale and read back in under another, sadly bad things will very likely happen :-(

Not if you save the locale in the compiled version and throw if a mismatch occurs, but anyway I understand serializing is not an easy thing to do.

Cheers /jog

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Eric Niebler

7:08 p.m.

Jacques-Olivier Goussard wrote:

...

Thanks for taking the time to answer John.

...
How desperate are you? Are there really that many regexes that loading them from their string representation is a problem?

Well, not *that* desperate - I'm looking at the existing options for now. The number of regexp is roughly 4000 currently but unconstrained a priori and a lot of them are quite huge. Futhermore, the match is done for all those regexps on a list of string that can contain around 1 million strings (so recompilation each time of the regexp is likely to be a problem).

The problem is the following: I'm supposed to implement a match on general tokens, i.e. being able to code regexps that would contain tokens like: (CITY) where CITY is defined elsewhere as a list of possible cities. The only way I see to do this with boost-regexp is to translate those pseudo-regexp into ones containing (boston|chicago|.....) I.e. replace all references to generic tokens to their expanded value - I'm afraid that will use too much mem (if all loaded) or too much time (if recompiled each time) - unless there is a way to refer to another regexp in a regexp ?

I'll second the suggestion to look into xpressive. With xpressive, you can refer to one regex from another. And with the latest version (in the Boost Sandbox) you can put your list of cities into a symbol table and get very fast look-up -- much better than just a bunch of alternates. You can read about xpressive's symbol tables here: http://boost-sandbox.sourceforge.net/libs/xpressive/doc/html/boost_xpressive... HTH, -- Eric Niebler Boost Consulting www.boost-consulting.com

Jacques-Olivier Goussard

8:01 p.m.

Thanks a lot, I'll take a look - it seems very promising. Does it support wchar_t ? Another requirement of mine I'm afraid :) /jog On 10/30/07, Eric Niebler <eric@boost-consulting.com> wrote:

...

...
Thanks for taking the time to answer John.

...
How desperate are you? Are there really that many regexes that loading them from their string representation is a problem?

Well, not *that* desperate - I'm looking at the existing options for now. The number of regexp is roughly 4000 currently but unconstrained a

Jacques-Olivier Goussard wrote: priori

...
and a lot of them are quite huge. Futhermore, the match is done for all those regexps on a list of string that can contain around 1 million strings (so recompilation each time of the regexp is likely to be a problem).

The problem is the following: I'm supposed to implement a match on general tokens, i.e. being able to code regexps that would contain tokens like: (CITY) where CITY is defined elsewhere as a list of possible cities. The only way I see to do this with boost-regexp is to translate those pseudo-regexp into ones containing (boston|chicago|.....) I.e. replace all references to generic tokens to their expanded value - I'm afraid that will use too much mem (if all loaded) or too much time (if recompiled each time) - unless there is a way to refer to another regexp in a regexp ?

I'll second the suggestion to look into xpressive. With xpressive, you can refer to one regex from another. And with the latest version (in the Boost Sandbox) you can put your list of cities into a symbol table and get very fast look-up -- much better than just a bunch of alternates.

You can read about xpressive's symbol tables here:

http://boost-sandbox.sourceforge.net/libs/xpressive/doc/html/boost_xpressive...

HTH,

-- Eric Niebler Boost Consulting www.boost-consulting.com _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Ovanes Markarian

8:12 p.m.

Yes. ;) On 10/30/07, Jacques-Olivier Goussard <jogoussard@gmail.com> wrote:

...

Thanks a lot, I'll take a look - it seems very promising. Does it support wchar_t ? Another requirement of mine I'm afraid :) /jog

On 10/30/07, Eric Niebler <eric@boost-consulting.com> wrote:

...
...
Thanks for taking the time to answer John.

...
How desperate are you? Are there really that many regexes that loading them from their string representation is a problem?

Well, not *that* desperate - I'm looking at the existing options for now. The number of regexp is roughly 4000 currently but unconstrained a

...
and a lot of them are quite huge. Futhermore, the match is done for all those regexps on a list of string that can contain around 1 million strings (so recompilation each time of the regexp is likely to be a problem).

The problem is the following: I'm supposed to implement a match on general tokens, i.e. being able to code regexps that would contain tokens like: (CITY) where CITY is defined elsewhere as a list of possible cities. The only way I see to do this with boost-regexp is to translate those pseudo-regexp into ones containing (boston|chicago|.....) I.e. replace all references to generic tokens to their expanded value

Jacques-Olivier Goussard wrote: priori -

...
I'm afraid that will use too much mem (if all loaded) or too much time (if recompiled each time) - unless there is a way to refer to another regexp in a regexp ?

I'll second the suggestion to look into xpressive. With xpressive, you can refer to one regex from another. And with the latest version (in the

Boost Sandbox) you can put your list of cities into a symbol table and get very fast look-up -- much better than just a bunch of alternates.

You can read about xpressive's symbol tables here:

http://boost-sandbox.sourceforge.net/libs/xpressive/doc/html/boost_xpressive...

HTH,

-- Eric Niebler Boost Consulting www.boost-consulting.com _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Jacques-Olivier Goussard

31 Oct 31 Oct

2:36 p.m.

Eric Niebler wrote:

...

'll second the suggestion to look into xpressive. With xpressive, you can refer to one regex from another. And with the latest version (in the Boost Sandbox) you can put your list of cities into a symbol table and get very fast look-up -- much better than just a bunch of alternates.

Ok, I've taken a look and that doesn't seem to be what I need afteral. The regexp used must use the Perl semantic and are only known at runtime, so I would go for what you call dynamic expressions, however I'm not sure I see how I can use your symbol table neat trick with those, unless they are disconnected. I.e. I can easily translate "in (CITY) on westbrook" into sregex leftre = 'in'; sregex rightre = 'on westbrook'; sregex re = leftre >> (city_map) >> rightre; but what to do with "in (CITY|heaven) on westbrook" ? /jog /jog

Eric Niebler

4:40 p.m.

Jacques-Olivier Goussard wrote:

...

*Eric Niebler* wrote:

...
'll second the suggestion to look into xpressive. With xpressive, you can refer to one regex from another. And with the latest version (in the Boost Sandbox) you can put your list of cities into a symbol table and get very fast look-up -- much better than just a bunch of alternates.

Ok, I've taken a look and that doesn't seem to be what I need afteral. The regexp used must use the Perl semantic and are only known at runtime, so I would go for what you call dynamic expressions, however I'm not sure I see how I can use your symbol table neat trick with those, unless they are disconnected. I.e. I can easily translate "in (CITY) on westbrook" into sregex leftre = 'in'; sregex rightre = 'on westbrook'; sregex re = leftre >> (city_map) >> rightre;

but what to do with "in (CITY|heaven) on westbrook" ?

You're right, you can't /directly/ use symbol tables in dynamic regexes, but you can embed a static regex into a dynamic one. See the section on Dynamic Regex Grammars here: http://tinyurl.com/2nacj7. In particular, you can: sregex_compiler comp; // static regex, creates a symbol table from city_map: comp["CITY"] = (a1 = city_map); // dynamic regex calls static regex sregex re = comp.compile("in ((?$CITY)|heaven) on westbrook"); HTH, -- Eric Niebler Boost Consulting www.boost-consulting.com

Jacques-Olivier Goussard

5:11 p.m.

Great! I think I'm all set with this. Thanks a lot - that's an amazing package. /jog On 10/31/07, Eric Niebler <eric@boost-consulting.com> wrote:

...

Jacques-Olivier Goussard wrote:

...
*Eric Niebler* wrote:

...
'll second the suggestion to look into xpressive. With xpressive, you can refer to one regex from another. And with the latest version (in the Boost Sandbox) you can put your list of cities into a symbol table and get very fast look-up -- much better than just a bunch of alternates.

Ok, I've taken a look and that doesn't seem to be what I need afteral. The regexp used must use the Perl semantic and are only known at runtime, so I would go for what you call dynamic expressions, however I'm not sure I see how I can use your symbol table neat trick with those, unless they are disconnected. I.e. I can easily translate "in (CITY) on westbrook" into sregex leftre = 'in'; sregex rightre = 'on westbrook'; sregex re = leftre >> (city_map) >> rightre;

but what to do with "in (CITY|heaven) on westbrook" ?

You're right, you can't /directly/ use symbol tables in dynamic regexes, but you can embed a static regex into a dynamic one. See the section on Dynamic Regex Grammars here: http://tinyurl.com/2nacj7.

In particular, you can:

sregex_compiler comp; // static regex, creates a symbol table from city_map: comp["CITY"] = (a1 = city_map); // dynamic regex calls static regex sregex re = comp.compile("in ((?$CITY)|heaven) on westbrook");

HTH,

-- Eric Niebler Boost Consulting www.boost-consulting.com _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

6470

Age (days ago)

6471

Last active (days ago)

List overview

Download

9 comments

4 participants

participants (4)

Eric Niebler
Jacques-Olivier Goussard
John Maddock
Ovanes Markarian