Hello folks, I believe my question applies to regular expression libraries generally and not just to regexp++. I want to know if it is possible to refer to or use one regular expression within another. What I wish to do is parse a string of html code for img tags, and if they have the clause ALT="whatever" replace the whole image tag with 'whatever'. So I decided to make regular expression for an image tag which had an ALT part, and have a sub-match on the contents of quoted part of the ALT. (I broke this up a bit, explaining what each part is for . . .) This will match img tags: static const boost::regex find_imgs(" <\\s*img Matches <, 0 or many whitespace, IMG \\s+src\\s* Matches 1 or more whitespace, SRC, 0 or many whitespace =\\s* Matches = followed by 0 or more whitespace \"\\s* Matches " followed by 0 or more whitespace ([^\"]*) Matches any number of characters that are not " ([^>]*>) Matches any number of chars that not >, followed by a > ", boost::regbase::normal | boost::regbase::icase); Ok, so now look at this one. I'm trying to do the same as above except I want to sub-match the quoted part of the alt part so I can use it. I can't do "anything except the word 'alt'" because it will interpret the [^(alt)] as "anything except 'a', 'l', or 't'". static const boost::regex find_imgs_with_alt(" <\\s*img Matches <, 0 or many whitespace, IMG \\s+src\\s* Matches 1 or more whitespace, SRC, 0 or many whitespace =\\s* Matches = followed by 0 or more whitespace \"\\s* Matches " followed by 0 or more whitespace [^\"]* Matches any number of chars not " \\s+ Matches 1 or more whitespace [^alt]* I would like match anything except the word ALT, but the regexp stuff interprets this as anything but 'a', 'l', or 't' alt\\s*= Matches ALT, 0 or whitespace, = \"(^\")\" Matches ", anything except a " as a group that I can reference, then another " [^>]*> Matches any number of chars not >, followed by a > ", boost::regbase::normal | boost::regbase::icase); So what I want to do is make another regular expression which matches "alt", and in the part that says [^alt]* do instead something like [^@alt]* where '@' would indicate that 'alt' was the name of another regular expression, such as static const boost::regex alt("alt", boost::regbase::normal | boost::regbase::icase); I can see how to do what I want to do without this; I would get the whole IMG tag and do a separate regexp_search on the match. But it seems to make it so much easier if it were possible, especially leaving me with fewer lines of regular expression code to have bugs in. If this is possible I'd like to know. Thanks in advance, and I'll post the regular expressions I end up using here if anyone might find them of use. --Rob
Once you find your IMG match and pick up your ALT sub-match from it, you can
use regex_format to change your IMG match to whatever you like based on the
string you find in your ALT sub-match.
Hello folks,
I believe my question applies to regular expression libraries generally and not just to regexp++. I want to know if it is possible to refer to or use one regular expression within another.
"Edward" == Edward Diener
writes: Edward> Edward> Once you find your IMG match and pick up your ALT sub-match Edward> from it, you can use regex_format to change your IMG match to Edward> whatever you like based on the string you find in your ALT Edward> sub-match.
My post was poorly written; my point was buried under a lot of text. My apologies for that, and let me try to be clearer. Consider these image tags, and how to submatch on the phrase "alternate text": 1) <img SRC="x.gif" ALT="alternate text"> 2) <img SRC="x.gif" BORDER="0" ALIGN="left" ALT="alternate text"> 3) <img SRC="x.gif" whocares="nobody" another_attribute="why?" ALT="alternate text"> I think I can write one that submatches on 1. For two and three, I would like to have a part of my regular expression that matched anything except whitespace, ALT, =, ". I can write a regular expression that matches anything but one character, or anything but a number of character, but how do I write one that would match anything but a word ? It seems to me that the best way would be to make a regular expression for the word, and negate it in the regular expression you actually use: static const boost::regex start_of_alt(" \\s+ /* at least one whitespace */ alt\\s* /* alt followed by 0 or more whitespace */ =\\s* /* = followed by 0 or more whitespace */ \" /* a quote */ ", boost::regbase::normal | boost::regbase::icase); static const boost::regexp img_tag_with_alt_submatch(" <\\s* /* a <, followed by 0 or more whitespace */ img\\s+ /* IMG, followed by at least one whitespace */ ^@start_of_alt /* anything that doesn't match the previously /* defined regular expression start_of_alt */ @start_of_alt /* the start_of_alt regular expression defined /* above */ ([^\"])* /* 0 or many instances of not-a-quote, in the /* sub-matching parens */ [^>]*> /* anything except a >, and then the > */ ", boost::regbase::normal | boost::regbase::icase); Is there a way to refer to previous regular expressions within another regular expression, as I did in the lines that use "@start_of_alt" ? --Rob
--- yg-boost-users@m.gmane.org wrote:
"Edward" == Edward Diener
writes: Edward> Edward> Once you find your IMG match and pick up your ALT sub-match Edward> from it, you can use regex_format to change your IMG match to Edward> whatever you like based on the string you find in your ALT Edward> sub-match. My post was poorly written; my point was buried under a lot of text. My apologies for that, and let me try to be clearer.
Consider these image tags, and how to submatch on the phrase "alternate text":
1) <img SRC="x.gif" ALT="alternate text">
2) <img SRC="x.gif" BORDER="0" ALIGN="left" ALT="alternate text">
3) <img SRC="x.gif" whocares="nobody" another_attribute="why?" ALT="alternate text">
I think I can write one that submatches on 1. For two and three, I would like to have a part of my regular expression that matched anything except whitespace, ALT, =, ". I can write a regular expression that matches anything but one character, or anything but a number of character, but how do I write one that would match anything but a word ?
If you use an alternative, you might write something like: (alt_subexpr|non-alt_subexpr)* although you probably don't want the whole thing to match, so maybe (?:alt_subexpr|non-alt_subexpr)* If you write substitute your own expressions into this, you get: (?: /* this subexpr matches once per attribute, */ /* but we discard the match: */ \\s+alt\\s*=\"([^\"]*)\" /* if the alt expression matches, */ /* we spit out a sub_match with the text */ | /* OTHERWISE */ \\s+[a-z]+\\s*=\"[^\"]*\" /* we match any other attribute, and discard */ )* /* we can have many attributes */ which should try to match the first alternative, ie, try to match an ALT="..." expression (and spit out a sub_match for the quoted text), or match and discard any other attribute XYZ="...". I haven't tested this, by the way, but it feels right. It assumes the first alternative (the alt one) is matched if possible, and the more general one is tried only if that fails. __________________________________________________ Do You Yahoo!? Everything you'll ever need on one web page from News and Sport to Email and Music Charts http://uk.my.yahoo.com
static const boost::regex find_imgs_with_alt(" <\\s*img Matches <, 0 or many whitespace, IMG \\s+src\\s* Matches 1 or more whitespace, SRC, 0 or many whitespace =\\s* Matches = followed by 0 or more whitespace \"\\s* Matches " followed by 0 or more whitespace [^\"]* Matches any number of chars not " \\s+ Matches 1 or more whitespace [^alt]* I would like match anything except the word ALT, but the regexp stuff interprets this as anything but 'a', 'l', or 't' alt\\s*= Matches ALT, 0 or whitespace, = \"(^\")\" Matches ", anything except a " as a group that I can reference, then another " [^>]*> Matches any number of chars not >, followed by a > ", boost::regbase::normal | boost::regbase::icase);
You could use forward lookahead asserts:
"(?!\
So what I want to do is make another regular expression which matches "alt", and in the part that says
[^alt]*
do instead something like
[^@alt]*
where '@' would indicate that 'alt' was the name of another regular expression, such as
static const boost::regex alt("alt", boost::regbase::normal | boost::regbase::icase);
I can see how to do what I want to do without this; I would get the whole IMG tag and do a separate regexp_search on the match. But it seems to make it so much easier if it were possible, especially leaving me with fewer lines of regular expression code to have bugs in.
If this is possible I'd like to know. Thanks in advance, and I'll post the regular expressions I end up using here if anyone might find them of use.
You can't do that right now - the main problem is how would the library find an expression called "alt"? Interpreted languages with reflexive abilities can do this (perl for example), but compiled languages can't. At present I'm in the middle of rewriting the regex matching code (for those that follow these things it's about 90% done and up to 10x faster than the current version). Once I've got that out the door there are a couple of extensions that I will be able to add: 1) recursive regexes (A regex that can jump to an arbitrary part in it's own state machine). 2) registered/named regexes: you would call boost::regex::register to register a named regular expression, which can then be called from as many other regexes as you want (basically it lets one state machine call another). There are limitations to be figured out, but I'm actually pretty excited about this one - and it happens to solve your problem as well - or at least almost, I admit I hadn't thought of referring to negated regexes as you want to do, that's actually quite tricky :-( John Maddock http://ourworld.compuserve.com/homepages/john_maddock/index.htm
static const boost::regex find_imgs_with_alt(" <\\s*img Matches <, 0 or many whitespace, IMG \\s+src\\s* Matches 1 or more whitespace, SRC, 0 or many whitespace =\\s* Matches = followed by 0 or more whitespace \"\\s* Matches " followed by 0 or more whitespace [^\"]* Matches any number of chars not " \\s+ Matches 1 or more whitespace [^alt]* I would like match anything except the word ALT, but the regexp stuff interprets this as anything but 'a', 'l', or 't' alt\\s*= Matches ALT, 0 or whitespace, = \"(^\")\" Matches ", anything except a " as a group that I can reference, then another " [^>]*> Matches any number of chars not >, followed by a > ", boost::regbase::normal | boost::regbase::icase);
You could use forward lookahead asserts:
"(?!\
)*" matches a sequence of chars that are not "\
", although this is rather slow I admit... So what I want to do is make another regular expression which matches "alt", and in the part that says
[^alt]*
do instead something like
[^@alt]*
where '@' would indicate that 'alt' was the name of another regular expression, such as
static const boost::regex alt("alt", boost::regbase::normal | boost::regbase::icase);
I can see how to do what I want to do without this; I would get the whole IMG tag and do a separate regexp_search on the match. But it seems to make it so much easier if it were possible, especially leaving me with fewer lines of regular expression code to have bugs in.
If this is possible I'd like to know. Thanks in advance, and I'll post the regular expressions I end up using here if anyone might find them of use.
You can't do that right now - the main problem is how would the library find an expression called "alt"? Interpreted languages with reflexive abilities can do this (perl for example), but compiled languages can't.
At present I'm in the middle of rewriting the regex matching code (for
that follow these things it's about 90% done and up to 10x faster than the current version). Once I've got that out the door there are a couple of extensions that I will be able to add:
1) recursive regexes (A regex that can jump to an arbitrary part in it's own state machine). 2) registered/named regexes: you would call boost::regex::register to register a named regular expression, which can then be called from as many other regexes as you want (basically it lets one state machine call another). There are limitations to be figured out, but I'm actually
"John Maddock"
excited about this one - and it happens to solve your problem as well - or at least almost, I admit I hadn't thought of referring to negated regexes as you want to do, that's actually quite tricky :-(
How are you saving 2) ? In memory or permanently in a file ? If permanently in a file, how does the end-user reuse named regexes in other situations from the one in which he created a name for a regular expression ? Inquiring minds want to know <g>. Named regexes is something I have intermittently thought about for my Regular Expression Component Library built using Boost Regex++. The difficulty is a practical decision of saving named regexes so that they can be used again in other invocations of the Boost Regex++ library. However one saves them, it seems the end-user must transport such permanent storage around with the Regex++ implementation, else the named regexes will be lost.
How are you saving 2) ? In memory or permanently in a file ? If
permanently
in a file, how does the end-user reuse named regexes in other situations from the one in which he created a name for a regular expression ? Inquiring minds want to know <g>.
Be aware that we are talking about vapourware here ;-) I wasn't planning to save them at all - it's up to the programmer to call a specific API to make them available to subsequent regexes. One could imagine a "library" of such expressions though (a specific source module that you would link to, and which at initialisation time would register it's expressions). John Maddock http://ourworld.compuserve.com/homepages/john_maddock/index.htm
"John Maddock"
How are you saving 2) ? In memory or permanently in a file ? If permanently in a file, how does the end-user reuse named regexes in other situations from the one in which he created a name for a regular expression ? Inquiring minds want to know <g>.
Be aware that we are talking about vapourware here ;-)
Of course <g>.
I wasn't planning to save them at all - it's up to the programmer to call
a
specific API to make them available to subsequent regexes. One could imagine a "library" of such expressions though (a specific source module that you would link to, and which at initialisation time would register it's expressions).
Yes, of course that would be helpful to refer to a regular expression with a rememberable name. I was fishing for how you plan to accomplish this "specific API" in order to somehow make the named expressions available to subsequent regexes. Or do you mean to say that it is up to the user to register named regular expressions when using Regex++ and however he accomplishes it is his own business, while once a named regular expression is registered it can be used as an alias for the actual regular expression ?
Yes, of course that would be helpful to refer to a regular expression with a rememberable name. I was fishing for how you plan to accomplish this "specific API" in order to somehow make the named expressions available to subsequent regexes. Or do you mean to say that it is up to the user to register named regular expressions when using Regex++ and however he accomplishes it is his own business, while once a named regular expression is registered it can be used as an alias for the actual regular expression ?
Yes, exactly :-)
Thanks for all the responses, and I am reading them with interest. Meanwhile, I thought I would post the way I am currently solving the specific problem of finding all HTML img tags in a string and replacing them with the value of the alt attribute. Instead of trying to match "not the word alt" I matched anything ( ".*" ) and then the word alt. static const boost::regex find_imgs_with_alt("<\\s*" // matches < followed by 0 ore more // whitespace "img\\s+" // matches IMG followed by // at least 1 whitespace ".*" // any number of stuff "alt\\s*" // ALT, 0 or more whitespace "=\\s*\"" // =, 0 or more whitespace, and a quote "([^\"]*)\"" // any number of non-quotes in a // sub-match, then a quote "[^>]*>", // anything not >, then a > boost::regbase::normal | boost::regbase::icase); string tmp; // This is what holds the html; just pretend it is // filled, that code doesn't matter here match_resultsstring::iterator img; // will hold the whole img tag string::iterator b = tmp.begin(); string::iterator e = tmp.end(); while (regex_search(b,e, img, find_imgs_with_alt, flags)) { string img_str = string(img[0].first, img[0].second); // string of // whole img tag-- printed out for debugging, not really used string alt_contents = string(img[1].first, img[1].second); // For some reason I have to do these erases in this order, if I // flip them it doesn't work // Erases everything after the contents of the alt attribute, to // the end of the img tag, and then erases from the front to the // begining of the contents of the alt attribute tmp.erase(img[1].second,img[0].second); tmp.erase(img[0].first,img[1].first); b = img[0].second; // This is what makes the regexp_search call in // the while loop go on to the rest of the string flags |= boost::match_prev_avail; // These two additions to flags // make the rest of the search // faster, but you can't use them // the first time. flags |= boost::match_not_bob; } Maybe that will be useful to someone. Back to the issue of if there should be a way to refer to one regular expression within another. I have read John Maddock's and Edward Dienar's posts. I am trying to come up with a good example of a way that using a named regular expression inside another would be any more complicated than a string replacement that could be done in pre-processor language. If the refered-to expression had sub-matches labled, then the person writing the larger expression might mis-count what sub-match he wanted to index. It might be a good first step to collect a sampling of regex problems that are easy to write understandably and bug-free with the ability to refer to other expressions, and hard to write without that ability. Then perhaps someone will understand how to properly craft what we want. --Rob
participants (4)
-
Edward Diener
-
John Maddock
-
Simon J Turner
-
yg-boost-users@m.gmane.org