Re: [Boost-users] Running Regular expression (RE) over a list, array or map

Hi,
I try to use pattern matching approach for a dataset with multilevel tags.
Suppose we have 2 sets of features: F and H with the following tags F={f0,
f1, f2, f3} and H={h0, h1, h2, h3, h4} to describe a set of elements E={e1,
e2, e3.....en}. We have in a table a sequence of E elements describe as
following:
E e1 e2 e3 e4 e5 e6 e7 e8 ...........en
F *f3 f0 f0 f0* f2 *f3 f0 * f2 f0
H h4 h0 h4 *h1* h0 h2 *h3* h2 h1
The order in which e1, e2 e3...... appear is important as words in a
sentence.
Suppose we have the following RE with f3 and f0 as following: f3f0+ . It
will match the corresponding sequence e1, e2, e3, e4 and e6, e7 in E. Now
we want to add additional constraint as the last f0 should map a h1 in H.
In this case the final result will be only the sequence e1, e2, e3, e4
because in the last sequence e6, e7 the f0 map h3 in H.The final result is
always E sequences but the RE and constraint can be based on E, F or H.
May be it becomes a bit clear. Thank for your help.
Regards
Olivier
2013/9/27 Anthony Foiani
Olivier, greetings --
Olivier Austina
writes: I am wondering if it is possible to run directly regular expression over a list and getting the indexes (begging and end) of the match. For example, I have a list of strings and a regular expression and I want to know which part of the list matches the RE and get the corresponding indexes in the list.
It looks like you got some other suggestions, but if they're not what you're looking for, you might want to clarify your request.
In particular, it's not clear to me whether you want to match the RE against each individual item (which can be parallelized, but the result is a membership bitmap or subset, not a range), or if you want to match the RE against the concatenated value, or the largest span of continuous values (which are the requests that make the most sense if you want a starting and ending index).
Some sample code would probably clarify things.
E.g., given:
typedef std::vector< std::string > string_vec; const string_vec sv{ "foo", "bar", "baz" };
And you wanted to match:
const boost::regex re{ "ba.*" };
What answer do you want to see?
* The membership bitmap would be something like: [ 0, 1, 1 ]
* The subset would be: { "bar", "baz" } (This is basically a "grep" operator.)
* The "range" answer would be sv.begin()+1, sv.begin()+3 (since "barbaz" matches).
* The other "range" answer would be the same, but because "bar" matches, and "baz" matches, so the range represents the (longest?) set of elements that individually match the given regex. This ends up being something of a meta-regex, or if you prefer, the result of searching the concatenated string for instances of "(?:re)+".
Or is there some other interpretation that you're trying to get at?
Happy hacking, Tony
p.s. Heh. Guess who just finished a few interviews where "spot the under-specified problem" was important... _______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users

Olivier --
Olivier Austina
I try to use pattern matching approach for a dataset with multilevel tags. Suppose we have 2 sets of features: F and H with the following tags F={f0, f1, f2, f3} and H={h0, h1, h2, h3, h4} to describe a set of elements E={e1, e2, e3.....en}. We have in a table a sequence of E elements describe as following:
E e1 e2 e3 e4 e5 e6 e7 e8 ...........en F f3 f0 f0 f0 f2 f3 f0 f2 f0 H h4 h0 h4 h1 h0 h2 h3 h2 h1
The order in which e1, e2 e3...... appear is important as words in a sentence.
Suppose we have the following RE with f3 and f0 as following: f3f0+ . It will match the corresponding sequence e1, e2, e3, e4 and e6, e7 in E. Now we want to add additional constraint as the last f0 should map a h1 in H. In this case the final result will be only the sequence e1, e2, e3, e4 because in the last sequence e6, e7 the f0 map h3 in H.The final result is always E sequences but the RE and constraint can be based on E, F or H.
May be it becomes a bit clear. Thank for your help.
This does make more sense. I suspect that what you will want to do is to use a regex iterator for each constraint that you're trying to apply. Each time a given iterator matches, it will make a sequence (begin+end) available; you can then apply whatever other constraints you need, to determine whether you want to save (or otherwise note) the match. Are F and H distinct types? If so, it gets ugly, because you need a different regex template for each type you're trying to match against, and those templates are not related (so you can't easily store an arbitrary list of them). Maybe variant would help. If you really only have two tags, a cheap-and-dirty trick is to try to fit them into bitfields, then match on character sets. E.g., f0 = 0x00 h0 = 0x00 f1 0x10 h1 0x01 f2 0x20 h2 0x02 ... ... f15 0xf0 h15 0x0f So your sample vector becomes: element: e1 e2 e3 e4 e5 ... tags: 0x 34 00 04 01 20 ... And you can do your matches by slicing the bitfield either way: f3f0+ [\x30-\x3f][\x00-\x0f]+ "last f0 should also be h1" .*?(:>\x01[^\x00-\x0f]) Or something like that, it's easier if you know that will always come after the first filter, or you can combine them: f3f0+ where last f0 should also be h1 [\x30-\x3f][\x00-\x0f]*\x01 But this is insane, of course, and probably not applicable. (And "\x00" in a string literal is just asking for trouble.) You could play games with fixed-width representations as well, depending on your exact requirements. Instead of having two rows, you interleave the tags: E e1 e2 e3 e4 e5 e6 e7 e8 ...........en FH f3h4f0h0f0h4f0h1f2h0f3h2f0h3f2h2 f0h1 And you can then use the same tricks as above, but with '.' as a "don't care" value: "f3f0+" becomes "(?:f3..)(?:f0..)+" And the additional constraint is even more straightforward: "(?:f3..)(?:f0..)*(?:f0h1)" This will probably be a bit less efficient, but... If you really do want to use the regex machinery on the raw tag types, I think you need to do all of this: 1. Create a boost::regex instantiation that knows about your tags, which probably means at least creating a traits class for each of your tags: http://www.boost.org/doc/libs/1_54_0/libs/regex/doc/html/boost_regex/ref/con... 2. Build something that will turn an iterator across your table into an iterator across just one row of that table, and vice-versa. 3. Make a struct or class that manages those two for a given tag type and row; 4. Make all those structs related by inheritance to a base struct (or use some sort of type erasure technique, or lambdas maybe.) Whether it's worth all that work would depend on how many criteria you want to express, IMHO. (And even then, I'm not sure it will work gracefully, since regex mingles data (regular characters) with control (meta characters). Since your tags aren't really "characters" in this sense, I'm not sure this will work.) It might be that using Spirit might be easier, or doing the state machine explicitly.

Hi Anthony Foiani,
Thank you for the reply.
Yes, you understands what I am trying to do.
F and H are distinct types. It can have more than 2 types. Type are not
necessarily character, it can be string also. Is it easier to get indexes
with Spirit?
Regards
Olivier
2013/9/28 Anthony Foiani
Olivier --
Olivier Austina
writes: I try to use pattern matching approach for a dataset with multilevel tags. Suppose we have 2 sets of features: F and H with the following tags F={f0, f1, f2, f3} and H={h0, h1, h2, h3, h4} to describe a set of elements E={e1, e2, e3.....en}. We have in a table a sequence of E elements describe as following:
E e1 e2 e3 e4 e5 e6 e7 e8 ...........en F f3 f0 f0 f0 f2 f3 f0 f2 f0 H h4 h0 h4 h1 h0 h2 h3 h2 h1
The order in which e1, e2 e3...... appear is important as words in a sentence.
Suppose we have the following RE with f3 and f0 as following: f3f0+ . It will match the corresponding sequence e1, e2, e3, e4 and e6, e7 in E. Now we want to add additional constraint as the last f0 should map a h1 in H. In this case the final result will be only the sequence e1, e2, e3, e4 because in the last sequence e6, e7 the f0 map h3 in H.The final result is always E sequences but the RE and constraint can be based on E, F or H.
May be it becomes a bit clear. Thank for your help.
This does make more sense.
I suspect that what you will want to do is to use a regex iterator for each constraint that you're trying to apply. Each time a given iterator matches, it will make a sequence (begin+end) available; you can then apply whatever other constraints you need, to determine whether you want to save (or otherwise note) the match.
Are F and H distinct types? If so, it gets ugly, because you need a different regex template for each type you're trying to match against, and those templates are not related (so you can't easily store an arbitrary list of them). Maybe variant would help.
If you really only have two tags, a cheap-and-dirty trick is to try to fit them into bitfields, then match on character sets. E.g.,
f0 = 0x00 h0 = 0x00 f1 0x10 h1 0x01 f2 0x20 h2 0x02 ... ... f15 0xf0 h15 0x0f
So your sample vector becomes:
element: e1 e2 e3 e4 e5 ... tags: 0x 34 00 04 01 20 ...
And you can do your matches by slicing the bitfield either way:
f3f0+ [\x30-\x3f][\x00-\x0f]+ "last f0 should also be h1" .*?(:>\x01[^\x00-\x0f])
Or something like that, it's easier if you know that will always come after the first filter, or you can combine them:
f3f0+ where last f0 should also be h1 [\x30-\x3f][\x00-\x0f]*\x01
But this is insane, of course, and probably not applicable. (And "\x00" in a string literal is just asking for trouble.)
You could play games with fixed-width representations as well, depending on your exact requirements. Instead of having two rows, you interleave the tags:
E e1 e2 e3 e4 e5 e6 e7 e8 ...........en FH f3h4f0h0f0h4f0h1f2h0f3h2f0h3f2h2 f0h1
And you can then use the same tricks as above, but with '.' as a "don't care" value:
"f3f0+" becomes "(?:f3..)(?:f0..)+"
And the additional constraint is even more straightforward:
"(?:f3..)(?:f0..)*(?:f0h1)"
This will probably be a bit less efficient, but...
If you really do want to use the regex machinery on the raw tag types, I think you need to do all of this:
1. Create a boost::regex instantiation that knows about your tags, which probably means at least creating a traits class for each of your tags:
http://www.boost.org/doc/libs/1_54_0/libs/regex/doc/html/boost_regex/ref/con...
2. Build something that will turn an iterator across your table into an iterator across just one row of that table, and vice-versa.
3. Make a struct or class that manages those two for a given tag type and row;
4. Make all those structs related by inheritance to a base struct (or use some sort of type erasure technique, or lambdas maybe.)
Whether it's worth all that work would depend on how many criteria you want to express, IMHO.
(And even then, I'm not sure it will work gracefully, since regex mingles data (regular characters) with control (meta characters). Since your tags aren't really "characters" in this sense, I'm not sure this will work.)
It might be that using Spirit might be easier, or doing the state machine explicitly.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
participants (2)
-
Anthony Foiani
-
Olivier Austina