Mismatch and regex newbie problem still problem
Yes i think there were some misunderstanding here.. I think that comes by the definition you have of mistake. A mistake for me is as follows: Regex: "testing" String_to_search: "tastung". The output should be that the regex testing was found but with 2 mismatches that are "a" and "u". So a mismatch is a letter that was not found. It may sound weird to you but the way i'm using the regex is to identify genomic regions, so in other words for biological applications. In some cases my regex is a piece of DNA such as "atgcta" and i want to search for this regex in another piece of DNA. Given the fact that the regex "atgcta" can be found in the genome many times i will get probably get a lot of matches. But in some cases i want to be able to search for "atgcta" but i want to allow some mismatches. Obviuously i will even get more matches but i think regex can be a more much efficient way that by building ip aligment matrices. ANy idea how to handle the example above
-----Original Message----- From: boost-users-bounces_at_[hidden] [mailto:boost-users- bounces_at_[hidden]] On Behalf Of david v Sent: Tuesday, August 29, 2006 9:08 AM To: boost-users_at_[hidden] Subject: [Boost-users] Mismatch and regex newbie problem still problem
So to sum-up. If the regex i'm looking for is "testing" and the string to search the regex for is "tastung" (obviously this is a short example but i'm dealing with more complex regular expressions.
how can i get the number of mismatches. Basically the output of the program would tell me:
2 mismatches found in string "tastung" at position 2 (a) and 5(u).
[Nat] Maybe I'm completely misunderstanding you. If so, please forgive me. I think you're saying that you want to start with the regex "testing" and have the library detect that the string "tastung" is somehow similar but nonetheless distinct. My belief is that, given the regex "testing", the library will not recognize the string "tastung" in any way. It will simply report that no match was found. You could construct a more complex regex that would handle this particular example. You could, for instance, say that you want to match a "t", followed by an arbitrary character, followed by "st", followed by another arbitrary character, followed by "ng". The library would report that the string "tastung" matches that regex, and you could ask it to tell you the specific substrings matching the variable parts. But if you want to allow arbitrary variance in any character position -- as long as some other set of character positions matches -- then I'm a little perplexed as to how to express that in a regular expression. Maybe an exhaustive family of acceptable alternatives? But if you're dealing with longer expression strings, that could explode really quickly. I think you need to get really specific about the rules you want to use to detect a "mismatch." Then you need to figure out whether the regex library is the right tool to help you apply those rules. Again, sorry if I'm way off base here.
I want to point out that regex may not be the right base to start from. The key thing that I am seeing is ALL of your sample strings to look for have been simple strings, with NO "regular expressions" in them (things like optional strings, alternates, or repeats). When you include those options theanswer you are looking for become not practically computable as there may be many possible "difference sets" to make the match. Also, how close do you need? Can I say that I find fun in foobar with a mismatch on the second and third characters? (if so you will hit a LOT of partial matchs). As has been pointed out, you need a precise definition of the mismatches allowed and then you can work for that. I personally do NOT think regex is apt to give you the results you want, because I get the feeling that you are going to want to know how close the match is, and regex only really know if it matched or not. Richard Damon.
Just a quick note, if you're interested in efficient string searching,
there's some interesting stuff in one of the Topcoder Intel Multi-Threaded
Marathon matches. The competition was called String Search. You've got to
be a member to view the problem statement and see the various solutions, but
its all free.
Anyway, the guy that took first had an interesting use of DSP math in his
algorithm. Might be worth taking a look. Other than that, I'd recommend
adapting one of the exact string matching algorithms to count mismatches.
I'd have to agree that regex probably isn't what you want.
On 8/29/06, Richard Damon
I want to point out that regex may not be the right base to start from. The key thing that I am seeing is ALL of your sample strings to look for have been simple strings, with NO "regular expressions" in them (things like optional strings, alternates, or repeats). When you include those options theanswer you are looking for become not practically computable as there may be many possible "difference sets" to make the match. Also, how close do you need? Can I say that I find fun in foobar with a mismatch on the second and third characters? (if so you will hit a LOT of partial matchs). As has been pointed out, you need a precise definition of the mismatches allowed and then you can work for that. I personally do NOT think regex is apt to give you the results you want, because I get the feeling that you are going to want to know how close the match is, and regex only really know if it matched or not.
Richard Damon.
_______________________________________________ Boost-users mailing list Boost-users@lists.boost.org http://lists.boost.org/mailman/listinfo.cgi/boost-users
david v wrote:
Yes i think there were some misunderstanding here.. I think that comes by the definition you have of mistake. A mistake for me is as follows:
Regex: "testing" String_to_search: "tastung". The output should be that the regex testing was found but with 2 mismatches that are "a" and "u". So a mismatch is a letter that was not found.
It may sound weird to you but the way i'm using the regex is to identify genomic regions, so in other words for biological applications. In some cases my regex is a piece of DNA such as "atgcta" and i want to search for this regex in another piece of DNA. Given the fact that the regex "atgcta" can be found in the genome many times i will get probably get a lot of matches. But in some cases i want to be able to search for "atgcta" but i want to allow some mismatches. Obviuously i will even get more matches but i think regex can be a more much efficient way that by building ip aligment matrices.
ANy idea how to handle the example above
David, The problem you describe is fundamentally different than that of Regular Expressions. A regex pattern "can't count", and is limited to anything that can be found with a 1-symbol language. (Sorry, obligatory obscure reference to theory.) In plain English, a regex can only list the possible things it matches against. It's sort of like having a long list of possibilites, even an infinite list, but a list nonetheless. It's critical to understand that no state can be held. In other words, while we can form a regex to find a*b*, we can't use a regex to find out if the pattern matches (a^n)(b^n) where n in any arbitrary value that we don't specify. A regular expression would have to be able to store the number of found 'a' in some state and then compare the number of found 'b' against it. Regular expressions are not capable of storing state, so trying to find the answer to your question is in a region of research called "approximate string matching". Definitions: Edit distance: the number of single character changes to change one string S into another string R. To be a little more precise, the two most popular edit distance definitions are: * Levenshtein edit distance, which allows for the single-character insertion, deletion, or substitution operations. * Damerau edit distance, which also allows for the transposition of adjoining characters as a single operation. Given that the operations are all treated as having cost 1, then the "cost" of changing S to R is the "edit distance" between the strings. Your problem is not a regular expression problem, but you seem to want to search for all strings {R} which are within an edit distance of N. (Most people choose 2, and use an algorithm that blows up badly for N>3.) Depending on the constraints in your actual problem, you are left with various approaches. Let me ask some questions: 1) What's the longest string you're searching for? 2) What's the smallest bit-width of any cpu on which you're running your code? 3) How familiar with string searching/matching are you? After seeing the answers to your questions, I may be able to point you to one or more papers which can help you out. If you want to just start reading, you can google the names Navarro, Hyyro, and the term "edit distance". If you want a historical perspective, the name "Esko Ukkonen" is canonical. Hope this helps, Brian
david v wrote:
Yes i think there were some misunderstanding here.. I think that comes by the definition you have of mistake. A mistake for me is as follows:
Regex: "testing" String_to_search: "tastung". The output should be that the regex testing was found but with 2 mismatches that are "a" and "u". So a mismatch is a letter that was not found.
As others have already said, the regular expressions aren't really what you want, approximate matching to some edit-distance is. It's a pity we don't have a Boost library for this actually as it's a pretty well studied area. John.
participants (5)
-
Brian Allison
-
david v
-
John Maddock
-
Paul Davis
-
Richard Damon