david v wrote:
Yes i think there were some misunderstanding here.. I think that comes by the definition you have of mistake. A mistake for me is as follows:
Regex: "testing" String_to_search: "tastung". The output should be that the regex testing was found but with 2 mismatches that are "a" and "u". So a mismatch is a letter that was not found.
It may sound weird to you but the way i'm using the regex is to identify genomic regions, so in other words for biological applications. In some cases my regex is a piece of DNA such as "atgcta" and i want to search for this regex in another piece of DNA. Given the fact that the regex "atgcta" can be found in the genome many times i will get probably get a lot of matches. But in some cases i want to be able to search for "atgcta" but i want to allow some mismatches. Obviuously i will even get more matches but i think regex can be a more much efficient way that by building ip aligment matrices.
ANy idea how to handle the example above
David, The problem you describe is fundamentally different than that of Regular Expressions. A regex pattern "can't count", and is limited to anything that can be found with a 1-symbol language. (Sorry, obligatory obscure reference to theory.) In plain English, a regex can only list the possible things it matches against. It's sort of like having a long list of possibilites, even an infinite list, but a list nonetheless. It's critical to understand that no state can be held. In other words, while we can form a regex to find a*b*, we can't use a regex to find out if the pattern matches (a^n)(b^n) where n in any arbitrary value that we don't specify. A regular expression would have to be able to store the number of found 'a' in some state and then compare the number of found 'b' against it. Regular expressions are not capable of storing state, so trying to find the answer to your question is in a region of research called "approximate string matching". Definitions: Edit distance: the number of single character changes to change one string S into another string R. To be a little more precise, the two most popular edit distance definitions are: * Levenshtein edit distance, which allows for the single-character insertion, deletion, or substitution operations. * Damerau edit distance, which also allows for the transposition of adjoining characters as a single operation. Given that the operations are all treated as having cost 1, then the "cost" of changing S to R is the "edit distance" between the strings. Your problem is not a regular expression problem, but you seem to want to search for all strings {R} which are within an edit distance of N. (Most people choose 2, and use an algorithm that blows up badly for N>3.) Depending on the constraints in your actual problem, you are left with various approaches. Let me ask some questions: 1) What's the longest string you're searching for? 2) What's the smallest bit-width of any cpu on which you're running your code? 3) How familiar with string searching/matching are you? After seeing the answers to your questions, I may be able to point you to one or more papers which can help you out. If you want to just start reading, you can google the names Navarro, Hyyro, and the term "edit distance". If you want a historical perspective, the name "Esko Ukkonen" is canonical. Hope this helps, Brian