Re: [Boost-users] Mismatch and regex newbie problem still problem

29 Aug 2006

      david v wrote:
...
Yes i think there were some misunderstanding here.. I think that comes by 
the definition you have of mistake. A mistake for me is as follows:
Regex: "testing"
String_to_search: "tastung".
The output should be that the regex testing was found but with 2 mismatches 
that are "a" and "u". So a mismatch is a letter that was not found.
It may sound weird to you but the way i'm using the regex is to identify 
genomic regions, so in other words for biological applications.
In some cases my regex is a piece of DNA such as "atgcta" and i want to 
search for this regex in another piece of DNA. Given the fact that the regex 
"atgcta" can be found in the genome many times i will get probably get a lot 
of matches. But in some cases i want to be able to search for "atgcta" but i 
want to allow some mismatches. Obviuously i will even get more matches but i 
think regex can be a more much efficient way that by building ip aligment 
matrices.
ANy idea how to handle the example above
David,

  The problem you describe is fundamentally different than that of 
Regular Expressions.

  A regex pattern "can't count", and is limited to anything that can be 
found with a 1-symbol language. (Sorry, obligatory obscure reference to 
theory.)

  In plain English, a regex can only list the possible things it matches 
against. It's sort of like having a long list of possibilites, even an 
infinite list, but a list nonetheless. It's critical to understand that 
no state can be held. In other words, while we can form a regex to find 
a*b*, we can't use a regex to find out if the pattern matches (a^n)(b^n) 
where n in any arbitrary value that we don't specify. A regular 
expression would have to be able to store the number of found 'a' in 
some state and then compare the number of found 'b' against it.
  Regular expressions are not capable of storing state, so trying to 
find the answer to your question is in a region of research called 
"approximate string matching".

Definitions:
  Edit distance: the number of single character changes to change one 
string S into another string R.

  To be a little more precise, the two most popular edit distance 
definitions are:
    * Levenshtein edit distance, which allows for the single-character 
insertion, deletion, or substitution operations.
    * Damerau edit distance, which also allows for the transposition of 
adjoining characters as a single operation.

  Given that the operations are all treated as having cost 1, then the 
"cost" of changing S to R is the "edit distance" between the strings.

  Your problem is not a regular expression problem, but you seem to want 
to search for all strings {R} which are within an edit distance of N. 
(Most people choose 2, and use an algorithm that blows up badly for N>3.)

  Depending on the constraints in your actual problem, you are left with 
various approaches. Let me ask some questions:

  1) What's the longest string you're searching for?
  2) What's the smallest bit-width of any cpu on which you're running 
your code?
  3) How familiar with string searching/matching are you?

  After seeing the answers to your questions, I may be able to point you 
to one or more papers which can help you out. If you want to just start 
reading, you can google the names Navarro, Hyyro, and the term "edit 
distance".
  If you want a historical perspective, the name "Esko Ukkonen" is 
canonical.

  Hope this helps,
Brian

Re: [Boost-users] Mismatch and regex newbie problem still problem

Brian Allison