Re: [Boost-users] Mismatch and regex newbie problem still problem
-----Original Message----- From: boost-users-bounces@lists.boost.org [mailto:boost-users- bounces@lists.boost.org] On Behalf Of david v Sent: Tuesday, August 29, 2006 10:02 AM To: boost-users@lists.boost.org Subject: [Boost-users] Mismatch and regex newbie problem still problem
It may sound weird to you but the way i'm using the regex is to identify genomic regions, so in other words for biological applications. In some cases my regex is a piece of DNA such as "atgcta" and i want to search for this regex in another piece of DNA.
[Nat] That's not weird. I wasn't questioning your desired processing, just trying to figure out where the disconnect lies.
in some cases i want to be able to search for "atgcta" but i want to allow some mismatches. Obviuously i will even get more matches but i think regex can be a more much efficient way that by building ip aligment matrices.
[Nat] This is where you have to get really specific about what kinds of mismatches you want to recognize. For example, will the sequence always begin with "at" and end with "ta" and be separated by exactly two items? ("atxxta") In that case regex is perfect for the problem. If the variable items are at fixed positions within the original pattern, it's easy. But since I don't yet know the full set of cases you intend to allow as a "mismatch," there's room for me to speculate that you might want to find "xtgcta" or "axgcta" or "atxcta" or "atgxta" or "atgcxa" or "atgctx" (one item wrong) or "xxgcta" or "xtxcta" or ... That's for a sequence length of 6. The full list of permutations is so long that you'd want to generate it programmatically. Step up to a length of 7 and that list gets alarmingly longer. That's what I meant when I said that it could quickly explode. How many items must match the original pattern before you recognize a valid "mismatch"? Presumably it's at least 1, otherwise you'd validly "mismatch" every position in every string. In practice, matching only 1 item out of 6 makes little sense to me either. You need to define what does make sense. It may be that regex is still the best tool for the job. But depending on the full set of possibilities that you mean by "mismatch," you might have to hand-code the (mis)match testing instead. I can imagine generating a regex string that would choke the library.
participants (1)
-
Nat Goodspeed