6 - Approximate matching
Published online by Cambridge University Press: 18 December 2014
Summary
Basic concepts
Approximate string matching, also called “string matching allowing errors,” is the problem of finding a pattern p in a text T when a limited number k of differences is permitted between the pattern and its occurrences in the text.
From the many existing models defining a “difference,” we focus on the most popular one, called Levenshtein distance or edit distance [Lev65]. Other more complex models exist, especially in computational biology, but the edit distance model has received the most attention and the most effective algorithms have been developed for it. Some of these algorithms can be extended to more complex models.
Under edit distance, one difference equals one edit operation: a character insertion, deletion, or substitution. That is, the edit distance between two strings x and y, ed(x, y), is the minimum number of edit operations required to convert x into y, or vice versa. For example, ed(annual, annealing) = 4. The approximate string matching problem becomes that of finding all occurrences in T of every p′ that satisfies ed(p,p′) ≤ k. To ensure a linear size output it is customary to report only the starting or ending positions of the occurrences.
Note that the problem only makes sense for 0 < k < m, because otherwise every text substring of length m can be converted into p by substituting the m characters. The case k = 0 corresponds to exact string matching. We call α = k/m the “error level.” It gives a measure of the “fraction” of the pattern that can be altered.
We concentrate on algorithms that are the fastest in the cases that are likely to be of use in some foreseeable application, particularly text retrieval and computational biology. In particular, α < 1/2 in most cases of interest.
We present four approaches. The first approach, which is also the oldest and most flexible, adapts a dynamic programming algorithm that computes edit distance.
- Type
- Chapter
- Information
- Flexible Pattern Matching in StringsPractical On-Line Search Algorithms for Texts and Biological Sequences, pp. 145 - 184Publisher: Cambridge University PressPrint publication year: 2002