Skip to main content Accessibility help
×
Hostname: page-component-76fb5796d-22dnz Total loading time: 0 Render date: 2024-04-28T17:14:29.815Z Has data issue: false hasContentIssue false

6 - Approximate matching

Published online by Cambridge University Press:  18 December 2014

Gonzalo Navarro
Affiliation:
Universidad de Chile
Mathieu Raffinot
Affiliation:
Centre National de la Recherche Scientifique (CNRS), Paris
Get access

Summary

Basic concepts

Approximate string matching, also called “string matching allowing errors,” is the problem of finding a pattern p in a text T when a limited number k of differences is permitted between the pattern and its occurrences in the text.

From the many existing models defining a “difference,” we focus on the most popular one, called Levenshtein distance or edit distance [Lev65]. Other more complex models exist, especially in computational biology, but the edit distance model has received the most attention and the most effective algorithms have been developed for it. Some of these algorithms can be extended to more complex models.

Under edit distance, one difference equals one edit operation: a character insertion, deletion, or substitution. That is, the edit distance between two strings x and y, ed(x, y), is the minimum number of edit operations required to convert x into y, or vice versa. For example, ed(annual, annealing) = 4. The approximate string matching problem becomes that of finding all occurrences in T of every p′ that satisfies ed(p,p′) ≤ k. To ensure a linear size output it is customary to report only the starting or ending positions of the occurrences.

Note that the problem only makes sense for 0 < k < m, because otherwise every text substring of length m can be converted into p by substituting the m characters. The case k = 0 corresponds to exact string matching. We call α = k/m the “error level.” It gives a measure of the “fraction” of the pattern that can be altered.

We concentrate on algorithms that are the fastest in the cases that are likely to be of use in some foreseeable application, particularly text retrieval and computational biology. In particular, α < 1/2 in most cases of interest.

We present four approaches. The first approach, which is also the oldest and most flexible, adapts a dynamic programming algorithm that computes edit distance.

Type
Chapter
Information
Flexible Pattern Matching in Strings
Practical On-Line Search Algorithms for Texts and Biological Sequences
, pp. 145 - 184
Publisher: Cambridge University Press
Print publication year: 2002

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Save book to Kindle

To save this book to your Kindle, first ensure coreplatform@cambridge.org is added to your Approved Personal Document E-mail List under your Personal Document Settings on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part of your Kindle email address below. Find out more about saving to your Kindle.

Note you can select to save to either the @free.kindle.com or @kindle.com variations. ‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi. ‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.

Find out more about the Kindle Personal Document Service.

  • Approximate matching
  • Gonzalo Navarro, Universidad de Chile, Mathieu Raffinot, Centre National de la Recherche Scientifique (CNRS), Paris
  • Book: Flexible Pattern Matching in Strings
  • Online publication: 18 December 2014
  • Chapter DOI: https://doi.org/10.1017/CBO9781316135228.006
Available formats
×

Save book to Dropbox

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Dropbox.

  • Approximate matching
  • Gonzalo Navarro, Universidad de Chile, Mathieu Raffinot, Centre National de la Recherche Scientifique (CNRS), Paris
  • Book: Flexible Pattern Matching in Strings
  • Online publication: 18 December 2014
  • Chapter DOI: https://doi.org/10.1017/CBO9781316135228.006
Available formats
×

Save book to Google Drive

To save content items to your account, please confirm that you agree to abide by our usage policies. If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account. Find out more about saving content to Google Drive.

  • Approximate matching
  • Gonzalo Navarro, Universidad de Chile, Mathieu Raffinot, Centre National de la Recherche Scientifique (CNRS), Paris
  • Book: Flexible Pattern Matching in Strings
  • Online publication: 18 December 2014
  • Chapter DOI: https://doi.org/10.1017/CBO9781316135228.006
Available formats
×