Abstract
Bit-parallelism permits executing several operations simultaneously over a set of bits or numbers stored in a single computer word. This technique permits searching for the approximate occurrences of a pattern of length m in a text of length n in time O(⌈m/w⌉n), where w is the number of bits in the computer word. Although this is asymptotically the optimal bit-parallel speedup over the basic O(mn) time algorithm, it wastes bit-parallelism's power in the common case where m is much smaller than w, since w−m bits in the computer words are unused. In this paper, we explore different ways to increase the bit-parallelism when the search pattern is short. First, we show how multiple patterns can be packed into a single computer word so as to search for all them simultaneously. Instead of spending O(rn) time to search for r patterns of length m≤w/2, we need O(⌈rm/w⌉n) time. Second, we show how the mechanism permits boosting the search for a single pattern of length m≤w/2, which can be searched for in O(⌈n/⌊w/m⌋⌉) bit-parallel steps instead of O(n). Third, we show how to extend these algorithms so that the time bounds essentially depend on k instead of m, where k is the maximum number of differences permitted. Finally, we show how the ideas can be applied to other problems such as multiple exact string matching and one-against-all computation of edit distance and longest common subsequences. Our experimental results show that the new algorithms work well in practice, obtaining significant speedups over the best existing alternatives, especially on short patterns and moderate number of differences allowed. This work fills an important gap in the field, where little work has focused on very short patterns.
- Aho, A. and Corasick, M. 1975. Efficient string matching: an aid to bibliographic search. Communications of the ACM 18, 6, 333--340. Google ScholarDigital Library
- Allauzen, C. and Raffinot, M. 1999. Factor oracle of a set of words. Technical Report 99--11, Institut Gaspard-Monge, Université de Marne-la-Vallée.Google Scholar
- Allison, L. and Dix, T. L. 1986. A bit-string longest common subsequence algorithm. Information Processing Letters 23, 305--310. Google ScholarDigital Library
- Baeza-Yates, R. and Gonnet, G. 1992. A new approach to text searching. Communications of the ACM 35, 10, 74--82. Google ScholarDigital Library
- Baeza-Yates, R. and Navarro, G. 1999. Faster approximate string matching. Algorithmica 23, 2, 127--158.Google ScholarCross Ref
- Crochemore, M., Iliopoulos, C. S., Pinzon, Y. J., and Reid, J. F. 2001. A fast and practical bit-vector algorithm for the longest common subsequence problem. Information Processing Letters 80, 279--285. Google ScholarDigital Library
- Fredriksson, K. 2003. Row-wise tiling for the Myers' bit-parallel dynamic programming algorithm. In Proc. 10th International Symposium on String Processing and Information Retrieval (SPIRE '03). LNCS 2857. Berlin, Germany, Springer, New York. 66--79.Google ScholarCross Ref
- Fredriksson, K. and Navarro, G. 2004. Average-optimal single and multiple approximate string matching. ACM Journal of Experimental Algorithmics (JEA). 9, 1.4. Google ScholarDigital Library
- Horspool, R. N. 1980. Practical fast searching in strings. Software Practice and Experience 10, 6, 501--506.Google ScholarCross Ref
- Hyyrö, H. 2001. Explaining and extending the bit-parallel approximate string matching algorithm of Myers. Tech. Rep. A-2001-10, Department of Computer and Information Sciences, University of Tampere, Tampere, Finland.Google Scholar
- Hyyrö, H. 2003. A bit-vector algorithm for computing Levenshtein and Damerau edit distances. Nordic Journal of Computing 10, 1--11. Google ScholarDigital Library
- Hyyrö, H. 2004. Bit-parallel LCS-length computation revisited. In Proc. 15th Australasian Workshop on Combinatorial Algorithms (AWOCA '04). 16--27.Google Scholar
- Hyyrö, H. and Navarro, G. 2002. Faster bit-parallel approximate string matching. In Proc. 13th Combinatorial Pattern Matching (CPM '02). LNCS 2373. Berlin, Germany, Springer, New York. 203--224. Google ScholarDigital Library
- Hyyrö, H. and Navarro, G. 2005. Bit-parallel witnesses and their applications to approximate string matching. Algorithmica 41, 3, 203--231. Google ScholarDigital Library
- Levenshtein, V. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10, 8, 707--710. {Original in Russian in Doklady Akademii Nauk SSSR, 163(4):845--848, 1965}.Google Scholar
- Muth, R. and Manber, U. 1996. Approximate multiple string search. In Proc. 7th Annual Symposium on Combinatorial Pattern Matching (CPM '96). LNCS 1075. Springer-Verlag, New York. 75--86. Google ScholarDigital Library
- Myers, G. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic progamming. Journal of the ACM 46, 3, 395--415. Google ScholarDigital Library
- Navarro, G. 2001. A guided tour to approximate string matching. ACM Computing Surveys 33, 1, 31--88. Google ScholarDigital Library
- Navarro, G. and Baeza-Yates, R. 1999. Very fast and simple approximate string matching. Information Processing Letters 72, 65--70. Google ScholarDigital Library
- Navarro, G. and Baeza-Yates, R. 2001. Improving an algorithm for approximate string matching. Algorithmica 30, 4, 473--502.Google ScholarCross Ref
- Navarro, G. and Raffinot, M. 2000. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics (JEA). 5, 4. Google ScholarDigital Library
- Navarro, G. and Raffinot, M. 2002. Flexible Pattern Matching in Strings---Practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge, UK. ISBN 0-521-81307-7. Google ScholarDigital Library
- Sellers, P. 1980. The theory and computation of evolutionary distances: pattern recognition. Journal of Algorithms 1, 359--373.Google ScholarCross Ref
- Ukkonen, E. 1985a. Algorithms for approximate string matching. Information and Control 64, 1--3, 100--118. Google ScholarDigital Library
- Ukkonen, E. 1985b. Finding approximate patterns in strings. Journal of Algorithms 6, 132--137.Google ScholarCross Ref
- Warren, H. S. 2003. Hacker's Delight. Addison-Wesley, Boston, MA. ISBN 0-201-91465-4. Google ScholarDigital Library
- Wu, S. and Manber, U. 1992. Fast text searching allowing errors. Communications of the ACM 35, 10, 83--91. Google ScholarDigital Library
Index Terms
- Increased bit-parallelism for approximate and multiple string matching
Recommendations
Average-optimal single and multiple approximate string matching
We present a new algorithm for multiple approximate string matching. It is based on reading backwards enough l-grams from text windows so as to prove that no occurrence can contain the part of the window read, and then shifting the window.We show ...
Average-optimal string matching
The exact string matching problem is to find the occurrences of a pattern of length m from a text of length n symbols. We develop a novel and unorthodox filtering technique for this problem. Our method is based on transforming the problem into multiple ...
A fast bit-parallel multi-patterns string matching algorithm for biological sequences
ISB '10: Proceedings of the International Symposium on BiocomputingThe problem of searching occurrences of a pattern P[0...m-1] in the text T[0...n-1>with m ≤ n, where the symbols of P and T are drawn from some alphabet Σ of size σ, is called exact string matching problem. In the present day, pattern matching is a ...
Comments