skip to main content
article

Increased bit-parallelism for approximate and multiple string matching

Published:31 December 2005Publication History
Skip Abstract Section

Abstract

Bit-parallelism permits executing several operations simultaneously over a set of bits or numbers stored in a single computer word. This technique permits searching for the approximate occurrences of a pattern of length m in a text of length n in time O(⌈m/wn), where w is the number of bits in the computer word. Although this is asymptotically the optimal bit-parallel speedup over the basic O(mn) time algorithm, it wastes bit-parallelism's power in the common case where m is much smaller than w, since wm bits in the computer words are unused. In this paper, we explore different ways to increase the bit-parallelism when the search pattern is short. First, we show how multiple patterns can be packed into a single computer word so as to search for all them simultaneously. Instead of spending O(rn) time to search for r patterns of length mw/2, we need O(⌈rm/wn) time. Second, we show how the mechanism permits boosting the search for a single pattern of length mw/2, which can be searched for in O(⌈n/⌊w/m⌋⌉) bit-parallel steps instead of O(n). Third, we show how to extend these algorithms so that the time bounds essentially depend on k instead of m, where k is the maximum number of differences permitted. Finally, we show how the ideas can be applied to other problems such as multiple exact string matching and one-against-all computation of edit distance and longest common subsequences. Our experimental results show that the new algorithms work well in practice, obtaining significant speedups over the best existing alternatives, especially on short patterns and moderate number of differences allowed. This work fills an important gap in the field, where little work has focused on very short patterns.

References

  1. Aho, A. and Corasick, M. 1975. Efficient string matching: an aid to bibliographic search. Communications of the ACM 18, 6, 333--340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Allauzen, C. and Raffinot, M. 1999. Factor oracle of a set of words. Technical Report 99--11, Institut Gaspard-Monge, Université de Marne-la-Vallée.Google ScholarGoogle Scholar
  3. Allison, L. and Dix, T. L. 1986. A bit-string longest common subsequence algorithm. Information Processing Letters 23, 305--310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Baeza-Yates, R. and Gonnet, G. 1992. A new approach to text searching. Communications of the ACM 35, 10, 74--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Baeza-Yates, R. and Navarro, G. 1999. Faster approximate string matching. Algorithmica 23, 2, 127--158.Google ScholarGoogle ScholarCross RefCross Ref
  6. Crochemore, M., Iliopoulos, C. S., Pinzon, Y. J., and Reid, J. F. 2001. A fast and practical bit-vector algorithm for the longest common subsequence problem. Information Processing Letters 80, 279--285. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Fredriksson, K. 2003. Row-wise tiling for the Myers' bit-parallel dynamic programming algorithm. In Proc. 10th International Symposium on String Processing and Information Retrieval (SPIRE '03). LNCS 2857. Berlin, Germany, Springer, New York. 66--79.Google ScholarGoogle ScholarCross RefCross Ref
  8. Fredriksson, K. and Navarro, G. 2004. Average-optimal single and multiple approximate string matching. ACM Journal of Experimental Algorithmics (JEA). 9, 1.4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Horspool, R. N. 1980. Practical fast searching in strings. Software Practice and Experience 10, 6, 501--506.Google ScholarGoogle ScholarCross RefCross Ref
  10. Hyyrö, H. 2001. Explaining and extending the bit-parallel approximate string matching algorithm of Myers. Tech. Rep. A-2001-10, Department of Computer and Information Sciences, University of Tampere, Tampere, Finland.Google ScholarGoogle Scholar
  11. Hyyrö, H. 2003. A bit-vector algorithm for computing Levenshtein and Damerau edit distances. Nordic Journal of Computing 10, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hyyrö, H. 2004. Bit-parallel LCS-length computation revisited. In Proc. 15th Australasian Workshop on Combinatorial Algorithms (AWOCA '04). 16--27.Google ScholarGoogle Scholar
  13. Hyyrö, H. and Navarro, G. 2002. Faster bit-parallel approximate string matching. In Proc. 13th Combinatorial Pattern Matching (CPM '02). LNCS 2373. Berlin, Germany, Springer, New York. 203--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Hyyrö, H. and Navarro, G. 2005. Bit-parallel witnesses and their applications to approximate string matching. Algorithmica 41, 3, 203--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Levenshtein, V. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10, 8, 707--710. {Original in Russian in Doklady Akademii Nauk SSSR, 163(4):845--848, 1965}.Google ScholarGoogle Scholar
  16. Muth, R. and Manber, U. 1996. Approximate multiple string search. In Proc. 7th Annual Symposium on Combinatorial Pattern Matching (CPM '96). LNCS 1075. Springer-Verlag, New York. 75--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Myers, G. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic progamming. Journal of the ACM 46, 3, 395--415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Navarro, G. 2001. A guided tour to approximate string matching. ACM Computing Surveys 33, 1, 31--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Navarro, G. and Baeza-Yates, R. 1999. Very fast and simple approximate string matching. Information Processing Letters 72, 65--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Navarro, G. and Baeza-Yates, R. 2001. Improving an algorithm for approximate string matching. Algorithmica 30, 4, 473--502.Google ScholarGoogle ScholarCross RefCross Ref
  21. Navarro, G. and Raffinot, M. 2000. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics (JEA). 5, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Navarro, G. and Raffinot, M. 2002. Flexible Pattern Matching in Strings---Practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge, UK. ISBN 0-521-81307-7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Sellers, P. 1980. The theory and computation of evolutionary distances: pattern recognition. Journal of Algorithms 1, 359--373.Google ScholarGoogle ScholarCross RefCross Ref
  24. Ukkonen, E. 1985a. Algorithms for approximate string matching. Information and Control 64, 1--3, 100--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ukkonen, E. 1985b. Finding approximate patterns in strings. Journal of Algorithms 6, 132--137.Google ScholarGoogle ScholarCross RefCross Ref
  26. Warren, H. S. 2003. Hacker's Delight. Addison-Wesley, Boston, MA. ISBN 0-201-91465-4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Wu, S. and Manber, U. 1992. Fast text searching allowing errors. Communications of the ACM 35, 10, 83--91. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Increased bit-parallelism for approximate and multiple string matching

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Journal of Experimental Algorithmics
          ACM Journal of Experimental Algorithmics  Volume 10, Issue
          2005
          291 pages
          ISSN:1084-6654
          EISSN:1084-6654
          DOI:10.1145/1064546
          Issue’s Table of Contents

          Copyright © 2005 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 31 December 2005
          Published in jea Volume 10, Issue

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader