skip to main content
research-article

Suffix Array Construction in External Memory Using D-Critical Substrings

Published:01 January 2014Publication History
Skip Abstract Section

Abstract

We present a new suffix array construction algorithm that aims to build, in external memory, the suffix array for an input string of length n measured in the magnitude of tens of Giga characters over a constant or integer alphabet. The core of this algorithm is adapted from the framework of the original internal memory SA-DS algorithm that samples fixed-size d-critical substrings. This new external-memory algorithm, called EM-SA-DS, uses novel cache data structures to construct a suffix array in a sequential scanning manner with good data spatial locality: data is read from or written to disk sequentially. On the assumed external-memory model with RAM capacity Ω((nB)0.5), disk capacity O(n), and size of each I/O block B, all measured in log n-bit words, the I/O complexity of EM-SA-DS is O(n/B). This work provides a general cache-based solution that could be further exploited to develop external-memory solutions for other suffix-array-related problems, for example, computing the longest-common-prefix array, using a modern personal computer with a typical memory configuration of 4GB RAM and a single disk.

References

  1. M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. 2004. Replacing suffix trees with enhanced suffix arrays. J. Disc. Algorith. 2, 1, 53--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. J. Bauer, A. J. Cox, and G. Rosone. 2011. Lightweight BWT construction for very large string collections. In Proceedings of the 22nd Annual Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 6661, Springer-Verlag, Berlin, 219--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. Bingmann, J. Fischer, and V. Osipov. 2013. Inducing suffix and LCP arrays in external memory. In Proceedings of the 15th Meeting on Algorithm Engineering and Experiments (ALENEX). 88--102.Google ScholarGoogle Scholar
  4. A. Crauser and P. Ferragina. 2008. A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32, 1, 1--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Dementiev, J. Kärkkäinen, J. Mehnert, and P. Sanders. 2008. Better external memory suffix array construction. ACM J. Exp. Algor. 12, 3.4:1--3.4:24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Dhaliwal, S. J. Puglisi, and A. Turpin. 2012. Trends in suffix sorting: A survey of low memory algorithms. In Proceedings of the 35th Australasian Computer Science Conference. 91--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Ferragina, T. Gagie, and G. Manzini. 2012. Lightweight data indexing and compression in external memory. Algorithmica 63, 3, 707--730. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Fischer. 2011. Inducing the LCP-array. In Proceeding of the 12th International Symposium on Algorithms and Data Structures. Lecture Notes in Computer Science, vol. 6844, Springer-Verlag, Berlin, 374--385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Kärkkäinen and P. Sanders. 2003. Simple linear work suffix array construction. In Proceedings of the International Colloquies on Automata, Languages and Programming (CALP). Lecture Notes in Computer Science, vol. 2719, Springer-Verlag, Berlin, 943--955. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. K. Kim, J. S. Sim, H. Park, and K. Park. 2003. Linear-time construction of suffix arrays. In Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 2676, Springer-Verlag, Berlin, 186--199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Ko and S. Aluru. 2003. Space efficient linear time construction of suffix arrays. In Proceedings of the 14th Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 2676, Springer-Verlag, Berlin, 200--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. U. Manber and G. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5, 935--948. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Nong, S. Zhang, and W. H. Chan. 2011. Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput. 60, 10, 1471--1484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Okanohara and K. Sadakane. 2009. A linear-time Burrows-Wheeler transform using induced sorting. In Proceedings of the International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 5721, Springer-Verlag, Berlin, 90--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. J. Puglisi, W. F. Smyth, and A. H. Turpin. 2007. A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39, 2, 1--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Weiner. 1973. Linear pattern matching algorithms. In Proceedings of 14th Annual IEEE Symposium on Switching and Automata Theory. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Suffix Array Construction in External Memory Using D-Critical Substrings

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Information Systems
          ACM Transactions on Information Systems  Volume 32, Issue 1
          January 2014
          123 pages
          ISSN:1046-8188
          EISSN:1558-2868
          DOI:10.1145/2576772
          Issue’s Table of Contents

          Copyright © 2014 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 January 2014
          • Accepted: 1 August 2013
          • Revised: 1 June 2013
          • Received: 1 November 2012
          Published in tois Volume 32, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader