Abstract
We present a new suffix array construction algorithm that aims to build, in external memory, the suffix array for an input string of length n measured in the magnitude of tens of Giga characters over a constant or integer alphabet. The core of this algorithm is adapted from the framework of the original internal memory SA-DS algorithm that samples fixed-size d-critical substrings. This new external-memory algorithm, called EM-SA-DS, uses novel cache data structures to construct a suffix array in a sequential scanning manner with good data spatial locality: data is read from or written to disk sequentially. On the assumed external-memory model with RAM capacity Ω((nB)0.5), disk capacity O(n), and size of each I/O block B, all measured in log n-bit words, the I/O complexity of EM-SA-DS is O(n/B). This work provides a general cache-based solution that could be further exploited to develop external-memory solutions for other suffix-array-related problems, for example, computing the longest-common-prefix array, using a modern personal computer with a typical memory configuration of 4GB RAM and a single disk.
- M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. 2004. Replacing suffix trees with enhanced suffix arrays. J. Disc. Algorith. 2, 1, 53--86. Google ScholarDigital Library
- M. J. Bauer, A. J. Cox, and G. Rosone. 2011. Lightweight BWT construction for very large string collections. In Proceedings of the 22nd Annual Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 6661, Springer-Verlag, Berlin, 219--231. Google ScholarDigital Library
- T. Bingmann, J. Fischer, and V. Osipov. 2013. Inducing suffix and LCP arrays in external memory. In Proceedings of the 15th Meeting on Algorithm Engineering and Experiments (ALENEX). 88--102.Google Scholar
- A. Crauser and P. Ferragina. 2008. A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32, 1, 1--35.Google ScholarDigital Library
- R. Dementiev, J. Kärkkäinen, J. Mehnert, and P. Sanders. 2008. Better external memory suffix array construction. ACM J. Exp. Algor. 12, 3.4:1--3.4:24. Google ScholarDigital Library
- J. Dhaliwal, S. J. Puglisi, and A. Turpin. 2012. Trends in suffix sorting: A survey of low memory algorithms. In Proceedings of the 35th Australasian Computer Science Conference. 91--98. Google ScholarDigital Library
- P. Ferragina, T. Gagie, and G. Manzini. 2012. Lightweight data indexing and compression in external memory. Algorithmica 63, 3, 707--730. Google ScholarDigital Library
- J. Fischer. 2011. Inducing the LCP-array. In Proceeding of the 12th International Symposium on Algorithms and Data Structures. Lecture Notes in Computer Science, vol. 6844, Springer-Verlag, Berlin, 374--385. Google ScholarDigital Library
- J. Kärkkäinen and P. Sanders. 2003. Simple linear work suffix array construction. In Proceedings of the International Colloquies on Automata, Languages and Programming (CALP). Lecture Notes in Computer Science, vol. 2719, Springer-Verlag, Berlin, 943--955. Google ScholarDigital Library
- D. K. Kim, J. S. Sim, H. Park, and K. Park. 2003. Linear-time construction of suffix arrays. In Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 2676, Springer-Verlag, Berlin, 186--199. Google ScholarDigital Library
- P. Ko and S. Aluru. 2003. Space efficient linear time construction of suffix arrays. In Proceedings of the 14th Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 2676, Springer-Verlag, Berlin, 200--210. Google ScholarDigital Library
- U. Manber and G. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5, 935--948. Google ScholarDigital Library
- G. Nong, S. Zhang, and W. H. Chan. 2011. Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput. 60, 10, 1471--1484. Google ScholarDigital Library
- D. Okanohara and K. Sadakane. 2009. A linear-time Burrows-Wheeler transform using induced sorting. In Proceedings of the International Symposium on String Processing and Information Retrieval. Lecture Notes in Computer Science, vol. 5721, Springer-Verlag, Berlin, 90--101. Google ScholarDigital Library
- S. J. Puglisi, W. F. Smyth, and A. H. Turpin. 2007. A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39, 2, 1--31. Google ScholarDigital Library
- P. Weiner. 1973. Linear pattern matching algorithms. In Proceedings of 14th Annual IEEE Symposium on Switching and Automata Theory. 1--11. Google ScholarDigital Library
Index Terms
- Suffix Array Construction in External Memory Using D-Critical Substrings
Recommendations
Better external memory suffix array construction
Suffix arrays are a simple and powerful data structure for text processing that can be used for full text indexes, data compression, and many other applications, in particular, in bioinformatics. However, so far, it has appeared prohibitive to build ...
Induced Sorting Suffixes in External Memory
We present in this article an external memory algorithm, called disk SA-IS (DSA-IS), to exactly emulate the induced sorting algorithm SA-IS previously proposed for sorting suffixes in RAM. DSA-IS is a new disk-friendly method for sequentially retrieving ...
Practical linear-time O(1)-workspace suffix sorting for constant alphabets
This article presents an O(n)-time algorithm called SACA-K for sorting the suffixes of an input string T[0, n-1] over an alphabet A[0, K-1]. The problem of sorting the suffixes of T is also known as constructing the suffix array (SA) for T. The ...
Comments