ABSTRACT
Burrows–Wheeler transform (BWT) is an invertible text transformation that, given a text T of length n, permutes its symbols according to the lexicographic order of suffixes of T. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length n, occupying O(n/logn) machine words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009) runs in O(n) time and O(n/logn) space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require Ω(n) time. Despite the clearly suboptimal running time, the existing techniques appear to have reached their limits.
In this paper, we propose the first algorithm that breaks the O(n)-time barrier for BWT construction. Given a binary string of length n, our procedure builds the Burrows–Wheeler transform in O(n/√logn) time and O(n/logn) space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art O(m√logm)-time solution by Chan and Pătraşcu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets, which is of independent interest. As one of the applications, we show that this technique lets us design a data structure of the optimal size O(n/logn) that answers Longest Common Extension queries (LCE queries) in O(1) time and, furthermore, can be deterministically constructed in the optimal O(n/logn) time.
- Mai Alzamel, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, and Wiktor Zuba. 2019. Quasi-linear-time algorithm for longest common circular factor. In Proc. CPM. arXiv: 1901.11305 To appear.Google Scholar
- Alberto Apostolico. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. 85–96. 3- 642- 82456- 2_6Google Scholar
- Maxim Babenko, Paweł Gawrychowski, Tomasz Kociumaka, and Tatiana Starikovskaya. 2015.Google Scholar
- Wavelet trees meet suffix trees. In Proc. SODA. 572–591.Google Scholar
- Djamal Belazzougui. 2014. Linear time construction of compressed text indices in compact space. In Proc. STOC. 148–193. Google ScholarDigital Library
- Oren Ben-Kiki, Philip Bille, Dany Breslauer, Leszek Ga ¸sieniec, Roberto Grossi, and Oren Weimann. 2014.Google Scholar
- Towards optimal packed string matching. Theor. Comput. Sci. 525 (2014), 111–129. Google ScholarDigital Library
- Michael A. Bender, Martin Farach-Colton, Giridhar Pemmasani, Steven Skiena, and Pavel Sumazin. 2005. Lowest common ancestors in trees and directed acyclic graphs. J. Algorithms 57, 2 (2005), 75–94. 001 Google ScholarCross Ref
- Or Birenzwige, Shay Golan, and Ely Porat. 2018. Locally consistent parsing for text indexing in small space. arXiv: 1812.00359Google Scholar
- Michael Burrows and David J. Wheeler. 1994.Google Scholar
- A block-sorting lossless data compression algorithm. Technical Report 124. Digital Equipment Corporation.Google Scholar
- Timothy M. Chan and Mihai Pătraşcu. 2010.Google Scholar
- Counting inversions, offline orthogonal range counting, and related problems. In Proc. SODA. 161–173. Google ScholarDigital Library
- Yu-Feng Chien, Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. 2015.Google Scholar
- Geometric BWT: compressed text indexing via sparse suffixes and range searching. Algorithmica 71, 2 (2015), 258–278. 013- 9792- 1 Google ScholarDigital Library
- Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. 2007.Google Scholar
- Algorithms on strings. Cambridge University Press, Cambridge, UK. cbo9780511546853 Google ScholarDigital Library
- Martin Farach-Colton, Paolo Ferragina, and S. Muthukrishnan. 2000.Google Scholar
- On the sorting-complexity of suffix tree construction. J. ACM 47, 6 (2000), 987–1011. Google ScholarDigital Library
- Paolo Ferragina and Giovanni Manzini. 2005. Indexing compressed text. J. ACM 52, 4 (2005), 552–581. Google ScholarDigital Library
- Nathan J. Fine and Herbert S. Wilf. 1965.Google Scholar
- Uniqueness theorems for periodic functions. Proc. Am. Math. Soc. 16, 1 (1965), 109–114. 2034009Google Scholar
- Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. 2003. High-order entropycompressed text indexes. In Proc. SODA. 841–850. http://dl.acm.org/citation.cfm? id=644108.644250 Google ScholarDigital Library
- Roberto Grossi and Jeffrey Scott Vitter. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2 (2005), 378–407. Google ScholarDigital Library
- Dan Gusfield. 1997.Google Scholar
- Algorithms on strings, trees, and sequences: Computer science and computational biology. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
- Torben Hagerup. 1998. Sorting and searching on the word RAM. In Proc. STACS, Vol. 1373. 366–398. Google ScholarDigital Library
- Yijie Han. 2004. Deterministic sorting in O (n log log n) time and linear space. J. Algorithms 50, 1 (2004), 96–105. Google ScholarDigital Library
- Dov Harel and Robert Endre Tarjan. 1984. Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13, 2 (1984), 338–355. 1137/0213024 Google ScholarDigital Library
- Wing-Kai Hon, Kunihiko Sadakane, and Wing-Kin Sung. 2009. Breaking a time- and-space barrier in constructing full-text indices. SIAM J. Comput. 38, 6 (2009), 2162–2178. Google ScholarDigital Library
- Guy Jacobson. 1989. Space-efficient static trees and graphs. In Proc. FOCS. 549–554. Google ScholarDigital Library
- Artur Jeż. 2016.Google Scholar
- Recompression: A simple and powerful technique for word equations. J. ACM 63, 1, Article 4 (2016). Google ScholarDigital Library
- Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. 2006. Linear work suffix array construction. J. ACM 53, 6 (2006), 918–936. Google ScholarDigital Library
- 1217858Google Scholar
- Dominik Kempa. 2019.Google Scholar
- Optimal construction of compressed indexes for highly repetitive texts. In Proc. SODA. 1344–1357. Google ScholarDigital Library
- 9781611975482.82Google Scholar
- Dominik Kempa and Tomasz Kociumaka. 2019.Google Scholar
- String synchronizing sets: Sublinear-time BWT construction and optimal LCE queries. arXiv: 1904.04228Google Scholar
- Tomasz Kociumaka. 2018.Google Scholar
- Efficient data structures for internal queries in texts. PhD thesis. University of Warsaw, Warsaw, Poland. https://www.mimuw.edu.pl/ ~kociumaka/files/phd.pdfGoogle Scholar
- Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. 2015. Internal pattern matching queries in a text and applications. In Proc. SODA. 532–551. Google ScholarDigital Library
- Gad M. Landau and Uzi Vishkin. 1988. Fast string matching with k differences. J. Comput. Syst. Sci. 37, 1 (1988), 63–78. 0000(88) 90045- 1 Google ScholarDigital Library
- Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, 3, Article R25 (2009).Google Scholar
- Heng Li and Richard Durbin. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 14 (2009), 1754–1760. Google ScholarDigital Library
- Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. 2015.Google Scholar
- Genome-scale algorithm design: Biological sequence analysis in the era of high-throughput sequencing. Cambridge University Press, Cambridge, UK.Google Scholar
- Udi Manber and Eugene W. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5 (1993), 935–948. 0222058 Google ScholarDigital Library
- J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. 2017. Space-efficient construction of compressed indexes in deterministic linear time. In Proc. SODA. 408–424. Google ScholarDigital Library
- J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. 2017.Google Scholar
- Text indexing and searching in sublinear time. arXiv: 1712.07431Google Scholar
- J. Ian Munro, Yakov Nekrich, and Jeffrey Scott Vitter. 2016. Fast construction of wavelet trees. Theor. Comput. Sci. 638 (2016), 91–97. 2015.11.011 Google ScholarDigital Library
- Gonzalo Navarro. 2014. Wavelet trees for all. J. Discrete Algorithms 25 (2014), 2–20. Google ScholarCross Ref
- Gonzalo Navarro. 2016.Google Scholar
- Compact data structures: A practical approach. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
- Enno Ohlebusch. 2013.Google Scholar
- Bioinformatics algorithms: Sequence analysis, genome rearrangements, and phylogenetic reconstruction. Oldenbusch Verlag, Ulm, Germany.Google Scholar
- Süleyman Cenk Sahinalp and Uzi Vishkin. 1994. On a parallel-algorithms method for string matching problems. In Proc. CIAC. 22–32. 3- 540- 57811- 0_3Google ScholarCross Ref
- Yuka Tanimura, Takaaki Nishimoto, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. 2017.Google Scholar
- Small-space LCE data structure with constant-time queries. In Proc. MFCS.Google Scholar
Index Terms
- String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure
Recommendations
Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space
Indexing highly repetitive texts—such as genomic databases, software repositories and versioned text collections—has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number ...
Towards optimal packed string matching
In the packed string matching problem, it is assumed that each machine word can accommodate up to @a characters, thus an n-character string occupies n/@a memory words. (a) We extend the Crochemore-Perrin constant-space O(n)-time string-matching ...
Linear-time String Indexing and Analysis in Small Space
The field of succinct data structures has flourished over the past 16 years. Starting from the compressed suffix array by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications ...
Comments