research-article

String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure

Authors:
Dominik Kempa

University of Warwick, UK

University of Warwick, UK
View Profile

,
Tomasz Kociumaka

University of Warsaw, Poland / Bar-Ilan University, Israel

University of Warsaw, Poland / Bar-Ilan University, Israel
View Profile

STOC 2019: Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of ComputingJune 2019Pages 756–767https://doi.org/10.1145/3313276.3316368

Published:23 June 2019Publication History

STOC 2019: Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing

Pages 756–767

ABSTRACT

Burrows–Wheeler transform (BWT) is an invertible text transformation that, given a text T of length n, permutes its symbols according to the lexicographic order of suffixes of T. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length n, occupying O(n/logn) machine words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009) runs in O(n) time and O(n/logn) space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require Ω(n) time. Despite the clearly suboptimal running time, the existing techniques appear to have reached their limits.

In this paper, we propose the first algorithm that breaks the O(n)-time barrier for BWT construction. Given a binary string of length n, our procedure builds the Burrows–Wheeler transform in O(n/√logn) time and O(n/logn) space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art O(m√logm)-time solution by Chan and Pătraşcu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets, which is of independent interest. As one of the applications, we show that this technique lets us design a data structure of the optimal size O(n/logn) that answers Longest Common Extension queries (LCE queries) in O(1) time and, furthermore, can be deterministically constructed in the optimal O(n/logn) time.

References

Mai Alzamel, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, and Wiktor Zuba. 2019. Quasi-linear-time algorithm for longest common circular factor. In Proc. CPM. arXiv: 1901.11305 To appear.Google Scholar
Alberto Apostolico. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. 85–96. 3- 642- 82456- 2_6Google Scholar
Maxim Babenko, Paweł Gawrychowski, Tomasz Kociumaka, and Tatiana Starikovskaya. 2015.Google Scholar
Wavelet trees meet suffix trees. In Proc. SODA. 572–591.Google Scholar
Djamal Belazzougui. 2014. Linear time construction of compressed text indices in compact space. In Proc. STOC. 148–193. Google ScholarDigital Library
Oren Ben-Kiki, Philip Bille, Dany Breslauer, Leszek Ga ¸sieniec, Roberto Grossi, and Oren Weimann. 2014.Google Scholar
Towards optimal packed string matching. Theor. Comput. Sci. 525 (2014), 111–129. Google ScholarDigital Library
Michael A. Bender, Martin Farach-Colton, Giridhar Pemmasani, Steven Skiena, and Pavel Sumazin. 2005. Lowest common ancestors in trees and directed acyclic graphs. J. Algorithms 57, 2 (2005), 75–94. 001 Google ScholarCross Ref
Or Birenzwige, Shay Golan, and Ely Porat. 2018. Locally consistent parsing for text indexing in small space. arXiv: 1812.00359Google Scholar
Michael Burrows and David J. Wheeler. 1994.Google Scholar
A block-sorting lossless data compression algorithm. Technical Report 124. Digital Equipment Corporation.Google Scholar
Timothy M. Chan and Mihai Pătraşcu. 2010.Google Scholar
Counting inversions, offline orthogonal range counting, and related problems. In Proc. SODA. 161–173. Google ScholarDigital Library
Yu-Feng Chien, Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. 2015.Google Scholar
Geometric BWT: compressed text indexing via sparse suffixes and range searching. Algorithmica 71, 2 (2015), 258–278. 013- 9792- 1 Google ScholarDigital Library
Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. 2007.Google Scholar
Algorithms on strings. Cambridge University Press, Cambridge, UK. cbo9780511546853 Google ScholarDigital Library
Martin Farach-Colton, Paolo Ferragina, and S. Muthukrishnan. 2000.Google Scholar
On the sorting-complexity of suffix tree construction. J. ACM 47, 6 (2000), 987–1011. Google ScholarDigital Library
Paolo Ferragina and Giovanni Manzini. 2005. Indexing compressed text. J. ACM 52, 4 (2005), 552–581. Google ScholarDigital Library
Nathan J. Fine and Herbert S. Wilf. 1965.Google Scholar
Uniqueness theorems for periodic functions. Proc. Am. Math. Soc. 16, 1 (1965), 109–114. 2034009Google Scholar
Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. 2003. High-order entropycompressed text indexes. In Proc. SODA. 841–850. http://dl.acm.org/citation.cfm? id=644108.644250 Google ScholarDigital Library
Roberto Grossi and Jeffrey Scott Vitter. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2 (2005), 378–407. Google ScholarDigital Library
Dan Gusfield. 1997.Google Scholar
Algorithms on strings, trees, and sequences: Computer science and computational biology. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
Torben Hagerup. 1998. Sorting and searching on the word RAM. In Proc. STACS, Vol. 1373. 366–398. Google ScholarDigital Library
Yijie Han. 2004. Deterministic sorting in O (n log log n) time and linear space. J. Algorithms 50, 1 (2004), 96–105. Google ScholarDigital Library
Dov Harel and Robert Endre Tarjan. 1984. Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13, 2 (1984), 338–355. 1137/0213024 Google ScholarDigital Library
Wing-Kai Hon, Kunihiko Sadakane, and Wing-Kin Sung. 2009. Breaking a time- and-space barrier in constructing full-text indices. SIAM J. Comput. 38, 6 (2009), 2162–2178. Google ScholarDigital Library
Guy Jacobson. 1989. Space-efficient static trees and graphs. In Proc. FOCS. 549–554. Google ScholarDigital Library
Artur Jeż. 2016.Google Scholar
Recompression: A simple and powerful technique for word equations. J. ACM 63, 1, Article 4 (2016). Google ScholarDigital Library
Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. 2006. Linear work suffix array construction. J. ACM 53, 6 (2006), 918–936. Google ScholarDigital Library
1217858Google Scholar
Dominik Kempa. 2019.Google Scholar
Optimal construction of compressed indexes for highly repetitive texts. In Proc. SODA. 1344–1357. Google ScholarDigital Library
9781611975482.82Google Scholar
Dominik Kempa and Tomasz Kociumaka. 2019.Google Scholar
String synchronizing sets: Sublinear-time BWT construction and optimal LCE queries. arXiv: 1904.04228Google Scholar
Tomasz Kociumaka. 2018.Google Scholar
Efficient data structures for internal queries in texts. PhD thesis. University of Warsaw, Warsaw, Poland. https://www.mimuw.edu.pl/ ~kociumaka/files/phd.pdfGoogle Scholar
Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. 2015. Internal pattern matching queries in a text and applications. In Proc. SODA. 532–551. Google ScholarDigital Library
Gad M. Landau and Uzi Vishkin. 1988. Fast string matching with k differences. J. Comput. Syst. Sci. 37, 1 (1988), 63–78. 0000(88) 90045- 1 Google ScholarDigital Library
Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, 3, Article R25 (2009).Google Scholar
Heng Li and Richard Durbin. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 14 (2009), 1754–1760. Google ScholarDigital Library
Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. 2015.Google Scholar
Genome-scale algorithm design: Biological sequence analysis in the era of high-throughput sequencing. Cambridge University Press, Cambridge, UK.Google Scholar
Udi Manber and Eugene W. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5 (1993), 935–948. 0222058 Google ScholarDigital Library
J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. 2017. Space-efficient construction of compressed indexes in deterministic linear time. In Proc. SODA. 408–424. Google ScholarDigital Library
J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. 2017.Google Scholar
Text indexing and searching in sublinear time. arXiv: 1712.07431Google Scholar
J. Ian Munro, Yakov Nekrich, and Jeffrey Scott Vitter. 2016. Fast construction of wavelet trees. Theor. Comput. Sci. 638 (2016), 91–97. 2015.11.011 Google ScholarDigital Library
Gonzalo Navarro. 2014. Wavelet trees for all. J. Discrete Algorithms 25 (2014), 2–20. Google ScholarCross Ref
Gonzalo Navarro. 2016.Google Scholar
Compact data structures: A practical approach. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
Enno Ohlebusch. 2013.Google Scholar
Bioinformatics algorithms: Sequence analysis, genome rearrangements, and phylogenetic reconstruction. Oldenbusch Verlag, Ulm, Germany.Google Scholar
Süleyman Cenk Sahinalp and Uzi Vishkin. 1994. On a parallel-algorithms method for string matching problems. In Proc. CIAC. 22–32. 3- 540- 57811- 0_3Google ScholarCross Ref
Yuka Tanimura, Takaaki Nishimoto, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. 2017.Google Scholar
Small-space LCE data structure with constant-time queries. In Proc. MFCS.Google Scholar

Index Terms

String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure
1. Theory of computation
  1. Computational complexity and cryptography
    1. Problems, reductions and completeness
  2. Design and analysis of algorithms
    1. Data structures design and analysis

Recommendations

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Indexing highly repetitive texts—such as genomic databases, software repositories and versioned text collections—has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number ...
Read More
Towards optimal packed string matching

In the packed string matching problem, it is assumed that each machine word can accommodate up to @a characters, thus an n-character string occupies n/@a memory words. (a) We extend the Crochemore-Perrin constant-space O(n)-time string-matching ...
Read More
Linear-time String Indexing and Analysis in Small Space

The field of succinct data structures has flourished over the past 16 years. Starting from the compressed suffix array by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
STOC 2019: Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing
June 2019
1258 pages
ISBN:9781450367059
DOI:10.1145/3313276
General Chair:
Moses Charikar
Stanford University
,
Program Chair:
Edith Cohen
Google, USA / Tel Aviv University, Israel
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Burrows-Wheeler transform
Longest Common Extension queries
Longest Common Prefix queries
packed strings
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,469of4,586submissions,32%
Upcoming Conference
STOC '24

Sponsor:

sigact

56th Annual ACM Symposium on Theory of Computing (STOC 2024)

June 24 - 28, 2024

Vancouver , BC , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 25
  Total Citations
  View Citations
- 466
  Total Downloads
- Downloads (Last 12 months)68
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure

STOC 2019: Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Towards optimal packed string matching

Linear-time String Indexing and Analysis in Small Space