skip to main content
10.1145/3313276.3316368acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
research-article

String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure

Published:23 June 2019Publication History

ABSTRACT

Burrows–Wheeler transform (BWT) is an invertible text transformation that, given a text T of length n, permutes its symbols according to the lexicographic order of suffixes of T. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length n, occupying O(n/logn) machine words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009) runs in O(n) time and O(n/logn) space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require Ω(n) time. Despite the clearly suboptimal running time, the existing techniques appear to have reached their limits.

In this paper, we propose the first algorithm that breaks the O(n)-time barrier for BWT construction. Given a binary string of length n, our procedure builds the Burrows–Wheeler transform in O(n/√logn) time and O(n/logn) space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art O(m√logm)-time solution by Chan and Pătraşcu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets, which is of independent interest. As one of the applications, we show that this technique lets us design a data structure of the optimal size O(n/logn) that answers Longest Common Extension queries (LCE queries) in O(1) time and, furthermore, can be deterministically constructed in the optimal O(n/logn) time.

References

  1. Mai Alzamel, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Juliusz Straszyński, Tomasz Waleń, and Wiktor Zuba. 2019. Quasi-linear-time algorithm for longest common circular factor. In Proc. CPM. arXiv: 1901.11305 To appear.Google ScholarGoogle Scholar
  2. Alberto Apostolico. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. 85–96. 3- 642- 82456- 2_6Google ScholarGoogle Scholar
  3. Maxim Babenko, Paweł Gawrychowski, Tomasz Kociumaka, and Tatiana Starikovskaya. 2015.Google ScholarGoogle Scholar
  4. Wavelet trees meet suffix trees. In Proc. SODA. 572–591.Google ScholarGoogle Scholar
  5. Djamal Belazzougui. 2014. Linear time construction of compressed text indices in compact space. In Proc. STOC. 148–193. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Oren Ben-Kiki, Philip Bille, Dany Breslauer, Leszek Ga ¸sieniec, Roberto Grossi, and Oren Weimann. 2014.Google ScholarGoogle Scholar
  7. Towards optimal packed string matching. Theor. Comput. Sci. 525 (2014), 111–129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Michael A. Bender, Martin Farach-Colton, Giridhar Pemmasani, Steven Skiena, and Pavel Sumazin. 2005. Lowest common ancestors in trees and directed acyclic graphs. J. Algorithms 57, 2 (2005), 75–94. 001 Google ScholarGoogle ScholarCross RefCross Ref
  9. Or Birenzwige, Shay Golan, and Ely Porat. 2018. Locally consistent parsing for text indexing in small space. arXiv: 1812.00359Google ScholarGoogle Scholar
  10. Michael Burrows and David J. Wheeler. 1994.Google ScholarGoogle Scholar
  11. A block-sorting lossless data compression algorithm. Technical Report 124. Digital Equipment Corporation.Google ScholarGoogle Scholar
  12. Timothy M. Chan and Mihai Pătraşcu. 2010.Google ScholarGoogle Scholar
  13. Counting inversions, offline orthogonal range counting, and related problems. In Proc. SODA. 161–173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yu-Feng Chien, Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. 2015.Google ScholarGoogle Scholar
  15. Geometric BWT: compressed text indexing via sparse suffixes and range searching. Algorithmica 71, 2 (2015), 258–278. 013- 9792- 1 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Maxime Crochemore, Christophe Hancart, and Thierry Lecroq. 2007.Google ScholarGoogle Scholar
  17. Algorithms on strings. Cambridge University Press, Cambridge, UK. cbo9780511546853 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Martin Farach-Colton, Paolo Ferragina, and S. Muthukrishnan. 2000.Google ScholarGoogle Scholar
  19. On the sorting-complexity of suffix tree construction. J. ACM 47, 6 (2000), 987–1011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Paolo Ferragina and Giovanni Manzini. 2005. Indexing compressed text. J. ACM 52, 4 (2005), 552–581. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Nathan J. Fine and Herbert S. Wilf. 1965.Google ScholarGoogle Scholar
  22. Uniqueness theorems for periodic functions. Proc. Am. Math. Soc. 16, 1 (1965), 109–114. 2034009Google ScholarGoogle Scholar
  23. Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. 2003. High-order entropycompressed text indexes. In Proc. SODA. 841–850. http://dl.acm.org/citation.cfm? id=644108.644250 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Roberto Grossi and Jeffrey Scott Vitter. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2 (2005), 378–407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Dan Gusfield. 1997.Google ScholarGoogle Scholar
  26. Algorithms on strings, trees, and sequences: Computer science and computational biology. Cambridge University Press, Cambridge, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Torben Hagerup. 1998. Sorting and searching on the word RAM. In Proc. STACS, Vol. 1373. 366–398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yijie Han. 2004. Deterministic sorting in O (n log log n) time and linear space. J. Algorithms 50, 1 (2004), 96–105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Dov Harel and Robert Endre Tarjan. 1984. Fast algorithms for finding nearest common ancestors. SIAM J. Comput. 13, 2 (1984), 338–355. 1137/0213024 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Wing-Kai Hon, Kunihiko Sadakane, and Wing-Kin Sung. 2009. Breaking a time- and-space barrier in constructing full-text indices. SIAM J. Comput. 38, 6 (2009), 2162–2178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Guy Jacobson. 1989. Space-efficient static trees and graphs. In Proc. FOCS. 549–554. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Artur Jeż. 2016.Google ScholarGoogle Scholar
  33. Recompression: A simple and powerful technique for word equations. J. ACM 63, 1, Article 4 (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. 2006. Linear work suffix array construction. J. ACM 53, 6 (2006), 918–936. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. 1217858Google ScholarGoogle Scholar
  36. Dominik Kempa. 2019.Google ScholarGoogle Scholar
  37. Optimal construction of compressed indexes for highly repetitive texts. In Proc. SODA. 1344–1357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. 9781611975482.82Google ScholarGoogle Scholar
  39. Dominik Kempa and Tomasz Kociumaka. 2019.Google ScholarGoogle Scholar
  40. String synchronizing sets: Sublinear-time BWT construction and optimal LCE queries. arXiv: 1904.04228Google ScholarGoogle Scholar
  41. Tomasz Kociumaka. 2018.Google ScholarGoogle Scholar
  42. Efficient data structures for internal queries in texts. PhD thesis. University of Warsaw, Warsaw, Poland. https://www.mimuw.edu.pl/ ~kociumaka/files/phd.pdfGoogle ScholarGoogle Scholar
  43. Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. 2015. Internal pattern matching queries in a text and applications. In Proc. SODA. 532–551. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Gad M. Landau and Uzi Vishkin. 1988. Fast string matching with k differences. J. Comput. Syst. Sci. 37, 1 (1988), 63–78. 0000(88) 90045- 1 Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, 3, Article R25 (2009).Google ScholarGoogle Scholar
  46. Heng Li and Richard Durbin. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 14 (2009), 1754–1760. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. 2015.Google ScholarGoogle Scholar
  48. Genome-scale algorithm design: Biological sequence analysis in the era of high-throughput sequencing. Cambridge University Press, Cambridge, UK.Google ScholarGoogle Scholar
  49. Udi Manber and Eugene W. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5 (1993), 935–948. 0222058 Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. 2017. Space-efficient construction of compressed indexes in deterministic linear time. In Proc. SODA. 408–424. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. 2017.Google ScholarGoogle Scholar
  52. Text indexing and searching in sublinear time. arXiv: 1712.07431Google ScholarGoogle Scholar
  53. J. Ian Munro, Yakov Nekrich, and Jeffrey Scott Vitter. 2016. Fast construction of wavelet trees. Theor. Comput. Sci. 638 (2016), 91–97. 2015.11.011 Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Gonzalo Navarro. 2014. Wavelet trees for all. J. Discrete Algorithms 25 (2014), 2–20. Google ScholarGoogle ScholarCross RefCross Ref
  55. Gonzalo Navarro. 2016.Google ScholarGoogle Scholar
  56. Compact data structures: A practical approach. Cambridge University Press, Cambridge, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Enno Ohlebusch. 2013.Google ScholarGoogle Scholar
  58. Bioinformatics algorithms: Sequence analysis, genome rearrangements, and phylogenetic reconstruction. Oldenbusch Verlag, Ulm, Germany.Google ScholarGoogle Scholar
  59. Süleyman Cenk Sahinalp and Uzi Vishkin. 1994. On a parallel-algorithms method for string matching problems. In Proc. CIAC. 22–32. 3- 540- 57811- 0_3Google ScholarGoogle ScholarCross RefCross Ref
  60. Yuka Tanimura, Takaaki Nishimoto, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. 2017.Google ScholarGoogle Scholar
  61. Small-space LCE data structure with constant-time queries. In Proc. MFCS.Google ScholarGoogle Scholar

Index Terms

  1. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            STOC 2019: Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing
            June 2019
            1258 pages
            ISBN:9781450367059
            DOI:10.1145/3313276

            Copyright © 2019 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 23 June 2019

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate1,469of4,586submissions,32%

            Upcoming Conference

            STOC '24
            56th Annual ACM Symposium on Theory of Computing (STOC 2024)
            June 24 - 28, 2024
            Vancouver , BC , Canada

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader