Abstract
Scaffolding is the final step in assembling Next Generation Sequencing data, in which pre-assembled contiguous regions (”contigs”) are oriented and ordered using information that links them (for example, mapping of paired-end reads). As the genome of some species is highly repetitive, we allow placing some contigs multiple times, thereby generalizing established computational models for this problem. We study the subsequent problems induced by the translation of solutions of the model back to actual sequences, proposing models and analyzing the complexity of the resulting computational problems. We find both polynomial-time and \(\mathcal {NP}\)-hard special cases like planarity or bounded degree. Finally, we propose two polynomial-time approximation algorithms according to cut/weight score.
Similar content being viewed by others
Notes
Solution graphs differ from scaffold graphs in that the multiplicity function is defined on all edges and not just on matching edges.
A sequences is called chimeric if it does not occur in the target genome, but is made up of chunks picked from different chromosomes or regions of the genome.
N50 is a statistical measure on contig lengths: given a set of contigs, the N50 is defined as the sequence length of the shortest contig at 50% of the total genome length.
A graph is co-bipartite if its vertices can be partitioned into two cliques.
The (widely believed) “Exponential Time Hypothesis” (ETH) states that the boolean satisfyability problem (SAT) cannot be solved in \(2^{o(n)}\) time, where n is the number of variables of the input formula.
References
Anselmetti, Y., Berry, V., Chauve, C., Chateau, A., Tannier, E., Bérard, S.: Ancestral gene synteny reconstruction improves extant species scaffolding. BMC Genom. 16(10), S11 (2015)
Berg, M.D., Khosravi, A.: Optimal binary space partitions for segments in the plane. Int. J. Comput. Geom. Appl. 22(3), 187–206 (2012)
Berman, P., Karpinski, M.: On some tighter inapproximability results (extended abstract). In: Proceedings of the 26th International Colloquium on Automata, Languages and Programming, pp. 200–209 (1999)
Berman, P., Karpinski, M., Scott, A.D.: Approximation hardness and satisfiability of bounded occurrence instances of SAT. In: Electronic Colloquium on Computational Complexity (ECCC) 10(022) (2003)
Burton, J.N., Adey, A., Patwardhan, R.P., Qiu, R., Kitzman, J.O., Shendure, J.: Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013)
Cao, M.D., Nguyen, S.H., Ganesamoorthy, D., Elliott, A.G., Cooper, M.A., Coin, L.J.M.: Scaffolding and completing genome assemblies in real-time with nanopore sequencing. Nat. Commun. 8, 14515 (2017)
Chateau, A., Giroudeau, R.: A complexity and approximation framework for the maximization scaffolding problem. Theor. Comput. Sci. 595, 92–106 (2015)
Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a bloom filter. Algorithms Mol. Biol. 8, 22 (2013)
Crescenzi, P.: A short guide to approximation preserving reductions. In: Proceedings of the Twelfth Annual IEEE Conference on Computational Complexity, Ulm, Germany, 24–27 June 1997, pp 262–273 (1997)
Dayarian, A., Michael, T.P., Sengupta, A.M.: SOPRA: Scaffolding algorithm for paired reads via statistical optimization. BMC Bioinform. 11, 345 (2010)
Dinur, I., Safra, S.: On the hardness of approximation minimum vertex cover. Ann. Math. 162(1), 439–485 (2005)
Donmez, N., Brudno, M.L.: SCARPA: scaffolding reads with practical algorithms. Bioinformatics 29(4), 428–434 (2013)
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York (1979)
Gritsenko, A.A., Nijkamp, J.F., Reinders, M.J.T., de Ridder, D.: GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies. Bioinformatics 28(11), 1429–1437 (2012)
Håstad, J.: Some optimal inapproximability results. J. ACM 48(4), 798–859 (2001)
Hunt, M., Newbold, C., Berriman, M., Otto, T.: A comprehensive evaluation of assembly scaffolding tools. Genome Biol. 15(3), 42 (2014)
Impagliazzo, R., Paturi, R., Zane, F.: Which problems have strongly exponential complexity? J. Comput. Syst. Sci. 63(4), 512–530 (2001)
Khot, S., Regev, O.: Vertex cover might be hard to approximate to within 2-epsilon. J. Comput. Syst. Sci. 74(3), 335–349 (2008)
Khot, S., Kindler, G., Mossel, E., O’Donnell, R.: Optimal inapproximability results for MAX-CUT and other 2-variable CSPs? SIAM J. Comput. 37(1), 319–357 (2007)
Kolodner, R., Tewari, K.K.: Inverted repeats in chloroplast DNA from higher plants*. Proc. Natl. Acad. Sci. U. S. A. 76(1), 41–45 (1979)
Koren, S., Treangen, T.J., Pop, M.: Bambus 2: scaffolding metagenomes. Bioinformatics 27(21), 2964–2971 (2011)
Lerat, E.: Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity 104(6), 520–533 (2010)
Mandric, I., Zelikovsky, A.: ScaffMatch: scaffolding algorithm based on maximum weight matching. Bioinformatics 31(16), 2632–2638 (2015)
Mandric, I., Lindsay, J., Măndoiu, I.I., Zelikovsky, A.: Scaffolding algorithms, chap 5. In: Măndoiu, I., Zelikovsky, A. (eds.) Computational Methods for Next Generation Sequencing Data Analysis, pp. 107–132. Wiley, Hoboken (2016)
Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation sequencing data. Genomics 95(6), 315–327 (2010)
Morey, M., Fernández-Marmiesse, A., Castiñeiras, D., Fraga, J.M., Couce, M.L., Cocho, J.A.: A glimpse into past, present, and future DNA sequencing. Mol. Genet. Metab. 110(1), 3–24 (2013). (Special Issue: Diagnosis)
Mostovoy, Y., Levy-Sakin, M., Lam, J., Lam, E.T., Hastie, A.R., Marks, P., Lee, J., Chu, C., Lin, C., Dzakula, Z., Cao, H., Schlebusch, S.A., Giorda, K., Schnall-Levin, M., Wall, J.D., Kwok, P.Y.: A hybrid approach for de novo human genome sequence assembly and phasing. Nat. Meth. 13(7), 587–590 (2016)
Papadimitriou, C.H., Yannakakis, M.: Optimization, approximation, and complexity classes. J. Comput. Syst. Sci. 43(3), 425–440 (1991)
Phillippy, A.M.: New advances in sequence assembly. Genome Res. 27(5), 11–13 (2017)
Sahlin, K., Vezzi, F., Nystedt, B., Lundeberg, J., Arvestad, L.: BESST—efficient scaffolding of large fragmented assemblies. BMC Bioinform. 15(1), 281 (2014)
Tabary, D., Davot, T., Weller, M., Chateau, A., Giroudeau, R.: New results about the linearization of scaffolds sharing repeated contigs. In: Combinatorial Optimization and Applications—12th International Conference, COCOA 2018, Atlanta, GA, USA, 15–17 Dec 2018, Proceedings, pp 94–107 (2018)
Tang, H.: Genome assembly, rearrangement, and repeats. Chem. Rev. 107(8), 3391–3406 (2007)
Treangen, T.J., Salzberg, S.L.: Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13(1), 36–46 (2012)
Vezzi, F., Narzisi, G., Mishra, B.: Reevaluating assembly evaluations with feature response curves: GAGE and assemblathons. PLoS ONE 7(12), 52210 (2012)
Weller, M., Chateau, A., Giroudeau, R.: Exact approaches for scaffolding. BMC Bioinform. 16(Suppl 14), S2 (2015)
Weller, M., Chateau, A., Giroudeau, R.: On the linearization of scaffolds sharing repeated contigs. In: Proceedings of the 11th COCOA’17, pp 509–517 (2017)
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)
Acknowledgements
This work was partially supported by the Région Occitanie.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Davot, T., Chateau, A., Giroudeau, R. et al. Producing Genomic Sequences after Genome Scaffolding with Ambiguous Paths: Complexity, Approximation and Lower Bounds. Algorithmica 83, 2063–2095 (2021). https://doi.org/10.1007/s00453-021-00819-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-021-00819-6