Skip to main content
Log in

Producing Genomic Sequences after Genome Scaffolding with Ambiguous Paths: Complexity, Approximation and Lower Bounds

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

Scaffolding is the final step in assembling Next Generation Sequencing data, in which pre-assembled contiguous regions (”contigs”) are oriented and ordered using information that links them (for example, mapping of paired-end reads). As the genome of some species is highly repetitive, we allow placing some contigs multiple times, thereby generalizing established computational models for this problem. We study the subsequent problems induced by the translation of solutions of the model back to actual sequences, proposing models and analyzing the complexity of the resulting computational problems. We find both polynomial-time and \(\mathcal {NP}\)-hard special cases like planarity or bounded degree. Finally, we propose two polynomial-time approximation algorithms according to cut/weight score.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Solution graphs differ from scaffold graphs in that the multiplicity function is defined on all edges and not just on matching edges.

  2. A sequences is called chimeric if it does not occur in the target genome, but is made up of chunks picked from different chromosomes or regions of the genome.

  3. N50 is a statistical measure on contig lengths: given a set of contigs, the N50 is defined as the sequence length of the shortest contig at 50% of the total genome length.

  4. A graph is co-bipartite if its vertices can be partitioned into two cliques.

  5. The (widely believed) “Exponential Time Hypothesis” (ETH) states that the boolean satisfyability problem (SAT) cannot be solved in \(2^{o(n)}\) time, where n is the number of variables of the input formula.

References

  1. Anselmetti, Y., Berry, V., Chauve, C., Chateau, A., Tannier, E., Bérard, S.: Ancestral gene synteny reconstruction improves extant species scaffolding. BMC Genom. 16(10), S11 (2015)

    Article  Google Scholar 

  2. Berg, M.D., Khosravi, A.: Optimal binary space partitions for segments in the plane. Int. J. Comput. Geom. Appl. 22(3), 187–206 (2012)

    Article  MathSciNet  Google Scholar 

  3. Berman, P., Karpinski, M.: On some tighter inapproximability results (extended abstract). In: Proceedings of the 26th International Colloquium on Automata, Languages and Programming, pp. 200–209 (1999)

  4. Berman, P., Karpinski, M., Scott, A.D.: Approximation hardness and satisfiability of bounded occurrence instances of SAT. In: Electronic Colloquium on Computational Complexity (ECCC) 10(022) (2003)

  5. Burton, J.N., Adey, A., Patwardhan, R.P., Qiu, R., Kitzman, J.O., Shendure, J.: Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013)

    Article  Google Scholar 

  6. Cao, M.D., Nguyen, S.H., Ganesamoorthy, D., Elliott, A.G., Cooper, M.A., Coin, L.J.M.: Scaffolding and completing genome assemblies in real-time with nanopore sequencing. Nat. Commun. 8, 14515 (2017)

    Article  Google Scholar 

  7. Chateau, A., Giroudeau, R.: A complexity and approximation framework for the maximization scaffolding problem. Theor. Comput. Sci. 595, 92–106 (2015)

    Article  MathSciNet  Google Scholar 

  8. Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a bloom filter. Algorithms Mol. Biol. 8, 22 (2013)

    Article  Google Scholar 

  9. Crescenzi, P.: A short guide to approximation preserving reductions. In: Proceedings of the Twelfth Annual IEEE Conference on Computational Complexity, Ulm, Germany, 24–27 June 1997, pp 262–273 (1997)

  10. Dayarian, A., Michael, T.P., Sengupta, A.M.: SOPRA: Scaffolding algorithm for paired reads via statistical optimization. BMC Bioinform. 11, 345 (2010)

    Article  Google Scholar 

  11. Dinur, I., Safra, S.: On the hardness of approximation minimum vertex cover. Ann. Math. 162(1), 439–485 (2005)

    Article  MathSciNet  Google Scholar 

  12. Donmez, N., Brudno, M.L.: SCARPA: scaffolding reads with practical algorithms. Bioinformatics 29(4), 428–434 (2013)

    Article  Google Scholar 

  13. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York (1979)

    MATH  Google Scholar 

  14. Gritsenko, A.A., Nijkamp, J.F., Reinders, M.J.T., de Ridder, D.: GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies. Bioinformatics 28(11), 1429–1437 (2012)

    Article  Google Scholar 

  15. Håstad, J.: Some optimal inapproximability results. J. ACM 48(4), 798–859 (2001)

    Article  MathSciNet  Google Scholar 

  16. Hunt, M., Newbold, C., Berriman, M., Otto, T.: A comprehensive evaluation of assembly scaffolding tools. Genome Biol. 15(3), 42 (2014)

    Article  Google Scholar 

  17. Impagliazzo, R., Paturi, R., Zane, F.: Which problems have strongly exponential complexity? J. Comput. Syst. Sci. 63(4), 512–530 (2001)

    Article  MathSciNet  Google Scholar 

  18. Khot, S., Regev, O.: Vertex cover might be hard to approximate to within 2-epsilon. J. Comput. Syst. Sci. 74(3), 335–349 (2008)

    Article  Google Scholar 

  19. Khot, S., Kindler, G., Mossel, E., O’Donnell, R.: Optimal inapproximability results for MAX-CUT and other 2-variable CSPs? SIAM J. Comput. 37(1), 319–357 (2007)

    Article  MathSciNet  Google Scholar 

  20. Kolodner, R., Tewari, K.K.: Inverted repeats in chloroplast DNA from higher plants*. Proc. Natl. Acad. Sci. U. S. A. 76(1), 41–45 (1979)

    Article  Google Scholar 

  21. Koren, S., Treangen, T.J., Pop, M.: Bambus 2: scaffolding metagenomes. Bioinformatics 27(21), 2964–2971 (2011)

    Article  Google Scholar 

  22. Lerat, E.: Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity 104(6), 520–533 (2010)

    Article  Google Scholar 

  23. Mandric, I., Zelikovsky, A.: ScaffMatch: scaffolding algorithm based on maximum weight matching. Bioinformatics 31(16), 2632–2638 (2015)

    Article  Google Scholar 

  24. Mandric, I., Lindsay, J., Măndoiu, I.I., Zelikovsky, A.: Scaffolding algorithms, chap 5. In: Măndoiu, I., Zelikovsky, A. (eds.) Computational Methods for Next Generation Sequencing Data Analysis, pp. 107–132. Wiley, Hoboken (2016)

    Google Scholar 

  25. Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation sequencing data. Genomics 95(6), 315–327 (2010)

    Article  Google Scholar 

  26. Morey, M., Fernández-Marmiesse, A., Castiñeiras, D., Fraga, J.M., Couce, M.L., Cocho, J.A.: A glimpse into past, present, and future DNA sequencing. Mol. Genet. Metab. 110(1), 3–24 (2013). (Special Issue: Diagnosis)

    Article  Google Scholar 

  27. Mostovoy, Y., Levy-Sakin, M., Lam, J., Lam, E.T., Hastie, A.R., Marks, P., Lee, J., Chu, C., Lin, C., Dzakula, Z., Cao, H., Schlebusch, S.A., Giorda, K., Schnall-Levin, M., Wall, J.D., Kwok, P.Y.: A hybrid approach for de novo human genome sequence assembly and phasing. Nat. Meth. 13(7), 587–590 (2016)

    Article  Google Scholar 

  28. Papadimitriou, C.H., Yannakakis, M.: Optimization, approximation, and complexity classes. J. Comput. Syst. Sci. 43(3), 425–440 (1991)

    Article  MathSciNet  Google Scholar 

  29. Phillippy, A.M.: New advances in sequence assembly. Genome Res. 27(5), 11–13 (2017)

    Article  Google Scholar 

  30. Sahlin, K., Vezzi, F., Nystedt, B., Lundeberg, J., Arvestad, L.: BESST—efficient scaffolding of large fragmented assemblies. BMC Bioinform. 15(1), 281 (2014)

    Article  Google Scholar 

  31. Tabary, D., Davot, T., Weller, M., Chateau, A., Giroudeau, R.: New results about the linearization of scaffolds sharing repeated contigs. In: Combinatorial Optimization and Applications—12th International Conference, COCOA 2018, Atlanta, GA, USA, 15–17 Dec 2018, Proceedings, pp 94–107 (2018)

  32. Tang, H.: Genome assembly, rearrangement, and repeats. Chem. Rev. 107(8), 3391–3406 (2007)

    Article  Google Scholar 

  33. Treangen, T.J., Salzberg, S.L.: Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13(1), 36–46 (2012)

    Article  Google Scholar 

  34. Vezzi, F., Narzisi, G., Mishra, B.: Reevaluating assembly evaluations with feature response curves: GAGE and assemblathons. PLoS ONE 7(12), 52210 (2012)

    Article  Google Scholar 

  35. Weller, M., Chateau, A., Giroudeau, R.: Exact approaches for scaffolding. BMC Bioinform. 16(Suppl 14), S2 (2015)

    Article  Google Scholar 

  36. Weller, M., Chateau, A., Giroudeau, R.: On the linearization of scaffolds sharing repeated contigs. In: Proceedings of the 11th COCOA’17, pp 509–517 (2017)

  37. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008)

    Article  Google Scholar 

Download references

Acknowledgements

This work was partially supported by the Région Occitanie.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tom Davot.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Davot, T., Chateau, A., Giroudeau, R. et al. Producing Genomic Sequences after Genome Scaffolding with Ambiguous Paths: Complexity, Approximation and Lower Bounds. Algorithmica 83, 2063–2095 (2021). https://doi.org/10.1007/s00453-021-00819-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-021-00819-6

Navigation