Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

SVDSS: structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads

Abstract

Structural variants (SVs) account for a large amount of sequence variability across genomes and play an important role in human genomics and precision medicine. Despite intense efforts over the years, the discovery of SVs in individuals remains challenging due to the diploid and highly repetitive structure of the human genome, and by the presence of SVs that vastly exceed sequencing read lengths. However, the recent introduction of low-error long-read sequencing technologies such as PacBio HiFi may finally enable these barriers to be overcome. Here we present SV discovery with sample-specific strings (SVDSS)—a method for discovery of SVs from long-read sequencing technologies (for example, PacBio HiFi) that combines and effectively leverages mapping-free, mapping-based and assembly-based methodologies for overall superior SV discovery performance. Our experiments on several human samples show that SVDSS outperforms state-of-the-art mapping-based methods for discovery of insertion and deletion SVs in PacBio HiFi reads and achieves notable improvements in calling SVs in repetitive regions of the genome.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of the SVDSS SV prediction pipeline.
Fig. 2: Extended comparative analysis of SV calls across methods.
Fig. 3: Example of an SV at a medically relevant gene that has been correctly called exclusively by SVDSS.

Similar content being viewed by others

Data availability

All described datasets are publicly available through the corresponding repositories. In our experimental evaluation we used data publicly available at: GRCh38 reference genome: https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.gz; GRCh37 reference genome: http://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz; HG002 PacBio HiFi data: https://storage.googleapis.com/brain-genomics-public/research/deepconsensus/publication/deepconsensus_predictions/hg002_15kb/two_smrt_cells/HG002_15kb_222723_002822_2fl_DC_hifi_reads.fastq; HG002 assembly: https://console.cloud.google.com/storage/browser/brain-genomics-public/research/deepconsensus/publication/analysis/genome_assembly/hg002_15kb/two_smrt_cells/dc; CMRG callset: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/GRCh38/StructuralVariant/; HG007 PacBio HiFi data: https://storage.googleapis.com/brain-genomics-public/research/deepconsensus/publication/deepconsensus_predictions/hg007_15kb/three_smrt_cells/HG007_230654_115437_2fl_DC_hifi_reads.fastq; HG007 assembly: https://console.cloud.google.com/storage/browser/brain-genomics-public/research/deepconsensus/publication/analysis/genome_assembly/hg007_15kb/two_smrt_cells/dc; CHM13 PacBio HiFi data: https://github.com/marbl/CHM13#hifi-data; CHM13 T2T assembly v1.1: https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chm13.draft_v1.1.fasta.gz. The three callset built using dipcall are available at https://github.com/ldenti/SVDSS-experiments.

Code availability

SVDSS is open source and publicly available at https://github.com/Parsoa/SVDSS. Scripts to reproduce the experimental evaluations described in the manuscript are available at https://github.com/ldenti/SVDSS-experiments. Other software tools used in the study are either referenced or provided as links here: pbmm2 (https://github.com/PacificBiosciences/pbmm2) and pbsv (https://github.com/PacificBiosciences/pbsv).

References

  1. Alkan, C., Coe, B. P. & Eichler, E. E. Genome structural variation discovery and genotyping. Nat. Rev. Genet. 12, 363–376 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  2. Feuk, L., Carson, A. R. & Scherer, S. W. Structural variation in the human genome. Nat. Rev. Genet. 7, 85–97 (2006).

    CAS  PubMed  Google Scholar 

  3. Ho, S. S., Urban, A. E. & Mills, R. E. Structural variation in the sequencing era. Nat. Rev. Genet. 21, 171–189 (2020).

    CAS  PubMed  Google Scholar 

  4. Mills, R. E. et al. Mapping copy number variation by population-scale genome sequencing. Nature 470, 59–65 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. Chaisson, M. J. et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature 517, 608–611 (2015).

    CAS  PubMed  Google Scholar 

  7. Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science 372, eabf7117 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. Khayat, M. M. et al. Hidden biases in germline structural variant detection. Genome Biol. 22, 347 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. Sekar, S. et al. Complex mosaic structural variations in human fetal brains. Genome Res. 30, 1695–1704 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. Carvalho, C. M. & Lupski, J. R. Mechanisms underlying structural variant formation in genomic disorders. Nat. Rev. Genet. 17, 224–238 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. Audano, P. A. et al. Characterizing the major structural variant alleles of the human genome. Cell 176, 663–675 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. Zhao, X. et al. Expectations and blind spots for structural variation detection from long-read assemblies and short-read genome sequencing technologies. Am. J. Human Genet. 108, 919–928 (2021).

    CAS  Google Scholar 

  13. Stankiewicz, P. & Lupski, J. R. Structural variation in the human genome and its role in disease. Annu. Rev. Med. 61, 437–455 (2010).

    CAS  PubMed  Google Scholar 

  14. Sharp, A. J., Cheng, Z. & Eichler, E. E. Structural variation of the human genome. Annu. Rev. Genomics Hum. Genet. 7, 407–442 (2006).

    CAS  PubMed  Google Scholar 

  15. Collins, R. L. et al. A structural variation reference for medical and population genetics. Nature 581, 444–451 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. Sudmant, P. H. et al. Global diversity, population stratification, and selection of human copy-number variation. Science 349, aab3761 (2015).

    PubMed  PubMed Central  Google Scholar 

  17. Sudmant, P. H. et al. Evolution and diversity of copy number variation in the great ape lineage. Genome Res. 23, 1373–1382 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. Fortna, A. et al. Lineage-specific gene duplication and loss in human and great ape evolution. PLoS Biol. 2, e207 (2004).

    PubMed  PubMed Central  Google Scholar 

  19. Hurles, M. Gene duplication: the genomic trade in spare parts. PLoS Biol. 2, e206 (2004).

    PubMed  PubMed Central  Google Scholar 

  20. Wala, J. A. et al. Svaba: genome-wide detection of structural variants and indels by local assembly. Genome Res. 28, 581–591 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. Walsh, T. et al. Rare structural variants disrupt multiple genes in neurodevelopmental pathways in schizophrenia. Science 320, 539–543 (2008).

    CAS  PubMed  Google Scholar 

  22. Conrad, D. F. et al. Origins and functional impact of copy number variation in the human genome. Nature 464, 704–712 (2010).

    CAS  PubMed  Google Scholar 

  23. Marshall, C. R. et al. Structural variation of chromosomes in autism spectrum disorder. Am. J. Human Genet. 82, 477–488 (2008).

    CAS  Google Scholar 

  24. The, I., of Whole, T. P.-C. A. & Consortium, G. et al. Pan-cancer analysis of whole genomes. Nature 578, 82 (2020).

    Google Scholar 

  25. Li, Y. et al. Patterns of somatic structural variation in human cancer genomes. Nature 578, 112–121 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. Ye, K. et al. Systematic discovery of complex insertions and deletions in human cancers. Nature Med. 22, 97–104 (2016).

    CAS  PubMed  Google Scholar 

  27. Scott, E. C. et al. A hot l1 retrotransposon evades somatic repression and initiates human colorectal cancer. Genome Res. 26, 745–755 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. Porubsky, D. et al. Recurrent inversion polymorphisms in humans associate with genetic instability and genomic disorders. Cell 185, 1986–2005 (2022).

    CAS  PubMed  Google Scholar 

  29. Porubsky, D. et al. Recurrent inversion toggling and great ape genome evolution. Nature Genet. 52, 849–858 (2020).

    CAS  PubMed  Google Scholar 

  30. Wang, S. et al. Long read sequencing reveals sequential complex rearrangements driven by hepatitis B virus integration. Preprint at bioRxiv https://doi.org/10.1101/2021.12.09.471697 (2021).

  31. Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  32. Abyzov, A., Urban, A. E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 21, 974–984 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  33. Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. Chen, S. et al. Paragraph: a graph-based structural variant genotyper for short-read sequence data. Genome Biol. 20, 291 (2019).

    PubMed  PubMed Central  Google Scholar 

  35. Lin, J. et al. Mako: a graph-based pattern growth approach to detect complex structural variants. Genomics Proteomics Bioinformatics 20, 205–218 (2022).

    PubMed  Google Scholar 

  36. Gardner, E. J. et al. The mobile element locator tool (melt): population-scale mobile element discovery and biology. Genome Res. 27, 1916–1929 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. Soylev, A., Le, T. M., Amini, H., Alkan, C. & Hormozdiari, F. Discovery of tandem and interspersed segmental duplications using high-throughput sequencing. Bioinformatics 35, 3923–3930 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. Ebler, J., Schönhuth, A. & Marschall, T. Genotyping inversions and tandem duplications. Bioinformatics 33, 4015–4023 (2017).

    CAS  PubMed  Google Scholar 

  39. Michaelson, J. J. & Sebat, J. forestSV: structural variant discovery through statistical learning. Nat. Methods 9, 819–821 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).

    PubMed  PubMed Central  Google Scholar 

  41. Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. Sindi, S., Helman, E., Bashir, A. & Raphael, B. J. A geometric approach for classification and comparison of structural variants. Bioinformatics 25, i222–i230 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. Medvedev, P., Stanciu, M. & Brudno, M. Computational methods for discovering structural variation with next-generation sequencing. Nat. Methods 6, S13–S20 (2009).

    CAS  PubMed  Google Scholar 

  44. Mahmoud, M. et al. Structural variant calling: the long and the short of it. Genome Biol. 20, 246 (2019).

  45. Zook, J. M. et al. A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 38, 1347–1355 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. Belyeu, J. R. et al. De novo structural mutation rates and gamete-of-origin biases revealed through genome sequencing of 2,396 families. Am. J. Human Genet. 108, 597–607 (2021).

    CAS  Google Scholar 

  47. Khorsand, P. & Hormozdiari, F. Nebula: ultra-efficient mapping-free structural variant genotyper. Nucleic Acids Res. 49, e47–e47 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. Maretty, L. et al. Sequencing and de novo assembly of 150 genomes from Denmark as a population reference. Nature 548, 87–91 (2017).

    CAS  PubMed  Google Scholar 

  49. Zhang, J.-Y. et al. Using de novo assembly to identify structural variation of eight complex immune system gene regions. PLoS Comput. Biol. 17, e1009254 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. Zhang, L., Zhou, X., Weng, Z. & Sidow, A. De novo diploid genome assembly for genome-wide structural variant detection. NAR Genom. Bioinform. 2, lqz018 (2020).

    PubMed  Google Scholar 

  51. Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. 40, 672–680 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  53. Khorsand, P. et al. Comparative genome analysis using sample-specific string detection in accurate long reads. Bioinform. Adv. 1, vbab005 (2021).

    PubMed  PubMed Central  Google Scholar 

  54. Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).

    PubMed  PubMed Central  Google Scholar 

  55. Baid, G. et al. DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer. Nat. Biotechnol. https://doi.org/10.1038/s41587-022-01435-7 (2022).

  56. Jiang, T. et al. Long-read-based human genomic structural variation detection with cuteSV. Genome Biol. 21, 189 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  57. Heller, D. & Vingron, M. SVIM: structural variant identification using mapped long reads. Bioinformatics 35, 2907–2915 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  58. Chen, Y. et al. DeBreak: deciphering the exact breakpoints of structural variations using long sequencing reads. Res. Square https://doi.org/10.21203/rs.3.rs-1261915/v1 (2022).

  59. English, A. C., Menon, V. K., Gibbs, R., Metcalf, G. A. & Sedlazeck, F. J. Truvari: refined structural variant comparison preserves allelic diversity. Preprint at bioRxiv https://doi.org/10.1101/2022.02.21.481353 (2022).

  60. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  61. Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R. & Pfister, H. UpSet: visualization of intersecting sets. IEEE Trans. Vis. Comput. Graph. 20, 1983–1992 (2014).

    PubMed  PubMed Central  Google Scholar 

  62. Denti, L., Previtali, M., Bernardini, G., Schönhuth, A. & Bonizzoni, P. Malva: genotyping by mapping-free allele detection of known variants. iScience 18, 20–27 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  63. Mc Cartney, A, M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat. Methods 19, 687–695 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  64. Formenti, G. et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat. Methods 19, 696–704 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  65. Li, H. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics 28, 1838–1844 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  66. Edge, P. & Bansal, V. Longshot enables accurate variant calling in diploid genomes from single-molecule long read sequencing. Nat. Commun. 10, 4660 (2019).

    PubMed  PubMed Central  Google Scholar 

  67. Gao, Y. et al. abPOA: an SIMD-based C library for fast partial order alignment using adaptive band. Bioinformatics 37, 2209–2211 (2021).

    CAS  PubMed  Google Scholar 

  68. Daily, J. parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinformatics 17, 81 (2016).

    PubMed  PubMed Central  Google Scholar 

  69. Bonfield, J. K. et al. Htslib: C library for reading/writing high-throughput sequencing data. Gigascience 10, giab007 (2021).

    PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grants agreements No. 872539 and 956229 (P.B. and R.C.). This work has also been supported in part by NSF award DBI-2042518 to F.H. R.C was supported by ANR Transipedia, SeqDigger, GenoPIM, Inception and PRAIRIE grants (ANR-18-CE45-0020, ANR-19-CE45-0008, ANR-21-CE46-0012, PIA/ANR16-CONV-0005, and ANR-19-P3IA-0001). This project has received funding from the European Union’s Horizon Europe programme for research and innovation under grant agreement No. 101047160. The funding body had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

L.D. and P.K. devised and implemented the approach. L.D. and P.K. performed the experimental evaluation. P.B., F.H. and R.C. conceived the study, supervised and coordinated the work. All authors wrote, reviewed, edited and approved the manuscript.

Corresponding authors

Correspondence to Paola Bonizzoni, Fereydoun Hormozdiari or Rayan Chikhi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks Andrew Carroll and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Lin Tang, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Sections A–D, Figs. 1–14 and Tables 1–3.

Reporting Summary

Peer Review File

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Denti, L., Khorsand, P., Bonizzoni, P. et al. SVDSS: structural variation discovery in hard-to-call genomic regions using sample-specific strings from accurate long reads. Nat Methods 20, 550–558 (2023). https://doi.org/10.1038/s41592-022-01674-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41592-022-01674-1

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research