Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

Rapid genotype imputation from sequence with reference panels

Abstract

Inexpensive genotyping methods are essential to modern genomics. Here we present QUILT, which performs diploid genotype imputation using low-coverage whole-genome sequence data. QUILT employs Gibbs sampling to partition reads into maternal and paternal sets, facilitating rapid haploid imputation using large reference panels. We show this partitioning to be accurate over many megabases, enabling highly accurate imputation close to theoretical limits and outperforming existing methods. Moreover, QUILT can impute accurately using diverse technologies, including long reads from Oxford Nanopore Technologies, and a new form of low-cost barcoded Illumina sequencing called haplotagging, with the latter showing improved accuracy at low coverages. Relative to DNA genotyping microarrays, QUILT offers improved accuracy at reduced cost, particularly for diverse populations that are traditionally underserved in modern genomic analyses, with accuracy nearly doubling at rare SNPs. Finally, QUILT can accurately impute (four-digit) human leukocyte antigen types, the first such method from low-coverage sequence data.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Schematic of the QUILT model.
Fig. 2: Assessment of read label partitioning.
Fig. 3: Imputation accuracy of the NA12878 sample.
Fig. 4: Imputation accuracy of 5-Family, GBR and ONT samples.
Fig. 5: Imputation accuracy of the 1000 Genomes Project samples.
Fig. 6: Imputation accuracy of HLA loci.
Fig. 7: Relative increase in effective sample size and power using lcWGS and QUILT.

Similar content being viewed by others

Data availability

The HRC haplotypes are available at the European Genome-phenome Archive under accession no. EGAD00001002729; they are available through the Sanger Institute under controlled access. The high-coverage, whole-genome sequence from the 1000 Genomes NYGC collection is available at https://www.internationalgenome.org/data-portal/data-collection/30x-grch38. Specifically, we used file http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/1000G_2504_high_coverage.sequence.index. High-coverage ONT data from Bowden et al.25 are available through the ENA under accession no. PRJEB30620. High-coverage ONT and Illumina (10×) samples from Shafin et al.27 are available through https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=. gnomAD SNP frequencies from the version 3.0 release were downloaded as detailed at https://gnomad.broadinstitute.org/downloads from URLs such as https://storage.googleapis.com/gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.chr1.vcf.bgz. IPD-IMGT/HLA data were downloaded through their github database (https://github.com/ANHIG/IMGTHLA), specifically v.3.39 through https://github.com/ANHIG/IMGTHLA/blob/032815608e6312b595b4aaf9904d5b4c189dd6dc/Alignments_Rel_3390.zip?raw=true. Previously inferred HLA types for 1000 Genomes Project participants (v.20181129) were downloaded from the 1000 Genomes FTP (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HLA_types/20181129_HLA_types_full_1000_Genomes_Project_panel.txt). Recombination rates for the CEU 1000 Genomes Project samples were downloaded from ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20130507_omni_recombination_rates/CEU_omni_recombination_20130507.tar. All new high- and low-coverage sequencing done for this study are available at the Sequence Read Archive under BioProject accession no. PRJNA669554.

Code availability

QUILT is available from https://github.com/rwdavies/QUILT under a General Public License. The specific versions of QUILT used in this manuscript are available from Figshare40.

References

  1. Pasaniuc, B. & Price, A. L. Dissecting the genetics of complex traits using summary association statistics. Nat. Rev. Genet. 18, 117–127 (2017).

    Article  CAS  PubMed  Google Scholar 

  2. Dudbridge, F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 9, e1003348 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Torkamani, A., Wineinger, N. E. & Topol, E. J. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 19, 581–590 (2018).

    Article  CAS  PubMed  Google Scholar 

  4. Burton, P. R. et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).

    Article  CAS  Google Scholar 

  5. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Browning, B. L., Zhou, Y. & Browning, S. R. A one-penny imputed genome from next-generation reference panels. Am. J. Hum. Genet. 103, 338–348 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. O’Connell, J. et al. Haplotype estimation for biobank-scale data sets. Nat. Genet. 48, 817–820 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Loh, P.-R., Palamara, P. F. & Price, A. L. Fast and accurate long-range phasing in a UK Biobank cohort. Nat. Genet. 48, 811–816 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Delaneau, O., Zagury, J.-F., Robinson, M. R., Marchini, J. L. & Dermitzakis, E. T. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 10, 5436 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Pasaniuc, B. et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat. Genet. 44, 631–635 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Cai, N. et al. Sparse whole-genome sequencing identifies two loci for major depressive disorder. Nature 523, 588–591 (2015).

    Article  CAS  Google Scholar 

  14. Nicod, J. et al. Genome-wide association of multiple complex traits in outbred mice by ultra-low-coverage sequencing. Nat. Genet. 48, 912–918 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Rohland, N. & Reich, D. Cost-effective, high-throughput DNA sequencing libraries for multiplexed target capture. Genome Res. 22, 939–946 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Meier, J. I. et al. Haplotype tagging reveals parallel formation of hybrid races in two butterfly species. Proc. Natl. Acad. Sci. USA https://doi.org/10.1073/pnas.2015005118 (2021).

  17. Davies, R. W., Flint, J., Myers, S. & Mott, R. Rapid genotype imputation from sequence without reference panels. Nat. Genet. 48, 965–969 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Browning, B. L. & Browning, S. R. Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 98, 116–126 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Spiliopoulou, A., Colombo, M., Orchard, P., Agakov, F. & McKeigue, P. GeneImp: fast imputation to large reference panels using genotype likelihoods from ultralow coverage sequencing. Genetics 206, 91–104 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Rubinacci, S., Ribeiro, D. M., Hofmeister, R. J. & Delaneau, O. Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nat. Genet. 53, 120–126 (2021).

    Article  CAS  PubMed  Google Scholar 

  21. VanRaden, P. M., Sun, C. & O’Connell, J. R. Fast imputation using medium or low-coverage sequence data. BMC Genet. 16, 82 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Ros-Freixedes, R. et al. Accuracy of whole-genome sequence imputation using hybrid peeling in large pedigreed livestock populations. Genet. Sel. Evol. 52, 17 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Zheng, C., Boer, M. P. & van Eeuwijk, F. A. Accurate genotype imputation in multiparental populations from low-coverage sequence. Genetics 210, 71–82 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  24. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  PubMed  Google Scholar 

  25. Bowden, R. et al. Sequencing of human genomes with nanopore technology. Nat. Commun. 10, 1869 (2019).

    Article  Google Scholar 

  26. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Shafin, K. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Jia, X. et al. Imputing amino acid polymorphisms in human leukocyte antigens. PLoS ONE 8, e64683 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Karnes, J. H. et al. Comparison of HLA allelic imputation programs. PLoS ONE 12, e0172444 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  30. Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Robinson, J. et al. IPD-IMGT/HLA Database. Nucleic Acids Res. 48, D948–D955 (2020).

    CAS  PubMed  Google Scholar 

  32. Luo, Y. et al. A high-resolution HLA reference panel capturing global population diversity enables multi-ethnic fine-mapping in HIV host response. Preprint at medRxiv https://doi.org/10.1101/2020.07.16.20155606 (2020).

  33. Durvasula, A. & Lohmueller, K. E. Negative selection on complex traits limits phenotype prediction accuracy between populations. Am. J. Hum. Genet. 108, 620–631 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Wainschtein, P. et al. Recovery of trait heritability from whole genome sequence data. Preprint at bioRxiv https://doi.org/10.1101/588020 (2019).

  35. Snyder, M. W. et al. Copy-number variation and false positive prenatal aneuploidy screening results. N. Engl. J. Med. 372, 1639–1645 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Liu, S. et al. Genomic analyses from non-invasive prenatal testing reveal genetic associations, patterns of viral infections, and Chinese population history. Cell 175, 347–359.e14 (2018).

    Article  CAS  PubMed  Google Scholar 

  37. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  38. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Davies, R. QUILT source code from manuscript. figshare https://doi.org/10.6084/m9.figshare.14401904.v1 (2021).

  41. Abi-Rached, L. et al. Immune diversity sheds light on missing variation in worldwide genetic diversity panels. PLoS ONE 13, e0206512 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank C. Lanz, R. Schwab, O. Weichenrieder and I. Bezrukov at the MPI Developmental Biology for assistance with high-throughput sequencing and associated data processing and A. Noll and the MPI Tübingen IT team for computational support. We used high-coverage resequencing of 1000 Genomes Project data performed by the NYGC. These data were generated at the NYGC with funds provided by National Human Genome Research Institute grant no. 3UM1HG008901-03S1. The research was supported by the Wellcome Trust Core Award Grant no. 203141/Z/16/Z with additional support from the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre and by Wellcome Trust grant nos. 200186/Z/15/Z and 212284/Z/18/Z (to S.M.). The views expressed are those of the author(s) and not necessarily those of the NHS, NIHR or the Department of Health. We acknowledge the contribution and support from affected persons and their families who contributed to the Bloom Syndrome Repository. We thank the New York Community Trust and Weill Cornell Medicine’s Clinical and Translational Science Center for providing funding. M.K., D.S. and Y.F.C. are supported by the Max Planck Society and a European Research Council Starting Grant (no. 639096 HybridMiX).

Author information

Authors and Affiliations

Authors

Contributions

R.W.D. developed and implemented QUILT. M.K., D.S. and Y.F.C. developed haplotagging. R.W.D., M.K. and S.S. performed the analyses. M.F. and C.M.C. developed the 5-Family dataset. S.M. developed and implemented the QUILT-HLA typer. R.W.D., Y.F.C. and S.M. wrote the paper. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Robert W. Davies.

Ethics declarations

Competing interests

M.K. and Y.F.C. declare competing interests in the form of patent and employment by the Max Planck Society. The remaining authors declare no competing interests.

Additional information

Peer review information Nature Genetics thanks Sayantan Das and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Note, Tables 1–10 and Figs. 1–6

Reporting Summary

Peer Review Information

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Davies, R.W., Kucka, M., Su, D. et al. Rapid genotype imputation from sequence with reference panels. Nat Genet 53, 1104–1111 (2021). https://doi.org/10.1038/s41588-021-00877-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-021-00877-0

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research