Rapid genotype imputation from sequence with reference panels

Davies, Robert W.; Kucka, Marek; Su, Dingwen; Shi, Sinan; Flanagan, Maeve; Cunniff, Christopher M.; Chan, Yingguang Frank; Myers, Simon

doi:10.1038/s41588-021-00877-0

Technical Report
Published: 03 June 2021

Rapid genotype imputation from sequence with reference panels

Nature Genetics volume 53, pages 1104–1111 (2021)Cite this article

7850 Accesses
30 Citations
107 Altmetric
Metrics details

Subjects

Abstract

Inexpensive genotyping methods are essential to modern genomics. Here we present QUILT, which performs diploid genotype imputation using low-coverage whole-genome sequence data. QUILT employs Gibbs sampling to partition reads into maternal and paternal sets, facilitating rapid haploid imputation using large reference panels. We show this partitioning to be accurate over many megabases, enabling highly accurate imputation close to theoretical limits and outperforming existing methods. Moreover, QUILT can impute accurately using diverse technologies, including long reads from Oxford Nanopore Technologies, and a new form of low-cost barcoded Illumina sequencing called haplotagging, with the latter showing improved accuracy at low coverages. Relative to DNA genotyping microarrays, QUILT offers improved accuracy at reduced cost, particularly for diverse populations that are traditionally underserved in modern genomic analyses, with accuracy nearly doubling at rare SNPs. Finally, QUILT can accurately impute (four-digit) human leukocyte antigen types, the first such method from low-coverage sequence data.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Schematic of the QUILT model.**

**Fig. 2: Assessment of read label partitioning.**

**Fig. 3: Imputation accuracy of the NA12878 sample.**

**Fig. 4: Imputation accuracy of 5-Family, GBR and ONT samples.**

**Fig. 5: Imputation accuracy of the 1000 Genomes Project samples.**

**Fig. 6: Imputation accuracy of HLA loci.**

**Fig. 7: Relative increase in effective sample size and power using lcWGS and QUILT.**

Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes

Article Open access 29 June 2023

Simone Rubinacci, Robin J. Hofmeister, … Olivier Delaneau

Efficient phasing and imputation of low-coverage sequencing data using large reference panels

Article 07 January 2021

Simone Rubinacci, Diogo M. Ribeiro, … Olivier Delaneau

Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank

Article Open access 29 June 2023

Robin J. Hofmeister, Diogo M. Ribeiro, … Olivier Delaneau

Data availability

The HRC haplotypes are available at the European Genome-phenome Archive under accession no. EGAD00001002729; they are available through the Sanger Institute under controlled access. The high-coverage, whole-genome sequence from the 1000 Genomes NYGC collection is available at https://www.internationalgenome.org/data-portal/data-collection/30x-grch38. Specifically, we used file http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/1000G_2504_high_coverage.sequence.index. High-coverage ONT data from Bowden et al.²⁵ are available through the ENA under accession no. PRJEB30620. High-coverage ONT and Illumina (10×) samples from Shafin et al.²⁷ are available through https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=. gnomAD SNP frequencies from the version 3.0 release were downloaded as detailed at https://gnomad.broadinstitute.org/downloads from URLs such as https://storage.googleapis.com/gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.chr1.vcf.bgz. IPD-IMGT/HLA data were downloaded through their github database (https://github.com/ANHIG/IMGTHLA), specifically v.3.39 through https://github.com/ANHIG/IMGTHLA/blob/032815608e6312b595b4aaf9904d5b4c189dd6dc/Alignments_Rel_3390.zip?raw=true. Previously inferred HLA types for 1000 Genomes Project participants (v.20181129) were downloaded from the 1000 Genomes FTP (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HLA_types/20181129_HLA_types_full_1000_Genomes_Project_panel.txt). Recombination rates for the CEU 1000 Genomes Project samples were downloaded from ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20130507_omni_recombination_rates/CEU_omni_recombination_20130507.tar. All new high- and low-coverage sequencing done for this study are available at the Sequence Read Archive under BioProject accession no. PRJNA669554.

Code availability

QUILT is available from https://github.com/rwdavies/QUILT under a General Public License. The specific versions of QUILT used in this manuscript are available from Figshare⁴⁰.

References

Pasaniuc, B. & Price, A. L. Dissecting the genetics of complex traits using summary association statistics. Nat. Rev. Genet. 18, 117–127 (2017).
Article CAS PubMed Google Scholar
Dudbridge, F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 9, e1003348 (2013).
Article CAS PubMed PubMed Central Google Scholar
Torkamani, A., Wineinger, N. E. & Topol, E. J. The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 19, 581–590 (2018).
Article CAS PubMed Google Scholar
Burton, P. R. et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).
Article CAS Google Scholar
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Article CAS PubMed PubMed Central Google Scholar
Li, N. & Stephens, M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165, 2213–2233 (2003).
Article CAS PubMed PubMed Central Google Scholar
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).
Article CAS PubMed PubMed Central Google Scholar
Browning, B. L., Zhou, Y. & Browning, S. R. A one-penny imputed genome from next-generation reference panels. Am. J. Hum. Genet. 103, 338–348 (2018).
Article CAS PubMed PubMed Central Google Scholar
O’Connell, J. et al. Haplotype estimation for biobank-scale data sets. Nat. Genet. 48, 817–820 (2016).
Article PubMed PubMed Central Google Scholar
Loh, P.-R., Palamara, P. F. & Price, A. L. Fast and accurate long-range phasing in a UK Biobank cohort. Nat. Genet. 48, 811–816 (2016).
Article CAS PubMed PubMed Central Google Scholar
Delaneau, O., Zagury, J.-F., Robinson, M. R., Marchini, J. L. & Dermitzakis, E. T. Accurate, scalable and integrative haplotype estimation. Nat. Commun. 10, 5436 (2019).
Article PubMed PubMed Central Google Scholar
Pasaniuc, B. et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat. Genet. 44, 631–635 (2012).
Article CAS PubMed PubMed Central Google Scholar
Cai, N. et al. Sparse whole-genome sequencing identifies two loci for major depressive disorder. Nature 523, 588–591 (2015).
Article CAS Google Scholar
Nicod, J. et al. Genome-wide association of multiple complex traits in outbred mice by ultra-low-coverage sequencing. Nat. Genet. 48, 912–918 (2016).
Article CAS PubMed PubMed Central Google Scholar
Rohland, N. & Reich, D. Cost-effective, high-throughput DNA sequencing libraries for multiplexed target capture. Genome Res. 22, 939–946 (2012).
Article CAS PubMed PubMed Central Google Scholar
Meier, J. I. et al. Haplotype tagging reveals parallel formation of hybrid races in two butterfly species. Proc. Natl. Acad. Sci. USA https://doi.org/10.1073/pnas.2015005118 (2021).
Davies, R. W., Flint, J., Myers, S. & Mott, R. Rapid genotype imputation from sequence without reference panels. Nat. Genet. 48, 965–969 (2016).
Article CAS PubMed PubMed Central Google Scholar
Browning, B. L. & Browning, S. R. Genotype imputation with millions of reference samples. Am. J. Hum. Genet. 98, 116–126 (2016).
Article CAS PubMed PubMed Central Google Scholar
Spiliopoulou, A., Colombo, M., Orchard, P., Agakov, F. & McKeigue, P. GeneImp: fast imputation to large reference panels using genotype likelihoods from ultralow coverage sequencing. Genetics 206, 91–104 (2017).
Article PubMed PubMed Central Google Scholar
Rubinacci, S., Ribeiro, D. M., Hofmeister, R. J. & Delaneau, O. Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nat. Genet. 53, 120–126 (2021).
Article CAS PubMed Google Scholar
VanRaden, P. M., Sun, C. & O’Connell, J. R. Fast imputation using medium or low-coverage sequence data. BMC Genet. 16, 82 (2015).
Article PubMed PubMed Central Google Scholar
Ros-Freixedes, R. et al. Accuracy of whole-genome sequence imputation using hybrid peeling in large pedigreed livestock populations. Genet. Sel. Evol. 52, 17 (2020).
Article CAS PubMed PubMed Central Google Scholar
Zheng, C., Boer, M. P. & van Eeuwijk, F. A. Accurate genotype imputation in multiparental populations from low-coverage sequence. Genetics 210, 71–82 (2018).
Article PubMed PubMed Central Google Scholar
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article PubMed Google Scholar
Bowden, R. et al. Sequencing of human genomes with nanopore technology. Nat. Commun. 10, 1869 (2019).
Article Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article CAS PubMed PubMed Central Google Scholar
Shafin, K. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat. Biotechnol. 38, 1044–1053 (2020).
Article CAS PubMed PubMed Central Google Scholar
Jia, X. et al. Imputing amino acid polymorphisms in human leukocyte antigens. PLoS ONE 8, e64683 (2013).
Article CAS PubMed PubMed Central Google Scholar
Karnes, J. H. et al. Comparison of HLA allelic imputation programs. PLoS ONE 12, e0172444 (2017).
Article PubMed PubMed Central Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
Article CAS PubMed PubMed Central Google Scholar
Robinson, J. et al. IPD-IMGT/HLA Database. Nucleic Acids Res. 48, D948–D955 (2020).
CAS PubMed Google Scholar
Luo, Y. et al. A high-resolution HLA reference panel capturing global population diversity enables multi-ethnic fine-mapping in HIV host response. Preprint at medRxiv https://doi.org/10.1101/2020.07.16.20155606 (2020).
Durvasula, A. & Lohmueller, K. E. Negative selection on complex traits limits phenotype prediction accuracy between populations. Am. J. Hum. Genet. 108, 620–631 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wainschtein, P. et al. Recovery of trait heritability from whole genome sequence data. Preprint at bioRxiv https://doi.org/10.1101/588020 (2019).
Snyder, M. W. et al. Copy-number variation and false positive prenatal aneuploidy screening results. N. Engl. J. Med. 372, 1639–1645 (2015).
Article CAS PubMed PubMed Central Google Scholar
Liu, S. et al. Genomic analyses from non-invasive prenatal testing reveal genetic associations, patterns of viral infections, and Chinese population history. Cell 175, 347–359.e14 (2018).
Article CAS PubMed Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Article CAS PubMed PubMed Central Google Scholar
Davies, R. QUILT source code from manuscript. figshare https://doi.org/10.6084/m9.figshare.14401904.v1 (2021).
Abi-Rached, L. et al. Immune diversity sheds light on missing variation in worldwide genetic diversity panels. PLoS ONE 13, e0206512 (2018).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank C. Lanz, R. Schwab, O. Weichenrieder and I. Bezrukov at the MPI Developmental Biology for assistance with high-throughput sequencing and associated data processing and A. Noll and the MPI Tübingen IT team for computational support. We used high-coverage resequencing of 1000 Genomes Project data performed by the NYGC. These data were generated at the NYGC with funds provided by National Human Genome Research Institute grant no. 3UM1HG008901-03S1. The research was supported by the Wellcome Trust Core Award Grant no. 203141/Z/16/Z with additional support from the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre and by Wellcome Trust grant nos. 200186/Z/15/Z and 212284/Z/18/Z (to S.M.). The views expressed are those of the author(s) and not necessarily those of the NHS, NIHR or the Department of Health. We acknowledge the contribution and support from affected persons and their families who contributed to the Bloom Syndrome Repository. We thank the New York Community Trust and Weill Cornell Medicine’s Clinical and Translational Science Center for providing funding. M.K., D.S. and Y.F.C. are supported by the Max Planck Society and a European Research Council Starting Grant (no. 639096 HybridMiX).

Author information

These authors contributed equally: Yingguang Frank Chan, Simon Myers.

Authors and Affiliations

Department of Statistics, University of Oxford, Oxford, UK
Robert W. Davies, Sinan Shi & Simon Myers
Friedrich Miescher Laboratory of the Max Planck Society, Tübingen, Germany
Marek Kucka, Dingwen Su & Yingguang Frank Chan
Department of Pediatrics, Weill Cornell Medical College, New York, NY, USA
Maeve Flanagan & Christopher M. Cunniff
The Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
Simon Myers

Authors

Robert W. Davies
View author publications
You can also search for this author in PubMed Google Scholar
Marek Kucka
View author publications
You can also search for this author in PubMed Google Scholar
Dingwen Su
View author publications
You can also search for this author in PubMed Google Scholar
Sinan Shi
View author publications
You can also search for this author in PubMed Google Scholar
Maeve Flanagan
View author publications
You can also search for this author in PubMed Google Scholar
Christopher M. Cunniff
View author publications
You can also search for this author in PubMed Google Scholar
Yingguang Frank Chan
View author publications
You can also search for this author in PubMed Google Scholar
Simon Myers
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.W.D. developed and implemented QUILT. M.K., D.S. and Y.F.C. developed haplotagging. R.W.D., M.K. and S.S. performed the analyses. M.F. and C.M.C. developed the 5-Family dataset. S.M. developed and implemented the QUILT-HLA typer. R.W.D., Y.F.C. and S.M. wrote the paper. All authors reviewed and approved the final manuscript.

Corresponding author

Correspondence to Robert W. Davies.

Ethics declarations

Competing interests

M.K. and Y.F.C. declare competing interests in the form of patent and employment by the Max Planck Society. The remaining authors declare no competing interests.

Additional information

Peer review information Nature Genetics thanks Sayantan Das and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Davies, R.W., Kucka, M., Su, D. et al. Rapid genotype imputation from sequence with reference panels. Nat Genet 53, 1104–1111 (2021). https://doi.org/10.1038/s41588-021-00877-0

Download citation

Received: 14 July 2020
Accepted: 23 April 2021
Published: 03 June 2021
Issue Date: July 2021
DOI: https://doi.org/10.1038/s41588-021-00877-0

This article is cited by

Assessing the efficacy of target adaptive sampling long-read sequencing through hereditary cancer patient genomes
- Wataru Nakamura
- Makoto Hirata
- Yuichi Shiraishi
npj Genomic Medicine (2024)
In it for the long run: perspectives on exploiting long-read sequencing in livestock for population scale studies of structural variants
- Tuan V. Nguyen
- Christy J. Vander Jagt
- Iona M. MacLeod
Genetics Selection Evolution (2023)
Imputation strategies for genomic prediction using nanopore sequencing
- H. J. Lamb
- L. T. Nguyen
- E. M. Ross
BMC Biology (2023)
Pangenomic genotyping with the marker array
- Taher Mun
- Naga Sai Kavya Vaddadi
- Ben Langmead
Algorithms for Molecular Biology (2023)
The size and composition of haplotype reference panels impact the accuracy of imputation from low-pass sequencing in cattle
- Audald Lloret-Villas
- Hubert Pausch
- Alexander S. Leonard
Genetics Selection Evolution (2023)

Rapid genotype imputation from sequence with reference panels

Subjects

Abstract

Access options

Similar content being viewed by others

Imputation of low-coverage sequencing data from 150,119 UK Biobank genomes

Efficient phasing and imputation of low-coverage sequencing data using large reference panels

Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Information

Reporting Summary

Peer Review Information

Rights and permissions

About this article

Cite this article

This article is cited by

Assessing the efficacy of target adaptive sampling long-read sequencing through hereditary cancer patient genomes

In it for the long run: perspectives on exploiting long-read sequencing in livestock for population scale studies of structural variants

Imputation strategies for genomic prediction using nanopore sequencing

Pangenomic genotyping with the marker array

The size and composition of haplotype reference panels impact the accuracy of imputation from low-pass sequencing in cattle

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links