Base-Calling Using a Random Effects Mixture Model on Next-Generation Sequencing Data

Cacho, Ashley; Yao, Weixin; Cui, Xinping

doi:10.1007/s12561-017-9190-3

Base-Calling Using a Random Effects Mixture Model on Next-Generation Sequencing Data

Published: 25 March 2017

Volume 10, pages 3–19, (2018)
Cite this article

Statistics in Biosciences Aims and scope Submit manuscript

191 Accesses
1 Citation
Explore all metrics

Abstract

The emergence of next-generation sequencing technology has greatly influenced research in biology and clinical applications. This new technology allows millions of DNA fragments to be sequenced in parallel, reducing costs and increasing throughput. One of the most widely used DNA sequencing machines is the Illumina platform which contains a novel sequencing-by-synthesis method involving a series of chemical reactions and image processing. However, it suffers from biases inherent with the complex nature of the chemical processes involved. The process of converting the fluorescence intensity output of the sequencing-by-synthesis technology to the nucleotide bases is what is known as base-calling. The resulting DNA sequences are used in further downstream analyses such as in genome assemblies or variant detection in which the accuracy and quality of bases impact the results. In this paper, we introduce a random effects mixture model that captures the sequencing process and compare its performance to a model with fixed effects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Next-Generation Sequencing: Advantages, Disadvantages, and Future

The Illumina Sequencing Protocol and the NovaSeq 6000 System

A survey of best practices for RNA-seq data analysis

Article Open access 26 January 2016

References

Biscarat JC (1994) Almost sure convergence of a class of stochastic algorithms. Stoch Proc Appl 50:83–99
Cacho A, Smirnova E, Huzurbazar S, Cui X (2015) A comparison of base-calling algorithms for illumina sequencing technology. Brief Bioinform. doi:10.1093/bib/bbv088
Corrada-Bravo H, Irizarry RA (2009) Model-based quality assessment and base-calling for second-generation sequencing data. Biometrics 3:665–674
MathSciNet MATH Google Scholar
Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8:186–194
Article Google Scholar
Illumina, Inc. Illumina sequencing technology: highest data accuracy, simple workflow, and a broad range of applications. Springer, New York. http://www.illumina.com/documents/products/ techspotlights/techspotlight_sequencing (2010)
Kircher M, Stenzel U, Kelso J (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol 10:R83.1–R83.9
Article Google Scholar
Ledergerber C, Dessimoz C (2011) Base-calling for next-generation sequencing platforms. Brief Bioinform 12:489–497
Article Google Scholar
Ma W, Wong WH (2011) The analysis of ChIP-Seq data. Methods Enzymol 497:51–73
Article Google Scholar
Massingham T, Goldman N (2012) All Your Base: a fast and accurate probabilistic approach to base calling. Genome Biol 13:R13
Article Google Scholar
Nielsen SF (2000) The stochastic EM algorithm: estimation and asymptotic results. Bernoulli 6:457–489
Renaud G, Kircher M, Stenzel U et al (2013) freeIbis: an efficient basecaller with calibrated quality scores for Illumina sequencers. Bioinformatics 29:1208–1209
Article Google Scholar
Speed TP, Li L (1999) An estimate of the cross-talk matrix in four-dye fluorescence-based DNA sequencing. Electrophoresis 20:1433–42
Article Google Scholar
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63
Article Google Scholar
Wei G, Tanner M (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699–704
Article Google Scholar
Wetterstrand KA. DNA sequencing costs: data from the NHGRI Genome Sequencing Program (GSP). www.genome.gov/sequencingcosts. Accessed 20 Dec 2015
Ye C, Hsiao C, Corrada Bravo H (2014) BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution. Bioinformatics 30(9):1214–1219. doi:10.1093/bioinformatics/btu010

Download references

Acknowledgements

The authors thank the Institute for Integrative Genome Biology Bioinformatics Facility at University of California, Riverside, for providing the bioinformatics cluster. This material was based upon work partially supported by the National Science Foundation (DMS ATD-1222718) and the University of California, Riverside (AES- CE RSAP A01869) for X.C. and A.C.

Author information

Authors and Affiliations

University of California Riverside, Riverside, USA
Ashley Cacho, Weixin Yao & Xinping Cui

Authors

Ashley Cacho
View author publications
You can also search for this author in PubMed Google Scholar
Weixin Yao
View author publications
You can also search for this author in PubMed Google Scholar
Xinping Cui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xinping Cui.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cacho, A., Yao, W. & Cui, X. Base-Calling Using a Random Effects Mixture Model on Next-Generation Sequencing Data. Stat Biosci 10, 3–19 (2018). https://doi.org/10.1007/s12561-017-9190-3

Download citation

Received: 12 April 2016
Revised: 16 January 2017
Accepted: 15 March 2017
Published: 25 March 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s12561-017-9190-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Base-Calling Using a Random Effects Mixture Model on Next-Generation Sequencing Data

Abstract

Access this article

Similar content being viewed by others

Next-Generation Sequencing: Advantages, Disadvantages, and Future

The Illumina Sequencing Protocol and the NovaSeq 6000 System

A survey of best practices for RNA-seq data analysis

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Base-Calling Using a Random Effects Mixture Model on Next-Generation Sequencing Data

Abstract

Access this article

Similar content being viewed by others

Next-Generation Sequencing: Advantages, Disadvantages, and Future

The Illumina Sequencing Protocol and the NovaSeq 6000 System

A survey of best practices for RNA-seq data analysis

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation