Abstract
The emergence of next-generation sequencing technology has greatly influenced research in biology and clinical applications. This new technology allows millions of DNA fragments to be sequenced in parallel, reducing costs and increasing throughput. One of the most widely used DNA sequencing machines is the Illumina platform which contains a novel sequencing-by-synthesis method involving a series of chemical reactions and image processing. However, it suffers from biases inherent with the complex nature of the chemical processes involved. The process of converting the fluorescence intensity output of the sequencing-by-synthesis technology to the nucleotide bases is what is known as base-calling. The resulting DNA sequences are used in further downstream analyses such as in genome assemblies or variant detection in which the accuracy and quality of bases impact the results. In this paper, we introduce a random effects mixture model that captures the sequencing process and compare its performance to a model with fixed effects.
Similar content being viewed by others
References
Biscarat JC (1994) Almost sure convergence of a class of stochastic algorithms. Stoch Proc Appl 50:83–99
Cacho A, Smirnova E, Huzurbazar S, Cui X (2015) A comparison of base-calling algorithms for illumina sequencing technology. Brief Bioinform. doi:10.1093/bib/bbv088
Corrada-Bravo H, Irizarry RA (2009) Model-based quality assessment and base-calling for second-generation sequencing data. Biometrics 3:665–674
Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8:186–194
Illumina, Inc. Illumina sequencing technology: highest data accuracy, simple workflow, and a broad range of applications. Springer, New York. http://www.illumina.com/documents/products/ techspotlights/techspotlight_sequencing (2010)
Kircher M, Stenzel U, Kelso J (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol 10:R83.1–R83.9
Ledergerber C, Dessimoz C (2011) Base-calling for next-generation sequencing platforms. Brief Bioinform 12:489–497
Ma W, Wong WH (2011) The analysis of ChIP-Seq data. Methods Enzymol 497:51–73
Massingham T, Goldman N (2012) All Your Base: a fast and accurate probabilistic approach to base calling. Genome Biol 13:R13
Nielsen SF (2000) The stochastic EM algorithm: estimation and asymptotic results. Bernoulli 6:457–489
Renaud G, Kircher M, Stenzel U et al (2013) freeIbis: an efficient basecaller with calibrated quality scores for Illumina sequencers. Bioinformatics 29:1208–1209
Speed TP, Li L (1999) An estimate of the cross-talk matrix in four-dye fluorescence-based DNA sequencing. Electrophoresis 20:1433–42
Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63
Wei G, Tanner M (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699–704
Wetterstrand KA. DNA sequencing costs: data from the NHGRI Genome Sequencing Program (GSP). www.genome.gov/sequencingcosts. Accessed 20 Dec 2015
Ye C, Hsiao C, Corrada Bravo H (2014) BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution. Bioinformatics 30(9):1214–1219. doi:10.1093/bioinformatics/btu010
Acknowledgements
The authors thank the Institute for Integrative Genome Biology Bioinformatics Facility at University of California, Riverside, for providing the bioinformatics cluster. This material was based upon work partially supported by the National Science Foundation (DMS ATD-1222718) and the University of California, Riverside (AES- CE RSAP A01869) for X.C. and A.C.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cacho, A., Yao, W. & Cui, X. Base-Calling Using a Random Effects Mixture Model on Next-Generation Sequencing Data. Stat Biosci 10, 3–19 (2018). https://doi.org/10.1007/s12561-017-9190-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-017-9190-3