Skip to main content
Log in

Base-Calling Using a Random Effects Mixture Model on Next-Generation Sequencing Data

  • Published:
Statistics in Biosciences Aims and scope Submit manuscript

Abstract

The emergence of next-generation sequencing technology has greatly influenced research in biology and clinical applications. This new technology allows millions of DNA fragments to be sequenced in parallel, reducing costs and increasing throughput. One of the most widely used DNA sequencing machines is the Illumina platform which contains a novel sequencing-by-synthesis method involving a series of chemical reactions and image processing. However, it suffers from biases inherent with the complex nature of the chemical processes involved. The process of converting the fluorescence intensity output of the sequencing-by-synthesis technology to the nucleotide bases is what is known as base-calling. The resulting DNA sequences are used in further downstream analyses such as in genome assemblies or variant detection in which the accuracy and quality of bases impact the results. In this paper, we introduce a random effects mixture model that captures the sequencing process and compare its performance to a model with fixed effects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Biscarat JC (1994) Almost sure convergence of a class of stochastic algorithms. Stoch Proc Appl 50:83–99

  2. Cacho A, Smirnova E, Huzurbazar S, Cui X (2015) A comparison of base-calling algorithms for illumina sequencing technology. Brief Bioinform. doi:10.1093/bib/bbv088

  3. Corrada-Bravo H, Irizarry RA (2009) Model-based quality assessment and base-calling for second-generation sequencing data. Biometrics 3:665–674

    MathSciNet  MATH  Google Scholar 

  4. Ewing B, Green P (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 8:186–194

    Article  Google Scholar 

  5. Illumina, Inc. Illumina sequencing technology: highest data accuracy, simple workflow, and a broad range of applications. Springer, New York. http://www.illumina.com/documents/products/ techspotlights/techspotlight_sequencing (2010)

  6. Kircher M, Stenzel U, Kelso J (2009) Improved base calling for the Illumina Genome Analyzer using machine learning strategies. Genome Biol 10:R83.1–R83.9

    Article  Google Scholar 

  7. Ledergerber C, Dessimoz C (2011) Base-calling for next-generation sequencing platforms. Brief Bioinform 12:489–497

    Article  Google Scholar 

  8. Ma W, Wong WH (2011) The analysis of ChIP-Seq data. Methods Enzymol 497:51–73

    Article  Google Scholar 

  9. Massingham T, Goldman N (2012) All Your Base: a fast and accurate probabilistic approach to base calling. Genome Biol 13:R13

    Article  Google Scholar 

  10. Nielsen SF (2000) The stochastic EM algorithm: estimation and asymptotic results. Bernoulli 6:457–489

  11. Renaud G, Kircher M, Stenzel U et al (2013) freeIbis: an efficient basecaller with calibrated quality scores for Illumina sequencers. Bioinformatics 29:1208–1209

    Article  Google Scholar 

  12. Speed TP, Li L (1999) An estimate of the cross-talk matrix in four-dye fluorescence-based DNA sequencing. Electrophoresis 20:1433–42

    Article  Google Scholar 

  13. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63

    Article  Google Scholar 

  14. Wei G, Tanner M (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699–704

    Article  Google Scholar 

  15. Wetterstrand KA. DNA sequencing costs: data from the NHGRI Genome Sequencing Program (GSP). www.genome.gov/sequencingcosts. Accessed 20 Dec 2015

  16. Ye C, Hsiao C, Corrada Bravo H (2014) BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution. Bioinformatics 30(9):1214–1219. doi:10.1093/bioinformatics/btu010

Download references

Acknowledgements

The authors thank the Institute for Integrative Genome Biology Bioinformatics Facility at University of California, Riverside, for providing the bioinformatics cluster. This material was based upon work partially supported by the National Science Foundation (DMS ATD-1222718) and the University of California, Riverside (AES- CE RSAP A01869) for X.C. and A.C.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xinping Cui.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cacho, A., Yao, W. & Cui, X. Base-Calling Using a Random Effects Mixture Model on Next-Generation Sequencing Data. Stat Biosci 10, 3–19 (2018). https://doi.org/10.1007/s12561-017-9190-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12561-017-9190-3

Keywords

Navigation