Skip to main content
Log in

Next-generation sequencing data analysis on cloud computing

  • Review
  • Published:
Genes & Genomics Aims and scope Submit manuscript

Abstract

With the advent of next-generation sequencing (NGS), including whole genome sequencing (WGS), RNA sequencing (RNA-seq), and chromatin immunoprecipitation followed by sequencing (ChIP-seq), many biologists and computer scientists are highlighting the urgent need for computing power, storage, and various bioinformatics software for analyzing large quantities of sequence data. Currently, building the computational infrastructure required for massive data processing and providing maintenance services are among the most important tasks. However, technology platforms for handling large amounts of information pose multiple challenges for data access and processing. To overcome these challenges, cloud computing technologies are emerging as a possible infrastructure for tackling the intensive use of computing power and communication resources in NGS data analysis. Thus, in this review, we explain the concepts and key technologies of cloud computing, such as Map and Reduce, and discuss the problem of data transfer. To reveal the performance and usefulness of these technologies, we analyzed NGS data using cloud platforms and compared them with a local cluster. From the benchmark results, we concluded that cloud computing is still more expensive than local cluster, but provides reasonable performance for NGS data analysis with acceptable prices and could be a good alternative to local cluster systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061–1073

    Article  PubMed  Google Scholar 

  • Afgan E, Baker D, Coraor N, Chapman B, Nekrutenko A, Taylor J (2010) Galaxy CloudMan: delivering cloud compute clusters. BMC Bioinform 11(Suppl 12):S4

  • Angiuoli SV, Matalka M, Gussman A, Galens K, Vangala M, Riley DR, Arze C, White JR, White O, Fricke WF (2011) CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinform 12:356

  • Asmann YW, Middha S, Hossain A, Baheti S, Li Y, Chai HS, Sun Z, Duffy PH, Hadad AA, Nair A et al (2012) TREAT: a bioinformatics tool for variant annotations and visualizations in targeted and exome sequencing data. Bioinformatics 28:277–278

  • Baker M (2010) Next-generation sequencing: adjusting to data overload. Nat Methods 7:495–499

    Article  CAS  Google Scholar 

  • Blow N (2009) Transcriptomics: the digital generation. Nature 458:239–242

    Article  CAS  PubMed  Google Scholar 

  • Dai L, Gao X, Guo Y, Xiao J, Zhang Z (2012) Bioinformatics clouds for big data manipulation Biol Direct 7:43; discussion 43

  • Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Paper presented at the proceedings of the 6th conference on symposium on operating systems design & implementation vol 6, San Francisco

  • Dudley JT, Pouliot Y, Chen R, Morgan AA, Butte AJ (2010) Translational bioinformatics in the cloud: an affordable alternative. Genome Med 2:51

    Article  PubMed Central  PubMed  Google Scholar 

  • Feng X, Grossman R, Stein L (2011) PeakRanger: a cloud-enabled peak caller for ChIP-seq data. BMC Bioinform 12:139

  • Fischer M, Snajder R, Pabinger S, Dander A, Schossig A, Zschocke J, Trajanoski Z, Stocker G (2012) SIMPLEX: cloud-enabled pipeline for the comprehensive analysis of exome sequencing data. PLoS One 7:e41948

  • Frommer M, McDonald LE, Millar DS, Collis CM, Watt F, Grigg GW, Molloy PL, Paul CL (1992) A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc Natl Acad Sci USA 89:1827–1831

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Fusaro VA, Patil P, Gafni E, Wall DP, Tonellato PJ (2011) Biomedical cloud computing with Amazon Web Services. PLoS Comput Biol 7:e1002147

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Habegger L, Balasubramanian S, Chen DZ, Khurana E, Sboner A, Harmanci A, Rozowsky J, Clarke D, Snyder M, Gerstein M (2012) VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment. Bioinformatics 28:2267–2269

  • Herman JG, Graff JR, Myohanen S, Nelkin BD, Baylin SB (1996) Methylation-specific PCR: a novel PCR assay for methylation status of CpG islands. Proc Natl Acad Sci USA 93:9821–9826

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Hong D, Rhie A, Park SS, Lee J, Ju YS, Kim S, Yu SB, Bleazard T, Park HS, Rhee H et al (2012) FX: an RNA-Seq analysis tool on the cloud. Bioinformatics 28:721–723

  • Horner DS, Pavesi G, Castrignano T, De Meo PD, Liuni S, Sammeth M, Picardi E, Pesole G (2010) Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing. Brief Bioinform 11:181–197

    Article  CAS  PubMed  Google Scholar 

  • Hu F, Qiu M, Li J, Grant T, Tylor D, McCaleb S, Butler L, Hamner R (2011) A review on cloud computing: design challenges in architecture and security. CIT 19(1):25–55

    Article  CAS  Google Scholar 

  • Johnson DS, Mortazavi A, Myers RM, Wold B (2007) Genome-wide mapping of in vivo protein-DNA interactions. Science 316:1497–1502

    Article  CAS  PubMed  Google Scholar 

  • Jourdren L, Bernard M, Dillies MA, Le Crom S (2012) Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses. Bioinformatics 28:1542–1543

  • Lam HY, Pan C, Clark MJ, Lacroute P, Chen R, Haraksingh R, O'Huallachain M, Gerstein MB, Kidd JM, Bustamante CD et al (2012) Detecting and annotating genetic variations using the HugeSeq pipeline. Nat Biotechnol 30:226–229

  • Langmead B, Hansen KD, Leek JT (2010) Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol 11:R83

  • Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL (2009) Searching for SNPs with cloud computing. Genome Biol 10:R134

  • Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760

  • Maher CA, Kumar-Sinha C, Cao X, Kalyana-Sundaram S, Han B, Jing X, Sam L, Barrette T, Palanisamy N, Chinnaiyan AM (2009) Transcriptome sequencing to detect gene fusions in cancer. Nature 458:97–101

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Marozzo F, Talia D, Trunfio P (2012) P2P-MapReduce: parallel data processing in dynamic Cloud environments. J Comput Syst Sci 78:1382–1402

    Article  Google Scholar 

  • Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–628

    Article  CAS  PubMed  Google Scholar 

  • Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE et al (2009) Targeted capture and massively parallel sequencing of 12 human exomes. Nature 461:272–276

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Ng SB, Bigham AW, Buckingham KJ, Hannibal MC, McMillin MJ, Gildersleeve HI, Beck AE, Tabor HK, Cooper GM, Mefford HC et al (2010) Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet 42:790–793

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Nguyen T, Shi W, Ruden D (2011) CloudAligner: A fast and full-featured MapReduce based tool for sequence mapping. BMC Res Notes 4:171

  • Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, Krabichler B, Speicher, Zschocke J, Trajanoski Z (2014) A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform 15(2):256–278

  • Peter M, Timothy G (2011) The NIST Definition of Cloud Computing. National Institute of Standards and Technology, Gaithersburg

  • Schatz MC (2009) CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25:1363–1369

  • Schatz MC, Langmead B, Salzberg SL (2010) Cloud computing and the DNA data race. Nat Biotechnol 28:691–693

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Schmid CD, Bucher P (2007) ChIP-Seq data reveal nucleosome architecture of human promoters. Cell 131:831–832 author reply 832–833

    Article  CAS  PubMed  Google Scholar 

  • Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63

    Article  PubMed Central  CAS  PubMed  Google Scholar 

Download references

Acknowledgments

This work was supported by the Research of Korea Centers for Disease Control and Prevention [4847-300-210-13]. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Conflict of interest

The authors declare that they have no conflict of interest.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Won Kim or Dae-Won Kim.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kwon, T., Yoo, W.G., Lee, WJ. et al. Next-generation sequencing data analysis on cloud computing. Genes Genom 37, 489–501 (2015). https://doi.org/10.1007/s13258-015-0280-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13258-015-0280-7

Keywords

Navigation