Skip to main content

Differential Expression for RNA Sequencing (RNA-Seq) Data: Mapping, Summarization, Statistical Analysis, and Experimental Design

  • Chapter
  • First Online:

Abstract

RNA sequencing (RNA-seq) is an exciting technique that gives experimenters unprecedented access to information on transcriptome complexity. The costs are decreasing, data analysis methods are maturing, and the flexibility that RNA-seq affords will allow it to become the platform of choice for gene expression analysis. Here, we focus on differential expression (DE) analysis using RNA-seq, highlighting aspects of mapping reads to a reference transcriptome, quantification of expression levels, normalization for composition biases, statistical modeling to account for biological variability and experimental design considerations. We also comment on recent developments beyond the analysis of DE using RNA-seq.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  • ‘t Hoen PA, Ariyurek Y, Thygesen HH, et al. (2008) Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms. Nucleic Acids Res 36:e141

    Article  Google Scholar 

  • Ameur A, Wetterbom A, Feuk L, et al. (2010) Global and unbiased detection of splice junctions from RNA-seq data. Genome Biol 11:R34

    Article  PubMed  Google Scholar 

  • Anders S and Huber W (2010) Differential expression analysis for sequence count data. Genome Biol 11:R106

    Article  PubMed  CAS  Google Scholar 

  • Auer PL (2010) Statistical Design And Analysis Of Next-Generation Sequencing Data. Doctor of Philosophy, Purdue University

    Google Scholar 

  • Auer PL and Doerge RW (2010) Statistical design and analysis of RNA sequencing data. Genetics 185:405–16

    Article  PubMed  CAS  Google Scholar 

  • Babak T, Garrett-Engele P, Armour CD, et al. (2010) Genetic validation of whole-transcriptome sequencing for mapping expression affected by cis-regulatory variation. BMC Genomics 11:473

    Article  PubMed  Google Scholar 

  • Binder H, Kirsten T, Loeffler M, et al. (2004) Sensitivity of Microarray Oligonucleotide Probes:  Variability and Effect of Base Composition. The Journal of Physical Chemistry B 108:18003–14

    Article  CAS  Google Scholar 

  • Blekhman R, Marioni JC, Zumbo P, et al. (2010) Sex-specific and lineage-specific alternative splicing in primates. Genome Res 20:180–9

    Article  PubMed  CAS  Google Scholar 

  • Bock C, Tomazou EM, Brinkman AB, et al. (2010) Quantitative comparison of genome-wide DNA methylation mapping technologies. Nat Biotechnol 28:1106–14

    Article  PubMed  CAS  Google Scholar 

  • Bradford JR, Hey Y, Yates T, et al. (2010) A comparison of massively parallel nucleotide sequencing with oligonucleotide microarrays for global transcription profiling. BMC Genomics 11:282

    Article  PubMed  Google Scholar 

  • Bullard JH, Purdom E, Hansen KD, et al. (2010) Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics 11:94

    Article  PubMed  Google Scholar 

  • Carvalho PC, Hewel J, Barbosa VC, et al. (2008) Identifying differences in protein expression levels by spectral counting and feature selection. Genet Mol Res 7:342–56

    Article  PubMed  CAS  Google Scholar 

  • Churchill GA (2002) Fundamentals of experimental design for cDNA microarrays. Nat Genet 32 Suppl:490–5

    Article  PubMed  CAS  Google Scholar 

  • Cloonan N, Forrest AR, Kolle G, et al. (2008) Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods 5:613–9

    Article  PubMed  CAS  Google Scholar 

  • De Bona F, Ossowski S, Schneeberger K, et al. (2008) Optimal spliced alignments of short sequence reads. Bioinformatics 24:i174–80

    Article  PubMed  Google Scholar 

  • Degner JF, Marioni JC, Pai AA, et al. (2009) Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25:3207–12

    Article  PubMed  CAS  Google Scholar 

  • Dennis G, Jr., Sherman BT, Hosack DA, et al. (2003) DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 4:P3

    Article  PubMed  Google Scholar 

  • Ferragina P and Manzini G (2000) Opportunistic data structures with applications. Annu Symp Found Comput Sci Proc 2000:390–398

    Google Scholar 

  • Flicek P and Birney E (2009) Sense from sequence reads: methods for alignment and assembly. Nat Methods 6:S6–S12

    Article  PubMed  CAS  Google Scholar 

  • Fu X, Fu N, Guo S, et al. (2009) Estimating accuracy of RNA-Seq and microarrays with proteomics. BMC Genomics 10:161

    Article  PubMed  Google Scholar 

  • Griffith M, Griffith OL, Mwenifumbo J, et al. (2010) Alternative expression analysis by RNA sequencing. Nat Methods 7:843–7

    Article  PubMed  CAS  Google Scholar 

  • Hansen KD, Brenner SE and Dudoit S (2010) Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res 38:e131

    Article  PubMed  Google Scholar 

  • Hardcastle TJ and Kelly KA (2010) baySeq: empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics 11:422

    Article  PubMed  Google Scholar 

  • Harr B and Turner LM (2010) Genome-wide analysis of alternative splicing evolution among Mus subspecies. Mol Ecol 19 Suppl 1:228–39

    Article  PubMed  CAS  Google Scholar 

  • Harris RA, Wang T, Coarfa C, et al. (2010) Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications. Nat Biotechnol 28:1097–1105

    Article  PubMed  CAS  Google Scholar 

  • Hawkins RD, Hon GC and Ren B (2010) Next-generation genomics: an integrative approach. Nat Rev Genet 11:476–86

    PubMed  CAS  Google Scholar 

  • Hu J, Coombes KR, Morris JS, et al. (2005) The importance of experimental design in proteomic mass spectrometry experiments: some cautionary tales. Brief Funct Genomic Proteomic 3:322–31

    Article  PubMed  CAS  Google Scholar 

  • Jiang H and Wong WH (2009) Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25:1026–32

    Article  PubMed  CAS  Google Scholar 

  • Kanehisa M and Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28:27–30

    Article  PubMed  CAS  Google Scholar 

  • Langmead B, Hansen KD and Leek JT (2010) Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol 11:R83

    Article  PubMed  Google Scholar 

  • Langmead B, Trapnell C, Pop M, et al. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25

    Article  PubMed  Google Scholar 

  • Levin JZ, Yassour M, Adiconis X, et al. (2010) Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nat Methods 7:709–15

    Article  PubMed  CAS  Google Scholar 

  • Li B, Ruotti V, Stewart RM, et al. (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26:493–500

    Article  PubMed  Google Scholar 

  • Li H and Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–60

    Article  PubMed  CAS  Google Scholar 

  • Li H, Ruan J and Durbin R (2008) Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851–8

    Article  PubMed  CAS  Google Scholar 

  • Linsen SE, de Wit E, Janssens G, et al. (2009) Limitations and possibilities of small RNA digital gene expression profiling. Nat Methods 6:474–6

    Article  PubMed  CAS  Google Scholar 

  • Lister R, Pelizzola M, Dowen RH, et al. (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462:315–22

    Article  PubMed  CAS  Google Scholar 

  • Liu S, Lin L, Jiang P, et al. (2011) A comparison of RNA-Seq and high-density exon array for detecting differential gene expression between closely related species. Nucleic Acids Res 39:578–88

    Article  PubMed  CAS  Google Scholar 

  • Lu J, Tomfohr JK and Kepler TB (2005) Identifying differential expression in multiple SAGE libraries: an overdispersed log-linear model approach. BMC Bioinformatics 6:165

    Article  PubMed  Google Scholar 

  • Maher CA, Kumar-Sinha C, Cao X, et al. (2009) Transcriptome sequencing to detect gene fusions in cancer. Nature 458:97–101

    Article  PubMed  CAS  Google Scholar 

  • Marioni JC, Mason CE, Mane SM, et al. (2008) RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 18:1509–17

    Article  PubMed  CAS  Google Scholar 

  • McCullagh P and Nelder JA (1989) Generalized linear models, 2nd. Chapman and Hall, London ; New York

    Google Scholar 

  • Montgomery SB, Sammeth M, Gutierrez-Arcelus M, et al. (2010) Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464:773–7

    Article  PubMed  CAS  Google Scholar 

  • Mortazavi A, Williams BA, McCue K, et al. (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5:621–8

    Article  PubMed  CAS  Google Scholar 

  • Naef F and Magnasco MO (2003) Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays. Phys Rev E Stat Nonlin Soft Matter Phys 68:011906

    Google Scholar 

  • NCBI (2011) NCBI – Entrez Genome. http://www.ncbi.nlm.nih.gov/sites/genome Accessed October 14

  • Oshlack A and Wakefield MJ (2009) Transcript length bias in RNA-seq data confounds systems biology. Biol Direct 4:14

    Article  PubMed  Google Scholar 

  • Ouyang Z, Zhou Q and Wong WH (2009) ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc Natl Acad Sci USA 106:21521–6

    Article  PubMed  CAS  Google Scholar 

  • Pan Q, Shai O, Lee LJ, et al. (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 40:1413–5

    Article  PubMed  CAS  Google Scholar 

  • Parikh A, Miranda ER, Katoh-Kurasawa M, et al. (2010) Conserved developmental transcriptomes in evolutionarily divergent species. Genome Biol 11:R35

    Article  PubMed  Google Scholar 

  • Picardi E, Horner DS, Chiara M, et al. (2010) Large-scale detection and analysis of RNA editing in grape mtDNA by RNA deep-sequencing. Nucleic Acids Res 38:4755–67

    Article  PubMed  CAS  Google Scholar 

  • Pickrell JK, Marioni JC, Pai AA, et al. (2010) Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464:768–72

    Article  PubMed  CAS  Google Scholar 

  • Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32 Suppl:496–501

    Article  PubMed  CAS  Google Scholar 

  • Quail MA, Kozarewa I, Smith F, et al. (2008) A large genome center’s improvements to the Illumina sequencing system. Nat Methods 5:1005–10

    Article  PubMed  CAS  Google Scholar 

  • Raha D, Wang Z, Moqtaderi Z, et al. (2010) Close association of RNA polymerase II and many transcription factors with Pol III genes. Proc Natl Acad Sci USA 107:3639–44

    Article  PubMed  CAS  Google Scholar 

  • Robertson G, Schein J, Chiu R, et al. (2010) De novo assembly and analysis of RNA-seq data. Nat Methods 7:909–12

    Article  PubMed  CAS  Google Scholar 

  • Robinson MD, McCarthy DJ and Smyth GK (2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–40

    Article  PubMed  CAS  Google Scholar 

  • Robinson MD and Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11:R25

    Article  PubMed  Google Scholar 

  • Robinson MD and Smyth GK (2007) Moderated statistical tests for assessing differences in tag abundance. Bioinformatics 23:2881–7

    Article  PubMed  CAS  Google Scholar 

  • Robinson MD and Smyth GK (2008) Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics 9:321–32

    Article  PubMed  Google Scholar 

  • Robinson MD, Stirzaker C, Statham AL, et al. (2010) Evaluation of affinity-based genome-wide DNA methylation data: effects of CpG density, amplification bias, and copy number variation. Genome Res 20:1719–29

    Article  PubMed  CAS  Google Scholar 

  • Schadt EE, Linderman MD, Sorenson J, et al. (2010) Computational solutions to large-scale data management and analysis. Nat Rev Genet 11:647–57

    Article  PubMed  CAS  Google Scholar 

  • Simpson JT, Wong K, Jackman SD, et al. (2009) ABySS: a parallel assembler for short read sequence data. Genome Res 19:1117–23

    Article  PubMed  CAS  Google Scholar 

  • Srivastava S and Chen L (2010) A two-parameter generalized Poisson model to improve the analysis of RNA-seq data. Nucleic Acids Res 38:e170

    Article  PubMed  Google Scholar 

  • Subramanian A, Tamayo P, Mootha VK, et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102:15545–50

    Article  PubMed  CAS  Google Scholar 

  • Sultan M, Schulz MH, Richard H, et al. (2008) A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 321:956–60

    Article  PubMed  CAS  Google Scholar 

  • Taub M and Speed TP (2010) Methods for allocating ambiguous short-reads. Communications in information and systems 10:69–82

    Google Scholar 

  • Trapnell C, Pachter L and Salzberg SL (2009) TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25:1105–11

    Article  PubMed  CAS  Google Scholar 

  • Trapnell C, Williams BA, Pertea G, et al. (2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature biotechnology 28:511–515

    Article  PubMed  CAS  Google Scholar 

  • Wang ET, Sandberg R, Luo S, et al. (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456:470–6

    Article  PubMed  CAS  Google Scholar 

  • Wang L, Xi Y, Yu J, et al. (2010) A statistical method for the detection of alternative splicing using RNA-seq. PLoS One 5:e8529

    Article  PubMed  Google Scholar 

  • Wang Z, Gerstein M and Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57–63

    Article  PubMed  CAS  Google Scholar 

  • White JR, Nagarajan N and Pop M (2009) Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput Biol 5:e1000352

    Article  PubMed  Google Scholar 

  • Wu D, Lim E, Vaillant F, et al. (2010) ROAST: rotation gene set tests for complex microarray experiments. Bioinformatics 26:2176–82

    Article  PubMed  CAS  Google Scholar 

  • Wu Z and Irizarry RA (2005) Stochastic models inspired by hybridization theory for short oligonucleotide arrays. J Comput Biol 12:882–93

    Article  PubMed  CAS  Google Scholar 

  • Yang YH and Speed T (2002) Design issues for cDNA microarray experiments. Nat Rev Genet 3:579–88

    PubMed  CAS  Google Scholar 

  • Young MD, Wakefield MJ, Smyth GK, et al. (2010) Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol 11:R14

    Article  PubMed  Google Scholar 

  • Zerbino DR and Birney E (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18:821–9

    Article  PubMed  CAS  Google Scholar 

  • Zhang K, Li JB, Gao Y, et al. (2009) Digital RNA allelotyping reveals tissue-specific and allele-specific gene expression in human. Nat Methods 6:613–8

    Article  PubMed  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark D. Robinson .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Young, M.D., McCarthy, D.J., Wakefield, M.J., Smyth, G.K., Oshlack, A., Robinson, M.D. (2012). Differential Expression for RNA Sequencing (RNA-Seq) Data: Mapping, Summarization, Statistical Analysis, and Experimental Design. In: Rodríguez-Ezpeleta, N., Hackenberg, M., Aransay, A. (eds) Bioinformatics for High Throughput Sequencing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-0782-9_10

Download citation

Publish with us

Policies and ethics