Abstract

Competitive gene set tests are commonly used in molecular pathway analysis to test for enrichment of a particular gene annotation category amongst the differential expression results from a microarray experiment. Existing gene set tests that rely on gene permutation are shown here to be extremely sensitive to inter-gene correlation. Several data sets are analyzed to show that inter-gene correlation is non-ignorable even for experiments on homogeneous cell populations using genetically identical model organisms. A new gene set test procedure (CAMERA) is proposed based on the idea of estimating the inter-gene correlation from the data, and using it to adjust the gene set test statistic. An efficient procedure is developed for estimating the inter-gene correlation and characterizing its precision. CAMERA is shown to control the type I error rate correctly regardless of inter-gene correlations, yet retains excellent power for detecting genuine differential expression. Analysis of breast cancer data shows that CAMERA recovers known relationships between tumor subtypes in very convincing terms. CAMERA can be used to analyze specified sets or as a pathway analysis tool using a database of molecular signatures.

INTRODUCTION

A gene set test is a differential expression analysis in which a P -value is assigned to a set of genes as a unit. Gene set tests are important because they permit differential expression questions to be posed in terms of ensembles of genes representing pathways or other biologically interpretable processes. Although a great many statistical procedures for gene set testing have been proposed in the literature, they fall into two major categories ( 1 , 2 ). Using the terminology of Goeman and Buhlmann ( 1 ), ‘self-contained’ gene set tests examine a set of genes in their own right without reference to other genes in the genome ( 3–8 ), whereas ‘competitive’ gene set tests compare genes in the test set relative to all other genes. Self-contained tests are of interest for assessing the relevance of an individual biological process to the experiment at hand ( 8 ), whereas the competitive tests focus more on distinguishing the most important biological processes from those that are less important. Competitive tests are overwhelmingly more commonly used in the genomic literature ( 9 ).

Most competitive gene set tests assume independence of genes ( 1 , 9 ), because they evaluate P -values by permutation of gene labels ( 2 , 10 , 11 ), or because they rely on parametric approximations that are asymptotically equivalent to gene permutation ( 12–16 ). A handful of articles have demonstrated that competitive tests are sensitive to inter-gene correlations, and even quite modest correlations can dangerously inflate the apparent false discovery rate (FDR) ( 9 , 10 , 17 , 18 ). Nevertheless, these warnings do not seem to have limited the popularity of such tests, probably because of the lack of satisfactory alternatives that maintain the direct interpretation of competitive tests.

A number of authors have concluded that gene set test P -values should always be evaluated by permutation or resampling of RNA samples, even when a competitive hypothesis is of interest, because sample permutation preserves the inter-gene correlation structure and therefore protects the test against such correlation ( 9 , 17 , 19 , 20 ). This strategy, however, cannot be applied to experiments with small number of biological replicates and, even more crucially, it inevitably alters the hypothesis that is being tested and therefore affects the essential clarity and interpretability of competitive gene set testing. The very popular Gene Set Enrichment Analysis (GSEA) procedure ( 20 ) is of this type. It uses sample permutation to test the significance of a competitive gene test statistic, resulting in a hybrid test for which the null and alternative hypotheses are difficult to characterize in terms of population parameters ( 1 ).

A different attempt is to de-correlate genes in the test set ( 21 ), but such an approach requires the estimation of a covariance matrix with many entries, so the methodology is likely to be limited in practice to experiments with relatively large numbers, at least dozens, of replicate samples. A related approach is to try to approximate covariances between genes using random effects ( 22 , 23 ), but this methodology is limited to special experimental designs and is dependent on commercial statistical software.

This article proposes a new competitive gene set testing procedure that maintains the direct interpretation of competitive tests associated with gene permutation but remains valid even when the genes in the test set are correlated. We call this new method CAMERA, an acronym for Correlation Adjusted MEan RAnk gene set test. The procedure is based on the idea of estimating the variance inflation factor associated with inter-gene correlation, and incorporating this into parametric or rank-based test procedures. The procedure does not assume any particular correlation structure and is suitable for any experiment that can be represented by genewise linear models, not just two-sample comparisons. As only one correlation parameter is being estimated, the procedure remains stable and correctly controls the type I error rate even for experiments with only a small number of biological replicates.

MATERIALS AND METHODS

Linear models

One of the advantages of competitive gene set tests is that they can be applied just as easily to any genewise test statistic, no matter how complex. There is no need to be limited to two-group comparisons, for example. To be as general as possible, we assume throughout this article a linear model setup similar to that described previously ( 8 , 24 ). Suppose that a gene expression experiment has been conducted resulting in log-expression values ygi for genes g  = 1, … ,  G and RNA samples i  = 1, … ,  n . We assume a linear model for the expected value of each expression value given the experimental design,
where the xij are covariates or design variables specifying which treatment condition is associated with each RNA sample, and the α gj are unknown regression coefficients representing expression log-fold changes (logFCs) between conditions in the experiment.

Each gene is assumed to have its own variance, forumla . Expression values from different arrays are assumed to be independent, but expression values for different genes from the same RNA sample are generally not. The correlations cor( ygi , ygi ) = ρ g , g are generally non-zero. Note that the ρ g , g here represent residual correlations between genes across replicate samples, after the treatment effects μ gi have been removed.

Genewise test statistics

We assume that a specified contrast forumla of the coefficients is of primary interest, and genewise statistical tests will be conducted of the null hypothesis H0  : β g =0. For example, the contrast might extract the logFC between two specified treatment conditions. More generally, it might represent an interaction term, or any similar quantity of interest to the study at hand. We will use the notation zg to represent any genewise statistic used to test this hypothesis. A number of different genewise statistics will be considered. First, the least squares estimate forumla is the observed logFC. Second, the ordinary t -statistic forumla , where sg is the residual standard error for gene g and v is obtained from the covariates xij , is the classic univariate test statistic. For a simple two-group comparison, v2  = 1/ n1  + 1/ n2 where n1 and n2 are the group sample sizes. Third, the moderated t -statistic forumla , where forumla is the empirical Bayes posterior estimator of forumla , generally outperforms the ordinary t -statistic for genomic experiments ( 24 ). Under the null hypothesis, forumla follows a t -distribution on d  +  d0 degrees of freedom (df), where d  =  n  −  p is the residual degrees of freedom (df) from the linear model and d0 is the prior df, the latter estimated as part of the empirical Bayes procedure ( 24 ). Fourth, a normalized version forumla of the moderated t -statistics is considered. Here, F and Ft are the cumulative distribution functions of the standard normal and t -distribution on d  +  d0 df, respectively. Under the null hypothesis, zg follows a standard normal distribution. Finally, in the context of genomic experiments, it is of interest to consider zg as the rank of forumla amongst all genes in the experiment.

Existing competitive gene set tests

In this article, we compare our new proposals to four existing gene set tests: PAGE ( 13 ), sigPathway ( 2 ) and two versions of the ‘geneSetTest’ procedure implemented in the limma software package ( 25 ) of the Bioconductor project ( 26 ). The two geneSetTest versions will be denoted geneSetTest-modt and geneSetTest-ranks, respectively. PAGE is implemented as a Python script obtained from the authors. sigPathway is implemented in a Bioconductor package of the same name. The software implementations of PAGE and sigPathway do not support linear models and, therefore, are restricted to two-group comparisons, although generalizing the procedures to a linear model context is straightforward in principle.

All four gene set procedures conduct global tests comparing genes in the test set to genes not in the test set using the genewise test statistics as observations. Specifically, they determine whether the mean forumla of the genewise statistics is significantly different for genes in the test set versus genes not in the set. PAGE uses logFC as zg whereas sigPathway uses ordinary t -statistics. geneSetTest can accept any genewise statistic, but is most commonly used with moderated t -statistics. sigPathway and geneSetTest-modt evaluate P -values by randomly permuting the gene labels. PAGE conducts a one-sample z -test, treating forumla as normally distributed, with the global mean and standard deviation of the genewise statistics across all genes in the experiment as population parameters under the null hypothesis. geneSetTest-ranks conducts a Wilcoxon–Mann–Whitney (WMW) rank sum test to compare the zg for genes in the set to those for genes not in the set. This amounts to a two-sample z -test using the ranks of the moderated t -statistics as zg . geneSetTest-ranks has also been called ‘mean-rank gene set enrichment’ ( 27 ). For results reported in this article, sigPathway and geneSetTest-modt were set to use 10 000 gene permutations, a number sufficient to generate highly repeatable P -values.

GSEA ( 20 ) differs from the other procedures considered in this article because it tests a hybrid hypothesis and because it is designed to test a battery of gene sets against one another rather than to test an individual gene set against background. For these reasons, GSEA is not included in the simulations but is compared to CAMERA on an example data set. GSEA was implemented using the R-GSEA script from http://www.broadinstitute.org/gsea (16 February 2011, date last accessed).

Variance inflation under correlation

Consider a set of m genewise statistics z1 , … ,  zm . The variance of the mean of the statistics is
where τ i is the standard deviation of zi and the ρ ij are the pairwise correlations. The second term represents the increase in the variance of the mean that arises from correlations between the genes. If the τ i are all equal to τ, then
where VIF is the variance inflation factor forumla and forumla is the average of the ρ ij . Note that VIF can be greater or less than one depending on the sign of forumla . It cannot, however, be <0 or > m .

Two-sided t -test allowing for correlation

Consider two groups of genewise statistics, with means forumla and forumla . The standard two-sample t -statistic for comparing the two groups has the form T  = δ/{ sp (1/ m1  + 1/ m2 ) 1/2 } where forumla is the difference in means, sp is the pooled residual standard deviation and m1 and m2 are the sample sizes in the first and second groups, respectively. The standard test assumes all observations are statistically independent.

We now extend the t -test to cover the possibility that observations in the first group are not independent. Write forumla for the average of all pairwise correlations between genes in the first group. If forumla is non-zero, the t -statistic remains asymptotically standard normal if it is redefined to be
where VIF forumla is the variance inflation factor for the first group and forumla is a consistent estimator of forumla . If m1  +  m2 is large, and ρ is estimated on d df, then T is approximately t -distributed on d df, under the null hypothesis of no mean difference between the two populations.

Write G  =  m1  +  m2 , and write forumla and s for the mean and standard deviation of all G z -values taken as a single group. For the applications in this article, it is convenient to express T in terms of the first group and the global statistics by writing forumla and forumla .

Wilcoxon rank sum test allowing for correlation

The WMW rank sum test is a well-known statistical test of whether one set of values is highly ranked relative to another, against the null hypothesis that all values are exchangeable. We implemented the usual WMW test following the formulas given in Section 8.10 of Zar ( 28 ), including corrections for ties and for continuity. The usual WMW test computes a standard normal statistic by dividing the rank sum statistic RankSum by the square root of its variance, var(RankSum) =  m1m2 ( m1  +  m2  + 1)/12. The z -statistic is then treated as standard normal.

The usual WMW test assumes that all the values being ranked are statistically independent. We extended the WMW test to allow for the possibility that the values in the first group are correlated, while values in the second group remain independent. If the values are normally distributed, and values in the first group share pairwise correlation ρ , then the variance of the WMW rank sum statistic can be shown to be
This result can be derived from Equation (4.5) of Barry et al . ( 29 ). In our extension of the WMW test, var(RankSum;ρ) replaces var(RankSum) in the denominator of the z -statistic, which is then treated as t -distributed on d df, where d is a measure of the precision with which ρ is estimated. Our test reduces to the usual WMW when forumla and d  = ∞.

Estimating the inter-gene correlation

Write Y  = { ygi } for the m  ×  n matrix of expression values for genes in the test set. Here, rows correspond to genes and columns to RNA samples. We assume that the expression values can be represented by genewise linear models with n  ×  p design matrix X  = { xij }. The rows of the design matrix correspond to RNA samples and the columns to coefficients of the linear model. There are d  =  n  −  p df available for estimating the genewise variances forumla and the correlations ρg , g between pairs of expression values ygi and ygi for the same sample i .

Our aim is to estimate the average correlation forumla , where the average is over pairwise correlations ρg , g . The first step is to compute d independent residuals for each gene. Write X  =  QR for the QR-decomposition of the design matrix, where Q is n  ×  n and R is n  ×  p . Here, R is upper-triangular and Q satisfies QTQ  =  I . An m  ×  d matrix of independent residuals is obtained by U  =  YQ2 , where Q2 represents the trailing d columns of Q . Note that the matrix U is already available as a by-product of fitting genewise linear models to the expression values using standard numerical algorithms. Extracting it requires no extra computation.

The residual standard error sg for gene g is equal to the root mean square of the corresponding row of U . We standardize each row of U by dividing by sg .

At this point, we could obtain the correlation matrix for the m genes from C  =  UUT ; however, this is a numerically inefficient procedure if m is large. A numerically superior algorithm is to compute the column means ū · k of U . Then
estimates the VIF. Note that forumla , which is concordant with the range of theoretical values for the VIF. An estimate of the average correlation can be obtained by solving forumla for forumla . This forumla is in fact numerically equal to the average of all pairwise correlations in the matrix C , although the need to explicitly form these pairwise correlations has been by-passed.

If m and d are both reasonably large, and forumla is relatively small, then forumla is approximately distributed as forumla . This implies that the standard deviation of forumla is approximately VIF(2/ d ) 1/2 , and that the standard deviation of forumla is approximately VIF(2/ d ) 1/2 /( m  − 1).

Simulations

Simulated data sets were generated with a total of G  = 10 000 genes and either two or three groups of RNA samples. Log-expression values were multivariate normal. Genewise variances forumla were generated from an inverse-chisquare distribution on 4 df. Specifically, forumla with d0  = 4 and s0  = 0.25, generating a distribution typical of microarray experiments.

Breast cancer data

Expression profiles of human breast tumors were downloaded from GEO series GSE3165. In order to standardize on one microarray platform, only the 94 arrays of platform GPL887 (Agilent Human 1A Microarray V2) were included in the analysis. Each tumor was classified to one of six molecular subtypes, namely basal-like, luminal A, luminal B, Her2, normal-like and claudin-low ( 30 ). Expression values were normalized and filtered as described previously ( 31 ).

Mammary epithelial cell data

Mammary epithelial cells and stroma cells from three human patients were sorted into four cell populations. RNA samples were profiled on two Illumina HumanWG-6 V3 BeadChips, comprising 12 microarrays. Expression values were normalized and filtered as described previously ( 31 ). The data is available as series GSE16997 in the GEO database ( http://www.ncbi.nlm.nih.gov/geo ).

Mouse hemapoietic stem cell data

Hematopoietic stem cells were isolated from four strains (one wild-type and three mutant strains) of inbred laboratory mice. Cells were further sorted into long-term, short-term and multi-potential progenitors. Between two and four biological replicates were available for each strain and cell type, making a total of 35 RNA samples from 12 experimental groups. RNA was hybridized to Illumina Mouse WG-6 Version 2 microarrays. Intensity values were normexp background corrected and quantile normalized using control probes ( 32 ). Probes that failed to reach a detection P -value of 0.05 on at least two arrays were filtered as not expressed, leaving 25 308 probes.

Ortholog mapping of the molecular signatures database

The Molecular Signatures Database (MSigDB) v3.0 was downloaded from http://www.broadinstitute.org/gsea/msigdb (28 September 2010, date last accessed). Ortholog mapping was used to prepare a pure human version of the MSigDB for use with the human expression data and a pure mouse version for use with the mouse expression data. All gene symbols were updated to latest official symbols using the human and mouse Bioconductor annotation packages ( 26 ). The resulting pure human and mouse gene set collections can be downloaded as R objects from http://bioinf.wehi.edu.au/software/MSigDB . The curated C2 gene set collection contains 3269 gene sets. The human C2 gene sets average 85 genes (median 35, maximum 2282) while the mouse C2 sets average 80 genes (median 34, maximum 1968). Inter-gene correlations were computed for sets containing at least 5 genes (3265 human and 3240 mouse).

RESULTS

Competitive gene set tests

Suppose that a gene expression experiment has been conducted, resulting in expression values for each of G genes (or probes or transcripts) in each of n target RNA samples. The total number of genes is assumed to be large, typically representative of the entire genome. The expression values should be at least roughly normally distributed. Typically, they will be normalized log-intensity values from microarrays. In order to be completely general, we assume that the assignment of experimental conditions to RNA samples can be described by a linear model (See ‘Materials and Methods’ section). This covers all common experimental situations. We assume that genewise tests of differential expression have been conducted, resulting in test statistics zg , one for each gene g .

We assume that a particular set of genes is of prior interest. This a priori specified gene set might represent a molecular pathway believed to be relevant to the experiment, or it might be a gene list from a previous microarray experiment hypothesized to be related to the current experiment. We want to test whether the genes in the gene set are highly ranked in terms of differential expression, that is, whether they tend to be associated with higher than average values of the test statistic. This type of test is inherently ‘competitive’ between genes, because the genes in the set are being compared with genes not in the set. We wish to test whether the test statistics associated with genes in the set tend to be more extreme than those associated with genes not in the set.

The competitive null hypothesis

The hypotheses tested by competitive tests have often not been explicitly stated, or else have been stated informally in terms of test statistics as we have just done. It is often unclear, therefore, exactly what is being tested. We now give a formal statement, in terms of model parameters, of what we consider to be a biologically meaningful null hypothesis for a competitive gene test. Suppose that forumla represents the indices of genes in the set of interest, and forumlac is the complementary set of indices of genes not in the set. For a ‘non-directional’ gene set test, we consider the null hypothesis to be that the average absolute-logFC of genes in the set is the same as that for genes not in the set, i.e. the mean |β g | for g  ∈  forumla is equal to the mean |β g | for g  ∈  forumlac . In intuitive terms, this means that genes in the set are no more differentially expressed on average than genes not in the set. The alternative hypothesis is that the average logFC is greater in absolute size for genes in the set than for those not in the set. Note that the hypotheses are in terms of the true unobserved logFC, not the observed expression log-ratios.

For a ‘directional’ gene set test, we consider the null hypothesis to be that the average logFC of genes in the set is the same as the average logFC of genes not in the set, i.e. the mean β g for g  ∈  forumla , forumla , is equal to the mean β g for g  ∈  forumlac , forumla . The directional hypothesis allows for one-sided or two-sided tests. Unless otherwise stated, all gene set tests in this article will be two-sided directional tests with alternative hypothesis forumla .

Note that these null hypotheses are more general than supposing the genes in the gene set to be a random sample from the genome or from all those genes on the array. This is because the hypotheses make statements only about the fold changes, not about variances or correlations or other distributional aspects. Note also that the competitive null hypothesis differs from the null hypothesis of self-contained tests. A self-contained gene set test would test the null hypothesis that the logFCs β g are all zero for g  ∈  forumla ( 8 ), whereas the competitive null hypothesis may be true even when all or most genes in the set is differentially expressed.

P -values from gene permutation

Competitive gene set tests are usually conducted by permuting gene labels. Typically, the gene set test statistic is the average zg for g  ∈  forumla , which we will denote as forumla . A P -value is assigned by drawing random gene sets of the same size m from the genes on the array. For the one-sided directional test forumla versus the alternative forumla , the P -value is the proportion of the gene sets, combining the random sets forumla * with the original forumla ( 33 ), for which the mean statistic forumla . For a two-sided test, the P -value is twice the minimum of the two one-sided P -values.

The genewise test statistic zg might be an ordinary two-sample t -statistic ( 2 ), or a moderated t -statistic ( 24 ), or the estimated logFC ( 13 ). Any relevant genewise statistic could be used. The need to draw random gene sets to estimate the P -value is computationally intensive and can be short-circuited in two ways. First, one can replace the zg by their ranks across all the genes on the array, in which case the permutation P -value can be approximated very accurately using the well-known WMW rank sum test ( 27 ). Second, if the zg are roughly normal, then the permutation P -value can be well-approximated by the standard normal tail probability of forumla where forumla is the average zg for genes in forumla and forumla and s are the mean and standard deviation of the zg over all genes on the array.

In this article, we examine four existing gene set tests. sigPathway ( 2 ) and geneSetTest-modt are permutation methods using the ordinary and moderated t -statistics, respectively. geneSetTest-ranks performs a WMW test using ranks from moderated t -statistics ( 27 ). PAGE uses the standard normal approximation for Z computed from the logFCs ( 13 ).

Inter-gene correlation increases the type I error rate

The process of generating P -values from permutation of gene labels treats all genes on the array as equivalent under the null hypothesis. This assumption will be violated, however, if the genes in the test set are more highly correlated with one another than a random set of genes would be. To illustrate this, we simulated data sets with no differentially expressed genes but for which the 100 genes in the test set share an inter-gene correlation of 0.05, whereas all other genes on the arrays are uncorrelated. Even though the null hypothesis is true, all four existing competitive gene set tests give P -values that are not uniformly distributed but instead highly skewed towards small values ( Figure 1 A–D). None of the existing methods come at all close to controlling the type I error rate correctly, yielding type I error rates many times the nominal rate ( Table 1 , first four lines). This shows that even a small inter-gene correlation dramatically increases the type I error rates to dangerous levels.

Figure 1.

Histograms of P -values from different gene set tests in the absence of any true differential expression, but with a small inter-gene correlation in the test set. The simulation setup and order of test methods is as for Table 1 . Test methods are ( A ) geneSetTest (mod t), ( B ) geneSetTest (ranks of mod t), ( C ) sigPathway, ( D ) PAGE, ( E ) CAMERA (modt) and ( F ) CAMERA (ranks of modt). Existing methods A–D give results highly skewed towards small and falsely significant P -values, whereas CAMERA gives uniformly distributed values.

Table 1.

Type I error rates of gene set tests when genes in the set are correlated

Test method Nominal P -value
0.010.020.050.10
geneSetTest (modt)0.27790.32750.41570.4950
geneSetTest (ranks of modt)0.28260.33190.41440.4955
sigPathway (t)0.25240.30250.38800.4704
PAGE (logFC)0.24410.29000.37090.4503
CAMERA (modt)0.00870.01870.04770.0990
CAMERA (ranks of modt)0.00860.01730.04730.1003
Test method Nominal P -value
0.010.020.050.10
geneSetTest (modt)0.27790.32750.41570.4950
geneSetTest (ranks of modt)0.28260.33190.41440.4955
sigPathway (t)0.25240.30250.38800.4704
PAGE (logFC)0.24410.29000.37090.4503
CAMERA (modt)0.00870.01870.04770.0990
CAMERA (ranks of modt)0.00860.01730.04730.1003

CAMERA holds its size correctly whereas existing methods are highly liberal.

Entries are probabilities of rejecting the null hypothesis when conducting a gene set test to compare two groups of four arrays. Set size is 100 with inter-gene correlation 0.05. The remainder of 10 000 genes are uncorrelated. Results are based on 10 000 simulated data sets, so the standard error with which the error rate is estimated ranges from slightly < 0.001 (for rates near 0.01) to slightly < 0.005 (for rates near 0.5).

Table 1.

Type I error rates of gene set tests when genes in the set are correlated

Test method Nominal P -value
0.010.020.050.10
geneSetTest (modt)0.27790.32750.41570.4950
geneSetTest (ranks of modt)0.28260.33190.41440.4955
sigPathway (t)0.25240.30250.38800.4704
PAGE (logFC)0.24410.29000.37090.4503
CAMERA (modt)0.00870.01870.04770.0990
CAMERA (ranks of modt)0.00860.01730.04730.1003
Test method Nominal P -value
0.010.020.050.10
geneSetTest (modt)0.27790.32750.41570.4950
geneSetTest (ranks of modt)0.28260.33190.41440.4955
sigPathway (t)0.25240.30250.38800.4704
PAGE (logFC)0.24410.29000.37090.4503
CAMERA (modt)0.00870.01870.04770.0990
CAMERA (ranks of modt)0.00860.01730.04730.1003

CAMERA holds its size correctly whereas existing methods are highly liberal.

Entries are probabilities of rejecting the null hypothesis when conducting a gene set test to compare two groups of four arrays. Set size is 100 with inter-gene correlation 0.05. The remainder of 10 000 genes are uncorrelated. Results are based on 10 000 simulated data sets, so the standard error with which the error rate is estimated ranges from slightly < 0.001 (for rates near 0.01) to slightly < 0.005 (for rates near 0.5).

Variance inflation factors

Write forumla for the average of all pairwise correlations ρg , g for genes in the test set. Also write forumla for the background correlation, the average of all pairwise correlations ρg , g for all genes in the genome, or all genes on the array. Normalization of microarray profiles will usually guarantee that forumla is close to zero. However, forumla is quite likely to be >0 when the test set represents co-regulated genes associated with some biological process.

Consider now the effect of these correlations on the gene set tests considered earlier. Suppose that the genewise statistics zg are ordinary or moderated t -statistics. Under the null hypothesis of no differential expression, these statistics have the same variance τ 2 for every gene. If the zg were all independent, then the variance of forumla would be τ2 / m and the sample variance s2 of the zg for all genes on the array would be good estimate of τ2 . If instead forumla is non-zero, then it is straightforward to show (See ‘Materials and Methods’ section) that the variance is not τ2 / m but rather is increased by the variance inflation factor
It can be seen from the formula that VIF can be much larger than one when the test set contains many genes ( m large), even if the inter-gene correlation is quite small. It is this fact that drives the large type I error rates seen in Table 1 . In practice, the correlations between genewise t -statistics are little different from those between the log-expression values ( 29 ), so we will assume the same correlations ρg , g hold between genewise t -statistics tg and tg as between expression values ygi and ygi . The above VIF can be taken to apply to the sigPathway and geneSetTest-modt tests based on ordinary or moderated t -statistics.

The PAGE parametric test uses logFCs as genewise statistics zg . These do not have equal variances under the null hypothesis, so the VIF is somewhat more complicated (See ‘Materials and Methods’ section). Nevertheless, the formula gives good guidance and holds approximately when the genewise variances forumla are roughly equal.

When zg is a rank, the VIF is slightly smaller than the formula above, because converting statistics to relative ranks tends to reduce positive correlations, or induces a small negative correlation if the original statistics were uncorrelated. This agrees with theoretical results that the WMW test is less sensitive to correlation than the t -test ( 34 ). An exact formula can be derived for the variance of the mean rank when the statistics zg are normally distributed and the inter-gene correlation is symmetric in the test set (See ‘Materials and Methods’ section). This exact formula is used in the rank version of the CAMERA procedure.

Estimating the inter-gene correlation

The CAMERA procedure is based on the idea of estimating the VIF from the data. The VIF and mean correlation can be estimated directly from residuals from the linear model for genes in the test set (See ‘Materials and Methods’ section). Briefly, the procedure is to extract a set of d  =  n  −  p independent residuals for each gene in the test test. The residuals are standardized to have equal variances, then summed over genes, and the mean square of these sums estimates the VIF. This procedure is equivalent to computing the average of all possible pairwise correlations between genes in the test set, but is numerically more efficient.

The CAMERA procedure treats the estimated VIF as an unbiased estimator of the true VIF, and as having the precision of a scaled chisquare distribution on d df (See ‘Materials and Methods’ section). Simulations show that this approximation is excellent when d and m are large and forumla is relatively small (data not shown). More generally, however, the VIF and mean correlation estimators are somewhat more precise than these distributional approximations imply ( Table 2 ). This ensures that the CAMERA test procedure will be conservative in small sample situations, and will be close to optimal when more replicates are available and the set size is moderate to large.

Table 2.

Correlation estimates are more precise than implied by the nominal chisquare approximation

CorrelationMean estimateEmpirical SDTheoretical SD
0−0.000070.006880.00698
0.020.01960.01170.0124
0.050.04900.01900.0206
0.10.09810.03000.0342
0.20.19610.04810.0614
CorrelationMean estimateEmpirical SDTheoretical SD
0−0.000070.006880.00698
0.020.01960.01170.0124
0.050.04900.01900.0206
0.10.09810.03000.0342
0.20.19610.04810.0614

Columns 2 and 3 give the mean and standard deviation (SD) of correlation estimates over 10 000 simulated data sets with set size of m  = 40 and residual df d  = 27. The empirical SDs are consistently less than the theoretical values. The simulation standard error with which the empirical SD is estimated is about 1.4%.

Table 2.

Correlation estimates are more precise than implied by the nominal chisquare approximation

CorrelationMean estimateEmpirical SDTheoretical SD
0−0.000070.006880.00698
0.020.01960.01170.0124
0.050.04900.01900.0206
0.10.09810.03000.0342
0.20.19610.04810.0614
CorrelationMean estimateEmpirical SDTheoretical SD
0−0.000070.006880.00698
0.020.01960.01170.0124
0.050.04900.01900.0206
0.10.09810.03000.0342
0.20.19610.04810.0614

Columns 2 and 3 give the mean and standard deviation (SD) of correlation estimates over 10 000 simulated data sets with set size of m  = 40 and residual df d  = 27. The empirical SDs are consistently less than the theoretical values. The simulation standard error with which the empirical SD is estimated is about 1.4%.

CAMERA

We develop two versions of CAMERA: a parametric version analogous to PAGE and a rank-based version analogous to geneSetTest. Both versions use genewise moderated t -statistics ( 24 ). The parametric version transforms the genewise t -statistics to z -statistics, zg , that are unit normal under the null hypothesis (See ‘Materials and Methods’ section), then conducts a global two-sample t -test to compare the zg values for genes in the test set to those for genes not in the set. The two-sample t -test is adjusted for correlation between genes in the test set, using the extended t -statistic described in the Materials and Methods section. In this procedure, the VIF estimate for the test set of genes is inserted to inflate the standard error of the two-sample t -statistic. The P -value is evaluated by comparing the t -statistic to the t -distribution on d df.

The rank-based version conducts a two-sample non-parametric test instead of the t -test. It replaces the two-sample t -test with an extended version of the WMW test that allows for a correlation between members of one of the two groups being compared (See ‘Materials and Methods’ section). The extended WMW test uses an exact formula for the variance of the rank-sum statistic under correlation. Again, the P -value is evaluated by comparing the rank-sum statistic to the t -distribution on d df.

CAMERA controls type I error correctly

Both the parametric (mod t ) and rank-based versions of CAMERA control the type I error rate correctly, or are very slightly conservative ( Table 1 , lines 5,6). They generate P -values that are uniformly distributed under the null hypothesis of no differential expression even when genes in the test set are positively correlated ( Figure 1 E and F). The results shown in Figure 1 and Table 1 are for set size m  = 100, mean correlation forumla and residual df d  =  n  −  p  = 6. The results remain essentially unchanged as these parameters are varied. The two CAMERA procedures continue to hold their size correctly regardless of the experimental setup in all simulations that we have conducted (data not shown).

The simulation setup of Figure 1 and Table 1 assumes that genes not in the test set are uncorrelated. In reality, the background genes may themselves belong to co-regulated pathways that may induce a more complex correlation structure. We, therefore, investigated a genome-wide correlation structure in which the whole genome can be partitioned into distinct molecular pathways, each consisting of 200 genes. Each pathway was assumed to have an inter-gene correlation of the same size as that for the test set. Within each pathway, half the genes were assumed to be up-regulated and half down-regulated by the pathway. Genes regulated in the same direction were assumed to be positively correlated whereas genes regulated in opposite directions were assumed to be negatively correlated. This clumpy genome-wide correlation did not affect the type I error rates for CAMERA. Both parametric and rank-based versions of CAMERA continued to return P -values that were uniformly distributed under the null hypothesis of no differential expression (data not shown).

CAMERA retains good power

CAMERA retains good power to detect small but consistent fold-changes in the test gene set. Table 3 gives power results for four different scenarios. The logFCs shown in the table were selected as the smallest changes for which power in the range of 50–80% was achieved. Further simulations show that power increases very rapidly for larger fold changes (data not shown).

Table 3.

CAMERA has excellent power to detect sets with small but consistent expression fold-changes

CorPercent DE genes log 2 FC df = 6
df = 27
ModtRanksModtRanks
01000.050.5870.5880.700.68
0250.200.5620.5150.690.58
0.051000.100.4520.4520.530.54
0.05250.250.6450.5330.770.66
CorPercent DE genes log 2 FC df = 6
df = 27
ModtRanksModtRanks
01000.050.5870.5880.700.68
0250.200.5620.5150.690.58
0.051000.100.4520.4520.530.54
0.05250.250.6450.5330.770.66

Columns 4–7 give probabilities of rejecting the null hypothesis at P <0.05. Set size is m  = 100 with either 100% or 25% of genes in the set differentially expressed between two groups of four arrays. Residual df is either 6 or 27 depending on whether or not the experiment includes a third group of 22 arrays. Inter-gene correlation is either 0 or 0.05. ‘Mod-t’ and ‘Ranks’ refer to parametric and rank-based CAMERA procedures, respectively. Results based on 1000 simulated data sets for each scenario.

Table 3.

CAMERA has excellent power to detect sets with small but consistent expression fold-changes

CorPercent DE genes log 2 FC df = 6
df = 27
ModtRanksModtRanks
01000.050.5870.5880.700.68
0250.200.5620.5150.690.58
0.051000.100.4520.4520.530.54
0.05250.250.6450.5330.770.66
CorPercent DE genes log 2 FC df = 6
df = 27
ModtRanksModtRanks
01000.050.5870.5880.700.68
0250.200.5620.5150.690.58
0.051000.100.4520.4520.530.54
0.05250.250.6450.5330.770.66

Columns 4–7 give probabilities of rejecting the null hypothesis at P <0.05. Set size is m  = 100 with either 100% or 25% of genes in the set differentially expressed between two groups of four arrays. Residual df is either 6 or 27 depending on whether or not the experiment includes a third group of 22 arrays. Inter-gene correlation is either 0 or 0.05. ‘Mod-t’ and ‘Ranks’ refer to parametric and rank-based CAMERA procedures, respectively. Results based on 1000 simulated data sets for each scenario.

Not unexpectedly, power is greatest when the genes are uncorrelated and all the genes in the test set are differentially expressed. However, power is still acceptable, even when the genes are correlated, only a subset of genes are actively differentially expressed, and the number of RNA samples is relatively small.

Interestingly, CAMERA loses relatively little power compared with existing unadjusted gene set tests when the genes are in fact independent. In the scenario of the first row of Table 3 , geneSetTest-modt and geneSetTest-ranks give powers 0.72 and 0.71, respectively, only slightly better than CAMERA with df=27. In the scenario of the second row, geneSetTest-modt and geneSetTest-ranks have powers 0.75 and 0.64, respectively, again only slightly better than CAMERA.

Even well-controlled data sets show inter-gene correlations

It should not be surprising that co-regulated genes will typically be positively correlated across diverse RNA samples. We wanted to explore how much inter-gene correlation remains in well-controlled experimental situations, for example, when RNA samples are extracted from sorted homogeneous cells from genetically identical model animals in controlled laboratory conditions. We examined three microarray data sets showing different degrees of biological variation between replicates. The first data set profiles 94 breast cancer tumors classified into six molecular subtypes ( 30 , 31 ). The second data set profiles four types of mammary epithelial progenitor cells from three human subjects ( 31 ). The third data set profiles three types of hemapoeitic stem cells from four strains of genetically identical mice. This dataset has 2–4 biological replicates for each strain and cell type for a total of 35 microarrays. After fitting linear models to remove treatment effects, the three data sets have, respectively, 94 − 6 = 88, 12 − 4 = 8 and 35 − 12 = 23 residual df available for estimating inter-gene correlations. The cancer data should show the most biological variability because replicates represent genetically different tumors, even within a molecular subtype. The mouse data should show the least, because the replicate samples are sorted homogeneous cells from genetically identical mice.

Inter-gene correlations and VIFs were computed for all gene sets containing five or more genes from the C2 collection of the Molecular Signatures Database Version 3.0 (MSigDB) ( 35 ). Although the average correlation between all genes on the arrays was close to zero (0.0026 for the tumor data, 0.0009 for the human cell data, 0.0029 for the mouse cell data), the correlations for the curated gene sets were overwhelming >0, ranging up to 0.71 ( Figure 2 ). For the tumor data, 96% of gene sets had positive correlation and the great majority of VIFs were significantly >1 ( Figure 2 , top right). For the human cell data, 86% of gene sets had positive correlation and more than half the VIFs were significantly >1 ( Figure 2 , bottom left). Even for the mouse data, nearly half of the VIFs were significantly >1, according to a conservative 5% P -value cutoff ( Figure 2 , bottom right). This demonstrates that positive inter-gene correlations and non-ignorably large VIFs are typical for sets of co-regulated genes, even for highly controlled experiments with genetically identical animals.

Figure 2.

Inter-gene correlations for MSigDB gene sets in three microarray data sets. Top-left panel shows correlations. The other three panels show VIFs for breast cancers, human mammary epithelial cells and mouse hemapoeitic stem cells, respectively. The VIF plots show the cumulative distribution of VIFs over all gene sets. Solid and dotted horizontal lines show the mean and upper 95% quantile under the assumption of zero correlation.

Molecular signature of basal-like breast cancer

Basal-like breast cancer has the worst prognosis of any of six well-accepted subtypes of breast cancer ( 30 , 31 ). To demonstate the ability of CAMERA to recover biologically meaningful results, we contrasted basal-like cancers with the other five cancer subtypes. That is, we formed a contrast for the logFC in expression between basal-like tumors and the average of the other five tumor subtypes. Note that this is more powerful than simply pooling the other five tumor subtypes, in that all the subtypes are still modeled by the linear model and between subtype variability is still removed from the analysis. We ran CAMERA for this contrast for all the gene sets in the curated C2 collection of the MSigDB. CAMERA found 74 signatures, using the Benjamini–Hochberg algorithm to control the FDR at 0.05.

The CAMERA results recapitulate our knowledge of basal-like cancer in the strongest possible terms ( Table 4 ). The basal-like signature itself is the top set, and the negative basal-like signature is third. No fewer than 30 out of the top 35 gene sets are explicitly breast cancer derived, even though there are only 127 such sets in the entire MSigDB. All basal-related signatures are up-regulated whereas signatures associated with other subtypes are down-regulated. In particular, signatures associated with BRCA1 mutations are up-regulated, confirming that this is a defining characteristic of basal-like cancer. Signatures associated with ESR1 are down-regulated, confirming that ESR1 expression is associated with the luminal A subtype and good prognosis. Three other gene sets show an embryonic stem cell-like signature in basal-like cancer, a known characteristic of basal-like cancer ( 30 , 36 , 37 ). Other sets show associations with early onset and metastasis, concordant with the poor patient outcomes associated with basal-like cancer. Note that all the top 35 gene sets show non-ignorable positive correlations.

Table 4.

Molecular signatures distinguishing basal-like from other breast cancer subtypes

Gene setN GenesCorrelationDirectionP -value FDR
Smid_Breast_Cancer_Basal_Up5800.039up1.2e-091.9e-08
Doane_Breast_Cancer_Esr1_Up980.063down1.4e-091.9e-08
Smid_Breast_Cancer_Basal_Dn5690.035down1.5e-091.9e-08
Vantveer_Breast_Cancer_Esr1_Up1160.062down3.2e-093.2e-08
Smid_Breast_Cancer_Relapse_In_Bone_Up850.044down1.9e-081.3e-07
Benporath_Es_Core_Nine_Correlated950.057up1.9e-081.3e-07
Smid_Breast_Cancer_Relapse_In_Brain_Up380.065up2.4e-081.3e-07
Yang_Breast_Cancer_Esr1_Up240.122down3.0e-081.5e-07
Smid_Breast_Cancer_Relapse_In_Brain_Dn680.059down3.7e-081.6e-07
Doane_Breast_Cancer_Esr1_Dn460.099up8.9e-083.6e-07
Yang_Breast_Cancer_Esr1_Bulk_Up150.109down3.0e-071.1e-06
Smid_Breast_Cancer_Relapse_In_Bone_Dn2810.044up3.8e-071.3e-06
Vantveer_Breast_Cancer_Esr1_Dn1950.069up4.1e-071.3e-06
Vantveer_Breast_Cancer_Metastasis_Up370.051down8.0e-072.3e-06
Benporath_Es_Core_Nine90.097up9.0e-072.4e-06
Smid_Breast_Cancer_Luminal_B_Up1440.049down3.2e-068.0e-06
Doane_Breast_Cancer_Classes_Up580.095down4.7e-061.1e-05
Smid_Breast_Cancer_Luminal_A_Dn160.174up4.9e-061.1e-05
Yang_Breast_Cancer_Esr1_Bulk_Dn150.066up6.1e-061.3e-05
Yang_Breast_Cancer_Esr1_Laser_Up240.053down8.1e-061.6e-05
Yang_Breast_Cancer_Esr1_Dn190.175up1.1e-052.2e-05
Benporath_Es_13190.024up1.7e-053.2e-05
Vecchi_Gastric_Cancer_Early_Up3420.053up2.1e-053.6e-05
Smid_Breast_Cancer_Relapse_In_Lung_Up210.051up3.2e-055.4e-05
Sotiriou_Breast_Cancer_Grade_1_Vs_3_Dn400.064down3.4e-055.5e-05
Landemaine_Lung_Metastasis150.139up4.5e-056.9e-05
Lien_Breast_Carcinoma_Metaplastic_Vs_Ductal_Dn900.097down4.8e-057.1e-05
Charafe_Breast_Cancer_Luminal_Vs_Basal_Up2760.034down5.2e-057.4e-05
Vantveer_Breast_Cancer_Metastasis_Dn920.119up7.1e-059.8e-05
Pujana_Breast_Cancer_With_Brca1_Mutated_Up500.144up1.3e-041.7e-04
Chiang_Liver_Cancer_Subclass_Proliferation_Up1320.071up1.5e-041.9e-04
Vantveer_Breast_Cancer_Brca1_Up270.039up2.1e-042.6e-04
Naderi_Breast_Cancer_Prognosis_Up370.123up2.8e-043.2e-04
Doane_Breast_Cancer_Classes_Dn310.072up2.8e-043.2e-04
Smid_Breast_Cancer_Luminal_A_Up740.122down2.9e-043.2e-04

Luminal progenitor up2970.032up0.00012
Luminal progenitor down1570.040down0.00049
Gene setN GenesCorrelationDirectionP -value FDR
Smid_Breast_Cancer_Basal_Up5800.039up1.2e-091.9e-08
Doane_Breast_Cancer_Esr1_Up980.063down1.4e-091.9e-08
Smid_Breast_Cancer_Basal_Dn5690.035down1.5e-091.9e-08
Vantveer_Breast_Cancer_Esr1_Up1160.062down3.2e-093.2e-08
Smid_Breast_Cancer_Relapse_In_Bone_Up850.044down1.9e-081.3e-07
Benporath_Es_Core_Nine_Correlated950.057up1.9e-081.3e-07
Smid_Breast_Cancer_Relapse_In_Brain_Up380.065up2.4e-081.3e-07
Yang_Breast_Cancer_Esr1_Up240.122down3.0e-081.5e-07
Smid_Breast_Cancer_Relapse_In_Brain_Dn680.059down3.7e-081.6e-07
Doane_Breast_Cancer_Esr1_Dn460.099up8.9e-083.6e-07
Yang_Breast_Cancer_Esr1_Bulk_Up150.109down3.0e-071.1e-06
Smid_Breast_Cancer_Relapse_In_Bone_Dn2810.044up3.8e-071.3e-06
Vantveer_Breast_Cancer_Esr1_Dn1950.069up4.1e-071.3e-06
Vantveer_Breast_Cancer_Metastasis_Up370.051down8.0e-072.3e-06
Benporath_Es_Core_Nine90.097up9.0e-072.4e-06
Smid_Breast_Cancer_Luminal_B_Up1440.049down3.2e-068.0e-06
Doane_Breast_Cancer_Classes_Up580.095down4.7e-061.1e-05
Smid_Breast_Cancer_Luminal_A_Dn160.174up4.9e-061.1e-05
Yang_Breast_Cancer_Esr1_Bulk_Dn150.066up6.1e-061.3e-05
Yang_Breast_Cancer_Esr1_Laser_Up240.053down8.1e-061.6e-05
Yang_Breast_Cancer_Esr1_Dn190.175up1.1e-052.2e-05
Benporath_Es_13190.024up1.7e-053.2e-05
Vecchi_Gastric_Cancer_Early_Up3420.053up2.1e-053.6e-05
Smid_Breast_Cancer_Relapse_In_Lung_Up210.051up3.2e-055.4e-05
Sotiriou_Breast_Cancer_Grade_1_Vs_3_Dn400.064down3.4e-055.5e-05
Landemaine_Lung_Metastasis150.139up4.5e-056.9e-05
Lien_Breast_Carcinoma_Metaplastic_Vs_Ductal_Dn900.097down4.8e-057.1e-05
Charafe_Breast_Cancer_Luminal_Vs_Basal_Up2760.034down5.2e-057.4e-05
Vantveer_Breast_Cancer_Metastasis_Dn920.119up7.1e-059.8e-05
Pujana_Breast_Cancer_With_Brca1_Mutated_Up500.144up1.3e-041.7e-04
Chiang_Liver_Cancer_Subclass_Proliferation_Up1320.071up1.5e-041.9e-04
Vantveer_Breast_Cancer_Brca1_Up270.039up2.1e-042.6e-04
Naderi_Breast_Cancer_Prognosis_Up370.123up2.8e-043.2e-04
Doane_Breast_Cancer_Classes_Dn310.072up2.8e-043.2e-04
Smid_Breast_Cancer_Luminal_A_Up740.122down2.9e-043.2e-04

Luminal progenitor up2970.032up0.00012
Luminal progenitor down1570.040down0.00049

CAMERA results for the top 35 gene sets from the MSigDB when comparing basal-like cancers to the average of the other five subtypes. Output includes the size of each set, the estimated inter-gene correlation, two-sided P -value and FDR. Also given are results for mammary luminal progentor cell signatures.

Table 4.

Molecular signatures distinguishing basal-like from other breast cancer subtypes

Gene setN GenesCorrelationDirectionP -value FDR
Smid_Breast_Cancer_Basal_Up5800.039up1.2e-091.9e-08
Doane_Breast_Cancer_Esr1_Up980.063down1.4e-091.9e-08
Smid_Breast_Cancer_Basal_Dn5690.035down1.5e-091.9e-08
Vantveer_Breast_Cancer_Esr1_Up1160.062down3.2e-093.2e-08
Smid_Breast_Cancer_Relapse_In_Bone_Up850.044down1.9e-081.3e-07
Benporath_Es_Core_Nine_Correlated950.057up1.9e-081.3e-07
Smid_Breast_Cancer_Relapse_In_Brain_Up380.065up2.4e-081.3e-07
Yang_Breast_Cancer_Esr1_Up240.122down3.0e-081.5e-07
Smid_Breast_Cancer_Relapse_In_Brain_Dn680.059down3.7e-081.6e-07
Doane_Breast_Cancer_Esr1_Dn460.099up8.9e-083.6e-07
Yang_Breast_Cancer_Esr1_Bulk_Up150.109down3.0e-071.1e-06
Smid_Breast_Cancer_Relapse_In_Bone_Dn2810.044up3.8e-071.3e-06
Vantveer_Breast_Cancer_Esr1_Dn1950.069up4.1e-071.3e-06
Vantveer_Breast_Cancer_Metastasis_Up370.051down8.0e-072.3e-06
Benporath_Es_Core_Nine90.097up9.0e-072.4e-06
Smid_Breast_Cancer_Luminal_B_Up1440.049down3.2e-068.0e-06
Doane_Breast_Cancer_Classes_Up580.095down4.7e-061.1e-05
Smid_Breast_Cancer_Luminal_A_Dn160.174up4.9e-061.1e-05
Yang_Breast_Cancer_Esr1_Bulk_Dn150.066up6.1e-061.3e-05
Yang_Breast_Cancer_Esr1_Laser_Up240.053down8.1e-061.6e-05
Yang_Breast_Cancer_Esr1_Dn190.175up1.1e-052.2e-05
Benporath_Es_13190.024up1.7e-053.2e-05
Vecchi_Gastric_Cancer_Early_Up3420.053up2.1e-053.6e-05
Smid_Breast_Cancer_Relapse_In_Lung_Up210.051up3.2e-055.4e-05
Sotiriou_Breast_Cancer_Grade_1_Vs_3_Dn400.064down3.4e-055.5e-05
Landemaine_Lung_Metastasis150.139up4.5e-056.9e-05
Lien_Breast_Carcinoma_Metaplastic_Vs_Ductal_Dn900.097down4.8e-057.1e-05
Charafe_Breast_Cancer_Luminal_Vs_Basal_Up2760.034down5.2e-057.4e-05
Vantveer_Breast_Cancer_Metastasis_Dn920.119up7.1e-059.8e-05
Pujana_Breast_Cancer_With_Brca1_Mutated_Up500.144up1.3e-041.7e-04
Chiang_Liver_Cancer_Subclass_Proliferation_Up1320.071up1.5e-041.9e-04
Vantveer_Breast_Cancer_Brca1_Up270.039up2.1e-042.6e-04
Naderi_Breast_Cancer_Prognosis_Up370.123up2.8e-043.2e-04
Doane_Breast_Cancer_Classes_Dn310.072up2.8e-043.2e-04
Smid_Breast_Cancer_Luminal_A_Up740.122down2.9e-043.2e-04

Luminal progenitor up2970.032up0.00012
Luminal progenitor down1570.040down0.00049
Gene setN GenesCorrelationDirectionP -value FDR
Smid_Breast_Cancer_Basal_Up5800.039up1.2e-091.9e-08
Doane_Breast_Cancer_Esr1_Up980.063down1.4e-091.9e-08
Smid_Breast_Cancer_Basal_Dn5690.035down1.5e-091.9e-08
Vantveer_Breast_Cancer_Esr1_Up1160.062down3.2e-093.2e-08
Smid_Breast_Cancer_Relapse_In_Bone_Up850.044down1.9e-081.3e-07
Benporath_Es_Core_Nine_Correlated950.057up1.9e-081.3e-07
Smid_Breast_Cancer_Relapse_In_Brain_Up380.065up2.4e-081.3e-07
Yang_Breast_Cancer_Esr1_Up240.122down3.0e-081.5e-07
Smid_Breast_Cancer_Relapse_In_Brain_Dn680.059down3.7e-081.6e-07
Doane_Breast_Cancer_Esr1_Dn460.099up8.9e-083.6e-07
Yang_Breast_Cancer_Esr1_Bulk_Up150.109down3.0e-071.1e-06
Smid_Breast_Cancer_Relapse_In_Bone_Dn2810.044up3.8e-071.3e-06
Vantveer_Breast_Cancer_Esr1_Dn1950.069up4.1e-071.3e-06
Vantveer_Breast_Cancer_Metastasis_Up370.051down8.0e-072.3e-06
Benporath_Es_Core_Nine90.097up9.0e-072.4e-06
Smid_Breast_Cancer_Luminal_B_Up1440.049down3.2e-068.0e-06
Doane_Breast_Cancer_Classes_Up580.095down4.7e-061.1e-05
Smid_Breast_Cancer_Luminal_A_Dn160.174up4.9e-061.1e-05
Yang_Breast_Cancer_Esr1_Bulk_Dn150.066up6.1e-061.3e-05
Yang_Breast_Cancer_Esr1_Laser_Up240.053down8.1e-061.6e-05
Yang_Breast_Cancer_Esr1_Dn190.175up1.1e-052.2e-05
Benporath_Es_13190.024up1.7e-053.2e-05
Vecchi_Gastric_Cancer_Early_Up3420.053up2.1e-053.6e-05
Smid_Breast_Cancer_Relapse_In_Lung_Up210.051up3.2e-055.4e-05
Sotiriou_Breast_Cancer_Grade_1_Vs_3_Dn400.064down3.4e-055.5e-05
Landemaine_Lung_Metastasis150.139up4.5e-056.9e-05
Lien_Breast_Carcinoma_Metaplastic_Vs_Ductal_Dn900.097down4.8e-057.1e-05
Charafe_Breast_Cancer_Luminal_Vs_Basal_Up2760.034down5.2e-057.4e-05
Vantveer_Breast_Cancer_Metastasis_Dn920.119up7.1e-059.8e-05
Pujana_Breast_Cancer_With_Brca1_Mutated_Up500.144up1.3e-041.7e-04
Chiang_Liver_Cancer_Subclass_Proliferation_Up1320.071up1.5e-041.9e-04
Vantveer_Breast_Cancer_Brca1_Up270.039up2.1e-042.6e-04
Naderi_Breast_Cancer_Prognosis_Up370.123up2.8e-043.2e-04
Doane_Breast_Cancer_Classes_Dn310.072up2.8e-043.2e-04
Smid_Breast_Cancer_Luminal_A_Up740.122down2.9e-043.2e-04

Luminal progenitor up2970.032up0.00012
Luminal progenitor down1570.040down0.00049

CAMERA results for the top 35 gene sets from the MSigDB when comparing basal-like cancers to the average of the other five subtypes. Output includes the size of each set, the estimated inter-gene correlation, two-sided P -value and FDR. Also given are results for mammary luminal progentor cell signatures.

For comparison, we used the popular GSEA software ( 20 ) to compare basal-like cancers to the other cancer subtypes, treating the non-basal tumors as a single group, yielding 124 up-regulated gene sets at 5% FDR and 13 significantly down-regulated sets at 5%. GSEA does not perform a two-sided test, so the FDR control is not as stringent as for CAMERA. The sets found by GSEA were less enriched for breast cancer sets and, in particular, did not include the basal-like signatures themselves.

Our previous interest in mammary luminal progenitor cells as the putative ‘cell of origin’ for basal-like cancer lead us to test the luminal progenitor signatures derived by ( 31 ). This confirmed the strong presence of the luminal progenitor signature in the basal-like tumors as opposed to the other tumor subtypes ( Table 4 ), confirming luminal progenitors as the likely cell of origin for basal cancers ( 31 , 38 ). CAMERA allows us to formally take genewise correlation into account in evaluating statistical significance, which we were not able to do in our earlier publication ( 31 ).

DISCUSSION

The majority of gene set tests that appear in the biological literature are of a competitive nature, in that they compare one category of genes to all other genes in the genome or on the microarray. Many or most methods of pathway analysis can be viewed as competitive gene set tests. This includes well-known contingency table tests, such as Fisher's exact test, that are often used to test for enrichment of a gene annotation category in a list of differentially expressed genes. These tests can be viewed as competitive gene set tests, according to the framework of this article, with the genewise statistic taking values one or zero depending on whether the gene is ranked in the top list ( 29 ). Another popular pathway analysis methodology, GSEA software ( 20 ), uses array permutation when the number of samples is large but gives the option of gene permutation when the number of samples is small. It, therefore, switches to a pure competitive test when the sample size is small. All these tests are likely to share the sensitivity to inter-gene correlation that was demonstrated in this article for existing competitive gene set tests.

Our results show positive inter-gene correlation to be prevalent for co-regulated genes, even for highly homogeneous cells and even for genetically identical animals under laboratory conditions. This agrees with Gatti et al . ( 9 ), who surveyed inter-gene correlations for common Kyoto Encyclopedia of Genes and Genomes pathways and gene ontology terms for over 200 data sets from the Gene Expression Omnibus. Gatti et al . used Pearson correlations coefficients, which measure total correlation between two genes, whereas we have computed residual correlations after removing treatment effects, and these are generally smaller. Our results show that inter-gene correlations remain prevalent even across replicates for homogeneous treatment groups. This suggests that inter-gene correlations cannot routinely be ignored in any commonly occuring molecular biology context.

A crucial aspect of the CAMERA procedure is to be able to estimate the inter-gene correlations efficiently, and to be able to characterize the precision of the resulting estimator, so that the uncertainty of estimation can be taken into account when evaluating the significance of the test. This article has shown that the variability of the estimated VIF can be bounded above by the variability of a chisquare distribution, meaning that the CAMERA test can be based on a t -distribution. This ensures that CAMERA controls the type I error rate correctly even for small sample sizes. CAMERA controlled the type I error rate correctly in all simulations we have conducted.

CAMERA continued to hold its size correctly even when all genes in the genome were assumed to belong to co-regulated pathways, each with their own inter-gene correlation structure. In our simulations, inter-gene correlations for background pathways were assumed to be of the same size as that for the test set. In most real situations, we expect that background genes will tend to be less highly correlated than those in the test set, because the test set is typically chosen specifically to contain co-regulated genes based on prior information. Hence, our simulations provide strong support for CAMERA to hold its size correctly in practical situations.

The necessity to estimate the inter-gene correlation from the data inevitably incurs some loss of statistical power, reflected in the use of a t -distribution instead of the standard normal distribution for evaluating the P -value. Yet, simulations show that CAMERA retains surprisingly good power compared to existing competitive tests when those methods are applicable, i.e. when the inter-gene correlation actually is zero. CAMERA actually has greater power than PAGE or sigPathway when the test set contains a large number of genes. One factor that contributes to this retention of statistical power is the use of genewise statistics that are normally distributed with equal variances under the null hypothesis of no differential expression. Previous parametric gene set tests have been based on genewise logFCs, which typically have different variances for different genes ( 13 , 16 ). A consequence is that the arithmetic average of these quantities over genes in the test set is less precise than would be a similar average of equal-variance quantities. Previous gene permutation gene set tests have been based on ordinary t -statistics that can be far from normally distributed when the sample sizes are small ( 2 ). Again, taking the arithmetic average of non-normal quantities is not generally a statistically efficient summary of their average size. Another factor contributing to power is the fact the CAMERA compares genes in the test set versus the complementary set of genes, rather than comparing the test set of genes to the background of all genes. This ensures that strong non-null effects in the test set do not contaminate the background set that is used to generate the null distribution.

CAMERA was applied to the breast cancer subtype data, and shown to be a very effective alternative to existing gene set enrichment analysis software ( 17 , 20 ) for interrogating a data set with a database of molecular signatures. CAMERA has greater statistical power than GSEA procedures based on array permutation when the number of RNA samples is not large.

Competitive tests have been used in the literature for their intuitive interpretation. To our knowledge, the null statistical hypotheses being tested have either not been stated or have been stated in operational terms. In effect, the hypothesis has been defined by the test procedure. This is especially true of competitive tests that evaluate P -values by array permutation ( 17 , 19 , 20 ). For these procedures, the null hypothesis being tested is difficult to characterize in parametric terms. To our knowledge, this article offers the first specification of null and alternative hypotheses for a gene set test in parametric terms. Crucially, the null and alternative hypothesis state relationships between logFCs for genes in or out of the test set, and do not involve other distributional aspects of the expression values such as correlations or variances. Previous statements of competitive null hypotheses in terms of random sampling of gene sets makes this distinction impossible.

Compared to the existing competitive gene set tests, CAMERA assigns less significance to gene sets that show positive inter-gene correlation. Positive correlation is an indication that genes are co-regulated and possibly functionally related, and it has been argued elsewhere that detection of co-regulated sets is of interest in itself ( 13 ). Our view is that inter-gene correlation reflects non-specific co-regulation, unrelated to the treatment conditions of the current experiment, whereas a gene set test should focus on co-regulation that is specific to the treatment comparison of interest.

To be completely general, CAMERA has been developed in a linear model context. This means that it is not limited to two-group comparisons, but can be used to test the behavior of gene sets across any contrast or interaction in a linear model context.

CAMERA is computationally extremely fast. The analysis presented in Table 4 , for example, requiring gene set tests for nearly 4000 gene sets, took only a couple of seconds on a laptop computer. By comparison, the equivalent GSEA analysis took 2 h on a high-performance large memory 16-core computer.

CONCLUSION

CAMERA is a competitive gene set test that controls type I error correctly regardless of inter-gene correlations, yet retains good statistical power. It has good performance for both focused testing of individual gene sets of special interest, and for gene set enrichment analysis using databases of gene annotation categories or transcriptional signatures. CAMERA is freely available as a function in the limma software package available from Bioconductor ( 26 ).

FUNDING

National Health and Medical Research Council [490037 and Research Fellowship to G.K.S.]; Australian Government [Australian Postgraduate Research Award to D.W.]. Funding for open access charge: National Health and Medical Research Council [490037].

Conflict of interest statement . None declared.

ACKNOWLEDGEMENTS

Thanks to our collaborators Benjamin Kile, Libby Kruse, Doug Hilton, Tracey Baldwin and Irina Pleines for allowing us to use data from the mouse microarray experiment conducted in their laboratory, to Yifang Hu for ortholog mapping and preparation of the MSigDB gene sets in R and for pre-processing of the mouse stem cell data, to Seon-Young Kim for providing Python code for PAGE, and to Belinda Phipson and Matthew Ritchie for helpful comments on the manuscript. DW is grateful to Jun S. Liu and Robert Plenge at Harvard University for supporting her as a postdoctoral researcher during the writing of this article.

REFERENCES

1
Goeman
JJ
Bühlmann
P
Analyzing gene expression data in terms of gene sets: methodological issues
Bioinformatics
2007
, vol. 
23
 (pg. 
980
-
987
)
2
Tian
L
Greenberg
SA
Kong
SW
Altschuler
J
Kohane
IS
Park
PJ
Discovering statistically significant pathways in expression profiling studies
Proc. Natl Acad. Sci. U.S.A
2005
, vol. 
102
 (pg. 
13544
-
13549
)
3
Goeman
JJ
van de Geer
SA
de Kort
F
van Houwelingen
HC
A global test for groups of genes: testing association with a clinical outcome
Bioinformatics
2004
, vol. 
20
 (pg. 
93
-
99
)
4
Saxena
V
Orgill
D
Kohane
I
Absolute enrichment: gene set enrichment analysis for homeostatic systems
Nucleic Acids Res.
2006
, vol. 
34
 pg. 
e151
 
5
Dinu
I
Potter
JD
Mueller
T
Liu
Q
Adewale
AJ
Jhangri
GS
Einecke
G
Famulski
KS
Halloran
P
Yasui
Y
Improving gene set analysis of microarray data by SAM-GS
BMC Bioinformatics
2007
, vol. 
8
 pg. 
242
 
6
Jiang
Z
Gentleman
R
Extensions to gene set enrichment
Bioinformatics
2007
, vol. 
23
 (pg. 
306
-
313
)
7
Oron
AP
Jiang
Z
Gentleman
R
Gene set enrichment analysis using linear models and diagnostics
Bioinformatics
2008
, vol. 
24
 (pg. 
2586
-
2591
)
8
Wu
D
Lim
E
Vaillant
F
Asselin-Labat
ML
Visvader
JE
Smyth
GK
ROAST: rotation gene set tests for complex microarray experiments
Bioinformatics
2010
, vol. 
26
 (pg. 
2176
-
2182
)
9
Gatti
DM
Barry
WT
Nobel
AB
Rusyn
I
Wright
FA
Heading down the wrong pathway: on the influence of correlation within gene sets
BMC Genomics
2010
, vol. 
11
 pg. 
574
 
10
Breslin
T
Edén
P
Krogh
M
Comparing functional annotation analyses with Catmap
BMC Bioinformatics
2004
, vol. 
5
 pg. 
193
 
11
Volinia
S
Evangelisti
R
Francioso
F
Arcelli
D
Carella
M
Gasparini
P
GOAL: automated Gene Ontology analysis of expression profiles
Nucleic Acids Res.
2004
, vol. 
32
 (pg. 
W492
-
W499
)
12
Smid
M
Dorssers
LC
GO-Mapper: functional analysis of gene expression data using the expression level as a score to evaluate Gene Ontology terms
Bioinformatics
2004
, vol. 
20
 (pg. 
2618
-
2625
)
13
Kim
SY
Volsky
DJ
PAGE: parametric analysis of gene set enrichment
BMC Bioinformatics
2005
, vol. 
6
 pg. 
144
 
14
Boorsma
A
Foat
BC
Vis
D
Klis
F
Bussemaker
HJ
T-profiler: scoring the activity of predefined groups of genes using gene expression data
Nucleic Acids Res.
2005
, vol. 
33
 (pg. 
W592
-
W595
)
15
Nam
D
Kim
SB
Kim
SK
Yang
S
Kim
SY
Chu
IS
ADGO: analysis of differentially expressed gene sets using composite GO annotation
Bioinformatics
2006
, vol. 
22
 (pg. 
2249
-
2253
)
16
Luo
W
Friedman
MS
Shedden
K
Hankenson
KD
Woolf
PJ
GAGE: generally applicable gene set enrichment for pathway analysis
BMC Bioinformatics
2009
, vol. 
10
 (pg. 
1471
-
2105
(Electronic) 1471–2105 (Linking)
17
Efron
B
Tibshirani
R
On testing the significance of sets of genes
Ann. Appl. Statist.
2007
, vol. 
1
 (pg. 
107
-
129
)
18
Dørum
G
Snipen
L
Solheim
M
Saebo
S
Rotation testing in gene set enrichment analysis for small direct comparison experiments
Stat. Appl. Genet. Mol. Biol.
2009
, vol. 
8
  
Article34
19
Barry
WT
Nobel
AB
Wright
FA
Significance analysis of functional categories in gene expression studies: a structured permutation approach
Bioinformatics
2005
, vol. 
21
 (pg. 
1943
-
1949
)
20
Subramanian
A
Tamayo
P
Mootha
VK
Mukherjee
S
Ebert
BL
Gillette
MA
Paulovich
A
Pomeroy
SL
Golub
TR
Lander
ES
, et al. 
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles
Proc. Natl Acad. Sci. U.S.A
2005
, vol. 
102
 (pg. 
15545
-
15550
)
21
Nam
D
De-correlating expression in gene-set analysis
Bioinformatics
2010
, vol. 
26
 (pg. 
i511
-
i516
)
22
Wang
L
Zhang
B
Wolfinger
RD
Chen
X
An integrated approach for the analysis of biological pathways using mixed models
PLoS Genet.
2008
, vol. 
4
 pg. 
e1000115
 
23
Wang
L
Chen
X
Wolfinger
R
Franklin
J
Coffey
R
Zhang
B
A unified mixed effects model for gene set analysis of time course microarray experiments
Stat. Appl. Genet. Mol. Biol.
2009
, vol. 
8
  
Article 47
24
Smyth
GK
Linear models and empirical Bayes methods for assessing differential expression in microarray experiments
Stat. Appl. Genet. Mol. Biol.
2004
, vol. 
3
  
Article3
25
Smyth
GK
Gentleman
R
Carey
V
Dudoit
S
Irizarry
R
Huber
W
Limma: linear models for microarray data
Bioinformatics and Computational Biology Solutions using R and Bioconductor
2005
NY
Springer
(pg. 
397
-
420
)
26
Gentleman
R
Carey
V
Bates
D
Bolstad
B
Dettling
M
Dudoit
S
Ellis
B
Gautier
L
Ge
Y
Gentry
J
, et al. 
Bioconductor: open software development for computational biology and bioinformatics
Genome Biol.
2004
, vol. 
5
 pg. 
R80
 
27
Michaud
J
Simpson
K
Escher
R
Buchet-Poyau
K
Beissbarth
T
Carmichael
C
Ritchie
M
Schutz
F
Cannon
P
Liu
M
, et al. 
Integrative analysis of RUNX1 downstream pathways and target genes
BMC Genomics
2008
, vol. 
9
 pg. 
363
 
28
Zar
J
Biostatistical Analysis
1999
4th edn
Upper Saddle River, New Jersey
Pentice-Hall International
29
Barry
WT
Nobel
AB
Wright
FA
A statistical framework for testing functional categories in microarray data
Ann. Appl. Stat.
2008
, vol. 
2
 (pg. 
286
-
315
)
30
Herschkowitz
JI
Simin
K
Weigman
VJ
Mikaelian
I
Usary
J
Hu
Z
Rasmussen
KE
Jones
LP
Assefnia
S
Chandrasekharan
S
, et al. 
Identification of conserved gene expression features between murine mammary carcinoma models and human breast tumors
Genome Bio.
2007
, vol. 
8
 pg. 
R76
 
31
Lim
E
Vaillant
F
Wu
D
Forrest
NC
Pal
B
Hart
AH
Asselin-Labat
ML
Gyorki
DE
Ward
T
Partanen
A
, et al. 
Aberrant luminal progenitors as the candidate target population for basal tumor development in BRCA1 mutation carriers
Nat. Med.
2009
, vol. 
15
 (pg. 
907
-
913
)
32
Shi
W
Oshlack
A
Smyth
G
Optimizing the noise versus bias trade-off for Illumina Whole Genome Expression BeadChips
Nucleic Acids Res.
2010
, vol. 
38
 pg. 
e204
 
33
Phipson
B
Smyth
G
Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn
Stat. Appl. Genet. Mol. Biol.
2010
, vol. 
9
  
Article 39
34
Gastwirth
JL
Rubin
H
Effect of dependence on the level of some one-sample tests
J. Am. Stat. Assoc.
1971
, vol. 
66
 (pg. 
816
-
820
)
35
Liberzon
A
Subramanian
A
Pinchback
R
Thorvaldsdóttir
H
Tamayo
P
Mesirov
J
Molecular signatures database (MSigDB) 3.0
Bioinformatics
2011
, vol. 
27
 (pg. 
1739
-
1740
)
36
Ben-Porath
I
Thomson
M
Carey
V
Ge
R
Bell
G
Regev
A
Weinberg
R
An embryonic stem cell-like gene expression signature in poorly differentiated aggressive human tumors
Nat. Genet.
2008
, vol. 
40
 (pg. 
499
-
507
)
37
Spike
B
Engle
D
Lin
J
Cheung
S
La
J
Wahl
G
A mammary stem cell population identified and characterized in late embryogenesis reveals similarities to human breast cancer
Cell Stem Cell
2012
, vol. 
10
 (pg. 
183
-
197
)
38
Molyneux
G
Geyer
F
Magnay
F
McCarthy
A
Kendrick
H
Natrajan
R
Mackay
A
Grigoriadis
A
Tutt
A
Ashworth
A
, et al. 
BRCA1 basal-like breast cancers originate from luminal epithelial progenitors and not from basal stem cells
Cell Stem Cell
2010
, vol. 
7
 (pg. 
403
-
417
)

Author notes

Correspondence may also be addressed to Di Wu. Tel: +1 617 495 5496; Fax: +1 617 496 8057; Email: dwu@fas.harvard.edu

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Comments

0 Comments
Submit a comment
You have entered an invalid code
Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.