Parsimonious and powerful composite likelihood testing for group difference and genotype–phenotype association
Introduction
Testing population difference between two groups of multivariate data is common in many fields of statistical research. Due to the significant development of data acquisition technologies in recent years, more and more complex data–e.g. involving temporal or spatial dependence among the sample units–can now be readily collected for statistical analysis. However, this entails the use of tractable statistical models that are not easily available. In particular, it may be difficult or even impossible to specify the full likelihood function for testing the group difference. These challenges are common in analyzing case-control data in genotype–phenotype association studies, where for example we test associations between a binary breast cancer phenotype and various genotype variants known as the single nucleotide polymorphisms (SNPs). Note that testing genotype–phenotype association from case-control data can be formulated as a two-sample statistical test problem. But association testing for many genotype variants altogether entails a high-dimensional statistical model, and makes it difficult to formulate a computationally tractable full likelihood (Han and Pan, 2012).
These issues naturally suggest approximating the full likelihood function by a computationally tractable one for constructing the test statistics for association testing. A well-developed approximation is based on the maximum composite likelihood estimator (MCLE), obtained by maximizing the product of low-dimensional sub-likelihood objects instead of the full likelihood. Besag (1974) proposed composite likelihood estimation for spatial data while Lindsay (1988) developed composite likelihood estimation in its generality. Over the years, composite likelihood methods have proved useful in many applied fields, including geo-statistics, spatial extremes and statistical genetics. See Varin et al. (2011) for a comprehensive survey on methods and applications.
Like the familiar maximum likelihood estimator (MLE), the MCLE is asymptotically unbiased and normally distributed under regularity conditions. This feature, being beneficial for constructing Wald-type statistics for testing group differences (see Geys et al., 1999 and Molenberghs and Verbeke, 2005 among others), can also be used in MCLE based testing. The standard approach here is to form a statistic using all the available data-subsets (so that the MCLE is computed by combining all the feasible sub-likelihood components). Although the resulting Wald test has known null distribution in the limit due to the asymptotic normality of MCLE, it may exhibit unsatisfactory power when the number of parameters in the model is moderate or large relative to the sample size.
In our view, forming a test statistic using all the available sub-likelihoods is not always well-justified from either a statistical or computational perspective. Specifically, when the noise in the data is evident and the statistical model considered is very complex, inclusion of sub-likelihoods that do not explain group differences will mainly be adding noise to the Wald statistic. Clearly, this unwanted noise has the undesirable effect of deteriorating the overall test power. A better strategy would be to choose only informative sub-likelihoods relevant to group differences, while dropping noisy or redundant components as much as possible.
Prompted by the above discussion, we propose a new approach–referred to as the forward step-up composite likelihood (FS-CL) testing–for group difference testing. Given a set of candidate data subsets used for constructing the sub-likelihood objects, our FS-CL method carries out simultaneous testing and data noise reduction by selecting a best set of sub-likelihoods so as to improve the resulting test power. Differently from the existing approaches, we impose a sparsity requirement on our alternative hypothesis reflecting the notion that only a certain portion of data subsets fundamentally explains the difference between groups. While testing the null hypothesis of no difference between groups, our method makes efficient use of data by dropping noisy or redundant data subsets to the maximum extent. This procedure is implemented by a forward search algorithm which, similar to the well-established methods in variable selection, progressively includes one more sub-likelihood at each step until no significant improvement in terms of power is observed.
The new approach proposed can be extended to general linear hypothesis testing (cf. Chapter 7 of Lehmann and Romano, 2005) without fundamental difficulty, but will not be pursued in detail in this paper. The remainder of the paper is organized as follows. In Section 2, we describe the main framework for composite likelihood estimation and overview the existing Wald-type association tests. In Section 3, we describe the new FS-CL methodology and propose the forward search algorithm. In Section 4, we study the finite-sample properties of our method in terms of Type I error probability and power using simulated data. In Section 4.4, we apply our test to the case-control genotype data from the Australian Breast Cancer Family Study. In Section 5, we conclude the paper by providing some final remarks.
Section snippets
Sparse composite likelihood estimation
Consider a random sample of observations on a -dimensional random vector following a probability density function , with unknown parameter and . Let be the profiled maximum composite likelihood estimator (MCLE) of , obtained by maximizing the composite likelihood function where is the total number of sub-likelihood objects being considered, is a vector of binary weights referred to
Optimal Wald composite test under sparse local alternatives
Recall that as defined in Section 2.2 is equivalent to , which is a vector of effective dimension giving the group difference. Given a composition rule which is an -vector of 1s and 0s, let be the corresponding sub-vector of . Following the discussion in Section 2.1, we still use to denote the effective dimension of knowing that , and we want to test against . Since some sub-likelihoods for the data of
Example 1: normally distributed MCLEs
Consider a simulated example of MCLE-based testing for the difference of group means from two samples of 40-dimensional normal data with known covariance matrix. This is equivalent to testing against , where is the MCLE of having 40 elements. We set the considered composite likelihood function to comprise up to independent pairwise sub-likelihoods, where each candidate sub-likelihood is for a 2-dimensional subset of the data and contains just two
Conclusion and discussion
Building on the well-established composite likelihood estimation framework, we have developed a method of simultaneous composition rule selection and group difference testing in multivariate parametric models for high-dimensional data. The method is particularly useful for multiple genotype–phenotype association testing in case-control studies. It constructs sparse composite likelihood by including a small number of informatively selected sub-likelihoods, while dropping redundant or noisy
References (20)
The bootstrap
- et al.
Distribution of a sum of order statistics
Scand. J. Statist.
(1979) Spatial interaction and the statistical analysis of lattice systems
J. R. Stat. Soc. Ser. B Stat. Methodol.
(1974)- et al.
Statistics for High-Dimensional Data: Methods, Theory and Applications
(2011) - et al.
Theoretical Statistics
(1979) - et al.
Composite likelihood Bayesian information criteria for model selection in high-dimensional data
J. Amer. Statist. Assoc.
(2010) - et al.
Pseudolikelihood modeling of multivariate outcomes in developmental toxicology
J. Amer. Statist. Assoc.
(1999) An optimum property of regular maximum likelihood estimation
Ann. Math. Statist.
(1960)- et al.
A composite likelihood approach to latent multivariate Gaussian modeling of SNP data with application to genetic association testing
Biometrics
(2012) - et al.
Testing Statistical Hypothesis
(2005)