Parsimonious and powerful composite likelihood testing for group difference and genotype–phenotype association

https://doi.org/10.1016/j.csda.2016.12.004Get rights and content

Abstract

Studying the association between a phenotype and a number of genetic variants from case-control data is an important goal in many genetic studies. Association analysis is often carried out by testing the null hypothesis that two groups of multi-dimensional data are generated by the same population. Testing based on genotype data is a challenging task as the full likelihood of the data is usually intractable. This difficulty may be tackled by composite likelihood (MCL) tests which do not entail the full likelihood. But currently available MCL tests are subject to severe power loss for involving non-informative or redundant sub-likelihoods. To reduce the power loss, a forward search and test method for simultaneous powerful group difference testing and informative sub-likelihoods composition is developed. The new method constructs a sequence of Wald-type test statistics by including only informative sub-likelihoods progressively so as to improve the test power under local sparsity alternatives. Numerical studies show it achieves considerable improvement over the available tests as the modeling complexity grows. The new method is illustrated through an analysis of genotype data from a case-control study on breast cancer.

Introduction

Testing population difference between two groups of multivariate data is common in many fields of statistical research. Due to the significant development of data acquisition technologies in recent years, more and more complex data–e.g. involving temporal or spatial dependence among the sample units–can now be readily collected for statistical analysis. However, this entails the use of tractable statistical models that are not easily available. In particular, it may be difficult or even impossible to specify the full likelihood function for testing the group difference. These challenges are common in analyzing case-control data in genotype–phenotype association studies, where for example we test associations between a binary breast cancer phenotype and various genotype variants known as the single nucleotide polymorphisms (SNPs). Note that testing genotype–phenotype association from case-control data can be formulated as a two-sample statistical test problem. But association testing for many genotype variants altogether entails a high-dimensional statistical model, and makes it difficult to formulate a computationally tractable full likelihood (Han and Pan, 2012).

These issues naturally suggest approximating the full likelihood function by a computationally tractable one for constructing the test statistics for association testing. A well-developed approximation is based on the maximum composite likelihood estimator (MCLE), obtained by maximizing the product of low-dimensional sub-likelihood objects instead of the full likelihood. Besag (1974) proposed composite likelihood estimation for spatial data while Lindsay (1988) developed composite likelihood estimation in its generality. Over the years, composite likelihood methods have proved useful in many applied fields, including geo-statistics, spatial extremes and statistical genetics. See Varin et al. (2011) for a comprehensive survey on methods and applications.

Like the familiar maximum likelihood estimator (MLE), the MCLE is asymptotically unbiased and normally distributed under regularity conditions. This feature, being beneficial for constructing Wald-type statistics for testing group differences (see Geys et al., 1999 and Molenberghs and Verbeke, 2005 among others), can also be used in MCLE based testing. The standard approach here is to form a statistic using all the available data-subsets (so that the MCLE is computed by combining all the feasible sub-likelihood components). Although the resulting Wald test has known null distribution in the limit due to the asymptotic normality of MCLE, it may exhibit unsatisfactory power when the number of parameters in the model is moderate or large relative to the sample size.

In our view, forming a test statistic using all the available sub-likelihoods is not always well-justified from either a statistical or computational perspective. Specifically, when the noise in the data is evident and the statistical model considered is very complex, inclusion of sub-likelihoods that do not explain group differences will mainly be adding noise to the Wald statistic. Clearly, this unwanted noise has the undesirable effect of deteriorating the overall test power. A better strategy would be to choose only informative sub-likelihoods relevant to group differences, while dropping noisy or redundant components as much as possible.

Prompted by the above discussion, we propose a new approach–referred to as the forward step-up composite likelihood (FS-CL) testing–for group difference testing. Given a set of candidate data subsets used for constructing the sub-likelihood objects, our FS-CL method carries out simultaneous testing and data noise reduction by selecting a best set of sub-likelihoods so as to improve the resulting test power. Differently from the existing approaches, we impose a sparsity requirement on our alternative hypothesis reflecting the notion that only a certain portion of data subsets fundamentally explains the difference between groups. While testing the null hypothesis of no difference between groups, our method makes efficient use of data by dropping noisy or redundant data subsets to the maximum extent. This procedure is implemented by a forward search algorithm which, similar to the well-established methods in variable selection, progressively includes one more sub-likelihood at each step until no significant improvement in terms of power is observed.

The new approach proposed can be extended to general linear hypothesis testing (cf. Chapter 7 of Lehmann and Romano, 2005) without fundamental difficulty, but will not be pursued in detail in this paper. The remainder of the paper is organized as follows. In Section  2, we describe the main framework for composite likelihood estimation and overview the existing Wald-type association tests. In Section  3, we describe the new FS-CL methodology and propose the forward search algorithm. In Section  4, we study the finite-sample properties of our method in terms of Type I error probability and power using simulated data. In Section  4.4, we apply our test to the case-control genotype data from the Australian Breast Cancer Family Study. In Section  5, we conclude the paper by providing some final remarks.

Section snippets

Sparse composite likelihood estimation

Consider a random sample of n observations on a d-dimensional random vector Y=(Y1,,Yd)T following a probability density function f(y;θ), with unknown parameter θΘRq and q=dim(Θ)1. Let θˆ(w) be the profiled maximum composite likelihood estimator (MCLE) of θ, obtained by maximizing the composite likelihood function cl(θ;w)=(k=1Nclwk)1k=1Nclwkk(θ), where Ncl is the total number of sub-likelihood objects being considered, w=(w1,,wNcl)TΩ={0,1}Ncl is a vector of binary weights referred to

Optimal Wald composite test under sparse local alternatives

Recall that δ=θ1θ0 as defined in Section  2.2 is equivalent to δ(wall)=θ1(wall)θ0(wall), which is a vector of effective dimension q giving the group difference. Given a composition rule w which is an Ncl-vector of 1s and 0s, let δ(w)=θ1(w)θ0(w) be the corresponding sub-vector of δ(wall). Following the discussion in Section  2.1, we still use dw to denote the effective dimension of δ(w) knowing that dwq, and we want to test H0:δ=0 against H1:δ0. Since some sub-likelihoods for the data of

Example 1: normally distributed MCLEs

Consider a simulated example of MCLE-based testing for the difference of group means from two samples of 40-dimensional normal data with known covariance matrix. This is equivalent to testing H0:nδˆN40(0,V) against H1:nδˆN40(δ,V), where δˆ is the MCLE of δ having 40 elements. We set the considered composite likelihood function to comprise up to Ncl=20 independent pairwise sub-likelihoods, where each candidate sub-likelihood is for a 2-dimensional subset of the data and contains just two

Conclusion and discussion

Building on the well-established composite likelihood estimation framework, we have developed a method of simultaneous composition rule selection and group difference testing in multivariate parametric models for high-dimensional data. The method is particularly useful for multiple genotype–phenotype association testing in case-control studies. It constructs sparse composite likelihood by including a small number of informatively selected sub-likelihoods, while dropping redundant or noisy

References (20)

  • J.L. Horowitz

    The bootstrap

  • K. Alam et al.

    Distribution of a sum of order statistics

    Scand. J. Statist.

    (1979)
  • J. Besag

    Spatial interaction and the statistical analysis of lattice systems

    J. R. Stat. Soc. Ser. B Stat. Methodol.

    (1974)
  • P. Bühlmann et al.

    Statistics for High-Dimensional Data: Methods, Theory and Applications

    (2011)
  • D.R. Cox et al.

    Theoretical Statistics

    (1979)
  • X. Gao et al.

    Composite likelihood Bayesian information criteria for model selection in high-dimensional data

    J. Amer. Statist. Assoc.

    (2010)
  • H. Geys et al.

    Pseudolikelihood modeling of multivariate outcomes in developmental toxicology

    J. Amer. Statist. Assoc.

    (1999)
  • V.P. Godambe

    An optimum property of regular maximum likelihood estimation

    Ann. Math. Statist.

    (1960)
  • F. Han et al.

    A composite likelihood approach to latent multivariate Gaussian modeling of SNP data with application to genetic association testing

    Biometrics

    (2012)
  • E. Lehmann et al.

    Testing Statistical Hypothesis

    (2005)
There are more references available in the full text version of this article.

Cited by (0)

View full text