A note on nonparametric estimation for clustered data

https://doi.org/10.1016/j.spl.2004.06.010Get rights and content

Abstract

The generalized estimating equation approach to nonparametric estimation using clustered data has a counter intuitive feature: the most asymptotically efficient estimator ignores the dependence within the clusters J. Amer. Statist. Assoc. 95 (2000) 520. Using an example with highly dependent data we present an alternate procedure that does account for dependence. The superior results of this procedure suggest a flaw in previous extensions of the generalized estimating equations to nonparametric inference. We show that, that efficient nonparametric estimation requires a different class of estimating equations wherein dependence is incorporated via an offset term.

Introduction

Lin and Carroll (2000) and the references therein, discuss estimators arising from nonparametric estimation for clustered data. They show the most efficient estimators ignore the within cluster correlation. This is a counter intuitive result to many practitioners—especially to those accustomed to the application of results in a parametric setting. For example, Liang and Zeger (1986) show in the parametric setting that the most efficient estimator arises when the true covariance matrix is included in the estimating equations. Thus for example, if cubic splines with many fixed knots are used to model the mean function in a flexible manner, the most efficient estimator incorporates the dependence but if similarly flexible nonparametric estimators are used it is not the most efficient one.

To illustrate the drawbacks of ignoring the dependence, a simple example involving 2 sampling units each containing 4 observations is displayed in Fig. 1. Here two observations in each sampling unit have large values of the explanatory variable (100 and 90), whilst the other two have smaller values, 65 and 60 in one unit and 45 and 40 in the other. The mean curve was μ(x)=20+0.7x and for observation j, in sampling unit i, the model was yij(x)=μ(x)+εi+δj. The data displayed in Fig. 1, arose from simulating data with σ21=Var(ε)=100 and σ22=Var(δ)=1. Thus the correlation between observations in the same sampling unit was very high, being 0.99. In the particular outcome examined, the observations in one sampling unit were above the curve and those in the other were below.

Suppose that μ is an unknown function of a covariate x, and it is wished to estimate the local polynomial function μ(x) of x. That is, for x near x0,μ(x)≈μ(x0)+b0+b1(x−x0)+⋯+(x−x0)pbp. For scalar u, let Gp(u)=(1,u,u2,…,up) so that for x near x0,μ(x)≈∑k=0pbk(x−x0)k=Gp(x−x0)b where b=(b0,b1,…,bp)T. Then μ(x0)=b0 and if b by is estimated by b̂, μ(x0) can be estimated by μ̂(x0)=b̂0. For u=(u1,…,un), writeGp(u)=1u1u12...u1p..............1unun2...unp.Then the mean vector of a sampling unit will be approximately Gp(xi−x0)b for elements of xi is not too far from x0. For a vector xi, let Wi(x0)=h−1W((xi−x0)/h) be a diagonal weight matrix associated with xi and x0. Suppose N sampling units (x1,Y1),…,(xN,YN) are observed. Following Lin and Carroll (2000), an extension of the generalised estimating equations (GEEs) arei=1NGTp(xi−x0)Wi{Yi−Gp(xi−x0)b)}=0,where Wi=W1/2i(x0)Ri−1W1/2i(x0) for a working covariance matrix Ri. The results of Lin and Carroll (2000) show that for this class of estimating equations, the optimal Ri is the identity matrix, which ignores information on the correlation between individuals.

The estimated mean function, using a local cubic model, for the data of Fig. 1 with a bandwidth of h=40 was computed at points x=30,32,…,98,100 and is plotted in Fig. 2. The estimated mean function is quite different from the true function and essentially tracks the observations. However, the dip around 65 could be predicted from the observations in that sampling unit with high x values, which are below the mean function, and the high correlation between observations in the same sampling unit. Thus, one could reasonably conclude that the large dip in the estimated mean function near 65 is an artifact of the method rather than a real dip.

The GEE estimating equations of Liang and Zeger (1986) are derived from the maximum likelihood estimating equations for the multivariate normal model. Now, the multivariate normal density may be written as the product of conditional densities f(yn,…,y1)=f(yn|yn−1,…,y1)f(yn−1|yn−2,…,y1)⋯f(y1), so that the log-likelihood estimating equations are combinations of linear regression equations.

Let Y=(Y1,…,Yn)T and suppose YMVN(μ,Ω). To define the estimating equations with a conditional offset that does not depend on the order of the observations within a sampling unit we proceed as follows:

  • Write Yj=(Y1,…,Yj−1,Yj+1,…,Yn)T and μj=(μ1,…,μj−1,μj+1,…,μn)T.

  • Decompose Ω into Var(Yj)=Ωjj,Cov(Yj,Y−j)=Ωj,−j,Cov(Y−j,Yj)=Ω−j,j and Cov(Y−j,Y−j)=Ω−j,−j.

  • Note E(Yj|Y−j)=μjj,−jΩ−j,−j−1(Y−j−μ−j) and define corrected values Ỹj=Yj−Ωj,−jΩ−j,−j−1(Y−j−μ−j), so that E(Ỹj)=E{E(Ỹj|Y−j)}=μj.

  • Let Ỹ=(Ỹ1,…,Ỹn)T.

  • When Ω is known, consider the alternate estimating equationsi=1NGTp(xi−x0)Wi{Ỹi−Gp(xi−x0)b)}=0.

These estimating equations are akin to regression equations with an offset of Ωj,−jΩ−j,−j−1(Y−j−μ−j) for observation j. Here Wi is as in Eq. (1) and Ri was again taken to be an identity matrix.

The associated estimates may be computed iteratively by first computing the GEE estimates μ̂1(x), then Ỹ and then μ̂(1)2(x). Thereafter, μ̂(i−1)2(x) was used to compute a new Ỹ and hence μ̂(i)2(x) with this process continuing until convergence was achieved. The estimated mean functions using both methods for the data of Fig. 1 with a bandwidth of h=40 were computed and are plotted in Fig. 3. It is seen in this figure that the bias has been noticeably reduced when the alternate procedure was used, the estimated mean function is close to a straight line. The bias as a whole is due to the observations in one sample unit being noticeably lower than the actual mean function. To compare the two procedures; an approximation to the integrated square error ISE=36−1t=035(μ̂(30+2t)−μ(30+2t))2 was computed as 37.3 for the GEE method and 13.4 for the alternate method.

In Table 1, the results of a small simulation study with data generated according to the above model with the design points fixed as in Fig. 1, for σ21=10 and several different values of σ22 is reported. The mean integrated square error based on 100 simulations was used to compare the estimators. It is apparent that for highly dependent data, the alternate procedure performs better but as the dependence reduces, the performance of the GEE procedure improves so that neither approach is uniformly the best.

Section snippets

Conclusion

It has been demonstrated that there are situations where previously considered extensions of the GEE's are not the most efficient, and that in these situations efficiency can be improved by taking dependence into account. The improvement in efficiency by taking dependence into account appears to be related to the strength of the dependence. However, highly dependent data can occur in repeated measures, twin, family and growth studies for example, and efficiency in these settings is important.

Acknowledgements

This work was inspired by discussions with Professor Raymond. J. Carroll.

References (3)

  • K.Y Liang et al.

    Longitudinal data analysis using generalized linear models

    Biometrika

    (1986)
There are more references available in the full text version of this article.
View full text