A note on nonparametric estimation for clustered data
Introduction
Lin and Carroll (2000) and the references therein, discuss estimators arising from nonparametric estimation for clustered data. They show the most efficient estimators ignore the within cluster correlation. This is a counter intuitive result to many practitioners—especially to those accustomed to the application of results in a parametric setting. For example, Liang and Zeger (1986) show in the parametric setting that the most efficient estimator arises when the true covariance matrix is included in the estimating equations. Thus for example, if cubic splines with many fixed knots are used to model the mean function in a flexible manner, the most efficient estimator incorporates the dependence but if similarly flexible nonparametric estimators are used it is not the most efficient one.
To illustrate the drawbacks of ignoring the dependence, a simple example involving 2 sampling units each containing 4 observations is displayed in Fig. 1. Here two observations in each sampling unit have large values of the explanatory variable (100 and 90), whilst the other two have smaller values, 65 and 60 in one unit and 45 and 40 in the other. The mean curve was μ(x)=20+0.7x and for observation j, in sampling unit i, the model was yij(x)=μ(x)+εi+δj. The data displayed in Fig. 1, arose from simulating data with σ21=Var(ε)=100 and σ22=Var(δ)=1. Thus the correlation between observations in the same sampling unit was very high, being 0.99. In the particular outcome examined, the observations in one sampling unit were above the curve and those in the other were below.
Suppose that μ is an unknown function of a covariate x, and it is wished to estimate the local polynomial function μ(x) of x. That is, for x near . For scalar u, let Gp(u)=(1,u,u2,…,up) so that for x near where b=(b0,b1,…,bp)T. Then μ(x0)=b0 and if b by is estimated by , μ(x0) can be estimated by . For , writeThen the mean vector of a sampling unit will be approximately for elements of is not too far from x0. For a vector , let be a diagonal weight matrix associated with and x0. Suppose N sampling units are observed. Following Lin and Carroll (2000), an extension of the generalised estimating equations (GEEs) arewhere Wi=W1/2i(x0)Ri−1W1/2i(x0) for a working covariance matrix Ri. The results of Lin and Carroll (2000) show that for this class of estimating equations, the optimal Ri is the identity matrix, which ignores information on the correlation between individuals.
The estimated mean function, using a local cubic model, for the data of Fig. 1 with a bandwidth of h=40 was computed at points x=30,32,…,98,100 and is plotted in Fig. 2. The estimated mean function is quite different from the true function and essentially tracks the observations. However, the dip around 65 could be predicted from the observations in that sampling unit with high x values, which are below the mean function, and the high correlation between observations in the same sampling unit. Thus, one could reasonably conclude that the large dip in the estimated mean function near 65 is an artifact of the method rather than a real dip.
The GEE estimating equations of Liang and Zeger (1986) are derived from the maximum likelihood estimating equations for the multivariate normal model. Now, the multivariate normal density may be written as the product of conditional densities f(yn,…,y1)=f(yn|yn−1,…,y1)f(yn−1|yn−2,…,y1)⋯f(y1), so that the log-likelihood estimating equations are combinations of linear regression equations.
Let and suppose . To define the estimating equations with a conditional offset that does not depend on the order of the observations within a sampling unit we proceed as follows:
- •
Write Y−j=(Y1,…,Yj−1,Yj+1,…,Yn)T and μ−j=(μ1,…,μj−1,μj+1,…,μn)T.
- •
Decompose into and .
- •
Note and define corrected values , so that .
- •
Let .
- •
When is known, consider the alternate estimating equations
The associated estimates may be computed iteratively by first computing the GEE estimates , then and then . Thereafter, was used to compute a new and hence with this process continuing until convergence was achieved. The estimated mean functions using both methods for the data of Fig. 1 with a bandwidth of h=40 were computed and are plotted in Fig. 3. It is seen in this figure that the bias has been noticeably reduced when the alternate procedure was used, the estimated mean function is close to a straight line. The bias as a whole is due to the observations in one sample unit being noticeably lower than the actual mean function. To compare the two procedures; an approximation to the integrated square error was computed as 37.3 for the GEE method and 13.4 for the alternate method.
In Table 1, the results of a small simulation study with data generated according to the above model with the design points fixed as in Fig. 1, for σ21=10 and several different values of σ22 is reported. The mean integrated square error based on 100 simulations was used to compare the estimators. It is apparent that for highly dependent data, the alternate procedure performs better but as the dependence reduces, the performance of the GEE procedure improves so that neither approach is uniformly the best.
Section snippets
Conclusion
It has been demonstrated that there are situations where previously considered extensions of the GEE's are not the most efficient, and that in these situations efficiency can be improved by taking dependence into account. The improvement in efficiency by taking dependence into account appears to be related to the strength of the dependence. However, highly dependent data can occur in repeated measures, twin, family and growth studies for example, and efficiency in these settings is important.
Acknowledgements
This work was inspired by discussions with Professor Raymond. J. Carroll.
References (3)
- et al.
Longitudinal data analysis using generalized linear models
Biometrika
(1986)
Cited by (1)
Understanding nonparametric estimation for clustered data
2006, Biometrika