Using link-preserving imputation for logistic partially linear models with missing covariates

doi:10.1016/j.csda.2016.03.004

Computational Statistics & Data Analysis

Volume 101, September 2016, Pages 174-185

https://doi.org/10.1016/j.csda.2016.03.004 Get rights and content

Abstract

To handle missing data one needs to specify auxiliary models such as the probability of observation or imputation model. Doubly robust (DR) method uses both auxiliary models and produces consistent estimation when either of the model is correctly specified. While the DR method in estimating equation approaches could be easy to implement in the case of missing outcomes, it is computationally cumbersome in the case of missing covariates especially in the context of semiparametric regression models. In this paper, we propose a new kernel-assisted estimating equation method for logistic partially linear models with missing covariates. We replace the conditional expectation in the DR estimating function with an unbiased estimating function constructed using the conditional mean of the outcome given the observed data, and impute the missing covariates using the so called link-preserving imputation models to simplify the estimation. The proposed method is valid when the response model is correctly specified and is more efficient than the kernel-assisted inverse probability weighting estimator by Liang (2008). The proposed estimator is consistent and asymptotically normal. We evaluate the finite sample performance in terms of efficiency and robustness, and illustrate the application of the proposed method to the health insurance data using the 2011–2012 National Health and Nutrition Examination Survey, in which data were collected in two phases and some covariates were partially missing in the second phase.

Introduction

Recently generalized partially linear models (GPLM) draw a lot of attention (Severini and Staniswalis, 1994, Carroll et al., 1997, Liang et al., 2004). The GPLMs include a nonparametric covariate effect in an otherwise generalized linear model. The logistic partially linear models (LPLM), as a special case of the GPLM for binary data, relax the structure of the mean in a logistic regression to be partially linear. Specifically, let $Y$ be the binary outcome, $X$ be parametrically modeled covariates and $Z$ be a nonparametrically modeled covariate. The conditional mean of $Y$ is assumed to be a twice differentiable function of linear predictor $X^{T} β + ν (Z)$ where $β$ are unknown parameters and $ν (.)$ is a smooth unknown function of $Z$ . In this paper, we investigate the estimation of the LPLM when $Y$ and $Z$ are fully observed but some of $X$ are partially missing.

When there are missing data, a likelihood method can naturally handle the problem by integrating over the missing data and maximizing the integrated marginal likelihood function. However for non-likelihood methods, the same technique cannot be used. There are two paradigms in handling missing data in estimating equation approaches to construct unbiased estimating functions, namely, imputation (e.g. Reilly and Pepe, 1995; Paik, 1997) and inverse probability weighting (IPW, e.g. Robins et al., 1994, Robins et al., 1995). The imputation method fills in missing statistics by its ‘best’ guess, the conditional expectation. The IPW weights the observed records by the inverse of the observation probability to properly represent the whole data, and has been very popular in various settings since it is easy to implement. Validity of the inference in both paradigms depends on correctness of assumptions on auxiliary models, the imputation model in the case of the imputation approach or the response model in the case of the IPW approach. The imputation method is generally more efficient than the IPW especially when there is a potent predictor for missing data (Wang and Paik, 2006). The efficiency of the IPW method can be effectively improved by subtracting projection onto the nuisance tangent space (Robins et al., 1994, Robins et al., 1995), but the projection term involves the conditional mean of the estimating function. The projection method requires assumptions on both auxiliary models but the inference is valid when either one of the assumptions is correct. Because of this property, this method is called doubly robust (DR) method. In the case of missing outcomes, simple implementation of the DR method is discussed in Bang and Robins (2005), Scharfstein et al. (1999), and Little and An (2004). Although the same principles apply for missing outcomes and missing covariates, the imputation method and the DR method, in the case of missing covariates, require evaluation of the conditional expectation of the product of missing covariates and the conditional mean of outcomes given observed data, which is a main hurdle for computation.

Missing data problem becomes even more computationally demanding in the context of semiparametric regression models. When outcomes are missing, Chen et al. (2006) and Wang et al. (2010) proposed weighted kernel estimating equations for the GPLMs. Wang et al. (1998) is one of the first work tackling missing covariate problem in a nonparametric regression model using the IPW approach. Liang et al. (2004) considered estimation of a partially linear model with missing covariates using the IPW-type kernel based method. Liang (2008) proposed a kernel-assisted IPW method for the GPLMs with missing covariates and derived asymptotic properties of the DR estimator, but discouraged using the DR estimator due to the complexity of implementation. Qin et al. (2012) also considered an IPW-type approach for robust GPLMs in the sense of Huber with missing covariates using a regression spline.

In this paper we propose a new kernel-assisted estimating equation approach to handle missing covariates in the context of LPLMs. The proposed method modifies the DR estimating function by replacing the conditional expectation with an unbiased estimating function constructed using the mean of the outcome conditioning on the observed covariates but marginalizing out the missing covariates. This marginal mean usually is not easy to evaluate. To overcome this, we introduce the concept of link-preserving imputation. We call imputation models link-preserving if the part of the linear predictor concerning completely observed covariates is preserved under the same link function. Under link-preserving imputation, the marginal mean can be easily obtained by replacing the missing covariate with some imputation value, which allows simple implementation of the proposed method via data augmentation. Use of the marginal mean coupled with link-preserving imputation greatly reduces the computational difficulty in solving the estimating equations for both the parametric and the nonparametric parts. The proposed estimator is more efficient than the kernel-assisted IPW estimator by Liang (2008).

The rest of the paper is organized as follows. In Section 2, we briefly describe the notation and framework. We propose new methods in Section 3. Simulation studies follow in Section 4. In Section 5, we show application to the health insurance coverage problem using the data of the 2011–2012 National Health and Nutrition Examination Survey. Concluding remarks follow in Section 6.

Section snippets

Notation and framework

Suppose that there are $n$ independently identically distributed observations ${{(Y_{i}, X_{i}^{T}, Z_{i})}^{T}, i = 1, \dots, n}$ . Let $Y_{i}$ denote a binary outcome variable for the $i$ th subject, $Z_{i}$ denote a single nonparametrically modeled covariate associated with the $i$ th subject, and $X_{i} = {(X_{i 1}^{T}, X_{i 2}^{T})}^{T}$ where $X_{i 1}$ and $X_{i 2}$ denote a vector of parametrically modeled covariates for the $i$ th subject with $p$ and $q$ elements, respectively. We consider the following logistic partially linear model, $logit {E (Y_{i} | X_{i}, Z_{i})} = log \frac{P (Y_{i} = 1 | X_{i}, Z_{i})}{P (Y_{i} = 0 | X_{i}}$

Link-preserving imputation model

The main idea of the proposed method starts from the simple fact that we can still estimate $E (Y | X_{2}, Z)$ even when $X_{1}$ is missing in the context of LPLMs. Evaluating $E (Y | X_{2}, Z)$ is not always straightforward, but under certain class of imputation models it could be manageable. We define link-preserving imputation models as follows. Let $η = logit {E (Y | X, Z; β, ν)} = X_{1}^{T} β_{1} + X_{2}^{T} β_{2} + ν (Z)$ . We call that imputation models for $X_{1}$ are link-preserving if they produce a form of $E (Y | X_{2}, Z; β, ν, γ)$ such that $η^{*} = logit {E (Y | X_{2}, Z;$

Design

We perform a simulation study to evaluate the finite sample performance of the proposed methods. We generate data from model (1) with $ν (Z) = Z$ or $ν (Z) = cos (2 π Z)$ , and generate $Z \sim Uniform (0, 1)$ and $X_{2} \sim Bernoulli (0.3)$ . Note that $X_{1}^{*}$ can be computed from $X_{2}$ and $Z$ as described in Section 3.1. Given $X_{1}^{*}$ , $X_{2}$ and $Z$ , $Y$ is generated. We consider two scenarios:

(S1):

$p = 1; β_{1} = 1; β_{2} = - 0.7$ ; and $logit {E (X_{1} | X_{2}, Y, Z)} = - 1 - 0.2 X_{2} + Y + 0.5 Z$ .

(S2):

$p = 2; β_{1} = {(- 1, 0.2)}^{T}; β_{2} = 0.75$ ; and $(X_{1} | X_{2}, Y, Z) \sim {MVN}_{2} (μ (X_{2}, Y, Z), Σ)$ , where $μ (X_{2}, Y, Z) = (\begin{matrix} 0.5 - \end{matrix}$

Analysis of the health insurance data

In this application, we study the association between ethnicity and health insurance coverage while controlling for the effect of age, gender, country of birth, and general health condition using the 2011–2012 National Health and Nutrition Examination Survey data (NHANES, Centers for Disease Control and Prevention). Our study sample contains individuals who were 18 years or older at screening, where individuals 80 and over were topcoded at 80 years of age. Survey participants were asked in

Conclusion

We propose two kernel-assisted estimating equation estimators using link-preserving imputation for logistic partially linear models with missing covariates. The first estimator under this approach is easy to implement by modifying built-in functions for complete data in statistical software via data augmentation. The second estimator is an extension of the first estimator but is guaranteed to be more efficient than the IPW. Our proposed estimators are valid when the response model is correct,

Acknowledgments

This work was partially supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2013R1A2A2A01067262) and the Seoul National University Research Grant.

References (20)

H. Liang
Generalized partially linear models with missing covariates
J. Multivariate Anal.
(2008)
H. Bang et al.
Doubly robust estimation in missing data and causal inference models
Biometrics
(2005)
R.J. Carroll et al.
Generalized partially linear single-index models
J. Amer. Statist. Assoc.
(1997)
J. Chen et al.
Local quasi-likelihood estimation with data missing at random
Statist. Sinica
(2006)
H. Liang et al.
Empirical likelihood-based inferences for generalized partially linear models
Scand. J. Statist.
(2009)
H. Liang et al.
Estimation in partially linear models with missing covariates
J. Amer. Statist. Assoc.
(2004)
R.J.A. Little et al.
Robust likelihood-based analysis of multivariate data with missing values
Statist. Sinica
(2004)
M.C. Paik
The generalized estimating equation approach when data are not missing completely at random
J. Amer. Statist. Assoc.
(1997)
M. Reilly et al.
A mean score method for missing and auxiliary covariate data in regression models
Biometrika
(1995)
J.M. Robins et al.
Estimation of regression coefficients when some regressors are not always observed
J. Amer. Statist. Assoc.
(1994)

There are more references available in the full text version of this article.

Cited by (0)

View full text

Using link-preserving imputation for logistic partially linear models with missing covariates

Abstract

Introduction

Section snippets

Notation and framework

Link-preserving imputation model

Design

Analysis of the health insurance data

Conclusion

Acknowledgments

J. Multivariate Anal.

Doubly robust estimation in missing data and causal inference models

Biometrics

Generalized partially linear single-index models

J. Amer. Statist. Assoc.

Local quasi-likelihood estimation with data missing at random

Statist. Sinica

Empirical likelihood-based inferences for generalized partially linear models

Scand. J. Statist.

Estimation in partially linear models with missing covariates

J. Amer. Statist. Assoc.

Robust likelihood-based analysis of multivariate data with missing values

Statist. Sinica

The generalized estimating equation approach when data are not missing completely at random

J. Amer. Statist. Assoc.

A mean score method for missing and auxiliary covariate data in regression models

Biometrika

Estimation of regression coefficients when some regressors are not always observed

J. Amer. Statist. Assoc.