Partially parametric techniques for multiple imputation

https://doi.org/10.1016/0167-9473(95)00057-7Get rights and content

Abstract

Multiple imputation is a technique for handling data sets with missing values. The method fills in the missing values several times, creating several completed data sets for analysis. Each data set is analyzed separately using techniques designed for complete data, and the results are then combined in such a way that the variability due to imputation may be incorporated. Methods of imputing the missing values can vary from fully parametric to nonparametric. In this paper, we compare partially parametric and fully parametric regression-based multiple-imputation methods. The fully parametric method that we consider imputes missing regression outcomes by drawing them from their predictive distribution under the regression model, whereas the partially parametric methods are based on imputing outcomes or residuals for incomplete cases using values drawn from the complete cases. For the partially parametric methods, we suggest a new approach to choosing complete cases from which to draw values. In a Monte Carlo study in the regression setting, we investigate the robustness of the multiple-imputation schemes to misspecification of the underlying model for the data. Sources of model misspecification considered include incorrect modeling of the mean structure as well as incorrect specification of the error distribution with regard to heaviness of the tails and heteroscedasticity. The methods are compared with respect to the bias and efficiency of point estimates and the coverage rates of confidence intervals for the marginal mean and distribution function of the outcome. We find that when the mean structure is specified correctly, all of the methods perform well, even if the error distribution is misspecified. The fully parametric approach, however, produces slightly more efficient estimates of the marginal distribution function of the outcome than do the partially parametric approaches. When the mean structure is misspecified, all of the methods still perform well for estimating the marginal mean, although the fully parametric method shows slight increases in bias and variance. For estimating the marginal distribution function, however, the fully parametric method breaks down in several situations, whereas the partially parametric methods maintain their good performance. In an application to AIDS research in a setting that is similar to although slightly more complicated than that of the Monte Carlo study, we examine how estimates for the distribution of the time from infection with HIV to the onset of AIDS vary with the method used to impute the residual time to AIDS for subjects with right-censored data. The fully parametric and partially parametric techniques produce similar results, suggesting that the model selection used for fully parametric imputation was adequate. Our application provides an example of how multiple imputation can be used to combine information from two cohorts to estimate quantities that cannot be estimated directly from either one of the cohorts separately.

References (43)

  • G.E.P. Box et al.

    Bayesian Inference in Statistical Analysis

    (1973)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • F.J. Dorey et al.

    Multiple imputation for threshold-crossing data with interval censoring

    Statist. Med.

    (1993)
  • B. Efron

    Bootstrap methods: another look at the jackknife

    Ann Statist.

    (1979)
  • B. Efron

    Missing Data, imputation, and the bootstrap (with discussion)

    J. Amer. Statist. Assoc.

    (1994)
  • J.L. Fahey et al.

    The prognostic value of cellular and serologic markers in infection with Human Immunodeficiency Virus Type 1

    New England J. Med.

    (1990)
  • R.E. Fay

    A design-based perspective on missing data variance (with discussion)

  • R.E. Fay

    When are inferences from multiple imputation valid?

  • B.L. Ford

    An overview of hot-deck procedures

  • A.E. Gelfand et al.

    Sampling-based approaches to calculating marginal densities

    J. Amer. Statist. Assoc.

    (1990)
  • D.F. Heitjan et al.

    Multiple imputation for the fatal accident reporting system

    App. Statist.

    (1991)
  • D.F. Heitjan et al.

    Assessing secular trends in blood pressure: a multiple-imputation approach

    J. Amer. Statist. Assoc.

    (1994)
  • D.F. Heitjan et al.

    Inference from coarse data via multiple imputation with application to age heaping

    J. Amer. Statist. Assoc.

    (1990)
  • T.N. Herzog et al.

    Using multiple imputations to handle nonresponse in sample surveys

  • R.A. Kaslow et al.

    The Multicenter AIDS Cohort Study: rationale, organization, and selected characteristics of the participants

    Amer. J. Epidemiology

    (1987)
  • K.C. Li et al.

    Regression analysis under link violation

    Ann Statist

    (1989)
  • K.H. Li et al.

    Significance levels from repeated p-values with multiply imputed data

    Statistica Sinica

    (1991)
  • K.H. Li et al.

    Large-sample significance levels from multiply imputed data using moment-based statistics and an F reference distribution

    J. Ameri. Statist. Assoc.

    (1991)
  • D.Y. Lin et al.

    Cox regression with incomplete covariate measurements

    J. Amer. Statist. Assoc.

    (1993)
  • R.J.A. Little

    Missing data adjustments in large surveys (with discussion)

    J. Bus. Econom. Statist.

    (1988)
  • R.J.A. Little et al.

    Statistical Analysis with Missing Data

    (1987)
  • Cited by (236)

    View all citing articles on Scopus

    This work was partially supported by National Institutes of Health grants AI29196 and CA64235, by US Bureau of the Census Joint Statistical Agreement 89-17, by American Foundation for AIDS Research Grant 01065-7-RG, and by the IBM/UCLA Joint Project in Supercomputing. The authors thank Laura Lazzeroni for programming assistance. The authors' names are listed in alphabetical order.

    View full text