Monte Carlo EM algorithm in logistic linear models involving non-ignorable missing data

https://doi.org/10.1016/j.amc.2007.07.080Get rights and content

Abstract

Many data sets obtained from surveys or medical trials often include missing observations. Since ignoring the missing information usually cause bias and inefficiency, an algorithm for estimating parameters is proposed based on the likelihood function of which the missing information is taken account. A binomial response and normal exploratory model for the missing data are assumed. We fit the model using the Monte Carlo EM (Expectation and Maximization) algorithm. The E-step is derived by Metropolis–Hastings algorithm to generate a sample for missing data, and the M-step is done by Newton–Raphson to maximize the likelihood function. Asymptotic variances and the standard errors of the MLE (maximum likelihood estimates) of parameters are derived using the observed Fisher information.

Introduction

Many data sets obtained from surveys or medical trials often include missing observations [1]. When these data sets are analyzed, it is general to use only complete cases with data all observed after removing missing data. However, this may cause some problems if the missing data is related to the values of the missing item [2]. The estimate of parameter could be biased and be inefficient [3]. So we need some method for utilizing the partial information involved in the missing data instead of ignoring them. Little and Rubin [3] described many statistical methods dealing with the missing data. Baker and Laird [4] used the EM (Expectation and Maximization) algorithm to obtain maximum likelihood estimates (MLE) of parameters from the incomplete data. Ibrahim and Lipsitz [5], [6] presented Bayesian methods for estimation in generalized linear models. Our proposed method stems from [5], [6], and can be thought as an extended and modified version for different model.

There are two types of missing data: Ignorable and non-ignorable [3]. Missing data is called ignorable (non-ignorable) if the probability of observing a data item is independent of (dependent on) the value of that data item. The data that is missing at random is ignorable, while non-ignorable missing data is not at random.

In this paper, we propose a method for estimating parameters in logistic linear models involving non-ignorable missing data. A binomial response and normal covariate model for the missing data is assumed. The Monte Carlo EM algorithm is used to estimate parameters [7]. Metropolis–Hastings algorithm to generate a sample for missing data is used in the E-step. Newton–Raphson iteration to solve the score equation is used to maximize the conditional expectation of likelihood function in the M-step. The standard errors of these estimates are calculated by the observed Fisher information matrix.

The rest of this paper is organized as follows. In Section 2, notation and model are stated. In Section 3 we derive the E- and M-steps of the Monte Carlo EM algorithm including Metropolis–Hastings algorithm and Newton–Raphson iteration. Calculation for standard error is described in Section 4. In Section 5 we illustrate our method with one example. A summary is given at the last section. Details of derivatives for Newton–Raphson iteration and formulas for elements of observed Fisher information matrix are given in the Appendix 1 Derivatives of, Appendix 2 Observed Fisher information matrix.

Section snippets

Notation and model

Suppose that y1,  , yn are independent observations, where each yi has a binomial distribution with sample size mi and success probability πi. Let Xi = (X1i, X2i)t is a 2 × 1 random vector of covariates, where X1i and X2i are independent observations and follow normal distributions with means μ1, μ2 and variances σ12, σ22, respectively. Further, let βt = (β0, β1, β2) are regression coefficients assuming to include an intercept. It is also assumed thatlogit(πi)=logπi1-πi=Xitβ,andp(yi|Xi,β)=exp{yiXitβ}1+exp{

Algorithm formulation

The MLE of β and other components of θ are the ones maximizing the observed-data likelihood L(θ∣(y, X)obs, ri, si) that has a quite intractable analytical form, where (y, X)obs is the observed components of (y, X). Rather than directly differentiating L(θ∣(y, X)obs, ri, si) with respect to θ, we compute the MLE of θ using an EM algorithm [8] which involves iterative evaluation and maximization of conditional expectation of the complete-data log-likelihood l(θ). If the conditional expectation involved

Standard errors of estimates

It is well known that the distribution of maximum likelihood estimates θˆ asymptotically follows a normal distribution MVN(θ, V(θ)) under some regularity condition. The expected Fisher information matrix I(θˆ) which gives the inverse of variance matrix of θˆ is approximated by the observed information matrix Jθˆ(Y):V(θˆ)-1=nE-2logL(θ)θ2θ=θˆn-2logL(θ)θ2dxi=1n-2logL(θ)θθ=θˆnJ(θˆ).We apply the result of [12] on the information of θ:observed information=complete information-missing

An illustration

In this section, we show an example to illustrate MCEM algorithm method with missing response variable and a covariate in logistic regression model. At first, we generate covariate x1i, x2i independently at random. Each x1i, x2i has normal distribution N(μ1,σ12), N(μ2,σ22), respectively. The response variable yi is generated from binomial distribution with sample size mi, probability π by generating x1i, x2i where πi = exp{xtβ}/(1 + exp{xtβ}). We apply missing data mechanism to generate missing

Summary

An algorithm for estimating parameters in logistic linear models is proposed for incomplete data. When some of the response and a covariate observations are missing non-ignorably, the maximum likelihood estimation (MLE) is considered. Metropolis–Hastings (MH) algorithm to compute the conditional expectation of log-likelihood function is implemented in the proposed Monte Carlo EM (Expectation and Maximization) algorithm. Newton–Raphson iteration is used in the M-step of the algorithm. For the

Acknowledgements

This paper was started while the third author was visiting the Department of Statistics, La Trobe University, Australia. She thanks Dr. Richard Huggins and the staff in the department for their hospitality and support.

References (12)

There are more references available in the full text version of this article.

Cited by (0)

This work was supported by Korean Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MOST) (R01-2006-000-11087-0).

View full text