Monte Carlo EM algorithm in logistic linear models involving non-ignorable missing data☆
Introduction
Many data sets obtained from surveys or medical trials often include missing observations [1]. When these data sets are analyzed, it is general to use only complete cases with data all observed after removing missing data. However, this may cause some problems if the missing data is related to the values of the missing item [2]. The estimate of parameter could be biased and be inefficient [3]. So we need some method for utilizing the partial information involved in the missing data instead of ignoring them. Little and Rubin [3] described many statistical methods dealing with the missing data. Baker and Laird [4] used the EM (Expectation and Maximization) algorithm to obtain maximum likelihood estimates (MLE) of parameters from the incomplete data. Ibrahim and Lipsitz [5], [6] presented Bayesian methods for estimation in generalized linear models. Our proposed method stems from [5], [6], and can be thought as an extended and modified version for different model.
There are two types of missing data: Ignorable and non-ignorable [3]. Missing data is called ignorable (non-ignorable) if the probability of observing a data item is independent of (dependent on) the value of that data item. The data that is missing at random is ignorable, while non-ignorable missing data is not at random.
In this paper, we propose a method for estimating parameters in logistic linear models involving non-ignorable missing data. A binomial response and normal covariate model for the missing data is assumed. The Monte Carlo EM algorithm is used to estimate parameters [7]. Metropolis–Hastings algorithm to generate a sample for missing data is used in the E-step. Newton–Raphson iteration to solve the score equation is used to maximize the conditional expectation of likelihood function in the M-step. The standard errors of these estimates are calculated by the observed Fisher information matrix.
The rest of this paper is organized as follows. In Section 2, notation and model are stated. In Section 3 we derive the E- and M-steps of the Monte Carlo EM algorithm including Metropolis–Hastings algorithm and Newton–Raphson iteration. Calculation for standard error is described in Section 4. In Section 5 we illustrate our method with one example. A summary is given at the last section. Details of derivatives for Newton–Raphson iteration and formulas for elements of observed Fisher information matrix are given in the Appendix 1 Derivatives of, Appendix 2 Observed Fisher information matrix.
Section snippets
Notation and model
Suppose that y1, … , yn are independent observations, where each yi has a binomial distribution with sample size mi and success probability πi. Let Xi = (X1i, X2i)t is a 2 × 1 random vector of covariates, where X1i and X2i are independent observations and follow normal distributions with means μ1, μ2 and variances , , respectively. Further, let βt = (β0, β1, β2) are regression coefficients assuming to include an intercept. It is also assumed that
Algorithm formulation
The MLE of β and other components of θ are the ones maximizing the observed-data likelihood L(θ∣(y, X)obs, ri, si) that has a quite intractable analytical form, where (y, X)obs is the observed components of (y, X). Rather than directly differentiating L(θ∣(y, X)obs, ri, si) with respect to θ, we compute the MLE of θ using an EM algorithm [8] which involves iterative evaluation and maximization of conditional expectation of the complete-data log-likelihood l(θ). If the conditional expectation involved
Standard errors of estimates
It is well known that the distribution of maximum likelihood estimates asymptotically follows a normal distribution MVN(θ, V(θ)) under some regularity condition. The expected Fisher information matrix which gives the inverse of variance matrix of is approximated by the observed information matrix :We apply the result of [12] on the information of θ:
An illustration
In this section, we show an example to illustrate MCEM algorithm method with missing response variable and a covariate in logistic regression model. At first, we generate covariate x1i, x2i independently at random. Each x1i, x2i has normal distribution , , respectively. The response variable yi is generated from binomial distribution with sample size mi, probability π by generating x1i, x2i where πi = exp{xtβ}/(1 + exp{xtβ}). We apply missing data mechanism to generate missing
Summary
An algorithm for estimating parameters in logistic linear models is proposed for incomplete data. When some of the response and a covariate observations are missing non-ignorably, the maximum likelihood estimation (MLE) is considered. Metropolis–Hastings (MH) algorithm to compute the conditional expectation of log-likelihood function is implemented in the proposed Monte Carlo EM (Expectation and Maximization) algorithm. Newton–Raphson iteration is used in the M-step of the algorithm. For the
Acknowledgements
This paper was started while the third author was visiting the Department of Statistics, La Trobe University, Australia. She thanks Dr. Richard Huggins and the staff in the department for their hospitality and support.
References (12)
- et al.
Data envelopment analysis with missing values: An interval DEA approach
Appl. Math. Comput.
(2006) - et al.
Indirect methods of imputation of missing data based on available units
Appl. Math. Comput.
(2005) - et al.
Using Monte Carlo method for ranking efficient DMUs
Appl. Math. Comput.
(2005) - et al.
Statistical Analysis with Missing Data
(2002) - et al.
Regression analysis for categorical variables with outcome subject to nonignorable nonresponse
J. Am. Stat. Assoc.
(1988) - et al.
Parameter estimation from incomplete data in bionomial regression when the missing data mechanism is nonignorable
Biometrics
(1996)
Cited by (0)
- ☆
This work was supported by Korean Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MOST) (R01-2006-000-11087-0).