Background

An increasing number of cohort studies are conducting repeated waves of follow-up in order to update information on their participants. This collection of data allows researchers to assess the association between change in an exposure variable, measured prospectively, and the risk of a given outcome variable. However, as with all epidemiological analyses, longitudinal follow-up studies are plagued by the problem of missing data. A recent review of cohort studies that analysed the association between a change in a prospectively measured exposure and an outcome variable found the majority of studies had at least 10% of participants with some missing data; the amount of missing data ranged from 4 to 80% with a median of 25% [1]. Similarly to epidemiological studies with only one wave of data collection, studies with repeated waves of data collection typically exclude participants with any missing data from the epidemiological analysis [1, 2]. Excluding participants with missing information, commonly termed a complete-case analysis, reduces the precision of the estimate of the exposure-outcome association and might bias the estimate if the participants with missing data are not similar to those with complete data [36]. Multiple imputation (MI) is an alternative method for handling missing data, which has become increasingly accessible to researchers in a number of statistical software packages (e.g. SAS [7] and Stata [8]).

In MI, imputed values for the missing data are generated multiple times resulting in multiple ‘completed’ (observed plus imputed) versions of the dataset; imputed values are sampled from an imputation model that accounts for the uncertainty introduced by the missing data. Each of the completed datasets is then analysed using an appropriate statistical model for the epidemiological analysis and the resulting estimates are combined to produce one overall MI estimate, with a standard error that accounts appropriately for both between- and within-completed data variability of the estimates [9].

Simulation studies present an opportunity to compare the performance of alternative statistical methods to handle missing data since the true values of the parameters being estimated are known. The statistical methods can be evaluated and compared for different amounts of missing data and different missing data scenarios. Increasingly, researchers are basing their simulation studies on data generation models designed to mimic actual epidemiologic studies [3, 1016]. Simulation studies evaluating methods to handle missing covariate data (i.e. missing data in the main exposure variable or confounder variable(s)) have focussed on covariate data measured once for each participant in the study [3, 1013, 15, 16]. Research investigating methods to handle missing data in repeated waves of data collection have focussed specifically on missing data in repeated measures of the outcome; these methods include extensions of weighted estimating equations, weighted least squares and random effects [14, 17, 18]. These proposed methods exclude the participants with missing exposure data in the analysis of the exposure-outcome association. Our focus is on how to handle longitudinal data where the repeated exposure measures are subject to missing values but the outcome is completely observed for all participants. To our knowledge this has not been investigated in an extensive simulation study.

The study reported in this paper is based on the Melbourne Collaborative Cohort Study [19] and was motivated by our interest in estimating the magnitude of the association between change in waist circumference, a numerical exposure derived from two measurements of waist circumference, and risk of colorectal cancer, a fully observed time-to-event outcome.

Waist circumference is an established marker for measuring obesity and the association between waist circumference measured at a single point in time and the risk of colorectal cancer is well established [2025]. Estimating the magnitude of the association between change in waist circumference, and colorectal cancer is important to establish whether the long-term effects of fat loss decrease the risk of colorectal cancer. In the Melbourne Collaborative Cohort Study, waist circumference was measured at baseline and at follow-up, 13 years later, and as with many other cohorts, a significant proportion of participants have missing data at the follow-up assessment and the true missing data mechanism is not known.

The primary aim of this paper was to determine if there was any gain in using MI compared to complete-case analysis for handling up to 50% missing data in a repeatedly measured exposure of interest with a fully observed outcome. Our secondary aim was to assess the impact of including a strong auxiliary variable in MI (i.e. a variable included in the imputation model that is highly correlated with the variable with missing data and is not included in the analysis model). We report the results of a simulation study that compares the performance of MI and complete-case analysis for handling missing data in waist circumference when estimating two associations: change in waist circumference and the incidence of colorectal cancer, and waist circumference at wave 2 and colorectal cancer, with data simulated using two Hazard Ratios (HRs) representing a weak and strong association between change in waist circumference and colorectal cancer, where there were different amounts of missing data according to various missing data mechanisms. Note, throughout this paper the epidemiological analysis of ‘change in waist circumference’ refers to an analysis where the exposure of interest is the absolute change in waist circumference adjusted for waist circumference at baseline.

Methods

Motivating example: Melbourne Collaborative Cohort Study

The Melbourne Collaborative Cohort Study is a prospective cohort study of 41,514 people (24,469 women), recruited in Melbourne, Australia, aged between 40 and 69 years at baseline (wave 1). Details of the Melbourne Collaborative Cohort Study have been published elsewhere [19]. In brief, participants were recruited between 1990 and 1994 and extensive baseline demographic, lifestyle (e.g. smoking status, alcohol consumption and physical activity) and dietary information were collected. Anthropometric measures (height, weight, waist and hip circumference) were directly measured by nurses according to written protocols based on standard procedures [26]. A second active wave of follow-up (wave 2) was conducted between 2003 and 2007 in which wave 1 information was updated and anthropometric measures were reassessed. Table 1 provides details of the information collected at each wave. Waist circumference was measured for 99% (41,476) of participants at wave 1 and for 60% (26,846) of participants alive at wave 2. The vital status of the Melbourne Collaborative Cohort Study participants was determined by record linkage to the National Death Index. Colorectal cancer cases were ascertained by record linkage to the population-based Victorian Cancer Registry and to the Australian Cancer Database to identify cases diagnosed in other states of Australia.

Table 1 Data structure for the Melbourne Collaborative Cohort Study dataset

Epidemiological analysis

Two Cox proportional hazards models were fitted to each of the simulated datasets. For both models, age was the time metric and follow-up began at wave 2 and continued until diagnosis of colorectal cancer, date of death or 30 June 2011, whichever came first. The first model (analysis (a)) included absolute change in waist circumference as the covariate of interest (i.e. wave 2 minus wave 1 waist circumference, rescaled so that the estimated HR corresponded to a difference of 10 cm in change in waist circumference), with adjustment for waist circumference at wave 1 as well as for sex, country of birth and education (Equation 1). The second model (analysis (b)) included waist circumference at wave 2 as the exposure of interest, also rescaled so that the estimated HR corresponded to an increase of 10cm in waist circumference at wave 2, with adjustment for sex, country of birth and education (Equation 2).

log ( h a ( t ) ) = log ( h a 0 ( t ) ) + β a 1 ( WC 2 WC 1 ) + β a 2 WC 1 + β a 3 Female + β a 4 COB UK + β a 5 COB Mediterranean + β a 6 Education Secondary + β a 7 Education Tertiary
(1)
log ( h b ( t ) ) = log ( h b 0 ( t ) ) + β b 1 WC 2 + β b 2 Female + β b 3 COB UK + β b 4 COB Mediterranean + β b 5 Education Secondary + β b 6 Education Tertiary ;
(2)

where t is the time, β ai and β bi are the vectors of regression coefficients, h a (t) and h b (t) are the hazard functions, and ha 0(t) and hb 0(t) are the baseline hazard functions, i.e. the hazard function for x = 0, for analysis (a) and (b), respectively.

In the second model there was no adjustment for waist circumference at wave 1, enabling us to assess the impact of an auxiliary variable (i.e. a variable included in the imputation model but not included in the epidemiological analysis) that has a strong association with the variable with missing data (ρ = 0.81, Table 1).

Generating the datasets

This simulation study was based on the 41,476 Melbourne Collaborative Cohort Study participants who had their waist circumference measured at wave 1. We used the observed demographic (i.e. age, sex, country of birth and education) and lifestyle information (i.e. smoking status, physical activity and alcohol consumption) for each participant (see Table 1 for a description of the variables), and simulated data for waist circumference at waves 1 and 2, the outcome of colorectal cancer and the censoring variable of death using relationships from the observed data as described below. We simulated the waist circumference data at waves 1 and 2 to preserve the correlation between the waist circumference measurements at the two waves of data collection conditional on the fixed demographic covariates at baseline. We repeated the simulation process 1,000 times, which allowed us to estimate the log of the HR for change in waist circumference and colorectal cancer to within 1.5% accuracy of the true value [27]. Independent random samples were generated using different starting seeds that were separated by 41,500 units (i.e. a value larger than the sample size) [27]. This produced 1,000 complete simulated datasets of 41,476 individuals for each HR, differing only in terms of waist circumference (at waves 1 and 2), death and the outcome of colorectal cancer.

To preserve the correlations of the exposure variables and the covariates, we simulated waist circumference at waves 1 and 2 from a multivariate normal distribution conditional on continuous (i.e. age) and categorical covariates (i.e. the demographic variables: sex, country of birth, education; and lifestyle factors: smoking status, physical activity and alcohol consumption) [28]. The means and variance-covariance matrix used for the multivariate normal distribution were obtained from the observed data of the 26,846 participants who attended both waves.

Once waist circumference was simulated, we also simulated time to the outcome of colorectal cancer and time to death (i.e. censoring) using two Weibull models [29] with the linear predictors defined in equations (3) and (4) below for colorectal cancer and death respectively based on the analysis of the Melbourne Collaborative Cohort Study data; where α and δ are the vectors of coefficients relating to the vector of covariates X:

For time to colorectal cancer:

α X colorectal = α 1 ( WC 2 WC 1 ) + α 2 WC 1 + α 3 Female + α 4 Age + α 5 Education Secondary + α 6 Education Tertiary + α 7 COB UK + α 8 COB Mediterranean
(3)

For time to death:

δ X death = δ 1 Female+ δ 2 Age
(4)

We arbitrarily chose to set the HR associated with a 10cm change in waist circumference to either 1.1 or 1.5, representing a weak and a strong relationship, respectively, with the outcome of colorectal cancer, adjusted for waist circumference at wave 1. The HRs for the remaining covariates, as well as the scale and shape parameters for the Weibull models were set to the values obtained from fitting a Weibull model to the complete Melbourne Collaborative Cohort Study dataset, with time on study set as the time metric.

Once time to event was simulated, participants were classified as either having a colorectal cancer event, or were censored at date of death or end of study (30 June 2011), whichever came first. The date of attendance at wave 2 is required for the epidemiological analysis as this represents the start of follow-up (see Section Epidemiological analysis) and was set to the actual date of attendance for the participants who attended wave 2 and to 30 May 2005 (median date of attendance) for those who did not attend.

Introduction of missing data

For each of the 1,000 simulated datasets simulated under each “true” HR of 1.1 and 1.5 (i.e. a total of 2,000 datasets) we assigned 15, 30 and 50% of the waist circumference data at wave 2 to missing. These values were chosen to bracket the actual proportion of Melbourne Collaborative Cohort Study participants who did not have their waist circumference measured at wave 2, which was 30%. The waist circumference data were set to missing according to three different scenarios: MCAR and two missing at random scenarios, in both of which missingness is dependent on the covariates, and will be referred to by Little’s [30] terminology ‘covariate-dependent MAR’. For data to be MCAR we selected a random sample of the desired proportion and set their waist circumference at wave 2 to missing.

To generate data according to a covariate-dependent MAR scenario (i.e. such that the distribution of the missingness indicator can be explained by observed variables in the dataset, independently of the missing values themselves), we assumed that the probability of a value being missing followed a logistic regression model (Equation 5) that included the covariates and corresponding parameters defined in Table 2. The predictors of missingness selected for the covariate-dependent MAR scenarios were those observed in the Melbourne Collaborative Cohort Study to be predictors of non-attendance at wave 2. The intercept for the logistic regression model, γ0 in (5) below, was determined (by iteration) so that the number of observations with missing data was approximately 15, 30 or 50%. In the first covariate-dependent MAR scenario (standard covariate-dependent MAR) we aimed to reflect the missingness pattern observed in the Melbourne Collaborative Cohort Study data using the Odds Ratios (ORs) shown in Table 2. For the second covariate-dependent MAR scenario (enhanced covariate-dependent MAR scenario), we aimed to create a more extreme but still realistic scenario by doubling the log of the ORs used in the first covariate-dependent MAR scenario (equivalent to squaring the ORs).

logit Pr(missing) = γ 0 + γ 1 WC 1 + γ 2 Age + γ 3 Female + γ 4 Education secondary + γ 5 Education tertiary + γ 6 COB UK + γ 7 COB Mediterranean + γ 8 Alcohol low + γ 9 Alcohol moderate + γ 10 Alcohol high + γ 11 Smoking former + γ 12 Smoking current + γ 13 Physical low + γ 14 Physical moderate + γ 15 Physical high
(5)
Table 2 Specification of the logistic regression models used to impose missing data under the two covariate-dependent MAR scenarios

Methods to handle missing data

We compared two methods to handle missing data: complete-case analysis and MI. In the complete-case analysis the HRs were estimated in each of the 1,000 datasets after excluding individuals with a missing waist circumference at wave 2. For MI, 20 sets of imputed values were generated for the missing data in each of the 1,000 datasets, by sampling from a Gaussian normal regression model (using Stata’s ‘mi impute regress’ command). The imputation model included the covariates used to generate the covariate-dependent MAR missing data (i.e. the covariates that were included in the Cox regression, which ensures the maximum recovery of information about the associations of interest [31]), as well as additional auxiliary variables measured at wave 1, an indicator for whether a participant had colorectal cancer or was censored at the time of analysis, and the baseline hazard generated using the Nelson-Aalen method [3, 32]. All categorical covariates included in the imputation model were represented as k-1 indicator variables (where k represents the number of categories). Cox proportional hazards models were then fitted separately to each of the 20 ‘completed’ datasets, and the resulting estimates (log(HR)) of the effect of interest were averaged to produce an overall MI estimate with a corresponding standard error calculated using Rubin’s rules [33, 34].

Evaluation of biases, precision and coverage

We focussed on the estimation of the exposure-outcome relationship of primary interest, the HR for change in waist circumference and colorectal cancer (analysis (a)) and for waist circumference at wave 2 and colorectal cancer (analysis (b)). For analysis (a), the true value for log(HR) associated with the exposure of interest was the value that was used in Equation (1) (i.e. log(1.1) and log(1.5)). However, analysis (b) fits a different model than that under which the data were simulated (not adjusted for waist circumference at wave 1), and therefore, the “complete” value of the log(HR) was taken to be that obtained by running the Cox proportional hazard model on 10,000 simulated datasets, before any values were assigned to missing. This gave “complete” values of log(1.14) and log(1.54). In both cases, the average of the standard errors of the estimated log(HR) from 10,000 simulated datasets prior to assigning any data to missing was considered the “complete” standard error. We compared the performance of the complete-case analysis with MI for handling missing data using: absolute bias of the log(HR), the difference between the true/complete value and the average of the estimated log(HR) calculated for each of the 1,000 simulated datasets; the empirical standard error, the average standard deviation of the estimates of interest from 1,000 simulations; and the coverage, the percentage of the 95% confidence intervals that included the true value [27].

The data were simulated and the statistical analyses were performed using Stata version 11.2 [8].

Results

With increasing proportions of missing data in waist circumference at wave 2 there was no bias using complete-case analysis under any missing data scenario (i.e. MCAR, standard covariate-dependent MAR or enhanced covariate-dependent MAR) for analysis (a), whereas slight bias was observed using MI to handle the missing data. However, even with up to 50% missing data the bias was minimal and did not exceed 0.01 (absolute change in log(HR)). The bias observed under analysis model (b) was negligible using both complete-case analysis and MI to handle the missing data and did not exceed 0.005 (Figure 1). The Monte Carlo error, i.e. the noise from the finite number of simulations, did not exceed 0.4%.

Figure 1
figure 1

Absolute bias for complete-case analysis and MI for increasing proportions of missing data (0.15, 0.3, 0.5) under three missing data scenarios (a) corresponds to the epidemiological analysis of absolute change in waist circumference adjusting for waist circumference at wave 1; (b) corresponds to the epidemiological analysis of waist circumference at wave 2 (without adjusting for waist circumference at wave 1); CD-MAR refers to Covariate-dependent MAR scenario.

Figure 2 shows the empirical standard errors of the exposure-outcome estimate. The empirical standard errors are, as expected, greater than the “complete” standard error for all proportions of missing data and increase with increasing proportions of missing data using the complete-case analysis and MI. For analysis (a), MI showed minimal, if any, gain in precision for up to 50% missing data compared to complete-case analysis. There were considerably larger gains from MI with analysis (b), which included the strong auxiliary variable, waist circumference at wave 1, in the imputation model (Figure 2).

Figure 2
figure 2

Empirical standard error for complete-case analysis and MI for increasing proportions of missing data (0.15, 0.3, 0.5) under three missing data scenarios (a) corresponds to the epidemiological analysis of absolute change in waist circumference adjusting for waist circumference at wave 1; (b) corresponds to the epidemiological analysis of waist circumference at wave 2 (without adjusting for waist circumference at wave 1); CD-MAR refers to Covariate-dependent MAR scenario.

The coverage for the different missing data methods in relation to an increasing proportion of missing data is shown in Figure 3. The coverage of the true values for analysis models (a) and (b) remained near the nominal 95% level with up to 50% missing data for the two methods used to handle the missing data.

Figure 3
figure 3

Coverage of the regression coefficient estimates for complete-case analysis and MI for increasing proportions of missing data (0.15, 0.3, 0.5) under three missing data scenarios (a) corresponds to the epidemiological analysis of change in waist circumference adjusting for waist circumference at wave 1; (b) corresponds to the epidemiological analysis of waist circumference at wave 2 (without adjusting for waist circumference at wave 1); CD-MAR refers to Covariate-dependent MAR scenario.

Discussion

Using a simulation study based on the Melbourne Collaborative Cohort Study we assessed two methods (i.e. complete-case analysis and MI) for handling up to 50% missing data in a repeatedly measured exposure of interest, in the context of a Cox proportional hazards model investigating two epidemiological associations. We found very little bias and the coverage remained around 95% from both complete-case analysis and MI for both associations. For the analysis that included absolute change in waist circumference as the exposure of interest and adjusted for waist circumference at wave 1 (i.e. analysis (a)) there was no gain in precision when using MI instead of a complete-case analysis. However, there were slight gains in precision (i.e. reduction in the standard error) for MI over a complete-case analysis in analysis (b), where there was a strong auxiliary variable in the imputation model.

The simulation study that we developed was based on a large existing cohort study, the Melbourne Collaborative Cohort Study. This approach for designing simulation studies (i.e. based on real studies), which is becoming increasingly common in the literature, allows researchers to incorporate complex and realistic associations within the data structure while simplifying the data generation process compared to a fully simulated scenario [3, 1012, 15, 35, 36]. As with all simulation studies of this type, the generalisability of our findings is limited since our simulated data are based on only a single cohort. Undoubtedly, further exploration of simulation models based on other real data settings would be useful. As well, there is scope to investigate datasets with more than two waves of data collection to ascertain whether there is any gain in using MI compared to complete-case analysis in these scenarios.

The true HRs that we chose for the association between a 10cm change in waist circumference and risk of colorectal cancer (i.e. 1.1 and 1.5) were based on realistic HRs that are typically observed for this anthropometric measure [25, 37, 38]. These HRs of moderate magnitude may have minimised the bias that we observed but our results are consistent with previous work that investigated HRs of similar magnitude; Demissie et al. [10] found little bias for a HR of one for a dichotomous exposure variable and Marshall et al. [3] found minimal bias for an exposure with a HR of one for a unit change in a continuous exposure variable.

We imposed our missing data on the exposure of interest under MCAR, standard and enhanced covariate-dependent MAR scenarios. Consistent with previously published studies, we found that under the MCAR scenario complete-case analysis produced negligible bias and good coverage of the estimate [3, 10, 11]. The covariate-dependent MAR scenarios that we investigated were based on the variables observed to be predictors of non-attendance at wave 2 in the Melbourne Collaborative Cohort Study (i.e. missing waist circumference at wave 2) and the coefficients for the standard covariate-dependent MAR scenario were set to the same values as the estimates of the regression coefficients from the logistic regression model of missingness indicator at wave 2. This realistic missing data scenario allowed us to evaluate the two methods for handling missing data under weak associations of covariate-dependent MAR, which are more likely to be observed in real studies than the more extreme covariate-dependent MAR scenarios that are often reported in the missing data literature [11, 12, 39, 40]; for example, Donders et al. [39] assigned 40% of the simulated data to missing with all of the missing data occurring in the unexposed group.

We found very little bias in the log(HR) using complete-case analysis and MI to handle the missing data. The slight bias observed in analysis (a) may be a result of the imputation model being semi-compatible with the analysis model (i.e. the exposure of interest in the analysis model is change in waist circumference, however, waist circumference at wave 2 is imputed in the imputation model) [41]. We decided to impute waist circumference at wave 2 instead of change in waist circumference in order to represent the real epidemiological analysis (i.e. in a study where the variable is fully observed at wave 1 and missing at wave 2, the analyst is more likely to impute the variable with missing data and then calculate the absolute change between the two variables). The imputation model that we used included an indicator variable for whether a participant had colorectal cancer or was censored at the time of analysis, and the baseline hazard generated using the Nelson-Aalen method in the imputation model. Although Marshall et al. [3] suggest that this may be a better method to use in the imputation model than including a log transformation of the survival time and event status it may have introduced bias into our results [32]. Further, our MAR scenario was a covariate-dependent scenario, which may be specific to our study and research looking at MAR scenarios dependent on both covariates and the outcome should be considered. Previous published simulation studies, which reported biased estimates using complete-case analysis or MI, induced a missingness mechanism dependent on the exposure and outcome variables and under more extreme missingness scenarios [3, 10, 11, 15, 40].

The auxiliary variables included in our imputation model (i.e. variables not included in the epidemiological analyses) were alcohol intake, smoking status, and physical activity at baseline. These variables had only weak to moderate associations with waist circumference at wave 2. To assess the impact of an auxiliary variable that has a strong association with the exposure of interest we compared MI with a complete-case analysis for handling missing data under two scenarios: (a) the association between change in waist circumference and risk of colorectal cancer adjusted for waist circumference at wave 1, and (b) waist circumference at wave 2 and the risk of colorectal cancer, not adjusted for waist circumference at wave 1. For analysis (b), waist circumference at wave 1 was included in the imputation model as a strong auxiliary variable, with no strong auxiliary variables in model (a). MI provided no gain in precision of the estimate compared to complete-case analysis in analysis (a) where the imputation procedure only included auxiliary variables with weak and moderate associations with the variable with missing data. However, slight gains in precision were observed for the MI estimate compared to the complete-case estimate in analysis (b). Graham and Collins [42] used simulations of artificial data to show that strong auxiliary variables included in the imputation model for MI restored some of the power lost due to missing data. Real data examples are less likely to have auxiliary variables that are strongly associated with the variable subject to missing values; for example Marshall et al. [3] reported correlations in the range of 0.3 and 0.4, and Lee and Carlin [43] reported correlations between 0.1 and 0.6 between the covariates and the variable with missing data. Incorporating auxiliary variables with weak or moderate associations with the variables with missing data into the imputation model will result in large between-imputation variance leading to larger standard errors for the MI estimates and thus, smaller gains (if any) from using MI compared to a complete-case analysis.

Complete-case analysis is the default method for most software packages for handling missing data in statistical analyses. However, MI is now available and easy to implement in many software packages (e.g. Stata, R, SAS and SPSS [7, 8, 44, 45]). This increased accessibility has led to an increase in the use of MI for dealing with missing data in epidemiological studies [34, 46]. MI produces unbiased estimates if the missing data mechanism is MAR, which encompasses the more specific scenario of covariate-dependent MAR [4, 47]. Data Missing Not at Random (MNAR) occur when the study participants with missing data differ from the study participants with complete data in a manner that cannot be explained by the observed data [9]. In our simulation study we did not include an MNAR missing data scenario. It has been suggested that for cohort studies that collect a large amount of information from their participants, the observed data can provide a lot of information about the missing data. Further, the imputation model may include combinations of observed variables that represent surrogate measures of the unobserved variables that are related to the missingness mechanism [48]. However, whether the data are MAR, either covariate-dependent or more generally, is untestable and therefore, further research investigating alternative approaches that explore the sensitivity of conclusions to plausible MNAR mechanisms or simultaneously estimate the missing data model and the analysis model will be important [49].

Conclusion

The findings from this simulation study, which used a data generation model designed to replicate a large longitudinal cohort study, demonstrate that for an epidemiological study assessing the association between change in an exposure measure assessed at two waves of data collection, with up to 50% missing data at the second measurement, and a fully observed time to event outcome, there is no advantage in using MI over the simple complete-case analysis approach when the missing data are only associated with other measured covariates and not the outcome. MI did show gains in precision for an analysis that ignored a strongly predictive baseline measure, which became a strong auxiliary variable when included in the imputation model. However, the latter example was somewhat artificial and it seems unlikely that in many real data scenarios there will be auxiliary variables (i.e. variables not already included as covariates in the analysis model) that exhibit such strong associations with the variable subject to missing data.