Statistical modelling for falls count data

https://doi.org/10.1016/j.aap.2009.08.018Get rights and content

Abstract

Falls and their injury outcomes have count distributions that are highly skewed toward the right with clumping at zero, posing analytical challenges. Different modelling approaches have been used in the published literature to describe falls count distributions, often without consideration of the underlying statistical and modelling assumptions. This paper compares the use of modified Poisson and negative binomial (NB) models as alternatives to Poisson (P) regression, for the analysis of fall outcome counts. Four different count-based regression models (P, NB, zero-inflated Poisson (ZIP), zero-inflated negative binomial (ZINB)) were each individually fitted to four separate fall count datasets from Australia, New Zealand and United States. The finite mixtures of P and NB regression models were also compared to the standard NB model. Both analytical (F, Vuong and bootstrap tests) and graphical approaches were used to select and compare models. Simulation studies assessed the size and power of each model fit. This study confirms that falls count distributions are over-dispersed, but not dispersed due to excess zero counts or heterogeneous population. Accordingly, the P model generally provided the poorest fit to all datasets. The fit improved significantly with NB and both zero-inflated models. The fit was also improved with the NB model, compared to finite mixtures of both P and NB regression models. Although there was little difference in fit between NB and ZINB models, in the interests of parsimony it is recommended that future studies involving modelling of falls count data routinely use the NB models in preference to the P or ZINB or finite mixture distribution. The fact that these conclusions apply across four separate datasets from four different samples of older people participating in studies of different methodology, adds strength to this general guiding principle.

Introduction

Falls can have common and serious consequences for older people (Robertson et al., 2005). With an ageing population, the rise in the number of falls and the cost of their treatment is predicted to lead to a huge burden on the individual and the community (Moller, 2005). Falls epidemiology data describing the magnitude of, and trends in, the problem has largely been descriptive in nature (Boufous et al., 2006, Boufous et al., 2004). It is important that good statistical models are used to generate accurate and reliable information to guide policy decisions in relation to priority setting and intervention investments to tackle the fall injury problem. As with other areas of public health, there has been an increased interest in statistical modelling of injury count data, including falls outcomes, in recent years (Chin and Quddus, 2003, Lord et al., 2004, Lord et al., 2005, Robertson et al., 2005).

Datasets of the number of fall and fall-related injury have the form of discrete count data characterized by a large proportion of zero counts, with the remaining values being highly skewed toward the right. This is because fall incidents are relatively rare and most people will not sustain a serious injury if do they fall. Moreover, falls can also be recurrent events, in that over a period of time an individual may experience one or more falls (Williamson et al., 1996, Stalenhoef et al., 2002), and this recurrence aspect needs to be incorporated into appropriate statistical models of fall counts. In a very recent systematic review (Donaldson et al., 2009), fewer than one-third of the 83 reviewed papers used appropriate statistical methods to analyse falls as a recurrent event.

To further progress falls epidemiology, there is a need for a unified and justified approach to the use of appropriate statistical models for these data, taking into account the large proportion of zero counts and the possibility of recurrent falls. A number of published studies have incorrectly assumed a normal distribution when modelling falls count data and used Student's t-test, linear regression, or analysis of variance, as has been highlighted elsewhere (Robertson et al., 2005). Other analysts have argued that falls count data does not meet the usual normality assumption required of many standard statistical tests and have therefore relied on a transformation to induce normality (Slymen et al., 2006). This can be problematic in that transformations often do not yield normally distributed data and can make the interpretation of regression coefficients difficult because they are not estimated on the original scale (Byers et al., 2003).

An alternative, more common, approach has been to assume a Poisson (P) model which is better suited to fall count processes and has become quite widespread in public health to model the number of events or rates (Mwalili et al., 2008), especially when there are few incidents and hence, many observed zeros (Shankar et al., 1997). However, if the number of observed zeros far exceeds the expected number of zeros (equivalent to requiring that the mean is equal to the variance), then one of the key features of the P structure is violated. Often, falls count data exhibit more variability than the nominal variance under the P model, a condition called over-dispersion (in that the sample variance exceeds the mean). Such over-dispersion in count data can occur because of excess zeros, unexplained heterogeneity, or temporal dependency (Cameron and Trivedi, 1998). With regards to recurrent events, the P model assumes that such events occur independently of each other. This assumption is violated for fall outcomes, as a major risk factor for a subsequent fall is a previous fall (Donaldson et al., 2009, Hill et al., 1999).

The negative binomial (NB) model has a built-in dispersion parameter that can account for situations where the variance is greater than the mean (Chin and Quddus, 2003). A number of studies have therefore argued for the NB model as an alternative to the P model when count data are over-dispersed in relation to the mean (Bliss and Fisher, 2003, Byers et al., 2003, White and Bennetts, 1996). Such a modelling approach can also be appropriate when count data are recurrent (Glynn and Buring, 1996). The NB model explicitly accounts for the heterogeneity by modelling the Poisson mean as a Gamma random variable and introducing an extra dispersion parameter (Johnson et al., 2005, Lord, 2006).

Although P and NB models have been the most common choices to date, it is possible that they could still fail to fit a set of data with a lot of zeros because of zero-inflation, over-dispersion, or both (Deng and Paul, 2005). As an extension of standard P and NB models, zero-inflated count models have gained considerable recognition as an alternative means of handling count data with a preponderance of zeros (Lambert, 1992, Gupta et al., 1996, Li et al., 1999, Lord et al., 2004, Lord, 2006). For this type of count data, more zeros are observed than would be predicted by a normal P or NB process (Park and Lord, 2009, Lord et al., 2007, Warton, 2005). It is generally believed that data with excess zeros come from two sources or two distinct distributions, hence the apply-named dual state process. The underlying assumption of this two-state process gives a simple two-component mixture distribution with the first state having only zeros, while the other state leads to a standard P or NB count model. In general, the zeros from the first state are called structural zeros and those from the P or NB models are called sampling zeros or non-structural zeros.

In recent years, there has been considerable interest in regression models based on zero-inflated count models. Much of this interest stems from the seminal paper of Lambert (1992) though this type of model appears to have originated in the econometrics literature. Mullahy (1986) first formulated the zero-inflated Poisson (ZIP) regression model and such models have since been applied in many topic areas: the number of defects in a manufacturing process (Lambert, 1992); the abundance of rare species (Welsh et al., 1996); road accident frequencies (Shankar et al., 1997, Shankar et al., 2003, Qin et al., 2004, Kumara and Chin, 2003, Lee and Mannering, 2002); dental caries epidemiology (Bohning et al., 1999); pharmaceutical utilization and expenditure (Street et al., 1999); early growth and motor development (Cheung, 2002); and physical activity (Slymen et al., 2006).

In addition to zero-inflated models, there are many further extensions to the classical P and NB models, such as finite mixture models. These finite mixture models are particularly useful for heterogeneous populations that incorporate a combination of counts and continuous representation of population heterogeneity. For a mathematical derivation and discussion of the application of finite mixture models, readers are referred to McLachlan and Peel (2000). Most recently, Park and Lord (2009) have proposed finite mixtures of P and NB models for analyzing motor vehicle crash data.

The modelling considerations raised above have significant implications for the description of falls data and published studies have used a variety of statistical approaches. To our knowledge, a full range of P and modified P (i.e. NB and zero-inflated) models have not been formally compared in terms of their applicability to falls data. Although Robertson et al. (2005) used the NB model in their consideration of statistical models for falls intervention trials, they compared it to two survival analysis models (the Andersen-Gill and marginal Cox regression) and not directly to other count distributions.

The aim of this paper is therefore compare the applicability of statistical count distributions to falls count data and to provide a clear rationale for future falls distribution-modelling approaches. In doing so, this study provides defensible guidance on how to appropriately model falls data in studies aiming to describe trends in injury numbers and rates. The paper has five objectives, to (1) overview the rationale for, and use of, P, NB, ZIP and zero-inflated negative binomial (ZINB) models, (2) apply the four models to real-world falls count data and to compare how well the various models approximate this, (3) formally compare the four models, (4) report a statistical simulation experiment as a means of assessing the size and power of the model fit, and (5) compare the NB model with finite mixtures of P or NB estimated using the same data.

Section snippets

Methods

A description of the data used in the example is first presented, so that the relevant features of the four regression models can be later described in the specific context of these data.

Model estimation framework

For each of the four model types, the maximum likelihood estimation (MLE) method was used to estimate μ, k and ϕ parameters and their corresponding standard errors and confidence limits for the falls count data, as relevant. The MLE was chosen, compared to other estimators, because it has properties of consistency, asymptotic normality and minimum variance for large samples. The MLE method was used to fit the falls data by applying a generalised linear model from underlying P, NB or

Model accuracy

The most common criterion for evaluating the performance of a statistical model is its accuracy in terms of fitting the data. Let fi denote the observed frequency of ith fall and fˆi denote the fitted frequency. The error is defined as ei=fifˆi and the percentage error is pi = 100ei/fi. Percentage errors have the advantage of being scale independent, so they are frequently used to compare model performance between different data series (Hyndman and Koehler, 2006). The most widely used measures

Comparing models

Four criteria were used to compare and select among considered models: likelihood ratio, F-test, Vuong statistic and bootstrap test. The likelihood ratio test is well understood and is not discussed further. The basic criterion of the F and bootstrap tests is to compare two models where one model should be nested with the other model (i.e. when one model is an extension to the other). For example, the P model is nested within the NB model and there is therefore a need to test if there is

Simulation framework

Simulation studies are increasingly being used in the public health literature for a wide variety of situations (Vaeth and Skovlund, 2004). There are several advantages of simulations compared with collecting and/or analyzing real data (Burton et al., 2006, Demirtas, 2007). Firstly, a large number of samples of representative falls data can be created rather than being restricted to using only one (or just a few) dataset and this enables the distributions of statistical parameters to be

Comparison of finite mixture models with standard and zero-inflated P and NB models.

The Poisson and NB mixture models with a fixed number of components (K = 2, 3) were estimated with the expectation-maximization (EM) algorithm within a maximum likelihood framework and with Markov Chain Monte Carlo (MCMC) sampling within a Bayesian framework (Stasinopoulos and Rigby, 2007, Leisch, 2004). Models were compare using a penalised-likelihood approach for model selection: Akaike's information criterion (AIC) and the Bayesian information criterion (BIC) (Park and Lord, 2009, Warton, 2005

Conclusions

There are several well-developed potential statistical models for analyzing falls count data but, to date, there has been little guidance on which is the most appropriate approach to use, and there are many published studies that have used incorrect statistical models for analyzing over-dispersion and recurrent fall events (Donaldson et al., 2009). Robertson et al. (2005) compared the NB model to two survival analysis models using two datasets, and concluded that the NB model was as appropriate

Acknowledgements

Project work supported by a grant from the Australian Government Department of Health and Ageing to undertake falls modelling research provided the impetus for this paper. John Campbell and Clare Robertson, Department of Medical and Surgical Sciences, Dunedin School of Medicine, University of Otago, New Zealand provided the falls data from the New Zealand trial used in this study. Dominique Lord, Zachry Department of Civil Engineering, Texas A&M University and Byung-Jung Park, Texas

References (58)

  • C.I. Bliss et al.

    Fitting the negative binomial distribution to biological data

    Biometrics

    (2003)
  • D. Bohning et al.

    The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology

    Journal of the Royal Statistical Society A

    (1999)
  • S. Boufous et al.

    Incidence of hip fracture in New South Wales: are our efforts having an effect?

    The Medical Journal of Australia

    (2004)
  • S. Boufous et al.

    The epidemiology of hospitalised wrist fractures in older people, New South Wales, Australia

    Bone

    (2006)
  • A. Burton et al.

    The design of simulation studies in medical statistics

    Statistics in Medicine

    (2006)
  • A.L. Byers et al.

    Application of negative binomial modelling for discrete outcomes: a case study in ageing research

    Journal of Clinical Epidemiology

    (2003)
  • A.C. Cameron et al.

    Regression Analysis of Count Data

    (1998)
  • Y.B. Cheung

    Zero-inflated models for regression analysis of count data: a study of growth and development

    Statistics in Medicine

    (2002)
  • H.C. Chin et al.

    Modeling count data with excess zeroes: an empirical application to traffic accidents

    Sociological Methods and Research

    (2003)
  • R. Davidson et al.

    The power of bootstrap and asymptotic tests

    Journal of Econometrics

    (2006)
  • H. Demirtas

    Letter to the Editor re: the design of simulation studies in medical statistics

    Statistics in Medicine

    (2007)
  • D. Deng et al.

    Score tests for zero-inflation and over-dispersion in generalized linear models

    Statistica Sinica

    (2005)
  • M.G. Donaldson et al.

    Analysis of recurrent events: a systematic review of randomised controlled trials of interventions to prevent falls

    Age and Ageing

    (2009)
  • R.J. Glynn et al.

    Ways of measuring rates of recurrent events

    British Medical Journal

    (1996)
  • W.H. Greene

    Econometric Analysis

    (2000)
  • P.L. Gupta et al.

    Analysis of zero-adjusted count data

    Computational Statistics and Data Analysis

    (1996)
  • K. Hill et al.

    Falls among healthy, community-dwelling, older women: a prospective study of frequency, circumstances, consequences and prediction accuracy

    Australian and New Zealand Journal of Public Health

    (1999)
  • R.J. Hyndman et al.

    Another look at measures of forecast accuracy

    International Journal of Forecasting

    (2006)
  • N.L. Johnson et al.

    Univariate Discrete Distributions

    (2005)
  • S.S. Kumara et al.

    Modeling accident occurrence at signalized tee intersections with special emphasis on excess zeros

    Traffic Injury Prevention

    (2003)
  • D. Lambert

    Zero-inflated Poisson regression, with an application to defects in manufacturing

    Technometrics

    (1992)
  • J. Lee et al.

    Impact of roadside features on the frequency and severity of run-off-roadway accidents: An empirical analysis

    Accident Analysis and Prevention

    (2002)
  • F. Leisch

    FlexMix: a general framework for finite mixture models and latent class regression in R

    Journal of Statistical Software

    (2004)
  • C. Li et al.

    Multivariate zero-inflated Poisson models and their applications

    Technometrics

    (1999)
  • D. Lord

    Modeling motor vehicle crashes using Poisson-gamma models: examining the effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter

    Accident Analysis and Prevention

    (2006)
  • Lord, D., Washington, S.P., Ivan, J.N., 2004. Statistical challenges with modeling motor vehicle crashes: understanding...
  • D. Lord et al.

    Poisson, Poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory

    Accident Analysis and Prevention

    (2005)
  • D. Lord et al.

    Further notes on the application of zero-inflated models in highway safety

    Accident Analysis and Prevention

    (2007)
  • S. Makridakis et al.

    Forecasting: Methods and Applications

    (1998)
  • Cited by (0)

    View full text