Model selection for time series of count data
Introduction
There are a plethora of integer valued time series models for modelling low count time series data. There are two broad class of approaches for constructing integer valued time series models, observation-driven (e.g. McKenzie (2003); Neal and Subba Rao (2007); Enciso-Mora et al. (2009a)) and parameter-driven (e.g. Davis et al. (2003)) models, see Davis et al. (2015) for an overview. The INAR, the th order integer autoregressive model is a prime example of an observation driven model. These models are motivated by real-valued time series models, primarily ARMA (autoregressive-moving average) models and the desire to adapt such models to an integer-valued scenario. A time-series is said to follow an INAR model if, for , where are the autoregressive parameters, denotes a thinning operator and are independent (typically identically distributed), integer-valued random variables. Note that if in (1.1) represented multiplication and the were Gaussian, then we would recover a standard real-valued AR process. The thinning operator, a generalised Steutel and van Harn operator (Steutel and van Harn, 1979) ensures is an integer valued random variable. A binomial thinning operator is the most common choice such that . The most common choice of is a Poisson distribution with mean , which combined with the binomial thinning operator and the condition leads to the stationary distribution of being Poisson with mean . Parameter-driven models are based on the observed counts being driven by some underlying, unobserved latent process, , Durbin and Koopman (2000) and Davis et al. (2003), for example a real-valued ARMA process, see Dunsmuir (2015). With Poisson distributed counts, a log-link function is used to link the latent process, , and the observed count process, . This results in a generalised linear model with The observation and parameter driven models described above can be extended in many ways, for example the inclusion of time dependent covariates into the INAR parameters (Enciso-Mora et al., 2009b) or into (1.2) to give , Davis et al. (2003). Other examples are the development of INARMA extensions of (1.1), see Neal and Subba Rao (2007) and INGARCH models (Fokianos, 2011), where for , with and is the -field generated by . The INGARCH model seeks to mimic the behaviour of GARCH models with alternative forms of considered in Fokianos (2011). It should be noted that for , the INGARCH model reduces to an INAR model with and . For parameter driven models there are alternative latent process formulations such as replacing by where is a Markovian process satisfying see Aktekin et al. (2013) and Aktekin et al. (in press). Negative binomially distributed counts as opposed to Poisson distributed counts can also be included in (1.2), see for example Windle et al. (2013).
Given the wide range of models for integer valued time series a key question is, what is the most appropriate model for a given data set? This leads onto a secondary question of the appropriate order for an INAR model or an AR autoregressive latent process. For INAR models, efficient reversible jump MCMC algorithms (Green, 1995) have been developed in Enciso-Mora et al. (2009a), Enciso-Mora et al. (2009b) for determining the order of the model and the inclusion/exclusion of covariates. Reversible jump MCMC could also be employed for determining the most appropriate order of an AR autoregressive latent process. However it is far more difficult to employ reversible jump MCMC for comparing between different classes of models due to the need to develop an efficient trans-dimensional moves between different models, see Brooks et al. (2003). Therefore in this work we focus primarily on choosing between different classes of integer valued time series models although we illustrate our approach for determining the model order .
In this paper we consider model selection in a Bayesian framework, via direct computation of the marginal likelihood, also known as the model evidence, and alternatively using the DIC, deviance information criterion, Spiegelhalter et al. (2002). We focus for illustration purposes on three models; the INAR model (1.1), the AR Poisson regression model given by (1.2) with being an AR process, and the INGARCH model (1.3). To estimate the marginal likelihood, we extend the two stage algorithm given in Touloupou et al. (in press), which first estimates the posterior distribution using MCMC and then uses a parametric approximation of the posterior distribution to estimate the marginal likelihood via importance sampling. This leads to the two key novel contributions of this paper. Firstly, we introduce a particle MCMC algorithm (Andrieu et al., 2010) for estimating the parameters of the AR Poisson regression model. This involves using a particle filter (Gordon et al., 1993) to estimate the likelihood, , where denotes the parameters of the model. The use of the particle filter to estimate is then exploited both in the effective estimation of the marginal likelihood using the algorithm of Touloupou et al. (in press) and also in giving a mechanism for estimating the DIC without the need to resort to data augmentation and the problems that this potentially entails, Celeux et al. (2006).
The remainder of this paper is structured as follows. In Section 2 we introduce the particle MCMC algorithm for the AR Poisson regression model. Given that Neal and Subba Rao (2007) provides an effective data augmentation MCMC algorithm for INAR models we utilise the algorithm provided there in our analysis, whilst in Section 3 we give brief details of an MCMC algorithm for INGARCH model which is particularly straightforward to implement as no data augmentation is required. In Section 4 we present the generic approach to model selection which is employed for all three integer valued time series models under consideration. In Section 5, we conduct a simulation study which demonstrates the ability of the approaches described in Section 4 for determining the true model. The simulation study also provides insights into the AR Poisson regression model and issues associated with identifying the autoregressive parameters in the latent process. In Section 6 we apply the AR Poisson regression, INAR and INGARCH models to two real-life data sets, monthly US polio cases (1970–1983) and monthly benefit claims from the logging industry to the British Columbia Workers Compensation Board (1985–1994). We show that an AR Poisson regression model is preferred for the Polio data, and the inclusion of covariates proposed by Zeger (1988) for the data lead to only a small increase in the marginal likelihood. By contrast the INGARCH model is preferred for benefit claims data with significant evidence for the inclusion of a summer effect. All the data sets analysed in Sections 5 Simulation study, 6 Analysis of data sets along with the R code used for the analysis are provided as supplementary material. Finally in Section 7 we make some concluding observations.
Section snippets
AR Poisson regression model
In this Section we introduce an adaptive, particle MCMC algorithm for obtaining samples from the posterior distribution of AR Poisson regression models Zeger (1988), Davis et al. (2003). The AR Poisson regression model assumes that we observe a (Poisson) count process which depends upon a (typically unobserved) latent AR process . Specifically, we assume that where depends upon explanatory variables
INGARCH model
In this Section we briefly discuss an MCMC algorithm for the INGARCH model. Given that for the INGARCH, with , for observations from , the likelihood satisfies Consequently, no data augmentation is required for analysing this model using MCMC, and given priors on the parameters we can construct a random walk Metropolis algorithm to explore . We choose independent gamma
Model selection
In this Section we consider model selection tools for choosing between competing integer valued time series models. We highlight a range of model selection tools in the Bayesian paradigm.
Reversible jump MCMC (Green, 1995) which extends MCMC to allow trans-dimensional moves enabling the comparison of different models within a single MCMC algorithm. Reversible jump MCMC is particularly well suited for moving between nested models where effective trans-dimensional moves can be identified and has
Simulation study
In this Section we present a simulation study which investigates the effectiveness of the model selection techniques on selecting the order of an AR Poisson regression model and of identifying the true model for three data sets with one data set each simulated from an AR Poisson regression model, an INAR model and an INGARCH model.
A time series of length 200 was generated from an AR Poisson regression model with , and (no covariate data). The data
Concluding remarks
In this paper we have shown how a particle filter algorithm can be successfully applied to estimate the likelihood, , for an AR Poisson regression model. The particle filter is then utilised both within a particle MCMC algorithm and for computation of the marginal likelihood and the DIC for model selection. This has enabled us to conduct model selection both within AR Poisson regression models to select the appropriate order of the model and between AR Poisson regression, INAR
Acknowledgements
NA was supported by a Ph.D. scholarship from the Saudi Arabian Government (1008074575).
We thank an associate editor and two anonymous referees for insightful comments which helped improve the paper.
References (36)
- et al.
Sequential Bayesian analysis of multivariate count data
Bayesian Anal.
(2017) - et al.
Assessment of mortgage default risk via Bayesian state space models
Ann. Appl. Stat.
(2013) - et al.
Particle Markov chain Monte Carlo methods (with discussion)
J. Roy. Soc. Ser. B
(2010) - et al.
The pseudo-marginal approach for efficient Monte Carlo computations
Ann. Statist.
(2009) - et al.
Time Series: Theory and Methods
(1996) - et al.
Efficient construction of reversible jump Markov chain Monte Carlo proposal distributions
J. Roy. Soc. Ser. B
(2003) - et al.
On Gibbs sampling for state space models
Biometrika
(1994) - et al.
Particle learning and smoothing
Statist. Sci.
(2010) - et al.
Deviance information criteria for missing data models
Bayesian Anal.
(2006)
Marginal likelihood from the Gibbs output
J. Amer. Statist. Assoc.
Marginal likelihood from the Metropolis–Hastings output
J. Amer. Statist. Assoc.
Observation-driven models for Poisson counts
Biometrika
On autocorrelation in a Poisson regression model
Biometrika
HandBook of Discrete-Valued Time Series
Generalized linear autoregressive moving average models
Time series analysis of non-Gaussian observations on state space models from both classical and Bayesian perspectives
J. R. Stat. Soc. Ser. B Stat. Methodol.
Efficient order selection algorithms for integer valued ARMA processes
J. Time Ser. Anal.
Cited by (18)
Health-informed predictive regression for statistical-simulation decision-making in urban heat mitigation
2023, Sustainable Cities and SocietyThe ARMA Point Process and its Estimation
2022, Econometrics and StatisticsCitation Excerpt :This work may benefit from developments on the time series side, such as reversible jump MCMC for model selection Enciso-Mora et al. (2009); Neal and Subba Rao (2007), as well as techniques for likelihood quantification Alzahrani et al. (2018); Touloupou et al. (2018), which has been shown to be important for distinguishing exogenous fluctuations in immigration from bursts and endogenous contagion dynamics Wheatley et al. (2019).
Semiparametric time series models driven by latent factor
2021, International Journal of ForecastingIdentification of the relative timing of infectiousness and symptom onset for outbreak control
2020, Journal of Theoretical BiologyCitation Excerpt :The challenge is to determine which of these five (observation) models best describes the household-stratified symptom-onset data (Fig. 1a). There is a relatively rich literature on Bayesian model discrimination (Chopin et al., 2013; Drovandi and Cutchan, 2016; Alzahrani et al., 2018; Touloupou et al., 2018), and optimal design for such (Chaloner and Verdinelli, 1995; Ryan et al., 2015), which are the most appropriate tools and framework to address this question. A general difficulty with this theory is that practical implementation is at best difficult, and often infeasible.
Bayesian model discrimination for partially-observed epidemic models
2019, Mathematical BiosciencesCitation Excerpt :Further, we consider different examples from epidemiology, which are important for understanding emerging infectious diseases. Other than in [48] and this paper, the method of importance sampling for estimating the evidence has typically only been considered for cases where the likelihood function is known [4,21], for inference on continuous-state models [21], where data augmentation is used to estimate the likelihood function [20,21], or in conjunction with another method which is not suitable for processes with highly variable observations [19]. This sort of implementation is inefficient for models with high dimensional parameter spaces.
Variable selection for an improved INAR(1) model with explanatory variables using 2SPCLS
2023, Brazilian Journal of Probability and Statistics