Aggregating the response in time series regression models, applied to weather-related cardiovascular mortality

https://doi.org/10.1016/j.scitotenv.2018.02.014Get rights and content

Highlights

  • Aggregating the response helps reducing its noise.

  • Modelling autocorrelation induced by the aggregation increases the fit of the model.

  • The methodology with an aggregated response achieves a better fit than the classical DLNM.

  • Using an aggregation with decreasing weights is well adapted to weather-related mortality issues.

Abstract

In environmental epidemiology studies, health response data (e.g. hospitalization or mortality) are often noisy because of hospital organization and other social factors. The noise in the data can hide the true signal related to the exposure. The signal can be unveiled by performing a temporal aggregation on health data and then using it as the response in regression analysis. From aggregated series, a general methodology is introduced to account for the particularities of an aggregated response in a regression setting. This methodology can be used with usually applied regression models in weather-related health studies, such as generalized additive models (GAM) and distributed lag nonlinear models (DLNM). In particular, the residuals are modelled using an autoregressive-moving average (ARMA) model to account for the temporal dependence. The proposed methodology is illustrated by modelling the influence of temperature on cardiovascular mortality in Canada. A comparison with classical DLNMs is provided and several aggregation methods are compared. Results show that there is an increase in the fit quality when the response is aggregated, and that the estimated relationship focuses more on the outcome over several days than the classical DLNM. More precisely, among various investigated aggregation schemes, it was found that an aggregation with an asymmetric Epanechnikov kernel is more suited for studying the temperature-mortality relationship.

Introduction

In environmental epidemiology, studies on the health effect of various environmental exposures, often rely on regression models applied to time series data (Gasparrini and Armstrong, 2013). Environmental exposure variables include atmospheric pollutant levels and temperature, while health issues include various diseases (Barreca and Shimshack, 2012; Blangiardo et al., 2011; Braga et al., 2002; Knowlton et al., 2009; Martins et al., 2006; Nitschke et al., 2011; Szpiro et al., 2014; Yang et al., 2015). In this context, the exposure-response relationship is complex since, among other reasons, the effect of exposure on health issues lasts several days. This is why models have often used exposure windows under the form of moving averages (MA, e.g. Armstrong, 2006) or distributed lags (DL, Schwartz, 2000a). In particular, the latter has been extended to deal with nonlinear relationships (distributed lags nonlinear models, DLNM, Gasparrini et al., 2010) which is now widely used in weather-related health population studies (e.g. Phung et al., 2016; Vanos et al., 2015; Wu et al., 2013).

The health response, however, is almost always used directly as a daily time series. This could lead to several drawbacks in the regression models of environmental exposure on a health issue. First, the response to an exposure can also be spread across several days (Lipfert, 1993), which means that it would seem more realistic to consider a health time window in response to an associated exposure window. Second, health time series data used in epidemiologic studies are often noisy. The noise can conceal the true signal of the response to an exposure, especially in areas with small populations where the number of cases (mortality or morbidity) is low. Sources of noise include diverse organizational factors such as weekends and holidays (Suissa et al., 2014; Wong et al., 2009), slight changes in the definition of diseases (e.g. Antman et al., 2000) as well as behavioral and technological changes. In the end, the noise in the response can reduce the accuracy of the model and the conclusions (e.g. Todeschini et al., 2004).

In order to assess a more realistic relationship between an exposure and a health issue as well as reduce the noise impact in the health response, it is proposed to consider an aggregation window over time in the health response also, in addition to the exposure. More precisely, moving aggregation is considered here, i.e. the time step of data points in the obtained series remains the same, in opposition to aggregation where the time step of data points is reduced (e.g. from daily values to monthly values).

Aggregating the response series is expected to have two advantages: (1) better representing the spread of the health response to an exposure and (2) reducing the noise in the health series. Indeed, aggregated series are less sensitive to random perturbation in the data. An aggregated response should make regression models more robust to variations induced by noise, leading to more reliable relationship estimates. This idea is consistent with the results of Cristobal et al. (1987) in a non-time series context, which showed that pre-smoothing a response variable to remove noise leads to consistent estimates with low variance in linear regression. In a similar study, Sarmento et al. (2011) concluded that regression models are more robust to noise when both the response and the exposure are aggregated.

There have been few preceding cases of aggregated responses (Roberts, 2015; Sarmento et al., 2011; Schwartz, 2000b), but the regression models applied did not account for the specificities of an aggregated response. These specificities include the presence of extra autocorrelation in the residuals and a modification of their distribution. Therefore, the objective of the present paper is to introduce a general methodology dealing with an aggregated response. The methodology allows the use of a DLNM with an aggregated response and deals with the autocorrelation created by the aggregation. The exposure-response surface of a DLNM with aggregated response is then compared to the surface of a classical DLNM in order to assess the impact on the estimated relationship. In past studies, only the moving average (Roberts, 2015; Sarmento et al., 2011) and Loess (Schwartz, 2000b) have been considered to aggregate the response. In the present paper, other aggregations are considered, in particular Nadaraya-Watson kernel smoothing (Nadaraya, 1964; Watson, 1964) with different kernels including the Epanechnikov kernel (Epanechnikov, 1969) and an asymmetric kernel proposed in Michels (1992).

The paper is organized as follows. Section 2 introduces the proposed methodology for an aggregated response. Section 3 illustrates the methodology and its benefits by applying it on a weather-related cardiovascular mortality case. The methodology is first compared to models with a non-aggregated response and then, different aggregation strategies are compared. The results are discussed in Section 4 and the conclusions are presented in Section 5.

Section snippets

Methods

This section introduces the statistical methodology consisting in 1) performing a temporal aggregation on the response time series yt; and 2) modelling the aggregated response y˜t according to an exposure xt through a regression model.

Application and comparison

In Canada, cardiovascular diseases remain the main cause of mortality and put an increasing burden on the public health system (Wielgosz et al., 2009). It has already been shown that temperature affects cardiovascular mortality and morbidity (e.g. Bayentin et al., 2010; Bustinza et al., 2013; Masselot et al., 2018). Therefore, in order to efficiently organize private and public health service and mitigate the effect of temperature on cardiovascular diseases, it is important to understand every

Discussion

The CVD mortality and temperature data were used to compare DLNM without aggregated response to DLNM with aggregated response, with and without modelling the created temporal dependence. Results show that when the temporal dependence is not modelled, results are quite similar between aggregated response (model MA) and non-aggregated response (model C), although the former smooth the relationship. For the latter, it is important to note that the results and interpretation are very similar to

Conclusions

The present paper proposes to aggregate the health response in environmental epidemiology studies, in order to reduce the importance of noise in the health data. The proposed methodology consists in aggregating the response and then applying a time series regression model to account for the temporal dependence created by the aggregation. This model is general and therefore not limited to linear regression and allows the use of DLNMs. The proposed methodology is then applied to the practical

Acknowledgements

The authors are thankful to the Fonds Vert du Québec for funding this study and to the Institut national de santé publique du Québec for data access. The authors also wish to thank Jean-Xavier Giroux (INRS-ETE) for his help with database building, Yohann Chiu (INRS-ETE) for all his relevant comments during the project as well as two anonymous reviewers for their helpful comments in improving the quality of the paper. All the analyses were performed using the R software (R Core Team, 2015) with

References (65)

  • C. Yang et al.

    Long-term variations in the association between ambient temperature and daily cardiovascular mortality in Shanghai, China

    Sci. Total Environ.

    (2015)
  • A.C. Aitken

    On least squares and linear combination of observations

    Proc. R. Soc. Edinb.

    (1935)
  • H. Akaike

    A new look at the statistical model identification

    IEEE Trans. Autom. Control

    (1974)
  • B. Armstrong

    Models for the relationship between ambient temperature and daily mortality

    Epidemiology

    (2006)
  • A.I. Barreca et al.

    Absolute humidity, temperature, and influenza mortality: 30 years of county-level evidence from the United States

    Am. J. Epidemiol.

    (2012)
  • L. Bayentin et al.

    Spatial variability of climate effects on ischemic heart disease hospitalization rates for the period 1989–2006 in Quebec, Canada

    Int. J. Health Geogr.

    (2010)
  • P. Billingsley

    Probability and measure

  • M. Blangiardo et al.

    A Bayesian analysis of the impact of air pollution episodes on cardio-respiratory hospital admissions in the Greater London area

    Stat. Methods Med. Res.

    (2011)
  • G.E.P. Box et al.

    Time series analysis: forecasting and control

  • A.L.F. Braga et al.

    The effect of weather on respiratory and cardiovascular deaths in 12 U.S. cities

    Environ. Health Perspect.

    (2002)
  • M.J. Brewer et al.

    The relative performance of AIC, AICC and BIC in the presence of unobserved heterogeneity

    Methods Ecol. Evol.

    (2016)
  • K.P. Burnham et al.

    Multimodel inference: understanding AIC and BIC in model selection

    Sociol. Methods Res.

    (2004)
  • R. Bustinza et al.

    Health impacts of the July 2010 heat wave in Quebec, Canada

    BMC Public Health

    (2013)
  • F. Chebana et al.

    A general and flexible methodology to define thresholds for heat health watch and warning systems, applied to the province of Québec (Canada)

    Int. J. Biometeorol.

    (2012)
  • Y. Chiu et al.

    Mortality and morbidity peaks modeling: an extreme value theory approach

    Stat. Methods Med. Res.

    (2016)
  • A.H. Choudhury et al.

    Understanding time-series regression estimators

    Am. Stat.

    (1999)
  • W.S. Cleveland et al.

    Locally weighted regression: an approach to regression analysis by local fitting

    J. Am. Stat. Assoc.

    (1988)
  • D. Cochrane et al.

    Application of least squares regression to relationships containing auto- correlated error terms

    J. Am. Stat. Assoc.

    (1949)
  • J.A.C. Cristobal et al.

    A class of linear regression parameter estimators constructed by nonparametric estimation

    Ann. Stat.

    (1987)
  • I. Daubechies

    Ten Lectures on Wavelets

    (1992)
  • B. Doyon et al.

    The potential impact of climate change on annual and seasonal mortality for three cities in Québec, Canada

    Int. J. Health Geogr.

    (2008)
  • V.A. Epanechnikov

    Non-parametric estimation of a multivariate probability density

    Theory Probab. Appl.

    (1969)
  • Cited by (10)

    • Impact of energy structure on carbon emission and economy of China in the scenario of carbon taxation

      2021, Science of the Total Environment
      Citation Excerpt :

      Grey model GM (1,1) is a method suitable for the short-term prediction of a small amount of data by multiple uncertain factors (He et al., 2019). Therefore, this paper processes the original data by this method and uses the ARMA model to predict the CO2 emissions of each sector referring to the approach proposed by Masselot et al. (2018). In scenario F-lower, the trends of GDP and CO2 in the future are predicted based on the planning requirements of the Chinese government.

    View all citing articles on Scopus
    View full text