EMD-regression for modelling multi-scale relationships, and application to weather-related cardiovascular mortality
Graphical abstract
Introduction
In a number of scientific fields (e.g. hydrology, environmental health, ecology, etc.), it is of interest to understand the effect of one or several predictor variables on a response variable. The classical class of models for this purpose is regression analysis (e.g. Nelder and Wedderbum, 1972). However, the variables of interest are often represented by time series processes, which potentially leads to modelling and accuracy issues. The multi-scale nature of some time series processes found in applications such as climatology and public health is of special interest. Indeed, such time series are often non-stationary (i.e. the moments vary with time) and some dominant patterns in the time series (e.g. annual cycles) create a large amount of multicollinearity in the exposure time series when several covariates are considered. In a regression analysis, if the model does not take these issues into account, it can lead to an increase in the variability of parameter estimates, making the final result less reliable (e.g. Ventosa-Santaulària, 2009). This also increases the possibility of making the wrong conclusions concerning whether or not a predictor influences the response (i.e. the so-called “spurious regression” issue, see Granger and Newbold, 1974, Phillips, 1986, Hoover, 2003).
The present paper proposes to address the issue of multi-scale time series data in regression by decomposing the series into intrinsic mode functions (IMF) through the empirical mode decomposition algorithm (EMD, Huang et al., 1998). The obtained IMFs are the basic oscillation modes of time series data, and can be used as variables in a regression analysis. Therefore, the proposed method combines EMD and regression as illustrated in Table 1 and is hereby called “EMD-regression” (EMD-R). The proposed approach differs significantly from other methods commonly used to address the issue of non-stationarity, such as removing the trend and the seasonality (detrending and deseasonalisation), applying a difference operator, or adding a smooth time variable. The main difference lies in the fact that no information is removed from the data. Instead, EMD-R acts as a scan of the relationship over all time scales that are present in the data. This allows isolating the most important time scales for a better understanding of the relationship, and even unveiling signals that may be hidden by the dominant frequencies.
Transforming the data as a prior step to regression analysis has been commonly carried out in the literature, for instance through the use of principal components (e.g. Jolliffe, 1982). More adapted to time series data, a number of spectral decomposition approaches have also been suggested by a number of authors, such as the STL (Seasonal-Trend using Loess) algorithm (e.g. Schwartz, 2000b), Fourier decomposition (e.g. Dominici et al., 2003) or Wavelet transform (e.g. Kucuk and Agiralioglu, 2006, Kişi, 2009). The main advantage of EMD for the decomposition is that it is entirely data-adaptive (Huang et al., 1998). Therefore, the algorithm automatically determines the time scales that are present in the data, avoiding hence the a priori choice that is necessary in the STL algorithm (Cleveland et al., 1990) for instance. In addition, no predetermined function is used to perform the decomposition, unlike Fourier and Wavelet based decompositions. This allows EMD to decompose non-stationary and non-linear time series into a small number of components (Huang and Wu, 2008).
In addition to being widely applied directly in several fields such as geosciences (Huang and Wu, 2008) and mechanical engineering (Lei et al., 2013), the EMD algorithm has been successfully combined with other established statistical methods. For instance, Lee and Ouarda, 2010, Lee and Ouarda, 2011 combined EMD and k-nearest neighbour simulations to predict climatic oscillations. Chen et al. (2012) applied an artificial neural network to forecast the IMFs of a tourism demand series. Lee and Ouarda (2012a) also combined EMD and principal component analysis to separate meaningful signals from noise in climatic applications. EMD has also been used to study the relationship between two variables. For instance, Durocher et al. (2016) used a combination of EMD and cross-wavelet analysis to study the relationship between two time series. For the same purpose, Biswas and Si (2011) and then Hu and Si (2013) used EMD before computing correlation coefficients on the IMFs. A more general method is developed in Chen et al. (2010) to study the correlation between two time series through the use of EMD.
Combining EMD and linear regression has been performed by Yang et al., 2011a, Yang et al., 2011b. In a recent article, Qin et al. (2016) proposed to use the Lasso approach to select the more relevant IMFs in predicting the response series. The present work goes a step further by proposing a broader scope for the procedure and proposing a number of generalisations of the approach. In particular, the previous studies decomposed only one predictor series, while the present work does not limit itself to only one predictor. In addition, two models are proposed here, one of which decomposes the response series also, allowing its prediction in the frequency space to gain insights at hidden variation scales. A sensitivity score for predictor's IMFs is also described as an interpretation tool for practitioners. Finally, unlike the cited studies, a comparison to state of the art regression methods is provided.
The EMD-regression method basically consists in two steps: i) decomposing the time series into their IMFs through EMD, and ii) using the IMFs as variables in a regression analysis. More specifically, two different designs are introduced: a) only the predictors are decomposed and all their IMFs are used as alternative predictors (EMD-R1) such as in Qin et al. (2016) and b) both the response and predictors are decomposed and each response's IMF is modeled according to the predictors' IMF of the same order (EMD-R2). The new EMD-R2 procedure provides hence more details concerning the relationship between predictors and the response variable than the EMD-R1 procedure.
The present study is motivated by an application in weather-related health, which contains typical examples of multi-scale processes. Such studies often control the seasonality and trend by using a time variable in order to focus on the day-to-day variations in the health issue of interest (Bhaskaran et al., 2013). EMD-regression provides a tool for the assessment of the long term effects of climatic variables through the low frequency IMFs. This represents a major challenge for the planning of future of public health conditions (Xun et al., 2010) and for setting more appropriate public health alerts, especially under climate change conditions. It is hoped that the use of EMD-regression may also unveil hidden features of the weather-health relationship such as the influence of weather factors at non dominant time scales.
The present paper is organized as follows. The background material associated to the EMD-R methodology and the details of the EMD-R approach are introduced in Section 2. In Section 3, both EMD-R1 and EMD-R2 methods are applied to the weather-related cardiovascular issue in the census metropolitan area (CMA) of Montréal (Canada). Since the motivation context for the present study concerns weather-related health, the EMD-R methods are then compared to commonly used models in this type of study. The results of the application are then discussed in Section 4, and the conclusions are presented in Section 5.
Section snippets
EMD-regression (EMD-R)
The EMD-regression methodology aims at explaining the effects of covariates Xj on a response variable Y by: 1) decomposing the time series using EMD and 2) using the IMFs as new variables in a sparse regression model, namely the Lasso (least absolute shrinkage and selection operator, Tibshirani, 1996). The methodology is summarized in Fig. 1.
Application to weather-related cardiovascular mortality
The literature abounds with studies documenting the potentially harmful impacts of climate change. Among these impacts, it is expected to observe an increase in weather-related mortality. Cardiovascular diseases (CVD) are among the diseases that are most affected by climate change since they are impacted by extreme weather (e.g. Braga et al., 2002, Bustinza et al., 2013). CVD are already the main cause of mortality in Canada and could represent an increasing burden on the Canadian public health
Discussion
The results of the weather-related cardiovascular mortality presented in Section 3, already show one advantage of the EMD-R: its ability to display some hidden aspects of the relationship. In this case, the effect of humidity found during spring season and at very large time scales (i.e. periodicities of several years) is quite new in the field of environmental epidemiology. Indeed, no significant association between relative humidity and mortality has been found when studying as a variable of
Conclusion
The present paper introduces a general methodology for EMD-regression when dealing with time series data (and more generally all data with autocorrelation) often found in environmental sciences. The purpose of the EMD-R approach is to understand a relationship between variables from a different point a view, i.e. from a time scale point of view. This point of view acknowledges the complexity of many real-world time series which contain a significant amount of information in their variations.
Acknowledgements
The authors are thankful to the Fonds Vert du Québec for funding this study and to the Institut national de santé publique du Québec for data access. The authors also thank Jean-Xavier Giroux (INRS-ETE) for his help on the database establishing as well as Yohann Chiu (INRS-ETE) for all his relevant comments during the project. The authors are grateful to Scott Sheridan, the associate editor of Science of the total environment as well as three anonymous reviewers for their judicious comments and
References (77)
- et al.
Toward creating simpler hydrological models: a LASSO subset selection approach
Environ. Model Softw.
(2015) - et al.
Forecasting tourism demand based on empirical mode decomposition and neural network
Knowl.-Based Syst.
(2012) - et al.
Mortality risk attributable to high and low ambient temperature: a multicountry observational study
Lancet
(2015) - et al.
Spurious regressions in econometrics
J. Econ.
(1974) - et al.
Soil water prediction based on its scale-specific control using multivariate empirical mode decomposition
Geoderma
(2013) - et al.
An investigation of thresholds in air pollution-mortality effects
Environ. Model Softw.
(2006) - et al.
A review on empirical mode decomposition in fault diagnosis of rotating machinery
Mech. Syst. Signal Process.
(2013) Understanding spurious regressions in econometrics
J. Econ.
(1986)- et al.
The effects of high temperature on cardiovascular admissions in the most populous tropical city in Vietnam
Environ. Pollut.
(2016) - et al.
Temperature–mortality relationship in four subtropical Chinese cities: a time-series study using a distributed lag non-linear model
Sci. Total Environ.
(2013)