Deep learning for short-term traffic flow prediction
Introduction
Real-time spatio-temporal measurements of traffic flow speed are available from in-ground loop detectors or GPS probes. Commercial traffic data providers, such as Bing maps (Microsoft Research, 2016), rely on traffic flow data, and machine learning to predict speeds for each road segment. Real-time (15–40 min) forecasting gives travelers the ability to choose better routes and authorities the ability to manage the transportation system. Deep learning is a form of machine learning which provides good short-term forecasts of traffic flows by exploiting the dependency in the high dimensional set of explanatory variables, we capture the sharp discontinuities in traffic flow that arise in large-scale networks. We provide a variable selection methodology based on sparse models and dropout.
The goal of our paper is to model the nonlinear spatio-temporal effects in recurrent and non-recurrent traffic congestion patterns. These arise due to conditions at construction zones, weather, special events, and traffic incidents. Quantifying travel time uncertainty requires real-time forecasts. Traffic managers use model-based forecasts to regulate ramp metering, apply speed harmonization, and regulate road pricing as a congestion mitigation strategy; whereas, the general public adjusts travel decisions on departure times and travel route choices, among other things.
Deep learning forecasts congestion propagation given a bottleneck location, and can provide an accurate forty minute forecasts for days with recurrent and non-recurrent traffic conditions. Deep learning can also incorporate other data sources, such as weather forecasts, and police reports to produce more accurate forecasts. We illustrate our methodology on traffic flows during two special events; a Chicago Bears football game and an extreme snow storm event.
To perform variable selection, we develop a hierarchical sparse vector auto-regressive technique (Dellaportas et al., 2012, Nicholson et al., 2014) as the first deep layer. Predictor selection then proceeds in a dropout (Hinton and Salakhutdinov, 2006). Deep learning models the sharp discontinuities in traffic flow are modeled as a superposition of univariate non-linear activation functions with affine arguments. Our procedure is scalable and estimation follows traditional optimization techniques, such as stochastic gradient descent.
The rest of our paper is outlined as follows. Section 1.2 discusses connections with existing work. Section 1.3 reviews fundamentals of deep learning. Section 2 develops deep learning predictors for forecasting traffic flows. Section 3 discusses fundamental characteristics of traffic flow data and illustrates our methodology with the study of traffic flow on Chicago’s I-55. Finally, Section 4 concludes with directions for future research.
Short-term traffic flow prediction has a long history in the transportation literature. Deep learning is a form of machine learning that can be viewed as a nested hierarchical model which includes traditional neural networks. Karlaftis and Vlahogianni (2011) provides an overview of traditional neural network approaches and (Kamarianakis et al., 2012) shows that model training is computationally expensive with frequent updating being prohibitive. On the other hand, deep learning with dropout can find a sparse model which can be frequently updated in real time. There are several analytical approaches to traffic flows modeling (Anacleto et al., 2013, Blandin et al., 2012, Chiou et al., 2014, Polson and Sokolov, xxxx, Polson and Sokolov, 2015, Work et al., 2010). These approaches can perform very well on filtering and state estimation. The caveat is that they are hard to implement on large scale networks. Bayesian approaches have been shown to be efficient for handling large scale transportation network state estimation problems (Tebaldi and West, 1998). Westgate et al. (2013) discusses ambulance travel time reliability using noisy GPS for both path travel time and individual road segment travel time distributions. Anacleto et al. (2013) provides a dynamic Bayesian network to model external intervention techniques to accommodate situations with suddenly changing traffic variables.
Statistical and machine learning methods for traffic forecasting are compared in Smith and Demetsky (1997). Sun et al. (2006) provides a Bayes network algorithm, where the conditional probability of a traffic state on a given road, given states on topological neighbors on a road network is calculated. The resulting joint probability distribution is a mixture of Gaussians. Bayes networks for estimating travel times were suggested by Horvitz et al. which eventually became a commercial product that led to the start of Inrix, a traffic data company. Wu et al. (2004) provides a machine-learning method support vector machine (SVM) (Polson and Scott, 2011) to forecast travel times and (Quek et al., 2006) proposes a fuzzy neural-network approach to address nonlinearities in traffic data. Rice and van Zwet (2004) argues that there is a linear relation between future travel times and currently estimated conditions with a time-varying coefficients regression model to predict travel times.
Integrated auto-regressive moving average (ARIMA) and exponential smoothing (ES) for traffic forecasting are studied in Tan et al., 2009, Van Der Voort et al., 1996. A Kohonen self-organizing map is proposed as an initial classifier. Van Lint (2008) addresses real-time parameter learning and improves the quality of forecasts using an extended Kalman filter. Ban et al. (2011) proposes a method for estimating queue lengths at controlled intersections, based on the travel time data measured by GPS probes. The method relies on detecting discontinuities and changes of slopes in travel time data. Ramezani and Geroliminis (2015) combines the traffic flow shockwave analysis with data mining techniques. Oswald et al. (2000) argues that non-parametric methods produce better forecasts than parametric models due to their ability to better capture spatial-temporal relations and non-linear effects. Vlahogianni et al. (2014) provides an extensive recent review of literature on short-term traffic predictions.
There are several issues not addressed in the current literature (Vlahogianni et al., 2014). First, predictions at a network level using data-driven approaches. There are two situations when a data-driven approach might be preferable to methodologies based on traffic flow equations. Estimating boundary conditions is a challenging task, which even in systems that rely on loop detectors as traffic sensors are typically not installed on ramps. Missing data problems are usually addressed using data imputation (Muralidharan and Horowitz, 2009) or weak formulations of boundary conditions (Strub and Bayen, 2006). Our results show that a data-driven approach can efficiently forecast flows without boundary measurements from ramps. Another challenge with physics-based approaches comes from their limited ability to model urban arterials. For example, Qiao et al. (2001) shows analytical approaches fail to provide good forecasts. Another challenge is to identify spatio-temporal relations in flow patterns, Vlahogianni et al. (2014) for further discussion. Data-driven approaches provide a flexible alternative to physical laws of traffic flows.
The challenge is to perform model selection and residual diagnostics (Vlahogianni et al., 2014). Model selection can be tackled by regularizing the loss function and using cross-validation to select the optimal penalty weight. To address this issue, when we specify our deep learning model we construct an architecture as follows. First we use is a regularized vector autoregressive model to perform predictor selection. Then, our deep learning model addresses the issue of non-linear and non-stationary relations between variables (speed measurements) using a series of activation functions.
Breiman (2003) describes the trade-off between machine learning and traditional statistical methods. Machine learning has been widely applied (Ripley, 1996) and shown to be particularly successful in traffic pattern recognition. For example, shallow neural networks for traffic applications (Chen and Grant-Muller, 2001), use a memory efficient dynamic neural network based on resource allocating network (RAN) with a single hidden layer with Gaussian radial basis function activation unit. Zheng et al. (2006) develops several one-hidden layer networks to produce fifteen-minute forecasts. Two types of networks, one with a activation function and one with a Gaussian radial basis function were developed. Several forecasts were combined using a Bayes factors that calculates an odds ratio for each of the models dynamically. Van Lint et al. (2005) proposes a state-space neural network and a multiple hypothesis approach that relies on using several neural network models at the same time (van Hinsbergen et al., 2009). Day of the week and time of day as inputs to a neural network was proposed in Çetiner et al. (2010). Our work is closely related to Lv et al. (2015), which demonstrates that deep learning can be effective for traffic forecasts. A stacked auto-encoder was used to learn the spatial-temporal patterns in the traffic data with training performed by a greedy layer-wise fashion. Ma et al. (2015) proposed a recurrent architecture, a Long Short-Term Memory Neural Network (LSTM), for travel speed prediction. Our approach builds on this by showing an additional advantage of deeper hidden layers together with sparse autoregressive techniques for variable selection.
Deep learning learns a high dimensional function via a sequence of semi-affine non-linear transformations. The deep architecture is organized as a graph. The nodes of the graph are units, connected by links to propagate activation, calculated at the origin, to the destination units. Each link has a weight that determines the relative strength and sign of the connection and each unit applies an activation function to all of the weighted sum of incoming activations. The activation function is given, such as a hard threshold, a sigmoid function or a . A particular class of deep learning models uses a directed acyclic graph structure is called a feed-forward neural network. There is vast literature on this topic; one of the earlier works include (Bishop, 1995, Haykin, 2004).
Deep learning allows for efficient modeling of nonlinear functions, see the original problem of Poincare and Hilbert. The advantage of deep hidden layers is for a high dimensional input variable, is that the activation functions are univariate, which implicitly requires the specification of the number of hidden units for each layer l.
The Kolmogorov-Arnold representation theorem (Kolmogorov, 1956) provides the theoretical motivation for deep learning. The theorem states that any continuous function of n variables, defined by , can be represented aswhere and are continuous functions, and is a universal basis, that does not depend on F. This remarkable representation result implies that any continuous function can be represented using operations of summation and function composition. For a neural network, it means that any function of n variables can be represented as a neural network with one hidden layer and activation units. The difference between theorem and neural network representations is that functions are not necessarily affine. Much research has focused on how to find such a basis. In their original work, Kolmogorov and Arnold develop functions in a constructive fashion. Diaconis and Shahshahani (1984) characterizes projection pursuit functions for a specific types of input functions.
A deep learning predictor, denoted by , takes an input vector and outputs y via different layers of abstraction that employ hierarchical predictors by composing L non-linear semi-affine transformations. Specifically, a deep learning architecture is as follows. Let be given univariate activation link functions, e.g. sigmoid (), Heaviside gate functions (), or rectified linear units () or indicator functions () for trees. The composite map is defined bywhere is a semi-activation rule defined byHere denotes the number of units at layer l. The weights and offset needs to be learned from training data.
Data dimension reduction of a high dimensional map F is performed via the composition of univariate semi-affine functions. Let denote the l-th layer hidden features, with . The final output is the response y, can be numeric or categorical. The explicit structure of a deep prediction rule is than
In many cases there is an underlying probabilistic models, denoted by . This leads to a training problem given by optimization problemwhere is the probability density function given by specification .For example, if is normal, we will be training via an -norm, . One of the key advantages of deep learning is the derivative information is available in closed form via the chain rule. Typically, a regularization penalty, defined by is added, to introduce the bias-variance decomposition to provide good out-of-sample predictive performance. An optimal regularization parameter, , can be chosen using out-of-sample cross-validation techniques. One of the advantages of penalized least squares formulation is that it leads to a convex, though non-smooth, optimization problem. Efficient algorithms (Kim et al., 2007) exist to solve those problems, even for high dimensional cases.
There is a strong connection with nonlinear multivariate non-parametric models, which we now explore. In a traditional statistical framework, the non-parametric approach seeks to approximate the unknown map F using a family of functions defined by the following expressionFunctions are called basis functions and play the similar role of a functional space basis, i.e. they are chosen to give a good approximation to the unknown map F. In some cases actually do form a basis of a space, e.g., Fourier () and wavelet bases. Multi-variate basis functions are usually constructed using functions of a single variable. Four examples are radial functions, ridge functions, kernel functions and indicator functions.Here is typically chosen to be a bell-shaped function (e.g., or ). The ridge function, a composition of inner-product and non-linear univariate functions, is arguably one of the simplest non-linear multi-variate function. Two of the most popular types of neural networks are constructed as a composition of radial or ridge functions. Popular non-parametric tree-based models (Breiman et al., 1984) can be represented as (2) by choosing given by Eq. (3). In tree-based regression, weights are the averages of and is a box set in with zero or more extreme directions (open sides).
Another set of basis functions are Fourier series, used primarily for time series analysis, where . A spline approximation can also be derived by using polynomial functions with finite support as a basis.
Ridge-based models, can efficiently represent high-dimensional data sets with a small number of parameters. We can think of deep features (outputs of hidden layers) as projections of the input data into a lower dimensional space. Deep learners can deal with the curse of dimensionality because ridge functions determine directions in input space, where the variance is very high. Those directions are chosen as global ones and represent the most significant patterns in the data. This approach resembles the other well-studied techniques such as projection pursuit (Friedman and Tukey, 1974) and principal component analysis.
Section snippets
Deep learning for traffic flow prediction
Let be the forecast of traffic flow speeds at time , given measurements up to time t. Our deep learning traffic architecture looks likeTo model traffic flow data we use predictors x given byHere n is the number of locations on the network (loop detectors) and is the cross-section traffic flow speed at location i at time t. We use, vec to denote the vectorization transformation, which converts the matrix
Chicago traffic flow during special events
To illustrate our methodology, we use data from twenty-one loop-detectors installed on a northbound section of Interstate I-55. Those loop-detectors span 13 miles of the highway. Traffic flow data is available from the Illinois Department of Transportation, (see Lake Michigan Interstate Gateway Alliance http://www.travelmidwest.com/, formally the Gary-Chicago-Milwaukee Corridor, or GCM). The data is measured by loop-detector sensors installed on interstate highways. Loop-detector is a simple
Discussion
The main contribution of this paper is development of an innovative deep learning architecture to predict traffic flows. The architecture combines a linear model that is fitted using regularization and a sequence of layers. The first layer identifies spatio-temporal relations among predictors and other layers model nonlinear relations. The improvements in our understanding of short-term traffic forecasts from deep learning are twofold. First, we demonstrate that deep learning provides a
References (74)
- et al.
On sequential data assimilation for scalar macroscopic traffic flow models
Phys. D: Nonlin. Phenom.
(2012) - et al.
Use of sequential learning for short-term traffic flow forecasting
Transport. Res. Part C: Emerg. Technol.
(2001) - et al.
Statistical methods versus neural networks in transportation research: differences, similarities and some insights
Transport. Res. Part C: Emerg. Technol.
(2011) - et al.
Testing for neglected nonlinearity in time series models: a comparison of neural network methods and alternative tests
J. Economet.
(1993) - et al.
Long short-term memory neural network for traffic speed prediction using remote microwave sensor data
Transport. Res. Part C: Emerg. Technol.
(2015) - et al.
Intelligent simulation and prediction of traffic flow dispersion
Transport. Res. Part B: Methodol.
(2001) - et al.
Nonlinear total variation based noise removal algorithms
Phys. D: Nonlin. Phenom.
(1992) - et al.
Combining Kohonen maps with Arima time series models to forecast traffic flow
Transport. Res. Part C: Emerg. Technol.
(1996) - et al.
Bayesian committee of neural networks to predict travel times with confidence intervals
Transport. Res. Part C: Emerg. Technol.
(2009) - et al.
Accurate freeway travel time prediction with state-space neural networks under missing data
Transport. Res. Part C: Emerg. Technol.
(2005)
Optimized and meta-optimized neural networks for short-term traffic flow prediction: a genetic approach
Transport. Res. Part C: Emerg. Technol.
Short-term traffic forecasting: where we are and where we’re going
Transport. Res. Part C: Emerg. Technol.
Multivariate forecasting of road traffic flows in the presence of heteroscedasticity and measurement errors
J. R. Statist. Soc.: Ser. C (Appl. Statist.)
Real time queue length estimation for signalized intersections using travel times from mobile sensors
Transport. Res. Part C: Emerg. Technol.
Neural Networks for Pattern Recognition
Distribution of residual autocorrelations in autoregressive-integrated moving average time series models
J. Am. Statist. Assoc.
Statistical modeling: the two cultures
Qual. Control Appl. Statist.
Classification and Regression Trees
A neural network based trafficflow prediction model
Math. Comput. Appl.
A novel method to predict traffic features based on rolling self-structured traffic patterns
J. Intell. Transport. Syst.
Joint specification of model space and parameter space prior distributions
Statist. Sci.
On nonlinear functions of linear combinations
SIAM J. Scient. Statist. Comput.
A projection pursuit algorithm for exploratory data analysis
IEEE Trans. Comp.
The variable selection problem
J. Am. Statist. Assoc.
LSTM recurrent networks learn simple context-free and context-sensitive languages
IEEE Trans. Neural Netw.
Offline handwriting recognition with multidimensional recurrent neural networks
A comprehensive foundation
Neural Netw.
On the inductive bias of dropout
J. Mach. Learn. Res.
Reducing the dimensionality of data with neural networks
Science
Long short-term memory
Neural Comput.
Real-time road traffic forecasting using regime-switching space-time models and adaptive LASSO
Appl. Stoch. Mod. Bus. Indust.
Real-time road traffic forecasting using regime-switching space-time models and adaptive LASSO
Appl. Stoch. Mod. Bus. Indust.
An interior-point method for large-scale -regularized least squares
IEEE J. Select. Top. Sig. Process.
Cited by (830)
Spatiotemporal Fusion Transformer for large-scale traffic forecasting
2024, Information FusionReal-time freeway traffic state estimation for inhomogeneous traffic flow
2024, Physica A: Statistical Mechanics and its ApplicationsLocation and time embedded feature representation for spatiotemporal traffic prediction
2024, Expert Systems with ApplicationsA lightweight multi-layer perceptron for efficient multivariate time series forecasting
2024, Knowledge-Based SystemsRepresentation learning and Graph Convolutional Networks for short-term vehicle trajectory prediction
2024, Physica A: Statistical Mechanics and its ApplicationsShort-term forecasting airport passenger flow during periods of volatility: Comparative investigation of time series vs. neural network models
2024, Journal of Air Transport Management