Deep learning for short-term traffic flow prediction

https://doi.org/10.1016/j.trc.2017.02.024Get rights and content

Highlights

Abstract

We develop a deep learning model to predict traffic flows. The main contribution is development of an architecture that combines a linear model that is fitted using 1 regularization and a sequence of tanh layers. The challenge of predicting traffic flows are the sharp nonlinearities due to transitions between free flow, breakdown, recovery and congestion. We show that deep learning architectures can capture these nonlinear spatio-temporal effects. The first layer identifies spatio-temporal relations among predictors and other layers model nonlinear relations. We illustrate our methodology on road sensor data from Interstate I-55 and predict traffic flows during two special events; a Chicago Bears football game and an extreme snowstorm event. Both cases have sharp traffic flow regime changes, occurring very suddenly, and we show how deep learning provides precise short term traffic flow predictions.

Introduction

Real-time spatio-temporal measurements of traffic flow speed are available from in-ground loop detectors or GPS probes. Commercial traffic data providers, such as Bing maps (Microsoft Research, 2016), rely on traffic flow data, and machine learning to predict speeds for each road segment. Real-time (15–40 min) forecasting gives travelers the ability to choose better routes and authorities the ability to manage the transportation system. Deep learning is a form of machine learning which provides good short-term forecasts of traffic flows by exploiting the dependency in the high dimensional set of explanatory variables, we capture the sharp discontinuities in traffic flow that arise in large-scale networks. We provide a variable selection methodology based on sparse models and dropout.

The goal of our paper is to model the nonlinear spatio-temporal effects in recurrent and non-recurrent traffic congestion patterns. These arise due to conditions at construction zones, weather, special events, and traffic incidents. Quantifying travel time uncertainty requires real-time forecasts. Traffic managers use model-based forecasts to regulate ramp metering, apply speed harmonization, and regulate road pricing as a congestion mitigation strategy; whereas, the general public adjusts travel decisions on departure times and travel route choices, among other things.

Deep learning forecasts congestion propagation given a bottleneck location, and can provide an accurate forty minute forecasts for days with recurrent and non-recurrent traffic conditions. Deep learning can also incorporate other data sources, such as weather forecasts, and police reports to produce more accurate forecasts. We illustrate our methodology on traffic flows during two special events; a Chicago Bears football game and an extreme snow storm event.

To perform variable selection, we develop a hierarchical sparse vector auto-regressive technique (Dellaportas et al., 2012, Nicholson et al., 2014) as the first deep layer. Predictor selection then proceeds in a dropout (Hinton and Salakhutdinov, 2006). Deep learning models the sharp discontinuities in traffic flow are modeled as a superposition of univariate non-linear activation functions with affine arguments. Our procedure is scalable and estimation follows traditional optimization techniques, such as stochastic gradient descent.

The rest of our paper is outlined as follows. Section 1.2 discusses connections with existing work. Section 1.3 reviews fundamentals of deep learning. Section 2 develops deep learning predictors for forecasting traffic flows. Section 3 discusses fundamental characteristics of traffic flow data and illustrates our methodology with the study of traffic flow on Chicago’s I-55. Finally, Section 4 concludes with directions for future research.

Short-term traffic flow prediction has a long history in the transportation literature. Deep learning is a form of machine learning that can be viewed as a nested hierarchical model which includes traditional neural networks. Karlaftis and Vlahogianni (2011) provides an overview of traditional neural network approaches and (Kamarianakis et al., 2012) shows that model training is computationally expensive with frequent updating being prohibitive. On the other hand, deep learning with dropout can find a sparse model which can be frequently updated in real time. There are several analytical approaches to traffic flows modeling (Anacleto et al., 2013, Blandin et al., 2012, Chiou et al., 2014, Polson and Sokolov, xxxx, Polson and Sokolov, 2015, Work et al., 2010). These approaches can perform very well on filtering and state estimation. The caveat is that they are hard to implement on large scale networks. Bayesian approaches have been shown to be efficient for handling large scale transportation network state estimation problems (Tebaldi and West, 1998). Westgate et al. (2013) discusses ambulance travel time reliability using noisy GPS for both path travel time and individual road segment travel time distributions. Anacleto et al. (2013) provides a dynamic Bayesian network to model external intervention techniques to accommodate situations with suddenly changing traffic variables.

Statistical and machine learning methods for traffic forecasting are compared in Smith and Demetsky (1997). Sun et al. (2006) provides a Bayes network algorithm, where the conditional probability of a traffic state on a given road, given states on topological neighbors on a road network is calculated. The resulting joint probability distribution is a mixture of Gaussians. Bayes networks for estimating travel times were suggested by Horvitz et al. which eventually became a commercial product that led to the start of Inrix, a traffic data company. Wu et al. (2004) provides a machine-learning method support vector machine (SVM) (Polson and Scott, 2011) to forecast travel times and (Quek et al., 2006) proposes a fuzzy neural-network approach to address nonlinearities in traffic data. Rice and van Zwet (2004) argues that there is a linear relation between future travel times and currently estimated conditions with a time-varying coefficients regression model to predict travel times.

Integrated auto-regressive moving average (ARIMA) and exponential smoothing (ES) for traffic forecasting are studied in Tan et al., 2009, Van Der Voort et al., 1996. A Kohonen self-organizing map is proposed as an initial classifier. Van Lint (2008) addresses real-time parameter learning and improves the quality of forecasts using an extended Kalman filter. Ban et al. (2011) proposes a method for estimating queue lengths at controlled intersections, based on the travel time data measured by GPS probes. The method relies on detecting discontinuities and changes of slopes in travel time data. Ramezani and Geroliminis (2015) combines the traffic flow shockwave analysis with data mining techniques. Oswald et al. (2000) argues that non-parametric methods produce better forecasts than parametric models due to their ability to better capture spatial-temporal relations and non-linear effects. Vlahogianni et al. (2014) provides an extensive recent review of literature on short-term traffic predictions.

There are several issues not addressed in the current literature (Vlahogianni et al., 2014). First, predictions at a network level using data-driven approaches. There are two situations when a data-driven approach might be preferable to methodologies based on traffic flow equations. Estimating boundary conditions is a challenging task, which even in systems that rely on loop detectors as traffic sensors are typically not installed on ramps. Missing data problems are usually addressed using data imputation (Muralidharan and Horowitz, 2009) or weak formulations of boundary conditions (Strub and Bayen, 2006). Our results show that a data-driven approach can efficiently forecast flows without boundary measurements from ramps. Another challenge with physics-based approaches comes from their limited ability to model urban arterials. For example, Qiao et al. (2001) shows analytical approaches fail to provide good forecasts. Another challenge is to identify spatio-temporal relations in flow patterns, Vlahogianni et al. (2014) for further discussion. Data-driven approaches provide a flexible alternative to physical laws of traffic flows.

The challenge is to perform model selection and residual diagnostics (Vlahogianni et al., 2014). Model selection can be tackled by regularizing the loss function and using cross-validation to select the optimal penalty weight. To address this issue, when we specify our deep learning model we construct an architecture as follows. First we use is a regularized vector autoregressive model to perform predictor selection. Then, our deep learning model addresses the issue of non-linear and non-stationary relations between variables (speed measurements) using a series of activation functions.

Breiman (2003) describes the trade-off between machine learning and traditional statistical methods. Machine learning has been widely applied (Ripley, 1996) and shown to be particularly successful in traffic pattern recognition. For example, shallow neural networks for traffic applications (Chen and Grant-Muller, 2001), use a memory efficient dynamic neural network based on resource allocating network (RAN) with a single hidden layer with Gaussian radial basis function activation unit. Zheng et al. (2006) develops several one-hidden layer networks to produce fifteen-minute forecasts. Two types of networks, one with a tanh activation function and one with a Gaussian radial basis function were developed. Several forecasts were combined using a Bayes factors that calculates an odds ratio for each of the models dynamically. Van Lint et al. (2005) proposes a state-space neural network and a multiple hypothesis approach that relies on using several neural network models at the same time (van Hinsbergen et al., 2009). Day of the week and time of day as inputs to a neural network was proposed in Çetiner et al. (2010). Our work is closely related to Lv et al. (2015), which demonstrates that deep learning can be effective for traffic forecasts. A stacked auto-encoder was used to learn the spatial-temporal patterns in the traffic data with training performed by a greedy layer-wise fashion. Ma et al. (2015) proposed a recurrent architecture, a Long Short-Term Memory Neural Network (LSTM), for travel speed prediction. Our approach builds on this by showing an additional advantage of deeper hidden layers together with sparse autoregressive techniques for variable selection.

Deep learning learns a high dimensional function via a sequence of semi-affine non-linear transformations. The deep architecture is organized as a graph. The nodes of the graph are units, connected by links to propagate activation, calculated at the origin, to the destination units. Each link has a weight that determines the relative strength and sign of the connection and each unit applies an activation function to all of the weighted sum of incoming activations. The activation function is given, such as a hard threshold, a sigmoid function or a tanh. A particular class of deep learning models uses a directed acyclic graph structure is called a feed-forward neural network. There is vast literature on this topic; one of the earlier works include (Bishop, 1995, Haykin, 2004).

Deep learning allows for efficient modeling of nonlinear functions, see the original problem of Poincare and Hilbert. The advantage of deep hidden layers is for a high dimensional input variable, x=(x1,,xp) is that the activation functions are univariate, which implicitly requires the specification of the number of hidden units Nl for each layer l.

The Kolmogorov-Arnold representation theorem (Kolmogorov, 1956) provides the theoretical motivation for deep learning. The theorem states that any continuous function of n variables, defined by F(x), can be represented asF(x)=j=12n+1gji=1nhij(xi),where gj and hij are continuous functions, and hij is a universal basis, that does not depend on F. This remarkable representation result implies that any continuous function can be represented using operations of summation and function composition. For a neural network, it means that any function of n variables can be represented as a neural network with one hidden layer and 2n+1 activation units. The difference between theorem and neural network representations is that functions hij are not necessarily affine. Much research has focused on how to find such a basis. In their original work, Kolmogorov and Arnold develop functions in a constructive fashion. Diaconis and Shahshahani (1984) characterizes projection pursuit functions for a specific types of input functions.

A deep learning predictor, denoted by ŷ(x), takes an input vector x=(x1,,xp) and outputs y via different layers of abstraction that employ hierarchical predictors by composing L non-linear semi-affine transformations. Specifically, a deep learning architecture is as follows. Let f1,fn be given univariate activation link functions, e.g. sigmoid (1/(1+e-x),cosh(x),tanh(x)), Heaviside gate functions (I(x>0)), or rectified linear units (max{x,0}) or indicator functions (I(xR)) for trees. The composite map is defined byŷ(x)F(x)=fwn,bnfw1,b1(x),where fw,b is a semi-activation rule defined byfwl,bl(x)=fj=1Nlwljxj+bl=f(wlTxl+bl)(l=1,,n).Here Nl denotes the number of units at layer l. The weights wlRNl×Nl-1 and offset bR needs to be learned from training data.

Data dimension reduction of a high dimensional map F is performed via the composition of univariate semi-affine functions. Let zl denote the l-th layer hidden features, with x=z0. The final output is the response y, can be numeric or categorical. The explicit structure of a deep prediction rule is thanz1=f(w0Tx+b0)z2=f(w1Tz1+b1)zq=f(wn-1Tzn-1+bn-1)y(x)=wnTzn+bn.

In many cases there is an underlying probabilistic models, denoted by p(y|ŷ(x)). This leads to a training problem given by optimization problemminw,b1Ti=1T-logp(yi|ŷw,b(xi)),where p(yi|ŷ(x)) is the probability density function given by specification yi=F(xi)+i.For example, if i is normal, we will be training ŵ,b̂ via an 2-norm, minw,by-Fw,b(x)2=i=1T(yi-Fw,b(xi))2. One of the key advantages of deep learning is the derivative information w,bl(y,ŷw,b(x)) is available in closed form via the chain rule. Typically, a regularization penalty, defined by λϕ(w,b) is added, to introduce the bias-variance decomposition to provide good out-of-sample predictive performance. An optimal regularization parameter, λ, can be chosen using out-of-sample cross-validation techniques. One of the advantages of 1 penalized least squares formulation is that it leads to a convex, though non-smooth, optimization problem. Efficient algorithms (Kim et al., 2007) exist to solve those problems, even for high dimensional cases.

There is a strong connection with nonlinear multivariate non-parametric models, which we now explore. In a traditional statistical framework, the non-parametric approach seeks to approximate the unknown map F using a family of functions defined by the following expressionF(x)=k=nNwkfk(x).Functions fk are called basis functions and play the similar role of a functional space basis, i.e. they are chosen to give a good approximation to the unknown map F. In some cases {fk}k=1N actually do form a basis of a space, e.g., Fourier (fk(x)=cos(kx)) and wavelet bases. Multi-variate basis functions are usually constructed using functions of a single variable. Four examples are radial functions, ridge functions, kernel functions and indicator functions.fk(x)=κX-γk2(radial function)κ(wTX+w0)(ridge function)κX-γkh(kernel estimator)I(XCk)(tree indicator function)Here κ is typically chosen to be a bell-shaped function (e.g., 1/ex2 or 1/cosh(x)). The ridge function, a composition of inner-product and non-linear univariate functions, is arguably one of the simplest non-linear multi-variate function. Two of the most popular types of neural networks are constructed as a composition of radial or ridge functions. Popular non-parametric tree-based models (Breiman et al., 1984) can be represented as (2) by choosing fk given by Eq. (3). In tree-based regression, weights αk=Yk are the averages of (Yi|XiCk) and Ck is a box set in Rp with zero or more extreme directions (open sides).

Another set of basis functions are Fourier series, used primarily for time series analysis, where fk(x)=cos(x). A spline approximation can also be derived by using polynomial functions with finite support as a basis.

Ridge-based models, can efficiently represent high-dimensional data sets with a small number of parameters. We can think of deep features (outputs of hidden layers) as projections of the input data into a lower dimensional space. Deep learners can deal with the curse of dimensionality because ridge functions determine directions in (zk-1,zk) input space, where the variance is very high. Those directions are chosen as global ones and represent the most significant patterns in the data. This approach resembles the other well-studied techniques such as projection pursuit (Friedman and Tukey, 1974) and principal component analysis.

Section snippets

Deep learning for traffic flow prediction

Let xt+ht be the forecast of traffic flow speeds at time t+h, given measurements up to time t. Our deep learning traffic architecture looks likey(x):=xt+40t=x1,t+40xn,t+40.To model traffic flow data xt=(xt-k,,xt) we use predictors x given byxt=vecx1,t-40x1,txn,t-40xn,t.Here n is the number of locations on the network (loop detectors) and xi,t is the cross-section traffic flow speed at location i at time t. We use, vec to denote the vectorization transformation, which converts the matrix

Chicago traffic flow during special events

To illustrate our methodology, we use data from twenty-one loop-detectors installed on a northbound section of Interstate I-55. Those loop-detectors span 13 miles of the highway. Traffic flow data is available from the Illinois Department of Transportation, (see Lake Michigan Interstate Gateway Alliance http://www.travelmidwest.com/, formally the Gary-Chicago-Milwaukee Corridor, or GCM). The data is measured by loop-detector sensors installed on interstate highways. Loop-detector is a simple

Discussion

The main contribution of this paper is development of an innovative deep learning architecture to predict traffic flows. The architecture combines a linear model that is fitted using 1 regularization and a sequence of tanh layers. The first layer identifies spatio-temporal relations among predictors and other layers model nonlinear relations. The improvements in our understanding of short-term traffic forecasts from deep learning are twofold. First, we demonstrate that deep learning provides a

References (74)

  • E.I. Vlahogianni et al.

    Optimized and meta-optimized neural networks for short-term traffic flow prediction: a genetic approach

    Transport. Res. Part C: Emerg. Technol.

    (2005)
  • E.I. Vlahogianni et al.

    Short-term traffic forecasting: where we are and where we’re going

    Transport. Res. Part C: Emerg. Technol.

    (2014)
  • Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et...
  • O. Anacleto et al.

    Multivariate forecasting of road traffic flows in the presence of heteroscedasticity and measurement errors

    J. R. Statist. Soc.: Ser. C (Appl. Statist.)

    (2013)
  • X.J. Ban et al.

    Real time queue length estimation for signalized intersections using travel times from mobile sensors

    Transport. Res. Part C: Emerg. Technol.

    (2011)
  • Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I.J., Bergeron, A., Bouchard, N., Bengio, Y., 2012....
  • C.M. Bishop

    Neural Networks for Pattern Recognition

    (1995)
  • G.E. Box et al.

    Distribution of residual autocorrelations in autoregressive-integrated moving average time series models

    J. Am. Statist. Assoc.

    (1970)
  • L. Breiman

    Statistical modeling: the two cultures

    Qual. Control Appl. Statist.

    (2003)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • B.G. Çetiner et al.

    A neural network based trafficflow prediction model

    Math. Comput. Appl.

    (2010)
  • Y.-C. Chiou et al.

    A novel method to predict traffic features based on rolling self-structured traffic patterns

    J. Intell. Transport. Syst.

    (2014)
  • P. Dellaportas et al.

    Joint specification of model space and parameter space prior distributions

    Statist. Sci.

    (2012)
  • P. Diaconis et al.

    On nonlinear functions of linear combinations

    SIAM J. Scient. Statist. Comput.

    (1984)
  • J.H. Friedman et al.

    A projection pursuit algorithm for exploratory data analysis

    IEEE Trans. Comp.

    (1974)
  • E.I. George

    The variable selection problem

    J. Am. Statist. Assoc.

    (2000)
  • F. Gers et al.

    LSTM recurrent networks learn simple context-free and context-sensitive languages

    IEEE Trans. Neural Netw.

    (2001)
  • A. Graves et al.

    Offline handwriting recognition with multidimensional recurrent neural networks

  • Hayashi, F., 2000. Econometrics. Princeton University Press, pp. 60–69 (Section...
  • S. Haykin

    A comprehensive foundation

    Neural Netw.

    (2004)
  • D.P. Helmbold et al.

    On the inductive bias of dropout

    J. Mach. Learn. Res.

    (2015)
  • G.E. Hinton et al.

    Reducing the dimensionality of data with neural networks

    Science

    (2006)
  • S. Hochreiter et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • Horvitz, E.J., Apacible, J., Sarin, R., Liao, L. Prediction, Expectation, and Surprise: Methods, Designs, and Study of...
  • Y. Kamarianakis et al.

    Real-time road traffic forecasting using regime-switching space-time models and adaptive LASSO

    Appl. Stoch. Mod. Bus. Indust.

    (2012)
  • Y. Kamarianakis et al.

    Real-time road traffic forecasting using regime-switching space-time models and adaptive LASSO

    Appl. Stoch. Mod. Bus. Indust.

    (2012)
  • S.J. Kim et al.

    An interior-point method for large-scale -regularized least squares

    IEEE J. Select. Top. Sig. Process.

    (2007)
  • Cited by (830)

    • Real-time freeway traffic state estimation for inhomogeneous traffic flow

      2024, Physica A: Statistical Mechanics and its Applications
    View all citing articles on Scopus
    View full text