Real-Time Tra ﬃ c Flow Forecasting via a Novel Method Combining Periodic-Trend Decomposition

: Accurate and timely tra ﬃ c ﬂow forecasting is a critical task of the intelligent transportation system (ITS). The predicted results o ﬀ er the necessary information to support the decisions of administrators and travelers. To investigate trend and periodic characteristics of tra ﬃ c ﬂow and develop a more accurate prediction, a novel method combining periodic-trend decomposition (PTD) is proposed in this paper. This hybrid method is based on the principle of “decomposition ﬁrst and forecasting last”. The well-designed PTD approach can decompose the original tra ﬃ c ﬂow into three components, including trend, periodicity, and remainder. The periodicity is a strict period function and predicted by cycling, while the trend and remainder are predicted by modelling. To demonstrate the universal applicability of the hybrid method, four prevalent models are separately combined with PTD to establish hybrid models. Tra ﬃ c volume data are collected from the Minnesota Department of Transportation (Mn / DOT) and used to conduct experiments. Empirical results show that the mean absolute error (MAE), mean absolute percentage error (MAPE), and mean square error (MSE) of hybrid models are averagely reduced by 17%, 17%, and 29% more than individual models, respectively. In addition, the hybrid method is robust for a multi-step prediction. These ﬁndings indicate that the proposed method combining PTD is promising for tra ﬃ c ﬂow forecasting.


Introduction
Over the past decades, with the sharp increase of car ownership, congestion has become a most troubling problem in urban areas. This problem has seriously affected travel quality in terms of prolongation of travelling time, increase of fuel consumption, and pollution of automobile exhausts. To solve a series of traffic problems, one effective way, which is universally acknowledged in present research [1,2], is to develop intelligent transportation systems (ITS). A tremendous number of sensors are distributed throughout road networks, and the transportation information is integrated into ITS. Thus, ITS makes it possible for administrators to monitor real-time traffic status and respond to traffic incidents in a timely manner. Traffic flow forecasting is regarded as the most fundamental task for ITS. For administrators, the predicted results can give early warnings of burst traffic flow and provide reliable information for reasonable management. Meanwhile for travelers, the predicted traffic flow can be a reference to choose a route with the shortest travel time.
Generally, traffic flow is defined as a measure of traffic status (i.e., total volume [3][4][5][6][7], average speed [1, [8][9][10], average travel time [11,12]) in a constant time interval at a target road segment. Obviously, the traffic flow is time-varying. Traffic flow forecasting tasks can be addressed as time series problems. In present studies [13,14], traffic flow has been demonstrated with significant autocorrelation, which means it is highly related to past data. Besides that, the trend and periodicity characteristics have been found; they are the important characteristics of traffic flow. Since Ahmed and Cook [15] first proposed the traffic flow forecasting problem in 1979, much effort has been devoted to the development of methods to predict traffic flow more accurate. According to some researches [16][17][18], the prediction methods are mainly divided into three categories: statistical models, artificial intelligent (AI) models, and hybrid models. Owing to flexible frameworks, hybrid models have attracted more attention than others. Tremendous studies have demonstrated that combining different models is an effective approach to improve prediction accuracy. In recent studies [3,10,[19][20][21][22][23][24][25], a novel method combining time series decomposition approaches has attracted extensive interest. The main idea of this kind of models is "decomposition first and forecasting last". In detail, the decomposition approach is employed to disaggregate the original complex time series into several regular and simple components, and then these components are predicted separately. The final outcomes are obtained by summing those predicted results. However, most decomposition approaches are originally developed to analyze digital signals. The decomposition results are a series of components with different frequencies and amplitudes.
In this paper, in order to give an insight into the characteristics of traffic flow, we proposed a novel method combining periodic-trend decomposition (PTD). PTD is developed to disaggregate the original data into three additive components, including trend, periodicity, and remainder. These three components show different characteristics of original traffic flow: the trend represents the day-to-day fluctuation; the periodicity represents variety within a day; the remainder represents noise. In the prediction stage, the periodicity is a strict period function and predicted by cycling. The trend and remainder are predicted by modelling. To test and evaluate the proposed method, traffic volume data of three detection stations, which are provided by the Minnesota Department of Transportation (Mn/DOT), are utilized to conduct experiments and analyze contrastively.
The rest of the content of this paper is organized as follows: Section 2 reviews existing literature in traffic flow forecasting. Section 3 formulates the proposed PTD approach and the hybrid prediction method. Section 4 elaborates the datasets' description and the process of experiments. Section 5 presents the results and discussions. Finally, Section 6 concludes this work.

Traffic Flow Forecasting Models
Due to time-varying nature, traffic flow is usually regarded as a kind of time series in most research. Over the past several decades, lots of models have been introduced and developed to improve the prediction accuracy. A summary of the studied literature is presented in Table 1. Statistical models [14] ARIMA arterial volume; speed 5 12 [11] KF arterials travel time 5 1 [26] KF expressway volume 5 1 According to studies [16][17][18], These models are roughly classified into three categories: statistical models, artificial intelligent (AI) models, and hybrid models.

1.
Statistical models. In early studies, many researchers proposed statistical models to predict traffic flow in the assumption of linearity and stationarity. The classical statistical models include autoregressive integrated moving average (ARIMA) [14,15,33,44], Kalman filtering (KF) [11,26], historical average (HA) [33], and exponential smoothing [39,40]. These models are simple and have low computational complexity. However, because of both linearity and nonlinearity, traffic flow is complex and volatile [4,10]. For example, traffic volume increases rapidly in rush hours and decreases rapidly after rush hours. Due to the assumption of linearity, these statistical models cannot well learn those characteristics of traffic flow.

2.
Artificial intelligent (AI) models. In the later studies, it has been proven that traffic flow is nonlinear and nonstationary [4,5,31]. Plenty of researchers developed AI models to capture the nonlinearity of traffic flow. These models mainly contain k-nearest neighbor (KNN) [27,28,33], support vector regression (SVR) [5,12,29,41], artificial neural network (ANN) [7,30,31,33], wavelet neural network (WNN) [7,37], and extreme learning machine (ELM) [23,40]. Due to the ability of nonlinear modelling, when a data scale is larger, the AI models usually perform better than the statistical models. Besides that, the deep learning models, which are improved from ANN and have more complex neural network structures, have been introduced and developed, to realize a more accurate prediction in recent years. The famous deep learning models include stacked auto encoders (SAE) [4], long short-term memory neural network (LSTM) [9,34,45], deep belief network (DBN) [16,42], and convolutional neural network (CNN) [32]. LSTM, especially, have shown superiority in traffic flow forecasting task due to the ability of temporal modelling [9,34,45]. Because of the universal approximation capability, the deep learning models can approximate any nonlinear function, and have shown outstanding performance for multi-source data. Nevertheless, owing to high computational complexity, these models will consume a large amount of time during training. 3.
Hybrid models. Any method combining two or more models can be treated as a hybrid model. Hybrid models have flexible frameworks and can integrate the merits of different methods. In addition, both theoretical and empirical findings have indicated that the hybridization of different models is an effective way of improving prediction accuracy. Thus, hybrid models have been paid more attention than the aforementioned two kinds. Zeng et al. [35] and Tan et al. [36] both proposed combining ARIMA with ANN, to capture both the linearity and nonlinearity of traffic flow. In order to mine the spatial-temporal features of traffic flow, Luo et al. [6] proposed the hybrid model of KNN and LSTM, and Li et al. [38] proposed the hybrid model of ARIMA and SVR. To enhance the prediction performance of models, a genetic algorithm (GA) was employed by Hong et al. [41] and Zhang et al. [42], to optimize the hyperparameters of SVR and DBN. Feng et al. [43] employed particle swarm optimization (PSO) to optimize the hyperparameters of SVR.
As a novel interest of hybrid models, the methods combining time series decomposition approaches have received great attention. These hybrid models are based on the principle of "decomposition first and forecasting last": the decomposition approach is utilized to extract several components, and then these components are predicted respectively. The decomposition approaches have been successfully applied in the analysis and prediction of traffic flow include Fourier transform (FT) [19], singular spectrum analysis (SSA) [21,23], empirical mode decomposition (EMD) [3,10,22], and wavelet decomposition (WD) [25]. Luo et al. [19] employed FT to decompose the original traffic flow into trend and residual series. Then, they predicted the trend series by extreme extrapolation and predicted residual series by SVR. Guo et al. [21] implemented SSA to disaggregate traffic volume into smoothed series and residue, then, they forecasted the smoothed series by KNN and predicted residue by historical average. Shang et al. [23] developed SSA to disaggregate the traffic volume series into several components and predicted these components by ELM. Chen et al. [3], Wang et al. [10], and Chen et al. [22] all utilized EMD to decompose the original traffic flow into several intrinsic mode functions (IMFs), then, they predicted each IMF by ANN [3], ARIMA [10], and SVR [22], respectively. Zhang et al. developed WD to disaggregated the traffic spend series into and one low-frequency sequence and three high-frequency sequences. They proposed a graph convolutional recurrent neural network (GCRNN) to forecast the low-frequency component and ARIMA to forecast the high-frequency components.
In addition, many researchers [13,[46][47][48] have suggested that traffic flow have the instincts of trend and periodic characteristics, and consideration for these characteristics during modelling can improve prediction accuracy. For those mentioned decomposition approaches, they are designed to analyze digital signals initially, and the decomposed results are a set of sub-time series with different frequency and amplitude. These approaches fail to extract the trend and periodic characteristics of traffic flow.

STL and the Proposed PTD Approach
As a time series decomposition approach, seasonal-trend decomposition procedure based on LOESS (STL) [49] can disaggregate the original data into trend, seasonal, and remainder components. STL is firstly proposed by Cleveland [49] to decompose long-term monthly CO 2 concentrations sequences with seasonal characteristics. Because of a strong robustness to outliers, STL is extensively employed to decompose time series with significant seasonality [50][51][52][53][54]. In addition, some studies have found that combining STL with prediction models can promote the accuracy of times series forecasting tasks. For example, Xiong et al. [53] utilized STL to decompose the seasonal agricultural commodity price, and employed extreme learning machines (ELM) to predict every component. In a similar way, in Qin et al.'s [54] study, they combined STL with echo state network (ESN), to improve the performance of monthly passenger flow forecasting. However, STL has two shortcomings which make it fail in the analysis of real-time traffic flow. One is that the decomposed seasonal component has a few gaps in different periods. In other words, the seasonal component is not a strict period function. The other is that STL is specially designed for macroscopic time series. It is a static procedure only for historical data, and it is unable to deal with out-of-sample data. Specifically, the STL is applicable to complete and static time series without updating, whereas real-time traffic flow is dynamic, and the new data comes as time goes on.
In this study, by overcoming these two shortcomings of STL, the periodic-trend decomposition (PTD) approach is proposed and formulated in Section 3.1. For the first shortcoming, PTD adjusts the seasonality into periodicity and the periodicity is a strict period function. For the second one, the procedure of out-of-sample decomposition is formulated additionally, to deal with out-of-sample data. Thus, with full consideration of the dynamicity of real-time traffic flow, PTD can decompose the original data into trend, periodicity, and remainder.

Contributions
In this paper, we propose a novel method combining PTD to decompose and predict real-time traffic flow. In order to demonstrate the universal applicability of the PTD approach, both statistical and AI models (i.e., ARIMA, SVR, ANN, LSTM) are combined with PTD to establish hybrid models (i.e., PTD-ARIMA, PTD-SVR, PTD-ANN and PTD-LSTM). These models are tested and evaluated based on real-world traffic volume data collected from Mn/DOT. The main contributions of this study are summarized as follows:

1.
A novel PTD approach is specially formulated for real-time traffic flow composition. We develop the PTD approach to extract trend and periodic characteristics of traffic flow. Fully considering the dynamicity of traffic flow in the real world, PTD is improved from STL to decompose traffic flow data. PTD can decompose the original traffic flow three additive components, including trend, periodicity, and remainder. The components can reveal the inner characteristics of traffic flow.

2.
A novel method combining PTD is developed to predict traffic flow more accurately. After completing decomposition, the periodicity is predicted by cycling, and the trend and remainder are respectively predicted by modelling. Then, the three predicted results are summed as final outcomes.

3.
Multi-step prediction is implemented, to provide more traffic flow information of the future. Traditional single-step prediction is unable to provide enough information for further plans and decisions of ITS. Thus, multi-step is necessary. Based on an iterated strategy, a multi-step prediction approach is developed, to extend the proposed hybrid method.

Methodology
In this section, the method combining PTD to forecast traffic flow is elaborated. First, the PTD method is formulated. Then, four prevalent prediction models are presented. Finally, the hybrid prediction method is illustrated.

Locally Weighted Regression (LOWESS)
Locally weighted regression (LOWESS, also aliased as LOESS) was first proposed by Cleveland [49] for smoothing scatterplots in STL. In this study, LOWESS is also employed in PTD to smooth curves, to extract trend and periodic components. Shown in Figure 1, LOWESS is a nonparametric regression model. The target point is evaluated by k nearest neighbors with distance weights.
Sustainability 2020, 12, x FOR PEER REVIEW 6 of 24 curves, to extract trend and periodic components. Shown in Figure 1, LOWESS is a nonparametric regression model. The target point is evaluated by k nearest neighbors with distance weights. Suppose a group of independent variables xi and dependent variables yi (i = 1, 2, …). The estimated value ˆi y is evaluated by K nearest neighbors of yi based on the equation: where wk is the weight of near neighbor wk and computed as: In this equation, D(x) denotes Epanechnikov kernel function, expressed as: where λ denotes the kernel width of Epanechnikov and represents the distance between xi and its Kth x , expressed as:

Decomposition for In-sample
The in-sample data are the historical traffic flow with several periods. Given an in-sample traffic flow Y(t), PTD method can disaggregate it into three additive components, including trend T(t), periodicity P(t) and remainder R(t): For in-sample decomposition, PTD is an iterative method which is similar to STL. The flow chart of PTD is shown in Figure 2. During each iteration process, periodicity and trend are respectively updated by periodic smoothing and trend smoothing. Suppose Y(t) denotes the original traffic flow, C and M denote the length of one period and the number of periods, respectively. Thus, the number of samples Y(t) is MC. In the initialization, the trend is set as T(t) = 0, t = 1, 2, ..., MC. The loop process is composed of six steps, as follows: Step 1: Suppose a group of independent variables x i and dependent variables y i (i = 1, 2, . . . ). The estimated valueŷ i is evaluated by K nearest neighbors of y i based on the equation: where w k is the weight of near neighbor w k and computed as: In this equation, D(x) denotes Epanechnikov kernel function, expressed as: where λ denotes the kernel width of Epanechnikov and represents the distance between x i and its Kth nearest neighbor x [K] , expressed as:

Decomposition for In-Sample
The in-sample data are the historical traffic flow with several periods. Given an in-sample traffic flow Y(t), PTD method can disaggregate it into three additive components, including trend T(t), periodicity P(t) and remainder R(t): For in-sample decomposition, PTD is an iterative method which is similar to STL. The flow chart of PTD is shown in Figure 2. During each iteration process, periodicity and trend are respectively updated by periodic smoothing and trend smoothing. Suppose Y(t) denotes the original traffic flow, C and M denote the length of one period and the number of periods, respectively. Thus, the number of samples Y(t) is MC. In the initialization, the trend is set as T(t) = 0, t = 1, 2,..., MC. The loop process is composed of six steps, as follows: Step 1: Detrending. Trend T(t) is removed from original data Y(t), then de-trended series T(t) is obtained: Step 2: Cycle-subseries smoothing. The details of this step are displayed in Figure 2b. Each cycle-subseries S c (t) of T(t) is smoothed by LOWESS with K 1 , and extended one period forward and backwards. Specifically, for independent variables X = [1, 2, 3, 4, . . . , M] T and dependent variables Step 3: Low-pass filtering of the cycle-subseries smoothing. The temporal time seriesŜ(t), t = −C + 1, −C + 2,..., MC + C are smoothed by a low-pass filter to obtain the series L(t), t = 1, 2,..., MC. The low-pass filter consists of a moving average of length C, followed by another moving average of length C, followed by another moving average of 3, and followed by LOWESS smooth with K 2 .
Step 4: Detrending of Smoothed Cycle-subseries. The preliminary periodicity is calculated as: For a new coming data zt at time t, it can also be disaggregated into three additive components, including trend tt, periodicity pt and remainder rt as the following four steps: Step 1: Calculating periodicity. The periodicity pt is calculated according to its position of the period in Q(t) (obtained from Equation (9)).
Step 2: De-periodicity. The original data zt subtract the periodicity pt to obtain a periodically adjusted datum t z  : Step 3: Calculating trend component. The periodically adjusted datum t z  is appended to the end of the periodically adjusted series ( ) P t  (obtained from Equation (10)). The next point of the new series is evaluated by LOWESS with K4, to obtain trend tt.   In the original STL method [49], F(t) is regarded as a seasonal component, which shows obvious periodicity. However, the values at the same time in each period have a few differences. Thus, it is unable to directly forecast future values by cycling. In this study, this component is normalized with the further process as the following: firstly, one period length time series of periodicity Q(t), t = 1, 2,..., C is computed as: Then, the Q(t) is extended to M periods to obtain the whole periodicity P(t), t = 1, 2,..., MC.
Step 5: De-periodicity. The original series Y(t) subtracts the periodicity P(t) to obtain a periodically adjusted series P(t): Step 6: Trend smoothing. the P(t) is smoothed by LOWESS with K 3 to obtain the trend T(t) = 0, t = 1, 2,..., MC.
After completing iterations, the remainder R(t) = 0, t = 1, 2,..., MC is calculated as: Following the above steps, the original times series Y(t) is decomposed into three additive components: trend Y(t), periodicity P(t) and remainder R(t), as Equation (5) expressed.

Decomposition for Out-Of-Sample
Owing to the dynamicity of real-time traffic flow, new data will come as time goes on. Hence, traffic flow data is also updated simultaneously. The new coming data is regarded as out-of-sample. The proposed PTD also provides the method to decompose out-of-sample data in a timely manner, which can satisfy the demand of real-time traffic flow forecasting.
For a new coming data z t at time t, it can also be disaggregated into three additive components, including trend t t , periodicity p t and remainder r t as the following four steps: Step 1: Calculating periodicity. The periodicity p t is calculated according to its position of the period in Q(t) (obtained from Equation (9)).
Step 2: De-periodicity. The original data z t subtract the periodicity p t to obtain a periodically adjusted datum z t : Step 3: Calculating trend component. The periodically adjusted datum z t is appended to the end of the periodically adjusted series P(t)(obtained from Equation (10)). The next point of the new series is evaluated by LOWESS with K 4 , to obtain trend t t .
Step 4: Calculating remainder. The remainder r t is computed as: Following the above four steps, the out-of-sample of traffic flow data is decomposed into trend t t , periodicity p t and remainder r t .

Prediction Models for the Decomposed Components
After completing decomposition, the prediction models are implemented to forecast trend and residual. In order to test the universal applicability of the proposed hybrid method, four well-established prediction models, including both statistical and AI models, are separately combined with PTD. The four prediction models are ARIMA, SVR, ANN, and LSTM. Periodicity is predicted by cycling in the place of modelling since it is a strict period function.

Autoregressive Integrated Moving Average (ARIMA)
ARIMA is a statistical time series model proposed by Bartholomew et al. [55], based on the assumption of stationarity and linearity of time series. Generally, ARIMA is composed of three parts: autoregressive (AR), integration (I) and moving average (MA). The performance of ARIMA is affected by three parameters: autoregressive order p, difference order d and moving average order q. The ARIMA (p,d,q) model can be expressed as follows: where ∇ = (1 − B) refers to the backward shift operator for B(x) = x t−1 ,x t is the predicted value, ε t is the residual component distributed as Gaussian distribution with zero mean and σ 2 variance. The difference order d is relevant to the difference method which can convert a nonstationary time series to a stationary time series. In this study, the augmented Dickey-Fuller (ADF) test (unit root test) is utilized, to diagnose the stationarity of traffic flow and determined the parameter d. In addition, the parameters p and q are selected based on the minimum Bayesian information criterion (BIC) value, as follows: where k is the number of model parameters; n is the number of samples; L is the maximum likelihood value.

Support Vector Regression (SVR)
Initially, a support vector machine (SVM) was proposed by Boser et al. [56] to solve classification problems, and Cortes and Vapnik [57] extended SVM for regression problems and renamed it as support vector regression (SVR). SVR is based on the structured risk minimization principle, which minimizes the upper bound of the generalization error by using subset data points (i.e., support vectors). For time series modelling, the output of SVR is expressed as: wherex t is the predicted value; x t−1 = [x t−1 ,x t−2 , . . . ,x t−n ] T is input vector with n time lag; w and b are weights and bias, respectively; ϕ(x) represents the nonlinear mapping function and maps input space x into a high-dimensional feature. In this study, ε-SVR is employed and slack variables ξ and ξ* are introduced. The slack variables measure the errors which occurred by values outside the boundaries, and errors are determined by the ε-tube. The SVR is an optimization problem with objective function as Equation (17) and constraints as Equation (18).
subject to : where H (H > 0) represents the regularization factor; m represents the number of training samples. If this optimization problem can be transformed into a dual problem, then the SVR function can be expressed as follows: where α i and α i * are the Lagrange multipliers; κ(x i ,x) is the kernel function that equals the inner product of ϕ(x i ) and ϕ(x). The radial basis function (RBF) is widely used as kernel function in previous literature [5,12,19,41], and it is also adopted in this study. BRF is expressed as: where γ is the coefficient of RBF.

Artificial Neural Network (ANN)
ANN is abstracted from neurons in the human brain. ANN has the universal approximation ability, which indicates that ANN can estimate any nonlinear continuous function up to any desired degree of accuracy. Hence, this model has a good ability to model nonlinear data. Because of this advantage, ANN has been widely introduced and developed for prediction tasks, as well as traffic flow forecasting. A typical ANN model consists of three parts: an input layer, a hidden layer, and an output layer. The structure of ANN is flexible and it depends on the number of neurons in each layer. In this paper, a one-hidden-layer and single-output neural network is employed. The relationship between the output x t and the inputs x t−1 , x t−2 , . . . , x t−n is expressed as follows: where a ij , j = 1, 2, . . . , n, represent connection weights from the input layer to hidden neuron ith; a i0 represents the biases in hidden neuron ith; b i , i = 1, 2, . . . , u, represents connection weights from hidden layer to output neuron; b 0 represents the bias in output neuron; n and u are the number of input neurons and hidden neurons, respectively; and ε t is the error term. The σ(x) is the sigmoid activation function and is formulated as follows: Therefore, the ANN model in fact performs a nonlinear functional mapping from the historical values x t−1 , x t−2 , . . . , x t−n to the future value x t :

Long Short-Term Memory Neural Network (LSTM)
LSTM is proposed by Hochreiter and Schmidhuber [58] to model for time series data. LSTM is different to traditional ANN in terms of the hidden layers. The hidden neurons of LSTM are designed specially to capture the temporal characteristics. It is realized through one memory cell and 3 gates of input gate, forget gate and output gate. At time t, input gate processes new coming data and the state of the memory cell at time t−1. Forget date decides when to forget the information of memory cell and selects the optimal lag for the input time series. Output gate controls the output information transmitting to the next neural layer. Suppose x t denotes the input time series at t time; i t , f t and o t denote the state of input gate, forget gate and output gate, respectively; c t−1 denotes the state of the memory cell at t−1 time. The equations to update the three gates are expressed as follows: At time t, the state of memory cell c t is controlled by input gate i t , forget gate f t and the last time state of memory cell c t−1 : The information of LSTM unit output to next layer units is controlled by output gate o t and the state of memory cell c t : where h t represents the final output of hidden neuron; operator "•" is Hadamard product; w i , w f , w o , w c , u i , u f , u o and u c are connection weights; b i , b f , b o and b c are activation biases; σ(x) and tanh(x) are activation functions; σ(x) is expressed as Equation (22) and tanh(x) is expressed as: tanh(x) = (e x − e −x )/(e x + e −x ).

Multi-Step Prediction
One-step prediction is extensively studied in the previous study [6,7,12,19,32,38], whereas, it cannot always satisfy the applications in ITS. In this study, a multi-step prediction strategy is well-designed for the proposed hybrid method.
Firstly, predict periodicity. Periodicity is predicted by cycling. Since the periodicity P(t) extracted from PTD is a strictly function with C period length (see Equation (9)), the predicted value at time t is the same to historical value at time t−C: Secondly, predict trend and remainder respectively. The iterated strategy, which is widely used in multi-step time series forecasting, is adopted to the multi-step prediction of trend and remainder. Based on an already constructed model with one-step prediction, the iterated strategy uses the predicted value as an input for the same model to forecast the next time value (see Equation (30)). This process is iterated up to the maximum prediction horizon. The iterated strategy has two outstanding advantages: one is that only one model must be established. The other is that the prediction steps are unlimited.
where t t and r t respectively represent the true values of trend and remainder at time t;t t andr t respectively represent the predicted values of trend and remainder at time t; f T and f R represent the already constructed model for trend and remainder, respectively. Lastly, the predicted values of the three components are summed as final outcomesŷ t : y t =t t +p t +r t y t+1 =t t+1 +p t+1 +r t+1 · · · (31)

Hybrid Prediction Models
Based on the principle of "decomposition first and forecasting last", the procedure of the proposed hybrid method for traffic flow forecasting is displayed in Figure 3; it contains three steps, as listed: Step 1: In-sample decomposition and models training. Based on the in-sample decomposition of PTD, the in-sample data is decomposed into three components: trend, periodicity, and remainder. Then, each component is used to train the models (described in Section 3.2) separately.
Step 2: Out-of-sample decomposition and multi-step prediction. Based on the out-of-sample decomposition of PTD, the out-of-sample data is also decomposed into trend, periodicity, and remainder. Then, each component of out-of-sample decomposition is input into the corresponding trained model. The predicted results are obtained from the output of the models.
Step 3: Integration for the final outcomes. The predicted results of the three components are summed as the final outcomes.

Data Description
In this study, the traffic flow data are available from the Minnesota Department of Transportation (Mn/DOT) [59]. The Mn/DOT provides traffic volume and occupancy data with a 30second interval of freeways in Minnesota, USA. In our experiments, only traffic volume data are used, and the data are collected from three detection stations (i.e., S188, S195 and 818) at the interchange of I-494 and TH100 (see Figure 4). As shown in Figure 4, S188 and S195 are located in two opposite directions of I-94 freeway and 818 is located at an exit ramp. The three cases with different patterns are used to test the adaptability of the hybrid models. We selected four weeks of data, from 29 April to 24 May in 2019. Since the patterns of traffic volume on weekdays and weekends are different, the data on weekends is discarded. The original traffic volume data is aggregated into a 5-minute interval referred to in previous studies [4,5,14,21,60]. The data of the first two weeks are used as the training dataset (i.e., from 29 April to 10 May, 2880 samples totally) to train the models. The next week is used as the validation dataset (i.e., from 12 to 16 in May, 1440 samples totally) to determine the hyperparameters of the models. The last week (i.e., from 20 to 24 in May, 1440 samples totally) is used as the testing dataset to evaluate the prediction results.
The training dataset of collected traffic volume is shown in Figure 5. It can be seen that the three stations have different distributions, which is to evaluate the adaptability of proposed hybrid models. The distribution features of traffic flow are agreed to their locations: S188 and S195 are located at the four-lane highway with large traffic volume in the daytime, and the peak-volume of these two stations are both over 600 vehicles per 5-minute; 818 is located at the ramp with only one lane and the traffic volume is much smaller, and the peak volume of it is about 180 vehicles per 5-minute. Furthermore, S188 has two obvious peaks in the morning and afternoon, and S195 only has one peak in the morning. Moreover, 818 only has one peak in the afternoon. Despite different distributions and peak patterns, all the three stations have obvious one-day periodicity. Step 1: Step 2: Step 3:

Data Description
In this study, the traffic flow data are available from the Minnesota Department of Transportation (Mn/DOT) [59]. The Mn/DOT provides traffic volume and occupancy data with a 30-second interval of freeways in Minnesota, USA. In our experiments, only traffic volume data are used, and the data are collected from three detection stations (i.e., S188, S195 and 818) at the interchange of I-494 and TH100 (see Figure 4). As shown in Figure 4, S188 and S195 are located in two opposite directions of I-94 freeway and 818 is located at an exit ramp. The three cases with different patterns are used to test the adaptability of the hybrid models. We selected four weeks of data, from 29 April to 24 May in 2019. Since the patterns of traffic volume on weekdays and weekends are different, the data on weekends is discarded. The original traffic volume data is aggregated into a 5-minute interval referred to in previous studies [4,5,14,21,60]. The data of the first two weeks are used as the training dataset (i.e., from 29 April to 10 May, 2880 samples totally) to train the models. The next week is used as the validation dataset (i.e., from 12 to 16 in May, 1440 samples totally) to determine the hyper-parameters of the models. The last week (i.e., from 20 to 24 in May, 1440 samples totally) is used as the testing dataset to evaluate the prediction results.   The training dataset of collected traffic volume is shown in Figure 5. It can be seen that the three stations have different distributions, which is to evaluate the adaptability of proposed hybrid models. The distribution features of traffic flow are agreed to their locations: S188 and S195 are located at the four-lane highway with large traffic volume in the daytime, and the peak-volume of these two stations are both over 600 vehicles per 5-minute; 818 is located at the ramp with only one lane and the traffic volume is much smaller, and the peak volume of it is about 180 vehicles per 5-minute. Furthermore, S188 has two obvious peaks in the morning and afternoon, and S195 only has one peak in the morning. Moreover, 818 only has one peak in the afternoon. Despite different distributions and peak patterns, all the three stations have obvious one-day periodicity.

Design of Experiments
To demonstrate the contribution of the proposed PTD approach, four hybrid models combining PTD (i.e., PTD-ARIMA, PTD-SVR, PTD-ANN and PTD-LSTM) and four individual models (i.e., ARIMA, SVR, ANN and LSTM) are employed to experiment and comparatively analyze by using the prepared data.
In the PTD procedure, the training datasets are regarded as in-sample, and the validation datasets and testing datasets are regarded as out-of-sample, to simulate real-time traffic flow. The period length C is 288 (i.e., 24 (hour) * 60 (minute)/5 (minute) = 288), and the period number M is 10 (i.e., workdays of two weeks). According to the studies [49,53,54], the nearest neighbor number K of LOWESS should be set to cover one-period length. In our study, for in-sample decomposition, the LOWESS smoothen original series contains both past and future. Consequently, K1, K2, and K3 (see Step 2, Step 3, and Step 6 in Section 3.1.2.) are all set as 144. Meanwhile, for out-of-sample decomposition, the LOWESS smoothen original times series containing only past and K4 (see Step 4 in Section 3.1.3.) is set as 288.
In the prediction stage, the Min-Max normalization technology (expressed as Equation (32)) is employed to scale the data into the range of [0, 1] before feeding into the model. This technology can

Design of Experiments
To demonstrate the contribution of the proposed PTD approach, four hybrid models combining PTD (i.e., PTD-ARIMA, PTD-SVR, PTD-ANN and PTD-LSTM) and four individual models (i.e., ARIMA, SVR, ANN and LSTM) are employed to experiment and comparatively analyze by using the prepared data.
In the PTD procedure, the training datasets are regarded as in-sample, and the validation datasets and testing datasets are regarded as out-of-sample, to simulate real-time traffic flow. The period length C is 288 (i.e., 24 (hour) * 60 (minute)/5 (minute) = 288), and the period number M is 10 (i.e., workdays of two weeks). According to the studies [49,53,54], the nearest neighbor number K of LOWESS should be set to cover one-period length. In our study, for in-sample decomposition, the LOWESS smoothen original series contains both past and future. Consequently, K 1 , K 2 , and K 3 (see Step 2, Step 3, and Step 6 in Section 3.1.2.) are all set as 144. Meanwhile, for out-of-sample decomposition, the LOWESS smoothen original times series containing only past and K 4 (see Step 4 in Section 3.1.3.) is set as 288.
In the prediction stage, the Min-Max normalization technology (expressed as Equation (32)) is employed to scale the data into the range of [0, 1] before feeding into the model. This technology can accelerate learning and convergence during training model. The outputs of the model are rescaled by reversed Min-Max normalization (expressed as Equation (33)), to obtain the final predicted results. (32) x = x × (max(x) − min(x)) + min(x) (33) It is worth noting that the hyper-parameters of prediction models can significantly affect their performance. Hence, the well-established grid-search and cross-validation method is implemented, to select appropriate hyper-parameters for each model. Specifically, all the models with different hyper-parameters are trained by the same training dataset, but only the model which has the best performance on validation dataset is selected. The models with determined hyper-parameters are evaluated by testing dataset. The critical hyper-parameters in each model are as follows: • ARIMA: the difference order d of ARIMA is set based on ADF unit root test and the autoregressive order p and moving average order q are both selected from 0 to 24, based on minimum BIC value (see Equation (15)). • SVR: The RBF kernel function coefficient γ, and the regularization factor H, the tube width ε are all selected from {10 −5 , 10 −4 , 10 −3 , 10 −2 , 10 −1 , 1, 10, 10 2 , 10 3 , 10 4 }. • ANN: According to previous research [30,31], a one-hidden-layer neural network is frequently utilized and also adopted in our experiment. The number of hidden neurons u is selected from 2 to 40 with step 2. The model is optimized by Adam algorithm with mean square error (MSE) loss function. The learning rate is set as 0.001, the batch-size is 256, and the epochs is 500. • LSTM: A one-hidden-layer LSTM is utilized reference to existing studies [9,34,60]. The number of hidden neurons is also selected from 2 to 40 with step 2, and the other hyper-parameters are the same as ANN.
In addition, the input step is set as 12 and output step is set as 1. The prediction horizons are set as 6. The final determined hyper-parameters are shown in Table 2. The periodicity is not in this table, because this component can be predicted by cycling.

Performance Measures
In our study, three widely used measures are utilized to evaluate the performance of models, including mean absolute error (MAE), mean absolute percentage error (MAPE) and mean square error (MSE). Denote true value and predicted value as y t andŷ i , respectively; n is the number of testing data. The equations are expressed as follows:

Analysis of Decomposition Results
The decomposition results are shown in Figure 6. It can be seen that the original traffic flow is decomposed into trend, periodic, and remainder components. For trend, this component shows the day-to-day characteristics, and it fluctuates moderately. Furthermore, trend extracted through in-sample decomposition (i.e., training dataset) is smoother than that through out-of-sample decomposition (i.e., validation dataset and testing dataset). The reason for this is that the trend is extracted by bilateral smoothness, containing both past and future during in-sample decomposition. Meanwhile, during out-of-sample decomposition, the trend is extracted through only past smoothness. The periodicity shows the within-day characteristics and it is a strict period function, which means the values at the same time on different days are identical. The peak features of traffic flow are also distinctly displayed in periodicity, which is consistent with the original data. As for the remainder, it fluctuates irregularly and can be regarded as noise. There is one thing that we need to explain; it seems that the testing models do not perform so well in case 818, in terms of MAPE. The MAPE of the best model PTD-LSTM is close to 20%. This is because that MAPE is sensitive to small true values (i.e., the yi in Equation (35)). The traffic volume

Analysis of Prediction Results
Each component is predicted separately, and the final outcomes are obtained by summing the three predicted results. Taking the first day of the testing dataset (i.e., 20 May 2019) as an example, the true values and one-step predicted results are shown in Figure 7. The left subplots are the whole day predicted results, and the right subplots zoom in on two rush hours in the evening. It is apparent that the predicted results of hybrid models are closer to true values than those results of individual models, which indicates that the hybrid models perform better. In addition, the predicted results of the four hybrid models are almost the same, which suggests that the four models combining PTD have almost similar performances.
Sustainability 2020, 12, x FOR PEER REVIEW 17 of 24 of 818 is much smaller than S188 and S195 (see Figure 5). When the value in the denominator is small, a small deviation in error will create a large change in the absolute percentage error (APE). For example, suppose two samples y1 = 10 and y2 = 100, and the predicted values of them are    Table 3 provides the average measures evaluated across all the six-step prediction horizons. Overall, all hybrid models perform better than individual models. Among hybrid models, it seems that there is no specific hybrid model that always performs best in all testing cases. However, after further investigation, it is found that the gaps among different hybrid models are slightly small. In particular, the differences of MAE among different hybrid models are less than 1. This finding suggests that combination with PTD is a universally applicable method to improve prediction accuracy effectively.
In addition, the improved items represent the improved performance from the hybrid model to the individual model. This item can also be regarded as the contribution of PTD approach to reduce the prediction errors. On average, the MAE, MAPE and MSE are respectively reduced by 23%, 25% and 39% on station S188, 11%, 11% and 19% on station S195, and 16%, 13% and 28% on station 818. Over all testing stations, the hybrid method reduces the MAE, MAPE, and MSE by 17%, 17%, and 29%. These results demonstrate that the hybrid models combining PTD are superior to the individual models. There is one thing that we need to explain; it seems that the testing models do not perform so well in case 818, in terms of MAPE. The MAPE of the best model PTD-LSTM is close to 20%. This is because that MAPE is sensitive to small true values (i.e., the y i in Equation (35)). The traffic volume of 818 is much smaller than S188 and S195 (see Figure 5). When the value in the denominator is small, a small deviation in error will create a large change in the absolute percentage error (APE). For example, suppose two samples y 1 = 10 and y 2 = 100, and the predicted values of them areŷ 1 = 20 andŷ 2 = 110. The absolute error (AE) and square error (SE) of them are equal (i.e., AE 1 = AE 2 = 10 and SE 1 = SE 2 = 10 2 = 100). However, the APE of them is quite different (i.e., APE 1 = 100% and APE 2 = 10%). When the AEs or SEs are equal, the true values are smaller; the APEs are larger. As for MSEs and MAEs, these results are normal because the models are trained based on MSE loss.

Analysis of Multi-Step Prediction Errors
In order to analyze the multi-step prediction errors, the evaluation measures of each prediction horizon are presented in Figure 8. An aspect that should be noted is that all prediction errors increase along prediction horizons. This is owing to the cumulative errors, which stem from feeding predicted values into the models during a multi-step prediction. What stands out in this figure is that the errors of the individual models increase more rapidly along with prediction horizons than the hybrid models. Furthermore, the increased rates of hybrid models are almost the same. These findings indicate that no matter which prediction model it is integrated into, the hybrid model can always reduce cumulative errors effectively.  To elaborate on the contribution of PTD, the rate of reduced error is introduced, which is equal to the reduced error divided by the error of the individual model. The rates are computed and plotted in Figure 9. The negative rate represents that the hybrid model performs better than the corresponding individual model. The absolute value of the rate is larger, the improvement of performance is larger. It is apparent that the rates of reduced errors decline, along with prediction horizons. This represents that hybrid models are able to reduce the cumulative errors. The PTD approach can decompose the complex traffic flow into trend, periodicity, and remainder. These components vary more regularly and are modelled more easily. This is why the hybrid method can improve the prediction performance. From these findings, it demonstrates that the method combining PTD shows robustness for multi-step traffic flow forecasting.

Conclusions
In this paper, to extract trend and periodic characteristics of traffic flow and achieve more accurate forecasting, a novel method combining PTD is proposed. PTD is developed to decompose the original traffic flow into trend, periodicity, and remainder. Then, the periodicity is predicted by cycling, while the trend and remainder by modelling. To test the applicability of the proposed method for different traffic flow patterns, three datasets with different distributions collected from Mn/DOT are utilized to conduct experiments and analyze comparatively. Our main work is concluded as follows: 1. A novel decomposition approach for traffic flow PTD is formulated. Fully considering the dynamicity of real-time traffic flow, PTD is formulated to decompose both in-sample and outof-sample data. This approach can decompose the original traffic flow into trend, periodicity, and remainder. 2. A novel hybrid method combining PTD for traffic flow forecasting is developed. To demonstrate the universal applicability of the PTD approach, both statistical and AI models (i.e., ARIMA, SVR, ANN, LSTM) are combined with PTD, to establish hybrid models (i.e., PTD-ARIMA, PTD-SVR, PTD-ANN and PTD-LSTM). Empirical results show that the MAE, MAPE, and MSE of hybrid models are averagely reduced by 17%, 17%, and 29% than individual models, respectively. 3. After investigating multi-step prediction results, it is found that the hybrid models combining PTD can not only reduce the prediction errors, but also reduce cumulative errors. It suggests that the proposed hybrid method is robust for multi-step traffic flow forecasting.

Conclusions
In this paper, to extract trend and periodic characteristics of traffic flow and achieve more accurate forecasting, a novel method combining PTD is proposed. PTD is developed to decompose the original traffic flow into trend, periodicity, and remainder. Then, the periodicity is predicted by cycling, while the trend and remainder by modelling. To test the applicability of the proposed method for different traffic flow patterns, three datasets with different distributions collected from Mn/DOT are utilized to conduct experiments and analyze comparatively. Our main work is concluded as follows:

1.
A novel decomposition approach for traffic flow PTD is formulated. Fully considering the dynamicity of real-time traffic flow, PTD is formulated to decompose both in-sample and out-of-sample data. This approach can decompose the original traffic flow into trend, periodicity, and remainder.

2.
A novel hybrid method combining PTD for traffic flow forecasting is developed. To demonstrate the universal applicability of the PTD approach, both statistical and AI models (i.e., ARIMA, SVR, ANN, LSTM) are combined with PTD, to establish hybrid models (i.e., PTD-ARIMA, PTD-SVR, PTD-ANN and PTD-LSTM). Empirical results show that the MAE, MAPE, and MSE of hybrid models are averagely reduced by 17%, 17%, and 29% than individual models, respectively.

3.
After investigating multi-step prediction results, it is found that the hybrid models combining PTD can not only reduce the prediction errors, but also reduce cumulative errors. It suggests that the proposed hybrid method is robust for multi-step traffic flow forecasting.
Three limitations remain in our study and will be addressed in future study. Firstly, only traffic flow data on workdays is studied. The patterns of traffic flow in weekdays are different to that on weekends. In future studies, the PTD approach will be improved for whole week traffic flow decomposition. Secondly, only traffic volume data are tested and evaluated. In further studies, other measures of traffic status (such as traffic speed, travel time) will be collected to carry out experiments. Thirdly, only historical data are considered. In further studies, the spatial-temporal features or topology features of the road network will be considered for a more accurate prediction.