Distance-To-Mean Continuous Conditional Random Fields: Case Study in Tra ﬃ c Congestion

: Tra ﬃ c prediction techniques are classiﬁed as having parametric, non-parametric, and a combination of parametric and non-parametric characteristics. The extreme learning machine (ELM) is a non-parametric technique that is commonly used to enhance tra ﬃ c prediction problems. In this study, a modiﬁed probability approach, continuous conditional random ﬁelds (CCRF), is proposed and implemented with the ELM and then utilized to assess highway tra ﬃ c data. The modiﬁcation is conducted to improve the performance of non-parametric techniques, in this case, the ELM method. This proposed method is then called the distance-to-mean continuous conditional random ﬁelds (DM-CCRF). The experimental results show that the proposed technique suppresses the prediction error of the prediction model compared to the standard CCRF. The comparison between ELM as a baseline regressor, the standard CCRF, and the modiﬁed CCRF is displayed. The performance evaluation of the techniques is obtained by analyzing their mean absolute percentage error (MAPE) values. DM-CCRF is able to suppress the prediction model error to ∼ 17.047%, which is twice as good as that of the standard CCRF method. Based on the attributes of the dataset, the DM-CCRF method is better for the prediction of highway tra ﬃ c than the standard CCRF method and the baseline regressor.


Introduction
The construction of highways is one of the proposed solutions to overcome the problem of vehicle congestion and air increase in metropolitan areas [1]. Highways can shorten the travel time of a vehicle compared with normal roadways. Therefore, highways are an ideal alternative for long-distance driving. However, several factors can cause congestion of vehicles on highways, including exceeding the vehicle capacity of highways and irregular flow of vehicles on the highways. Research to predict traffic flow on highways can be done to study the problem of vehicle congestion [2] by analyzing traffic flow data. The traffic flow can be assumed to be analogous to fluid flow and can be viewed as a continuum where its characteristics correspond to fluid physics characteristics [3]. There are two early categorizations of traffic flow: macroscopic and microscopic traffic flow models. The macroscopic model is comparable to a fluid moving along a duct (described as a highway), and the microscopic model considers the movement of each individual vehicle while they interact [4].
Traffic prediction techniques based on models are classified as having parametric, non-parametric, or a combination of parametric and non-parametric characteristics [5][6][7]. Parametric techniques: (1) capture all information about the traffic status within parameters, (2) use the training data to adjust some finite and fixed set of model parameters, (3) use the model to estimate the traffic states for a set of test data, (4) are the simplest approach, (5) define the structure in advance, and (6) are based on time-series analysis. Non-parametric techniques: (1) include an unspecified number of A standard CCRF modification was implemented on aerosol optical depth (AOD) data by using two prediction results, namely statistical models and deterministic methods [38]. The modifications were made on the edge features to capture information from the AOD data. Baltrusaistis et al. [39] used CCRF combined with the support vector machine (SVM) for regression cases, and the method was modified by making a baseline prediction with neural networks. This modification method was hereinafter known as continuous conditional neural fields (CCNF) [40,41]. Another modification of CCRF was proposed by Banda et al. [44], who conducted the continuous dimensional emotion prediction task utilizing a continuous conditional recurrent neural field (CCRNF). The method evaluated audio and physiological emotion data, and the results were compared with other methods such as Long Short-Term Memory (LSTM). Zhou et al. [45] proposed a deep continuous conditional random fields (DCCRF) approach to tackle online multi-object tracking (MOT) problems, such as detached inter-object relations and manually tuned relations, which produced non-optimal settings. The method implemented an asymmetric pairwise term to regularize the final displacement.

Extreme Learning Machine (ELM)
The ELM method has been implemented in traffic flow prediction research, with or without modification or combination. A method based on the extreme learning machine was proposed to enhance a real-time traffic problem by Ban et al. [33]. Due to the efficiency and effectiveness of the ELM for a wide area, a modification of the ELM, where a kernel function substitutes the hidden layer of the ELM, was proposed [31]. The aim was to improve the accuracy of the prediction in the case of traffic flow. A novel prediction model implemented the extreme learning machine with the addition of bidirectional back propagation, where the parameters in these techniques were not tuned by experience [32]. This technique, known as incremental extreme learning machine (I-ELM), aimed to overcome the drawbacks of previous techniques, such as (1) time consumption and (2) hidden nodes leading the trained model stack to be over-fitted. Zhang et al. [34] implemented the extreme learning machine to carry out traffic flow prediction based on real heterogeneous data sources. The time series model also included the techniques and was used as a benchmark.
A method based on the extreme learning machine was built and applied to the prediction of the urban traffic congestion problem [35]. A symmetric-ELM-cluster (S-ELM-Cluster) transformed the complex learning issue into different issues on small and medium scale data sets. Yang et al. [36] utilized the Taguchi method, which is known as a robust and systematic optimization approach, to improve the optimized configuration of the proposed exponential smoothing and extreme learning machine forecasting model. This developed model was then applied to highway traffic data. Feng et al. [37] proposed a combination of the wavelet function and the extreme learning machine to optimize the short-term traffic flow forecasting method.

Standard CCRF
Probabilistic graphical models (PGM) is a method that relies on three main components of an intelligent system: representation, inference, and learning. The PGM framework is capable of supporting natural representation, having effective inferences, and being able to acquire a decent model. Those three components give this method the ability to complete domain renewal [30]. The continuous conditional random field (CCRF) is a part of PGM that is able to accommodate sequential prediction problems with many variables. This method was first introduced by Qin et al. [47] and is a regression form of the conditional random field (CRF) model. The CCRF model is a conditional probability distribution that represents a mapping relationship of the data selected against their ranking values, where the ranking values are expressed as continuous variables. In CCRF, information about data and relationships between data is used as a feature. The structure of the standard CCRF is illustrated in Figure 1. The probabilistic density function (PDF) is an exponential model that contains features based on input and output. It is assumed that there is a connection between the labels that are adjacent to the output. The CCRF forms a connection between a point and its neighboring points. These points represent the predicted values of time unity and are generated by the conventional predictor algorithm as the baseline. Because the CCRF works in the case of regression, the baseline used is referred to as the baseline regressor. The baseline regressors that could be used in this method include the support vector machine (SVM), neural networks, or trees.
In general, the CCRF serves to strengthen probabilities for weak predictive values. In general, the CCRF model can be written as [39] ( | ) = 1 . (1) Here, = ( 1 , 2 , … , ) is a set of predictive values (output), denotes the number of observed samples, and is a vector of independent random variables called predictor vectors. The function is the potential function of CCRF, which defines an interaction between every variable on a clique. A clique is a maximal subgraph, i.e., a set of vertices on a graph that has an edge for each two vertex pairs [48]. The function is the normalizer formula that is used to maintain the probability value ( | ) between 0 and 1, which is defined as The potential function Ψ is defined as [49] ( , , , ) = ∑ ( , , ) + ∑ ( , , , ) where is the CCRF feature variable function referred to as the association potential, is the CCRF edge feature function called the interaction potential, , = 1,2, … , denotes the observation sample, and , are the contribution parameters of the feature variable and edge feature, respectively. The feature function variable and the edge feature are the two sources of information used in CCRF. The feature function represents prior knowledge for CCRF and evaluates predictive results formed by the baseline repressors. Generally, the feature tests the prediction results by using an error evaluation function such as the mean square error (MSE). Meanwhile, the edge feature expresses the interactions between prediction values. The functions and are defined as shown in (4) and (5), respectively: The probabilistic density function (PDF) is an exponential model that contains features based on input and output. It is assumed that there is a connection between the labels that are adjacent to the output. The CCRF forms a connection between a point and its neighboring points. These points represent the predicted values of time unity and are generated by the conventional predictor algorithm as the baseline. Because the CCRF works in the case of regression, the baseline used is referred to as the baseline regressor. The baseline regressors that could be used in this method include the support vector machine (SVM), neural networks, or trees.
In general, the CCRF serves to strengthen probabilities for weak predictive values. In general, the CCRF model can be written as [39] Here, y = (y 1 , y 2 , . . . , y N ) is a set of predictive values (output), N denotes the number of observed samples, and X is a vector of independent random variables called predictor vectors. The function Ψ is the potential function of CCRF, which defines an interaction between every variable on a clique. A clique is a maximal subgraph, i.e., a set of vertices on a graph that has an edge for each two vertex pairs [48]. The function η is the normalizer formula that is used to maintain the probability value P y X between 0 and 1, which is defined as The potential function Ψ is defined as [49] Ψ(y, where F is the CCRF feature variable function referred to as the association potential, G is the CCRF edge feature function called the interaction potential, i, j = 1, 2, . . . , N denotes the observation sample, and α, β are the contribution parameters of the feature variable and edge feature, respectively. The feature function variable F and the edge feature G are the two sources of information used in CCRF. The feature function F represents prior knowledge for CCRF and evaluates predictive results formed by the baseline repressors. Generally, the feature tests the prediction results by using an error  (4) and (5), respectively: The integers K 1 and K 2 represent the number of baseline regressors and the number of similarity measurements between feature vectors, respectively. The function f k (X) is an unstructured model that predicts a single output y i based on the input X. Simply stated, f k (X) is a function that maps the input x i ∈ X to a prediction value y i , which is referred to as a prediction function by a baseline regressor.

DM-CCRF
The distance-to-mean continuous conditional random fields (DM-CCRF) method includes a modification made to the edge feature function of CCRF, as shown in Figure 2. It aims to improve the CCRF performance in predicting time-series data. Modifications are carried out with the assumption that there is information on the average probability of an event in the total sequence of time-series data. The prediction model belief probability that emerges is expected to increase by using this assumption. This assumption is formulated by defining a new edge feature H: Here, integer K 3 is the length of the calculated sequence, γ is the contribution variable of the modified edge feature, and m i is the average of prediction values to y i−1 . The variable m i can be formulated as where the integer s is the sequence of events. The structure of the DM-CCRF is illustrated in Figure 3.
The integers 1 and 2 represent the number of baseline regressors and the number of similarity measurements between feature vectors, respectively. The function ( ) is an unstructured model that predicts a single output based on the input . Simply stated, ( ) is a function that maps the input ∈ to a prediction value , which is referred to as a prediction function by a baseline regressor. The distance-to-mean continuous conditional random fields (DM-CCRF) method includes a modification made to the edge feature function of CCRF, as shown in Figure 2. It aims to improve the CCRF performance in predicting time-series data. Modifications are carried out with the assumption that there is information on the average probability of an event in the total sequence of time-series data. The prediction model belief probability that emerges is expected to increase by using this assumption. This assumption is formulated by defining a new edge feature :

DM-CCRF
Here, integer 3 is the length of the calculated sequence, is the contribution variable of the modified edge feature, and is the average of prediction values to −1 . The variable can be With the formation of a new edge feature function, a DM-CCRF potential function is defined as In the form of conditional probabilities, DM-CCRF can be written as .
By substituting Equations (4)-(6) into Equation (10), In the concept of a matrix, a simplification of Equation (11) can be written as [38] ( | ) = where Matrix −1 contains the contribution variable of the entire DM-CCRF feature function, | | is the determinant of matrix , and denotes the average predictor variable. Matrix is a diagonal matrix that contains elements Matrix is a symmetrical matrix where the elements consist of With the formation of a new edge feature function, a DM-CCRF potential function is defined as In the form of conditional probabilities, DM-CCRF can be written as Thus, the form of the DM-CCRF formulation can be written as P y X By substituting Equations (4)-(6) into Equation (10), In the concept of a matrix, a simplification of Equation (11) can be written as [38] P y X = e [− 1 where µ(X) = σu.
Information 2019, 10, 382 7 of 15 Matrix σ −1 contains the contribution variable of the entire DM-CCRF feature function, |σ| is the determinant of matrix σ, and µ denotes the average predictor variable. Matrix A is a diagonal matrix that contains elements Matrix B is a symmetrical matrix where the elements consist of where U is constant. Vector u contains elements which are defined as

Learning and Inference in DM-CCRF
The learning process aims to select the optimum feature variable value, which will maximize the conditional probability value [48]. In DM-CCRF, the learning process aims to choose the optimum values of variables α and γ such that P y X reaches its maximum value. Given a training data , which is formed from any probability distribution [47], X (q) is an input vector that corresponds to data q, and y (q) is a set of predictive values that correspond to the q-th data point. The value of the DM-CCRF feature variable θ = α, γ can be estimated. A conditional logarithmic likelihood function that corresponds to the DM-CCRF model is defined from observational data, i.e., L(θ) = N q=1 log P y X ; θ .
Equation (19) is obtained by substituting Equation (9) into Equation (18): log η X (q) . (19) The learning process for training data in DM-CCRF can be written as A stochastic gradient ascent is an algorithm that can be used to process thousands of datasets that contain hundreds of features. Therefore, the optimal value of a variable can be determined using a stochastic gradient ascent [39]. The partial derivatives of the conditional logarithmic function P y X for α k and γ k can be written as It is assumed that there is a constraint that can be used to guarantee the partition function so that where α k > 0 dan γ k > 0 (24) can be integrated [38,39]. A constraint in Equation (24) will reach its optimum value by using a partial derivative of logα k dan logγ k [47], which can be written as Using Equations (25)- (26), the most recent values of α k and γ k for each iteration based on the gradient ascent can be calculated by using Equations (27) and (28), respectively.
where ζ, commonly known as the learning rate, is a constant that is used to determine how significant the variables updated in each iteration are. If ζ has an enormous value, then there is a possibility of premature convergence, whereas if ζ has an insignificant value, then the optimization process will take a very long time to reach convergence. In the inference process, the desired predictive value is determined [39]. The inference process in DM-CCRF aims to find the predictive value of y for each input value X given, such that the conditional probability value P y X reaches the maximum value. The estimated value for each optimal y corresponding to the conditional probability value P y X will be the same as the expected µ(X) . The prediction valueŷ in DM-CCRF can be formulated aŝ y = argmax y P y X ; θ = µ(X). (29)

Dataset
The traffic data used in this study were obtained from the Department of Transport, United Kingdom, and were collected using hundreds of sensors for 24 h. The sensors were placed on road segments and operated in a real-time scenario, resulting in an increase in size over time. The utilized data were traffic data from a Highways Agency that provides traffic flow information, average traffic speed, and average trip time for periods of 15 min [50]. The data were collected from 2009 to 2013. From 270,000,000 observation data points, only 2760 data points were used in the experiment, and these were located at the latitude of 50.832657 • . These traffic data had ten attributes: source-latitude, source-longitude, destination-latitude, destination-longitude, time and date, period, vehicle speed, distance, and traffic flow. The traffic data were narrow in the morning, congested at midday, and started to plummet at the end of the day [51].
The combination of automatic number plate recognition (ANPR) cameras, an in-vehicle Global Positioning System (GPS), and inductive loops was utilized to calculate the travel time and average speed attributes. Furthermore, the travel time attribute was derived from real vehicle observations Information 2019, 10, 382 9 of 15 and calculated using the adjacent time periods. The date attribute was converted to the number of days in one week: 1 for Monday, 2 for Tuesday, and so on. The time attribute was graded from 1 to 96, representing 00:00-24:00 per time interval. The period attribute was displayed in seconds, the vehicle speed attribute was displayed in km/h, the distance attribute was displayed in km, and the traffic flow was displayed as the number of vehicles. The process aimed to analyze the number of vehicles (traffic flow) predicted.
The cleaning process was conducted by removing the empty attributes or the attributes with values of zero. In addition, if the dataset contained missing values, then data preprocessing through imputation techniques was conducted. After the data cleaning process was complete, the remaining attributes were used as random variables with one target variable value. The target variable in question was the traffic flow prediction variable. Furthermore, the clean dataset was converted into numerical data using values between 0 and 1. This was done to avoid data outliers or a huge range of data.

Baseline Regressor
A baseline regressor was formed before data processing with the DM-CCRF. The extreme learning machine (ELM), which is a regressor based on neural networks, was chosen as a baseline regressor. The original ELM, which is a machine learning algorithm for single-hidden layer feedforward networks (SFLNs), was proposed by Huang et al. [52]. Learning parameters in the ELM, namely input weights and biases from a hidden node, can be set randomly without needing to be set first in each iteration process [53]. As for the output on the ELM, it can be determined analytically by a simple inverse operation. The only parameter that must be defined first in the ELM is the number of hidden nodes. The ELM has a better performance than other SLFN algorithms, especially in terms of the learning process duration. In the ELM, it is assumed that any non-linear fitness function can be used as a hidden layer. Figure 4 is an illustration of the ELM, where the enlarged part shows a hidden neuron in the ELM that can load sub-hidden neurons. feedforward networks (SFLNs), was proposed by Huang et al. [52]. Learning parameters in the ELM, namely input weights and biases from a hidden node, can be set randomly without needing to be set first in each iteration process [53]. As for the output on the ELM, it can be determined analytically by a simple inverse operation. The only parameter that must be defined first in the ELM is the number of hidden nodes. The ELM has a better performance than other SLFN algorithms, especially in terms of the learning process duration. In the ELM, it is assumed that any non-linear fitness function can be used as a hidden layer. Figure 4 is an illustration of the ELM, where the enlarged part shows a hidden neuron in the ELM that can load sub-hidden neurons. Several variations of ELM parameters were used to form various scenarios. These scenarios produced a baseline regressor to provide diverse quality. Furthermore, the behavior of the DM-CCRF during interaction with various baselines was observed. Each scenario was evaluated by the Mean Absolute Percentage Error (MAPE), which was used as a benchmark for the DM-CCRF. MAPE formulations can be written as In the case of the classical regression model, one way to choose the best model is to analyze the MAPE value, where the best model can minimize the MAPE value.

Results and Discussion
Given several variations of kernel parameters with the ELM as the baseline regressor, the results Several variations of ELM parameters were used to form various scenarios. These scenarios produced a baseline regressor to provide diverse quality. Furthermore, the behavior of the DM-CCRF during interaction with various baselines was observed. Each scenario was evaluated by the Mean Absolute Percentage Error (MAPE), which was used as a benchmark for the DM-CCRF. MAPE formulations can be written as In the case of the classical regression model, one way to choose the best model is to analyze the MAPE value, where the best model can minimize the MAPE value.

Results and Discussion
Given several variations of kernel parameters with the ELM as the baseline regressor, the results of the interaction of DM-CCRF with each parameter were obtained. These are represented by regularization coefficient values. The scenarios were built on the variation of the baseline regressor, which combined the number of kernel parameters and the coefficient of regulation. The scenarios were obtained from fine-tuning ELM results. Table 1 presents the variations of ELM parameters as baseline regressors for DM-CCRF in several scenarios. Fifteen scenarios were investigated, with the smallest number of the kernel being 1 and the largest number being 1,000,000. The interval of the regularization coefficient value ranged from 1 up to 1,000,000. 1,000,000 10 12 1,000,000 50 13 1,000,000 100 14 1,000,000 1000 15 1,000,000 10,000 The evaluation of the performance of the ELM as a baseline regressor for various scenarios is shown in Figure 5. The peak MAPE value for ELM implementation, 184.76%, was given by the 15th scenario, while the lowest performance was given by the 8th scenario, 47.33%. In the first scenario, ELM gave a reasonably high MAPE value, which then decreased gradually until the 8th scenario. From the 9th scenario to the 11th scenario, the MAPE value for the ELM rose, although this was not significant. It rose significantly in the 12th scenario and then stabilized with a small increase. Then, the performance of the DM-CCRF was compared with the standard CCRF and the ELM as the baseline regressor. 10 1,000,000 5 11 1,000,000 10 12 1,000,000 50 13 1,000,000 100 14 1,000,000 1000 15 1,000,000 10,000 The same baseline regressor was implemented on the Highways Agency, United Kingdom, dataset [50] for both methods. Based on Figure 6, it can be seen that the DM-CCRF and CCRF showed significant performance improvements compared with the performance of the baseline regressor for each scenario. The results provided by the standard CCRF show its ability to suppress errors obtained by the baseline regressor. Almost every scenario showed a decreasing MAPE value for the standard CCRF compared with the ELM as the baseline regressor, except for the fourth and fifth scenarios. However, the DM-CCRF provided better results than the standard CCRF in terms of minimizing MAPE values. Each scenario showed a decreased MAPE value with the DM-CCRF technique compared with the standard CCRF and ELM. These results show the superiority of the DM-CCRF compared with the standard CCRF method and ELM for traffic flow prediction. The same baseline regressor was implemented on the Highways Agency, United Kingdom, dataset [50] for both methods. Based on Figure 6, it can be seen that the DM-CCRF and CCRF showed significant performance improvements compared with the performance of the baseline regressor for each scenario. The results provided by the standard CCRF show its ability to suppress errors obtained by the baseline regressor. Almost every scenario showed a decreasing MAPE value for the standard CCRF compared with the ELM as the baseline regressor, except for the fourth and fifth scenarios. However, the DM-CCRF provided better results than the standard CCRF in terms of minimizing MAPE values. Each scenario showed a decreased MAPE value with the DM-CCRF technique compared with the standard CCRF and ELM. These results show the superiority of the DM-CCRF compared with the standard CCRF method and ELM for traffic flow prediction. In Table 2, a direct comparison between the results of the DM-CCRF, standard CCRF, and the baseline regressor is displayed. For each scenario, DM-CCRF always gave superior results compared with the standard CCRF and baseline regressor. The MAPE values achieved by DM-CCRF were constantly lower than those of the standard CCRF and baseline regressor. These results show the superiority of DM-CCRF in suppressing error values for each scenario of traffic flow prediction. When the standard CCRF simply suppressed errors of ~7.63% compared with the baseline regression, the DM-CCRF was able to suppress errors up to ~17.047%. Table 2. Head-to-head comparison between the ELM, CCRF, and DM-CCRF.

Scenarios
Performance  In Table 2, a direct comparison between the results of the DM-CCRF, standard CCRF, and the baseline regressor is displayed. For each scenario, DM-CCRF always gave superior results compared with the standard CCRF and baseline regressor. The MAPE values achieved by DM-CCRF were constantly lower than those of the standard CCRF and baseline regressor. These results show the superiority of DM-CCRF in suppressing error values for each scenario of traffic flow prediction. When the standard CCRF simply suppressed errors of ∼ 7.63% compared with the baseline regression, the DM-CCRF was able to suppress errors up to ∼ 17.047%. Table 2. Head-to-head comparison between the ELM, CCRF, and DM-CCRF.

Scenarios
Performance Evaluation (MAPE) The best performance of DM-CCRF was achieved in the 15th scenario, where the DM-CCRF provided a difference in results of 17.047% compared to the regression baseline. The lowest difference to the baseline regressor, 2.365%, was obtained in the 8th scenario. Compared with the standard CCRF, DM-CCRF had the biggest difference in the 14th scenario, where the difference in error was 9.465%. The smallest error, 1.299%, was found in the 8th scenario. Hence, it can be concluded that DM-CCRF provided better predictive results for the traffic flow dataset compared with the standard CCRF or ELM baseline regression.

Conclusions
A modification to a probability approach, continuous conditional random fields (CCRF), was proposed and implemented in the ELM and then utilized to assess highway traffic data. The modification was conducted to improve the performance of the ELM method. The experimental results showed that the proposed technique was better at suppressing the prediction error of the prediction model compared with the standard CCRF. The comparison between the ELM as the baseline regressor, the standard CCRF, and the modified CCRF was displayed. The performance evaluation of the techniques was conducted by analyzing their Mean Absolute Percentage Error (MAPE) values. The DM-CCRF was able to suppress the prediction model error twice as well as the standard CCRF method. Based on the attributes of the dataset, the DM-CCRF method was better for the prediction of highway traffic than the standard CCRF method and ELM baseline regressor.

Future Work
In further research, observations will be made on whether the probability of the emergence of a predictive model will continue to increase even though the belief level is not too significant. Another problem is that even though the DM-CCRF is superior to the standard CCRF, this modified method still provides a fairly large error value.