Road Tra ﬃ c Prediction Model Using Extreme Learning Machine: The Case Study of Tangier, Morocco

: An e ﬃ cient and credible approach to road tra ﬃ c management and prediction is a crucial aspect in the Intelligent Transportation Systems (ITS). It can strongly inﬂuence the development of road structures and projects. It is also essential for route planning and tra ﬃ c regulations. In this paper, we propose a hybrid model that combines extreme learning machine (ELM) and ensemble-based techniques to predict the future hourly tra ﬃ c of a road section in Tangier, a city in the north of Morocco. The model was applied to a real-world historical data set extracted from ﬁxed sensors over a 5-years period. Our approach is based on a type of Single hidden Layer Feed-forward Neural Network (SLFN) known for being a high-speed machine learning algorithm. The model was, then, compared to other well-known algorithms in the prediction literature. Experimental results demonstrated that, according to the most commonly used criteria of error measurements (RMSE, MAE, and MAPE), our model is performing better in terms of prediction accuracy. The use of Akaike’s Information Criterion technique (AIC) has also shown that the proposed model has a higher performance.


Introduction
The growing size of cities and increasing population mobility have caused a rapid increase in the number of vehicles on the roads. Road traffic management now faces such challenges as traffic congestion, accidents, and air pollution. Recent statistics reveal that the majority of vehicle crashes usually happen in the areas around congested roads. This is due to the fact that the drivers tend to drive faster, before or after encountering congestions, in order to compensate for the experienced delay. As a consequence, researchers from both industry and academia focused on making traffic management systems (TMS) more efficient to cope with the above issues. Traffic data are growing fast, and their analysis is a key component in developing a road network strategy. The most important question is how to analyze and benefit from this gold mine of information in order to bring out predictions of future data. An accurate traffic prediction system is one of the critical steps in the operations of an Intelligent Transportation System (ITS) and is extremely important for practitioners to perform route planning and traffic regulations.
In this context, the primary motivation of this work is to exploit the mass of historical information collected from a road section in order to predict the next hourly traffic. The fundamental challenge of road safety is predicting the possible future states of road traffic with pinpoint accuracy. Predicted information helps to prevent the occurrence of such incidents as congestion or other road and require a large number of traffic data. Some scientists considered the hybrid schemes that combine different models to forecast the traffic, pointing out that single models are not efficient enough to handle all kinds of traffic situations [23]. While these models have various levels of success, complex ones are time-consuming and not simple to implement.
In this study, we propose a hybrid model that combines extreme learning machine (ELM) and ensemble-based techniques for road traffic prediction. Compared with traditional neural networks and support vector machine, ELM offers significant advantages such as fast learning speed, ease of implementation, and minimal human intervention [24,25]. We conducted a comparative study with traditional learning algorithms (MLP, SVR and ARIMA) and concluded that our model is higher performing in terms of prediction accuracy. The experiment's results showed that our system is robust to the outliers and has superior generalization capability.
The sources of data utilized are extracted from sensors in fixed positions, which were able to detect the presence of nearby vehicles. Initially, inductive loop detectors were the most popular, but nowadays, a wide variety of sensors have become available [26]. The set of data recorded consists of the number of vehicles passing through a road section together with the parameter of date and time information. It covers a 5-year period from 2013 to 2017. The road section considered is about 13 km and is located in Tangier, a major city of northern Morocco. It is located on a bay facing the Strait of Gibraltar 17 miles (27 km) from the southern tip of Spain. It links the continent to Europe through its port and connects the city to the rest of the country.
This paper attempts to provide satisfying answers to the following questions: • What are the necessary features for road traffic prediction? • How can we build a dataset based on fixed sensing data? • How can we process our dataset using machine learning algorithms? • How can we improve the predictability of our model?
The rest of the paper is organized as follows: -Section 2 presents the methodology and the proposed approach. -Section 3 shows the simulation and prediction results with a comparison of the models. -Finally, we conclude the paper in Section 4.

Methodology and the Proposed Approach
This section is dedicated to present the proposed methodology and the process of building the dataset that will be used for our study.

Data Collection
The source of data was obtained from The Moroccan center for road studies and research. The analysis of traffic data is an essential and critical indicator of the country's infrastructure and transport system strategies. The center uses either permanent or temporary inductive loops embedded under the road surface to collect a massive amount of statistics about traffic flow in some of the major roads in Morocco. The dataset contains recorded traffic flow over a 5-years period from 2013 to 2017 and is composed of a set of separate files. Each file contains traffic information about a specific time-period. The meta-data of the files includes the date (day and hour) of the start and end of the recording, number of lanes, and also traffic decomposition by the class of the vehicle based on their length L (Tables 1 and 2).  :00  181  85  45  34  17  202  46  63  52  41  01/01/16  23:00  389  182  152  42  13  136  46  41  25  24  02/01/16  00:00  418  156  124  87  51  118  42  45  19  12  02/01/16  01:00  438  205  140  52  41  134  52  43  24  15   Table 2. Vehicle classification based on their length (L).

Data Preprocessing
The first step in the process of building the dataset consists of merging the road traffic data in a single file. In our study, we will consider the sum of hourly traffic for all classes of vehicles. Data is then preprocessed by first adding some appropriate features that may influence the type of traffic flow and that constitute the input layer of our model ( Figure 1):

Data Preprocessing
The first step in the process of building the dataset consists of merging the road traffic data in a single file. In our study, we will consider the sum of hourly traffic for all classes of vehicles. Data is then preprocessed by first adding some appropriate features that may influence the type of traffic flow and that constitute the input layer of our model ( Figure 1  The second step is to remove the data records that have missing values because some observations may be lacking due to sensor failure or absence of some features data. The next step of data preprocessing is data standardization. By standardizing the data, all data values will fall within a given range, and the incidence of input factors can be assessed based on the pattern shown by the input factors, but not on the numeric magnitude. Data standardization has two main advantages: It reduces the scale of the network and ensures that input factors with large numerical values do not eclipse those with smaller numerical values. Machine learning algorithms may be affected by data distribution. Thus, the model might finish with reduced accuracy. Given the use of small weights in the model, and the error between the predicted and expected values, the scale of inputs and outputs The second step is to remove the data records that have missing values because some observations may be lacking due to sensor failure or absence of some features data. The next step of data preprocessing is data standardization. By standardizing the data, all data values will fall within a given range, and the incidence of input factors can be assessed based on the pattern shown by the input factors, but not on the numeric magnitude. Data standardization has two main advantages: It reduces the scale of the network and ensures that input factors with large numerical values do not eclipse those with smaller numerical values. Machine learning algorithms may be affected by data distribution. Thus, the model might finish with reduced accuracy. Given the use of small weights in the model, and the error between the predicted and expected values, the scale of inputs and outputs used to train the model is a crucial factor. Unscaled input variables may result in a slow or unstable learning process, while unscaled target variables on regression problems may result in an exploding gradient causing the failure of the learning process. So, it is highly recommended to rescale data in the same range to eliminate a multi-range of features.
One of the techniques that permit scaling data is the Min-Max normalization; the default range of scaling is 0 to 1.
The Features Data was normalized using the previous formula, i.e., the default range scaling. Target Data was normalized in [−1, 1] range given by the equation: where a and b represent respectively the min and the max range that are in our case, respectively, −1 and 1.

Extreme Learning Machine
The extreme learning machine (ELM) is a novel and fast learning method for SLFNs ( Figure 2), where the inputs weights and the hidden layer biases are randomly chosen [24]. The optimization problem arising in learning the parameters of an ELM model can be solved analytically, resulting in a closed-form involving only matrix multiplication and inversion [27]. Hence, the learning process can be carried out efficiently without requiring an iterative algorithm, such as a backpropagation-neural network (BP-NN), or the solution to a quadratic programming problem as in the standard formulation of SVM. In ELM, the input weights (linking the input layer to the hidden layer) and hidden biases are randomly chosen, and the output weights (linking the hidden layer to the output layer) are analytically determined by using Moore-Penrose MP generalized inverse [28].
Information 2020, 11, x FOR PEER REVIEW 5 of 15 One of the techniques that permit scaling data is the Min-Max normalization; the default range of scaling is 0 to 1.
The Features Data was normalized using the previous formula, i.e., the default range scaling. Target Data was normalized in [−1, 1] range given by the equation: where a and b represent respectively the min and the max range that are in our case, respectively, −1 and 1.

Extreme Learning Machine
The extreme learning machine (ELM) is a novel and fast learning method for SLFNs (Figure 2), where the inputs weights and the hidden layer biases are randomly chosen [24]. The optimization problem arising in learning the parameters of an ELM model can be solved analytically, resulting in a closed-form involving only matrix multiplication and inversion [27]. Hence, the learning process can be carried out efficiently without requiring an iterative algorithm, such as a backpropagation-neural network (BP-NN), or the solution to a quadratic programming problem as in the standard formulation of SVM. In ELM, the input weights (linking the input layer to the hidden layer) and hidden biases are randomly chosen, and the output weights (linking the hidden layer to the output layer) are analytically determined by using Moore-Penrose MP generalized inverse [28]. Given a finite training set, , ∈ ℝ ℝ represent the input variables and represent the output variables. The SLFN neural Network with N hidden layer neurons is written in the following general form: Input Layer Hidden Layer Output Layer Given a finite training set, x j , y j ∈ R n × R m represent the input variables and y j represent the output variables. The SLFN neural Network with N hidden layer neurons is written in the following general form: where w i is the input weight vector of the ith hidden layer neuron. β i is the output weight vector of the ith hidden layer neuron. b i is the threshold of the ith hidden layer neuron, and g(x) is the activation function. ELM search to minimize training error: Which is equivalent to: The above equations can be written: H is the output matrix of the hidden layer: The output matrix is calculated by: where H is the Moore-Penrose generalized inverse of H.

Ensemble Based Systems in Decision Making
Ensemble Based System in Decision Making (EBDM) is a process that uses a set of models, each of them obtained by applying a machine learning technic to specific data. EBDM then integrates these models each of them differently in order to get the final prediction. They have been widely used in machine learning and pattern recognition applications since they have demonstrated to perform favorably and significantly increase prediction accuracy as compared to single models. There are many reasons to use EBDM, including statistics, the large volume of data, too little data, or data fusion. More details are available in [29]. There are several technics, including methods that use a combination of rules such as algebraic combination; voting-based approach; decision templates; and methods based on neural networks such as Bagging, Boosting, and AdaBoost that require modifications to the learning process [30]. The technique adopted in this paper is Arithmetic Averaging because it is simple and allows having powerful results [2]. The general formula is: where N is the number of ensembles, y k i is the output of the ith time and µ i is the average output to be considered.
The ELM model generates different values for each application on the test dataset due to the random generation of its input weights and the bias of the hidden layer. However, these variations in predictions are negligible compared to those of other ANNs. Still, to solve this issue, we will consider the ensemble methods that are used to reduce these small fluctuations and make the prediction more robust than the simple model with better precision. To do this, we will get several ELM models that we will apply to our data while keeping the test and training datasets the same. Since the prediction results are different for each model, the final predicted value considered will be the algebraic mean of all the results obtained.

Autoregressive Integrated Moving Average (ARIMA)
The ARIMA model for Autoregressive Integrated Moving Average has been used extensively in time series modeling as it is robust and easy to implement. The auto regression (AR) part of ARIMA is a linear regression on the prior values of the variable of interest and can predict future values. The moving average (MA) part is also a linear regression of the past forecast residual errors. The I in ARIMA stands for integrated and refers to the differencing by computing the differences between consecutive observations to eliminate the non-stationarity. The full ARIMA (p, d, q) model can be written as: To determine the values of p and q, one can use the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots. The ACF plot shows the autocorrelation between y' t and y' t−k , and the PACF shows the autocorrelations between y' t and y' t−k after removing the effects of lags 1, 2, 3, . . . , k−1.

Support Vector Regression
Support Vector Regression (SVR) is a regression algorithm, and it applies a similar technique of support vector machines (SVM) for regression analysis. As we know, regression data contains continuous real numbers. To model such type of data, the SVR model fits the regression line in a way that it approximates the best values with a given margin called ε-tube (epsilon-tube ε identifies a tube width). SVR attempts to fit the hyperplane that has a maximum number of points while minimizing the error. The ε-tube takes into consideration the model complexity and error rate, which is the difference between actual and predicted outputs. The constraints of the SVR problem are described as: where the equation of the hyperplane is: The objective function of SVR is to minimize the coefficients:

Multi-Layer Perceptron
Multi-layer Perceptron (MLP) is a type of feedforward neural network (FNN) that uses a supervised learning algorithm. It can learn a non-linear function approximator for either classification or regression. The simplest MLP consists of three or more layers of nodes: an input layer, a hidden layer and an output layer. It has the following disadvantages: MLP requires tuning a number of hyperparameters such as the number of hidden neurons, layers, and iterations. Moreover, MLP is sensitive to feature scaling.
For regression problem, MLP use backpropagation (BP) to calculate gradients to minimize the square error loss function. Since BP has a high computational complexity, it is advisable to start with a smaller number of hidden neurons and few hidden layers for training. However, if it has few (in contrast to large) neurons, it might have a high training error due to underfitting (in contrast to overfitting). Therefore, BP-based ANNs tend to exhibit sub-optimal generalization performance [24,31].

Evaluation Metrics
The model built may be evaluated by different criteria; it gives us a clear view of its performance and how well it correctly predicts the results. For regression models that predict numerical values, there are several metrics, such as RMSE, MAE, and MAPE [32]. In this paper, the three previous Metrics were used to evaluate model accuracy. In addition to these three indices, the AIC has been used to perform model comparisons. The formulas of these indices are given as follows: RMSE: the formula calculates Root Mean Square Error: MAE: Mean Absolute Error's formula is: MAPE: Mean Absolute Percentage Error is given by: 16) where N represents the number of observations in the test dataset, x i is the reel output of ith observation andx i is the predicted value for ith observation. AIC: The Akaike's Information Criterion [33] is used to check the complexity of models. The model with minimum AIC value is considered as the best and less complex among alternative models. The AIC formula is defined by: where RSS is the Residual Sum of Squares, K is the total number of parameters in the model, and n is the number of observations.

Proposed Framework and Prediction Evaluation of ELM Model
In order to reduce the fluctuations generated by the ELM model, we have considered ensemble methods. The purpose is to make the final decision of the model more accurate. The proposed framework combines the high performance of ELM with the advantages of ensemble methods. In the first step, we fit N times the ELM model on our training data, where N is the number of ensembles. The choice of the number of neurons in the hidden layer of ELM, as well as N, is made based on the quality of the results. Subsequently, the outputs of the ensemble are algebraically combined by averaging, which can reduce the risk of bad selection of a poor predictor (Figure 3).
It is generally known that several parameters may influence the accuracy of the model, such as the number of hidden neurons, the size of the data, and the used activation function. Various researchers and specialists in this field generally test several combinations or use approved approaches. The ELM algorithm may give different results for diverse parameters. Therefore, we have considered 20, 50, and 100 trials for each fixed size of the ELM model, and the result taken into account is the average result values. Our ELM network is composed of 10 input layer neurons (attributes), a hidden layer, Information 2020, 11, 542 9 of 15 and a single node of the output layer giving the 1 h traffic prediction. The choice of the size of the hidden layer is based on the results of several tests; the number of neurons giving the best performance is then considered.

Proposed Framework and Prediction Evaluation of ELM Model
In order to reduce the fluctuations generated by the ELM model, we have considered ensemble methods. The purpose is to make the final decision of the model more accurate. The proposed framework combines the high performance of ELM with the advantages of ensemble methods. In the first step, we fit N times the ELM model on our training data, where N is the number of ensembles. The choice of the number of neurons in the hidden layer of ELM, as well as N, is made based on the quality of the results. Subsequently, the outputs of the ensemble are algebraically combined by averaging, which can reduce the risk of bad selection of a poor predictor (Figure 3).   Figure 4a illustrates the average test error RMSE. As shown in the figure, the outputs are normalized between 0 and 1; it represents the traffic test set RMSE. After 10 tests, we find out that the optimal number of hidden nodes for our learning set varies between 80 and 360, as described by [24]. We observe that RMSE decreases as the number of neurons increases, and that is optimized at 180 neurons in the hidden layer. Due to the random selection of hidden neurons, these results are given by several trials, and the average value is taken.
During the preparation of our dataset, it was decided to add information on features that better describe the observations (see Section 2.2). This resulted in incomplete records due to the absence of this information, representing a rate of 2.8% of the dataset size. Therefore, these records will not be considered and our model uses then 25,323 observations of which 22,790 are allocated to the learning phase and 2533 to test the model.
Then, the objective was to determine the optimal parameters for our model, namely the number of hidden neurons and the number of ensembles by considering the evaluation criteria that show less fluctuation.
The ELM algorithm was then evaluated several times to determine the best number of hidden neurons N. For this purpose, the model is tested on different sizes of the hidden layer in the range of 10 to 400 with a step of 10. For each hidden layer size, the model was trained, tested and evaluated successively 20 times, 50 times and 100 times. Evaluations criteria (RMSE, MAE and MAPE) are then calculated, and their average values are taken. Results of evaluation metrics of each number of trials are presented in Figure 4. It can be observed in the figure that the curves for the number of trials 50 and 100 have almost the same pattern with less fluctuation, which means that they have nearly the same quality of results.
Furthermore, the pattern of 20 trials shows some fluctuations that appear for the RMSE and MAE measurements, which mean that it will not probably be considered as the number of ensembles. On the other hand, the execution time of 50 trials is significantly faster than 100 trials. Consequently, the number of ensembles that will be considered in this study is actually 50.
The results of the evolution of the model performance according to the number of hidden neurons for 50 trials (Figure 4b,c) shows that for our training set, the optimal number of hidden neurons that have better results and less fluctuations is 250 neurons.
The simulation results were obtained using a 2.5 GHz Intel Core i5-7200U computer using Python 3.6. Information 2020, 11, x FOR PEER REVIEW 10 of 15

Model Evaluation and Comparison to Other Methods
We tested the ELM model based on the obtained optimal number of neurons in the hidden layer. The performance of our model has been compared to SVR and ARIMA. We have tested several combinations of the cost parameter C and the kernel parameter. Authors of [30] recommend choosing the best combination among the values: = [2 , 2 , … , 2 , 2 ] and = [2 , 2 , … , 2 , 2 ] based on the cross-validation method.
In Figure 5, the x-axis represents the predicted hourly road traffic in chronological order. The y-axis describes the residues of the predicted values in relation to the real ones. We observe that there is no particular pattern for the five models; this actually proves a good fit. The shapes of the curves show that the hybrid EBDM model gives a remarkable improvement over the single ELM model. Moreover, the residues of this latter are clustered close to the line of equation 0. These smaller values indicate that our model is more accurate. This is definitely due to the right choice of relevant parameters as features, high quality of data after preprocessing, and number of neurons.

Model Evaluation and Comparison to Other Methods
We tested the ELM model based on the obtained optimal number of neurons in the hidden layer. The performance of our model has been compared to SVR and ARIMA. We have tested several combinations of the cost parameter C and the kernel parameter. Authors of [30] recommend choosing the best combination among the values: C = 2 12 , 2 11 , . . . , 2 −1 , 2 −2 and γ = 2 4 , 2 3 , . . . , 2 −9 , 2 −10 based on the cross-validation method.
In Figure 5, the x-axis represents the predicted hourly road traffic in chronological order. The y-axis describes the residues of the predicted values in relation to the real ones. We observe that there is no particular pattern for the five models; this actually proves a good fit. The shapes of the curves show that the hybrid EBDM model gives a remarkable improvement over the single ELM model. Moreover, the residues of this latter are clustered close to the line of equation 0. These smaller values indicate that our model is more accurate. This is definitely due to the right choice of relevant parameters as features, high quality of data after preprocessing, and number of neurons. Information 2020, 11, x FOR PEER REVIEW 11 of 15 Figure 5. The absolute residual error in test phase. Figure 6 and Table 3 show that our model using EBDM generates better values with high accuracy than ELM itself. The y-axis represents the real value, the predicted one and the residues of the predicted values in relation to the real ones. It can be seen that fluctuations generated by random generalization of input parameters can be reduced. The EBDM actually reduces the prediction error of ELM while maintaining an acceptable execution time. This is thanks to the advantage of EBDM to adapt to small fluctuations generated by ELM at each prediction.
The Akaike Information Criterion (AIC) index takes into account the complexity and performance of the model; according to AIC values in Table 3, the EBDM model has the lowest costs in training phase. This confirms that our approach has superior quality than SVR, ARIMA and ANN models. As we can see in Figure 6 (ELM-EBDM), the hybrid model tries to reduce the fluctuations generated by the ELM model, which results in an improvement of the quality of the prediction results.   Figure 6 and Table 3 show that our model using EBDM generates better values with high accuracy than ELM itself. The y-axis represents the real value, the predicted one and the residues of the predicted values in relation to the real ones. It can be seen that fluctuations generated by random generalization of input parameters can be reduced. The EBDM actually reduces the prediction error of ELM while maintaining an acceptable execution time. This is thanks to the advantage of EBDM to adapt to small fluctuations generated by ELM at each prediction.
The Akaike Information Criterion (AIC) index takes into account the complexity and performance of the model; according to AIC values in Table 3, the EBDM model has the lowest costs in training phase. This confirms that our approach has superior quality than SVR, ARIMA and ANN models. As we can see in Figure 6 (ELM-EBDM), the hybrid model tries to reduce the fluctuations generated by the ELM model, which results in an improvement of the quality of the prediction results.

Conclusions
In this paper, we have discussed and employed the extreme learning machine algorithm to design an accurate model to predict hourly road traffic data. The study uses historical data from the past 5 years on a road section in Tangier, Morocco. The main contribution of our work is the development of a hybrid model called ELM-EBDM that shows far better performance than single ELM without affecting the speed of the learning process. Our proposed model has been evaluated by the comparison with the well-known algorithms MLP, SVR and ARIMA. The simulation results demonstrated that it gave a better performance in terms of both accuracy and stability. The proposed traffic prediction model represents a reliable tool for engineers and managers to plan roads and regulate traffic. It will also allow road users to have prior information about traffic to plan trips and manage time efficiently by avoiding the congestion hours.
The loop detectors used may know some malfunctions or a power failure, which can lead to incorrect or incomplete recordings. Therefore, our model will also serve to reconstitute lost or destroyed data with the predicted values. This allows filling in the gaps in the traffic matrix used to calculate the common traffic indicator Annual Average Daily Traffic (AADT). This indicator is widely considered for determining the type of road maintenance operations to be planned in the future In our future work, we may consider some improvements through adding more relevant information about traffic. This includes specific events or weather conditions. Many adjacent roads and their traffic attributes may influence one particular road. Therefore, we will consider different numbers of nearby highways and traffic attributes for target roads to further improve the performance of our approach. We also plan to develop an algorithm to automatically determine the parameters: number of neurons and ensembles, as an alternative to the tedious manual selection method.