Applying Hybrid Lstm-Gru Model Based on Heterogeneous Data Sources for Traffic Speed Prediction in Urban Areas

With the advent of the Internet of Things (IoT), it has become possible to have a variety of data sets generated through numerous types of sensors deployed across large urban areas, thus empowering the notion of smart cities. In smart cities, various types of sensors may fall into different administrative domains and may be accessible through exposed Application Program Interfaces (APIs). In such setups, for traffic prediction in Intelligent Transport Systems (ITS), one of the major prerequisites is the integration of heterogeneous data sources within a preprocessing data pipeline resulting into hybrid feature space. In this paper, we first present a comprehensive algorithm to integrate heterogeneous data obtained from sensors, services, and exogenous data sources into a hybrid spatial–temporal feature space. Following a rigorous exploratory data analysis, we apply a variety of deep learning algorithms specialized for time series geospatial data and perform a comparative analysis of Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convolutional Neural Network (CNN), and their hybrid combinations. The hybrid LSTM–GRU model outperforms the rest with Root Mean Squared Error (RMSE) of 4.5 and Mean Absolute Percentage Error (MAPE) of 6.67%.


Introduction
Recently, the explosive growth of sensors, the internet, and huge data generation have provided new avenues of storage, execution, and implementation opportunities for IoTbased applications. Where the real-time availability of heterogeneous data generated from a huge number of sensors has brought about novel possibilities, a new set of challenges has also emerged, which is primarily related with the need of methodologies and capabilities to optimally harness the power of heterogeneity of data. Unveiling new interpretations by first integrating various forms of heterogeneous data and then applying intelligent algorithms to it is one of the most exciting possibilities of the notion of the Connected World [1].
Intelligent Transportation System (ITS) plays an enabling role in the realization of the concept of smart cities [2]. ITS has a huge demand to integrate data directly from various sensors, services built upon sensors, and from a variety of other exogenous data sources. ITS heavily relies on the power of IoT to build multisource, multisensor, and multimodel service systems for the prediction of traffic speed [3,4].
Major challenges faced during the activities of the data integration pipeline include sparsity handling, anomaly detection and rectification, normalization, map matching, and data resolution. The resolution quality of the fused and integrated traffic data plays a significant role when prediction algorithms are applied to it. The spatial temporal nature of traffic data defines the two corresponding facets of its resolution on GIS maps. Spatial resolution identifies the length of the road segment on which prediction is computed whereas the temporal resolution means the minimum time interval during which the prediction is made. The spatial temporal structure is used to integrate traffic data from different sources, e.g., Floating Car Data and Google in our case. Data elements from different sources at specific road segments during a specific time interval are fused together, thus adding value to accuracy. Necessary transformations from Estimated Time of Arrival (ETA) to speed or vice versa are performed before the fusion. This may result in spatiotemporal gaps due to the varying sparsity of data sources that need to be addressed. Data from the exogenous sources, e.g., weather, holidays, or peak hours, are also weaved upon the same spatial temporal framework in a similar fashion.
In this research, a detailed workflow is developed for the data integration pipeline that transforms raw data extracted from different heterogeneous data sources, passing through various preprocessing, and integration activities finally resulting into an integrated hybrid feature space. Open Street Maps (OSM) was selected as the GIS mapping technology. The integrated data are mapped on the road segments between every two adjacent OSM nodes within the time intervals of 15 min for the whole 24 h of the day. Keeping in mind the spatial temporal nature of the data, machine learning and deep learning algorithms specialized for time series data were applied, and their results were compared.
The main contributions of the paper can be summarized as follows: • We integrated the heterogeneous data sources of Intelligent transportation systems for data collected from a particular city in Pakistan and built the hybrid LSTM-GRU model. • We predicted the traffic speed on the basis of heterogeneous traffic data sources including exogenous data sources, e.g., weather, event and, peak hours. • The Hybrid LSTM-GRU model has been applied on time intervals varying from 15 min to 1 h and the effectiveness of the model has been evaluated.
The remainder of the paper is organized as follows: Section 2 presents the related work. Section 3 presents the proposed methodology based on Hybrid LSTM-GRU model. Section 4 provides the results and discussions. Section 5 concludes the paper and identifies the future research direction.

Related Work
The related work was organized in context with the integration of heterogeneous data sources obtained from a variety of sensors leading to deep learning techniques applied on the spatial-temporal traffic data. Salanova Grau et al. [5] presented a novel auto-DL approach for tuning the 145 errors and 22 warnings hyperparameter of the LSTM model in order to reduce time consumption. However, the author worked only on temporal features, not spatial features, and also did not focus on multidata sources.
Tang et al. [6] improved the fuzzy neural network model in order to enhance the learning capability of the model. The proposed model consists of a supervised and unsupervised learning process. In unsupervised learning, a k-means is used to partition input samples, and in the case of supervised learning, a weighted recursive least squares estimator is used for the optimization of hyperparameters. However, spatial patterns were not considered, while periodic features were extracted from traffic flow data. The authors did not deal with heterogeneous data sources of the arterial road network.
Ma, Tao, Wang, and Yu [7] proposed the novel approach LSTM-NN to overcome the issue of backpropagated error decay and automatically determine the optimal time window for the time series data set. This means the short-term prediction of speed deals with NN and long-term prediction of speed deals with LSTM. The authors used microwave detector data for the prediction of speed. However, they have only addressed the temporal features. There was a need to address the spatial features as well as a need to investigate with different data aggregated levels that would act as an additional input and contribute towards the prediction of speed in a more accurate consistent manner. Salavona Grau et al. [1] proposed the framework that collects data from conventional(cameras, radars, and loops) as well as innovative (FCD and Bluetooth devices) technologies and analyzes the ETA and traffic flow on road networks. They are focused on data collection, filtration, and fusion of big data. They analyzed the conventional and innovative types of data sources with the help of statistical approaches such as correlation coefficients, R-square, absolute value, and percentages ranges. However, there is a need to check the spatialtemporal features of big data on the road networks with the help of a machine and deep learning models that are specially proposed to handle the time series and spatial patterns. Bratsas et al. [8] applied different machine learning models and figured out the forecasting effectiveness of the machine learning model on randomly selected dates, randomly selected roads with duration over eight consecutive 15 min intervals, and the whole day. The experiments show that SVR performed well with stable conditions and MPL performed well on greater variations. However, the paper does not analyze the specialized deep learning models that are designed to deal with spatial-temporal trends of the data.
Yang et al. [9] present the effects of heterogeneous data sources (parking meter transactions, traffic conditions, and weather conditions) and spatial-temporal nature of data on the parking occupancy by linking Graph CNN and LSTM deep learning models. However, the authors did not incorporate ETA, event, holiday, calendar, OSM, and count data sources in their work. The authors focused on single-scale occupancy prediction rather than multiple scales, i.e., parking meters to the level of aggregated zones.
Zhang et al. [24] proposed the deep learning-based multitask learning model to predict traffic speed on the road networks. The authors used the hybrid approach to increase the performance of the proposed model on taxis GPS data. They extracted the spatial-temporal patterns with the help of the nonlinear Granger causality analysis method, and for hyperparameter tuning, they used the Bayesian optimization technique. However, the authors did not consider other modes of traffic and their influence on the prediction of speed.
Kong et al. [25] proposed the recommendation system of intelligent traffic based on LSTM deep learning model. However, they ignored the positive impact of performance by using a multimodel on intelligent traffic information and also did not make a comparison of the famous time-recurrent neural network models, e.g., CNN [26], GRU, and hybrid models.
Authors in [27][28][29] used hybrid graph convolutional neural networks, CNN-LSTM, and LSTM-NN models on real-time traffic speed. The hybrid model produced promising results; however, other data sources related to road traffic that have directly or indirectly influenced the prediction of speed on road network were not taken into account.
Authors in [30][31][32] used k-means clustering, PCA, SOM, SVM, and SVR for prediction of traffic speed on regular or irregular intervals. However, their prediction models can be improved with a fusion of nonrecurrent events(e.g., calendar, special events, accidents, and weather) and traffic flow analysis.
Li et al. [33] proposed the transfer-learning model to address the missing data, data insufficiency, and mitigate model overfitting problems and stack LSTM for considering the time series patterns. However, the authors did not analyze the exogenous data sources, traffic types, and different traffic modes data sets. Hybrid feature space may be helpful for the construction of rules for applying transfer learning on the specific area while considering spatial factors.
Mena-Oreja et al. [34] discussed the formation of congestion by using state-of-the-art error recurrent and deep learning models. The author conducted a survey in the field of transportation and identified which deep learning and error recurrent models are helpful while considering spatial-temporal factors and other traffic conditions. The author applied the error recurrent and deep learning models on real traffic data sets in order to generate the common benchmark under traffic congestion conditions and demonstrated that the error recurrent model shows better accuracy as compared to deep learning models. However, the authors do not focus on the impact of statistical and ensemble deep and machine learning models on traffic congestion conditions using real traffic data sets and also do not discus hybrid feature space and its impact on traffic congestion prediction.
Ren et al. [2] propose the hybrid integrated-DL model in order to capture both spatialtemporal dependencies on the prediction of citywide spatial-temporal flow volume. The authors proposed the hybrid LSTM and ResNet model in order to deal with spatialtemporal effects on traffic volume. However, the authors proposed that the model shows large prediction error in sparse spatial areas and sleeping hours. The authors do not focus on the applicability of other types of flows, e.g., passenger flow, bike flow, etc.
Yu et al. [35] introduced the piece-wise correlation function and Jenks clustering method with dynamic programming to fix relationship of segment intervals. They considered heterogeneous data sources, e.g., speed, traffic flow, density, and road occupancy for shortterm speed prediction on 5 and 10 min time intervals. They prepared the results on the basis of only three days, i.e., 1 February to 3 February 2015 (from Sunday to Tuesday). They did not discuss the effects of the weekdays and nonweekdays on road traffic. The amount of data is very low, and it is difficult to explore the correlation function across the whole week.
Liu et al. [36] used attention CNN to forecast traffic speed. They used 29,952 records as a training set and 5760 records as the test set. The amount of data is very low due to coverage of single road and difficult to explore traffic trends on their adjacent roads. In an earlier work [37], we built a solution that used traffic and weather data to predict traffic congestion using Estimated Time of Arrival (ETA).
We proposed a hybrid LSTM-GRU approach on heterogeneous data sources which comprises 7,343,362 records of September 2020. We made a comprehensive comparison among famous time-recurrent neural network advanced deep learning models, e.g., CNN, GRU, and their combinations.

Proposed Methodology Based on Hybrid LSTM-GRU Model
We obtained Floating Car data from a regional trackers' company. A data integration pipe-line was developed to clean, preprocess, and integrate this data with other data sources. Data integration pipe-line caters data pertaining to FCD, Google, weather, peak hour, holiday, and OSM data sources. We addressed the issues related to the FCD data sources-some of which include zero speed adjustments, outlier removal, and map matching. A feature in FCD data provides the reason for a signal generation from the tracker. The reason may be regular time interval, ignition on/off, turn, etc. A map-matching technique was used for node correction whereas threshold value, parking info, and ignition on/off details were used for zero speed adjustment. Finally, all the data sources were aggregated on the basis of wayid attribute of OSM and time attribute at a regular interval of 15 min. This was followed by the normalization of road speed through Permissible Speed Limit (SPL) attribute and Speed Performance Index (SPI).

Data Sources
Data from the following data sources were obtained and integrated.

FCD Data Source
FCD data were obtained from a regional tracker company. The data set contained events generated by 2895 unique tracker ids for the whole month of September 2020. The tracker units were mounted with the sensors GSM Modem(Quectel M95) and GPS Chipset(U-blox EVA-M8M). The key features of Floating Car Data (FCD) include latitude, longitude, date time, address, location, direction, speed, reason, and unit id. We faced multiple issues related to data preparation of FCD data source. The following issues were identified in FCD data: • There was an off-road mapping of cars. This could be due to two reasons. Either the car appears offroad because of inherent GPS error or because the car was actually parked somewhere off the road. • A large number of speed values generated by trackers were zero. This again could be due to two reasons: either the car is parked or stuck in congestion. The congestion data needed to be distinguished from the data related to the parked cars. • There was duplication of tuples. • There are missing values causing spatial sparsity. This is because the FCD does not cover all segments of roads of the road network.
The issues pertaining to the missing values and off-road mapping of cars are resolved through the map-matching process described in Algorithm 1. When the nodes are placed correctly on the map, the in-between missing values can be suggested. The zero speed issue is resolved through the zero speed adjustment section described in Algorithm 1. Outliers are identified and rectified through the max speed attribute obtained from OSM. Any speed exceeding the max speed on a segment of the road is replaced with the max speed value. Speed Performance Index is used for the normalization of data. Initialize way-id to zero 5: nodes-seg["start-point","end-point"]:=get_nearest_seg(latitude,longitude) 6: if nodes-seg ["start-point"] is zero then 7: Check end-point value and update the start-point value with previous node value 8: else 9: nodes-seg ["start-point"] and nodes-seg ["end-point"] represents to same road points 10: end if 11: end for 12: Update way-id 13: data.append(way-id) 14: Update data 15: take input records from data 16: F = 0 17: for t = 1 to n do 18: Initialize speed = data["FCD-speed"] 19:

Algorithm 1 Preprocessing and Data Integration Algo
if reason is "ignition-ON" then 21: F = 1 22: if next reason is "ignition-OFF" then 23: if speed is zero then 24: if elapsed time >threshold then 25: continue F = 1 26: end if 27: if next speed is zero then 28 41: Initialize the data to empty 42: for each seg do 43: for each agg-min each day do 44: Compute avg-speed = 1 n ∑ x i

ETA Data Source
We decided to obtain Google Map's data from more than 500 points of interest on important roads of Islamabad. We acquired data by sending start and endpoints to Google Maps API. The Google data source is an authentic data source and provides an estimated time of arrival information. Data obtained from Google can be easily mapped on OSM Maps. The key features of Google data are Source Latitude, Source Longitude, Max Speed, Date, Time, Destination Latitude, Destination Longitude, and Estimated Time Arrival.

OSM Data Source
The road network attributes were fetched from OSM's Turbo Overpass API. It provides the tags of Islamabad which include start node, end node, highway types, max speed, min speed, way_id, and max length. Max speed is a feature of special importance that is utilized in the normalization of data and outlier identification and removal. Some roads do not have a max speed feature associated with them; therefore, we had to insert it manually.

Calendar Data Source
In order to identify the effect of traffic on holidays at a particular location, we need a calendar data source. Behaviors and patterns of traffic are highly dependent on holiday data. The calendar data source features include DateTime, Name, and Type.

Weather Data Source
Real-time data source is gathered from Yahoo and Dark Sky API on the basis of time and latitude and longitude. Table 1 shows the most relevant features of the proposed hybrid model. The maxspeed-real attribute is used to detect anomalies and normalized the speed.
From feature engineering process, we exploited the input feature set and then applied machine and deep learning models that are capable enough to consider spatial-temporal effects on transportation data sources. We also applied hybrid approaches such as LSTM-GRU, GRU-LSTM, CNN-LSTM, and CNN-GRU as these hybrid approaches enhance the performance of models and provide more accurate prediction of the speed of a specific road at a specific time. SPI = (S t i /S PL ) * 100 (1)

Data Integration Pipeline
In the Integration Pipeline, we performed the transformation on FCD and Google data set. The map-matching process was used to obtain the data in spatial format. Hence, the first transformation in the pipeline is the map matching. Figure 1 illustrates the data integration procedure. Data are collected from different data sources such as Google, tracking company, osm road, weather data, holiday data, and peak hour data. In addition to the map-matching procedure, the following activities were required for data transformation before the machine learning algorithms could be applied: • Map matching of GPS points • Handling the abnormal behavior of data • Data generalization and transformation • Calculating the average speed of road section.
In the Google data set, the data points needed to be mapped on the OSM nodes. In this way, we could divide long roads in smaller segments with each segment marked by two adjacent OSM nodes. For this purpose, the nearest API of OSRM was again used. We verified the mapping results by visualizing the nodes on OSM maps. The mapping of FCD data and Google traffic data on OSM nodes provided a mechanism to spatially unify both traffic data. For temporal aggregation, the data points of both data sources were aggregated for every 15 min for all the road segments defined on the OSM road network. The integrated data were then merged with holiday data based on the date field. The purpose of the abovementioned integration was to encounter the special effects of congestion on working hours and weekends during holidays. These merged data are further integrated with road attributes on the basis of wayid. Furthermore, to handle the environmental effects, we combined these integrated data with only those weather data parameters that affect the behavior of the traffic, i.e., rainy, visibility, etc. Algorithm 1 represents the procedure of map matching. The data contains the coordinate points with the latitude and longitude of the located geographical positions. Firstly, these coordinates were sent to the nearest API of OSRM server to obtain the pair of nodes of the segment containing the location of the driving vehicle that had already been determined from different sources. This was followed by the ordering of nodes on OSM maps to identify whether the road is incoming or outgoing. Occasionally, OSRM nearest API returns zero value in place of start node that might be due to multiple options for the nearest start node owing to the junctions on roads. For zero values, the end-node value was traced back on the OSM road information, and the immediate previous node was assigned to the start-node value.
In the following subsections we briefly describe the existing and well-known LSTM and GRU techniques and then move on to describe our hybrid model based on combining these existing techniques.

LSTM
Long Short Term Memory (LSTM) [42] is a variant of recurrent neural network (RNN) [5,45]. It is specialized for time series data. A generalized LSTM unit consists of three gates (i.e., input, output and a forget) and a cell. Cells are used to memorize the values of data and flow the information to output and forget gate. It is used to address vanishing gradient problem.

GRU
Gated Recurrent Unit (GRU) [43] is an advanced and more improved version of LSTM. It is also the type of recurrent neural network. It uses less hyper parameters because of reset gate and update gate as contrast to three gates of LSTM. Update gate and reset gate are basically vectors and are used to decide which information should be passed to the output.

Hybrid LSTM-GRU Model Description
Since, our dataset is time series and regression problem, we generated the results by using both classical as well as deep learning techniques. Our hybrid approach techniques yielded the low RMSE and is thus more effective as compared to classical regression techniques e.g., KNN, XGBoost, Linear Regressor, ANN and MLP. In our hybrid LSTM-GRU model, we first applied LSTM. LSTM is used to tackled the problem of vanishing gradient in backpropagation. LSTM contain three gates e.g., input gate (i g ), forget gate ( f g ) and output gate (o g ). Gates are used to store information in memory. It stores the information in analog format. These gates are element-wise multiplied by sigmoid function ranges between 0-1. If the value of the gate is zero then this information is ignored or discarded else remained in memory. Tanh [46] is a well-known non-linear activation function and ranges between −1 to +1. In order to avoid information fading, a second derivative is used. The sigmoid function [47] is also well-known as a non-linear activation function. A sigmoid function contains values between 0 to 1. It is basically used to suggest which information should stay or drop from memory units known as gates. The mathematical Equations of input gate( i g ), forget gate ( f g ) and output gate (o g ) are taken and adapted from the literature and are explained in Equations (2)-(4). Whereas, GRU contains two gates e.g., update gate (u g ) and reset gate (r g ). The output of the LSTM was passed to GRU during this approach. x t i is the input feature set that contains hybrid feature space(start-node, end-node, way-id, day, hour, agg-minutes, quarter, holiday, peakhour, mazspeed-real) at specific time and location. The aggregate speed is the target or output label, similar to [48] in which authors used LSTM and Bi-Directional LSTM Models to predict stock price.
The details and equations presented are taken and adapted from the literature such as [42,43].
Input Gate: i g → represents input gate Forget Gate: f g → represents f orget gate Output Gate: o g → represents output gate Update Gate: u g → represents update gate Reset Gate: r g → represents reset gate σ → represents sigmoid f unction w x → represents weight f or the respective gate (x) h t−1 → output o f the previous lstm block at timestamp t − 1 x t → input at current timestamp b x → biases f or the respective gates (x) Cell Output: C t → memory at timestamp (t) Cell Input: ∼ C t → represents candidate f or memory at timestamp (t) The Cell Input state is ∼ C t , Cell Output state is C t , and LSTM consists of three gates i g , f g , and o g . GRU consists of two gates u g and r g . The hidden layers of LSTM-GRU model are ∼ C t , ∼ h t , and h t . The weights of LSTM are w i , w f , w o , and w c . The weights of GRU are w u , w r , w o , and w C t . LSTM-GRU model have biases b i , b f , b o , and b c . tanh is known as the hyperbolic tangent function. The ratio of the hyperbolic sine and corresponding hyperbolic cosine functions is defined in terms of tanh function. The scalar products of two vectors are represented as •.
When x t is passed to the input network unit, it is multiplied by its own weight (w i ), and h t−1 is also multiplied by its own weight (w i ) and then added the bias (b i ). A h t−1 holds the information of previous units t − 1. It passes to the sigmoid function and converts values between 0 and 1 and updates the status of the cell. The details and equations presented are taken and adapted from the literature such as [42,48,49].
Equations (5) and (6) describes how to produce the result between 0 and 1 using sigmoid activation function . ∼ C t and C t are used to decide what information is kept in memory and what information is forgotten. ∼ C t is multiplied by the tanh function and decides which value is more significant.
The details and equations presented are taken and adapted from the literature such as [43,48]. Equations (7) and (8) explain that C t is passed as an input to the first layer of GRU (u g ), whereas u g and h t−1 are multiplied to weight and this information is forwarded to reset gate (r g ).
h t decides information to be kept. The stayed information is then attached to the output layer. Same layer contains tanh as an activation function that is used to predict road traffic speed at specific time and location. This is also discussed in Equations (9)-(11). We used adam as an optimizer and mean squared as loss function in this regression problem.
In this study, the proposed LSTM-GRU model was applied to the data collected from Google and FCD which comprises 7,343,362 records of September 2020. The traffic condition data were captured every fifteen minutes of arterial roads in Islamabad, Pakistan. The proposed stacked LSTM-GRU architecture consisted of four hidden layers with 256 hidden units each. tanh was used as an activation function in all hidden layers. In the dense layer, we used one unit. The linear activation function was used in the output layer. We employed holdout crossvalidation to split data set into training and test sets. Then, we trained the model on the training data set by a batch learning approach using batch size of 512. This was followed by checking generalization of model on test data set. To evaluate the performance of the proposed deep architecture, we adopted RMSE, MAE, and MAPE as performance measures. The final configuration of the proposed LSTM-GRU model is summarized in Table 2. In the proposed model LSTM-GRU, we used four hidden layers (two for LSTM and the rest for GRU). We tested various configurations of hidden units, i.e., 4, 16, 32, 64, 128, and 256. We achieved the optimal results with 256 hidden units. Likewise, tanh activation function in the hidden layers and linear activation function in the output layer gives optimal results. Similarly, we varied the number of epochs from 5 to 50 and choose 10 as the optimal value. Moreover, we used batch size 512, learning rate 0.001, and loss function adam as the optimal parameter values.

Performance Measures for Proposed Hybrid LSTM-GRU Model
Well-known and existing performance evaluation metrics are used to judge or measure the model performance and pattern. It also indicates the best model in order to achieve the output label performance. To evaluate the solution, we have used the existing performance evaluation metrics [50] such as RMSE (Root Mean Square Error), MAPE (Mean Absolute Percentage Error) and MAE (Mean Absolute Error).
The details and equations for the existing performance evaluation metrics presented below are taken and adapted from the literature such as [50][51][52]. RMSE is also known as RMSD ( root-mean-square deviation). It is the square root of the mean squared difference between desired output and predicted output. The same is explained in Equation (12).
where, Y dsi = desired speed Y psi = predicted speed n = number of observations MAPE (mean-absolute-percentage-error) is robust to large outliers. It eliminates the scaling factor and explains the error in the form of percentage. The formula of MAPE is explained in Equation (13).
In our scenario, MAE (Mean-Absolute-Error) is used to define how far our predicted output speed is from the desired output speed. Mathematically, we can explain this from the following Equation (14).

Results and Discussion
In this study, we worked on the regression data set with the multitimestamp and single label. Some features related to traffic patterns were derived in each segments at specific interval of time such as minimum estimated time and maximum speed per segment from integrated data.
Data were captured at 15 min time resolution and taking the space resolution less than or equal to 1 km.

Exploratory Data Analysis
We analyzed the influence of weekdays and weekend on traffic patterns and speed fluctuation. Speed Performance Index (SPI) [53] is a derived feature and calculated by using the following formula in the Equation (15): where, V O = the current speed for the road segment; V m = the permissible max speed for the road segment.
SPI also provides a normalized expected speed which prevents having extreme values. Figure 2 presents the average speed performance index (SPI) vs. time of day on weekdays. Different working days, i.e., Monday to Thursday, are shown with different color lines. In Figure 2, it can be visualized that there is a major change in the SPI over different hours of the day. In the morning rush hours (5:00-6:00 a.m., 9:00-10:00 a.m.,12:00-13:00 p.m.), the SPI is 66%, 64.6%, and 63%, respectively, which is higher than the average of morning hours. During the evening rush hour (around 8:00-9:00 p.m.), the SPI is 55%. The different trend of the SPI in the different time slots is not only compelling for model training but also in taking the average of the SPI for each time slot and for the prediction of speed on the basis of historical data. Various time slots including 5:00-6:00 a.m., 9:00-10:00 a.m., 12:00-13:00 p.m., and 8:00-9:00 p.m. show high traffic congestion. Friday's SPI trend is different from other working days. On Friday, during the morning rush hour (10:00 a.m.-12:00 p.m.), the SPI is 65.7%, which is higher than the average of morning hours. Friday traffic is different from other weekdays due to the Friday prayer, which is offered during 1-2 p.m. During the evening rush hour (9:00-10:00 p.m.), SPI is 60%, which is higher than all the time of day.

Feature Selection
Feature selection is a method that was used to explore the most relevant feature set by using correlation. The correlation of each feature was calculated and compared with the target variable. The correlation ranged between −1 and +1. +1 shows the positive correlation or perfect correlation, whereas −1 shows negative correlation and zero value means nonexistence of correlation. Basically, we were dealing with numeric input and target features. As the most relevant and famous techniques that deal with numeric data are correlation feature selection and mutual feature selection techniques, we used the same ones in our study.

Correlation Feature Selection Technique
Correlation is a statistical measure used to identify how two features changed together. We used Pearson correlation coefficient(PCC) [54] in our scenario.
Equation (16) explains that PCC is a standard measure of linear correlation between the two features. The formula of PCC is the ratio between the covariance of two features divided by their standard deviations. It deals between normalized data that range between −1 and +1.  Table 3 depicts that start-node, way-id, hour, peak hour, and max speed-real are most relevant features as they are positively correlated, because in transportation domain startnode, way-id shows spatial impact, whereas hour and peak hour show temporal impact. Max speed-real shows the permissible speed of the road. We selected the 9 most correlated features from 29 combined features of all heterogeneous data sources. The correlation shows how all features are close to the target feature.

Mutual Information Regression Feature Selection Technique
The mutual information feature selection technique is a method that works on information gain [55]. It works on the decision tree principle and calculates the entropy of each feature by calculating the information gain of each feature. Entropy helps the decision tree draw boundaries and measures disorder and uncertainty in the available data set. The most relevant feature has the highest information gain. Figure 4 indicates that start-node, end-node, way-id, and max speed have the highest information gain and are therefore the most relevant feature set.

Heat Map of Hybrid Feature Space
Heat Map is a visualization style for analyzing the intensity, density, patterns, outliers, and variance of the feature set. It provides a correlation among all features. Figure 5 shows that maxspeed-real have the highest correlation with target agg-speed followed by way-id, day, and holiday features among the other heterogeneous features. Here, maxspeed-real denotes the permissible road speed limit. It helps to identify outliers and trends and patterns of road speed. In this way, we can tackle the missing speed and also normalize the data set in order to predict the true speed of a specific road. Figure 6 depicts that LSTM-GRU yields the lowest RMSE, i.e., 4.5 as compared to deep learning technique LSTM with a RMSE yield of 4.86 and classical regression techniques such as KNN with an RMSE yield of 6.03. Because the LSTM-GRU model handles both spatial-temporal effects, LSTM-GRU is specialized in time series data set. LSTM contains the temporal effects, and GRU contains the spatial effects. Because we have worked on transport data set and in the domain of transportation prediction of speed highly depends on specific time and location. Table 4 elaborates the three performance metrics, i.e., RMSE, MAE, and MAPE, with respect to lowest RMSE-generated model, i.e., hybrid LSTM-GRU and GRU-LSTM models. In Figure 7, the prediction horizon indicates the capturing of data intervals such as 15, 30, 45, and 60 min. This helps in analyzing the time resolution impact on the model training. As we increased the sliding window, our test data performance improves, which proves a direct proportion of sliding window with the performance metrics. Table 4 contains the different performance metrics behavior on multiple individuals as well as hybrid deep learning models. As per RMSE and MAE evaluation metrics, hybrid LSTM-GRU produced the lowest RMSE of 4.5, MAE of 2.03, and MAPE of 6.67% being the lowest error.

Conclusions
This paper utilizes the speed performance index (SPI) as the road network state evaluation indicator. We integrated heterogeneous data, i.e., traffic, GPS, weather, special condition, and OSM obtained from a variety of sensors and services. We analyzed the behavior of transportation data sources with the help of different machine and deep learning algorithms. The LSTM-GRU model proved to be the most effective hybrid model among all time series deep learning and classical machine learning models with a net RMSE yield of 4.5. In the future, we plan to automatically label the classes using fuzzy logic and k-means clustering, followed by analyzing the results automatically by using optimization of hyper parameters and statistical models.