A Distributed Machine Learning-Based Scheme for Real-Time Highway Traffic Flow Prediction in Internet of Vehicles

Alnami, Hani; Mahgoub, Imad; Al-Najada, Hamzah; Alalwany, Easa

doi:10.3390/fi17030131

Open AccessArticle

A Distributed Machine Learning-Based Scheme for Real-Time Highway Traffic Flow Prediction in Internet of Vehicles

¹

Electrical Engineering & Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431, USA

²

Computer Science Department, Jazan University, Jazan 82917, Saudi Arabia

³

College of Computer Science and Engineering, Taibah University, Yanbu 46421, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Future Internet 2025, 17(3), 131; https://doi.org/10.3390/fi17030131

Submission received: 24 January 2025 / Revised: 8 March 2025 / Accepted: 14 March 2025 / Published: 19 March 2025

(This article belongs to the Special Issue Joint Design and Integration in Smart IoT Systems)

Download

Browse Figures

Versions Notes

Abstract

Abnormal traffic flow prediction is crucial for reducing traffic congestion. Most recent studies utilized machine learning models in traffic flow detection systems. However, these detection systems do not support real-time analysis. Centralized machine learning methods face a number of challenges due to the sheer volume of traffic data that needs to be processed in real-time. Thus, it is not scalable and lacks fault tolerance and data privacy. This study designs and evaluates a scalable distributed machine learning-based scheme to predict highway traffic flows in real-time. The proposed system is segment-based where the vehicles in each segment form a cluster. We train and validate a local Random Forest Regression (RFR) model for each vehicle’s cluster (highway-segment) using six different hyper parameters. Due to the variance of traffic flow patterns between segments, we build a global Distributed Machine Learning Random Forest (DMLRF) regression model to improve the system performance for abnormal traffic flows. Kappa Architecture is utilized to enable real-time prediction. The proposed model is evaluated and compared to other base-line models, Linear Regression (LR), Logistic Regression (LogR), and K Nearest Neighbor (KNN) regression in terms of Mean Square Error (MSE), Root Mean Square Error (RMSE), Mean Absolute Error (MAE), R-squared (R²), and Adjusted R-Squared (AR²). The proposed scheme demonstrates high accuracy in predicting abnormal traffic flows while maintaining scalability and data privacy.

Keywords:

internet of vehicles; real-time traffic flow prediction; intelligent transportation systems; big data processing; distributed machine learning; distributed architecture; Apache Spark

1. Introduction

Transportation systems provide huge amounts of data that can be obtained from different resources in real-time such as sensors, detectors, smart cards, and video cameras [1,2,3]. By utilizing the big traffic data, researchers can discover the reasons for traffic accidents and congestion, redundant traffic jams in certain areas, traffic delay at intersections, and several redundant issues on transportation systems [4,5,6,7,8,9,10,11]. However, the amount of traffic data increases rapidly to the level of Petabyte, which makes centralized machine learning prediction methods [12,13] face a major drawback in terms of scalability limitation.

Several big data techniques are utilized for processing and mining such as Artificial Intelligent tools, Machine Learning, and Data Mining [14]. Big data techniques are used in different fields of study, and researchers achieved outstanding results and success [15,16,17,18]. In transportation, decision makers should provide optimal solutions in real-time since vehicles are moving very fast. Apache Spark (3.x) and Hadoop (3.3.x) can process and store large amounts of data with a low computation time due to Hadoop Map-Reduce and Spark in-memory parallel distributed computing. These two big data platforms are used widely in industry and academia [19,20]. Studying the redundant issues of the traffic flow in historical data, helps big data scientists to provide optimal traffic solutions for drivers [21,22,23,24,25,26,27,28].

This research proposes a scalable real-time distributed machine learning-based scheme for Internet of Vehicles to predict highway traffic average speed, and detect abnormal traffic flow patterns. The proposed system consists of two main stages, which are the learning stage and the real-time stage. In the learning stage, a highway is divided into small complex segments and each Complex Segment (CS) consists of several segments. Each CS is located between two exits. We train and validate a local Random Forest Regression (RFR) model for each vehicle cluster (highway segment) using six different hyper-parameters: max depth, min sample split, min samples leaf, number of estimators, bootstrap sample, and max features. Randomize Search Cross Validation (RSCV) technique is utilized to determine the six hyper-parameters’ values of each segment’s model. By utilizing Stacking ensemble learning, all RFR models are used as inputs to build the Distributed Machine Learning Random Forest (DMLRF) model. Kappa Architecture (KA) is utilized to enable real-time traffic flow predictions [29]. The proposed system is implemented in Google Cloud Platform (GCP), which contains a private computing cluster with 17 nodes where 16 nodes were assigned as worker nodes (each highway’s segment is assigned to one computing node) and one node will be the master node that controls the processing [30]. In the real-time stage, the Discretized Stream (DStream) of Apache Spark Streaming is utilized to process real-life traffic data records and store them as batches. The real-time stream records will be sent to the streaming layer of KA where Spark Streaming can process the real-life data and then use the saved DMLRF model to provide real-time traffic flow predictions. This research utilizes real-life traffic data obtained from Florida Department of Transportation District 4 (FDOT-D4) [31]. The contribution of this study can be organized as follows:

Build a scalable distributed machine learning system that can predict highway traffic flows. The scheme divide a highway into small complex segments, where each complex segment is located between two exits to control the level of congestion and allow vehicles to exit the highway before entering the congestion zone.
Build a distributed machine learning model to improve the system performance for abnormal traffic flows. The distributed machine learning model is trained based on different hyper-parameters’ values, where these values are determined based on the traffic flow patterns for each segment of the complex segment. So each segment has a model with its adjusted hyper-parameters’ values based on RSCV technique. Then all models participate to build the proposed distributed learning model. Stacking ensemble learning method is utilized to build the DMLRF model.

The remainder of this paper is organized as follows, Section 2 describes the most recent related work. Section 3 presents the preliminaries of this study. Section 4 shows the research methodology, model design, and challenges. Section 5 demonstrates details of a real-life traffic dataset and all preprocessing steps. Section 6 presents our distributed machine learning architecture workflow and its components. Section 7 shows the performance evaluation and results. Section 8 concludes the paper and provides the future work.

2. Related Work

Distributed machine learning architecture has attracted researchers in the past due to its high scalability, and reliability. However, very limited papers have discussed traffic flow prediction while using distributed machine learning and parallel analysis to deal with big traffic data challenges. Providing an optimal traffic flow prediction for a large-scale traffic network where numerous vehicle detection systems are geographically installed on highways to record traffic and collect its data is challenging.

Recent studies have investigated diverse Artificial Intelligent (AI) and deep learning techniques for improving traffic flow prediction and event detection in vehicular networks. Tan et al. [32], presented a comprehensive analysis of machine learning applications in vehicular networks, including mobility management, handover optimization, and predictive routing to address dynamic network challenges in high-mobility scenarios. Olugbade et al. [33], explored AI and ML-based incident detection systems in road transport, highlighting significant advancements in real-time vehicle monitoring, route optimization, and predictive fleet maintenance, all of which enhance road safety. Abdullah et al. [34], proposed a soft GRU-based recurrent neural network model for predicting congestion in smart cities, utilizing historical traffic data and sensor inputs to enhance traffic flow and reduce congestion. These studies demonstrate the growing role of AI-driven methodologies in optimizing traffic management and improving transportation efficiency.

A distributed machine learning method was developed in [35] to predict traffic flow and deal with big traffic data challenges. A deep convolutional neural network model was implemented to be used in their parallel training approach. The framework was implemented by using Apache Spark cluster to perform parallel distributed machine learning. Real traffic data was obtained from California Department of Transportation Performance Measurement System (PeMS) to be utilized for this study. The main dataset, collected from 12 highways, was decomposed into sub datasets to be utilized by different Spark nodes (machines) to perform parallel computations. 9 months of traffic data records in 2016 were used for training and one month was used for testing. The models trained on the sub datasets and provided local training parameters that were sent to the master node to average these parameters and provide global learning parameters. These global learning parameters were sent back to the slave nodes as sub-sequence iteration processes. This process continued until the best accuracy was achieved. The performance of the model was evaluated by using MSE, RMSE, and MAE in different time intervals. Based on the results, the proposed model improved the efficiency of the training process.

A Distributed Automatic Long Short-Term Memory Customization Algorithm (DALC) for predicting highway traffic speed flow was developed in [36]. The authors started with designing the algorithm for a single detector based on Markov decision process formulation to define the best two hyperparameters manually. The two hyperparameters were the number of hidden layers and epochs. Based on the Markov decision process formulation, they designed the ALC algorithm for a single detector. Then they proposed that each detector should have its own algorithm that can work in parallel due to different traffic speed patterns. After analyzing five weekdays of traffic speed records starting from 16 October 2017, the traffic patterns in the morning were different than the traffic patterns in the night. So, they proposed that each detector should have two models. They utilized a dataset obtained from California Department of Transportation Performance Measurement System (PeMS) that was collected from 60 detectors. They utilized Apache Hadoop YARN and Spark to build a private cluster that consisted of 30 nodes where each node had two DALC algorithms. Their algorithm was compared with five different distributed machine learning algorithms that were built by Apache Spark where these algorithms utilized the same dataset. The proposed algorithm outperformed the other algorithms in terms of average absolute relative error.

To implement a traffic flow prediction model that satisfies real world application requirements, Boukerche, and Wang focused on improving the prediction model features by implementing a hybrid deep learning model that combines Graph Convolutional Network (GCN) and the deep aggregation structure to predict traffic flow [37]. This approach used parallel training that combined several edge devices with a central system to improve the accuracy and efficiency of the system. To improve the performance of the model during different lengths of time lag, a refinement learning approach was applied. The proposed system used traffic data obtained from PeMS and was evaluated in terms of MAE, R square, and RMSE.

In [38], the authors propose hybrid designs that integrate Convolutional Neural Networks (CNNs), Graph Convolutional Networks (GCNs), and transformer-based models for large-scale road networks. These hybrid models effectively combine spatial and temporal analyses, leveraging attention mechanisms to dynamically capture dependencies in traffic data. The study highlights significant improvements in prediction accuracy and scalability, demonstrated through evaluations on datasets such as PeMS-Bay and Metr-La, with models like SAGCN-SST and GMAN showing notable performance.

Alsubai et al. [39], present the Improved Arithmetic Optimization Algorithm-based Deep Learning Traffic Congestion Control (IAOADL-TCC) model, which employs a hybrid CNN-LSTM architecture to optimize traffic flow prediction. The model applies a hyperparameter tuning approach using arithmetic optimization to enhance the CNN-ALSTM framework, achieving a notable average accuracy of 98.03% on a Kaggle road traffic dataset. The IAOADL-TCC model outperformed previous approaches, including UAV-CNN, SVM, and fuzzy logic models, by effectively capturing spatio-temporal patterns. Another study in [40], explores the development of traffic flow prediction methods, transitioning from statistical models like ARMA to advanced deep learning approaches such as LSTM, CNN, and GCN. Hybrid models, including IGA-LSTM and DAFF-Net, have improved effectiveness by capturing spatio-temporal patterns and addressing inefficiencies. GCN-based models, such as FAST-GCN, effectively manage non-Euclidean traffic data, leading to notable improvements in accuracy. However, challenges persist in handling sparse data and ensuring real-time adaptability.

Fouladgar came up with a decentralized deep learning method in order to come up with scalable and real-time feedback of traffic flow predictions [41]. They used dataset obtained from Caltrans Performance Measurement System (CPMS) in California where data of 48 days from 51 different locations were used to train their model and the next 12 days were selected for validation. To avoid the class imbalance problem of their dataset where heavy traffic was significantly lower than the light traffic, they obtained a regularized euclidean loss function. CNN and LSTM were used to build their decentralized deep learning architecture. The accuracy of both models was high, but it decreased during rush hours due to the unbalanced dataset.

An improved hybrid traffic flow prediction was proposed in [42]. The hybrid model combined a decomposition and prediction process to predict traffic flow. The process started as follows: a decomposition ensemble empirical mode was utilized to decompose an original nonlinear and complex highway traffic flow data. Then, a new reconstructed component was obtained from an improved weighted permutation entropy (IWPE). To create the reconstructed component, a gray wolf optimizer was used to optimize the least-squares support vector machine. The proposed model provided useful traffic flow information based on their analysis and results.

Building on these studies, this research proposes a scalable, real-time distributed machine learning-based scheme for the Internet of Vehicles to predict highway traffic average speed and detect abnormal traffic flow patterns. The application of distributed machine learning to highway traffic flow prediction is still in its infancy. Only a small fraction of existing work utilize distributed machine learning and build their scheme on a local level of modeling using small datasets. In this study, we propose a scalable distributed machine learning-based scheme to predict highway traffic flows in real-time. The system’s models are trained on the first seven year’s data and validated on the last year’s data. To enhance the performance of the model for the abnormal traffic flows, we introduced a global stage where the model is further trained on different hyper parameter’s values based on the different traffic flow patterns. Randomize Search Cross Validation is performed to determine the hyper parameter’s values due to its efficiency of tuning random forest regression model’s hyper-parameters.

3. Preliminaries

3.1. Kappa Architecture

Jay Kerps introduced the Kappa Architecture (KA) in 2014 to reduce the code overwhelming of the Lambda Architecture [29,43]. Lambda architecture is a scalable, fault-tolerant framework with three layers. The batch-processing layer aggregates historical data but is time-consuming, while the stream-processing layer handles real-time data for quick insights. The serving layer merges both outputs for efficient querying. KA consists of one pipeline machine learning process, where all the data is considered as a stream. KA eliminates the need for the batch layer of lambda architecture, which consists of stream and service layers only. KA as represented in Figure 1 works as follows: real-time data will be processed by a message passing system such as Kafka to create the real-life streaming and send it to the streaming layer. In the streaming layer, Apache Spark ML-lib can be utilized to build machine learning models for real-time predictions. The results of machine learning will be sent to the service layer where all the results can be stored in most of the well-known storage management systems such as Hadoop Base, Cassandra, and MySQL. In the service layer, real-time analysis and decisions will be provided to users. We use KA in our research to build a real-time prediction system. Apache Spark and Spark Streaming component can be used to implement KA, which is a big data tool engine that includes different Application Programming Interfaces (APIs) such as MLlib, Spark Streaming, Spark SQL, and Spark Graphics [44]. Spark has been used recently to build big data applications for ITS due to its ability to process large scale distributed data in real-time. Spark is known for in-memory computation and the capability of processing the data 100 times faster than Apache Hadoop. In every Spark application, a Spark context will be created to manage all tasks. Once Spark context creates Resilient Distributed Datasets (RDDs), the cluster manager will distribute the RDDs among the available worker nodes in the cluster to be processed in parallel and will provide the results in real-time. Figure 2 demonstrates Apache Spark architecture.

3.2. Ensemble Learning and Hyper-Parameters Tuning

3.2.1. Ensemble Learning

Most modern applications require their problems to be solved with high accuracy, which can be achieved by using a centralized machine learning architecture. However, when the centralized method is not an option to solve a big data problem and the distributed machine learning is capable of dealing with it, the learning process is carried out in two stages: local stage and global stage. In the local stage, the machine learning models are expected to perform low than a centralized learning model that has access to all the data. In the global stage, the machine learning models are aggregated using ensemble method to improve the performance [45]. More information on ensemble learning can be found in [46]. The Stacking ensemble learning technique is applied in our proposed architecture to build our distributed learning model. The following are well-known ensemble methods [47].

Boosting: When some data are misclassified by a model during the training process, a new model will train the misclassified data in order to improve the classification accuracy.
Bagging: Is the process of combining different machine learning models into one model.
Bucketing: Selecting the best model that achieved the highest accuracy evaluation metrics.
Stacking: Is the process of reducing the variance of machine learning classifiers by taking the output of multiple classifiers/regressors as inputs of a new classifier/regressor.
Random Forests: Increases the overall performance by averaging the prediction of multiple decision trees into an individual tree [48].

3.2.2. Hyper-Parameters Tuning

Is considered as a technique to improve machine learning algorithm’s performance. Choosing hyper-parameters and values of these parameters can be done randomly. However, randomize search cross validation and grid search cross validation can be adopted to determine the best values for the selected hyper-parameters automatically [49,50]. These two techniques test an algorithm several times based on the number of cross validations for each combination of hyper-parameter values. After a certain number of iterations, the best hyper-parameter’s values for the used dataset are determined and utilized for training the algorithm. The difference between these two approaches is that grid search cross validation consumes more time in determining the values of hyper-parameters since it validates all the values in a grid search manner. However, RSCV selects the values randomly. The RSCV is applied in our system to select the best hyper-parameters values for our local random forest models. More details on these two techniques are available in this study [51].

4. Research Methodology

4.1. Research Methodology

Intelligent Transportation Systems aim to provide a high level of safety and efficiency for drivers, and monitoring highway traffic in real-time is one of its highest priority applications. However, highways extend for hundreds of miles and vehicles move very fast making monitoring and predicting traffic flow patterns in real-time by utilizing a centralized system very challenging [52,53]. This research proposes a scalable distributed machine learning-based scheme for Internet of Vehicles to predict highway traffic average speed and detect abnormal traffic flow patterns. The experimental environment included the use of Python (3.10) for data preprocessing, Google Cloud Platform (GCP) for deployment, and Apache Spark for data processing and real-time analytics using Kappa Architecture. The proposed model was evaluated using Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R-Squared (R²), and Adjusted R-Squared (AR²) metrics. We assume that the I-95 highway is divided into Complex Segments (CSs) where each CS is located between two exits as represented in Figure 3. Each CS consists of 4 segments and each segment has a Road Side Unit (RSU). In each segment (cluster), a leader vehicle (cluster head) will collect traffic information from the vehicles in that segment using Vehicle to Vehicle (V2V) communication. This information will be sent to the RSU, where machine learning models are implemented and ready to provide prediction results in real-time. This method allows us to break down a complex problem into small sub-problems where different machine learning algorithms can collaborate and provide the traffic prediction results in parallel. Since the geographical location of each cluster provides different traffic flow patterns, different learning algorithms trained on different hyper-parameter’s values are utilized to build a distributed machine learning model. These hyperparameter’s values are determined based on the different traffic flow patterns. So, each segment has specific hyper-parameter’s values based on its historical traffic flow data. The block diagram in Figure 4 provides an overview of the DMLRF model’s structure for predicting the average speed of highway traffic. More details on the proposed model is available in Section 6.

4.2. Model Design

4.2.1. Random Forest Regression

Random forest regression combines a large set of regression trees and considers as an ensemble learning algorithm. Starting from a root to a leaf of the tree, a set of conditions that are hierarchically organized represents by a regression tree [54]. From the original training dataset, the random forest trains bootstrap samples using sub-datasets. Algorithm 1 works as follows: for each tree in the forest, a bootstrap sample is selected from T where

T^{(i)}

denotes the ith bootstrap. Then, using a decision tree to learn a modified decision tree. The modification steps works as follow: at each node of the tree, some subset of the features

z \subseteq Z

where Z is the set of features are selected to examine a possible split instead of using all the features. Rather than splitting the node on Z, the algorithm use z to define the split on the best features. Determining on which feature to split is the most computation expensive of decision tree learning. By averaging all the trees, the predicted value of an observation is calculated. Algorithm 1 is utilized to build the base models in Step 1(b) of Algorithm 2.

Algorithm 1 Random Forest Base-line Model

Precondition: A training set T: =

(x_{1}, y_{1}), . . . ., (x_{n}, y_{n})

, features Z, and number of tree in forest N.
1 function Random-Forest(T, Z)
2

H \leftarrow \emptyset

3 for

i \in 1, . . . . ., B

do

T^{i} \leftarrow

a bootstrap sample of T

D T_{i} \leftarrow

RandomizeTreeLearn

(T^{i}, Z)

H \leftarrow H \cup {D T_{i}}

4 end for
5 return H
6 end function
7 function RandomizeTreeLearn

(T, Z)

At each node:

z \leftarrow

very small subset of Z
split on best feature in z
8 return the learned tree
9 end function

Algorithm 2 DMLRF model creation

Input: Split traffic dataset D into K subsets D = {

D_{1}, D_{2}, D_{3}, . . . ., D_{k}

}, such that

D_{i}

is the data subset of Segment i. Combine segments between exits into J complex segments
Output: A Distributed Machine Learning Random Forest (DMLRF) model for real-time traffic flow prediction.
1 Step 1: for i ← 1 to K:
(a) determine the hyperparameters’ values using RSCV for data subset

D_{i}

.
(b) First level training: train random forest regression (RFR) model for data subset

D_{i}

.
End for
2 Step 2: for i ← 1 to J
Second level training: stack the output of individual estimators (RFR) and use a meta model (DMLRF) to compute the final prediction for Complex Segment i.
End for
3 return DMLRF

4.2.2. Stacking Ensemble Learning

In this section, we will discuss the proposed prediction model. As we can see from Algorithm 2, the main traffic dataset D is divided into sub datasets D = {

D_{1}, D_{2}, D_{3}, . . . . ., D_{k}

} where K denotes the number of segments, and each segment’s input parameters are total volume, average occupancy, and average delay. The output parameter of the segments is average speed. For each segment in (

D_{k}

), an RFR model will be utilized as a base regressor as demonstrated in Algorithm 1. In Step 1, the RFR models utilized the adjusted hyperparameter’s values that are determined based on the RSCV technique to train the base RFR models. Due to the variance between traffic flow patterns of sub datasets, the hyperparameter’s values for RFR models are different. In Step 2, meta regression model is developed. Stacking ensemble learning technique is used to build the DMLRF model, which combines the outputs of individual estimators (RFRs) and uses a meta model (DMLRF) to compute the final prediction where this model is trained by using cross-validated predictions of the base estimators [55]. This technique takes the output of RFR models as input of DMLRF model.

4.3. Research Challenges

The first challenge is to analyze the big traffic data. For instance, our traffic data exceeds 18 million traffic flow records obtained from 16 detectors that are located on Florida I-95 highway. The second challenge is to maintain high accuracy for the utilized models due to the distributed architecture where the main data will be divided into small datasets. So, the machine learning models will be trained on small fractions of the main dataset. The third challenge is the high need for computational resources. Since our architecture is based on a distributed learning framework, every segment (cluster) has a machine learning model that needs to be implemented and evaluated, which consumes more computational resources. So, each segment has a separate worker node (machine) that is responsible for all the analysis and evaluations of its machine learning model.

5. Data Description and Preprocessing

5.1. Data Description

In this research, we utilize a real-life traffic dataset, which we directly obtained from the Florida Department of Transportation District 4 Regional Transportation Management Center (FDOT-D4 RTMC). It is commonly referred to as SunGuide traffic data. While it is not publicly available, it can be requested from FDOT-D4 RTMC. Over 400 vehicle detection systems were used to collect the traffic data from three main highways of Florida (I-95, I-75, and I-595). The detectors are located on every half a mile on the three highways in both directions. The highway traffic flow dataset covers 8 years of traffic’s records starting from 2008 until November of 2015. It provides real-time traffic flow information including, occupancy, average speed, total volume, average delay, and other valuable information for each lane of the freeways. The FDOT-D4 collected the traffic data based on a 15-min time interval, which gave us more opportunity for subsequent data processing. Table 1 illustrates the important parameters and their descriptions of road traffic records. In this study, we focused on traffic flow data obtained from 16 detectors that are located on I-95 highway for 8 years starting from 2008. The total number of traffic flow observations of all detectors is 18,812,799 million records.

5.2. Data Preprocessing and Transformation

To provide accurate prediction results for any machine learning algorithm, data preprocessing and transformation are very essential steps that have to be considered. Removing null values and detecting outliers will impact algorithm’s accuracy dramatically. Traffic data usually suffer from class imbalance issues, which can be fixed by applying over-sampling or under-sampling techniques such as SMOTE, Near-Miss, and ROSE [56,57,58]. More details on preprocessing and transformation for our data-sets are available in our previous studies [59,60]. This research proposes a scalable distributed machine learning-based scheme for the Internet of Vehicles to predict highway traffic average speed and detect abnormal traffic flow patterns. We trained and validated our proposed model using seven years of traffic data from 2008 to 2014, which constitutes 99.8% of the total dataset. The training-validation split was 80% for training and 20% for validation. We then tested our model on one week of traffic flow patterns from 2015, which accounts for 0.2% of the original dataset.

6. Proposed System Architecture and Components

6.1. Proposed System Architecture

The proposed system consists of two main stages, which are the learning stage and the real-time stage. The main task of the learning stage is to build and update our proposed DMLRF algorithm. As demonstrated in Figure 5, the FDOT-D4 traffic historical dataset is divided into small parts based on available detectors. Each detector covers half a mile, so the highway is divided into half-mile segments. RFR algorithm is assigned for each segment, and each algorithm has six different hyper-parameters. The hyper-parameters are max depth, min sample split, min samples leaf, number of estimators, bootstrap sample, and max feature. A full description for each hyper parameter is available in Table 2. RSCV technique is utilized to determine the six hyper-parameter’s values of each segment’s model. Due to the variances of highway traffic flow data patterns in these segments, each RFR algorithm will be trained on different hyper-parameter values. By utilizing Stacking ensemble learning, all RFR models are used as inputs to build the DMLRF model. So, the DMLRF model will be built based on different hyper-parameters’ values, which makes it a distributed machine learning model.

In the real-time stage as illustrated in Figure 5, the Discretized Stream (DStream) of Apache Spark Streaming is utilized to process real-life traffic data records and store them as batches. The real-time stream records will be sent to the streaming layer of KA where Spark Streaming can process the real-life data and then use the saved DMLRF model to provide real-time traffic flow predictions. Prediction results will be sent to the serving layer of KA for analysis. In the serving layer, the system will calculate the Speed Reduction Index (SRI) value to determine a threshold value. The SRI formula returns values from 0 to 10 where 0 means very normal traffic flow and 10 means vehicles are not moving. 4 is the beginning of congestion, which is our threshold value. If the SRI value exceeds the threshold value, the system will calculate the delay time of traffic on each segment. A total delay time will be calculated for multiple complex segments located adjacent and between two exits in a highway to be shared with vehicles that are still out of the calculated area. The Internet of Vehicles communication schemes can be utilized to send and receive traffic beacon information (Average Seed, Total Volume, and Average Delay) between a cluster head and cluster nodes. Approaching vehicles will decide to either stay on the highway or use alternative roads based on the results. If the SRI value exceeds 9 for each segment, the total delay may exceed one hour and 30 min for some complex segment of the I-95 highway. The SRI formula is presented in Equation (1). The results of the traffic flow predictions are transmitted and used to update the historical database, ensuring that the system remains consistently updated.

S R I = 1 - (1 - \frac{A v e r a g e P r e d i c t e d S p e e d}{S p e e d l i m i t} \times 10)

(1)

6.2. Proposed System Components

Our proposed architecture consists of two components and software modules. The ultimate objective of this system is to provide a high level of safety and increase the efficiency of the transportation system. The following points summarize the system components:

Kappa Architecture: KA is a real-time streaming component that can process the real-life data that is obtained from a cluster (segment) head node. This is a one pipeline machine learning process that consists of two main layers: The streaming and serving layers.
Distributed machine learning model: It is the proposed model that is built based on different random forest algorithms with different hyper-parameters values. The Stacking ensemble learning approach is utilized to build our proposed model.

7. Performance Evaluation and Results

In this section, we demonstrate the efficiency of our model’s performance on abnormal traffic flows.

7.1. Classification Models

We have evaluated our models based on the most related features to the target parameter that are extracted by using Boruta Feature Selection method [61]. RFR algorithm was adopted in this study due to its efficient performance and its capability of feature selection where this algorithm achieved outstanding results in our previous studies [59,60]. After data preparation, the RFR algorithms are trained on 18,812,799 traffic flow records for 16 detectors separately where the target parameter is average speed and the input parameters are total volume, average occupancy, and average delay. Stacking ensemble learning approach was adopted to implement the DMLRF model where the RFR models are used as inputs to train the final model based on different hyper-parameter values. In this study, we implemented other regression base line models to compare it with our proposed model. The models are Linear Regression (LR), Logistic Regression (LogR), and K Nearest Neighbor (KNN). All the Models are trained validated on seven years of traffic data starting from 2008 to 2014 and one week was utilized for testing and is obtained from 2015, which is from 11 June 2015 to 17 June 2015. This week was selected for having more abnormal traffic flow cases.

7.2. Performance Metrics

Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R Squared, Adjusted R Squared, and Mean Absolute Error (MAE) are utilized as evaluation metrics to compare different prediction models. All the evaluation metrics are explained in Equations (2)–(6).

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(X_{i} - Z_{i})}^{2} .

(2)

where n represents a number of data points,

X_{i}

presents the observed values, and

Z_{i}

shows the predicted values. MSE measures the average of squared errors between the prediction results and actual values.

R M S E = \sqrt{\frac{1}{n} Σ_{i = 1}^{n} {(X_{i} - Z_{i})}^{2}}

(3)

where

Z_{i}

are the observations,

X_{i}

are the predicted values of a variable, and n represents the number of observations. RMSE increases the error rate when the difference between the prediction results and actual values of data is bigger

M A E = \frac{\sum_{i = 1}^{n} | y_{i} - x_{i} |}{n}

(4)

where

y_{i}

represents the predicted value,

x_{i}

presents the true value, and n is the total number of data points. The MAE illustrates the difference between the prediction results and the real data.

R^{2} = 1 - [\sum_{i = 1}^{n} \frac{{(y_{i} - x_{i})}^{2}}{{(y_{i} - x_{i})}^{2}}]

(5)

R-squared or (

R^{2}

) is used to define the percentage of variance for the predicted value (Y) from the independent variable (X). The output values range from 0 to 1, which mean that they range from 0% to 100%.

A d j u s t e d R^{2} = 1 - (\frac{(1 - R^{2}) (N - 1)}{N - p - 1})

(6)

Adjusted

R^{2}

shows how a regression model is impacted by using useless variables. If the regression model used an unrelated variable, the output value will decrease. Where (N) represents the number of samples in a data set, and (P) is the number of features.

7.3. Results and Discussion

Our proposed distributed machine learning system effectively enhances traffic flow prediction. Our proposed system achieved notable results of traffic flow prediction for future traffic forecasting especially on predicting abnormal traffic flow. In this study, we evaluated 4 complex segments and each complex segment has 4 segments and is located between two exits. Each segment covers half a mile of distance, so each CS consists of 2 miles. Normal travel time of half a mile is 28 s if a vehicle is traveling at 65 Mph based on Equation (7). So normal travel time for one CS is 112 s or almost 2 min.

N o r m a l T r a v e l T i m e = (D_{c} / S) \times 3600

(7)

where

D c

is the length of a segment, which is 0.5 mile. S represents the speed limit, which is 65 Mph. 3600 is the total number of seconds in one hour.

The effective performance of the DMLRF model in predicting both high and low values can be attributed to robust data preprocessing, segment-based modeling, optimized hyper-parameters using Randomized Search Cross-Validation (RSCV), and the use of Stacking ensemble learning. These factors collectively contribute to the model’s ability to accurately capture the unique traffic patterns across different segments.

7.3.1. Complex Segment 1

The proposed model achieved the highest results where the

R^{2}

and adjusted

R^{2}

achieved over 80 percent of accuracy in terms of predicting highway traffic flow for Segment C of CS1. We observed that traffic flow patterns in this complex segment become more abnormal for the segments that are close to the exit. Figure 6 shows the DMLRF predicted delay time compared with actual delay time for all segments of CS1. The results for Segment C of Complex Segment 1 were compared with those of other baseline regression models in terms of predicting the weekly traffic flow across all complex segments, as shown in Table 3. The proposed model outperform the other models in all the segments. The proposed model achieved 85% in terms of

R^{2}

and

A R^{2}

for Segment C, which is the highest accuracy. Moreover, the proposed model achieved the lowest error in terms of MSE, RMSE, and MAE, which is presented in Segment B. Figure 7 compared the performance of the DMLRF with actual data of traffic flow for all segments in CS1. Figure 8 presents the performance of DMLRF model in terms of predicting a week of traffic flow for all the segments of CS1 and is compared with the actual SRI value. If the SRI value of predicted average speed is 4 (the threshold value) for each detector of CS1, the total travel time will be around 4 min. So the total delay is 2 min. However, if the SRI value reaches 9 to 10, the total travel time will be from 24 min to 2 h. So an early warning message should be provided for all approaching vehicles before entering this area. For this complex segment, the highest SRI value was 4 and occurred in Segment C. Segment C is located at the beginning of a curved road and is close to an exit, which increases the level of congestion in this area of I-95 highway. Next, analysis and discussion will be provided for adjacent complex segments.

7.3.2. Complex Segment 2

To provide more realistic scenarios, we obtained the prediction results of the DMLRF model for several complex segments located adjacent. This will allow drivers to prepare for their trip in advance by determining the CS that has the highest delay time and exit the highway before entering that CS. RMSE and MAE are close to zero for all the algorithms, but the

R^{2}

and adjusted

R^{2}

for Segments A and B are lower than 60%. However, the traffic patterns for these two segments are mostly normal. The accuracy of the DMLRF model improves when traffic flow patterns are abnormal and achieves good results under normal conditions. Figure 9 shows the comparison of DMLRF model performance in terms of predicting a week of traffic flow with actual SRI values of CS2. The actual SRI values increased when the predicted average speed values decreased. Table 3 compares the performance of the proposed model for Segment D of Complex Segment 2 with that of other baseline regression models. Overall, the proposed model outperformed the other models in all evaluation metrics and for all segments. In Segment D, the DMLRF model achieved the highest accuracy, which is 76%. LogR and KNN models performance were very low when compared to the DMLRF model for all segments. LR presented better performance than LogR and KNN in terms of all evaluation metrics for most of the segments.

7.3.3. Complex Segment 3

Traffic flow patterns in this complex segment were mostly normal for all the segments. The DMLRF performed more efficiently than other base-line models in terms of all evaluation metrics for all segments as represented in Table 3. however, the performance in this complex segment of the proposed model was lower than the previous complex segments in terms of

R^{2}

and

A R^{2}

. The

R^{2}

and

A R^{2}

results were 59% and 60% respectively. However, MAE, MSE and RMSE results were very low especially in Segment B. The predicted average speed values and actual average speed values were very close. We observed that there is no curved road in this complex segment and the actual average speed were mostly over 65 mph.

Table 3. The Prediction Results of the DMLRF Model and Baseline Models for Segments A, B, C, and D in CS1, CS2, CS3, and CS4.

Complex Segment	Algorithm	Segment	MSE	RMSE	MAE	R²	AR²
	DMLRF	Segment A	7.9	2.8	2.1	61%	61%
	LR	Segment A	8.4	2.9	2.2	59%	58%
	LogR	Segment A	9.9	3.1	2.3	15%	14%
	KNN	Segment A	10.9	3.3	2.6	7%	7%
	DMLRF	Segment B	1.3	1.1	0.9	75%	75%
	LR	Segment B	4.3	2.0	1.7	18%	17%
	LogR	Segment B	6.5	2.5	1.4	22%	22%
	KNN	Segment B	1.5	1.2	1.0	71%	71%
Complex Segment 1	DMLRF	Segment C	3.5	1.8	1.5	85%	85%
	LR	Segment C	10.5	3.2	2.7	56%	55%
	LogR	Segment C	21.1	4.5	3.5	12%	8%
	KNN	Segment C	4.9	2.2	1.7	79%	79%
	DMLRF	Segment D	5.0	2.2	1.8	71%	71%
	LR	Segment D	7.9	2.8	2.2	54%	54%
	LogR	Segment D	9.2	3.0	2.2	46%	13%
	KNN	Segment D	10.5	2.6	3.2	38%	38%
	DMLRF	Segment A	7.0	2.6	2.0	52%	52%
	LR	Segment A	11.4	3.3	2.4	22%	23%
	LogR	Segment A	15.1	3.8	2.5	11%	11%
	KNN	Segment A	10.3	3.2	2.4	30%	30%
	DMLRF	Segment B	5.2	2.2	1.7	52%	52%
	LR	Segment B	6.7	2.6	1.9	38%	38%
	LogR	Segment B	7.7	2.7	2.0	30%	17%
	KNN	Segment B	8.4	2.9	2.2	23%	23%
Complex Segment 2	DMLRF	Segment C	5.6	2.3	1.8	69%	68%
	LR	Segment C	7.6	2.7	2.1	57%	57%
	LogR	Segment C	8.4	2.9	2.1	52%	16%
	KNN	Segment C	11.6	3.4	2.5	35%	35%
	DMLRF	Segment D	8.6	2.9	2.3	76%	76%
	LR	Segment D	18.4	4.2	3.2	49%	49%
	LogR	Segment D	26.3	5.1	3.7	28%	9%
	KNN	Segment D	33.0	5.7	4.5	9%	8%
	DMLRF	Segment A	5.9	2.4	1.8	35%	35%
	LR	Segment A	7.3	2.7	2.0	20%	19%
	LogR	Segment A	7.1	2.6	1.9	21%	17%
	KNN	Segment A	13.2	3.6	2.8	13%	13%
	DMLRF	Segment B	6.4	2.5	1.9	59%	60%
	LR	Segment B	11.4	3.3	2.5	30%	30%
	LogR	Segment B	15.7	3.9	2.8	13%	13%
	KNN	Segment B	7.7	2.7	2.1	50%	50%
Complex Segment 3	DMLRF	Segment C	8.6	2.9	2.2	50%	50%
	LR	Segment C	16.4	4.0	3.1	5%	5%
	LogR	Segment C	7.7	2.7	2.0	55%	17%
	KNN	Segment C	19.3	4.4	3.2	12%	12%
	DMLRF	Segment D	6.8	2.6	1.9	58%	58%
	LR	Segment D	11.3	3.3	2.4	31%	30%
	LogR	Segment D	14.2	3.7	2.6	13%	13%
	KNN	Segment D	10.2	3.1	2.3	37%	37%
	DMLRF	Segment A	9.7	3.1	2.3	44%	43%
	LR	Segment A	13.9	3.7	2.7	20%	19%
	LogR	Segment A	16.4	4.0	2.8	14%	14%
	KNN	Segment A	11.9	3.4	2.6	31%	31%
	DMLRF	Segment B	5.4	2.3	1.9	80%	80%
	LR	Segment B	7.2	2.6	2.0	73%	72%
	LogR	Segment B	8.1	2.8	2.1	69%	20%
	KNN	Segment B	13.8	3.7	2.8	48%	48%
Complex Segment 4	DMLRF	Segment C	10.7	3.2	2.5	70%	70%
	LR	Segment C	12.3	3.5	2.8	65%	65%
	LogR	Segment C	12.2	3.5	2.7	65%	10%
	KNN	Segment C	20.0	4.4	3.3	43%	43%
	DMLRF	Segment D	6.8	2.6	2.0	81%	81%
	LR	Segment D	9.7	3.1	2.5	72%	72%
	LogR	Segment D	12.6	3.5	2.8	64%	10%
	KNN	Segment D	13.7	3.7	2.8	61%	61%

7.3.4. Complex Segment 4

In complex segment 4, the abnormal traffic flow patterns occurred in Segments C and D where these two segments are located close to an exit, which is similar to the traffic flow patterns in CS1. This leads to an observation that abnormal traffic flow on I-95 highway usually occurred at the curved road that has an exit within the next half a mile. The proposed model achieved 80% in terms of

R^{2}

and

A R^{2}

and outperformed all other base-line models in all the segments.

8. Conclusions

Real-time highway traffic flow prediction is considered one of the highest priority applications for intelligent transportation systems since abnormal traffic flow can have severe impacts on people’s time, money, and lives. Providing traffic flow information in real-time will assist drivers to reach their destination faster and safer. However, traffic data is very challenging to process and evaluate in real-time due to its huge size. In this study, we designed and evaluated a scalable distributed machine learning-based scheme to predict highway traffic flow in (near-) real-time. The system maintains security and privacy since the data will not be shared at a centralized node. Stacking ensemble learning is utilized to implement the DMLRF model that takes the RFR models with different hyper-parameters’ values as inputs. KA and Apache Spark are utilized to process the data and provide the prediction results in real-time. A dataset obtained from FDOT-D4 is used to build all the models in this research to predict highway traffic flows. The proposed model was evaluated and compared to other base-line models, LR, LogR, and KNN in terms of RMSE, MSE, MAE,

R^{2}

, and adjusted

R^{2}

. The proposed scheme demonstrated high accuracy in predicting abnormal traffic flows and outperformed the base-line models while maintaining scalability and data privacy. For future work, we will design and evaluate a real-time accident prediction scheme using distributed machine learning.

Author Contributions

Conceptualization, H.A., I.M. and H.A.-N.; methodology, H.A., I.M. and H.A.-N.; software, H.A.; validation, H.A., I.M. and H.A.-N.; formal analysis, H.A., I.M. and H.A.-N.; investigation, H.A., I.M. and H.A.-N.; resources, H.A. and I.M.; writing—original draft preparation, H.A.; writing—review and editing, H.A., I.M., H.A.-N. and E.A.; visualization, H.A. and E.A.; supervision, I.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data available upon request.

Acknowledgments

This work is part of the Smart Drive initiative at Tecore Networks Lab at Florida Atlantic University.

Conflicts of Interest

The authors declare that they have no conflict of interest.

References

Shi, Q.; Abdel-Aty, M. Big data applications in real-time traffic operation and safety monitoring and improvement on urban expressways. Transp. Res. Part C Emerg. Technol. 2015, 58, 380–394. [Google Scholar]
Abdullah, T.; Nyalugwe, S. A Data Mining Approach for Analysing Road Traffic Accidents. In Proceedings of the 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS), Riyadh, Saudi Arabia, 1–3 May 2019; pp. 1–6. [Google Scholar]
Micko, K.; Papcun, P.; Zolotova, I. Review of IoT sensor systems used for monitoring the road infrastructure. Sensors 2023, 23, 4469. [Google Scholar] [CrossRef] [PubMed]
Ozbayoglu, M.; Kucukayan, G.; Dogdu, E. A real-time autonomous highway accident detection model based on big data processing and computational intelligence. In Proceedings of the 2016 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 5–8 December 2016; pp. 1807–1813. [Google Scholar]
Dong, C.; Shao, C.; Li, J.; Xiong, Z. An Improved Deep Learning Model for Traffic Crash Prediction. J. Adv. Transp. 2018, 2018, 1–13. [Google Scholar] [CrossRef]
Ifthikar, A.; Hettiarachchi, S. Analysis of historical accident data to determine accident prone locations and cause of accidents. In Proceedings of the 2018 8th International Conference on Intelligent Systems, Modelling and Simulation (ISMS), Kuala Lumpur, Malaysia, 8–10 May 2018; pp. 11–15. [Google Scholar]
Kumeda, B.; Zhang, F.; Zhou, F.; Hussain, S.; Almasri, A.; Assefa, M. Classification of Road Traffic Accident Data Using Machine Learning Algorithms. In Proceedings of the 2019 IEEE 11th International Conference on Communication Software and Networks (ICCSN), Chongqing, China, 12–15 June 2019; pp. 682–687. [Google Scholar]
Huang, W.; Song, G.; Hong, H.; Xie, K. Deep Architecture for Traffic Flow Prediction: Deep Belief Networks With Multitask Learning. IEEE Trans. Intell. Transp. Syst. 2014, 15, 2191–2201. [Google Scholar] [CrossRef]
Chen, Y.; Chen, H.; Ye, P.; Lv, Y.; Wang, F.Y. Acting as a Decision Maker: Traffic-Condition-Aware Ensemble Learning for Traffic Flow Prediction. IEEE Trans. Intell. Transp. Syst. 2022, 23, 3190–3200. [Google Scholar] [CrossRef]
Jiang, Y.; Fan, J.; Liu, Y.; Zhang, X. Deep Graph Gaussian Processes for Short-Term Traffic Flow Forecasting From Spatiotemporal Data. IEEE Trans. Intell. Transp. Syst. 2022, 23, 20177–20186. [Google Scholar] [CrossRef]
Oladimeji, D.; Gupta, K.; Kose, N.A.; Gundogan, K.; Ge, L.; Liang, F. Smart transportation: An overview of technologies and applications. Sensors 2023, 23, 3880. [Google Scholar] [CrossRef]
Al Najada, H.; Mahgoub, I. Anticipation and alert system of congestion and accidents in VANET using Big Data analysis for Intelligent Transportation Systems. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence (SSCI), Athens, Greece, 6–9 December 2016; pp. 1–8. [Google Scholar]
Al Najada, H.; Mahgoub, I.; Mohammed, I. Highway cluster density and average speed prediction in vehicular ad hoc networks (VANETs). In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 18–21 November 2018; pp. 96–103. [Google Scholar]
Bello-Orgaz, G.; Jung, J.J.; Camacho, D. Social big data: Recent achievements and new challenges. Inf. Fusion 2016, 28, 45–59. [Google Scholar] [CrossRef] [PubMed]
Chen, M.; Mao, S.; Liu, Y. Big data: A survey. Mob. Netw. Appl. 2014, 19, 171–209. [Google Scholar]
An, S.H.; Lee, B.H.; Shin, D.R. A survey of intelligent transportation systems. In Proceedings of the 2011 Third International Conference on Computational Intelligence, Communication Systems and Networks, Bali, Indonesia, 26–28 July 2011; pp. 332–337. [Google Scholar]
El Faouzi, N.E.; Leung, H.; Kurian, A. Data fusion in intelligent transportation systems: Progress and challenges–A survey. Inf. Fusion 2011, 12, 4–10. [Google Scholar]
Zhang, J.; Wang, F.Y.; Wang, K.; Lin, W.H.; Xu, X.; Chen, C. Data-driven intelligent transportation systems: A survey. IEEE Trans. Intell. Transp. Syst. 2011, 12, 1624–1639. [Google Scholar]
Lin, X.; Wang, P.; Wu, B. Log analysis in cloud computing environment with Hadoop and Spark. In Proceedings of the 2013 5th IEEE International Conference on Broadband Network & Multimedia Technology, Guilin, China, 17–19 November 2013; pp. 273–276. [Google Scholar]
Zaharia, M.; Chowdhury, M.; Das, T.; Dave, A.; Ma, J.; Mccauley, M.; Franklin, M.; Shenker, S.; Stoica, I. Fast and interactive analytics over Hadoop data with Spark. Usenix Login 2012, 37, 45–51. [Google Scholar]
Shahriari, S.; Ghasri, M.; Sisson, S.; Rashidi, T. Ensemble of ARIMA: Combining parametric and bootstrapping technique for traffic flow prediction. Transp. A Transp. Sci. 2020, 16, 1552–1573. [Google Scholar]
Sun, P.; Aljeri, N.; Boukerche, A. Machine learning-based models for real-time traffic flow prediction in vehicular networks. IEEE Netw. 2020, 34, 178–185. [Google Scholar]
Hou, Q.; Leng, J.; Ma, G.; Liu, W.; Cheng, Y. An adaptive hybrid model for short-term urban traffic flow prediction. Phys. A Stat. Mech. Its Appl. 2019, 527, 121065. [Google Scholar]
Zhao, C.; Chen, C.; Cai, Z.; Shi, M.; Du, X.; Guizani, M. Classification of small UAVs based on auxiliary classifier wasserstein GANs. In Proceedings of the 2018 IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, United Arab, 9–13 December 2018; pp. 206–212. [Google Scholar]
Ahn, J.; Ko, E.; Kim, E.Y. Highway traffic flow prediction using support vector regression and Bayesian classifier. In Proceedings of the 2016 International Conference on Big Data and Smart Computing (BigComp), Hong Kong, China, 18–20 January 2016; pp. 239–244. [Google Scholar]
Lv, Y.; Duan, Y.; Kang, W.; Li, Z.; Wang, F.Y. Traffic flow prediction with big data: A deep learning approach. IEEE Trans. Intell. Transp. Syst. 2014, 16, 865–873. [Google Scholar]
Xu, J.; Deng, D.; Demiryurek, U.; Shahabi, C.; Van der Schaar, M. Mining the situation: Spatiotemporal traffic prediction with big data. IEEE J. Sel. Top. Signal Process. 2015, 9, 702–715. [Google Scholar]
Jiber, M.; Lamouik, I.; Ali, Y.; Sabri, M.A. Traffic flow prediction using neural network. In Proceedings of the 2018 International Conference on Intelligent Systems and Computer Vision (ISCV), Fez, Morocco, 2–4 April 2018; pp. 1–4. [Google Scholar]
Lin, J. The lambda and the kappa. IEEE Internet Comput. 2017, 21, 60–66. [Google Scholar]
Bisong, E. Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners; Apress: New York, NY, USA, 2019. [Google Scholar]
Florida Department of Transportation. S.S.F.D. of Transportation District 4 (FDOT-D4) US Department of Energy, Florida Department of Transportation—District 4 (FDOT-D4). SSFDOT. 2015. Available online: https://www.d4fdot.com/ (accessed on 23 January 2025).
Tan, K.; Bremner, D.; Le Kernec, J.; Zhang, L.; Imran, M. Machine learning in vehicular networking: An overview. Digit. Commun. Netw. 2022, 8, 18–24. [Google Scholar]
Olugbade, S.; Ojo, S.; Imoize, A.L.; Isabona, J.; Alaba, M.O. A review of artificial intelligence and machine learning for incident detectors in road transport systems. Math. Comput. Appl. 2022, 27, 77. [Google Scholar] [CrossRef]
Abdullah, S.M.; Periyasamy, M.; Kamaludeen, N.A.; Towfek, S.; Marappan, R.; Kidambi Raju, S.; Alharbi, A.H.; Khafaga, D.S. Optimizing traffic flow in smart cities: Soft GRU-based recurrent neural networks for enhanced congestion prediction using deep learning. Sustainability 2023, 15, 5949. [Google Scholar] [CrossRef]
Zhang, Y.; Zhou, Y.; Lu, H.; Fujita, H. Traffic network flow prediction using parallel training for deep convolutional neural networks on spark cloud. IEEE Trans. Ind. Inform. 2020, 16, 7369–7380. [Google Scholar] [CrossRef]
Lee, M.C.; Lin, J.C. DALC: Distributed automatic LSTM customization for fine-grained traffic speed prediction. In Proceedings of the International Conference on Advanced Information Networking and Applications; Springer: Berlin/Heidelberg, Germany, 2020; pp. 164–175. [Google Scholar]
Boukerche, A.; Wang, J. A performance modeling and analysis of a novel vehicular traffic flow prediction system using a hybrid machine learning-based model. Ad. Hoc. Netw. 2020, 106, 102224. [Google Scholar] [CrossRef]
Zheng, G.; Chai, W.K.; Duanmu, J.L.; Katos, V. Hybrid deep learning models for traffic prediction in large-scale road networks. Inf. Fusion 2023, 92, 93–114. [Google Scholar] [CrossRef]
Alsubai, S.; Dutta, A.K.; Sait, A.R.W. Hybrid deep learning-based traffic congestion control in IoT environment using enhanced arithmetic optimization technique. Alex. Eng. J. 2024, 105, 331–340. [Google Scholar] [CrossRef]
Ning, Y.; Samonte, M.J.C.; Li, Y. A Review of Research on Traffic Flow Prediction Methods Based on Deep Learning. In Proceedings of the 2024 International Conference on Digital Society and Artificial Intelligence, Qingdao, China, 24–26 May 2024; pp. 166–170. [Google Scholar]
Fouladgar, M.; Parchami, M.; Elmasri, R.; Ghaderi, A. Scalable deep traffic flow neural networks for urban traffic congestion prediction. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 2251–2258. [Google Scholar]
Wang, Z.; Chu, R.; Zhang, M.; Wang, X.; Luan, S. An improved hybrid highway traffic flow prediction model based on machine learning. Sustainability 2020, 12, 8298. [Google Scholar] [CrossRef]
Warren, J.; Marz, N. Big Data: Principles and Best Practices of Scalable Realtime Data Systems; Simon and Schuster: New York, NY, USA, 2015. [Google Scholar]
Zaharia, M.; Xin, R.S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.; Franklin, M.J.; et al. Apache spark: A unified engine for big data processing. Commun. ACM 2016, 59, 56–65. [Google Scholar] [CrossRef]
Ji, G.; Ling, X. Ensemble learning based distributed clustering. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining; Springer: Berlin/Heidelberg, Germany, 2007; pp. 312–321. [Google Scholar]
Dong, X.; Yu, Z.; Cao, W.; Shi, Y.; Ma, Q. A survey on ensemble learning. Front. Comput. Sci. 2020, 14, 241–258. [Google Scholar] [CrossRef]
Fernández-Delgado, M.; Cernadas, E.; Barro, S.; Amorim, D. Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 2014, 15, 3133–3181. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Muntasir Nishat, M.; Faisal, F.; Jahan Ratul, I.; Al-Monsur, A.; Ar-Rafi, A.M.; Nasrullah, S.M.; Reza, M.T.; Khan, M.R.H. A Comprehensive Investigation of the Performances of Different Machine Learning Classifiers with SMOTE-ENN Oversampling Technique and Hyperparameter Optimization for Imbalanced Heart Failure Dataset. Sci. Program. 2022, 2022, 3649406. [Google Scholar]
Asif, M.; Nishat, M.M.; Faisal, F.; Dip, R.R.; Udoy, M.H.; Shikder, M.; Ahsan, R. Performance Evaluation and Comparative Analysis of Different Machine Learning Algorithms in Predicting Cardiovascular Disease. Eng. Lett. 2021, 29, 1–11. [Google Scholar]
Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
Lana, I.; Del Ser, J.; Velez, M.; Vlahogianni, E.I. Road traffic forecasting: Recent advances and new challenges. IEEE Intell. Transp. Syst. Mag. 2018, 10, 93–109. [Google Scholar]
Cheng, Z.; Pang, M.S.; Pavlou, P.A. Mitigating traffic congestion: The role of intelligent transportation systems. Inf. Syst. Res. 2020, 31, 653–674. [Google Scholar]
Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Routledge: London, UK, 2017. [Google Scholar]
Combine Predictors Using Stacking. Available online: https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html (accessed on 23 January 2025).
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Mqadi, N.M.; Naicker, N.; Adeliyi, T. Solving Misclassification of the Credit Card Imbalance Problem Using Near Miss. Math. Probl. Eng. 2021, 2021, 7194728. [Google Scholar]
Cano, A.; Krawczyk, B. ROSE: Robust Online Self-Adjusting Ensemble for Continual Learning on Imbalanced Drifting Data Streams. Mach. Learn. 2022, 111, 2561–2599. [Google Scholar]
Alnami, H.M.; Mahgoub, I.; Al-Najada, H. Highway Accident Severity Prediction for Optimal Resource Allocation of Emergency Vehicles and Personnel. In Proceedings of the 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 27–30 January 2021; pp. 1231–1238. [Google Scholar]
Alnami, H.M.; Mahgoub, I.; Al Najada, H. Segment Based Highway Traffic Flow Prediction in VANET Using Big Data Analysis. In Proceedings of the 2021 IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, 5–7 December 2021; pp. 1–8. [Google Scholar]
Kursa, M.B.; Rudnicki, W.R. Feature selection with the Boruta package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar]

Figure 1. Kappa Architecture.

Figure 2. Apache Spark Architecture.

Figure 3. A highway distributed machine learning architecture.

Figure 4. Structure overview of the DMLRF model implementation using Stacking ensemble.

Figure 5. Proposed System Workflow.

Figure 6. DMLRF predicted delay compared to actual delay for Segments A, B, C, and D of complex Segment 1 (tested on one week with more abnormal traffic flow cases).

Figure 7. DMLRF Predicted average speed compared to actual average speed for Segments A, B, C, and D of complex Segment 1 (tested on one week with more abnormal traffic flow cases).

Figure 8. DMLRF predicted average speed compared to actual SRI values for Segments A, B, C, and D of complex segment 1 (tested on week with more abnormal traffic flow cases).

Figure 9. DMLRF predicted average speed compared to actual SRI values for Segments A, B, C, and D of complex Segment 2 (tested on one week with more abnormal traffic flow cases).

Table 1. Important Parameters Of Traffic Data.

#	Parameters	Description
1	Detector ID	The ID number of the detector
2	Total Volume	The total number of vehicles detected within a 15-min interval that was recorded by the available detector.
3	Average Speed	The average speed of vehicles that are calculated based on a 15-min interval for all the lanes.
4	Archive Lane ID	ID of each lane on each bound of the highway. Each lane on each segment has a different lane id.
5	Average Occupancy	Average number of vehicles that occupy the segment.
6	Detector Lat and log	Latitude and Longitude of the detector.
7	Traffic Timestamp	The time-stamp of the traffic record (Year-Month-Day, Hour:Minute:Second).
8	Average Latency	The average latency that recorded by a detector.

Table 2. Random Forest hyper-parameters description.

Hyper-Parameters	Description
N estimators	Number of trees in the forest
Max features	Max number of features considered for splitting a node
Max depth	Max number of levels in each decision tree
Min samples split	Min number of data points placed in a node before the node is split
Min samples leaf	Min number of data points allowed in a leaf node
Bootstrap	Method for sampling data points (with or without replacement)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alnami, H.; Mahgoub, I.; Al-Najada, H.; Alalwany, E. A Distributed Machine Learning-Based Scheme for Real-Time Highway Traffic Flow Prediction in Internet of Vehicles. Future Internet 2025, 17, 131. https://doi.org/10.3390/fi17030131

AMA Style

Alnami H, Mahgoub I, Al-Najada H, Alalwany E. A Distributed Machine Learning-Based Scheme for Real-Time Highway Traffic Flow Prediction in Internet of Vehicles. Future Internet. 2025; 17(3):131. https://doi.org/10.3390/fi17030131

Chicago/Turabian Style

Alnami, Hani, Imad Mahgoub, Hamzah Al-Najada, and Easa Alalwany. 2025. "A Distributed Machine Learning-Based Scheme for Real-Time Highway Traffic Flow Prediction in Internet of Vehicles" Future Internet 17, no. 3: 131. https://doi.org/10.3390/fi17030131

APA Style

Alnami, H., Mahgoub, I., Al-Najada, H., & Alalwany, E. (2025). A Distributed Machine Learning-Based Scheme for Real-Time Highway Traffic Flow Prediction in Internet of Vehicles. Future Internet, 17(3), 131. https://doi.org/10.3390/fi17030131

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Distributed Machine Learning-Based Scheme for Real-Time Highway Traffic Flow Prediction in Internet of Vehicles

Abstract

1. Introduction

2. Related Work

3. Preliminaries

3.1. Kappa Architecture

3.2. Ensemble Learning and Hyper-Parameters Tuning

3.2.1. Ensemble Learning

3.2.2. Hyper-Parameters Tuning

4. Research Methodology

4.1. Research Methodology

4.2. Model Design

4.2.1. Random Forest Regression

4.2.2. Stacking Ensemble Learning

4.3. Research Challenges

5. Data Description and Preprocessing

5.1. Data Description

5.2. Data Preprocessing and Transformation

6. Proposed System Architecture and Components

6.1. Proposed System Architecture

6.2. Proposed System Components

7. Performance Evaluation and Results

7.1. Classification Models

7.2. Performance Metrics

7.3. Results and Discussion

7.3.1. Complex Segment 1

7.3.2. Complex Segment 2

7.3.3. Complex Segment 3

7.3.4. Complex Segment 4

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI