An intelligent intrusion detection system for 5G­enabled internet of vehicles

: The deployment of 5G technology has drawn attention to different computer-based scenarios. It is useful in the context of Smart Cities, the Internet of Things (IoT), and Edge Computing, among other systems. With the high number of connected vehicles, providing network security solutions for the Internet of Vehicles (IoV) is not a trivial process due to its decentralized management structure and heterogeneous characteristics (e.g., connection time, and high-frequency changes in network topology due to high mobility, among others). Machine learning (ML) algorithms have the potential to extract patterns to cover security requirements better and to detect/classify malicious behavior in a network. Based on this, in this work we propose an Intrusion Detection System (IDS) for detecting Flooding attacks in vehicular scenarios. We also simulate 5G-enabled vehicular scenarios using the Network Simulator 3 (NS-3). We generate four datasets considering different numbers of nodes, attackers, and mobility patterns extracted from Simulation of Urban MObility (SUMO). Furthermore, our conducted tests show that the proposed IDS achieved F1 scores of 1.00 and 0.98 using decision trees and random forests, respectively. This means that it was able to properly classify the Flooding attack in the 5G vehicular environment considered.


Introduction
The efforts of industry and academia to find solutions to urban traffic problems (such as traffic congestion, cargo theft, and optimized public transportation, among others) have allowed vehicles to become more than just transportation machines [1]. Modern vehicles are equipped with novel communication devices (e.g., wireless antenna and cellular technology) that make it possible to communicate with surrounding vehicles, send and receive messages, and access remote applications through an Internet connection. Regarding cellular technology, vehicles commonly use 4G/LTE (Long Term Evolution) or 5G (fifth-generation mobile network). Some studies have already considered the forthcoming 6G [2] in the vehicular context.
The vehicular network, initially referenced as Vehicular Ad hoc Network (VANET), has evolved into the Internet of Vehicles (IoV) [3] as a result of its integration with other technologies, namely the Internet of Things (IoT) [4]. Furthermore, IoV has different types of communication: Vehicle to Vehicle (V2V), Vehicle to Infrastructure (V2I), Vehicle to Sensor (V2S), Vehicle to Roadside Unit (V2R), Vehicle to Pedestrian (V2P), and Vehicle to Everything (V2X) [5]. They have called attention to the security requirements, since each communication type can require different layers of security mechanisms. Figure 1 shows an overview of the different kinds of vehicular communications, where other technologies can also be integrated, namely, cloud and edge computing [6].
Although vehicles have fewer computing resources (e.g., processing power or storage) than traditional computers, they have caught the attention of malicious users who can adapt the methodology of computer-based attacks (e.g., Denial-of-Service (DoS) [7], Sybil [8], Although vehicles have fewer computing resources (e.g., processing power or storage) than traditional computers, they have caught the attention of malicious users who can adapt the methodology of computer-based attacks (e.g., Denial-of-Service (DoS) [7], Sybil [8], Jamming [9], Fuzzy [10], Spoofing [11], and Eavesdropping [12], among others) to vehicular networks. In addition, vehicular networks have the potential to generate valuable data from their users (e.g., tracking vehicles' routes, Global Position System (GPS) coordinates, vehicles' identity, or most visited places) that can also be valuable for malicious users. On the other hand, providing/adapting network security tools for IoV is not a trivial task due to its characteristics, which cannot be ignored, such as rapid network topology change, nodes with high mobility, and small connection duration.
Machine Learning (ML) algorithms have been explored to maximize the potential of identifying malicious users and network security breaches. However, in the IoV context, the use of ML-based solutions faces a crucial challenge: finding publicly available network datasets. The ML model would be trained with data extracted from real vehicular testbed in an ideal model-building process scenario. However, generating datasets with real data is challenging because public sources are not widely available. Nonetheless, there are well-known vehicular network simulators that are publicly available to create private and public datasets such as Network Simulator 3 (NS-3) [13] and Veins [14].
Network security tools use different strategies for identifying malicious activities, where ML algorithms can help them to increase the detection rate. An Intrusion Detection System (IDS) is a good example of a security tool that merges its functionalities with ML algorithms. For this purpose, the right choice of vehicular network datasets represents an important step in correctly labelling malicious behaviors in such a vehicular scenario.
Availability is one of the third pillars of data and information security. The other two are confidentiality and integrity. As aforementioned, some attacks can cause network dis- Machine Learning (ML) algorithms have been explored to maximize the potential of identifying malicious users and network security breaches. However, in the IoV context, the use of ML-based solutions faces a crucial challenge: finding publicly available network datasets. The ML model would be trained with data extracted from real vehicular testbed in an ideal model-building process scenario. However, generating datasets with real data is challenging because public sources are not widely available. Nonetheless, there are well-known vehicular network simulators that are publicly available to create private and public datasets such as Network Simulator 3 (NS-3) [13] and Veins [14].
Network security tools use different strategies for identifying malicious activities, where ML algorithms can help them to increase the detection rate. An Intrusion Detection System (IDS) is a good example of a security tool that merges its functionalities with ML algorithms. For this purpose, the right choice of vehicular network datasets represents an important step in correctly labelling malicious behaviors in such a vehicular scenario.
Availability is one of the third pillars of data and information security. The other two are confidentiality and integrity. As aforementioned, some attacks can cause network disruption, especially the Flooding attack. By performing this attack, malicious vehicles can stop legitimate messages from reaching their destination on the network. Furthermore, this attack can also lengthen the time for receiving useful messages, such as those sent by vehicular safety applications requiring low latency. Based on this problem, in our work we developed an IDS that uses ML algorithms to detect the Flooding attack in 5G-enabled vehicular networks. Our contributions are the following:

•
We propose four new labelled datasets of 5G-enabled vehicular networks with 16 features, which have Flooding attack characteristics.

•
We build a decision tree model that outperforms (e.g., accuracy, precision, recall, and F1) some works that use more complex ML algorithms.
The remainder of the work is organized as follows. Section 2 presents the background and related work. In Section 3, we present the vehicular scenarios used to generate our datasets. Section 4 presents our experimental setup and Section 5 reports the obtained results. Finally, Section 6 concludes the paper.

Background and Related Work
Conducting cyber-attacks on vehicular networks can compromise the entire communication structure between vehicles, by interrupting vehicles from receiving safety messages or by consuming network resources such as bandwidth, hence putting human lives at risk [15]. The lack of security mechanisms for vehicles can cause chaos in a city, where stopping 20% of the vehicles during heavy traffic would be enough for this disaster to occur [16]. Different studies have been conducted by the scientific community bearing in mind the seriousness of this threat [17,18]. The dynamic nature of these networks presents characteristics that cannot be ignored, such as high mobility, the number of vehicles in a given area, and connection time [19].
Attacks on in-vehicle communications, such as espionage, injection, bus-off, and DoS attacks, aim to cause Engine Control Unit (ECU) malfunctions [20]. The ECU provides different services for passengers, such as entertainment, system information, import of multimedia content, etc. For instance, an attack of espionage occurs when an attacker can access the vehicle's messages, where through the Controller Area Network (CAN) patterns are identified in the legitimate messages exchanged. Since CAN messages are not authenticated, the injection attack enables attackers to access the vehicle through On-Board Diagnostic II (OBD-II), ECU ports, or entertainment services, allowing the injection of malicious messages into the network or devices. The bus-off attack aims to turn off the ECU by continuously sending bits causing the ECU error counter to increase.
There are three types of IDS: network-based (Network-based Intrusion Detection System, NIDS), host-based (Host-based Intrusion Detection System, HIDS), and hybrid [21,22]. NIDS aims to monitor the network on which the devices are connected. HIDS seeks to detect anomalies that may occur in the device in which the IDS was configured. Moreover, the hybrid approach combines the characteristics of the other two. However, an IDS that applies ML techniques uses datasets generated with data from real or simulated networks to train the anomaly classifier [23].
Although vehicles can use different communication technologies to share information (e.g., Wi-Fi), they mainly use the IEEE 802.11p communication standard. However, as vehicular applications become more robust, there is a need for new technologies that enable low delay and high throughput, such as 5G technology. As highlighted in [24], applying 5G technology in vehicular scenarios can expand the integration of systems that use 3G, 4G, Wi-Fi, ZigBee, and Bluetooth. In addition, vehicular safety applications demand messages with low latency. For example, a collision avoidance application can avoid an accident by receiving timely messages before the driver reacts to the behavior of an adversary vehicle.
Deep Learning (DL) has revolutionized how ML optimizes information processing, enabling it to be used in different areas of knowledge. Tangade et al. [25] applied DL in vehicular networks, highlighting the possibility of increasing reliability, reducing latency, and detecting security problems.
The particularities of an inter-vehicle network can directly affect the accuracy of building an ML model. For example, each vehicular environment has its own heterogeneous characteristics (e.g., number of nodes, network topology, and available resources) that can influence how the ML model will react to the behavior of the entire network.
Seeking to provide public datasets, Gonçalves et al. [26] generated different datasets for IoV, where they performed DoS and Fabrication attacks (i.e., false acceleration, speed, and direction data). Aiming to validate the generated datasets, they proposed a hierarchical IDS that uses ML algorithms [27] to identify malicious behaviors in the network. Each generated dataset has a total of 18 columns/features, including the attack class label [26].
In the context of Smart cities and electric vehicles, Aloqaily et al. [28] proposed the identification of Probing, User to Root (U2R), Remote to User (R2U), and DoS attacks in Connected Vehicular Network (CVN) using an IDS. The strategy used consisted of grouping vehicles into clusters [29], for which the algorithm selects a cluster head (CH) that is responsible for communicating with the trusted third parties (TTP) that are not available in the cluster. They use deep belief network (DBN) and decision tree (DT) algorithms for identifying and classifying anomalies. In the proposed IDS, the authors use a hybrid dataset (network data from NS-3 and NSL-KDD dataset) as input. For the classification of anomalous or normal behavior, the network data packets are processed by the DBN algorithm, which aims to reduce unnecessary network data packets. Finally, the DT algorithm classifies network packets into anomalies or legitimate packets. Additionally, it is pointed out that the NS-3 network data are only used to add normal traffic to the dataset. Apart from the work done, both datasets have the same format. It is important to highlight that NSL-KDD does not use vehicular network data. As already mentioned, vehicular networks have their own characteristics that should not be ignored. Finally, it is not mentioned which features the hybrid dataset has and how important each feature is after DT classification.
Privacy issues in vehicular networks should be addressed at different levels of the vehicular network architecture, since the attacker can harm users in different ways, such as spreading false information, receiving and collecting/processing unauthorized data, and so forth. For example, the Sybil attack can create different identities, and each identity can simulate a vehicle on the road. For example, a legitimate vehicle may not receive an important message about the road conditions in this case. Liang et al. [30] proposed an IDS for identifying False Information and Sybil attacks. The proposed tool was used in two scenarios for data collection (conducting training) and testing. The first scenario did not contain anomalies, and the second one did, to perform the training of the anomaly detection algorithm. The detection algorithm used is called growing hierarchical self-organizing map (GHSOM), which is a neural network.
Garip et al. [31] presented the first adaptive botnet detection mechanism, called SHIELDNET. For the proposed solution, they simulate different scenarios in the Veins tool, which includes the Simulation of Urban MObility (SUMO) and OMNeT simulators, and ML algorithms to identify botnets on the network.
Adhikary et al. [32] proposed a hybrid algorithm to detect distributed DoS (DDoS) attacks in VANETs, where their solution combines support vector machine (SVM) kernels, namely, AnovaDot and RBFDot. Their simulation has a total of 5 RSUs and 1000 vehicles, where the vehicles are displaced every 100 to 500 ms. To evaluate their solution, they also generated a dataset with two classes, 0 (normal behavior) and 1 (victim or DDoS attacker). First, they evaluated the accuracy for each RSU considering only AnovaDot, RBFDot and the Hybrid algorithm. Second, they also considered Gini coefficient, Kolmogorov-Smirnov (which measures the empirical distance between two sample datasets), Hand Measure (which is an alternative performance measure for Area Under Curve-AUC), and Minimum Error rate.
As a proposed solution to the black-hole attack in vehicles with the auto-driving system, Alheeti et al. [33] developed an IDS that uses neural networks and fuzzified data to identify and correct the problem. For the simulation of message exchange between different vehicles and between vehicles and RSUs, the NS2 simulator was used, which had as input the data generated by SUMO and MObilty VEhicles (MOVE) [34] simulators. A statistical approach was also used to extract relevant information in the tracing files generated by the NS2, called Proportional Overlapping Scores (POS).
Kosmanos et al. [35] developed an IDS to identify spoofing attacks in electric vehicles. In addition to using ML, they also employ Position Verification using Relative Speed (PVRS) to optimize the results obtained. An attacker performs some actions on the vehicle or network through the spoofing attack, such as data theft, sending false information, and sending false GPS information (i.e., GPS spoofing).
Polat et al. [36] proposed an IDS solution to detect DDoS on Software-defined network (SDN)-based VANET, where SDN is already the main activator of 5G. To detect the DDoS attack, they used the stacked sparse autoencoder (SSAE) + Softmax classifier deep network model.
In addition, Otoum et al. [37] developed a transfer-learning-driven intrusion detection for IoV, where they used deep neural networks and Convolutional Neural Network (CNN) in two datasets, namely, CICIDS2017 and CSE-CIC-IDS2018. Their solution aims to classify DoS, DDoS, Botnet, Brute-force, Infiltration, Web Attacks, and Port Scan attacks. Table 1 summarizes the related work and our proposed IDS, where we emphasize that ours is the only approach that uses 5G technology. The related work described above lacks discussion on non-trivial issues in ML, such as data distribution and how the data are balanced among classes. These are important themes, since poorly distributed and/or unbalanced datasets can pose serious difficulties to proper model training and consequent performance. Furthermore, most of the related work also seems to completely disregard the usefulness of the simplest and most interpretable ML models, such as decision trees, and how proper parameter settings can improve the quality of the models when regarded through different metrics. As the reader will see in Sections 4 and 5, in our work we explore the parameterization of simple ML algorithms, combine different datasets in order to improve data distribution, and evaluate the results using different metrics that are robust to unbalanced data.

Simulated Scenarios
The proposed scenarios are simulated in a virtual machine running Ubuntu 20.04.5 LTS with Intel (R) Core (TM) i5-8300H, four cores at 2.3 GHz, and 8 GB RAM. The simulation parameters are listed in Table 2. We use the NS-3 network simulator, which is open-source. We used the 5G-LENA module, i.e., a GPLv2 New Radio (NR) network [38], called nr, that also allows to simulate 4G and 5G networks and V2X-based 5G communication. The simulator allows simulating some network actors, such as remote hosts that can connect to Packet Gateway and Service Gateway through a link and send it to gNodeB, and user equipment (i.e., vehicles). Additionally, the nr module is described as a "hard fork" of the millimeter-wave (mmWave) simulator [38], which enables simulating the physical (PHY) layer and medium access control (MAC), mmWave channel [39], propagation, beamforming [40], and antenna models.  Figure 2).
The simulations are designed as follows: • All vehicles are equipped with 5G technology, where SUMO is used to generate mobility.

•
There are two distinct groups of vehicles: senders and receivers.
As previously stated, vehicles are separated into two groups (i.e., senders and receivers) and we generated four maps: • the first map has a total of 45 vehicles, where 10 are senders (from this total, two vehicles are attackers) and 35 receivers; • the second map also has 45 vehicles, where 10 are senders (from this total, four vehicles are attackers) and 35 receivers; • the third map has a total of 70 vehicles, where 15 are senders (from this total, seven are attackers) and 55 receivers; • finally, the fourth map has 100 vehicles, where 19 are senders (from this total, nine vehicles are attackers) and 81 receivers.
In addition, each simulation lasts a total of 230 s. However, to enable more mobility/movement of the vehicles, they exchange packets at second 170.
Additionally, all datasets have the following features: • timeSec-this feature indicates the simulation time at which a packet is sent or received. In our dataset, we are considering only metrics of received packets; • txRx-a tag to indicate whether a packet was sent (tx) or received (rx); • nodeId-refers to the receiver node ID; • imsi-is the International Mobile Subscriber Identity, which is an identifier assigned with the SIM (Subscriber Identity Module) card; • srcIp-the IP address of a sender node; • dstIp-the IP address of a receiver node; • packetSizeBytes-it refers to the packet size in bytes. Each sender node uses a different size to increase randomness; • srcPort-refers to the port where the sender nodes are sending the packets; • dstPort-refers to the port where the receiver nodes are receiving the packets; • pktSeqNum-refers to the sequence of transmitted packets; • delay-the difference between the reception time of a packet and its sending time; • jitter-it uses the RFC 1889 [41] format; • coord_x-is the "x" coordinate on the map generated in SUMO; • coord_y-is the "y" coordinate on the map generated in SUMO; • speed-is the speed of the vehicle in meters per second; • isAttack-is the class of benign (class 0) packet or malign (class 1) packet.
nr, that also allows to simulate 4G and 5G networks and V2X-based 5G communication.
The simulator allows simulating some network actors, such as remote hosts that can connect to Packet Gateway and Service Gateway through a link and send it to gNodeB, and user equipment (i.e., vehicles). Additionally, the nr module is described as a "hard fork" of the millimeter-wave (mmWave) simulator [38], which enables simulating the physical (PHY) layer and medium access control (MAC), mmWave channel [39], propagation, beamforming [40], and antenna models. Furthermore, seeking to generate heterogeneous data, we used SUMO, as it permits the modeling of intermodal traffic systems, to generate four different maps. Each map can have a different number of nodes and a different coverage area. The four maps simulate some regions of Lisbon, Portugal (see Figure 2). The simulations are designed as follows: • All vehicles are equipped with 5G technology, where SUMO is used to generate mobility.  Table 3 shows the total of rows and class distributions on each dataset.

Experimental Setup
Our experiments are divided into two parts: first, we use one of the simplest learning methods available, e.g., decision trees, to explore different combinations of our datasets while obtaining preliminary baseline results; then, we explore other, more complex learning methods, namely the ensemble method of random forests and the neural network method multilayer perceptron. We use scikit-learn (version 1.1.1) [42] for all our experiments.
We name our four datasets according to the number of attackers simulated in each, specifically 2, 4, 7, and 9. In the first experiment, we train classifiers in each of them and then test these classifiers on the remaining three datasets. We measure the F1 score separately for each test set and obtain the F1 score for a larger test set that joins the three sets. We prefer the F1 score to the accuracy since some datasets are unbalanced (see Table 3). For choosing the decision tree depth, we perform a grid search with 10-fold cross-validation on the depths {2, 3,4,5,6,7,8,9,10,15,20,25,30,35,40,45, 50, 55}. This means that the training set is split into ten parts, and ten different models are trained for each tree depth, each using nine parts for training and one part for validation, and the best tree depth is then chosen based on the different F1 scores obtained. From all the features included in the datasets (see Section 3), we use: nodeId, imsi, pktSizeBytes, dstPort, delay, jitter, coord_x, coord_y, and speed. The remaining ones were not used because they caused overfitting. In the second experiment, we use a mix of two datasets to train and then test on the remaining two separately and joined. The third experiment uses a mix of three datasets for training and the remaining one for testing. In these experiments, the tree depth is not chosen with regular k-fold cross-validation, but rather with what scikit-learn calls GroupKFold cross-validation. In the latter, each group is the set of samples from each dataset and the same group cannot coexist in both training and validation parts. Table 4 presents the tested parameters in all algorithms. Table 4. Summary of tested parameters.

Results
Following the experimental setup, first we report a complete set of results obtained with decision trees using different combinations of our datasets, and then we report the Electronics 2023, 12, 1757 9 of 15 main results obtained using other learning methods. Finally, we compare our results to others that reported the same metrics in similar work. Table 5 reports the results obtained by training on a single dataset and then testing on the other three. As described above (see Experimental Setup), the training and validation folds for choosing tree depth belong to the same dataset. The reported test F1 scores refer to the three F1 scores measured separately for each dataset and the F1 score measured on the dataset that joins these three (between parenthesis). The first case reported is when training on dataset 2, which generalizes very well to datasets 4 and 7, and a bit less to dataset 9. Training on dataset 4 generalizes better for datasets 7 and 9 than for dataset 2, and always worse than the values obtained in the first case; on the joined dataset, the F1 is almost 9% less than the score obtained in the first case. Training on dataset 7 generalizes better for dataset 9, and then for dataset 4 and finally dataset 2, with these values being generally worse than those obtained in the second case; on the joined dataset, 5.6% less. Finally, training on dataset 9 generalizes well for all other datasets, only 1.5% less than the first case on the joined dataset. It is worth remarking that: (1) larger training datasets (see Table 3) tend to provide better generalization ability; (2) generalization seems to be easier for datasets where the number of attackers is closer but not less than the number of attackers in the training dataset; (3) the trees become smaller as the number of attackers grows; (4) lack of Precision is the main culprit for the worst results. Another thing to notice is the high dispersion of F1 values obtained by the same classifier on different test sets (e.g., between 0.75 and 0.97 by the classifier trained on dataset 4), which renders its predictions highly unreliable.  Table 6 reports the results obtained by training on a mix of two datasets and then testing on the other two. In this case, the training and validation folds for choosing tree depth come from different datasets. The immediate difference we can observe is that the decision trees are now extremely shallow. This means that the characteristics of the two training datasets are substantially different, and deeper trees obtained for one dataset do not generalize well on the other. However, these small trees can do a good job. Their results show much less dispersion, with all F1 values between 0.95 and 0.99 (except the classifier trained on datasets 4 + 7 when tested on dataset 9). Therefore, they are much more reliable.  Table 7 reports the results obtained by training on a mix of three datasets and testing on the remaining one. Moreover, here the trees are very shallow (always depth 3) and the results are very good (all scores between 0.95 and 0.999, rounded to 1.00 in the table). Dataset 9 is the one where generalization is most difficult. In order to draw some conclusions, Figure 3 rearranges the previous results in the following way. We collect several results for each test dataset presented in the previous tables. For example, for dataset 2, we look at the results obtained when training with dataset 4 (0.75), training with dataset 7 (0.77) and training with dataset 9 (0.96), and average these into a single value (0.83) that represents the F1 obtained in test dataset 2 when the training was done using only one dataset. We then look at the results obtained when training with the mixes of two datasets (in this case, 4 + 7, 4 + 9, 7 + 9) and again calculate the average (0.96). Finally, we examine the result obtained when training with a mix of three datasets (0.98). These three averages are the first three bars of the plot, referring to test dataset 2. To complete the plot, we do the same for the other test datasets (i.e., 4,7,9). Looking at Figure 3, it becomes obvious that training with two datasets is better than training with only one, and training with three datasets is even better than training with two. The increasing diversity in the training data results in a better generalization ability on the test set. This is observed in all test sets, but the effect is more dramatic for those with fewer attackers. Furthermore, we now clearly observe that generalization is more difficult on test set 9 than, for example, on test set 7. This can be explained due to the number of rows in both datasets (see Table 3). For example, set 7 has, proportionally, almost 12% more records from attackers than set 9. Finally, Figures 4 and 5 show receiver operating characteristic (ROC) curves and confusion matrices, respectively, for each experiment presented in Table 7 (using three datasets for training).

Exploring Methods
To compare the previous results with those obtained with different learning algorithms, here we report the performance of random forests and the multilayer perceptron when trained with the mix of three datasets. Table 8 shows the results of random forests, including the chosen hyperparameters. When trained with the joined datasets 2 + 4 + 7, a random forest with 30 trees of maximum depth 5 achieves an F1 score of 0.97, which is 2.1% higher than the result of decision trees (0.95). When using dataset 2 + 4 + 9 for training, a larger random forest of 50 trees with a maximum depth of 7 achieves 0.98, which is the same score obtained by a decision tree in the previous results. In the remaining two cases, both random forests achieve 0.97, which is lower than the 1.00 and 0.98 achieved by the respective decision trees in the previous results.
Given the reported results, there seems to be no advantage in using random forests instead of decision trees. However, these results present less variance than those obtained by decision trees. Therefore, we can argue that a random forest model may be more reliable than a decision tree for a diverse range of test sets.
Regarding the multilayer perceptron, the results obtained are worse than those achieved by either decision trees or random forests. Table 9 reports the results as well as the chosen hyperparameters returned by grid search. Whereas decision trees previously obtained results between 0.95 and 1.00, and random forests obtained results between 0.97 and 0.98, the multilayer perceptron ranged from 0.70 on test dataset 2 to 0.90 on test dataset 7, with the remaining tests resulting in 0.82 and 0.83. Since a high Recall score was always obtained, the problem is lack of Precision.    Table 7. (a) Training with datasets 2 + 4 + 7 and test with dataset 9; (b) Training with datasets 2 + 4 + 9 and test with dataset 7; (c) Training with datasets 2 + 7 + 9 and test with dataset 4; and (d) Training with datasets 4 + 7 + 9 and test with dataset 2.    Table 7. (a) Training with datasets 2 + 4 + 7 and test with dataset 9; (b) Training with datasets 2 + 4 + 9 and test with dataset 7; (c) Training with datasets 2 + 7 + 9 and test with dataset 4; and (d) Training with datasets 4 + 7 + 9 and test with dataset 2.

Exploring Methods
To compare the previous results with those obtained with different learning algorithms, here we report the performance of random forests and the multilayer perceptron when trained with the mix of three datasets. Table 8 shows the results of random forests, including the chosen hyperparameters. When trained with the joined datasets 2 + 4 + 7, a random forest with 30 trees of maximum depth 5 achieves an F1 score of 0.97, which is 2.1% higher than the result of decision trees (0.95). When using dataset 2 + 4 + 9 for training, a larger random forest of 50 trees with a maximum depth of 7 achieves 0.98, which is the same score obtained by a decision tree in the previous results. In the remaining two cases, both random forests achieve 0.97, which is lower than the 1.00 and 0.98 achieved by the respective decision trees in the previous results.  Table 7. (a) Training with datasets 2 + 4 + 7 and test with dataset 9; (b) Training with datasets 2 + 4 + 9 and test with dataset 7; (c) Training with datasets 2 + 7 + 9 and test with dataset 4; and (d) Training with datasets 4 + 7 + 9 and test with dataset 2.

Comparison with Similar Work
Seeking to compare our results to what was reported by others in similar work, here we show a comparison with [36,37], the two works reporting the same metrics. Going back to Table 7, by averaging all the rows for each metric we obtain accuracy 0.97, precision 0.96, recall 0.98, and F1 0.97. For [36], we average the results reported for the four variants presented, obtaining accuracy 0.95, and precision, recall, and F1 all equal to 0.94. For [37], we average the reported results for the two datasets used and for the two methods used, obtaining accuracy 0.98, precision 0.94, recall 0.89, and F1 0.90. Figure 6 shows these values in a comparative plot. It should be noted that ours and both other works use unbalanced datasets. However, the other works highlight the accuracy achieved, which is well known to be an unreliable metric for unbalanced data. As visible in the plot, except for accuracy our results are the best, despite using much simpler methods.

Conclusions
We have generated four vehicular network datasets by performing simulations in NS-3 using 5G communication technology. Each dataset consisted of a different scenario with a different number of sender and receiver vehicles. In each dataset, a Flooding attack was simulated with a different number of attackers. The generated datasets were extensively combined and tested to report reliable accuracy, precision, recall, and F1 scores on classifying Flooding behavior in the simulated scenarios. We used different machine learning algorithms such as decision trees, random forests and multilayer perceptron, and compared their performance. The first conclusion of our study is that data diversity on the training set is important, as the more diversity it contains the better the generalization is achieved on the test data. The second conclusion is that complex learning methods do not necessarily provide better results than simple methods such as decision trees. Nevertheless, while the multilayer perceptron performed poorly, the random forests performed at the same level as decision trees, with the advantage of being more consistent across different test sets.
As future work, we will generate more complex data from different attacks and scenarios, e.g., false information and back-hole attacks, and platooning scenarios. We will also explore other ensemble methods such as extreme gradient boosting (XGBoost).

Conclusions
We have generated four vehicular network datasets by performing simulations in NS-3 using 5G communication technology. Each dataset consisted of a different scenario with a different number of sender and receiver vehicles. In each dataset, a Flooding attack was simulated with a different number of attackers. The generated datasets were extensively combined and tested to report reliable accuracy, precision, recall, and F1 scores on classifying Flooding behavior in the simulated scenarios. We used different machine learning algorithms such as decision trees, random forests and multilayer perceptron, and compared their performance. The first conclusion of our study is that data diversity on the training set is important, as the more diversity it contains the better the generalization is achieved on the test data. The second conclusion is that complex learning methods do not necessarily provide better results than simple methods such as decision trees. Nevertheless, while the multilayer perceptron performed poorly, the random forests performed at the same level as decision trees, with the advantage of being more consistent across different test sets.
As future work, we will generate more complex data from different attacks and scenarios, e.g., false information and back-hole attacks, and platooning scenarios. We will also explore other ensemble methods such as extreme gradient boosting (XGBoost).