Model Optimization and Robustness
To ensure high detection accuracy and generalizability, we rigorously optimized the Random Forest classifier through hyperparameter tuning and feature importance analysis. Grid search was employed to select optimal parameters (e.g., tree depth, estimator count), while Gini impurity metrics identified the most discriminative features. Robustness was validated across varied attack intensities (20–40% malicious nodes) and network scales. This section details the methodology, performance trade-offs, and stability of the model under dynamic VANET conditions, addressing potential overfitting risks observed in SVM-based approaches. Results confirm the model’s reliability for real-world deployment.
The following pseudocode details the training and evaluation process of the attack detection model. This algorithm represents the steps followed for classification, including data splitting, preprocessing, and cross-validation (
Figure 5).
Several machine learning algorithms were used to predict the behavior of attacking nodes: Kernelized Support Vector Machine (SVM), Random Forest, Logistic Regression, and Naive Bayes classifiers. Also, multiple measures to assess these algorithms were used, such as precision, recall, and f1-score, with macro averaging as stated in the literature [
23,
24]. Precision (p) and recall (r) can be computed as:
where TP (TN) stands for true positive (true negative) and FP (FN) for false positive (false negative). The
score can be interpreted as a weighted harmonic average of the precision and recall [
23].
Table 3 and
Table 4 synthesize the results of the measurements performed by the classification algorithms with different scaling methods (Standard and Min-Max). In all algorithms, a 10-fold stratified cross-validation was used to differentiate between training and test data. In each fold, the dataset was chosen randomly, as is regularly performed according to the literature [
22]. In all algorithms, 10-fold stratified cross-validation was used to differentiate between training and test data. To prevent data leakage and ensure the honesty of model evaluation, we grouped samples from the same simulation period within folds, and data preprocessing was fine-tuned exclusively with the training set for each fold. Both tables show that the best algorithm to classify misbehavior of nodes on a Gray Hole attack in a VANET is Random Forest. The Random Forest Classifier used has 10 estimators with a max depth of 10. When tuning these two parameters to 15 estimators and a depth of 15, Random Forest achieved 0.9927 in terms of precision and F1 score.
Table 4 shows the list of the five best selected features and the corresponding weight of each section. To find the best feature set for detecting Gray Hole attacks from 20 extracted features, the Random Forest Regressor class from scikit-learn was used [
21]. As
Table 4 shows, the transmission rate and the amount of information transmitted in bytes and packets are the foremost influential features. By incorporating kernel functions into the Support Vector Machine (SVM) model, its learning capability is significantly improved. However, as seen in
Table 4, the predictive performance of the SVM can be low in certain scenarios. Each kernel function has its own set of strengths and weaknesses in terms of learning and generalization. Specifically, the Radial Basis Function (RBF) kernel stands out for its strong interpolation capabilities and for capturing local properties of the data. However, its main limitation is that it is not effective in extracting global features from the entire training set [
25]. In fact, overfitting is an inherent risk when using the RBF kernel if corrective measures are not applied [
26].
To justify our choice of Random Forest, we first analyzed the limitations of SVM. Given the imbalanced dataset (74% normal vs. 26% malicious), SVM variants suffer from overfitting, as shown in Equations (4)–(15). This theoretical insight supports our empirical results (
Table 3 and
Table 4), where Random Forest consistently outperforms SVM.
Our contribution lies in identifying the optimal ML algorithm (Random Forest) and the set of features for the detection of the gray hole attack, rather than modifying the algorithms themselves. The superiority of Random Forest is demonstrated through comparative analysis and the rank of characteristics of importance (
Table 5).
Adjusted gamma scale parameter as stated in [
27] shows no change in RBF performance, so, given the extracted features, the limit is about 74.6% of precision. By conducting a similar analysis as performed by [
26] about the similarity between characteristics, this was found:
Given: A training dataset with m samples , where is a sample with a two-class label .
Two known metrics: , the Euclidean distance between two different samples, and , the Manhattan distance between two given features from different samples named as the absolute difference between and at the kth feature.
Let , the maximum absolute difference (MAD) that measures the maximum expression difference across all features for two samples.
It is easy to find [
25] that for all the samples in the dataset, the ratio between Euclidean distance and MAD is always between 1 and
, which means:
Using this result leads to the next affirmation: given any
, there exist a
such that the ratio is equal to 1 so that
is the minimal,
. Substituting this result in expression (4) gives:
The Euclidean distance between training samples
and
in the feature space of RBF-SVM is the
entry in the learning machine’s kernel matrix in the learning process, so:
Equation (6) can be rewritten based on the Euclidean distance:
Equations (5) and (7) can be combined to obtain the next result.
Accordingly, according to expression (8), the entry in the machine’s kernel matrix in the learning process is upper bounded by the term . Choosing , as stated in the literature, leads to the expression as the upper limit in the training process. It is interesting to see that leads to a value .
Typically, some
during optimization are zero [
28]. But what happens when almost all
are zero or near zero? The SVM RBF classifier will always be overfitting and will always predict data as the majority type in the training data, causing the RBF to lose generalization.
To probe that, given
, let
be the number of samples on the training set where
, and
be the number of samples on the training set where
, and
be the total length of the samples. As stated in the literature [
26,
27] the decision function of SVM is computed as:
if almost all
are zero or near zero, then the decision function will completely depend on the threshold vector
b.
This is clear because the
b vector is defined by the weight vector
, since the Karush–Kuhn–Tucker (KKT) condition [
29]:
Equation (11), as stated in the literature, is the mean of all samples in the training set. But, when
:
Using the fact that
, this leads to the next result.
According to the sign’s function definition [
26] the decision function is further simplified as
where the decision on the type of class would be based solely on knowing which class has the most samples in the training set. If there is no majority type, the learning machine cannot determine the class type of the input sample. This means that using the Euclidean distance metric between samples in the training set, even before training, will give a quick view of whether there will be overfitting in an SVM-RBF. Extracted data used in this work shows a majority of one class sample (normal behavior) at 74%, while the other (bad behavior) is at 26%.
Table 4 shows a very accurate approximation to these results given that the Euclidean distance between different samples in the training data used in this work exhibits a value greater than this, which leads to overfitting in the RBF-SVM algorithm.
Referring to the results of the polynomial kernel (POL-SVM), using the Taylor series expansion to express Equation (6) shows the relationship of the RBF kernel with the polynomial kernel. Each term of the series is a monomial of
and
.
Representing the RBF kernel as an inner product of two functions leads to
, where each,
is a function of the feature space.
Although expressions (17) and (18) show an infinite sum, Taylor’s theory allows a truncation to be carried out to obtain a polynomial of degree
. The error introduced by this truncation is defined in [
28].
On the other hand, the polynomial kernel is also a sum of monomials between
and
. More precisely, the characteristics that correspond to the polynomial kernel can also be written as
Paying attention to Equations (19) and (20), although they are similar, there are differences in scaling between the two polynomials. The most important difference is the scaling that depends on the degree
k of the polynomial, which is the inverse of the factorial of the degree. This means that for higher-degree monomials, the scaling is smaller. Therefore, it is understood that the POL-SVM kernel tends to rely more on lower-degree monomials. It follows that the behavior of the polynomial kernel asymptotically approaches [
25,
26,
28,
30] the RBF kernel, and increasing the order of the polynomial does not affect this fact, since the scaling factor is approaching zero and has no major impact on its performance. Therefore, in view of this result, the behavior of the POL-SVM algorithm in
Table 3 corresponds with these experiments and with the experience in the results presented by other authors [
28].
The confusion matrix of the Random Forest model shows high performance in classifying normal and adversarial nodes. Of 9880 normal nodes (Class A), the model correctly classified 9682 (98% recall), making only 198 errors (false negatives). Of the 3120 adversarial nodes (Class B), it correctly identified 2922 (94% specificity), with 198 false positives. This indicates that the model is accurate and sensitive, although with a slight trade-off between false positives (FP) and false negatives (FN). Ideal for environments where both errors have a similar cost; see
Table 6.
Table 7 compares the performance of the proposed method with recent studies addressing similar attacks in VANETs. It is noteworthy that the Random Forest classifier, with only 20 selected features, achieves an F1-score of 0.9927, outperforming previous works such as [
4] (F1-score: 0.88) and [
3] (accuracy: 0.95). The table also highlights the limitations of other approaches, such as the high computational complexity of hybrid models or the reliance on generic features. These results reinforce the superiority of the proposed model in terms of accuracy, adaptability to intermittent attacks, and computational efficiency, consolidating it as a robust solution for the detection of Gray Hole attacks in VANET environments.