Next Article in Journal
Resplace of the Car–Driver–Passenger System in a Frontal Crash Using a Water Impact Attenuator
Previous Article in Journal
Multi-Object-Based Efficient Traffic Signal Optimization Framework via Traffic Flow Analysis and Intensity Estimation Using UCB-MRL-CSFL
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Handling Data Structure Issues with Machine Learning in a Connected and Autonomous Vehicle Communication System

by
Pranav K. Jha
1 and
Manoj K. Jha
2,*
1
MKJHA Consulting, Unit: Basement, 419 Blairfield Court, Severn, MD 21144, USA
2
Department of Information Technology, University of Maryland Global Campus, 3501 University Blvd. East, Adelphi, MD 20783, USA
*
Author to whom correspondence should be addressed.
Vehicles 2025, 7(3), 73; https://doi.org/10.3390/vehicles7030073
Submission received: 30 April 2025 / Revised: 30 June 2025 / Accepted: 9 July 2025 / Published: 11 July 2025

Abstract

Connected and Autonomous Vehicles (CAVs) remain vulnerable to cyberattacks due to inherent security gaps in the Controller Area Network (CAN) protocol. We present a structured Python (3.11.13) framework that repairs structural inconsistencies in a public CAV dataset to improve the reliability of machine learning-based intrusion detection. We assess the effect of training data volume and compare Random Forest (RF) and Extreme Gradient Boosting (XGBoost) classifiers across four attack types: DoS, Fuzzy, RPM spoofing, and GEAR spoofing. XGBoost outperforms RF, achieving 99.2 % accuracy on the DoS dataset and 100 % accuracy on the Fuzzy, RPM, and GEAR datasets. The Synthetic Minority Oversampling Technique (SMOTE) further enhances minority-class detection without compromising overall performance. This methodology provides a generalizable framework for anomaly detection in other connected systems, including smart grids, autonomous defense platforms, and industrial control networks.

1. Introduction

Connected and Autonomous Vehicles (CAVs) have the potential to revolutionize ground transportation systems, making them safer, more efficient, and more convenient. However, their reliance on interconnected systems, such as the Controller Area Network (CAN) protocol, introduces critical cybersecurity vulnerabilities. The CAN protocol, a widely adopted standard for intra-vehicle communication between electronic control units (ECUs), lacks inherent security features such as message authentication or encryption, making it susceptible to exploitation by malicious actors [1]. Cyberattacks targeting CAN buses—such as message injection, spoofing, or denial-of-service (DoS)—can manipulate vehicle behavior, leading to catastrophic outcomes, including accidents, system malfunctions, or loss of driver control [2]. Machine learning (ML) has emerged as a promising tool for detecting anomalous CAN traffic and mitigating these risks. However, the efficacy of ML models heavily depends on the availability of high-quality training datasets that accurately reflect real-world attack scenarios. Prior studies [3,4] have demonstrated the feasibility of CAN bus attacks, yet many datasets suffer from structural inconsistencies, missing labels, or insufficient representation of adversarial patterns. These limitations hinder the development of reliable intrusion detection systems (IDSs) for CAVs [5,6]. In our previous work [7], we analyzed experimental CAV datasets from the Hacking and Countermeasure Research Laboratory (HCRL), specifically the “Car-Hacking Dataset”, identifying significant data quality issues that pose a major challenge in developing reliable and efficient supervised ML models. The dataset can be accessed via the following link: https://ocslab.hksecurity.net/Datasets/car-hacking-dataset, accessed on 8 July 2025.
Figure 1 shows the typical setup for detecting intrusions in a CAN bus system. The diagram highlights key parts of the system, including how Electronic Control Units (ECUs) communicate and how a ML-based Intrusion Detection System (IDS) checks the traffic for any suspicious activities. This setup emphasizes the importance of protecting CAN bus networks, especially in cars, where secure communication is essential to prevent unauthorized access and ensure the system works correctly.
To address these challenges, this research proposes a novel methodology to repair and optimize CAN datasets using custom scenario-driven logic implemented in a Python programming environment. By reconstructing training data to align with realistic attack scenarios, we enhance the integrity and usability of the dataset for ML applications. A sensitivity analysis is conducted to evaluate the relationship between training dataset size and prediction accuracy across multiple ML algorithms, including Random Forest (RF), Extreme Gradient Boost (XGBoost), and Support Vector Machines (SVM). The results demonstrate that dataset repairing significantly improves the detection of malicious CAN messages, maximizing true positives (TP) while minimizing false positives (FP).
The implications of this work extend beyond automotive systems. The proposed framework can be adapted to secure other interconnected cyber-physical systems, such as smart power grids, military combat vehicles, and distributed computing networks, where data quality and ML robustness are critical to resilience [8,9].

2. Literature Review

The security of CAVs and their communication systems is a growing concern as connectivity increases and cyber threats become more sophisticated. This section explores the current literature and research on vulnerabilities in the CAN protocol, ML-based intrusion detection methods, challenges related to datasets, and their applications in other cyber-physical systems.

2.1. CAN Protocol Vulnerabilities

The CAN bus remains the de facto in-vehicle communication backbone, yet it was never designed with security considerations. Indeed, CAN is “inherently insecure”, lacking any form of authentication, authorization, or encryption [10,11]. Consequently, attackers can readily inject fabricated CAN frames. Empirical studies have demonstrated that such malicious injections may override legitimate Electronic Control Unit (ECU) commands—manipulating braking, steering, and other safety-critical functions—without triggering detection. Furthermore, modern vehicles may incorporate hundreds of ECUs, thereby expanding the CAN attack surface; adversaries can exploit this by altering message timing or payloads to induce vehicle malfunctions. Remote exploitation vectors have also been identified: compromised telematics or infotainment systems can serve as entry points for CAN-based attacks, highlighting the necessity of an in-vehicle intrusion detection system (IDS) to supplement network security [12]. In summary, the absence of intrinsic security mechanisms in CAN—specifically, message authentication and encryption—renders automotive networks vulnerable to injection, replay, denial-of-service, and spoofing attacks.

2.2. ML-Based Intrusion Detection for CAN Networks

Because of CAN’s inherent insecurity, ML has become popular for detecting anomalies in CAN traffic. Traditional rule- or signature-based detectors fail to adapt to new attack methods, so many recent works employ supervised or deep learning models. For instance, deep recurrent neural networks (LSTM, GRU) have been used to model CAN sequences; one study reports LSTM achieving 9 ˜ 9.9 % accuracy in distinguishing normal vs. malicious frames [11]. Classical ML methods (RF, SVM, etc.) have also been effective on curated CAN datasets. However, most published results rely on idealized or small datasets, which raises concerns. In fact, it has been noted that many ML-IDS studies use “small, outdated and balanced” datasets [13]. Such limited data can give misleadingly high accuracy. Effectively, an IDS’s performance on real in-vehicle traffic depends heavily on data preprocessing and model choice. Recent research highlights that extensive data cleaning, feature selection, and balancing are often required. For example, the authors of [14] developed a CAN IDS using comprehensive feature engineering—including custom feature scaling and random forest-based feature selection—which yielded nearly perfect detection rates. Likewise, data-preprocessing pipelines (normalization, SMOTE oversampling, dimensionality reduction) have been shown to dramatically improve IDS accuracy. In one study, combining SMOTE-based oversampling with feature reduction raised IDS F1-scores by over 30% [15]. Overall, the shift to ML-based IDS has enabled very accurate CAN intrusion detection under clean data conditions (often >99% accuracy) [11,15], but these results assume high-quality training data.

2.3. Feature Engineering and Preprocessing in IDS

The importance of feature engineering and preprocessing in IDS has been stressed in recent work. As one survey notes, “Data preprocessing is a crucial step for building successful models and can improve data quality and model performance” [14]. In practice, this can involve cleaning raw CAN logs, filling missing timestamps, normalizing signals, balancing classes, and extracting or selecting informative features. For example, the authors of [15] introduced a CAN-IDS that applied novel feature scaling and random-forest-based feature selection before training boosting classifiers. Other authors built pipelines that automatically detect and correct issues such as class imbalance or feature redundancy [16]. The authors of [17] proposed an ensemble-based IDS that first balances the dataset (using undersampling, oversampling, SMOTE bagging) and then decorrelates features before classification; this approach significantly boosted detection quality [14,15]. In short, careful preprocessing and feature engineering are widely recognized as essential: without them, even powerful ML models struggle.

2.4. ML Model Benchmarking and Evaluation

Recent works also focus on rigorously benchmarking IDS models on CAN datasets. One contribution is the construction of the can-train-and-test dataset, which provides labeled CAN traffic from multiple vehicle models and nine realistic attack types (DoS, GEAR spoofing, etc.) [10]. This dataset is explicitly designed to evaluate how well ML-IDS generalize across vehicles and unseen attacks: in experiments, 18 different ML classifiers were tested on it to stress-test their generality. Similarly, the authors of [12] benchmarked four time-based IDS approaches on the new ROAD CAN dataset (with real, stealthy injection attacks). They found that non-parametric (distribution-agnostic) anomaly detectors greatly outperformed classical hypothesis-test-based methods—e.g., improving precision-recall AUC by ~55% on realistic attack data. These results highlight that model choice and evaluation framework matter: some IDS methods excel only on synthetic attacks, while others perform better on complex real-world data. On the deep-learning side, multiple works confirm that sequence models (e.g., LSTM, bi-LSTM) achieve excellent detection rates on labeled CAN attack datasets [11]. However, these high accuracies are typically obtained under controlled dataset conditions. In general, comparative studies underscore the importance of using diverse, realistic data for benchmarking ML models in CAN/CAV security.

2.5. Dataset Challenges in CAN Security Research

The quality and diversity of CAN datasets represent a key challenge in the development of effective IDS for connected vehicles. Many publicly available datasets are constrained in scope, limiting their ability to capture the complexity of real-world vehicular communication. As noted by [18], most existing datasets suffer from inherent biases—such as attacks being injected without accounting for normal driving behavior—resulting in unrealistic scenarios. These limitations hinder the training and generalization capabilities of ML models, thereby affecting the robustness and reliability of IDS solutions. Similarly, it is noted in [19] that many IDS datasets are “outdated and lack sufficient diversity”, so newer attack patterns may not be represented. These dataset shortcomings are acute for CAN: older datasets often have only one attack type (e.g., a simple DoS or fuzzing) under static conditions, with little real driving data. Moreover, data collection issues plague some CAN logs and several datasets either omit timestamp fields or encode the CAN data bytes inconsistently, which complicates merging and processing [20]. In sum, CAN security datasets often suffer missing or inconsistent fields (timestamps, labels, data bytes), imbalanced class distributions, and a lack of realistic attack diversity. These data quality issues can severely limit ML performance. As a recent analysis emphasizes, up to 80% of effort in building an IDS goes into preprocessing the data (cleaning, normalizing, balancing) [15]. Without careful data repair, ML models may learn spurious artifacts or fail to detect novel threats.

2.6. Intrusion Detection in IoT and Other Cyber-Physical Systems

Many of the challenges in CAN security echo those in other IoT and CPS domains. For example, false data injection attacks in smart power grids aim to corrupt sensor measurements, and ML-based detectors have been widely adopted to counter them. The authors of [21] review these grid attacks and note that ML algorithms have been “widely adopted for detecting malicious manipulation of sensor data” thanks to their speed and accuracy. Similarly, in IoT networks, researchers point out that existing datasets are often inadequate. Authors in [8] introduced the TON_IoT dataset after observing that “there is a lack of benchmark Internet of Things (IoT) and Industrial Internet of Things (IIoT) datasets” with realistic attacks and labels. Without representative data, IoT anomaly detectors cannot be properly validated. In the connected vehicles context, secure communication extends to domains like intelligent transportation or even military vehicles. These settings similarly require IDS models that distinguish between benign and malicious signals under noisy, dynamic conditions. Thus, lessons from CAN IDS (the need for data preprocessing, balancing, and diverse attacks) are broadly applicable across cyber-physical systems—from smart grids to industrial IoT to battlefield networks.

2.7. Gaps and Research Opportunities

Several key gaps remain in CAN and CAV security research. First, most ML-focused work has prioritized model architectures over data issues. However, as the authors of [15] observe, without high-quality input data, even the best algorithms underperform. In practice, very little work has systematically improved CAN dataset quality (e.g., by imputing missing values or correcting labels) before training. Second, there are few studies analyzing how dataset size, feature selection, or preprocessing techniques affect IDS accuracy. For instance, it is well known that training on small, curated datasets can give overly optimistic accuracy; yet the impact of scaling up data or choosing particular CAN signals remains underexplored [13]. Third, most available CAN datasets lack diversity in attack types and scenarios, as highlighted by [18,19]. This makes it difficult to train ML models that generalize to new threats. These gaps motivate our approach: we propose a data repair framework for CAN (and similar CPS) datasets to address missing or inconsistent values and improve balance. We will also systematically vary dataset size and feature sets—including using synthetic oversampling (SMOTE)—to quantify effects on IDS performance. By focusing on data quality and diversity, we aim to build more robust ML-IDS that advance the state-of-the-art in connected vehicle security.

3. Research Methodology

In this study, we adopt a systematic approach to analyzing and enhancing the security of automotive communication systems. Our methodology consists of four key steps: (1) data acquisition, (2) preprocessing, (3) model selection, and (4) evaluation, ensuring a robust framework for detecting anomalies in CAN messages, as illustrated in Figure 2.
The dataset was collected from the HCRL website [5]. The process involved downloading the dataset, saving it locally, and loading it into a Jupyter Notebook for analysis. The dataset is composed of CAN traffic logs obtained from a real vehicle through the OBD-II port, while message injection attacks were performed.
The dataset includes four types of cyberattacks:
  • DoS Attack: Injection of CAN messages with an ID of 0000 every 0.3 ms, which dominates the bus and disrupts normal communication.
  • Fuzzy Attack: Injection of random CAN IDs and data values every 0.5 ms, causing unpredictable behavior in the vehicle.
  • Spoofing Attack (RPM/GEAR): Injection of specific CAN messages related to Revolutions per Minute (RPM) and GEAR information every 1 ms, misleading the vehicle’s control system.
Each dataset contains 300 instances of message injection intrusions, performed for 3–5 s per attack. The entire dataset spans 30–40 min of CAN traffic.
In the Data Cleaning and Preprocessing stage in Section 4, we analyze the dataset structure and clean it to ensure it is well-suited for developing robust and efficient ML models. The dataset is split into 80% training and 20% testing. Since the dataset is highly imbalanced—over 80% of instances are attack-free, while less than 20% contain attack data—we apply the SMOTE to balance the data distribution. We train and evaluate multiple ML models, including
  • SVM
  • RF
  • XGBoost
The performance of these models is compared with and without SMOTE, and we present the results for the two best-performing models to demonstrate the impact of handling class imbalance effectively.

4. Data Cleaning and Preprocessing

To ensure dataset reliability, we performed several preprocessing steps, including handling missing values, removing duplicates, filling null values, and converting hexadecimal values to integers, as detailed in later sections. Additionally, we applied feature engineering techniques—such as bitwise and arithmetic transformations, global time delta calculations, and rolling count windows for time-sensitive analysis—to evaluate their impact on model performance.
Table 1 represents the data statistics for the car hacking dataset [5], categorized by attack types and providing details on both normal and injected messages.

4.1. DoS Attack Data Cleaning

For the DoS attack data, we identified inconsistencies in the data structure where the DLC column has two unique values: 2 and 8. Rows with DLC = 2 contain NaN values from DATA3 to DATA7, as a DLC value of 2 indicates that only DATA0 and DATA1 should be populated.
Due to this structural error, the DATA2 column contains “R”, which is actually the value intended for the Flag column. However, because of the misalignment in the data structure, this value has been incorrectly placed in DATA2, as shown in Table 2.
The data was cleaned using the following methodology: First, the value in the “DATA2” column was transferred to the “Flag” column to correct the misalignment caused by structural inconsistencies. Subsequently, the “DATA2” column was populated with the hexadecimal value “00” for missing entries, ensuring data integrity. Finally, all “NaN” values across the dataset were systematically replaced with hex “00” to maintain uniformity and consistency in the data structure.
The cleaned data is presented in Table 3, where structural inconsistencies have been corrected to ensure proper alignment of values across columns. Additionally, Table 4 provides an example of DoS attack data with DLC = 8.
Table 5 below presents the summary statistics of the cleaned CAN bus data for the DoS attack.

4.2. Fuzzy Attack Data Cleaning

Following the same approach as the DoS attack, we cleaned the Fuzzy attack data, which also exhibited similar inconsistencies but for DLC values 2, 5, and 6. Table 6 below shows the summary statistics of the cleaned CAN bus data for the Fuzzy attack.

4.3. RPM and GEAR Spoofing Attack Data Cleaning

Both the RPM and GEAR Spoofing attack data exhibited the same inconsistencies observed in the DoS attack—DLC = 2 entries contained only two valid inputs, with the remaining values as NaN. We resolved these issues using the same data cleaning approach as for the DoS attack. The summary statistics for the cleaned CAN bus data are presented in Table 7 for the RPM attack and in Table 8 for the GEAR Spoofing attack.
Aside from these inconsistencies, no additional issues were identified in the DoS, Fuzzy, or RPM/GEAR attack datasets.

5. Sensitivity Analysis and Model Performance

5.1. Analysis of Attack Characteristics

To evaluate the effectiveness of different ML models in detecting attacks, we tested Support SVM, RF, and XGBoost. Since SVM consistently underperformed compared to RF, we excluded it from further analysis. Our primary focus was on RF and XGBoost, with the latter demonstrating several advantages, particularly in handling imbalanced datasets.
Figure 3a,b illustrate the data distribution for DoS and Fuzzy attacks, respectively. The RPM/GEAR attacks shown in Figure 3c,d exhibit a distribution pattern similar to that of the DoS attacks, as they rely on a limited set of specific CAN IDs and are not time-sensitive. In contrast, the Fuzzy attack continuously generates random CAN IDs, as detailed in Section 3, leading to a more widely dispersed distribution along the x-axis.
The correlation matrices in Figure 4 illustrate the relationships between the DATA0–DATA7 columns in the cleaned DoS, Fuzzy, RPM, and GEAR attack datasets. For instance, in the DoS dataset, moderate correlations are observed between DATA3 and DATA5 (0.62) and between DATA2 and DATA4 (0.37). However, most correlation values remain within acceptable limits, suggesting no significant multicollinearity.
At this stage, we have opted to retain all columns, as each variable may provide valuable insights for further analysis. Removing any features could compromise the dataset’s integrity, so we will keep all variables to ensure a thorough evaluation in subsequent modeling and anomaly detection steps.
In our dataset, the proportion of injected data is as follows: 16.03% for DoS attacks, 12.81% for Fuzzy attacks, 14.17% for RPM gauge spoofing, and 13.44% for GEAR Drive Spoofing, as shown in Table 1.

5.2. RF and XGBoost-Based Modeling

In this section, we will begin by analyzing DoS attacks, explaining how we can identify these attacks using the features in our dataset. The target variable, found in the “Flags” column, is represented by binary values: 0 for normal behavior and 1 for attack events. After the analysis of DoS attacks, we will move on to Fuzzy attacks, RPM (Revolutions Per Minute) attacks, and GEAR-based attacks, as the approach for analyzing these attacks will be similar to that used for DoS.

5.2.1. Baseline RF-Based Classifier

In preparing our dataset, we intentionally excluded the “CAN ID” field from our feature set. While CAN IDs provide valuable information for detecting specific attacks like DoS, RPM manipulation, and GEAR spoofing, their inclusion could lead to data leakage that might artificially inflate model performance on test datasets. To prevent this potential overfitting risk and ensure our models learn genuine patterns rather than protocol-specific identifiers, we ultimately removed this feature. Our analysis now focuses exclusively on the hexadecimal payload values (DATA0 through DATA7) as the foundation for training our detection models. The baseline Random Forest performance metrics are presented in Table 9. These metrics were calculated using stratified K-fold cross-validation with K = 10.
The baseline Random Forest classifier achieved perfect classification performance using the configuration parameters shown in Table 10, as demonstrated by the evaluation metrics.
  • Flawless separation of classes in the confusion matrix:
    615 , 650 0 0 117 , 504
  • Maximum achievable scores across all evaluation metrics (precision, recall, and F1-score)
  • Ideal class-specific performance for both majority (normal) and minority (attack) classes
Notable characteristics of this preliminary model:
  • Effective handling of class imbalance (1:5.24 ratio) through class weighting
  • Conservative depth constraint preventing excessive tree growth
  • Full computational resource utilization during training

5.2.2. Fine-Tuning Class Weights for Model Performance

While the initial results demonstrated theoretically perfect performance, they also suggested potential overfitting risks that required further investigation. To address these concerns, additional validation strategies were implemented. Specifically, class weight tuning was performed by manually calculating the class weights based on the exact class ratio (approximately 5.24:1), computed as
Ratio = 3 , 078 , 250 587 , 521 5.24
Consequently, the class_weight parameter was set to these manual weights instead of using the default “balanced” setting. Table 11 presents the optimized classification performance metrics, where manual class weighting was applied using an empirical ratio of 5.24.
The parameter refinement yielded more realistic performance while maintaining critical attack detection capabilities:
  • Preserved perfect attack recall (1.00) despite precision trade-off:
    598 , 520 17 , 130 0 117 , 504
  • Achieved balanced accuracy ( 0.97 + 1.00 2 = 0.985 )
  • Reduced false positive rate for normal class:
    FPR = 17 , 130 615 , 650 0.028
Parameter adjustments:
  • Manual class weighting: class_weight = { 0 : 1 , 1 : 5.24 }
Key Observations:
  • 2.8% false positive rate suggests residual tuning potential for normal traffic
  • Maintained 100% attack recall meets critical security requirements
  • 13% absolute precision decrease in attack class indicates more conservative anomaly flagging
This refined model demonstrates improved operational realism while preserving essential attack detection capabilities, with the accuracy metric decreasing from 1.0 to 0.977 reflecting more balanced performance characteristics.

5.2.3. XGBoost-Based Classifier

Given the potential overfitting risks observed with the initial Random Forest implementation, we further experimented with an alternative model—XGBoost. Without any hyperparameter tuning, the XGBoost model achieved performance comparable to that of the Random Forest model, using the configuration shown in Table 12.
The XGBoost model was able to generalize well while maintaining strong predictive performance. Notably, the use of scale_pos_weight = 8 instead of the exact 5.24, further accounted for class imbalance, while regularization techniques such as max_depth = 6, subsample = 0.8, and colsample_bytree = 0.8 helped mitigate overfitting risks.
These findings suggest that XGBoost, even without hyperparameter optimization, can serve as a robust alternative to Random Forest for this classification task.

5.2.4. Feature Engineering

Based on the feature–importance analysis shown in Figure 5—generated using a simple train–test split with the XGBoost model (without K-Fold cross-validation or SMOTE)—we applied additional feature engineering. The bar plots in Figure 5 reveal that, for DoS attacks, DATA1 and DATA0 are the most influential features, while Fuzzy attacks depend primarily on DATA3 and DATA1. Similarly, RPM-based attacks are dominated by DATA7 and DATA3, and GEAR-based attacks by DATA1 and DATA3.
The following transformations were applied to the DoS attack data; analogous transformations were then applied to each of the other attack types, guided by their respective feature–importance graphs:
  • Bitwise XOR transformation: DATA0 ⊕ DATA1 and DATA5 ⊕ DATA7
  • Arithmetic summation: DATA0 + DATA1
These newly derived features were incorporated into the dataset:
  • DATA01_XOR = DATA0 ⊕ DATA1,
  • DATA01_SUM = DATA0 + DATA1,
  • DATA57_XOR = DATA5 ⊕ DATA7
The model was retrained using the updated feature set:
X new = DATA 0 , DATA 1 , DATA 5 , DATA 7 , DATA01_XOR , DATA01_SUM , DATA57_XOR
After training, the evaluation metrics demonstrated a slight improvement, as presented in Table 13 below:
While the improvement was minor, these results suggest that feature engineering can help refine model performance.

5.2.5. Temporal Feature Engineering

To further investigate temporal patterns, we derived and evaluated the following time-based features, as shown in the heatmaps in Figure 6:
  • Global Time Delta: Difference in timestamps between consecutive CAN messages.
  • Rolling Counts (1 ms, 10 ms, 0.3 ms): Number of messages observed within sliding windows of 1 ms, 10 ms, and 0.3 ms.
  • Hour of Message: Hour component extracted from each message timestamp.
  • Time Since First Message: Elapsed time since the initial message in the trace.
The analysis uncovered distinct behavioral characteristics across different types of cyberattacks:
  • Fuzzy Attacks: Demonstrated a strong dependence on temporal features, particularly the hour of occurrence. This is consistent with their tendency to inject random CAN IDs at irregular intervals, making timing a key factor in detection.
  • DoS and RPM/GEAR Attacks: Showed limited reliance on temporal features. Detection in these cases was primarily driven by the characteristics of the payload data rather than any timing anomalies.
Given the already high performance of the XGBoost classifier across all attack types, and as illustrated in Figure 6, the newly derived temporal features contribute minimally to detection performance—except for the Hour feature in the case of Fuzzy attacks. However, since this is a classification problem and the model is already achieving excellent results, we chose not to pursue further temporal analysis. As a result, no additional feature engineering was applied to the temporal attributes.

5.2.6. Performance Analysis with SMOTE and Stratified K-Fold

To further address class imbalance, Synthetic Minority Over-sampling Technique (SMOTE) was applied in conjunction with Stratified K-Fold cross-validation. This approach ensured balanced training splits while preserving the distribution of target classes.
The model achieved the following average performance across folds:
  • Average Accuracy: 0.9924
  • Average ROC AUC Score: 0.9998
The aggregated confusion matrix across folds was
3 , 050 , 684 27 , 565 285 587 , 236
An example classification report from a single fold is presented in Table 14.

6. Results and Discussion

Figure 7 summarizes the confusion matrices for DoS, Fuzzy, RMP, and GEAR attacks using XGBoost without SMOTE (top row) and with SMOTE (bottom row) for the XGBoost model. Across both configurations, the model achieves high true-positive rates and low false-positive rates for all attack types. For example, on the DoS dataset, the configuration without SMOTE yields 117,516 true positives and 17,417 false positives, corresponding to an accuracy of 97.66% and a ROC AUC score of 0.9859. After applying SMOTE, true positives increase to 587,236 and false positives to 27,565, boosting the average accuracy to 99.24% and the ROC AUC to 0.9998. A similar improvement is observed for the Fuzzy dataset, where both accuracy and ROC AUC approach perfect scores following SMOTE; however, for the RMP and GEAR datasets—which already achieve 100% accuracy and ROC AUC without SMOTE—the application of SMOTE yields no further performance gains.
These results represent a notable improvement over our previous study [7], where the prediction accuracies for the DoS, Fuzzy, GEAR, and RPM datasets were 0.93, 0.99, 1.00, and 1.00, respectively. In the current analysis, the DoS detection accuracy using the XGBoost classifier has increased to 0.992, while the Fuzzy, GEAR, and RPM datasets continue to achieve perfect scores of 1.00. This improvement highlights the effectiveness of incorporating SMOTE in enhancing the classifier’s ability to correctly identify minority-class instances, without compromising its performance on already well-balanced datasets.
The RF model also performed comparably, achieving similar levels of accuracy. However, it required longer training times and more extensive hyperparameter tuning compared to XGBoost.

7. Conclusions

In this paper, we presented a comprehensive framework for enhancing and evaluating ML-based intrusion detection in CAV networks. We first addressed structural deficiencies in publicly available CAN datasets by applying scenario-driven repair and augmentation rules, thereby improving data quality and representativeness. Through sensitivity analyses, we quantified the impact of training set size on model performance, highlighting the importance of sufficient minority-class samples for reliable detection.
We benchmarked two state-of-the-art classifiers—RF and XGBoost—on four attack scenarios (DoS, Fuzzy, RMP, and GEAR). Our results demonstrate that XGBoost consistently outperforms RF, achieving up to 99.24% accuracy on the DoS dataset and 100% accuracy on the Fuzzy, RMP, and GEAR datasets. XGBoost’s computational efficiency on standard CPU hardware, combined with its native support for imbalanced data and optimized gradient-boosting routines, resulted in substantially lower training and inference times compared to RF. Moreover, the application of SMOTE further improved minority-class detection without degrading performance on already balanced attack classes.
The scenario-driven data repair methodology and evaluation pipeline introduced here are directly transferable to other cyber-physical systems—such as power grids, military ground vehicles, and networked industrial controls—to strengthen anomaly detection capabilities. Future work will investigate the integration of deep learning architectures (e.g., LSTM and transformer-based models) and the deployment of real-time inference pipelines, with the goal of enabling adaptive, low-latency cyberattack mitigation in live CAV environments.

Author Contributions

Conceptualization, P.K.J. and M.K.J.; methodology, P.K.J. and M.K.J.; software, P.K.J.; validation, P.K.J. and M.K.J.; formal analysis, P.K.J. and M.K.J.; investigation, P.K.J. and M.K.J.; resources, P.K.J. and M.K.J.; data curation, P.K.J.; writing—original draft preparation, P.K.J.; writing—review and editing, P.K.J. and M.K.J.; visualization, P.K.J.; supervision, M.K.J.; project administration, M.K.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset used in this study is publicly available at https://ocslab.hksecurity.net/Datasets/car-hacking-dataset, as also referenced in Section 1.

Conflicts of Interest

Author Pranav K. Jha was employed by the company MKJHA Consulting. The remaining author declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Koscher, K.; Czeskis, A.; Roesner, F.; Patel, S.; Kohno, T.; Checkoway, S.; McCoy, D.; Kantor, B.; Anderson, D.; Shacham, H.; et al. Experimental Security Analysis of a Modern Automobile. In Proceedings of the 2010 IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 12–14 May 2025; pp. 447–462. [Google Scholar] [CrossRef]
  2. Miller, C.; Valasek, C. Adventures in Automotive Networks and Control Units; IOActive Inc.: Seattle, WA, USA, 2013. [Google Scholar]
  3. Checkoway, S.; McCoy, D.; Kantor, B.; Anderson, D.; Shacham, H.; Savage, S.; Koscher, K.; Czeskis, A.; Roesner, F.; Kohno, T. Comprehensive Experimental Analyses of Automotive Attack Surfaces. In Proceedings of the 20th USENIX Security Symposium (USENIX Security 11), San Francisco, CA, USA, 5–9 August 2002. [Google Scholar]
  4. Miller, C.; Valasek, C. Remote Exploitation of an Unaltered Passenger Vehicle. In Proceedings of the Black Hat Briefings, Las Vegas, NV, USA, 1–6 August 2015; pp. 1–91. [Google Scholar]
  5. Seo, E.; Song, H.M.; Kim, H.K. GIDS: GAN Based Intrusion Detection System for In-Vehicle Network. In Proceedings of the 2018 16th Annual Conference on Privacy, Security and Trust (PST), Belfast, Ireland, 28–30 August 2018; pp. 1–6. [Google Scholar] [CrossRef]
  6. Song, H.M.; Woo, J.; Kim, H.K. In-Vehicle Network Intrusion Detection Using Deep Convolutional Neural Network. Veh. Commun. 2020, 21, 100198. [Google Scholar] [CrossRef]
  7. Jha, M.K.; Jaiswal, R. A Machine Learning Model to Predict Cyberattacks in Connected and Autonomous Vehicles. J. Comput. Cogn. Eng. 2024, 3, 307–315. [Google Scholar] [CrossRef]
  8. Alsaedi, A.; Moustafa, N.; Tari, Z.; Mahmood, A.; Anwar, A. TON_IoT Telemetry Dataset: A New Generation Dataset of IoT and IIoT for Data-Driven Intrusion Detection Systems. IEEE Access 2020, 8, 165130–165150. [Google Scholar] [CrossRef]
  9. Hastie, T.; Tibshirani, R.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009; pp. 1–758. [Google Scholar] [CrossRef]
  10. Lampe, B.; Meng, W. can-train-and-test: A Curated CAN Dataset for Automotive Intrusion Detection. Comput. Sec. 2024, 140, 103777. [Google Scholar] [CrossRef]
  11. Rai, R.; Grover, J.; Sharma, P.; Pareek, A. Securing the CAN bus using deep learning for intrusion detection in vehicles. Sci. Rep. 2025, 15, 13820. [Google Scholar] [CrossRef] [PubMed]
  12. Blevins, D.H.; Moriano, P.; Bridges, R.A.; Verma, M.E.; Iannacone, M.D.; Hollifield, S.C. Time-Based CAN Intrusion Detection Benchmark. arXiv 2021, arXiv:2101.05781. [Google Scholar]
  13. Talukder, M.A.; Islam, M.M.; Uddin, M.A.; Hasan, K.F.; Sharmin, S.; Alyami, S.A.; Moni, M.A. Machine Learning-Based Network Intrusion Detection for Big and Imbalanced Data Using Oversampling, Stacking Feature Embedding and Feature Extraction. J. Big Data 2024, 11, 33. [Google Scholar] [CrossRef]
  14. Rai, H.M.; Yoo, J.; Agarwal, S. The Improved Network Intrusion Detection Techniques Using the Feature Engineering Approach with Boosting Classifiers. Mathematics 2024, 12, 24. [Google Scholar] [CrossRef]
  15. Semenov, S.; Krupska-Klimczak, M.; Czapla, R.; Krzaczek, B.; Gavrylenko, S.; Poltorazkiy, V.; Zozulia, V. Intrusion Detection Method Based on Preprocessing of Highly Correlated and Imbalanced Data. Appl. Sci. 2025, 15, 4243. [Google Scholar] [CrossRef]
  16. Hossain, M.A.; Islam, M.S. Ensuring Network Security with a Robust Intrusion Detection System Using Ensemble-Based Machine Learning. Array 2023, 19, 100306. [Google Scholar] [CrossRef]
  17. Louk, M.H.L.; Tama, B.A. Exploring Ensemble-Based Class Imbalance Learners for Intrusion Detection in Industrial Control Networks. Big Data Cogn. Comput. 2021, 5, 72. [Google Scholar] [CrossRef]
  18. Lee, S.; Choi, W.; Kim, I.; Lee, G.; Lee, D.H. A Comprehensive Analysis of Datasets for Automotive Intrusion Detection Systems. Comput. Mater. Continua 2023, 76, 3. [Google Scholar] [CrossRef]
  19. Liu, I.-H.; Hsieh, C.-E.; Lin, W.-M.; Li, J.-S.; Li, C.-F. Data-Balancing Algorithm Based on Generative Adversarial Network for Robust Network Intrusion Detection. J. Robot. Netw. Artif. Life 2022, 9, 303–308. [Google Scholar]
  20. Kidmose, B.; Meng, W. CAN-sleuth: Investigating and Evaluating Automotive Intrusion Detection Datasets. In Proceedings of the 2024 European Interdisciplinary Cybersecurity Conference, Xanthi, Greece, 5–6 June 2024; pp. 19–28. [Google Scholar] [CrossRef]
  21. Sayghe, A.; Hu, Y.; Zografopoulos, I.; Liu, X.; Dutta, R.G.; Jin, Y.; Konstantinou, C. Survey of Machine Learning Methods for Detecting False Data Injection Attacks in Power Systems. IET Smart Grid 2020, 3, 581–595. [Google Scholar] [CrossRef]
Figure 1. Typical architecture of intrusion detection of a CAN bus network.
Figure 1. Typical architecture of intrusion detection of a CAN bus network.
Vehicles 07 00073 g001
Figure 2. Methodological framework.
Figure 2. Methodological framework.
Vehicles 07 00073 g002
Figure 3. Distribution of attack data in the car hacking dataset: (a) DoS; (b) Fuzzy; (c) RPM; (d) GEAR attacks. These figures show the frequency and characteristics of normal and injected messages for each attack type.
Figure 3. Distribution of attack data in the car hacking dataset: (a) DoS; (b) Fuzzy; (c) RPM; (d) GEAR attacks. These figures show the frequency and characteristics of normal and injected messages for each attack type.
Vehicles 07 00073 g003
Figure 4. Correlation matrices for the cleaned attack datasets: (a) DoS attack dataset, (b) Fuzzy attack dataset, (c) RPM attack dataset, and (d) GEAR attack dataset for DATA0–DATA7.
Figure 4. Correlation matrices for the cleaned attack datasets: (a) DoS attack dataset, (b) Fuzzy attack dataset, (c) RPM attack dataset, and (d) GEAR attack dataset for DATA0–DATA7.
Vehicles 07 00073 g004
Figure 5. Feature importance analysis for attack types: (a) DoS attack; (b) Fuzzy attack, showing the most influential features for each attack.
Figure 5. Feature importance analysis for attack types: (a) DoS attack; (b) Fuzzy attack, showing the most influential features for each attack.
Vehicles 07 00073 g005
Figure 6. Feature importance heatmaps for the car hacking dataset: (a) DoS, (b) Fuzzy, (c) RPM, and (d) GEAR attacks. The heatmaps highlight the most influential features for detecting each attack type.
Figure 6. Feature importance heatmaps for the car hacking dataset: (a) DoS, (b) Fuzzy, (c) RPM, and (d) GEAR attacks. The heatmaps highlight the most influential features for detecting each attack type.
Vehicles 07 00073 g006
Figure 7. Confusion matrices for XGBoost on the car hacking dataset. Top row (ad): without SMOTE for (a) DoS, (b) Fuzzy, (c) RPM gauge spoofing, and (d) GEAR spoofing. Bottom row (eh): with SMOTE applied for (e) DoS, (f) Fuzzy, (g) RPM-based, and (h) GEAR-based attacks.
Figure 7. Confusion matrices for XGBoost on the car hacking dataset. Top row (ad): without SMOTE for (a) DoS, (b) Fuzzy, (c) RPM gauge spoofing, and (d) GEAR spoofing. Bottom row (eh): with SMOTE applied for (e) DoS, (f) Fuzzy, (g) RPM-based, and (h) GEAR-based attacks.
Vehicles 07 00073 g007
Table 1. Data statistics.
Table 1. Data statistics.
Attack TypeNumber of MessagesNormal MessagesInjected MessagesInjected (%)
DoS3,665,7713,078,250587,52116.03%
Fuzzy3,838,8603,347,013491,84712.81%
RPM4,621,7023,966,805654,89714.17%
GEAR4,443,1423,845,890597,25213.44%
Table 2. Example of inconsistent data structure in DoS attack data.
Table 2. Example of inconsistent data structure in DoS attack data.
CAN IDDLCDATA0DATA1DATA2DATA3DATA4DATA5DATA6DATA7Flag
05f020100RNaNNaNNaNNaNNaNNaN
05f020100RNaNNaNNaNNaNNaNNaN
05f020100RNaNNaNNaNNaNNaNNaN
05f020100RNaNNaNNaNNaNNaNNaN
05f020100RNaNNaNNaNNaNNaNNaN
Table 3. Example of cleaned DoS attack data with DLC = 2.
Table 3. Example of cleaned DoS attack data with DLC = 2.
CAN IDDLCDATA0DATA1DATA2DATA3DATA4DATA5DATA6DATA7Flag
05f020100000000000000R
05f020100000000000000R
05f020100000000000000R
05f020100000000000000R
05f020100000000000000R
Table 4. Example of DoS attack data with DLC = 8.
Table 4. Example of DoS attack data with DLC = 8.
CAN IDDLCDATA0DATA1DATA2DATA3DATA4DATA5DATA6DATA7Flag
018f8fe5b0000003c0000R
0260819212230088e6d3aR
02a0864009a1d9702bd00R
0329840bb7f1411200014R
05458d800008a00000000R
Table 5. Summary Statistics of cleaned CAN Bus Data for DOS Attack.
Table 5. Summary Statistics of cleaned CAN Bus Data for DOS Attack.
MetricCAN IDDATA0DATA1DATA2DATA3DATA4DATA5DATA6DATA7Flag
Count3,665,7703,665,7703,665,7703,665,7703,665,7703,665,7703,665,7703,665,7703,665,7703,665,770
Unique27108717526190256752562
Top00000000000000000000R
Freq587,5211,623,2831,850,5492,391,5871,989,8841,973,7521,595,5672,266,6792,171,2463,078,249
Table 6. Summary statistics of cleaned CAN bus data for Fuzzy attack.
Table 6. Summary statistics of cleaned CAN bus data for Fuzzy attack.
MetricCAN IDDATA0DATA1DATA2DATA3DATA4DATA5DATA6DATA7Flag
Count3,838,8593,838,8593,838,8593,838,8593,838,8593,838,8593,838,8593,838,8593,838,8593,838,859
Unique2,0482562562562562562562562562
Top03160000000000000000R
Freq182,1211,141,8241,375,2001,982,1501,497,8771,489,1621,136,0171,857,4671,753,7643,347,009
Table 7. Summary statistics of cleaned CAN bus data for RPM Gauze Spoofing attack.
Table 7. Summary statistics of cleaned CAN bus data for RPM Gauze Spoofing attack.
MetricCAN IDDATA0DATA1DATA2DATA3DATA4DATA5DATA6DATA7Flag
Count4,621,7014,621,7014,621,7014,621,7014,621,7014,621,7014,621,7014,621,7014,621,7014,621,701
Unique26113858828192256802562
Top03160000000000000000R
Freq871,2301,467,1241,710,2322,406,0081,888,7651,790,9651,298,3542,822,1192,131,7763,966,804
Table 8. Summary statistics of cleaned CAN bus data for GEAR Drive Spoofing attack.
Table 8. Summary statistics of cleaned CAN bus data for GEAR Drive Spoofing attack.
MetricCAN IDDATA0DATA1DATA2DATA3DATA4DATA5DATA6DATA7Flag
Count4,443,1424,443,1424,443,1424,443,1424,443,1424,443,1424,443,1424,443,1424,443,1424,443,142
Unique2615013097362212561072562
Top043f0000000000000000R
Freq804,5411,260,7081,654,1372,331,8551,827,3631,736,1041,858,6902,702,1802,662,0373,845,889
Table 9. Initial random forest performance metrics.
Table 9. Initial random forest performance metrics.
ClassPrecisionRecallF1-ScoreSupport
Normal (0)1.0001.0001.000615,650
Attack (1)1.0001.0001.000117,504
Total1.0001.0001.000733,154
Note: Metrics were calculated using stratified 10-fold cross-validation.
Table 10. Model configuration parameters.
Table 10. Model configuration parameters.
ParameterValue
Trees (n_estimators)100
Class weightingBalanced
Max tree depth10
ParallelizationFull (n_jobs = −1)
Random seed42
Table 11. Optimized classification performance metrics.
Table 11. Optimized classification performance metrics.
ClassPrecisionRecallF1-ScoreSupport
Normal (0)1.000.970.99615,650
Attack (1)0.871.000.93117,504
Total0.980.980.98733,154
Note: Manual class weighting was applied using empirical ratio 3 , 078 , 250 587 , 521 5.24 .
Table 12. XGBoost model configuration parameters.
Table 12. XGBoost model configuration parameters.
ParameterValue
scale_pos_weight8
objectivebinary:logistic
eval_metricaucpr
max_depth6 (Prevents overfitting)
subsample0.8
colsample_bytree0.8
random_state42
Note: XGBoost performed comparably to Random Forest without any additional tuning.
Table 13. Performance before and after feature engineering.
Table 13. Performance before and after feature engineering.
MetricBefore Feature EngineeringAfter Feature Engineering
Accuracy0.97660.9769
Confusion Matrix 598 , 520 17 , 130 0 117 , 504 598 , 221 17 , 417 0 117 , 516
F1-Score (Class 0)0.990.99
F1-Score (Class 1)0.930.93
Macro Avg F1-Score0.960.96
Weighted Avg F1-Score0.980.98
Table 14. Classification report (example fold).
Table 14. Classification report (example fold).
ClassPrecisionRecallF1-Score
0 (Negative)1.000.991.00
1 (Positive)0.951.000.98
Overall Accuracy0.99
Macro Avg0.981.000.99
Weighted Avg0.990.990.99
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jha, P.K.; Jha, M.K. Handling Data Structure Issues with Machine Learning in a Connected and Autonomous Vehicle Communication System. Vehicles 2025, 7, 73. https://doi.org/10.3390/vehicles7030073

AMA Style

Jha PK, Jha MK. Handling Data Structure Issues with Machine Learning in a Connected and Autonomous Vehicle Communication System. Vehicles. 2025; 7(3):73. https://doi.org/10.3390/vehicles7030073

Chicago/Turabian Style

Jha, Pranav K., and Manoj K. Jha. 2025. "Handling Data Structure Issues with Machine Learning in a Connected and Autonomous Vehicle Communication System" Vehicles 7, no. 3: 73. https://doi.org/10.3390/vehicles7030073

APA Style

Jha, P. K., & Jha, M. K. (2025). Handling Data Structure Issues with Machine Learning in a Connected and Autonomous Vehicle Communication System. Vehicles, 7(3), 73. https://doi.org/10.3390/vehicles7030073

Article Metrics

Back to TopTop