Handling Data Structure Issues with Machine Learning in a Connected and Autonomous Vehicle Communication System

Jha, Pranav K.; Jha, Manoj K.

doi:10.3390/vehicles7030073

Open AccessArticle

Handling Data Structure Issues with Machine Learning in a Connected and Autonomous Vehicle Communication System

by

Pranav K. Jha

¹

and

Manoj K. Jha

^2,*

¹

MKJHA Consulting, Unit: Basement, 419 Blairfield Court, Severn, MD 21144, USA

²

Department of Information Technology, University of Maryland Global Campus, 3501 University Blvd. East, Adelphi, MD 20783, USA

^*

Author to whom correspondence should be addressed.

Vehicles 2025, 7(3), 73; https://doi.org/10.3390/vehicles7030073

Submission received: 30 April 2025 / Revised: 30 June 2025 / Accepted: 9 July 2025 / Published: 11 July 2025

Download

Browse Figures

Versions Notes

Abstract

Connected and Autonomous Vehicles (CAVs) remain vulnerable to cyberattacks due to inherent security gaps in the Controller Area Network (CAN) protocol. We present a structured Python (3.11.13) framework that repairs structural inconsistencies in a public CAV dataset to improve the reliability of machine learning-based intrusion detection. We assess the effect of training data volume and compare Random Forest (RF) and Extreme Gradient Boosting (XGBoost) classifiers across four attack types: DoS, Fuzzy, RPM spoofing, and GEAR spoofing. XGBoost outperforms RF, achieving 99.2 % accuracy on the DoS dataset and 100 % accuracy on the Fuzzy, RPM, and GEAR datasets. The Synthetic Minority Oversampling Technique (SMOTE) further enhances minority-class detection without compromising overall performance. This methodology provides a generalizable framework for anomaly detection in other connected systems, including smart grids, autonomous defense platforms, and industrial control networks.

Keywords:

connected and autonomous vehicles; machine learning; cyberattack; data structure; connected systems

1. Introduction

Connected and Autonomous Vehicles (CAVs) have the potential to revolutionize ground transportation systems, making them safer, more efficient, and more convenient. However, their reliance on interconnected systems, such as the Controller Area Network (CAN) protocol, introduces critical cybersecurity vulnerabilities. The CAN protocol, a widely adopted standard for intra-vehicle communication between electronic control units (ECUs), lacks inherent security features such as message authentication or encryption, making it susceptible to exploitation by malicious actors [1]. Cyberattacks targeting CAN buses—such as message injection, spoofing, or denial-of-service (DoS)—can manipulate vehicle behavior, leading to catastrophic outcomes, including accidents, system malfunctions, or loss of driver control [2]. Machine learning (ML) has emerged as a promising tool for detecting anomalous CAN traffic and mitigating these risks. However, the efficacy of ML models heavily depends on the availability of high-quality training datasets that accurately reflect real-world attack scenarios. Prior studies [3,4] have demonstrated the feasibility of CAN bus attacks, yet many datasets suffer from structural inconsistencies, missing labels, or insufficient representation of adversarial patterns. These limitations hinder the development of reliable intrusion detection systems (IDSs) for CAVs [5,6]. In our previous work [7], we analyzed experimental CAV datasets from the Hacking and Countermeasure Research Laboratory (HCRL), specifically the “Car-Hacking Dataset”, identifying significant data quality issues that pose a major challenge in developing reliable and efficient supervised ML models. The dataset can be accessed via the following link: https://ocslab.hksecurity.net/Datasets/car-hacking-dataset, accessed on 8 July 2025.

Figure 1 shows the typical setup for detecting intrusions in a CAN bus system. The diagram highlights key parts of the system, including how Electronic Control Units (ECUs) communicate and how a ML-based Intrusion Detection System (IDS) checks the traffic for any suspicious activities. This setup emphasizes the importance of protecting CAN bus networks, especially in cars, where secure communication is essential to prevent unauthorized access and ensure the system works correctly.

To address these challenges, this research proposes a novel methodology to repair and optimize CAN datasets using custom scenario-driven logic implemented in a Python programming environment. By reconstructing training data to align with realistic attack scenarios, we enhance the integrity and usability of the dataset for ML applications. A sensitivity analysis is conducted to evaluate the relationship between training dataset size and prediction accuracy across multiple ML algorithms, including Random Forest (RF), Extreme Gradient Boost (XGBoost), and Support Vector Machines (SVM). The results demonstrate that dataset repairing significantly improves the detection of malicious CAN messages, maximizing true positives (TP) while minimizing false positives (FP).

The implications of this work extend beyond automotive systems. The proposed framework can be adapted to secure other interconnected cyber-physical systems, such as smart power grids, military combat vehicles, and distributed computing networks, where data quality and ML robustness are critical to resilience [8,9].

2. Literature Review

The security of CAVs and their communication systems is a growing concern as connectivity increases and cyber threats become more sophisticated. This section explores the current literature and research on vulnerabilities in the CAN protocol, ML-based intrusion detection methods, challenges related to datasets, and their applications in other cyber-physical systems.

2.1. CAN Protocol Vulnerabilities

The CAN bus remains the de facto in-vehicle communication backbone, yet it was never designed with security considerations. Indeed, CAN is “inherently insecure”, lacking any form of authentication, authorization, or encryption [10,11]. Consequently, attackers can readily inject fabricated CAN frames. Empirical studies have demonstrated that such malicious injections may override legitimate Electronic Control Unit (ECU) commands—manipulating braking, steering, and other safety-critical functions—without triggering detection. Furthermore, modern vehicles may incorporate hundreds of ECUs, thereby expanding the CAN attack surface; adversaries can exploit this by altering message timing or payloads to induce vehicle malfunctions. Remote exploitation vectors have also been identified: compromised telematics or infotainment systems can serve as entry points for CAN-based attacks, highlighting the necessity of an in-vehicle intrusion detection system (IDS) to supplement network security [12]. In summary, the absence of intrinsic security mechanisms in CAN—specifically, message authentication and encryption—renders automotive networks vulnerable to injection, replay, denial-of-service, and spoofing attacks.

2.2. ML-Based Intrusion Detection for CAN Networks

Because of CAN’s inherent insecurity, ML has become popular for detecting anomalies in CAN traffic. Traditional rule- or signature-based detectors fail to adapt to new attack methods, so many recent works employ supervised or deep learning models. For instance, deep recurrent neural networks (LSTM, GRU) have been used to model CAN sequences; one study reports LSTM achieving

\tilde{9} 9.9 %

accuracy in distinguishing normal vs. malicious frames [11]. Classical ML methods (RF, SVM, etc.) have also been effective on curated CAN datasets. However, most published results rely on idealized or small datasets, which raises concerns. In fact, it has been noted that many ML-IDS studies use “small, outdated and balanced” datasets [13]. Such limited data can give misleadingly high accuracy. Effectively, an IDS’s performance on real in-vehicle traffic depends heavily on data preprocessing and model choice. Recent research highlights that extensive data cleaning, feature selection, and balancing are often required. For example, the authors of [14] developed a CAN IDS using comprehensive feature engineering—including custom feature scaling and random forest-based feature selection—which yielded nearly perfect detection rates. Likewise, data-preprocessing pipelines (normalization, SMOTE oversampling, dimensionality reduction) have been shown to dramatically improve IDS accuracy. In one study, combining SMOTE-based oversampling with feature reduction raised IDS F1-scores by over 30% [15]. Overall, the shift to ML-based IDS has enabled very accurate CAN intrusion detection under clean data conditions (often >99% accuracy) [11,15], but these results assume high-quality training data.

2.3. Feature Engineering and Preprocessing in IDS

The importance of feature engineering and preprocessing in IDS has been stressed in recent work. As one survey notes, “Data preprocessing is a crucial step for building successful models and can improve data quality and model performance” [14]. In practice, this can involve cleaning raw CAN logs, filling missing timestamps, normalizing signals, balancing classes, and extracting or selecting informative features. For example, the authors of [15] introduced a CAN-IDS that applied novel feature scaling and random-forest-based feature selection before training boosting classifiers. Other authors built pipelines that automatically detect and correct issues such as class imbalance or feature redundancy [16]. The authors of [17] proposed an ensemble-based IDS that first balances the dataset (using undersampling, oversampling, SMOTE bagging) and then decorrelates features before classification; this approach significantly boosted detection quality [14,15]. In short, careful preprocessing and feature engineering are widely recognized as essential: without them, even powerful ML models struggle.

2.4. ML Model Benchmarking and Evaluation

Recent works also focus on rigorously benchmarking IDS models on CAN datasets. One contribution is the construction of the can-train-and-test dataset, which provides labeled CAN traffic from multiple vehicle models and nine realistic attack types (DoS, GEAR spoofing, etc.) [10]. This dataset is explicitly designed to evaluate how well ML-IDS generalize across vehicles and unseen attacks: in experiments, 18 different ML classifiers were tested on it to stress-test their generality. Similarly, the authors of [12] benchmarked four time-based IDS approaches on the new ROAD CAN dataset (with real, stealthy injection attacks). They found that non-parametric (distribution-agnostic) anomaly detectors greatly outperformed classical hypothesis-test-based methods—e.g., improving precision-recall AUC by ~55% on realistic attack data. These results highlight that model choice and evaluation framework matter: some IDS methods excel only on synthetic attacks, while others perform better on complex real-world data. On the deep-learning side, multiple works confirm that sequence models (e.g., LSTM, bi-LSTM) achieve excellent detection rates on labeled CAN attack datasets [11]. However, these high accuracies are typically obtained under controlled dataset conditions. In general, comparative studies underscore the importance of using diverse, realistic data for benchmarking ML models in CAN/CAV security.

2.5. Dataset Challenges in CAN Security Research

The quality and diversity of CAN datasets represent a key challenge in the development of effective IDS for connected vehicles. Many publicly available datasets are constrained in scope, limiting their ability to capture the complexity of real-world vehicular communication. As noted by [18], most existing datasets suffer from inherent biases—such as attacks being injected without accounting for normal driving behavior—resulting in unrealistic scenarios. These limitations hinder the training and generalization capabilities of ML models, thereby affecting the robustness and reliability of IDS solutions. Similarly, it is noted in [19] that many IDS datasets are “outdated and lack sufficient diversity”, so newer attack patterns may not be represented. These dataset shortcomings are acute for CAN: older datasets often have only one attack type (e.g., a simple DoS or fuzzing) under static conditions, with little real driving data. Moreover, data collection issues plague some CAN logs and several datasets either omit timestamp fields or encode the CAN data bytes inconsistently, which complicates merging and processing [20]. In sum, CAN security datasets often suffer missing or inconsistent fields (timestamps, labels, data bytes), imbalanced class distributions, and a lack of realistic attack diversity. These data quality issues can severely limit ML performance. As a recent analysis emphasizes, up to 80% of effort in building an IDS goes into preprocessing the data (cleaning, normalizing, balancing) [15]. Without careful data repair, ML models may learn spurious artifacts or fail to detect novel threats.

2.6. Intrusion Detection in IoT and Other Cyber-Physical Systems

Many of the challenges in CAN security echo those in other IoT and CPS domains. For example, false data injection attacks in smart power grids aim to corrupt sensor measurements, and ML-based detectors have been widely adopted to counter them. The authors of [21] review these grid attacks and note that ML algorithms have been “widely adopted for detecting malicious manipulation of sensor data” thanks to their speed and accuracy. Similarly, in IoT networks, researchers point out that existing datasets are often inadequate. Authors in [8] introduced the TON_IoT dataset after observing that “there is a lack of benchmark Internet of Things (IoT) and Industrial Internet of Things (IIoT) datasets” with realistic attacks and labels. Without representative data, IoT anomaly detectors cannot be properly validated. In the connected vehicles context, secure communication extends to domains like intelligent transportation or even military vehicles. These settings similarly require IDS models that distinguish between benign and malicious signals under noisy, dynamic conditions. Thus, lessons from CAN IDS (the need for data preprocessing, balancing, and diverse attacks) are broadly applicable across cyber-physical systems—from smart grids to industrial IoT to battlefield networks.

2.7. Gaps and Research Opportunities

Several key gaps remain in CAN and CAV security research. First, most ML-focused work has prioritized model architectures over data issues. However, as the authors of [15] observe, without high-quality input data, even the best algorithms underperform. In practice, very little work has systematically improved CAN dataset quality (e.g., by imputing missing values or correcting labels) before training. Second, there are few studies analyzing how dataset size, feature selection, or preprocessing techniques affect IDS accuracy. For instance, it is well known that training on small, curated datasets can give overly optimistic accuracy; yet the impact of scaling up data or choosing particular CAN signals remains underexplored [13]. Third, most available CAN datasets lack diversity in attack types and scenarios, as highlighted by [18,19]. This makes it difficult to train ML models that generalize to new threats. These gaps motivate our approach: we propose a data repair framework for CAN (and similar CPS) datasets to address missing or inconsistent values and improve balance. We will also systematically vary dataset size and feature sets—including using synthetic oversampling (SMOTE)—to quantify effects on IDS performance. By focusing on data quality and diversity, we aim to build more robust ML-IDS that advance the state-of-the-art in connected vehicle security.

3. Research Methodology

In this study, we adopt a systematic approach to analyzing and enhancing the security of automotive communication systems. Our methodology consists of four key steps: (1) data acquisition, (2) preprocessing, (3) model selection, and (4) evaluation, ensuring a robust framework for detecting anomalies in CAN messages, as illustrated in Figure 2.

The dataset was collected from the HCRL website [5]. The process involved downloading the dataset, saving it locally, and loading it into a Jupyter Notebook for analysis. The dataset is composed of CAN traffic logs obtained from a real vehicle through the OBD-II port, while message injection attacks were performed.

The dataset includes four types of cyberattacks:

DoS Attack: Injection of CAN messages with an ID of 0000 every 0.3 ms, which dominates the bus and disrupts normal communication.
Fuzzy Attack: Injection of random CAN IDs and data values every 0.5 ms, causing unpredictable behavior in the vehicle.
Spoofing Attack (RPM/GEAR): Injection of specific CAN messages related to Revolutions per Minute (RPM) and GEAR information every 1 ms, misleading the vehicle’s control system.

Each dataset contains 300 instances of message injection intrusions, performed for 3–5 s per attack. The entire dataset spans 30–40 min of CAN traffic.

In the Data Cleaning and Preprocessing stage in Section 4, we analyze the dataset structure and clean it to ensure it is well-suited for developing robust and efficient ML models. The dataset is split into 80% training and 20% testing. Since the dataset is highly imbalanced—over 80% of instances are attack-free, while less than 20% contain attack data—we apply the SMOTE to balance the data distribution. We train and evaluate multiple ML models, including

SVM
RF
XGBoost

The performance of these models is compared with and without SMOTE, and we present the results for the two best-performing models to demonstrate the impact of handling class imbalance effectively.

4. Data Cleaning and Preprocessing

To ensure dataset reliability, we performed several preprocessing steps, including handling missing values, removing duplicates, filling null values, and converting hexadecimal values to integers, as detailed in later sections. Additionally, we applied feature engineering techniques—such as bitwise and arithmetic transformations, global time delta calculations, and rolling count windows for time-sensitive analysis—to evaluate their impact on model performance.

Table 1 represents the data statistics for the car hacking dataset [5], categorized by attack types and providing details on both normal and injected messages.

4.1. DoS Attack Data Cleaning

For the DoS attack data, we identified inconsistencies in the data structure where the DLC column has two unique values: 2 and 8. Rows with DLC = 2 contain NaN values from DATA3 to DATA7, as a DLC value of 2 indicates that only DATA0 and DATA1 should be populated.

Due to this structural error, the DATA2 column contains “R”, which is actually the value intended for the Flag column. However, because of the misalignment in the data structure, this value has been incorrectly placed in DATA2, as shown in Table 2.

The data was cleaned using the following methodology: First, the value in the “DATA2” column was transferred to the “Flag” column to correct the misalignment caused by structural inconsistencies. Subsequently, the “DATA2” column was populated with the hexadecimal value “00” for missing entries, ensuring data integrity. Finally, all “NaN” values across the dataset were systematically replaced with hex “00” to maintain uniformity and consistency in the data structure.

The cleaned data is presented in Table 3, where structural inconsistencies have been corrected to ensure proper alignment of values across columns. Additionally, Table 4 provides an example of DoS attack data with DLC = 8.

Table 5 below presents the summary statistics of the cleaned CAN bus data for the DoS attack.

4.2. Fuzzy Attack Data Cleaning

Following the same approach as the DoS attack, we cleaned the Fuzzy attack data, which also exhibited similar inconsistencies but for DLC values 2, 5, and 6. Table 6 below shows the summary statistics of the cleaned CAN bus data for the Fuzzy attack.

4.3. RPM and GEAR Spoofing Attack Data Cleaning

Both the RPM and GEAR Spoofing attack data exhibited the same inconsistencies observed in the DoS attack—DLC = 2 entries contained only two valid inputs, with the remaining values as NaN. We resolved these issues using the same data cleaning approach as for the DoS attack. The summary statistics for the cleaned CAN bus data are presented in Table 7 for the RPM attack and in Table 8 for the GEAR Spoofing attack.

Aside from these inconsistencies, no additional issues were identified in the DoS, Fuzzy, or RPM/GEAR attack datasets.

5. Sensitivity Analysis and Model Performance

5.1. Analysis of Attack Characteristics

To evaluate the effectiveness of different ML models in detecting attacks, we tested Support SVM, RF, and XGBoost. Since SVM consistently underperformed compared to RF, we excluded it from further analysis. Our primary focus was on RF and XGBoost, with the latter demonstrating several advantages, particularly in handling imbalanced datasets.

Figure 3a,b illustrate the data distribution for DoS and Fuzzy attacks, respectively. The RPM/GEAR attacks shown in Figure 3c,d exhibit a distribution pattern similar to that of the DoS attacks, as they rely on a limited set of specific CAN IDs and are not time-sensitive. In contrast, the Fuzzy attack continuously generates random CAN IDs, as detailed in Section 3, leading to a more widely dispersed distribution along the x-axis.

The correlation matrices in Figure 4 illustrate the relationships between the DATA0–DATA7 columns in the cleaned DoS, Fuzzy, RPM, and GEAR attack datasets. For instance, in the DoS dataset, moderate correlations are observed between DATA3 and DATA5 (0.62) and between DATA2 and DATA4 (0.37). However, most correlation values remain within acceptable limits, suggesting no significant multicollinearity.

At this stage, we have opted to retain all columns, as each variable may provide valuable insights for further analysis. Removing any features could compromise the dataset’s integrity, so we will keep all variables to ensure a thorough evaluation in subsequent modeling and anomaly detection steps.

In our dataset, the proportion of injected data is as follows: 16.03% for DoS attacks, 12.81% for Fuzzy attacks, 14.17% for RPM gauge spoofing, and 13.44% for GEAR Drive Spoofing, as shown in Table 1.

5.2. RF and XGBoost-Based Modeling

In this section, we will begin by analyzing DoS attacks, explaining how we can identify these attacks using the features in our dataset. The target variable, found in the “Flags” column, is represented by binary values: 0 for normal behavior and 1 for attack events. After the analysis of DoS attacks, we will move on to Fuzzy attacks, RPM (Revolutions Per Minute) attacks, and GEAR-based attacks, as the approach for analyzing these attacks will be similar to that used for DoS.

5.2.1. Baseline RF-Based Classifier

In preparing our dataset, we intentionally excluded the “CAN ID” field from our feature set. While CAN IDs provide valuable information for detecting specific attacks like DoS, RPM manipulation, and GEAR spoofing, their inclusion could lead to data leakage that might artificially inflate model performance on test datasets. To prevent this potential overfitting risk and ensure our models learn genuine patterns rather than protocol-specific identifiers, we ultimately removed this feature. Our analysis now focuses exclusively on the hexadecimal payload values (DATA0 through DATA7) as the foundation for training our detection models. The baseline Random Forest performance metrics are presented in Table 9. These metrics were calculated using stratified K-fold cross-validation with K = 10.

The baseline Random Forest classifier achieved perfect classification performance using the configuration parameters shown in Table 10, as demonstrated by the evaluation metrics.

Flawless separation of classes in the confusion matrix:

$[\begin{matrix} 615, 650 & 0 \\ 0 & 117, 504 \end{matrix}]$
Maximum achievable scores across all evaluation metrics (precision, recall, and F1-score)
Ideal class-specific performance for both majority (normal) and minority (attack) classes

Notable characteristics of this preliminary model:

Effective handling of class imbalance (1:5.24 ratio) through class weighting
Conservative depth constraint preventing excessive tree growth
Full computational resource utilization during training

5.2.2. Fine-Tuning Class Weights for Model Performance

While the initial results demonstrated theoretically perfect performance, they also suggested potential overfitting risks that required further investigation. To address these concerns, additional validation strategies were implemented. Specifically, class weight tuning was performed by manually calculating the class weights based on the exact class ratio (approximately 5.24:1), computed as

Ratio = \frac{3, 078, 250}{587, 521} \approx 5.24

(1)

Consequently, the class_weight parameter was set to these manual weights instead of using the default “balanced” setting. Table 11 presents the optimized classification performance metrics, where manual class weighting was applied using an empirical ratio of 5.24.

The parameter refinement yielded more realistic performance while maintaining critical attack detection capabilities:

Preserved perfect attack recall (1.00) despite precision trade-off:

$[\begin{matrix} 598, 520 & 17, 130 \\ 0 & 117, 504 \end{matrix}]$
Achieved balanced accuracy ( $\frac{0.97 + 1.00}{2} = 0.985$ )
Reduced false positive rate for normal class:

$FPR = \frac{17, 130}{615, 650} \approx 0.028$

(2)

Parameter adjustments:

Manual class weighting: class_weight $= {0 : 1, 1 : 5.24}$

Key Observations:

2.8% false positive rate suggests residual tuning potential for normal traffic
Maintained 100% attack recall meets critical security requirements
13% absolute precision decrease in attack class indicates more conservative anomaly flagging

This refined model demonstrates improved operational realism while preserving essential attack detection capabilities, with the accuracy metric decreasing from 1.0 to 0.977 reflecting more balanced performance characteristics.

5.2.3. XGBoost-Based Classifier

Given the potential overfitting risks observed with the initial Random Forest implementation, we further experimented with an alternative model—XGBoost. Without any hyperparameter tuning, the XGBoost model achieved performance comparable to that of the Random Forest model, using the configuration shown in Table 12.

The XGBoost model was able to generalize well while maintaining strong predictive performance. Notably, the use of scale_pos_weight = 8 instead of the exact 5.24, further accounted for class imbalance, while regularization techniques such as max_depth = 6, subsample = 0.8, and colsample_bytree = 0.8 helped mitigate overfitting risks.

These findings suggest that XGBoost, even without hyperparameter optimization, can serve as a robust alternative to Random Forest for this classification task.

5.2.4. Feature Engineering

Based on the feature–importance analysis shown in Figure 5—generated using a simple train–test split with the XGBoost model (without K-Fold cross-validation or SMOTE)—we applied additional feature engineering. The bar plots in Figure 5 reveal that, for DoS attacks, DATA1 and DATA0 are the most influential features, while Fuzzy attacks depend primarily on DATA3 and DATA1. Similarly, RPM-based attacks are dominated by DATA7 and DATA3, and GEAR-based attacks by DATA1 and DATA3.

The following transformations were applied to the DoS attack data; analogous transformations were then applied to each of the other attack types, guided by their respective feature–importance graphs:

Bitwise XOR transformation: DATA0 ⊕ DATA1 and DATA5 ⊕ DATA7
Arithmetic summation: DATA0 + DATA1

These newly derived features were incorporated into the dataset:

DATA01_XOR = DATA0 ⊕ DATA1,
DATA01_SUM = DATA0 + DATA1,
DATA57_XOR = DATA5 ⊕ DATA7

The model was retrained using the updated feature set:

X_{new} = \{\begin{matrix} DATA 0, DATA 1, DATA 5, \\ DATA 7, DATA01_XOR, \\ DATA01_SUM, DATA57_XOR \end{matrix}\}

After training, the evaluation metrics demonstrated a slight improvement, as presented in Table 13 below:

While the improvement was minor, these results suggest that feature engineering can help refine model performance.

5.2.5. Temporal Feature Engineering

To further investigate temporal patterns, we derived and evaluated the following time-based features, as shown in the heatmaps in Figure 6:

Global Time Delta: Difference in timestamps between consecutive CAN messages.
Rolling Counts (1 ms, 10 ms, 0.3 ms): Number of messages observed within sliding windows of 1 ms, 10 ms, and 0.3 ms.
Hour of Message: Hour component extracted from each message timestamp.
Time Since First Message: Elapsed time since the initial message in the trace.

The analysis uncovered distinct behavioral characteristics across different types of cyberattacks:

Fuzzy Attacks: Demonstrated a strong dependence on temporal features, particularly the hour of occurrence. This is consistent with their tendency to inject random CAN IDs at irregular intervals, making timing a key factor in detection.
DoS and RPM/GEAR Attacks: Showed limited reliance on temporal features. Detection in these cases was primarily driven by the characteristics of the payload data rather than any timing anomalies.

Given the already high performance of the XGBoost classifier across all attack types, and as illustrated in Figure 6, the newly derived temporal features contribute minimally to detection performance—except for the Hour feature in the case of Fuzzy attacks. However, since this is a classification problem and the model is already achieving excellent results, we chose not to pursue further temporal analysis. As a result, no additional feature engineering was applied to the temporal attributes.

5.2.6. Performance Analysis with SMOTE and Stratified K-Fold

To further address class imbalance, Synthetic Minority Over-sampling Technique (SMOTE) was applied in conjunction with Stratified K-Fold cross-validation. This approach ensured balanced training splits while preserving the distribution of target classes.

The model achieved the following average performance across folds:

Average Accuracy: 0.9924
Average ROC AUC Score: 0.9998

The aggregated confusion matrix across folds was

[\begin{matrix} 3, 050, 684 & 27, 565 \\ 285 & 587, 236 \end{matrix}]

An example classification report from a single fold is presented in Table 14.

6. Results and Discussion

Figure 7 summarizes the confusion matrices for DoS, Fuzzy, RMP, and GEAR attacks using XGBoost without SMOTE (top row) and with SMOTE (bottom row) for the XGBoost model. Across both configurations, the model achieves high true-positive rates and low false-positive rates for all attack types. For example, on the DoS dataset, the configuration without SMOTE yields 117,516 true positives and 17,417 false positives, corresponding to an accuracy of 97.66% and a ROC AUC score of 0.9859. After applying SMOTE, true positives increase to 587,236 and false positives to 27,565, boosting the average accuracy to 99.24% and the ROC AUC to 0.9998. A similar improvement is observed for the Fuzzy dataset, where both accuracy and ROC AUC approach perfect scores following SMOTE; however, for the RMP and GEAR datasets—which already achieve 100% accuracy and ROC AUC without SMOTE—the application of SMOTE yields no further performance gains.

These results represent a notable improvement over our previous study [7], where the prediction accuracies for the DoS, Fuzzy, GEAR, and RPM datasets were 0.93, 0.99, 1.00, and 1.00, respectively. In the current analysis, the DoS detection accuracy using the XGBoost classifier has increased to 0.992, while the Fuzzy, GEAR, and RPM datasets continue to achieve perfect scores of 1.00. This improvement highlights the effectiveness of incorporating SMOTE in enhancing the classifier’s ability to correctly identify minority-class instances, without compromising its performance on already well-balanced datasets.

The RF model also performed comparably, achieving similar levels of accuracy. However, it required longer training times and more extensive hyperparameter tuning compared to XGBoost.

7. Conclusions

In this paper, we presented a comprehensive framework for enhancing and evaluating ML-based intrusion detection in CAV networks. We first addressed structural deficiencies in publicly available CAN datasets by applying scenario-driven repair and augmentation rules, thereby improving data quality and representativeness. Through sensitivity analyses, we quantified the impact of training set size on model performance, highlighting the importance of sufficient minority-class samples for reliable detection.

We benchmarked two state-of-the-art classifiers—RF and XGBoost—on four attack scenarios (DoS, Fuzzy, RMP, and GEAR). Our results demonstrate that XGBoost consistently outperforms RF, achieving up to 99.24% accuracy on the DoS dataset and 100% accuracy on the Fuzzy, RMP, and GEAR datasets. XGBoost’s computational efficiency on standard CPU hardware, combined with its native support for imbalanced data and optimized gradient-boosting routines, resulted in substantially lower training and inference times compared to RF. Moreover, the application of SMOTE further improved minority-class detection without degrading performance on already balanced attack classes.

The scenario-driven data repair methodology and evaluation pipeline introduced here are directly transferable to other cyber-physical systems—such as power grids, military ground vehicles, and networked industrial controls—to strengthen anomaly detection capabilities. Future work will investigate the integration of deep learning architectures (e.g., LSTM and transformer-based models) and the deployment of real-time inference pipelines, with the goal of enabling adaptive, low-latency cyberattack mitigation in live CAV environments.

Author Contributions

Conceptualization, P.K.J. and M.K.J.; methodology, P.K.J. and M.K.J.; software, P.K.J.; validation, P.K.J. and M.K.J.; formal analysis, P.K.J. and M.K.J.; investigation, P.K.J. and M.K.J.; resources, P.K.J. and M.K.J.; data curation, P.K.J.; writing—original draft preparation, P.K.J.; writing—review and editing, P.K.J. and M.K.J.; visualization, P.K.J.; supervision, M.K.J.; project administration, M.K.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset used in this study is publicly available at https://ocslab.hksecurity.net/Datasets/car-hacking-dataset, as also referenced in Section 1.

Conflicts of Interest

Author Pranav K. Jha was employed by the company MKJHA Consulting. The remaining author declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Koscher, K.; Czeskis, A.; Roesner, F.; Patel, S.; Kohno, T.; Checkoway, S.; McCoy, D.; Kantor, B.; Anderson, D.; Shacham, H.; et al. Experimental Security Analysis of a Modern Automobile. In Proceedings of the 2010 IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 12–14 May 2025; pp. 447–462. [Google Scholar] [CrossRef]
Miller, C.; Valasek, C. Adventures in Automotive Networks and Control Units; IOActive Inc.: Seattle, WA, USA, 2013. [Google Scholar]
Checkoway, S.; McCoy, D.; Kantor, B.; Anderson, D.; Shacham, H.; Savage, S.; Koscher, K.; Czeskis, A.; Roesner, F.; Kohno, T. Comprehensive Experimental Analyses of Automotive Attack Surfaces. In Proceedings of the 20th USENIX Security Symposium (USENIX Security 11), San Francisco, CA, USA, 5–9 August 2002. [Google Scholar]
Miller, C.; Valasek, C. Remote Exploitation of an Unaltered Passenger Vehicle. In Proceedings of the Black Hat Briefings, Las Vegas, NV, USA, 1–6 August 2015; pp. 1–91. [Google Scholar]
Seo, E.; Song, H.M.; Kim, H.K. GIDS: GAN Based Intrusion Detection System for In-Vehicle Network. In Proceedings of the 2018 16th Annual Conference on Privacy, Security and Trust (PST), Belfast, Ireland, 28–30 August 2018; pp. 1–6. [Google Scholar] [CrossRef]
Song, H.M.; Woo, J.; Kim, H.K. In-Vehicle Network Intrusion Detection Using Deep Convolutional Neural Network. Veh. Commun. 2020, 21, 100198. [Google Scholar] [CrossRef]
Jha, M.K.; Jaiswal, R. A Machine Learning Model to Predict Cyberattacks in Connected and Autonomous Vehicles. J. Comput. Cogn. Eng. 2024, 3, 307–315. [Google Scholar] [CrossRef]
Alsaedi, A.; Moustafa, N.; Tari, Z.; Mahmood, A.; Anwar, A. TON_IoT Telemetry Dataset: A New Generation Dataset of IoT and IIoT for Data-Driven Intrusion Detection Systems. IEEE Access 2020, 8, 165130–165150. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009; pp. 1–758. [Google Scholar] [CrossRef]
Lampe, B.; Meng, W. can-train-and-test: A Curated CAN Dataset for Automotive Intrusion Detection. Comput. Sec. 2024, 140, 103777. [Google Scholar] [CrossRef]
Rai, R.; Grover, J.; Sharma, P.; Pareek, A. Securing the CAN bus using deep learning for intrusion detection in vehicles. Sci. Rep. 2025, 15, 13820. [Google Scholar] [CrossRef] [PubMed]
Blevins, D.H.; Moriano, P.; Bridges, R.A.; Verma, M.E.; Iannacone, M.D.; Hollifield, S.C. Time-Based CAN Intrusion Detection Benchmark. arXiv 2021, arXiv:2101.05781. [Google Scholar]
Talukder, M.A.; Islam, M.M.; Uddin, M.A.; Hasan, K.F.; Sharmin, S.; Alyami, S.A.; Moni, M.A. Machine Learning-Based Network Intrusion Detection for Big and Imbalanced Data Using Oversampling, Stacking Feature Embedding and Feature Extraction. J. Big Data 2024, 11, 33. [Google Scholar] [CrossRef]
Rai, H.M.; Yoo, J.; Agarwal, S. The Improved Network Intrusion Detection Techniques Using the Feature Engineering Approach with Boosting Classifiers. Mathematics 2024, 12, 24. [Google Scholar] [CrossRef]
Semenov, S.; Krupska-Klimczak, M.; Czapla, R.; Krzaczek, B.; Gavrylenko, S.; Poltorazkiy, V.; Zozulia, V. Intrusion Detection Method Based on Preprocessing of Highly Correlated and Imbalanced Data. Appl. Sci. 2025, 15, 4243. [Google Scholar] [CrossRef]
Hossain, M.A.; Islam, M.S. Ensuring Network Security with a Robust Intrusion Detection System Using Ensemble-Based Machine Learning. Array 2023, 19, 100306. [Google Scholar] [CrossRef]
Louk, M.H.L.; Tama, B.A. Exploring Ensemble-Based Class Imbalance Learners for Intrusion Detection in Industrial Control Networks. Big Data Cogn. Comput. 2021, 5, 72. [Google Scholar] [CrossRef]
Lee, S.; Choi, W.; Kim, I.; Lee, G.; Lee, D.H. A Comprehensive Analysis of Datasets for Automotive Intrusion Detection Systems. Comput. Mater. Continua 2023, 76, 3. [Google Scholar] [CrossRef]
Liu, I.-H.; Hsieh, C.-E.; Lin, W.-M.; Li, J.-S.; Li, C.-F. Data-Balancing Algorithm Based on Generative Adversarial Network for Robust Network Intrusion Detection. J. Robot. Netw. Artif. Life 2022, 9, 303–308. [Google Scholar]
Kidmose, B.; Meng, W. CAN-sleuth: Investigating and Evaluating Automotive Intrusion Detection Datasets. In Proceedings of the 2024 European Interdisciplinary Cybersecurity Conference, Xanthi, Greece, 5–6 June 2024; pp. 19–28. [Google Scholar] [CrossRef]
Sayghe, A.; Hu, Y.; Zografopoulos, I.; Liu, X.; Dutta, R.G.; Jin, Y.; Konstantinou, C. Survey of Machine Learning Methods for Detecting False Data Injection Attacks in Power Systems. IET Smart Grid 2020, 3, 581–595. [Google Scholar] [CrossRef]

Figure 1. Typical architecture of intrusion detection of a CAN bus network.

Figure 2. Methodological framework.

Figure 3. Distribution of attack data in the car hacking dataset: (a) DoS; (b) Fuzzy; (c) RPM; (d) GEAR attacks. These figures show the frequency and characteristics of normal and injected messages for each attack type.

Figure 4. Correlation matrices for the cleaned attack datasets: (a) DoS attack dataset, (b) Fuzzy attack dataset, (c) RPM attack dataset, and (d) GEAR attack dataset for DATA0–DATA7.

Figure 5. Feature importance analysis for attack types: (a) DoS attack; (b) Fuzzy attack, showing the most influential features for each attack.

Figure 6. Feature importance heatmaps for the car hacking dataset: (a) DoS, (b) Fuzzy, (c) RPM, and (d) GEAR attacks. The heatmaps highlight the most influential features for detecting each attack type.

Figure 7. Confusion matrices for XGBoost on the car hacking dataset. Top row (a–d): without SMOTE for (a) DoS, (b) Fuzzy, (c) RPM gauge spoofing, and (d) GEAR spoofing. Bottom row (e–h): with SMOTE applied for (e) DoS, (f) Fuzzy, (g) RPM-based, and (h) GEAR-based attacks.

Table 1. Data statistics.

Attack Type	Number of Messages	Normal Messages	Injected Messages	Injected (%)
DoS	3,665,771	3,078,250	587,521	16.03%
Fuzzy	3,838,860	3,347,013	491,847	12.81%
RPM	4,621,702	3,966,805	654,897	14.17%
GEAR	4,443,142	3,845,890	597,252	13.44%

Table 2. Example of inconsistent data structure in DoS attack data.

CAN ID	DLC	DATA0	DATA2	DATA3	DATA4	DATA5	DATA6	DATA7	Flag
05f0	2	01	R	NaN	NaN	NaN	NaN	NaN	NaN
05f0	2	01	R	NaN	NaN	NaN	NaN	NaN	NaN
05f0	2	01	R	NaN	NaN	NaN	NaN	NaN	NaN
05f0	2	01	R	NaN	NaN	NaN	NaN	NaN	NaN
05f0	2	01	R	NaN	NaN	NaN	NaN	NaN	NaN

Table 3. Example of cleaned DoS attack data with DLC = 2.

CAN ID	DLC	DATA0	Flag
05f0	2	01	R
05f0	2	01	R
05f0	2	01	R
05f0	2	01	R
05f0	2	01	R

Table 4. Example of DoS attack data with DLC = 8.

CAN ID	DLC	DATA0	DATA1	DATA2	DATA3	DATA4	DATA5	DATA6	DATA7	Flag
018f	8	fe	5b	00	00	00	3c	00	00	R
0260	8	19	21	22	30	08	8e	6d	3a	R
02a0	8	64	00	9a	1d	97	02	bd	00	R
0329	8	40	bb	7f	14	11	20	00	14	R
0545	8	d8	00	00	8a	00	00	00	00	R

Table 5. Summary Statistics of cleaned CAN Bus Data for DOS Attack.

Metric	CAN ID	DATA0	DATA1	DATA2	DATA3	DATA4	DATA5	DATA6	DATA7	Flag
Count	3,665,770	3,665,770	3,665,770	3,665,770	3,665,770	3,665,770	3,665,770	3,665,770	3,665,770	3,665,770
Unique	27	108	71	75	26	190	256	75	256	2
Top	0000	00	00	00	00	00	00	00	00	R
Freq	587,521	1,623,283	1,850,549	2,391,587	1,989,884	1,973,752	1,595,567	2,266,679	2,171,246	3,078,249

Table 6. Summary statistics of cleaned CAN bus data for Fuzzy attack.

Metric	CAN ID	DATA0	DATA1	DATA2	DATA3	DATA4	DATA5	DATA6	DATA7	Flag
Count	3,838,859	3,838,859	3,838,859	3,838,859	3,838,859	3,838,859	3,838,859	3,838,859	3,838,859	3,838,859
Unique	2,048	256	256	256	256	256	256	256	256	2
Top	0316	00	00	00	00	00	00	00	00	R
Freq	182,121	1,141,824	1,375,200	1,982,150	1,497,877	1,489,162	1,136,017	1,857,467	1,753,764	3,347,009

Table 7. Summary statistics of cleaned CAN bus data for RPM Gauze Spoofing attack.

Metric	CAN ID	DATA0	DATA1	DATA2	DATA3	DATA4	DATA5	DATA6	DATA7	Flag
Count	4,621,701	4,621,701	4,621,701	4,621,701	4,621,701	4,621,701	4,621,701	4,621,701	4,621,701	4,621,701
Unique	26	113	85	88	28	192	256	80	256	2
Top	0316	00	00	00	00	00	00	00	00	R
Freq	871,230	1,467,124	1,710,232	2,406,008	1,888,765	1,790,965	1,298,354	2,822,119	2,131,776	3,966,804

Table 8. Summary statistics of cleaned CAN bus data for GEAR Drive Spoofing attack.

Metric	CAN ID	DATA0	DATA1	DATA2	DATA3	DATA4	DATA5	DATA6	DATA7	Flag
Count	4,443,142	4,443,142	4,443,142	4,443,142	4,443,142	4,443,142	4,443,142	4,443,142	4,443,142	4,443,142
Unique	26	150	130	97	36	221	256	107	256	2
Top	043f	00	00	00	00	00	00	00	00	R
Freq	804,541	1,260,708	1,654,137	2,331,855	1,827,363	1,736,104	1,858,690	2,702,180	2,662,037	3,845,889

Table 9. Initial random forest performance metrics.

Class	Precision	Recall	F1-Score	Support
Normal (0)	1.000	1.000	1.000	615,650
Attack (1)	1.000	1.000	1.000	117,504
Total	1.000	1.000	1.000	733,154

Note: Metrics were calculated using stratified 10-fold cross-validation.

Table 10. Model configuration parameters.

Parameter	Value
Trees (n_estimators)	100
Class weighting	Balanced
Max tree depth	10
Parallelization	Full (n_jobs = −1)
Random seed	42

Table 11. Optimized classification performance metrics.

Class	Precision	Recall	F1-Score	Support
Normal (0)	1.00	0.97	0.99	615,650
Attack (1)	0.87	1.00	0.93	117,504
Total	0.98	0.98	0.98	733,154

Note: Manual class weighting was applied using empirical ratio

\frac{3, 078, 250}{587, 521} \approx 5.24

.

Table 12. XGBoost model configuration parameters.

Parameter	Value
scale_pos_weight	8
objective	binary:logistic
eval_metric	aucpr
max_depth	6 (Prevents overfitting)
subsample	0.8
colsample_bytree	0.8
random_state	42

Note: XGBoost performed comparably to Random Forest without any additional tuning.

Table 13. Performance before and after feature engineering.

Metric	Before Feature Engineering	After Feature Engineering
Accuracy	0.9766	0.9769
Confusion Matrix	$[\begin{matrix} 598, 520 & 17, 130 \\ 0 & 117, 504 \end{matrix}]$	$[\begin{matrix} 598, 221 & 17, 417 \\ 0 & 117, 516 \end{matrix}]$
F1-Score (Class 0)	0.99	0.99
F1-Score (Class 1)	0.93	0.93
Macro Avg F1-Score	0.96	0.96
Weighted Avg F1-Score	0.98	0.98

Table 14. Classification report (example fold).

Class	Precision	Recall	F1-Score
0 (Negative)	1.00	0.99	1.00
1 (Positive)	0.95	1.00	0.98
Overall Accuracy	0.99
Macro Avg	0.98	1.00	0.99
Weighted Avg	0.99	0.99	0.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jha, P.K.; Jha, M.K. Handling Data Structure Issues with Machine Learning in a Connected and Autonomous Vehicle Communication System. Vehicles 2025, 7, 73. https://doi.org/10.3390/vehicles7030073

AMA Style

Jha PK, Jha MK. Handling Data Structure Issues with Machine Learning in a Connected and Autonomous Vehicle Communication System. Vehicles. 2025; 7(3):73. https://doi.org/10.3390/vehicles7030073

Chicago/Turabian Style

Jha, Pranav K., and Manoj K. Jha. 2025. "Handling Data Structure Issues with Machine Learning in a Connected and Autonomous Vehicle Communication System" Vehicles 7, no. 3: 73. https://doi.org/10.3390/vehicles7030073

APA Style

Jha, P. K., & Jha, M. K. (2025). Handling Data Structure Issues with Machine Learning in a Connected and Autonomous Vehicle Communication System. Vehicles, 7(3), 73. https://doi.org/10.3390/vehicles7030073

Article Menu

Handling Data Structure Issues with Machine Learning in a Connected and Autonomous Vehicle Communication System

Abstract

1. Introduction

2. Literature Review

2.1. CAN Protocol Vulnerabilities

2.2. ML-Based Intrusion Detection for CAN Networks

2.3. Feature Engineering and Preprocessing in IDS

2.4. ML Model Benchmarking and Evaluation

2.5. Dataset Challenges in CAN Security Research

2.6. Intrusion Detection in IoT and Other Cyber-Physical Systems

2.7. Gaps and Research Opportunities

3. Research Methodology

4. Data Cleaning and Preprocessing

4.1. DoS Attack Data Cleaning

4.2. Fuzzy Attack Data Cleaning

4.3. RPM and GEAR Spoofing Attack Data Cleaning

5. Sensitivity Analysis and Model Performance

5.1. Analysis of Attack Characteristics

5.2. RF and XGBoost-Based Modeling

5.2.1. Baseline RF-Based Classifier

5.2.2. Fine-Tuning Class Weights for Model Performance

5.2.3. XGBoost-Based Classifier

5.2.4. Feature Engineering

5.2.5. Temporal Feature Engineering

5.2.6. Performance Analysis with SMOTE and Stratified K-Fold

6. Results and Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI