Abstract
Connected and Autonomous Vehicles (CAVs) are exposed to increasingly sophisticated cyber threats hidden within high-dimensional, heterogeneous network traffic. A critical bottleneck in existing Intrusion Detection Systems (IDS) is the feature heterogeneity gap: discrete protocol signatures (e.g., flags, services) and continuous traffic statistics (e.g., flow duration, packet rates) reside in disjoint latent spaces. Traditional deep learning approaches typically rely on naive feature concatenation, which fails to capture the intricate, non-linear semantic dependencies between these modalities, leading to suboptimal performance on long-tail, minority attack classes. This paper proposes HCA-IDS, a novel framework centered on Semantics-Aware Cross-Modal Alignment. Unlike heavy-weight models, HCA-IDS adopts a streamlined Multi-Layer Perceptron (MLP) backbone optimized for edge deployment. We introduce a dedicated Multi-Head Cross-Attention mechanism that explicitly utilizes static “Pattern” features to dynamically query and re-weight relevant dynamic “State” behaviors. This architecture forces the model to learn a unified semantic manifold where protocol anomalies are automatically aligned with their corresponding statistical footprints. Empirical assessments on the NSL-KDD and CICIDS2018 datasets, validated through rigorous 5-Fold Cross-Validation, substantiate the robustness of this approach. The model achieves a Macro-F1 score of over 94% on 7 consolidated attack categories, exhibiting exceptional sensitivity to minority attacks (e.g., Web Attacks and Infiltration). Crucially, HCA-IDS is ultra-lightweight, with a model size of approximately 1.00 MB and an inference latency of 0.0037 ms per sample. These results confirm that explicit semantic alignment combined with a lightweight architecture is key to robust, real-time intrusion detection in resource-constrained CAVs.
1. Introduction
Connected and Autonomous Vehicles (CAVs) have emerged as a paradigm shift in intelligent transportation systems, facilitating ubiquitous Vehicle-to-Everything (V2X) communication through cutting-edge technologies like 5G and next-generation Wi-Fi. In recent years, advances in onboard sensing and V2V communication have driven CAV research beyond single-vehicle control towards increasingly complex cooperative maneuvers in dynamic traffic scenarios, such as lane changing, merging, collision avoidance, and platoon evolution based on multi-agent hierarchical architectures [1]. Furthermore, the rapid development of CAVs has empowered fine-grained urban traffic management, exemplified by recent advances in privacy-preserving cycle-based arrival profile estimation for cross-company collaboration [2]. These V2X services facilitate the sharing of attribute information and dynamic status (e.g., location, speed, driving intentions), laying the foundation for high-level information interaction and autonomous driving [3]. However, this expanded connectivity surface also exposes CAVs to severe security threats; potential cyber intrusions can lead to uncontrollable and dangerous behaviors [4]. For instance, a report by Tencent Keen Security Lab demonstrated that attackers could exploit vulnerabilities in Bluetooth and diagnostic functions to remotely control a Lexus vehicle, performing unintended physical operations [5]. This evidence underscores that illegal access to vehicle control systems poses a life-threatening risk to drivers and passengers. Therefore, the implementation of advanced Intrusion Detection Systems (IDS) has emerged as an imperative countermeasure against such cyber threats. By facilitating real-time analysis of communication dynamics, IDSs enable the rapid detection of adversarial incursions, effectively safeguarding the operational integrity and reliability of CAV networks [6,7].
Network Intrusion Detection Systems (NIDS) are conventionally classified into two primary paradigms: signature-based and anomaly-based. Specifically, Signature-based IDS (SIDS) operates by scrutinizing real-time traffic flows and cross-referencing patterns against a repository of known threat signatures. While effective for known threats, the efficacy of SIDS is contingent upon the comprehensiveness of signature databases, rendering them ineffective against zero-day attacks or polymorphic threats. It is thus essential to ensure the periodic maintenance and updating of the system [8]. Conversely, Anomaly-based Intrusion Detection Systems (AIDS) identify intrusions by modeling normal network behavior, thereby providing protection against unknown attacks [9]. In the design of AIDS, machine learning (ML) techniques such as Decision Tree (DT) [10] and Support Vector Machine (SVM) [11] have been extensively employed. However, these traditional ML methods often rely on manual feature engineering, which suffers from prolonged processing times and limited scalability when identifying complex attack patterns in high-dimensional data.
Driven by its transformative success in Natural Language Processing (NLP) and Computer Vision (CV), Deep Learning (DL) has emerged as a dominant paradigm for Intrusion Detection Systems (IDS) in Connected and Autonomous Vehicles (CAVs). The increasing prevalence of DL stems from its superior efficacy in automated feature extraction and its robustness in handling high-dimensional, heterogeneous data. Recent literature highlights a diverse array of architectures employed to secure vehicular networks, including Convolutional Neural Networks (CNN) [12], Autoencoders (AE) [13], Recurrent Neural Networks (RNN) [14], Long Short-Term Memory (LSTM) [15], Generative Adversarial Networks (GAN) [16], and Deep Belief Networks (DBN) [17].
Despite these advancements, a critical research gap remains: the majority of current efforts concentrate on stacking complex model architectures while treating network traffic as a homogeneous data stream. In reality, CAV network traffic is inherently heterogeneous, consisting of discrete Pattern dataand continuous State data. Existing methods typically use simple concatenation to fuse these features, ignoring the intrinsic semantic misalignment between them. For example, a specific protocol flag (Pattern) usually dictates a specific statistical threshold (State). Ignoring this correlation prevents the model from fully exploring the potential information in the data, limiting its robustness and generalization ability [18]. Crucially, this limitation leads to poor performance on minority attack classes in imbalanced datasets. Furthermore, complex architectures often incur high computational overhead and latency. Since CAVs operate in resource-constrained edge environments requiring millisecond-level responses, simply increasing model parameters to improve accuracy is impractical. Therefore, it is necessary to reconstruct the IDS architecture from a “data-first” perspective, balancing high-precision semantic alignment with ultra-lightweight deployment capabilities.
To address the critical feature heterogeneity gap in existing IDS, this work proposes HCA-IDS, a novel framework centered on Semantics-Aware Cross-Modal Alignment. Existing methods often ignore that discrete protocol signatures and continuous traffic statistics reside in disjoint latent spaces, relying on naive concatenation that fails to capture intricate semantic dependencies. In contrast, our core innovation lies in explicitly decoupling these features and introducing a Multi-Head Cross-Attention mechanism. This architecture forces the model to learn a unified semantic manifold where static “Pattern” features can dynamically query and re-weight relevant dynamic “State” behaviors, achieving precise alignment that previous methods failed to capture. We evaluate our model on NSL-KDD and the complex, highly imbalanced CIC-IDS2018 dataset [19]. The results affirm that grounding IDS architecture in the intrinsic structure of traffic data yields a more accurate and robust solution.
The main contributions of this paper are as follows:
- (1)
- We propose HCA-IDS, a lightweight framework optimized for the resource-constrained edge environments of CAVs. Unlike traditional approaches that rely on heavy convolutional layers, we utilize a streamlined MLP backbone to process heterogeneous data streams (Patterns and States). This design maximizes computational efficiency while preserving the distinct statistical properties of different feature modalities.
- (2)
- We introduce a novel Semantics-Aware Cross-Modal Alignment mechanism to bridge the feature heterogeneity gap. By leveraging Multi-Head Cross-Attention, the model explicitly utilizes static protocol patterns to dynamically query relevant traffic statistics. This allows for the learning of a unified semantic manifold, significantly enhancing the detection of long-tail minority attacks without increasing model depth.
- (3)
- We conduct comprehensive experiments on NSL-KDD and CIC-IDS2018 datasets using rigorous 5-Fold Cross-Validation. HCA-IDS demonstrates exceptional robustness, achieving a Macro-F1 score of over 94% on 7 consolidated attack categories. Crucially, with a model size of 1.00 MB and ultra-low inference latency (0.0037 ms), our solution proves its viability for real-time deployment, striking an optimal balance between high-precision semantic alignment and efficiency.
The remainder of the paper is structured as follows: Section 2 introduces related research work; Section 3 describes the dataset analysis and preprocessing methods; Section 4 details the structure of the proposed HCA-IDS model. Section 5 presents the experimental setup, evaluation criteria, and a comprehensive discussion of the results, specifically analyzing the performance on minority classes. Finally, Section 6 summarizes the work of this paper.
2. Related Work
The evolution of Intrusion Detection Systems (IDS) has progressed from conventional statistical methods to advanced deep learning architectures. Recent research in CAVs has moved beyond single-vehicle intelligence to complex cooperative strategies. For example, Liang et al. [1] proposed a hierarchical architecture based on Multi-Agent Systems (MAS) to optimize the cooperative control of CAV platoons, demonstrating the potential of V2X communication in dynamic traffic management. However, as the complexity of these cooperative interactions increases, so does the attack surface. The reliance on frequent data exchange makes the communication links vulnerable to malicious injections, necessitating robust security mechanisms like IDS. This section reviews existing methodologies, categorized into traditional machine learning approaches, deep learning-based hybrid models, and recent advances in attention mechanisms, while highlighting the limitations that motivate our proposed framework.
2.1. Traditional Machine Learning and Ensemble Approaches
Early research heavily relied on conventional Machine Learning (ML) algorithms, which depend on rigorous data analysis and manual feature engineering. Al-Saud et al. [20] proposed an outlier detection model based on One-Class SVM (OCSVM), utilizing ID frequency as a pivotal feature and employing the Social Spider Optimization algorithm to tune hyperparameters. Similarly, Wazirali et al. [21] introduced a semi-supervised approach for KNN hyperparameter tuning. By evaluating the distance metrics and weights of nearest neighbors, their method effectively reduced false alarms on unlabeled data. To address the challenges of high-dimensional data, Zhang et al. [22] developed the IDTSRF model, integrating three-way decision theory with Random Forest. This approach optimizes feature selection by evaluating attribute importance via decision boundary entropy. Furthermore, ensemble learning has shown promise in improving robustness. Yang et al. [23] proposed the LCCDE framework, a hybrid ensemble strategy that dynamically selects the best-performing model (XGBoost, LightGBM, or CatBoost) for each specific attack category based on prediction confidence. While these ML-based methods demonstrate high accuracy in specific scenarios, they fundamentally struggle with the “curse of dimensionality” and require extensive domain knowledge for manual feature extraction [24].
2.2. Deep Learning and Hybrid Architectures
Deep Learning (DL) has transformed IDS design by automating feature extraction, leveraging successes in Natural Language Processing (NLP) [25] and Computer Vision [26]. Recurrent neural networks, particularly LSTMs, are widely used to capture temporal dependencies in traffic flows. For instance, the deployment of Bidirectional LSTMs (BiLSTM) has proven effective in boosting detection accuracy, as evidenced by Sri et al. [27] and Imrana et al. [28]. Specifically, Imrana et al. highlighted that BiDLSTM models offer more robust generalization than standard LSTMs when mitigating specific threats like User-to-Root (U2R) and Remote-to-Local (R2L) intrusions on the NSL-KDD dataset.
Recognizing that network traffic contains both spatial and temporal characteristics, researchers have moved towards hybrid architectures. Halbouni et al. [29] developed a hybrid architecture that integrates CNNs for spatial feature abstraction with LSTMs for temporal sequence learning. Evaluated on CIC-IDS2017 and UNSW-NB15, their model achieved superior F1-scores compared to single-model baselines. However, most hybrid models employ a homogeneous fusion strategy, typically concatenating features from different encoders into a single vector. This “loose coupling” ignores the inherent semantic heterogeneity between discrete protocol fields (Pattern) and continuous traffic statistics (State), limiting the model’s ability to learn complex, non-linear correlations.
Although the aforementioned DL-based methods (e.g., CNNs, LSTMs, and Transformers) have achieved promising detection accuracy, they often suffer from significant limitations in the context of CAVs. First, most architectures rely on computationally expensive operations (e.g., convolution and recurrence), resulting in high parameter counts and inference latency that exceed the constraints of onboard edge devices. Second, simpler architectures like Multi-Layer Perceptrons (MLP), when combined with proper feature alignment, can achieve comparable performance with a fraction of the computational cost. Therefore, unlike existing works that stack complex layers, this paper seeks to explore a lightweight, “data-centric” approach that balances semantic alignment with real-time efficiency.
2.3. Attention Mechanisms and Data Challenges
2.3.1. Attention Mechanisms in IDS
To enhance model interpretability and capture long-range dependencies, attention mechanisms have been increasingly integrated into IDS. Early approaches, such as the CLAM model proposed by Sun et al. [30], combined CNN and LSTM with a Temporal Attention mechanism. This approach calculates weights for LSTM hidden states to identify locally significant time steps, effectively filtering out noise in traffic sequences.
More recently, with the success of Transformers [31], Self-Attention mechanisms have been applied to packet analysis to capture global contextual correlations. For instance, recent studies have utilized multi-head self-attention to model the interactions between bytes within a packet or flows within a session. Similarly, in CNN-based IDSs, Channel Attention(e.g., Squeeze-and-Excitation networks) is often used to recalibrate the importance of spatial feature maps.
However, a critical limitation remains in existing literature: most works apply attention in an intra-modal manner (i.e., attending to dependencies within a single data representation, such as a sequence of bytes or a vector of statistics). Few studies have explored “Cross-Attention” between heterogeneous data modalities. specifically, how a discrete protocol pattern (e.g., HTTP header structure) explicitly influences the semantic significance of a continuous traffic statistic (e.g., flow duration). This “semantic gap” limits the model’s ability to correlate multi-source features effectively.
2.3.2. Data Imbalance and Generalization
Furthermore, the performance of data-driven models is critically dependent on data quality and distribution [32]. While most studies validate performance on the synthetic NSL-KDD dataset, real-world vehicular networks face more complex threats. The CIC-IDS2018 dataset [19] provides a more realistic benchmark with modern attacks like Botnet and DDoS. A major challenge in CIC-IDS2018 is the extreme class imbalance, where minority attacks constitute less than 0.01% of traffic. Standard DL models often fail to detect these rare events, yielding near-zero F1-scores for minority classes due to the dominance of benign traffic gradients. Current research rarely addresses this cross-modal semantic alignment and class imbalance issue simultaneously.
To bridge these gaps, our work diverges from the architecture-centric trend. We propose a data-characteristic-aware framework that explicitly decomposes traffic into Pattern and State streams. By employing a Heterogeneous Cross-Attention mechanism for semantic alignment, our model not only captures the interaction between heterogeneous features but also significantly improves robustness against long-tail, minority attacks in realistic datasets like CIC-IDS2018.
3. Dataset and Data Preprocessing
To ensure the proposed model is robust across both established benchmarks and realistic modern traffic scenarios, we employ two distinct datasets: the classic NSL-KDD and the state-of-the-art CIC-IDS2018.
3.1. NSL-KDD Dataset
The NSL-KDD dataset [33] serves as a statistically refined iteration of the legacy KDD Cup 99 archive [34]. By eliminating the redundant and duplicate records that skewed the original distribution, NSL-KDD offers a more rigorous benchmark for evaluating IDS performance. The dataset is structured into a training set (KDDTrain+) containing 125,973 samples and a testing set (KDDTest+) with 22,544 samples. Each instance is defined by a 43-dimensional vector, comprising 41 traffic attributes, a class label, and a difficulty score. As detailed in Table 1, the attack scenarios are taxonomized into four primary families:
Table 1.
Composition of the NSL-KDD dataset.
As summarized in Table 1, the dataset encompasses four distinct attack families alongside normal traffic. These categories include Denial of Service (DoS), which aims to deplete target resources to disrupt legitimate access; Probe, involving surveillance activities for network reconnaissance; Remote to Local (R2L), representing unauthorized remote access attempts such as password guessing; and User to Root (U2R), where local users exploit vulnerabilities to escalate privileges. The statistical distribution reveals a significant class imbalance: while Normal traffic (53.46%) and DoS attacks (36.46%) dominate the training set, the U2R and R2L categories represent a minority, constituting only 0.04% and 0.79% of the training records, respectively. This scarcity poses a substantial challenge for detection algorithms.
3.2. CIC-IDS2018 Dataset
While NSL-KDD serves as a standard baseline, it does not fully reflect modern encrypted traffic and complex attack patterns. To validate the technical quality and generalizability of our model, we incorporate the CSE-CIC-IDS2018 dataset [19]. Generated on a realistic AWS network topology, this dataset captures diverse modern attacks including Brute Force, Botnet, DoS, DDoS, Web Attacks, and Infiltration.
A critical challenge in CIC-IDS2018 is the extreme class imbalance. As shown in our experimental setup later, attacks like Heartbleed and Web Attacks constitute less than 0.01% of the total traffic. This reflects real-world CAV scenarios where malicious signals are rare but fatal. We utilize the raw CSV files generated by CICFlowMeter, which provide roughly 80 statistical features per flow (e.g., Flow Duration, Flow IAT, Packet Length Std). We utilize the raw CSV files generated by CICFlowMeter. To ensure statistical reliability during evaluation, we aggregated the sub-attacks into major distinct categories (detailed in the experimental setup), allowing for a robust assessment of the model’s ability to detect both voluminous and minority threats.
3.3. Heterogeneous Data Preprocessing
The quality of data representation fundamentally dictates model performance. As illustrated in Table 2, network traffic features are inherently heterogeneous, consisting of:
Table 2.
Distribution of feature variables for the NSL-KDD dataset.
- (1)
- Nominal/Binary Variables: Discrete attributes representing protocols, service types, or status flags. These encode the structural “Pattern” of the traffic.
- (2)
- Interval/Ratio Variables: Continuous statistical metrics such as duration, byte counts, and error rates. These encode the dynamic “State” of the traffic.
Directly feeding these mixed types into a single neural network layer often leads to suboptimal convergence, as high-variance continuous values can dominate the gradients over sparse binary features. To address this, we implement a Semantics-Aware Preprocessing Pipeline (as shown in Figure 1) to decouple and normalize these features:
Figure 1.
The proposed Heterogeneous Data Preprocessing Pipeline. Raw traffic is decoupled into Pattern (Discrete) and State (Continuous) streams, processed via One-Hot Encoding and Normalization respectively, to prepare for the dual-encoder architecture.
Feature Decoupling and Encoding
First, irrelevant features are removed to prevent the model from learning distinct identifiers. The remaining features are processed based on their semantic type:
Pattern Data Processing: Nominal variables are mapped to high-dimensional sparse vectors using One-Hot Encoding. Binary variables are retained as is. These form the Pattern Input ().
State Data Processing: Continuous variables, Serror_rate) exhibit large variations in scale. We apply Min-Max Normalization to map these values into the range . This prevents numerical instability and ensures that the State Input () is comparable across different dimensions:
This explicit separation allows our subsequent Heterogeneous Encoders to extract features from and independently before semantic alignment, thereby maximizing the information gain from both modalities.
4. HCA-IDS Architecture
A visual representation of the proposed HCA-IDS architecture is provided in Figure 2. Following the heterogeneous data preprocessing, the input traffic flow is decomposed into two distinct streams: Pattern Data (), comprising binary variables that encode discrete communication semantics, and State Data (), consisting of normalized continuous variables reflecting behavioral dynamics.
Figure 2.
The overall architecture of HCA-IDS (Refined with MLP Backbones).
To address the limitations of static concatenation and high computational costs in CAVs, we design a lightweight dual-branch architecture. This consists of dedicated heterogeneous MLP-based encoders to extract modality-specific features efficiently, followed by a Semantics-Aware Cross-Attention Fusion Module. This module enables the model to dynamically utilize protocol information (Pattern) to query and highlight anomalous statistical behaviors (State). Finally, a compact classification head predicts the specific attack category.
4.1. Heterogeneous Feature Encoders
We design two parallel encoders. Unlike previous works that employ heavy convolutional layers, we adopt streamlined Multi-Layer Perceptrons (MLP) for both branches to minimize inference latency on edge devices while maintaining feature expressiveness.
4.1.1. Pattern Encoder (MLP-Based)
The pattern input (where ) is sparse and categorical. To capture the non-linear co-occurrence between protocols and service flags, we employ a fully connected (FC) network. As shown in Table 3, the encoder consists of three layers with decreasing dimensions (114 → 48 → 16 → 8). This “bottleneck” design forces the network to learn a compact semantic embedding , filtering out noise from the sparse one-hot vectors.
Table 3.
Configuration of Heterogeneous Encoders (Lightweight MLP Design).
4.1.2. State Encoder (MLP-Based)
The state input represents continuous statistical metrics (e.g., flow duration, packet inter-arrival times) reflecting the global behavioral profile of the traffic. To efficiently capture the complex, non-linear correlations among these continuous variables, we employ a high-efficiency MLP Encoder.
This encoder consists of three stacked dense layers, each equipped with Batch Normalization (BN) and ReLU activation to ensure rapid convergence and stable gradient propagation. As detailed in Table 3, the network progressively projects the high-dimensional statistical features into a lower-dimensional latent space (). This hierarchical compression extracts the most discriminative behavioral features while discarding redundant noise, yielding a robust state embedding vector optimized for the subsequent cross-modal alignment.
4.2. Semantics-Aware Cross-Attention Fusion
This module is the core innovation of our framework. Traditional methods simply concatenate the pattern embedding and the state embedding , which assumes a fixed, linear relationship between protocol types and traffic statistics. However, the relevance of a statistical feature often depends on the protocol context. For instance, a high Source Byte Count might be normal for an FTP transfer (Pattern context) but highly anomalous for a DNS request.
To model this dynamic dependency, we adopt a Multi-Head Cross-Attention mechanism. Unlike standard self-attention that operates on temporal sequences, our mechanism is designed to align heterogeneous feature vectors. We designate the Pattern embedding as the Query (Q), representing the “semantic context,” and the State embedding as both Key (K) and Value (V), representing the “behavioral evidence.”
To enable fine-grained alignment, we project the input vectors into h distinct semantic subspaces (heads). Mathematically, for the i-th head:
where are learnable projection matrices. This projection effectively decomposes the global state vector into multiple latent behavioral aspects.
The attention weights are calculated to measure the semantic compatibility between the protocol pattern and each behavioral aspect:
Here, we utilize a Sigmoid-like activation function (or Softmax across heads) to determine the relevance score. This creates a “gating” effect, where the protocol context dynamically re-weights specific dimensions of the state features—amplifying relevant statistics while suppressing noise.
Finally, the outputs from all heads are concatenated and passed through a linear projection. To prevent feature degradation and ensure stable gradient flow, we employ a residual connection followed by Layer Normalization:
This fused vector contains the pattern information enriched by contextually aligned state details, providing a robust representation for the final classification.
4.3. Classification Head
The aligned feature vector is fed into a classification head consisting of three fully connected layers (e.g., 64 → 32 →). To mitigate overfitting—especially given the high complexity of the fusion module—we apply Dropout with a rate of 0.5 before the final layer. The output layer utilizes a Softmax function to produce the probability distribution over the attack classes.
This architecture ensures that the final decision is based on semantically aligned heterogeneous features, significantly enhancing the detection capability for minority classes with subtle anomalies.
5. Experiments
5.1. Experimental Settings
5.1.1. Data Partitioning and Balancing
To rigorously evaluate HCA-IDS, we adopted distinct partitioning strategies for the two datasets to align with their respective benchmarks.
NSL-KDD Setup: We followed the standard benchmark protocol, utilizing the predefined KDDTrain+ set for training and the KDDTest+ set for performance evaluation. This ensures our results are directly comparable with existing literature.
CIC-IDS2018 Setup: For the CIC-IDS2018 dataset, we consolidated the attack types into 7 major categories to ensure statistical significance. Specifically, we excluded ultra-minority classes (e.g., Heartbleed, <0.001%) where sample sizes were insufficient for reliable variance analysis.We implemented a Stratified 5-Fold Cross-Validation. The dataset was shuffled and split into 5 folds; in each iteration, 4 folds were used for training and 1 for testing.
Handling Imbalance: As detailed in Table 4, the raw CIC-IDS2018 distribution is highly skewed. To prevent the model from biasing towards majority classes, we applied adaptive oversampling (SMOTE) exclusively to the training folds. The test folds remained strictly comprised of original, raw samples to reflect real-world detection difficulty. Table 4 lists the exact sample sizes used in our 5-fold experiments.
Table 4.
Detailed Sample Distribution for CIC-IDS2018 (Average per Fold in 5-Fold CV). Note that oversampling is applied only to the training set to prevent data leakage.
5.1.2. Implementation Details
The proposed HCA-IDS is implemented using PyTorch 1.10 on a workstation equipped with an NVIDIA RTX 3090 GPU.
Parameter Settings: The MLP encoders and attention modules are optimized using the Adam optimizer with a learning rate of and a batch size of 256. To ensure lightweight deployment suitable for CAVs, the hidden dimensions of the MLP encoders were restricted (as detailed in Section 4), resulting in a total model size of approx. 0.15 MB. The training process runs for 50 epochs with an early stopping mechanism to prevent overfitting.
5.2. Evaluation Metrics
To strictly evaluate the performance on minority classes, we employ standard metrics including Accuracy, Precision, Recall, and F1-score. Furthermore, we report the Macro-Average F1, which treats all classes equally regardless of their sample size, providing a more fair assessment of the model’s robustness against long-tail attacks compared to Weighted-Average metrics.
5.3. Experimental Results and Analysis
5.3.1. Performance Evaluation on NSL-KDD Benchmark
To establish a comparative baseline, we first evaluated HCA-IDS on the NSL-KDD dataset using Stratified 5-Fold Cross-Validation. Table 5 details the precision, recall, and F1-score for each category, reported as the mean value with standard deviation across folds.
Table 5.
Classification Performance on NSL-KDD (Mean ± Std Dev over 5 Folds).
As indicated in Table 5, the model achieves robust performance on the majority classes (Normal and DoS), with F1-scores exceeding 95%. For the R2L category, which involves stealthy unauthorized access attempts, the model maintains a high F1-score of 87.96%. It is noted that the U2R category exhibits a relatively lower performance (F1 ≈ 54.8%). This is a known characteristic of the NSL-KDD dataset, where the extreme scarcity of U2R samples in the training set limits the model’s ability to generalize boundaries. However, compared to standard baselines that often yield near-zero detection for U2R without synthetic augmentation, our model demonstrates a capability to capture minority patterns effectively.
5.3.2. Performance Evaluation on CIC-IDS2018
To validate the model’s efficacy in modern CAV environments characterized by high-dimensional and encrypted traffic, we conducted experiments on the CIC-IDS2018 dataset. Table 6 presents the quantitative results across 7 consolidated categories.
Table 6.
Classification Performance on CIC-IDS2018 (Mean ± Std Dev over 5 Folds).
Analysis of Long-Tail Distributions: A critical objective of this study was to address the detection of minority attacks, which is often compromised in varying traffic distributions. As detailed in Table 6:
- Statistical Stability: The low standard deviations across all metrics (e.g., for Benign, for Infiltration) confirm that the 5-fold cross-validation yields consistent results, mitigating concerns regarding data leakage or random split bias.
- Minority Class Detection: Despite Web Attacksconstituting only 0.15% of the dataset, the model achieves a mean Recall of 90.65% and an F1-score of 82.00%. Similarly, Infiltration attacks are detected with an F1-score of 89.02%.
These results suggest that the proposed Semantics-Aware Cross-Attention mechanism successfully aligns the discrete protocol patterns with continuous flow statistics, thereby amplifying the feature representation of rare attack classes against the background of dominant benign traffic.
5.4. Computational Complexity and Efficiency Analysis
For Intrusion Detection Systems (IDS) deployed in Connected and Autonomous Vehicles (CAVs), the model must strike a balance between detection accuracy and computational overhead. The strictly limited resources of On-Board Units (OBUs) and Electronic Control Units (ECUs) require the algorithm to be lightweight and capable of real-time processing.
To quantitatively assess the deployment feasibility of HCA-IDS, we analyzed its computational complexity and inference speed on an NVIDIA RTX 3090 environment. The detailed computational profile is presented in Table 7.
Table 7.
Computational Profile of HCA-IDS.
Analysis of Real-Time Feasibility:
- Storage Efficiency: As shown in Table 7, the total parameter count of HCA-IDS is 262,343, resulting in a physical storage footprint of approximately 1.00 MB. This compact size is significantly below the storage limits of typical automotive-grade microcontrollers and edge gateways, allowing for seamless integration without requiring hardware upgrades.
- Inference Speed: The average inference time is 0.0037 ms, corresponding to a throughput of over 269,000 Flows Per Second (FPS). In real-world vehicular networks, even high-load CAN FD or Automotive Ethernet traffic rarely exceeds such packet rates. This microsecond-level latency ensures that HCA-IDS can analyze traffic and flag anomalies continuously with negligible delay, leaving ample safety margins for the vehicle control systems to react.
These results confirm that the proposed architecture achieves the design goal of being lightweight and fast, making it highly suitable for real-time deployment in resource-constrained CAV environments.
5.5. Comparison with State-of-the-Art Approaches
To assess the relative standing of HCA-IDS within the current literature, we compared its performance against representative Deep Learning-based IDS models. The baselines include CNN-LSTM [29], a standard hybrid spatial-temporal model; CLAM [30], which incorporates attention mechanisms; and LCCDE [23], a high-performance ensemble framework based on tree boosting.
5.5.1. Binary Classification Performance
Table 8 presents the comparison of F1-Scores on both the NSL-KDD and CIC-IDS2018 benchmarks.
Table 8.
Comparison of Binary Classification Performance (F1-Score).
While all models exhibit high proficiency on the NSL-KDD dataset, significant divergence is observed on the more challenging CIC-IDS2018 dataset. The F1-score of the CNN-LSTM baseline drops to approximately 94.87%, indicating a struggle to generalize to modern encrypted traffic patterns. In contrast, HCA-IDS maintains a superior F1-score of 99.80%. This result suggests that the proposed heterogeneous feature decoupling provides a more robust representation for binary anomaly detection than traditional homogeneous architectures.
5.5.2. Multi-Class Performance Analysis
The capability to detect specific attack categories, particularly those with low prevalence, is a more rigorous metric for IDS evaluation. Table 9 details the Macro-Average F1-Score and the detection performance on key minority classes in the CIC-IDS2018 dataset.
Table 9.
Comparison of Multi-Class Performance (Macro-F1) on CIC-IDS2018.
As shown in Table 9, HCA-IDS achieves the highest Macro-F1 score of 94.06%.
- Addressing the Long-Tail Problem: Conventional Deep Learning models often exhibit a bias towards majority classes. For instance, CNN-LSTM achieves only 75.4% F1-score on the Botnet category. HCA-IDS significantly improves this to 99.4%, demonstrating that the semantic alignment mechanism effectively preserves the gradient signals of minority classes during training.
- Efficiency vs. Accuracy Trade-off: While the ensemble-based LCCDE achieves a slightly higher score on Web Attacks (86.2% vs. 82.0%), it typically requires significantly higher computational resources (storage and memory for multiple tree structures). HCA-IDS offers a competitive detection rate with a lightweight MLP-based architecture (1.00 MB), presenting a more favorable trade-off for resource-constrained CAV deployment.
6. Conclusions
The integration of Connected and Autonomous Vehicles (CAVs) into modern intelligent transportation systems has exacerbated cybersecurity vulnerabilities, necessitating Intrusion Detection Systems (IDS) that can simultaneously satisfy high detection accuracy and strict resource constraints. In this study, we addressed the limitations of existing deep learning-based IDSs, which often suffer from feature misalignment due to homogeneous data processing and high computational overhead. We proposed HCA-IDS, a novel Semantics-Aware Heterogeneous Cross-Attention framework that fundamentally shifts the detection paradigm from an architecture-centric to a data-centric approach. By explicitly decoupling network traffic into discrete protocol patterns and continuous behavioral states, and subsequently aligning them via a lightweight Cross-Attention mechanism, our model effectively resolves the semantic heterogeneity gap. Furthermore, the replacement of heavy convolutional layers with streamlined MLP-based encoders significantly reduces the model’s complexity, ensuring its viability for deployment on resource-constrained vehicular edge computing units.
To strictly validate the robustness and generalization capability of the proposed framework, we conducted extensive experiments using Stratified 5-Fold Cross-Validation on the benchmark NSL-KDD and the high-dimensional CIC-IDS2018 datasets. The experimental results demonstrate that HCA-IDS achieves state-of-the-art performance, recording a Macro-F1 score of 94.06% on the CIC-IDS2018 dataset. Crucially, the model exhibits exceptional resilience against the long-tail distribution problem inherent in real-world traffic, significantly outperforming traditional hybrid baselines in detecting minority threats such as Web Attacks and Botnets. The analysis confirms that the proposed semantic alignment mechanism successfully preserves and amplifies the gradient signals of rare attack classes, preventing them from being overwhelmed by dominant benign traffic.
In terms of deployment feasibility, HCA-IDS demonstrates a superior trade-off between accuracy and efficiency. With a compact model size of approximately 1.00 MB and an average inference latency of just 0.0037 ms per sample, the system supports a throughput exceeding 269,000 flows per second, satisfying the real-time processing requirements of automotive electronic control units. These findings suggest that the proposed lightweight, semantics-aware architecture offers a robust and practical solution for securing next-generation vehicular networks. Future work will focus on evaluating the model’s resilience against adversarial examples to further enhance its security in hostile environments, as well as exploring its implementation on automotive-grade hardware for hardware-in-the-loop validation.
Author Contributions
Q.H., Y.Z., J.L. and Q.L. participated in the discussion of research methodology, designed and performed the experiments, and wrote the original draft. W.Z., M.H., T.Z. and A.X. verified the experimental results, refined the experimental procedures, and reviewed and edited the manuscript. All authors have read and agreed to the published version of the manuscript.
Funding
This work was financially supported by the National Natural Science Foundation of China (No. 42201464), the Hubei Provincial Natural Science Foundation (No. JCZRQN202500217), the Hubei Provincial Key Laboratory of Green Intelligent Computing Power Network for providing experimental facilities and technical assistance and the Youth Talent Support Program of the China Association for Science and Technology.
Data Availability Statement
Publicly available datasets were analyzed in this study. This data can be found here: NSL-KDD https://www.unb.ca/cic/datasets/nsl.html (accessed on 15 January 2025) and CSE-CIC-IDS2018 https://www.unb.ca/cic/datasets/ids-2018.html (accessed on 20 January 2025).
Conflicts of Interest
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
References
- Liang, J.; Li, Y.; Yin, G.; Xu, L.; Lu, Y.; Feng, J.; Shen, T.; Cai, G. A MAS-based hierarchical architecture for the cooperation control of connected and automated vehicles. IEEE Trans. Veh. Technol. 2022, 72, 1559–1573. [Google Scholar] [CrossRef]
- Tan, C.; Yao, J.; Tang, K.; Liang, J.; Yin, G. Privacy-Preserving Cycle-Based Arrival Profile Estimation Based on Cross-Company Connected Vehicles. IEEE Trans. Consum. Electron. 2025, 71, 6167–6182. [Google Scholar]
- Kim, K.; Kim, J.S.; Jeong, S.; Park, J.-H.; Kim, H.K. Cybersecurity for autonomous vehicles: Review of attacks and defense. Comput. Secur. 2021, 103, 102150. [Google Scholar] [CrossRef]
- Yu, T.; Hu, J.; Yang, J. Intrusion detection in intelligent connected vehicles based on weighted self-information. Electronics 2023, 12, 2510. [Google Scholar] [CrossRef]
- Tencent Cohen Lab. Lexus Vehicle Safety Research Review Report. 2020. Available online: https://keenlab.tencent.com/zh/2020/03/30/Tencent-Keen-Security-Lab-Experimental-Security-Assessment-on-Lexus-Cars/ (accessed on 10 December 2024).
- El-Rewini, Z.; Sadatsharan, K.; Selvaraj, D.F.; Plathottam, S.J.; Ranganathan, P. Cybersecurity challenges in vehicular communications. Veh. Commun. 2020, 23, 100214. [Google Scholar] [CrossRef]
- Kamal, M.; Srivastava, G.; Tariq, M. Blockchain-based lightweight and secured v2v communication in the internet of vehicles. IEEE Trans. Intell. Transp. Syst. 2020, 22, 3997–4004. [Google Scholar] [CrossRef]
- Manso, P.; Moura, J.; Serrão, C. SDN-based intrusion detection system for early detection and mitigation of DDoS attacks. Information 2019, 10, 106. [Google Scholar]
- Mashudi, N.A.; Ab Aziz, N.T.; Rahman, W.F.W.A.; Ahmad, N.; Noor, N.M. Intrusion detection using machine learning for security and privacy in humanitarian aid system. In Proceedings of the 2024 IEEE 12th Region 10 Humanitarian Technology Conference (R10-HTC); IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
- Kalkan, S.C.; Sahingoz, O.K. In-vehicle intrusion detection system on controller area network with machine learning models. In Proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT); IEEE: New York, NY, USA, 2020; pp. 1–6. [Google Scholar]
- Khraisat, A.; Gondal, I.; Vamplew, P.; Kamruzzaman, J.; Alazab, A. Hybrid intrusion detection system based on the stacking ensemble of c5 decision tree classifier and one class support vector machine. Electronics 2020, 9, 173. [Google Scholar] [CrossRef]
- Yang, L.; Shami, A. A transfer learning and optimized CNN based intrusion detection system for Internet of Vehicles. In Proceedings of the ICC 2022-IEEE International Conference on Communications; IEEE: New York, NY, USA, 2022; pp. 2774–2779. [Google Scholar]
- Xing, L.; Wang, K.; Wu, H.; Ma, H.; Zhang, X. FL-MAAE: An intrusion detection method for the Internet of Vehicles based on federated learning and memory-augmented autoencoder. Electronics 2023, 12, 2284. [Google Scholar] [CrossRef]
- Yin, C.; Zhu, Y.; Fei, J.; He, X. A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access 2017, 5, 21954–21961. [Google Scholar] [CrossRef]
- Ashraf, J.; Bakhshi, A.D.; Moustafa, N.; Khurshid, H.; Javed, A.; Beheshti, A. Novel deep learning-enabled LSTM autoencoder architecture for discovering anomalous events from intelligent transportation systems. IEEE Trans. Intell. Transp. Syst. 2020, 22, 4507–4518. [Google Scholar] [CrossRef]
- Liu, Y.; Xiao, M.; Zhou, Y.; Zhang, D.; Zhang, J.; Gacanin, H.; Pan, J. An access control mechanism based on risk prediction for the IoV. In Proceedings of the 2020 IEEE 91st Vehicular Technology Conference (VTC2020-Spring); IEEE: New York, NY, USA, 2020; pp. 1–5. [Google Scholar]
- Mahendran, R.K.; Rajendran, S.; Pandian, P.; Rathore, R.S.; Benedetto, F.; Jhaveri, R.H. A novel constructive unceasement conditional random field and dynamic Bayesian network model for attack prediction on Internet of Vehicle. IEEE Access 2024, 12, 24644–24658. [Google Scholar] [CrossRef]
- Cho, E.; Chang, T.-W.; Hwang, G. Data preprocessing combination to improve the performance of quality classification in the manufacturing process. Electronics 2022, 11, 477. [Google Scholar] [CrossRef]
- Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 2018, 1, 108–116. [Google Scholar]
- Al-Saud, M.; Eltamaly, A.M.; Mohamed, M.A.; Kavousi-Fard, A. An intelligent data-driven model to secure intravehicle communications based on machine learning. IEEE Trans. Ind. Electron. 2019, 67, 5112–5119. [Google Scholar] [CrossRef]
- Wazirali, R. An improved intrusion detection system based on KNN hyperparameter tuning and cross-validation. Arab. J. Sci. Eng. 2020, 45, 10859–10873. [Google Scholar] [CrossRef]
- Zhang, C.; Wang, W.; Liu, L.; Ren, J.; Wang, L. Three-branch random forest intrusion detection model. Mathematics 2022, 10, 4460. [Google Scholar] [CrossRef]
- Yang, L.; Shami, A.; Stevens, G.; De Rusett, S. LCCDE: A decision-based ensemble framework for intrusion detection in the internet of vehicles. In Proceedings of the GLOBECOM 2022-2022 IEEE Global Communications Conference; IEEE: New York, NY, USA, 2022; pp. 3545–3550. [Google Scholar]
- Almehdhar, M.; Albaseer, A.; Khan, M.A.; Abdallah, M.; Menouar, H.; Al-Kuwari, S.; Al-Fuqaha, A. Deep learning in the fast lane: A survey on advanced intrusion detection systems for intelligent vehicle networks. IEEE Open J. Veh. Technol. 2024, 5, 869–906. [Google Scholar] [CrossRef]
- Lauriola, I.; Lavelli, A.; Aiolli, F. An introduction to deep learning in natural language processing: Models, techniques, and tools. Neurocomputing 2022, 470, 443–456. [Google Scholar] [CrossRef]
- Srivastava, S.; Divekar, A.V.; Anilkumar, C.; Naik, I.; Kulkarni, V.; Pattabiraman, V. Comparative analysis of deep learning image detection algorithms. J. Big Data 2021, 8, 66. [Google Scholar] [CrossRef]
- Sri vidhya, G.; Nagarajan, R. A novel bidirectional LSTM model for network intrusion detection in SDN-IoT network. Computing 2024, 106, 2613–2642. [Google Scholar]
- Imrana, Y.; Xiang, Y.; Ali, L.; Abdul-Rauf, Z. A bidirectional LSTM deep learning approach for intrusion detection. Expert Syst. Appl. 2021, 185, 115524. [Google Scholar] [CrossRef]
- Halbouni, A.; Gunawan, T.S.; Habaebi, M.H.; Halbouni, M.; Kartiwi, M.; Ahmad, R. CNN-LSTM: Hybrid deep neural network for network intrusion detection system. IEEE Access 2022, 10, 99837–99849. [Google Scholar] [CrossRef]
- Sun, H.; Chen, M.; Weng, J.; Liu, Z.; Geng, G. Anomaly detection for in-vehicle network using CNN-LSTM with attention mechanism. IEEE Trans. Veh. Technol. 2021, 70, 10880–10893. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Google: Mountain View, CA, USA, 2017; pp. 5998–6008. [Google Scholar]
- Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A survey on text classification: From traditional to deep learning. ACM Trans. Intell. Syst. Technol. (TIST) 2022, 13, 1–41. [Google Scholar] [CrossRef]
- Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A.A. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications; IEEE: New York, NY, USA, 2009; pp. 1–6. [Google Scholar]
- KDD Cup. KDD Cup 3999. 1999. Available online: http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html (accessed on 6 January 2025).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

