HCA-IDS: A Semantics-Aware Heterogeneous Cross-Attention Network for Robust Intrusion Detection in CAVs

Qiyi He; Yifan Zhang; Jieying Liu; Wen Zhou; Tingting Zhang; Minlong Hu; Ao Xu; Qiao Lin

doi:10.3390/electronics15040784

Abstract

Connected and Autonomous Vehicles (CAVs) are exposed to increasingly sophisticated cyber threats hidden within high-dimensional, heterogeneous network traffic. A critical bottleneck in existing Intrusion Detection Systems (IDS) is the feature heterogeneity gap: discrete protocol signatures (e.g., flags, services) and continuous traffic statistics (e.g., flow duration, packet rates) reside in disjoint latent spaces. Traditional deep learning approaches typically rely on naive feature concatenation, which fails to capture the intricate, non-linear semantic dependencies between these modalities, leading to suboptimal performance on long-tail, minority attack classes. This paper proposes HCA-IDS, a novel framework centered on Semantics-Aware Cross-Modal Alignment. Unlike heavy-weight models, HCA-IDS adopts a streamlined Multi-Layer Perceptron (MLP) backbone optimized for edge deployment. We introduce a dedicated Multi-Head Cross-Attention mechanism that explicitly utilizes static “Pattern” features to dynamically query and re-weight relevant dynamic “State” behaviors. This architecture forces the model to learn a unified semantic manifold where protocol anomalies are automatically aligned with their corresponding statistical footprints. Empirical assessments on the NSL-KDD and CICIDS2018 datasets, validated through rigorous 5-Fold Cross-Validation, substantiate the robustness of this approach. The model achieves a Macro-F1 score of over 94% on 7 consolidated attack categories, exhibiting exceptional sensitivity to minority attacks (e.g., Web Attacks and Infiltration). Crucially, HCA-IDS is ultra-lightweight, with a model size of approximately 1.00 MB and an inference latency of 0.0037 ms per sample. These results confirm that explicit semantic alignment combined with a lightweight architecture is key to robust, real-time intrusion detection in resource-constrained CAVs.

Keywords:

connected autonomous vehicles; deep learning; intrusion detection system; cross-attention mechanism; heterogeneous feature fusion

1. Introduction

Connected and Autonomous Vehicles (CAVs) have emerged as a paradigm shift in intelligent transportation systems, facilitating ubiquitous Vehicle-to-Everything (V2X) communication through cutting-edge technologies like 5G and next-generation Wi-Fi. In recent years, advances in onboard sensing and V2V communication have driven CAV research beyond single-vehicle control towards increasingly complex cooperative maneuvers in dynamic traffic scenarios, such as lane changing, merging, collision avoidance, and platoon evolution based on multi-agent hierarchical architectures [1]. Furthermore, the rapid development of CAVs has empowered fine-grained urban traffic management, exemplified by recent advances in privacy-preserving cycle-based arrival profile estimation for cross-company collaboration [2]. These V2X services facilitate the sharing of attribute information and dynamic status (e.g., location, speed, driving intentions), laying the foundation for high-level information interaction and autonomous driving [3]. However, this expanded connectivity surface also exposes CAVs to severe security threats; potential cyber intrusions can lead to uncontrollable and dangerous behaviors [4]. For instance, a report by Tencent Keen Security Lab demonstrated that attackers could exploit vulnerabilities in Bluetooth and diagnostic functions to remotely control a Lexus vehicle, performing unintended physical operations [5]. This evidence underscores that illegal access to vehicle control systems poses a life-threatening risk to drivers and passengers. Therefore, the implementation of advanced Intrusion Detection Systems (IDS) has emerged as an imperative countermeasure against such cyber threats. By facilitating real-time analysis of communication dynamics, IDSs enable the rapid detection of adversarial incursions, effectively safeguarding the operational integrity and reliability of CAV networks [6,7].

Network Intrusion Detection Systems (NIDS) are conventionally classified into two primary paradigms: signature-based and anomaly-based. Specifically, Signature-based IDS (SIDS) operates by scrutinizing real-time traffic flows and cross-referencing patterns against a repository of known threat signatures. While effective for known threats, the efficacy of SIDS is contingent upon the comprehensiveness of signature databases, rendering them ineffective against zero-day attacks or polymorphic threats. It is thus essential to ensure the periodic maintenance and updating of the system [8]. Conversely, Anomaly-based Intrusion Detection Systems (AIDS) identify intrusions by modeling normal network behavior, thereby providing protection against unknown attacks [9]. In the design of AIDS, machine learning (ML) techniques such as Decision Tree (DT) [10] and Support Vector Machine (SVM) [11] have been extensively employed. However, these traditional ML methods often rely on manual feature engineering, which suffers from prolonged processing times and limited scalability when identifying complex attack patterns in high-dimensional data.

Driven by its transformative success in Natural Language Processing (NLP) and Computer Vision (CV), Deep Learning (DL) has emerged as a dominant paradigm for Intrusion Detection Systems (IDS) in Connected and Autonomous Vehicles (CAVs). The increasing prevalence of DL stems from its superior efficacy in automated feature extraction and its robustness in handling high-dimensional, heterogeneous data. Recent literature highlights a diverse array of architectures employed to secure vehicular networks, including Convolutional Neural Networks (CNN) [12], Autoencoders (AE) [13], Recurrent Neural Networks (RNN) [14], Long Short-Term Memory (LSTM) [15], Generative Adversarial Networks (GAN) [16], and Deep Belief Networks (DBN) [17].

Despite these advancements, a critical research gap remains: the majority of current efforts concentrate on stacking complex model architectures while treating network traffic as a homogeneous data stream. In reality, CAV network traffic is inherently heterogeneous, consisting of discrete Pattern dataand continuous State data. Existing methods typically use simple concatenation to fuse these features, ignoring the intrinsic semantic misalignment between them. For example, a specific protocol flag (Pattern) usually dictates a specific statistical threshold (State). Ignoring this correlation prevents the model from fully exploring the potential information in the data, limiting its robustness and generalization ability [18]. Crucially, this limitation leads to poor performance on minority attack classes in imbalanced datasets. Furthermore, complex architectures often incur high computational overhead and latency. Since CAVs operate in resource-constrained edge environments requiring millisecond-level responses, simply increasing model parameters to improve accuracy is impractical. Therefore, it is necessary to reconstruct the IDS architecture from a “data-first” perspective, balancing high-precision semantic alignment with ultra-lightweight deployment capabilities.

To address the critical feature heterogeneity gap in existing IDS, this work proposes HCA-IDS, a novel framework centered on Semantics-Aware Cross-Modal Alignment. Existing methods often ignore that discrete protocol signatures and continuous traffic statistics reside in disjoint latent spaces, relying on naive concatenation that fails to capture intricate semantic dependencies. In contrast, our core innovation lies in explicitly decoupling these features and introducing a Multi-Head Cross-Attention mechanism. This architecture forces the model to learn a unified semantic manifold where static “Pattern” features can dynamically query and re-weight relevant dynamic “State” behaviors, achieving precise alignment that previous methods failed to capture. We evaluate our model on NSL-KDD and the complex, highly imbalanced CIC-IDS2018 dataset [19]. The results affirm that grounding IDS architecture in the intrinsic structure of traffic data yields a more accurate and robust solution.

The main contributions of this paper are as follows:

(1): We propose HCA-IDS, a lightweight framework optimized for the resource-constrained edge environments of CAVs. Unlike traditional approaches that rely on heavy convolutional layers, we utilize a streamlined MLP backbone to process heterogeneous data streams (Patterns and States). This design maximizes computational efficiency while preserving the distinct statistical properties of different feature modalities.
(2): We introduce a novel Semantics-Aware Cross-Modal Alignment mechanism to bridge the feature heterogeneity gap. By leveraging Multi-Head Cross-Attention, the model explicitly utilizes static protocol patterns to dynamically query relevant traffic statistics. This allows for the learning of a unified semantic manifold, significantly enhancing the detection of long-tail minority attacks without increasing model depth.
(3): We conduct comprehensive experiments on NSL-KDD and CIC-IDS2018 datasets using rigorous 5-Fold Cross-Validation. HCA-IDS demonstrates exceptional robustness, achieving a Macro-F1 score of over 94% on 7 consolidated attack categories. Crucially, with a model size of 1.00 MB and ultra-low inference latency (0.0037 ms), our solution proves its viability for real-time deployment, striking an optimal balance between high-precision semantic alignment and efficiency.

The remainder of the paper is structured as follows: Section 2 introduces related research work; Section 3 describes the dataset analysis and preprocessing methods; Section 4 details the structure of the proposed HCA-IDS model. Section 5 presents the experimental setup, evaluation criteria, and a comprehensive discussion of the results, specifically analyzing the performance on minority classes. Finally, Section 6 summarizes the work of this paper.

3. Dataset and Data Preprocessing

To ensure the proposed model is robust across both established benchmarks and realistic modern traffic scenarios, we employ two distinct datasets: the classic NSL-KDD and the state-of-the-art CIC-IDS2018.

3.1. NSL-KDD Dataset

The NSL-KDD dataset [33] serves as a statistically refined iteration of the legacy KDD Cup 99 archive [34]. By eliminating the redundant and duplicate records that skewed the original distribution, NSL-KDD offers a more rigorous benchmark for evaluating IDS performance. The dataset is structured into a training set (KDDTrain+) containing 125,973 samples and a testing set (KDDTest+) with 22,544 samples. Each instance is defined by a 43-dimensional vector, comprising 41 traffic attributes, a class label, and a difficulty score. As detailed in Table 1, the attack scenarios are taxonomized into four primary families:

Table 1. Composition of the NSL-KDD dataset.

As summarized in Table 1, the dataset encompasses four distinct attack families alongside normal traffic. These categories include Denial of Service (DoS), which aims to deplete target resources to disrupt legitimate access; Probe, involving surveillance activities for network reconnaissance; Remote to Local (R2L), representing unauthorized remote access attempts such as password guessing; and User to Root (U2R), where local users exploit vulnerabilities to escalate privileges. The statistical distribution reveals a significant class imbalance: while Normal traffic (53.46%) and DoS attacks (36.46%) dominate the training set, the U2R and R2L categories represent a minority, constituting only 0.04% and 0.79% of the training records, respectively. This scarcity poses a substantial challenge for detection algorithms.

3.2. CIC-IDS2018 Dataset

While NSL-KDD serves as a standard baseline, it does not fully reflect modern encrypted traffic and complex attack patterns. To validate the technical quality and generalizability of our model, we incorporate the CSE-CIC-IDS2018 dataset [19]. Generated on a realistic AWS network topology, this dataset captures diverse modern attacks including Brute Force, Botnet, DoS, DDoS, Web Attacks, and Infiltration.

A critical challenge in CIC-IDS2018 is the extreme class imbalance. As shown in our experimental setup later, attacks like Heartbleed and Web Attacks constitute less than 0.01% of the total traffic. This reflects real-world CAV scenarios where malicious signals are rare but fatal. We utilize the raw CSV files generated by CICFlowMeter, which provide roughly 80 statistical features per flow (e.g., Flow Duration, Flow IAT, Packet Length Std). We utilize the raw CSV files generated by CICFlowMeter. To ensure statistical reliability during evaluation, we aggregated the sub-attacks into major distinct categories (detailed in the experimental setup), allowing for a robust assessment of the model’s ability to detect both voluminous and minority threats.

3.3. Heterogeneous Data Preprocessing

The quality of data representation fundamentally dictates model performance. As illustrated in Table 2, network traffic features are inherently heterogeneous, consisting of:

Table 2. Distribution of feature variables for the NSL-KDD dataset.

(1): Nominal/Binary Variables: Discrete attributes representing protocols, service types, or status flags. These encode the structural “Pattern” of the traffic.
(2): Interval/Ratio Variables: Continuous statistical metrics such as duration, byte counts, and error rates. These encode the dynamic “State” of the traffic.

Directly feeding these mixed types into a single neural network layer often leads to suboptimal convergence, as high-variance continuous values can dominate the gradients over sparse binary features. To address this, we implement a Semantics-Aware Preprocessing Pipeline (as shown in Figure 1) to decouple and normalize these features:

Figure 1. The proposed Heterogeneous Data Preprocessing Pipeline. Raw traffic is decoupled into Pattern (Discrete) and State (Continuous) streams, processed via One-Hot Encoding and Normalization respectively, to prepare for the dual-encoder architecture.

Feature Decoupling and Encoding

First, irrelevant features are removed to prevent the model from learning distinct identifiers. The remaining features are processed based on their semantic type:

Pattern Data Processing: Nominal variables are mapped to high-dimensional sparse vectors using One-Hot Encoding. Binary variables are retained as is. These form the Pattern Input (

X_{p}

).

State Data Processing: Continuous variables, Serror_rate) exhibit large variations in scale. We apply Min-Max Normalization to map these values into the range

[0, 1]

. This prevents numerical instability and ensures that the State Input (

X_{s}

) is comparable across different dimensions:

x_{i}^{'} = \frac{x_{i} - \min (x)}{\max (x) - \min (x)}

(1)

This explicit separation allows our subsequent Heterogeneous Encoders to extract features from

X_{p}

and

X_{s}

independently before semantic alignment, thereby maximizing the information gain from both modalities.

4. HCA-IDS Architecture

A visual representation of the proposed HCA-IDS architecture is provided in Figure 2. Following the heterogeneous data preprocessing, the input traffic flow is decomposed into two distinct streams: Pattern Data (

X_{p}

), comprising binary variables that encode discrete communication semantics, and State Data (

X_{s}

), consisting of normalized continuous variables reflecting behavioral dynamics.

Figure 2. The overall architecture of HCA-IDS (Refined with MLP Backbones).

To address the limitations of static concatenation and high computational costs in CAVs, we design a lightweight dual-branch architecture. This consists of dedicated heterogeneous MLP-based encoders to extract modality-specific features efficiently, followed by a Semantics-Aware Cross-Attention Fusion Module. This module enables the model to dynamically utilize protocol information (Pattern) to query and highlight anomalous statistical behaviors (State). Finally, a compact classification head predicts the specific attack category.

4.1. Heterogeneous Feature Encoders

We design two parallel encoders. Unlike previous works that employ heavy convolutional layers, we adopt streamlined Multi-Layer Perceptrons (MLP) for both branches to minimize inference latency on edge devices while maintaining feature expressiveness.

4.1.1. Pattern Encoder (MLP-Based)

The pattern input

X_{p} \in R^{N_{p}}

(where

N_{p} = 114

) is sparse and categorical. To capture the non-linear co-occurrence between protocols and service flags, we employ a fully connected (FC) network. As shown in Table 3, the encoder consists of three layers with decreasing dimensions (114 → 48 → 16 → 8). This “bottleneck” design forces the network to learn a compact semantic embedding

H_{p}

, filtering out noise from the sparse one-hot vectors.

Table 3. Configuration of Heterogeneous Encoders (Lightweight MLP Design).

4.1.2. State Encoder (MLP-Based)

The state input

X_{s} \in R^{N_{s}}

represents continuous statistical metrics (e.g., flow duration, packet inter-arrival times) reflecting the global behavioral profile of the traffic. To efficiently capture the complex, non-linear correlations among these continuous variables, we employ a high-efficiency MLP Encoder.

This encoder consists of three stacked dense layers, each equipped with Batch Normalization (BN) and ReLU activation to ensure rapid convergence and stable gradient propagation. As detailed in Table 3, the network progressively projects the high-dimensional statistical features into a lower-dimensional latent space (

N_{s} \to 64 \to 32 \to 16

). This hierarchical compression extracts the most discriminative behavioral features while discarding redundant noise, yielding a robust state embedding vector

H_{s}

optimized for the subsequent cross-modal alignment.

4.2. Semantics-Aware Cross-Attention Fusion

This module is the core innovation of our framework. Traditional methods simply concatenate the pattern embedding

H_{p}

and the state embedding

H_{s}

, which assumes a fixed, linear relationship between protocol types and traffic statistics. However, the relevance of a statistical feature often depends on the protocol context. For instance, a high Source Byte Count might be normal for an FTP transfer (Pattern context) but highly anomalous for a DNS request.

To model this dynamic dependency, we adopt a Multi-Head Cross-Attention mechanism. Unlike standard self-attention that operates on temporal sequences, our mechanism is designed to align heterogeneous feature vectors. We designate the Pattern embedding

H_{p}

as the Query (Q), representing the “semantic context,” and the State embedding

H_{s}

as both Key (K) and Value (V), representing the “behavioral evidence.”

To enable fine-grained alignment, we project the input vectors into h distinct semantic subspaces (heads). Mathematically, for the i-th head:

Q_{i} = H_{p} W_{i}^{Q}, K_{i} = H_{s} W_{i}^{K}, V_{i} = H_{s} W_{i}^{V}

(2)

where

W_{i}^{Q}, W_{i}^{K}, W_{i}^{V}

are learnable projection matrices. This projection effectively decomposes the global state vector into multiple latent behavioral aspects.

The attention weights are calculated to measure the semantic compatibility between the protocol pattern and each behavioral aspect:

Attention (Q_{i}, K_{i}, V_{i}) = σ (\frac{Q_{i} K_{i}^{⊤}}{\sqrt{d_{k}}}) \cdot V_{i}

(3)

Here, we utilize a Sigmoid-like activation function

σ (\cdot)

(or Softmax across heads) to determine the relevance score. This creates a “gating” effect, where the protocol context dynamically re-weights specific dimensions of the state features—amplifying relevant statistics while suppressing noise.

Finally, the outputs from all heads are concatenated and passed through a linear projection. To prevent feature degradation and ensure stable gradient flow, we employ a residual connection followed by Layer Normalization:

Z_{fused} = LayerNorm (H_{p} + MultiHead (Q, K, V))

(4)

This fused vector

Z_{fused}

contains the pattern information enriched by contextually aligned state details, providing a robust representation for the final classification.

4.3. Classification Head

The aligned feature vector

Z_{fused}

is fed into a classification head consisting of three fully connected layers (e.g., 64 → 32 →

N_{c l a s s e s}

). To mitigate overfitting—especially given the high complexity of the fusion module—we apply Dropout with a rate of 0.5 before the final layer. The output layer utilizes a Softmax function to produce the probability distribution over the attack classes.

\hat{y} = Softmax (W_{o u t} \cdot ReLU (W_{1} \cdot Z_{fused} + b_{1}) + b_{o u t})

(5)

This architecture ensures that the final decision is based on semantically aligned heterogeneous features, significantly enhancing the detection capability for minority classes with subtle anomalies.

5. Experiments

5.1. Experimental Settings

5.1.1. Data Partitioning and Balancing

To rigorously evaluate HCA-IDS, we adopted distinct partitioning strategies for the two datasets to align with their respective benchmarks.

NSL-KDD Setup: We followed the standard benchmark protocol, utilizing the predefined KDDTrain+ set for training and the KDDTest+ set for performance evaluation. This ensures our results are directly comparable with existing literature.

CIC-IDS2018 Setup: For the CIC-IDS2018 dataset, we consolidated the attack types into 7 major categories to ensure statistical significance. Specifically, we excluded ultra-minority classes (e.g., Heartbleed, <0.001%) where sample sizes were insufficient for reliable variance analysis.We implemented a Stratified 5-Fold Cross-Validation. The dataset was shuffled and split into 5 folds; in each iteration, 4 folds were used for training and 1 for testing.

Handling Imbalance: As detailed in Table 4, the raw CIC-IDS2018 distribution is highly skewed. To prevent the model from biasing towards majority classes, we applied adaptive oversampling (SMOTE) exclusively to the training folds. The test folds remained strictly comprised of original, raw samples to reflect real-world detection difficulty. Table 4 lists the exact sample sizes used in our 5-fold experiments.

Table 4. Detailed Sample Distribution for CIC-IDS2018 (Average per Fold in 5-Fold CV). Note that oversampling is applied only to the training set to prevent data leakage.

5.1.2. Implementation Details

The proposed HCA-IDS is implemented using PyTorch 1.10 on a workstation equipped with an NVIDIA RTX 3090 GPU.

Parameter Settings: The MLP encoders and attention modules are optimized using the Adam optimizer with a learning rate of

1 \times 10^{- 3}

and a batch size of 256. To ensure lightweight deployment suitable for CAVs, the hidden dimensions of the MLP encoders were restricted (as detailed in Section 4), resulting in a total model size of approx. 0.15 MB. The training process runs for 50 epochs with an early stopping mechanism to prevent overfitting.

5.2. Evaluation Metrics

To strictly evaluate the performance on minority classes, we employ standard metrics including Accuracy, Precision, Recall, and F1-score. Furthermore, we report the Macro-Average F1, which treats all classes equally regardless of their sample size, providing a more fair assessment of the model’s robustness against long-tail attacks compared to Weighted-Average metrics.

5.3. Experimental Results and Analysis

5.3.1. Performance Evaluation on NSL-KDD Benchmark

To establish a comparative baseline, we first evaluated HCA-IDS on the NSL-KDD dataset using Stratified 5-Fold Cross-Validation. Table 5 details the precision, recall, and F1-score for each category, reported as the mean value with standard deviation across folds.

Table 5. Classification Performance on NSL-KDD (Mean ± Std Dev over 5 Folds).

As indicated in Table 5, the model achieves robust performance on the majority classes (Normal and DoS), with F1-scores exceeding 95%. For the R2L category, which involves stealthy unauthorized access attempts, the model maintains a high F1-score of 87.96%. It is noted that the U2R category exhibits a relatively lower performance (F1 ≈ 54.8%). This is a known characteristic of the NSL-KDD dataset, where the extreme scarcity of U2R samples in the training set limits the model’s ability to generalize boundaries. However, compared to standard baselines that often yield near-zero detection for U2R without synthetic augmentation, our model demonstrates a capability to capture minority patterns effectively.

5.3.2. Performance Evaluation on CIC-IDS2018

To validate the model’s efficacy in modern CAV environments characterized by high-dimensional and encrypted traffic, we conducted experiments on the CIC-IDS2018 dataset. Table 6 presents the quantitative results across 7 consolidated categories.

Table 6. Classification Performance on CIC-IDS2018 (Mean ± Std Dev over 5 Folds).

Analysis of Long-Tail Distributions: A critical objective of this study was to address the detection of minority attacks, which is often compromised in varying traffic distributions. As detailed in Table 6:

Statistical Stability: The low standard deviations across all metrics (e.g., $\pm 0.01 %$ for Benign, $\pm 0.30 %$ for Infiltration) confirm that the 5-fold cross-validation yields consistent results, mitigating concerns regarding data leakage or random split bias.
Minority Class Detection: Despite Web Attacksconstituting only 0.15% of the dataset, the model achieves a mean Recall of 90.65% and an F1-score of 82.00%. Similarly, Infiltration attacks are detected with an F1-score of 89.02%.

These results suggest that the proposed Semantics-Aware Cross-Attention mechanism successfully aligns the discrete protocol patterns with continuous flow statistics, thereby amplifying the feature representation of rare attack classes against the background of dominant benign traffic.

5.4. Computational Complexity and Efficiency Analysis

For Intrusion Detection Systems (IDS) deployed in Connected and Autonomous Vehicles (CAVs), the model must strike a balance between detection accuracy and computational overhead. The strictly limited resources of On-Board Units (OBUs) and Electronic Control Units (ECUs) require the algorithm to be lightweight and capable of real-time processing.

To quantitatively assess the deployment feasibility of HCA-IDS, we analyzed its computational complexity and inference speed on an NVIDIA RTX 3090 environment. The detailed computational profile is presented in Table 7.

Table 7. Computational Profile of HCA-IDS.

Analysis of Real-Time Feasibility:

Storage Efficiency: As shown in Table 7, the total parameter count of HCA-IDS is 262,343, resulting in a physical storage footprint of approximately 1.00 MB. This compact size is significantly below the storage limits of typical automotive-grade microcontrollers and edge gateways, allowing for seamless integration without requiring hardware upgrades.
Inference Speed: The average inference time is 0.0037 ms, corresponding to a throughput of over 269,000 Flows Per Second (FPS). In real-world vehicular networks, even high-load CAN FD or Automotive Ethernet traffic rarely exceeds such packet rates. This microsecond-level latency ensures that HCA-IDS can analyze traffic and flag anomalies continuously with negligible delay, leaving ample safety margins for the vehicle control systems to react.

These results confirm that the proposed architecture achieves the design goal of being lightweight and fast, making it highly suitable for real-time deployment in resource-constrained CAV environments.

5.5. Comparison with State-of-the-Art Approaches

To assess the relative standing of HCA-IDS within the current literature, we compared its performance against representative Deep Learning-based IDS models. The baselines include CNN-LSTM [29], a standard hybrid spatial-temporal model; CLAM [30], which incorporates attention mechanisms; and LCCDE [23], a high-performance ensemble framework based on tree boosting.

5.5.1. Binary Classification Performance

Table 8 presents the comparison of F1-Scores on both the NSL-KDD and CIC-IDS2018 benchmarks.

Table 8. Comparison of Binary Classification Performance (F1-Score).

While all models exhibit high proficiency on the NSL-KDD dataset, significant divergence is observed on the more challenging CIC-IDS2018 dataset. The F1-score of the CNN-LSTM baseline drops to approximately 94.87%, indicating a struggle to generalize to modern encrypted traffic patterns. In contrast, HCA-IDS maintains a superior F1-score of 99.80%. This result suggests that the proposed heterogeneous feature decoupling provides a more robust representation for binary anomaly detection than traditional homogeneous architectures.

5.5.2. Multi-Class Performance Analysis

The capability to detect specific attack categories, particularly those with low prevalence, is a more rigorous metric for IDS evaluation. Table 9 details the Macro-Average F1-Score and the detection performance on key minority classes in the CIC-IDS2018 dataset.

Table 9. Comparison of Multi-Class Performance (Macro-F1) on CIC-IDS2018.

As shown in Table 9, HCA-IDS achieves the highest Macro-F1 score of 94.06%.

Addressing the Long-Tail Problem: Conventional Deep Learning models often exhibit a bias towards majority classes. For instance, CNN-LSTM achieves only 75.4% F1-score on the Botnet category. HCA-IDS significantly improves this to 99.4%, demonstrating that the semantic alignment mechanism effectively preserves the gradient signals of minority classes during training.
Efficiency vs. Accuracy Trade-off: While the ensemble-based LCCDE achieves a slightly higher score on Web Attacks (86.2% vs. 82.0%), it typically requires significantly higher computational resources (storage and memory for multiple tree structures). HCA-IDS offers a competitive detection rate with a lightweight MLP-based architecture (1.00 MB), presenting a more favorable trade-off for resource-constrained CAV deployment.

6. Conclusions

The integration of Connected and Autonomous Vehicles (CAVs) into modern intelligent transportation systems has exacerbated cybersecurity vulnerabilities, necessitating Intrusion Detection Systems (IDS) that can simultaneously satisfy high detection accuracy and strict resource constraints. In this study, we addressed the limitations of existing deep learning-based IDSs, which often suffer from feature misalignment due to homogeneous data processing and high computational overhead. We proposed HCA-IDS, a novel Semantics-Aware Heterogeneous Cross-Attention framework that fundamentally shifts the detection paradigm from an architecture-centric to a data-centric approach. By explicitly decoupling network traffic into discrete protocol patterns and continuous behavioral states, and subsequently aligning them via a lightweight Cross-Attention mechanism, our model effectively resolves the semantic heterogeneity gap. Furthermore, the replacement of heavy convolutional layers with streamlined MLP-based encoders significantly reduces the model’s complexity, ensuring its viability for deployment on resource-constrained vehicular edge computing units.

To strictly validate the robustness and generalization capability of the proposed framework, we conducted extensive experiments using Stratified 5-Fold Cross-Validation on the benchmark NSL-KDD and the high-dimensional CIC-IDS2018 datasets. The experimental results demonstrate that HCA-IDS achieves state-of-the-art performance, recording a Macro-F1 score of 94.06% on the CIC-IDS2018 dataset. Crucially, the model exhibits exceptional resilience against the long-tail distribution problem inherent in real-world traffic, significantly outperforming traditional hybrid baselines in detecting minority threats such as Web Attacks and Botnets. The analysis confirms that the proposed semantic alignment mechanism successfully preserves and amplifies the gradient signals of rare attack classes, preventing them from being overwhelmed by dominant benign traffic.

In terms of deployment feasibility, HCA-IDS demonstrates a superior trade-off between accuracy and efficiency. With a compact model size of approximately 1.00 MB and an average inference latency of just 0.0037 ms per sample, the system supports a throughput exceeding 269,000 flows per second, satisfying the real-time processing requirements of automotive electronic control units. These findings suggest that the proposed lightweight, semantics-aware architecture offers a robust and practical solution for securing next-generation vehicular networks. Future work will focus on evaluating the model’s resilience against adversarial examples to further enhance its security in hostile environments, as well as exploring its implementation on automotive-grade hardware for hardware-in-the-loop validation.

Author Contributions

Q.H., Y.Z., J.L. and Q.L. participated in the discussion of research methodology, designed and performed the experiments, and wrote the original draft. W.Z., M.H., T.Z. and A.X. verified the experimental results, refined the experimental procedures, and reviewed and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by the National Natural Science Foundation of China (No. 42201464), the Hubei Provincial Natural Science Foundation (No. JCZRQN202500217), the Hubei Provincial Key Laboratory of Green Intelligent Computing Power Network for providing experimental facilities and technical assistance and the Youth Talent Support Program of the China Association for Science and Technology.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: NSL-KDD https://www.unb.ca/cic/datasets/nsl.html (accessed on 15 January 2025) and CSE-CIC-IDS2018 https://www.unb.ca/cic/datasets/ids-2018.html (accessed on 20 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Liang, J.; Li, Y.; Yin, G.; Xu, L.; Lu, Y.; Feng, J.; Shen, T.; Cai, G. A MAS-based hierarchical architecture for the cooperation control of connected and automated vehicles. IEEE Trans. Veh. Technol. 2022, 72, 1559–1573. [Google Scholar] [CrossRef]
Tan, C.; Yao, J.; Tang, K.; Liang, J.; Yin, G. Privacy-Preserving Cycle-Based Arrival Profile Estimation Based on Cross-Company Connected Vehicles. IEEE Trans. Consum. Electron. 2025, 71, 6167–6182. [Google Scholar]
Kim, K.; Kim, J.S.; Jeong, S.; Park, J.-H.; Kim, H.K. Cybersecurity for autonomous vehicles: Review of attacks and defense. Comput. Secur. 2021, 103, 102150. [Google Scholar] [CrossRef]
Yu, T.; Hu, J.; Yang, J. Intrusion detection in intelligent connected vehicles based on weighted self-information. Electronics 2023, 12, 2510. [Google Scholar] [CrossRef]
Tencent Cohen Lab. Lexus Vehicle Safety Research Review Report. 2020. Available online: https://keenlab.tencent.com/zh/2020/03/30/Tencent-Keen-Security-Lab-Experimental-Security-Assessment-on-Lexus-Cars/ (accessed on 10 December 2024).
El-Rewini, Z.; Sadatsharan, K.; Selvaraj, D.F.; Plathottam, S.J.; Ranganathan, P. Cybersecurity challenges in vehicular communications. Veh. Commun. 2020, 23, 100214. [Google Scholar] [CrossRef]
Kamal, M.; Srivastava, G.; Tariq, M. Blockchain-based lightweight and secured v2v communication in the internet of vehicles. IEEE Trans. Intell. Transp. Syst. 2020, 22, 3997–4004. [Google Scholar] [CrossRef]
Manso, P.; Moura, J.; Serrão, C. SDN-based intrusion detection system for early detection and mitigation of DDoS attacks. Information 2019, 10, 106. [Google Scholar]
Mashudi, N.A.; Ab Aziz, N.T.; Rahman, W.F.W.A.; Ahmad, N.; Noor, N.M. Intrusion detection using machine learning for security and privacy in humanitarian aid system. In Proceedings of the 2024 IEEE 12th Region 10 Humanitarian Technology Conference (R10-HTC); IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Kalkan, S.C.; Sahingoz, O.K. In-vehicle intrusion detection system on controller area network with machine learning models. In Proceedings of the 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT); IEEE: New York, NY, USA, 2020; pp. 1–6. [Google Scholar]
Khraisat, A.; Gondal, I.; Vamplew, P.; Kamruzzaman, J.; Alazab, A. Hybrid intrusion detection system based on the stacking ensemble of c5 decision tree classifier and one class support vector machine. Electronics 2020, 9, 173. [Google Scholar] [CrossRef]
Yang, L.; Shami, A. A transfer learning and optimized CNN based intrusion detection system for Internet of Vehicles. In Proceedings of the ICC 2022-IEEE International Conference on Communications; IEEE: New York, NY, USA, 2022; pp. 2774–2779. [Google Scholar]
Xing, L.; Wang, K.; Wu, H.; Ma, H.; Zhang, X. FL-MAAE: An intrusion detection method for the Internet of Vehicles based on federated learning and memory-augmented autoencoder. Electronics 2023, 12, 2284. [Google Scholar] [CrossRef]
Yin, C.; Zhu, Y.; Fei, J.; He, X. A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access 2017, 5, 21954–21961. [Google Scholar] [CrossRef]
Ashraf, J.; Bakhshi, A.D.; Moustafa, N.; Khurshid, H.; Javed, A.; Beheshti, A. Novel deep learning-enabled LSTM autoencoder architecture for discovering anomalous events from intelligent transportation systems. IEEE Trans. Intell. Transp. Syst. 2020, 22, 4507–4518. [Google Scholar] [CrossRef]
Liu, Y.; Xiao, M.; Zhou, Y.; Zhang, D.; Zhang, J.; Gacanin, H.; Pan, J. An access control mechanism based on risk prediction for the IoV. In Proceedings of the 2020 IEEE 91st Vehicular Technology Conference (VTC2020-Spring); IEEE: New York, NY, USA, 2020; pp. 1–5. [Google Scholar]
Mahendran, R.K.; Rajendran, S.; Pandian, P.; Rathore, R.S.; Benedetto, F.; Jhaveri, R.H. A novel constructive unceasement conditional random field and dynamic Bayesian network model for attack prediction on Internet of Vehicle. IEEE Access 2024, 12, 24644–24658. [Google Scholar] [CrossRef]
Cho, E.; Chang, T.-W.; Hwang, G. Data preprocessing combination to improve the performance of quality classification in the manufacturing process. Electronics 2022, 11, 477. [Google Scholar] [CrossRef]
Sharafaldin, I.; Lashkari, A.H.; Ghorbani, A.A. Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp 2018, 1, 108–116. [Google Scholar]
Al-Saud, M.; Eltamaly, A.M.; Mohamed, M.A.; Kavousi-Fard, A. An intelligent data-driven model to secure intravehicle communications based on machine learning. IEEE Trans. Ind. Electron. 2019, 67, 5112–5119. [Google Scholar] [CrossRef]
Wazirali, R. An improved intrusion detection system based on KNN hyperparameter tuning and cross-validation. Arab. J. Sci. Eng. 2020, 45, 10859–10873. [Google Scholar] [CrossRef]
Zhang, C.; Wang, W.; Liu, L.; Ren, J.; Wang, L. Three-branch random forest intrusion detection model. Mathematics 2022, 10, 4460. [Google Scholar] [CrossRef]
Yang, L.; Shami, A.; Stevens, G.; De Rusett, S. LCCDE: A decision-based ensemble framework for intrusion detection in the internet of vehicles. In Proceedings of the GLOBECOM 2022-2022 IEEE Global Communications Conference; IEEE: New York, NY, USA, 2022; pp. 3545–3550. [Google Scholar]
Almehdhar, M.; Albaseer, A.; Khan, M.A.; Abdallah, M.; Menouar, H.; Al-Kuwari, S.; Al-Fuqaha, A. Deep learning in the fast lane: A survey on advanced intrusion detection systems for intelligent vehicle networks. IEEE Open J. Veh. Technol. 2024, 5, 869–906. [Google Scholar] [CrossRef]
Lauriola, I.; Lavelli, A.; Aiolli, F. An introduction to deep learning in natural language processing: Models, techniques, and tools. Neurocomputing 2022, 470, 443–456. [Google Scholar] [CrossRef]
Srivastava, S.; Divekar, A.V.; Anilkumar, C.; Naik, I.; Kulkarni, V.; Pattabiraman, V. Comparative analysis of deep learning image detection algorithms. J. Big Data 2021, 8, 66. [Google Scholar] [CrossRef]
Sri vidhya, G.; Nagarajan, R. A novel bidirectional LSTM model for network intrusion detection in SDN-IoT network. Computing 2024, 106, 2613–2642. [Google Scholar]
Imrana, Y.; Xiang, Y.; Ali, L.; Abdul-Rauf, Z. A bidirectional LSTM deep learning approach for intrusion detection. Expert Syst. Appl. 2021, 185, 115524. [Google Scholar] [CrossRef]
Halbouni, A.; Gunawan, T.S.; Habaebi, M.H.; Halbouni, M.; Kartiwi, M.; Ahmad, R. CNN-LSTM: Hybrid deep neural network for network intrusion detection system. IEEE Access 2022, 10, 99837–99849. [Google Scholar] [CrossRef]
Sun, H.; Chen, M.; Weng, J.; Liu, Z.; Geng, G. Anomaly detection for in-vehicle network using CNN-LSTM with attention mechanism. IEEE Trans. Veh. Technol. 2021, 70, 10880–10893. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Google: Mountain View, CA, USA, 2017; pp. 5998–6008. [Google Scholar]
Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A survey on text classification: From traditional to deep learning. ACM Trans. Intell. Syst. Technol. (TIST) 2022, 13, 1–41. [Google Scholar] [CrossRef]
Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A.A. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications; IEEE: New York, NY, USA, 2009; pp. 1–6. [Google Scholar]
KDD Cup. KDD Cup 3999. 1999. Available online: http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html (accessed on 6 January 2025).