Adaptive Semi-Supervised Algorithm for Intrusion Detection and Unknown Attack Identification

Li, Meng; Luo, Lei; Xiao, Kun; Wang, Geng; Wang, Yintao

doi:10.3390/app15041709

Open AccessArticle

Adaptive Semi-Supervised Algorithm for Intrusion Detection and Unknown Attack Identification

by

Meng Li

,

Lei Luo

^*,

Kun Xiao

,

Geng Wang

and

Yintao Wang

School of Information and Software Engineering, Shahe Campus, University of Electronic Science and Technology of China, Chengdu 610054, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(4), 1709; https://doi.org/10.3390/app15041709

Submission received: 26 December 2024 / Revised: 6 February 2025 / Accepted: 6 February 2025 / Published: 7 February 2025

(This article belongs to the Special Issue Network Intrusion Detection and Attack Identification)

Download

Browse Figures

Versions Notes

Abstract

Intrusion detection systems face significant challenges, including the inability to detect unknown threats and imbalances between normal and anomalous traffic. To address these limitations, we propose a semi-supervised intrusion detection algorithm based on GAN with a Transformer backbone for network security in IoT devices. To address the issue of imbalanced normal and anomalous traffic due to the diversity of network behavior and the difficulty that supervised algorithms experience in detecting unknown intrusions, we use only normal traffic as training data. By integrating the self-attention mechanism of Transformers, we leverage their ability to capture long-range dependencies in sequential data, enhancing the core capability of the GAN. The experimental results show that our algorithm achieves an F1-score of 95.2% and a false omission rate (FOR) of 10.7% on the CIC-IDS2017 dataset. On the Kitsune dataset, it attains an F1-score of 83.2% and a FOR of 15.8%. In real-world applications, when the algorithm was deployed on actual vehicle devices, it maintained strong performance with a FOR of 13%, further validating the practical applicability and value of the algorithm.

Keywords:

network intrusion detection; imbalanced data; generative adversarial network; IoT network security

1. Introduction

With the rapid development of Internet of Things (IoT) technology, the number of IoT devices has grown exponentially, becoming deeply integrated into our daily lives. However, the widespread adoption of these devices has also introduced significant security challenges. According to “The 2024 IoT Security Landscape Report” [1], there are approximately 3.8 million smart homes globally, containing around 50 million IoT devices, which collectively generate over 9.1 billion cybersecurity incidents. These incidents have exposed numerous vulnerabilities and attack scenarios. As the number of smart devices continues to rise, cybersecurity concerns are becoming increasingly complex and critical. Therefore, effectively detecting and defending against potential network threats, particularly in the face of constantly evolving attack techniques and an expanding attack surface, is crucial for ensuring IoT security.

Intrusion detection systems (IDS), which are core tools in network security defense, play a critical role in identifying and responding to various network attacks. Traditional IDS methods are generally categorized into signature-based detection and anomaly-based detection. Signature-based IDS rely on predefined attack signatures and rule databases to detect intrusions by matching known patterns [2]. While effective against known threats, this method is limited in its ability to detect novel or variant attacks [3]. Additionally, maintaining and updating the rule sets can be costly and time-consuming. In contrast, anomaly-based IDS establish a baseline of normal network behavior and identify deviations from this baseline as anomalies [4]. However, this approach tends to generate a high false-positive rate in the dynamic and complex IoT environment, and its detection performance significantly deteriorates in the presence of imbalanced data.

To address these challenges, this paper proposes a semi-supervised intrusion detection algorithm based on a Transformer and a generative adversarial network: TransGAN-IDS. The algorithm is trained using only normal traffic data and is specifically designed to tackle issues such as data imbalance and the detection of unknown attacks. By incorporating the Transformer’s self-attention mechanism, the model captures long-range dependencies within traffic sequences, enhancing its ability to effectively identify potential anomalous behaviors [5]. Furthermore, the generator and discriminator in the GAN structure engage in a competitive process, further improving the model’s ability to detect subtle anomalies and increasing the precision of traffic generation and discrimination.

The main contributions of this paper include the following:

(1): Addressing the challenges faced by traditional intrusion detection methods in IoT environments, such as data imbalance and difficulties detecting unknown attacks. By training the model exclusively on normal traffic, the proposed algorithm enhances its robustness in detecting new and previously unseen attacks.
(2): Validating the effectiveness of the proposed algorithm through experiments on the CIC-IDS2017 and Kitsune datasets. The results show that the algorithm performs well on both datasets when handling complex network traffic, significantly outperforming traditional methods.

2. Related Work

Today, a significant body of research has applied machine learning and deep learning techniques to address problems in network security. In [6], an intrusion detection algorithm based on convolutional neural networks (CNN) was proposed using a deep Maxout network for detection. The weights and training parameters were optimized using the Remora algorithm, ultimately achieving high accuracy. In [7], a method combining various types of recurrent neural networks (RNN) for intrusion detection was introduced. The model’s performance was evaluated on the NSL-KDD [8] and UNSW-NB15 [9] datasets, demonstrating its superior performance. However, these methods heavily depend on large amounts of data, which are often difficult to obtain in real-world applications due to challenges in collecting anomalous traffic data, leading to extreme data imbalance. As a result, such methods remain largely confined to academic research.

Generative adversarial networks (GANs) represent a crucial branch of deep learning, widely used for anomaly detection in computer vision. Some studies have utilized GANs as data processing tools to generate synthetic data that alleviate data imbalance. For example, Ref. [10] used a GAN to generate virtual data resembling existing data to balance the data distribution. However, the virtual data generated by GAN may not fully capture the complexity of real-world data, particularly when dealing with complex minority classes, which can negatively impact model performance. Additionally, GANs are prone to “mode collapse” during training, where the generator produces similar samples, failing to capture the diversity of the data [11]. Over-reliance on synthetic data can lead to overfitting, reducing the model’s generalization ability when applied to real-world data.

Some studies have also leveraged GAN for image defect detection. In [12], an unsupervised image anomaly detection method based on GAN was proposed, combining selective skip connections and hybrid attention mechanisms, with the Earth mover’s (EM) distance optimizing the loss function [13]. This approach effectively improved image reconstruction quality, training stability, and enhanced detection capabilities in high-speed railway contact networks.

Given the constant evolution of hacking techniques, current intrusion detection methods, which rely on known attack data, are insufficient for detecting unknown and novel cyberattacks. As a result, numerous studies have emerged employing unsupervised or semi-supervised approaches to tackle network security challenges [14,15,16]. For instance, ref. [17] introduced the GANomaly model, which is based on deep neural networks for semi-supervised learning. The model designs an adaptive algorithm to dynamically train subsets of datasets, and was validated on four adaptive datasets, achieving promising results on the NSL-KDD and UNSW-NB15 cybersecurity datasets. Similarly, ref. [18] proposed an unsupervised intrusion detection method for vehicular communication networks, combining the feature extraction capabilities of autoencoders with precise clustering via fuzzy C-means (FCM) [19,20]. This approach does not require labeled datasets for training and demonstrates broad generalization ability and robustness.

While these studies show promising results in detecting unknown and novel attacks, they still have some limitations. They struggle to capture long-range dependencies in network traffic, which makes it difficult to fully detect detailed characteristics in complex network attacks. Furthermore, when handling high-dimensional and complex IoT network traffic, these methods may fall short in achieving optimal feature representation. Although progress has been made in improving generalization and robustness, these methods continue to face challenges when confronted with highly complex and diverse attack scenarios.

3. Methodology

3.1. Dataset and Preprocessing

To validate the effectiveness of the proposed semi-supervised intrusion detection algorithm, two datasets were selected for experimental evaluation: CIC-IDS2017 and Kitsune. CIC-IDS2017 is a publicly available dataset widely used in network intrusion detection research, covering various types of normal traffic and malicious attack traffic, including DDoS, web attacks, brute force attacks, and more [21]. The dataset provides traffic records with over 70 features, offering a wide range of attack types and traffic behaviors. On the other hand, the Kitsune dataset offers 115 features and includes 9 distinct attack categories, such as OS scan, ARP MitM, and fuzzing. This provides a newer set of traffic patterns and behaviors for evaluation.

Figure 1 below illustrates the distribution of different traffic types and the data distribution after filtering and combining selected categories from the CIC-IDS2017 dataset.

During the data preprocessing stage, the data were first cleaned by removing redundant and missing values, followed by the elimination of infinite values. These anomalies are typically caused by errors during dataset collection and can hinder the model’s ability to converge effectively, weakening its generalization capacity and ultimately affecting its performance.

Since the features in the dataset have varying scales and ranges, features with larger numerical ranges tend to dominate the learning process, overshadowing those with smaller values. When using gradient descent optimization, the lack of feature normalization can result in an unusually steep cost function surface, particularly when there are significant differences in gradient magnitudes across dimensions. This can slow down the convergence process or make it difficult to find the global optimum, preventing the model from converging properly.

To address this, we applied min–max normalization to standardize the data features within the range of [0,1], improving the effectiveness of model training. The normalization formula is as follows:

x^{'} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}

(1)

Let x represent the original feature value of the traffic data in the dataset, and let

x^{'}

denote the normalized feature value.

In this study, to fully leverage the self-attention mechanism of the Transformer for capturing long-range dependencies [22], we applied a specific transformation process to the preprocessed one-dimensional (1D) network traffic features. Each traffic record in the original dataset consists of a 78-dimensional 1D feature vector. While these features can be directly fed into traditional classification algorithms, models like Transformers, which excel at processing sequential and matrix data, would be limited in their ability to capture long-range dependencies if the 1D features were used directly. Therefore, we designed a feature transformation scheme to convert these 1D features into two-dimensional (2D) feature matrices, better suited for Transformer input formats and capable of revealing the intrinsic relationships within the network traffic features.

Specifically, for each traffic record’s 78-dimensional 1D feature vector, we transformed it into a 20 × 20 2D feature matrix. Since the original feature length is less than 400, we replicated the 1D data five times to obtain a 390-length vector and then applied zero-padding to fill the remaining matrix positions. The transformation process is illustrated in Figure 2. Similarly, in the Kitsune dataset, a 1 × 115 one-dimensional feature vector is replicated three times to form a 1 × 345 one-dimensional feature vector, which is then converted into a 20 × 20 feature matrix with zero padding.

This feature transformation method, tailored to the characteristics of the Transformer, helps the algorithm model better capture features and also facilitates a more intuitive understanding of changes in data characteristics.

3.2. Algorithm Design

3.2.1. Network Model Structure

We designed a robust GAN model suitable for network traffic data detection scenarios. The model comprises a generator, a discriminator, and a noise encoder, all working together to perform anomaly detection on traffic data.

Generator: The structure of the generator is shown in Figure 3. It utilizes multiple layers of Transformer encoders, with each layer employing multi-head attention mechanisms to capture long-range dependencies within the network traffic data.

The original network traffic data are first mapped to a high-dimensional feature space through an input embedding layer. Position encoding is used to incorporate positional information, aiding the model in understanding the order and temporal dependencies of the data. The multi-head attention mechanism extracts important temporal information from the network traffic and handles various complex traffic patterns. A feed-forward network further performs linear transformations and nonlinear activations on the traffic features, enhancing the expressiveness of the generated synthetic traffic data. The output of the generator is synthetic data that closely resembles real network traffic. The training objective of the generator is to produce highly realistic network traffic to deceive the discriminator.

Discriminator: The structure of the discriminator is shown in Figure 4. The discriminator is responsible for distinguishing whether the input network traffic is real or generated by the generator. Its structure is similar to that of the generator and also utilizes multiple layers of Transformers.

Similarly, the input is processed through an embedding layer and positional encoding, maintaining consistency with the generator structure. The discriminator employs a six-layer Transformer encoder structure, using LayerNorm and a feed-forward network to extract features from the input traffic [23]. Finally, the discriminator’s output is passed through a Softmax activation function to generate classification probabilities for distinguishing between real and synthetic traffic.

Noise encoder: The structure of the noise encoder is shown in Figure 5. The noise encoder is responsible for generating the initial noise data input to the generator.

The structure of the noise encoder is a simple fully connected neural network that processes the input noise using a LeakyReLU activation function and maps it to a feature space suitable for the generator. This design enhances the diversity of the generator’s inputs, increasing the model’s generalization capability. The Tanh output layer is used to map the generated noise to a fixed range, providing input to the generator.

3.2.2. Overall Model Architecture

The overall system architecture is shown in Figure 6, where the generator (G), discriminator (D), and noise encoder (E) work together in a closed-loop GAN framework.

The original input data are first processed through an encoder for feature extraction, which transforms the feature matrix into a latent space vector, denoted as z. This latent vector is then fed into the generator, which uses it to produce a synthetic feature matrix. The goal of the generation process is to make the synthetic feature matrix as similar as possible to the real feature matrix.

The algorithm architecture includes two discriminators: one that assesses the authenticity of the generated synthetic feature matrix and another that evaluates the differences between the generated and the original feature matrices in the feature space. This dual discriminator mechanism enables the algorithm to learn not only the overall visual similarity of the synthetic feature matrix but also to more accurately align it at the feature level.

By jointly optimizing both pixel-level and feature-level losses and incorporating the dual discriminator mechanism, the system ensures that the generator produces high-quality features that are aligned with the real feature matrix, both visually and in terms of feature representation. The final evaluation of generation performance is carried out through the combined scores of

{Loss}_{Img}

and

{Loss}_{Feat}

, guiding the continuous optimization process. The following outlines the complete implementation logic for the testing phase of the TransGAN-IDS model (Algorithm 1):

Algorithm 1 Testing Process for TransGAN-IDS

1:: Input: X: Single data in the test set
2:: Require: Score: Final test score
3:: Params: BS: Batch size; $L_{m a t}$ : Loss between feature matrices; $L_{f e a t}$ : Loss between features; Test_Datasets: Dataset Collection; $G_{θ}$ : Generator; $E_{ϕ}$ : Encoder; $D_{w}$ : Discriminators; K: Kappa weight adjustment parameter; MSE: Mean Squared Error Loss Function
4:: for BS in Test_Datasets do
5:: Step 1: Sample real data X from distribution Test_Datasets
6:: Step 2: $Z \leftarrow E_{ϕ} (X)$
7:: Step 3: $X^{'} \leftarrow G_{θ} (Z)$
8:: Step 4: $Z^{'} \leftarrow E_{ϕ} (X^{'})$
9:: Step 5: $F \leftarrow D_{w} (X)$
10:: $F^{'} \leftarrow D_{w} (X^{'})$
11:: Step 6: $L_{m a t} \leftarrow M S E (X, X^{'})$
12:: $L_{f e a t} \leftarrow M S E (F, F^{'})$
13:: Step 7: Score $\leftarrow L_{m a t} + K \cdot L_{f e a t}$
14:: end for

3.2.3. Loss Function

We introduce an improved loss calculation method that enhances the training stability of the model through a gradient penalty mechanism. Traditional GANs may encounter issues such as vanishing or exploding gradients during training, which can hinder model convergence or result in suboptimal generation performance. To mitigate this issue, we generate new samples by interpolating between real and synthetic samples, and then calculate the gradient of the discriminator with respect to these interpolated samples, thereby introducing a gradient penalty term.

GAN Loss: Let

x_{real}

be the real samples and

x_{fake}

be the generated samples. First, generate a random weight

α

that matches the dimensions of the samples. This weight is used to perform linear interpolation between the real and generated samples, resulting in an interpolated sample

\hat{x}

:

\hat{x} = α \cdot x_{real} + (1 - α) \cdot x_{fake}

(2)

Next, the interpolated sample

\hat{x}

is input into the discriminator D, yielding the discriminator output

D (\hat{x})

. The gradient of the discriminator output with respect to the interpolated sample

\nabla_{\hat{x}} D (\hat{x})

is then computed, and the

L_{2}

norm of this gradient is calculated.

The design goal of the gradient penalty term (GP) is to ensure that the norm of the discriminator’s gradient is close to 1, thereby ensuring the smoothness of the gradients [24].

The final discriminator loss

L_{D}

is given by the following:

L_{D} = - \frac{1}{N} \sum_{i = 1}^{N} D (x_{real, i}) + \frac{1}{N} \sum_{i = 1}^{N} D (x_{fake, i}) + λ \cdot g p

(3)

where

λ

is a hyperparameter used to adjust the weight of the gradient penalty term; in this study,

λ

is set to 10.

The loss for the generator is computed as follows:

L_{D} = - \frac{1}{N} \sum_{i = 1}^{N} D (G (z_{i}))

(4)

where z represents the random noise input to the generator.

Encoder Loss: Using pre-trained generator and discriminator models, train the encoder model with preprocessed two-dimensional feature matrix data from normal traffic conditions.

Feed the real feature matrix into the encoder network to obtain the generated noise z. Then, input the noise z into the trained generator model to produce the generated “fake” feature matrix fake_matrix.

Finally, input both the generated feature matrix and the real data into the trained discriminator to obtain fake_feature and real_feature, respectively. The encoder loss is calculated as follows:

L_{mat} = MSE (G (E (x_{real})), x_{real})

(5)

L_{feat} = MSE (D (G (E (x_{real}))), D (x_{real}))

(6)

L_{E} = L_{mat} + k a p p a * L_{feat}

(7)

where

L_{mat}

is the loss between the real data and the generated data,

L_{feat}

represents the loss between the discriminator’s evaluation results of the real data and the generated data, and

L_{E}

is the total loss of the encoder, which also serves as the final detection score of the entire algorithm model.

k a p p a

is a hyperparameter controlling the importance of

L_{feat}

; in this study, it is set to 0.9.

4. Experiments

In this section, we test the proposed algorithm model and evaluate its performance and feasibility. First, we present the experimental environment, including the configuration parameters and dataset description. Next, we describe the data preprocessing process and evaluation metrics. Finally, we validate the feasibility of the proposed method through comparative analysis of different experimental results.

4.1. Experimental Environment

The algorithm model we designed was trained and tested under the following configuration (Table 1):

4.2. Dataset and Evaluation Metrics

Dataset: We used the CIC-IDS2017 and Kitsune datasets for model training and evaluation. The following table presents the network traffic types and data volume after preprocessing (Table 2):

Our model is trained using only positive (benign) samples, so there is no need to split the attack traffic data into training and validation sets. For the Benign class, we randomly select two million samples as the training set. During validation, we randomly select a number of Benign samples equal to the number of attack samples for each attack type from the remaining Benign data. For example, in the CIC-IDS2017 dataset, since the BruteForce attack has 13,832 samples, we randomly select 13,832 samples from the remaining 271,285 Benign samples to form the validation set along with the BruteForce data. This process is repeated for each attack type.

Evaluation Metrics: We used the following evaluation metrics to assess the performance of the designed model (Figure 7):

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(8)

Precision = \frac{T P}{T P + F P}

(9)

Recall = \frac{T P}{T P + F N}

(10)

F 1_{score} = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}

(11)

F O R = \frac{FN}{FN + TN}

(12)

TP (true positive) represents the number of attacks correctly detected by the model, while TN (true negative) refers to the number of normal traffic instances correctly identified by the model. FP (false positive) is the number of normal traffic instances incorrectly classified as attacks, and FN (false negative) is the number of attacks that the model failed to identify [25].

4.3. Model Evaluation

Our algorithm model is constructed based on normal traffic data and cannot specifically identify individual attack types during actual testing. Therefore, the experiment primarily uses a binary classification approach of “Normal” versus “Abnormal”. To further validate the model’s effectiveness, we conducted independent binary classification experiments for each of the five attack types shown in the Figure 8 (DoS, PortScan, DDoS, BruteForce, WebAttack) against normal traffic (Benign). Additionally, we combined these five attack types into a single “Anomalous” category and conducted an overall binary classification experiment against normal traffic. This experimental design allows for a comprehensive evaluation of the model’s performance in detecting normal traffic versus various types of anomalous traffic.

Our TransGAN-IDS model demonstrated strong detection capabilities across different attack types, particularly excelling in FOR. It effectively identifies significant anomalies in traffic, such as DoS and DDoS attacks, showcasing its advantage in handling large-scale traffic anomalies. However, for attacks with relatively subtle features, such as BruteForce and WebAttack, while recall rates remain high, improvements in precision are still needed (Figure 9).

Specifically, while the model achieved good accuracy and recall across most attack categories in the Kitsune dataset, attacks like ARP MitM exhibited relatively higher FOR values, indicating a more noticeable issue with false negatives for these types. This points to the need for further refinement, especially in reducing the FOR and improving the precision for these more challenging attack scenarios. Moving forward, enhancing detection capability in these areas will be a primary focus, as we aim to achieve a more balanced performance across all metrics.

In Figure 10, we validate the effectiveness of our proposed algorithm by comparing it with several other intrusion detection algorithms related to the CIC-IDS2017 dataset, including DI-NIDS [26], M-CNN [27], Method [28], SGAN-IDS [29], and Method [30].

In the experimental comparison, our proposed TransGAN-IDS algorithm demonstrated significant advantages across multiple metrics, particularly in recall rate. With a recall rate of up to 0.995, TransGAN-IDS clearly outperforms other methods such as DI-NIDS, M-CNN, and SGAN-IDS, highlighting its superior accuracy in detecting actual intrusion behaviors, especially under conditions of data imbalance. The high recall rate indicates that TransGAN-IDS effectively reduces false negatives, which is crucial for intrusion detection systems, ensuring that most attacks are promptly identified.

Although some methods, such as SGAN-IDS, show slightly better performance in precision and F1-score, TransGAN-IDS maintains an impressive precision of 0.912 and an F1-score of 0.952, while still demonstrating exceptional recall capabilities. This balance underscores the robustness of our semi-supervised approach, which leverages both the Transformer and GAN architectures to handle complex and dynamic intrusion patterns, making it particularly well-suited for the evolving threat landscape in IoT environments.

4.4. Intrusion Detection System Design

To simulate the effectiveness of the proposed algorithm, we utilized an in-vehicle infotainment system manufactured by an automaker to collect traffic data. A diagram of the car’s central control system is shown in Figure 11. A Linux-based server was used for data processing and was wirelessly connected to the infotainment system for receiving and analyzing the data. Additionally, a host running KaliOS was set up to simulate a hacker launching a network attack on the infotainment system.

The in-vehicle infotainment system deploys a packet capture service based on Libpcap, actively collecting data flow information from various components connected to the system. This includes data from the in-vehicle communication system, entertainment system, and in-car control units. The captured data are then processed to extract relevant features, generating input data for the algorithm, which are uploaded to the cloud for analysis and identification by the computing server. The overall simulation framework is shown in Figure 12.

The generator, discriminator, and encoder models trained on the CIC-IDS2017 normal traffic data are integrated into the semi-supervised detection module of the framework. After processing the traffic data extracted by Libpcap from the in-vehicle infotainment system using the preprocessing methods described earlier, the data are fed into the semi-supervised detection module. The overall packet capture analysis is shown in Figure 13.

In the initial test experiments, we observed anomalies in the relevant indicators of various types of attacks, which fell far short of the expected results. Further verification revealed differences between the real-world vehicular network environment and the open-source datasets. These differences include aspects such as communication protocols, hardware devices, and data distribution biases. This suggests that open-source datasets may not always be effective for practical applications.

To more accurately assess the effectiveness of our proposed algorithm in real-world scenarios, we replaced the CIC-IDS2017 dataset with traffic data collected from simulations based on typical user behaviors in an in-vehicle infotainment system. The training dataset was generated from traffic data corresponding to real-world activities such as map navigation, online videos, and music streaming. The distribution of the collected real traffic data for training is shown in Table 3.

Online detection may not quantitatively analyze traffic data, which could lead to experimental errors. For instance, when an attack is launched in real-time while the infotainment system is using navigation software, the data would contain both normal and abnormal traffic, making it difficult to perform accurate statistics. Therefore, we collected traffic data for testing in the same manner, excluding normal traffic data. We also used a KaliOS machine to launch common network attacks, such as DOS and DDoS, against the infotainment system, and collected these abnormal data into CSV files for later offline testing. The distribution of the collected real-world test data is shown in Table 4.

Based on the test results, we observe that on the CIC-IDS2017 dataset, the model achieved a precision of 0.678, recall of 0.744, F1-score of 0.704, and a FOR of 0.282. In contrast, when tested on real-world data, the model significantly improved with a precision of 0.808, recall of 0.884, F1-score of 0.838, and a reduced false positive rate of 0.130. These results demonstrate that the TransGAN-IDS model performs better on real-world data than on the CIC-IDS2017 dataset, highlighting its superior effectiveness and lower false positive rate in practical applications. This further validates the model’s potential value and effectiveness in real-world scenarios (Figure 14).

5. Conclusions

In this paper, we propose a semi-supervised intrusion detection algorithm, TransGAN-IDS, which combines Transformer and GAN technologies. This algorithm aims to enhance intrusion detection capabilities in IoT environments by leveraging these advanced techniques to improve model accuracy and robustness when handling complex and high-dimensional data. TransGAN-IDS integrates the powerful feature extraction capabilities of Transformers with the generative abilities of GANs, thereby establishing a robust detection system. We employ various data preprocessing techniques, such as data cleaning and normalization, and transform the data into matrix features to optimize the model’s training process and detection performance.

Experimental results show that TransGAN-IDS excels across various performance metrics, particularly demonstrating outstanding capability in detecting rare or novel attacks, thus highlighting its potential value in IoT environments. Although TransGAN-IDS has achieved significant results, there is still room for further improvement. Future research will focus on several key areas: first, exploring more advanced Transformer architectures and GAN variants to enhance model performance and adaptability; second, investigating methods to strengthen model robustness through adversarial training, improving its defensive capabilities; and third, examining ways to design a lightweight version of TransGAN-IDS for deployment on devices with lower computational power.

In addition to improving the detection capabilities of TransGAN-IDS, future work will also explore integrating complementary security measures, such as fuzz testing, to enhance overall IoT device security. Techniques like Snipuzz [31], which uses message snippet inference for fuzz testing IoT firmware, and mGPTFuzz [32], which leverages large language models for enhanced fuzzing of matter devices, could be combined with our intrusion detection system to identify potential vulnerabilities beyond network-based attacks. By incorporating such methods, we aim to build a more comprehensive defense strategy that addresses both known and unknown threats, thereby enhancing the robustness and adaptability of IoT security systems in the face of evolving cyber threats.

Author Contributions

Research on semi-supervised algorithm techniques, G.W.; research on Transformer algorithms, K.X. and M.L.; key algorithm analysis and design, M.L.; test verification, L.L. and K.X.; visualization, Y.W.; writing—review and editing, G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The CIC-IDS2017 and Kitsune datasets used in this study are publicly available. The CIC-IDS2017 dataset can be accessed at https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 24 June 2024), and the Kitsune dataset is available at https://archive.ics.uci.edu/dataset/516/kitsune+network+attack+dataset (accessed on 12 January 2025). The dataset used in collaboration with a car company is not publicly available due to commercial partnership.

Acknowledgments

We would like to express our heartfelt thanks to anonymous reviewers and editors for their constructive comments on the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bitdefender. The 2024 IoT Security Landscape Report. Available online: Https://www.bitdefender.com (accessed on 26 August 2024).
Díaz-Verdejo, J.; Muñoz-Calle, J.; Estepa Alonso, A.; Estepa Alonso, R.; Madinabeitia, G. On the detection capabilities of signature-based intrusion detection systems in the context of web attacks. Appl. Sci. 2022, 12, 852. [Google Scholar] [CrossRef]
Ahmad, R.; Alsmadi, I.; Alhamdani, W.; Tawalbeh, L.A. Zero-day attack detection: A systematic literature review. Artif. Intell. Rev. 2023, 56, 10733–10811. [Google Scholar] [CrossRef]
Maseer, Z.K.; Yusof, R.; Bahaman, N.; Mostafa, S.A.; Foozy, C.F.M. Benchmarking of machine learning for anomaly based intrusion detection systems in the CICIDS2017 dataset. IEEE Access 2021, 9, 22351–22370. [Google Scholar] [CrossRef]
Hao, Y.; Dong, L.; Wei, F.; Xu, K.; Zhang, S. Self-attention attribution: Interpreting information interactions inside transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 12963–12971. [Google Scholar]
Pingale, S.V.; Sutar, S.R. Remora based Deep Maxout Network model for network intrusion detection using Convolutional Neural Network features. Comput. Electr. Eng. 2023, 110, 108831. [Google Scholar] [CrossRef]
Kasongo, S.M. A deep learning technique for intrusion detection system using a Recurrent Neural Networks based framework. Comput. Commun. 2023, 199, 113–125. [Google Scholar] [CrossRef]
Eshak Magdy, M.; M MATTER, A.; Hussin, S.; Hassan, D.; Elsaid, S. A Comparative study of intrusion detection systems applied to NSL-KDD Dataset. Egypt. Int. J. Eng. Sci. Technol. 2023, 43, 88–98. [Google Scholar] [CrossRef]
Türk, F. Analysis of intrusion detection systems in UNSW-NB15 and NSL-KDD datasets with machine learning algorithms. Bitlis Eren Üniversitesi Fen Bilim. Derg. 2023, 12, 465–477. [Google Scholar] [CrossRef]
Lee, J.H.; Park, K.H. GAN-based imbalanced data intrusion detection system. Pers. Ubiquitous Comput. 2021, 25, 121–128. [Google Scholar] [CrossRef]
Thanh-Tung, H.; Tran, T. Catastrophic forgetting and mode collapse in GANs. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–10. [Google Scholar]
Wang, S.; Zou, Q.; Gao, B. SCA-GANomaly: An unsupervised anomaly detection model of high-speed railway catenary components. Multimed. Tools Appl. 2024, 83, 88919–88947. [Google Scholar] [CrossRef]
Zhang, C.; Cai, Y.; Lin, G.; Shen, C. Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12203–12213. [Google Scholar]
Singh, A.; Jang-Jaccard, J. Autoencoder-based unsupervised intrusion detection using multi-scale convolutional recurrent networks. arXiv 2022, arXiv:2204.03779. [Google Scholar]
Narasimhan, H.; Ravi, V.; Mohammad, N. Unsupervised deep learning approach for in-vehicle intrusion detection system. IEEE Consum. Electron. Mag. 2021, 12, 103–108. [Google Scholar] [CrossRef]
Mvula, P.K.; Branco, P.; Jourdan, G.V.; Viktor, H.L. A Survey on the Applications of Semi-supervised Learning to Cyber-security. ACM Comput. Surv. 2024, 56, 1–41. [Google Scholar] [CrossRef]
Han, Y.; Chang, H. XA-GANomaly: An explainable adaptive semi-supervised learning method for intrusion detection using GANomaly. Comput. Mater. Contin. 2023, 76, 221–237. [Google Scholar] [CrossRef]
Kabilan, N.; Ravi, V.; Sowmya, V. Unsupervised intrusion detection system for in-vehicle communication networks. J. Saf. Sci. Resil. 2024, 5, 119–129. [Google Scholar]
Krasnov, D.; Davis, D.; Malott, K.; Chen, Y. Fuzzy c-means clustering: A review of applications in breast cancer detection. Entropy 2023, 25, 1021. [Google Scholar] [CrossRef]
Hashemi, S.E.; Gholian-Jouybari, F.; Hajiaghaei-Keshteli, M. A fuzzy C-means algorithm for optimizing data clustering. Expert Syst. Appl. 2023, 227, 120377. [Google Scholar] [CrossRef]
Rosay, A.; Cheval, E.; Carlier, F.; Leroux, P. Network intrusion detection: A comprehensive analysis of CIC-IDS2017. In Proceedings of the 8th International Conference on Information Systems Security and Privacy, Online, 9–11 February 2022; SCITEPRESS-Science and Technology Publications: Setúbal, Portugal, 2022; pp. 25–36. [Google Scholar]
Jin, X.; Zhou, J.; Rao, Y.; Zhang, X.; Zhang, W.; Ba, W.; Zhou, X.; Zhang, T. An innovative approach for integrating two-dimensional conversion of Vis-NIR spectra with the Swin Transformer model to leverage deep learning for predicting soil properties. Geoderma 2023, 436, 116555. [Google Scholar] [CrossRef]
ValizadehAslani, T.; Liang, H. LayerNorm: A key component in parameter-efficient fine-tuning. arXiv 2024, arXiv:2403.20284. [Google Scholar]
Ray, D.; Murgoitio-Esandi, J.; Dasgupta, A.; Oberai, A.A. Solution of physics-based inverse problems using conditional generative adversarial networks with full gradient penalty. Comput. Methods Appl. Mech. Eng. 2023, 417, 116338. [Google Scholar] [CrossRef]
Valero-Carreras, D.; Alcaraz, J.; Landete, M. Comparing two SVM models through different metrics based on the confusion matrix. Comput. Oper. Res. 2023, 152, 106131. [Google Scholar] [CrossRef]
Layeghy, S.; Baktashmotlagh, M.; Portmann, M. DI-NIDS: Domain invariant network intrusion detection system. Knowl.-Based Syst. 2023, 273, 110626. [Google Scholar] [CrossRef]
Yin, X.; Chen, L. Network Intrusion Detection Method Based on Multi-Scale CNN in Internet of Things. Mob. Inf. Syst. 2022, 2022, 8124831. [Google Scholar] [CrossRef]
Aktar, S. Network Intrusion Detection Using a Deep Learning Approach. Master’s Thesis, The University of New Orleans, New Orleans, LA, USA, 2022. [Google Scholar]
Aldhaheri, S.; Alhuzali, A. SGAN-IDS: Self-attention-based generative adversarial network against intrusion detection systems. Sensors 2023, 23, 7796. [Google Scholar] [CrossRef] [PubMed]
Kaliyaperumal, P.; Periyasamy, S.; Thirumalaisamy, M.; Balamurugan, B. A Novel Hybrid Unsupervised Learning Approach for Enhanced Cybersecurity in the IoT. Future Internet 2024, 16, 253. [Google Scholar] [CrossRef]
Ma, X.; Luo, L.; Zeng, Q. From One Thousand Pages of Specification to Unveiling Hidden Bugs: Large Language Model Assisted Fuzzing of Matter IoT Devices. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024; pp. 4783–4800. [Google Scholar]
Feng, X.; Sun, R.; Zhu, X.; Xue, M.; Wen, S.; Liu, D.; Nepal, S.; Xiang, Y. Snipuzz: Black-box fuzzing of IoT firmware via message snippet inference. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, 15–19 December 2021; pp. 337–350. [Google Scholar]

Figure 1. DoS GoldenEye, DoS Hulk, DoS Slowhttptest, and DoS Slowloris in (left) were merged into a single category labeled “DoS”. Similarly, FTP-Patator and SSH-Patator were combined into the “Brute Force” category, and web attack—brute force, web attack—XSS, and web attack—SQL Injection were grouped under “Web Attack”. This resulted in six traffic types, as shown in (right).

Figure 2. The 1 × 78 one-dimensional feature vector is replicated five times and concatenated to form a 1 × 390 one-dimensional feature vector. This is then transformed into a 20 × 20 feature matrix, with any remaining positions filled with zeros.

Figure 3. Generator network structure.

Figure 4. Discriminator network structure.

Figure 5. Noise encoder network structure.

Figure 6. Overall architecture of TransGAN-IDS.

Figure 7. Confusion matrix.

Figure 8. Test results of the algorithm model for DoS, PortScan, DDoS, BruteForce, WebAttack, and overall testing results.

Figure 9. The test results of the algorithm model for 9 attack types, including Fuzzing, Mirai, OS Scan, SSDP Flood, and ARP MitM, as well as the overall test results.

Figure 10. Comparison of performance between our algorithm and other related algorithms on the CIC-IDS2017 dataset [26,27,28,29,30].

Figure 11. Car center console.

Figure 12. The edge-end intrusion detection component is deployed on the in-vehicle infotainment system, where it integrates a supervised learning-based intrusion detection algorithm for preliminary real-time detection at the inference layer. On the cloud side, unknown intrusion detection is deployed on the Linux server, using our TransGAN-IDS to perform intrusion detection and predict unknown network attacks based on the traffic data periodically uploaded from the edge-side infotainment system. The results are visualized and analyzed via a web interface.

Figure 13. The system first discovers and activates network devices, setting up filters to capture network packets and process them individually. During packet parsing, the system extracts key information such as protocol type, source/target ports, and TCP/UDP flags, and calculates payload size. The parsed data then undergo feature engineering before being uploaded to the database. Finally, the TranGAN-IDS model performs intrusion detection, allowing for individual and combined detection outputs.

Figure 14. Test results using different datasets in real-world in-vehicle scenarios.

Table 1. Hardware and software configuration for training and testing the algorithm model.

CPU	GPU	OS	CUDA	Pytorch
Intel i5-12400	NVIDA-A4500	Ubuntu-22.04	11.8	2.0.0

Table 2. Description and data volume of network traffic types after preprocessing for CIC-IDS2017 and Kitsune datasets.

Dataset	Attack Type	Data Volume
CIC-IDS2017	Benign	2,271,285
	DoS	251,701
	PortScan	158,804
	DDoS	128,019
	BruteForce	13,832
	WebAttack	2180
Kitsune	OS Scan	1,697,851
	Fuzzing	2,244,139
	Video Injection	2,472,401
	ARP MitM	2,504,267
	Active Wiretap	4,554,925
	SSDP Flood	4,077,266
	SYN DoS	2,771,276
	SSL Renegotiation	6,084,492
	Mirai	764,137

Table 3. Training Set: network traffic data captured using Libpcap from the normal use of in-vehicle infotainment applications, exported to a CSV dataset for subsequent training.

Scenario	Quantity	Proportion
Map Navigation	129,832	23%
Online Music	108,867	19.4%
Online Video	139,195	24.8%
Instant Messaging	87,892	15.6%
Web Browsing	96,561	17.2%

Table 4. Normal traffic data collected from the use of in-vehicle infotainment applications and abnormal traffic data collected by launching attacks using KaliOS. During the collection of abnormal data, the infotainment applications were stopped.

Scenario	Quantity	Proportion
Map Navigation	5047	10.8%
Online Music	6363	13.7%
Online Video	5489	11.8%
Instant Messaging	4927	10.6%
Web Browsing	3876	8.3%
DOS	7893	17%
DDOS	7037	15.1%
Web Attack	5889	12.7%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, M.; Luo, L.; Xiao, K.; Wang, G.; Wang, Y. Adaptive Semi-Supervised Algorithm for Intrusion Detection and Unknown Attack Identification. Appl. Sci. 2025, 15, 1709. https://doi.org/10.3390/app15041709

AMA Style

Li M, Luo L, Xiao K, Wang G, Wang Y. Adaptive Semi-Supervised Algorithm for Intrusion Detection and Unknown Attack Identification. Applied Sciences. 2025; 15(4):1709. https://doi.org/10.3390/app15041709

Chicago/Turabian Style

Li, Meng, Lei Luo, Kun Xiao, Geng Wang, and Yintao Wang. 2025. "Adaptive Semi-Supervised Algorithm for Intrusion Detection and Unknown Attack Identification" Applied Sciences 15, no. 4: 1709. https://doi.org/10.3390/app15041709

APA Style

Li, M., Luo, L., Xiao, K., Wang, G., & Wang, Y. (2025). Adaptive Semi-Supervised Algorithm for Intrusion Detection and Unknown Attack Identification. Applied Sciences, 15(4), 1709. https://doi.org/10.3390/app15041709

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Semi-Supervised Algorithm for Intrusion Detection and Unknown Attack Identification

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Dataset and Preprocessing

3.2. Algorithm Design

3.2.1. Network Model Structure

3.2.2. Overall Model Architecture

3.2.3. Loss Function

4. Experiments

4.1. Experimental Environment

4.2. Dataset and Evaluation Metrics

4.3. Model Evaluation

4.4. Intrusion Detection System Design

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI