Class-Specific GAN Augmentation for Imbalanced Intrusion Detection: A Comparative Study Using the UWF-ZeekData22 Dataset

Debelie, Asfaw; Bagui, Sikha S.; Mink, Dustin; Bagui, Subhash C.

doi:10.3390/fi18040200

Open AccessArticle

Class-Specific GAN Augmentation for Imbalanced Intrusion Detection: A Comparative Study Using the UWF-ZeekData22 Dataset

¹

Department of Computer Science, The University of West Florida, Pensacola, FL 32514, USA

²

Department of Cybersecurity, The University of West Florida, Pensacola, FL 32514, USA

³

Department of Mathematics and Statistics, The University of West Florida, Pensacola, FL 32514, USA

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(4), 200; https://doi.org/10.3390/fi18040200

Submission received: 6 March 2026 / Revised: 1 April 2026 / Accepted: 8 April 2026 / Published: 10 April 2026

(This article belongs to the Special Issue Intrusion Detection and Resiliency in Cyber-Physical Systems and Networks—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Extreme class imbalance is a persistent obstacle for machine learning-driven intrusion detection, as rare but high-impact cyberattacks occur far less frequently than benign traffic in training data. In many real-world cybersecurity datasets, this imbalance becomes extreme, with certain attack types containing a handful of samples, effectively placing the problem in a few-shot learning regime. This paper presents a controlled benchmarking study of Generative Adversarial Network (GAN) objectives for synthesizing minority-class cyberattack data. Using the UWF-ZeekData22 network traffic dataset, each MITRE ATT&CK tactic is framed as a separate binary detection task, and tactic-specific GANs are trained solely on minority samples to generate synthetic attack records. Four widely used GAN variants—Vanilla GAN, Conditional GAN (cGAN), Wasserstein GAN (WGAN), and Wasserstein GAN with Gradient Penalty (WGAN-GP)—are compared under unified training steps and fixed augmentation conditions. The utility of generated data is assessed by evaluating downstream detection performance using five traditional classifiers: Logistic Regression, Support Vector Machine, k-Nearest Neighbors, Decision Tree, and Random Forest. The results indicate that GAN augmentation generally strengthens minority-class detection across tactics and models, reducing false negatives and improving recall consistency, while not systematically harming majority-class performance. However, the effectiveness of each GAN objective varies significantly with data sparsity. Specifically, simpler adversarial objectives often outperform more complex architectures by preserving discriminative feature structure, while heavily regularized models may overly smooth minority-class distributions and reduce separability. Wasserstein-based objectives provide improved training stability, but additional regularization does not consistently translate to better detection performance. Overall, the results demonstrate that in extreme-imbalance settings, GAN effectiveness is governed more by data sparsity and structure preservation than by architectural complexity. These findings establish class-specific generative augmentation as a practical strategy for intrusion detection and provide empirical guidance for selecting appropriate GAN objectives for tabular cybersecurity data under highly imbalanced conditions.

Keywords:

cybersecurity; intrusion detection; class imbalance; generative adversarial networks (GAN); class-specific data augmentation; minority-class learning; Vanilla GAN; Conditional GAN (cGAN); Wasserstein GAN (WGAN); MITRE ATT&CK; network traffic analysis

1. Introduction

Generative Adversarial Networks (GANs) have emerged as a promising approach for synthesizing minority-class samples in highly imbalanced cybersecurity datasets [1,2,3,4,5]. By generating realistic representations of rare attack behaviors, GANs address one of the most persistent challenges in intrusion detection: the scarcity of labeled data for infrequent yet high-impact cyber threats [6,7,8]. In real-world network traffic, benign activities overwhelmingly dominate, while malicious behaviors occur more sparsely. This imbalance causes machine learning-based intrusion detection systems (IDSs) to favor majority-class predictions, often leading to elevated false-negative rates for the attacks that are most critical to detect.

Within the broader machine learning literature, numerous GAN architectures have been proposed to address limitations of the original formulation, including training instability and mode collapse [9,10,11]. Conditional GANs (cGANs) incorporate label information to guide the generation process toward class-specific distributions [12], while Wasserstein GANs (WGANs) replace the original divergence-based objective with a smoother distance metric to improve gradient stability and convergence behavior [13]. Wasserstein GANs with Gradient Penalty (WGAN-GP) further enforce Lipschitz continuity through regularization, resulting in more stable adversarial training across a wide range of applications [14].

Cybersecurity datasets introduce additional challenges that complicate the application of generative models. Datasets such as UWF-ZeekData22 [15,16] consist of high-dimensional tabular data with mixed categorical and continuous features, rather than the spatial structure exploited by convolutional GAN architectures [17]. Moreover, minority attack samples vary substantially across MITRE ATT&CK tactics and often exhibit subtle, tactic-specific feature patterns [18]. Under extreme data sparsity, these characteristics raise important questions about whether simple adversarial objectives or more heavily regularized GAN variants are better suited to preserving discriminative structure while maintaining training stability. Consequently, it remains unclear which GAN architecture is most appropriate when evaluated under comparable experimental conditions for generating class-specific minority data in highly imbalanced intrusion detection settings. Despite the growing body of work on GAN-based augmentation for intrusion detection, existing studies typically evaluate a single generative objective or a limited set of experimental configurations, making it difficult to directly compare architectural trade-offs under consistent preprocessing, training, and evaluation conditions.

Beyond dataset-specific challenges, real-world deployment of GAN-based intrusion detection systems introduces additional complexities related to cross-domain generalization and operational variability. In practice, network environments differ significantly in traffic patterns, feature distributions, and attack characteristics across organizations, time periods, and infrastructure types [3,4,6]. Models trained on a single dataset may therefore encounter distribution shift and domain mismatch when applied to unseen environments, potentially reducing the effectiveness of both generated and synthetic data, as well as downstream classifiers [6]. Furthermore, real-world systems must balance detection performance with computational constraints, scalability, and adaptability to evolving threat landscapes [3,11]. These challenges highlight the need for robust generative approaches that not only perform well under controlled experimental conditions but also generalize effectively across diverse and dynamic cybersecurity environments.

This paper presents a systematic comparison of four GAN architectures—Vanilla GAN, Conditional GAN (cGAN), Wasserstein GAN (WGAN), and Wasserstein GAN with Gradient Penalty (WGAN-GP)—applied to class-specific minority samples from UWF-ZeekData22 [16]. Each MITRE ATT&CK tactic is treated as an independent binary classification task, and separate GANs are trained exclusively on minority-class samples to generate synthetic cyberattack data. The evaluated architectures are compared across multiple dimensions, including training stability, convergence behavior, computational cost, diversity of generated samples, and the effect of synthetic augmentation on downstream classifier performance.

The contributions of this paper are summarized as follows:

A systematic evaluation of four GAN architectures for minority-class generation in tabular cybersecurity data.
A stability and convergence analysis examining how adversarial objectives and regularization strategies influence GAN training under extreme class imbalance.
An empirical assessment of synthetic data diversity and coverage, evaluating how well each architecture represents minority-class feature distributions.
A downstream performance comparison demonstrating how synthetic data from each GAN architecture affects classifier accuracy, recall, and F1-scores across multiple MITRE ATT&CK tactics.

The remainder of this paper is organized as follows: Section 2 reviews related work on GAN variants, tabular GANs, and GAN-based cybersecurity applications. Section 3 describes the dataset and preprocessing pipeline. Section 4 details the GAN architectures evaluated. Section 5 outlines the experimental design. Section 6 presents comparative results across GAN variants. Section 7 discusses the implications of the findings, Section 8 concludes, and Section 9 provides directions for future research.

2. Related Work

Generative Adversarial Networks (GANs) have been extensively studied in the machine learning community, with a large body of work focused on improving training stability, sample diversity, and convergence behavior [1,9,10]. While GAN variants have achieved remarkable success in image and signal synthesis, their application to tabular cybersecurity data remains comparatively unexplored [19,20]. This section reviews prior work related to this study, focusing on (1) architectural variants of GANs, (2) GANs for tabular data generation, and (3) GAN-based approaches in cybersecurity. This section concludes by situating the present work within the existing literature.

2.1. Architectural Variants of GANs

Since the original GAN framework introduced by Goodfellow et al. [1], numerous architectural and objective-function modifications have been proposed to address well-known challenges, such as training instability, vanishing gradients, and mode collapse [9,11]. Salimans et al. [9] introduced several training heuristics to improve convergence and sample quality, highlighting GANs’ sensitivity to optimization dynamics and hyperparameter selection.

Conditional Generative Adversarial Networks (cGANs), proposed by Miza and Osindero [12], extend the original GAN framework by incorporating auxiliary information, such as class labels, into both the generator and the discriminator. This conditioning mechanism enables more structured generation and has been shown to improve sample fidelity when reliable labels are available [12,19]. However, conditional models may perform poorly when labels are sparse, noisy, or highly imbalanced, a common characteristic of real-world cybersecurity datasets [17].

Wasserstein GANs (WGANs), introduced by Arjovsky et al. [13], replace the Jensen–Shannon divergence used in the original GAN formulation with the Wasserstein-1 distance, providing smoother gradients and improved training stability. While WGANs reduce mode collapse and improve convergence, they rely on enforcing a Lipschitz constraint via weight clipping, which can introduce optimization challenges. To address this limitation, Gulrajani et al. [14] proposed the Wasserstein GAN with Gradient Penalty (WGAN-GP), which replaces weight clipping with a gradient-based regularization term and has since become a widely adopted approach for stabilizing GAN training.

Recent work has further explored architectural enhancements and hybrid objectives to improve both stability and sample fidelity. For instance, Yang et al. [20] introduced a Conditional Aggregation Encoder–Decoder Structure (CE-GAN), which combines an encoder–decoder structure with a Conditional GAN framework to balance realism and diversity in generated samples. Their approach demonstrated improved performance on intrusion detection tasks across the NSL-KDD and UNSW-NB15 datasets. Similarly, Tian et al. [21] proposed a hybrid generative framework, VAE-WACGAN, which integrates Variational Autoencoders, Auxiliary Classifiers, and Wasserstein objectives with a gradient penalty to improve training stability and sample quality. These approaches reflect a broader trend toward combining multiple generative paradigms to address the inherent instability of adversarial training.

2.2. GANs for Tabular Data Generation

Unlike images and audio signals, tabular datasets often contain heterogeneous feature types, including continuous, categorical, and binary attributes [17,22]. This structural heterogeneity complicates the direct application of GANs, which were originally designed for continuous-valued data distributions [17,22]. As a result, several studies have proposed specialized GAN architectures and training strategies to better model tabular data.

Xu et al. [17] introduced Conditional Tabular GAN (CTGAN), which employs conditional sampling and tailored loss functions to better capture imbalanced categorical distributions in tabular data. Recent work has extended these ideas to cybersecurity and IoT environments. For example, CTGAN-based intrusion detection frameworks have been proposed by Menssouri et al. [23] to generate synthetic minority-class samples in highly imbalanced IoT network traffic, often combined with hybrid resampling strategies to further address class imbalance. Their approach involves generating synthetic rare attack samples using CTGAN, followed by applying SMOTEENN, a hybrid method that combines the Synthetic Minority Oversampling Technique (SMOTE) with Edited Nearest Neighbors (ENN) to remove noise from the data. These methods highlight the importance of domain-aware generative modeling for structured cybersecurity data.

However, many tabular GAN frameworks rely on architecture-specific adaptation, auxiliary objectives, or feature-type-dependent preprocessing pipelines, which complicate direct comparisons across different GAN objectives [22]. Moreover, most tabular GAN studies evaluate synthetic data quality using distributional similarity metrics or performance on generic classification tasks [24]. Only a limited number of works systematically benchmark multiple GAN architectures under consistent preprocessing and training conditions, particularly in highly imbalanced settings [25]. This gap motivates the need for controlled architecture-level comparisons using realistic, domain-specific datasets.

Furthermore, advanced tabular generative models such as CTGAN and the Tabular Variational Autoencoder (TVAE) [17] typically require sufficient data density to effectively learn conditional relationships among features. In highly imbalanced datasets such as UWF-ZeekData22, several minority classes contain extremely limited samples, in some cases fewer than ten instances. Under such conditions, learning stable conditional distributions becomes challenging, and model training may be unstable or prone to overfitting [1,13,25]. This motivates the use of simpler, class-specific generative models that can better adapt to extreme data sparsity.

2.3. GANs in Cybersecurity Data Modeling

GAN-based approaches have increasingly been explored in cybersecurity applications, particularly for addressing class imbalance in intrusion detection systems [5,26]. Several studies have demonstrated that GAN-generated synthetic traffic can improve recall for rare attack behaviors by augmenting minority-class samples. For example, IGAN-based frameworks, proposed by Rao et al. [27], have shown that synthetic augmentation can significantly improve minority-class detection while maintaining high overall classification accuracy. IGAN (Imbalanced Generative Adversarial Network) is designed to address class imbalance by generating additional minority-class samples to enrich sparse attack categories.

While IGAN primarily focuses on class imbalance, other approaches have explored the challenge of data scarcity and adversarial evasion in cybersecurity. Randhawa et al. [28] proposed the Evasion Generative Adversarial Network (EVAGAN), a GAN-based framework specifically designed for low-data regimes, where limited training samples hinder effective model learning. EVAGAN focuses on generating evasion samples to augment scarce anomaly classes and improve the detection performance of machine learning classifiers. In addition, its discriminator functions as an evasion-aware classifier, eliminating the need for separate adversarial training. Experimental results demonstrate improved detection performance, training stability, and efficiency compared to baseline models such as ACGAN. This approach is particularly relevant to intrusion detection, where attack types often have extremely limited labeled samples.

Cybersecurity datasets introduce additional challenges beyond those encountered in generic tabular modeling. Network traffic data are typically high-dimensional, noisy, and highly imbalanced, and attack behaviors vary significantly across MITRE ATT&CK tactics [19]. Furthermore, labeled attack samples are often extremely scarce, making it difficult to train conditional or supervised generative models reliably [29]. As a result, conclusions drawn from image-based GAN studies do not necessarily transfer to intrusion detection contexts.

Most existing cybersecurity GAN studies evaluate performance improvements using a limited set of classifiers, a single augmentation ratio, or a single generative objective, thereby complicating direct comparisons of architectural trade-offs and stability–utility relationships across methods [4,5]. While prior work has demonstrated the potential of GAN-based augmentation for intrusion detection, systematic benchmarking of multiple GAN objectives under consistent preprocessing, training, and evaluation conditions remains limited, particularly for highly imbalanced, tactic-specific attack data.

In addition to architectural comparisons, recent work has emphasized the importance of the augmentation ratio, the adversarial objective, and the training dynamics in GAN-based data augmentation for cybersecurity datasets [30]. A controlled ablation study on the UWF-ZeekData22 dataset examined multiple augmentation levels and training configurations across several GAN variants, including Vanilla GAN, cGAN, WGAN, and WGAN-GP. The results demonstrated consistent structural patterns: moderate augmentation levels yield the most stable improvements in minority-class recall, while excessive augmentation can introduce distributional overlap and degrade classifier performance, particularly for linear and margin-based models. Furthermore, Wasserstein-based objectives exhibited improved stability under aggressive augmentation, whereas Conditional GANs were less reliable in extremely sparse regimes. These findings highlight the importance of carefully balancing augmentation strategies and training objectives, motivating the systematic evaluation conducted in this study.

Table 1 summarizes existing GAN-based approaches and highlights the limitations addressed in this study.

2.4. Positioning of This Study

This study addresses the above limitations by providing a systematic benchmarking analysis of four widely used GAN architectures—Vanilla GAN, Conditional GAN (cGAN), Wasserstein GAN (WGAN), and Wasserstein GAN with Gradient Penalty (WGAN-GP)—applied to minority-class cyberattack data from the UWF-ZeekData22 dataset.

In contrast to much of the existing literature, which typically focuses on a single architecture, attack type, or evaluation configuration, this work evaluates each GAN variant under a unified experimental framework that includes:

A consistent preprocessing and training pipeline;
Class-specific GAN training exclusively on minority-only samples;
Controlled augmentation ratios and training durations;
A diverse set of downstream machine learning classifiers.

By isolating architectural effects and evaluating performance through downstream intrusion detection tasks, this study provides empirical insight into how different GAN objectives influence minority-class separability, training stability, and classifier behavior under extreme class imbalance.

3. Dataset and Preprocessing

3.1. UWF-ZeekData22 Dataset Overview

This study uses the UWF-ZeekData22 dataset [15,16], a large-scale collection of network telemetry captured using the Zeek network security monitor [31]. The dataset contains structured flow-level records describing connection metadata, including transport-layer protocols, connection states, byte and packet counts, flow durations, and additional traffic attributes. Malicious samples are annotated according to MITRE ATT&CK tactics [18], enabling class-specific analysis of attack behavior.

Malicious events are extremely rare relative to benign traffic, and each ATT&CK tactic appears as a highly imbalanced minority class. This characteristic makes UWF-ZeekData22 particularly well-suited to evaluating generative models that synthesize minority-class cyberattack behaviors under extreme class imbalance.

3.2. Distribution of MITRE ATT&CK Tactics

The distribution of malicious traffic in the UWF-ZeekData22 [15,16] dataset is highly uneven across MITRE ATT&CK tactics, as illustrated in Figure 1a. Reconnaissance activity [32] overwhelmingly dominates the dataset, accounting for approximately 9.2 million records and reflecting the prevalence of large-scale scanning and information gathering in operational networks. Other categories collectively represent less than 0.1%, highlighting the extreme class imbalance addressed in this study. Discovery activity [33], the second-highest activity, is substantially less frequent, with slightly more than 2000 instances, indicating limited representation of internal system and environment enumeration behaviors.

As shown in Figure 1b (which presents the malicious activity without the Reconnaissance and Discovery), more advanced attack behaviors are observed only sporadically. Credential Access [34] attacks targeting authentication materials such as passwords or tokens account for only 31 samples. Privilege Escalation [35] activities, enabling adversaries to obtain elevated permissions, occur only 13 times, while Exfiltration events [36], associated with the unauthorized transfer of data outside the victim environment, occur only 7 times.

Post-compromise behaviors exhibit the most severe scarcity. Lateral Movement [37], defined as attempts to expand control across multiple systems, appears in only four records. Resource Development [38], which covers preparatory actions such as infrastructure and account creation, is documented in three samples. The most extreme imbalance is observed for Initial Access [39], Persistence [40], and Defense Evasion [41], each of which appears only once in the entire dataset. Together, these distributions highlight the extreme imbalance present in the UWF-ZeekData22 [15,16] and motivate the use of generative augmentation to systematically assess the relative effectiveness of different GAN architectures.

3.3. Class-Specific Minority Datasets for GAN Training

To ensure a controlled and fair comparison across GAN architectures, this study adopts a class-specific training strategy for each ATT&CK tactic. For each experiment:

All available samples belonging to a given minority tactic are extracted.
Each GAN variant (Vanilla, cGAN, WGAN, and WGAN-GP) is trained exclusively on this minority subset.
No benign or majority-class samples are included during GAN training.

This design is intended to isolate each architecture’s ability to model the intrinsic structure of a specific attack behavior without interference from dominant majority-class patterns that could otherwise bias the generation process.

3.4. Data Selection and Subsampling

The full UWF-ZeekData22 [15,16] dataset contains more than 18 million network flow records and exhibits extreme class imbalance across MITRE ATT&CK tactics. The benign (“none”) and high-frequency Reconnaissance classes together account for the vast majority of samples, each exceeding nine million instances in the original dataset. In contrast, several advanced attack types contain only a handful of labeled samples.

To accommodate the computational demands of training multiple GAN variants and evaluating stratified classifiers, a reduced yet representative subset of the dataset is used. The final working dataset contains 402,147 samples. All available samples from rare and underrepresented ATT&CK tactics are retained in full. To control dataset scale while preserving realistic class imbalance, the benign (“none”) and Reconnaissance classes are downsampled to 200,000 instances each using reproducible group-aware random sampling across source files to preserve distributional diversity [42].

This strategy preserves the relative rarity of advanced attack behaviors while preserving high-frequency classes from dominating classifier learning or obscuring minority-class performance. Because the objective of GAN-based augmentation is to learn minority-class feature distributions rather than exploit the absolute abundance of benign traffic, the resulting dataset remains well-suited to generative modeling and downstream evaluation.

3.5. Preprocessing Pipeline

A structured preprocessing pipeline was applied to ensure compatibility with both GAN training and downstream classification, while avoiding noise, bias, and information leakage.

First, features that primarily serve as identifiers or contextual metadata—such as unique connection identifiers, timestamps, and source or destination IP addresses—were removed. These attributes do not encode intrinsic behavioral characteristics and may introduce spurious correlations or reduce model generalizability [43].

Boolean-valued attributes were converted to integer representations to ensure consistent numeric preprocessing. Missing values were handled using feature-specific strategies: numerical attributes related to traffic volume and duration were imputed with median values to reduce sensitivity to outliers [44], while categorical features with missing entries were assigned a dedicated placeholder category to preserve dataset completeness.

Categorical features such as protocol, service type, and connection state were converted to numeric form using label-based encoding. To ensure a fair and consistent comparison between real and synthetic data, the same fitted encoders were applied consistently across training, validation, and generated samples. Encoder objects were preserved to ensure reproducibility and to enable decoding during evaluation.

All preprocessing operations—including encoding, imputing, and normalization—were performed strictly within the training portion of each cross-validation fold. Parameters learned from the training data were then applied unchanged to the corresponding validation fold, ensuring that no information leakage occurred [45]. Class labels were left unchanged throughout preprocessing to preserve ground-truth annotations.

3.6. Feature Scaling

All features were scaled to [−1, 1] using min–max normalization. This scaling strategy is commonly used in GAN training to promote numerical stability; maintain consistent gradient behavior, particularly for Wasserstein-based objectives; and ensure fair architectural comparison [1]. Scaling parameters were computed exclusively from real minority-class samples for each ATT&CK tactic.

3.7. Training and Evaluation Protocol

For each ATT&CK tactic, GAN training and evaluation follow a consistent protocol:

Each GAN variant is trained once per ATT&CK tactic using the full set of available minority-class samples, and the trained generator is reused across all downstream evaluations.
Synthetic samples are generated after GAN training is complete and appended to the real dataset according to a fixed augmentation ratio.
Real-versus-synthetic analyses are performed directly on the generated samples to support qualitative assessment of synthetic data behavior.
A downstream classifier is evaluated using stratified 5-fold cross-validation on the augmented dataset (real + synthetic). GANs are not retrained within individual folds.

This design decouples GAN training from classifier optimization while ensuring consistent preprocessing, augmentation, and validation procedures across all evaluated architectures [45].

3.8. Summary

These preprocessing and data selection pipelines provide a consistent, architecture-agnostic representation of minority-class cyberattack data across all ATT&CK tactics. By minimizing variation in data preparation and enforcing strict separation between training and evaluation, this section ensures that performance differences observed in subsequent experiments are attributable to GAN architecture and training behavior rather than to preprocessing artifacts.

4. GAN Architectures

This section describes the four Generative Adversarial Network (GAN) architectures evaluated in this study: Vanilla GAN, Conditional GAN (cGAN), Wasserstein GAN (WGAN), and Wasserstein GAN with Gradient Penalty (WGAN-GP). Each architecture introduces modifications to the adversarial objective, training dynamics, or conditioning mechanisms that may affect its ability to synthesize realistic and diverse tabular cybersecurity data.

All architectures share a common base network structure of fully connected multilayer perceptrons. Differences among models arise exclusively from their loss functions, conditioning strategies, and training constraints, ensuring a fair and controlled architectural comparison.

4.1. Vanilla GAN

The Vanilla GAN, introduced by Goodfellow et al. [1], serves as the foundational adversarial generative framework. It consists of two competing networks:

Generator (G): Produces synthetic samples from a latent noise distribution.
Discriminator (D): Distinguishes real samples from generated samples.

The adversarial objective is defined as

{\min_{G} \max}_{D} V (D, G) {= E}_{x \sim p_{d a t a} (x)} [l o g D (x)] + E_{z \sim p_{z} (z)} [l o g (1 - D (G (z)))]

(1)

4.1.1. Characteristics

Simple and computationally efficient.
Highly sensitive to hyperparameter selection.
Prone to training instability and mode collapse.
Implicitly minimizes the Jensen–Shannon divergence between real and generated distributions [9].

4.1.2. Relevance to Cybersecurity Data

Vanilla GANs often struggle when applied to high-dimensional, mixed-type tabular datasets, particularly under extreme class imbalance [17]. Nevertheless, they provide a useful baseline for assessing how architectural enhancements improve generative performance in cybersecurity applications.

4.2. Conditional GAN (cGAN)

Conditional GANs extend the Vanilla GAN framework by incorporating auxiliary information, such as class labels, into both the generator and the discriminator [12]. Conditioning enables the generator to produce samples aligned with a specific class and exert greater control over the generation process.

The adversarial objective is defined as

{\min_{G} \max}_{D} V (D, G) = E_{x \sim p_{d a t a} (x)} [l o g D (x | y)] + E_{z \sim p_{z} (z)} [l o g (1 - D (G (z | y)))]

(2)

4.2.1. Characteristics

The generator produces class-specific outputs.
Discriminators learn conditional decision boundaries.
Conditioning improves structure and sample realism.
Typically, it has greater stability than Vanilla GANs when reliable labels are available.

Figure 2 illustrates the Conditional GAN architecture used in this study. The conditional input, c, represents the attack class label, while z denotes a latent noise vector. The conditional generator produces synthetic minority-class samples, G (z, c), which are evaluated by the conditional discriminator against real minority-class samples from the UWF-ZeekData22 dataset.

4.2.2. Relevance to Cybersecurity Data

Conditioning is particularly useful for attack types exhibiting internal structural variation, such as Credential Access. However, cGAN performance depends heavily on the quality and availability of labels. In cybersecurity datasets, labels are often sparse or noisy, which may limit the effectiveness of conditional generation [17].

4.3. Wasserstein GAN (WGAN)

Wasserstein GANs replace the original Jensen–Shannon divergence with the Wasserstein-1 (Earth Mover’s) distance to improve gradient stability [13]. Unlike Vanilla GANs, WGANs use a critic rather than a probabilistic discriminator and omit the sigmoid activation in the final layer. This formulation provides smoother gradients during training.

The adversarial objective is defined as

{\min_{G} \max}_{D} V (D, G) = E_{x \sim p_{d a t a} (z)} [D (x)] {- E}_{z \sim p_{z} (z)} [D (G (z))]

(3)

where the critic, D, is constrained to be 1-Lipschitz, which is enforced in practice through weight clipping.

4.3.1. Characteristics

More stable training than Vanilla GAN.
Reduced risk of mode collapse.
Sensitive to weight-clipping thresholds.
Uses an unbounded critic rather than probabilities.

4.3.2. Relevance to Cybersecurity Data

WGAN’s smoother gradient flow can be advantageous for modeling complex tabular distributions. However, improper weight clipping may distort feature representations or limit the critic’s expressive capacity in high-dimensional cybersecurity data [13].

4.4. Wasserstein GAN with Gradient Penalty (WGAN-GP)

WGAN-GP improves the WGAN by replacing weight clipping with a gradient penalty that enforces the Lipschitz constraint [14]. The gradient norm is penalized for deviations from unity, thereby yielding more stable and reliable adversarial training.

WGAN-GP also removes batch normalization from the critic to avoid sample dependencies that could violate the Lipschitz constraint [14].

4.4.1. Characteristics

Most stable among classical GAN variants.
Strong resistance to mode collapse.
More robust to hyperparameter selection than Vanilla GAN and WGAN.

4.4.2. Relevance to Cybersecurity Data

Cybersecurity datasets exhibit complex joint feature distributions and extreme class imbalance. WGAN-GAP’s improved stability enables more reliable learning of nuanced minority-class structures, making it well-suited to synthesizing cyberattack behaviors in tabular data [14].

5. Experimental Design

This section describes the experimental setup used to benchmark the four GAN architectures—Vanilla GAN, cGAN, WGAN, and WGAN-GP—on minority-class cyberattack data from the UWF-ZeekData22 [15,16] dataset. The design emphasizes fairness, reproducibility, and controlled comparison, enabling the observed performance differences to be attributed to architectural characteristics rather than confounding factors.

5.1. ATT&CK Tactics Selected for Evaluation

Experiments were conducted on a representative subset of MITRE ATT&CK tactics that exhibit distinct behavioral patterns and an extreme class imbalance. These include Credential Access, Privilege Escalation, Exfiltration, Lateral Movement, Resource Development, Defense Evasion, Initial Access, and Persistence. These tactics were selected because they are operationally high-impact in intrusion detection contexts and span a wide range of distributional properties [3]. Each GAN architecture was trained independently on the minority-class samples corresponding to each tactic.

5.2. Consistent Preprocessing Pipeline

To isolate architectural effects, all GANs and classifiers used an identical preprocessing pipeline:

Label encoding of categorical features.
Min–max scaling to the range [−1, 1].
No dimensionality reduction.
No feature engineering.

Real and synthetic samples were evaluated in the same vector space, ensuring an unbiased comparison of architectural features.

5.3. Standardized Hyperparameter Settings

To maintain fairness across architectures, hyperparameters were standardized whenever possible. All GAN variants used identical network depths, widths, batch sizes, latent noise dimensionalities, learning rates, optimizers, and numbers of training epochs. Optimizers were selected in accordance with standard practice for each adversarial objective to ensure stable training while enabling controlled architectural comparisons.

Although the original WGAN formulation discourages the use of adaptive optimizers, RMSProp was used for WGAN (with weight clipping), whereas Adam was used for the remaining GAN variants (Vanilla GAN, cGAN, and WGAN-GP), following standard implementations for each objective while maintaining a controlled architectural comparison.

5.3.1. Common Settings

The following hyperparameters were shared across all GAN architectures unless otherwise specified:

Batch size: 64.
Learning rate: 0.0002.
Noise vector dimension: 32.
Fully connected multilayer perceptron (MLP) architectures.
LeakyReLU activation in hidden layers.
Tanh activation at the generator output.

5.3.2. Architecture-Specific Settings

WGAN: RMSProp optimizer with weight clipping threshold = 0.01.
WGAN-GP: Adam Optimizer with gradient penalty coefficient, λ = 10.
cGAN: Class labels concatenated to generator and discriminator inputs and intermediate layers.

All architectures used identical network capacities to ensure that observed performance differences reflect architectural design rather than model capacity.

5.3.3. Detailed GAN Training Configuration

To ensure reproducibility and clarity of implementation, the detailed training configuration for all GAN variants is summarized in Table 2. The table provides a complete specification of the architectural, optimization, and training settings used for each model.

This unified configuration ensures that observed performance differences are primarily attributable to the adversarial objective and stabilization strategy, rather than differences in architectural capacity and training conditions.

5.4. Augmentation Ratio

To assess the effect of synthetic data augmentation on minority-class learning, a single augmentation ratio was used in this study:

15% synthetic samples (light augmentation).

Synthetic samples were generated in proportion to the size of the post-downsampling majority-class subset used in the final training dataset.

Synthetic Sample Generation

Synthetic minority samples generated for each ATT&CK tactic are defined as follows:

N_{s y n} = r \cdot |D_{m a j}|

(4)

where

r = 0.15

and

|D_{m a j}|

denotes the size of the post-downsampling majority-class subset used in the final dataset. This augmentation ratio was selected to increase minority-class representation to a level sufficient for stable adversarial training and effective downstream classifier learning, while avoiding excessive oversampling that could distort data distribution under extreme class imbalance [46]. Preliminary experiments with higher augmentation ratios were excluded from this study to maintain a controlled comparison and prevent excessive synthetic dominance.

5.5. Computing Environment

All experimental evaluations were conducted in a cloud-based computing environment configured to support GPU-accelerated training and extended execution.

5.5.1. System Configuration

Experiments were conducted on Google Colab Pro, which provides access to GPU-accelerated cloud resources suitable for training deep generative models. The execution environment was configured with the following specifications:

Runtime environment: Python 3.x with GPU acceleration enabled.
GPU accelerator: NVIDIA Tesla T4 with approximately 15 GB of video memory.
System memory: 12.7 GB RAM.
Disk Storage: Approximately 235 GB total capacity, with peak usage of about 39 GB during experimentation.

This configuration enabled stable and efficient training of multiple GAN architectures under GPU-accelerated settings. All GAN models were explicitly trained using the available GPU, while CPU resources (typically Intel Xeon-class processors provided by the Colab environment) were used for data preprocessing and auxiliary tasks.

While Google Colab provides a flexible and accessible platform, it may introduce minor variability in hardware allocation across sessions due to shared cloud infrastructure. To mitigate this, all experiments were conducted under comparable resource conditions using the same GPU class (Tesla T4). The reported training time reflects consistent GPU-accelerated execution, although slight variations in runtime may occur due to system-level resource scheduling. Random seeds were fixed across all experiments to ensure reproducibility.

5.5.2. Software Environment

The experimental pipeline was implemented using a standardized software stack commonly used in machine learning research:

Python: Version 3.12.
PyTorch: Version 2.9.0, used for implementing and training GAN.
Scikit-learn: Version 1.6.1, employed for classifier training and performance evaluation.
Pandas: Version 2.2.2, used for data loading and preprocessing.
Numpy: Version 2.0.2, supporting numerical computations.
Matplotlib (3.10.0) and Seaborn (0.13.2): Utilized for visualization of experimental results.
t-SNE (1.6.1): Applied for low-dimensional visualization of real and synthetic samples.

5.5.3. Implementation Details

To reduce computational overhead, GPU acceleration was used whenever a CUDA-capable device was available. In these cases, supported classifiers, including linear models, Support Vector Machines, k-Nearest Neighbors, and Random Forest classifiers, were executed using the GPU-accelerated implementation provided by the RAPIDS cuML framework. When GPU resources were not available, equivalent CPU-based implementations from scikit-learn were used instead. Decision Tree classifiers were consistently implemented on a CPU because no fully compatible GPU implementation was available.

Importantly, GPU acceleration was used solely to improve computational efficiency. Model architecture, optimization objectives, hyperparameter configurations, and evaluation protocols were kept identical across GPU and CPU executions. As a result, experimental outcomes remained directly comparable across hardware platforms.

5.6. Classifier Training Setup

For each minority attack class and each GAN variant, five widely used classical machine learning classifiers were trained and evaluated. These models were selected to represent diverse learning paradigms, including linear, distance-based, and tree-based approaches:

Logistic Regression;
Support Vector Machine (SVM);
K-Nearest Neighbor (KNN);
Decision Tree;
Random Forest.

5.6.1. Evaluation Metrics

To evaluate classifier behavior under severe class imbalance, multiple complementary performance metrics were used. All metrics are derived from four fundamental classification outcomes [47]:

True positives (TPs)—minority-class instances correctly identified as positive.
True negatives (TNs)—majority-class instances correctly classified as negative.
False positives (FPs)—majority-class instances incorrectly labeled as positive.
False negative (FNs)—minority-class instances incorrectly classified as negative.

These quantities underline all reported evaluation metrics.

Confusion Matrix Analysis

The confusion matrix provides a detailed breakdown of classification outcomes by summarizing the counts of TPs, TNs, FPs, and FNs [47]. It enables direct inspection of error patterns and class-specific behavior. In intrusion detection, minimizing false negatives is especially important, as such errors correspond to undetected attacks. Accordingly, the objective of GAN-based augmentation in this study is to increase true-positive detections while reducing false negatives without introducing an excessive number of false positives.

Classification outcomes are summarized using the standard confusion matrix representation:

[\begin{matrix} T P & F N \\ F P & T N \end{matrix}]

(5)

Derived Performance Metrics

Precision, recall, and F1-scores are computed from the confusion matrix [45]. Precision is the proportion of predicted positive instances that are truly positive, while recall is the proportion of actual positive instances correctly identified. The F1-score is the harmonic mean of precision and recall. In multi-class or highly imbalanced settings, the macro-averaged F1-score is used to give each class equal weight [45,48].

5.7. Reproducibility and Consistency

To ensure reproducibility, random seeds were fixed across PyTorch, NumPy, and scikit-learn. Identical cross-validation folds were reused across all experiments. GANs were trained once per tactic using the full set of minority-class samples, and the trained generators were reused across all downstream evaluations. Synthetic samples were generated prior to cross-validation and included in both training and validation folds. This design avoids retraining generative models within each fold while ensuring consistent preprocessing, augmentation, and evaluation across all experiments.

6. Results

This section presents a systematic comparison of the four GAN architectures—Vanilla GAN, cGAN, WGAN, and WGAN-GP—across nine ATT&CK tactics, using the experimental framework described in Section 5. The results are evaluated using downstream classifier performance [4,24,45] and confusion-matrix-based metrics [8,42,48] to assess how different adversarial objectives affect minority-class detection under extreme class imbalance [9,11,13,14].

6.1. Downstream Classifier Performance

To evaluate how different adversarial objectives affect classifier performance, each GAN variant (Vanilla GAN, cGAN, WGAN, and WGAN-GP) was trained separately for each minority ATT&CK tactic and used to augment the training set. Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9 and Table 10 summarize the resulting confusion matrices for all five classifiers. Each entry presents the aggregated confusion matrix values (TPs, FNs, FPs, and TNs) across all cross-validation folds, using the same preprocessing, data splits, and classifier hyperparameters as described in Section 5.

6.1.1. Discovery

The results for the Discovery tactic (Table 3) reveal clear differences in classifier behavior across adversarial training objectives. Because each GAN variant is trained independently on Discovery samples prior to classifier evaluation and all downstream models are evaluated using identical preprocessing and stratified cross-validation splits, observed performance differences can be attributed to how each adversarial objective shapes the synthetic minority-class distribution.

For logistic regression, all GAN variants substantially improve minority-class recognition; however, notable differences emerge across objectives. The Vanilla GAN and cGAN achieve the most balanced performance, exhibiting low counts of false negatives and false positives. Specifically, the Vanilla GAN produces 79 false negatives and 90 false positives, whereas the cGAN achieves the lowest false-negative count (48) and maintains a low false-positive rate (78). This suggests that label conditioning enables more accurate reconstruction of minority-class decision boundaries, improving recall without sacrificing precision. In contrast, WGAN introduces a higher number of false positives (138) despite comparable recall, suggesting that Wasserstein-based training may produce broader or more diffuse feature distributions that reduce linear separability. WGAN-GP further degrades performance by increasing both error types (115 false negatives and 100 false positives), suggesting that gradient-penalized regularization may oversmooth the minority distribution under extreme sparsity, leading to reduced boundary sharpness.

For SVM, the Vanilla GAN, cGAN, and WGAN yield near-identical performances, with minimal false positives and comparable false-negative counts. Once the minority class is sufficiently densified through augmentation, the SVM’s margin-based decision function becomes largely insensitive to the specific adversarial objective. This indicates that, beyond a certain density threshold, improvements in synthetic sample quality have a diminishing impact on margin-based classifiers. In contrast, WGAN-GP exhibits substantially poorer performance, producing 1026 false positives and 6864 false negatives, indicating severe degradation in margin separability. This behavior suggests that gradient-penalized training distorts class boundaries when the underlaying minority distribution is highly sparse.

A similar pattern was observed for KNN. The Vanilla GAN, cGAN, and WGAN produce nearly identical results, with false-negative counts between 640 and 647 and 197 false positives in each case. This reflects KNN’s reliance on local neighborhood structure, where sufficiently clustered synthetic samples produce stable classification behavior regardless of the generative objective. However, WGAN-GP introduces additional errors, yielding 279 false positives and 1749 false negatives, suggesting reduced neighborhood purity. This suggests that overly smooth synthetic distributions can blur local structure, negatively impacting distance-based classifiers.

For the Decision Tree and Random Forest classifiers, no misclassifications were observed across all GAN variants. This indicates that once the Discovery class is sufficiently represented, tree-based models effectively isolate the minority class through hierarchical partitioning. Their invariance across GAN architectures suggests that these models rely primarily on feature-level separability rather than on fine-grained distributional differences in the synthetic data.

Overall, the Discovery results reveal a consistent pattern: the choice of adversarial objective primarily affects linear and distance-based classifiers, while tree-based models remain robust once sufficient minority representation is achieved. Among the evaluated approaches, cGAN provides the most balanced improvement in recall and precision, suggesting that conditional generation is particularly effective in preserving class-specific structure under moderate sparsity. In contrast, WGAN-GP introduces significant degradation for non-tree classifiers, suggesting that strong regularization can be detrimental when training data is extremely limited. These findings highlight a key trade-off between distribution smoothness and the preservation of discriminative structure, which is critical for effective minority-class augmentation in intrusion detection systems.

6.1.2. Credential Access

The results for Credential Access (Table 4) demonstrate pronounced performance differences across GAN variants, particularly for classifiers sensitive to severe class imbalance. Because each GAN architecture is trained independently on Credential Access samples prior to classifier evaluation and all downstream models are evaluated under identical preprocessing and stratified cross-validation conditions, observed differences reflect how each adversarial objective shapes the synthetic minority-class distribution.

In logistic regression, conditioning provides a substantial advantage. The Vanilla GAN produces numerous errors, with 3374 false negatives and 2545 false positives, indicating poor alignment between the synthetic samples and a linear decision boundary. In contrast, the cGAN sharply reduces both error types, yielding only 90 false negatives and 158 false positives. This indicates that label conditioning enables the generator to better capture class-specific feature distributions, thereby significantly improving linear separability under extreme imbalance. WGAN exhibits a performance nearly identical to that of cGAN, suggesting that once minority structure is sufficiently captured, the benefit of the Wasserstein objective is marginal for linear classifiers. By comparison, WGAN-GP performs substantially worse, producing the highest false-positive count (3155) and an elevated number of false negatives (4376), indicating that gradient-penalized regularization may overly smooth the synthetic distribution, increasing overlap between minority samples and benign traffic and degrading the clarity of the decision boundary.

For SVM and KNN, both cGAN and WGAN achieve almost error-free classification, with only a small number of misclassifications (≤5 false positives and ≤42 false negatives). This suggests that once the minority class is sufficiently densified, both margin-based (SVM) and distance-based (KNN) classifiers reach a performance plateau, beyond which further improvements in generative fidelity have limited impact. In contrast, WGAN-GP again degrades performance, producing 609 FPs for SVM and 152 FPs for KNN, along with increased false negatives, indicating reduced class separability and distortion of local neighborhood structure under gradient-penalized training.

For Decision Tree and Random Forest classifiers, no misclassifications were observed across all GAN variants. Once Credential Access samples are augmented to sufficient density, tree-based models robustly isolate the minority class through hierarchical splitting and remain largely invariant to differences in adversarial objectives. This robustness suggests that tree-based models rely primarily on feature-level partitioning rather than precise distributional fidelity, making them less sensitive to variations in synthetic data generation. As in the Discovery case, this behavior reflects effective class separability rather than data leakage, because synthetic samples are generated exclusively from the training data and the trained generators are reused across fixed cross-validation splits.

Overall, the Credential Access results indicate that conditional generation yields the most consistent improvements for linear classifiers, highlighting the importance of incorporating class-specific information when modeling highly sparse attack behaviors. While WGAN performs comparably under this setting, WGAN-GP introduces systematic performance degradation across multiple models, reinforcing the observation that strong regularization can be detrimental when training data is extremely limited. Tree-based ensemble methods remain largely unaffected by the choice of GAN once class imbalance is mitigated, demonstrating that sufficient minority representation is often more critical than the specific generative objective for these models.

6.1.3. Privilege Escalation

The results for Privilege Escalation (Table 5) indicate that the choice of adversarial objective has a measurable impact on linear and distance-based classifiers, while tree-based models remain largely unaffected once sufficient minority-class augmentation is achieved. As in previous experiments, all GAN variants were trained independently on Privilege Escalation samples prior to classifier evaluation, and all downstream models were evaluated under identical preprocessing and cross-validation conditions.

For Logistic Regression, the Vanilla GAN yields the most favorable balance between recall and precision, with only 27 false negatives and 15 false positives. In comparison, both cGAN and WGAN maintain a higher false-negative count (130) and introduce additional false positives (71), indicating a modest reduction in precision relative to the Vanilla GAN. This suggests that, for this tactic, simpler adversarial objectives may better preserve the original minority-class structure, whereas conditioning or Wasserstein-based training may introduce slight distributional shifts that reduce linear separability. WGAN-GP performs substantially worse, producing 5762 false negatives and 3533 false positives, indicating that gradient-penalized regularization significantly increases class overlap and degrades boundary definition under sparse conditions.

A similar pattern was observed for SVM. The Vanilla GAN achieves lower error rates, with 136 false negatives and 15 false positives. In contrast, both cGAN and WGAN exhibit increased error counts, producing 1561 false negatives and 566 false positives. This behavior suggests that margin-based classifiers are sensitive to subtle distortions in the synthetic feature distribution, where even small shifts introduced by conditioning or Wasserstein objectives can adversely affect margin placement. WGAN-GP again underperforms, generating a sharp rise in misclassifications (8666 false negatives and 927 false positives), indicating severe degradation in margin separability due to overly smoothed or distorted synthetic samples.

For KNN, the Vanilla GAN yields the lowest error rates, with only 5 false negatives and 13 false positives, suggesting that it preserves local neighborhood structure well. Both cGAN and WGAN introduce a modest increase in error, while WGAN-GP substantially degrades performance, yielding 1528 false negatives and 467 false positives. This pattern indicates that local neighborhood integrity is best preserved under simpler generative objectives, while more complex regularization can blur local structure and reduce neighborhood purity.

For Decision Tree and Random Forest, classification performance is effectively perfect across all GAN variants. Decision Tree has at most a single false positive for WGAN and none for the other variants, while Random Forest achieves zero false positives and zero false negatives across all cases. This confirms that tree-based ensemble models are robust to variations in synthetic data distribution, as long as key discriminative feature thresholds are preserved. As in prior tactics, this behavior reflects effective class separability rather than data leakage, as synthetic samples are generated exclusively from training data and evaluated using fixed cross-validation splits.

Overall, for Privilege Escalation, the Vanilla GAN provides the most consistent augmentation for linear and distance-based classifiers under the evaluated configuration. While cGAN and WGAN preserve reasonable recall, they introduce additional classification errors relative to the Vanilla GAN, and WGAN-GP consistently underperforms. These results highlight that increased model complexity does not necessarily translate to improved performance under extreme data sparsity and may instead introduce instability or unnecessary smoothing. Tree-based classifiers remain robust across all GANs once class imbalance is mitigated, reinforcing the observation that sufficient minority representation is more critical than the specific adversarial objective for these models.

6.1.4. Exfiltration

The results for Exfiltration (Table 6) demonstrate that classifier performance is strongly influenced by the adversarial objective, particularly for linear and distance-based models under extreme class imbalance. As in previous experiments, all GAN variants were trained independently on Exfiltration samples prior to classifier evaluation and evaluated under identical preprocessing and stratified cross-validation conditions.

For logistic regression, the vanilla GAN yields a reasonable balance between recall and precision, producing 115 false negatives and 36 false positives. Introducing conditional generation via cGAN reduces false positives to 16 but increases false negatives to 215, indicating a trade-off in which precision improves at the expense of recall. This suggests that conditioning introduces a more conservative decision boundary that reduces false alarms but may fail to capture the full diversity of minority-class patterns. The WGAN further reduces false positives to 6, but at a substantial cost to recall, yielding 942 false negatives. This behavior suggests that Wasserstein-based training may prioritize boundary sharpness over coverage of the minority class, leading to the underrepresentation of rare patterns. In contrast, WGAN-GP achieves the lowest false-negative count (7) but introduces 126 false positives, indicating a strong recall bias accompanied by increased overlap with benign traffic due to oversmoothing of the synthetic distribution.

For SVM, the Vanilla GAN exhibits limited recall, producing 1274 false negatives and 520 false positives. Both cGAN and WGAN substantially improve recall relative to Vanilla GAN; cGAN reduces false negatives to 604, while WGAN results in 6101 false negatives, indicating weaker recall compared to cGAN. However, cGAN retains a higher false-positive count (126), whereas WGAN yields a much cleaner decision boundary with only 15 false positives. This highlights a clear trade-off between margin sensitivity and boundary precision, with cGAN improving minority-class detection while WGAN favors stricter separation at the cost of missed detection. In contrast, WGAN-GP performs poorly, producing 18,511 false negatives and 456 false positives, indicating severe distortion of margin structure and instability under gradient-penalized training.

A similar pattern was observed for KNN, which is sensitive to local neighborhood density. The Vanilla GAN performs well, with 169 false negatives and 17 false positives, while cGAN improves both recall and precision, reducing false negatives to 70 and false positives to 11. This suggests that conditional generation effectively enhances local cluster coherence for this tactic. In contrast, WGAN increases false negatives to 550 and false positives to 186, and WGAN-GP further degrades performance, yielding 710 false negatives and 489 false positives. These results indicate that overly smooth or diffuse synthetic distributions disrupt local neighborhood structure, reducing classification reliability for distance-based methods.

For Decision Trees and Random Forests classifiers, no misclassifications were observed for Vanilla GAN, cGAN, and WGAN, and only a single false positive was observed for WGAN-GP in the Decision Tree model. Random Forests achieved zero false positives and zero false negatives across all GAN variants. This confirms that tree-based models are robust to variations in synthetic data distribution once sufficient minority representation is achieved, relying primarily on feature thresholding rather than distributional precision. As in prior tactics, tree-based classifiers remain largely invariant in the choice of adversarial objective once sufficient minority-class augmentation is achieved.

Overall, the Exfiltration results indicate that no single GAN objective dominates across all classifiers. cGAN and WGAN improve recall for margin- and distance-based models, with WGAN producing the cleanest decision boundaries for SVM. WGAN-GP favors recall at the expense of precision and consistently degrades performance under the evaluated configuration. These findings highlight a fundamental trade-off between minority-class coverage and boundary sharpness, where different GAN objectives emphasize different aspects of the data distribution. Tree-based classifiers remain robust across GAN choices, underscoring that adversarial objective selection is most critical for linear and neighborhood-based models in highly imbalanced cybersecurity detection tasks.

6.1.5. Lateral Movement

The results for Lateral Movement (Table 7) show pronounced differences across adversarial objectives, particularly for linear and distance-based classifiers, reflecting the attack’s extreme sparsity in the original dataset. As in prior experiments, all GAN variants were trained independently on Lateral Movement samples prior to classifier evaluation and evaluated under identical preprocessing and stratified cross-validation conditions.

For Logistic Regression, the Vanilla GAN yields relatively weak performance, producing 353 false negatives and 649 false positives, indicating poor linear separability despite augmentation. This suggests that the Vanilla GAN is unable to adequately capture the underlaying structure of extremely sparse minority samples. In contrast, both cGAN and WGAN substantially improve performance, reducing false negatives to six and false positives to two. These results indicate that incorporating either conditioning or the Wasserstein objective significantly enhances the alignment of synthetic samples with the true minority-class distribution under extreme sparsity. In comparison, WGAN-GP degrades performance, yielding 303 false negatives and 1059 false positives, indicating that gradient-penalized regularization may introduce excessive smoothing, leading to increased class overlap and reduced separability.

A similar pattern was observed for SVM. The Vanilla GAN exhibits limited recall, producing 1600 false negatives and 121 false positives. Both cGAN and WGAN achieve very low error rates, reducing false negatives to six and false positives to two, indicating strong margin separability once the minority class is sufficiently densified. This demonstrates that for extremely rare attack types, improving minority density substantially impacts margin-based classifiers, provided that synthetic samples preserve class structure. In contrast, WGAN-GP performs substantially worse, yielding 3679 false negatives and 1755 false positives, indicating severe distortion of the margin due to overly smoothed or poorly structured synthetic data.

For KNN, the Vanilla GAN produces moderate errors, with 46 false negatives and 47 false positives. Both cGAN and WGAN achieve almost error-free classification, with only two false negatives and four false positives, indicating well-preserved local neighborhood structure. This suggests that both conditioning and Wasserstein objectives effectively reconstruct local feature relationships when sufficient structure is learned. WGAN-GP again underperforms, producing 25 false negatives and 45 false positives, consistent with degraded local neighborhood coherence due to overly diffuse synthetic samples.

For the Decision Tree classifier, no misclassifications were observed across all GAN variants. Random Forest similarly achieves very low error rates, with only two false negatives for the Vanilla GAN, cGAN, and WGAN and no misclassifications under WGAN-GP. This further confirms that tree-based models are resilient to variation in synthetic data distributions, as long as key feature thresholds defining the minority class are preserved. As in prior tactics, tree-based models remain largely invariant to the choice of adversarial objective once sufficient minority-class augmentation is achieved.

Overall, the Lateral Movement results indicate that cGAN and WGAN are particularly effective at stabilizing classification performance for extremely rare attack types, especially for linear and distance-based classifiers. In contrast, the Vanilla GAN provides insufficient augmentation in this setting, and WGAN-GP consistently underperforms under the evaluated configuration. These findings highlight that, under extreme sparsity, the ability to accurately reconstruct minority-class structure is more critical than enforcing strong regularization or distributional smoothness. Tree-based classifiers remain robust regardless of the choice of GAN once class imbalance is mitigated.

6.1.6. Resource Development

The results for Resource Development (Table 8) show that classifier performance is strongly influenced by the choice of adversarial objective, particularly for linear and margin-based classifiers, reflecting the extreme sparsity of this tactic in the original dataset. As in prior experiments, all GAN variants were trained independently on Resource Development samples before classifier evaluation and were evaluated under identical preprocessing and stratified cross-validation conditions.

For logistic regression, the Vanilla GAN yields moderate performance, producing 28 false negatives and 89 false positives, indicating limited linear separability even after augmentation. This suggests that the Vanilla GAN struggles to adequately reconstruct the minority-class feature distribution under extreme sparsity. In contrast, both cGAN and WGAN substantially improve performance, reducing false negatives to one and false positives to zero. These results indicate that conditioning and the Wasserstein-based objectives enable more accurate modeling of sparse class-specific structure, leading to near-perfect linear separability. In comparison, WGAN-GP performs substantially worse, yielding 4489 false negatives and 1572 false positives, indicating that gradient-penalized regularization introduces excessive smoothing and increases overlap between minority and majority classes.

A similar pattern was observed for SVM. The Vanilla GAN exhibits limited recall, with 505 false negatives and 37 false positives. Both cGAN and WGAN markedly improve performance, reducing false negatives to 14 and false positives to 6, yielding very low error rates. This demonstrates that margin-based classifiers benefit significantly from improved minority density when synthetic samples preserve structural consistency. In contrast, WGAN-GP again underperforms, generating 8572 false negatives and 293 false positives, indicating severe degradation of margin separability due to distorted synthetic distributions.

For KNN, the Vanilla GAN already performs well, with 20 false negatives and 6 false positives. Both cGAN and WGAN further reduce classification error, producing only one false negative and two false positives, indicating improved neighborhood purity. This suggests that both conditioning and Wasserstein objectives enhance local cluster coherence for this tactic. WGAN-GP introduces noticeably higher errors, with 49 false negatives and 82 false positives, indicating distortion of local density relationships due to overly smooth or diffuse synthetic samples.

For Decision Tree and Random Forest classifiers, no misclassifications were observed across all GAN variants. Once the Resource Development class is sufficiently augmented, tree-based models robustly isolate the minority class through hierarchical splitting and remain largely invariant to the choice of adversarial objective. This confirms that tree-based models depend primarily on feature-threshold separability rather than on precision distributional modeling. As in previous tactics, this behavior reflects effective class separability rather than data leakage.

Overall, the Resource Development results indicate that cGAN and WGAN provide the most consistent improvements for extremely rare attack types, particularly for linear, margin-based, and distance-based classifiers. The Vanilla GAN shows limited improvement under severe sparsity, whereas WGAN-GP consistently underperforms across the evaluated configurations. These findings reinforce that accurate reconstruction of minority-class structure is more critical than increased model complexity or regularization strength in highly sparse cybersecurity settings. Tree-based classifiers remain robust regardless of the choice of GAN once class imbalance is mitigated.

6.1.7. Defense Evasion

The results for Defense Evasion (Table 9) show consistent, strong classification performance across all GAN variants. Given the extreme sparsity of this tactic in the original dataset, synthetic augmentation appears sufficient to produce a learnable minority-class representation regardless of the specific adversarial objective employed. As in prior experiments, all GAN variants were trained independently on Defense Evasion samples and evaluated under identical preprocessing and cross-validation conditions.

For logistic regression, all GAN variants achieve very low error rates. The Vanilla GAN, cGAN, and WGAN each produce only one false negative and zero false positives, indicating strong linear separability following augmentation. WGAN-GP introduces a slight degradation, with two false negatives and zero false positives, but overall performance remains high. This suggests that, for this tactic, even simple generative models are sufficient to reconstruct the minority-class structure once minimal representation is achieved. These results suggest that for this tactic, linear classifiers primarily benefit from class densification rather than subtle differences in adversarial objectives.

For SVM, classification is effectively error-free across all GAN variants. The Vanilla GAN and WGAN-GP achieve zero false negatives and zero false positives, while cGAN and WGAN introduce only two false positives each, with no false negatives. This indicates that margin-based classifiers reach a saturation point once a minimal level of minority density is achieved, beyond which improvements in generative modeling provide negligible benefit. This suggests that once the minority class is synthetically densified, margin-based classifiers become largely insensitive to the choice of GAN architecture.

A similar pattern was observed for KNN. The Vanilla GAN and WGAN-GP produce three false positives and no false negatives, while cGAN introduces eight false positives with no false negatives. WGAN yields the highest error among the variants for KNN, producing 80 false positives and 1 false negative, indicating some degradation in local neighborhood structure. This suggests that while overall performance remains high, certain adversarial objectives may still introduce minor distortions in local density, though these effects are insufficient to significantly affect classification outcomes. Despite this, overall performance remains strong across all models.

For Decision Tree classifiers, no misclassifications were observed for any GAN variant. Random Forest also achieves near-error-free performance: both the Vanilla GAN and WGAN introduce a single false negative, while cGAN and WGAN-GP achieve zero misclassifications. As in previous tactics, tree-based classifiers remain largely invariant to the adversarial objective once the minority-class augmentation is achieved. This further reinforces that tree-based models rely on clear feature thresholds rather than precise distributional fidelity.

Overall, the Defense Evasion result indicates that all GAN variants are effective at stabilizing classifier performance for this extremely rare attack type once basic class balance is achieved. Differences among adversarial objectives are marginal and classifier- dependent, with tree-based models exhibiting consistent robustness across all augmentation strategies. These findings suggest that, for certain attack types, the primary challenge lies in achieving sufficient representation rather than optimizing the generative objective, highlighting diminishing returns in model complexity once separability is established.

6.1.8. Initial Access

The results for Initial Access (Table 10) show consistently strong classification performance across all GAN variants. Even with minimal synthetic augmentation, all classifiers achieve low error rates, indicating that this attack type is readily learned once class imbalance is mitigated. As in prior experiments, all GAN variants were trained independently on Initial Access samples and evaluated under identical preprocessing and cross-validation conditions.

For logistic regression, all GAN variants perform exceptionally well. The Vanilla GAN and WGAN-GP each produce one false negative and zero false positives, while cGAN and WGAN achieve error-free classification with zero false negatives and zero false positives. These results indicate strong linear separability of the augmented minority class across all adversarial objectives. This suggests that the underlying feature distribution of this tactic is inherently well-structured, requiring only minimal augmentation to become linearly separable.

For SVM, performance is error-free across all GAN variants, with zero false negatives and zero false positives. Once synthetic augmentation is applied, margin-based classification appears insensitive to the choice of adversarial objective for this tactic. This indicates that the classifier reaches a saturation point, beyond which further improvements in synthetic data quality do not translate into measurable performance gains.

A similar trend was observed for KNN. The Vanilla GAN and WGAN-GP produce only three false positives and no false negatives, while cGAN and WGAN introduce slightly more false positives (seven) but still no false negatives. These differences are minor, and the overall performance remains high across all variants. This suggests that the local neighborhood structures are sufficiently well-defined after augmentation, making KNN robust to minor variations in synthetic sample generation.

For Decision Tree classifiers, no misclassifications were observed across all GAN variants. Random Forest also achieves near-error-free performance, with each GAN variant producing a single false negative and zero false positives. As with prior tactics, tree-based ensemble models remain robust once sufficient augmentation of the minority class is achieved. This further confirms that tree-based methods rely primarily on feature-level separability rather than precise distributional fidelity.

Overall, the Initial Access results indicate that GAN-based augmentation is sufficient to stabilize classifier performance for this tactic regardless of the adversarial objective used. Differences among Vanilla GAN, cGAN, WGAN, and WGAN-GP are minimal, and classifier performance remains consistently high across linear, margin-based, and tree-based models. These findings highlight diminishing returns in model complexity, where achieving basic class balance is sufficient, and more advanced generative objectives provide little additional benefit.

6.1.9. Persistence

The results for Persistence (Table 11) show consistently strong classification performance across all GAN variants. Once GAN-based augmentation is applied, all classifiers achieve very low error rates, indicating that this attack type is readily learned regardless of the specific adversarial objective used. As in prior experiments, all GAN variants were trained independently on Persistence samples and evaluated under identical and cross-validation conditions.

For Logistic Regression, all GAN variants perform exceptionally well. The Vanilla GAN, cGAN, and WGAN each produce only two false negatives and zero false positives, while WGAN-GP achieves error-free classification with zero false negatives and zero false positives. These results indicate the strong linear separability of the augmented minority class across all adversarial objectives. This suggests that the Persistence tactic exhibits a well-defined feature structure that becomes readily separable with minimal augmentation.

For SVM, classification performance is effectively error-free across all GAN variants. The Vanilla GAN and WGAN-GP produce no false negatives or false positives, while cGAN and WGAN introduce only two false positives, each with no false negatives. These differences are minimal and do not materially affect the overall performance. This indicates that margin-based classifiers quickly reach a performance ceiling once sufficient minority representation is achieved.

A similar pattern is observed for KNN. The Vanilla GAN and WGAN-GP produce three false positives and no false negatives, whereas cGAN and WGAN introduce slightly more false positives (seven), while still producing no false negatives. Despite these minor differences, the overall classification performance remains high across all variants. This suggests that local neighborhood structures are well preserved across all GAN objectives once basic class density is established.

For the Decision Tree classifier, no misclassifications were observed across any GAN variant. Random Forest also achieves near-error-free performance, with each GAN variant producing only one false negative and no false positives. As with prior tactics, tree-based ensemble models remain robust once sufficient augmentation of the minority class is achieved. This reinforces the idea that tree-based models rely primarily on clear feature thresholds rather than on subtle distributional differences in synthetic data.

Overall, the Persistence results indicate that GAN-based augmentation is sufficient to stabilize classifier performance for this tactic, regardless of the adversarial objective used. Differences among Vanilla GAN, cGAN, WGAN, and WGAN-GP are negligible, and classifier behavior remains consistent across linear, margin-based, distance-based, and tree-based models. These findings further support the observation that, for certain attack types, achieving sufficient minority representation is more important than optimizing the generative model’s complexity, leading to diminishing returns from more advanced GAN variants.

6.2. Training Time Results

Baseline Training

Baseline training times were recorded on Google Colab Pro using the computational configurations described in Section 5.5, with GPU acceleration enabled. All GAN variants were trained with identical baseline hyperparameters [10,11], including batch size, noise dimensionality, optimizer settings, and number of training epochs. Training time corresponds to the wall-clock execution time for a single run, measured directly from notebook execution, as reported in Table 12.

The observed training time shows moderate variation across adversarial objectives. The Vanilla GAN required 19 min of training and served as a reference baseline. The cGAN completed training in 17 min, making it the fastest variant under the evaluated configuration. The WGAN required 20 min of training, comparable to the Vanilla GAN. In contrast, WGAN-GP was the most computationally expensive variant, requiring 29 min of training.

Beyond raw training time, these results highlight an important trade-off between computational cost and model stability/performance. While WGAN-GP incurs the highest training time, its gradient penalty mechanism is known to improve training stability and reduce mode collapse [13,14], which can be beneficial in highly imbalanced settings. Conversely, simpler architectures such as Vanilla GAN and cGAN offer lower computational overhead, making them more suitable for resource-constrained environments, although they may exhibit less stable training behavior [9,10]. This trade-off is particularly relevant in real-world intrusion detection systems, where computational efficiency, scalability, and detection performance must be carefully balanced.

From a scalability perspective, training time is expected to increase approximately linearly with dataset size, as each epoch processes more samples [10]. However, larger datasets may also improve the stability of adversarial training by providing richer data distributions [1,10]. In such cases, more complex objectives (e.g., WGAN-GP) may benefit from increased data availability, whereas simpler models may scale more effectively in terms of computational cost.

Additionally, the class-specific training strategy adopted in this study enables partial scalability by decomposing the problem into independent training tasks for each minority class. This allows for potential parallelization across multiple models, mitigating the computational burden associated with larger datasets or increased augmentation levels.

Overall, these results indicate that adversarial objectives introduce measurable differences in computational cost under otherwise identical training conditions. Among the evaluated variants, WGAN-GP exhibits the highest training time but offers potential stability benefits, while cGAN achieves the lowest training time, with Vanilla GAN and WGAN providing a balance between efficiency and robustness.

6.3. Summary of Results

Section 6 presented a comprehensive empirical evaluation of four GAN architectures—Vanilla GAN, Conditional GAN (cGAN), Wasserstein GAN (WGAN), and Wasserstein GAN with Gradient Penalty (WGAN-GP)—for generating minority-class cyberattack data for the UWF-ZeekData22 dataset. Performance was assessed across nine MITRE ATT&CK tactics using five classical machine learning classifiers, with consistent preprocessing, training, and evaluation protocols and a fixed augmentation ratio of 0.15. Notably, several attack types in the dataset contain fewer than 10 samples, and in some cases only a single instance, placing the problem in an extreme-sparsity or few-shot learning regime.

Across most tactics and classifiers, Vanilla GAN provided a strong and reliable baseline. Despite its simpler adversarial objective, Vanilla GAN frequently achieved a favorable balance between recall and false-positive rate for linear and distance-based classifiers. This behavior is particularly evident for moderately rare tactics, such as Discovery, Privilege Escalation, and Persistence, where Vanilla GAN often matched or exceeded the performance of more complex architectures under the evaluated configuration. This indicates that under limited data, simpler models that preserve the observed feature structure can be more effective than complex models that attempt to learn the full data distribution.

Conditional GANs (cGANs) exhibited highly variable behavior across tactics. In some cases, conditional generation reduced false negatives for extremely sparse attack types; however, performance was inconsistent across classifiers and tactics, with noticeable degradation in certain linear and distance-based models. This variability suggests that conditioning mechanisms require sufficient data density to be effective and may introduce instability when applied to highly sparse or poorly defined class distributions. These patterns indicate that cGAN performance was sensitive to the structure of the minority-class data and the conditioning mechanism, limiting its reliability as a uniform augmentation strategy in this setting.

Wasserstein GANs (WGANs) demonstrated comparatively consistent performance across attack types and classifiers. In many cases, WGANs achieved low false-negative rates for linear and margin-based classifiers while maintaining controlled false-positive levels. This consistency reflects the stabilizing effect of the Wasserstein objective, which yields smoother gradients and improves training robustness under limited data. This suggests that the Wasserstein objective produced synthetic samples that preserved class structure more reliably than conditional objectives under the evaluated augmentation setting.

WGAN-GP exhibited mixed and often degraded performance across several tactics, particularly for linear and distance-based classifiers. While WGAN-GP occasionally produced smoother synthetic distributions, this smoothing often increased overlap between minority and majority classes, reducing discriminative separability. This indicates that strong regularization may be detrimental in extreme-sparsity settings, where preserving sharp feature distinctions is more important than enforcing smooth distributions.

Classifier sensitivity varied substantially across model families. Tree-based classifiers (Decision Tree and Random Forest) were largely insensitive to the choice of GAN architecture once sufficient minority-class augmentation was applied and frequently achieved near-error-free performance across tactics. This suggests that these models rely primarily on feature-level partitioning and thresholding rather than precise distributional fidelity. In contrast, Logistic Regression, SVM, and KNN were more sensitive to differences in synthetic data distributions, making them effective indicators of the influence of adversarial objectives on minority-class separability.

Overall, the results indicate that the selection of GAN architecture has a meaningful impact on downstream classifier behavior in imbalanced intrusion detection tasks, particularly for linear and distance-based models. While simple adversarial objectives often prove effective, architectures that emphasize stable distributional alignment exhibit more consistent behavior across diverse attack types. A key finding across all experiments is that, in extreme-sparsity settings, the effectiveness of the GAN-based augmentation depends more on preserving discriminative structure and appropriate augmentation calibration than on architectural complexity. These findings motivate a deeper discussion of stability, classifier sensitivity, and practical trade-offs, which is addressed in the following section.

7. Discussion

The study conducted a systematic evaluation of four GAN architectures—Vanilla GAN, cGAN, WGAN, and WGAN-GP—for minority-class data augmentation in highly imbalanced intrusion detection tasks using the UWF-ZeekData22 dataset [15,16]. Analysis of downstream classifier performance across nine MITRE ATT&CK tactics reveals several consistent patterns in generative stability, classifier sensitivity, and the practical trade-offs between architectural complexity and empirical utility. Importantly, these findings must be interpreted in the context of extreme class imbalance, where several attack types contain fewer than ten samples and, in some cases, only a single instance. Under such conditions, GAN-based augmentation operates in a few-shot regime, fundamentally shifting the generative model’s role from distribution learning to structure preservation and controlled sample expansion.

7.1. Effectiveness of Simple Adversarial Objectives

One of the most notable findings is the strong and often competitive performance of the Vanilla GAN across multiple attack types. Despite lacking explicit conditioning or Wasserstein-based regularization, the Vanilla GAN frequently achieved a favorable balance between recall and false-positive rate for linear and distance-based classifiers, particularly for moderately rare tactics such as Discovery, Privilege Escalation, and Persistence.

This behavior suggests that, in tabular cybersecurity data, the simplicity of the original adversarial objective can serve as an implicit regularizer when the minority distribution is relatively compact. Rather than attempting to learn the full data distribution, Vanilla GAN effectively preserves the limited observed structure, which is sufficient to improve downstream separability in low-sample regimes.

Prior work has shown that excessive regularization or architectural constraints can unintentionally suppress informative modes in low-sample regimes, leading to degraded class separability rather than improved fidelity [9,11,24]. The results in this study empirically confirm this phenomenon, demonstrating that increased model complexity does not necessarily translate to improved performance under extreme sparsity.

These results challenge the assumption that more sophisticated GAN variants universally outperform the original formulation and highlight the importance of empirical validation when applying generative models to structured, highly imbalanced domains.

7.2. Conditional GANs and Sensitivity to Extreme Sparsity

Conditional GANs exhibited highly variable performance across tactics. While conditioning occasionally reduced false negatives for extremely sparse attack types, cGANs also demonstrated instability and inconsistent behavior across classifiers. In several scenarios, cGAN-based augmentation led to substantial recall degradation or increased false positives, particularly for linear and neighborhood-based models.

This sensitivity is consistent with prior observations that conditional generation relies heavily on reliable label–feature relationships [12,17]. In cybersecurity datasets, labels are often sparse, noisy, or weakly informative, especially for advanced post-compromised behaviors. In the presence of extremely limited samples (e.g., one to ten instances), conditioning provides insufficient signals for learning meaningful class-conditioned distributions, leading to unstable or biased generation.

Under such conditions, conditioning can inadvertently constrain the generator to a limited subset of the minority-class distribution, leading to mode concentration rather than improved diversity [10,24].

These findings suggest that while conditional generation can be effective in well-labeled, sufficiently dense settings, it is not a reliable general-purpose augmentation strategy in extreme-sparsity regimes without additional data or regularization mechanisms.

7.3. Wasserstein Objectives and Distributional Stability

Wasserstein GANs demonstrated greater consistency across attack types and classifiers compared to conditional models. The Wassestein-1 distance provides smoother gradients and a more meaningful measure of distributional divergence, which likely contributed to the observed stability in downstream classification performance [13,14].

In many cases, WGAN-based augmentation achieved low false-negative rates while maintaining controlled false-positive levels, particularly for margin-based classifiers. These results are consistent with prior work showing that Wasserstein objectives improve convergence behavior and mitigate mode collapse in high-dimensional spaces [13,14].

In the context of extreme sparsity, the primary advantage of WGAN appears to be improved training stability rather than increased expressive power. This enables more reliable reconstruction of minority-class structure without introducing excessive variance or instability.

Importantly, the benefits of WGANs in this study were realized without introducing the additional complexity of gradient penalties, suggesting that enforcing a weaker form of Lipschitz continuity via weight clipping may be sufficient for structured tabular data when augmentation ratios are modest.

7.4. Role of Gradient Penalty in Extremely Sparse Regimes

WGAN-GP exhibited the most stable training dynamics across tactics but did not consistently yield superior downstream classification performance. In several moderately sparse scenarios, gradient-penalized training produced smoother synthetic distributions that increased overlap with benign traffic, thereby degrading the performance of linear and distance-based classifiers. This indicates that while gradient penalties improve optimization stability, they may overly smooth the minority-class distribution, reducing the sharpness of discriminative features that are critical for classification.

However, for ultra-sparse attack types such as Defense Evasion, Initial Access, and Persistence, WGAN-GP did not substantially harm classifier performance and occasionally matched alternative objectives. In these cases, the problem becomes dominated by representation rather than distributional fidelity, and any stable augmentation strategy is sufficient once minimal class density is achieved.

These findings align with prior work indicating that gradient penalties improve adversarial stability but may trade off sample sharpness for smoothness [14,24]. The results further suggest that the effectiveness of gradient-based regularization is highly dependent on data sparsity and should be carefully calibrated rather than uniformly applied.

7.5. Classifier Sensitivity as a Diagnostic Tool

A consistent pattern across all experiments is the differential sensitivity of classifier families to GAN-generated data. Tree-based classifiers (Decision Tree and Random Forest) were largely invariant to the GAN architecture once sufficient minority-class augmentation was achieved and frequently produced near-error-free performance across tactics. This robustness arises because tree-based models rely on feature thresholding rather than global distributional structure, making them less sensitive to variations in synthetic data quality.

In contrast, Logistic Regression, SVM, and KNN were substantially more sensitive to the choice of adversarial objectives. Linear and margin-based models amplified differences in global distributional alignment, while distance-based classifiers exposed local neighborhood distortions introduced by certain generative objectives.

These observations suggest that classifier sensitivity itself can serve as a diagnostic lens for evaluating the quality of synthetic data. Rather than relying solely on distributional similarity metrics, downstream classifiers—particularly simple linear and neighborhood-based models—provide actionable insight into how well synthetic samples preserve discriminative structure [8,45].

7.6. Practical Implications for Intrusion Detection

From a practical standpoint, the results indicate that GAN-based augmentation is an effective strategy for mitigating extreme class imbalance in intrusion detection systems, but the choice of architecture should be guided by data characteristics rather than model complexity alone.

For moderately rare attack types, simpler adversarial objectives may suffice and even outperform more heavily regularized models. For extremely sparse attack classes, an architecture emphasizing training stability may be preferable, even at the cost of smoother synthetic distributions.

A key practical insight is that the effectiveness of augmentation depends more on achieving sufficient minority-class representation and preserving discriminative structure than on selecting increasingly complex generative models.

Importantly, tree-based classifiers remain robust across augmentation strategies, making them well-suited to operational deployment when combined with generative balancing techniques.

Additionally, the presence of a performance saturation effect across several tactics suggests that, beyond a certain point, increasing augmentation complexity yields diminishing returns, reinforcing the importance of balanced and calibrated augmentation strategies.

7.7. Limitations and Future Directions

This study has several limitations that suggest directions for future work. First, the experiments were conducted using a single augmentation ratio and a single large-scale dataset. While this controlled setting enabled fair architectural comparison, further studies should explore adaptive augmentation strategies and cross-dataset generalization.

Second, this study employed a fixed augmentation ratio (r = 0.15) across all experiments. While this choice ensured consistency and controlled comparison across classifiers and attack types, it did not capture the full impact of varying augmentation intensity on model performance. Prior ablation studies on the UWF-ZeekData22 dataset [49] have shown that augmentation intensity plays a critical role: moderate levels yield stable improvements, whereas excessive augmentation may introduce distributional overlap and degrade classifier performance. Future work will include a systematic evaluation of augmentation ratios beyond the fixed setting used in this study, including both lower and higher levels (e.g., 5% and 30%), to better understand the trade-offs among minority-class representation, generalization, and potential overreliance on synthetic data.

Third, this study employed a standard GAN architecture rather than more advanced tabular generative models, such as the Conditional Tabular GAN (CTGAN) or the Tabular Variational Autoencoder (TVAE). While these models have demonstrated strong performance in structured data synthesis, they typically require sufficient data density to effectively learn conditional relationships among features. In the UWF-ZeekData22 dataset [15,16], several minority classes contain extremely limited samples, in some cases fewer than ten instances, which limits the feasibility of training such models reliably. Future work will include a systematic comparison of Vanilla GAN, CTGAN, TVAE, and other tabular generative models to evaluate their effectiveness under extreme class imbalance.

Fourth, GAN performance was evaluated indirectly through downstream classifiers. Incorporating explicit distributional fidelity metrics and privacy-aware evaluations could further strengthen the assessment of synthetic data quality. Finally, extending this analysis to hybrid generative models or domain-aware conditioning mechanisms may improve robustness for extremely sparse cyberattack behaviors.

8. Conclusions

This study presented a systematic benchmarking analysis of four Generative Adversarial Network (GAN) architectures—Vanilla GAN, Conditional GAN (cGAN), Wasserstein GAN (WGAN), and Wasserstein GAN with Gradient Penalty (WGAN-GP)—for minority-class data augmentation in highly imbalanced cybersecurity intrusion detection tasks. Using the UWF-ZeekData22 dataset, each MITRE ATT&CK tactic was formulated as an independent binary classification problem, enabling fine-grained evaluation of how adversarial objectives affect the utility of synthetic data across diverse attack behaviors. Notably, several attack types in the dataset contain fewer than ten samples, placing the problem in an extremely sparse or few-shot learning regime.

The experimental results demonstrate that GAN-based augmentation can substantially stabilize downstream classifier performance for rare cyberattack types when applied under controlled preprocessing and training conditions. Consistent with baseline comparisons reported in our prior study [30], all evaluated GAN variants enhance minority-class learnability across nine ATT&CK tactics and five classical machine learning classifiers, particularly for linear and distance-based models that are sensitive to class imbalance. However, performance gains are strongly dependent on data sparsity and the preservation of discriminative feature structure rather than purely on model complexity.

Contrary to the assumption that increased architectural complexity necessarily yields superior performance, the Vanilla GAN often provided a strong and reliable baseline. For moderately rare attack types, the original adversarial objective frequently achieved a favorable balance between recall and false-positive rate, matching or exceeding more complex alternatives. This indicates that in low-sample regimes, simpler models that preserve the observed feature structure can outperform more sophisticated architectures that attempt to learn the full data distribution. Conditional GANs exhibited variable behavior across tactics, offering benefits in some ultra-sparse settings but demonstrating instability and sensitivity to data characteristics in others, limiting their reliability as a general-purpose augmentation strategy.

Wasserstein-based objectives showed greater consistency across attack types. WGANs frequently preserved minority-class structure while maintaining controlled error rates across classifiers, reflecting improved training stability under limited data conditions. In contrast, WGAN-GP, while providing stable optimization, often produced overly smooth synthetic distributions that increased class overlap and degraded discriminative performance for linear and distance-based classifiers. This highlights a trade-off between generative stability and discriminative fidelity, particularly in extreme-sparsity settings.

The classifier family played a critical role in evaluating the quality of synthetic data. Tree-based models were largely invariant to the GAN architecture once sufficient minority-class augmentation was achieved, whereas linear margin-based and distance-based classifiers were more sensitive to differences in adversarial objectives. These sensitivities provided valuable insights into how well synthetic samples preserved class structure and separability, underscoring the importance of downstream evaluation for assessing the effectiveness of generative augmentation.

A key observation across all experiments is the presence of a performance saturation effect, where once minimal minority-class representation is achieved, further increases in generative complexity yield diminishing returns in classifier performance.

Overall, the findings indicate that class-specific GAN-based augmentation is a practical and effective approach for mitigating extreme class imbalance in intrusion detection systems. More importantly, this study demonstrates that the effectiveness of GANs in cybersecurity applications is governed primarily by data sparsity and the preservation of discriminative structure, rather than by architectural sophistication alone. Rather than favoring a single generative architecture, the results emphasize that GAN selection should be guided by data sparsity, classifier sensitivity, and computational considerations. This study provides empirical guidance for selecting and deploying GAN-based augmentation strategies in the cybersecurity context and establishes a reproducible framework for evaluating generative models on structured, highly imbalanced network traffic data.

9. Future Work

Several directions may be explored to extend the findings in this study. First, future work could incorporate complementary measures of synthetic data quality, such as distributional similarity metrics and drift-based fidelity analyses, to supplement the downstream classifier evaluation used in this work. Such metrics may provide additional insight into how well generative models preserve minority-class structure beyond what classification performance alone can provide.

Second, evaluating the proposed framework across different augmentation configurations and training schedules may help clarify how sensitive generative augmentation is to the volume of synthetic data and training duration in highly imbalanced cybersecurity settings. This could support more adaptive augmentation strategies tailored to specific attack types and sparsity regimes.

Third, this study focused on a standard GAN architecture to enable controlled comparison under extreme data sparsity. Future work will extend this analysis to advanced tabular generative models, such as the Conditional Tabular GAN (CTGAN) and Tabular Variational Autoencoder (TVAE), to evaluate their effectiveness under low-sample conditions. Such comparisons will help determine whether architecture-specific mechanisms for modeling heterogeneous tabular features offer benefits when sufficient data density is unavailable.

Fourth, another important direction for future work is evaluating GAN-augmented data with deep learning-based intrusion detection models. While this study focuses on classical machine learning classifiers to provide an interpretable and controlled assessment of synthetic data quality, deep neural networks offer additional representation-learning capabilities that may interact differently with GAN-generated samples. Evaluating architectures such as feedforward neural networks, recurrent models, or transformer-based approaches would provide further insight into how generative augmentation influences the learning of feature representations under extreme class imbalance. Such analysis would complement the current findings of GAN-based augmentation in modern deep learning intrusion detection pipelines.

Fifth, extending the evaluation to additional cybersecurity datasets with different traffic characteristics and labeling would be valuable for assessing generalizability and identifying dataset-specific behaviors not captured in UWF-ZeekData22. Particularly, datasets with varying levels of class sparsity would enable a systematic investigation of how generative model performance scales across different imbalance regimes.

Finally, incorporating model interoperability techniques may help analyze how classifiers respond to real versus synthetic samples, supporting greater transparency and trust in GAN-augmentation intrusion detection systems. Understanding how synthetic samples influence decision boundaries, particularly linear and distance-based classifiers, remains an important direction for improving the reliability of generative augmentation in operational settings.

Author Contributions

Conceptualization, A.D., S.S.B., D.M. and S.C.B.; methodology, A.D. and S.S.B.; software, A.D.; validation, S.S.B., D.M. and S.C.B.; formal analysis, A.D. and S.S.B.; investigation, A.D.; resources, S.S.B., D.M. and S.C.B.; data curation, A.D. and D.M.; writing—original draft preparation, A.D.; writing—review and editing, S.S.B., D.M. and S.C.B.; visualization, A.D.; supervision, S.S.B., D.M. and S.C.B.; project administration, S.S.B., D.M. and S.C.B.; funding acquisition, S.S.B., D.M. and S.C.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets are available at https://datasets.uwf.edu/ (accessed on 8 August 2025).

Acknowledgments

This research was also partially supported by the Askew Institute at the University of West Florida.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

IDS	Intrusion Detection System
ATT&CK	Adversarial Tactics, Techniques, and Common Knowledge
GAN	Generative Adversarial Network
SMOTE	Synthetic Minority Oversampling Technique
D	Discriminator
G	Generator
ReLU	Rectified Linear Unit
LR	Logistic Regression
SVM	Support Vector Machine
KNN	K-Nearest Neighbor
RF	Random Forest
t-SNE	t-Distributed Stochastic Neighbor Embedding
TN	True Negative
FN	False Negative
FP	False Positive
TP	True Positive
cGAN	Conditional GAN
WGAN	Wasserstein GAN
WGAN-GP	Wasserstein GAN with Gradient Penalty
XAI	Explainable AI

References

Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Wang, A.X.; Chukova, S.S.; Simpson, C.R.; Nguyen, B.P. Challenges and Opportunities of Generative Models on Tabular Data. Appl. Soft Comput. 2024, 166, 112223. [Google Scholar] [CrossRef]
Buczak, A.L.; Guven, E. A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection. IEEE Commun. Surv. Tutor. 2016, 18, 1153–1176. [Google Scholar] [CrossRef]
Dunmore, A.; Jang-Jaccard, J.; Sabrina, F.; Kwak, J. A Comprehensive Survey of Generative Adversarial Networks (GANs) in Cybersecurity Intrusion Detection. IEEE Access 2023, 11, 76071–76094. [Google Scholar] [CrossRef]
Saeed, U.; Jan, S.U.; Ahmad, J.; Shah, S.A.; Alshehri, M.S.; Ghadi, Y.Y.; Pitropakis, N.; Buchanan, W.J. Generative Adversarial Networks-Enabled Anomaly Detection Systems: A Survey. Expert Syst. Appl. 2025, 296, 128978. [Google Scholar] [CrossRef]
Sommer, R.; Paxson, V. Outside the Closed World: On Using Machine Learning for Network Intrusion Detection. In Proceedings of the IEEE Symposium on Security and Privacy, Oakland, CA, USA, 16–19 May 2010; pp. 305–316. [Google Scholar] [CrossRef]
Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Learning from Imbalanced Data Sets; Springer: Cham, Switzerland, 2018. [Google Scholar]
Saito, T.; Rehmsmeier, M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved Techniques for Training GANs. arXiv 2016, arXiv:1606.03498. [Google Scholar] [CrossRef]
Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative Adversarial Networks: An Overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
Mescheder, L.; Geiger, A.; Nowozin, S. Which Training Methods for GANs Do Actually Converge? arXiv 2018, arXiv:1801.04406. [Google Scholar] [CrossRef]
Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar] [CrossRef]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A. Improved Training of Wasserstein GANs. arXiv 2017, arXiv:1704.00028. [Google Scholar] [CrossRef]
Bagui, S.S.; Mink, D.; Bagui, S.C.; Ghosh, T.; Plenkers, R.; McElroy, T.; Dulaney, S.; Shabanali, S. Introducing UWF-ZeekData22: A Comprehensive Network Traffic Dataset Based on the MITRE ATT&CK Framework. Data 2023, 8, 18. [Google Scholar] [CrossRef]
UWF Dataset Repository. Available online: https://datasets.uwf.edu (accessed on 8 August 2025).
Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling Tabular Data Using Conditional GAN. arXiv 2019, arXiv:1907.00503. [Google Scholar] [CrossRef]
Strom, B.E.; Applebaum, A.; Miller, D.P.; Nickels, K.C.; Pennington, A.G.; Thomas, C.B. MITRE ATT&CK^®: Design and Philosophy. MITRE Corporation, 2020. Available online: https://attack.mitre.org/docs/ATTACK_Design_and_Philosophy_March_2020.pdf (accessed on 8 August 2025).
Odena, A.; Olah, C.; Shlens, J. Conditional Image Synthesis with Auxiliary Classifier GANs. In Proceedings of the International Conference on Machine Learning (ICML), Sydney, NSW, Australia, 6–11 August 2017. [Google Scholar]
Yang, Y.; Liu, X.; Wang, D.; Sui, Q.; Yang, C.; Li, H.; Li, Y.; Luan, T. A CE-GAN based approach to address data imbalance in network intrusion detection systems. Sci. Rep. 2025, 15, 7916. [Google Scholar] [CrossRef]
Tian, W.; Shen, Y.; Guo, N.; Yuan, J.; Yang, Y. VAE-WACGAN: An Improved Data Augmentation Method Based on VAEGAN for Intrusion Detection. Sensors 2024, 24, 6035. [Google Scholar] [CrossRef]
Yoon, J.; Jordon, J.; van der Schaar, M. PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019; Available online: https://openreview.net/forum?id=S1zk9iRqF7 (accessed on 2 April 2026).
Menssouri, S.; Amhoud, E.M. A Conditional Tabular GAN-Enhanced Intrusion Detection System for Rare Attacks in IoT Networks. In Proceedings of the IEEE International Conference on Communications Workshops (ICC Workshops), Montreal, QC, Canada, 8–12 June 2025; pp. 1918–1923. [Google Scholar] [CrossRef]
Herurkar, D.; Ali, A.; Dengel, A. Evaluating Generative Models for Tabular Data: Novel Metrics and Benchmarking. arXiv 2025, arXiv:2504.20900. [Google Scholar] [CrossRef]
Borisov, V.; Leemann, T.; Seßler, K.; Haug, J.; Pawelczyk, M.; Kasneci, G. Deep Neural Networks and Tabular Data: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 7499–7519. [Google Scholar] [CrossRef]
Shone, N.; Ngoc, T.N.; Phai, V.D.; Shi, Q. A Deep Learning Approach to Network Intrusion Detection. IEEE Trans. Emerg. Top. Comput. Intell. 2018, 2, 41–50. [Google Scholar] [CrossRef]
Rao, Y.N.; Suresh Babu, K. An Imbalanced Generative Adversarial Network-Based Approach for Network Intrusion Detection in an Imbalanced Dataset. Sensors 2023, 23, 550. [Google Scholar] [CrossRef] [PubMed]
Randhawa, R.H.; Aslam, N.; Alauthman, M.; Rafiq, H. Evasion Generative Adversarial Network for Low Data Regimes. IEEE Trans. Artif. Intell. 2023, 4, 1076–1088. [Google Scholar] [CrossRef]
Kim, J.; Kim, J.; Thi Thu, H.L.; Kim, H. Long Short Term Memory Recurrent Neural Network Classifier for Intrusion Detection. In Proceedings of the International Conference on Platform Technology and Service (PlatCon), Jeju, Republic of Korea, 15–17 February 2016; pp. 1–5. [Google Scholar] [CrossRef]
Debelie, A.; Bagui, S.S.; Mink, D.; Bagui, S.C. Class-Specific GAN-Based Minority Data Augmentation for Cyberattack Detection Using the UWF-ZeekData22 Dataset. Technologies 2026, 14, 117. [Google Scholar] [CrossRef]
Paxson, V. Bro: A System for Detecting Network Intruders in Real-Time. Comput. Netw. 1999, 31, 2435–2463. [Google Scholar] [CrossRef]
MITRE ATT&CK. Reconnaissance (Tactic TA0043—Enterprise). Available online: https://attack.mitre.org/tactics/TA0043/ (accessed on 8 August 2025).
MITRE ATT&CK. Discovery (Tactic TA0007—Enterprise). Available online: https://attack.mitre.org/tactics/TA0007/ (accessed on 8 August 2025).
MITRE ATT&CK. Credential Access (Tactic TA0006—Enterprise). Available online: https://attack.mitre.org/tactics/TA0006/ (accessed on 8 August 2025).
MITRE ATT&CK. Privilege Escalation (Tactic TA0004—Enterprise). Available online: https://attack.mitre.org/tactics/TA0004/ (accessed on 8 August 2025).
MITRE ATT&CK. Exfiltration (Tactic TA0010—Enterprise). Available online: https://attack.mitre.org/tactics/TA0010/ (accessed on 8 August 2025).
MITRE ATT&CK. Lateral Movement (Tactic TA0008—Enterprise). Available online: https://attack.mitre.org/tactics/TA0008/ (accessed on 8 August 2025).
MITRE ATT&CK. Resource Development (Tactic TA0042—Enterprise). Available online: https://attack.mitre.org/tactics/TA0042/ (accessed on 8 August 2025).
MITRE ATT&CK. Initial Access (Tactic TA0001—Enterprise). Available online: https://attack.mitre.org/tactics/TA0001/ (accessed on 8 August 2025).
MITRE ATT&CK. Persistence (Tactic TA0003—Enterprise). Available online: https://attack.mitre.org/tactics/TA0003/ (accessed on 8 August 2025).
MITRE ATT&CK. Defense Evasion (Tactic TA0005—Enterprise). Available online: https://attack.mitre.org/tactics/TA0005/ (accessed on 8 August 2025).
Japkowicz, J.; Stephen, S. The Class Imbalance Problem: A Systematic Study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2009. [Google Scholar]
Bagui, S.; Mink, D.; Bagui, S.; Ghosh, T.; McElroy, T.; Paredes, E.; Khasnavis, N.; Plenkers, R. Detecting Reconnaissance and Discovery Tactics from the MITRE ATT&CK Framework in Zeek Conn Logs Using Spark’s Machine Learning in the Big Data Framework. Sensors 2022, 22, 7999. [Google Scholar] [CrossRef]
Raschka, S. Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning. arXiv 2018, arXiv:1811.12808. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Han, J.; Kamber, M.; Pei, J. Data Mining Concepts and Techniques, 3rd ed.; Elsevier Inc.: Waltham, MA, USA, 2012. [Google Scholar]
He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
Debelie, A.; Bagui, S.S.; Bagui, S.C.; Mink, D. A Systematic Ablation Study of GAN-Based Minority Augmentation for Intrusion Detection on UWF-ZeekData22. Electronics 2026, 15, 1291. [Google Scholar] [CrossRef]

Figure 1. (a). Distribution of malicious traffic across MITRE ATT&CK tactics in the UWF-ZeekData22. (b). Distribution of rare MITRE ATT&CK tactics in UWF-ZeekData22, excluding Reconnaissance and Discovery.

Figure 2. Structure of the Conditional Generative Adversarial Network (cGAN) used for minority-class cyberattack data generation. The conditional input (c) represents the attack class label, and (z) denotes a latent noise vector. The generator (G) produces synthetic minority samples (G(z,c)), which are evaluated by the discriminator (D) against real samples from the UWF-ZeekData22 dataset. Arrows indicate the flow of data through the model.

Table 1. Summary of GAN-based approaches in tabular data and cybersecurity.

Study	GAN Variant	Domain	Key Objectives	Dataset(s)	Contribution
Goodfellow et al. [1]	GAN	General	Generative modeling	-	Introduced adversarial training
Mirza & Osindero [12]	cGAN	General	Conditional generation	-	Label-guided generation
Arjovsky et al. [13]	WGAN	General	Stable training	-	Wasserstein loss improves convergence
Gulrajani et al. [14]	WGAN-GP	General	Stability	-	Gradient penalty stabilizes training
Xu et al. [17]	CTGAN	Tabular	Tabular data synthesis	Multiple	Handles categorical imbalance
Yang et al. [20]	CE-GAN	Cybersecurity	Stability + diversity	NSL-KDD, UNSW-NB15	Encoder–decoder + Conditional GAN
Tian et al. [21]	VAE-WACGAN	Cybersecurity	Stability + realism	CIC-IDS2017, UNSWNB15	Hybrid VAE + GAN improves quality
Menssouri et al. [23]	CTGAN + SMOTEENN	IoT Security	Class imbalance	IoT datasets	Hybrid resampling improves minority detection
Rao et al. [27]	IGAN	Cybersecurity	Class imbalance	IDS datasets	Improve minority detection
Randhawa et al. [28]	EVAGAN	Cybersecurity	Low data + evasion	ISCX, CIC datasets	Evasion-aware discriminator, no retraining needed

Table 2. Summary of GAN training parameters across all models.

Parameter	Vanilla GAN	cGAN	WGAN	WGAN-GP
Epochs	2000	2000	2000	2000
Batch size	64	64	64	64
Noise dimension	32	32	32	32
Noise distribution	Uniform	Uniform	Uniform	Uniform
Conditioning	None	One-hot labels	None	None
Hidden layers (G/D)	2/2	2/2	2/2	2/2
Hidden units	128	128	128	128
Activation (hidden)	LeakyReLU (0.2)	LeakyReLU (0.2)	LeakyReLU (0.2)	LeakyReLU (0.2)
Generator normalization	BatchNorm	BatchNorm	BatchNorm	BatchNorm
Generator output	Tanh	Tanh	Tanh	Tanh
Discriminator/critic output	Sigmoid	Logit	Scalar	Scalar
Loss function	BCE	BCEWithLogits	Wasserstein	Wasserstein + GP
Optimizer	Adam	Adam	RMSprop	Adam
Learning rate	0.0002	0.0002	5 × 10⁻⁵	0.0002
Adam betas	(0.9, 0.999)	(0.5, 0.999)	-	(0.5, 0.9)
Critic steps	1	1	5	5
Regularization	None	Dropout (D)	Weight clipping (0.01)	Gradient penalty (λ = 10)

		Logistic Regression	SVM	KNN	Decision Tree	Random Forest
Discovery	GAN	$[\begin{matrix} 32,007 & 79 \\ 90 & 199,910 \end{matrix}]$	$[\begin{matrix} 29,894 & 2192 \\ 4 & 199,996 \end{matrix}]$	$[\begin{matrix} 31,439 & 647 \\ 197 & 199,803 \end{matrix}]$	$[\begin{matrix} 32,086 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 32,086 & 0 \\ 0 & 200,000 \end{matrix}]$
	CGAN	$[\begin{matrix} 32,038 & 48 \\ 78 & 199,922 \end{matrix}]$	$[\begin{matrix} 29,972 & 2114 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 31,445 & 641 \\ 197 & 199,803 \end{matrix}]$	$[\begin{matrix} 32,086 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 32,086 & 0 \\ 0 & 200,000 \end{matrix}]$
	WGAN	$[\begin{matrix} 32,002 & 84 \\ 138 & 199,862 \end{matrix}]$	$[\begin{matrix} 30,004 & 2082 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 31,446 & 640 \\ 197 & 199,803 \end{matrix}]$	$[\begin{matrix} 32,086 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 32,086 & 0 \\ 0 & 200,000 \end{matrix}]$
	WGAN-GP	$[\begin{matrix} 31,971 & 115 \\ 100 & 199,900 \end{matrix}]$	$[\begin{matrix} 25,222 & 6864 \\ 1026 & 198,974 \end{matrix}]$	$[\begin{matrix} 31,337 & 749 \\ 279 & 199,721 \end{matrix}]$	$[\begin{matrix} 32,086 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 32,086 & 0 \\ 0 & 200,000 \end{matrix}]$

		Logistic Regression	SVM	KNN	Decision Tree	Random Forest
Credential Access	GAN	$[\begin{matrix} 26,657 & 3374 \\ 2545 & 197,455 \end{matrix}]$	$[\begin{matrix} 26,968 & 3063 \\ 1923 & 198,077 \end{matrix}]$	$[\begin{matrix} 29,418 & 613 \\ 308 & 199,693 \end{matrix}]$	$[\begin{matrix} 30,031 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,031 & 0 \\ 0 & 200,000 \end{matrix}]$
	CGAN	$[\begin{matrix} 29,941 & 90 \\ 158 & 199,842 \end{matrix}]$	$[\begin{matrix} 29,989 & 42 \\ 5 & 199,995 \end{matrix}]$	$[\begin{matrix} 30,007 & 24 \\ 4 & 199,996 \end{matrix}]$	$[\begin{matrix} 30,031 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,031 & 0 \\ 0 & 200,000 \end{matrix}]$
	WGAN	$[\begin{matrix} 29,941 & 90 \\ 158 & 199,842 \end{matrix}]$	$[\begin{matrix} 29,989 & 42 \\ 5 & 199,995 \end{matrix}]$	$[\begin{matrix} 30,007 & 24 \\ 4 & 199,996 \end{matrix}]$	$[\begin{matrix} 30,031 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,031 & 0 \\ 0 & 200,000 \end{matrix}]$
	WGAN-GP	$[\begin{matrix} 25,655 & 4376 \\ 3155 & 196,845 \end{matrix}]$	$[\begin{matrix} 26,612 & 3419 \\ 609 & 199,391 \end{matrix}]$	$[\begin{matrix} 29,284 & 747 \\ 152 & 199,848 \end{matrix}]$	$[\begin{matrix} 30,031 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,031 & 0 \\ 0 & 200,000 \end{matrix}]$

		Logistic Regression	SVM	KNN	Decision Tree	Random Forest
Privilege Escalation	GAN	$[\begin{matrix} 29,986 & 27 \\ 15 & 199,985 \end{matrix}]$	$[\begin{matrix} 29,877 & 136 \\ 15 & 199,985 \end{matrix}]$	$[\begin{matrix} 30,008 & 5 \\ 13 & 199,987 \end{matrix}]$	$[\begin{matrix} 30,013 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,013 & 0 \\ 0 & 200,000 \end{matrix}]$
	CGAN	$[\begin{matrix} 29,883 & 130 \\ 71 & 199,929 \end{matrix}]$	$[\begin{matrix} 28,452 & 1561 \\ 566 & 199,434 \end{matrix}]$	$[\begin{matrix} 29,909 & 104 \\ 14 & 199,986 \end{matrix}]$	$[\begin{matrix} 30,013 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,013 & 0 \\ 0 & 200,000 \end{matrix}]$
	WGAN	$[\begin{matrix} 29,883 & 130 \\ 71 & 199,929 \end{matrix}]$	$[\begin{matrix} 28,452 & 1561 \\ 566 & 199,434 \end{matrix}]$	$[\begin{matrix} 29,909 & 104 \\ 14 & 199,986 \end{matrix}]$	$[\begin{matrix} 30,012 & 1 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,013 & 0 \\ 0 & 200,000 \end{matrix}]$
	WGAN-GP	$[\begin{matrix} 24,251 & 5762 \\ 3533 & 196,467 \end{matrix}]$	$[\begin{matrix} 21,347 & 8666 \\ 927 & 199,073 \end{matrix}]$	$[\begin{matrix} 28,485 & 1528 \\ 467 & 199,533 \end{matrix}]$	$[\begin{matrix} 30,012 & 1 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,013 & 0 \\ 0 & 200,000 \end{matrix}]$

		Logistic Regression	SVM	KNN	Decision Tree	Random Forest
Exfiltration	GAN	$[\begin{matrix} 29,892 & 115 \\ 36 & 199,964 \end{matrix}]$	$[\begin{matrix} 28,733 & 1274 \\ 520 & 199,480 \end{matrix}]$	$[\begin{matrix} 29,838 & 169 \\ 17 & 199,983 \end{matrix}]$	$[\begin{matrix} 30,007 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,007 & 0 \\ 0 & 200,000 \end{matrix}]$
	CGAN	$[\begin{matrix} 29,792 & 215 \\ 16 & 199,984 \end{matrix}]$	$[\begin{matrix} 29,403 & 604 \\ 126 & 199,874 \end{matrix}]$	$[\begin{matrix} 29,937 & 70 \\ 11 & 199,989 \end{matrix}]$	$[\begin{matrix} 30,007 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,007 & 0 \\ 0 & 200,000 \end{matrix}]$
	WGAN	$[\begin{matrix} 29,065 & 942 \\ 6 & 199,994 \end{matrix}]$	$[\begin{matrix} 23,906 & 6101 \\ 15 & 199,985 \end{matrix}]$	$[\begin{matrix} 29,457 & 550 \\ 186 & 199,814 \end{matrix}]$	$[\begin{matrix} 30,007 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,007 & 0 \\ 0 & 200,000 \end{matrix}]$
	WGAN-GP	$[\begin{matrix} 30,000 & 7 \\ 126 & 199,874 \end{matrix}]$	$[\begin{matrix} 11,496 & 18,511 \\ 456 & 199,544 \end{matrix}]$	$[\begin{matrix} 29,297 & 710 \\ 489 & 199,511 \end{matrix}]$	$[\begin{matrix} 30,006 & 1 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,007 & 0 \\ 0 & 200,000 \end{matrix}]$

		Logistic Regression	SVM	KNN	Decision Tree	Random Forest
Lateral Movement	GAN	$[\begin{matrix} 29,651 & 353 \\ 649 & 199,351 \end{matrix}]$	$[\begin{matrix} 28,404 & 1600 \\ 121 & 199,879 \end{matrix}]$	$[\begin{matrix} 29,958 & 46 \\ 47 & 199,953 \end{matrix}]$	$[\begin{matrix} 30,004 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,002 & 2 \\ 0 & 200,000 \end{matrix}]$
	CGAN	$[\begin{matrix} 29,998 & 6 \\ 2 & 199,998 \end{matrix}]$	$[\begin{matrix} 29,998 & 6 \\ 2 & 199,998 \end{matrix}]$	$[\begin{matrix} 30,002 & 2 \\ 4 & 199,996 \end{matrix}]$	$[\begin{matrix} 30,004 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,002 & 2 \\ 0 & 200,000 \end{matrix}]$
	WGAN	$[\begin{matrix} 29,998 & 6 \\ 2 & 199,998 \end{matrix}]$	$[\begin{matrix} 29,998 & 6 \\ 2 & 199,998 \end{matrix}]$	$[\begin{matrix} 30,002 & 2 \\ 4 & 199,996 \end{matrix}]$	$[\begin{matrix} 30,004 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,002 & 2 \\ 0 & 200,000 \end{matrix}]$
	WGAN-GP	$[\begin{matrix} 29,701 & 303 \\ 1059 & 198,941 \end{matrix}]$	$[\begin{matrix} 26,325 & 3679 \\ 1755 & 198,245 \end{matrix}]$	$[\begin{matrix} 29,979 & 25 \\ 45 & 199,955 \end{matrix}]$	$[\begin{matrix} 30,004 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,004 & 0 \\ 0 & 200,000 \end{matrix}]$

		Logistic Regression	SVM	KNN	Decision Tree	Random Forest
Resource Development	GAN	$[\begin{matrix} 29,975 & 28 \\ 89 & 199,911 \end{matrix}]$	$[\begin{matrix} 29,498 & 505 \\ 37 & 199,963 \end{matrix}]$	$[\begin{matrix} 29,983 & 20 \\ 6 & 199,994 \end{matrix}]$	$[\begin{matrix} 30,003 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,003 & 0 \\ 0 & 200,000 \end{matrix}]$
	CGAN	$[\begin{matrix} 30,002 & 1 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 29,989 & 14 \\ 6 & 199,994 \end{matrix}]$	$[\begin{matrix} 30,002 & 1 \\ 2 & 199,998 \end{matrix}]$	$[\begin{matrix} 30,003 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,003 & 0 \\ 0 & 200,000 \end{matrix}]$
	WGAN	$[\begin{matrix} 30,002 & 1 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 29,989 & 14 \\ 6 & 199,994 \end{matrix}]$	$[\begin{matrix} 30,002 & 1 \\ 2 & 199,998 \end{matrix}]$	$[\begin{matrix} 30,003 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,003 & 0 \\ 0 & 200,000 \end{matrix}]$
	WGAN-GP	$[\begin{matrix} 25,514 & 4489 \\ 1572 & 195,428 \end{matrix}]$	$[\begin{matrix} 21,431 & 8572 \\ 293 & 199,707 \end{matrix}]$	$[\begin{matrix} 29,954 & 49 \\ 82 & 199,918 \end{matrix}]$	$[\begin{matrix} 30,003 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,003 & 0 \\ 0 & 200,000 \end{matrix}]$

		Logistic Regression	SVM	KNN	Decision Tree	Random Forest
Defense Evasion	GAN	$[\begin{matrix} 30,000 & 1 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 3 & 199,997 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,000 & 1 \\ 0 & 200,000 \end{matrix}]$
	CGAN	$[\begin{matrix} 30,000 & 1 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 2 & 199,998 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 8 & 199,992 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$
	WGAN	$[\begin{matrix} 30,000 & 1 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 2 & 199,998 \end{matrix}]$	$[\begin{matrix} 30,000 & 1 \\ 80 & 199,992 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,000 & 1 \\ 0 & 200,000 \end{matrix}]$
	WGAN-GP	$[\begin{matrix} 29,999 & 2 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 3 & 199,997 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$

		Logistic Regression	SVM	KNN	Decision Tree	Random Forest
Initial Access	GAN	$[\begin{matrix} 30,000 & 1 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 3 & 199,997 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,000 & 1 \\ 0 & 200,000 \end{matrix}]$
	CGAN	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 7 & 199,993 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,000 & 1 \\ 0 & 200,000 \end{matrix}]$
	WGAN	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 7 & 199,993 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,000 & 1 \\ 0 & 200,000 \end{matrix}]$
	WGAN-GP	$[\begin{matrix} 30,000 & 1 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 3 & 199,997 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,000 & 1 \\ 0 & 200,000 \end{matrix}]$

		Logistic Regression	SVM	KNN	Decision Tree	Random Forest
Persistence	GAN	$[\begin{matrix} 29,999 & 2 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 3 & 199,997 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,000 & 1 \\ 0 & 200,000 \end{matrix}]$
	CGAN	$[\begin{matrix} 29,999 & 2 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 2 & 199,998 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 7 & 199,993 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,000 & 1 \\ 0 & 200,000 \end{matrix}]$
	WGAN	$[\begin{matrix} 29,999 & 2 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 2 & 199,998 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 7 & 199,993 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,000 & 1 \\ 0 & 200,000 \end{matrix}]$
	WGAN-GP	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 3 & 199,997 \end{matrix}]$	$[\begin{matrix} 30,001 & 0 \\ 0 & 200,000 \end{matrix}]$	$[\begin{matrix} 30,000 & 1 \\ 0 & 200,000 \end{matrix}]$

Table 12. Baseline training time per variant.

Variant	Time (min)
GAN	19
cGAN	17
WGAN	20
WGAN-GP	29

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Debelie, A.; Bagui, S.S.; Mink, D.; Bagui, S.C. Class-Specific GAN Augmentation for Imbalanced Intrusion Detection: A Comparative Study Using the UWF-ZeekData22 Dataset. Future Internet 2026, 18, 200. https://doi.org/10.3390/fi18040200

AMA Style

Debelie A, Bagui SS, Mink D, Bagui SC. Class-Specific GAN Augmentation for Imbalanced Intrusion Detection: A Comparative Study Using the UWF-ZeekData22 Dataset. Future Internet. 2026; 18(4):200. https://doi.org/10.3390/fi18040200

Chicago/Turabian Style

Debelie, Asfaw, Sikha S. Bagui, Dustin Mink, and Subhash C. Bagui. 2026. "Class-Specific GAN Augmentation for Imbalanced Intrusion Detection: A Comparative Study Using the UWF-ZeekData22 Dataset" Future Internet 18, no. 4: 200. https://doi.org/10.3390/fi18040200

APA Style

Debelie, A., Bagui, S. S., Mink, D., & Bagui, S. C. (2026). Class-Specific GAN Augmentation for Imbalanced Intrusion Detection: A Comparative Study Using the UWF-ZeekData22 Dataset. Future Internet, 18(4), 200. https://doi.org/10.3390/fi18040200

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Class-Specific GAN Augmentation for Imbalanced Intrusion Detection: A Comparative Study Using the UWF-ZeekData22 Dataset

Abstract

1. Introduction

2. Related Work

2.1. Architectural Variants of GANs

2.2. GANs for Tabular Data Generation

2.3. GANs in Cybersecurity Data Modeling

2.4. Positioning of This Study

3. Dataset and Preprocessing

3.1. UWF-ZeekData22 Dataset Overview

3.2. Distribution of MITRE ATT&CK Tactics

3.3. Class-Specific Minority Datasets for GAN Training

3.4. Data Selection and Subsampling

3.5. Preprocessing Pipeline

3.6. Feature Scaling

3.7. Training and Evaluation Protocol

3.8. Summary

4. GAN Architectures

4.1. Vanilla GAN

4.1.1. Characteristics

4.1.2. Relevance to Cybersecurity Data

4.2. Conditional GAN (cGAN)

4.2.1. Characteristics

4.2.2. Relevance to Cybersecurity Data

4.3. Wasserstein GAN (WGAN)

4.3.1. Characteristics

4.3.2. Relevance to Cybersecurity Data

4.4. Wasserstein GAN with Gradient Penalty (WGAN-GP)

4.4.1. Characteristics

4.4.2. Relevance to Cybersecurity Data

5. Experimental Design

5.1. ATT&CK Tactics Selected for Evaluation

5.2. Consistent Preprocessing Pipeline

5.3. Standardized Hyperparameter Settings

5.3.1. Common Settings

5.3.2. Architecture-Specific Settings

5.3.3. Detailed GAN Training Configuration

5.4. Augmentation Ratio

Synthetic Sample Generation

5.5. Computing Environment

5.5.1. System Configuration

5.5.2. Software Environment

5.5.3. Implementation Details

5.6. Classifier Training Setup

5.6.1. Evaluation Metrics

Confusion Matrix Analysis

Derived Performance Metrics

5.7. Reproducibility and Consistency

6. Results

6.1. Downstream Classifier Performance

6.1.1. Discovery

6.1.2. Credential Access

6.1.3. Privilege Escalation

6.1.4. Exfiltration

6.1.5. Lateral Movement

6.1.6. Resource Development

6.1.7. Defense Evasion

6.1.8. Initial Access

6.1.9. Persistence

6.2. Training Time Results

Baseline Training

6.3. Summary of Results

7. Discussion

7.1. Effectiveness of Simple Adversarial Objectives

7.2. Conditional GANs and Sensitivity to Extreme Sparsity

7.3. Wasserstein Objectives and Distributional Stability

7.4. Role of Gradient Penalty in Extremely Sparse Regimes

7.5. Classifier Sensitivity as a Diagnostic Tool

7.6. Practical Implications for Intrusion Detection

7.7. Limitations and Future Directions

8. Conclusions

9. Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References