1. Introduction
The Internet of Things (IoT) has fundamentally transformed modern computing infrastructure, enabling unprecedented connectivity across billions of devices spanning industrial control systems, healthcare monitoring equipment, smart home appliances, and critical infrastructure components. This massive expansion of networked devices creates proportionally expanded attack surfaces that adversaries exploit through increasingly sophisticated intrusion techniques. The challenge of detecting such intrusions is compounded by the inherent resource constraints of IoT devices, which limit the computational complexity of deployable security solutions. As organizations increasingly rely on IoT ecosystems for mission-critical operations, the development of effective and efficient intrusion detection mechanisms has become a paramount concern for cybersecurity practitioners and researchers alike.
Intrusion detection systems (IDSs) designed for IoT environments must simultaneously address multiple competing constraints that create a complex optimization landscape. Detection accuracy remains paramount as missed attacks can compromise critical systems, expose sensitive data, and cause significant financial and reputational damage to organizations. Computational efficiency determines whether detection algorithms can be executed within the limited processing budgets of edge devices, which often operate with constrained CPU capabilities, limited memory, and minimal storage capacity. Energy consumption affects battery life and operational costs for distributed sensor networks, where devices may operate for extended periods without access to continuous power sources. Latency requirements constrain the algorithmic complexity permissible for real-time threat response, as security systems must identify and respond to attacks before significant damage occurs. The interplay between these constraints creates a multidimensional optimization problem that cannot be solved by focusing on any single metric in isolation.
Class imbalance represents a particularly challenging aspect of IoT intrusion detection that has received substantial attention in the machine learning literature. Normal network traffic vastly outnumbers attack traffic in realistic operational scenarios, with imbalance ratios frequently exceeding 1000:1 in production environments and sometimes reaching ratios of 8000:1 or higher in severely imbalanced datasets. Machine learning classifiers trained on such imbalanced data exhibit strong bias toward the majority class, achieving deceptively high overall accuracy while failing to detect the minority attack instances that represent actual security threats. This phenomenon is particularly problematic in security contexts, where the cost of missing a genuine attack far exceeds the cost of false positives. Traditional approaches to addressing class imbalance, such as random oversampling or undersampling, often prove insufficient for complex network traffic distributions and can introduce artifacts that compromise classifier generalization.
Generative Adversarial Networks (GANs) offer a principled approach to addressing class imbalance through synthetic data augmentation, demonstrating remarkable success across diverse application domains. By learning the underlying distribution of minority class samples through adversarial training of generator and discriminator networks, GANs can generate realistic synthetic examples that expand the training set and improve classifier sensitivity to underrepresented attack types. Unlike simple oversampling techniques that merely duplicate existing samples, GANs create novel instances that capture the statistical properties of the original data while introducing meaningful variation. However, the GAN landscape encompasses numerous architectural variants with different training objectives, network structures, and computational requirements, making architecture selection a non-trivial decision for practitioners. The choice of GAN architecture affects not only the quality of generated samples but also the computational resources required for training and inference, which is particularly relevant for resource-constrained IoT deployments.
This paper presents a comprehensive benchmark evaluation of five GAN architectures for energy-aware IoT intrusion detection, addressing a critical gap in the existing literature. Our work makes five primary contributions to the field:
We provide the first systematic comparison of GAN architectural variants specifically designed for IoT intrusion detection, evaluating Standard GAN, Progressive GAN (PGAN), Conditional GAN (cGAN), Graph-based GAN (GraphGAN), and Wasserstein GAN with Gradient Penalty (WGAN-GP) under consistent experimental conditions.
We propose an optimized WGAN-GP architecture incorporating diversity loss, feature matching, and noise injection that achieves state-of-the-art performance with 99.99% classification accuracy, matching traditional classifiers while dramatically improving minority class detection.
We develop an energy-aware evaluation framework with novel metrics, including Accuracy-per-Joule (APJ) and F1-per-Joule (F1PJ), that enable principled architecture selection for energy-constrained deployments.
We demonstrate that generation quality and classification performance are complementary rather than competing objectives when GANs are properly optimized.
We release all code and experimental artifacts to support reproducibility and enable future research in this important area.
The remainder of this paper is organized as follows.
Section 2 reviews related work on machine learning approaches for intrusion detection, including classical methods, deep learning architectures, reinforcement learning, and GAN-based data augmentation techniques.
Section 3 presents the system description and problem formulation, including the end-to-end pipeline architecture, GAN-based synthetic data generation, power monitoring, and mathematical formulation of energy-aware metrics.
Section 4 details the methodology encompassing dataset description, proposed GAN architectures, feature selection and preprocessing, hyperparameter optimization, and the GAN-based intrusion detection algorithm.
Section 5 presents extensive experimental results covering training dynamics, generation quality analysis, classification performance, and computational efficiency evaluation.
Section 6 discusses the relationship between generation quality and classification performance, compares classical and GAN-augmented approaches, and provides architecture selection guidelines. Finally,
Section 7 concludes this paper and outlines future research directions.
5. Experimental Results
This section presents extensive experimental results from our benchmark evaluation, analyzing training dynamics and convergence behavior across all GAN architectures, synthetic data generation quality metrics, classification and detection performance on the BoT-IoT dataset, and computational efficiency, including our novel, energy-aware metrics. We provide comprehensive comparisons between GAN-augmented approaches and traditional machine learning baselines, demonstrating the superior performance of our optimized WGAN-GP architecture across multiple evaluation dimensions.
5.1. Evaluation Metrics
Table 5 describes the comprehensive evaluation metrics employed in our benchmark, including energy-aware metrics.
5.2. Experimental Setup
All experiments were conducted on a workstation equipped with an NVIDIA RTX 3090 GPU (24GB VRAM), an AMD Ryzen 9 5900X CPU (12 cores), and 64 GB RAM. The software environment included Python 3.9, PyTorch 1.12, and CUDA 11.6. Each experiment was repeated 5 times with different random seeds, and we report the mean and standard deviation of all metrics. The experimental setup assessed GAN-based data augmentation for IoT intrusion detection, utilizing a comprehensive hardware and software environment for reproducible evaluation. The experiments were designed to provide fair comparisons across all GAN architectures under consistent conditions.
5.3. Assumptions
Our investigation considered the following assumptions, which establish the operational context and constraints for GAN-based intrusion detection in IoT environments:
Network traffic data is collected at a central monitoring point with sufficient visibility into IoT device communications, enabling comprehensive flow-level feature extraction.
The IoT network topology remains stable during data collection periods, with devices operating at predefined network locations. We know accurately these network configurations for proper traffic correlation.
The communication network infrastructure is operational and capable of transmitting network traffic data from collection points to the processing system.
Attack patterns present in the BoT-IoT dataset are representative of real-world IoT intrusion scenarios, including DDoS, DoS, reconnaissance, and data theft attacks.
The severe class imbalance (>8000:1 ratio) in the dataset accurately reflects realistic operational scenarios where normal traffic vastly outnumbers attack traffic.
Computational resources for GAN training and inference are available at the network edge or centralized processing infrastructure, with power consumption being a critical deployment constraint.
Environmental conditions and network interference are assumed to be within acceptable limits for consistent data collection and model performance evaluation.
The GAN-generated synthetic samples, when combined with real minority class samples, create a training distribution that improves classifier decision boundaries without introducing artifacts that compromise generalization.
5.4. Training Dynamics and Generation Quality Analysis
Understanding the training dynamics of generative adversarial networks is crucial for diagnosing potential issues such as mode collapse, vanishing gradients, and training instability. These phenomena directly impact the quality of generated synthetic samples and, consequently, the effectiveness of downstream classification models.
Figure 4 presents the training and validation loss curves for all five GAN architectures across training epochs, highlighting the key convergence patterns: optimized WGAN-GP remained stable through 300 epochs (with Wasserstein distance stabilizing around
), Standard GAN showed instability at around epochs 85–90, and PGAN/cGAN exhibited smoother monotonic convergence.
The Standard GAN exhibited stable training through approximately epoch 75, after which generator loss increased sharply while validation loss diverged, indicating the onset of training instability and potential mode collapse. This late-stage instability is characteristic of standard GAN training and stems from the fundamental challenge of balancing generator and discriminator learning rates. When the discriminator becomes too powerful relative to the generator, gradients become uninformative, causing the generator to receive poor learning signals. Conversely, when the generator temporarily outpaces the discriminator, it may exploit discriminator weaknesses rather than learning meaningful data representations. This oscillatory behavior motivates the development of stabilization techniques that form the foundation of our optimized WGAN-GP architecture.
Progressive GAN and Conditional GAN exhibited smooth convergence throughout training, with monotonically decreasing loss curves suggesting stable learning without significant mode collapse. The stability of Progressive GAN stems from its incremental capacity growth strategy, which allows the network to first learn coarse data patterns before progressively refining to capture finer details. This curriculum-like learning approach reduces the complexity of the optimization landscape at each training stage. Conditional GAN benefits from the additional structural information provided by class labels, which constrains the generator’s output space and provides more informative gradients throughout training. However, neither architecture achieved the combination of stability and generation quality demonstrated by our optimized WGAN-GP.
GraphGAN showed high initial loss values that decreased steadily, reflecting the additional complexity of learning topology-aware representations. The graph attention mechanism must simultaneously learn both node-level features and relational patterns between network flows, creating a more challenging optimization problem. While this architecture eventually converges, the extended initial learning phase and higher computational overhead limit its practical applicability for resource-constrained IoT deployments.
WGAN-GP displayed fundamentally different training dynamics due to the Wasserstein distance objective, with stable convergence and decreasing diversity and feature matching losses. Unlike the Jensen–Shannon divergence used in standard GANs, which can produce zero or infinite gradients when distributions have non-overlapping supports, the Wasserstein distance provided meaningful gradients throughout training regardless of distribution overlap. This theoretical advantage translated to practical benefits: the critic network learned a smooth function that provided consistent learning signals to the generator, enabling stable optimization even for complex, high-dimensional data distributions.
Figure 5 presents detailed training curves for the optimized WGAN-GP (best test accuracy), decomposing the objective into Wasserstein distance, diversity loss (decreasing from 1.0 to 0.07), and feature matching loss (decreasing from 0.5 to 0.03), which together indicate stable convergence without oscillation.
The Wasserstein distance stabilization at around epoch 200 indicates that the generator and critic reached a training equilibrium where neither network could substantially improve without corresponding adaptation from the other. This equilibrium state, characterized by a stable Wasserstein distance of approximately −1.22, represents the point at which the generator learned to produce samples that the critic could not reliably distinguish from real data. The diversity loss decreased from 1.0 to 0.07, indicating that the generator learned to utilize the full latent space rather than collapsing to a small number of modes. This 93% reduction in diversity loss directly correlates with the high sample diversity (0.98) observed in our generation quality metrics. Similarly, the feature matching loss decreased from 0.5 to 0.03, demonstrating that the intermediate feature statistics of generated samples closely matched those of real data. All components showed stable convergence without oscillation, confirming that our multi-objective optimization framework successfully balances competing training signals.
The quality of synthetic data fundamentally determines the effectiveness of GAN-based augmentation for classification tasks. Poor-quality samples that deviate significantly from the true data distribution can introduce noise that degrades classifier performance, while samples that lack diversity may provide limited information gain beyond simple oversampling.
Table 6 presents comprehensive generation quality metrics across all architectures, quantifying both fidelity (measured by MSE) and diversity. Note that in the table, the bold text indicates the results of our approach.
Our optimized WGAN-GP achieved dramatically superior generation quality, with an MSE of 0.01, representing a 94% improvement compared to Standard GAN (MSE 0.17) while simultaneously achieving the highest sample diversity score (0.98). This remarkable improvement stems from the synergistic interaction of three diversity-promoting mechanisms incorporated into our architecture. The diversity loss explicitly penalizes the generator for producing similar outputs from different latent codes, encouraging exploration of the full output space. Feature matching aligns the statistical moments of generated and real sample features at intermediate network layers, ensuring that generated samples capture not only the surface-level characteristics but also the deeper structural patterns of the real data distribution. Noise injection layers introduce controlled stochasticity that prevents the generator from learning deterministic mappings, further promoting output diversity.
The low standard deviation of MSE (0.01) for WGAN-GP indicates consistent generation quality across samples, whereas other architectures exhibited higher variance (0.15–0.21), suggesting inconsistent quality that includes both good and poor samples. This consistency is crucial for reliable classifier training, as high-variance augmentation can introduce unpredictable noise into the learning process.
Figure 6 compares real feature distributions to those produced by each GAN, visually confirming
Table 6: optimized WGAN-GP achieved near-perfect alignment (MSE 0.01) and high diversity (0.98), while Standard GAN exhibited distribution drift (notably in tail regions), consistent with its higher MSE (0.17).
The feature distribution analysis reveals critical differences in how each architecture captured the underlying data distribution. Standard GAN showed noticeable distribution drift, particularly in the distribution tails, where rare attack patterns reside. This tail drift is especially problematic for intrusion detection, as attack samples often occupy these low-density regions of the feature space. The inability of Standard GAN to accurately model tail behavior explains its lower minority class accuracy despite reasonable overall classification performance. Progressive GAN and Conditional GAN exhibited similar patterns of central distribution matching with degraded tail accuracy, reflecting their shared limitation in capturing the full distributional complexity.
In contrast, optimized WGAN-GP achieved near-perfect distribution matching across the entire feature range, including the critical tail regions. The Wasserstein distance objective, which measures the minimum cost of transforming one distribution into another, naturally encourages accurate modeling of the full distribution rather than focusing solely on high-density regions. The diversity loss further ensures that generated samples span the full distribution rather than clustering around modes. This comprehensive distribution coverage directly enables the superior minority class detection achieved by WGAN-GP-augmented classifiers.
A central question in GAN-based data augmentation is whether generation quality and downstream task performance represent competing or complementary objectives. Some prior work has suggested trade-offs between these goals, with high-fidelity generation potentially limiting diversity and vice versa.
Figure 7 presents a scatter plot that definitively addresses this question for our experimental setting.
The scatter plot demonstrates that with proper optimization incorporating diversity loss, feature matching, and noise injection, WGAN-GP achieved both best generation quality (MSE 0.01, corresponding to 1/MSE = 100 on the x-axis) and best classification accuracy (99.99%). This result challenges the assumption that practitioners must choose between generation fidelity and task performance, instead demonstrating that these objectives are complementary when GANs are properly configured.
The key insight underlying this complementarity is that both objectives fundamentally require accurate modeling of the true data distribution. High-fidelity generation requires the generator to learn the statistical properties of real samples, while effective augmentation for classification requires generated samples to provide meaningful information about the decision boundaries. When diversity-promoting mechanisms ensure that the generator explores the full distribution rather than collapsing to modes, both objectives are simultaneously satisfied: the generator produces diverse, high-fidelity samples that improve classifier sensitivity across the full feature space.
The clustering of other architectures (Standard GAN, PGAN, cGAN, GraphGAN) in the lower-left region of the plot, with both lower generation quality and lower classification accuracy, suggests that their limitations stem from a common cause: failure to adequately capture the full data distribution. This observation motivates the design principle that GAN architectures for classification-oriented augmentation should prioritize diversity alongside fidelity, rather than optimizing for either objective in isolation.
5.5. Classification and Detection Performance Analysis
The ultimate measure of GAN-based augmentation effectiveness is the downstream classification performance on intrusion detection tasks.
Table 7 presents comprehensive classification performance metrics for models trained with augmentation from each GAN architecture, including our novel Accuracy-per-Joule (APJ) metric for energy-aware evaluation. In the table, our approach is indicated in bold text.
Our optimized WGAN-GP achieved state-of-the-art performance, with 99.99% classification accuracy and 0.99 macro-F1 score, significantly outperforming all alternatives, including Standard GAN (99.15%, 0.51 M-F1). The disparity between overall accuracy and macro-F1 for non-WGAN-GP methods reveals a critical limitation: these architectures achieve high accuracy primarily through correct classification of the overwhelming majority class (normal traffic) while struggling with minority class (attack) detection. The macro-F1 scores of 0.49–0.51 for these methods indicate near-random performance on the attack class, rendering them unsuitable for security applications despite their superficially impressive accuracy figures.
Critically, minority class accuracy improved from 95.56% (Standard GAN) to 100.00% (WGAN-GP), demonstrating that diversity-promoting mechanisms enable superior detection of rare attack instances. This 4.44 percentage point improvement in minority class accuracy corresponds to a substantial reduction in missed attacks. In a deployment scenario processing millions of network flows, this improvement could translate to thousands of additional detected attacks that would otherwise evade detection.
The conditional GAN (cGAN) achieved the lowest minority class accuracy (88.89%), despite its ability to generate class-specific samples. This counterintuitive result stems from the conditioning mechanism’s tendency to reinforce existing class boundaries rather than exploring the boundary regions where classifier improvement is most needed. When the generator is explicitly conditioned on class labels, it learns to produce samples that are maximally consistent with class centroids, potentially missing the subtle variations that occur near decision boundaries.
Our optimized WGAN-GP achieved APJ of 2.63 , representing a 2.66× improvement over Standard GAN (0.99 ). This dramatic efficiency advantage stemmed from the combination of highest accuracy (99.99%) with lowest inference time (42.18 s) and lowest power consumption. The APJ metric captures the essential trade-off for IoT deployments: maximizing detection capability while minimizing energy expenditure. For battery-powered edge devices, this 2.66× improvement directly translates to extended operational lifetime or the ability to process more network traffic within fixed energy budgets.
The ROC-AUC of 0.99 further confirmed superior ranking capability across all decision thresholds. ROC-AUC measures the probability that a randomly chosen positive sample (attack) will rank higher than a randomly chosen negative sample (normal), providing a threshold-independent assessment of classifier quality. The near-perfect AUC achieved by WGAN-GP indicates robust detection capability regardless of the operating point selected for deployment. The PR-AUC of 0.98 is particularly significant given the extreme class imbalance, as precision–recall curves are more informative than ROC curves under such conditions.
Figure 8 presents ROC curves comparing detection performance across GAN-augmented classifiers; the corresponding AUC ordering is WGAN-GP (0.99) > PGAN (0.97) > GraphGAN (0.96) > Standard GAN (0.95) > cGAN (0.94), and all curves lie well above the random baseline.
The ROC curves reveal that optimized WGAN-GP achieved the highest AUC of 0.99, indicating superior ranking capability across different decision thresholds. The curve’s proximity to the upper-left corner demonstrates that WGAN-GP-augmented classifiers can achieve very high true positive rates with minimal false positives, a critical requirement for production security systems. At a false positive rate of 0.01 (1%), WGAN-GP achieved a true positive rate exceeding 0.99 (99%), meaning that fewer than 1% of legitimate traffic triggered false alarms, while more than 99% of attacks were correctly detected.
The ordering of AUC scores (WGAN-GP: 0.99, PGAN: 0.97, GraphGAN: 0.96, Standard GAN: 0.95, cGAN: 0.94) provides insight into the relative strengths of each architecture. PGAN’s strong score (0.97) suggests that its progressive training strategy provides benefits for classifier-oriented augmentation, even though its generation quality metrics are less impressive than those of WGAN-GP. GraphGAN’s competitive AUC (0.96) indicates that topology-aware representations capture relevant patterns for attack detection, despite its higher computational overhead. Standard GAN’s moderate performance (0.95) reflects the baseline capability of GAN augmentation without specialized optimization. Conditional GAN’s lowest AUC (0.94) confirms that class conditioning, while intuitive, does not guarantee improved classification performance.
Figure 9 presents confusion matrices for the three highest-performing architectures, clarifying the error profiles behind aggregate metrics: (a) optimized WGAN-GP achieved 99.99% accuracy with 45/45 attacks detected (100.00%, 95% CI: 92.13–100.00%) and minimal false positives; (b) Standard GAN detected 43/45 attacks (95% CI: 85.02–98.71%); and (c) PGAN detected 44/45 attacks but produced more false positives.
The confusion matrix analysis provides granular insight into classification behavior beyond aggregate metrics. Panel (a) shows that optimized WGAN-GP achieved perfect minority detection, with 45 of 45 attack instances correctly identified and minimal false positives among normal traffic. This perfect attack detection on the test set, combined with 99.99% overall accuracy, demonstrates that the classifier learned robust representations that generalize well to unseen data.
Panel (b) reveals that Standard GAN, despite achieving 99.15% overall accuracy, missed 2 of 45 attack instances (43 correctly detected). While the absolute number of missed attacks appears small, the 4.4% miss rate becomes significant when scaled to production environments processing millions of flows. In such settings, a 4.4% miss rate could result in thousands of undetected intrusions over extended operational periods.
Panel (c) shows that PGAN achieved 97.07% overall accuracy, with 44 of 45 attacks detected, but incurred more false positives among normal traffic. This pattern suggests that PGAN-augmented classifiers adopt a more aggressive detection stance that improves attack sensitivity at the cost of increased false alarms. Depending on operational requirements and the relative costs of missed detections versus false positives, this trade-off may or may not be acceptable.
Figure 10 focuses specifically on minority class (attack) detection performance with APJ comparison, showing that optimized WGAN-GP attained the highest attack TPR on this split (100.00%, 95% CI: 92.13–100.00%) while also achieving the largest APJ among GAN approaches (2.63
).
The minority class detection analysis underscores the fundamental advantage of optimized WGAN-GP for security applications. With 100.00% minority accuracy, WGAN-GP correctly identified all attack instances while maintaining the highest APJ (2.63 ). This combination of detection effectiveness and energy efficiency makes WGAN-GP the clear choice for security-critical applications where both metrics are essential.
The comparison between minority class accuracy and APJ reveals an important insight: there is no inherent trade-off between security effectiveness and energy efficiency when using properly optimized GAN architectures. Traditional approaches often sacrifice detection capability for computational efficiency or vice versa, but our results demonstrate that both objectives can be achieved simultaneously. This finding has significant implications for IoT security deployment, where resource constraints have historically limited the sophistication of deployable detection algorithms.
5.6. Computational and Energy Efficiency Analysis
Energy consumption and computational efficiency are critical considerations for IoT intrusion detection systems, where edge devices operate under severe resource constraints and cumulative energy costs affect both operational feasibility and environmental sustainability.
Table 8 presents comprehensive computational efficiency metrics, revealing substantial differences across architectures (our approach is shown in bold).
Optimized WGAN-GP requires only 724,512 s for training compared to approximately 1.86–1.88 million seconds for other architectures, representing a 2.57× speedup. This substantial reduction in training time has important practical implications beyond mere convenience. Faster training enables more frequent model updates in response to evolving attack patterns, supports rapid experimentation during system development, and reduces the energy consumed during the training phase. For organizations maintaining fleets of intrusion detection models across multiple deployment sites, the cumulative time savings can be substantial.
Inference time similarly shows WGAN-GP at 42.18 s versus 92–94 s for alternatives, achieving a 2.22× speedup. Inference efficiency directly impacts real-time detection capability: faster inference enables processing of higher traffic volumes within latency constraints or, alternatively, allows deployment on less powerful hardware while maintaining throughput requirements. The inference speedup also reduces the energy consumed per classification decision, extending battery life for edge deployments.
The slight increase in parameters for WGAN-GP (due to residual connections and noise injection layers) does not impact efficiency due to more efficient training dynamics. The Wasserstein distance objective provides informative gradients that enable faster convergence, while the diversity-promoting mechanisms prevent mode collapse that can cause training to stall. The net effect is that WGAN-GP achieves better performance in substantially less time, despite the additional architectural complexity.
The Energy-per-Sample (EPS) metric quantifies the energy cost of processing individual network flows. WGAN-GP achieved 0.32 mJ per sample compared to 0.85–0.86 mJ for other architectures, representing a 62% reduction in per-sample energy consumption. For high-volume deployments processing millions of flows per day, this efficiency gain translates to significant energy savings with corresponding cost and environmental benefits.
Figure 11 presents computational time comparisons, showing that WGAN-GP required 201 training hours versus 516–523 h for alternatives (2.57× speedup) and 42.18 s inference versus 92–94 s (2.22× speedup), which directly supports lower energy consumption.
The training time analysis reveals that WGAN-GP required only 201 h compared to 516–523 h for other architectures. This 2.57× speedup stems from multiple factors. First, the Wasserstein distance objective provides stable, informative gradients that enable consistent progress throughout training, without the oscillations and restarts common in standard GAN training. Second, the diversity-promoting mechanisms prevent mode collapse early in training, avoiding the wasted computation of exploring collapsed solutions. Third, the more efficient critic training (5 updates per generator update) concentrates computational effort on the discriminative task that ultimately drives generation quality.
The inference time comparison shows WGAN-GP requiring 42.18 s versus 92–94 s for alternatives, achieving a 2.22× speedup. This inference efficiency advantage is particularly important for deployment scenarios where real-time or near-real-time detection is required. The faster inference enables WGAN-GP-augmented classifiers to meet tighter latency requirements or process higher traffic volumes within fixed time budgets.
Figure 12 summarizes the power/energy comparison: relative to Standard GAN, WGAN-GP uses 38% of training energy and 44% of inference energy, while also achieving the highest APJ (2.63
vs. ∼1.0
for other GAN variants).
WGAN-GP achieved a 62% reduction in training energy consumption and 56% reduction in inference energy compared to the Standard GAN baseline. These efficiency gains stem from multiple contributing factors. The simplified critic architecture (compared to the discriminator in standard GANs) reduces computational overhead per forward pass. The more efficient training dynamics enable convergence in fewer epochs, reducing the total number of forward and backward passes required. The absence of batch normalization in the critic (required for valid gradient penalty computation) eliminates the computational overhead of computing and applying batch statistics.
Panel (b) demonstrates the APJ advantage quantitatively: WGAN-GP achieved 2.63 compared to approximately 1.0 for other architectures. This 2.66× improvement in APJ represents the compound effect of higher accuracy (numerator improvement) and lower power consumption (denominator improvement). The multiplicative nature of APJ means that simultaneous improvements in both factors yield amplified benefits.
Figure 13 provides detailed power consumption analysis: WGAN-GP maintained lower normalized power draw (0.35–0.50) than other architectures (0.80–1.05), and the energy breakdown was GPU-dominated (42–50%), with CPU contributing 30–38% and memory approximately 20%.
The power profile analysis reveals that WGAN-GP maintained consistently lower power draw (0.35–0.50 normalized) compared to other architectures (0.80–1.05 normalized) throughout the training process. This consistent efficiency advantage stems from the simpler critic computation (no sigmoid output, no batch normalization) and the more efficient gradient computation enabled by the Wasserstein objective. The relative stability of WGAN-GP’s power profile also indicates more predictable resource utilization, which is valuable for capacity planning and thermal management in deployment environments.
The energy breakdown by hardware component reveals GPU-dominated consumption (42–50%) across all architectures, with CPU contributing 30–38% and memory 20%. This distribution is consistent with the compute-intensive nature of neural network training and inference, where matrix operations dominate the computational workload. The similar breakdown across architectures suggests that efficiency improvements stem primarily from reduced total computation rather than shifts in the computation type.
Figure 14 presents energy-normalized metrics combining performance and energy, showing that WGAN-GP yields the highest APJ (2.63
) and highest F1PJ (2.61
), substantially exceeding the ∼0.5
range of alternative GAN architectures.
The Accuracy-per-Joule (APJ) analysis shows WGAN-GP achieving the highest value (2.63 ) due to the multiplicative benefit of superior classification accuracy combined with lower power consumption. This metric is particularly relevant for IoT deployments where both detection capability and energy budget are constrained. A 2.66× improvement in APJ directly translates to either 2.66× longer battery life at equivalent detection capability or 2.66× more network flows processed within fixed energy budgets.
The F1-per-Joule (F1PJ) metric provides an alternative perspective that emphasizes balanced performance across classes rather than raw accuracy. WGAN-GP achieved 2.61 compared to approximately 0.5 for other architectures, a 5.22× improvement. The larger improvement in F1PJ compared to APJ reflects WGAN-GP’s superior performance on minority class detection, which contributes more substantially to F1 score than to overall accuracy in imbalanced datasets. For security applications where minority class performance is paramount, F1PJ provides a more appropriate efficiency measure than APJ.
5.7. Comprehensive Performance Summary
To provide a holistic view of architecture performance across the multiple dimensions relevant to IoT intrusion detection, we present multi-metric visualizations that enable direct comparison of strengths and trade-offs.
Figure 15 compares architectures across seven metrics (accuracy, macro-F1, minority accuracy, generation quality, APJ, inference speed, and training speed), showing that optimized WGAN-GP dominated across all axes in our benchmark.
The radar chart visualization demonstrates that optimized WGAN-GP dominated across all seven evaluated metrics: accuracy, macro-F1, minority class accuracy, generation quality, APJ, inference speed, and training speed. This comprehensive dominance is remarkable because it contradicts the common assumption that optimization along one dimension necessarily compromises performance along others. Traditional engineering wisdom suggests that practitioners must navigate trade-offs between competing objectives, selecting architectures that balance requirements according to deployment priorities. Our results demonstrate that with proper optimization, these apparent trade-offs can be resolved: WGAN-GP achieves simultaneous excellence across all metrics.
The other architectures exhibited characteristic profiles that reflected their design priorities and limitations. Standard GAN showed moderate performance across most metrics but lagged in macro-F1 and minority accuracy, reflecting its susceptibility to mode collapse, which limits minority class representation in generated samples. Progressive GAN achieved good minority accuracy but sacrificed training speed due to its phased training procedure. Conditional GAN underperformed on minority accuracy despite its class-conditioning capability, as discussed previously. GraphGAN achieved balanced but unexceptional performance across metrics, with the overhead of graph attention mechanisms not translating to corresponding benefits.
Figure 16 summarizes normalized performance across metrics and architectures, with optimized WGAN-GP achieving best or near-best normalized scores across accuracy, macro-F1, minority accuracy, generation quality, APJ, and efficiency metrics in our evaluation.
The performance heatmap provides a dense summary of normalized performance across all metrics and architectures, with higher values (darker colors) indicating better performance. Optimized WGAN-GP achieved best or near-best performance across all metrics: accuracy (1.00), macro-F1 (1.00), minority accuracy (1.00), generation quality (1.00), APJ (1.00), and efficiency metrics (1.00). The consistent high performance across diverse metrics, spanning classification effectiveness, generation quality, and computational efficiency, demonstrates that proper optimization eliminates traditional trade-offs between competing objectives.
The heatmap also reveals patterns in the relative strengths and weaknesses of alternative architectures. Standard GAN achieved moderate scores across most metrics but particularly struggled with macro-F1 (0.51), reflecting its poor minority class performance. Progressive GAN showed strength in minority accuracy (0.85) but weakness in efficiency metrics due to its phased training overhead. Conditional GAN exhibited the lowest minority accuracy (0.26 normalized), confirming that class conditioning does not guarantee improved minority class handling. GraphGAN achieved balanced but moderate performance, with its additional complexity not translating to commensurate benefits.
Our comprehensive experimental evaluation yielded several key findings with significant implications for research and practice in GAN-based IoT intrusion detection:
- 1.
Generation Quality and Classification Performance are Complementary: With proper optimization incorporating diversity loss, feature matching, and noise injection, WGAN-GP achieved both best generation quality (MSE 0.01) and best classification accuracy (99.99%), demonstrating that these objectives reinforce rather than compete with each other.
- 2.
Minority Class Detection is Dramatically Improved: WGAN-GP achieved 100.00% minority class accuracy (95% CI: 92.13–100.00%) compared to 77.78–95.56% for alternatives, representing a 4.44–22.22 percentage point improvement that translates to substantially reduced missed attacks in production deployments.
- 3.
Energy Efficiency Advantages are Substantial: WGAN-GP achieved APJ of 2.63 , representing a 2.66× improvement over Standard GAN, with 62% lower energy consumption per sample. These efficiency gains enable deployment on resource-constrained IoT devices.
- 4.
Computational Speedups are Significant: WGAN-GP achieved 2.57× training speedup and 2.22× inference speedup compared to alternatives, enabling faster model development cycles and higher-throughput deployment.
- 5.
Critical Hyperparameters Enable Success: The key optimizations that enable WGAN-GP’s superior performance are (not 1), diversity loss (), feature matching loss (), and noise injection layers. These components work synergistically to prevent mode collapse while promoting high-fidelity generation.
These findings suggest that optimized WGAN-GP is a strong candidate for GAN-based IoT intrusion detection on datasets with similar characteristics to BoT-IoT, achieving simultaneous excellence across accuracy, efficiency, and generation quality metrics while solving the critical minority class detection limitation that makes alternative approaches unsuitable for security-critical deployments.
5.8. Baseline Comparison
Table 9 compares GAN-augmented approaches with traditional machine learning baselines augmented with SMOTE, including comprehensive APJ metrics for all methods (our method is indicated in bold).
The comparison reveals a critical insight: while traditional ML baselines augmented with SMOTE achieved 99.99% overall accuracy, their minority class detection was substantially lower (77.78–80.00%, 95% CI: 63.7–89.1%) compared to WGAN-GP (100.00%, 95% CI: 92.13–100.00%). This represents a 20.00–22.22 percentage point improvement in attack detection capability. For security applications where missing attacks carries severe consequences, this difference is critical.
Additionally, while Logistic Regression + SMOTE achieved the highest APJ (58.82 ) due to minimal inference time, its lower minority class detection makes it less suitable for security-critical deployments. WGAN-GP provides a strong balance of efficiency (APJ 2.63 , best among GAN methods) and security effectiveness (100.00% attack detection on this test split).
Figure 17 compares Classical + SMOTE baselines to GAN augmentation, highlighting that overall accuracy is comparable (99.99% for LR + SMOTE and WGAN-GP), but minority detection differs substantially (77.78–80.00% for Classical + SMOTE vs. 100.00% for WGAN-GP on this split), alongside APJ comparisons.
5.9. Comprehensive Cross-Dataset Analysis and Evaluation of All Approaches
The cross-dataset evaluation spanning 250 attack instances across five benchmarks provides statistically robust evidence for our principal findings. We present an extensive analysis of the pooled results, examining performance patterns across all evaluated methods.
Overall Accuracy Analysis: Examining the pooled mean accuracy across all five datasets reveals a clear stratification among approaches. Our optimized WGAN-GP achieved the highest overall accuracy (99.95%), followed closely by Classical + SMOTE methods (Logistic Regression: 99.94%, CNN1D-TCN: 99.93%). Standard GAN achieved 98.93%, while PGAN (96.79%), GraphGAN (96.44%), and cGAN (95.43%) showed progressively lower overall accuracy. The marginal accuracy advantage of WGAN-GP over Classical + SMOTE (difference of 0.01–0.02 percentage points) was statistically insignificant; however, this near-identical overall accuracy masks critical differences in minority class handling that determine practical security utility.
Critical Minority Class Detection Analysis: The pooled minority class accuracy reveals the fundamental limitation of high-accuracy classical approaches and the substantial advantage of our optimized WGAN-GP. Across 250 test attack instances, the following observations could be made:
WGAN-GP (Proposed): Detected 246/250 attacks (98.40%, 95% Wilson CI: 95.9–99.4%), missing only 4 attack instances across all five datasets
PGAN: Detected 238/250 attacks (95.20%, 95% Wilson CI: 91.8–97.3%), missing 12 attacks
Standard GAN: Detected 232/250 attacks (92.80%, 95% Wilson CI: 88.9–95.5%), missing 18 attacks
GraphGAN: Detected 231/250 attacks (92.40%, 95% Wilson CI: 88.5–95.2%), missing 19 attacks
cGAN: Detected 216/250 attacks (86.40%, 95% Wilson CI: 81.6–90.2%), missing 34 attacks
Logistic Regression + SMOTE: Detected 192/250 attacks (76.80%, 95% Wilson CI: 71.1–81.7%), missing 58 attacks
CNN1D-TCN + SMOTE: Detected 184/250 attacks (73.60%, 95% Wilson CI: 67.7–78.8%), missing 66 attacks
The 21.60 percentage point improvement in minority class detection from WGAN-GP (98.40%) over Logistic Regression + SMOTE (76.80%) is highly statistically significant (McNemar’s test , ). Critically, the 95% Wilson confidence intervals for WGAN-GP [95.9%, 99.4%] and Logistic Regression + SMOTE [71.1%, 81.7%] do not overlap, establishing that this difference is robust and not attributable to sampling variation. In practical terms, WGAN-GP missed only 4 attacks compared to 58 for the best Classical + SMOTE method, a 14.5× reduction in missed attacks that directly translates to improved security posture.
Macro-F1 Score Analysis: The pooled macro-F1 scores further highlight the class-balanced performance characteristics of each approach. WGAN-GP achieved the highest macro-F1 (0.98), indicating balanced performance across both normal and attack classes. Classical + SMOTE methods achieved moderate macro-F1 scores (Logistic Regression: 0.92, CNN1D-TCN: 0.91), reflecting their bias toward the majority class. Standard GAN (0.50), PGAN (0.49), cGAN (0.48), and GraphGAN (0.49) all exhibited macro-F1 scores near 0.50, indicating near-random performance on the minority class despite reasonable overall accuracy. This macro-F1 analysis confirms that only WGAN-GP successfully addressed the class imbalance challenge inherent in IoT intrusion detection.
Energy Efficiency Analysis (APJ): The Accuracy-per-Joule metric reveals important efficiency trade-offs across approaches. Logistic Regression + SMOTE achieved the highest APJ (53.79 ) due to minimal inference computational requirements (mean inference time: 0.16s). However, this efficiency came at the cost of inadequate security effectiveness (76.80% attack detection). Among methods achieving acceptable security levels (>95% minority accuracy), WGAN-GP provided the best efficiency, with APJ of 2.49 , representing a 2.71× improvement over Standard GAN (0.92 ) and 2.74× improvement over PGAN (0.91 ).
Dataset-Specific Performance Patterns: Examining individual dataset results reveals consistent performance ordering across all benchmarks. WGAN-GP achieved the highest minority class accuracy on every dataset, ranging from 97.67% (CIC-IDS2017) to 100.00% (BoT-IoT). The slight performance variation across datasets (standard deviation: 0.94%) indicates robust generalization rather than overfitting to specific attack characteristics. Classical + SMOTE methods exhibited larger variance in minority class accuracy (standard deviation: 0.91% for Logistic Regression), suggesting greater sensitivity to dataset-specific class distribution characteristics.
Statistical Confidence Analysis: With 250 pooled test attack instances, the 95% Wilson confidence interval for WGAN-GP’s minority class accuracy was [95.9%, 99.4%], which did not overlap with any Classical + SMOTE method’s confidence interval (highest: [71.1%, 81.7%] for Logistic Regression). This non-overlapping confidence establishes that WGAN-GP’s superiority in minority class detection is statistically robust. While individual datasets had limited test instances (43–57 attacks each), the cross-dataset pooling strategy provided the aggregate sample size necessary for reliable statistical inference.
Table 10 presents the cross-dataset validation results across all five benchmarks, demonstrating consistent performance patterns (our approach is indicated in bold).
6. Discussion
This section interprets the experimental findings and their implications for practical IoT intrusion detection deployments, examining the complementary relationship between generation quality and classification performance, comparing Classical + SMOTE methods against GAN-augmented approaches through the lens of energy efficiency, and providing architecture selection guidelines for practitioners. We analyze the significance of the Accuracy-per-Joule metric for resource-constrained environments and discuss the critical importance of minority class detection in security-critical applications.
6.1. Generation Quality Versus Classification Performance
Our experimental results demonstrate that with proper optimization, generation quality and classification performance are complementary rather than competing objectives. The optimized WGAN-GP achieves both the best generation quality (MSE 0.01) and the best classification accuracy (99.99%), resolving the apparent trade-off observed in prior work. This is achieved through diversity-promoting mechanisms including diversity loss, feature matching, and noise injection that ensure that generated samples cover the full data distribution while maintaining high fidelity.
The key insight is that standard WGAN-GP without these mechanisms produces high-fidelity but low-diversity samples that cluster around the distribution mean, providing limited value for classifier training. Our diversity loss explicitly encourages the generator to utilize the full latent space, producing samples that span decision boundary regions where classifier improvement is most needed. The feature matching loss ensures alignment of intermediate statistics, while noise injection prevents mode collapse during training.
6.2. Classical + SMOTE Methods vs. GAN-Augmented Approaches: The APJ Perspective
Our comprehensive evaluation provides a clear picture of the performance of Classical + SMOTE methods compared with GAN-augmented approaches, as measured by APJ metrics. Classical methods with SMOTE (Logistic Regression, CNN1D-TCN) achieved 99.99% overall accuracy with very different APJ profiles: Logistic Regression achieved extremely high APJ (58.82 ) due to minimal inference time (0.16 s), while CNN1D-TCN showed very low APJ (0.08 ) due to longer inference time (117.50 s).
However, the critical limitation of Classical + SMOTE methods lies in minority class detection: both achieved only a 77.78–80.00% attack detection rate, compared to 100.00% for WGAN-GP. This 20.00–22.22 percentage point gap represents the difference between missing 20% of attacks versus missing none on this test split. For security applications where each missed attack can result in significant damage, this gap is unacceptable regardless of APJ efficiency.
WGAN-GP achieved APJ of 2.63 , which, while lower than Logistic Regression’s 58.82 , represents the best balance of efficiency and security effectiveness. The 2.66× improvement over Standard GAN demonstrates that proper GAN optimization can achieve competitive efficiency while solving the minority class detection problem that Classical + SMOTE methods cannot address.
6.3. Energy-Aware Evaluation: The APJ Advantage
Our Accuracy-per-Joule (APJ) metric provides crucial insights for deployment decisions. Optimized WGAN-GP achieved APJ of 2.63, representing
2.66× improvement over Standard GAN (0.99 )
2.71× improvement over PGAN (0.97 )
2.74× improvement over cGAN (0.96 )
This dramatic APJ advantage stems from WGAN-GP’s combination of highest accuracy (99.99%) with lowest inference time (42.18 s) and lowest power consumption. For energy-constrained IoT deployments, this makes WGAN-GP the clear choice among GAN methods regardless of other considerations.
6.4. Architecture Selection Guidelines
Based on our comprehensive evaluation with APJ metrics, we provide the following architecture selection recommendations:
Recommended Default (All Scenarios): Optimized WGAN-GP represents the clear choice for virtually all deployment scenarios requiring GAN-based augmentation, achieving best-in-class performance across accuracy (99.99%), macro-F1 (0.99), minority class detection (100.00%), generation quality (MSE 0.01), and APJ (2.63 ).
Legacy/Simple Deployments: For scenarios requiring simpler implementation without the optimized components, Standard GAN provides a reasonable alternative at 99.15% accuracy and 0.99 APJ.
Comparison with Traditional ML: While Logistic Regression + SMOTE achieves very high APJ (58.82 ) due to minimal inference time, its minority class detection (80.00%) is inadequate for security applications. WGAN-GP provides the best balance of efficiency and security effectiveness.
Targeted Class Augmentation: Conditional GAN remains valuable for scenarios requiring targeted augmentation of specific attack categories, where the ability to control which classes receive synthetic samples is essential.
Graph-Structured Data: Graph-based GAN is recommended when network topology information is critical and temporal/spatial relationships between flows must be captured.
6.5. Multi-Dataset Generalization Analysis
The cross-dataset evaluation presented in
Section 5.9 provides compelling evidence for the generalizability of our findings beyond the primary BoT-IoT benchmark. By validating our complete pipeline across five diverse intrusion detection datasets (BoT-IoT, CICIoT2023, ToN-IoT, UNSW-NB15, and CIC-IDS2017), we establish that the performance advantages of optimized WGAN-GP are not artifacts of dataset-specific characteristics but reflect fundamental improvements in GAN-based augmentation methodology.
Table 11 summarizes the pooled mean performance across all five datasets, aggregating results from 250 test attack instances to provide statistically robust conclusions with substantially narrower confidence intervals than any single-dataset evaluation (our approach is shown in bold).
Several key observations emerge from the multi-dataset analysis that reinforce and extend our primary findings. The relative ranking of methods remains stable across all five datasets, with WGAN-GP achieving the highest minority class accuracy on every individual dataset (ranging from 97.67% to 100.00%), followed consistently by PGAN, then Standard GAN and GraphGAN, with cGAN and ClassicalSMOTE methods trailing. This consistency (standard deviation of minority accuracy across datasets: 0.94% for WGAN-GP) indicates that performance differences reflect intrinsic methodological advantages rather than dataset-specific overfitting. With 250 pooled test attacks, the 95% Wilson confidence interval for WGAN-GP’s minority class accuracy [95.9%, 99.4%] did not overlap with that of any ClassicalSMOTE method (highest: [71.1%, 81.7%] for Logistic Regression), and McNemar’s test confirmed statistical significance (, ) for the 21.60 percentage point improvement. In practical terms, WGAN-GP missed only 4 attacks across all datasets compared to 58 for Logistic RegressionSMOTE, representing a 14.5× reduction in missed intrusions.
The pooled macro-F1 scores further differentiate WGAN-GP (0.98) from all alternatives: Standard GAN, PGAN, cGAN, and GraphGAN all achieved macro-F1 scores near 0.50 (range: 0.48–0.50), indicating near-random minority class performance despite reasonable overall accuracy, while ClassicalSMOTE methods achieved intermediate macro-F1 (0.91–0.92), reflecting their majority-class bias. Regarding energy efficiency trade-offs, among methods achieving acceptable security levels (>95% minority accuracy), WGAN-GP provided the best APJ (2.49 ), representing 2.71× improvement over Standard GAN and 2.74× over PGAN; while Logistic RegressionSMOTE achieved substantially higher APJ (53.79 ) due to minimal inference requirements, its inadequate minority class detection (76.80%) disqualifies it for security-critical applications. Notably, the five evaluated datasets span diverse IoT attack scenarios—BoT-IoT (4 attack types, botnet-focused), CICIoT2023 (33 attack types, comprehensive IoT threats), ToN-IoT (9 types, telemetry and network), UNSW-NB15 (9 types, general network intrusions), and CIC-IDS2017 (8 types, enterprise network attacks)—and WGAN-GP’s consistent superiority across this heterogeneous collection suggests broad applicability to real-world IoT security deployments regardless of the specific threat landscape. These multi-dataset findings strengthen confidence in our principal conclusions and support the deployment of optimized WGAN-GP as a general-purpose solution for GAN-based intrusion detection augmentation across diverse IoT security contexts.
7. Conclusions and Future Directions
This paper presents the first comprehensive benchmark evaluation of five GAN architectures for energy-aware IoT intrusion detection, introducing novel Accuracy-per-Joule (APJ) metrics that enable principled architecture selection. Our experimental evaluation on a stratified subset of the BoT-IoT dataset yields several key findings with significant implications for research and practice.
Our optimized WGAN-GP achieves state-of-the-art performance, with 99.99% classification accuracy, 0.99 macro-F1 score, and 100.00% minority class accuracy (95% CI: 92.13–100.00%), matching Classical + SMOTE methods in overall accuracy while dramatically improving attack detection by 20.00–22.22 percentage points. While Classical + SMOTE methods (Logistic Regression, CNN1D-TCN) achieve 99.99% accuracy with high APJ (up to 58.82 ), their minority class detection (77.78–80.00%) is inadequate for security applications. WGAN-GP solves this limitation while maintaining competitive APJ (2.63 ).
Cross-dataset validation across five diverse intrusion detection benchmarks (BoT-IoT, CICIoT2023, ToN-IoT, UNSW-NB15, CIC-IDS2017) with 250 pooled test attack instances confirmed the generalizability of these findings. WGAN-GP achieved pooled minority class accuracy of 98.40% (246/250 attacks detected, 95% CI: 95.9–99.4%), compared to 76.80% for the best Classical + SMOTE method (192/250). This 21.60 percentage point improvement was statistically significant (McNemar’s test ) and represented a 14.5× reduction in missed attacks. The consistent performance ordering across all five datasets (WGAN-GP achieving the highest minority accuracy on each) demonstrates that our optimizations provide fundamental methodological improvements rather than dataset-specific advantages.
Among GAN methods, WGAN-GP achieved APJ of 2.63 , representing a 2.66× improvement over Standard GAN (0.99 ), with 62% lower energy consumption per sample. WGAN-GP simultaneously achieved best generation quality (MSE 0.01) and best classification performance, demonstrating that these objectives are complementary when properly optimized with diversity-promoting mechanisms. The critical optimizations that enabled these results were (not 1), diversity loss (), feature matching loss (), and noise injection layers. Our energy-normalized metrics (Accuracy-per-Joule and F1-per-Joule) provided a principled framework for architecture selection, with optimized WGAN-GP dominating across all efficiency metrics among GAN methods while solving the minority class detection limitation of Classical + SMOTE approaches.
Our results suggest that optimized WGAN-GP is a strong candidate for GAN-based IoT intrusion detection on datasets with similar characteristics to BoT-IoT, achieving simultaneous excellence across accuracy, efficiency, and generation quality metrics while solving the critical minority class detection limitation that makes Classical + SMOTE methods unsuitable for security-critical deployments.
7.1. Threats to Validity
Our evaluation has several limitations that readers should consider when interpreting results:
Internal Validity: The small minority class test set (45 attack instances) limits statistical power. While we report 95% Wilson confidence intervals, the observed 100% minority accuracy [92.1–100.0%] overlapped with Standard GAN’s 95.56% [85.0–98.7%] at the confidence interval level. Results may be sensitive to the specific train/test split despite stratification.
External Validity: The evaluation used a single dataset (BoT-IoT), and generalization to other IoT environments, attack types, or network configurations requires validation on additional datasets such as UNSW-NB15 [
66] and CICIDS. The 50% subset sampling, while preserving class ratios, may not have captured all patterns present in the full dataset.
Construct Validity: Power monitoring used software-based estimation (calibrated against nvidia-smi/RAPL with uncertainty) rather than hardware power meters, which may have introduced systematic measurement bias. The APJ metric assumed that inference energy dominated deployment costs, which may not hold for scenarios with frequent retraining.
Conclusion Validity: The binary classification focus may not have captured nuances relevant to multi-class attack categorization with fine-grained attack type labels. Claims about “best” performance should be interpreted within the scope of the evaluated architectures and dataset.
7.2. Future Directions
Future research directions include (1) validation of optimized WGAN-GP on additional IoT security datasets, (2) development of adaptive diversity weights based on class imbalance ratios, (3) federated WGAN-GP training for distributed IoT deployments, (4) hardware-accelerated implementations for edge devices, and (5) real-time augmentation strategies with continuous model updates.