1. Introduction
The Internet of Things (IoT) is now an indispensable component of contemporary civilization, underpinning a multitude of applications at both individual and enterprise levels. However, despite the widespread adoption, the IoT network remains highly susceptible to cyberattacks due to inherent device constraints such as limited computational resources, weak security configurations, and heterogeneous deployment environments. In particular, IoT botnet-driven attacks, most notably DDoS attacks, continue to present major security risks to enterprises and institutions across diverse industry verticals. To mitigate these challenges, a broad spectrum of artificial intelligence (AI)-based attack detection strategies have been introduced and extensively evaluated to strengthen the resilience of IoT ecosystems. For example, prior studies have explored advanced AI-driven techniques based on federated learning [
1,
2,
3], graph neural networks [
4,
5,
6], transformer-based architectures [
7,
8,
9], autoencoder and variational autoencoder models [
10,
11,
12], latent space arithmetic and alignment-based methods [
13,
14,
15], and cost-sensitive learning strategies designed to address class imbalance in IoT security datasets [
16,
17,
18,
19].
However, despite its demonstrated potential for improving computational efficiency in domains such as NLP and LLMs [
20,
21,
22], Mixture of Experts (MoE) [
23,
24] methods remain relatively unexplored in the context of IoT botnet detection. In the MoE design, computation is dynamically routed such that only a small number of expert networks are utilized for each data instance. To ensure feasibility on low-resource IoT edge devices, we adopt the mixture of tiny experts (MoTE) [
25], a variant that leverages lightweight expert models for per-instance classification. This mechanism substantially reduces inference latency, energy consumption, and memory overhead, while preserving strong representational capacity through expert specialization.
This work introduces the VAE-encoder-MoTE framework in which a VAE encoder first transforms high-dimensional IoT botnet traffic features into a compact latent embedding space. The MoTE model is then trained on these low-dimensional embeddings, enabling efficient learning while retaining the critical structural information necessary for accurate attack detection. This integration of latent representation learning with conditional expert activation achieves a strong trade-off across detection accuracy and computational cost, positioning the method as well-suited for deployment on IoT systems with limited resources.
To further improve the discriminative quality of the learned representations, supervised contrastive learning [
26,
27] is incorporated during model training. This training approach explicitly promotes instances from the same attack category to form tight groupings in the latent representation, while ensuring that instances from different categories are separated by larger distances. As a result, the learned feature space exhibits improved intra-class compactness and inter-class separability, which enhances generalization capability, robustness to class imbalance, and resilience to evolving IoT traffic patterns. In addition to the proposed VAE-encoder-MoTE architecture, two alternative models, namely VAE-encoder-MLP and VAE-encoder-GAT, are also trained and evaluated. A comparative analysis of the three architectures is conducted using standard performance metrics.
The main contributions of this study are:
A hybrid framework that combines VAE-based latent representation learning with an MoTE architecture for IoT botnet detection is introduced. The approach compresses high-dimensional IoT traffic data into compact embeddings and applies conditional expert routing to enable efficient and scalable detection.
The study incorporates supervised contrastive learning to enhance the discriminative power of the latent feature space. By encouraging embeddings of samples from the same attack class to cluster together while separating different classes, the approach improves intra-class compactness and inter-class separability.
A systematic empirical evaluation is conducted comparing the proposed VAE-encoder-MoTE model against the VAE-encoder-MLP and VAE-encoder-GAT models. The comparison assesses their effectiveness under both binary and multiclass IoT botnet detection scenarios. Performance is analyzed using multiple metrics providing a detailed understanding of the strengths and limitations of each architectural design.
Through extensive empirical validation on benchmark IoT botnet datasets, the study provides insights into how architectural choices and dataset characteristics influence detection performance in IoT intrusion detection systems.
The remainder of this paper is organized as follows.
Section 2 reviews the relevant literature on supervised contrastive learning, mixture-of-experts architecture, and GNNs for attack detection.
Section 3 presents the proposed framework, including the VAE-based representation learning approach, the MoTE architecture, and the experimental setup.
Section 4 reports the experimental results obtained under both binary and multiclass classification settings.
Section 5 provides an in-depth analysis and interpretation of the findings.
Section 6 summarizes the key contributions and outlines prospects for future work.
3. Methodology
This study proposes the VAE-encoder-MoTE model and systematically evaluates its detection performance against two other advanced hybrid deep learning architectures—VAE-encoder-MLP and VAE-encoder-GAT—for IoT botnet detection under both binary and multiclass classification settings. To boost the discriminative quality of the learned representations, supervised contrastive learning is employed to promote intra-class compactness and inter-class separability, thereby enhancing detection performance.
To tackle the complexity of high-dimensional IoT traffic data, a VAE with a latent space of dimension k (set to k = 8 [
4,
9,
12] in this study) is initially trained on the original feature space. After training, only the encoder part of the VAE is preserved. For each model architecture, this pretrained encoder maps the inputs into compact, lower-dimensional vectors. The resulting reduced-dimension training data is subsequently used to train an MLP, a GAT, and an MoTE model under a supervised contrastive learning framework.
3.1. Variational Autoencoder for Dimensionality Reduction
Dimensionality reduction has become a key strategy for building efficient detection systems. Common approaches include feature selection methods (filter, wrapper), PCA, and deep learning techniques such as Auto-encoders (AEs) and Variational Auto-encoders (VAEs). Filter methods apply statistical measures to score and prioritize features, while wrapper methods iteratively evaluate model performance with different feature subsets. PCA reduces dimensionality by projecting data onto directions of maximum variance. However, these techniques have limitations: filter methods may produce suboptimal subsets, wrapper methods are computationally expensive, and PCA ignores class labels and struggles with nonlinear data. AEs offer an alternative by learning compact latent representations, but their lack of explicit regularization can lead to poorly structured latent spaces, reducing generalization performance and interpretability. Therefore, the VAE was selected for this task.
The VAE, originally proposed in [
51], is a generative model that introduces a structured and regularized latent space, distinguishing it from traditional autoencoders. Given a dataset
, the generative process assumes a variable
z drawn from a prior
, followed by sampling from the conditional distribution
. Because direct optimization of
is computationally intractable, the true posterior is approximated using a variational distribution
, parameterized by an encoder network. The model is then trained by maximizing the evidence lower bound (ELBO), as expressed in Equation (
1).
Here, the Kullback–Leibler (KL) divergence term acts as a regularizer, encouraging the learned latent distribution to stay close to the prior. As a result, the VAE simultaneously learns a compact latent representation and a probabilistic generative model, supporting both accurate reconstruction and data generation under uncertainty.
In this study, both the encoder and decoder networks comprised four fully connected hidden layers. The encoder employed a decreasing architecture with 128, 64, 32, and 16 neurons to transform the input into the latent representation, whereas the decoder adopted a symmetric increasing structure with 16, 32, 64, and 128 neurons to reconstruct the input from the latent representation. The LeakyReLU activation function was applied across all hidden layers.
3.2. Contrastive Learning
To promote tighter grouping of instances belonging to the same class and clearer separation across different classes in the learned embedding space, supervised contrastive learning is adopted. Given a batch
B of feature embeddings
, where
, (
), the embeddings are first
-normalized using Equation (
2):
The pairwise cosine similarity between samples
and
is computed using Equation (
3):
where
is a temperature scaling parameter.
The supervised contrastive loss for a sample
i is obtained using Equation (
4):
where
denotes the set of indices corresponding to samples that share the same class label as sample i.
: similarity between sample i and positive sample p.
a: any sample in the batch (both positives and negatives).
The final supervised contrastive loss is obtained by averaging over the batch using Equation (
5):
3.3. Tiny Experts Architecture
Each tiny expert takes the form of a lightweight MLP featuring two dense layers (24 and 16 units) with ReLU activations. This compact architectural design reduces computational overhead while preserving sufficient representational capacity for effective feature transformation. The transformation performed by an expert is given by Equation (
6):
where
,
are weight matrices,
and
are bias terms,
denotes the ReLU activation function.
3.4. Mixture of Tiny Experts (MoTE)
The MoTE model introduces conditional computation through a trainable routing mechanism between a set of tiny experts. Given an input
x, the router produces a probability distribution over
E experts:
where
, and
number of experts.
For computational efficiency, activation is limited to the k experts with the highest routing probabilities for any given input. The aggregated representation results from weighting and summing the outputs of the selected experts as in Equation (
8):
where
denotes the set of top-
k selected experts.
3.5. Classification Head and Joint Optimization
The aggregated embedding
z is passed to a linear classifier to obtain class predictions:
The supervised classification objective is defined using the cross-entropy loss:
To jointly optimize representation learning and classification performance, the total training objective was formulated as follows:
Figure 1 provides an illustration of the complete architectural design of the VAE-encoder-MoTE model. A supervised contrastive learning layer is incorporated to enhance class separability within the latent space by drawing samples of the same category nearer to each other, while simultaneously pushing samples from different categories farther away from each other.
3.6. The MLP Architecture
The MLP architecture comprises four hidden dense layer (128 → 64 → 32 → 16 units) with ReLU activation. The output layer utilized the softmax activation function to produce normalized class probability distributions. To protect the model from overfitting, regularization was introduce via a 0.1 dropout rate after the third and fourth layers. The output from the VAE-encoder component was passed through the contrastive learning layer and then fed into the MLP model as shown in
Figure 2.
3.7. The GAT Architecture
GAT is a GNN model that incorporates the self-attention mechanism to learn the relative contribution of neighboring nodes. Because GNNs operate on graph-structured inputs, the low-dimensional embeddings of the IoT traffic instances were first transformed into a graph structure. To construct the graph and determine node neighborhoods, the approach proposed in [
4] was adopted. Specifically, the k-NN algorithm with
, and
distance metric was used to establish connections between nodes. The graph was constructed on the embeddings produced from the contrastive learning layer which was placed between the VAE-encoder and the kNN algorithm as shown in
Figure 3.
3.8. Datasets
The three model architectures were trained and evaluated using two well-studied benchmark datasets—the N-BaIoT dataset [
10] and the CICIoT2022 dataset [
52].
3.8.1. N-BaIoT Dataset
To overcome the limited availability of real-world IoT botnet traffic, the N-BaIoT dataset was introduced by [
10] as a comprehensive benchmark derived from operational IoT devices. The dataset contains 115 features, computed as 23 statistical descriptors across five temporal windows, extracted from NetFlow traffic captured in a controlled yet realistic testbed environment comprising of nine devices. The devices were configured to run two prominent IoT malware families, Mirai and BashLite. Following preprocessing to remove duplicate records, the dataset was reduced from 6,331,884 to 2,482,470 instances and subsequently employed to train and evaluate each of the three model architectures under binary classification, and fine-grained ten-class classification settings, with percentage class distributions of {“Normal”: 21.52%, “mirai_udp”: 23.30%, “mirai_syn”: 13.29%, “mirai_ack”: 11.74%, “mirai_scan”: 10.74%, “gafgyt_udp”: 4.51%, “gafgyt_combo”: 2.61%, “gafgyt_junk”: 1.31%, “gafgyt_scan”: 1.30%, “mirai_udpplain”: 9.66%}.
3.8.2. CICIoT2022 Dataset
In contrast to N-BaIoT, which focuses on a limited number of devices, the CICIoT2022 dataset [
52] provides a substantially broader representation of IoT ecosystems by incorporating traffic from 60 heterogeneous devices. This increased device and protocol diversity facilitates a more comprehensive characterization of IoT traffic behavior under benign and malicious conditions. Feature extraction was performed on the released
.pcap files using the revised CICFlowMeter (
https://github.com/GintsEngelen/CICFlowMeter accessed on 23 August 2025). The resulting dataset was composed of more than 3.21 million data samples and 84 training features. The dataset was annotated using directory-based labels, resulting in five traffic classes. The sample distribution across the five classes is {“Normal”: 80.870%, “HTTP flood”: 17.130%, “TCP flood”: 1.418%, “Brute force”: 0.379%, “UDP flood”: 0.203%}.
3.9. Data Preprocessing
Data preprocessing constituted refining the datasets through the removal of outliers, duplicate records, invalid samples, and missing values. In addition, non-informative attributes were excluded, including “Flow ID”, which served as a unique ID for each NetFlow record; “Src IP” and “Dst IP”, which associated traffic with specific source and destination addresses; and “Timestamp”, which linked NetFlows to temporal information. Subsequently, the training variables were scaled to take on values in the range of 0 to 1.
3.10. Experimental Setup and Model Training
The training process spanned 30 epochs, utilizing the Adam optimizer with a learning rate set to . Via empirical analysis the final value of temperature parameter in the supervised contrastive loss was . The architecture employs six experts with top-2 expert selection per input sample. The dataset was divided into train (80%) and test (20%) sets while a batch size of 128 was used during model training for all model architectures.
To mitigate the risk of performance overestimation associated with a single train–test split, all experiments were conducted over 10 independent runs for each model and classification task. In each run, the dataset was randomly shuffled prior to splitting, with a different random seed applied to ensure variability across splits. The set of seeds used was 10, 42, 116, 17, 37, 1412, 73, 100, 98, 3200. For every experimental configuration, the average and variance of each performance metric were computed and reported.
The training procedure was conducted in two stages. First, the VAE encoder and the contrastive learning layer were trained to learn a structured latent representation. Subsequently, each classifier was trained independently using the learned embeddings. During the classifier training phase, the parameters of the pretrained VAE encoder and the contrastive learning layer were frozen, ensuring that only the classifier parameters were updated.
3.11. Performance Evaluation Metrics
The performance was analyzed based on typical multi-class classification measures: accuracy, precision, recall, and F1-score, while barplots were utilized to provide visual insights of how each model performs in comparison with other models. A key point to consider is that for multiclass classification macro averages are reported for precision, recall and F1-score. These metrics provided a comprehensive assessment of overall performance as well as class-wise behavior, which is critical for security-sensitive applications such as IoT botnet detection.
4. Results
This section presents the results from a comprehensive evaluation of the proposed approach across binary and multiclass classification tasks on the CICIoT2022 and N-BaIoT datasets. The performance of the three models—VAE-Encoder-MoTE, VAE-Encoder-MLP, and VAE-Encoder-GAT—is reported before and after the application of contrastive learning. All results represent the mean performance obtained across several independent experiments conducted using varying data partitions and initialization seed values, and the outcomes are reported using mean score ± square root of variance. The highest values of accuracy, precision, recall, and F1-score obtained for each experimental setting are presented in boldface in the corresponding tables.
4.1. Binary Classification on the N-BaIoT Dataset
Table 1 summarizes the empirical findings for all models on the N-BaIoT dataset for the binary classification task, evaluated before and after the application of contrastive learning. The comparative outcomes are presented in
Figure 4a and
Figure 4b, showing performance prior to and after the incorporation of contrastive learning, respectively.
The results indicate that the application of contrastive learning consistently improves performance across all models, with the VAE-Encoder-MoTE model achieving the best overall performance.
4.2. Binary Classification on the CICIoT2022 Dataset
Table 2 summarizes the results on the CICIoT2022 dataset. The corresponding performance comparisons are illustrated in
Figure 5a and
Figure 5b, which depict the results prior to and after the incorporation of contrastive learning, respectively.
4.3. Multiclass Classification on the N-BaIoT Dataset
Table 3 reports the multiclass classification results on the N-BaIoT dataset. The corresponding performance comparisons are illustrated in
Figure 6a and
Figure 6b, which depict the results prior to and after the incorporation of contrastive learning, respectively.
The gains from contrastive learning are also evident in the multiclass setting, with improved classification consistency across all models.
4.4. Multiclass Classification on the CICIoT2022 Dataset
Table 4 summarizes the multiclass performance outcomes on the CICIoT2022 dataset. The corresponding performance comparisons are illustrated in
Figure 7a and
Figure 7b, which depict the results prior to and after the incorporation of contrastive learning, respectively.
Consistent with previous observations, contrastive learning enhances model performance in the multiclass scenario, demonstrating its robustness across datasets and classification settings.
5. Discussion
The findings highlight the strong performance achieved through the integration of VAE-based latent representation learning with supervised contrastive learning for IoT botnet detection across both datasets and classification settings.
5.1. Impact of Contrastive Learning
A consistent improvement is observed after incorporating supervised contrastive learning across all models and tasks. In the binary setting (
Table 1 and
Table 2;
Figure 4 and
Figure 5), performance gains are modest due to already-saturated accuracy levels. However, reduced standard deviations indicate improved stability. In contrast, the multiclass setting (
Table 3 and
Table 4;
Figure 6 and
Figure 7) shows more substantial improvements, confirming that contrastive learning enhances class separability in complex classification scenarios.
5.2. Model Comparison
The VAE-encoder-MoTE model demonstrates a competitive performance across all tasks, achieving the highest or near-highest accuracy in binary classification (
Table 1 and
Table 2). Its low variance across runs indicates stable learning. In the multiclass setting, MoTE benefits significantly from contrastive learning, suggesting that expert specialization is more effective when the latent space is well-separated.
The VAE-encoder-MLP model achieves the best overall performance in multiclass classification, particularly on the CICIoT2022 dataset (
Table 4). This suggests that well-structured latent representations can enable a strong performance even with relatively simple classifiers.
The VAE-encoder-GAT model consistently underperforms in multiclass scenarios (
Table 3 and
Table 4). This may be attributed to sensitivity to graph construction and the fixed k-NN strategy used, which may not adequately model intricate relationships within the dataset. Performance degradation is more evident on the CICIoT2022 dataset, indicating limited robustness under heterogeneous and imbalanced conditions.
5.3. Binary vs. Multiclass Performance
All models achieve near-perfect results in binary classification, indicating that detecting attack traffic from normal samples is relatively straightforward in the learned latent space. However, multiclass classification remains more challenging, with noticeable performance gaps between models. This highlights the importance of discriminative representation learning, particularly for fine-grained attack categorization.
5.4. Dataset Effects
Performance is consistently higher on the N-BaIoT dataset than on CICIoT2022. This suggests that N-BaIoT exhibits more separable class structures, whereas CICIoT2022 introduces greater heterogeneity and class imbalance, leading to reduced recall and F1-scores (
Table 4). The MoTE model shows improved recall under these conditions, indicating its potential for handling diverse traffic patterns.
An investigation on which instances of the N-BaIoT dataset were misclassified in the multiclass classification setting, revealed that 684 instances of the “gafgyt_combo” attack category were wrongly classified as “gafgyt_junk” by the VAE-Encoder-MoTE model. Also, similar patterns were recorded using the other two models. Following this, t-SNE was utilized to obtain a graphical representation of the spatial distribution of three categories: correctly classified “gafgyt_combo” instances, correctly classified “gafgyt_junk” instances, and “gafgyt_combo” instances misclassified as “gafgyt_junk”. The test dataset was passed through the VAE-Encoder-MoTE model pipeline and the representations immediately preceding the output layer were extracted and mapped into a two-dimensional space using t-SNE. The t-SNE output is presented in
Figure 8. A similar pattern was observed when investigating the misclassified instances on the CICIoT2022 dataset.
The t-SNE visualization reveals that the misclassified “gafgyt_combo” instances tend to concentrate in intermediate or overlapping regions between the primary clusters. These observations suggest two plausible causes of misclassification: (1)
feature ambiguity—where the misclassified samples exhibit attack patterns, traffic statistics, or payload characteristics more closely aligned with “gafgyt_junk” than with “gafgyt_combo”, thereby necessitating additional feature engineering or the adoption of a more sophisticated detection model to effectively discriminate between the two classes; (2)
traffic labeling errors [
53]—where some traffic instances are incorrectly assigned to the wrong categories. This finding underscores the importance of evaluating deep learning models on multiple datasets to ensure more reliable and generalizable outcomes.
5.5. Practical Implications
From a deployment perspective, the findings reveal a compromise between model accuracy and resource utilization. Although the MLP model achieves the highest accuracy, the MoTE architecture offers an effective equilibrium between detection capability and computational overhead due to its conditional computation mechanism. This makes it a strong candidate for implementation in IoT settings with limited computational resources.
Overall, the findings confirm that combining VAE-based dimensionality reduction with supervised contrastive learning provides a robust framework for IoT botnet detection, with model selection depending on task complexity and deployment constraints.
5.6. Computational Cost Analysis
VAE encoder and contrastive layer time complexity: Both the VAEs and contrastive layer architectures in this study used dense layers as the core computational component. Therefore, the time complexity of a dense layer is with inputs and neurons is . For a VAE encoder with a dense layers, and constant sampling , the time complexity is . If the contrastive layers deploys b dense layers, the total time complexity of the VAE encoder and contrastive learning layer is , where .
MLP and MoTE Time Complexity: Like the VAE and Contrastive learning layer, the MLP and MoTE in this work constituted dense layers with the MoTE using fewer layers (only two layes in this case) and far fewer computational units per layer to make more lightweight. In addition, the MoTE has a routing operation of constant time complexity . Therefore, the time complexity of MLP and MoTE can be expressed as and which is simple ; where d and k with () are the number of layers in MLP and MoTE while E is the number of tiny experts in MoTE, respectively. Therefore the overall time complexities of the VAE-Encoder-MLP and VAE-Encoder-MoTE can be expressed as and where and , respectively.
GAT time complexity: This incorporates a time complexity
for constructing graph with
V vertices (each of latent dimension
) and
E edges and the time complexity
for a GAT with
n GATConv layers, attention heads
H, and output feature size
K per head.
Table 5 summarizes the computational complexity associated with using each of the detection frameworks.
5.7. Comparison with Benchmark Studies
Table 6 presents a comparative analysis of the proposed VAE-Encoder-MoTE method against approaches reported in prior studies. To ensure a fair and consistent comparison, only methods evaluated on the same datasets (CICIoT2022 and N-BaIoT) and performance measured using identical metrics were considered. In this comparison, we have considered the highest recorded scores from each study across the different evaluation metrics and compared with the highest recorded scores by the VAE-Encoder-MoTE model. As indicated by the boldface values in
Table 6, the proposed model achieved competitive performance across both datasets and outperformed sate-of-the-art methods.
6. Conclusions and Future Work
This study presented a comparative evaluation of three hybrid deep learning architectures for IoT botnet detection: VAE-encoder-MLP, VAE-encoder-GAT, and VAE-encoder-MoTE. The proposed framework utilized a variational autoencoder to compress high-dimensional IoT traffic features into a compact latent representation, which was subsequently used to train downstream classifiers under a supervised contrastive learning objective. This training strategy encouraged intra-class compactness and inter-class separability in the embedding space, thereby improving the discriminative quality of the learned representations. In addition, the Mixture of Tiny Experts (MoTE) architecture was investigated as a lightweight conditional computation strategy designed for resource-constrained IoT environments.
Experimental results on the CICIoT2022 and N-BaIoT datasets demonstrated that the proposed architectures achieve near-perfect performance in binary attack detection, with accuracy exceeding across both datasets. In the more challenging multiclass scenario, the VAE-encoder-MLP model achieved the best overall performance, particularly on the N-BaIoT dataset, indicating that well-structured representations can enable highly accurate intrusion detection even with relatively simple classifiers. The MoTE architecture achieved competitive results and demonstrated improved recall on the CICIoT2022 dataset, highlighting the benefits of expert specialization and conditional routing. In contrast, the GAT-based model showed lower performance in multiclass settings, suggesting sensitivity to the graph construction method and dataset imbalance.
Future work will focus on improving graph construction strategies for GNN-based models, enhancing expert routing mechanisms in MoTE architectures, compression via quantization, and incorporating temporal modeling techniques to better capture evolving IoT attack patterns.