Hybrid Deep Architectures in Contrastive Latent Space: Performance Analysis of VAE-MLP, VAE-MoTE, and VAE-GAT for IoT Botnet Detection

Wasswa, Hassan; Lynar, Timothy

doi:10.3390/iot7020041

Open AccessArticle

Hybrid Deep Architectures in Contrastive Latent Space: Performance Analysis of VAE-MLP, VAE-MoTE, and VAE-GAT for IoT Botnet Detection

by

Hassan Wasswa

^*

and

Timothy Lynar

School of Systems and Computing, University of New South Wales, Canberra, ACT 2600, Australia

^*

Author to whom correspondence should be addressed.

IoT 2026, 7(2), 41; https://doi.org/10.3390/iot7020041

Submission received: 14 March 2026 / Revised: 21 April 2026 / Accepted: 9 May 2026 / Published: 12 May 2026

(This article belongs to the Special Issue Cybersecurity in the Age of the Internet of Things)

Download

Browse Figures

Versions Notes

Abstract

The rapid proliferation of Internet of Things (IoT) devices has significantly expanded the attack surface of modern networks leading to a surge in IoT-based botnet attacks. Detecting such attacks remains challenging due to the high dimensionality and heterogeneity of IoT network traffic. This study proposes and evaluates three hybrid deep learning architectures for IoT botnet detection that combine representation learning with supervised classification: VAE-encoder-MLP, VAE-encoder-GAT, and VAE-encoder-MoTE. A Variational Autoencoder is initially trained to learn a compact latent representation of the high-dimensional traffic features. Subsequently, the pretrained VAE-encoder component is employed to project the data into a lower-dimensional embedding space. These embeddings are then used to train three different downstream classifiers: a multilayer perceptron (MLP), a graph attention network (GAT), and a mixture of tiny experts (MoTE) model. To further enhance representation discriminability, supervised contrastive learning is incorporated to encourage intra-class compactness and inter-class separability. The proposed architectures are evaluated on two widely studied benchmark datasets—the CICIoT2022 and N-BaIoT dataset—under both binary and multiclass classification settings. Experimental results demonstrate that all three models achieve near-perfect performance in binary attack detection, with accuracy exceeding 99.8%. In the more challenging multiclass scenario, the VAE-encoder-MLP model achieves the best overall performance, reaching accuracies of 98.55% on CICIoT2022 and 99.75% on N-BaIoT. These findings provide insights into the design of efficient and scalable deep learning architectures for IoT intrusion detection.

Keywords:

IoT botnet detection; graph neural networks; graph attention network; mixture of experts; mixture of tiny experts; contrastive learning; variational autoencoder; latent representation learning

1. Introduction

The Internet of Things (IoT) is now an indispensable component of contemporary civilization, underpinning a multitude of applications at both individual and enterprise levels. However, despite the widespread adoption, the IoT network remains highly susceptible to cyberattacks due to inherent device constraints such as limited computational resources, weak security configurations, and heterogeneous deployment environments. In particular, IoT botnet-driven attacks, most notably DDoS attacks, continue to present major security risks to enterprises and institutions across diverse industry verticals. To mitigate these challenges, a broad spectrum of artificial intelligence (AI)-based attack detection strategies have been introduced and extensively evaluated to strengthen the resilience of IoT ecosystems. For example, prior studies have explored advanced AI-driven techniques based on federated learning [1,2,3], graph neural networks [4,5,6], transformer-based architectures [7,8,9], autoencoder and variational autoencoder models [10,11,12], latent space arithmetic and alignment-based methods [13,14,15], and cost-sensitive learning strategies designed to address class imbalance in IoT security datasets [16,17,18,19].

However, despite its demonstrated potential for improving computational efficiency in domains such as NLP and LLMs [20,21,22], Mixture of Experts (MoE) [23,24] methods remain relatively unexplored in the context of IoT botnet detection. In the MoE design, computation is dynamically routed such that only a small number of expert networks are utilized for each data instance. To ensure feasibility on low-resource IoT edge devices, we adopt the mixture of tiny experts (MoTE) [25], a variant that leverages lightweight expert models for per-instance classification. This mechanism substantially reduces inference latency, energy consumption, and memory overhead, while preserving strong representational capacity through expert specialization.

This work introduces the VAE-encoder-MoTE framework in which a VAE encoder first transforms high-dimensional IoT botnet traffic features into a compact latent embedding space. The MoTE model is then trained on these low-dimensional embeddings, enabling efficient learning while retaining the critical structural information necessary for accurate attack detection. This integration of latent representation learning with conditional expert activation achieves a strong trade-off across detection accuracy and computational cost, positioning the method as well-suited for deployment on IoT systems with limited resources.

To further improve the discriminative quality of the learned representations, supervised contrastive learning [26,27] is incorporated during model training. This training approach explicitly promotes instances from the same attack category to form tight groupings in the latent representation, while ensuring that instances from different categories are separated by larger distances. As a result, the learned feature space exhibits improved intra-class compactness and inter-class separability, which enhances generalization capability, robustness to class imbalance, and resilience to evolving IoT traffic patterns. In addition to the proposed VAE-encoder-MoTE architecture, two alternative models, namely VAE-encoder-MLP and VAE-encoder-GAT, are also trained and evaluated. A comparative analysis of the three architectures is conducted using standard performance metrics.

The main contributions of this study are:

A hybrid framework that combines VAE-based latent representation learning with an MoTE architecture for IoT botnet detection is introduced. The approach compresses high-dimensional IoT traffic data into compact embeddings and applies conditional expert routing to enable efficient and scalable detection.
The study incorporates supervised contrastive learning to enhance the discriminative power of the latent feature space. By encouraging embeddings of samples from the same attack class to cluster together while separating different classes, the approach improves intra-class compactness and inter-class separability.
A systematic empirical evaluation is conducted comparing the proposed VAE-encoder-MoTE model against the VAE-encoder-MLP and VAE-encoder-GAT models. The comparison assesses their effectiveness under both binary and multiclass IoT botnet detection scenarios. Performance is analyzed using multiple metrics providing a detailed understanding of the strengths and limitations of each architectural design.
Through extensive empirical validation on benchmark IoT botnet datasets, the study provides insights into how architectural choices and dataset characteristics influence detection performance in IoT intrusion detection systems.

The remainder of this paper is organized as follows. Section 2 reviews the relevant literature on supervised contrastive learning, mixture-of-experts architecture, and GNNs for attack detection. Section 3 presents the proposed framework, including the VAE-based representation learning approach, the MoTE architecture, and the experimental setup. Section 4 reports the experimental results obtained under both binary and multiclass classification settings. Section 5 provides an in-depth analysis and interpretation of the findings. Section 6 summarizes the key contributions and outlines prospects for future work.

2. Related Work

This section surveys existing literature relevant to three key areas: (1) supervised contrastive learning, (2) mixture of experts and tiny expert architectures, (3) graph neural networks for intrusion detection.

2.1. Supervised Contrastive Learning

Contrastive learning is a representation learning framework where models are optimized to reduce the separation between related samples (positive pairs) while maximizing the separation between unrelated samples (negative pairs) in an embedding space. Supervised contrastive learning broadens this framework by leveraging label information to establish richer semantic relationships between samples, unlike purely unsupervised approaches that rely only on data augmentation without considering class labels [26,28]. By encouraging examples from the same category to be drawn closer together in the learned representation space, supervised contrastive objectives often produce features that are more discriminative and transferable for classification tasks [26,29]. In comparison with traditional supervised objectives such as cross-entropy, supervised contrastive learning has yielded better resilience to data augmentation, label noise, and dataset biases [26,30]. Furthermore, contrastive objectives tend to yield smoother and better-structured representation spaces, which are advantageous for transfer learning and downstream applications [28,31].

The study in [27] investigated the geometric structure of representations learned using supervised contrastive loss in comparison with those obtained through conventional cross-entropy training. The authors analyze how both loss functions encourage the formation of compact class clusters within the embedding space and theoretically demonstrate that optimal representations under both objectives converge to vertices forming a regular structure positioned over a high-dimensional sphere. Under this configuration, samples from the same class become tightly grouped while samples from different classes remain maximally separated. The study also provided empirical evidence linking this geometric structure to improved generalization performance.

The work presented in [32] proposes a weakly supervised contrastive learning framework designed to bridge instance discrimination with supervised information. Rather than treating each sample as an independent class, the proposed method introduces two projection heads: one dedicated to conventional instance discrimination and another that utilizes graph-based similarity relationships to assign weak labels across samples. Samples sharing these weak labels are considered positive pairs under a supervised contrastive objective, thereby encouraging the learning of semantically meaningful representations. To further enrich positive sample diversity, the framework incorporates a k-nearest neighbor multi-crop strategy. Experiments conducted on standard computer vision benchmarks demonstrated improvements in representation quality and competitive semi-supervised classification performance when only limited labeled data are available.

With respect to cybersecurity applications—particularly intrusion detection—the study in [33] introduced FeCo (Federated Contrastive Learning), a framework that leverages contrastive objectives to align feature embeddings of network traffic collected from distributed IoT devices. The approach coordinates multiple local models so that they learn shared representations capable of distinguishing normal network behavior from malicious activity by leveraging both feature and label similarities during contrastive training. Through the federated contrastive loss, the encoder learns an embedding space where semantically related network traffic patterns are positioned closer together, thereby improving the effectiveness of intrusion detection across heterogeneous deployment environments.

The research in [34] further extends contrastive learning principles to sequence modeling and transformer-based architectures. In this approach, sequences of network events are encoded using transformer models, and contrastive learning is employed to differentiate benign from malicious patterns. Positive pairs are constructed from similar event sequences, whereas negative pairs represent dissimilar sequences. This supervised contrastive framework improves the discriminative capacity and robustness of intrusion detection systems, particularly in settings where annotated data is limited, highlighting the adaptability of contrastive approach beyond traditional computer vision domains.

2.2. Mixture of Experts and Tiny Expert Architectures

The MoE framework represents a form of conditional computation in which a neural network is partitioned into multiple expert sub-networks that are orchestrated by a routing or gating mechanism. For any sample, a subset of experts is activated, enabling the model to increase its representational capacity while avoiding the full computational overhead associated with dense model inference. This strategy has been widely adopted to address scalability and efficiency challenges in deep learning, particularly in applications requiring large model capacity or dealing with heterogeneous input distributions [35,36].

Recent developments have introduced the concept of tiny experts, where each expert module is deliberately designed to be extremely lightweight. These experts often consist of minimal parameter structures such as shallow multilayer perceptrons or even simple single-layer transformations. The work in [25] formalizes this concept through the Mixture of a Million Experts framework, demonstrating that a very large collection of small experts can be efficiently accessed using parameter-efficient retrieval strategies while activating only a limited subset for each input. Related research has further explored fine-grained expert decomposition and sparse activation mechanisms to minimize redundancy and enhance parameter efficiency [37,38,39].

Tiny expert architectures provide several advantages compared with conventional MoE models, particularly in environments with constrained computational resources such as IoT edge devices. Traditional MoE designs typically employ moderately large expert networks, which can result in increased memory consumption and higher inference latency, making them unsuitable for resource-limited edge hardware. In contrast, architectures composed of numerous tiny experts enable highly selective activation, lower computational cost per expert, and greater flexibility in adapting to dynamic traffic patterns. These characteristics make MoTE particularly suitable for lightweight intrusion and botnet detection systems [40,41,42,43,44]. Such properties are critical in IoT security applications, where real-time threat detection, low power consumption, and resilience against evolving attack strategies are essential.

2.3. Graph Neural Networks for Intrusion Detection

Recent research has focused on combining graph-based neural methods with transformer-style attention mechanisms to strengthen intrusion detection. Such combined frameworks leverage the distinct yet complementary advantages of both approaches: graph neural models are well-suited for learning intricate relationships between interconnected nodes, whereas transformer architectures are highly effective at capturing long-range dependencies within data. Bringing these techniques together has been found to boost both the accuracy and resilience of detection systems. For example, the study in [45] introduced a graph attention network-driven intrusion detection framework specifically designed for diverse IoT settings, achieving encouraging results on the NSL-KDD dataset. Similarly, [46] developed a hybrid architecture that integrates a GNN with a transformer to jointly capture structural and contextual dependencies, while [47] combined graph convolution with attention mechanisms to detect IoT botnets through device interaction modeling. In the context of electric vehicle charging networks, [48] demonstrated that integrating transformer components with GNNs improves the modeling of complex feature relationships for cyber-attack detection.

Several studies have concentrated on developing sophisticated GNN architectures tailored for IoT security. For instance, the EGAT-LSTM model proposed in [49] integrates an improved graph attention mechanism alongside a LSTM component, enabling it to learn both structural dependencies and evolving traffic dynamics over time. This integration significantly improves malicious traffic classification compared with approaches that rely solely on individual flow features. Similarly, [50] proposed AJSAGE, an attention-enhanced GraphSAGE model designed to improve anomalous node detection in graph-structured network attack datasets, particularly for complex and evolving threats.

Dimensionality reduction techniques have also been examined as complementary strategies for enhancing GNN-based detection performance. The study in [12] compared AE-encoder, VAE-encoder, and PCA methods for reducing the high-dimensional CICIoT2022 into a low dimensional space before graph construction. Among these approaches, the VAE-encoder produced the best results. In addition, the 3-euclidean n_neighbors-metric configuration achieved the strongest performance for kNN-based graph generation, emphasizing the importance of both high-quality latent representations and appropriate graph construction parameters.

Despite these advances, comparative evaluations suggest that GNN-based models still require further refinement to achieve competitive performance. In [4], the authors compared VAE-GCN and VAE-GAT against VAE-MLP and ViT-MLP models utilizing the 8-dimensional latent vectors derived from the N-BaIoT dataset. Although VAE-GAT demonstrated better performance than VAE-GCN, both GNN models generally underperformed in comparison with the alternative learning frameworks. These findings highlight the need for improved architectural designs and more effective integration strategies when applying GNNs to intrusion detection systems.

3. Methodology

This study proposes the VAE-encoder-MoTE model and systematically evaluates its detection performance against two other advanced hybrid deep learning architectures—VAE-encoder-MLP and VAE-encoder-GAT—for IoT botnet detection under both binary and multiclass classification settings. To boost the discriminative quality of the learned representations, supervised contrastive learning is employed to promote intra-class compactness and inter-class separability, thereby enhancing detection performance.

To tackle the complexity of high-dimensional IoT traffic data, a VAE with a latent space of dimension k (set to k = 8 [4,9,12] in this study) is initially trained on the original feature space. After training, only the encoder part of the VAE is preserved. For each model architecture, this pretrained encoder maps the inputs into compact, lower-dimensional vectors. The resulting reduced-dimension training data is subsequently used to train an MLP, a GAT, and an MoTE model under a supervised contrastive learning framework.

3.1. Variational Autoencoder for Dimensionality Reduction

Dimensionality reduction has become a key strategy for building efficient detection systems. Common approaches include feature selection methods (filter, wrapper), PCA, and deep learning techniques such as Auto-encoders (AEs) and Variational Auto-encoders (VAEs). Filter methods apply statistical measures to score and prioritize features, while wrapper methods iteratively evaluate model performance with different feature subsets. PCA reduces dimensionality by projecting data onto directions of maximum variance. However, these techniques have limitations: filter methods may produce suboptimal subsets, wrapper methods are computationally expensive, and PCA ignores class labels and struggles with nonlinear data. AEs offer an alternative by learning compact latent representations, but their lack of explicit regularization can lead to poorly structured latent spaces, reducing generalization performance and interpretability. Therefore, the VAE was selected for this task.

The VAE, originally proposed in [51], is a generative model that introduces a structured and regularized latent space, distinguishing it from traditional autoencoders. Given a dataset

X

, the generative process assumes a variable z drawn from a prior

p_{θ} (z)

, followed by sampling from the conditional distribution

p_{θ} (x | z)

. Because direct optimization of

\log p_{θ} (x)

is computationally intractable, the true posterior is approximated using a variational distribution

q_{ϕ} (z | x)

, parameterized by an encoder network. The model is then trained by maximizing the evidence lower bound (ELBO), as expressed in Equation (1).

L (θ, ϕ; x) = E_{q_{ϕ} (z | x)} [\log p_{θ} (x | z)] - D_{K L} (q_{ϕ} (z | x) ∥ p_{θ} (z))

(1)

Here, the Kullback–Leibler (KL) divergence term

D_{K L} (q_{ϕ} (z | x) ∥ p_{θ} (z))

acts as a regularizer, encouraging the learned latent distribution to stay close to the prior. As a result, the VAE simultaneously learns a compact latent representation and a probabilistic generative model, supporting both accurate reconstruction and data generation under uncertainty.

In this study, both the encoder and decoder networks comprised four fully connected hidden layers. The encoder employed a decreasing architecture with 128, 64, 32, and 16 neurons to transform the input into the latent representation, whereas the decoder adopted a symmetric increasing structure with 16, 32, 64, and 128 neurons to reconstruct the input from the latent representation. The LeakyReLU activation function was applied across all hidden layers.

3.2. Contrastive Learning

To promote tighter grouping of instances belonging to the same class and clearer separation across different classes in the learned embedding space, supervised contrastive learning is adopted. Given a batch B of feature embeddings

{z_{i}}_{i = 1}^{B}

, where

z_{i} \in R^{k}

, (

k \to latent space dimension

), the embeddings are first

l_{2}

-normalized using Equation (2):

{\tilde{z}}_{i} = \frac{z_{i}}{∥ z_{i} ∥_{2}}

(2)

The pairwise cosine similarity between samples

{\tilde{z}}_{i}

and

{\tilde{z}}_{j}

is computed using Equation (3):

sim ({\tilde{z}}_{i}, {\tilde{z}}_{j}) = \frac{{\tilde{z}}_{i}^{⊤} {\tilde{z}}_{j}}{τ}

(3)

where

τ

is a temperature scaling parameter.

The supervised contrastive loss for a sample i is obtained using Equation (4):

L_{SupCon}^{(i)} = - \frac{1}{| P (i) |} \sum_{p \in P (i)} \log \frac{\exp (sim (i, p))}{\sum_{a \neq i} \exp (sim (i, a))}

(4)

where

P (i)

denotes the set of indices corresponding to samples that share the same class label as sample i.

s i m (i, p)

: similarity between sample i and positive sample p.

a: any sample in the batch (both positives and negatives).

The final supervised contrastive loss is obtained by averaging over the batch using Equation (5):

L_{SupCon} = \frac{1}{B} \sum_{i = 1}^{B} L_{SupCon}^{(i)}

(5)

3.3. Tiny Experts Architecture

Each tiny expert takes the form of a lightweight MLP featuring two dense layers (24 and 16 units) with ReLU activations. This compact architectural design reduces computational overhead while preserving sufficient representational capacity for effective feature transformation. The transformation performed by an expert is given by Equation (6):

Expert (x) = σ (W_{2} σ (W_{1} x + b_{1}) + b_{2})

(6)

where

W_{1}

,

W_{2}

are weight matrices,

b_{1}

and

b_{2}

are bias terms,

σ (\cdot)

denotes the ReLU activation function.

3.4. Mixture of Tiny Experts (MoTE)

The MoTE model introduces conditional computation through a trainable routing mechanism between a set of tiny experts. Given an input x, the router produces a probability distribution over E experts:

g (x) = softmax (W_{r} x + b_{r})

(7)

where

g (x) \in R^{E}

, and

E \to

number of experts.

For computational efficiency, activation is limited to the k experts with the highest routing probabilities for any given input. The aggregated representation results from weighting and summing the outputs of the selected experts as in Equation (8):

z = \sum_{i \in K (x)} g_{i} (x) \cdot {Expert}_{i} (x)

(8)

where

K (x)

denotes the set of top-k selected experts.

3.5. Classification Head and Joint Optimization

The aggregated embedding z is passed to a linear classifier to obtain class predictions:

\hat{y} = softmax (W_{c} z + b_{c})

(9)

The supervised classification objective is defined using the cross-entropy loss:

L_{CE} = - \sum_{c = 1}^{C} y_{c} \log ({\hat{y}}_{c})

(10)

To jointly optimize representation learning and classification performance, the total training objective was formulated as follows:

L_{total} = L_{SupCon} + L_{CE}

(11)

Figure 1 provides an illustration of the complete architectural design of the VAE-encoder-MoTE model. A supervised contrastive learning layer is incorporated to enhance class separability within the latent space by drawing samples of the same category nearer to each other, while simultaneously pushing samples from different categories farther away from each other.

3.6. The MLP Architecture

The MLP architecture comprises four hidden dense layer (128 → 64 → 32 → 16 units) with ReLU activation. The output layer utilized the softmax activation function to produce normalized class probability distributions. To protect the model from overfitting, regularization was introduce via a 0.1 dropout rate after the third and fourth layers. The output from the VAE-encoder component was passed through the contrastive learning layer and then fed into the MLP model as shown in Figure 2.

3.7. The GAT Architecture

GAT is a GNN model that incorporates the self-attention mechanism to learn the relative contribution of neighboring nodes. Because GNNs operate on graph-structured inputs, the low-dimensional embeddings of the IoT traffic instances were first transformed into a graph structure. To construct the graph and determine node neighborhoods, the approach proposed in [4] was adopted. Specifically, the k-NN algorithm with

n_n e i g h b o r s = 3

, and

e u c l i d e a n

distance metric was used to establish connections between nodes. The graph was constructed on the embeddings produced from the contrastive learning layer which was placed between the VAE-encoder and the kNN algorithm as shown in Figure 3.

3.8. Datasets

The three model architectures were trained and evaluated using two well-studied benchmark datasets—the N-BaIoT dataset [10] and the CICIoT2022 dataset [52].

3.8.1. N-BaIoT Dataset

To overcome the limited availability of real-world IoT botnet traffic, the N-BaIoT dataset was introduced by [10] as a comprehensive benchmark derived from operational IoT devices. The dataset contains 115 features, computed as 23 statistical descriptors across five temporal windows, extracted from NetFlow traffic captured in a controlled yet realistic testbed environment comprising of nine devices. The devices were configured to run two prominent IoT malware families, Mirai and BashLite. Following preprocessing to remove duplicate records, the dataset was reduced from 6,331,884 to 2,482,470 instances and subsequently employed to train and evaluate each of the three model architectures under binary classification, and fine-grained ten-class classification settings, with percentage class distributions of {“Normal”: 21.52%, “mirai_udp”: 23.30%, “mirai_syn”: 13.29%, “mirai_ack”: 11.74%, “mirai_scan”: 10.74%, “gafgyt_udp”: 4.51%, “gafgyt_combo”: 2.61%, “gafgyt_junk”: 1.31%, “gafgyt_scan”: 1.30%, “mirai_udpplain”: 9.66%}.

3.8.2. CICIoT2022 Dataset

In contrast to N-BaIoT, which focuses on a limited number of devices, the CICIoT2022 dataset [52] provides a substantially broader representation of IoT ecosystems by incorporating traffic from 60 heterogeneous devices. This increased device and protocol diversity facilitates a more comprehensive characterization of IoT traffic behavior under benign and malicious conditions. Feature extraction was performed on the released .pcap files using the revised CICFlowMeter (https://github.com/GintsEngelen/CICFlowMeter accessed on 23 August 2025). The resulting dataset was composed of more than 3.21 million data samples and 84 training features. The dataset was annotated using directory-based labels, resulting in five traffic classes. The sample distribution across the five classes is {“Normal”: 80.870%, “HTTP flood”: 17.130%, “TCP flood”: 1.418%, “Brute force”: 0.379%, “UDP flood”: 0.203%}.

3.9. Data Preprocessing

Data preprocessing constituted refining the datasets through the removal of outliers, duplicate records, invalid samples, and missing values. In addition, non-informative attributes were excluded, including “Flow ID”, which served as a unique ID for each NetFlow record; “Src IP” and “Dst IP”, which associated traffic with specific source and destination addresses; and “Timestamp”, which linked NetFlows to temporal information. Subsequently, the training variables were scaled to take on values in the range of 0 to 1.

3.10. Experimental Setup and Model Training

The training process spanned 30 epochs, utilizing the Adam optimizer with a learning rate set to

10^{- 3}

. Via empirical analysis the final value of temperature parameter

τ

in the supervised contrastive loss was

0.1

. The architecture employs six experts with top-2 expert selection per input sample. The dataset was divided into train (80%) and test (20%) sets while a batch size of 128 was used during model training for all model architectures.

To mitigate the risk of performance overestimation associated with a single train–test split, all experiments were conducted over 10 independent runs for each model and classification task. In each run, the dataset was randomly shuffled prior to splitting, with a different random seed applied to ensure variability across splits. The set of seeds used was 10, 42, 116, 17, 37, 1412, 73, 100, 98, 3200. For every experimental configuration, the average and variance of each performance metric were computed and reported.

The training procedure was conducted in two stages. First, the VAE encoder and the contrastive learning layer were trained to learn a structured latent representation. Subsequently, each classifier was trained independently using the learned embeddings. During the classifier training phase, the parameters of the pretrained VAE encoder and the contrastive learning layer were frozen, ensuring that only the classifier parameters were updated.

3.11. Performance Evaluation Metrics

The performance was analyzed based on typical multi-class classification measures: accuracy, precision, recall, and F1-score, while barplots were utilized to provide visual insights of how each model performs in comparison with other models. A key point to consider is that for multiclass classification macro averages are reported for precision, recall and F1-score. These metrics provided a comprehensive assessment of overall performance as well as class-wise behavior, which is critical for security-sensitive applications such as IoT botnet detection.

4. Results

This section presents the results from a comprehensive evaluation of the proposed approach across binary and multiclass classification tasks on the CICIoT2022 and N-BaIoT datasets. The performance of the three models—VAE-Encoder-MoTE, VAE-Encoder-MLP, and VAE-Encoder-GAT—is reported before and after the application of contrastive learning. All results represent the mean performance obtained across several independent experiments conducted using varying data partitions and initialization seed values, and the outcomes are reported using mean score ± square root of variance. The highest values of accuracy, precision, recall, and F1-score obtained for each experimental setting are presented in boldface in the corresponding tables.

4.1. Binary Classification on the N-BaIoT Dataset

Table 1 summarizes the empirical findings for all models on the N-BaIoT dataset for the binary classification task, evaluated before and after the application of contrastive learning. The comparative outcomes are presented in Figure 4a and Figure 4b, showing performance prior to and after the incorporation of contrastive learning, respectively.

The results indicate that the application of contrastive learning consistently improves performance across all models, with the VAE-Encoder-MoTE model achieving the best overall performance.

4.2. Binary Classification on the CICIoT2022 Dataset

Table 2 summarizes the results on the CICIoT2022 dataset. The corresponding performance comparisons are illustrated in Figure 5a and Figure 5b, which depict the results prior to and after the incorporation of contrastive learning, respectively.

4.3. Multiclass Classification on the N-BaIoT Dataset

Table 3 reports the multiclass classification results on the N-BaIoT dataset. The corresponding performance comparisons are illustrated in Figure 6a and Figure 6b, which depict the results prior to and after the incorporation of contrastive learning, respectively.

The gains from contrastive learning are also evident in the multiclass setting, with improved classification consistency across all models.

4.4. Multiclass Classification on the CICIoT2022 Dataset

Table 4 summarizes the multiclass performance outcomes on the CICIoT2022 dataset. The corresponding performance comparisons are illustrated in Figure 7a and Figure 7b, which depict the results prior to and after the incorporation of contrastive learning, respectively.

Consistent with previous observations, contrastive learning enhances model performance in the multiclass scenario, demonstrating its robustness across datasets and classification settings.

5. Discussion

The findings highlight the strong performance achieved through the integration of VAE-based latent representation learning with supervised contrastive learning for IoT botnet detection across both datasets and classification settings.

5.1. Impact of Contrastive Learning

A consistent improvement is observed after incorporating supervised contrastive learning across all models and tasks. In the binary setting (Table 1 and Table 2; Figure 4 and Figure 5), performance gains are modest due to already-saturated accuracy levels. However, reduced standard deviations indicate improved stability. In contrast, the multiclass setting (Table 3 and Table 4; Figure 6 and Figure 7) shows more substantial improvements, confirming that contrastive learning enhances class separability in complex classification scenarios.

5.2. Model Comparison

The VAE-encoder-MoTE model demonstrates a competitive performance across all tasks, achieving the highest or near-highest accuracy in binary classification (Table 1 and Table 2). Its low variance across runs indicates stable learning. In the multiclass setting, MoTE benefits significantly from contrastive learning, suggesting that expert specialization is more effective when the latent space is well-separated.

The VAE-encoder-MLP model achieves the best overall performance in multiclass classification, particularly on the CICIoT2022 dataset (Table 4). This suggests that well-structured latent representations can enable a strong performance even with relatively simple classifiers.

The VAE-encoder-GAT model consistently underperforms in multiclass scenarios (Table 3 and Table 4). This may be attributed to sensitivity to graph construction and the fixed k-NN strategy used, which may not adequately model intricate relationships within the dataset. Performance degradation is more evident on the CICIoT2022 dataset, indicating limited robustness under heterogeneous and imbalanced conditions.

5.3. Binary vs. Multiclass Performance

All models achieve near-perfect results in binary classification, indicating that detecting attack traffic from normal samples is relatively straightforward in the learned latent space. However, multiclass classification remains more challenging, with noticeable performance gaps between models. This highlights the importance of discriminative representation learning, particularly for fine-grained attack categorization.

5.4. Dataset Effects

Performance is consistently higher on the N-BaIoT dataset than on CICIoT2022. This suggests that N-BaIoT exhibits more separable class structures, whereas CICIoT2022 introduces greater heterogeneity and class imbalance, leading to reduced recall and F1-scores (Table 4). The MoTE model shows improved recall under these conditions, indicating its potential for handling diverse traffic patterns.

An investigation on which instances of the N-BaIoT dataset were misclassified in the multiclass classification setting, revealed that 684 instances of the “gafgyt_combo” attack category were wrongly classified as “gafgyt_junk” by the VAE-Encoder-MoTE model. Also, similar patterns were recorded using the other two models. Following this, t-SNE was utilized to obtain a graphical representation of the spatial distribution of three categories: correctly classified “gafgyt_combo” instances, correctly classified “gafgyt_junk” instances, and “gafgyt_combo” instances misclassified as “gafgyt_junk”. The test dataset was passed through the VAE-Encoder-MoTE model pipeline and the representations immediately preceding the output layer were extracted and mapped into a two-dimensional space using t-SNE. The t-SNE output is presented in Figure 8. A similar pattern was observed when investigating the misclassified instances on the CICIoT2022 dataset.

The t-SNE visualization reveals that the misclassified “gafgyt_combo” instances tend to concentrate in intermediate or overlapping regions between the primary clusters. These observations suggest two plausible causes of misclassification: (1) feature ambiguity—where the misclassified samples exhibit attack patterns, traffic statistics, or payload characteristics more closely aligned with “gafgyt_junk” than with “gafgyt_combo”, thereby necessitating additional feature engineering or the adoption of a more sophisticated detection model to effectively discriminate between the two classes; (2) traffic labeling errors [53]—where some traffic instances are incorrectly assigned to the wrong categories. This finding underscores the importance of evaluating deep learning models on multiple datasets to ensure more reliable and generalizable outcomes.

5.5. Practical Implications

From a deployment perspective, the findings reveal a compromise between model accuracy and resource utilization. Although the MLP model achieves the highest accuracy, the MoTE architecture offers an effective equilibrium between detection capability and computational overhead due to its conditional computation mechanism. This makes it a strong candidate for implementation in IoT settings with limited computational resources.

Overall, the findings confirm that combining VAE-based dimensionality reduction with supervised contrastive learning provides a robust framework for IoT botnet detection, with model selection depending on task complexity and deployment constraints.

5.6. Computational Cost Analysis

VAE encoder and contrastive layer time complexity: Both the VAEs and contrastive layer architectures in this study used dense layers as the core computational component. Therefore, the time complexity of a dense layer is with

x_{i n}

inputs and

y_{o u t}

neurons is

O (x_{i n} \cdot y_{o u t})

. For a VAE encoder with a dense layers, and constant sampling

O (1)

, the time complexity is

O (a \cdot x_{i n} \cdot y_{o u t})

. If the contrastive layers deploys b dense layers, the total time complexity of the VAE encoder and contrastive learning layer is

O (c \cdot x_{i n} \cdot y_{o u t})

, where

c = a + b

.

MLP and MoTE Time Complexity: Like the VAE and Contrastive learning layer, the MLP and MoTE in this work constituted dense layers with the MoTE using fewer layers (only two layes in this case) and far fewer computational units per layer to make more lightweight. In addition, the MoTE has a routing operation of constant time complexity

O (1)

. Therefore, the time complexity of MLP and MoTE can be expressed as

O (d \cdot x_{i n} \cdot y_{o u t})

and

O (E \cdot k \cdot x_{i n} \cdot y_{o u t}) + O (1)

which is simple

O (E \cdot k \cdot x_{i n} \cdot y_{o u t})

; where d and k with (

k < < d

) are the number of layers in MLP and MoTE while E is the number of tiny experts in MoTE, respectively. Therefore the overall time complexities of the VAE-Encoder-MLP and VAE-Encoder-MoTE can be expressed as

O (P \cdot x_{i n} \cdot y_{o u t})

and

O (Q \cdot x_{i n} \cdot y_{o u t})

where

P = c + d

and

Q = c + E \cdot k

, respectively.

GAT time complexity: This incorporates a time complexity

O (V l_{z}^{2} + E l_{z})

for constructing graph with V vertices (each of latent dimension

l_{z}

) and E edges and the time complexity

O (n (V l_{z} K + H E K))

for a GAT with n GATConv layers, attention heads H, and output feature size K per head. Table 5 summarizes the computational complexity associated with using each of the detection frameworks.

5.7. Comparison with Benchmark Studies

Table 6 presents a comparative analysis of the proposed VAE-Encoder-MoTE method against approaches reported in prior studies. To ensure a fair and consistent comparison, only methods evaluated on the same datasets (CICIoT2022 and N-BaIoT) and performance measured using identical metrics were considered. In this comparison, we have considered the highest recorded scores from each study across the different evaluation metrics and compared with the highest recorded scores by the VAE-Encoder-MoTE model. As indicated by the boldface values in Table 6, the proposed model achieved competitive performance across both datasets and outperformed sate-of-the-art methods.

6. Conclusions and Future Work

This study presented a comparative evaluation of three hybrid deep learning architectures for IoT botnet detection: VAE-encoder-MLP, VAE-encoder-GAT, and VAE-encoder-MoTE. The proposed framework utilized a variational autoencoder to compress high-dimensional IoT traffic features into a compact latent representation, which was subsequently used to train downstream classifiers under a supervised contrastive learning objective. This training strategy encouraged intra-class compactness and inter-class separability in the embedding space, thereby improving the discriminative quality of the learned representations. In addition, the Mixture of Tiny Experts (MoTE) architecture was investigated as a lightweight conditional computation strategy designed for resource-constrained IoT environments.

Experimental results on the CICIoT2022 and N-BaIoT datasets demonstrated that the proposed architectures achieve near-perfect performance in binary attack detection, with accuracy exceeding

99.80 %

across both datasets. In the more challenging multiclass scenario, the VAE-encoder-MLP model achieved the best overall performance, particularly on the N-BaIoT dataset, indicating that well-structured representations can enable highly accurate intrusion detection even with relatively simple classifiers. The MoTE architecture achieved competitive results and demonstrated improved recall on the CICIoT2022 dataset, highlighting the benefits of expert specialization and conditional routing. In contrast, the GAT-based model showed lower performance in multiclass settings, suggesting sensitivity to the graph construction method and dataset imbalance.

Future work will focus on improving graph construction strategies for GNN-based models, enhancing expert routing mechanisms in MoTE architectures, compression via quantization, and incorporating temporal modeling techniques to better capture evolving IoT attack patterns.

Author Contributions

H.W.: conceptualization, methodology, software, data curation, validation, writing—original draft preparation, formal analysis, writing—review and editing, T.L.: writing—review and editing, and supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The study used publicly available IoT attack traffic datasets—CICIoT2022 and N-BaIoT datasets. The authors confirm that the processed data will be available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

CNN	Convolutional Neural Network
DL	Deep Learning
FL	Federated Learning
GAT	Graph Attention Network
GNN	Graph Neural Network
IoT	Internet of Things
k-NN	k-Nearest Neighbors
MLP	Multilayer Perceptron
MoE	Mixture of experts
MoTE	Mixture of Tiny Experts
PCA	Principal Component Analysis
ReLU	Rectified Linear Unit
SupCon	Supervised Contrastive Learning
t-SNE	t-distributed Stochastic Neighbor Embedding
VAE	Variational Autoencoder
W/o CL	Without Contrastive Learning

References

de Caldas Filho, F.L.; Soares, S.C.M.; Oroski, E.; de Oliveira Albuquerque, R.; Da Mata, R.Z.A.; De Mendonça, F.L.L.; de Sousa Júnior, R.T. Botnet detection and mitigation model for IoT networks using federated learning. Sensors 2023, 23, 6305. [Google Scholar] [CrossRef]
Danquah, L.K.G.; Appiah, S.Y.; Mantey, V.A.; Danlard, I.; Akowuah, E.K. Computationally efficient deep federated learning with optimized feature selection for iot botnet attack detection. Intell. Syst. Appl. 2025, 25, 200462. [Google Scholar] [CrossRef]
Myakala, P.K.; Kamatala, S.; Bura, C. Privacy-Preserving federated learning for IoT botnet detection: A federated averaging approach. ICCK Trans. Mach. Intell. 2025, 1, 6–16. [Google Scholar] [CrossRef]
Wasswa, H.; Abbass, H.; Lynar, T. Are GNNs Worth the Effort for IoT Botnet Detection? A Comparative Study of VAE-GNN vs. ViT-MLP and VAE-MLP Approaches. arXiv 2025, arXiv:2505.17363. [Google Scholar]
Altaf, T.; Wang, X.; Ni, W.; Yu, G.; Liu, R.P.; Braun, R. GNN-based network traffic analysis for the detection of sequential attacks in IoT. Electronics 2024, 13, 2274. [Google Scholar] [CrossRef]
Zhang, B.; Li, J.; Ward, L.; Zhang, Y.; Chen, C.; Zhang, J. Deep graph embedding for IoT botnet traffic detection. Secur. Commun. Netw. 2023, 2023, 9796912. [Google Scholar] [CrossRef]
AboulEla, S.; Kashef, R. Enhancing iot intrusion detection with transformer-based network traffic classification. In Proceedings of the 2025 IEEE International systems Conference (SysCon), Montreal, QC, Canada, 7–10 April 2025; IEEE: New York, NY, USA, 2025; pp. 1–8. [Google Scholar]
Pavithran, D.; Keloth, K.N.E.; Thankamani, R.N. IOT botnet detection using transformer model and federated learning. In Proceedings of the AIP Conference Proceedings; AIP Publishing LLC: Melville, NY, USA, 2025; Volume 3237, p. 060041. [Google Scholar]
Wasswa, H.; Nanyonga, A.; Lynar, T. Impact of latent space dimension on IoT botnet detection performance: VAE-encoder versus ViT-encoder. In Proceedings of the 2024 3rd International Conference for Innovation in Technology (INOCON), Bangalore, India, 1–3 March 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Meidan, Y.; Bohadana, M.; Mathov, Y.; Mirsky, Y.; Shabtai, A.; Breitenbacher, D.; Elovici, Y. N-BaIoT—Network-based detection of iot botnet attacks using deep autoencoders. IEEE Pervasive Comput. 2018, 17, 12–22. [Google Scholar] [CrossRef]
Stiawan, D.; Bimantara, A.; Idris, M.Y.; Budiarto, R. IoT botnet attack detection using deep autoencoder and artificial neural networks. KSII Trans. Internet Inf. Syst. 2023, 17, 1310. [Google Scholar] [CrossRef]
Wasswa, H.; Abbass, H.; Lynar, T. Graph attention neural network for botnet detection: Evaluating autoencoder, vae and pca-based dimension reduction. arXiv 2025, arXiv:2505.17357. [Google Scholar] [CrossRef]
Snoussi, R.; Youssef, H. VAE-based latent representations learning for botnet detection in IoT networks. J. Netw. Syst. Manag. 2023, 31, 4. [Google Scholar] [CrossRef]
Wasswa, H.; Abbass, H.A.; Lynar, T. Latent space alignment for robust detection of IoT botnet attacks in non-stationary environments. Knowl.-Based Syst. 2025, 330, 114749. [Google Scholar] [CrossRef]
Vu, L.; Cao, V.L.; Nguyen, Q.U.; Nguyen, D.N.; Hoang, D.T.; Dutkiewicz, E. Learning latent representation for IoT anomaly detection. IEEE Trans. Cybern. 2020, 52, 3769–3782. [Google Scholar] [CrossRef]
Kozik, R.; Pawlicki, M.; Choraś, M. Cost-Sensitive Distributed Machine Learning for NetFlow-Based Botnet Activity Detection. Secur. Commun. Netw. 2018, 2018, 8753870. [Google Scholar] [CrossRef]
Telikani, A.; Rudbardeh, N.E.; Soleymanpour, S.; Shahbahrami, A.; Shen, J.; Gaydadjiev, G.; Hassanpour, R. A cost-sensitive machine learning model with multitask learning for intrusion detection in IoT. IEEE Trans. Ind. Inform. 2023, 20, 3880–3890. [Google Scholar] [CrossRef]
Wasswa, H.; Lynar, T.; Abbass, H. Enhancing IoT-botnet detection using variational auto-encoder and cost-sensitive learning: A deep learning approach for imbalanced datasets. In Proceedings of the 2023 IEEE Region 10 Symposium (TENSYMP), Canberra, Australia, 6–8 September 2023; IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
Telikani, A.; Gandomi, A.H. Cost-sensitive stacked auto-encoders for intrusion detection in the Internet of Things. Internet Things 2021, 14, 100122. [Google Scholar] [CrossRef]
Li, J.; Wang, X.; Zhu, S.; Kuo, C.W.; Xu, L.; Chen, F.; Jain, J.; Shi, H.; Wen, L. Cumo: Scaling multimodal llm with co-upcycled mixture-of-experts. Adv. Neural Inf. Process. Syst. 2024, 37, 131224–131246. [Google Scholar]
Sukhbaatar, S.; Golovneva, O.; Sharma, V.; Xu, H.; Lin, X.V.; Rozière, B.; Kahn, J.; Li, D.; Yih, W.t.; Weston, J.; et al. Branch-train-mix: Mixing expert llms into a mixture-of-experts llm. arXiv 2024, arXiv:2403.07816. [Google Scholar]
Team, L.; Zeng, B.; Huang, C.; Zhang, C.; Tian, C.; Chen, C.; Jin, D.; Yu, F.; Zhu, F.; Yuan, F.; et al. Every flop counts: Scaling a 300b mixture-of-experts ling llm without premium gpus. arXiv 2025, arXiv:2503.05139. [Google Scholar]
Zhou, Y.; Lei, T.; Liu, H.; Du, N.; Huang, Y.; Zhao, V.; Dai, A.M.; Le, Q.V.; Laudon, J. Mixture-of-experts with expert choice routing. Adv. Neural Inf. Process. Syst. 2022, 35, 7103–7114. [Google Scholar]
Yuksel, S.E.; Wilson, J.N.; Gader, P.D. Twenty years of mixture of experts. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1177–1193. [Google Scholar] [CrossRef]
He, X.O. Mixture of a million experts. arXiv 2024, arXiv:2407.04153. [Google Scholar] [CrossRef]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), virtually, 6–12 December 2020; NeurIPS: Sydney, Australia, 2020; Volume 33, pp. 18661–18673. [Google Scholar]
Graf, F.; Hofer, C.; Niethammer, M.; Kwitt, R. Dissecting supervised contrastive learning. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 18–24 July 2021; Curran Associates, Inc.: Red Hook, NY, USA, 2021; pp. 3821–3830. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the International Conference on Machine Learning (ICML), Virtual Event, 13–18 July 2020; JMLR, Inc.: New York, NY, USA, 2020; pp. 1597–1607. [Google Scholar]
Gunel, B.; Du, J.; Conneau, A.; Stoyanov, V. Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021; Curran Associates, Inc.: Red Hook, NY, USA, 2021. [Google Scholar]
Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding Deep Learning Requires Rethinking Generalization. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017; dblp: Trier, Germany, 2017. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 9729–9738. [Google Scholar]
Zheng, M.; Wang, F.; You, S.; Qian, C.; Zhang, C.; Wang, X.; Xu, C. Weakly supervised contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 10042–10051. [Google Scholar]
Wang, N.; Shi, S.; Chen, Y.; Lou, W.; Hou, Y.T. FeCo: Boosting intrusion detection capability in IoT networks via contrastive learning. In Proceedings of the IEEE Transactions on Dependable and Secure Computing, Montreal, BC, Canada, 11–17 October 2021; IEEE: New York, NY, USA, 2025. [Google Scholar]
Koukoulis, I.; Syrigos, I.; Korakis, T. Self-supervised transformer-based contrastive learning for intrusion detection systems. arXiv 2025, arXiv:2505.08816. [Google Scholar]
Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv 2017, arXiv:1701.06538. [Google Scholar]
Anantharamaiah, K.B.; TP, D. Malware Detection Using Mixture of Experts Neural Network in the IOT Platform. SSRN 3757817. 2020. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3757817 (accessed on 16 January 2026).
Zadouri, T.; Üstün, A.; Ahmadian, A.; Ermiş, B.; Locatelli, A.; Hooker, S. Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. arXiv 2023, arXiv:2309.05444. [Google Scholar] [CrossRef]
Chitty-Venkata, K.T.; Madireddy, S.; Emani, M.; Vishwanath, V. LExI: Layer-Adaptive Active Experts for Efficient MoE Model Inference. arXiv 2025, arXiv:2509.02753. [Google Scholar]
Tan, Y.; Li, Q.; Yang, M.; Hu, Y.; Zhang, L.; Zhang, X. MalMoE: Mixture-of-Experts Enhanced Encrypted Malicious Traffic Detection Under Graph Drift. arXiv 2026, arXiv:2602.10157. [Google Scholar]
Duan, J.; Li, W.; Bai, Q.; Nguyen, M.; Wang, X.; Jiang, J. LLM-BotGuard: A novel framework for detecting LLM-driven bots with mixture of experts and graph neural networks. IEEE Trans. Comput. Soc. Syst. 2025, 12, 3488–3500. [Google Scholar] [CrossRef]
Wang, F.; Yang, S.; Li, Q.; Wang, C. An internet of things malware classification method based on mixture of experts neural network. Trans. Emerg. Telecommun. Technol. 2021, 32, e3920. [Google Scholar] [CrossRef]
Wang, Y.; Ma, W.; Xu, H.; Liu, Y.; Yin, P. A lightweight multi-view learning approach for phishing attack detection using transformer with mixture of experts. Appl. Sci. 2023, 13, 7429. [Google Scholar] [CrossRef]
Chen, F.; Li, P.; Pan, S.; Zhong, L.; Deng, J. Giant could be tiny: Efficient inference of giant models on resource-constrained UAVs. IEEE Internet Things J. 2024, 11, 21170–21179. [Google Scholar] [CrossRef]
Yang, J.; Zhang, K.; Zheng, R.; Li, C.; Zheng, J. IoT Network Security Threat Detection Algorithm Integrating Symmetric Routing and a Sparse Mixture-of-Experts Model. Symmetry 2025, 18, 63. [Google Scholar] [CrossRef]
Ahanger, A.S.; Khan, S.M.; Masoodi, F.; Salau, A.O. Advanced intrusion detection in internet of things using graph attention networks. Sci. Rep. 2025, 15, 9831. [Google Scholar] [CrossRef]
Zhang, H.; Cao, T. A Hybrid Approach to Network Intrusion Detection Based On Graph Neural Networks and Transformer Architectures. In Proceedings of the 2024 14th International Conference on Information Science and Technology (ICIST), Chengdu, China, 6–9 December 2024; IEEE: New York, NY, USA, 2024; pp. 574–582. [Google Scholar]
Mohan, H.G.; Jalesh, K.; Nandish, M. GrMA-CNN: Integrating Spatial-Spectral Layers with Modified Attention for Botnet Detection Using Graph Convolution for Securing Networks. Int. J. Intell. Eng. Syst. 2025, 18, 1009. [Google Scholar]
Li, Y.; Chen, G.; Dong, Z. Multi-view graph contrastive representative learning for intrusion detection in EV charging station. Appl. Energy 2025, 385, 125439. [Google Scholar] [CrossRef]
Zhang, L.; Tan, L.; Shi, H.; Sun, H.; Zhang, W. Malicious traffic classification for IoT based on graph attention network and long short-term memory network. In Proceedings of the 2023 24st Asia-Pacific Network Operations and Management Symposium (APNOMS), Sejong, Republic of Korea, 6–8 September 2023; IEEE: New York, NY, USA, 2023; pp. 54–59. [Google Scholar]
Xu, L.; Zhao, Z.; Zhao, D.; Li, X.; Lu, X.; Yan, D. AJSAGE: A intrusion detection scheme based on Jump-Knowledge Connection To GraphSAGE. Comput. Secur. 2025, 150, 104263. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. An introduction to variational autoencoders. Found. Trends^® Mach. Learn. 2019, 12, 307–392. [Google Scholar] [CrossRef]
Dadkhah, S.; Mahdikhani, H.; Danso, P.K.; Zohourian, A.; Truong, K.A.; Ghorbani, A.A. Towards the development of a realistic multidimensional IoT profiling dataset. In Proceedings of the 2022 19th Annual International Conference on Privacy, Security & Trust (PST), Fredericton, NB, Canada, 22–24 August 2022; IEEE: New York, NY, USA, 2022; pp. 1–11. [Google Scholar]
Liu, L.; Engelen, G.; Lynar, T.; Essam, D.; Joosen, W. Error Prevalence in NIDS datasets: A Case Study on CIC-IDS-2017 and CSE-CIC-IDS-2018. In Proceedings of the 2022 IEEE Conference on Communications and Network Security (CNS), Austin, TX, USA, 3–5 October 2022; IEEE: New York, NY, USA, 2022; pp. 254–262. [Google Scholar] [CrossRef]
Gor, K.D.; Doshi, M.H.; Mehta, Y.D.; Mehta, K.N.; Verma, S. IoT-Based Smart Meter Attack Detection Using N-BaIoT Dataset. In Proceedings of the 2025 IEEE International Conference on Smart Power, Energy, Renewables, and Transportation (SPERT), Surat, India, 22–24 December 2025; IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
Okur, C.; Orman, A.; Dener, M. DDOS intrusion detection with machine learning models: N-BaIoT data set. In Proceedings of the International Conference on Artificial Intelligence and Applied Mathematics in Engineering; Springer: Berlin/Heidelberg, Germany; 2022, pp. 607–619.
Serhane, A.; Ibrahimi, K.; Hamzaoui, E.M.; Jouhari, M.; Ben-Othman, J. IoT Intrusion Detection Using Machine Learning Classifiers and PCA Dimensionality Reduction for N-BaIoT Dataset. In Proceedings of the ICC 2025-IEEE International Conference on Communications, Montreal, QC, Canada, 8–12 June 2025; IEEE: New York, NY, USA, 2025; pp. 5809–5814. [Google Scholar]
Pynadath, M.A.; Pavithra, K.; Lobo, S.E.; Murthy, S.S.; Bharathi, R. Anomaly Detection and Multi-Output Classification of IoT Attacks. In Proceedings of the 2023 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal, 26–28 April 2023; IEEE: New York, NY, USA, 2023; pp. 1750–1757. [Google Scholar]
Guimarães, L.C.; Couto, R.S. A performance evaluation of neural networks for botnet detection in the internet of things. J. Netw. Syst. Manag. 2024, 32, 98. [Google Scholar] [CrossRef]
Do, P.H.; Le, T.D.; Vishnevsky, V.; Berezkin, A.; Kirichek, R. A horizontal federated learning approach to iot malware traffic detection: An empirical evaluation with n-baiot dataset. In Proceedings of the 2024 26th International Conference on Advanced Communications Technology (ICACT), Gangwon-Do, Republic of Korea, 3–7 February 2024; IEEE: New York, NY, USA, 2024; pp. 1494–1506. [Google Scholar]
Umair, M.; Tan, W.H.; Foo, Y.L. Efficient malware classification with spiking neural networks: A case study on n-baiot dataset. In Proceedings of the 2023 Fourteenth International Conference on Ubiquitous and Future Networks (ICUFN), Paris, France, 4–7 July 2023; IEEE: New York, NY, USA, 2023; pp. 231–236. [Google Scholar]
Zhang, X.; Zhang, T.; Liu, Y.; Cheng, W. Knowledge Distillation Based Lightweight Classification for Encrypted Traffic with Long Tailed Distribution. In Proceedings of the 2025 IEEE 25th International Conference on Communication Technology (ICCT), Shenyang, China, 17–19 October 2025; IEEE: New York, NY, USA; 2025, pp. 1856–1861. [Google Scholar]
Liu, W.; Cui, W.; Wang, B.; Pan, H.; She, W.; Tian, Z. Decentralized traffic detection utilizing blockchain-federated learning with quality-driven aggregation. Comput. Netw. 2025, 262, 111179. [Google Scholar] [CrossRef]
Mittal, A.; Nedunoori, V. Advanced Security for IoT Networks: A Novel Approach to Intrusion Detection and Feature Selection. In Proceedings of the 2024 International Conference on Distributed Systems, Computer Networks and Cybersecurity (ICDSCNC), Bengaluru, India, 20–21 September 2024; IEEE: New York, NY, USA, 2024; pp. 1–5. [Google Scholar]

Figure 1. Proposed VAE-encoder-MoTE model with supervised contrastive learning on the low-dimensional latent space embeddings.

Figure 2. Proposed VAE-encoder-MLP model with supervised contrastive learning on the low-dimensional latent space embeddings (top-k experts indicated with dotted arrows).

Figure 3. Proposed VAE-encoder-GAT model with supervised contrastive learning on the low-dimensional latent space embeddings.

Figure 4. Binary classification performance comparison of the VAE-encoder-MLP, VAE-encoder-GAT, and VAE-encoder-MoTE architectures on the N-BaIoT dataset: (a) without contrastive learning and (b) with contrastive learning.

Figure 5. Binary classification performance comparison of the VAE-encoder-MLP, VAE-encoder-GAT, and VAE-encoder-MoTE architectures on the CICIoT2022 dataset: (a) without contrastive learning and (b) with contrastive learning.

Figure 6. Multiclass classification performance comparison of the VAE-encoder-MLP, VAE-encoder-GAT, and VAE-encoder-MoTE architectures on the N-BaIoT dataset: (a) without contrastive learning and (b) with contrastive learning.

Figure 7. Multiclass classification performance comparison of the VAE-encoder-MLP, VAE-encoder-GAT, and VAE-encoder-MoTE architectures on the CICIoT2022 dataset: (a) without contrastive learning and (b) with contrastive learning.

Figure 8. t-SNE visualization of misclassified “gafgyt_combo” traffic instances.

Table 1. Binary classification performance on the N-BaIoT dataset before and after application of contrastive learning.

	Accuracy (%)		Precision (%)		Recall (%)		F1-Score (%)
Model	W/o CL	With CL	W/o CL	With CL	W/o CL	With CL	W/o CL	With CL
VAE-MLP	$99.91 \pm 0.07$	$99.94 \pm 0.02$	$99.90 \pm 0.04$	$99.92 \pm 0.02$	$99.90 \pm 0.04$	$99.92 \pm 0.03$	$99.88 \pm 0.01$	$99.91 \pm 0.01$
VAE-GAT	$99.50 \pm 0.12$	$99.80 \pm 0.08$	$99.17 \pm 0.11$	$99.82 \pm 0.09$	$99.01 \pm 0.10$	$99.84 \pm 0.08$	$99.08 \pm 0.06$	$99.75 \pm 0.05$
VAE-MoTE	$99.92 \pm 0.01$	$99.97 \pm 0.01$	$99.94 \pm 0.02$	$99.96 \pm 0.03$	$99.91 \pm 0.04$	$99.95 \pm 0.01$	$99.92 \pm 0.03$	$99.95 \pm 0.01$

Table 2. Binary classification performance on the CICIoT2022 dataset before and after application of contrastive learning.

	Accuracy (%)		Precision (%)		Recall (%)		F1-Score (%)
Model	W/o CL	With CL	W/o CL	With CL	W/o CL	With CL	W/o CL	With CL
VAE-MLP	$99.84 \pm 0.12$	$99.86 \pm 0.06$	$99.50 \pm 0.11$	$99.79 \pm 0.07$	$99.63 \pm 0.13$	$99.81 \pm 0.06$	$99.57 \pm 0.11$	$99.79 \pm 0.06$
VAE-GAT	$99.20 \pm 0.10$	$99.24 \pm 0.04$	$99.00 \pm 0.11$	$99.72 \pm 0.03$	$99.10 \pm 0.10$	$99.36 \pm 0.04$	$99.05 \pm 0.09$	$99.29 \pm 0.04$
VAE-MoTE	$99.81 \pm 0.11$	$99.93 \pm 0.02$	$99.77 \pm 0.06$	$99.87 \pm 0.06$	$99.69 \pm 0.07$	$99.82 \pm 0.04$	$99.73 \pm 0.06$	$99.80 \pm 0.04$

Table 3. Multiclass classification performance on the N-BaIoT dataset before and after application of contrastive learning.

	Accuracy (%)		Precision (%)		Recall (%)		F1-Score (%)
Model	W/o CL	With CL	W/o CL	With CL	W/o CL	With CL	W/o CL	With CL
VAE-MLP	$99.51 \pm 0.07$	$99.84 \pm 0.04$	$99.53 \pm 0.22$	$99.87 \pm 0.06$	$99.42 \pm 0.12$	$99.85 \pm 0.03$	$99.42 \pm 0.08$	$99.86 \pm 0.08$
VAE-GAT	$95.12 \pm 0.13$	$96.87 \pm 0.08$	$93.11 \pm 0.12$	$96.13 \pm 0.09$	$94.82 \pm 0.10$	$96.79 \pm 0.08$	$94.92 \pm 0.13$	$96.75 \pm 0.05$
VAE-MoTE	$99.38 \pm 0.01$	$99.87 \pm 0.02$	$99.36 \pm 0.04$	$99.88 \pm 0.03$	$98.59 \pm 0.04$	$99.86 \pm 0.01$	$98.36 \pm 0.03$	$99.85 \pm 0.01$

Table 4. Multiclass classification performance on the CICIoT2022 dataset before and after application of contrastive learning.

	Accuracy (%)		Precision (%)		Recall (%)		F1-Score (%)
Model	W/o CL	With CL	W/o CL	With CL	W/o CL	With CL	W/o CL	With CL
VAE-MLP	$98.21 \pm 0.25$	$99.03 \pm 0.12$	$98.78 \pm 0.27$	$99.35 \pm 0.13$	$93.53 \pm 0.26$	$96.40 \pm 0.12$	$95.22 \pm 0.24$	$96.35 \pm 0.12$
VAE-GAT	$97.93 \pm 0.12$	$98.20 \pm 0.10$	$97.25 \pm 0.13$	$98.18 \pm 0.11$	$83.47 \pm 0.17$	$90.15 \pm 0.10$	$88.91 \pm 0.21$	$92.53 \pm 0.10$
VAE-MoTE	$95.30 \pm 0.23$	$98.90 \pm 0.11$	$95.45 \pm 0.24$	$98.70 \pm 0.12$	$95.10 \pm 0.23$	$97.83 \pm 0.11$	$95.05 \pm 0.22$	$97.75 \pm 0.11$

Table 5. Computational cost for each of the three models.

Framework	Time Complexity
VAE-Encoder-MLP	$O (P \cdot x_{i n} \cdot y_{o u t})$
VAE-Encoder-MoTE	$O (Q \cdot x_{i n} \cdot y_{o u t})$
VAE-Encoder-GAT	$O (c \cdot x_{i n} \cdot y_{o u t})$ + $O (V l_{z}^{2} + E V l_{z})$ + $O (n (V l_{z} K + H E K))$

Table 6. Performance evaluation contrasting the proposed VAE-Encoder-MoTE framework with existing advanced techniques on the CICIoT2022 and N-BaIoT datasets.

	N-BaIoT Dataset
Study	Methods	Classification Type	Acc (%)	Prec	Recall	F1-score
[54]	Random Forest, MLP	4-class	99.98	0.999	0.999	0.999
[55]	23 models (Best: Random Forest)	3-class	99.92	-	-	-
[56]	6 models with PCA (Best: Extra Trees)	multiclass	99.94	0.999	0.999	0.999
[57]	Autoencoder, MLP	3-class	99.98	0.990	0.990	0.990
[57]	Autoencoder, MLP	multiclass	88.89	0.900	0.860	0.850
[58]	LSTM	multiclass	86.38	0.829	0.795	-
[59]	3 DL models with FL (Best: CNN)	multiclass	90.90	0.940	0.900	0.870
[60]	spiking neural networks	multiclass	71.00	0.692	0.634	0.652
VAE-MoTE	VAE encoder, supervised contrastive learning, MoTE	binary	99.98	0.999	0.999	0.999
VAE-MoTE	VAE encoder, supervised contrastive learning, MoTE	multiclass	99.85	0.999	0.998	0.999
	CICIoT2022 Dataset
Study	Methods	Classification Type	Acc (%)	Prec	Recall	F1-score
[61]	Hierarchical Matrix mapping, Transformer-ResNet	multiclass	95.81	-	-	0.957
[62]	FL with QDVTA and ResNet18	multiclass	95.46	0.965	0.955	0.955
[63]	Golden sine with crystal structure algorithm, GRU	binary	99.97	0.999	0.999	0.999
VAE-MoTE	VAE encoder, supervised contrastive learning, MoTE	binary	99.97	0.999	0.999	0.999
VAE-MoTE	VAE encoder, supervised contrastive learning, MoTE	multiclass	98.90	0.989	0.988	0.985

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wasswa, H.; Lynar, T. Hybrid Deep Architectures in Contrastive Latent Space: Performance Analysis of VAE-MLP, VAE-MoTE, and VAE-GAT for IoT Botnet Detection. IoT 2026, 7, 41. https://doi.org/10.3390/iot7020041

AMA Style

Wasswa H, Lynar T. Hybrid Deep Architectures in Contrastive Latent Space: Performance Analysis of VAE-MLP, VAE-MoTE, and VAE-GAT for IoT Botnet Detection. IoT. 2026; 7(2):41. https://doi.org/10.3390/iot7020041

Chicago/Turabian Style

Wasswa, Hassan, and Timothy Lynar. 2026. "Hybrid Deep Architectures in Contrastive Latent Space: Performance Analysis of VAE-MLP, VAE-MoTE, and VAE-GAT for IoT Botnet Detection" IoT 7, no. 2: 41. https://doi.org/10.3390/iot7020041

APA Style

Wasswa, H., & Lynar, T. (2026). Hybrid Deep Architectures in Contrastive Latent Space: Performance Analysis of VAE-MLP, VAE-MoTE, and VAE-GAT for IoT Botnet Detection. IoT, 7(2), 41. https://doi.org/10.3390/iot7020041

Article Menu

Hybrid Deep Architectures in Contrastive Latent Space: Performance Analysis of VAE-MLP, VAE-MoTE, and VAE-GAT for IoT Botnet Detection

Abstract

1. Introduction

2. Related Work

2.1. Supervised Contrastive Learning

2.2. Mixture of Experts and Tiny Expert Architectures

2.3. Graph Neural Networks for Intrusion Detection

3. Methodology

3.1. Variational Autoencoder for Dimensionality Reduction

3.2. Contrastive Learning

3.3. Tiny Experts Architecture

3.4. Mixture of Tiny Experts (MoTE)

3.5. Classification Head and Joint Optimization

3.6. The MLP Architecture

3.7. The GAT Architecture

3.8. Datasets

3.8.1. N-BaIoT Dataset

3.8.2. CICIoT2022 Dataset

3.9. Data Preprocessing

3.10. Experimental Setup and Model Training

3.11. Performance Evaluation Metrics

4. Results

4.1. Binary Classification on the N-BaIoT Dataset

4.2. Binary Classification on the CICIoT2022 Dataset

4.3. Multiclass Classification on the N-BaIoT Dataset

4.4. Multiclass Classification on the CICIoT2022 Dataset

5. Discussion

5.1. Impact of Contrastive Learning

5.2. Model Comparison

5.3. Binary vs. Multiclass Performance

5.4. Dataset Effects

5.5. Practical Implications

5.6. Computational Cost Analysis

5.7. Comparison with Benchmark Studies

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI