Next Article in Journal
An Enhanced Plant Growth Algorithm with Adam Learning, Lévy Flight, and Dynamic Stage Control
Previous Article in Journal
An Improved Multi-Objective Moth Flame Optimization Algorithm Based on r-Domination and Its Application in Parameter Optimization of Multi-Spacecraft Attitude Control
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

IoT Network Security Threat Detection Algorithm Integrating Symmetric Routing and a Sparse Mixture-of-Experts Model

1
School of Ocean Information Engineering, Jimei University, Xiamen 361000, China
2
Zhangzhou Power Supply Company, State Grid Fujian Electric Power Co., Ltd., No. 13 Shengli East Road, Xiangcheng District, Zhangzhou 363000, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Symmetry 2026, 18(1), 63; https://doi.org/10.3390/sym18010063 (registering DOI)
Submission received: 27 November 2025 / Revised: 24 December 2025 / Accepted: 27 December 2025 / Published: 30 December 2025
(This article belongs to the Section Engineering and Materials)

Abstract

With the rapid deployment of the Internet of Things (IoT) in critical domains such as power and industrial systems, the number of IoT devices has surged, accompanied by increasingly severe network security risks. IoT networks face diverse threats, including distributed denial-of-service attacks, advanced persistent threats, and data theft or tampering, while traditional detection and defense, lacking deep feature analysis, struggle with complex and unknown attacks, degrading security threat event detection. To this end, this paper proposes an IoT network security threat detection algorithm that integrates symmetric linear routing with a sparse mixture-of-experts model. The algorithm consists of a ConvNeXt feature extractor and a sparse BiLSTM expert layer, with symmetric linear routing embedded in the gating module. ConvNeXt provides refined global and local representations, Top-K gated BiLSTM experts for the module sequence-level dependencies among ordered features, and symmetric linear routing suppresses routing bias, enabling efficient and robust detection of IoT security threats. Experimental results on the CIC-IDS2018, TON-IoT, and BoT-IoT datasets indicate that the proposed IoT network security threat detection algorithm achieves accuracies of 94.08%, 99.99 ± 0.01 %, and 99.78%, respectively. Comparative experiments show the proposed algorithm outperforms baseline and state-of-the-art models, while the ablation and Top-K studies confirm module effectiveness for IoT intrusion detection.

1. Introduction

The rapid development of the IoT has profoundly transformed numerous industries and is now widely applied in scenarios such as power IoT, industrial IoT, intelligent transportation, and smart manufacturing. With the widespread adoption of edge computing, the integration of 5G networks, and the accelerated advancement of artificial intelligence, the growth momentum of IoT has become increasingly pronounced, and the number of connected devices is expected to reach 25.4 billion by 2030 [1,2]. IoT significantly improves the efficiency and reliability of the system through perception interworking and collaborative decision-making [3]. However, its high degree of connectivity and digitalization also makes IoT devices more vulnerable to attacks, exposing them to diverse risks such as distributed denial-of-service attacks, advanced persistent threats, and data theft or tampering, which may lead to service disruption, economic losses, and even damage to critical infrastructure. It is thus evident that, while IoT drives industrial intelligence and digital transformation, it also faces increasingly severe network security challenges. Consequently, designing efficient and adaptive intrusion detection systems (IDSs) tailored to IoT scenarios has become a key approach to ensuring the stable operation of IoT networks and the security of their data.
Classical IDS methods can be categorized into two types according to their identification techniques: misuse-based IDSs and anomaly-based IDSs [4,5]. Misuse-based IDSs detect known attacks with high accuracy through steps such as signature database construction, real-time monitoring, and alert generation. However, they rely on static rules and frequently updated feature libraries, which results in low computational efficiency on resource-constrained devices and prevents them from processing massive heterogeneous data streams in real time. Moreover, they are incapable of identifying unknown attacks, such as zero-day attacks and advanced persistent threats [6,7], and inherently lack adaptive learning capability. Anomaly-based IDS methods learn normal behavior patterns and identify potential anomalies or intrusions that deviate from them. In the complex and evolving threat landscape of IoT environments, anomaly-based approaches are better suited to security requirements, as they can uncover previously unknown threats. However, they typically suffer from high false alarm rates and may misclassify atypical but benign behaviors as anomalies. Classical anomaly-based IDS methods mainly include statistical-based and knowledge-based approaches [8], which detect intrusions by extracting statistical features of network traffic and relying on knowledge bases constructed from normal traffic. However, they require continuous model updates and incur high maintenance costs.
With the introduction of machine learning (ML) and deep learning (DL) techniques [9,10], IDSs have gained automatic feature extraction and hierarchical representation capabilities, which significantly improve the accuracy and generalization performance of anomaly-based IDS methods. Classical ML-based methods typically require extensive manual intervention, including feature engineering and model selection. Although they perform well on structured data, their adaptability and accuracy are limited when confronted with complex attack patterns and highly dynamic network environments. In recent years, DL has emerged as a key means of overcoming the limitations of traditional approaches. By enabling automatic feature extraction, DL substantially reduces reliance on manual feature engineering and thus alleviates the constraints of handcrafted features [11]. For example, convolutional neural networks (CNNs), recurrent neural networks (RNNs), and deep neural networks (DNNs) can accurately identify malicious patterns in large-scale traffic data and have achieved significant research progress in intrusion detection [12,13,14]. However, these models still suffer from insufficient accuracy when recognizing multiple attack types, face challenges in effective model selection and adaptation, and exhibit limited scalability in their overall architectures.
The emergence of sparse mixture-of-experts (Sparse MoEs) architectures has offered a promising solution to these challenges [15]. By leveraging ensembles of specialized experts, they enhance decision-making capability and have demonstrated favorable performance in IoT scenarios, particularly in handling imbalanced datasets and diverse types of threats. However, such models still face two key challenges: high sensitivity of the routing gate to perturbations and severe imbalance in expert load. To this end, this paper proposes a novel detection model that integrates symmetric linear routing with a Sparse MoEs architecture. Specifically, a ConvNeXt network is employed to automatically extract multi-scale, hierarchical spatial features from network traffic data [16], thereby improving the efficiency and accuracy of recognizing complex attack patterns. On this basis, the Sparse MoEs architecture aggregates multiple parallel BiLSTM experts to capture sequence-level dependencies among ordered features [17]. By incorporating symmetric routing refinements together with a load-balancing auxiliary loss, the proposed model effectively alleviates the strong perturbation of routing gates and the imbalance of expert loads in conventional Sparse MoEs, enabling efficient multi-expert collaboration and significantly enhancing classification performance.
Our main contributions are summarized as follows:
(I)
In terms of base model design, ConvNeXt is effectively combined with a Sparse MoEs framework to form an integrated model that unifies feature extraction and expert decision-making. ConvNeXt is employed to automatically extract multi-scale, hierarchical spatial features from network traffic data, thereby improving the efficiency and accuracy of recognizing complex attack patterns. On this basis, the Sparse MoEs architecture aggregates multiple parallel BiLSTM experts to capture sequence-level dependencies among ordered features.
(II)
For IoT network intrusion detection scenarios, we introduce a symmetric linear routing mechanism into the gating network, which effectively mitigates the strong perturbation sensitivity of routing gates and the expert load imbalance commonly observed in conventional MoE models. This design enables efficient collaborative inference among multiple experts and leads to improved classification accuracy in the final decision.
(III)
To address the asymmetric class distribution in IoT intrusion datasets, we integrate random undersampling with SMOTE–Tomek oversampling. This strategy enhances the representation of minority classes while preserving high recognition rates for majority classes, thereby improving class balance, boosting overall accuracy, and alleviating bias induced by data imbalance.

2. Related Work

2.1. IoT Network Intrusion Detection Technology

As a critical component of IoT network security systems, intrusion detection technology faces the fundamental challenge and core objective of effectively identifying various complex and latent attack behaviors. This section briefly reviews the evolution of IDS techniques, tracing the progression from traditional ML to modern DL methods, and then focuses on recent advances in Sparse MoEs models and the emerging need for routing symmetry and directional invariance in intrusion detection tasks.
In the early stage, IDS methods primarily relied on traditional ML methods. The main idea was to extract static features from network traffic data and system logs, and then train classifiers to distinguish normal behavior from malicious activities. Commonly used ML algorithms include several classical methods such as Decision Trees (DTs), Support Vector Machines (SVMs), K-Nearest Neighbors (KNNs), and random forests (RFs) [18,19,20,21]. In related research, Tahri et al. [22] investigated performance improvements for network intrusion detection systems by comparing several ML algorithms on the UNSW-NB15 dataset, including SVM, Naïve Bayes, DT, RF, Classification and Regression Trees, and an embedded SVM–Naïve Bayes classifier. Using accuracy, precision, sensitivity, specificity, false alarm rate, and detection rate as the main evaluation metrics, and incorporating data pre-processing, dimensionality reduction, and feature selection into the assessment, their results showed that the RF algorithm achieved sufficiently good classification performance across different attack types and yielded the highest overall accuracy. In the study by Kilincer et al. [23], the authors focused on analyzing the performance of SVM, KNN, and DT algorithms on the CIC-IDS2018, UNSW-NB15, ISCX-2012, NSL-KDD, and CIDDS-001 datasets. The data were pre-processed using Min–Max normalization, and the class imbalance problem was mitigated by merging similar attack categories. The results demonstrated that the DT model achieved superior performance across multiple datasets. Although the above ML methods perform well in handling structured, low-dimensional data and known attacks, they struggle to capture the temporal and sequential dependencies, as well as directional behaviors, inherent in network traffic sequences, and their reliance on complex feature engineering and their limited generalization capability become increasingly evident in the face of dynamic IoT environments and massive high-dimensional data.
DL, with its powerful capabilities for automatic feature learning and pattern recognition, has become a mainstream approach for intrusion detection tasks. Representative convolutional neural networks, such as ResNet [24] and ConvNeXt, can automatically extract local and hierarchical spatiotemporal features from raw network traffic data or their image representations, effectively capturing fine-grained characteristics of attack behaviors and alleviating the dependence on manual feature engineering in traditional methods. To address data imbalance in intrusion detection datasets, traditional techniques, such as ADASYN, SMOTE, cluster-based undersampling, and random undersampling [25,26,27,28], remain the predominant strategies. For example, in our previous work, Zhang et al. [29] proposed a lightweight IDS model, SE-DWNet, which combines a residual backbone with SE attention and 1D depthwise separable convolutions to enhance representation efficiency. The study further used SMOTE–Tomek and Focal Loss to mitigate class imbalance and improve minority-class detection. Dangol et al. [30] employed a CNN-BLSTM architecture for intrusion detection on the NSL-KDD and UNSW-NB15 datasets, integrating ADASYN with random undersampling to mitigate class imbalance. This combination significantly improved F1-scores on minority classes such as R2L and U2R, demonstrating better balance between precision and recall.
However, recent studies have demonstrated that the performance of intrusion detection systems critically depends on the sequential order of features extracted from network traffic. The arrangement sequence of features significantly influences the learned feature representations and the stability of classification outcomes. Zhang et al. [31] and Li et al. [32] demonstrated that feature-rearranged or direction-inverted flows lead to inconsistent detection due to non-invariant feature coupling within high-dimensional embeddings. These studies indicate that current DL–based intrusion detection methods remain vulnerable to variations in feature order, while lacking mechanisms to maintain representational symmetry.

2.2. Sparse Mixture-of-Experts

With the rise of Sparse MoEs, network intrusion detection has gained a new line of solutions. Sparse MoEs models consist of multiple expert submodels and a routing gate, where the gate selects only a subset of experts to participate in the computation for each input, while the remaining experts remain inactive. Compared with single DL models and conventional DL ensembles [33,34,35], Sparse MoEs dynamically select expert submodels through a routing mechanism, thereby maintaining detection accuracy while significantly reducing computational cost and enhancing model scalability and generalization capability. In related work, Ilias et al. [36] were among the first to apply Sparse MoEs architectures to 5G intrusion detection, combining CNN-derived representations with sparse expert routing to achieve high accuracy and low inference cost on large-scale network datasets. Shanka et al. [37] designed a hybrid expert ensemble to address the class imbalance challenge in IDSs, showing that expert-specific specialization enhances detection robustness across minority attack categories. Wang et al. [38] proposed MoE-TransDLD, a Transformer-based MoE model for intrusion detection in power systems. It enhances detection accuracy and generalization to unknown attacks via dynamic feature routing, performing well in complex cyber-physical environments. Rahim et al. [39] further extended this paradigm by integrating physical-layer context with Sparse MoEs for hybrid signature-based and anomaly-based detection, demonstrating improved detection of both known and unseen attacks. Traditional Sparse MoEs routing design primarily encompasses three paradigms: Token-Level Routing Strategy, Modality-Level Routing Strategy, and Task-Level Routing Strategy [40]. However, it typically relies on single-perspective and unidirectional representations for decision-making, lacking structural constraints for order invariance or symmetric consistency. In recent Sparse MoEs research, the design of routing gates has evolved into advanced strategies incorporating symmetry or invertibility. Zhao et al. [41] introduced Symmetric Gating Regularization to align gating distributions between original and flipped samples. Gao et al. [42] proposed the Multi-Scale Mixture-of-Experts (MSMoEs) framework, which utilizes symmetric gating and bidirectional cross-attention to achieve consistent expert routing in temporal modeling, thereby enhancing robustness in dynamic sequence processing. These studies indicate that routing mechanisms with symmetry or feature-flipping awareness help improve expert utilization balance and enhance robustness against input perturbations. However, such symmetry-aware gating has not yet been sufficiently explored in IoT intrusion detection tasks.
Building on the above studies, we next present an IoT network security threat detection algorithm that integrates symmetric linear routing with a sparse mixture-of-experts model. It is clarified that the intended deployment scenario of the proposed model is workstation-level edge servers connected to IoT gateways. Experimental validation and analysis of the results are then conducted to verify the algorithm’s effectiveness. Our routing design is fundamentally built upon a noisy Top-K affine linear routing strategy, which remains the core selection mechanism. In contrast to the approach proposed, our method does not rely on regularizing logits. Instead, we directly construct a forward and backward routing pair from the ordered IoT feature stream and generate the final routing vector by averaging them. This symmetric mechanism operates directly on the Top-K probability assignment process rather than through parameter regularization, thereby offering a novel routing enhancement solution for IoT intrusion detection tasks.

3. Proposed Model Framework

To address the complex and diverse attack environment in IoT networks, this paper proposes an IoT network security threat detection algorithm that integrates symmetric linear routing with a sparse mixture-of-experts model, as illustrated in Figure 1. As shown in the figure, the proposed algorithm mainly consists of three components: a data pre-processing module, a ConvNeXt-1D feature extraction module, and a Sparse MoEs classification and decision module. First, the data pre-processing module performs data cleaning, feature selection, and data balancing on the input data. Second, the ConvNeXt-1D network consists of two stages with a total of six stacked ConvNeXt blocks. Each block employs a one-dimensional depthwise separable convolution with a kernel size of 7, combined with LayerNorm and GELU activation to form the feature extraction backbone [43,44], fully exploiting both long- and short-range dependencies and capturing local patterns at multiple scales in the input traffic and thereby significantly enhancing the feature representation capacity. Finally, the Sparse MoEs classification and decision module adopts BiLSTM as the sub-expert model and introduces a symmetric linear gating network as the routing mechanism. After receiving the high-dimensional features output by ConvNeXt, the routing network computes the importance weights of each sub-expert from two symmetric perspectives, forward and reverse, averages them to obtain the final expert weights, and then sparsely activates the most appropriate sub-experts according to a Top-K strategy. In this way, the module effectively captures the sequential dependencies of network traffic and enhances its ability to discriminate complex traffic patterns, ultimately producing multi-class threat detection results. The core routing module is detailed below.

3.1. ConvNeXt Feature Extraction Module

For the feature extraction component of the module, considering the computational and storage constraints of practical IoT devices, we adopt a lightweight design based on a ConvNeXt-Tiny model constructed by stacking multiple ConvNeXt blocks, and further shrink both the network depth and channel dimension. The designed ConvNeXt-1D module stacks two stages, with the input channel dimensions (Dims) set to (96, 192), and the number of ConvNeXt blocks in each stage, denoted as Depths (Depths), set to (3, 3). Between the two stages, a downsampling convolution module with a kernel size of 2 and a stride of 2 is employed to further reduce the spatial resolution while doubling the number of channels. Since the model is designed for tabular sequential inputs of network traffic, the original two-dimensional (2D) convolutional structure of ConvNeXt is adapted to a 1D form: the convolution kernel is changed from K = 7 × 7 to a one-dimensional kernel of length K = 7, which reduces the number of parameters and multiply-accumulate operations. Meanwhile, the 2D strided convolutions used for downsampling are replaced with 1D strided convolutions to achieve hierarchical downsampling along the sequence-length dimension.
The ConvNeXt model, owing to its use of large convolution kernels, is well-suited to capturing slowly varying trends and cross-feature dependencies along the ordered feature sequence. In combination with LayerNorm and the GELU activation function, it stabilizes gradients, accelerates convergence, and maintains a smooth asymptotic response in the negative input region, thereby alleviating overfitting and improving optimization robustness [45,46]. Meanwhile, converting the convolution kernels to a 1D form preserves the representation capability while effectively constraining model size and inference cost, providing a deployable feature extraction solution for resource-constrained IoT scenarios. The detailed architecture of the ConvNeXt-1D module is shown in Figure 2. For the ConvNeXt Block-1D, the structure is implemented as follows: it first applies a 1D depthwise convolution with a kernel size of K = 7 to capture longer local dependencies along the ordered feature sequence; then, LayerNorm is applied along the channel dimension to accelerate convergence and reduce overfitting; next, two pointwise convolutions with K = 1 are used to expand and then restore the channel dimension, with a GELU activation inserted in between to enhance nonlinearity; finally, a residual connection adds the original input to the transformed features to produce the block output. The Downsample-1D module is used for downsampling between adjacent stages. It first applies LayerNorm to the input, and then performs a one-dimensional convolution with a kernel size of 2 and a stride of 2 to halve the sequence length and switch the channel dimension before outputting the result.

3.2. Sparse Mixture-of-Experts Classification Module

In the classification decision output layer of the model, a Sparse MoEs module is employed. Compared with dense expert architectures, sparse experts activate only a small subset of experts for each sample, offering advantages such as higher computational efficiency, more flexible parameter utilization, and improved generalization. The Sparse MoEs module adopted in this paper consists of a routing mechanism, multiple expert submodels, and an auxiliary loss term, and its overall architecture is illustrated in Figure 3. When a sample is fed into the module, the routing gate first generates selection weights Wn over all expert models based on the input features. A Top-K strategy is then applied, where only the K experts with the largest weights are retained, and the weights of all remaining experts are set to zero, thereby achieving sparse selection of sub-expert models. To avoid unbalanced expert assignment, a load-balancing loss is introduced to stabilize routing and training, preventing a few experts from being overused while others remain idle. The selected experts each produce class scores, which are finally aggregated by a weighted summation according to their routing weights to obtain the output for the given sample.

3.2.1. Routing Gate Network

In a Sparse MoEs module, the expert routing gate is a critical component: it maps the input sample features to selection weights and, via a Top-K strategy, activates only the K experts with the highest weights, thereby achieving sparse computation while preserving model capacity. In this paper, we adopt a noisy Top-K affine linear routing strategy [15], as defined in Equations (1)–(3). Its advantage lies in the fact that, compared with nonlinear gating functions, the affine linear gate maintains an approximately proportional relationship between feature responses in the forward and reversed representations, thereby avoiding the amplification of small directional perturbations that typically arise from exponential normalization in softmax gating. In addition, the injected Gaussian noise acts as a form of stochastic regularization on the gating logits, encouraging the model to explore diverse expert configurations and effectively preventing expert collapse during training.
First, a biased linear function g ( x ) is applied to the input representation x to produce unnormalized logits. Gaussian noise g noise is then injected into the logits, which are subsequently normalized by a softmax function to obtain the selection probability vector over experts. Finally, a Top-K operation is applied to retain only the K experts with the highest probabilities while zeroing out the others, yielding a sparse probability distribution G ( x ) .
G ( x ) = Top-K ( Softmax ( g ( x ) + g noise , k ) )
g ( x ) = W · x + b
Top-K ( v , k ) = v i , if v i is in the top-k ; 0 , otherwise .
To enhance sequence-direction invariance and enforce symmetry constraints, we constructed a symmetric routing mechanism based on the Top-K affine linear routing strategy. This mechanism jointly considers both forward and reversed feature sequences, encouraging the gate to learn direction-invariant routing distributions and achieve balanced expert utilization, as defined in Equation (4). The proposed gate generates two probability distributions, G ( x ) fwd and G ( x ) rev , from forward and reverse feature-sequence perspectives, respectively. The reverse perspective is enabled only during training and is implemented by reversing the ordered feature sequence along the sequence dimension, thereby constructing a symmetric representation that is independent of a single directional feature ordering. Finally, the two routing probabilities are computed separately and averaged to form the symmetric routing distribution G ( x ) avg .
G ( x ) avg = 1 2 G ( x ) fwd + G ( x ) rev
This design exploits feature-sequence reversal symmetry to construct forward and reverse routing paths and averages their outputs, which helps suppress routing bias and improves the balance and stability of expert utilization. Based on the assumption that semantic information should remain invariant to the direction or ordering of the feature sequence, the gate is encouraged to produce consistent distributions in both forward and reverse feature representations, thereby forcing the decision basis to focus on robust semantic features rather than incidental ordering factors and effectively enhancing generalization. Moreover, the reverse routing computation is only enabled during training, while a single forward gate is used at inference time, resulting in virtually no additional inference overhead and yielding more robust expert selection and higher overall accuracy without increasing deployment cost.

3.2.2. Expert Model

In the proposed Sparse MoEs module, the submodels consist of multiple single-layer BiLSTM experts, as shown in Figure 4. The structure is implemented as two parallel LSTM branches. For a given input sequence, the BiLSTM runs both LSTMs at each feature step, processing the data in the forward and reverse directions, and then concatenates the outputs of the two branches to produce the final representation. In this way, it can simultaneously capture information in both directions of the sequence and achieve a more comprehensive characterization of feature-sequence dependencies. By combining ConvNeXt’s large-kernel convolutions with the bidirectional sequence-dependency modeling capability of BiLSTM, the proposed model attains both strong representational capacity and a large receptive field while effectively capturing sequence order and long-term dependencies. As a result, it exhibits superior performance in modeling slowly varying trends, cross-feature correlations, and boundary patterns.

3.2.3. Auxiliary Loss

To prevent the routing gate from assigning the vast majority of input samples to only a few sub-experts and to enable more efficient utilization of the sparse capacity, two auxiliary balancing terms, namely importance loss and load balance loss, are introduced in addition to the primary task loss during training. This strategy has been shown to significantly improve training stability and expert utilization [15]. The specific formulations are given in Equations (5)–(11), where L importance denotes the importance balancing loss, L load denotes the load balancing loss, and L aux denotes the combined auxiliary loss function.
L importance = w importance · CV ( Importance ( X ) ) 2
Importance ( X ) = x X G ( x )
CV ( · ) = standard deviation ( · ) mean ( · )
L load = w load · CV ( Load ( X ) ) 2
Load ( X ) i = x X P ( x , i )
P ( x , i ) = Φ ( x · W g ) i kth_excluding ( H ( x ) , k , i ) Softplus ( ( x · W noise ) i )
L aux = L importance + L load

3.3. Algorithmic Implementation Workflow

In the algorithm implementation, the pre-processed input is represented as a 2D feature tensor (B, L), where B denotes the batch size, and L the feature dimension. First, (B, L) is reshaped to (B, 1, L) to match the input interface of ConvNeXt-1D, and the backbone network outputs a global vector, Gate-Feat, and a sequence feature representation, Feat-Seq. Subsequently, the gating network generates expert selection probabilities based on Gate-Feat; during training, symmetric forward–reverse routing is introduced, and a Top-K sparse selection is applied. The activated BiLSTM sub-experts then model Feat-Seq, and their outputs are fused by a weighted aggregation according to the routing probabilities to produce the final prediction. Finally, the overall loss is defined as the sum of the cross-entropy primary loss and a weighted auxiliary balancing loss. The detailed procedure is summarized as follows.
Step 1: Input and stem convolution. The input (B, L) is reshaped to (B, 1, L) and then passed through a stem convolution with a kernel size of 4 and a stride of 4, which maps it to (B, 96, L/4) to match the input of the feature extraction module.
Step 2: Two-stage ConvNeXt-1D feature extraction. Two ConvNeXt-1D stages are constructed using Depths = (3, 3) and Dims = (96, 192). In each stage, three ConvNeXt blocks are connected in sequence, while a Downsample-1D module with a kernel size of 2 and a stride of 2 is applied between stages to halve the sequence length and increase the number of channels. Channel-wise normalization is performed at the end.
Step 3: Dual representation outputs. The feature extractor produces two outputs. The first is Gate-Feat (B, 192), obtained by global average pooling, which is used for gating. The second is Feat-Seq (B, L/8, 192), obtained after transposing along the feature sequence and channel dimensions. This is used as the input for BiLSTM sub-experts for sequence-level dependency modeling along the feature sequence axis.
Step 4: Gating and symmetric routing. Gate-Feat is first passed through a biased linear layer and normalized by a softmax function to obtain expert probabilities. During training, probabilities are recomputed on the feature-reversed input (B, 1, L flip ) and averaged with the forward ones to mitigate directional bias.
Step 5: Top-K sparse selection and expert fusion. A Top-K operation is applied to the averaged routing distribution, activating only the K experts with the highest probabilities and setting the remaining weights to zero. Each selected BiLSTM expert performs bidirectional sequence-level dependency modeling along the feature sequence axis on the Feat-Seq input and outputs class logits, which are then combined by a weighted summation according to the routing weights to produce the final prediction for the input data.
Step 6: Training objective. Cross-entropy is used as the primary loss to which the auxiliary losses with predefined coefficients are added, balancing classification accuracy against balanced utilization of experts.

Calculation of the Model Calculations and Parameter Quantities

This section analyzes the computational complexity of the proposed model in terms of FLOPs and parameter count. The input is x(B, 1, L), where B is the batch size and L is the feature dimension. For unified derivations, we first provide the complexity formulas of Conv1d and LayerNorm: the parameters and FLOPs of Conv1d are given in Equations (12) and (13), where C in , C out , k, and g denote the input channels, output channels, kernel size, and number of groups, respectively; the parameters of LayerNorm are given in Equation (14), where c is the normalized channel dimension.
P c o n v = C o u t ( C i n g ) k + C o u t
F c o n v 2 L o u t C o u t ( C i n g ) k
P L N ( c ) = 2 c
(I)
Complexity analysis of the ConvNeXt-1D feature extraction module:
The ConvNeXt-1D feature extraction network begins with a stem layer for initial representation learning, implemented using a 1D convolution with a kernel size of k 0 = 4 . This layer takes a single input channel and outputs d 1 channels. With a stride of s 0 = 4 , the input sequence length is downsampled to L out = L / 4 at this stage. Accordingly, the parameter count and computational cost (FLOPs) of the stem module are given in Equations (15) and (16), respectively.
P s t e m = 5 d 1
F s t e m 2 L o u t d 1 1 k 0 = 2 L d 1
In the ConvNeXt Block-1D, let the input channel dimension be d. This block comprises a 1D depthwise convolution with kernel size k (with groups = d ), a LayerNorm (LN) layer, two pointwise convolutions with channel mappings d 4 d and 4 d d , and a learnable scaling factor γ . The corresponding parameter count and FLOPs are given in Equations (17) and (18), respectively.
P b l o c k ( d ) = 8 d 2 + ( k + 9 ) d
F b l o c k ( d , γ ) 2 γ ( k d + 8 d 2 )
The ConvNeXt-1D network consists of multiple stages. For the s-th stage, let the channel width be d s , the number of ConvNeXt Block-1D units be n s , and the input sequence length be L s . Then, the parameter count and computational cost (FLOPs) of this stage are given in Equations (19) and (20), respectively.
P s t a g e ( s ) = n s P b l o c k ( d s )
F s t a g e ( s ) n s F b l o c k ( d s , L s )
A Downsample-1D module is inserted between adjacent stages to perform channel transition and sequence downsampling. Let its input and output channels be d s and d s + 1 , respectively, with a kernel size of 2, and the output sequence length satisfies L s + 1 = L s / 2 . Then, the parameter count and computational cost (FLOPs) of the Downsample-1D module are given by Equations (21) and (22), respectively.
P d o w n ( s ) = 2 d s + 2 d s d s + 1 + d s + 1
F d o w n ( s ) 4 L s + 1 d s d s + 1
Therefore, assuming that the ConvNeXt-1D feature extraction network contains S stages, its total number of parameters is given by Equation (12), where the final output normalization LayerNorm contributes 2 d S learnable parameters. The corresponding total computational cost (FLOPs) is given by Equation (13). Note that the FLOPs of the output normalization are typically negligible compared with those of the convolutional backbone and the expert modules and are thus omitted in standard complexity accounting.
P b a c k b o n e = P s t e m + s = 1 s P s t a g e ( s ) + s = 1 s 1 P d o w n ( s ) + 2 d s
F b a c k b o n e F s t e m + s = 1 s F s t a g e ( s ) + s = 1 s 1 F d o w n ( s )
(II)
Complexity analysis of the sparse mixture-of-experts module:
First, for the gating linear layer (Gate), let the input feature dimension be d S , and the number of experts be M. Then, the parameter count and computational cost (FLOPs) of the gate are given by Equations (25) and (26), respectively.
P g a t e = d s M + M
F g a t e 2 d s M
For a single BiLSTM expert, let the input feature dimension be d S , the sequence length be S L , the hidden size be H, and the number of layers be 1. Then, the parameter count and computational cost (FLOPs) of this expert are given by Equations (27) and (28), respectively.
P exp e r t = 8 H ( d s + H + 2 )
F exp e r t S L 16 H ( d s + H )
The classification head is implemented as a linear layer with input dimension 2 H , which maps the features to C cls classes. Its parameter count and computational cost (FLOPs) are given by Equations (29) and (30), respectively.
P h e a d = ( 2 H ) C c l s + C c l s
F h e a d 2 ( 2 H ) C c l s = 4 H C c l s
Although the model contains M experts, under the Top-K sparse routing mechanism, only K experts are activated for each sample during the forward pass. Therefore, the parameter count and computational cost (FLOPs) of the Sparse MoEs module are given by Equations (31) and (32), respectively.
P M o E = M P exp e r t
F M o E K ( F exp e r t + F h e a d )
Therefore, the total parameter count and FLOPs of the model during inference are given by Equations (33) and (34), respectively. It can be observed that Sparse MoEs module increases conditional capacity by introducing more experts, while the inference time computational cost is mainly governed by the Top-K value K. During training, the proposed symmetric gating requires an additional forward pass through the backbone and averages the forward and reverse gating probabilities; during inference, only the forward gate is used, and thus the inference FLOPs do not increase.
P t o t a l = P b a c k b o n e + P g a t e + P M o E + P h e a d
F t o t a l F b a c k b o n e + F g a t e + F M o E
Overall, the computational cost of the proposed model is mainly dominated by two parts: (i) the pointwise and downsampling convolutions in the ConvNeXt-1D backbone and (ii) the BiLSTM expert computations in the Sparse MoEs module over the output sequence length L S . In contrast, operations such as global average pooling, LayerNorm, GELU, softmax, and Top-K selection are typically non-dominant and are thus treated approximately in the FLOPs accounting (reported using “≈”). Notably, Sparse MoEs increases the parameter count with the number of experts M to enhance capacity, while the inference time computation is primarily governed by the Top-K value K, enabling performance gains under a bounded inference budget. Moreover, the proposed symmetric linear gating mainly increases training time cost but introduces negligible overhead during inference.

4. Experiments

4.1. Experimental Setup

(I)
Experimental hardware environment: The experiments were conducted on a workstation running Windows 11, equipped with an Intel® Core™ i9-9900K processor (3.60 GHz), 32 GB of RAM, and an NVIDIA GeForce RTX 2080 Ti GPU.
(II)
Experimental software environment: The proposed sparse-expert-based intrusion detection method for IoT networks was implemented using the PyTorch (version 2.5.1) framework. Data processing, numerical computation, and visualization were mainly carried out with libraries such as Pandas (version 2.2.3), NumPy (version 1.26.4), and Matplotlib (version 3.10.0).

4.2. Datasets and Pre-Processing

This study employs three datasets that are widely used in the IoT network security domain: CIC-IDS2018, TON-IoT, and BoT-IoT [47,48,49].These datasets cover a wide range of network attack samples and thus provide evaluation scenarios with diverse feature spaces and attack patterns. Conducting comparative experiments on these datasets helps verify the generalization capability and stability of the proposed model across different application scenarios and under imbalanced data distributions.
During data pre-processing, the Comma-Separated Values (CSV) feature files of the three datasets were subjected to data cleaning, feature selection, and data balancing. First, during data cleaning, empty rows were removed, and fields unrelated to intrusion recognition, such as sequence numbers and binary flags, were discarded. To mitigate the impact of skewness and extreme values, continuous features were clipped above the 95th percentile. Numerical features with a large number of distinct values and highly skewed distributions were transformed using a logarithmic function to make them closer to a symmetric distribution. In addition, constant columns with all zeros or all ones were removed; hexadecimal fields were uniformly converted to decimals, and negative values and NaNs were standardized to 0 to ensure consistency in data types and value ranges. For non-numeric categorical features with a small number of unique values, one-hot encoding was applied, generating a binary indicator column for each category (1 indicating that the sample belongs to that category, and 0 otherwise). The original single-feature column was thus transformed into multiple feature dimensions, with the number of dimensions equal to the number of unique values in that column. For non-numeric features with many unique values, numerical encoding was used, mapping each distinct category to a fixed numeric value while ensuring that the same category was always mapped to the same value. This encoding strategy preserves the informative content of the features while improving the efficiency of data processing. After pre-processing, the number of features in the CIC-IDS2018, TON-IoT, and BoT-IoT datasets changed from 80, 126, and 46 to 71, 105, and 67, respectively.
The feature selection process effectively removes redundant and noisy features, significantly reducing computational and storage costs while preserving discriminative power, thereby improving the efficiency of model training and inference. Before feature selection, the data were normalized by mapping all numerical values to the (0, 1) range, thereby eliminating inconsistencies caused by differences in scale and units across features and improving the stability and efficiency of model training. After the above pre-processing, a random forest–based feature selection algorithm [50] was applied to compute an importance score for each feature, and these were then ranked in descending order; the top 60, 60, and 40 features were selected for CIC-IDS2018, TON-IoT, and BoT-IoT, respectively. We strictly preserved the original feature order from the CSV file, aiming to maintain the inherent correlations and structural relationships among features. After feature selection via a random forest scoring method, the top-ranked features were retained while keeping their original order unchanged. This ensured that the relative order among the most informative features was maintained, forming a consistent feature sequence for each record and avoiding additional randomness that could be introduced by shuffling the feature order.
Finally, each dataset was split into training and test sets in a 7:3 ratio, and the pronounced class imbalance among different attack types was further addressed. To mitigate class bias during training and enhance data symmetry, we applied a combination of random undersampling and SMOTE-Tomek [51,52] to the training set. The former moderately reduces majority-class samples, alleviating their dominance near the decision boundary; the latter first uses SMOTE to synthesize minority-class samples in feature space to improve their representativeness and coverage; it then employs Tomek links to remove overlapping and noisy instances between classes, thereby refining inter-class boundaries while performing oversampling. After this processing, the class proportions in the training set shift from being highly imbalanced to becoming approximately symmetric within both the majority and minority classes. This allows the model to train under a more uniform class prior, thereby alleviating decision boundary shifts caused by class imbalance and enhancing the robustness of generalization.

4.3. Training, Evaluation, and Results Analysis

4.3.1. Training Loss Function

In the training process, the loss function is defined as a combination of the primary cross-entropy loss and the expert balancing auxiliary loss, thereby ensuring both accurate multi-class classification and balanced expert utilization with improved training stability, as shown in Equation (12). Here, a u x coef denotes the weighting coefficient of the auxiliary loss, which controls the strength of its influence during training so as to reach a balance between stabilizing optimization and preserving performance; in our experiments, it was set to 0.001. All models considered in this study, including the proposed ConvNeXt–MoEs model and all baseline models, were optimized using the AdamW [53] optimizer with a fixed learning rate of 0.001 and a batch size of 128. AdamW provides adaptive parameter-wise update magnitudes based on first- and second-order gradient statistics, which improves convergence stability and reduces sensitivity to learning rate selection across different network architectures. To ensure fair comparison and reproducibility, the proposed model and all baseline models were trained under the same optimization settings and for 30 epochs.
L train = L crossentropy + a u x coef · L aux

4.3.2. Evaluation Metrics

In IoT network threat detection, performance evaluation is crucial. To comprehensively assess the model’s behavior across different application scenarios, this paper adopts multiple evaluation metrics for combined analysis, including accuracy, precision, recall, and F1-score, as defined in Equations (13)–(16).
A c c u r a c y = T P + T N T P + T N + F P + F N
P r e c i s i o n = T P T P + F P
R e c a l l = T P T P + F N
F 1 S c o r e = P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
In these equations, TP denotes true positives, the number of attack samples correctly identified as attacks; TN denotes true negatives, the number of normal samples correctly identified as normal; FP denotes false positives, the number of normal samples incorrectly predicted as attacks; and FN denotes false negatives, the number of attack samples incorrectly predicted as normal. Since the task in this paper is a multi-class threat detection problem, the overall model performance is evaluated using weighted average metrics.

4.3.3. Comparative Analysis Experiments on the CIC-IDS2018 Dataset

Given that the original CIC-IDS2018 dataset provided by the official source contains 15 attack categories with a total of 16,233,002 samples, its scale is excessively large and highly skewed toward benign traffic, which makes it unfavorable for rapid experimentation and obscures the learning of minority attack behaviors. Therefore, rather than strictly reproducing real-world benign and attack ratios, this paper aims to construct an evaluation-oriented experimental subset that jointly preserves attack diversity, inter-class contrast, and computational feasibility.
To this end, a stratified random sampling strategy was first applied. Specifically, 0.5% of the majority Benign class was retained, 5% of the other non-minority attack classes were sampled, and all samples of minority attack classes were fully preserved. This design not only significantly reduces the dataset scale while maintaining representative attack patterns and meaningful class discrimination, but also ensures that the proportion of the mainstream Benign class remains the highest, aligning with the realistic network environment where benign traffic predominates. The class composition and sample counts of the resulting sampled dataset are reported in Table 1.
After this initial subsampling, the CIC-IDS2018 training set still exhibits noticeable class imbalance. Therefore, a dual-sampling strategy combining random undersampling and SMOTE–Tomek oversampling was further applied exclusively to the training set. Importantly, the test set was not subjected to any additional resampling or synthetic data generation beyond the initial stratified sampling, ensuring that the evaluation reflects the model’s generalization ability under a fixed and consistent class distribution. On the one hand, the Benign class was reduced to a proportion comparable to other high-frequency attack classes, alleviating its dominance near the decision boundary. On the other hand, SMOTE-Tomek was used to synthesize minority-class samples, such as SQL Injection and Brute Force-XSS, in the feature space and to remove overlapping instances between classes, thereby increasing the proportion and separability of minority classes. As a result, the distribution of the majority and minority classes becomes more symmetric and balanced. Figure 5 compares the class proportions of the training set before and after this balancing procedure.
The proposed algorithm was trained on the CIC-IDS2018 training set, and confusion matrices before and after the dual-sampling procedure were used for evaluation and comparative analysis, as shown in Figure 6. As shown in the figure, the proposed sampling scheme significantly improves the separability of minority classes by enhancing data symmetry. In particular, the classification accuracies of the minority attack classes Brute Force-XSS and SQL Injection increase markedly after balancing, while the change for Brute Force-Web, which is also a minority class, is relatively minor. At the same time, the recognition of DoS attacks-SlowHTTPTest and Infiltration is also improved. In contrast, although the recognition rates of majority classes such as Benign, DDoS attacks-LOIC-HTTP, DoS attacks-Slowloris, FTP-BruteForce, and SSH-Bruteforce decrease slightly, they still remain at a high and acceptable level. Overall, the dual-sampling strategy enables roughly half of the minority classes to achieve higher recognition accuracy, while the slight performance degradation of majority classes has only a limited impact on the overall results; the comprehensive evaluation shows that the overall accuracy improves by approximately 0.2%.
After pre-processing the CIC-IDS2018 dataset, the proposed ConvNeXt–MoEs detection algorithm was employed for model training and prediction. We configured the Sparse MoEs module with 12 sub-experts, and we set K = 6 in the Top-K routing strategy. During training, Gaussian noise with an intensity of 0.05 was injected into the routing gate, which was disabled during inference. The optimizer learning rate was set to 0.001, and the input batch size was 128. To highlight the performance advantages of the proposed method, two baseline models were constructed: a two-layer CNN with a kernel size of 3 and a single-layer BiLSTM, both trained under the same data pre-processing and training settings for fair comparison. In addition, we conducted a horizontal comparison with advanced methods that use the same dataset [54,55,56]. The overall results, summarized in Table 2, show that the proposed model achieves superior accuracy.

4.3.4. Comparative Analysis Experiments on the TON-IoT Dataset

To verify the generalizability of the proposed ConvNeXt–MoEs detection algorithm in IoT intrusion detection scenarios, we conducted further experiments on the representative IoT security benchmark dataset TON-IoT. The class imbalance analysis and t-SNE visualization conducted in the final experimental analysis indicate that the TON-IoT dataset exhibits relatively low class imbalance and good class separability; therefore, the additional gains brought by the dual-sampling strategy are limited. To avoid unnecessary pre-processing overhead, this strategy was not adopted in the formal experiments on TON-IoT. All other procedures for data cleaning and feature selection follow Section 4.3.1. In terms of model configuration, the Sparse MoEs module was set as having 12 experts and K = 6; Gaussian noise with an intensity of 0.05 was injected into the routing gate during training. The AdamW optimizer learning rate was set to 0.001, the input batch size to 128, and the number of training epochs to 30.
In the TON-IoT dataset, we selected the Windows 10 subset containing a total of 20,671 samples, with the attack types and their corresponding counts summarized in Table 3. To ensure a fair and leakage-free evaluation, the dataset was split into training and test sets using stratified random sampling with a ratio of 7:3 for each class, preserving the original class proportions in both subsets, and the resulting confusion matrix is shown in Figure 7. Given the limited sample size of the TON-IoT dataset, relying on a single fixed split may result in high evaluation variance and unstable results. To obtain a more reliable estimation and avoid potential bias caused by a particular data partition, a 5-fold stratified cross-validation was performed. The averaged results across all folds remained highly consistent, with an extremely low standard deviation of 0.01%, confirming the robustness and strong generalization capability of the proposed model.
The experimental results demonstrate that the proposed ConvNeXt–MoE model achieves an excellent accuracy of 99.99 ± 0.01 % on this dataset. A comparison with the baseline models and several state-of-the-art methods [57,58,59], as reported in Table 4, shows that the proposed method attains the best performance across all four evaluation metrics.

4.3.5. Comparative Analysis Experiments on the BoT-IoT Dataset

To further verify the generalization capability of the model, we employed the BoT-IoT dataset, which is more representative of real-world IoT attack distributions, to evaluate its cross-scenario transferability and its detection performance on typical IoT attacks such as DDoS and DoS. In the BoT-IoT experiments, we used the officially provided 5% subset. Given that the original dataset is likewise very large, with a total of 3,668,522 samples, we performed stratified sampling over the different attack types to construct more balanced training and test sets. In addition, in the original CSV files, the attack types are organized using two-level labels, namely category and subcategory. In our experiments, to avoid data sparsity and severe class imbalance caused by overly fine-grained labels, the subcategory level was removed, thereby improving training stability and the comparability of evaluation results. The specific attack categories and corresponding sample counts are listed in Table 5. To reduce computational overhead while preserving representative attack patterns, stratified sampling with class-dependent ratios was applied: 0.1% of samples were retained for the high-frequency DDoS and DoS classes, and 10% for the Reconnaissance class, with all samples preserved for the extremely small Normal and Theft classes. After sampling, the resulting dataset contained a total of 45,432 samples. Using stratified splitting with a 7:3 ratio, the training and test sets consist of 31,802 and 13,630 samples, respectively. All experiments were conducted with a fixed random seed of 42 to ensure strict reproducibility. The AdamW optimizer uses a learning rate of 0.001, a batch size of 128, and 30 training epochs. The Sparse MoEs section was configured with 12 experts and K = 6, and Gaussian noise with a strength of 0.05 was injected into the routing gate during training.
When investigating the impact of the stratified sampling strategy on the experimental results, we performed stratified sampling using random seeds of 0, 42, 84, and 128 under the same experimental setup as in Table 6. The results, presented in Table 6, with accuracy as the evaluation metric, show that varying random seeds only leads to minor fluctuations in accuracy within 1%, confirming that the current stratified sampling method exhibits good robustness. More systematic sensitivity analysis will be conducted in future work. With regard to the dual-sampling strategy, we first performed class balancing on the training set, and the class distribution before and after rebalancing is illustrated in Figure 8. Subsequently, we conducted a comparative analysis using confusion matrices with and without the dual-sampling procedure, as shown in Figure 9. Experimental results demonstrate that although the overall performance improvement is moderate, the classification accuracy for the “Reconnaissance” and “Normal” categories shows noticeable enhancement. In addition, the strategy increases the training accuracy by approximately 0.5% and the testing accuracy by about 0.07%. These findings indicate that the dual-sampling strategy, without altering the model architecture, effectively alleviates the bias caused by class imbalance and directly contributes to the improvement of classification performance. Further comparative results are reported in Table 7. The proposed method achieves an accuracy of 99.78%, representing an improvement of up to 3.16% over the CNN and BiLSTM baseline models. In addition, it consistently outperforms several state-of-the-art methods that use the same dataset [60,61,62], demonstrating a stable overall advantage.

4.3.6. Ablation Study

To evaluate the effectiveness of the proposed ConvNeXt–MoEs fusion detection algorithm, we designed and conducted ablation experiments on the CIC-IDS2018 dataset in order to quantify the contributions of the ConvNeXt backbone, the symmetric routing gate, and the Sparse MoEs module to overall performance. The experimental results listed in Table 8 show the individual contribution of each module and the effect of their combinations. In the experimental setup, we first evaluated the performance of using only ConvNeXt and only Sparse MoEs with symmetric routing. The results indicate that, even with the auxiliary loss and dual-sampling strategy retained, removing ConvNeXt leads to a clear degradation across all four metrics, demonstrating its crucial role in global feature representation. Similarly, removing the Sparse MoEs module results in a significant drop in accuracy, recall, and F1-score. This demonstrates the irreplaceable value of the model in capturing bidirectional dependencies within the feature sequence. Furthermore, to investigate the effectiveness of the symmetric routing gate, we replaced it with a single noisy Top-K affine linear routing strategy; all four metrics consistently decline, which confirms that, in intrusion detection scenarios, the proposed symmetric routing gate effectively mitigates routing bias, improves expert utilization, and thereby yields stable performance gains. Finally, when all three modules are used jointly, all four evaluation metrics reach their best values, which clearly demonstrates that the proposed model achieves a significant performance advantage on the CIC-IDS2018 dataset.
We conducted further ablation experiments on the auxiliary loss function and the dual-sampling strategy, combining random undersampling with SMOTE-Tomek (RUS + SMOTE-Tomek). The results show that removing the auxiliary loss leads to a pronounced degradation in overall performance, with accuracy dropping from 94.08% to 93.17%, indicating that it helps stabilize expert routing and facilitates optimization during training. Meanwhile, disabling the dual-sampling strategy also causes performance deterioration, with accuracy decreasing from 94.08% to 93.88%, suggesting that class balancing at the data level remains necessary for the CIC-IDS2018 dataset.
By further considering that the number of sub-experts and the Top-K routing strategy jointly determine the effectiveness of sparse activation in the Sparse MoEs model, we conducted experiments under three configurations, with the number of sub-experts set to 4, 8, and 12, respectively. We systematically analyzed the impact of different values of (K) on model accuracy and plotted the corresponding results as a line chart (Figure 10). On the x-axis, “(M–K)” denotes the total number of experts (M) and the number of selected experts (K), and the y-axis represents the accuracy achieved by each configuration. This design covers model costs from lightweight to medium scales and facilitates observation of the trade-off between expert parallelism and selection sparsity in terms of performance and computational overhead.
Since an excessively large value of (K) can weaken model sparsity and dilute the specialization ability of individual experts [63], we constrain (K) in the validation experiments such that (K < M/2). The results show that, for a fixed number of sub-experts, increasing (K) generally leads to higher accuracy, with this effect being particularly pronounced when the number of experts is 12: the accuracy improves from 93.66% for “(12-2)” to 94.08% for “(12-6)”, indicating that involving more experts in collaboration can enhance prediction performance. In addition, when the number of selected experts is fixed, increasing the total number of sub-experts yields a steady improvement in overall accuracy, rising from 93.47% for “(4-2)” to 93.62% for “(8-2)”, and further to 93.67% for “(12-2)”. This suggests that appropriately enlarging the expert pool is beneficial for overall performance. Therefore, we adopt the “(12-6)” configuration as the final setting, which maintains sparse activation and expert specialization while achieving higher overall metrics.

4.3.7. Analysis of Results

The experimental results in this section show that, for the CIC-IDS2018 and BoT-IoT datasets, introducing a dual-sampling strategy that combines random undersampling with SMOTE-Tomek oversampling makes the distribution of majority and minority classes more symmetric and balanced. The resulting confusion matrices confirm the effectiveness of this strategy and the necessity of achieving data symmetry. Specifically, the classification accuracies of the minority classes Brute Force-XSS and SQL Injection are markedly improved after balancing, and the recognition rates of DoS attacks-SlowHTTPTest and Infiltration are also enhanced. Meanwhile, the overall accuracy exhibits a slight improvement of about 0.1% compared with the unbalanced setting. On the BoT-IoT dataset, the recognition performance for the minority classes Normal and Reconnaissance also improves. Accordingly, the overall accuracy shows a slight increase of approximately 0.07% compared with the unbalanced setting.
However, the performance gain brought by the proposed dual-sampling strategy is relatively limited on the TON-IoT dataset. To further investigate the potential reason, we quantified the class imbalance degree using the Gini impurity of class distribution (Gini) [64], which is defined in Equation (40). In this metric, p i denotes the proportion of samples belonging to the i-th class, and K is the total number of classes. A larger Gini value indicates a more imbalanced class distribution. As reported in Table 9, TON-IoT yields a Gini value of 0.694, which is lower than that of CIC-IDS2018 (0.823) and close to that of BoT-IoT (0.648). This result suggests that TON-IoT exhibits a relatively more balanced class distribution, and therefore, the additional benefit from resampling is less pronounced.
G = 1 i = 1 K p i 2
In addition, t-SNE [65] was employed to visualize the feature spaces of the TON-IoT and BoT-IoT datasets to examine class separability.The main settings are as follows: perplexity = 30, learning rate = 200, number of iterations = 2000, Euclidean distance, PCA initialization, and a fixed random seed of 42. As shown in Figure 11, the DoS, DDoS, and Reconnaissance classes in BoT-IoT exhibit substantial overlap, whereas most classes in TON-IoT form more distinct and well-separated clusters. This observation suggests that higher feature separability enables the model to achieve strong classification performance, even without applying sampling. Overall, the proposed sampling strategy is more beneficial for datasets with severe class imbalance and limited separability (e.g., CIC-IDS2018 and BoT-IoT), while its performance gains are relatively limited on datasets with more balanced distributions and clearer inter-class differences (e.g., TON-IoT).
On the CIC-IDS2018, TON-IoT, and BoT-IoT datasets, the proposed ConvNeXt–MoEs detection algorithm was compared horizontally with CNN and BiLSTM baselines, as well as state-of-the-art methods using the same datasets, and comprehensive multi-class classification results were reported to further demonstrate the overall performance advantage of the proposed approach. Subsequently, ablation studies were conducted to systematically evaluate the individual contributions and joint effects of each component module: removing either the ConvNeXt or the Sparse MoEs module leads to performance degradation, whereas introducing the symmetric routing gate significantly improves overall recognition accuracy, indicating that all three components work synergistically to support the final performance. Moreover, the ablation results indicate that the auxiliary loss (Auxiliary Loss), as a routing regularization term, promotes more balanced expert activation, thereby stabilizing the training process and effectively improving performance. Meanwhile, incorporating the RUS + SMOTE-Tomek strategy mitigates training bias caused by class imbalance, leading to further performance gains. Finally, we investigated the impact of the number of sub-experts and the Top-K setting, analyzing how different Top-K strategies affect performance under a limited expert budget and providing a recommended configuration that balances accuracy and sparsity. The above experimental results collectively verify the effectiveness and robustness of the proposed ConvNeXt–MoEs-based IoT network security threat detection algorithm across multiple datasets.

5. Conclusions and Future Work

This paper proposes an IoT network security threat detection algorithm that fuses ConvNeXt and Sparse MoEs; it is intended for deployment on workstation-level edge servers connected to IoT gateways. The main innovation lies in combining the strong representation capability of ConvNeXt with the sparse expert decision mechanism of Sparse MoEs, as well as further incorporating a symmetric gating design to enhance sequence-direction invariance, thereby achieving more efficient and robust threat detection in IoT scenarios. Specifically, at the model level, the design adheres to the principle of coupling feature extraction with sparse expert decision-making: first, the original ConvNeXt model designed for image tasks was converted into a one-dimensional architecture to accommodate the sequential nature of network traffic; second, a bidirectional symmetric mechanism was introduced into the linear routing gate of the BiLSTM-based Sparse MoEs. On the one hand, BiLSTM enhances bidirectional dependencies along the feature sequence and boundary discrimination, improving the completeness of feature representations; on the other hand, the symmetric routing gate averages forward and reverse routing probabilities, enabling more robust expert selection and higher overall accuracy. In the data pre-processing stage, we performed data cleaning and feature selection, and then applied random undersampling with SMOTE-Tomek to mitigate class imbalance and improve training separability, with larger benefits on more imbalanced and less separable datasets. Finally, ablation experiments and Top-K sensitivity studies further confirmed that the symmetric routing modification provides a distinct and independent performance gain, and that the introduction of Sparse MoEs is instrumental to the overall effectiveness of the detection algorithm. Experimental results demonstrate that the proposed method exhibits good applicability and stability across multiple datasets.
To quantify the inference efficiency of the proposed model, we measured its inference latency on the CIC-IDS2018 test set using our local workstation environment (the configuration is detailed in Section 4.1). Specifically, we randomly selected 1000 test samples and ran the model with gradient computation disabled, obtaining an average per-sample inference latency of 0.2829 ms/sample. From a theoretical standpoint, our method incurs higher computational overhead than the baseline models; however, experimental results across all three datasets indicate that this additional cost yields a significant performance improvement. In future work, we will continue to advance this research in the following directions: (I) further reduce computational overhead and improve inference speed through structured pruning, quantization, and operator-level optimizations to meet stricter resource constraints at the edge; (II) since different architectures can be sensitive to hyperparameters, we will follow a unified tuning protocol in future work and optimize each model accordingly, with a particular focus on MoE-related settings (the number of experts, Top-K selection, and the auxiliary loss weight) to further improve robustness and generalization; (III) consider the dynamics and noise in real network environments, deploy and test the model in real-time scenarios, and evaluate its throughput, scalability, and robustness under large-scale, dynamic traffic conditions so as to validate its adaptability and stability for practical IoT network security protection.

Author Contributions

Conceptualization, K.Z. and J.Z.; methodology, R.Z.; software, C.L.; validation, J.Y.; formal analysis, K.Z.; investigation, J.Y.; resources, C.L.; data curation, J.Z.; writing—original draft preparation, J.Y.; writing—review and editing, J.Z.; visualization, J.Y.; supervision, R.Z.; project administration, J.Y.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Research on Key Technologies for Intelligent Diagnosis of Power Information Network Security (No. 521350240008).

Data Availability Statement

The data presented in this study are openly available in the following repositories: CIC-IDS2018 at https://www.unb.ca/cic/datasets/ids-2018.html (accessed on 26 December 2025), TON-IoT at https://ieee-dataport.org/documents/toniot-datasets (accessed on 26 December 2025), and BoT-IoT at https://ieee-dataport.org/documents/bot-iot-dataset (accessed on 26 December 2025).

Conflicts of Interest

Author Kunshan Zhang and Renguang Zheng are with the Zhangzhou Power Supply Company, State Grid Fujian Electric Power Co., Ltd. The remaining authors declare that this research does not involve any commercial or financial relationships that could be construed as potential conflicts of interest.

References

  1. Danladi, M.S.; Baykara, M. Low Power Wide Area Network Technologies: Open Problems, Challenges, and Potential Applications. Rev. Comput. Eng. Stud. 2022, 9, 2. [Google Scholar] [CrossRef]
  2. Zhou, Y.; Chen, X. Edge Intelligence: Edge Computing for 5G and the Internet of Things. Future Internet 2025, 17, 101. [Google Scholar] [CrossRef]
  3. Adewale, T.; Paul, J. AI, 5G, and IoT: How These Technologies Are Creating the Perfect Storm for Smart Systems. Available online: https://www.researchgate.net/publication/385855348_AI_5G_and_IoT_How_These_Technologies_Are_Creating_the_Perfect_Storm_for_Smart_Systems (accessed on 1 November 2024).
  4. Le Jeune, L.; Goedeme, T.; Mentens, N. Machine Learning for Misuse-Based Network Intrusion Detection: Overview, Unified Evaluation and Feature Choice Comparison Framework. IEEE Access 2021, 9, 63995–64015. [Google Scholar] [CrossRef]
  5. Jyothsna, V.; Prasad, R.; Prasad, K.M. A Review of Anomaly Based Intrusion Detection Systems. Int. J. Comput. Appl. 2011, 28, 26–35. [Google Scholar] [CrossRef]
  6. Ahmad, R.; Alsmadi, I.; Alhamdani, W.; Tawalbeh, L.A. Zero-Day Attack Detection: A Systematic Literature Review. Artif. Intell. Rev. 2023, 56, 10733–10811. [Google Scholar] [CrossRef]
  7. Buchta, R.; Gkoktsis, G.; Heine, F.; Kleiner, C. Advanced Persistent Threat Attack Detection Systems: A Review of Approaches, Challenges, and Trends. Digit. Threat. Res. Pract. 2024, 5, 1–37. [Google Scholar] [CrossRef]
  8. Debar, H.; Dacier, M.; Wespi, A. Towards a Taxonomy of Intrusion-Detection Systems. Comput. Netw. 1999, 31, 805–822. [Google Scholar] [CrossRef]
  9. Jordan, M.I.; Mitchell, T.M. Machine Learning: Trends, Perspectives, and Prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef]
  10. LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
  11. Saleem, T.J.; Chishti, M.A. Deep Learning for Internet of Things Data Analytics. Procedia Comput. Sci. 2019, 163, 381–390. [Google Scholar] [CrossRef]
  12. Idrissi, I.; Boukabous, M.; Azizi, M.; Moussaoui, O.; El Fadili, H. Toward a Deep Learning-Based Intrusion Detection System for IoT against Botnet Attacks. IAES Int. J. Artif. Intell. 2021, 10, 110. [Google Scholar] [CrossRef]
  13. Ghurab, M.; Gaphari, G.; Alshami, F.; Alshamy, R.; Othman, S. A Detailed Analysis of Benchmark Datasets for Network Intrusion Detection System. Asian J. Res. Comput. Sci. 2021, 7, 14–33. [Google Scholar] [CrossRef]
  14. Hussain, N.; Rani, P. Comparative Studied Based on Attack Resilient and Efficient Protocol with Intrusion Detection System Based on Deep Neural Network for Vehicular System Security. In Distributed Artificial Intelligence; Taylor & Francis: Abingdon, UK, 2020; pp. 217–236. [Google Scholar]
  15. Shazeer, N.; Mirhoseini, A.; Maziarz, K.; Davis, A.; Le, Q.; Hinton, G.; Dean, J. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. arXiv 2017, arXiv:1701.06538. [Google Scholar]
  16. Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
  17. Siami-Namini, S.; Tavakoli, N.; Namin, A.S. The Performance of LSTM and BiLSTM in Forecasting Time Series. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; IEEE: New York, NY, USA, 2019; pp. 3285–3292. [Google Scholar]
  18. Rokach, L.; Maimon, O. Decision Trees. In Data Mining and Knowledge Discovery Handbook; Springer: Boston, MA, USA, 2005; pp. 165–192. [Google Scholar]
  19. Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support Vector Machines. IEEE Intell. Syst. Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
  20. Peterson, L.E. K-Nearest Neighbor. Scholarpedia 2009, 4, 1883. [Google Scholar] [CrossRef]
  21. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  22. Tahri, R.; Jarrar, A.; Lasbahani, A.; Balouki, Y. A Comparative Study of Machine Learning Algorithms on the UNSW-NB15 Dataset. ITM Web Conf. 2022, 48, 03002. [Google Scholar] [CrossRef]
  23. Kilincer, I.F.; Ertam, F.; Sengur, A. Machine Learning Methods for Cyber Security Intrusion Detection: Datasets and Comparative Study. Comput. Netw. 2021, 188, 107840. [Google Scholar] [CrossRef]
  24. Koonce, B. ResNet 50. In Convolutional Neural Networks with Swift for TensorFlow: Image Recognition and Dataset Categorization; Apress: Berkeley, CA, USA, 2021; pp. 63–72. [Google Scholar]
  25. He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks, Hong Kong, China, 1–8 June 2008; IEEE: New York, NY, USA, 2008; pp. 1322–1328. [Google Scholar]
  26. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  27. Yen, S.-J.; Lee, Y.-S. Cluster-Based Under-Sampling Approaches for Imbalanced Data Distributions. Expert Syst. Appl. 2009, 36, 5718–5727. [Google Scholar] [CrossRef]
  28. Hasanin, T.; Khoshgoftaar, T. The Effects of Random Undersampling with Simulated Class Imbalance for Big Data. In Proceedings of the 2018 IEEE International Conference on Information Reuse and Integration (IRI), Salt Lake City, UT, USA, 6–9 July 2018; IEEE: New York, NY, USA, 2018; pp. 70–79. [Google Scholar]
  29. Zhang, K.; Zheng, R.; Li, C.; Zhang, S.; Wu, X.; Sun, S.; Yang, J.; Zheng, J. SE-DWNet: An Advanced ResNet-Based Model for Intrusion Detection with Symmetric Data Distribution. Symmetry 2025, 17, 526. [Google Scholar] [CrossRef]
  30. Dangol, N.; Eaman, A.; Shakshuki, E.; Hassan, E. Impact of Resampling Techniques in Deep Learning Based Intrusion Detection: A Comparative Study on NSL-KDD and UNSW-NB15. Procedia Comput. Sci. 2025, 272, 84–91. [Google Scholar] [CrossRef]
  31. Zhang, J.; Ling, Y.; Fu, X.; Yang, X.; Xiong, G.; Zhang, R. Model of the Intrusion Detection System Based on the Integration of Spatial–Temporal Features. Comput. Secur. 2020, 97, 101946. [Google Scholar] [CrossRef]
  32. Li, Y.; Zhang, S.; Yang, H. Feature-Space Transformations for Robust Network Intrusion Detection. Expert Syst. Appl. 2023, 224, 119927. [Google Scholar]
  33. Sayem, I.M.; Sayed, M.I.; Saha, S.; Haque, A. ENIDS: A Deep Learning-Based Ensemble Framework for Network Intrusion Detection Systems. IEEE Trans. Netw. Serv. Manag. 2024, 21, 5809–5825. [Google Scholar] [CrossRef]
  34. Wang, Z.; Liu, Y.; He, D.; Chan, S. Intrusion Detection Methods Based on Integrated Deep Learning Model. Comput. Secur. 2021, 103, 102177. [Google Scholar] [CrossRef]
  35. Ncir, C.E.B.; HajKacem, M.A.B.; Alattas, M. Enhancing Intrusion Detection Performance Using Explainable Ensemble Deep Learning. PeerJ Comput. Sci. 2024, 10, e2289. [Google Scholar] [CrossRef]
  36. Ilias, L.; Doukas, G.; Lamprou, V.; Ntanos, C.; Askounis, D. Convolutional Neural Networks and Mixture of Experts for Intrusion Detection in 5G Networks and Beyond. arXiv 2024, arXiv:2412.03483. [Google Scholar] [CrossRef]
  37. Shanka, S.; Singh, D.; Badoni, A.; Shukla, M.K. Towards Robust IDS in Network Security: Handling Class Imbalance with Deep Hybrid Architectures. IEEE Netw. Lett. 2025, 7, 120–124. [Google Scholar] [CrossRef]
  38. Wang, L.; Sikdar, B.; Zhang, K.; Wang, Y. MoE-TransDLD: A Transformer-Driven Mixture of Experts for Cyber-Attack Detection in Power Systems. In Proceedings of the 2025 IEEE 19th International Conference on Control & Automation (ICCA), Tallinn, Estonia, 30 June–3 July 2025; IEEE: New York, NY, USA, 2025; pp. 511–516. [Google Scholar]
  39. Rahim, K.; Nasir, Z.U.I.; Ikram, N.; Qureshi, H.K. Integrating Contextual Intelligence with Mixture of Experts for Signature and Anomaly-Based Intrusion Detection in CPS Security. Neural Comput. Appl. 2025, 37, 5991–6007. [Google Scholar] [CrossRef]
  40. Mu, S.; Lin, S. A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications. arXiv 2025, arXiv:2503.07137. [Google Scholar]
  41. Zhao, G.; Zhao, Y.; Yin, X.; Lin, L.; Zhu, J. Beyond Spurious Cues: Adaptive Multi-Modal Fusion via Mixture-of-Experts for Robust Sarcasm Detection. Mathematics 2025, 13, 3250. [Google Scholar] [CrossRef]
  42. Gao, Y.; Zhao, B.; Peng, H.; Bao, H.; Zhao, J.; Cui, Z. Bidirectional Temporal-Aware Modeling with Multi-Scale Mixture-of-Experts for Multivariate Time Series Forecasting. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management, Seoul, Republic of Korea, 10–14 November 2025; pp. 696–706. [Google Scholar]
  43. Xu, J.; Sun, X.; Zhang, Z.; Zhao, G.; Lin, J. Understanding and Improving Layer Normalization. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
  44. Hendrycks, D. Gaussian Error Linear Units (GELUs). arXiv 2016, arXiv:1606.08415. [Google Scholar]
  45. Zhao, Y.; Ge, L.; Cui, G.; Fang, T. Improved ConvNeXt Facial Expression Recognition Embedded with Attention Mechanism. In Proceedings of the International Conference on Applied Intelligence, Nanning, China, 8–12 December 2023; Springer Nature: Singapore, 2023; pp. 89–100. [Google Scholar]
  46. Wang, Z. Combining UPerNet and ConvNeXt for Contrails Identification to Reduce Global Warming. arXiv 2023, arXiv:2310.04808. [Google Scholar]
  47. A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018). Canadian Institute for Cybersecurity. Available online: https://registry.opendata.aws/cse-cic-ids2018 (accessed on 17 February 2025).
  48. Moustafa, N. A New Distributed Architecture for Evaluating AI-Based Security Systems at the Edge: Network TON_IoT Datasets. Sustain. Cities Soc. 2021, 72, 102994. [Google Scholar] [CrossRef]
  49. Koroniotis, N.; Moustafa, N.; Sitnikova, E.; Turnbull, B. Towards the Development of Realistic Botnet Dataset in the Internet of Things for Network Forensic Analytics: BoT-IoT Dataset. Future Gener. Comput. Syst. 2019, 100, 779–796. [Google Scholar] [CrossRef]
  50. Hasan, M.A.M.; Nasser, M.; Ahmad, S.; Molla, K.I. Feature Selection for Intrusion Detection Using Random Forest. J. Inf. Secur. 2016, 7, 129–140. [Google Scholar] [CrossRef]
  51. Hancock, J.T., III; Khoshgoftaar, T.M. Exploring Maximum Tree Depth and Random Undersampling in Ensemble Trees to Optimize the Classification of Imbalanced Big Data. SN Comput. Sci. 2023, 4, 462. [Google Scholar] [CrossRef]
  52. Swana, E.F.; Doorsamy, W.; Bokoro, P. Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset. Sensors 2022, 22, 3246. [Google Scholar] [CrossRef]
  53. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
  54. Kim, J.; Kim, J.; Kim, H.; Shim, M.; Choi, E. CNN-Based Network Intrusion Detection against Denial-of-Service Attacks. Electronics 2020, 9, 916. [Google Scholar] [CrossRef]
  55. Jumabek, A.; Yang, S.S.; Noh, Y.T. CatBoost-Based Network Intrusion Detection on Imbalanced CIC-IDS-2018 Dataset. Korean Soc. Commun. Commun. J. 2021, 46, 2191–2197. [Google Scholar] [CrossRef]
  56. Umman Varghese, M.; Taghiyarrenani, Z. Intrusion Detection in Heterogeneous Networks with Domain-Adaptive Multi-Modal Learning. arXiv 2025, arXiv:2508.03517. [Google Scholar]
  57. Cherfi, S.; Boulaiche, A.; Lemouari, A. Enhancing IoT Security: A Deep Learning Approach with Autoencoder-DNN Intrusion Detection Model. In Proceedings of the 2024 6th International Conference on Pattern Analysis and Intelligent Systems (PAIS), El Oued, Algeria, 24–25 April 2024; IEEE: New York, NY, USA, 2024; pp. 1–7. [Google Scholar]
  58. Chishti, F.; Rathee, G. ToN-IOT Set: Classification and Prediction for DDoS Attacks Using AdaBoost and RUSBoost. In Proceedings of the 2023 3rd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE), Greater Noida, India, 12–13 May 2023; IEEE: New York, NY, USA, 2023; pp. 2842–2847. [Google Scholar]
  59. Tareq, I.; Elbagoury, B.M.; El-Regaily, S.; El-Horbaty, E.S. Analysis of TON-IoT, UNW-NB15, and Edge-IIoT Datasets Using DL in Cybersecurity for IoT. Appl. Sci. 2022, 12, 9572. [Google Scholar] [CrossRef]
  60. Esmaeilyfard, R.; Shoaei, Z.; Javidan, R. A Lightweight and Efficient Model for Botnet Detection in IoT Using Stacked Ensemble Learning. Soft Comput. 2025, 29, 89–101. [Google Scholar] [CrossRef]
  61. Hussan, M.I.T.; Reddy, G.V.; Anitha, P.T.; Kanagaraj, A.; Naresh, P. DDoS Attack Detection in IoT Environment Using Optimized Elman Recurrent Neural Networks Based on Chaotic Bacterial Colony Optimization. Clust. Comput. 2024, 27, 4469–4490. [Google Scholar] [CrossRef]
  62. Syed, N.F.; Ge, M.; Baig, Z. Fog-Cloud Based Intrusion Detection System Using Recurrent Neural Networks and Feature Selection for IoT Networks. Comput. Netw. 2023, 225, 109662. [Google Scholar] [CrossRef]
  63. Zhou, Y.; Lei, T.; Liu, H.; Du, N.; Huang, Y.; Zhao, V.; Dai, A.M.; Le, Q.V.; Laudon, J. Mixture-of-Experts with Expert Choice Routing. Adv. Neural Inf. Process. Syst. 2022, 35, 7103–7114. [Google Scholar]
  64. Saabni, R.; Asi, A.; El-Sana, J. Text Line Extraction for Historical Document Images. Pattern Recognit. Lett. 2014, 35, 23–33. [Google Scholar] [CrossRef]
  65. van der Maaten, L.; Hinton, G. Visualizing Data Using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Figure 1. Framework of the IoT network security threat detection algorithm.
Figure 1. Framework of the IoT network security threat detection algorithm.
Symmetry 18 00063 g001
Figure 2. ConvNeXt-1D feature extraction module.
Figure 2. ConvNeXt-1D feature extraction module.
Symmetry 18 00063 g002
Figure 3. Sparse MoEs classification module.
Figure 3. Sparse MoEs classification module.
Symmetry 18 00063 g003
Figure 4. BiLSTM expert submodels in the Sparse MoEs module.
Figure 4. BiLSTM expert submodels in the Sparse MoEs module.
Symmetry 18 00063 g004
Figure 5. Attack distribution in the CIC-IDS2018 training set. (a) Raw data distribution in the training set. (b) Attack distribution in the training set after data balancing.
Figure 5. Attack distribution in the CIC-IDS2018 training set. (a) Raw data distribution in the training set. (b) Attack distribution in the training set after data balancing.
Symmetry 18 00063 g005
Figure 6. (a) Confusion matrix of the CIC-IDS2018 training set without dual-sampling balancing. (b) Confusion matrix after applying the dual-sampling balancing strategy.
Figure 6. (a) Confusion matrix of the CIC-IDS2018 training set without dual-sampling balancing. (b) Confusion matrix after applying the dual-sampling balancing strategy.
Symmetry 18 00063 g006
Figure 7. Confusion matrix results based on the TON-IoT dataset.
Figure 7. Confusion matrix results based on the TON-IoT dataset.
Symmetry 18 00063 g007
Figure 8. Attack distribution in the BoT-IoT training set. (a) Raw data distribution in the training set. (b) Attack distribution in the training set after data balancing.
Figure 8. Attack distribution in the BoT-IoT training set. (a) Raw data distribution in the training set. (b) Attack distribution in the training set after data balancing.
Symmetry 18 00063 g008
Figure 9. (a) Confusion matrix of the BoT-IoT training set without dual-sampling balancing. (b) Confusion matrix after applying the dual-sampling balancing strategy.
Figure 9. (a) Confusion matrix of the BoT-IoT training set without dual-sampling balancing. (b) Confusion matrix after applying the dual-sampling balancing strategy.
Symmetry 18 00063 g009
Figure 10. Results of Top-K strategies with different numbers of sub-experts.
Figure 10. Results of Top-K strategies with different numbers of sub-experts.
Symmetry 18 00063 g010
Figure 11. (a) t-SNE visualization of the BoT-IoT feature space. (b) t-SNE visualization of the TON-IoT feature space.
Figure 11. (a) t-SNE visualization of the BoT-IoT feature space. (b) t-SNE visualization of the TON-IoT feature space.
Symmetry 18 00063 g011
Table 1. Composition of the sampled CIC-IDS2018 dataset.
Table 1. Composition of the sampled CIC-IDS2018 dataset.
IndexAttack ClassCountIndexAttack ClassCount
0Benign67,4248DoS attacks—Hulk23,096
1Bot14,3109DoS attacks—SlowHTTPTest6994
2Brute Force—Web61110DoS attacks—Slowloris550
3Brute Force—XSS23011FTP—BruteForce9668
4DDoS attack—HOIC34,30112Infiltration8097
5DDoS attack—LOIC-UDP173013SQL Injection87
6DDoS attacks—LOIC-HTTP28,81014SSH—Bruteforce9379
7DoS attacks—GoldenEye2075
Table 2. Comparative performance of methods on the CIC-IDS2018 dataset.
Table 2. Comparative performance of methods on the CIC-IDS2018 dataset.
MethodAccuracyPrecisionRecallF1-Score
CNN92.87%93.83%92.87%92.02%
BiLSTM92.09%91.54%92.09%90.94%
CNN-Image [54]91.50%---
CatBoost-Based [55]91.95%---
MM-DNN [56]93.40%---
ConvNext–MoEs94.08%93.68%94.08%93.22%
Table 3. Composition of the TON-IoT dataset.
Table 3. Composition of the TON-IoT dataset.
IndexAttack ClassCountIndexAttack ClassCount
0Dos45394Normal9740
1DDoS5055Password3594
2Injection6066Scanning434
3Xssl137Mitm1240
Table 4. Comparative performance of methods on the TON-IoT dataset.
Table 4. Comparative performance of methods on the TON-IoT dataset.
MethodAccuracyPrecisionRecallF1-Score
CNN99.96%99.96%99.96%99.96%
BiLSTM99.90%99.84%99.90%99.87%
SA+DNN [57]99.86%99.86%99.86%99.86%
AdaBoost [58]99.70%---
Inception Time [59]98.30%98.30%98.30%98.30%
ConvNext–MoEs 99.99 ± 0.01 % 99.99 ± 0.01 % 99.99 ± 0.01 % 99.99 ± 0.01 %
Table 5. Composition of the BoT-IoT dataset.
Table 5. Composition of the BoT-IoT dataset.
IndexAttack ClassCount
0DDoS19,266
1DoS16,502
2Normal477
3Reconnaissance9108
4Theft79
Table 6. Random seed sensitivity analysis of stratified sampling on the BoT-IoT dataset.
Table 6. Random seed sensitivity analysis of stratified sampling on the BoT-IoT dataset.
DatasetRandom SeedAccuracyFluctuation Range
BoT-IoT099.66% 0.12 %
4299.78%Baseline
8499.79% + 0.01 %
12899.42% 0.36 %
Table 7. Comparative performance of the methods on the BoT-IoT dataset.
Table 7. Comparative performance of the methods on the BoT-IoT dataset.
MethodAccuracyPrecisionRecallF1-Score
CNN97.70%97.81%97.70%97.69%
BiLSTM96.55%96.76%96.55%96.54%
Stacked Ensemble Learning [60]99.30%99.20%99.00%99.10%
CBCO-ERNN [61]99.02%99.75%98.59%98.35%
RNN [62]99.55%99.99%99.02%99.49%
ConvNext–MoEs99.78%99.78%99.78%99.78%
Table 8. Ablation study on the CIC-IDS2018 dataset.
Table 8. Ablation study on the CIC-IDS2018 dataset.
ConvNeXtSymmetric
Routing Gate
Sparse MoEsAuxiliary
Loss
RUS+
SMOTE-Tomek
AccuracyPrecisionRecallF1-Score
×92.94%92.90%92.94%92.26%
××93.33%94.06%93.33%92.57%
×93.69%93.24%93.69%93.11%
×93.17%93.25%93.17%92.50%
×93.88%93.54%93.88%92.91%
94.08%93.68%94.08%93.22%
✓ indicates that the corresponding module is enabled; × indicates that it is disabled.
Table 9. Comparison of Gini impurity values across different intrusion detection datasets.
Table 9. Comparison of Gini impurity values across different intrusion detection datasets.
DatasetClassesGini
CIC-IDS2018150.823
TON-IoT80.694
BoT-IoT50.648
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, J.; Zhang, K.; Zheng, R.; Li, C.; Zheng, J. IoT Network Security Threat Detection Algorithm Integrating Symmetric Routing and a Sparse Mixture-of-Experts Model. Symmetry 2026, 18, 63. https://doi.org/10.3390/sym18010063

AMA Style

Yang J, Zhang K, Zheng R, Li C, Zheng J. IoT Network Security Threat Detection Algorithm Integrating Symmetric Routing and a Sparse Mixture-of-Experts Model. Symmetry. 2026; 18(1):63. https://doi.org/10.3390/sym18010063

Chicago/Turabian Style

Yang, Jiawen, Kunsan Zhang, Renguang Zheng, Chaopeng Li, and Jiachun Zheng. 2026. "IoT Network Security Threat Detection Algorithm Integrating Symmetric Routing and a Sparse Mixture-of-Experts Model" Symmetry 18, no. 1: 63. https://doi.org/10.3390/sym18010063

APA Style

Yang, J., Zhang, K., Zheng, R., Li, C., & Zheng, J. (2026). IoT Network Security Threat Detection Algorithm Integrating Symmetric Routing and a Sparse Mixture-of-Experts Model. Symmetry, 18(1), 63. https://doi.org/10.3390/sym18010063

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop