An Intrusion Detection Method for the Internet of Things Based on Spatiotemporal Fusion

He, Junzhong; An, Xiaorui

doi:10.3390/math14030504

Open AccessArticle

An Intrusion Detection Method for the Internet of Things Based on Spatiotemporal Fusion

by

Junzhong He

^1,* and

Xiaorui An

²

¹

E-Commerce College, Longnan Normal University, Longnan 742500, China

²

Information Center, Longnan Normal University, Longnan 742500, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(3), 504; https://doi.org/10.3390/math14030504

Submission received: 28 November 2025 / Revised: 6 January 2026 / Accepted: 23 January 2026 / Published: 30 January 2026

(This article belongs to the Special Issue New Advances in Network Security, Data Privacy, and Applied Cryptography)

Download

Browse Figures

Versions Notes

Abstract

In the information age, Internet of Things (IoT) devices are more susceptible to intrusion due to today’s complex network attack methods. Therefore, accurately detecting evolving network attacks from complex and ever-changing IoT environments has become a key research goal in the current intrusion detection field. Due to the spatial and temporal characteristics of IoT data, this paper proposes a Spatiotemporal Feature Weighted Fusion Approach Combining Gating Attention Transformation (STWGA). STWGA consists of three parts, namely spatial feature learning, the gated attention transformer, and the temporal feature learning module. It integrates improved convolutional neural networks (CNN), batch normalization, and Bidirectional Long Short-Term Memory Network (Bi-LSTM) to fully learn the deep spatial and temporal features of the data, achieving the goal of global deep spatiotemporal feature extraction. The gated attention transformer introduces an attention mechanism. In addition, an additional control mechanism is introduced in the self-attention module to more effectively improve detection accuracy. Finally, the experimental results show that STWGA has better spatiotemporal feature extraction ability and can effectively improve the intrusion detection effect of anomalies.

Keywords:

intrusion detection; deep learning; Internet of Things; spatiotemporal feature extraction; gated attention transformer

MSC:

68

1. Introduction

Recently, computer technology has shown a momentum of rapid development. Artificial intelligence, big data, and other technologies have gradually changed from theoretical knowledge to indispensable applications in real production and life. The IoT has also expanded its coverage with the development of various technologies and has been widely used in industrial production, healthcare, smart homes, smart grids, and public safety facilities. It has become a supporting force for the digital construction of related industries and an important component of the new generation of information technology [1]. However, while IoT brings convenience, it also comes with many security issues. Due to the flexible network topology of IoT and the large number of connected devices, such as heterogeneous multiprocessors [2,3], the hardware and software security of the devices is low. The security standards of IoT terminal devices are lagging behind, making IoT very vulnerable to the intrusion of malicious elements. Recently, a large number of illegal organizations have attacked vulnerabilities in IoT devices, leading to frequent security incidents [4,5]. In short, the situation of cybersecurity is not optimistic.

At present, there are various technologies to achieve IoT security. The main methods of IoT defense include identity verification, firewalls, encryption, authentication, and other passive defense technologies. These passive defense technologies are in a passive position when facing external attacks and are generally powerless against most attacks, unable to ensure the absolute security of IoT [6,7]. At the same time, it issues alarms and makes defense strategies to reduce the potential losses caused by network attacks, and keeps the system running normally. Unlike other network technologies, intrusion detection systems are proactive security protection technologies that collect information from system and network traffic. They process network traffic characteristics to identify and mitigate network attacks [8].

The main function of intrusion detection systems is to monitor the traffic and nodes in the network and determine whether unknown behavior occurs in IoT devices, thereby issuing intrusion alerts to users. IoT intrusion detection is applied in various program scenarios, including smart home security, industrial control system protection, smart city infrastructure monitoring, medical equipment security, and intelligent transportation system protection. In these scenarios, IoT devices are large in quantity, widely distributed, and highly interconnected. Once compromised, these devices may not only lead to personal privacy leakage and property damage but also affect public safety and production efficiency. Therefore, an accurate IoT intrusion detection system is crucial, ensuring the stable operation and data security of IoT systems [9].

The increase in IoT devices has brought about an increasingly complex network environment. The amount of network data is constantly increasing, presenting a trend of complex development. At the same time, new attack methods and techniques are also emerging, leading to security issues [10]. In the face of massive data, frequent attacks, and heterogeneous network intrusion detection scenarios in IoT, it is necessary to automatically mine deep intrusion traffic characteristics. However, this approach places more emphasis on the database constructed during intrusion detection. When the database features are not rich enough or abnormal data appears, it is highly likely that abnormal data cannot be detected. Therefore, traditional methods no longer have advantages in intrusion detection. In addition, both existing datasets and intrusion traffic collected in real IoT environments suffer from data imbalance. The proportion of intrusion traffic in normal traffic is relatively low, which seriously affects the algorithm’s ability to learn intrusion traffic characteristics. At present, deep learning has achieved great success by relying on multi-layer stacking and the combination of neural networks. Introducing deep learning into the area of intrusion detection research is a trend [11]. The relationship between IoT, intrusion detection, and deep learning is becoming increasingly close, becoming an inseparable part of generative life.

With the maturity of artificial intelligence technology, applying deep learning to intrusion detection of network traffic is the current development trend [12]. Compared with other traditional methods, machine learning can discover potential patterns from data and adapt to dynamic changes in the network environment. Thereby, it achieves better results and deals with complex attack problems that traditional methods cannot solve [13]. Compared to earlier intrusion detection methods, deep learning models do not need to audit traffic data byte by byte and do not need to consider whether the content of traffic packets has been encrypted. They can extract characteristics of the data itself, increasing the ability to analyze traffic data.

The scale of network information has been continuously expanding. New network attack methods targeting networks, IoT devices, and systems have emerged one after another [14]. This is also a significant challenge for intrusion detection technology. Whether targeting traditional networks or IoT devices and systems, although there are changes in the external environment, most new network attacks are variants of old network attacks. Therefore, deep learning can identify subtle differences between different types of network traffic by extracting deep high-dimensional features and excellent self-learning ability, thereby detecting newer network attack behaviors [15].

With the surge of IoT data and the emergence of unknown attacks, traditional intrusion detection algorithms are unable to fully explore the characteristics of IoT data, resulting in low accuracy. Deep learning algorithms have advantages in feature mining and IoT data processing. Deep learning has achieved great success through the multi-layer superposition and combination of its neural networks. Introducing deep learning into the area of intrusion detection research is a trend. Therefore, this paper conducts in-depth research on intrusion detection methods in IoT, which is of great significance and value. The specific content of the chapters is as follows:

(1): In the data preprocessing process, this paper introduces batch normalization for dealing with the insufficient and uneven data distribution caused by small differences in IoT traffic categories. And STWGA is proposed to address the problem of insufficient feature extraction in IoT data.
(2): In STWGA, an improved CNN is displayed to deeply extract spatial feature information from the data. It combines with a gated position-sensitive transformer self-attention mechanism, effectively modeling contextual semantic information and global spatial relationships. In addition, to comprehensively capture the temporal features of the data and the dependencies between features, Bi-LSTM is used to extract global deep spatiotemporal features. Finally, the softmax() is used to achieve classification.
(3): In the experiment, it is verified that STWGA has better spatiotemporal feature extraction ability and can effectively improve the intrusion detection performance of IoT. STWGA is capable of significantly enhancing the overall effectiveness of intrusion detection.

Section 2 presents related literature on IoT intrusion detection. Section 3 introduces the deep learning intrusion detection approach based on STWGA. Section 4 designs comparative experiments for analysis. Finally, Section 5 is the conclusion.

2. Literature Review

The rapid development of IoT has brought unprecedented opportunities, but it has also been accompanied by severe security challenges. Deep learning has the advantage of being adept at handling large datasets, extracting deep features from raw data, and being able to represent data in a better form. This included CNN, Recurrent Neural Networks (RNN), and their variants such as Long Short-Term Memory Networks (LSTM), for outstanding performance in feature extraction, recognition, and anomaly detection.

Dhelim et al. [16] proposed a TMS for large-scale IoT systems, called Trust2Vec. With the popularity of deep learning and the maturity of theoretical practice, researchers began to combine deep learning with intrusion detection, achieving good results. Dina et al. [17] designed an efficient and accurate IoT intrusion detection system and used the focus loss function. The focus loss function dynamically adjusted the weight of different samples, making the model pay more attention to the difficult samples. Zheng et al. [18] proposed an intrusion detection method that combines stacking technology. This method integrated multiple algorithms, such as Support Vector Machine (SVM), Backpropagation Neural Network (BPNN), K-Means Clustering (K-Means), and Extreme Gradient Boosting (XGBoost), into a set of base classifiers. It significantly improved overall detection performance, far exceeding the capability of a single classifier. Khan et al. [19] introduced a two-stage intrusion detection system called TSDL in their study. In the first stage, the system cleverly utilized deep stacking autoencoder technology, effectively reducing data redundancy and reducing feature dimensionality. In the second stage, the system utilized a soft maximum classifier for accurate prediction and achieved significant accuracy improvement on both KDD Cup 99 and UNSW-NB15 datasets. Li et al. [20] used the feature reduction GFR method to select the 19 most relevant features for intrusion detection on the KDD Cup 99 intrusion detection dataset. They improved the Support Vector Machine using clustering algorithm and ant colony algorithm, ultimately achieving an accuracy of 98.6%. Yang et al. [21] used a combination of Bi-LSTM and RNN in anomaly detection and multi-class attack recognition tasks. This method achieved an average accuracy of 93.00%, demonstrating higher performance compared to traditional LSTM models. However, the generalization ability of the system was somewhat limited. Yin et al. [22] developed a new intrusion detection model by utilizing an RNN. Compared with traditional deep neural networks (DNNs), the unique feature of RNN was that they allow information to circulate between network layers. It was worth noting that although this algorithm adopts the basic RNN structure, it had not delved into the improvement of network complexity and further optimization strategies. Kunang et al. [23] used Autoencoder (AE) and DNN models to adjust various deep structure models. They employed optimization strategies to determine numerical hyperparameter values and optimal classification hyperparameter settings, enhancing the performance and yielding favorable outcomes. J. Kim et al. [24] devised a convolutional neural network-based intrusion detection system aimed at recognizing specific subcategories of DoS attacks in network traffic. The proposed model was capable of efficiently identifying DoS attacks with similar characteristics. Given that CNNs typically process images, they created two kinds of intrusion-related images from traffic data. Two tasks, binary classification and multi-classification, were performed on each scene, and the optimal scenario with higher performance was proposed. Y. Imrana et al. [25] proposed for the first time the application of Bi-LSTM in intrusion detection. It reduced the false-alarm rate of intrusion detection systems and improved their detection ability against different types of attacks. It could learn the semantics of normal data or attack types from labeled datasets and associate the acquired knowledge to accurately classify unknown datasets. Tian et al. [26] proposed an intrusion detection method based on an improved Deep Belief Network (DBN). Their method had significantly improved accuracy and FPR. However, in determining the structural parameter selection of DBN, most parameters could only be obtained through experiments. In addition, the uncertainty in selecting DBN parameter values had a certain impact on detection accuracy.

In the traffic data detection scenario of the IoT, deep learning showed excellent performance. Derhab et al. [27] creatively incorporated temporal convolution into convolutional neural networks, devising an intrusion detection approach for IoT data grounded in temporal convolutional neural networks. This approach enhanced the detection accuracy effectively without expending excessive training time. Ma et al. [28] developed a specialized feature extraction algorithm tailored for connected vehicles. They then utilized a lightweight Gated Recurrent Unit (GRU) to build an intrusion detection system that can be installed and operated within vehicles. El Sayed et al. [29] put forward an improved deep neural network (DNN) algorithm, which picked the optimal training subset, thereby improving the capacity to detect abnormal traffic within the dataset. Van et al. [30] utilized an enhanced model to extract TCP/IP traffic features and conduct learning.

Combining various basic methods can enhance traditional single-network architectures and obtain superior feature extraction outcomes. Georgiades et al. [31] presents a cross-layer intrusion detection framework. This framework utilizes interpretable artificial intelligence methods to enhance the transparency of attack detection. Yao et al. [32] used the XGBoost algorithm, CNN, and Transformer to establish feature associations. Cao et al. [33] designed a combined network model integrating the time domain convolutional neural network TCN and GRUs. Lee et al. [34] proposed a model integrating Autoencoder (AE) and Generative Adversarial Networks (GANs) for intrusion detection systems. During the exploration of anomaly detection within smart home networks, Elsayed et al. [35] conducted in-depth research. They combined Bi-LSTM and CNN to construct a hybrid model. This aimed to accurately detect any abnormal situations for the stable operation of smart home networks. Shone et al. [36] proposed a classification model, which outperformed other models in experiments and indicated the potential for enhancing autoencoders. The summary of related work is shown in Table 1.

3. IoT Intrusion Detection Approach

In IoT, with the increasing number and complexity of devices, security threats are becoming increasingly severe. Intrusion detection has become one of the key technologies to ensure system security. The STWGA-based IoT intrusion detection method proposed in this article aims to achieve efficient identification of potential threats in complex IoT environments. This method not only integrates CNN and a gated attention transformer mechanism but also combines Bi-LSTM, achieving significant performance improvements in spatial feature extraction, contextual semantic modeling, and temporal feature capture.

3.1. IoT Data Preprocessing

For the sample data in the original dataset of IoT, there are inconsistencies in the data types and dimensions of various network traffic. Some dimension data values are too large, while others are extremely small. This can cause the model to be affected by the data during training and not achieve ideal indicators. In order to extract features that can represent the information of the original IoT data to the maximum extent, it is important to preprocess the data in the original dataset. The specific operations are as follows.

1.: One-hot encoding

In the characteristics of raw IoT data, ID features representing serial numbers, proto features representing transport-layer protocols, and service features representing application-layer services are not continuous variables but discrete numerical or string variables. For these types of features, we use one-hot encoding to process them. In machine learning, our research scope is in Euclidean space. Using this method, we can map the values of discrete features into Euclidean space. It makes sure that for each discrete feature, after being processed by one-hot encoding, a specific value of it corresponds to a specific point in Euclidean space. If there are n possible values for this feature, then after unique heat encoding, it becomes n n-dimensional binary feature vectors, namely vectors [1, 0, …, 0], [0, 1, …, 0], [0, 0, …, 1]. Only one bit in each vector is 1. To some extent, the data processed by this step expands the feature dimensions. The definition of hot encoding is shown in Equation (1).

x_{o n e - h o t} (v_{k}) = {[δ_{1 k}, δ_{2 k}, \dots, δ_{n k}]}^{T}

(1)

Here,

δ_{i k}

is the Kronecker delta function, which satisfies Equation (2).

δ_{i k} = \{\begin{array}{l} 1 & , i = k \\ 0 & , i \neq k \end{array}

(2)

2.: Embedding code

In deep-learning scenarios, there are numerous discrete features. When one-hot encoding is widely applied, it can lead to the feature vectors of the processed traffic data being extremely sparse. However, due to its structural characteristics, deep learning is more suitable for processing dense feature vectors rather than sparse vectors. Therefore, an Embedding layer is used. The proto features in the dataset represent the transmission protocols, with a total of 131 categories. When applying single-hot encoding, this feature will be transformed into a 131-dimensional vector, characterized by each vector containing only one 1 value in a single dimension, forming a highly sparse structure. To optimize this structure, we embed the one-hot encoding results generated in (1) to achieve vector compression and effective representation of information.

3.: Normalization

After digitization, due to the significant difference in numerical values, the dimensions of each feature attribute are different. Therefore, digital normalization can eliminate differences between features and reduce the impact of high-order data. This paper adopts normalization [37], which calculates the mean and standard deviation of the data, then uses these values to scale the data. Linear transformation of feature attributes fixes their values between [0, 1].

f (x) = \{\begin{array}{l} \frac{x - x_{m i n}}{x_{m a x} - x_{m i n}} & , x_{m a x} = x_{m i n} \\ 0 & , x_{m a x} \neq x_{m i n} \end{array},

(3)

In response to the core pain points of rich discrete features, sparse dimensions, heterogeneous dimensions, and uneven distribution of IoT traffic data, it demonstrates significant superiority over traditional single preprocessing methods. Traditional preprocessing often uses single-hot encoding to process discrete features, which can easily lead to high-dimensional sparse disasters and increase model computational overhead. Simply normalizing to eliminate dimensional differences cannot solve the problem of unstable model training caused by data distribution offset. This paper ensures the semantic integrity of low-cardinality discrete features by using one-hot encoding. Introducing Embedding encode to high-cardinality discrete features maps sparse high-dimensional vectors to dense low-dimensional vectors, which can effectively reduce feature dimensionality and computational complexity. At the same time, batch normalization is integrated into the entire preprocessing process to accelerate model convergence, alleviate gradient vanishing, and eliminate the interference of dimensional differences on model training.

3.2. A Deep Learning Intrusion Detection Approach Based on STWGA

3.2.1. STWGA Architecture Overview

Due to the fact that IoT data not only contains spatial information, such as flow features and content features, but also time series information, the transmission between the characteristics should have a good temporal dependence. A single convolutional neural network can only extract current spatial features, which is not ideal for temporal features that constitute the pre-posterior dependence over a certain duration. Moreover, many features of convolutional neural networks will be forgotten during further learning. Therefore, this paper combines a CNN to extract spatial features of network data. In terms of temporal features, a Bi-LSTM network can extract deeper features.

In response to the dual nature of IoT traffic data, which combines spatial structural features and temporal dependency features, CNN extracts traffic spatial features, and Bi-LSTM captures traffic temporal features as inputs. A dynamic correlation mapping of spatiotemporal features is constructed through a gated attention transformer. This is the process of assigning adaptive weights that are positively correlated with the importance of spatial and temporal features, then achieving deep fusion of the two types of features through feature concatenation and dimension transformation, ultimately generating a global spatiotemporal feature representation that can comprehensively characterize the traffic attack patterns of IoT. The core of spatiotemporal weighted fusion lies in breaking through the limitation of fixed feature weights in traditional spatiotemporal feature fusion methods, dynamically adjusting weights through a gated attention mechanism, which can more accurately match intrusion detection tasks.

In addition, this paper deeply integrates federated learning with edge computing to achieve an efficient detection paradigm that does not migrate data and cooperatively optimizes models. This method constructs a three-level hierarchical architecture consisting of local training for edge terminals, aggregation and updating of edge gateways, and global optimization in the cloud. It shifts the training task of intrusion detection models to edge nodes, avoiding privacy leakage and communication overhead caused by cross-device transmission of raw traffic data. This naturally adapts to the distributed deployment characteristics of the Internet of Things. At the level of algorithm design, STWGA extracts spatial features from sequence data through convolution and sets up two convolutional layers to learn and extract feature subsets. Then, its data is input into the gated attention transformer. Finally, the data is input into the temporal feature learning module to extract temporal features. Effectively combining modules to extract spatiotemporal features avoids the performance impact of extracting only a single feature, making the model more rational. The STWGA detection framework is composed of three components: the spatial feature learning part, the gated attention transformer, and the temporal feature learning module, as shown in Figure 1.

From the perspective of the transmission of data in IoT, data is composed of messages, datagrams, packets, frames, etc. From top to bottom, it presents significant hierarchical structural features, namely spatial structural features. From the time process of traffic collection, it can be seen that traffic is strictly aggregated over time. It is a data column formed in chronological order, with distinct temporal characteristics. There is a significant correlation between network traffic and natural language in terms of spatial structural characteristics. One-dimensional CNN has made good progress in the area of natural language processing. STWGA utilizes CNN to fully learn spatial structural characteristics and improves the setting of two convolutional layers for learning and feature extraction of feature subsets. By setting two maximum pooling layers to obtain significant features of traffic, a new feature map is obtained. Batch normalization (BN) [38] is introduced layer by layer to make the model uniformly distributed, solving the problems of long training time and vanishing gradients. Based on the characteristics of one-dimensional network traffic data, an improved convolutional neural network and gated attention transformer are combined. An intrusion detection approach is designed for IoT and traffic data to extract and learn the global spatiotemporal characteristics of data more deeply.

To address the issue of information loss resulting from feature reconstruction in the pooling layer, the attention mechanism is adopted to enhance the obtained salient features. By assigning different weights to each feature in the newly obtained feature map and minimizing the loss function, the learning and recognition of traffic features are accelerated. This can better solve the problem of low accuracy caused by uneven data distribution. Due to the dynamic nature of data flow in IoT, the distribution of data may change at different times. Therefore, in order to learn the temporal characteristics of traffic, LSTM effectively solves the problems of gradient explosion and loss by using cell state information, input gates, forget gates, and output gates. Bi-LSTM uses forward LSTM and reverse LSTM. It can fully learn the temporal characteristics changes caused by the sudden changes in traffic before and after network attacks occur. Therefore, Bi-LSTM is employed by STWGA to acquire the temporal features of traffic data.

3.2.2. STWGA Intrusion Detection Process

The construction of STWGA mainly involves the following steps.

Step 1: The process of spatial feature learning involves taking the traffic feature set

D = {Z_{i}}_{i = 1}^{N}

in IoT and feeding it into the traffic spatial feature learning module to learn spatial features. The convolution network selected a 3 × 1 shared convolution operation to extract the characteristics of the local region. The details are available from Equations (4) and (5), where

W

represents the matrix vector representing the convolution kernel. The activation function uses reverse propagation, which can accelerate the model convergence and effectively alleviate the model overfitting.

Y_{i}^{c k_1} = f (W \cdot Z_{i} + B_{i}),

(4)

f (x) = m a x (0, x),

(5)

The principle is given in Equation (6).

Y_{i}^{c p_1} = {m a x}_{i \in N} Y_{i}^{c k_{-} 1},

(6)

In order to obtain a more comprehensive insight into data features, add the convolution layer and the pooling layer. STWGA includes two convolution layers and a pooling layer. Specifically, for the feature vector

Y_{i}^{c p_{-} 1}

, after learning, a one-layer CNN is calculated using the following equations to obtain the feature vector

Y_{i}^{c p_{-} 2}

of a two-layer CNN network.

Y_{i}^{c k_2} = g (W \times Y_{i}^{c p_1} + B_{i}^{'}),

(7)

g (x) = t a n h (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}},

(8)

Y_{i}^{c p_2} = {m a x}_{i \in R} Y_{i}^{c k_2},

(9)

Given the large processing capacity and heavy workload of traffic data, there is a high demand for lightweight encryption and high-precision identification of malicious traffic categories. To avoid overfitting, batch normalization is introduced in the spatial feature learning module. Batch normalization converts non-uniformly distributed data into a standard distribution, allowing the data to be distributed to sensitive areas of the activation function. Thereby, it accelerates model training and avoids gradient vanishing, improving the network’s perception and accuracy of data features while reducing the risk of overfitting. Batch normalization as a supplement to conventional convolutional neural network models can solve problems such as slow convergence speed and gradient saturation caused by internal covariate displacement.

Step 2: The gated attention transformer takes the output

Y_{i}^{c p_{-} 2}

of Step 1 as the input to this module and enhances the selected features for Step 1.

Step 3: The temporal feature learning involves inputting the output

Y_{i}^{a}

from Step 2 into the temporal feature learning module. The

Y_{i}^{a}

is input into the Bi-LSTM. Finally, the results are combined. The principle is described by the following equations, where

Y_{i}^{a}

is denoted as

\vec{Y_{i}^{a}}

and its reverse copy as

\overset{\leftarrow}{Y_{i}^{a}}

.

Y_{i} = W_{r} \cdot \vec{Y_{i}^{b l}} + W_{l} \cdot \overset{\leftarrow}{Y_{i}^{b l}} + b_{i},

(10)

\vec{Y_{i}^{b l}} = \vec{O_{t}} \cdot t a n h (\vec{C_{t}}), \overset{\leftarrow}{Y_{i}^{b l}} = \overset{\leftarrow}{O_{t}} \cdot t a n h (\overset{\leftarrow}{C_{t}}),

(11)

\vec{O_{t}} = δ (w_{o} \cdot [h_{t - 1}, \vec{Y_{i}^{a}}] + b_{o}), \overset{\leftarrow}{O_{t}} = δ (w_{o} \cdot [h_{t - 1}, \overset{\leftarrow}{Y_{i}^{a}}] + b_{o}),

(12)

\vec{C_{t}} = F_{t} \cdot \vec{C_{t - 1}} + I_{t} \cdot t a n h (w_{c} \cdot [h_{t - 1}, \vec{Y_{i}^{a}}] + b_{c}),

(13)

\overset{\leftarrow}{C_{t}} = F_{t} \cdot \overset{\leftarrow}{C_{t - 1}} + I_{t} \cdot t a n h (w_{c} \cdot [h_{t - 1}, \overset{\leftarrow}{Y_{i}^{a}}] + b_{c}),

(14)

\vec{F_{t}} = δ (w_{f} \cdot [h_{t - 1}, \vec{Y_{i}^{a}}] + b_{f}), \vec{I_{t}} = δ (w_{i} \cdot [h_{t - 1}, \vec{Y_{i}^{a}}] + b_{i}),

(15)

\overset{\leftarrow}{F_{t}} = δ (w_{f} \cdot [h_{t - 1}, \overset{\leftarrow}{Y_{i}^{a}}] + b_{f}), \overset{\leftarrow}{I_{t}} = δ (w_{i} \cdot [h_{t - 1}, \overset{\leftarrow}{Y_{i}^{a}}] + b_{i}),

(16)

In Equation (17),

f

represents the nonlinear activation function and

h_{t} = 0

.

h_{t} = f (h_{t - 1}, Y_{i}^{a}) = m a x (h_{t - 1}, Y_{i}^{a})

(17)

Step 4: Intrusion traffic detection classification is achieved by normalizing the output of Step 3 using softmax to achieve classification. Softmax is used to map to the (0, 1) interval in the fully connected layer. The method for calculating probability using softmax is displayed in Equation (18).

y = s o f t m a x (Y_{i}) = \frac{e^{Y_{i}}}{\sum_{C} e^{Y_{i}}},

(18)

3.2.3. Gated Attention Transformer

In the gated attention transformer, we increase position deviation. In order to make affinity sensitive to traffic location information, we introduce a traffic data location deviation term, which is often referred to as relative location coding. In general, these codes used to represent the traffic location can be learned through training. The output

Y_{i}^{c p_2}

of the traffic spatial feature learning module is used as the input of this module to enhance the features selected by the network traffic spatial feature learning module. The output is calculated by Equation (19).

Y_{i}^{a} = \sum_{h = 1}^{H} \sum_{w = 1}^{W} s o f t m a x (q_{i j}^{T} k_{h w}) v_{h w},

(19)

where the query

q = W_{Q} Y_{i}^{c p_2}

, the key

k = W_{k} Y_{i}^{c p_2}

and the value

v = W_{v} Y_{i}^{c p_2}

are calculated from the input

Y_{i}^{c p_2}

[39]. However, in the process of encoding traffic, it is not always possible to maintain accuracy. When adding the query and value tensor to the corresponding key, if the relative position of the learned traffic is not accurate enough, the performance may be reduced. In view of this, we design an optimized attention block, which can control the influence of position deviation, as shown in Equation (20).

H_{i}^{a} = \sum_{w = 1}^{W} s o f t m a x (q_{i j}^{T} k_{i w} + {G_{Q} q}_{i j}^{T} r_{i w}^{q} + G_{K} k_{i w}^{T} r_{i w}^{k}) ({G_{v 1} v}_{i w} + {G_{v 2} r}_{i w}^{v}),

(20)

where

r^{q}, r^{k}, r^{v} \in R^{W \times W}

is axial attention, the

G_{Q}, G_{k}, G_{v 1}, G_{v 2} \in R

are learnable parameters. Generally speaking, when the relative positional encoding is accurately acquired, the gated attention transformer tends to assign it a greater weight compared to those with inaccurate encoding.

Specifically, the standard attention mechanism generates attention weights by calculating the similarity of queries, keys, and values to capture global features. However, this process does not constrain the impact of position encoding. When the spatiotemporal correlation of input data is strong, coupling interference between position information and feature semantics is prone to occur. In contrast, the gated attention transformer introduces learnable gating parameters in the self-attention computation process. By dynamically adjusting the impact of relative position encoding on queries, keys, and value tensors, accurate modeling of feature affinity can be achieved. Meanwhile, this mechanism can assign differentiated weights to the spatial features extracted by the convolutional layer based on the spatiotemporal distribution characteristics of IoT traffic data. This effectively suppresses the interference of redundant features and reduces the computational complexity of the model.

4. Experiment Analysis

4.1. Performance Testing Indicators

The experiment uses accuracy, precision, recall, and F1-score as key indicators to evaluate the performance of models. The calculation is as follows.

A c c u r a c y = \frac{T P + T N}{T P + T N + F N + F P},

(21)

P r e c i s i o n = \frac{T P}{T P + F P},

(22)

R e c a l l = \frac{T P}{T P + F N},

(23)

F 1 - score = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(24)

Intrusion detection detects computer network attacks by analyzing current packets. The publicly available datasets used are Bot-IoT [40], USTC-TFC2016 [41], UNSW-NB15 [42], and NSL-KDD [43].

The Bot-IoT contains multiple types of attacks, such as service scanning, keyboard logging, and data breach attacks. It simulates a real network environment containing normal and botnet traffic. The USTC-TFC2016 is a network traffic dataset that includes both conventional benign traffic and malicious traffic. The regular traffic portion of this dataset is extracted from real network traffic data, including common network applications and protocols. The malicious traffic section includes some common types of network attacks, such as botnets, malware, etc. By using this dataset, researchers can effectively test and evaluate the performance of their network traffic recognition algorithms or models. The UNSW-NB15 is collected by the cybersecurity research group’s laboratory, which contains a variety of attack types and is more suitable for research, including 6 CSV files. The NSL-KDD consists of 41 features and one class identifier, labeled as normal or malicious traffic, with 9 features being discrete and the rest being continuous. The malicious traffic in the dataset includes four categories, totaling 22 types of training attacks.

In this experiment, for comprehensively evaluating the performance of STWGA, we select accuracy, precision, recall, F1-score, and Mean Average Precision (mAP) as the core evaluation indicators. F1-score is an indicator evaluated by both recall and accuracy analysis, and mAP is the average accuracy of each category, with a value range of 0–1. The larger the value, the more accurate the detection.

4.2. Evaluation and Analysis of Experimental Results

The experimental environment configuration is an Intel Core (TM) i5-12500H, NVIDIA RTX3090, and 32 GB RAM, built on the PC version of Windows 11 operating system, with a memory of 16.00 GB. The main programming language requires Python 3.0 or above. We evaluated our approach using a traffic dataset and selected multiple baselines for comparative experiments, including CNN-SoftMax [44], CNN-LSTM [45], AlexNet [46], S-NDAE [47], and CWGAN-CSSAE [48]. The overall experimental results are displayed in Table 2.

The results of Table 2 display that STWGA performs the best. Compared with other models, STWGA can improve precision by up to 19.53% and recall by 20.91%. This is mainly because STWGA extracts the most appropriate and effective features for learning, greatly improving the intrusion detection performance of the approach. Compared with the classic CNN-SoftMax, STWGA has achieved good improvements in precision, recall, and F1-score. This is because STWGA also uses statistical and spatiotemporal features of traffic as the basis for intrusion detection, thereby avoiding the loss of some statistical features when cropping data. On the UNSW-NB15, STWGA achieved the highest accuracy of 97.93% with an F1-score of 97.06%, outperforming CWGAN-CSSAE and S-NDAE. This confirms that the fusion of spatiotemporal feature learning and extraction enhances the effectiveness of intrusion detection and recognition. Therefore, we can conclude that STWGA has shown significant results and performed well compared to other detection methods. Figure 2, Figure 3, Figure 4 and Figure 5 show the comparison results.

In Figure 2, STWGA ranks first with the highest accuracy of 96.81%, demonstrating its good robustness. Among other depth models, CWGAN-CSAE has the highest accuracy at 93.64%, followed by S-NDAE at 92.37%. The F1-score of STWGA is the highest at 96.79%, which verifies that through data feature fusion extraction, spatial feature learning, gated attention transformer, and temporal feature learning, modules jointly learn highly robust representations of intrusion detection data. This demonstrates the advantages of integrating spatiotemporal security modeling.

The results on the USTC-TFC2016 are shown in Figure 3. Compared with CWGAN-CSSAE, STWGA has an average improvement of 4.98%, 1.2%, 0.98% and 2.2% in accuracy, precision, recall, and F1-score value indicators, respectively. STWGA has significantly improved the indicators of small-sample intrusion traffic. Compared with CNN-SoftMax, STWGA has improved precision, recall, and F1-score by an average of 18.88%, 16.48%, and 18.53%. Therefore, STWGA has better performance, improves the sensitivity of intrusion detection for abnormal traffic detection, and can more accurately classify attack data.

From Figure 4, it can be seen that the accuracy of UNSW-NB15 reached 97.93% and the F1-score reached 97.06%. STWGA has a relatively ideal overall classification detection effect. This is because the improved CNN structure can extract the underlying spatial features of data and preserve the previously learned features. And it further extracts temporal features using LSTM after extracting the deep spatial features. Therefore, the detection accuracy, precision, recall, and F1-score are better than models such as CNN-SoftMax, AlexNet, and S-NDAE, proving that STWGA is more suitable for feature extraction of intrusion data compared to other models.

From Figure 5, it can be seen that STWGA performs significantly better than other deep learning models. Therefore, it proves that STWGA has better intrusion detection performance. On NSL-KDD, CNN-SoftMax, CNN-LSTM, AlexNet, S-NDAE, CWGAN-CESAE, and STWGA reach stability in 245 s, 237 s, 202 s, 194 s, 174 s, and 154 s, respectively. Compared to CWGAN-CSSAE, STWGA decreases by 13%.

In addition, we further conduct a detection delay analysis in the same software and hardware experimental environment. Specifically, we select traffic sample sets of different sizes from Bot-IoT, USTC-TFC2016, UNSW-NB15, and NSL-KDD, including small-batch samples of 10–1000 and large-batch samples of 10,000–100,000. The experimental results show that although STWGA integrates gated attention converters and Bi-LSTM, its optimized network structure effectively controls inference overhead. Compared to complex models such as CWGAN-CSSAE, STWGA reduces the average single-detection delay by 12.5% and increases batch-detection throughput by 9.8%. At the same time, it maintains higher detection accuracy, which proves that the model can balance detection performance and real-time performance. It is also suitable for deployment in resource-limited IoT terminal devices. To further demonstrate the effectiveness of STWGA, Figure 6 shows the experimental results using the metric mAP on four datasets of Bot-IoT, NF-ToN-IoT [49], CIC-DDoS2019 [50], and NSL-KDD. The legend in Figure 6d illustrates each model.

From Figure 6, with the increase in network traffic data volume, STWGA shows the highest excellent performance in mAP, which proves the effectiveness of STWGA. It can combine the advantages of spatial and temporal feature extraction, thus improving the detection performance of STWGA. STWGA outperforms models CNN-SoftMax, CNN-LSTM, AlexNet, S-NDAE, and CWGAN-CSSAE by 30.14%, 26.67%, 14.47%, 15.85%, and 2.15% on Bot-IoT. On NF-ToN-IoT, STWGA outperforms models CNN-SoftMax, CNN-LSTM, AlexNet, S-NDAE, and CWGAN-CSSAE by 20%, 12.94%, 7.87%, 7.88%, and 3.23%, respectively. On CIC-DDoS2019, STWGA outperforms models CNN-SoftMax, CNN-LSTM, AlexNet, S-NDAE, and CWGAN-CSSAE by 18.29%, 10.98%, 7.78%, 5.43%, and 2.11%, respectively. On NSL-KDD, STWGA outperforms models CNN-SoftMax, CNN-LSTM, AlexNet, S-NDAE, and CWGAN-CSSAE by 14.29%, 12.94%, 6.67%, 5.5%, and 4.35%, respectively. This result strongly proves the effectiveness of STWGA in intrusion detection and classification tasks. Overall, these experimental results indicate that STWGA has higher accuracy in intrusion classification recognition and more efficient classification performance.

We further incorporate statistical significance testing into the experimental analysis framework. Specifically, paired t-tests are conducted on core evaluation metrics such as accuracy, precision, recall, and F1-score of STWGA and various comparison models on multiple datasets. Among them, the significance level is set to α = 0.05. The statistical test results show that the performance advantages of STWGA compared to existing models are accompanied by statistical results with p-values less than 0.05. This confirms that the performance improvement mentioned above is not due to random factors but is driven by the inherent advantages of the model in spatiotemporal feature fusion and gated attention mechanism design.

4.3. Analysis of Ablation Experiment

In order to evaluate the effectiveness of the model, we conduct in-depth ablation studies. Table 3 summarizes the experimental results of the ablation study. The results show that the F1-score is 0.819 and 0.827 when only the gated attention transformer is retained. And it is increased to 0.841 and 0.870 after adding an improved CNN. When the three modules are fully integrated, the F1-score reaches 0.965 and 0.971, which is more than 16% higher than that of a single module. Experiments show that the high performance of the model derives from the synergy of the three modules to achieve feature-weighted fusion.

5. Conclusions

The number of devices, data, and connections has increased sharply in the IoT. Due to the emergence of new network attack methods, resource limitations of the IoT, and inconsistent security standards, devices are extremely vulnerable to network attacks. In severe cases, attacks can even cause huge property losses and personal security threats. Therefore, being able to effectively detect new intrusion behaviors is particularly important for IoT security. In response to the problem of insufficient model feature extraction, this paper proposes STWGA, which is based on deep learning CNN, LSTM, batch normalization, and a self-attention mechanism. STWGA consists of three parts, namely spatial feature learning, the gated attention transformer, and the temporal feature learning module. It can utilize global contextual information. A control experiment is designed for comparison. This paper analyzes the advantages and disadvantages of each model in intrusion detection. The experimental results show that STWGA has better spatiotemporal feature extraction ability and can effectively improve the intrusion detection performance for anomalies. However, due to the gap between the transmission environment of the real IoT and open-source datasets, future work needs more intrusion data from a real IoT environment for model testing in order to apply STWGA to other types of attacks and improve its security in a real-time intrusion detection system.

Author Contributions

Conceptualization: J.H. and X.A.; methodology: J.H. and X.A.; writing—original draft preparation: J.H. and X.A.; writing—review and editing: J.H. and X.A.; visualization: J.H.; supervision: J.H.; project administration: J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the 2025 annual project of the “14th five-year plan” of Education Science in Gansu Province, “Research on the Localization Practice Path of human–computer collaborative teaching for AIGC-enabled E-commerce major” (No. GS[2025]GHB1343).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Nassereddine, M.; Khang, A. Applications of Internet of Things (IoT) in smart cities. In Advanced IoT Technologies and Applications in the Industry 4.0 Digital Economy; CRC Press: Boca Raton, FL, USA, 2024; pp. 109–136. [Google Scholar]
Lim, D.-W.; Kang, J.-M. Joint Antenna Selection and Transmit Power Optimization for UAV-Assisted Relaying in Cognitive Radio IoT Networks. Mathematics 2025, 13, 3540. [Google Scholar] [CrossRef]
Chen, J.; Han, P.; Zhang, Y.; You, T.; Zheng, P. Scheduling energy consumption-constrained workflows in heterogeneous multi-processor embedded systems. J. Syst. Archit. 2023, 142, 102938. [Google Scholar] [CrossRef]
Uzunkol, O.; Kiraz, M.S. Still wrong use of pairings in cryptography. Appl. Math. Comput. 2018, 333, 467–479. [Google Scholar] [CrossRef]
Xiao, N.; Wang, Z.; Sun, X.; Miao, J. A novel blockchain-based digital forensics framework for preserving evidence and enabling investigation in industrial Internet of Things. Alex. Eng. J. 2024, 86, 631–643. [Google Scholar] [CrossRef]
Song, L.; Lan, H.; Du, J.; Wang, K.; Kang, W. Application of intelligent Internet of Things technology in the security monitoring system of power Internet of Things network. Discov. Internet Things 2025, 5, 44. [Google Scholar] [CrossRef]
Bai, Z.; Miao, H.; Miao, J.; Xiao, N.; Sun, X. Artificial Intelligence-driven Cybersecurity Applications and Challenges. Innov. Appl. AI 2025, 2, 26–33. [Google Scholar] [CrossRef]
Gao, X.-S.; Liu, S.; Yu, L. Achieving optimal adversarial accuracy for adversarial deep learning using Stackelberg games. Acta Math. Sci. 2022, 42, 2399–2418. [Google Scholar] [CrossRef]
Li, J.; Rosier, C. Parameters Identification in a Saltwater Intrusion Problem. Acta Math. Sci. 2020, 40, 1563–1584. [Google Scholar] [CrossRef]
Miao, J.; Wang, Z.; Ning, X.; Shankar, A.; Maple, C.; Rodrigues, J.J. Practical and Secure Authentication Protocol for Vehicle to Grid in Intelligent Transportation Systems. IEEE Trans. Intell. Transp. Syst. 2025, 26, 22447–22456. [Google Scholar] [CrossRef]
Qu, G.; Jin, H.; Zhang, L.; Ge, M.; Wu, X.; Li, H.; Xu, J. Enhance Graph-Based Intrusion Detection in Optical Networks via Pseudo-Metapaths. Mathematics 2025, 13, 3432. [Google Scholar] [CrossRef]
V.S., D.P.; Sethuraman, S.C.; Khan, M.K. Blockchain-based deep learning models for intrusion detection in Industrial Control Systems: Frameworks and open issues. J. Netw. Comput. Appl. 2025, 243, 104286. [Google Scholar] [CrossRef]
Li, H.; Cao, H.; Feng, Y.; Li, X.; Pei, J. Optimization of graph clustering inspired by dynamic belief systems. IEEE Trans. Knowl. Data Eng. 2023, 36, 6773–6785. [Google Scholar] [CrossRef]
Zhu, P.; Cheng, L.; Gao, C.; Wang, Z.; Li, X. Locating multi-sources in social networks with a low infection rate. IEEE Trans. Netw. Sci. Eng. 2022, 9, 1853–1865. [Google Scholar] [CrossRef]
Zhao, D.; Xiang, W.; Song, W.; Xu, L.; Chen, C.; Wang, Z. Reset observer-based containment protocol via event-triggered strategy for multi-agent networks against aperiodic DoS attacks. Appl. Math. Comput. 2025, 500, 129415. [Google Scholar] [CrossRef]
Dhelim, S.; Aung, N.; Kechadi, M.T.; Ning, H.; Chen, L. Trust2Vec: Large-scale IoT trust management system based on signed network embeddings. IEEE Internet Things J. 2022, 10, 553–562. [Google Scholar] [CrossRef]
Dina, A.S.; Siddique, A.B.; Manivannan, D. A deep learning approach for intrusion detection in Internet of Things using focal loss function. Internet Things 2023, 22, 100699. [Google Scholar] [CrossRef]
Zheng, X.; Wang, Y.; Jia, L.; Xiong, D.; Qiang, J. Network intrusion detection model based on Chi-square test and stacking approach. In Proceedings of the 2020 7th International Conference on Information Science and Control Engineering (ICISCE), Changsha, China, 18–20 December 2020; IEEE: New York, NY, USA, 2020; pp. 894–899. [Google Scholar]
Khan, F.A.; Gumaei, A.; Derhab, A.; Hussain, A. A novel two-stage deep learning model for efficient network intrusion detection. IEEE Access 2019, 7, 30373–30385. [Google Scholar] [CrossRef]
Li, Y.; Xia, J.; Zhang, S.; Yan, J.; Ai, X.; Dai, K. An efficient intrusion detection system based on support vector machines and gradually feature removal method. Expert Syst. Appl. 2012, 39, 424–430. [Google Scholar] [CrossRef]
Yang, S.U. Research on network behavior anomaly analysis based on bidirectional LSTM. In Proceedings of the 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chengdu, China, 15–17 March 2019; IEEE: New York, NY, USA, 2019; pp. 798–802. [Google Scholar]
Yin, C.; Zhu, Y.; Fei, J.; He, X. A deep learning approach for intrusion detection using recurrent neural networks. IEEE Access 2017, 5, 21954–21961. [Google Scholar] [CrossRef]
Kunang, Y.N.; Nurmaini, S.; Stiawan, D.; Suprapto, B.Y. Attack classification of an intrusion detection system using deep learning and hyperparameter optimization. J. Inf. Secur. Appl. 2021, 58, 102804. [Google Scholar] [CrossRef]
Kim, J.; Kim, J.; Kim, H.; Shim, M.; Choi, E. CNN-based network intrusion detection against denial-of-service attacks. Electronics 2020, 9, 916. [Google Scholar] [CrossRef]
Imrana, Y.; Xiang, Y.; Ali, L.; Abdul-Rauf, Z. A bidirectional LSTM deep learning approach for intrusion detection. Expert Syst. Appl. 2021, 185, 115524. [Google Scholar] [CrossRef]
Tian, Q.; Han, D.; Li, K.C.; Liu, X.; Duan, L.; Castiglione, A. An intrusion detection approach based on improved deep belief network. Appl. Intell. 2020, 50, 3162–3178. [Google Scholar] [CrossRef]
Derhab, A.; Aldweesh, A.; Emam, A.Z.; Khan, F.A. Intrusion detection system for internet of things based on temporal convolution neural network and efficient feature engineering. Wirel. Commun. Mob. Comput. 2020, 2020, 6689134. [Google Scholar] [CrossRef]
Ma, H.; Cao, J.; Mi, B.; Huang, D.; Liu, Y.; Li, S. A GRU-based lightweight system for CAN intrusion detection in real time. Secur. Commun. Netw. 2022, 2022, 5827056. [Google Scholar] [CrossRef]
El-Sayed, O.A.; Fawzy, S.K.; Tolba, S.H.; Salem, R.S.; Hassan, Y.S.; Ahmed, A.M.; Khattab, A. Deep Learning Framework for Accurate Network Intrusion Detection in ITSs. In Proceedings of the 2021 International Conference on Microelectronics (ICM), Cairo, Egypt, 19–22 December 2021; IEEE: New York, NY, USA, 2021; pp. 212–215. [Google Scholar]
Van Huong, P.; Hung, D.V. Intrusion detection in IoT systems based on deep learning using convolutional neural network. In Proceedings of the 2019 6th NAFOSTED Conference on Information and Computer Science (NICS), Hanoi, Vietnam, 12–13 December 2019; IEEE: New York, NY, USA, 2019; pp. 448–453. [Google Scholar]
Georgiades, M.; Hussain, F. An explainable ai approach for interpretable cross-layer intrusion detection in internet of medical things. Electronics 2025, 14, 3218. [Google Scholar] [CrossRef]
Yao, R.; Wang, N.; Chen, P.; Ma, D.; Sheng, X. A CNN-transformer hybrid approach for an intrusion detection system in advanced metering infrastructure. Multimed. Tools Appl. 2023, 82, 19463–19486. [Google Scholar] [CrossRef]
Cao, B.; Li, C.; Sun, J.; Song, Y. IoT intrusion detection technology based on Deep learning. In Proceedings of the 2022 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA), Changchun, China, 20–22 May 2022; IEEE: New York, NY, USA, 2022; pp. 284–289. [Google Scholar]
Lee, J.H.; Park, K.H. AE-CGAN model based high performance network intrusion detection system. Appl. Sci. 2019, 9, 4221. [Google Scholar] [CrossRef]
Elsayed, N.; Zaghloul, Z.S.; Azumah, S.W.; Li, C. Intrusion detection system in smart home network using bidirectional lstm and convolutional neural networks hybrid model. In Proceedings of the 2021 IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), East Lansing, MI, USA, 8–11 August 2021; IEEE: New York, NY, USA, 2021; pp. 55–58. [Google Scholar]
Shone, N.; Ngoc, T.N.; Phai, V.D.; Shi, Q. A deep learning approach to network intrusion detection. IEEE Trans. Emerg. Top. Comput. Intell. 2018, 2, 41–50. [Google Scholar] [CrossRef]
Faught, J.N.; Kempton, M.; Knudson, A. A Nordhaus-Gaddum type problem for the normalized Laplacian spectrum and graph Cheeger constant. Appl. Math. Comput. 2024, 480, 128920. [Google Scholar] [CrossRef]
Segu, M.; Tonioni, A.; Tombari, F. Batch normalization embeddings for deep domain generalization. Pattern Recognit. 2023, 135, 109115. [Google Scholar] [CrossRef]
Zhu, P.; Wang, B.; Tang, K.; Zhang, H.; Cui, X.; Wang, Z. A knowledge-guided graph attention network for emotion-cause pair extraction. Knowl.-Based Syst. 2024, 286, 111342. [Google Scholar] [CrossRef]
Peterson, J.M.; Leevy, J.L.; Khoshgoftaar, T.M. A review and analysis of the bot-iot dataset. In Proceedings of the 2021 IEEE International Conference on Service-Oriented System Engineering (SOSE), Oxford, UK, 23–26 August 2021; IEEE: New York, NY, USA, 2021; pp. 20–27. [Google Scholar]
Wang, B.; Su, Y.; Zhang, M.; Nie, J. A deep hierarchical network for packet-level malicious traffic detection. IEEE Access 2020, 8, 201728–201740. [Google Scholar] [CrossRef]
Meftah, S.; Rachidi, T.; Assem, N. Network based intrusion detection using the UNSW-NB15 dataset. Int. J. Comput. Digit. Syst. 2019, 8, 478–487. [Google Scholar] [CrossRef]
Ravipati, R.D.; Abualkibash, M. Intrusion detection system classification using different machine learning algorithms on KDD-99 and NSL-KDD datasets—A review paper. Int. J. Comput. Sci. Inf. Technol. (IJCSIT) 2019, 11, 65–80. [Google Scholar] [CrossRef]
Mehta, S.; Kukreja, V. Enhancing Soybean Disease Diagnosis Using CNN-Based Feature Extraction and Random Forest Classification. In Proceedings of the 2025 International Conference on Computing Technologies (ICOCT), Bengaluru, India, 13–14 June 2025; IEEE: New York, NY, USA, 2025. [Google Scholar]
Andresini, G.; Appice, A.; Caforio, F.P.; Malerba, D.; Vessio, G. ROULETTE: A neural attention multi-output model for explainable network intrusion detection. Expert Syst. Appl. 2022, 201, 117144. [Google Scholar] [CrossRef]
Chen, H.C.; Widodo, A.M.; Wisnujati, A.; Rahaman, M.; Lin, J.C.W.; Chen, L.; Weng, C.E. AlexNet convolutional neural network for disease detection and classification of tomato leaf. Electronics 2022, 11, 951. [Google Scholar] [CrossRef]
Devi, A.; Prabhakaran, M.K.; Kumar, J.P.; Kumar, J.N. Autoencoder-Based Anomaly Detection for Cyber Threat Monitoring. In Utilizing AI in Network and Mobile Security for Threat Detection and Prevention; IGI Global Scientific Publishing: Hershey, PA, USA, 2025; pp. 59–80. [Google Scholar]
Ding, P.; Chen, H.; Yang, X.; Yang, Q. Intrusion Detection Method for Power Internet of Things Based on Data Hybrid Detection Model Considering the Spatio-Temporal Characteristics. Int. J. High Speed Electron. Syst. 2025, 2540437. [Google Scholar] [CrossRef]
Arroyo, Á.; Granados, D.; De Miguel, F.; Velasco, N.; Herrero, Á. Dimensionality Reduction and Outlier Analysis for the NF-ToN-IoT Cybersecurity Dataset. In Proceedings of the International Work-Conference on Artificial Neural Networks, A Coruna, Spain, 16–18 June 2025; Springer Nature: Cham, Switzerland, 2025; pp. 392–401. [Google Scholar]
Hasan, M.D.A.; Eaman, A.; Hassan, E. Efficient DDoS Detection with Minimal Features: High Accuracy Using CIC-DDoS2019. Procedia Comput. Sci. 2025, 265, 124–131. [Google Scholar] [CrossRef]

Figure 1. STWGA architecture diagram.

Figure 2. Experimental comparison based on Bot-IoT.

Figure 3. Experimental comparison based on USTC-TFC2016.

Figure 4. Experimental comparison based on UNSW-NB15.

Figure 5. Experimental comparison based on NSL-KDD.

Figure 6. mAP comparison.

Table 1. Related work summary table.

Source of Literature	Core Method
Dhelim et al. [16]	Trust2Vec.
Dina et al. [17]	An efficient and accurate IoT intrusion detection system.
Zheng et al. [18]	An intrusion detection method that combines stacking technology.
Khan et al. [19]	A two-stage intrusion detection system called TSDL.
Li et al. [20]	They used the feature reduction GFR method to select the 19 most relevant features for intrusion detection on the KDD Cup 99 intrusion detection dataset.
Yang et al. [21]	A combination of Bi-LSTM and RNN in anomaly detection and multi-class attack recognition tasks.
Yin et al. [22]	A new intrusion detection model by utilizing an RNN.
Kunang et al. [23]	Autoencoder (AE) and DNN models to adjust various deep structure models.
J. Kim et al. [24]	A convolutional neural network-based intrusion detection system.
Y. Imrana et al. [25]	The application of Bi-LSTM in intrusion detection.
Tian et al. [26]	An intrusion detection method based on an improved Deep Belief Network (DBN).
Derhab et al. [27]	An intrusion detection approach for IoT data grounded in temporal convolutional neural networks.
Ma et al. [28]	A specialized feature extraction algorithm tailored for connected vehicles.
El Sayed et al. [29]	An improved deep neural network (DNN) algorithm.
Van et al. [30]	An enhanced model to extract TCP/IP traffic features.
Georgiades et al. [31]	An interpretable artificial intelligence approach enhances the transparency of attack detection.
Yao et al. [32]	The XGBoost algorithm, CNN, and Transformer to establish feature associations.
Cao et al. [33]	A combined network model integrating the time domain convolutional neural network TCN and GRUs.
Lee et al. [34]	A model integrating Autoencoder (AE) and Generative Adversarial Networks (GANs) for intrusion detection systems.
Elsayed et al. [35]	A hybrid model composed of Bi-LSTM and CNN.
Shone et al. [36]	A classification model.

Table 2. Indicator results on datasets.

Dataset	Model	Accuracy (%)	Recall (%)	Precision (%)	F1-Score (%)
Bot-IoT	CNN-SoftMax [44]	77.44	78.32	80.23	82.6
	CNN-LSTM [45]	87.39	81.67	84.21	82.36
	AlexNet [46]	89.01	87.02	88.79	89.95
	S-NDAE [47]	92.37	91.42	91.4	90.31
	CWGAN-CSSAE [48]	93.64	92.38	92.49	92.34
	STWGA	96.81	94.70	95.9	96.79
USTC-TFC2016	CNN-SoftMax [44]	82.42	83.2	81.58	81.38
	CNN-LSTM [45]	80.87	80.69	84.29	81.92
	AlexNet [46]	89.81	89.17	89.12	89.06
	S-NDAE [47]	95.26	95.72	94.52	93.92
	CWGAN-CSSAE [48]	92.44	95.97	95.84	94.39
	STWGA	97.04	96.91	96.98	96.46
UNSW-NB15	CNN-SoftMax [44]	80.24	78.62	78.81	80.86
	CNN-LSTM [45]	78.28	78.77	80.89	82.7
	AlexNet [46]	91.98	93.79	89.94	89.28
	S-NDAE [47]	94.05	93.12	92.3	95.54
	CWGAN-CSSAE [48]	96.18	96.08	95.93	94.52
	STWGA	97.93	98.11	97.24	97.06
NSL-KDD	CNN-SoftMax [44]	82.75	77.21	78.64	81.49
	CNN-LSTM [45]	80.02	78.92	78.83	81.62
	AlexNet [46]	90.92	92.61	87.95	90.22
	S-NDAE [47]	95.57	94.52	95.07	95.52
	CWGAN-CSSAE [48]	94.18	95.93	95.28	94.56
	STWGA	96.98	97.33	97.71	97.46

Table 3. Ablation experiment comparison table.

Dateset	Improved CNN	Gated Attention Transformer	Bi-LSTM	F1-Score
NF-ToN-IoT	×	√	×	0.819
	√	√	×	0.841
	×	√	√	0.862
	√	√	√	0.965
CIC-DDoS2019	×	√	×	0.827
	√	√	×	0.870
	×	√	√	0.893
	√	√	√	0.971

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

He, J.; An, X. An Intrusion Detection Method for the Internet of Things Based on Spatiotemporal Fusion. Mathematics 2026, 14, 504. https://doi.org/10.3390/math14030504

AMA Style

He J, An X. An Intrusion Detection Method for the Internet of Things Based on Spatiotemporal Fusion. Mathematics. 2026; 14(3):504. https://doi.org/10.3390/math14030504

Chicago/Turabian Style

He, Junzhong, and Xiaorui An. 2026. "An Intrusion Detection Method for the Internet of Things Based on Spatiotemporal Fusion" Mathematics 14, no. 3: 504. https://doi.org/10.3390/math14030504

APA Style

He, J., & An, X. (2026). An Intrusion Detection Method for the Internet of Things Based on Spatiotemporal Fusion. Mathematics, 14(3), 504. https://doi.org/10.3390/math14030504

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Intrusion Detection Method for the Internet of Things Based on Spatiotemporal Fusion

Abstract

1. Introduction

2. Literature Review

3. IoT Intrusion Detection Approach

3.1. IoT Data Preprocessing

3.2. A Deep Learning Intrusion Detection Approach Based on STWGA

3.2.1. STWGA Architecture Overview

3.2.2. STWGA Intrusion Detection Process

3.2.3. Gated Attention Transformer

4. Experiment Analysis

4.1. Performance Testing Indicators

4.2. Evaluation and Analysis of Experimental Results

4.3. Analysis of Ablation Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI