Few-Shot Learning for Malicious Traffic Detection with Sample Relevance Guided Attention

Wu, Xuan; Wang, Peng; Song, Yafei; Wang, Xiaodan; Chai, Jinjin

doi:10.3390/electronics14234717

Open AccessArticle

Few-Shot Learning for Malicious Traffic Detection with Sample Relevance Guided Attention

by

Xuan Wu

,

Peng Wang

,

Yafei Song

,

Xiaodan Wang

^*

and

Jinjin Chai

College of Air and Missile Defense, Air Force Engineering University, Xi’an 710051, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4717; https://doi.org/10.3390/electronics14234717

Submission received: 14 November 2025 / Revised: 24 November 2025 / Accepted: 27 November 2025 / Published: 29 November 2025

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

Malicious traffic detection in IoT environments faces dual challenges: limited labeled data and heterogeneous, complex traffic patterns. To address these limitations, we propose a malicious traffic detection framework, GADF-SRGA, which integrates Gram-angle-difference-field (GADF) imaging with meta-learning. The framework first encodes raw IoT traffic into images via GADF, preserving the spatiotemporal characteristics of malicious traffic. It then employs meta-learning on these encoded images to enable feature-space learning under scarce data. In the inner loop, Sample-Relation Guided Attention (SRGA) leverages class-label-guided supervision graphs to learn sample similarity, improving intra-class compactness and inter-class separability in the feature space. Comprehensive evaluations on public IoT intrusion datasets Malicious_TLS and ToN_IoT demonstrate the framework’s superiority and robustness, particularly under class-imbalanced conditions, over baseline methods.

Keywords:

few-shot classification; meta learning; Sample-Relation Guided Attention; traffic detection

1. Introduction

Malicious Traffic Detection (MTD) is a critical technology for security defense in the Internet of Things (IoT), identifying potential malicious activities by monitoring abnormal behavior in network traffic in real-time. In response to the unique characteristics of heterogeneous devices, limited resources, and diverse communication protocols in the IoT, malicious traffic detection requires dynamic traffic analysis and lightweight detection algorithms, striking a balance between accuracy, real-time performance, and low computational overhead. Traditional traffic detection methods face critical challenges, including the variability and concealment of malicious traffic in IoT environments.

Machine learning methods can automatically learn feature representations, reducing reliance on manual feature engineering, and are suitable for large-scale traffic data analysis and complex pattern extraction. Le et al. [1] proposed an XGBoost-based IoT intrusion detection method to mitigate multiclass sample imbalance in the Industrial Internet of Things (IIoT), enhancing attack detection performance in such scenarios. Thakkar et al. [2] proposed a feature selection technique based on statistical importance fusion, which integrates standard deviation, mean, and median statistics to select relevant features and reduce redundant features. Thakkar et al. [3] integrated Autoencoder (AE) and Principal Component Analysis (PCA) techniques for feature dimensionality reduction, using AE to capture nonlinear relationships and PCA to capture linear relationships. Traditional machine learning methods have limitations in extracting deep features from sequence data, resulting in insufficient generalization performance.

In contrast, detection systems based on deep learning [4] demonstrate better accuracy and adaptability in detecting malicious traffic in the Internet of Things through hierarchical feature learning. Djenouri et al. [5] proposed the D2E-ADN framework, which combines data decomposition, deep learning, and evolutionary computation. The authors employ five clustering algorithms to decompose the data, a new RNN model, and two evolutionary algorithms to optimize the hyperparameters. The framework outperforms the baseline algorithm in terms of runtime and accuracy. Yang et al. [6] proposed a Data Purification Algorithm (DPA) to reduce data redundancy, enhanced the CNN structure based on separable convolution, and designed a lightweight detection algorithm, LSCNN. Wang et al. [7] proposed an automated lightweight spatiotemporal decoupling Transformer framework called AutoLDT, which improves model performance while reducing model complexity. Deep learning-based methods can better extract deep features. However, most of these methods require a large amount of data for training, which poses challenges for practical network traffic scenarios. In practical network scenarios, a large amount of data is normal, and destructive data only accounts for a small portion, but it can bring incalculable losses. In IoT traffic detection tasks, there is still room for improvement in detecting a small number of samples.

Few-shot learning has made significant progress in fields such as image classification [8] and temporal classification [9], but its research in malicious traffic detection is still in the exploratory stage. Few-shot learning is significant in detecting network traffic with scarce abnormal data. In the early exploration of using Few-shot learning for malicious traffic detection, researchers mainly focused on introducing the basic concepts of Few-shot learning into this field. Olasehinde et al. [10] were the first to apply meta-learning to intrusion detection, proposing a novel IDS based on three meta-learning algorithms. Subsequently, Lu et al. [11] developed an IoT intrusion detection model by combining Model-Agnostic Meta-Learning (MAML) and Convolutional Neural Networks (CNN). They also constructed the FSIDS-IoT dataset (integrating five public datasets with multiple attack scenarios) to support few-shot learning. Leveraging MAML, the model quickly adapts to new attacks and achieves high few-shot detection accuracy; however, it lacks sample diversity and practical deployment testing. To further improve the performance of Few-shot learning in malicious traffic detection, Wu et al. [12] proposed MASiNet, a multistage attention Siamese network. This model utilizes a carefully designed attention mechanism to more effectively extract network traffic features, while introducing a contrastive loss function to enhance the model’s ability to compare sample pairs. Experiments on NSL_KDD and UNSW_NB15 show that it outperforms existing methods. Most existing methods have not fully considered the relationships between samples, resulting in the feature representations learned by the model being less robust and discriminative, which limits the performance of Few-shot classification.

Therefore, inspired by previous research, we propose GADF-SRGA, a novel IoT malicious traffic detection framework. This framework combines temporal imaging techniques, few-shot meta-learning methods, and inter-sample class similarity relationships. First, the Gramian Angular Difference Field (GADF) imaging method encodes IoT network traffic data into two-dimensional images, preserving the temporal dependencies and pattern features of the traffic sequences. Convolutional networks are then used to extract features and generate a pre-trained model. Then, we propose the GADF-SRGA model, which enhances meta learning by incorporating a display-guided attention mechanism to capture inter-sample relationships. To verify the effectiveness of the proposed method, we validated the publicly available real IoT intrusion detection datasets ToN_IoT and Malicious_TLS. The contributions of this paper are as follows:

We propose a detection framework based on GADF-SRGA, which solves the problem of insufficient malicious traffic recognition performance in public IoT traffic detection datasets Malicious_TLS and ToN_IoT.
We utilize GADF image encoding to transform network traffic data into 2D images, thereby enhancing the representation of temporal and semantic features.
We introduce few-shot learning methods to solve the problems of insufficient learnable samples and low recognition accuracy of newly emerging attack types in the Internet of Things environment.
We propose a sample relevance guidance attention module, which addresses the issue of insufficient feature discriminability in existing few-shot classification methods by considering inter-image association relationships. This module significantly improves the model’s intra-class compactness and inter-class separability in the feature space.

2. Fundamental Technologies

This section sorts out the key technologies used and introduces the core principles of three key technologies: attention mechanism, model-agnostic meta-learning, and prototype classification.

2.1. Attention Mechanism

The attention mechanism [13] is a computational paradigm in deep learning that simulates how human attention is allocated. Its core is to improve the model’s ability to capture key information by dynamically adjusting its degree of attention to different input parts. The entire attention mechanism process can be divided into three parts: correlation measurement, weight normalization, and weighted feature output. In the correlation measurement stage, suppose the input features generate a query vector

Q \in R^{d_{q} \times L}

, a key vector

K \in R^{d_{k} \times L}

, and a value vector

V \in R^{d_{v} \times L}

through linear transformation, where L is the length of the feature sequence, and

d_{q} = d_{k}

must be satisfied. The correlation between

Q

and

K

is quantified by the dot product operation, and the formula is as follows:

S = \frac{Q^{T} K}{\sqrt{d_{k}}},

(1)

where

\sqrt{d_{k}}

is a scaling factor, used to avoid numerical overflow caused by the dot product of high-dimensional vectors and ensure the stability of gradient backpropagation. In the weight normalization stage, the Softmax function is used to convert the correlation score

S

into a probability distribution with a sum of 1, and the attention weight matrix

A \in R^{L \times L}

is obtained to realize the probabilistic highlighting of key features:

A = Softmax (S) = \frac{exp (S_{i j})}{\sum_{m = 1}^{L} exp (S_{i m})} .

(2)

where the larger the weight

A_{i j}

, the higher the contribution of the j-th feature to the i-th query target. In the weighted feature output stage, through the weighted summation of the attention weight matrix

A

and the value vector

V

, key information is aggregated, and the attention-enhanced feature is output:

Attention (Q, K, V) = A V^{T} .

(3)

In summary, the core of the attention mechanism is to adaptively assign weights to input features, so that the model can specifically capture the differentiated features of samples.

2.2. Model-Agnostic Meta-Learning

Model-Agnostic Meta-Learning (MAML) [14] is a general framework in meta-learning suitable for few-shot scenarios. Its core goal is to optimize a set of initialized model parameters with strong generalization ability, enabling the model to quickly adapt to new tasks with only a few gradient updates. MAML is mainly divided into the meta-training phase and the meta-testing phase. The meta-training phase extracts general initialized parameters from many known malicious traffic tasks through task sampling and continuous iteration of inner and outer loop optimization to quickly adapt to new tasks in the meta-testing phase.

In the meta-training phase, first, N-way K-shot tasks are generated according to the task distribution

p (T)

. Each task

T_{i}

consists of a support set

D_{Support, i}

and a query set

D_{Query, i}

, where N is the number of categories and K is the number of samples per category. In the meta-training phase, N known categories are selected to maintain the consistency of task distribution. Each category extracts K labeled samples to form

D_{Support, i}

, and then extracts

n_{q}

samples to form

D_{Query, i}

, where the sample categories are the same, but there are no overlapping samples between the support set and the query set. Then, for each generated task

T_{i}

, the inner loop starts with the initial meta-parameter

θ

, calculates the loss

L_{T_{i}} (f_{θ})

on the support set

D_{Support, i}

, and updates the parameters in the direction opposite to the loss gradient, thereby performing task-level rapid fine-tuning. The update formula for the meta-parameter

θ

is

θ^{'} = θ - α \nabla_{θ} L_{T_{i}} (f_{θ}),

(4)

where

α

is the inner loop learning rate.

After completing the inner loop fine-tuning of each task

T_{i}

, outer loop parameter optimization is performed. The initial meta-parameter

θ

is updated by aggregating the loss of the query set

D_{Query, i}

, and the meta-loss is

L_{meta} = \sum_{T_{i}} L_{T_{i}} (f_{θ^{'}}) .

(5)

Subsequently, the initial parameter

θ

is updated along the gradient direction of the meta-loss as follows:

θ \leftarrow θ - β \nabla_{θ} L_{meta}

(6)

where

β

is the outer loop learning rate.

In the meta-testing phase, the model’s ability to quickly adapt to new samples is verified through three links: constructing new tasks, parameter fine-tuning, and performance evaluation. First, select an N-way K-shot task

T_{test}

containing new-type samples from the test set

D_{test}

. Among them, the support set

D_{Support, test}

is K labeled samples of this new-type malicious traffic category, and the query set

D_{Query, test}

is unlabeled samples of the same type, to simulate the detection scenario when encountering new-type malicious traffic. For

T_{test}

, using the optimal initial parameter

θ^{*}

obtained from meta-training, gradient fine-tuning is performed through the inner loop to generate a parameter

θ_{test}

adapted to new-type malicious traffic. The fine-tuning formula is

θ_{test} = θ^{*} - α \nabla_{θ^{*}} L_{T_{test}} (f_{θ^{*}})

(7)

In the meta-testing phase, the model does not require retraining, and adaptation can be completed with only a small number of gradient updates.

2.3. Prototypical Networks

Prototypical Networks [15] is a core framework based on the metric learning paradigm, suitable for handling fast adaptation problems under limited data conditions. A prototype is a typical feature representative of each category, composed of the mean value of the feature vectors of samples of that category in the support set. Prototypical Networks realize the rapid recognition of new categories through category prototype representation learning and feature distance metric. Prototypical Networks transform the classification problem into a distance calculation problem through four steps: feature embedding, prototype construction, distance metric, and classification decision. In the feature embedding stage, the original data is mapped to a high-dimensional feature space through a differentiable model to achieve effective feature extraction.

In the feature embedding stage, the original data is mapped to a high-dimensional feature space through a differentiable model to achieve effective feature extraction. When

\forall x \in D

, its feature vector is

e = f_{ϕ} (x) \in R^{d}

, where d is the embedding dimension.

In the prototype calculation stage, for the k-th category in the N-way K-shot task

T_{i}

, where

k = 1, 2, \dots, N,

its support set is

D_{Support, i}^{k} = {x_{i 1}^{k}, x_{i 2}^{k}, \dots, x_{i K}^{k}}

, and the corresponding feature vectors are

{e_{i 1}^{k}, e_{i 2}^{k}, \dots, e_{i K}^{k}}

. The calculation formula for the prototype vector of this category is

c_{k} = \frac{1}{K} \sum_{x \in D_{Support, i}^{k}} f_{ϕ} (x)

. As shown in Figure 1, the prototypes of each category are represented as blue squares.

In the distance metric stage, Prototypical Networks realizes classification by measuring the distance between the query sample and the prototypes of each category. The smaller the distance, the higher the feature similarity between the sample and the category. The Euclidean distance is used to measure the distance between the feature

e_{q} = f_{ϕ} (x_{q})

of the query sample and each prototype

c_{k}

, as shown below:

d (e_{q}, c_{k}) = | | e_{q} - c_{k} {| |}_{2}^{2}

(8)

In the classification decision stage, the distance is converted into a classification probability, and the Softmax function is used to normalize the negative distance. The probability that the query sample

x_{q}

belongs to the k-th category is

p (y = k | x_{q}) = \frac{exp (- λ d (e_{q}, c_{k}))}{\sum_{m = 1}^{N} exp (- λ d (e_{q}, c_{m}))}

(9)

where

λ

is a distance scaling factor, used to adjust the weight of the distance’s influence on the probability. The model is trained with cross-entropy loss as the optimization objective, and the classification error of the query set is minimized as follows:

L (ϕ) = - \frac{1}{| D_{Query, i} |} \sum_{x_{q} \in D_{Query, i}} log p (y_{true} | x_{q})

(10)

3. Proposed Approach

The overall framework of GADF-SRGA is shown in Figure 2, which includes three parts: the traffic data visualization module, the Sample Relevance Guided Attention module, and the Nearest Prototype Classifier (NPC) module. The traffic data visualization module effectively extracts global spatial features and correlation information between features by converting traffic data into image data, thereby enhancing the spatial representation ability in network malicious traffic detection. It also uses a ResNet-based backbone to obtain pre-trained models. Then, in the Few-shot learning framework based on meta-learning, a sample association-guided attention mechanism is introduced, and the support set labels are used to explicitly guide the model to enhance the extraction of same-class relationships and expand the distance between different classes, thereby improving inter-class discrimination. Finally, based on the variable nature of task-learning categories, a nearest neighbor prototype classifier is used to classify the query set samples.

3.1. GADF Encoder

Raw traffic data exists only in a one-dimensional form composed of discrete time points, which makes it difficult to reflect temporal correlation characteristics—such as those of burst traffic. This limitation imposes constraints on traditional time-series analysis methods when capturing complex features. Specifically, traditional traffic processing methods face two key challenges: on one hand, they struggle to model the spatiotemporal correlation of multi-dimensional traffic features and fail to effectively capture dynamic collaborative patterns between different traffic dimensions; on the other hand, their ability to represent features of bursty and non-linear malicious traffic patterns is relatively limited, which often results in the loss of detailed information about key patterns during the feature extraction process. Furthermore, in few-shot scenarios, one-dimensional time-series data has an inherent attribute of sparse features, which cannot adequately support the model in quickly learning generalizable feature representations. This inadequacy further impacts the detection performance for new and unknown malicious traffic. To address this issue, we adopt the Gramian Angular Difference Field (GADF) [16] to perform spatiotemporal feature reconstruction and modal conversion on network traffic data, thereby facilitating the extraction of fine-grained features.

The GADF is an image encoding technology that maps one-dimensional time-series data to two-dimensional images via polar coordinate transformation and Gram matrix construction, while completely preserving the temporal dependence and spatial correlation of the data. Compared with GASF and MTF, GADF more effectively captures the dynamic patterns of time series and converts them into image-based visual representations, facilitating subsequent feature extraction and classification. The implementation process of this technology is as follows: first, the one-dimensional time-series data is normalized to the interval [−1, 1]; after normalization, the value of each time-series point corresponds to two elements of polar coordinates, namely the angle and the radius. Subsequently, a two-dimensional image is generated through the matrix operation of the GADF. During this process, two key conversions take place: first, the order of the time-series data is converted into the spatial positional relationship of the image; second, the magnitude and variation trend of the data values are converted into the texture and shape features of the image. Ultimately, the dynamic patterns of the time series are fully encoded into the visual structure of the image. For a detailed implementation of the GADF-based encoding workflow, refer to Algorithm 1.

Algorithm 1 GADF encoding for converting traffic data to images

Input: Traffic data

T = {T_{1}, T_{2}, \dots, T_{N}}

, where N denotes the batch size, the k-th time-series data

T_{k} = {t_{k 1}, t_{k 2}, \dots, t_{k L}} L \geq ω

, the normalized scaling limit

[- 1, 1]

, sliding window size

ω

used to truncate the length of the time-series data.

Output: The set of GADF image matrices

G = {G_{1}, G_{2}, \dots, G_{N}}

corresponding to the traffic data sequence

T = {T_{1}, T_{2}, \dots, T_{N}}

is constructed as follows, where each

G_{k} \in R^{ω \times ω}

uniquely corresponds to the timeseries data

T_{k}

in a one-to-one manner.

1. For each

T_{k} \in T

,

k = 1, . . ., N

:

G_{k} \leftarrow 0_{ω \times ω}

// Initialize the GADF Matrix

G_{k}

of the k-th sample

// Data standardization

2. For

i = 1

to

ω

:

3.

{\tilde{t}}_{k i} = \frac{2 (t_{k i} - min (T_{k}))}{max (T_{k}) - min (T_{k})} - 1 \in [- 1, 1]

// Normalize within a single traffic flow

T_{k}

4.

{\tilde{t}}_{k i} \leftarrow CLIP ({\tilde{t}}_{k i}, - 1, 1)

, where

{\tilde{T}}_{k} = [{\tilde{t}}_{k 1}, {\tilde{t}}_{k 2}, \dots, {\tilde{t}}_{k ω}]

//Truncate individual elements

5. End for

// Polar coordinate transformation

6. For

i = 1

to

ω

:

7.

ϕ_{k i} = arccos ({\tilde{t}}_{i}) \in [0, π]

// Angle corresponding to

{\tilde{t}}_{k i}

8.

γ_{k i} = \frac{i}{ω} \in [0, 1]

// Radius

9. End For

// Gram Matrix Calculation

10. For

i = 1

to

ω

:

11. For

j = 1

to

ω

:

12.

G_{k, i j} = sin (ϕ_{k i} - ϕ_{k j}) = {\sqrt{1 - {\tilde{t}}_{k i}^{2}}}^{T} \cdot {\tilde{t}}_{k j} - {\tilde{t}}_{k i} \cdot \sqrt{1 - {\tilde{t}}_{k j}^{2}}

13. End for

14. End for

15.

G_{k} \leftarrow 255 \cdot \frac{G_{k} + 1}{2}

// Image Scaling for

G_{k}

16. Return

G = {G_{1}, G_{2}, \dots, G_{N}}

Based on Algorithm 1, the traffic image dataset G can be obtained. The image data set is partitioned into mini-batches with a batch size of M, resulting in input images

{g_{i}}_{i = 1}^{M}

for each batch. Subsequently, the d-dimensional feature vectors

b_{i} \in R^{1 \times d}

are extracted from each batch of images using a backbone feature extractor. Finally, the feature vectors

b_{i} \in R^{1 \times d}

from each batch are concatenated along the sample dimension, yielding the basic feature

G_{B} \in R^{N \times d}

.

The image encoding technology of the GADF mainly lies in three aspects: first, it converts the temporal dependence and numerical relationships of the one-dimensional time series into the spatial structure of two-Zdimensional images, facilitating the capture of multiscale local and global patterns; second, two-dimensional images can directly reuse mature image processing technologies, enabling efficient extraction of high-level visual features; third, the image-based dense feature representation can provide richer pattern information for meta-learning frameworks in few shot scenarios, thereby accelerating the model’s adaptation process to new types of malicious traffic.

3.2. Class-Sparse Attention Module

Malicious traffic classification frequently confronts novel traffic family attacks, such as zero-day vulnerabilities and variant ransomware samples, where the labeled samples of such novel malicious traffic are incredibly scarce, rendering it a typical few-shot classification scenario. Traditional methods rely solely on implicit feature similarity, which is prone to missing detections due to the one-sidedness of single-sample features; meanwhile, the full-attention mechanism tends to introduce noise. To address these issues, this paper proposes a class-sparse attention mechanism, which filters out noise via top-K sparsification and employs a guided map for explicit supervision to align with true categories. It aggregates the standard features of samples within the same class to compress intra-class differences and enlarge inter-class distances. Ultimately, it re-weights via the inter-sample class similarity graph

S R

to generate the class-sparse attention feature

G_{S}

with explicit category association information. The detailed specifics are presented as follows.

The base feature undergoes three linear transformations

Q, K, V = {Linear}_{q, k, v} (G_{B})

to obtain query, key, and value vectors

Q, K, V \in R^{N \times d}

. Only the top K high-similarity associations are retained to construct the inter-class similarity graph

S R

of samples. The raw inter-class similarity score of samples is given by

S_{raw} = \frac{Q K^{T}}{\sqrt{h}}

, where h is the scaling factor in the attention mechanism. The top-K sparsification strategy is defined as follows: for each class i, the top K associations with the highest similarity scores to other classes are retained, and the rest are set to 0, yielding the sparsified inter-class similarity score

S_{s p a r s e, i} = Top K (S_{r a w, i}, k = K)

.

Finally, the sigmoid function is used for activation to generate the final inter-class similarity graph of samples as follows:

S R = sigmoid (S_{s p a r s e}) = \frac{1}{1 + e^{- S_{s p a r s e}}}, s_{i j} \in (0, 1)

(11)

Unlike traditional few-shot learning methods that only indirectly optimize features via classification loss and tend to learn spurious similarities due to the lack of explicit constraints on class associations, this paper adopts a guided map (GM) to supervise SR explicitly. It constrains SR to align with true class relationships through a loss function. For a batch of input samples containing N samples and L classes, its one-hot encoding matrix is denoted as

o n e h o t

. The guided map GM is obtained by multiplying the one-hot encoding matrix:

G M = o n e h o t \times o n e h o t^{T}

,

G M \in R^{N \times N}

. Its elements are defined as follows:

g_{i j} = \{\begin{matrix} 1, & the two samples belong to the same class \\ 0, & the two samples belong to different classes \end{matrix}

(12)

GM provides label-level information to supervise the inter-class similarity graph SR of samples. GM enables the network to capture the true relationships between samples. Then, the value vector V is re-weighted by the inter-class similarity graph SR to obtain the class-sparse attention feature

G_{S} \in R^{N \times d}

for each sample:

G_{S} = softmax (S R) \times V,

(13)

where × denotes matrix multiplication and the softmax function ensures that the weighted aggregated features are consistent in scale with the original features. This class-sparse attention mechanism can thus obtain the task-relevant similarity graph SR of similar samples and the class-sparse attention feature

G_{S} \in R^{N \times d}

for malicious traffic classification.

Compared with the attention mechanism in Transformers [13], the SRGA module exhibits differences in attention modeling granularity. Standard self-attention calculates attention weights at the token level, focusing on temporal dependencies between individual tokens in the input sequence. In contrast, the SRGA module models attention directly at the sample level, prioritizing correlations among entire malicious traffic samples rather than local token features. This makes SRGA more suitable for few-shot traffic detection tasks that require mining global sample relationships. Cross-attention relies on bidirectional mapping between the query set and the key-value set, yet lacks explicit category supervision and is susceptible to noise interference from heterogeneous traffic features. The SRGA module introduces a class-label-guided supervision graph that imposes sparse constraints on attention weights, retaining only the top-K high-similarity sample associations to suppress invalid cross-class attention calculations. This approach effectively reduces computational redundancy while enhancing intra-class compactness. Standard self-attention employs a unified scaled dot-product computation for all attention weights, leading to non-targeted weight allocation. In contrast, the SRGA module employs a dynamic weight allocation mechanism based on sample class similarity, strengthening association weights within the same malicious traffic family while weakening those of heterogeneous samples. This design enhances the model’s ability to distinguish between similar attack types in few-shot scenarios.

For each corresponding element in the inter-class similarity graph

S R

and the guided map

G M

, the relationship loss

L_{u}

is computed using the following formula:

L_{u} = - \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} (g_{i j} log s_{i j} + (1 - g_{i j}) log (1 - s_{i j})),

(14)

where

s_{i j} \in S R

is denotes the inter-class similarity score between sample i and sample j, and

g_{i j} \in G M

represents the true class association label for the two samples. By constructing this relationship loss

L_{u}

, the learning process of the inter-class similarity graph

S R

is explicitly guided to align its elements with those of

G M

, thereby achieving more compact intra-class clustering of malicious traffic samples in the feature space from a global perspective and further enhancing the model’s generalization ability for few-shot novel malicious traffic.

3.3. Fusion Features

The base features extracted by the backbone network contain fine-grained information about samples, but they tend to lack class association, resulting in insufficient discriminability in few-shot scenarios. By contrast, the class-sparse attention features are generated under the supervision of the guided map; they can capture the correlation information of samples within the same class, yet may lose the fine-grained details of individual samples. To dynamically and adaptively fuse the base features extracted by the backbone network with the class-sparse attention features based on class information, this paper proposes an adaptive feature fusion strategy.

Given that base features and class-sparse attention features exhibit distinct focuses, each feature is incorporated with learnable feature weighting parameters. Accordingly, we define a weight parameter

α \in R^{1 \times d}

and a bias parameter

β \in R^{1 \times d}

for the base features and class-sparse attention features of input samples. For each feature, the linear transformation function

δ (\cdot)

parameterized by a and b is expressed as follows:

δ (g) = \frac{g - E [g]}{\sqrt{V a r [g] + ε}} \times α + β, g \in R^{1 \times d},

(15)

where

E [\cdot]

represents the mean operation,

V a r [\cdot]

describes the variance operation, and

ε

is a constant infinitely close to 0. The learnable parameter weights of the original and sparse features are independent. By linearly weighting the original features and the aggregated features, and then cascading the dimensions, the fusion feature

G_{F} \in R^{N \times d}

of the original features and sparse features is obtained, as shown below:

G_{F} = δ (G_{B}) \oplus δ (G_{S}), G_{F} \in R^{N \times 2 d} .

(16)

where ⊕ represents the connection operation.

3.4. Nearest Neighbor Prototype Classifier

The standard Few-Shot Classification (FSC) problem provides a training set

D_{T r a i n}

and a test set

D_{Test}

. The FSC model is trained on a randomly sampled sequence of tasks from

D_{T r a i n}

, and then evaluated on a randomly sampled sequence of tasks from

D_{T e s t}

. Each task consists of two disjoint sets: the support set

D_{S u p p o r t}

and the query set

D_{Q u e r y}

, arranged according to the N-way K-shot setting, defined as

D_{S u p p o r t} = {(g_{i}^{s p}, l_{i}^{s p})}_{i = 1}^{N \times K}

where

n_{p}

samples are organized. The query set is denoted as

D_{Q u e r y} = {(g_{i}^{qy}, l_{i}^{qy})}_{i = 1}^{n_{q y}}

and shares a common label space with the support set. During the training phase, the FSC model learns from the support set and then extracts from

D_{T r a i n}

to evaluate on the query set. This training strategy enables the model to effectively handle the behavior of processing a few samples, thereby enhancing the model’s robustness.

Recently, based on meta-learning approaches, the model has shown superior performance in various tasks. The original network framework calculates the average probability for each class K as follows:

{\bar{c}}_{k} = \frac{1}{| D_{S u p p o r t}^{k} |} \sum_{(g_{i}^{s p}, l_{i}^{s p}) \sim S_{k}} f_{θ} (g_{i})

, where

D_{S u p p o r t}^{k}

represents samples of class K, and

f_{θ} (g_{i})

denotes the feature extraction. Subsequently, the classification probabilities for each sample class are computed using the Softmax function to obtain the class probabilities.

p_{θ} (l = k | g^{q}) = \frac{exp (- ∥ f_{θ} (g^{q}) - p_{k} ∥_{2}^{2})}{\sum_{k = 1}^{N} exp (- ∥ f_{θ} (g^{q}) - p_{k} ∥_{2}^{2})}

(17)

where

p_{k}

is the class prototype. The distance between the sample feature

f_{θ} (g^{q})

and the class prototype

p_{k}

indicates the proximity of the sample to the respective class. The function

exp (- d_{c}) \in (0, 1)

is used for normalizing the Euclidean distance

d_{c}

, which represents the distance between samples. The higher the distance, the larger the influence on the classification. N represents the number of classes. The learning process minimizes the loss function

L = - {log}_{p_{θ}} (l = k | g_{i}^{q y})

. In this paper, we employ the cross-entropy loss to minimize the prediction error of the query samples for each task, as follows:

L_{s} = - \frac{1}{n_{q}} \sum_{q = 1}^{n_{q}} \sum_{k = 1}^{N} x_{q} log (p_{θ} (l_{i} = k | g_{q}))

(18)

where

n_{q}

is the number of samples in the query set of a task, and N is the total number of classes in the dataset.

x_{q}

is the ground truth label, which is 1 if the query sample q truly belongs to class k, and 0 otherwise.

p_{θ} (l = k | g_{i}^{q v})

is the sample classification probability obtained from Equation (17). Finally, based on the sample-based loss, the total loss

L_{u}

for classification learning and cross-entropy defined in (3) is given by:

L_{p} = λ_{s} L_{s} + λ_{u} L_{u}

(19)

where

λ_{s}

and

λ_{u}

are hyperparameters that control the impact of each loss term on the overall loss.

4. Experimental Design and Results Analysis

This section describes the dataset, relevant experimental details, and experimental setup.

4.1. Datasets Description

To verify the effectiveness and detection performance of GADF-SRGA in real-world IoT environments, we conduct IoT intrusion detection using the private dataset Malicious_TLS and the public dataset ToN-IoT. The Malicious_TLS dataset is a malicious traffic dataset captured from real edge network devices, where all traffic is encrypted with TLS technology. This dataset contains 22 different types of traffic. The ToN-IoT dataset contains traffic captured from real IoT environments, including 10 different types of traffic. The training set and test set are partitioned in a ratio of 7:3. As shown in Figure 3, it illustrates the class distribution and categorization of the ToN-IoT dataset, where class 0 represents benign samples, and the remaining classes correspond to different malicious traffic families.

In the ToN-IoT dataset, benign samples account for approximately 68.1%. Among the malicious traffic families, except for Class 4 which represents 0.2%, each of the remaining classes accounts for about 4.3%, indicating a significant difference in the number of samples across malicious traffic families.

As shown in Figure 4, it presents the class distribution of the Malicious_TLS dataset, where Class 0 denotes benign samples and the rest correspond to different malicious traffic families.

In the Malicious_TLS dataset, benign samples account for 34.9%. There are significant differences in the occurrence counts and proportions among different malicious traffic families; for instance, Class 20 occurs 4798 times, while Class 16 only appears 1107 times, indicating a significant class imbalance.

4.2. Experimental Settings

Experimental environment: All experiments were conducted on the same device, equipped with an NVIDIA A100 80GB GPU. The experimental setup is configured with Python 3.12, PyTorch 2.5, and CUDA 12.1 to ensure an efficient, reproducible experimental environment.

To verify the effectiveness of the proposed method, this experiment utilizes four commonly used evaluation metrics in malicious traffic detection and classification: accuracy (Acc), precision (Pr), recall (Rc), and F1-score (F1). The specific calculation methods are shown in Equations (20)–(23). This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, and the conclusions that can be drawn.

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

(20)

P r e c i s i o n = \frac{T P}{T P + F P}

(21)

R e c a l l = \frac{T P}{T P + F N}

(22)

F 1 - score = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(23)

Model parameter settings: The configuration of important hyperparameters of the model is shown in Table 1.

As shown in Table 1, in the hyperparameter configuration, the learning rate is set to 0.001, which balances the convergence rate and stability of parameter updates in the meta-learning framework; the number of epochs is set to 200 to ensure the model entirely fits the meta-training data; the number of meta-learning task-based training episodes (Num_episode) is set to 600, and multi-task sampling in each meta-training phase enhances the model’s generalization ability in few-shot scenarios; the multi-source guided classification loss weights

λ, μ

are set to 1.0 and 1.5 respectively, balancing the contributions of different classification losses; the image size

S_{img}

is 28, adopting lightweight encoding for malicious traffic features; both the embedding dimension

d_{m}

and the fusion dimension

d_{o}

are 64, ensuring the efficiency of feature embedding and feature interaction.

According to the characteristics of meta-learning, the dataset needs to be divided into two parts: the meta-training set and the meta-validation set, each part containing a support set and a query set. Taking the Malicious_TLS dataset as an example, the initial dataset has 23 categories, including 22 attack categories and 1 benign category. Firstly, the 23 categories of the dataset were shuffled, and 18 categories were randomly selected as the meta-training set, while 5 categories were chosen as the meta-validation set. Randomly select 15 samples from each category as the query set and N = {1, 5, 10} samples as the support set. The tasks required for meta-learning are randomly combined from these categories, with each task containing a small number of training and testing samples.

4.3. Comparative Experiments with Meta-Learning Approaches

To evaluate the traffic classification performance of the proposed method under few-shot conditions, we established three scenarios: 5-way 10-shot, 5-way 5-shot, and 5-way 1-shot. We employed four metrics, namely Accuracy (Acc), Precision (Pr), Recall (Rc), and F1-score (F1), for assessment and mitigated randomness through cross-task validation. Firstly, comparative experiments were conducted on the Malicious_TLS dataset, contrasting the proposed method with four classic meta-learning models in the few-shot domain, including MAML, Reptile, MatchingNet, and ProtoNet. The comparison results are summarized in Table 2.

As shown in Table 2, the metrics represented in bold indicate the highest performance values, while the metrics represented in underlined text denote the second-highest performance values. In the various few-shot scenarios of the Malicious_TLS dataset, the proposed method consistently demonstrates performance advantages. Specifically, it achieves an accuracy metric of 97.17% in the 5-way 1-shot scenario, which increases to 97.34% in the 5-way 5-shot scenario, and further improves to 97.72% in the 5-way 10-shot scenario. In the 5-way 1-shot scenario, the method demonstrates superior performance in terms of accuracy and precision, reaching 97.17% and 97.14%, respectively, while the second-best methods, MatchingNet and ProtoNet, attain an accuracy of 96.69%. In the 5-way 5-shot scenario, the proposed method achieves the highest performance across all four metrics, with values of 97.34%, 97.56%, 97.34%, and 97.45%; the second-best method, ProtoNet, scores 97.28%, 97.36%, 97.28%, and 97.31%. In the 5-way 10-shot scenario, the proposed method maintains its superiority with an accuracy of 97.72%, which is 0.21 percentage points higher than the second-best method, ProtoNet, at 97.51%. In summary, the proposed method consistently surpasses the baseline models in all evaluation metrics across the few-shot scenarios of Malicious_TLS traffic.

Next, for the ToN_IoT dataset, we conducted comparative experiments between the proposed method and four classic few-shot models: MAML, Reptile, MatchingNet, and ProtoNet. The comparison results are shown in the Table 3.

As shown in Table 3, the metrics represented in bold indicate the highest performance values, while the metrics represented in underlined text denote the second-highest performance values. In the various few-shot scenarios of the ToN_IoT dataset, specifically from 5-way 1-shot to 5-way 10-shot, the proposed method consistently demonstrates performance advantages. In the 5-way 1-shot scenario, the accuracy reaches 92.37%, while the accuracy for the 5-shot scenario is 96.56%, and it further increases to 97.81% in the 10-shot scenario. In the 5-way 1-shot scenario, the proposed method achieves the best performance, with accuracy and precision values of 92.37% and 92.84%, respectively. The following method, ProtoNet, has an accuracy of 84.59% and a precision of 86.09%. In the 5-way 5-shot scenario, the proposed method is optimal across all four evaluation metrics, achieving an accuracy of 96.56%, which is 3.84 percentage points higher than the next best method, MatchingNet, with an accuracy of 92.72%. In the 5-way 10-shot scenario, the proposed method maintains its superiority with an accuracy of 97.81%, surpassing the next best method, ProtoNet, by 4.45 percentage points, as ProtoNet’s accuracy is 93.36%.

The MAML method struggles to extract spatiotemporal patterns of malicious traffic because it lacks traffic-specific feature enhancement mechanisms. Meanwhile, Reptile is constrained by its simplistic gradient update strategy, which results in inadequate generalization and robustness when dealing with class imbalance. Additionally, MatchingNet employs a cosine similarity-based matching mechanism, making it challenging to differentiate between semantically similar variants of malicious traffic. ProtoNet’s approach to modeling class prototypes through mean calculations is susceptible to interference from outlier samples and dominant classes, resulting in significant bias in prototype representation under imbalanced conditions. In contrast, the GADF-SRGA proposed in this paper effectively preserves the spatiotemporal correlation features of malicious traffic. Its SRGA module explicitly models inter-sample class correlations, significantly improving classification robustness. Furthermore, it incorporates a meta-learning framework that enables rapid few-shot adaptation, maintaining high accuracy even in extreme 1-shot scenarios. In summary, the proposed method consistently outperforms all baseline comparison models across various evaluation metrics in few-shot scenarios for both Malicious_TLS and TON_IoT traffic datasets.

4.4. Comparison Experiment

To verify the performance of the proposed time series classification method under few-shot conditions, three scenarios were defined: 5-way 10-shot, 5-way 5-shot, and 5-way 1-shot. The evaluation metric uses classification accuracy and eliminates the influence of randomness through cross-task validation. The few-shot IoT malicious traffic detection model based on GADF-SRGA proposed in this paper can dynamically adjust the network and adapt to new tasks. In Table 4, we review the results of other researchers in terms of the dataset used, the number of model categories used, classification settings, feature types, and the amount of data used during the training process.

GADF-SRGA achieves 97.34% accuracy under the 5-way 5-shot configuration and reaches 97.72% on the Malicious_TLS dataset under 5-way 10-shot settings. This unequivocally validates its dual capability: effective multi-category classification and rapid feature learning from minimal samples, delivering distinct advantages for detecting novel attacks prevalent in IoT environments. Notably, GADF-SRGA attains near 98% accuracy with merely 9183 samples (5-way 10-shot), while the Aggregation model requires approximately 30 million samples to achieve comparable performance.

4.5. Ablation Experiments

To verify the effectiveness of the proposed image-based few-shot malicious traffic detection model, we conducted ablation experiments on the Malicious_TLS dataset to validate the effectiveness of each module: the traffic data visualization module (comparison of different visualization methods) and the Sample Relevance Guided Attention (SRGA) module. The results of the ablation experiment are shown in Table 5.

As shown in Table 5, bold font denotes the best result, and underlined font denotes the suboptimal result. When the GADF method is used to convert traffic data into images for few-shot experiments, the improvement in model performance is significant when combined with the SRGA module. In the 5-way 1-shot setting, the proposed method achieves an accuracy 7.02% higher than that of the model without the SRGA module. Experiment 1 demonstrates that the SRGA module can effectively improve the accuracy of few-shot classification. Comparative experiments were further conducted in Experiments 2–5 by combining the SRGA module with different traffic visualization methods. When the GADF method is adopted in this study, the accuracies in the 5-way 1-shot and 5-way 5-shot settings both achieve the highest values, which are 97.17% and 97.34% respectively. Specifically, in the 5-way 1-shot setting, the accuracy is 0.21%, 0.89%, and 1.47% higher than those of the GASF, MTF, and RP methods; in the 5-way 5-shot setting, the accuracy is 1.91%, 0.78%, and 5.36% higher than those of the GASF, MTF, and RP methods. Different traffic data visualization methods have different impacts on model performance. In the ablation experiment on the Malicious_TLS dataset, the accuracy of GADF-SRGA on task 5way_1shot and 5way_5shot was 97.17% and 97.34%, respectively, achieving the best performance among all ablation experiments. The visualization of the results is shown in Figure 5.

Experimental results demonstrate that the combination of the GADF method adopted in this study and the SRGA module achieves the highest accuracy and exhibits the best performance.

5. Conclusions

This paper proposes a few-shot malicious traffic detection method based on a sample similarity-guided attention mechanism. This method integrates a sample relevance-guided attention module into the meta-learning framework to explore the application potential of few-shot learning in this task. Experiments verify that the method can address label scarcity in IoT network anomaly detection with a small amount of labeled data, and it exhibits excellent feature extraction performance on the Malicious_TLS and ToN-IoT datasets, with detection performance significantly outperforming existing methods.

Although GADF-SRGA achieves good performance in few-shot malicious traffic detection, it still has two limitations: first, for malicious traffic types not observed during training, the model can only identify them by relying on the general spatiotemporal features extracted by GADF image encoding and the sample relationship generalization ability of the SRGA module. In few-shot scenarios, it is unable to learn exclusive feature patterns for new attacks, which tends to weaken the discriminative ability due to feature distribution shift; second, the computational cost of the method and its deployment feasibility on resource-constrained IoT devices remain unclear, and the real-time deployment adaptability on edge IoT devices is yet to be verified.

To address the above limitations, future research will focus on the pruning and network design of lightweight multimodal models, aiming to improve the efficiency of model training and design, while solving the deployment challenges of it on resource-constrained devices.

Author Contributions

X.W. (Xuan Wu) authored the Abstract and Introduction, and contributed to the theory section, model performance analysis section, and outlook section. P.W. provided technical guidance for code implementation and the theory section. Y.S., X.W. (Xiaodan Wang) and J.C. obtained the essential funding for experiments, data collection, and framework development. All authors participated in revising the manuscript, ensuring its academic accuracy and coherence. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China [grant numbers 61876189, 61703426, 61273275]; Shaanxi Provincial Natural Science Foundation [grant numbers 2024JC-YBQN-0677].

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets related to this study can be obtained from the following website: https://research.unsw.edu.au/projects/toniot-datasets (accessed on 1 January 2025). Other data information can be obtained from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Le, T.T.H.; Oktian, Y.E.; Kim, H. XGBoost for Imbalanced Multiclass Classification-Based Industrial Internet of Things Intrusion Detection Systems. Sustainability 2022, 14, 8707. [Google Scholar] [CrossRef]
Thakkar, A.; Lohiya, R. Fusion of statistical importance for feature selection in deep neural network-based intrusion detection system. Inf. Fusion 2023, 90, 353–363. [Google Scholar] [CrossRef]
Thakkar, A.; Kikani, N.; Geddam, R. Fusion of linear and non-linear dimensionality reduction techniques for feature reduction in LSTM-based intrusion detection system. Appl. Soft Comput. 2024, 154, 111378. [Google Scholar] [CrossRef]
Song, Y.F.; Zhang, D.D.; Wang, J.; Wang, Y.; Wang, Y.; Ding, P. Application of Deep Learning in Malware Detection: A Review. Big Data 2025, 12, 99. [Google Scholar] [CrossRef]
Djenouri, Y.; Djenouri, D.; Belhadi, A.; Srivastava, G.; Lin, J.C.-W. Emergent deep learning for anomaly detection in internet of everything. IEEE Internet Things J. 2023, 10, 3206–3214. [Google Scholar] [CrossRef]
Yang, T.; Chen, J.; Deng, H.; He, B. A lightweight intrusion detection algorithm for IoT based on data purification and a separable convolution improved CNN. Knowl.-Based Syst. 2024, 304, 112473. [Google Scholar] [CrossRef]
Wang, P.; Wang, K.; Song, Y.F.; Wang, X. AutoLDT: A lightweight spatio-temporal decoupling transformer framework with AutoML method for time series classification. Sci. Rep. 2024, 14, 29801. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Wu, W.; Ma, L.; You, X.; Gao, C.; Sang, N.; Shao, Y. Exploring sample relationship for few-shot classification. Pattern Recognit. 2025, 159, 111089. [Google Scholar] [CrossRef]
Park, S.-H.; Syazwany, N.S.; Lee, S.-C. Meta-feature fusion for few-shot time series classification. IEEE Access 2023, 11, 41400–41414. [Google Scholar] [CrossRef]
Olasehinde, O.O.; Johnson, O.V.; Olayemi, O.C. Evaluation of selected meta learning algorithms for the prediction improvement of network intrusion detection system. In Proceedings of the 2020 International Conference in Mathematics, Computer Engineering and Computer Science (ICMCECS), Ayobo, Nigeria, 18–21 March 2020; pp. 1–7. [Google Scholar] [CrossRef]
Lu, C.; Wang, X.; Yang, A.; Liu, Y.; Dong, Z. A few-shot-based model-agnostic meta-learning for intrusion detection in security of Internet of Things. IEEE Internet Things J. 2023, 10, 21309–21321. [Google Scholar] [CrossRef]
Wu, Y.; Lin, G.; Liu, L.; Hong, Z.; Wang, Y.; Yang, X. MASiNet: Network intrusion detection for IoT security based on meta-learning framework. IEEE Internet Things J. 2024, 11, 25136–25146. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Obamuyide, A.; Vlachos, A. Model-Agnostic Meta-Learning for Relation Classification with Limited Supervision. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 28 July–2 August 2019; pp. 5873–5879. [Google Scholar] [CrossRef]
Snell, J.; Swersky, K.; Zemel, R. Prototypical Networks for Few-Shot Learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 4080–4090. [Google Scholar]
Wang, P.; Song, Y.F.; Wang, X.; Guo, X.; Xiang, Q. ImagTIDS: An Internet of Things Intrusion Detection Framework Utilizing GADF Imaging Encoding and Improved Transformer. Complex Intell. Syst. 2025, 11, 93. [Google Scholar] [CrossRef]
Wu, Z.; Zhang, H.; Wang, P.; Sun, Z. RTIDS: A robust transformer-based approach for intrusion detection systems. IEEE Access 2022, 10, 64375–64387. [Google Scholar] [CrossRef]
Abdelmoumin, G.; Rawat, D.B.; Rahman, A. On the performance of machine learning models for anomaly-based intelligent intrusion detection systems for the Internet of Things. IEEE Internet Things J. 2022, 9, 4280–4290. [Google Scholar] [CrossRef]
Abdel-Basset, M.; Moustafa, N.; Hawash, H.; Razzak, I.; Sallam, K.M.; Elkomy, O.M. Federated intrusion detection in blockchain-based smart transportation systems. IEEE Trans. Intell. Transp. Syst. 2022, 23, 2523–2537. [Google Scholar] [CrossRef]
Zeeshan, M.; Riaz, Q.; Bilal, M.A.; Shahzad, M.K.; Jabeen, H.; Haider, S.A. Protocol-based deep intrusion detection for DoS and DDoS attacks using UNSW-NB15 and Bot-IoT data-sets. IEEE Access 2022, 10, 2269–2283. [Google Scholar] [CrossRef]

Figure 1. Prototypical Networks. The red/blue/purple/green dots represent different sample categories, and the boxes indicate the prototype regions of each category.

Figure 2. Model framework.

Figure 3. ToN-IoT Dataset Class Distribution.

Figure 4. Malicious_TLS Dataset Class Distribution.

Figure 5. Experimental results of Malicious_TLS dataset ablation (%).

Table 1. Model hyperparameter combinations.

Hyperparameters	Values
Learning rate	0.001
Number of iterations	200
Num_episode	600
Guided classification loss weight	1.0, 1.5
Image size	28
Embedding dimension, integrating dimension	64, 64

Table 2. Performance Comparison on the Malicious_TLS Dataset (%).

Model	5way_1shot				5way_5shot				5way_10shot
Model	Acc	Pr	Rc	F1	Acc	Pr	Rc	F1	Acc	Pr	Rc	F1
MAML	88.94	87.42	88.94	88.05	87.70	85.35	87.70	86.35	90.81	88.75	90.81	89.65
Reptile	76.63	76.28	76.63	76.51	79.40	77.74	79.40	78.56	79.67	77.86	79.67	78.76
MatchingNet	96.69	96.74	96.69	96.72	97.28	97.19	97.28	97.22	97.45	97.40	97.45	97.41
ProtoNet	96.69	96.76	96.69	96.73	97.28	97.36	97.28	97.31	97.51	97.52	97.21	97.36
Ours	97.17	97.14	96.09	96.61	97.34	97.56	97.34	97.45	97.72	97.89	97.72	97.79

Table 3. Performance Comparison on the ToN_IoT Dataset (%).

Model	5way_1shot				5way_5shot				5way_10shot
Model	Acc	Pr	Rc	F1	Acc	Pr	Rc	F1	Acc	Pr	Rc	F1
MAML	84.45	85.76	84.45	85.00	90.12	91.28	90.12	90.67	91.45	92.58	91.45	92.00
Reptile	56.60	55.35	56.60	55.86	67.17	65.86	67.17	66.53	69.40	69.21	69.40	69.30
MatchingNet	76.71	74.90	76.71	75.62	92.72	92.45	92.72	92.54	92.95	93.63	92.95	93.26
ProtoNet	84.59	86.09	84.59	85.27	92.20	93.41	92.20	92.79	93.36	94.41	93.36	93.88
Ours	92.37	92.84	92.37	92.57	96.56	96.60	96.56	96.56	97.81	97.95	97.81	97.88

Table 4. Results of Comparison Experiments (%).

Model	Dataset	Number of Categories	Feature Type	Number of Samples	Classification Setting	Acc%
Aggregation model [17]	CIC-DDoS2019	12	Statistic + Multiple DL models	30,480,823	2-way	98.58
FED-IDS [18]	ToN-IoT	9	Statistic + CNN	521,050	2-way	94.85
PB-DID [19]	UNSW-NB15 and BoT-IoT	3	Statistic + DNN + LSTM	6,654,000	3-way	96.3
MTH-IDS [20]	NSL-KDD	5	Statistic + DNN	184,603	2-way	96.2
	CIC-IDS2017	15		823,951	2-way	96.5
MAML + CNN [11]	FSIDS-IoT	78	CNN	1260 + 1	5-way 1-shot	73.81
				1260 + 5	5-way 5-shot	89.64
				1260 + 10	5-way 10-shot	92.19
Ours	Malicious_TLS	22	CNN	6433 + 2750	5-way 1-shot	97.17
					5-way 5-shot	97.34
					5-way 10-shot	97.72
Ours	ToN-IoT	9	CNN	32,267 + 13,828	5-way 1-shot	92.37
					5-way 5-shot	96.56
					5-way 10-shot	97.81

Table 5. Results of Ablation Experiments (%).

No.	Different Methods				SRGA	Accuracy
No.	GASF	GADF	MTF	RP	SRGA	5way_1shot	5way_5shot
1		✓				90.15	91.98
2	✓				✓	96.96	95.43
3		✓			✓	97.17	97.34
4			✓		✓	96.28	96.56
5				✓	✓	95.7	91.98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, X.; Wang, P.; Song, Y.; Wang, X.; Chai, J. Few-Shot Learning for Malicious Traffic Detection with Sample Relevance Guided Attention. Electronics 2025, 14, 4717. https://doi.org/10.3390/electronics14234717

AMA Style

Wu X, Wang P, Song Y, Wang X, Chai J. Few-Shot Learning for Malicious Traffic Detection with Sample Relevance Guided Attention. Electronics. 2025; 14(23):4717. https://doi.org/10.3390/electronics14234717

Chicago/Turabian Style

Wu, Xuan, Peng Wang, Yafei Song, Xiaodan Wang, and Jinjin Chai. 2025. "Few-Shot Learning for Malicious Traffic Detection with Sample Relevance Guided Attention" Electronics 14, no. 23: 4717. https://doi.org/10.3390/electronics14234717

APA Style

Wu, X., Wang, P., Song, Y., Wang, X., & Chai, J. (2025). Few-Shot Learning for Malicious Traffic Detection with Sample Relevance Guided Attention. Electronics, 14(23), 4717. https://doi.org/10.3390/electronics14234717

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Few-Shot Learning for Malicious Traffic Detection with Sample Relevance Guided Attention

Abstract

1. Introduction

2. Fundamental Technologies

2.1. Attention Mechanism

2.2. Model-Agnostic Meta-Learning

2.3. Prototypical Networks

3. Proposed Approach

3.1. GADF Encoder

3.2. Class-Sparse Attention Module

3.3. Fusion Features

3.4. Nearest Neighbor Prototype Classifier

4. Experimental Design and Results Analysis

4.1. Datasets Description

4.2. Experimental Settings

4.3. Comparative Experiments with Meta-Learning Approaches

4.4. Comparison Experiment

4.5. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI