Few-Shot Target Detection Algorithm Based on Adaptive Sampling Meta-DETR

Ma, Zihao; Liu, Gang; Tong, Zhaoya; Fan, Xiaoliang

doi:10.3390/electronics14173506

Open AccessArticle

Few-Shot Target Detection Algorithm Based on Adaptive Sampling Meta-DETR

College of Information Engineering, Henan University of Science and Technology, Luoyang 471000, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(17), 3506; https://doi.org/10.3390/electronics14173506

Submission received: 17 July 2025 / Revised: 22 August 2025 / Accepted: 26 August 2025 / Published: 2 September 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Meta-DETR is a few-shot target detection algorithm that combines meta-learning and transformer architecture to solve the problem of data sample scarcity. This algorithm uses deformable attention to focus feature learning process more accurately on the target and its surroundings. However, the number of sampling points in the deformable attention is fixed, which limits the effective information involved in feature extraction, resulting in insufficient feature extraction of the target and affecting detection performance. To solve this problem, a Meta-DETR few-shot target detection algorithm based on adaptive sampling deformable attention is proposed. Firstly, the cosine similarity between feature points is calculated by query features that are integrated with support features. Secondly, the number of related features of each feature point is counted by the similarity threshold. Thirdly, the final number of sampling points of the feature map are calculated by using the idea of maximum inter-class variance to achieve adaptive sampling. Finally, adaptive sampling deformable attention is integrated into Meta-DETR to achieve few-shot target detection. From the attention activation map, it can be seen that the deformable attention based on adaptive sampling pays more attention to the target itself. Compared with Meta-DETR, the proposed algorithm improves the detection accuracy of novel classes by 0.9%, 0.7%, 1.4%, and 2.1%, respectively, for shots 1, 2, 3, and 10 in partition 1 on the PASCAL VOC dataset; 3.5%, 0.1%, 5.5%, and 5.7%, respectively, for shots 2, 3, 5, and 10 in partition 2; and 1.9%, 1.0%, 2.1%, and 0.1%, respectively, for shots 2, 3, 5, and 10 in partition 3. Compared with MPF-Net, CRK-Net, and FSCE, the proposed algorithm achieves the best performance and can effectively realize detection under few-shot conditions. In addition, experiments on a self-made infrared dataset further validate the effectiveness of the algorithm proposed in this paper.

Keywords:

target detection; few-shot; meta learning; transformer; deformable attention; adaptive sampling; cosine similarity; maximum inter-class variance

1. Introduction

Target detection models based on deep learning require a large number of labeled samples for training. When there are not enough samples or the annotations of the samples are difficult to obtain, it is difficult for the existing mainstream target detection algorithms to achieve satisfactory results. Therefore, many scholars have explored the few-shot target detection task to solve this problem. Few-shot target detection is a fusion of traditional target detection technology and few-shot learning technology. It aims to learn a detection model with generalization performance through a small number of labeled samples [1]. At present, the main methods of few-shot target detection can be roughly divided into methods based on transfer learning and methods based on meta-learning.

The core idea of the few-shot target detection method based on transfer learning is as follows: first, pre-train the source domain model on a large-scale base class-labeled dataset; then, fine-tune the model parameters based on a small number of target domain training samples [2]. Based on transfer learning, Chen et al. [3] combined the advantages of the single-stage target detection model SSD [4] (single shot multibox detector) and the two-stage target detection model Faster RCNN [5] and proposed a few-shot transfer detector (low-shot transfer detector (LSTD)). LSTD designs two mechanisms—background suppression regularization and knowledge transfer regularization—so that it can focus on the foreground target during the transfer learning process, reduce the impact of semantic confusion on the model accuracy, and better utilize the source domain knowledge to enhance the fine-tuning of a small number of target images. Wang et al. [6] proposed a two-stage fine-tuning approach (TFA), which freezes all networks before the detector head in novel classes and only fine-tunes the last layer. This simple training method brings significant accuracy improvements. Ke et al. [7] proposed a generalized feature extraction framework to solve the problem that the knowledge learned during the model base training process tends to be biased towards the characteristics of the base class data, resulting in a decrease in learning ability when fine-tuning novel classes and further overfitting of the model due to the scarcity of samples. This framework solves the impact of changes in target shape and size on the overall detection performance and improves the generalization performance of the base training model. In addition, a feature-level data enhancement method based on self-distillation is proposed to further enhance the generalization performance of the model. Experimental results show that the algorithm has achieved good results on both the COCO and PASCAL VOC datasets. In order to transfer the general knowledge learned from data-rich base classes to novel classes, Yang et al. [8] proposed a weight transfer strategy to enable the model to better transfer features, and propose an attention-based feature enhancement mechanism to learn more robust target feature representations. In addition, an angle-guided additive margin classifier was introduced to enhance instance-level inter-class differences and intra-class compactness, improving the classification and discrimination ability of the model. Experimental results show that the detection results of the algorithm on the PASCAL VOC and COCO dataset are higher than those of the current advanced algorithms. Although the training method of the few-shot target detection method based on transfer learning is simple, when there are very few samples, it is difficult to accurately characterize the feature distribution of the entire category, which makes the model have serious overfitting problems and leads to poor generalization ability. In order to overcome this overfitting problem and further improve the generalization ability of the model in few-shot target detection, a meta-learning strategy can be used.

The core idea of the few-shot target detection method based on meta-learning is to transfer prior knowledge from the base classes with rich annotations to the novel classes with scarce data by simulating a series of similar few-shot tasks, so as to cope with the problem of insufficient sample quantity [9,10,11]. Specifically, meta-learning usually divides the training dataset into multiple different subtasks, each task consisting of a support set and a query set. The support set is used for model training, and the query set is used for model evaluation. In this way, the model is iteratively trained on multiple tasks, so as to learn how to use a small number of samples for effective learning. At present, many scholars have combined meta-learning with different types of target detection models. Kang et al. [12] combined meta-learning with the single-stage target detection YOLO v2 [13] and proposed a feature reweighting few-shot target detection algorithm (few-shot object detection via feature reweighting, FSRW). The feature learning mechanism in this algorithm learns generalizable meta-features, the feature reweighting mechanism learns global features for each target category in the support set, and the prediction mechanism predicts the category and bounding box of the query image. This algorithm predicts the entire feature map of the query image. Considering that the query image may contain multiple targets, meta-learning on the entire image is not the best solution. To this end, Yan et al. [14] combined meta-learning with the two-stage target detection model to design Meta-RCNN. They introduced a predictor-head remodeling network (PRN) to infer class attention vectors for all class targets in the support set image, and used it as meta-knowledge to perform channel-level fusion with the feature map of the region of interest extracted by the query image through the RPN (region proposal network), and finally obtained the corresponding detection map. Similarly, Du Yunyan et al. [15] proposed a few-shot target detection algorithm based on Faster RCNN. They reduced the number of irrelevant candidate boxes by improving the RPN module, and then proposed a global–local relationship detector module. By associating the features of a small number of labeled samples and the samples to be detected, they obtained candidate regions that were more relevant to the target category, thereby improving the detection accuracy of novel classes of targets. Chen et al. [16] solved the problem that the valuable correlation feature among different categories is insufficiently exploited, hindering the generalization of knowledge from base classes to novel classes for target detection. They proposed few-shot target detection via correlation-RPN and transformer encoder–decoder (CRTED), a novel training network to learn object-relevant features of inter-class correlation and intra-class compactness while suppressing target-agnostic features in the background with limited annotated samples. Li et al. [17] introduced a simple yet effective proposal distribution calibration (PDC) approach to neatly enhance the localization and classification abilities of the RoI head by recycling its localization ability endowed in base training and enriching high-quality positive samples for semantic fine-tuning.

There are still two potential problems based on meta-learning methods that hinder the full utilization of base class knowledge. First, the region-based detection framework relies on region proposals to generate the final prediction, so the detection results are sensitive to low-quality region proposals. In the few-shot target detection task, it is not easy to generate high-quality region proposals for limited novel classes. Second, most strategies based on meta-learning methods use “feature reweighting” or its variants to aggregate query features and support features, and can only process one support class (i.e., the target class to be detected) at a time. In this case, the important inter-class correlations between different support classes are largely ignored.

To address the above limitations, Zhang et al. [18] abandoned the region bounding box, made full use of the complementary relationship between classification and regression tasks, combined the transformer model that has been popular in recent years with meta-learning, and constructed a Meta-DETR framework. The transformer has the ability to model long-distance dependencies and can effectively utilize the contextual information between features. Meta-DETR combines meta-learning with deformable DETR [19] to perform pure image-level prediction. This framework can skip region proposal generation, avoid the problem of low quality of novel class region proposals, and perform detection directly at the image level. In addition, Meta-DETR also introduces an inter-class correlation meta-learning strategy, allowing multiple support classes to be focused on at one time, making full use of inter-class correlation and reducing misclassification between similar classes. Although Meta-DETR solves the above problems, the deformable attention mechanism in Meta-DETR selects a fixed number when selecting the value corresponding to each query, which greatly limits the extraction of information related to the target features.

Therefore, this paper proposes a Meta-DETR few-shot target detection algorithm based on adaptive sampling deformable attention. The main work of this paper is as follows:

(1): An adaptive sampling deformable attention (ASDA) module is proposed. This module measures the correlation between feature points by calculating the cosine similarity between feature points in deformable attention, and then preliminarily screens the feature points according to the cosine similarity threshold. Finally, the maximum inter-class variance is used to calculate the final number of sampling points of the target feature points, thereby avoiding over- or under-sampling of some feature points and achieving accurate sampling.
(2): Combining the ASDA model with the Meta-DETR framework based on meta learning, a new few-shot target detection algorithm is proposed. This algorithm uses the ASDA model to construct the encoder and decoder, and performs feature enhancement on the output of the correlational aggregation module (CAM) in Meta-DETR, ultimately achieving target detection under few-shot conditions.

In the first section, this paper introduces the current research status of few-shot target detection, including few-shot target detection algorithms based on transfer learning and meta-learning, and also focuses on the algorithm that combines meta-learning with different types of target detection models. In the second section, the principle of the baseline model used in this paper and the algorithm proposed in this paper are mainly introduced. In the third section, experiments are conducted using the PASCAL VOC dataset and the self-made infrared aircraft dataset to verify the effectiveness of the proposed algorithm. In the fourth section, the proposed algorithm is summarized and prospects for future work are given.

2. The Proposed Method

This paper improves the Meta-DETR algorithm and proposes an adaptive sampling deformable attention Meta-DETR few-shot target detection algorithm based on cosine similarity. The ResNet-101 network is used as the feature extractor. The algorithm structure is shown in Figure 1. Given a query image and a set of support images with target annotations, they are first encoded into the same feature space using a weight-sharing feature extractor, and then the extracted support features are fused with the query features using the correlational aggregation module (CAM). Finally, the fused features are enhanced and predicted by the encoder and decoder, respectively, with adaptive sampling deformable attention as the basic structure.

2.1. Meta-DETR Principle

Meta-DETR combines meta-learning with deformable DETR. The detection algorithm can skip region proposal generation and perform detection directly at the image level, completely bypassing the constraints of inaccurate region proposals, thereby avoiding the problem of low quality novel class region proposals and better generalizing base class knowledge to novel classes. In addition, Meta-DETR also introduces a CAM, which allows the model to focus on multiple support classes at a time and aggregates multiple support features with query features to capture and utilizes inter-class correlations between different categories. In order to distinguish different support classes in a class-agnostic manner, CAM introduces a set of task encodings assigned to each support class. Finally, the decoder detects the target by predicting the location of the target and the corresponding task encoding.

2.1.1. Correlational Aggregation Module

CAM is a key component for associating inter-class features in Meta-DETR. It aggregates query features with support features for subsequent target class prediction. The structure of CAM (correlational aggregation module) is shown in Figure 2. The query features and support features are first processed by weight-shared multi-head attention to encode them into the same embedding space. The support features are then processed by RoIAlign [20] (region of interest align) to obtain the class prototype vector of each support class, and the class prototype vectors are then average-pooled. Next, the query features are feature-matched with the support class prototype vectors, and the query features are encoding-matched with the task encoding. Finally, the matching results are added and sent to the feed-forward network (FFN) to produce the final output.

(1): Feature matching

Feature matching aims to filter out features that are irrelevant to the support class and is achieved through the attention mechanism. Specifically, given a query feature map

Q \in R^{U \times d}

and a support class prototype

S \in R^{C \times d}

, where

U = H \times W

,

H

, and

W

are the height and width of the feature map,

C

is the number of support classes, and

d

is the feature dimension, the matching coefficient is obtained by:

L = S o f t \max (\frac{(Q W^{'}) (S W^{'})^{T}}{\sqrt{d}})

(1)

where

W^{'}

is the linear weight shared by

Q

and

S

, ensuring that the support features and query features are encoded into the same feature space. Subsequently, the output of the feature matching module can be expressed as:

Q_{F} = L σ (S) ⊙ Q

(2)

where

σ (\cdot)

represents the sigmoid function,

⊙

represents the Hadamard product, and

σ (S)

is the feature filter for each support class, extracting only class-related features from the query features. By applying the matching coefficient

L

to

σ (S)

, any query features that do not match the support class can be filtered out, generating a query feature map

Q_{F}

that only highlights the target of the given support class.

(2): Encoding matching

In order to achieve relevant meta-learning, a task encoding is assigned to each support class, and the query features are matched with their corresponding task encoding to predict the task encoding. Task encoding

T \in R^{C \times d}

is implemented by a sine function, and encoding matching uses the same matching coefficient as feature matching. The output of encoding matching can be expressed as:

Q_{E} = L T

(3)

where

L

is the matching coefficient.

(3): Background modeling

Since the background usually occupies most of the spatial position of the query image, a learnable background prototype (BG-prototype) and background encoding (BG-encoding) are introduced to explicitly model the background class. When the query does not match any given support class, it can be considered as the background class.

2.1.2. Loss Function

The loss function of Meta-DETR consists of classification loss, regression loss, and prototype loss. In the decoder of Meta-DETR, a fixed

X

set of predictions

\overset{\land}{y} = {\overset{\land}{y_{i}}}_{i = 1}^{X} = {(\overset{\land}{c_{i}}, \overset{\land}{b_{i}})}_{i = 1}^{X}

is generated for each image, where

X

is much larger than the number of targets

E

in the image. The set of true boxes in the image is represented by

y = {y_{i}}_{i = 1}^{E} = {(c_{i}, b_{i})}_{i = 1}^{E}

. One-hot encoding, where

l

represents the predicted category,

p

is used to represent the original predicted value output by Meta-DETR. The classification loss of Meta-DETR can be expressed as:

L_{cls} = L_{cls} (c_{i}, {\overset{\land}{c}}_{σ (i)})

(4)

where

c_{i}

represents the true category of the

i

th target,

{\overset{\land}{c}}_{σ (i)}

represents the predicted category of the

i

th target. When

l

is equal to 0, the classification loss is shown in Formula (5), and when

l

is equal to 1, the classification loss is shown in Formula (6).

L_{cls} (c_{i}, {\overset{\land}{c}}_{σ (i)}) = α * {(1 - p_{σ (i)})}^{γ} * \log (1 - \frac{1}{1 + e^{- p_{σ (i)}}})

(5)

L_{cls} (c_{i}, {\overset{\land}{c}}_{σ (i)}) = α * {(1 - p_{σ (i)})}^{γ} * \log (\frac{1}{1 + e^{- p_{σ (i)}}})

(6)

The regression loss of Meta-DETR can be expressed as:

L_{box} = {‖b_{i} - {\overset{\land}{b}}_{σ (i)}‖}_{1} + 1 - \frac{A \cap B}{A \cup B} + \frac{D - (A \cup B)}{D}

(7)

where

A, B

represents the predicted box and the true box, respectively,

D

represents the minimum enclosing box of

A

and

B

,

b_{i}

represents the true box coordinates of the

i

th target, and

{\overset{\land}{b}}_{σ (i)}

represents the predicted box coordinates of the

i

th target. In Equations (4) and (7),

σ

represents the optimal match between the predicted box and the true box. The optimal match is calculated by the decoder using the Hungarian matching algorithm. In addition, Meta-DETR also introduces a cosine similarity cross-entropy loss to classify the class prototypes obtained by CAM. That is:

L_{s} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{J} y_{i j} \log (\frac{e^{p_{i j}}}{\sum_{j = 1}^{J} e^{p_{i j}}})

(8)

where

p_{i j}

represents the prediction score of the

i

th sample category

j

,

y_{i j}

represents the One-Hot encoding of the

i

th sample category

j

,

N

represents the number of samples, and

J

represents the total number of categories.

2.2. Adaptive Sampling Deformable Attention Based on Cosine Similarity

2.2.1. Sampling Strategy in Adaptive Deformable Attention

The encoder and decoder of the Meta-DETR algorithm contain a deformable attention mechanism, which can enhance the algorithm’s ability to extract target features. Deformable attention is a mechanism proposed in the literature [19], which solves the problem of high computational complexity of the self-attention module in the transform encoder of DETR by fixing the number of sampling points of each attention head [21]. Deformable attention is shown in Figure 3, and its principle is as follows:

Given a feature map

x

, assume that

z_{q}

is a feature vector corresponding to a point

q

in the feature map, and

p_{q}

is the position coordinate corresponding to the point

q

. First, pass the feature map

x

through a linear layer to generate the corresponding values, and then pass

z_{q}

through two different linear layers to generate

K

sampling offsets

Δ p

and attention weights

A

, respectively. Then, the generated offset is used to find the position of the corresponding point in values, and then the feature vector of the corresponding position is multiplied element by element by the attention weight

A

and then added to obtain the new feature vector of the corresponding point

q

. Finally, this vector is passed through a linear layer to obtain the final output. Deformable attention can be described as:

D A = \sum_{m = 1}^{M} W_{m} [\sum_{k = 1}^{K} A_{m q k} \cdot W_{m}^{'} x (p_{q} + Δ p_{m q k})]

(9)

where

M

represents the number of attention heads and

W_{m}

and

W_{m}^{'}

correspond to the weight of the linear layer.

In Meta-DETR’s deformable attention mechanism, a fixed number of samples are taken for each feature point

q

, which will limit the effective information involved in feature extraction, resulting in an insufficient extraction of target features by the algorithm and affecting detection performance. For different feature points

q

, the number of related feature points should not be fixed. For example, features in areas with complex backgrounds usually contain more diverse elements, which will lead to lower correlation between background features and a smaller number of related feature points. For the features of the target area, they usually have similar features, such as color, texture, etc., and the number of their related feature points should be large. It is reasonable to sample feature points related to the current feature point as much as possible. The algorithm in this paper measures the degree of correlation by calculating the similarity between features. Cosine similarity focuses on the direction of the vector. It is not seriously affected by the length and amplitude of the vector. In addition, it is less sensitive to noise. In convolutional networks, feature vectors usually have more channel information, and cosine similarity can represent these features more stably. Therefore, this paper uses cosine similarity to characterize the similarity between features. Given an input feature map

F \in R^{H \times W \times C}

, flatten it into

Z \in R^{U \times C}

, where

U = H \times W

,

H

, and

W

are the height and width of the feature map, and

C

is the number of channels. Let

Z_{i}

represent the feature vector of the

i

th feature point in

Z

, and construct a similarity matrix

S \in R^{U \times U}

,

S (m, n)

, which can be expressed as:

S (m, n) = \frac{Z_{m} \cdot Z_{n}}{‖Z_{m}‖ ‖Z_{n}‖}

(10)

where

S (m, n)

represents the value corresponding to the point

(m, n)

in

S

,

\begin{matrix} \cdot \end{matrix}

represents the dot product calculation,

| | • | |

represents the modulus for

•

.

The irrelevant features of all points in the similarity matrix

S

are excluded by the similarity threshold

T

, and the number of feature points related to each feature point in

Z

is counted. The correlation number vector

N \in R^{1 \times U}

represents the number of feature points related to other feature points in

Z

,

N_{i}

, which can be expressed as:

N_{i} = \sum_{c = 1}^{U} II (S (c, i) > T)

(11)

where

N_{i}

represents the

i

th features related to other feature points in

Z

,

II (\cdot)

is an indicator function, which takes one when the conditions in the brackets are met, otherwise it takes zero.

2.2.2. Strategy for Determining the Number of Target Feature Point Samplings

The target detection algorithm is usually iteratively trained in batches, and the values in the relevant number vectors

N

are not all equal. If different numbers of offsets and weights are generated for each feature, this will increase the computational complexity of the model. Therefore, before generating offsets and weights for the feature map, it is necessary to first determine a sampling number Num to represent the number of samples for each point in the entire feature map, and Num should be suitable for most target feature points. Through the learning of the network model, the cosine similarity between target feature points is usually large, and the cosine similarity between the target and the background and between complex background feature points is small. After filtering by the similarity threshold T, for the target feature points in

N

, the number of sampling points will be more, while the number of sampling points of the background feature points will be less, so the value in

N

will generally show two clusters. This paper uses the maximum inter-class variance to find the optimal threshold

t^{*}

of the cluster, by

t^{*}

distinguishing the target feature points from the background feature points. Assume

L

that there are different values in

N

, where the maximum value is Max and the minimum value is Min. Let

v_{i}, M i n \leq v_{i} \leq M a x

represent the value of the

i th \in {1, 2, \dots, L}

value in

N

, and

n_{i}, 0 \leq n_{i} \leq U

represent the number of the

i th \in {1, 2, \dots, L}

value in

N

, so the probability of the

i

th value appearing is:

p_{i} = n_{i} / U

(12)

The global mean in

N

can be expressed as:

m_{G} = \sum_{i = 1}^{L} v_{i} p_{i}

(13)

Assume that a threshold value

t

, Min ≤ t ≤ Max, is selected and used to process the

N

threshold into two categories

C_{1}

and

C_{2}

, where

C_{1}

is composed of values between the interval [Min, t] and

C_{2}

is composed of other values. Then the probability

P (t)

of the value being divided to

C_{1}

can be expressed as:

P (t) = \sum_{i = M i n}^{t} p_{i}

(14)

The cumulative mean is:

m (t) = \sum_{i = M i n}^{t} i p_{i}

(15)

So the between-class variance can be expressed as:

σ^{2} (t) = \frac{{[m_{G} P (t) - m (t)]}^{2}}{P (t) [1 - P (t)]}

(16)

Thus the optimal threshold

t^{*}

is maximized

σ^{2} (k)

, that is:

σ^{2} (t^{*}) = \max_{M i n \leq t \leq M a x} σ^{2} (t)

(17)

After obtaining

t^{*}

, it can be used to separate the target from the background features, retain the value in

N

corresponding to the target, and set the value less than

t^{*}

to zero, that is:

N_{i} = Ψ (N_{i} > t^{*}), (i \in [1, U])

(18)

where

Ψ (\cdot)

is an indicator function, which takes zero when the condition in the brackets is not met, otherwise it remains unchanged.

Finally, find the value

K

that appears the most times in

N

after removing the zero value, which is the number of samples of the feature map.

3. Experiment and Analysis

The hardware environment in this experiment: The CPU model is Intel(R) Core(TM) i7-14700KF, manufactured by Intel Corporation, Santa Clara, CA, USA, and the running memory is 32 GB. The GPU model is NVIDIA GeForce RTX 4090D, manufactured by NVIDIA Corporation, Santa Clara, CA, USA, and the video memory size is 24 GB. Software environment: The operating system is Ubuntu 22.04, based on the PyTorch 2.1 deep learning framework, the programming language is Python 3.8, and the GPU is run using CUDA 12.4, manufactured by NVIDIA Corporation, California, USA. Experiment details: Initial learning rate 2 × 10⁻⁴, AdamW optimizer with weight decay of 1 × 10⁻⁴, batch size set to 4, and similarity threshold

T

reached the empirical value obtained in the experiment. In the base training stage, the model is trained for 50 epochs, and the learning rate is decayed by 0.1 at the 45th epoch. In the fine-tuning stage, the same setting model is applied until convergence.

This paper uses the PASCAL VOC dataset and the self-made infrared aircraft dataset for experiments. The PASCAL VOC dataset uses trainval07+12 as training samples and tests on test07. The PASCAL VOC dataset contains a total of 20 categories. This paper’s experiment adopts three different division methods, which are consistent with the division method in Meta-DETR. Each method selects five categories as novel classes, and the other categories are regarded as base classes. The first division method uses birds, buses, cows, motorcycles, and sofas as novel classes. The second division method uses airplanes, bottles, cows, horses, and sofas as novel classes. The third division method uses boats, cats, motorcycles, sheep, and sofas as novel classes.

The self-made infrared aircraft dataset has a total of 5582 images, and all images are actually collected. This dataset is divided into a training set and a validation set in a ratio of 8:2 using the “cross” method, which are used for model training and validation, respectively. There are six categories of manually labeled targets, namely, back-attitude-fuselage (BAF), back-attitude-tailflame (BAT), lateral-attitude-fuselage (LAF), lateral-attitude-tailflame (LAT), backward-attitude-fuselage (BWF), and backward-attitude-tailflame (BWT); the type distribution is shown in Table 1. In order to meet the training requirements of meta-learning, one category is selected as the novel class among the six target categories, and the other five categories are used as base classes. This infrared dataset is experimented with three different splitting methods, namely Class Split 1, Class Split 2, and Class Split 3. Among them, Class Split 1 uses BAT as the novel class and BAF, LAF, LAT, BWT, and BWF as base classes; Class Split 2 uses BAF as the novel class and BAT, LAF, LAT, BWT, and BWF as base classes; Class Split 3 uses LAF as the novel class and BAT, BAF, LAT, BWT, and BWF as base classes. For few-shot target detection, each novel class has k target instances, and k is 1, 2, 3, 5, or 10.

3.1. Compared with Advanced Algorithms

In order to evaluate the effectiveness of the proposed algorithm, some representative classic few-shot target detection algorithms are selected to conduct experiments on the PASCAL VOC dataset and the self-made infrared aircraft dataset, and the average results are obtained by multiple training in the few-shot fine-tuning stage. The novel class detection results are shown in Table 2 and Table 3. The blue font in the table indicates the best result in a column. As can be seen from Table 2, compared with Meta-DETR, the proposed algorithm improves the detection accuracy of novel classes by 0.9%, 0.7%, 1.4%, and 2.1%, respectively, for shots 1, 2, 3, and 10 in partition 1 on the PASCAL VOC dataset, 3.5%, 0.1%, 5.5%, and 5.7%, respectively, for shots 2, 3, 5, and 10 in partition 2, and 1.9%, 1.0%, 2.1%, and 0.1%, respectively, for shots 2, 3, 5, and 10 in partition 3. In addition, compared with MPF-Net, CRK-Net, and FSCE, the proposed algorithm achieves superior performance under most shot settings across all three partitions. Compared with CRK-Net, the proposed algorithm achieves accuracy improvements of 0.4%, 5.1%, 8.5%, 1.3%, and 0.2% for shots 1, 2, 3, 5, and 10 in partition 1, 3.0%, 0.5%, 4.2%, and 6.0% for shots 2, 3, 5, and 10 in partition 2, and 4.4%, 9.1%, 7.8%, and 3.3% for shots 2, 3, 5, and 10 in partition 3. Compared with MPF-Net, the proposed algorithm achieves improvements of 3.3% for shot 3 in partition 1, 0.4%, 4.9%, and 4.6% for shot 2, 5, and 10 in partition 2, and 2.1%, 5.5%, 4.9%, and 3.6% for shot 2, 3, 5, and 10 in partition 3.

From Table 3, compared with Meta-DETR, the proposed algorithm improves the detection accuracy of novel classes by 0.6%, 1.9%, 2.7%, 0.8%, and 0.4%, respectively, for shots 1, 2, 3, 5, and 10 in partition 1, 2.9%, 9.6%, 11.2%, 5.4%, and 10.5%, respectively, for shots 1, 2, 3, 5, and 10 in partition 2, and 0.5% and 2.7%, respectively, for shots 3 and 10 in partition 3. In addition, compared with CMESOPA, CME, Meta R-CNN, and other existing methods, the proposed algorithm achieves the best performance under most shot settings across all three partitions. Compared with CMESOPA, the proposed algorithm improves the detection accuracy of novel classes by 1.9%, 7.0%, 4.9%, 0.7%, and 9.4% for shots 1, 2, 3, 5, and 10 in partition 1, 5.8%, 11.5%, 11.0%, 10.7%, and 15.0% for shots 1, 2, 3, 5, and 10 in partition 2, and 1.3%, 1.1%, and 5.8% for shots 2, 3, and 10 in partition 3.

Table 4 presents the detection results of the proposed algorithm compared with Meta-DETR, FSCE, MPSR, and other methods on the base classes of Class Split 1 in the PASCAL VOC dataset. It can be seen that, compared with Meta-DETR, the proposed algorithm not only achieves superior detection accuracy for novel classes under limited training samples but also improves the detection performance on most base classes. Specifically, the detection accuracy of base classes is improved by 0.3%, 0.6%, and 0.5% under 1, 3, and 10 shots, respectively. Compared with FSCE, MPSR, TFA, and other methods, the proposed algorithm also achieves competitive performance on base class detection.

3.2. Comparative Experiments on Similarity Measurement Methods

In order to verify the superiority of using cosine similarity to measure the similarity between feature vectors, the proposed algorithm is evaluated using various similarity metrics on the 10-shot setting of Class Split 1 in the PASCAL VOC dataset. As shown in Table 5, compared with Euclidean distance, Pearson distance, Manhattan distance, and Chebyshev distance, the detection accuracy for novel classes is improved by 0.8%, 0.5%, 2.2%, and 1.5%, respectively, when cosine similarity is used. These results demonstrate that cosine similarity can more effectively capture feature similarity and accurately determine the number of samples for deformable attention, thereby enhancing detection performance under few-shot conditions.

3.3. Comparison Experiment of Model Parameter Quantity and Inference Speed

To further validate the feasibility of the proposed algorithm, we conducted comparative experiments on model parameter quantity and inference speed between Meta-DETR and the proposed algorithm. The experimental results are shown in Table 6. As can be seen from the data in Table 6, the proposed algorithm slightly surpasses Meta-DETR in terms of parameter quantity. This slight increase is likely due to the additional parameters used when calculating the maximum inter-class variance. In terms of inference speed, the proposed algorithm exhibits a certain advantage, thanks to the fact that adaptive deformable attention uses fewer sampling points when calculating attention between backgrounds, thereby improving computational speed. Overall, the proposed algorithm achieves an improvement in inference speed with only a slight increase in the number of parameters. This demonstrates that the proposed algorithm optimizes operational efficiency while maintaining model performance, further validating its usability and superiority in practical applications.

3.4. Loss Curve

In order to further demonstrate the advantages of the proposed algorithm, the loss curve comparison diagrams of Meta-DETR and the proposed algorithm are plotted on the PASCAL VOC dataset and the self-made infrared aircraft dataset, as shown in Figure 4 and Figure 5. Figure 4 is a comparison of loss curves on the PASCAL VOC dataset, and Figure 5 is a comparison of loss curves on the self-made infrared aircraft dataset. In Figure 4 and Figure 5, (a) is a comparison of base training loss curves, and (b) is a comparison of fine-tuning loss curves. The blue line is the loss curve of the Meta-DETR algorithm, and the red line is the loss curve of the algorithm in this paper. The left side is the loss curve of the complete round of training, and the right side is a partial enlargement of the loss curve of some rounds. It can be seen from the figure that, whether in the base training stage or in the fine-tuning stage, the later loss of the algorithm in this paper is lower than that of the Meta-DETR algorithm, which shows that the algorithm in this paper has better generalization performance.

3.5. Visual Analysis

3.5.1. The Visual Results of Deformable Attention

In order to more clearly demonstrate the effect of the network model on the feature attention area, this paper uses the features learned by the Eigen-CAM [31] visualization model to observe the changes in the feature attention area of the encoder output layer after the detection algorithm uses fixed sampling and adaptive sampling. In the visualization results, different colors represent the degree of attention of the algorithm to different areas in the image. The red area represents the area that the algorithm pays the most attention to, and the yellow, green, and blue represent the areas of image attention that decrease in turn, as shown in Figure 6. In each sub-figure, the left column in the figure shows the heat map visualization results of Meta-DETR, and the right column shows the heat map visualization results of the algorithm in this paper. Observing Figure 6a, we can see that Meta-DETR pays more attention to almost the entire image without focusing on any particular area of interest, while the proposed algorithm focuses on the car itself and pays less attention to the background. Observing Figure 6b, we can see that Meta-DETR pays less attention to the potted plant area, while the proposed algorithm pays more attention to the potted plant itself. Observing Figure 6c, we can see that Meta-DETR pays more attention to the background area, while the proposed algorithm only pays attention to the sheep itself and ignores the background area. Observing Figure 6d, we can see that Meta-DETR pays too much attention to the background part, while the proposed algorithm focuses on the person. Observing Figure 6e, we can see that Meta-DETR focuses on the large background area, while the proposed algorithm only pays attention to the chair object. Observing Figure 6f, we can see that Meta-DETR also pays attention to the dog itself, but also pays equal attention to the background, while the proposed algorithm only pays attention to the dog itself and ignores the background area. Observing Figure 6g, we can see that Meta-DETR pays attention to almost the entire image but pays less attention to a key area of the foreground train, while the proposed algorithm pays more attention to the train target. Observing Figure 6h, we can see that Meta-DETR pays even attention to the entire image, while the proposed algorithm only pays attention to the TV. Observing Figure 6i, we can see that Meta-DETR pays even attention to the entire image, while the proposed algorithm not only pays attention to the sofa but also pays attention to the potted plant in the upper right corner. This shows that the proposed algorithm can also pay attention to the target itself under conditions of multiple targets and complex backgrounds.

3.5.2. Test Visualization Results

To further verify the detection performance of the proposed algorithm, Meta-DETR and the proposed algorithm are used to make predictions on the PASCAL VOC dataset and the self-made infrared aircraft dataset. Some representative images from different categories of the PASCAL VOC dataset are selected for visual analysis, as shown in Figure 7. In each sub-figure, the left column in the figure shows the detection results of Meta-DETR, and the right column shows the detection results of the proposed algorithm. In Figure 7a, for occluded targets, Meta-DETR misses the occluded car on the far right, while our algorithm detects it. In Figure 7b, for small targets, Meta-DETR detects the two cars in the lower left corner as one car, and the detection position is inaccurate, while our algorithm detects the two cars in the lower left corner separately. In Figure 7c, for overlapping targets, Meta-DETR misses the person on the motorcycle and the person occluded by the sand, while our algorithm not only detects the overlapping person and the motorcycle but also detects the person buried in the sand. In Figure 7d, for blurred targets, Meta-DETR misses the bicycle on the track, while the proposed algorithm detects it; in Figure 7e, for large targets, although both Meta-DETR and the proposed algorithm detect the target, the detection confidence of the proposed algorithm is higher. From the comparison, it can be seen that Meta-DETR has difficulty in detecting occluded targets (Figure 7a), small targets (Figure 7b), overlapping targets (Figure 7c), blurred targets (Figure 7d), and large targets (Figure 7e), while the performance of the proposed algorithm is relatively robust.

For the self-made infrared aircraft dataset, some representative images of different postures and different numbers of aircraft are selected for visualization analysis, as shown in Figure 8. In each sub-figure, the left side of the figure shows the detection results of Meta-DETR, and the right side shows the detection results of the algorithm in this paper. For the single-aircraft LAF + LAT (Figure 8a), Meta-DETR failed to identify both the LAF and LAT, while the proposed algorithm accurately detected both with high confidence. For the multi-aircraft LAF + LAT (Figure 8b), Meta-DETR missed the LAF of the two leftmost aircraft, while the proposed algorithm detected all targets in the image without any omissions. For the single-aircraft BAF + BAT (Figure 8c), Meta-DETR only detected the BAF and missed the BAT, while the proposed algorithm not only detected the BAF but also the BAT. For the multi-aircraft BAF + BAT (Figure 8d), the proposed algorithm detected the BAT of the two aircraft at the bottom of the image compared with Meta-DETR. For the multi-aircraft BWT (Figure 8e), Meta-DETR missed the second BWT on the right, while the proposed algorithm detected all targets without any omissions. From the comparison, it can be seen that Meta-DETR has difficulty in detecting aircraft in different postures and numbers, while the proposed algorithm is relatively more robust.

4. Conclusions

In order to solve the problem of insufficient target feature extraction caused by fixed deformable attention sampling points in Meta-DETR, this paper proposes a few-shot target detection algorithm based on adaptive sampling deformable attention. The algorithm uses cosine similarity in deformable attention to measure the correlation between feature points and preliminarily screens feature points according to the similarity threshold. Finally, the idea of maximum inter-class variance is used to achieve adaptive sampling of target feature points. Next, the encoder and decoder are constructed based on adaptive sampling deformable attention, and the output of the relevant aggregation module in Meta-DETR is feature-enhanced, finally achieving target detection under few-shot conditions. The proposed algorithm shows varying degrees of improvement over Meta-DETR across multiple shot conditions using different partitioning methods on PASCAL VOC and a self-developed infrared dataset. This is due to the adaptively sampled deformable attention algorithm, which uses only relevant feature points for computation while masking irrelevant ones, thereby achieving feature enhancement. Furthermore, adaptively sampled deformable attention accurately samples relevant feature points, allowing the proposed algorithm to more precisely focus on the target in attention visualization results. However, the algorithm in this paper only uses single-scale features, and the available information is limited under few-shot conditions. In future work, it is possible to consider designing multi-scale feature fusion strategies for the query branch and the support branch to further enhance the target features.

Author Contributions

Conceptualization, Z.M. and G.L.; methodology, Z.M. and G.L.; software, Z.M.; validation, Z.M.; formal analysis, Z.T. and X.F.; investigation, Z.T. and X.F.; resources, Z.M.; data curation, X.F.; writing—original draft preparation, Z.M.; writing—review and editing, Z.M. and G.L.; visualization, Z.M.; supervision, G.L.; project administration, G.L.; funding acquisition, G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by China Scholarship Council, grant number No. [2022]20.

Data Availability Statement

Data in this article can be downloaded at: https://drive.google.com/file/d/1JCxJ2lmNX5E4YsvAZnngVZ5hQeJU67tj/view?usp=sharing (accessed on 16 July 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Z.W.; Hao, J.G.; Huang, J.; Pan, C.Y. Review of Few-Shot Object Detection. Comput. Eng. Appl. 2022, 58, 1–11. [Google Scholar] [CrossRef]
Shi, Y.Y.; Shi, D.X.; Qiao, Z.T.; Zhang, Z.; Liu, Y.Y.; Yang, S.W. A Survey on Recent Advances in Few-Shot Object Detection. Chin. J. Comput. 2023, 46, 1763–1780. [Google Scholar]
Chen, H.; Wang, Y.; Wang, G.; Qiao, Y. Lstd: A low-shot transfer detector for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Huang, T.E.; Darrell, T.; Zitnick, C.L. Frustratingly Simple Few-Shot Object Detection. In Proceedings of the 37th International Conference on Machine Learning (ICML), Vienna, Austria, 12–18 July 2020; pp. 9861–9870. [Google Scholar]
Ke, X.; Chen, Q.; Liu, H.; Zhao, Z. GFENet: Generalization Feature Extraction Network for Few-Shot Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 12741–12755. [Google Scholar] [CrossRef]
Shi, Y.; Yang, S.; Yang, W.; Liu, X. Boosting Few-Shot Object Detection with Discriminative Representation and Class Margin. ACM Trans. Multimedia Comput. Commun. Appl. 2023, 20, 75. [Google Scholar] [CrossRef]
Qiao, S.; Liu, C.; Shen, W.; Yuille, A. Few-Shot Image Recognition by Predicting Parameters from Activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7229–7238. [Google Scholar]
Rusu, A.A.; Rao, D.; Sygnowski, J.; Vinyals, O.; Pascanu, R.; Osindero, S.; Hadsell, R. Meta-Learning with Latent Embedding Optimization. arXiv 2018, arXiv:1807.05960. [Google Scholar]
Lee, Y.; Choi, S. Gradient-Based Meta-Learning with Learned Layerwise Metric and Subspace. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 2927–2936. [Google Scholar]
Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; Darrell, T. Few-Shot Object Detection via Feature Reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8420–8429. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Yan, X.; Chen, Z.; Xu, A.; Wang, X.; Liang, X.; Lin, L. Meta R-CNN: Towards General Solver for Instance-Level Low-Shot Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9577–9586. [Google Scholar]
Du, Y.Y.; Yang, J.H.; Li, H.; Mao, Y.; Jiang, Y. Few-Shot Object Detection Algorithm Based on Improved Faster R-CNN. Electron. Opt. Control. 2023, 30, 44–51. [Google Scholar]
Chen, J.; Xu, K.; Ning, Y.; Jiang, L.; Xu, Z. CRTED: Few-Shot Object Detection via Correlation-RPN and Transformer Encoder–Decoder. Electronics 2024, 13, 1856. [Google Scholar] [CrossRef]
Li, B.; Liu, C.; Shi, M.; Chen, X.; Ji, X.; Ye, Q. Proposal Distribution Calibration for Few-Shot Object Detection. IEEE Trans. Neural Netw. Learn. Syst. 2023. Early Access. [Google Scholar] [CrossRef] [PubMed]
Zhang, G.J.; Luo, Z.; Cui, K.; Lu, S. Meta-DETR: Image-Level Few-Shot Object Detection with Inter-Class Correlation Exploitation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 12832–12843, Early Access. [Google Scholar] [CrossRef] [PubMed]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Wang, Y.X.; Ramanan, D.; Hebert, M. Meta-Learning to Detect Rare Objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9924–9933. [Google Scholar]
Xiao, Y.; Lepetit, V.; Marlet, R. Few-Shot Object Detection and Viewpoint Estimation for Objects in the Wild. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3090–3106. [Google Scholar] [PubMed]
Wu, J.; Liu, S.; Huang, D.; Wang, Y. Multi-Scale Positive Sample Refinement for Few-Shot Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Guangzhou, China, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 456–472. [Google Scholar]
Hu, H.Z.; Bai, S.; Li, A.; Cui, J.; Wang, L. Dense Relation Distillation with Context-Aware Aggregation for Few-Shot Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10185–10194. [Google Scholar]
Sun, B.; Li, B.H.; Cai, S.C.; Yuan, Y.; Zhang, C. FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7348–7358. [Google Scholar]
Feng, X.; Zhang, Z.X.; Wang, J.J.; Wang, S.; Jiao, X. Class-Relation Reasoning with Knowledge-Transfer for Few-Shot Object Detection. IEEJ Trans. Electr. Electron. Eng. 2024, 19, 518–526. [Google Scholar] [CrossRef]
Chen, H.; Wang, Q.; Xie, K.L.; Lei, L.; Wu, X. MPF-Net: Multi-Projection Filtering Network for Few-Shot Object Detection. Appl. Intell. 2024, 54, 7777–7792. [Google Scholar] [CrossRef]
Li, B.; Yang, B.; Liu, C.; Liu, F.; Ji, R.; Ye, Q. Beyond Max-Margin: Class Margin Equilibrium for Few-Shot Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7363–7372. [Google Scholar]
Si, Q.F.; Liu, G.; Xu, H.P.; Chen, H.X. Few-Shot Infrared Object Detection with Class Margin Equilibrium Based on Second-Order Pooling Attention. Comput. Eng. Appl. 2025, 61, 279. [Google Scholar]
Muhammad, M.B.; Yeasin, M. Eigen-CAM: Class Activation Map Using Principal Components. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–7. [Google Scholar]

Figure 1. Overall framework of the algorithm.

Figure 2. Correlational aggregation module.

Figure 3. Deformable attention.

Figure 4. Comparison of loss curves on the PASCAL VOC dataset. (a) Comparison of base training loss curves of this algorithm and Mate-DETR algorithm on PASCAL VOC dataset; (b) comparison of fine-tuning loss curves of this algorithm and Mate-DETR algorithm on PASCAL VOC dataset.

Figure 5. Comparison of loss curves on the self-made infrared aircraft dataset. (a) Comparison of base training loss curves of this algorithm and Mate-DETR algorithm on self-made infrared aircraft dataset; (b) comparison of fine-tuning loss curves of this algorithm and Mate-DETR algorithm on self-made infrared aircraft dataset.

Figure 6. Heat map visualization. (a) Car; (b) potted plant; (c) sheep; (d) person; (e) chair; (f) dog; (g) train; (h) tv monitor; (i) sofa.

Figure 7. Visualization results. (a) Occluded targets; (b) small targets; (c) overlapping targets; (d) blurred targets; (e) large targets.

Figure 8. Visualization results. (a) Single-aircraft LAF + LAT; (b) multi-aircraft LAF + LAT; (c) single-aircraft BAF + BAT; (d) multi-aircraft BAF + BAT; (e) multi-aircraft BWT.

Table 1. Infrared aircraft dataset type distribution.

Classes	BAF	BAT	LAF	LAT	BWF	BWT
Numbers	3385	2730	6438	3155	352	4904

Table 2. Novel class detection results of different algorithms on the PASCAL VOC dataset.

Algorithm	Class Split 1					Class Split 2					Class Split 3
Algorithm	1	2	3	5	10	1	2	3	5	10	1	2	3	5	10
Meta Det [22]	18.9	20.6	30.2	36.8	49.6	21.8	23.1	27.8	31.7	43.0	20.6	23.9	29.4	43.9	44.1
Meta R-CNN [14]	19.9	25.5	35.0	45.7	51.5	10.4	19.4	29.6	34.8	45.4	14.3	18.2	27.5	41.2	48.1
TFA w/fc [6]	22.9	34.5	40.4	46.7	52.0	16.9	26.4	30.5	34.6	39.7	15.7	27.2	34.7	40.8	44.6
TFA w/cos [6]	25.3	36.4	42.1	47.9	52.8	18.3	27.5	30.9	34.1	39.5	17.9	27.2	34.4	40.8	45.6
FsDetView [23]	24.2	35.3	42.2	49.1	57.4	21.6	24.6	31.9	37.0	45.7	21.2	30.0	37.2	43.8	49.6
MPSR [24]	34.7	42.6	46.1	49.4	56.7	22.6	30.5	31.0	36.7	43.3	27.5	32.5	38.2	44.6	50.0
DCNet [25]	33.9	37.4	43.7	51.1	59.6	23.2	24.8	30.6	36.7	46.6	32.3	34.9	39.7	42.6	50.7
FSCE [26]	32.9	44.0	46.8	52.9	59.7	23.7	30.6	38.4	43.0	48.5	22.6	33.4	39.5	47.3	54.0
CRK-Net [27]	30.3	38.6	44.3	51.6	58.9	25.2	27.2	32.5	37.8	45.7	27.0	34.8	39.8	45.0	51.5
MPF-Net [28]	39.4	48.5	49.5	55.7	60.5	28.0	29.8	34.8	37.1	47.1	32.4	37.1	43.4	47.9	51.2
Meta- DETR [18]	29.8	43.0	51.4	53.5	57.0	16.7	26.7	32.9	36.5	46.0	27.8	37.3	47.9	50.7	54.7
The proposed algorithm	30.7	43.7	52.8	52.9	59.1	16.0	30.2	33.0	42.0	51.7	24.1	39.2	48.9	52.8	54.8

Bold indicates the best value in a column.

Table 3. Novel classes detection results of different algorithms on self-made infrared dataset.

Algorithm	Class Split 1					Class Split 2					Class Split 3
Algorithm	1	2	3	5	10	1	2	3	5	10	1	2	3	5	10
LSTD [3]	18.8	22.9	29.4	36.6	41.8	19.2	26.4	29.3	33.9	40.8	18.8	26.4	28.2	31.8	35.6
Meta- YOLO [12]	24.1	28.5	34.6	41.1	49.6	22.9	29.5	33.6	35.9	48.7	18.3	25.3	28.7	33.8	38.9
Meta R-CNN [14]	31.5	36.8	42.8	45.9	60.4	28.1	35.1	36.8	40.5	53.6	21.5	31.8	35.8	43.2	43.8
CME [29]	32.2	38.5	43.9	45.7	61.3	27.9	35.5	35.6	42.8	55.9	24.0	33.1	36.5	42.9	43.6
CMESOPA [30]	34.6	40.3	42.8	49.2	64.2	29.8	34.4	36.8	45.9	55.3	22.9	33.2	38.7	44.2	46.5
Meta-DETR [18]	35.9	45.4	45.0	49.1	73.2	32.7	36.3	36.6	51.2	59.8	23.8	35.3	39.3	43.9	49.6
Algorithm of this article	36.5	47.3	47.7	49.9	73.6	35.6	45.9	47.8	56.6	70.3	22.0	34.5	39.8	42.1	52.3

Bold indicates the best value in a column.

Table 4. The detection results of base classes and novel classes for Class Split 1 in PASCAL VOC dataset.

Algorithm	Base Classes				Novel Classes
Algorithm	1	3	5	10	1	3	5	10
Meta-YOLO [12]	66.4	64.8	63.4	63.60	14.8	26.7	33.9	47.2
FsDetView [23]	64.2	69.4	69.8	71.1	24.2	42.2	49.1	57.4
TFA w/cos [6]	77.6	77.3	77.4	77.5	25.3	42.1	47.9	52.9
MPSR [24]	60.6	65.9	68.2	69.8	34.7	46.1	49.4	56.7
FSCE [26]	75.5	73.7	75.0	75.2	32.9	46.8	52.9	59.7
Meta-DETR [18]	70.8	69.7	70.3	77.1	29.8	51.4	53.5	57.0
The proposed algorithm	71.1	70.3	69.8	77.6	30.7	52.8	52.9	59.1

Table 5. Base class and novel class detection results of different similarity measurement methods.

Category\ Method	Cosine Similarity	Euclidean Distance	Pearson Distance	Manhattan Distance	Chebyshev Distance
Base classes	77.6	77.7	77.8	77.5	77.7
Novel classes	59.1	58.3	58.6	56.9	57.6

Table 6. Comparison of model parameters and inference speed between Meta-DETR and the proposed algorithm.

	Parameter Quantity (M)	Inference Speed (FPS)
Meta-DETR	51.66	59
The proposed algorithm	52.23	61

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, Z.; Liu, G.; Tong, Z.; Fan, X. Few-Shot Target Detection Algorithm Based on Adaptive Sampling Meta-DETR. Electronics 2025, 14, 3506. https://doi.org/10.3390/electronics14173506

AMA Style

Ma Z, Liu G, Tong Z, Fan X. Few-Shot Target Detection Algorithm Based on Adaptive Sampling Meta-DETR. Electronics. 2025; 14(17):3506. https://doi.org/10.3390/electronics14173506

Chicago/Turabian Style

Ma, Zihao, Gang Liu, Zhaoya Tong, and Xiaoliang Fan. 2025. "Few-Shot Target Detection Algorithm Based on Adaptive Sampling Meta-DETR" Electronics 14, no. 17: 3506. https://doi.org/10.3390/electronics14173506

APA Style

Ma, Z., Liu, G., Tong, Z., & Fan, X. (2025). Few-Shot Target Detection Algorithm Based on Adaptive Sampling Meta-DETR. Electronics, 14(17), 3506. https://doi.org/10.3390/electronics14173506

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Few-Shot Target Detection Algorithm Based on Adaptive Sampling Meta-DETR

Abstract

1. Introduction

2. The Proposed Method

2.1. Meta-DETR Principle

2.1.1. Correlational Aggregation Module

2.1.2. Loss Function

2.2. Adaptive Sampling Deformable Attention Based on Cosine Similarity

2.2.1. Sampling Strategy in Adaptive Deformable Attention

2.2.2. Strategy for Determining the Number of Target Feature Point Samplings

3. Experiment and Analysis

3.1. Compared with Advanced Algorithms

3.2. Comparative Experiments on Similarity Measurement Methods

3.3. Comparison Experiment of Model Parameter Quantity and Inference Speed

3.4. Loss Curve

3.5. Visual Analysis

3.5.1. The Visual Results of Deformable Attention

3.5.2. Test Visualization Results

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI