1. Introduction
Target detection models based on deep learning require a large number of labeled samples for training. When there are not enough samples or the annotations of the samples are difficult to obtain, it is difficult for the existing mainstream target detection algorithms to achieve satisfactory results. Therefore, many scholars have explored the few-shot target detection task to solve this problem. Few-shot target detection is a fusion of traditional target detection technology and few-shot learning technology. It aims to learn a detection model with generalization performance through a small number of labeled samples [
1]. At present, the main methods of few-shot target detection can be roughly divided into methods based on transfer learning and methods based on meta-learning.
The core idea of the few-shot target detection method based on transfer learning is as follows: first, pre-train the source domain model on a large-scale base class-labeled dataset; then, fine-tune the model parameters based on a small number of target domain training samples [
2]. Based on transfer learning, Chen et al. [
3] combined the advantages of the single-stage target detection model SSD [
4] (single shot multibox detector) and the two-stage target detection model Faster RCNN [
5] and proposed a few-shot transfer detector (low-shot transfer detector (LSTD)). LSTD designs two mechanisms—background suppression regularization and knowledge transfer regularization—so that it can focus on the foreground target during the transfer learning process, reduce the impact of semantic confusion on the model accuracy, and better utilize the source domain knowledge to enhance the fine-tuning of a small number of target images. Wang et al. [
6] proposed a two-stage fine-tuning approach (TFA), which freezes all networks before the detector head in novel classes and only fine-tunes the last layer. This simple training method brings significant accuracy improvements. Ke et al. [
7] proposed a generalized feature extraction framework to solve the problem that the knowledge learned during the model base training process tends to be biased towards the characteristics of the base class data, resulting in a decrease in learning ability when fine-tuning novel classes and further overfitting of the model due to the scarcity of samples. This framework solves the impact of changes in target shape and size on the overall detection performance and improves the generalization performance of the base training model. In addition, a feature-level data enhancement method based on self-distillation is proposed to further enhance the generalization performance of the model. Experimental results show that the algorithm has achieved good results on both the COCO and PASCAL VOC datasets. In order to transfer the general knowledge learned from data-rich base classes to novel classes, Yang et al. [
8] proposed a weight transfer strategy to enable the model to better transfer features, and propose an attention-based feature enhancement mechanism to learn more robust target feature representations. In addition, an angle-guided additive margin classifier was introduced to enhance instance-level inter-class differences and intra-class compactness, improving the classification and discrimination ability of the model. Experimental results show that the detection results of the algorithm on the PASCAL VOC and COCO dataset are higher than those of the current advanced algorithms. Although the training method of the few-shot target detection method based on transfer learning is simple, when there are very few samples, it is difficult to accurately characterize the feature distribution of the entire category, which makes the model have serious overfitting problems and leads to poor generalization ability. In order to overcome this overfitting problem and further improve the generalization ability of the model in few-shot target detection, a meta-learning strategy can be used.
The core idea of the few-shot target detection method based on meta-learning is to transfer prior knowledge from the base classes with rich annotations to the novel classes with scarce data by simulating a series of similar few-shot tasks, so as to cope with the problem of insufficient sample quantity [
9,
10,
11]. Specifically, meta-learning usually divides the training dataset into multiple different subtasks, each task consisting of a support set and a query set. The support set is used for model training, and the query set is used for model evaluation. In this way, the model is iteratively trained on multiple tasks, so as to learn how to use a small number of samples for effective learning. At present, many scholars have combined meta-learning with different types of target detection models. Kang et al. [
12] combined meta-learning with the single-stage target detection YOLO v2 [
13] and proposed a feature reweighting few-shot target detection algorithm (few-shot object detection via feature reweighting, FSRW). The feature learning mechanism in this algorithm learns generalizable meta-features, the feature reweighting mechanism learns global features for each target category in the support set, and the prediction mechanism predicts the category and bounding box of the query image. This algorithm predicts the entire feature map of the query image. Considering that the query image may contain multiple targets, meta-learning on the entire image is not the best solution. To this end, Yan et al. [
14] combined meta-learning with the two-stage target detection model to design Meta-RCNN. They introduced a predictor-head remodeling network (PRN) to infer class attention vectors for all class targets in the support set image, and used it as meta-knowledge to perform channel-level fusion with the feature map of the region of interest extracted by the query image through the RPN (region proposal network), and finally obtained the corresponding detection map. Similarly, Du Yunyan et al. [
15] proposed a few-shot target detection algorithm based on Faster RCNN. They reduced the number of irrelevant candidate boxes by improving the RPN module, and then proposed a global–local relationship detector module. By associating the features of a small number of labeled samples and the samples to be detected, they obtained candidate regions that were more relevant to the target category, thereby improving the detection accuracy of novel classes of targets. Chen et al. [
16] solved the problem that the valuable correlation feature among different categories is insufficiently exploited, hindering the generalization of knowledge from base classes to novel classes for target detection. They proposed few-shot target detection via correlation-RPN and transformer encoder–decoder (CRTED), a novel training network to learn object-relevant features of inter-class correlation and intra-class compactness while suppressing target-agnostic features in the background with limited annotated samples. Li et al. [
17] introduced a simple yet effective proposal distribution calibration (PDC) approach to neatly enhance the localization and classification abilities of the RoI head by recycling its localization ability endowed in base training and enriching high-quality positive samples for semantic fine-tuning.
There are still two potential problems based on meta-learning methods that hinder the full utilization of base class knowledge. First, the region-based detection framework relies on region proposals to generate the final prediction, so the detection results are sensitive to low-quality region proposals. In the few-shot target detection task, it is not easy to generate high-quality region proposals for limited novel classes. Second, most strategies based on meta-learning methods use “feature reweighting” or its variants to aggregate query features and support features, and can only process one support class (i.e., the target class to be detected) at a time. In this case, the important inter-class correlations between different support classes are largely ignored.
To address the above limitations, Zhang et al. [
18] abandoned the region bounding box, made full use of the complementary relationship between classification and regression tasks, combined the transformer model that has been popular in recent years with meta-learning, and constructed a Meta-DETR framework. The transformer has the ability to model long-distance dependencies and can effectively utilize the contextual information between features. Meta-DETR combines meta-learning with deformable DETR [
19] to perform pure image-level prediction. This framework can skip region proposal generation, avoid the problem of low quality of novel class region proposals, and perform detection directly at the image level. In addition, Meta-DETR also introduces an inter-class correlation meta-learning strategy, allowing multiple support classes to be focused on at one time, making full use of inter-class correlation and reducing misclassification between similar classes. Although Meta-DETR solves the above problems, the deformable attention mechanism in Meta-DETR selects a fixed number when selecting the value corresponding to each query, which greatly limits the extraction of information related to the target features.
Therefore, this paper proposes a Meta-DETR few-shot target detection algorithm based on adaptive sampling deformable attention. The main work of this paper is as follows:
- (1)
An adaptive sampling deformable attention (ASDA) module is proposed. This module measures the correlation between feature points by calculating the cosine similarity between feature points in deformable attention, and then preliminarily screens the feature points according to the cosine similarity threshold. Finally, the maximum inter-class variance is used to calculate the final number of sampling points of the target feature points, thereby avoiding over- or under-sampling of some feature points and achieving accurate sampling.
- (2)
Combining the ASDA model with the Meta-DETR framework based on meta learning, a new few-shot target detection algorithm is proposed. This algorithm uses the ASDA model to construct the encoder and decoder, and performs feature enhancement on the output of the correlational aggregation module (CAM) in Meta-DETR, ultimately achieving target detection under few-shot conditions.
In the first section, this paper introduces the current research status of few-shot target detection, including few-shot target detection algorithms based on transfer learning and meta-learning, and also focuses on the algorithm that combines meta-learning with different types of target detection models. In the second section, the principle of the baseline model used in this paper and the algorithm proposed in this paper are mainly introduced. In the third section, experiments are conducted using the PASCAL VOC dataset and the self-made infrared aircraft dataset to verify the effectiveness of the proposed algorithm. In the fourth section, the proposed algorithm is summarized and prospects for future work are given.
3. Experiment and Analysis
The hardware environment in this experiment: The CPU model is Intel(R) Core(TM) i7-14700KF, manufactured by Intel Corporation, Santa Clara, CA, USA, and the running memory is 32 GB. The GPU model is NVIDIA GeForce RTX 4090D, manufactured by NVIDIA Corporation, Santa Clara, CA, USA, and the video memory size is 24 GB. Software environment: The operating system is Ubuntu 22.04, based on the PyTorch 2.1 deep learning framework, the programming language is Python 3.8, and the GPU is run using CUDA 12.4, manufactured by NVIDIA Corporation, California, USA. Experiment details: Initial learning rate 2 × 10−4, AdamW optimizer with weight decay of 1 × 10−4, batch size set to 4, and similarity threshold reached the empirical value obtained in the experiment. In the base training stage, the model is trained for 50 epochs, and the learning rate is decayed by 0.1 at the 45th epoch. In the fine-tuning stage, the same setting model is applied until convergence.
This paper uses the PASCAL VOC dataset and the self-made infrared aircraft dataset for experiments. The PASCAL VOC dataset uses trainval07+12 as training samples and tests on test07. The PASCAL VOC dataset contains a total of 20 categories. This paper’s experiment adopts three different division methods, which are consistent with the division method in Meta-DETR. Each method selects five categories as novel classes, and the other categories are regarded as base classes. The first division method uses birds, buses, cows, motorcycles, and sofas as novel classes. The second division method uses airplanes, bottles, cows, horses, and sofas as novel classes. The third division method uses boats, cats, motorcycles, sheep, and sofas as novel classes.
The self-made infrared aircraft dataset has a total of 5582 images, and all images are actually collected. This dataset is divided into a training set and a validation set in a ratio of 8:2 using the “cross” method, which are used for model training and validation, respectively. There are six categories of manually labeled targets, namely, back-attitude-fuselage (BAF), back-attitude-tailflame (BAT), lateral-attitude-fuselage (LAF), lateral-attitude-tailflame (LAT), backward-attitude-fuselage (BWF), and backward-attitude-tailflame (BWT); the type distribution is shown in
Table 1. In order to meet the training requirements of meta-learning, one category is selected as the novel class among the six target categories, and the other five categories are used as base classes. This infrared dataset is experimented with three different splitting methods, namely Class Split 1, Class Split 2, and Class Split 3. Among them, Class Split 1 uses BAT as the novel class and BAF, LAF, LAT, BWT, and BWF as base classes; Class Split 2 uses BAF as the novel class and BAT, LAF, LAT, BWT, and BWF as base classes; Class Split 3 uses LAF as the novel class and BAT, BAF, LAT, BWT, and BWF as base classes. For few-shot target detection, each novel class has
k target instances, and
k is 1, 2, 3, 5, or 10.
3.1. Compared with Advanced Algorithms
In order to evaluate the effectiveness of the proposed algorithm, some representative classic few-shot target detection algorithms are selected to conduct experiments on the PASCAL VOC dataset and the self-made infrared aircraft dataset, and the average results are obtained by multiple training in the few-shot fine-tuning stage. The novel class detection results are shown in
Table 2 and
Table 3. The blue font in the table indicates the best result in a column. As can be seen from
Table 2, compared with Meta-DETR, the proposed algorithm improves the detection accuracy of novel classes by 0.9%, 0.7%, 1.4%, and 2.1%, respectively, for shots 1, 2, 3, and 10 in partition 1 on the PASCAL VOC dataset, 3.5%, 0.1%, 5.5%, and 5.7%, respectively, for shots 2, 3, 5, and 10 in partition 2, and 1.9%, 1.0%, 2.1%, and 0.1%, respectively, for shots 2, 3, 5, and 10 in partition 3. In addition, compared with MPF-Net, CRK-Net, and FSCE, the proposed algorithm achieves superior performance under most shot settings across all three partitions. Compared with CRK-Net, the proposed algorithm achieves accuracy improvements of 0.4%, 5.1%, 8.5%, 1.3%, and 0.2% for shots 1, 2, 3, 5, and 10 in partition 1, 3.0%, 0.5%, 4.2%, and 6.0% for shots 2, 3, 5, and 10 in partition 2, and 4.4%, 9.1%, 7.8%, and 3.3% for shots 2, 3, 5, and 10 in partition 3. Compared with MPF-Net, the proposed algorithm achieves improvements of 3.3% for shot 3 in partition 1, 0.4%, 4.9%, and 4.6% for shot 2, 5, and 10 in partition 2, and 2.1%, 5.5%, 4.9%, and 3.6% for shot 2, 3, 5, and 10 in partition 3.
From
Table 3, compared with Meta-DETR, the proposed algorithm improves the detection accuracy of novel classes by 0.6%, 1.9%, 2.7%, 0.8%, and 0.4%, respectively, for shots 1, 2, 3, 5, and 10 in partition 1, 2.9%, 9.6%, 11.2%, 5.4%, and 10.5%, respectively, for shots 1, 2, 3, 5, and 10 in partition 2, and 0.5% and 2.7%, respectively, for shots 3 and 10 in partition 3. In addition, compared with CMESOPA, CME, Meta R-CNN, and other existing methods, the proposed algorithm achieves the best performance under most shot settings across all three partitions. Compared with CMESOPA, the proposed algorithm improves the detection accuracy of novel classes by 1.9%, 7.0%, 4.9%, 0.7%, and 9.4% for shots 1, 2, 3, 5, and 10 in partition 1, 5.8%, 11.5%, 11.0%, 10.7%, and 15.0% for shots 1, 2, 3, 5, and 10 in partition 2, and 1.3%, 1.1%, and 5.8% for shots 2, 3, and 10 in partition 3.
Table 4 presents the detection results of the proposed algorithm compared with Meta-DETR, FSCE, MPSR, and other methods on the base classes of Class Split 1 in the PASCAL VOC dataset. It can be seen that, compared with Meta-DETR, the proposed algorithm not only achieves superior detection accuracy for novel classes under limited training samples but also improves the detection performance on most base classes. Specifically, the detection accuracy of base classes is improved by 0.3%, 0.6%, and 0.5% under 1, 3, and 10 shots, respectively. Compared with FSCE, MPSR, TFA, and other methods, the proposed algorithm also achieves competitive performance on base class detection.
3.2. Comparative Experiments on Similarity Measurement Methods
In order to verify the superiority of using cosine similarity to measure the similarity between feature vectors, the proposed algorithm is evaluated using various similarity metrics on the 10-shot setting of Class Split 1 in the PASCAL VOC dataset. As shown in
Table 5, compared with Euclidean distance, Pearson distance, Manhattan distance, and Chebyshev distance, the detection accuracy for novel classes is improved by 0.8%, 0.5%, 2.2%, and 1.5%, respectively, when cosine similarity is used. These results demonstrate that cosine similarity can more effectively capture feature similarity and accurately determine the number of samples for deformable attention, thereby enhancing detection performance under few-shot conditions.
3.3. Comparison Experiment of Model Parameter Quantity and Inference Speed
To further validate the feasibility of the proposed algorithm, we conducted comparative experiments on model parameter quantity and inference speed between Meta-DETR and the proposed algorithm. The experimental results are shown in
Table 6. As can be seen from the data in
Table 6, the proposed algorithm slightly surpasses Meta-DETR in terms of parameter quantity. This slight increase is likely due to the additional parameters used when calculating the maximum inter-class variance. In terms of inference speed, the proposed algorithm exhibits a certain advantage, thanks to the fact that adaptive deformable attention uses fewer sampling points when calculating attention between backgrounds, thereby improving computational speed. Overall, the proposed algorithm achieves an improvement in inference speed with only a slight increase in the number of parameters. This demonstrates that the proposed algorithm optimizes operational efficiency while maintaining model performance, further validating its usability and superiority in practical applications.
3.4. Loss Curve
In order to further demonstrate the advantages of the proposed algorithm, the loss curve comparison diagrams of Meta-DETR and the proposed algorithm are plotted on the PASCAL VOC dataset and the self-made infrared aircraft dataset, as shown in
Figure 4 and
Figure 5.
Figure 4 is a comparison of loss curves on the PASCAL VOC dataset, and
Figure 5 is a comparison of loss curves on the self-made infrared aircraft dataset. In
Figure 4 and
Figure 5, (a) is a comparison of base training loss curves, and (b) is a comparison of fine-tuning loss curves. The blue line is the loss curve of the Meta-DETR algorithm, and the red line is the loss curve of the algorithm in this paper. The left side is the loss curve of the complete round of training, and the right side is a partial enlargement of the loss curve of some rounds. It can be seen from the figure that, whether in the base training stage or in the fine-tuning stage, the later loss of the algorithm in this paper is lower than that of the Meta-DETR algorithm, which shows that the algorithm in this paper has better generalization performance.
3.5. Visual Analysis
3.5.1. The Visual Results of Deformable Attention
In order to more clearly demonstrate the effect of the network model on the feature attention area, this paper uses the features learned by the Eigen-CAM [
31] visualization model to observe the changes in the feature attention area of the encoder output layer after the detection algorithm uses fixed sampling and adaptive sampling. In the visualization results, different colors represent the degree of attention of the algorithm to different areas in the image. The red area represents the area that the algorithm pays the most attention to, and the yellow, green, and blue represent the areas of image attention that decrease in turn, as shown in
Figure 6. In each sub-figure, the left column in the figure shows the heat map visualization results of Meta-DETR, and the right column shows the heat map visualization results of the algorithm in this paper. Observing
Figure 6a, we can see that Meta-DETR pays more attention to almost the entire image without focusing on any particular area of interest, while the proposed algorithm focuses on the car itself and pays less attention to the background. Observing
Figure 6b, we can see that Meta-DETR pays less attention to the potted plant area, while the proposed algorithm pays more attention to the potted plant itself. Observing
Figure 6c, we can see that Meta-DETR pays more attention to the background area, while the proposed algorithm only pays attention to the sheep itself and ignores the background area. Observing
Figure 6d, we can see that Meta-DETR pays too much attention to the background part, while the proposed algorithm focuses on the person. Observing
Figure 6e, we can see that Meta-DETR focuses on the large background area, while the proposed algorithm only pays attention to the chair object. Observing
Figure 6f, we can see that Meta-DETR also pays attention to the dog itself, but also pays equal attention to the background, while the proposed algorithm only pays attention to the dog itself and ignores the background area. Observing
Figure 6g, we can see that Meta-DETR pays attention to almost the entire image but pays less attention to a key area of the foreground train, while the proposed algorithm pays more attention to the train target. Observing
Figure 6h, we can see that Meta-DETR pays even attention to the entire image, while the proposed algorithm only pays attention to the TV. Observing
Figure 6i, we can see that Meta-DETR pays even attention to the entire image, while the proposed algorithm not only pays attention to the sofa but also pays attention to the potted plant in the upper right corner. This shows that the proposed algorithm can also pay attention to the target itself under conditions of multiple targets and complex backgrounds.
3.5.2. Test Visualization Results
To further verify the detection performance of the proposed algorithm, Meta-DETR and the proposed algorithm are used to make predictions on the PASCAL VOC dataset and the self-made infrared aircraft dataset. Some representative images from different categories of the PASCAL VOC dataset are selected for visual analysis, as shown in
Figure 7. In each sub-figure, the left column in the figure shows the detection results of Meta-DETR, and the right column shows the detection results of the proposed algorithm. In
Figure 7a, for occluded targets, Meta-DETR misses the occluded car on the far right, while our algorithm detects it. In
Figure 7b, for small targets, Meta-DETR detects the two cars in the lower left corner as one car, and the detection position is inaccurate, while our algorithm detects the two cars in the lower left corner separately. In
Figure 7c, for overlapping targets, Meta-DETR misses the person on the motorcycle and the person occluded by the sand, while our algorithm not only detects the overlapping person and the motorcycle but also detects the person buried in the sand. In
Figure 7d, for blurred targets, Meta-DETR misses the bicycle on the track, while the proposed algorithm detects it; in
Figure 7e, for large targets, although both Meta-DETR and the proposed algorithm detect the target, the detection confidence of the proposed algorithm is higher. From the comparison, it can be seen that Meta-DETR has difficulty in detecting occluded targets (
Figure 7a), small targets (
Figure 7b), overlapping targets (
Figure 7c), blurred targets (
Figure 7d), and large targets (
Figure 7e), while the performance of the proposed algorithm is relatively robust.
For the self-made infrared aircraft dataset, some representative images of different postures and different numbers of aircraft are selected for visualization analysis, as shown in
Figure 8. In each sub-figure, the left side of the figure shows the detection results of Meta-DETR, and the right side shows the detection results of the algorithm in this paper. For the single-aircraft LAF + LAT (
Figure 8a), Meta-DETR failed to identify both the LAF and LAT, while the proposed algorithm accurately detected both with high confidence. For the multi-aircraft LAF + LAT (
Figure 8b), Meta-DETR missed the LAF of the two leftmost aircraft, while the proposed algorithm detected all targets in the image without any omissions. For the single-aircraft BAF + BAT (
Figure 8c), Meta-DETR only detected the BAF and missed the BAT, while the proposed algorithm not only detected the BAF but also the BAT. For the multi-aircraft BAF + BAT (
Figure 8d), the proposed algorithm detected the BAT of the two aircraft at the bottom of the image compared with Meta-DETR. For the multi-aircraft BWT (
Figure 8e), Meta-DETR missed the second BWT on the right, while the proposed algorithm detected all targets without any omissions. From the comparison, it can be seen that Meta-DETR has difficulty in detecting aircraft in different postures and numbers, while the proposed algorithm is relatively more robust.