4.1. Experimental Setup
Datasets. We conducted extensive experiments on the self-constructed UAV Swarm Dataset (USD) and the publicly available Drone Vehicle dataset. The USD contains 7000 infrared images of low-altitude UAV swarms with a resolution of 720 × 540, capturing low-altitude UAV swarms against urban backgrounds. The data was acquired using mid-wave infrared cameras mounted on a ground-based platform, simulating real-world drone swarm operations. Data collection covered varied conditions, including time of day (day and night) and weather (clear and cloudy). To ensure diverse behavioral representation, the UAV swarms were programmed to execute multiple formations—such as linear, circular, and random patterns—along with maneuvers, including hovering, cruising, and crossing. The original video clips were converted into image sequences at 25 frames per second, with frames sampled at three-frame intervals for annotation in COCO format. The resulting infrared images encompass a wide range of scenarios, such as swarm cruising, cross-flights, and dynamic formation changes. Each image contains an average of 20–30 UAV targets. The images in the USD can suitably reflect the three challenges associated with the infrared UAV swarm detection mentioned in the first section; an example image and related statistical analysis are shown in
Figure 5a and
Table 1. The allocation of target areas and the statistical analysis of targets categorized by size are illustrated in
Figure 5b,c, respectively. The distribution of UAV targets in the USD are shown in
Figure 5d. The heat map indicates the number of UAV sightings in the same area, and each blue point indicates an UAV target.
In order to make the comparison more robust, we also compared the GM-DETR with other detectors on Drone Vehicle, which is a large-scale, drone-based, RGB–infrared vehicle detection dataset. It includes 28,439 vision–infrared image pairs. Its detection challenges lie not only in accurately locating small-scale targets from an aerial perspective but also in overcoming the fine-grained feature deficiency caused by shooting height, complex background, and cross-modal differences, so as to accurately distinguish the types of vehicles. Here, we only used infrared images to evaluate the existing methods.
Evaluation Metrics. This study used the standard evaluation protocol, Average Precision (AP) and Average Recall (AR) metrics, as revealed in [
4]. These metrics reflect the area under the precision–recall curve, and the formulation can be represented by
where the
denotes the detection accuracy,
r represents the recall, TP is true positives, FP denotes false positives, and FN represents false negatives.
The criterion for determining a true positive is that the C-IoU of the prediction bounding box and the ground truth should be greater than the threshold of 0.5. The average AP for each class is given as mAP, while the average accuracy mAP is obtained from thresholds in the range 0.5:0.95. In addition, the computational and real-time performance are evaluated based on the number of giga-floating-point operations per second (GFLOPs) and frames processed per second (FPS).
As shown in
Figure 5b,c, all the target areas in the USD are smaller than the definition of small targets in coco (32 × 32 pixels), which leads to the evaluation metrics, such as
and
, being −1 in the final evaluation. Such results provide no meaningful information for the task of infrared detection of small UAVs and allow for no meaningful comparison among different models. In this case, we adopted 81 and 256 pixels as thresholds to redefine the pixel area ranges for large, medium, and small objects, mapping the original pixel area ranges from (96 × 96, +
∞), (32 × 32, 96 × 96], and (0, 32 × 32] to (16 × 16, +
∞), (9 × 9, 16 × 16], and (0, 9 × 9], respectively. This classification aligns with the definition of small infrared targets established by the Society of Photo-Optical Instrumentation Engineers [
53,
54].
Implementation details. In terms of the hyperparameter settings and training details, we set
and
for deformable attention. The number of fusion block was set to 6. As for object queries, DETR class methods conducted query selection layer by layer on multi-scale features, and the number of object queries was set to 300 to avoid the deviation of detection performance caused by inconsistent object queries. The effect of object queries on the detection performance of the GM-DETR is discussed in detail in the ablation study. The memory slots are designed to update and store prior knowledge for each object category. Therefore,
was set to the total number of object categories plus one to include the background class. For instance, in the Drone Vehicle dataset,
was set to 6, while for the USD with a single foreground category,
took the value 2. We trained the GM-DETR using the AdamW optimizer [
55] with a weight decay of
on NVIDIA RTX 4090 GPUs (24 GB), which is developed by NVIDIA, a US-based technology corporation. For the initial learning rate, the backbone part was set to
, and other parts were set to
. The learning rate was decreased by 0.1 at later stages. The batch of per GPU was set to 4 and 10 for the USD and Drone Vehicle, respectively. The training epoch was set to 120 and 12 for the USD and Drone Vehicle, respectively. All methods adjusted the input size according to the image to ensure the validity of the experimental results.
4.2. Comparison with State-of-the-Art Methods
To better validate the effectiveness of the GM-DETR, we compared the performance of the GM-DETR model with models that have achieved the cutting-edge performance in recent years under the same experimental settings, namely, Faster R-CNN [
29], RetinaNet [
56], YOLOv5 [
57], YOLOv6 [
58], YOLOv7 [
26], YOLOv8 [
27], YOLOv10 [
28], Deformable DETR [
37], Conditional DETR [
38], Anchor DETR [
59], DAB-DETR [
39], Dn-DETR [
45], Dino [
40], Dino-eva-01 [
60], Focus-DETR [
32], Co-DETR [
49], and Salience DETR [
44]. The evaluation results on the USD and Drone Vehicle are shown in
Table 2 and
Table 3. The GM-DETR achieves good performance on both datasets.
Comparison on USD. As shown in
Table 2, except for
,
, and
, the GM-DETR outperforms reference methods on all metrics. It is worth noting that the GM-DETR is the only method to score above 90 on
, and the GM-DETR surpasses the second-best method, the Focus-DETR, with a large margin of 1.9% on
. In addition, we focus more on the performance in detecting small targets, i.e.,
, as it reflects the response speed and detection distance of a model. It is clear that the GM-DETR improves by 1.1% on
compared to the second-best method, the Salience DETR, which demonstrates the effectiveness of the GM-DETR in the infrared detection of small UAV targets. All methods were trained for 120 epochs. All employ ResNet50 as the backbone, except for Dino-eva-01, which uses EVA [
60] as backbone.
Comparison on Drone Vehicle. Table 3 shows the comparison results on Drone Vehicle. Drone Vehicle is also a highly challenging detection dataset because of the similar appearance of air-ground targets. Detectors are required to capture sufficient appearance feature and context information to distinguish different type vehicles. The GM-DETR achieves the best performance in terms of the AP and AR metrics. To evaluate the proposed methods against the baseline Salience DETR, the ablation study is conducted using AP,
,
, and FPS metrics by removing specific components and testing on the USD.
As shown in
Table 4, the ablation study demonstrates that the FCAF, SS, and GRE modules contribute significantly to the detection performance. In contrast, the Memory Updater and Memory-Augmented Decoder increases the
but declines in other metrics. All in all, the GM-DETR achieves the best results (73.7 on AP and 90.6 on
), confirming the effectiveness of each proposed component. The convergence graphs of the ablation study are shown in
Figure 6.
4.3. Ablation Study
Effect of Fine-Grained Context-Aware Fusion. To validate the fusion effect of the Fine-Grained Context-Aware Fusion module, we performed a controlled experiment with the Salience DETR and the GM-DETR, as shown in
Table 5. Compared to the GM-DETR, the GM-DETR with FPN declines on all metrics, except
. In another control group, the Salience DETR with the Fine-Grained Context-Aware Fusion module gains +0.6 improvement on AP and +0.1 improvement on
, which shows that the Fine-Grained Context-Aware Fusion module can improve the fusion effect of multi-scale features and the detection performance. In
Figure 7, we show the convergence process of detection accuracy and loss terms during training.
The loss curve, which denotes the total training loss, exhibits a consistent decline from approximately 15 to 8 over the course of training, demonstrating stable optimization and effective learning. The loss box, representing the regression loss for bounding box localization, decreases significantly from 0.032 to around 0.012, indicating the model’s enhanced capability in accurately locating small infrared UAV targets. Similarly, the loss class curve, which corresponds to classification error, declines from 0.46 to 0.34, reflecting improved discrimination between target and background under challenging infrared conditions. The synchronized convergence of decreasing losses and increasing mAP attests to the robustness and stability of the training process and validates effectiveness of the GM-DETR for the infrared detection of small UAVs.
Ablation study of Memory Updater. In order to ascertain the enhancement of the Memory Updater module and different compression values of
on the detection effect, we conducted experiments on the Drone Vehicle dataset. The results are displayed in
Table 6. It is evident that increasing the compression value results in an enhancement of detection accuracy, albeit to a certain extent. Specifically, expanding the prior experience space is demonstrated to enhance the detection precision for different vehicles such as vans and freight cars, which are hard to recognize. No memory group proves the effectiveness of the proposed Memory Updater module. The 64 (Mix) group represents the Memory Updater as set a single memory block and compresses it into 64 in MASA, which is iteratively updated for each target and image. The result demonstrates the the effectiveness of the proposed method in setting the memory update mechanism for each class.
Collectively, the results presented in
Table 6 indicate that for infrared images characterized by the absence of discriminative texture and color features, the Memory Updater module enhances the performance of the GM-DETR on the classification task for similar targets.
Ablation study of object queries. The query sparsification of DETR class methods aims to reduce training costs by decreasing the number of queries while maintaining comparable detection performance. To investigate the impact of varying the number of object queries on detection performance, we conducted extensive experiments on Drone Vehicle, and the results are presented in
Table 7. The research results show that within a certain range, the detection performance of the GM-DETR demonstrates a positive correlation with the number of queries. However, as the number of queries continues to grow, the decoding effect will decline after the introduction of too many low-quality queries. Considering both training costs and the validity of comparisons, we standardized the number of queries to 300 across all our experiments.
Ablation study of the boundary points ratio. To evaluate the influence of the boundary points ratio, a hyperparameter, on the model performance, we designed a detailed ablation experiment. This parameter plays a critical role in the model’s capacity to perceive and leverage edge features of faint and small targets. We performed extensive comparative experiments on the USD across different values of this ratio, with results illustrated in
Figure 8. As shown in the figure, detection accuracy begins to decrease gradually once the number of boundary points surpasses that of center points. Furthermore, as the proportion of boundary points continues to rise, the rate of performance degradation becomes increasingly pronounced.
4.4. Visualization
Visualization results for USD. We selected six representative images from the test set to visualize the detection performance covering formation changes, dense distribution, mutual occlusion, multi-cluster flights, motion blur, and different formation shape scenes. In
Figure 9, we visualize the attention results of the GM-DETR and find that it can effectively locate the foreground area and distinguish targets from background. Although the GM-DETR is also affected by watermarks and the smoke, the GM-DETR is able to eliminate their interference in the query refinement and decoding stages from the final detection results, which are shown in
Figure 10. In
Figure 11, we visualize the confidence values of the queries selected by the BBS module. Since only queries with high confidence were selected, there are red dots but no blue dots in the subgraphs. The visualization results demonstrate that the Boundary–Center Balance Sampling module enhances the relevance of the selected query to the target by guiding the selection of queries, which makes the GM-DETR focus on the boundary and center regions.
In
Figure 10, YOLOv8 suffers from leak detection and poor accuracy. Faster R-CNN suffers from severe leak detection and misdetection. The Focus-DETR suffers from misdetection in the first image and suffers from leak detection in the second and last images. The Salience DETR is the best among the comparison methods and only suffers from leak detection in the motion blur image. In contrast, the GM-DETR performs well in all cases, especially in the mutual occlusion, dense distribution, and motion blur images.
Visualization results for Drone Vehicle. Vehicle targets in the Drone Vehicle dataset have a relatively large target scale and rich diversity of appearance features, which makes it easier to understand the differences in how different detection models respond to features. Based on the above characteristics, we selected several images to conduct a visual analysis of the baseline and GM-DETR, and corresponding attention heat maps were generated, as shown in
Figure 12. Comparative analysis revealed that the GM-DETR model exhibits more focused attention distribution for distinguishable parts, such as the front or rear of the vehicle, depending on the vehicle type. This phenomenon demonstrates the effectiveness of our CAS module and BBS module. In contrast, the baseline model exhibits a more focused attention response pattern relative to the target center area.
The differences in the visual perception mechanism between the GM-DETR and the baseline model were effectively verified by visual comparison analysis. As shown in
Figure 13, the confidence values of the baseline model show a significant central aggregation phenomenon. Its attention area is limited to the central region of the annotation box, and there are obvious shortcomings in its ability to capture the boundary and global contextual information of a target. In contrast, the GM-DETR leverages its CAS and BBS modules to dynamically shift focus across target types while balancing attention between central target regions and contextual boundaries. The visualization results of different methods on Drone Vehicle can be found in
Figure 14.