4.4.2. Qualitative Analysis
To evaluate the performance of different detection methods for infrared small target detection, green and orange boxes are used in the visualization results to represent correct detections and abnormal detections, respectively.
Because the red boxes obtained in different scenes are small and the confidence scores are not clearly visible, green boxes and orange boxes are used instead to indicate correct detections and false or abnormal detections.
The number of green boxes reflects the number of correctly detected targets, while the number of orange boxes represents the number of false or missed detections.
In addition, the number inside the green box indicates the detection confidence, and the number inside the orange box indicates the false detection confidence. Red boxes are used to enclose the target areas.
The visualization results are shown in
Figure 9,
Figure 10 and
Figure 11. The left side of each figure displays the ground-truth target boxes in eight typical scenes. These scenes include dark appearance, blurred background, complex architectural occlusion, cloudy scene, multi-target, bright clutter, high-contrast boundary, and high noise.
Figure 9 shows the ground-truth bounding boxes of targets in eight scenarios on the left and the detection results of YOLO series and REDETR-RISTD on the right.
In the dark-appearance scenario, the target brightness is extremely low. The YOLO series methods failed to detect the target and did not show any detection boxes. In contrast, REDETR-RISTD successfully detected the target with a confidence of 58%. This success is due to its reparameterized multi-scale feature extraction module, which effectively extracts target features in low-signal-to-noise and low-contrast conditions.
In the blurred-background scenario, all methods detected the single target. However, detection confidence varied. YOLOv7 achieved a confidence of 89%, while REDETR-RISTD maintained around 80%. This indicates that REDETR-RISTD has high stability in blurred backgrounds. The attention-guided intra-scale contextual feature interaction module effectively suppresses background interference.
In the complex architectural occlusion scenario, REDETR-RISTD achieved a detection confidence of 88%. This value is significantly higher than the YOLO series methods, which ranged from 70% to 82%. This improvement is mainly attributed to the AICFI module, which optimizes the feature differences between the target and occluded regions. This reduces the impact of building occlusion and enables accurate detection of partially occluded targets.
In the cloudy scenario, REDETR-RISTD reached a detection confidence of 85%, which is clearly higher than YOLOv9m’s 76%. Its advantage comes from the RMSFE module’s multi-scale feature fusion capability. Under cloud interference, this module enhances the expression of target edge features and improves the detection of small targets.
In the multi-target scenario, the YOLO series methods only detected the largest target. They failed to recognize the other two targets. In contrast, REDETR-RISTD successfully detected all three targets with detection confidences of 0.73, 0.27, and 0.32. This demonstrates that the multi-scale pyramid feature fusion architecture, through a bidirectional fusion mechanism, effectively integrates target features at different scales to achieve comprehensive detection.
In the bright-clutter scenario, REDETR-RISTD reached a detection confidence of 86%. This shows strong anti-interference capability. It maintains high detection accuracy in complex backgrounds and effectively avoids false detections.
In the high-contrast-boundary scenario, YOLOv6s failed to detect the target, and other YOLO methods showed low detection confidence. In contrast, REDETR-RISTD successfully detected the target with a confidence of 72%. This indicates that it has strong target localization ability in high-contrast environments and can overcome the influence of high-contrast noise.
Finally, in the high-noise scenario, the YOLO series methods only detected the largest target. REDETR-RISTD, however, successfully detected three targets with detection confidences of 0.68, 0.69, and 0.71. This further proves its strong robustness in noisy environments and its ability to maintain high detection stability under complex interference conditions.
Figure 10 shows the ground-truth bounding boxes of targets in eight scenarios on the left and the detection results of TOOD, Sparse R-CNN, Mask R-CNN, and REDETR-RISTD on the right.
In the dark-appearance scenario, TOOD failed to detect the target. Sparse R-CNN detected the target with a confidence of 0.799 and produced one false positive with a confidence of 0.511. Mask R-CNN did not detect the target and generated two false positives with confidences of 0.949 and 0.579. In contrast, REDETR-RISTD successfully detected the target with a confidence of 58%. This result demonstrates that REDETR-RISTD, with its reparameterized multi-scale feature extraction module, can better capture target features in low-signal-to-noise and low-contrast environments.
In the blurred-background scenario, all four methods successfully detected the single target, with detection confidences of 41.1%, 86.3%, 99.7%, and 80%, respectively. REDETR-RISTD uses an attention-guided intra-scale contextual feature interaction module to effectively reduce background interference and ensure stable detection performance.
In the complex architectural occlusion scenario, TOOD detected the target with a confidence of 40% and produced three false positives with confidences of 0.367, 0.303, and 0.368. Sparse R-CNN, Mask R-CNN, and REDETR-RISTD detected the target with confidences of 87.5%, 63.7%, and 88%, respectively. The multi-scale feature fusion module of REDETR-RISTD is particularly effective in this scenario. It reduces the negative impact of building occlusion on detection results.
In the cloudy scenario, all methods successfully detected the target with detection confidences of 42.7%, 88.8%, 98.4%, and 85%, respectively. Mask R-CNN produced two false positives with confidences of 0.925 and 0.326. REDETR-RISTD further enhances the expression of target edge features through its multi-scale fusion mechanism, thus improving detection accuracy.
In the multi-target scenario, TOOD detected only the largest target with a confidence of 0.646 and missed two targets. Sparse R-CNN detected three targets with confidences of 0.925, 0.385, and 0.456. Mask R-CNN detected one target with a confidence of 0.962 and missed two targets. In contrast, REDETR-RISTD successfully detected three targets with confidences of 0.73, 0.27, and 0.32. Its multi-scale pyramid feature fusion and bidirectional interaction mechanism effectively integrate target features at different scales, achieving comprehensive target detection.
In the high-contrast-boundary scenario, the detection confidences were 33.7%, 82.2%, 99.9%, and 86% for TOOD, Sparse R-CNN, Mask R-CNN, and REDETR-RISTD, respectively. These results indicate that REDETR-RISTD can maintain high detection accuracy under complex background conditions.
Finally, in the high-noise scenario, TOOD detected one target with a confidence of 0.327 and missed two targets. Sparse R-CNN detected one target with a confidence of 0.899 and missed two targets. Mask R-CNN failed to detect the target. In contrast, REDETR-RISTD successfully detected three targets with confidences of 0.68, 0.69, and 0.71. Its multi-scale fusion and self-attention mechanism effectively reduce the impact of noise on feature extraction, thereby maintaining high detection performance even in high-noise conditions.
Figure 11 shows the visualization results of infrared small target detection in eight typical scenarios. The leftmost part displays the ground-truth bounding boxes of the targets, while the detection outputs of DINO, RT-DETR-ResNet18, RT-DETR-HGNet-L, and REDETR-RISTD are shown from left to right.
In the dark-appearance scenario, DINO correctly detected the target with a confidence of 0.721 but also produced a false positive with a confidence of 0.827. This indicates that it is susceptible to background noise under low-light conditions. RT-DETR-ResNet18 failed to detect the target, while RT-DETR-HGNet-L only detected the target with a confidence of 0.33, showing significantly inadequate performance. In contrast, REDETR-RISTD successfully detected the target with a confidence of 58%, which can be attributed to its reparameterized multi-scale feature extraction module, which is highly sensitive to weak target features in low-signal-to-noise environments.
In the blurred-background scenario, all methods successfully detected the single target, with detection confidences of 88.9% for DINO and 80% for RT-DETR-ResNet18, RT-DETR-HGNet-L, and REDETR-RISTD. This result suggests that when the target is well distinguished from the background, all models can perform well, and the attention-guided contextual feature interaction module of REDETR-RISTD plays a crucial role in maintaining stable detection performance.
In the complex architectural occlusion scenario, DINO detected the target with a confidence of 90.1%, while RT-DETR-ResNet18 and RT-DETR-HGNet-L achieved confidences of 85.1% and 80.7%, respectively. REDETR-RISTD reached 88%. The higher confidence of DINO indicates that it can extract stronger features when dealing with partial occlusion. However, REDETR-RISTD effectively mitigated the interference caused by occlusion through multi-scale feature fusion, ensuring accurate detection.
In the cloudy scenario, DINO detected the target with a confidence of 89.6%, but also produced a false positive with a confidence of 0.745. RT-DETR-ResNet18 and REDETR-RISTD achieved confidences of 83% and 85%, respectively, while RT-DETR-HGNet-L only reached 77%. Although DINO had a high detection confidence, its false positives show that it tends to capture erroneous features under cloud interference. REDETR-RISTD, on the other hand, uses its reparameterized structure to suppress interference information, resulting in more robust detection.
In the multi-target scenario, DINO detected all three targets with confidences of 0.783, 0.673, and 0.656, demonstrating an advantage when there are more targets. RT-DETR-ResNet18 detected only one target and missed two, while RT-DETR-HGNet-L detected two targets but produced one false positive and one missed target. In contrast, REDETR-RISTD successfully detected all three targets with confidences of 0.73, 0.27, and 0.32. This result confirms the effectiveness of REDETR-RISTD’s multi-scale pyramid feature fusion and bidirectional interaction mechanism in integrating features from different scales to handle multi-target detection tasks comprehensively.
In the bright-clutter scenario, all methods exhibited high detection confidences, with DINO reaching 92.6%, RT-DETR-ResNet18 at 87%, RT-DETR-HGNet-L at 85%, and REDETR-RISTD at 86%. This shows that all methods can recognize the target in a bright and cluttered background, with REDETR-RISTD maintaining a relatively stable performance.
For the high-contrast-boundary scenario, DINO achieved a detection confidence of 89%, showing strong target localization capability. In comparison, RT-DETR-ResNet18, RT-DETR-HGNet-L, and REDETR-RISTD achieved confidences of 74%, 69%, and 72%, respectively. Although DINO performed relatively well under these conditions, REDETR-RISTD effectively reduced the negative impact of high-contrast noise through its adaptive noise suppression mechanism, ensuring stable detection results.
In the high-noise scenario, DINO detected two targets with confidences of 0.875 and 0.871, but missed one target. RT-DETR-ResNet18 and RT-DETR-HGNet-L also detected two targets, but both missed one target, with confidences ranging from 0.74 to 0.73 and 0.67 to 0.61, respectively. In contrast, REDETR-RISTD detected three targets with confidences of 0.68, 0.69, and 0.71, demonstrating stronger noise suppression ability. This indicates that in noisy environments, traditional methods are prone to interference in feature extraction, while REDETR-RISTD’s multi-scale feature fusion and self-attention mechanism effectively alleviated this issue, ensuring comprehensive target detection.