1. Introduction
In recent years, Unmanned Aerial Vehicles (UAVs) have emerged as the primary carrier within the low-altitude economy [
1]. Due to the performance of specific missions, the platforms can be configured in various ways into several functional systems, such as a spraying system for agriculture [
2], a data acquisition module used for surveying and mapping [
3], a sensor unit used for environmental monitoring [
4] and an image recognition module used for security and public safety [
5]. Nevertheless, UAV-based image processing faces a challenge in real-time object detection of small objects. Characterized by drastic scale variances, indistinct feature representations, and heavy background clutter [
6], small object detection poses a great challenge to existing detection algorithms which report very high missed detection rates and false alarm rates and often perform poorly in these scenarios. Furthermore, the effective concept of ‘small objects’ in UAV imagery is highly relative, influenced by continuous variations in Ground Sample Distance (GSD) due to fluctuating flight altitudes and camera specifications [
7]. An object that appears medium-sized at a low altitude may drastically reduce to a few pixels as the GSD increases [
8]. While standard evaluation protocols, such as metrics for objects ≤ 32 × 32 pixels, provide quantitative benchmarks, real-world UAV detectors must structurally adapt to these physical scale uncertainties.
Currently, most mainstream object detection methods, such as the YOLO series [
9] and Faster R-CNN [
10], are primarily optimized for medium and large-scale objects [
11]. Although lightweight end-to-end detectors like RT-DETR [
12] demonstrate strong performance in general detection tasks by eliminating post-processing operations such as non-maximum suppression (NMS), their direct application to UAV imagery still exposes several significant structural limitations throughout the detection pipeline, from feature extraction to final prediction.
First, feature submergence tends to occur in the shallow backbone. The standard downsampling strategy (Conv+Norm), employed by backbones such as ResNet18, fails to effectively preserve the fine-grained representations of small objects in the early layers. The irreversible information loss introduced during this process consequently hinders deep feature maps from retaining the critical spatial details necessary for small-object detection [
13].
Second, deep features are often overshadowed by background interference. In the absence of direct attention guidance for small objects, genuine targets can easily become obscured within complex backgrounds and cluttered textures. Although the Transformer in RT-DETR is effective in modeling the global context, it still struggles to disentangle the weak signals of extremely small targets from dominant background tokens in its deeper layers. Existing studies have demonstrated that multi-scale feature fusion and adaptive attention mechanisms are essential for overcoming this bottleneck [
14].
Finally, the detection head experiences a resolution mismatch. The standard RT-DETR head relies heavily on highly compressed deep semantic features, which lack adequate spatial granularity. Consequently, when confronted with densely distributed small objects, the model frequently demonstrates poor localization accuracy and a high rate of missed detections.
To address these limitations, we propose MSA-DETR, an enhanced RT-DETR framework designed to facilitate small object detection from a UAV perspective. It is important to emphasize that the unique novelty of our study does not reside in defining new object categories, but rather in an architectural redesign specifically engineered to overcome the significant visual degradation, such as feature submergence and background drowning, that common objects experience in UAV imagery. The architecture incorporates three critical components at the feature extraction, attention modeling, and detection stages, enhancing multi-scale feature representation, spatial discrimination, and localization accuracy for small targets.
The main contributions of this work are summarized as follows:
We propose a PercepConv module that integrates parallel multi-scale and dilated convolutions with a channel-adaptive mechanism, thereby effectively enlarging the receptive field and enhancing small-object feature representations at a controllable computational cost.
Our SODAttention module is designed with a dual-branch architecture to jointly model local spatial detail and global context. This design considerably reduces complicated background noises and improves the discriminating power for small targets.
We present a dedicated small object detection layer that integrates shallow high-resolution features to the detection head. The preservation of subtle texture information that is often ignored in previous studies effectively mitigates the insufficient perception of fine targets in deep semantic features.
2. Related Work
2.1. Object Detection in UAV Scenarios and Its Limitations
Convolutional Neural Networks (CNNs) serve as the foundation for contemporary object detection. Two-stage detectors, including Faster R-CNN [
10], Cascade R-CNN [
15], and Mask R-CNN [
16], typically achieve high localization accuracy. However, their intricate architectures and substantial computational overhead render them less suitable for UAV platforms that require stringent real-time performance. In contrast, one-stage detectors, exemplified by the YOLO series, directly perform category and localization predictions, significantly enhancing inference efficiency and making them broadly applicable to real-time vision tasks [
17].
However, object detection in UAV scenarios continues to encounter significant challenges. Factors such as the top-down viewpoint, flight altitude, and variations in imaging scale often result in UAV images displaying small object sizes, dense distributions, weak textures, and complex backgrounds [
18]. These characteristics impose heightened demands on the fine-grained representation and multi-scale modeling capabilities of detection models. While conventional CNN-based methods are effective in extracting local patterns, their limited receptive fields constrain their ability to model long-range dependencies [
19]. Furthermore, successive downsampling operations within the backbone rapidly diminish the already restricted spatial details of small objects, leading to insufficient representations of small objects in deep features [
20]. As a result, existing methods are still susceptible to missed detections and localization errors in complex UAV scenarios.
Therefore, effectively preserving shallow detailed information while maintaining real-time performance and enhancing the representation capability of small objects in complex backgrounds has emerged as a critical challenge in the field of UAV object detection research.
2.2. RT-DETR-Based Improvement Methods
The introduction of Transformers has opened a new avenue for object detection. DETR-based methods utilize self-attention to model global dependencies while reducing reliance on handcrafted priors and complex post-processing, thereby promoting the development of end-to-end detection frameworks [
21]. To further enhance convergence speed and computational efficiency, variants such as Deformable DETR [
22] and RT-DETR [
12] have been proposed. These methods have demonstrated significant potential for real-time detection and have gradually become important research benchmarks in UAV scenarios.
Existing studies on the improvement of RT-DETR have primarily concentrated on enhancing feature fusion, optimizing attention mechanisms, and refining the decoder. For example, BAP-DETR [
23] improves interaction efficiency among multi-level features. ED-DETR [
24] enhances multi-scale representation through collaborative modeling of convolution and Transformer architectures, as well as an edge-guided branch. Freq-DETR [
25] strengthens the representation of weak targets from a frequency-domain perspective, while EAV-DETR [
26] enhances the model’s adaptability to viewpoint variations in UAV imagery. Collectively, these methods have significantly broadened the applicability of RT-DETR in complex scenarios.
Nevertheless, existing improvements to RT-DETR have predominantly concentrated on enhancements in the middle and back ends, while the preservation of small-object information in the front-end backbone remains relatively underexplored. In UAV scenarios, for extremely small objects, once shallow details are irreversibly lost during early downsampling, even sophisticated subsequent fusion or attention mechanisms struggle to recover their original representations [
27]. In other words, although current studies related to RT-DETR have made advancements in back-end enhancement, significant deficiencies persist in front-end information preservation. This issue has become a critical factor limiting the further enhancement of end-to-end detectors for small-object detection in UAV applications.
2.3. Architectural Innovations and Comparative Analysis of MSA-DETR
To address the aforementioned limitations, this paper proposes MSA-DETR, which performs collaborative optimization across three key aspects: shallow feature preservation, spatial attention modeling, and detection-level design. This approach establishes a comprehensive improvement framework for small-object detection in UAV applications. In comparison to existing RT-DETR variants, the proposed method exhibits distinct differences in the following three areas.
First, in terms of feature preservation, existing methods generally assume that the backbone has already learned relatively complete representations of small objects, thus primarily emphasizing enhancements in the middle and later stages. However, for tiny objects in UAV imagery, critical issues often arise in the shallow stages. To address this, this study introduces the PercepConv module at the front of the backbone. By combining multi-scale parallel convolutions with dilated convolutions, the module expands the effective receptive field and enhances target responses before features become severely compressed, thereby improving the preservation of shallow information. Unlike existing methods that focus on late-stage compensation, the proposed approach emphasizes mitigating detail loss at the source.
Second, in the realm of spatial attention modeling, complex backgrounds often occupy the majority of regions in UAV images, which can significantly interfere with the detection of weak targets. Although conventional global modeling enhances contextual perception, it may inadvertently amplify background noise. Previous studies have demonstrated that saliency priors can improve target responses while mitigating background interference in complex scenes, thus providing effective cues for the perception of weak small objects [
28]. Motivated by this, the proposed SODAttention module integrates a local detail branch and a global context branch to adaptively emphasize critical regions while suppressing irrelevant backgrounds, thereby enhancing the model’s discriminative capability in complex scenarios.
Finally, in terms of detection-level design, most existing RT-DETR variants primarily rely on high-level semantic features for prediction. However, the low spatial resolution of these features limits their capacity to accurately characterize the geometric structures and texture details of extremely small objects [
29]. To mitigate this issue, a high-resolution small-object detection layer has been incorporated into the detection head, enabling shallow fine-grained information to contribute directly to the final prediction process. Consequently, the proposed method enhances the detection of densely distributed and extremely small objects while minimizing the risk of missed detections.
In summary, although existing studies have made significant progress in feature fusion and decoder optimization, a comprehensive solution to the ongoing challenges of UAV small-object detection—such as shallow detail loss, severe background interference, and insufficient prediction-level resolution—remains elusive. The proposed MSA-DETR addresses these challenges through a collaborative design that enhances front-end features, employs spatial discriminative modeling, and extends detection levels. This approach offers targeted improvements to the small-object detection capabilities of end-to-end detectors in UAV scenarios.
3. Methodology
3.1. Overall Architecture of MSA-DETR
As illustrated in
Figure 1, the proposed MSA-DETR continues to adhere to the end-to-end Backbone–Neck–Head detection paradigm. However, to address the core challenges associated with UAV small-object detection, specifically, small object scales, weak visual features, and complex backgrounds, we introduce three key enhancements to the overall architecture. These enhancements focus on small-object feature extraction, multi-scale spatial attention modeling, and detection-layer design. Through these three structural optimizations, the model enhances small-object detection performance while preserving favorable computational complexity.
First, an enhanced feature extraction strategy for small objects is introduced at the backbone stage. In conventional RT-DETR, consecutive downsampling operations and standard convolutional structures tend to cause a rapid loss of shallow fine-grained information [
30]. To mitigate this issue, we incorporate an improved convolutional module, termed SmallModule, into the early layers of the backbone. By jointly modeling features with varying receptive fields through parallel multi-scale convolutions and dilated convolutions, we expand the effective receptive field without significantly increasing computational cost. Additionally, a lightweight channel-attention mechanism is integrated to adaptively enhance small-object regions, thereby improving the representation capacity of shallow features for small-scale targets and providing richer detailed information for subsequent multi-scale fusion.
Second, during mid-to-high-level feature modeling, we introduce a multi-scale spatial attention module (SODBlock) designed to enhance the discriminability of small objects in complex backgrounds. This module effectively models spatial dependencies through parallel global-local branches: it captures long-range contextual information to mitigate the limited semantics associated with small objects, while simultaneously suppressing the interference of background noise on target responses. The integration of SODBlock into the deeper layers of the backbone allows for the retention of high-level semantic features, which not only preserves strong abstraction capabilities but also enhances sensitivity to small objects and improves spatial consistency.
Finally, the detection hierarchy in the head is deliberately extended by incorporating an additional small-object detection layer to enhance multi-scale predictions. Given that deep feature maps typically exhibit low spatial resolution, they often fail to adequately preserve the spatial details of extremely small targets. To address this limitation, we introduce shallower high-resolution features into the detection head, building upon the original multi-scale structure. These features are then fused with deep semantic features through the neck module to improve detection accuracy. This design effectively reduces the information loss arising from deep semantic compression concerning small objects, significantly reducing missed detections and allowing more stable performance, particularly in UAV contexts involving dense targets and drastic scale variations.
To sum up, the MSA-DETR is a form of synergistic optimization via three structural improvements. To improve the representation of multi-scale features for small objects and prevent a loss of shallow information, the PercepConv module is added to the backbone. In addition, the SODAttention module is employed for spatial attention modeling, integrating local details with global context to suppress interference from complex backgrounds. Furthermore, an additional small-object detection layer is introduced to leverage high-resolution features for enhanced prediction accuracy. Collectively, these designs systematically enhance the performance of UAV small-object detection and its real-time applicability, encompassing the entire process from feature extraction and modeling to detection output.
3.2. PercepConv Module
In UAV small-object detection, certain targets may appear extremely small in images and exhibit significant scale variations, making them challenging to recognize. Furthermore, the ConvNormLayer in the RT-DETR (Res18) backbone is not conducive to detecting small objects [
31]. To enhance detection performance for UAV small targets, we propose the PercepConv module, which comprises two primary components: one branch functions as a multi-scale feature processing module, while the other serves as a small-object adaptive detection module. The architecture of the PercepConv module is depicted in
Figure 2.
The multi-scale branch utilizes three parallel convolutional paths with kernel sizes of 1 × 1, 3 × 3, and 5 × 5. This design enables the model to concurrently capture UAV targets at various scales and to extract local, mid-range, and broader contextual features. Furthermore, a dilated-convolution branch is introduced, where a dilation rate of 3 effectively enlarges the receptive field with only an acceptable increase in computational complexity.
The other branch relates to the small object adaptive detection module. This module is a lightweight channel-attention mechanism designed to automatically allocate weights adaptively, focusing more on small-object regions and suppressing the background. Consequently, the model’s performance in detecting small objects remains excellent even in complex scenes. Moreover, additional branches are included to capture the texture and shape information of small targets.
3.3. SODAttention Module
To solve the issues of small-target size, weak target representation and strong background interference of UAV small-object detection, we design the SODAttention module.
Figure 3 displays the architecture of the module.
The architecture of the module has a dual-branch parallel structure: after grouping the input features, they are fed into the GLM (Global–Local Mixing) branch and the ContextBlock branch, respectively. In particular, the GLM branch captures spatial dependencies along the horizontal and vertical directions, respectively, using adaptive pooling in the X and Y axes, and utilizes a sigmoid gating mechanism to perform cross-dimensional feature recalibration. A combination of GroupNorm and MaxPooling–Softmax is applied subsequently to enhance global context modeling, so that long-range spatial information can be effectively aggregated to complement the limited local features of small targets.
Conversely, the ContextBlock branch uses an attention based spatial pooling strategy to enhance global contextual features. By means of a channel-wise additive fusion mechanism, it calibrates the features to enhance the responses to the small objects and suppress the interference from the complex background. After the two branches are added together and projected by convolution, the local details in the fused representation are maintained and the global semantics are incorporated, thus enabling accurate localization and robust recognition of small targets from the UAV viewpoint.
3.4. P2 Small-Object Detection Layer
To solve the “feature drowning” issue for small targets, we design a small-object detection layer to reduce the high-stride downsampling of the deep layers of RT-DETR (Res18). The purpose of this module is to integrate high-resolution feature maps from the shallower stages of the backbone, thereby preserving fine-grained texture and geometric localization cues absent in deep convolutions. By creating a cross-scale feature fusion pathway, the model fuses spatial details of shallow depth with semantic features of higher depth to supplement the incapability of single-scale features in perceiving extremely small objects. This upgrade is set to dramatically enhance the effective receptive field of the detection head. Consequently, this pushes down the miss rate in dense and occluded scenes. It also improves the accurate detection of extreme-scale targets from UAV viewpoints.
4. Experiments
4.1. Experimental Datasets
For this study, the authors mainly utilized the VisDrone2019 dataset [
25], developed by the Machine Learning and Data Mining Laboratory, Tianjin University, China. The VisDrone2019 dataset was a collection of data using cameras mounted on various unmanned aerial vehicles (UAVs) under diverse real-world scenarios. The VisDrone2019 dataset encompasses ten object categories: pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor. It comprises a total of 10,209 images, with 6471 designated for training, 548 for validation, and 3190 for testing.
Figure 4 presents the category distribution histogram alongside the object-size scatter plot. The scatter distribution indicates that the majority of instances within VisDrone2019 are clustered within a relatively small-scale range, highlighting a significant prevalence of small objects. Consequently, these dataset characteristics reinforce the necessity for modeling and optimization tailored specifically to small-scale targets, thereby providing robust experimental evidence and data support for the small-object detection architecture we have designed.
The HIT-UAV dataset was utilized for the generalization experiment. This dataset comprises infrared thermal imaging specifically designed for UAV high-altitude perspectives, encompassing a variety of scenes, including schools, roads, playgrounds, and parking lots. For the purpose of object detection, the dataset includes five annotated categories: person, car, bicycle, motorcycle, and other vehicles. In total, the HIT-UAV dataset contains 2898 thermal infrared images, which are partitioned into 2029 training images, 290 validation images, and 579 test images.
Representative examples from the dataset are illustrated in
Figure 5.
4.2. Experimental Setup
All architectural modifications and training experiments were conducted on a high-performance computing server, the specifications of which are detailed in
Table 1.
The hyperparameters used for training are summarized in
Table 2.
4.3. Evaluation Metrics
To quantitatively evaluate the effectiveness of the improved model for small-object detection, this study employs precision (P), recall (R), and mean average precision (mAP) as the evaluation metrics.
Precision (P) is a metric used to assess the reliability of a model’s predictions. It is defined as the ratio of true positives to the total number of samples predicted as positives, and its mathematical formulation is given by:
Here, denotes the number of correctly detected objects, whereas represents the number of false detections. Accordingly, a higher precision generally indicates a lower risk of false positives.
Recall (R) is a metric used to evaluate a model’s ability to retrieve true objects. It is defined as the ratio of correctly detected instances to the total number of ground-truth positive samples. The mathematical formulation of recall is as follows:
Here, denotes the number of correctly detected objects, whereas represents the number of missed detections. Accordingly, a higher recall indicates a lower risk of false negatives.
Average Precision (AP) quantifies the detection performance for a single class by calculating the area under the precision–recall curve. Mean Average Precision (mAP) is defined as the average of AP values across all classes, with its mathematical formulation provided as follows:
Here, C denotes the total number of object categories within the detection framework, while represents the average precision for class c. To facilitate a more comprehensive evaluation, two metrics for mAP are employed: mAP50 refers to the mean average precision calculated at a fixed IoU threshold of 0.5, whereas mAP@50–95 is determined by averaging the precision over multiple IoU thresholds, ranging from 0.5 to 0.95, with a step size of 0.05.
To comprehensively evaluate the capability of the proposed model for tiny object detection in complex UAV scenarios, this study adheres to the COCO evaluation protocol and reports the average precision (AP) for small objects. This metric is precisely defined as the detection accuracy for objects with an area not exceeding 32 × 32 pixels. While the physical representation of these small objects in real-world UAV flights varies continuously with the Ground Sample Distance (GSD), we adhere strictly to the static COCO protocol for APs to ensure a standardized, objective, and fair quantitative comparison with other baseline models. Furthermore, the computational complexity of the model, including floating-point operations (GFLOPs) and the number of parameters (Params), is reported to assess its computational efficiency and practical applicability.
4.4. Comparative Experiments
4.4.1. Comparison with the Baseline RT-DETR (Res18) Model
To verify the effectiveness of the proposed improvements, we conduct a comparative analysis between the baseline RT-DETR (Res18) and MSA-DETR on the VisDrone2019 validation set. MSA-DETR consistently outperforms the baseline method, according to the results shown in
Table 3. Evidently, the larger gains in small-object categories, e.g., Car, Pedestrian and Bicycle, clearly show the better fine-grained feature extraction and small-object perception capacity. In the meantime, large-object categories (Bus) also show stable improvement, indicating that the improvement in small-object detection performance is not achieved at the expense of large-object accuracy. In general, the improved model considerably increases the mAP50 from 47.5% up to 52.2% in UAV scenarios.
Figure 6 depicts the experimental results of the RT-DETR (Res18) and the MSA-DETR. The comparison of the curves clearly indicates that the modified model offers overall better performance for UAV object detection. MSA-DETR reaches a larger peak on the F1-confidence curve and has a more extended stable region, indicating that it effectively maintains a favorable precision–recall balance over a broad range of confidence settings. Furthermore, the area enclosed by its precision–recall curve is larger; in particular, People and Bicycle have seen greater improvement. The averaged curve also indicates that
is substantially higher than that of the baseline model, suggesting that the proposed strategies effectively enhance detection performance for small objects and complex scenes.
4.4.2. Comparison with Other Mainstream Object Detection Models
To comprehensively validate the detection performance and practical utility of the proposed approach, this section conducts a comparative analysis between the improved model and several state-of-the-art object detection methods. Under consistent experimental settings applied to the same dataset, we compare the proposed method with representative mainstream object detectors, including two-stage detectors (Faster R-CNN, Cascade R-CNN, and Mask R-CNN), one-stage YOLO-family models, and DETR-based approaches (DETR, RT-DETR (Res18), RT-DETR (Res34), and UAV-DETR). The detailed comparison results are summarized in
Table 4.
Comparative results on the VisDrone2019 dataset indicate that different detection paradigms exhibit significant performance disparities in UAV scenarios. Both two-stage methods and the one-stage YOLO series demonstrate notable limitations in localization and generalization when addressing small-scale and densely distributed objects. In contrast, the proposed MSA-DETR achieves superior overall detection performance. Compared to the baseline RT-DETR, MSA-DETR enhances mAP@50 and mAP@50–95 to 52.2% and 33.2%, respectively, while also increasing the small-object metric APs to 20.3%. This improvement verifies its effectiveness in tiny-object perception and its robustness against complex background interference.
The comparative results indicate that, despite a slight reduction in the parameter size of MSA-DETR to 18.86 MB, its computational complexity increases significantly from 57 G FLOPs for the baseline model to 79.4 G FLOPs. This increase represents an inherent trade-off necessary to address the challenges associated with small-object perception and primarily arises from a more profound utilization of shallow high-resolution features. Specifically, the newly introduced small-object detection layer directly integrates shallow feature maps with larger spatial resolutions into the final prediction stage, while the Percep-Conv module operates on high-resolution inputs in the early backbone. Given the strong correlation between the complexity of convolution and attention operations and the resolution of feature maps, processing these low-level features inevitably results in a substantial increase in floating-point operations. Considering the marked improvements across all accuracy metrics, this strategy is both reasonable and highly practical for UAV vision tasks, where the tolerance for missed detections is extremely limited.
4.5. Ablation Studies
To verify the effectiveness and complementarity of the improved modules in MSA-DETR, ablation experiments were conducted on the VisDrone2019 dataset. Using RT-DETR (Res18) as the baseline, three core structural improvements were progressively introduced for systematic evaluation. Specifically, Module A refers to the PercepConv module, which enhances multi-scale feature representation and mitigates shallow information loss. Module B denotes the SODAttention module, which reduces complex background interference through joint local-global feature modeling. Module C pertains to the newly introduced small-object detection layer, which integrates high-resolution features to address the perceptual deficiencies of deep networks.
A controlled-variable strategy was employed to evaluate the independent contribution of each module and the feature compatibility of various pairwise combinations. Ultimately, the comprehensive MSA-DETR model, which integrates all three enhancements, was assessed. The detailed network configurations and quantitative evaluation results for all ablation settings are summarized in
Table 5.
The experimental results demonstrate that each individual module significantly enhances detection performance. Notably, the small-object detection layer, which incorporates shallow high-resolution features, yields the most substantial improvement, elevating mAP@50 and APs to 51.2% and 19.7%, respectively. The synergy of multiple modules further illustrates the complementarity of the proposed architecture. Specifically, the integration of PercepConv with the small-object detection layer establishes a closed loop between multi-scale feature extraction and high-resolution prediction, resulting in a remarkable increase in APs to 23.0%. In contrast, the integration of SODAttention and the small-object detection layer achieves the highest precision of 64.8% due to its robust capabilities in feature disentanglement and background suppression. Ultimately, the full model that integrates all three enhancements achieves a mAP@50 of 52.2% and an APs of 20.3%. Although its APs is marginally lower than that of the two-module variant, which is optimized more aggressively for small objects, SODAttention is crucial for spatial regularization. By effectively filtering out complex background noise, it enables the model to maintain a robust balance among precision, recall, and cross-domain generalization ability.
4.6. Generalization Experiment
To further validate the generalization capability and robustness of the proposed MSA-DETR across various imaging modalities and application scenarios, we conducted cross-domain generalization experiments using the HIT-UAV infrared thermal imaging dataset. The HIT-UAV dataset is specifically designed for high-altitude perspectives captured by UAVs. Unlike the visible-light images in the VisDrone2019 dataset, infrared thermal images are characterized by a lack of rich color information and fine surface textures, with targets typically manifesting as discrete bright thermal spots. Additionally, these images face challenges such as extremely small object scales and complex background interference, particularly under high-altitude viewpoints.
As demonstrated in
Table 6, the proposed MSA-DETR exhibits robust cross-domain generalization capabilities on the HIT-UAV dataset. In infrared thermal imaging scenarios, which are characterized by a lack of fine textures and limited color information, the enhanced model achieves 75.6% mAP@50 and 49.8% mAP@50–95, surpassing the baseline RT-DETR (Res18) by 1.2% and 1.8%, respectively.
The model demonstrates a significant improvement in precision (P), achieving a 5.0% increase from 83.1% in the baseline to 88.1%. Meanwhile, recall (R) remains stable at 70.9%, with only a minor decrease of 0.5%. These findings suggest that the SODAttention module effectively mitigates complex background noise in infrared images, significantly reducing false detections while preserving stable detection capabilities. This results in a more favorable balance between precision and recall.
In the context of extremely small object recognition, MSA-DETR achieves APs of 40.1%, reflecting a significant improvement of 1.1% over the baseline model. These findings suggest that the multi-scale parallel convolution design of PercepConv effectively addresses the challenges associated with the weak feature representation of infrared small objects. Overall, the quantitative results indicate that the proposed MSA-DETR architecture demonstrates strong robustness and promising adaptability across various imaging modalities and challenging background conditions.
4.7. Visualization Results
To visually evaluate the detection effectiveness of MSA-DETR across various imaging modalities and complex scenarios, this section presents a comparative visualization analysis of the detection results on the VisDrone2019 visible-light dataset and the HIT-UAV infrared thermal imaging dataset. Additionally, Grad-CAM heatmaps are utilized to further investigate the feature response mechanism of the model.
Figure 7 compares the typical detection results of the baseline RT-DETR (Res18) and MSA-DETR across two different datasets. The first two rows present the visible-light scene results from the VisDrone2019 dataset. In complex environments, such as urban street scenes, commercial pedestrian areas, and regions with densely distributed targets, the baseline model is significantly affected by background interference and mutual occlusion. This leads to disordered overlapping bounding boxes in crowded areas and missed detections of small objects. In contrast, MSA-DETR demonstrates superior detection completeness and robustness in these scenarios. Benefiting from the introduction of a multi-scale adaptive attention mechanism, the improved model can more accurately distinguish highly dense crowd targets while maintaining higher confidence and clearer bounding box contours for small-scale and partially occluded objects.
The last two rows of
Figure 7 illustrate the cross-domain generalization performance on the HIT-UAV infrared thermal imaging dataset. Infrared images are characterized by a lack of rich color information and fine texture details, with targets often appearing merely as discrete bright thermal spots. Additionally, these images are challenged by extremely small scales and complex background thermal noise, particularly under high-altitude viewpoints. In these demanding conditions, the baseline model demonstrates significant missed detections. In contrast, MSA-DETR achieves accurate localization and target delineation. The SODAttention module effectively mitigates complex background noise in infrared images, thereby substantially reducing false detections. Furthermore, the multi-scale parallel convolution design of PercepConv addresses the weak feature representation of small infrared objects, allowing the model to maintain robust visual perception stability in cross-domain scenarios.
To further elucidate the intrinsic mechanisms underlying the performance improvements from a feature perspective,
Figure 8 presents a comparison of the Grad-CAM heatmaps of the two models across different imaging modalities. The first two rows depict visible-light scenes from VisDrone, while the last row showcases infrared scenes from HIT-UAV. The baseline model exhibits dispersed activations and indistinct boundaries, rendering it susceptible to localization drift when addressing dense or small objects, and it tends to assign elevated responses to background areas such as roads and buildings. In contrast, MSA-DETR demonstrates a significantly enhanced capability for feature focusing. This improvement is attributed to the effective aggregation of multi-scale information through PercepConv and the precise global-context modeling enabled by SODAttention, resulting in more compact high-response regions in the improved model. Under both visible-light and infrared modalities, MSA-DETR accurately encompasses core regions, such as vehicles and pedestrians, while effectively mitigating environmental noise. These visual results substantiate that the proposed model alleviates the loss of fine-grained information in deep networks and fundamentally enhances the feature discriminability and localization accuracy of small objects in complex scenarios.
5. Conclusions
This study proposes MSA-DETR, an end-to-end detection model enhanced by multi-scale spatial attention for tiny object detection in UAV imagery. The proposed architecture is specifically redesigned to address the challenges associated with UAV-based small-object detection. In the backbone, PercepConv effectively enhances multi-scale feature modeling. SODAttention improves the disentanglement of spatial features and suppresses background interference. Moreover, the newly introduced small-object detection layer fully exploits shallow high-resolution fine-grained features. On the VisDrone2019 visible-light dataset, MSA-DETR achieves 52.2% mAP@50 and 33.2% mAP@50–95, while reaching 20.3% on the key small-object metric APs.
Cross-domain generalization experiments further validate the robustness of the proposed architecture. On the HIT-UAV infrared thermal imaging dataset, which lacks fine texture information, MSA-DETR continues to demonstrate excellent perceptual stability, achieving 75.6% mAP@50 and 40.1% APs for infrared small objects. Coupled with the visualization analysis of feature heatmaps, the results suggest that the proposed model can provide accurate localization even under densely distributed targets and significant background thermal noise interference. Overall, MSA-DETR effectively mitigates the bottleneck of end-to-end frameworks in extracting extreme-scale features while maintaining a reasonable computational cost. The proposed method offers an effective solution for UAV tiny-object detection in cross-modal scenarios and presents a reliable new paradigm for the practical deployment of high-precision visual perception systems in the low-altitude economy.