An Efficient Aerial Image Detection with Variable Receptive Fields

Liu, Wenbin; Shi, Liangren; An, Guocheng

doi:10.3390/rs17152672

Open AccessArticle

An Efficient Aerial Image Detection with Variable Receptive Fields

by

Wenbin Liu

¹

,

Liangren Shi

¹

and

Guocheng An

^2,*

¹

School of Automation and Intelligent Sensing, Shanghai Jiao Tong University, Shanghai 201100, China

²

Artificial Intelligence Research Institute of Shanghai Huaxun Network System Co., LTD., Chengdu 610074, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(15), 2672; https://doi.org/10.3390/rs17152672

Submission received: 16 June 2025 / Revised: 17 July 2025 / Accepted: 29 July 2025 / Published: 2 August 2025

(This article belongs to the Special Issue Deep Learning Innovations in Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

This article presents VRF-DETR, a lightweight real-time object detection framework for aerial remote sensing images, aimed at addressing the challenge of insufficient receptive fields for easily confused categories due to differences in height and perspective. Based on the RT-DETR architecture, our approach introduces three key innovations: the multi-scale receptive field adaptive fusion (MSRF²) module replaces the Transformer encoder with parallel dilated convolutions and spatial-channel attention to adjust receptive fields for confusing objects dynamically; the gated multi-scale context (GMSC) block reconstructs the backbone using Gated Multi-Scale Context units with attention-gated convolution (AGConv), reducing parameters while enhancing multi-scale feature extraction; and the context-guided fusion (CGF) module optimizes feature fusion via context-guided weighting to resolve multi-scale semantic conflicts. Evaluations were conducted on both the VisDrone2019 and UAVDT datasets, where VRF-DETR achieved the mAP₅₀ of 52.1% and the mAP_50-95 of 32.2% on the VisDrone2019 validation set, surpassing RT-DETR by 4.9% and 3.5%, respectively, while reducing parameters by 32% and FLOPs by 22%. It maintains real-time performance (62.1 FPS) and generalizes effectively, outperforming state-of-the-art methods in accuracy-efficiency trade-offs for aerial object detection.

Keywords:

aerial remote sensing; variable receptive fields; real-time object detection; lightweight model

1. Introduction

Remote sensing object detection is a critical branch in the field of computer vision, which has been widely applied to urban planning, environmental monitoring, agricultural management, disaster response, and other fields [1]. Such images are typically acquired by aerial platforms such as satellites or drones, providing a unique perspective of the surface of the Earth from above. Real-time detection is required in some scenarios. However, real-time object detection in aerial remote sensing images faces severe challenges. These include dense small targets, drastic target scale changes caused by altitude variations, complex backgrounds, and occlusions [2]. Another challenge is the contradiction between pursuing accuracy and real-time performance.

However, general real-time object detection frameworks have significant limitations in the above complex aerial remote sensing scenarios. Current general real-time object detection frameworks mainly include the single-stage YOLO series [3,4,5,6] and the end-to-end Transformer-based DETR series [7,8,9]. YOLO series detectors directly generate dense predictions for target categories and locations on feature maps through fully convolutional networks and anchor box mechanisms [10]. Although this improves speed, the scale sensitivity of predefined anchor boxes makes it difficult to effectively cover multi-scale targets in remote sensing image scenarios with drastic target scale changes [11]. DETR series models directly output results through encoder–decoder architectures. They eliminate the need for anchor boxes and enhance global perception capabilities. This is beneficial for handling complex background interference. However, the high computational complexity of the Transformer sharply conflicts with the high-resolution characteristics of remote sensing images, leading to slow convergence on small targets [12]. The recently proposed real-time Transformer-based detector RT-DETR [13] accelerates model convergence and reduces model parameters by using a single feature map combined with a feature fusion network, achieving a certain balance between speed and accuracy. However, the global attention mechanism of the Transformer has insufficient feature extraction capabilities for tiny targets and is prone to interference in remote sensing scenarios with complex backgrounds [14].

To compensate for the shortcomings of general real-time detection frameworks, researchers have conducted extensive studies, mainly focusing on enhancing the detection capability of small targets and addressing complex backgrounds and occlusions. Existing studies mainly achieve the above goals by constructing cross-level feature fusion architectures and implementing multi-scale feature enhancement strategies. Cross-level feature fusion can significantly enhance the representation capability of the model for small targets by improving feature complementarity between levels. A typical solution is to build a cross-layer feature interaction network, which combines multi-level features to enhance small target representation in various environments [15]. For example, CSFF [16], Attention-guided BiFPN [17], and AGMF-Net [18] have constructed multi-level feature fusion mechanisms. They use cross-layer feature interactions, dynamic sparse attention fusion guidance, and global multi-scale semantic association, respectively. These mechanisms significantly improve the semantic representation capability of small targets in complex backgrounds. Multi-scale feature enhancement can further optimize the extracted feature semantic information. A representative strategy in this direction is to use attention mechanisms to help the model focus on important regions of the image and reduce the impact of complex backgrounds. For example, CSAM [19], RDIAN [20], and AGCB [21] have significantly improved the detection accuracy of small targets in complex backgrounds. They have also effectively suppressed background noise and enhanced target semantic information. They achieve this through channel and spatial attention, multi-directional guidance, and local–global context association, respectively. In addition, using large kernel convolutions [22,23,24] to expand the receptive field and obtain sufficient context is also a commonly used feature enhancement method. Since large kernel convolutions greatly increase the number of parameters, researchers have proposed using parallel dilated convolutions [25,26] and convolutional decomposition mechanisms [27] instead of large kernel convolutions to reduce parameters and expand spatial context coverage.

We found that aerial remote sensing images, especially those captured by drones, exhibit significant perspective and height differences due to the non-fixed camera angle and height, resulting in the coexistence of intra-class variation and inter-class similarity. Specifically, some objects require receptive fields of different sizes from objects of the same category. As shown in Figure 1, motors with and without sunshade have very different appearances, while motors with sunshade and awning-tricycle have certain similarities, making them difficult to distinguish from some specific perspectives. As shown in Figure 2, the impact of different altitudes is also huge: pedestrians (standing) and people (non-standing) that are easy to distinguish in low-altitude images are difficult to distinguish when shooting at high altitudes due to the decrease in resolution. To address these unique challenges, an organic fusion mechanism is required that combines multi-scale spatial perception and adaptive attention allocation to dynamically adjust the receptive fields of different objects in response to varying conditions in remote sensing images, such as viewing angles, altitudes, and occlusion levels. However, existing studies still fall short in addressing these issues effectively. For instance, cross-level feature fusion methods like CSFF and AGMF-Net enhance small target representation through complex feature interaction. Still, their intricate architectures lead to a significant increase in parameters, which hinders real-time performance critical for aerial platforms. Meanwhile, multi-scale feature enhancement strategies, such as CSAM and RDIAN, rely on static attention mechanisms. These mechanisms fix the receptive field size, making the strategies fail to adapt to perspective changes. For example, they cannot expand the receptive field when distinguishing high-altitude pedestrians from non-pedestrians (as shown in Figure 2). Nor can they shrink it for clear low-altitude targets. Even methods aiming to expand receptive fields, such as those using large kernel convolutions or parallel dilated convolutions, have limitations: large kernels drastically increase parameters, while existing parallel dilated convolution designs lack adaptive spatial selection, making it hard to capture context dependencies between targets and their surroundings (e.g., distinguishing motor with sunshade from awning-tricycle in occluded perspectives requires focusing on both local details and global context). Thus, there is an urgent need for a lightweight network that can dynamically adjust receptive fields, balancing parameter efficiency to maintain real-time performance and adaptively tune the attention range based on target characteristics and environmental conditions. Current static or parameter-heavy methods do not provide these capabilities.

To address the above issues in remote sensing images, we propose a lightweight network VRF-DETR that dynamically adjusts the receptive field of detected objects based on background information. Its key innovations target the two critical limitations of existing methods. To ensure real-time performance while balancing parameter efficiency, VRF-DETR builds on the RT-DETR framework and retains its end-to-end detection via the Hungarian algorithm to avoid complex post-processing. Specifically, to tackle the inability of static mechanisms to adaptively tune receptive fields, we design the multi-scale receptive field adaptive fusion (MSRF²) module: by integrating spatial attention, channel attention, and Multi-Head Self-Attention through parallel convolutions with varying dilation rates, it enables dynamic adjustment of receptive field sizes—expanding to capture global context for high-altitude ambiguous targets and shrinking to focus on local details for clear low-altitude objects. For lightweight design to maintain real-time capability, we further propose the attention-gated convolution (AGConv), which leverages the Gated Linear Unit (GLU) idea [28] with attention mechanisms for branch feature extraction, drastically reducing parameters while preserving high-quality multi-scale features. Combining AGConv and MSRF², we develop the gated multi-scale context (GMSC) block, which replaces the bottleneck of C2f in the YOLOv8 backbone to form the RS-Backbone—reducing parameters by 32% compared to RT-DETR’s original ResNet-18 and enhancing computational efficiency. Finally, to resolve redundant and conflicting information from direct feature fusion, the context-guided fusion (CGF) module replaces simple concatenation/addition in the neck, improving cross-scale feature complementarity via adaptive weighting. Our contributions are summarized as follows:

Aiming at the above challenges in aerial remote sensing images, we design a lightweight real-time object detection framework VRF-DETR with variable receptive fields based on the RT-DETR framework. Compared with some real-time detection benchmark models and state-of-the-art (SOTA) methods, VRF-DETR can provide adaptive receptive fields and achieve better performance while ensuring real-time performance.
To solve the confusion caused by the difference in height and viewing angle, we propose the MSRF² module. The module integrates parallel dilated convolution and attention mechanism, which can adaptively adjust the receptive field according to the context information so that the model can dynamically capture multi-scale features and improve the detection accuracy of small targets and occluded targets.
We design the AGConv module and GMSC block to reconstruct the backbone network, where AGConv uses gated attention to optimize feature extraction, and GMSC combines MSRF² and AGConv to reduce parameters by 46% while enhancing multi-scale feature representation.
The CGF module is proposed to optimize the neck network, leveraging dynamic weight allocation to alleviate semantic conflicts in multi-scale feature fusion, thereby improving the complementarity of features across different scales with minimal computational overhead.

2. Related Work

2.1. Receptive Field Enhancement in Remote Sensing

Acquiring a sufficient receptive field can significantly improve the accuracy and robustness of target detection tasks, especially for small or complex targets in remote sensing images. Thus, researchers have developed numerous methods to obtain an adequate receptive field. Current approaches for enhancing the receptive field can be categorized into direct spatial domain expansion and implicit dynamic enhancement based on the former.

Direct spatial domain expansion achieves receptive field enhancement by directly expanding the physical coverage of convolution operations, essentially influencing the receptive field boundaries in the spatial domain through the mathematically defined size of convolution kernels. The most intuitive way is to enlarge the convolution kernel: RepLKNet [22] significantly enhances detection performance by introducing a 31 × 31 convolution kernel; SLaK [23] and PeLK [24] further expand the convolution kernel to 51 × 51 and 101 × 101. However, directly using large convolution kernels can obtain a large receptive field but leads to an excessively large number of parameters, making it difficult to ensure real-time detection performance. With the proposal of depthwise separable convolution [29,30], more models have begun to use DSC to acquire multi-scale features while reducing the number of parameters. The MobileNet [30] reduces the parameter count and improves computational efficiency by splitting ordinary 3 × 3 convolutions, but its receptive field is too small to accurately identify small targets. ConvNeXt [31] further expands the receptive field by increasing the size of depthwise separable convolution kernels to 7 × 7, but it has insufficient multi-scale feature extraction capability. Xception [29] adopts a parallel architecture to perform spatial relationship mapping for each channel, obtain multiple output channels, and concatenate them, integrating receptive fields of different sizes obtained from each channel while significantly reducing the number of parameters. In addition to depthwise separable convolution, using dilated convolution is also a common efficient method to expand the receptive field. The ASPP of the Deeplab [32] replaces ordinary convolutions with dilated convolutions and combines a parallel architecture of convolutions to obtain a larger multi-scale receptive field, effectively capturing the multi-scale contextual information of objects. Subsequent works attempt to combine depthwise separable convolution and dilated convolution in a parallel architecture and achieve good results [33], and the receptive field expansion strategy of MSRF² adopts this architecture.

Implicit dynamic enhancement is usually combined with direct spatial domain expansion methods, indirectly expanding the effective receptive field through feature fusion, feature focusing, or local–global relationship modeling. Its core lies in using network structure design or dynamic weight allocation to enable local operations to have global or cross-regional information integration capabilities. The most common approach is the combination of ASPP and attention mechanisms [34,35,36]. RAANet [34] obtains more semantic information at multiple scales by embedding residual structures and optional attention modules in ASPP. RCCT-ASPPNet [35] more efficiently extracts spatial and channel information by combining ASPP with the Convolutional Block Attention Module (CBAM). CMTF-Net [36] combines depthwise separable convolution with a multi-head self-attention mechanism on the basis of ASPP, effectively extracting rich multi-scale global contextual information. VoVNetV2 [37] expands the receptive field by concatenating multi-branch convolutions using the idea of RFBNet [38] and further enhances features with the proposed eSE attention mechanism. ARFE-Net [39] replaces concatenation with an attention-based selection module on the basis of VoVNetV2, which can adaptively adjust the receptive field size according to the multi-scale information of input data. In aerial scenarios, implicit dynamic enhancement methods address dense targets and scale variations through refined implicit receptive field expansion. ClusDet [40] handles dense overlaps by refining inter-object feature associations via clustering, implicitly expanding receptive fields. DMNet [41] and CDMNet [42] enhance crowded small targets through density map-guided dynamic weighting of contextual features. UFPMP-Det [43] adapts to aerial scale variations by integrating local–global features via optimized fusion. AD-Det [44] balances tail classes in uneven distributions through focused dynamic enhancement of critical regions. YOLC [45] strengthens tiny dense targets by aggregating clustered features, implicitly integrating cross-region information to expand effective receptive fields. However, its selective convolution module mainly relies on channel weight allocation and lacks the ability to dynamically focus on fine-grained features in the spatial dimension. Although these methods have been effective in acquiring multi-scale features of remote sensing images, they either have excessively large parameter counts, making real-time detection impossible; or have static fixed-size receptive fields, resulting in insufficient robustness; or are implemented through static branches or simple attention weighting, leading to low interaction efficiency between local details and global semantics. They do not consider the unique challenges of complex perspective and altitude differences in aerial remote sensing images. Our method aims to address the contextual differentiation requirements of different targets in aerial remote sensing images through variable receptive field selection, spatial-channel attention collaboration, and dynamic branch selection mechanisms, improving detection performance in complex environments.

2.2. Lightweight Backbone Networks

In existing target detection network architectures, the backbone generally accounts for the largest proportion of parameters. Excessive parameters not only affect the storage requirements of the model, computational cost, and training time but also impact the real-time performance of the model to a certain extent. Therefore, lightweight backbone networks optimized in terms of both volume and speed have emerged to maintain accuracy as much as possible. SqueezeNet [46] uses Fire modules for parameter compression, and SqueezeNext [47] improves it by adding split convolutions. SqueezeNet series models reduce the number of parameters through overall structural optimization of the network architecture. ShuffleNetV1 [48] proposed the channel shuffle operation, allowing the network to freely use group convolutions for acceleration, while ShuffleNetV2 [49] overhauls most of the designs of V1, proposes the channel split operation, and achieves good results by accelerating the network and reusing features. In the MobileNet family, MobileNetV1 [30] uses depthwise separable convolutions to build lightweight networks, MobileNetV2 [50] proposes an innovative inverted residual with linear bottleneck unit, which increases the number of layers but improves the overall network accuracy and speed, and MobileNetV3 [51] combines AutoML technology with manual fine-tuning for more lightweight network construction. Subsequent EfficientNet series [52,53] and CSPDarkNet [54] have all effectively achieved network lightweight. For remote sensing, lightweight backbones prioritize efficiency for real-time deployment, with recent works optimizing architectures for aerial scenes. HIC-YOLOv5 [55] enhances small target detection via a streamlined CSPDarkNet-derived backbone, maintaining lightweight properties through fused path optimization. FBRT-YOLO [56] achieves fast aerial detection with efficient backbones (e.g., depthwise convolutions) and optimized propagation, reducing overhead to 119.2 G FLOPs. RemDet [57] balances speed and accuracy for UAVs via streamlined backbones with adaptive feature selection, excelling in parameter efficiency. CSPDarkNet has been used as the feature extraction network for the YOLO series, fully demonstrating its superiority in terms of volume and speed. Given that YOLOv8 is widely used in remote sensing target detection tasks [17], this paper improves the backbone network of YOLOv8, reduces the number of parameters, and enhances its detection accuracy, replacing the original Resnet-18 with the improved backbone to achieve model lightweight.

2.3. Multi-Scale Feature Fusion

Multi-scale feature fusion is a crucial step in remote sensing target detection. This is because the low-level features extracted by the feature extraction network have higher resolution and contain more positional and detailed information, but due to fewer convolutions, they have lower semanticity and more noise; high-level features have stronger semantic information but low resolution and poor detail perception. Therefore, a multi-scale feature fusion network that can effectively integrate the useful information of both and discard invalid information can significantly improve the detection effect of various targets, especially small ones. Inspired by image pyramids such as the Gaussian pyramid, the FPN network [58] uses bottom-up and top-down methods to obtain strong semantic features, addressing the deficiency of target detection in handling multi-scale variations. PANet [59] builds on FPN by adding connections from the bottom to the top, constructing a bottom-up path to better transmit low-level precise positioning information upward and enhance the positioning expression capability of the overall feature pyramid. Subsequent bidirectional fusion models such as ASFF [60] and BiFPN [61] have also been widely used in target detection tasks. RT-DETR [13] proposes CCFF based on the PA-FPN structure, optimizing multi-scale feature fusion to address the computational bottleneck and redundancy of traditional methods for real-time detection requirements. In remote sensing, multi-scale fusion methods refine strategies beyond simple concatenation to tackle aerial scale variations. MSFE-YOLO [62] improves drone small targets via hierarchical adaptive weighting of cross-level features in YOLOv8. YOLO-DCTI [63] fuses multi-scale details for remote sensing via transformer-based dynamic attention, mitigating redundant feature mismatch. Drone-YOLO [64] handles altitude-induced scale shifts with scale-aware gating in pyramids, avoiding direct fusion incompatibilities. QueryDet [65] accelerates high-resolution detection via cascaded sparse queries, refining alignment through adaptive relevance filtering. SDPDet [66] enhances drone detection by decoupling scales and aligning features with target sizes via context-aware weighting. However, most of the above methods use feature concatenate or feature add to directly fuse features of different scales when fusing features of different scales and the semantic levels of features at different scales are significantly different, and direct fusion can easily lead to feature space incompatibility, bring a large amount of redundant and conflicting information, and reduce the ability of multi-scale expression [67,68]. In order to solve this problem, we propose a context-guided fusion (CGF) module, which introduces a dimensionality reduction attention mechanism, adaptively suppresses cross-scale redundant channels, and uses different levels of bidirectional guidance mechanism to enhance the positioning accuracy of low-level features through high-level semantic weights. High-level features can complement the details with low-level space weights with little to no additional computational overhead.

3. Methods

This section introduces the proposed VRF-DETR model in detail based on the RT-DETR framework. Section 3.1 illustrates the key innovations and necessity of each component with the overall architecture diagram of VRF-DETR. Section 3.2 analyzes the workflow and limitations of the AIFI module in RT-DETR and describes the design and working principle of the proposed MSRF². Section 3.3 introduces the original ResNet-18 backbone of RT-DETR, explains the necessity of replacing it with a lighter backbone, and describes how we improved the YOLOv8 backbone using AGConv and MSRF². In Section 3.4, we analyze the impact of feature concatenation in the CCFF module of RT-DETR and detail the principle of the proposed CGF module and how it bridges the semantic gap between features of different scales during multi-scale feature fusion.

3.1. Overall Architecture

The architecture of the VRF-DETR model is shown in Figure 3. Compared with RT-DETR, VRF-DETR improves the feature extraction network, Transformer encoder, and feature fusion network to address the detection challenges of aerial remote sensing images. We aim to better focus on intra-class differences and inter-class similarities. We also want to obtain dynamic receptive fields for different altitudes and viewing angles. To achieve these, we propose a lightweight multi-scale receptive field adaptive fusion (MSRF²) module. It replaces the Transformer encoder in RT-DETR. This addresses the limitation of the fixed-scale attention of RT-DETR with almost no increase in parameters, providing dynamic multi-scale context for the encoder. To further reduce the number of parameters and computational complexity, we introduce the YOLOv8 backbone and propose the gated multi-scale context (GMSC) block to replace the original Bottleneck module, forming the RS-Backbone suitable for aerial remote sensing images. The GMSC block uses the proposed MSRF² and attention-gated convolution (AGConv) to obtain high-quality multi-scale features while effectively reducing the number of parameters. Finally, in order to alleviate the large amount of redundancy and conflict information caused by feature splicing at different scales in the RT-DETR feature fusion module, we propose a context-guided fusion (CGF) module and use it to replace the feature splicing in the original neck CCFF to form ContextFPN, which effectively alleviates the multi-scale semantic conflict, enhances the complementarity of multi-scale features, and improves the detection accuracy under the premise of ensuring real-time performance.

3.2. Encoder for Dynamic Multi-Scale Context Fusion

3.2.1. RT-DETR Encoder

The encoders of DETR series [7,8,9] generally adopt the architecture of the encoder in Transformer [69]. Deformable DETR [8] and DINO [9] use multi-scale feature maps as inputs to the Transformer encoder, which indeed improves model performance [70]. However, since high-level features are derived from low-level features, flattening and concatenating multi-scale features for interactive operations introduce computational redundancy [13]. The RT-DETR model replaces the multi-scale Transformer encoder in DINO with a single-scale Transformer encoder (i.e., the Attention-based Intra-scale Feature Interaction module, AIFI). It inputs only the last layer feature map P5 of the feature extraction network into AIFI. Then, it uses the feature map S5—after sufficient intra-scale interaction in AIFI—as the new P5. This new P5 is input into the PANet-style feature fusion module CCFF for interscale fusion with P3 and P4 layers. This approach decouples intra-scale interaction and inter-scale fusion, greatly improving the detection speed of DETR, which is why RT-DETR becomes the first real-time end-to-end object detector.

As shown in Figure 4, the AIFI module of RT-DETR is a standard Transformer encoder, including multi-head self-attention, feed-forward network (FFN), residual connection, and layer normalization (LayerNorm). The difference between AIFI and the standard Transformer encoder is the use of statically generated 2D sine-cosine position encoding, which integrates horizontal and vertical position information and is more suitable for the spatial structure of the image. The input top-layer feature map is first flattened into a sequence, then added to the 2D sine-cosine position encoding, and input into the multi-head self-attention module. The self-attention performs sufficient interaction within the feature map, and the specific interaction form of each head is as follows:

Attention {(Q, K, V)}_{i, j} = \sum_{k = 1}^{n} \frac{exp (\frac{Q_{i, :} \cdot K_{k, :}^{T}}{\sqrt{d_{k}}})}{\sum_{m = 1}^{n} exp (\frac{Q_{i, :} \cdot K_{m, :}^{T}}{\sqrt{d_{k}}})} \cdot V_{k, j},

(1)

where Q, K, and V are the Query, Key, and Value matrices, respectively, whose values are the feature map of the P5 layer after flattening and adding position encoding.

3.2.2. MSRF² Module

We adopt RT-DETR to decouple intra-scale interaction and inter-scale fusion, thereby improving real-time detection performance, and follow this processing flow. However, the standard Transformer encoder and AIFI based on the standard Transformer encoder have poor performance in dealing with small objects in aerial remote sensing images. This is because the background of aerial remote sensing images is complex, and the objects to be recognized are mostly small targets. Self-attention treats all tokens equally by default [71], while small targets have few tokens, and their weights are severely diluted. Therefore, during self-attention calculation, the weights are easily overwhelmed by a large number of background tokens, resulting in ineffective activation of small target features [72]. Meanwhile, although the interaction weight of each token in the self-attention mechanism covers the entire feature map, the effective receptive field of each token is determined by its semantic importance in practice. For example, the number of pedestrians difficult to distinguish due to too high shooting altitude mentioned in the first section is much smaller than that of easily distinguishable pedestrians at medium altitude, leading to weak feature responses of these pedestrians requiring a larger receptive field. The gradient magnitude generated in backpropagation is lower than that of the background or similar non-confusing pedestrians so they cannot obtain sufficient semantic importance [73] and thus cannot adaptively obtain the required receptive field size.

To address these limitations and enhance the multi-scale representation capability in real-time detection architectures, we propose the multi-scale receptive field adaptive fusion (MSRF²) module as a lightweight alternative to the AIFI module. The structure of MSRF² is shown in Figure 5, inspired by CMTF-Net [36], with the difference that MSRF² adds a space-oriented adaptive receptive field selection mechanism. To address the problem that the weights of small targets are easily submerged during multi-scale feature extraction, we used a dual-branch structure where multi-head self-attention and channel attention are parallel. For the channel attention branch, after the feature map is input, it is first compressed to the spatial dimension of

C \times 1 \times 1

by average pooling, and then the channel attention weight is generated by 1 × 1 convolution, ReLU, and Sigmoid, which is element-wise multiplied with the original feature map:

{Attention}_{C A} (x) = x \otimes σ (W_{2} \cdot ReLU 6 (W_{1} \cdot \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{:, i, j})),

(2)

where

x \in R^{C \times H \times W}

is the input feature map,

W_{1} \in R^{\frac{C}{r} \times C}

and

W_{2} \in R^{C \times \frac{C}{r}}

are the weights of the two convolutional layers,

ReLU 6 (x) = \min (\max (0, x), 6)

,

σ

is the Sigmoid function, and ⊗ denotes the channel-wise multiplication operation. This process dynamically weights the channel features, enhances the weights of channels with significant small target feature responses, avoids background channels from dominating, and improves the semantic saliency of small target tokens.

To solve the problem of insufficient receptive field adaptability, we improve the multi-head self-attention mechanism. The input of Q is still directly from the P5 layer feature map, while the inputs of K and V need to go through the Adaptive MS selection module. Retaining the direct input of Q can preserve the global position information of the target in the feature map, providing spatial anchors for confusing targets, while Adaptive MS selection allows K and V to contain local details of confusing targets and background context. In this way, when calculating weights, self-attention not only relies on the category semantic intensity of distinguishable targets but also can allocate reasonable receptive fields for confusing targets based on multi-scale context information, realizing dual decoupling of semantics and space, avoiding weights from being submerged by a single strong semantic target, and achieving adaptive receptive field allocation. The Adaptive MS selection integrates the ideas of depthwise separable convolution and dilated convolution. First, dimensionality reduction is performed using 1 × 1 Conv:

F_{0} = {Conv}_{1 \times 1} (X) .

(3)

Then, parallel DW dilated convolutions with different dilation rates are used in combination with corresponding PW convolutions to restore channels and generate feature maps

F_{i}

with different receptive fields:

F_{i} = {PWConv}_{i} ({DWConv}_{i} (F_{0}, d_{i})),

(4)

where

i \in {1, 2, 3}

represents different branches,

d_{i}

represents the dilation rate of the i-th branch (e.g.,

d_{1} = 3, d_{2} = 5, d_{3} = 7

),

{DWConv}_{i}

represents depthwise separable convolution, and

{PWConv}_{i}

represents pointwise convolution.

After obtaining the feature maps of different branches, three operations are adopted: feature concatenation, spatial selection, and channel decoupling fusion. The first step is to concatenate the feature maps

F_{i}

of different scales along the channel dimension to form a multi-modal tensor. The second step is to generate a three-channel spatial selection mask M through the spatial attention mechanism. The third step is to use this three-channel spatial selection mask to decouple and differentially weight the scale features

F_{i}

channel by channel and then realize adaptive enhancement through element-wise multiplication with the shortcut connection of the original input map X. The three-step operation process can be described as follows:

Y = (\sum_{i = 1}^{N} (S A {([F_{1}; \dots; F_{N}])}_{i} ⊙ F_{i})) ⊙ X,

(5)

where ⊙ denotes the Hadamard product, and

SA

denotes the spatial attention mechanism similar to that in [74].

SA

is defined as follows:

S A = σ (Conv ([P_{A v g} (F_{c a t}); P_{M a x} (F_{c a t})])),

(6)

where

σ

denotes the Sigmoid activation function, and

F_{cat}

denotes the concatenation of

[F_{1}; . . .; F_{N}]

.

Here,

P_{A v g}

and

P_{M a x}

refer to channel-wise average pooling and max pooling (i.e., aggregating feature values across all channels for each spatial location), rather than spatial downsampling operations. This ensures that the spatial resolution of the feature map remains unchanged, thus preserving fine-grained spatial details of small targets. The generated spatial attention mask M dynamically weights regions containing small targets and, combined with residual connections, further enhances the feature responses of small targets to avoid suppression in subsequent processing.

A key innovation of MSRF² lies in its differentiated QKV design, which decouples spatial positioning and contextual understanding. Unlike standard multi-head self-attention where Q, K, and V are derived from the same input, MSRF² retains Q from the original feature map to preserve precise spatial anchors—critical for distinguishing spatially overlapping targets. In contrast, K and V are derived from the adaptively selected multi-scale features, enabling them to encode rich contextual cues that vary with target scale and scene complexity. This design ensures that attention weights are computed not only based on semantic similarity but also on contextually appropriate receptive fields, avoiding dominance by strong semantic targets (e.g., large buildings) and enhancing the model’s ability to focus on confusing or small targets.

Compared to existing methods, MSRF² offers distinct advantages. First, compared to RT-DETR’s AIFI module, which relies on global self-attention with static receptive fields, MSRF²’s dual-branch attention and differentiated QKV reduce the dilution of small-target weights, achieving a 2.1% higher mAP₅₀ and 1.2% higher mAP_50-95 on VisDrone2019 while maintaining comparable complexity. Second, unlike methods using large-kernel convolutions [22] that increase parameters, MSRF² leverages depthwise separable dilated convolutions and attention gating to balance efficiency and performance, making it suitable for real-time aerial scenarios. Third, models with static multi-branch structures or channel-only attention (e.g., VoVNetV2 and ARFE-Net) suffer from fixed receptive fields and limited spatial adjustment, failing to adapt to scale fluctuations in aerial scenes. MSRF² addresses these via dynamic branch weighting, dual-branch attention, and differentiated QKV interaction.

In summary, MSRF² enhances detection robustness by dynamically tuning receptive fields to target scale and scene context, amplifying small-target features via dual-branch attention, and preserving spatial precision through differentiated QKV interaction—collectively addressing the unique challenges of aerial remote sensing image detection.

3.3. Lighter and Stronger RS-Backbone

3.3.1. From ResNet to CSPDarknet

The core idea of DETR lies in the end-to-end detection paradigm, so many DETR series [7,8,13] models directly use the classic ResNet for feature extraction. However, although ResNet has good extraction effects, it has the problems of too large a parameter quantity and general real-time performance due to its design focusing more on accuracy than speed. The backbone of the YOLO series has fewer parameters and strong real-time performance, and many studies have proven its application potential in remote sensing [67,68]. Therefore, we are considering introducing the YOLOv8 backbone. The backbone of YOLOv8 adopts a structure similar to CSPDarknet [54], and its biggest difference from CSPDarknet is the use of C2f instead of the C3 module [75]. The comparison between the two modules is shown in Figure 6. Compared with the C3 module, the channels of the input Tensor of the Bottleneck calculation sequence entering the C2f module is only 0.5 times the input channels of C2F, so the calculation amount is significantly reduced. On the other hand, C2F draws on the multi-branch parallel idea of ELAN in YOLOv7, introducing more skip connections (cross-layer residuals) to form a single-branch multi-residual link. Compared with the dual-branch of C3 (one part is directly connected, and the other part passes through the bottleneck layer), the multi-branch parallel processing of C2F makes gradient propagation smoother in the network, effectively alleviating the problem of gradient disappearance and improves the model convergence speed and stability. At the same time, since C2F can integrate more scale feature information in each layer, the gradient flow of C2F contains richer semantic and spatial details and has stronger adaptability to complex scenes.

Figure 7a shows the structure of the Bottleneck unit used in C2F. The Bottleneck structure adopts a channel dimensionality reduction strategy, first using 3 × 3 convolution to extract spatial features and reduce dimensions and then using 3 × 3 convolution to extract spatial features and increase dimensions. Compared with directly using two layers of 3 × 3 convolution, the parameters and calculation amount are significantly reduced.

3.3.2. GMSC Block

However, the Bottleneck module still has two problems in aerial remote sensing target detection applications. The first is that the multi-scale features obtained by the Bottleneck are limited, and the detection effect is general. The second is that the Bottleneck uses traditional convolution, which still has a certain calculation burden. To solve these two problems, we propose the gated multi-scale context (GMSC) block to replace the Bottleneck. This block first improves accuracy and then reduces the number of parameters. The improved backbone is called RS-Backbone. Specifically, we replace the first 3 × 3 convolution with the MSRF² module to obtain sufficient multi-scale features; then, we propose an attention-gated convolution (AGConv) module to replace the second 3 × 3 convolution, which greatly reduces the number of parameters. The structure of the GMSC block is shown in Figure 7b.

The model first obtains multi-scale features through the MSRF² module and then performs residual connection with the original input features after Batch norm and Dropout. Then, the AGConv module is used to reduce the number of parameters.

\hat{X} = (DWConv (X^{'}) \cdot σ (1.702 \cdot DWConv (X^{'}))) ⊙ C A (V),

(7)

where ⊙ denotes the Hadamard product,

σ

denotes the Sigmoid activation function, and CA denotes the channel attention similar to Equation (2). Then, the processed features are restored through the second 1 × 1 convolution and dropout regularization. Finally, the residual connection adds the original input X to the transformed output.

As shown in Figure 8, inspired by TransNeXt [76], the AGConv module adopts a variant structure of the Gated Linear Unit (GLU) [28] and additionally introduces an attention-gating mechanism to enhance feature expression capabilities. The module first uses 1 × 1 convolution to map the input feature from

X \in R^{C \times H \times W}

to a higher-dimensional hidden space

X \in R^{C^{'} \times H \times W}

and splits the feature into two parts

X^{'}

and

V \in R^{C^{'} / 2 \times H \times W}

through channel separation. The

X^{'}

part of the feature undergoes spatial feature extraction through depthwise separable convolution, which greatly reduces the number of parameters while maintaining the receptive field. The other part of the feature V is used as a gating signal after passing through the channel attention mechanism and modulates the features extracted by depthwise separable convolution through element-wise multiplication. The attention-gating mechanism can adaptively filter and select valuable feature information, improving the adaptability of the model to complex scenes. The mathematical form of the entire mechanism is as follows:

In the design of AGConv, the combination of depthwise separable convolution and the Gated Linear Unit mechanism not only reduces the computational complexity but also enhances the ability of the model to extract spatial features, which is particularly suitable for processing high-resolution aerial remote sensing images. The introduction of the attention-gating mechanism further improves the flexibility of feature expression, enabling the model to dynamically adjust feature responses according to the input content. Through this structural design, AGConv maintains feature expression capabilities comparable to traditional convolution while reducing the number of parameters, effectively alleviating the calculation burden of the Bottleneck module. According to experimental analysis, we set the hidden dimension

C^{'}

to

\frac{4}{3} C_{base}

based on experience to balance performance and computational overhead. In the case of ignoring bias, the parameter calculations of the AGConv module and the second 3 × 3 convolution in the original Bottleneck are as follows:

\begin{matrix} \frac{1}{2} \times C \times C \times 3 \times 3 = 4.5 C^{2}, \\ (\frac{4}{3} + \frac{2}{3}) \times C \times C + \frac{2}{3} \times (\frac{1}{4} + \frac{1}{4}) \times C \times C + \frac{2}{3} \times C \times 3 \times 3 \approx 2.4 C^{2} . \end{matrix}

(8)

It can be seen that replacing the second 3 × 3 convolution in the original Bottleneck with the AGConv module can reduce the number of parameters in this part by approximately 46%. Since the Bottleneck module and its corresponding C2F module are repeatedly used in the entire backbone network, replacing the Bottleneck with the GMSC block can greatly reduce the number of parameters and calculation amount of the backbone network and improve real-time calculation performance. In the aerial remote sensing image target detection task, the application of the GMSC block not only significantly reduces the number of model parameters but also improves the detection ability of the model for different sizes and types of targets through multi-scale feature fusion and attention-gating mechanism, especially showing obvious advantages in processing complex backgrounds and small target detection tasks.

A key innovation of the GMSC block lies in its synergistic integration of multi-scale feature enhancement and parameter-efficient extraction, which addresses the limitations of traditional Bottleneck in aerial scenarios. Unlike the original Bottleneck that relies on fixed 3 × 3 convolutions with limited scale adaptability and heavy computation, the GMSC block replaces the first convolution with the MSRF² module to dynamically capture multi-scale contextual cues—critical for distinguishing targets with varying sizes and perspectives in aerial images. Meanwhile, the AGConv module replaces the second convolution, leveraging depthwise separable convolutions and a gated attention mechanism to filter redundant features and reduce parameters. This dual-module design ensures that the GMSC block not only enriches multi-scale feature representation but also achieves parameter compression, balancing accuracy and efficiency for real-time aerial detection.

Compared to existing lightweight backbones, the GMSC block offers distinct advantages. First, compared to MobileNet series [30,50] that use depthwise separable convolutions but lack dynamic multi-scale adjustment, the MSRF² module of the GMSC block adaptively tunes receptive fields, improving small-target detection. Second, unlike ShuffleNetV2 [49], which relies on channel shuffle for acceleration but weakens feature relevance, AGConv’s gated attention enhances feature relevance while reducing parameters by 46%. Third, remote-sensing-oriented backbones like HIC-YOLOv5 [55] optimize paths but lack explicit multi-scale fusion, and FBRT-YOLO [56] prioritizes speed over feature richness.

In summary, the GMSC block enhances aerial detection performance through the synergistic effect of MSRF² and AGConv: MSRF² enriches multi-scale features to adapt to perspective/altitude variations, while AGConv reduces parameters via gated attention and depthwise convolutions. This design collectively addresses the Bottleneck’s limitations, achieving a superior accuracy–efficiency balance for real-time aerial remote sensing tasks.

3.4. Context-Guided Feature Fusion Network

3.4.1. RT-DETR Feature Fusion Network

As shown in Figure 9, the CNN-based cross-scale feature fusion (CCFF) module in RT-DETR is the core component of its efficient hybrid encoder [13], which adopts a top-down combined with bottom-up architecture similar to PANet. However, compared with PANet, RT-DETR’s CCFF has two key improvements: first, the 3 × 3 convolution used to refine the fusion features is replaced by RepC3; second, the “addition” operation in the upsampling stage is replaced by the “concatenation” operation.

Figure 10 shows the architecture of RepC3, which is built on the structure reparameterization technology of RepVGG [77]. During training, RepC3 uses a two-branch structure: the first branch reduces the number of channels to the hidden channels

C_{h}

(where the number of hidden channels is

e \times C_{o u t}

, with

e = 0.5

in practical applications) through a 1 × 1 convolution, followed by a sequence of n RepConv layers (three in training) to enhance feature extraction; the second branch directly performs a 1 × 1 convolution to adjust the channel dimension to

C_{h}

without additional RepConv layers, serving as a residual path. The outputs of the two branches are aggregated by element-wise addition, and finally, if the number of hidden channels

C_{h}

is not equal to

C_{o u t}

, a 1 × 1 convolution is used to restore the channel number to

C_{o u t}

; otherwise, an identity mapping is applied. During inference, RepC3 merges all branches into a 3 × 3 convolution through structural reparameterization [77], forming a simple topology similar to VGG’s 3 × 3 convolution stack [78] (branchless). However, unlike VGG (whose training and inference structures are identical [78]), RepC3 leverages multi-branch training to enhance feature learning, outperforming VGG’s static 3 × 3 convolutions. Each RepConv layer within RepC3, during training, consists of 3 × 3 convolution, 1 × 1 convolution, and an identity branch, which are fused into a single 3 × 3 convolution during inference via reparameterization, further simplifying the inference structure.

The shift from “addition” to “concatenation” in CCFF is another important modification. Replacing the addition operation with concatenation can make the information more complete and rich because the addition operation compresses the feature dimension and causes some information loss; and the semantics and spatial details of feature maps at different levels vary greatly. The concatenation operation directly connects the feature maps, which not only retains the shallow spatial details but also the deep semantic information.

3.4.2. CGF Module

Although CCFF replaces the feature addition of the original PANet with feature stitching in upsampling, which retains different levels of information, the feature concatenate usually adopts the strategy of direct fusion after interpolation, and the significant semantic level differences between features at different scales will make the strategy face the problem of feature space conflict—this direct fusion will introduce a large amount of redundant information such as background noise, especially in remote sensing scenarios, which can easily lead to the features of small targets being overwhelmed by noise [67,68]. Therefore, we propose a context-guided fusion (CGF) module to replace the feature concatenate in CCFF and form a new neck ContextFPN. The CGF module reconstructs the multi-scale fusion process through the dynamic weight mechanism of semantic perception and the bidirectional feature guidance strategy, which effectively solves the problem of information redundancy and conflict. As shown in Figure 11, the CGF module realizes the effective guidance and adaptive adjustment of contextual information in the process of multi-scale feature fusion through the dimensionality reduction attention mechanism and the weighted feature recombination operation.

The workflow of the CGF module is as follows: First, for the input features of different levels, if the number of channels is not the same, the 1 × 1 convolution is adjusted to the same dimension to ensure the compatibility of the features. Subsequently, the adjusted features were spliced with the original features, and the spliced features were screened at the channel level through the dimensionality reduction attention mechanism (dimensionality reduction ratio r = 16), the channel weights were generated through the dimensionality reduction-ascending mapping after extracting the context information through global average pooling, and the redundant channels corresponding to the background noise were adaptively suppressed. Then, the attention-weighted features are divided into high-level semantic weights and low-level spatial weights according to channels (split ratio of 2:1), a design that aligns with the two-input structure of multi-scale fusion. This division enables bidirectional guidance wherein the low-level features enhance the target semantic consistency with the help of high-level semantic weights (e.g., make the edge features focus more on the semantic region of the “vehicle”), and the high-level features supplement the details such as contours through the low-level spatial weights (e.g., let the “vehicle” semantic features restore the edge details). Finally, the integrity of the original features is retained through residual connection and secondary splicing, and the complementary fusion of cross-scale features is realized. Dimensionality Reduction Attention represents a channel attention mechanism similar to that in [79], defined as follows:

F_{Attn} (x) = x \otimes σ (W_{2} \cdot \max (0, W_{1} \cdot AvgPool (x))),

(9)

where ⊗ denotes element-wise multiplication,

AvgPool

denotes global average pooling,

σ

denotes the Sigmoid activation function,

W_{1} \in R^{\frac{C}{r} \times C}

is the dimensionality reduction weight matrix, and

W_{2} \in R^{C \times \frac{C}{r}}

is the dimensionality increase weight matrix.

This design brings significant advantages. Through the Dimensionality Reduction Attention mechanism, the CGF module can adaptively focus on the importance of different channels, effectively suppressing noise and redundant information and enhancing feature expression capabilities. The weighted feature recombination operation realizes bidirectional guidance of context information, enabling low-level features to better utilize high-level semantic information, while high-level features can obtain more details and position information, thereby improving the quality of overall features. This fusion method introduces almost no additional computational overhead, ensuring the efficiency of the model.

4. Experiments

4.1. Datasets

To verify the effectiveness of the variable receptive field provided by VRF-DETR in solving the problems of intra-class variation and inter-class similarity in aerial remote sensing images, we mainly conducted experiments on the VisDrone2019 [80] dataset and used the UAVDT [81] dataset to assist in verifying its generalization. VisDrone2019 contains images captured by drones. The challenges include uneven lighting, variable heights and perspectives, and easily confused categories. It covers 10 types of objects including pedestrians, motors, tricycles, and cars. The VisDrone2019 dataset has a total of 6471 training images, 548 validation images, and 1610 test images, with the maximum pixels of the images being 2000 × 1500. During the training process, the official data division was adopted, and the common practices of the dataset were followed [45,62]. The validation set was used for model comparison, and an additional comparison on the test set was carried out. UAVDT is also a dataset specifically constructed for the object detection task from the perspective of drones. It contains dynamic images in various complex scenes such as cities, villages, and transportation hubs. It contains a total of 40,409 images of three categories: cars, trucks, and buses, with 23,829 for training and 16,580 for testing. The resolution of all images in the UAVDT dataset remains unchanged, all being 1080 × 540 pixels.

The information statistics of the VisDrone2019 dataset are shown in Figure 12. The figures from the upper left to the lower right are a bar chart of the distribution quantity of each label category; a visualization diagram of the shape and size of the anchor boxes; a dot diagram of the distribution position of the center points of the anchor boxes; and a dot diagram of the proportion of the anchor boxes to the size of the entire image. In the lower left and lower right figures, the more points there is, the darker the color.

The VisDrone2019 dataset contains many images with different angles, heights, lighting conditions, and degrees of motion blur in the same scene. These images greatly increase the intra-class variation and the difficulty of detection. As shown in Figure 13, the four images in the same square scene have different shooting conditions and detection challenges, which increases the difficulty of detecting pedestrians and people.

4.2. Experimental Configuration and Evaluation Metrics

4.2.1. Experimental Configuration

The experimental platform of this article consists of a Windows 11 system with Python 3.8 and PyTorch 1.13.1 framework. Hardware specifications include the following:

CPU: 13th Gen Intel^® Core^TM i7-13700K @ 3.40 GHz (Intel, Santa Clara, CA, USA).
GPU: NVIDIA GeForce RTX 4090 with 24 GB GDDR6X memory (NVIDIA, Santa Clara, CA, USA).

Detailed hyperparameter settings for the benchmark models that we experimentally validated (including YOLO series and RT-DETR series) are presented in Table 1, with distinct configurations specified for each series. To ensure the reproducibility and fairness of comparisons, all these models were trained under the consistent initialization strategies and data augmentation protocols detailed in this table. For the SOTA methods in the comparison experiments section that are cited from original publications, their parameter settings strictly follow the configurations reported in their respective works.

4.2.2. Evaluation Metrics

In order to better evaluate the effectiveness of the proposed model, this paper selects the mean average precision mAP₅₀ and mAP_50-95 as the core evaluation indicators of the model performance; the number of model parameters and the number of billions of floating-point operations per second (GFLOPs) as the core evaluation indicators of the model efficiency; and FPS as the evaluation indicator of the real-time performance of the model. Among them, the mean average precision mAP is related to the precision P, recall R, and average precision AP. The calculation methods of precision P, recall R, and average precision AP are as follows:

\begin{matrix} P = \frac{T P}{T P + F P}, \\ R = \frac{T P}{T P + F N}, \\ AP = \int_{0}^{1} P (r) d r, \end{matrix}

(10)

among them, TP represents the number of samples correctly predicted as positive examples, FP is the number of negative samples incorrectly predicted as positive examples, FN is the number of positive samples incorrectly predicted as negative examples, and

P (r)

is the precision corresponding to the recall rate r.

Additionally, to evaluate detection performance across different target scales, Average Precision is further categorized into

{AP}_{s}

(Average Precision for Small targets),

{AP}_{m}

(Average Precision for Medium targets), and

{AP}_{l}

(Average Precision for Large targets) based on object size, as defined in the MS COCO dataset [82]. Specifically,

A P_{s}

refers to the average precision for targets with a pixel area in the range [0², 32²];

{AP}_{m}

corresponds to targets with a pixel area in [32², 96²];

{AP}_{l}

applies to targets with a pixel area larger than 96².

mAP₅₀ is the multi-class average precision when the IoU threshold is 0.5. mAP_50-95 covers the average precision of 10 thresholds from 0.5 to 0.95 (with a step size of 0.05) for IoU, which more comprehensively reflects the positioning accuracy. The two indicators are calculated by the following formulas, respectively:

\begin{matrix} {mAP}_{50} = \frac{1}{C} \sum_{c = 1}^{C} {AP}_{c} (I o U = 0.5), \\ {mAP}_{50 - 95} = \frac{1}{10 C} \sum_{c = 1}^{C} \sum_{i = 0.5}^{0.95} {AP}_{c} (I o U = i), \end{matrix}

(11)

among them, C is the total number of target categories, and

{AP}_{c} (I o U = 0.5)

represents the average precision of the c-th category when the IoU threshold is 0.5.

4.3. Experimental Evaluation of Encoder

4.3.1. The Optimal Configuration of the MSRF² Module

To determine the optimal configuration of MSRF², we conducted ablation experiments on the VisDrone2019 validation set, focusing primarily on the impact of parallel dilated convolution rates on model performance. The baseline model used is VRF-DETR, and subsequent module ablation experiments will also select the complete VRF-DETR as the baseline. The experimental environment and hyperparameter settings are consistent with those in Section 4.2.1, with evaluation metrics measured at a resolution of 640 × 640. The symbol ↑ indicates that the higher the index, the better; the symbol ↓ indicates that the lower the index, the better. Subsequent module ablation experiments and comparative tests will adhere to this setting. The evaluation metrics for the dilation convolution rate include the mean average precision for small targets at a resolution of 640 × 640 (mAP_s), mAP₅₀, and mAP_50-95. Since the dilation rate does not alter the number of model parameters or computational load, no comparisons will be made.

Following the perspective of HDC [83], we have designed and compared three combinations of non-public factor arithmetic increasing expansion rates (Table 2). The second group has an expansion rate of [3, 5, 7], achieving optimal performance with an average mean accuracy for small targets of 21.6%. We analyze that this is due to the more moderate starting values and spans of the second group, which effectively captures small targets while avoiding excessive background noise. In contrast, the first group (expansion rate of [1, 3, 5]) lacks sufficient receptive field expansion for distant small targets at high altitudes, resulting in a decrease of 1.8% in mAP₅₀ compared to the second group and a slight reduction of 0.2% in the mean average precision for small targets. The third group (expansion rate of [5, 7, 9]) introduces redundant context, lowering the mean average precision for small targets to 19.9%.

4.3.2. Comparison Between MSRF² and Other Encoders

To further determine the superiority of the MSRF² module, we conducted a comparative experiment on the VisDrone2019 validation set among baseline (RT-DETR-r18), the RT-DETR-r18 model with only the AIFI module replaced by other modules, and the RT-DETR-r18 model with only the MSRF² module replaced under the optimal parameters. The evaluation metrics include mAP₅₀, mAP_50-95, the number of model parameters, and the number of giga floating-point operations per second (GFLOPs). There are essential differences between the two in terms of architectural design. The subsequent module analysis experiments will use the same evaluation metrics as this time. The results are shown in Table 3.

The experimental results indicate that MSRF² outperforms other advanced modules on both important metrics of accuracy, mAP₅₀ and mAP_50-95, while maintaining similar parameters and computational complexity. We analyze that this is due to the effective enhancement of feature weights for small objects by the parallel channel attention mechanism of MSRF², which suppresses interference from complex backgrounds. Furthermore, its adaptive multi-scale spatial selection mechanism dynamically allocates the optimal receptive field for different objects, strengthening the feature extraction capabilities for multi-scale targets. This enables the model to achieve optimal mAP₅₀ and mAP_50-95 on datasets primarily focused on small objects. The comparative analysis of these results firmly demonstrates the effectiveness of MSRF² in the task of object detection in aerial remote sensing images.

4.4. Experimental Evaluation of Backbone

4.4.1. The Optimal Configuration of the GMSC Block

We conducted a series of ablation experiments on the GMSC block using the VisDrone2019 validation set to identify its optimal configuration. The experiments were divided into five groups, all of which modified the backbone to the YOLOv8 architecture. Specifically, Group A utilized the YOLOv8-S backbone with the original Bottleneck, while Group B replaced the first 3 × 3 convolution in the original Bottleneck with an MSRF² module along with a residual connection. Group C substituted the second 3 × 3 convolution in the Bottleneck with an AGConv module and a residual connection. Group D was consistent with the GMSC block method presented in this paper, replacing the two 3 × 3 convolutions with the MSRF² module and AGConv module, respectively, with each module being followed by its own residual connection, as shown in Table 4.

Through ablation experiments, it has been verified that replacing the two 3 × 3 convolutions in the YOLOv8 Bottleneck with MSRF² and AGConv modules (D group configuration) can achieve a synergistic enhancement in detection accuracy and model efficiency, with the MSRF² module exhibiting a stronger capacity for accuracy improvement compared to AGConv. Notably, GMSC not only enhances the mAP metric through the complementary feature processing mechanism between modules but also achieves super-linear compression of the parameter count. We analyze that this non-linear growth in compression arises from the architectural design of global channel reassignment, where the reduction in parameters is not solely attributable to the direct differences of module replacement but also to the full-link parameter cascading optimization induced by intermediate layer channel compression.

4.4.2. Comparison Between GMSC-Based RS-Backbone and Other Backbone

Based on the confirmation to modify the encoder to MSRF², we further conducted comparative experiments on the VisDrone2019 validation set to verify the effectiveness of the GMSC block. Subsequent module comparison experiments will also be based on this experiment. Specifically, we compared the baseline (RT-DETR-r18-MSRF²), the RT-DETR model with the backbone replaced by other advanced backbones, and the RT-DETR model with the backbone replaced by RS-Backbone on the VisDrone2019 validation set. The results are shown in Table 5.

From the experimental results in Table 4, RS-Backbone shows significant advantages on the VisDrone2019 verification set. In terms of detection accuracy, the detection accuracy is 51.4% mAP₅₀ and mAP_50-95, which are 2.1% and 1.9% higher than baseline, respectively, and greatly exceed the comparison models such as SwinTransformer-T and EfficientVit-M0, which verifies the effectiveness of GMSC block in multi-scale feature fusion. The MSRF² module in the GMSC block replaces the first-layer convolution, which significantly enhances the ability of multi-scale feature extraction, especially for small targets and complex backgrounds, which is the key to improving accuracy. The gated attention mechanism introduced by the AGConv module and the MSRF² module greatly reduces the parameter burden so that RS-Backbone is only slightly higher than the two extremely lightweight models of EfficientVit-M0 and Mobilenetv4-S under the premise of significantly improving accuracy, achieving the optimal balance.

4.5. Experimental Evaluation of Neck

4.5.1. The Optimal Configuration of the CGF Module

In order to confirm the advantages of the CGF fusion method compared to Concat and Add, and the role of each component inside the CGF module, we conducted a comparative test on the VisDrone2019 validation set. Specifically, the A group model uses Concat from the original neck CCFF. The B group model replaces all Concat with Add. The C group model uses a CGF module retaining only the dimensionality reduction attention mechanism. The D group model uses a CGF module retaining only the bidirectional feature guidance strategy. Finally, the E group model uses the full CGF module. The results are shown in Table 6.

The CGF module boosts mAP₅₀ to 52.1% and mAP_50-95 to 32.2%, outperforming Concat by 0.7% and 0.4%. It enhances cross-scale feature complementarity via semantic-aware dynamic weighting. In contrast, Add causes a 3% mAP₅₀ drop due to ignoring feature importance. CGF adds only 0.2 M parameters with unchanged FLOPs, while Add sacrifices accuracy for lightweight design. Concat, lacking feature screening, underperforms CGF, proving dynamic weighted fusion essential for small-target dense scenes. Further analysis of the role of the module components reveals that both retaining only the C group with dimensionality reduction attention mechanism and retaining only the D group with bidirectional feature guidance demonstrate lower performance than the complete CGF module. This indicates that the role of the attention mechanism in filtering the importance of features and the facilitation of the bidirectional guidance strategy for cross-scale information interaction have a synergistic effect. Together, they constitute the core mechanism by which CGF enhances detection performance.

4.5.2. Comparison Between CGF-Based ContextFPN and Other FPN

In order to confirm the advantages of the neck using the CGF fusion method compared with other advanced necks, we conducted a comparative test on the VisDrone2019 validation set based on the modified encoder and backbone, and the results are shown in Table 7.

From Table 7, ContextFPN with CGF outperforms baseline by 0.7% in mAP₅₀ and 0.4% in mAP_50-95. It surpasses BiFPN, MAFPN, and HSFPN, demonstrating that the dynamic weighted fusion of CGF and bidirectional guidance enhances cross-scale feature interaction. Though HSFPN has lower parameters and FLOPs, its accuracy lags. ContextFPN balances performance and computation, verifying the superiority of CGF in small-target dense scenes.

4.6. Ablation Experiment

In addition to the ablation and comparative experiments within the aforementioned modules, to further validate the roles and importance of various components in the VRF-DETR model, we conducted an overall ablation study on the VRF-DETR validation set by progressively adding the MSRF², GMSC, and CGF modules to the baseline model RT-DETR and comparing different model combinations. This ablation experiment also adhered to the experimental settings outlined in Section 4.2.1 of this paper, with each experiment strictly following the same parameters, and the results are presented in Table 8.

The results of the ablation experiments validate the effectiveness of our modules. The complete VRF-DETR achieved an improvement of 4.9% in mAP₅₀ and 3.5% in mAP_50-95 compared to the baseline RT-DETR while reducing the number of parameters by 31.7% and the computational load by 21.8%.

Notably, MSRF² maintains comparable complexity to AIFI (the original encoder in RT-DETR): replacing AIFI with MSRF² alone slightly reduces parameters from 20.2 M to 20.0 M and marginally increases FLOPs from 57.0 G to 57.5 G. This efficiency stems from targeted optimizations, such as eliminating redundant position embeddings, simplifying the feedforward network, and restricting attention interactions to a compact 16 × 16 spatial range, which reduces computational redundancy while preserving critical feature interactions. These improvements stem from MSRF²’s dynamic receptive field adjustment through parallel dilated convolutions with varying rates (3, 5, and 7) and spatial-channel attention mechanisms. By adaptively capturing multi-scale contextual information, MSRF² enhances discrimination between visually similar categories and improves detection of small targets under high-altitude conditions. The module’s ability to tailor receptive fields to object characteristics reduces misclassifications arising from perspective and altitude variations.

Analyzing the GMSC block, which combines MSRF² with AGConv, we observe notable performance gains. When added to the baseline without CGF, the GMSC block elevates mAP₅₀ to 50.4% (a 3.2% improvement) and mAP_50-95 to 31.1% (a 2.4% increase). This improvement is attributed to AGConv’s parameter-efficient gated attention, which reduces redundancy by 46% while enhancing feature relevance. The integration of MSRF²’s multi-scale extraction with AGConv’s lightweight design results in a 24.8% reduction in parameters (from 20.2 M to 15.2 M) and a 14.2% decrease in FLOPs (from 57.0 G to 48.9 G), showcasing a synergistic balance between accuracy and computational efficiency.

The contribution of the CGF module is highlighted by comparing configurations with and without it while retaining all other components. The addition of CGF improves mAP₅₀ from 51.4% to 52.1% and mAP_50-95 from 31.8% to 32.2%. This enhancement arises from CGF’s role in mitigating semantic conflicts during multi-scale fusion through dimensionality reduction attention and bidirectional guidance. By enabling low-level spatial features to incorporate high-level semantic cues (e.g., refining vehicle edge detection) and high-level features to leverage low-level spatial details, CGF significantly improves detection accuracy for small targets in cluttered environments, such as densely parked vehicles in aerial imagery.

We also observed that the contributions of MSRF² and GMSC to precision are comparable, while GMSC significantly contributes more to model complexity than MSRF². We analyze this is due to GMSC being repeated four times in the C2F throughout the backbone, leading to a synergistic effect in the modules as analyzed in Section 3.4.1, which results in a superlinear compression of the number of parameters. Notably, MSRF² in the GMSC block contributes more to mAP₅₀ and mAP_50-95, while AGConv contributes more to the number of model parameters and floating-point operations. Finally, embedding the CGF module in the neck section can improve the detection accuracy of small targets in complex backgrounds without increasing computational load, fully demonstrating the importance of efficiently utilizing contextual effective semantics during the feature fusion process.

The results of the ablation experiments illustrate that the three modules must cooperate with each other to achieve optimal performance, effectively confirming the validity of variable receptive fields, attention-gating mechanisms, and context-guided feature fusion for aerial remote sensing target recognition.

4.7. Comparison Experiment

In order to better evaluate the performance of VRF-DETR among models of the same and different complexity levels, as well as between specialized and general models, we conducted experiments on the validation set (Table 9) and test set (Table 10) of the VisDrone2019 dataset, as well as the test set (Table 11) of UAVDT. We compared VRF-DETR with several benchmark models and state-of-the-art models based on metrics such as average precision, the number of parameters, and floating-point operations. Among them, the single-stage model generates predictions with an anchor box mechanism and relies on post-processing, while the end-to-end model achieves direct output without an anchor box through set prediction and binary matching loss. There are essential differences between the two in terms of architectural design. Notably, in the VisDrone2019-related experiments, we followed conventional practices in the field [45,62], using the validation set for method comparison. It is worth mentioning that we also conducted further performance verification against many real-time detection benchmarks and state-of-the-art models on the test set that was not involved in model optimization. For clarity, the “Imgsize” column in each table indicates the input size setting: all models in Table 10 (VisDrone2019 test set) are trained and evaluated with a uniform 640 × 640 input, while in Table 9 (VisDrone2019 validation set) and Table 11 (UAVDT test set), models that overlap with models in Table 10 follow the same 640 × 640 setting for training and evaluation, and other domain-specific SOTA methods retain their reported original input sizes.

As shown in Table 9, VRF-DETR achieves the best mAP_50-95 on the VisDrone2019 dataset validation set, its mAP₅₀ is close to the two-stage state-of-the-art model SDP with higher resolution input, and the parameter size is only 13.9% of the latter. In terms of model complexity, the VRF-DETR parameter is only 13.8 M, and the computation cost is only 44.6 G FLOPs, which is significantly lower than most comparison models, achieving a balance between accuracy and efficiency. Compared with the two-stage, one-stage and end-to-end methods, it shows stronger generalization ability and practicability in UAV scene detection.

As shown in Table 10, VRF-DETR continues its excellent performance from the validation set on the VisDrone2019 test set, leading all comparison models with the mAP_s of 15.5%, the mAP₅₀ of 39.9%, and the mAP_50-95 of 23.3%. Its parameter count of 13.8 M and computational cost of 44.6 G FLOPs remain low, achieving a 4.1% increase in mAP_50-95 compared to the one-stage method YOLOv12-M while reducing parameters by 31.7%. Compared to the end-to-end model D-Fine-M, it shows a 0.3% increase in mAP₅₀ and a 20.9% decrease in FLOPs. Although the parameter count is larger than FBRT-YOLO-M, the computational efficiency and accuracy metrics of the model are superior to those of FBRT-YOLO. On the test set, which was not optimized for the model, VRF-DETR achieved a further breakthrough in detection accuracy with lower computational complexity, highlighting its generalization ability and engineering practicality for complex targets in drone scenarios.

Furthermore, we conducted a systematic performance evaluation of six models including VRF-DETR on the test set and visualized the results in a radar chart (as shown in Figure 14). For each evaluation metric, we standardized the values using the best-performing model as the benchmark (100%). In particular, considering that the values of the GFLOPs and Params metrics are negatively correlated with model performance, their reciprocals were used for normalization to ensure consistency in the presentation of all metric results. The visualization results indicate that VRF-DETR achieved state-of-the-art detection accuracy for small-to-medium-sized targets (with definitions of

A P_{s}

,

A P_{m}

, and

A P_{l}

referring to Section 4.2.2. Evaluation Metrics) with the lowest computational cost, while its performance for large target detection is only slightly lower than that of D-Fine. The results demonstrate that VRF-DETR achieves the optimal balance between performance and size.

Next, the results of the experiment on the UAVDT test set are presented. Table 11 includes supplementary comparison models selected based on complexity levels (parameters and FLOPs) similar to VRF-DETR, ensuring fair comparisons of speed and accuracy; among them, models with parameters between 12 and 20 M and FLOPs between 40 and 60 G form the main comparative group. As shown in Table 11, VRF-DETR achieves 40.3% mAP₅₀ and 25.9% mAP_50-95 with 640 × 640 inputs on the UAVDT test set, which is 1.6% and 1.3% higher than the two-stage state-of-the-art model UFPMP-DET, respectively, with a parameter volume of 13.8 M and FLOPs 44.6 G that are only 39.1% and 66.2% of RemDet-L. Compared with one-stage methods, its mAP₅₀ exceeds FBRT-YOLO-L by 9.2%, with parameters reduced by 0.8 M and computational effort only 37.4% of FBRT-YOLO-L. With an input size smaller than most comparison models, VRF-DETR achieves breakthrough detection accuracy with lower computational complexity, achieving state-of-the-art performance on UAVDT, which fully demonstrates its excellent performance and generalization in aerial remote sensing image target recognition.

Finally, in the real-time performance evaluation, the inference efficiency of the model is influenced by multiple factors including the type of hardware device, the operational environment status, batch size, and the degree of code optimization [103]. To ensure the fairness and reproducibility of the comparative experiments, we conducted real-time comparisons of the baseline models RT-DETR versions r18 and r50 alongside the VRF-DETR proposed in this paper on the same local hardware platform (specific configuration seen in Section 4.2.1) and the same datasets (VisDrone2019 and UAVDT test set), with an image size of 640 × 640 and increasing batch sizes. The experiments utilized domain-standard inference time and frames per second (FPS), which can intuitively reflect the real-time responsiveness of the model in continuous inference scenarios, with results presented in Table 12.

As can be seen from Table 12, VRF-DETR exhibits excellent real-time performance on the VisDrone2019 and UAVDT test sets. Simulating resource-constrained conditions with a batch size set to 1, its FPS on the two datasets is 62.1 and 60.2, respectively. Although this is lower than RT-DETR-R18, it outperforms RT-DETR-R50 and surpasses the threshold of 60 FPS, which is critical for human perception and mainstream video frame rates, thereby meeting the real-time requirements of most scenarios. As the batch size increases to 8 and 16, the inference efficiency of the model significantly improves, with a maximum FPS reaching 188.7. This is achieved while maintaining a low parameter count of 13.8 M and computation load of 44.6 G FLOPs, thus enabling a co-optimization of detection accuracy and inference speed. This validates its engineering applicability for future real-time detection tasks in drone deployments.

4.8. Visualization Experiment

In order to more intuitively assess the detection capabilities of VRF-DETR and the baseline model RT-DETR, we selected six images of varying difficulty, different angles, heights, lighting conditions, and occlusion scenarios from the validation set of VisDrone2019. We conducted inference using both models on these images. To provide a clearer and more intuitive comparison of the detection results, we emphasized misdetected objects with purple circles, while missed detections are highlighted with red circles. The comparison of detection results between RT-DETR-r18 and VRF-DETR is illustrated in Figure 15. The numerical indices from the original annotations correspond to the following categories: 0 for pedestrian, 1 for people, 2 for bicycle, 3 for car, 4 for van, 5 for truck, 6 for tricycle, 7 for awning-tricycle, 8 for bus, and 9 for motor. The same applies to the following figure.

The results show that compared with VRF-DETR, RT-DETR has varying degrees of false positives or missed detections in all six images and exhibits poor robustness to changes in perspective and altitude. It is not difficult to observe that in the first two images with a low-altitude side view perspective, RT-DETR missed the crowd on the river embankment and two pedestrians under the highway respectively, compared with VRF-DETR. In the third and fourth images with a medium-low altitude 45-degree view perspective, RT-DETR missed the cars in the distance of the lane and pedestrians on the roadside; meanwhile, it mistakenly detected the clothes hanger placed diagonally next to the engine in the fourth image as a bicycle. The fourth and fifth images present high-altitude top view scenes, featuring dense ground objects and complex backgrounds. In such scenarios, RT-DETR missed a large number of targets in image 4, including a row of parked motors and cars, two pedestrians on the road, and two motorcyclists; in image 5, it missed a great many ground pedestrians. In contrast, VRF-DETR excellently completed the detection task. These results fully demonstrate the detection capability and robustness of VRF-DETR under various aerial remote sensing imaging conditions.

We have also selected an additional complex intersection scene of medium height to test the detection capabilities of VRF-DETR. The detection results are shown in Figure 16. It is not difficult to observe that, compared to RT-DETR, VRF-DETR demonstrates superior detection performance for the portion of the overall image enclosed in the purple box at the upper right corner, successfully identifying many targets that were missed by RT-DETR. In the remaining area, we have highlighted the misdetected targets with magenta circles and the missed targets with red circles. RT-DETR exhibited many more missed targets in this area, while VRF-DETR mistakenly detected a traffic light on the right side of the image as a person riding a motor. This comparative result demonstrates that VRF-DETR can effectively address the poor detection performance of RT-DETR for small targets in complex scenes with dense occlusions.

To specifically evaluate the performance of VRF-DETR and RT-DETR on tiny objects (8 × 8 pixels to 16 × 16 pixels) and very tiny objects (smaller than 8 × 8 pixels), we found the smallest image 9999979_00000_d_0000034.jpg in the VisDrone2019 validation and test sets and selected the smallest 24 objects (17 very tiny objects, 7 tiny objects, pixel areas ranging from 3 to 91, with an average pixel area of 51.125). We used these two models for inference, limiting them to output only the smallest top 30 detection boxes, and visualized the results, highlighting missed detections (red circles) and false positives (purple circles). The comparison results are shown in Figure 17.

As can be seen from Figure 17, RT-DETR-R18 performs poorly in handling these tiny targets: it misses the tiny target cluster in the upper-left corner and the person at the bookstall at the bottom of the image and has serious false positives, misdetecting a large number of vehicles in the middle of the lane as tiny targets and misdetecting two vehicles in the gas station as tiny targets. This is due to insufficient preservation of local context information and high-resolution features, which often fail to capture the fine-grained features of these targets. In contrast, VRF-DETR only misses the tiny target cluster in the upper-left corner, while having a very low false positive rate, which shows that VRF-DETR is designed to effectively preserve fine-grained information and capture the local context of tiny targets. The MSRF² module avoids over-smoothing of tiny features, while the GMSC module ensures that high-resolution details are preserved, enabling robust detection even for point-like objects. These results are consistent with our theoretical improvements and verify the advantages of VRF-DETR in handling difficult tiny target scenarios commonly seen in aerial remote sensing.

In order to further validate the effectiveness of the MSRF² module and to observe the receptive field, we selected an image that is severely occluded due to perspective and input it into both the RT-DETR model and the RT-DETR model with the original encoder replaced solely by MSRF², generating heat maps. The original image and the comparative results are presented in Figure 18.

Due to the high altitude and severe occlusion, the motor and awning-tricycle in this picture are more easily confused, so the coverage and intensity distribution of the heat map for easily confused targets can reflect the receptive field of the model for the target to some extent. It is not difficult to observe the two heat maps that RT-DETR detects three fewer areas (the areas where the red circles are located) than VRF-DETR with only MSRF². At the same time, RT-DETR focuses more on pedestrians on the street in the center of the picture, while VRF-DETR with only MSRF² pays more attention to the concentrated parking area of the easily confused target motor. It is worth noting that for objects that are not easily confused, the heat map of VRF-DETR is more concentrated, while for objects that are easily confused (motor with sunshade), the heat map of VRF-DETR is more divergent. These results show that the MSRF² module enables the model to pay attention to a wider range of environmental knowledge when detecting easily confused objects and to focus more on the object itself when detecting objects that are not easily confused. This capability enables VRF-DETR to complete detection tasks better and more efficiently under various changing conditions and complex scenarios of aerial remote sensing images.

5. Discussion

The previous section has analyzed the experimental results, demonstrating that the proposed model achieves state-of-the-art performance while maintaining a lightweight architecture. This section addresses several critical questions to further elucidate the model’s contributions and remaining limitations: Why is VRF-DETR not constrained by camera angle and height, and why does it outperform models of similar scale in handling perspective and altitude variations? How does it implement variable receptive field adaptation? In what respect does it still exhibit performance deficiencies? And finally, how can future research further develop this framework? These issues are examined in turn.

From a mechanistic standpoint, VRF-DETR directly targets the challenges posed by intra-class variation and inter-class similarity, which are exacerbated by changes in perspective and height in aerial images. This is achieved through a series of architectural innovations that enable adaptive receptive field modulation and enhanced feature discrimination.

Central to this capability is the multi-scale receptive field adaptive fusion (MSRF²) module, which enables dynamic adjustment of receptive field sizes. This function is essential for recognizing objects of the same category that appear under vastly different perspectives or spatial resolutions. The MSRF² module employs a differentiated query-key-value (QKV) design, where the query vectors retain global spatial anchors to preserve positional awareness, while the key and value vectors are derived from adaptively selected multi-scale features. These features are extracted via parallel dilated convolutions with dilation rates of 3, 5, and 7 and are further refined using spatial-channel attention. This structure allows the receptive field to expand in order to capture broader contextual cues—critical for distinguishing high-altitude targets that lack fine detail—and to contract for more localized attention on well-resolved, low-altitude objects. Consequently, the model is better able to handle intra-class variation induced by differences in altitude and viewing angle.

Complementing MSRF² is the gated multi-scale context (GMSC) block, which integrates MSRF² with attention-gated convolution (AGConv). AGConv incorporates gated linear units and depthwise separable convolutions, enabling the selective filtration of redundant or noisy features. This contributes to a 46% reduction in parameter count while simultaneously improving the quality of multi-scale feature representation. The GMSC block enhances the model’s ability to distinguish visually similar categories, which often exhibit high inter-class similarity in remote sensing scenarios.

To further mitigate cross-scale semantic inconsistencies, the context-guided fusion (CGF) module is introduced. CGF performs dimensionality-reduction attention and employs bidirectional guidance mechanisms to align semantic information across scales. This fosters stronger feature complementarity and reinforces detection robustness, particularly in cluttered or heterogeneous backgrounds commonly found in aerial imagery.

These components collectively endow VRF-DETR with the flexibility to adapt to variations in perspective and altitude, thereby addressing some of the most persistent challenges in aerial remote sensing object detection. However, despite these innovations, the model still presents certain limitations that merit further investigation.

First, the parallel dilated convolutional structure in MSRF², while effective in receptive field modulation, introduces additional computational overhead when processing ultra-high-resolution inputs (e.g., 4K resolution). Second, the use of larger dilation rates can introduce excessive contextual information when detecting very tiny objects (e.g., targets smaller than 8 × 8 pixels), leading to occasional misclassifications due to the inclusion of irrelevant background features. Third, while this study focuses on receptive field adaptation, the potential synergy between VRF-DETR and remote-sensing-specific efficient architectures, such as RingMoE [104] and RingMo-Aerial [105], has not yet been explored. These models employ mixture-of-experts (MoE) designs that dynamically route inputs to specialized sub-networks, potentially allowing computational resources to be more effectively allocated based on object scale and scene complexity.

These observations point toward several promising directions for future research. First, optimizing MSRF² with dynamic resolution adaptation mechanisms could alleviate the computational burden in ultra-high-resolution scenarios, maintaining real-time inference performance without compromising detection accuracy. Second, incorporating scale-aware dilation rate selection into MSRF² could improve the model’s precision when handling extremely small objects by better balancing contextual breadth with noise suppression. Third, integrating VRF-DETR with MoE-based architectures could enhance both efficiency and scalability, enabling more precise expert routing for complex, multi-scale aerial scenes. Finally, expanding the model to handle multi-modal remote sensing data—including synthetic aperture radar (SAR) and thermal imagery—offers a pathway to improved robustness under adverse environmental conditions. In this context, the CGF module could be leveraged to fuse spectral and spatial features effectively, facilitating reliable detection performance across varying modalities and weather conditions.

6. Conclusions

In this paper, we propose VRF-DETR, a lightweight real-time object detection framework tailored to address the challenges of varying viewing angles and altitudes in aerial remote sensing images. Built on RT-DETR, the framework achieves dynamic receptive field adaptation and efficient feature extraction through three key innovations: the multi-scale receptive field adaptive fusion (MSRF²) module, the gated multi-scale context (GMSC) block, and the context-guided fusion (CGF) module. These components work synergistically to enhance the model’s ability to handle intra-class variation and inter-class similarity, which are exacerbated by perspective and altitude changes in aerial scenes.

The experimental results demonstrate that VRF-DETR achieves a significant balance between detection accuracy and efficiency. Compared to the baseline RT-DETR, it improves mAP₅₀ by 4.9% and mAP_50-95 by 3.5% on the VisDrone2019 dataset while reducing parameters by 31.7% and computational load by 21.8%. On the UAVDT dataset, it maintains state-of-the-art performance, further verifying its strong generalization in aerial remote sensing tasks. In terms of real-time capability, VRF-DETR achieves an FPS exceeding 60 under resource-constrained conditions, meeting the requirements of most drone-based real-time detection scenarios. The visualization results confirm its robustness in handling small targets, complex backgrounds, and varying perspectives.

Despite these advancements, VRF-DETR has limitations that warrant further improvement. It still faces challenges in detecting extremely tiny targets (pixel area <8 × 8) and incurs higher computational overhead when processing ultra-high-resolution inputs. Additionally, its potential integration with remote-sensing-specific efficient architectures (e.g., mixture-of-experts designs) remains unexplored.

Future work will focus on three directions: optimizing the MSRF² module to enhance tiny target detection and reduce ultra-high-resolution computational costs; extending the framework to multi-modal remote sensing data (e.g., SAR and thermal imagery) to improve robustness under adverse conditions; and exploring synergies with MoE-based architectures to further balance efficiency and scalability. These efforts aim to strengthen VRF-DETR’s adaptability in complex aerial scenarios and expand its practical applications.

Author Contributions

Conceptualization, W.L.; methodology, W.L. and L.S.; software, W.L. and G.A.; validation, W.L., L.S. and G.A.; writing—original draft preparation, W.L.; writing—review and editing, W.L., L.S. and G.A.; visualization, W.L.; supervision, L.S. and G.A.; funding acquisition, G.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (Project Number: 2023YFC3006700).

Data Availability Statement

The datasets used in this study are public datasets and can be downloaded from the official website of the corresponding datasets. The original contributions made in this study are included in the article and any further questions about the data can be obtained by contacting the corresponding authors.

Acknowledgments

We sincerely thank the editor and anonymous reviewers for their valuable feedback and constructive suggestions, which have significantly enhanced the quality of this manuscript.

Conflicts of Interest

Author Guocheng An was employed by the company Artificial Intelligence Research Institute of Shanghai Huaxun Network System Co., LTD. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Khelifi, L.; Mignotte, M. Deep learning for change detection in remote sensing images: Comprehensive review and meta-analysis. IEEE Access 2020, 8, 126385–126400. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, T.; Wang, G.; Zhu, P.; Tang, X.; Jia, X.; Jiao, L. Remote sensing object detection meets deep learning: A metareview of challenges and advances. IEEE Geosci. Remote Sens. Mag. 2023, 11, 8–44. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the 2020 European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: DETR with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Cheng, R. A survey: Comparison between Convolutional Neural Network and YOLO in image identification. In Proceedings of the 2020 Journal of Physics: Conference Series, Virtual, 3–5 September 2020; p. 012139. [Google Scholar]
Li, Y.; Miao, N.; Ma, L.; Shuang, F.; Huang, X. Transformer for object detection: Review and benchmark. Eng. Appl. Artif. Intell. 2023, 126, 107021. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Zhang, H.; Liu, K.; Gan, Z.; Zhu, G.N. UAV-DETR: Efficient End-to-End Object Detection for Unmanned Aerial Vehicle Imagery. arXiv 2025, arXiv:2501.01855. [Google Scholar]
Min, L.; Dou, F.; Zhang, Y.; Shao, D.; Li, L.; Wang, B. CM-YOLO: Context Modulated Representation Learning for Ship Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4202414. [Google Scholar] [CrossRef]
Cheng, G.; Si, Y.; Hong, H.; Yao, X.; Guo, L. Cross-scale feature fusion for object detection in optical remote sensing images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 431–435. [Google Scholar] [CrossRef]
Yi, H.; Liu, B.; Zhao, B.; Liu, E. Small object detection algorithm based on improved YOLOv8 for remote sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 1734–1747. [Google Scholar] [CrossRef]
Gao, T.; Li, Z.; Wen, Y.; Chen, T.; Niu, Q.; Liu, Z. Attention-free global multiscale fusion network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5603214. [Google Scholar] [CrossRef]
Li, B.; Xiao, C.; Wang, L.; Wang, Y.; Lin, Z.; Li, M.; An, W.; Guo, Y. Dense nested attention network for infrared small target detection. IEEE Trans. Image Process. 2022, 32, 1745–1758. [Google Scholar] [CrossRef]
Sun, H.; Bai, J.; Yang, F.; Bai, X. Receptive-field and direction induced attention network for infrared dim small target detection with a large-scale dataset IRDST. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5000513. [Google Scholar] [CrossRef]
Zhang, T.; Li, L.; Cao, S.; Pu, T.; Peng, Z. Attention-guided pyramid context networks for detecting infrared small target under complex background. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 4250–4261. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31 × 31: Revisiting large kernel design in cnns. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 11963–11975. [Google Scholar]
Liu, S.; Chen, T.; Chen, X.; Chen, X.; Xiao, Q.; Wu, B.; Kärkkäinen, T.; Pechenizkiy, M.; Mocanu, D.; Wang, Z. More convnets in the 2020s: Scaling up kernels beyond 51 × 51 using sparsity. arXiv 2022, arXiv:2207.03620. [Google Scholar]
Chen, H.; Chu, X.; Ren, Y.; Zhao, X.; Huang, K. PeLk: Parameter-efficient large kernel convnets with peripheral convolution. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 5557–5567. [Google Scholar]
Gao, R. Rethinking dilated convolution for real-time semantic segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 4675–4684. [Google Scholar]
Munir, M.; Rahman, M.M.; Marculescu, R. RapidNet: Multi-Level Dilated Convolution Based Mobile Backbone. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 5–9 January 2025; pp. 8302–8312. [Google Scholar]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 16794–16805. [Google Scholar]
Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language modeling with gated convolutional networks. In Proceedings of the 2017 International Conference on Machine Learning (ICML), Sydney, Australia, 6–11 August 2017; pp. 933–941. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 11976–11986. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Hu, Y.; Tian, S.; Ge, J. Hybrid convolutional network combining multiscale 3D depthwise separable convolution and CBAM residual dilated convolution for hyperspectral image classification. Remote Sens. 2023, 15, 4796. [Google Scholar] [CrossRef]
Liu, R.; Tao, F.; Liu, X.; Na, J.; Leng, H.; Wu, J.; Zhou, T. RAANet: A residual ASPP with attention framework for semantic segmentation of high-resolution remote sensing images. Remote Sens. 2022, 14, 3109. [Google Scholar] [CrossRef]
Li, Y.; Cheng, Z.; Wang, C.; Zhao, J.; Huang, L. RCCT-ASPPNet: Dual-encoder remote image segmentation based on transformer and ASPP. Remote Sens. 2023, 15, 379. [Google Scholar] [CrossRef]
Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and multiscale transformer fusion network for remote-sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2004612. [Google Scholar] [CrossRef]
Lee, Y.; Park, J. Centermask: Real-time anchor-free instance segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; pp. 13906–13915. [Google Scholar]
Deng, L.; Yang, M.; Li, T.; He, Y.; Wang, C. RFBNet: Deep multimodal networks with residual fusion blocks for RGB-D semantic segmentation. arXiv 2019, arXiv:1907.00135. [Google Scholar]
Wang, J.; Li, X.; Zhou, L.; Chen, J.; He, Z.; Guo, L.; Liu, J. Adaptive receptive field enhancement network based on attention mechanism for detecting the small target in the aerial image. IEEE Trans. Geosci. Remote Sens. 2023, 62, 1–18. [Google Scholar] [CrossRef]
Yang, F.; Fan, H.; Chu, P.; Blasch, E.; Ling, H. Clustered object detection in aerial images. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8311–8320. [Google Scholar]
Li, C.; Yang, T.; Zhu, S.; Chen, C.; Guan, S. Density map guided object detection in aerial images. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 190–191. [Google Scholar]
Duan, C.; Wei, Z.; Zhang, C.; Qu, S.; Wang, H. Coarse-grained density map guided object detection in aerial images. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 2789–2798. [Google Scholar]
Huang, Y.; Chen, J.; Huang, D. UFPMP-Det: Toward accurate and efficient object detection on drone imagery. In Proceedings of the 2022 AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; pp. 1026–1033. [Google Scholar]
Li, Z.; Lian, S.; Pan, D.; Wang, Y.; Liu, W. AD-Det: Boosting Object Detection in UAV Images with Focused Small Objects and Balanced Tail Classes. Remote Sens. 2025, 17, 1556. [Google Scholar] [CrossRef]
Liu, C.; Gao, G.; Huang, Z.; Hu, Z.; Liu, Q.; Wang, Y. YOLC: You only look clusters for tiny object detection in aerial images. IEEE Trans. Intell. Transp. Syst. 2024, 25, 13863–13875. [Google Scholar] [CrossRef]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Gholami, A.; Kwon, K.; Wu, B.; Tai, Z.; Yue, X.; Jin, P.; Zhao, S.; Keutzer, K. Squeezenext: Hardware-aware neural network design. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1638–1647. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNetv2: Practical guidelines for efficient cnn architecture design. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetv2: Inverted residuals and linear bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Tan, M.; Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 2019 International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Tan, M.; Le, Q. EfficientNetv2: Smaller models and faster training. In Proceedings of the 2021 International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Virtual, 14–19 June 2020; pp. 390–391. [Google Scholar]
Tang, S.; Zhang, S.; Fang, Y. HIC-YOLOv5: Improved YOLOv5 for small object detection. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 6614–6619. [Google Scholar]
Xiao, Y.; Xu, T.; Xin, Y.; Li, J. FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection. In Proceedings of the 2025 AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 25–27 February 2025; pp. 8673–8681. [Google Scholar]
Li, C.; Zhao, R.; Wang, Z.; Xu, H.; Zhu, X. RemDet: Rethinking Efficient Model Design for UAV Object Detection. In Proceedings of the 2025 AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 25–27 February 2025; pp. 4643–4651. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Qi, S.; Song, X.; Shang, T.; Hu, X.; Han, K. MSFE-YOLO: An improved yolov8 network for object detection on drone view. IEEE Geosci. Remote. Sens. Lett. 2024, 21, 6013605. [Google Scholar] [CrossRef]
Min, L.; Fan, Z.; Lv, Q.; Reda, M.; Shen, L.; Wang, B. YOLO-DCTI: Small object detection in remote sensing base on contextual transformer enhancement. Remote Sens. 2023, 15, 3970. [Google Scholar] [CrossRef]
Zhang, Z. Drone-YOLO: An efficient neural network method for target detection in drone images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]
Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded sparse query for accelerating high-resolution small object detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 13668–13677. [Google Scholar]
Yin, N.; Liu, C.; Tian, R.; Qian, X. SDPDet: Learning scale-separated dynamic proposals for end-to-end drone-view detection. IEEE Trans. Multimed. 2024, 26, 7812–7822. [Google Scholar] [CrossRef]
Wang, M.; Yang, W.; Wang, L.; Chen, D.; Wei, F.; KeZiErBieKe, H.; Liao, Y. FE-YOLOv5: Feature enhancement network based on YOLOv5 for small object detection. J. Vis. Commun. Image Represent. 2023, 90, 103752. [Google Scholar] [CrossRef]
Xiao, Y.; Xu, T.; Yu, X.; Fang, Y.; Li, J. A Lightweight Fusion Strategy with Enhanced Inter-layer Feature Correlation for Small Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4708011. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Du, S.; Liang, X.; Wu, K.; Tian, Y.; Liu, Y.; Jian, L. Multiple scales fusion and query matching stabilization for detection with transformer. Eng. Appl. Artif. Intell. 2025, 144, 110047. [Google Scholar] [CrossRef]
Dong, Y.; Cordonnier, J.B.; Loukas, A. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In Proceedings of the 2021 International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 2793–2803. [Google Scholar]
Liu, S.; Cao, J.; Yang, R.; Wen, Z. Key phrase aware transformer for abstractive summarization. Inf. Process. Manag. 2022, 59, 102913. [Google Scholar] [CrossRef]
Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the effective receptive field in deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2016, 29, 4905–4913. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Park, H.; Yoo, Y.; Seo, G.; Han, D.; Yun, S.; Kwak, N. C3: Concentrated-comprehensive convolution and its application to semantic segmentation. arXiv 2018, arXiv:1812.04920. [Google Scholar]
Shi, D. TransNext: Robust foveal visual perception for vision transformers. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 17773–17783. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making vgg-style convnets great again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 13733–13742. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Wang, P.; Chen, P.; Yuan, Y.; Liu, D.; Huang, Z.; Hou, X.; Cottrell, G. Understanding convolution for semantic segmentation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1451–1460. [Google Scholar]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. Efficientvit: Memory efficient vision transformer with cascaded group attention. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 14420–14430. [Google Scholar]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision transformer with deformable attention. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 4794–4803. [Google Scholar]
Pan, Z.; Cai, J.; Zhuang, B. Fast vision transformers with hilo attention. Adv. Neural Inf. Process. Syst. 2022, 35, 14541–14554. [Google Scholar]
Shaker, A.; Maaz, M.; Rasheed, H.; Khan, S.; Yang, M.H.; Khan, F.S. Swiftformer: Efficient additive attention for transformer-based real-time mobile vision applications. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 17425–17436. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal models for the mobile ecosystem. In Proceedings of the 2024 European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–3 October 2024; pp. 78–96. [Google Scholar]
Yang, Z.; Guan, Q.; Zhao, K.; Yang, J.; Xu, X.; Long, H.; Tang, Y. Multi-branch Auxiliary Fusion YOLO with Re-parameterization Heterogeneous Convolutional for Accurate Object Detection. In Proceedings of the 2024 Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Urumqi, China, 18–20 October 2024; pp. 492–505. [Google Scholar]
Chen, Y.; Zhang, C.; Chen, B.; Huang, Y.; Sun, Y.; Wang, C.; Fu, X.; Dai, Y.; Qin, F.; Peng, Y.; et al. Accurate leukocyte detection based on deformable-DETR and multi-level feature fusion for aiding diagnosis of blood diseases. Comput. Biol. Med. 2024, 170, 107917. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Wang, J.; Yu, J.; He, Z. ARFP: A novel adaptive recursive feature pyramid for object detection in aerial images. Appl. Intell. 2022, 52, 12844–12859. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Ma, Y.; Chai, L.; Jin, L. Scale decoupled pyramid for object detection in aerial images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4704314. [Google Scholar] [CrossRef]
Zeng, W.; Wu, P.; Wang, J.; Hu, G.; Zhao, J. C4D-YOLOv8: Improved YOLOv8 for object detection on drone-captured images. Signal Image Video Process. 2025, 19, 862. [Google Scholar] [CrossRef]
Luo, W.; Yuan, S. Enhanced YOLOv8 for small-object detection in multiscale UAV imagery: Innovations in detection accuracy and efficiency. Digit. Signal Process. 2025, 158, 104964. [Google Scholar] [CrossRef]
Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. RT-DETRv2: Improved baseline with bag-of-freebies for real-time detection transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar]
Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine regression Task in DETRs as Fine-grained distribution refinement. arXiv 2024, arXiv:2410.13842. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Deng, S.; Li, S.; Xie, K.; Song, W.; Liao, X.; Hao, A.; Qin, H. A global-local self-adaptive network for drone-view object detection. IEEE Trans. Image Process. 2020, 30, 1556–1569. [Google Scholar] [CrossRef]
Liao, H.; Tang, Y.; Liu, Y.; Luo, X. ViDroneNet: An efficient detector specialized for Target Detection in Aerial Images. Digit. Signal Process. 2025, 164, 105270. [Google Scholar] [CrossRef]
Alqahtani, D.K.; Cheema, M.A.; Toosi, A.N. Benchmarking deep learning models for object detection on edge computing devices. In Proceedings of the 2024 International Conference on Service-Oriented Computing, Berlin, Germany, 4–8 November 2024; pp. 142–150. [Google Scholar]
Bi, H.; Feng, Y.; Tong, B.; Wang, M.; Yu, H.; Mao, Y.; Chang, H.; Diao, W.; Wang, P.; Yu, Y.; et al. RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation. arXiv 2025, arXiv:2504.03166. [Google Scholar]
Diao, W.; Yu, H.; Kang, K.; Ling, T.; Liu, D.; Feng, Y.; Bi, H.; Ren, L.; Li, X.; Mao, Y.; et al. RingMo-Aerial: An Aerial Remote Sensing Foundation Model with Affine Transformation Contrastive Learning. arXiv 2024, arXiv:2409.13366. [Google Scholar]

Figure 1. Successfully distinguishing between motor with sunshade and awning-tricycle in occluded perspectives requires broader contextual information than normal viewing angles.

Figure 2. At high altitudes, successfully distinguishing between pedestrians and people requires a wider range of receptive fields than at low altitudes.

Figure 3. Overall architecture of VRF-DETR. RepC3 is the same as RepC3 in RT-DETR. AMSS and CA stand for Adaptive Multi-Scale Selection and Channel Attention, respectively.

Figure 4. The architecture of AIFI.

Figure 5. The architecture of MSRF² module.

Figure 6. Structural comparison between C3 and C2F. (a) C3 in CSPDarknet; (b) C2F in YOLOv8 backbone.

Figure 7. Difference between C2F bottleneck and GMSC. (a) The bottleneck in C2F; (b) GMSC block.

Figure 8. The architecture of AGConv module.

Figure 9. The architecture of CCFF in RT-DETR.

Figure 10. The architecture of RepC3 in CCFF.

Figure 11. The architecture of CGF module.

Figure 12. Information statistics of the VisDrone2019 dataset.

Figure 13. Visualization of images under different conditions in the VisDrone2019 dataset.

Figure 14. Radar comparison chart of different models on the VisDrone2019 test set.

Figure 15. Comparison of detection results for scenes at different heights and perspectives, with height and perspective axes on the sides. The lower the picture, the higher the height and the more the perspective is looking down. The red circles indicate the targets missed by RT-DETR, and the purple circles indicate the false detections by RT-DETR.

Figure 16. Comparison of detection results in the intersection scenario.

Figure 17. Comparison of small and very small object detection results. Red circles: missed detections; purple circles: false positives.

Figure 18. Comparison of visual heat maps in scenarios with confusing objects.

Table 1. Hyperparameter configuration.

Hyperparameter	YOLO Series	RT-DETR Series
Input size	640 × 640	640 × 640
Batch size	8	8
Training epochs	300	300
Optimizer	SGD	AdamW
Initial learning rate	0.01	0.0001
Learning rate factor	0.01	0.01
Momentum	0.937	0.9
Warmup steps	2000	2000

Table 2. Optimal parameter ablation results of the MSRF² module.

Dilated Rates	mAP_s (%↑)	mAP₅₀ (%↑)	mAP_50-95 (%↑)
(1, 3, 5)	21.4	50.3	31.1
(3, 5, 7)	21.6	52.1	32.2
(5, 7, 9)	19.9	50.6	31.2