MPI-DETR: Multi-Grain Prompt and Intensity-Guided Transformer for Small-Object Detection in UAV Imagery

Zhang, Jie; Xie, Boxiang; Lin, Lingfeng; Yang, Liejun; Zhang, Xian; Meng, Yuke; Xie, Xiaojuan; Zhang, Yao; Zhang, Wei

doi:10.3390/rs18111763

Open AccessArticle

MPI-DETR: Multi-Grain Prompt and Intensity-Guided Transformer for Small-Object Detection in UAV Imagery

by

Jie Zhang

^1,2

,

Boxiang Xie

³

,

Lingfeng Lin

⁴

,

Liejun Yang

¹,

Xian Zhang

⁵,

Yuke Meng

⁵,

Xiaojuan Xie

⁶,

Yao Zhang

^5,*

and

Wei Zhang

^7,8

¹

College of Information Engineering, Ningde Normal University, Ningde 352100, China

²

School of Civil Engineering and Transportation, Northeast Forestry University, Harbin 150040, China

³

Aulin College, Northeast Forestry University, Harbin 150042, China

⁴

College of Life Science, Northeast Forestry University, Harbin 150042, China

⁵

The State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430072, China

⁶

College of Civil Engineering, Fuzhou University, Fuzhou 350116, China

⁷

The School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, 10044 Stockholm, Sweden

⁸

Faculty of Computing, Harbin Institute of Technology, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(11), 1763; https://doi.org/10.3390/rs18111763

Submission received: 7 March 2026 / Revised: 17 May 2026 / Accepted: 23 May 2026 / Published: 1 June 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose MPI-DETR, a novel detection framework utilizing a Dual- Stream Ranked Self-Attention (DRSA) module that maps spatial features into ordered intensity sequences.
The framework achieves state-of-the-art AP50 scores of 43.8%, 87.5%, and 92.5% on the highly challenging AI-TOD, DIOR, and NWPU VHR-10 datasets, respectively.

What are the implications of the main findings?

MPI-DETR effectively overcomes severe background clutter and semantic misalignment challenges in UAV imagery through bounded noise filtering and prompt-driven feature fusion.
It provides a highly competitive and computationally efficient visual perception solution, ideal for resource-constrained UAV platforms and edge deployment scenarios.

Abstract

Small object detection in unmanned aerial vehicle (UAV) remote sensing images remains challenging due to large-scale variations, dense object distributions, and complex background interference. Although Transformer-based detectors have improved global context modeling for remote sensing object detection, many existing designs still rely on spatial geometric relationships and conventional cross-level fusion, which may limit the aggregation of spatially scattered small-object features and introduce background interference. To address these issues, this paper proposes a Multi-granularity Prompt and Intensity Guidance Detection Transformer (MPI-DETR), an efficient end-to-end Transformer-based detector for small-object detection in UAV remote sensing images. MPI-DETR consists of three key components: a Dual-Stream Ranked Self-Attention (DRSA) module for intensity-ordered global feature aggregation, a Bilateral Tanh Gating and Cosine Attention Feature Alignment Module (BTC-FAM) for noise-resistant cross-level alignment, and a Prompt-driven Multi-granularity Fusion (PMGF) module for enhancing weak small-object details. Experiments on AI-TOD, DIOR, and NWPU VHR-10 demonstrate that MPI-DETR achieves

A P_{50}

scores of

43.8 %

,

87.5 %

, and

92.5 %

, respectively. Compared with the RT-DETR-R18 baseline, MPI-DETR improves

A P_{50}

by 3.0, 1.0, and 3.9 percentage points on the three datasets, respectively, and increases

A P_{S}

by 2.9, 2.9, and 11.6 percentage points. It also surpasses the strongest compared models by 1.7, 1.0, and 1.3 percentage points in

A P_{50}

on AI-TOD, DIOR, and NWPU VHR-10, respectively. These results indicate that MPI-DETR provides a robust and efficient solution for small-object perception in complex UAV remote sensing applications, especially in scenes with dense targets, background interference, and limited computational resources.

Keywords:

object detection; small-target detection; remote sensing images; detection transformer

1. Introduction

With the rapid development of unmanned aerial vehicle (UAV) remote sensing technology, its applications have become increasingly widespread in areas such as urban traffic monitoring [1,2], disaster emergency response [3,4], precision agriculture management [5,6], and military reconnaissance [7,8]. As one of the core tasks in UAV vision systems, object detection aims to accurately localize and identify objects of interest from high-resolution aerial imagery. However, unlike natural scene images, low-altitude UAV remote sensing images present numerous challenges, including drastic object scale variations caused by the top-down viewing perspective, extremely high object density, and interference from complex background textures [8]. These factors make accurate recognition of small objects within highly variable and cluttered backgrounds an extremely challenging problem.

In recent years, detectors represented by convolutional neural networks (CNNs) have dominated industrial applications due to their efficient inference speed. By constructing deep hierarchical feature pyramids and employing multi-scale fusion strategies, these algorithms have achieved competitive accuracy on general object detection benchmarks in natural scenes. However, due to the inherent limitation of the local receptive field in convolution operations, such models often struggle to capture global contextual information, resulting in relatively high false alarm rates in complex scenarios. To address this issue, Transformer-based detectors introduce global self-attention mechanisms, significantly enhancing the model’s capability to model long-range dependencies and providing a new solution for low-altitude UAV image detection. Nevertheless, the conventional self-attention mechanism they employ excessively relies on geometric neighborhood relationships [9]. While incurring high computational complexity, it assumes that adjacent pixels exhibit stronger correlations. However, in vast aerial scenes, objects of the same category are often spatially distant yet exhibit high consistency in spectral intensity or texture responses [10]. Spatial position-based scanning struggles to effectively capture such physical attribute associations characterized by “spectral similarity despite spatial separation,” leading to inefficient feature aggregation. Furthermore, existing multi-scale feature fusion networks usually adopt simple linear addition or concatenation [11]. During the transmission of deep semantic features to shallow layers, such a “passive” fusion strategy fails to effectively distinguish weak target signals from high-frequency background noise. As a result, the robustness of deep semantic features is often diluted by shallow background clutter, causing the edge information of small objects to gradually vanish during the fusion process.

To address the above challenges, this paper proposes a detection framework, termed MPI-DETR, based on multi-granularity prompting and intensity guidance. Unlike other general-purpose models, MPI-DETR abandons the spatial scanning strategy that relies on geometric neighborhoods and instead constructs an intensity ranking mechanism based on magnitude responses. At the same time, it breaks through the limitations of linear addition that lacks semantic interaction and adopts a semantics-driven prompt retrieval strategy, thereby effectively improving detection accuracy for low-altitude UAV images. Specifically, MPI-DETR consists of three core components. First, a Dual-Stream Ranked Self-Attention (DRSA) module is designed, which maps features into an ordered intensity sequence domain through an intensity reordering strategy. This not only reduces the computational complexity to linear scale, but also enables the model to transcend spatial distance and directly cluster and enhance objects according to their intrinsic physical attributes, such as intensity responses. Second, to address noise interference in the feature flow, we propose a Bilateral Tanh Gating and Cosine Attention Feature Alignment Module (BTC-FAM), which employs a Tanh gating mechanism to construct bounded residuals and dynamically filters high-amplitude background impulse noise during feature transmission. In addition, to resolve the challenge of cross-level feature alignment, we construct a Prompt-Driven Multi-Grain Fusion Module (PMGF). This module treats deep encoded features as “global semantic prompts” and actively retrieves and “activates” small-object textures overlooked by the backbone network in a multi-granularity space, thereby achieving precise complementarity between deep semantics and shallow details.

The main contributions of this paper are summarized as follows:

We propose MPI-DETR, an end-to-end detection framework specifically designed for small-object detection in UAV remote sensing images. By incorporating intensity guidance and prompt learning mechanisms, the framework effectively alleviates the challenges posed by background clutter interference and the spatially scattered distribution of targets.
We design the DRSA module, which overcomes the limitations of conventional spatial attention by introducing an intensity reordering strategy and a dual-stream interaction mechanism, thereby enabling efficient global aggregation of features from spatially dispersed targets.
We propose the BTC-FAM and PMGF modules, which reconstruct the pathways of feature encoding and fusion from the perspectives of noise suppression and cross-level feature alignment, respectively, enhancing the model’s ability to perceive weak target signals.

2. Related Works

2.1. Object Detection in UAV Imagery

Object detection in UAV imagery is closely related to weak target perception in remote sensing scenarios, where targets often exhibit small spatial extent, ambiguous appearance, and strong background interference. In this context, several infrared and hyperspectral target perception studies have provided useful insights into robust appearance representation and background suppression. Zhao et al. [12] improved appearance modeling for infrared target tracking, showing the importance of discriminative target representation under low-contrast conditions. Recent hyperspectral video tracking methods further explored spectral-spatial angle mapping and state-aware template updating [13], spectral difference matching reduction and deep spectral target perception [14], as well as band correlation grouping and spatial-spectral information interaction [15]. In addition, local sub-block contrast and spatial-spectral gradient feature fusion have also been investigated for hyperspectral anomaly detection [16], indicating that spatial-spectral feature enhancement is valuable for identifying weak targets in complex remote sensing scenes. These studies suggest that fine-grained feature representation, background suppression, and cross-region feature interaction are essential for detecting small objects in UAV images.

Although traditional convolutional neural network (CNN)-based detectors have achieved considerable success in general scenarios, they are constrained by local receptive fields, and increasing network depth in UAV image processing can easily lead to the loss of fine-grained details of small objects. In contrast, DETR (Detection Transformer) introduces set prediction and global modeling mechanisms, capturing long-range dependencies. However, despite its strong detection capability, the DETR architecture often suffers from substantial computational overhead and parameter redundancy. To address these issues, current research has mainly focused on core computation lightweighting and multi-scale feature enhancement. Focus-DETR [17] removes redundant background computation and improves detection efficiency by combining a cross-scale foreground selector with a dual-attention mechanism. UAV-DETR [18], from another perspective, reduces model parameters by adopting an inverted residual structure and cascaded linear attention, while extending the receptive field through cross-channel dynamic sampling to preserve small-object details. E²-Former [19] introduces an edge-enhanced Transformer framework for UAV-based small-object detection, emphasizing edge-aware representation, dynamic multi-feature fusion, and fine-grained cross-scale integration under complex low-altitude scenes. To address the challenge of small objects under high-altitude UAV viewpoints, FBRT-YOLO [20] employs a feature complementary mapping module to deeply fuse shallow high-precision spatial location information with deep semantic information, thereby alleviating the loss of small-object features during deep-network downsampling. In addition, to cope with the challenge of dramatic object scale variation in aerial scenes, SED-DETR [21] proposes scale-enhanced deformable attention, which explicitly incorporates scale and shape priors through dynamic dilated sampling, improving the detection accuracy of dense and multi-scale objects in UAV images.

Although the above methods have made important progress in lightweight detection, edge-aware representation, scale-enhanced attention, and multi-scale feature enhancement, two issues remain insufficiently explored for small-object detection in complex UAV imagery. On the one hand, many existing attention or sampling strategies still rely on spatial geometric relationships when modeling feature interactions. This design is effective for neighboring or locally continuous targets, but it may be less efficient in aggregating spatially scattered small objects that share similar visual responses across large background regions. On the other hand, feature fusion modules such as complementary mapping, cross-level aggregation, or dynamic multi-feature fusion can enhance small-object details, but the direct interaction between shallow high-resolution features and deep semantic features may also introduce background interference when target responses are weak. In such cases, the key challenge is not only to fuse multi-level features, but also to selectively enhance target-relevant details while suppressing irrelevant high-frequency background textures. Therefore, this paper explores a detection architecture that combines response-aware global feature reorganization with noise-resistant cross-level feature alignment, aiming to improve small-object perception in UAV scenes with dense distributions and complex background clutter. Compared with closely related DETR-based UAV detectors, Focus-DETR mainly reduces redundant background computation through foreground selection and dual attention, while UAV-DETR improves lightweight modeling by combining inverted residual structures, cascaded linear attention, and dynamic sampling. These methods still organize feature interaction mainly from spatial foreground selection or spatial sampling perspectives. In contrast, DRSA first reorders tokens according to feature response intensity and then performs intra-bucket and inter-bucket interactions in the ranked sequence domain, which provides a different way to associate spatially separated small-object responses.

2.2. Global Context Modeling and Attention Mechanisms

Since the Vision Transformer (ViT) architecture was introduced into visual tasks, the ability of global context modeling has become a key factor in overcoming the bottleneck of object detection in complex scenes. Unlike convolutional neural networks that rely on local receptive fields, the self-attention mechanism can explicitly capture long-range dependencies between any two positions in an image. In recent years, numerous studies have focused on addressing the substantial computational cost of standard self-attention. Zhou et al. [22] proposed an adaptive sparse Transformer, which reduces computational overhead by filtering noisy interactions from irrelevant regions through a dual-branch sparse attention mechanism. Song et al. [23] explored an architecture that efficiently approximates global sparse attention through low-rank approximation, providing a solution to this problem. Chen et al. [24] further investigated an efficient global modeling scheme for Transformer in full-scene remote sensing images, demonstrating the necessity of global perception when handling ultra-high-resolution imagery. In addition, Pu et al. [25], from another perspective, effectively reduced spatial redundancy in multi-step feature interaction by introducing dynamic attention relays.

In the field of remote sensing and UAV image processing, extremely high resolution and complex background textures make efficient context modeling necessary. To address background interference in UAV images, Wang et al. [26] designed a multi-scale network based on dynamic sparse attention, which achieves efficient global feature aggregation while preserving high-resolution local details. To adapt to the arbitrary orientations and irregular distributions of objects in remote sensing images, Lin et al. [27] introduced a variant of deformable attention, enabling the network to adaptively focus on sparse foreground target regions. Beyond conventional RGB-based detection, hyperspectral target perception studies also demonstrate the importance of modeling non-local spatial-spectral relationships. For example, SiamSTU [13] enhances spectral-spatial angle representation for hyperspectral video tracking, while SiamBSI [15] models band correlation and spatial-spectral information interaction to strengthen target discrimination. These studies further indicate that weak target perception benefits from feature interaction beyond local spatial neighborhoods. However, when existing attention mechanisms attempt to match these discrete weak signals across extensive background regions, they not only suffer from low search efficiency but are also highly susceptible to interference from high-frequency background clutter encountered along the way. This misalignment between “spatial physical distance” and “semantic similarity” leads to difficulties in feature aggregation for existing global context modeling methods when handling discrete small objects in UAV imagery.

2.3. Feature Fusion and Pyramid Networks

Multi-scale feature fusion is a core paradigm for addressing the problem of object scale variation in object detection. Starting from the Feature Pyramid Network (FPN) [28] and its variants, cross-level information flow has alleviated the gap between deep semantics and shallow resolution. In recent years, research has focused on breaking static topologies and exploring dynamic feature interaction mechanisms. Cheng et al. [29] proposed a re-parameterized vision-language pyramid network, achieving multi-scale cross-modal feature guidance and fusion. Li et al. [30], in their investigation of the evolution of visual Transformer architectures, pointed out that dynamic and context-aware multi-scale feature aggregation has become crucial for improving the performance of dense prediction tasks.

In remote sensing and UAV vision tasks, given the small-object scale and complex background, the design of feature fusion pathways is critical. To address the problem that small objects are sparsely distributed in high-resolution images and often overwhelmed by the background, Liu et al. [31] proposed an efficient small-object detection framework that performs object searching and partitioning at the feature hierarchy, thereby avoiding ineffective fusion computation in redundant background regions. To cope with semantic confusion caused by multi-scale feature fusion under a large field of view, Zhang et al. [32] designed full-scale feature aggregation and grouped feature reconstruction modules, which actively reduce feature confusion and filter redundant information during cross-level interaction. Recent hyperspectral anomaly detection methods also provide relevant references for noise-resistant feature fusion. Zhao et al. [16] fused local sub-block contrast with spatial-spectral gradient features to strengthen weak anomaly responses. Huo et al. [33] proposed a multi-scale memory network with separation training to improve the discriminability of anomalous and background features. Furthermore, Huo et al. [34] designed a dual-stream background modeling network with anomaly suppression, where local and global branches are combined with attention mechanisms to enhance background modeling and suppress anomaly reconstruction. These works suggest that effective feature fusion should not simply combine multi-level features, but should selectively enhance weak target cues while suppressing background interference.

However, when facing UAV detection tasks involving small and dense objects, existing feature fusion paradigms still suffer from two limitations. On the one hand, most existing networks rely on element-wise addition or channel concatenation. In UAV images, shallow features contain high-resolution contours of small objects but are also filled with high-frequency background noise. Simple linear superposition causes shallow high-frequency noise to contaminate deep semantics, resulting in weak target signals being submerged in background clutter. On the other hand, deep high-level semantic features are often directly broadcast to shallow layers, without actively selecting or activating fine-grained textures in shallow features according to high-level semantics, which leads to “semantic misalignment” during cross-scale alignment.

3. Methods

3.1. Overall Architecture of MPI-DETR

The overall architecture of MPI-DETR is illustrated in Figure 1. Unlike other DETR-based object detection models, the proposed architecture is specifically optimized for UAV remote sensing images, where small-object features are prone to loss and background noise interference is severe. The improved architecture mainly consists of three core components: a backbone network, an Intensity-Prompt Hybrid Encoder, and a decoder. First, the input image is processed by the backbone to extract multi-scale features, denoted as

{P_{3}, P_{4}, P_{5}}

. To alleviate the computational redundancy of high-level features in long-range dependency modeling and enhance the perception of spatially scattered targets, we employ the DRSA module to refine deep features. Unlike attention mechanisms based on spatial grids, DRSA performs efficient feature interaction in the sequence domain through global sorting of feature intensities, thereby preserving critical semantic responses while reducing computational complexity. Second, along the multi-scale feature fusion path of the Intensity-Prompt Hybrid Encoder, we embed the BTC-FAM module. This module applies a bilateral gating mechanism together with Tanh-bounded activation to progressively filter deep features layer by layer. Such a design aims to suppress high-frequency noise impulses introduced by complex backgrounds during feature transmission, thereby ensuring the robustness of semantic features. Furthermore, to address the feature misalignment between deep semantics and shallow details, we place the PMGF module between the encoder and the decoder, where it receives shallow features from the backbone and deep outputs from the encoder in parallel. This module treats deep encoded features as “semantic prompts” and guides the selective enhancement of shallow features in a multi-granularity space by computing cosine similarity. Finally, the enhanced features are fed into a Transformer decoder with a fixed set of learnable object queries. The decoder performs query-feature interaction through multi-head attention layers and then uses prediction heads to generate the final object categories and bounding box coordinates. Through the collaborative operation of all modules, the overall architecture improves the detection performance of small and dense objects in UAV scenes while maintaining real-time capability.

3.2. Dual-Stream Ranked Self-Attention

DRSA is designed to efficiently encode global dependencies among dense small objects without relying on local geometric neighborhoods. In low-altitude UAV images, objects of the same category may be spatially scattered but share similar intensity or texture responses, while conventional self-attention is computationally expensive and often inefficient in associating such non-adjacent weak target patterns. To address these issues, we propose the DRSA module and apply it to deep features for efficient global feature encoding. As shown in Figure 2, given an input feature tensor

X \in R^{B \times C \times H \times W}

, DRSA discards the conventional spatial scanning paradigm based on geometric neighborhoods and adopts an “intensity reordering” strategy. This strategy maps features from the discrete two-dimensional spatial domain into an ordered one-dimensional intensity sequence domain. Specifically, DRSA first projects the input into a query

Q

, key

K

, and value

V

through linear transformations. Unlike conventional mechanisms, DRSA further introduces a numerical sorting index

I

. To measure the response intensity of features, we first compute the

L_{2}

norm of the value tensor

V

along the channel dimension, denoted as

{∥ V ∥}_{2}

. Then, all feature pixels are globally sorted according to the magnitude of this norm to generate the index

I

, and a spatial re-indexing operation (Gather) is used to synchronously rearrange

Q

,

K

, and

V

. This operation thereby “clusters” features that are originally spatially dispersed but exhibit similar response intensities:

I = argsort (∥ V ∥_{2})

(1)

[Q_{r k}, K_{r k}, V_{r k}] = Gather ([Q, K, V], I)

(2)

where

Q_{rk}

,

K_{rk}

, and

V_{rk}

denote the reordered feature sequences. On this basis, to capture intensity dependencies at different scales, we construct a dual-stream interaction mechanism. Specifically, we reshape the ordered sequence of length N into a matrix structure of

G \times M

, where M denotes the preset number of elements within each group and

G = ⌈ N / M ⌉

denotes the resulting number of groups. Two complementary views are then constructed:

Intra-bucket dense stream ( $Φ_{intra}$ ): Through consecutive grouping, attention is focused on adjacent elements in the sequence. Since the sequence has already been sorted, this stream concentrates on modeling the continuity of local fine-grained intensity variations.
Inter-bucket sparse stream ( $Φ_{inter}$ ): Through strided sampling, long-range connections are established across different intensity levels. This stream is intended to capture global statistical distribution relationships across hierarchical levels.

Both streams compute attention weights using smoothed exponential normalization. Formally, for the m-th stream (

m \in {intra, inter}

), we first compute its attention affinity matrix

A^{(m)}

, and then obtain the intermediate output feature

H^{(m)}

by weighted aggregation of the reordered value vectors:

A^{(m)} = \frac{exp (τ \cdot Q_{rk} {(K_{rk})}^{⊤})}{\sum exp (τ \cdot Q_{rk} {(K_{rk})}^{⊤}) + 1}

(3)

H^{(m)} = A^{(m)} V_{rk}

(4)

where

τ

denotes a learnable temperature coefficient, and the constant term 1 in the denominator is introduced to ensure numerical stability and suppress the gradient response of non-salient features. In this manner, DRSA obtains

H^{(intra)}

, which represents local intensity consistency, and

H^{(inter)}

, which represents the global intensity distribution. Subsequently, DRSA fuses the dual-stream features through a mutual gating mechanism, in which the two branches calibrate each other’s response intensities via element-wise multiplication. Finally, to restore the spatial semantics of the features, the inverse sorting index

I^{- 1}

is used to accurately remap the processed sequence back to the original spatial coordinates, yielding

Y_{out}

:

Y_{out} = Scatter (H^{(intra)} ⊙ H^{(inter)}, I^{- 1})

(5)

where ⊙ denotes element-wise multiplication, and

Scatter (\cdot)

denotes the inverse mapping operation. It is worth noting that although DRSA introduces a global sorting operation, it remains hardware-friendly in terms of both computational overhead and gradient propagation. In terms of time complexity, the computational complexity of standard multi-head self-attention is

O (N^{2} C)

. Here,

N = H \times W

denotes the number of spatial tokens in a feature map with resolution

H \times W

, and C denotes the channel dimension. For simplicity, the batch size and the number of attention heads are omitted in the asymptotic notation because they introduce only constant scaling factors. In DRSA, the response score used for sorting is computed from the

L_{2}

norm of

V

at each spatial position, which requires

O (N C)

operations. The global sorting operation is then performed once over the N response scores of each feature map to obtain the ranking index

I

, leading to a theoretical complexity of

O (N log N)

. After sorting, the ordered sequence is divided into G groups with a preset group size M, where

G = ⌈ N / M ⌉

. This grouping step only partitions the already sorted sequence and does not introduce an additional sorting operation. For the intra-bucket and inter-bucket streams, attention is computed within the grouped tokens rather than over the full

N \times N

spatial token pairs. Therefore, the grouped interaction cost is approximately

O (G M^{2} C) = O (N M C)

. Since M is a fixed and small grouping hyperparameter in our implementation, this term can be written as

O (N C)

with respect to N. Therefore, the overall time complexity of DRSA is

O (N log N + N C)

. For multi-level feature maps, the complexity can be written as

\sum_{l} O (N_{l} log N_{l} + N_{l} C_{l})

, where

N_{l} = H_{l} \times W_{l}

and

C_{l}

denote the number of spatial tokens and channels at the l-th feature level, respectively. This complexity is not strictly linear because of the sorting term

O (N log N)

. However, compared with the quadratic token-to-token interaction in standard self-attention, DRSA removes the costly

N \times N

attention matrix and achieves a strictly sub-quadratic complexity with respect to the number of spatial tokens. In practical high-resolution UAV feature maps, the sorting and grouped interaction operations introduce substantially lower computational overhead than full self-attention, while preserving global feature reorganization. Meanwhile, from the perspective of model optimization, although the sorting operation used to obtain the index matrix

I = argsort (∥ V ∥_{2})

is itself non-differentiable, this property does not hinder the end-to-end training of the network. In the forward propagation of DRSA, the index

I

and the inverse index

I^{- 1}

participate in the computation only as static “routing masks” and are detached from the computational graph. During backpropagation, the error gradients are propagated back to the original query, key, and value feature maps (

Q, K, V

) according to the recorded spatial re-indexing (Gather) and restoration (Scatter) paths. This hard-routing mechanism allows effective gradient propagation through the reordered feature tensors and avoids the memory burden caused by constructing a massive

N \times N

attention matrix in conventional self-attention. Through this closed-loop process of “sorting–interaction–restoration,” DRSA reduces the quadratic dependency on the number of spatial tokens to a sub-quadratic form of

O (N log N + N C)

, while enabling the model to aggregate features according to response-intensity ordering rather than only local spatial neighborhoods. As a result, even in complex UAV scenes with severe background clutter and weak object responses, the model can more effectively capture discriminative small-object features across spatially scattered regions.

3.3. Bilateral Tanh Gating and Cosine Attention Feature Alignment Module

In object detection for UAV remote sensing images, as network depth increases, the features of small objects often face the dual challenges of high-frequency information loss and background noise interference. In the baseline detector, the activation maps in complex urban scenes tend to respond not only to target regions but also to building edges, road textures, and tree shadows.This observation indicates that cross-level fusion may transmit large-amplitude background responses together with weak target details. Therefore, the goal of BTC-FAM is not to fundamentally eliminate all background noise, but to mitigate such interference by bounding residual perturbations and aligning cross-level semantic responses. Specifically, BTC-FAM consists of a Bilateral Tanh Gating block (BTG Block) for local feature denoising and a cosine cross-attention mechanism for global semantic alignment. As shown in Figure 3, the preceding BTG Block adopts a bilateral nonlinear modulation mechanism to process the input feature

X_{in} \in R^{C \times H \times W}

. Specifically, the BTG Block first decouples the input feature into two orthogonal feature streams,

X^{(1)}

and

X^{(2)}

, through linear mapping. To introduce nonlinear constraints into the deep feature space, we define a bounded residual modulation mechanism. For the k-th feature stream (

k \in {1, 2}

), the feature update process can be formulated as:

H^{(k)} = X^{(k)} + tanh (D_{k} (X^{(k)}; Θ_{k}))

(6)

where

D_{k} (\cdot)

denotes a depthwise convolution transformation with spatial inductive bias, and

Θ_{k}

represents learnable parameters.

In implementation, the two depthwise convolution branches adopt the same initialization strategy to avoid introducing branch preference at the beginning of training. Specifically, the convolutional weights in

D_{k} (\cdot)

and the projection layer

W_{out}

are initialized using the standard Kaiming initialization, while the bias terms are initialized to zero. No branch-specific manual initialization or freezing operation is adopted. The

tanh (\cdot)

function serves as a saturated nonlinear function. Unlike the unbounded ReLU, this saturation property constrains the residual perturbation of features within the interval

[- 1, 1]

, thereby flexibly suppressing large-amplitude noise fluctuations in the background while enhancing high-frequency details. During training, the two gating branches are optimized jointly with the whole detection network through end-to-end backpropagation. The gradient of the nonlinear residual branch is modulated by the derivative of the Tanh function, i.e.,

1 - {tanh}^{2} (\cdot)

, which naturally bounds the gradient response of large-amplitude activations. Meanwhile, the identity residual connection in Equation (6) provides a direct gradient path for

X^{(k)}

, preventing the Tanh gating branch from blocking feature propagation. Therefore, the BTG Block does not require an additional optimization strategy; its parameters are updated by the same loss function and optimizer as the rest of the detector.

Subsequently, by exploiting the distributional differences between the two streams, we generate the enhanced feature

X_{enh}

through adaptive mutual gating:

X_{enh} = W_{out} (H^{(1)} ⊙ H^{(2)})

(7)

where ⊙ denotes element-wise multiplication, and

W_{out}

represents a linear projection used for feature aggregation. For the mutual gating operation, gradients are propagated to the two branches symmetrically through the element-wise multiplication. Specifically, the gradient received by one branch is weighted by the activation response of the other branch, which encourages the two streams to calibrate each other during optimization. This multiplicative interaction mechanism enables the two feature streams to dynamically weight and filter each other’s spatial responses. Based on the denoised feature

X_{enh}

obtained from the BTG Block, to address the semantic misalignment caused by variations in feature magnitude, we model the feature alignment process on a unit hypersphere and establish cosine correlations with the reference feature

Y \in R^{C \times H \times W}

. Formally, this process can be described as cosine reweighting in a normalized feature space. The generation of the attention matrix

A

and the feature aggregation formula is expressed as follows:

Z = Softmax (τ \cdot \frac{P_{q} {(X_{enh})}^{⊤} P_{k} (Y)}{∥ P_{q} (X_{enh}) ∥_{2} {∥ P_{k} (Y) ∥}_{2}}) \cdot P_{v} (Y)

(8)

where

P_{q, k, v}

denotes the embedding projections for the query, key, and value, respectively, and flatten the spatial dimensions

H \times W

into a sequence of length N;

{∥ \cdot ∥}_{2}

denotes the

L_{2}

norm; and

τ

is a temperature coefficient used to regulate the entropy of the distribution. In this way, BTC-FAM is able to achieve feature reconstruction for small objects within a unified framework, enhancing the robustness of the model in complex UAV scenarios.

3.4. Prompt-Driven Multi-Grain Fusion Module

During cross-level feature fusion in remote sensing images, small-object features are often weakened by background texture interference, and the semantic gap between different feature levels is difficult to eliminate through simple linear concatenation. To address this issue, we propose the PMGF module, which performs prompt-driven multi-granularity fusion by combining a main feature alignment path with parallel local–global attention branches.

As shown in Figure 4, PMGF receives two input features with the same spatial resolution but different channel dimensions, denoted as

X_{1} \in R^{B \times C_{1} \times H \times W}

and

X_{2} \in R^{B \times C_{2} \times H \times W}

. The two features are first projected into a unified channel dimension by two independent convolutional projections:

{\tilde{X}}_{1} = ϕ_{1} (X_{1}), {\tilde{X}}_{2} = ϕ_{2} (X_{2}),

(9)

where

ϕ_{1} (\cdot)

and

ϕ_{2} (\cdot)

are

1 \times 1

convolutional projections used for channel alignment. After projection,

{\tilde{X}}_{1}

and

{\tilde{X}}_{2}

have the same size of

R^{B \times C \times H \times W}

. The main fusion path is then formulated as

F_{main} = ϕ_{m} ({\tilde{X}}_{1} + {\tilde{X}}_{2}),

(10)

where

ϕ_{m} (\cdot)

denotes the convolutional transformation after feature summation.

To enhance target-relevant details at different spatial granularities, PMGF introduces parallel Local–Global Attention branches with patch sizes

p = 2

and

p = 4

. For each aligned feature

{\tilde{X}}_{j}

(j \in {1, 2})

, the local–global attention branch at granularity p is denoted as

F_{j}^{p} = L_{p} ({\tilde{X}}_{j}), p \in {2, 4},

(11)

where

L_{p} (\cdot)

represents the Local–Global Attention operator with patch size p. Specifically, this operator partitions the input feature into local patches of size

p \times p

, models the interaction between local patch responses and global contextual information, and then restores the enhanced responses to the original spatial resolution. Thus, the

p = 2

branch focuses more on fine-grained local details, while the

p = 4

branch captures relatively larger contextual structures.

The outputs of the main path and the multi-granularity attention branches are then aggregated as

F_{fuse} = F_{main} + F_{1}^{2} + F_{1}^{4} + F_{2}^{2} + F_{2}^{4} .

(12)

Finally, the fused feature is further refined by a lightweight convolutional block with structural re-parameterization:

Z_{out} = ϕ_{o} (RepConv (ϕ_{f} (F_{fuse}))),

(13)

where

ϕ_{f} (\cdot)

and

ϕ_{o} (\cdot)

denote the convolutional layers before and after the re-parameterized convolution, respectively.

RepConv (\cdot)

represents the structural re-parameterization block, which uses a multi-branch structure during training and can be equivalently converted into a single convolutional branch during inference. Through this design, PMGF selectively strengthens small-object details from different feature levels and granularities while maintaining efficient inference.

4. Experiments

4.1. Datasets

To comprehensively evaluate the performance of MPI-DETR in UAV and remote sensing scenarios, we selected three representative datasets: AI-TOD [26], DIOR [35], and NWPU VHR-10 [36]. The selection of these datasets is motivated by their complementary characteristics. AI-TOD contains a large number of extremely small objects with dense distributions, making it suitable for evaluating the small-object localization capability of the proposed method. DIOR covers diverse object categories, complex backgrounds, and significant scale variations, which can be used to assess the multi-class generalization ability and robustness of MPI-DETR in large-scale remote sensing scenes. NWPU VHR-10 consists of high-resolution aerial images with dense object distributions and complex spatial layouts, providing an additional benchmark for evaluating the adaptability of the model under high-resolution remote sensing conditions. Therefore, these three datasets jointly cover key challenges in UAV and remote sensing object detection, including extreme object scale, dense spatial distribution, background interference, and multi-class scene generalization. The details of the three datasets are described as follows.

AI-TOD: This is an extremely challenging dataset specifically designed for small-object detection. It contains 28,036 images and 700,621 instances across 8 categories. Unlike conventional datasets, the average object size in AI-TOD is extremely small, at only 12.8 pixels, and the proportion of small objects (smaller than 16 pixels) is exceptionally high. We follow the official split, using 14,536 images for training, 4272 for validation, and 9228 for testing.
DIOR: This is currently one of the largest and most category-rich remote sensing object detection datasets. It contains 23,463 images and 192,472 instances, covering 20 common object categories, such as airplanes, ships, and storage tanks. This dataset exhibits dramatic object scale variations together with complex background textures, including urban areas, ports, and wild fields. It can effectively evaluate the robustness and generalization performance of MPI-DETR when facing high-frequency background noise interference and cross-scale feature alignment. We divide the dataset into training, validation, and test sets with a ratio of 7:1:2.
NWPU VHR-10: This is a classic high-resolution geospatial object detection dataset. It contains 800 ultra-high-resolution images covering 10 categories. Although the dataset is relatively small in scale, it includes rich high-resolution texture details and dense spatial distributions. The training, validation, and test sets contain 550, 100, and 150 images, respectively. We follow this split for model training and evaluation.

Table 1 summarizes the statistical details of the above three datasets.

4.2. Experimental Setup

All experiments in this study were conducted on a Linux workstation running Ubuntu 20.04, equipped with four NVIDIA RTX A10 GPUs for accelerated computation. In terms of software, the model was implemented based on the PyTorch 2.1.0 deep learning framework, with Python 3.10 and CUDA 12.2 as the runtime environment. To ensure fairness in comparative experiments, ImageNet-pretrained ResNet-18 [37] was adopted as the backbone network for feature extraction. During training, the AdamW optimizer [38] was employed for parameter updates, with both the initial learning rate and weight decay set to

1 \times 10^{- 4}

, and the momentum parameter set to 0.937. The model was trained for a total of 100 epochs with a batch size of 4. In addition, during data preprocessing, to enhance the model’s robustness to drastic scale variations under UAV viewpoints, a data augmentation strategy was adopted, including random photometric distortion to simulate different illumination conditions, as well as multi-scale geometric jittering and random cropping to improve adaptability to objects at different resolutions. These settings effectively ensure a fair experimental protocol for validating the effectiveness of the proposed MPI-DETR architecture. For reproducibility, the main structural hyperparameters of the proposed modules are as follows. In DRSA, the spatial token number is

N = H \times W

, the ranked sequence is divided into a preset group size M, and the group number is

G = ⌈ N / M ⌉

. The query, key, and value tensors keep the same channel dimension as the input feature level. In BTC-FAM, the input channels are split into two branches with equal channel dimensions, and the two branches are initialized and updated as described in Section 3.3. In PMGF, the two input features are first aligned to the same channel dimension by

1 \times 1

convolution, and the Local–Global Attention branches use patch sizes

p = 2

and

p = 4

.

4.3. Evaluation Indicators

To rigorously evaluate detection performance and ensure fair comparison with state-of-the-art methods, we follow the standard Microsoft COCO evaluation protocol [39]. The mean Average Precision (mAP) is adopted as the primary evaluation metric in this paper, which is computed by averaging AP values over IoU thresholds ranging from 0.50 to 0.95 with a step size of 0.05. Specifically,

A P_{50}

and

A P_{75}

denote the Average Precision at IoU thresholds of 0.50 and 0.75, respectively, while

A P_{50 : 95}

denotes the mean AP averaged over IoU thresholds from 0.50 to 0.95. Meanwhile, considering the extreme scale variation in UAV images, we particularly emphasize scale-specific metrics, including

A P_{S}

(area

< 32^{2}

pixels),

A P_{M}

(

32^{2} \leq

area

< 96^{2}

pixels), and

A P_{L}

(area

\geq 96^{2}

pixels). These three metrics represent the Average Precision for small, medium, and large objects, respectively, and are used to evaluate the model’s detection capability across different object scales. Furthermore, to assess the feasibility of deployment on resource-constrained UAV platforms, we also report the number of parameters (Params), floating-point operations (FLOPs), and inference latency measured by frames per second (FPS). Together, these metrics provide a comprehensive evaluation of detection accuracy, scale adaptability, and computational efficiency for UAV image detection tasks.

4.4. Comparative Experiment

To comprehensively evaluate the detection performance of MPI-DETR, this section compares MPI-DETR with various mainstream methods currently in use. These methods include CNN-based real-time detection models such as the YOLO series and its variant FBRT-YOLO, as well as Transformer-based end-to-end real-time detection models such as RT-DETR [40]. It is worth noting that, since MPI-DETR adopts ResNet-18 as its backbone, only the lightest versions of the compared models are selected in the experimental comparison of this paper. In addition, because the AI-TOD dataset mainly focuses on small objects and lacks large-object instances, the

A P_{L}

metric is not included in the evaluation on this dataset. Moreover, all compared models are trained according to their officially recommended training protocols rather than manually adjusted settings. The training epochs reported in Table 2, Table 3 and Table 4 follow the original training recipes of the corresponding models, because different detector families adopt different optimization strategies and convergence schedules. For example, CNN-based YOLO detectors generally use longer training schedules with their official learning-rate settings, whereas Transformer-based detectors usually follow different optimization recipes designed for query-based feature interaction and encoder–decoder architectures. Therefore, forcing all methods to use an identical number of epochs may lead to under-training for some models or unnecessary over-training for others, which would not fairly reflect their representative performance. To ensure a fair and transparent comparison, all methods are evaluated under the same dataset splits, input resolution, evaluation metrics, and hardware environment, while their official training schedules are retained. Table 2 presents the test results of MPI-DETR and other mainstream methods on the AI-TOD dataset. MPI-DETR obtains the highest values among the compared methods on this dataset, with an

A P_{50}

of 43.8% and an

A P_{S}

of 37.4%. Compared with the RT-DETR-R18 baseline, MPI-DETR improves

A P_{50}

and

A P_{S}

by 3.0 and 2.9 percentage points, respectively. Although its parameter count and GFLOPs are not the smallest compared with extremely lightweight CNN models, MPI-DETR uses fewer parameters and FLOPs than RT-DETR-based Transformer variants, indicating a favorable trade-off between detection accuracy and computational cost among Transformer-based models. The detection results of MPI-DETR and other models on the AI-TOD dataset are illustrated in Figure 5. Meanwhile, Table 3 and Table 4 present the quantitative comparison results of MPI-DETR and other detection models on the DIOR and NWPU VHR-10 datasets, respectively. On the DIOR dataset, MPI-DETR achieves 87.5% and 30.5% in

A P_{50}

and

A P_{S}

, respectively, corresponding to improvements of 1.0–3.8 and 1.2–4.3 percentage points over other Transformer-based models. Figure 6 shows some detection results of MPI-DETR on the DIOR dataset. On the NWPU VHR-10 dataset, MPI-DETR achieves the highest

A P_{50}

and

A P_{S}

among the compared methods, with gains of 1.3–8.5 percentage points in

A P_{50}

and 1.0–18.3 percentage points in

A P_{S}

. Figure 7 presents some qualitative results of MPI-DETR on the NWPU VHR-10 dataset. The three datasets emphasize different verification aspects. AI-TOD mainly evaluates extremely small and dense object localization, DIOR emphasizes multi-class remote sensing generalization under complex backgrounds and scale changes, and NWPU VHR-10 further examines adaptability on high-resolution aerial images with limited training samples. MPI-DETR obtains consistent

A P_{S}

gains on all three datasets, with improvements of 2.9, 2.9, and 11.6 percentage points over RT-DETR-R18 on AI-TOD, DIOR, and NWPU VHR-10, respectively. Since

A P_{S}

directly reflects small-object detection quality, these results further support the main objective of this work beyond the overall

A P_{50}

metric.

Taken together, the above experimental results show that MPI-DETR achieves competitive detection accuracy while maintaining a relatively moderate computational cost among Transformer-based detectors. Compared with RT-DETR-R18, MPI-DETR reduces the number of parameters from 20.1 M to 16.8 M and the FLOPs from 58.6 G to 48.3 G, while improving

A P_{50}

on AI-TOD, DIOR, and NWPU VHR-10 by 3.0, 1.0, and 3.9 percentage points, respectively. These results suggest that the collaborative design of DRSA, BTC-FAM, and PMGF contributes to small-object feature enhancement and cross-level feature alignment in UAV remote sensing images. Therefore, MPI-DETR provides a practical trade-off between accuracy and efficiency for UAV-based small-object detection, especially under complex backgrounds and resource-constrained deployment scenarios.

4.5. Ablation Experiment

To validate the effectiveness of each innovative module in MPI-DETR, we conducted progressive ablation experiments on the AI-TOD dataset. RT-DETR with a ResNet-18 backbone was selected as the baseline model, and all experimental settings for the ablation study were kept the same as those used in the comparative experiments. During the ablation process, the backbone, input resolution, optimizer, training schedule, data split, and evaluation protocol are kept unchanged. The only controlled variable is the insertion or replacement of the corresponding module, so the performance variation can be attributed to the investigated component under the same experimental protocol. The quantitative results are presented in Table 5.

DRSA: As shown in Row 2 of Table 5, when the spatial attention module in the baseline model is replaced with the proposed DRSA, the $A P_{S}$ of the model improves from 34.5% to 35.8%. More importantly, since DRSA abandons the computationally intensive $O (N^{2})$ spatial attention and instead adopts a parameter-free intensity sorting mechanism, the number of parameters and FLOPs are substantially reduced. This strongly demonstrates that, in UAV-view imagery, cross-region feature aggregation based on physical response intensity is not only more precise than conventional spatial search, but also more lightweight and efficient.
BTC-FAM: On top of DRSA, we further embed BTC-FAM into the feature pyramid pathway. The experimental results show that this configuration improves $A P_{50}$ by 0.8% and $A P_{S}$ by 0.8%. This performance gain validates our previous theoretical assumption that the high-frequency background noise prevalent in shallow features of UAV images severely interferes with the representation of weak targets. By employing a bilateral tanh gating mechanism, BTC-FAM successfully filters out these impulse noises before cross-level feature propagation, thereby providing a cleaner semantic environment. Meanwhile, because it replaces part of the heavy spatial convolutions, the computational cost of the model is further reduced.
PMGF: Finally, we integrate PMGF into the network to form the complete MPI-DETR (Row 4 of Table 5). Compared with the previous version, $A P_{S}$ achieves another 0.8 percentage point improvement, reaching the best result of 37.4%, while the model size is reduced to 16.8 M. This result indicates that PMGF changes the traditional passive feature concatenation paradigm by actively retrieving and activating the fine-grained textures of small objects that are submerged in the shallow network through deep global prompts, thereby thoroughly addressing the problem of “semantic misalignment” during cross-level fusion.

From a reverse-ablation perspective, comparing the complete MPI-DETR with the adjacent incomplete variants also indicates the independent contribution of each component. Removing PMGF from the complete model reduces

A P_{S}

from 37.4% to 36.6% and

A P_{50}

from 43.8% to 42.9%. Removing BTC-FAM from the DRSA+BTC-FAM variant to the DRSA-only variant reduces

A P_{S}

from 36.6% to 35.8%. Removing DRSA and returning to the original baseline further reduces

A P_{S}

from 35.8% to 34.5%. This reverse-view analysis is based on the same controlled ablation table and provides a clearer view of the relative contribution of the three modules.

4.6. Necessity Analysis of the Dual-Flow Mechanism Within DRSA

To further investigate the respective roles of the intra-bucket dense stream and the inter-bucket sparse stream in DRSA, we conducted an internal decoupled ablation study of this module on the AI-TOD dataset (as shown in Table 6). When only the intra-bucket dense stream is used, the model tends to capture the continuity of locally similar intensities, which is beneficial for modeling local contours of small objects but provides limited global span. When only the inter-bucket sparse stream is used, the model can establish response associations across broader spatial ranges, but may lose some local detail continuity. The full dual-stream design obtains the best

A P_{S}

in this comparison. Considering that the difference between the inter-only variant and the full dual-stream model is 0.5 percentage points in

A P_{S}

, we interpret this result as supportive evidence rather than statistical proof of complementarity. Repeated trials and formal significance tests will be further explored in future work.

4.7. Visualization of Discriminative Regions

To more intuitively investigate the internal feature representation mechanism of MPI-DETR when handling complex UAV images, we employed the Grad-CAM [50] technique to visualize heatmaps of the deep feature maps before the output of the detection head. As shown in Figure 8, we selected typical aerial scenes with strong background interference and compared the response activation regions of the baseline model and MPI-DETR. It can be observed that, in urban scenes containing large buildings and tree-lined roads, the attention mechanism of the baseline model is severely distracted, with its high-response regions erroneously focusing on the complex textures of the high-rise buildings on the left, resulting in “semantic misalignment.” In contrast, MPI-DETR successfully separates the background noise of the buildings and precisely anchors the attention weights on the small vehicles on the road. This phenomenon visually verifies the effectiveness of BTC-FAM in filtering high-frequency background clutter through its bounded residual mechanism. Meanwhile, in densely built residential scenes, because the small targets are highly confused with the background rooftops in visual appearance, the activation regions of the baseline model exhibit an extremely scattered and diffused distribution, with the true target signals being completely submerged by the surrounding environment. By contrast, the heatmap distribution of MPI-DETR demonstrates a highly concentrated spatial pattern, accurately highlighting the dense target group in the lower-left corner. This fully indicates that, benefiting from DRSA, which discards the limitation of spatial neighborhoods, and the active detail retrieval capability of the PMGF module, the model can accurately reconstruct and enhance the feature representation of small targets across extensive interference regions, further confirming the robustness of the MPI-DETR architecture in complex scenes.

4.8. Analysis of Feature Representations in Dual-Stream Ranked Attention

To provide an illustrative analysis of the feature-alignment behavior, we selected a representative image containing multiple spatially scattered objects of the same category from the AI-TOD test set, extracted the attention weights between one target pixel in the deep network (the

Q u e r y

pixel) and all other pixels in the image (the

K e y

pixels), and computed their Euclidean spatial distances. Table 7 summarizes the distribution of average attention weights across different spatial distance intervals for this case. Meanwhile, Figure 9 visualizes the attention-distance relationship. In the baseline model, the weights of standard multi-head self-attention (MHSA) show higher responses within local neighborhoods and decrease as the physical spatial distance increases. In contrast, in MPI-DETR equipped with DRSA, the attention weights are more evenly distributed across different distance intervals when the query and key positions share similar response intensity. This case study illustrates the intended behavior of intensity-based reordering, namely associating spatially separated small-object responses beyond local neighborhoods. Since this analysis is based on a representative sample rather than large-scale statistical testing, it is used as qualitative and illustrative evidence rather than confirmatory proof of a universal attention pattern.

To further provide intuitive visual evidence, we directly extracted the intermediate feature maps produced by the internal attention modules of the network, namely the standard MHSA in the baseline and the proposed DRSA, and mapped them into two-dimensional heatmaps for comparison. The visualization results in the

2 \times 5

layout of Figure 10 present two challenging UAV scenes. From the standard MHSA feature maps in the second column, the activated regions show scattered background responses and diffused feature activations. In contrast, the DRSA feature maps in the fourth column present more concentrated responses around annotated small-object regions and weaker responses in many background areas. These visual observations are consistent with the design motivation of DRSA, namely to reorganize features by intensity response and improve the aggregation of spatially separated small-object cues. Overall, the attention-distance example in Table 7 and the feature visualizations in Figure 10 provide intuitive evidence for the behavior of DRSA, while broader statistical validation across more scenes will be considered in future work.

4.9. Robustness and Discussion

In real UAV application scenarios, captured images are often affected by sensor thermal noise or transmission degradation. To verify the structural stability of MPI-DETR under adverse conditions, we conducted a controlled noise interference experiment. Without retraining the model, Gaussian noise with progressively increasing variance

σ^{2} \in {0, 0.01, 0.05, 0.1, 0.2}

was injected into the test images, and the performance degradation trajectory of the model was recorded. As shown by the interference curves in Figure 11, the detection accuracy of both the baseline model RT-DETR and MPI-DETR decreases as the noise intensity increases. However, MPI-DETR exhibits a more gradual performance degradation rate. Specifically, under severe noise interference (

σ^{2} = 0.10

), the

A P_{S}

of the baseline model drops from 34.5% to 14.1%, whereas MPI-DETR decreases from 37.4% to 24.2%. This result indicates that the proposed architecture mitigates, rather than completely eliminates, the influence of noise interference. The bounded property of bilateral tanh gating in BTC-FAM can suppress part of the large-amplitude feature perturbations during cross-level feature fusion, thereby reducing the contamination of weak small-object semantics by high-frequency background responses.

4.10. Computational Complexity and Efficiency Analysis

To further validate the deployment feasibility of MPI-DETR on real edge devices, we measured the actual training video memory (VRAM) consumption and inference speed in frames per second (FPS) of both the baseline model and MPI-DETR on a single NVIDIA RTX A10 GPU, as reported in Table 8.

The experimental results show that, by completely removing the massive

N \times N

spatial similarity matrix in standard MHSA, the DRSA module enables MPI-DETR to reduce the peak training memory consumption from 7.8 GB to 5.4 GB. This low memory dependence substantially lowers the hardware requirements for model training. More importantly, in terms of inference speed, although the model includes a global sorting operation, the computational complexity along the sequence-length dimension is reduced to

O (N log N)

, so the slight latency introduced by sorting is offset by the substantial reduction in attention interaction overhead. The measured inference speed of MPI-DETR reaches 132 FPS, representing a 14.8% improvement over the baseline model. This fully demonstrates that the proposed intensity-ranking and bounded-gating mechanisms not only improve model accuracy, but also achieve a substantial breakthrough in low-level hardware execution efficiency, thereby fully satisfying the requirements of real-time UAV video stream detection.

5. Conclusions

To address the core challenges in UAV remote sensing images, including the susceptibility of small-object features to loss, their spatially discrete distribution, and their vulnerability to complex background clutter, we propose MPI-DETR, a lightweight end-to-end detection framework specifically designed for UAV visual perception. The framework incorporates multiple collaboratively designed components. By jointly integrating DRSA, BTC-FAM, and the prompt-driven PMGF, it successfully breaks the spatial geometric limitations of conventional attention mechanisms and effectively overcomes the problems of high-frequency noise contamination and semantic misalignment in cross-level fusion. Comprehensive experiments on three benchmark datasets, namely AI-TOD, DIOR, and NWPU VHR-10, verify the superior performance of MPI-DETR. Compared with existing state-of-the-art methods, MPI-DETR not only consistently pushes the detection performance limit across different scales, especially for small objects, but also achieves significant reductions in model parameters and computational complexity. In addition, the underlying feature visualization and Gaussian noise interference experiments further confirm the strong robustness of the proposed architecture when handling complex backgrounds and adverse imaging conditions, providing a highly competitive and efficient visual perception solution for resource-constrained UAV platforms and edge deployment scenarios in the Internet of Things (IoT). In future work, we will further investigate the deployment efficiency of MPI-DETR on resource-constrained UAV platforms and improve its adaptability to more diverse real-world imaging conditions. In addition, extending the proposed framework to video-based UAV perception and multi-modal remote sensing data will also be explored to further enhance its practical applicability.

Author Contributions

Conceptualization, J.Z. and Y.Z.; methodology, J.Z. and Y.Z.; software, J.Z. and B.X.; validation, L.L., L.Y. and X.X.; formal analysis, X.Z. and Y.M.; investigation, B.X. and L.L.; resources, W.Z. and Y.Z.; data curation, L.Y. and X.X.; writing—original draft preparation, J.Z.; writing—review and editing, Y.Z. and W.Z.; visualization, X.Z. and Y.M.; supervision, Y.Z. and W.Z.; project administration, W.Z.; funding acquisition, J.Z. and W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Special Fund for Major Research Projects of Ningde Normal University (No.: 2025ZX037) and the Research Projects of Ningde Normal University (2024ZX49).

Data Availability Statement

The data used in this study are publicly available datasets, namely AI-TOD (https://drive.google.com/drive/folders/1uNY_rcOO5LrWibXRY6l2dvqSbK6xikJp (accessed on 18 December 2025)), DIOR (https://drive.google.com/drive/folders/1UdlgHk49iu6WpcJ5467iT-UqNPpx__CC (accessed on 17 December 2025)), and NWPU VHR-10 (https://github.com/Gaoshuaikun/NWPU-VHR-10 (accessed on 15 December 2025)).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, N.; Ye, M.; Zhou, L.; Tang, S.; Gan, Y.; Liang, Z.; Zhu, X. Self-prompting analogical reasoning for UAV object detection. Proc. AAAI Conf. Artif. Intell. 2025, 39, 18412–18420. [Google Scholar] [CrossRef]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
Jiao, Z.; Wang, M.; Qiao, S.; Zhang, Y.; Huang, Z. Transformer-based object detection in low-altitude maritime UAV remote sensing images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4210413. [Google Scholar] [CrossRef]
Jankovic, B.; Jangirova, S.; Ullah, W.; Khan, L.U.; Guizani, M. UAV-assisted real-time disaster detection using optimized transformer model. In Proceedings of the IEEE Symposium on Computers and Communications; IEEE: Piscataway, NJ, USA, 2025; pp. 1–7. [Google Scholar]
Kelly, M.; Feirer, S.; Hogan, S.; Lyons, A.; Lin, F.; Jacygrad, E. Mapping orchard trees from UAV imagery through one growing season: A comparison between OBIA-based and three CNN-based object detection methods. Drones 2025, 9, 593. [Google Scholar] [CrossRef]
Das, A.; Yang, Y.; Subburaj, V.H. YOLOv7 for weed detection in cotton fields using UAV imagery. AgriEngineering 2025, 7, 313. [Google Scholar] [CrossRef]
Luo, M.; Zhao, R.; Zhang, S.; Chen, L.; Shao, F.; Meng, X. IM-CMDet: An intra-modal enhancement and cross-modal fusion network for small object detection in UAV aerial RGBT imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5008316. [Google Scholar] [CrossRef]
Qin, H.; Xu, T.; Li, T.; Chen, Z.; Feng, T.; Li, J. MUST: The first dataset and unified framework for multispectral UAV single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–17 June 2025; pp. 16882–16891. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
Nian, Z.; Yang, W.; Chen, H. AEFFNet: Attention enhanced feature fusion network for small object detection in UAV imagery. IEEE Access 2025, 13, 26494–26505. [Google Scholar] [CrossRef]
Zhao, D.; Gu, L.; Qian, K.; Zhou, H.; Yang, T.; Cheng, K. Target tracking from infrared imagery via an improved appearance model. Infrared Phys. Technol. 2020, 104, 103116. [Google Scholar] [CrossRef]
Zhao, D.; Zhang, H.; Arun, P.V.; Jiao, C.; Zhou, H.; Xiang, P.; Cheng, K. SiamSTU: Hyperspectral video tracker based on spectral spatial angle mapping enhancement and state aware template update. Infrared Phys. Technol. 2025, 150, 105919. [Google Scholar] [CrossRef]
Zhao, D.; Hu, B.; Jiang, W.; Zhong, W.; Arun, P.V.; Cheng, K.; Zhao, Z.; Zhou, H. Hyperspectral video tracker based on spectral difference matching reduction and deep spectral target perception features. Opt. Lasers Eng. 2025, 194, 109124. [Google Scholar] [CrossRef]
Zhao, D.; Zhong, W.; Ge, M.; Jiang, W.; Zhu, X.; Arun, P.V.; Zhou, H. SiamBSI: Hyperspectral video tracker based on band correlation grouping and spatial-spectral information interaction. Infrared Phys. Technol. 2025, 151, 106063. [Google Scholar] [CrossRef]
Zhao, D.; Xu, X.; You, M.; Arun, P.V.; Zhao, Z.; Ren, J.; Wu, L.; Zhou, H. Local sub-block contrast and spatial-spectral gradient features fusion for hyperspectral anomaly detection. Remote Sens. 2025, 17, 695. [Google Scholar] [CrossRef]
Zheng, D.; Dong, W.; Hu, H.; Chen, X.; Wang, Y. Less is more: Focus attention for efficient DETR. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6674–6683. [Google Scholar]
Liao, N.; Zhang, Y.; Yu, Z.; Huang, J.; Zhu, M.; Peng, B. UAV-DETR: Few-parameter DETR for small object detection in high-altitude UAV images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 19, 2575–2587. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, Y.; Easa, S.M.; Xie, B.; Lin, L.; Zhou, X.; Zeng, N.; Zhang, W.; Song, M. E2-Former: An edge-enhanced transformer for UAV-based small object detection. IEEE Internet Things J. 2026; in press.
Xiao, Y.; Xu, T.; Xin, Y.; Li, J. FBRT-YOLO: Faster and better for real-time aerial image detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 8673–8681. [Google Scholar]
Yin, H.; Zhu, Z.; Wang, H. SED-DETR: A scale-enhanced deformable detection transformer for remote sensing images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5624412. [Google Scholar] [CrossRef]
Zhou, S.; Chen, D.; Pan, J.; Shi, J.; Yang, J. Adapt or perish: Adaptive sparse transformer with attentive feature refinement for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 2952–2963. [Google Scholar]
Song, L.; Chen, Y.; Yang, S.; Ding, X.; Ge, Y.; Chen, Y.C.; Shan, Y. Low-rank approximation for sparse attention in multi-modal LLMs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 13763–13773. [Google Scholar]
Chen, W.; Bruzzone, L.; Dang, B.; Gao, Y.; Deng, Y.; Yu, J.G.; Yuan, L.; Li, Y. REST: Holistic learning for end-to-end semantic segmentation of whole-scene remote sensing imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 48, 693–710. [Google Scholar] [CrossRef]
Pu, Y.; Xia, Z.; Guo, J.; Han, D.; Li, Q.; Li, D.; Yuan, Y.; Li, J.; Han, Y.; Song, S.; et al. Efficient diffusion transformer with step-wise dynamic attention mediators. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 424–441. [Google Scholar]
Wang, J.; Yang, W.; Guo, H.; Zhang, R.; Xia, G.S. Tiny object detection in aerial images. In Proceedings of the International Conference on Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 3791–3798. [Google Scholar]
Lin, H.; Liu, J.; Li, X.; Wei, L.; Liu, Y.; Han, B.; Wu, Z. DCEA: DETR with concentrated deformable attention for end-to-end ship detection in SAR images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 17292–17307. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. YOLO-World: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16901–16911. [Google Scholar]
Li, X.; Ding, H.; Yuan, H.; Zhang, W.; Pang, J.; Cheng, G.; Chen, K.; Liu, Z.; Loy, C.C. Transformer-based visual segmentation: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10138–10163. [Google Scholar] [CrossRef]
Liu, K.; Fu, Z.; Jin, S.; Chen, Z.; Zhou, F.; Jiang, R.; Chen, Y.; Ye, J. ESOD: Efficient small object detection on high-resolution images. IEEE Trans. Image Process. 2024, 34, 183–195. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, C.; Zhang, T.; Zheng, Y. Full-scale feature aggregation and grouping feature reconstruction-based UAV image target detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–11. [Google Scholar] [CrossRef]
Huo, Y.; Dong, Y.; Wang, C.; Zhang, M.; Wang, H. Multi-scale memory network with separation training for hyperspectral anomaly detection. Inf. Process. Manag. 2026, 63, 104494. [Google Scholar] [CrossRef]
Huo, Y.; Wang, S.; Wang, C.; Zhang, M.; Wang, H. Dual-stream background modeling network with anomaly suppression for hyperspectral anomaly detection. Int. J. Appl. Earth Obs. Geoinf. 2026, 148, 105233. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://docs.ultralytics.com/models/yolov8 (accessed on 15 May 2026).
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11. 2024. Available online: https://docs.ultralytics.com/models/yolo11 (accessed on 15 May 2026).
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Ding, G.; Du, S.; Wu, Z.; Gao, Y. YOLOv13: Real-time object detection with hypergraph-enhanced adaptive visual perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]
Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. RT-DETRv2: Improved baseline with bag-of-freebies for real-time detection transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar]
Ma, S.; Zhang, Y.; Peng, L.; Sun, C.; Ding, L.; Zhu, Y. OWRT-DETR: A novel real-time transformer network for small object detection in open water search and rescue from UAV aerial imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4205313. [Google Scholar] [CrossRef]
Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine regression task in DETRs as fine-grained distribution refinement. arXiv 2024, arXiv:2410.13842. [Google Scholar]
Huang, S.; Lu, Z.; Cun, X.; Yu, Y.; Zhou, X.; Shen, X. DEIM: DETR with improved matching for fast convergence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–17 June 2025; pp. 15162–15171. [Google Scholar]
Selvaraju, R.R.; Das, A.; Vedantam, R.; Cogswell, M.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. arXiv 2016, arXiv:1610.02391. [Google Scholar]

Figure 1. Overall architecture of the proposed MPI-DETR. After the input image is processed by the backbone network to extract multi-scale features, the deep features are fed into the DRSA module for intensity-based global interaction. The Intensity-Prompt Hybrid Encoder is embedded with BTC-FAM to dynamically filter high-frequency impulse noise, while the PMGF module actively retrieves shallow features through deep semantic prompting. The resulting features are finally delivered to the Transformer decoder for prediction.

Figure 2. The internal structure and workflow of the DRSA module. Given the input feature

X

, DRSA first performs spatial sorting according to the feature response intensity and generates the sorting index

I

, which rearranges spatially scattered features into ordered intensity sequences. The sorted features are then projected and refined by

1 \times 1

convolution and

3 \times 3

depth-wise convolution to obtain

Q_{1}

,

K_{1}

,

Q_{2}

,

K_{2}

, and

V

. Based on the generated index, the Feature Gather operation aligns the query, key, and value features and feeds them into two complementary branches: the intra-bucket stream captures local dependencies among adjacent intensity-ranked features, while the inter-bucket stream models long-range correlations across different intensity groups. Finally, the outputs of the two streams are fused and restored to the original spatial order to generate the enhanced feature representation.

Figure 2. The internal structure and workflow of the DRSA module. Given the input feature

X

, DRSA first performs spatial sorting according to the feature response intensity and generates the sorting index

I

, which rearranges spatially scattered features into ordered intensity sequences. The sorted features are then projected and refined by

1 \times 1

convolution and

3 \times 3

depth-wise convolution to obtain

Q_{1}

,

K_{1}

,

Q_{2}

,

K_{2}

, and

V

. Based on the generated index, the Feature Gather operation aligns the query, key, and value features and feeds them into two complementary branches: the intra-bucket stream captures local dependencies among adjacent intensity-ranked features, while the inter-bucket stream models long-range correlations across different intensity groups. Finally, the outputs of the two streams are fused and restored to the original spatial order to generate the enhanced feature representation.

Figure 3. Overall architecture of BTC-FAM.

Figure 4. Overall architecture of the PMGF module.

Figure 5. Visualization of MPI-DETR’s detection performance on the AI-TOD dataset compared to other models. As can be seen, MPI-DETR achieves the best detection results compared to other models, strongly demonstrating its performance in low-altitude drone image detection tasks.

Figure 6. Visualization of partial detection performance of MPI-DETR on the DIOR dataset.

Figure 7. Visualization of partial detection performance of MPI-DETR on the NWPU VHR-10 dataset.

Figure 8. Grad-CAM was used to visualize the images after detection by the baseline model RT-DETR and our proposed MPI-DETR, respectively. The brighter the color, the higher the attention the model pays to the region.

Figure 9. Scatter plot of attention weights as a function of physical spatial distance between pixels.

Figure 10. Fine-grained visual comparison of intermediate feature maps between Standard MHSA (Baseline) and the proposed DRSA (Ours). Warmer colors (green/yellow) indicate stronger feature activations. Pink frames denote Ground Truth boxes. Compared to standard MHSA which displays severe scattered noise and diffuse background activations, our proposed DRSA architecture remarkably purifies the features, suppressing clutter and highlighting discrete small objects as sharp, precise points of pixel-level activation that perfectly align with the target locations.

Figure 11. Performance degradation of

A P_{S} (%)

under varying Gaussian noise variances on the AI-TOD dataset.

Figure 11. Performance degradation of

A P_{S} (%)

under varying Gaussian noise variances on the AI-TOD dataset.

Table 1. Statistical details of the three datasets.

Dataset	Images	Instances	Classes	Resolution	Main Challenge
AI-TOD	28,036	700,621	8	$800 \times 800$	Extremely Small Objects (≈12.8 px)
DIOR	23,463	192,472	20	$800 \times 800$	Large Scale & Multi-class
NWPU VHR-10	800	3651	10	Variable sizes	High-Res & Dense Dist.

Table 2. Detection results of MPI-DETR and other mainstream models on the AI-TOD dataset. The bold represented the best performence of each metric.

Model	Epochs	Param (M)	FLOPs (G)	${AP}_{S}$	${AP}_{M}$	${AP}_{50}$	${AP}_{75}$	${AP}_{50 : 95}$
CNN-Based
YOLO8n [41]	200	3.1	8.9	0.321	0.152	0.386	0.185	0.170
YOLO10n [42]	200	2.8	8.7	0.302	0.148	0.365	0.178	0.164
YOLO11n [43]	200	2.6	6.6	0.325	0.155	0.389	0.188	0.169
YOLO12n [44]	200	2.6	6.6	0.331	0.158	0.393	0.192	0.173
YOLO13n [45]	200	2.4	6.4	0.335	0.160	0.396	0.196	0.175
FBRT-YOLO-N [20]	300	0.9	6.9	0.342	0.165	0.402	0.201	0.179
Transformer-Based
RT-DETR-R18 [40]	120	20.1	58.6	0.345	0.171	0.408	0.205	0.182
RT-DETRv2-R18 [46]	120	20.1	58.6	0.351	0.175	0.415	0.211	0.187
L-OWRT-DETR [47]	100	19.1	54.2	0.348	0.172	0.410	0.207	0.184
D-FINE-N [48]	160	3.7	7.3	0.326	0.162	0.395	0.193	0.173
DEIM [49]	160	3.7	7.3	0.355	0.178	0.421	0.216	0.190
MPI-DETR (Ours)	120	16.8	48.3	0.374	0.185	0.438	0.231	0.202

Table 3. Detection results of MPI-DETR and other models on the DIOR dataset. Bold text represents the best performance of each indicator.

Model	Epochs	Param (M)	FLOPs (G)	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$	${AP}_{50}$	${AP}_{75}$	${AP}_{50 : 95}$
CNN-Based
YOLO8n [41]	200	3.1	8.9	0.225	0.462	0.751	0.824	0.658	0.602
YOLO10n [42]	200	2.8	8.7	0.231	0.456	0.760	0.829	0.668	0.615
YOLO11n [43]	200	2.6	6.6	0.221	0.465	0.771	0.836	0.678	0.619
YOLO12n [44]	200	2.6	6.6	0.227	0.473	0.787	0.841	0.688	0.629
YOLO13n [45]	200	2.4	6.4	0.223	0.474	0.782	0.845	0.686	0.632
FBRT-YOLO-N [20]	300	0.9	6.9	0.208	0.433	0.723	0.792	0.628	0.572
Transformer-Based
RT-DETR-R18 [40]	120	20.1	58.6	0.276	0.517	0.803	0.865	0.705	0.651
RT-DETRv2-R18 [46]	120	20.1	58.6	0.277	0.520	0.791	0.861	0.704	0.647
L-OWRT-DETR [47]	100	19.1	54.2	0.293	0.505	0.792	0.854	0.698	0.643
D-FINE-N [48]	160	3.7	7.3	0.262	0.485	0.768	0.839	0.672	0.635
DEIM [49]	160	3.7	7.3	0.264	0.471	0.763	0.837	0.662	0.623
MPI-DETR (Ours)	120	16.8	48.3	0.305	0.531	0.815	0.875	0.718	0.662

Table 4. Detection results of MPI-DETR and other models on the NWPU VHR-10 dataset. Bold text represents the best performance of each indicator.

Model	Epochs	Param (M)	FLOPs (G)	${AP}_{S}$	${AP}_{M}$	${AP}_{L}$	${AP}_{50}$	${AP}_{75}$	${AP}_{50 : 95}$
CNN-Based
YOLO8n [41]	200	3.1	8.9	0.319	0.526	0.619	0.904	0.623	0.552
YOLO10n [42]	200	2.8	8.7	0.149	0.477	0.582	0.840	0.543	0.512
YOLO11n [43]	200	2.6	6.6	0.322	0.513	0.624	0.896	0.628	0.556
YOLO12n [44]	200	2.6	6.6	0.295	0.484	0.595	0.859	0.591	0.524
YOLO13n [45]	200	2.4	6.4	0.301	0.483	0.594	0.852	0.587	0.529
FBRT-YOLO-N [20]	300	0.9	6.9	0.313	0.472	0.615	0.866	0.555	0.521
Transformer-Based
RT-DETR-R18 [40]	120	20.1	58.6	0.216	0.529	0.662	0.886	0.630	0.570
RT-DETRv2-R18 [46]	120	20.1	58.6	0.234	0.537	0.619	0.894	0.649	0.573
L-OWRT-DETR [47]	100	19.1	54.2	0.267	0.550	0.646	0.906	0.637	0.577
D-FINE-N [48]	160	3.7	7.3	0.283	0.547	0.632	0.904	0.633	0.570
DEIM [49]	160	3.7	7.3	0.275	0.556	0.647	0.912	0.644	0.580
MPI-DETR (Ours)	120	16.8	48.3	0.332	0.576	0.685	0.925	0.662	0.598

Table 5. Results of progressive ablation experiments for the core components of MPI-DETR on the AI-TOD dataset. All experiments use ResNet-18 as the backbone network for fair comparison.

Model	DRSA	BTC-FAM	PMGF	Params (M)	FLOPs (G)	${AP}_{S}$ (%)	${AP}_{M}$ (%)	${AP}_{50}$ (%)
Baseline	×	×	×	20.1	58.6	0.345	0.171	0.408
+DRSA	✓	×	×	18.5	53.0	0.358	0.176	0.421
+BTC-FAM	✓	✓	×	17.9	50.5	0.366	0.180	0.429
MPI-DETR	✓	✓	✓	16.8	48.3	0.374	0.185	0.438

Table 6. Internal ablation study of the Dual-Stream Ranked Self-Attention (DRSA) on the AI-TOD dataset.

Model Variant	Intra-Bucket Stream	Inter-Bucket Stream	${AP}_{S}$ (%)	${AP}_{M}$ (%)	${AP}_{50}$ (%)
Baseline (No DRSA)	×	×	34.5	17.1	40.8
DRSA (Intra-only)	✓	×	35.1	17.3	41.3
DRSA (Inter-only)	×	✓	35.3	17.4	41.6
DRSA (Full Dual-Stream)	✓	✓	35.8	17.6	42.1

Table 7. Statistics of average attention weights assigned to Key pixels at different spatial distances from the Query pixel.

Model	Avg. Weights (Dist. 0–20 px)	Avg. Weights (Dist. 50–100 px)	Avg. Weights (Dist. 150–300 px)	Avg. Weights (Dist. > 400 px)
Baseline (Standard Multi-Head Self-Attention)	0.85	0.22	0.08	0.02
DRSA (Intensity-Ranked)	0.65	0.61	0.58	0.55

Table 8. Empirical efficiency comparison of training VRAM and inference speed. Bold text represents the best performance of each indicator.

Model	Params (M)	FLOPs (G)	Training VRAM (GB)	Inference Speed (FPS)
Baseline	20.1	58.6	7.8	115
MPI-DETR (Ours)	16.8	48.3	5.4	132
Relative Change	$- 16.4 %$	$- 17.6 %$	$- 30.8 %$	$+ 14.8 %$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J.; Xie, B.; Lin, L.; Yang, L.; Zhang, X.; Meng, Y.; Xie, X.; Zhang, Y.; Zhang, W. MPI-DETR: Multi-Grain Prompt and Intensity-Guided Transformer for Small-Object Detection in UAV Imagery. Remote Sens. 2026, 18, 1763. https://doi.org/10.3390/rs18111763

AMA Style

Zhang J, Xie B, Lin L, Yang L, Zhang X, Meng Y, Xie X, Zhang Y, Zhang W. MPI-DETR: Multi-Grain Prompt and Intensity-Guided Transformer for Small-Object Detection in UAV Imagery. Remote Sensing. 2026; 18(11):1763. https://doi.org/10.3390/rs18111763

Chicago/Turabian Style

Zhang, Jie, Boxiang Xie, Lingfeng Lin, Liejun Yang, Xian Zhang, Yuke Meng, Xiaojuan Xie, Yao Zhang, and Wei Zhang. 2026. "MPI-DETR: Multi-Grain Prompt and Intensity-Guided Transformer for Small-Object Detection in UAV Imagery" Remote Sensing 18, no. 11: 1763. https://doi.org/10.3390/rs18111763

APA Style

Zhang, J., Xie, B., Lin, L., Yang, L., Zhang, X., Meng, Y., Xie, X., Zhang, Y., & Zhang, W. (2026). MPI-DETR: Multi-Grain Prompt and Intensity-Guided Transformer for Small-Object Detection in UAV Imagery. Remote Sensing, 18(11), 1763. https://doi.org/10.3390/rs18111763

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MPI-DETR: Multi-Grain Prompt and Intensity-Guided Transformer for Small-Object Detection in UAV Imagery

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Object Detection in UAV Imagery

2.2. Global Context Modeling and Attention Mechanisms

2.3. Feature Fusion and Pyramid Networks

3. Methods

3.1. Overall Architecture of MPI-DETR

3.2. Dual-Stream Ranked Self-Attention

3.3. Bilateral Tanh Gating and Cosine Attention Feature Alignment Module

3.4. Prompt-Driven Multi-Grain Fusion Module

4. Experiments

4.1. Datasets

4.2. Experimental Setup

4.3. Evaluation Indicators

4.4. Comparative Experiment

4.5. Ablation Experiment

4.6. Necessity Analysis of the Dual-Flow Mechanism Within DRSA

4.7. Visualization of Discriminative Regions

4.8. Analysis of Feature Representations in Dual-Stream Ranked Attention

4.9. Robustness and Discussion

4.10. Computational Complexity and Efficiency Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI