EAGLE-DET: Edge-Aware Global–Local Enhancement for Small Object Detection in UAV Aerial Imagery

Tao, Yimeng; Ding, Yan; Mo, Bo; Zhang, Bozhi; Zhao, Chunbo; Li, Dawei

doi:10.3390/s26113554

Open AccessArticle

EAGLE-DET: Edge-Aware Global–Local Enhancement for Small Object Detection in UAV Aerial Imagery

by

Yimeng Tao

¹

,

Yan Ding

^1,*,

Bo Mo

¹,

Bozhi Zhang

²

,

Chunbo Zhao

¹

and

Dawei Li

³

¹

School of Aerospace Engineering, Beijing Institute of Technology, Beijing 100081, China

²

Beijing Special Machinery Research Institute, Beijing 100089, China

³

Southwest Institute of Technical Physics, Chengdu 610041, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(11), 3554; https://doi.org/10.3390/s26113554

Submission received: 4 April 2026 / Revised: 24 May 2026 / Accepted: 30 May 2026 / Published: 3 June 2026

(This article belongs to the Section Navigation and Positioning)

Download

Browse Figures

Versions Notes

Abstract

Small object detection in UAV aerial imagery poses significant challenges due to sparse pixel representation and ambiguous object boundaries. Through systematic analysis, we identify three critical degradation stages during forward propagation in deep detection networks: edge attenuation during feature extraction, semantic conflict during feature fusion, and detail loss during feature reconstruction. Existing methods address these stages in isolation or implicitly, lacking collaborative and stage-aware repair strategies. To address this issue, we propose EAGLE-DET, a novel detection framework based on sparse multi-scale attention and refined transformation. Specifically, the framework comprises three core modules: (1) the Cross-stage Multi-resolution Edge Enhancement Network (CMENet), which preserves small object edge representations via adaptive high-low frequency decomposition; (2) the Attention-guided Multi-scale Feature Fusion Network (AMFFN), which resolves cross-scale semantic conflicts through pyramidal sparse attention and multi-scale spatial decoupling; (3) the Enhanced Upsampling with Channel Bridging and Spatial Coordination module (EUCBSC), which recovers spatial detail fidelity via bidirectional channel shift mixing. Extensive experiments on three benchmark datasets—VisDrone-2019, UAVDT, and DOTA1.0—demonstrate the effectiveness of EAGLE-DET, which achieves improvements of 4.5% AP50 and 2.9% AP50:95 on VisDrone-2019 over the baseline, while maintaining inference at 71.7 FPS, achieving an optimal accuracy–efficiency trade-off.

Keywords:

UAV object detection; small object detection; edge enhancement; multi-scale feature fusion; sparse attention mechanism

1. Introduction

Unmanned aerial vehicle (UAV)-based object detection has emerged as crucial technology with diverse applications, including traffic monitoring, disaster rescue, and urban surveillance [1]. Unlike ground-view imagery, UAV aerial images exhibit wide fields of view and flexible acquisition, but impose severe detection challenges: objects captured from high altitudes typically occupy only tens of pixels, possess blurred boundaries, and appear densely distributed against complex backgrounds. According to the MS COCO benchmark [2], small objects are formally defined as objects with pixel areas smaller than 32² pixels. As shown in Figure 1, small objects dominate in three representative UAV aerial datasets: VisDrone-2019 [3], UAVDT [4], and DOTA1.0 [5], making small object detection a central bottleneck for UAV vision systems.

Deep convolutional neural networks have achieved remarkable success in general object detection tasks, but face fundamental challenges of insufficient feature representation in aerial small object detection scenarios. Researchers have proposed improvement schemes from different perspectives. Swin Transformer [6] enhances global feature modeling capability through hierarchical window attention mechanisms, but has high computational complexity and lacks explicit modeling of edge information. ConvNeXt [7] and RepViT [8] achieve good balance between efficiency and accuracy, but still show inadequacy in small object edge feature extraction. BiFPN [9] achieves more flexible multi-scale fusion through bidirectional feature pyramids and learnable weights, while AFPN [10] introduces progressive feature fusion strategies, but these methods essentially still rely on linear fusion paradigms, lacking deep modeling of cross-scale feature semantic alignment. CARAFE [11] proposes content-aware feature reorganization strategies, and DySample [12] designs dynamic sampling mechanisms, but these methods mainly focus on feature recovery in the spatial dimension, neglecting the important role of inter-channel information interaction in small object feature enhancement. More critically, existing research lacks systematic understanding and collaborative solutions for the degradation problem of small objects in network forward propagation.

This paper identifies three critical degradation stages: First, edge attenuation during the feature extraction stage [13]. Successive downsampling in backbone networks progressively attenuates high-frequency edge information that small objects critically depend on. Traditional residual architectures lack stage-aware protection mechanisms for such edge features, causing them to weaken or vanish in deeper layers. Second, semantic conflict during the feature fusion stage [9,14]. When fusing features across large-scale spans, simple linear concatenation or element-wise addition cannot resolve the substantial gap in semantic abstraction levels between shallow and deep feature maps, generating information redundancy that masks critical small object features. Third, conventional upsampling operators like bilinear and nearest-neighbor blend target edge pixels with background during resolution recovery, introducing aliasing artifacts that dilute the spatial structural integrity of small objects [11,12]. Crucially, these three stages are sequential and mutually compounding: unaddressed degradation at each stage amplifies errors in subsequent stages.

We propose EAGLE-DET, which addresses the three-stage degradation through cascaded feature repair. For edge attenuation, the CMENet backbone employs the Cross-Stage Pyramidal Multi-Resolution Edge Enhancement (CSPMEE) module to construct multi-scale feature pyramids, with the EdgeEnhancer sub-module providing adaptive edge enhancement. For semantic conflict, the Attention-guided Multi-scale Feature Fusion Network (AMFFN) employs the Hierarchical Sparse Attention Transformer (HSAT) to establish long-range cross-scale semantic associations via pyramidal sparse attention, combined with Spatial Decomposition Enhanced Convolution (SDEC) for multi-scale spatial decoupling of shallow features, thereby suppressing background noise. For detail loss, the Enhanced Upsampling with Channel Bridging and Spatial Coordination module (EUCBSC) integrates depth-wise separable convolution, channel shuffle, and SDEC to preserve edge sharpness and structural integrity during resolution recovery. The three modules form a cascaded enhancement chain from edge-aware extraction to semantic-aligned fusion to high-fidelity reconstruction. The main contributions of this paper are summarized as follows:

First, a theoretical analysis framework for three-stage degradation of small object features is proposed, providing a new theoretical perspective for the design of aerial small object detection algorithms;
Second, the CMENet, including cross-stage pyramidal structure and multi-resolution edge enhancement modules, is proposed to handle the edge feature attenuation problem;
Third, the AMFFN is proposed for the adaptive semantic alignment of cross-scale features through a pyramidal sparse attention mechanism and a multi-scale spatial decoupling strategy;
Fourth, the EUCBSC, which enhances feature reconstruction quality through a bidirectional channel-shift mixing mechanism, is proposed to alleviate the upsampling detail loss problem.

The remainder of this paper is organized as follows: Section 2 reviews related work; Section 3 details the proposed EAGLE-DET algorithm; Section 4 presents experimental results and analysis. Section 5 discusses the model proposed in this paper and Section 6 concludes the paper.

2. Related Work

2.1. UAV Small Object Detection

Object detection has evolved from two-stage detectors to one-stage detectors and then to end-to-end Transformer detectors. Based on mature object detection frameworks, researchers have proposed numerous specialized algorithms for UAV small object detection. Within the UAV domain, TPH-YOLOv5 [15] incorporates self-attention into YOLOv5 to handle dense scenes; QueryDet [16] employs cascaded sparse queries for computational efficiency; DMF-YOLO [17] designs dynamic multi-scale fusion for small object enhancement. RT-DETR [18], which serves as our baseline, achieves real-time end-to-end detection by eliminating NMS and using IoU-aware query selection. Subsequent works further adapt RT-DETR for UAV scenarios: Sparse-DETR [19] introduces sparse attention mechanisms and progressive query selection strategies. Drone-DETR [20] introduces lightweight backbones and shallow feature enhancement, while UAV-DETR [14] integrates multi-scale fusion with spatial alignment. However, a careful examination reveals that each method addresses only a subset of the degradation stages identified in this work. Specifically, Drone-DETR [20] improves the backbone and shallow feature enhancement, primarily targeting edge attenuation during extraction, but does not address semantic conflict in fusion or detail loss in reconstruction. UAV-DETR [14] combines multi-scale fusion with spatial alignment, partially alleviating semantic conflict, yet treats backbone extraction and upsampling as independent components without considering their compounding effects. DMF-YOLO [17] designs dynamic multi-scale fusion for small object enhancement, focusing exclusively on the fusion stage while lacking edge-aware extraction and high-fidelity reconstruction. In contrast, EAGLE-DET provides a unified three-stage framework where each module is explicitly designed to repair a specific degradation stage, and the modules are cascaded to prevent error propagation across stages.

2.2. Feature Extraction Networks

ResNet [21] established the residual connection paradigm for deep feature extraction, but its fixed architecture lacks adaptive mechanisms for small object edge preservation. FasterNet [22] proposes partial convolution (PConv) for computational efficiency. Gold-YOLO [23] achieves multi-scale aggregation through Gather-and-Distribute. In network architecture design, InternImage [24] constructs powerful feature extraction networks using deformable large kernel convolutions, while UniRepLKNet [25] proposes a unified architectural design paradigm for large kernel convolutions, achieving excellent performance in multiple vision tasks. Vision Transformer [26] introduces self-attention mechanisms into vision tasks, demonstrating powerful global modeling capability, but has high computational complexity. ConvNeXt [7] revisits the design space of convolutional networks, bringing pure convolutional architectures to performance comparable to Transformers through modernization. MambaVision [27] combines Mamba with Transformer architectures, enhancing global feature modeling capability while maintaining efficient computation. Edge information is crucial for small object detection, as discriminative features of small objects greatly depend on their contour shapes. PiDiNet [28] efficiently extracts edge features through pixel-difference convolution. DiffusionEdge [29] explores the application potential of diffusion models in edge detection. However, these methods are typically used as independent modules, lacking deep integration with the feature extraction process of backbone networks. Existing backbone networks lack explicit modeling of edge high-frequency information, with continuous downsampling causing layer-by-layer attenuation of small object edge features.

2.3. Multi-Scale Feature Fusion and Reconstruction

Multi-scale feature fusion aims to comprehensively utilize semantic information and spatial details from features at different levels. FPN [30] pioneered the top-down feature pyramid for multi-level fusion, subsequently extended by PANet [31] with bottom-up path aggregation, BiFPN [9] with bidirectional pyramids and learnable weights, and AFPN [10] with progressive non-adjacent layer interaction. Attention-based fusion methods include DANet [32] with parallel position-channel attention, BiFormer [33] with bi-level routing attention, and SCSA [34] which explores spatial-channel synergy. In upsampling, CARAFE [11] generates content-aware kernels, SAPA [35] employs similarity-aware point affinity, and DySample [12] learns dynamic sampling patterns. However, these methods mainly focus on single-scale feature enhancement, showing inadequacy in modeling semantic alignment among cross-scale features. Linear fusion paradigms struggle to effectively model semantic correspondence relationships among cross-scale features, easily producing semantic conflicts when fusing features with large-scale spans.

In feature reconstruction, upsampling operations determine the quality of deep semantic features recovering to high resolution. Bilinear interpolation, as the most commonly used upsampling method, achieves resolution improvement through weighted averaging, but introduces edge blur and aliasing artifacts. Deconvolution [36] achieves upsampling through learnable parameters, but easily produces checkerboard effects. CARAFE [11] proposes content-aware feature reorganization strategies, dynamically generating upsampling kernels based on input content. SAPA [35] achieves high-quality feature upsampling through similarity-aware point affinity modeling. These methods improve upsampling quality to some extent, but mainly focus on feature recovery in the spatial dimension, neglecting the potential role of inter-channel information interaction in small object feature enhancement. Our proposed approach differs from these existing methods in two key aspects. First, unlike BiFPN [9] and AFPN [10] that rely on linear fusion paradigms, our AMFFN integrates pyramidal sparse attention with spatial decoupling convolution, enabling adaptive semantic alignment across scales through learnable cross-scale correspondences rather than fixed weighted summation. Second, unlike CARAFE [11] and DySample [12], which operate solely in the spatial dimension, our EUCBSC combines spatial resolution recovery with inter-channel information mixing through bidirectional channel shift, simultaneously enhancing both spatial fidelity and channel-wise feature diversity during reconstruction.

3. Method

We propose EAGLE-DET, an end-to-end detection framework that instantiates a dedicated repair strategy at each of the three identified degradation stages. The overall architecture is shown in Figure 2. Specifically, the Cross-stage Multi-resolution Enhancement Network (CMENet) provides multi-scale edge-enhanced feature maps to an Attention-guided Multi-scale Feature-Fusion Network (AMFFN), which performs semantically aligned cross-scale fusion; the fused features are then passed to an Enhanced Upsampling with Channel Bridging and Spatial Coordination module (EUCBSC) for high-fidelity upsampling before entering the detection head. The three modules construct a complete feature enhancement chain from edge-aware extraction to semantic-aligned fusion to high-fidelity reconstruction, systematically solving the degradation problem of small object features during network forward propagation. For clarity, a comprehensive summary of all mathematical symbols and notations used in this section is provided in Appendix A.

3.1. Cross-Stage Pyramidal Multi-Resolution Edge Enhancement Backbone Network Design

The CMENet backbone, built on Cross-Stage Pyramidal Multi-Resolution Edge Enhancement (CSPMEE) modules, targets edge attenuation during feature extraction. The structure of CSPMEE is shown in Figure 3. It combines the efficiency of cross-stage partial connections with multi-scale edge enhancement, improving small object edge perception while maintaining a lightweight design.

The CSPMEE module adopts a branch-fusion feature processing paradigm, achieving balanced distribution of computational load by decomposing input feature maps along the channel dimension into multiple subspaces. The module first uses

1 \times 1

convolution to map input features to an intermediate representation with twice the number of channels, then uniformly divides along the channel dimension into two branches, where the reserved branch maintains the structural integrity of original features and the enhancement branch performs deep-feature transformation through concatenated Multi-Resolution Edge Amplification Module (MREAM) units. The mathematical expression of the entire module can be formalized as

F_{CSPMEE} (X) = ψ_{1 \times 1} (C [X_{bypass}, ⋂_{i = 1}^{n} M_{MREAM}^{(i)} (X_{branch})])

(1)

where

X_{bypass}

and

X_{branch}

represent feature tensors of bypass and enhancement branches respectively,

M_{MREAM}^{(i)}

represents the i-th MREAM unit,

C [\cdot]

denotes feature concatenation operation, and

ψ_{1 \times 1}

is the final

1 \times 1

convolution mapping function. In our implementation, the number of cascaded MREAM units n is set to 1 for all CSPMEE stages. After the initial

1 \times 1

convolution, the tensor is split equally along the channel axis into

X_{bypass}

and

X_{branch}

, each with half the channels. After the MREAM unit, the two branches are then concatenated and projected back to the original channel count via the final

1 \times 1

convolution.

The MREAM module constructs a hierarchical multi-scale feature pyramid, capturing feature representations under different spatial receptive fields through parallel adaptive pooling branches. The module designs four different scale pooling kernels

{3 \times 3, 6 \times 6, 9 \times 9, 12 \times 12}

. Each branch independently performs feature transformation and then edge information enhancement through EdgeEnhance, finally fusing with local convolution branches to form comprehensive feature representation. The mathematical description of multi-scale feature extraction is

F_{multi}^{(k)} (X) = I (E (D_{3 \times 3} (ϕ_{1 \times 1} (P_{k \times k}^{adaptive} (X)))), H \times W)

(2)

where

P_{k \times k}^{adaptive}

represents adaptive average pooling of scale

k \times k

,

ϕ_{1 \times 1}

is

1 \times 1

convolution for channel compression,

D_{3 \times 3}

denotes

3 \times 3

depth-wise separable convolution,

I (\cdot, H \times W)

is bilinear interpolation upsampling operation, and E represents the EdgeEnhance enhancement function. The output of the entire module is obtained through multi-branch feature fusion:

F_{MREAM} (X) = ψ_{1 \times 1} (C [ϕ_{3 \times 3} (X), ⋃_{k \in {3, 6, 9, 12}} F_{multi}^{(k)} (X)])

(3)

The EdgeEnhancer sub-module achieves edge enhancement based on high-frequency and low-frequency decomposition edge enhancement theory, extracting edge features through difference operation between smoothing filtering and original signals, and adopting gating mechanisms for adaptive enhancement. Its core computation process is expressed as

E (X) = X + σ (φ (H_{high} (X)))

(4)

H_{high} (X) = X - P_{3 \times 3}^{avg} (X)

(5)

where

P_{3 \times 3}^{avg}

represents a

3 \times 3

average pooling operation for extracting low-frequency smooth components,

H_{high}

is the high-pass filtering result,

φ

represents edge feature transformation convolution, and

σ

is the Sigmoid gating function. From a frequency-domain perspective, the EdgeEnhancer implements an explicit high-pass filtering operation: the average pooling

P_{3 \times 3}^{avg}

extracts the low-frequency smooth component of the input signal, and the subtraction operation in Equation (5) isolates the high-frequency residual containing edge and texture information. The subsequent Sigmoid gating mechanism in Equation (4) allows the network to adaptively control the enhancement intensity at each spatial location, selectively amplifying high-frequency edge responses in regions where small objects are present while suppressing noise-dominant high-frequency components in background regions.

This design enables fine-grained edge contours of small objects to be consistently preserved across successive downsampling stages, which directly contributes to improved localization accuracy and category discrimination for small objects in complex aerial scenes.

3.2. Attention-Guided Multi-Scale Feature Fusion Network

The Attention-guided Multi-scale Feature Fusion Network (AMFFN), whose structure is shown in Figure 4, targets semantic conflict during feature fusion. Through pyramidal sparse attention and multi-scale spatial decoupling, AMFFN adaptively learns cross-scale semantic correspondences while reducing computational complexity and maintaining sensitivity to small object features.

The AMFFN feature fusion network adopts a hierarchical multi-scale feature interaction architecture, achieving efficient cross-scale feature fusion through cascaded combination of Spatial Decomposition Enhanced Convolution (SDEC) and a Hierarchical Sparse Attention Transformer (HSAT). The network first uses the SDEC module to perform multi-scale spatial decomposition on shallow features, extracting local feature representations at different granularities, then establishes long-range semantic associations between deep features and shallow features through the HSAT module, and finally achieves effective fusion of multi-level features through adaptive feature aggregation. The mathematical expression of the entire AMFFN network can be formalized as

F_{AMFFN} (X) = A_{adaptive} (T_{HSAT} (S_{SDEC} (X_{shallow}), X_{deep}))

(6)

where

X_{shallow}

and

X_{deep}

represent shallow and deep feature maps respectively,

S_{SDEC}

represents spatial decoupling convolution transformation,

T_{HSAT}

represents an HSAT processing function, and

A_{adaptive}

is an adaptive feature aggregation operation.

HSAT achieves deep interactive fusion of cross-scale features through a Sparse Attention Fusion Block (SAFB). The HSAT module first performs adaptive adjustment of channel dimensions for input features and upper layer features, then progressively enhances feature representation capability through concatenated attention blocks, and finally generates fused features through feature concatenation and convolution transformation. The overall computation process of HSAT can be expressed as

F_{HSAT} (X, X_{up}) = ϕ_{1 \times 1} (C [X_{0}, ⋂_{i = 1}^{n} B_{SAFB}^{(i)} (X_{i - 1}, X_{up}^{'})])

(7)

where

X_{0} = ψ_{1} (X)

is initial feature transformation,

X_{up}^{'} = ψ_{up} (X_{up})

is upper layer feature transformation,

B_{SAFB}^{(i)}

represents the i-th SAFB, and

ϕ_{1 \times 1}

is the final channel fusion convolution.

SAFB achieves deep transformation of features through residual connected attention mechanisms and multi-layer perceptrons, with a computation process including attention enhancement and nonlinear transformation stages. The mathematical description of this module is

B_{SAFB} (X_{0}, X_{up}^{'}) = X_{0} + M_{MLP} (X_{0} + A_{PSAttn} (X_{0}, X_{up}^{'}))

(8)

where

A_{PSAttn}

represents a pyramidal sparse attention function, and

M_{MLP}

is multi-layer perceptron transformation. The pyramidal sparse attention mechanism achieves efficient cross-scale feature alignment through coarse-grained and fine-grained two-level attention computation, with core computation including query-key-value attention and sparse fine-grained attention in two branches:

A_{coarse} = Softmax (\frac{Q \cdot K^{T}}{\sqrt{d_{k}}}) \cdot V_{up}

(9)

A_{fine} = Softmax (\frac{Q \cdot K_{topk}^{T}}{\sqrt{d_{k}}}) \cdot V_{topk} if topk > 0

(10)

where

Q = ϕ_{q} (X)

is a query matrix,

K = ϕ_{k} (X_{up})

and

V_{up} = ϕ_{v} (X_{up})

are key and value matrices of upper layer features respectively, and

K_{topk}

and

V_{topk}

are fine-grained key-value matrices based on sparse selection, fusing coarse and fine two-level attention outputs through gating mechanisms. In our implementation, the multi-head attention operation uses four heads. For training, the top-k selection is disabled so that only coarse attention is computed, reducing training cost; during inference, k is set to four to enable fine-grained attention refinement. When fine attention is active, the position mapping expands each selected coarse position to its

2 \times 2

neighborhood, yielding

4 \times k = 16

fine-grained key-value pairs per query. This decomposition thus achieves effective cross-scale semantic alignment at substantially lower cost than full attention, with the coarse stage providing global correspondences and the fine stage refining only the most informative local regions.

The SDEC module achieves enhanced processing of shallow features through multi-scale spatial decoupling and channel attention mechanisms. The module designs two space-depth transformation branches with different strides, respectively capturing local feature patterns at

2 \times

and

4 \times

downsampling, adaptively adjusting feature channel weights through squeeze-excitation mechanisms, and finally generating enhanced feature representations through feature fusion. The complete computation process of SDEC is

F_{SDEC} (X) = ϕ_{3 \times 3} (S E (C [S_{SPD}^{(2)} (X), I (S_{SPD}^{(4)} (X))]))

(11)

where

S_{SPD}^{(s)}

represents space-depth convolution transformation with stride s,

I

is bilinear interpolation upsampling,

S E

is squeeze-excitation attention, and

ϕ_{3 \times 3}

is the final

3 \times 3

convolution. Space-depth convolution transformation achieves feature space dimensionality reduction and channel expansion through a slice recombination operation:

S_{SPD}^{(s)} (X) = ϕ_{conv} (C [X [\dots, i : : s, j : : s] | i, j \in {0, 1, \dots, s - 1}])

(12)

The squeeze-excitation layer learns interdependencies among channels through global average pooling and two-layer fully connected networks, with mathematical representation:

S E (X) = X ⊙ σ (W_{2} \cdot δ (W_{1} \cdot GAP (X)))

(13)

where GAP represents global average pooling,

W_{1}

and

W_{2}

are weight matrices for dimensionality reduction and expansion respectively,

δ

is ReLU activation function,

σ

is Sigmoid function, and ⊙ denotes element-wise multiplication.

AMFFN achieves adaptive semantic alignment of cross-scale features through the collaborative action of HSAT’s pyramidal sparse attention and SDEC’s spatial decoupling, improving the discriminability and robustness of small object features.

3.3. Enhanced Upsampling with Channel-Spatial Coordination Mechanism

The Enhanced Upsampling with Channel Bridging and Spatial Coordination module (EUCBSC) targets detail loss during feature reconstruction. The structure of EUCBSC is shown in Figure 5. By integrating depth-wise separable convolution, channel shuffle, and bidirectional channel shift mixing, EUCBSC preserves spatial detail during resolution expansion and enriches feature representation through inter-channel spatial recombination.

The EUCBSC module adopts a cascaded feature enhancement architecture, decomposing the upsampling process into three stages: size expansion, channel recombination, and feature optimization. The module first enlarges the input feature map to target size through bilinear interpolation, then uses depth-wise separable convolution for spatial detail recovery, followed by a channel shuffle mechanism to rearrange feature channels to promote information exchange, and finally performs nonlinear transformation of channel dimensions through point-wise convolution. The mathematical expression of the entire EUCBSC module can be formalized as

F_{EUCBSC} (X) = ϕ_{1 \times 1} (S_{BSCR} (S_{shuffle} (D_{DW} (U_{bilinear} (X, s = 2)))))

(14)

where

U_{bilinear} (\cdot, s = 2)

represents a bilinear upsampling operation with scale factor 2,

D_{DW}

represents a depth-wise separable convolution transformation,

S_{shuffle}

is a channel shuffle function,

S_{BSCR}

represents a Bidirectional Spatial Channel Reorganization (BSCR) operation, and

ϕ_{1 \times 1}

is a point-wise convolution mapping function.

The BSCR sub-module designs a novel bidirectional channel-shift mixing strategy, enhancing feature spatial diversity and inter-channel information interaction by applying opposite-direction spatial shift operations on different channel groups. The module first uniformly splits the input feature tensor along the channel dimension into four sub-tensors, then applies cyclic shift transformations in horizontal and vertical directions to each sub-tensor respectively. Specifically, the channel splitting operation can be expressed as

{X_{1}, X_{2}, X_{3}, X_{4}} = Split (X, \dim = 1, chunks = 4)

(15)

Subsequently, bidirectional shift transformations of sub-tensors are defined as

\begin{matrix} X_{1}^{'} & = Roll (X_{1}, shift = s, \dim = H) \end{matrix}

(16)

\begin{matrix} X_{2}^{'} & = Roll (X_{2}, shift = - s, \dim = H) \end{matrix}

(17)

\begin{matrix} X_{3}^{'} & = Roll (X_{3}, shift = s, \dim = W) \end{matrix}

(18)

\begin{matrix} X_{4}^{'} & = Roll (X_{4}, shift = - s, \dim = W) \end{matrix}

(19)

where

Roll (\cdot, shift, \dim)

represents a cyclic shift operation along a specified dimension, s is shift step, and H and W represent the height and width dimensions of the feature map, respectively. After concatenation, each spatial position aggregates information from neighboring positions across different channel subspaces, effectively expanding the receptive field.

S_{BSCR} (X) = Concat ([X_{1}^{'}, X_{2}^{'}, X_{3}^{'}, X_{4}^{'}], \dim = 1)

(20)

Through depth-wise separable convolution and the BSCR bidirectional channel shift mixing strategy, EUCBSC enhances spatial detail fidelity and inter-channel information flow during upsampling.

4. Experiments

4.1. Datasets

To comprehensively evaluate the performance of the proposed algorithm, this study selected three representative UAV object-detection datasets for experimental validation: VisDrone-2019 [3], UAVDT [4], and DOTA1.0 [5].

The VisDrone-2019 Dataset is a UAV vision benchmark dataset constructed by the AISKYEYE team of Tianjin University, specifically designed for object detection and tracking tasks in UAV scenarios. The dataset covers 10 object categories: pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor. It includes various scenarios, including urban, rural, highways, and construction sites, as well as complex environments with various disturbances, different weather conditions, and different object scales, providing an ideal testing platform for algorithm robustness evaluation. We divided it into a training set containing 6471 images, a validation set containing 548 images, and a test set containing 1610 images for experiments.

The UAVDT Dataset is a large-scale challenging benchmark dataset specifically designed for UAV object detection and tracking tasks. Images cover various complex urban scenarios, including squares, main roads, toll stations, highways, intersections, and T-junctions, providing important references for evaluating algorithm performance in practical applications. It contains three vehicle object categories: car, truck, and bus. Data are divided into 20,368 training samples, 8147 validation samples, and 12,220 test samples in total.

The DOTA-v1.0 Dataset is one of the most influential large-scale benchmark datasets in the field of aerial image object detection. The dataset covers 15 object categories: plane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, large vehicle, small vehicle, helicopter, roundabout, soccer field, and swimming pool. Dataset images come from Google Earth, GF-2 satellite, JL-1 satellite, and CycloMedia B.V. aerial imagery, with diverse geographical locations, sensor types, and shooting platforms, providing a comprehensive testing environment for algorithm generalization performance evaluation. The dataset is divided into 15,749 training images and 5297 validation images.

4.2. Implementation Details

The experimental environment of this study adopts mainstream deep learning frameworks and hardware configurations to ensure reproducibility and comparability of experimental results. As shown in Table 1, the operating system is Ubuntu LTS 22.04. Experiments are based on PyTorch 2.0.1 and CUDA 11.7, using the Python 3.10 programming language, and conducted on a workstation equipped with NVIDIA GeForce RTX 4090 GPU (24 GB memory). The CPU uses an Intel Core(TM) i7-13700KF processor equipped with 64 GB DDR4 system memory.

The training was performed using an AdamW optimizer with an initial learning rate of

1 \times 10^{- 4}

, momentum of 0.9, and weight decay of

1 \times 10^{- 4}

. The training process adopts a cosine annealing learning-rate scheduling strategy, with total training epochs set to 300. We employed a warmup strategy with 2000 iterations. Batch size is set to 8, and input image size is uniformly resized to

640 \times 640

pixels to ensure fair comparison among different configurations. For data augmentation, we apply random flipping with a probability of 0.5, random scaling between 0.5 and 1.5, and Mosaic augmentation with a probability of 0.5, which is disabled during the last 30 training epochs.

4.3. Evaluation Metrics

To comprehensively evaluate the performance of the proposed algorithm, this study adopts the COCO evaluation metric system widely used in object detection. Specific evaluation metrics include: AP50:95, representing average precision calculated over an IoU threshold range of 0.5 to 0.95 (incremented by 0.05), as the most important comprehensive performance metric; AP50: representing average precision at a 0.5 IoU threshold; APs: measuring average precision of small objects below

32 \times 32

pixels, which is a key metric directly reflecting algorithm effectiveness on small objects in UAV images; APm: evaluating average precision of medium-sized objects between

32 \times 32

and

96 \times 96

pixels; APl: evaluating average precision of large objects exceeding

96 \times 96

pixels. Computational efficiency is quantified through GFLOPS and Params (expressed in millions), measuring inference complexity and model storage requirements respectively.

4.4. Ablation Studies

4.4.1. Effect of Multi-Scale Pooling Kernel Configuration

To verify the effectiveness of the multi-scale pooling kernel configuration in the proposed MREAM module, we conducted systematic ablation experiments on different pooling kernel combination strategies, focusing on exploring the impact of single-scale, dual-scale, triple-scale, and different four-scale configurations on aerial small object detection performance. Experimental results are shown in Table 2.

The experimental results in Table 2 verify the effectiveness and rationality of our proposed progressive multi-scale design. The single-pooling kernel configuration shows obvious limitations in detection performance, with configuration A achieving only 45.8% AP50. The dual-scale and triple-scale configurations show gradual improvement trends across metrics, but have limited capability in capturing multi-scale feature information. The five-scale configuration shows AP50 dropping to 46.2%, indicating that excessive scale branches may introduce redundant information and computational burden. Our four-scale progressive configuration achieves continuous coverage from local details to global context, better adapting to characteristics of drastic object scale changes in UAV images, achieving 47.2% AP50, small object detection accuracy APs of 20.1%, and medium object accuracy APm of 38.0%. This experimental result fully verifies that the MREAM module, through the progressive multi-scale pooling kernel combination strategy, can effectively capture multi-level feature representations of objects at different scales.

4.4.2. Effect of Edge Enhancement Mechanism

To verify the effectiveness of the edge enhancement strategy in the proposed EdgeEnhancer sub-module, we conducted comparative experiments on different edge feature extraction methods. Experimental results are shown in Table 3.

The experimental results in Table 3 show that adopting edge enhancement strategies can effectively improve object detection performance. Traditional edge detection operators demonstrate certain feature enhancement capability, with the Sobel operator improving APs by 0.4% compared to the no-edge-enhancement configuration, and the Laplacian operator improving APs by 0.7%. In contrast, our proposed average pooling-based high-frequency and low-frequency decomposition edge enhancement method achieves optimal performance across all evaluation metrics, improving AP50 by 1.3% compared to baseline, APs by 1.2%, APm by 1.1%, and APl by 1.2%.

4.4.3. Validation of AMFFN Feature Fusion Network Effectiveness

To verify the effectiveness of each sub-module in the proposed AMFFN feature fusion network, we designed module combination ablation experiments. Experimental results are shown in Table 4.

The experimental results in Table 4 show that both SDEC and HSAT modules make significant contributions to detection performance. Using the SDEC module alone improves AP50 by 1.3% compared to baseline, and using the HSAT module alone brings 0.9% AP50 improvement. When both modules work together, AMFFN achieves 46.8% in AP50, improving by 2.5% compared to baseline, fully verifying the overall effectiveness of the proposed AMFFN feature fusion network.

4.4.4. Overall Ablation Experimental Results Analysis

To comprehensively verify the effectiveness of each core module in EAGLE-DET and their synergistic effects, we conducted systematic overall ablation experiments. As shown in Table 5 and Figure 6, we systematically evaluated eight configuration schemes. To verify the reliability of our results, we conducted five independent training runs of the full EAGLE-DET model with different random seeds and we report the 95% confidence interval for the detection accuracy metrics.

The experimental results show that all three core modules can independently improve detection performance. The CMENet alone achieves the largest single-module improvement, with AP50 increasing by 2.9% and APs by 1.5%. Using the AMFFN feature fusion network alone brings 2.5% and 1.4% improvements in AP50 and AP50:95. Integrating the EUCBSC module alone also achieves 1.8% AP50 and 0.4% AP50:95 performance improvements.

Any dual module consistently outperforms either individual component, indicating complementarity rather than redundancy. Notably, CMENet + AMFFN performs best, improving the AP50 by 3.9% and the AP50:95 by 2.5% compared to baseline, significantly exceeding the simple addition effects of single modules. CMENet + EUCBSC improves the AP50 by 3.6% while maintaining a lightweight design. By contrast, AMFFN + EUCBSC yields 47.6% for AP50, a comparatively modest gain, as without CMENet’s edge-enriched representations, the upstream feature quality limits the effectiveness of both downstream modules.

More importantly, our complete EAGLE-DET architecture achieves optimal performance across all evaluation metrics, with AP50, AP50:95, and APs reaching 49.5%, 29.8%, and 21.8% respectively, improving 5.2%, 3.3%, and 3.2% respectively compared to baseline. This experimental result fully verifies that there exist significant synergistic enhancement effects among the three modules CMENet, AMFFN, and EUCBSC. Through cascaded collaboration of edge feature enhancement, sparse attention fusion, and spatial-channel coordination, they can achieve comprehensive optimization of cross-scale object detection performance while maintaining model lightweight function, obviously improving detection capability for small objects in aerial scenarios.

4.5. Comparison Experiments

4.5.1. Comparison of Backbone Network Architectures

To verify the effectiveness of our proposed CMENet backbone network, we conducted backbone network architecture comparison experiments on the VisDrone-2019 validation set. The experimental results are shown in Table 6 and Figure 7.

Our proposed CMENet achieves optimal performance across all key evaluation metrics, with AP50 and AP50:95 reaching 47.2% and 27.8%, improving 2.9% and 1.3% respectively compared to baseline ResNet18, while CMENet’s parameter count is only 14.4 M, 60.7% lower than Swin-T.

4.5.2. Comparison of Neck Network Architectures

To verify the effectiveness of our proposed AMFFN feature fusion network, we conducted neck network architecture comparison experiments on the VisDrone-2019 validation set. Experimental results are shown in Table 7.

Notably, our proposed AMFFN achieves the highest AP50 of 46.8% and an AP50:95 of 27.9% among all compared neck architectures. It also achieves the highest APs of 19.5% and APm of 38.5%. BIFPN and HyperACE perform better on large objects than AMFFN due to their bidirectional feature pyramid or hypergraph enhancement strategies. Nevertheless, for the core challenge of small and medium object detection in UAV imagery, AMFFN provides the most balanced and effective feature fusion.

4.5.3. Upsampling Comparison Experiments

To verify the effectiveness of our proposed EUCBSC upsampling module, we conducted upsampling method comparison experiments on the VisDrone-2019 validation set. The experimental results are shown in Table 8.

Among all compared upsampling methods, our EUCBSC achieves the highest AP50 of 46.1%, AP50:95 of 26.9%, APs of 19.7%, and APm of 37.6%. However, CARAFE achieves a higher APl of 42.2% compared to our 40.9%, indicating that content-aware feature reorganization may preserve more structural information for large objects.

4.5.4. Comparison with State-of-the-Art Methods

To comprehensively evaluate the performance of our proposed EAGLE-DET algorithm, we selected various representative detection methods for comparison experiments, including two-stage detectors, one-stage CNN-based detectors, and Transformer-based methods. For the baseline, our method, and the strongest competing methods, we report the average accuracy accompanied by a 95% confidence interval. The experimental results are shown in Table 9 and Figure 8.

Compared to the baseline, EAGLE-DET achieves improvements of 4.5% in AP50 and 2.9% in AP50:95, with inference speed reaching 71.7 FPS. Two-stage detectors generally have large computational overhead and slow inference speed. Faster RCNN and Cascade-RCNN achieve 32.9% and 32.6% AP50 respectively, but their GFLOPS reach 208.9 and 236.6. Among CNN-based detectors, YOLOv10m and YOLOv11m achieve higher inference speeds of 105.8 and 117.1 FPS respectively, but their detection accuracy is notably lower, with AP50:95 of 19.5% for YOLOv10m and 19.5% for YOLOv11m, compared to our 23.0%. For Transformer-based detectors, DAB-DETR achieves a competitive APl of 48.3%, surpassing our 42.6%, primarily because its dynamic anchor box mechanism favors larger objects, whereas EAGLE-DET is specifically designed for small object scenarios. It should be noted that RT-DETR-R50 achieves a comparable AP50 of 39.1% versus our 39.7% and a slightly lower AP50:95 of 22.5% versus our 23.0%, but this comes at the cost of 2.3 times more parameters, specifically 42.0 million compared to our 18.4 million, and approximately 2.0 times more GFLOPs, specifically 129.6 versus our 65.7. For UAV-oriented methods, UAV-DETR-R18 and VRF-DETR both yield lower AP50, AP50:95, and APs than EAGLE-DET. These comparisons indicate that EAGLE-DET outperforms all the above methods in AP50, AP50:95, APs and APm, achieving the best balance between accuracy and efficiency.

To more intuitively demonstrate EAGLE-DET’s detection performance, Figure 9 shows a visualization comparison of the VisDrone-2019 test set.

Figure 10 shows feature activation heatmaps under different scenarios, demonstrating EAGLE-DET’s feature-learning capability. It can be observed that baseline heat response generally shows scattered and blurred characteristics, with activation regions deviating from object positions and extremely weak responses under low-light conditions. In contrast, EAGLE-DET’s heat response focuses more precisely on object regions, with clear activation boundaries highly matching actual object positions. A closer examination of the progressive heatmaps reveals interpretable feature response changes corresponding to each module’s function. After incorporating CMENet, the activations transition from spatially diffuse patterns to contour-concentrated responses, consistent with the module’s high-frequency edge enhancement mechanism that explicitly preserves high-pass filtered features. The subsequent addition of AMFFN further refines the activations by producing semantically coherent responses across different object scales, reflecting the cross-scale sparse attention’s ability to establish semantic correspondence between shallow spatial details and deep semantic representations. Finally, the complete EAGLE-DET with EUCBSC achieves the sharpest activation boundaries with minimal background leakage, demonstrating the bidirectional channel-shift mixing’s effectiveness in preserving spatial structural integrity during feature reconstruction.

4.6. Generalization Experiments

This paper conducted generalization experiments on the representative datasets UAVDT and DOTA1.0. The experimental results in Table 10 show that the EAGLE-DET algorithm achieves significant performance improvements on both representative datasets. On the UAVDT dataset, compared to baseline algorithm RT-DETR-R18, EAGLE-DET improves AP50:95 by 2.2%, APs by 3.2%, and APm by 1.9% respectively. On the more challenging DOTA1.0 remote sensing dataset, EAGLE-DET also demonstrates superior performance, with AP50:95 reaching 50.6%, improving 1.5% compared to baseline.

We also present fine-grained category performance comparison on the DOTA dataset in Table 11. EAGLE-DET achieves significant improvements in multiple categories, with especially notable improvements in small-scale object categories: small_vehicle’s AP50 improves from 67.8% to 69.2%, large_vehicle’s AP50 improves from 85.3% to 87.3%, and helicopter’s AP50 improves from 53.8% to 69.9%.

Figure 11 and Figure 12 show detection result visualizations on UAVDT and DOTA1.0 datasets respectively.

5. Discussion

EAGLE-DET addresses UAV small object detection through three modules: CMENet targets edge attenuation via high-frequency decomposition, AMFFN resolves semantic conflict via cross-scale sparse attention, and EUCBSC recovers detail loss via bidirectional channel-spatial reorganization. This design ensures each module produces higher-quality inputs for subsequent stages, forming a coherent repair chain.

It is worth noting that EAGLE-DET achieves these improvements while maintaining a favorable complexity–performance trade-off. Compared to the RT-DETR-R18 baseline with 19.9 M parameters and 58.0 GFLOPs, EAGLE-DET reduces parameters to 18.4 M (7.5% reduction) and maintains real-time inference at 71.7 FPS, demonstrating that the three-module design introduces targeted complexity rather than indiscriminate architectural expansion. The CMENet backbone alone achieves 16.6% GFLOPs reduction and 27.6% parameter reduction while improving AP50 by 2.9%, indicating that our edge-aware design replaces rather than supplements the original feature extraction overhead. Furthermore, in the context of UAV small object detection where targets occupy as few as tens of pixels, improvements of 3.2% in APs and 5.2% in AP50 on the VisDrone-2019 validation set represent meaningful advances, as each percentage point requires recovering discriminative features from extremely sparse pixel representations.

While our approach demonstrates significant improvements, several important limitations warrant discussion. First, the AMFFN and EUCBSC modules, while effective for cross-scale semantic alignment and high-fidelity feature reconstruction, introduce non-negligible computational overheads. The pyramidal sparse attention computation within the HSAT module of AMFFN involves coarse-grained and fine-grained two-level attention operations across multi-scale features, and the EUCBSC module’s cascaded depth-wise separable convolution, channel shuffle, and BSCR operations further increase computational costs during the upsampling stage. As shown in the ablation experiments (Table 5), the combination of AMFFN and EUCBSC raises GFLOPs from 58.0 to 62.4, and the complete EAGLE-DET configuration reaches 65.7 GFLOPs. Current GPU implementations are not fully optimized for our specific sparse attention and bidirectional channel shift operation patterns, suggesting the potential for more efficient attention mechanisms or lightweight upsampling alternatives in future work.

Second, the fine-grained category analysis on the DOTA1.0 dataset (Table 11) reveals uneven performance gains across different object categories. While EAGLE-DET achieves notable improvements for categories such as helicopter (+16.1% AP50) and swimming pool (+2.9% AP50), it shows degradation for soccer field (−4.1% AP50), roundabout (−2.7% AP50), and basketball court (−14.2% AP50). These categories typically feature large, relatively uniform regions where our edge enhancement strategy may introduce unnecessary high-frequency noise, suggesting that the edge-centric design philosophy, while beneficial for small, compact objects, may not universally benefit all object types. This observation indicates the need for category-adaptive feature processing strategies.

Third, our architecture exhibits diminishing returns at very high resolutions. While the current

640 \times 640

input resolution achieves a good accuracy–efficiency balance, scaling to higher resolutions for high-altitude drone footage would cause memory requirements to grow quadratically in the HSAT module due to the attention computation, creating deployment challenges on memory-constrained edge devices. The SDEC module’s space-depth transformation with the stride-2 and stride-4 branches further amplifies channel dimensions at higher resolutions, potentially exceeding the memory capacity of typical UAV-mounted computing platforms.

Future work will focus on addressing these limitations through more efficient sparse attention and upsampling designs, category-adaptive feature modulation, scalable high-resolution processing, and further optimization for edge deployment, particularly targeting dedicated neural processing hardware for UAV platforms.

6. Conclusions

This paper proposes an improved detection method EAGLE-DET, targeting the core bottleneck problem of feature degradation in UAV aerial image small object detection. Through in-depth analysis of the degradation mechanism of small object features during deep network forward propagation, this paper systematically identifies three key degradation stages of edge attenuation, semantic conflict, and detail loss, and designs targeted cascaded feature repair strategies. The main technical contributions of this paper are summarized as follows: (1) Proposes the CMENet backbone network, which reduces computational complexity while improving small object detection accuracy through cross-stage pyramidal structure and multi-resolution edge enhancement modules; (2) Designs the AMFFN feature fusion network, which achieves adaptive semantic alignment of cross-scale features through a pyramidal sparse attention mechanism and multi-scale spatial decoupling strategy, effectively suppressing the semantic conflict problem; (3) Constructs the EUCBSC module, which enhances feature reconstruction quality through a bidirectional channel-shift mixing mechanism, effectively alleviating the upsampling detail loss problem. Comprehensive experimental validation on three representative datasets VisDrone-2019, UAVDT, and DOTA1.0 demonstrates the effectiveness of the proposed method. EAGLE-DET provides new theoretical perspectives and technical solutions for aerial small object detection algorithm design.

Author Contributions

Conceptualization, Y.T.; methodology, Y.T. and Y.D.; software, B.Z.; validation, B.Z. and D.L.; formal analysis, C.Z.; investigation, B.Z.; resources, B.M.; data curation, B.Z. and D.L.; writing—original draft preparation, Y.T.; writing—review and editing, Y.D.; visualization, C.Z.; supervision, Y.D.; project administration, B.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned Aerial Vehicle
CMENet	Cross-stage Multi-resolution Edge Enhancement Network
AMFFN	Attention-guided Multi-scale Feature Fusion Network
EUCBSC	Enhanced Upsampling with Channel Bridging and Spatial Coordination
CSPMEE	Cross-Stage Pyramidal Multi-Resolution Edge Enhancement
MREAM	Multi-Resolution Edge Amplification Module
HSAT	Hierarchical Sparse Attention Transformer
SDEC	Spatial Decomposition Enhanced Convolution
SAFB	Sparse Attention Fusion Block
BSCR	Bidirectional Spatial Channel Reorganization
FPN	Feature Pyramid Network
NMS	Non-Maximum Suppression
AP	Average Precision
FPS	Frames Per Second

Appendix A. Notation Summary

Table A1. Summary of mathematical symbols and notations.

Symbol	Module	Description
X	General	Input feature tensor
$ψ_{1 \times 1}$ , $ϕ_{1 \times 1}$	General	$1 \times 1$ convolution mapping functions
$ϕ_{3 \times 3}$	General	$3 \times 3$ convolution mapping function
$C [\cdot]$	General	Feature concatenation operation
$I (\cdot, H \times W)$	General	Bilinear interpolation upsampling operation
$σ$	General	Sigmoid activation function
$δ$	General	ReLU activation function
⊙	General	Element-wise multiplication
$X_{bypass}$ , $X_{branch}$	CMENet	Bypass and enhancement branch feature tensors
$M_{MREAM}^{(i)}$	CMENet	The i-th MREAM unit
$P_{k \times k}^{adaptive}$	CMENet	Adaptive average pooling of scale $k \times k$
$D_{3 \times 3}$	CMENet	$3 \times 3$ depth-wise separable convolution
$E (\cdot)$	CMENet	EdgeEnhance enhancement function
$H_{high}$	CMENet	High-pass filtering result
$P_{3 \times 3}^{avg}$	CMENet	$3 \times 3$ average pooling for low-frequency extraction
$φ$	CMENet	Edge feature transformation convolution
$X_{shallow}$ , $X_{deep}$	AMFFN	Shallow and deep feature maps
$S_{SDEC}$	AMFFN	Spatial decoupling convolution transformation
$T_{HSAT}$	AMFFN	HSAT processing function
$A_{adaptive}$	AMFFN	Adaptive feature aggregation operation
$B_{SAFB}^{(i)}$	AMFFN	The i-th Sparse Attention Fusion Block
$A_{PSAttn}$	AMFFN	Pyramidal sparse attention function
Q, K, V	AMFFN	Query, key, and value matrices
$S_{SPD}^{(s)}$	AMFFN	Space-depth convolution with stride s
$S E (\cdot)$	AMFFN	Squeeze-excitation attention
GAP	AMFFN	Global average pooling
$W_{1}$ , $W_{2}$	AMFFN	SE weight matrices for reduction and expansion
$U_{bilinear}$	EUCBSC	Bilinear upsampling operation
$D_{DW}$	EUCBSC	Depth-wise separable convolution transformation
$S_{shuffle}$	EUCBSC	Channel shuffle function
$S_{BSCR}$	EUCBSC	Bidirectional Spatial Channel Reorganization
$Roll (\cdot)$	EUCBSC	Cyclic shift operation along specified dimension
s	EUCBSC	Shift step in BSCR

References

Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. RepViT: Revisiting mobile CNN from ViT perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 15909–15920. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic feature pyramid network for object detection. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, HI, USA, 1–4 October 2023; pp. 2184–2189. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6027–6037. [Google Scholar]
Liu, W.; Shi, L.; An, G. An efficient aerial image detection with variable receptive fields. Remote Sens. 2025, 17, 2672. [Google Scholar] [CrossRef]
Zhang, H.; Liu, K.; Gan, Z.; Zhu, G.-N. UAV-DETR: Efficient end-to-end object detection for unmanned aerial vehicle imagery. arXiv 2025, arXiv:2501.01855. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 2778–2788. [Google Scholar]
Yang, C.; Huang, Z.; Wang, N. QueryDet: Cascaded sparse query for accelerating high-resolution small object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13668–13677. [Google Scholar]
Yan, X.; Sun, S.; Zhu, H.; Hu, Q.; Ying, W.; Li, Y. DMF-YOLO: Dynamic multi-scale feature fusion network-driven small target detection in UAV aerial images. Remote Sens. 2025, 17, 2385. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Roh, B.; Shin, J.; Shin, W.; Kim, S. Sparse DETR: Efficient end-to-end object detection with learnable sparsity. arXiv 2021, arXiv:2111.14330. [Google Scholar]
Kong, Y.; Shang, X.; Jia, S. Drone-DETR: Efficient small object detection for remote sensing image using enhanced RT-DETR model. Sensors 2024, 24, 5496. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Chen, J.; Kao, S.-H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Wang, C.; He, Y.; Wang, Y.; Zhang, K.; Tang, C.; Lü, J. Gold-YOLO: Efficient object detector via gather-and-distribute mechanism. arXiv 2023, arXiv:2309.11331. [Google Scholar]
Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H. InternImage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14408–14419. [Google Scholar]
Ding, X.; Zhang, X.; Han, Y.; Ding, G. Scaling up your kernels to 31 × 31: Revisiting large kernel design in CNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 11963–11975. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Hatamizadeh, A.; Kautz, J. MambaVision: A hybrid Mamba-Transformer vision backbone. arXiv 2025, arXiv:2407.08083. [Google Scholar]
Su, Z.; Liu, W.; Yu, Z.; Hu, D.; Liao, Q.; Tian, Q.; Pietikäinen, M.; Liu, L. Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 5117–5127. [Google Scholar]
Ye, W.; Li, Z.; Zhong, Y. DiffusionEdge: Diffusion probabilistic model for crisp edge detection. arXiv 2024, arXiv:2401.02032. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Z.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
Zhu, L.; Wang, B.; Dai, Z.; Yuan, T.; Ye, Y. BiFormer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
Si, K.; Li, Y.; Liu, W.; Zhang, H.; Yan, J. SCSA: Exploring the synergistic effects between spatial and channel attention. arXiv 2025, arXiv:2407.05128. [Google Scholar] [CrossRef]
Lu, H.; Liu, W.; Ye, Z.; Fu, H.; Liu, Y.; Cao, Z. SAPA: Similarity-aware point affiliation for feature upsampling. Adv. Neural Inf. Process. Syst. 2022, 35, 20889–20901. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Yu, W.; Wang, X. MambaOut: Do we really need Mamba for vision? In 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2024; pp. 74–90. [Google Scholar]
Chen, Y.; Zhang, C.; Chen, B.; Huang, Y.; Sun, Y.; Wang, C.; Fu, X.; Dai, Y.; Qin, F.; Peng, Y.; et al. Accurate leukocyte detection based on deformable-DETR and multi-level feature fusion for aiding diagnosis of blood diseases. Comput. Biol. Med. 2024, 170, 107917. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Xu, H.; Zhang, C.; Guo, F. Rethinking feature pyramid networks for object detection. arXiv 2024, arXiv:2409.13917. [Google Scholar]
Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Ding, G.; Du, S.; Wu, Z.; Gao, Y. YOLOv13: Real-time object detection with hypergraph-enhanced adaptive visual perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]
Xu, W.; Zheng, S.; Wang, C.; Zhang, Z.; Ren, C.; Xu, R.; Xu, S. SAMamba: Adaptive state space modeling with hierarchical vision for infrared small target detection. Inf. Fusion 2025, 124, 103338. [Google Scholar] [CrossRef]
Rahman, M.M.; Munir, M.; Marculescu, R. EMCAD: Efficient multi-scale convolutional attention decoding for medical image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 11769–11779. [Google Scholar]
Huang, X.; Liu, S.; Zhang, K.; Tai, Y.; Yang, J.; Zeng, H.; Zhang, L. Reverse convolution and its applications to image restoration. In Proceedings of the International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2025; pp. 1–10. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-aligned one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3490–3499. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Jocher, G. YOLOv5 by Ultralytics. Available online: https://github.com/ultralytics/yolov5 (accessed on 19 January 2025).
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. Available online: https://docs.ultralytics.com/models/yolov8/ (accessed on 19 January 2025).
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO11. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 19 January 2025).
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Xiao, Y.; Xu, T.; Xin, Y.; Li, J. FBRT-YOLO: Faster and better for real-time aerial image detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 8673–8681. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3651–3660. [Google Scholar]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic anchor boxes are better queries for DETR. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]

Figure 1. Example images and object size distribution of three representative UAV aerial datasets. (a) The VisDrone-2019 dataset. (b) The UAVDT dataset. (c) The DOTA1.0 dataset.

Figure 2. Overall architecture of EAGLE-DET. The framework consists of three core modules: a CMENet backbone network for edge-aware feature extraction, AMFFN for attention-guided multi-scale feature fusion, and EUCBSC for enhanced upsampling with channel-spatial coordination. The modules work collaboratively to address the three-stage degradation problem of small object features.

Figure 3. Architecture of the CSPMEE module in the CMENet backbone network. The CSPMEE module adopts a branch-fusion feature-processing paradigm, consisting of bypass and enhancement branches. The enhancement branch contains MREAM units that construct multi-scale feature pyramids through parallel adaptive pooling branches with different scales. Each branch incorporates EdgeEnhance for adaptive edge enhancement based on high-frequency and low-frequency decomposition.

Figure 4. Architecture of AMFFN feature fusion network. The network adopts a hierarchical multi-scale feature interaction architecture through a cascaded combination of SDEC and HSAT. (a) The HSAT module. (b) Detailed structure of the PSAttn mechanism. (c) The SDEC module.

Figure 5. Architecture of EUCBSC. (a) EUCBSC module; (b) BSCR sub-module.

Figure 6. Module ablation experimental results on the VisDrone-2019 validation set. In the configuration labels, C, A, and E represent CMENet, AMFFN, and EUCBSC, respectively. (a) Detection performance comparison across different module configurations. (b) Computational efficiency comparison.

Figure 7. Comparison of backbone and neck network architectures. (a) Backbone architecture comparison. (b) Neck architecture comparison.

Figure 8. Comparison with state-of-the-art methods on VisDrone-2019 test set. The scatter plot shows the trade-off between detection performance and model parameters. Circle size represents FPS (frames per second).

Figure 9. Visualization of detection results on VisDrone-2019 test set. Comparison between Ground Truth, Baseline (RT-DETR-R18), FBRT-YOLO-m, DAB-DETR, and EAGLE-DET across various challenging scenarios: (a) Enhanced small object identification capability in urban intersection scenes. (b) Accurate detection and category discrimination for ultra-small objects. (c) Multi-scale object detection in complex scenes with both near large vehicles and distant small motorcycles. (d) Fine-texture perception for ultra-small pixel objects, correcting baseline misdetections. (e) Dense object scene processing with precise boundary distinction. (f) Robust detection under extreme lighting conditions (night low-light). (g) Accurate identification of occluded objects while suppressing false detections. Red boxes indicate missed detections or misclassifications by baseline or other methods that are correctly handled by EAGLE-DET.

Figure 10. Feature activation heatmaps under different scenarios comparing Ground Truth, Baseline, CMENet + AMFFN, and EAGLE-DET. The visualization covers various challenging aerial scenarios, including dense parking lots, oblique viewing angles, multi-scale intersections, and night low-light environments.

Figure 11. Visualization of detection results on the UAVDT dataset. Comparison between Ground Truth, Baseline (RT-DETR), and EAGLE-DET across various traffic scenarios including highways, intersections, and urban roads. EAGLE-DET demonstrates superior detection capability for small and distant vehicles under different viewing angles and lighting conditions. Red boxes highlight improved detections compared to baseline.

Figure 12. Visualization of detection results on the DOTA1.0 dataset. Comparison shows EAGLE-DET’s performance on various object categories in remote sensing imagery including storage tanks, ships, harbors, and aircraft. Red boxes highlight improved detections compared to baseline.

Table 1. Experimental system environment.

Item	Model/Parameters
Operating System	Ubuntu LTS 22.04
Programming Language	Python 3.10
CPU	Intel Core(TM) i7-13700KF (Intel Corporation, Santa Clara, CA, USA)
Graphics card	NVIDIA GeForce RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA)
GPU Memory	24G

Table 2. Multi-scale pooling kernel configuration comparison experiments.

Config	Type	Pooling Kernel Config	AP50	APs	APm
A	Single-scale	${3 \times 3}$	45.8	18.1	35.2
B	Dual-scale	${3 \times 3, 6 \times 6}$	46.0	19.2	36.4
C	Triple-scale	${3 \times 3, 6 \times 6, 9 \times 9}$	46.0	19.8	36.1
D	Four-scale	${3 \times 3, 3 \times 3, 3 \times 3, 3 \times 3}$	46.2	18.5	35.8
E	Four-scale	${6 \times 6, 6 \times 6, 6 \times 6, 6 \times 6}$	46.3	19.1	36.2
F	Four-scale	${9 \times 9, 9 \times 9, 9 \times 9, 9 \times 9}$	46.3	19.3	36.6
G	Four-scale	${12 \times 12, 12 \times 12, 12 \times 12, 12 \times 12}$	47.6	18.9	36.0
H	Four-scale Progressive (Ours)	${3 \times 3, 6 \times 6, 9 \times 9, 12 \times 12}$	47.2	20.1	38.0
I	Five-scale	${3 \times 3, 6 \times 6, 9 \times 9, 12 \times 12, 15 \times 15}$	46.2	19.2	36.8