ECP-YOLO: Integrating Edge-Aware Attention and Contextual Refinement for UAV Object Detection

Wang, Qi; Cang, Mingming; Chen, Yongji

doi:10.3390/electronics15102067

Open AccessArticle

ECP-YOLO: Integrating Edge-Aware Attention and Contextual Refinement for UAV Object Detection

by

Qi Wang

¹

,

Mingming Cang

^1,*

and

Yongji Chen

^2,3

¹

School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

School of Data Science and Big Data Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China

³

School of Applied Mathematics, University of Reading, Reading RG6 6DX, UK

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2067; https://doi.org/10.3390/electronics15102067

Submission received: 6 April 2026 / Revised: 29 April 2026 / Accepted: 8 May 2026 / Published: 12 May 2026

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

Object detection in UAV imagery is hindered by micro-scale targets, dense distributions, and cluttered backgrounds, where existing detectors fail to simultaneously achieve high accuracy and real-time throughput. We propose ECP-YOLO, a lightweight framework built on YOLOv12s, incorporating four modules: Pinwheel Convolution (PConv) for direction-selective geometric modeling, a Context Refiner Block (CRB) for spatially gated background suppression, an Edge-Aware Attention Fusion Module (EAFM) for structural boundary preservation, and a Progressive Inter-Scale Feature Fusion (PISF) strategy for cascaded cross-scale detail propagation, alongside a high-resolution P2 detection head. On VisDrone2019, ECP-YOLO achieves 38.1% mAP@0.5 and 22.1% mAP@0.5:0.95, surpassing YOLOv12s by 6.3% and 3.5% at 79 FPS. On UAVDT, Precision improves from 27.0% to 34.1% and mAP@0.5 from 28.7% to 30.4%, demonstrating cross-dataset transferability. These results demonstrate that ECP-YOLO achieves competitive accuracy–efficiency trade-offs for real-time UAV detection in complex environments.

Keywords:

UAV object detection; edge-aware attention; multi-scale feature fusion; micro-scale targets

1. Introduction

UAV-based object detection remains constrained by micro-scale targets, dense distributions, and cluttered backgrounds under volatile illumination. While YOLO-series detectors offer competitive inference speed, their efficacy on small objects is often compromised by insufficient high-resolution feature utilization. Conversely, Transformer-based [1] architectures enhance global dependencies but incur prohibitive computational overhead in dense scenarios. We identify three unresolved bottlenecks: the dissipation of fine-grained spatial details, inadequate background interference suppression, and inconsistent cross-scale feature integration.

To address these challenges, we propose ECP-YOLO, an efficient framework built upon YOLOv12s [2]. Rather than proposing isolated novel operators, this work addresses three interdependent bottlenecks in UAV small-object detection—structural boundary dissipation, inadequate background suppression, and cross-scale feature inconsistency—through a systematically co-designed framework. The core technical contributions are:

(1): A direction-selective sparse sampling mechanism (PConv [3]) that preserves target silhouettes against linear background clutter while reducing parameter redundancy.
(2): An Edge-Aware Attention Fusion Module (EAFM) that integrates deterministic Sobel operators with learnable multi-scale attention to explicitly reinforce structural boundary cues.
(3): A spatially gated context refinement block (CRB) that suppresses background noise through lightweight global context aggregation.
(4): A Progressive Inter-scale Feature Fusion strategy (PISF) that cascades shallow spatial details into deep semantic layers, enforcing cross-scale consistency.
(5): A high-resolution P2 detection head for micro-scale target localization.

Together, these components form a mutually reinforcing pipeline: PConv preserves directional structures that the EAFM subsequently enhances; the EAFM’s sharpened boundary cues guide the CRB’s spatial gating toward foreground regions; and PISF propagates the refined features across scales without semantic dilution. This integrated design achieves gains that no single component alone can deliver.

2. Related Work

The YOLO series remains the dominant framework for real-time object detection, owing to its favorable accuracy–efficiency trade-off. YOLOv10 [4] eliminates NMS through a dual-label assignment strategy, while YOLOv11 [5] further improves backbone efficiency via enhanced feature aggregation. More recently, YOLOv12 introduced attention-centric designs that strengthen global dependency modeling within a one-stage pipeline. Nevertheless, these advances primarily target generic benchmarks and have not adequately addressed the specific demands of UAV imagery, where micro-scale targets suffer from severe resolution loss, dense spatial packing, and cluttered backgrounds.

UAV-based object detection poses distinct challenges beyond those of generic benchmarks: targets are densely packed, frequently occluded, and embedded in structurally complex aerial backgrounds. Transformer-based methods such as RT-DETR [6] and its variants improve dense-scene modeling through NMS-free set prediction, but their quadratic attention complexity limits deployment on resource-constrained UAV platforms. RT-DETRv2 [7] further refines this framework by introducing scale-aware sampling and improved training strategies, yet the fundamental complexity barrier persists. State-space models, including Mamba [8], offer linear-complexity sequence modeling as an alternative, though their advantage on spatially structured aerial data remains limited. Recent UAV-specific YOLO variants have demonstrated consistent gains on VisDrone2019 [9]. PARE-YOLO [10] integrates multi-scale attention into the neck to enhance feature discrimination. MFA-YOLO [11] combines local feature mapping with progressive atrous pyramid fusion. BPD-YOLO [12] reconstructs FPNs with asymptotic shallow-deep feature integration. LUD-YOLO [13] addresses feature propagation degradation through a progressive pyramid network with dynamic sparse attention, enabling lightweight deployment on UAV edge devices. SMA-YOLO [14] introduces bidirectional multi-branch auxiliary FPNs [15] to integrate semantic and spatial information for improved small-object sensitivity. While effective, these methods perform cross-scale fusion primarily through parallel pathways, leaving the semantic-detail imbalance between shallow and deep features largely unresolved. Attention mechanisms such as BiFormer [16] and deformable convolutions further improve contextual modeling, yet background suppression under dense distributions remains a persistent weakness. In the tracking domain, methods such as Learning Adaptive Spatial–Temporal Context-Aware Correlation Filters [17] exploit temporal context to improve UAV tracking robustness. While targeting a different task, their spatial–temporal context modeling conceptually aligns with the cross-scale context propagation in our PISF and the global context aggregation in the CRB, highlighting the broader importance of context-aware design for UAV vision tasks.

Multi-scale feature fusion is central to small-object detection in aerial imagery. FPNs and PANet [18] establish foundational top-down and bidirectional fusion pathways, but both suffer from semantic dilution when shallow spatial details propagate through successive abstraction layers. BiFPN [19] further advances this line by introducing learnable cross-scale feature weighting to enable dynamic importance allocation across resolution levels. More recent works adopt bidirectional cross-scale paths or progressive integration strategies to mitigate this issue, yet feature inconsistency across scales—where high-level semantics dominate and suppress fine-grained spatial cues—remains an open challenge. Gold-YOLO [20] proposes a gather-and-distribute mechanism that globally fuses multi-level features, achieving better cross-scale information flow than conventional FPN variants. In parallel, standard convolutions are inherently limited in capturing directional and structural properties, constraining their ability to represent the irregular silhouettes and boundary textures of small UAV targets. Direction-aware convolutions and edge-aware attention mechanisms, such as CBAM [21], have been proposed to address this gap, yet their integration into lightweight one-stage detectors for UAV imagery remains underexplored.

In summary, despite substantial progress, three gaps persist in UAV small-object detection: insufficient structural boundary preservation, inadequate background suppression under dense distributions, and unresolved cross-scale feature inconsistency. ECP-YOLO is designed to target these three gaps through the integration of edge-aware attention, spatially gated context refinement, and progressive inter-scale feature fusion.

3. Methods

Section 3 is organized as follows: Section 3.1 presents the overall architecture of ECP-YOLO, followed by detailed descriptions of the EAFM (Section 3.2), the CRB (Section 3.3), PConv (Section 3.4), and PISF (Section 3.5).

3.1. Structure of ECP-YOLO

The proposed ECP-YOLO adopts a backbone–neck–head architecture specifically re-engineered to enhance geometric fidelity and multi-scale feature consistency in UAV imagery (Figure 1). Within the backbone, PConv leverages multidirectional sparse sampling to decouple target silhouettes from linear background clutter. In the neck, the CRB suppresses background noise via spatial gating, while the EAFM re-introduces high-frequency boundary cues to mitigate semantic dilution during feature aggregation. Furthermore, the PISF strategy progressively cascades shallow spatial details across the feature hierarchy (

P_{2} \to P_{3} \to P_{4} \to P_{5}

), ensuring that fine-grained information is preserved in deeper semantic representations. Finally, a high-resolution P2 detection head, operating on a 4× down-sampling grid, is incorporated to enhance localization precision for micro-scale targets in dense aerial scenarios.

Compared with the standard YOLOv12s baseline (Figure 2), our architecture replaces the conventional backbone block with PConv and introduces the CRB and EAFM into the neck, while the PISF strategy restructures the feature propagation pathway. These modifications collectively re-engineer the feature extraction and fusion pipeline to address the specific challenges of small, densely packed, and heavily occluded targets in UAV imagery.

3.2. EAFM

The EAFM is designed to enhance model robustness for small-object detection in complex scenes through three-path parallel processing and the dynamic fusion of multi-source heterogeneous features. Its complete topology is illustrated in Figure 3. The module takes the output feature

x_{i n}

from the preceding layer as input and distributes it to three parallel branches. The upper residual branch performs linear mapping via point-wise convolution (PWConv) to generate the skip-connection feature

x_{s k i p}

. The middle edge-aware branch employs an Edge Extract (Sobel) module to reinforce the structural contour information of targets, producing the edge-enhanced feature

x_{e d g e}

. The lower Multi-scale Feature Enhancement (MFE) branch aggregates multi-scale contextual information and enriches semantic representations, yielding the feature

x_{m f e}

.

The outputs of all three branches are subsequently fed into the Boundary-Aware Gating (BAG) module. Within BAG, the three branch features are first concatenated along the channel dimension. A feed-forward network followed by a Sigmoid activation then independently generates an adaptive fusion weight

α_{i} \in [0, 1]

for each branch. The output of BAG is defined as the weighted sum of the three branch features. The resulting weighted feature is finally integrated by a Conv-BN-ReLU (CBR) block, which consists of a convolutional layer, batch normalization, and ReLU activation, to produce the final output of the EAFM. By explicitly recovering high-frequency boundary signals, the EAFM complements PConv: PConv’s direction-selective filtering suppresses linear background textures that would otherwise generate false edge responses, while the EAFM’s Sobel-based edge extraction operates on these structurally clean features to produce sharper, more reliable boundary cues.

The internal processing pipeline of the MFE branch is depicted in Figure 4 and can be decomposed into four sequential stages. Stage 1: Edge-Attention Collaborative Enhancement. The input feature is processed along two parallel paths: one extracts structural edges using fixed Sobel operators, and the other performs adaptive enhancement via learnable convolutions. The outputs of both paths, combined with the backbone feature, are fed into two Local–Global Attention (LGA) [22] modules with patch sizes of

P = 2

and

P = 4

to capture spatial dependencies at different scales. Stage 2: Deep Semantic Encoding. The feature passes through three consecutive CBR blocks to perform nonlinear transformation and semantic abstraction. Stage 3: Multi-scale Context Aggregation. An SPP-Lite module employs parallel dilated convolutions with dilation rates

d \in {1,2, 3}

to capture multi-scale contextual information over large receptive fields. Stage 4: Feature Refinement. The aggregated feature is element-wise added to the enhanced feature from Stage 1 via a residual connection. The result is then passed through a Convolutional Block Attention Module (CBAM), which sequentially applies channel attention and spatial attention to suppress irrelevant noise and recalibrate feature responses.

The LGA module serves as the core attention mechanism of the EAFM, and its working principle is illustrated in Figure 5. The module first partitions the input feature map into a set of local patches via an Unfold operation. In the global context summarization stage, the mean is computed along the channel dimension for each patch, aggregating it into a summary vector that encodes the global information of the corresponding local region. The resulting sequence of summary vectors is then processed by a lightweight feed-forward network and normalized via Softmax to produce a scalar attention weight for each patch, reflecting its relative importance. These weights subsequently guide two complementary forms of adaptive selection: spatially, the most discriminative locations are retained based on their saliency scores (Token Selection); channel-wise, the most informative feature dimensions are preserved according to their activation responses (Channel Selection). This mechanism enables the LGA module to dynamically focus on task-relevant regions without incurring substantial computational overhead, thereby providing efficient contextual support for the precise detection of small objects in dense scenes.

3.3. CRB Module

The Context Refiner Block (CRB), illustrated in Figure 6, operates on an input feature map

x \in R^{N \times C \times H \times W}

. Three parallel

1 \times 1

convolutions first project x into three distinct representations: a spatial gating branch produces

a \in R^{N \times 1 \times H \times W}

, a key branch produces

k \in R^{N \times 1 \times H \times W}

, and a value branch produces

v \in R^{N \times 1 \times C^{'} \times H W}

, where

C^{'} = C / r

denotes the reduced channel dimension controlled by the reduction ratio

r

. The gating map is obtained by applying a Sigmoid activation to

a

:

α = σ (a) \in {[0, 1]}^{N \times 1 \times H \times W}

(1)

The key

k

is reshaped to

R^{N \times 1 \times H W \times 1}

and normalized along the spatial dimension via Softmax to yield the attention weight distribution

\hat{k}

. The value v is reshaped to

R^{N \times 1 \times C^{'} \times H W}

. Their matrix multiplication aggregates spatially weighted value features into a compact global context vector:

y = v \cdot \hat{k} \in R^{N \times C^{'} \times 1 \times 1}

(2)

The compressed context

y

is then projected back to the original channel dimension

C

via a

1 \times 1

convolution

m (\cdot)

and spatially modulated by the gating factor

α

:

\tilde{y} = m (y) ⊙ α

(3)

This gating mechanism allows the module to selectively emphasize contextually informative spatial locations while suppressing irrelevant background regions. The final output is defined as:

x^{'} = x + γ \cdot \tilde{y}

(4)

where

γ

is a learnable scalar coefficient initialized to zero, which ensures the module behaves as an identity mapping at the start of training and progressively incorporates global contextual information as training proceeds.

The CRB complements the EAFM and PISF through orthogonal contextual modulation. While the EAFM enhances local boundary details and PISF propagates spatial cues across scales, the CRB provides global spatial context that guides both processes: it suppresses background regions that the EAFM’s edge extraction might falsely amplify, and its gating mechanism ensures that PISF’s cross-scale propagation focuses on foreground-relevant regions.

3.4. Pinwheel Convolution

Pinwheel Convolution (PConv) achieves sparse receptive field modeling through asymmetric padding and direction-selective convolution, as illustrated in Figure 7. Given an input feature map

x \in R^{N \times C \times H \times W}

, four directional zero-paddings are applied, followed by

1 \times k

and

k \times 1

convolutions to extract horizontal and vertical features. This yields four sub-responses

{y_{w 0}, y_{w 1}, y_{h 0}, y_{h 1}}

, which are concatenated along the channel dimension and fused by convolution to form the final representation:

y = Conv (Cat [y_{w 0}, y_{w 1}, y_{h 0}, y_{h 1}])

(5)

Compared with conventional

k \times k

convolution, PConv significantly reduces redundant computation while maintaining directional sensitivity, essentially employing a sparse “pinwheel-shaped” sampling strategy to strengthen edge and structure modeling. Unlike standard kernels, PConv encodes directional priors via sparse filtering, which effectively suppresses background noise while preserving the structural integrity of micro-scale targets. This mechanism contributes to a consistent performance gain in multi-scale environments.

The parameter reduction achieved by PConv stems from its factorized asymmetric kernel design. For a standard convolution mapping of

C_{i n}

input channels to

C_{o u t}

output channels with kernel size

k \times k

, the total learnable parameters (excluding bias) are:

P_{Conv} = C_{in} \times C_{out} \times k^{2}

(6)

PConv replaces the square kernel with two shared asymmetric kernels—a horizontal kernel

W_{h} \in R^{C_{i n} \times \frac{C_{o u t}}{4} \times 1 \times k}

and a vertical kernel

W_{v} \in R^{C_{i n} \times \frac{C_{o u t}}{4} \times k \times 1}

—each applied twice with distinct zero-padding offsets to produce four feature maps of

\frac{C_{o u t}}{4}

channels each. The four branches are concatenated and fused by a

2 \times 2

pointwise-like convolution

W_{c a t}

with parameters

P_{c a t} = 4 C_{o u t}^{2}

. The parameter budget of the asymmetric kernels is therefore:

P_{PConv} = \underset{W_{h}}{\underset{⏟}{C_{in} \times \frac{C_{out}}{4} \times k}} + \underset{W_{v}}{\underset{⏟}{C_{in} \times \frac{C_{out}}{4} \times k}} = \frac{1}{2} C_{in} \times C_{out} \times k

(7)

The weight-sharing between branch pairs (pad[0]/pad[1] share

W_{h}

; pad[2]/pad[3] share

W_{v}

) is the key mechanism: four branches are computed but only two distinct kernel sets are stored. Comparing Equations (6) and (7), the parameter ratio is:

ρ = \frac{P_{PConv}}{P_{Conv}} = \frac{\frac{1}{2} C_{in} C_{out} k}{C_{in} C_{out} k^{2}} = \frac{1}{2 k}

(8)

Equation (8) shows that the parameter saving scales linearly with

k

. For

k = 3

, the asymmetric kernels use only ≈16.7% of the parameters of an equivalent standard convolution. The

2 \times 2

fusion layer adds a fixed overhead of

4 C_{o u t}^{2}

, which is negligible when

C_{i n}

is large. In our configuration, this factorized design reduces the total parameters from 9.90 M to 9.48 M, as reported in Table 2. Beyond parameter efficiency, PConv structurally complements the CRB: PConv’s anisotropic receptive fields encode directional priors that distinguish true target silhouettes from background clutter, providing the CRB’s global context aggregation with spatially cleaner features from which to compute attention weights.

3.5. Progressive Inter-Scale Feature Fusion Strategy

To address the limitations of conventional multi-scale fusion in UAV aerial detection, we propose a Progressive Inter-Scale Feature Fusion (PISF) mechanism, as illustrated in Figure 8. Unlike standard feature pyramid networks (FPNs) and their variants (e.g., PANs), which fuse features through parallel or bidirectional pathways and assign each prediction head an independent, non-overlapping scale (predicting S, M, and L objects separately), PISF adopts a fundamentally different strategy: cascaded cross-scale concatenation with cumulative information propagation.

PISF proceeds as follows.

P_{2}

is first concatenated with

P_{3}

and passed to the

P_{3}

prediction head, which now captures both fine-grained spatial detail (from

P_{2}

) and mid-level semantics (from

P_{3}

), enabling it to predict both S and M objects. The enriched

P_{3}

is then fused with

P_{4}

and forwarded to the

P_{4}

head, which accumulates texture, boundary, and semantic information across three scales, supporting S + M + L prediction. This layer-by-layer propagation ensures that shallow features carrying edge, texture, and high-resolution spatial cues are not discarded but continuously injected into successively deeper representations.

This cumulative fusion design confers several concrete advantages over parallel fusion strategies. First, it directly alleviates information attenuation: in a standard FPN (as shown in Figure 9), fine-grained details from shallow layers rarely survive the depth of feature abstraction. In PISF, these details are explicitly re-introduced at every stage. Second, the progressive concatenation avoids the feature redundancy and semantic conflicts that arise when independent scale-specific branches are merged in a single step. Third, each prediction head in PISF operates on a richer, multi-origin feature representation, improving both sensitivity to small objects and preservation of boundary integrity.

PISF complements the CRB by ensuring that the globally contextualized features produced by the CRB are propagated across multiple semantic depths. Without this progressive injection, the CRB’s gated refinement would remain confined to a single scale, potentially over-suppressing small-target cues at deeper layers. The cascaded concatenation ensures that shallow spatial details survive deep semantic encoding, directly addressing the cross-scale inconsistency identified in Section 1.

3.6. Sobel Operator-Based Edge Enhancement

The Sobel operator is a classical edge detection method whose fundamental principle is to highlight regions with significant gray-level variations by locally computing the image intensity gradients, thereby enhancing edge information. Let the input image be

I (x, y)

. The Sobel operator applies two orthogonal convolution kernels to compute the gradients along the horizontal and vertical directions, respectively. Specifically, the horizontal and vertical kernels are defined as:

G_{x} = (\begin{matrix} - 1 & - 2 & - 1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{matrix}), G_{y} = (\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix})

(9)

The image gradients in the horizontal and vertical directions are computed as:

S_{x} (x, y) = (G_{x} * I) (x, y), S_{y} (x, y) = (G_{y} * I) (x, y)

(10)

where * denotes the convolution operation. The gradient magnitude of each pixel is given by:

S (x, y) = \sqrt{S_{x} {(x, y)}^{2} + S_{y} {(x, y)}^{2}}

(11)

and the gradient orientation is expressed as:

θ (x, y) = \arctan (\frac{S_{y} (x, y)}{S_{x} (x, y)})

(12)

Through these computations, the Sobel operator extracts both the edge strength and orientation of the image. Compared with simple difference operators, Sobel incorporates smoothing weights in gradient computation, which effectively suppresses noise while preserving edge sharpness. The visualization of this process is illustrated in Figure 10, where the first row shows the original feature maps and the second row presents the edge maps extracted by the Sobel operator. Crucially, the third row demonstrates the edge-enhanced feature maps, where the Sobel-derived gradients are integrated back into the original features. As shown, the third row effectively amplifies the structural contrast of micro-scale targets, providing clearer geometric cues for the subsequent detection heads.

4. Experiments

We evaluate ECP-YOLO on VisDrone2019 and UAVDT [23], with ablation studies, scenario analysis, and cross-domain experiments.

4.1. Datasets

The VisDrone2019 dataset, jointly constructed by Tianjin University, JD Digits, and the State University of New York at Albany, is a major benchmark in UAV vision research. It covers both urban and rural environments across 14 cities in China, comprising 288 videos (261,908 frames) and 10,209 static images, with over 2.6 million annotated instances spanning 10 categories, including pedestrians, vehicles, and bicycles. As shown in Figure 11, the dataset exhibits pronounced class imbalance: pedestrian and people instances dominate with approximately 130,000 and 80,000 samples, respectively, while rare categories such as awning-tricycle, bus, and motor contribute negligibly to the overall distribution. Target centers are broadly scattered across the image plane, and bounding boxes are heavily concentrated in the small-scale range, reflecting the characteristic challenges of UAV-captured imagery, small objects, dense scenes, and complex illumination. The dataset supports both detection and tracking tasks and provides additional annotations for occlusion and truncation attributes.

The UAVDT dataset, jointly released by the University of Texas at San Antonio, the University of Chinese Academy of Sciences, and Harbin Institute of Technology, is dedicated to UAV-based vehicle detection and tracking. It contains approximately 80,000 frames extracted from 10 h of video at a resolution of 1080 × 540 at 30 fps, covering 16 traffic scenario types, including arterial roads, toll stations, and highways. As illustrated in Figure 12, the dataset is heavily dominated by the car category, with nearly 400,000 annotated instances, while trucks and buses appear far less frequently, resulting in a severe long-tail distribution. Spatially, targets are concentrated along vertical mid-image strips, consistent with road-following flight paths, and bounding boxes cluster tightly in the small width-height range, underscoring the prevalence of small and densely packed vehicle targets in aerial traffic monitoring.

4.2. Experimental Environment and Parameter Setup

The maximum number of training epochs is set to 300, with an input resolution of 640 × 640 pixels. Data augmentation strategies include Mosaic augmentation [24] (enabled for the first 10 epochs and disabled for the last 10 epochs), RandAugment [25], and HSV color-space adjustment (h = 0.015, s = 0.7, v = 0.4), along with horizontal flipping at a probability of 50%. The optimizer is configured in auto-selection mode, with an initial learning rate of 0.01, a decay factor of 0.01, momentum of 0.937, weight decay of 0.0005, a batch size of 16, and mixed-precision training enabled.

For loss weighting, the bounding box loss is set to 7.5, the classification loss to 0.5, and the distribution focal loss to 1.5. Post-processing employs non-maximum suppression (NMS) with an IoU threshold of 0.7. Additionally, deterministic training is enabled to ensure reproducibility, multi-scale training is disabled, and only the final model is saved upon training completion. The experimental environment is summarized in Table 1.

4.3. Evaluation Metrics

We evaluate detection performance using standard metrics: IoU, Precision, Recall, and mean Average Precision (mAP).

IoU measures the spatial overlap between the predicted bounding box and the ground truth, defined as the ratio of their intersection area to their union area:

I oU = \frac{|B_{p} \cap B_{gt}|}{|B_{p} \cup B_{gt}|}

(13)

A threshold (e.g., 0.5) is typically set to determine whether a detection is successful: if IoU ≥ 0.5, it is regarded as a true positive (TP); otherwise, it is considered a false positive (FP). Precision measures the proportion of correctly predicted positives among all predicted positives, reflecting the reliability of the detection results.

Precision = \frac{TP}{TP + FP}

(14)

Recall measures the proportion of actual positive samples that are successfully detected, reflecting the model’s coverage capability.

Recall = \frac{TP}{TP + FN}

(15)

where FN (false negative) represents ground-truth objects that are missed by detection. Average Precision (AP), defined for a single category, is obtained by calculating the area under the Precision–Recall (PR) curve across different confidence thresholds. Specifically, predicted boxes are first ranked in descending order of confidence, cumulative TP/FP counts are computed to generate the PR curve, and the area under the curve is then derived via integration. The mean Average Precision (mAP) serves as the core comprehensive metric in object detection, calculated as the mean of AP across all categories:

mAP = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(16)

where N denotes the total number of categories. In practice, the mAP is often refined under different IoU thresholds: mAP@0.5 represents the mAP computed at an IoU threshold of 0.5 (PASCAL VOC standard), whereas mAP@0.5:0.95 denotes the average mAP computed over IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05 (COCO standard [26]).

4.4. Ablation Experiments

4.4.1. Ablation Experiments of Different Improved Modules

To validate the effectiveness of each proposed component, we conduct both single-module and stepwise ablation experiments on the VisDrone2019 dataset, with results summarized in Table 2.

The baseline YOLOv12s achieves 31.8% mAP@0.5 and 18.6% mAP@0.5:0.95. Introducing the P2 Detection Head (A) yields a substantial gain, raising mAP@0.5 to 34.0% (+2.2%) and mAP@0.5:0.95 to 19.8% (+1.2%), confirming that high-resolution shallow features are critical for small-object detection in UAV imagery.

We then evaluate each module independently by adding it to A. The EAFM (B) improves mAP@0.5 to 35.9% (+1.9 over A), with concurrent gains in both Precision and Recall, indicating that edge-aware attention strengthens feature discrimination in cluttered scenes. PConv (C) achieves 34.7% (+0.7 over A) while reducing parameters from 9.58 M to 9.16 M, demonstrating that direction-sensitive sparse sampling enhances structural modeling with improved efficiency. The CRB (D) reaches 34.4% (+0.4 over A) with negligible GFLOP increase, confirming that spatially gated context aggregation suppresses background noise at minimal cost. PISF (E) attains 34.6% (+0.6 over A), validating that progressive cross-scale detail propagation alone contributes to better small-target localization. The sum of these individual gains (+1.9 + 0.7 + 0.4 + 0.6 = +3.6) plus the P2 head contribution (+2.2) totals +5.8%, approaching the full model’s +6.3% gain, indicating that the four modules address largely orthogonal degradation modes with limited functional overlap.

Building upon B, the stepwise integration further validates the cumulative benefit. Adding PConv (F) pushes mAP@0.5 to 36.8% and mAP@0.5:0.95 to 21.7%, with parameters decreasing from 9.90 M to 9.48 M. Incorporating the CRB (G) achieves 37.3% mAP@0.5 and 21.9% mAP@0.5:0.95, with Recall rising to 38.3%; the negligible GFLOP increase (49.4 → 49.5) confirms the CRB’s lightweight design. Finally, integrating PISF (H) yields the best overall performance, with mAP@0.5 and mAP@0.5:0.95 reaching 38.1% and 22.1%, corresponding to absolute gains of +6.3% and +3.5% over the baseline. Precision marginally decreases from 48.8% to 48.6%, attributable to additional cross-scale pathways slightly broadening activated response regions. This is offset by a significant Recall improvement (+1.2%), confirming that progressive fusion effectively alleviates detail loss during deep semantic encoding and enhances sensitivity to small, densely distributed targets.

Table 2 also provides per-component latency profiling. The EAFM and CRB introduce only modest overhead (a reduction of 5 and 9 FPS, respectively), confirming their lightweight designs. The primary latency bottlenecks are the P2 detection head and PISF: the P2 head increases pixel processing density for micro-scale targets, while PISF requires additional cross-scale feature propagation. Although these components reduce FPS from 180 to 79, the substantial detection accuracy gains justify this trade-off, as 79 FPS remains well above the standard 30 FPS requirement for practical UAV deployment.

Taken together, the ablation results demonstrate that each module contributes both independently and complementarily, with the full architecture achieving a consistent and substantial improvement over YOLOv12s across all evaluation metrics.

4.4.2. Sub-Component Ablation Within the EAFM

To isolate the contribution of each sub-component inside the EAFM, we conduct a fine-grained ablation by sequentially removing the Sobel branch, LGA, CBAM, and SPP-Lite. All experiments are built upon Experiment A (34.0% mAP@0.5). As shown in Table 3, LGA contributes the largest marginal gain (+1.1%), confirming its central role in feature discrimination. SPP-Lite and the Sobel branch follow with +0.8% and +0.7%, respectively, while the CBAM adds +0.5%.

Taken together, the combined +1.9% gain from the full EAFM is less than the sum of the individual contributions, indicating partial functional overlap among the four components. This redundancy is practically desirable: when specific cues are degraded under challenging conditions, the remaining pathways provide compensatory feature discrimination, stabilizing overall detection performance.

4.4.3. Experimental Results

The experimental results on VisDrone2019 demonstrate that the proposed method consistently outperforms YOLOv12s across all object categories. As shown in Table 4, the overall mAP@0.5 improved from 31.8% to 38.1%, representing a gain of 6.3 percentage points. Notably, detection accuracy for the pedestrian and people categories improved most significantly, by 9.4% and 10.1%, respectively, highlighting the enhanced representation capability for dense small objects. Substantial gains were also observed for the motor category (+8.9%). For the bicycle, tricycle, and awning-tricycle categories, improvements ranged between 3% and 5.8%. Even in large-object categories such as car, truck, and bus, performance improved by 2.1–6.5%. Overall, the results confirm that the proposed method achieves superior detection performance across diverse object classes.

4.5. Model Comparison

The comparative experimental results on the VisDrone2019 dataset are presented in Table 5. Among lightweight one-stage detectors evaluated under a 640 × 640 single-scale protocol, ECP-YOLO achieves 38.1% mAP@0.5 and 22.1% mAP@0.5:0.95, surpassing YOLOv8s by 6.1 and 4.0 percentage points, YOLOv11s by 5.8 and 3.9 percentage points, and the baseline YOLOv12s by 6.3 and 3.5 percentage points, respectively. Compared with UAV-specific variants, ECP-YOLO outperforms MFA-YOLO, BPD-YOLO, and MF-YOLO [27] in both mAP@0.5 and mAP@0.5:0.95 while delivering the highest Precision (48.6%) and Recall (39.5%) among all compared methods for which these metrics are reported. The high Recall indicates strong coverage of densely packed micro-scale targets, and the high Precision confirms effective false positive suppression in cluttered backgrounds.

Although YOLO-MARS [30] reports a higher mAP@0.5 (40.9%), it lacks Precision, Recall, and latency measurements, which are essential for evaluating real-time deployment viability. Similarly, RT-DETR-R18 achieves competitive mAP@0.5:0.95 (20.7%) but contains 20.1 M parameters and operates at inference speeds far below the real-time threshold required for UAV deployment. In contrast, ECP-YOLO maintains 79 FPS with 11.8 M parameters, achieving the most favorable accuracy–efficiency trade-off among the methods with complete metrics. Notably, ECP-YOLO achieves this with only 55.7 GFLOPs, substantially fewer than YOLOv8m (79.3) and competitive with RT-DETR-R18 (57). This efficient design, combined with the model’s strong performance on small and densely distributed objects, underscores its suitability for real-time UAV object detection in complex environments.

4.6. Performance Analysis of ECP-YOLO Across Different Scenarios

To further assess model robustness in UAV scenarios, we conduct fine-grained comparisons on the VisDrone2019 dataset across five representative conditions: daytime, nighttime, dense distribution, motion blur, and occlusion. The results are summarized in Table 6.

In daytime conditions, the proposed model improves mAP@0.5 from 35.8% to 42.0%, with notable per-class gains in pedestrian (+11.4%) and people (+14.4%), attributable to the EAFM’s multi-scale attention and PISF’s cross-level detail supplementation under strong illumination.

In nighttime conditions, where low contrast and noise severely degrade detection, mAP@0.5 increases from 31.4% to 37.6%, with pedestrian AP nearly doubling (14.0% → 27.7%) and motor AP rising from 12.0% to 24.8%. PConv’s directional filtering and the EAFM’s edge-aware branch contribute most to contour recovery under low-light conditions.

In occlusion scenarios, the most substantial gains are observed: mAP@0.5 increases from 40.3% to 48.0%, Recall from 41.8% to 49.0%, and pedestrian AP from 43.6% to 56.5%. PConv’s sparse receptive field modeling and the CRB’s gated modulation effectively aggregate local semantics under partial visibility.

The limited improvement under motion blur stems from the fundamental dependence of the EAFM and PConv on structural edge cues. Under severe blur, gradient magnitudes are substantially attenuated, rendering the Sobel-based edge extraction and direction-selective sparse sampling ineffective. The slight Recall decline (16.2% → 15.4%) can be attributed to PISF: when shallow features are degraded by blur, cascading them into deeper layers may introduce noise rather than useful spatial detail, suppressing otherwise correct low-confidence predictions. Mitigating this limitation requires mechanisms that do not rely solely on explicit gradient computation, such as learning-based deblurring modules or temporal feature aggregation across video frames. We identify this as a key direction for future work and have added a corresponding discussion in the Conclusion.

4.7. Generalization Experiment

Table 7 presents the comparison results on the UAVDT dataset. The proposed model improves Precision from 27.0% to 34.1% and mAP@0.5 from 28.7% to 30.4% over the YOLOv12 baseline. Recall decreases slightly from 38.5% to 34.3%, which is consistent with the model’s tendency toward higher-confidence predictions for small targets in complex scenes—a trade-off that reduces redundant detections at the cost of marginal Recall.

At the category level, car AP marginally drops from 67.2% to 67.1%, while truck AP improves from 2.87% to 4.21% and bus AP from 16.0% to 20.1%. Given the severe class imbalance in UAVDT—where car instances dominate—this redistribution of discriminative capacity toward long-tail categories is expected and reflects the model’s improved ability to handle multi-class detection under imbalanced distributions.

5. Results and Visual Analysis

5.1. Result Visualization

Visual detection comparisons between ECP-YOLO and the YOLOv12 baseline on the VisDrone dataset are presented in Figure 13, where the left and right columns correspond to our model and the baseline, respectively.

In the low-light blurred scenario (a), the CRB’s global context aggregation and the EAFM’s edge-aware enhancement jointly suppress noise interference, recovering target boundaries that the baseline fails to resolve. In the complex background scenario (b), the combined effect of the EAFM’s multi-scale aggregation, the CRB’s spatial modulation, and PISF’s progressive feature injection substantially improves foreground–background discrimination, reducing both missed detections and false positives among densely packed objects. In the dark, dense scenario (c), PConv’s directional filtering and PISF’s shallow-detail supplementation mitigate target–background confusion that causes widespread missed detections in the baseline. Under overexposure (d), the CRB’s contextual refinement and PISF’s cross-level feature propagation stabilize detection where the baseline suffers from saturated feature responses. Across all four conditions, ECP-YOLO produces measurably fewer missed detections and false positives, confirming its robustness under diverse and adverse imaging conditions.

5.2. Synergistic Analysis of Module Interactions

The four modules in ECP-YOLO are co-designed to address interdependent UAV-specific bottlenecks rather than functioning independently. Under dense distributions, PConv suppresses background textures that would otherwise generate false edge responses; the EAFM subsequently extracts cleaner boundary cues, and the CRB’s spatial gating produces sharper foreground activations with fewer false positives. Under occlusion, the Sobel branch recovers partial contours, PISF cascades these cues into deeper layers, and the CRB’s global context compensates for missing local information.

The sub-component ablation (Section 4.4.2) reinforces this interpretation: the four EAFM components exhibit partial functional overlap, with combined gain (+1.9%) smaller than the sum of individual contributions. This redundancy is intentional—when one pathway is degraded (e.g., gradients under motion blur), the remaining pathways provide compensatory discrimination, stabilizing performance across diverse conditions.

At the macro level, the single-module experiments (C–E in Table 2) further confirm this design principle: the sum of individual gains (+1.9 from the EAFM, +0.7 from PConv, +0.4 from the CRB, +0.6 from PISF) plus the P2 head gain (+2.2) totals +5.8%, approaching the full model’s +6.3%. The residual +0.5% reflects weak positive interactions—PConv’s directionally filtered features slightly enhance the EAFM’s edge extraction, and the CRB’s global context modestly improves PISF’s cross-scale propagation. This near-additive structure demonstrates that the four main modules target largely orthogonal degradation modes.

5.3. Heatmap Analysis

As shown in Figure 14, the baseline heatmaps exhibit diffuse activations distributed across road surfaces and background structures, indicating insufficient target–background discrimination. In contrast, the proposed model generates compact, high-intensity responses centered on individual vehicle instances with well-defined contour boundaries while effectively suppressing spurious activations in non-target regions. This qualitative evidence corroborates the quantitative results, demonstrating that the proposed modules substantially enhance feature selectivity and spatial localization Precision under dense traffic conditions.

5.4. Grayscale Analysis

As illustrated in Figure 15, the baseline produces diffuse activations spanning both target and background regions, resulting in indistinct small-target boundaries and elevated sensitivity to environmental clutter, with missed detections observable in densely occluded areas highlighted by the red bounding box. The proposed model, by contrast, yields sharply localized and high-intensity responses concentrated on individual target instances, with background interference substantially attenuated and more complete detection coverage across congested regions. At the P2 detection head, activations are spatially selective and precisely centered on small targets, demonstrating effective feature discrimination at the shallow encoding stage. At the P3 detection head, responses exhibit greater spatial uniformity and semantic consistency, reflecting the role of progressive inter-scale fusion in consolidating global contextual representations. This complementary activation pattern across detection scales confirms that the proposed modules collectively strengthen both fine-grained spatial localization and high-level semantic discriminability under complex UAV imaging conditions.

5.5. Cross-Domain Experiment

As illustrated in Figure 16, the proposed model trained solely on VisDrone2019 achieves competitive or superior detection coverage across daytime, nighttime, dense traffic, and foggy conditions compared to the domain-matched UAVDT-trained baseline. The generalization advantage is primarily attributed to the greater scene diversity and denser small-object annotations in VisDrone2019, which promote learning of transferable representations capturing vehicle contours, textures, and multi-scale structural cues. The UAVDT-trained model, despite being in-domain, exhibits a tendency to overfit dataset-specific patterns such as motion blur and extreme illumination, resulting in degraded cross-domain robustness. These findings indicate that training on a more diverse and densely annotated source dataset yields stronger domain-generalizable feature representations, supporting the practical deployment of the proposed model across heterogeneous UAV operational environments.

6. Conclusions

In this study, we addressed the persistent challenges of micro-scale target leakage and background interference in UAV-based object detection by proposing ECP-YOLO. Rather than relying on conventional architectural scaling, our approach emphasizes directional geometric modeling and progressive cross-scale feature refinement.

Extensive experiments on the VisDrone2019 and UAVDT datasets demonstrate that ECP-YOLO achieves a favorable balance between detection accuracy and computational efficiency. On VisDrone2019, ECP-YOLO attains 38.1% mAP@0.5 and 22.1% mAP@0.5:0.95, surpassing the YOLOv12s baseline by 6.3 and 3.5 percentage points, respectively, while maintaining 79 FPS with only 11.8 M parameters. The integration of edge-aware boundary preservation and spatial gating mechanisms effectively suppresses complex background interference: under nighttime conditions, mAP@0.5 improves from 31.4% to 37.6%, and under occlusion, mAP@0.5 rises from 40.3% to 48.0%. The progressive cross-scale propagation strategy alleviates information loss in deep feature hierarchies, contributing to a Recall improvement from 33.4% to 39.5% and enabling more reliable localization of extremely small objects. On UAVDT, Precision improves from 27.0% to 34.1% and mAP@0.5 from 28.7% to 30.4%, confirming cross-dataset transferability.

Overall, ECP-YOLO provides a lightweight yet effective solution for UAV-based detection tasks and is well-suited for deployment on resource-constrained edge devices. Future work will proceed along four directions. First, we will incorporate temporal feature aggregation across consecutive frames to recover degraded structural cues, directly targeting the motion blur limitation identified in our scenario analysis. Second, we will extend the framework to multi-spectral aerial imagery to improve robustness under low illumination and adverse weather. Third, we will explore learning-based deblurring modules as an alternative to the current edge operators, enabling end-to-end compensation for image degradation without explicit gradient computation. Fourth, we will benchmark ECP-YOLO on representative embedded platforms (e.g., NVIDIA Jetson series) using TensorRT optimization and INT8 quantization to validate its real-time performance in practical edge deployment scenarios.

Author Contributions

Conceptualization, Q.W. and M.C.; methodology, M.C.; software, M.C.; validation, M.C., Q.W. and Y.C.; formal analysis, M.C.; investigation, M.C.; resources, Q.W.; data curation, M.C.; writing—original draft preparation, M.C.; writing—review and editing, Q.W. and Y.C.; visualization, M.C.; supervision, Q.W.; project administration, Q.W.; funding acquisition, Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in public repositories, including the VisDrone [https://github.com/VisDrone/VisDrone-Dataset (accessed on 7 May 2026)] and UAVDT [https://opendatalab.com/OpenDataLab/UAVDT (accessed on 7 May 2026)] datasets. The specific data processing scripts and model configurations generated during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-Shaped Convolution and Scale-Based Dynamic Loss for Infrared Small Target Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; AAAI: Washinton, DC, USA, 2025; Volume 39, pp. 9202–9210. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 10–15 December 2024; Neural Information Processing Systems Foundation, Inc. (NeurIPS): San Diego, CA, USA, 2024; Volume 37, pp. 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 16965–16974. [Google Scholar]
Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Ling, H.; Hu, Q.; Zheng, J.; Peng, T.; Wang, X.; Zhang, Y.; et al. VisDrone-SOT2019: The Vision Meets Drone Single Object Tracking Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–8. [Google Scholar]
Zhang, H.; Xiao, P.; Yao, F.; Zhang, Q.; Gong, Y. Fusion of Multi-Scale Attention for Aerial Images Small-Target Detection Model Based on PARE-YOLO. Sci. Rep. 2025, 15, 4753. [Google Scholar] [CrossRef]
Li, S.; Chen, C. MFA-YOLO: A Multi-Feature Aggregation Approach for Small-Object Detection Method in Drone Imagery. Sci. Rep. 2026, 16, 2484. [Google Scholar] [CrossRef] [PubMed]
Chao, M.; Peng, C.; Yun, L.; Zhang, C.; Wang, H.; Chen, Z. A Lightweight Small Object Detection Model for UAV Images Based on Deep Semantic Integration. Sci. Rep. 2025, 15, 31888. [Google Scholar] [CrossRef]
Fan, Q.; Li, Y.; Deveci, M.; Zhong, K.; Kadry, S. LUD-YOLO: A Novel Lightweight Object Detection Network for Unmanned Aerial Vehicle. Inf. Sci. 2025, 686, 121366. [Google Scholar] [CrossRef]
Zhou, S.; Zhou, H.; Qian, L. A Multi-Scale Small Object Detection Algorithm SMA-YOLO for UAV Remote Sensing Images. Sci. Rep. 2025, 15, 9255. [Google Scholar] [CrossRef] [PubMed]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2117–2125. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R. BiFormer: Vision Transformer with Bi-Level Routing Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 10323–10333. [Google Scholar]
Yuan, D.; Chang, X.; Li, Z.; He, Z. Learning Adaptive Spatial-Temporal Context-Aware Correlation Filters for UAV Tracking. ACM Trans. Multimed. Comput. Commun. Appl. 2022, 18, 70. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2022; pp. 10781–10790. [Google Scholar]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Han, K.; Wang, Y. Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Neural Information Processing Systems Foundation, Inc. (NeurIPS): San Diego, CA, USA, 2023; Volume 36, pp. 51094–51112. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–19. [Google Scholar]
Shao, Y. Local-Global Attention: An Adaptive Mechanism for Multi-Scale Feature Integration. arXiv 2024, arXiv:2411.09604. [Google Scholar]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 370–386. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 702–703. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Li, W.; Li, A.; Kong, X.; Zhang, Y.; Li, Z. MF-YOLO: Multimodal Fusion for Remote Sensing Object Detection Based on YOLOv5s. In Proceedings of the 2024 27th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Tianjin, China, 8–10 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 897–903. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.E.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Mittal, P.; Sharma, A.; Singh, R.; Dhull, V. Dilated Convolution Based RCNN Using Feature Fusion for Low-Altitude Aerial Objects. Expert Syst. Appl. 2022, 199, 117106. [Google Scholar] [CrossRef]
Zhang, G.; Peng, Y.; Li, J. YOLO-MARS: An Enhanced YOLOv8n for Small Object Detection in UAV Aerial Imagery. Sensors 2025, 25, 2534. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Structural diagram of ECP-YOLO.

Figure 2. Structural diagram of YOLOv12.

Figure 3. Structural diagram of the EAFM.

Figure 4. Structural diagram of the MFE module.

Figure 5. Structural diagram of LGA.

Figure 6. Structural diagram of the CRB.

Figure 7. Structural diagram of Pinwheel Convolution.

Figure 8. Progressive inter-scale feature fusion.

Figure 9. Feature pyramid network.

Figure 10. Edge enhancement of dense objects via the Sobel operator for improved structural boundary definition.

Figure 11. Statistical analysis of the VisDrone2019 dataset: class distribution, spatial layout, and object scale characteristics.

Figure 12. Statistical analysis of the UAVDT dataset: class distribution, spatial layout, and object scale characteristics.

Figure 13. Performance comparison between the baseline YOLOv12s and the proposed ECP-YOLO across various challenging UAV scenarios.

Figure 14. Heatmap visualization comparison. Top left: original image; top right: detection results of the proposed ECP-YOLO; bottom left: heatmap of the proposed ECP-YOLO; bottom right: heatmap of the baseline YOLOv12s.

Figure 15. Grayscale activation maps comparison between baseline YOLOv12s and proposed ECP-YOLO. (a) Original image; (b) Detection results; (c) P2 detection head activations; (d) P3 detection head activations. Row 1: proposed model; Row 2: baseline.

Figure 16. Cross-dataset inference result across Daytime, Night, Dense, and Foggy conditions. Left to right: original image, VisDrone-trained model, UAVDT-trained model.

Table 1. Experimental environment.

Parameter	Setup
OS	Windows 11
CPU	AMD Ryzen 9 7945 HX
GPU	RTX 3090 (24 GB)
Memory	DDR5 (32 GB)
Python	3.9.21
CUDA	11.8
Pytorch	2.3.1

Table 2. Results of ablation experiments (%).

Methods	P2H	EAFM	PConv	CRB	PISF	mAP@0.5	mAP@0.5:0.95	P	R	Params (M)	GFLOPS	FPS
YOLOv12s	—	—	—	—	—	31.8	18.6	45.1	33.4	9.23	21.2	180
A	✔	—	—	—	—	34.0	19.8	45.7	35.4	9.58	28.8	131
B	✔	✔	—	—	—	35.9	21.0	47.4	36.8	9.90	45.6	126
C	✔	—	✔	—	—	34.7	20.5	46.0	35.8	9.16	32.6	128
D	✔	—	—	✔	—	34.4	20.0	45.9	35.7	9.58	28.9	124
E	✔	—	—	—	✔	34.6	20.1	45.5	36.3	10.2	35.1	112
F	✔	✔	✔	—	—	36.8	21.7	48.6	37.7	9.48	49.4	121
G	✔	✔	✔	✔	—	37.3	21.9	48.8	38.3	9.48	49.5	112
H	✔	✔	✔	✔	✔	38.1	22.1	48.6	39.5	11.8	55.7	79

Table 3. Sub-component ablation within the EAFM on the Visdrone2019 dataset (%).

Configuration	Sobel	LGA	CBAM	SPP-Lite	mAP@0.5
Full EAFM	✔	✔	✔	✔	35.9
w/o Sobel	✗	✔	✔	✔	35.2
w/o LGA	✔	✗	✔	✔	34.8
w/o CBAM	✔	✔	✗	✔	35.4
w/o SPP-Lite	✔	✔	✔	✗	35.1
w/o EAFM	✗	✗	✗	✗	34.0

Table 4. Experimental results by category on the Visdrone2019 dataset (%).

Models	mAP@0.5	Pedestrian	People	Bicycle	Car	Van	Truck	Tricycle	Awn-Tri	Bus	Motor
YOLOv12s	31.8	27.4	13.9	9.22	71.8	37.3	39	16.7	18.5	56.2	28.5
Ours	38.1	36.8	24	14.5	78.3	43.3	42.2	22.5	21.8	59.9	37.4
Improve	6.3	9.4	10.1	5.28	6.5	6	3.2	5.8	3.3	3.7	8.9

Table 5. Comparison with different models on the Visdrone2019 Dataset (%).

Models	mAP@0.5	mAP@0.5:0.95	P	R	Param(M)	GFLOPS
SSD [28]	24.1	10.7	21.3	35.4	13.3	22.8
YOLOv8s	32.0	18.1	43.5	34.3	11.2	28.7
YOLOv8m	35.1	20.2	47.5	36.9	25.9	79.3
YOLOv10n	33.9	19.3	45.0	34.3	2.69	8.2
YOLOv10s	31.6	18.0	43.5	33.6	8.0	24.8
YOLOv10m	33.9	19.5	46.5	36.0	16.0	64
YOLOv11s	32.3	18.2	44.4	34.5	9.4	23.5
RT-DETR-R18	36.2	20.7	41.4	32.4	20.1	57
DCRFF [29]	35.0	23.4	—	—	—	—
MF-YOLO	34.8	21.1	—	—	9.0	—
RetinaNet	28.7	13.1	—	—	19.8	—
MFA-YOLO	36.0	20.7	—	—	7.5	—
BPD-YOLO	35.5	20.8	—	—	3.9	18.2
YOLO-MARS	40.9	23.4	—	—	—	—
Ours	38.1	22.1	48.6	39.5	11.8	55.7

Table 6. Comparison across different scenarios on the VisDrone2019 dataset (%).

Scenarios	Model	P	R	mAP@0.5	Pedestrian	People	Bicycle	Car	Van	Truck	Tricycle	Aw-Tri	Bus	Motor
Daylight	YOLOv12	47.7	35.5	35.8	39.3	23.3	7.11	77.1	46.9	51.1	16.2	18.6	46.4	31.7
Daylight	Ours	49.7	41.4	42	50.7	37.7	11.7	82.6	53.7	54	21.7	20.9	47	39.8
Night	YOLOv12	37.4	36	31.4	14	7.37	1.7	57.7	27.3	36.5	14.2	99.5	44.1	12
Night	Ours	38.8	41	37.6	27.7	11.9	1.9	68.7	37.7	39.7	18.1	99.5	50.3	24.8
Dense	YOLOv12	45.8	35.3	34.3	33.5	17.6	9.98	76.8	44	28.9	17.1	20.3	63.4	31
Dense	Ours	50.5	41.8	41.3	43.5	30.9	14.5	83.6	51.8	32.8	24.3	23.6	66.7	41.5
Blur	YOLOv12	33.1	16.2	15.8	10.3	9.08	1.68	51.1	21.6	16	4.59	7.02	25.1	11.2
Blur	Ours	38.1	15.4	16.8	12.2	12.7	2.23	54.1	24.8	16.3	4.36	7.67	22.3	11.7
Occlusion	YOLOv12	49	41.8	40.3	43.6	31.6	13.1	82	42.2	54.4	25.2	21.8	61.5	27.4
Occlusion	Ours	51.9	49	48	56.5	44.7	19.4	87.3	48.1	59.8	34.6	22.7	69.2	37.3

Table 7. Comparison with the baseline model on the UAVDT (%).

Models	P	R	mAP@0.5	Car	Truck	Bus	Param (M)	GFlop
YOLOv12	27	38.5	28.7	67.2	2.87	16	9.23	21.2
Ours	34.1	34.3	30.4	67.1	4.21	20.1	11.8	55.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Q.; Cang, M.; Chen, Y. ECP-YOLO: Integrating Edge-Aware Attention and Contextual Refinement for UAV Object Detection. Electronics 2026, 15, 2067. https://doi.org/10.3390/electronics15102067

AMA Style

Wang Q, Cang M, Chen Y. ECP-YOLO: Integrating Edge-Aware Attention and Contextual Refinement for UAV Object Detection. Electronics. 2026; 15(10):2067. https://doi.org/10.3390/electronics15102067

Chicago/Turabian Style

Wang, Qi, Mingming Cang, and Yongji Chen. 2026. "ECP-YOLO: Integrating Edge-Aware Attention and Contextual Refinement for UAV Object Detection" Electronics 15, no. 10: 2067. https://doi.org/10.3390/electronics15102067

APA Style

Wang, Q., Cang, M., & Chen, Y. (2026). ECP-YOLO: Integrating Edge-Aware Attention and Contextual Refinement for UAV Object Detection. Electronics, 15(10), 2067. https://doi.org/10.3390/electronics15102067

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ECP-YOLO: Integrating Edge-Aware Attention and Contextual Refinement for UAV Object Detection

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Structure of ECP-YOLO

3.2. EAFM

3.3. CRB Module

3.4. Pinwheel Convolution

3.5. Progressive Inter-Scale Feature Fusion Strategy

3.6. Sobel Operator-Based Edge Enhancement

4. Experiments

4.1. Datasets

4.2. Experimental Environment and Parameter Setup

4.3. Evaluation Metrics

4.4. Ablation Experiments

4.4.1. Ablation Experiments of Different Improved Modules

4.4.2. Sub-Component Ablation Within the EAFM

4.4.3. Experimental Results

4.5. Model Comparison

4.6. Performance Analysis of ECP-YOLO Across Different Scenarios

4.7. Generalization Experiment

5. Results and Visual Analysis

5.1. Result Visualization

5.2. Synergistic Analysis of Module Interactions

5.3. Heatmap Analysis

5.4. Grayscale Analysis

5.5. Cross-Domain Experiment

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI