CSSA-YOLO: A Clutter-Suppressed and Scale-Aware Framework for Robust Object Detection in UAV Imagery

Yang, Xiao; Wang, Yongjia; Wang, Yong; Li, Wangyuan; Liu, Beiyuan; Liu, Ganchao

doi:10.3390/rs18101533

Open AccessArticle

CSSA-YOLO: A Clutter-Suppressed and Scale-Aware Framework for Robust Object Detection in UAV Imagery

by

Xiao Yang

^1,2

,

Yongjia Wang

¹

,

Yong Wang

¹,

Wangyuan Li

¹,

Beiyuan Liu

¹

and

Ganchao Liu

^3,*

¹

School of Cybersecurity, Northwestern Polytechnical University, Xi’an 710072, China

²

Yangtze River Delta Research Institute of NPU, Taicang 215400, China

³

School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1533; https://doi.org/10.3390/rs18101533

Submission received: 3 April 2026 / Revised: 4 May 2026 / Accepted: 9 May 2026 / Published: 12 May 2026

(This article belongs to the Special Issue Object Detection in Remote Sensing Imagery)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose the Semantic Bottleneck Module (SBM) to filter severe background clutter by establishing a lightweight low-rank semantic bottleneck.
Based on mathematical analysis, we formulate the Scale-Aware Complete-IoU (SA-CIoU) loss to mitigate the gradient attenuation bottleneck for small objects.

What are the implications of the main findings?

Synergizing these two innovative mechanisms, we construct the CSSA-YOLO framework, which successfully decouples target priors from background clutter and enhances the precise localization of small objects.
CSSA-YOLO establishes a detection paradigm with strong robustness and generalizability, demonstrating superior performance in extensive experiments on the VisDrone2019 dataset and additional benchmarks.

Abstract

The widespread deployment of unmanned aerial vehicles (UAVs) in remote sensing has highlighted the necessity for robust object detection methods in UAV imagery. However, high-altitude UAV imagery suffers from severe background clutter that obscures target discriminability and extreme scale variations that degrade fine-grained features. To address these challenges, we propose CSSA-YOLO, a clutter-suppressed and scale-aware detection framework built upon YOLOv9. Specifically, we project dense spatial features into a low-rank token space via a Semantic Bottleneck Module (SBM). This projection acts as an information bottleneck, suppressing the background clutter while robustly retaining critical target semantic and structural priors. Furthermore, we develop a Scale-Aware Complete-IoU (SA-CIoU) loss to tackle gradient attenuation for small objects. By analytically integrating a scale-aware modulation factor with a dynamic alignment mechanism into localization optimization, SA-CIoU shifts the optimization priority to the precise localization of small and hard-to-detect instances. Extensive experiments on the VisDrone2019 benchmark demonstrate the superiority of our approach, with CSSA-YOLO achieving an mAP@0.5 of 46.0% and an mAP@0.5:0.95 of 28.4%, yielding an absolute 1.4% improvement over the YOLOv9 baseline. Furthermore, when integrated with a P2-enhanced YOLOv9 architecture, our method achieves a remarkable mAP@0.5 of 49.5%. Notably, evaluations across diverse scenarios, including the infrared (IR) thermal HIT-UAV benchmark and PCB defect detection datasets, further demonstrate the generalizability and robustness of our framework.

Keywords:

UAV object detection; small object detection; clutter suppression; scale-aware modulation; low-rank representation; YOLOv9

1. Introduction

In recent years, unmanned aerial vehicles (UAVs) have been extensively deployed in diverse fields, ranging from urban surveillance and traffic monitoring to precision agriculture and disaster rescue [1,2]. As the core perception task for UAVs, aerial object detection has attracted considerable research attention. However, compared to ground-level observations, UAV-based remote sensing imagery poses significant challenges for visual object detection. As illustrated in Figure 1, owing to high flight altitude, the visual features of aerial targets undergo severe degradation. Specifically, long-distance overhead imaging causes instances, such as pedestrians, to exhibit small pixel footprints and drastic scale variations [3]. Meanwhile, the wide top-down perspective submerges these targets within cluttered backgrounds [4].

Early work on aerial object detection was predominantly built on traditional computer vision techniques, combining handcrafted feature extractors such as Histogram of Oriented Gradients (HOG) and Scale-Invariant Feature Transformer (SIFT) with sliding-window detection paradigms [5,6]. While these conventional approaches established foundational baselines for aerial detection, they suffered from limited representational capacity. Confronted with the intrinsic challenges in aerial imaging scenarios, including drastic illumination variations and unconstrained imaging viewpoints, such handcrafted approaches struggle to maintain stable detection performance and generalizability. To overcome these limitations, deep learning approaches such as Convolutional Neural Networks (CNNs) have emerged as the mainstream paradigm for aerial object detection [7,8]. Despite achieving notable performance, these approaches remain constrained by two primary bottlenecks:

Severe Background Clutter. UAV remote sensing images typically encompass wide-area geographical contexts with diverse background clutter. Within such clutter, specific distractors, including terrain textures and urban infrastructure, exhibit morphological characteristics highly similar to those of targets, severely interfering with the feature extraction process.
Extreme Scale Variations and Tiny Objects. Due to the high flight altitudes of UAVs, a large proportion of targets, such as pedestrians and vehicles, occupy a negligible fraction of the image area. Furthermore, substantial inter-class and intra-class scale variations exist. These factors result in the loss of fine-grained spatial features and gradient attenuation during network backpropagation.

To mitigate background clutter interference in UAV imagery, prior works predominantly rely on attention mechanisms and context modeling. One main research line focuses on spatial localization and feature decoupling to suppress structural background clutter. Dong et al. [9] introduced a background separation strategy coupled with the Effective Localization Attention, while Wang et al. [10] designed a Long-Focus Attention Module to enhance discriminative target regions. Another research line focuses on filtering out redundant contextual information via cross-level and multi-dimensional feature interactions. To this end, Wang et al. [11] applied a sparse cross self-attention mechanism within a Shallow-Deep Information Sparse Aggregation Module to filter out redundant background information. Similarly, Chen et al. [12] proposed an Attention-Enhanced Feature Module that integrates channel and semantic priors to strengthen target representations. Beyond attention mechanisms, advanced representation learning paradigms from other remote sensing domains, such as Perception–Retrieval–Localization pipelines [13] and Tri-State Prototype self-distillation [14], have also provided valuable insights for complex background modeling.

For the detection of tiny and multi-scale objects, existing methods broadly fall into two paradigms: multi-scale feature fusion and optimization metric redesign. The former is predominantly employed to preserve the fine-grained spatial details of small targets. For instance, Qu et al. [15] leveraged a Multi-Cross-Scale Feature Pyramid Network and a Parameter-Free Simple Slicing Convolution Module to retain high-frequency spatial cues. Hou et al. [16] designed a Multi-Branch Feature Fusion Model to reduce the feature information loss of small targets. Meanwhile, the latter paradigm optimizes the detection framework via loss function improvements and label assignment strategy refinement. Recognizing that standard center sampling strategies are biased toward large objects, Xu et al. [17] proposed a Receptive Field Distance metric paired with a Hierarchical Label Assignment module to achieve balanced learning for tiny objects. In a similar vein, Zhao et al. [18] utilized the Normalized Wasserstein Distance (NWD) [19] loss to optimize the bounding box regression, thereby alleviating the geometric instability of IoU metrics for small targets. Furthermore, to explicitly address inherent target sparsity and dense distributions, progressive approaching strategies such as Glance-Focus-Gaze [20] and COPO [21], as well as Triple-Level Sparsity Awareness [22], have been successfully explored in related aerial perception tasks.

However, despite these remarkable advances, current methodologies remain fundamentally bottlenecked by two inherent limitations. First, existing clutter suppression strategies often rely on unconstrained global interactions. This inadvertently captures spurious correlations between targets and distractors, leaving target representations entangled with background clutter across high-dimensional feature spaces. Second, current structural and metric-level optimizations for small objects are fundamentally constrained by a scale-invariant regression paradigm. Consequently, during backpropagation, the weak gradient signals of tiny targets are frequently overshadowed by those of larger instances, inevitably driving the network into severe gradient starvation.

To address these aforementioned limitations, we propose CSSA-YOLO, a clutter-suppressed and scale-aware framework for robust UAV object detection built upon the YOLOv9 [23] architecture. Specifically, to tackle the interference of background clutter, we design the Semantic Bottleneck Module (SBM). By mapping high-resolution spatial features into a low-rank token space, SBM acts as a lightweight filtering bottleneck that suppresses background clutter, while concurrently preserving and propagating the critical structural and semantic priors of aerial targets throughout the network. Furthermore, to mitigate the gradient attenuation of small aerial objects, we propose the Scale-Aware Complete-IoU (SA-CIoU) loss. This mechanism integrates a scale-aware modulation factor based on absolute pixel area to up-weight the gradients for tiny objects, coupled with a dynamic alignment mechanism that adaptively calibrates the regression penalties, prioritizing the precise localization of small and hard-to-detect instances. Extensive experiments on the VisDrone2019 [24] dataset demonstrate that CSSA-YOLO outperforms existing representative methods. To further validate the generalization and robustness of our approach, we also extend the evaluation to infrared (IR) thermal UAV-based detection and industrial printed circuit board (PCB) defect detection, which share similar challenges such as extremely small object sizes and severe background clutter, thereby rigorously testing the adaptability of CSSA-YOLO.

Summarily, our main contributions are as follows:

We propose the Semantic Bottleneck Module (SBM) to tackle severe background clutter interference in UAV imagery. By establishing a low-rank projection that preserves structural and semantic priors, SBM effectively filters out the clutter and highlights target morphologies, thereby suppressing feature aliasing and enhancing target discriminability within the feature space.
We formulate the Scale-Aware Complete-IoU (SA-CIoU) loss to address scale variations and enhance tiny object localization for aerial targets. By mathematically coupling a scale-aware modulation factor with a dynamic alignment mechanism, SA-CIoU redirects the optimization guidance to prioritize tiny targets, alleviating the regression bias dominated by large instances.
By combining these innovative mechanisms, we construct CSSA-YOLO, a robust UAV imagery object detector. Extensive experiments on the VisDrone2019 [24] dataset and other benchmarks with similar visual characteristics demonstrate that our framework outperforms current mainstream detectors, providing an effective and robust solution for UAV perception.

The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 details the proposed methodology, including the SBM architecture and the SA-CIoU optimization mechanism. Section 4 presents the experimental results and discussions. Finally, Section 5 concludes the paper.

2. Related Work

2.1. Lightweight Attention Modules

As analyzed in the introduction, the integration of attention mechanisms has become a mainstream approach in UAV object detection to suppress background clutter. However, standard self-attention mechanisms often introduce heavy computational overhead due to their quadratic computational complexity, making them unsuitable for UAV platforms. Therefore, to avoid imposing an excessive burden on the network, we primarily focus on lightweight attention modules. Early pioneering works predominantly focused on recalibrating feature maps via global context aggregation with channel or spatial descriptors. Convolutional Block Attention Module (CBAM) [25] sequentially infers attention maps along the channel and spatial dimensions. Specifically, it employs both max and average pooling to aggregate spatial and channel information into feature representations. These are then processed by a multi-layer perceptron (MLP) for channel modeling and a convolutional layer for spatial modeling, respectively, to yield multiplicative attention masks. To further reduce complexity, Efficient Channel Attention (ECA) [26] captures local cross-channel interactions by performing global average pooling followed by a 1D convolution, where the kernel size is adaptively determined by the channel dimension. While the computational overhead of such attention modules is negligible, they rely on aggressive pooling-based compression operations. This coarse-grained feature aggregation inevitably erodes the fine-grained spatial structure of targets, leading to a loss of crucial structural priors. To bypass the limitations of explicit pooling branches, the Simple, Parameter-Free Attention Module (SimAM) [27] attempts to estimate 3D attention weights by defining and minimizing a specific energy function. This function evaluates the linear separability between a target neuron and other neurons within the same channel, thereby effectively identifying its importance based on both spatial and channel cues simultaneously. However, since this approach is grounded in heuristic energy-based optimization, it lacks the high-level semantic mapping capability required to model complex dependencies in cluttered aerial environments.

More recently, research has shifted towards linearizing or sparsifying self-attention to capture long-range dependencies with manageable complexity. Agent Attention [28] introduces a set of agent tokens generated through spatial pooling of the feature maps. Within its two-stage attention pipeline, these tokens first act as global collectors to aggregate contextual information from the keys and values. Subsequently, they function as feature distributors to broadcast the global information back to the queries, effectively reducing the computational complexity to a linear scale. Binary Attention [29] explores the limit of low-bit quantization by binarizing queries and keys. By retaining only the signs of these tensors, standard floating-point dot products are replaced with high-speed bit-wise operations, thereby significantly reducing the computational overhead. To compensate for the inherent information loss during the process, Binary Attention incorporates learnable biases and employs self-distillation techniques to align the binary similarity with its full-precision counterpart. While these approaches enhance efficiency, they suffer from unconstrained token interactions, which lead to feature aliasing in UAV imagery. Consequently, target signals are susceptible to being confused with structurally similar background clutter.

2.2. Bounding Box Regression Loss Functions

Accurate target localization is critical for UAV object detection, with its performance relying fundamentally on the design of bounding box regression loss functions. IoU is the standard metric for evaluating the spatial overlap between a predicted bounding box and its corresponding ground-truth box, defined as the ratio of the area of their intersection to the area of their union. However, when formulated as a loss function, IoU suffers from a severe gradient vanishing limitation if there is no overlap between the bounding boxes. To address this non-overlapping limitation, Zheng et al. [30] proposed Distance-IoU (DIoU), which incorporates the normalized distance between the central points of the predicted and ground-truth boxes, ensuring continuous gradient propagation and effectively accelerating convergence. They further introduced CIoU [30] to refine the geometric alignment by incorporating aspect ratio consistency. While these loss functions perform well in general scenarios, they are highly sensitive to minor positional deviations when applied to the tiny targets typically found in cluttered UAV imagery, often leading to suboptimal regression.

To overcome the limitations of standard overlap loss functions and handle the extreme scale variations inherent to UAV scenarios, recent research has explored more adaptive regression paradigms. On top of the CIoU loss, Inner-CIoU [31] is proposed to accelerate convergence and improve localization for high-IoU samples via a dynamic scale ratio factor. Specifically, this method generates paired auxiliary inner boxes by scaling down the original predicted and ground-truth boxes. It then replaces the standard IoU term in the CIoU with the Inner-IoU calculated from these inner boxes to formulate the loss function. Meanwhile, Wise-IoU v3 (WIoUv3) [32] introduces a dynamic focusing mechanism that evaluates the outlier degree of predicted bounding boxes based on their real-time regression status. By strategically allocating gradient gains, WIoUv3 mitigates the severe disruption caused by hard-to-detect training samples, thereby ensuring a more robust optimization process. While these enhancement strategies alleviate the challenges of precise localization for small targets, they remain fundamentally constrained by the scale-invariant optimization paradigm. In contrast, NWD models bounding boxes as 2D Gaussian distributions. By measuring the distribution similarity via the Wasserstein distance, NWD smoothly reflects the spatial correlation between boxes even when they are extremely small. However, when the target bounding box contains severe background clutter in UAV imagery, the 2D Gaussian modeling of NWD inevitably incorporates this interference, skewing the distribution similarity metric. Consequently, the regression gradient weights for these affected targets are severely attenuated.

3. Methods

3.1. Overview of CSSA-YOLO

The overall architecture of the proposed CSSA-YOLO framework is illustrated in Figure 2. It is constructed upon the YOLOv9 baseline, integrating two targeted innovations to tackle the challenges of UAV imagery.

Firstly, YOLOv9 serves as an advanced architecture that addresses the information bottleneck problem in deep object detection networks. To achieve this, YOLOv9 introduces two innovative components. The Generalized Efficient Layer Aggregation Network (GELAN) optimizes parameter utilization through an efficient network topology, ensuring robust feature extraction without incurring excessive computational overhead. In addition, Programmable Gradient Information (PGI) refines the training process by incorporating auxiliary reversible branches. This mechanism ensures that rich target information is preserved throughout the forward pass, thereby generating reliable gradient signals during backpropagation. Owing to its robust gradient preservation capability, YOLOv9 is highly suited to the intrinsic detection challenges of UAV aerial imagery. During the optimization process, YOLOv9 adopts a decoupled detection head, where the classification branch is supervised by the Binary Cross-Entropy (BCE) loss, while the flexible regression branch is jointly supervised by the Distribution Focal Loss (DFL) and our SA-CIoU loss.

Secondly, to address the severe background clutter inherent in UAV images, we design the SBM. Unlike traditional unconstrained global attention mechanisms that may forge erroneous correlations with structural distractors, SBM maps the spatial features into a low-rank token space. Serving as a lightweight filtering bottleneck, it suppresses background clutter while preserving the critical semantic and structural priors of aerial targets, thereby enhancing target discriminability within the feature space.

Thirdly, to mitigate the gradient attenuation and localization bottlenecks of small aerial objects, we propose the SA-CIoU loss. By dynamically adjusting the optimization focus during training, SA-CIoU prevents the weak gradient signals of small targets from being overwhelmed by larger instances. Specifically, it mathematically couples a scale-aware modulation factor based on absolute pixel area with a dynamic alignment mechanism, effectively prioritizing the precise regression of small and hard-to-detect objects.

3.2. Semantic Bottleneck Module

Standard local convolutions are inherently limited in capturing long-range dependencies and suffer from local feature redundancy, rendering them suboptimal for distinguishing aerial targets from complex backgrounds. Although vision transformers (ViTs) [33] and standard self-attention mechanisms establish global receptive fields to mitigate these constraints, their unconstrained token-to-token interactions often lead to global information redundancy and amplify background interference in complex expansive UAV imagery. In such high-altitude scenarios, unrestricted spatial similarities increase the model’s susceptibility to spurious correlations between actual targets and structurally similar distractors. To mitigate these spurious attention interactions and bias the attention mechanism towards discriminative target semantic features, we propose the SBM, the architecture of which is illustrated in Figure 3.

Let

X \in R^{H \times W \times D}

denote the original spatial feature map fed into the SBM, where H, W, and D represent the spatial height, width, and embedding dimension, respectively. We first apply Batch Normalization (BN) to

X

, yielding the normalized representation

\hat{X} \in R^{H \times W \times D}

. Let

N = H \times W

represent the spatial sequence length. To capture the localized geometric variations of aerial targets, we construct a multi-scale feature representation

F \in R^{H \times W \times D}

. We apply a dynamic weighted fusion across K parallel convolutional branches. To encompass diverse receptive fields without incurring excessive computational overhead, we set

K = 3

and instantiate these branches using a

1 \times 1

convolution, a

3 \times 3

Depthwise Convolution, and a

5 \times 5

Depthwise Convolution. The multi-scale fusion is formulated as:

F = \sum_{i = 1}^{K} σ (ω_{i}) Φ_{i} (\hat{X})

(1)

where

Φ_{i} (\cdot)

denotes the i-th convolutional mapping,

ω_{i}

represents learnable scale parameters, and

σ (\cdot)

is the Sigmoid activation function.

To perform global contextual optimization while preventing unconstrained clutter interactions, we introduce a decoupled two-stage attention mechanism mediated by a low-rank semantic bottleneck. Specifically, we introduce a set of learnable mediator tokens

M \in R^{m \times D}

, where m is the number of mediator tokens (

m ≪ N

). During the multi-head attention computation,

M

is evenly partitioned across

N_{h}

parallel heads, yielding

M_{h} \in R^{m \times d_{k}}

for each head, where

d_{k} = D / N_{h}

. Concurrently, for each head, the features

\hat{X}

are linearly projected to derive the spatial queries

Q_{X}

, keys

K_{X}

, and values

V_{X} \in R^{N \times d_{k}}

. Furthermore, to alleviate the lack of spatial inductive bias intrinsic to standard self-attention, we dynamically generate a Contextual Position Encoding (CoPE) by applying a Depthwise Convolutional mapping

Ψ (\cdot)

on the

F

. This CoPE is then injected into the

Q_{X}

and

K_{X}

:

{\tilde{Q}}_{X} = Q_{X} + Ψ (F)

(2)

{\tilde{K}}_{X} = K_{X} + Ψ (F)

(3)

In the first attention stage, the mediator tokens act as bottleneck queries to aggregate global semantic information from

{\tilde{K}}_{X}

and

V_{X}

, effectively compressing the dense spatial representation into a low-rank semantic prior

V_{sem} \in R^{m \times d_{k}}

:

V_{sem} = Softmax (\frac{M_{h} {\tilde{K}}_{X}^{⊤}}{\sqrt{d_{k}}}) V_{X}

(4)

In the second stage,

{\tilde{Q}}_{X}

retrieves information from this low-rank global representation to broadcast the purified contextual features back to the original spatial resolution, yielding the aggregated representation

Z_{h} \in R^{N \times d_{k}}

:

Z_{h} = Softmax (\frac{{\tilde{Q}}_{X} M_{h}^{⊤}}{\sqrt{d_{k}}}) V_{sem}

(5)

The outputs from all

N_{h}

parallel heads are subsequently concatenated along the channel dimension to form the global spatial representation

Z \in R^{N \times D}

. In wide-range UAV imagery, the spatial background contains complex and unordered clutter, rendering its feature representation highly redundant. In contrast, most aerial targets occupy only a small number of pixels and are highly structured, allowing their core semantics to be mapped into a low-rank latent subspace. Due to the limited number of mediator tokens, the process decouples the spatial associations between actual targets and morphologically similar distractors. This encourages the network to suppress irrelevant clutter and focus on distilling the essential semantic features of aerial targets. While Agent Attention constructs a similar bottleneck structure via spatial pooling, it remains inherently unconstrained. When agent tokens serve as mediators to broadcast information globally, they inevitably propagate background cluttered features across the entire image. In contrast, SBM employs data-agnostic and learnable mediator tokens. Acting as optimized semantic probes, these tokens automatically assign negligible weights to background regions, actively and selectively seeking discriminative target features. This shift achieves robust decoupling of aerial targets from complex background clutter.

Finally, while the bottleneck extracts high-level semantic priors, precise localization of tiny targets requires fine-grained structural cues. Therefore, we introduce a nonlinear spatial gating mechanism to leverage the local structural priors embedded in

F

to further recalibrate the globally aggregated features

Z

. Specifically, we employ a point-wise linear projection followed by the Sigmoid-Weighted Linear Unit (SiLU) to generate a dynamic saliency map, which acts as a spatial filter to selectively activate target-related regions. The final purified output

Y

is formulated as a gated residual connection:

Y = X + β \cdot (SiLU (F W_{g}) ⊙ Z_{p r o j})

(6)

where

W_{g} \in R^{D \times D}

denotes the learnable weight matrix for the spatial gate,

Z_{p r o j} \in R^{H \times W \times D}

is the linearly projected output of

Z

, ⊙ indicates the element-wise Hadamard product, and

β

is a learnable scaling parameter.

The overall computational complexity of the SBM encompasses linear projections, multi-scale feature extraction, and the two-stage attention. The depthwise convolutions in the multi-scale fusion and CoPE require

O (N D k^{2})

, where k is the kernel size. The linear projections for QKV generation, the

1 \times 1

convolution, the spatial gating projection using

W_{g}

, and the final output projection consume

O (N D^{2})

. The two-stage attention mechanism requires

O (N m d_{k})

per head per stage, totaling

O (N m D)

for all heads. Consequently, the total computational complexity of SBM is

O (N D^{2} + N m D + N D k^{2})

. Since

k^{2} ≪ D

, the overall complexity simplifies to

O (N D^{2} + N m D)

, which is lower than the

O (N^{2} D)

complexity of standard self-attention.

In summary, the SBM forces the network to project spatial features into a low-rank space, effectively filtering out complex background clutter and distilling robust global structural and semantic priors. To analyze the optimal integration of the SBM within the YOLOv9 architecture, we construct three distinct variants by deploying the module individually at the P3, P4, and P5 layers of the Feature Pyramid Network (FPN) [34], as illustrated in Figure 4. Empirical results indicate that it is optimal to anchor the SBM at the high-resolution P3 layer, which is typically characterized by the most severe background clutter. For a UAV input image of

640 \times 640

pixels, the spatial dimensions of the P3 feature map are

80 \times 80

(with a downsampling stride of 8), yielding

N = 6400

dense spatial tokens. By configuring the SBM with

m = 128

mediator tokens, the module establishes an extreme spatial compression ratio of 50:1, drastically compressing the dense spatial representation to merely

2 %

of its original sequence volume.

3.3. Scale-Aware Complete-IoU Loss

IoU is a fundamental metric for evaluating bounding box localization performance in object detection, defined as the ratio of the intersection area to the union area between the predicted and ground-truth boxes. Let b and

b_{g t}

denote the predicted and ground-truth bounding boxes, respectively. The standard IoU is mathematically formulated as follows:

IoU = \frac{| b \cap b_{g t} |}{| b \cup b_{g t} |}

(7)

where

| \cdot |

denotes the spatial area of the corresponding region.

While IoU effectively quantifies spatial overlap, it suffers from severe gradient vanishing in non-overlapping scenarios. To alleviate this, CIoU incorporates central spatial distance and aspect ratio consistency into the optimization objective. As illustrated in Figure 5, let d represent the Euclidean distance between the center coordinates of b and

b_{g t}

, and c denote the diagonal length of the smallest enclosing bounding box. Formally, the CIoU loss is given by:

L_{CIoU} = 1 - IoU + \frac{d^{2}}{c^{2}} + α v

(8)

where v measures the consistency of the aspect ratio, with

(w, h)

and

(w_{g t}, h_{g t})

denoting the width and height of the predicted and ground-truth boxes, respectively:

v = \frac{4}{π^{2}} {(arctan \frac{w_{g t}}{h_{g t}} - arctan \frac{w}{h})}^{2}

(9)

and

α

functions as a trade-off parameter, defined as:

α = \frac{v}{(1 - IoU) + v}

(10)

Although CIoU loss successfully encodes geometric structural constraints, it inherently treats all objects uniformly. In the context of object detection in UAV imagery, this property leads to gradient attenuation for tiny targets. To mathematically demonstrate this optimization bottleneck, we provide the following analysis:

Assumption 1.

Under standard bounding box regression losses such as

L_{CIoU}

, the gradient optimization capacity for small objects may be bounded by their relative spatial area. In datasets with extreme scale variations, this spatial imbalance tends to force the parameter updates of small objects to be heavily dominated by large objects, resulting in a gradient attenuation bottleneck.

Remark 1

(Theoretical Justification). Let

A_{S}

and

A_{L}

denote the areas of the small and large objects in UAV imagery, respectively (

A_{S} ≪ A_{L}

). Under standard loss formulations such as

L_{CIoU}

, the bounding box regression gradients are inherently scale-invariant at the individual instance level (i.e.,

| \nabla_{\hat{t}} L_{S} | = | \nabla_{\hat{t}} L_{L} |

, see Appendix A). However, the number of spatial feature points available for optimizing these targets is proportional to their spatial area

A

. Consequently, the structural capacity ratio for gradient backpropagation is drastically skewed:

η = \frac{N_{S}}{N_{L}} \propto \frac{A_{S}}{A_{L}} ≪ 1

(11)

For example, in the scale-normalized

640 \times 640

VisDrone2019 setting, many tiny objects occupy only a few tens of pixels, whereas large objects may span several thousand pixels. This severe optimization sample imbalance implies that the accumulated parameter updates for small objects are largely eclipsed by the overwhelming number of sample points from large objects. In this paper, we define this macroscopic diminution of the accumulated optimization signal, driven by sample deficiency rather than individual gradient magnitude degradation, as effective gradient attenuation. To counteract this, a re-weighting strategy should be employed to restore the optimization balance.

To counteract this imbalance, we propose a scale-aware modulation factor

ω_{s}

to effectively re-balance the gradient contributions. Specifically,

ω_{s}

dynamically modulates the loss value via a reciprocal logarithmic mapping of the target’s spatial area:

ω_{s} = clamp (\frac{λ}{ln (1 + | b_{g t} |)}, ω_{m i n}, ω_{m a x})

(12)

where

λ

is a hyperparameter governing the scaling intensity, which is set to

5.0

based on the average object area to maintain the overall magnitude balance of the loss function. To prevent a negligible minority of extreme targets or corrupted samples from compromising numerical stability and triggering gradient explosion,

ω_{s}

is clamped to the range

[0.5, 2.0]

. Notably, we employ a reciprocal logarithmic function rather than an inverse proportional mapping. As illustrated in Figure 6, the reciprocal logarithmic mapping provides a more graceful weight decay compared to the catastrophic drop inherent in the inverse proportional mapping. Moreover, large objects exhibit spatial feature redundancy, whereby their effective semantic information grows logarithmically with spatial area.

Assumption 2.

For the internal homogeneous regions of targets in UAV imagery, the pixel distribution can be modeled as a wide-sense stationary (WSS) random field. Given that the spatial correlation length τ is comparable to the target’s characteristic dimension L, the effective semantic information, quantified by its joint entropy, scales logarithmically with its spatial area

A

, i.e.,

H (A) \propto ln (A)

.

Remark 2

(Theoretical Justification). For an image region with N pixels (where

N \propto A

), we treat each pixel as a discrete random variable

X_{i}

. The joint entropy

H (X_{1}, X_{2}, \dots, X_{N})

, which characterizes the total effective semantic information of the region, can be expanded as:

H (X_{1}, X_{2}, \dots, X_{N}) = H (X_{1}) + \sum_{i = 2}^{N} H (X_{i} | X_{1}, X_{2}, \dots, X_{i - 1})

(13)

where

H (X_{i} | X_{1}, X_{2}, \dots, X_{i - 1})

denotes the conditional entropy. In a hypothetical scenario of independent and identically distributed (i.i.d.) pixels, the conditional entropy equals the marginal entropy (i.e.,

H (X_{i} | \dots) = H (X_{i})

). The joint entropy thus simplifies to

N \cdot H (X_{i})

, implying that the intrinsic semantic information grows strictly linearly with the spatial area (i.e.,

H (A) \propto A

).

However, objects in UAV remote sensing imagery exhibit strong spatial auto-correlation, presenting WSS properties. Since the correlation length τ is comparable to the object’s characteristic dimension L, the internal feature field is heavily redundant. According to foundational studies in natural image statistics, particularly the seminal works by Field [35] and Ruderman [36], such correlated WSS fields inherently exhibit scale invariance, where their spatial power spectra closely follow a

1 / f^{α}

distribution. In the information-theoretic domain, this long-range spatial dependency and immense internal redundancy imply that the conditional entropy of such fields decays asymptotically as a power-law function of the spatial neighborhood size. Consequently, the marginal information gain provided by a newly processed i-th pixel experiences a long-tail decay. To mathematically capture this behavior in a tractable closed form, we approximate the conditional entropy using a first-order power-law instance, namely the harmonic decay profile:

H (X_{i} | X_{1}, X_{2}, \dots, X_{i - 1}) = O (\frac{1}{i})

(14)

Substituting this harmonic decay into the joint entropy expansion yields:

H (X_{1}, X_{2}, \dots, X_{N}) \propto \sum_{i = 1}^{N} \frac{1}{i}

(15)

According to the approximation formula for the harmonic series, as N increases, this discrete sum can be approximated by:

\sum_{i = 1}^{N} \frac{1}{i} \approx ln (N) + γ

(16)

where γ is the Euler constant. Since the number of pixels N is strictly proportional to the spatial area

A

, it mathematically follows that:

H (A) \propto ln (A) + γ \approx ln (A)

(17)

This theoretical model suggests that the internal semantic information of large UAV targets exhibits a logarithmic growth bottleneck. Therefore, our reciprocal logarithmic modulation

ω_{s}

is theoretically principled, as it closely aligns the gradient optimization weights with the derived mathematical upper bound of the target’s information gain.

While

ω_{s}

effectively compensates for static spatial imbalance, it only imposes a fixed gradient multiplier based on the absolute area of ground truth. This may amplify gradients for well-localized tiny objects in UAV imagery in the late training stages, while providing insufficient penalties for hard samples with low IoU. To achieve a robust optimization paradigm, we further introduce a dynamic alignment mechanism to construct the final SA-CIoU loss. This mechanism adaptively modulates the regression gradients based on the real-time IoU, forcing the network to prioritize poorly localized objects. The dynamic alignment mechanism, formulated as

{(1 - IoU)}^{2}

, is additively integrated into the bounding box regression objective. To analyze its optimization dynamics, let

x = IoU \in [0, 1)

(where

x = 1

is practically unattainable). The relative loss weight of the loss with the dynamic alignment mechanism with respect to the standard IoU loss is given by:

f (x) = \frac{(1 - x) + {(1 - x)}^{2}}{1 - x} = 2 - x

(18)

Since

f (x) > 1

for

x \in [0, 1)

, the introduced dynamic alignment mechanism increases the relative loss penalty with respect to the standard IoU loss, thereby placing greater optimization emphasis on poorly localized predictions. More importantly, the first-order derivative

\frac{d}{d x} f (x) = - 1

guarantees a strictly monotonic and linear decay of the penalty. As the predicted bounding box converges toward the ground truth (i.e.,

x \to 1

), the additional gradient boost vanishes smoothly, thereby ensuring stability near the optimal local minima.

By synergizing the geometric priors of CIoU with the scale-balancing capability of the scale-aware modulation factor and the IoU-aware optimization of the dynamic alignment mechanism, the comprehensive SA-CIoU loss is formalized as:

L_{SA - CIoU} = ω_{s} \cdot (1 - IoU + \frac{d^{2}}{c^{2}} + α v + {(1 - IoU)}^{2})

(19)

This unified formulation delivers stable optimization efficacy for targets across varying scales in UAV-based object detection tasks.

4. Experiments and Results

4.1. Experimental Setup

4.1.1. Details

All experiments are implemented using PyTorch 1.13.0 with CUDA 11.7 on an Ubuntu 20.04 system, powered by an NVIDIA GeForce RTX 4090 GPU (24 GB VRAM). We adopt the medium-scale YOLOv9m as our baseline model, which is trained for 350 epochs with a batch size of 16 and an input resolution of

640 \times 640

. Other key hyperparameters are summarized in Table 1, with any unlisted settings strictly following the official default configurations.

4.1.2. Datasets

To verify the effectiveness of CSSA-YOLO, we conducted extensive experiments on the VisDrone2019 dataset, which is recognized as one of the most authoritative benchmarks for UAV-captured aerial object detection. This dataset is captured across diverse urban scenarios under varying weather and lighting conditions, and consists of 6471 training images, 548 validation images, and 1610 test images. The dataset provides high-quality manually annotated bounding boxes for 10 object categories, namely pedestrian, person (e.g., people), car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle. To provide deeper insights into the data characteristics, we visualize the class distributions and bounding box size proportions within the training set, as illustrated in Figure 7.

To further validate the generalizability and robustness of CSSA-YOLO, we extend our evaluation to the HIT-UAV dataset [37] and two representative industrial PCB defect datasets, namely DeepPCB [38] and HRIPCB [39]. These benchmarks share core challenges with conventional RGB-based UAV object detection scenarios, including severe background clutter, extreme scale variations, and tiny objects. The HIT-UAV dataset is dedicated to high-altitude IR thermal UAV imaging detection scenarios. It consists of 2029 training and 290 validation images. DeepPCB contains 1500 image pairs with a resolution of

640 \times 640

pixels, covering six common defect types. HRIPCB consists of 693 images allocated for detection, featuring varying high resolutions such as

3034 \times 1586

and

3056 \times 2464

pixels, and covering six categories. In the absence of official partitions for both PCB datasets, the training and validation sets were split at a ratio of 8:2.

4.1.3. Metrics

To comprehensively evaluate the performance of our proposed model, we employ precision (P), recall (R), F1-score (

F_{1}

), mAP@0.5, and mAP@0.5:0.95 as the primary evaluation metrics. P measures the proportion of correctly predicted positive observations among all predicted positive observations, while R quantifies the proportion of actual positive samples that are correctly identified. They are formulated as follows:

P = \frac{TP}{TP + FP}

(20)

R = \frac{TP}{TP + FN}

(21)

where TP, FP, and FN denote the number of true positives, false positives, and false negatives, respectively. To balance the trade-off between P and R,

F_{1}

computes their harmonic mean:

F_{1} = 2 \cdot \frac{P R}{P + R}

(22)

Average precision (AP) is defined as the area under the

P - R

curve for a specific class. Mean average precision (mAP) denotes the average value of AP calculated across all classes at a certain IoU threshold. Mathematically, they are formulated as:

{AP}_{i} = \int_{0}^{1} P_{i} (R) d R

(23)

mAP = \frac{1}{N_{cls}} \sum_{i = 1}^{N_{cls}} {AP}_{i}

(24)

where

P_{i} (R)

represents the precision value at recall R for the i-th class, and

N_{cls}

is the total number of object categories. Specifically, mAP@0.5 corresponds to the mAP calculated at a single IoU threshold of 0.5. In contrast, mAP@0.5:0.95 provides a more stringent assessment by computing the average mAP across 10 distinct IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05, which comprehensively reflects the localization accuracy of the bounding boxes.

4.2. Comparative Experiments

We conduct extensive comparisons with other advanced detectors to verify the effectiveness of CSSA-YOLO. Quantitative results on the VisDrone2019 dataset are summarized in Table 2. It is worth noting that for YOLOv8m, YOLOv9m, YOLOv10m, YOLO11m, and CSSA-YOLO, we uniformly utilized the official COCO pre-trained weights to initialize the models, and maintained identical training configurations and epochs across these comparative experiments.

CSSA-YOLO outperforms the compared representative models across these evaluation metrics. Our framework achieves an mAP@0.5 of 46.0% and an mAP@0.5:0.95 of 28.4%. Compared with the YOLO11m model, CSSA-YOLO exhibits a significant improvement of 3.1% in mAP@0.5 and 2.2% in mAP@0.5:0.95. Furthermore, our approach surpasses other specialized UAV detectors, such as SR-YOLO and DetectoRS with RFLA, by 1.8% and 1.0% in terms of mAP@0.5:0.95, respectively. These results validate the superior capability of CSSA-YOLO in handling the extreme scale variations and complex background clutter inherent in UAV imagery, particularly in high-precision object localization. Table 3 summarizes the category-specific P, R, and mAP@0.5 performance of CSSA-YOLO, with the corresponding

P - R

curves at an IoU threshold of 0.5 illustrated in Figure 8.

4.3. Ablation Studies and Analysis

4.3.1. Effectiveness of SBM and SA-CIoU

To further evaluate the effectiveness of the proposed SBM and SA-CIoU, we conducted ablation studies on the VisDrone2019 dataset, with the quantitative results summarized in Table 4. As shown in the table, both proposed components enhance the baseline performance. Furthermore, their integration yields a notable synergistic effect, where the clean feature space provided by SBM effectively amplifies the precise localization capabilities of SA-CIoU. Specifically, the integration of the SBM yields a notable increase in P from 56.1% to 57.7%, and boosts mAP@0.5:0.95 by 0.8%. This result validates that the low-rank representation effectively suppresses background clutter interference, thereby reducing false positives in complex UAV aerial scenarios. Similarly, incorporating the SA-CIoU loss improves R by 0.6% and mAP@0.5:0.95 by 0.9%, demonstrating its robust capability to optimize bounding box regression for tiny and hard-to-detect targets.

Notably, the combination of both modules produces a synergistic effect in R. This synergy can be attributed to a complementary mechanism: the SBM first provides a clean and clutter-suppressed feature space with a high signal-to-noise ratio, which in turn enables the SA-CIoU loss to focus entirely on the precise geometric localization of tiny and challenging objects. Consequently, this joint optimization effectively tackles the core challenges in UAV imagery.

4.3.2. Detailed Analysis of the SBM

To empirically investigate the optimal architectural integration of the SBM, we performed a systematic ablation study on its insertion across different feature pyramid levels (P3, P4, and P5), with quantitative results summarized in Table 5.

As observed, anchoring the SBM at the high-resolution P3 layer yields the optimal performance. At this scale, the extreme compression ratio (

N / m = 50

) aggressively decouples target priors from severe background clutter. In contrast, deploying the SBM at the deep P5 layer is less effective. This is primarily attributed to the low spatial resolution and feature aliasing, which cause a partial loss of fine-grained spatial structural priors. Consequently, the CoPE and spatial gating mechanisms within the SBM lose their critical spatial anchors. However, it is noteworthy that while integrating the SBM at the P5 layer leads to a degradation in P, it still yields consistent gains in mAP@0.5 and mAP@0.5:0.95. This divergence suggests that although feature aliasing introduces additional false positives, the extensive global receptive field effectively reinforces the categorical semantics of targets and elevates the confidence scores of true positives.

Furthermore, the selection of

m = 128

aims to strike an optimal balance between background clutter suppression and target semantic preservation. To validate this design choice, we conducted an ablation study evaluating

m \in {64, 128, 256, 512}

, with the quantitative results summarized in Table 6. As shown in this table, P exhibits a steady increase from 55.7% to 56.8% as m decreases. This trend demonstrates that a smaller m enforces a stricter low-rank bottleneck, which effectively suppresses background clutter. Conversely, R improves as m increases from 64 to 256, demonstrating that an appropriately larger capacity helps preserve key semantic features. Notably, when m is set to 512, R paradoxically declines. This occurs because an overly wide bottleneck diminishes the low-rank filtering property, causing background clutter to contaminate the target features. Consequently, the optimal performance is achieved at both

m = 128

and

m = 256

. Given that a larger m introduces higher computational overhead,

m = 128

emerges as the most suitable choice.

We further comprehensively evaluate our proposed SBM against a spectrum of efficient and lightweight attention mechanisms, as detailed in Table 7. All evaluated modules are exclusively integrated into the high-resolution P3 layer of YOLOv9m equipped with SA-CIoU. Giga Floating-Point Operations (GFLOPs) quantify the computational complexity, serving as a key indicator of inference latency, while parameters govern the memory footprint required by the model’s learnable weights. Note that the GFLOPs denote the additional computational overhead relative to this baseline, while the parameters refer to the absolute count of the modules themselves. Since SA-CIoU is a loss function utilized exclusively during the training phase, it does not introduce any computational overhead during inference. Therefore, the absolute GFLOPs of this baseline are entirely identical to those of the standard YOLOv9m, which is 76.5.

As shown in the table, while classic lightweight attention modules such as CBAM and SimAM introduce negligible computational overhead, their performance improvements are limited. These non-self-attention mechanisms either compromise the fine-grained structural priors of small targets or lack robust semantic modeling capabilities due to their reliance on heuristic energy functions. In contrast, our SBM projects N spatial tokens into m latent semantic anchors, successfully preserving a low-dimensional representation that encapsulates core semantic information. Furthermore, although recent advanced attention mechanisms like Agent Attention and Binary Attention introduce relatively large computational overhead relative to this baseline (+3.1 and +8.9 GFLOPs, respectively), they yield only marginal performance gains. This fundamentally stems from unconstrained token interactions, which cause target signals to alias with structurally similar clutter. In contrast, by imposing a data-agnostic low-rank semantic bottleneck, the SBM explicitly suppresses such background clutter interference. In addition, to visually demonstrate the effectiveness of the SBM module, we visualized the feature difference heatmaps between the output and input of the SBM, as shown in Figure 9. For each row, the left side shows the annotated original image as a reference, and the right side highlights the specific active regions modified by the module.

4.3.3. Detailed Analysis of the SA-CIoU

We also benchmark SA-CIoU against mainstream advanced bounding box regression loss functions. The comparative study results are summarized in Table 8.

These results demonstrate that SA-CIoU achieves the highest scores, with the exception of P. The surge in R validates our analysis concerning the gradient attenuation bottleneck in standard loss formulations. By introducing the scale-aware modulation factor, SA-CIoU effectively re-balances gradient contributions, enabling the network to retrieve tiny targets that are typically omitted by scale-invariant optimization paradigms. Furthermore, the superior performance in mAP@0.5 and mAP@0.5:0.95 proves that the dynamic alignment mechanism establishes a robust optimization paradigm by adaptively imposing stricter penalties on poorly localized predictions. Unlike WIoUv3, which suppresses low-IoU samples as outliers, our dynamic alignment mechanism prioritizes these hard-to-detect targets. In UAV imagery, even tiny positional deviations of small targets will result in a severe collapse of IoU. Rather than being considered as outliers, such targets require more attention to ensure robust convergence. Furthermore, Although SA-CIoU exhibits a slight degradation of 0.3% in P compared to the CIoU baseline, this is an expected consequence. Forcing the network to actively mine ambiguous feature representations of tiny and difficult targets inevitably introduces an increase in false positives within cluttered background regions. In contrast, Wasserstein distance-based methods such as NWD model bounding boxes as Gaussian distributions. While achieving extremely high P, they struggle to detect targets submerged in background clutter in UAV imagery, severely bottlenecking their R to a mere 42.4%.

Beyond the comparative study, a sensitivity analysis of the hyperparameter

λ

was conducted. When images are resized to a resolution of

640 \times 640

, the median defect area within the training set of the dataset is approximately 123 pixels, which is chosen to evenly split the defect samples into two halves. According to our formulation, setting

λ \approx 4.8

yields a scale-aware modulation factor of approximately 1.0 for this area, thereby maintaining the loss magnitude at a level comparable to the original baseline. As a result, the optimal value may lie between 4.0 and 6.0. As shown in Table 9, we tested

λ

values of 4.0, 5.0, and 6.0. The results reveal that while

λ = 4.0

provides an advantage in P, setting

λ = 5.0

achieves the most balanced overall performance. More importantly, the performance exhibits slight fluctuations across these different settings, demonstrating that SA-CIoU is robust to variations in

λ

within a certain range.

4.3.4. Extended Qualitative and Ablation Results

To qualitatively evaluate the practical detection capability of our proposed method, we present a visual comparison between YOLOv9m and CSSA-YOLO, as illustrated in Figure 10. The baseline model frequently struggles with small objects and background clutter, leading to false positives and missed detections, while CSSA-YOLO exhibits robust detection capabilities. Furthermore, when evaluated on an RTX 4090 GPU under the single-frame inference setting (batch size = 1), CSSA-YOLO requires 13.5 ms/image, while the baseline model YOLOv9m takes 12.1 ms/image. Although there is an increase of 1.4 ms in inference time, this slight drop in speed is an acceptable trade-off.

To further demonstrate that the efficacy of our proposed modules is intrinsic to their design, we construct an enhanced baseline by replacing the standard P5 detection head in YOLOv9m with a P2 head specifically tailored for small objects. In this configuration, to avoid introducing excessive computational complexity, the SBM is strategically inserted at the P4 layer. Building upon this foundation, we compare the mAP@0.5 performance of the P2-equipped CSSA-YOLO against this upgraded baseline across all categories of the VisDrone2019 dataset. As detailed in Table 10, despite the enhanced baseline already being strengthened by the P2 detection head, our method achieves an overall mAP@0.5 improvement of 1.3%. Specifically, the most pronounced performance gains are observed in the detection of targets that are tiny or susceptible to background clutter. The improvements achieved in these challenging UAV imagery categories validate that our proposed SBM and SA-CIoU effectively suppress background clutter interference and alleviate the gradient attenuation bottleneck for small objects. Furthermore, the stable performance gains maintained across larger-scale categories demonstrate that CSSA-YOLO does not compromise the detection capability for relatively large targets, thereby establishing a robust detection paradigm for UAV aerial scenes.

4.4. Generalization Validation

To further verify the generalization ability of CSSA-YOLO, we conducted additional validation on the HIT-UAV [37], DeepPCB [38], and HRIPCB [39] datasets, and compared its performance with other advanced YOLO models, as shown in Table 11.

CSSA-YOLO achieves the best overall performance, most notably in the challenging mAP@0.5:0.95 metric, yielding 54.2%, 84.0%, and 53.2% on HIT-UAV, DeepPCB and HRIPCB, respectively. Furthermore, our model comprehensively outperforms YOLOv9m, surpassing it by 1.5%, 2.8%, and 1.9% in mAP@0.5:0.95 across the three datasets, respectively. Since these datasets exhibit distinct feature distributions compared to RGB-based UAV imagery, the results demonstrate the excellent generalization ability of our model. This confirms that the architectural optimizations in CSSA-YOLO are not merely tailored to aerial scenes, but effectively generalize to other complex visual tasks characterized by background clutter and scale variations.

5. Conclusions

To mitigate severe background clutter and extreme scale variations inherent in UAV imagery, we introduce a clutter-suppressed and scale-aware detection framework named CSSA-YOLO for robust object detection. Specifically, we develop the SBM to suppress background clutter by establishing a lightweight low-rank semantic bottleneck that preserves both structural and semantic priors. This mechanism filters out background clutter while accentuating target morphologies, thereby highlighting feature discriminability within the complex feature space of UAV imagery. Additionally, we introduce the SA-CIoU loss to enhance the detection of small and hard targets. By incorporating a scale-aware modulation factor and a dynamic alignment mechanism into the standard CIoU loss, SA-CIoU overcomes the gradient attenuation bottleneck in conventional loss formulations, thereby optimizing the localization of these objects. Extensive experiments on the VisDrone2019 dataset demonstrate that CSSA-YOLO achieves superior performance, yielding mAP@0.5 and mAP@0.5:0.95 scores of 46.0% and 28.4%, respectively. Furthermore, evaluations on additional benchmarks substantiate the robustness and generalization capability of CSSA-YOLO across diverse scenarios.

Author Contributions

Conceptualization, X.Y. and Y.W. (Yongjia Wang); methodology, X.Y. and Y.W. (Yongjia Wang); software, Y.W. (Yong Wang) and B.L.; validation, Y.W. (Yongjia Wang) and X.Y.; formal analysis, G.L. and Y.W. (Yongjia Wang); investigation, X.Y. and Y.W. (Yong Wang); resources, X.Y. and G.L.; data curation, X.Y. and G.L.; writing—original draft preparation, X.Y. and Y.W. (Yongjia Wang); writing—review and editing, Y.W. (Yongjia Wang) and W.L.; visualization, B.L. and W.L.; supervision, X.Y. and G.L.; project administration, X.Y. and W.L.; funding acquisition, X.Y. and G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Natural Science Foundation of China (52205533, 62273282), the Suzhou Leading Talent Program for Innovation and Entrepreneurship (ZXL2025299), and the Basic Research Programs of Taicang (TC2024JC23), and the Open Projects funded by Hubei Engineering Research Center for Intelligent Detection and Identification of Complex Parts (IDICP-KF-2025-11).

Data Availability Statement

The original data presented in the study are openly available in GitHub at https://github.com/Yongjiawang-NPU/CSSA-YOLO (accessed on 8 May 2026).

Acknowledgments

We thank the editors and reviewers for their hard work and valuable advice.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Scale-Invariance of CIoU Gradients

Lemma A1

(Scale-Invariance of CIoU Gradients). Given a predicted bounding box b and a ground-truth box

b_{g t}

, let all spatial coordinates of the two boxes (including their minimum enclosing rectangle) be uniformly scaled by a positive factor k such that

b^{k} = k \cdot b

and

b_{g t}^{k} = k \cdot b_{g t}

. The gradient of the CIoU loss with respect to the neural network’s scale-normalized bounding box regression outputs

\hat{t}

remains strictly invariant to the scale factor k.

Proof.

Since the CIoU loss relies on relative geometric properties (i.e., area ratios, normalized distances, and aspect ratios), it is inherently scale-invariant:

L_{CIoU} (b^{k}, b_{g t}^{k}) = L_{CIoU} (b, b_{g t})

(A1)

However, the partial derivative of

L_{CIoU}

with respect to the absolute coordinate is scale-variant. Taking the width w as an example, scaling introduces a factor of

1 / k

by definition:

\frac{\partial L_{CIoU}}{\partial (k w)} = \frac{1}{k} \frac{\partial L_{CIoU}}{\partial w}

(A2)

Modern object detectors universally parameterize bounding box regression through a generalized decoding paradigm. Let the network’s scale-normalized regression output for the width be

{\hat{t}}_{w}

. The absolute dimension is decoded as

w = w_{ref} \cdot f ({\hat{t}}_{w})

, where

w_{ref}

is a base reference width and

f (\cdot)

is a scale-invariant transformation function such as an exponential or Sigmoid mapping. Applying the chain rule to backpropagate the gradient to the network’s regression output

{\hat{t}}_{w}

yields:

\frac{\partial L_{CIoU}}{\partial {\hat{t}}_{w}} = \frac{\partial L_{CIoU}}{\partial w} \cdot \frac{\partial w}{\partial {\hat{t}}_{w}} = \frac{\partial L_{CIoU}}{\partial w} \cdot (w_{ref} \cdot f^{'} ({\hat{t}}_{w}))

(A3)

Under the uniform scaling transformation by k, the base reference width

w_{ref}

synchronously scales with the image space, as it is inherently tied to the absolute spatial geometry. Moreover, since the output

{\hat{t}}_{w}

remains unchanged for the identically proportioned target (i.e.,

{\hat{t}}_{w}^{k} = {\hat{t}}_{w}

), substituting Equations (A2) and (A3) into the chain rule yields:

\frac{\partial L_{CIoU}}{\partial {\hat{t}}_{w}^{k}} = (\frac{1}{k} \frac{\partial L_{CIoU}}{\partial w}) \cdot ((k \cdot w_{ref}) \cdot f^{'} ({\hat{t}}_{w}^{k})) = \frac{\partial L_{CIoU}}{\partial w} \cdot w_{ref} \cdot f^{'} ({\hat{t}}_{w}) = \frac{\partial L_{CIoU}}{\partial {\hat{t}}_{w}}

(A4)

The absolute scale factor k is perfectly cancelled out during backpropagation. By symmetry, the identical derivation holds for the height h. Furthermore, the translation of center coordinates

(x, y)

is commonly parameterized by absolute spatial scales, ensuring that the scaling factor k is equally cancelled out for the center coordinate regressions

{\hat{t}}_{x}

and

{\hat{t}}_{y}

. Therefore, under ideal mathematical conditions, assuming a small object with area

A_{S}

and a large object with area

A_{L}

exhibit identical IoU, their backpropagated gradient magnitudes are strictly equal:

| \nabla_{\hat{t}} L_{S} | = | \nabla_{\hat{t}} L_{L} |

(A5)

In real-world scenarios, discrete quantization errors may lead to approximately equivalent gradient magnitudes. Nevertheless, this demonstrates that the theoretical gradient of the CIoU loss with respect to the scale-normalized outputs

\hat{t}

remains strictly invariant to the scale factor k. □

References

Jack, P.; Biggs, T.; Sousa, D.; Coulter, L.; Hutmacher, S.; McMillan, H. Multi-resolution UAV remote sensing for anthropogenic debris detection in complex river environments. Remote Sens. 2025, 17, 2172. [Google Scholar] [CrossRef]
Zhang, W.; Li, X.; Wang, L.; Zhang, D.; Lu, P.; Wang, L.; Cheng, C. A lightweight method for road defect detection in UAV remote sensing images with complex backgrounds and cross-scale fusion. Remote Sens. 2025, 17, 2248. [Google Scholar] [CrossRef]
Yang, Y.; Guo, F.; Niu, P. UAVDet: A CNN-Mamba hybrid network for efficient small object detection in UAV imagery. Comput. Vis. Image Underst. 2026, 264, 104637. [Google Scholar] [CrossRef]
Zhai, Y.; Zhang, Z.; Xie, S.; Tong, C.; Luo, X.; Li, X.; Wang, L.; Zhao, Y. A real-time improved YOLOv10 model for small and multi-scale ground target detection in UAV lidar range images of complex scenes. Electronics 2026, 15, 211. [Google Scholar] [CrossRef]
Moranduzzo, T.; Melgani, F. Automatic car counting method for unmanned aerial vehicle images. IEEE Trans. Geosci. Remote Sens. 2013, 52, 1635–1647. [Google Scholar] [CrossRef]
Liu, K.; Mattyus, G. Fast multiclass vehicle detection on aerial images. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1938–1942. [Google Scholar]
Luo, X.; Zhang, F.; Firkat, E.; Hamdulla, A.; Zhu, B.; Dawut, A. UAV vision-based object detection network with lightweight and multi-scale fusion. Neurocomputing 2025, 666, 132211. [Google Scholar] [CrossRef]
Dong, Q.; Han, T.; Wu, G.; Sun, L.; Lu, Y. Robust object detection for UAVs in foggy environments with spatial-edge fusion and dynamic task alignment. Remote Sens. 2026, 18, 169. [Google Scholar] [CrossRef]
Dong, Y.; Yang, H.; Liu, S.; Gao, G.; Li, C. Optical remote sensing object detection based on background separation and small object compensation strategy. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2024, 19, 3341–3351. [Google Scholar] [CrossRef]
Wang, D.; Gao, Z.; Fang, J.; Li, Y.; Xu, Z. Eagle-YOLOv8: UAV object detection inspired by the eagle-eye vision system. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2025, 18, 9432–9447. [Google Scholar] [CrossRef]
Wang, J.; Ma, M.; Huang, P.; Mei, S.; Zhang, L.; Wang, H. Remote sensing small object detection based on multicontextual information aggregation. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2025, 18, 8248–8260. [Google Scholar] [CrossRef]
Chen, L.; Liu, C.; Li, W.; Xu, Q.; Deng, H. DTSSNet: Dynamic training sample selection network for UAV object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Zhang, T.; Gao, G.; Ke, X.; Zhang, X. Swarm learning: Perception-retrieval-localization for ship detection from synthetic aperture radar remote sensing imagery. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2026, 19, 12384–12395. [Google Scholar] [CrossRef]
Deng, R.; Zhang, T.; Xu, X.; Zhang, X.; Gao, G. Tri-state prototype self-distillation for SAR ocean imagery panoptic segmentation. IEEE Geosci. Remote Sens. Lett. 2026. [Google Scholar] [CrossRef]
Qu, S.; Dang, C.; Chen, W.; Liu, Y. SMA-YOLO: An improved YOLOv8 algorithm based on parameter-free attention mechanism and multi-scale feature fusion for small object detection in UAV images. Remote Sens. 2025, 17, 2421. [Google Scholar] [CrossRef]
Hou, T.; Leng, C.; Wang, J.; Pei, Z.; Peng, J.; Cheng, I.; Basu, A. MFEL-YOLO for small object detection in UAV aerial images. Expert Syst. Appl. 2025, 291, 128459. [Google Scholar] [CrossRef]
Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. RFLA: Gaussian receptive field based label assignment for tiny object detection. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 526–543. [Google Scholar]
Zhao, S.; Chen, H.; Zhang, D.; Tao, Y.; Feng, X.; Zhang, D. SR-YOLO: Spatial-to-depth enhanced multi-scale attention network for small target detection in UAV aerial imagery. Remote Sens. 2025, 17, 2441. [Google Scholar] [CrossRef]
Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2022, 190, 79–93. [Google Scholar] [CrossRef]
Zhang, T.; Gao, G.; Zhang, X. Glance-Focus-Gaze: A novel eagle-eye vision-inspired panorama-population-individual progressive screening paradigm to capture ships in SAR images. ISPRS J. Photogramm. Remote Sens. 2026, 235, 241–260. [Google Scholar] [CrossRef]
Zhang, T.; Zhang, X.; Gao, G. Divergence to concentration and population to individual: A progressive approaching ship detection paradigm for synthetic aperture radar remote sensing imagery. IEEE Trans. Aerosp. Electron. Syst. 2025, 62, 1325–1338. [Google Scholar] [CrossRef]
Zhang, T.; Gao, G.; Zhang, X. Triple-level sparsity awareness for marine ship surveillance using satellite synthetic aperture radar. IEEE Trans. Autom. Sci. Eng. 2026, 23, 5155–5166. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. YOLOv9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops; IEEE: New York, NY, USA, 2019; pp. 213–226. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 11534–11542. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. SimAM: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning; PMLR: New York, NY, USA, 2021; pp. 11863–11874. [Google Scholar]
Han, D.; Ye, T.; Han, Y.; Xia, Z.; Pan, S.; Wan, P.; Song, S.; Huang, G. Agent attention: On the integration of softmax and linear attention. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 124–140. [Google Scholar]
Xiao, C.; Zhang, Z.; Zhang, L. BinaryAttention: One-bit QK-attention for vision and diffusion transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2026. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Virtual Conference, 3–7 May 2021. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2017; pp. 2117–2125. [Google Scholar]
Field, D.J. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A 1987, 4, 2379–2394. [Google Scholar] [CrossRef]
Ruderman, D.L. The statistics of natural images. Netw. Comput. Neural Syst. 1994, 5, 517. [Google Scholar] [CrossRef]
Suo, J.; Wang, T.; Zhang, X.; Chen, H.; Zhou, W.; Shi, W. HIT-UAV: A high-altitude infrared thermal dataset for unmanned aerial vehicle-based object detection. Sci. Data 2023, 10, 227. [Google Scholar] [CrossRef]
Tang, S.; He, F.; Huang, X.; Yang, J. Online PCB defect detector on a new PCB defect dataset. arXiv 2019, arXiv:1902.06197. [Google Scholar] [CrossRef]
Huang, W.; Wei, P.; Zhang, M.; Liu, H. HRIPCB: A challenging dataset for PCB defects detection and classification. J. Eng. 2020, 2020, 303–309. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. In Proceedings of the Annual Conference on Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 107984–108011. [Google Scholar]
Shi, H.; Wang, N.; Xu, X.; Qian, Y.; Zeng, L.; Zhu, Y. HeMoDU: High-efficiency multi-object detection algorithm for unmanned aerial vehicles on urban roads. Sensors 2024, 24, 4045. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 16965–16974. [Google Scholar]
Zhao, X.; Wang, J.; Li, L.; Shao, X.; Zhang, K. A unified solution for replacing position embedding in Vision Transformer for object detection. Eng. Appl. Artif. Intell. 2025, 152, 110679. [Google Scholar] [CrossRef]
Li, H.; Xiao, L.; Cao, L.; Yao, S.; Wang, M.; Li, Y. Dfformer: Uav object detection via feature scaling and interaction. IEEE Trans. Geosci. Remote Sens. 2026, 64, 1–16. [Google Scholar] [CrossRef]
Tian, S.; Zhang, B.; Cao, L.; Kang, L.; Tian, J.; Xing, X.; Shen, B.; Fan, C.; Du, K.; Fu, C.; et al. MFDAFF-net: Multiscale frequency-aware and dual attention-guided feature fusion network for UAV imagery object detection. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2025, 18, 10640–10656. [Google Scholar] [CrossRef]

Figure 1. Visual illustrations of the core challenges in UAV-based object detection. In expansive aerial scenes, targets are frequently submerged in severe background clutter, while simultaneously occupying tiny pixel footprints accompanied by extreme scale variations.

Figure 2. Overall architecture of the proposed CSSA-YOLO. Input UAV images are first processed by the backbone network to extract multi-scale features. These features are then fed into the neck network, where the SBM is embedded to preserve pure target semantics during the feature fusion process. The features are subsequently passed to the decoupled head for prediction. During training, the SA-CIoU loss dynamically calibrates the bounding box regression, alleviating the gradient starvation problem for small aerial targets.

Figure 3. Architecture of the SBM. Input features first undergo dynamic multi-scale contextual encoding to model local patterns. Subsequently, learnable mediator tokens act as a semantic bottleneck to refine features via a two-stage self-attention mechanism. Finally, a nonlinear gating mechanism recalibrates features to suppress background noise, yielding the purified residual output.

Figure 4. Integration of the SBM at the P3, P4, and P5 layers of the YOLOv9 neck. Given a

640 \times 640

input image and 128 mediator tokens, these variants achieve spatial compression ratios of 50:1, 12.5:1, and 3.125:1, respectively.

Figure 4. Integration of the SBM at the P3, P4, and P5 layers of the YOLOv9 neck. Given a

640 \times 640

input image and 128 mediator tokens, these variants achieve spatial compression ratios of 50:1, 12.5:1, and 3.125:1, respectively.

Figure 5. The diagram of CIoU loss. The blue and red rectangles represent the predicted box (b) and the ground-truth box (

b_{g t}

), respectively. The parameter d denotes the Euclidean distance between their center points, while c is the diagonal length of the minimum enclosing bounding box. The variables

(w, h)

and

(w_{g t}, h_{g t})

denote their corresponding spatial dimensions.

Figure 5. The diagram of CIoU loss. The blue and red rectangles represent the predicted box (b) and the ground-truth box (

b_{g t}

), respectively. The parameter d denotes the Euclidean distance between their center points, while c is the diagonal length of the minimum enclosing bounding box. The variables

(w, h)

and

(w_{g t}, h_{g t})

denote their corresponding spatial dimensions.

Figure 6. Comparison between inverse proportional mapping and reciprocal logarithmic mapping. The proposed reciprocal logarithmic approach maintains a progressive decay profile, whereas the inverse proportional baseline suffers from a catastrophic collapse as the target scale increases.

Figure 7. Statistical analysis of the training split for the VisDrone2019 dataset. (left) Class distributions; (right) size proportions. The size proportions clearly demonstrate the extreme scale variations inherent in UAV imagery.

Figure 8. Category-specific

P - R

curves of CSSA-YOLO evaluated at an IoU threshold of 0.5 on the VisDrone2019 dataset.

Figure 8. Category-specific

P - R

curves of CSSA-YOLO evaluated at an IoU threshold of 0.5 on the VisDrone2019 dataset.

Figure 9. Feature-level visualization of the SBM module. For each of the rows (a,b): (left) original input image with ground-truth bounding boxes (in green); (right) feature difference map derived from the subtraction of feature maps before and after the SBM, highlighting the specific regions enhanced or suppressed by the SBM.

Figure 10. Visual comparison of YOLOv9m and CSSA-YOLO on the VisDrone2019 dataset. For each of the three rows (a–c): (left) YOLOv9m detection results; (center) CSSA-YOLO detection results; (right) zoomed-in view of the red dashed rectangular area from the CSSA-YOLO results, highlighting regions with significant performance differences.

Table 1. Training hyperparameters.

Hyperparameters	Value
Optimizer	MuSGD
Learning rate	0.01
Momentum	0.9
Attention heads	8
Mediator tokens	128
$λ$	5.0
( $ω_{m i n}$ , $ω_{m a x}$ )	(0.5, 2.0)

Table 2. Quantitative comparison results on the VisDrone2019 dataset (%).

Models	P	R	$F_{1}$	mAP@0.5	mAP@0.5:0.95
YOLOv8m	53.9	42.9	47.8	44.0	26.2
YOLOv9m [23]	56.1	43.7	49.1	44.6	27.0
YOLOv10m [40]	54.1	41.8	47.2	42.9	26.0
YOLO11m	56.1	41.4	47.6	42.9	26.2
Eagle-YOLOv8m [10]	55.6	43.5	48.8	43.3	25.9
MCIA-YOLOm [11]	52.8	41.1	46.2	42.7	25.9
SMA-YOLO [15]	53.0	40.9	45.0	42.3	25.3
SR-YOLO [18]	55.2	43.3	48.5	44.6	26.6
HeMoDU [41]	54.9	42.4	47.8	44.2	27.1
DTSSNet [12]	−	−	−	41.1	25.5
MFEL-YOLO [16]	−	−	−	44.7	27.3
DetectoRS with RFLA [17]	−	−	−	45.3	27.4
RT-DETR [42]	−	−	−	43.8	26.2
HV-SwinViT [43]	−	−	−	43.6	26.3
DFFormer [44]	−	−	−	44.6	26.5
MFDAFF-Net [45]	−	−	−	43.9	26.9
CSSA-YOLO	56.4	44.8	49.9	46.0	28.4

Table 3. Category-specific performance of CSSA-YOLO on the VisDrone2019 dataset (%).

Metrics	ped.	peo.	bic.	car	van	tru.	tri.	awn.	bus	mot.	all
P	65.4	60.8	33.2	78.1	57.7	58.3	46.1	31.3	75.0	57.7	56.4
R	46.9	37.1	24.1	81.8	45.3	39.5	36.7	24.2	57.8	55.0	44.8
mAP@0.5	53.1	40.9	19.9	83.8	47.9	42.3	34.4	19.8	63.6	54.3	46.0

Table 4. Quantitative ablation results on the VisDrone2019 dataset (%).

SBM	SA-CIoU	P	R	$F_{1}$	mAP@0.5	mAP@0.5:0.95
×	×	56.1	43.7	49.1	44.6	27.0
✓	×	57.7	43.4	49.4	45.3	27.8
×	✓	55.8	44.3	49.4	45.4	27.9
✓	✓	56.4	44.8	49.9	46.0	28.4

Table 5. Comparison of different SBM insertion positions on the VisDrone2019 dataset (%).

Position	P	R	$F_{1}$	mAP@0.5	mAP@0.5:0.95	Ratio ( $N / m$ )
w/o SBM	55.8	44.3	49.4	45.4	27.9	−
P5	55.2	44.5	49.3	45.7	28.1	3.125
P4	56.6	44.3	49.7	45.8	28.4	12.5
P3	56.4	44.8	49.9	46.0	28.4	50

Table 6. Performance with different numbers of mediator tokens (m) (%).

m	P	R	mAP@0.5	mAP@0.5:0.95	Ratio ( $N / m$ )
64	56.8	42.0	44.0	27.2	100
128	56.4	44.8	46.0	28.4	50
256	56.1	45.0	45.9	28.4	25
512	55.7	44.4	45.5	28.1	12.5

Table 7. Quantitative comparison results against other lightweight attention mechanisms (%).

Model	P	R	mAP@0.5	mAP@0.5:0.95	GFLOPs	Parameters
YOLOv9m + SA-CIoU	55.8	44.3	45.4	27.9	−	−
+CBAM [25]	56.4	43.9	45.0	27.5	+0.1	57,938
+ECA [26]	55.1	43.9	45.4	28.0	+0.1	3
+SimAM [27]	55.4	43.7	44.9	27.3	+0.0	0
+Agent Attention [28]	55.6	44.4	45.5	28.0	+3.1	272,176
+Binary Attention [29]	56.0	43.8	45.7	28.1	+8.9	693,600
+SBM	56.4	44.8	46.0	28.4	+4.7	390,963

Table 8. Quantitative comparison with other representative regression loss functions (%).

BBox Loss	P	R	$F_{1}$	mAP@0.5	mAP@0.5:0.95
CIoU [30]	56.1	43.7	49.1	44.6	27.0
Inner-CIoU [31]	56.2	43.7	49.2	44.7	27.3
WIoUv3 [32]	56.5	43.6	49.2	45.1	27.4
NWD [19]	57.8	42.4	48.9	44.7	27.2
SA-CIoU	55.8	44.3	49.4	45.4	27.9

Table 9. Sensitivity analysis of the hyperparameter

λ

(%).

Table 9. Sensitivity analysis of the hyperparameter

λ

(%).

$λ$	P	R	$F_{1}$	mAP@0.5	mAP@0.5:0.95
4.0	56.7	44.6	49.9	45.9	28.3
5.0	56.4	44.8	49.9	46.0	28.4
6.0	56.0	44.7	49.7	45.8	28.3

Table 10. Comparison of CSSA-YOLO ^∗ and YOLOv9m ^∗ on the VisDrone2019 dataset (%).

Methods	ped.	peo.	bic.	car	van	tru.	tri.	awn.	bus	mot.	All
YOLOv9m ^∗	57.1	45.4	21.6	85.8	51.1	43.7	35.5	20.6	65.8	56.6	48.2
CSSA-YOLO ^∗	59.4	47.8	22.7	86.6	51.7	44.4	37.4	21.2	65.9	58.0	49.5
$Δ$	2.3 ↑	2.4 ↑	1.1 ↑	0.8 ↑	0.6 ↑	0.7 ↑	1.9 ↑	0.6 ↑	0.1 ↑	1.4 ↑	1.3 ↑

∗ indicates models equipped with the P2 detection head, and ↑ indicates a performance improvement.

Table 11. Comparative performance on other benchmark datasets (%).

Method	HIT-UAV		DeepPCB		HRIPCB
Method	mAP@0.5	mAP@0.5:0.95	mAP@0.5	mAP@0.5:0.95	mAP@0.5	mAP@0.5:0.95
YOLOv8m	77.3	53.0	97.3	79.8	93.6	51.1
YOLOv9m [23]	77.6	52.7	97.5	81.2	94.6	51.3
YOLOv10m [40]	75.5	50.9	98.0	77.8	94.8	52.4
YOLO11m	77.0	51.4	96.8	79.4	94.7	50.7
CSSA-YOLO	78.6	54.2	98.3	84.0	95.4	53.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, X.; Wang, Y.; Wang, Y.; Li, W.; Liu, B.; Liu, G. CSSA-YOLO: A Clutter-Suppressed and Scale-Aware Framework for Robust Object Detection in UAV Imagery. Remote Sens. 2026, 18, 1533. https://doi.org/10.3390/rs18101533

AMA Style

Yang X, Wang Y, Wang Y, Li W, Liu B, Liu G. CSSA-YOLO: A Clutter-Suppressed and Scale-Aware Framework for Robust Object Detection in UAV Imagery. Remote Sensing. 2026; 18(10):1533. https://doi.org/10.3390/rs18101533

Chicago/Turabian Style

Yang, Xiao, Yongjia Wang, Yong Wang, Wangyuan Li, Beiyuan Liu, and Ganchao Liu. 2026. "CSSA-YOLO: A Clutter-Suppressed and Scale-Aware Framework for Robust Object Detection in UAV Imagery" Remote Sensing 18, no. 10: 1533. https://doi.org/10.3390/rs18101533

APA Style

Yang, X., Wang, Y., Wang, Y., Li, W., Liu, B., & Liu, G. (2026). CSSA-YOLO: A Clutter-Suppressed and Scale-Aware Framework for Robust Object Detection in UAV Imagery. Remote Sensing, 18(10), 1533. https://doi.org/10.3390/rs18101533

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CSSA-YOLO: A Clutter-Suppressed and Scale-Aware Framework for Robust Object Detection in UAV Imagery

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Lightweight Attention Modules

2.2. Bounding Box Regression Loss Functions

3. Methods

3.1. Overview of CSSA-YOLO

3.2. Semantic Bottleneck Module

3.3. Scale-Aware Complete-IoU Loss

4. Experiments and Results

4.1. Experimental Setup

4.1.1. Details

4.1.2. Datasets

4.1.3. Metrics

4.2. Comparative Experiments

4.3. Ablation Studies and Analysis

4.3.1. Effectiveness of SBM and SA-CIoU

4.3.2. Detailed Analysis of the SBM

4.3.3. Detailed Analysis of the SA-CIoU

4.3.4. Extended Qualitative and Ablation Results

4.4. Generalization Validation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Scale-Invariance of CIoU Gradients

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI