3.1. Overview of CSSA-YOLO
The overall architecture of the proposed CSSA-YOLO framework is illustrated in
Figure 2. It is constructed upon the YOLOv9 baseline, integrating two targeted innovations to tackle the challenges of UAV imagery.
Firstly, YOLOv9 serves as an advanced architecture that addresses the information bottleneck problem in deep object detection networks. To achieve this, YOLOv9 introduces two innovative components. The Generalized Efficient Layer Aggregation Network (GELAN) optimizes parameter utilization through an efficient network topology, ensuring robust feature extraction without incurring excessive computational overhead. In addition, Programmable Gradient Information (PGI) refines the training process by incorporating auxiliary reversible branches. This mechanism ensures that rich target information is preserved throughout the forward pass, thereby generating reliable gradient signals during backpropagation. Owing to its robust gradient preservation capability, YOLOv9 is highly suited to the intrinsic detection challenges of UAV aerial imagery. During the optimization process, YOLOv9 adopts a decoupled detection head, where the classification branch is supervised by the Binary Cross-Entropy (BCE) loss, while the flexible regression branch is jointly supervised by the Distribution Focal Loss (DFL) and our SA-CIoU loss.
Secondly, to address the severe background clutter inherent in UAV images, we design the SBM. Unlike traditional unconstrained global attention mechanisms that may forge erroneous correlations with structural distractors, SBM maps the spatial features into a low-rank token space. Serving as a lightweight filtering bottleneck, it suppresses background clutter while preserving the critical semantic and structural priors of aerial targets, thereby enhancing target discriminability within the feature space.
Thirdly, to mitigate the gradient attenuation and localization bottlenecks of small aerial objects, we propose the SA-CIoU loss. By dynamically adjusting the optimization focus during training, SA-CIoU prevents the weak gradient signals of small targets from being overwhelmed by larger instances. Specifically, it mathematically couples a scale-aware modulation factor based on absolute pixel area with a dynamic alignment mechanism, effectively prioritizing the precise regression of small and hard-to-detect objects.
3.2. Semantic Bottleneck Module
Standard local convolutions are inherently limited in capturing long-range dependencies and suffer from local feature redundancy, rendering them suboptimal for distinguishing aerial targets from complex backgrounds. Although vision transformers (ViTs) [
33] and standard self-attention mechanisms establish global receptive fields to mitigate these constraints, their unconstrained token-to-token interactions often lead to global information redundancy and amplify background interference in complex expansive UAV imagery. In such high-altitude scenarios, unrestricted spatial similarities increase the model’s susceptibility to spurious correlations between actual targets and structurally similar distractors. To mitigate these spurious attention interactions and bias the attention mechanism towards discriminative target semantic features, we propose the SBM, the architecture of which is illustrated in
Figure 3.
Let
denote the original spatial feature map fed into the SBM, where
H,
W, and
D represent the spatial height, width, and embedding dimension, respectively. We first apply Batch Normalization (BN) to
, yielding the normalized representation
. Let
represent the spatial sequence length. To capture the localized geometric variations of aerial targets, we construct a multi-scale feature representation
. We apply a dynamic weighted fusion across
K parallel convolutional branches. To encompass diverse receptive fields without incurring excessive computational overhead, we set
and instantiate these branches using a
convolution, a
Depthwise Convolution, and a
Depthwise Convolution. The multi-scale fusion is formulated as:
where
denotes the
i-th convolutional mapping,
represents learnable scale parameters, and
is the Sigmoid activation function.
To perform global contextual optimization while preventing unconstrained clutter interactions, we introduce a decoupled two-stage attention mechanism mediated by a low-rank semantic bottleneck. Specifically, we introduce a set of learnable mediator tokens
, where
m is the number of mediator tokens (
). During the multi-head attention computation,
is evenly partitioned across
parallel heads, yielding
for each head, where
. Concurrently, for each head, the features
are linearly projected to derive the spatial queries
, keys
, and values
. Furthermore, to alleviate the lack of spatial inductive bias intrinsic to standard self-attention, we dynamically generate a Contextual Position Encoding (CoPE) by applying a Depthwise Convolutional mapping
on the
. This CoPE is then injected into the
and
:
In the first attention stage, the mediator tokens act as bottleneck queries to aggregate global semantic information from
and
, effectively compressing the dense spatial representation into a low-rank semantic prior
:
In the second stage,
retrieves information from this low-rank global representation to broadcast the purified contextual features back to the original spatial resolution, yielding the aggregated representation
:
The outputs from all parallel heads are subsequently concatenated along the channel dimension to form the global spatial representation . In wide-range UAV imagery, the spatial background contains complex and unordered clutter, rendering its feature representation highly redundant. In contrast, most aerial targets occupy only a small number of pixels and are highly structured, allowing their core semantics to be mapped into a low-rank latent subspace. Due to the limited number of mediator tokens, the process decouples the spatial associations between actual targets and morphologically similar distractors. This encourages the network to suppress irrelevant clutter and focus on distilling the essential semantic features of aerial targets. While Agent Attention constructs a similar bottleneck structure via spatial pooling, it remains inherently unconstrained. When agent tokens serve as mediators to broadcast information globally, they inevitably propagate background cluttered features across the entire image. In contrast, SBM employs data-agnostic and learnable mediator tokens. Acting as optimized semantic probes, these tokens automatically assign negligible weights to background regions, actively and selectively seeking discriminative target features. This shift achieves robust decoupling of aerial targets from complex background clutter.
Finally, while the bottleneck extracts high-level semantic priors, precise localization of tiny targets requires fine-grained structural cues. Therefore, we introduce a nonlinear spatial gating mechanism to leverage the local structural priors embedded in
to further recalibrate the globally aggregated features
. Specifically, we employ a point-wise linear projection followed by the Sigmoid-Weighted Linear Unit (SiLU) to generate a dynamic saliency map, which acts as a spatial filter to selectively activate target-related regions. The final purified output
is formulated as a gated residual connection:
where
denotes the learnable weight matrix for the spatial gate,
is the linearly projected output of
, ⊙ indicates the element-wise Hadamard product, and
is a learnable scaling parameter.
The overall computational complexity of the SBM encompasses linear projections, multi-scale feature extraction, and the two-stage attention. The depthwise convolutions in the multi-scale fusion and CoPE require , where k is the kernel size. The linear projections for QKV generation, the convolution, the spatial gating projection using , and the final output projection consume . The two-stage attention mechanism requires per head per stage, totaling for all heads. Consequently, the total computational complexity of SBM is . Since , the overall complexity simplifies to , which is lower than the complexity of standard self-attention.
In summary, the SBM forces the network to project spatial features into a low-rank space, effectively filtering out complex background clutter and distilling robust global structural and semantic priors. To analyze the optimal integration of the SBM within the YOLOv9 architecture, we construct three distinct variants by deploying the module individually at the P3, P4, and P5 layers of the Feature Pyramid Network (FPN) [
34], as illustrated in
Figure 4. Empirical results indicate that it is optimal to anchor the SBM at the high-resolution P3 layer, which is typically characterized by the most severe background clutter. For a UAV input image of
pixels, the spatial dimensions of the P3 feature map are
(with a downsampling stride of 8), yielding
dense spatial tokens. By configuring the SBM with
mediator tokens, the module establishes an extreme spatial compression ratio of 50:1, drastically compressing the dense spatial representation to merely
of its original sequence volume.
3.3. Scale-Aware Complete-IoU Loss
IoU is a fundamental metric for evaluating bounding box localization performance in object detection, defined as the ratio of the intersection area to the union area between the predicted and ground-truth boxes. Let
b and
denote the predicted and ground-truth bounding boxes, respectively. The standard IoU is mathematically formulated as follows:
where
denotes the spatial area of the corresponding region.
While IoU effectively quantifies spatial overlap, it suffers from severe gradient vanishing in non-overlapping scenarios. To alleviate this, CIoU incorporates central spatial distance and aspect ratio consistency into the optimization objective. As illustrated in
Figure 5, let
d represent the Euclidean distance between the center coordinates of
b and
, and
c denote the diagonal length of the smallest enclosing bounding box. Formally, the CIoU loss is given by:
where
v measures the consistency of the aspect ratio, with
and
denoting the width and height of the predicted and ground-truth boxes, respectively:
and
functions as a trade-off parameter, defined as:
Although CIoU loss successfully encodes geometric structural constraints, it inherently treats all objects uniformly. In the context of object detection in UAV imagery, this property leads to gradient attenuation for tiny targets. To mathematically demonstrate this optimization bottleneck, we provide the following analysis:
Assumption 1. Under standard bounding box regression losses such as , the gradient optimization capacity for small objects may be bounded by their relative spatial area. In datasets with extreme scale variations, this spatial imbalance tends to force the parameter updates of small objects to be heavily dominated by large objects, resulting in a gradient attenuation bottleneck.
Remark 1 (Theoretical Justification)
. Let and denote the areas of the small and large objects in UAV imagery, respectively (). Under standard loss formulations such as , the bounding box regression gradients are inherently scale-invariant at the individual instance level (i.e., , see Appendix A). However, the number of spatial feature points available for optimizing these targets is proportional to their spatial area . Consequently, the structural capacity ratio for gradient backpropagation is drastically skewed:For example, in the scale-normalized VisDrone2019 setting, many tiny objects occupy only a few tens of pixels, whereas large objects may span several thousand pixels. This severe optimization sample imbalance implies that the accumulated parameter updates for small objects are largely eclipsed by the overwhelming number of sample points from large objects. In this paper, we define this macroscopic diminution of the accumulated optimization signal, driven by sample deficiency rather than individual gradient magnitude degradation, as effective gradient attenuation. To counteract this, a re-weighting strategy should be employed to restore the optimization balance.
To counteract this imbalance, we propose a scale-aware modulation factor
to effectively re-balance the gradient contributions. Specifically,
dynamically modulates the loss value via a reciprocal logarithmic mapping of the target’s spatial area:
where
is a hyperparameter governing the scaling intensity, which is set to
based on the average object area to maintain the overall magnitude balance of the loss function. To prevent a negligible minority of extreme targets or corrupted samples from compromising numerical stability and triggering gradient explosion,
is clamped to the range
. Notably, we employ a reciprocal logarithmic function rather than an inverse proportional mapping. As illustrated in
Figure 6, the reciprocal logarithmic mapping provides a more graceful weight decay compared to the catastrophic drop inherent in the inverse proportional mapping. Moreover, large objects exhibit spatial feature redundancy, whereby their effective semantic information grows logarithmically with spatial area.
Assumption 2. For the internal homogeneous regions of targets in UAV imagery, the pixel distribution can be modeled as a wide-sense stationary (WSS) random field. Given that the spatial correlation length τ is comparable to the target’s characteristic dimension L, the effective semantic information, quantified by its joint entropy, scales logarithmically with its spatial area , i.e., .
Remark 2 (Theoretical Justification)
. For an image region with N pixels (where ), we treat each pixel as a discrete random variable . The joint entropy , which characterizes the total effective semantic information of the region, can be expanded as:where denotes the conditional entropy. In a hypothetical scenario of independent and identically distributed (i.i.d.) pixels, the conditional entropy equals the marginal entropy (i.e., ). The joint entropy thus simplifies to , implying that the intrinsic semantic information grows strictly linearly with the spatial area (i.e., ).However, objects in UAV remote sensing imagery exhibit strong spatial auto-correlation, presenting WSS properties. Since the correlation length τ is comparable to the object’s characteristic dimension L, the internal feature field is heavily redundant. According to foundational studies in natural image statistics, particularly the seminal works by Field [35] and Ruderman [36], such correlated WSS fields inherently exhibit scale invariance, where their spatial power spectra closely follow a distribution. In the information-theoretic domain, this long-range spatial dependency and immense internal redundancy imply that the conditional entropy of such fields decays asymptotically as a power-law function of the spatial neighborhood size. Consequently, the marginal information gain provided by a newly processed i-th pixel experiences a long-tail decay. To mathematically capture this behavior in a tractable closed form, we approximate the conditional entropy using a first-order power-law instance, namely the harmonic decay profile: Substituting this harmonic decay into the joint entropy expansion yields: According to the approximation formula for the harmonic series, as N increases, this discrete sum can be approximated by:where γ is the Euler constant. Since the number of pixels N is strictly proportional to the spatial area , it mathematically follows that: This theoretical model suggests that the internal semantic information of large UAV targets exhibits a logarithmic growth bottleneck. Therefore, our reciprocal logarithmic modulation is theoretically principled, as it closely aligns the gradient optimization weights with the derived mathematical upper bound of the target’s information gain.
While
effectively compensates for static spatial imbalance, it only imposes a fixed gradient multiplier based on the absolute area of ground truth. This may amplify gradients for well-localized tiny objects in UAV imagery in the late training stages, while providing insufficient penalties for hard samples with low IoU. To achieve a robust optimization paradigm, we further introduce a dynamic alignment mechanism to construct the final SA-CIoU loss. This mechanism adaptively modulates the regression gradients based on the real-time IoU, forcing the network to prioritize poorly localized objects. The dynamic alignment mechanism, formulated as
, is additively integrated into the bounding box regression objective. To analyze its optimization dynamics, let
(where
is practically unattainable). The relative loss weight of the loss with the dynamic alignment mechanism with respect to the standard IoU loss is given by:
Since for , the introduced dynamic alignment mechanism increases the relative loss penalty with respect to the standard IoU loss, thereby placing greater optimization emphasis on poorly localized predictions. More importantly, the first-order derivative guarantees a strictly monotonic and linear decay of the penalty. As the predicted bounding box converges toward the ground truth (i.e., ), the additional gradient boost vanishes smoothly, thereby ensuring stability near the optimal local minima.
By synergizing the geometric priors of CIoU with the scale-balancing capability of the scale-aware modulation factor and the IoU-aware optimization of the dynamic alignment mechanism, the comprehensive SA-CIoU loss is formalized as:
This unified formulation delivers stable optimization efficacy for targets across varying scales in UAV-based object detection tasks.