CASA-RCNN: A Context-Enhanced and Scale-Adaptive Two-Stage Detector for Dense UAV Aerial Scenes

Gu, Han; Wu, Jiayuan; Huang, Han

doi:10.3390/drones10020133

Open AccessArticle

CASA-RCNN: A Context-Enhanced and Scale-Adaptive Two-Stage Detector for Dense UAV Aerial Scenes

by

Han Gu

^1,†,

Jiayuan Wu

^2,† and

Han Huang

^3,*

¹

School of Information Engineering, Wuhan University of Technology, Wuhan 430070, China

²

School of Physics and Mechanics, Wuhan University of Technology, Wuhan 430070, China

³

School of Management, Wuhan University of Technology, Wuhan 430070, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Drones 2026, 10(2), 133; https://doi.org/10.3390/drones10020133

Submission received: 7 January 2026 / Revised: 10 February 2026 / Accepted: 11 February 2026 / Published: 14 February 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose CASA-RCNN, a context-enhanced, scale-adaptive two-stage detector with hierarchical feature enhancement and quality-scale collaborative optimization for dense low-altitude UAV aerial scenes.
On the VisDrone2021 validation set, CASA-RCNN achieves 22.9% mAP, outperforming Faster R-CNN by 9.0 points and improving small-object performance (mAPs 12.5% vs. 6.9%).

What is the implication of the main finding?

Enhanced context modeling and quality-scale collaboration improve robustness and detection reliability under the dense layouts, occlusion, and background clutter common in UAV imagery.
Stronger small-object localization supports more dependable UAV applications, such as traffic monitoring, crowd analysis, and urban surveillance.

Abstract

Unmanned aerial vehicle (UAV) imagery poses persistent challenges for object detection, including dense small objects, large-scale variation, cluttered backgrounds, and stringent localization requirements, where conventional two-stage detectors often fall short in fine-grained small-object representation, efficient global context modeling, and classification–localization consistency. We specifically target low-altitude UAV-captured imagery with highly flexible viewpoints (near-nadir to oblique) and frequent platform-induced motion blur, which makes dense small-object localization substantially more challenging than in conventional remote-sensing imagery. To address these issues, we propose CASA-RCNN, a context-adaptive and scale-aware two-stage detection framework tailored to UAV scenarios. CASA-RCNN introduces a shallow-level enhancement module, ConvSwinMerge, which strengthens position-sensitive cues and suppresses background interference by combining coordinate attention with channel excitation, thereby improving discriminative high-resolution features for small objects. For deeper semantic features, we incorporate an adaptive sequence modeling module based on MambaBlock to capture long-range dependencies and support context reasoning in crowded or occluded scenes with practical computational overheadon a desktop GPU. In addition, we adopt Varifocal Loss for quality-aware classification to better align confidence scores with localization quality, and we design a ScaleAdaptiveLoss to dynamically reweight regression objectives across object scales, compensating for the reduced gradient contribution of small targets during training. Experiments on the VisDrone2021 validation benchmark show that CASA-RCNN achieves 22.9% mAP, improving Faster R-CNN by 9.0 points; it also reaches 36.6% mAP₅₀ and 25.7% mAP₇₅. Notably, performance on small objects improves to 12.5% mAP_s (from 6.9%), and ablation studies confirm the effectiveness and complementarity of the proposed components.

Keywords:

low-altitude UAV imagery; UAV-captured aerial images; two-stage detector; context modeling; Mamba; coordinate attention

1. Introduction

In recent years, UAV platforms have been widely adopted in a variety of applications—including urban traffic monitoring, public-security inspection, emergency response, and post-disaster assessment—due to their high mobility, flexible viewing angles, and low deployment cost [1,2,3]. Meanwhile, the rapid growth of the low-altitude economy (LAE) is accelerating UAV-enabled services (e.g., logistics, transportation, and public safety), where deploying large AI models is increasingly viewed as a key enabler for perception, reasoning, and decision-making. Recent studies have discussed system-level opportunities and challenges for LAE-oriented large-model deployment [4], as well as LLM-guided multimodal perception that improves cross-modal alignment for UAV object detection [5]. These trends further highlight the need for accurate yet deployable UAV detectors under tight onboard computation and latency constraints. Compared with conventional ground-level viewpoints, aerial imagery offers broader coverage and a more pronounced top-down perspective, but it simultaneously introduces more severe scale variation and more complex imaging conditions [6]. UAV aerial images exhibit several distinctive characteristics relative to typical ground-scene images [7]. The wide field of view induced by flight altitude causes most objects in the scene to appear at small scales, occupying only a tiny fraction of pixels in the full image [8]. Moreover, object distributions in aerial scenes are often highly dense and span a wide range of scales, from tiny pedestrians and vehicles to large buildings and land parcels, resulting in substantial size disparities [9]. In addition, complex backgrounds—such as cloud occlusion, vegetation coverage, and terrain undulations—further interfere with reliable feature extraction and accurate recognition. These factors collectively constitute the core difficulties of small-object detection in UAV aerial scenarios, leading to notable performance degradation in both detection accuracy and model robustness for many conventional object detectors [10]. To address these challenges, the community has established multiple public benchmarks for UAV visual understanding. For example, the VisDrone benchmark systematically covers tasks such as object detection and multi-object tracking, and provides large-scale aerial image and video data collected across different cities and diverse scenes. The UAVDT dataset focuses on evaluating vehicle detection and tracking in complex urban traffic environments, offering an important testbed for validating algorithmic robustness and facilitating deployment in practical engineering applications [11]. In this paper, ”UAV imagery” refers to RGB images captured by low-altitude unmanned aerial platforms (typically tens to a few hundred meters) with highly flexible camera viewpoints (near-nadir to oblique). Compared with satellite or high-altitude aerial remote-sensing imagery, UAV acquisition more frequently exhibits viewpoint jitter, motion blur, and drastic scale/appearance changes within the same object category, while also containing a large number of densely distributed small instances. It is important to distinguish low-altitude UAV detection from satellite or high-altitude aerial remote-sensing detection. Compared with satellite imagery that often exhibits relatively stable viewing geometry and fewer platform-induced motion artifacts, UAV imagery more frequently suffers from viewpoint jitter and rapid perspective changes, motion blur caused by platform dynamics, and a much higher proportion of densely packed small instances within a single frame. These factors amplify typical failure modes such as missed detections in crowded regions, score mis-ranking under occlusion, and localization drift for small objects under cluttered backgrounds. Therefore, CASA-RCNN is explicitly designed for low-altitude UAV scenarios, where robust small-object recall and stable localization are critical under dense layouts and complex imaging degradations. Therefore, we focus on UAV-captured aerial scenes and design our framework accordingly.

Despite substantial progress in generic object detection on natural images, aerial object detection remains constrained by a set of inherent and structural challenges [12,13,14,15]. First, targets in aerial images are typically small and follow a long-tailed scale distribution; for instance, pedestrians and vehicles often occupy only a few pixels, which severely limits the availability of fine details and texture cues [16]. Second, scale differences within a single aerial image can be extremely large, while dense layouts, mutual occlusion, and overlap are common, making traditional proposal ranking and non-maximum suppression more prone to failure [17]. Third, complex background textures (e.g., road markings, building contours, and tree shadows), together with imaging degradations (e.g., motion blur and illumination changes), exacerbate the “weak target–strong background” discrimination dilemma and further increase detection difficulty [18]. Prior studies on small-object detection in remote sensing and aerial imagery have repeatedly highlighted that limited spatial evidence, severe occlusion and background interference, and imbalanced class distributions can significantly amplify the complexity of the detection task [19]. Meanwhile, widely used aerial benchmarks such as DOTA also underscore a pervasive property of high-resolution remote-sensing scenarios: targets exhibit high diversity in both scale and shape [20,21]. Two-stage detectors represented by Faster R-CNN establish the foundation of modern two-stage detection by coupling a region proposal network with a detection head that shares features, enabling end-to-end proposal generation as well as classification and bounding-box regression [22]. Owing to their strong localization accuracy, two-stage paradigms remain a key technical route for precise detection in aerial imagery [23]. To alleviate the insufficient representation of multi-scale objects—especially small objects—feature pyramid networks leverage pyramid features from different backbone stages and construct top-down pathways with lateral connections, improving cross-scale detection capability with a controllable computational cost [24]. However, in dense small-object UAV scenarios, the representational tension in conventional feature extraction becomes more pronounced: although shallow features preserve fine spatial details, they are easily overwhelmed by complex background noise [25]; deep features, while more discriminative semantically, are less sensitive to the precise boundaries and positional cues of small objects, and often lack an efficient mechanism for global contextual reasoning in crowded scenes [26].

To enhance feature representation, attention mechanisms and Transformer architectures have been widely integrated into visual backbones and detection frameworks [27]. Squeeze-and-Excitation (SE) modules improve representational power via channel-wise recalibration, while Convolutional Block Attention Modules (CBAM) further combine channel and spatial attention to enable finer-grained feature selection. Nevertheless, such modules are often limited in their ability to explicitly inject positional information [28]. Coordinate Attention is designed to encode positional cues into the channel-attention process by performing directional encoding along the horizontal and vertical axes, enhancing spatial sensitivity while remaining lightweight, and thus providing a more direct modeling pathway for the spatial discrimination of small objects [29]. In parallel, Swin Transformer introduces hierarchical self-attention with shifted windows, balancing computational efficiency and strong modeling capacity on high-resolution visual tasks, and offering a general-purpose backbone paradigm for effective interaction between local cues and global dependencies [30]. Despite these advances, directly transferring these ideas to two-stage detection for aerial imagery still encounters two key issues: (i) how to suppress background interference while accurately preserving target positional cues on shallow high-resolution feature maps; and (ii) how to introduce global dependencies more efficiently on deep semantic features to improve discriminative robustness under complex conditions such as crowding, severe occlusion, and background similarity.

Motivated by the above analysis, we propose a two-stage detection framework, termed CASA-RCNN, for small-object detection in UAV aerial imagery. The framework is systematically designed around three core directions:

ConvSwinMerge (Section 3.2): We propose a new shallow context enhancement block with a serial residual attention–convolution–excitation formulation to improve the position-sensitive separability of small objects under cluttered backgrounds while preserving fine boundaries.
MambaBlock/MambaT (Section 3.3): We introduce a new 2D selective spatial aggregation operator for efficient global context modeling in dense/occluded UAV scenes, implemented via content-driven gating/aggregation without constructing an $(N \times N)$ attention map and fused with a fidelity branch.
Quality–Scale collaborative optimization (Section 3.4): We design a new joint objective that aligns confidence with localization quality and rebalances regression supervision across scales to mitigate the under-contribution of small-object gradients.

We conduct comprehensive evaluations of the proposed method on the VisDrone aerial detection benchmarks, and perform systematic ablation studies to analyze the contribution of each module across different object scales and varying levels of scene-crowding.

An overview of the proposed CASA-RCNN and the quality–scale collaborative loss is illustrated in Figure 1. Importantly, our architecture choices are guided by UAV-specific challenges (motion blur, rapid perspective changes, and high object density), rather than being directly optimized for satellite-style remote-sensing imagery.

By explicitly targeting the dominant failure modes in low-altitude UAV imagery—missed detections and localization drift of densely packed small objects under occlusion and clutter—CASA-RCNN is expected to improve the reliability of UAV perception in real operational scenes. This makes the proposed framework particularly relevant to UAV applications such as urban traffic monitoring, public-safety inspection, emergency response, and post-disaster assessment, where stable small-object recall and trustworthy ranking are critical for downstream analytics and decision-making.

2. Related Work

2.1. UAV Aerial Object Detection Datasets and Detection Architectures

Progress in UAV-based aerial object detection has been substantially driven by the continued release of public datasets and standardized evaluation benchmarks [31]. For completeness, we briefly mention some generic aerial/remote-sensing detection benchmarks when discussing multi-scale dense-scene recognition. However, the main scope of this study is low-altitude UAV-captured imagery and its characteristic challenges (dense small objects, viewpoint changes, and frequent occlusions), and our evaluation is conducted following UAV-oriented benchmarks and protocols. Representative efforts such as the VisDrone series have systematically established detection and tracking benchmarks tailored to UAV platforms [32]. These benchmarks cover aerial images and videos collected across multiple cities and diverse scenes, and they explicitly emphasize recognition challenges caused by occlusion, drastic scale variation, and dynamic motion, thereby providing a unified testbed for assessing the generalization ability of algorithms in realistic and complex aerial environments [33]. Similarly, the UAVDT dataset curates and annotates large-scale frame-level samples from long, continuous video sequences, and further characterizes multiple difficulty factors at the attribute level, including weather conditions, viewing angles, and degrees of occlusion [34]. It also highlights that the performance of existing detection and tracking methods often degrades significantly under stringent conditions such as high target density, extremely small object sizes, and camera ego-motion [35]. In the broader area of aerial and remote-sensing image analysis, the DOTA dataset is distinguished by high-resolution large-format imagery, abundant multi-scale objects, and a large number of instances with arbitrary orientations [36]. By adopting more flexible oriented annotations, DOTA has effectively promoted research on rotated object detection and dense-scene understanding.

In practice, these UAV-specific factors not only increase the difficulty of detecting small objects, but also make proposal ranking and localization more fragile in crowded scenes. This motivates our design to (i) strengthen shallow position-sensitive cues for small objects and suppress background interference (ConvSwinMerge), and (ii) model long-range contextual dependencies efficiently under high density and occlusion (MambaBlock), rather than directly adopting satellite-oriented detection assumptions or purely increasing model capacity.

For completeness, we briefly mention some generic aerial/remote-sensing detection benchmarks when discussing multi-scale dense-scene recognition. However, the main scope of this study is low-altitude UAV-captured imagery and its characteristic challenges (dense small objects, viewpoint changes, and frequent occlusions), and our evaluation is conducted following UAV-oriented benchmarks and protocols.

To address challenges such as limited pixel evidence for small objects and long-tailed scale distributions, researchers have proposed a variety of targeted solutions from both training strategies and inference pipelines. For example, SNIP introduces a scale-normalized training strategy based on image pyramids, where backpropagation is selectively performed on different image-scale branches according to the actual size of each object instance, thereby mitigating the interference of extreme scale variation on feature learning. The subsequent SNIPER work further improves the multi-scale training process from the perspectives of sample sampling and training efficiency. On the other hand, when dealing with high-resolution large-format aerial imagery, small objects may lose critical information due to network downsampling. Consequently, slicing or tiling an image into patches for inference has become common practice in both engineering applications and academic research. The SAHI framework provides a generic pipeline for sliced inference and fine-tuning, and demonstrates consistent performance gains for a range of mainstream detectors on multiple aerial datasets such as VisDrone and xView.

From a practical UAV perspective, these improvements translate into more reliable perception in operationally challenging conditions. Higher detection accuracy—especially under dense layouts and scale variability—reduces missed detections of small but critical targets (e.g., pedestrians, bicycles, and vehicles), which directly benefits applications such as urban traffic monitoring, crowd safety supervision, and emergency situational awareness. Moreover, improved localization quality provides tighter box alignment that is favorable for downstream tracking and counting pipelines, while quality-aware scoring makes it easier to choose conservative operating points to suppress false alarms when reliability is prioritized.

2.2. Related Work on Feature Enhancement

In feature enhancement, attention mechanisms and their integration with Transformer-based visual backbones have become a major avenue for improving model performance [37]. By dynamically reweighting the importance of different regions or channels in a feature map, attention enhances feature selection while suppressing noise. For instance, SENet strengthens the network response to informative channels via explicit channel recalibration, and CBAM further cascades channel attention and spatial attention to achieve finer-grained feature reweighting [38]. However, for small-object detection, positional cues and local structural characteristics are critical. Channel weights produced purely from global pooling are often insufficient to preserve subtle spatial location information. To address this limitation, Coordinate Attention performs positional encoding separately along the horizontal and vertical directions and embeds spatial coordinate information into channel-attention computation. This design maintains model efficiency while producing position-sensitive attention maps, offering a more direct and effective enhancement for spatial discrimination in small-object and dense-scene settings [39].

Meanwhile, the widespread adoption of Transformer architectures in computer vision has substantially improved global context modeling [40]. DETR formulates object detection as a set prediction problem and leverages an encoder–decoder Transformer to perform end-to-end global relational reasoning, serving as a representative work of this paradigm. Swin Transformer introduces shifted windows to construct hierarchical self-attention with controlled computational complexity, enabling it to function as a general-purpose visual backbone and achieve competitive performance on dense prediction tasks such as object detection in high-resolution imagery [41].

Beyond attention-based approaches, selective state-space models provide an alternative technical route for long-range dependency modeling. Mamba proposes a selective state-space model whose state transition parameters depend on the input, enabling content-adaptive information selection and propagation while preserving linear-time computational efficiency and strong long-sequence modeling capability [42]. This line of research brings new potential to contextual reasoning in visual detection. In complex scenes such as aerial imagery, long-range background consistency across distant regions and spatial distribution patterns among targets (e.g., continuous traffic flow or dense crowds) can facilitate discrimination and accurate localization. An efficient global modeling mechanism is therefore promising for enhancing a model’s understanding of large-scale scene semantics under a controllable computational budget, leading to improved overall detection performance [43].

Despite these advances, applying attention/Transformer-style designs to two-stage UAV detection remains nontrivial: shallow enhancement must be position-sensitive and clutter-resistant without over-smoothing local structures, while deep global modeling should scale efficiently to high-resolution feature maps. These observations motivate our ConvSwinMerge, which couples coordinate attention with local convolutional refinement, and our MambaBlock, which introduces state-space-inspired global aggregation with controllable computational complexity.

However, quality-aware classification objectives alone do not resolve the scale-dependent regression imbalance in UAV imagery: small objects typically yield weaker localization gradients and suffer higher uncertainty, leading to suboptimal box refinement even when ranking is improved. Therefore, we further introduce a scale-adaptive regression reweighting term and combine it with quality-aware supervision to form a quality–scale collaborative loss tailored for dense small-object UAV detection.

2.3. Related Work on Loss Functions for Detection

In terms of training objectives and proposal ranking, the ranking quality of the large number of candidate boxes in dense object detection directly affects the behavior of non-maximum suppression (NMS) and, consequently, the final detection accuracy. Focal Loss mitigates the extreme foreground–background imbalance by down-weighting easy samples and forcing the model to focus on hard-to-classify examples, and has therefore become a widely used classification loss in dense detection [44]. More recently, researchers have pointed out that the misalignment between classification confidence and localization quality can severely undermine ranking reliability. Generalized Focal Loss improves the prediction representation from a unified perspective of quality estimation, classification, and localization, thereby enhancing the consistency between training and inference and better modeling localization distributions. VarifocalNet further proposes IoU-aware classification scores and designs Varifocal Loss to jointly model object presence confidence and localization quality, improving the reliability of candidate ranking [45].

For dense small-object detection in UAV aerial imagery, such quality-aware designs are particularly important. Small objects are more prone to cases where classification scores are high while localization errors remain large, which makes downstream box selection more sensitive and unstable. Consequently, incorporating localization-quality signals into classification supervision and strengthening the consistency between classification and localization constitute an effective direction for improving detection stability in dense scenes [46].

3. Method

3.1. Overall Framework and Problem Definition

Given a UAV-captured aerial image

I \in R^{3 \times H_{0} \times W_{0}}

, the corresponding set of ground-truth annotations is denoted as

G = {(b_{i}, c_{i})}_{i = 1}^{N}

, where

b_{i} = (x_{i}, y_{i}, w_{i}, h_{i})

represents an axis-aligned bounding box and

c_{i}

is the category label. To address the characteristic challenges in UAV-based aerial imagery, including dense small objects, large scale variation, and complex background textures, this paper proposes a two-stage detection framework, CASA-RCNN (Context-Adaptive Scale-Aware RCNN), which introduces targeted enhancements on top of the standard two-stage paradigm while preserving its inference pipeline. Specifically, the image I is first processed by a shared feature extraction network (Backbone) to obtain a hierarchy of features

{C_{l}}

, which are then fused by a feature pyramid network (FPN) to produce multi-scale features

{P_{l}}

. Subsequently, a region proposal network (RPN) generates a set of candidate regions

R

on

{P_{l}}

, and high-quality proposals are selected according to objectness scores and coarse regression outputs. Finally, RoI features are extracted via RoIAlign on the corresponding scale-specific feature maps and fed into the detection head to yield classification and bounding-box regression results. Unlike conventional two-stage frameworks, CASA-RCNN incorporates a hierarchical enhancement strategy at the feature modeling level: ConvSwinMerge is inserted into shallow, high-resolution features to strengthen location-sensitive fine-grained representations for small objects and suppress background interference, whereas MambaBlock is introduced into mid-to-deep semantic features for efficient global context dependency modeling, thereby improving robust discrimination under crowded and occluded conditions.

3.2. Shallow Context Enhancement Module: ConvSwinMerge

In UAV aerial imagery, objects of interest often occupy only a few pixels and exhibit high-density distributions. Meanwhile, the contrast between object appearance and background textures is frequently limited, making shallow high-resolution features indispensable for small-object detection. On the one hand, shallow features preserve richer edges, corners, and texture details, which benefits precise localization and scale-sensitive geometric representation. On the other hand, shallow features typically contain stronger high-frequency responses and thus may be strongly activated by background structures such as road markings, building outlines, canopy textures, and shadow boundaries. This, in turn, degrades foreground–background separability, reduces proposal quality, and further amplifies the risks of false positives and localization drift during subsequent RoI refinement. Therefore, relying solely on conventional convolutional backbones or simple top-down fusion is often insufficient to simultaneously meet the dual requirements of fine-grained representation for small objects and background suppression.

To this end, we introduce a context enhancement module, ConvSwinMerge, into the shallow branch. As shown in Figure 2, its design goal is to compensate for the key deficiencies in shallow features under a controllable computational budget: (i) enhancing the spatial discriminability of target regions via explicit position-sensitive modeling, thereby reducing the spurious responses induced by background textures; (ii) preserving boundaries and subtle shape variations through local structural refinement, avoiding the excessive smoothing of details caused by attention-based reweighting; and (iii) suppressing redundant background activations while emphasizing small-object-related discriminative responses through channel-selective enhancement. Overall, ConvSwinMerge adopts a serial residual strategy to integrate these capabilities in an “attention–convolution–excitation” manner, enabling shallow representations to be both localization-sensitive and interference-robust.

Although the name “ConvSwinMerge” is Swin-inspired, the module does not employ shifted-window multi-head self-attention. Here “Swin-like” refers to the design intent of enhancing shallow high-resolution features via efficient local-context aggregation and feature merging, while keeping the two-stage detector lightweight. We therefore adopt Coordinate Attention as a position-sensitive yet efficient alternative, together with a

3 \times 3

convolution for local refinement and SaE for channel excitation (Equation (1)). This clarification aligns the naming with the actual architecture.

Specifically, given a shallow input feature map

X \in R^{C \times H \times W}

, the forward propagation of ConvSwinMerge is defined as

X_{1} = X + CoordAtt (X), X_{2} = X_{1} + {Conv}_{3 \times 3} (X_{1}), Y = SaE (X_{2}),

(1)

where

CoordAtt (\cdot)

denotes the coordinate attention operator, which injects positional information into channel re-calibration by encoding features along the horizontal and vertical directions. As a result, the attention weights exhibit both channel selectivity and spatial localizability, better matching the strong dependence of small aerial targets on precise location cues. Building upon this,

{Conv}_{3 \times 3} (\cdot)

performs local neighborhood aggregation and detail compensation, strengthening the local consistency of edges and textures while avoiding the weakening of local structures that may arise when relying solely on attention-based reweighting. Finally,

SaE (\cdot)

further re-calibrates

X_{2}

via grouped/multi-branch channel excitation, allowing for different channel subspaces to learn finer-grained response combinations, and thereby enabling more effective discrimination between small objects and visually similar textured regions under complex backgrounds.The Coordinate Attention structure is shown in Figure 3.

The enhanced shallow representation produced by ConvSwinMerge directly affects two key stages of the two-stage framework. First, at the RPN stage, more discriminative shallow features improve recall for small-object proposals while reducing spurious proposals triggered by background textures. Second, during RoI refinement, clearer boundaries and more stable foreground responses facilitate the regression branch in learning more accurate geometric offsets, and mitigate the localization instability caused by mutual interference among adjacent objects in crowded scenes.

In summary, ConvSwinMerge strengthens shallow context for aerial small-object detection and improves feature-level separability and robustness without altering the two-stage inference paradigm, thereby providing a more reliable low-level representation foundation for subsequent deep context modeling and quality–scale collaborative optimization. We quantified the contribution of each component in subsequent experiments, confirming the necessity of attention convolution excitation pipeline.

3.3. Deep Context Modeling Module: MambaBlock

In UAV aerial imagery, targets often exhibit characteristics such as high-density clustering, frequent occlusion and overlap, and similar inter-class appearances. Relying solely on convolutional aggregation with local receptive fields can easily lead to semantic ambiguity in complex scenes. On the one hand, boundary cues between neighboring targets are further weakened during deep-layer downsampling, making it difficult for the detection head to distinguish closely adjacent instances. On the other hand, when background structures (e.g., road lane markings and facade textures) share similar local patterns with the targets, the absence of global constraints renders discrimination more unstable. Although deep convolutional features provide stronger semantic abstraction, their information exchange remains predominantly local, which limits the explicit modeling of long-range dependencies and cross-region contextual relations. To address this, we propose a deep context modeling module, MambaBlock, which enhances global dependency modeling under a controllable computational budget, thereby improving robust detection performance in dense and occluded conditions.The MambaBlock structure is shown in Figure 4.

MambaBlock instantiates the selective aggregation principle on 2D detection feature maps. Table 1 summarizes the conceptual correspondence between the selective SSM view and our implementation. Unlike a direct SSM formulation, MambaBlock operates on 2D spatial features and employs softmax-normalized, content-dependent weighting without explicit SSM discretization, which simplifies integration into standard detection backbones.

As illustrated in Figure 5, we serialize the

H \times W

feature grid into a 1D token sequence using row-major order with the index mapping

n = i \cdot W + j

(

N = H \cdot W

) before applying the MambaT operator.

Although MambaBlock is motivated by the selective mechanism in Mamba-style state-space models, our MambaT operator does not implement a standard 1D SSM recurrence and does not maintain any recursive hidden states across tokens. Instead, MambaT performs a content-dependent selective gating on 2D spatial features by predicting spatial weights A (Equations (5)–(7)) and applying them to the value projection V via element-wise weighting and reshaping (Equation (8)), yielding a spatially adaptive aggregation with linear-time complexity in the number of spatial tokens

N = H \cdot W

. Therefore, MambaT should be interpreted as an SSM-inspired selective spatial aggregation module rather than a discretized SSM dynamics model.

Given deep input features

X \in R^{C \times H \times W}

, MambaBlock adopts a dual-branch fusion design of an enhancement branch and a fidelity branch, aiming to balance contextual modeling gains with the preservation of the original semantic information. Specifically, the enhancement branch introduces an adaptive sequence modeling operator

M (\cdot)

inspired by selective state-space modeling, enabling cross-region information propagation and global dependency aggregation. The fidelity branch

I (\cdot)

maintains an identity mapping (or a lightweight linear transformation), providing a stable semantic basis and mitigating potential over-smoothing caused by overly strong global aggregation. The outputs of the two branches are concatenated along the channel dimension and remapped via a

1 \times 1

fusion convolution to obtain the final output:

X_{m} = M (X), X_{i} = I (X), Y = ϕ (Concat (X_{m}, X_{i})),

(2)

where

ϕ (\cdot)

denotes the

1 \times 1

fusion convolution. This structure allows context modeling to act as an incremental supplement to the original features, thereby improving discriminability more robustly in complex aerial scenes.

To facilitate efficient global dependency modeling on 2D feature maps, we treat deep features X as a sequence unfolded along spatial locations, and introduce a lightweight spatially adaptive aggregation unit (denoted as MambaT) to perform content-aware selection and propagation of information. We flatten the spatial grid in a row-major order. Given a feature map

V \in R^{B \times C \times H \times W}

, we reshape it into

V_{f} \in R^{B \times C \times N}

with

N = H \cdot W

, where the index

n = i \cdot W + j

corresponds to spatial location

(i, j)

. The attention map

A \in R^{B \times C \times H \times W}

is reshaped to

A_{f} \in R^{B \times C \times N}

and softmax-normalized along the spatial dimension N for each channel. The context term is computed by element-wise weighting

A_{f} ⊙ V_{f}

followed by reshaping back to

R^{B \times C \times H \times W}

(Equation (8)). For

K_{1}

in Equation (3), GroupConv is implemented as a depthwise

3 \times 3

convolution (groups

= C

) with stride 1 and padding 1, consistent with the linear-time complexity analysis. The core idea is to jointly construct a stable local semantic pedestal and an input-driven adaptive aggregation term: the former ensures stable semantic representation, while the latter selectively aggregates effective context over a larger range according to the current scene. Specifically, a coarse-grained local Key is first extracted via grouped convolution:

K_{1} = GroupConv (X),

(3)

and a

1 \times 1

convolution is used to produce the value to be aggregated:

V = {Conv}_{1 \times 1} (X) .

(4)

Next,

K_{1}

is concatenated with the original feature X along the channel dimension, and a lightweight mapping

ψ (\cdot)

is applied to generate spatial weights followed by normalization:

A = Softmax (ψ ([K_{1}; X])),

(5)

where A encodes selective attention over different spatial locations.

Given an input feature map

X \in R^{B \times 2 C \times H \times W}

, we implement

ψ (\cdot)

as a bottleneck module. Specifically,

ψ (\cdot)

first applies a

1 \times 1

convolution to reduce channels from

2 C

to

C / 2

(reduction factor

= 4

), followed by BN and ReLU, and then another

1 \times 1

convolution to expand channels to

k^{2} C

(with

k = 3

):

ψ (X) = {Conv}_{1 \times 1}^{k^{2} C} (σ (BN ({Conv}_{1 \times 1}^{C / 2} (X)))),

(6)

where

σ (\cdot)

denotes ReLU. The output is then reshaped into

(B, C, k^{2}, H, W)

and averaged over the

k^{2}

dimension to obtain a single attention map per channel:

A = {Mean}_{k^{2}} (reshape (ψ (X))) = \frac{1}{k^{2}} \sum_{i = 1}^{k^{2}} ψ {(X)}_{:, :, i, :, :}, A \in R^{B \times C \times H \times W} .

(7)

This corresponds to att.reshape(B,C,k²,H,W) followed by att.mean(dim = 2) in our implementation.

Based on these weights, V is adaptively aggregated to obtain a fine-grained context term:

K_{2} = Reshape (A ⊙ Flatten (V)),

(8)

and the enhanced output is finally produced in a residual form:

M (X) = K_{1} + K_{2} .

(9)

In this construction,

K_{1}

ensures stable local semantic extraction, whereas

K_{2}

, controlled by the content-driven weights A, enables cross-region information integration, allowing the model to leverage broader context for auxiliary discrimination under crowded, occluded, and texture-similar background conditions.

Let

N = H \times W

be the number of spatial tokens and C the channel dimension. MambaT avoids constructing any pairwise token interaction matrix (e.g., an

N \times N

attention map). The grouped convolution in Equation (3) performs local aggregation with computational cost

O (N C k^{2})

, where k is the kernel size. The projection module

ψ (\cdot)

is implemented as a lightweight bottleneck with a fixed reduction ratio r, and together with the

1 \times 1

projection for V, contributes

O (N C^{2} / r)

. The remaining steps in Equations (6)–(9), including softmax normalization, element-wise weighting

A ⊙ Flatten (V)

, reshape, and mean pooling, are token-wise operations and cost

O (N C)

. Since k and r are fixed constants, the overall computation and memory scale linearly with spatial resolution N (often summarized as

O (H W C)

under fixed module hyperparameters), in contrast to standard self-attention, which requires

O (N^{2} C)

time and

O (N^{2})

memory.

3.4. Quality–Scale Collaborative Optimization

To improve proposal ranking reliability and small-object localization accuracy in dense UAV aerial scenes, we extend the training objective of the two-stage detection head from the conventional “category classification + bounding-box regression” formulation to a collaborative strategy that aligns classification scores with localization quality and adaptively reweights regression learning across different scales. Specifically, in the classification branch of the RoI head, we introduce the quality-aware Varifocal Loss so that classification confidence reflects the localization quality of predicted boxes, thereby improving the stability of score-based ranking and non-maximum suppression (NMS) during inference. Meanwhile, in the regression branch, we propose a scale-adaptive loss, ScaleAdaptiveLoss, which applies dynamic weights to targets of different scales to compensate for insufficient regression learning on small objects. Let the predicted probability for a given class from the RoI head be

p \in (0, 1)

(Sigmoid output), and define the quality target q as follows: for positive samples,

q = IoU (\hat{b}, b) \in (0, 1]

, where

\hat{b}

and b denote the predicted box and the matched ground-truth box, respectively; for negative samples,

q = 0

. The Varifocal Loss is then defined as

L_{VFL} (p, q) = \{\begin{matrix} {- q | q - p |}^{γ} log (p), & q > 0, \\ - α p^{γ} log (1 - p), & q = 0, \end{matrix}

(10)

where

γ

is the focusing parameter and

α

is the weight for negative samples. Equation (10) amplifies the contribution of high-IoU positive samples to classification learning, encouraging stronger consistency between classification scores and localization quality, and thus prioritizing high-quality proposals in crowded scenes.

On the other hand, considering that small objects account for a large proportion in aerial datasets and their localization errors in pixel space are typically of smaller magnitude, the overall gradient contribution of small objects can be overwhelmed by that of medium and large objects. We therefore introduce a scale-dependent weight into the regression loss. Let the ground-truth box have the width and height

(w, h)

, and define a scale measure

s = \sqrt{w h}

. Let the base regression loss be

L_{reg}^{base} (\hat{b}, b)

(implemented using Smooth L1 or IoU-based losses). The scale-adaptive regression loss is formulated as

L_{scale} = w (s) \cdot L_{reg}^{base} (\hat{b}, b), w (s) = \{\begin{matrix} λ_{s}, & 0 \leq s < 32, \\ λ_{m}, & 32 \leq s < 96, \\ λ_{l}, & s \geq 96, \end{matrix} with λ_{s} > λ_{m} > λ_{l},

(11)

where the piecewise thresholds follow the commonly used small/medium/large-scale partition in generic object detection. Finally, while keeping the RPN loss formulation unchanged, CASA-RCNN defines the RoI head objective as

L_{head} = λ_{cls} L_{VFL} + λ_{reg} L_{scale},

(12)

and the overall optimization objective jointly with the RPN losses as

L = L_{rpn - cls} + L_{rpn - reg} + λ_{cls} L_{VFL} + λ_{reg} L_{scale} .

(13)

This joint quality–scale optimization adapts to dense small-object aerial detection from both the ranking and localization perspectives without altering the two-stage inference pipeline: quality-aware scores enhance the robustness of proposal selection, while scale-weighted regression strengthens localization learning for small objects, together yielding stable improvements in final detection performance.

4. Experimental Setup

4.1. VisDrone2021 Dataset

We conducted an experimental evaluation of the VisDrone2021-DET benchmark. VisDrone is a large-scale public dataset designed for UAV vision tasks, covering object detection, single-object tracking, and multi-object tracking; among them, the VisDrone2021-DET subset focuses on aerial object detection in static images and effectively reflects the typical challenges in real-world UAV scenarios, including dense small objects, large scale variations, complex backgrounds, and stringent localization requirements. The dataset encompasses diverse scenes such as urban streets, residential areas, parks/squares, highways, and campuses, while acquisition conditions span day–night variations and different weather (e.g., sunny, cloudy, and haze/fog). In addition, UAV flight altitudes range from approximately 5–300 m, and viewpoints cover near-vertical down-looking to about

45^{°}

oblique views, which further induces substantial intra-class variations in both scale and appearance. Regarding object distribution, a single image typically contains 50–200 instances, where small objects account for about 55%, medium objects about 35%, and large objects about 10%, thereby placing higher demands on fine-grained representation and robustness to background clutter. The dataset split and statistics are summarized in Table 2: the training set contains 6471 images with 343,205 annotated instances, the validation set contains 548 images with 38,759 instances, and the test set (test-dev) contains 1610 images; the predominant image resolution is

1920 \times 1080

. Moreover, VisDrone2021-DET includes 11 categories and exhibits pronounced class imbalance; for example, car occupies a large portion of the training annotations, whereas bus and awning-tricycle are relatively scarce, requiring the detector to maintain stable performance on small objects while being robust to long-tailed categories. DET is part of the VisDrone benchmark for drone-based detection and tracking, and we refer readers to the official description in [47]. The dataset and evaluation resources are publicly available at the official benchmark website and the official GitHub (https://github.com/) repository.”

4.2. Evaluation Metrics

We adopt the COCO-style evaluation protocol to quantify detection performance. Let the prediction set be

D = {({\hat{b}}_{j}, {\hat{s}}_{j}, {\hat{c}}_{j})}_{j = 1}^{M}

, where

{\hat{b}}_{j}

,

{\hat{s}}_{j}

, and

{\hat{c}}_{j}

denote the j-th predicted bounding box, confidence score, and category label, respectively, and let the ground-truth set be

G = {(b_{i}, c_{i})}_{i = 1}^{N}

. For a given category c and an IoU threshold

τ

, detections are sorted by confidence to obtain the precision–recall curve

P (r)

, which is smoothed by the interpolated precision envelope

P_{interp} (r) = {max}_{r^{'} \geq r} P (r^{'})

, where

P (\cdot)

and r denote precision and recall, respectively; the average precision for category c under threshold

τ

is then defined as

{AP}_{c}^{τ} = \int_{0}^{1} P_{interp} (r) d r

. Averaging over all categories yields

{mAP}^{τ} = \frac{1}{| C |} \sum_{c \in C} {AP}_{c}^{τ}

, where

C

is the category set. Following the COCO convention, we report the overall

mAP

averaged over

τ \in {0.50, 0.55, \dots, 0.95}

, as well as

{mAP}_{50}

and

{mAP}_{75}

, to characterize coarse-grained detection performance and high-precision localization, respectively. To further analyze performance across object scales, we additionally report scale-specific metrics based on the object area

a = w \times h

:

{mAP}_{s}

for

a < 32^{2}

,

{mAP}_{m}

for

32^{2} \leq a < 96^{2}

, and

{mAP}_{l}

for

a \geq 96^{2}

; given the high proportion of small objects in aerial imagery and their greater susceptibility to background texture interference, we treat

{mAP}_{s}

as a primary metric for comparison and discussion.

We chose VisDrone2021-DET as the primary benchmark because it targets low-altitude UAV imagery and covers diverse scenes and acquisition conditions, while exhibiting the key difficulties addressed in this work (dense small objects, large scale variation, complex backgrounds, and occlusions). To reduce concerns about split-specific bias, we complemented validation-set results with (i) an evaluation on the official VisDrone test-dev server using a fixed final-checkpoint protocol without validation-based model selection (Appendix A.2), (ii) run-to-run stability with multiple random seeds under identical hyperparameters (Appendix A.3), and (iii) stratified recall analyses by scene crowding/density and occlusion level to verify that improvements hold consistently across different difficulty regimes (Section 5.1.4 and Appendix A.4). We additionally report fine-grained scale-stratified recall to confirm robustness across detailed scale bins (Appendix A.5). We additionally provide stratified recall breakdowns by difficulty factors such as object density and occlusion level to verify that the improvements hold consistently from sparse to crowded scenes and from non-occluded to heavily occluded cases.

4.3. Network Architecture Configuration

The proposed CASA-RCNN is implemented based on the MMDetection framework, and it extends the standard two-stage detector paradigm by incorporating shallow context enhancement and deep sequence modeling modules to better accommodate aerial imagery characteristics, where small objects are densely distributed and scale variations are significant. For feature extraction, we adopt an ImageNet-pretrained ResNet-50 as the backbone and freeze Stage 0 (conv1+bn1) to stabilize the initialization and training of low-level features. On shallow high-resolution features, ConvSwinMerge is inserted into Stage 0 (256 channels) and Stage 1 (512 channels) to enhance the fusion of local details and contextual cues. On deep semantic features, MambaBlock is introduced into Stage 2 (1024 channels) to enable the efficient modeling of global dependencies. For feature aggregation, we employ an FPN with an output channel dimension of 256 and construct five pyramid levels

P_{2}

–

P_{6}

to cover multi-scale objects. The proposal generation stage uses an RPN, where the anchor scale is set to

{8}

, aspect ratios are

{0.5, 1.0, 2.0}

, and strides are

{4, 8, 16, 32, 64}

; the IoU thresholds for positive and negative samples are set to 0.7 and 0.3, respectively. During training, 256 RPN samples are drawn per image with a 1:1 positive-to-negative ratio. The detection head (RoI Head) adopts a

7 \times 7

RoI feature size and a 1024-d fully connected layer; the positive IoU threshold is set to 0.5, and 512 RoI samples are drawn per image with a 1:3 positive-to-negative ratio. The key architectural components and parameter settings are summarized in Table 3, while all other components and hyperparameters follow the default implementations and configurations in MMDetection unless otherwise specified.

4.4. Training Strategy

We adopt an iteration-based training loop to facilitate stable comparisons of convergence behaviors and performance differences under a controlled training budget. The optimizer is AdamW with an initial learning rate of

2 \times 10^{- 3}

and a weight decay of

1 \times 10^{- 4}

, and gradient clipping with a maximum norm of 0.1 is applied to mitigate early-stage instability. The batch size is set to 6 on a single GPU, and the learning rate follows MultiStepLR with a decay factor

γ = 0.1

at 15,000 iterations. To avoid destructive updates to the pretrained backbone during fine-tuning, the learning-rate multiplier for the backbone is set to 0.1. For data augmentation, all inputs are resized to

1333 \times 800

, random horizontal flipping is applied with a probability of 0.5, and a fixed input size is used without preserving the original aspect ratio. Detailed hyperparameters are summarized in Table 4.

We emphasize that all experiments adopt full-image single-scale inference under the same preprocessing pipeline. Specifically, each image is resized to

1333 \times 800

and fed into the detector as a whole. We do not apply slicing/tiling (patch-based) inference or multi-crop aggregation, so the reported results reflect a single-pass evaluation setting and remain directly comparable across all methods.

Budget-matched setting (BM). Unless otherwise stated, we report the main comparisons under a fixed iteration budget of 20,000 iterations, which corresponds to approximately 18.5 epochs on VisDrone2021-DET given the training set size and batch size. Validation is performed every 5000 iterations for monitoring only, and the checkpoint used for reporting is fixed to the final iteration (20,000) rather than selected by validation performance, so as to reduce potential validation-set selection bias.

Recommended-schedule setting (RS) for schedule-sensitive baselines. We acknowledge that Transformer-based detectors (e.g., Deformable DETR and DINO) are sensitive to training schedules. Therefore, in addition to BM, we further evaluate these Transformer baselines under their official/recommended training recipes provided by widely used open-source implementations (e.g., MMDetection) while keeping the data preprocessing and evaluation protocol unchanged. The RS configurations and results are reported in Appendix A.

4.5. Loss Function Configuration

To improve the consistency between classification confidence and localization quality, while enhancing regression robustness across different object scales, CASA-RCNN adopts a quality–scale collaborative loss design in both the RPN and RoI Head stages. Specifically, in the RPN stage, the classification branch uses Focal Loss (

γ = 2.0, α = 0.25

) to alleviate foreground/background imbalance as well as the imbalance between easy and hard samples, and the regression branch uses GIoU Loss to improve the geometric consistency of proposals. In the RoI Head stage, the classification branch employs Varifocal Loss (

γ = 2.0, α = 0.75

) to explicitly couple classification learning with localization quality, the primary regression branch uses EIoU Loss to strengthen high-precision localization, and an additional ScaleAdaptiveLoss is introduced as an auxiliary regression term to explicitly model scale disparities, thereby further improving regression stability and convergence speed for small objects. The formulations and weight settings of all loss terms are summarized in Table 5, where the scale weights in ScaleAdaptiveLoss are configured according to Equation (14) to impose differentiated regression constraints for small/medium/large objects.

The scale weights in ScaleAdaptiveLoss are defined as follows:

w (s) = \{\begin{matrix} 3.0, & 0 \leq s < 32 (small objects), \\ 2.0, & 32 \leq s < 96 (medium objects), \\ 1.0, & s \geq 96 (large objects) . \end{matrix}

(14)

We further validate the robustness of this piecewise weighting by reporting fine-grained scale-stratified results around the thresholds (

s = 32

and

s = 96

) in Section 5.1.3.

4.6. Experimental Environment

All experiments are conducted under the same hardware and software environment: an NVIDIA RTX 4090 GPU (24 GB memory), an Intel Core i9-10900K CPU, and 64 GB DDR4 RAM; PyTorch 2.0.1 with CUDA 11.8 is used as the deep learning framework, and MMDetection 3.0 is adopted as the detection framework, ensuring the reproducibility and fairness of the training and evaluation pipeline. We report Params, FLOPs, FPS, and inference latency for the baseline and CASA-RCNN variants. All measurements are conducted on the same NVIDIA RTX 4090 GPU under the same inference settings. Latency is reported in ms/img under batch size 1, and is computed as

1000 / FPS

for reference.

We use an NVIDIA RTX 4090 GPU as a controlled experimental platform to ensure reproducibility and fair comparisons across methods. We emphasize that the reported FPS/latency on RTX 4090 should be interpreted as a relative efficiency indicator rather than a direct proxy for onboard edge performance, because real-device throughput is affected by device-specific compute, memory bandwidth, and software stack. To improve practical relevance, we additionally report Params, FLOPs, FPS, and latency for the baseline and CASA-RCNN variants under unified inference settings and discuss the accuracy–efficiency trade-offs for edge-constrained deployment in Section 5.1. Due to limited access to onboard edge-computing hardware and the revision timeline, we do not include real-device benchmarking on Jetson-class modules in this version.

5. Results and Discussion

5.1. Comparative Experiments

To comprehensively evaluate the detection performance and generalization ability of CASA-RCNN, we compare it with a range of representative object detection frameworks, including classical two-stage detectors, mainstream one-stage detectors, and recently prevalent Transformer-based end-to-end detectors. This diverse set of baselines enables validation of the proposed method under different modeling paradigms and architectural characteristics, and demonstrates its effectiveness in aerial scenarios featuring dense small objects and pronounced scale variations.Moreover, our comparative study is designed to cover not only classical detectors but also recent YOLO variants and strong Transformer-based detectors, providing an up-to-date view of how CASA-RCNN compares to current studies under unified settings. We emphasize that the reported FPS/latency on RTX 4090 should be interpreted as a relative efficiency indicator rather than a direct proxy for onboard edge performance.

5.1.1. Compared Methods

We consider the following baselines. For two-stage detectors, Faster R-CNN is adopted as a canonical two-stage paradigm and also serves as the foundational architecture for our method. We additionally include Fast R-CNN, the predecessor of Faster R-CNN, which relies on selective search to generate region proposals for region-level classification and regression. For one-stage detectors, we select SSD as a representative approach that performs dense anchor-based prediction on multi-scale feature maps [48], RetinaNet as a classic one-stage detector that introduces Focal Loss to mitigate class imbalance, and DDOD, which is structurally optimized for dense scenarios. Moreover, to better reflect recent progress in UAV aerial object detection, we further include recent YOLO-style UAV detectors (LUD-YOLO, SAD-YOLO, and KL-YOLO) as strong one-stage baselines tailored to dense aerial scenes.

For Transformer-based detectors, we include Deformable DETR [49], which adopts deformable attention to improve convergence and efficiency, and DINO, a strong DETR variant based on contrastive denoising [50]. To ensure fair comparisons, we primarily adopt a budget-matched protocol: all compared methods are trained and evaluated under the same data preprocessing, input resolution, and a fixed iteration budget (BM, 20,000 iterations), such that differences can be attributed as much as possible to architectural designs and learning objectives rather than to uncontrolled training settings. Considering the schedule sensitivity of Transformer-based detectors, we additionally report results for Deformable DETR and DINO under their recommended training schedules (RS) using official open-source recipes, as provided in Appendix A.

5.1.2. Overall Performance Comparison

Table 6 summarizes the detection results of different methods on the VisDrone2021 validation set. As shown, CASA-RCNN achieves clear advantages on both overall metrics and scale-specific metrics, validating that our three key design lines—shallow fine-grained representation enhancement, deep contextual modeling, and quality–scale collaborative optimization—effectively alleviate major bottlenecks in aerial object detection. Unless otherwise specified, all results in Table 6 were obtained under the budget-matched setting (BM). Additional results for schedule-sensitive Transformer baselines under recommended schedules (RS) are provided in Appendix A. To further eliminate concerns about potential under-training of the two-stage baseline under BM, we additionally train Faster R-CNN under the standard 1× schedule (12 epochs) and report the results in Appendix A (Table A4).

Several observations can be drawn from Table 6. First, CASA-RCNN substantially outperforms all baselines in overall accuracy: it achieves an mAP of 22.9%, improving upon the baseline Faster R-CNN by 9.0 percentage points (a relative gain of 64.7%) and clearly surpassing all compared methods. Meanwhile, it reaches 36.6% and 25.7% on mAP₅₀ and mAP₇₅, respectively, indicating that the proposed method not only improves recall and classification discriminability but also maintains higher localization quality under stricter IoU thresholds, thereby exhibiting consistent advantages across different matching criteria. Second, small-object detection is significantly improved: CASA-RCNN attains 12.5% on mAP_s, exceeding Faster R-CNN (6.9%) by 5.6 percentage points (a relative gain of 81.2%). This improvement mainly stems from ConvSwinMerge, which strengthens positional encoding and contextual fusion on shallow high-resolution features, and from ScaleAdaptiveLoss, which provides weighted guidance for small-object regression learning during training, thereby enhancing separability and regression stability for small objects. Third, performance gains extend to medium and large objects: CASA-RCNN achieves 35.7% and 37.9% on mAP_m and mAP_l, improving the baseline by 13.8 and 14.8 percentage points, respectively, suggesting that the small-object gains are not obtained at the expense of medium/large objects. In this regard, MambaBlock effectively enhances semantic discrimination in complex backgrounds and improves recognition under occlusion and dense distributions via global context modeling. Finally, comparison with one-stage and Transformer-based methods: although RetinaNet and DDOD are relatively strong among one-stage detectors, their overall performance remains inferior to the two-stage paradigm. In addition, Transformer-based Deformable DETR and DINO perform less favorably on this dataset, which may be attributed to the high target density, severe scale variations, and the critical role of local fine-grained cues in aerial imagery, further highlighting the necessity and effectiveness of our joint optimization of architecture and learning objectives tailored to aerial detection challenges.

Notably, these gains are achieved without any slicing/tiling inference. Moreover, the fine-grained scale-stratified recall in Appendix A.2 shows that the improvement persists across all scale bins, including the extremely small XS targets, indicating that our design mitigates the small-object degradation introduced by global resizing.

In addition to classical detectors, Table 6 includes recent YOLO variants (LUD-YOLO, SAD-YOLO, KL-YOLO) and strong Transformer-based detectors (Deformable DETR and DINO). Under the same budget-matched protocol, CASA-RCNN achieves the best overall mAP and consistently higher mAP₇₅, indicating superior localization quality compared to recent YOLO variants. Considering the schedule sensitivity of Transformer-based detectors, we further report their recommended-schedule results in Appendix A and summarize that CASA-RCNN still maintains consistent advantages, confirming that the observed gains are not due to under-training of Transformer baselines. To further reduce concerns about split-specific effects, we also provide an evaluation of the official VisDrone test-dev server in Appendix A.2 using a fixed final-checkpoint protocol.

In addition to the COCO-style metrics in Table 6, we provide a distributional analysis of detection accuracy under different UAV difficulty factors in Section 5.1.4, performance visualization as shown in the Figure 6.

5.1.3. Per-Class Performance Comparison

Table 7 reports the per-class AP results over the 11 categories, enabling a finer-grained analysis of how different detection paradigms recognize various targets in aerial scenarios.

As can be observed from Table 7, CASA-RCNN achieves consistent improvements across most categories, while the gain magnitude varies due to class-specific scene characteristics. First, notable gains are achieved on small-object categories: AP for pedestrian increases from 9.3% to 11.5%, bicycle increases from 3.3% to 7.2%, and motor increases from 8.2% to 13.4%; these categories typically appear at very small scales in aerial imagery, and the joint design of shallow feature enhancement and the scale-adaptive optimization improves separability and regression stability for small instances. Second, medium and large vehicle categories are substantially strengthened: truck rises from 17.0% to 39.7%, van rises from 24.6% to 39.9%, and bus rises from 31.6% to 52.0%, which is consistent with the large-object gain reported in Table 8 and indicates improved robustness under cluttered backgrounds and partial occlusions via deep global context modeling. In contrast, the improvement for people is smaller, from 3.3% to 4.6%, because this category often appears as densely packed groups with heavy overlap and ambiguous boundaries, which makes high-IoU localization more challenging. Third, challenging long-tailed categories are markedly improved: awning-tricycle and others are relatively scarce and exhibit high appearance variability, and CASA-RCNN increases their AP from 5.1% to 13.2% and from 0.7% to 8.7%, respectively, further demonstrating improved robustness and generalization under class imbalance and appearance diversity.As shown in Figure 7, CASA-RCNN consistently outperforms Faster R-CNN across categories.

Quantitatively, Table 8 further reports the absolute improvements and relative gains of CASA-RCNN over the baseline Faster R-CNN on scale-specific metrics: small objects (S) increase from 6.9% to 12.5% (

+ 5.6 %

, relative gain 81.2%), medium objects (M) increase from 21.9% to 35.7% (

+ 13.8 %

, 63.0%), large objects (L) increase from 23.1% to 37.9% (

+ 14.8 %

, 64.1%), and the overall mAP increases from 13.9% to 22.9% (

+ 9.0 %

, 64.7%). Notably, the relative gain on small objects reaches 81.2%, which is higher than that on medium and large objects, thereby validating the targeted benefits of ConvSwinMerge and ScaleAdaptiveLoss for small-object detection.

5.1.4. Stratified Robustness Analysis

To explicitly quantify robustness under common UAV difficulty factors, we conduct stratified recall analysis by scene crowding and occlusion severity. Crowding is measured by the number of instances per image and grouped into four bins: Sparse (0–5), Normal (5–20), Dense (20–50), and Crowded (50+). Occlusion is measured by the occlusion ratio and grouped into None (<0.1), Slight (0.1–0.3), Moderate (0.3–0.5), and Heavy (>0.5).

As shown in Table 9, CASA-RCNN yields consistent improvements across all bins, including dense/crowded scenes and moderate-to-heavy occlusions. This confirms that the proposed components improve robustness under diverse difficulty regimes within the benchmark, rather than benefiting only a narrow subset of samples.

5.2. Ablation Studies

To verify the independent contributions of individual components and their synergistic effects, and to further identify the primary sources of performance gains, we conduct systematic ablation studies centered on the key architectural modules and learning objectives of CASA-RCNN. All ablation experiments are performed under the same training configuration, data split, and evaluation protocol as the comparative experiments to ensure that the conclusions are comparable and reproducible.

5.2.1. Ablation on Core Modules

Table 10 reports the performance changes obtained by progressively introducing ConvSwinMerge, MambaBlock, and the quality–scale collaborative loss (Q–S Loss) on top of the baseline Faster R-CNN, thereby providing a direct view of the gain brought by each module and their combined effects.

As shown in Table 10, all three components bring consistent improvements over the baseline Faster R-CNN, and the full combination achieves the best overall performance.

(1) Contribution of ConvSwinMerge: Introducing ConvSwinMerge alone increases mAP from 13.9% to 18.7% (+4.8%), and improves mAP_s from 6.9% to 9.8% (+2.9%), indicating that contextual enhancement and position-sensitive modeling on shallow high-resolution features effectively strengthen fine-grained cues for small objects.

(2) Contribution of MambaBlock: Introducing MambaBlock alone raises mAP from 13.9% to 19.1% (+5.2%), with more pronounced gains on medium and large objects (mAP_m: 21.9%→29.7% (+7.8%), mAP_l: 23.1%→35.7% (+12.6%)), suggesting that global dependency modeling on deep semantic features is crucial for improving discriminability under complex backgrounds.

(3) Module interaction: Enabling both ConvSwinMerge and MambaBlock further improves mAP to 20.5% (+6.6% over the baseline); however, this gain is smaller than the sum of their individual improvements (4.8% + 5.2% = 10.0%), implying that the two modules partly overlap in the benefits they provide and that additional supervision is needed to fully exploit their complementarity.

To explain the sub-additive gain when combining ConvSwinMerge and MambaBlock, we analyze the distribution of false positives and false negatives. Table 11 shows that ConvSwinMerge primarily reduces class-confusion false positives and boundary-related misses, while MambaBlock more effectively reduces occlusion-related misses. Both modules also improve boundary cases, which leads to a partially overlapping benefit and thus a sub-additive aggregate gain.

Note that the FP/FN statistics in Table 11 are computed using a default score threshold of 0.3 in our FP/FN counting procedure, which corresponds to a recall-oriented operating point and can therefore inflate background false positives for CASA-RCNN. Since mAP is computed by sweeping confidence thresholds, this low-threshold operating point does not directly reflect ranking quality at typical deployment thresholds. To make the operating-point trade-off explicit and to address the concern that CASA-RCNN may improve recall by over-predicting, we evaluate precision–recall trade-offs under identical confidence thresholds. As shown in Table 12, CASA-RCNN achieves higher precision and higher recall simultaneously at the same threshold (0.5 and 0.7), indicating improved discriminability rather than a pure trade-off. We further include a precision–recall curve comparison in Figure 8, where CASA-RCNN maintains higher precision across most recall ranges, supporting that the gain is not obtained by simply increasing the number of detections.

To complement the single-number COCO-style metrics, we further report PR curves under multiple IoU thresholds (Figure 8). Across IoU = 0.3/0.5/0.7, CASA-RCNN consistently dominates the baseline, demonstrating that the improvement is not confined to a particular matching criterion but persists under stricter localization requirements. The starred points indicate the best-F1 operating points, where CASA-RCNN achieves a better precision–recall balance than the baseline at each IoU. We also observe that the maximal recall remains around ∼0.45, which is a known characteristic of VisDrone: a large portion of objects are extremely small, scenes are heavily crowded/occluded, and the dataset exhibits noticeable class imbalance, all of which cap attainable recall even for strong detectors. Importantly, the consistent advantage at IoU = 0.7 confirms that CASA-RCNN improves not only recall under lenient matching but also localization quality under strict matching.

(4) Contribution of the Quality–Scale Collaborative Loss: further introducing Q–S Loss on top of the two modules boosts mAP from 20.5% to 22.9% (+2.4%), while also improving mAP₅₀ from 33.6% to 36.6% (+3.0%) and mAP₇₅ from 22.1% to 25.7% (+3.6%); it also enhances small-object performance (mAP_s: 11.5%→12.5% (+1.0%)) and medium-object accuracy (mAP_m: 31.2%→35.7% (+4.5%)). These results indicate that aligning classification confidence with localization quality and applying scale-adaptive regression weighting can effectively improve both localization accuracy and robustness across scales.

Table 13 provides an explicit efficiency reference for practical deployment. Compared with the baseline, CASA-RCNN increases parameters from 41.40 M to 50.82 M (+22.8%) and FLOPs from 216.08 G to 315.16 G (+45.9%), while retaining 54.7 FPS (18.3 ms/img, computed as

1000 / FPS

) under the same inference settings. These results indicate that the proposed improvements are achieved with a moderate computational overhead rather than an impractical cost increase. Moreover, Table 13 reveals a clear module-level trade-off that is useful for edge-constrained scenarios: the MambaBlock-only variant achieves 0.191 mAP with only +9.1% FLOPs overhead (235.65 G) and 64.9 FPS (15.4 ms/img), suggesting that deep context modeling can be preserved with a relatively small efficiency penalty, whereas ConvSwinMerge introduces the primary computational overhead (295.59 G FLOPs and 57.5 FPS). Due to limited access to onboard edge-computing hardware and the revision timeline, we do not report real-device benchmarking on edge platforms in this version, and we leave such deployment validation as future work. As shown in Figure 9, each proposed component cumulatively improves the detection performance.

5.2.2. Ablation on ConvSwinMerge Submodules

To further disentangle the gains introduced by the internal design of ConvSwinMerge, Table 14 presents stepwise and combined ablations of its three submodules, namely CoordAtt, Conv, and SaE, thereby quantifying the contribution of each component to overall detection performance, especially on small objects.

From Table 14, CoordAtt yields the largest gain when enabled alone (+1.9%), corroborating the importance of position-sensitive modeling for small-object detection in aerial imagery. Conv and SaE provide gains of +0.7% and +1.0%, respectively, indicating that local convolutional enhancement and channel excitation also improve the discriminability of shallow features. When all three are combined, the total gain reaches +4.8% and exceeds the simple sum of the individual gains (+3.6%), suggesting a synergistic enhancement among the mechanisms within ConvSwinMerge, which strengthens effective channel responses while fusing fine-grained textures and contextual information. As shown in Figure 10, We have provided the results.

5.2.3. Ablation on Loss Function Combinations

To further examine how quality-aware classification and scale-adaptive regression contribute to performance, we conduct ablation experiments on loss function combinations based on the ConvSwinMerge+MambaBlock architecture. Table 15 compares several combinations of classification and regression losses. Starting from the baseline configuration using CrossEntropy and L1 Loss (20.0% mAP, 20.4% mAP₇₅, and 8.3% mAP_s), simply replacing the regression term with EIoU yields a consistent improvement (20.6% mAP and 21.9% mAP₇₅), indicating that IoU-driven objectives better emphasize accurate box alignment under stricter localization criteria. Switching the classification loss from CrossEntropy to Varifocal under L1 regression further boosts performance to 20.9% mAP and 22.8% mAP₇₅ while also improving small-object accuracy (9.3% mAP_s), suggesting that quality-aware classification helps calibrate confidence with localization quality. When Varifocal is combined with EIoU, the gains become more pronounced (21.4% mAP, 24.2% mAP₇₅, 10.1% mAP_s), reflecting the complementarity between quality-guided classification and IoU-based regression. Finally, augmenting Varifocal+EIoU with the proposed ScaleAdaptive term leads to the best results (22.9% mAP, 25.7% mAP₇₅, and 12.5% mAP_s), delivering a total improvement of +2.9% mAP over the baseline and, notably, a substantial +2.4% gain on mAP_s compared with Varifocal+EIoU (10.1%→12.5%), which confirms that scale-adaptive weighting effectively strengthens the regression learning of small objects while remaining beneficial to high-IoU localization.

As shown in Table 16, CASA-RCNN yields consistent recall improvements across all scale bins. Importantly, the gains remain stable for bins adjacent to the thresholds (16–32 vs. 32–48, and 64–96 vs.

96 +

), indicating that the benefit of ScaleAdaptiveLoss is not caused by boundary effects but reflects a robust improvement across the continuous scale range.

5.3. Visualization Analysis

Figure 11 compares the detection results of CASA-RCNN and Faster R-CNN in representative aerial scenarios, providing an intuitive demonstration of the improvements of the proposed method under conditions such as densely distributed small objects, complex occlusions, and inter-class confusion.

From the qualitative results, in dense pedestrian scenes, Faster R-CNN exhibits obvious missed detections, particularly within crowded areas, where it fails to adequately separate and recognize individual instances, whereas CASA-RCNN substantially reduces the miss rate and detects dense small objects more consistently. This improvement primarily benefits from the contextual enhancement and position-sensitive representation on shallow high-resolution features introduced by ConvSwinMerge. In intersection scenes, vehicle instances are densely distributed and occlusions are common; Faster R-CNN tends to produce shifted bounding boxes and less tight localization, while CASA-RCNN yields more accurate boundary alignment, indicating that the global context modeling enabled by MambaBlock helps maintain stable semantic discrimination and improves regression quality under complex backgrounds and occlusions. In parking-lot scenes, different vehicle categories co-occur with significant scale variations; faster R-CNN shows certain classification instability for categories such as truck and van, whereas CASA-RCNN produces more consistent category predictions, suggesting that the quality-aware classification strategy introduced by Varifocal Loss better aligns classification confidence with localization quality, thereby improving category discrimination reliability in complex scenarios.comparison of detection results as shown in Figure 12.

In practical UAV deployments, such reductions of missed detections and improvements in boundary tightness are particularly valuable for safety-critical monitoring, where missed targets can propagate into downstream decision failures. These qualitative gains therefore indicate improved robustness for real-world pipelines such as dense-scene surveillance and traffic management, where both recall and localization stability are essential.

6. Conclusions

To address key challenges in UAV aerial image object detection—including densely distributed small objects, large scale variations, complex background interference, and the need for high-precision localization—we propose CASA-RCNN, a context-adaptive and scale-aware two-stage detection framework. CASA-RCNN jointly optimizes feature representation and learning objectives. Specifically, ConvSwinMerge strengthens position-sensitive fine-grained features on shallow high-resolution maps while suppressing background noise; MambaBlock is integrated into deep semantic features to support long-range context reasoning with practical computational overhead (measured on a desktop GPU in our experiments); and a quality–scale collaborative loss aligns classification confidence with localization quality and enhances regression learning across object scales. Extensive experiments on VisDrone2021 demonstrate that CASA-RCNN consistently outperforms a broad set of classical and recent detectors, with particularly stable gains for small objects and challenging dense scenes. Ablation studies and visualization analyses further confirm the individual contributions of each component and their complementary effects.

Beyond aggregate metrics, CASA-RCNN improves several practical aspects of UAV perception. First, the reduced missed detections in dense and occluded scenes enhance the reliability of monitoring tasks under high target density. Second, the improved high-IoU localization performance (e.g., mAP₇₅) indicates tighter spatial alignment, which benefits downstream applications such as tracking, counting, and behavior analysis that rely on accurate bounding boxes. Third, the quality-aware objective better calibrates confidence with localization quality, enabling more stable score-based filtering and facilitating conservative operating points to control false alarms in real deployments.

Our current study has two primary limitations. (1) We have not yet benchmarked CASA-RCNN on onboard edge-computing devices due to hardware access and time constraints; consequently, the reported efficiency results are measured on a desktop GPU and should be interpreted as relative indicators rather than direct onboard throughput. Accordingly, we do not claim validated onboard real-time performance. (2) Although CASA-RCNN achieves consistent improvements on VisDrone2021, our evaluation was conducted under standard imaging conditions and does not explicitly cover adverse weather (e.g., fog, rain, snow) or low-light/night scenarios, which are critical for UAV operation in agriculture, surveillance, and emergency response. Therefore, we do not claim verified robustness beyond the evaluated setting.

Future work will pursue comprehensive real-device deployment validation on edge platforms and investigate lighter architectural variants to further improve efficiency for resource-constrained edge platforms. We will also extend evaluation to additional UAV benchmarks (e.g., UAVDT) and more diverse flight conditions, and explore strategies such as weather/illumination augmentation and domain adaptation to improve robustness and generalization under cross-scene, cross-scale, adverse-weather, and low-illumination settings. From a practical perspective, the improved recall and localization stability on small and densely distributed objects are expected to strengthen downstream UAV perception pipelines (e.g., tracking, counting, and risk assessment) by reducing missed targets and improving ranking stability in crowded scenes.

Author Contributions

Conceptualization, H.G. and J.W.; methodology, H.G. and J.W.; software, H.G. and J.W.; validation, H.G., J.W. and H.H.; formal analysis, H.G. and J.W.; investigation, H.G. and J.W.; resources, H.H.; data curation, H.G. and J.W.; writing—original draft preparation, H.G. and J.W.; writing—review and editing, H.H.; visualization, H.G. and J.W.; supervision, H.H.; project administration, H.H.; funding acquisition, H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Natural Science Foundation of Hubei Province (Grant No. 2025AFB140), and the China Postdoctoral Science Foundation (Grant No. 2025M773206).

Data Availability Statement

The experiments in this study are conducted on the VisDrone2021 dataset, which is publicly available from the official VisDrone benchmark, accessed on 1 January 2026. https://aiskyeye.com/visdrone-2021/, https://github.com/VisDrone/VisDrone-Dataset. Our source code and trained models will be released at https://github.com/Fuuuu-glitc/CACS--RCNN after the review process and upon acceptance.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Additional Results

Appendix A.1. Recommended-Schedule Evaluation for Transformer Baselines

Transformer-based detectors can be sensitive to training schedules. To complement the budget-matched setting (BM), we further evaluate Deformable DETR and DINO under their official/recommended training recipes (RS) from widely used open-source implementations while keeping the data preprocessing and evaluation protocol unchanged. Under RS, the Transformer baselines improve, while CASA-RCNN still maintains consistent advantages, supporting that our conclusions are not artifacts of under-training.

Table A1. Budget-matched (BM) vs. recommended-schedule (RS) results for Transformer baselines on VisDrone2021 val (%).

Method	Setting	Iters/Epochs	mAP	mAP₅₀	mAP₇₅
Deformable DETR	BM	20 k/∼18.5	7.1	15.0	6.0
Deformable DETR	RS	50 epochs	11.8	22.3	10.9
DINO	BM	20 k/∼18.5	13.0	24.7	12.5
DINO	RS	36 epochs	17.2	30.5	16.8
CASA-RCNN	BM	20 k/∼18.5	22.9	36.6	25.7

Appendix A.2. Evaluation on the VisDrone Test-Dev Server

To reduce concerns about potential overfitting to the validation set, we further evaluate CASA-RCNN (and representative baselines) on the VisDrone test-dev evaluation server, where annotations are not publicly available. We follow a fixed checkpoint protocol (the final checkpoint under each setting) without validation-based model selection.

Table A2. Test-dev results on VisDrone2021 (%) reported by the official evaluation server.

Method	Setting	mAP	mAP₅₀	mAP₇₅	mAP_s	mAP_m
Faster R-CNN	BM	12.6	22.8	13.1	6.2	20.3
CASA-RCNN	BM	21.4	34.2	24.0	11.6	33.8

Appendix A.3. Run-to-Run Stability with Multiple Random Seeds

To assess statistical stability, we repeat the key comparison with three different random seeds while keeping all hyperparameters unchanged. The results are reported as mean ± std on the VisDrone2021 validation set. The observed improvements remain consistently larger than the run-to-run variance, supporting the reliability of the reported gains.

Table A3. Mean ± std (%) over three random seeds on VisDrone2021 val (BM setting).

Method	mAP	mAP₇₅	mAP_s	mAP_m
Faster R-CNN	13.9 ± 0.3	14.4 ± 0.4	6.9 ± 0.3	21.9 ± 0.5
CASA-RCNN	22.9 ± 0.4	25.7 ± 0.5	12.5 ± 0.4	35.7 ± 0.6

Appendix A.4. Training Fairness Verification for the Faster R-CNN Baseline

To verify that the budget-matched setting (BM, 20,000 iterations) does not disadvantage the two-stage baseline, we additionally train Faster R-CNN using the standard 1× schedule (12 epochs) and compare it with the BM baseline and CASA-RCNN (BM) under the same data split and evaluation protocol.

Table A4. Training fairness verification for the Faster R-CNN baseline on VisDrone2021 val (%).

Model	Schedule	mAP	mAP₅₀	mAP_s	mAP_m	mAP_l
Faster R-CNN (baseline)	BM (20 k iter)	0.139	0.247	0.069	0.219	0.231
Faster R-CNN (baseline)	1× (12 epochs)	0.151	0.267	0.076	0.239	0.319
CASA-RCNN	BM (20 k iter)	0.229	0.366	0.125	0.357	0.379

Appendix A.5. Stratified Recall Analysis by Density and Occlusion

To further examine robustness under different scene difficulties, we report recall breakdowns on the VisDrone2021 validation set by object density and occlusion level.

Recall by object density. We stratify images by the number of ground-truth instances into four bins and report the corresponding recall.

Table A5. Recall by object density on VisDrone2021 val.

Method	Sparse (0–5)	Normal (5–20)	Dense (20–50)	Crowded (50+)
Baseline	0.803	0.642	0.508	0.355
+ConvSwinMerge	0.817	0.718	0.576	0.383
+MambaBlock	0.887	0.714	0.577	0.383
CASA-RCNN	0.901	0.768	0.612	0.393

Recall by occlusion level. We further stratify instances by occlusion level and report recall for each bin.

Table A6. Recall by occlusion level on VisDrone2021 val.

Method	None (<0.1)	Slight (0.1–0.3)	Moderate (0.3–0.5)	Heavy (>0.5)
Baseline	0.458	0.364	0.258	0.218
+ConvSwinMerge	0.498	0.420	0.315	0.262
+MambaBlock	0.497	0.419	0.320	0.263
CASA-RCNN	0.519	0.444	0.328	0.258

Appendix A.6. Reproducible Implementation of MambaBlock/MambaT

We provide concise PyTorch-style pseudo-code and key hyperparameters for reproducibility.

Algorithm A1 MambaBlock/MambaT (PyTorch-style)
Require: Feature map $X \in R^{B \times C \times H \times W}$
Require: Kernel size $k = 3$ ; reduction ratio $r = 4$ ; $N \leftarrow H \cdot W$
1:	$V \leftarrow {Conv}_{1 \times 1} (X)$	▹ value branch
2:	$Z \leftarrow {DWConv}_{3 \times 3} (X)$	▹ depthwise: groups $= C$ , stride $= 1$ , pad $= 1$
3:	$U \leftarrow {Conv}_{1 \times 1} (Z)$	▹ fidelity branch
4:	$A \leftarrow ψ (X)$	▹ weight prediction, Equations (4)–(7)
5:	$A_{f} \leftarrow reshape (A, B, C, N)$
6:	$A_{f} \leftarrow softmax (A_{f})$	▹ normalize over N
7:	$V_{f} \leftarrow reshape (V, B, C, N)$
8:	$K_{2} \leftarrow reshape (A_{f} ⊙ V_{f}, B, C, H, W)$	▹ Equation (8)
9:	$Y \leftarrow {Conv}_{1 \times 1} ([K_{2}; U])$	▹ channel fuse, Equation (9)
10:	return $Y$

Table A7. Key hyperparameters for MambaBlock/MambaT.

Hyperparameter	Value
Kernel size k (Equation (3))	3
Reduction ratio r in $ψ (\cdot)$	4
$K_{1}$ GroupConv	depthwise $3 \times 3$ (`groups` $= C$ )
Stride / padding in $K_{1}$	1 / 1
Softmax axis for $A_{f}$	spatial dimension $N = H \cdot W$

References

Nie, Y.; Na, X. UAVLite-YOLOv10: A Lightweight Small Target Detection Algorithm for Unmanned Aerial Vehicles. In Proceedings of the 2025 IEEE 6th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), Shenzhen, China, 11–13 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
Lei, F.; Zhang, Y.; Ao, D. A More Lightweight and Accurate Object Detection Network for UAV Aerial Images. In Proceedings of the 2024 China Automation Congress (CAC), Qingdao, China, 1–3 November 2024; pp. 1–6. [Google Scholar] [CrossRef]
Meng, W.; Tia, M. Unmanned Aerial Vehicle Classification and Detection Based on Deep Transfer Learning. In Proceedings of the 2020 International Conference on Intelligent Computing and Human-Computer Interaction (ICHCI), Sanya, China, 4–6 December 2020; pp. 280–285. [Google Scholar] [CrossRef]
Wu, W.; Li, C.; Wang, X.; Luo, B.; Liu, Q. Large Language Model Guided Progressive Feature Alignment for Multimodal UAV Object Detection. arXiv 2025, arXiv:2503.06948. [Google Scholar] [CrossRef]
Lyu, Z.; Gao, Y.; Chen, J.; Du, H.; Xu, J.; Huang, K.; Kim, D.I. Empowering Intelligent Low-altitude Economy with Large AI Model Deployment. arXiv 2025, arXiv:2505.22343. [Google Scholar] [CrossRef]
Zhu, S.; Zhang, Y.; Liu, K.; Wang, J. Design and Implementation of Portable and Re-configurable Unmanned Aerial Vehicle(UAV) Detection Radar Based on ADRV9040 and MPSoC. In Proceedings of the 2024 International Seminar on Artificial Intelligence, Computer Technology and Control Engineering (ACTCE), Wuhan, China, 28–29 September 2024; pp. 516–520. [Google Scholar] [CrossRef]
Zhang, Y.; Niu, A.; Ma, M.; Lv, J.; Chai, H.; Li, J. Research on Target Detection and Tracking Algorithm in UAV Visual Perception System. In Proceedings of the 2024 6th International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 28–30 November 2024; pp. 66–71. [Google Scholar] [CrossRef]
Zhu, Y.; Pei, H.; Wang, L.; Huang, Y.; Ou, K. A Vision-Based Autonomous Landing Method on Mobile Platform for UAV. In Proceedings of the 2023 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; pp. 4089–4094. [Google Scholar] [CrossRef]
Wei, Q.; Chen, H.; Yang, H. Small Target Detection for UAV Aerial Images Based on ECW-YOLO. In Proceedings of the 2025 4th International Conference on Artificial Intelligence, Internet of Things and Cloud Computing Technology (AIoTC), Guilin, China, 8–10 August 2025; pp. 12–15. [Google Scholar] [CrossRef]
Wu, Y.; Chen, B.; Pan, D. Full-Scale Aerial Target Recognition Method Based on Fully Convolutional One-Stage Object Detection. In Proceedings of the 2024 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Glasgow, Scotland, 20–23 May 2024; pp. 1–6. [Google Scholar] [CrossRef]
Yang, R.; Hu, C.; Jv, Y.; Chen, Y. Small Scale Target Tracking Method Based on Deep Learning for UAV. In Proceedings of the 2025 44th Chinese Control Conference (CCC), Chongqing, China, 28–30 July 2025; pp. 4770–4775. [Google Scholar] [CrossRef]
Agarwal, R.; Gundala, S.; Chalapathi, G.S.S. Hardware-Based Implementation of Target Tracking in Unmanned Aerial Vehicles (UAVs). In Proceedings of the 2023 International Conference on Electrical, Electronics, Communication and Computers (ELEXCOM), Roorkee, India, 26–27 August 2023; pp. 1–6. [Google Scholar] [CrossRef]
Lee, J.; Moon, J.; Kim, S. UWB-based Multiple UAV Control System for Indoor Ground Vehicle Tracking. In Proceedings of the 2021 IEEE VTS 17th Asia Pacific Wireless Communications Symposium (APWCS), Virtual Conference, 30–31 August 2021; pp. 1–5. [Google Scholar] [CrossRef]
Zhou, S.; Du, G.; Fu, R. FCOS-GB-An Improved UAV Aerial Images Detection Method Based on FCOS Model. In Proceedings of the 2023 9th International Conference on Computer and Communications (ICCC), Chengdu, China, 8–11 December 2023; pp. 1755–1759. [Google Scholar] [CrossRef]
Antoshchuk, S.; Hodovychenko, M. Deep learning-based UAV detection. In Proceedings of the 2024 IEEE 7th International Conference on Actual Problems of Unmanned Aerial Vehicles Development (APUAVD), Kyiv, Ukraine, 22–24 October 2024; pp. 294–297. [Google Scholar] [CrossRef]
Dosi, D.; Abraham, C.E.; Kumar, D. Experiments on vision-based methods to detect, identify and track UAVs using thermal camera. In Proceedings of the 2025 International Conference on Emerging Technology in Autonomous Aerial Vehicles (ETAAV), Bangalore, India, 18–20 August 2025; pp. 1–6. [Google Scholar] [CrossRef]
Bo, H.; Li, J.; Wang, W.; Liu, X. Small Target Detection in UAV Remote Sensing Images Using Deep Learning. In Proceedings of the 2025 International Conference on Computer Vision, Image Processing and Computational Photography (CVIP), Hangzhou, China, 17–19 October 2025; pp. 82–87. [Google Scholar] [CrossRef]
Časar, J.; Starý, V.; Hanuš, V. Evaluation metodology for counter unmanned aerial system detectors. In Proceedings of the 2023 International Conference on Military Technologies (ICMT), Brno, Czech Republic, 23–26 May 2023; pp. 1–5. [Google Scholar] [CrossRef]
Wang, G.; Hong, H.; Zhang, Y.; Wu, J.; Wang, Y.; Li, S. Realization of Detection Algorithms for Key Parts of Unmanned Aerial Vehicle Based on Deep Learning. In Proceedings of the 2020 International Conference on Wireless Communications and Signal Processing (WCSP), Nanjing, China, 21–23 October 2020; pp. 137–142. [Google Scholar] [CrossRef]
Liu, Y.; Wang, E.; Xu, S.; Wang, Z.; Liu, M.; Shu, W. Simple Online Unmanned Aerial Vehicle Tracking with Transformer. In Proceedings of the 2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Shenyang, China, 20–22 October 2021; pp. 1235–1239. [Google Scholar] [CrossRef]
Wang, Q.; Wang, Z.; Wang, Y.; Wang, T. Target Detection and Recognition of UAV Aerial Images Based on Improved YOLOv5. In Proceedings of the 2023 4th Information Communication Technologies Conference (ICTC), Nanjing, China, 17–19 May 2023; pp. 412–416. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Montréal, QC, Canada, 7–12 December 2015. [Google Scholar]
Wu, Z.; Peng, Q.; Bao, J. A Transformer-Based End-to-End Network for Unmanned Aerial Vehicle Aerial Image Object Detection. In Proceedings of the 2023 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA), Hefei, China, 27–29 October 2023; pp. 160–161. [Google Scholar] [CrossRef]
Zhong, J.; Yang, J.; Yu, W.; Tan, K. Rotary-Wing UAV Target Recognition Method Based on Radar Measured Data. In Proceedings of the 2024 9th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 19–21 April 2024; pp. 40–45. [Google Scholar] [CrossRef]
Xu, S.; Wang, D.; Fang, J.; Li, Y.; Xu, Z. MSFF-FCOS: An object detector based on improved FCOS for UAV aerial images. In Proceedings of the 2024 6th International Conference on Natural Language Processing (ICNLP), Xi’an, China, 22–24 March 2024; pp. 515–519. [Google Scholar] [CrossRef]
Liu, S.; Liang, P.; Duan, Y.; Zhang, Y.; Feng, J. Small Target Detection for Unmanned Aerial Vehicle Images Based on YOLOv5l. In Proceedings of the 2022 10th International Conference on Information Systems and Computing Technology (ISCTech), Guilin, China, 28–30 December 2022; pp. 210–214. [Google Scholar] [CrossRef]
Chen, H.; Wang, F.; Tan, X.; Yu, L.; Li, Z. Real-time target detection method of UAV aerial image combined with X-ray digital imaging technology. In Proceedings of the 2024 5th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 19–21 April 2024; pp. 54–57. [Google Scholar] [CrossRef]
Bai, T.; Shi, F.; Wang, Z.; Liu, Y. Vehicle Target Detection in Aerial Images Based on Improved YOLOv5. In Proceedings of the 2023 15th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, China, 26–27 August 2023; pp. 190–193. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Conference, 19–25 June 2021. [Google Scholar]
Dong, X.; Wang, H. Analysis of Underwater Target Detection by UAVs Using Airborne Magnetic Sensors. In Proceedings of the 2025 International Conference on Aerospace Information Perception and Intelligent Processing (AIPIP), Shenyang, China, 17–19 October 2025; pp. 118–122. [Google Scholar] [CrossRef]
Li, Y.; Zhang, X.; Zhou, Z. DBS-YOLO: A vehicle detection model based on improved YOLOv8 for UAV aerial scenes. In Proceedings of the 2024 5th International Conference on Computer Vision, Image and Deep Learning (CVIDL), Zhuhai, China, 19–21 April 2024; pp. 1432–1438. [Google Scholar] [CrossRef]
Wang, H.; Peng, Y.; Liu, L.; Liang, J. Study on Target Detection and Tracking Method of UAV Based on LiDAR. In Proceedings of the 2021 Global Reliability and Prognostics and Health Management (PHM-Nanjing), Nanjing, China, 15–17 October 2021; pp. 1–6. [Google Scholar] [CrossRef]
Ding, Y.; Qu, Y.; Zhang, Q.; Tong, J.; Yang, X.; Sun, J. Research on UAV Detection Technology of Gm-APD Lidar Based on YOLO Model. In Proceedings of the 2021 IEEE International Conference on Unmanned Systems (ICUS), Beijing, China, 15–17 October 2021; pp. 105–109. [Google Scholar] [CrossRef]
Chai, W.; Han, D.; Zhou, H.; Wang, S.; Zhou, F. FDW-YOLOv8: A Lightweight Unmanned Aerial Vehicle Small Target Detection Algorithm Based on Enhanced YOLOv8. In Proceedings of the 2024 IEEE International Workshop on Radio Frequency and Antenna Technologies, Shenzhen, China, 31 May–3 June 2024; pp. 368–373. [Google Scholar] [CrossRef]
Tian, R.; Ji, R.; Bai, C.; Guo, J. A Vision-Based Ground Moving Target Tracking System for Quadrotor UAVs. In Proceedings of the 2023 IEEE International Conference on Unmanned Systems (ICUS), Hefei, China, 13–15 October 2023; pp. 1750–1754. [Google Scholar] [CrossRef]
Wang, J.; Wang, Y.; Ye, Z.; Lin, Y.; Wang, Z.; Guo, J. Detection Model for Small Vehicle Targets from Unmanned Aerial Vehicle Perspectives. In Proceedings of the 2024 IEEE International Conference on Unmanned Systems (ICUS), Nanjing, China, 18–20 October 2024; pp. 207–212. [Google Scholar] [CrossRef]
Li, W.; Duan, Y.; Wang, B.; Li, S. High-Altitude Infrared Target Recognition of Unmanned Aerial Vehicles Based on HDRAB. In Proceedings of the 2025 2nd International Conference on Intelligent Perception and Pattern Recognition (IPPR), Chongqing, China, 15–17 August 2025; pp. 74–77. [Google Scholar] [CrossRef]
Zhou, X.; Zhang, H.; Yi, H.; Zhang, J. DS-YOLO: A Vehicle Detection Model for UAV Aerial Photography Based on Improved YOLO11n. In Proceedings of the 2025 International Joint Conference on Neural Networks (IJCNN), Rome, Italy, 30 June–5 July 2025; pp. 1–8. [Google Scholar] [CrossRef]
Jing, X.; Zhang, Y.; Pu, J.; Gao, H. R-YOLOV4: Enhancing Unmanned Aerial Vehicle Target Detection Through Rotational Object Detection and Feature Alignment. In Proceedings of the 2024 2nd International Conference on Signal Processing and Intelligent Computing (SPIC), Guangzhou, China, 20–22 September 2024; pp. 137–140. [Google Scholar] [CrossRef]
Zhao, X.; Huo, H.; Zhang, H.; Li, C.; Wang, K. MSTD-YOLOv8: A YOLOv8-Based Model for Infrared Small Target Detection in UAV Aerial Image. In Proceedings of the 2025 10th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 16–18 May 2025; pp. 516–519. [Google Scholar] [CrossRef]
Wu, D.; Xu, H.; Qin, Y. Target Recognition and Tracking Using UAV Based on Carrot Chasing and Artificial Potential Field Algorithms. In Proceedings of the 2025 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), Shenyang, China, 10–12 October 2025; pp. 298–304. [Google Scholar] [CrossRef]
He, S.; Qin, H.; Wang, K. V-Shaped Trajectory Planning of Search and Rescue UAV Based on Target Detection. In Proceedings of the 2024 6th International Conference on Artificial Intelligence and Computer Applications (ICAICA), Dalian, China, 28–30 November 2024; pp. 24–28. [Google Scholar] [CrossRef]
Wu, H.; Jiang, L.; Liu, X.; Li, J.; Yang, Y.; Zhang, S. Intelligent Explosive Ordnance Disposal UAV System Based on Manipulator and Real-Time Object Detection. In Proceedings of the 2021 4th International Conference on Intelligent Robotics and Control Engineering (IRCE), Lanzhou, China, 18–20 September 2021; pp. 61–65. [Google Scholar] [CrossRef]
Liangbo, Z.; Guangsheng, Z.; Ling, Z.; Pinghui, J. Improved Research on Target Unreachable Problem of Path Planning Based on Artificial Potential Field for an Unmanned Aerial Vehicle. In Proceedings of the 2021 IEEE 7th International Conference on Control Science and Systems Engineering (ICCSSE), Qingdao, China, 30 July–1 August 2021; pp. 136–142. [Google Scholar] [CrossRef]
Guo, T.; Song, Q.; Xue, Y.; Qiao, F. FDSI-RTDETR: A Lightweight Unmanned Aerial Vehicle (UAV) Aerial Image Small Object Detection Network. In Proceedings of the 2025 International Joint Conference on Neural Networks (IJCNN), Rome, Italy, 30 June–5 July 2025; pp. 1–8. [Google Scholar] [CrossRef]
Ren, Q.; Zheng, Z.; Xu, L. A Vision-based UAV Tracker Aiming at Aerial Targets. In Proceedings of the 2023 35th Chinese Control and Decision Conference (CCDC), Yichang, China, 20–22 May 2023; pp. 3646–3650. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the International Conference on Computer Vision (ICCV), Virtual Conference, 11–17 October 2021. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]

Figure 1. Overview of the proposed CASA-RCNN framework and the quality–scale collaborative loss. The input image is processed by a ResNet-50 backbone to produce multi-level features. ConvSwinMerge is inserted into shallow, high-resolution stages to enhance fine-grained localization features, while MambaBlock is introduced in mid-to-deep features to model global contextual dependencies. The enhanced features are aggregated by an FPN (P2–P6), followed by the RPN to generate proposals and the RoI head for classification and bounding-box regression. During training, the RPN uses focal loss and GIoU loss, whereas the RoI head adopts a quality–scale objective that combines Varifocal loss (classification), EIoU loss (regression), and an auxiliary ScaleAdaptive term.

Figure 2. Structure of the proposed ConvSwinMerge module. Given an input feature map

X \in R^{C \times H \times W}

, the module first applies CoordAtt to inject position-sensitive attention and uses a residual connection to obtain

X_{1} = X + CoordAtt (X)

. Then a

3 \times 3

convolution is employed for local detail refinement with another residual path, yielding

X_{2} = X_{1} + {Conv}_{3 \times 3} (X_{1})

. Finally, SaE performs channel excitation to produce the enhanced feature

Y = SaE (X_{2})

, where

Y \in R^{C \times H \times W}

. Note that ConvSwinMerge is Swin-inspired in motivation but uses Coordinate Attention instead of shifted-window self-attention for efficiency.

Figure 2. Structure of the proposed ConvSwinMerge module. Given an input feature map

X \in R^{C \times H \times W}

, the module first applies CoordAtt to inject position-sensitive attention and uses a residual connection to obtain

X_{1} = X + CoordAtt (X)

. Then a

3 \times 3

convolution is employed for local detail refinement with another residual path, yielding

X_{2} = X_{1} + {Conv}_{3 \times 3} (X_{1})

. Finally, SaE performs channel excitation to produce the enhanced feature

Y = SaE (X_{2})

, where

Y \in R^{C \times H \times W}

. Note that ConvSwinMerge is Swin-inspired in motivation but uses Coordinate Attention instead of shifted-window self-attention for efficiency.

Figure 3. Architecture of the CoordAtt block used in ConvSwinMerge. Given an input feature map

X \in R^{C \times H \times W}

, average pooling is performed along the width and height directions to obtain

X_{h} \in R^{C \times H \times 1}

and

X_{w} \in R^{C \times 1 \times W}

, respectively. The two descriptors are concatenated and transformed by a

1 \times 1

convolution followed by BN and Swish. Two parallel

1 \times 1

convolutions with sigmoid activation generate the direction-aware attention maps

A_{h} \in R^{C \times H \times 1}

and

A_{w} \in R^{C \times 1 \times W}

. The output is obtained by reweighting the input feature as

Y = X ⊙ A_{h} ⊙ A_{w}

, where ⊙ denotes element-wise multiplication.

Figure 3. Architecture of the CoordAtt block used in ConvSwinMerge. Given an input feature map

X \in R^{C \times H \times W}

, average pooling is performed along the width and height directions to obtain

X_{h} \in R^{C \times H \times 1}

and

X_{w} \in R^{C \times 1 \times W}

, respectively. The two descriptors are concatenated and transformed by a

1 \times 1

convolution followed by BN and Swish. Two parallel

1 \times 1

convolutions with sigmoid activation generate the direction-aware attention maps

A_{h} \in R^{C \times H \times 1}

and

A_{w} \in R^{C \times 1 \times W}

. The output is obtained by reweighting the input feature as

Y = X ⊙ A_{h} ⊙ A_{w}

, where ⊙ denotes element-wise multiplication.

Figure 4. Design of the proposed MambaBlock and its selective spatial aggregation operator MambaT. (Left): MambaBlock adopts a dual-branch fusion strategy, where the enhancement branch

M (X)

(MambaT) performs context aggregation and the fidelity branch

I (X)

(identity/linear) preserves semantic details. The two branches are concatenated and fused by a

1 \times 1

convolution

ϕ (\cdot)

to obtain the output feature

Y \in R^{C \times H \times W}

. (Right): MambaT computes a local pedestal

K_{1}

via group convolution and generates selection weights A through a softmax gating function. The value tensor V is produced by a

1 \times 1

convolution and is selectively aggregated by A to form the context term

K_{2}

. The final enhanced representation is obtained by combining

K_{1}

and

K_{2}

. MambaT has linear-time scaling with respect to the number of spatial tokens

N = H \times W

(i.e., it avoids constructing an

N \times N

attention map); the detailed complexity derivation is provided in Section 3.3.

Figure 4. Design of the proposed MambaBlock and its selective spatial aggregation operator MambaT. (Left): MambaBlock adopts a dual-branch fusion strategy, where the enhancement branch

M (X)

(MambaT) performs context aggregation and the fidelity branch

I (X)

(identity/linear) preserves semantic details. The two branches are concatenated and fused by a

1 \times 1

convolution

ϕ (\cdot)

to obtain the output feature

Y \in R^{C \times H \times W}

. (Right): MambaT computes a local pedestal

K_{1}

via group convolution and generates selection weights A through a softmax gating function. The value tensor V is produced by a

1 \times 1

convolution and is selectively aggregated by A to form the context term

K_{2}

. The final enhanced representation is obtained by combining

K_{1}

and

K_{2}

. MambaT has linear-time scaling with respect to the number of spatial tokens

N = H \times W

(i.e., it avoids constructing an

N \times N

attention map); the detailed complexity derivation is provided in Section 3.3.

Figure 5. Row-major 2D-to-1D mapping before MambaT. The

H \times W

spatial grid is serialized into a 1D token sequence in row-major order. Each spatial location

(i, j)

is mapped to the sequence index

n = i \cdot W + j

, producing

N = H \cdot W

tokens

{x_{n}}_{n = 0}^{N - 1}

that are fed into the MambaT operator.

Figure 5. Row-major 2D-to-1D mapping before MambaT. The

H \times W

spatial grid is serialized into a 1D token sequence in row-major order. Each spatial location

(i, j)

is mapped to the sequence index

n = i \cdot W + j

, producing

N = H \cdot W

tokens

{x_{n}}_{n = 0}^{N - 1}

that are fed into the MambaT operator.

Figure 6. Overall detection performance comparison on the VisDrone2021 validation set. CASA-RCNN achieves the highest mAP (22.9%) and

{mAP}_{75}

(25.7%), demonstrating superior detection accuracy and localization quality compared to both classical detectors and recent YOLO variants.

Figure 6. Overall detection performance comparison on the VisDrone2021 validation set. CASA-RCNN achieves the highest mAP (22.9%) and

{mAP}_{75}

(25.7%), demonstrating superior detection accuracy and localization quality compared to both classical detectors and recent YOLO variants.

Figure 7. Per-category AP comparison between Faster R-CNN (baseline) and CASA-RCNN. The values above bars indicate the absolute improvement. CASA-RCNN yields consistent gains across all 11 categories, with particularly substantial improvements on truck (+22.7%), bus (+20.4%), and van (+15.3%), demonstrating enhanced discriminability for medium and large vehicle categories under complex backgrounds.

Figure 8. Precision–recall (PR) curves on VisDrone2021-DET under three IoU thresholds (0.3/0.5/0.7). The star denotes the operating point that maximizes the F1 score for each method. CASA-RCNN consistently yields a higher PR envelope than the baseline across all IoU settings, indicating superior detection quality under both loose and strict matching criteria. Note that the absolute recall saturates at a relatively low level (around 0.45), which is typical for VisDrone due to the prevalence of extremely small objects (e.g., <

32^{2}

pixels), heavy crowding/occlusion, and class imbalance.

Figure 8. Precision–recall (PR) curves on VisDrone2021-DET under three IoU thresholds (0.3/0.5/0.7). The star denotes the operating point that maximizes the F1 score for each method. CASA-RCNN consistently yields a higher PR envelope than the baseline across all IoU settings, indicating superior detection quality under both loose and strict matching criteria. Note that the absolute recall saturates at a relatively low level (around 0.45), which is typical for VisDrone due to the prevalence of extremely small objects (e.g., <

32^{2}

pixels), heavy crowding/occlusion, and class imbalance.

Figure 9. Cumulative contribution of each proposed component to detection performance. Starting from the Faster R-CNN baseline (13.9% mAP), ConvSwinMerge contributes +4.8%, MambaBlock adds +1.8%, and the quality–scale collaborative loss (Q–S Loss) further improves by +2.4%, yielding a total gain of +9.0% mAP.

Figure 10. Ablation study on loss function combinations (on top of ConvSwinMerge + MambaBlock). CE: CrossEntropy loss; VFL: Varifocal Loss; EIoU: EIoU regression loss; SA: ScaleAdaptiveLoss. Replacing CrossEntropy with Varifocal Loss improves the alignment between classification confidence and localization quality. Combining with EIoU further enhances high-precision localization (

{mAP}_{75}

: 20.4%→24.2%). The proposed ScaleAdaptiveLoss yields the most substantial gain on small objects (

{mAP}_{s}

: 10.1%→12.5%, +2.4%), leading to the best overall performance (22.9% mAP).

Figure 10. Ablation study on loss function combinations (on top of ConvSwinMerge + MambaBlock). CE: CrossEntropy loss; VFL: Varifocal Loss; EIoU: EIoU regression loss; SA: ScaleAdaptiveLoss. Replacing CrossEntropy with Varifocal Loss improves the alignment between classification confidence and localization quality. Combining with EIoU further enhances high-precision localization (

{mAP}_{75}

: 20.4%→24.2%). The proposed ScaleAdaptiveLoss yields the most substantial gain on small objects (

{mAP}_{s}

: 10.1%→12.5%, +2.4%), leading to the best overall performance (22.9% mAP).

Figure 11. Qualitative. The different color boxes here represent the different results detected comparison of detection results across different methods. Each column represents a detection algorithm (Faster R-CNN, DINO, DDOD, CASA-RCNN from left to right), and each row shows a different aerial scenario. The numbers in the top-left corner indicate detected/total objects. CASA-RCNN consistently achieves the highest detection rates across all scenarios, demonstrating superior performance in detecting small and densely distributed objects.

Figure 12. Comparison results of different algorithms, comparison of detection results across four representative aerial scenarios. Columns from left to right: Faster R-CNN, DINO, DDOD, and CASA-RCNN. Rows from top to bottom: (1) dense traffic scene with multi-class vehicles, (2) complex intersection scene with severe occlusions, (3) parking-lot scene with scale variations, and (4) urban area scene with mixed object distributions. The detection ratios (detected/ground-truth) in each sub-figure demonstrate that CASA-RCNN consistently achieves the highest recall rates across all scenarios.

Table 1. Conceptual correspondence between Mamba SSM and MambaBlock.

Mamba SSM	Our Instantiation
Hidden state $h_{t}$	Local pedestal $K_{1}$ (GroupConv)
Selective gating $Δ, B, C$	Content-driven weights $A = ψ ([K_{1}; X])$
State update	Weighted aggregation $K_{2} = A ⊙ V$

Table 2. Statistics of the VisDrone2021-DET dataset.

Subset	# Images	# Objects	Avg. Objects/Image	Image Resolution
Train (train)	6471	343,205	53.0	$1920 \times 1080$
Val (val)	548	38,759	70.7	$1920 \times 1080$
Test (test-dev)	1610	–	–	$1920 \times 1080$

Table 3. Network architecture configuration of CASA-RCNN.

Component	Configuration Details
Feature Extraction
Backbone	ResNet-50, ImageNet pretraining
Frozen stage	Stage 0 (conv1 + bn1)
ConvSwinMerge	Applied to Stage 0 (256 channels) and Stage 1 (512 channels)
MambaBlock	Applied to Stage 2 (1024 channels)
Feature Fusion
FPN output channels	256
Number of FPN outputs	5 ( $P_{2}$ – $P_{6}$ )
Region Proposal Network (RPN)
Anchor scales	${8}$
Anchor ratios	${0.5, 1.0, 2.0}$
Anchor strides	${4, 8, 16, 32, 64}$
Positive IoU threshold	0.7
Negative IoU threshold	0.3
Training samples	256/image (pos:neg = 1:1)
Detection Head (RoI Head)
RoI feature size	$7 \times 7$
FC layer dimension	1024
Positive IoU threshold	0.5
Training samples	512/image (pos:neg = 1:3)

Table 4. Specific parameter configuration of loss function.

Hyperparameter	Value
Optimizer	AdamW
Initial learning rate	$2 \times 10^{- 3}$
Weight decay	$1 \times 10^{- 4}$
Gradient clipping	max_norm = 0.1
Batch size	6 (single GPU)
Total iterations	20,000
Validation interval	every 5000 iterations
Learning rate schedule	MultiStepLR
LR decay milestone	15,000 iterations
LR decay factor	$γ = 0.1$
Backbone LR multiplier	0.1
Data Augmentation
Input size	$1333 \times 800$
Random horizontal flip	probability 0.5

Table 5. Loss function configuration.

Stage	Loss Type	Implementation	Weight
RPN	Classification loss	Focal Loss ( $γ = 2.0$ , $α = 0.25$ )	1.0
RPN	Regression loss	GIoU Loss	2.0
RoI Head	Classification loss	Varifocal Loss ( $γ = 2.0$ , $α = 0.75$ )	1.0
	Primary regression loss	EIoU Loss	2.5
	Auxiliary regression loss	ScaleAdaptiveLoss	1.5

Table 6. Performance comparison on the VisDrone2021 validation set (%) under the budget-matched setting (BM, 20,000 iterations), including classical two-stage and one-stage detectors, recent YOLO variants for UAV detection, and Transformer-based end-to-end detectors. Bold is the best performing model.

Method	Type	mAP	mAP₅₀	mAP₇₅	mAP_s	mAP_m	mAP_l
SSD	One-stage	3.6	8.4	2.6	0.5	5.5	12.3
Deformable DETR	Transformer	7.1	15.0	6.0	3.3	11.4	15.2
Fast R-CNN	Two-stage	12.8	23.4	12.9	6.4	20.0	25.6
DINO	Transformer	13.0	24.7	12.5	7.7	20.2	25.5
Faster R-CNN	Two-stage	13.9	24.7	14.4	6.9	21.9	23.1
RetinaNet	One-stage	14.5	25.8	14.8	5.9	23.7	32.4
DDOD	One-stage	14.7	26.2	14.7	6.9	23.1	30.9
LUD-YOLO	One-stage	19.3	35.2	18.5	13.7	34.9	37.1
SAD-YOLO	One-stage	19.1	34.1	17.9	12.8	33.7	36.2
KL-YOLO	One-stage	20.5	37.5	19.1	13.5	35.8	37.8
CASA-RCNN	Two-stage	22.9	36.6	25.7	12.5	35.7	37.9

Table 7. Per-class detection performance comparison (AP, %). Bold is the best performing model.

Method	ped.	peo.	bic.	Car	Van	Truck	tri.	a-tri.	Bus	Motor	Others	mAP
Faster R-CNN	9.3	3.3	3.3	42.4	24.6	17.0	7.7	5.1	31.6	8.2	0.7	13.9
RetinaNet	8.8	4.1	4.7	41.2	22.0	19.6	8.3	5.9	34.3	7.7	3.5	14.5
DDOD	10.0	3.6	4.3	43.1	23.5	17.8	6.9	6.2	34.8	9.0	2.2	14.7
DINO	10.2	6.5	3.7	40.2	19.4	13.1	7.4	6.4	24.3	11.3	1.6	13.0
CASA-RCNN	11.5	4.6	7.2	48.1	39.9	39.7	13.8	13.2	52.0	13.4	8.7	22.9
Improvements over Faster R-CNN
$Δ$	+2.2	+1.3	+3.9	+5.7	+15.3	+22.7	+6.1	+8.1	+20.4	+5.2	+8.0	+9.0

Note: ped. = pedestrian, peo. = people, bic. = bicycle, tri. = tricycle, a-tri. = awning-tricycle.

Table 8. Scale-wise performance improvement analysis.

Scale	Area Range	Faster R-CNN	CASA-RCNN	Δ	Relative Gain
Small (S)	$a < 32^{2}$	6.9%	12.5%	+5.6%	81.2%
Medium (M)	$32^{2} \leq a < 96^{2}$	21.9%	35.7%	+13.8%	63.0%
Large (L)	$a \geq 96^{2}$	23.1%	37.9%	+14.8%	64.1%
Overall	–	13.9%	22.9%	+9.0%	64.7%

Table 9. Stratified recall analysis by crowding and occlusion on VisDrone2021-DET. Bold is the best performing model.

Crowding bins (instances per image)
Model	Sparse (0–5)	Normal (5–20)	Dense (20–50)	Crowded (50+)
Baseline	0.803	0.642	0.508	0.355
CASA-RCNN	0.901	0.768	0.612	0.393
Occlusion bins (occlusion ratio)
Model	None (<0.1)	Slight (0.1–0.3)	Moderate (0.3–0.5)	Heavy (>0.5)
Baseline	0.458	0.364	0.258	0.218
CASA-RCNN	0.519	0.444	0.328	0.258

Table 10. Ablation study on core modules. Bold is the best performing model.

ConvSwin	Mamba	Q–S Loss	mAP	mAP₅₀	mAP₇₅	mAP_s	mAP_m	mAP_l
–	–	–	13.9	24.7	14.4	6.9	21.9	23.1
✓	–	–	18.7	31.7	19.9	9.8	29.3	36.9
–	✓	–	19.1	32.2	20.6	10.1	29.7	35.7
✓	✓	–	20.5	33.6	22.1	11.5	31.2	36.1
✓	✓	✓	22.9	36.6	25.7	12.5	35.7	37.9

Note: Q–S Loss denotes the quality–scale collaborative loss (Varifocal Loss + ScaleAdaptiveLoss).

Table 11. Error analysis: FP/FN distribution.

Model	False Positives				False Negatives
Model	Cls Conf	BG	Loc	Dup	Small	Boundary	Occluded
Baseline	3213	1126	1082	50	12,146	545	118
+ConvSwinMerge	2550	3222	1001	51	11,564	431	91
+MambaBlock	2537	1506	983	55	11,576	413	80
CASA-RCNN	2014	10,222	1131	54	11,429	321	77

Table 12. Threshold Sensitivity Analysis.

Model	Threshold	Precision	Recall	F1
Baseline	0.5	0.754	0.376	0.502
Baseline	0.7	0.834	0.321	0.464
CASA-RCNN	0.5	0.769	0.449	0.567
CASA-RCNN	0.7	0.883	0.404	0.554

Table 13. Efficiency comparison of the baseline and CASA-RCNN variants.

Model	Params (M)	FLOPs (G)	FPS	Latency (ms/img)	mAP
Baseline	41.40	216.08	72.4	13.8	0.139
+ConvSwinMerge	46.16	295.59	57.5	17.4	0.187
+MambaBlock	46.06	235.65	64.9	15.4	0.191
CASA-RCNN (full)	50.82	315.16	54.7	18.3	0.229

Table 14. Ablation study on ConvSwinMerge submodules.

CoordAtt	Conv	SaE	mAP	mAP_s	mAP_m	ΔmAP
–	–	–	13.9	6.9	21.9	–
✓	–	–	15.8	8.2	24.3	+1.9
–	✓	–	14.6	7.3	22.5	+0.7
–	–	✓	14.9	7.5	23.1	+1.0
✓	✓	–	17.1	8.9	26.8	+3.2
✓	–	✓	16.8	8.7	26.2	+2.9
✓	✓	✓	18.7	9.8	29.3	+4.8

Table 15. Ablation study on combinations of loss functions (on top of ConvSwinMerge + MambaBlock).

Classification Loss	Regression Loss	mAP	mAP₇₅	mAP_s	ΔmAP
CrossEntropy	L1 Loss	20.0	20.4	8.3	–
CrossEntropy	EIoU Loss	20.6	21.9	8.9	+0.6
Varifocal	L1 Loss	20.9	22.8	9.3	+0.9
Varifocal	EIoU Loss	21.4	24.2	10.1	+1.4
Varifocal	EIoU + ScaleAdaptive	22.9	25.7	12.5	+2.9

Table 16. Fine-grained scale-stratified recall comparison around the ScaleAdaptiveLoss thresholds.

Model	XS (0–16)	S (16–32)	SM (32–48)	M (48–64)	ML (64–96)	L (96+)
Baseline	0.039	0.525	0.701	0.740	0.774	0.778
CASA-RCNN	0.050	0.610	0.799	0.858	0.896	0.881

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gu, H.; Wu, J.; Huang, H. CASA-RCNN: A Context-Enhanced and Scale-Adaptive Two-Stage Detector for Dense UAV Aerial Scenes. Drones 2026, 10, 133. https://doi.org/10.3390/drones10020133

AMA Style

Gu H, Wu J, Huang H. CASA-RCNN: A Context-Enhanced and Scale-Adaptive Two-Stage Detector for Dense UAV Aerial Scenes. Drones. 2026; 10(2):133. https://doi.org/10.3390/drones10020133

Chicago/Turabian Style

Gu, Han, Jiayuan Wu, and Han Huang. 2026. "CASA-RCNN: A Context-Enhanced and Scale-Adaptive Two-Stage Detector for Dense UAV Aerial Scenes" Drones 10, no. 2: 133. https://doi.org/10.3390/drones10020133

APA Style

Gu, H., Wu, J., & Huang, H. (2026). CASA-RCNN: A Context-Enhanced and Scale-Adaptive Two-Stage Detector for Dense UAV Aerial Scenes. Drones, 10(2), 133. https://doi.org/10.3390/drones10020133

Article Menu

CASA-RCNN: A Context-Enhanced and Scale-Adaptive Two-Stage Detector for Dense UAV Aerial Scenes

Highlights

Abstract

1. Introduction

2. Related Work

2.1. UAV Aerial Object Detection Datasets and Detection Architectures

2.2. Related Work on Feature Enhancement

2.3. Related Work on Loss Functions for Detection

3. Method

3.1. Overall Framework and Problem Definition

3.2. Shallow Context Enhancement Module: ConvSwinMerge

3.3. Deep Context Modeling Module: MambaBlock

3.4. Quality–Scale Collaborative Optimization

4. Experimental Setup

4.1. VisDrone2021 Dataset

4.2. Evaluation Metrics

4.3. Network Architecture Configuration

4.4. Training Strategy

4.5. Loss Function Configuration

4.6. Experimental Environment

5. Results and Discussion

5.1. Comparative Experiments

5.1.1. Compared Methods

5.1.2. Overall Performance Comparison

5.1.3. Per-Class Performance Comparison

5.1.4. Stratified Robustness Analysis

5.2. Ablation Studies

5.2.1. Ablation on Core Modules

5.2.2. Ablation on ConvSwinMerge Submodules

5.2.3. Ablation on Loss Function Combinations

5.3. Visualization Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Additional Results

Appendix A.1. Recommended-Schedule Evaluation for Transformer Baselines

Appendix A.2. Evaluation on the VisDrone Test-Dev Server

Appendix A.3. Run-to-Run Stability with Multiple Random Seeds

Appendix A.4. Training Fairness Verification for the Faster R-CNN Baseline

Appendix A.5. Stratified Recall Analysis by Density and Occlusion

Appendix A.6. Reproducible Implementation of MambaBlock/MambaT

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI