1. Introduction
The rapid development of the low-altitude economy and intelligent emergency response systems is driving UAV vision from offline analysis toward online decision-making, where aerial pedestrian detection is increasingly regarded as a critical enabling capability for urban security, disaster rescue, and infrastructure inspection [
1,
2,
3]. Compared with ground-view settings, pedestrians in aerial images are typically characterized by extremely small scales and sparse distributions, and their detection is continually degraded by complex background textures, nadir-view geometric distortions, and motion blur [
4]. Under nighttime operation, backlighting, smoke occlusion, or thermally cluttered environments, the instability of visible cues further constrains the performance ceiling of single-modality detectors; consequently, visible–thermal (RGBT) fusion has been widely adopted to improve robustness for UAV-based tiny pedestrian detection [
5,
6]. Nevertheless, existing RGBT fusion approaches remain challenged by the intertwined effects of cross-modal geometric misalignment, time-varying modality quality, and weak tiny-object representations, where even slight spatial offsets can obscure discriminative cues [
7]. Meanwhile, modality reliability is often spatially and temporally non-uniform, such that stable cross-modal correspondences are difficult to establish and region-wise adaptive, preferential fusion is difficult to achieve, ultimately leading to systematic coexistence of missed detections and false alarms [
8,
9]. Accordingly, RGBT tiny pedestrian detection on UAV platforms is still an active and challenging research problem.
In practical UAV deployment, onboard perception is constrained by stringent platform-level hardware limitations, including payload capacity, power budget, memory footprint, computational throughput, and thermal dissipation [
10]. These limitations hinder the deployment of heavyweight detectors and computationally intensive global multimodal fusion modules, especially when real-time inference is required for online decision-making [
11]. RGBT perception further increases system complexity, as paired visible–thermal sensors require accurate calibration, temporal synchronization, and concurrent multimodal data acquisition and transmission [
12]. Consequently, a UAV-oriented RGBT detector is expected to remain robust to tiny-object scales, cross-modal misregistration, and spatially varying modality reliability, while maintaining a favorable trade-off among detection accuracy, parameter size, computational cost, and inference speed [
13]. These platform-level hardware constraints further emphasize the need for efficient and reliability-guided RGBT fusion strategies tailored to UAV-based tiny pedestrian detection.
Aerial object detection remains intrinsically challenging due to nadir-dominant viewing geometry, arbitrary orientations, and large-scale variation. Ding et al. [
14] summarized key factors, including scale changes, oriented instances, and cluttered-scene interference on the DOTA benchmark, providing a unified context for method evaluation; however, the analysis is primarily oriented toward generic aerial objects and does not explicitly address the more severe weak object representation and cross-modal coupling disturbances encountered in UAV-based tiny pedestrian detection. Oriented R-CNN [
15] improves efficiency by generating rotated proposals in a streamlined manner. Li et al. [
16] strengthen rotation and shape modeling via the point-set geometric representation of Oriented RepPoints, and Yu et al. [
17] further reduce annotation cost by exploring point-level supervision for oriented detection. Nevertheless, these methods are largely centered on geometric formulation and annotation paradigms, while explicit suppression of low-frequency background dominance and redundant contextual interference in tiny pedestrian feature learning has been insufficiently investigated. For high-resolution small objects detection, QueryDet [
18] accelerates inference and enhances accuracy through cascaded sparse queries, yet the improvement is mainly achieved by alleviating computational overhead and query design, and the progressive submergence of tiny-object cues by low-frequency background components and local noise across multi-level feature refinement is not fundamentally avoided. In addition, Miri Rekavandi et al. [
19] and Hua et al. [
20] review advances in multi-scale modeling, contextual reasoning, and attention mechanisms from transformer-based and aerial small-object perspectives and consistently identify the long-standing bottlenecks of sparse discriminative cues and strong background interference. Overall, prevailing approaches still predominantly rely on spatial-domain stacking for enhancement, whereas controllable decoupling between background redundancy and fine-grained object details remains limited.
To improve robustness under low illumination, backlighting, and occlusion, RGBT pedestrian detection has been widely investigated with an emphasis on fusion, alignment, and false-positive suppression. Song et al. [
21] and Lu et al. [
22] systematically review RGBT task taxonomies and fusion techniques, respectively, providing a high-level backdrop for multimodal perception. From the detection perspective, however, cross-modal misregistration, time-varying modality quality, and task-driven adaptive fusion remain open challenges. To alleviate false alarms induced by fusion noise, TFDet [
23] suppresses task-irrelevant noise propagation via object-aware fusion, but the imposed constraints are largely coarse-grained. When pedestrians are extremely small and highly sensitive to pixel-level shifts, critical regions may still suffer from mismatched fusion. MS-DETR [
24] introduces a DETR-style decoder with loosely coupled sampling to improve tolerance to mild misalignment and further mitigates modality imbalance through modality-balancing optimization. Nevertheless, for tiny and sparse objects, cross-modal correspondences are more easily perturbed by background textures, and the joint modeling of local fine-grained correspondence and weak object representation remains insufficient. For reliability modeling, CMPD [
25] leverages Dempster–Shafer evidence theory to assign confidence for fusion guidance, and Li et al. [
26] further enhance stability by combining cross-modal homogeneity reinforcement with confidence-aware fusion. However, reliability is often estimated globally or at a coarse spatial resolution, making it difficult to yield region-level preference cues that remain sensitive to tiny objects in UAV scenarios where smoke occlusion, thermal distractors, and misalignment can co-occur. To address misregistration, Zhang et al. [
27] explicitly learn spatial and modality alignment; DeformCAT [
28] employs deformable cross-attention for weakly aligned RGBT pedestrians; DAMSDet [
29] handles time-varying complementarity and misalignment via dynamic query selection and adaptive fusion; and Hou et al. [
30] formulate fusion logic from similarity and complementarity. Overall, alignment and reliability selection are still frequently treated in isolation: alignment modules are seldom constrained by region-level quality priors, while reliability weighting cannot ensure the fine-grained spatial correspondence that is most critical for tiny objects. Meanwhile, DLA-Deformable DETR [
31] improves sparse sampling and alignment via deformable attention; Swin Transformer [
32] is introduced as a hierarchical vision Transformer backbone; DAB-DETR [
33], DN-DETR [
34], and DINO [
35] advance DETR variants through dynamic anchor queries, denoising training, and improved denoising anchors; ConvNeXt [
36] and ConvNeXt V2 [
37] exemplify modern CNN-based representation backbones; and InternImage [
38] pushes real-time end-to-end detection toward an NMS-free paradigm. These developments primarily strengthen generic representation and inference schemes but are not explicitly tailored to UAV RGBT tiny pedestrian detection, where fine details are submerged by low-frequency background components, and robust fusion requires fine-grained correspondence together with region-level reliability prioritization under weak misalignment. In the frequency domain, FcaNet [
39] and AIS-FCANet [
40] show that frequency cues can be incorporated into channel-attention modeling to enhance structural representation, supporting the utility of frequency information for structure encoding; however, they are not task-specifically designed for the coupled detection challenges posed by multimodal misalignment and time-varying reliability.
Recent RGBT salient object detection methods have provided relevant insights into modality-aware interaction and adaptive fusion. For instance, Zhang et al. proposed an asymmetric light-aware progressive decoding network for RGBT salient object detection, where asymmetric cross-modal interaction, light-aware feature selection, and progressive decoding are employed to suppress modality interference and refine salient regions [
41]. Luo et al. developed a Transformer-based cross-modality interaction guidance network (CIGNet) to guide complementary RGBT feature interaction [
42], while Zhao et al. introduced a wavelet-driven multi-band feature fusion strategy to integrate low- and high-frequency cross-modal cues for robust saliency prediction [
43]. However, they are mainly designed for dense pixel-level saliency prediction, whereas UAV-based RGBT tiny pedestrian detection requires box-level localization of sparse and extremely small targets under scale degradation, cross-modal misregistration, and spatially varying modality reliability.
Based on the analysis of the aforementioned methods, current UAV-based RGBT fusion methods for tiny pedestrian detection have two key bottlenecks that limit their effectiveness:
Issue 1: Tiny-object cues are readily overwhelmed by low-frequency background and contextual redundancy, yielding insufficient discriminative features. Consequently, during iterative learning, detection heads are biased toward background structures or local noise rather than tiny pedestrians.
Issue 2: Tiny objects are highly sensitive to cross-modal misregistration, but fusion is often performed without fine-grained reliability guidance. On UAV platforms, parallax, calibration errors, and temporal asynchrony can induce spatial misalignment between visible and thermal modalities. For extremely small objects, even minor shifts can severely impair cross-modal complementarity. When fusion is controlled only by coarse-grained or global weights, mismatched fusion at critical regions is likely, degrading sensitivity to tiny pedestrians.
To tackle these issues, we propose QA2FDet, a quality-aware adaptive alignment and fusion network for UAV RGBT tiny pedestrian detection. Given paired visible and thermal images, a dual-stream backbone extracts initial multi-scale modality-specific features, which are then processed by the spectrum–spatial decoupled enhancement (SDE) module. In SDE, the block-wise discrete cosine transform (Block-DCT) spectral decoupling explicitly separates low-frequency background redundancy and extracts high signal-to-noise ratio detail maps. Different from generic frequency-attention designs, these details are selectively injected into shallow features under deep semantic gating, thereby enhancing detection-relevant tiny pedestrian cues while avoiding indiscriminate amplification of high-frequency noise. To mitigate the high sensitivity of tiny objects to cross-modal misalignment, the cross-modal correspondence mining (CCM) module performs thermal-guided asymmetric local cross-attention, where thermal tokens serve as spatial anchors and visible candidates are searched within neighboring windows to establish fine-grained local correspondences. Spatially varying modality reliability is further estimated by the quality prior estimator (QPE), which derives fine-grained quality-prior maps from modality-specific classification and regression responses, providing detection-quality-supervised region-level reliability cues for subsequent prior-informed fusion after local correspondence mining. Finally, the prior-informed gated fusion (PGF) module jointly models quality priors and modality-difference cues to generate bidirectional adaptive gates, enabling degraded or redundant modality responses to be suppressed while complementary object-sensitive cues are enhanced. The fused features are then forwarded to the detection head for final prediction. Overall, QA2FDet establishes a quality-aware feature refinement mechanism for UAV RGBT tiny pedestrian detection, in which modality-specific responses are progressively purified, locally calibrated, and reliability-adapted toward detection-sensitive representations. This integrated design enhances the detector’s ability to preserve tiny pedestrian cues under background clutter, weak cross-modal misregistration, and spatially varying modality reliability.
The main contributions can be summarized as follows:
A spectrum–spatial decoupled feature enhancement strategy is developed to suppress dominant low-frequency background responses and selectively introduce high signal-to-noise detail cues into shallow representations, improving the discriminability of tiny pedestrian features.
Thermal responses are exploited as spatial anchors to mine local visible candidates within neighboring regions, enabling fine-grained cross-modal correspondence modeling under slight visible–thermal spatial offsets.
A prior-informed bidirectional gated fusion strategy is proposed to jointly exploit region-level reliability and modality-discrepancy cues, adaptively suppressing degraded or redundant responses while enhancing complementary object-sensitive information.
Extensive experiments on three UAV-oriented RGBT benchmarks demonstrate the superior detection accuracy, cross-modal resilience, and deployment-oriented computational efficiency of the proposed method in challenging aerial scenarios.