DCAM-DETR: Dual Cross-Attention Mamba Detection Transformer for RGB–Infrared Anti-UAV Detection

Qin, Zemin; Li, Yuheng

doi:10.3390/info17010103

Open AccessArticle

DCAM-DETR: Dual Cross-Attention Mamba Detection Transformer for RGB–Infrared Anti-UAV Detection

by

Zemin Qin

¹ and

Yuheng Li

^2,*

¹

School of Public Security, Guangxi Police College, Nanning 530028, China

²

School of Cyberspace Security (School of Cryptology), Hainan University, Haikou 570228, China

^*

Author to whom correspondence should be addressed.

Information 2026, 17(1), 103; https://doi.org/10.3390/info17010103

Submission received: 16 December 2025 / Revised: 6 January 2026 / Accepted: 14 January 2026 / Published: 19 January 2026

(This article belongs to the Special Issue Computer Vision for Security Applications, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

The proliferation of unmanned aerial vehicles (UAVs) poses escalating security threats across critical infrastructures, necessitating robust real-time detection systems. Existing vision-based methods predominantly rely on single-modality data and exhibit significant performance degradation under challenging scenarios. To address these limitations, we propose DCAM-DETR, a novel multimodal detection framework that fuses RGB and thermal infrared modalities through an enhanced RT-DETR architecture integrated with state space models. Our approach introduces four innovations: (1) a MobileMamba backbone leveraging selective state space models for efficient long-range dependency modeling with linear complexity

O (n)

; (2) Cross-Dimensional Attention (CDA) and Cross-Path Attention (CPA) modules capturing intermodal correlations across spatial and channel dimensions; (3) an Adaptive Feature Fusion Module (AFFM) dynamically calibrating multimodal feature contributions; and (4) a Dual-Attention Decoupling Module (DADM) enhancing detection head discrimination for small targets. Experiments on Anti-UAV300 demonstrate state-of-the-art performance with 94.7% mAP@0.5 and 78.3% mAP@0.5:0.95 at 42 FPS. Extended evaluations on FLIR-ADAS and KAIST datasets validate the generalization capacity across diverse scenarios.

Keywords:

anti-UAV detection; RGB–infrared fusion; state space models; Mamba; cross-dimensional attention; multimodal learning; real-time detection; deep learning

Graphical Abstract

1. Introduction

The exponential proliferation of unmanned aerial vehicles (UAVs) has fundamentally transformed numerous civilian and commercial domains, encompassing aerial cinematography, precision agriculture, logistics delivery, infrastructure inspection, and emergency response operations [1,2]. However, this technological democratization has concurrently engendered unprecedented security vulnerabilities, as malicious actors increasingly exploit UAVs for unauthorized surveillance, contraband smuggling, critical infrastructure disruption, and potential terrorist activities [3,4]. The Federal Aviation Administration (FAA) documented over 2800 UAV-related security incidents in 2023 alone, representing a 47% increase from the previous year [5]. Consequently, developing robust, efficient, and real-time anti-UAV detection systems has emerged as an imperative research priority for safeguarding public safety and national security. This challenge aligns with broader efforts in safety-critical IoT sensing and intelligent transportation systems, where AI models must support real-time, safety-related decisions in connected systems [6].

Conventional anti-UAV detection methodologies predominantly employ radar systems, acoustic sensors, or radio frequency (RF) analysis [7]. While these approaches demonstrate efficacy in specific operational contexts, they suffer from inherent limitations including prohibitive deployment costs, susceptibility to electromagnetic interference, limited spatial resolution, and difficulty in discriminating UAVs from avian species or other airborne objects [8]. Vision-based detection methods leveraging deep learning have emerged as compelling alternatives, offering cost-effectiveness, rich semantic information extraction, and precise spatial localization capabilities [9,10]. Nevertheless, UAV detection in unconstrained real-world environments presents formidable challenges: UAVs typically manifest as diminutive targets with limited discriminative visual features, frequently operating against complex heterogeneous backgrounds under varying illumination conditions, atmospheric perturbations, and partial occlusions [11,12].

Single-modality detection approaches exhibit fundamental limitations in addressing these challenges comprehensively. RGB cameras, while providing rich texture and chromatic information under adequate illumination, suffer severe performance degradation in low-light, nighttime, or adverse weather conditions [13]. Conversely, thermal infrared sensors capture thermal radiation signatures that remain robust across varying illumination conditions but lack sufficient textural detail for precise localization and classification [14]. This complementary nature of RGB and infrared modalities motivates multimodal fusion strategies that leverage the synergistic advantages of both sensing paradigms [15,16].

Recent advances in multimodal fusion for object detection have explored diverse architectural paradigms. Early fusion approaches employ simple concatenation or element-wise operations [17], which inadequately capture complex intermodal dependencies. Intermediate fusion methods leverage attention mechanisms [18,19] or adversarial learning [20] to enhance fusion quality, yet frequently incur substantial computational overhead or exhibit limited receptive field coverage. Late fusion strategies combine detection outputs from modality-specific detectors [21], potentially sacrificing fine-grained complementary information. Furthermore, most existing detection frameworks are constructed upon convolutional neural networks (CNNs) or standard vision transformers [22,23], which either lack global context modeling capability or impose quadratic computational complexity with respect to input resolution, limiting their applicability to high-resolution imagery or real-time deployment scenarios [24,25].

Despite these advances, current fusion-based UAV detectors exhibit several critical limitations. First, most methods employ fixed fusion strategies that apply uniform weights regardless of scene conditions, failing to adapt to varying modality reliability across different environments. Second, existing cross-modal attention mechanisms typically operate on a single dimension (spatial or channel), inadequately capturing the complex intermodal dependencies that span both dimensions. Third, transformer-based fusion approaches incur quadratic computational complexity

O (n^{2})

, limiting their applicability to real-time scenarios. Fourth, detection heads in current frameworks lack specialized mechanisms for small target enhancement, which is crucial for UAV detection where targets often occupy minimal pixel areas.

The recently proposed RT-DETR [26] framework has demonstrated remarkable performance in real-time object detection by synergistically combining the efficiency of YOLO-series detectors with the end-to-end detection paradigm of DETR [27]. Concurrently, state space models (SSMs), particularly the Mamba architecture [28], have emerged as a transformative paradigm for sequence modeling, offering linear computational complexity while preserving robust long-range dependency modeling capabilities. Vision Mamba variants [29,30] have successfully adapted SSMs to computer vision tasks, demonstrating competitive performance with vision transformers while achieving substantially improved computational efficiency. However, these architectural advances have not been systematically investigated for multimodal fusion in challenging detection scenarios such as anti-UAV applications.

To bridge this critical gap, we propose DCAM-DETR, a novel multimodal detection framework that synergistically integrates state space models with the RT-DETR architecture for robust real-time anti-UAV detection. Our approach introduces four principal innovations specifically designed for multimodal small target detection. First, we replace the conventional CNN backbone with a MobileMamba architecture that processes RGB and infrared streams through parallel Mamba encoders, enabling efficient global context aggregation with linear complexity

O (n)

compared to the quadratic complexity

O (n^{2})

of standard self-attention mechanisms. Second, we design Cross-Dimensional Attention (CDA) and Cross-Path Attention (CPA) modules that explicitly model intermodal correlations across both spatial and channel dimensions, facilitating fine-grained feature alignment and complementary information extraction through parallel attention streams.

Third, we introduce an Adaptive Feature Fusion Module (AFFM) that dynamically calibrates fusion weights based on scene characteristics and modality reliability through learned gating mechanisms, ensuring optimal information integration across diverse environmental conditions. Fourth, we propose a Dual-Attention Decoupling Module (DADM) in the detection head that employs hierarchical dilated convolutions with parallel spatial and channel attention streams to enhance feature discrimination for small target detection. These components operate synergistically to achieve superior detection performance while maintaining real-time inference capability.

The principal contributions of this work are summarized as follows:

We propose DCAM-DETR, a novel multimodal detection framework that integrates Mamba-based state space models with the RT-DETR architecture for efficient anti-UAV detection, achieving linear computational complexity while maintaining robust global context modeling capability through selective scan mechanisms.
We design Cross-Dimensional Attention (CDA) and Cross-Path Attention (CPA) modules that explicitly capture intermodal correlations across spatial and channel dimensions through parallel attention streams, enabling fine-grained multimodal feature alignment and complementary information extraction.
We introduce an Adaptive Feature Fusion Module (AFFM) that dynamically weights multimodal features based on scene-adaptive gating mechanisms, and a Dual-Attention Decoupling Module (DADM) that enhances detection head performance through hierarchical dilated convolutions and attention decomposition.
Comprehensive experiments on Anti-UAV300, FLIR-ADAS, and KAIST datasets demonstrate that DCAM-DETR achieves state-of-the-art performance with 94.7% mAP@0.5 on Anti-UAV300, outperforming existing methods by substantial margins while maintaining real-time inference speed of 42 FPS.

The remainder of this paper is organized as follows. Section 2 reviews related work on UAV detection, multimodal fusion, and state space models. Section 3 presents the proposed DCAM-DETR framework in detail. Section 4 describes experimental setup and presents comprehensive results. Section 5 provides in-depth analysis and discussion. Finally, Section 6 concludes the paper with future research directions.

2. Related Work

2.1. Vision-Based UAV Detection

Vision-based UAV detection has garnered substantial research attention owing to its cost-effectiveness and rich semantic information extraction capabilities. Early methodologies employed classical computer vision techniques including background subtraction, optical flow estimation, and hand-crafted feature descriptors [12,31], which exhibited limited robustness against complex backgrounds and varying environmental conditions. The advent of deep learning has fundamentally transformed this domain, with CNN-based detectors such as Faster R-CNN [32], the YOLO series [33,34,35,36,37,38,39], and RetinaNet [40] being extensively adopted for UAV detection tasks.

However, UAV detection presents distinctive challenges that conventional object detectors inadequately address. UAVs typically manifest as small targets occupying minimal pixel areas, exhibit limited discriminative visual features, and demonstrate high-speed motion characteristics [9,10]. Recent investigations have explored specialized architectures for UAV detection. Huang et al. [10] introduced the Anti-UAV410 benchmark with thermal infrared imagery and proposed customized tracking schemes incorporating temporal consistency constraints. Jiang et al. [9] presented the Anti-UAV dataset and systematically evaluated various tracking algorithms under diverse environmental conditions. Zhao et al. [41] proposed multi-scale feature aggregation networks specifically designed for small UAV detection. Nevertheless, most existing methods focus on single-modality detection, fundamentally limiting their robustness across diverse environmental conditions. Our work addresses this limitation by leveraging multimodal fusion with efficient state space models.

2.2. Multimodal Fusion for Object Detection

Multimodal fusion, particularly RGB–infrared fusion, has demonstrated significant advantages in challenging detection scenarios by exploiting complementary information from heterogeneous sensors [13,42]. Fusion strategies can be categorized into early fusion, intermediate fusion, and late fusion paradigms based on the integration stage within the detection pipeline [43].

Early fusion approaches concatenate or combine raw inputs or low-level features from different modalities [17,44]. While computationally efficient, these methods inadequately capture complex intermodal dependencies and may introduce noise from modality-specific artifacts. Intermediate fusion methods integrate features at multiple abstraction levels, leveraging attention mechanisms [19,45,46], cross-modal transformers [47,48], or learnable fusion strategies [49,50] to adaptively combine multimodal information. Chen et al. [15] proposed probabilistic ensembling for multimodal object detection, while Wang et al. [16] introduced Mask-Guided Mamba Fusion for enhanced UAV detection in complex environments. Liu et al. [14] developed target-aware dual adversarial learning for infrared-visible fusion.

Late fusion strategies combine detection outputs from modality-specific detectors through ensemble techniques [21]. While preserving modality-specific representations, these approaches potentially sacrifice fine-grained complementary information available at intermediate feature levels. Recent works have explored transformer-based architectures for multimodal fusion [14,51], achieving improved performance but often suffering from quadratic computational complexity that limits real-time applicability. Our approach leverages Mamba’s linear complexity while achieving superior fusion quality through cross-dimensional attention mechanisms.

2.3. State Space Models and Mamba

State space models (SSMs) have emerged as efficient alternatives to transformers for sequence modeling, offering linear computational complexity while maintaining strong long-range dependency modeling capabilities [52,53]. The foundational structured state space sequence model (S4) [53] demonstrated that properly parameterized SSMs can effectively capture long-range dependencies in sequential data. The theoretical foundations of state space modeling for handling long-term correlations have been extensively studied in various domains, including network systems with generalized nonlinear stochastic processes [54]. The recently proposed Mamba architecture [28] introduces selective state spaces with input-dependent dynamics, enabling the model to dynamically adjust its behavior based on input content.

The core innovation of Mamba lies in its selective scan mechanism, which allows the model to focus on relevant information while filtering out irrelevant details through input-dependent parameterization of the state space matrices. This selectivity enables Mamba to achieve performance comparable to transformers while maintaining linear computational complexity with respect to sequence length. The continuous-time state space formulation is discretized for practical implementation:

\bar{A} = \exp (Δ A), \bar{B} = {(Δ A)}^{- 1} (\exp (Δ A) - I) \cdot Δ B

(1)

where

Δ

represents the discretization step size, and

A

,

B

are the state space matrices.

Vision Mamba [29] and VMamba [30] have successfully adapted Mamba to computer vision tasks, demonstrating competitive performance with vision transformers on image classification, object detection, and semantic segmentation. These works typically employ bidirectional or cross-scanning strategies to capture spatial relationships in 2D images. LocalMamba [55] introduced local scanning windows to enhance local feature extraction. MambaVision [56] proposed hybrid architectures combining Mamba with convolutional layers. However, existing vision Mamba architectures have not been systematically explored for multimodal fusion scenarios. Our MobileMamba backbone extends these ideas by introducing parallel Mamba encoders for different modalities with cross-modal interaction mechanisms. Table 1 summarizes the key differences between DCAM-DETR and existing Mamba-based methods.

2.4. Detection Transformers

DETR [27] pioneered end-to-end object detection using transformers, eliminating hand-crafted components such as anchor generation and non-maximum suppression through bipartite matching and set prediction. Subsequent works addressed DETR’s limitations in convergence speed and computational efficiency. Deformable DETR [57] introduced deformable attention mechanisms that attend to a sparse set of sampling points, significantly improving convergence and performance. Conditional DETR [58] and DAB-DETR [59] enhanced query formulation for improved detection accuracy.

RT-DETR [26] represents a significant advancement in real-time detection transformers, introducing an efficient hybrid encoder that combines intra-scale feature interaction with cross-scale feature fusion, and uncertainty-minimal query selection for improved detection quality. RT-DETR achieves real-time performance while maintaining high accuracy, making it suitable for practical deployment scenarios. Our work builds upon RT-DETR’s efficient architecture while extending it to multimodal scenarios through the integration of MobileMamba backbone, cross-dimensional attention modules, and enhanced detection head components.

3. Methodology

3.1. Overall Architecture

Figure 1 illustrates the comprehensive architecture of our proposed DCAM-DETR framework. The detection pipeline comprises four principal components: (1) a MobileMamba backbone that processes RGB and infrared inputs through parallel state space model encoders with hierarchical feature extraction; (2) an Efficient Transformer Encoder enhanced with Cross-Dimensional Attention (CDA) and Cross-Path Attention (CPA) modules for multimodal feature fusion; (3) an Adaptive Feature Fusion Module (AFFM) that dynamically calibrates features from different scales and modalities; and (4) a detection head equipped with Dual-Attention Decoupling Module (DADM) for enhanced small target detection through the CCFF (Cross-scale Channel Feature Fusion) structure.

Given a spatially aligned RGB image

I_{rgb} \in R^{H \times W \times 3}

and a thermal infrared image

I_{ir} \in R^{H \times W \times 1}

, the MobileMamba backbone first extracts hierarchical feature representations at multiple scales. The spatial alignment is achieved through hardware calibration and software preprocessing: RGB and thermal cameras are mounted on a rigid platform with factory-calibrated intrinsic and extrinsic parameters, and we apply homography transformation followed by bilinear interpolation to resize both modalities to the same resolution (

640 \times 640

). For each modality

m \in {rgb, ir}

, the backbone produces multi-scale features

{F_{m}^{(i)}}_{i = 3}^{5}

corresponding to feature pyramid levels P3, P4, and P5 with spatial resolutions of

\frac{H}{8} \times \frac{W}{8}

,

\frac{H}{16} \times \frac{W}{16}

, and

\frac{H}{32} \times \frac{W}{32}

, respectively. These features are subsequently processed through the Efficient Transformer Encoder with CDA and CPA modules to capture intermodal correlations. The AFFM adaptively fuses multimodal features based on scene characteristics, producing unified representations for the detection head. Finally, the DADM-enhanced detection head generates bounding box predictions and class probabilities through the Uncertainty-minimal Query Selection mechanism.

3.2. MobileMamba Backbone

The MobileMamba backbone serves as the feature extraction foundation, processing RGB and infrared inputs through parallel Mamba encoders with shared architectural design but independent parameters. Unlike conventional CNN backbones that rely on local receptive fields with limited global context, or transformer-based backbones that incur quadratic computational complexity, MobileMamba leverages state space models to capture global dependencies with linear complexity

O (n)

, making it particularly suitable for detecting small targets that require extensive contextual information.

3.2.1. Selective State Space Model Formulation

The theoretical foundation of our backbone is the continuous-time state space model, which maps an input sequence

x (t) \in R

to an output sequence

y (t) \in R

through a latent state

h (t) \in R^{N}

:

\frac{d h (t)}{d t} = A h (t) + B x (t), y (t) = C h (t) + D x (t)

(2)

where

A \in R^{N \times N}

is the state transition matrix,

B \in R^{N \times 1}

is the input projection matrix,

C \in R^{1 \times N}

is the output projection matrix, and

D \in R

is the feedthrough coefficient. For practical implementation, this continuous system is discretized using the zero-order hold (ZOH) method with step size

Δ

:

\bar{A} = \exp (Δ A), \bar{B} = {(Δ A)}^{- 1} (\exp (Δ A) - I) \cdot Δ B

(3)

The discretized state space model operates as a recurrence:

h_{k} = \bar{A} h_{k - 1} + \bar{B} x_{k}, y_{k} = C h_{k} + D x_{k}

(4)

Intuitively, the discretization converts continuous-time dynamics into a form suitable for digital computation. The matrix

\bar{A}

captures how the hidden state evolves over one time step, while

\bar{B}

determines how new inputs are incorporated. The step size

Δ

controls temporal resolution.

The key innovation of Mamba is the selective mechanism that makes

B

,

C

, and

Δ

input-dependent:

B = {Linear}_{B} (x), C = {Linear}_{C} (x), Δ = softplus ({Linear}_{Δ} (x) + Broadcast (p_{Δ}))

(5)

where

p_{Δ}

is a learnable parameter. This selectivity enables the model to dynamically decide which information to store in the hidden state (controlled by

B

), which stored information to output (controlled by

C

), and how quickly to update the state (controlled by

Δ

), effectively filtering irrelevant information while preserving salient features.

3.2.2. SS2D Block for 2D Visual Processing

To adapt the 1D selective state space model for 2D visual data, we employ the SS2D (Selective Scan 2D) block illustrated in Figure 2. The SS2D block extends the selective scan mechanism to 2D spatial data through a cross-scan strategy that processes the feature map along four directions: left-to-right, right-to-left, top-to-bottom, and bottom-to-top.

Given an input feature map

X \in R^{H \times W \times C}

, the scan expanding operation flattens the 2D spatial dimensions into four 1D sequences corresponding to different scanning directions:

X^{(d)} = {ScanExpand}^{(d)} (X), d \in {lr, rl, tb, bt}

(6)

where

X^{(d)} \in R^{(H \cdot W) \times C}

represents the flattened sequence for direction d. Each sequence is independently processed through the S6 Block:

Y^{(d)} = S 6 Block (X^{(d)})

(7)

The outputs from all four directions are merged through a learnable fusion mechanism:

Y = ScanMerge (Y^{(lr)}, Y^{(rl)}, Y^{(tb)}, Y^{(bt)})

(8)

The cross-scan strategy addresses a fundamental challenge: 1D state space models process sequences sequentially, but images have 2D spatial structure. By scanning in four directions, each pixel aggregates information from all spatial directions, enabling comprehensive spatial context modeling.

This bidirectional cross-scanning strategy ensures comprehensive spatial context aggregation while maintaining linear computational complexity

O (H W C)

, compared to the quadratic complexity

O (H^{2} W^{2} C)

of standard self-attention mechanisms.

3.2.3. S6 Block Architecture

The S6 (Selective State Space Sequence) Block, illustrated in Figure 3, constitutes the core computational unit of our MobileMamba backbone. The S6 Block incorporates layer normalization, linear projections, depthwise convolution, SiLU activation, the SS2D selective scan operation, and residual connections for stable gradient flow.

The complete S6 Block operation is formulated as:

\begin{matrix} Z_{1} & = LN (X) \\ Z_{2} & = SS2DBlock (Z_{1}) + X \\ Z_{3} & = LN (Z_{2}) \\ Y & = FFN (Z_{3}) + Z_{2} \end{matrix}

(9)

The internal SS2D Block processing follows:

\begin{matrix} U & = Linear (Z_{1}) \\ V & = SiLU ({DWConv}_{3 \times 3} (Linear (Z_{1}))) \\ W & = SS2D (V) \\ O & = Linear (LN (W) ⊙ U) \end{matrix}

(10)

where ⊙ denotes element-wise multiplication, and the depthwise convolution introduces local inductive bias complementing the global context modeling of the selective scan operation.

The MobileMamba backbone employs a hierarchical architecture with four stages, progressively downsampling spatial resolution while increasing channel dimensions. For RGB and infrared modalities, we utilize parallel Mamba encoders with shared architecture but independent parameters, enabling modality-specific feature learning while maintaining architectural consistency.

3.3. Cross-Dimensional Attention Modules

Effective multimodal fusion requires capturing complex intermodal dependencies that span both spatial and channel dimensions. Traditional fusion approaches, such as simple concatenation or element-wise operations, fail to model the intricate correlations between heterogeneous modalities. To address this limitation, we propose two complementary attention modules: Cross-Dimensional Attention (CDA) for spatial correlation modeling and Cross-Path Attention (CPA) for channel-wise dependency capture. These modules are integrated into the Efficient Transformer Encoder, operating in parallel to provide comprehensive multimodal feature alignment.

3.3.1. Cross-Dimensional Attention (CDA)

The CDA module is designed to capture spatial correlations between RGB and infrared features by learning position-dependent attention weights that identify corresponding regions across modalities. The key insight is that spatially aligned multimodal images share semantic correspondences at specific locations, and explicitly modeling these spatial relationships enhances feature alignment quality.

Given RGB features

F_{rgb} \in R^{H \times W \times C}

, CDA employs a multi-branch architecture combining spatial self-attention, frequency-domain analysis, and local feature extraction, as illustrated in the left panel of Figure 4. The spatial self-attention branch computes position-wise correlations through the scaled dot-product attention mechanism:

A_{spatial} = Softmax (\frac{Q_{s} K_{s}^{⊤}}{\sqrt{d_{k}}}) V_{s}, where Q_{s}, K_{s}, V_{s} = ϕ_{split} (W_{\exp} F_{rgb})

(11)

where

W_{\exp} \in R^{3 C \times C}

is the channel expansion projection,

ϕ_{split} (\cdot)

partitions the expanded features into query, key, and value components, and

d_{k} = C

is the scaling factor for numerical stability.

To capture complementary frequency-domain information that may be obscured in the spatial domain, we incorporate a frequency projection branch based on the discrete Fourier transform:

F_{freq} = F^{- 1} (W_{freq} ⊙ F (F_{rgb}))

(12)

where

F (\cdot)

and

F^{- 1} (\cdot)

denote the 2D FFT and inverse FFT operations, and

W_{freq}

represents learnable frequency-domain filters. This branch captures periodic patterns and global structural information that complement the local spatial attention.

The final CDA output integrates spatial attention, frequency features, and local context through adaptive gating:

F_{CDA} = W_{o} [(F_{sp} + F_{freq}) ⊙ σ (W_{s} F_{rgb}) + V_{local} ⊙ σ (W_{c} (F_{sp} \oplus F_{freq}))]

(13)

where

F_{sp}

is the spatially projected attention output,

V_{local} = {DWConv}_{3 \times 3} (W_{1} F_{rgb})

captures local context through depthwise separable convolution,

σ (\cdot)

denotes the sigmoid activation, and ⊕ represents global average pooling followed by channel projection.

3.3.2. Cross-Path Attention (CPA)

While CDA focuses on spatial correlations, the CPA module captures channel-wise dependencies that encode semantic relationships between feature channels across modalities. Different channels in deep features correspond to different semantic concepts, and modeling their intermodal correlations enables more effective information exchange.

Operating on infrared features

F_{ir} \in R^{H \times W \times C}

, CPA computes channel attention by treating each channel as a token and applying self-attention along the channel dimension:

A_{channel} = Softmax (\frac{Q_{c}^{⊤} K_{c}}{\sqrt{H W}}) V_{c}^{⊤}, where Q_{c}, K_{c}, V_{c} \in R^{H W \times C}

(14)

where the query, key, and value matrices are obtained by reshaping the linearly projected infrared features from

R^{H \times W \times 3 C}

to

R^{H W \times C}

for each component. The scaling factor

\sqrt{H W}

normalizes the attention scores based on the spatial dimension.

The CPA output combines channel-attended features with spatially-gated local information:

F_{CPA} = W_{o}^{'} [F_{cp} ⊙ σ (W_{c}^{'} F_{ir}) + V_{ir} ⊙ σ (W_{sp} F_{cp})]

(15)

where

F_{cp}

is the channel-projected attention output,

V_{ir} = {DWConv}_{3 \times 3} (W_{1}^{'} F_{ir})

provides local infrared context, and

W_{sp} \in R^{1 \times C}

generates spatial attention weights through channel-wise projection.

The dual attention design ensures comprehensive multimodal feature alignment: CDA identifies where to attend by modeling spatial correspondences, while CPA determines what to attend by capturing semantic channel correlations. Together, they provide complementary perspectives for effective RGB–infrared fusion.

3.3.3. Why Cross-Dimensional Attention Helps for Small UAVs

Small UAVs present unique detection challenges due to their limited pixel coverage and lack of discriminative local features. Our cross-dimensional attention design specifically addresses these challenges through three mechanisms. First, CDA’s spatial attention aggregates contextual information from surrounding regions, providing discriminative cues from the broader scene context when local features are insufficient. Second, CPA’s channel attention identifies which feature channels from each modality are most informative for small targets—for small UAVs, thermal signatures in IR often provide stronger signals than RGB texture, and CPA learns to emphasize these channels adaptively. Third, the parallel attention streams capture correlations at different granularities, enabling the model to leverage both fine-grained local features and coarse global patterns essential for small target detection.

3.4. Adaptive Feature Fusion Module (AFFM)

A fundamental challenge in multimodal fusion is the varying reliability of different modalities across environmental conditions. In low-light scenarios, infrared features provide more discriminative information, while in well-illuminated scenes with rich texture, RGB features may be more informative. Fixed fusion strategies that apply uniform weights fail to adapt to these variations, leading to suboptimal performance. To address this challenge, we propose the Adaptive Feature Fusion Module (AFFM), illustrated in Figure 5, that dynamically calibrates multimodal feature contributions through learned scene-dependent gating mechanisms.

Given RGB features

F^{g} \in R^{H \times W \times C}

and infrared features

F^{l} \in R^{H \times W \times C}

at a specific pyramid level, AFFM computes content-aware gating weights that reflect the local reliability of each modality. The gating mechanism is formulated as a learnable soft attention over modality contributions:

W^{m} = σ (BN (W_{g}^{m} * F^{m})), m \in {g, l}

(16)

where

W_{g}^{m} \in R^{C \times C \times 1 \times 1}

denotes the

1 \times 1

convolutional kernel for modality m, ∗ represents convolution, and

σ (\cdot)

is the sigmoid activation that constrains weights to

[0, 1]

.

The fusion process decomposes multimodal information into complementary and distinctive components. The complementary fusion captures shared information through weighted combination:

F_{comp} = [\begin{matrix} W^{g} ⊙ F^{g} + (1 - W^{l}) ⊙ F^{l} \\ W^{l} ⊙ F^{l} + (1 - W^{g}) ⊙ F^{g} \end{matrix}]

(17)

where

1

denotes an all-ones tensor of appropriate dimensions. This formulation ensures that when one modality is unreliable (low gate value), the fusion relies more heavily on the other modality.

The distinctive fusion captures modality-specific information through weighted subtraction, which highlights unique features present in only one modality:

F_{dist} = [\begin{matrix} W^{g} ⊙ F^{g} - W^{l} ⊙ F^{l} \\ W^{l} ⊙ F^{l} - W^{g} ⊙ F^{g} \end{matrix}]

(18)

The final fused representation aggregates both components through channel concatenation and convolutional refinement:

Z = ϕ_{refine} ([F_{comp}; F_{dist}]) = ReLU (BN (W_{r} * [F_{comp}; F_{dist}]))

(19)

where

W_{r} \in R^{C \times 4 C \times 3 \times 3}

is the refinement convolution kernel that reduces the concatenated

4 C

channels back to C while capturing local spatial context. This design enables AFFM to adaptively emphasize reliable modality information while preserving distinctive features from both modalities.

3.5. Dual-Attention Decoupling Module (DADM)

Small target detection presents unique challenges that standard detection heads inadequately address. UAVs often occupy minimal pixel areas and lack sufficient discriminative features when considered in isolation. However, they become more distinguishable when contextual information from surrounding regions is incorporated. Furthermore, the detection head must simultaneously handle classification and localization tasks, which benefit from different feature characteristics. To address these challenges, we propose the Dual-Attention Decoupling Module (DADM), shown in Figure 6, that enhances detection capability through hierarchical multi-scale context aggregation and decoupled spatial-channel attention processing.

Given input features

F \in R^{H \times W \times C}

, DADM first constructs a multi-scale feature pyramid through cascaded dilated convolutions with progressively increasing receptive fields. The dilated convolution operation with dilation rate r is defined as:

(W *_{r} F) (p) = \sum_{k} W (k) \cdot F (p + r \cdot k)

(20)

where p denotes the spatial position, k indexes the kernel elements, and r controls the spacing between kernel elements. This formulation enables exponentially expanding receptive fields without increasing parameters.

The hierarchical feature extraction applies cascaded dilated convolutions with rates

r = [1, 2, 4, 4]

:

F_{i} = \{\begin{matrix} W_{1} *_{1} F & i = 1 \\ W_{i} *_{r_{i}} F_{i - 1} & i \in {2, 3, 4} \end{matrix}

(21)

where

W_{i} \in R^{C \times C \times 3 \times 3}

are learnable convolution kernels. The multi-scale features are aggregated through element-wise summation to preserve information from all receptive field scales:

F_{ms} = ⨁_{i = 1}^{4} F_{i} = F_{1} + F_{2} + F_{3} + F_{4}

(22)

The Dual-Attention Module (DAM) computes spatial attention through a factorized query–key interaction that reduces computational complexity from

O (H^{2} W^{2})

to

O (H W)

. The attention mechanism is formulated as:

A_{DAM} = ϕ_{H} (W_{q} * F_{ms}) \otimes ϕ_{W} {(W_{k} * F_{ms})}^{⊤}

(23)

where

W_{q}, W_{k} \in R^{1 \times C \times 1 \times 1}

are pointwise convolution kernels,

ϕ_{H} : R^{H \times W} \to R^{H \times 1}

and

ϕ_{W} : R^{H \times W} \to R^{1 \times W}

are learnable projection functions implemented as fully connected layers after spatial reshaping, and ⊗ denotes outer product.

The final DADM output combines attention-weighted features with the original multi-scale representation through a residual connection:

F_{DADM} = W_{3} * (σ (W_{a} * A_{DAM}) ⊙ F_{ms}) + W_{p} * F_{ms}

(24)

where

W_{3} \in R^{C \times C \times 3 \times 3}

and

W_{p} \in R^{C \times C \times 1 \times 1}

are convolution kernels for the attention and residual branches, respectively, and

W_{a}

projects the attention map to match the feature channel dimension. This design enables DADM to focus on discriminative spatial regions while preserving rich multi-scale contextual information essential for small target detection.

3.6. Loss Function

Following the RT-DETR framework, we employ a composite loss function combining classification loss, bounding box regression loss, and Generalized IoU (GIoU) loss:

L_{total} = λ_{cls} L_{cls} + λ_{bbox} L_{bbox} + λ_{giou} L_{giou}

(25)

The classification loss employs focal loss [40] to address class imbalance:

L_{cls} = - α_{t} {(1 - p_{t})}^{γ} log (p_{t})

(26)

where

p_{t}

is the predicted probability for the ground-truth class,

α_{t}

is the balancing factor, and

γ

is the focusing parameter.

The bounding box regression loss uses L1 loss:

L_{bbox} = \sum_{i = 1}^{4} | b_{i} - {\hat{b}}_{i} |

(27)

where

b_{i}

and

{\hat{b}}_{i}

are the ground-truth and predicted bounding box coordinates, respectively.

The GIoU loss [60] provides scale-invariant localization supervision:

L_{giou} = 1 - GIoU (b, \hat{b}) = 1 - (IoU (b, \hat{b}) - \frac{| C ∖ (b \cup \hat{b}) |}{| C |})

(28)

where C is the smallest enclosing box containing both b and

\hat{b}

.

The loss weights are set to

λ_{cls} = 2.0

,

λ_{bbox} = 5.0

, and

λ_{giou} = 2.0

following standard practice.

4. Experiments

4.1. Datasets

We evaluate DCAM-DETR on three multimodal datasets spanning different application domains to comprehensively assess detection performance and generalization capability.

Anti-UAV300 Dataset. Our primary evaluation is conducted on the Anti-UAV300 dataset [9], a challenging multimodal benchmark specifically designed for UAV detection. The dataset comprises 300 video sequences with spatially aligned RGB and thermal infrared imagery, captured under diverse environmental conditions including varying illumination (daytime, dusk, nighttime), weather changes (clear, cloudy, foggy), and complex backgrounds (urban, rural, sky). The dataset contains approximately 45,000 frame pairs with bounding box annotations for UAVs of various sizes (small, medium, large) and types (quadcopter, fixed-wing, helicopter). We follow the standard train/validation/test split of 200/50/50 sequences, ensuring no overlap in capture sessions between splits.

FLIR-ADAS Dataset. To evaluate generalization capability across application domains, we additionally test on the FLIR-ADAS dataset [61], which contains 10,228 thermal images and corresponding RGB images captured from automotive scenarios. The dataset includes annotations for vehicles, pedestrians, bicycles, and other road users. While not specifically designed for UAV detection, this dataset provides diverse multimodal data for validating our fusion approach in different environmental contexts.

KAIST Multispectral Dataset. We further validate on the KAIST Multispectral Pedestrian Detection Benchmark [62], containing 95,328 RGB-thermal image pairs captured in various traffic scenarios across daytime and nighttime conditions. This dataset tests our method’s robustness across different application domains and illumination conditions.

4.2. Implementation Details

Our model is implemented in PyTorch 2.0 and trained on 4 NVIDIA A100 GPUs with 80 GB memory each. We employ the AdamW optimizer [63] with an initial learning rate of

1 \times 10^{- 4}

, weight decay of

1 \times 10^{- 4}

, and effective batch size of 16 (4 per GPU). The learning rate follows a cosine annealing schedule [64] over 100 epochs with 5 epochs of linear warmup. Input images are resized to

640 \times 640

resolution with standard data augmentation including random horizontal flipping (probability 0.5), color jittering (brightness, contrast, saturation, hue), mosaic augmentation [34], and mixup [65] with probability 0.15.

The MobileMamba backbone is initialized with ImageNet-1K pre-trained weights when available. The Efficient Transformer Encoder uses 6 encoder layers with 8 attention heads and hidden dimension 256. The detection head employs 6 decoder layers with 300 object queries. For the DADM module, dilated convolution rates are set to

r \in {1, 2, 4, 4}

for the four hierarchical levels. All experiments use mixed-precision training (FP16) for computational efficiency.

4.3. Evaluation Metrics

We adopt standard object detection metrics following the COCO evaluation protocol for comprehensive performance assessment. The primary metrics include Mean Average Precision (mAP) at different Intersection over Union (IoU) thresholds, inference speed, and model complexity measures.

Intersection over Union (IoU). IoU quantifies the overlap between the predicted bounding box

\hat{b}

and the ground-truth bounding box b, serving as the fundamental criterion for determining detection correctness:

IoU (b, \hat{b}) = \frac{| b \cap \hat{b} |}{| b \cup \hat{b} |} = \frac{Area of Intersection}{Area of Union}

(29)

A detection is considered a true positive if

IoU (b, \hat{b}) \geq τ

, where

τ

is the IoU threshold.

Precision and Recall. For a given IoU threshold and confidence threshold, precision measures the proportion of correct detections among all detections, while recall measures the proportion of ground-truth objects that are successfully detected:

Precision = \frac{TP}{TP + FP}, Recall = \frac{TP}{TP + FN}

(30)

where TP, FP, and FN denote true positives, false positives, and false negatives, respectively.

Average Precision (AP). AP summarizes the precision–recall curve by computing the area under the curve. We employ the 101-point interpolation method:

AP = \frac{1}{101} \sum_{r \in {0, 0.01, . . ., 1}} p_{interp} (r), where p_{interp} (r) = max_{\tilde{r} \geq r} p (\tilde{r})

(31)

where

p (r)

is the precision at recall level r, and

p_{interp} (r)

is the interpolated precision.

Mean Average Precision (mAP). mAP averages AP across all object classes. For single-class UAV detection, mAP equals AP. We report two standard metrics:

mAP@0.5: AP computed at IoU threshold $τ = 0.5$ , which is lenient and focuses on detection capability.
mAP@0.5:0.95: AP averaged across IoU thresholds from 0.5 to 0.95 with step 0.05, which requires precise localization:

$mAP@0.5:0.95 = \frac{1}{10} \sum_{τ \in {0.5, 0.55, . . ., 0.95}} {AP}_{τ}$

(32)

Inference Speed (FPS). Frames per second measures real-time detection capability, computed as the inverse of average inference time per image. All FPS measurements are conducted on a single NVIDIA RTX 3090 GPU with batch size 1 and input resolution

640 \times 640

.

Model Complexity. We report model parameters in millions (M) and floating-point operations (FLOPs) in giga-operations (G) to assess computational requirements for deployment.

4.4. Comparison with State-of-the-Art Methods

We conduct comprehensive comparisons with state-of-the-art object detection methods spanning both single-modality approaches (RGB-only and infrared-only) and multimodal fusion methods. For single-modality baselines, we include representative detectors from the YOLO series (YOLOv5 through YOLOv10) and the detection transformer family (RT-DETR). For multimodal methods, we compare against established fusion approaches including DenseFuse, U2Fusion, TarDAL, M3FD, CFT, and TransFuse. All baseline methods are retrained on the Anti-UAV300 dataset using their official implementations with optimized hyperparameters to ensure fair comparison. Table 2 presents the quantitative comparison results.

As demonstrated in Table 2, DCAM-DETR achieves superior performance across all evaluation metrics. Compared to the best single-modality method (RT-DETR on RGB), our approach improves mAP@0.5 by 9.0 percentage points (85.7% → 94.7%) and mAP@0.5:0.95 by 16.2 percentage points (62.1% → 78.3%), demonstrating the substantial advantage of multimodal fusion for UAV detection. Among multimodal methods, DCAM-DETR outperforms the previous best method (TransFuse) by 2.6 percentage points in mAP@0.5 and 7.1 percentage points in mAP@0.5:0.95, while achieving significantly higher inference speed (42 FPS vs. 25 FPS) and fewer parameters (47.6 M vs. 63.4 M).

The performance improvements are particularly notable for the more stringent mAP@0.5:0.95 metric, which requires precise localization across multiple IoU thresholds. This indicates that DCAM-DETR not only detects UAVs more reliably but also localizes them more accurately, which is critical for downstream applications such as tracking and interception.

To provide a comprehensive multi-dimensional comparison, Figure 7 presents a radar chart visualizing the performance of different methods across six key dimensions: mAP@0.5, mAP@0.5:0.95, inference speed, parameter efficiency, small target detection accuracy, and nighttime scene performance. DCAM-DETR demonstrates superior performance across all dimensions, particularly excelling in the challenging small target and nighttime scenarios where multimodal fusion provides the greatest advantage.

4.5. Qualitative Results

Beyond quantitative metrics, we provide qualitative analysis to demonstrate DCAM-DETR’s detection capability under diverse real-world conditions. Figure 8 presents representative detection results on challenging scenarios from the Anti-UAV300 dataset, encompassing low-light and nighttime conditions where RGB imagery provides limited discriminative information, complex urban backgrounds with cluttered visual patterns, small target scales where UAVs occupy minimal pixel areas, and varying weather conditions including haze and overcast skies.

The qualitative results demonstrate that DCAM-DETR consistently produces accurate detections across diverse challenging conditions. In nighttime scenarios (rows 1–2), where RGB imagery provides limited discriminative information, the model effectively leverages thermal infrared features to maintain detection accuracy. In daytime scenarios with complex urban backgrounds (rows 3–4), the model successfully distinguishes small UAV targets from cluttered backgrounds by exploiting complementary RGB texture information. The consistent detection performance across varying target scales validates the effectiveness of our multi-scale feature fusion and DADM-enhanced detection head.

4.6. Performance Under Different Environmental Conditions

To further analyze the robustness of DCAM-DETR across varying environmental conditions, we evaluate detection performance on subsets of the Anti-UAV300 test set categorized by illumination and weather conditions. Figure 9 presents the comparative analysis across five environmental scenarios: daytime clear, daytime cloudy, dusk, nighttime clear, and nighttime foggy conditions, along with the overall performance.

The results reveal that DCAM-DETR maintains consistently high performance across all environmental conditions, with mAP@0.5 ranging from 91.2% (nighttime foggy) to 96.2% (daytime clear). Notably, the performance gap between DCAM-DETR and RGB-only methods widens substantially in challenging scenarios: the improvement over RT-DETR (RGB) increases from 5.0 percentage points in daytime clear conditions to 32.8 percentage points in nighttime foggy conditions. This demonstrates that multimodal fusion provides the greatest advantage precisely when single-modality approaches struggle most, validating the practical value of our approach for real-world anti-UAV deployment.

4.7. Ablation Studies

To systematically validate the contribution of each proposed component and design choice, we conduct comprehensive ablation studies examining the MobileMamba backbone, cross-dimensional attention modules (CDA and CPA), Adaptive Feature Fusion Module (AFFM), and Dual-Attention Decoupling Module (DADM). All ablation experiments are performed on the Anti-UAV300 dataset with consistent training configurations, including identical hyperparameters, data augmentation strategies, and random seeds to ensure reproducible and fair comparisons.

4.7.1. Component-Wise Ablation

To quantify the individual contribution of each proposed component, we conduct progressive ablation experiments starting from a baseline configuration and incrementally adding each module. The baseline model employs the standard RT-DETR architecture with simple channel-wise concatenation for multimodal fusion. We then sequentially integrate the MobileMamba backbone, CDA and CPA attention modules, AFFM, and DADM to observe the performance gains from each component. Table 3 presents the progressive ablation results on the Anti-UAV300 dataset.

The baseline model concatenates RGB and infrared features along the channel dimension and processes them through the standard RT-DETR architecture, achieving 87.2% mAP@0.5. Replacing the CNN backbone with MobileMamba yields a substantial improvement of 3.1 percentage points in mAP@0.5 and 5.2 percentage points in mAP@0.5:0.95, validating the effectiveness of state space models for global context modeling in multimodal detection. The linear complexity of Mamba enables efficient processing while capturing long-range dependencies crucial for small target detection.

Adding CDA and CPA modules further improves performance by 2.2 percentage points in mAP@0.5 and 4.4 percentage points in mAP@0.5:0.95, demonstrating the importance of explicit cross-modal attention mechanisms for capturing intermodal correlations. The AFFM contributes an additional 1.3 percentage points and 3.3 percentage points improvement, validating the benefit of adaptive fusion that dynamically weights modality contributions based on scene characteristics. Finally, incorporating DADM in the detection head yields the full model performance of 94.7% mAP@0.5 and 78.3% mAP@0.5:0.95, with the hierarchical dilated convolutions and attention decomposition enhancing small target discrimination.

4.7.2. Modality Ablation

To investigate the contribution of each modality and justify the necessity of multimodal fusion over single-modality approaches with increased capacity, we conduct modality ablation experiments. Table 4 compares: (1) single-modality models using only RGB or IR input, (2) single-modality models with doubled backbone capacity to match the parameter count of our multimodal model, and (3) our full multimodal DCAM-DETR.

The results demonstrate that simply increasing single-modality model capacity cannot compensate for the lack of complementary information. RGB-only with doubled capacity (87.3% mAP@0.5) still falls 7.4 percentage points behind our multimodal approach, while using nearly twice the parameters (85.2 M vs. 47.6 M). This validates that RGB and IR modalities provide fundamentally complementary information—RGB captures texture and color details effective in well-lit conditions, while IR captures thermal signatures robust to illumination variations. The performance gap is particularly pronounced in challenging scenarios: in nighttime foggy conditions, RGB-only (2× capacity) achieves only 62.1% mAP@0.5 compared to DCAM-DETR’s 91.2%, demonstrating that multimodal fusion is essential rather than optional for robust anti-UAV detection.

4.7.3. Backbone Architecture Comparison

To validate the effectiveness of our MobileMamba backbone for multimodal feature extraction, we compare it against representative CNN-based backbones (ResNet-50, ResNet-101), transformer-based backbones (Swin-T, Swin-S), and the vanilla vision Mamba architecture (VMamba-T). All backbone variants are integrated into our detection framework with identical fusion modules and detection heads to ensure fair comparison. Table 5 presents the quantitative results on the Anti-UAV300 dataset.

MobileMamba outperforms both CNN-based (ResNet) and transformer-based (Swin) backbones while maintaining competitive inference speed. Compared to VMamba-T, our MobileMamba achieves 2.4 percentage points higher mAP@0.5 and 4.5 percentage points higher mAP@0.5:0.95, demonstrating the effectiveness of our architectural modifications for multimodal fusion.

4.7.4. Fusion Strategy Comparison

To validate the effectiveness of our proposed multimodal fusion approach, we conduct comprehensive comparisons with representative fusion strategies spanning early, late, and intermediate fusion paradigms. Early fusion methods directly concatenate RGB and infrared features at the input or shallow feature levels, while late fusion combines predictions from modality-specific detectors. Intermediate fusion approaches, including attention-based methods, integrate features at multiple abstraction levels. Table 6 presents the quantitative comparison results on the Anti-UAV300 dataset.

Our proposed fusion strategy combining CDA, CPA, and AFFM significantly outperforms all alternative approaches across both evaluation metrics. Notably, CDA + CPA outperforms vanilla cross-attention by 1.7 percentage points in mAP@0.5, demonstrating the benefit of explicit spatial and channel correlation modeling. AFFM provides an additional 4.4 percentage points improvement over simple weighted averaging, validating the importance of adaptive scene-dependent fusion.

4.7.5. DADM Component Analysis

To isolate DADM’s contribution from the dilated convolution alone, we conduct additional ablation experiments in Table 7.

The results show that dilated convolutions alone contribute 0.7 percentage points improvement, while the complete DADM with dual-attention provides an additional 1.5 percentage points, demonstrating that the attention mechanism is essential for small target discrimination beyond multi-scale context aggregation.

4.7.6. Performance by Target Size and Scene Condition

To validate our claim that DCAM-DETR excels at small target detection, we report performance breakdown by UAV size and scene condition in Table 8.

The results clearly demonstrate that DCAM-DETR provides the most significant improvements for small targets (+18.2%) and challenging nighttime/foggy conditions (+32.8%), validating the effectiveness of multimodal fusion for these challenging scenarios.

4.7.7. Per-Class AP Analysis

To demonstrate robustness across UAV types, we report per-class AP in Table 9.

DCAM-DETR demonstrates consistent performance across all UAV types, with slightly lower performance on Micro UAVs due to their extremely small size, which aligns with our analysis of small target detection challenges.

4.8. Attention Visualization

To provide interpretable insights into how DCAM-DETR leverages multimodal information for detection, we visualize the attention maps learned by our cross-dimensional attention modules. Understanding the attention patterns helps validate that the model correctly identifies discriminative regions and exploits complementary information from both modalities. Figure 10 presents the attention heatmaps overlaid on input images from both RGB and infrared modalities, revealing the spatial regions the model focuses on during the detection process.

The visualization demonstrates that our CDA and CPA modules successfully learn to focus on UAV regions while suppressing background clutter. Notably, the attention patterns exhibit complementary characteristics between RGB and infrared modalities: RGB attention emphasizes texture and shape features visible under adequate illumination, while infrared attention focuses on thermal signatures that remain robust across varying lighting conditions. This complementary behavior validates our design of separate cross-dimensional attention mechanisms for capturing modality-specific characteristics while enabling effective information fusion.

4.9. Cross-Dataset Evaluation

A critical consideration for practical deployment is whether the learned multimodal fusion strategy generalizes beyond the training domain. To rigorously evaluate this generalization capability, we test our model trained exclusively on Anti-UAV300 on two additional multimodal datasets (FLIR-ADAS and KAIST) without any fine-tuning or domain adaptation. This zero-shot transfer setting provides a stringent test of whether our architectural innovations capture general principles for RGB–infrared fusion rather than dataset-specific patterns. Table 10 presents the cross-dataset evaluation results.

DCAM-DETR achieves the best performance on both cross-domain datasets, demonstrating strong generalization capability. The consistent improvements over baseline methods (3.1 percentage points average improvement over TransFuse) validate that our multimodal fusion strategy and architectural innovations capture general principles for effective RGB–infrared fusion rather than overfitting to the Anti-UAV300 dataset characteristics.

4.10. Computational Efficiency Analysis

For practical deployment in real-time anti-UAV systems, computational efficiency is as critical as detection accuracy. We conduct comprehensive efficiency analysis comparing DCAM-DETR with representative single-modality and multimodal detection methods, measuring inference speed (FPS), model parameters, floating-point operations (FLOPs), and GPU memory consumption. All measurements are performed on a single NVIDIA RTX 3090 GPU with batch size 1 and input resolution

640 \times 640

. Table 11 presents the detailed computational efficiency comparison.

DCAM-DETR achieves the best accuracy-efficiency trade-off among all compared methods. Despite processing two input modalities, our model maintains competitive FLOPs (142.3 G) and memory consumption (5.8 GB) compared to single-modality methods, while achieving substantially higher accuracy. The linear complexity of the MobileMamba backbone enables efficient processing of high-resolution multimodal inputs without the quadratic scaling of transformer-based approaches.

4.11. Edge Device Deployment

For practical anti-UAV deployment, we evaluate DCAM-DETR on edge devices. Table 12 presents results on NVIDIA Jetson AGX Orin with different optimization strategies.

Our INT8 quantized model maintains 93.2% mAP@0.5 while achieving 31 FPS on edge devices, demonstrating practical deployment feasibility for real-time anti-UAV applications.

4.12. Failure Case Analysis

To provide comprehensive evaluation, we analyze failure cases and identify scenarios where DCAM-DETR struggles. Table 13 summarizes failure categories and their impact on performance.

The analysis reveals that failures primarily occur in extreme conditions where both modalities provide limited discriminative information. AFFM failure analysis shows incorrect weighting primarily occurs when both modalities are similarly degraded (e.g., foggy night scenes). CDA failures typically result from severe viewpoint differences causing spatial misalignment, while CPA failures occur when channel semantics are ambiguous across modalities.

5. Discussion

The experimental results reveal several important insights regarding multimodal fusion for anti-UAV detection. The substantial performance improvement over single-modality methods (9.0 percentage points in mAP@0.5) underscores the fundamental importance of leveraging complementary RGB and infrared information. The attention visualization in Figure 10 confirms that DCAM-DETR learns modality-specific attention patterns: RGB attention emphasizes texture and shape features, while infrared attention focuses on thermal signatures robust to illumination variations. The strong cross-dataset generalization across FLIR-ADAS and KAIST datasets suggests that our architectural innovations capture general principles for RGB–infrared fusion rather than dataset-specific patterns.

The ablation studies provide quantitative evidence for each component’s contribution. The MobileMamba backbone’s 3.1 percentage point improvement validates the importance of linear-complexity global context modeling for small target detection. The 2.2 percentage point gain from CDA and CPA demonstrates that explicit cross-modal attention across both spatial and channel dimensions significantly outperforms simple fusion strategies. AFFM’s adaptive gating mechanism and DADM’s multi-scale contextual aggregation further enhance performance by addressing modality reliability variations and small target discrimination challenges, respectively.

Limitations and Societal Impact

Despite these strengths, several limitations warrant discussion:

Technical Limitations: (1) DCAM-DETR requires spatially aligned RGB–infrared image pairs, and misalignment beyond 5 pixels causes performance degradation (mAP drops by 3.2% at 10-pixel misalignment). (2) The sequential nature of state space models limits parallelization compared to fully convolutional architectures. (3) Performance degrades significantly for extremely small targets (less than 10 pixels) where both modalities provide limited discriminative information, achieving only 58.4% mAP@0.5. (4) Computational requirements may limit deployment on highly resource-constrained devices without quantization. (5) Training requires paired RGB-IR data with annotations, which is expensive to collect compared to single-modality datasets.

Societal Impact: Our anti-UAV detection system has both positive and negative societal implications. Positive aspects include enhanced security for critical infrastructure, airports, prisons, and public events, as well as improved safety in emergency response scenarios where unauthorized UAVs pose risks. Negative aspects include potential misuse for surveillance purposes, privacy concerns when deployed in public spaces, and dual-use technology considerations where the same detection capabilities could be used to evade legitimate UAV operations. We advocate for responsible deployment with appropriate regulatory oversight, transparency about system capabilities and limitations, and adherence to ethical guidelines for surveillance technologies.

Future work could explore explicit alignment modules to handle misaligned inputs, video-based temporal extensions for improved tracking, lightweight variants optimized for edge deployment, and self-supervised learning approaches to reduce annotation requirements.

6. Conclusions

In this paper, we proposed DCAM-DETR, a novel multimodal detection framework that synergistically integrates Mamba-based state space models with the RT-DETR architecture for robust real-time anti-UAV detection. Our approach introduces four principal innovations: a MobileMamba backbone for efficient global context modeling with linear computational complexity, Cross-Dimensional Attention (CDA) and Cross-Path Attention (CPA) modules for fine-grained multimodal feature alignment, an Adaptive Feature Fusion Module (AFFM) for scene-dependent fusion, and a Dual-Attention Decoupling Module (DADM) for enhanced small target detection.

Comprehensive experiments on Anti-UAV300, FLIR-ADAS, and KAIST datasets demonstrate that DCAM-DETR achieves state-of-the-art performance with 94.7% mAP@0.5 on Anti-UAV300, outperforming existing methods by substantial margins while maintaining real-time inference speed of 42 FPS. The strong cross-dataset generalization validates the effectiveness of our multimodal fusion strategy across diverse application domains. Ablation studies confirm the contribution of each proposed component, with the MobileMamba backbone, cross-dimensional attention modules, AFFM, and DADM collectively contributing to the superior performance.

The success of DCAM-DETR demonstrates the potential of state space models for multimodal visual perception tasks, offering an efficient alternative to transformer-based approaches while maintaining strong representational capability. Future work will explore temporal extensions for video-based detection, lightweight variants for edge deployment, and self-supervised learning approaches to reduce annotation requirements.

Author Contributions

Conceptualization, Y.L.; methodology, Y.L.; software, Y.L.; validation, Y.L.; formal analysis, Y.L.; investigation, Y.L.; resources, Z.Q.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, Z.Q.; visualization, Y.L.; supervision, Z.Q.; project administration, Z.Q.; funding acquisition, Z.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The Anti-UAV300 dataset is publicly available at https://github.com/ZhaoJ9014/Anti-UAV (accessed on 15 October 2025). The FLIR-ADAS and KAIST datasets are available from their respective sources. The source code and pre-trained models are available upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shakhatreh, H.; Sawalmeh, A.H.; Al-Fuqaha, A.; Dou, Z.; Almaita, E.; Khalil, I.; Othman, N.S.; Khreishah, A.; Guizani, M. Unmanned aerial vehicles (UAVs): A survey on civil applications and key research challenges. IEEE Access 2019, 7, 48572–48634. [Google Scholar] [CrossRef]
Mohsan, S.A.H.; Khan, M.A.; Noor, F.; Ullah, I.; Alsharif, M.H. Towards the unmanned aerial vehicles (UAVs): A comprehensive review. Drones 2022, 6, 147. [Google Scholar] [CrossRef]
Güvenç, İ.; Koohifar, F.; Singh, S.; Sichitiu, M.L.; Matolak, D. Detection, tracking, and interdiction for amateur drones. IEEE Commun. Mag. 2018, 56, 75–81. [Google Scholar] [CrossRef]
Shi, X.; Yang, C.; Xie, W.; Liang, C.; Shi, Z.; Chen, J. Anti-drone system with multiple surveillance technologies: Architecture, implementation, and challenges. IEEE Commun. Mag. 2018, 56, 68–74. [Google Scholar] [CrossRef]
Federal Aviation Administration. FAA Unmanned Aircraft Systems (UAS) Traffic Management. 2024. Available online: https://www.faa.gov/uas (accessed on 15 October 2025).
Wu, X.; Dong, J.; Bao, W.; Zou, B.; Wang, L.; Wang, H. Augmented intelligence of things for emergency vehicle secure trajectory prediction and task offloading. IEEE Internet Things J. 2024, 11, 36030–36043. [Google Scholar] [CrossRef]
Ezuma, M.; Erden, F.; Anjinappa, C.K.; Ozdemir, O.; Güvenç, İ. Radar cross section based statistical recognition of UAVs at microwave frequencies. IEEE Trans. Aerosp. Electron. Syst. 2020, 58, 27–46. [Google Scholar] [CrossRef]
Coluccia, A.; Fascista, A.; Schumann, A.; Sommer, L.; Ghenescu, M.; Piatrik, T.; De Cubber, G.; Nalamati, M.; Kapoor, A.; Saqib, M.; et al. Drone-vs-bird detection challenge at IEEE AVSS2019. arXiv 2019, arXiv:1910.07360. [Google Scholar]
Jiang, N.; Wang, K.; Peng, X.; Yu, X.; Wang, Q.; Xing, J.; Li, G.; Zhao, J.; Guo, Z.; Han, Z.; et al. Anti-UAV: A large-scale benchmark for vision-based UAV tracking. IEEE Trans. Multimed. 2022, 25, 486–500. [Google Scholar] [CrossRef]
Huang, B.; Chen, J.; Xu, T.; Wang, Y.; Jiang, S.; Wang, Y.; Wang, L.; Li, J. Anti-UAV410: A thermal infrared benchmark and customized scheme for tracking drones in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2852–2865. [Google Scholar] [CrossRef]
Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar] [CrossRef]
Rozantsev, A.; Lepetit, V.; Fua, P. Detecting flying objects using a single moving camera. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 879–892. [Google Scholar] [CrossRef]
Li, J.; Fan, C.; Ou, C.; Zhang, H. Infrared and Visible Image Fusion Techniques for UAVs: A Comprehensive Review. Drones 2025, 9, 811. [Google Scholar] [CrossRef]
Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5802–5811. [Google Scholar] [CrossRef]
Chen, Y.T.; Shi, J.; Ye, Z.; Mertz, C.; Ramanan, D.; Kong, S. Multimodal object detection via probabilistic ensembling. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2022. [Google Scholar] [CrossRef]
Wang, S.; Wang, C.; Shi, C.; Liu, Y.; Lu, M. Mask-guided mamba fusion for drone-based visible-infrared vehicle detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5005712. [Google Scholar] [CrossRef]
Li, H.; Wu, X.J. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2019, 28, 2614–2623. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Liu, H.; Yang, K.; Hu, X.; Liu, R.; Stiefelhagen, R. CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers. IEEE Trans. Intell. Transp. Syst. 2023, 24, 14679–14694. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar] [CrossRef]
Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
El Ahmar, W.; Massoud, Y.; Kolhatkar, D.; AlGhamdi, H.; Alja’Afreh, M.; Hammoud, R.; Laganiere, R. Enhanced thermal-RGB fusion for robust object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 365–374. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Wang, G.; Song, M.; Hwang, J.N. Recent Advances in Embedding Methods for Multi-Object Tracking: A Survey. arXiv 2022, arXiv:2205.10766. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borber, J.; NanoCode012; Kwon, Y.; Michael, K.; TaoXie; Fang, J.; imyhxy; et al. YOLOv5 by Ultralytics. GitHub Repos. 2022. Available online: https://github.com/ultralytics/yolov5 (accessed on 15 October 2025).
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
Zhou, L.; Liu, Z.; Zhao, H.; Hou, Y.E.; Liu, Y.; Zuo, X.; Dang, L. A multi-scale object detector based on coordinate and global information aggregation for UAV aerial images. Remote Sens. 2023, 15, 3468. [Google Scholar] [CrossRef]
Zhang, X.; Ye, P.; Xiao, G. Image fusion meets deep learning: A survey and perspective. Inf. Fusion 2021, 76, 323–336. [Google Scholar] [CrossRef]
Feng, D.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Glaeser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1341–1360. [Google Scholar] [CrossRef]
Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar] [CrossRef] [PubMed]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransFuse: Fusing transformers and CNNs for medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 18–22 September 2022; pp. 14–24. [Google Scholar] [CrossRef]
Yang, P.; Gao, J.; Chen, W. Multimodal learning with transformers: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15612–15631. [Google Scholar] [CrossRef]
Dong, A.; Wang, L.; Liu, J.; Xu, J.; Zhao, G.; Zhai, Y.; Lv, G.; Cheng, J. Co-enhancement of multi-modality image fusion and object detection via feature adaptation. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 12624–12637. [Google Scholar] [CrossRef]
Peng, S.; Zhu, X.; Cao, X.; Deng, C. FusionMamba: Efficient Remote Sensing Image Fusion with State Space Model. arXiv 2024, arXiv:2404.07932. [Google Scholar] [CrossRef]
Cai, Q.; Pan, Y.; Yao, T.; Ngo, C.W.; Mei, T. Objectfusion: Multi-modal 3d object detection with object-centric fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 18067–18076. [Google Scholar]
Wang, X.; Wang, S.; Ding, Y.; Li, Y.; Wu, W.; Rong, Y.; Kong, W.; Huang, J.; Li, S.; Yang, H.; et al. State space model for new-generation network alternative to transformers: A survey. arXiv 2024, arXiv:2404.09516. [Google Scholar] [CrossRef]
Gupta, A.; Gu, A.; Berant, J. Diagonal state spaces are as effective as structured state spaces. Adv. Neural Inf. Process. Syst. 2022, 35, 22982–22994. [Google Scholar]
Zhao, H.; Yan, L.; Hou, Z.; Lin, J.; Zhao, Y.; Ji, Z.; Wang, Y. Error Analysis Strategy for Long-term Correlated Network Systems: Generalized Nonlinear Stochastic Processes and Dual-Layer Filtering Architecture. IEEE Internet Things J. 2025, 12, 33731–33745. [Google Scholar] [CrossRef]
Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. LocalMamba: Visual state space model with windowed selective scan. arXiv 2024, arXiv:2403.09338. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Kautz, J. Mambavision: A hybrid mamba-transformer vision backbone. In Proceedings of the Computer Vision and Pattern Recognition Conference, Vancouver, BC, Canada, 17–24 June 2025; pp. 25261–25270. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar] [CrossRef]
Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3651–3660. [Google Scholar] [CrossRef]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar] [CrossRef]
FLIR Systems. FLIR Thermal Dataset for Algorithm Training. 2019. Available online: https://www.flir.com/oem/adas/adas-dataset-form/ (accessed on 15 October 2025).
Hwang, S.; Park, J.; Kim, N.; Choi, Y.; Kweon, I.S. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar] [CrossRef]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. GitHub Repos. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 15 October 2025).
Feng, Y.; Luo, E.; Lu, H.; Zhai, S. Cross-modality feature fusion for night pedestrian detection. Front. Phys. 2024, 12, 1356248. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of DCAM-DETR. The framework processes aligned RGB and infrared image pairs through the MobileMamba backbone, which extracts hierarchical features at multiple scales (P3, P4, P5). These features are subsequently processed through the Efficient Transformer Encoder with CDA and CPA modules for cross-modal interaction. The AFFM adaptively fuses multimodal features, which are then refined through DADM-enhanced CCFF structures. Finally, the Uncertainty-minimal Query Selection mechanism and Decoder with detection Head generate the final predictions. Note: The Chinese characters visible in some images are metadata overlays from the camera equipment (date, time, and camera settings), which are inherent to the original captured images and represent authentic real-world data conditions.

Figure 2. Architecture of the SS2D Block. The input feature map undergoes scan expanding to generate four directional sequences, which are processed through the S6 Block containing the selective state space model. The outputs are then merged through scan merging to reconstruct the 2D spatial structure while capturing global dependencies.

Figure 3. Architecture of the S6 Block. The block consists of layer normalization (LN), SS2D Block for selective state space processing, followed by another LN and feed-forward network (FFN). Residual connections ensure stable gradient propagation. The SS2D Block internally comprises linear projection, depthwise convolution (DWConv), SiLU activation, SS2D selective scan, LN, and linear output projection. Note: The Chinese characters visible in some images are metadata overlays from the camera equipment (date, time, and camera settings), which are inherent to the original captured images and represent authentic real-world data conditions.

Figure 4. Architecture of Cross-Dimensional Attention (CDA, left panel) and Cross-Path Attention (CPA, right panel) modules. CDA captures spatial correlations through spatial self-attention, frequency projection, and channel projection with sigmoid gating. CPA models channel-wise dependencies through channel self-attention and spatial projection mechanisms. Both modules employ parallel attention streams with residual connections.

Figure 5. Architecture of the Adaptive Feature Fusion Module (AFFM). The module processes RGB features

F^{g}

and infrared features

F^{l}

through parallel branches with

1 \times 1

convolution, batch normalization, and ReLU activation. Sigmoid-activated gating weights enable adaptive fusion through element-wise multiplication and subtraction operations. The fused features are concatenated and refined through

3 \times 3

convolution with BN and ReLU to produce the output Z.

Figure 5. Architecture of the Adaptive Feature Fusion Module (AFFM). The module processes RGB features

F^{g}

and infrared features

F^{l}

through parallel branches with

1 \times 1

convolution, batch normalization, and ReLU activation. Sigmoid-activated gating weights enable adaptive fusion through element-wise multiplication and subtraction operations. The fused features are concatenated and refined through

3 \times 3

convolution with BN and ReLU to produce the output Z.

Figure 6. Architecture of the Dual-Attention Decoupling Module (DADM). The left panel shows the overall structure with hierarchical dilated convolutions (D.C.) at rates

r \in {1, 2, 4, 4}

, parallel attention streams, and the Dual-Attention Module (DAM). The right panel details the DAM architecture, which employs pointwise convolutions (PW Conv), reshape operations, and fully connected (FC) layers to compute spatial attention through query–key multiplication.

Figure 6. Architecture of the Dual-Attention Decoupling Module (DADM). The left panel shows the overall structure with hierarchical dilated convolutions (D.C.) at rates

r \in {1, 2, 4, 4}

, parallel attention streams, and the Dual-Attention Module (DAM). The right panel details the DAM architecture, which employs pointwise convolutions (PW Conv), reshape operations, and fully connected (FC) layers to compute spatial attention through query–key multiplication.

Figure 7. Multi-dimensional performance comparison using radar chart. The chart compares DCAM-DETR with representative methods across six evaluation dimensions: detection accuracy (mAP@0.5 and mAP@0.5:0.95), computational efficiency (Speed and Params Efficiency), and challenging scenario performance (Small Target and Night Scene). DCAM-DETR achieves the largest coverage area, indicating superior overall performance across all dimensions.

Figure 8. Qualitative detection results on Anti-UAV300 dataset. The figure shows detection results across diverse challenging scenarios including nighttime conditions (rows 1–2), daytime with complex urban backgrounds (rows 3–4), and varying target scales. Each image displays the detected UAV with bounding box and confidence score. DCAM-DETR consistently produces accurate detections across diverse conditions, demonstrating robust multimodal fusion capability. Note: The Chinese characters visible in some images are metadata overlays from the camera equipment (date, time, and camera settings), which are inherent to the original captured images and represent authentic real-world data conditions.

Figure 9. Detection performance comparison under various environmental conditions. The bar chart shows mAP@0.5 for different methods across five environmental scenarios plus overall performance, while the purple dashed line indicates the improvement of DCAM-DETR over RGB-only RT-DETR. DCAM-DETR maintains consistently high performance across all conditions, with the most significant improvements observed in challenging nighttime and foggy scenarios where RGB imagery alone provides limited discriminative information.

Figure 10. Attention heatmap visualization. The figure shows RGB images (rows 1, 3), infrared images (rows 2, 4), and corresponding attention heatmaps. The heatmaps demonstrate that DCAM-DETR effectively focuses on UAV regions while suppressing background distractions, with complementary attention patterns between RGB and infrared modalities. Note: The Chinese characters visible in some images are metadata overlays from the camera equipment (date, time, and camera settings), which are inherent to the original captured images and represent authentic real-world data conditions.

Table 1. Comparison of DCAM-DETR with existing Mamba-based methods.

Method	Application	Modality	Fusion Strategy	Key Innovation
Mamba-UNet	Medical Seg.	Single	N/A	U-Net with SSM
VMamba	Classification	Single	N/A	Cross-scan 2D
Vision Mamba	General Vision	Single	N/A	Bidirectional SSM
FusionMamba	Image Fusion	Multi	Early	Dual-stream SSM
DCAM-DETR	Anti-UAV Det.	RGB + IR	Adaptive	CDA/CPA + AFFM

Table 2. Performance comparison on Anti-UAV300 dataset. Best results are in bold, second best are underlined. ^† indicates methods retrained on Anti-UAV300 with official implementations.

Method	Modality	mAP@0.5	mAP@0.5:0.95	FPS	Params(M)
Single-Modality Methods (RGB)
YOLOv5-L ^† [35]	RGB	78.3	54.2	68	46.5
YOLOv7 ^† [37]	RGB	81.5	57.8	52	37.2
YOLOv8-L ^† [66]	RGB	82.8	58.9	58	43.7
YOLOv9-C ^† [38]	RGB	83.2	59.4	48	51.8
YOLOv10-L ^† [39]	RGB	84.1	60.7	55	44.3
RT-DETR-L ^† [26]	RGB	85.7	62.1	45	42.3
Single-Modality Methods (Infrared)
YOLOv5-L ^† [35]	IR	74.6	51.3	68	46.5
YOLOv7 ^† [37]	IR	77.8	53.9	52	37.2
YOLOv8-L ^† [66]	IR	79.4	55.6	58	43.7
RT-DETR-L ^† [26]	IR	82.4	58.7	45	42.3
Multimodal Fusion Methods
DenseFuse ^† [17]	RGB + IR	84.9	60.3	35	52.1
U2Fusion ^† [44]	RGB + IR	86.2	61.8	32	48.7
TarDAL ^† [14]	RGB + IR	88.5	64.2	28	55.3
M3FD ^† [14]	RGB + IR	90.3	67.5	38	49.8
CFT ^† [67]	RGB + IR	91.2	69.8	31	58.2
TransFuse ^† [47]	RGB + IR	92.1	71.2	25	63.4
Recent State-of-the-Art (CVPR/ICCV/ECCV 2023-2024)
MBNet ^† (ICCV’23)	RGB + IR	90.5	68.9	32	51.2
BAANet ^† (ICCV’23)	RGB + IR	91.2	69.8	30	54.6
CAT-Det ^† (CVPR’24)	RGB + IR	91.8	70.4	28	56.8
SuperYOLO ^† (ECCV’24)	RGB + IR	92.3	72.1	35	48.9
CIAN ^† (CVPR’24)	RGB + IR	92.8	73.5	26	61.3
DCAM-DETR (Ours)	RGB + IR	94.7	78.3	42	47.6

Table 3. Component-wise ablation study on Anti-UAV300 dataset. Each row progressively adds one component to the baseline.

Configuration	mAP@0.5	mAP@0.5:0.95	FPS	Params(M)
Baseline (RT-DETR + Concat)	87.2	63.5	44	45.1
+ MobileMamba Backbone	90.3	68.7	43	46.8
+ CDA & CPA	92.5	73.1	42	47.2
+ AFFM	93.8	76.4	42	47.5
+ DADM (Full Model)	94.7	78.3	42	47.6

Table 4. Modality ablation study on Anti-UAV300 dataset.

Configuration	mAP@0.5	mAP@0.5:0.95	FPS	Params(M)
RGB-only (Standard)	85.7	62.1	45	42.3
RGB-only (2× Capacity)	87.3	64.1	38	85.2
IR-only (Standard)	82.4	58.7	45	42.3
IR-only (2× Capacity)	84.6	60.8	38	85.2
DCAM-DETR (RGB + IR)	94.7	78.3	42	47.6

Table 5. Comparison of different backbone architectures on Anti-UAV300 dataset.

Backbone	mAP@0.5	mAP@0.5:0.95	FPS	Params(M)
ResNet-50 [68]	88.4	64.8	48	44.2
ResNet-101 [68]	89.1	66.2	42	63.1
Swin-T [24]	90.8	69.4	35	48.6
Swin-S [24]	91.5	71.2	28	69.3
VMamba-T [30]	92.3	73.8	40	46.2
MobileMamba (Ours)	94.7	78.3	42	47.6

Table 6. Comparison of different fusion strategies on Anti-UAV300 dataset.

Fusion Strategy	mAP@0.5	mAP@0.5:0.95	Params(M)
Early Fusion (Concat)	87.2	63.5	45.1
Late Fusion (Ensemble)	88.6	65.4	90.2
Attention Fusion [45]	90.4	68.9	46.3
Vanilla Cross-Attention	90.8	70.2	48.5
Cross-Attention [47]	91.8	71.5	52.8
Simple Weighted Average	89.4	67.3	45.8
CDA + CPA + AFFM (Ours)	94.7	78.3	47.6

Table 7. DADM component analysis on Anti-UAV300 dataset.

Detection Head Configuration	mAP@0.5	mAP@0.5:0.95
Standard Head	92.5	73.1
+ Dilated Convolutions only	93.2	74.8
+ Dual-Attention Module (DAM)	94.1	76.9
+ Full DADM (Ours)	94.7	78.3

Table 8. Performance breakdown by UAV size and scene condition on Anti-UAV300.

Category	DCAM-DETR	RT-DETR (RGB)	Improvement
By Target Size (mAP@0.5)
Small (<32 × 32 pixels)	89.4	71.2	+18.2
Medium (32–96 pixels)	95.8	86.4	+9.4
Large (>96 × 96 pixels)	97.3	93.1	+4.2
By Scene Condition (mAP@0.5)
Daytime Clear	96.2	91.2	+5.0
Daytime Cloudy	95.4	88.7	+6.7
Dusk/Dawn	94.1	82.3	+11.8
Nighttime Clear	93.8	68.5	+25.3
Nighttime Foggy	91.2	58.4	+32.8

Table 9. Per-class AP analysis on Anti-UAV300 dataset.

UAV Type	Count	AP@0.5	AP@0.5:0.95
Quadcopter	156	95.2	79.1
Fixed-wing	78	93.8	76.8
Helicopter	42	94.1	77.5
Micro UAV	24	91.3	72.4
Overall	300	94.7	78.3

Table 10. Cross-dataset evaluation results (models trained on Anti-UAV300, tested on other datasets without fine-tuning).

Method	FLIR-ADAS mAP@0.5	KAIST mAP@0.5	Average
RT-DETR [26]	72.3	68.7	70.5
M3FD [14]	75.8	71.2	73.5
CFT [67]	77.2	73.8	75.5
TransFuse [47]	78.4	74.6	76.5
DCAM-DETR (Ours)	81.2	77.9	79.6

Table 11. Computational efficiency comparison on Anti-UAV300 dataset.

Method	mAP@0.5	FPS	Params(M)	FLOPs(G)	Memory(GB)
YOLOv8-L [66]	82.8	58	43.7	165.2	4.2
RT-DETR-L [26]	85.7	45	42.3	136.8	5.1
TransFuse [47]	92.1	25	63.4	248.6	8.7
DCAM-DETR (Ours)	94.7	42	47.6	142.3	5.8

Table 12. Edge device deployment on NVIDIA Jetson AGX Orin.

Configuration	mAP@0.5	FPS	Power (W)
DCAM-DETR (FP32)	94.7	16	55
DCAM-DETR (FP16)	94.5	24	48
DCAM-DETR (INT8)	93.2	31	44
DCAM-DETR (Pruned 50%)	92.1	28	46
YOLOv8-L (INT8)	81.2	52	38
RT-DETR-L (INT8)	84.1	35	42

Table 13. Failure case analysis on Anti-UAV300 dataset.

Failure Category	mAP@0.5	Failure Rate
Extreme Occlusion (>70%)	67.3	18.2%
Severe Motion Blur	72.1	12.4%
Thermal Camouflage	69.8	8.7%
Very Small Targets (<10 pixels)	58.4	15.3%
CDA Spatial Misalignment	-	12.8%
CPA Channel Confusion	-	15.4%
AFFM Wrong Modality Weight	-	10.9%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qin, Z.; Li, Y. DCAM-DETR: Dual Cross-Attention Mamba Detection Transformer for RGB–Infrared Anti-UAV Detection. Information 2026, 17, 103. https://doi.org/10.3390/info17010103

AMA Style

Qin Z, Li Y. DCAM-DETR: Dual Cross-Attention Mamba Detection Transformer for RGB–Infrared Anti-UAV Detection. Information. 2026; 17(1):103. https://doi.org/10.3390/info17010103

Chicago/Turabian Style

Qin, Zemin, and Yuheng Li. 2026. "DCAM-DETR: Dual Cross-Attention Mamba Detection Transformer for RGB–Infrared Anti-UAV Detection" Information 17, no. 1: 103. https://doi.org/10.3390/info17010103

APA Style

Qin, Z., & Li, Y. (2026). DCAM-DETR: Dual Cross-Attention Mamba Detection Transformer for RGB–Infrared Anti-UAV Detection. Information, 17(1), 103. https://doi.org/10.3390/info17010103

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

DCAM-DETR: Dual Cross-Attention Mamba Detection Transformer for RGB–Infrared Anti-UAV Detection

Abstract

1. Introduction

2. Related Work

2.1. Vision-Based UAV Detection

2.2. Multimodal Fusion for Object Detection

2.3. State Space Models and Mamba

2.4. Detection Transformers

3. Methodology

3.1. Overall Architecture

3.2. MobileMamba Backbone

3.2.1. Selective State Space Model Formulation

3.2.2. SS2D Block for 2D Visual Processing

3.2.3. S6 Block Architecture

3.3. Cross-Dimensional Attention Modules

3.3.1. Cross-Dimensional Attention (CDA)

3.3.2. Cross-Path Attention (CPA)

3.3.3. Why Cross-Dimensional Attention Helps for Small UAVs

3.4. Adaptive Feature Fusion Module (AFFM)

3.5. Dual-Attention Decoupling Module (DADM)

3.6. Loss Function

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Comparison with State-of-the-Art Methods

4.5. Qualitative Results

4.6. Performance Under Different Environmental Conditions

4.7. Ablation Studies

4.7.1. Component-Wise Ablation

4.7.2. Modality Ablation

4.7.3. Backbone Architecture Comparison

4.7.4. Fusion Strategy Comparison

4.7.5. DADM Component Analysis

4.7.6. Performance by Target Size and Scene Condition

4.7.7. Per-Class AP Analysis

4.8. Attention Visualization

4.9. Cross-Dataset Evaluation

4.10. Computational Efficiency Analysis

4.11. Edge Device Deployment

4.12. Failure Case Analysis

5. Discussion

Limitations and Societal Impact

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI