Fine-Grained Multispectral Fusion for Oriented Object Detection in Remote Sensing

Lan, Xin; Zhang, Shaolin; Bai, Yuhao; Qin, Xiaolin

doi:10.3390/rs17223769

Open AccessArticle

Fine-Grained Multispectral Fusion for Oriented Object Detection in Remote Sensing

¹

Chengdu Institute of Computer Applications, Chinese Academy of Sciences, Chengdu 610213, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(22), 3769; https://doi.org/10.3390/rs17223769

Submission received: 26 September 2025 / Revised: 27 October 2025 / Accepted: 7 November 2025 / Published: 20 November 2025

(This article belongs to the Special Issue Image Fusion and Object Detection Using Multi-Modal Remote Sensing Data)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel visible-infrared fusion framework for oriented object detection, FGMF, is proposed to, achieving mAP₅₀ of 80.2% and 66.3% on the DroneVehicle and VEDAI datasets, respectively, with only 87.2M parameters.
A dual-enhancement and fusion module (DEFM) is proposed for fine-grained multispectral feature calibration and fusion and an orientation aggregation module (OAM) is designed to capture directional context.

What are the implications of the main findings?

The study provides an effective solution to the critical challenges of modality misalignment and limited orientation sensitivity, significantly advancing the robustness of object detection in complex scenarios, such as low illumination, arbitrary orientations and dense arrangement.
The DEFM and OAM modules represent significant advancements in multispectral fusion and orientation modeling, offering transferable architectural designs that can benefit numerous vision tasks beyond remote sensing applications.

Abstract

Infrared–visible-oriented object detection aims to combine the strengths of both infrared and visible images, overcoming the limitations of a single imaging modality to achieve more robust detection with oriented bounding boxes under diverse environmental conditions. However, current methods often suffer from two issues: (1) modality misalignment caused by hardware and annotation errors, leading to inaccurate feature fusion that degrades downstream task performance; and (2) insufficient directional priors in square convolutional kernels, impeding robust object detection with diverse directions, especially in densely packed scenes. To tackle these challenges, in this paper, we propose a novel method, Fine-Grained Multispectral Fusion (FGMF), for oriented object detection in the paired aerial images. Specifically, we design a dual-enhancement and fusion module (DEFM) to obtain the calibrated and complementary features through weighted addition and subtraction-based attention mechanisms. Furthermore, we propose an orientation aggregation module (OAM) that employs large rotated strip convolutions to capture directional context and long-range dependencies. Extensive experiments on the DroneVehicle and VEDAI datasets demonstrate the effectiveness of our proposed method, yielding impressive results with accuracies of 80.2% and 66.3%, respectively. These results highlight the effectiveness of FGMF in oriented object detection within complex remote sensing scenarios.

Keywords:

oriented object detection; multispectral object detection; infrared and visible image fusion

1. Introduction

Object detection plays a crucial role in the field of remote sensing with extensive applications, such as disaster monitoring and urban planning. Due to the bird’s-eye perspective, objects in aerial images usually appear in various scales and orientations. Consequently, numerous oriented object detectors [1,2,3,4] have been proposed and achieved promising performance on challenging benchmark datasets [5,6]. However, these detectors are designed for visible images, which typically provide detailed object information, such as texture and color. These images often suffer from unfavorable conditions, such as low illumination, fog, and cloud cover, which can obscure objects and significantly degrade the detection performance. In contrast to visible imaging, infrared cameras can capture objects under adverse conditions, because of their ability to detect emitted thermal radiation. The complementary characteristics of visible (RGB) and infrared (IR) images enhance their utility, leading to their widespread application in tasks such as pedestrian detection [7,8,9] and crowd counting [10,11,12].

Although the practical significance of Infrared–Visible Oriented Object Detection (IVOOD) task is evident, one important challenge is the modality misalignment problem. Existing methods [13,14,15,16,17,18,19] generally assume perfect geometric alignment between visible–infrared image pairs and proceed with direct multimodal fusion techniques. Actually, as shown in Figure 1, the paired images are just weakly aligned due to hardware limitations such as radiation distortion and clock skew, as well as inevitable annotation inaccuracies [20]. These issues result not only in positional shifts but also in rotational discrepancies between modalities. Moreover, in extreme cases, objects may even disappear completely in one modality. These misalignments—both geometric and semantic—lead to inaccurate feature fusion, thereby degrading detection performance and highlighting the need for more robust cross-modal alignment techniques.

Another challenge in oriented object detection is limited orientation sensitivity. Objects in remote sensing images often exhibit diverse and arbitrary orientation, necessitating models that can accurately perceive and adapt to such variations. Previous approaches [1,3,4,21] often employ deformable convolutions or Rotated RoI mechanism, but often entail high computational complexity and memory consumption to capture oriented semantic information. More recently, methods [22,23,24] have utilized large square convolution kernels to obtain long-range contextual information with improved efficiency, as illustrated in Figure 2. However, these strategies fail to effectively incorporate directional priors into feature representation. Insufficient directional information impairs the modeling of target orientation and spatial relations, ultimately leading to suboptimal recognition of oriented targets.

Recently, several approaches have been introduced to tackle these challenges in RGB-IR oriented object detection tasks. For instance, TSFADet [20] predicted the deviation in position, size, and angle in the RoI head to mitigate the misalignment issue. C²Former [25] designed an inter-modality cross-attention module based on the Transformer architecture for modality calibration, but it requires substantial computational overhead. More recently, DMM [26] provided an efficient solution by employing a disparity-guided selective scan module for global context modeling without quadratic cost and a multi-scale target-aware attention module to capture features at different scales. Despite its promising performance, DMM lacks specialized components for capturing fine-grained directional information and a selective enhancement mechanism during the fusion process, limiting its overall robustness and precision.

In this paper, we propose a novel Fine-Grained Multispectral Fusion for oriented object detection method based on the one-stage detector S²ANet, called FGMF, which contains a Dual-Enhancement and Fusion Module (DEFM) and an orientation aggregation module (OAM) to cope with the above two issues. Specifically, to obtain a robust and highly discriminative fused feature representation, DEFM applies a weighted addition enhancement and a subtraction-based attention enhancement. These processes strengthen the discriminative power within each individual modality, after which the enhanced features are fed into the DMM [26] for final fusion. In addition, OAM employs several rotated strip convolutions to extract both orientation and long-range contextual information from RGB features, aiming to achieve a more robust and comprehensive understanding of the spatial and directional relationships in complex scenes. Our contributions can be summarized as follows:

We propose a novel one-stage-based infrared–visible-oriented object detection method, named FGMF, which dynamically emphasizes object orientations and adaptively fuses complementary and similar information.
To tackle the modality misalignment problem, we propose a Dual-Enhancement and Fusion Module with two single-modality enhancement processes to capture similar and distinct features, followed by a fusion step to achieve feature calibration.
To handle the lack of directional priors in large square convolutional kernels, we propose an orientation aggregation module that employs a series of rotated strip convolution to encode orientation-aware features.

The remaining parts of this paper are organized as follows. Section 2 introduces related work. Section 3 elaborates on the framework of the proposed method. Section 4 details the experiments and performance evaluations that were conducted on the proposed method. Finally, Section 5 summarizes the paper.

2. Related Work

2.1. Oriented Object Detection

Oriented object detection refers to recognizing and localizing objects of interest, such as ships and vehicles, using rotated bounding boxes, which is widely used in the field of remote sensing imagery and scene text detection. Unlike horizontal detection, rotated object detection requires not only accurate localization but also precise estimation of rotation angles, which places higher demands on feature representation and spatial modeling. To achieve better rotated feature representations, a variety of methods have been developed based on horizontal detectors [27,28]. Many two-stage approaches [1,3,29] employ the rotated region of interest (RoI) extraction mechanism to align features with oriented proposals, while some one-stage methods, such as R³Det [2] and S²A-Net [4], introduce feature alignment mechanisms between rotated anchors and horizontal receptive fields to mitigate misalignment. Recently, adaptive convolution-based methods have emerged to enhance orientation sensitivity. For example, ARC [30] and GRA [31] are designed to provide the network with greater flexibility to capture the orientation information of various oriented objects by adaptively rotating the convolution kernels. Alternatively, large kernel designs such as LSKNet [23] and PKINet [24] employ large convolutional kernels to capture broader contextual information, which aids in distinguishing closely packed and rotated objects. In addition to architectural innovations, several loss functions have been proposed to improve the regression accuracy of rotated bounding boxes. CSL [32] and DCL [33] regard the angle prediction as an angular classification task to avoid the boundary discontinuity issue, while GWD [34], KLD [35], and KFIoU [36] adopt distribution-based measures that convert rotated bounding boxes into Gaussian distribution to better match the positions and orientations, which enhances rotation sensitivity and stabilize training, especially for high-precision detection tasks.

Considering the varying orientations of targets in aerial images and the limited directional sensitivity of large square convolutions, we propose an orientation aggregation module, which encodes multi-directional objects using four oriented strip convolutions to capture fine-grained orientation representations.

2.2. Infrared–Visible Image Fusion

Infrared and visible image fusion has attracted considerable attention due to its applications in tasks such as object detection [37,38,39,40] and tracking [41,42]. Fusion methods that integrate thermal signatures from infrared images and texture details from visible images can improve the perceptual robustness under challenging conditions. Based on the stage at which fusion occurs, the fusion methods can be broadly categorized into three levels: pixel-level, decision-level, and feature-level [43]. Pixel-level fusion methods directly combine raw inputs from different modalities, enabling early information interaction before feature extraction. For example, SuperYOLO [44] fuses the modalities at the pixel level, leading to a compact input representation that has a substantial reduction in computational cost. The decision-level fusion techniques, in contrast, combine the detection results from separate modality-specific branches at the final stage. Despite its ability to leverage modality-specific features for the improved performance, this strategy is often computationally expensive due to the repeated calculations across different multispectral branches. The feature-level fusion, as a balanced compromise, is commonly adopted in the field of remote sensing. Methods [20,25,40] in this field typically use multiple branches to extract modality-specific features, which are then integrated through attention mechanisms or channel concatenation. This approach could preserve structural and semantic details from the modalities, thereby improving fusion quality and further enhancing the performance of oriented object detection under variable challenging conditions.

Following the feature-level image fusion strategies, we propose a dual-enhancement and fusion module to enhance modality-specific representations and achieve more informative feature integration. The module highlights salient and complementary information from each modality through feature enhancement through additive and subtractive operations, ultimately generating consolidated and discriminative fused features.

2.3. Long-Sequence Modeling

Transformer architecture is widely used in various long-sequence modeling tasks due to its excellent global modeling capabilities [45], though it often requires high computing resources. In recent years, the Mamba architecture has emerged as an efficient alternative for long-sequence modeling. With its linear complexity, efficient state update mechanism, and streaming processing capabilities, the Mamba-based model is particularly well-suited for processing large-scale and high-resolution data. This has led to their successful application across numerous visual tasks [46,47,48,49], establishing Mamba as a powerful tool for capturing long-range dependencies. With its theoretical framework and practical applications continuously evolving, Mamba has also attracted growing interest in the domain of remote sensing vision [50,51,52]. For example, Pan-Mamba [50] introduces a channel-exchange mechanism along with cross-modal Mamba modules to achieve efficient fusion and interaction of multimodal information. RSMamba [51] proposes a dynamic multi-path activation mechanism that enhances Mamba’s capacity to model non-causal data structures, thereby improving the understanding and representation of complex semantics in remote sensing imagery. Samba [52], on the other hand, builds upon the state space model (SSM) to develop a semantic segmentation framework tailored for high-resolution remote sensing imagery, enabling efficient extraction of multi-level semantic information. ChangeMamba [53] introduces three spatio-temporal relationship modeling mechanisms, incorporated into the Mamba framework to facilitate comprehensive interaction among multi-temporal features. These efforts demonstrate the capacity of Mamba-based models to effectively tackle the distinctive challenges of geospatial vision, such as high-resolution image processing, multimodal data fusion, and complex spatial arrangements.

Inspired by Disparity-guided Multispectral Mamba [26], we propose the DEFM module, which effectively fuses visible and infrared images by leveraging the complementary and discriminative information between the two modalities, thereby improving the performance in oriented object detection.

3. Methodology

3.1. Preliminaries

State Space Models. The classical state space model (SSM) is a linear time-invariant (LTI) system that describes a system’s internal state as a set of linear equations. Specifically, given an input

x (t) \in R

, it produces an output

y (t) \in R^{M}

through a hidden state

h (t) \in R^{N}

. This procedure can be expressed as linear ordinary differential equations (ODEs) as follows:

\begin{matrix} h (t) & = Ah (t) + B x (t), \end{matrix}

(1)

\begin{matrix} y (t) & = Ch (t) + D (t), \end{matrix}

(2)

where

A \in R^{N \times N}

represents the state transition matrix.

B \in R^{N \times M}

and

C \in R^{M \times N}

are the projection parameters.

D \in R^{1}

serves as a skip connection and is often omitted in the equation. To enhance global modeling capabilities, the input-dependent selective scanning mechanism is adopted that leverages nonlinearity and discretization by introducing a time scale parameter

Δ

and the zero-order hold (ZOH) [54], converting A and B into their discrete forms

\bar{A}

and

\bar{B}

:

\begin{matrix} \bar{A} = e x p (Δ A), \end{matrix}

(3)

\begin{matrix} \bar{B} = {(Δ A)}^{- 1} (e x p (Δ A) - I) Δ B . \end{matrix}

(4)

After discretization, Equations (1) and (2) can also be described as follows:

\begin{matrix} h_{t} = \bar{A} h_{t - 1} + \bar{B} x_{t}, \end{matrix}

(5)

\begin{matrix} y_{t} = \bar{C} h_{t}, \end{matrix}

(6)

where

\bar{C} = C

. Finally, the output is obtained through a global convolution:

\begin{matrix} \bar{K} = (C \bar{B}, C \bar{A} \bar{B}, \dots, C {\bar{A}}^{L - 1} \bar{B}), \end{matrix}

(7)

\begin{matrix} y = x \bar{K}, \end{matrix}

(8)

where L is the length of the input sequence, and

\bar{K} \in R^{L}

represents the structured convolutional kernel.

Disparity-guided Selective Scan Module. Traditional SSMs are effective for modeling long-range dependencies in 1D sequences, but they suffer from input invariant parameters, limiting their capacity for context-aware modeling. To overcome this limitation, Selective State Space Model (S6) [54], commonly known as Mamba, derives the projection parameters

B

and

C

as well as the time scale parameter

Δ

from the input

x_{t}

to endow the model with context-aware capabilities. The input-dependent selective scanning mechanism marks a substantial improvement over the LTI systems, offering informative representation for complex sequential data. However, such 1D sequential models cannot be applied directly to visual data since it is necessary to capture 2D spatial information in images. To bridge the gap between 2D visual data and 1D sequences, VMamba [46] developed 2D Selective Scan (SS2D) mechanism that scans image patches along four distinct directions to effectively establish a global receptive field while maintaining computational efficiency.

Inspired by this paradigm, disparity-guided selective scan module (DSSM) is proposed in the field of multispectral object detection. Specifically, given the three inputs

F_{r g b}, F_{i r}, F_{d} \in R^{N \times C \times H \times W}

, they are first flattened, followed by the concatenation of

F_{d}

with

F_{r g b}

and

F_{i r}

to get two sequences

F_{r g b - d}, F_{i r - d} \in R^{N \times 2 HW \times C}

. To better capture bidirectional spatial context, reverse scans are performed to produce complementary sequences

{\tilde{F}}_{r g b - d}, {\tilde{F}}_{i r - d}

. Therefore, the four sequences are processed through Equations (3)–(6), yielding four corresponding output sequences

F_{r g b - d}^{'}, F_{i r - d}^{'}, {\tilde{F}}_{r g b - d}^{'}, {\tilde{F}}_{i r - d}^{'} \in R^{N \times 2 H W \times C}

. The output

Y_{r g b - d}

and

Y_{i r - d}

are obtained by combining the forward outputs with the reversed counterparts:

\begin{matrix} Y_{r g b - d} = F_{r g b - d}^{'} + Reverse ({\tilde{F}}_{r g b - d}^{'}), \end{matrix}

(9)

\begin{matrix} Y_{i r - d} = F_{i r - d}^{'} + Reverse ({\tilde{F}}_{i r - d}^{'}), \end{matrix}

(10)

where

Reverse (\cdot)

denotes the operation that reverses the input along the second dimension. Then, the two sequences are split along the same dimension, with the first half of modality-specific features to be maintained for the modality-specific features and the second half of differential features to be discarded. The resulting sequences

Y_{r g b - d}

and

Y_{i r - d}

are then reshaped to produce the final fused representations.

3.2. Overview

As shown in Figure 3, our proposed FGMF comprises four main components: a dual-stream backbone network, an orientation aggregation module (OAM), a dual-enhancement and fusion module (DEFM) and a detection head. Given a pair of infrared and visible images, we initially perform feature extraction independently through two modality-specific backbones. Subsequently, the visible features are input into an orientation aggregation module, which provides essential orientation information for objects of varying directions. Next, the two modality-specific features flow into a dual-enhancement and fusion module, which refines them with semantic interaction between infrared and visible feature maps. Finally, all bounding boxes and classification scores are obtained through the detection head. In the following, we will elaborate on the proposed DEFM and OAM in detail.

3.3. Dual-Enhancement and Fusion Module

Feature fusion plays a critical role in multimodel models. In contrast to Transformer-based approaches [20,55,56], which often suffer from quadratic computational complexity with respect to the sequence length, our proposed DEFM has two enhancement processes and a mamba-based fusion to achieve better feature representations while avoiding quadratic computational overhead. As presented in Figure 4, we firstly adopt a weighted addition enhancement approach at the first enhancement process. Select the maximum value across the channel dimension on the initial fused feature map

X_{i n i t_{f} u s e}

, and then generate the modality-specific significance weights:

\begin{matrix} X_{i n i t_f u s e} = X_{r g b} + X_{i r}, \\ u_{1} = \frac{X_{r g b}}{M a x (X_{i n i t_f u s e})}, \\ u_{2} = \frac{X_{i r}}{M a x (X_{i n i t_f u s e})}, \end{matrix}

(11)

where

M a x

represents the operation of selecting maximum values. When

u_{1} \approx 1

, it means that the visible feature value at a certain position is close to the maximum fusion value, indicating that the visible mode is important at this position.

u_{2} \approx 1

means that the infrared mode contributes more and has a higher weight. A new fused feature can be defined as follows:

\begin{matrix} W_{s u m} = u_{1} \cdot X_{r g b} + u_{2} \cdot X_{i r} . \end{matrix}

(12)

The complementary enhancement of visible and infrared features as follows:

\begin{matrix} X_{r g b}^{1} = & W_{s u m} \cdot X_{r g b}, \\ X_{i r}^{1} = & W_{s u m} \cdot X_{i r} . \end{matrix}

(13)

Element-wise summation fusion highlights complementary feature similarity, but it suppresses differential features between modalities. Therefore, at the second enhancement stage, the difference between the visible and infrared feature maps is computed, and then it is passed through a convolutional layer followed by a sigmoid activation function to derive the weighting factor

W_{s i g}

:

\begin{matrix} W_{s u b} = σ (C o n v (X_{r g b} - X_{i r})), \end{matrix}

(14)

where

σ

denotes the sigmoid function. This feature is used to modulate the visible and infrared features, generating the second enhanced visible

X_{r g b}^{2}

and infrared

X_{i r}^{2}

features, as well as the subtraction feature map

X_{s u b}^{2}

:

\begin{matrix} X_{k}^{2} = X_{k} + W_{s u b} \cdot X_{k}, \end{matrix}

(15)

where

k \in {rgb, ir, sub}

indexes the visible, infrared, and their differential features, respectively. After enhancement, the normalized visible and infrared features, denoted as

X_{r g b}^{n o r m}

and

X_{i r}^{n o r m}

, along with the differentiated features

X_{s u b}^{2}

are used for a high-dimensional projection to enrich modality-specific representations. Depthwise separable convolutions

D W C o n v (\cdot)

are then performed on the three features to enable effective interaction among channels:

\begin{matrix} X_{k}^{p r o j} = P r o j e c t (X_{r g b}^{n o r m}; X_{i r}^{n o r m}; X_{s u b}^{2}), \end{matrix}

(16)

\begin{matrix} X_{k}^{d w c} = S i L U (D W C o n v (X_{k}^{p r o j})), \end{matrix}

(17)

\begin{matrix} X^{d s m m} = & D S M M (X_{r g b}^{d w c}; X_{i r}^{d w c}; X_{s u b}^{d w c}), \end{matrix}

(18)

where

L N (\cdot)

represents Layer Normalization,

P r o j e c t (\cdot)

denotes a linear projection operation, and

S i L U

means the SiLU activation function. These three features are then fed into Disparity-guided Selective Scan Module (DSSM) [26], which can obtain complementary information from another modality.

Despite its strength in modeling long-range dependencies, DSSM struggles to effectively represent correlations across channels. Therefore, a simple channel attention operation is used to capture channel-domain features, including global max pooling

G M P (\cdot)

and global average pooling

G A P (\cdot)

. Specifically, for a given input feature

Y

, we perform the following operations:

\begin{matrix} Y^{a v g} = & G A P (S i L U (Y)), \\ Y^{m a x} = & G M P (S i L U (Y)), \end{matrix}

(19)

\begin{matrix} Y^{c r} = & R e L U (C o n v_{1 \times 1} (Y^{a v g})) + R e L U (C o n v_{1 \times 1} (Y^{m a x}) . \end{matrix}

(20)

The channel attention output is applied as follows:

\begin{matrix} Y^{c a} = & σ (Y^{c r}) \cdot Y + Y . \end{matrix}

(21)

3.4. Orientation Aggregation Module

The greater the use of orientation-aware information for object discrimination, the better the performance in oriented object detection tends to be. However, as shown in Figure 5, previous structures [22,23,24] that employ large convolutions for global modeling often fail to adequately capture precise object orientations. To enhance rotation and texture details for subsequent infrared and visible image fusion, as depicted in Figure 6, we propose the orientation aggregation module that models the orientation and long-range contextual relationships. Concretely, the input RGB feature is first fed into a convolution with a 3 × 3 kernel size, followed by four directional strip convolutions. A horizontal and vertical convolution capture axis-aligned patterns, while two diagonal convolutions operate on 45° and −45° rotated feature maps:

\begin{matrix} \bar{X} = C o n v_{3 \times 3} (X_{i n}), \end{matrix}

(22)

\begin{matrix} X_{agg} = \sum_{θ \in \{0 °, 90 °, 45 °, - 45 °\}} R_{- θ} ({Conv}_{k_{θ}} (R_{θ} (\bar{X}))) . \end{matrix}

(23)

Here, we adopt a strip convolution size

k_{θ}

of

1 \times 9

and

R

is defined as the rotated transformation of strip convolution with

θ

or

- θ

. These multi-orientation features are aggregated via summation, then transformed through a pointwise convolution and GELU activation to obtain directional and long-range contextual representations:

\begin{matrix} X_{a c t} = G E L U (C o n v_{1 \times 1} (X_{agg})) . \end{matrix}

(24)

In order to focus more on the relevant spatial regions for target detection, we employ a spatial selection mechanism that utilizes the average and maximum pooling (denoted as

P_{avg}

and

P_{\max}

):

\begin{matrix} X_{avg} = P_{avg} (X_{a c t}), X_{\max} = P_{\max} (X_{a c t}), \end{matrix}

(25)

To facilitate effective information interaction, we concatenate the spatially pooled features and employ a convolutional layer

F^{2 \to N} (\cdot)

to transform the aggregated features into N channels:

\begin{matrix} \hat{X} = F^{2 \to N} (C o n c a t [X_{avg}, X_{\max}]) . \end{matrix}

(26)

The final output of OAM is then obtained by performing an element-wise product between the input feature

X_{i n}

and the spatial attention

\hat{X}

:

\begin{matrix} X_{o u t} = X_{i n} \cdot \hat{X} . \end{matrix}

(27)

3.5. Loss Function

As previously described, our proposed FGMF framework is a dual-stream feature extraction architecture. In this work, we adopt the one-stage method S2ANet [4] as the basic detector, following the same experimental configurations as current state-of-the-art (SOTA) approaches. The final loss function in FGMF consists of two components: one applied to the RGB branch and the other to the fusion branch. The overall loss function can be defined as follows:

\begin{matrix} L_{t o t a l} = L_{o m a_{c} l s} + L_{o m a_{r} e g} + L_{d e f m_{c} l s} + L_{d e f m_{r} e g}, \end{matrix}

(28)

where

L_{o m a_{c} l s}

and

L_{o m a_{r} e g}

denote the classification and regression loss from the RGB branch, while

L_{d e f m_{c} l s}

and

L_{d e f m_{r} e g}

represent those from the fusion branch. All the classification losses are implemented by using cross-entropy loss [57] and the regression loss employs smooth l1 loss [58] function.

4. Experiments and Analysis

4.1. Datasets

DroneVehicle [59] is the most extensively used drone-based infrared–visible dataset for arbitrary-oriented object detection, comprising 28,439 image pairs with 953,087 instances. It includes five categories: car, bus, truck, van, and freight car. The dataset covers various scenes, such as residential areas, urban roads, parking lots, and other scenarios from day to night. Officially, the dataset is split into 17,990, 1469, and 8980 image pairs for training, validation, and testing, respectively. In the experiments, the 100-pixel-wide white borders are removed from all images.

VEDAI [60] is a multispectral vehicle detection dataset, comprising 1246 pairs of infrared and visible images, each with a resolution of

512 \times 512

. We select nine vehicle categories for detection: car, truck, tractor, camper, van, pick-up, boat, plane, and others. In the experiments, in line with [26], a ten-fold cross-validation protocol is employed to train and test the model.

4.2. Implementation Details

As shown in Table 1, we utilize the MMRotate [61] and MMDetection [62] platforms to implement the proposed method on the server with a single GeForce RTX 4090. S²ANet is selected as the basic oriented object detector with the pretrained backbone network VMamba [46]. AdamW optimizer is adopted with the initial learning rate of 0.0001 and the weight decay of 0.05. The model was trained for 12 epochs, with the learning rate divided by 10 at epochs 8 and 11. For the DroneVehicle dataset, IR modality labels are used as ground truth, aligning with the previous studies [20,25,59]. For the VEDAI dataset, the model is trained on all nine categories, including those with fewer than 50 instances, to evaluate performance under imbalanced data conditions.

4.3. Evaluation Metrics

To better evaluate the oriented object detection performance, mean average precision (mAP) and frames per second (FPS) are selected as evaluation metrics. FPS quantifies the inference speed and mAP measures detection accuracy across different categories, which can be obtained by averaging the mean of average precision (AP) scores:

\begin{matrix} m A P = \frac{1}{k} \sum_{k = 1}^{k} A P_{k}, \end{matrix}

(29)

where k is the number of object classes, and

A P_{k}

represents the average precision of the

k t h

category. The area under the precision–recall (P-R) curve is defined as the AP value, providing a robust metric for assessing classifier performance.

\begin{matrix} A P = \int_{0}^{1} P (R) \cdot d R, \end{matrix}

(30)

where Precision

(P)

and Recall

(R)

are essential metrics that measure the proportion of correctly identified samples among all the positive predicted samples and the actual positive samples:

\begin{matrix} P = \frac{T P}{T P + F P}, \end{matrix}

(31)

\begin{matrix} R = \frac{T P}{T P + F N} . \end{matrix}

(32)

True positives

(T P)

, false positives

(F P)

, and false negatives

(F N)

refer to the number of correctly identified targets, misclassified nontargets such as background or other categories, and undetected true targets, respectively. In this work, mAP with an IoU threshold of 0.5 is employed to evaluate the performance.

4.4. Ablation Studies

Effect of rotated strip convolution. As revealed in Table 2, conventional square convolution with size of

3 \times 3

achieves only 78.5% mAP₅₀, primarily due to its limited receptive field failing to capture long-range contextual dependencies and directional patterns critical for oriented object detection. Replacing the square kernel with strip convolutions at 0° (Exp II) or 90° (Exp III) alone performs comparably to the square convolution. Notably, significant gains are observed when combining complementary orientations. Employing

1 \times 9

kernels at both 0° and 90° (Exp IV) improves accuracy to 79.3%, demonstrating that orthogonal directional contexts provide mutually reinforcing information. Extending this design to a complete set of four orientations: 0°, 45°, 90°, and −45° (Exp VI) further boosts performance to 79.6% mAP₅₀, a total gain of 1.1% over the baseline. This confirms that multi-directional coverage is essential to capture the full spectrum of object rotations. However, simply increasing the kernel size is not always beneficial. Expanding from

1 \times 9

to

1 \times 11

(Exp VII) causes a slight performance drop to 79.4%, likely due to increased sensitivity to background clutter and irrelevant context as the receptive field grows beyond the scale of target structures.

Effect of each module. As depicted in Table 3, the baseline model that incorporates neither rotated features nor modality-specific enhancement achieves only 79.3%. mAP₅₀. Integrating DSSM with our proposed orientation aggregation module (OAM) yields a 0.3% improvement in mAP₅₀. The introduction of the first enhancement stage further increases mAP₅₀ by 0.2% by leveraging cross-modality complementary information, thereby enriching unimodal feature representations. Subsequent integration of the second enhancement stage explicitly amplifies discriminative differential features between visible and infrared modalities. The full configuration achieves the best performance at 80.2%, representing a total gain of 0.9% over the baseline.

4.5. Comparison with State-of-the-Art Methods

Results on DroneVehicle. To demonstrate the effectiveness of our proposed method, we provide four single-modality methods (Faster R-CNN [27], RetinaNet [28] RoI Transformer [1], and S²ANet [4] and eight multimodal object detection approaches, including CIAN [63], AR-CNN [64], UA-CMDet [59], TSFADet [20], CALNet [65], C²Former [25], E2E-MFD [66], and DMM [26] for comparisons. It can be seen from Table 4 that most multimodal methods show superior performance over single-modal detectors, confirming the critical advantage of multimodal fusion in aerial oriented object detection. Overall, the proposed FGMF achieves SOTA performance among all competitors at 80.2%, surpassing the previous approaches C²Former, E2E-MFD, and DMM by 6.0%, 2.8%, and 0.9%, respectively, which demonstrates its superiority in the IVOOD task. In the Car category, both existing approaches (CALNet, C²Former, E2E-MFD, and DMM) and our method exceed 90%. Moreover, our method is the first to surpass 90% in the Bus category, achieving 90.3%. Furthermore, significant improvements are seen in challenging cases where our proposed method attains notable oriented object detection accuracy of 74.7% on the Freight Car category and 67.1% mAP on the Van category, representing substantial gains over previous methods in these underrepresented categories. In addition, our method attains the best performance while notably requiring fewer parameters (87.2 M) than all other IVOOD approaches.

Results on VEDAI. Current multimodal fusion detection studies on the VEDAI dataset with oriented bounding box are still under limited exploration. As reported in Table 5, multimodal detection methods significantly outperform unimodal approaches. Under single-modality evaluation, S²ANet [4] just achieve this on both RGB at 44.5% and IR at 40%, which highlight the challenges of modality-specific detection under poor conditions. In contrast, multimodal-based methods could achieve substantial gains under the same detector for fair comparison. Among these methods, our proposed FGMF achieves the best performance with 66.3% mAP₅₀, which is significantly higher than C²Former + S²ANet (55.6%) [25] and DMM + S²ANet (65.7%) [26]. In particular, our method achieves 87.4% mAP in the Plane category and 47.1% mAP in the Others category, exceeding the second-best method by significant margins of 9.9% and 3.6%, respectively. These results demonstrate superior capability in identifying planes and generalizing well to varying targets. Furthermore, FGMF maintains competitive accuracy across common categories such as Car, Pick-up, and Tractor. These results demonstrate that FGMF effectively leverages cross-modal information while preserving modality-specific characteristics, leading to superior detection accuracy in complex scenarios.

Visual Comparisons. For the DroneVehicle dataset, we provide some visual detection results of the compared method in Figure 7. DMM [26] exhibits significant limitations in detection accuracy, including misclassifications, missed detections, and inaccurate localization. These errors stem from its limited ability to discern subtle intermodal discrepancies and its lack of orientation modeling. In contrast, our FGMF delivers more accurate localization and identification through discriminative feature extraction, demonstrating robustness in low-visibility and densely distributed scenarios. For the VEDAI dataset, as presented in Figure 8, we visualize the results of the model in different scenarios. The objects are in complex scenarios including blur, dense arrangement, and angle change. Through these results, we find that FGMF is robust to the above challenging situations, which verifies its effectiveness.

Speed versus Accuracy. Under a batch size of 1 and an input resolution of

512 \times 512

, our proposed FGMF achieves superior detection accuracy of 80.2% mAP with a competitive inference speed of 12.5 FPS, as depicted in Figure 9, demonstrating an excellent trade-off between performance and efficiency.

5. Discussion

Despite the compelling results achieved by our FGMF framework, there are still several limitations that point to valuable future research directions. First, the feature extraction pathways for the two modalities are not entirely symmetric. This architectural asymmetry can introduce a representation-level discrepancy between visible and infrared features, potentially impairing the calibration capability of the dual-enhancement and fusion module. Second, the orientation aggregation module relies on convolution kernels with a set of predefined rotation angles. Although effective in capturing dominant directional patterns, this design inherently lacks the flexibility to adapt to arbitrary orientations. As a result, the model may struggle to accurately represent objects whose orientations fall between the predefined angles, particularly in complex scenes with highly varied object layouts.

6. Conclusions

In this paper, we introduce a Fine-Grained Multispectral Fusion (FGMF) framework, specifically designed for the infrared–visible-oriented object detection task in remote sensing. Our research tackles two key problems in this domain: (1) the modality misalignment and partial or complete modality disappearance under varying environmental conditions, which presents challenges on the effective multimodal fusion process, and (2) the large orientation disparities among targets in images, which make it difficult for the models to capture rotated representations. The proposed FGMG cope with these issues through its novel design, which mainly consists of two modules: a dual-enhancement and fusion module that integrates complementary and correlated cross-modal features to address the modality misalignment issue and an orientation aggregation module that employs rotated strip convolutions to model directional context and long-range dependencies. Extensive experiments on DroneVehicle and VEDAI datasets demonstrate that FGMF delivers significant performance gains as a robust and versatile framework for infrared–visible-oriented object detection.

Author Contributions

Conceptualization, X.L. and S.Z.; methodology, X.L.; validation, X.L., S.Z., and Y.B.; writing—original draft preparation, X.L. and S.Z.; writing—review and editing, X.L.; visualization, X.L. and Y.B.; supervision, X.Q.; funding acquisition, X.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Sichuan Science and Technology Program (2024NSFJQ0035), and in part by the Talents at the Sichuan provincial Party Committee Organization Department.

Data Availability Statement

The VisDrone and VEDAI dataset are publicly available at https://github.com/VisDrone/VisDrone-Dataset accessed on 6 January 2025 and https://downloads.greyc.fr/vedai/ accessed on 21 June 2025, respectively. The code will be available at https://github.com/StarBlue98/FGMF.

Acknowledgments

We sincerely appreciate the constructive comments and suggestions of the anonymous reviewers, which have greatly helped to improve this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 3163–3171. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3520–3529. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602511. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the International Conference on Pattern Recognition Applications and Methods, Porto, Portugal, 24–26 February 2017; Volume 2, pp. 324–331. [Google Scholar]
Deng, Q.; Tian, W.; Huang, Y.; Xiong, L.; Bi, X. Pedestrian detection by fusion of RGB and infrared images in low-light environment. In Proceedings of the 2021 IEEE 24th International Conference on Information Fusion (FUSION), Sun City, South Africa, 1–4 November 2021; pp. 1–8. [Google Scholar]
Abdulfattah, M.H.; Sheikh, U.U.; Masud, M.I.; Othman, M.A.; Khamis, N.; Aman, M.; Arfeen, Z.A. Assessing the Detection Capabilities of RGB and Infrared Models for Robust Occluded and Unoccluded Pedestrian Detection. IEEE Access 2025, 13, 91834–91845. [Google Scholar] [CrossRef]
Wang, Q.; Chi, Y.; Shen, T.; Song, J.; Zhang, Z.; Zhu, Y. Improving rgb-infrared pedestrian detection by reducing cross-modality redundancy. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 526–530. [Google Scholar]
Peng, T.; Li, Q.; Zhu, P. Rgb-t crowd counting from drone: A benchmark and mmccn network. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
Gu, S.; Lian, Z. A unified RGB-T crowd counting learning framework. Image Vis. Comput. 2023, 131, 104631. [Google Scholar] [CrossRef]
Mu, B.; Shao, F.; Xie, Z.; Xu, L.; Jiang, Q. RGBT-Booster: Detail-Boosted Fusion Network for RGB-Thermal Crowd Counting with Local Contrastive Learning. IEEE Internet Things J. 2025, 12, 18331–18349. [Google Scholar] [CrossRef]
Guan, D.; Cao, Y.; Yang, J.; Cao, Y.; Yang, M.Y. Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Inf. Fusion 2019, 50, 148–157. [Google Scholar] [CrossRef]
Li, C.; Liang, X.; Lu, Y.; Zhao, N.; Tang, J. RGB-T object tracking: Benchmark and baseline. Pattern Recognit. 2019, 96, 106977. [Google Scholar] [CrossRef]
Liu, L.; Chen, J.; Wu, H.; Li, G.; Li, C.; Lin, L. Cross-modal collaborative representation learning and a large-scale rgbt benchmark for crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 4823–4833. [Google Scholar]
Xu, D.; Ouyang, W.; Ricci, E.; Wang, X.; Sebe, N. Learning cross-modal deep representations for robust pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5363–5371. [Google Scholar]
Zhang, Q.; Huang, N.; Yao, L.; Zhang, D.; Shan, C.; Han, J. RGB-T salient object detection via fusing multi-level CNN features. IEEE Trans. Image Process. 2019, 29, 3321–3335. [Google Scholar] [CrossRef]
Zhang, Q.; Zhao, S.; Luo, Y.; Zhang, D.; Huang, N.; Han, J. ABMDRNet: Adaptive-weighted bi-directional modality difference reduction network for RGB-T semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 2633–2642. [Google Scholar]
Zhou, K.; Chen, L.; Cao, X. Improving multispectral pedestrian detection by addressing modality imbalance problems. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 787–803. [Google Scholar]
Yuan, M.; Wang, Y.; Wei, X. Translation, scale and rotation: Cross-modal alignment meets RGB-infrared vehicle detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 509–525. [Google Scholar]
Pan, X.; Ren, Y.; Sheng, K.; Dong, W.; Yuan, H.; Guo, X.; Ma, C.; Xu, C. Dynamic refinement network for oriented and densely packed object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11207–11216. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Li, Y.; Li, X.; Dai, Y.; Hou, Q.; Liu, L.; Liu, Y.; Cheng, M.M.; Yang, J. Lsknet: A foundation lightweight backbone for remote sensing. Int. J. Comput. Vis. 2024, 133, 1410–1431. [Google Scholar] [CrossRef]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly kernel inception network for remote sensing detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–21 June 2024; pp. 27706–27716. [Google Scholar]
Yuan, M.; Wei, X. C2former: Calibrated and complementary transformer for rgb-infrared object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403712. [Google Scholar] [CrossRef]
Zhou, M.; Li, T.; Qiao, C.; Xie, D.; Wang, G.; Ruan, N.; Mei, L.; Yang, Y.; Shen, H.T. DMM: Disparity-guided Multispectral Mamba for Oriented Object Detection in Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5404913. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 91–99. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Cheng, Y.; Xu, C.; Kong, Y.; Wang, X. Short-Side Excursion for Oriented Object Detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6515205. [Google Scholar] [CrossRef]
Pu, Y.; Wang, Y.; Xia, Z.; Han, Y.; Wang, Y.; Gan, W.; Wang, Z.; Song, S.; Huang, G. Adaptive rotated convolution for rotated object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6589–6600. [Google Scholar]
Wang, J.; Pu, Y.; Han, Y.; Guo, J.; Wang, Y.; Li, X.; Huang, G. Gra: Detecting oriented objects through group-wise rotating and attention. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 298–315. [Google Scholar]
Yang, X.; Yan, J. Arbitrary-oriented object detection with circular smooth label. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 677–694. [Google Scholar]
Yang, X.; Hou, L.; Zhou, Y.; Wang, W.; Yan, J. Dense label encoding for boundary discontinuity free rotation detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 15819–15829. [Google Scholar]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking rotated object detection with gaussian wasserstein distance loss. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 18–24 July 2021; pp. 11830–11841. [Google Scholar]
Yang, X.; Yang, X.; Yang, J.; Ming, Q.; Wang, W.; Tian, Q.; Yan, J. Learning high-precision bounding box for rotated object detection via kullback-leibler divergence. Adv. Neural Inf. Process. Syst. 2021, 34, 18381–18394. [Google Scholar]
Yang, X.; Zhou, Y.; Zhang, G.; Yang, J.; Wang, W.; Yan, J.; Zhang, X.; Tian, Q. The KFIoU Loss for Rotated Object Detection. In Proceedings of the The Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Jiang, C.; Ren, H.; Yang, H.; Huo, H.; Zhu, P.; Yao, Z.; Li, J.; Sun, M.; Yang, S. M2FNet: Multi-modal fusion network for object detection from visible and thermal infrared images. Int. J. Appl. Earth Obs. Geoinf. 2024, 130, 103918. [Google Scholar] [CrossRef]
Yuan, M.; Shi, X.; Wang, N.; Wang, Y.; Wei, X. Improving RGB-infrared object detection with cascade alignment-guided transformer. Inf. Fusion 2024, 105, 102246. [Google Scholar] [CrossRef]
Hu, Y.; Chen, X.; Wang, S.; Liu, L.; Shi, H.; Fan, L.; Tian, J.; Liang, J. Deformle Cross-Attention Trnsformer for Wekly Aligned RGB–T Pedestrin Detection. IEEE Trans. Multimed. 2025, 27, 4400–4411. [Google Scholar] [CrossRef]
Liu, Y.; Guo, W.; Yao, C.; Zhang, L. Dual-Perspective Alignment Learning for Multimodal Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5404015. [Google Scholar] [CrossRef]
Cao, B.; Guo, J.; Zhu, P.; Hu, Q. Bi-directional adapter for multimodal tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, WC, Canada, 20–27 February 2024; Volume 38, pp. 927–935. [Google Scholar]
Zhang, P.; Zhao, J.; Bo, C.; Wang, D.; Lu, H.; Yang, X. Jointly modeling motion and appearance cues for robust RGB-T tracking. IEEE Trans. Image Process. 2021, 30, 3335–3347. [Google Scholar] [CrossRef]
Gómez-Chova, L.; Tuia, D.; Moser, G.; Camps-Valls, G. Multimodal classification of remote sensing images: A review and future directions. Proc. IEEE 2015, 103, 1560–1584. [Google Scholar] [CrossRef]
Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super resolution assisted object detection in multimodal remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605415. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Kautz, J. Mambavision: A hybrid mamba-transformer vision backbone. arXiv 2024, arXiv:2407.08083. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar]
Mehta, H.; Gupta, A.; Cutkosky, A.; Neyshabur, B. Long range language modeling via gated state spaces. arXiv 2022, arXiv:2206.13947. [Google Scholar] [CrossRef]
Malik, H.S.; Shamshad, F.; Naseer, M.; Nandakumar, K.; Khan, F.S.; Khan, S. Towards Evaluating the Robustness of Visual State Space Models. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 3544–3553. [Google Scholar]
He, X.; Cao, K.; Zhang, J.; Yan, K.; Wang, Y.; Li, R.; Xie, C.; Hong, D.; Zhou, M. Pan-mamba: Effective pan-sharpening with state space model. Inf. Fusion 2025, 115, 102779. [Google Scholar] [CrossRef]
Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. Rsmamba: Remote sensing image classification with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8002605. [Google Scholar] [CrossRef]
Zhu, Q.; Cai, Y.; Fang, Y.; Yang, Y.; Chen, C.; Fan, L.; Nguyen, A. Samba: Semantic segmentation of remotely sensed images with state space model. Heliyon 2024, 10, e38495. [Google Scholar] [CrossRef]
Chen, H.; Song, J.; Han, C.; Xia, J.; Yokoya, N. ChangeMamba: Remote sensing change detection with spatiotemporal state space model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4409720. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Wu, X.; Cao, Z.H.; Huang, T.Z.; Deng, L.J.; Chanussot, J.; Vivone, G. Fully-Connected Transformer for Multi-Source Image Fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2071–2088. [Google Scholar] [CrossRef] [PubMed]
Vs, V.; Jose Valanarasu, J.M.; Oza, P.; Patel, V.M. Image Fusion Transformer. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 3566–3570. [Google Scholar] [CrossRef]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef]
Zhou, Y.; Yang, X.; Zhang, G.; Wang, J.; Liu, Y.; Hou, L.; Jiang, X.; Liu, X.; Yan, J.; Lyu, C.; et al. Mmrotate: A rotated object detection benchmark using pytorch. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 7331–7334. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar] [CrossRef]
Zhang, L.; Liu, Z.; Zhang, S.; Yang, X.; Qiao, H.; Huang, K.; Hussain, A. Cross-modality interactive attention network for multispectral pedestrian detection. Inf. Fusion 2019, 50, 20–29. [Google Scholar] [CrossRef]
Zhang, L.; Liu, Z.; Zhu, X.; Song, Z.; Yang, X.; Lei, Z.; Qiao, H. Weakly aligned feature fusion for multimodal object detection. IEEE Trans. Neural Netw. Learn. Syst. 2021, 36, 4145–4159. [Google Scholar] [CrossRef] [PubMed]
He, X.; Tang, C.; Zou, X.; Zhang, W. Multispectral object detection via cross-modal conflict-aware learning. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 1465–1474. [Google Scholar]
Zhang, J.; Cao, M.; Xie, W.; Lei, J.; Li, D.; Huang, W.; Li, Y.; Yang, X. E2e-mfd: Towards end-to-end synchronous multimodal fusion detection. Adv. Neural Inf. Process. Syst. 2024, 37, 52296–52322. [Google Scholar]

Figure 1. Illustration of the modality misalignment problem. The yellow and red boxes represent annotations of the same targets in IR and RGB images, respectively.

Figure 2. Successfully modeling remote sensing objects with diverse orientations requires high directional perceptual capability from convolution kernels. Square convolutions may easily lead to inaccurate direction results.

Figure 3. (a) Illustration of the proposed FGMF. The input dual-modal images are initially processed by the VSS blocks, each followed by a downsampling layer to reduce the feature map size. Features of varying scales generated by the VSS blocks in the top stream and bottom stream are then directed into the orientation aggregation module (OAM) and the dual-enhancement and fusion module (DEFM), respectively. (b) At the bottom, the structure of VSS block within the backbone is shown, derived from the v9 architecture of VMamba.

Figure 4. Illustration of the Dual-Enhancement and Fusion Module. It mainly consists of three parts, including the first feature enhancement guided by the addition operation to emphasize the similar representations, the second enhancement guided by the subtraction operation to highlight the modality-specific features, and the disparity-guided selective scan module.

Figure 5. Architectural comparisons between our FGMG and other methods, including SKNet [22], LSKNet [23] and PKINet [24]. SSC, LSC, and LRSC denote small square convolution, large square convolution, and large rotated strip convolution, respectively.

Figure 6. Illustration of the orientation aggregation module.

Figure 7. Visual examples of detection results on the DroneVehicle dataset for our proposed method and DMM [26], respectively.

Figure 8. Detection results of our FGMF in different scenarios on the VEDAI dataset.

Figure 9. Speed versus accuracy on the DroneVehicle test set.

Table 1. Server and environment parameters.

Configuration Items	Parameters
CPU	Intel(R) Xeon(R) Platinum 8336C
GPU	GTX 4090 24 G
Memory	100 G
Operating System	Ubuntu 22.04
Deep Learning Framework	MMRotate and MMDetection (Based on PyTorch 2.5.1)

Table 2. Ablation study on the effectiveness of rotated strip convolution with different sizes and angles in orientation aggregation module.

Exp	0°	45°	90°	−45°	mAP₅₀
I	(3, 3)	–	–	–	78.5
II	(1, 9)	–	–	–	78.3
III	–	–	(1, 9)	–	78.4
IV	(1, 9)	–	(1, 9)	–	79.3
V	–	(1, 9)	–	(1, 9)	78.1
VI	(1, 9)	(1, 9)	(1, 9)	(1, 9)	79.6
VII	(1, 11)	(1, 11)	(1, 11)	(1, 11)	79.4

Table 3. Ablation study for each proposed module in FGMF. Enh. denotes enhancement stage.

Exp	OAM	DEFM			mAP₅₀
Exp	OAM	DSSM	1st Enh.	2nd Enh.	mAP₅₀
I	–	✔	–	–	79.3
II	✔	✔	–	–	79.6
III	✔	✔	✔	–	79.8
IV	✔	✔	✔	✔	80.2

Table 4. Comprehensive comparative experiments on the DroneVehicle dataset. The bold and underlined fonts indicate the top two performances, respectively.

Modality	Method	Venue	Year	Basic Detector	Car	Truck	Freight Car	Bus	Van	mAP₅₀ (%)	Params (M)
RGB	Faster R-CNN [27]	NeurIPS	2015	–	79.0	49.0	37.2	77.0	37.0	55.9	41.1
	RetinaNet [28]	ICCV	2017	–	78.5	34.4	24.1	69.8	28.8	47.1	36.4
	RoI Trans [1]	CVPR	2019	–	61.6	55.1	42.3	85.5	44.8	61.6	55.1
	S²ANet [4]	TGRS	2021	–	80.0	54.2	42.2	84.9	43.8	61.0	38.6
IR	Faster R-CNN [27]	NeurIPS	2015	–	89.4	53.5	48.3	87.0	42.6	64.2	41.1
	RetinaNet [28]	ICCV	2017	–	88.8	35.4	39.5	76.5	32.1	54.5	36.4
	RoI Trans [1]	CVPR	2019	–	89.6	51.0	53.4	88.9	44.5	65.5	55.1
	S²ANet [4]	TGRS	2021	–	89.9	54.5	55.8	88.9	48.4	67.5	38.6
RGB+IR	CIAN [63]	INFORM FUSION	2019	-	89.98	62.47	60.22	88.90	49.59	70.23	-
	AR-CNN [64]	TNNLS	2021	Faster R-CNN	90.1	64.8	62.1	89.4	51.5	71.6	-
	UA-CMDet [59]	TCSVT	2022	RoI Trans	87.5	60.7	46.8	87.1	38.0	64.0	-
	TSFADet [20]	ECCV	2022	-	89.9	67.9	63.7	89.8	54.0	73.1	104.7
	CALNet [65]	ACM MM	2023	–	90.3	76.2	63.0	89.1	58.5	75.4	-
	C²Former [25]	TGRS	2024	S²ANet	90.2	68.3	64.4	89.8	58.5	74.2	100.8
	E2E-MFD [66]	NeurIPS	2024	–	90.3	79.3	64.6	89.8	63.1	77.4	-
	DMM [26]	TGRS	2025	S²ANet	90.5	77.7	73.2	90.0	65.1	79.3	88.0
	FGMF (Ours)	-	2025	S²ANet	90.5	78.5	74.7	90.3	67.1	80.2	87.2

Table 5. Comprehensive comparative experiments on the VEDAI dataset. The bold and underlined fonts indicate the top two performances, respectively.

Modality	Method	Car	Truck	Tractor	Camping Car	Van	Pick-Up	Boat	Plane	Others	mAP₅₀
RGB	RetinaNet [28]	48.9	16.8	15.9	21.4	5.9	37.5	4.4	21.2	14.1	20.7
	S²ANet [4]	74.5	47.3	55.6	61.7	32.5	65.1	16.7	7.1	39.8	44.5
	Faster R-CNN [27]	71.4	54.2	61.0	70.5	59.5	67.6	52.3	77.1	40.1	61.5
	RoI Trans [1]	77.3	56.1	64.7	73.6	60.2	71.5	56.7	85.7	42.8	65.4
IR	RetinaNet [28]	44.2	15.3	9.4	17.1	7.2	32.1	4.0	33.4	5.7	18.7
	S²ANet [4]	73.0	39.2	41.9	59.2	32.3	65.6	13.9	12.0	23.1	40.0
	Faster R-CNN [27]	71.6	49.1	49.2	68.1	57.0	66.5	35.6	71.6	29.5	55.4
	RoI Trans [1]	76.1	51.7	51.9	71.2	64.3	70.7	46.9	83.3	28.3	60.5
RGB+IR	C²Former + S²ANet [25]	76.7	52.0	59.8	63.2	48.0	68.7	43.3	47.0	41.9	55.6
	DMM + S²ANet [26]	77.9	59.3	68.1	70.8	57.4	75.8	61.2	77.5	43.5	65.7
	FGMF + S²ANet (Ours)	78.2	57.6	66.8	69.7	57.9	74.1	57.6	87.4	47.1	66.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lan, X.; Zhang, S.; Bai, Y.; Qin, X. Fine-Grained Multispectral Fusion for Oriented Object Detection in Remote Sensing. Remote Sens. 2025, 17, 3769. https://doi.org/10.3390/rs17223769

AMA Style

Lan X, Zhang S, Bai Y, Qin X. Fine-Grained Multispectral Fusion for Oriented Object Detection in Remote Sensing. Remote Sensing. 2025; 17(22):3769. https://doi.org/10.3390/rs17223769

Chicago/Turabian Style

Lan, Xin, Shaolin Zhang, Yuhao Bai, and Xiaolin Qin. 2025. "Fine-Grained Multispectral Fusion for Oriented Object Detection in Remote Sensing" Remote Sensing 17, no. 22: 3769. https://doi.org/10.3390/rs17223769

APA Style

Lan, X., Zhang, S., Bai, Y., & Qin, X. (2025). Fine-Grained Multispectral Fusion for Oriented Object Detection in Remote Sensing. Remote Sensing, 17(22), 3769. https://doi.org/10.3390/rs17223769

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Fine-Grained Multispectral Fusion for Oriented Object Detection in Remote Sensing

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Oriented Object Detection

2.2. Infrared–Visible Image Fusion

2.3. Long-Sequence Modeling

3. Methodology

3.1. Preliminaries

3.2. Overview

3.3. Dual-Enhancement and Fusion Module

3.4. Orientation Aggregation Module

3.5. Loss Function

4. Experiments and Analysis

4.1. Datasets

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Ablation Studies

4.5. Comparison with State-of-the-Art Methods

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI