Infrared-Visible Image Fusion Meets Object Detection: Towards Unified Optimization for Multimodal Perception

Xiang, Xiantai; Zhou, Guangyao; Niu, Ben; Pan, Zongxu; Huang, Lijia; Li, Wenshuai; Wen, Zixiao; Qi, Jiamin; Gao, Wanxin

doi:10.3390/rs17213637

Open AccessArticle

Infrared-Visible Image Fusion Meets Object Detection: Towards Unified Optimization for Multimodal Perception

by

Xiantai Xiang

^1,2,3

,

Guangyao Zhou

^1,2,3,

Ben Niu

^1,3,*,

Zongxu Pan

⁴

,

Lijia Huang

^1,2,3,

Wenshuai Li

^1,2,3,

Zixiao Wen

^1,2,3

,

Jiamin Qi

^1,2,3 and

Wanxin Gao

⁵

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

³

Key Laboratory of Target Cognition and Application Technology (TCAT), Beijing 100190, China

⁴

School of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, China

⁵

School of Automation, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(21), 3637; https://doi.org/10.3390/rs17213637

Submission received: 10 October 2025 / Revised: 2 November 2025 / Accepted: 3 November 2025 / Published: 4 November 2025

(This article belongs to the Special Issue Machine Learning for Intelligent Processing and Applications of Multi-Source Remote Sensing Data)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Our proposed UniFusOD method integrates infrared-visible image fusion and object detection into a unified, end-to-end framework, achieving superior performance across multiple tasks.
The introduction of the Fine-Grained Region Attention (FRA) module and UnityGrad optimization significantly enhances the model’s ability to handle multi-scale features and resolves gradient conflicts, improving both fusion and detection outcomes.

What are the implications of the main findings?

The unified optimization approach not only improves image fusion quality but also enhances downstream task performance, particularly in detecting rotated and small objects.
This approach demonstrates significant robustness across various datasets, offering a promising solution for multimodal perception tasks in remote sensing and autonomous driving.

Abstract

Infrared-visible image fusion and object detection are crucial components in remote sensing applications, each offering unique advantages. Recent research has increasingly sought to combine these tasks to enhance object detection performance. However, the integration of these tasks presents several challenges, primarily due to two overlooked issues: (i) existing infrared-visible image fusion methods often fail to adequately focus on fine-grained or dense information, and (ii) while joint optimization methods can improve fusion quality and downstream task performance, their multi-stage training processes often reduce efficiency and limit the network’s global optimization capability. To address these challenges, we propose the UniFusOD method, an efficient end-to-end framework that simultaneously optimizes both infrared-visible image fusion and object detection tasks. The method integrates Fine-Grained Region Attention (FRA) for region-specific attention operations at different granularities, enhancing the model’s ability to capture complex information. Furthermore, UnityGrad is introduced to balance the gradient conflicts between fusion and detection tasks, stabilizing the optimization process. Extensive experiments demonstrate the superiority and robustness of our approach. Not only does UniFusOD achieve excellent results in image fusion, but it also provides significant improvements in object detection performance. The method exhibits remarkable robustness across various tasks, achieving a 0.8 and 1.9 mAP₅₀ improvement over state-of-the-art methods on the DroneVehicle dataset for rotated object detection and the M3FD dataset for horizontal object detection, respectively.

Keywords:

infrared-visible image fusion; object detection; multi-task learning

1. Introduction

The rapid development of remote sensing satellite platforms has made the acquisition of vast amounts of data possible, significantly driving advancements in deep learning technologies [1]. However, the inherent limitations of individual sensor modalities present substantial challenges for achieving comprehensive visual perception. Visible images, with their high spatial resolution and rich color information, are adept at capturing texture and chromatic details. Nevertheless, they heavily depend on high-quality illumination conditions, and their performance deteriorates significantly in low-light environments, resulting in the loss of important information [2,3,4,5]. In contrast, infrared images, owing to their thermal imaging mechanism, are not dependent on ambient light, providing robust edge information even in dim lighting. They also offer certain advantages in terms of penetration and camouflage resistance, making them particularly useful for highlighting target contours. However, they fall short in representing fine textures [6,7,8]. Thus, relying solely on a single modality for object detection in remote sensing often leads to perceptual blind spots, limiting model performance and scene generalization.

To address these complementary deficiencies, Infrared-Visible Image Fusion has emerged as a critical research domain, with substantial applications in autonomous driving and remote sensing systems [9,10,11,12]. On the one hand, the task of infrared-visible image fusion aims to integrate complementary information from both types of images, creating a richer and more informative fused image. This enhanced image improves scene clarity and provides the necessary details relevant to the specific application scenario. Existing methods in infrared-visible fusion typically focus on feature-level fusion and alignment using deep learning techniques. These methods can be broadly classified into three categories: autoencoder (AE)-based approaches [4,13,14,15] enhance feature representation through reconstruction; generative adversarial network (GAN)-based approaches [8,16,17] constrain the fusion image distribution to align with original inputs, thus avoiding direct fusion weight learning; and unified models [18,19] employ cross-learning to address the lack of ground truth and training samples. However, these methods generally improve fusion performance from the perspectives of fusion weights or image distribution, with an emphasis on semantic information. They often overlook the fine-grained fusion of image details, which are crucial for downstream tasks, especially those requiring dense information. On the other hand, while fused images can provide high-quality inputs for higher-level perceptual tasks such as object detection and tracking [6,20,21], traditional infrared-visible fusion methods primarily focus on improving visual information quality without fully addressing the specific needs of downstream tasks. As a result, although fused images may exhibit high visual quality, they do not necessarily lead to significant improvements in perceptual accuracy or overall task performance in real-world applications [22].

To overcome these limitations, recent studies have explored the joint optimization of image fusion and high-level perception tasks such as object detection [20,21,22,23]. This approach aims to optimize both pixel-level and feature-level processes simultaneously, ensuring that enhancements in image fusion also improve downstream tasks like object detection. One of the primary advantages of joint optimization is its ability to leverage semantic information from object detection to guide the fusion process, making the fused image more effective for detection. Additionally, this optimization enables the fusion task itself to be more beneficial in enhancing object detection performance. However, despite its potential, several challenges remain, as illustrated in Figure 1. These challenges include the following: (1) Inefficient stepwise optimization: Most existing methods adopt a cascaded design, where image fusion and object detection networks are optimized in separate stages, as shown in Figure 1a–c. While this approach may offer improvements in both fusion and detection individually, it introduces inefficiencies due to the lack of integrated learning, making the process computationally expensive and complex. This stepwise optimization also poses challenges for real-time processing, as it does not leverage the potential for joint optimization that could reduce computational overhead. (2) Lack of focus on fine-grained or dense information: Existing feature fusion methods often fail to emphasize fine-grained or dense information, resulting in fused features that may not perform well in location-sensitive tasks such as detection and segmentation. This oversight limits the effectiveness of the fused features in tasks where precise spatial information is crucial. (3) Limited ability to find global optimal solutions: Multi-stage optimization methods often become trapped in local optima due to their stepwise nature. Furthermore, these methods typically connect tasks via the loss function, without structural interactions between them, limiting the optimization process’s ability to address the needs of both tasks simultaneously.

Overall, significant challenges remain in achieving synergistic optimization between image fusion and downstream tasks, particularly regarding efficient end-to-end optimization, for which effective solutions are still lacking. In this paper, we propose UniFusOD, a novel end-to-end framework that unifies image fusion and object detection (OD) tasks into a single, integrated optimization process, as illustrated in Figure 1d. By jointly optimizing these tasks, UniFusOD ensures that the fused image is not only visually enhanced but also optimized for downstream tasks like object detection, leveraging the complementary strengths of both modalities. This approach enhances feature fusion across various levels, improving visual perception capabilities. To enable the model to focus on important details at fine-grained levels, we introduce the Fine-Grained Region Attention (FRA) module. Inspired by the biological visual system, the FRA module allows the model to selectively attend to and distinguish key regions, thereby improving feature representation at both spatial and semantic levels. Additionally, we introduce UnityGrad, a novel optimization algorithm based on the Nash bargaining principle. UnityGrad resolves gradient conflicts between fusion and detection tasks, aligning their optimization directions and scales. This approach stabilizes the optimization process and enhances the efficiency and effectiveness of multimodal image fusion for object detection.

The main contributions of this paper are as follows:

(1) We present UniFusOD, an end-to-end multimodal image fusion detection framework that synchronously optimizes both image fusion and downstream tasks. This approach overcomes the inefficiencies and local optima issues associated with multi-stage optimization methods.

(2) The Fine-Grained Region Attention (FRA) module is designed to enhance the model’s ability to focus on and capture region-specific information at various levels of granularity. Inspired by biological visual systems, FRA improves feature representation by selectively attending to crucial regions, enabling the model to better capture and represent task-relevant information in complex multimodal images.

(3) We propose UnityGrad, inspired by the Nash bargaining principle, to resolve gradient conflicts between fusion and detection tasks. This novel approach harmonizes the optimization goals of both tasks, leading to a more balanced and efficient optimization process, ultimately stabilizing and improving model performance.

(4) Through extensive experiments on image fusion and object detection tasks, we demonstrate the effectiveness and robustness of our approach, achieving superior performance over traditional methods in both tasks.

2. Related Work

2.1. Multimodal Image Fusion and Object Detection

Deep learning has significantly advanced both low- and high-level visual tasks in remote sensing, particularly in image fusion and object detection, demonstrating great potential [6,24,25,26]. Early multimodal image fusion studies [7,27,28] mainly optimized fusion outcomes by adjusting network structures or loss functions, achieving good visual effects. From the perspective of network architecture design, many works adopted encoder-decoder frameworks to extract hierarchical features from source images—for instance, integrating residual blocks or dense connections to enhance feature propagation and avoid gradient vanishing, which is particularly effective for fusing heterogeneous modalities like visible and infrared images [29,30]. In terms of feature fusion strategies, researchers have explored multi-scale fusion mechanisms and attention-driven weight allocation to emphasize complementary information between modalities [31,32]. These feature-based fusion approaches allow for a more nuanced integration of information from different sources, potentially improving the overall quality of fused images. However, a significant limitation of feature-based fusion methods lies in their potential disconnection from the end-task performance, such as object detection. While these methods can enhance visual quality and detail in fused images, they may not always be optimized for downstream tasks, which require more task-specific feature integration [33]. For loss function optimization, pixel-level reconstruction losses and structural similarity (SSIM) loss were widely employed to constrain the fused image to be consistent with source images in pixel intensity and structural distribution [34,35]. However, these methods often overlook a crucial point: the primary goal of fusion is not just to improve visual quality but to enhance the performance of downstream tasks, such as object detection. Although high-quality fused images are visually impressive, they may not always meet the specific needs of practical applications [33].

Recent research has increasingly recognized that multimodal image fusion should not be an isolated task but closely integrated with downstream tasks like object detection, tracking, and segmentation. This has led to the development of joint optimization frameworks that combine image fusion with object detection. In these frameworks, fusion is not only aimed at generating visually pleasing images but also at improving downstream task performance. For example, Yuan et al. [26] pioneered cross-modal alignment to address airborne visible and infrared misalignment for rotated detection, establishing the foundational need to resolve modality discrepancies. Building on this, Liu et al. [22] introduced joint learning of fusion and detection with a novel loss function, directly incorporating detection-derived semantic and location information into fusion to simultaneously enhance both tasks. This approach improves fusion quality and detection performance by incorporating semantic and location information from the detection task into the fusion process. Finally, Liu et al. [36] generalized this interaction paradigm through a multi-interaction architecture, formalizing mutual task promotion beyond single-directional guidance to achieve bidirectional, task-aligned feature learning that collectively elevates fusion and detection performance.

Despite these advances, challenges remain. Object detection focuses on semantic understanding, while fusion and segmentation tasks emphasize pixel-level relationships, making the optimization of image fusion and object detection complex. A critical challenge is finding a balance that allows both tasks to mutually enhance each other [10,36,37]. Many methods still rely on cascade architectures, where separate modules are trained and inferred independently, resulting in high computational cost and inefficiency. Furthermore, efficiently integrating information from different modalities while removing redundant features remains a persistent challenge in multimodal fusion.

In conclusion, the integration of image fusion with object detection is a promising research area. By designing effective network architectures and loss functions, image fusion and object detection can mutually promote each other, enhancing the overall performance of multimodal image processing tasks. However, overcoming the optimization challenges requires exploring more efficient and flexible model architectures, particularly end-to-end optimization frameworks for joint inference of image fusion and object detection tasks.

2.2. Multitask Learning

Multitask Learning (MTL) is a technique that improves learning efficiency by simultaneously addressing multiple tasks and sharing information between them [38,39]. This information sharing is typically achieved through a shared hidden representation [40,41,42]. However, the optimization process in multitask learning presents several challenges, such as gradient conflicts between tasks [43,44] and plateau effects in the loss function [45], which complicate the optimization process.

To overcome these challenges, various architectures and methods have been proposed [46,47,48]. Some approaches focus on optimizing the training process by adjusting the gradients of tasks through weighting. For example, some studies weight the loss functions based on task uncertainty [49], gradient norms [50], stochastic weights [51], or gradient similarity [52,53]. However, these methods are predominantly heuristic and may lead to performance instability in practical applications [54]. Additionally, other methods employ techniques such as Neural Architecture Search (NAS) [55,56] or routing networks [57] to automatically discover shared patterns and determine network architectures. While effective in some cases, these approaches come with significant computational overhead.

Recently, there has been growing interest in multi-objective optimization based on the Multi-Gradient Descent Algorithm (MGDA) [58]. Under certain conditions, MGDA guarantees convergence to a Pareto stable point, making it a promising optimization strategy. Hotegni et al. [59] framed the multi-objective optimization problem as a multitask learning problem and introduced a task-weighting approach based on the Frank-Wolfe algorithm [60]. Liu et al. [54] proposed a method that maximizes the worst-case improvement by searching for the optimal update direction within the neighborhood of the average gradient. Liu [51] further developed a method to find a fair gradient direction by ensuring equal cosine similarity of gradients across all tasks. While this approach satisfies all the requirements of Nash axioms, it does not guarantee a Pareto optimal solution. Therefore, mitigating gradient conflicts in multitask learning remains a critical challenge.

3. Methodology

In this section, we introduce UniFusOD, a unified end-to-end framework that simultaneously addresses infrared-visible image fusion and object detection. Specifically, in Section 3.1, we formalize the joint fusion and detection task as an end-to-end optimization problem, aiming to simultaneously improve both visual quality and detection performance. Then, in Section 3.2, we present the overall framework, which integrates a shared backbone, a Fine-Grained Region Attention (FRA) module, and task-specific heads for fusion and detection. The entire model is trained end-to-end using the UnityGrad method, which harmonizes gradients from both tasks to enable stable and balanced multi-task optimization. In Section 3.3, we introduce the FRA mechanism designed to enhance region-level feature representation by focusing on important areas across multiple scales. Next, Section 3.4 details the task heads and their corresponding loss functions used to guide the model toward generating semantically meaningful fused images and accurate object detection. Finally, in Section 3.5, we propose UnityGrad, a gradient harmonization strategy that mitigates optimization conflicts between tasks, enabling more stable and effective end-to-end learning.

3.1. Problem Formulation

Assuming the visible image is denoted as

x \in R^{H \times W \times 3}

and the infrared image as

y \in R^{H \times W \times 1}

, the optimization problem can be formulated as follows:

min_{θ_{d}, θ_{f}} L^{d} (t, Φ (u, θ_{d})), s . t . u = Ψ (x, y; θ_{f})

Here,

u

represents the fused image, and

Ψ (.)

and

Φ (.)

are the fusion network and detection network controlled by parameters

θ_{f}

and

θ_{d}

, respectively.

To avoid optimization difficulties, most methods separately train the fusion network

Ψ (.)

and detection network

Φ (.)

at different stages. However, this approach makes it challenging to find the global optimal solution. To enable end-to-end optimization, we reformulate the problem into the following optimization problem:

θ, θ_{d}, θ_{f} = arg min λ L^{d} (t, Φ (x, y); θ, θ_{d}) + (1 - λ) L^{f} (Ψ (x, y; θ, θ_{f})) + R (θ, θ_{d}, θ_{f})

Here,

θ

represents the parameters shared by both the detection network and the fusion network, such as the backbone parameters.

θ_{d}

and

θ_{f}

are the specific parameters for the detection and fusion networks, respectively.

λ

is a balancing coefficient. To ensure the stability of the optimization process, we jointly optimize the losses

L^{d}

and

L^{f}

, and

R (.)

is a regularization term applied to the parameters. The regularization constraints are implemented using the UnityGrad method, which adjusts the gradients during optimization.

3.2. Overall Architecture

The overall framework, as shown in Figure 2, consists of three components: the Backbone, the Fine-grained Region Attention (FRA), and task-specific heads. Moreover, to mitigate gradient conflicts between different tasks, the UnityGrad method is used during parameter updates to compute more stable gradients, thus stabilizing the multi-task optimization process.

For the visible image

x \in R^{H \times W \times 3}

and the infrared image

y \in R^{H \times W \times 1}

, the backbone network

f (.)

is first employed to extract features

f (x)

and

f (y)

for each modality. To save memory and computational resources, the backbone is assumed to have shared parameters. The features extracted from each block of the backbone are then summed along the channel dimension, producing the mixed-modal features

z_{1}, z_{2}, \dots, z_{L}

from L blocks.

To enhance the model’s ability to perceive features from different regions, we propose a fine-grained region attention mechanism, which progressively extracts region and object information across different scales, improving the feature representation capacity. Finally, a lightweight task head is used to generate the fused image, ensuring it exhibits both high visual quality and strong semantic information. At the same time, a detection head is employed for object detection, ensuring that the learned representations effectively balance visual quality and detection task accuracy, making the model suitable for various perception tasks in multimodal scenarios.

Finally, the proposed UnityGrad method is used to modulate the gradients propagated from different tasks, solving for new update gradients to ensure stable optimization across tasks.

3.3. Fine-Grained Region Attention

In multimodal perception tasks, models must process information across multiple scales and feature representations. This requires feature extractors to adaptively capture region-specific information at various levels of granularity. However, traditional convolutional neural networks (CNNs) typically use fixed-size convolution kernels, limiting their ability to handle regions with fine-grained precision. In biological visual systems, the ability of neurons to selectively focus on regions at different scales is a key feature of visual perception. Based on this principle, we propose a fine-grained region attention mechanism (FRA), which improves the model’s ability to focus on and distinguish important regions by enabling more effective feature representation across multiple spatial and semantic levels. By integrating region-level attention mechanisms, FRA improves the model’s capacity to capture and represent crucial region-specific information in complex images. The structure of the FRA module is illustrated in Figure 3, providing a detailed view of how attention is applied across different regions.

The input to the FRA module consists of multi-scale feature maps

z_{1}, z_{2}, \dots, z_{L}

extracted by the backbone. Each

z_{L}

represents mixed features from visible and infrared images at different scales. Here, L denotes the number of feature maps, and

z_{l} \in R^{H_{l} \times W_{l} \times C_{l}}

represents the feature map at layer l, where

H_{l}

and

W_{l}

are the height and width of the feature map, and

C_{l}

is the number of channels in the feature map.

These multi-scale feature maps are extracted from different layers of the backbone, each representing information at different granularities. Thus, the maps contain multi-scale information from distinct regions of the image. To improve the network’s ability to capture region-specific features, we apply region-level operations to these feature maps. Specifically, we perform K convolution operations on each feature map

z_{L}

using different dilation factors and kernel sizes to generate initial region-specific attention maps. These attention maps are then aggregated to obtain the final attention maps. The calculation of the region-level attention maps

A_{l}^{k}

is as follows:

A_{l}^{k} = φ_{k, d_{k}} (z_{l}) \in R^{M \times H_{l} \times W_{l}}, k = 1, 2, \dots, K

where

φ_{k, d_{k}} (\cdot)

represents the k-th convolution operation with dilation factor

d_{k}

.

A_{l}^{k}

represents region-level attention maps generated by the k-th convolution operation, consisting of M attention masks, each with spatial dimensions

H_{l} \times W_{l}

, but focusing on different regions of the image.

Considering the role of global information, we integrate the global features into the region-specific attention maps to refine them. We apply global pooling on the input feature map

z_{L}

to extract its global features, compressing each channel into a scalar to form a global feature vector

s_{l} \in R^{1 \times C_{l}}

, which represents the global context of the image:

s_{l} = \frac{1}{H_{l} \times W_{l}} \sum_{i = 1}^{H_{l}} \sum_{j = 1}^{W_{l}} z_{l} (i, j)

This global feature vector is then passed through a feed-forward network (FFN), which generates a weighted matrix

W_{l} \in R^{K \times M}

. Each row

w_{k}

of

W_{l}

represents the weight for each region from the k-th convolution operation. To normalize the weights, we apply softmax along the M-dimension, resulting in

a_{k}

, which represents the contribution of each operation to the M regions:

α_{k} = softmax (w_{k}) \in R^{M}

The softmax operation ensures that the attention weights are normalized, allowing the coefficients to reflect the relative importance of each region.

Using the weighted coefficients

a_{k}

predicted by the global information, we compute a weighted sum of the initial attention maps to obtain the combined attention maps

A_{l}

:

A_{l} = \sum_{k = 1}^{K} a_{k} \cdot A_{l}^{k} \in R^{M \times H \times W}

The combined maps effectively capture region-specific features by aggregating the attention from different dilation factors and kernel sizes, which focus on various spatial scales and receptive fields.

Next, we apply the Sigmoid activation function to

A_{l}

to normalize it:

A_{l}^{'} = σ (A_{l}) \in R^{M \times H \times W}

where

σ (.)

represents the Sigmoid activation function, which normalizes the attention weights for each region, ensuring a balanced distribution across all regions.

The attention map

A_{l}^{'}

contains M region masks, where each mask

A_{l m} \in R^{H \times W}

represents the attention distribution for a specific region of the image. Thus, the region attention map

A_{l}^{'}

can be represented as

A_{l}^{'} = {A_{l 1}, A_{l 2}, \dots, A_{l M}}

Using the region attention maps

A_{l}^{'}

, we apply pixel-wise weighting to the original feature map

z_{l}

, enhancing the features of the important regions. In this process, each region mask

A_{l m}

is pixel-wise multiplied with the corresponding region of the input feature map

z_{l}

, resulting in the weighted region feature map

v_{l}

:

v_{l} (i, j) = \sum_{m = 1}^{M} A_{l m} (i, j) ⊙ z_{l} (i, j)

where ⊙ represents the pixel-wise multiplication operation and

(i, j)

denotes the spatial location. By performing this operation, each region mask

A_{l m}

weights the corresponding region in the feature map

z_{l}

. Regions with higher attention weights are amplified, while those with lower weights are suppressed. This process enhances the important regions and effectively reduces the impact of irrelevant or less important regions, refining the overall feature map representation.

Using this approach, we apply region attention to the multi-scale feature maps

z_{1}, z_{2}, \dots, z_{L}

extracted by the backbone, producing region-enhanced features

v_{1}, v_{2}, \dots, v_{L}

. These enhanced features are then used for the final detection and fusion tasks.

3.4. Detection and Fusion Heads

In multimodal perception tasks, besides the feature extraction module, the task heads play a crucial role in both object detection and image fusion. We design two distinct task heads for object detection and image fusion, and optimize them using appropriate loss functions.

As shown in Figure 4, the image fusion task head aims to restore the multi-scale region-level feature maps

v_{1}, v_{2}, \dots, v_{L}

from the FRA module to the same spatial dimensions as the original input image

X

. It then reconstructs the fused image. First, for each

v_{l}

, we use an upsampling operation to resize it to match the dimensions of the input image

(H, W)

. This is typically done using bilinear interpolation. After upsampling, the feature maps align with the input image spatially, preserving spatial consistency during fusion. We then sum all the upsampled feature maps to fuse information from different scales:

F_{fuse} = \sum_{l = 1}^{L} Upsample (v_{l})

To reduce computational complexity and prepare the feature maps for the next step, we apply a

1 \times 1

convolution layer to decrease the number of channels in the fused feature map. This operation compresses the feature map’s channel dimension, making it suitable for reconstruction. After channel reduction, we process the feature map

F_{reduce}

with five consecutive

3 \times 3

convolution layers, followed by ReLU activations, to progressively reconstruct the fused image. Each convolution operation helps recover image details by learning convolutional kernels.

To optimize the fusion quality, we use the Structural Similarity Index (SSIM) and Laplacian second-order gradient loss. SSIM is effective at evaluating structural and perceptual image quality and is computed as

L_{SSIM} = \frac{1 - SSIM (u, x)}{2} + \frac{1 - SSIM (u, y)}{2}

where

u

is the fused image, and

x

and

y

are the source images.

The Laplacian operator

\nabla^{2}

captures second-order texture details, enhancing edges and fine features like high-frequency textures. The gradient loss is defined as the difference between the Laplacian results of the target and source images:

L_{grad} = \sum_{k = 3, 5, 7} ∥ \nabla_{k}^{2} u - max (\nabla_{k}^{2} x, \nabla_{k}^{2} y) ∥

where

\nabla_{k}^{2}

is the Laplacian operator calculated using different Gaussian kernel sizes k, and

u

is the fused image, while

x

and

y

are the source images.

The final image fusion loss function

L^{f}

is

L^{f} = λ_{1} L_{SSIM} + λ_{2} L_{grad}

where

L_{SSIM}

is the SSIM-based structural loss,

L_{grad}

is the Laplacian gradient loss, and

λ_{1}

and

λ_{2}

are balancing coefficients controlling the importance of each term.

The object detection task head consists of regression and classification branches, aiming to detect target objects through accurate regression and classification. We use classic loss functions such as SmoothL1 Loss and Focal Loss, which have demonstrated strong performance in detection tasks:

L^{d} = λ_{3} L_{smoothL 1} + λ_{4} L_{focal}

where

λ_{3}

and

λ_{4}

are balancing coefficients to adjust the relative importance of regression and focal losses.

Therefore, the final total loss function is

L = λ L^{d} + (1 - λ) L^{f}

where

λ

is a balancing coefficient that controls the trade-off between object detection loss

L^{d}

and image fusion loss

L^{f}

, allowing the model to effectively learn both tasks simultaneously.

3.5. UnityGrad

In image fusion and object detection tasks, optimization objectives often conflict in both gradient direction and magnitude. Image fusion aims to generate high-quality fused images, while object detection focuses on improving detection accuracy. When these tasks share parameters

θ

, conflicting gradients can lead to suboptimal updates and degraded overall performance. To address this challenge effectively, we propose UnityGrad, a principled approach that unifies conflicting gradients through cooperative bargaining.

Let K denote the number of tasks and let

g_{i} \in R^{n}

be the gradient of the i-th task’s loss with respect to shared parameters

θ

. While we primarily focus on image fusion (

i = 1

) and object detection (

i = 2

) in this paper, the UnityGrad formulation generalizes to any number of tasks.

Given the current parameters

θ

, we search for an update vector

Δ θ

within the ball of radius

ϵ

centered at zero, denoted as

B_{ϵ}

. The key insight of UnityGrad is to formulate this as a cooperative bargaining problem where the agreement set is

B_{ϵ}

and the disagreement point is 0, which represents making no parameter update [61]. For each task, we define a utility function

u_{i} (Δ θ) = g_{i}^{⊤} Δ θ

, representing how beneficial the update direction is for that task [62].

Our main assumption is that when

θ

is not at a Pareto stationary point, the task gradients are linearly independent. This ensures that the disagreement point (no update) is dominated by some point in

B_{ϵ}

that benefits all tasks.

The core of UnityGrad is the following optimization objective:

Δ θ^{*} = arg max_{Δ θ \in B_{ϵ}} \sum_{i = 1}^{K} log (Δ θ^{⊤} g_{i})

This logarithmic objective is derived from the Nash Bargaining Solution in cooperative game theory, which maximizes the product of utility gains. Taking the logarithm transforms this product into a sum while preserving the solution’s properties. The logarithmic formulation is particularly important as it ensures scale-invariance across tasks with different gradient magnitudes, preventing any single task from dominating the optimization process. We can characterize the solution to this logarithmic optimization problem as follows:

Claim 1.

Let G be the

d \times K

matrix whose columns are the gradients

g_{i}

. The solution to our optimization problem is (up to scaling)

Δ θ^{*} = \sum_{i} α_{i} g_{i}

, where

α \in R_{+}^{K}

is the solution to

G^{⊤} G α = 1 / α

, with

1 / α

representing element-wise reciprocal.

Proof.

To derive this result, we analyze the gradient of our objective function. The gradient of our objective function takes the form

\sum_{i = 1}^{K} \frac{g_{i}}{Δ θ^{⊤} g_{i}}

. We observe that for any vector

Δ θ

satisfying

Δ θ^{⊤} g_{i} > 0

for all i, the utility functions increase monotonically with

| Δ θ |

. This, combined with the Pareto optimality characteristic inherent in bargaining solutions [61], necessitates that the optimal point must lie on the boundary of

B_{ϵ}

. Consequently, at the optimal solution, the gradient

\sum_{i = 1}^{K} \frac{g_{i}}{Δ θ^{⊤} g_{i}}

must align with the radial direction. Mathematically, this means

\sum_{i = 1}^{K} \frac{g_{i}}{Δ θ^{⊤} g_{i}} = λ Δ θ

for some scalar

λ

. Given the independence of the gradients, we can express

Δ θ

as a linear combination

Δ θ = \sum_{i} α_{i} g_{i}

where each

α_{i} > 0

. This yields the condition

\frac{1}{Δ θ^{⊤} g_{i}} = λ α_{i}

, which can be rearranged as

Δ θ^{⊤} g_{i} = \frac{1}{λ α_{i}}

. Since we require

Δ θ^{⊤} g_{i} > 0

for descent directions, it follows that

λ > 0

. For simplicity, we set

λ = 1

to determine the direction of

Δ θ

(noting that its magnitude might exceed

ϵ

). The bargaining solution problem thus reduces to finding coefficients

α \in R^{K}

with positive components such that:

Δ θ^{⊤} g_{i} = \sum_{j} α_{j} g_{j}^{⊤} g_{i} = \frac{1}{α_{i}}

for all i. This can be elegantly expressed in matrix form as

G^{⊤} G α = α^{- 1}

, where

α^{- 1}

denotes the element-wise reciprocal vector. □

To solve the equation

G^{⊤} G α = 1 / α

efficiently, we employ an iterative approach. We initialize

α

with uniform weights

α^{(0)} = {(1 / K, \dots, 1 / K)}^{⊤}

and use a fixed-point iteration:

α^{(t + 1)} = \sqrt{\frac{1}{G^{⊤} G α^{(t)}}}

where the square root and division operations are applied element-wise. This iteration continues until

| α^{(t + 1)} - α^{(t)} | < ϵ

for a small threshold

ϵ

, typically requiring only a few iterations to achieve good convergence.

Through this iterative optimization, UnityGrad converges to a Pareto Stationary Point, where the gradients of all tasks are balanced relative to each other. This ensures that no task’s loss can be further reduced without increasing another task’s loss, achieving true unity in the optimization process. The complete UnityGrad algorithm is summarized in Algorithm 1.

Algorithm 1 UnityGrad

Input: Initial shared parameters

θ^{(0)}

; differentiable losses

{ℓ_{i}}_{i = 1}^{K}

; learning rate

η

; total steps T
Output:

θ^{(T)}

for

t = 1

to T do
Compute task gradients
for

i = 1

to K do

g_{i}^{(t)} \leftarrow \nabla_{θ^{(t - 1)}} ℓ_{i}

end for

G^{(t)} \leftarrow [g_{1}^{(t)}, g_{2}^{(t)}, \dots, g_{K}^{(t)}]

Solve for

α^{(t)}

: find

α^{(t)} \in R_{> 0}^{K}

such that

{(G^{(t)})}^{⊤} G^{(t)} α^{(t)} = 1 / α^{(t)}

Update shared parameters

θ^{(t)} \leftarrow θ^{(t - 1)} - η G^{(t)} α^{(t)}

end for
return

θ^{(T)}

4. Experiment

4.1. Datasets and Evaluation Criteria

4.1.1. Introduction to Experimental Datasets

We conducted validation on four publicly available infrared and visible image datasets, which are as follows: M3FD [22], DroneVehicle [63], RoadScene [18], and TNO [64]. Specifically, all datasets consist of co-registered infrared and visible image pairs, which are acquired concurrently with aligned sensors to ensure spatial and temporal consistency between modalities. The selection of these datasets allows us to comprehensively evaluate both image fusion and object detection performance under diverse conditions, reflecting the robustness of our approach across various challenges. The M3FD dataset was used to evaluate both detection and image fusion performance [22], while RoadScene and TNO datasets were used for evaluating image fusion performance [65]. The DroneVehicle dataset was used to assess performance in detecting rotated objects [42].

The M3FD dataset includes 4200 pairs of high-resolution aligned infrared and visible light images. The dataset covers a variety of scenes and is categorized into four different types: daytime, overcast, nighttime, and challenging conditions. Additionally, the M3FD dataset annotates a total of 33,603 objects across six categories: people, cars, buses, motorcycles, trucks, and lights. This makes it suitable for evaluating both object detection and fusion tasks.

The DroneVehicle dataset consists of a total of 56,878 images collected by drones, with half of the images being RGB and the other half infrared. The dataset provides detailed annotations for five categories: cars, trucks, buses, vans, and cargo trucks, using rotated bounding boxes for annotation. This increases the evaluation standard for the model’s detection capabilities and is well-suited for evaluating multimodal object detection performance.

The RoadScene dataset, created in 2020, is based on road scenes and includes paired infrared and visible light images. It contains 221 pairs of aligned images, covering a rich set of road scenes such as bicycles, cars, pedestrians, and traffic lights. These image pairs were extracted from FLIR5 video footage and have been denoised and rigorously aligned. With a large number of high-resolution images, the RoadScene dataset is suitable for evaluating image fusion tasks.

The TNO dataset is a commonly used dataset in the field of infrared and visible image fusion. It includes a large collection of multispectral images from various military-related scenes, such as enhanced visual images, near-infrared images, long-wave infrared images, and thermal radiation images, collected by the Netherlands Organization for Applied Scientific Research. Unlike the MSRS and RoadScene datasets, the visible light images in the TNO dataset are single-channel images. It contains night-time images of multi-band military scenes, with a total of 60 pairs of infrared and visible light images.

4.1.2. Evaluation Criteria

To comprehensively evaluate the performance of the model, we used five evaluation metrics for the image fusion task: Entropy (EN), Structural Similarity Index (SSIM), Mutual Information (MI), Visual Information Fidelity (VIF), and Standard Deviation (SD). The object detection task was assessed using mean Average Precision (mAP).

EN measures the information richness of the fused image. A higher entropy value indicates that the fused image contains more information. The entropy is calculated as

EN = - \sum_{n = 1}^{N} p_{n} {log}_{2} p_{n}

where N is the number of gray levels in the fused image, and

p_{n}

is the proportion of pixels at the gray level n in the fused image.

SSIM is a metric used to assess image quality, particularly to measure the structural similarity between the fused image and the reference image. The SSIM is calculated as

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

where

μ_{x}

and

μ_{y}

are the mean values of images x and y,

σ_{x}^{2}

and

σ_{y}^{2}

are their variances,

σ_{x y}

is their covariance, and

C_{1}

and

C_{2}

are constants to avoid division by zero.

SSIM values range from −1 to 1, with 1 indicating perfect similarity. The closer the value is to 1, the more similar the structural information between the images.

MI quantifies how much information is retained in the fused image from the source images. In information theory, MI is used to measure the dependence between two random variables. The MI is calculated as

M I (A, B) = H (A) + H (B) - H (A, B)

where

H (A)

and

H (B)

are the entropies of images A and B, and

H (A, B)

is their joint entropy.

VIF evaluates the image quality by quantifying the consistency between the image content and human visual perception. Unlike traditional pixel-based metrics (such as MSE or PSNR), VIF considers the perceptual quality by accounting for how the human eye is more sensitive to certain frequencies. The VIF calculation involves several steps: first, the image is decomposed using filters (such as Gaussian filters) to generate multi-scale representations. Then, information is calculated for each scale, with higher weights given to low-frequency components due to human sensitivity. Finally, the VIF is computed by combining the information from all scales:

VIF = \sum_{i = 1}^{M} \frac{I_{i}}{I_{i} + σ_{i}^{2}}

where

I_{i}

represents the information content at the i-th scale,

σ_{i}^{2}

is the noise variance at that scale, and M is the number of scales. A higher VIF indicates better image quality.

SD measures the degree of variation in the pixel values of the image. A higher SD indicates that the image has more distinct details and richer textures. The SD is calculated as

SD = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - μ)}^{2}}

where

x_{i}

is the pixel value of the image,

μ

is the mean of the pixel values, and N is the total number of pixels.

For evaluating object detection performance, we use Precision, Recall, mAP_0.5, and mAP_0.5:0.95 as the evaluation metrics. These metrics are derived based on the counts of true positives (TP), false positives (FP), and false negatives (FN), as well as the Intersection over Union (IoU) between predicted and ground-truth bounding boxes.

Precision is the ratio of correctly predicted positive samples to all detected samples, calculated as

precision = \frac{TP}{TP + FP}

Recall is the ratio of correctly predicted positive samples to the total number of actual positive samples, calculated as

recall = \frac{TP}{TP + FN}

Average Precision (AP) is the area under the precision-recall curve, calculated as

AP = \int_{0}^{1} precision (recall) d (recall)

Mean Average Precision (mAP) is the average of the AP values for all classes, calculated as

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}

where

{AP}_{i}

is the AP value for class i, and N is the number of classes in the dataset. mAP_0.5 indicates the average precision when the Intersection over Union threshold is set to 0.5. mAP_0.5:0.95 represents the mean average precision across multiple IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05.

4.2. Implementation Details

All experiments were conducted on an NVIDIA RTX 4090 GPU, using the MMDetection framework for multimodal image fusion and detection. During training, the Adam optimizer was applied with an initial learning rate of

1 \times 10^{- 4}

, which decayed every 10 epochs. We employed the UniFusOD structure, shown in Figure 2, for end-to-end training, utilizing both fusion and detection loss. The hyperparameters in the Detection and Fusion Heads were set as

λ_{1} = λ_{2} = λ_{3} = λ_{4} = 1

, with

λ = 0.4

and

1 - λ = 0.6

, balancing the fusion and detection tasks. For the M3FD dataset, a 2:8 training-to-testing split was used to ensure stability during training and reliability in evaluation. All images were resized and underwent essential data augmentation, including random cropping and flipping, to enhance the model’s generalization ability. During inference, performance was evaluated separately for fusion and detection, with adjustments made based on the characteristics of each dataset. This approach allows for more targeted assessment of the model’s strengths in different aspects of multimodal processing.

The experimental results demonstrate that the proposed method achieves strong performance across multiple tasks and datasets, highlighting the potential of this algorithm in the fields of multimodal image fusion and object detection.

5. Results

5.1. Results on Infrared-Visible Image Fusion

To validate the competitiveness of our algorithm, we compared it with ten other methods: DIDFuse [7], FusionGAN [5], SDNet [2], U2Fusion [18], TarDAL [22], RFN-Nest [13], DenseFuse [4], CDDFuse [65], AMDANet [66] and MMIF-INet [67]. We evaluated performance using five metrics: Entropy (EN), Structural Similarity Index (SSIM), Mutual Information (MI), Visual Information Fidelity (VIF), and Standard Deviation (SD), with detailed descriptions provided in Section 4.1.2. The experimental results show that, even with a simple and direct feature extraction approach, our method outperforms the others on several metrics.

On the TNO dataset, as shown in the Table 1, our method achieves higher EN and MI values, 7.44 and 2.00, respectively, significantly surpassing the alternatives. This indicates that our method preserves more image details and information. Our method also outperforms others in SSIM with a score of 1.07, indicating that the fused images retain the best structural resemblance to the original. While the SD of our method is slightly lower than CDDFuse, it maintains higher stability, balancing brightness variations and avoiding overprocessing.

On the Roadscene dataset, the results in Table 2 indicate that SSIM is slightly lower than some methods; our method still excels in EN and SD, especially with an SD of 59.48, reflecting superior brightness and contrast retention. Additionally, our MI value of 1.96 is comparable to other methods, demonstrating strong information preservation.

For the M3FD dataset, as shown in Table 3, our method achieves the highest MI and EN values, 1.40 and 6.60, respectively, showing a clear advantage in information and detail retention. Although VIF and SSIM are slightly lower than some methods, the overall fusion quality remains high, particularly in terms of detail and information fidelity.

To visually validate the fusion performance, Figure 5 presents qualitative fusion results. The first row shows M3FD samples, and the second row shows DroneVehicle samples, with each group comprising visible, infrared, and fused images. The blue rectangles in the fused images highlight regions where key thermal features from the infrared modality are retained. Meanwhile, the red rectangles emphasize enhanced texture and color semantics originating from the visible spectrum. These results demonstrate that our fusion algorithm effectively integrates complementary information from both modalities, maintaining critical details while enhancing overall scene visibility.

5.2. Results on Infrared-Visible Object Detection

In addition to evaluating image fusion quality, we also assessed the performance of the detector using the DroneVehicle and M3FD datasets. To ensure a fair comparison, we used Oriented RCNN [68] as the baseline model for rotated object detection and YOLOv5 [69] for horizontal object detection on the M3FD dataset, consistent with prior methods. The backbone network was kept the same as the fusion network. The object detection results demonstrate that our method outperforms others across several categories, validating its effectiveness and stability.

On the DroneVehicle dataset, as shown in Table 4, the fused detector integrating visible and infrared images with the Oriented RCNN architecture achieved notable performance gains, especially across multiple object categories. In terms of overall performance, our method attained an optimal mAP of 79.5, 0.8 higher than M2FP—the second-ranked method. For individual categories, our method delivered the highest detection accuracy in Car, Truck, and Van. Specifically, the Car detection accuracy reached 96.4, 0.7 higher than M2FP; the Truck accuracy hit 81.3, a substantial 3.0 higher than C²Former which ranked second in this category; and the Van accuracy reached 65.6, 0.7 higher than AFFCM. These results highlight our clear advantage in object detection, confirming that fusing visible and infrared modalities significantly improves performance, particularly for small objects in complex environments.

On the M3FD dataset, as seen in Table 5, our method paired with YOLO v5 demonstrated superior performance, achieving 59.4 mAP and 86.9 mAP₅₀—1.9 higher than Fusion-Mamba (the second-best method) in both metrics. In category-specific detection, our method outperformed all counterparts. The Car detection accuracy reached 95.0, 0.2 higher than TarDAL which previously held the top spot; the Bus accuracy hit 94.1, 0.9 higher than SuperFusion; the Motorcycle accuracy reached 77.8, 0.4 higher than SuperFusion; and the Lamp accuracy attained 88.6, 0.8 higher than DetFusion. However, Truck detection was lower at 82.4, compared to 87.1 and 85.8 for Fusion-Mamba and SuperFusion, respectively. This indicates that our method struggles with larger objects, likely due to the challenges in capturing complex spatial features or less distinct boundaries, which may not be fully addressed by the current fusion-detection framework. Overall, our method demonstrates strong performance, especially for smaller objects and in challenging environments, though further refinement is needed for detecting larger objects like Trucks.

To further validate the detection robustness, Figure 6 provides qualitative results on both datasets. It clearly illustrates UniFusOD’s capability in accurately localizing rotated and small targets, even under challenging multimodal conditions. The comparison with ground truth highlights its strong spatial precision and reliable semantic understanding.

In summary, UniFusOD not only excels in image fusion but also delivers state-of-the-art performance in object detection across varying modalities, categories, and visual complexities. Its robustness against object rotation, scale variation, and modality noise makes it a compelling solution for multimodal perception tasks.

5.3. Ablation Study Results

To verify the effectiveness of the proposed Fine-Grained Region Attention (FRA) mechanism and UnityGrad optimization strategy, we conducted systematic ablation experiments on the M3FD dataset. Performance was evaluated using image fusion metrics and object detection metrics. Results are presented in Table 6, where “ Remotesensing 17 03637 i001

” indicates the module is enabled, and bold values denote the best performance.

As shown in Table 6, the Baseline achieves initial performance with EN of 5.21, SSIM of 0.76, mAP₅₀ of 83.3, and mAP_50:95 of 57.1. The introduction of the FRA module significantly enhances these metrics. By focusing on fine-grained regional variations during the fusion process, FRA boosts EN to 6.44, improves the VIF to 0.64, and increases the detection metrics by 2.7 for mAP₅₀ and 1.6 for mAP_50:95. This demonstrates FRA’s ability to capture multi-scale regional information, which significantly enhances both visual quality and semantic representation.

Further integration of UnityGrad—designed to reduce conflicts during multi-task optimization—leads to even more substantial gains across the board. Specifically, UnityGrad improves fusion metrics, with SD increasing by 4.97 compared to FRA alone. The SSIM increases to 0.85, while EN sees a modest improvement of 0.16. Detection performance is also enhanced, with mAP₅₀ increasing by 0.9 to 86.9 and mAP_50:95 rising by 0.7 to 59.4. These results confirm that UnityGrad optimizes the gradient propagation across tasks, allowing the FRA module to fully exploit its potential in synergizing image fusion and object detection tasks.

6. Discussion

6.1. Study of Fine-Grained Region Attention

The FRA module uses multi-scale feature maps and region-level attention to adaptively focus on important regions, enhancing the model’s ability to capture key features in complex images. To validate its effectiveness, we conducted ablation experiments on the M3FD dataset. Specifically, we investigated the effects of (1) varying the number and configuration of convolutional operators

φ_{k}

and (2) changing the number of region attention maps M.

6.1.1. Effect of Different $φ_{k}$ Designs

Each

φ_{k}

denotes a convolution operation with a specific kernel size and dilation factor, followed by batch normalization and ReLU activation. By combining multiple such

φ_{k}

, the module generates diverse region attention maps, enabling it to model spatial contexts of different granularities. To study the impact of varying

φ_{k}

, we conducted experiments using combinations of convolutional kernels with sizes 3 × 3, 5 × 5, 7 × 7, and 11 × 11. The number of

φ_{k}

in the FRA module, denoted as K, directly determines the diversity of region-wise attention maps. Table 7 presents the results for both image fusion and object detection tasks.

The results show that using a single small kernel such as 3 × 3 limits the model’s ability to capture broader contextual information, as it tends to focus solely on fine-grained local details. Specifically, the fusion metric EN increases only marginally from 6.44 to 6.42, and detection accuracy measured by mAP₅₀ shows a minor rise from 86.0 to 86.2. When a second kernel with size 5 × 5 is added, the model benefits from a broader receptive field, resulting in a noticeable enhancement in performance. For instance, the spatial detail metric SD improves from 37.86 to 40.25, and the mAP_50:95 increases from 59.0 to 59.1.

The best performance is observed when three kernels are used—specifically 3 × 3, 5 × 5, and 7 × 7. Under this configuration, all key metrics reach their peak values. The EN reaches 6.60, SD increases to 42.52, mutual information MI rises to 1.40, and VIF stands at 0.68. In terms of detection, mAP₅₀ reaches 86.9, while mAP_50:95 improves to 59.4. However, introducing a fourth kernel with a size of 11 × 11 slightly degrades performance. For example, SD drops to 39.88 and mAP_50:95 decreases to 58.9, which may be attributed to the over-smoothing of fine details and increased computational overhead, leading to redundancy in feature representation.

To gain deeper insights into how different

φ_{k}

configurations affect spatial attention, we visualize the corresponding region attention maps in Figure 7. The attention map generated by the first operator, which uses a 3 × 3 kernel, predominantly focuses on localized, fine-grained regions. As larger kernels are progressively introduced, the receptive field expands, allowing the attention to gradually shift toward broader, more semantically meaningful areas across the object. In the final aggregation stage, the attention maps clearly highlight the full extent of the target object, demonstrating the effectiveness of the FRA module in capturing both local details and global structural information.

These results collectively confirm that incorporating multiple convolutional operators with varied receptive fields significantly enhances the model’s ability to capture both fine-grained details and broader semantic context. A carefully selected combination of small to medium-sized kernels enables the FRA module to generate diverse region attention, which leads to more informative and discriminative feature representations.

6.1.2. Effect of the Number of Region Attention Maps

The parameter M represents the number of region attention weights, which controls the number of regions the model focuses on during region modeling. As shown in Table 8, increasing M leads to a significant improvement in the model’s performance for both image fusion and object detection tasks.

When

M = 2

, the model performs at a basic level, with an EN of 6.30, indicating that the model is unable to effectively focus on high-information regions during image fusion, which limits detection performance. In this case, mAP₅₀ is 85.8, and mAP_50:95 is 58.5. As M increases to 4, performance improves slightly, with EN reaching 6.35, mAP₅₀ at 86.0, and mAP_50:95 at 58.7, suggesting some improvement in region modeling. Increasing M further to 6 yields a more substantial performance gain, with EN at 6.38, MI at 1.35, SSIM at 0.75, mAP₅₀ at 86.5, and mAP_50:95 at 59.0. This indicates the model’s enhanced ability to capture and represent important regional features, improving both image fusion and detection accuracy. When

M = 8

, the model achieves the best performance. By increasing the number of region attention maps, the model strengthens its capacity to model region-specific features in complex images, significantly boosting detection accuracy and image quality. However, when M is increased to 10, performance slightly declines. EN increases to 6.40, mAP₅₀ to 86.0, and mAP_50:95 to 58.8, suggesting that the benefits of region modeling are approaching saturation, while computational cost continues to rise.

6.2. Study of UnityGrad Algorithm

To evaluate the robustness of the proposed UnityGrad method, we conduct comparative experiments with three typical MTL approaches: GradNorm [50], PCGrad [82], and CAGrad [54], using multiple evaluation metrics. As shown in Table 9, UnityGrad outperforms all other methods, demonstrating significant improvements in both image fusion metrics, such as SSIM of 0.85, and object detection metrics, such as mAP_50:95 of 59.4, as well as stability metrics, such as EN of 6.60 and VIF of 0.68. These results confirm that UnityGrad effectively mitigates task interference in multi-task learning, ensuring both superior and stable performance across various dimensions. Compared to existing MTL methods, UnityGrad exhibits enhanced robustness, which can be attributed to its effective multi-task gradient coordination mechanism.

In addition to the quantitative results presented in Table 6 and Table 9, we also analyzed the gradient behavior during training to better understand how UnityGrad mitigates conflicts between tasks. As illustrated in Figure 8, the blue curves represent gradients of the detection loss with respect to shared parameters, while the red curves correspond to gradients of the fusion loss. Without UnityGrad, the detection task dominates the shared gradient space, suppressing the learning capacity of the fusion network due to its relatively smaller gradient magnitude. In contrast, with UnityGrad, the gradient contributions from both tasks are better balanced, leading to improved gradient alignment and a more stable optimization process. Overall, UnityGrad enhances both low-level image fidelity and high-level semantic accuracy, delivering stable joint optimization and superior end-to-end performance for integrated vision tasks.

7. Conclusions

In this paper, we proposed the UniFusOD, a unified framework designed to optimize both infrared-visible image fusion and object detection tasks simultaneously. This network performs end-to-end optimization of low-level and high-level tasks. The Fine-Grained Region Attention (FRA) module enhances the model’s ability to recognize complex region-specific information by applying attention operations at multiple granularities. Furthermore, to address the gradient conflicts between the fusion and detection tasks, we introduced the UnityGrad method, which balances the gradients of the different tasks, stabilizing and improving optimization performance. Experimental results show that UniFusOD significantly enhances the performance of both image fusion and object detection tasks across multiple datasets, demonstrating its potential for real-world applications such as autonomous driving and remote sensing. In the future, we plan to extend this framework to more modalities, such as LiDAR and SAR, to further improve its performance in complex environments.

Author Contributions

Conceptualization, X.X., B.N., Z.W. and W.G.; Methodology, X.X.; Software, X.X.; Validation, Z.P.; Formal analysis, J.Q.; Resources, G.Z., B.N. and L.H.; Data curation, X.X. and W.L.; Writing—original draft, X.X.; Writing—review & editing, B.N., Z.P. and L.H.; Visualization, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, H.; Xu, H.; Tian, X.; Jiang, J.; Ma, J. Image fusion meets deep learning: A survey and perspective. Inf. Fusion 2021, 76, 323–336. [Google Scholar] [CrossRef]
Zhang, H.; Ma, J. SDNet: A versatile squeeze-and-decomposition network for real-time image fusion. Int. J. Comput. Vis. 2021, 129, 2761–2785. [Google Scholar] [CrossRef]
Liu, J.; Zhang, S.; Wang, S.; Metaxas, D.N. Multispectral deep neural networks for pedestrian detection. arXiv 2016, arXiv:1611.02644. [Google Scholar] [CrossRef]
Li, H.; Wu, X.J. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2018, 28, 2614–2623. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
Zhang, X.; Demiris, Y. Visible and infrared image fusion using deep learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10535–10554. [Google Scholar] [CrossRef] [PubMed]
Zhao, Z.; Xu, S.; Zhang, C.; Liu, J.; Li, P.; Zhang, J. DIDFuse: Deep image decomposition for infrared and visible image fusion. arXiv 2020, arXiv:2003.09210. [Google Scholar]
Ma, J.; Liang, P.; Yu, W.; Chen, C.; Guo, X.; Wu, J.; Jiang, J. Infrared and visible image fusion via detail preserving adversarial learning. Inf. Fusion 2020, 54, 85–98. [Google Scholar] [CrossRef]
Zhou, K.; Chen, L.; Cao, X. Improving multispectral pedestrian detection by addressing modality imbalance problems. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 787–803. [Google Scholar]
Yuan, M.; Cui, B.; Zhao, T.; Wei, X. UniRGB-IR: A Unified Framework for Visible-Infrared Downstream Tasks via Adapter Tuning. arXiv 2024, arXiv:2404.17360. [Google Scholar]
Kim, D.; Ruy, W. CNN-based fire detection method on autonomous ships using composite channels composed of RGB and IR data. Int. J. Nav. Archit. Ocean. Eng. 2022, 14, 100489. [Google Scholar] [CrossRef]
Zhao, G.; Hu, Z.; Feng, S.; Wang, Z.; Wu, H. GLFuse: A Global and Local Four-Branch Feature Extraction Network for Infrared and Visible Image Fusion. Remote Sens. 2024, 16, 3246. [Google Scholar] [CrossRef]
Li, H.; Wu, X.J.; Kittler, J. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Inf. Fusion 2021, 73, 72–86. [Google Scholar] [CrossRef]
Wang, Z.; Ng, M.K.; Michalski, J.; Zhuang, L. A self-supervised deep denoiser for hyperspectral and multispectral image fusion. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5520414. [Google Scholar] [CrossRef]
Zhao, Z.; Xu, S.; Zhang, J.; Liang, C.; Zhang, C.; Liu, J. Efficient and model-based infrared and visible image fusion via algorithm unrolling. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1186–1196. [Google Scholar] [CrossRef]
Hou, J.; Zhang, D.; Wu, W.; Ma, J.; Zhou, H. A generative adversarial network for infrared and visible image fusion based on semantic segmentation. Entropy 2021, 23, 376. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; Zhang, H.; Shao, Z.; Liang, P.; Xu, H. GANMcC: A generative adversarial network with multiclassification constraints for infrared and visible image fusion. IEEE Trans. Instrum. Meas. 2020, 70, 5005014. [Google Scholar] [CrossRef]
Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar] [CrossRef]
Xu, H.; Wang, X.; Ma, J. DRF: Disentangled representation for visible and infrared image fusion. IEEE Trans. Instrum. Meas. 2021, 70, 5006713. [Google Scholar] [CrossRef]
Zhang, S.; Zhang, X.; Ren, W.; Shen, L.; Wan, S.; Zhang, J.; Jiang, Y.M. Bringing RGB and IR Together: Hierarchical Multi-Modal Enhancement for Robust Transmission Line Detection. arXiv 2025, arXiv:2501.15099. [Google Scholar] [CrossRef]
Li, S.; Han, M.; Qin, Y.; Li, Q. Self-attention progressive network for infrared and visible image fusion. Remote Sens. 2024, 16, 3370. [Google Scholar] [CrossRef]
Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5802–5811. [Google Scholar]
Zhao, W.; Xie, S.; Zhao, F.; He, Y.; Lu, H. Metafusion: Infrared and visible image fusion via meta-feature embedding from object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13955–13965. [Google Scholar]
Bae, S.; Shin, H.; Kim, H.; Park, M.; Choi, M.Y.; Oh, H. Deep learning-based human detection using rgb and ir images from drones. Int. J. Aeronaut. Space Sci. 2024, 25, 164–175. [Google Scholar] [CrossRef]
Lee, Y.; Kim, S.; Lim, H.; Lee, H.K.; Choo, H.G.; Seo, J.; Yoon, K. Performance analysis of object detection neural network according to compression ratio of RGB and IR images. J. Broadcast Eng. 2021, 26, 155–166. [Google Scholar]
Yuan, M.; Wang, Y.; Wei, X. Translation, scale and rotation: Cross-modal alignment meets RGB-infrared vehicle detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 509–525. [Google Scholar]
Liu, J.; Fan, X.; Jiang, J.; Liu, R.; Luo, Z. Learning a deep multi-scale feature ensemble and an edge-attention guidance for image fusion. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 105–119. [Google Scholar] [CrossRef]
Liu, R.; Liu, Z.; Liu, J.; Fan, X. Searching a hierarchically aggregated fusion architecture for fast multi-modality image fusion. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 1600–1608. [Google Scholar]
Wang, J.; Lan, C.; Gao, Z. Deep Residual Fusion Network for Single Image Super-Resolution. J. Phys. Conf. Ser. 2020, 1693, 012164. [Google Scholar] [CrossRef]
Peng, J.; Zhang, W.; Hou, Y.; Yu, H.; Zhu, Z.l. ECAFusion: Infrared and visible image fusion via edge-preserving and cross-modal attention mechanism. Infrared Phys. Technol. 2025, 151, 106085. [Google Scholar] [CrossRef]
Zhang, C.; He, D. A Deep Multiscale Fusion Method via Low-Rank Sparse Decomposition for Object Saliency Detection Based on Urban Data in Optical Remote Sensing Images. Wirel. Commun. Mob. Comput. 2020, 2020, 7917021. [Google Scholar] [CrossRef]
Zhang, P.; Jiang, Q.; Cai, L.; Wang, R.; Wang, P.; Jin, X. Attention-based F-UNet for Remote Sensing Image Fusion. In Proceedings of the 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), Haikou, China, 20–22 December 2021; pp. 81–88. [Google Scholar]
Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion 2022, 82, 28–42. [Google Scholar] [CrossRef]
Huo, Z.; Qiao, L. Research on Monocular Depth Estimation Algorithm Based on Structured Loss. J. Univ. Electron. Sci. Technol. China 2021, 50, 728–733. [Google Scholar]
Jiang, L.; Fan, H.; Li, J. A multi-focus image fusion method based on attention mechanism and supervised learning. Appl. Intell. 2022, 52, 339–357. [Google Scholar] [CrossRef]
Liu, J.; Liu, Z.; Wu, G.; Ma, L.; Liu, R.; Zhong, W.; Luo, Z.; Fan, X. Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 8115–8124. [Google Scholar]
Senushkin, D.; Patakin, N.; Kuznetsov, A.; Konushin, A. Independent component alignment for multi-task learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 20083–20093. [Google Scholar]
Caruana, R. Multitask learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Q. A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 2021, 33, 2739–2756. [Google Scholar] [CrossRef]
Menon, R.; Dengler, N.; Pan, S.; Chenchani, G.K.; Bennewitz, M. EvidMTL: Evidential Multi-Task Learning for Uncertainty-Aware Semantic Surface Mapping from Monocular RGB Images. arXiv 2025, arXiv:2503.04441. [Google Scholar]
Wu, Y.; Wang, Y.; Yang, H.; Zhang, P.; Wu, Y.; Wang, B. A Mutual Information Constrained Multi-Task Learning Method for Very High-Resolution Building Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 9230–9243. [Google Scholar] [CrossRef]
Zhang, Z.; Luo, P.; Loy, C.C.; Tang, X. Facial landmark detection by deep multi-task learning. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part VI 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 94–108. [Google Scholar]
Wang, Z.; Tsvetkov, Y.; Firat, O.; Cao, Y. Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models. arXiv 2020, arXiv:2010.05874. [Google Scholar] [CrossRef]
Yang, R.; Xu, H.; Wu, Y.; Wang, X. Multi-task reinforcement learning with soft modularization. Adv. Neural Inf. Process. Syst. 2020, 33, 4767–4777. [Google Scholar]
Maninis, K.K.; Radosavovic, I.; Kokkinos, I. Attentive single-tasking of multiple tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1851–1860. [Google Scholar]
Crawshaw, M. Multi-Task Learning with Deep Neural Networks: A Survey. arXiv 2020, arXiv:2009.09796. [Google Scholar] [CrossRef]
Bairaktari, K.; Blanc, G.; Tan, L.Y.; Ullman, J.; Zakynthinou, L. Multitask Learning via Shared Features: Algorithms and Hardness. In Proceedings of the Thirty Sixth Conference on Learning Theory, Bangalore, India, 12–15 July 2023; pp. 747–772. [Google Scholar]
Zhang, Y.; Yang, Q. An overview of multi-task learning. Natl. Sci. Rev. 2017, 5, 30–43. [Google Scholar] [CrossRef]
Kendall, A.; Gal, Y.; Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7482–7491. [Google Scholar]
Chen, Z.; Badrinarayanan, V.; Lee, C.Y.; Rabinovich, A. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 794–803. [Google Scholar]
Liu, L.; Li, Y.; Kuang, Z.; Xue, J.; Chen, Y.; Yang, W.; Liao, Q.; Zhang, W. Towards impartial multi-task learning. In Proceedings of the ICLR, Vienna, Austria, 4 May 2021. [Google Scholar]
Du, Y.; Czarnecki, W.M.; Jayakumar, S.M.; Farajtabar, M.; Pascanu, R.; Lakshminarayanan, B. Adapting auxiliary losses using gradient similarity. arXiv 2018, arXiv:1812.02224. [Google Scholar]
Panageas, I.; Piliouras, G.; Wang, X. First-order methods almost always avoid saddle points: The case of vanishing step-sizes. arXiv 2019, arXiv:1906.07772. [Google Scholar]
Liu, B.; Liu, X.; Jin, X.; Stone, P.; Liu, Q. Conflict-averse gradient descent for multi-task learning. Adv. Neural Inf. Process. Syst. 2021, 34, 18878–18890. [Google Scholar]
Bragman, F.J.; Tanno, R.; Ourselin, S.; Alexander, D.C.; Cardoso, J. Stochastic filter groups for multi-task cnns: Learning specialist and generalist convolution kernels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1385–1394. [Google Scholar]
Ahn, C.; Kim, E.; Oh, S. Deep elastic networks with model selection for multi-task learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6529–6538. [Google Scholar]
Rosenbaum, C.; Klinger, T.; Riemer, M. Routing networks: Adaptive selection of non-linear functions for multi-task learning. arXiv 2017, arXiv:1711.01239. [Google Scholar]
Yu, J.; Dai, Y.; Liu, X.; Huang, J.; Shen, Y.; Zhang, K.; Zhou, R.; Adhikarla, E.; Ye, W.; Liu, Y.; et al. Unleashing the power of multi-task learning: A comprehensive survey spanning traditional, deep, and pretrained foundation model eras. arXiv 2024, arXiv:2404.18961. [Google Scholar] [CrossRef]
Hotegni, S.S.; Berkemeier, M.; Peitz, S. Multi-objective optimization for sparse deep multi-task learning. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–9. [Google Scholar]
Garber, D.; Kretzu, B. Projection-free online convex optimization with time-varying constraints. arXiv 2024, arXiv:2402.08799. [Google Scholar]
Nash, J. Two-person cooperative games. Econom. J. Econom. Soc. 1953, 21, 128–140. [Google Scholar] [CrossRef]
Navon, A.; Shamsian, A.; Achituve, I.; Maron, H.; Kawaguchi, K.; Chechik, G.; Fetaya, E. Multi-task learning as a bargaining game. arXiv 2022, arXiv:2202.01017. [Google Scholar] [CrossRef]
Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-based RGB-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
Toet, A. The TNO multiband image data collection. Data Brief 2017, 15, 249–251. [Google Scholar] [CrossRef]
Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5906–5916. [Google Scholar]
Zhong, H.; Tang, F.; Chen, Z.; Chang, H.J.; Gao, Y. AMDANet: Attention-Driven Multi-Perspective Discrepancy Alignment for RGB-Infrared Image Fusion and Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–25 October 2025; pp. 10645–10655. [Google Scholar]
He, D.; Li, W.; Wang, G.; Huang, Y.; Liu, S. MMIF-INet: Multimodal medical image fusion by invertible network. Inf. Fusion 2025, 114, 102666. [Google Scholar] [CrossRef]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3520–3529. [Google Scholar]
Ultralytics. Ultralytics/yolov5: v3.1—Bug Fixes and Performance Improvements. Zenodo 2020. [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Shan, L.; Wang, W. Mbnet: A multi-resolution branch network for semantic segmentation of ultra-high resolution images. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore, 22–27 May 2022; pp. 2589–2593. [Google Scholar]
Yuan, M.; Wei, X. C²former: Calibrated and complementary transformer for rgb-infrared object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403712. [Google Scholar] [CrossRef]
Wu, Y.; Guan, X.; Zhao, B.; Ni, L.; Huang, M. Vehicle detection based on adaptive multimodal feature fusion and cross-modal vehicle index using RGB-T images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 8166–8177. [Google Scholar] [CrossRef]
Ouyang, J.; Wang, Q.; Liu, J.; Qu, X.; Song, J.; Shen, T. Multi-modal and cross-scale feature fusion network for vehicle detection with transformers. In Proceedings of the 2023 International Conference on Machine Vision, Image Processing and Imaging Technology (MVIPIT), Hangzhou, China, 26–28 July 2023; pp. 175–180. [Google Scholar]
Ouyang, J.; Jin, P.; Wang, Q. Multimodal feature-guided pre-training for RGB-T perception. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 16041–16050. [Google Scholar] [CrossRef]
Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Detfusion: A detection-driven infrared and visible image fusion network. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 4003–4011. [Google Scholar]
Li, J.; Chen, J.; Liu, J.; Ma, H. Learning a graph neural network with cross modality interaction for image fusion. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 4471–4479. [Google Scholar]
Tang, L.; Deng, Y.; Ma, Y.; Huang, J.; Ma, J. SuperFusion: A versatile image registration and fusion network with semantic awareness. IEEE/CAA J. Autom. Sin. 2022, 9, 2121–2137. [Google Scholar] [CrossRef]
Li, K.; Wang, D.; Hu, Z.; Li, S.; Ni, W.; Zhao, L.; Wang, Q. Fd2-net: Frequency-driven feature decomposition network for infrared-visible object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 4797–4805. [Google Scholar]
Dong, W.; Zhu, H.; Lin, S.; Luo, X.; Shen, Y.; Liu, X.; Zhang, J.; Guo, G.; Zhang, B. Fusion-mamba for cross-modality object detection. arXiv 2024, arXiv:2404.09146. [Google Scholar] [CrossRef]
Yu, T.; Kumar, S.; Gupta, A.; Levine, S.; Hausman, K.; Finn, C. Gradient Surgery for Multi-Task Learning. Adv. Neural Inf. Process. Syst. 2020, 33, 5824–5836. [Google Scholar]

Figure 1. Comparison of (d) UniFusOD with existing image fusion and detection paradigms: (a) Decoupled stages, (b) Coupled two-stage, (c) Multi-stage, and (d) End-to-end. The figure shows the evolution from separated optimization (a,b) to multi-stage (c) and our unified end-to-end framework (d), addressing key challenges through joint fusion-detection optimization.

Figure 2. An overview of the proposed UniFusOD framework. The Backbone extracts and fuses multi-level features from infrared and visible images. The Fine-grained Region Attention (FRA) module learns and distinguishes the importance of different feature regions, guiding the model to focus on key features. The model, together with the task-specific heads, is end-to-end synchronized and optimized using the UnityGrad method.

Figure 3. Detailed view of the fine-grained region attention. This is a schematic diagram showing an example where

k = 2

in

φ_{k, d_{k}} (.)

.

Figure 3. Detailed view of the fine-grained region attention. This is a schematic diagram showing an example where

k = 2

in

φ_{k, d_{k}} (.)

.

Figure 4. The structure diagram of Detection and Fusion Heads.

Figure 5. Qualitative fusion results. The first row shows examples from the M3FD dataset (Visible, Infrared, and Fused images), and the second row shows samples from the DroneVehicle dataset. Red boxes highlight infrared-dominant regions preserved in the fusion result, while blue boxes indicate enhanced semantic and detail representations from the visible spectrum.

Figure 6. Qualitative detection results. The first two columns are samples from the DroneVehicle dataset, and the last four columns are from the M3FD dataset. The first row shows ground truth annotations, and the second row shows predictions from UniFusOD. The results highlight the model’s robustness in detecting rotated and scale-varying targets under complex conditions.

Figure 7. Visualization of region attention maps generated by different

φ_{k} (.)

in the FRA module. The aggregated attention highlights the complete object region.

Figure 7. Visualization of region attention maps generated by different

φ_{k} (.)

in the FRA module. The aggregated attention highlights the complete object region.

Figure 8. Gradient conflict visualization in joint fusion-detection optimization: (a) Without UnityGrad, showing the gradients of both fusion (red) and detection (blue) tasks across iterations, with visible misalignment; (b) With UnityGrad, demonstrating improved gradient alignment between the fusion (red) and detection (blue) tasks, resulting in better convergence. The x-axis represents the iterations, while the y-axis shows the gradient values.

Table 1. Fusion Results on the TNO Dataset. Best Results are shown in red and the second results are shown in blue.

Model	EN	SD	MI	VIF	SSIM
DIDFuse	6.97	45.12	1.70	0.60	0.81
U2Fusion	6.83	34.55	1.37	0.58	0.99
SDNet	6.64	32.66	1.52	0.56	1.00
RFN-Nest	6.83	34.50	1.20	0.51	0.92
TarDAL	6.84	45.63	1.86	0.53	0.88
DenseFuse	6.95	38.41	1.78	0.60	0.96
MMIF-INet	6.88	39.27	1.69	0.56	0.83
FusinoGAN	7.10	44.85	1.78	0.57	0.88
AMDANet	7.37	39.52	1.82	0.70	0.95
CDDFuse	7.12	46.00	2.19	0.77	1.03
UniFusOD	7.44	41.28	2.00	0.79	1.07

Table 2. Fusion Results on the Roadscene Dataset with Best Results shown in red and second results shown in blue.

Model	EN	SD	MI	VIF	SSIM
DIDFuse	7.43	51.58	2.11	0.58	0.86
U2Fusion	7.09	38.12	1.87	0.60	0.97
SDNet	7.14	40.20	2.21	0.60	0.99
RFN-Nest	7.21	41.25	1.68	0.54	0.90
TarDAL	7.17	47.44	2.14	0.54	0.88
DenseFuse	7.23	44.44	2.25	0.63	0.89
MMIF-INet	7.24	49.75	2.05	0.61	0.78
FusionGAN	7.36	52.54	2.18	0.59	0.88
AMDANet	7.43	53.77	1.92	0.73	0.81
CDDFuse	7.44	54.67	2.30	0.69	0.98
UniFusOD	7.47	59.48	1.96	0.84	0.90

Table 3. Fusion Results on the M3FD Dataset with Best Results in red and second results shown in blue.

Model	EN	SD	MI	SSIM	VIF
DIDFuse	5.97	41.78	1.37	0.81	0.54
U2Fusion	5.62	36.51	1.20	0.99	0.50
SDNet	6.21	34.22	1.24	1.00	0.61
RFN-Nest	6.01	37.59	1.01	0.92	0.51
TarDAL	5.84	40.18	1.37	0.88	0.59
DenseFuse	6.44	36.46	1.23	0.96	0.57
MMIF-INet	5.74	40.67	1.18	0.96	0.55
FusionGAN	6.30	39.83	1.16	0.88	0.53
AMDANet	6.51	38.27	1.31	0.97	0.72
CDDFuse	5.77	39.74	1.33	0.91	0.69
UniFusOD	6.60	42.52	1.40	0.85	0.68

Table 4. Object detection results on the DroneVehicle dataset. The table shows the performance of various methods using different modalities: visible images, infrared (IR) images, and their fusion (visible + IR). The best results in each category are highlighted in red and the second results are highlighted in blue.

Methods	Modality	Car	Truck	Freight-Car	Bus	Van	mAP
Faster R-CNN [70]	Visible	79.0	49.0	37.2	77.0	37.0	55.9
RoITransformer [71]	Visible	61.6	55.1	42.3	85.5	44.8	61.6
YOLOv5s [69]	Visible	78.6	55.3	43.8	87.1	46.0	62.1
Faster R-CNN	IR	89.4	53.5	48.3	87.0	42.6	64.2
RoITransformer	IR	90.1	60.4	58.9	89.7	52.2	70.3
YOLOv5s	IR	90.0	59.5	60.8	89.5	53.8	70.7
Halfway Fusion [3]	Visible + IR	90.1	62.3	58.5	89.1	49.8	70.0
UA-CMDet [63]	Visible + IR	88.6	73.1	57.0	88.5	54.1	70.0
MBNet [72]	Visible + IR	90.1	64.4	62.4	88.8	53.6	71.9
TSFADet [26]	Visible + IR	89.9	67.9	63.7	89.8	54.0	73.1
C²Former [73]	Visible + IR	90.2	78.3	64.4	89.8	58.5	74.2
AFFCM [74]	Visible + IR	90.2	73.4	64.9	89.9	64.9	76.6
MC-DETR [75]	Visible + IR	94.8	76.7	60.4	91.1	61.4	76.9
M2FP [76]	Visible + IR	95.7	76.2	64.7	92.1	64.7	78.7
UniFusOD (Oriented RCNN)	Visible + IR	96.4	81.3	63.5	90.8	65.6	79.5

Table 5. Object detection results on the M3FD dataset. The best results in each category are highlighted in red and the second results are highlighted in blue.

Methods	Detector	mAP₅₀	mAP	People	Bus	Car	Motorcycle	Lamp	Truck
DIDFuse [7]	YOLOv5	78.9	52.6	79.6	79.6	92.5	68.7	84.7	68.7
SDNet [2]	YOLOv5	79.0	52.9	79.4	81.4	92.3	67.4	84.1	69.3
RFNet [13]	YOLOv5	79.4	53.2	79.4	78.2	91.1	72.8	85.0	69.0
TarDAL [22]	YOLOv5	80.5	54.1	81.5	81.3	94.8	69.3	87.1	68.7
DetFusion [77]	YOLOv5	80.8	53.8	80.8	83.0	92.5	69.4	87.8	71.4
CDDFuse [65]	YOLOv5	81.1	54.3	81.6	82.6	92.5	71.6	86.9	71.5
IGNet [78]	YOLOv5	81.5	54.5	81.6	82.4	92.8	73.0	86.9	72.1
SuperFusion [79]	YOLOv7	83.5	56.0	83.7	93.2	91.0	77.4	70.0	85.8
Fd2-Net [80]	YOLOv5	83.5	55.7	82.7	82.7	93.6	78.1	87.8	73.7
Fusion-Mamba [81]	YOLOv5	85.0	57.5	80.3	92.8	91.9	73.0	84.8	87.1
UniFusOD	YOLOv5	86.9	59.4	83.4	94.1	95.0	77.8	88.6	82.4

Table 6. Ablation study results on the M3FD dataset, with the best results shown in bold. “ Remotesensing 17 03637 i001

” denotes that the module is enabled.

Table 6. Ablation study results on the M3FD dataset, with the best results shown in bold. “ Remotesensing 17 03637 i001

” denotes that the module is enabled.

EN	SD	MI	VIF	SSIM	mAP₅₀	mAP_50:95
5.21	37.68	1.21	0.52	0.76	83.3	57.1
6.44	37.55	1.15	0.64	0.68	86.0	58.7
6.60	42.52	1.40	0.68	0.85	86.9	59.4

Table 7. Ablation study of the number of convolutional operators

φ_{k}

(i.e., K). The table shows the performance of the model when using different numbers of kernel types (e.g., 3 × 3, 5 × 5, 7 × 7, 11 × 11). The best results are shown in bold.

Table 7. Ablation study of the number of convolutional operators

φ_{k}

(i.e., K). The table shows the performance of the model when using different numbers of kernel types (e.g., 3 × 3, 5 × 5, 7 × 7, 11 × 11). The best results are shown in bold.

Number of $φ_{k}$ (K)	EN	SD	MI	VIF	mAP₅₀	mAP_50:95
0 (no region attention)	6.44	37.55	1.15	0.64	86.0	58.7
1 (3 × 3)	6.42	37.86	1.38	0.62	86.2	59.0
2 (3 × 3, 5 × 5)	6.41	40.25	1.38	0.68	86.7	59.1
3 (3 × 3, 5 × 5, 7 × 7)	6.60	42.52	1.40	0.68	86.9	59.4
4 (3 × 3, 5 × 5, 7 × 7, 11 × 11)	6.52	39.88	1.37	0.66	86.2	58.9

Table 8. Ablation study of the number of region attention maps M. The table shows the impact of different values of M on the model’s performance, with the best performing metrics highlighted in bold.

M	EN	SD	MI	VIF	SSIM	mAP₅₀	mAP_50:95
2	6.30	37.00	1.10	0.62	0.68	85.8	58.5
4	6.35	37.30	1.30	0.61	0.72	86.0	58.7
6	6.38	39.00	1.35	0.65	0.75	86.5	59.0
8	6.60	42.52	1.40	0.68	0.85	86.9	59.4
10	6.40	39.50	1.35	0.64	0.73	86.0	58.8

Table 9. Performance Comparison of Different Multi-task Learning Methods on M3FD for Image Fusion and Object Detection.

Method	EN	SD	MI	VIF	SSIM	mAP₅₀	mAP_50:95
GradNorm	6.21	38.60	1.29	0.57	0.70	86.1	58.7
PCGrad	6.51	39.92	1.35	0.63	0.79	86.1	59.0
CAGrad	6.47	40.77	1.36	0.66	0.84	86.3	58.8
UniGrad	6.60	42.52	1.40	0.68	0.85	86.9	59.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiang, X.; Zhou, G.; Niu, B.; Pan, Z.; Huang, L.; Li, W.; Wen, Z.; Qi, J.; Gao, W. Infrared-Visible Image Fusion Meets Object Detection: Towards Unified Optimization for Multimodal Perception. Remote Sens. 2025, 17, 3637. https://doi.org/10.3390/rs17213637

AMA Style

Xiang X, Zhou G, Niu B, Pan Z, Huang L, Li W, Wen Z, Qi J, Gao W. Infrared-Visible Image Fusion Meets Object Detection: Towards Unified Optimization for Multimodal Perception. Remote Sensing. 2025; 17(21):3637. https://doi.org/10.3390/rs17213637

Chicago/Turabian Style

Xiang, Xiantai, Guangyao Zhou, Ben Niu, Zongxu Pan, Lijia Huang, Wenshuai Li, Zixiao Wen, Jiamin Qi, and Wanxin Gao. 2025. "Infrared-Visible Image Fusion Meets Object Detection: Towards Unified Optimization for Multimodal Perception" Remote Sensing 17, no. 21: 3637. https://doi.org/10.3390/rs17213637

APA Style

Xiang, X., Zhou, G., Niu, B., Pan, Z., Huang, L., Li, W., Wen, Z., Qi, J., & Gao, W. (2025). Infrared-Visible Image Fusion Meets Object Detection: Towards Unified Optimization for Multimodal Perception. Remote Sensing, 17(21), 3637. https://doi.org/10.3390/rs17213637

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Infrared-Visible Image Fusion Meets Object Detection: Towards Unified Optimization for Multimodal Perception

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Multimodal Image Fusion and Object Detection

2.2. Multitask Learning

3. Methodology

3.1. Problem Formulation

3.2. Overall Architecture

3.3. Fine-Grained Region Attention

3.4. Detection and Fusion Heads

3.5. UnityGrad

4. Experiment

4.1. Datasets and Evaluation Criteria

4.1.1. Introduction to Experimental Datasets

4.1.2. Evaluation Criteria

4.2. Implementation Details

5. Results

5.1. Results on Infrared-Visible Image Fusion

5.2. Results on Infrared-Visible Object Detection

5.3. Ablation Study Results

6. Discussion

6.1. Study of Fine-Grained Region Attention

6.1.1. Effect of Different φ k Designs

6.1.2. Effect of the Number of Region Attention Maps

6.2. Study of UnityGrad Algorithm

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

6.1.1. Effect of Different $φ_{k}$ Designs