SEND: Semantic-Aware Deep Unfolded Network with Diffusion Prior for Multi-Modal Image Fusion and Object Detection

Zhang, Rong; Xiong, Mao-Yi; Huang, Jun-Jie

doi:10.3390/math13162584

Open AccessFeature PaperArticle

SEND: Semantic-Aware Deep Unfolded Network with Diffusion Prior for Multi-Modal Image Fusion and Object Detection

by

Rong Zhang

^1,2

,

Mao-Yi Xiong

³ and

Jun-Jie Huang

^3,*

¹

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

²

School of Naval Architecture and Civil Engineering, Jiangsu University of Science and Technology, Zhangjiagang 212003, China

³

College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(16), 2584; https://doi.org/10.3390/math13162584

Submission received: 11 July 2025 / Revised: 2 August 2025 / Accepted: 7 August 2025 / Published: 12 August 2025

(This article belongs to the Topic Transformer and Deep Learning Applications in Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Multi-modality image fusion (MIF) aims to integrate complementary information from diverse imaging modalities into a single comprehensive representation and serves as an essential processing step for downstream high-level computer vision tasks. The existing deep unfolding-based processes demonstrate promising results; however, they often rely on deterministic priors with limited generalization ability and usually decouple from the training process of object detection. In this paper, we propose Semantic-Aware Deep Unfolded Network with Diffusion Prior (SEND), a novel framework designed for transparent and effective multi-modality fusion and object detection. SEND consists of a Denoising Prior Guided Fusion Module and a Fusion Object Detection Module. The Denoising Prior Guided Fusion Module does not utilize the traditional deterministic prior but combines the diffusion prior with deep unfolding, leading to improved multi-modal fusion performance and generalization ability. It is designed with a model-based optimization formulation for multi-modal image fusion, which is unfolded into two cascaded blocks: a Diffusion Denoising Fusion Block to generate informative diffusion priors and a Data Consistency Enhancement Block that explicitly aggregates complementary features from both the diffusion priors and input modalities. Additionally, SEND incorporates the Fusion Object Detection Module with the Denoising Prior Guided Fusion Module for object detection task optimization using a carefully designed two-stage training strategy. Experiments demonstrate that the proposed SEND method outperforms state-of-the-art methods, achieving superior fusion quality with improved efficiency and interpretability.

Keywords:

multi-modality image fusion; deep unfolding; diffusion model; object detection

MSC:

68T45

1. Introduction

Multi-modality images, which arise in many real-world applications, spanning object detection [1,2,3], medical analysis [4,5], remote sensing [6,7,8,9], autonomous vehicles [10], etc., refer to a set of images that capture identical objects using different imaging sensors. With distinct physical principles, each image modality can provide unique and complementary information about the target scene. Therefore, it is highly beneficial to effectively combine the information from different modalities to analyze an object of interest in many high-level computer vision tasks.

Multi-modality image fusion (MIF) [11,12,13] aims to integrate the complementary information from heterogeneous image modalities into a single and unified representation. The essence of MIF lies in synergistically identifying and combining the unique information from different image modalities to ensure discriminative feature extraction and information completeness. For instance, infrared and visible image fusion (IVF) aims to optimally integrate the thermal radiation patterns from infrared images with photometric texture details from visible images. It is essential to effectively fuse the structural information from an infrared image and the textural detail information from a visible image.

The existing MIF methods can be broadly categorized into three main groups: the model-based approach [14,15,16,17], the learning-based approach [13,18,19], and the deep unfolding-based methods [20,21,22,23]. The model-based methods [14,15,16,17] mainly explicitly exploit the prior knowledge of the image modalities, such as sparsity and low-rank, within a mathematical optimization model to combine complementary information from different modalities. These methods are typically highly interpretable and do not rely on large training datasets. Therefore, they are suitable for scenarios with limited data or strict computational constraints. On the contrary, the learning-based methods [13,18,19] mainly focus on constructing effective image fusion deep neural networks and leverage training datasets to optimize the model parameters. The deep unfolding-based methods [14,15,16,17] aim to combine the merits of both the model-based approach and the learning-based approach by converting the iterative algorithm for solving the model-based optimization problem into deep networks. They typically enjoy interpretable model architectures with effective fusion performance. However, the deep unfolding-based image fusion methods mainly utilize a deterministic prior, which means limited generalization ability towards out-of-domain testing on data distributions.

In this paper, we propose Semantic-Aware Deep Unfolded Network with Diffusion Prior (SEND) for image fusion and object detection. The proposed SEND method leverages a generative diffusion model [24,25] to empower the generalization ability of deep unfolding networks for joint image fusion and object detection. It consists of a Denoising Prior Guided Fusion Module (DPFM) and a Fusion Object Detection Module (FODM). Specifically, we first formulate the multi-modality image fusion task as an optimization problem, and then transfer it into an iterative algorithm, in which the prior sub-problem can be treated as a denoising problem. We then draw connections between the denoising problem and the denoising diffusion model and construct the solution to the prior sub-problem as a diffusion step. From the experimental results, the proposed SEND method outperforms the comparison methods with respect to the metrics used to evaluate both image fusion and object detection.

The contribution of this work can be summarized as follows:

We propose a novel Semantic-Aware Deep Unfolded Network with Diffusion Prior (SEND) method that synergistically combines a generative diffusion model with a deep unfolding network for joint multi-modality image fusion and object detection.
The proposed SEND consists of a Denoising Prior Guided Fusion Module (DPFM) and a Fusion Object Detection Module (FODM). We propose a two-stage training strategy to effectively train these two main modules.
The experimental results show that the proposed SEND method outperforms the comparison methods in both image fusion and object detection, and both tasks are mutually beneficial for each other.

The rest of the paper is organized as follows: Section 2 briefly reviews the related works on multi-modal image fusion. Section 3 introduces the proposed Semantic-Aware Deep Unfolded Network with Diffusion Prior (SEND) method. Section 4 presents the experimental settings, comparison results, and ablation studies, and finally, Section 5 draws our conclusions.

2. Related Works

Multi-modality image fusion (MIF) aims to integrate complementary information from different imaging modalities. The MIF methods can be generally categorized into three groups: model-based methods, learning-based methods, and deep unfolding-based methods.

Model-based methods. Model-based multi-modality image fusion methods [17,26,27] typically formulate the image fusion task as a mathematical model and employ iterative optimization with handcrafted fusion rules to solve the fusion task. For instance, Ma et al. [28] proposed combining a visual saliency map and weighted least-square optimization to enhance image contrast in the fused images. Li et al. [15] proposed a low-rank representation framework for latent and salient features extraction, enabling flexible processing of multi-scale image data. Li et al. [29] proposed a novel norm optimization strategy that integrates the Split Bregman method with structural similarity (SSIM) to improve detail retention. Furthermore, Ma et al. [16] proposed a fusion algorithm based on gradient transfer to preserve source information. While these methods ensure mathematical interpretability, they are limited by their reliance on handcrafted rules, which have restricted generalization across diverse fusion scenarios and sensitivity to parameter tuning in optimization processes.

Learning-based methods. The recent learning-based multi-modality image fusion methods mainly exploit the strong learning capability of deep neural networks to achieve effective image fusion. Xu et al. [30] proposed employing an end-to-end deep network to automatically assess the importance of source images and adaptively determine the degree of information retention. Wang et al. [31] introduced feature space alignment techniques to address the artifact that arises from misaligned image pairs in infrared and visible image fusion. Recently, deep generative models have also been applied to multi-modality image fusion. The existing methods [13,19,32,33] mainly focus on using diffusion models to provide powerful priors for generating fused images. Zhao et al. [19] introduced diffusion models to infrared and visible image fusion with an unfolded expectation-maximization optimization framework, but this enacts diffusion models learning from the source modalities so as to provide ill-suited priors. Yi et al. [13] incorporated fusion knowledge priors with a trainable end-to-end diffusion model paradigm. Yang et al. [33] combined a pixel space auto-encoder and a diffusion model to deal with the priors in latent space. However, these heuristically designed architectures still exhibit the same opaque workflow as previously observed [34], which significantly hinders the diffusion models from providing accurate diffusion priors and results in suboptimal fusion results.

Deep unfolding-based methods. Deep unfolding methods [23,35,36,37,38,39] aim to construct a highly effective deep network by unrolling a model-inspired iterative algorithm into a multi-layer deep network, in which each layer corresponds to one iteration step. Deep unfolding methods combine the implicit advantages of model-based and CNN-based methods, which integrate a mathematical fusion model with CNN. Li et al. [23] developed a model formulation with low-rank and sparse representation to facilitate an efficient optimization process via a CNN. Zhao et al. [38] transformed a convolutional sparse coding model and iterative shrinkage with a thresholding algorithm into hidden layer units in a neural network for interpretable fusion. He et al. [39] proposed a deep network architecture with an unfolded heterogeneous image fusion model to enhance the fusion performance in degradation scenes. However, existing methods learn a deterministic prior from the data, which presents a significant challenge in highly fitting the distribution of modalities, leading to inadequate feature extraction.

3. Proposed Method

In this section, we first provide an overview of the proposed Semantic-Aware Deep Unfolded Network with Diffusion Prior (SEND) method, and we present the desgin details of the two key modules, including a Denoising Prior Guided Fusion Module (DPFM) and a Fusion Object Detection Module (FODM); finally, we introduce the loss functions and training strategy.

3.1. Overview

An overview of the proposed SEND method is illustrated in Figure 1. The proposed SEND method consists of two main modules: a Denoising Prior Guided Fusion Module (DPFM) and a Fusion Object Detection Module (FODM). The DPFM performs effective multi-modal image fusion by leveraging the iterative diffusion denoising prior. The FODM performs multi-modal object detection. Specifically, the DPFM is designed through a deep unfolding strategy. It contains a Diffusion Denoising Fusion Block (DDFB) and a Data Consistency Enhancement Block (DCEB), which impose a diffusion prior and consistent data regularization, respectively. During training, the FODM takes the fused image based on the intermediate sampling step

t = 0

of DPFM as input and evaluates its loss functions. During inference, the DPFM performs image fusion in an iterative manner, and the FODM performs object detection based on the finally fused image.

3.2. Denoising Prior Guided Fusion Module

The Denoising Prior Guided Fusion Module (DPFM) aims to perform effective multi-modal image fusion by combining the merits of both the model-based and the learning-based approaches. We first mathematically formulated the multi-modal image fusion problem, and then designed an iterative algorithm; finally, we designed a diffusion prior-guided deep network module.

3.2.1. Model Formulation

Multi-modality image fusion aims to fuse the complementary information from different imaging modalities into a single representation

X

. Let us assume that the image modality

U \in R^{C \times W \times H}

contains clear structural information of the target, while the image modality

V \in R^{C \times W \times H}

contains rich textural information of the scene. Therefore, the optimization problem for multi-modal image fusion can be formulated as

min_{X} \frac{1}{2} {∥ X - U ∥}_{2}^{2} + λ {∥ K \otimes (X - V) ∥}_{1},

(1)

where

K \in R^{n \times k \times k}

represents n sparse filters of spatial size

k \times k

that extract textural information, and

{∥ \cdot ∥}_{1}

represents the

l_{1}

norm, which is a sparsity-promoting prior term.

The above optimization problem aims to optimize the fused image

X

so that it can contain structural information from

U

by minimizing a

l_{2}

norm between

X

and

U

, and it also contains the discriminative textural information from

V

by minimizing a sparsity-promoting prior term with information extracted using sparsity-promoting filters

K

.

3.2.2. Optimization Algorithm

To solve Equation (1) directly is somehow infeasible; instead, we can write it in a consensus form by introducing an auxiliary variable

Y = K \otimes (X - V)

:

min_{X} \frac{1}{2} {∥ X - U ∥}_{2}^{2} + λ {∥ Y ∥}_{1}, s . t . Y = K \otimes (X - V) .

(2)

Therefore, the optimization problem expressed in Equation (2) can be iteratively solved through two simpler sub-problems. At the t-th iteration, the optimization algorithm sequentially performs the following:

\{\begin{matrix} Y_{t} = arg min_{Y} \frac{1}{2 {(\sqrt{λ / τ})}^{2}} ∥ Y - K \otimes (X_{t} - V) ∥_{2}^{2} + {∥ Y ∥}_{1}, \\ X_{t - 1} = arg min_{X} {∥ X - U ∥}_{2}^{2} + τ {∥ Y_{t} - K \otimes (X - V) ∥}_{2}^{2}, \end{matrix}

(3)

where

τ

is a penalty parameter.

Denoising Prior Sub-problem: The first sub-problem related to the auxiliary variable

Y

can be regarded as a Basis Pursuit De-Noising (BPDN) problem [40]. In traditional deep unfolding methods [35,41,42,43], this denoising sub-problem is usually solved through the proximal operator corresponding to the prior term, for example, a soft-thresholding function for the

l_{1}

norm term. The proximal operator is then typically approximated using a deterministic deep network. The deterministic deep network learns the deterministic mapping from the input state of the signal to the desired signal.

Here, we draw connections between this denoising sub-problem and the denoising diffusion model [44]. Diffusion models are a class of generative models that create high-quality and diverse new data (like images, audio, or text) by learning to reverse a gradual “noising” process. They’ve driven breakthroughs in image generation, audio synthesis, and more. The denoising diffusion framework enables sampling a noisy signal at time step t from the clean sample in a closed form at an arbitrary time step t with a scaling factor

\sqrt{{\bar{α}}_{t}}

for the clean signal, and a scaling factor

\sqrt{1 - {\bar{α}}_{t}}

for the variance of the added noise, which is sampled from a standard Gaussian distribution. Therefore, the diffusion prior module learns the reverse mapping from the noisy version of the signal to the clean signal, and it depends on the sampling step.

Then, this denoising sub-problem can be solved as

Y_{t} \approx K \otimes (X_{t} - V) - \frac{\sqrt{1 - {\bar{α}}_{t}}}{{\bar{α}}_{t}} {Prox}_{θ} (K \otimes (X_{t} - V)),

(4)

where

{\bar{α}}_{t} = 1 / (λ / τ - 1)

, and

{Prox}_{θ} (\cdot)

is a proximal denoising operator with

{Prox}_{θ} (\cdot) = \sqrt{1 - \bar{α}} \nabla {∥ \cdot ∥}_{1}

through Tweedie’s formula [45].

Data consistency sub-problem: The sub-probem that corresponds to variable

X

has two

l_{2}

norm terms. It can be effectively solved using a closed-form solution:

X_{t - 1} = D \otimes (U + τ K^{T} \otimes (Y_{t} + K \otimes V)),

(5)

where

D = {(I + τ K^{T} K)}^{- 1}

, and

I

is an identity matrix.

3.2.3. Module Design

We have connected the iterative algorithm defined in Equation (3), the denoising diffusion process, and the data consistency enhancement process. Here, we provide the design details of the proposed Denoising Prior Guided Fusion Module. Based on the solutions to the two sub-problems in Equations (4) and (5), the DPFM is designed to have a Diffusion Denoising Fusion Block (DDFB) and a Data Consistency Enhancement Block (DCEB). The process at the t-th time step can be expressed as a DDFB, a DCEB, and a DDIM step [46]:

\{\begin{matrix} F_{0, t} = {DDFB}_{t} (F_{t}, V), \\ X_{0, t} = {DCEB}_{t} (F_{0, t}, U, V), \\ F_{t - 1} = \sqrt{{\bar{α}}_{t - 1}} K \otimes (X_{0, t} - V) + \sqrt{1 - {\bar{α}}_{t - 1}} {\hat{ϵ}}_{t}, \end{matrix}

(6)

where

F_{t} = K \otimes (X_{t} - V)

, and

{DDFB}_{t} (\cdot)

and

{DCEB}_{t} (\cdot)

denote the Diffusion Denoising Fusion Block and the Data Consistency Enhancement Block at the t-th step, respectively. Moreover,

{\hat{ϵ}}_{t}

denotes the predicted noise generated from

{Prox}_{θ} (\cdot)

.

Based on Equation (6), the overview of DPFM is illustrated in Figure 1. The

{DDFB}_{t} (\cdot)

is the noise prediction network conditioned on the step number t. The

{DCEB}_{t} (\cdot)

implements the closed-form solution in Equation (5) with

D

,

K

, and

K^{T}

being conditionally generated through a hypernetwork. The DPFM starts with a random noise image

X_{T}

and sequentially and iteratively applies the DDFB, DCEB, and the DDIM step from

t = T

to

t = 0

to gradually remove the noise and generate the fused image.

3.3. Fusion Object Detection Module

The objective of multi-modality image fusion is to serve downstream computer vision tasks, for example, object detection. Up to now, we have constructed a Denoising Prior Guided Fusion Module, which iteratively produces diffusion denoising prior fusion and data consistency-enforced images.

In this section, we propose a Fusion Object Detection Module (FODM), which is optimized together with the DPFM. Therefore, both image fusion and object detection can be connected. The FODM takes the fused image as input for object detection. The backbone network used for object detection is based on YOLOv5 [47]. The FODM applies the object detection network to a fused image and produces the bounding boxes for objects. The corresponding loss functions of the DPFM and FODM are applied to evaluate the performance of two tasks, and backpropagation is then applied to update their parameters. During inference, the FODM only takes the final fused image

X_{0}

as input and predicts the bounding box of the targets.

3.4. Loss Functions and Training Strategy

In this section, we introduce the loss functions for the Denoising Prior Guided Fusion Module (DPFM) and the Fusion Object Detection Module (FODM), as well as the training strategy.

3.4.1. Loss Functions for DPFM

The training loss for the DPFM contains three loss terms, including a reconstruction loss

L_{r}

, a pixel-wise constraint

L_{X_{0}}

with a fusion knowledge prior, and a noise constraint

L_{ϵ}

for the denoising process, which can be expressed as follows:

\begin{matrix} L_{DPGFM} & = L_{r} + γ_{ϵ} L_{ϵ} + γ_{x} L_{X_{0}}, \end{matrix}

(7)

where

L_{ϵ} = {∥ ϵ - {\hat{ϵ}}_{t} ∥}_{1}

and

L_{X_{0}} = {∥ {\hat{X}}_{0, t} - X_{t s} ∥}_{1}

, and

γ_{ϵ}, γ_{x}

denotes the regularization parameters. The reconstruction loss can be expressed as a weighted sum of an intensity loss

L_{i}

, a gradient loss

L_{g}

, an SSIM loss

L_{s}

, and a feature loss

L_{i p}

:

L_{r} = γ_{i} L_{i} + γ_{g} L_{g} + γ_{s} L_{s} + γ_{f} L_{f},

(8)

where

L_{i} = {∥ {\hat{X}}_{0, t} - max (U, V) ∥}_{2}^{2}

,

L_{g} = ∥ | \nabla {\hat{X}}_{0, t} {| - max (| \nabla U |, | \nabla V |) ∥}_{1}

,

L_{s} = ∥ 1 - SSIM ({\hat{X}}_{0, t}, U) ∥_{1} + {∥ 1 - SSIM ({\hat{X}}_{0, t}, V) ∥}_{1}

,

L_{f} = {∥ {\hat{F}}_{0, t} - K \otimes (X_{t s} - V) ∥}_{1}

, ∇ denotes the gradient operator,

SSIM (\cdot)

represents the structural similarity operation, and

γ_{i}, γ_{g}

,

γ_{s}

, and

γ_{f}

are penalty parameters.

3.4.2. Loss Functions for FODM

We follow the training loss in YOLOv5 [47] for the FODM. It contains a bounding box loss

L_{box}

, a confidence loss

L_{conf}

, and a classification loss

L_{cls}

:

L_{FODM} = λ_{b} L_{box} + λ_{c o n f} L_{conf} + λ_{c l s} L_{cls},

(9)

where

λ_{b}

,

λ_{c o n f}

, and

λ_{c l s}

are the regularization parameters.

For brevity, the detailed description of the loss functions is not presented here. The detailed description can be found in [47].

3.4.3. Training Strategy

We propose a two-stage training strategy for effectively training the proposed SEND network. In the first stage, the Denoising Prior Guided Fusion Module is warmed up with only

L_{DPGFM}

so that the DPFM can produce multi-modality image fusion results with satisfactory quality to support the training of the subsequent Fusion Object Detection Module. In the second stage, both loss functions

L_{DPGFM}

and

L_{FODM}

are adopted for training the whole SEND network. Algorithm 1 and Algorithm 2 illustrate the training and sampling procedures of the proposed method, respectively.

Algorithm 1 Training strategy of SEND.
Input:	Denoising Network ${{Prox}_{θ}}$ ,
	Filters ${D, K, K^{T}}$ ,
	Fusion Object Detection Network ${{FOD}_{θ}}$ ,
	Source Images ${U, V}$ ,
	Learnable Hyperparameters $τ$ ,
	and Noise Schedule $α_{t}$ .
1: Initializing the $X_{t s}$ by targeted search [13]; 2: repeat 3: $t \sim Uniform {1, \dots, T}$ ; 4: $ϵ \sim N (0, I)$ ; 5: $Y_{0} = K \otimes (X_{t s} - V)$ ; 6: $Y_{t} = \sqrt{{\bar{α}}_{t}} Y_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ$ ; 7: ${\hat{ϵ}}_{t} \leftarrow {Prox}_{θ} (Y_{t}, K \otimes V, t)$ ; 8: ${\hat{Y}}_{0, t} = \frac{1}{{\sqrt{\bar{α}}}_{t}} (Y_{t} - \sqrt{1 - {\bar{α}}_{t}} {\hat{ϵ}}_{t})$ ; 9: ${\hat{X}}_{0, t} = D \otimes (U + τ K^{T} \otimes ({\hat{Y}}_{0, t} + K \otimes V))$ ; 10: if $c u r r e n t i t e r a t i o n s < w a r m u p s t e p s$ then 11: $L \leftarrow L_{DPGFM}$ ; 12: else 13: $R e s u l t_{Det} = FOD ({\hat{X}}_{0, t})$ ; 14: $L \leftarrow L_{DPGFM} + L_{FODM}$ 15: end if 16: Perform gradient descent steps on $\nabla_{θ} L$ ; 17: until converged;

Algorithm 2 Sampling strategy of SEND.
Input:	Denoising Network ${{Prox}_{θ}}$ ,
	Filters ${D, K, K^{T}}$ ,
	Fusion Object Detection Network ${{FOD}_{θ}}$ ,
	Source Images ${U, V}$ ,
	Learnable Hyperparameters $τ$ ,
	and Noise Schedule ${α_{t}, σ_{η t}}$ .
1: Initialize $X_{T} \sim N (0, I)$ , $Y_{T} = K \otimes (X_{T} - V)$ ; 2: for $t = T - 1, \dots, 1$ do 3: ${\hat{ϵ}}_{t} \leftarrow {Prox}_{θ} (Y_{t}, K \otimes V, t)$ ; 4: ${\hat{Y}}_{0, t} = \frac{1}{{\sqrt{\bar{α}}}_{t}} (Y_{t} - \sqrt{1 - {\bar{α}}_{t}} {\hat{ϵ}}_{t})$ ; 5: ${\hat{X}}_{0, t} = D \otimes (U + τ K^{T} \otimes ({\hat{Y}}_{0, t} + K \otimes V))$ ; 6: ${\tilde{Y}}_{0, t} = K \otimes ({\hat{X}}_{0, t} - V)$ ; 7: $ϵ \sim N (0, I)$ ; 8: $Y_{t - 1} = \sqrt{{\bar{α}}_{t - 1}} {\tilde{Y}}_{0, t} + \sqrt{1 - {\bar{α}}_{t - 1} - σ_{η t}^{2}} {\hat{ϵ}}_{t} + σ_{η t} ϵ$ ; 9: end for 10: $R e s u l t_{Det} = {FOD}_{θ} (X_{0})$ ; return Fused Image $X_{0}$ and Detection Result $R e s u l t_{Det}$ ;

4. Experimental Results

In this section, we first present the experimental settings in terms of datasets, evaluation metrics, comparison methods, and implementation details. We then show the evaluation results of all comparison methods for both multi-modal image fusion and object detection.

4.1. Settings

Datasets. Three datasets were included in the experiments, including MSRS [48], LLVIP [49], and M3FD [50]. The M3FD dataset [50] contains 4200 pairs of calibrated infrared and visible images, covering a variety of scenes and pixel variations, with a particular emphasis on the wide range of the two modes. The dataset is designed to support object detection tasks by fusing infrared and visible light images to improve detection accuracy and visual quality.

For the multi-modal image fusion task, the proposed method is trained on the MSRS training dataset with 1083 pairs of images. The test datasets contain 361 pair testing images from the MSRS test dataset and 3463 pair images from LLVIP.

For the multi-modal object detection task, the testing datasets LLVIP and M3FD are divided into an 8:1:1 ratio for training, validation, and testing to compare the performance of different methods.

Evaluation metrics. In the following, we introduce the evaluation metrics for the multi-modal image fusion task and the multi-modal object detection task.

Following the settings on prior research, infrared and visible image fusion performance were evaluated using five widely adopted metrics: Entropy (EN), which quantifies the preservation of information content in the fused output; Average Gradient (AG), which assesses image clarity and edge sharpness by measuring spatial texture variation; Standard Deviation (SD), which reflects the richness of details by evaluating pixel intensity dispersion; Sum of the Correlations of Differences (SCDs), which measures the consistency between source images and the fused result; and Visual Information Fidelity (VIF), which measures perceptual quality by comparing visual information retention against a reference.

Multi-modal object detection performance is evaluated using Precision, Recall, mAP50, and mAP50-95. Precision reflects the accuracy of the model’s predictions, Recall represents the model’s ability to find all positive instances, mAP50 is used to evaluate the mean average precision when the IoU threshold is 0.5, and mAP50-95 is the average mAP across IoU thresholds ranging from 0.5 to 0.95.

Comparison methods. The comparison methods include FusionGAN [51], U2Fusion [30], UMFusion [31], EMFusion [52], and Diff-IF [13]. Since these works did not jointly optimize the object detectors, the object detection network is then based on their image fusion results afterwards.

Implementation details. All experiments were performed on a computer with a single NVIDIA Geforce RTX 4090 GPU. During training, we adopted the Adam optimizer with a learning rate of

1 e^{- 4}

, a training patch size of

128 \times 128

, and a set batch size of 8. The number of training iterations was

80 K

. We set the hyperparameters as

γ_{ϵ} = 8

,

γ_{x} = 1

,

γ_{i} = 1

,

γ_{g} = 2

,

γ_{s} = 1

, and

γ_{i p} = 1

. During inference, we set the total number of iterations to 3 in DDIM fashion. The training time of the proposed model was around 7.5 h, and the space requirement is around 16.9 G.

4.2. Evaluation Results

In this section, we evaluate the multi-modal image fusion and object detection performance of the proposed SEND method and other multi-modality image fusion methods both numerically and visually.

Quantitative comparisons. We first illustrate the numerical comparison results for both multi-modal image fusion and object detection. Table 1 shows the quantitative results of different methods on infrared and visible image fusion tasks evaluated on the MSRS and LLVIP datasets.

From Table 1, we can see that the proposed SEND method achieves the best multi-modal image fusion results in terms of EN, SD, and SCD on the MSRS dataset, and the best results in terms of EN, AG, and SD on the LLIVP dataset. This suggests that the proposed SEND method can effectively fuse complementary information from different image modalities. Moreover, the quantitative results of different methods for the person detection task evaluated on the LLVIP dataset are illustrated in Table 2. The results for the multi-object detection task evaluated on the M3FD dataset are illustrated in Table 3, including person, car, bus, motorcycle, lamp, and truck. We can also see from Table 2 that the proposed SEND method achieves the best object detection results in terms of Precision, Recall, and mAP50, and the second-best result in terms of mAP50-5.

Compared to object detection performance without image fusion, the proposed SEND method achieves around 0.15, 0.07, 0.08, and 0.07 improvements on Precision, Recall, mAP50, and mAP50-95, respectively. Moreover, the proposed method exhibits highly competitive performance on multi-object detection, with three of the best and one of the second-best results, demonstrated in Table 3. This result shows that the proposed SEND method can effectively fuse essential information from two image modalities, leading to superior object detection results.

Qualitative comparisons. In Figure 2, we visualize the input visible image and infrared image, as well as the image fusion results of different comparison methods, including FusionGAN [51], U2Fusion [30], EMFusion [52], CDDFuse [18], LRRNet [23], Diff-IF [13], and the proposed SEND method. We can see that the proposed SEND method achieves the best image fusion quality. The details of the target from the visible modality are effectively fused with the structure of the target from the infrared modality. Particularly, the proposed SEND method, which has the jointly trained Denoising Prior Guided Fusion Module and a Fusion Object Detection Module, benefits the multi-modality image fusion results from the training of the object detection module. The details and the structure of the targets are more notable. In Figure 3, we further visualize the object detection results of different methods. We can see that the proposed SEND method successfully and accurately detected all objects, while the results of the other methods contain missing or false detections.

Model efficiency analysis. We have included Table 4, which compares the number of model parameters, FLOPs, and FPS of different methods. We can see that the methods designed via the diffusion mechanism have certain drawbacks in terms of model parameter count and computational cost compared to other approaches, often stemming from the relatively complex denoising network architecture designs and multi-step iterative mechanisms, including Diff-IF and SEND. However, the proposed SEND method has made improvements in this regard, which includes reducing the number of model parameters and significantly lowering the computational costs of up to 20.403 GFLOPs. Meanwhile, SEND achieves superior detection performance with nearly comparable real-time efficiency.

Failure cases analysis. In the failure cases shown in Figure 4, we can observe that the probability of false detection is relatively high, with car windows or tires being easily mistakenly identified as pedestrians. Meanwhile, the issue of duplicate detection also arises in Figure 4a,b. Even though our proposed SEND has achieved the best performance among comparative methods, we still find that false detection and duplicate detection remain significant issues, restricting further performance improvement. As for the reasons, it may be attributed to the ambiguity of some target features in complex scenarios. For instance, the texture or contour of car windows and tires can sometimes be mistakenly matched with the partial features of pedestrians, leading to false detections. Additionally, the instability of feature extraction in edge cases, such as when targets are partially occluded, might cause duplicate detections, as the model fails to effectively distinguish between overlapping or similar feature regions.

For future improvements, we will conduct targeted optimizations from multiple dimensions to reduce the current issues of false detection and duplicate detection. On the one hand, we plan to augment the training data, expand the training dataset to include more complex scenarios, and add samples where easily confused targets coexist. On the other hand, we will also improve the training strategies by adjusting the confidence calibration strategy for detection boxes to reduce duplicate annotations caused by feature similarity. Furthermore, we intend to introduce a more refined feature discrimination mechanism, emphasize the further alignment between fused features and downstream recognition features, design an inter-modality key target emphasis mechanism, and strengthen the ability to distinguish features of recognized objects.

5. Conclusions

In this paper, we propose a novel Deep Unfolded Network with Diffusion Prior (SEND) method for both multi-modal image fusion and object detection. It consists of two main modules: a Denoising Prior Guided Fusion Module (DPFM) and a Fusion Object Detection Module (FODM). The DPFM is designed based on the deep unfolding principle and a generative diffusion model. It iteratively performs diffusion denoising and imposes data consistency, therefore leading to highly effective performance. The FODM is based on the fused image process and performs object detection for the multi-modal data. We propose a two-stage training strategy for training the proposed SEND method. From the experimental results, the proposed SEND method outperforms the comparison methods on both image fusion and object detection.

In the future, it would be interesting to investigate reducing the number of steps on the denoising prior-guided fusion module to further improve the testing efficiency of the model. It would also be interesting to perform a joint optimization of the whole model to perform multple tasks, including detection and segmentation.

Author Contributions

Conceptualization, R.Z., M.-Y.X., and J.-J.H.; methodology, R.Z., M.-Y.X., and J.-J.H.; software, R.Z., M.-Y.X., and J.-J.H.; validation, R.Z., M.-Y.X., and J.-J.H.; formal analysis, R.Z., M.-Y.X., and J.-J.H.; investigation, R.Z. and J.-J.H.; resources, R.Z. and J.-J.H.; data curation, M.-Y.X. and J.-J.H.; writing—original draft preparation, R.Z. and J.-J.H.; writing—review and editing, R.Z. and J.-J.H.; visualization, R.Z., M.-Y.X., and J.-J.H.; supervision, J.-J.H.; project administration, J.-J.H.; funding acquisition, J.-J.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by NUDT Innovation Science Foundation grant number 23-ZZCXKXKY-07 and NUDT Research Project grant number ZK22-56.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

Rong Zhang acknowledges the National Key Laboratory of Information System Engineering, National University of Defense Technology, for funding her doctoral studies. Maoyi Xiong would like to acknowledge the NUDT Innovation Science Foundation grant number 23-ZZCXKXKY-07 and NUDT Research Project grant number ZK22-56.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wan, B.; Zhou, X.; Sun, Y.; Wang, T.; Lv, C.; Wang, S.; Yin, H.; Yan, C. MFFNet: Multi-modal feature fusion network for VDT salient object detection. IEEE Trans. Multimed. 2023, 26, 2069–2081. [Google Scholar] [CrossRef]
Gao, W.; Liao, G.; Ma, S.; Li, G.; Liang, Y.; Lin, W. Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 2091–2106. [Google Scholar] [CrossRef]
Liu, T.; Luo, W.; Ma, L.; Huang, J.J.; Stathaki, T.; Dai, T. Coupled network for robust pedestrian detection with gated multi-layer feature extraction and deformable occlusion handling. IEEE Trans. Image Process. 2020, 30, 754–766. [Google Scholar] [CrossRef]
James, A.P.; Dasarathy, B.V. Medical image fusion: A survey of the state of the art. Inf. Fusion 2014, 19, 4–19. [Google Scholar] [CrossRef]
Liu, T.; Meng, Q.; Huang, J.J.; Vlontzos, A.; Rueckert, D.; Kainz, B. Video summarization through reinforcement learning with a 3D spatio-temporal u-net. IEEE Trans. Image Process. 2022, 31, 1573–1586. [Google Scholar] [CrossRef]
Ghassemian, H. A review of remote sensing image fusion methods. Inf. Fusion 2016, 32, 75–89. [Google Scholar] [CrossRef]
Pradhan, P.K.; Das, A.; Kumar, A.; Baruah, U.; Sen, B.; Ghosal, P. SwinSight: A hierarchical vision transformer using shifted windows to leverage aerial image classification. Multimed. Tools Appl. 2024, 83, 86457–86478. [Google Scholar] [CrossRef]
Pradhan, P.K.; Purkayastha, K.; Sharma, A.L.; Baruah, U.; Sen, B.; Ghosal, P. Graphically Residual Attentive Network for tackling aerial image occlusion. Comput. Electr. Eng. 2025, 125, 110429. [Google Scholar] [CrossRef]
Huang, J.J.; Wang, Z.; Liu, T.; Luo, W.; Chen, Z.; Zhao, W.; Wang, M. DeMPAA: Deployable multi-mini-patch adversarial attack for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Peng, Y.; Qin, Y.; Tang, X.; Zhang, Z.; Deng, L. Survey on image and point-cloud fusion-based object detection in autonomous vehicles. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22772–22789. [Google Scholar] [CrossRef]
Han, J.; Pauwels, E.J.; De Zeeuw, P. Fast saliency-aware multi-modality image fusion. Neurocomputing 2013, 111, 70–80. [Google Scholar] [CrossRef]
Zhang, Q.; Liu, Y.; Blum, R.S.; Han, J.; Tao, D. Sparse representation based multi-sensor image fusion for multi-focus and multi-modality images: A review. Inf. Fusion 2018, 40, 57–75. [Google Scholar] [CrossRef]
Yi, X.; Tang, L.; Zhang, H.; Xu, H.; Ma, J. Diff-IF: Multi-modality image fusion via diffusion model with fusion knowledge prior. Inf. Fusion 2024, 110, 102450. [Google Scholar] [CrossRef]
Ma, J.; Zhou, Y. Infrared and visible image fusion via gradientlet filter. Comput. Vis. Image Underst. 2020, 197, 103016. [Google Scholar] [CrossRef]
Li, H.; Wu, X.J.; Kittler, J. MDLatLRR: A novel decomposition method for infrared and visible image fusion. IEEE Trans. Image Process. 2020, 29, 4733–4746. [Google Scholar] [CrossRef]
Ma, J.; Chen, C.; Li, C.; Huang, J. Infrared and visible image fusion via gradient transfer and total variation minimization. Inf. Fusion 2016, 31, 100–109. [Google Scholar] [CrossRef]
Li, S.; Kang, X.; Hu, J. Image fusion with guided filtering. IEEE Trans. Image Process. 2013, 22, 2864–2875. [Google Scholar] [CrossRef]
Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5906–5916. [Google Scholar]
Zhao, Z.; Bai, H.; Zhu, Y.; Zhang, J.; Xu, S.; Zhang, Y.; Zhang, K.; Meng, D.; Timofte, R.; Van Gool, L. DDFM: Denoising diffusion model for multi-modality image fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 8082–8093. [Google Scholar]
Deng, X.; Dragotti, P.L. Deep convolutional neural network for multi-modal image restoration and fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3333–3348. [Google Scholar] [CrossRef]
Zhao, Z.; Xu, S.; Zhang, J.; Liang, C.; Zhang, C.; Liu, J. Efficient and model-based infrared and visible image fusion via algorithm unrolling. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1186–1196. [Google Scholar] [CrossRef]
Deng, X.; Xu, J.; Gao, F.; Sun, X.; Xu, M. DeepM2CDL: Deep Multi-Scale Multi-Modal Convolutional Dictionary Learning Network. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 2770–2787. [Google Scholar] [CrossRef]
Li, H.; Xu, T.; Wu, X.J.; Lu, J.; Kittler, J. LRRNet: A novel representation learning guided fusion network for infrared and visible images. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11040–11052. [Google Scholar] [CrossRef]
Cao, H.; Tan, C.; Gao, Z.; Xu, Y.; Chen, G.; Heng, P.A.; Li, S.Z. A survey on generative diffusion models. IEEE Trans. Knowl. Data Eng. 2024, 36, 2814–2830. [Google Scholar] [CrossRef]
Croitoru, F.A.; Hondru, V.; Ionescu, R.T.; Shah, M. Diffusion models in vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10850–10869. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Zheng, J.; Zhu, Z.; Yao, W.; Wu, S. Weighted guided image filtering. IEEE Trans. Image Process. 2014, 24, 120–129. [Google Scholar] [CrossRef]
Zhang, X.; Ma, Y.; Fan, F.; Zhang, Y.; Huang, J. Infrared and visible image fusion via saliency analysis and local edge-preserving multi-scale decomposition. JOSA A 2017, 34, 1400–1410. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; Zhou, Z.; Wang, B.; Zong, H. Infrared and visible image fusion based on visual saliency map and weighted least square optimization. Infrared Phys. Technol. 2017, 82, 8–17. [Google Scholar] [CrossRef]
Li, G.; Lin, Y.; Qu, X. An infrared and visible image fusion method based on multi-scale transformation and norm optimization. Inf. Fusion 2021, 71, 109–129. [Google Scholar] [CrossRef]
Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar] [CrossRef]
Wang, D.; Liu, J.; Fan, X.; Liu, R. Unsupervised Misaligned Infrared and Visible Image Fusion via Cross-Modality Image Generation and Registration. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022; pp. 3508–3515. [Google Scholar] [CrossRef]
Yue, J.; Fang, L.; Xia, S.; Deng, Y.; Ma, J. Dif-fusion: Towards high color fidelity in infrared and visible image fusion with diffusion models. IEEE Trans. Image Process. 2023, 32, 5705–5720. [Google Scholar] [CrossRef]
Yang, B.; Jiang, Z.; Pan, D.; Yu, H.; Gui, G.; Gui, W. LFDT-Fusion: A latent feature-guided diffusion Transformer model for general image fusion. Inf. Fusion 2025, 113, 102639. [Google Scholar] [CrossRef]
Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv. (CSUR) 2018, 51, 1–42. [Google Scholar] [CrossRef]
Gregor, K.; LeCun, Y. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on iNternational Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 399–406. [Google Scholar]
Sun, J.; Li, H.; Xu, Z. Deep ADMM-Net for compressive sensing MRI. Adv. Neural Inf. Process. Syst. 2016, 29, 10–18. Available online: https://proceedings.neurips.cc/paper_files/paper/2016/file/1679091c5a880faf6fb5e6087eb1b2dc-Paper.pdf (accessed on 6 August 2025).
Sreter, H.; Giryes, R. Learned convolutional sparse coding. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 2191–2195. [Google Scholar]
Zhao, Z.; Zhang, J.; Bai, H.; Wang, Y.; Cui, Y.; Deng, L.; Sun, K.; Zhang, C.; Liu, J.; Xu, S. Deep convolutional sparse coding networks for interpretable image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2369–2377. [Google Scholar]
He, C.; Li, K.; Xu, G.; Zhang, Y.; Hu, R.; Guo, Z.; Li, X. Degradation-resistant unfolding network for heterogeneous image fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 12611–12621. [Google Scholar]
De-noising by soft-thresholding. IEEE Trans. Inf. Theory 2002, 41, 613–627.
Huang, J.J.; Liu, T.; Xia, J.; Wang, M.; Dragotti, P.L. DURRNet: Deep unfolded single image reflection removal network with joint prior. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 5235–5239. [Google Scholar]
Pu, W.; Huang, J.J.; Sober, B.; Daly, N.; Higgitt, C.; Daubechies, I.; Dragotti, P.L.; Rodrigues, M.R. Mixed X-ray image separation for artworks with concealed designs. IEEE Trans. Image Process. 2022, 31, 4458–4473. [Google Scholar] [CrossRef]
Huang, J.J.; Liu, T.; Chen, Z.; Liu, X.; Wang, M.; Dragotti, P.L. A Lightweight Deep Exclusion Unfolding Network for Single Image Reflection Removal. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 4957–4973. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Efron, B. Tweedie’s formula and selection bias. J. Am. Stat. Assoc. 2011, 106, 1602–1614. [Google Scholar] [CrossRef] [PubMed]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Ingham, F.; Poznanski, J.; Fang, J.; Yu, L.; et al. Ultralytics/yolov5: v3. 1-Bug Fixes and Performance Improvements; Zenodo: Geneva, Switzerland, 2020. [Google Scholar]
Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion 2022, 83, 79–92. [Google Scholar] [CrossRef]
Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 3496–3504. [Google Scholar]
Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5802–5811. [Google Scholar]
Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
Xu, H.; Ma, J. EMFusion: An unsupervised enhanced medical image fusion network. Inf. Fusion 2021, 76, 177–186. [Google Scholar] [CrossRef]

Figure 1. An overview of the proposed Semantic-Aware Deep Unfolded Network with Diffusion Prior (SEND) method. It consists of a Denoising Prior Guided Fusion Module (DPFM) and a Fusion Object Detection Module (FODM) for image fusion and object detection, respectively.

Figure 2. Qualitative fusion performance comparison of “220160.jpg” from the LLVIP dataset. (a) Infrared. (b) Visible. (c) FusionGAN [51]. (d) U2Fusion [30]. (e) EMFusion [52]. (f) CDDFuse [18]. (g) LRRNet [23]. (h) Diff-IF [13]. (i) SEND (Ours).

Figure 3. Qualitative detection performance comparison of “210396.jpg” from the LLVIP dataset. (The ground truth and the detection results are marked with red and green boxes, respectively). (a) Infrared. (b) Visible. (c) FusionGAN [51]. (d) U2Fusion [30]. (e) EMFusion [52]. (f) CDDFuse [18]. (g) LRRNet [23]. (h) Diff-IF [13]. (i) SEND (Ours).

Figure 4. Failure detection results from the LLVIP dataset (the ground truth and the detection results are marked with red and green boxes, respectively). (a) 190203.jpg. (b) 190263.jpg. (c) 260052.jpg. (d) 260224.jpg.

Table 1. Quantitative results of different methods on the infrared and visible image fusion task evaluated on the MSRS and LLVIP datasets (the first-, second-, and third-best results are in bold, underline, and italics, respectively).

Methods	MSRS (361 Pairs)					LLVIP (3463 Pairs)
Methods	EN	AG	SD	SCD	VIF	EN	AG	SD	SCD	VIF
1-11 FusionGAN [51]	5.44	1.45	17.08	0.98	0.44	6.55	2.93	28.01	0.74	0.37
U2Fusion [30]	4.95	2.10	18.89	1.01	0.48	6.84	4.99	38.55	1.35	0.55
EMFusion [52]	6.33	2.92	35.49	1.29	0.84	6.68	4.41	34.79	1.19	0.63
CDDFuse [18]	6.70	3.75	43.40	1.62	1.00	7.33	5.81	48.98	1.58	0.77
LRRNet [23]	6.19	2.65	31.79	0.79	0.54	6.39	3.81	28.52	0.84	0.45
Diff-IF [13]	6.67	3.71	42.63	1.62	1.04	7.40	5.85	49.99	1.48	0.84
SEND (Ours)	7.60	3.70	43.45	1.63	0.98	7.57	6.64	54.39	1.50	0.73

Table 2. Quantitative results of the different methods for the object detection task evaluated on the LLVIP test dataset (the first-, second-, and third-best results are in bold, underline, and italics, respectively).

Methods	LLVIP Test Dataset (2769:347:347)
Methods	Precision	Recall	mAP50	mAP50-95
1-5 Infrared	0.962	0.917	0.97015	0.64658
Visible	0.940	0.936	0.97080	0.64932
FusionGAN [51]	0.954	0.961	0.98445	0.73746
U2Fusion [30]	0.974	0.948	0.98713	0.73373
EMFusion [52]	0.969	0.956	0.98557	0.72883
CDDFuse [18]	0.972	0.957	0.98611	0.74150
LRRNet [23]	0.964	0.955	0.98186	0.71775
Diff-IF [13]	0.968	0.957	0.98602	0.75221
SEND (Ours)	0.977	0.962	0.98847	0.74538

Table 3. Quantitative results of the different methods for the multi-object detection task evaluated on the M3FD test dataset (the first-, second-, and third-best results are in bold, underline, and italics, respectively).

Methods	M3FD Test Dataset (3360:420:420)
Methods	Precision	Recall	mAP50	mAP50-95
1-5 Infrared	0.785	0.604	0.67158	0.43095
Visible	0.766	0.635	0.69091	0.43573
FusionGAN [51]	0.788	0.616	0.67873	0.43817
U2Fusion [30]	0.838	0.600	0.69014	0.44195
EMFusion [52]	0.810	0.601	0.68356	0.43657
CDDFuse [18]	0.781	0.605	0.67692	0.43675
LRRNet [23]	0.810	0.601	0.68356	0.43657
Diff-IF [13]	0.777	0.610	0.67828	0.43410
SEND (Ours)	0.835	0.707	0.77278	0.50065

Table 4. Comparison of model parameters, FLOPs, and FPS for each comparative fusion method with the YOLO detection model and our proposed SEND (the first-, second-, and third-best results are in bold, underline, and italics, respectively).

Methods	Test Infrared and Visible Image Size (480, 640)
Methods	Params.(M)	FLOPs.(G)	FPS
FusionGAN [51]	3.521	7.634	1.215
U2Fusion [30]	3.251	410.004	1.730
EMFusion [52]	2.741	96.174	5.028
CDDFuse [18]	3.781	9.834	4.686
LRRNet [23]	2.641	5.834	6.996
Diff-IF [13]	26.304	48.364	6.047
SEND (Ours)	25.177	27.961	5.327

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, R.; Xiong, M.-Y.; Huang, J.-J. SEND: Semantic-Aware Deep Unfolded Network with Diffusion Prior for Multi-Modal Image Fusion and Object Detection. Mathematics 2025, 13, 2584. https://doi.org/10.3390/math13162584

AMA Style

Zhang R, Xiong M-Y, Huang J-J. SEND: Semantic-Aware Deep Unfolded Network with Diffusion Prior for Multi-Modal Image Fusion and Object Detection. Mathematics. 2025; 13(16):2584. https://doi.org/10.3390/math13162584

Chicago/Turabian Style

Zhang, Rong, Mao-Yi Xiong, and Jun-Jie Huang. 2025. "SEND: Semantic-Aware Deep Unfolded Network with Diffusion Prior for Multi-Modal Image Fusion and Object Detection" Mathematics 13, no. 16: 2584. https://doi.org/10.3390/math13162584

APA Style

Zhang, R., Xiong, M.-Y., & Huang, J.-J. (2025). SEND: Semantic-Aware Deep Unfolded Network with Diffusion Prior for Multi-Modal Image Fusion and Object Detection. Mathematics, 13(16), 2584. https://doi.org/10.3390/math13162584

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SEND: Semantic-Aware Deep Unfolded Network with Diffusion Prior for Multi-Modal Image Fusion and Object Detection

Abstract

1. Introduction

2. Related Works

3. Proposed Method

3.1. Overview

3.2. Denoising Prior Guided Fusion Module

3.2.1. Model Formulation

3.2.2. Optimization Algorithm

3.2.3. Module Design

3.3. Fusion Object Detection Module

3.4. Loss Functions and Training Strategy

3.4.1. Loss Functions for DPFM

3.4.2. Loss Functions for FODM

3.4.3. Training Strategy

4. Experimental Results

4.1. Settings

4.2. Evaluation Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI