Visual Attention Fusion Network (VAFNet): Bridging Bottom-Up and Top-Down Features in Infrared and Visible Image Fusion

Liu, Yaochen; Wang, Yunke; Jing, Zixuan

doi:10.3390/sym17071104

Open AccessArticle

Visual Attention Fusion Network (VAFNet): Bridging Bottom-Up and Top-Down Features in Infrared and Visible Image Fusion

by

Yaochen Liu

^1,*

,

Yunke Wang

¹ and

Zixuan Jing

²

¹

School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, China

²

College of Information Science and Technology, Dalian Maritime University, Dalian 116026, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(7), 1104; https://doi.org/10.3390/sym17071104

Submission received: 25 May 2025 / Revised: 28 June 2025 / Accepted: 7 July 2025 / Published: 9 July 2025

(This article belongs to the Special Issue Symmetry in Next-Generation Intelligent Information Technologies)

Download

Browse Figures

Versions Notes

Abstract

Infrared and visible image fusion aims to integrate useful information from the source image to obtain a fused image that not only has excellent visual perception but also promotes the performance of the subsequent object detection task. However, due to the asymmetry between image fusion and object detection tasks, obtaining superior visual effects while facilitating object detection tasks remains challenging in real-world applications. Addressing this issue, we propose a novel visual attention fusion network for infrared and visible image fusion (VAFNet), which can bridge bottom-up and top-down features to achieve high-quality visual perception while improving the performance of object detection tasks. The core idea is that bottom-up visual attention is utilized to extract multi-layer bottom-up features for ensuring superior visual perception, while top-down visual attention determines object attention signals related to object detection tasks. Then, a bidirectional attention integration mechanism is designed to naturally integrate two forms of attention into the fused image. Experiments on public and collection datasets demonstrate that VAFNet not only outperforms seven state-of-the-art (SOTA) fusion methods in qualitative and quantitative evaluation but also has advantages in facilitating object detection tasks.

Keywords:

image fusion; bottom-up visual attention; top-down visual attention

1. Introduction

Due to theoretical and technical limitations, single modality images are unable to effectively and comprehensively describe imaging scenes [1,2]. Therefore, multi-modal image fusion technology has gradually become a research hotspot in the field of information fusion [3]. Among them, infrared images can highlight useful information by capturing thermal radiation. On the contrary, visible images can clearly outline the useful feature by capturing the light reflected from the surface of objects. The infrared and visible image fusion technology can utilize the advantages of both to extract symmetrical useful information for obtaining a fused image that can stably describe the imaging scene, which has become a widely used pre-processing module for high-level visual tasks such as object detection [4,5,6,7,8].

In recent years, scholars have proposed many infrared and visible image fusion methods, including traditional image fusion methods and deep learning-based image fusion methods. Traditional image fusion methods use hand-crafted features to obtain the fused image, such as multi-scale transformations, sparse representation, subspace and hybrid models. Deep learning-based image fusion methods, in contrast, can automatically extract useful features through network frameworks [9,10]. However, in order to achieve superior visual effects, these fusion methods mainly focus on the pixel relationships between image pairs, but object features are ignored. Unfortunately, for object detection tasks, the key to accurately locating interesting regions lies in obtaining rich object features. Therefore, due to the semantic asymmetry between image fusion tasks and object detection tasks, both traditional and deep learning image fusion methods are unable to improve the performance of subsequent object detection tasks. Some image fusion methods use object detection networks as constraints to train the image fusion network, forcing the fused image to contain object features [11,12]. Unfortunately, the effectiveness of cascaded optimization is limited [6]. Not only that, in order to achieve excellent visual perception, the computational complexity of existing fusion methods is increasing, which is not conducive to deploying pre-processing modules for object detection tasks because they cannot generate fusion images in real time.

The human visual system has powerful feature extraction capabilities with extremely high computing efficiency [13,14,15,16]. We therefore resort to our visual system for inspiration. Numerous neuroscience studies have proven that the human visual system relies on both bottom-up and top-down visual attention. In bottom-up process, the novel or salient visual stimuli can be automatically associated [17]. In contrast, attention can be focused on task-relevant stimuli by a top-down process [18]. Obviously, bottom-up visual attention can ensure that the fused image preserves rich visual features and top-down visual attention can be used to extract object features that are useful for the object detection task. Therefore, this paper studies a joint bottom-up and top-down visual attention fusion network to obtain the fused image with excellent visual perception and rich object features.

However, some works relied solely on either top-down or bottom-up visual attention to obtain fusion results [19,20,21], which neglects the cooperation between bottom-up and top-down visual attention, resulting in a restricted effect. In order to integrate top-down and bottom-up visual attention, Fang et al. [22] used dense connections to combine top-down signals with the top layer of the bottom-up signals. Lin et al. [23] directly added bottom-up and top-down attention signals to segment objects. However, although top-down signals preserve object features, the saliency is reduced due to the increase in network depth and narrow receptive fields. Therefore, directly combining bottom-up and top-down signals will heavily rely on the learning ability of the fusion network, which is typically proportional to the size of the training dataset. Unfortunately, obtaining a large amount of training data is very difficult, especially in the field of infrared and visible image fusion. In addition, these methods cannot learn the relevant relationships between top-down and bottom-up signals, which may limit their natural combination due to semantic differences.

To overcome the above problem, we propose a novel visual attention fusion network for infrared and visible image fusion, which can achieve excellent visual perception while improving the performance of object detection tasks. Firstly, since bottom-up visual attention is automatically triggered by external novel or salient stimuli, we design a visual attention extraction module (VAEM) to obtain the multi-layer bottom-up feature of infrared and visible images in the bottom-up process. The multi-layer bottom-up feature of this module integrates both intensity and orientation features, effectively enhancing the visual effect of fused images. Secondly, in the top-down process, a cross-modal attention module (CMAM) is developed to extract significant features from multi-layer bottom-up features, then the multi-layer significant features are merged to obtain object attention signals through a cross-scale attention module (CSAM). Finally, in order to fully leverage the advantages of bottom-up and top-down visual attention, a bidirectional attention integration mechanism is proposed to naturally integrate multi-layer bottom-up feature and object attention signals, which includes an object attention enhancement module (OAEM) and attention guidance module (AGM) to overcome the loss of object feature saliency and semantic differences between bottom-up and top-down signals, respectively. In particular, the top-down process utilizes multi-layer bottom-up features obtained from the bottom-up process to improve object feature extraction, while the bottom-up process further integrates the object attention signal obtained from the top-down process to obtain fine-grained visual features. Thus, in VAFNet, top-down and bottom-up visual attention work in a collaborative manner. The specific contributions in this study are as follows:

We propose a real-time visual attention image fusion network, which can achieve superior performance in both excellent visual perception and improve object detection tasks.
We design a bidirectional attention integration mechanism to integrate multi-layer bottom-up features and object attention signals, which can effectively enhance object attention signals and extract relevant features of two forms of attention.
A series of experiments on public and collected datasets show that the proposed method not only has better qualitative and quantitative evaluation results than seven SOTA methods but also can serve as a pre-processing module for the object detection task.

The remainder of this paper is organized as follows. In Section 2, we introduce the related works of existing image fusion methods and visual attention mechanisms. Section 3 describes our VAFNet in detail. Section 4 analyzes the experimental results of the proposed method, followed by Section 5, which presents the conclusions of this work.

2. Related Work

In this section, the existing fusion methods and visual attention mechanism are briefly described.

2.1. Image Fusion Methods

2.1.1. Traditional Image Fusion Methods

In recent years, due to strong interpretability and mature theoretical foundations, many traditional infrared and visible image fusion methods have been proposed. According to the adopted theory, these fusion methods can be roughly classified into four categories, including multi-scale transformation, sparse representation, subspace and hybrid models. The multi-scale transformation utilizes decomposition tools to extract high-frequency and low-frequency information from the source images and then reconstructs the fused image by inverse transformation. Although this method can preserve rich texture in the fused image, it is difficult to extract prominent objects [24,25,26]. The sparse representation can use sparse bases from an overcomplete dictionary to extract prominent objects of the source image, but its ability to retain details is insufficient [27]. The subspace method can map the source image to a low-dimensional space to obtain key information. Therefore, this method has good computational efficiency, but it is prone to losing background information [28]. The above theories have their own advantages, so the hybrid model aims to improve the quality of the fused images by combining their advantages [29]. However, complex hybrid models may not be able to meet the requirements of real-time image fusion. In addition, traditional image fusion methods focus on improving the visual effect of fused images and fail to integrate object semantic information, which limits the contribution of fusion results to object detection tasks.

2.1.2. Deep Learning-Based Image Fusion Methods

In the past few years, since deep learning has a powerful feature extraction ability and stability, it has been successfully applied in the community of image fusion. Yang et al. [30] proposed a texture conditional generative adversarial network to keep the gradient changes of the visible image in the fusion results. Rao et al. [31] combined a convolutional neural network and transformer to extract local and global information to improve the resolution of fused images. Su et al. [32] designed an auto-encoder to obtain fused images with better visual effects. However, it has been proven that solely considering the visual effects of fused images is difficult to improve the performance of the object detection task [33,34]. In order to preserve the object features in the fused image, Liu et al. [12] constructed a cascaded training framework for object detection and image fusion networks. Similarly, Tang et al. [11] used a high-level task to constrain image fusion networks for obtaining the fused image with a salient object. Unfortunately, using the fusion output directly as a link between the image fusion module and the object detection network fails to effectively guide the fusion process, as it overlooks the pixel-level semantic cues provided by the detection network.

2.2. Visual Attention for Image Fusion

The human visual system exhibits remarkable capability in extracting critical features from complex scenes with high computational efficiency, which offers benefits in many visual tasks, such as visual question answering and image classification [35,36,37]. It has been proven that the visual attention can be divided into bottom-up and top-down processes [38]. On the one hand, bottom-up visual attention is driven by external visual stimuli, such as direction and intensity. Wang et al. [39] utilized bottom-up visual attention to obtain visual information during feature extraction. Liu et al. [20] proposed an improved ITTI visual attention model to extract intensity and directional features of source images. These methods demonstrate that bottom-up attention can obtain fused images with excellent visual perception. On the other hand, top-down visual attention can modulate the visual system’s responsiveness to different stimuli based on prior knowledge and experience of perceptual objects. This mechanism enables selective enhancement of object-relevant information while suppressing task-irrelevant distractors in complex scenes [40]. Fang et al. [41] utilized top-down attention to detect salient objects. Wang et al. [42] enhance the object semantics by top-down attention. These works demonstrate that top-down visual attention can effectively capture an object’s semantic information.

Therefore, by integrating bottom-up and top-down visual attention, a fused image with excellent visual quality and rich object semantics can be obtained. However, early attempts at integrating bottom-up and top-down visual attention merely concatenated features along the channel dimension or performed direct element-wise addition [22,23], which placed excessive reliance on the fusion network’s learning capacity. Crucially, due to the inherent asymmetry between bottom-up and top-down features, it is essential to explicitly model their cross-modal correlations to achieve plausible integration.

3. Method

3.1. Overview

Figure 1 shows the network architecture of VAFNet, which has three main components: (a) bottom-up process, (b) top-down process and (c) bidirectional attention integration mechanism. Firstly, in the bottom-up process, VAEM is utilized to extract the multi-layer bottom-up features of infrared and visible images, respectively (see Section 3.2). Secondly, in the top-down process, the multi-layer bottom-up features of the infrared and visible image are fed into CMAM to obtain multi-scale significant features, and then CSAM is utilized to integrate the multi-scale significant features for obtaining object attention signals (see Section 3.3). Finally, in order to reconstruct the fused image, we use a bidirectional attention integration mechanism to combine multi-scale attention maps and object attention signals(see Section 3.4).

3.2. Bottom-Up Process

The bottom-up process work is shown in Figure 1a. Specifically, bottom-up attention is driven by external stimuli, which exhibit varying saliency levels across different spatial locations. Therefore, the VAEM is employed to extract multi-layer bottom-up features, ensuring that salient external stimuli are accurately preserved in the fused image. Moreover, since infrared and visible images possess distinct characteristics, the process of extracting multi-layer bottom-up features can be expressed as,

\{ϕ_{i}^{v i s}, ϕ_{i}^{i r}\} = \{\begin{matrix} V A E M (I_{v i s}, I_{i r}) & i f i = 1 \\ V A E M (ϕ_{i - 1}^{v i s}, ϕ_{i - 1}^{i r}) & i f i = 2, 3, 4 \end{matrix}

(1)

where

ϕ_{i}^{v i s} \in R^{H \times W \times C}

and

ϕ_{i}^{i r} \in R^{H \times W \times C}

represent the i-th layer bottom-up features of the visible

I_{v i s} \in R^{H \times W \times 3}

and infrared images

I_{i r} \in R^{H \times W \times 3}

, respectively.

V A E M (\cdot)

represents the bottom-up feature extraction operation, and its framework is illustrated in Figure 2. Firstly, we employ a convolutional layer with kernels of

3 \times 3

and a stride of 1 to extract shallow features

γ_{i} \in R^{H \times W \times C}

from the input. After the convolutional layer, Sigmoid is used as the activation function. Then, as higher intensity information is more likely to attract attention, we utilize the Gaussian kernel function to extract intensity features, which can be expressed as,

Γ_{i} = G a u s s i a n (w, σ) * γ_{i}

(2)

where

G a u s s i a n (\cdot)

represents the Gaussian kernel, w is the Gaussian filtering window, and

σ

is the Gaussian kernel parameters.

Γ_{i} \in R^{H \times W \times C}

is the intensity features; ∗ illustrates the convolution operation.

Moreover, regions with richer orientations in the source image tend to attract more attention, so the Gabor filter [43] is utilized to extract orientation features, which can be represented as,

O_{i} = \sum_{θ \in \{0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ}\}} G a b o r (θ) * γ_{i}

(3)

where

G a b o r (\cdot)

is Gabor filter,

θ

is the filtering direction,

θ \in \{0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ}\}

.

O_{i} \in R^{H \times W \times C}

is the orientation features.

Furthermore, based on the extracted intensity and texture features, we propose a linear normalization operator for visual salient feature extraction. Specifically, this operator first sums the intensity features and textural features, then applies normalization to generate a visual saliency map. In this saliency map, higher values indicate regions with greater attentional priority. Subsequently, the saliency map is multiplied by

γ_{i}

to obtain the visual salient feature. This process can be described as,

ψ_{i} = N o r (Γ_{i} \oplus O_{i}) \otimes γ_{i}

(4)

where

ψ_{i} \in R^{H \times W \times C}

is the visual salient feature,

N o r (\cdot)

is the normalization operation to the range of

0 - 1

, ⊕ represents element-wise addition, ⊗ is the Hadamard product operation, which represents an element-wise multiplication operation between two features of identical dimensions [44].

Finally, to prevent the degradation of visual features as network depth increases, residual connections are employed to integrate features

ψ_{i}

and

γ_{i}

, which can be expressed as,

ϕ_{i} = ψ_{i} \oplus γ_{i}

(5)

where

ϕ_{i}

are the i-th layer bottom-up features.

3.3. Top-Down Process

The top-down process includes two components: CMAM and CSAM, as shown in Figure 3a and b, respectively. It takes the multi-layer bottom-up features of the infrared and visible images as input and computes significant features by CMAM. Then, CSAM is used to merge the ith significant features for obtaining object attention signals.

3.3.1. Cross Modal Attention Module

Since the multi-layer bottom-up features of infrared and visible imagesb can prove important clues for focusing on object-relevant stimuli, the

ϕ_{i}^{i r}

and

ϕ_{i}^{v i s}

are connected to form cross modal features

κ_{i} \in R^{H \times W \times 2 C}

. Next, we calculate the maximum values

p_{i}^{max} \in R^{1 \times 1 \times 2 C}

and average values

p_{i}^{a v g} \in R^{1 \times 1 \times 2 C}

of

κ_{i} \in R^{H \times W \times 2 C}

in channel [45], then the difference between

p_{i}^{max}

and

p_{i}^{a v g}

is calculated to obtain channel attention features, which can be expressed as,

Z_{i} = {〈 \frac{1}{1 + e^{(p_{i}^{max} - p_{i}^{a v g})}}, κ_{i} 〉}_{C}

(6)

where

Z_{i} \in R^{H \times W \times 2 C}

is channel attention features, and

{〈 \cdot, \cdot 〉}_{C}

is channel element-wise multiplication.

In addition, we obtain the spatial maximum value

q_{i}^{max} \in R^{H \times W \times 1}

and spatial average value

q_{i}^{a v g} \in R^{H \times W \times 1}

of

Z_{i}

through max pooling and average pooling [46], and then the significant features in each layer are obtained by Equation (7),

ν_{i} = {〈 \frac{1}{1 + e^{(q_{i}^{max} - q_{i}^{a v g})}}, Z_{i} 〉}_{H \times W}

(7)

where

ν_{i} \in R^{H \times W \times 2 C}

is the significant features,

{〈 \cdot, \cdot 〉}_{H \times W}

is the spatial element-wise multiplication.

3.3.2. Cross-Scale Attention Module

In order to efficiently combine significant features, the

ν_{i} \in R^{H \times W \times 2 C}

is input into a convolutional layer with kernels of

1 \times 1

and stride of 1 to adjust the number of channels, which can be expressed as,

ζ_{i} = c o n v_{1 \times 1} (ν_{i})

(8)

where

ζ_{i} \in R^{H \times W \times C}

is the output layers of the convolutional layer, and

c o n v_{n \times n} (\cdot)

is the standard convolution operation with the kernel set to

n \times n

and stride set to 1. After the convolutional layer, the leaky rectified linear unit (LReLU) is used as the activation function.

After resizing all ith significant features, Hadamard product operations are applied to obtain object attention signals, which can be expressed as,

\tilde{ζ} = \otimes_{i = 1}^{4} (ζ_{i})

(9)

where

\tilde{ζ} \in R^{H \times W \times C}

is object attention signals.

\otimes_{i = 1}^{4} (\cdot)

denotes the Hadamard product of multi-scale attention feature.

3.4. Bidirectional Attention Integration Mechanism

The whole structure of the bidirectional attention integration mechanism is shown in Figure 1c. Specifically, the saliency of object features in the fused image is crucial for improving the performance of object detection tasks. However, due to the gradual deepening of the network and narrow receptive fields, it is extremely easy to reduce the saliency of object attention signals. Therefore, OAEM is considered to enhance the

\tilde{ζ} \in R^{H \times W \times C}

. In addition, the proposed AGM can be utilized to guide the multi-layer bottom-up features through enhanced object attention signals for obtaining global attention signals. Last, the global attention signals are used to reconstruct the fused image, which can be expressed as,

F_{i} = \{\begin{matrix} c o n v_{3 \times 3} (φ_{i}) & i f i = 4 \\ c o n v_{3 \times 3} (F_{i + 1} © φ_{i}) & i f i = 3, 2, 1 \end{matrix}

(10)

where © is the channel connection operation. After the

F_{4}, F_{3}, F_{2}

convolutional layer, LReLU is used as the activation function.

φ_{i} \in R^{H \times W \times 2 C}

is the global attention signals. The

F_{1}

is followed by the Tanh activation function. In the end, we get

F_{1} \in R^{H \times W \times 3}

as the fused image.

3.4.1. Object Attention Enhancement Module

OAEM considers enhancing the saliency of object features from two aspects. On the one hand, we adopt a multi-branch structure to increase the richness of object features, which can reduce the saliency attenuation caused by network depth. On the other hand, in order to expand the receptive field, an atrous convolution is applied to obtain richer object features. The structure of OAEM is shown in Figure 3c. OAEM includes two branches with atrous convolution and one residual branch, where two branches perform

1 \times 3

and

3 \times 1

convolution to adjust the number of channels and then use atrous convolutions to extract object features of different shapes, which can be expressed as,

\overset{\land}{Υ} = f_{3 \times 3} (C o n v_{1 \times 3} (C o n v_{3 \times 1} (\tilde{ζ})))

(11)

\overset{\lor}{Υ} = f_{3 \times 3} (C o n v_{3 \times 1} (C o n v_{1 \times 3} (\tilde{ζ})))

(12)

where

f_{3 \times 3} (\cdot)

represents atrous convolution with a dilation rate of 5 and the kernel set to

3 \times 3

.

\overset{\land}{Υ} \in R^{H \times W \times \frac{C}{2}}

, and

\overset{\lor}{Υ} \in R^{H \times W \times \frac{C}{2}}

are the output data of two branches after atrous convolution.

Then, the residual branch connects two branches with an atrous convolution to enhance object features, which can be expressed as,

Υ = \overset{\land}{Υ} © \overset{\lor}{Υ} © \tilde{ζ}

(13)

where

Υ \in R^{H \times W \times 2 C}

is the enhanced object attention signal.

3.4.2. Attention Guidance Module

In order to efficiently obtain global attention signals, the key issue is obtaining the correlation of multi-layer bottom-up features and object attention signals to guide their natural combination. Inspired by the impressive performance of transformer networks in capturing contextual relationships [47], we design AGM to obtain global attention signals, as shown in Figure 3d. Firstly, the multi-layer bottom-up features of infrared and visible are connected to obtain cross-modal features

κ_{i}

. In addition, we take

κ_{i}

as query(Q) and

Υ

as key(K) to extract the correlation weight, which is expressed as,

A t t_{i} = S o f t max (\frac{Q \cdot K_{i}}{\sqrt{d_{k}}})

(14)

where

A t t_{i} \in R^{H \times W \times 2 C}

is the correlation weight.

Then, in order to reduce the semantic difference between object attention signals and multi-layer bottom-up features, we take the enhanced object attention signal as value (V) to multiply

A t t_{i}

to obtain the modulated object attention signals. Finally, the global attention signals can be extracted by adding the modulated object attention signals and multi-layer bottom-up features. This process can be expressed as,

φ_{i} = A t t_{i} \cdot V \oplus κ_{i}

(15)

where

φ_{i} \in R^{H \times W \times 2 C}

is the global attention signals.

3.5. Loss Function

In order to guide the bottom-up process in VAFNet to extract strength features, we present a strength loss

l_{s t r}

to measure the pixel-level differences between the fused image and the source image, which can be expressed as,

l_{s t r} = \frac{1}{H \times W \times C} ({∥F - I^{v i s}∥}_{2} + {∥F - I^{i r}∥}_{2})

(16)

where

{∥\cdot∥}_{2}

is L2 norm,

F \in R^{H \times W \times 3}

is the fused image.

I^{v i s} \in R^{H \times W \times 3}

and

I^{i r} \in R^{H \times W \times 3}

are the visible and infrared image, respectively. H, W and C are the height, width and channel number.

Then, we apply a structure loss

l_{s s i m}

to force the bottom-up process for obtaining orientation features, which can be expressed as,

l_{s s i m} = S S I M (F, I^{v i s}) + S S I M (F, I^{i r})

(17)

where

S S I M (\cdot)

is a structural similarity measurement.

In addition, in order to encourage a top-down process to extract object features, the proposed object loss can be defined as,

l_{o b j e c t} = \frac{1}{H \times W \times C} ({∥F, Ψ^{i r}∥}_{1} + {∥F, Ψ^{v i s}∥}_{1})

(18)

where

{∥\cdot∥}_{1}

is L1 norm,

l_{o b j e c t}

is the object loss,

Ψ^{i r}

and

Ψ^{v i s}

are object regions extracted from infrared and visible images using object saliency detection algorithms [48].

Finally, we combine the strength, structure, and object loss to train VAFNet, which can be expressed as,

l_{t o t a l} = l_{s t r} + ω_{1} l_{s s i m} + ω_{2} l_{o b j e c t}

(19)

where

l_{t o t a l}

is the final loss, and

ω_{1}

and

ω_{2}

are the balance coefficient.

4. Experiment

In this section, we perform a series of experiments on the public and collected datasets to comprehensively validate the performance of our fusion method. Firstly, we describe the experimental settings and implementation details. Secondly, the comparative, generalization and surveillance application experiments are discussed and analyzed. Thirdly, we also complete the evaluation of infrared-visible object detection. Finally, the computational efficiency of the proposed method is analyzed.

4.1. Experimental Settings

To demonstrate the advantages of our VAFNet, the comparative and generalization experiments were conducted on the MSRS [49] and TNO public datasets, respectively. Moreover, we also collected 23 pairs of aligned infrared and visible images during daytime and nighttime for surveillance application experiments. The imaging equipment and acquisition site are shown in Figure 4. We compare VAFNet with seven SOTA infrared and visible image fusion methods, including MDA [50], SFINet [51], SpTFuse [52], U2Fusion [53], LRRNet [54], RES2Fusion [55] and SeAFusion [11]. The implementation code of the seven SOTA methods is publicly available, and the parameter selection is based on default values.

In addition, we selected four metrics for quantitative evaluation, including entropy (EN) [56], spatial frequency (SF) [57], standard deviation (SD) [58] and visual information fidelity (VIF) [59]. EN can measure the information richness of the fused image, with a larger EN indicating a richer amount of information. SF describes the clarity of fused images, with larger SF representing clearer fused images. SD indicates the degree of pixel distribution in fused images, with a larger SD indicating higher contrast. VIF can evaluate the visual fidelity of the fusion results; a higher VIF value means the fused image is closer to human visual perception.

4.2. Implementation Details

We utilized 1106 pairs of registered infrared and visible images from the MSRS and collection dataset to train VAFNet, which includes 562 pairs of daytime scene images and 544 pairs of nighttime scene images. The parameters of VAFNet were updated by the Adam optimizer and the learning rate was set to

1 \times 10^{- 4}

. For the hyperparameters of Equation (19), we set

ω_{1} = 5

and

ω_{2} = 1

. Our VAFNet was implemented on the Microsoft Windows 10 platform, NVIDIA GeForce RTX 4070 super graphics Processing Unit (GPU), 32 GB Random Access Memory and Intel i5-13400F central processing unit (CPU).

4.3. Comparative Experiment

In order to illustrate the advantages of the proposed fusion method, we compared the VAFNet with LRRNet, MDA, RES2Fusion, SeAFusion, SFINet, SpTFuse and U2Fusion on the MSRS dataset.

4.3.1. Qualitative Evaluation

Since image fusion methods need to be applied around the clock, we chose daytime and nighttime scenes for qualitative evaluation. The qualitative evaluation results of VAFNet and seven SOTA fusion methods are shown in Figure 5. In daytime scenes, infrared images have prominent objects, while visible images have rich details. Therefore, the fused image with excellent visual effects should have prominent infrared objects and rich textures. As shown in the first and second rows of Figure 5, due to insufficient feature extraction ability, LRRNet and MDA cannot preserve the rich details of visible images. In contrast, SpTFuse and U2Fusion cannot effectively preserve infrared objects. Although SFINet, RES2Fusion and SeAFusion can extract the infrared objects and visible texture to a certain extent, they cannot effectively overcome the interference of uneven lighting. To clearly demonstrate the ability to preserve textures, we enlarged a small texture region with a red box in the bottom right corner. In addition, a prominent object is displayed within the yellow box to illustrate the effectiveness of extracting interesting targets. We can see that due to the ability to accurately extract visual saliency features from infrared and visible images, only our method has significant objects and rich texture information.

In nighttime scenes, although infrared images are not easily affected, visible images always contain many distorted regions due to insufficient lighting. Therefore, excellent fusion methods not only preserve infrared objects but also need to extract features from distorted regions to comprehensively express the imaging scene. As shown in the third and fourth rows of Figure 5, each fusion method can combine complementary information of infrared and visible images to some extent. However, from the enlarged texture region (as shown in the red box) and target region (as shown in the yellow box), it can be seen that apart from the fusion results of SeAFuison and VAFNet, there is some distortion in the fused images of the other six fusion methods. Compared to SeAFusion, VAFNet has richer texture details thanks to the stability of the visual attention mechanism. In summary, the proposed fusion method has better qualitative evaluation performance than the other seven fusion methods.

4.3.2. Quantitative Evaluation

Table 1 and Figure 6 show the quantitative evaluation results of EN, SF, SD and VIF metrics for 56 pairs of images from the MSRS dataset. For MI, VAFNet achieved the maximum value, indicating that our proposed method can generate fused images with rich complementary information. For SF and SD, our method has great advantages, which indicates that VAFNet has abundant textures and significant objects. Finally, the evaluation results of VIF indicate that the fused images generated by VAFNet conform to human visual perception. In summary, compared with the other seven fusion methods, the proposed fusion method has better quantitative evaluation results.

4.4. Generalization Experiment

To evaluate the reliability of our fusion method, we perform generalization experiments on the TNO dataset. Our VAFNet was trained on the MSRS and collection dataset, and tested on the TNO dataset.

4.4.1. Qualitative Evaluation

Figure 7 shows the qualitative evaluation results of VAFNet and other seven comparative methods. It can be seen that the fusion results of MDA, SFINet, SpTFuse, U2Fusion, LRRNet and Res2Fusion not only weaken the contrast of infrared objects but also have blurry textures. To visually demonstrate the fusion performance of the proposed method, we enlarged a small texture region in the red box and displayed an object region in the yellow box. We can observe that only the fusion results of SeAFusion and VAFNet retain the significance of infrared targets. However, compared to SeAFusion, the proposed method has clearer textures.

4.4.2. Quantitative Evaluation

We also selected nine pairs of infrared and visible images from the TNO dataset for qualitative evaluation, as shown in Table 2 and Figure 8. The results demonstrate that VAFNet outperforms other methods in EN, SF, and SD, indicating its superior capabilities in preserving information richness, maintaining optimal clarity, and retaining object details. However, due to the considerable difficulty in collecting large-scale training datasets, the proposed method achieves a secondary ranking on the VIF metric.

4.5. Evaluation on Surveillance Application

To evaluate the performance of the proposed method in practical surveillance tasks, we collected 23 pairs of infrared and visible images for application evaluation.

4.5.1. Qualitative Evaluation

Figure 9 shows the quantitative evaluation results of our VAFNet and seven SOTA fusion methods. It can be seen that VAFNet not only has better visual effects but also has prominent objects (as shown in the yellow box) in daytime scenes. Moreover, for nighttime scenes, the proposed method has a lighting enhancement effect because VAFNet can accurately extract complementary information from dark regions. In summary, the application experiment shows that VAFNet can provide comprehensive imaging scene information for surveillance tasks.

4.5.2. Quantitative Evaluation

Table 3 and Figure 10 show the quantitative evaluation results of our VAFNet and seven SOTA fusion methods from the collection dataset. It can be seen that the proposed method ranks first in EN and SD. The best EN demonstrates that VAFNet can effectively integrate complementary information. The best SD represents that our fusion method can effectively extract salient objects. In addition, SF has the second-best evaluation result, indicating that the fusion result of VAFNet has good clarity. Finally, although our VIF value is not optimal, it also indicates that the fusion results of the proposed method are consistent with human visual perception.

4.6. Evaluation on Infrared-Visible Object Detection

The infrared and visible fusion technology has been widely used as a preprocessing module for object detection tasks. Therefore, we use the popular YOLOV5 to evaluate the performance of each fusion method in object detection tasks. Specifically, we selected 80 pairs of infrared and visible images from the MSRS dataset as the test set and labeled them with two key categories: people and cars. Figure 11 provides some visual examples to illustrate the advantages of our VAFNet in improving the performance of object detection tasks. In the first row of Figure 11, YOLOV5 is unable to detect a car in the infrared image due to the lack of texture. On the contrary, due to lighting factors, YOLOV5 cannot detect pedestrians in the visible image. However, since VAFNet and other seven methods can combine complementary information of infrared and visible images to a certain extent, the fusion results of various fusion methods can enable YOLOV5 to detect pedestrians and cars simultaneously. However, due to the proposed method, maintaining the best object intensity and clearest texture information, VAFNet has the optimal confidence level. It is worth noting that there are similar results in the second row of Figure 11, which proves that VAFNet can effectively improve the performance of object detection.

In addition, precision (P), recall (R), and the mean average Precision (mAP) are used to measure detection performance. The AP@50, AP@70 and AP@90 are the AP value at Intersection over Union (IoU) 0.5, 0.7 and 0.9. mAP@[0.5:0.95] represents the average precision within the range of IoU from 0.5 to 0.95. The detection performance of YOLOV5 is shown in Table 4. We can see that VAFNet exhibits high accuracy and recall, and overall has better accuracy under different IoUs. In summary, the proposed method can effectively improve the performance of object detection tasks.

4.7. Efficiency Comparison

In practical applications, the image fusion methods must have excellent computational efficiency to achieve real-time processing. Table 5 shows the average running time of VAFNet and the other seven fusion methods in MSRS, TNO, and the collection dataset. All fusion methods run on a GPU. It can be seen that VAFNet not only has the best running efficiency, but also meets the real-time requirement of processing 30 frames per second (FPS) for object detection tasks [60]. This is because visual attention technology has the ability to process large amounts of information with extremely low energy consumption. In conclusion, VAFNet can serve as a reliable preprocessing module for object detection tasks.

5. Conclusions

To fill the gap between image fusion and object detection tasks, we proposed a novel visual attention fusion network for infrared and visible image fusion. Inspired by the human visual system, our network can achieve bottom-up and top-down inference to extract visual and object features, respectively. Most importantly, bottom-up and top-down processes work in a collaborative manner. In addition, we also present a bidirectional attention integration mechanism to naturally combine two forms of attention. Comparative and generalization experiments show that the proposed fusion method outperforms seven SOTA methods in both qualitative and quantitative evaluations. In addition, the evaluation results applied to surveillance and object detection tasks reveal that our fusion network cannot only comprehensively express imaging scenes but also effectively improve the performance of object detection tasks. In terms of computational efficiency, the average running time of VAFNet also meets real-time processing requirements. In the future, we will expand the training dataset collection to enhance VAFNet’s robustness in practical engineering applications.

Author Contributions

Conceptualization, methodology, software and funding acquisition, Y.L.; visualization and software, Y.W.; Dataset development, supervision and editing, network model and parameter optimization, Z.J. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported in part by the Natural Science Foundation of Henan Province in China (242300421409, 252300420983), and in part by the Key Scientific Research Project Plan of Colleges and Universities in Henan Province of China (24A520038, 25A520020).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kaur, H.; Koundal, D.; Kadyan, V. Image fusion techniques: A survey. Arch. Comput. Methods Eng. 2021, 8, 4425–4447. [Google Scholar]
Parida, P.; Panda, M.K.; Rout, D.K.; Panda, S.K. Infrared and visible image fusion using quantum computing induced edge preserving filter. Image Vis. Comput. 2025, 153, 105344. [Google Scholar]
Singh, S.; Singh, H.; Bueno, G.; Deniz, O.; Singh, S.; Monga, H.; Hrisheekesha, P.N.; Pedraza, A. A review of image fusion: Methods, applications and performance metrics. Digit. Signal Process. 2023, 137, 104020. [Google Scholar]
Zhang, Y.; Liao, X. Asymmetric Training and Symmetric Fusion for Image Denoising in Edge Computing. Symmetry 2025, 17, 424. [Google Scholar] [CrossRef]
Liu, J.; Wu, G.; Liu, Z.; Wang, D.; Jiang, Z.; Ma, L. Infrared and Visible Image Fusion: From Data Compatibility to Task Adaption. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 2349–2369. [Google Scholar]
Zhao, W.; Xie, S.; Zhao, F.; He, Y.; Lu, H. Metafusion: Infrared and visible image fusion via meta-feature embedding from object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 13955–13965. [Google Scholar]
Hou, H.; Wang, Y.; Qin, Q.; Tan, Y.; Liu, T. Multi-Scale feature fusion based on difference enhancement for remote sensing image change detection. Symmetry 2025, 17, 590. [Google Scholar] [CrossRef]
Yi, X.; Gu, S.; Wu, X.; Jing, D. AFEDet: A Symmetry-Aware Deep Learning Model for Multi-Scale Object Detection in Aerial Images. Symmetry 2015, 17, 488. [Google Scholar]
Zhang, X.; Demiris, Y. Visible and infrared image fusion using deep learning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10535–10554. [Google Scholar]
Zhang, H.; Xu, H.; Tian, X.; Jiang, J.; Ma, J. Image fusion meets deep learning: A survey and perspective. Inf. Fusion. 2021, 76, 323–336. [Google Scholar]
Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion. 2022, 82, 28–42. [Google Scholar]
Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5802–5811. [Google Scholar]
Guo, M.H.; Lu, C.Z.; Liu, Z.N.; Cheng, M.M.; Hu, S.M. Visual attention network. Comput. Vis. Media 2023, 9, 733–752. [Google Scholar]
Hassanin, M.; Anwar, S.; Radwan, I.; Khan, F.S.; Mian, A. Visual attention methods in deep learning: An in-depth survey. Inf. Fusion. 2024, 108, 102417. [Google Scholar]
Liu, Y.; Guo, Y.; Yin, J.; Song, X.; Liu, W.; Nie, L. Answer questions with right image regions: A visual attention regularization approach. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2022, 18, 1–18. [Google Scholar]
Shen, F.; Liu, H.; Xu, C.; Ouyang, L.; Zhang, J.; Chen, Y.; He, Y. VGGNet and Attention Mechanism-Based Image Quality Assessment Algorithm in Symmetry Edge Intelligence Systems. Symmetry 2025, 17, 331. [Google Scholar] [CrossRef]
Lu, S.; Liu, M.; Yin, L.; Yin, Z.; Liu, X.; Zheng, W. The multi-modal fusion in visual question answering: A review of attention mechanisms. PeerJ Comput. Sci. 2023, 9, e1400. [Google Scholar]
Guo, W.; Zhang, Y.; Yang, J.; Yuan, X. Re-attention for visual question answering. IEEE Trans. Image Process. 2021, 30, 6730–6743. [Google Scholar] [PubMed]
Li, J.; Huo, H.; Li, C.; Wang, R.; Feng, Q. AttentionFGAN: Infrared and visible image fusion using attention-based generative adversarial networks. IEEE Trans. Multimed. 2020, 23, 1383–1396. [Google Scholar]
Liu, Y.; Dong, L.; Chen, Y.; Xu, W. An efficient method for infrared and visual images fusion based on visual attention technique. Remote Sens. 2020, 12, 781. [Google Scholar]
Jin, X.; Jiang, Q.; Yao, S.; Zhou, D.; Nie, R.; Hai, J.; He, K. A survey of infrared and visual image fusion methods. Infr. Phys. Technol. 2017, 85, 478–501. [Google Scholar]
Fang, C.; Tian, H.; Zhang, D.; Zhang, Q.; Han, J.; Han, J. Densely nested top-down flows for salient object detection. Sci. China Inf. Sci. 2022, 65, 182103. [Google Scholar]
Lin, D.; Shen, D.; Shen, S.; Ji, Y.; Lischinski, D.; Cohen-Or, D.; Huang, H. Zigzagnet: Fusing top-down and bottom-up context for object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Liu, S.; Yan, A.; Huang, S. Seismic data denoising based on DC-PCNN image fusion in NSCT domain. IEEE Geosci. Remote Sens. Lett. 2024, 21, 7502205. [Google Scholar]
Singh, S.; Singh, H.; Gehlot, A.; Jaskirat, K. IR and visible image fusion using DWT and bilateral filter. Microsyst. Technol. 2023, 29, 457–467. [Google Scholar]
Zhang, H.; Ma, X.; Tian, Y. An image fusion method based on curvelet transform and guided filter enhancement. Math. Probl. Eng. 2020, 2020, 9821715. [Google Scholar]
Li, G.; Lin, Y.; Qu, X. An infrared and visible image fusion method based on multi-scale transformation and norm optimization. Inf. Fusion. 2021, 71, 109–129. [Google Scholar]
Fu, Z.; Wang, X.; Xu, J.; Ning, Z.; Zhao, Y. Infrared and visible images fusion based on RPCA and NSCT. Infr. Phys. Technol. 2016, 77, 114–123. [Google Scholar]
Zhang, S.; Li, X.; Zhang, X.; Zhang, S. Infrared and visible image fusion based on saliency detection and two-scale transform decomposition. Infr. Phys. Technol. 2021, 114, 103626. [Google Scholar]
Yang, Y.; Liu, J.; Huang, S.; Wan, W. Infrared and visible image fusion via texture conditional generative adversarial network. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4771–4783. [Google Scholar]
Rao, D.; Xu, T.; Wu, X.J. TGFuse: An infrared and visible image fusion approach based on transformer and generative adversarial network. IEEE Trans. Image Process. 2023; Early Access. [Google Scholar]
Su, W.; Huang, Y.; Li, Q.; Zuo, F.; Liu, L. Infrared and visible image fusion based on adversarial feature extraction and stable image reconstruction. IEEE Trans. Instrum. Meas. 2022, 71, 1–14. [Google Scholar]
Li, S.; Araujo, I.B.; Ren, W.; Wang, Z.; Tokuda, E.K.; Junior, R.H.; Cesar-Junior, R.; Zhang, J.; Guo, X. Single image deraining: A comprehensive benchmark analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3838–3847. [Google Scholar]
Pei, Y.; Huang, Y.; Zou, Q.; Lu, Y.; Wang, S. Does haze removal help cnn-based image classification? In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 682–697. [Google Scholar]
Tzinis, E.; Wang, Z.; Smaragdis, P. Sudo rm-rf: Efficient networks for universal audio source separation. In Proceedings of the 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), Espoo, Finland, 21–24 September 2020; IEEE: New York, NY, USA, 2020. [Google Scholar]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6077–6086. [Google Scholar]
Guo, S.; Wang, Y.; Tan, Y.; Liu, T.; Qin, Q. Efficient Coastal Mangrove Species Recognition Using Multi-Scale Features Enhanced by Multi-Head Attention. Symmetry 2025, 17, 461. [Google Scholar]
Guo, M.; Xu, T.; Liu, J.; Liu, Z.; Jiang, P.; Mu, T. Neuronal mechanisms of visual attention. Annu. Rev. Vis. Sci. 2015, 1, 373–391. [Google Scholar]
Wang, J.; Xi, X.; Li, D.; Li, F. FusionGRAM: An infrared and visible image fusion framework based on gradient residual and attention mechanism. IEEE Trans. Instrum. Meas. 2023, 72, 1–12. [Google Scholar]
Pang, B.; Li, Y.; Li, J.; Li, M.; Cao, H.; Lu, C. Tdaf: Top-down attention framework for vision tasks. Proc. AAAI Conf. Artif. Intell. 2021, 35, 2384–2392. [Google Scholar]
Fang, Y.; Lin, W.; Lau, C.T.; Lee, B.S. A visual attention model combining top-down and bottom-up mechanisms for salient object detection. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; IEEE: New York, NY, USA, 2011; pp. 1293–1296. [Google Scholar]
Wang, W.; Shen, J.; Cheng, M.M. An iterative and cooperative top-down and bottom-up inference network for salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5968–5977. [Google Scholar]
Mathôt, S.; Theeuwes, J. Visual attention and stability. Philos. Trans. R. Soc. Biol. Sci. 2021, 366, 516–527. [Google Scholar]
Chrysos, G.G.; Wu, Y.; Pascanu, R.; Torr, P.; Cevher, V. Hadamard product in deep learning: Introduction, Advances and Challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 6531–6549. [Google Scholar] [PubMed]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Subakan, C.; Ravanelli, M.; Cornell, S.; Bronzi, M.; Zhong, J. Attention is all you need in speech separation. In Proceedings of the IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021. [Google Scholar]
Liu, Y.; Dong, L.; Xu, W. Infrared and visible image fusion via salient object extraction and low-light region enhancement. Infr. Phys. Technol. 2022, 124, 104223. [Google Scholar]
Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion. 2022, 83, 79–92. [Google Scholar]
Yang, G.; Li, J.; Lei, H.; Gao, X. A multi-scale information integration framework for infrared and visible image fusion. Neurocomputing 2024, 600, 128116. [Google Scholar]
Song, W.; Li, Q.; Gao, M.; Chehri, A.; Jeon, G. SFINet: A semantic feature interactive learning network for full-time infrared and visible image fusion. Expert Syst. Appl. 2025, 261, 125472. [Google Scholar]
Guo, L.; Luo, X.; Liu, Y.; Zhang, Z.; Wu, X. SAM-guided multi-level collaborative Transformer for infrared and visible image fusion. Pattern Recognit. 2025, 162, 111391. [Google Scholar]
Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar]
Li, H.; Xu, T.; Wu, X.; Lu, J.; Kittler, J. Lrrnet: A novel representation learning guided fusion network for infrared and visible images. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11040–11052. [Google Scholar] [PubMed]
Wang, Z.; Wu, Y.; Wang, J.; Xu, J.; Shao, W. Res2Fusion: Infrared and visible image fusion based on dense Res2net and double nonlocal attention models. IEEE Trans. Instrum. Meas. 2022, 71, 1–12. [Google Scholar]
Wang, Z.; Yang, F.; Sun, J.; Xu, J.; Yang, F.; Yan, X. AITFuse: Infrared and visible image fusion via adaptive interactive transformer learning. Knowl.-Based Syst. 2024, 299, 111949. [Google Scholar]
Cui, G.; Feng, H.; Xu, Z.; Li, Q.; Chen, Y. Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition. Opt. Commun. 2015, 341, 199–209. [Google Scholar]
Hossny, M.; Nahavandi, S.; Creighton, D. Comments on ‘Information measure for performance of image fusion’. Electron. Lett. 2008, 44, 1028–1066. [Google Scholar]
Chen, J.; Yang, L.; Liu, W.; Tian, X.; Ma, J. LENFusion: A Joint Low-Light Enhancement and Fusion Network for Nighttime Infrared and Visible Image Fusion. IEEE Trans. Instrum. Meas. 2024, 73, 5018715. [Google Scholar]
Katkade, K.S.N.; Bagal, V.C.; Manza, R.R.; Yannawar, P.L. Advances in real-time object detection and information retrieval: A review. Artif. Intell. Appl. 2023, 1, 123–128. [Google Scholar]

Figure 1. The network architecture of VAFNet.

Figure 2. The structure of VAEM.

Figure 3. The structure of CMAM, CSAM, OAEM, AGM.

Figure 4. Imaging equipment and acquisition site for capturing infrared and visible images.

Figure 5. Qualitative comparison of the VAFNet with seven SOTA fusion methods on the MSRS dataset. For clear comparison, we highlight an object region (i.e., yellow box) and magnify a small texture region (i.e., red box).

Figure 6. Quantitative comparison of EN, SF, SD, and VIF on 56 pairs of infrared and visible images from the MSRS dataset. The point

(x, y)

on the curve indicates that the metric value of

(100 * x)

percent of image pairs does not exceed y.

Figure 6. Quantitative comparison of EN, SF, SD, and VIF on 56 pairs of infrared and visible images from the MSRS dataset. The point

(x, y)

on the curve indicates that the metric value of

(100 * x)

percent of image pairs does not exceed y.

Figure 7. Qualitative comparison of the VAFNet with seven SOTA fusion methods on the TNO dataset. For clear comparison, we highlight an object region (i.e., yellow box) and magnify a small texture region (i.e., red box).

Figure 8. Quantitative comparison of EN, SF, SD, and VIF on 9 pairs of infrared and visible images from the TNO dataset. The point

(x, y)

on the curve indicates that the metric value of

(100 * x)

percent of image pairs does not exceed y.

Figure 8. Quantitative comparison of EN, SF, SD, and VIF on 9 pairs of infrared and visible images from the TNO dataset. The point

(x, y)

on the curve indicates that the metric value of

(100 * x)

percent of image pairs does not exceed y.

Figure 9. Qualitative comparison of the proposed method with seven SOTA fusion methods on the collection dataset. For clear comparison, we highlight an object region (i.e., yellow box) and magnify a small texture region (i.e., red box).

Figure 10. Quantitative comparison of EN, SF, SD, and VIF on 23 pairs of infrared and visible images from the collection dataset. The point

(x, y)

on the curve indicates that the metric value of

(100 * x)

percent of image pairs does not exceed y.

Figure 10. Quantitative comparison of EN, SF, SD, and VIF on 23 pairs of infrared and visible images from the collection dataset. The point

(x, y)

on the curve indicates that the metric value of

(100 * x)

percent of image pairs does not exceed y.

Figure 11. The object detection results of infrared, visible, VAFNet, and seven SOTA methods from the MSRS dataset.

Table 1. Mean and standard of quantitative evaluation results on 56 image pairs from the MSRS dataset. Blue and green represent the best and runner-up values, respectively. ↑ describes that the larger the value, the better the performance.

Method	EN↑	SF↑	SD↑	VIF↑
LRRNET	5.399 ± 0.663	4.674 ± 0.961	22.626 ± 8.047	0.513 ± 0.120
MDA	5.329 ± 0.471	4.523 ± 0.735	17.831 ± 4.531	0.614 ± 0.061
RES2Fusion	4.896 ± 1.056	5.028 ± 0.972	32.774 ± 9.056	0.827 ± 0.097
SeAFusion	5.965 ± 0.544	5.637 ± 0.860	33.169 ± 8.225	1.001 ± 0.077
SFINet	6.074 ± 0.513	5.272 ± 0.824	28.012 ± 7.268	0.785 ± 0.111
SpTFuse	5.981 ± 0.547	5.256 ± 0.806	32.126 ± 8.299	0.988 ± 0.073
U2Fusion	4.244 ± 0.706	4.096 ± 0.919	18.176 ± 5.180	0.477 ± 0.063
VAFNet	6.452 ± 0.359	6.851 ± 0.874	36.451 ± 4.421	1.060 ± 0.210

Table 2. Mean and standard of quantitative evaluation results on 9 image pairs from the TNO dataset. Blue and green represent the best and runner-up values, respectively. ↑ describes that the larger the value, the better the performance.

Method	EN↑	SF↑	SD↑	VIF↑
LRRNet	7.177 ± 0.399	8.095 ± 1.487	44.302 ± 12.874	0.970 ± 0.208
MDA	6.589 ± 0.306	6.320 ± 0.846	28.665 ± 7.255	0.812 ± 0.153
RES2Fusion	6.188 ± 0.537	5.200 ± 1.461	23.206 ± 9.078	0.741 ± 0.108
SeAFusion	7.067 ± 0.234	8.946 ± 1.642	40.161 ± 10.384	1.061 ± 0.303
SFINet	6.940 ± 0.325	8.148 ± 1.220	38.649 ± 9.845	0.857 ± 0.182
SpTFuse	6.810 ± 0.388	5.756 ± 1.409	32.796 ± 11.088	0.915 ± 0.101
U2Fusion	6.465 ± 0.550	7.798 ± 1.666	25.813 ± 9.795	0.765 ± 0.147
VAFNet	7.416 ± 0.185	9.124 ± 1.850	48.841 ± 3.766	1.042 ± 0.302

Table 3. Mean and standard of quantitative evaluation results on 23 image pairs form the collection dataset. Blue and green represent the best and runner-up values, respectively. ↑ describes that the larger value, the better performance.

Method	EN↑	SF↑	SD↑	VIF↑
LRRNet	7.069 ± 0.542	9.076 ± 1.533	45.272 ± 9.645	0.817 ± 0.131
MDA	6.791 ± 0.498	7.901 ± 1.315	33.212 ± 8.180	0.727 ± 0.105
RES2Fusion	7.113 ± 0.507	8.835 ± 1.738	44.663 ± 11.189	0.116 ± 0.192
SeAFusion	7.294 ± 0.405	9.520 ± 1.450	48.948 ± 9.929	0.950 ± 0.129
SFINet	6.955 ± 0.453	8.121 ± 1.584	39.058 ± 9.006	0.362 ± 0.387
SpTFuse	7.335 ± 0.473	8.451 ± 1.355	49.471 ± 11.994	0.876 ± 0.099
U2Fusion	6.676 ± 0.664	9.200 ± 1.854	33.300 ± 9.809	0.092 ± 0.112
VAFNet	7.337 ± 0.275	9.232 ± 1.235	49.773 ± 5.285	0.799 ± 0.195

Table 4. Object detection performance (mAP) of visible, infrared and fused images on the MSRS dataset. Blue represents the best result, and green represents the second best result.

Method	P	R	AP@50			AP@70			AP@90			mAP@[0.5:0.95]
Method	P	R	Person	Car	All	Person	Car	All	Person	Car	All	Person	Car	All
Infrared	0.9203	0.7337	0.9493	0.6835	0.8164	0.9061	0.6294	0.7677	0.2375	0.1938	0.2157	0.7208	0.4959	0.6084
Visible	0.7806	0.6606	0.4375	0.9255	0.6815	0.3197	0.9199	0.6198	0.0032	0.3993	0.2012	0.2412	0.7514	0.4963
MDA	0.9485	0.7856	0.8165	0.9557	0.8861	0.7640	0.9538	0.8589	0.0999	0.4473	0.2736	0.5714	0.7938	0.6826
SFINet	0.9488	0.8756	0.9366	0.9830	0.9598	0.8908	0.9830	0.9369	0.1397	0.5667	0.3532	0.6771	0.8278	0.7524
SpTFuse	0.9156	0.8873	0.9281	0.9752	0.9516	0.8891	0.9665	0.9278	0.1757	0.4843	0.3300	0.6868	0.8174	0.7521
U2Fusion	0.9467	0.8324	0.8699	0.9604	0.9152	0.8499	0.9513	0.9006	0.1160	0.6232	0.3696	0.6348	0.8206	0.7277
LRRNet	0.9612	0.7414	0.7350	0.9440	0.8395	0.6598	0.9440	0.8019	0.0266	0.6059	0.3162	0.4748	0.7944	0.6346
RES2Fusion	0.9235	0.7973	0.9039	0.8864	0.8951	0.8269	0.8719	0.8494	0.0663	0.3922	0.2292	0.6170	0.7210	0.6690
SeAFusion	0.9108	0.9038	0.9372	0.9742	0.9557	0.8707	0.9604	0.9156	0.1684	0.5440	0.3562	0.6726	0.8171	0.7448
VAFNet	0.9639	0.9476	0.9912	0.9911	0.9911	0.9896	0.9912	0.9904	0.3041	0.7751	0.5396	0.8014	0.8838	0.8426

Table 5. Mean and standard of running time on the MSRS, TNO and collection dataset (unit: second). Blue represents the best values, and green represents the runner-up values.

Method	MSRS	TNO	Collection
MDA	0.2527 ± 0.0065	0.9739 ± 0.0168	1.1855 ± 0.0113
SFINet	1.6688 ± 0.3591	1.3579 ± 0.4125	8.5686 ± 1.7780
SpTFuse	0.6259 ± 0.0137	1.9306 ± 0.0390	2.4026 ± 0.0434
U2Fusion	0.6514 ± 0.0193	2.0656 ± 0.0349	2.4314 ± 0.0444
LRRNet	0.0967 ± 0.3240	0.4529 ± 0.7805	0.3330 ± 0.5065
RES2Fusion	1.0121 ± 0.0154	8.4371 ± 0.0797	13.2681 ± 0.1103
SeAFusion	0.0488 ± 0.3246	0.2966 ± 0.7789	0.1344 ± 0.5090
VAFNet	0.0054 ± 0.0007	0.0050 ± 0.0003	0.0053 ± 0.0005

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Wang, Y.; Jing, Z. Visual Attention Fusion Network (VAFNet): Bridging Bottom-Up and Top-Down Features in Infrared and Visible Image Fusion. Symmetry 2025, 17, 1104. https://doi.org/10.3390/sym17071104

AMA Style

Liu Y, Wang Y, Jing Z. Visual Attention Fusion Network (VAFNet): Bridging Bottom-Up and Top-Down Features in Infrared and Visible Image Fusion. Symmetry. 2025; 17(7):1104. https://doi.org/10.3390/sym17071104

Chicago/Turabian Style

Liu, Yaochen, Yunke Wang, and Zixuan Jing. 2025. "Visual Attention Fusion Network (VAFNet): Bridging Bottom-Up and Top-Down Features in Infrared and Visible Image Fusion" Symmetry 17, no. 7: 1104. https://doi.org/10.3390/sym17071104

APA Style

Liu, Y., Wang, Y., & Jing, Z. (2025). Visual Attention Fusion Network (VAFNet): Bridging Bottom-Up and Top-Down Features in Infrared and Visible Image Fusion. Symmetry, 17(7), 1104. https://doi.org/10.3390/sym17071104

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Visual Attention Fusion Network (VAFNet): Bridging Bottom-Up and Top-Down Features in Infrared and Visible Image Fusion

Abstract

1. Introduction

2. Related Work

2.1. Image Fusion Methods

2.1.1. Traditional Image Fusion Methods

2.1.2. Deep Learning-Based Image Fusion Methods

2.2. Visual Attention for Image Fusion

3. Method

3.1. Overview

3.2. Bottom-Up Process

3.3. Top-Down Process

3.3.1. Cross Modal Attention Module

3.3.2. Cross-Scale Attention Module

3.4. Bidirectional Attention Integration Mechanism

3.4.1. Object Attention Enhancement Module

3.4.2. Attention Guidance Module

3.5. Loss Function

4. Experiment

4.1. Experimental Settings

4.2. Implementation Details

4.3. Comparative Experiment

4.3.1. Qualitative Evaluation

4.3.2. Quantitative Evaluation

4.4. Generalization Experiment

4.4.1. Qualitative Evaluation

4.4.2. Quantitative Evaluation

4.5. Evaluation on Surveillance Application

4.5.1. Qualitative Evaluation

4.5.2. Quantitative Evaluation

4.6. Evaluation on Infrared-Visible Object Detection

4.7. Efficiency Comparison

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI