AEFusion: Adaptive Enhanced Fusion of Visible and Infrared Images for Night Vision

Wang, Xiaozhu; Zhang, Chenglong; Hu, Jianming; Wen, Qin; Zhang, Guifeng; Huang, Min

doi:10.3390/rs17183129

Open AccessArticle

AEFusion: Adaptive Enhanced Fusion of Visible and Infrared Images for Night Vision

by

Xiaozhu Wang

¹

,

Chenglong Zhang

^2,*

,

Jianming Hu

³

,

Qin Wen

¹,

Guifeng Zhang

¹ and

Min Huang

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

The School of Data Science, The Chinese University of Hong Kong, Shenzhen 518172, China

³

The School of Aerospace Engineering, Harbin Institute of Technology, Harbin 150001, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(18), 3129; https://doi.org/10.3390/rs17183129

Submission received: 11 August 2025 / Revised: 3 September 2025 / Accepted: 5 September 2025 / Published: 9 September 2025

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

We propose a deep learning-based visible–infrared fusion framework with local adaptive enhancement and ResNet152-LDA feature integration.
Our method achieves superior performance over state-of-the-art methods in both objective metrics and subjective visual quality.

What is the implication of the main findings?

Provides a robust solution for preserving critical details in night vision image fusion.
Offers practical support for intelligent driving and low-visibility imaging applications.

Abstract

Under night vision conditions, visible-spectrum images often fail to capture background details. Conventional visible and infrared fusion methods generally overlay thermal signatures without preserving latent features in low-visibility regions. This paper proposes a novel deep learning-based fusion algorithm to enhance visual perception in night driving scenarios. Firstly, a local adaptive enhancement algorithm corrects underexposed and overexposed regions in visible images, thereby preventing oversaturation during brightness adjustment. Secondly, ResNet152 extracts hierarchical feature maps from enhanced visible and infrared inputs. Max pooling and average pooling operations preserve critical features and distinct information across these feature maps. Finally, Linear Discriminant Analysis (LDA) reduces dimensionality and decorrelates features. We reconstruct the fused image by the weighted integration of the source images. The experimental results on benchmark datasets show that our approach outperforms state-of-the-art methods in both objective metrics and subjective visual assessments.

Keywords:

visible image; infrared image; image fusion; ResNet152 feature extraction; image enhancement

1. Introduction

During nighttime driving, complex environmental conditions such as fog, rain, and blind spots hinder the recognition of critical scene features. These challenges primarily arise from insufficient illumination. To overcome these limitations, it is necessary to utilize complementary spectral bands to capture additional scene information. By operating in a specific band of the electromagnetic spectrum, infrared imaging can detect the thermal radiation emitted by objects in the environment. It offers strong penetration capabilities and a high sensitivity to thermal sources. As a result, infrared images can effectively extract precise feature information. The fusion of visible and infrared images significantly enhances perceptual capabilities under such challenging conditions. Research indicates that this fusion technology holds substantial theoretical significance and practical potential in applications such as nighttime autonomous driving [1] and military reconnaissance [2]. Particularly, enhanced low-illumination and infrared image fusion techniques have significant applications in remote sensing, including nighttime urban monitoring, disaster response, and resource surveying.

In terms of application domain, image fusion primarily comprises three categories: remote sensing image fusion [3,4], visible–infrared image fusion [5,6,7,8], and multi-focus image fusion [9,10]. Methodologically, it may be broadly categorized into traditional multi-scale transform methods and deep learning-based algorithms [11,12,13,14,15]. Among these, deep learning fusion methods constitute a current research focus due to their powerful feature extraction capabilities. These approaches use deep neural networks to extract hierarchical image features and then reconstruct fused images using specific integration strategies.

Under nighttime vision conditions, visible images often fail to capture background details due to insufficient illumination. Traditional visible and infrared image fusion methods, such as the Discrete Wavelet Transform [11] and the Nonsubsampled Contourlet Transform [12], typically only superimpose thermal source information. However, they have limited capacity to represent latent features within nighttime blind spots, including low-illumination regions and areas saturated by intense lighting. As a result, the completeness of fused image information is restricted. The fundamental reason lies in the inability of traditional methods to fully exploit the differences in deep image features, making them ineffective at simultaneously mitigating the degradation caused by low illumination and the distortion induced by overexposure in visible images. This limitation, in turn, constrains the reliability of visual perception in nighttime driving scenarios. Although deep neural networks offer significant advantages in feature extraction and nonlinear modeling, providing new opportunities to overcome the performance bottlenecks of traditional fusion approaches, the inherent characteristics of visible images, such as low illumination and high dynamic range, still lead to substantial performance variation across different deep learning-based fusion algorithms [16].

Overall, the primary challenges confronting infrared–visible image fusion in nocturnal environments can be summarized as follows:

Image Information Deficiency: Imaging detectors inherently acquire incomplete information in nighttime scenes. Background and target features are constrained by grayscale dynamic range limitations and noise interference, causing feature loss that complicates fusion processes and leads to suboptimal fused image quality.
Insufficient Target Feature Saliency: When targets reside in visual blind zones within natural scenes, low-illumination conditions obscure their features in visible images. Meanwhile, infrared imagery provides only thermal signatures, substantially impairing subsequent target detection performance.
Local Contrast Saturation: In dark night environments, light illumination and reflections frequently induce localized oversaturation in detector-captured images. This phenomenon severely compromises image quality and substantially amplifies the difficulty of image enhancement.

Current mainstream visible–infrared image fusion algorithms employ diverse integration strategies to acquire enriched detailed information. However, due to extensive information-deficient regions (such as blind zones) in nocturnal visible images, fused results remain limited in achieving comprehensive scene representation. Jia et al. [17] proposed fusing enhanced visible images with infrared counterparts to improve fusion image contrast. This approach, however, introduces two significant limitations. First, visible images are highly susceptible to oversaturation; second, the capacity for critical feature representation remains insufficient. To address these challenges, this work proposes a deep adaptive enhancement fusion algorithm for visible and infrared images, specifically optimized for night vision environments.

This study addresses critical challenges in infrared and visible image fusion for night vision systems, including texture loss, target degradation, and artifacts. First, to overcome adverse imaging conditions such as low illumination, rain, and fog, we develop a local color-adaptive enhancement model based on color mapping and adaptive enhancement principles. This model suppresses oversaturation in night vision imagery while effectively enhancing degraded content.

Next, a pretrained ResNet152 network extracts quintuple-layer deep features separately from both enhanced visible and original infrared images. For each feature map, we execute max pooling and average pooling operations in parallel. Max pooling captures salient feature differences while average pooling preserves global background information. Concatenating both pooling outputs generates comprehensive feature representations that retain gradient and edge information from both source images.

Finally, we apply Linear Discriminant Analysis to project feature data into a more discriminative space. Simultaneously, quadtree decomposition extracts salient infrared features to construct adaptive weighting factors for reconstructing the final fused image. Validation combines objective metrics with subjective visual assessments. Experiments demonstrate the proposed framework’s superior fusion performance over current mainstream algorithms on public benchmark datasets.

2. Related Work

Recent advances in deep learning, particularly within domains such as pattern recognition, have significantly advanced the field of image fusion, leading to the development of numerous improved fusion network models. However, low-illumination visible images inherently suffer from insufficient brightness and local saturation artifacts. When relying solely on neural networks, the fusion of visible and infrared images often produces suboptimal results. Accordingly, this section provides an overview of the current research status of low-illumination visible image enhancement algorithms, followed by a summary of the latest progress in visible and infrared image fusion algorithms.

2.1. Visible Image Enhancement Algorithms

Both non-uniformly illuminated images and globally low-illumination images require enhancement to restore visual quality. Specifically, non-uniform illumination is characterized by a spatially uneven light distribution, whereas low-illumination conditions manifest as globally insufficient brightness across the entire scene. The objective for enhancing both types is to restore images to a perceptually favorable state with clearly discernible details. In non-uniformly illuminated images, indiscriminate global brightness enhancement often causes severe oversaturation in relatively normal or moderately bright regions. Similarly, in low-illumination images, substantial luminance amplification used to reveal obscured details in dark areas can cause oversaturation in inherently bright regions such as license plates, headlights, or reflective surfaces. This resultant oversaturation precipitates irreversible information loss and noise amplification, thereby compromising visual quality.

For non-uniformly illuminated image enhancement, Pu et al. [18] proposed treating images as products of luminance mapping and contrast measure transfer functions through a perceptually inspired method. This approach frequently causes over-enhancement. Li et al. [19] applied a multi-scale top-hat transformation to estimate background illumination and correct light inhomogeneity. Their method suffers from unevenness and threshold drift. Pu et al. [20] introduced a contrast/residual decomposition framework that separates images into contrast and residual components, with the contrast component containing the scene details. However, this technique is computationally intensive. Lin et al. [21] presented two CNN-based systems using convolutional layers with exponential and logarithmic activation functions for unsupervised enhancement. These models suffer from high complexity and slow convergence.

For low-illumination image enhancement, the Retinex theory models this process as a color mapping transformation [22]. Images processed by this method exhibit a visually smoother appearance while significantly enhancing the texture details and features present in the original image. Zhang et al. [23] proposed a single convolutional layer model (SCLM). The algorithm introduces a local adaptation module that learns a set of shared parameters to accomplish local illumination correction and address the issue of varied exposure levels in different image regions. Although this method reduces oversaturation, it produces only marginal improvements in image enhancement.

Additionally, Luo et al. [24] proposed a pseudo-supervised image enhancement method. Their approach first employs quadratic curves to generate pseudo-clear images. These pseudo-paired images are then fed into two parallel isomorphic branches, ultimately producing enhanced results through knowledge learning. However, the improvement in image details was marginal. Fan et al. [25] introduced an end-to-end illumination image enhancement model called the multi-scale low-illumination image enhancement network with illumination constraints to achieve enhanced generalization capability and stable performance. Despite these advantages, the model was largely ineffective in mitigating overexposure. Jiang et al. [26] developed an efficient unsupervised generative adversarial network. This algorithm demonstrates broad applicability for enhancing real-world images across diverse domains, though it exhibits a suboptimal denoising performance. Consequently, adaptive enhancement methods capable of processing bright regions and low-contrast regions in low-illumination images without inducing oversaturation remain an unmet research need.

2.2. Infrared and Visible Image Fusion Algorithms

Research on image fusion algorithms for night vision systems continues to advance, yet significant challenges persist. Gao et al. [27] proposed a new method for infrared and visible image fusion based on a densely connected disentanglement representation generative adversarial network. However, their fused images exhibited the incomplete enhancement of salient features from visible images, the inadequate enhancement of target information from infrared images, and weak visual perception. Zhao et al. [28] introduced a model-based convolutional neural network model to preserve both the thermal radiation information of infrared images and the texture details of visible images. However, these features remained inadequately enhanced in the fusion results. In parallel, Wang et al. [29] proposed a cross-scale iterative attention adversarial fusion network, incorporating a cross-modal attention integration module to merge content from different modal images. However, their approach amplified noise together with details, resulting in over-enhancement and distortion. Ma et al. [30] applied a state-of-the-art fusion algorithm to combine nighttime visible and infrared images. While the algorithm effectively enhanced target information in the fused image, the overall scene information remained less prominent. Park et al. [15] developed a fusion algorithm based on cross-modal transformers, which captures global interactions by faithfully extracting complementary information from source images. However, the enhancement of night vision images was not significant. Finally, Tang et al. [8] proposed a darkness-free infrared and visible image fusion method that effectively illuminates dark areas and facilitates complementary information integration. Yet, the fused image exhibited slight color distortion.

Beyond these fundamental approaches, Transformer- and GAN-based frameworks have gained prominence. Vibashan et al. [31] introduced a Transformer-based image fusion method that models long-range dependencies between image patches through a self-attention mechanism, with its core mechanism residing in learning cross-modal global feature interactions and representations. This method, however, lacks model-specific preprocessing capabilities and fails to effectively extract salient structures and fundamental texture information from feature maps. Wang et al. [32] developed a GAN-based fusion approach that primarily leverages the adversarial learning framework between a generator and a discriminator, aiming to generate fused results preserving salient features from source images. This methodology is susceptible to challenges during adversarial training, including mode collapse, training instability, and convergence difficulties.

When applied to downstream visual tasks, fusion methods face additional limitations. Wang et al. [33] present a multi-scale gated fusion network to enhance change detection accuracy, utilizing EfficientNetB4 for bitemporal feature extraction. However, this approach demonstrates a limited generalization capability and unstable performance. Separately, Ma et al. [34] propose a novel cross-domain image fusion framework based on the Swin Transformer for long-range learning. Nevertheless, their method lacks specialized optimization for nighttime infrared and visible image fusion in our target domain.

3. The Proposed Fusion Framework

As shown in Figure 1, the proposed visible and infrared image fusion pipeline operates in two stages: enhancing nighttime visible images, then fusing them with infrared data to deliver high-quality fused results.

Figure 2 details the proposed night vision image fusion framework. The system takes visible and infrared images as inputs. For the visible image, we apply color mapping and adaptive enhancement to suppress oversaturation. A pretrained ResNet152 model then extracts hierarchical feature maps separately from both the enhanced visible image and the original infrared input. At each feature level, we execute max pooling and average pooling operations in parallel: max pooling captures salient local feature variations, while average pooling preserves global contextual information. We subsequently project the fused pooling features into a more discriminative subspace using Linear Discriminant Analysis (LDA). Concurrently, we implement quadtree decomposition to derive salient infrared features, constructing adaptive weighting factors from these features for high-quality fused image reconstruction.

This section details the proposed fusion network. Section 3.1 presents the enhancement algorithm. Section 3.2 introduces the infrared feature extraction method. Section 3.3 describes the deep learning-based fusion strategy.

3.1. Image Enhancement Algorithm

In night vision systems, visible images often suffer from low illumination and noise. Preprocessing is essential for noise removal, a crucial step for enhancing image visibility. Image filtering aims to reduce noise, often at the cost of slight blurring. Night vision visible images typically exhibit fuzzy edges and dark backgrounds. The neighborhood average method [35] leverages spatial proximity and pixel similarity to suppress edge blurring. For improved edge preservation, the bilateral filter [36] offers a nonlinear approach that incorporates both spatial information and gray level similarity.

Bilateral filtering combines a spatial Gaussian kernel with a range kernel based on gray level similarity. The Gaussian spatial function ensures that only pixels within the neighborhood influence the center point, based on spatial distance and gray level similarity. Meanwhile, the Gaussian function ensures that only the gray values of the center pixel are similar to those used for the blur operation. The method smooths images while preserving edges [37]. The property of edge-preserving smoothing provides a more accurate and smoother illumination estimate for subsequent adaptive brightness adjustment, thereby effectively preventing halo artifacts or oversaturation during tone mapping.

Figure 3 shows the image enhancement framework. This process first applies tone mapping, followed by adaptive local thresholding.

The enhancement process is described as follows: Given an input image

P

and a guidance image

I

(typically

I = P

), we obtain the filtered output

T

. The approach relies on a core assumption of guided filtering: within a local window

w_{k}

, the output relates linearly to the guidance image. This is illustrated as follows:

T_{i} = a_{k} I_{i} + b_{k}, \forall i \in w_{k}

(1)

where

a_{k}

represents the linear coefficient,

b_{k}

denotes the bias term,

k

represents the window index.

w_{k}

indicates a local window,

I_{i}

represents the guidance image’s pixel value at location

i

, and

T_{i}

indicates the filtered output.

The coefficient

a_{k}

is derived as follows:

a_{k} = \frac{σ_{k}^{2}}{σ_{k}^{2} + ε}

(2)

where

σ_{k}

represents the variance of the guided image

I

within window

w_{k}

.

ε

denotes the smoothing regularization parameter.

The bias term

b_{k}

is as follows:

b_{k} = (1 - a_{k}) \bar{p_{k}}

(3)

where

\bar{p_{k}}

represents the mean of the original image

p

within window

w_{k}

.

When processing nighttime visible images in foggy backlit scenes, the initial enhancement remains inadequate. To address localized oversaturation, we integrate color mapping and adaptive saturation suppression.

Because the visual system responds most strongly to green, followed by red and then blue, using different weights will yield a more reasonable grayscale image. The following values are derived based on experiment and theory.

We compute luminance

L_{w (x, y)}

as follows:

L_{w (x, y)} = 0.299 \times T_{R} + 0.587 \times T_{G} + 0.114 \times T_{B}

(4)

where

T_{R}

,

T_{G}

, and

T_{B}

represent the red, green, and blue components in the image, respectively.

The logarithmic mean brightness is defined by Equation (5) as follows:

\bar{L_{w}} = \exp (a v g [\log (δ + L_{w (x, y)})])

(5)

where

\bar{L_{w}}

represents the average logarithm of the input luminance and

δ

denotes a small constant, which is primarily to avoid numerical overflow when performing log calculations on pure black pixels.

a v g [\cdot]

is the arithmetic mean operator and

\log (\cdot)

is the natural logarithm.

The fully adapted output is

L_{g} (x, y) = \frac{\log (L_{w (x, y)} / \bar{L_{w}} + 1)}{\log (L_{w \max} / \bar{L_{w}} + 1)}

(6)

where

L_{w (x, y)}

represents the luminance at

(x, y)

and

L_{w \max}

denotes its global maximum.

Since enhanced images exhibit elevated luminance in saturated regions, adaptive processing is essential to prevent oversaturation artifacts. Accordingly, we employ adaptive saturation suppression.

L (x, y) = \frac{L_{g} (x, y) - L_{\min} (x, y)}{L_{\max} (x, y) - L_{\min} (x, y)}

(7)

where

L_{\min} (x, y)

represents the minimum luminance value of the input image and

L_{\max} (x, y)

denotes the maximum luminance value of the input image. The output

L (x, y)

constitutes the final enhanced result.

The process optimally elevates global brightness before fusion with the infrared image.

3.2. Infrared Feature Extraction

The accurate reconstruction of the background is crucial for extracting target features from infrared images. This process requires maximized sampling of control points from background regions while excluding interfering points from target areas. Existing pixel-wise search strategies exhibit significant limitations in computational efficiency and noise sensitivity. Since the quadtree decomposition method simultaneously excludes target points and efficiently samples background control points, it significantly enhances both accuracy and speed in infrared target segmentation [38]. Therefore, this study adopts quadtree decomposition for infrared image processing through the following steps—as shown in Figure 4.

The infrared image serves as the root node of a quadtree structure. When the intensity range (defined as the difference between the maximum and minimum intensity values) of any node exceeds a preset threshold

T

, the node undergoes recursive subdivision into four child nodes. This decomposition terminates upon reaching the minimum size

Z

or meeting non-subdividable conditions. Parameter

T

governs the splitting decisions, while

Z

constrains the spatial resolution. Smaller

T

values are typically employed to suppress noise perturbations, yielding smooth image blocks.

Subsequently, we uniformly sample 16 control points per terminal block. The algorithm assigns each control point’s intensity based on the local minimum within its spatial neighborhood. Following Equations (8)–(10), we reconstruct the Bézier surface for each block by interpolating the three-dimensional coordinates

(x, y, I)

of the control points, where

I

denotes intensity. The complete procedure comprises three stages.

1.: Per-Block Background Reconstruction via Bézier Surfaces:

$I_{q u a d}^{(k)} (x, y) = B (u_{x}) P_{k} B^{T} (v_{y})$

(8)

$u_{x} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$

(9)

$v_{y} = \frac{y - y_{\min}}{y_{\max} - y_{\min}}$

(10)

where $(x, y)$ represents image pixel coordinates, $P_{k}$ denotes the control point matrix of the k-th quadtree block, $u_{x}$ and $v_{y}$ indicate the normalized local coordinates, $B (u_{x})$ denotes the cubic Bézier basis function, $B^{Τ} (v_{y})$ indicates the transposed basis function, and $I_{q u a d}^{(k)} (x, y)$ represents the block background grayscale value.

2.: Block combination and smoothing:

$I_{b g} = G a u s s_{σ} (\oplus_{k = 1}^{k} I_{q u a d}^{(k)})$

(11)

where $k$ represents the quadtree block index, $\oplus$ denotes the block stitching operator, $G a u s s$ indicates the Gaussian smoothing operator, $σ$ represents the standard deviation, and $I_{b g}$ represents the image background information.

3.: Salient feature extraction:

$F_{S} = I_{i r} - I_{b g}$

(12)

where $I_{i r}$ represents original infrared image and $F_{S}$ denotes the infrared salient feature information.

3.3. Deep Learning Fusion Algorithm

At CVPR 2016, He et al. [39] proposed a network architecture utilizing shortcut connections and residual representations to address the degradation problem. Compared with prior networks, this architecture is easier to optimize and achieves a higher accuracy with increased depth. Figure 5 illustrates a residual block.

In Figure 5,

x

denotes the input to the network block.

f (x)

represents a network operation and relu indicates the rectified linear unit activation function. The output of the residual block equals

f (x) + x

. This structure enables the construction of multi-level feature representations. Consequently, image reconstruction tasks using residual blocks demonstrate enhanced performance [40]. We incorporate this architecture into our fusion algorithm.

First, ResNet extracts hierarchical features from infrared and visible images through its convolutional architecture. The network generates five multi-scale feature maps at distinct processing stages (conv1x, res2cx, res3dx, res4fx, res5cx), with each layer capturing modality-specific characteristics. Second, feature maps undergo max pooling and average pooling operations. While max pooling preserves texture details and average pooling maintains background structures, an over-reliance on averaging can lead to global blurring. We therefore fuse both operations via weighted averaging to retain discriminative details and structural contours. Finally, to eliminate high-dimensional redundancies in the multi-layer features, we apply Linear Discriminant Analysis (LDA) for joint decorrelation and dimensionality reduction.

ResNet generates five hierarchical feature maps from input images. Each feature map encodes distinct characteristics of infrared and enhanced visible modalities. Figure 2 illustrates this feature extraction process, where weight 1 and weight 2 each operate across the five feature layers.

3.3.1. Dimension Reduction and Decorrelation of Linear Discriminant Analysis (LDA)

Initially formulated for binary classification, Linear Discriminant Analysis (LDA) leverages discriminant functions to maximize class separability. Its computational process inherently reduces data dimensionality, enabling supervised linear dimensionality reduction. The algorithm optimizes inter-class separation while minimizing intra-class variance for

k

classes, producing compact class embeddings with maximal between-class distance in the projection subspace.

The process of LDA contains the following five steps:

1.: Establish data standardization, normalize each feature to zero mean and unit variance.

${\tilde{X}}^{(j)} = \frac{X^{(j)} - μ_{j}}{σ_{j}}, j = 1, \dots, d$

(13)

where $μ_{j}$ and $σ_{j}$ represent the mean and standard deviation of the j-th feature across all samples. $d$ is the original feature dimension. $X$ is the high-dimensional data.

2.: Compute the class and global mean vectors.

$m_{i} = \frac{1}{n_{i}} \sum_{X \in D_{i}} X$

(14)

$m = \frac{1}{n} \sum_{i = 1}^{k} n_{i} m_{i}$

(15)

where $n$ represents the total number of samples, $n_{i}$ denotes the number of samples in class $i$ , $D_{i}$ indicates the set of samples in class $i$ , $m_{i}$ is the mean vector of class $i$ , and $m$ is the global mean vector.

3.: Compute the between-class scatter matrix $S_{B}$ and within-class scatter matrix $S_{w}$ .

$S_{w} = \sum_{i = 1}^{k} \sum_{X \in D_{i}} (X - m_{i}) {(X - m_{i})}^{Τ}$

(16)

$S_{B} = \sum_{i = 1}^{k} n_{i} (m_{i} - m) {(m_{i} - m)}^{Τ}$

(17)

where $k$ represents the number of classes, $S_{w}$ denotes the within-class scatter matrix, and $S_{B}$ indicates the between-class scatter matrix.

4.: Calculate the eigenvector corresponding to the eigenvalue of the matrix.

$S_{B} w = λ S_{w} w$

(18)

where $w$ represents the projection matrix and $λ$ denotes the discriminant ability index.

5.: Construct the projection matrix $W$ and transform the data.

$Y = X W$

(19)

where $W$ represents the projection matrix and $Y$ denotes the projected data matrix.

The image feature pooling and LDA conversion are shown in Figure 6.

3.3.2. Image Reconstruction

To enrich the infrared thermal target information in the fused image, we incorporate significant features into infrared regions when selecting weight factors, thereby enhancing the night vision image. Figure 7 displays the reconstructed fusion result.

After the dimensionality reduction extracts feature values from each image layer, we fuse the enhanced visible and infrared images using a weighted sum. Equations (20) and (21) show the corresponding equations.

W_{I R} = \frac{F_{I R} + F_{S}}{F_{V I S} + F_{I R} + F_{S}}

(20)

W_{V I S} = \frac{F_{V I S}}{F_{V I S} + F_{I R}}

(21)

where

F_{I R}

represents the infrared image feature,

F_{V I S}

denotes the visible image feature,

F_{S}

corresponds to the infrared image’s salient feature,

W_{I R}

designates the infrared feature’s weight in the fused image, and

W_{V I S}

indicates the visible feature’s weight in the fused image.

The weight factor and the source image are fused and reconstructed as follows:

I_{f} = I_{i r} \times W_{I R} + I_{v i s - e n h a n c e} \times W_{V I S}

(22)

where

I_{i r}

represents the infrared image,

I_{v i s - e n h a n c e}

denotes the enhanced visible image, and

I_{f}

indicates the fused image.

The process carried out in this study is shown in Algorithm 1.

Algorithm 1 AEFusion: Adaptive Enhanced Fusion of Visible and Infrared Images for Night Vision

Input

: Infrared images I_{i r}

and Visible images I_{v i s}

I_{i r} \in ℝ^{Λ} \{H \times W\}

I_{v i s} \in ℝ^{Λ} \{H \times W \times 3\}

Output : Fused image I_{f}

I_{f} \in ℝ^{Λ} \{H \times W \times 3\}

1 : for each pixel (x, y)

in I_{v i s}

2 : for each pixel (x, y)

in I_{i r}

STAGE 1 : Night visible image enhancement 3 : Guided filtering : T = a_{k} I_{v i s} + b_{k}

4 : Tone mapping : L_{w (x, y)} = 0.299 \times T_{R} + 0.587 \times T_{G} + 0.114 \times T_{B}

5 : Local threshold adaptive : I_{v i s - e n h a n c e} = L (x, y) = \frac{L_{g} (x, y) - L_{\min} (x, y)}{L_{\max} (x, y) - L_{\min} (x, y)}

STAGE 2 : Infrared feature extraction 6 : Quadtree structure for background feature extraction : I_{i r - f b}

7 : Salient Feature : F_{S} = I_{i r - s a l i e n t} = I_{i r} - I_{i r - f b}

STAGE 3 : Fusion of visible and infrared images 8 : Extract multi-scale features : I_{i r - r e s n e t 152} = Re s N e t 152 (I_{i r})

I_{v i s - r e s n e t 152} = Re s N e t 152 (I_{v i s - e n h a n c e})

9 : Dual-pooling fusion per level : I_{i r - p o o l} = M a x P o o l (I_{i r - r e s n e t 152}) \oplus A v g P o o l (I_{i r - r e s n e t 152})

I_{v i s - p o o l} = M a x P o o l (I_{v i s - p o o l}) \oplus A v g P o o l (I_{v i s - p o o l})

10 : Up-sample : I_{i r - u p} = α I_{i r - p o o l}

I_{v i s - u p} = β I_{v i s - p o o l}

11 : Linear discriminant analysis (LDA) : F_{I R} = I_{i r - L D A} = L D A \cdot I_{i r - u p}

F_{V I S} = I_{v i s - L D A} = L D A \cdot I_{v i s - u p}

STAGE 4 : Image reconstruction 12 : Weight coefficient : W_{I R} = \frac{F_{I R} + F_{S}}{F_{V I S} + F_{I R} + F_{S}}

W_{V I S} = \frac{F_{V I S}}{F_{V I S} + F_{I R}}

13 : Reconstruct fused images : I_{f} = I_{i r} \times W_{I R} + I_{v i s - e n h a n c e} \times W_{V I S}

14: end for
15: end for

4. Experiments and Analysis

In this section, we first establish the infrared and visible image datasets. Secondly, we construct the experimental setup. Finally, we evaluate our fusion method against existing techniques both subjectively and objectively, supplemented by ablation studies that objectively compare technique results with our model’s outcomes.

4.1. Infrared and Visible Image Datasets

To address the problem of fusing infrared and visible images in night vision systems, we have established an infrared and visible dataset, which includes TNO [41], VOT2020 [42], and a multi-spectral dataset [43]—as illustrated in Table 1 and Figure 8.

The TNO dataset features nighttime scenes containing objects such as houses, vehicles, and people. The VOT2020 dataset provides video sequences depicting vehicles, buildings, and trees. The multi-spectral dataset comprises focused imagery, including night vision scenes with buildings, vehicles, and human subjects. All datasets encompass diverse environmental conditions (such as indoor, outdoor, nighttime, partial exposure), enabling a rigorous investigation of image fusion algorithm generalization.

The rationale for selecting the TNO, VOT 2020, and multi-spectral datasets in this study rests upon three pivotal considerations:

Field Recognition: All three constitute benchmark datasets widely acknowledged in the image fusion domain.
Comprehensive Night Scenario Coverage: They encompass diverse indoor/outdoor night vision environments, addressing extreme illumination validation needs.
Task-Specific Utility:

TNO: Provides infrared–visible image pairs under extreme nocturnal conditions (smoke occlusion/low illumination), enabling the quantitative assessment of fusion algorithms in mutual information, target contour integrity, and low-illumination texture recovery.

VOT 2020: Constructs dynamic nighttime tracking scenarios, featuring glare interference, motion blur, and scale variations, specifically designed to test fusion algorithms’ low-illumination robustness and real-time performance.

Multi-spectral: Integrates spectral, spatial, and temporal dimensions through multi-band data (such as 390–950 nm) and dynamic disturbances (such as rapid motion, occlusion, and cluttered backgrounds), supporting generalization validation for cross-modal downstream tasks.

4.2. Experimental Settings

This study utilized source images from the TNO, VOT2020, and multi-spectral datasets. We propose a two-stage algorithmic architecture: stage 1 performs night vision visible image enhancement, and stage 2 fuses the enhanced visible images with infrared images. To validate the efficacy of this two-stage architecture, we designed comparative experiments for quantitative evaluation. Figure 9 illustrates enhanced visible images from night vision scenarios. Figure 10 presents the fusion results of these enhanced visible images with corresponding infrared images.

4.2.1. Experimental Comparison of Night Vision Visible Image Enhancement

We selected five state-of-the-art image enhancement algorithms for comparison with our proposed method: ZRDCE [44], RELLIE [45], DPDLLE [46], AGLLE [47], and KD [48].

We implemented ZRDCE and RELLIE in PyTorch 2.7.1 with Python 3.7.0 and ran DPDLLE, AGLLE, and KD on TensorFlow 2.0 using an NVIDIA GeForce GTX 1080 Ti GPU (16 GB RAM). We developed our algorithm on a Windows 10 workstation equipped with an Intel Core i7-10750H CPU (2.60 GHz) and 16 GB of RAM. We conducted all experiments in MATLAB R2022a.

We selected six representative images for comparative analysis, as shown in Figure 9.

4.2.2. Image Fusion Experiment Comparison

For the comparison of image fusion algorithms, we selected fourteen classic methods, including DWT [11], NSCT [12], LatLRR [49], DIV [8], PIA [5], DF [50], Transformer [31], MGFF [51], RFN-Nest [52], ADDNIP [53], BF [54], SwinFusion [34], FreqGAN [32], and EMMA [55].

We implemented the following fusion algorithms for comparison: DWT, NSCT, LatLRR, MGFF, and BF in MATLAB 2022a; DIV, PIA, and DF in TensorFlow 2.0 on an NVIDIA TITAN RTX GPU; and Transformer, RFN-Nest, ADDNIP, SwinFusion, FreqGAN, and EMMA in PyTorch 2.7.1. Our algorithm was deployed on a workstation (Intel^® Core™ i7-10750H @ 2.60 GHz, 16 GB RAM, Windows 10) using MATLAB 2022a. Feature extraction utilized ResNet152 weights that were pretrained on ImageNet. Training executed 110 epochs with a batch size of 64 and RMSprop optimizer (lr = 0.005). To ensure a fair comparison, we carefully tuned each method to its best performance using the configurations reported in its original publication.

The TNO dataset comprises 63 pairs of night vision images, allocated as follows: 43 pairs to the training set, 10 pairs to the validation set, and 10 pairs to the testing set. From the 14-night vision video sequences in the VOT2020 dataset, we extracted 167 image pairs, allocating 107 pairs to the training set, 30 pairs to the validation set, and 30 pairs to the testing set. The multi-spectral dataset consists of 137 pairs of night vision images, distributed as follows: 97 pairs to the training set, 20 pairs to the validation set, and 20 pairs to the testing set.

In this experiment, we implement the proposed algorithm to fuse enhanced visible and infrared images across all datasets. We conduct comprehensive evaluations against fourteen benchmark fusion algorithms. Figure 10 visually compares six representative fusion results.

4.3. Subjective Evaluation

The primary purpose of subjective visual effect evaluation is to distinguish the contrast of scene information in the image. Existing subjective assessment methods primarily derive quality judgments from normalized observer ratings. Its drawback is limited mathematical tractability, whereas its strength is faithfully reflecting perceived visual quality. In this study, we conducted fourteen types of comparative experiments. Some algorithms are classic traditional fusion methods, and others are the latest fusion algorithms. We utilize the Mean Opinion Score (MOS) for the quantitative subjective evaluation of fusion results [56]—as shown in Table 2.

This study utilizes three benchmark datasets (TNO, VOT2020, and multi-spectral), selecting five representative night vision infrared and visible image pairs per dataset for a total of 15 pairs, where each set comprises original infrared images, original visible images, fusion results from 14 classical and deep learning-based algorithms, and outputs generated by the proposed method. Subsequently, 20 observers (age 25–40, normal/corrected-to-normal vision, 10 image processing experts and 10 non-experts) were recruited for subjective evaluation, with each required to independently perform ratings under a controlled protocol featuring 20 s image sequence presentations followed by 5 s blank intervals to eliminate the persistence of vision effects. The data processing workflow involved discarding extreme scores (highest and lowest) per fused image while retaining the middle 80% of valid ratings, followed by the computation of global Mean Opinion Scores (MOSs) for all 15 fusion algorithms based on evaluations across the 15 image sets.

Figure 9 presents the enhancement results of six night vision visible image sets processed by six algorithms, while Figure 10 displays the fusion results of six infrared and visible image sets generated by fifteen algorithms. Table 3 summarizes the Mean Opinion Scores (MOSs) evaluated by 20 observers across all 15 image sets for each algorithm. A comparative analysis reveals that the proposed algorithm achieves an optimal visual performance, demonstrating superior capabilities in target saliency enhancement, detail preservation, and brightness–contrast improvement. The DIV, Transformer, ADDNIP, RFN-Nest, BF, SwinFusion, and EMMA methods exhibit a relatively strong performance, each excelling in specific dimensions such as target saliency, detail retention, and visual naturalness. Conversely, the FreqGAN method demonstrates a suboptimal performance across these three metrics due to its inherent instability.

4.4. Objective Evaluation

4.4.1. Image Enhancement Evaluation

For the evaluation of night vision visible image enhancement results, there is no full reference image for comparison due to the changes in the color and structure of the image. Therefore, we evaluated the enhanced images using the natural image quality evaluator (NIQE) and lightness order error (LOE) indexes of non-reference evaluation standards [57,58]. It is noteworthy that these indicators reflect only certain aspects of image quality and may not be entirely consistent with human visual perception.

Table 4 presents a comparison of the average enhancement metric scores for each algorithm. The results demonstrate that the proposed algorithm achieved lower LOE and NIQE values compared with the other five methods. This indicates that the local adaptive method applied after tone mapping is highly effective for enhancing low-illumination images.

4.4.2. Image Fusion Evaluation

For infrared and visible images, the following problems occur during the fusion process:

The long-distance contour of the infrared image collected is unclear.
The visible image in the night vision system contains many blind areas. Its background information is unclear, resulting in difficulty in recognizing the target.
The fused image fails to highlight target information while also lacking comprehensive background details.

To address specific challenges, we propose six evaluation indicators:

Information entropy (EN) quantifies information richness in inherently unclear infrared images [59].
Peak signal-to-noise ratio (PSNR) evaluates fusion quality to enhance low-contrast visible images in night vision systems [60].
Structural similarity (SSIM) assesses distortion probability in fusion-prone outputs [61].
Edge preservation index ( $Q^{A B / F}$ ) measures edge integrity retention to counter frequent blurring in fused results [62].
Visual Information Fidelity (VIF) quantifies the fidelity level of multi-source visual information preservation in fused images, thereby reflecting robustness against detail loss [63].
Average Gradient (AG) assesses spatial clarity in fused outputs to counteract edge blurring artifacts [64].

We selected the latest fifteen fusion algorithms for comparison. Figure 11 shows the comparison of six indicators for the fusion results of each algorithm.

Figure 11 presents comparative performance metrics of the fifteen fusion algorithms across six benchmark image sets. The proposed algorithm demonstrates consistent superiority on all three datasets. A quantitative evaluation using six key metrics (EN, SSIM, PSNR,

Q^{A B / F}

, VIF, and AG) confirms that our method outperforms fourteen baseline algorithms across all performance dimensions. Subsequent analysis will examine the distinctive characteristics of each fusion approach.

The DWT [11] approach employs multi-scale decomposition to extract multi-resolution image features, and its fusion results demonstrate strong performance on EN and PSNR metrics. The NSCT [12] method utilizes its non-subsampled strategy to capture image edge and texture features effectively, and its fusion results exhibit significant advantages across multiple metrics, including EN, PSNR, VIF, and AG. The LatLRR [49] method extracts detailed image features. Its fused images demonstrate favorable values for the PSNR, SSIM, EN, and VIF metrics. The DIV [8] method decomposes images into contour and detail components, utilizing neural networks to extract multi-layer feature maps from the detail part. DIV’s fused images show improved values for EN, PSNR,

Q^{A B / F}

, and AG. The PIA [5] method establishes an adversarial network between a discriminator and a generator, aiming to retain edge and gradient features in fused images. PIA’s fused images indicate better values for SSIM, VIF, AG, and

Q^{A B / F}

.

The DF [50] method constructs a coding network to extract richer texture information from source images, with its fused images demonstrating superior EN, PSNR, SSIM, and AG metrics. The Patch Pyramid Transformer [31] extracts non-local features for accurate input reconstruction, yielding fused images with enhanced EN, SSIM, and

Q^{A B / F}

values. The MGFF [51] approach employs guided image filtering to isolate significant regions in multi-view scenes, producing fused results exhibiting higher PSNR, SSIM, and AG. Further advancing the field, the RFN-Nest [52] preserves feature enhancement via a two-stage training strategy, where its fused images achieve improved SSIM, VIF, AG, and

Q^{A B / F}

. The ADDNIP [53] framework decomposes images into tripartite frequency components to capture comprehensive features, leading to fused outputs with elevated EN, PSNR, SSIM, VIF, AG, and

Q^{A B / F}

. The BF [54] method utilizes hierarchical Bayesian inference to align fusion results with human visual perception, culminating in fused images displaying superior EN, SSIM, VIF, and AG.

The SwinFusion [34] method employs an attention-guided cross-domain interaction module to achieve domain-specific feature extraction and complementary cross-domain integration, yielding fused images with an exceptional performance for EN, PSNR, and

Q^{A B / F}

. The FreqGAN [32] constructs a frequency-compensating generator to enhance contour and detail features. Its fused images suffer from suboptimal results across multiple metrics due to training instability. The EMMA [55] framework adopts an end-to-end self-supervised learning paradigm for direct fused image generation, demonstrating significant advantages in EN, PSNR,

Q^{A B / F}

, and VIF.

Compared with existing methods, the fourteen algorithms demonstrate only partial performance advantages on specific evaluation metrics. However, our proposed algorithm surpasses all comparative methods across all assessment metrics. Overall, the approach successfully extracts detailed features from source images, prominently highlights salient target regions, fully preserves fine structural details, and achieves natural contrast enhancement.

4.5. Ablation Study

To comprehensively and objectively evaluate the advantages of the proposed algorithm, we designed two sets of controlled experiments. Under the condition of maintaining the base model architecture, we sequentially compared the baseline ResNet152 model, the baseline ResNet152 model integrated with dual pooling and LDA modules, and the model proposed in this study to assess the evolution of performance. Concurrently, testing was conducted on three public datasets using identical objective evaluation metrics as referenced in the original literature. Table 5 delineates the quantitative results of ablation models versus the complete model across all metrics, thereby precisely quantifying the contribution of each module to the performance enhancement of the final integrated model.

As shown in Table 5, the ablation results demonstrate that when the adaptive visible image enhancement module is excluded, the ResNet152 baseline and its dual pooling+LDA variant achieve a key metric performance comparable to the fourteen baseline algorithms across three datasets. Critically, the ResNet152 model integrated with dual pooling and LDA significantly outperforms the standalone ResNet152 model, validating the efficacy of the dual pooling strategy (preserving salient features and gradient information) and LDA (optimizing feature discriminability). Furthermore, ResNet152’s deep feature extraction captures richer information than shallow or handcrafted features, while the adaptive enhancement module substantially improves visual quality in low-illumination and oversaturated regions. The quantitative analysis confirms our key components: adaptive visible image enhancement, ResNet152 feature extraction, a dual pooling strategy, and LDA optimization collectively provide indispensable contributions to systematic image fusion enhancement.

5. Discussion

Our proposed locally adaptive enhancement algorithm specifically targets low-illumination scenarios and localized luminance saturation. Under severely adverse weather conditions characterized by critically limited visibility (such as dense fog, heavy rain, or snow), visible imagery undergoes significant quality degradation, exhibiting substantial scattering noise and extremely low-contrast information. In such operational environments, our enhancement method may amplify inherent noise or fog-induced artifacts in the visible inputs. This could potentially introduce unintended pseudo-features or over-enhancement in the fused outputs, manifesting as excessive brightness in foggy regions or textural degradation. While infrared imaging can penetrate atmospheric obscurants to reveal thermal signatures, its scene details become progressively obscured in dense fog. Therefore, future research will focus on integrating weather-aware modules to develop jointly optimized fusion strategies for adverse weather conditions.

The existing fusion algorithms, particularly those involving deep neural networks (such as ResNet152) for multi-layer feature extraction, pooling operations, LDA-based dimensionality reduction, and subsequent reconstruction steps, exhibit relatively high computational complexity. Processing a single standard-resolution image (640 × 512 pixels) on current experimental platforms takes approximately 50 ms per frame. This performance is inadequate for meeting the stringent requirements of real-time applications. Consequently, future optimization efforts will prioritize model compression, model lightweighting, and inference engine optimization strategies. The specific deployment is as follows:

Model Compression: We will employ structured pruning to remove redundant channels or filters in the feature extraction backbone (ResNet152). This approach will significantly reduce the model’s parameter count and computational load while maintaining its representational capacity.
Lightweight Backbone Replacement: To achieve an optimal fusion performance, this study employed the parameter-heavy ResNet152 as its feature extractor. Subsequently, we will replace the backbone network with lightweight architectures specifically designed for mobile and embedded devices, such as GhostNet, EfficientNet-Lite, or SqueezeNet.
Inference Engine Optimization: During actual deployment, we will employ an efficient inference engine to accelerate the optimized model. These tools deliver operator fusion, inter-layer computation optimization, and hardware-specific optimizations, further enhancing inference speed.

These experiments primarily relied on existing benchmark datasets, whose limited scale may restrict a comprehensive evaluation of the model’s performance under extreme or highly diverse nighttime conditions. Although the model achieves strong results on available data, the further validation of its generalization ability using larger and more real-world datasets remains necessary. To thoroughly assess the model’s robustness and adaptability in practical applications, future work will include extensive testing on newly emerging large-scale urban nighttime driving datasets (such as KAIST) that contain richer scenarios.

6. Conclusions

In this study, we propose an adaptive enhanced fusion algorithm for visible and infrared images in night vision scenarios. Our central contribution is a coordinated processing framework that integrates adaptive enhancement with deep feature extraction. This framework effectively addresses the shortcomings of existing methods under extreme lighting conditions. First, we designed a local adaptive enhancement algorithm that addresses the simultaneous presence of low-illumination and overexposed areas in visible images. This algorithm employs a dual branch correction mechanism to effectively suppress halo artifacts while enhancing textural details, thereby providing better-balanced source images for subsequent fusion. When compared with existing end-to-end fusion methods such as those based on GANs or Transformers, this preprocessing step substantially improves performance in complex nighttime environments. Second, we fully utilize the hierarchical representation capability of ResNet152 to extract multi-scale features from both infrared and visible images. Our approach combines max and average pooling to preserve prominent thermal targets for infrared images while enhancing structural details from visible ones, achieving more complementary feature representations. Furthermore, we introduce a feature significance-based adaptive weighting mechanism for fusion. This method uses infrared characteristics to guide weight generation, ensuring that the fused image emphasizes thermal targets while preserving visible light details to the greatest extent possible. The experimental results demonstrate that the proposed method outperforms current state-of-the-art techniques in both subjective visual quality and objective evaluation metrics. It significantly improves target recognition rates and scene perception in nighttime environments, effectively reducing risks associated with blind zones.

Author Contributions

Conceptualization, X.W.; data curation, X.W. and Q.W.; formal analysis, C.Z. and J.H.; methodology, X.W.; software, X.W. and G.Z.; writing—original draft, X.W.; writing—review and editing, X.W. and M.H. All authors have read and agreed to the published version of the manuscript.

Funding

The Startup Research Fund of Henan Academy of Sciences (No. 242025005). The Joint Fund of Henan Province Science and Technology R&D Program (No. 235200810001).

Data Availability Statement

Data available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shi, P.; Yang, L.; Dong, X.; Qi, H.; Yang, A. Research Progress on Multi-Modal Fusion Object Detection Algorithms for Autonomous Driving: A Review. Comput. Mater. Contin. 2025, 83, 3877–3917. [Google Scholar] [CrossRef]
Zeng, B.; Gao, S.; Xu, Y.; Zhang, Z.; Li, F.; Wang, C. Detection of Military Targets on Ground and Sea by UAVs with Low-Altitude Oblique Perspective. Remote Sens. 2024, 16, 1288. [Google Scholar] [CrossRef]
He, Y.; Li, H.; Zhang, M.; Liu, S.; Zhu, C.; Xin, B.; Wang, J.; Wu, Q. Hyperspectral and Multispectral Remote Sensing Image Fusion Based on a Retractable Spatial-Spectral Transformer Network. Remote Sens. 2025, 17, 1973. [Google Scholar] [CrossRef]
Zhao, Z.; Sun, Y.; Jia, W.; Yang, J.; Wang, F. Prediction of Vanadium Contamination Distribution Pattern Through Remote Sensing Image Fusion and Machine Learning. Remote Sens. 2025, 17, 1164. [Google Scholar]
Lin, T.; Ji, Y.; Hao, Z.; Xing, J.; Jia, M. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion 2022, 83, 79–92. [Google Scholar] [CrossRef]
Gao, X.; Liu, S. BCMFIFuse: A Bilateral Cross-Modal Feature Interaction-Based Network for Infrared and Visible Image Fusion. Remote Sens. 2024, 16, 3136. [Google Scholar]
Qi, B.; Bai, X.; Wu, W.; Zhang, Y.; Lv, H.; Li, G. A Novel Saliency-Based Decomposition Strategy for Infrared and Visible Image Fusion. Remote Sens. 2023, 15, 2624. [Google Scholar] [CrossRef]
Tang, L.; Xiang, X.; Zhang, H.; Gong, M.; Ma, J. DIVFusion: Darkness-free infrared and visible image fusion. Inf. Fusion 2023, 91, 477–493. [Google Scholar] [CrossRef]
Zhao, J.; Shi, Z.; Yu, C.; Liu, Y. Multi-Scenario Remote Sensing Image Forgery Detection Based on Transformer and Model Fusion. Remote Sens. 2024, 16, 4311. [Google Scholar] [CrossRef]
Li, J.; Chen, L.; An, D.; Feng, D.; Song, Y. A Novel Method for CSAR Multi-Focus Image Fusion. Remote Sens. 2024, 16, 2797. [Google Scholar] [CrossRef]
Zhan, L.; Zhuang, Y.; Huang, L. Infrared and Visible Images Fusion Method Based on Discrete Wavelet Transform. J. Comput. 2017, 28, 57–71. [Google Scholar] [CrossRef]
Cao, Z.; Guan, Y.; Wang, P.; Ti, C. A Novel Fusion Algorithm of Visible Image and Infrared Image Based on NSCT. Adv. Mater. Res. 2012, 424, 223–226. [Google Scholar] [CrossRef]
Zhao, G.; Hu, Z.; Feng, S.; Wang, Z.; Wu, H. GLFuse: A Global and Local Four-Branch Feature Extraction Network for Infrared and Visible Image Fusion. Remote Sens. 2024, 16, 3246. [Google Scholar]
Yang, L.; Wu, J.; Li, H.; Liu, C.; Wei, S. Real-Time Runway Detection Using Dual-Modal Fusion of Visible and Infrared Data. Remote Sens. 2025, 17, 669. [Google Scholar] [CrossRef]
Park, S.; Vien, A.; Lee, C. Cross-modal transformers for infrared and visible image fusion. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 770–785. [Google Scholar] [CrossRef]
Liu, J.; Wu, G.; Liu, Z.; Wang, D.; Jiang, Z.; Ma, L.; Zhong, W.; Fan, X.; Liu, R. Infrared and Visible Image Fusion: From Data Compatibility to Task Adaption. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2349–2369. [Google Scholar] [CrossRef]
Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A Visible-infrared Paired Dataset for Low-light Vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Beijing, China, 11–17 October 2021; pp. 3489–3497. [Google Scholar]
Pu, T.; Wang, S. Perceptually motivated enhancement method for non-uniformly illuminated images. IET Comput. Vis. 2018, 12, 424–433. [Google Scholar] [CrossRef]
Li, Y.; Li, Y.; Zhang, Q.; Wang, J.; Li, D. A Method for Correction and Enhancement of Non-Uniformly Illuminated Images. In Proceedings of the 2022 International Conference on Knowledge Engineering and Communication Systems (ICKES), Chickballapur, India, 28–29 December 2022; pp. 1–6. [Google Scholar]
Pu, T.; Zhu, Q. Non-Uniform Illumination Image Enhancement via a Retinal Mechanism Inspired Decomposition. IEEE Trans. Consum. Electron. 2024, 70, 747–756. [Google Scholar] [CrossRef]
Lin, F.; Zhang, H.; Wang, J.; Wang, J. Unsupervised image enhancement under non-uniform illumination based on paired CNNs. Neural Netw. 2024, 170, 202–214. [Google Scholar] [CrossRef]
Edwin, H.L. The Retinex Theory of Color Vision. Sci. Am. 1978, 237, 108–128. [Google Scholar]
Zhang, Y.; Teng, B.; Yang, D.; Chen, Z.; Ma, H.; Li, G.; Ding, W. Learning a Single Convolutional Layer Model for Low Light Image Enhancement. IEEE Trans. Circuits Syst. Video Technol. 2025, 34, 5995–6008. [Google Scholar] [CrossRef]
Luo, Y.; You, B.; Yue, G.; Ling, J. Pseudo-Supervised Low-Light Image Enhancement With Mutual Learning. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 85–96. [Google Scholar] [CrossRef]
Fan, G.; Fan, B.; Gan, M.; Chen, G.; Chen, C. Multiscale Low-Light Image Enhancement Network With Illumination Constraint. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7403–7417. [Google Scholar] [CrossRef]
Jiang, Y.; Gong, X.; Liu, D.; Chen, Y.; Fang, C.; Shen, X. Enlightengan: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef]
Gao, Y.; Ma, S.; Liu, J. DCDR-GAN: A Densely Connected Disentangled Representation Generative Adversarial Network for Infrared and Visible Image Fusion. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 549–561. [Google Scholar] [CrossRef]
Zhao, Z.; Xu, S.; Zhang, J.; Liang, C.; Zhang, C.; Liu, J. Efficient and Model-Based Infrared and Visible Image Fusion via Algorithm Unrolling. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 1186–1196. [Google Scholar] [CrossRef]
Wang, Z.; Shao, W.; Chen, Y.; Xu, J.; Zhang, L. A Cross-Scale Iterative Attentional Adversarial Fusion Network for Infrared and Visible Images. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 3677–3688. [Google Scholar] [CrossRef]
Ma, J.; Tang, L.; Xu, M.; Zhang, H.; Xiao, G. Stdfusionnet: An infrared and visible image fusion network based on salient target detection. IEEE Trans. Instrum. Meas. 2021, 70, 1–13. [Google Scholar] [CrossRef]
Vibashan, V.; Jeya, V.; Poojan, O.; Vishal, M.P. Image Fusion Transformer. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 3566–3570. [Google Scholar]
Wang, Z.; Zhang, Z.; Qi, W.; Yang, F.; Xu, J. FreqGAN: Infrared and visible image fusion via unified frequency adversarial learning. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 728–740. [Google Scholar] [CrossRef]
Wang, Y.; Wang, M.; Hao, Z.; Wang, Q.; Wang, Q.; Ye, Y. MSGFNet: Multi-Scale Gated Fusion Network for Remote Sensing Image Change Detection. Remote Sens. 2024, 16, 572. [Google Scholar] [CrossRef]
Ma, J.; Tang, L.; Fan, F.; Huang, J.; Mei, X.; Ma, Y. SwinFusion: Cross-domain long-range learning for general image fusion via swin transformer. IEEE/CAA J. Autom. Sinica 2022, 9, 1200–1217. [Google Scholar] [CrossRef]
Jain, P.; Tyagi, V. LAPB: Locally adaptive patch-based wavelet domain edge-preserving image denoising. Inf. Fusion 2015, 294, 164–181. [Google Scholar] [CrossRef]
Wu, G.; Luo, S.; Yang, Z. Optimal weighted bilateral filter with dual-range kernel for gaussian noise removal. IET Image Process. 2020, 14, 1840–1850. [Google Scholar] [CrossRef]
Li, X.; Huang, J.; Deng, L.; Huang, T. Bilateral filter based total variation regularization for sparse hyperspectral image unmixing-science direct. Inf. Sci. 2019, 504, 334–353. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, L.; Bai, X.; Zhang, L. Infrared and visual image fusion through infrared feature extraction and visual information preservation. Infrared Phys. Technol. 2017, 83, 227–237. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Cai, J.; Gu, S.; Zhang, L. Learning a deep single image contrast enhancer from multi-exposure images. IEEE T Image Process. 2018, 27, 2049–2062. [Google Scholar] [CrossRef]
Toet, A. TNO Image Fusion Dataset. Available online: https://figshare.com/articles/dataset/TNO_Image_Fusion_Dataset/1008029/2 (accessed on 26 April 2014).
Kristan, M. VOT 2020 Dataset. Available online: https://votchallenge.net/vot2020/dataset.html (accessed on 28 August 2020).
Soonmin, H.; Jaesik, P.; Namil, K.; Yukyung, C. Multi-Spectral Dataset. Available online: https://soonminhwang.github.io/rgbt-ped-detection/data/ (accessed on 11 June 2017).
Guo, C.; Li, C.; Guo, J.; Loy, C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1780–1789. [Google Scholar]
Zhang, R.; Guo, L.; Huang, S.; Wen, B. Rellie: Deep reinforcement learning for customized low-light image enhancement. In Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 20–24 October 2021; pp. 2429–2437. [Google Scholar]
Wu, W.; Weng, J.; Zhang, P.; Wang, X.; Yang, W.; Jiang, J. Uretinex-net: Retinex-based deep unfolding network for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5901–5910. [Google Scholar]
Lv, F.; Li, Y.; Lu, F. Attention guided low-light image enhancement with a large scale low-light simulation dataset. Int. J. Comput. Vis. 2021, 129, 2175–2193. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, J.; Guo, X. Kindling the darkness: A practical low-light image enhancer. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1632–1640. [Google Scholar]
Li, H.; Wu, X.; Kitter, J. MDLatLRR: A novel decomposition method for infrared and visible image fusion. IEEE Trans. Image Process. 2020, 29, 4733–4746. [Google Scholar] [CrossRef]
Li, H.; Wu, X. Infrared and visible image fusion using latent low-rank representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Bavirisetti, D.; Xiao, G.; Zhao, J.; Dhuli, R.; Liu, G. Multi-scale guided image and video fusion: A fast and efficient approach. Circ. Syst. Signal Process. 2019, 38, 5576–5605. [Google Scholar] [CrossRef]
Li, H.; Wu, X.; Kittler, J. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Inf. Fusion 2021, 73, 72–86. [Google Scholar] [CrossRef]
Fu, Y.; Wu, X.; Kittler, J. Deep decomposition network for image processing: A case study for visible and infrared image fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Zhao, Z.; Xu, S.; Zhang, C.; Liu, J.; Zhang, J. Bayesian fusion for infrared and visible images. Signal Process. 2020, 177, 107734. [Google Scholar] [CrossRef]
Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Zhang, K.; Xu, S.; Chen, D.; Timofte, R.; Van Gool, L. Equivariant multi-modality image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 25912–25921. [Google Scholar]
Bruzgiene, R.; Narbutaite, L.; Adomkus, T. Subjective and objective MOS evaluation of user’s perceived quality assessment for IPTV service: A study of the experimental investigations. Elektron. Ir Elektrotechnika 2013, 19, 110–113. [Google Scholar] [CrossRef]
Wang, S.; Zheng, J.; Hu, H.; Li, B. Naturalness preserved enhancement algorithm for non-uniform illumination images. IEEE Trans. Image Process. 2013, 22, 3538–3548. [Google Scholar] [CrossRef]
Mittal, A.; Soundararajan, R.; Bovik, A. Making a ’completely blind’ image quality analyzer. IEEE Signal Process. Lett. 2013, 20, 209–212. [Google Scholar] [CrossRef]
Sholehkerdar, A.; Tavakoli, J.; Liu, Z. Theoretical analysis of Tsallis entropy-based quality measure for weighted averaging image fusion. Inf. Fusion 2020, 58, 69–81. [Google Scholar] [CrossRef]
Setiadi, D. Psnr vs. ssim: Imperceptibility quality assessment for image steganography. Multimed. Tools Appl. 2021, 80, 8423–8444. [Google Scholar] [CrossRef]
Hong, R.; Cao, W.; Pang, J.; Jiang, J. Directional projection based image fusion quality metric. Inf. Sci. 2014, 281, 611–619. [Google Scholar] [CrossRef]
Xydeas, C.; Petrovic, V. Objective pixel-level image fusion performance measure. Sens. Fusion Archit. Algorithms Appl. IV 2000, 4051, 89–98. [Google Scholar]
Han, Y.; Cai, Y.; Cao, Y. A new image fusion performance metric based on visual information fidelity. Inf. Fusion 2013, 14, 127–135. [Google Scholar] [CrossRef]
Karathanassi, V.; Kolokousis, P.; Ioannidou, S. A comparison study on fusion methods using evaluation indicators. Int. J. Remote Sens. 2007, 28, 2309–2341. [Google Scholar] [CrossRef]

Figure 1. Visible and infrared image fusion method.

Figure 2. Infrared and visible image fusion framework under the night vision system.

Figure 3. Night visible image enhancement.

Figure 4. Salient feature extraction of infrared image.

Figure 5. Architecture of residual block.

Figure 6. Image feature pooling and LDA.

Figure 7. Image reconstruction.

Figure 8. Sample infrared and visible datasets. The first, third, and fifth rows display RGB images, while the second, fourth, and sixth rows display their corresponding infrared image counterparts.

Figure 9. From top to bottom: image 1, image 2, image 3, image 4, image 5, image 6. From left to right: visible images, enhancement results of ZRDCE [44], RELLIE [45], DPDLLE [46], AGLLE [47], KD [48], ours, respectively.

Figure 10. From top to bottom: visible images, infrared images, fusion results of DWT [11], NSCT [12], LatLRR [49], DIV [8], PIA [5], DF [50], Transformer [31], MGFF [51], RFN-Nest [52], ADDNIP [53], BF [54], SwinFusion [34], FreqGAN [32], EMMA [55], ours.

Figure 11. Quantitative comparisons of fused images.

Table 1. Visible and infrared image fusion datasets under night vision systems.

Names	Image Pairs	Image Type	Resolution	Year
TNO [41]	63 image pairs	Multi-spectral	various	2014
VOT2020 [42]	14 video pairs	RGB/Infrared	various	2020
Multi-spectral datasets [43]	137 Focused image pairs	RGB/Infrared	350 × 256	2017

Table 2. Rating criteria for the Mean Opinion Score (MOS).

Dimension.	1 Point (Serious Distortion)	10 Points (Optimal Fusion)
Target Saliency	Heat source completely blurred	Clear thermal target boundaries
Detail Preservation	Severe texture loss	Complete visible details
Naturalness	Obvious color distortion/artifacts	Smooth and noiseless

Table 3. Mean Opinion Scores (MOSs) by 20 observers for the results of the fifteen algorithms.

Algorithm	Target Saliency	Detail Preservation	Naturalness	MOS
DWT [11]	7.2	7.9	7.4	7.5
NSCT [12]	7.4	8.1	7.7	7.7
LatLRR [49]	8.2	7.4	7.3	7.6
DIV [8]	7.5	8.3	8.7	8.2
PIA [5]	7.1	7.5	8.1	7.6
DF [50]	8.3	8.1	7.4	7.9
Transformer [31]	7.5	8.1	8.3	8.0
MGFF [51]	8.5	7.1	8.2	7.9
RFN-Nest [52]	8.6	8.3	7.5	8.1
ADDNIP [53]	7.9	8.2	8.6	8.2
BF [54]	8.3	8.1	7.5	8.0
SwinFusion [34]	8.1	8.2	7.6	8.0
FreqGAN [32]	6.7	6.1	6.3	6.4
EMMA [55]	7.9	8.3	8.1	8.1
Ours	8.4	8.3	8.6	8.4

Table 4. The objective visual evaluation results.

Img	Index	Method
Img	Index	ZRDCE	RELLIE	DPDLLE	AGLLE	KD	Ours
Image 1	LOE	439.72	376.81	457.23	482.65	363.53	349.41
Image 1	NIQE	3.279	4.648	3.269	4.177	3.188	2.396
Image 2	LOE	432.97	414.72	397.82	432.17	405.62	315.74
Image 2	NIQE	3.539	4.258	2.988	3.897	3.362	2.584
Image 3	LOE	425.36	398.27	413.56	425.67	397.68	336.71
Image 3	NIQE	3.164	3.956	3.735	3.542	2.965	2.763
Image 4	LOE	419.27	383.27	384.26	413.57	379.26	322.73
Image 4	NIQE	3.247	3.625	2.892	3.236	3.156	2.627
Image 5	LOE	453.27	427.63	393.27	391.58	365.23	341.56
Image 5	NIQE	3.527	3.413	2.912	3.626	3.327	2.811
Image 6	LOE	399.27	411.11	388.56	410.23	377.39	362.57
Image 6	NIQE	3.317	3.723	3.129	2.981	3.236	2.914

Table 5. Three-dataset ablation study: comparative analysis of six objective metric categories.

Method	TNO						VOT2020						Multi-Spectral
Method	EN	PSNR	SSIM	Q	VIF	AG	EN	PSNR	SSIM	Q	VIF	AG	EN	PSNR	SSIM	Q	VIF	AG
DWT	6.12	36.2	0.54	0.62	0.53	4.57	6.27	37.1	0.59	0.64	0.55	4.85	6.33	36.5	0.55	0.65	0.55	4.68
NSCT	6.28	36.9	0.57	0.67	0.61	5.23	6.31	37.5	0.55	0.66	0.59	5.12	6.25	37.2	0.61	0.63	0.6	5.09
LatLRR	6.72	34.2	0.63	0.71	0.63	5.33	6.54	35.6	0.6	0.7	0.61	5.45	6.48	35.1	0.62	0.68	0.64	5.42
DIV	6.42	37.8	0.66	0.75	0.65	6.51	6.38	36.8	0.61	0.73	0.64	6.25	6.33	37.4	0.63	0.7	0.62	6.39
PIA	6.23	36.5	0.69	0.69	0.61	5.67	6.11	36.9	0.63	0.71	0.62	5.84	6.21	37.1	0.66	0.68	0.65	5.97
DF	6.12	36.8	0.59	0.64	0.62	5.37	6.28	37.5	0.63	0.68	0.66	5.42	6.15	36.5	0.61	0.66	0.63	5.84
Transformer	6.35	37.9	0.72	0.71	0.64	4.67	6.42	37.6	0.71	0.69	0.67	4.55	6.47	36.1	0.65	0.73	0.66	4.98
MGFF	6.28	36.4	0.68	0.65	0.64	5.24	6.39	36.9	0.65	0.68	0.59	5.15	6.45	36.6	0.63	0.65	0.62	5.46
RFN-Nest	6.19	37.5	0.66	0.68	0.6	6.13	6.08	38.1	0.64	0.71	0.61	6.24	6.25	37.1	0.62	0.67	0.63	5.96
ADDNIP	6.94	38.4	0.64	0.65	0.63	6.44	6.32	37.9	0.59	0.66	0.6	6.15	6.73	36.6	0.66	0.63	0.65	6.64
BF	6.22	36.7	0.66	0.69	0.58	5.63	6.18	37.7	0.67	0.7	0.59	5.42	6.11	36.3	0.62	0.65	0.67	5.58
SwinFusion	6.37	38.1	0.64	0.71	0.55	6.79	6.74	37.9	0.61	0.69	0.57	4.57	6.52	37.8	0.6	0.73	0.59	4.75
FreqGAN	5.49	35.2	0.59	0.43	0.46	3.84	5.84	36.4	0.51	0.53	0.49	3.97	5.94	35.9	0.48	0.57	0.51	4.11
EMMA	6.16	36.8	0.66	0.64	0.63	5.67	6.42	37.1	0.63	0.67	0.62	5.82	6.55	38.1	0.65	0.64	0.66	5.57
ResNet152	5.86	35.2	0.57	0.55	0.54	4.29	5.94	35.9	0.49	0.56	0.57	5.21	5.82	36.2	0.51	0.59	0.61	5.47
ResNet152+LDA	6.74	37.1	0.65	0.67	0.61	4.85	6.31	36.4	0.57	0.68	0.65	5.99	6.45	36.9	0.63	0.68	0.67	6.15
Ours	7.15	38.2	0.71	0.76	0.65	6.73	6.78	38.1	0.69	0.74	0.67	6.29	6.81	37.9	0.68	0.73	0.71	6.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; Zhang, C.; Hu, J.; Wen, Q.; Zhang, G.; Huang, M. AEFusion: Adaptive Enhanced Fusion of Visible and Infrared Images for Night Vision. Remote Sens. 2025, 17, 3129. https://doi.org/10.3390/rs17183129

AMA Style

Wang X, Zhang C, Hu J, Wen Q, Zhang G, Huang M. AEFusion: Adaptive Enhanced Fusion of Visible and Infrared Images for Night Vision. Remote Sensing. 2025; 17(18):3129. https://doi.org/10.3390/rs17183129

Chicago/Turabian Style

Wang, Xiaozhu, Chenglong Zhang, Jianming Hu, Qin Wen, Guifeng Zhang, and Min Huang. 2025. "AEFusion: Adaptive Enhanced Fusion of Visible and Infrared Images for Night Vision" Remote Sensing 17, no. 18: 3129. https://doi.org/10.3390/rs17183129

APA Style

Wang, X., Zhang, C., Hu, J., Wen, Q., Zhang, G., & Huang, M. (2025). AEFusion: Adaptive Enhanced Fusion of Visible and Infrared Images for Night Vision. Remote Sensing, 17(18), 3129. https://doi.org/10.3390/rs17183129

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AEFusion: Adaptive Enhanced Fusion of Visible and Infrared Images for Night Vision

Abstract

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Visible Image Enhancement Algorithms

2.2. Infrared and Visible Image Fusion Algorithms

3. The Proposed Fusion Framework

3.1. Image Enhancement Algorithm

3.2. Infrared Feature Extraction

3.3. Deep Learning Fusion Algorithm

3.3.1. Dimension Reduction and Decorrelation of Linear Discriminant Analysis (LDA)

3.3.2. Image Reconstruction

4. Experiments and Analysis

4.1. Infrared and Visible Image Datasets

4.2. Experimental Settings

4.2.1. Experimental Comparison of Night Vision Visible Image Enhancement

4.2.2. Image Fusion Experiment Comparison

4.3. Subjective Evaluation

4.4. Objective Evaluation

4.4.1. Image Enhancement Evaluation

4.4.2. Image Fusion Evaluation

4.5. Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI