A Cascaded Enhancement-Fusion Network for Visible-Infrared Imaging in Darkness

Huang, Hanchang; Liu, Hao; Wang, Hailu; Yang, Yunzhuo; Guo, Chuan; Chen, Minsun; Han, Kai

doi:10.3390/photonics12121231

Open AccessArticle

A Cascaded Enhancement-Fusion Network for Visible-Infrared Imaging in Darkness

by

Hanchang Huang

^1,2,

Hao Liu

^1,2,*,

Hailu Wang

^1,2,

Yunzhuo Yang

^1,2,

Chuan Guo

^1,2,

Minsun Chen

^1,2 and

Kai Han

^1,2,3,*

¹

College of Advanced Interdisciplinary Studies, National University of Defense Technology, Changsha 410073, China

²

Nanhu Laser Laboratory, National University of Defense Technology, Changsha 410073, China

³

Hunan Provincial Key Laboratory of High Energy Laser Technology, National University of Defense Technology, Changsha 410073, China

^*

Authors to whom correspondence should be addressed.

Photonics 2025, 12(12), 1231; https://doi.org/10.3390/photonics12121231

Submission received: 4 November 2025 / Revised: 27 November 2025 / Accepted: 2 December 2025 / Published: 15 December 2025

(This article belongs to the Special Issue Technologies and Applications of Optical Imaging)

Download

Browse Figures

Versions Notes

Abstract

This paper presents a cascaded imaging method that combines low-light enhancement and visible–long-wavelength infrared (VIS-LWIR) image fusion to mitigate image degradation in dark environments. The framework incorporates a Low-Light Enhancer Network (LLENet) for improving visible image illumination and a heterogeneous information fusion subnetwork (IXNet) for integrating features from enhanced VIS and LWIR images. Using a joint training strategy with a customized loss function, the approach effectively preserves salient targets and texture details. Experimental results on the LLVIP, M³FD, TNO, and MSRS datasets demonstrate that the method produces high-quality fused images with superior performance evaluated by quantitative metrics. It also exhibits excellent generalization ability, maintains a compact model size with low computational complexity, and significantly enhances performance in high-level visual tasks like object detection, particularly in challenging low-light scenarios.

Keywords:

image fusion; low-light enhancement; VIS-LWIR; deep learning; object detection

1. Introduction

Visible (VIS) and long-wavelength infrared (LWIR) devices are commonly used to perform perception tasks in applications such as autonomous driving, border law enforcement, security monitoring, and low speed small target detection and recognition. VIS images have rich textures whose characteristics are suitable for human visual perception, and LWIR images can perceive the contour of targets in low light or extreme environments by capturing their thermal radiations. Therefore, existing devices often use the image fusion method, which can extract useful information from VIS and LWIR images and merge them into a fused image. This technology can not only reduce redundancy between multimodal data but also generate high-quality images with high contrast and rich texture details. Furthermore, the complementarity between visible and infrared images can also enhance visual perception and subsequent high-level visual tasks [1], such as object detection, pedestrian tracking, semantic segmentation, and surveillance. Researchers have proposed many VIS and LWIR image fusion methods, which can be broadly divided into two categories: conventional methods and learning-based methods. Conventional methods usually include the following three steps: (1) first, in the feature extraction stage, specific transformations are used to extract features from the source image; (2) then, smart fusion strategies are applied to fuse the extracted features in the feature fusion stage; (3) finally, in the feature reconstruction stage, the fused image is reconstructed from the merged features using the corresponding inverse transform. According to the mathematical principles of method, these fusion methods can be further divided into five categories, including hybrid-based methods [2], sparse representation methods [3], saliency-based methods [4], subspace-based methods [5], and multi-scale transformation-based methods [6]. Although these conventional methods can generate satisfactory fused images in some cases, they have significant limitations: (1) they require manual design of complex fusion rules; (2) the feature extraction method of the source image also relies on manual design and usually does not consider the differences in characteristics between different modalities. These drawbacks make conventional methods lack flexibility and difficult to promote in complex or adversarial scenarios.

In recent years, the rise of deep learning technology has provided more possibilities for VIS and LWIR images fusion methods, and can, to some extent, break the limitations suffered by conventional methods, thereby achieving better fusion performance [7]. The methods based on deep learning can be further divided into three categories according to the network architecture they adopt: CNN-based methods [8,9], AE-based methods [10,11], GAN-based methods [12], and Transformer-based methods [13,14,15]. The CNN-based method generates unique fusion results by designing a well-designed network structure and loss function to achieve feature extraction, feature fusion, and feature reconstruction. In contrast to CNN-based methods, AE-based methods use autoencoders to complete feature extraction and reconstruction, while the feature fusion stage is accomplished through specific fusion strategies. The generative adversarial mechanism is introduced into the field of infrared and visible light image fusion on the basis of the CNN method. The fusion method based on GAN consists of a generator and a discriminator. Specifically, the network uses the discriminator to constrain the probability distribution of the fused image generated by the generator to be as close as possible to the source image, thus eliminating the need for supervision during the learning process. Transformer-based methods can theoretically achieve higher-quality images than convolutional-based methods in the VIS-LWIR fusion research due to its global perception and better feature representative capability, while the relatively high computational complexity of self-attention mechanisms limits the large-scale deployment of such methods in practical applications. Moreover, although existing fusion algorithms can generate visually pleasing images, they may not necessarily achieve optimal performance in subsequent object detection tasks. To achieve better quality and visual perception effect of fused images, this paper proposes a cascaded imaging method via combining low-light enhancement and VIS-LWIR fusion. To address the image degradation issues caused by weak texture details from VIS images in low-light scenarios, a low-light enhancement module is proposed. Through a joint training strategy, prominent targets in infrared images and texture details in enhanced visible light images are incorporated into the fusion process. The training and testing results on public datasets indicate that this method effectively improves imaging quality and generalization, as well as outperform in visual tasks such as object detection.

2. Imaging Methods

2.1. Framework of the Proposed Method

A Low-Light Enhancer Network (LLENet) based on Retinex theory is designed to enhance visible images in low-light environments. Then, a heterogeneous information fusion subnetwork (IXNet) is designed to integrate complementary information among VIS and LWIR images. Finally, the intrinsic relationship between low-light image enhancement and image fusion is fully considered, and the loss function is defined as a joint loss composed of illumination enhancement loss and fusion loss to optimize the network learning, thereby achieving effective coupling and reciprocity between the two subnetworks. The framework of the proposed imaging method is illustrated in Figure 1.

Given a pair of strictly registered LWIR images I_ir and VIS images I_vi, the two branches are parallelly input into the network. For the VIS channel, the visible image I_vi is input into the LLENet to generate an enhanced image I_en with RGB three-channel illumination improvement. Then, the enhanced image is subjected to RGB2YCrCb conversion to obtain three components [

I_{e n}^{Y}

,

I_{e n}^{C r}

,

I_{e n}^{C b}

]. The Y-channel component of the enhanced image,

I_{e n}^{Y}

, and the LWIR image, I_ir, are input into IXNet module in a single channel form. Under the guidance of a specific loss function, image fusion is achieved through feature extraction, fusion, and reconstruction. More specifically, the feature extraction (E_F) module is applied to extract prominent target features from the infrared image I_ir and texture features from the enhanced image

I_{e n}^{Y}

. This process can be expressed as follows:

(F_{v i}, F_{i r}) = (E_{F} (I_{e n}^{Y}), E_{F} (I_{i r}))

(1)

Then, the fused image features are reconstructed by the image reconstruction module after fusion. The fusion process is represented as follows:

F_{f} = concat (F_{v i}, F_{i r})

(2)

The Y-channel fused image

I_{f}^{Y}

is restored from the fused feature

F_{f}

by the image reconstruction module R, which is represented as

I_{f}^{Y} = R (F_{f})

(3)

The fused image

I_{f}

I_{f} = H (concat (I_{f}^{Y}, I_{e n}^{C r}, I_{e n}^{C b}))

(4)

where

H

(·) represents the transfer matrix for converting an image from YCbCr color space to RGB color space.

I_{e n}^{C r}

and

I_{e n}^{C b}

represent the Cr and Cb channels of the enhanced VIS image, respectively.

2.2. Loss Function

2.2.1. Loss from Low-Light Enhancement

To improve the generalization of LLENet, an unsupervised learning strategy was adopted in the research process to enhance the robustness of the network. Here, the loss from low-light enhancement

L_{e n h a n c e} = β_{1} L_{c} + β_{2} L_{s}

(5)

where

L_{c}

and

L_{s}

represent structural similarity loss and smoothness loss, respectively. The

β_{1}

and

β_{2}

are two hyperparameters. The structural similarity loss

L_{c}

is to evaluate pixel-level consistency between the illumination components output by the illumination estimation module and the input image,

L_{c} = \sum_{t = 1}^{T} ∥ x^{t} - F^{t} ∥^{2}

(6)

where T is the total number of illumination estimation modules during the training process, and

F^{t}

and

x^{t}

represent the input and output of the illumination estimation module, respectively. Flatness loss is a commonly used loss function in low-light image enhancement tasks [16,17], which maintains the monotonic relationship among adjacent pixels. The

L_{s}

is expressed as

L_{s} = \sum_{i = 1}^{N} \sum_{j \in N (i)} w_{i, j} ∣ x_{i}^{t} - x_{j}^{t} ∣

(7)

in which

w_{i, j} = e x p (- \frac{\sum_{c} (F_{i, c}^{t} - F_{j, c}^{t})^{2}}{2 σ^{2}})

(8)

where N represents the total number of pixels in the image, and

N (i)

represents adjacent elements within a 5 × 5 window centered around the i-th pixel in the image, and j represents one of the elements. Here, c represents the color channel of the YUV color space, and σ = 0.1 is the standard deviation of the Gaussian kernel.

2.2.2. Fusion Loss

To promote the integration of more meaningful information from the source images in the fusion process, and consequently improve imaging quality and quantitative results, the fusion loss [10] is designed as the sum of intensity loss

L_{i n t}

and the gradient loss

L_{g r a d}

,

L_{f u s i o n} = L_{i n t} + λ L_{g r a d}

(9)

among them,

L_{i n t}

forces the fused image to contain more texture details, while

L_{g r a d}

constrains the overall pixel intensity of the fused image. Here, λ is used to balance the intensity loss

L_{i n t}

and the gradient loss

L_{g r a d}

.

Intensity loss

L_{i n t}

is the difference in pixel intensity between the fused image and the source image calculated at the pixel level:

L_{i n t} = \frac{1}{H W} ‖ I_{f} - \max (I_{e n}^{Y}, I_{i r}) ‖_{1}

(10)

where H and W are the height and width of the source image, respectively, where

{‖ \cdot ‖}_{1}

represents L1 norm, and max (·) represents the element-based maximum selection strategy. The loss function integrates the pixel intensity distribution of LWIR and VIS images via a selection strategy based on maximum likelihood and then constrains the pixel intensity distribution of the fused image using an integral distribution.

The fused image is expected to maintain the optimal pixel intensity distribution while preserving the rich texture details in the source image. However, intensity loss

L_{i n t}

only provides intensity distribution constraints for model learning. Therefore, gradient loss

L_{g r a d}

is introduced to constrain the fused image to contain more texture information:

L_{g r a d} = \frac{1}{H W} ‖ | ▽ I_{f} | - \max (| ▽ I_{e n}^{Y} |, | ▽ I_{i r} |) ‖_{1}

(11)

where ▽ is the Sobel gradient operation used to measure the texture information.

3. Results and Discussions

3.1. Experimental Setup

During the research process, the proposed method is trained on the LLVIP [18] dataset. The LLVIP dataset contains aligned infrared and visible images captured in low-light environments on roads at night. In addition, MSRS [19] dataset, which contains both daytime and nighttime scenes, is used to assess the generalization ability of the method. This method is compared with the 6 existing advanced VIS-LWIR fusion algorithms, including two AE-based methods, DenseFuse [10] and RFN-Nest [20], two-GAN based methods, DDcGAN [21] and GANMcC [12], and two CNN-based methods, SDNet [22] and MFEIF [8]. All the image fusion algorithms and datasets in the prior studies above are publicly available, with the same parameter settings as the original text.

For quantitative evaluation, six metrics are employed to quantitatively evaluate fusion performance, including average gradient (AG) [23], spatial frequency (SF) [24], entropy (EN) [25], information content (MI) [26],

Q_{a b f}

[27], and visual information fidelity (VIF) [28]. AG reflects the richness of image texture information. SF refers to the information contained in the fused image at the spatial scale. Both EN and MI evaluate fusion performance from the perspective of information theory. EN evaluates the amount of information contained in the fused image. MI evaluates the amount of information transmitted from the source image to the fused image.

Q_{a b f}

evaluates the amount of edge information transferred from the source image to the fused image. VIF measures the fidelity of information from the perspective of human visual perception. The fusion method with larger value of AG, SF, EN, MI,

Q_{a b f}

, and VIF indicate better performance.

This method is implemented using the PyTorch 2.1 framework on NVIDIA Tesla V100 GPU. With a Batchsize of 6, the Adam optimizer was used to train the network during the research process, which required 50 epochs. The learning rate of the Adam optimizer is initialized to 0.001, weight decay is 0, epsilon is 1.0 ×

10^{- 8}

, and betas are (0.9, 0.999). In addition, the hyperparameters that control the trade-off of each sub loss term are set experimentally as

α_{1}

= 1,

α_{2}

= 0.1,

β_{1}

= 2.8,

β_{2}

= 1,

λ

= 15. All images are normalized to [0, 1] before being fed into the network.

3.2. Experimental Results

In order to visually demonstrate the fusion performance of different algorithms on the LLVIP dataset, this section selects a pair of representative VIS and LWIR images, and the visualized results output from various fusion methods are shown in Figure 2. Due to the issue of lighting degradation in VIS images at night, excellent fusion algorithms should not only extract meaningful information from the source image but also provide bright scenes with high contrast. It is worth noting that almost all methods introduce meaningless information during the fusion process, mainly in the form of texture regions being affected by infrared images and weakening of those significant targets. As illustrated in the enlarged area of the image, the fusion result of DDcGAN and SDNet is relatively blurry, lacking a lot of scene detail information, and the text in the red enlarged box is not obvious. Although DenseFuse, RFN-Nest, GANMcC, and MFEIF have more texture information, the entire scene is suffering in darkness, and their overall brightness is lower. Moreover, the significant target intensity is weaker than that of LWIR images, with color distortion, poor contrast, and a relatively blurry background compared to VIS images, which is not conducive to subsequent high-level visual tasks. The fused images generated by these comparative algorithms need more satisfactory results, while the proposed fusion method can obtain bright scenes, significant targets, and rich texture details even in the dark.

The experimental results of six existing algorithms evaluated by six quantitative metrics are shown in Table 1, where red represents the best results and blue represents the second-best results. On the LLVIP dataset, the proposed method outperforms in all quantitative evaluation metrics. The best results on metrics of MI and VIF mean that the algorithm in this paper preserves the most information from the source image and achieves the best visual perception performance. The results on SF and

Q_{a b f}

indicate that the fused images of this method have the most spatial frequency information and edge information, respectively. The largest value of the EN indicates that this method reduces a large amount of redundant information, so its outputs have high image contrast, and are less likely to color distortion. Ranking first on AG means that the images generated by this method contain rich texture details compared to existing algorithms.

More experimental results on diverse, publicly available datasets are displayed in Figure 3 and Figure 4. VIS-LWIR fusion via various existing methods and our method on M³FD dataset [29] are illustrated as Figure 3 and Table 2. Except for the MI metrics ranking second and slightly lower than the MFEIF method, the method proposed in this paper outperforms existing research methods in evaluations via such quantitative metrics as VIF, SF, EN,

Q_{a b f}

, and AG.

Figure 4 and Table 3 present both qualitative and quantitative results output from existing methods and our method on TNO dataset [30]. The largest values of MI, VIF, SF,

Q_{a b f}

, and AG of the present method in this work indicate the superiority of the fusion method. Although the EN value in this study is approximately 2% lower than DDcGAN method, the image contrast is still satisfactory and presents less color distortion than majority other methods.

3.3. Generalization Ability of the Method

An important criterion for evaluating deep learning methods is their generalization ability. This article uses the LLVIP dataset for model training and employs the MSRS dataset for generalization ability testing by randomly selecting 50 pairs of nighttime infrared and visible images for qualitative and quantitative experiments.

Figure 5 provides a qualitative comparison among the proposed method (Figure 5g) and prior studies (Figure 5a–f) on the MSRS dataset. The results illustrate that the present method not only provides brighter scenes with prominent targets but also mines information submerged in darkness. As shown in the enlarged box in the figure, the method produces images with higher overall contrast, not only preserving the rich detail information in visible images but also improving the salient targets in infrared images. It is worth emphasizing that compared with other methods, it effectively maintains the color fidelity of the fused image without distortion, and this difference can be easily seen from the car taillights in the enlarged green box. In addition, the fusion result of this method is similar to the background area of visible ground-truth (Figure 5h), and the pixel intensity of salient targets is consistent with that of infrared ground-truth (Figure 5i), making the image more in line with human visual perception.

The quantitative results of different comparative algorithms on the MSRS dataset are shown in Table 4. From the results, it can be seen that our method still ranks first in the four metrics of VIF, SF, EN, and AG, indicating that the model itself can fully preserve the texture information in the source image and contain rich spatial information, and the fused image conforms to human visual perception. Although the algorithm proposed in this article ranks second in MI measurement, the gap with the best result is small, indicating that the fusion result preserves enough information from the source image. Although ranked second on

Q_{a b f}

, it is not significantly different from the best method, indicating that the fused image has also transferred sufficient edge information from the source image. The quantitative results in Table 4 indicate that the superior generalization ability of our method.

3.4. Ablation Study

An ablation study is a critical analysis in deep learning research that systematically removes or modifies individual components of a model to isolate and quantify their contribution to the overall performance. Its primary importance lies in moving beyond a single performance metric to provide an interpretable insight, of which elements are truly critical. The ablation study results are shown in Figure 6.

3.4.1. Smoothing Loss Analysis

The smoothness loss maintains the monotonic relationship between adjacent pixels in the enhanced image. This section designs an ablation experiment on smooth loss to verify its special effect. More specifically, in the research, only structural similarity loss is used as the loss function for low-light enhancement networks, without introducing smoothness loss for network learning optimization. Structural similarity loss is to ensure pixel-level consistency between the enhanced image and the visible image. As shown in Figure 6a, it can be noted that without the guidance of smoothness loss, the weight of structural similarity loss in the loss function is greater, breaking the balance between the original loss functions. The fused image generated by the network is more inclined towards features from visible images. This is specifically manifested as the weakening of prominent targets and the overall brightness of the image.

Figure 6. Ablation study results of the present method. (a) This method without smoothing loss

L_{s}

, (b) this method without intensity loss

L_{i n t}

, (c) this method without gradient loss

L_{g r a d}

, (d) this method. (e)VIS image and (f) LWIR image are displayed as the ground-truth (GT) for comparison.

Figure 6. Ablation study results of the present method. (a) This method without smoothing loss

L_{s}

, (b) this method without intensity loss

L_{i n t}

, (c) this method without gradient loss

L_{g r a d}

, (d) this method. (e)VIS image and (f) LWIR image are displayed as the ground-truth (GT) for comparison.

3.4.2. Intensity Loss Analysis

The expected information of fused images is clearly defined as salient targets in infrared images and background texture details in visible images. The intensity loss measures the difference between the fused image and the source image at the pixel level, guiding the network to generate a fused image with the maximum similarity on pixel intensity distribution of infrared and visible images. As shown in Figure 6b, when intensity loss is not applied, the network only fuses images with the goal of preserving texture details, resulting in a fusion image that is more biased towards visible images. This leads to poor preservation of contour information from thermal radiation in infrared images, which goes against the original intention of fusing images. Therefore, intensity loss is indispensable.

3.4.3. Gradient Loss Analysis

Gradient loss guides the network to generate fused images with the maximum texture detail information of infrared and visible light images, forcing salient targets in the image to have clearer contours. As shown in Figure 6c, after removing gradient loss, the fused image failed to retain sufficient texture detail information. The text on the stone tablet in the scene was very blurry compared to the VIS GT image, and the edges of the target are severely weakened. The entire scene is relatively smooth, with fewer gradient changes and less dynamic texture structure.

3.5. Model Complexity Analysis and High-Level Visual Task Evaluation

In order to comprehensively evaluate the performance of different algorithms, Table 5 provides model parameters and floating-point operations per second (FLOPs) for 6 prior image fusion methods and the present method. From the results, it can be seen that SDNet has the smallest number of parameters FLOPs because it only uses fusion networks with simple structures. Our method consists of two parts, which are Low-Illumination Enhancement Network (LLENet) and Heterogeneous Image Fusion Network (IXNet). They effectively couple the tasks of image enhancement and image fusion. Benefitting from the weight sharing architecture of LLENet, the enhancement network only needs one illumination estimation module during the testing phase, greatly reducing model complexity. Therefore, the algorithm proposed in this study ranks third in computational complexity while achieving multi-task collaboration, slightly behind single-task networks such as DenseFuse and SDNet. However, considering the leading performance on the task of image fusion and generalization shown in Figure 2 and Figure 5, the overall performance of this method is still superior to existing methods.

To further demonstrate the effectiveness of our method in high-level visual tasks, we applied the fusion results to nighttime pedestrian detection experiments. The SOTA detection model of YOLOv5 is used to detect fused images output by Figure 7a–f prior methods, Figure 7g our method, and GT images captured by Figure 7h VIS and Figure 7i LWIR cameras. The experimental results are illustrated in Figure 7. The pedestrian detection in nighttime scenarios is utilized as an example to illustrate the performance in high-level visual tasks.

From the figure, it can be seen that due to the constraint of lighting degradation, visible images cannot detect all pedestrians, and some pedestrians are lost in the detection results. Although infrared images have significant targets, they do not have sufficient texture detail information, and the detection results cannot detect all targets either. The fusion images generated by 6 existing algorithms all suffer lighting predicament. Although the complementary information of the fused image itself, such as high contrast, rich texture details, and prominent targets, is beneficial for pedestrian detection, weak-light environments can weaken the effective combination of complementary information between source images, resulting in the detector being unable to detect all pedestrians from the fusion results. In addition, the fusion images of the comparison algorithms need improvements on color distortion and poor contrast. The present method fully exploits the information hidden in the darkness, improves the overall brightness of the image, integrates complementary information from infrared and visible light images, and achieves texture detail preservation and significant target contrast improvement, thus detecting all pedestrians.

4. Conclusions

This study has developed a cascaded imaging framework that effectively addresses the challenges of low-light image degradation through the synergistic integration of visible image enhancement and VIS-LWIR fusion. The proposed method incorporates a dedicated Low-Light Enhancer Network (LLENet) based on Retinex theory to improve illumination conditions and a heterogeneous fusion network (IXNet) to combine complementary information from enhanced visible and infrared images. Through a carefully designed joint loss function that couples enhancement and fusion objectives, our approach successfully generates fused images with superior brightness preservation, prominent thermal targets, and rich texture details. Experimental validation demonstrates that our method achieves state-of-the-art performance across multiple quantitative metrics including AG, SF, EN, MI,

Q_{a b f}

, and VIF on benchmark datasets. The framework exhibits excellent generalization capability across different scenarios while maintaining computational efficiency with relatively low parameter requirements. More importantly, the practical utility of our approach is confirmed through enhanced performance in high-level visual tasks such as pedestrian detection, particularly under challenging low-light conditions. This study, designed for practical deployment in security engineering scenarios, faces limitations due to constraints on computational efficiency and real-time performance, leading to comparisons primarily with existing methods of lower complexity. Future work will expand evaluations to include more state-of-the-art approaches, such as transformer-based and large-scale model methods, for a broader benchmark. Additionally, more comprehensive and diverse evaluation metrics will be employed to ensure a thorough assessment of fusion quality. Other limitations involve potential trade-offs in performance under dynamic conditions, while future directions will focus on adaptive optimization strategies for varying environments and further architectural refinements to enhance real-time applicability.

Author Contributions

Conceptualization, H.H., H.L. and K.H.; methodology, H.H., C.G. and H.L.; software, H.L., H.W. and Y.Y.; validation, H.L. and M.C.; formal analysis, H.H. and H.L.; investigation, H.L.; resources, H.L.; data curation, H.L. and H.W.; writing—original draft preparation, H.H. and H.L.; writing—review and editing, H.H. and H.L.; visualization, H.H. and H.L.; supervision, H.H. and H.L.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (62375283, U24B20138, 62272421), Hunan Natural Science Fund for Distinguish Young Scholars (No. 2024JJ4044).

Data Availability Statement

Data available on reasonable request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jin, S.; Zhang, Y.; Xu, Z.; Qin, Z.; Jiang, X.; Liu, H.; Xu, M. Adaptive dual-domain-guided semantics diffusion model for image defogging. Opt. Laser Technol. 2025, 190, 113081. [Google Scholar] [CrossRef]
Ma, J.; Zhou, Z.; Wang, B.; Zong, H. Infrared and visible image fusion based on visual saliency map and weighted least square optimization. Infrared Phys. Technol. 2017, 82, 8–17. [Google Scholar] [CrossRef]
Wei, Q.; Bioucas-Dias, J.; Dobigeon, N.; Tourneret, J.Y. Hyperspectral and Multispectral Image Fusion Based on a Sparse Representation. IEEE Trans. Geosci. Remote Sens. 2015, 53, 3658–3668. [Google Scholar] [CrossRef]
Liu, C.H.; Qi, Y.; Ding, W.R. Infrared and visible image fusion method based on saliency detection in sparse domain. Infrared Phys. Technol. 2017, 83, 94–102. [Google Scholar] [CrossRef]
Mou, J.; Gao, W.; Song, Z. Image fusion based on non-negative matrix factorization and infrared feature extraction. In Proceedings of the 2013 6th International Congress on Image and Signal Processing (CISP), Hangzhou, China, 16–18 December 2013; pp. 1046–1050. [Google Scholar]
Chen, J.; Li, X.; Luo, L.; Ma, J. Multi-Focus Image Fusion Based on Multi-Scale Gradients and Image Matting. IEEE Trans. Multimed. 2022, 24, 655–667. [Google Scholar] [CrossRef]
Zhang, H.; Xu, H.; Tian, X.; Jiang, J.; Ma, J. Image fusion meets deep learning: A survey and perspective. Inf. Fusion 2021, 76, 323–336. [Google Scholar] [CrossRef]
Liu, J.; Fan, X.; Jiang, J.; Liu, R.; Luo, Z. Learning a Deep Multi-Scale Feature Ensemble and an Edge-Attention Guidance for Image Fusion. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 105–119. [Google Scholar] [CrossRef]
Jin, S.; Xu, Z.; Xu, M.; Liu, H. Time-gated imaging through dense fog via physics-driven Swin transformer. Opt. Express 2024, 32, 18812–18830. [Google Scholar] [CrossRef]
Li, H.; Wu, X.J. DenseFuse: A Fusion Approach to Infrared and Visible Images. IEEE Trans. Image Process. 2019, 28, 2614–2623. [Google Scholar] [CrossRef] [PubMed]
Liu, H.; Wang, P.; He, X.; Chen, M.; Liu, M.; Xu, Z.; Jiang, X.; Peng, X.; Xu, M. PI-NLOS: Polarized infrared non-line-of-sight imaging. Opt. Express 2023, 3, 44113–44126. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; Zhang, H.; Shao, Z.; Liang, P.; Xu, H. GANMcC: A Generative Adversarial Network With Multiclassification Constraints for Infrared and Visible Image Fusion. IEEE Trans. Instrum. Meas. 2021, 70, 1–14. [Google Scholar] [CrossRef]
Tang, W.; He, F.; Liu, Y.; Duan, Y.; Si, T. DATFuse: Infrared and Visible Image Fusion via Dual Attention Transformer. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 3159–3172. [Google Scholar] [CrossRef]
Tang, W.; He, F.; Liu, Y. YDTR: Infrared and Visible Image Fusion via Y-Shape Dynamic Transformer. IEEE Trans. Multimed. 2023, 25, 5413–5428. [Google Scholar] [CrossRef]
Tang, W.; He, F.; Liu, Y. ITFuse: An interactive transformer for infrared and visible image fusion. Pattern Recognit. 2024, 156, 110822. [Google Scholar] [CrossRef]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1777–1786. [Google Scholar]
Zhang, Y.; Guo, X.; Ma, J.; Liu, W.; Zhang, J. Beyond Brightening Low-light Images. Int. J. Comput. Vis. 2021, 129, 1013–1037. [Google Scholar] [CrossRef]
Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 3496–3504. [Google Scholar]
Tang, L.; Yuan, J.; Zhang, H.; Jiang, X.; Ma, J. PIAFusion: A progressive infrared and visible image fusion network based on illumination aware. Inf. Fusion 2022, 83–84, 79–92. [Google Scholar] [CrossRef]
Li, H.; Wu, X.-J.; Kittler, J. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Inf. Fusion 2021, 73, 72–86. [Google Scholar] [CrossRef]
Ma, J.; Xu, H.; Jiang, J.; Mei, X.; Zhang, X.P. DDcGAN: A Dual-Discriminator Conditional Generative Adversarial Network for Multi-Resolution Image Fusion. IEEE Trans. Image Process. 2020, 29, 4980–4995. [Google Scholar] [CrossRef]
Zhang, H.; Ma, J. SDNet: A Versatile Squeeze-and-Decomposition Network for Real-Time Image Fusion. Int. J. Comput. Vis. 2021, 129, 2761–2785. [Google Scholar] [CrossRef]
Cui, G.; Feng, H.; Xu, Z.; Li, Q.; Chen, Y. Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition. Opt. Commun. 2015, 341, 199–209. [Google Scholar] [CrossRef]
Eskicioglu, A.M.; Fisher, P.S. Image quality measures and their performance. IEEE Trans. Commun. 1995, 43, 2959–2965. [Google Scholar] [CrossRef]
Roberts, J.W.; Van Aardt, J.A.; Ahmed, F.B. Assessment of image fusion procedures using entropy, image quality, and multispectral classification. J. Appl. Remote Sens. 2008, 2, 23522. [Google Scholar]
Qu, G.; Zhang, D.; Yan, P. Information measure for performance of image fusion. Electron. Lett. 2002, 38, 313–315. [Google Scholar] [CrossRef]
Piella, G.; Heijmans, H. A new quality metric for image fusion. In Proceedings of the 2003 International Conference on Image Processing, Barcelona, Spain, 14–17 September 2003; IEEE: New York, NY, USA, 2003. [Google Scholar]
Han, Y.; Cai, Y.; Cao, Y.; Xu, X. A new image fusion performance metric based on visual information fidelity. Inf. Fusion 2013, 14, 127–135. [Google Scholar] [CrossRef]
Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware Dual Adversarial Learning and a Multi-scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5792–5801. [Google Scholar]
Toet, A. The TNO Multiband Image Data Collection. Data Brief 2017, 15, 249–251. [Google Scholar] [CrossRef] [PubMed]

Figure 1. (a) The overall framework of the proposed imaging method, (b) the framework of LLENet module, (c) the framework of IXNet module.

Figure 2. VIS-LWIR fusion results output from (a–f) existing models, and (g) our method on the LLVIP dataset. The raw data of (h) VIS image and (i) LWIR image are displayed as the ground-truth (GT) for comparison.

Figure 3. VIS-LWIR fusion results output from (a–f) existing models, and (g) our method on the M³FD dataset. The raw data of (h) VIS image and (i) LWIR image are displayed as the ground-truth (GT) for comparison.

Figure 4. VIS-LWIR fusion results output from (a–f) existing models, and (g) our method on the TNO dataset. The raw data of (h) VIS image and (i) LWIR image are displayed as the ground-truth (GT) for comparison.

Figure 5. VIS-LWIR fusion results output from (a–f) existing models, and (g) our method on the MSRS dataset. The raw data of (h) VIS image and (i) LWIR image are displayed as the ground-truth (GT) for comparison.

Figure 7. Nighttime pedestrian detection experiment results on fused images output by (a–f) prior methods, (g) our method, and GT images captured by (h) VIS and (i) LWIR cameras.

Table 1. Experimental results of six existing algorithms evaluated by six quantitative metrics, including MI, VIF, SF, EN,

Q_{a b f}

, and AG, on the LLVIP dataset. The best performance is marked by red color, and the second-best performance is marked by blue color.

Table 1. Experimental results of six existing algorithms evaluated by six quantitative metrics, including MI, VIF, SF, EN,

Q_{a b f}

, and AG, on the LLVIP dataset. The best performance is marked by red color, and the second-best performance is marked by blue color.

	DenseFuse	DDcGAN	GANMcC	RFN-Nest	SDNet	MFEIF	Ours
MI	2.558 ± 0.772	1.813 ± 0.663	2.321 ± 0.844	1.854 ± 1.172	1.981 ± 0.873	1.954 ± 1.443	3.812 ± 1.625
VIF	0.726 ± 0.114	0.624 ± 0.273	0.695 ± 0.287	0.479 ± 0.286	0.347 ± 0.202	0.425 ± 0.246	1.017 ± 0.184
SF	0.018 ± 0.006	0.032 ± 0.010	0.021 ± 0.009	0.018 ± 0.009	0.018 ± 0.007	0.023 ± 0.008	0.043 ± 0.008
EN	5.395 ± 0.976	6.498 ± 0.305	4.728 ± 1.731	3.464 ± 2.301	4.196 ± 0.851	2.997 ± 2.262	6.637 ± 0.933
$Q_{a b f}$	0.360 ± 0.042	0.264 ± 0.125	0.355 ± 0.134	0.276 ± 0.169	0.285 ± 0.152	0.298 ± 0.177	0.557 ± 0.158
AG	1.526 ± 0.618	2.716 ± 0.801	1.557 ± 0.895	1.019 ± 0.676	1.400 ± 0.560	1.235 ± 0.800	3.355 ± 1.179

Table 2. Experimental results of six existing algorithms evaluated by six quantitative metrics, including MI, VIF, SF, EN,

Q_{a b f}

, and AG, on the M³FD dataset. The best performance is marked by red color, and the second-best performance is marked by blue color.

Table 2. Experimental results of six existing algorithms evaluated by six quantitative metrics, including MI, VIF, SF, EN,

Q_{a b f}

, and AG, on the M³FD dataset. The best performance is marked by red color, and the second-best performance is marked by blue color.

	DenseFuse	DDcGAN	GANMcC	RFN-Nest	SDNet	MFEIF	Ours
MI	3.092 ± 0.772	2.797 ± 0.651	2.946 ± 0.656	3.182 ± 1.012	3.171 ± 1.039	3.487 ± 0.983	3.246 ± 1.079
VIF	0.781 ± 0.203	0.759 ± 0.240	0.582 ± 0.177	0.974 ± 0.355	0.773 ± 0.148	0.904 ± 0.223	1.322 ± 0.291
SF	0.019 ± 0.009	0.037 ± 0.016	0.022 ± 0.011	0.018 ± 0.007	0.030 ± 0.015	0.022 ± 0.012	0.058 ± 0.016
EN	6.433 ± 0.513	6.767 ± 0.305	5.006 ± 1.083	6.777 ± 0.525	6.680 ± 0.300	6.746 ± 0.510	6.812 ± 0.527
$Q_{a b f}$	0.370 ± 0.040	0.454 ± 0.112	0.294 ± 0.092	0.347 ± 0.092	0.508 ± 0.085	0.428 ± 0.126	0.514 ± 0.093
AG	1.660 ± 0.748	2.942 ± 1.449	1.693 ± 0.869	1.676 ± 0.778	2.576 ± 1.232	1.934 ± 0.950	4.217 ± 1.652

Table 3. Experimental results of six existing algorithms evaluated by six quantitative metrics, including MI, VIF, SF, EN,

Q_{a b f}

, and AG, on the TNO dataset. The best performance is marked by red color, and the second-best performance is marked by blue color.

Table 3. Experimental results of six existing algorithms evaluated by six quantitative metrics, including MI, VIF, SF, EN,

Q_{a b f}

, and AG, on the TNO dataset. The best performance is marked by red color, and the second-best performance is marked by blue color.

	DenseFuse	DDcGAN	GANMcC	RFN-Nest	SDNet	MFEIF	Ours
MI	2.910 ± 0.573	2.462 ± 0.485	2.734 ± 0.519	2.730 ± 0.778	2.766 ± 0.747	2.966 ± 0.813	3.438 ± 0.901
VIF	0.762 ± 0.139	0.713 ± 0.184	0.620 ± 0.152	0.806 ± 0.254	0.628 ± 0.119	0.741 ± 0.169	1.218 ± 0.202
SF	0.019 ± 0.006	0.035 ± 0.011	0.032 ± 0.008	0.018 ± 0.006	0.026 ± 0.010	0.022 ± 0.008	0.053 ± 0.011
EN	6.080 ± 0.474	6.676 ± 0.226	4.911 ± 0.926	5.651 ± 0.856	5.835 ± 0.351	5.471 ± 0.840	6.553 ± 0.471
$Q_{a b f}$	0.367 ± 0.030	0.389 ± 0.085	0.315 ± 0.076	0.523 ± 0.084	0.432 ± 0.076	0.384 ± 0.103	0.529 ± 0.082
AG	1.614 ± 0.537	2.865 ± 0.994	1.647 ± 0.649	1.453 ± 0.563	2.176 ± 0.835	1.696 ± 0.683	3.924 ± 1.162

Table 4. Experimental results of six existing algorithms evaluated by six quantitative metrics, including MI, VIF, SF, EN,

Q_{a b f}

, and AG, on the MSRS dataset. The best performance is marked by red color, and the second-best performance is marked by blue color.

Table 4. Experimental results of six existing algorithms evaluated by six quantitative metrics, including MI, VIF, SF, EN,

Q_{a b f}

, and AG, on the MSRS dataset. The best performance is marked by red color, and the second-best performance is marked by blue color.

	DenseFuse	DDcGAN	GANMcC	RFN-Nest	SDNet	MFEIF	Ours
MI	3.092 ± 0.772	2.797 ± 0.651	3.021 ± 1.068	3.182 ± 1.012	3.171 ± 1.039	3.487 ± 0.983	3.246 ± 1.079
VIF	0.781 ± 0.203	0.759 ± 0.240	0.798 ± 0.232	0.974 ± 0.355	0.773 ± 0.148	0.904 ± 0.223	1.017 ± 0.391
SF	0.019 ± 0.009	0.037 ± 0.016	0.022 ± 0.009	0.018 ± 0.007	0.030 ± 0.015	0.022 ± 0.012	0.058 ± 0.026
EN	6.433 ± 0.513	6.767 ± 0.305	6.674 ± 0.151	6.777 ± 0.525	6.680 ± 0.300	6.746 ± 0.510	6.812 ± 0.595
$Q_{a b f}$	0.370 ± 0.040	0.454 ± 0.112	0.336 ± 0.106	0.347 ± 0.092	0.508 ± 0.085	0.428 ± 0.126	0.479 ± 0.123
AG	1.660 ± 0.748	2.942 ± 1.449	1.874 ± 0.838	1.676 ± 0.778	2.576 ± 1.232	1.934 ± 0.950	4.522 ± 2.039

Table 5. Model complexity comparison among 6 existing methods and our proposed method.

	DenseFuse	DDcGAN	GANMcC	RFN-Nest	SDNet	MFEIF	Ours
Parameters (M)	0.074	10.93	1.864	10.93	0.067	0.373	0.167
FLOPs (G)	233.55	4232.1	7075.4	4444.5	254.32	1848.3	449.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, H.; Liu, H.; Wang, H.; Yang, Y.; Guo, C.; Chen, M.; Han, K. A Cascaded Enhancement-Fusion Network for Visible-Infrared Imaging in Darkness. Photonics 2025, 12, 1231. https://doi.org/10.3390/photonics12121231

AMA Style

Huang H, Liu H, Wang H, Yang Y, Guo C, Chen M, Han K. A Cascaded Enhancement-Fusion Network for Visible-Infrared Imaging in Darkness. Photonics. 2025; 12(12):1231. https://doi.org/10.3390/photonics12121231

Chicago/Turabian Style

Huang, Hanchang, Hao Liu, Hailu Wang, Yunzhuo Yang, Chuan Guo, Minsun Chen, and Kai Han. 2025. "A Cascaded Enhancement-Fusion Network for Visible-Infrared Imaging in Darkness" Photonics 12, no. 12: 1231. https://doi.org/10.3390/photonics12121231

APA Style

Huang, H., Liu, H., Wang, H., Yang, Y., Guo, C., Chen, M., & Han, K. (2025). A Cascaded Enhancement-Fusion Network for Visible-Infrared Imaging in Darkness. Photonics, 12(12), 1231. https://doi.org/10.3390/photonics12121231

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Cascaded Enhancement-Fusion Network for Visible-Infrared Imaging in Darkness

Abstract

1. Introduction

2. Imaging Methods

2.1. Framework of the Proposed Method

2.2. Loss Function

2.2.1. Loss from Low-Light Enhancement

2.2.2. Fusion Loss

3. Results and Discussions

3.1. Experimental Setup

3.2. Experimental Results

3.3. Generalization Ability of the Method

3.4. Ablation Study

3.4.1. Smoothing Loss Analysis

3.4.2. Intensity Loss Analysis

3.4.3. Gradient Loss Analysis

3.5. Model Complexity Analysis and High-Level Visual Task Evaluation

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI