1. Introduction
In recent years, with advancements in sensor-imaging technologies, cameras have gained the ability to capture both infrared (IR) and visible images. IR images are effective in capturing target objects under low-illumination conditions by leveraging the difference between the target object and environmental thermal radiation. However, these images often lack sufficient information on the edges and texture details of the target. On the other hand, visible images can capture high-resolution texture details. Therefore, researchers have become interested in the optimal integration of complementary information from IR and visible images through image fusion techniques to compensate for sensor limitations and enhance human and machine perception [
1,
2,
3,
4,
5]. The fusion of IR and visible images has a wide range of applications, including intelligent surveillance, target monitoring, and video analysis. This technology enables subsequent image-processing tasks and assists in more comprehensive and intuitive analysis or decision-making. Image fusion techniques have evolved into two main categories: traditional methods and deep learning-based methods for both IR and visible images.
Traditional methods for image fusion can generally be categorized into three stages, namely feature extraction, feature fusion, and image reconstruction [
1,
2,
3]. Typically, the fusion method employed in this case involves converting the image from the spatial domain into a domain that enables the extraction of features, which are then integrated using specific principles. The features extracted from the source images are then fused using fusion techniques. Finally, the fused features are converted back to the spatial domain representation to achieve image fusion. For instance, in the study by Jin et al. [
4], the discrete stationary wavelet transform (DSWT) was utilized to decompose significant features of source images into subgraph collections of varying levels and spatial resolutions. The significant details are then separated using the discrete cosine transform (DCT) into sub-graph sets based on the energy of the individual frequencies. The local spatial frequency (LSF) is used to enhance the regional features of DCT coefficients. To combine the significant features in the source images, a segmentation strategy based on the LSF threshold is employed, resulting in DCT coefficient fusion. Finally, the fused image is reconstructed by performing the inverse transform of DCT and DSWT sequentially. More recently, Ren et al. [
5] proposed a non-subsampled shearlet transform (NSST)-based multiscale entropy fusion method for IR and visible images. The approach initially employs an NSST-based multiscale method to decompose the source images to acquire both high and low-frequency information. The high and low-frequency data are then fused at different scales, and the weight values are determined using multiscale entropy. Finally, the fusion results at different scales are obtained via an inverse NSST transformation to generate the fused images. These aforementioned studies demonstrate that effectively extracting features from both IR and visible images and designing appropriate fusion rules are crucial for addressing the fusion problem through traditional fusion methods. The design of reasonable fusion rules is essential to maximizing the preservation of complementary content between the two images in the output image. However, these tasks rely on manual experience, and the fusion result has significant limitations and cannot adapt to complex scene changes.
The enhancement of deep learning techniques [
6] over the last decade and their successful application to the field of low-level vision processing have resulted in the development of various types of network models for IR and visible image fusion by researchers in this field [
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19]. Based on the stage of IR and visible image feature fusion, this process can generally be classified into the following three types [
20]: (1) early fusion [
7,
8,
9], which involves merging the original images at a low-level feature extraction stage. For instance, an end-to-end fusion model using detail-preserving adversarial learning (DPAL) techniques was proposed by Ma et al. [
7]. The middle section consisting of Conv+BN+PReLU+Conv+BN repeats five times, where BN represents batch normalization. The back section contains a
convolutional layer, a PReLU, and a
convolutional layer. DPAL uses an advanced adversarial learning technique that has proved to be highly effective across multiple fields. This approach can be implemented on diverse image data types and can rapidly train high-quality models in a short duration. However, this kind of method tends to produce fused images of a generic nature, as the network architecture lacks appropriate mechanisms to facilitate image feature fusion, ultimately failing to fully leverage the complementary nature of IR and visible images. This inadequate fusion of image features often leads to an excessive redundancy in input feature information, complicating image reconstruction tasks in later stages. (2) Medium fusion [
10,
11,
12,
13]. Li et al. [
11] proposed a network model called dense fusion (DenseFuse) for the fusion of IR and visible images, which achieved image fusion by cascading feature maps of multiple images of the input. Compared to other image fusion methods, DenseFuse retains detailed information in the source images, while facilitating superior robustness and scalability. This method reduces the noise and produces a visually appealing output image. However, it has also been observed to lead to a loss of image information and reduced spatial resolution. The applied network architecture, which uses medium-term feature fusion, initially extracts source image features before proceeding with the fusion. Lastly, the fused features are reconstructed into the final fused image using a relatively complex feature reconstruction module. Although the fusion framework in this architecture successfully prevents the introduction of redundant information compared to earlier feature fusion architectures, the feature fusion module is relatively simple and may not fuse features effectively. Consequently, image feature fusion is inadequate, leading to an inferior image reconstruction quality post-fusion. Overall, this approach is not optimal and can be improved. (3) Late fusion [
14,
15,
16,
17,
18,
19]. In late fusion, the input images are first processed separately to extract relevant features, and these features are then fused to create a final image. Tang et al. [
14] proposed a novel light-awareness-based network designed to manage luminance distribution for salient target retention while preserving texture in the background. This approach fully incorporates complementary and public information within the inherent constraints of light awareness. The initial design involves a
convolutional layer to minimize modal discrepancies between the IR and visible images. Four convolutional layers with shared weights are then implemented to extract deeper IR and visible image features, using leaky rectified linear unit (LeakyReLu) activations within each layer of the feature extractor. These extracted features are then concatenated and supplied as input to the image reconstructor, composed of five convolutional layers. This module works towards integrating common and complementary information to produce a fused image. Each of the convolutional layers, except the last one, has a kernel size of
, while the last layer has a kernel size of
. LeakyReLu activations are used in most of the layers, with the last layer adopting the Tanh activation. Given that late image fusion merges multiple images, it can lead to distorted images if not carefully performed, and fusion occurs in the later part of the network, resulting in a relatively weak processing ability in the image reconstruction module. Thus, several limitations should be addressed in the final fusion effect.
Based on the aforementioned studies, IR and visible image fusion models can be readily developed using end-to-end training methods supported by a large amount of training data. Generally speaking, various complex image feature representations, extraction designs, and fusion rules that heavily rely on human expertise, are replaced by numerous network parameters, whose values are determined mainly by the training dataset without the need for human intervention. Therefore, fusion models based on deep learning are relatively easy to design and use. In these methods, the network architecture design and the combination of loss functions play a vital role in improving performance. The network architecture design determines the potential ability to extract features from IR and visible light images and the subsequent fusion, while the choice of loss function determines whether the performance of the fused image can be optimized. However, there is still room for improvement in both network architecture and loss functions in current methods.
In this work, we present an adjacent feature shuffle combination network (AFSFusion) as a solution to maximize the preservation of complementary information in both IR and visible images, as well as to obtain fused images that display the highest quality and resolution across various objective evaluation metrics. The primary contributions of our proposed network are as follows:
Our AFSFusion network differs from the resolution transformation of the UNet [
21], wherein the network is specifically designed to optimize image feature processing for image fusion. We establish a hybrid fusion network to achieve fusion by scaling up and down several feature channels based on the improved UNet-like architecture. The first half of the network uses the adjacent feature shuffle-fusion (AFSF) module. This module fuses the features of each adjacent convolutional layer several times using shuffle operations, resulting in the full interaction of the input feature information. Consequently, information loss during feature extraction and fusion is significantly reduced, and the complementarity of feature information is enhanced. The proposed network architecture effectively integrates complementary features using the AFSF module, achieving a balanced and collaborative approach to feature extraction, fusion, and modulation. This method is typically not found in traditional methods, and it has resulted in improved fusion performance. These changes allow the proposed network to achieve excellent performance in terms of subjective and objective evaluation.
Three loss functions, mean square error (MSE), structural similarity (SSIM), and total variation (TV), are used to construct the total loss function; to enable the total loss function to play a significant role in network training, an adaptive weight adjustment (AWA) strategy is designed to automatically set the weight values of the different loss functions according to extracted feature responses for the IR and visible light, which are based on the image content (the magnitude of the feature response values are extracted and calculated using VGG16 [
22]). As such, the fused images produced by the fusion network can achieve the best possible fusion result that aligns with the visual perceptual characteristics of the human eye.
The remainder of this study is organized as follows:
Section 2 offers a brief overview of previous research on the convolutional neural network backbone architecture, UNet.
Section 3 provides a thorough explanation of the proposed AFSFusion, comprising the model architecture, backbone network, and loss functions. In
Section 4, we present several experiments performed to demonstrate the superiority of the model and illustrate its excellent performance compared to other alternatives based on subjective and objective analyses. The main conclusions are outlined in
Section 5.
5. Conclusions
In this study, we propose a novel adaptive fusion network model for IR and visible images via an adjacent feature fusion approach, called AFSFusion, which employs a UNet-like network architecture. The model uses a designed AFSF module multiple times in the front section of the network for sequential feature extraction, fusion, and shuffling in the module. This allows the input feature information to fully interact in the feature extraction to enhance the complementarity of feature information. To create fused images that closely approximate human visual perception, the fusion network uses three distinct loss functions: MSE, SSIM, and TV. Additionally, an AWA strategy is used to ensure that the corresponding pixels of IR and visible images have adaptive weighting values that optimize the fusion effect. Finally, by exploiting these improvements, the fusion network outputs high-quality images. Extensive experimental results demonstrate that the proposed AFSFusion network model produces better-fused images than other state-of-the-art fusion methods in both subjective and objective metrics, indicating the effectiveness of the designed network architecture.
In future work, to enhance the overall versatility of the fusion network architecture, we plan to generalize the model to other fusion domains, including multi-exposure and multimodal medical imaging, visible light and millimeter-wave image fusion, as well as visible light and ultrasound image fusion. Additionally, to improve network performance while reducing computational complexity, we will explore the Restormer structure [
42], which introduces local receptive fields to limit the scope of the self-attention mechanism in each Transformer layer [
43]. Specifically, This approach can reduce computational costs while preserving spatial positional information, allowing the model to handle high-resolution images. Next, it downsamples the spatial dimensions of the input image to varying degrees, allowing the model to simultaneously process multiscale information. Finally, group linear attention mechanisms and short-range connections between different layers can be used to facilitate gradient propagation and reduce the computational complexity of self-attention. With these changes, our AFSFusion will perform better in more metrics while reducing computational complexity.