Infrared and Visible Image Fusion for Highlighting Salient Targets in the Night Scene

The goal of infrared and visible image fusion in the night scene is to generate a fused image containing salient targets and rich textural details. However, the existing image fusion methods fail to take the unevenness of nighttime luminance into account. To address the above issue, an infrared and visible image fusion method for highlighting salient targets in the night scene is proposed. First of all, a global attention module is designed, which rescales the weights of different channels after capturing global contextual information. Second, the loss function is divided into the foreground loss and the background loss, forcing the fused image to retain rich texture details while highlighting the salient targets. Finally, a luminance estimation function is introduced to obtain the trade-off control parameters of the foreground loss function based on the nighttime luminance. It can effectively highlight salient targets by retaining the foreground information from the source images. Compared with other advanced methods, the experimental results adequately demonstrate the excellent fusion performance and generalization of the proposed method.


Introduction
Image fusion fuses multimodal images of the same scene to produce a fused image with richer information and clearer details [1,2]. Infrared and visible image fusion, as an indispensable part in image fusion direction, has been applied in all walks of life [3][4][5]. The fused image can be utilized as a preprocessing unit for subsequent high-level visual tasks such as object detection, object tracking and semantic segmentation. As shown in Figure 1, the salient target of a human is mainly concentrated in the thermal radiation information of the infrared image in the night scene. In addition, the uneven luminance of the night scene results in the uneven distribution of texture details in the infrared and visible images. Therefore, the fused image can effectively retain texture details while highlighting salient targets [6]. To solve these problems, this paper proposes an infrared and visible image fusion method for highlighting salient targets in the night scene. The major contributions of this paper are as follows: 1. To reduce the computational burden caused by stacking convolution layers, a global attention module is added to the image fusion network. This module rescales the  Constrained by the local receptive field of the single convolution layer, U2Fusion [7], IFCNN [8] and RXDNFuse [9] capture global contextual information by stacking plenty of convolution layers, resulting in a huge computation workload. GANMcC [10] uses the infrared and visible images as the inputs of the discriminator. It can retain the information in the source images but fails to highlight the salient targets. STDFusion [11] designs the loss function by considering only the infrared foreground and the visible background, possibly leading to the loss of some important information. Therefore, there are no comprehensive loss functions to supervise the training of the network.
To solve these problems, this paper proposes an infrared and visible image fusion method for highlighting salient targets in the night scene. The major contributions of this paper are as follows: 1.
To reduce the computational burden caused by stacking convolution layers, a global attention module is added to the image fusion network. This module rescales the weights of different channels after capturing global contextual information. Assigning higher weight for the important channels makes the salient targets more prominent; 2.
In order to improve the comprehensiveness of the loss function, the loss function of the proposed method is divided into the foreground loss and the background loss, forcing the fused image to contain more prominent salient targets and richer texture details; 3.
To address the problem of the uneven luminance in the night scene, a luminance estimation function is introduced into the training process. It outputs the trade-off control parameters of the foreground loss function based on the nighttime luminance, which helps to retain the foreground information from source images more reasonably.
The remainder of this paper is arranged as follows. In Section 2, the reviews about infrared and visible image fusion methods and attention mechanism are presented. In Section 3, the loss function and structure of the proposed network are elaborated. The effectiveness and generalization of the proposed method are demonstrated in Section 4. Finally, the conclusion is expressed in Section 5.

Related Work
In this section, a literature review of infrared and visible image fusion methods and attention mechanisms are presented.

Infrared and Visible Image Fusion
Traditional image fusion methods usually use a fixed mathematical transformation to extract image features, possibly leading to a weak feature representation. Moreover, these methods do not take modal differences between multimodal images into account [12]. Deep learning-based image fusion methods, with a powerful representation capability, transform the fusion process into training different depths of neural networks.
Over the last few years, convolutional neural network-based image fusion methods have gained notable success. Xu et al. [13] adopted a dense connection structure to extract image features and added multiple jump connections between the dense blocks to increase the information flow. Tang et al. [14] proposed an image fusion network based on illumination perception, which could adaptively maintain the intensity distribution of salient targets according to light distribution.
Due to the capability of unsupervised distribution estimation, a GAN-based network is fit for image fusion tasks. Based on the former research [15], Ma et al. [16] proposed a GAN-based image fusion network with two discriminators. The infrared and visible images were adopted as the inputs of two discriminators, respectively, forcing the fused result to contain the information from both source images. Hou et al. [17] divided the input of generators into a gradient path and a contrast path and designed an adjusted image as the ground truth for the discriminator.
As the research progressed, a number of in-depth methods based on auto-encoder were proposed. Li et al. [18] adopted an encoder with nest connection structure to fully Entropy 2022, 24, 1759 3 of 20 exploit the deep features, and a hybrid attention module with channel and spatial attention was added to the fusion layer. Li et al. [19] proposed a special fusion layer with four modified residual fusion modules. The shallow and deep fusion modules were used to extract texture details and retain the significant structure, respectively. Particularly, the coding and fusion networks were trained in stages.
Although excellent fusion results can be obtained using the above methods, they still fail to effectively highlight salient targets in the fused image. The loss function of the proposed method is divided into the foreground loss and the background loss, which helps to more effectively extract and retain texture details while highlighting the salient targets.

Attention Mechanism
The attention mechanism is of great significance in the human visual system. When faced with a complex scene, the human eye tends to focus more on the important targets and ignore the irrelevant information, which helps to analyze complex scenes quickly and efficiently [20]. The attention mechanism enables selective focus on important parts by reassigning weights to input sequences [21].
The attention mechanism has achieved great success in some fields, including target detection [22], image enhancement [23], semantic segmentation [24] and emotion recognition [25]. Recently, many scholars have applied the attention mechanism to image fusion tasks. For instance, Li et al. [26] introduced a multiscale channel attention network into the generator and discriminator, respectively, to help the network focus more on salient regions in the image. Li et al. [27] argued that both channel and spatial dimensions contain rich attention information, therefore, the channel and spatial attention modules were combined for further feature extraction.
To break through the limitation of the local receptive field of the single convolution layer, Wang et al. [28] proposed a non-local attention module, which could model global contextual relationships to capture feature dependencies over long distance. However, multiple matrix multiplications were applied in this module, leading to inefficient learning of the network. In order to reduce the computational complexity, many scholars have proposed some variants based on the non-local attention module [29,30].
Despite the fact that the non-local attention module is effective in capturing global contextual information, it is unable to model the correlations between channels. Therefore, a global attention module was designed, which rescales the weights of different channels after modeling global contextual relationships.

The Proposed Method
In this section, the problem formulation and the core ideas are discussed. Then, the design of the loss function is presented. Finally, the structure of the image fusion network is described in detail.

Problem Formulation
A fused image can compensate for the lack of information from a single sensor [31]. The ideal result of the fused image is to retain rich texture details while highlighting the salient targets. As shown in Figure 2, in the training process, the source images were fed into the image fusion network which output the fused image. The luminance estimation function evaluated the luminance of all visible images in the training dataset and output the trade-off control parameters of the foreground loss function. Under the surveillance of the loss function, the fusion performance was improved.
function evaluated the luminance of all visible images in the training dataset and output the trade-off control parameters of the foreground loss function. Under the surveillance of the loss function, the fusion performance was improved.
In the testing process, the final fused image was obtained by feeding the source images into the image fusion network directly without any preprocessing. The design of the loss function and the image fusion network are described in detail as follows.  Figure 2. The entire procedure of the proposed method.

Loss Function
To obtain the fused result with salient targets and rich texture details, the loss function was divided into the foreground loss and the background loss. The foreground with salient targets and the background containing rich texture details were split from the images [17].
As shown in Figure 3, humans were defined as the salient targets in this paper. First of all, the semantic segmentation results of the infrared image were acquired using Mask R-CNN [32]. The mask for humans was kept and then set as a binary image. As shown in Figure 3c,d, the foreground mask and the background mask were obtained. The source images were split into the foreground and background using the above masks [17]. It is worth mentioning that IR-Fore, VI-Fore, IR-Back and VI-Back represent the foreground and background of the infrared and visible image, respectively.  In the testing process, the final fused image was obtained by feeding the source images into the image fusion network directly without any preprocessing. The design of the loss function and the image fusion network are described in detail as follows.

Loss Function
To obtain the fused result with salient targets and rich texture details, the loss function was divided into the foreground loss and the background loss. The foreground with salient targets and the background containing rich texture details were split from the images [17].
As shown in Figure 3, humans were defined as the salient targets in this paper. First of all, the semantic segmentation results of the infrared image were acquired using Mask R-CNN [32]. The mask for humans was kept and then set as a binary image. As shown in Figure 3c,d, the foreground mask and the background mask were obtained. The source images were split into the foreground and background using the above masks [17]. It is worth mentioning that IR-Fore, VI-Fore, IR-Back and VI-Back represent the foreground and background of the infrared and visible image, respectively. function evaluated the luminance of all visible images in the training dataset and output the trade-off control parameters of the foreground loss function. Under the surveillance of the loss function, the fusion performance was improved.
In the testing process, the final fused image was obtained by feeding the source images into the image fusion network directly without any preprocessing. The design of the loss function and the image fusion network are described in detail as follows.  Figure 2. The entire procedure of the proposed method.

Loss Function
To obtain the fused result with salient targets and rich texture details, the loss function was divided into the foreground loss and the background loss. The foreground with salient targets and the background containing rich texture details were split from the images [17].
As shown in Figure 3, humans were defined as the salient targets in this paper. First of all, the semantic segmentation results of the infrared image were acquired using Mask R-CNN [32]. The mask for humans was kept and then set as a binary image. As shown in Figure 3c,d, the foreground mask and the background mask were obtained. The source images were split into the foreground and background using the above masks [17]. It is worth mentioning that IR-Fore, VI-Fore, IR-Back and VI-Back represent the foreground and background of the infrared and visible image, respectively.  In Figure 3, most of the salient target information is concentrated in the infrared foreground in the night scene. However, the visible foreground also contained some salient target information. The lower the nighttime luminance, the less salient target information was contained in the visible foreground. Therefore, a luminance estimation function was introduced to evaluate the nighttime luminance. The luminance estimation function is formulated in Equations (1) and (2).
where β 1 and β 2 represent the trade-off control parameters of the foreground loss function. N denotes the number of visible images in the training dataset. p n denotes the normalized average grey level value of the nth input image. According to the analysis of Equation (1), the lower the nighttime luminance, the smaller the value of β 2 .
The foreground loss function contains the foreground pixel loss L pixel− f ore and the foreground gradient loss L grad− f ore , which are defined in Equations (3) and (4).
where I m , I f , I ir and I vi represent the foreground mask, the fused image, the infrared and the visible image, respectively. H and W denote the height and width of the image. |·| indicates the mean absolute error. ∇ and · denote the gradient operator and element multiplication operator, respectively. The fused results for different values of β 1 and β 2 are analyzed in Section 4. The background luminance was uneven in the night scene, which lead to an uneven distribution of texture details in the source images as well. When the luminance was high, most of the texture details were concentrated in the visible background. As the luminance decreased, most of the texture details were concentrated in the infrared background. Therefore, for the background loss function, the highest pixel and gradient intensity distribution of the source images were chosen. Equations (5) and (6) for the background pixel loss L pixel−back and the background gradient loss L grad−back are shown as follows.
where (1 − I mask ) represents the background mask and max (·) denotes the element of the max selection. The total loss function L is defined in Equation (7).

Network Structure
As shown in Figure 4, the image fusion network consisted of three parts, including the feature extraction module, global attention module and image reconstruction module. First, the feature extraction module extracted the features of the source images. Then, the global attention module aggregated the global contextual information and assigned higher weight to the important channels. Finally, the image reconstruction module output the final result by reconstructing all the feature maps concatenated in the channel dimension. As shown in Figure 4, the image fusion network consisted of three parts, including the feature extraction module, global attention module and image reconstruction module. First, the feature extraction module extracted the features of the source images. Then, the global attention module aggregated the global contextual information and assigned higher weight to the important channels. Finally, the image reconstruction module output the final result by reconstructing all the feature maps concatenated in the channel dimension.

Feature Extraction Module
Since different types of source images had modal differences, the input was split into two paths. The feature extraction module for each path was designed with the same structure and respective parameters.
The feature extraction module was composed of a shallow and deep feature extraction module. In particular, the shallow feature extraction module contained two convolution layers and an activation function layer. There were two convolution blocks in the deep feature extraction module which further extracted the image features adequately. Furthermore, each block contained a convolution layer and an activation function layer. A dense connection structure was applied between each block to ensure the full utilization of information. It is necessary to note that the padding of each convolution layer was set to SAME and the stride was set to one. The parameter settings of the feature extraction module are shown in Table 1.

Global Attention Module
Two global attention modules were introduced after the shallow and deep feature extraction modules. First, the global context was modeled to capture long-range feature dependencies. Later, the correlation between each channel was built through a fully connected layer. The structure of the global attention module is shown in Figure 5.

Feature Extraction Module
Since different types of source images had modal differences, the input was split into two paths. The feature extraction module for each path was designed with the same structure and respective parameters.
The feature extraction module was composed of a shallow and deep feature extraction module. In particular, the shallow feature extraction module contained two convolution layers and an activation function layer. There were two convolution blocks in the deep feature extraction module which further extracted the image features adequately. Furthermore, each block contained a convolution layer and an activation function layer. A dense connection structure was applied between each block to ensure the full utilization of information. It is necessary to note that the padding of each convolution layer was set to SAME and the stride was set to one. The parameter settings of the feature extraction module are shown in Table 1.

Global Attention Module
Two global attention modules were introduced after the shallow and deep feature extraction modules. First, the global context was modeled to capture long-range feature dependencies. Later, the correlation between each channel was built through a fully connected layer. The structure of the global attention module is shown in Figure 5.
For the input feature Φ I ∈ R C×H×W , it was first reshaped and transposed to Φ X ∈ R C×HW . Then, Φ I was reshaped to Φ Y ∈ R HW×1×1 after a convolution layer. A softmax operation was applied to calculate the attention map A ∈ R C×1×1 , which is defined by: where i ∈ [1, HW]. Subsequently, a matrix multiplication operation between A and Φ X was performed, obtaining the global attention map Φ G ∈ R C×1×1 , which is defined by: where × denotes the matrix multiplication operation. After obtaining the global context dependencies, the channel relationship was modeled with a fully connected layer. Then, a softmax operation was applied to obtain the global attention map G ∈ R C×1×1 , which is defined by: where f c (·) denotes the fully connected layer, W is the parameter of the fully connected layer and σ (·) denotes the Sigmoid operation. Finally, the final output feature map Φ O ∈ R C×H×W was obtained by applying the corresponding pixel multiplication operation between G and Φ I . The final output feature map Φ O is formulated as follows.
where × denotes a corresponding pixel multiplication operation. For the input feature Φ ∈ × × , it was first reshaped and transposed to × . Then, Φ was reshaped to Φ ∈ ×1×1 after a convolution layer. A sof operation was applied to calculate the attention map ∈ ×1×1 , which is defined b ]. Subsequently, a matrix multiplication operation between an was performed, obtaining the global attention map Φ ∈ ×1×1 , which is defined b where × denotes the matrix multiplication operation. After obtaining the global co dependencies, the channel relationship was modeled with a fully connected layer. T

Image Reconstruction Module
The image reconstruction module reconstructed the whole feature maps concatenated in the channel dimension and output the final result. There were four convolution blocks in this module. In particular, each block contained a convolution layer and an activation function layer. Moreover, the padding of each convolution layer was set to SAME and the stride was set to one. The image fusion network did not change the size of the input. The parameter settings of the image reconstruction module are shown in Table 2.

Experiments
The effectiveness and generalization of the proposed method are demonstrated in this section. First of all, the experimental details are presented. Then, the experimental results of two datasets are discussed in detail. Finally, ablation experiments concerning the global attention module, the feature extraction module and the trade-off control parameter are presented.

Training Dataset
The MSRS dataset [14] was chosen to train the network model. This dataset was a mixture of daytime and nighttime road-related scenes, containing various elements such as pedestrians, vehicles, buildings, etc. One hundred and eighty-eight pairs of the nighttime images were selected from the dataset. The foreground mask and the background mask were obtained through the modified semantic segmentation network Mask R-CNN [32]. Each selected image was cropped into image patches with the size of 120 × 120 by setting the stride to 64, resulting in 13,160 patch pairs for the network training. In the testing process, 26 and 16 pairs of images were chosen from the MSRS dataset and the M 3 FD dataset, respectively, for comparison experiments. Note that during the testing process, the input did not require any preprocessing.

Results of the MSRS Dataset
For assessing the fusion performance of the proposed method in a comprehensive manner, the MSRS dataset was chosen for comparison with six state-of-the-art methods.

Qualitative Results
For visually comparing the fusion performance of different methods, four pairs of images from the MSRS dataset were chosen for qualitative comparison. Notably, in Figures 6-9, red and green boxes were used to mark the salient target and texture details, respectively. All methods achieved excellent performance in retaining both salient targets and texture details. Specifically, as shown in Figure 6, DenseFuse [34], U2Fusion [7], GANMcC [10] and STDFusion [11] were able to highlight the salient targets well, but the texture details of the building's windows in the background were not clear. GTF [33] and RFN-Nest [19] retained the thermal radiation information well, but the edges of the salient targets were blurred.
Entropy 2022, 24, 1759 9 of 21 methods achieved excellent performance in retaining both salient targets and texture details. Specifically, as shown in Figure 6, DenseFuse [34], U2Fusion [7], GANMcC [10] and STDFusion [11] were able to highlight the salient targets well, but the texture details of the building's windows in the background were not clear. GTF [33] and RFN-Nest [19] retained the thermal radiation information well, but the edges of the salient targets were blurred. Entropy 2022, 24, 1759 9 of 21 methods achieved excellent performance in retaining both salient targets and texture details. Specifically, as shown in Figure 6, DenseFuse [34], U2Fusion [7], GANMcC [10] and STDFusion [11] were able to highlight the salient targets well, but the texture details of the building's windows in the background were not clear. GTF [33] and RFN-Nest [19] retained the thermal radiation information well, but the edges of the salient targets were blurred.    For the other three scenes, the proposed method effectively highlighted the salient targets. In addition, it also had some advantages in retaining texture details in the background. To be specific, our fused result had the sharpest detail of the tree branches in Figure 7. As shown in Figure 8, the outline of the car wheels was clear. Moreover, the tree trunk could be clearly distinguished from its surroundings in Figure 9. The above results indicated that the proposed method had better capabilities in highlighting salient targets and preserving rich texture details. For the other three scenes, the proposed method effectively highlighted the salient targets. In addition, it also had some advantages in retaining texture details in the background. To be specific, our fused result had the sharpest detail of the tree branches in Figure 7. As shown in Figure 8, the outline of the car wheels was clear. Moreover, the tree trunk could be clearly distinguished from its surroundings in Figure 9. The above results indicated that the proposed method had better capabilities in highlighting salient targets and preserving rich texture details.

Quantitative Results
For further evaluating the effectiveness of different methods, 26 pairs of images from the MSRS dataset were chosen for quantitative comparison. As shown in Figure 10 and Table 3, the proposed method had a remarkable advantage in five metrics, including EN, MI, SD, SF and Q ab/f . The value of the VIF for the proposed method was second to the value for RFN-Nest by a narrow margin. Specifically, the best results for EN and MI showed that the fused results contained a whole wealth of information transferred from the source images. The maximum values of the SD and SF suggested that our fused image had a relatively high contrast ratio. Even if the value of the VIF was not the best, the fused result still effectively highlighted the salient targets and preserved rich texture details. For the other three scenes, the proposed method effectively highlighted the salient targets. In addition, it also had some advantages in retaining texture details in the background. To be specific, our fused result had the sharpest detail of the tree branches in Figure 7. As shown in Figure 8, the outline of the car wheels was clear. Moreover, the tree trunk could be clearly distinguished from its surroundings in Figure 9. The above results indicated that the proposed method had better capabilities in highlighting salient targets and preserving rich texture details.

Quantitative Results
For further evaluating the effectiveness of different methods, 26 pairs of images from the MSRS dataset were chosen for quantitative comparison. As shown in Figure 10 and Table 3, the proposed method had a remarkable advantage in five metrics, including EN, MI, SD, SF and Q ab/f . The value of the VIF for the proposed method was second to the value for RFN-Nest by a narrow margin. Specifically, the best results for EN and MI showed that the fused results contained a whole wealth of information transferred from the source images. The maximum values of the SD and SF suggested that our fused image had a relatively high contrast ratio. Even if the value of the VIF was not the best, the fused result still effectively highlighted the salient targets and preserved rich texture details.

Results of the M 3 FD Dataset
For further evaluating the generalization of the proposed method, the M 3 FD dataset was used to test the network model trained using the MSRS dataset.

Qualitative Results
A pair of images was chosen from the M 3 FD dataset for qualitative comparison. As shown in Figure 11, all methods retained the infrared thermal radiation information as well as the texture details. Specifically, the fused images of the proposed method were rich in thermal radiation information. Both near and distant targets could be easily distinguished from the background. Compared with other methods, the textural details of the proposed method were much clearer and more suitable for visual observation, for instance the detail of the building in Figure 11.

Results of the M 3 FD Dataset
For further evaluating the generalization of the proposed method, the M 3 FD dataset was used to test the network model trained using the MSRS dataset.

Qualitative Results
A pair of images was chosen from the M 3 FD dataset for qualitative comparison. As shown in Figure 11, all methods retained the infrared thermal radiation information as well as the texture details. Specifically, the fused images of the proposed method were rich in thermal radiation information. Both near and distant targets could be easily distinguished from the background. Compared with other methods, the textural details of the proposed method were much clearer and more suitable for visual observation, for instance the detail of the building in Figure 11. The excellent results demonstrate the generalization of the proposed method. To sum up, the fused results were effective in highlighting salient targets when applied in the night scene, and the texture details were clearly visible. The excellent results demonstrate the generalization of the proposed method. To sum up, the fused results were effective in highlighting salient targets when applied in the night scene, and the texture details were clearly visible.

Quantitative Results
Sixteen pairs of images from the M 3 FD dataset were chosen for quantitative comparison. As shown in Figure 12 and Table 4, the proposed method obtained the maximum mean values for four evaluation metrics including EN, MI, VIF and Q ab/f . This indicated that our fused images were rich in information and were more suitable for observation with the human visual system. Although our proposed method did not achieve the best results for the SD and SF, the fused images still had a sufficiently high contrast. The above experimental results adequately prove the excellent fusion performance and outstanding generalization.
indicated that our fused images were rich in information and were more suitable for observation with the human visual system. Although our proposed method did not achieve the best results for the SD and SF, the fused images still had a sufficiently high contrast. The above experimental results adequately prove the excellent fusion performance and outstanding generalization.

Global Attention Module
The perceptual field of the single convolution layer is limited, so the contextual information can only be captured by stacking plenty of convolution layers. To address the huge computational effort, the global attention module was designed. This module modeled the global contextual relationships and then rescaled the weights of different channels to make salient targets more prominent. For confirming the usefulness of the  The perceptual field of the single convolution layer is limited, so the contextual information can only be captured by stacking plenty of convolution layers. To address the huge computational effort, the global attention module was designed. This module modeled the global contextual relationships and then rescaled the weights of different channels to make salient targets more prominent. For confirming the usefulness of the global attention module, three ablation experiments were designed, involving removing the first and the second global attention module, respectively, and removing both global attention modules.
One hundred pairs of images from the M 3 FD dataset were chosen for qualitative and quantitative analyses. In Figure 13, for the fused result with the single global attention module, the salient targets were relatively blurred. The fused result without the global attention module performed the worst in highlighting salient targets. In Table 5, the maximum values for five evaluation metrics were achieved when two global attention modules were retained. Therefore, the global attention module assigned higher weights to the significant channels, resulting in the effective highlighting of the salient targets. quantitative analyses. In Figure 13, for the fused result with the single global attention module, the salient targets were relatively blurred. The fused result without the global attention module performed the worst in highlighting salient targets. In Table 5, the maximum values for five evaluation metrics were achieved when two global attention modules were retained. Therefore, the global attention module assigned higher weights to the significant channels, resulting in the effective highlighting of the salient targets.  In order to further verify its advancement, the global attention module was replaced with SENet [41] and ECANet [42]. As shown in Figure 13, all global attention modules effectively highlighted the salient targets. Therefore, it was difficult to judge the superiority of these modules with human visual inspection. Therefore, 100 pairs of images from the M 3 FD dataset were utilized for further quantitative comparison. In Table 6, all evaluation metrics of the proposed module achieved maximum values. Therefore, all results confirmed its advancement.  In order to further verify its advancement, the global attention module was replaced with SENet [41] and ECANet [42]. As shown in Figure 13, all global attention modules effectively highlighted the salient targets. Therefore, it was difficult to judge the superiority of these modules with human visual inspection. Therefore, 100 pairs of images from the M 3 FD dataset were utilized for further quantitative comparison. In Table 6, all evaluation metrics of the proposed module achieved maximum values. Therefore, all results confirmed its advancement. The global attention module effectively avoided the problem of high computational complexity caused by stacking convolution layers. To further verify the effect of the number of convolution blocks in the deep feature extraction module, two ablation experiments were designed. Three or four convolution blocks were adopted in the deep feature extraction module. The output channels of the third and fourth convolution layer were set up to 96 and 128, respectively. The other parameters were set up as before. One hundred pairs of images from the M 3 FD dataset were chosen for quantitative analysis. As shown in Table 7, the proposed method achieved the maximum values for four evaluation metrics including EN, MI, SD and VIF. Note that the evaluation metrics did not rise with the increase in convolution blocks, which meant that the network was possibly overfitted. Therefore, only two convolution blocks were adopted in the deep feature extraction module.

Trade-Off Control Parameters
For retaining the salient targets and rich texture details at the same time, the loss function was divided into the foreground loss and the background loss. The trade-off control parameters of the foreground loss were α 1 and α 2 , where α 1 + α 2 = 10. Fixing the other parameters, we set α 1 = 1, 2, . . . , 9. One hundred pairs of images from the M 3 FD dataset were selected for analysis. The variation trends in the evaluation metrics under different values of α 1 can be observed in Figure 14. In Table 8, the best results were obtained for all evaluation metrics except for Q ab/f when α 1 was set to four.    By evaluating the luminance of all visible images in the training dataset, the luminance estimation function output the trade-off control parameters β 1 and β 2 , where β 1 + β 2 = 1. Fixing the other parameters in the network, we set β 1 = 0.1, 0.2, . . . , 0.9. One hundred pairs of images from the M 3 FD dataset were selected for qualitative and quantitative analyses. In Figure 15, the variation trends in the evaluation metrics under the different values of β 1 are shown. As shown in Table 9, our proposed method acquired the best results for four evaluation metrics including EN, MI, SD and VIF.  In order to further verify the validity of the luminance estimation function, the experiment was continued with a reduction in the step length. We set 1 = In order to further verify the validity of the luminance estimation function, the experiment was continued with a reduction in the step length. We set β 1 = 0.61, 0.62, . . . , 0.69 and kept the other parameters in the network unchanged.
One hundred pairs of images from the M 3 FD dataset were chosen for quantitative analysis using the above 10 sets of ablation experiments. In Figure 16, the variation trends in the evaluation metrics under the different values of β 1 are shown. As shown in Table 10, our proposed method acquired the best results for four evaluation metrics including EN, SD, SF and VIF. In summary, the luminance estimation function could effectively evaluate the luminance of the night scene and output a reasonable value of β.     As shown in Figure 17, the salient targets were very prominent and the details of the building were very distinct when α 1 and β 1 were set to 4 and 0.66, respectively. All the above experiments adequately prove the rationality of the parameter settings in this paper.

Conclusions
This paper proposed an infrared and visible image fusion method for highlighting salient targets in the night scene. First, a global attention module was designed, which rescaled the weights of different channels after capturing the global contextual information. Second, the loss function was divided into the foreground loss and the background loss, forcing the fused image to retain rich texture details while highlighting the salient targets. Moreover, a luminance estimation function was introduced to obtain the trade-off control parameters of the foreground loss function based on the nighttime luminance. It effectively highlighted salient targets by retaining foreground information from the source images. In summary, the effectiveness and generalization of the proposed method was demonstrated by comparing with other advanced methods.

Conclusions
This paper proposed an infrared and visible image fusion method for highlighting salient targets in the night scene. First, a global attention module was designed, which rescaled the weights of different channels after capturing the global contextual information. Second, the loss function was divided into the foreground loss and the background loss, forcing the fused image to retain rich texture details while highlighting the salient targets. Moreover, a luminance estimation function was introduced to obtain the trade-off control parameters of the foreground loss function based on the nighttime luminance. It effectively highlighted salient targets by retaining foreground information from the source images. In summary, the effectiveness and generalization of the proposed method was demonstrated by comparing with other advanced methods. Funding: This work was supported in part by the Jilin Provincial Development and Reform Commission's special project for innovation ability construction (infrared image super-resolution and detail enhancement system research and development).

Conflicts of Interest:
The authors declare no conflict of interest.