Infrared and Visible Image Fusion Based on Visual Saliency Map and Image Contrast Enhancement

The purpose of infrared and visible image fusion is to generate images with prominent targets and rich information which provides the basis for target detection and recognition. Among the existing image fusion methods, the traditional method is easy to produce artifacts, and the information of the visible target and texture details are not fully preserved, especially for the image fusion under dark scenes and smoke conditions. Therefore, an infrared and visible image fusion method is proposed based on visual saliency image and image contrast enhancement processing. Aiming at the problem that low image contrast brings difficulty to fusion, an improved gamma correction and local mean method is used to enhance the input image contrast. To suppress artifacts that are prone to occur in the process of image fusion, a differential rolling guidance filter (DRGF) method is adopted to decompose the input image into the basic layer and the detail layer. Compared with the traditional multi-scale decomposition method, this method can retain specific edge information and reduce the occurrence of artifacts. In order to solve the problem that the salient object of the fused image is not prominent and the texture detail information is not fully preserved, the salient map extraction method is used to extract the infrared image salient map to guide the fusion image target weight, and on the other hand, it is used to control the fusion weight of the basic layer to improve the shortcomings of the traditional ‘average’ fusion method to weaken the contrast information. In addition, a method based on pixel intensity and gradient is proposed to fuse the detail layer and retain the edge and detail information to the greatest extent. Experimental results show that the proposed method is superior to other fusion algorithms in both subjective and objective aspects.


Introduction
The main purpose of image fusion is to generate a single image containing complementary information from multiple images of the same scene. In the field of multi-sensor image fusion, infrared and visible image fusion is an important technology. Since infrared sensors can capture thermal information about targets in a scene, some objects can be highlighted in weak light, occlusion and bad weather conditions. The visible image contains the texture details of the scene, and the images presented are more consistent with human visual perception in terms of intensity and contrast. These two types of images have complementary effects and their fusion can obtain more accurate and richer scene information. At present, this technology plays an important role in security monitoring, target detection, target recognition and so on [1][2][3][4][5].
Ma et al. investigated the field of infrared and visible image fusion [6]. The fusion algorithm can be divided into seven categories according to the theoretical way: multi-scale transform, sparse representation, neural network, subspace, dominance-based, hybrid model and other methods. Among them, the multi-scale transform algorithm is most widely used typically adopting Laplace pyramid transform [7] (LP), wavelet transform [8] (Haar), non-subsampled shearlet transform [9] (NSST) or non-subsampled contourlet transform [10] (NSCT). These algorithms are based on the Laplace pyramid method, which smoothes the image to a certain extent and loses some structure and detail information. At the same time, it may also produce artifacts that affect the quality of the fused image.
These problems have also attracted the attention of many researchers. In order to retain more detail information and structure information, some researchers proposed a method to optimize the objective function by imposing constraint conditions to achieve fusion. Huang et al. [11] reserved detail information based on gradient constraints and explicit constraints, highlighting the explicit goal. Li et al. [12] used contrast fidelity and sparse constraints for image fusion. There are also improved fusion methods based on multi-scale decomposition Chen et al. [13] proposed a target enhanced multi-scale transformation (MST) decomposition model by using linear programming to decompose the source image and fuse them on different scales, spatial resolution and decomposition layers. This method introduces a linear transformation function to control the weight of infrared images and fully retains the information of infrared prominent targets. Li et al. [14] proposed a multiscale fusion method of potential low-rank representation decomposition to extract global and local structural information from the source image, and fully tap the information of the source image. These methods focus on preserving image information and ignore the problem of artifacts caused by the loss of image information in multi-scale decomposition. Ding et al. [15] proposed a fusion method based on non-subsampled shearlet transform and sparse structural features. The sparsity based on structural information is used to guide the fusion of high-frequency coefficients, and the principal component analysis is used to fuse low-frequency information. This method has a good effect on extracting prominent target information. However, due to the abandonment of some low-frequency information in the principal component analysis, some information will be lost or artifacts will occur.
In recent years, the edge-preserving filter has been successfully applied to image fusion. Zhou et al. [16] proposed a multi-scale decomposition method based on Gaussian and bilateral filtering mixing, but this result will also smooth the contrast information. Ma et al. [17] proposed a multi-scale decomposition method based on the combination of Gaussian and rolling guided filter (RGF), which has a certain effect on artifact elimination. In recent years, the method based on explicitness has gradually emerged, and more people are keen to use the method of combining the multi-scale method with explicitness to fuse images. Zuo et al. [18] proposed an infrared and visible image fusion method that introduced region segmentation into the dual-tree complex wavelet transform (DTCWT) region. According to explicitness, the region of interest and the background region were separated, and the results were mapped to the fusion results.
Although information retention and artifacts elimination in image fusion are individually studied in the above methods, they have not been considered comprehensively at the same time. Based on the advantages and disadvantages of the above methods, this paper proposes an infrared and visible image fusion method based on a visual saliency map and image contrast enhancement on the basis of two typical scenes that affect the fusion effect, namely, dark scenes and smoke background. The advantages of the algorithm in this paper are as follows. In view of the difficulty caused by the low image contrast to the fusion, we propose an improved method of gamma correction and local mean to enhance the contrast of the input image. In view of the problem that artifacts are prone to occur in the process of image fusion, a DRGF method is adopted to decompose the input image into the basic layer and the detail layer. Compared with the traditional multi-scale decomposition method, this method can retain specific edge information and reduce the occurrence of artifacts. In order to solve the problem that the visible target of the fused image is not prominent and the texture detail information is not fully preserved, a method is proposed to extract the visible image. On the one hand, it is used to extract the visible image from the infrared image to guide the fusion image target weight. On the other hand, it is used to control the fusion weight of the basic layer to improve the shortcomings of the traditional 'average' fusion method to weaken the contrast information. In addition, a method based on pixel intensity and gradient is proposed to fuse the detail layer and retain the edge and detail information to the greatest extent.

Methods
The structure of the fusion algorithm is shown in Figure 1. Firstly, the contrast of the input image is enhanced by an improved gamma correction and local mean. Then, the infrared image is extracted to guide the target weight of the fusion image. Then, the input image is decomposed into the basic layer and the detail layer by using the DRGF multi-scale decomposition method. For the base layer, the fusion weight is controlled by using the dominant value. For the detail layer, the fusion method based on pixel intensity and gradient is used.

Image Contrast Enhancement
Low contrast images in a dark background will have a certain impact on the fusion quality [19]. Considering this aspect, this paper adds the image contrast enhancement algorithm to the fusion framework. The purpose of image contrast enhancement is to improve the visual effect of the image and enhance the contrast and clarity, which can be divided into spatial domain enhancement and frequency domain enhancement. The spatial domain enhancement algorithms include histogram equalization (HE), gamma transform, and logarithmic image processing model (LIP). The frequency domain enhancement methods include high-pass filtering, low-pass filtering and homomorphic filtering. Gamma transform [20] is mainly used for image correction, and the images with too high and too low gray levels are corrected accordingly to achieve the contrast enhancement effect. The simple form of this method is expressed as Equation (1).
P represents the image after contrast enhancement, L represents the image to be enhanced and here the de-noising image, and γ represents the correction degree. The smaller the value is, the brighter the image is. The gamma correction curve is shown in Figure 2. The value of γ is bounded by 1. For γ > 1, the high light part will be suppressed the low light part will be expanded, and the image will become dark as a whole. For γ < 1, the high light part is extended, the low light part is suppressed, and the whole image becomes bright. The results of gamma correction make the image brighter or darker as a whole. Our goal is to obtain an image with enhanced target and clear contrast. Therefore, this paper proposes an improved method of weighted local mean to improve the image contrast. The specific methods are as follows: where r is the size of the local window; H(i, j) is the local mean; α and β are the coefficients that control the tensile degree. The algorithm proposed in this paper is compared with histogram equalization, gamma correction, LIP and high-frequency enhanced filtering (HEF) algorithms. The contrast effect is shown in the Figure 3 below. The contrast areas are marked with rectangular boxes: From the comparison effect of local regions, the method in this paper has a good effect. HE direct use of image histogram to adjust the contrast is simple and effective, but this method in the image useful data contrast is very close when the effect is better, and the experimental image useful data contrast is not all close, so the effect is general; the LIP method can expand the low gray value part of the image, display more details of the low gray value part, compress the high gray value part, reduce the details of the high gray value part, so as to achieve the purpose of emphasizing the low gray part of the image, and enhance the sky part from the effect, which leads to the contrast between the sky and the tree. HEF contrast enhancement is obvious, but the red frame-out area weakens the highlight effect of the light and leads to excessive unnaturalness at the junction of the tree and the sky. Gamma correction corrects the image from the whole, and the overall effect is good, but the correction of the local area needs to be improved. The local effect of the improved method in this paper is obvious.
In addition, since the source image inevitably contains some effects such as noise, we added the method of robust principal component analysis (RPCA) to remove the noise of small structures [21]. Robust principal component analysis can effectively find out the 'main' elements and structures in the data, which is different from the principal component analysis. RPCA believes that the data matrix can be decomposed into two parts, the lowrank part and the sparse part, and the noise is sparse. Therefore, the de-noising effect can be achieved by retaining the low-rank part. RPCA principle formula is as Equation (5) : where I represents the source image, and L and S represent the low-rank and sparse parts after RPCA decomposition.

Visual Saliency Map Extraction
Through the observation of a large number of infrared images, it is found that due to the unique imaging characteristics of infrared images, the appearance of infrared images on the target is particularly prominent under the conditions of weak light and smoke occlusion. Especially under the conditions of smoke, the target in the visible light image may be hidden. Therefore, it is particularly important to guide the fusion image through the infrared target saliency map.
Zhai et al. proposed a saliency detection method [22], in which the saliency value of each pixel is the sum of the Euclidean distance between the brightness value of the pixel and the brightness value of other pixels. Assuming that I p represents the brightness value of pixel p in image I, the formula is as Equation (6).
where N represents the total number of pixels in image I. The algorithm is simple and effective. In this paper, the algorithm is introduced to extract the image saliency value, which is used to guide the weight value of the fused image saliency target and the weight coefficient of the fusion infrared image and visible image in the basic layer. Before extracting the saliency map, we add the filtering operation to avoid the influence of noise points. The formula is as follows: where P represents the enhanced infrared image, I SG the image after Gaussian filtering, j the pixel intensity, M j the number of pixel intensity values, C the gray level being set to 256 in this paper and I j the brightness value j.

Multi-Scale Decomposition Based on Improved Differential Rolling Guided Filter (DRGF)
The Gaussian filter is widely used in the field of multi-scale image processing. A Gaussian filter will filter out the noise and small structure in the source image, at the same time blurring the whole image and producing artifacts. The edge-preserving filter can retain the image boundary content and reduce artifacts. The most widely used edge-preserving filters are weighted least squares filter, bilateral filter and guided filter. The guided filter is to filter the initial image through a guided image so that the output image is generally similar to the initial image, but the texture part is similar to the guided image. The guided filter can not only realize the advantages of bilateral filtering to maintain the edge but also overcome the shortcomings of bilateral filtering in gradient deformation near the edge. Zhang et al. [23] improved the guided filtering and proposed an RGF technology, which has the characteristics of scale perception and edge preservation. This method includes small structure removal and edge recovery. RGF is composed of a Gaussian filter and guided filter, and the specific implementation is as follows: where, Gaussian(I, σ s ) represents Gaussian filtering, GuideFilter(J t , I, σ S , σ 2 r ) represents guided filtering, I represents the image to be filtered, σ s is used as a proportional parameter to control the Gaussian filtering removal scale. Represents the boot image, controls the distance weight, set to 0.05. RGF is iteratively implemented by Equations (9) and (10), which can be simply expressed as: where T is the number of iterations. The multi-scale decomposition formula of improved DRGF is: where P is the enhanced source image, M i is the filtering image of the i layer, D i is the detail layer of the i layer, N is the decomposition layer, M i as the basic layer, σ i s and σ r is the filtering parameter, this paper sets σ i s = 2σ i−1 s to control the image scale; T is the number of iterations. The multiscale decomposition structure is shown in Figure 4.

Fusion of Base Layers
The image base layer after multi-scale decomposition contains the basic information of the image. The traditional 'average' fusion rule is not sensitive to the contrast of the image, which will weaken the contrast of the image and weaken the prominent information. In Section 2.2, this paper proposes a method for extracting infrared dominant images. Equation (8) is used to calculate the dominant values of two images in the base layer, and then Equations (15) and (16) is used for image fusion in the base layer. The formula is as follows: where B R and B V represent the infrared image base layer and the visible image base layer, ω B the fusion weight of the infrared image base layer, S R and S V the obvious indigenous value of the infrared image base layer and the visible image base layer, BF the image after the fusion of the base layer. The "weighted average" method, the information entropy method and the regional average energy method are compared and the results are shown in Figure 5.

Fusion of Detail Layers
The multi-scale decomposition of the image detail layer contains a variety of scale details of the original image, and this subtle detail information is useful. Due to the different imaging principles of the two source images, the information contained in the detail layer is complementary. Therefore, the traditional fusion strategy of 'absolute value maximization' does not take into account the different characteristics of infrared and visible images. The fusion results may change the original image information and introduce irrelevant details and noise. The 'additive fusion' strategy will weaken the contrast of detail features. In order to avoid this phenomenon, we propose a method based on the combination of pixel intensity and gradient to fuse the detail layer. The pixel intensity can represent the energy information of the highlight detail part, and the gradient represents the highlight degree of the image. Combined with the advantages of the two methods, the detail highlights of the two source images can be fully retained. The specific formulas are as follows: In the expression, D V and D R denote the pixel values of the visible light image detail layer and infrared image detail layer, G V and G R denote the gradient values of visible light image detail layer and infrared image detail layer, ω D denotes the fusion weight of visible light image detail layer, DF denotes the fusion image detail layer.
In order to prove that our method can preserve detail information more effectively, it is compared with 'additive fusion', 'absolute value is larger' and the method in Reference [17] (VSM-WLS), and the results are shown in the Figure 6.
By means of the above methods, we obtain the fused basic layer, detail layer and infrared saliency map, and then overlay these three parts. Among them, the infrared saliency map is used to guide the infrared target of the fused image, so we give it a weight to avoid losing the original information of the image.
Finally, the basic layer, the detail layer and the infrared image are weighted and added by Formula (19).
where F is the final fusion image, and ω S is the fusion weight of the infrared dominant image.

Results and Analysis
The infrared and visible images in the experiment are four groups of images accurately registered from the TNO dataset. The image scene environment covers night, dark environment and dense fog environment. In order to verify the effectiveness of the algorithm in all aspects, this paper evaluates the fusion effect through subjective evaluation and objective evaluation. The comparison algorithms include curvelet transform (CVT) [24], DTCWT, Haar , LP, MSVD, NSCT, NSST, potential low-rank representation decomposition (LatLRR) [14], and VSM-WLS. The experimental platform is MATLAB 2018b (Core i7, clocked at 3.30 GHz, memory of 16 GB).

Parameter Setting
In this method, there are three adjustable parameters, namely α and β in Formula (3) and ω S in Formula (19). The corresponding fine-tuning can be made according to different application requirements. Here's how we determine the parameters: In Formula (3), α is the overall correction parameter, and its value affects the overall brightness and darkness of the image. If the value is too small, the whole image is bright and the contrast is reduced. On the contrary, if the value is too large, the image is too dark and some details are lost. Based on experience, the α value is 1.9.
β in Formula (3) and ω S in Formula (19) are both weight parameters. We introduce a nonlinear function to control the value parameter. The nonlinear function is shown in Formula (20).
where ξ is the independent variable of the function, and the range of ξ and S λ (ξ) is [0, 1]. The nonlinear transformation function under different λ values is shown in Figure 7. It can be seen that when λ gradually increases, the curve becomes steeper and steeper, and the nonlinear transformation is also gradually enhanced. Therefore, we can control the guiding weight of the infrared image by adjusting λ. Therefore, the two weight parameters can be expressed as: Among them, S R represents the normalized infrared image, H represents the mean of local gray level of normalized image, λ and λ β are used to adjust the weight according to different applications. If λ is too large, it will lead to too much infrared information and lose the original image information. On the contrary, too little infrared information cannot play a guiding role, so does λ β . The λ and λ β values in this article are set to 8 and 9, respectively.

Subjective Evaluation
The first set of data is the image of the intersection in the dark environment. The size of the source image is 575 × 475. The infrared image highlights the target information of pedestrians, street lights, cars and so on. The visible image can only highlight the information of billboards, street lights, signal lights and so on. The comparison results of different fusion algorithms are shown in Figure 8. The red box is selected where part of the target is amplified. The comparison results show that Haar, MSVD, NSST and LatLRR algorithms are too smooth, resulting in low image contrast after fusion. CVT, LP, NSCT algorithms appear artifacts around the pedestrian contour; VSM-WLS algorithm has relatively good effect, but for the prominent effect of the target it is inferior to the proposed algorithm, which can be observed from the amplified pedestrian effect.
The second set of data is the image in the dark scene, and the size of the source image is 360 × 270. The infrared image highlights the smoke, people and doors on the roof, and the visible image highlights the texture details of the trees, door contours and ground. The comparison results of different fusion algorithms are shown in Figure 9. Haar, MSVD, NSST, VSM-WLS algorithms have low image contrast after fusion, and CVT, LP, NSCT algorithms have artifacts on the edge of image characters after fusion, and LatLRR algorithm is not of rich information in the texture of trees and ground bricks. The proposed algorithm leads to good effect in artifacts, contrast, target, texture and so on. The third group of data is the image of the ship on the sea in the dark scene, and the size of the source image is 505 × 510. The comparison results of different fusion algorithms are shown in Figure 10. From the overall contrast, Haar, MSVD, NSST and LatLRR have low contrast and unclear image. From the perspective of highlighting the target, the proposed algorithm is the best.
The fourth group of data is an image in the dense fog scenario, and the size of the source image is 620 × 450. In the infrared source image, the target information can be clearly displayed without the influence of dense fog. In the visible light image, the texture details of the house and land are clear, but the target is hidden. The comparison results of different fusion algorithms are shown in Figure 11. In the dense fog scenario, the highlight of the target is particularly important. The selected part of the red box in the figure is the target hidden by the dense fog after amplification. It can be seen from the comparison figure that the algorithm in this paper has the most obvious highlight effect on the target. In terms of texture detail preservation, VSM-WLS algorithm is slightly better than the algorithm in this paper for details preservation of roof. However, in terms of ground texture, the algorithm in this paper retains more details than VSM-WLS algorithm. The details of roof are selected by blue box, and the ground details are selected by green box.

Objective Evaluation
At present, many researchers have conducted research on the image fusion evaluation index [25,26], which is mainly divided into four types, namely based on information theory, image features, image structure similarity and human perception measurement. This paper selects several evaluation indexes to verify the objective effect of the algorithm. The selected indicators are: information entropy (EN), structural similarity (SSIM), average gradient (AG) and standard deviation (SD).
The main function of EN is to measure the amount of information in the image. The greater the value of information entropy is, the greater the amount of information contained is, and the more retained details of the image are. The main function of SSIM is to measure the structural similarity between the fused image and the two source images. The larger the value, the closer the representation and the source image are, the more details are retained. The main function of AG is to reflect the clarity of the fused image, and also to reflect the small detail contrast and texture transformation characteristics in the image. The larger the value is, the clearer the image is; the larger the SD value, the higher the image quality and the clearer the image.
The four groups of fusion images based on different algorithms are quantitative evaluated and indexed as shown in Table 1. The best indicator in the table is marked by bold fonts, and the second value of the effect is marked by underline. The Table 1 (S1-S4) represent four test images. It can be seen from Table 1 that in terms of objective evaluation indexes, the proposed method has certain advantages in EN, SSIM, AG and SD. On the whole, most of the indexes of this algorithm are higher than those of other comparison algorithms. Although it does not rank first for EN in the second group, the SSIM and AG in the third group and the SSIM in the fourth group, it ranks second and leading. The algorithm in this paper takes into account the enhancement of image contrast. In terms of indicators, the AG and SD are basically in the leading position, especially the standard deviation index of the third group of images, which is 51.37% higher than that of the second group. By comprehensive comparison, this method performs well in objective evaluation.
Finally, the running time of different methods on four experimental images is provided in Table 2. Our method is at the intermediate level. CVT, LP and MSVD algorithms are the most basic algorithms, and the running time is relatively short. The running time of the proposed method is faster than that of NSST and LatLRR, and slightly lower than that of NSCT, but the fusion effect is better than NSCT. The algorithm in this paper is similar to the VSM-WLS algorithm, and both of them use the method of saliency map and multi-scale decomposition. However, the algorithm in this paper has been improved in the fusion algorithm of each layer of multi-scale decomposition, and many factors are considered. Therefore, the algorithm complexity is high, but the comprehensive effect is better than that of VSM-WLS. The algorithm in this paper needs to be improved in terms of time complexity, which is also the direction for further efforts.

Conclusions
This paper presents an infrared and visible image fusion method based on visual aboriginal image and image contrast enhancement. The contrast enhancement algorithm is used to improve the image contrast and the clarity of the target. The infrared salientindigenous image is used to guide the prominence of the salient-indigenous target in the fusion image. The differential rolling guidance filtering method is used to decompose the image into different layers, so as to facilitate the targeted fusion. In terms of the fusion strategy of the basic layer and the detail layer, the fusion weights are controlled by the salient-indigenous value and the fusion method based on the combination of pixel intensity and gradient is also proposed, which is targeted for fusion. The experimental results show that the retention of the target is very prominent, and the detail information retention is better than other algorithms. Since the algorithm in this paper is comprehensive and complex, the next step is to improve the speed.