Infrared and Visible Image Fusion Based on Co-Occurrence Analysis Shearlet Transform Context-Aware

: This study based on co-occurrence analysis shearlet transform (CAST) effectively combines the latent low rank representation (LatLRR) and the regularization of zero-crossing counting in differences to fuse the heterogeneous images. First, the source images are decomposed by CAST method into base-layer and detail-layer sub-images. Secondly, for the base-layer components with larger-scale intensity variation, the LatLRR, is a valid method to extract the salient information from image sources, and can be applied to generate saliency map to implement the weighted fusion of base-layer images adaptively. Meanwhile, the regularization term of zero crossings in differences, which is a classic method of optimization, is designed as the regularization term to construct the fusion of detail-layer images. By this method, the gradient information concealed in the source images can be extracted as much as possible, then the fusion image owns more abundant edge information. Compared with other state-of-the-art algorithms on publicly available datasets, the quantitative and qualitative analysis of experimental results demonstrate that the proposed method outperformed in enhancing the contrast and achieving close fusion result.


Introduction
Image fusion, as a main part of the image enhancement technology, aims to assimilate the abundant and valid detail information of the heterogeneous source images to construct a fused image with rich and interesting information [1][2][3][4]. Moreover, for the image fusion of multi-sensor, the research on the fusion of infrared and visible images is more common [5]. The fused image with robustness and rich information about the scene is significant for a lot of applications such as surveillance, remote sensing, human perception and computer vision tasks [6], etc. Although visible images possess texture details in high resolution, under poor conditions such as low illumination, smoke and occlusion, the quality of the visual images is hardly satisfying, and some important target information will be lost [7]. However, the infrared sensors generate an image via capturing the heat radiation from the object and the salient information of the target in the complicated scene can be actively obtained. Therefore, the infrared images usually are selected as the offset for visual image during above bad environment. Via combining the complementary sources of information effectively from these two kinds of images in the same scene, the disadvantages of the human eyes' visual characteristics can be improved, and the range of the human visual band can be extended greatly [2,8].
Multi-scale transform (MST), as the most general method, is adopted in image fusion applications by numerous researchers with visual fidelity, lower computational complexity and higher efficiency [3,9]. Generally speaking, the fusion method based on the MST

Related Work
In this section, for a comprehensive review of some algorithms most relevant to this study, we focus on reviewing including co-occurrence filter, directional localization, latent low rank representation, and counting the zero crossings in differences.

Co-Occurrence Filter
The co-occurrence filter (COF) [27] assigns the weight value in accordance with the cooccurrence information, so that the weight of the frequently occurring texture information is reduced and smoothed, and the weight of the infrequently occurring edge information is increased. In this way, the idea of edge detection is directly applied to the filtering process. It is the perfect combination of edge detection and edge preservation. Figure 1 shows the comparison between before and after co-occurrence filter processing.

Related Work
In this section, for a comprehensive review of some algorithms most relevant to this study, we focus on reviewing including co-occurrence filter, directional localization, latent low rank representation, and counting the zero crossings in differences.

Co-Occurrence Filter
The co-occurrence filter (COF) [27] assigns the weight value in accordance with the co-occurrence information, so that the weight of the frequently occurring texture information is reduced and smoothed, and the weight of the infrequently occurring edge information is increased. In this way, the idea of edge detection is directly applied to the filtering process. It is the perfect combination of edge detection and edge preservation. Figure 1 shows the comparison between before and after co-occurrence filter processing. Based on the Bilateral Filter, the COF introduces the normalized co-occurrence matrix to take the place of the range Gaussian [4,16]. In addition, the definition of the COF is as follows:  corresponding to frequency statistics of the pixel a and b [28]. The co-occurrence matrix can be obtained by the following formulas: Based on the Bilateral Filter, the COF introduces the normalized co-occurrence matrix to take the place of the range Gaussian [4,16]. In addition, the definition of the COF is as follows: where J out and I in denote the output and input pixel value, respectively, and a, b are pixel indexes; G σ s (a, b) · M(a, b) represents the weighted item, measuring the contribution of pixel b to output pixel a, G σ s (a, b) is the Gaussian filter, M(a, b) is 256 × 256 matrix for general gray-scale images computed from the following formula: where C(p, q) is called as the co-occurrence matrix, and used to collect the co-occurrence information of the original images; h(p) and h(q) denote the histogram of pixel values corresponding to frequency statistics of the pixel a and b [28]. The co-occurrence matrix can be obtained by the following formulas: where σ means the parameter of the COF which can control the filtering of the edge texture in the image and is usually 15; the symbol [·] represents a value definition relationship: if the Boolean expression in the bracket is true, its value is 1, otherwise the value is 0; d(a, b) represents the Euclidean distance from pixel a to b in the image plane. Thence, the COF has excellent performance in smoothing the edges within the texture area and preserving the boundaries across the texture area. It is obvious that Gaussian white noise and checkerboard textures are common in the figure, so that the co-occurrence matrix gives them higher weight. On the contrary, because the borders of dark and shallow areas appear less frequently, they are given a relatively low weight. COF can remove noise and smooth texture areas, while clearly retaining the boundaries between different texture areas. Given the edge retention filter does not blur the edge when decomposing the image, and there is no ringing effect and artifacts, then there is a good spatial consistency and edge retention [21].

Directional Localization
In view of the advantages of the local shearlet filter, such as removing the block effect, attenuating the Gibbs phenomenon, and improving the convolution calculation efficiency in time domain, it is applied to processing the high-frequency images to realize the directional localization [29]. The decomposition framework is shown in Figure 2. Thence, the COF has excellent performance in smoothing the edges within the texture area and preserving the boundaries across the texture area. It is obvious that Gaussian white noise and checkerboard textures are common in the figure, so that the co-occurrence matrix gives them higher weight. On the contrary, because the borders of dark and shallow areas appear less frequently, they are given a relatively low weight. COF can remove noise and smooth texture areas, while clearly retaining the boundaries between different texture areas. Given the edge retention filter does not blur the edge when decomposing the image, and there is no ringing effect and artifacts, then there is a good spatial consistency and edge retention [21].

Directional Localization
In view of the advantages of the local shearlet filter, such as removing the block effect, attenuating the Gibbs phenomenon, and improving the convolution calculation efficiency in time domain, it is applied to processing the high-frequency images to realize the directional localization [29]. The decomposition framework is shown in Figure 2. The detail-layer image decomposition process is as follows: (1) Coordinate mapping: from the pseudo-polar coordinates to the Cartesian coordinates; (2) Based on the "Meyer" equation, construct the small-size shearlet filter; (3) The k band-pass detail-layer images and the "Meyer" equation are processed by convolution operation [26]. The detail-layer image decomposition process is as follows: (1) Coordinate mapping: from the pseudo-polar coordinates to the Cartesian coordinates; (2) Based on the "Meyer" equation, construct the small-size shearlet filter; (3) The k band-pass detail-layer images and the "Meyer" equation are processed by convolution operation [26].

Latent Low Rank Representation
Visual saliency detection applies certain intelligent processing algorithms to simulate the bionic mechanism of human vision, and analyze the salient targets and areas in a scene. The saliency map consists of the weight information about the image gray value [30]. In this way, the higher the gray value, the greater its saliency, and the larger weight value is allocated during image fusion. In view of the above, more and more scholars use saliency detection on the image fusion field. For example, the latent low rank representation (LatLRR) is commonly used in extracting salient features from the original image data [22]. The LatLRR problem can be solved by minimizing the following optimization function: where X denotes the data matrix of original image, R is a low-rank component matrix, L denotes the projection matrix of saliency component, and E represents a sparse matrix with noisy [5]; λ > 0 is a regularization term, which is usually chosen by cross-validation; * is the nuclear norm and 1 denotes 1 -norm. The projection matrix L can be obtained by LatLRR method, and the saliency parts of the original image can be calculated by the projection matrix. The result of LatLRR method is shown as the Figure 3. Firstly, select a n × n window. Then, in the horizontal and vertical direction, move the window by the S pixels at each stride. Moreover, via sliding the window, the original image can be partitioned into a lot of image patches. Finally, a new matrix can be acquired, in which all the image patches are reshuffled and every column corresponds with an image patch [5]. In Equation (6), the saliency part (I d ) is solved via the projection matrix L, the preprocessing matrix P(·) and the source data I.
scene. The saliency map consists of the weight information about the image gray value [30]. In this way, the higher the gray value, the greater its saliency, and the larger weight value is allocated during image fusion. In view of the above, more and more scholars use saliency detection on the image fusion field. For example, the latent low rank representation (LatLRR) is commonly used in extracting salient features from the original image data [22]. The LatLRR problem can be solved by minimizing the following optimization function: where X denotes the data matrix of original image, R is a low-rank component matrix, L denotes the projection matrix of saliency component, and E represents a sparse matrix with noisy [5]; 0 λ > is a regularization term, which is usually chosen by crossvalidation; * is the nuclear norm and 1 denotes ℓ1-norm.
The projection matrix L can be obtained by LatLRR method, and the saliency parts of the original image can be calculated by the projection matrix. The result of LatLRR method is shown as the Figure 3. Firstly, select a n n × window. Then, in the horizontal and vertical direction, move the window by the S pixels at each stride. Moreover, via sliding the window, the original image can be partitioned into a lot of image patches. Finally, a new matrix can be acquired, in which all the image patches are reshuffled and every column corresponds with an image patch [5]. In equation (6), the saliency part ( d I ) is solved via the projection matrix L , the preprocessing matrix ( ) P ⋅ and the source data I . ( ) R ⋅ denotes the reconstruction of the saliency image from the detail part. As recovering d V , the overlapped pixels are processed by the averaging strategy, in which the pixel average value is calculated via counting the number of the overlapped pixels in the recovered image.

Counting the Zero Crossings in Difference
The number of zero crossings, which is a gradient-aware statistic method, can ensure that the similarity between the processed signal and the original signal is higher. In addition, there are fewer intervals at each of which it is monotonous, either increased or decreased, convex or concave in the processed signal. In this paper, the optimization The result of decomposition is defined as V d , and the function of P(·) contains the window sliding, image partition and reshuffling technique. R(·) denotes the reconstruction of the saliency image from the detail part. As recovering V d , the overlapped pixels are processed by the averaging strategy, in which the pixel average value is calculated via counting the number of the overlapped pixels in the recovered image.

Counting the Zero Crossings in Difference
The number of zero crossings, which is a gradient-aware statistic method, can ensure that the similarity between the processed signal and the original signal is higher. In addition, there are fewer intervals at each of which it is monotonous, either increased or decreased, convex or concave in the processed signal. In this paper, the optimization problem about the number of zero crossings can be solved by evaluating the proximity operator [25].

Proximity Operator of the Number of Zero Crossings
Zero crossing means that the signal passes through the zero point, and the frequency of the signal fluctuations can be measured by this method. The number of zero crossings z(·) is defined as follows: Suppose g = g 1 , g 2 , . . . , g N is a partition of the vector g (g k ∈ R N ) into sub-vectors according to the following conditions: (1) The number of non-zero elements in each segment is one at least.
(2) None-zero elements' signs in the same segment are the same.
(3) None-zero elements' signs in adjacent segments are opposite.
Provided the k-th segment can be denoted as g k , the partition (g 1 , g 2 , . . . , g M ) is called the Minimum Same Sign Partition [31] (MSSP for abbreviation). Therefore, the number of zero crossings z(g) of the vector g can be defined as: Next, we need to consider the problem of minimizing the proximity operator of z(g), and introduce auxiliary variable u: Denote the loss function that needs to be minimized as l(u), namely: In order to minimize the value of l(u), there are only two possibilities for the value of every element in the optimal solution, that isû i = 0 orû i = g i . It should be pointed out that the symbol 0 here represents a zero vector [32].
Take l j (·) to represent the loss function of part of the data g 1:j , and ζ is an optimal solution vector of the problem (9). Rewriting, the Formula (9) becomes: Consider the constrained minimum loss function of the last segment The optimal solution vectorζ (ζ ∈ D j , D j = ζ ∈ R i j+1 −1 : ζ k = 0 or g k , k = 1, 2, . . . , j) of the Formula (10) must meet the solution vector that l j−1 (·) or l j−2 (·) can take the minimum value. In addition, it is obvious that bothζ j−1 andζ j−2 will not be 0 at the same time, so there are only two possible situations: (1)ζ j−1 = 0; (2)ζ j−1 = 0 andζ j−2 = 0. Using e j = g j 2 (1 ≤ j ≤ M), then for j ≥ 3, the problem can be written as Algorithm 1 summarizes the process of using dynamic programming to solve Equation (8), using the form of filling from the bottom up.

Algorithm 1. Evaluating the proximity operator of z(·)
Input: Vector g ∈ R N , smoothing parameter λ, weight parameter β. Output: The result u, namely is the proximity operator of z(·). 1 Find a MSSP g 1 , g 2 , . . . , g M from vector g. 2 Initialize relative parameter u ← g . 3 Calculate e, then we can get l nz solve for l nz j in Equation (11) to compute the minimum loss. 7 End for 8 Update the parameter j ← M + 1 9 While j ≥ 2 do 10 End while 13 End if

Image Smoothing with Zero-Crossing Count Regularization
In order to smooth the image of the detail layer, the horizontal and vertical differences of the image S are processed with regularization: where · 2 denotes the 2 -norm; Y is the input image, n 1 and n 2 is the number of rows and columns of the image Y; ∂ = ∆ or ∆ 2 (first-order difference or second-order difference) [33,34], ∂ x and ∂ y represent the horizontal and vertical difference operators, respectively; λ (λ > 0) is a relatively important value that can balance the data items and the regularization items. Next, the problem (13) can be solved via using the alternating direction method of multipliers [35] (ADMM) algorithm. First, the auxiliary variables V and H are introduced instead of ∂ x and ∂ y , respectively, and the Formula (13) becomes Then its Lagrangian augmented matrix is: where V and H are the iterative variables of V and H, and their values they take can be updated by Equation (19). The ADMM framework consists of the following iterative formulas: In the Equation (16), in order to calculate separately each column of the vector V, each column of the vector V should be decoupled with the others, which allow us to solve each column separately. The Equation (16) can be rewritten as where β is the weight parameter, which will gradually increase after each iteration. Equations (20) and (21) can be solved efficiently by dynamic programming [36][37][38].
Algorithm 2 summarizes the zero-crossing smoothing algorithm, in which the update of auxiliary variables V and H needs to rely on Algorithm 1: Our algorithm is sketched in Algorithm 2.

The Proposed Method
The overall fusion framework of the proposed algorithm is shown in Figure 4, mainly including three parts, which are image decomposition, sub-images fusion, and the final image reconstruction.

Image Decomposition by CAST
Co-occurrence analysis shearlet transform (CAST), a novel hybrid multi-scale transformation tool, combines the advantages of co-occurrence filter (COF) and shearlet transform. In addition, the filter process is as following.
The detail layer images are obtained by calculating the difference between the original image and the base image, and described as VI D and IR D , respectively.

= -
Then the K-scale decomposition is in the form,

Image Decomposition by CAST
Co-occurrence analysis shearlet transform (CAST), a novel hybrid multi-scale transformation tool, combines the advantages of co-occurrence filter (COF) and shearlet transform. In addition, the filter process is as following.

The Multi-Scale Decomposition Steps of COF
I V I and I IR denote the original visual image and infrared image, B V I and B IR are the corresponding base layer images, and COF is applied to scale the source images: The detail layer images are obtained by calculating the difference between the original image and the base image, and described as D V I and D IR , respectively.
Then the K-scale decomposition is in the form,

Multi-Directional Decomposition by Using Discrete Tight Support Shearlet Transform
In fact, the traditional shearlet transform is qualified enough for this step. In addition, on this basis, parabolic scaling function is adopted to control the size of the shear filter within a reasonable range. In addition, according to the support range of the shearlet function, the relationship between L and l can be given qualitatively. Thus, an adaptive multi-directional shearlet filter is constructed as following: where the parameters M and N denote the sizes of the input image, l represents the multi-directional decomposition scale parameter, and L is the size of shear filter. The adaptive shear filter used in this paper performs multi-directional shearlet transformation on each scale detail layer, so as to effectively obtain the optimal multi-directional detail components. The decomposition details are shown in Figure 5.
where the parameters M and N denote the sizes of the input image, l represents the multi-directional decomposition scale parameter, and L is the size of shear filter.
The adaptive shear filter used in this paper performs multi-directional shearlet transformation on each scale detail layer, so as to effectively obtain the optimal multidirectional detail components. The decomposition details are shown in Figure 5.

The Brightness Correction of Based-Layer Image
The grey-scale image is more in line with human visual requirements, if its gray value ranges between 0 and 1, and the average value of the image is 0.5. In fact, it is not possible that every image is optimal. As a result, the omega correction can be introduced to revise the brightness of the base layer [39].
The definition of the omega correction is as follows:

The Brightness Correction of Based-Layer Image
The grey-scale image is more in line with human visual requirements, if its gray value ranges between 0 and 1, and the average value of the image is 0.5. In fact, it is not possible that every image is optimal. As a result, the omega correction can be introduced to revise the brightness of the base layer [39].
The definition of the omega correction is as follows: where I B presents the base component of the input image; I BE denotes the base component corrected by the correction parameter ω, and the extension degree of the image can be controlled via the parameter ω. Obviously, when ω = 1, the corrected image is the same as the input image absolutely; when ω < 1, the corrected image I BE is a bit brighter, and the overall brightness of I BE will increase with the decrease of the parameter omega; on the contrary, when ω > 1, the corrected image becomes darker than the input image.
In addition, the parameter omega can be derived from the following formula: where α is the rate of correction, and µ(x, y) denotes the average gray value of the base layer image in the window. Figure 5 indicates that the lower the value of µ is, the higher the pixel enhancement level is. If µ(x, y) is less than 0.5, the image of the window seems dark, and the value of the image will be corrected via the brightness correction function and vice versa.

Fusion Rule of Base-Layer Image
The base components of the original images consist of the fundamental structures, the redundant information and light intensity, which are the approximate parts and the main energy parts. In fact, the effect of the final fused image is dependent on the fusion rule of the base layer. In order to improve the brightness of the visual image, the omega correction function and the saliency information weighted map are utilized to fuse the base layer components.
For the sake of preventing incompatible spectral characteristics from heterologous images, LatLRR, is utilized to generate the weighted map to guide the fusion of the baselayer images adaptively. The specific fusion strategy of the base components is provided below: Step 1: The saliency features of visual and infrared images can be calculated by the LatLRR algorithm. The corresponding weighted maps S IR b and S V I b are constructed. In addition, the normalized weighting coefficient matrices of µ IR b and µ V I b can be obtained by the salient maps' values.
Step 2: The parameters of µ IR b and µ V I b are applied to implement weighted fusion of the base-layer images adaptively. The specific formulas are as follows: where µ V I b and µ IR b are defined as the weights of the base layers, I V I b and I IR b are the base layer images of visual and infrared images, respectively, and I b f (x, y) denotes the final base-layer fused coefficients.
Considering the spectral differences between the two original images, it can be compensated by the weighted map. At the same time, the contrast of the visual images can be improved. In addition, the weighted map mainly is the weighting distribution of the grayscale value in space, and this fusion strategy can adaptively transfer the saliency com-ponents of the infrared images to the visual images with textural components as many as possible. Finally, the fusion effect of the base-layer images can be greatly improved by the appropriate combination of the salient components between the two original images [24].

Fusion Rule of Detail-Layer Image
In contrast with the base-layer images, the detail layers of the images preserve more structural information, such as some edge and texture components. So the fusion strategy of the detail-layer components can also affect the final visual effect of the fused image. In order to avoid excessive punishment for adjacent pixels with large intensity differences, the number of zero crossings in differences is selected as the regularization term to penalize the number of convex or concave segments of the sub-images.
The fusion strategy based on mixed zero-crossing regularization is put forward, the expression of which can be described as follows: where D VL σ,τ , D IR σ,τ and D σ,τ denote the detail-layer sub-coefficients of visual image, infrared image, and fused image, respectively; σ is the number of the decomposition, τ represents the decomposition direction in each layer, ϕ and η are the regularization coefficients, 2 represents the 2 -norm, z(·) indicates the counting zero crossings and ∂ = ∆ 2 is the secondorder difference operator. Furthermore, the gradient parameter λ can be used to weigh the importance of two types of detail-layer components in spatial distribution. The expression of λ is shown as follows: When D VL σ,τ (x, y) ≥ D IR σ,τ (x, y) , namely λ = 1, the detail-layer coefficients of the visual image domain the main features of the fusion image, and the zero-crossing number of the detail-layer coefficients in infrared image is selected as the regular term for supplement. On the contrary, λ = 0 indicates that the detail layer sub-coefficients of the infrared image include more information.
Next, using ADMM algorithm to solve the problem (36), the process is as follows: Step 1: Let the parameters H and V replace ∂ x and ∂ y operators, respectively, so the Equation (36) becomes: Step 2: When λ = 1, the Lagrangian augmented function is:  Step 3: The ADMM framework consists of the following iterative formulas: where ω is the weight parameter, which will gradually increase after each iteration until ω satisfies the termination criterion of the iteration. In addition, Equations (40) to (43) can be solved efficiently by dynamic programming. When λ = 0, repeat the step 2 and step 3. By repeating the above steps, the fused images of the detail layers can be obtained. This final fusion effect will be tested in the experiments of the Section 4.

Experimental Results and Analysis
So as to affirm the fusion effect of the proposed method in this paper, the common infrared and visual image fusion sets are used as experimental data. Moreover, some classic and state-of-the-art algorithms are tested to compare with our algorithm from qualitative and quantitative aspects, respectively.

Experimental Settings
Seven pairs of infrared and visible images (i.e., namely, Road, Camp, Car, Marne, Umbrella, Kaptein and Octec) from TNO [40] are selected for testing, which are exhibited in Figure 6.
Step 3: The ADMM framework consists of the following iterative formulas: ( ) ( ) arg min where ω is the weight parameter, which will gradually increase after each iteration until ω satisfies the termination criterion of the iteration. In addition, equations (40) to (43) can be solved efficiently by dynamic programming. When 0 λ = , repeat the step 2 and step 3. By repeating the above steps, the fused images of the detail layers can be obtained. This final fusion effect will be tested in the experiments of the chapter 4.

Experimental Results and Analysis
So as to affirm the fusion effect of the proposed method in this paper, the common infrared and visual image fusion sets are used as experimental data. Moreover, some classic and state-of-the-art algorithms are tested to compare with our algorithm from qualitative and quantitative aspects, respectively.

Experimental Settings
Seven pairs of infrared and visible images (i.e., namely, Road, Camp, Car, Marne, Umbrella, Kaptein and Octec) from TNO [40] are selected for testing, which are exhibited in Figure 6. Among them, "Road" is a set of images taken under the low illumination condition. The image pair of "Camp" contains rich background information in the visible image and has the clearer hot targets in another image. Both visible and infrared images in "Car", "Marne", "Umbrella" and "Kaptein" contain significant and abundant information. In Among them, "Road" is a set of images taken under the low illumination condition. The image pair of "Camp" contains rich background information in the visible image and has the clearer hot targets in another image. Both visible and infrared images in "Car", "Marne", "Umbrella" and "Kaptein" contain significant and abundant information. In "Octec", the infrared image has the interesting region, but the visible image is blocked by smoke. The size of images is 256 × 256, 270 × 360, 490 × 656, 450 × 620, 450 × 620, 450 × 620, 640 × 480, respectively. The various samples can fully confirm the effect of the novel algorithm.
In this paper, the compared algorithms are tested based on the public Matlab codes and the parameters in the code take the default value. In addition, all experiments are run on the computer with 3.6 GHz Intel Core CPU and 32GB memory.

Subjective Evaluation
The subjective evaluation for the fusion of the infrared and visible images depends on the visual effect of fused images. As shown in Figures 7a-h-13a-h, each method has its advantage in preserving detail components, but our method can balance the relationship between retaining significant details and maintaining the overall intensity distribution as much as possible. Through experiments and analysis, the novel algorithm can enhance the contrast ratio of the visible image, and enrich the image detail information; moreover, the noise can also be well limited.
In this paper, the compared algorithms are tested based on the public Matlab codes and the parameters in the code take the default value. In addition, all experiments are run on the computer with 3.6 GHz Intel Core CPU and 32GB memory.

Subjective Evaluation
The subjective evaluation for the fusion of the infrared and visible images depends on the visual effect of fused images. As shown in Figures 7a-h-13a-h, each method has its advantage in preserving detail components, but our method can balance the relationship between retaining significant details and maintaining the overall intensity distribution as much as possible. Through experiments and analysis, the novel algorithm can enhance the contrast ratio of the visible image, and enrich the image detail information; moreover, the noise can also be well limited. The fusion results of the first group on the "Marne" image pairs are shown in Figure  7. It is obvious that the most of fused images can contain rich textures of original visual image and the hot targets of infrared image. Furthermore, the cloud in the sky should be retained clearly in the final fused image as much as possible so that the fused image with higher contrast looks more natural. The results of the CVT, CWT and ADF algorithms own the lower contrast, so the visual quality of the fused image is a little bad. The algorithm of WLS can also improve the brightness distribution of the visible image, but the outline of the cloud in the red box does not look natural enough. The MSVD algorithm cannot fuse more edge and detail information into each layer of the image. In addition, The fusion results of the first group on the "Marne" image pairs are shown in Figure 7. It is obvious that the most of fused images can contain rich textures of original visual image and the hot targets of infrared image. Furthermore, the cloud in the sky should be retained clearly in the final fused image as much as possible so that the fused image with higher contrast looks more natural. The results of the CVT, CWT and ADF algorithms own the lower contrast, so the visual quality of the fused image is a little bad. The algorithm of WLS can also improve the brightness distribution of the visible image, but the outline of the cloud in the red box does not look natural enough. The MSVD algorithm cannot fuse more edge and detail information into each layer of the image. In addition, the result of GTF algorithm is fused by more information from the infrared image, and the pattern on the car is the clearest. Although the CVT, WLS, LP, GTF and the proposed algorithm can preserve the target light regions, the proposed algorithm can transfer more textures of the cloud and the edges of the tree in visible images into the fused image. Figure 8 exhibits the fusion results of the "Umbrella" image pair based on various fusion algorithms. Although all of these fusion methods in this paper can realize the aim of image fusion, the fusion effect of different fusion methods is still very different. The results of the CVT and CWT methods pay more attention to preserving the infrared areas of interest, but some components in the visible image are missing. The background of the WLS and GTF methods are over bright, which leads to the poor visual effect. The result based on the LP method owns better visual effect than Figure 8a, and the contrast of the background is not too bad, but it still needs to be improved. The diverse feature information cannot be extracted absolutely from the input images by the ADF and MSVD methods, so the results lose the significant information and tiny details, and the contrast is also low. Moreover, the result based on the method of this paper is suitable for the human visual perception system via enhancing the contrast of the interesting regions of the input images. In conclusion, the fused image of the "Umbrella" based on the proposed method contains more complementary information from the input images.
pattern on the car is the clearest. Although the CVT, WLS, LP, GTF and the proposed algorithm can preserve the target light regions, the proposed algorithm can transfer more textures of the cloud and the edges of the tree in visible images into the fused image. Figure 8 exhibits the fusion results of the "Umbrella" image pair based on various fusion algorithms. Although all of these fusion methods in this paper can realize the aim of image fusion, the fusion effect of different fusion methods is still very different. The results of the CVT and CWT methods pay more attention to preserving the infrared areas of interest, but some components in the visible image are missing. The background of the WLS and GTF methods are over bright, which leads to the poor visual effect. The result based on the LP method owns better visual effect than Figure 8a, and the contrast of the background is not too bad, but it still needs to be improved. The diverse feature information cannot be extracted absolutely from the input images by the ADF and MSVD methods, so the results lose the significant information and tiny details, and the contrast is also low. Moreover, the result based on the method of this paper is suitable for the human visual perception system via enhancing the contrast of the interesting regions of the input images. In conclusion, the fused image of the "Umbrella" based on the proposed method contains more complementary information from the input images. Next, the image pairs of "Kaptein" and "Car" are chosen as the test sets in order to affirm the effect of the proposed method further. The fusion results of the existing algorithms and ours are displayed in Figure 9 and Figure 10, respectively. It is very obvious that there are more noise and image artifacts in the fused images acquired by the methods of WLS and MSVD. In addition, the methods of the ADF and GTF cannot preserve more detail components of the visible image. Moreover, the background information is receiving more and more attention, and LP, CVT and CWT are proposed to keep more visual information. The final images fused by CVT, CWT and LP methods have more significant features and setting information. However, those images with so much complementary information look more similar to the infrared images, especially the background. Compared with the previous methods, the proposed algorithm can cut down the saliency features of the infrared image and the fused image is easier to be accepted. Furthermore, in Figure 9, the texture information in the 'bushes' (red box) and the Next, the image pairs of "Kaptein" and "Car" are chosen as the test sets in order to affirm the effect of the proposed method further. The fusion results of the existing algorithms and ours are displayed in Figures 9 and 10, respectively. It is very obvious that there are more noise and image artifacts in the fused images acquired by the methods of WLS and MSVD. In addition, the methods of the ADF and GTF cannot preserve more detail components of the visible image. Moreover, the background information is receiving more and more attention, and LP, CVT and CWT are proposed to keep more visual information. The final images fused by CVT, CWT and LP methods have more significant features and setting information. However, those images with so much complementary information look more similar to the infrared images, especially the background. Compared with the previous methods, the proposed algorithm can cut down the saliency features of the infrared image and the fused image is easier to be accepted. Furthermore, in Figure 9, the texture information in the 'bushes' (red box) and the 'ground' (green box) can be preserved as much as possible via our method, and the fused image is clearer and the detail information is more abundant. 'ground' (green box) can be preserved as much as possible via our method, and the fused image is clearer and the detail information is more abundant. In Figure 10, the fusion algorithms of CVT, CWT, LP and ours can reserve more salient features of the visible images. However, these results about the rest of the methods are fused by a lot of background information of the infrared image. By this way, it is very difficult to extract the salient components from the input image so that the background information and the context around the saliency regions are obscure and even not visible. Compared with the other fusion algorithms in this paper, our fusion method can fuse more detail components and keep the higher contrast of infrared components. The 'tree' in the red box of the fused image based on our method is clearer, and the detail 'ground' (green box) can be preserved as much as possible via our method, and the fused image is clearer and the detail information is more abundant. In Figure 10, the fusion algorithms of CVT, CWT, LP and ours can reserve more salient features of the visible images. However, these results about the rest of the methods are fused by a lot of background information of the infrared image. By this way, it is very difficult to extract the salient components from the input image so that the background information and the context around the saliency regions are obscure and even not visible. Compared with the other fusion algorithms in this paper, our fusion method can fuse more detail components and keep the higher contrast of infrared components. The 'tree' in the red box of the fused image based on our method is clearer, and the detail In Figure 10, the fusion algorithms of CVT, CWT, LP and ours can reserve more salient features of the visible images. However, these results about the rest of the methods are fused by a lot of background information of the infrared image. By this way, it is very difficult to extract the salient components from the input image so that the background information and the context around the saliency regions are obscure and even not visible. Compared with the other fusion algorithms in this paper, our fusion method can fuse more detail components and keep the higher contrast of infrared components. The 'tree' in the red box of the fused image based on our method is clearer, and the detail information is richer than the others. Therefore, our method can maintain an optimal balance between the visual context information and the infrared saliency features.
In fact, our proposed method focuses on maintaining as much visible texture details as possible and highlighting the thermal target. Therefore, we hope that the experimental re-sults based on our method can be more in line with human vision, and the large infrared targets will not be lost, such as the umbrella in the Figure 8, the human in the Figures 9 and 10. However, the texture of the forest in Figure 8, the floor in Figure 9 and the trees in Figure 10 is clearer than other methods due to the omega correction in the Section 3.2, which can be used in greyscale images for changing their dynamic range and the local contrast. Of course, it is possible that these sets of experimental parameters are too biased towards the enhanced contrast of visible information, so they are a little darker. In addition, the adaptive adjustment of these parameters will also be our next task.
Moreover, it is necessary to prove the effect of retaining complementary information in this method. The fusion results based on different algorithms on the image "Camp" are displayed in Figure 11. The hot targets in the green box are clear enough on different methods. However, the contour of the figure in green box obtained by the methods of WLS and ours is more distinct and its brightness is improved greatly. In addition, the image details of the bushes in the red boxes reveal that the proposed algorithm possesses the following advantages. Firstly, the image contrast can be improved by the proposed method via enhancing the brightness of the bushes. Secondly, our algorithm can transmit more texture information of the bushes to the fused result so that the fused image looks more similar to the visible image and reserves much more infrared information.
Remote Sens. 2022, 14, 283 17 of 22 information is richer than the others. Therefore, our method can maintain an optimal balance between the visual context information and the infrared saliency features. In fact, our proposed method focuses on maintaining as much visible texture details as possible and highlighting the thermal target. Therefore, we hope that the experimental results based on our method can be more in line with human vision, and the large infrared targets will not be lost, such as the umbrella in the Figure 8, the human in the Figure 9 and Figure 10. However, the texture of the forest in Figure 8, the floor in Figure 9 and the trees in Figure 10 is clearer than other methods due to the omega correction in the Section 3.2, which can be used in greyscale images for changing their dynamic range and the local contrast. Of course, it is possible that these sets of experimental parameters are too biased towards the enhanced contrast of visible information, so they are a little darker. In addition, the adaptive adjustment of these parameters will also be our next task.
Moreover, it is necessary to prove the effect of retaining complementary information in this method. The fusion results based on different algorithms on the image "Camp" are displayed in Figure 11. The hot targets in the green box are clear enough on different methods. However, the contour of the figure in green box obtained by the methods of WLS and ours is more distinct and its brightness is improved greatly. In addition, the image details of the bushes in the red boxes reveal that the proposed algorithm possesses the following advantages. Firstly, the image contrast can be improved by the proposed method via enhancing the brightness of the bushes. Secondly, our algorithm can transmit more texture information of the bushes to the fused result so that the fused image looks more similar to the visible image and reserves much more infrared information. In addition, the low contrast image pair "Octec" is selected to verify the fusion effect of our method. In Figure 12, there is some cloud of smoke in the center of the visual image, behind which the interesting target in the infrared image conceals. In addition, the fusion methods are ought to merge the hot target and the houses sheltered from the smoke into the result. CWT, WLS, LP and our algorithm all meet the above performance requirements. Moreover, the methods of the ADF and GTF can enhance the brightness of the fusion images, but the hot target in the green box is lost. Although the CWT algorithm improves the visual brightness of the fused result, the contrast of the fused image is very poor. In addition, the result of MSVD algorithm is without enough details of the trees and In addition, the low contrast image pair "Octec" is selected to verify the fusion effect of our method. In Figure 12, there is some cloud of smoke in the center of the visual image, behind which the interesting target in the infrared image conceals. In addition, the fusion methods are ought to merge the hot target and the houses sheltered from the smoke into the result. CWT, WLS, LP and our algorithm all meet the above performance requirements. Moreover, the methods of the ADF and GTF can enhance the brightness of the fusion images, but the hot target in the green box is lost. Although the CWT algorithm improves the visual brightness of the fused result, the contrast of the fused image is very poor. In addition, the result of MSVD algorithm is without enough details of the trees and houses. As for CWT and GTF algorithms, the detail information of visible image cannot be reserved into the fused image so that the image looks a little blurry. Finally, the result obtained by the proposed method owns clearer detail information of the roof in the red box, and its contrast is more suitable for human vision.
houses. As for CWT and GTF algorithms, the detail information of visible image cannot be reserved into the fused image so that the image looks a little blurry. Finally, the result obtained by the proposed method owns clearer detail information of the roof in the red box, and its contrast is more suitable for human vision. Finally, the image pair "Road" taken at low illumination is chosen to experiment. In addition, the fusion results based on different algorithms are shown in the Figure 13.  Finally, the image pair "Road" taken at low illumination is chosen to experiment. In addition, the fusion results based on different algorithms are shown in the Figure 13. houses. As for CWT and GTF algorithms, the detail information of visible image cannot be reserved into the fused image so that the image looks a little blurry. Finally, the result obtained by the proposed method owns clearer detail information of the roof in the red box, and its contrast is more suitable for human vision. Finally, the image pair "Road" taken at low illumination is chosen to experiment. In addition, the fusion results based on different algorithms are shown in the Figure 13.  As for the algorithms of CWT, WLS and ADF, the interesting parts such as the vehicles, the person and the lights are not able to be well highlighted. At the same time, the results of GTF and MSVD fusion algorithms do not conform to human eye vision observation due to its blurred details. Furthermore, the image contrast of the CVT and LP methods is better than the results obtained by the rest of the compared fusion algorithms. However, the proposed method can enhance the tiny features properly, for example, the person in the green box is clearer than the others, especially the outline between two legs; and the details of the car in the red box can also be preserved well. Therefore, the method proposed in this paper can meet the needs of night the observation.
Although the WLS and GTF can preserve more thermal characteristics than ours in Figures 8-10, the visual detail components are not so much as ours. For example, the texture features of the bush in Figures 8 and 9 are richer. Of course, this also shows that our algorithm can integrate more visual significant features of the visible image into the fused image. Moreover, there are some small infrared detail information lost in ours, such as the brightness information on the human in Figure 10. However, this will not affect our recognition of infrared targets. Of course, our next focus will also be to find a better way to balance while retaining more infrared and visible information.
In conclusion, obviously, our method can not only improve the contrast of the fusion images, but also fuse more infrared brightness features in each experiment. Although the compared methods can also highlight the interesting parts of input images, they cannot fuse more details of the input images such as this method. In a word, the proposed algorithm in this paper can achieve a better balance between highlighting infrared targets and reserving detail information, and the results are easier to be accepted by human eyes.

Objective Evaluation
Five fusion quality evaluation metrics are selected to evaluate our fusion algorithm objectively, such as average gradient (AVG) [46], mutual information (MI) [47], edge strength (Q ab/f ) [30], spatial frequency (SF) [48] and standard deviation (SD) [49]. The fusion performance improves with the eight methods of all these five metrics, and the results are shown in Table 1. The evaluation results are shown in Table 1, in which the values marked in bold are the best of all. From the 7 experiments, we can see that the AVG value of the proposed algorithm is higher than the other algorithms besides the fourth experiment, which indicates that our algorithm can keep the gradient texture information contained in the original images into the fusion image, and our fused images have the most abundant detail information. The amount of the edge information can be counted by the metric of Q ab/f , which represents the ability that the edges are shifted to the fused images from the input images. We can clearly see that in addition to the second to forth group of experiments, the Q ab/f values of our algorithm keeps leading. This shows again that our method owns the prominent effect both on the reconstruction and restoration of the gradient information. As for the evaluation of SF and SD values, the SF value indicates the global activities of the image in the domain of the space, and the value of SD is used to reflect the gray-value distribution of each pixel, which is another manifestation of sharpness and is computed indirectly by the average values of the image. Both SF and SD values of other methods are lower than the proposed algorithm, so the result of our proposed method has the higher overall activity in the spatial domain and the contrast of the image is promoted well. MI characterizes the degree of correlation between the fused image and the original image. Except the fourth group of experiment, the values of other methods perform better than ours, which indicates that the proposed algorithm in this aspect still needs to be improved.

Conclusions
This paper proposed a novel fusion model for infrared and visual images based on co-occurrence analysis shearlet transform. Firstly, the CAST is used as the multi-scale transform tool to decompose the source images. Next, for the base-layer images that represent the energy distribution, we adopt the LatLRR to generate the saliency maps, and a weighted model guided by the salient map is put forward as the fusion rule. For the detail-layer images that reflect the texture detail information, an optimization model based on zero-crossing counting regularization is adopted as the fusion rule. In order to confirm the performance of our method, relevant experiments are implemented in this paper. The results show that the fused images obtained by ours with a higher contrast and rich texture detail information outperform the others in terms of visual evaluation.