Infrared and Visible Image Fusion Using Truncated Huber Penalty Function Smoothing and Visual Saliency Based Threshold Optimization

: An efﬁcient method for the infrared and visible image fusion is presented using truncated Huber penalty function smoothing and visual saliency based threshold optimization. The method merges complementary information from multimodality source images into a more informative composite image in two-scale domain, in which the signiﬁcant objects/regions are highlighted and rich feature information is preserved. Firstly, source images are decomposed into two-scale image representations, namely, the approximate and residual layers, using truncated Huber penalty function smoothing. Beneﬁting from the edge- and structure-preserving characteristics, the signiﬁcant objects and regions in the source images are effectively extracted without halo artifacts around the edges. Secondly, a visual saliency based threshold optimization fusion rule is designed to fuse the approximate layers aiming to highlight the salient targets in infrared images and remain the high-intensity regions in visible images. The sparse representation based fusion rule is adopted to fuse the residual layers with the goal of acquiring rich detail texture information. Finally, combining the fused approximate and residual layers reconstructs the fused image with more natural visual effects. Sufﬁcient experimental results demonstrate that the proposed method can achieve comparable or superior performances compared with several state-of-the-art fusion methods in visual results and objective assessments.


Introduction
Infrared (IR) and visible image fusion is a current research hot spot in image processing because of its numerous applications in computer vision tasks [1], such as military reconnaissance, biological recognition, target detecting, and tracking, etc. Infrared images can expose the thermal radiation difference of different objects, which identify the targets from the poor lighting backgrounds. IR imaging can refrain from the disturbances of such bad scenarios as low lighting, snow, rain, and fog, etc., therefore working well in the conditions of all-day/night and all-weather. However, IR images usually have low definition backgrounds and poor texture details. In contrast, visible imaging is able to capture the reflected light of an object and provide considerably high resolution and texture details; however, it is often affected by bad weather [2,3]. To obtain the plenty information for the exact understanding of a scene, most users often have to analyze the multimodality images of a scene one-by-one. However, analyzing each individual member of the multimodality images of a scene usually requires numerous resources such as more people, more time, and more money. Therefore, it is desirable to integrate multimodality images into a single image with the goal of obtaining a complementary and informative image [4,5]. The IR

•
An effective two-scale decomposition algorithm using truncated Huber penalty function (THPF) smoothing is proposed to decompose the source images into the approximate and residual layers. The proposed THPF-based decomposition algorithm efficiently extracts the feature information (e.g., the edges, contours, etc.) and well restrains the edges and structures of the fusion results from the halo artifacts. • A visual saliency based threshold optimization (VSTO) fusion rule is proposed to merge the approximate layers. The VSTO fusion rule can suppress contrast loss and highlight the significant targets in the IR images and the high-intensity regions in the visible images. The fused images are more natural and consistent with human visual perception, which facilitate both the scene understanding of humans and the post processing of computers. • Unlike most fusion methods using sparse representation (SR) to decompose an image or merge the low frequency sub-band images, we utilize the SR based fusion rule to merge the residual layers for obtaining rich feature information (e.g., the detail, edge, and contrast). The subjective and objective evaluations demonstrate that the considerable feature information is integrated from the IR and visible images into the fused image.
The rest parts of this paper are arranged as follows. Section 2 provides the related works in the IR and visible fusion field. Section 3 details the proposed method for the IR and visible image fusion. Section 4 provides experimental setting, including the image dataset, other representative fusion methods for comparison, objective assessment metrics, and the parameter selection of the proposed THPF-VSTO-SR fusion algorithm. In Section 5, experimental results and the corresponding discussion are comprehensively presented. Finally, Section 6 gives the conclusion and discussion on future work.
MSD-based methods further comprise two types: multiscale transform (MST)-based methods and edge-preserving filter (EPF)-based methods. In the 1980s, Burt et al. [16] proposed the pyramid transform (PT)-based MST fusion approach for the first time. Since then, wavelet transform (WT) [17,18], curvelet transform (CVT) [19], contourlet transform (CT) [20], nonsubsampled contourlet transform (NSCT) [21], combination of wavelet and NSCT (CWTNSCT) [22], and nonsubsampled shearlet transform (NSST) [23] were widely used in the image fusion. The MST-based methods have achieved good performances in many applications. However, the visual artifacts (e.g., halos, pseudo Gibbs phenomenon) are commonly adverse defects for MST-based approaches [4,6]. In addition, a simple 'average' fusion rule applied in the low frequency sub-band images often causes contrast distortions [8]. Recently, the edge-preserving filter (EPF) is introduced into the image fusion fields [2,24]. Li et al. [24] used the guided filter for image fusion aiming to get the consistency with human visual perception. Kumar [25] employed the cross bilateral filter to merge the multimodality images. Zhou et al. [26] combined the bilateral and Gaussian filters together for merging IR and visible images well. More advanced, Ma et al. [27] utilized the rolling guidance filter to decompose the source images into the base and detail layers. These representative EPF-based approaches were validated with good fusion performances due to the edge-preserving specialty.
In recent years, due to the excellent signal presentation ability of sparse representation (SR) [10], Liu et al. [28,29] adopted the sparse representation and the convolutional sparse representation (CSR) to capture the intrinsic characteristic information of source images, respectively. Meanwhile, in [30], the joint sparse representation and saliency detection (JSRSD) were used to highlight the significant objects in IR images and regions in visible images. Generally, the SR-based image fusion method consists of following steps [2,10]. (i) Input images are decomposed to overlapping patches. (ii) Each patch is vectorized as a pixel vector. (iii) Vectors are encoding as sparse representation coefficients using the trained over-complete dictionary. (iv) Sparse representation coefficients are combined via the given fusion rules. (v) The fusion result is reconstructed by the trained over-complete dictionary again. Despite SR-based methods have achieved good results in many cases, many existing methods suffer from several open-ended challenges. In [28], SR in the MST-SR based fusion method is used to merge the low-pass MST bands. However, the MST-SR-based method suffers from he information loss and block artifacts [27]. There's also brightness loss in the CSR-based method [29]. The fusion results of JSRSD have block artifacts [30]. The above-mentioned issues cause the unnatural fusion images.
The idea of the visual saliency (VS) based method is that the important objects/regions (e.g., the outline, edge, brightness) often more easily capture people's attentions. Bavirisetti [8] has used the mean and median filters to implement the visual saliency detection. Liu et al. [30] and Ma et al. [31] have achieved the good fusion performances with saliency analysis, respectively. The visual saliency detection is usually used to design the fusion rules.
In the last three years, the state-of-the-art deep learning (DL) was introduced into the image fusion area due to its powerful ability to process the images [9,11]. Generally, the DL-based methods can work well, because of the excellent ability of extracting the feature information from source images. Therefore, many neural network models are widely studied in image fusion fields. In [32], Li et al. have used ResNet and zero-phase component analysis, which achieve the good fusion performances. More generally, many convolutional neural networks (CNN) based methods were proposed for the IR and visible image fusion [33][34][35][36]. Unlike CNN and ResNet models, Ma et al. [37] have adopted DDcGAN (Dual-discriminator Conditional Generative Adversarial Network) to attain the fusion outputs with the enhanced targets, which facilitates the understandings of scenes for humans.

Image Decomposition
The main challenges in the computer vision tasks and graphics may be attributed to the image smoothing. But in most cases, the smoothing properties for different tasks can vary considerably. It is desirable to require the smoothing operation to smooth out small details while preserving the edges and structures. In literature [38], a generalized framework for image smoothing using the truncated Huber penalty function (THPF) is presented. The smooth features of simultaneous edge-and structure-preserving can achieve better performance than that of the previous methods in many challenging situations. In this work, truncated Huber penalty function (THPF) is introduced to build a decomposition model for the first time.

Truncated Huber Penalty Function
The truncated Huber penalty function is defined as follows.
where a and b are constants. h(·) is the Huber penalty function [39] expressed as Equation (2).
where f denotes an input image, and g is a guide image. The filtered output image u is the solution providing the minimum value of the above objective function Equation (3). h T (·) is the same as in Equation (1); N d (i) is the (2r d + 1) × (2r d + 1) square patch centered at pixel i (here, r d = 1); N s (i) is the (2r s + 1) × (2r s + 1) square patch centered at i excluding pixel i (here, r s = 1); λ is a parameter controlling the whole filtering strength.
The first term of Equation (3) denotes the data term, and the second term of Equation (3) is the smoothness term in this smoothing model. (a d , b d ) and (a s , b s ) are employed to describe parameters a and b in h T (·) in the data item and the smoothing item, respectively. Typically, a d = a s = a = 10 −3 , and b d = b s = b = 0.15, which are suggested by [38]. ω s i,j is the Gaussian spatial kernel given by: where σ = r d in the first term of Equation (3), and σ = r s in the second term of Equation (3). ω g i,j is the guidance weight defined as follows.
where the sensitivity edges in guide image g (here, g = f ) is determined by β, and δ is a small constant (avoiding 0 in the denominator) that is generally set as δ = 10 −3 . The numerical solution of the image smoothing model in Equation (3) is provided in the original paper.
THPF (truncated Huber penalty function)-based smoothing model can achieve simultaneous edge-preserving and structure-preserving smoothing that are rarely obtained by previous methods, which enables our fusion method to achieve better performance in the source image decomposition. In the following Section 3.1.3, we will present the detailed image decomposition based on the THPF smoothing.

THPF Smoothing-Based Two-Scale Image Decomposition
Due to the key edge-and structure-preserving characteristics of THPF smoothing, the filtered image is similar to the original image in terms of the overall appearance, thereby being conducive to maintaining feature information of original images in the fusion process. Assuming that I 1 (i, j) and I 2 (i, j) represent the registered infrared (IR) and visible images respectively, decomposition using THPF smoothing algorithm includes the following two steps. Mathematically, the procedure is depicted as follows.
Firstly, infrared image I 1 (i, j) and visible image I 2 (i, j) are filtered with THPF smoothing.
where the filtered images I 1 (i, j) and I 2 (i, j) denote the approximate layers of IR and visible images, respectively. * S THPF (·) indicates the filtering operation using THPF smoothing. Parameter λ controls the overall smoothing strength, and β determines sensitivity edges of an image. The parameter settings of λ and β will be detailed in Section 4.4. The approximate layers of IR and visible images are shown in Figure 2(a2,b2), respectively.
Secondly, the residual layer images I 1 (i, j) and I 2 (i, j) can be acquired as follows, respectively.
The residual layers of IR and visible images are shown in Figure 2(a3,b3), respectively.

Approximate Layer Fusion Using Visual Saliency-Based Threshold Optimization
The approximate layer images I 1 (i, j) and I 2 (i, j) contain low frequency energy information, which exhibits the overall appearance. The frequently-used and simple average fusion scheme hardly highlights the pivotal objects in infrared images and high-intensity regions in visible images, thereby often suffering from contrast distortions. Meanwhile, some fusion rules excessively pursuing salient objects either leads to over-enhancement of objects, or brings about blurred details. These phenomenons will be intuitively provided in qualitative analysis (Section 5.1). To address the above issues, it is desirable to design an efficient and appropriate algorithm for merging low frequency sub-band images. The composite images with good visual effects should be consistent with human visual perception and capture humans attentions. In view of this, a new fusion scheme using visual saliency based threshold optimization (VSTO) is proposed to merge the approximate layer images. The motivation is that human attentions are usually grabbed by objects/pixels that are more significant than their neighbors.
Morphological operations using the maximum filter are completed on the approximate layer images I 1 (i, j) and I 2 (i, j) for objects/regions enhancement, respectively. where * MaxFilter(·) is the maximum filtering. w r = 3 denotes the filtering window with the window size (2r + 1)(2r + 1), typically r = 1. S 1 and S 2 are filtered images serving as the visual saliency maps (VSM) of I 1 (i, j) and I 2 (i, j), respectively. Figure 3 illustrates the visual saliency maps of approximate layers for IR and visible images, as shown in Figure 3(a2,b2), respectively. Obviously, objects/regions in Figure 3(a2,b2) are easier for identification than the ones in Figure 3(a1,b1) (see comparisons in red and green boxes), respectively.  Then, decision maps of approximate layers are calculated.
where DM(i, j) is the decision map, and with the window size (2s + 1) × (2s + 1), here s = 1. A threshold Th, Th ∈ (1, ..., 9), is key to determining the criterion for merging more information from IR or visible approximate layer. Figure 3(a3,b3) present the decision maps of IR and visible approximate layer images, respectively. Then, the fused approximate layer image F A can be acquired as follows.
From Equations (12) and (13), we can find that: As Th gradually increases, the fused output will merge more salient information from the visible image. Considering an extreme case of Th = 9, the fused approximate layer image F A is the visible approximate layer image. Therefore, it is worth optimizing Th with the purpose of achieving the best fusion performance. Theoretically, Th goes from 4 to 6 (intermediate range values), thereby taking salient objects/regions in both IR and visible images into account. The fusion performance influences of Th will be given in Section 4.4 (iii). The proposed fusion scheme adopting visual saliency-based threshold optimization is referred to simply as VSTO.

Residual Layer Fusion Using Sparse Representation
Sparse representation (SR) and its variants are popular machine learning methods in recent years [10,40]. The core idea of sparse representation is that a signal can be represented as a linear combination of the over-complete dictionary and sparse matrix [41]. The benefit of SR is that it is more effective at finding the implicit structures and patterns from images. In many existing fusion methods, sparse representation is often used for representing/decomposing images or fusing the low frequency sub-band images. Unlike two above-mentioned ideas, SR is used to design the fusion rule for residual layer images (high frequency sub-band images) in this work. The motivation is that residual layers rather than low frequency sub-band images contain considerable structural and textural details. Thus, using SR to merge the residual layer images is very interesting and worthy of exploration.

Sparse Representation
For a signal y ∈ X(X ∈ R M ), it can be expressed as y ≈ D × α. D (D ∈ R M×N , and M < N) denotes the over-complete dictionary. α (α ∈ R N ) is the sparse coefficient vector. Mathematically, α can be obtained by solving the following optimization problem.
where · 0 denotes l 0 norm, and α 0 is the number of nonzero units in the sparse coefficient vector α. Then, orthogonal matching pursuit (OMP) [42], as a common greedy algorithm, is used to solve the sparse representation model with l 0 norm (Equation (14)). The K-SVD [43] algorithm is employed to train the sample signals to obtain the over-complete dictionary D. Finally, the sparse solution α can be obtained.
In what follows, merging residual layers based on SR is described in detail.

Residual Layer Fusion
In contrast to the approximate layer images, the residual layer images I 1 (i, j) and I 2 (i, j) mainly contain the detail texture information. Fusing residual layer images consists of following steps.
Step 2: For each patch, Step 3: In term of the pertrained dictionary D, the optimal sparse vector α i 1 (or α i 2 ) is solved using OMP algorithm.
Step 4: Maximum l 2 norm is selected as the evaluation criterion to obtain the fused sparse vector α i r f for the residual layer images. where · 2 denotes l 2 norm. Then, the fused column vector V i r f of residual layer images is calculated via the dictionary D. V Step 5: Reverse operation is performed on each vector V i r f to form the composite image, thereby achieving the fused residual layer image I R .
The process of residual layer fusion using sparse representation is schematically illustrated in Figure 4.

Reconstruction
Finally, the fused result I F is reconstructed through the fused approximate layer image I A (Equation (13)) and the fused residual layer image I R (Equation (19)).
The pseudo code of the proposed algorithm is given by Algorithm 1.

Algorithm 1 Pseudo code of proposed fusion method.
Input: I 1 (i, j): the infrared image; I 2 (i, j): the visible image; Output: the fused image I F 1: Apply THPF smoothing [38] on source images I 1 (i, j) and I 2 (i, j) to get the approximate layers I 1 (i, j) and I 2 (i, j) respectively 2: Compute the residual layer of the infrared image I 1 (i, j) and the residual layer of the visible image I 2 (i, j) 3: Apply the maximum filter on the approximate layers I 1 (i, j) and I 2 (i, j) to get visual saliency maps S 1 (i, j) and S 2 (i, j) respectively 4: Apply threshold Th to get the decision map DM(i, j) of the approximate layers (see Equation (13)) 5: Compute the fused approximate layer image 6: Apply the sliding window technique to divide the residual layer images I 1 (i, j) and to the corresponding column vector V i 1 (or V i 2 ) 8: Apply the per-trained dictionary D [43] and OMP algorithm [42] to solve optimal sparse vector α i 9: Apply maximum l 2 norm to get the fused sparse vector α i r f for the residual layers 10: Compute the fused column vector V i r f of the residual layers using dictionary D 11: Apply the reverse operation on V i r f to get the fused residual layer image I R 12: Reconstruct the fused result I F = I A + I R

Experimental Setting
This section presents the experimental image dataset for conducting experiments in Section 4.1, nine excellent fusion approaches for making comparisons in Section 4.2, objective assessment metrics for evaluating the fusion performance in Section 4.3, and the parameter setting of the proposed THPF-VSTO-SR fusion method in Section 4.4, respectively.

Image Data Set and Setting
Twenty preregistered IR and visible image pairs are selected to establish the experimental data set. All source image pairs are collected from the websites https:// figshare.com/articles/TNO_Image_Fusion_Dataset/1008029 (accessed on 18 January 2021), and https://sites.google.com/view/durgaprasadbavirisetti/datasets (accessed on 25 June 2019), and often adopted in fusion methods. These image pairs are marked as 'S1-S20' respectively, as shown in Figure 5. Throughout this paper, it is assumed that all source image pairs are perfectly registered in advance. All fusion results of S1-S20 will be presented one by one in Section 5.1.
Among these methods, DWT is multiscale transform based method. CSR and JSRSD are sparse domain based methods. TSVSM is the edge-preserving decomposition with visual salience detection method. VSMWLS is a hybrid fusion method based on edgepreserving decomposition, visual saliency map, and weighted least square optimization. TE employs a target-enhanced multiscale transform (MST) decomposition algorithm. ResNet, IFCNN, and DDcGAN are recent schemes using various state-of-the-art deep learning models, respectively. JSRSD, TSVSM, VSMWLS, and TE also are hybrid fusion methods combining the strengths of multiscale decomposition (or sparse representation) and special fusion rules (i.e., saliency detection, visual saliency map, weighted least square optimization, and target-enhanced). Generally, the first one approach is an often-used and representative multiscale transform method, while the latter eight schemes are excellent methods proposed in recent years.
For the sake of fairness, all the experimental results in this work are implemented on a desktop computer with 16 GB memory, 2.6 GHz Intel Xeon CPU, and NVIDIA GTX 1060 GPU. DWT, CSR, TSVSM, JSRSD, VSMWLS, ResNet, TE, and our method are conducted on the MATLAB 2018a environment. DDcGAN is implemented with Tensor-flow-CPU. IFCNN is carried out with PyTorch-GPU. The experimental parameters of DWT are set according to [28]. The experimental parameters of CSR [29], TSVSM [8], JSRSD [30], VSMWLS [27], ResNet [32], IFCNN [36], TE [44], and DDcGAN [37] are adopted as the same with the original papers, respectively. All the nine representative methods are conducted using the openly available codes. λ = 0.01, β = 0.5, and Th = 6 are set for the proposed fusion method, and the parameter setting will be explained in Section 4.4.

Objective Assessment Metrics
Six commonly-used objective evaluation metrics are adopted to assessment the performances of various fusion methods. There are SD (standard deviation) [6], MI (mutual information) [45], FMI (feature mutual information) [46], Q ab f (edge based on similarity measure) [47], NCIE (nonlinear correlation information entropy) [48,49], and NCC (nonlinear correlation coefficient) [2,50]. SD is the most commonly-used index reflecting the brightness differences and contrast of the fused image. MI quantifies the amount of information fused from source images to the fused image. FMI is based on MI and feature information (such as edges, details, and contrast), thereby reflecting the amount of feature information fused from source images to the fused image. Q ab f is a very important and frequently-used metric to measure the edge information transferred from source images to the fused image. NCIE can be used to measure the general relationship between the fused image and the source images with a number from the closed interval [0, 1]. The bigger the NCIE, the stronger relationship the fused image and the source images, which indicates a good fusion performance. NCC measures how much information is extracted from the two source images. The larger the NCC, the more similar the fused image is to the source images and the better the fusion performance.
For all above metrics, large values indicate that fusion methods achieve good performances.

Parameter Analysis and Setting
For our THPF-VSTO-SR fusion method, THPF smoothing is used as the two-scale decomposition tool to extract the approximate and residual layer images. The edge-and structure-preserving characteristics of THPF smoothing based decomposition are influenced by λ and β in Equation (6). If λ and β vary, the performances of fusion results also change. So it is necessary to analyze the performance of our method according to parameters λ and β. Besides, the threshold selection of Th in the VSTO (visual saliency-based threshold optimization) fusion rule is key to the fusion performance. These are done with help of objective evaluation metrics performing on the testing dataset consisting of 8 source image pairs ('Bench' (S1), 'Camp'(S2), 'Road' (S3), 'Tank' (S6), 'Square' (S7), 'Bunker' (S8), 'Kaptein1123' (S11), and 'Pedestrian' (S13)).
(i) Fusion performance influence of λ: λ is a parameter controlling the whole filtering strength of an image. When we test the influence of the parameter λ on the objective evaluation metrics, β and Th are set to 0.5 and 4, respectively.
As shown in Figure 6, the value selections of λ influence the average values of evaluation metrics SD, MI, FMI, Q ab f , NCIE, and NCC on the testing dataset. Meanwhile, the value of λ is in the range [0.001, 1]. The metric SD remains uptrend with the value of λ increasing. Whereas, MI, FMI, Q ab f , NCIE, and NCC generally keep downtrend with the value of λ increasing, and all achieve the peak values at λ = 0.01, respectively. As can be considered, λ = 0.01 is an eclectic but good choice.
(ii) Fusion performance influence of β: β determines the edge-preserving of the filtered image. Similarly, when we test the influence of the parameter β on the objective evaluation metrics, λ and Th are set to 0.01 and 4, respectively. Figure 7 presents the influence of the parameter β. Obviously, SD remains uptrend with the value of β increasing and achieves the highest value when β = 1. However, other metrics MI, FMI, Q ab f NCIE, and NCC keep downtrend with the value of β increasing, and achieve the best performances when β = 0.5. From Figure 7b-d, it is apparent that the performances of MI, FMI and Q ab f degrade more quickly when the value of β is more than 0.5. In the light of the comprehensive objective evaluation mentioned above, β = 0.5 is set. hard to distinguish from the backgrounds. The main causes of these phenomena lie in the fact that using AVG rule in the base layers can bring about a sharp reduction in contrast.
In addition, Part A in Table 1 provides the corresponding quantitative evaluation of various fusion rules. The highest value standing for the best performance is highlighted in bold for each metric except for Q cv . On the contrary, a low Q cv indicates that the fused image has a good performance. The performance of PGFL-MAX(or PGFL-ABSMAX) is better than the one of AVG-MAX(or AVG-ABSMAX) in SD, EN, SF, TMQI,VIF, SCD, Q abf , and Q cv . For SSIM, the values are quite close and comparable.    Through subjective and objective evaluation, PGFL has a great advantage over AVG in combining the base layers.
B) Effectiveness of VSM Outwardly, Fig.7(a5, b5, c5)(AVG-VSM) are similar to Fig.7    (iii) Fusion performance influence of Th: On the basis of λ = 0.01 and β = 0.5, Figure 8 gives the fusion performance influence of Th in the range [1,8]. From Figure 8a,b,d, three metrics SD, MI and Q ab f remain uptrend with the value of Th increasing and all achieve the highest values at Th = 8. Moreover, Th has little effect on NCIE and NCC. The values of NCIE and NCC remain the relatively stable state. Whereas FMI keeps downtrend with the value of Th increasing and obtains the highest value at Th = 5. On the one hand, Th = 6 or 7 reaches a best compromise between Th = 5 and Th = 8 for all metrics. On the other hand, Th ∈ (4, 5, 6) has the benefit of taking salient objects/regions in both IR and visible images into account. Therefore, Th = 6, as an overlap, is selected as the optimal value according to the two above-mentioned concerns. the visual aspect. Third, from the objective assessments aspect, PGFL-VSM(Ours) in part C of Table 1) has advantages over PGFL-MAX and PGFL-ABSMAX in most metrics. Last but most important, the performance using PGFL-VSM is better than the one only using PGFL or VSM.

Quality performance comparison between SGFL-VSM and PGFL-VSM
The performance of SGFL-VSM in Fig.7 is better than that of AVG-MAX and AVG-ABSMAX. However, SGFL-VSM still has its drawbacks. In Fig.7(a8), the overall background of SGFL-VSM (all enclosed with yellow box) is so dark that it is unnatural and inconsistent with human vision   Herein, λ = 0.01, β = 0.5, and Th = 6 are set throughout this paper.

Results and Analysis
In what follows, extensive experimental results and corresponding analysis are presented via subjective evaluations and objective assessments in Sections 5.1 and 5.2, respectively. Figure 9 shows the fusion results of the first three source image pairs 'Bench' (S1), 'Camp' (S2), and 'Road' (S3) in the image dataset. For 'Bench' in the first two rows (Figure 9(a1-a12)), the IR image (Figure 9(a1)) clearly gives the person information. However, it lacks texture information of the background. Contrarily, the visible image (Figure 9(a2)) can provide detailed textures of the background. The outputs of DWT, CSR, TSVSM, ResNET, and IFCNN (see the yellow boxes in Figure 9(a3-a5,a8,a9)) suffer from severe contrast distortions, thereby not easily highlighting the targets from the background. VSMWLS, TE, and DDcGAN can achieve a good visual effect of the targeted person. Whereas VSMWLS lacks the detailed texturs (see the yellow boxes in Figure 9(a7)), and there exists brightness loss of the ground in TE and DDcGAN (see the yellow rectangles in Figure 9(a10,a11)). DDcGAN can highlight the person well, but the outline shape is indistinct (see the blue box in Figure 9(a11)) due to the over-enhancement. On the contrary, JSRSD and our method can effectively merge both the targeted information from the IR image without contrast distortions and the background information from the visible image without brightness loss.
In the eyes of 'Garden' (S5, Figure 10(b1-b12)), the targeted persons in Figure 10(b3,b4,b8) are dim (see the green boxes), whereas the person and background suffer from overenhancement in Figure 10(b11). In the view of detailed textures, all fusion methods have good performances except for VSMWLS (see the leaf in the red box of Figure 10(b7)).  [17], CSR [29], TSVSM [8], JSRSD [30], VSMWLS [27], ResNet [32], IFCNN [36], TE [44], To provide the clear outputs of the test image, we display the fusion results of 'Tank' (S6) in a larger size. For 'Tank', Figure 11(a1) clearly provides the thermal radiation information of targets (e.g., the tank), but there is no sign of detail information of the trees and the grass. In contrast to the IR image, the visible image ( Figure 11(a2)) is capable of providing rich details and textures of the trees and grass, whereas has no abundant detail information of 'Tank'. For the rear of 'Tank' (see the yellow rectangle), DWT, CSR, TSVSM, VSMWLS, ResNet, and IFCNN suffer from contrast distortions, meanwhile JSRSD, TE, DDcGAN and our method are free of this defect. Furthermore, the wheels of 'Tank' (see the close-up views in red boxes) have black stains in eight other methods except DDcGAN and our method. For the grass (see the green rectangles), the fusion results of JSRSD and our method look like more clearer and natural than the ones of the rest methods. Therefore, our THPF-VSTO-SR method is effective to integrate the significant and complementary information of IR and visible images into a well-pleasing image.

Quantitative Analysis via Objective Assessments
in Q abf , SF, TMQI and SSIM, respectively. When C bf is 0.5, the ranking for each metric of Ours(0.5) is shown on the last row of the Table 2 using (*). * is a number denoting the ranking. Compared with ten other methods, Ours(0.5) has its advantages in SCD and TMQI, but other metrics are at an intermediate level. So that's why we set C bf as a flexible parameter to improve the performance of our method.  24 Figure 13. Quantitative comparisons of 4 metrics using various representative methods (DWT [17], CSR [29], TSVSM [8], JSRSD [30], VSMWLS [27], ResNet [32], IFCNN [36], TE [44], and DDcGAN For all methods, the average values of SD, MI, FMI, Q ab f , NCIE, and NCC on 20 image pairs are given in the legend of Figure 13 and specially shown in Table 1 for the sake of observation and comparison, respectively. The highest value standing for the best performance is highlighted in 'red' for each metric. Accordingly, the second best value is highlighted in 'green', and the third best value is highlighted in 'blue'.  Table 1, DDcGAN achieves the highest score but the outline distortions of targets arise out of the overenhancement in the fused images (e.g., Figure 9(a11,b11)). Objectively, the targets/regions in the fusion results of JSRSD and our method are natural and pleasing in most cases, and the values of SD between JSRSD and our method are with a very small difference. JSRSD, IFCNN, ResNet, CSR, TE and IFCNN are the runners-up on SD, MI, FMI, Q ab f , NCIE, and NCC, respectively. Besides, our method, JSRSD, DDcGAN, TE, JSRSD in a tie with IFCNN, and JSRSD win the third place on SD, MI, FMI, Q ab f , NCIE, and NCC, respectively. Except for the champion THPF-VSTO-SR, CSR, ResNet, IFCNN, TE, JSRSD, and DDcGAN also earn a fairly good competition outcome via objective assessments, respectively.
Through the qualitative analysis and the quantitative analysis, extensive experimental results suggest that our THPF-VSTO-SR scheme achieves better performances than other representative methods. Table 2 provides the average running time of various methods tested on 'Camp' (S2) with the size 270 × 360. From Table 2, the computational efficiency of our method is not as high as DWT, TSVSM, VSMWLS, ResNet, IFCNN, and TE because of consuming a large amount of time with huge-scale data processing using sparse coding. IFCNN wins the first place with the help of GPU. Whereas THPF-VSTO-SR has less running time than CSR and JSRSD. The running time of sparse representation based methods is longer than most of multiscale decomposition and the spatial transform based fusion methods. But more importantly, the presented method achieves a comparable or superior performance in comparison with several state-of-the-art methods in the fusion performance. In addition, the random access memory resources of the fusion methods respectively require 4258M, 4414M, 4275M, 4470M, 4451M, 4401M, 5164M, 4259M, 5249M, and 4175M.

Conclusions and Future Work
This paper proposes a novel image fusion method for infrared and visible images using truncated Huber penalty function (THPF) smoothing based image decomposition, visual saliency based threshold optimization (VSTO) fusion strategy, and sparse representation (SR) fusion strategy. In the presented THPF-VSTO-SR method, source images are decomposed into the approximate layer images and residual layer images. For approximate layer components, a visual saliency based threshold optimization fusion rule is proposed aiming to highlight the significant targets/regions in both infrared and visible images. For residual layer components, sparse representation based fusion scheme is implemented with the goal of capturing the intrinsic structure and texture information. Extensive experimental results demonstrate that the proposed THPF-VSTO-SR method can achieve a comparable or superior performance compared with several state-of-the-art fusion methods via subjective evaluations and objective assessments.
However, the proposed THPF-VSTO-SR method has a few limitations: (i) The computational efficiency of our method is lower than most of traditional multiscale decomposition and spatial transform based fusion methods (i.e., DWT, TSVSM, VSMWLS, and TE). The reason is twofold: considerable matrix computations in sparse dictionary coding of our method, and simple and fast fusion rules in traditional methods. (ii) The advantages of THPF-VSTO-SR is only demonstrated by infrared and visible images. Other modality images, such as remote sensing images, multifocus images, medical images, are not taken into account using our method. Therefore, there are some concerns to be worth investigating in following work. For the first limitation, we can try to use multithread computing and GPU to boost the huge-scale data computations in SR-based fusion methods. The second can be solved by exploring fast and efficient fusion rules to reduce the computational time. Besides, we will also devote to investigate the general frameworks for simultaneously merging different multimodality images.