The Multi-Focus-Image-Fusion Method Based on Convolutional Neural Network and Sparse Representation

Multi-focus-image-fusion is a crucial embranchment of image processing. Many methods have been developed from different perspectives to solve this problem. Among them, the sparse representation (SR)-based and convolutional neural network (CNN)-based fusion methods have been widely used. Fusing the source image patches, the SR-based model is essentially a local method with a nonlinear fusion rule. On the other hand, the direct mapping between the source images follows the decision map which is learned via CNN. The fusion is a global one with a linear fusion rule. Combining the advantages of the above two methods, a novel fusion method that applies CNN to assist SR is proposed for the purpose of gaining a fused image with more precise and abundant information. In the proposed method, source image patches were fused based on SR and the new weight obtained by CNN. Experimental results demonstrate that the proposed method clearly outperforms existing state-of-the-art methods in addition to SR and CNN in terms of both visual perception and objective evaluation metrics, and the computational complexity is greatly reduced. Experimental results demonstrate that the proposed method not only clearly outperforms the SR and CNN methods in terms of visual perception and objective evaluation indicators, but is also significantly better than other state-of-the-art methods since our computational complexity is greatly reduced.


Introduction
In the image processing field, multi-focus-image-fusion is a significant branch [1][2][3]. Multi-focus-image-fusion is the process of combining two or more images with different focal points of the same scene into a composite image with all-focus, which is of service to humans and machine perception [4,5]. The multi-focus-image-fusion holds true for multifarious applications such as remote sensing and computer vision [6].
In the past decade, sparse representation (SR)-based methods have been extensively applied to multi-focus-image-fusion [7]. SR has been proven as an extraordinarily powerful signal modeling method and has a good reputation in both theoretical research and practical application [8]. Yang and Li first applied SR in the field of image fusion [9]. After that, a large number of fusion methods based on SR emerged [10,11]. Liu and Wang proposed a adaptive sparse representation (ASR) model for simultaneous image fusion and denoising [12]. In the ASR model, a set of compact sub-dictionaries were learned from plentiful image patches which were pre-classified into several categories according to their gradient information. One of the sub-dictionaries can be adaptively selected by a given set of source image patches. In [13], a convolutional sparse representation (CSR)-based image fusion framework was presented, in which each source image is decomposed into a base layer and a detailed layer. The SR-based method was, in nature, the local one with the nonlinear fusion regulation, which was used to merge the source image patches.
In contrast to the relatively complex fusion method based on SR, in [14], Liu et al. proposed a multi-focus-image-fusion method based on the convolutional neural network (CNN). In this method, the decision map was received in line with the CNN model, which represented the accurate measurement of activity level. At last, a pixel-by-pixel weighted average strategy was employed to gain the fused image. Compared to SR methods, CNN was the global method with a linear fusion rule.
Uniting the merits of the two aforementioned methods, a novel fusion method was presented and more ample fused images were gained. In this method, source image patches are fused on the strength of the new weights obtained by CNN and SR.
The rest of this paper is organized as follows. In Section 2, some related work is discussed. In Section 3, the basic idea of the proposed fusion method is presented in detail.
Experimental results and discussions are demonstrated in Section 4. Finally, Section 5 concludes the paper.

Sparse Representation
The ecumenical process of most SR-based methods is divided into three steps. Above all, the input images fall into a cluster of patches and the sparse codings of all patches are acquired [15]. Afterwards, the fused sparse vectors/codings are determined on account of a nonlinear fusion rule with the norm of the sparse vectors [16]. The final result was obtained by adopting reconstruction.
In the sparse coding step, SR handles the image patches by a pre-trained dictionary, ultimately gaining more concise representation [17][18][19][20]. Given a patch s ∈ R n and a trained dictionary D = [d 1 , d 2 · · · d K ] ∈ R n×K (n < K) with atoms d k , the SR of s was expressed as a sparse vector x = [x 1 , x 2 , · · · , x K ] that did not merely meet s = Dx or s ≈ Dx, but also satisfied sparsity. This problem can be formulated as arg min where · 0 denotes a semi-norm that computes the number of nonzero entries in x and ε is the error tolerance. The l 0 -minimization is a widespread NP-hard problem [21]. The approximation techniques cover greedy algorithms such as matching pursuit (MP) and orthogonal matching pursuit (OMP) that are extensively applied to resolve approximation matters [22,23]. D is a trained dictionary and is obtained via the K-singularly valuable decomposition (K-SVD) algorithm that is shown in Algorithm 1.
There are some issues that need further discussion. The sparse coding of each patch greatly increases the computational complexity. It is doubtful to what extent the magnitudes of the norm are consistent with the activity level of the corresponding patches. We therefore ask whether there is a better criterion that can be used to fuse the coefficients of SR.

CNN-Based Image Fusion Method
In [14], an emblematic CNN method for multi-focus-image-fusion is presented. Figure 1b is the CNN model used for fusion. It can be seen that each branch in the network has three convolutional layers and a max-pooling layer. The convolutional and max-pooling layers are considered as feature extraction. The output of Figure 1b is a 2-dimensional vector that is the two scores of the input image patches P 1 , P 2 , which is fully connected with a 256-dimensional vector. The 2-dimensional vector produces a probability distribution on two classes. The fully connected layer can be deemed as classification. Then, the softmax loss function is applied to gain the value of the score map. Please note that in the fusion process, after the two fully connected layers are converted into convolutional layers, the network can process the source images of any size as a whole without dividing them into small patches [14]. The result of the CNN model is the score map that represents the pixels' sharpness level. More particular information about the CNN model can be found in [14].

Algorithm 1 Dictionary Learning (K-SVD)
, initialize dictionary D, sparse matrix X, Output: Dictionary, sparse matrix. 1: Initialize: Randomly take K column vectors from the original sample Y ∈ R m×n or take the first K column vectors d 1 , d 2 , · · · , d K of its left singular matrix as the atoms of the initial dictionary, and the dictionary D 0 ∈ R m×K , j = 0, maximum iterations J, tolerance value ε 0 . 2: Sparse coding: Using the dictionary D j , X j ∈ R K×n is obtained by 3: Dictionary update: Update dictionary D j column by column, column d k ∈ {d 1 , d 2 , · · · , d K }.
• When updating d k , calculate the error matrix • Take out the set of indices where the k-th row vector x k T of the sparse matrix is not Elect the column corresponding to ω k = 0 from E k , and obtain E k . • Perform singular value decomposition of E k , E k = UΣVT, take the first column of U to update the k-th column of the dictionary, that is, d k = U(·, 1); Let x k T = Σ(1, 1)V(·, 1) T , after obtaining x k T , update accordingly it to the original x k T . • Set j = j + 1.

5:
The sparse coding and dictionary update steps; 6: until the specified number of iteration steps J is reached, or converge to the specified error ε 0 .  The specific steps of the CNN-based method are described below. Two source images are primarily sent to the CNN model to obtain a score map that includes the focus information of source images. Each pixel in the score map is acquired according to the focus characteristics of a pair of corresponding patches from the two source images. Consequently, the focus map with the size equal to that of source images is gained from the score map via averaging the overlapped patches. Afterwards, the focus map is divided into a binary map on the ground of a threshold of 0.5. Then, the binary map is optimized with a small region removal and guided image filtering to create the final decision map. At last, the fused image is obtained through the pixel-wise weighted-average algorithm.
There are still a few issues that require ulterior debate and need further discussion. The in-focused and out-focused regions of the source images are separated in the focus map. For the in-focused and out-focused junction area, the image patches are not well explained on the CNN model with a black box and it is easy to cause blockiness and artifacts. The CNN method employs the focus map to learn the decision map and the final fusion rule is linear. We therefore ask whether there is a better manner to utilize the map.

Complementary of the Two Methods
In accordance with the complementarities of SR and CNN, a novel multi-focus fusion method is proposed. In the first place, the weight map is acquired by means of the focus map obtained via the CNN model. Then, the source image patches obtained through the sliding window technique have strong correlation and spatial consistency. Meanwhile, these spatially adjacent patches have similar focusing conditions. If the patch is in-focused or out-focused, it can be directly drawn from the original images with any computation. At the junction of in-focused and out-focused areas, the new SR is employed. In the new SR, the weight norm is needed for measuring the activity level of the source image patches. The fused norm is obtained according to the magnitude of the weight norm. After that, the reconstruction is carried out to earn the fused image patches. Finally, the fused image is received through the pixel-wise weighted-average algorithm. To sum up, the multi-focus fusion method proposed in this paper gives each patch a suitable fusion rule.
The highlights of the mixed method based on SR and CNN include: (1) The sorting treatment of image patches based on the CNN model reduces the computational complexity of SR [24][25][26]; (2) The pixel value of the decision map obtained by means of the CNN model is imposed on the norm of sparse vectors, which can more accurately measure the activity level of the source image patches, giving full play to the advantages of strong spatial correlation between patches; (3) SR can handle the in-focused and out-focused junction areas that CNNs with black boxes cannot properly handle, making the patches in the junction area interpretable; and (4) SR can perform the nonlinear fusion of the patches at the junction of in-focused and out-focused area.

Proposed Fusion Algorithm
The proposed method based on CNN and SR includes three principal parts: (1) CNNbased weight map generation; (2) fusion of image patches based on the new SR; and (3) Fast image fusion based on patches. The following subsections describe the aforementioned steps at length. The algorithm flow is shown in Figure 2.

CNN-Based Weight Map Generation
We suppose that I 1 , I 2 are the two source images and the size is X × Y. I 1 is taken as a reference. These two images are fed to a pre-trained CNN to acquire the score map, the size of the which is denotes the ceil operation). Every value of the score map that represents the focus level of a set of 16 × 16 patches of I 1 is between 0 and 1. The closer the pixel value is to 1, the more focused the image patches from I 1 are. After that, each pixel of the score map is extended to a 16 × 16 matrix that has the same element, and the focus map with the size X × Y is obtained through the pixel-wise overlap-averaging tactics. With this, the initial segment and small region removal were performed on the focus map to obtain the decision map. Later, the slider process is executed on the decision map. The patch size is 8 × 8 and the step size is 1. Each patch is averaged to obtain the pixel value of the corresponding position of the weight map E, i.e., the weight of the patches. The size of E is (X − 8 + 1) × (Y − 8 + 1). The flow chart for generating the weight map E is shown in Figure 2a.

Fusion of Image Patches Based on the New SR
Given the image patches P q , q = 1, 2, which are represented as vectors V q , the normal- Then, the normalized vectors v q are represented in the dictionary by the following formula: where D is the pre-trained dictionary via the K-SVD algorithm, as shown in Figure 1a. The α q that are earned by the OMP algorithm are the SR vectors of P q . The fusion coefficients and the fusion means are, respectively, obtained as follows: where 2 1 and ω is the weight of P 1 obtained from E. The weight l 1 -norm M q reflects the actual activity level of the image patches, which can avoid the wrong selection of patch with the small value of the norm. The fused result of V F is calculated by V F is reshaped into the 8 × 8 patch P F and P F is the fused image patch. In the end, each pixel's value of the fused image I F is obtained by its average over its superposition.

Fast Image Fusion Based on Patches
By sliding window technology, I 1 , I 2 is divided into n × n patches I t 1 , I t 2 , t = 1, · · · , T. The number of patches from each image is T, T = (X − n + 1)(Y − n + 1). In fact, the procedure proposed in Section 3.2 is not needed for each patch. In the very beginning, the weight map E is expressed in vector form E t that is used to choose the patch that does not need sparse coding.
When E t = 1, i.e., the image patch of I 1 is in-focused, as can be shown, for example, when these in-focused patches are at the position of the red diamonds in Figure 3a, then the fusion result is I t F = I t 1 . If E t = 0, i.e., the image patch of I 1 is out-focused, as shown by the green squares in Figure 3a, the fusion result I t F is I t 2 . In the case of 0 < E t < 1, the image patch is located somewhere in between the in-focused and out-focused regions. For instance, these patches are the blue blocks in Figure 3a. Only in this case is the new SR fusion method adopted, where ω = E t and the fusion patch is gained.
It can be known that the above classification can greatly reduce the computational complexity.

Experiments
This section successively gives the experimental settings that include the source images to be processed, image fusion quality metrics, parameters setting, computational complexity analysis, compared methods and image fusion results to be visually and quantitatively analyzed.

Source Images
In order to illustrate the experimental results, different types of source images are applied. There are 12 pairs of source images, including five pairs of multi-focus grayscale images in Figure 4 and seven pairs of multi-focus color images in Figure 5. These images are obtained from the Lytro Multi-Focus Dataset that contains 20 pairs of color multi-focus images and four series of multi-focus images with three sources, and the Multi-Focus-Image-Fusion-Dataset that includes 150 different images used in multi-focus-image-fusion algorithms [27,28].

Evaluation Metrics
For the sake of verifying the performance of image fusion methods, subjective and objective evaluation metrics are usually applied. Between them, subjective evaluation means that people explain the relative merits of the methods through the visual effects of the fusion results, and it is affected by uncertain factors such as the observer's own conditions, professional knowledge, observation angle, application occasions and objective environment [29]. The subjective evaluation is thus less reliable and objective. The objective evaluation is required to assist subjective evaluation. Therefore, the objective evaluation is especially important. The objective evaluation method conducts the quantitative analysis of fused images through certain mathematical models, which can overcome the limitations of subjective evaluation and the evaluation results are stable and reliable [30]. Generally speaking, it is difficult to objectively evaluate the merit and fallacies of the fusion method by relying on only one evaluation index. Therefore, many researchers ecumenically adopt comprehensive evaluation with multiple evaluation indexes.
In this paper, five metrics were employed to evaluate the fusion quality. The larger the values of the metrics, the higher the fusion performance. The five metrics are introduced as follows: 1.
Mutual information mainly reflects how much information the fused image contains from the source images [31]. The greater the mutual information is, the more information of the source images the fused image contains, and the better the fusion effect is. Mutual information is defined as follows: MI(I q , Here, h I q (x), h I F (y) are, respectively, the edge histogram of I q , I F . h I q ,I F (x, y) are normalized joint histograms of I F and source images I q , respectively. 2.
The Chen-Blum metric Q CB is a human perception-inspired fusion metric. Q CB is calculated by the following steps. At the very start, the masked contrast map for the input image I q (x, y) can be computed in: where C is Peli's contrast, k, l, m, n are real scalar parameters, and more details on the parameter settings can be found in [32]. The information preservation value Q I q ,I F (x, y) and the saliency map µ I q (x, y) can be calculated by the two following expressions: µ I q (x, y) = C I q (x, y) Then, the value of the global quality map can be calculated: x,y µ I 1 (x, y)Q I 1 ,I F (x, y) + µ I 2 (x, y)Q I 2 ,I F (x, y) where Q CB is the average of Q GQM .

3.
The fusion metric Q G based on the gradient is a popular fusion metric which computes the amount of gradient information of the source images injected into the fused image [33]. It is calculated by where Q I q ,I F e (x, y) and Q I q ,I F o (x, y) are the edge strength and orientation reservation values, respectively. The weight factor τ I q (x, y) shows the significance of Q I q ,I F (x, y).

4.
The fusion metric based on phase congruency Q P measures the image-salient features of the source images, such as the edges and corners in the fused image [34]. The definition of Q P is: where r, H, h refer to phase congruency, maximum and minimum moments, respectively. The exponential parameters θ, υ, σ are all set to 1. More details about Q P can be seen in [34].

5.
Q Y was proposed by Yang et al., which is a structural similarity-based method of fusion assessment [35]. The definition of Q Y is shown as follows: The details of local weight µ(ω) and the structural similarity of images SSI M(I 1 , I 2 ) can be found in [35,36].

Parameters Setting
In this section, our training parameters are set. For image processing applications based on SR, the size of image patches is 8 × 8 and the step length of sliding window technology is 1 pixel, which has been proven to be an appropriate setting [37]. The dictionary is obtained according to the K-SVD methodology, and the 68,000 image patches are randomly selected from the natural images. According to the paper [38], NSCT is selected for the multi-focus-image-fusion-based MST and MST-SR methods. The implementation of the compared method in this article was based on the exposed code, and we set the parameters according to their original reports. All experiments were performed on MATLAB R2017a. The computer processor is Intel(R) Xeon(R) Silver 4110CPU.

The Compared Methods
The effectiveness of the proposed algorithm was evaluated against state-of-the-art research methods. The first one was the NSCT-based method that uses the weighted average for low-pass sub-bands and 'max-absolute' for high-pass sub-bands. The second compared algorithm was based on SR [9]. The third method was ASR [12]. The fourth method was NSCT-SR-1 [38]. Each of the pre-registered source images was decomposed by NSCT 1 level decomposition, and the low-pass and high-pass coefficients were obtained. The low-pass coefficients were merged with an SR-based fusion method, while the highpass coefficients were fused using the absolute values of coefficients for activity level measurement. The fifth approach was CSR [13]. The sixth compared algorithm was based on CNN [14].

Computational Complexity Analysis
In order to verify that the proposed algorithm reduced the computational complexity, Figure 3b is given. As shown in Figure 3b, the ordinate is the number of positions, and the abscissas 1-5 indicate five pairs of grayscale source images. The red and blue rectangles indicate the corresponding positions of the patches fused by SR in the traditional SR and the algorithm proposed in this paper, respectively. It can be seen from the histogram that the height of the yellow rectangle is much higher than that of the purple. Therefore, the fusion algorithm proposed in this paper greatly reduces the number of patches that need to be fused by SR, thereby enormously reducing the computational complexity.
Referring to Table 1, it can be seen that the running time of CNN-SR is less than that of the SR and CNN. The bold font in Table 1 indicates that the shortened running time of CNN-SR is greater than one minute. To sum up, the method proposed in this paper improves the computational efficiency.

Validity of the Proposed Fusion Method
In this section, the comparative methods and the proposed method are applied to the commonly used multi-focus grayscale images of Figure 4. Figures 6a,b-10a An example of 'flowerpot' fusion is shown in Figure 6. The left-hand magnification of the clock is shown at the lower left corner of each image in Figure 6. From the magnified details, it can be seen that the fusion results obtained by MST, SR, ASR and MST-SR are uneven in varying degrees. For the remaining three fusion results, the human visual system struggles to tell the difference. Hence, objective evaluation is needed.
As shown in Table 2, among the five indicators, CNN-SR leads four, including MI, Q G , Q Y , Q P . It follows that our proposed method can extract the most information from the source images. Although the Q CB of our proposed method is a little small, our proposed fusion method best preserves the structure and detailed information of the source images, and improves the clarity of the fused image.    Figure 7c,d,f. The fusion result of the algorithm in this paper shown in Figure 7i retains the best and restores the information at the bottom left.  Table 3 exhibits the objective evaluation of Figure 7. The objective results confirm that our approach is the best among the seven methods. The performance shows that the CNN-SR can extract the edge and structure information of the source image well. As can be seen from the magnified details of Figure 8, the edge artifacts exist in Figure 8c-f, and the fusion results have low contrast, resulting in the loss of some useful details. Figure 8g with artificial edges is derived by the CSR-based approach. Figure 8h,i effectively preserve the detail of the source images without producing specific visual artifacts and brightness distortion. By comparison, using our method can achieve better image appearance. The fusion performance measured by the objective metrics is shown in Table 4. The fusion method proposed in this paper is superior to other methods in terms of evaluation criteria MI, Q CB , Q G , Q Y . The performance of these indexes shows that the fused image obtained by the CNN-SR method not only contains more detailed information, however, it is also more suitable for human visual perception. While the Q P is inferior to the CNNbased approach, our approach obtains comparable performance. Therefore, the proposed fusion method is superior to the SR-based method. The fused results of 'newspaper' are shown in Figure 9. The fusion details are shown in the lower left corner of all the images in Figure 9. By comparing the image details fused by different methods, it can be determined that Figure 9c-g are relatively fuzzy, with poor contrast brightness. The fused images Figure 9h,i have better performance in information recovery and contrast and have better fusion performance. The indicators of the proposed method and the contrastive methods are shown in Table 5. Table 5 shows the best performance of our proposed fusion method on MI, Q CB , Q G , Q Y . It can be inferred that the proposed method has better performance in visual fidelity, image clarity and structure information level. For Q P , the image integrated by CSR algorithm shows the best result. However, the image integrated by CSR extracts less information than the image fused by our proposed method. Therefore, the fusion method proposed in this paper is superior to the other contradistinctive methods. The image pairs 'temple' and fusion results are shown in Figure 10. The details are shown in the lower left corner of Figure 10. For the details in Figure 10, the integrated images of MST, SR, ASR, MST-SR and CSR have different degrees of artifacts. The fused image of CNN and our proposed method show better performance in terms of detailed information than other integrated images. The objective evaluation indexes are listed in Table 6. It can be clearly seen in Table 6 that our method obviously obtains all the largest quality indicators. A conclusion can be drawn from the experiment. Through visual comparison and objective evaluations, the proposed method shows emulatory fusion performance compared with the previous methods. Therefore, these experimental consequences show that the proposed method fully extracts the information of the multi-focus source images. After CNN-SR fusion, the fused images with clear edges and no artificial artifacts preserve the detailed information well and have high contrast. Meanwhile, uneven fusion does not occur. Both the subjective and objective evaluation of CNN-SR are better than that of other algorithms.

Fusion of Multi-Focus Color Images
The proposed method can be extended to multi-focus color image fusion. In order to prove the effectiveness of CNN-SR in the color images, the color source images, as shown in Figure 5, are adopted. Figure 11 is the fusion results of the different methods. Table 7 provides the average scores of the seven pairs of input images under different fusion methods. The visual fusion results and the quantitative estimate in Table 7 show that the CNN-SR method can gain the best fusion results.  Table 7. Quantitative assessments of Figure 11 and values for the seven pairs of input images in Figure 11 are averaged.

Conclusions
We proposed a multi-focus-image-fusion method based on CNN and SR. In the method, the weight map was acquired according to the CNN model, where each pixel of the weight map represents the focus level of each source image patch. If the pixel value of the weight map is 0 or 1, this means that the image patch is in-focused or out-focused, which can be directly obtained from the source images. When the pixel value is greater than 0 and less than 1, it indicates that the image patch is between clear and blurred. The new SR method was adopted. In the novel SR method, the image patches are represented by the dictionary to gain the sparse vectors, and the weight of the patch is multiplied by the l 1 -norm of its sparse vector to obtain its actual activity level. The fused sparse vectors were received by the max weight l 1 -norm. The fused image can be gained by aggregating all the reconstructed patches with the pixel-wise overlap-averaging tactics. The classified disposal of the image patches makes the proposed fusion method have great computational efficiency and it retains as much information of source images in the fused image as possible. The qualitative and quantitative comparisons show that the proposed method achieves better fusion performance in visual and objective evaluation.

Conflicts of Interest:
The authors declare no conflict of interest.