Sparse Representation-Based Multi-Focus Image Fusion Method via Local Energy in Shearlet Domain

Multi-focus image fusion plays an important role in the application of computer vision. In the process of image fusion, there may be blurring and information loss, so it is our goal to obtain high-definition and information-rich fusion images. In this paper, a novel multi-focus image fusion method via local energy and sparse representation in the shearlet domain is proposed. The source images are decomposed into low- and high-frequency sub-bands according to the shearlet transform. The low-frequency sub-bands are fused by sparse representation, and the high-frequency sub-bands are fused by local energy. The inverse shearlet transform is used to reconstruct the fused image. The Lytro dataset with 20 pairs of images is used to verify the proposed method, and 8 state-of-the-art fusion methods and 8 metrics are used for comparison. According to the experimental results, our method can generate good performance for multi-focus image fusion.


Introduction
Due to the limited depth of field of the optical lens, the imaging device sometimes cannot achieve clear focus imaging of all objects or areas in the same scene, resulting in defocus and blurring of the scene content outside the depth of field [1][2][3][4][5]. In order to solve the above problems, multi-focus image fusion technology provides an effective way to synthesize the complementary information contained in multiple partially focused images in the same scene, and then generate an all-in-focus fusion image, which is more suitable for human observation or computer processing, and has wide application value in digital photography, microscopic imaging, holographic imaging, integrated imaging, and other fields [6][7][8][9][10][11][12][13][14][15]. Now, many multi-focus image fusion methods have been proposed. Especially, the methods based on multi-scale transform, sparse representation, edge-preserving filtering, and deep learning have achieved remarkable results in image fusion [16]. The curvelet [17], surfacelet [18], contourlet [19,20], and shearlet transforms [21][22][23] are widely used in multiscale transform fields. Vishwakarma et al. [24] introduced the multi-focus image fusion algorithm via curvelet transform and the Karhunen-Loève Transform (KLT), and this method can achieve fused images with less noise and improve the information interpretation capability of the fused images. Yang et al. [25] proposed the multi-focus image fusion method using a pulse-coupled neural network (PCNN) and sum-modified-Laplacian algorithms in the fast discrete curvelet transform domain. Zhang et al. [26] proposed a multi-focus image fusion technique using a compound pulse-coupled neural network in a surfacelet domain, with a local sum-modified-Laplacian algorithm used as the external stimulus of the compound PCNN, and the results show that this method can achieve a good performance for multi-focus image fusion. Li et al. [27] introduced multi-focus image fusion utilizing dynamic threshold neural P systems and a surfacelet transform, and the sum-modified-Laplacian algorithm and spatial frequency are regarded as the external map guided image filtering for image fusion, such as medical images, multi-focus images, infrared and visual images, and multi-exposure images, and this method can generate good performance.
Deep learning-based image fusion methods have been widely used in image processing. Zhang et al. [53] proposed an image fusion method using a convolutional neural network (IFCNN), and this method has good performance for multi-focus, infrared-visual, multi-modal medical and multi-exposure image fusion. Zhang et al. [54] introduced a fast unified image fusion network based on the proportional maintenance of gradient and intensity, and this method can generate good fusion results. Xu et al. [55] proposed the unified and unsupervised end-to-end image fusion network (U2Fusion), and this algorithm achieves better fusion effects compared to state-of-the-art fusion methods. Dong et al. [56] proposed a multi-branch multi-scale deep learning image fusion algorithm based on denseNet, and this method can achieve excellent results and keep more feature information of the source images in the fused image.
In order to generate a high-quality multi-focus fusion image, a novel image fusion framework based on sparse representation and local energy is proposed. The source images are separated into the low-and high-frequency sub-bands by shearlet transform, then the sparse representation model is used for fusing the low-frequency sub-bands, and the local energy-based fusion rule is used for fusing the high-frequency sub-bands. The inverse shearlet transform is applied to reconstruct the fused image. Experimental results show that the proposed multi-focus image fusion method can retain more source image information.

Shearlet Transform
In dimension n = 2, the shearlet transform (ST) for the signal f can be defined as follows [21]: where SH ψ (·) shows the shearlet transform. · depicts the inner product. The ST projects f onto the functions ψ a,s,t at scale a, orientation s, and location t.
The element ψ a,s,t is named shearlet, and it can be generated by: where the parameters R + , R, and R 2 show the positive real numbers, real numbers, and 2-dimensional real vectors, respectively. M a,s can be computed by: where M a,s = S s A a consists of two matrixes: the shear transform matrix S s and the anisotropic dilation matrix A a . The corresponding equations can be computed by: The inverse shearlet transform is computed by:

Sparse Representation
Sparse representation can effectively extract the essential characteristics of signals and can be represented by a linear combination of non-zero atoms in a set of dictionaries [57]. We define the signal x ∈ R n and the over-complete dictionary D ∈ R n×m (n < m). The purpose of sparse representation is to estimate the sparse vector α ∈ R m with the fewest nonzero entries, such that x ≈ Dα. Suppose that M training patches of size √ n × √ n are rearranged to column vectors in the R n space, so the training database y i M i=1 is constructed with each y i ∈ R n . The dictionary learning model can be depicted as follows: where ε > 0 shows an error tolerance, {α i } M i=1 shows the unknown sparse vectors corresponding to y i M i=1 , and D ∈ R n×m is the unknown dictionary to be learned. Some effective models, such as MOD and K-SVD, have been introduced to deal with this question. More details can be seen in reference [57].

Proposed Fusion Method
The proposed image fusion algorithm mainly contains four phases: shearlet transform decomposition, low-frequency fusion, high-frequency fusion, and shearlet transform reconstructed. The schematic diagram of the proposed approach is described in Figure 1.

Sparse Representation
Sparse representation can effectively extract the essential characteristics of signals and can be represented by a linear combination of non-zero atoms in a set of dictionaries [57]. We define the signal is the unknown dictionary to be learned. Some effective models, such as MOD and K-SVD, have been introduced to deal with this question. More details can be seen in reference [57].

Proposed Fusion Method
The proposed image fusion algorithm mainly contains four phases: shearlet transform decomposition, low-frequency fusion, high-frequency fusion, and shearlet transform reconstructed. The schematic diagram of the proposed approach is described in Fig

Shearlet Transform Decomposition
The shearlet transform decomposition performs on the two source images {I A , I B } to achieve the low-frequency components {L A , L B } and the high-frequency components {H A , H B }.

Low-Frequency Fusion
In the low-frequency component, the main energy of the image is concentrated, and the subject of the image is in the low-frequency component. In this section, L A and L B are merged with the sparse representation fusion method. The sliding window method is utilized to divide L A and L B into image patches with the size √ n × √ n from upper left to lower right with the step length of s pixels. Assume that there are T patches depicted as in L A and L B , respectively. For each position i, rearrange p i A , p i B into column vectors v i A , v i B and then normalize each vector's mean value to zero to obtain V i A ,V i B by the following equations [57]: where 1 shows an all-one valued n × 1 vector, andv i A andv i B are the mean values of all the elements in V i A and V i B , respectively. For the sparse coefficient vectors we can compute them utilizing the orthogonal matching pursuit (OMP) technique with the following formulas: where D denotes the learned dictionary that is trained by the K-singular value decomposition (K-SVD) method. Then, α i A and α i B are merged with the "max-L1" rule to obtain the fused sparse vector: The fused results of V i A and V i B can be computed by the following: where the merged mean value v i F can be calculated by the following: The above process is iterated for all the source image patches in p i to obtain all the fused vectors V i F T i=1 . Let L F denote the low-pass fused result. For each V i F , reshape it into a patch p i F and then plug p i F into its original position in L F . As patches are overlapped, each pixel's value in L F is averaged over its accumulation times.

High-Frequency Fusion
The high-frequency components contain a great deal of detailed information, and the high-frequency components are fused using the local coefficient energy, which is described as follows [58]: where H(m, n) represents the high-frequency coefficients at pixel (m, n), and ω is a local window with size M × N. Let ω A (i, j) and ω B (i, j) show the local windows centered at pixel (i, j) in H A and H B , respectively. The high-frequency fused result H F is achieved by the following:

Shearlet Transform Reconstruction
The inverse shearlet transform is performed on L F and H F to reconstruct the final fused image I F .

Experimental Results and Discussions
In this section, 20 pairs of multi-focus images from the Lytro dataset [59] (Figure 2) are selected to experiment with the subjective and objective evaluation metrics to demonstrate the effectiveness of the proposed multi-focus image fusion algorithm. Compared with the latest published algorithms, we can highlight the advantages of our image fusion algorithm. The eight state-of-the-art image fusion methods are selected for comparison, and the methods are nonsubsampled contourlet transform and fuzzy-adaptive reduced pulsecoupled neural network (NSCT) [29], image fusion using the curvelet transform (CVT) [57], image fusion with parameter-adaptive pulse-coupled neural network in nonsubsampled shearlet transform domain (NSST) [36], image fusion framework based on convolutional neural network (IFCNN) [53], fast unified image fusion network based on the proportional maintenance of gradient and intensity (PMGI) [54], unified unsupervised image fusion network (U2Fusion) [55], local extreme map guided multi-modal image fusion (LEGFF) [52], and zero-shot multi-focus image fusion (ZMFF) [60]. A single image fusion evaluation index cannot fully reflect the image quality, and multiple evaluation indexes can be used together to more objectively analyze the data and image information. The eight metrics are used as the objective evaluation, and the metrics are the edge-based similarity measurement Q AB/F [61], the human perception inspired metric Q CB [62], the structural similaritybased metric Q Y introduced by Yang et al. [62], the structural similarity-based metric Q E [62], the gradient-based metric Q G [62], the nonlinear correlation information entropy Q NCIE [62], the mutual information Q MI [61], and the phase congruency-based metric Q P [62]. Figures 3-7 show the corresponding fusion results, and Figure 8 and Tables 1-5 show the corresponding metrics data. In our method, the decomposition levels of the shearlet is 4, and the direction numbers are [10,10,18,18]. The dictionary size is set to 256, and the iteration number of K-SVD is fixed to 180. The patch size is 6 × 6, the step length is set to 1, and the error tolerance ε is set to 0.1. phase congruency-based metric P Q [62]. Figures 3-7 show the corresponding fusion results, and Figure 8 and Tables 1-5 show the corresponding metrics data. In our method, the decomposition levels of the shearlet is 4, and the direction numbers are [10,10,18,18]. The dictionary size is set to 256, and the iteration number of K-SVD is fixed to 180. The patch size is 6 × 6, the step length is set to 1, and the error tolerance  is set to 0.1.                    Figure 3 shows the fused images of different methods on the first pair of images in Figure 2, and Table 1 shows the corresponding metrics data. The fused images generated by the NSCT, CVT, and NSST algorithms are blurred in some areas. The PMGI method generates a dark image, and it is distorted and blurred. The IFCNN, U2Fusion, LEGFF, and ZMFF methods generate higher brightness. Compared with the other fusion methods, our method has the best fusion result, and more complementary image information is retained. The enlarged area in the images allows observing some details in the fused images. From Table 1, we can see that the metrics date of Q AB/F , Q Y , Q E , Q G , Q NCIE , Q MI , and Q P generated by our method are the best, and the corresponding values are 0.7446, 0.9708, 0.8868, 0.7273, 0.8243, 6.5008, and 0.7860, respectively. The ZMFF method generates the best value of Q CB with 0.7802, and our method, which achieves the value 0.7760, is ranked second. Table 2. Objective evaluation of methods in Figure 4.  Table 3. Objective evaluation of methods in Figure 5.  Table 4. Objective evaluation of methods in Figure 6.  Figure 4 shows the fused images of different methods on the second pair of images in Figure 2, and Table 2 depicts the corresponding metrics data. The fused images generated by the NSCT, CVT, IFCNN, LEGFF, and ZMFF algorithms produce a considerable fusion effect, and the images are similar. The NSST algorithm produces clearer close-range information, while the distant information, such as the outline of the mountain, is relatively fuzzy. The PMGI algorithm produces a fuzzy fusion image, which does not achieve the effect of information complementarity, and the definition is obviously low, so it is difficult to observe the details in the image. The U2Fusion method improves the brightness of some areas of the image, such as the man's face area, but the head, mouth, and neck areas of the man are obviously dark, so it is impossible to observe these parts of the information. Compared with the other fusion algorithms, our algorithm obtained clear close and distant information, achieved the effect of information complementarity, and maintained the image details well, and the result is easy to observe in the image. From Table 2, we can see that the metrics date of Q CB , Q Y , and Q E computed by our method are the best, with the corresponding values 0.6924, 0.9593, and 0.8684, respectively. Figure 5 shows the fused images of different methods on the third pair of images in Figure 2, and Table 3 shows the corresponding metrics data. The fused images generated by the NSCT and NSST algorithms are blurred in the girl's face area. The CVT, IFCNN, LEGFF, and ZMFF methods generate all-focus images. The PMGI approach generates a distorted and blurred fusion image, making it impossible to obtain details in the images. Some areas in the fused image acquired by the U2Fusion method are very dark, such as the collar of the boys and girls, the tongue and hair of the boys, and the leaves. Our algorithm obtains a full-focus image, and the details of the source images are preserved well. From Table 3, we can see that the metrics date of Q AB/F , Q Y , Q E , Q G , and Q P generated by our method are the best, with the corresponding values 0.7134, 0.9589, 0.8710, 0.7139, and 0.8194, respectively. Figure 6 shows the fused images of different methods on the third pair of images in Figure 2, and Table 4 shows the corresponding metrics data. The fused images generated by the NSCT, CVT, IFCNN, LEGFF, and ZMFF algorithms produce basic full-focus images. The NSST method has a blurred image, such as the contour information of the woman in the distance. The PMGI method produces a completely blurred effect, and it is dark. The U2Fusion method makes some areas too bright and some areas too dark, and does not achieve an effect of moderate brightness. Our method produces a clear full-focus image, and the information complementation achieves an optimal effect. From Table 4, we can see that the metrics date of Q AB/F , Q CB , Q Y , Q E , Q G , and Q P generated by our method are the best, with the corresponding values 0.7148, 0.7301, 0.9584, 0.8691, 0.7162, and 0.8249, respectively. Figure 7 shows the fused results of different methods on other images in Figure 2, and we can compare the fusion effect of different algorithms. Figure 8 shows the line chart of the metrics data with different methods in Figure 2, and we can observe the fluctuation of corresponding index values obtained by the different algorithms on the 20 groups of multifocus images. The average metrics data of the different methods in Figure 8 are shown in Table 5, and from this table, we can notice that the metrics data Q AB/F ,Q CB , Q Y , Q E , Q G , and Q NCIE generated by the proposed method are the best. The values of Q MI and Q P generated by the IFCNN method are the best; however, the two corresponding index values Q MI and Q P obtained by our algorithm still rank second among all the algorithms and have obvious advantages. Through qualitative and quantitative evaluation and analysis, our algorithm achieves the best multi-focus image fusion effect.

Conclusions
In order to generate a clear full-focus image, a novel multi-focus image fusion method based on sparse representation and local energy in shearlet domain is introduced. The shearlet transform is utilized to decompose the source images into low-and high-frequency sub-bands; the sparse representation based fusion rule is used to fuse the low-frequency sub-band, and local energy based fusion rule is used to fuse the high-frequency sub-bands. Twenty groups of multi-focus images are tested, and the effectiveness of the algorithm proposed in this paper is verified through qualitative and quantitative evaluation and analysis. The average metrics data Q AB/F , Q CB , Q Y , Q E , Q G , and Q NCIE computed by our method are the best, and the corresponding values are 0.7343, 0.7436, 0.9538, 0.8808, 0.7317, and 0.8299, respectively; the values of Q MI and Q P also generate relatively advanced data. In the future work, we will extend this algorithm to multi-exposure image fusion and other multi-modal image fusion.