Multi-Focus Image Fusion Based on Decision Map and Sparse Representation

: As the focal length of an optical lens in a conventional camera is limited, it is usually arduous to obtain an image in which each object is focused. This problem can be solved by multi-focus image fusion. In this paper, we propose an entirely new multi-focus image fusion method based on decision map and sparse representation (DMSR). First, we obtained a decision map by analyzing low-scale images with sparse representation, measuring the e ﬀ ective clarity level, and using spatial frequency methods to process uncertain areas. Subsequently, the transitional area around the focus boundary was determined by the decision map, and we implemented the transitional area fusion based on sparse representation. The experimental results show that the proposed method is superior to the other ﬁve fusion methods, both in terms of visual e ﬀ ect and quantitative evaluation.


Introduction
Multi-focus image fusion is a method of combining multiple images with different focal points into a composite image in which all objects are completely focused. The composite image will be more suitable for visual perception, making it easier for humans to further complete image processing tasks. Multi-focus image fusion technology has been widely used in digital photography, computer vision, military reconnaissance, and other fields [1].
With the maturity and improvement of image fusion technology, miscellaneous image fusion methods have emerged in the past few years. As many new fusion algorithms have been proposed recently, we feel inclined to divide the current fusion methods into four categories: multiscale transform (MST) methods, spatial domain methods, sparse representation (SR) methods, and neural network methods. Among the existing transform domain image fusion methods, MST is widely used [2]. A variety of multiscale transforms have been proposed and applied to image fusion. These include the Laplacian pyramid (LP), discrete wavelet transform (DWT) [3,4], dual-tree complex wavelet transform (DTCWT) [5], and discrete cosine harmonic wavelet transform (DCHWT) [6]. The multiscale geometric analysis tools developed in recent years have higher directional sensitivity than wavelets, such as shearlet transform [7], curvelet transform (CVT) [8], nonsubsampled contourlet transform (NSCT) [9], and so on. All of these transform domain fusion methods have a similar "decomposition-fusion-reconstruction" framework. First, the source images are decomposed into a multiscale transform domain to obtain transform coefficients, and the transform coefficients are then fused based on a certain fusion rule. Finally, the fusion coefficients are inversely transformed to reconstruct the fused image. jointly generate activity level measurement and fusion rules and overcome some difficulties faced by certain existing fusion methods.
Based on the analysis and research of existing multi-focus image fusion methods, we propose a new multi-focus image fusion method based on decision map and sparse representation (DMSR), which can not only satisfy the requirements of the visual effect and fusion performance but also make the algorithm robust and adaptive. In our framework, the advantages of fusion methods based on the decision map and sparse representation are combined. Considering that the human visual system does not require much detail in identifying the focused and defocused area of the source images, we generated a sparsity graph using low-scale images of the source images. In the existing multi-focus image fusion methods based on the decision map, each pixel is strictly defined as focused or defocused, which inevitably leads to erroneous judgment in the decision map. In particular, the pixels of the uncertain region are difficult to determine simply as focus or defocus. In order to avoid this defect, we analyzed the sparseness of the corresponding points in the sparsity graph and divided each pixel into three categories-focused, defocused, and uncertain-to generate the initial decision map. Then, the spatial frequency method was used to further divide each point in the uncertain region of the initial decision map into focused or defocused points, and the final decision map was determined. After obtaining the fused image based on the final decision map, the transitional area of the source images was detected according to the final decision map, and the area was processed by the multi-focus image fusion algorithm based on the sparse representation to obtain the transitional area fusion result. Finally, the fused image based on the final decision map and the transitional area fused image were averaged to obtain the final fused image. In order to verify the effectiveness of the proposed method, we performed a large number of experiments using two data sets based on the three target quality indicators. The experimental results show that our method is superior to the other five methods, both in terms of visual effect and quantitative evaluation.
The remainder of this paper is organized as follows. Section 2 describes the specifics of our proposed method. The experimental results, a comparison with the state-of-the-art methods and objective evaluations are demonstrated in Section 3. Finally, Section 4 is the conclusion of this paper.

Proposed Fusion Scheme
The newly proposed multi-focus image fusion framework is shown in Figure 1. Obviously, the fusion method consists of two main steps: generating a decision map and performing fusion. In the first step, multi-focus feature analysis of the low-scale images of the two source images is performed to obtain the corresponding clarity score maps. Then, they are normalized to get the initial decision map, and the spatial frequency method is used to obtain the final decision map. Section 2.1 details the creation of the score maps, and the specific process for further obtaining the initial decision map and the final decision map are described in Section 2.2. In the second step, the fused image based on the final decision graph and the transitional area fused image are obtained, respectively, and the two images above are averaged to obtain the final fused image. Among them, the fusion process of the transitional area is based on sparse representation, which is elaborated in Section 2.3. the creation of the score maps, and the specific process for further obtaining the initial decision map and the final decision map are described in Section 2.2. In the second step, the fused image based on the final decision graph and the transitional area fused image are obtained, respectively, and the two images above are averaged to obtain the final fused image. Among them, the fusion process of the transitional area is based on sparse representation, which is elaborated in Section 2.3.

Clarity Score Map
Firstly, wavelet decomposition is performed on the two multi-focus source images by a wavelet basis, and four low-frequency sub-band images of horizontal low-frequency and vertical lowfrequency (LL), horizontal low-frequency and vertical high-frequency (LH), horizontal highfrequency and vertical low-frequency (HL), and horizontal high-frequency and vertical highfrequency (HH) are obtained, respectively. Among them, the LL low-frequency sub-band images still maintain the overview and spatial characteristics of the source images and are suitable for the analysis and extraction of the subsequent source image focusing features, so they are selected as the low-scale images of the algorithm, as shown in Figure 2c,d. Next, the sparse representation of lowscale images is carried out, and the corresponding sparsity graphs are generated. Finally, two corresponding clarity score maps are obtained by the image block-based clarity measurement method. The main steps of creating clarity score maps are described as follows: •

Clarity Score Map
Firstly, wavelet decomposition is performed on the two multi-focus source images by a wavelet basis, and four low-frequency sub-band images of horizontal low-frequency and vertical low-frequency (LL), horizontal low-frequency and vertical high-frequency (LH), horizontal high-frequency and vertical low-frequency (HL), and horizontal high-frequency and vertical high-frequency (HH) are obtained, respectively. Among them, the LL low-frequency sub-band images still maintain the overview and spatial characteristics of the source images and are suitable for the analysis and extraction of the subsequent source image focusing features, so they are selected as the low-scale images of the algorithm, as shown in Figure 2c,d. Next, the sparse representation of low-scale images is carried out, and the corresponding sparsity graphs are generated. Finally, two corresponding clarity score maps are obtained by the image block-based clarity measurement method. The main steps of creating clarity score maps are described as follows: • The low-scale versions of source images I LL A , I LL B ∈ R H×W are divided into √ n × √ n image patches using the smooth window technique from top left to bottom right, and the sliding step is one. All patches are reshaped into n dimensional column vectors v i Given the global dictionary Φ ∈ R n×K (n << K), each column vector can be represented by the where M i A and M i B denote the sum values, respectively. If M i A ≥ M i B , each score value within the √ n × √ n corresponding patch is centered at x i + √ n, y i + √ n) in the clarity score map S A add one, and vice versa, as shown in Figure 3. In addition, the total times of the comparison between each corresponding pair of patches are recorded in a weight map W.

Decision Map
The above clarity score maps are binarized by a given threshold K 1 and denoted as S A and S B , as shown in Figure 4a,b (the focused pixels are marked as yellow, defocused pixels are marked as blue). It can be observed that there may be some misjudgment areas caused by misclassification in the focused area or the defocused area. Morphological techniques are used to filter out these misclassifications to obtain the standard normalized clarity score maps. The results are shown in Figure 4c,d and denoted as S A and S B . Thus, we can determine the location of the uncertain area when the focused areas of Figure 4c,d overlap.

Decision Map
The above clarity score maps are binarized by a given threshold 1 K and denoted as ' S , as shown in Figure 4a,b (the focused pixels are marked as yellow, defocused pixels are marked as blue). It can be observed that there may be some misjudgment areas caused by misclassification in the focused area or the defocused area. Morphological techniques are used to filter out these misclassifications to obtain the standard normalized clarity score maps. The results are shown in Figure   Finally, the initial decision map is obtained by as shown in Figure 4e, where the white pixels indicate the uncertain area. In order to make the size of the decision map consistent with the source images, the upsampling operation is also carried on to the initial decision map. The next target is to generate the final decision map. As mentioned above, there is still an uncertain area in the initial decision map D  . To obtain the final decision map, further analysis and processing of the uncertain area is needed. We use the spatial frequency method to divide the pixels of the uncertain area in the initial decision map D  into two categories-focused and defocused-to obtain the final decision map containing only the focused area and the defocused area. The spatial frequency method can be described as Finally, the initial decision map is obtained by as shown in Figure 4e, where the white pixels indicate the uncertain area. In order to make the size of the decision map consistent with the source images, the upsampling operation is also carried on to the initial decision map. The next target is to generate the final decision map. As mentioned above, there is still an uncertain area in the initial decision map D. To obtain the final decision map, further analysis and processing of the uncertain area is needed. We use the spatial frequency method to divide the pixels of the uncertain area in the initial decision map D into two categories-focused and defocused-to obtain the final decision map containing only the focused area and the defocused area. The spatial frequency method can be described as where I is the input image, Ω is a 7 × 7 window centered on the point (x, y), and x and y represent the horizontal and vertical differences of the pixel points, respectively. The larger the spatial frequency value, the higher the clarity value of the point. Thus, points in the uncertain area of the initial decision map D can be classified according to the following decision rules: Assuming that the spatial frequency values of the corresponding uncertain pixel points in the two source images are SF A (x, y) and SF B (x, y), respectively, and SF A (x, y) > SF B (x, y), the pixel point can be determined as the focus point, and vice versa. Based on this, the final decision map D can be obtained, as shown in Figure 4f.

Fusion
Based on the final decision map D, the fused image I F can be simply obtained by However, in this way, the pixels in the transitional area are actually averaged. This can cause undesirable effects such as the edge-blocking effect and artificial-edge effect. In order to suppress these effects at the same time, it is considered that the pixel classification of the transitional area has the following difficulties: the difference in the clarity of the pixels is small, the gray change is irregular, and the traditional classification methods have difficulty with accurate division. For the transitional area, we choose the fusion method based on sparse representation. The determination of the transitional area and the specific fusion algorithm are as follows: • The boundary line of the final decision map D is centered, the appropriate radius (3-5 pixels) is set, and the corresponding rectangular area is delineated as the transitional area R.
where j is the column index of the sparse coefficient matrix, and τ is the index of the atom in the dictionary Φ.

•
The fused vector V F without the DC components is obtained by The fused DC component obeys the following rule: where each column vector v j F in V F is reshaped into a block with size √ n × √ n and then overlaid at its recorded position in Λ.
• Finally, the transitional area fused image V F based on the sparse representation and the fused image I F based on the final decision map are averaged to generate the final fused image I F . As shown in Figure 5, compared with the fused image I F based on the final decision map, our final fused image I F is significantly clearer at the "brim edge" and "sweater texture". • The fused vector j F v is determined as follows: where each column vector j On the fifth step of the algorithm, most of the existing fusion methods calculate the fused DC components using a simple average. However, this easily produces fuzzy effects around some strong edges due to the great change in brightness. The main reason for this is that the energy of the region with high brightness diffuses into the region with low brightness when losing focus. Therefore, we modify the fusion rule for DC components. When the DC components from different source images are close to each other, we choose the average operation; otherwise, the minimal DC component is selected.

Experiment and Analyses
This section verifies the effectiveness of the proposed method by experimenting with different types of source images. The fusion results of the proposed method are compared with several existing fusion algorithms, including DCHWT [6], SOMP [19], GF [15], IM [16], and CNN [30].

Source Images
The experiment was performed on two image datasets. The first one included eight pairs of popular multi-focus source images, as shown in Figure 6 [31]. The other one was composed of 20 pairs of color multi-focus images selected from the Lytro picture gallery, as shown in Figure 7 [32]. On the fifth step of the algorithm, most of the existing fusion methods calculate the fused DC components using a simple average. However, this easily produces fuzzy effects around some strong edges due to the great change in brightness. The main reason for this is that the energy of the region with high brightness diffuses into the region with low brightness when losing focus. Therefore, we modify the fusion rule for DC components. When the DC components from different source images are close to each other, we choose the average operation; otherwise, the minimal DC component is selected.

Experiment and Analyses
This section verifies the effectiveness of the proposed method by experimenting with different types of source images. The fusion results of the proposed method are compared with several existing fusion algorithms, including DCHWT [6], SOMP [19], GF [15], IM [16], and CNN [30].

Source Images
The experiment was performed on two image datasets. The first one included eight pairs of popular multi-focus source images, as shown in Figure 6 [31]. The other one was composed of 20 pairs of color multi-focus images selected from the Lytro picture gallery, as shown in Figure 7

Parameter Setting
The 8 × 8 image patches were used in the computation of sparse coefficients for each pixel location. Besides that, the block size of the sliding window used for clarity level comparison in the clarity score map was also fixed to 8 × 8. The threshold 1 K for binarizing clarity score maps was set The overcomplete dictionary Φ used in sparse representation had a size of 64 × 256, which was trained globally from a large set of natural images. The residue error of the SOMP algorithm was set as ε = 5 . The DCHWT method was implemented based on multiscale transform toolboxes downloaded from MATLAB Central [33], and its level of wavelet decomposition was set to 4. The codes for the GF and IM methods can be found on Xu Dongkang's homepage [34], and the codes for the NSCT-PCNN are available on Qu Xiaobo's homepage [35]. The parameters of these methods were set to their recommended values.

Objective Evaluation Metrics
To evaluate the fusion quality of different fusion methods, three fusion quality metrics were utilized in our experiment. The large value of the fusion quality metric indicates better fusion quality.

Normalized mutual information (MI), MI Q [36]: MI
Q is used to overcome the deficit of MI [37].

MI
Q is defined as

MI MI A F MI B F Q H A H F H B H F
where ( ) H X is the entropy of image X , and ( , ) MI X Y is the mutual information between image X and Y . The MI Q measures the amount of information in the fused image inherited from the source images.

Parameter Setting
The 8 × 8 image patches were used in the computation of sparse coefficients for each pixel location. Besides that, the block size of the sliding window used for clarity level comparison in the clarity score map was also fixed to 8 × 8. The threshold K 1 for binarizing clarity score maps was set as K 1 = 0.65. The overcomplete dictionary Φ used in sparse representation had a size of 64 × 256, which was trained globally from a large set of natural images. The residue error of the SOMP algorithm was set as ε= 5. The DCHWT method was implemented based on multiscale transform toolboxes downloaded from MATLAB Central [33], and its level of wavelet decomposition was set to 4. The codes for the GF and IM methods can be found on Xu Dongkang's homepage [34], and the codes for the NSCT-PCNN are available on Qu Xiaobo's homepage [35]. The parameters of these methods were set to their recommended values.

Objective Evaluation Metrics
To evaluate the fusion quality of different fusion methods, three fusion quality metrics were utilized in our experiment. The large value of the fusion quality metric indicates better fusion quality.

1.
Normalized mutual information (MI), Q MI [36]: Q MI is used to overcome the deficit of MI [37]. Q MI is defined as where H(X) is the entropy of image X, and MI(X, Y) is the mutual information between image X and Y. The Q MI measures the amount of information in the fused image inherited from the source images.

2.
Petrovic's metric, Q AB/F [38]: Q AB/F evaluates the fusion performance by measuring the amount of gradient information transferred from source images into the fused image. It is calculated by where Q AF (i, j) = Q AF g (i, j) · Q AF o (i, j). Q AF g (i, j) and Q AF o (i, j) are the grad magnitude and orientation at pixel location (i, j), respectively. Q BF is computed similarly to Q AF . W A (i, j) and W B (i, j) are the weights of Q AF (i, j) and Q BF (i, j), respectively. 3.
The quality index, visual information fidelity for fusion (VIFF) [39]: This is a multiresolution image fusion metric based on visual information fidelity. To calculate the VIFF, the images are divided into blocks in each sub-band, and visual information in each block is measured using different models, including the Gaussian scale mixture (GSM) model, the HVS model, and the distortion model. The VIFF of each sub-band is then calculated, and an overall quality measure is determined by weighting.

Evaluation on Popular Multi-Focus Images
In this section, we demonstrate the advantages of the proposed method (DMSR) on popular multi-focus images. An example, the fused images of the "Lab" pair (640 × 480) using different fusion methods is presented in Figure 8c-h. The "Lab" source images are shown in Figure 8a,b. For better comparison, we also present the normalized difference images between the correctly focused source image and the fusion results in Figure 9. It can be observed that the fused images obtained by DCHWT or SOMP methods showed serious artifacts and visible fake edges around the "man". The GF method had ringing artifacts and blurring effects near the "men". The IM method suffered from blurring effects near the "men's hair". The CNN method could achieve better fusion quality, but some small defects could still be found with careful observation, such as imperceptible artificial flaws on the "table" (see the lower middle in Figure 9e). Comparatively, the DMSR produced the best fused image. 3. The quality index, visual information fidelity for fusion (VIFF) [39]: This is a multiresolution image fusion metric based on visual information fidelity. To calculate the VIFF, the images are divided into blocks in each sub-band, and visual information in each block is measured using different models, including the Gaussian scale mixture (GSM) model, the HVS model, and the distortion model. The VIFF of each sub-band is then calculated, and an overall quality measure is determined by weighting.

Evaluation on Popular Multi-focus Images
In this section, we demonstrate the advantages of the proposed method (DMSR) on popular multi-focus images. An example, the fused images of the "Lab" pair (640 × 480) using different fusion methods is presented in Figure 8c-h. The "Lab" source images are shown in Figure 8a,b. For better comparison, we also present the normalized difference images between the correctly focused source image and the fusion results in Figure 9. It can be observed that the fused images obtained by DCHWT or SOMP methods showed serious artifacts and visible fake edges around the "man". The GF method had ringing artifacts and blurring effects near the "men". The IM method suffered from blurring effects near the "men's hair". The CNN method could achieve better fusion quality, but some small defects could still be found with careful observation, such as imperceptible artificial flaws on the "table" (see the lower middle in Figure 9e). Comparatively, the DMSR produced the best fused image.  Another example, the fusion results of the "Flowerpot" image pair (944 × 736) are shown in Figure 10c-h. The normalized difference images between the correctly focused source image and the fusion results are shown in Figure 11. Similar to the previous example, the DCHWT and SOMP method produced serious artifacts around the "horologe". The fused image obtained by the GF method suffered from a ringing effect, and the edges of the "horologe" were blurred. The results of the IM method also showed similar artifacts near the "horologe". Although the CNN method performed well overall, it exposed obvious artifacts on the "ground" and the "wall" of the fused image. Comparatively, the DMSR method exhibited the best visual quality. Another example, the fusion results of the "Flowerpot" image pair (944 × 736) are shown in Figure 10c-h. The normalized difference images between the correctly focused source image and the fusion results are shown in Figure 11. Similar to the previous example, the DCHWT and SOMP method produced serious artifacts around the "horologe". The fused image obtained by the GF method suffered from a ringing effect, and the edges of the "horologe" were blurred. The results of the IM method also showed similar artifacts near the "horologe". Although the CNN method performed well overall, it exposed obvious artifacts on the "ground" and the "wall" of the fused image. Comparatively, the DMSR method exhibited the best visual quality. Another example, the fusion results of the "Flowerpot" image pair (944 × 736) are shown in Figure 10c-h. The normalized difference images between the correctly focused source image and the fusion results are shown in Figure 11. Similar to the previous example, the DCHWT and SOMP method produced serious artifacts around the "horologe". The fused image obtained by the GF method suffered from a ringing effect, and the edges of the "horologe" were blurred. The results of the IM method also showed similar artifacts near the "horologe". Although the CNN method performed well overall, it exposed obvious artifacts on the "ground" and the "wall" of the fused image. Comparatively, the DMSR method exhibited the best visual quality.   Table 1, with the best results indicated in bold. It can be seen that the DMSR method outperformed all other methods and won in almost all the quality metrics.  To evaluate fusion performance more objectively, each pair of popular multi-focus images was fused by six fusion methods. The values of metrics Q MI , Q AB/F , and VIFF were calculated and are recorded in Table 1, with the best results indicated in bold. It can be seen that the DMSR method outperformed all other methods and won in almost all the quality metrics.

Evaluation on Lytro Image Dataset
The Lytro image dataset was composed of 20 color multi-focus image pairs of the same size (520 × 520). For visual evaluation, the fused results of the "Lytro17" image pair obtained by different fusion methods are demonstrated in Figure 12. In order to observe the fusion effect of the transitional area more intuitively, some details of the puppy have been intercepted and enlarged. The DCHWT method still exhibited undesirable ringing artifacts around the head, as shown in Figure 12c. The same phenomenon can also be seen in Figure 12d,e,g. As shown in the close-up views of Figure 12f, the IM method suffered from severe blurring effects and false edges. Comparatively, the DMSR methods produced ideal fusion images without perceptible artifacts along the focus boundary.

Evaluation on Lytro Image Dataset
The Lytro image dataset was composed of 20 color multi-focus image pairs of the same size (520 × 520). For visual evaluation, the fused results of the "Lytro17" image pair obtained by different fusion methods are demonstrated in Figure 12. In order to observe the fusion effect of the transitional area more intuitively, some details of the puppy have been intercepted and enlarged. The DCHWT method still exhibited undesirable ringing artifacts around the head, as shown in Figure 12c. The same phenomenon can also be seen in Figure 12d,e,g. As shown in the close-up views of Figure 12f, the IM method suffered from severe blurring effects and false edges. Comparatively, the DMSR methods produced ideal fusion images without perceptible artifacts along the focus boundary. Further, the quantitative assessments of the six methods are shown in Figure 13. The charts show that the proposed method outperformed the others and obtained the best quality metrics. Further, the quantitative assessments of the six methods are shown in Figure 13. The charts show that the proposed method outperformed the others and obtained the best quality metrics.

Evaluation on Three Multi-Focus Images
Our method was also suitable for more than two multi-focus images. The three source images for "Toy" (512 × 512) are shown in Figure 14a-c, and close-up views are shown at the bottom for better observation. Figure 14d,e show that the fused images obtained by the DCHWT and SOMP methods showed serious blurring effects at the "ball" in the right corner. The GF fusion method produced jagged edges around the "puppet", as shown in Figure 14f. The IM fusion method exhibited slight blurry artifacts in the upper-right corner of the "ball", as shown in Figure 14g. Compared with other methods, the CNN and DMSR performed well. As shown in Figure 14h,i, all focused areas from the source images were merged into the fusion image with imperceptible artifacts. The values of Q MI , Q AB/F , and VIFF for various fusion methods are presented in Table 2, with the best results indicated in bold.

Evaluation on Three Multi-focus Images
Our method was also suitable for more than two multi-focus images. The three source images for "Toy" (512 × 512) are shown in Figure 14a-c, and close-up views are shown at the bottom for better observation. Figure 14d,e show that the fused images obtained by the DCHWT and SOMP methods showed serious blurring effects at the "ball" in the right corner. The GF fusion method produced jagged edges around the "puppet", as shown in Figure 14f. The IM fusion method exhibited slight blurry artifacts in the upper-right corner of the "ball", as shown in Figure 14g. Compared with other methods, the CNN and DMSR performed well. As shown in Figure 14h,i, all focused areas from the source images were merged into the fusion image with imperceptible artifacts. The values of MI Q , / AB F Q , and VIFF for various fusion methods are presented in Table 2, with the best results indicated in bold.

Conclusions
In this paper, we propose a new multi-focus image fusion method based on decision map and sparse representation. By generating the initial decision map by focusing on feature analysis for low-scale images, not only can the performance be guaranteed but the computational complexity can also be effectively reduced. Aiming at the characteristics of difficult decisions in the transitional area, we used the fusion algorithm based on sparse representation to directly fuse this and effectively reduce the error caused by incorrect judgment while ensuring the quality of fusion. In addition, the fusion method is also generalized to be capable of fusing more than two images. Experimental results show that the fusion method proposed in this paper has better fusion quality than other methods, both in terms of visual perception and objective measurement. In the future, we plan to evaluate whether the method proposed here can be applied to multi-focus image fusion in dynamic scenes.
Author Contributions: B.L. and H.C. conceived and designed the algorithm; B.L. and H.C. performed the experiments; W.M. analyzed the data and contributed reagents/materials/analysis tools; B.L. and H.C. wrote the paper; W.M. provided technical support and revised the paper.