Multi-Focus Image Fusion Method for Vision Sensor Systems via Dictionary Learning with Guided Filter

Vision sensor systems (VSS) are widely deployed in surveillance, traffic and industrial contexts. A large number of images can be obtained via VSS. Because of the limitations of vision sensors, it is difficult to obtain an all-focused image. This causes difficulties in analyzing and understanding the image. In this paper, a novel multi-focus image fusion method (SRGF) is proposed. The proposed method uses sparse coding to classify the focused regions and defocused regions to obtain the focus feature maps. Then, a guided filter (GF) is used to calculate the score maps. An initial decision map can be obtained by comparing the score maps. After that, consistency verification is performed, and the initial decision map is further refined by the guided filter to obtain the final decision map. By performing experiments, our method can obtain satisfying fusion results. This demonstrates that the proposed method is competitive with the existing state-of-the-art fusion methods.


Introduction
A large number of images can be obtained via vision sensor systems (VSS). These images are employed in many applications, such as surveillance, traffic and industrial, as is shown in Figure 1. For example, these images can be used to build an urban surveillance system, as in [1]. Besides, these images can be utilized to monitor objects and behavior in [2]. Images with sufficient information are required to achieve these goals. However, Since the depth of field (DOF) is limited in vision sensors, it is hard to obtain an all-focused image, which can provide more information compared to the single multi-focus image. This causes difficulties for VSS in analyzing and understanding the image. In addition, it also causes redundancy in storage. To address those problems, multi-focus image fusion technology can fuse the complementary information from two or more defocused images into a single all-focused image. Compared with each defocused image, the fused image with extended DOF can provide more information and can thus better interpret the scene.
Of the popular multi-focus image fusion methods, there are two major branches [3]: spatial domain methods and transform domain methods. Spatial domain methods directly fuse source images via specific fusion rules. The primitive way is to calculate the mean of the source images pixel by pixel. To avoid the same treatment of pixels, Tian et al. [4] used a normalized weighted aggregation approach. Li et al. [5] decomposed the source image into the detail layer and base layer, then fused them by using a guided filter. However, the pixel-based fusion methods are often subject to noise and misregistration. To further enhance the fusion performance, some block-and region-based methods have been proposed. For instance, Li et al. [6] chose the image blocks based on spatial frequency. Miao et al. [7] measured the activity of blocks based on image gradients. Song et al. [8] fused source images adaptively by using the weighted least squares filter. Jian et al. [9] decomposed images into multiple scales and fused them through a rolling guidance filter. Zuo et al. [10] fused images based on region segmentation. Besides spatial frequency and image gradients, the energy of Laplacian method is also an important method to evaluate the sharpness measures. Although the influences of noise and misregistration become smaller, those methods often suffer from block artifacts and contrast decrease.
Unlike the former, the main idea of transform domain methods is to fuse multi-focus images in the transform domain. Those methods include the Laplacian pyramid (LP) [11], the ratio of the low-pass pyramid (RP) [12], the gradient pyramid (GP) [13], discrete wavelet transform (DWT) [14], dual-tree complex wavelet transform (DTCWT) [15] and discrete cosine harmonic wavelet transform (DCHWT) [16]. Nowadays, some multi-scale geometry analysis tools are employed. For instance, Tessens et al. [17] used curvelet transform (CVT) to decompose multi-focus images. Zhang et al. [18] used nonsubsampled contourlet transform (NSCT) to decompose multi-focus images. Huang et al. [19] fused source images in the non-subsampled shearlet transform domain. Wu et al. [20] used the hidden Markov model to fuse multi-focus images. Besides the transform domain methods listed above, some new transform domain method such as independent component analysis (ICA) [21] and sparse representation (SR) [22,23] are also used to fuse multi-focus image. To avoid block effects and undesirable artifacts, those methods often employ the sliding window technique to obtain image patches. For instance, SR-based image fusion methods divide source images into patches via a sliding window with a fixed size and transform the image patches to sparse coefficients, then apply the L1-norm to the sparse coefficients to measure the activity level.
Although some of the multi-focus fusion methods perform well, there are still some drawbacks that remain to be settled. For spatial domain methods, some of them are subject to noisy and misregistration, and block effects may be caused in the fused images. Besides, some methods also result in increased artifacts near the boundary, decreased contrast and reduced sharpness. For transform domain methods, the fusion rules are based on the relevant coefficients; thus, a small change in the coefficients would cause a huge change in pixel values, which would cause undesirable artifacts.
Sparse representation [22] has drawn much attention in recent years for its outstanding ability in computer vision tasks and machine learning, such as image denoising [24], object tracking [25,26], face recognition [27] and image super-resolution [28][29][30]. Similarly, sparse representation has achieved great success in the field of multi-focus image fusion [31][32][33][34][35]. Yang et al. [31] brought SR to multi-focus image fusion. Based on this work, Liu et al. [32] fused the multi-focus images based on SR with adaptive sparse domain selection. In their method, different categories of images were utilized to learn multiple sub-dictionaries. However, this often leads to overfitting of the sub-dictionaries and causes obvious artificial effects. To address this problem, Liu et al. [33] decomposed source images into multiple scale and fused them by using SR. To further improve the resolution of the fused image, Yin et al. [34] combined image fusion and image super-resolution together based on SR. Besides, Mansour et al. [35] proposed a novel multi-focus image fusion method based on SR with a guided filter, and the Markov random field was also utilized to refine the decision map in their method. These methods can achieve good performances. However, there are still some drawbacks that remain to be settled: 1. Some SR-based methods [31][32][33][34][35] obtain the fused image by fusing the corresponding sparse coefficients directly, while a small change in the coefficients may cause a huge variation in pixel values. This would lead to undesirable effects on the fused image. 2. For some ambiguous areas in the multi-focus image, the sparse coefficients cannot determine if they are focused or not. This often causes spatial inconsistency problems. For example, the initial map obtained by Mansour's method [35] suffered from spatial inconsistency. The following process to refine the decision map requires much computational cost. 3. The boundary between the focused area and the unfocused area is smooth, while the final decision map obtained by Mansour's method [35] was sharp on the boundary. This may lead to halo effects on the boundary between the focused area and the unfocused area.
To solve these problems, we propose a novel multi-focus image fusion method (SRGF) by using sparse coding and the guided filter [36]. The proposed method uses sparse coefficients to classify the focused regions and defocused regions to obtain the focus feature maps, as shown in Figure 2b. Then, the guided filter is used to calculate the score maps as shown in Figure 2c. An initial decision map as shown in Figure 2d can be obtained via comparing the score maps. After that, consistency verification is preformed, and the initial decision map is further refined by the guided filter to obtain the final decision map, as shown in Figure 2e. Compared with these traditional SR-based methods, there are three major contributions: 1. We use sparse coefficients to classify the focused regions and the unfocused regions to build an initial decision map, as shown in Figure 2d, rather than directly fusing the sparse coefficients. The initial decision map would be optimized in the latter steps. In this way, we avoid the artifacts caused by improper selection of the sparse coefficients. 2. To address the spatial inconsistency problem, we use the guided filter to smooth the focus feature maps, as shown in Figure 2b, fully considering the connection with the adjacent pixels. In this way, we effectively preserve the structure of images and avoid the spatial inconsistency problem. 3. To generate a decision map, which concerns the the boundary information, a guided filter is used to refine the initial decision map. By doing so, the boundary of the final decision map, as shown in Figure 2e, is smoothed, and it has a slow transition. Thus, the halo artifact of the fused image is efficiently reduced. To validate the proposed method, we conduct a series of experiments. By the experiments, we demonstrate that the proposed method can obtain satisfying fusion results. Moreover, it is competitive with the existing state-of-the-art fusion method.
The remainder of paper is organized as follows. In Section 2, the SR theory and the guided filter are briefly reviewed. Section 3 describes the proposed multi-focus image fusion method in detail. Section 4 analyzes the experimental results. Finally, Section 5 concludes the paper.

Related Work
Basic theories of sparse coding and the guided filter are reviewed briefly in this section.

Sparse Coding
Sparse signal coding [22] has drawn much attention in recent years for its outstanding ability in computer version tasks and signal processing. This is mainly because a signal can be composed into a dictionary and correlating sparse coefficients. In other words, given a set of N input signals Y = {y 1 , · · ·y N } ∈ R d×N , each signal y i can be represented as: where y i ∈ R d , D ∈ R d×m is an over-complete dictionary, which has M atoms; T is a threshold of non-zero elements in each sparse coefficient. The basic concept is shown in Figure 3.

Guided Filter
GF [36] is an edge-preserving smoothing filter. It can avoid ringing artifacts since strong edges would not be blurred during the operation. In this paper, GF is used to smooth the focus feature maps and refine the decision map.
Given an input image P, with a guidance image I, in a local window ω k , and pixel k being the central pixel, we assume that the resulting image O is linear correlated with I.
where ω k is a square window and its size is (2r + 1) × (2r + 1). To estimate the linear coefficients a k and b k , the goal is to minimize the squared difference between O and P.
where ε is set manually. The following linear regression is used to calculate a k and b k .
where |ω| means the count of pixels in a local window size of ω k . µ k and σ k are the mean and variance of I in ω k respectively. P k is the mean of P in ω k . Output image O would be obtained according to Equation (2). The guided filter used for smoothing is shown in Figure 4.

Proposed Multi-Focus Image Fusion Method
In the proposed method, an over-complete dictionary is trained, and the correlating sparse coefficients are calculated. The coefficients would be used to measure the activity level, then the focus feature maps would be obtained according to the activity level. The guided filter is applied to the focus feature maps to generate the score maps. An initial decision map is obtained via comparing the score maps. Then, the guided filter is used for refining the initial decision map.
As shown in Figure 2, the proposed method can be divided into three parts: 1. Learning dictionary 2. Calculating the sparse coefficients and obtaining the initial decision map 3. Refining the initial decision map The following subsections will introduce these steps mentioned above in detail.

Learning Dictionary
Considering the differences between the focused regions and defocused regions, we want to learn a dictionary that can perform well on both types. We blur the nature images several times using a Gaussian filter, since the blurred images have a similar visual effect as the defocused image patches; besides, we can control the blur level according to the actual needs. This process is shown in Figure 5.
Next, many image patches of a fixed size would be randomly sampled from the nature images and the corresponding blurred images. This aims to extend the patch diversity [37] for a better sparse dictionary compared with traditional SR methods. Then, these will be used for learning the dictionary D, which can be calculated by solving Equation (1) via the K-SVD [22] algorithm. Figure 6 shows the general process.
To train the dictionary D, the related parameters are set as follows. The standard deviation and size of the Gaussian filter are set to three and 5 × 5, and the blur iteration number is set to five, respectively. The dictionary size is set to 64 × 512; the patch size is 8 × 8; the threshold of the non-zero numbers T is set to five. We randomly selected 10,000 patches from the source images to train the dictionary.

Sparse Coding and Obtaining Initial Decision Map
After the dictionary D is learned, it would be used for calculating the sparse coefficients of the N input multi-focus images. In the sparse coding phase, we adopt a sliding window with the same size as the patch size we adopted in the training phase (i.e., eight). Then, we use a sliding window to sample patches, from the source images pixel by pixel. When the patches are sampled, they will be expanded into column vectorsX i = {x i1 , x i2 , · · · , x i(n−1) , x in }, and the sparse coefficients will be calculated by solving Equation (5) via the OMP [38] algorithm.
where σ is a constant (it is set to 15 in this experiment) and Y i , (0 < i ≤ n) is the input images. X i = {x i1 , x i2 , · · · , x i(n−1) , x in } (n denotes the number of patches). The output coefficients reflect if the input image patches are focused or not. An activity level measure function is set up as shown below: Given the input multi-focus images I 1 in Figure 7a and I 2 in Figure 7b, the related activity level vector f 1 = ( f 11 , f 12 , · · · , f 1(n−1) , f 1n ), f 2 = ( f 21 , f 22 , · · · , f 2(n−1) , f 2n ), can be calculated via Equation (6). The focus feature maps E i , i ∈ {1, 2} can be calculated by reshaping the related activity level measure vector f i , i ∈ {1, 2} as follows: The focus feature maps are shown in Figure 7c,d. Since the difference between focused regions and defocused regions in E i is not obvious, GF is adopted to smooth the focus feature map. The score maps can be obtained as follows: where GF (•) represents the guided filter operator; the guidance images of the guided filter are focus feature maps themselves; and the parameters are set as r 1 = 8, 1 = 0.16, respectively. The score maps are shown in Figure 7e,f. After obtaining the score maps, the initial decision map can be calculated as follows:

Refining the Decision Map
The initial decision map Q i obtained by comparing the score maps may lead to some non-smoothing edges and some small holes, as shown in Figure 7g. This is because some regions have a similar visual effect on both input images, and the sparse coefficient cannot determine if they are focused or not. To remove those small holes, the small region remove strategy is adopted in our proposed method. The decision map after applying this strategy is shown in Figure 7h. Many small holes have been removed obviously. Then, the decision map would be up-sampled to the size as input images. This process can be expressed as follows: In addition, the boundary between the focused area and the unfocused area is smooth, while the decision map Q is sharp on the boundary. To address this problem, the guided filter is adopted to optimize the decision map Q. In this section, we fuse the multi-focus images using decision map Q, then the fused image would be served as the guidance image of the guided filter. This process can be described according to the equation below: where GF (•) represents the the guided filter operator and the two parameters r and are set to eight and 0.1, respectively. The filtered result of the decision map is shown in Figure 7i.

Fusion
Finally, the fused image F can be obtained by: F(x, y) = Q(x, y)I 1 (x, y) + (1 − Q(x, y)I 2 (x, y)) (12) Figure 7j shows the fused image of the given source images.

Experiments
To verify the proposed method, we performed experiments on twenty groups of colorful multi-focus images selected from the image dataset "Lytro" [35]. The size of all test images is 520 × 520. Part of the test images is shown in Figure 8.
To evaluate the proposed method objectively, four representative evaluation metrics are adopted as follows: • Mutual information MI [40] measures how much information from the source images the fused image contains. When the value of MI is high, it indicates that the fused image contains more information from the source images.

•
Edge retention Q AB/F [41] calculates how much edge information transferred from the input images to the fused image. When the value of Q AB/F is high, it indicates that the fused image contains more edge information from the source images. The ideal value is 1.

•
Feature mutual information FMI [42] is a non-reference objective image fusion metric that calculates the amount of feature information, like gradients and edges, existing in the fused image. When the value of FMI is high, it indicates that the fused image contains more feature information from the source images. The ideal value is 1.

•
The standard deviation SD is used to measure the contrast in the fused image. When the value of SD is high, it indicates that the contrast of the fused image is higher.
To evaluate the fusion performance, the colorful images are transformed to gray images. For all these quality evaluation metrics, the larger value denotes the better performance. Moreover, the largest values are shown in bold.

Fusion of Multi-Focus "Face" Images
Experiments are performed on the "face" images. As Figure 9a,b shows, Source Image 1 is focused on the left part; on the contrary, Source Image 2 is focused on the right part. The man's face and glasses separate the focused region and defocused region. The decision map and the refined decision map are shown in Figure 9c,d; the decision map separates the boundary of the focused region and the defocused region precisely. The fused result by the proposed method is shown in Figure 9l. Figure 9e-k shows the fused results of the DTCWT-, CVT-, NSCT-, GFF-, SR-, NSCT-SR-and CSR-based methods, respectively. As Figure 9 shows, the fused results make full use of the two source images. Compared with the DTCWT, CVT and NSCT methods, the proposed method produces an edge-smoothing fused image. Besides, the quantitative assessments are shown in Table 1. Bold denotes the largest value. The glasses in Figure 9f,k are not clear enough. This is mainly because of the CVT method and CSR method losing some edge information of the source images. This also leads to a low score in Q AB/F . Besides, the fused results obtained by the DTCWT method and NSCT method suffer a slight color distortion. The MI and FMI scores for the two fusion results are relatively low. This is because much spatial information is lost during the image decomposition process. The other methods, namely the GFF-, SR-and NSCT-SR-based methods, work well in visual observation. Combining Figure 9 and Table 1, the superiority of the proposed method is demonstrated.

Fusion of Multi-Focus "Golf" Images
In this part, experiments are performed on "golf" images, as shown in Figure 10a,b. Source Image 1 is focused on the man and the golf club, while Source Image 2 is focused on the background. The two regions are separated by the decision map shown in Figure 10c,d. The fusion result obtained by the proposed method is shown in Figure 10l. Figure 10e-k shows the fused results of the DTCWT-, CVT-, NSCT-, GFF-, SR-, NSCT-SR-and CSR-based methods, respectively. The quantitative assessments are shown in Table 2. It can be seen that the ringing effect around the edge of the DTCWT-based and CVT based methods is obvious. Besides, the contrast of the fused image is reduced at the edge of the hat. These are because of the inappropriate image decomposition level, and the fused coefficients of DTCWT and CVT cannot represent the edge information. The Q AB/F and SD scores for their fused images are pretty low. Besides, The results of the SR-based method and CSR-based method contain some "artifacts". Some artificial edges are introduced in the T-shirt and the background. The GFF and NSCT methods yield some artifacts in the man's hair. The result of our method has the best visual effects. Namely, the proposed method outperforms all comparative methods in both visual effects and evaluation indicators.

Fusion of Multi-Focus "Puppy" Images
Experiments are performed on the "puppy" images, as shown in Figure 11a,b. Source Image 1 is focused on the puppy and the foreground; Source Image 2 is focused on the background. The decision map and the refined decision map are shown in Figure 11c,d. The border between the focused region and the defocused region is obviously separated by the decision map. The proposed method fusion result is shown in Figure 11a. From Figure 11e-k, the fused results of the DTCWT-, CVT-, NSCT-, GFF-, SR-, NSCT-SR-and CSR-based methods, respectively. The quantitative assessment for this experiment is shown in Table 3. Compared with the proposed method, the DTCWT-, CVT-and NSCT-based methods choose irrational regions, which leads to unclear edges. For these methods, the quantitative assessments in terms of Q AB/F and FMI are relatively low. The fused images of the SR-based method and NSCT-SR-based method look better with respect to this issue, but there are still some small blocks in the fused images. This is mainly for the traditional SR-based methods using the sparse coefficients to fuse the multi-focus images, which often lead to block effects. The fused image of the GFF-based method performs well, but the contrast of the fused image is decreased due to the unsuitable proportion of the "detail layer" and "base layer". The proposed method fusion result retains abundant information and handles the boundary well. Figure 11 and Table 3 demonstrate that the proposed method outperforms all comparative methods in this experiment. Figure 11. Fusion of "puppy" images.

Statistical Analysis of Fusion Results
Experiments were performed other images in the "Lytro" dataset. Some fusion results are shown in Figure 12. The proposed method can produce a precise decision map, which separates the focused region from the unfocused region accurately. Besides, the refined decision map obtained by the guided filter is robust to edges, which effectively avoids the artifacts on the edge. To further demonstrate the effectiveness of our method, a one-way ANOVA test was performed to statistically compare the quantitative assessment distributions of all images in the "Lytro" dataset. The threshold of p-value was set to 0.05. Table 4 shows the results of the ANOVA test. Smaller values mean more significant differences. The p-values smaller than the threshold are shown in bold. It can be seen that the p-values for MI and FMI are smaller than the pre-defined threshold. This means that there are significant overall differences in MI and FMI. To figure out where these differences occurred, post hoc tests were performed on MI and FMI. The threshold of the p-value was also set to 0.05, and the post hoc test results are shown in Table 5. All values less than the threshold are bolded. It can be seen that there are significant differences between our methods and other methods in terms of MI and FMI. Moreover, the boxplots of the statistical results are shown in Figure 13. In terms of MI and FMI, the results obtained by our method have larger values and more concentrated distributions. In terms of Q AB/F and SD, our method has a slight advantage. The proposed method achieves slightly larger values, and the distribution is similar to other methods. According to the statistical results and the boxplots, it can be concluded that the proposed method can obtain significantly better results than other methods for MI and FMI and slightly better than other methods for Q AB/F and SD. In other words, the proposed method outperforms most of the existing fusion methods, and it achieves better performance.

Comparison of Computational Cost
To evaluate the required computation power of these methods, we evaluate the running time for each method. Table 6 shows the average running time for all the test images in the "Lytro" dataset. It can be seen that these SR-based methods (namely SR, NSCT-SR, CSR and SRGF) require more running time than other methods. That is due to the fact that calculating the sparse coefficients requires much computational cost. However, as we mentioned, it is obvious that the proposed method can achieve promising results. Besides, by using parallel computing with two threads and four threads, the running time is effectively reduced. This demonstrates that there is much room for improvement. On the one hand, we think it is tolerable to sacrifice a little time for a promising improvement. On the other hand, with the development of parallel computing and the wide use of the graphical processing unit (GPU), the time cost will be reduced soon. In our next work, we will further accelerate our method by using a GPU, which has many more cores than a CPU, to train the dictionary and to calculate the sparse coefficients.

Conclusions
In this paper, a novel multi-focus image fusion method is proposed. The proposed method utilizes sparse coefficients to produce focus feature maps, and the guided filter is used to generate an initial map and to refine the decision map. The decision map obtained by our method separates focused regions from defocused regions precisely. Compared to traditional SR-based methods, the proposed method avoids the block effect and produces an edge-preserving fusion result. By performing experiments, we demonstrate that the proposed method outperforms other popular approaches, and it is competitive with the state-of-the art image fusion method.