A Novel Multi-Focus Image Fusion Method Based on Stochastic Coordinate Coding and Local Density Peaks Clustering

.


Introduction
High-quality images are widely used in different areas of a highly developed society.Following the development of cloud computing, more and more images are processed in a cloud [1,2].High-quality images can increase the accuracy and efficiency of image processing.Due to the limitation of field depth in most of optical lenses, only the objects a certain distance away from the camera can be captured in focus and sharply, and other objects are out of focus and blurred.It usually takes multiple images of the same scene to enhance the robustness of image processing.However, viewing and analyzing a series of images separately is neither convenient nor efficient [3].The multi-focus image fusion method is an effective way to resolve this problem by combining complementary information from multiple images into a fused image, which is useful for human or machine perception [4,5].
During the past few years, many image fusion algorithms have been developed to integrate multi-focus images.In general, multi-focus fusion algorithms can be classified into two groups: spatial-domain fusion and transform-domain fusion [3,[6][7][8][9][10][11].The spatial-domain methods only need spatial information of images to carry out image fusion, without doing any type of transformation.The main principle of spatial-domain methods is to select those pixels or regions with higher clarity to construct the fused image, according to the image clarity measurement.Energy of Laplacian [8,12] and spatial frequency [3,6,11] are two typical focus measures used to decide the clarity of pixels or regions.The main limitations of spatial-domain fusion methods in generating desirable fused images are the misalignment of decision map within the boundary of focused objects, and the incorrect decision in locating sub-regions of focused or out-of-focused regions.To reduce these limitations, some spatial-domain techniques use the average weight of pixel values to fuse source images, instead of using binary decision [7].Due to the weight construction method, these spatial-domain methods may lead to blurring edges, contrast decrease, and reduction of sharpness [6].
In contrast to spatial-domain fusion methods, transform-domain fusion methods first convert source images into a transform domain to obtain the corresponding transform coefficients.Then, the transformed coefficients are merged, according to the pre-defined fusion rule.Finally, the fused image is constructed by carrying out inverse transform of the fused coefficients.The most commonly used transform-domain fusion methods are based on multi-scale transform (MST).MST algorithms use the following methods, including discrete wavelet transform, gradient pyramid, dual tree complex wavelet transform, and so on.Recently, some novel transform-domain analysis methods have been proposed, such as curvelet transform [13], and nonsubsampled contourlet transform [9].Although multi-scale transform coefficients can reasonably represent important features of an image, each transform has its own merits and limitations corresponding to the context of input images.Thus, it is difficult to select an optimal transform basis without apriori knowledge [14,15].
In recent years, sparse-representation based methods, as a subset of transform-domain fusion methods, have been applied to image fusion.Different from other MST methods, sparse-representation based methods usually use learned bases, which can adaptively change according to the input images without apriori knowledge.Due to the adaptively learning feature, sparse representation is an effective way to describe and reconstruct images and signals.It is widely applied to image denoising [16], image deblurring [17], image inpainting [18], super-resolution [19] and image fusion [20].Yang and Li [21] first applied the sparse representation theory to image fusion field and also proposed a multi-focus image fusion method with an MST dictionary.Li and Zhang applied the morphologically filtering sparse feature to the matrix decomposition method to improve the accuracy of sparse representation in multi-focus image fusion [20].Wang and Liu proposed an approximate K-SVD based sparse representation method for multi-focus fusion and exposure fusion to reduce the computation costs of sparse-representation based image fusion [22].Nejati and Samavi proposed K-SVD dictionary-learning based sparse representation for the decision map construction of multi-focus fusion [6].However, these aforementioned sparse-representation based methods do not take the high computation costs into account as K-SVD, and online dictionary learning.In recent years, many researchers have been devoted to speeding up dictionary learning for image fusion.Zhang and Fu [23] proposed a joint sparse-representation-based image fusion method.Their method had lower complexity than K-SVD.However, it still required a substantial amount of computations.Kim and Han [14] proposed a joint-clustering-based dictionary construction method for image fusion.The proposed method used K-means clustering to group the image patches before dictionary learning.The K-means clustering needed a number of cluster centers before clustering.However, in most cases, the number of cluster centers is difficult to estimate accurately before clustering.This paper proposes a novel stochastic coordinate coding (SCC)-based image fusion framework integrated with local density peaks clustering.The proposed multi-focus image fusion framework consists of three steps.First, a local density peaks clustering method is applied in order to cluster image patches.The local density peaks based algorithm increases the accuracy of clustering, and does not need any presetting value for the input image data.Second, an SCC-based dictionary construction approach is proposed.The constructed dictionary not only obtains accurate descriptions of input images, but also dramatically decreases the costs of dictionary learning.Finally, the trained dictionary is used for the sparse representation of image patches, and max-L1 theory is implemented in the image fusion process.The key contributions of this paper can be elaborated as follows: 1.An integrated sparse representation framework for multi-focus image fusion is proposed that combines the local density peaks based image-patch clustering and stochastic coordinate coding.2.An SCC-based dictionary construction method is proposed and applied to sparse representation process, which can obtain a more accurate dictionary and decrease the computation cost of dictionary learning.
The rest of this paper is structured as follows: Section 2 presents and specifies the proposed framework; Section 3 simulates the proposed solutions and analyzes experiment results; and Section 4 concludes this paper.

Introduction of Framework
The proposed framework for image fusion shown in Figure 1 has three main steps.All image patches are clustered into different groups in the first step.Then each image patch group is learned by a sub-dictionary using the SCC algorithm [24] and these sub-dictionaries are combined into an integrated dictionary.Finally, the learned dictionary is used for image fusion.The details of each algorithm and method will be explained in the following paragraphs.

Local Density Peaks Clustering
An image usually consists of different types of image patches.It is efficient to describe the underlying structure of each image patch by using specific sub-dictionary that describes different types of image patches.This paper uses the local density peaks clustering method to classify image patches into a specific group by the similarity of structure [25,26].Compared with other existing clustering methods, the local density peaks clustering method has two advantages.First, the method is insensitive to the start point (or initialization).Second, it does not need to know the number of clusters before clustering.Moreover, the basis of local density peaks clustering can be easily expressed by Euclidean distance between two different patches.
In Figure 2a, the local density ρ i of each image patch i is calculated by using Equation (1).A distance δ i of each image patch is measured to find the cluster centers.The calculation equation is shown in Equation ( 2).δ i is the minimum distance between the image patch i and any other patch j with higher density.For the point with highest density, δ i = max(d ij ).A local density map can be constructed by using ρ i (x-axis) and the normalized δ i (y-axis, 0 ≤ δ i ≤ 1), which is shown in Figure 2b.The cluster centers are recognized that are boxed by dotted squares in Figure 2b, when the value of δ i is anomalously large.When the cluster centers are identified, the remaining image patches are clustered into the relatively nearest identified center.

Dictionary Construction
In the clustering step, image patches with similar structure are classified into a few groups.To construct a more discriminative and compact dictionary, the SCC online dictionary learning algorithm [24] is used to learn a sub-dictionary for each cluster.Subsequently, the learned sub-dictionaries are combined to a new dictionary for the image sparse representation and restoration.The dictionary construction process is illustrated in Figure 1 as the dictionary learning step.

Sub-Dictionary Learning Approach
The SCC online dictionary learning algorithm [24] shown in Algorithm 1 extracts eigenvalues from each cluster and builds the corresponding sub-dictionary.The dictionary and sparse code are initialized and denoted as D 1  1 , z 0 i = 0, i = 1, 2, ... n respectively.The general expression of the sparse code is z k i = 0, i = 1, 2, ... n, k = 1, 2, ... n and learning rate η 1 1 = 1.The number of epochs and the index of data points are represented as superscript k and subscript i respectively.It acquires an image patch x i , when k = 1 and i = 1.The sparse code z k i is updated by a few steps of coordinate descent (CD): The j-th coordinate where h λ is a soft threshold shrinkage function [27,28] and b j is a descent parameter that can be calculated by Equation ( 5).An updating cycle is equivalent to one step of coordinate descent.Dictionary D is updated by stochastic gradient decent (SGD): where P denotes the projection operator, and B m is the feasible set of D that is defined as follows: The learning rate is an approximation of the inverse of the Hessian matrix T .The gradient of D k i can be obtained as follows: This Then it repeats the previous calculating process.When k > m, the calculation stops, m is preset value and usually 10 ≤ m ≤ 15.The SCC only runs a few steps of CD to update the sparse codes and SGD algorithm is conducted to update the dictionary.
All sub-dictionaries D 1 , D 2 , ..., D n are learned by using SCC.These sub-dictionaries are used for describing the underlying structure of each image patches cluster.

Sub-Dictionary Combination
As a sub-dictionary for each cluster is learned, all sub-dictionaries are combined into a new dictionary Φ.

Fusion Scheme
The fusion scheme is shown in Figure 1 and the image fusion algorithm is shown in Algorithm 2. The learned dictionary is used for the estimation of coefficient vectors.For each image patch p i , a coefficient vector z i is estimated by the SOMP algorithm using the learned dictionary.Max-L1 rule [21] is conducted for coefficient fusion shown in Equation (17), where z i is the fused coefficient vector, • 1 is the l 1 norm, and * is an elementary multiplication operation.

Input:
image patch x i , sparse code z k i , learning rate η 1 1 , running time m Output: 5: Update j-th coordinate z k−1 i,j of z k−1 i according to Equations ( 4) and ( 5) Update dictionary The fused coefficient vectors are restored to an image.The restoring process is based on Equation (10), where the .., z i m } is corresponding to image patches of the fused image and D is the learned dictionary.

Experiments and Analyses
The proposed multi-focus image fusion method is applied to standard multi-focus images from public website [29].All standard multi-focus images used in this paper are free to use for research purposes.The images from the image fusion library have the size of 256 × 256 pixels or 320 × 240 pixels.The fused images are evaluated by comparing them to the fused images of other existing methods.In this paper, four pairs of images are used as a case study to simulate the proposed multi-focus image fusion method.To simulate a real world environment, four pairs of images have two scenes.One is an outdoor scene, such as a hot-air balloon and leopard shown in Figure 3a,b respectively.The other is an indoor scene, such as a lab and bottle shown in Figure 3c,d respectively.These four pairs of original images are from the same sensor modality.Since each image focuses on a different object, there are two images for each scene.The out-of-focus regions in the original images are blurred.

Edge Intensity
The quality of the fused image is measured by the local edge intensity L in image I [38].It folds a Gaussian kernel G with the image I to get a smoothed image.Then it obtains the edge intensity image by subtracting the smoothed image from the original image.The spectrum of the edge intensities depends on the width of the Gaussian kernel G.
The fused image H is calculated by image L j , j = 1, ..., n and the weighted average of local edge intensities.
where L is the number of gray-level, h R,F (i, j) is the gray histogram of image A and F. The h A (i) and h F (j) are edge histogram of image A and F. For a fused image, the MI of the fused image can be calculated by Equation (15).
where MI(A, F) represents the MI value between input image A and fused image F; MI(B, F) represents the MI value of input image B and fused image F.

Q AB/F
The Q AB/F metric is a gradient-based quality index to measure how well the edge information of source images conducted to the fused image [34].It is calculated by: where Q AF = Q AF g Q AF 0 , Q AF g and Q AF 0 are the edge strength and orientation preservation values at location (i,j).Q BF can be computed similarly to Q AF .w A (i, j) and w B (i, j) are the importance weights of Q AF and Q BF respectively.

Visual Information Fidelity
VIF is the novel full reference image quality metric.VIF quantifies the mutual information between the reference and test images based on natural scene statistics (NSS) theory and human visual system (HVS) model.It can be expressed as the ratio between the distorted test image information and the reference image information, the calculation equation of VIF is shown in Equation (17).
where An average VIF value of each input image and integrated image is used to evaluate the fused image.The evaluation function of VIF for image fusion is shown in Equation ( 18) [37].
where V IF(A, F) is the VIF value between input image A and fused image F; V IF(B, F) is the VIF value between input image B and fused image F.

Image Quality Comparison
To show the efficiency of the proposed method, the quality comparison of fused images is demonstrated.Four pairs of multi-focused images of a hot-air balloon, leopard, lab, and bottle are employed for quality comparison.It compares the quality of fused image based on visual effect, the accuracy of focused region detection, and the objective evaluations.The different images are used to show the differences between fused images and corresponding source images.This paper increases the contrast and brightness of difference images for printable purposes.All difference images are adjusted by using the same parameters.
In the first comparison experiment, the "hot-air balloon" images are a pair of multi-focused images.The source multi-focused images are shown in Figure 4a,b.In Figure 4a, the biggest hot-air balloon on the bottom left is out of focus, the rest of the hot-air balloons are in focus.In contrast, in Figure 4b, the biggest hot-air balloon is in focus, but the rest of balloons are out of focus.LE, DWT, DT-CWT, CVT, NSCT, SR-DCT, SR-KSVD and the proposed method are employed to merge two multi-focused images into a clear one, respectively.The corresponding fusion results are shown in Figure 4c-j respectively.The difference images of LE, DWT, DT-CWT, CVT, NSCT, SR-DCT, SR-KSVD and the proposed method do the matrix subtraction with the source images shown in Figure 4a,b.The corresponding subtracted results are shown in Figure 5a-h,i-p respectively.
Figure 5a-h are the difference images between Figure 4a and Figure 4c-j, and Figure 5i-p are the difference images between Figure 4b and Figure 4c-j.The difference images of LE, DWT, DT-CWT, CVT, NSCT, SR-DCT, SR-KSVD and the proposed method are the matrix subtraction results of the corresponding fused images and source images shown in Figure 4a,b.
There are a lot of noises in Figure 4h, which are acquired by SR-DCT.The rest of integrated images in Figure 4 are similar.Difference images, that show hot-air balloons of LE, DWT, and DT-CWT on the left side respectively, do not get all the edge information in Figure 5a-c.
Similarly, Figure 5i-k shows that the biggest hot-air balloons of LE, DWT, and DT-CWT on the bottom left, respectively, are not totally focused.Due to the misjudgement of focused areas, the fused "hot-air balloon" images of LE, DWT, DT-CWT, and SR-DCT have shortcomings.Compared with source images, the rest of the methods do great job in identifying the focused area.To further compare the quality of fused images, objective metrics are used.
Table 1 shows the objective evaluations.Compared with the rest of the image fusion methods, the proposed method SR-SCC gets the largest value in MI and VIF.LE and DT-CWT get the largest value in EI and Q AB/F respectively, but they provide an inaccurate decision in detecting the focused region.The proposed method has the best overall performance of multi-focus image fusion in the "hot-air balloon" scene among all eight methods, according to the quality of fused image, accuracy of locating focused region, and objective evaluations.Similarly, the source images of the other three comparison experiments, as "leopard", "lab" and "bottle", are shown in Figures 6, 7        Objective metrics of multi-focus "leopard", "lab", and "bottle" fusion experiments are shown in Tables 2-4 respectively to evaluate the quality of fused images.
• Multi-focus "leopard" fusion: The proposed method SR-SCC achieves the largest value in MI and VIF.LE obtains the largest value in EI index, but it makes inaccurate decision in detecting the focused region.SR-KSVD shows great performance in Q AB/F , and the result of proposed method is only 0.0002 smaller than SR-KSVD.According to the quality of visual image, the accuracy of focused region, and objective evaluations, the proposed method does a better job than the rest of the methods.• Multi-focus "lab" fusion: The proposed method SR-SCC achieves the largest value in Q AB/F and VIF.DWT obtains the largest value of EI index, but it cannot distinguish the correct focused areas.SR-KSVD has the best performance in MI.The proposed method and SR-KSVD show great performance of visual effect in focused area, distinguishing focused area, and objective evaluation.Compared with SR-KSVD, the proposed method dramatically reduces computation costs in dictionary construction.So the proposed method has the best overall performance among all comparison methods.• Multi-focus "bottle" fusion: DWT obtains the largest value in EI, but it does not get an accurate focused area.The proposed method achieves SR-SCC with the largest values of the rest of the objective evaluations.So the proposed method has the best overall performance compared with other methods in the "bottle" scene.

Dictionary Construction Time Comparison
As shown in the previous subsection, the fused images of different multi-focus fusion methods are compared by objective evaluations.Dictionary-learning based image fusion methods, including SR-KSVD and the proposed SCC method, achieve the best performance.However, the dictionary construction process usually takes a very long time.The efficiency of dictionary construction is an important feature of image fusion method.Both K-SVD [39] and the proposed SCC are sparse-representation based dictionary learning methods.So it compares the dictionary construction time of K-SVD and the proposed SCC.K-SVD is one of the most popular dictionary learning methods in recent years.It uses an iterative algorithm to reduce dictionary learning errors and can describe the underlying structure of the image perfectly.To verify the low computation cost of the proposed method, four pairs of images are used for testing computation time.The time consumption of K-SVD and SCC are shown in Figure 12 and Table 5. SCC uses low computation times, that are marked in bold, in four group experiments.The experimental results demonstrate that SCC has a much better performance of computation time than K-SVD.

Conclusions
This paper proposed an integrated image fusion framework based on online dictionary learning.Compared with traditional image fusion methods, the integrated framework had two major improvements.First, it introduced a local density-based clustering method in sparse representation, which had high performance in clustering without any apriori knowledge.Second, an online dictionary learning algorithm was used to extract discriminative features that enhanced the efficiency of image fusion.The proposed method was compared with seven existing algorithms LE, DWT, DT-CWT, CVT, NSCT, SR-DCT, SR-KSVD, and SR-SCC using four source image pairs and objective metrics.Experimental results demonstrated that the proposed method was significantly superior to other methods in terms of subjective and objective evaluation.This means that the fused images of the proposed method had better quality than other methods.Compared with other sparse-representation based methods, the proposed method had high efficiency in generating fused images.
Although the proposed solution had a good performance in image fusion, many optimizations are still worth doing in the follow-up research.The parallel processing and the use of multiple graphics processing units (GPUs) will be considered to improve the efficiency of proposed solution.Denoising techniques will also be applied to the proposed solution to enhance the quality of the fused image.
is the Euclidean distance between image patch i and j. d c is a cutoff distance and usually set to the median value of d ij .Basically, ρ i equals the number of patches that are closer than d c to patch i.The clustering algorithm is only sensitive to the relative magnitude of ρ i from a different image patch, and robust with respect to the choice of d c .

Figure 2 .
Figure 2. Local Density and Distance Map of Fusion Images, (a) shows the local density calculation of each image patch; (b) shows the constructed local density map.

Figure 3 .
Figure 3. Four Source Image Pairs of Different Scenes for Multi-focus Fusion Experiments, (a)-(d) are source image pairs of hot-air balloon, leopard, lab, bottle respectively.
the mutual information, which are extracted from a particular subband in the reference and the test images respectively.− → C N denotes N elements from a random field, − → E N and − → F N are visual signals at the output of HVS model from the reference and the test images respectively.

Figure 5 .
Figure 5. Difference Images of "Hot-air Balloon" by Different Methods, (a-h) are differences images between source image in Figure 4a and fused images of LE, DWT, DT-CWT, CVT, NSCT, SR-DCT, SR-KSVD, and proposed SR-SCC in Figure 4c-j respectively, (i-p) are differences images between source image in Figure 4b and corresponding fused images in Figure 4c-j respectively.
and 8a,b respectively.In a set of source images, two images (a) and (b) focus on different items.The source images are fused by LE, DWT, DT-CWT, CVT, NSCT, SR-DCT, SR-KSVD, and the proposed method to get a totally focused image, and the corresponding fusion results are shown in Figures 6, 7 and 8c-j respectively.The difference between fused images and their corresponding source images are shown in Figures 9, 10 and 11a-h,i-p respectively.

Figure 9 .
Figure 9. Difference Images of "Leopard" by Different Methods, (a-h) are differences images between source image in Figure 6a and fused images of LE, DWT, DT-CWT, CVT, NSCT, SR-DCT, SR-KSVD, and proposed SR-SCC in Figure 6c-j respectively, (i-p) are differences images between source image in Figure 6b and corresponding fused images in Figure 6c-j respectively.

Figure 10 .
Figure 10.Difference Images of "Lab" by Different Methods, (a-h) are differences images between source image in Figure 7a and fused images of LE, DWT, DT-CWT, CVT, NSCT, SR-DCT, SR-KSVD, and proposed SR-SCC in Figure 7c-j respectively, (i-p) are differences images between source image in Figure 7b and corresponding fused images in Figure 7c-j respectively.

Figure 11 .
Figure 11.Difference Images of "Bottle" by Different Methods, (a-h) are differences images between source image in Figure 8a and fused images of LE, DWT, DT-CWT, CVT, NSCT, SR-DCT, SR-KSVD, and proposed SR-SCC in Figure 8c-j respectively, (i-p) are differences images between source image in Figure 8b and corresponding fused images in Figure 8c-j respectively.

Table 1 .
Objective Evaluations of Multi-focus "Hot-air Balloon" Fusion Experiments.

Table 3 .
Objective Evaluations of Multi-focus "Lab" Fusion Experiments.

Table 5 .
Time Consumption Comparison.