Conditional Random Field-Guided Multi-Focus Image Fusion

Multi-Focus image fusion is of great importance in order to cope with the limited Depth-of-Field of optical lenses. Since input images contain noise, multi-focus image fusion methods that support denoising are important. Transform-domain methods have been applied to image fusion, however, they are likely to produce artifacts. In order to cope with these issues, we introduce the Conditional Random Field (CRF) CRF-Guided fusion method. A novel Edge Aware Centering method is proposed and employed to extract the low and high frequencies of the input images. The Independent Component Analysis—ICA transform is applied to high-frequency components and a Conditional Random Field (CRF) model is created from the low frequency and the transform coefficients. The CRF model is solved efficiently with the α-expansion method. The estimated labels are used to guide the fusion of the low-frequency components and the transform coefficients. Inverse ICA is then applied to the fused transform coefficients. Finally, the fused image is the addition of the fused low frequency and the fused high frequency. CRF-Guided fusion does not introduce artifacts during fusion and supports image denoising during fusion by applying transform domain coefficient shrinkage. Quantitative and qualitative evaluation demonstrate the superior performance of CRF-Guided fusion compared to state-of-the-art multi-focus image fusion methods.


Introduction
The limited Depth-of-Field of optical lenses allows only parts of the scene within a certain distance from the camera sensor to be captured well-focused each time, while the remaining parts of the scene stay out-of-focus or blurred. Multi-focus image fusion algorithms are thus of vital importance in order to cope with this limitation. Multi-focus image fusion methods merge multiple input images captured with different focus settings into a single image with extended Depth-of-Field. More precisely, the well-focused pixels of the input images are preserved in the fused image and the out-of-focus pixels of the input images are discarded. Consequently, the fused image should have extended Depth-of-Field and thus more information than each one of the input images and should not introduce artifacts during fusion.
The problem of multi-focus image fusion has been explored widely in the literature. Lately, a number of multi-focus image fusion methods have been proposed. Liu et al. [1] classified the multi-focus image fusion methods in four categories: spatial-domain methods, transform-domain methods, combined methods and deep learning methods. In spatialdomain methods, the fused image is estimated as the weighted average of the input images. Spatial-domain methods are also classified as block-based, region-based, and pixel-based. In block-based methods, the image is decomposed into blocks of fixed size, and the activity level is estimated individually for each of these blocks.
However, since blocks are likely to contain both well-focused and out-of-focus pixels, the block-based methods are likely to have blocking artifacts near the boundaries of wellfocused and out-of-focus pixels. Thus, the fused image has lower quality near their boundary. Region-based methods, use a whole region of irregular shape in order to estimate the saliency of the included pixels. Although region-based methods provide higher flexibility than block-based methods, a region may also contain simultaneously both well-focused and out-of-focus pixels. As a result, region-based methods also produce artifacts and have lower fused image quality near the boundaries of well-focused and out-of-focus pixels. In order to overcome these issues, pixel-based methods have lately gained more popularity. In these methods, activity level estimation is carried out at pixel level. Pixel-based methods do not have blocking artifacts and have better accuracy near the boundary of well-focused and out-of-focus pixels, however, they are likely to produce noisy weight maps, which also lead to fused images of lower image quality. Popular spatial domain-based multi-focus image fusion methods include: Quadtree [2], Boundary Finding [3], dense Sift [4], guided filtering [5], PCNN [6] and Image Matting [7]. Singh et al. [8] used the Arithmetic optimization algorithm (AOA) in order to estimate the weight maps for image fusion, which were refined with weighted least square optimization (WLS). The fused image is extracted through pixel-wise weighted average fusion. In [9], the fusion method FNMRA was presented, which used the modified naked mole-rat algorithm (mNMRA) in order to generate the weight maps, which were refined with weighted leastsquares optimization. Pixel-wise single-scale composition was used in order to create the fused image.
In transform-domain methods, a forward transform is firstly applied to the input images. A fusion rule is then applied in order to combine the transform coefficients. Finally, an inverse transform is applied to the fused coefficients in order to return the fused image to the spatial domain. An advantage of dictionary-based transform-domain methods is the support of image denoising during fusion, by applying shrinkage methods,such as [10], which can be used to remove the noisy transform-domain coefficients. An issue of transformdomain methods lies in the imperfect forward-backward transforms that result in visible artifacts due to the Gibbs phenomenon. Since both the selection of the transform domain and the manual design of the fusion rule highly impact the quality of the fused image a number of transform domain-based multi-focus image fusion methods have been introduced. Typical transform domain-based multi-focus image fusion methods include: ICA [11], ASR [12], CSR [13], NSCT [14], NSCT-SR [15], MWGF [16] and DCHWT [17]. Qin et al. [18] proposed a new image fusion method combining discrete wavelet transform (DWT) and sparse representation (SR). Jagtap et al. [19] introduced information preservation-based guided filtering in order to decompose the input images into base and detail images. Low-rank representation was used in order to estimate the focus map and perform a fusion of the detailed images. In [20], the authors used weight maps based on local contrast, and the fused image was estimated with multi-scale weighted average fusion based on pyramid decomposition.
The methods that lie in the combined methods category employ both the merits of spatial and transform domain methods. Nonetheless, each method uses different domains. Bouzos et al. [21] combined the advantages of both the ICA domain and spatial domain. Chai et al. [22] combined advantages of multi-scale decomposition and spatial domain. He et al. [23] combined the Meanshift algorithm and NSCT domain. An issue of the aforementioned methods is that they do not support image denoising during fusion. Singh et al. [24] proposed the Discrete Wavelet Transform-bilateral filter (DWTBF) method, which combined the Discrete Wavelet Transform (DWT) and the bilateral filter. In [25], the authors combined a multi-resolution pyramid and the bilateral filter in order to predict the fused image.
Lately, deep learning-based methods have gained more popularity. According to the study [26], deep learning-based methods, are classified into decision map-based methods and end-to-end methods. In decision-map-based methods, the deep learning networks predict a decision map, with a classification-based architecture. Post-processing steps, including morphological operations, are usually employed to refine the decision map. The decision map is later used to guide the fusion of the input images, by selecting the respective pixels from the input images. Typical decision map-based deep learning multi-focus image fusion methods include: CNNFusion [27], ECNN [28] and p-CNN [29]. On the other hand, end-to-end deep learning-based networks, directly predict the fused image without the intermediate step of the decision map. Typical end-to-end based deep learning networks for multi-focus image fusion include: IFCNN [30] and DenseFuse [31]. Ma et al. [32] introduced a multi-focus image fusion method based on an end-to-end multi-scale generative adversarial network (MsGAN). Wei et al. [33] combined advantages of sparse representation and CNN networks in order to estimate the fusion weights for the multi-focus image fusion problem. Since the sensitivity of the aforementioned deep learning-based methods to noise was not studied, the methods are likely to be sensitive to noise. In addition, these deep learning-based multi-focus image fusion methods do not support image denoising during fusion.
In this manuscript, we introduce CRF-Guided fusion, which is a novel transform domain-based method that uses the Conditional Random Field model, in order to guide the fusion of the transform-domain ICA method. Due to various sources, input images are likely to contain noise, thus multi-focus methods that are robust to noise and support fusion and denoising during fusion are of great importance. Since CRF-Guided fusion is a dictionary-based method (ICA), the method is robust to Gaussian noise and supports image denoising during fusion by applying the shrinkage coefficient method [10]. A novel Edge Aware Centering method (EAC) is also introduced and is used, instead of the typical centering method, and alleviates artifacts caused by the centering procedure. The combination of EAC and the proposed CRF-Guided fusion method produce fused images of high quality, without introducing artifacts for both clean images and images that contain Gaussian noise, while also supporting denoising during fusion.
The main contributions of this manuscript and improvements over our previous method [21] are: 1.
the development of the novel EAC method, which preserves the strong edges of the input images, instead of the typical centering method.

2.
the design of a novel framework, based on a CRF model, that is suitable for transformdomain image fusion. 3.
the design of a novel transform-domain fusion method that produces fused images of high visual quality, preserves via CRF optimization, the boundary between wellfocused and out-of-focus pixels, and does not introduce artifacts during fusion. 4.
the introduction of a novel transform-domain fusion rule, based on the labels extracted from the CRF model, that produces fused images of higher image quality without the transform-domain artifacts 5.
the robustness of the proposed method against Gaussian noise and the support of denoising during fusion, by applying the transform-domain coefficient shrinkage method [10].

Proposed Method Description
The proposed framework of the CRF-Guided fusion is summarised in Figure 1. An outline of the method is now provided: Firstly, Edge Aware Centering is applied to the input images, in order to extract the low and high-frequency components. The Forward ICA transform is then applied to the high frequencies of the input images. Then, the Low frequency and ICA coefficients are used to compute the Unary U and Smoothness V potentials and thus construct the CRF model. Consequently, the CRF model is solved efficiently with the α-expansion method based on GraphCuts [34]. The predicted labels are then employed to fuse the low frequencies leading to the fused low-frequency image. In addition, they are also used to guide the fusion of the transform-domain ICA coefficients. Lastly, the inverse ICA transform is applied to the fused transform coefficients in order to return the fused high-frequency component. Finally, the fused image F is estimated by the addition of the fused low-frequency and the fused high-frequency components. More details of the aforementioned steps of the proposed framework are included in the following subsections. Figure 2 includes two source input images for multi-focus image fusion that will be used during the steps of the CRF-Guided fusion.

Edge Aware Centering
In this section, we introduce the Edge Aware Centering (EAC) method, which is used instead of the typical centering method, in order to estimate the low frequency of the multi-focus input images. EAC consists of a spatially varying Gaussian filter that preserves the strong edges of the input images. More precisely, where w i,j is the weight at spatial location (i, j), µ i,j is the mean value of a 7 × 7 block with central pixel In addition, the · operator implies averaging over the all m, n values. Finally, the filtered image f in spatial locations (i, j) is estimated as: EAC is applied to both input images in order to estimate the low frequency of each image. Figure 3 includes the low-frequency images, as computed by applying the proposed EAC to the input images of Figure 2. It is evident that the EAC preserves accurately the strong edges of the input images.  By subtracting the low-frequency images from the input images, we extract the highfrequency images as demonstrated in Figure 4. The forward ICA transform is then applied to the high-frequency images in order to get the transform domain coefficients. For more information on the estimation of the ICA transform, and its application on images for fusion, please refer to [11].

Energy Minimization
In order to model the multi-focus image fusion problem and solve it efficiently, we construct an energy minimization equation. Since solvers of graph cuts can reach a global or close-to-global optimum solution, we formulate the energy minimization problem of multifocus image fusion as a graph cut problem. More precisely, we introduce the Conditional Random Field (CRF) equation that describes our multi-focus image fusion problem, which is solved efficiently with the inference method of α-expansion reaching a global or closeto-global optimum solution. The solution of the proposed energy minimization leads to the optimum labels of the decision that is used to guide the fusion of low frequency and transform coefficients.
In order to guide the fusion of the low frequency and the transform coefficients, we formulate the Conditional Random Field (CRF) equation, as follows: where are the estimated labels, U is the unary potential function, V is the pairwise potential function, i are spatial locations, and m, n adjacent pixels in the C which is the N8-neighborhood. The energy minimization equation is optimized using the α-expansion method, based on GraphCuts [34].

Inference α-Expansion Method
In the α-expansion, the optimization problem is divided into a sequence of binaryvalued maximization problems. Given a current label configuration h and a fixed label α ∈ U, with U being the set of all label values. In the α-expansion move, each pixel i gets a binary decision, to either retain its old value or change it to label α. The expansion move starts with the initial set of labels h 0 and then based on some order, computes the optimal αexpansion moves for the labels α. Only the moves that lead to the increase of the objective function are accepted.

Unary Potential Estimation
Let us assume that x 1 , x 2 are the input images, P L is the probability of low frequency, P H the probability of high frequency, P the probability of the input images, and U unary potential function. Figure 5 depicts the method of estimating the unary potential. More precisely, EAC is firstly applied to the images to extract low and high frequencies. The 2nd Laplacian is applied to both low-frequency components and the probability of the low frequency, P L is estimated by: where, S 0 , S 1 are the second Laplacian of the low frequencies of the first and the second image respectively. The probability of the high frequency P H is extracted by the ICA coefficients and is estimated as follows: where C 0 is the L2-norm of ICA coefficients of the first image, C 1 is the L2-norm of ICA coefficients of the second image. In order to determine the probability that each one of the input images i should contribute to the spatial location n of the guidance map, we compute the combined probability of high and low frequencies for each image. This probability we call the probability of input image that corresponds to label . Thus probability of each input image P( n ) is estimated as follows: Finally, the Unary potential function U is estimated by the negative likelihood of the predicted probabilities:

Smoothness Term
The smoothness potential function V is estimated from the low-frequency image, as follows: where p, q are adjacent pixels in the N8-neighborhood and l 0 , l 1 are the first and second lowfrequency images respectively. Finally, the labels of the CRF model in (3) are estimated efficiently using the α-expansion method [34]. Figure 6 demonstrates the labels, as estimated from the direct minimization of the unary term U and the labels, as estimated from the CRF minimization (3). The predicted labels are then used to fuse the low frequency of the input images.
where L F is the low-frequency fused image, i is the spatial location, L 0 is the low frequency of the first image and L 1 is the low-frequency of the second image.

Transform-Domain CRF Fusion Rule
A sliding window with size 7 × 7 is applied to the decision map of the predicted probabilities. The transform coefficients that correspond to each 7 × 7 block are then fused according to the label of the central pixel of the block by selecting the respective coefficients from the input images that correspond to that label. Inverse ICA is then applied to the fused transform coefficients in order to return the fused high frequency.  Finally, the fused image is estimated by the addition of the low and high-frequency components. Figure 8 demonstrates the final fused image.

Fusion and Denoising
A major advantage of the proposed CRF-Guided fusion is the robustness against Gaussian noise and the support of denoising during fusion. In the case of Gaussian noise, the coefficient shrinkage method [10] is applied to the transform coefficients of both input images. More precisely, where C(k) is the k-th transform coefficient in the ICA domain and σ n is the standard deviation of the noise, which is estimated by areas of the image where there is low activity. Low activity areas contain no strong edges, therefore may contain only noise and thus can be used to estimate the noise standard deviation σ n . The denoised transform coefficients are then employed to estimate the P H of both input images. Consequently, Guided fusion from the CRF labels is performed on the denoised transform coefficients. Then, the inverse ICA transform is used to return the denoised high-frequency image. Lastly, the final denoised fused image is formed by the addition of the denoised high-frequency and the fused low-frequency images. Figure 9 includes the noisy input images with Gaussian noise N 0, σ 2 , σ = 5 and the denoised fused image F. The fused image F is successfully denoised during the fusion, as is demonstrated in Figure 9c.

Quantitative Evaluation
In [43,44] Singh et al. made a review of multiple image fusion algorithms along with the image fusion performance metrics. In order to assess the quality of the fused images of the compared multi-focus image fusion methods, eight metrics are used. More precisely the metrics used are: Mutual Information (MI) [45], Q ab/ f [46], Qg [47], Qy [48], CB [49], SSIM [50], NIQE [51] and Entropy.

Mutual Information-MI
Mutual Information-MI is an information theory-based metric and the objective measure of the mutual dependence of two random variables. For two discrete random variables U and V, MI is defined as follows:

Yang's Metric Qy
Yang et al. [48] proposed the image structural similarity-based metric Q Y . For input images A, B and fused image F, it is defined as follows: where s(A|w) is a local salience measure of image A within a window w. A higher value of Q Y indicates better-fused image quality and higher structural similarity of the fused images and the input images.

Chen-Blum Metric-CB
The Chen-Blum Metric CB [49] is a human perception-inspired fusion metric that features the following five steps:

1.
Contrast sensitivity filtering: Filtered image I A (m, n) = I A (m, n)S(r), where S(r) is the CSF filter in polar form and r = √ m 2 + n 2 .

3.
Contrast preservation calculation: For input image I A the masked contrast map is estimated as: where t, h, p, q, Z are real scalar parameters that determine the shape of the nonlinearity of the masking function [49].

4.
Generation of saliency map: The saliency map for image I A is: The value of information preservation is:

5.
The global quality map is defined as: The value of metric CB is the average of the global quality map: 4.1.4. Gradient Based Methods-Q G , Q AB/F Xydeas et al. [47] proposed a metric to measure the amount of edge information from source images to the fused image. Q G is a gradient-based method. Firstly, a Sobel operator is applied to input image A to extract edge strength g A (i, j) and orientation α A (i, j).
where s x A , s y A are the outputs of the convolution application of the horizontal and vertical Sobel templates respectively. The relative strength between input image A and fused image F is: The orientation values ∆ AF between input image A and fused image F are: The edge strength value is estimated as: The orientation preservation value is estimated as: The constants Γ g , k g , σ g and Γ α , k α , σ α are used to define the shape of the sigmoid functions used for the edge strength and orientation preservation values [47]. and where Q AF (i, j) denotes the edge similarity at position (i, j) between input image A and fused image F, Q AF g the edge strength similarity and Q AF α the orientation similarity.

Structural Similarity Index-SSIM [50]
The structural similarity index-SSI M for two images A, B is defined as: where µ A , µ B are the mean intensity values of images A, B, σ A , σ B are the standard deviation of images A, B and σ AB is the square root of covariance of A, B. C 1 , C 2 are constants. Due to the lack of ground truth image, the SSI M for input images A, B and fused image F in the experiments is defined as follows: where A and B are the two input images and F is the fused image.

Niqe [51]
N IQE is a blind image quality metric based on the Multivariate Gaussian Model (MVG). The quality of the fused image is defined as the distance between the quality aware natural scene statistic (NSS) model and the MVG fit, extracted from features of the distorted image: where v 1 , v 2 and Σ 1 , Σ 2 are the mean vectors and covariance matrices of the natural multivariate Gaussian model [51] and the multivariate Gaussian model that is fit to the fused image.

Entropy
The entropy of an image I is defined as: where L is the number of gray levels, p s j is the probability of occurrence of gray level s j in image I. Table 1 includes the objective evaluation of the compared methods for the Lytro dataset [35].
For the Lytro dataset [35], the proposed CRF-Guided fusion method has the highest value for the metrics MI, Q g , Q AB/F , Q Y , CB, the lowest value for the N IQE metric and the second highest score for SSI M and entropy. These results indicate that the fused quality of the proposed fused image is better than the compared state-of-the art methods. Since CRF-Guided has the highest Mutual Information [45], the proposed method preserves best the information of the input images. In addition, CRF-Guided has the highest Qg [47] and Q AB/F [46] values, which indicate that the proposed method preserves best the edge information from the input images to the fused image. In order to assess the quality of the structural similarity of the fused images, Yang's metric Q Y [48] and the structural similarity index measure SSI M [50] are employed. The proposed method has the highest Q Y value and the second highest according to SSI M, which indicates high fused image quality, regarding structural similarity. DenseFuse [31] has highest SSI M value for the Lytro dataset. The proposed CRF-Guided method has the highest value on the human perception inspired fusion metric CB [49], which implies that perceptually the produced results by the method are the most pleasing to the human eye. According to the blind image quality metric N IQE [51], CRF-Guided has the lowest value and thus the best fused image quality. Lastly, for the blind image quality Entropy, GBM [36] has the highest score and CRF-Guided has the second highest score. Overall for the Lytro dataset [35] of perfectly registered color input images, the proposed CRF-Guided method outperforms the compared state-of-the art image fusion methods in most metrics.  [35]. Lower values for N IQE indicate better fused image quality, while for rest metrics higher values indicate better fused image quality.

M I [45]
Qg [47] Q AB/F [46] Qy [48] CB [49] SSI M [50] N IQE [ Table 2 includes the quantitative evaluation of the compared methods for the grayscale dataset [3]. The CRF-Guided fusion method outperforms the compared state-of-the-art methods, in terms of metrics MI [45], Qg [47], Q AB/F [46], Q Y [48], CB [49] and SSI M [50] and has the second lowest score for the N IQE [51] metric and the second highest Entropy value. More precisely, since CRF-Guided has the highest Mutual Information [45], it preserves better the original information compared to the other methods. The highest value of CRF-Guided in Q g [47] and Q AB/F [46] indicate that the proposed method preserves better the edges of the input images, compared to the state-of-the-art methods. Moreover, the structural information of the original images is best preserved in the CRF-Guided method, since both Q y [48] and SSI M [50] have the highest value for the proposed method. According to the human perception inspired fusion metric CB [49], CRF-Guided has the best fused image quality. For the N IQE [51] metric, the method dchwt [17] has the lowest score and the proposed method has the second lowest value. The method GBM [36] has the highest entropy value for the grayscale dataset. Overall, the proposed method has the highest fused image compared to the state of the art methods for the grayscale dataset [3]. Table 2. Objective evaluation for the grayscale dataset [3]. Lower values for N IQE indicate better fused image quality, while for rest metrics higher values indicate better fused image quality.

M I [45]
Qg [47] Q AB/F [46] Qy [48] CB [49] SSI M [50] N IQE [ In summary, according to the 8 metrics used for quantitative evaluation, the proposed CRF-Guided method has the best performance compared to 13 state-of-the art image fusion methods for both public datasets: the Lytro dataset [35] and the grayscale dataset [3].

Complexity
We analyzed the computational complexity of the proposed and compared image fusion methods. The average execution time on the Lytro dataset of the compared methods are included in Table 3. The included times were computed on an Intel ® Core TM i9 2.9GHz processor with 16 GB RAM and a 64-bit operating system. IFCNN [30] and DenseFuse [31] were executed on an NVIDIA GeForce RTX 2080 with Max-Q Design.

Conclusions
A novel transform domain multi-focus image fusion method is introduced in this paper. The proposed CRF-Guided fusion takes advantage of the CRF minimization and the labels are used to guide the fusion of both low frequency and the ICA transform coefficients and thus the high frequency. CRF-Guided fusion supports image denoising during fusion, by applying coefficient shrinkage. Quantitative and qualitative evaluation demonstrate that CRF-Guided fusion outperforms state-of-the-art multi-focus image fusion methods. Limitations of the proposed CRF-Guided fusion method include the selection of the transform domain and the hand-crafted design of the unary and smoothness potential functions for the energy minimization problem. Future work includes the application of CRF-Guided fusion in different transform domains and learning the unary and smoothness potential function with deep learning networks.