Multiscale Image Matting Based Multi-Focus Image Fusion Technique

Multi-focus image fusion is a very essential method of obtaining an all focus image from multiple source images. The fused image eliminates the out of focus regions, and the resultant image contains sharp and focused regions. A novel multiscale image fusion system based on contrast enhancement, spatial gradient information and multiscale image matting is proposed to extract the focused region information from multiple source images. In the proposed image fusion approach, the multi-focus source images are firstly refined over an image enhancement algorithm so that the intensity distribution is enhanced for superior visualization. The edge detection method based on a spatial gradient is employed for obtaining the edge information from the contrast stretched images. This improved edge information is further utilized by a multiscale window technique to produce local and global activity maps. Furthermore, a trimap and decision maps are obtained based upon the information provided by these near and far focus activity maps. Finally, the fused image is achieved by using an enhanced decision maps and fusion rule. The proposed multiscale image matting (MSIM) makes full use of the spatial consistency and the correlation among source images and, therefore, obtains superior performance at object boundaries compared to region-based methods. The achievement of the proposed method is compared with some of the latest techniques by performing qualitative and quantitative evaluation.


Introduction
During image acquisition, one of the most important objectives is obtaining a focused region of interest. However, because of the limited field depth, the focused region contains sharp edges, whereas the other regions get blurred. Recently, multi-focus image fusion (combine images with different focused objects) has received tremendous attention amongst the researchers. This fused image offers high quality containing more detailed information [1,2]. Several methods are developed to fuse multiple images, which are broadly grouped into transform and spatial domains [3,4].
Transform domain methods fuse the corresponding transform coefficients and employ inverse transformation to construct the fused image. Spatial domain methods are further classified into pixel [5,6] and region based methods [7,8]. The spatial domain methods form the fuse image by choosing the pixels/regions/blocks that are focused. Transform domain-based methods in dynamic scenes merge these coefficients without considering the spatial properties, resulting in artifacts in the fused image. Furthermore, pixel and region-based methods are unable to produce the best fusion results for images with complicated texture patterns [1].
Zhang et al. [9] used morphological operations to extract focus regions. However, this technique suffers from block artifacts. De et al. [10] utilized morphological processes to detect the focused region and suggested a technique for calculating an optimized block size. The fused result still suffers from blocking effects. Later on, Bai et al. [11] presented a novel quadtree decomposition and a weighted focus based image fusion technique. However, this technique also provides inaccurate segmentation and low visual effects because of the smooth regions. Yin et al. [12] proposed a method based on joint dictionary and singular value decomposition (SVD) methods. Still, this method is not effective computationally because of the individual training for sub dictionaries and SVD computation.
Li et al. [13] explored guided filtering (GFF) and spatial information to improve the fusion results by mitigating the block effects. Zhang et al. [14] proposed a multifocus image scheme based upon a visual saliency method. Recently, image matting has been used for effectively differentiating the focused and out-of-focus regions. These methods can be broadly categorized as supervised matting and unsupervised matting techniques. Supervised methods require a user specified foreground and background regions known as trimap. Therefore, such techniques require human experts, are time consuming, and produce inconsistent results for images with high-textured backgrounds. However, unsupervised methods are better than supervised ones because user interaction is not required for achieving a good matting result. Chen et al. [15] used a parametric edge based method. However, these methods do not consider the artifacts among the smooth regions, and much of it depends on the performance of hand crafted features, which require much expert knowledge. Li et al. [16] proposed multifocus matting (MFM) based image fusion by combining together the focus region and its neighboring pixels. This method marginally improves the fusion results and also overcomes some shortcomings of spatial domain methods.
Xiao et al. [17] used depth information to segment an image into focus and blur regions. Zhang et al. [18] made use of log spectrum, Fourier transform, and Bayesian techniques. In [19], a definite focus region is detected by using a novel multi scale gradient information. Liu et al. [20] proposed a transform (which is scale invariant) to detect focus regions. However, this technique fails to offer sharp edges of the focus regions. Furthermore, in [21], the focus information was extracted by using texture features. Baohua et al. [22] performed the near and far focus region detection by using a sparse representation and guided filter techniques. In [23], a structure tensor was used for the detection of high and low frequency components. However, this technique fails to provide a visible difference between focus and defocus regions in many cases. Yu at.al. [24] presented a convolutional neural network (CNN) based multifocus image fusion technique. However, in this method, the precision of recognizing the focus block is very low.
In this paper, a novel multi-focus image fusion method is presented using contrast stretching and spatial gradients to enhance the edges from the source images. A multiscale sliding window method is used for detecting the local and global intensity variations to generate initial activity maps. These multiple activity maps are further processed to generate a trimap. An enhanced image matting technique is used for generating the decision maps. Finally, the fused image is obtained after processing the source images, enhanced decision maps, and employing the fusion rule.

Proposed Fusion Technique
The schematic diagram of the proposed algorithm is shown in Figure 1. It can be observed that in the first step, a contrast enhancement scheme is applied on the source images. In the second step, the outcome of the intensity transformed image is processed through an edge detection method. In a multi-focus image fusion scheme, the selection of near focus and far focus region plays a vital role. The region that is in focus during image acquisition tends to have sharp edges as compared to the out-of-focus region. Therefore, these sharp edges can be detected easily by applying an appropriate edge detection method.
The edge detection schemes rely heavily on the intensity distribution of an image. A poor intensity distribution can lead to an oversaturated, undersaturated, dark, or bright image. In either of those images, the edge detection algorithm cannot perform well. In order to improve the intensity distribution of an image, an intensity transformation can be performed. In Figure 2, the improvement in edge information is shown by comparing the images before and after applying the contrast enhancement scheme. In the next step, a sliding window technique with two different scales is applied on both edges of the detected images to generate activity maps. In this step, both local and global intensity variations are analyzed. The fine details are more prominent under a small sliding window scale. These masks are further fused together and processed to generate a trimap. Next, the trimap undergoes an image matting transformation to produce refined decision maps, which produces the final fused image. The proposed fusion scheme, along with the equation references, is also elaborated in Algorithm 1.

Require:
Step 1. Apply contrast enhancement on I i using Equation (1).

Contrast Enhancement
Improving the enhancement of the low contrast image, the histogram equalization seems an effective method. Non-parametric modified histogram equalization (NMHE) [25] is integrated to enhance the contrast and preserves the mean brightness of the source image I i , i.e., Image development in contrast centrally improves and concentrates pixel details. Figure 2 shows the enhancement in edge information. Figure 2a,b shows the far and near focus source image, respectively, and their gradients are shown in Figure 2c,d. Contrast enhanced of near and far focus images are displayed in Figure 2e,f, and their respective edge maps are shown in Figure 2g,h, respectively. From the images, it can be clearly seen that after the enhancement algorithm, the gradients of the source image were greatly improved.  [26], (e,f) contrast enhancement using non-parametric modified histogram equalization (NMHE) [25], (g,h) gradients of (e,f) achieved by SSGSM [26].

Edge Detection
The edges of the images after contrast enhancement is done by a spatial stimuli sketch model (SSGSM) [26] technique, which principally focuses on focal intensity points and edges in an image, and then the unknown region is calculated in the coarse decision maps by implementing the concentrated information in both the activity level maps. The weight of the local stimuli is deliberated by detecting the local variation in the perceived brightness at the respective positions. The discerned brightness, P i of a specific image is given in Equation (2) as, where,Í i represents the source images, and ϑ denotes the scaling factor. Gradients illustrate the sharp intensity variations in the image. Mathematically, the weight is computed as the total difference of the perceived brightness on x and y directions. The intensity variations of P i on the x and y axis are represented by x i and y i , respectively. These variations are calculated by using their respective gradients B x i and B y i , given as in Equations (3) and (4): The weight of local stimuli Z i is expressed by using Equation (5):

Focus Maps
A multiscale sliding window technique is applied to acquire diverse focus maps from activity maps Z i . Two sliding windows are selected for the generation of focus maps. Firstly, a 9 × 9 window is initialized by setting k = 9, l = 9 and σ = 1 in Equation (6). The activity maps are divided into blocks of 9 × 9 pixels by using spatial domain filters, as in Equations (6) and (7): The activity of each block is stored in the form of map scores. Furthermore, the sum of intensity levels in each near (G 1,σ=1 (s, t)) and far focus block (G 2,σ=1 (s, t)) are calculated and compared with one another to update the score maps (ζ i,σ=1 ), as given in Equations (8) and (9).
These multiple sliding windows result in multiple near and far focus maps. This multiscale sliding window technique reduces the blocking artifacts in the coarse decision maps. Each map offers different characteristic information, which plays a key role in improving the focus maps and the fused image. These multiscale windows extract the information from original images at different scales. It is noted that this approach has demonstrated better visual quality than the existing methods. Each scale offers different information for image fusion, for example, a small window size focuses on local intensity variations, whereas a large size window size extracts global variations in an image. The information from these multiscale near-focus (ζ 1,σ=1 and ζ 1,σ=3 ) and far-focus maps (ζ 2,σ=1 and ζ 2,σ=3 ) are combined together to form a single near-focus (D 1 ) and far-focus (D 2 ) map, respectively, carrying the attributes of both scales, as in Equation (10).
After obtaining the focus maps, the next step is to generate a trimap that segments the given images into the three different regions, i.e., focused, definite defocused, and unknown. Pixels from the focused region have greater focus value than pixels in the defocused region [27]. The trimap T of A 1 is processed by using D 1 , D 2 as in Equation (11).
In a given image I, the image matting considers it a composite of foreground I Fore and background I back . Each pixel is assumed to be a linear combination of I Fore and I back . Let α denote the pixel foreground opacities then an image I can be represented as, In [28], the quadratic cost function for α is derived as, where, L is defined as a matting laplacian matrix of N × N dimension. The L is a symmetric positive definite matrix and is defined in [28] as L = H -W, where, H is a diagonal matrix and W is a symmetric matrix. The neighborhood W M is given as, where, |w k | denotes the number of pixels in the window, φ k and ν k represents mean and variance of intensities in the window w k , respectively. χ represents the pixel color, is a regularization parameter and Γ is an identity matrix. Finally, the obtained alpha matte α from the source images and trimap is same as the focused region of I i is constructed as in Equation (15).

Results and Discussion
To show the superiority of the proposed MSIM, a comparison was performed with discrete wavelet transform (DWT) [29], guided filtering based fusion (GFF) [13], discrete cosine transform (DCT) [30], dense sift (DSIFT) [20], multi-scale morphological focus-measure (MSMFM) [9], and convolutional neural network (CNN) [24] on a multifocus image dataset [31]. The proposed method was evaluated by performing both subjective and objective assessments. These algorithms were tested on a Acer laptop Intel(R) Core TM i7 2.6GHz processor with 12GB RAM under a Matlab R2018b environment. All the algorithms were executed by using the original codes made available by the authors.

Comparison of Image Matting Result
Generally, an unsupervised trimap produces better results than the supervised ones. Hence, in practice, user specified trimaps are often necessary to achieve the high quality matting results; however, the making of a user supervised trimap takes time, skills, and is not available for all kind of images. In this paper, two image matting techniques have been proposed, i.e., focus maps matting and feature based matting. The results of the proposed method are compared with feature based matting and the closed form matting [28]. It is clearly observed that the proposed matting produces better results compared to the existing technique (Figure 3).

Comparison of Image Fusion with Other Methods
The proposed technique is tested on gray scale, color, and dynamic images. Figure 4 shows the results of the proposed MSIM for "Lab" image. The source near and far focus inputs are presented in Figure 4a,b, respectively. The fused results produced by other methods and the proposed technique are given in Figure 4c-i. To further investigate the effectiveness, the difference of the near-focus image with the fused images is shown in Figure 4j-p. The close up views enclosed by red and yellow boxes are also shown at the bottom of their respective difference image. It is noted that the DWT, DCT, and DSIFT methods produce poor edge information and contain artifacts (as shown in the close-ups). Furthermore, GFF, MSMFM, and CNN methods also provide limited information of the focused regions as compared to the proposed MSIM technique. Similarly, Figure 5 illustrates the results produced by several existing and proposed algorithms for "Globe" images. To further analyze the results, close-up views of important regions are placed at the bottom of each difference image. In this image, the boundary region of the hand is difficult to detect since it lies on the focus transition point. The results of fusion by other techniques in Figure 5j-o show the distorted regions and lack of sharpness in the highlighted region. However, the proposed MSIM method has successfully fused the complementary information from both the images, as shown in Figure 5p. It is very important to evaluate the results of different algorithms on the color dataset shown in Figures 6a,b and 7a,b. The outcomes of the existing techniques and the proposed method on "Flower" and "Boy" are shown in Figures 6c-i and 7c-i, respectively. The difference between the fused and out of focus source images is illustrated in Figures 6j-p and 7j-p, respectively. It is noted that in both the flower and boy images, the existing techniques are unable to mitigate the artifacts and blur in the focus transition area (as noted in the close-ups of difference images). The proposed MSIM is able to preserve contrast and details using the edge feature and multi scale image matting technique.   Another challenge for multi-focus fusion includes the performance in dynamic scenes. The scenario occurs either due to the movement of the camera or the motion of the object. So it is important to verify the effectiveness of the MSIM result with the existing ones on such scenes. Figure 8a,b shows near and far focus "Girl" images, respectively. The results of MSIM and existing techniques are shown in Figure 8c  It is observed from these visualizations that the existing methods produce artifacts, erosion, halo effects and are unable to produce sharp boundaries of the near and far focus images. Note that the MSIM technique not only perfectly identifies the near and far focus regions but also fuses the complementary information in an effective manner.

Objective Evaluation Metrics
After evaluating the visual quality and quantitative assessment of different methods, it can be clearly observed that MSIM produces a visually pleasant and high quality fusion result in almost all cases and outperformed the existing fusion methods for multi-focus images. Five most commonly used metrics are evaluated, i.e., Mutual Information (MI) [32], Spatial Structural Similarity (SSS) Q AB/F [33], Feature Mutual Information (FMI) [34], Entropy (EN) [35], and Visual Information Fidelity (VIF) [36] to verify the superiority and effectiveness of the proposed MSIM method. Table 1 shows that the proposed MSIM gives better objective assessment results than the existing methods. Although, the results of existing techniques are comparable in some cases (Flower and Boy); however, the metric values obtained using the proposed MSIM generally outperforms the existing techniques.

Comparison of Computational Efficiency
In this section, the computational efficiency of different fusion methods is compared. The execution time of these schemes for different images is shown in Table 1. The results show that the proposed MSIM, DSIFT, and GFF consume less time as compared to the other algorithms DCT, DWT, MSMFM, and CNN. The MSMFM algorithm uses a multi-scale morphological gradient based feature, therefore taking longer processing time than DSIFT. Whereas, GFF integrates the source images by using a global weight based scheme; however, it still takes less computation time and produces satisfactory results.
The proposed method utilizes the contrast enhancement, SSGSM based edge extraction, sliding window based local and global operations to create activity maps and trimap. The sliding window method, activity maps generation, their comparison, and a trimap generation are time consuming tasks. Although, the proposed algorithm consumes more processing time as compared to the existing ones; however, it produces the best unsupervised image matting and image fusion results.

Conclusions
A multiscale image fusion technique is presented for accurate construction of tri-maps, decision maps, and fused images. Firstly, the source images are pre-processed using a NMHE histogram equalization method and their gradients are computed using SSGSM. A multiscale sliding window technique calculates the focus maps from source images. Furthermore, the focus information is processed so that an accurate focused region is extracted. The proposed MSIM is robust to noise interference and is flexible to combine various fusion strategies and provides better fusion performance both visually and quantitatively when compared with other state of the art methods for multi-focus images datasets. In the future, the proposed scheme will be further considered for other application areas of image processing.