A Novel Geometric Dictionary Construction Approach for Sparse Representation Based Image Fusion

Sparse-representation based approaches have been integrated into image fusion methods in the past few years and show great performance in image fusion. Training an informative and compact dictionary is a key step for a sparsity-based image fusion method. However, it is difficult to balance “informative” and “compact”. In order to obtain sufficient information for sparse representation in dictionary construction, this paper classifies image patches from source images into different groups based on morphological similarities. Stochastic coordinate coding (SCC) is used to extract corresponding image-patch information for dictionary construction. According to the constructed dictionary, image patches of source images are converted to sparse coefficients by the simultaneous orthogonal matching pursuit (SOMP) algorithm. At last, the sparse coefficients are fused by the Max-L1 fusion rule and inverted to a fused image. The comparison experimentations are simulated to evaluate the fused image in image features, information, structure similarity, and visual perception. The results confirm the feasibility and effectiveness of the proposed image fusion solution.


Introduction
High-quality images can help increase the accuracy and efficiency of image processing and related analysis.However, a single sensor cannot capture sufficient information in one scenario.For obtaining more information, image fusion techniques are used to combine multiple images from the same scenario.Now image fusion techniques are widely used in different areas, such as computer vision, medical diagnosis and treatment, and remote sensing.Various image fusion algorithms are proposed and used in diverse applications.
According to the spatial and transformative features of fusion domain, these methods could be categorized into two main categories: spatial-domain-based methods and transform-domain-based methods [1,2].Spatial-domain-based methods directly choose clear pixels, blocks, or regions from source images to compose a fused image [3][4][5].Some simple methods, such as averaging or max pixel schemes, are performed on single pixel to generate fused image.However, these methods may reduce the contrast and edge intensity of the fused result.In order to improve the quality of fused image, some advanced algorithms, such as block-based and region-based algorithms, were developed.
Li et al. proposed a scheme by dividing images into blocks and chose the focused one by comparing spatial frequencies (SF) first; then, the fused results are produced by consistency verification [6,7].Although block-based methods improve the contrast and sharpness of integrated image, they may cause a block effect in the integrated image [8,9].
Different from spatial-domain fusion methods, transform-domain methods transform source images into a few corresponding coefficients and transform bases first [10,11].Then, the coefficients are fused and inverted to an integrated image.Multi-scale transform (MST) and wavelet based algorithms are conventional transform approaches applied to transform-domain-based image fusion [12][13][14], such as wavelet transform [15,16], shearlet [17,18], curvelet [19], dual tree complex wavelet transform [20,21], and nonsubsampled contourlet transform (NSCT) [22].MST decomposition methods have attracted great attention in the image processing field, and are widely used in image fusion fields.However, MST-based methods need priori knowledge to select an optimal transform basis [23].As each MST method has its own limitations, one MST method is difficult to fit all kinds of images [12].
Recently, sparse-representation based methods show great performance in image de-noising [24], image de-blurring [25], image target tracking [26,27], and image super-resolution [28,29].Sparse-representation based methods decompose an image patch using a few number of bases or atoms of a fixed or trained dictionary.
In the image fusion field, a sparse-representation based method was first proposed by Yang and Li [30].They applied the Discrete Cosine Transform(DCT) dictionary and orthogonal matching pursuit (OMP) method to sparse-representation based multi-focus image fusion.Liu et al. [31] presented a sparse-representation based method using an NSCT filter for image decomposition and the DCT dictionary for sparse coding of image patches.Yin et al. [32] used a dual-tree complex shearlet transform dictionary for image fusion, which enhanced the contrast of image details.
Previously mentioned methods used a fixed dictionary for spare representation.However, a fixed dictionary cannot adaptively change according to input images.As the dictionary is one of the most crucial parts in sparse representation, a trained dictionary according to input images has better performance in describing source images.Selecting a good over-complete dictionary is the main issue of a sparse representation-based image reconstruction and fusion techniques.Mostly, the DCT basis or wavelet basis are often used for an over-complete dictionary.Since such dictionaries formed with transform bases do not rely on input image data, they are fixed regardless of type of sensors, context of images, or applications.While a fixed dictionary can easily be implemented, their performance is somewhat limited depending on the type of data and application.To make a dictionary adaptive to input image data, a dictionary learning method has been developed by Aharon et al. [33].Yin et al. [34] developed an image fusion method based on K-means generalized singular value decomposition (K-SVD) [33], which also explored the sparse parameter in image fusion.Wang et al. [35] proposed an approximate K-SVD-based sparse representation method for image fusion and exposure fusion to reduce computation costs in dictionary learning.To make the trained dictionary more informative, Kim et al. [36] proposed a compact dictionary learning method called joint clustering patches dictionary learning (JCPD).JCPD used image pixel clustering and principal component analysis (PCA) bases to train sub-dictionaries in dictionary construction.Fusion results showed that the detailed information from source images was perfectly reserved.Zhu et al. [9,37,38] presented an image patch clustering method and applied it to corresponding sub-dictionary training process.Their method improved the detailed features in medical image fusion.All of these sparse-representation based image fusion methods only used one dictionary for sparse coding of all image patches, which may cause redundancy in the constructed dictionary.
Geometric information, as one type of the most important image information, including edges, contours, and textures of image, can remarkably influence the quality of image perception [39,40].This information can be used in patch classification as a supervised dictionary prior to improving the performance of the trained dictionary [41,42].In this paper, a geometric classification based dictionary learning method is proposed for sparse-representation based image fusion.Instead of grouping the pixels of images, the proposed geometric classification method groups image blocks directly by the geometric similarity of each image block.Since a sparse-representation based fusion method uses image blocks for sparse coding and coefficient fusion, extracting underlying geometric information from image-block groups is an efficient way to construct a dictionary.Moreover, the geometric classification can group image blocks based on edge and sharp line information for dictionary learning, which can improve the accuracy of sparse representation.This paper has two main contributions.
1.A geometric-information based classification method is proposed and applied to a sub-dictionary learning of image patches.The proposed classification method can accurately split source image patches into different groups for sub-dictionary learning based on the corresponding geometry features.Sub-dictionary bases extracted from each image-patch group contain the key geometry features of source images.These extracted sub-dictionary bases are trained to form informative and compact sub-dictionaries for image fusion.2. A dictionary combination method is developed to construct an informative and compact sub-dictionary.Each image patch of a fused image is composed of corresponding source image patches using a constructed-sub-dictionary (CSD).According to the classification of geometry features, each source image patch is trained and categorized into a group of sub-dictionaries.
Corresponding image patches, that appear at the same place of the two source images, at most have two groups of sub-dictionary.Redundant geometric information of source image patches is eliminated.
The remaining sections of this paper are structured as follows: Section 2 proposes the geometric sub-dictionary learning method and integrated image fusion framework; Section 3 compares and analyzes experimentation results; and Section 4 concludes this paper.

Geometry-Based Image Fusion Framework
This section presents the proposed image fusion method.The proposed method consists of geometric similarity based sub-dictionary learning and sparse representation based image fusion processes.In the sub-dictionary learning step, images are split into image patches first.The image patches are clustered into a few groups based on the geometric similarity.The K-SVD method is used in sub-dictionary training.In the image fusion step, image patches of source images are sparse coded using an assembled dictionary.The assembled dictionary consists of sub-dictionaries, which are corresponding to the groups of input image patches.When image patches are sparse coded, the coded coefficients are fused by using the Max-L1 fusion rule [30].

Dictionary Learning
The proposed dictionary learning method is shown in Figure 1, in which source images are split into 8 × 8 image patches.In the proposed method, source images are split into 8 × 8 image patches by sliding windows.These image patches are transformed to vectors of 1 × 64 in a linewise direction and normalized between 0 and 1.Then, these image patches can be clustered into a few groups for sub-dictionary learning.These sub-dictionaries can preserve key information of each image patch group.There are six specific groups of sub-dictionaries in this paper.
The geometric image patches clustering method can classify all of the image patches into three main groups, such as smooth, dominant orientation, and stochastic patch group.In the proposed method, image patches are classified into smooth and non-smooth first.A variance based method is proposed for grouping smooth and non-smooth patches.For image patches (p 1 , p 2 , ..., p n ), the corresponding variances are (v 1 , v 2 , ..., v n ).If the variance c i of ith image patch meets c i < δ, the ith image patch p i is considered as a smooth image patch.
In this way, image patches can be classified into smooth and non-smooth patches.Non-smooth patches can be further clustered into dominant orientation and stochastic patch group, by calculating the dominant orientations of patches.The dominant orientation estimation method is based on the singular value decomposition (SVD).The gradient of an image pixel g i can be calculated by Equation (1).
where g i is the gradient map.The gradient map of an image patch p i = [i 1 , i 2 , ..., i n ], which consists of n pixels, is shown in Equation (2): where g p i is the gradient map of p i .Performing an SVD on g p i can obtain g = SVD T .Extracting the first column of V can obtain the dominant orientation of the gradient field v 1 .The second column v 2 in V is the subdominant orientation of the gradient field.If the corresponding singular values of v 1 and v 2 are remarkably different, the dominant measure [43] R can be calculated using Equation (3): After SVD decomposition, S 1,1 and S 2,2 are row 1-column 1 and row 2-column 2 values of singular value matrix V, respectively.If R is smaller than a significance level threshold R * , the image patch is considered as a stochastic pattern.
In order to differentiate the geometric information of dominant orientation patches, the dominant orientation image patches can be further classified to different groups according to the directions of image patches.The direction d of dominant orientation image patches can be estimated using gradient field v 1 , which is shown in Equation ( 4): In the proposed dictionary learning framework, dominant orientation image patches are classified into four groups, such as horizontal, right-direction, vertical, and left-direction patch group that correspond to 0, 45, 90, and 135 degree group, respectively.In Equation ( 4), when d is close to the horizontal, right-direction, vertical, or left-direction patch group, d is clustered into the corresponding group.For each group, a sub-dictionary can be trained by the stochastic coordinate coding (SCC) algorithm shown in Algorithm 1, which is extremely fast.H represents a Hessian matrix of the objective function.To obtain the learning rate, SCC uses the Hessian matrix of objective function.According to the second order stochastic gradient descent, it should inverse the Hessian matrix as the learning rate [44].z is obtained by using the simultaneous orthogonal matching pursuit (SOMP) algorithm shown in Algorithm 2 to sparse code image patches based on dictionary D. The trained sub-dictionaries for different geometric groups are shown at the bottom of Figure 1.
In the SOMP algorithm, K is the sum of image patch x and sparse coefficient z, x k is k-th image patch, and z k is k-th sparse coefficient.In this paper, it assumes that the source images are all noise free.Thus, a small global error is set, i.e., ε = 0.01.Algorithm 1 SCC Algorithm.

Input:
Image patches of Wth cluster P w = (p w 1 , p w 2 , ..., p w n ) ∈ R 64×n Output: , and 1 , H = 0, and i via one or a few steps of coordinate descent: , x i ) Update the Hessian matrix and the learning rate: Select the index t l which indicates the next best coefficient atom to simultaneously provide good reconstruction for all signals by solving: . Compute new coefficients (sparse representations), approximations, and residuals as: > ε 2 , go back to step 2.

Image Sparse Coding and Fusion
When all image-patch groups are trained, source images can be fused by using the trained sub-dictionaries.The proposed image sparse-representation and fusion method are shown in Figure 2. In the proposed solution, the fused image patch can be obtained by corresponding sub-dictionaries.It is an efficient way to decrease the size of learned dictionaries.All image patches are clustered into six groups.Any two aligned image patches of source images at most belong to two groups.Even if two aligned image patches belong to the same group, two corresponding sub-dictionaries are different.Since one group only has a sub-dictionary, it only needs two sub-dictionaries to represent all information of two image blocks.The proposed fusion method uses sliding windows and sets overlap as six, so the splitting overlap of source images has six image patches.Suppose two source images for fusion have already been split into image patches.According to the classification method mentioned in the previous section, all of these image patches are clustered into a few groups.The pair of image patches from the same location of source images are sparse coded by using a CSD.In accordance with the classified image patch groups, one CSD of two corresponding source image patches at most consists of two sub-dictionaries.The CSD construction algorithm shown in Algorithm 3 combines the corresponding sub-dictionaries of two source image patches to obtain dictionary D.
When all of the CSDs are constructed, any pair of source image patches can be sparse coded by using the corresponding CSD and SOMP algorithm.Assume that there are K registered source images, I 1 , ..., I j with size of M × N. The Max-L1 fusion rule takes the following steps.
• Step 1: Use the sliding window technique to divide each source image I j , from left-top to right-bottom, into patches of size 8 × 8, i.e., the size of the atom in the dictionary.These image patches are vectorized to image pixel vectors in the linewise direction.The obtained image pixel vectors only have one dimension.• Step 2: For the ith image patch x ji of one source image I j , it can be sparse coded using the trained dictionary D. • Step 3: When all of the image patches are sparse coded, the corresponding image patches of each image use Equation ( 5) to do fusion: where z ji is a sparse coefficient corresponding to the i-th image patch in j-th image p ji .• Step 4: Fused coefficients are inversely transformed to fused image pixel vectors, using Equation ( 6), and transform these vectors back to the fused image patches.Then, it reconstructs the fused image by using fused image patches.The dictionary D in Equation ( 6) is the same as dictionary D in Algorithm 3: Algorithm 3 CSD Construction Algorithm.

Experiments and Analyses
To test the efficiency, the proposed image fusion approach is applied to multi-focus, medical, and visible-infrared images, respectively.
• Ten pairs of visible-infrared images are obtained from from www.quxiaobo.orgconsisting of four 320 × 240 and six 256 × 256 image pairs.Figure 3a-f show the selected sample pairs of multi-focus, medical, and visible-infrared images, respectively.This paper assumes that the input image pairs are precisely co-aligned.All image pairs are from the standard library.They have the same size.The proposed solution can also be applied to multiple images.In this section, one experiment of each image type is chosen and presented respectively in the following sections.To show the efficiency of the proposed method, the state-of-the-art dictionary learning based sparse-representation fusion approaches K-SVD and JCPD , which were proposed by Li et al. in 2012 [45] and Kim et al. in 2016 [36], respectively, are used for comparison.The experiments are evaluated by both subjective and objective assessments.Five popular image fusion quality metrics are used in this paper for the quantitative evaluation.The larger the metric value is, the better the performance is.The patch size of all sparse-representation-based methods including the proposed method are set to 8 × 8. To avoid blocking artifacts, all experiments use the sliding window scheme [36,45,46].The overlapping region of the sliding window is set to six pixels in each vertical and horizontal direction of all experiments.All experiments are performed using a 2.60 GHz single processor of an Intel(R) Core(TM) i7-4720HQ CPU Laptop with 12.00 GB RAM.To compare fusion results fairly, all experiments in this paper are programmed by Matlab code in a Matlab 2014a environment.

Objective Evaluation Methods
Five mainstream objective evaluation metrics are implemented for the quantitative evaluation.These metrics include edge retention (Q AB/F ) [47], mutual information (MI) [48], visual information fidelity (VIF) [49],the Yang proposed fusion metric (Q Y ) [50,51], and the Chen-Blum metric (Q CB ) [51,52].The above five solutions are classical approaches used in multi-focus, multi-modality medical, and infrared-visible image fusion.Q AB/F is the image feature-based metric.MI is the information theory-based metric.Q Y is the image structural similarity-based metric.Q CB and VIF are human perception inspired fusion metrics.According to objective assessment [51,53,54], these metrics can objectively evaluate the fused image in image feature, information, structure similarity, and visual perception.Thus, our paper chooses these metrics.For the fused image, the sizes of Q AB/F , MI, VIF, Q Y , and Q CB become bigger, and the corresponding fusion results are better.

Mutual Information
MI for images can be formalized as Equation (7): where L is the number of gray-level, h A,F (i, j) is the gray histogram of image A and F. h A (i) and h F (j) are edge histogram of image A and F. Edge histogram is used to present the edge information of image [48].MI of the fused image can be calculated by Equation ( 8): where MI(A, F) represents the MI value of input image A and fused image F; MI(B, F) represents the MI value of input image B and fused image F.
Q AB/F metric is a gradient-based quality index to measure how well the edge information of source images is conducted to the fused image.It is calculated by: where Q AF = Q AF g Q AF 0 , Q AF g and Q AF 0 are the edge strength and orientation preservation values at location (i,j).Q BF can be computed similarly to Q AF .w A (i, j) and w B (i, j) are the weights of Q AF and Q BF , respectively.

Visual Information Fidelity
V IF is the novel full reference image quality metric.V IF quantifies the mutual information between the reference and test images based on the Natural Scene Statistics (NSS) theory and the Human Visual System (HVS) model.It can be expressed as the ratio between the distorted test image information and the reference image information, and the calculation equation of V IF is shown in Equation (10): where ) represent the mutual information, which are extracted from a particular subband in the reference and the test images, respectively.Here, subband means the frequency of human eye sensory.Thus, this subband is used to evaluate the visual performance objectively [49,55].

− →
C N denotes N elements from a random field, and − → E N and − → F N are visual signals at the output of HVS model from the reference and the test images, respectively.
To evaluate the VIF of fused image, an average of VIF values of each input image and the integrated image is proposed [49].The evaluation function of VIF for image fusion is shown in Equation ( 11): where V IF(A, F) is the V IF value between input image A and fused image F; V IF(B, F) is the V IF value between input image B and fused image F.

Q Y
Yang et al. proposed a structural similarity-based way for fusion assessment [50].Yang's method is shown in Equation ( 12): where λ(ω) is the local weight, and SSIM(A, B) is a structural similarity index measure for images A and B. The detail of λ(ω) and SSIM(A, B) can be found in [50,51].

Q CB
The Chen-Blum metric is a human perception inspired fusion metric.The Chen-Blum metric consists of five steps: The first step is filtering image I(i, j) in the frequency domain.I(i, j) is transformed to the frequency domain and gets I(m, n).Filtering I(m, n) by the contrast sensitive function (CSV) [52,56] filter S(r), where r = √ m 2 + n 2 .In this image fusion metric, S(r) is in polar form.Ĩ(m, n) can be obtained by Ĩ(m, n) = I(m, n) × S(r).
In the second step, local contrast is computed.Considering the band-pass filters of a pyramid transform, which can be obtained as the difference of two neighboring low-pass filters.For the Q CB metric, Peli's contrast C is used in this paper, and it can be defined as: A common choice for φ k (i, j) would be a Gaussian kernel that is shown as follows: where k and k + 1 stand for two neighboring low-pass filters σ k = 2 k .In the third step, the masked contrast map for input image I A (i, j) is calculated as: Here, t, h, p, q and Z are real scalar parameters that determine the shape of the nonlinearity of the masking function [52].
In the fourth step, the saliency map of I A (i, j) can be calculated by Equation ( 16): The information preservation value is computed as Equation (17): In the fifth step, the Global quality map can be calculated: Then, the value of Q CB can be obtained by averaging the global quality map:

Image Quality
To show the efficiency of proposed method, the comparison of fused images is provided.It compares the quality of the fused image based on visual effects, the accuracy of focused region detection, and the objective evaluations.

Multi-Focus Comparison
Figure 4a,b are the source multi-focus images.To show the details of the fused image, two image blocks are highlighted and magnified, which are marked by red and blue frames, respectively.The image block in the red frame is out of focus in Figure 4a, and the image block in the blue frame is out of focus in Figure 4b.The corresponding image blocks in blue and red frames are totally focused in Figure 4a,b, respectively.Figure 4c-e show the fused images of K-SVD, JCPD, and the proposed method, respectively.The difference and performance of the algorithms to the fused images by three different methods are difficult to figure out visually.In order to evaluate of fusion performances objectively, Q AB/F , MI, VIF, Q Y , and Q CB are also used as image fusion quality measures.The fusion results of multi-focus images using three different methods are shown in Table 1.The best results of each evaluation metric are highlighted by bold-face in Table 1.According to Table 1, the proposed method has the best performance in all five types of evaluation metrics.Particularly, for the objective evaluation metric Q AB/F , the proposed method obtains higher results than other two comparison image fusion methods.Since Q AB/F is a gradient-based quality metric to measure how well the edge information of source images is conducted to the fused image, it means that the proposed method can get a better fused image with edge information.

Medical Comparison
The "brain" images are a pair of PET (Positron Emission Tomography) and MRI (Magnetic Resonance Image) images shown in Figure 5a,b, respectively.PET images show the image of brain slices that produces a 3D image of functional processes in the human body.MRI images also show the image of brain slices that contain clear information of soft tissues.K-SVD, JCPD and the proposed method are employed to merge PET and MRI images into a clear image with soft tissues and functional processes information.The corresponding fusion results are shown in Figure 5c-e, respectively.Figure 5f-k show the enlarged details in red and green frames of the fused images in Figure 5c-e, respectively.Three fused images of different approaches have high quality in details, contrast, sharpness, and brightness.Table 2 shows the objective evaluations of fusion results.Compared with K-SVD and JCPD, the proposed method gets the largest values in all five objective evaluations.

Visible-Infrared Comparison
The proposed solution is used to fuse two sample images from the same scenario of the downtown street scene.One is a visible image and the other one is an infrared image shown in Figure 6a,b, respectively.In Figure 6a,b, the walking person is marked in the red frame and the letters in the shade marked in the blue frame are dark, respectively.The fused images of K-SVD, JCPD and proposed method shown in Figure 6c-e are compared.The enlarged details in the red and blue frames of the fused images in Figure 6c-e are shown in Figure 6f-k, respectively.The walking person and the letters in the shade are clear in all three fused images.The objective evaluations of each visible-infrared image fusion solution are demonstrated in Table 3.Similarly, the proposed solution has the best performances in all five objective evaluations.

Conclusions
This paper proposes a novel sparse-representation based image fusion framework, which integrates geometric dictionary construction.A geometric image patch classification approach is presented to cluster image patches from different source images based on the similarity of image geometric structure.A few compact and informative sub-dictionaries are extracted from each image patch cluster by SCC.The extracted sub-dictionaries are combined into a dictionary for sparse representation.Then, image patches are sparsely coded into coefficients by the trained dictionary.To obtain better edge and corner details of fusion results, the proposed solution also chooses image block size adaptively and selects optimal coefficients during the image fusion process.The sparsely coded coefficients are fused by the Max-L1 rule and inverted to the fused image.The proposed method is compared with existing mainstream sparse-representation based methods in three aspects, including multi-focus, medical, and visible-infrared comparison.The experimentation results prove that the proposed method has the best performance in all three image scenarios.It means that geometric information of the source image can not only reduce the size of the learned dictionary efficiently and effectively, but also obtain a high-quality fused image.In the future, it will explore more details in geometric information to enhance fusion performance.Denoising, inpainting, and other image processing techniques will be integrated into the current solution.

Figure 1 .
Figure 1.Sub-dictionaries training for different groups of image patches.

Figure 3 .
Figure 3. Selected sample pairs of multi-focus, medical, and visible-infrared images; (a,b) are sample multi-focus image pairs; (c,d) are sample medical image pairs; (e,f) are sample visible-infrared images.

Figure 4 .
Figure 4. Fusion results of multi-focus image of 'Love Card and Hong-Kong'; (a,b) are source images, (c-e) are fused image of K-means generalized singular value decomposition, joint clustering patches dictionary and the proposed method, (f-h) are difference images between (a) and fused image (c-e), (i-k) are difference images between (b) and fused image (c-e).

Figure 5 .
Figure 5. Fusion results of the medical image of the "Brain"; (a,b) are source images, (c-e) are fused image of K-SVD, JCPD and proposed method, (f-k) are enlarged details in red and green frame of fused image (c-e).

Figure 6 .
Figure 6.Fusion results of visible-infrared images of "Downtown Street Scenes"; (a,b) are source images, (c-e) are fused images of K-SVD, JCPD and the proposed method, (f-k) are enlarged details in red and blue frames of fused images (c-e).

Table 1 .
Fusion performance comparison of multi-focus image pairs.

Table 2 .
Fusion performance comparison of medical image pairs.

Table 3 .
Fusion performance comparison of visible-infrared image pairs.