A New Dictionary Construction Based Multimodal Medical Image Fusion Framework

Training a good dictionary is the key to a successful image fusion method of sparse representation based models. In this paper, we propose a novel dictionary learning scheme for medical image fusion. First, we reinforce the weak information of images by extracting and adding their multi-layer details to generate the informative patches. Meanwhile, we introduce a simple and effective multi-scale sampling to implement a multi-scale representation of patches while reducing the computational cost. Second, we design a neighborhood energy metric and a multi-scale spatial frequency metric for clustering the image patches with a similar brightness and detail information into each respective patch group. Then, we train the energy sub-dictionary and detail sub-dictionary, respectively by K-SVD. Finally, we combine the sub-dictionaries to construct a final, complete, compact and informative dictionary. As a main contribution, the proposed online dictionary learning can not only obtain an informative as well as compact dictionary, but can also address the defects, such as superfluous patch issues and low computation efficiency, in traditional dictionary learning algorithms. The experimental results show that our algorithm is superior to some state-of-the-art dictionary learning based techniques in both subjective visual effects and objective evaluation criteria.


Introduction
In recent decades, various medical imaging techniques have emerged and numerous medical images have been provided. Different modal medical images can only offer a limited characteristic description about human organs due to the diverse imaging mechanism [1]. Medical image fusion technology can integrate the complementary information of multiple single-modal medical images acquired by different imaging sensor modalities and provide a more precise, reliable, and better description of lesions [2]. This technology has witnessed various applications in real products such as medical assistance, as it can provide doctors a comprehensive description on disease tissues, so as to develop appropriate treatment plans and improve diagnostic accuracy.
In past decades, various research efforts have been made to develop an effective medical image fusion technique [3]. In general, according to the characteristics of the medical images, multi-scale transform (MST) based image fusion methods have been the most popular trend due to their simplicity, feasibility, and effectiveness in implementation [4][5][6]. A number of typical MSTs were developed, for example: Stationary Wavelet Transform (SWT) [7], Dual-Tree Complex Wavelet Transform (DTCWT) [8], Curvelet Transform (CVT) [9], Contourlet Transform (CT) [10], Nonsubsampled Contourlet Transform (NSCT) [11,12], etc. The MST-based methods first decompose the source image into a series of sub-images to express the base and detail information with different levels and directions and then fuse the lowpass and high-pass coefficients according to the designed fusion rules respectively. Finally, the fusion result can be obtained by performing the inverse MST, although the MST-based methods can effectively extract some important features from the source image into the fused image. However, these methods cannot adaptively express the content of the image, which heavily limits their fusion performance, meanwhile, each MST tool has its own merits and limitations depending on the context of input images [3]. Furthermore, some useful information may be inevitably lost in the process of decomposition and reconstruction, and the suboptimal results would be generated. Thus choosing an optimal MST is not an obvious and trivial task as it relies on scene contexts and applications.
In order to address the deficiencies of MSTs based methods and obtain an encouraging fusion performance, a plethora of carefully designed medical image fusion algorithms have been proposed, such as neural network (NN) based methods [13][14][15], principal component analysis (PCA) based techniques [16], and mathematical morphology (MM) based methods [17] and so on. However, the NN based method relies too much on a large number of parameters manually set, which is not conducive to the adaptive implementation of the fusion process. The PCA based technique is prone to produce the drawback of spectral distortion. Although the MM based algorithm plays an important role in image fusion, some of the details in source image may be smoothed in the final fusion result, which would affect the fusion performance and hinder the applications of them.
Sparse representation (SR) theory can describe and reconstruct images in a sparse and efficient way by linear combination of sparse coefficients and overcomplete dictionary [18]. In recent years, SR theory has been rapidly developed and successfully applied in many image processing applications, such as image super-resolution reconstruction [19,20], image feature extraction [21], image denoising [22,23], pedestrian re-identification [24,25], image classification [26], and many other fields. At the same time, it has been successfully applied to many image fusion fields and achieved some satisfactory results [27][28][29]. In SR, the overcomplete dictionary is of important significance in medical image representation. It also plays a key role in affecting the quality of fused image. Yang and Li [30] took the first step for utilizing SR theory into the field of multi-focus image fusion, in which DCT was used as a fixed dictionary. Subsequently, they proposed a SR based on Simultaneous Orthogonal Matching Pursuit (SOMP) and successfully applied for multimodal image fusion [31]. Jiang et al. [17] regarded the image as a superposition of two different components of cartoon and texture, respectively, they employed Curvelet and DCT basis as two different dictionaries to express their respective information, and proposed an image fusion method based on SR and morphological component analysis. However, owing to the fact that dictionaries produced by these methods are constructed by the fixed base functions with poor adaptability, this is not an effective way to describe the complex structure information of medical images [18].
Relative to analytical fixed dictionaries, dictionary learning based methods use a small amount of atoms from a trained dictionary instead of a predefined one [27], which can produce state-of-the-art results in many image processing and recognition tasks [2]. Learning from the training sample images to obtain an informative and compact overcomplete dictionary can enhance the adaptability of the dictionary, as well as accurately express the medical image information. In order to produce better fusion quality, many recent dictionary learning based fusion algorithms have been proposed [15]. Meanwhile, some SR based methods that directly train the source images to train the overcomplete dictionary can generate promising fusion results [31]. However, many methods suffered from a general problem that taking all of the patches for dictionary learning, which will unavoidable, would introduce lots of unvalued and redundant information during dictionary training and decrease the fusion performance in medical image fusion. To address this shortcoming in the literature [32], Zhu et al. developed a local density peak-clustering algorithm to refine the process of patch selection, and then established a compact dictionary by K-SVD algorithm. Kim et al. [33] employed a joint clustering technique to classify image patches according to their similar structures, the overcomplete dictionary can be obtained by combing principal components of them. In order to capture the intrinsic features of the image and preserve the hierarchical structure of stationary wavelets, Yin [34] designed a joint dictionary learning method based on all base subbands of stationary wavelets. Liu et al. [35] proposed an adaptive SR method for simultaneous fusion and denoising the images. In this method, they classified a lot of high-quality image patches with gradient information into several categories, and then trained each sub-dictionary of them. Qi et al. [36] proposed an entropy based image fusion method. In this method, the source images were decomposed into low frequency and high frequency images respectively, then the weighted average scheme was utilized to fuse the low-frequency part and the entropy based dictionary learning technique was introduced to fuse the high-frequency part. However, the average scheme for the low-frequency component would inevitably lose some energy of the input image, which would decrease some useful brightness information in fused results. Although comparing with some traditional medical image fusion techniques, the medical image feature can be represented effectively and completely in these methods [32,33], the fusion performance remaining has much room for improvement.
The purpose of the medical image fusion technique is to retain useful information from source images into a fused result as much as possible [37]. To address the shortcomings of low-learning efficiency and weak dictionary expression ability in traditional dictionary learning based algorithms, this paper proposes a novel dictionary learning method based on the brightness and detail clustering for medical image fusion. Our method consists of three steps, the enhancement and multi-scale sampling on images, patch feature classification and dictionary construction, and image fusion. Firstly, we enhance the details of the source images by multi-layer filtering technology, which can significantly improve the information expression ability of dictionary. In the meantime, we conduct a multi-sampling scheme on enhanced images. This operation can significantly increase the richness of patches while decreasing computation load in the dictionary learning stage. Secondly, in order to obtain a fused result with abundant brightness and detail features, i.e., the successful expression of the underlying visual features of the medical image, we developed two feature clustering criteria to classify all patches into two categories including a brightness group and a details group, and construct the energy sub-dictionary and the detail sub-dictionary by K-SVD algorithm. These sub-dictionaries can be directly built by the final compact and informative dictionary. Finally, the sparse coefficient can be calculated by the orthogonal matching pursuit (OMP) algorithm [38], and then reconstructed to the fused images. The main contributions of this paper can be elaborated as follows:

1.
We conduct multi-level neighbor distance filtering to enhance the information and take multi-scale sampling to realize the multi-scale expression of images, which can make image patches more informative and flexible, while not increasing the computational complexity in the training stage.

2.
Based on the characteristics of the human visual system processing medical images, we develop novel neighborhood energy and a multi-scale spatial frequency to cluster brightness and detail patches, and then train the brightness sub-dictionary and detail sub-dictionary, respectively.

3.
A feature discriminative dictionary is constructed by combing the two sub-dictionaries. The final dictionary contains important brightness and detail information, which can effectively describe the useful feature of medical images.
The rest of the paper is organized as follows: The proposed dictionary learning method for medical image fusion is described in detail in Section 2, including reviews the basic theory of SR, dictionary learning and SR subsections. The experiments and results analysis are presented in Section 3. Conclusions and discussion are summarized in Section 4.

Proposed Framework
In this section, we present the proposed technique in detail. There are three sub-sections in this part, which include the theory of SR, dictionary learning and image fusion. In the dictionary step, we propose two techniques for images to generate informative training images. Then, the image patches are clustered into brightness and detail groups based on the neighborhood energy and the multi-scale spatial frequency, respectively. The K-SVD algorithm is employed to construct a brightness-based sub-dictionary and a detail-based sub-dictionary, and the final dictionary comsists of the sub-dictionaries. In the image fusion step, a SR model is well established and the sparse coefficients can be calculated by OMP, then generate the final fused image.

Sparse Representation
For a signal y = (y 1 , y 2 , . . . y n ), y ∈ R n , the basic assumption in SR theory is that y can be approximately represented as a linear combination of a set of base signals {d i } m i=1 from a redundant dictionary D ∈ R n×m (n < m), the signal y can be expressed as where α = (α 1 , α 2 , . . . α n ) is the unknown sparse coefficient vector, d i is an atom of D. Given a signal y and an overcomplete dictionary D, the process of finding the sparse coefficient α is called sparse representation. The goal of SR is to calculate the sparsest α, which contains the fewest non-zero entries. However, this is an underdetermined problem due to the existence of overcomplete D. Generally, the sparsest α can be obtained by solving the following sparse model: where • 0 denotes the l 0 norm that counts the number of non-zero entries, ε > 0 is an error tolerance. Equation (2) is a non-deterministic polynomial-hard (NP-hard) problem which can be solved by the greedy approximation approach OMP [38].

Proposed Dictionary Learning Approach
For SR based medical image fusion, the ability of information expression of dictionary has a direct, great impact on the fusion results. Numerous studies [32,33,35] indicate that the dictionary obtained by the traditional training methods cannot generate impressive fusion results because of the weak expression ability. In order to construct an informative as well as compact dictionary, we propose a new dictionary learning method illustrated in Figure 1. In the first stage, two effective techniques including detail enhancement and multi-scale sampling are developed for improving the qualities of image patches. In the second stage, a neighborhood energy criterion is introduced to cluster the brightness patches and a multi-scale spatial frequency criterion is proposed to cluster the edge detail patches, and then generate two high-quality training sets. In the last stage, the brightness sub-dictionary and detail sub-dictionary can be constructed by training the two categories of patches using K-SDV, then the overcomplete dictionary is obtained by the combination of the sub-dictionaries. The detailed analyses of these three sub-sections are shown in the rest of the sub-sections.

Detail Enhancement
We employ the way of online learning to obtain the overcomplete dictionary. The advantage is that it can directly utilize useful information from source images and can be beneficial to enhance the expressive ability of the dictionary. However, the general way of dividing the image into a series of overlapping patches is directly performed on the pre-training images, which would lead to some un-conspicuous problems. When the details of certain regions in the source image are relatively weak, some important information in this region could not be well expressed in dictionary training. To address this problem, we develop a detail enhancement technique shown in Figure 2. The high-pass detail information with different levels are first extracted by simple multiple filtering, and then superimposes into the source images, which can be mathematically expressed as: where X l = X 1 , X 2 , . . . denote the l-th source image, if only two source images in medical image fusion procedure, l = 1, 2, X l is the detail enhanced version of X l , H X l,p is the p-th level highpass detail information image of X l . sparse representation. The goal of SR is to calculate the sparsest α , which contains the fewest nonzero entries. However, this is an underdetermined problem due to the existence of overcomplete D. Generally, the sparsest α can be obtained by solving the following sparse model： where 0  denotes the 0 l norm that counts the number of non-zero entries, is an error tolerance. Equation (2) is a non-deterministic polynomial-hard (NP-hard) problem which can be solved by the greedy approximation approach OMP [38]. ND denotes the Neighbor Distance filter, some studies [39,40] demonstrated that ND can effectively extract and express the high frequency information of image, so we use ND to extract details in this step, more detailed introduction about ND can be found in [40]. In addition, it should be noted that this paper enhances the source image when training them to generate a strong overcomplete dictionary, rather than enhanced the source images during the fusion process. For SR based medical image fusion, the ability of information expression of dictionary has a direct, great impact on the fusion results. Numerous studies [32,33,35] indicate that the dictionary obtained by the traditional training methods cannot generate impressive fusion results because of the weak expression ability. In order to construct an informative as well as compact dictionary, we propose a new dictionary learning method illustrated in Figure 1. In the first stage, two effective techniques including detail enhancement and multi-scale sampling are developed for improving the qualities of image patches. In the second stage, a neighborhood energy criterion is introduced to cluster the brightness patches and a multi-scale spatial frequency criterion is proposed to cluster the edge detail patches, and then generate two high-quality training sets. In the last stage, the brightness sub-dictionary and detail sub-dictionary can be constructed by training the two categories of patches using K-SDV, then the overcomplete dictionary is obtained by the combination of the subdictionaries. The detailed analyses of these three sub-sections are shown in the rest of the sub-sections.

Detail Enhancement
We employ the way of online learning to obtain the overcomplete dictionary. The advantage is that it can directly utilize useful information from source images and can be beneficial to enhance the expressive ability of the dictionary. However, the general way of dividing the image into a series of overlapping patches is directly performed on the pre-training images, which would lead to some unconspicuous problems. When the details of certain regions in the source image are relatively weak, some important information in this region could not be well expressed in dictionary training. To address this problem, we develop a detail enhancement technique shown in Figure 2. The high-pass detail information with different levels are first extracted by simple multiple filtering, and then superimposes into the source images, which can be mathematically expressed as： X  is the detail enhanced version of l X , ND denotes the Neighbor Distance filter, some studies [39,40] demonstrated that ND can effectively extract and express the high frequency information of image, so we use ND to extract details in this step, more detailed introduction about ND can be found in [40]. In addition, it should be noted that this paper enhances the source image when training them to generate a strong overcomplete dictionary, rather than enhanced the source images during the fusion process.  can highlight the unobvious but significant information, and express them into patches, which could be beneficial to constructing a strong training set.

Multi-Scale Sampling (MSS)
For dictionary learning, the traditional way of constructing a training set is first dividing the images into a series of overlap patches by a fixed size, and then pull them into a vector respectively. However, the fixed size of image patch usually cannot comprehensive describe image information, especially small target information in medical images. Generally, a large size of image patch is more likely to improve robustness and it can illustrate the shape and location of object, but the description about the details are insufficient, and vice versa. Therefore, choosing a suitable patch size is significant for improving the informative ability of the training set. To integrate the advantages of each patch with different sizes, we present a Multi-scale Sampling (MSS) scheme to obtain a stronger training set. MSS can not only improve multi-scale properties of the training set but also reduce the dictionary training time. The MSS can be obtained by the follows where X d,l is the sampling version of X l by downsampling rate of d (d = 1, 2, . . .), obviously, X l = X 1,l . As can be seen from Figure 3. A series of different images with various smaller size of source can be generated.
To facilitate the explanation, the white circular areas in each of X d,l images in Figure 3 are marked by same-sized red rectangles. We can notice that the contour and details for the same region in diverse images with different size are significantly different. By performing MSS, the useful information can be comprehensively utilized from coarse to fine with a different scale, then the informative image patches can be obtained by dividing. As can be seen from Figure 2, by performing the process of detail enhancement, the weak information such as boundary, brightness and texture et al. of the source images are significantly enhanced. For the convenience of comparison, a same region from source image and enhanced image is selected by the red rectangle and enlarged and placed in the lower right of them. The enhanced image can highlight the unobvious but significant information, and express them into patches, which could be beneficial to constructing a strong training set.

Multi-Scale Sampling (MSS)
For dictionary learning, the traditional way of constructing a training set is first dividing the images into a series of overlap patches by a fixed size, and then pull them into a vector respectively. However, the fixed size of image patch usually cannot comprehensive describe image information, especially small target information in medical images. Generally, a large size of image patch is more likely to improve robustness and it can illustrate the shape and location of object, but the description about the details are insufficient, and vice versa. Therefore, choosing a suitable patch size is significant for improving the informative ability of the training set. To integrate the advantages of each patch with different sizes, we present a Multi-scale Sampling (MSS) scheme to obtain a stronger training set. MSS can not only improve multi-scale properties of the training set but also reduce the dictionary training time. The MSS can be obtained by the follows As can be seen from Figure 3. A series of different images with various smaller size of source can be generated.
To facilitate the explanation, the white circular areas in each of Figure 3 are marked by same-sized red rectangles. We can notice that the contour and details for the same region in diverse images with different size are significantly different. By performing MSS, the useful information can be comprehensively utilized from coarse to fine with a different scale, then the informative image patches can be obtained by dividing. More importantly, MSS can increase the computation efficiency due to the downsampling operation. For example, as illustrated in Figure 3, the size of the input image l X  is 256 × 256, when the downsampling rate is set to 2, 3, 4 and 5, respectively, we can generate 2,l X  , 3,l X  , 4,l X  and 5,l X  , the size of each of them is smaller than . Using several images processed by MSS with different sampling rates instead of source images to construct a training set would not decrease dictionary quality, while reducing dictionary learning time. In order to analyze quantitatively, as shown in Table 1, we calculate the number of patches from the upper left to the lower right, which is d l X  More importantly, MSS can increase the computation efficiency due to the downsampling operation. For example, as illustrated in Figure 3, the size of the input image X l is 256 × 256, when the downsampling rate d is set to 2, 3, 4 and 5, respectively, we can generate X 2,l , X 3,l , X 4,l and X 5,l , the size of each of them is smaller than X l . Using several images processed by MSS with different sampling rates instead of source images to construct a training set would not decrease dictionary quality, while reducing dictionary learning time. In order to analyze quantitatively, as shown in Table 1, we calculate the number of patches from the upper left to the lower right, which is produced by the patch size 8 × 8 and a step length of two (pixels overlapping is six). As can be seen from Table 1, the summation of patches obtained by the four images { X d,l } (d = 2, 3, 4, 5) is much smaller than X l , which means the process of MSS can significantly reduce the training data, and then decrease the computation complexity in dictionary learning. Brightness information and detailed information are two important manifestations for medical image features. Thus, detecting the brightness and detail feature from the source images into the fused image is necessary to produce encouraging fusion results. Considering the energy correlation between pixels in medical images, the sum of neighborhood energy (SNE) is developed to evaluate the brightness of the central pixel, as follows: produced by the patch size 8×8 and a step length of two (pixels overlapping is six). As can be seen from Table 1, the summation of patches obtained by the four images smaller than , which means the process of MSS can significantly reduce the training data, and then decrease the computation complexity in dictionary learning. Brightness information and detailed information are two important manifestations for medical image features. Thus, detecting the brightness and detail feature from the source images into the fused image is necessary to produce encouraging fusion results. Considering the energy correlation between pixels in medical images, the sum of neighborhood energy (SNE) is developed to evaluate the brightness of the central pixel, as follows:

Multi-Scale Spatial Frequency (MSF)
In medical images, the spatial frequency (SF) of the centered pixel can reflect the difference between the pixel and the surrounding pixel. Thus, SF of an image can be used to express the richness degree of the image detail information. Considering the correlation between pixels, a neighborhood based SF is usually employed, which is defined as: where ,

Multi-Scale Spatial Frequency (MSF)
In medical images, the spatial frequency (SF) of the centered pixel can reflect the difference between the pixel and the surrounding pixel. Thus, SF of an image can be used to express the richness degree of the image detail information. Considering the correlation between pixels, a neighborhood based SF is usually employed, which is defined as: where SP X d,l (i, j) is the value of SF of X d,l centered at (i, j), M × N is the size of neighborhood. In practical applications, to reduce the computational complexity, the size of the neighborhood is usually set to M = N. However, this technique in Equation (7) is sensitive to the neighborhood size, especially for the dark and low-contrast regions. To address this problem, we propose a novel detail detection technique called Multi-scale spatial frequency (MSF), The MSF of an image is mathematically defined as where, r (r = r 1 , r 2 , r 3 ) is the scale factor, r × r denotes the size of neighborhood,ε 1 and ε 2 are weighting parameter given by the user. SP r

Clustering and Dictionary Learning
We classify all patches into two groups by taking SNE and MSF as the feature metrics for clustering. The patch clustering and dictionary learning consist of the following five steps.

• Step 1: Patches Collection
Adopt the slide window technique to divide each of X d,l into a suite of image patches with size ρ × ρ from upper left to lower right, and then overlapping each of them, the step length between two adjacent patches is δ pixels. The patches extraction results are as follows: where ∪ denotes merging each of them, X k d,l is the k-th patch of X d,l , K is the total number of image patches.
where e can be generated by Equation (8): where c where T e is the brightness dominant training set, which composes the image patches with a better brightness feature, and T k denotes that the k-th patch from {E , . . .} has the largest SNE value among them. Similarly, construct a detail training set by choosing the corresponding image patches containing the richest detail activity level by the following rules: where W c is the detail dominant training set which composes of image patches with better detail feature, and W k means that the k-th patch from {C , . . .} has the largest MSF value is selected.

• Step 4: Sub-dictionaries learning
The brightness sub-dictionary D e can be generated by calculating the objective function in Equation (15) by using K-SVD [41], where D e and α e are the brightness sub-dictionary and its sparse coefficient, respectively. To avoid this, a same patch is used in the brightness set as well as the detail set. At the same time, we add the following constraint E The detail sub-dictionary D c is obtained by performing K-SVD on detail training set W c as follows where D c and α c are the detail sub-dictionary and its sparse coefficient, respectively.

• Step 5: Construct the final dictionary
The final informative, compact and discriminative overcomplete dictionary is obtained by the combination of the sub-dictionaries D e and D c in Equation (20) The schematic diagram of the proposed dictionary learning is illustrated in Figure 1. Moreover, the formalized mechanism of the dictionary learning algorithm is described in Algorithms 1. For simplicity, we take two source images as an example to discuss, and it can easily be extended to the case of more than two source images.

Algorithm 1 The proposed dictionary learning algorithm
Inputs: Two group of images { X d,l } (l = 1, 2) (1) Extract patches of { X d,1 } and { X d,2 } from upper left to lower right.
(3) Construct two training sets T e and W c by Equations (13) and (15), respectively. (4) Obtain the two sub-dictionaries D e and D c by solve Equations (17) and (19), respectively. (5) Generate the final dictionary D by Equation (20) Output: The overcomplete dictionary D

Image Fusion
The whole image fusion schematic diagram is shown in Figure 5. In the sparse coding process, each source image X l is first divided into a series of image patches of size ρ × ρ from upper left to lower right with an overlapping step length of δ pixels, and then rearranging them into column vectors X k l (k = 1, 2, . . . , K, l = 1, 2, . . . , n), where K and n are the number of patches and source images, respectively. X k l denotes the k-th patch' column vector of image X l . Calculate the sparse coefficient vectors {α k 1 , α k 2 , . . . , α k n } of source images by using OMP algorithm and the following underdetermined equation can be solvedα = argmin . . , X k n }. By employing the "max-absolute choosing" rule to obtain the fused sparse vector α k The k-th patch' column vector of fused image can be generated by combing D and α k Reshape all of the column vectors F k into image patches F k and then plug each of F k into its corresponding original position, the final fused image F can be obtained:

Experiments and Analysis
In this section, we present four subsections to analyze and verify the superiority of the proposed medical image fusion scheme. We first illustrate the detail experimental setups including test methods and parameters setting in Section 3.1. Then, the test images in experiments are displayed and briefly introduced in Section 3.2. We evaluate the fusion results of different methods based on

Experiments and Analysis
In this section, we present four subsections to analyze and verify the superiority of the proposed medical image fusion scheme. We first illustrate the detail experimental setups including test methods and parameters setting in Section 3.1. Then, the test images in experiments are displayed and briefly introduced in Section 3.2. We evaluate the fusion results of different methods based on their visual performance and quantitative assessment in Sections 3.3 and 3.4. The computation efficiency analysis is discussed in Section 3.5. Finally, we extend our method into other type of image fusion applications in Section 3.6.

Test methods and Parameters Setting
In order to verify the effectiveness of the proposed method, six representative medical image fusion algorithms are selected to compare with the proposed algorithm. The first three compared methods are MST based: DTCWT, CVT and NSCT. The others are three state-of-the-art medical image fusion methods including the adaptive SR (Liu-ASR) proposed by Liu et al. [35], the Kim [33] method, which is based on clustering and principal component analysis methods, and the K-SVD based fusion technique developed by Zhu et al. [32]. The latter three carefully design algorithms and mainly paid their attention to the novel ways for training an informative overcomplete dictionary, they can produce the representative fusion performance for the field of medical image fusion in a few years.
For the parameters setting, the first layer of DTCWT uses 'LeGall 5-3' filter, and the other layer employs 'Qshift-06' filter. The NSCT method employs the 'pyrexc' filter as a pyramid filter, and the 'vk' filter as the directional filter, which is decomposed into four layers, and the number of directions from coarse to fine is 4, 8, 8, and 16, respectively. These setting can generate the best fusion performance [3]. For all the MST based methods, they fuse high-frequency coefficients and low-frequency coefficients by "max-absolute" and "weighted averaging" scheme. Furthermore, a 3×3 consistency verification check [42] is adopted to fuse the high-frequency coefficients. The default parameters given by the respective authors are used for the Liu-ASR, Zhu and Kim methods.
For the proposed method, during the training dictionary stage, the size of patch is conventionally set to 8×8, and the overlapping pixels is 6. For the detail enhancement, the number of level P is set to 4 in Equation (3). For the MSS, the downsampling rate is set 3, 4, 5 and 6, respectively in Equation (5). As a consequence, the proposed dictionary learning based method can generate high fusion performance while keep lower computation efficiency. Meanwhile, due to the strong flexibility of multiscale property in MSS, the MSS scheme can also well address the case of small size source images. The size of the neighborhood of the SNE in Equation (6) is set to 11 × 11. For MSF, the three scale factors r 1 , r 2 and r 3 are set to 7, 11 and 15, and the weight coefficients ε 1 and ε 2 are set to 0.33 and 0.67 in Equation (8), respectively. In the sparse representation stage, the reconstruction error is 0.1, and the number of iterations is 50. The setting of the parameters of the proposed algorithm can generate the best fusion results for the experimental images in this paper.

Test Images
For pixel level multi-modality medical image fusion, most of the test images generally can be obtained at http://www.imagefusion.org and http://www.med.harvard.edu/aanlib/home.html. To demonstrate the effectiveness of the proposed method, we utilize four pairs of real multimodal medical images shown in Figure 6 to evaluate the algorithm in our experiments, the size of all of them is 256 × 256, and all of them are perfectly registered, which means that the objects in each set of images are geometrically aligned. These medical images include MRI, CT, MR-T1, MR-T2, SPECT and PET, the characteristics of each of them are summarized as following.

1.
CT has a shorter imaging time and a higher spatial resolution, whereas it provides soft tissue information with low contrast.

2.
MRI can clearly display the soft tissue information of the human body, but it is hard to reflect the dynamic information of the metabolic activity in human body.

3.
MR-T1 image is sensitive to observe the anatomy, while the MR-T2 image can detect the tissue lesions. 4.
SPECT can show the biological activities of cells and molecules, but it is difficult to distinguish human organ tissues due to the low image quality of SPECT. 5.
PET can reflect the metabolic activity information of human tissues and organs at the molecular level, but the spatial resolution of PET is relatively low.
In our experiments, all of the tests are implemented in Matlab 2016a using a desktop with a Core (TM) i7-3770 CPU, 3.40GHz, 8 CPUs and 12G RAM. obtained at http://www.imagefusion.org and http://www.med.harvard.edu/aanlib/home.html. To demonstrate the effectiveness of the proposed method, we utilize four pairs of real multimodal medical images shown in Figure 6 to evaluate the algorithm in our experiments, the size of all of them is 256 × 256, and all of them are perfectly registered, which means that the objects in each set of images are geometrically aligned. These medical images include MRI, CT, MR-T1, MR-T2, SPECT and PET, the characteristics of each of them are summarized as following.
1. CT has a shorter imaging time and a higher spatial resolution, whereas it provides soft tissue information with low contrast. 2. MRI can clearly display the soft tissue information of the human body, but it is hard to reflect the dynamic information of the metabolic activity in human body. 3. MR-T1 image is sensitive to observe the anatomy, while the MR-T2 image can detect the tissue lesions. 4. SPECT can show the biological activities of cells and molecules, but it is difficult to distinguish human organ tissues due to the low image quality of SPECT. 5. PET can reflect the metabolic activity information of human tissues and organs at the molecular level, but the spatial resolution of PET is relatively low.
In our experiments, all of the tests are implemented in Matlab 2016a using a desktop with a Core

Fusion Results by Subjective Visual Effects Analysis
The results of the first set of experiments are shown in Figure 7, and the source MRI and CT images are listed in Figure 7a and b. The fused images obtained by the DCTWT, Curvelet, NSCT, Liu-ASR, Kim, Zhu, and the proposed method are displayed in Figure 7c-i, respectively. To facilitate subjective visual comparisons, local regions enclosed by red colored rectangular boxes in Figure 7 are enlarged and presented in the bottom right corners of their respective images. We can notice that the diverse fusion performance generated by different methods in retaining the brightness information and edge detail the information of source medical images. Despite the MST methods expressing source information with different scales and directions, all of the results obtained by DTCWT, Curvelet and NSCT in Figure7c-e still produce some distortion, which heavily decreases the qualities

Fusion Results by Subjective Visual Effects Analysis
The results of the first set of experiments are shown in Figure 7, and the source MRI and CT images are listed in Figure 7a,b. The fused images obtained by the DCTWT, Curvelet, NSCT, Liu-ASR, Kim, Zhu, and the proposed method are displayed in Figure 7c-i, respectively. To facilitate subjective visual comparisons, local regions enclosed by red colored rectangular boxes in Figure 7 are enlarged and presented in the bottom right corners of their respective images. We can notice that the diverse fusion performance generated by different methods in retaining the brightness information and edge detail the information of source medical images. Despite the MST methods expressing source information with different scales and directions, all of the results obtained by DTCWT, Curvelet and NSCT in Figure 7c-e still produce some distortion, which heavily decreases the qualities of the medical images. Meanwhile, after careful observation, we can find the fusion results generated by Liu, Kim and Zhu can preserve details well from MRI. However, the brightness active level is relatively lower than the source CT image, which indicates the loss of useful information. It can be seen from Figure 7i that the fusion result obtained by the proposed method has the best performance in terms of retaining brightness information and detail information, which indicates that the visual effect of the method is the best.
The fusion results of different methods about "MR-T1/MR-T2" (see Figure 8a,b) are shown in Figure 8c-i. For this experiment, the detail information (shape, edge, texture et. al) in the red rectangle is mainly from the MR-T1 image while the energy information (brightness, contrast et.al) mainly from MR-T2 image. Among these fusion results, it can be seen that the details of the fusion results produced by the DTCWT, Curvelet and NSCT methods are severely damaged, especially the Curvelet method.
Although the Liu-ASR, Kim and Zhu methods can relatively effectively protect the edge details of the source image, they do not protect the contrast of the image well. This is very disadvantageous for medical images with high-quality requirements, and is not conducive to subsequent medical image processing and recognition tasks. of the medical images. Meanwhile, after careful observation, we can find the fusion results generated by Liu, Kim and Zhu can preserve details well from MRI. However, the brightness active level is relatively lower than the source CT image, which indicates the loss of useful information. It can be seen from Figure 7i that the fusion result obtained by the proposed method has the best performance in terms of retaining brightness information and detail information, which indicates that the visual effect of the method is the best. The fusion results of different methods about "MR-T1/ MR-T2" (see Figure 8a and b) are shown in Figure 8c-i. For this experiment, the detail information (shape, edge, texture et. al) in the red rectangle is mainly from the MR-T1 image while the energy information (brightness, contrast et.al) mainly from MR-T2 image. Among these fusion results, it can be seen that the details of the fusion results produced by the DTCWT, Curvelet and NSCT methods are severely damaged, especially the Curvelet method. Although the Liu-ASR, Kim and Zhu methods can relatively effectively protect the edge details of the source image, they do not protect the contrast of the image well. This is very disadvantageous for medical images with high-quality requirements, and is not conducive to subsequent medical image processing and recognition tasks.
By comparison, our fusion result (Figure 8i) can not only effectively protect the edge detail information of the source image, but also maintain the contrast of the source image, which is mainly due to the detail enhancement processing of training set and the clustering techniques to classify brightness and detail groups before dictionary learning. At the same time, little artificial false information is introduced in our fusion result, which means that the visual effect of the proposed method outperforms others methods in this experiment. By comparison, our fusion result (Figure 8i) can not only effectively protect the edge detail information of the source image, but also maintain the contrast of the source image, which is mainly due to the detail enhancement processing of training set and the clustering techniques to classify brightness and detail groups before dictionary learning. At the same time, little artificial false information is introduced in our fusion result, which means that the visual effect of the proposed method outperforms others methods in this experiment.
To further demonstrate that the proposed method is equally effective for other types of medical image fusion, we test the other two categories of medical image fusion systems include MRI/SPECT and MRI/PET, and the fusion results are shown in Figures 9 and 10. The details of fused images are zoomed in and presented in the bottom left corners of their respective images.
With more careful observation, we can see that the spatial edge details and brightness information in Figures 9i and 10i are more accurate compared with Figures 9c-h and 10c-h generated by the six compared methods. This means the useful information from source images have been successfully transformed into the fused images, that is, our fused results have the best visual features. This is mainly because the novel construction of the overcomplete dictionary in this paper is composed of two parts include the brightness sub-dictionary and the edge detail sub-dictionary, which can fully express the significant features of the medical images. This will be very beneficial to the implementation of medical image fusion in practical medical assistance applications. In conclusion, the proposed algorithm has the best subjective visual effect.  To further demonstrate that the proposed method is equally effective for other types of medical image fusion, we test the other two categories of medical image fusion systems include MRI/SPECT and MRI/PET, and the fusion results are shown in Figure 9 and Figure 10. The details of fused images are zoomed in and presented in the bottom left corners of their respective images.
With more careful observation, we can see that the spatial edge details and brightness information in Figure 9i and Figure 10i are more accurate compared with Figure 9 c-h and Figure 10 c-h generated by the six compared methods. This means the useful information from source images have been successfully transformed into the fused images, that is, our fused results have the best visual features. This is mainly because the novel construction of the overcomplete dictionary in this paper is composed of two parts include the brightness sub-dictionary and the edge detail subdictionary, which can fully express the significant features of the medical images. This will be very beneficial to the implementation of medical image fusion in practical medical assistance applications. In conclusion, the proposed algorithm has the best subjective visual effect. To further demonstrate that the proposed method is equally effective for other types of medical image fusion, we test the other two categories of medical image fusion systems include MRI/SPECT and MRI/PET, and the fusion results are shown in Figure 9 and Figure 10. The details of fused images are zoomed in and presented in the bottom left corners of their respective images.
With more careful observation, we can see that the spatial edge details and brightness information in Figure 9i and Figure 10i are more accurate compared with Figure 9 c-h and Figure 10 c-h generated by the six compared methods. This means the useful information from source images have been successfully transformed into the fused images, that is, our fused results have the best visual features. This is mainly because the novel construction of the overcomplete dictionary in this paper is composed of two parts include the brightness sub-dictionary and the edge detail subdictionary, which can fully express the significant features of the medical images. This will be very .

Fusion Results by Objective Evaluation
To effectively evaluate the quality of fused images, we need take a quantitative evaluation for fused results, the metric can assess the perceptual quality of fused results consistently with visual subjective evaluation is highly desired. For medical image fusion problems, since the reference image (ground truth) does not exist in practice, quantitative evaluation of the quality of the fused image is

Fusion Results by Objective Evaluation
To effectively evaluate the quality of fused images, we need take a quantitative evaluation for fused results, the metric can assess the perceptual quality of fused results consistently with visual subjective evaluation is highly desired. For medical image fusion problems, since the reference image (ground truth) does not exist in practice, quantitative evaluation of the quality of the fused image is not an effortless task. Over the last few decades, a number of fusions metrics have been proposed. Unfortunately, due to the various fusion scenarios, none of them are generally considered to be always more reasonable than other metrics. Therefore, it is often necessary to apply multiple indicators for comprehensive evaluation.
In this part, we employ five popular and representative quantitative indices to objectively evaluate all fusion results. These metrics include mutual information (MI) [43], edge information retention operator Q AB/F [44], nonlinear correlation information entropy (Q NCIE ) [45], image fusion metric-based on phase congruency (Q P ) [46], and Piella's Metric Q S [47]. Among them, MI can reflect the degree of mutual information between the source images and the fused image; Q AB/F can effectively evaluate the amount of the edge information transferred from the source images into the fused image; Q NCIE is an information-based metric that measures the Correlation between fused image and the source images; Q P indicates the degree of retaining significant saliency feature from the source images into the fused image. Q S measures the contrast in the neighborhood of pixel.
For all of them, a larger value indicates better fusion results, and the codes of metrics include Q NCIE , Q P and Q S are implemented in an evaluation toolbox by Liu [48]. The quantitative evaluation results of all methods are presented in Tables 2 and 3. By comparison, we can see that the Zhu's method will marginally outperform our method for the Q P evaluation value in MRI/PET images. But our method can obtain the best performance for other metrics, and the subjective visual effect of our result (Figure 10i) is slightly superior to it (Figure 10h). As a whole, Tables 2 and 3 clearly indicate our method outperforms the other methods in terms of almost metrics.  Consideration of the subjective visual comparison and objective quantitative evaluation, one can finally conclude that our method can generate visually pleasant fused images contain abundant detail and brightness information in most cases and outperforms some competing methods on visual quality and objective evaluation.

Computational Efficiency Analysis
As previously mentioned, the proposed dictionary learning method can address the defects of traditional dictionary learning algorithms with some superfluous patches and low computation efficiency. In this subsection, the computational complexity of different fusion methods are analyzed, the results are shown in Table 4. It is well known that SR based fusion method are generally time consuming, especially for the case of learning a dictionary from source images. As can be seen from Table 4, we can see that the first three MST based methods are less time-consuming compared with other four SR based methods.
Nonetheless, compared with the three typical dictionary methods include Liu-ASR, Kim and Zhu methods, the proposed method saves more time. This demonstrates that our method could increase the computational efficiency more than some dictionary learning based fusion methods. Furthermore, the computational efficiency of our method still has much room for improvement, and it can be further improved if we fully optimize the speed of its implementation and by utilizing multithreading and performance boosters like Graphics Processing Unit (GPU) acceleration.

Extension to Other Type Image Fusion Issues
To exhibit the generalization ability of the proposed method, we extend its application to other types of image fusion including multifocus image fusion, panchromatic-multispectral image fusion and infrared-visible image fusion. The three fusion examples are shown in Figure 11, where the source images are displayed in the first two columns, and the fused results are listed in the last column. We can notice that the important feature include detail, sharp, edge and brightness feature et al. of source images are well preserved in the fused image. Among them, the fusion performance of infrared-visible image fusion is relatively high. This demonstrates that the proposed model can transfer useful information from the source images into the fused result [49]. Meanwhile, fewer undesirable artifacts are introduced in these fusion process, which indicate our method has strong robustness in these applications. Furthermore, compared with MRI and PET imaging, some new functional medical imaging modality, such as photoacoustic computed tomography (PACT) [50], resting-state functional connectivity (RSFC) and functional connectivity photoacoustic tomography (fcPAT) [51] etc., can offer better spatial resolutions with a fast, noninvasive and non-ionizing imaging modality to describe the brain physiology and pathology. Therefore, further study will be required to investigate them for image fusion in the near future. robustness in these applications. Furthermore, compared with MRI and PET imaging, some new functional medical imaging modality, such as photoacoustic computed tomography (PACT) [50], resting-state functional connectivity (RSFC) and functional connectivity photoacoustic tomography (fcPAT) [51] etc., can offer better spatial resolutions with a fast, noninvasive and non-ionizing imaging modality to describe the brain physiology and pathology. Therefore, further study will be required to investigate them for image fusion in the near future.

Conclusions
SR based dictionary learning technology has been widely used in the medical image fusion field due to its superior performance. The core problem of these kinds of algorithms is constructing an informative and compact overcomplete dictionary with abundant information. Aiming at the

Conclusions
SR based dictionary learning technology has been widely used in the medical image fusion field due to its superior performance. The core problem of these kinds of algorithms is constructing an informative and compact overcomplete dictionary with abundant information. Aiming at the traditional dictionary learning methods lack enough ability to express the source image information. This paper proposes a novel dictionary learning method based on brightness and detail clustering for medical image fusion. The proposed approach consists of three steps. Firstly, multi-layer ND filtering enhances the details of pre-training images, thus the weak information can be reinforced in the training set.
At the same time, we conduct the MSS on images to realize the multi-scale presentation of patches. Secondly, we propose SNE and MSF to classify the brightness and detail information patches groups, and then construct the brightness sub-dictionary and the detail sub-dictionary by K-SVD. The combination of the two sub-dictionaries can generate the final informative and compact dictionary. Finally, a SR model is established, and generates the encouraging fused results. Experimental analysis on some traditional as well the state-of-the art dictionary learning based medical fusion methods on four categories of medical images fusion shows that the proposed method has obvious superiority in both subjective visual effect and quantitative evaluation.