The Fusion of MRI and CT Medical Images Using Variational Mode Decomposition

: In medical image processing, magnetic resonance imaging (MRI) and computed tomography (CT) modalities are widely used to extract soft and hard tissue information, respectively. However, with the help of a single modality, it is very challenging to extract the required pathological features to identify suspicious tissue details. Several medical image fusion methods have attempted to combine complementary information from MRI and CT to address the issue mentioned earlier over the past few decades. However, existing methods have their advantages and drawbacks. In this work, we propose a new multimodal medical image fusion approach based on variational mode decomposition (VMD) and local energy maxima (LEM). With the help of VMD, we decompose source images into several intrinsic mode functions (IMFs) to effectively extract edge details by avoiding boundary distortions. LEM is employed to carefully combine the IMFs based on the local information, which plays a crucial role in the fused image quality by preserving the appropriate spatial information. The proposed method’s performance is evaluated using various subjective and objective measures. The experimental analysis shows that the proposed method gives promising results compared to other existing and well-received fusion methods.


Introduction
Medical image analysis plays a crucial role in clinical assessment. However, the success rate of the diagnosis depends upon the visual quality and the information present in medical images [1]. In real-world medical imaging, denoising [2,3] or texture information processing [4,5] is a necessary preprocessing step to improve the fused image's visual quality further.
Nowadays, several imaging modalities are available to capture specific medical information of a given organ [6][7][8]. X-ray, MRI, CT, positron emission tomography (PET), and single-photon emission computed tomography (SPECT) of a human brain displayed in Figure 1 are crucial medical imaging modalities among them. For example, the magnetic resonance imaging (MRI) modality captures the anatomical information of the soft tissue. In contrast, computed tomography (CT) significantly provides hard tissue information such as bones structures and tumors [8]. Moreover, for clinical needs, the information provided by a single modality may not be sufficient, especially during the diagnosis of diseases [9]. The image fusion mechanism can effectively address this problem, enhancing the information by combining the complementary details provided by two or more modalities into a single image. enhancing the information by combining the complementary details provided by two or more modalities into a single image. Image fusion can be categorized into spatial and transform domain techniques [10]. In spatial domain methods, the fusion takes place between the pixels of the source images directly. The maximum, minimum, average, weighted average and PCA are examples of the spatial domain fusion methods, which are easy to implement and computationally efficient. Direct pixel-based fusion methods use a weighted pixel of input images to form a fused image [11]. The activity level of the pixels determines these weights. In the literature, various machine learning methods such as neural networks and support vector machines (SVM) are also used to select the pixels with the highest activity [12,13]. In [14], an iterative block-level fusion method is proposed. First, the source images are decomposed into small square blocks, and PCA is computed on those blocks. Next, weights are found using the average of the PCA components. Finally, a maximum average mutual information fusion rule is employed for the final blending of input images. In [15], a pixel-level image fusion method is proposed using PCA. Here, the first PCA components from both the input images are multiplied individually, and those weighted images are added for fusion. However, these methods might exhibit spatial color, information loss, and brightness distortions [16,17]. Image fusion can be categorized into spatial and transform domain techniques [10]. In spatial domain methods, the fusion takes place between the pixels of the source images directly. The maximum, minimum, average, weighted average and PCA are examples of the spatial domain fusion methods, which are easy to implement and computationally efficient. Direct pixel-based fusion methods use a weighted pixel of input images to form a fused image [11]. The activity level of the pixels determines these weights. In the literature, various machine learning methods such as neural networks and support vector machines (SVM) are also used to select the pixels with the highest activity [12,13]. In [14], an iterative block-level fusion method is proposed. First, the source images are decomposed into small square blocks, and PCA is computed on those blocks. Next, weights are found using the average of the PCA components. Finally, a maximum average mutual information fusion rule is employed for the final blending of input images. In [15], a pixel-level image fusion method is proposed using PCA. Here, the first PCA components from both the input images are multiplied individually, and those weighted images are added for fusion. However, these methods might exhibit spatial color, information loss, and brightness distortions [16,17].
Image fusion methods based on the transform domain techniques are receiving much consideration [18]. Pyramid [19], wavelet [20], and multi-resolution singular value decomposition (MSVD) are examples of traditional methods [21] in this category. However, transform domain fusion methods have a few drawbacks [18]. For example, most pyramid methods suffer from blocking artifacts and a loss of source information, even producing artifacts around edges [22]. Wavelets suffer shift sensitivity, poor directionality, an absence of phase information, poor performance at edges and texture regions, and produce artifacts around edges because of the shift-variant nature [22]. Despite the reliable quantification results, MSVD fusion methods might result in poor visual quality [23].
To address the issues mentioned above, other transform domain fusion techniques such as À Trous wavelet transform (ATWT), curvelet transform (CVT), and ridgelet transform are suggested in [24]. These methods provide better results concerning the visual aspect, preserving spatial and spectral information. Nevertheless, these techniques suffer from artifacts around the edges in the fused image [25].
In [26], a new pixel-level image fusion approach using convolutional sparsity-based morphological component analysis (CS-MCA) is introduced. This method achieves sparse representation by combining MCA and sparse convolutional representation into the unified optimization method. This approach might suffer from a spatial consistency problem, resulting in the degradation of spatial details [27]. An NSST-based fusion scheme is proposed in [28].This approach used a blend of NSST with weighted local energy (WLE) and a weighted sum of eight-neighborhood-based modified Laplacian (WSEML) to integrate MRI and CT images. However, this method is a non-adaptive approach. A summary of different types of image fusion methods, their advantages and drawbacks are tabulated in Table 1. An adaptive transform-domain fusion technique might provide a better solution to the challenges mentioned above. In these fusion approaches, the basis function of the transform technique depends on the source image's characteristics. With the help of adaptive wavelets, the image's crucial features can be highlighted, which helps in the fusion process. Hence, adaptive wavelets turned out to be a preferable representation compared to standard wavelets. Similar works based on VMD decomposition-based techniques can be found in [35,36]. However, this paper proposes a new adaptive multimodal image fusion strategy based on the combination of variational mode decomposition (VMD) and local energy maxima (LEM) to address the challenges mentioned above. The highlights of the proposed method are as follows: 1. VMD is an adaptive decomposition scheme that decomposes the images as bandlimited sub-bands called intrinsic mode functions (IMFs) without introducing boundary distortions and mode-mixing problems. Indeed, the band-limited sub-bands characterize the edge and line features of source images. This decomposition technique can effectively extract the image features from the other transform methods such as wavelet transform (WT), bi-dimensional empirical mode decomposition (BEMD), and empirical wavelet transform (EWT); 2. The LEM fusion rule extracts the local information from decomposed modes corresponding to two source images pixel by pixel using a windowing operation (3 × 3) and then measures the maximum information value. Hence, using the LEM fusion rule, we can preserve the required complementary visual, edge, and texture information in the IMFs; 3. The proposed approach aims to preserve the information and details of both MRI and CT images into the fused image using VMD and LEM. From visual perception and objective assessment of the fusion results, it is evident that our new image fusion method accomplishes good performance over other existing fusion methods.
The remainder of the paper is arranged as follows: The proposed framework and its mathematical representation are presented in Section 2. The detailed analysis of the simulation results and necessary discussion is presented in Section 3. A final note on the proposed method and future directions is given in Section 4.

Proposed Methodology
Our proposed work aims to integrate the details of the soft tissue and dense bone structure provided by MRI and CT medical imaging technologies into a unique image. For this, we have proposed a multimodal medical image fusion based on a blend of VMD and LEM, as shown in Figure 2.
x FOR PEER REVIEW 5 of 19 can preserve the required complementary visual, edge, and texture information in the IMFs; 3. The proposed approach aims to preserve the information and details of both MRI and CT images into the fused image using VMD and LEM. From visual perception and objective assessment of the fusion results, it is evident that our new image fusion method accomplishes good performance over other existing fusion methods.
The remainder of the paper is arranged as follows: The proposed framework and its mathematical representation are presented in Section 2. The detailed analysis of the simulation results and necessary discussion is presented in Section 3. A final note on the proposed method and future directions is given in Section 4.

Proposed Methodology
Our proposed work aims to integrate the details of the soft tissue and dense bone structure provided by MRI and CT medical imaging technologies into a unique image. For this, we have proposed a multimodal medical image fusion based on a blend of VMD and LEM, as shown in Figure 2. The main steps involved in our fusion methodology are: A. VMD-based image decomposition; B. A fusion strategy depending on the LEM; C. Synthesizing the fused image.

A. VMD-Based Image Decomposition
The traditional decomposition approaches, such as wavelets [37,38], BEMD [39], and EWT [40], suffer from various problems such as boundary distortions and mode-mixing. With these issues, we may fail to achieve an appropriate fusion result. To address these problems, we employed VMD [41], a robust adaptive decomposition approach, highlighting meaningful details in the form of sub-images.
The VMD finds applications in image denoising [42] and texture decomposition [43]. VMD is a non-stationary and adaptive signal processing technique. Unlike EMD and its variants, VMD is not a recursive analysis approach, and it decomposes the signal/image into bandlimited sub-bands based on its frequency content. This work uses VMD to obtain distinct and significant IMFs from the source images (MRI and CT). The derived IMFs reduce mode-mixing and boundary distortions, which are the major concerns in the above mentioned transform domain methods. With this VMD decomposition, we can extract prominent edge information. Initially, we decomposed the input images into six The main steps involved in our fusion methodology are: A.
A fusion strategy depending on the LEM; C.
Synthesizing the fused image.
A. VMD-Based Image Decomposition The traditional decomposition approaches, such as wavelets [37,38], BEMD [39], and EWT [40], suffer from various problems such as boundary distortions and mode-mixing. With these issues, we may fail to achieve an appropriate fusion result. To address these problems, we employed VMD [41], a robust adaptive decomposition approach, highlighting meaningful details in the form of sub-images.
The VMD finds applications in image denoising [42] and texture decomposition [43]. VMD is a non-stationary and adaptive signal processing technique. Unlike EMD and its variants, VMD is not a recursive analysis approach, and it decomposes the signal/image into bandlimited sub-bands based on its frequency content. This work uses VMD to obtain distinct and significant IMFs from the source images (MRI and CT). The derived IMFs reduce mode-mixing and boundary distortions, which are the major concerns in the above mentioned transform domain methods. With this VMD decomposition, we can extract prominent edge information. Initially, we decomposed the input images into six IMFs, which are illustrated in Figure 3. From Figure 3, it can be observed that the first IMF ((b) and (i)) captures promi information from the source images, whereas the remaining IMFs encompass the line edge information. We can note from Figure 3 that as the mode number increases visual details are not significant.
Mathematical Details of VMD: The main goal of VMD is to subdivide an input signal () into a specific numb sub-bands (IMFs or Modes) ( l b ), and each sub-band is bandlimited to specific freq cies in the spectral domain (Fourier domain) by maintaining sparsity. Each of sub-bands is bandlimited to its center frequencies. VMD involves the following step get the bandlimited sub-bands [41]: 1. For each sub-band, its analytical counterpart needs to be computed using Hi transform to get the one-sided frequency spectrum; 2. An exponential is used to mix with each mode to shift its frequency spectru the baseband; 3. Finally, the bandwidth of the mode estimates using the squared L 2 -norm o gradient. The constrained variational problem can be represented as below.
where th l indicates the th l sub-band and its center frequency, respectively. (t) δ resents the Dirac distribution, * is the symbol of the convolution. The constrained problem in Equation (1) is solved using the quadratic penalty and Lagrangian multipliers λ to make it an unconstrained problem given in Equa From Figure 3, it can be observed that the first IMF ((b) and (i)) captures prominent information from the source images, whereas the remaining IMFs encompass the line and edge information. We can note from Figure 3 that as the mode number increases, the visual details are not significant.
Mathematical Details of VMD: The main goal of VMD is to subdivide an input signal () into a specific number of sub-bands (IMFs or Modes) (b l ), and each sub-band is bandlimited to specific frequencies in the spectral domain (Fourier domain) by maintaining sparsity. Each of the sub-bands is bandlimited to its center frequencies. VMD involves the following steps to get the bandlimited sub-bands [41]: 1. For each sub-band, its analytical counterpart needs to be computed using Hilbert transform to get the one-sided frequency spectrum; 2. An exponential is used to mix with each mode to shift its frequency spectrum to the baseband; 3. Finally, the bandwidth of the mode estimates using the squared L 2 -norm of the gradient. The constrained variational problem can be represented as below.
where l th indicates the l th sub-band and its center frequency, respectively. δ(t) represents the Dirac distribution, * is the symbol of the convolution. The constrained problem in Equation (1) is solved using the quadratic penalty term and Lagrangian multipliers λ to make it an unconstrained problem given in Equation (2).
where L represents augmented Lagrange matrix function, α is the penalty factor parameter, λ indicates the Lagrange multiplier, and x(t) is the input signal. Now the solution of Equation (1) can be computed as the saddle point of Equation (2) using the method called an alternating direction method of multipliers (ADMM).
Equation (3) can be further solved using an alternating direction method of multipliers (ADMM) [41]. Finally, the estimate of the l th sub-band is computed as [44]: Similarly, the center frequency is updated as: In this work, we used the two-dimensional (2D)-VMD [45] method to decompose the MRI and CT images. As stated above, 2D-VMD is a helpful method in extracting useful information such as edges and curves from the source images. Furthermore, VMD is a reliable method to deal with noisy images. Therefore, it can improve the quality of the fusion process even without employing additional preprocessing techniques.
B. Fusion Strategy Depending on LEM As discussed before, the VMD adaptively decomposes the input images into bandlimited sub-bands called IMFs. Indeed, these IMFs characterize the image features of source images. To highlight and extract relevant features in the fused image, we require appropriate fusion rules. As discussed in Section 1, many fusion rules [46], such as minima, maxima, averaging, and PCA, have been widely explored for this purpose over the past few years. Among them, minima and maxima cause brightness distortions, averaging rule blurs the fused image, and PCA degrades the spectral information [15]. Furthermore, the fusion rules mentioned above may produce low spatial resolution issues [47]. The LEM-based [47] fusion rule is adopted to tackle the issues discussed above in this work.
We have demonstrated the influence of these fusion rules visually in Figure 4 and quantitatively in Table 2. As shown in Figure 4, the VMD with LEM fusion rule achieves visually satisfying results compared to VMD with other fusion rules. Similarly, as shown in Table 2, the fusion metric values calculated over 10 data sets proved the efficacy of the chosen LEM fusion rule.     The technical details of the LEM fusion rule are discussed as follows. The principal idea behind using LEM is to extract and preserve vital information with the help of local information constraints from both the images pixel by pixel [47]. The entire process of LEM is described in Algorithm 1.

Algorithm 1
Let us consider the IMFs of the first image as IMFs i A , and the sec ond image as IMFs i B . The local information LE α (x, y) of I MFs i α (α = A, B) is evaluated using the following steps. Input : Decomposed modes of images IMFs i A , IMFs i B . Output : Enhanced decomposition modes F i I MFs A,B (x, y).

Step 1 : Calculate the local information LEM α (x, y) of individual modes IMFs
where, W k is given by: Step 2 : Choose the maximum value in the local information LEM α (x, y) Step 3: Calculate the binary decision weight maps Step 4 : Obtain the enhanced decomposition modes F i I MFs A,B (x, y) C. Synthesizing the Fused Image We linearly combine all the enhanced IMFs obtained from each LEM fusion rule to construct the fused image. The whole process of the proposed fusion framework is given in Algorithm 2.
(d) Fuse the decomposed modes F i I MFs A,B (x, y) using Equation (9).
Step 3: Reconstruct the fused image by summing all the fused sub-bands obtained from Step 2.

D. Image Fusion Evaluation Metrics
In this paper, we used a few state-of-the-art image fusion metrics to estimate the information contribution of each source image in the fusion process. They are edge intensity (EI) [48], mutual information (MI) [49], visual information fidelity (VIF) [50,51], edgebased similarity measure (Q AB/F P ) [52], structural similarity index measure (SSIM) [51,53], average gradient (AG) [54], root mean square error (RMSE) [15], peak signal-to-noise ratio (PSNR) [13,42]. EI represents the difference of luminance along the gradient direction in images. MI is used to measure the relative information between the source and the fused images. VIF estimates the visual information fidelity between the fused and source images depending on the Gaussian mixture model. The edge-based similarity (Q AB/F P ) measure will be useful to provide the edge details in the fused image. RMSE computes a difference measure between the reference image and fused image. In this work, the maximum value of RMSE of MRI-fused images and CT-fused images is considered. Similarly, PSNR is also computed. Except for RMSE, the higher values of all these metrics imply better fusion. In the case of the RMSE, the lowest value yields a better result.

Results and Discussion
This section presents the experimental setup, results and analysis of the proposed method. First, we explain the experimental setup and methods, followed by data analysis using both qualitative and quantitative methods. Finally, we compare the proposed method with the existing literature for a fair assessment.
The experiments are conducted on a PC with Intel(R) Core (TM) i5-5200U CPU@2.20GHz and RAM 8GB using MATLAB2018b. We have considered a whole-brain atlas website (http://www.med.harvard.edu/AANLIB/home.html, accessed on 1 September 2021) to conduct our experiments. For this purpose, 23 MRI-CT medical image data sets are taken from this database. All these data sets are registered with a resolution of 256 × 256. Image registration [55] is a necessary step prior to image fusion. It is defined as the process of mapping the input images with the help of a reference image. Such mapping aims to match the corresponding images based on specific features to assist in the image fusion process. The database contains various cross-sectional multimodal medical images, such as MRI (T1 and T2 weighted), CT, single-photon emission computed tomography (SPECT), and positron emission tomography (PET).
Furthermore, it has a wide range of brain images ranging from healthy to different brain diseases, including cerebrovascular, neoplastic, degenerative, and infectious diseases.
We have considered 23 pairs of MRI-CT from fatal stroke (cerebrovascular disease) to validate our proposed approach (Supplementary Materials). Interested readers can find more details of this database in [56].
The efficacy of any image fusion algorithm can be verified using subjective (qualitative) and objective (quantitative) analysis. In Section 3.1, we first verified the subjective performance of various fusion algorithms and then performed objective analysis using fusion metrics in Section 3.2.

Subjective Assessment
Visual results of various MRI and CT fusion methods are shown in Figures 5-7. A good MRI-and CT-fused image should contain both the soft tissue information and dense structure information of the MRI and CT images. We can draw the following observations by examining the visual quality of the four sets of MRI-CT fusion results using various methods.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 11 of 19 dense structure information of the MRI and CT images. We can draw the following observations by examining the visual quality of the four sets of MRI-CT fusion results using various methods.
1. Compared to all the other methods, our proposed algorithm provides a brighter outer region representing the CT image's dense structure; 2. From Figures 5-7, it can be seen that the fused images of methods (c)-(g) are yielding poor contrast; 3. Though the method (h) in all the Figures 5-7 provides better contrast details; still, it is suffering from artifacts, especially in the CT region.   From Figures 5-7, it can be noticed that the ASR method transfers both the CT and MRI information partially with low contrast. Next, coming to the CVT contains more MRI details than the CT. In the DTCWT method, we can find a few fusion artifacts in and 1. Compared to all the other methods, our proposed algorithm provides a brighter outer region representing the CT image's dense structure; 2. From Figures 5-7, it can be seen that the fused images of methods (c)-(g) are yielding poor contrast; 3. Though the method (h) in all the Figures 5-7 provides better contrast details; still, it is suffering from artifacts, especially in the CT region.
From Figures 5-7, it can be noticed that the ASR method transfers both the CT and MRI information partially with low contrast. Next, coming to the CVT contains more MRI details than the CT. In the DTCWT method, we can find a few fusion artifacts in and around the CT region. Similarly, we can observe information fusion loss in the MSVD method. Compared with the methods mentioned above, CSMCA gives better visual quality, but the overall contrast of the image is reduced. The fused images with the NSST method are visually degraded due to both the fusion loss and artifacts. Overall, our proposed method retains the necessary information from the MRI and CT with minimum fusion losses. The comparison results of the MRI-CT fusion using various methods, including the proposed method on the 23 pairs of fatal stroke images, are shown in Figures 8 and 9.

Objective Assessment
Here, we assess the fused image quality objectively using fusion metrics. Tables 3-5 demonstrate the objective assessment of the three fatal-stroke images proposed and other existing approaches (sets: 7, 11, and 15) subjectively analyzed earlier. In addition, we have presented the average objective metric scores of all the 23 sets (fatal-stroke)in Table 6. Fusion metrics except for RMSE with the first highest values are highlighted in bold font, and the second-highest values are underlined. The first-lowest value of the RMSE is indicated in bold, and the second-lowest value is underlined. A number within the bracket at the end of the quantitative metric scores represents the rank of the fusion algorithm. In these Tables, the ranking scheme is considered for better quantitative analysis of fusion algorithms.    Comprehensively, the proposed framework is the only approach that occupies the first two ranks for all eight metrics among all the seven methods. It indicates that our method has robust performance (i.e., stable and promising performance) than other existing techniques. Specifically, our approach always remains in the first position on VIFF and RMSE for all four data sets, as shown in Tables 3-5.
Average quantitative analysis of the proposed and other state-of-the-art methods calculated over 23 pairs of MRI-CT (fatal stroke) are presented in Table 6. The proposed method occupied the first position by overperforming other fusion algorithms when average values are considered in fusion metrics.
In general, the consistent performance of any image fusion algorithm in quantitative results is mainly due to the good visual quality of fused images, fusion gain, and less fusion loss and fusion artifacts. We have already seen from the visual result analysis that the proposed method can transfer the source image information into the fused image with less fusion loss and artifacts compared to the other fusion algorithms. It is also evident from the fusion metrics that our method is giving a stable performance.
Hence, we can conclude that the proposed method is promising, stable, and efficient from qualitative and quantitative comparative analysis.

Conclusions and Future Scope
We proposed a multi-modal medical image fusion framework with VMD and LEM to fuse MRI and CT medical images in this work. By using an adaptive decomposition technique VMD, significant IMFs are derived from the source images. This decomposition process can preserve some details of source images. However, these details are not sufficient to fulfill the clinical needs of radiologists. Hence, we used a LEM fusion rule to preserve complementary information from IMFs, an essential criterion during medical image diagnosis. All the experiments are evaluated on the Whole Brain Atlas benchmark data sets to analyze the efficacy of the proposed methodology. The experimental results reveal that the proposed framework attained better visual perception. Even objective assessment in terms of average EI (64.582), MI (3.830), VIFF (0.498), Q AB/F P (0.542), SSIM (0.6574), RMSE (0.020), AG (6.41), and PSNR (20.291) demonstrated quantitative fusion performance better than the existing multi-modal fusion approaches. In the future, we wish to conduct experiments with extensive data that contain images of MRI and CT with different disease information. Additionally, we consider extending this work to both 2D and 3D image clinical applications. Furthermore, we would like to verify the effectiveness of the proposed method for other image fusion applications such as digital photography, remote sensing, battlefield monitoring, and military.