Multimodal Image Fusion for X-ray Grating Interferometry

X-ray grating interferometry (XGI) can provide multiple image modalities. It does so by utilizing three different contrast mechanisms—attenuation, refraction (differential phase-shift), and scattering (dark-field)—in a single dataset. Combining all three imaging modalities could create new opportunities for the characterization of material structure features that conventional attenuation-based methods are unable probe. In this study, we proposed an image fusion scheme based on the non-subsampled contourlet transform and spiking cortical model (NSCT-SCM) to combine the tri-contrast images retrieved from XGI. It incorporated three main steps: (i) image denoising based on Wiener filtering, (ii) the NSCT-SCM tri-contrast fusion algorithm, and (iii) image enhancement using contrast-limited adaptive histogram equalization, adaptive sharpening, and gamma correction. The tri-contrast images of the frog toes were used to validate the proposed approach. Moreover, the proposed method was compared with three other image fusion methods by several figures of merit. The experimental evaluation results highlighted the efficiency and robustness of the proposed scheme, with less noise, higher contrast, more information, and better details.


Introduction
X-ray imaging techniques, such as mammography [1] and computed tomography (CT) [2], have become indispensable diagnostic tools for investigating the inner structure of materials. They can provide valuable information in many fields, from medical diagnosis to industrial inspection and security screening. Traditionally, the image contrast of these techniques depends on differences in X-ray attenuation. The attenuation contrast (µ) positively correlates with the material mass density (ρ) and atomic number (Z) (µ ∝ ρZ 4 ), and negatively correlates with the X-ray energy (E) (µ ∝ 1/E 3 ) [3]. In principle, conventional X-ray attenuation-based imaging is ideal for materials with high absorption properties. However, the attenuation contrast becomes extremely poor without a significant increase in dose deposition, while low-Z materials are under investigation with high-energy X-rays.
Recently, X-ray grating interferometry (XGI) has been introduced to mitigate the inherent limitations of imaging low-Z materials using conventional X-ray imaging techniques. Because XGI is compatible with conventional low-coherence X-ray sources and detectors, it has become the most promising scheme for translating XGI into practice [4]. Moreover, XGI is a multi-contrast imaging technique, able to provide three physically different signals with complementary image contrast: attenuation contrast (AC), differential phase contrast (DPC), and small-angle scattering, also known as dark-field contrast (DFC) [5]. The phase signal can reveal differences between materials with similar absorption properties because it is highly sensitive to the electron density variations in the object. The scattering signal can access unresolved structural variations of the sample in the micrometer scale, which is beyond the system resolution. Many studies have demonstrated that both differential phase and scattering modalities were able to offer valuable information in addition to conventional attenuation contrast, including clinical applications such as mammography [6,7] and lung imaging [8,9] in addition to non-destructive testing [10] and material science in industrial settings [11]. The scattering signal, in particular, has piqued the attention of researchers because of its effectiveness in offering quantitative or inaccessible structural information in radiographic applications [12][13][14].
Adding two more informationally-complementary contrasts to the conventional attenuation contrast can enrich the information access channels. However, the three output images represent morphological features of an object with different physical properties, which can significantly enhance the complexity of interpretation and burden a physician. Image fusion could combine the tri-contrast modalities into a single integrated image, making analysis and diagnosis less cumbersome. The simultaneous acquisition of the tri-contrast images circumvents the preregistration process for image fusion because the retrieved AC, DPC, and DFC images are temporally and spatially registered. This could be particularly advantageous for reducing artifacts in the fusion procedure and conserving the reliability of the acquired information.
Tri-contrast image fusion methods have been developing over the past few decades. Ewald Roessl et al. presented, in 2012, an image fusion algorithm to combine AC and DPC based on an assumption of a simple scaling law [15]. However, the DFC signal was not considered for the procedure. Z. Wang et al. proposed a tri-contrast fusion method based on multiple resolutions in 2013 [16]. It successfully transformed details from the original images to the fusion results. However, the study lacked objective measurements to evaluate the method's performance. Felix Scholkmann et al. proposed an image denoising, fusion, and enhancement scheme in 2014 [17]. It had pleasing results in both dental and breast imaging applications because it introduced pre-denoising and after-enhancement. However, the fusion rule of their scheme was unable to process three input images simultaneously, making it unsuitable for trimodal application. Eduardo Coello et al. introduced a Fourier domain framework for XGI fusion in 2017 [18]. The fusion results contained abundant diagnostic features and details, attributed to the full utilization of complementary information from three XGI channels by the Fourier transform. However, they did not compare it with other image fusion algorithms.
In this work, an XGI fusion scheme, based on the non-subsampled contourlet transform (NSCT) and the spiking cortical model (SCM), was proposed to solve several drawbacks of the current tri-contrast image fusion methods mentioned above. This scheme was able to process tri-contrast images from three channels of XGI simultaneously. It incorporated the pre-denoising processes of XGI outputs, the fusion process (based on NSCT-SCM), and the post-enhancement process of the fusion results. The proposed fusion algorithm was able to extract fine details and essential information from the tri-contrast images of XGI, presenting them in a final fused image with high contrast and low noise. The similarity between the fusion result and AC, DPC, and DPC channels of XGI was modulated by several tannable parameters, facilitating the easy realization of prior knowledge and preferences for particular channels.
Moreover, the proposed fusion scheme was compared with the three XGI fusion methods mentioned above, i.e., the work of Felix Scholkmann et al. [17], the conventional NSCT image fusion algorithm, and the conventional NSCT-pulse-coupled neural network (PCNN) image fusion algorithm. The comparison was carried out in both subjective and objective evaluations. Objective measures incorporated edge strength (ES), spatial frequency (SF), standard deviation (SD), entropy (H), feature mutual information (FMI), feature similarity index measure (FSI M), fusion factor (FF), structural similarity index measure (SSI M), and power spectral density (PSD). Experimental results demonstrated the robustness and effectiveness of the proposed multimodal image fusion scheme.
The rest of this study was organized as follows: the basic principles of XGI fusion, NSCT, and SCM were presented in Section 2; the proposed NSCT-SCM XGI fusion scheme was illustrated in Section 3; the introduction of objective evaluation criteria was presented in Section 4; the experimental analysis of the proposed method was presented, together with the comparison with the other three algorithms for XGI fusion, in Section 5; and conclusions were drawn in Section 6.
Contributions of this study: (1) drawbacks of image fusion methods in the XGI were analyzed; (2) an image fusion scheme based on NSCT-SCM for the XGI was proposed; (3) a tunable sub-band coefficient selection strategy was proposed to serve special requirements for the XGI fusion; (4) the proposed NSCT-SCM image fusion scheme was applied to XGI data of frog toes and compared with current fusion methods in the XGI fusion field, exhibiting state-of-the-art performance.

Image Fusion for X-ray Grating Interferometry
X-ray grating interferometry simultaneously retrieves three complementary signals: AC, DPC, and DFC channels. Among these signals, AC represents the attenuation of the X-ray intensity; therefore, it provides the same information as conventional X-ray imaging, presenting it in the form of an X-ray absorption coefficient. DPC, on the other hand, is presented in the form of a refraction index, which relates to the X-ray's local deflection. Finally, DFC is defined by the small-angle X-ray scattering at sub-pixel structures, presenting detailed information that would not be easily visible in the previous channels.
In XGI image fusion, high-frequency components of images from DPC and DFC are selected to provide greater features and details. At the same time, low-frequency components of the image from AC are preferred because of an intrinsic principle of conventional X-ray methods: making images easy for doctors or radiologists to read [18]. In addition, because the three pictures from XGI are retrieved simultaneously from the same direction by the same sensor, there is no need for additional image registration.

Non-Subsampled Contourlet Transform
Minh N. Do and Martin Vetterli proposed the contourlet transform (CT) in 2005 [19]. The following analogy demonstrates the advantages of CT; imagine there are two painters, one using a wavelet style and the other using a contourlet style. Both plan to paint a natural scene. Each painter increases the resolution of their painting from coarse to fine, step by step. When painting a smooth contour, as shown in Figure 1, the wavelet-style painter can only use square-shaped brush strokes along the contour [20]. He uses different-sized brush strokes, corresponding to the multiresolution structure of wavelets [21,22]. As the resolution grows finer, it becomes apparent that this painter needs to use a significant number of fine dots to describe the contour. However, the contourlet-style painter, in the same scenario, effectively and efficiently maintains the smoothness of the contour, attributed to their using brushstrokes with different elongated shapes, following the directions of the contour. This analogy gives a clear view of the advantages of the CT compared with the wavelet: the CT decomposes an image following its contour, which makes it less computationally complex than the wavelet.
Derived from CT, NSCT is a multi-directional, multi-scale transform that can analyze detailed information in an image [23,24]. It uses the non-subsampled pyramid filter bank (NSPFB) and the non-subsampled directional filter bank (NSDFB), and thus, it achieves the shift-invariance property. First, the input image is decomposed into two parts by NSPFB: high-pass and low-pass sub-bands. Then, the high-pass sub-band is further decomposed into serval directional sub-bands by the NSDFB. Meanwhile, the low-pass sub-band continues to implement the above decomposition as a new input. As shown in Figure 2, when the decomposing process is done, one low-pass sub-band and serval high-pass directional sub-bands are obtained from an original input image. Note that the size of each sub-band is the same as that of the original image because there is no sampling operation. Moreover, Derived from CT, NSCT is a multi-directional, multi-scale transform that can analyze detailed information in an image [23,24]. It uses the non-subsampled pyramid filter bank (NSPFB) and the non-subsampled directional filter bank (NSDFB), and thus, it achieves the shift-invariance property. First, the input image is decomposed into two parts by NSPFB: high-pass and low-pass sub-bands. Then, the high-pass sub-band is further decomposed into serval directional sub-bands by the NSDFB. Meanwhile, the low-pass subband continues to implement the above decomposition as a new input. As shown in Figure 2, when the decomposing process is done, one low-pass sub-band and serval highpass directional sub-bands are obtained from an original input image. Note that the size of each sub-band is the same as that of the original image because there is no sampling operation. Moreover, NSCT has a redundancy, given by ∑ 2 , where 2 is the number of directions at scale .

Spiking Cortical Model
The spiking cortical model [25] is a modified model, based on Eckhorn's neural network, that uses physiology as inspiration [26]. It has fewer parameters and better accuracy than the original model. Its time matrix can be recognized as a subjective, human sense of stimulus intensity. As a result of these physiology-inspired neural networks' outstanding ability to extract dynamic information inside multi-dimensional signals, they have been widely used in numerous fields. Instances include feature extraction [27], pulse shape discrimination [28][29][30], image encryption [31], and image segmentation and fusion [32,33].
Considering a biological neuron in a resting state, the membrane potential of this neuron is directly charged by external stimulus. Meanwhile, this membrane potential is modulated by the postsynaptic action potential of its neighboring neurons. In comparison, the membrane potential of SCM is similar to the aforementioned biological neural activity. The membrane potential of neurons in the SCM is calculated by combining the external  Derived from CT, NSCT is a multi-directional, multi-scale transform that can analyze detailed information in an image [23,24]. It uses the non-subsampled pyramid filter bank (NSPFB) and the non-subsampled directional filter bank (NSDFB), and thus, it achieves the shift-invariance property. First, the input image is decomposed into two parts by NSPFB: high-pass and low-pass sub-bands. Then, the high-pass sub-band is further decomposed into serval directional sub-bands by the NSDFB. Meanwhile, the low-pass subband continues to implement the above decomposition as a new input. As shown in Figure 2, when the decomposing process is done, one low-pass sub-band and serval highpass directional sub-bands are obtained from an original input image. Note that the size of each sub-band is the same as that of the original image because there is no sampling operation. Moreover, NSCT has a redundancy, given by ∑ 2 , where 2 is the number of directions at scale .

Spiking Cortical Model
The spiking cortical model [25] is a modified model, based on Eckhorn's neural network, that uses physiology as inspiration [26]. It has fewer parameters and better accuracy than the original model. Its time matrix can be recognized as a subjective, human sense of stimulus intensity. As a result of these physiology-inspired neural networks' outstanding ability to extract dynamic information inside multi-dimensional signals, they have been widely used in numerous fields. Instances include feature extraction [27], pulse shape discrimination [28][29][30], image encryption [31], and image segmentation and fusion [32,33].
Considering a biological neuron in a resting state, the membrane potential of this neuron is directly charged by external stimulus. Meanwhile, this membrane potential is modulated by the postsynaptic action potential of its neighboring neurons. In comparison, the membrane potential of SCM is similar to the aforementioned biological neural activity. The membrane potential of neurons in the SCM is calculated by combining the external

Spiking Cortical Model
The spiking cortical model [25] is a modified model, based on Eckhorn's neural network, that uses physiology as inspiration [26]. It has fewer parameters and better accuracy than the original model. Its time matrix can be recognized as a subjective, human sense of stimulus intensity. As a result of these physiology-inspired neural networks' outstanding ability to extract dynamic information inside multi-dimensional signals, they have been widely used in numerous fields. Instances include feature extraction [27], pulse shape discrimination [28][29][30], image encryption [31], and image segmentation and fusion [32,33].
Considering a biological neuron in a resting state, the membrane potential of this neuron is directly charged by external stimulus. Meanwhile, this membrane potential is modulated by the postsynaptic action potential of its neighboring neurons. In comparison, the membrane potential of SCM is similar to the aforementioned biological neural activity. The membrane potential of neurons in the SCM is calculated by combining the external stimulus and the neighboring modulation. A neuron in the SCM is fired and produces a spike when its neural membrane potential rises over its threshold. The threshold is dynamic, constantly changing under the influence of membrane potential states. Based on the characteristics mentioned above, the mathematical formulae of the SCM [25] can be written as follows: where each neuron is denoted by a coordinate (i, j); coordinate (k, l) represents one of the neighboring neurons of the central neuron located at (i, j); U ij (n) is the membrane potential of a neuron located at (i, j) when the iterative count is n; S ij is the external stimulus; Θ ij is the dynamic threshold; Y ij (n) is the output action potential (spike); the convolution of W and Y stands for the modulation on the center neuron, located at the (i, j) coordinate by its neighborhood neurons; W is the synaptic weighted matrix; β is the linking strength coefficient; f denotes the attenuation constant of the membrane potential which defines the gathering speed of it; and g represents the threshold's attenuation constant, controlling the relative refractory period (i.e., the difficulty of activating peripheral neurons). Finally, h indicates the absolute refractory period, which prevents a neuron that has just been fired from immediately being reactivated again.

NSCT-SCM Fusion Scheme
The proposed image fusion scheme incorporated three steps: (i) denoising all three input images (AC, DPC, and DFC) using adaptive Wiener filtering, (ii) implementing the NSCT-SCM based image fusion algorithm to the input images, and (iii) enhancing the output fused image using contrast-limited adaptive histogram equalization (CLAHE), adaptive sharpening (AS) and gamma correction (GC). The principle of the NSCT-SCM XGI fusion scheme is introduced in Figure 3.
where each neuron is denoted by a coordinate , ; coordinate , represents one the neighboring neurons of the central neuron located at , ; is the membran potential of a neuron located at , when the iterative count is ; is the extern stimulus; Θ is the dynamic threshold; is the output action potential (spike); th convolution of and stands for the modulation on the center neuron, located at th , coordinate by its neighborhood neurons; is the synaptic weighted matrix; the linking strength coefficient; denotes the attenuation constant of the membrane p tential which defines the gathering speed of it; and represents the threshold's attenu tion constant, controlling the relative refractory period (i.e., the difficulty of activating p ripheral neurons). Finally, ℎ indicates the absolute refractory period, which prevents neuron that has just been fired from immediately being reactivated again.

NSCT-SCM Fusion Scheme
The proposed image fusion scheme incorporated three steps: (i) denoising all thre input images (AC, DPC, and DFC) using adaptive Wiener filtering, (ii) implementing th NSCT-SCM based image fusion algorithm to the input images, and (iii) enhancing th output fused image using contrast-limited adaptive histogram equalization (CLAHE adaptive sharpening (AS) and gamma correction (GC). The principle of the NSCT-SCM XGI fusion scheme is introduced in Figure 3. Step I: Images are denoised using Wien filtering.
Step II: Images are decomposed into coefficient matrixes using NSCT. Then, the coefficie matrixes are proposed by SCM, outputting ignition matrixes. Finally, band mixing is implemente (three coefficient matrixes are fused into one coefficient matrix based on a coefficient selectio Step I: Images are denoised using Wiener filtering. Step II: Images are decomposed into coefficient matrixes using NSCT. Then, the coefficient matrixes are proposed by SCM, outputting ignition matrixes. Finally, band mixing is implemented (three coefficient matrixes are fused into one coefficient matrix based on a coefficient selection algorithm designed on the basis of ignition matrixes), and the fused image is obtained by reconstructing the fused coefficient matrix.
Step III: The fused image is enhanced to generate the final output image.

Step 1. Image Denoising Based on Wiener Filtering
To obtain better quality raw images, the adaptive Wiener filter was applied to reduce the noise from an image while preserving the high-frequency information and edge features. The sizes of each input image are denoted by M × N; the AC, DPC, and DFC images are represented by I AC = {I AC (i, j)}, I DPC = {I DPC (i, j)}, and I DFC = {I DFC (i, j)}, respectively, where i = 1, 2, · · · , M and j = 1, 2, · · · , N. The image I D obtained after Wiener filter processing is expressed as follows [34]: where, m stands for the local mean, σ 2 denotes the local variance, and v 2 denotes the noise variance; X and Y are manual parameters which define the processing window size in the to-be-processed image I; and µ 2 represents the average noise variance. After implementing adaptive Wiener filtering to images AC, DPC, and DFC, the output images are presented as I D AC , I D DPC and I D DFC .

Step 2. NSCT-SCM XGI Fusion Algorithm
In this step, three images (I D AC , I D DPC and I D DFC ) were fused into one image, I D F . 1.
First, the NSCT was implemented to the I D AC , I D DPC and I D DFC obtaining images' high-frequency coefficients (H D,n AC , H D,n DPC and H D,n DFC ) and low-frequency coefficients (L D AC , L D DPC and L D DFC ), where n denotes the index of high-frequency coefficients, because multiple high-frequency coefficients are decomposed from a single image. Note that the size of each coefficient obtained from NSCT was the same as the input images, M × N in this case. Additionally, although only one low-frequency coefficient could be obtained from the NSCT process, multiple high-frequency coefficients could be gained from the NSCT of a single image, depending on the decomposition levels of NSDFB and NSPFB.

2.
Second, high-frequency coefficients and low-frequency coefficients were fed into the SCM, generating the state of the firing of each coefficient (T D,n AC , T D,n DPC , or T D,n DFC for the high-frequency coefficient and T D,L AC , T D,L DPC , or T D,L DFC for the low-frequency coefficient), i.e., the ignition matrix. Each ignition matrix has the same size as its input coefficient, which was M × N in this case.

3.
Two separate fusion rules were provided for high-frequency and low-frequency coefficients because of the need to preserve details and features in the high-frequency sub-band and keep the low-frequency part of the fused final image closer to the AC image. It is easier for doctors or radiologists to analyze a fused tri-contrast image when its low-frequency sub-band is close to that of the AC channel. Under this condition, the final fusion results will generally resemble the effects of traditional absorption-based tomography while containing complementary information of DPC and DFC channels. For the low-frequency coefficients: where L D F is the fused low-frequency coefficient and a is a tunable parameter that determines the similarity between the fused image and the AC image; the larger the value of a, the closer the fused image will be to the AC image. For the high-frequency coefficients: There were a total of 7 possible values for H D,n F (i, j): The programming idea of the high-frequency fusion rule was such that we set a threshold T for the comparison of ignition results T D,L AC , T D,L DPC , and T D,L DFC . This comparison measured whether the information of a pixel coming from a single channel was significant enough to replace the others or whether a weighted average of the information of two or three channels was required. To be specific, when one channel was significantly larger than others, we chose the coefficient from this channel as the value of the H D,n F (i, j) directly. When two were significantly larger than the rest, we took the average as the value of the H D,n F (i, j). When no channel was significantly larger than the others, we weighted averaged the value of all three channels as the value of the H D,n F (i, j) by the weight factors b, c, and d. A detailed fusion scheme of high-frequency coefficients is presented in the Supplemental Information, Section S1.

4.
Finally, the inverse NSCT was implemented with respect to the low-frequency coefficients L D F , as well as the high-frequency coefficients H D,n F , obtaining the fused image I D F .

Step 3. Image Enhancement Using CLAHE, AS, and GC
Contrast-limited adaptive histogram equalization (CLAHE), adaptive sharpening (AS), and gamma correction (GC) were introduced to improve the image quality by Felix Scholkmann et al. [17]. This scheme was convenient to implement and was able to facilitate the output of better-quality images. Although it could enhance the image contrast and sharpness, it could not add further information to the fused image from the original AC, DPC, and DFC channels. Its application incorporated the following steps: 1.
The image I D F was first processed by CLAHE [35], which divided it into small tiles and changed the histogram of these tiles to enhance their contrast. Additionally, a clipping limit needed to be applied to the aforementioned processing, aiming to prevent excessive noise in the image. Bilinear interpolation was implemented on the tiles to avoid image discontinuities. After the implementation, the processed image I En 1 F was obtained.

2.
Second, I En 1 F was sharpened by the AS method, mathematically given by: where where C is the weighting factor adaptively determined by calculating the image entropies with many values of C and finding the C max value, i.e., when the maximum entropy was obtained. The final C was calculated by C = C max (argmax(H))/α, where α is a constant to preserve the image becoming over-sharpened, with a fixed value of 3, empirically given by Felix Scholkmann et al. in their work [12]. After the aforementioned process, the image I En 2 F was obtained.

3.
Finally, in the GC step, the image I En 2 F was enhanced by a sigmoid function, denoted as: where λ 1 and λ 2 are two manually tunable parameters.

Measures of the Fusion Performance
With regard to fusion performance evaluation, there are two kinds of evaluation strategies: subjective and objective evaluations. Subjective evaluation is difficult to reproduce and highly dependent on the evaluators' experience, making the evaluation results unstable and difficult to quantify. In this study, we chose the objective evaluation method as the primary method by which to compare the results of the proposed fusion scheme with the other fusion algorithms. Several performance measures were implemented for the fusion results in our experiment, as follows:

1.
Edge strength (ES) [36] stands for the relative amount of edge information transferred from the input images (I AC , I DPC , and I DFC ) into the fused result I F , denoted as: where w AC (i, j), w DPC (i, j), and w DFC (i, j) are the weights, assigned to edge preservation values ES AC,F (i, j), ES DPC,F (i, j), and ES DFC,F (i, j) for I AC , I DPC , and, I DFC , respectively. This edge preservation value was calculated through a Sobel edge operator, detailed information of which can be found in [36]. The larger the value of ES, the better the image fusion performance.

2.
Spatial frequency (SF) measures the number of details presented in a stimulus per degree of visual angle, and can be given as follows: where RF and CF represent the row frequency and column frequency, respectively, and Z(i, j) denotes the gray-value intensity of the pixel located at (i, j) in the image. A higher SF value of an image meant that it contained more details-and hence, led to a better fusion result.

3.
Standard deviation (SD) is the square root of the variance, which refers to the image contrast. The higher the contrast, the greater the value of SD. SD was calculated as follows: where . µ stands for the mean intensity of the image.

4.
Entropy (H) [37] measures how much information is contained in an image, calculated as follows: where L represents the gray level of an image and p l stands for the probability of the lth gray level in the image. A larger H value signified a better image fusion performance.

5.
Feature mutual information (FMI) [38,39] refers to how much feature information is successfully transferred from the original images (I AC , I DPC , and I DFC ) to the fused image I F , mathematically defined as follows: where FI (I A , I B ) stands for the amount of feature information transferred from image I A to image I B ; FI, in Formula (19), can be calculated as follows: FI(I AC , I F ) = ∑ I AC ,I F p I AC ,I F (i, j, k, l)log 2 FI(I DPC , I F ) = ∑ I DPC ,I F p I DPC ,I F (i, j, k, l)log 2 FI(I DFC , I F ) = ∑ I DFC ,I F p I DFC ,I F (i, j, k, l)log 2 where p A,B is the joint distribution function between image A and image B, and (i, j) and (k, l) denote the pixel coordinates in image A and image B, respectively. Should the value of FMI be more significant, the fusion scheme fused three images successfully, preserving more feature information from each image. 6.
The feature similarity index measure (FSI M) [40,41] related to the similarity between two images based on the low-level features-specifically, the phase congruency (PC) and the image gradient magnitude (GM). The FSI M of two images, I A (i, j) and I B (i, j), were calculated by: where PC A and PC B are the PC values of I A and I B , respectively, and S AB (i, j) refers to the local similarity, denoted as follows: where S PC;AB (i, j) and S GM;AB (i, j) are similarity measurements for I A (i, j) and I B (i, j), based on PC and GM respectively; α and β are two parameters; and T 1 and T 2 are two constants, all of which were defined in [36]. To measure the performance of the XGI fusion, the overall The structural similarity index measure (SSI M) [42] measures how much structural information was transferred from one image into another based on the human eye's sensitivity to the structural information, given as follows: where SSI M(I A , I B ) represents the SSI M value of images I A and I B ; W is the number of windows that come from the division of an image; and SSI M I A j , I B j denotes the structural similarity between images I A and I B in the jth window. This was calculated by: where µ I A j , µ I B j , σ I A j 2 , and σ I B j 2 are the local means and the local variances of the jth windows in images I A and I B , respectively; σ I A j I B j 2 is the cross-covariance for the jth windows between I A and I B . An overall SSI M value for the XGI fusion was defined as follows: where I AC , I DPC , I DFC , and I F denote the three input images and the fused image, respectively. Note that larger SSI M values corresponded to better fusion performance. 9.
Power spectral density (PSD) [43,44] measures the power at each signal frequency. The estimate of the PSD P j at frequency j was denoted as follows: where C j are the Fourier terms and n is the number of samples. The total area enclosed by the PSD curve and the coordinate axis denoted the information contained in an image. The PSD curve of one image within one frequency band was higher than that of the other image, which meant that the former image had more information in this frequency band. A generally higher PSD curve indicated a better image fusion performance [42].

Image Fusion Parameters and Results
The fusion parameters used in this work were given by the order of the fusion steps. For step 1, the sizes of the neighborhood samples for adaptive Wiener filtering were set at [5,5]. For step 2, the decomposition levels of NSCT were [4,4,4,4]. With regard to the parameters of the SCM, defined in Equations (4)-(6), we empirically set f = 0.8, g = 0.7, h = 20, W = [0.1091, 0.1409, 0.1091; 0.1409, 0, 0.1409; 0.1091, 0.1409, 0.1091], and the total iterative counts k = 200. Weight factor for low-frequency band: a = 0.55. Weight factors for high-frequency bands: b = 0.41, c = 0.29, d = 0.30, and Tth = 1. For step 3, the number of tiles, by row and column, used for CLAHE was [5,5], the contrast enhancement limit parameters for CLAHE were [0, 1] and 0.00125, and the CLAHE histogram's number of bins was 500. Finally, λ1 and λ2 for contrast optimization were 4.8 and 0.49, respectively.
The data used for the fusion process came from the grating-based X-ray phase contrast imaging of frog toes [45]. These images (a total of four sets of images) were fused by our algorithm using the parameters above. These experiments were carried out on MATLAB and half the results (of two sets of images) are shown in Figure 4. The remaining results of the other two sets of images are given in the Supplemental Information Section S2. our algorithm using the parameters above. These experiments were carried out on MATLAB and half the results (of two sets of images) are shown in Figure 4. The remaining results of the other two sets of images are given in the Supplemental Information Section S2. As shown in Figure 4, many features that only appeared in the DPC or DFC channels were successfully transported to the final fusion results. The soft tissue around the bone and meshwork structure of the bone trabecula (which can only be observed in the DPC channel), as well as the high signal of the bone cortex (which is only visible in the DFC channel), were successfully transferred into the fusion results. These well-preserved features demonstrated the efficiency of the proposed fusion scheme.

Objective Evaluation and Discussion
In this section, we implemented the other three image fusion schemes on the same datasets as those in Section 5.1. These methods included the algorithm based on the shiftinvariance discrete wavelet transform (SIDWT) [17], the traditional NSCT image fusion algorithm, and the conventional NSCT-PCNN image fusion algorithm [46]. Then, the performance results of all four methods were evaluated by the measures mentioned in Section 4. Half of the results (of two sets of images) are displayed in Figure 5 and Tables 1 and 2, while the remaining results of the other 2 datasets are given in the Supplemental Information, Section S2.
With regard to the parameter settings of SIDWT, the size of the neighborhood samples used for adaptive Wiener filtering was 5, 5 ; the decomposition levels of the first and second fusion steps were 4 and 5, respectively; the numbers of tiles by row and column used for CLAHE were 5, 5 ; the limit of CLAHE contrast enhancement was 0,1 : 0.0017; the CLAHE histogram' number of bins was 500; and and , for he contrast optimization, were 3.9 and 0.59, respectively. The parameter settings of the NSCT used in the NSCT-PCNN method and NSCT method were the same as those we mentioned in Section 5.1. In addition, the parameters of the PCNN were empirically set as follows: 0.06931 , 0.2 , 1 , 20 , 0.2 , 200 , and linking weight 0.707,1,0.707; 1,0,1; 0.707,1,0.707 [46]. As shown in Figure 4, many features that only appeared in the DPC or DFC channels were successfully transported to the final fusion results. The soft tissue around the bone and meshwork structure of the bone trabecula (which can only be observed in the DPC channel), as well as the high signal of the bone cortex (which is only visible in the DFC channel), were successfully transferred into the fusion results. These well-preserved features demonstrated the efficiency of the proposed fusion scheme.

Objective Evaluation and Discussion
In this section, we implemented the other three image fusion schemes on the same datasets as those in Section 5.1. These methods included the algorithm based on the shiftinvariance discrete wavelet transform (SIDWT) [17], the traditional NSCT image fusion algorithm, and the conventional NSCT-PCNN image fusion algorithm [46]. Then, the performance results of all four methods were evaluated by the measures mentioned in Section 4. Half of the results (of two sets of images) are displayed in Figure 5 and Tables 1 and 2, while the remaining results of the other 2 datasets are given in the Supplemental Information, Section S2. As shown in Figure 5, we marked areas with red squares, called the regions of interest (ROI), to reduce the impact of noise on evaluation and focus on the part of the image in which we were most interested. We observed that the soft tissue around the bone was better presented by the NSCT-SCM methods than others. Our proposed method also pre-  With regard to the parameter settings of SIDWT, the size of the neighborhood samples used for adaptive Wiener filtering was [5,5]; the decomposition levels of the first and second fusion steps were 4 and 5, respectively; the numbers of tiles by row and column used for CLAHE were [5,5]; the limit of CLAHE contrast enhancement was [0, 1] : 0.0017; the CLAHE histogram' number of bins was 500; and λ 1 and λ 2 , for he contrast optimization, were 3.9 and 0.59, respectively. The parameter settings of the NSCT used in the NSCT-PCNN method and NSCT method were the same as those we mentioned in Section 5.1. In addition, the parameters of the PCNN were empirically set as follows: α L = 0.06931, α θ = 0.2, V L = 1, V θ = 20, θ = 0.2, N = 200, and linking weight W = [0.707, 1, 0.707; 1, 0, 1; 0.707, 1, 0.707] [46].
As shown in Figure 5, we marked areas with red squares, called the regions of interest (ROI), to reduce the impact of noise on evaluation and focus on the part of the image in which we were most interested. We observed that the soft tissue around the bone was better presented by the NSCT-SCM methods than others. Our proposed method also preserved the texture inside the bones and the details at the bone joint junctions. In contrast, the details and texture of the other methods were not satisfactory, with images that were blurrier and less sharp in comparison, indicating those methods' tendency to compromise on information preservation. The objective evaluation criteria were further carried out on these fusion results, and the evaluation results of the ROI are given in Tables 1 and 2. The best results for each measure are marked in bold.
As shown in Tables 1 and 2, the results of FMI, FF, SSI M, and FSI M of all methods were at the same level, with some slight fluctuations. This indicated that all methods demonstrated the ability to output fusion results that were similar enough to the source images. However, regarding the outcomes of ES, H, SD, and SF, the proposed method generally outperformed the others, showing that NSCT-SCM was able to transfer more information and details from the source images to the fusion result than other methods. Specifically, NSCT-SCM had higher values regarding H, SD, and SF in Table 1 and H and SD in Table 2. The NSCT method also led to the best ES results, as shown in in Table 1. The NSCT-PCNN method outperformed others, with regard to ES, and the SIDWT showed the best SF value.
In addition, we calculated the PSD of each fusion result and drew the PSD curves of the fusion images, given in Figure 6. information and details from the source images to the fusion result than other methods. Specifically, NSCT-SCM had higher values regarding , , and in Table 1 and and in Table 2. The NSCT method also led to the best results, as shown in in Table  1. The NSCT-PCNN method outperformed others, with regard to , and the SIDWT showed the best value. In addition, we calculated the PSD of each fusion result and drew the PSD curves of the fusion images, given in Figure 6. As shown in Figure 6, the PSD curve of our proposed scheme was generally higher than the others, meaning that the fusion results of NSCT-SCM contained more information and were of better quality. In addition, although the power spectral density of the SIDWT remained at the same level as that of the proposed method in high spatial frequencies, it was significantly outperformed by the NSCT-SCM in low spatial frequencies. This result was consistent with the evaluation results of the above eight measures and the subjective evaluation results, i.e., that the fusion image of NSCT-SCM had higher contrast and finer details.

Conclusions
In the present work, an NSCT-SCM-based image fusion scheme was proposed for Xray grating interferometry. It incorporated three major steps: denoising, the NSCT-SCM fusion algorithm, and enhancement. A new coefficient selection strategy was proposed for the fusion algorithm step, which selected coefficients in different ways concerning high-frequency and low-frequency coefficients. This strategy met a unique requirement of XGI: that the low-frequency coefficient should derive primarily from the AC channel As shown in Figure 6, the PSD curve of our proposed scheme was generally higher than the others, meaning that the fusion results of NSCT-SCM contained more information and were of better quality. In addition, although the power spectral density of the SIDWT remained at the same level as that of the proposed method in high spatial frequencies, it was significantly outperformed by the NSCT-SCM in low spatial frequencies. This result was consistent with the evaluation results of the above eight measures and the subjective evaluation results, i.e., that the fusion image of NSCT-SCM had higher contrast and finer details.

Conclusions
In the present work, an NSCT-SCM-based image fusion scheme was proposed for X-ray grating interferometry. It incorporated three major steps: denoising, the NSCT-SCM fusion algorithm, and enhancement. A new coefficient selection strategy was proposed for the fusion algorithm step, which selected coefficients in different ways concerning high-frequency and low-frequency coefficients. This strategy met a unique requirement of XGI: that the low-frequency coefficient should derive primarily from the AC channel in order to achieve final fusion results similar to traditional CT, and that the high-frequency coefficient should be selected in a way preserves the details and features in the DPC and DFC channels.
Furthermore, the proposed method and three other image fusion methods were implemented on X-ray grating interferometry data of frog toes to demonstrate the feasibility and robustness of the NSCT-SCM image fusion scheme. The fusion results were evaluated using both subjective and objective measures. As observed and demonstrated, the proposed method was competitive with the other image fusion methods, both visually and quantitatively. The proposed image fusion scheme output images with high contrast and explicit details, and demonstrated the potential for real-time application. In our future research, a feature-based fusion scheme will be studied to process images more similarly to human eyes and achieve better computational efficiency.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/s23063115/s1. Author Contributions: Study conception and design were performed by H.L., X.J., J.L., Y.S. and X.C. Material preparation, data collection and analysis were performed by H.L., M.L. and G.Z. The first draft of the manuscript was written by H.L., and all authors commented on previous versions of the manuscript. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by National Natural Science Foundation of China, grant numbers U19A2086.
Data Availability Statement: The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.