BPDGAN: A GAN-Based Unsupervised Back Project Dense Network for Multi-Modal Medical Image Fusion

Single-modality medical images often cannot contain sufficient valid information to meet the information requirements of clinical diagnosis. The diagnostic efficiency is always limited by observing multiple images at the same time. Image fusion is a technique that combines functional modalities such as positron emission computed tomography (PET) and single-photon emission computed tomography (SPECT) with anatomical modalities such as computed tomography (CT) and magnetic resonance imaging (MRI) to supplement the complementary information. Meanwhile, fusing two anatomical images (like CT-MRI) is often required to replace single MRI, and the fused images can improve the efficiency and accuracy of clinical diagnosis. To this end, in order to achieve high-quality, high-resolution and rich-detail fusion without artificial prior, an unsupervised deep learning image fusion framework is proposed in this paper. It is named the back project dense generative adversarial network (BPDGAN) framework. In particular, we construct a novel network based on the back project dense block (BPDB) and convolutional block attention module (CBAM). The BPDB can effectively mitigate the impact of black backgrounds on image content. Conversely, the CBAM improves the performance of BPDGAN on the texture and edge information. To conclude, qualitative and quantitative experiments are tested to demonstrate the superiority of BPDGAN. In terms of quantitative metrics, BPDGAN outperforms the state-of-the-art comparisons by approximately 19.58%, 14.84%, 10.40% and 86.78% on AG, EI, Qabf and Qcv metrics, respectively.


Introduction
Clinical medical imaging often involves the acquisition of medical images which can characterize different biological information in a variety of modalities. Magnetic resonance imaging (MRI) provides high-resolution information on tissue structure [1], and computed tomography (CT) provides higher resolution in assessing density [2], and single-photon emission computed tomography (SPECT) focuses on functional information on organs and diseased tissues [3], and positron emission computed tomography (PET) reflects the presence or absence of physiological lesions at the molecular level [4,5]. The effective information contained in single-modality images cannot sufficiently meet the information demand for clinical diagnosis. Researchers have attempted to solve this problem through image fusion, and in recent years multimodal medical fusion imaging has attracted a great deal of interest in the clinical field [6,7].
Developing a means to improve the resolution of structural information, while also preserving functional information, is the key problem to be solved in multimodal medical image fusion tasks [8]. With the continuous development of deep learning in recent years, strategies based on convolutional neural networks have gradually highlighted the strengths of image fusion [9]. In fact, most of the existing fusion frameworks are still manually designed rather than employing end-to-end fusion, the former technique being 1.
The BPDB module is proposed and utilized in conjunction with the CBAM module. These modules can eliminate the obstacle of large black backgrounds in the fusion results and obtain high-quality fusion results.

2.
An end-to-end multimodal medical image fusion model is put forward to implement the fusion of three kind of medical images with MR images. No manual priori knowledge is required, no labelled data are needed, and the model's robustness ability is strong. 3.
Our loss function, designed for medical image fusion, contains a content loss function and a gradient loss. The gradient loss focuses on high-frequency information of the image. An adversarial mechanism function with gradient information is used to make the fused images texturally clear and content-rich.

Related Work
With the development of signal processing technologies, more and more image fusion methods have emerged in the past ten years. These methods can be categorized into two types: conventional and deep learning-based methods. Conventional medical image fusion techniques can be divided into two subtypes: spatial domain and frequency domain methods [11]. The spatial domain fusion technique manipulates image pixels directly with simple rules (such as maximum), but is less effective. Because of the spatial domain processing method, it is difficult to decouple high-frequency information and low-frequency (global and detailed) information, leading to increasingly complex fusion rules. In contrast, transform domain fusion methods can fuse images with more high-frequency details. Multiscale transform (MST) is the standard method for frequency domain-based fusion [12]. The MST-based approach involves three steps: decomposition, fusion rule selection and reconstruction. First, the source image is transformed into the frequency domain, where it is decomposed into a high-frequency sub-band image (HSI) and a low-frequency sub-band image (LSI), with the HIS containing mainly texture details and the LSI containing the image feature distribution and background information. Subsequently, the LSI and HSI are fused using different fusion rules. Finally, the image is reconstructed by the inverse of the decomposition process and transformed into the time domain. Representative examples include the shear wavelet transform [13], Laplace pyramid (LP) transform [14], discrete wavelet transform (DWT) [15], gradient pyramid transform [16], and double tree complex wavelet transform (DTCWT) [17]. However, all these fusion methods contain a down-sampling process, which always harms image information and blunts the texture edges. To address this issue, schemes without a down-sampling process, including the non-sampled contour transform (NSCT) [18] and the non-sampled shear wavelet transform (NSST) [19], have been proposed. For subsequent feature fusion, conventional approaches are limited by their forcing the same transformation to be performed on medical images of different modalities without considering the feature differences among the various modalities, resulting in a poor feature representation. However, the fusion rules are so complex that the algorithm efficiency is low.
In subsequent years, learning-based solutions were proposed to address these issues. Deep learning fusion strategies can be divided into three categories: autocoders (AE), traditional convolutional neural networks (CNN) and generative adversarial networks (GAN). One of the best-known AE-based approaches is DenseFuse [20], which trains the encoder and decoder on the MS-COCO. There is also a classic CNN-based approach, PMGI [21], which implements end-to-end feature extraction, image fusion and reconstruction. Fusion-GAN [22] pioneered the GAN-based approach to image fusion by building an adversarial Entropy 2022, 24, 1823 3 of 16 mechanism between the fused image and the input image, while training the generator and discriminator to improve the texture detail of the fused image. However, all these existing methods have their own drawbacks. DenseFuse is a universal model, which is trained on a generic dataset and is not suitable for medical image fusion tasks. PCNN [23] relies on a priori knowledge and is less robust. PMGI is an end-to-end model, which does not require a priori knowledge, but the fused images are devoid of details and have serious information loss. Although FusionGAN provides better texture preservation, it uses just one input image to participate in the adversarial mechanism, and the information of the other image is severely lost. Furthermore, these existing methods have poor performances on edge, texture and color (pseudo color with gray) information.
To solve these problems, a back project dense generative adversarial network is proposed in this paper.

Proposed Method
We propose a back project dense generative adversarial network to achieve highresolution multimodal image fusion, and its pipeline architecture is shown in Figure 1. First, to preserve the functional information in the color images, the PET and SPECT images with RGB color space are first converted into YCbCr color space to separate the luminance channel (Y) from the color channel (Cb, Cr) [24]. Secondly, the Y channel of the image is sent to BPDGAN for fusing with the MR image. Then, the fused Y channel is put out. Finally, the fused Y channel is combined with the Cb and the Cr channels. Then, it is inverse-transformed back into the RGB color space. For single-channel CT images, the image can be fed directly into BPDGAN for fusion with the MR image. reconstruction. FusionGAN [22] pioneered the GAN-based approach to image fusion by building an adversarial mechanism between the fused image and the input image, while training the generator and discriminator to improve the texture detail of the fused image. However, all these existing methods have their own drawbacks. DenseFuse is a universal model, which is trained on a generic dataset and is not suitable for medical image fusion tasks. PCNN [23] relies on a priori knowledge and is less robust. PMGI is an end-to-end model, which does not require a priori knowledge, but the fused images are devoid of details and have serious information loss. Although FusionGAN provides better texture preservation, it uses just one input image to participate in the adversarial mechanism, and the information of the other image is severely lost. Furthermore, these existing methods have poor performances on edge, texture and color (pseudo color with gray) information.
To solve these problems, a back project dense generative adversarial network is proposed in this paper.

Proposed Method
We propose a back project dense generative adversarial network to achieve high-resolution multimodal image fusion, and its pipeline architecture is shown in Figure  1. First, to preserve the functional information in the color images, the PET and SPECT images with RGB color space are first converted into YCbCr color space to separate the luminance channel (Y) from the color channel (Cb, Cr) [24]. Secondly, the Y channel of the image is sent to BPDGAN for fusing with the MR image. Then, the fused Y channel is put out. Finally, the fused Y channel is combined with the Cb and the Cr channels. Then, it is inverse-transformed back into the RGB color space. For single-channel CT images, the image can be fed directly into BPDGAN for fusion with the MR image. In medical image fusion tasks, the image to be fused often has a large black background, which can cause a loss of overall brightness and blurred edges, reducing visual quality and texture detail. In order to reduce the negative impact of the black background on the fusion result, the back project dense block (BPDB) module is introduced in this paper, where the feature map is fed into the back proje module and passes through three modules: the feature enhancement module, the residual calculation module and the fea- In medical image fusion tasks, the image to be fused often has a large black background, which can cause a loss of overall brightness and blurred edges, reducing visual quality and texture detail. In order to reduce the negative impact of the black background on the fusion result, the back project dense block (BPDB) module is introduced in this paper, where the feature map is fed into the back proje module and passes through three modules: the feature enhancement module, the residual calculation module and the feature reconstruction module. The input features undergo a series of calculations to obtain the residuals and add In this paper, we also design a generator loss function that matches the medical image fusion task to jointly prompt the generator so as to ensure the generator produces high-quality fused images. Meanwhile, our model establishes an adversarial mechanism between the generator and the discriminator to jointly constrain the generator and the discriminator. Via the adversarial mechanism, the fused image gradient gets closer and closer to the joint gradient of the input image until the discriminator is confused. The texture detail can be further enhanced.

Pre-Processing
The PET images involved in the fusion are pseudo-color images with functional information presented in RGB color space, while the MRI images are single-channel images, which have only structural information. In order to protect the functional information of the PET and SPECT images, we choose to pre-process the PET and SPECT images to YCbCr color space [25], separating the color information (Cb and Cr channels) from the luminance information (Y channel) and using only the luminance information (Y channel) for fusion.
During the fusion procedure, only the MRI image and the Y component of the PET image are processed, both of which are grayscale images, and the color information in the PET image will be perfectly preserved.

Network Structures
To enhance the texture detail of the fused images and to preserve the structural information that the Y-channel of the PET image has been integrated with the MRI image, the BPDGAN adversarial generation network architecture is proposed and described as follows.

Generative Adversarial Networks
Labeling medical image requires specialized expertise, and it is too expensive and timeconsuming to label them by hand. Therefore, a generative adversarial network (GAN) [26] is employed in this paper.
The GAN architecture contains a generator (G) and a discriminator (D). During training period, the parameters of G are updated not from data samples, but from the back propagation of D. Thus, the GAN-based network architecture can be freed from the labeling data.
G is trained to generate realistic samples from random noise or latent variables that can be formulated as x = G(Z). For the fusion task, the generator has two inputs and its training goal is to make the generated data as close as possible to the real data P data (x, y). The optimization process for the generator can be executed by Equation (1).
where Div(P G (z 1 , z 2 ), P data (x, y)) represents the distance between the generated data distribution P G (z 1 , z 2 ) and the true data P data (x, y).
Since both the generating distribution and the true distribution are unknown, D is trained simultaneously to identify the truthfulness of the generated data, as shown in Equation (2).
The higher the discriminator scores are for data sampled from the true data P data (x, y), the lower the generator scores are for data sampled from the generated distribution P G (z 1 , z 2 ), both of which maximize the function V(G, D). For more detail, please refer to Equation (3). During the training process of the network, G and D keep playing a max and min mechanism, respectively. At first, the direct interval distance between the generated data of the generator and the real data distribution is large. However, as the number of iterations increases, the two distributions become closer and closer until the discriminator cannot distinguish the generated data from the real data, and a Nash equilibrium is reached between the generator and the discriminator. At this point, the generator training is completed, and the generated data can be regarded as the real data.

Overall Network Architecture
The BPDGAN network structure is elaborately designed to fuse the PET's Y-channel with the grayscale image to obtain a richer detailed texture with higher image resolution. Our model improves the medical image fusion in terms of both the network structure and the loss function. Large black areas of multimodal medical images (e.g., PET) can reduce the brightness of the fused image and blur the texture structure of the image boundaries. To address the problem, we improve the network structure and propose a back-projection module (see Figure 1) for removing the obstacle of black background on the feature subtraction fusion results. For the loss function, the model dynamically adjusts the loss function to optimize the prediction results and uses gradient loss to enhance the fused image detail performance.
To meet the high accuracy needs of medical image diagnosis for lesion texture, we introduce an attention mechanism into our model. The attention module assigns different weights to each part of the input information, enabling the model to extract the more important features. The model focuses its limited computational resources on the important features to avoid information overload and optimize the fusion results.
To further enhance the fused image detail texture, we build an adversarial mechanism based on gradients. The real data are the joint gradient map of the two input images, which is computed by the Laplace operator as well as the maximax principle; and the pseudo data are the gradient map of the fused image, which is computed by the Laplace operator. Within the discriminator, the real data and the pseudo data undergo continuous adversarial learning. Thus, the generator is forced to produce fused images with richer texture detail and more realistic gradients. The GAN can be optimized in the way as shown in Equation (4).

Generator Architecture
The structure of the generator of our model is shown in Figure 1; the PET's Y-channel and the MRI image are fed into the network through two branches with the same structure. Taking the MRI image as an example, the MRI image first enters the two convolutionactivation modules for feature extraction, where the convolution layer uses a 3 × 3 convolution kernel and the activation method uses ELU. The extracted features are formulated as follows: The extracted high-dimensional features then enter the back-projection module (BPDB), details of which are described in Section 3.2.5. The features after going through the BPDB can be represented as: where H BPDB represents the back-projection function. The model feeds the back-projected features into the attention module for weighting, which is shown in Figure 2 and discussed in detail in Section 3.2.6. The weighted feature F 2 can be formulated as follows: where Composite H denotes the composite function of the feature back-projection and the attention module. The final fused image is obtained after two convolution-activation modules, and the convolution layer uses 3 × 3 convolution, and the activation method is ELU.

Discriminator Architecture
The discriminator of our model BPDGAN is also shown in Figure 1, whose input is not the image itself, but the image gradient. The two images to be fused are calculated using the Laplace operator to yield two gradient maps, which are then calculated according to the principle of maximum to obtain the joint gradient map, i.e., the real data. The fused image is calculated by using the Laplace operator to generate the fused gradient map as the false data. The features extracted after the convolution enter the proba- H CBAM is the operator function of the attention mechanism, which contains channel attention and spatial attention operations. Afterward, the two branches are merged into one by concat the operation, and the fusion feature can be expressed as follows: where H concat indicates the feature map stitching operation. Afterwards feature F f usion is again back-projected and weighted, and the output features can be expressed as follows: where H Composite denotes the composite function of the feature back-projection and the attention module. The final fused image is obtained after two convolution-activation modules, and the convolution layer uses 3 × 3 convolution, and the activation method is ELU.

Discriminator Architecture
The discriminator of our model BPDGAN is also shown in Figure 1, whose input is not the image itself, but the image gradient. The two images to be fused are calculated using the Laplace operator to yield two gradient maps, which are then calculated according to the principle of maximum to obtain the joint gradient map, i.e., the real data. The fused image is calculated by using the Laplace operator to generate the fused gradient map as the false data. The features extracted after the convolution enter the probability module, which is utilized to calculate the probability value to determine whether the generated data are real or not.

Back Project Dense Block (BPDB)
The inverse projection module is also elaborately designed to leverage feature operations to reduce the negative effect of large black backgrounds on the fusion results. The structure is shown in Figure 2. The feature map is fed into the reverse projection module and passes through three modules, where convolution-activation layer has a convolution kernel size of 3 × 3 and the activation method is ELU. First is the feature enhancement module, in which feature F pre enters the module and is convolution-activated to encode the feature F enc ; and then F enc enters the difference calculation module, whose output value is added to F enc and sent to the convolution-activation module to decode F add . The residual calculation module is similar to the feature enhancement module. However, instead of adding the encoded features and the difference calculated features, the absolute value is then subtracted and fed into the decoder to obtain the residual features F res . In the final feature reconstruction module, F residual is encoded to obtain F res_en , which is added to and decoded to output feature F.

Convolutional Block Attention Module (CBAM)
In contrast to the traditional approach to computing 3D attention maps, CBAM divides attention into channel attention and spatial attention. The architecture is shown in Figure 2. Feature F is first fed into the channel attention module, where it is copied to two branches and subjected to maximum pooling and average pooling to yield F max and F avg . F max and F avg are fed into the fully connected layer to obtain the channel weights, which are then summed and activated by the module (the activation method is ELU) to generate the channel attention. The channel attention weighted feature F 1 is fed into the spatial attention module and copied to both branches for maximum pooling and average pooling to obtain F 1_max and F 1_avg . F 1_max and F 1_avg are spliced to yield F concat , and the spliced features are then fed into the 1 × 1 convolutional layer and ELU activation layer to generate spatial attention.

Loss Function
For the special requirements of texture details for multimodal medical image fusion tasks, we design a series of semi-supervised loss functions, which include generator loss and discriminator loss.

Generator Loss
The loss function of the generator is based on adversarial loss, pixel-level Euclidean loss and texture loss, which can be calculated from Equation (10).
where L Gan ismeans the adversarial loss of the generative adversarial network; L pixel is the absolute pixel distance loss optimized using the screening map; and L grad denotes the gradient loss based on the gradient map. λ 1 and λ 2 are the weights of pixel distance loss and texture loss, respectively, which ensure that the three loss functions are of equal importance.

Adversarial Loss
In order for the generator to generate images closer to the ideal fused image, the loss needs to be built between the generator and the discriminator. The traditional adversarial loss reduces the max-min problem to log(1 − D(G(I))). However, at the beginning of the training phase, log(1 − D(I)) may saturate, and the generator network is trained using D(G(I)) maximization. To provide a stronger gradient, a square operation is added to the maximization operation. The definition of L Gan is as follows.
where N is the number of images in one batch during the training period; c is the alteration rate label for the discriminator to identify true and false images, and here c = 1; ∇ denotes the Laplace operator for the gradient map calculation; M, I denote the input MRI image and CT or Y-channel for PET and SPECT.

Pixel-Level Euclidean Loss
Once the pixels at the corresponding positions of the input image and the fused image are taken, their absolute distances can be calculated; the smaller the distance is, the closer the intensity of the two images is (see Equation (12)). where x, y denote the pixel values of the xth row and yth column; X, Y are the height and width of the image, respectively; Map 1 and Map 2 represent the filtered maps generated by the judgment block based on the two input images.

Gradient Loss
The gradient of the image can partially characterize the texture details, more so for the contrast sharp MRI images, thus requiring the fused image to have a similar gradient to the input image. Combined with the screening map, the gradient loss is computed as follows.

Discriminator Loss
Not only the generator needs the loss function to optimize the quality of the fused image, but also the discriminator needs the loss function to accurately identify the true and false fused images. In this paper, a loss function based on the gradient map is designed for the discriminator, where the "false data" is the gradient map of the fused image, which can be calculated from Equation (14).
The "true data" required by the discriminator comes from the joint gradient map constructed by M and I, which can be calculated from Equation (15).
where abs denotes the absolute value function; and maximum represents the maximization function.
Based on the two gradient plots mentioned above, the loss can be computed from Equation (16).
where a is the label of "false data" which is set to 0; and b is the label of "true data" which is set to 1. This causes the discriminator to treat the joint gradient map of the image as true data and the gradient map of the fused image as false data. This constraint guides the generator to adjust Grad 1 based on Grad 2 to enhance the texture of the fused image in the confrontation.

Experimental Results and Analysis
To verify the superiority of our BPDGAN in multi-modal medical image fusion, a number of experiments were conducted on publicly available datasets.

Training Details
The PET and MRI images employed in this experiment were obtained from the publicly available datasets at the Harvard Medical School website. It should be noted that our model does not require labeling data for training, and these datasets are only used to verify the validity of our model. PET images are three-channel pseudo-color images of 256 × 256 size, and MRI images are single-channel grayscale images of the same size.
During adversarial training, the batch-size is set to b and one iteration is divided into K steps. The total number of trainings is M times, and the discriminator is trained p times more often than the generator. After extensive experiments on the model, the hyperparameters were set as follows: b = 32, p = 2, M = 300, and the other parameters were updated using the Adam Optimizer.

Quantitative Evaluation Indicators
Four evaluation indicators are adopted in this paper. These are Q abf , Q cv , AG and EI. The Q abf focuses on local information and uses local information to measure the ability of the fused image to preserve important information of the input image [27]. The Q abf can be used to gauge the quality of the fused image, as shown in Equation (17).
where W is used to divide the local area; λ(ω) represents the weights of local area; A, B are the input images; F is the fused image. The quality of the local area image can be expressed as Q cv . It calculates the mean square error of the weighted difference image between the fused area image and the source area image, and the quality of the fused image is a weighted sum of the local area image quality measures [28]. The Q equation is formulated as follows.
where D is the local similarity function. AG (average gradient) can reflect the image's ability to represent details and textures and is often used to quantify the sharpness of an image [29]. For an image with size H × W, its average gradient can be computed from Equation (19).
where (h, w) represents the image coordinates; and ∂g ∂h , ∂g ∂w indicates the gradient information of image vertical and horizontal. The average gradient value reflects the amount of information contained in the image, and it can evaluate the fusion effect.
EI (edge intensity) represents the image quality and sharpness; the edge strength shows a positive correlation with the sharpness of the image [30]. For an image F with size H × W, whose edge intensity can be calculated from Equation (20). where where h x and h y are the Sobel operators in the x and y directions, respectively.

Quantitative and Qualitative Comparison Results
To verify the effectiveness of our BPDGAN for multi-modal image fusion, five state-ofthe-art methods, DDcGAN [31], DenseFuse, GCF [32], IFCNN [33], PMGI and U2Fusion [34] are employed to compare with our approach in this paper.
In order to solve the brightness destruction and edge blurring problems brought by large black backgrounds, a reverse projection module is introduced to mitigate the negative impact of invalid information on fusion by using residual operations. Qualitative results show that BPDGAN has clear details, high image quality, significant edge contrast, and no loss of luminance information.
In the CT-MRI fusion task shown in Figure 3, the DDcGAN texture is missing, the gray value is too large, and the information is incompletely preserved. DenseFuse has blurred edges and little texture detail. MRI information is lost in the white area of GCF. In IFCNN, some fields of the original CT image are poorly fused. A large amount of MRI information is lost in PMGI, and the visual effect is poor. In CT-MRI tasks, DDcGAN has poor visual quality and serious loss of MRI information. Distinct from other approaches, a structural loss function and a gradient-based adversarial loss function are put forward to protect high-frequency information and texture gradient information, respectively, and enhance the fused images by nonlinear loss constraints; and the BPDGAN high-frequency information performs the best in the three tasks. Beside these qualitative experiments, quantitative evaluation is carried out too, and the fused images are judged quantitatively from the data, and the results of CT-MRI are shown in Table 1. poor visual quality and serious loss of MRI information. Distinct from other approaches, a structural loss function and a gradient-based adversarial loss function are put forward to protect high-frequency information and texture gradient information, respectively, and enhance the fused images by nonlinear loss constraints; and the BPDGAN high-frequency information performs the best in the three tasks. Beside these qualitative experiments, quantitative evaluation is carried out too, and the fused images are judged quantitatively from the data, and the results of CT-MRI are shown in Table 1.  The results of the PET-MRI fusion task are shown in Figure 4. From Figure 4, it can be clearly seen that the grayscale information of DDcGAN, DenseFuse, IFCNN, PMGI and U2Fusion are all corrupted to different degrees, with poor visual effects and imperfect preservation of texture features. Although GCF protects the brightness of the fusion  The results of the PET-MRI fusion task are shown in Figure 4. From Figure 4, it can be clearly seen that the grayscale information of DDcGAN, DenseFuse, IFCNN, PMGI and U2Fusion are all corrupted to different degrees, with poor visual effects and imperfect preservation of texture features. Although GCF protects the brightness of the fusion result, the fused image has blurred fused edges and poor quality compared with our BPDGAN. Taking PET-MRI task as example, BPDGAN has the least luminance loss and the most perfect texture. The network also incorporates a CBAM module to increase the focus of lesion information and speed up the training, which can localize lesions more accurately. Previous methods are used to sharpen edges by direct target enhancement and gradient. A quantitative comparison of PET-MRI results is shown in Table 2.   The results of SPECT-MRI fusion tasks are shown in Figure 5. In Figure 5, DDcGAN, during the fusion process, will destroy the spectrum, and the edge gradien not obvious. The fused images of DenseFuse have low brightness and poor visual effe GCF has better color preservation, but there are noise blocks in the image and structu information is lost, which is unacceptable in medical images. IFCNN is blurred near boundary line, losing details, and the focus is not prominent, factors which will eas lead to clinical misjudgment. PMGI fusion suffers from serious background defocus and loses a great deal of structural information while its functional information is w preserved, but the overall is too blurred to be used practically. A quantitative compa son of SPECT-MRI results is shown in Table 3.  The results of SPECT-MRI fusion tasks are shown in Figure 5. In Figure 5, the DDcGAN, during the fusion process, will destroy the spectrum, and the edge gradient is not obvious. The fused images of DenseFuse have low brightness and poor visual effect. GCF has better color preservation, but there are noise blocks in the image and structural information is lost, which is unacceptable in medical images. IFCNN is blurred near the boundary line, losing details, and the focus is not prominent, factors which will easily lead to clinical misjudgment. PMGI fusion suffers from serious background defocusing and loses a great deal of structural information while its functional information is well preserved, but the overall is too blurred to be used practically. A quantitative comparison of SPECT-MRI results is shown in Table 3.  It can be seen that, for the PET-MRI fusion task and the SPECT-MRI fusion task, o BPDGAN outperformed the remaining five comparison methods in all five metrics, a BPDGAN had the best combined metrics for the CT-MRI task. The EI index of BPDGA outperformed the rest of the algorithms in all three modes, and the EI index in t PET-MRI task, for example, was 10.79% higher than the state-of-the-art compariso which indicated that the model had a clear edge texture, which fully validated the ro of BPDB in eliminating the negative effects of a black background. It was the speci structural and adversarial loss functions that gave BPDGAN a clearer gradient textu and the average gradient was reflected by the AG index, which was 9.6% higher than t state-of-the-art comparison in the CT-MRI fusion task. The Qcv index is based on the gional mean square error of the human visual system (HVS); with the benefit of t CBAM module, BPDGAN can adaptively determine the pixel weights to improve t regional similarity and thus the visual quality of the fusion results. The Qcv metric BPDGAN was lower than the second one by 68.2%, which proves its higher region similarity and stronger human eye perception capabilities than other methods. O model adopted the ground pixel-scale control strategy, and the Euclidean distance b tween pixels was controlled well, so the pixel-level fusion index Qabf of our model w the highest in PET and SPECT tasks and second only to GCF in CT task. In short, t visual information of our model was perfect, and the pixel-level gap between fused i age and source image was small.  It can be seen that, for the PET-MRI fusion task and the SPECT-MRI fusion task, our BPDGAN outperformed the remaining five comparison methods in all five metrics, and BPDGAN had the best combined metrics for the CT-MRI task. The EI index of BPDGAN outperformed the rest of the algorithms in all three modes, and the EI index in the PET-MRI task, for example, was 10.79% higher than the state-of-the-art comparison, which indicated that the model had a clear edge texture, which fully validated the role of BPDB in eliminating the negative effects of a black background. It was the specific structural and adversarial loss functions that gave BPDGAN a clearer gradient texture, and the average gradient was reflected by the AG index, which was 9.6% higher than the state-of-the-art comparison in the CT-MRI fusion task. The Q cv index is based on the regional mean square error of the human visual system (HVS); with the benefit of the CBAM module, BPDGAN can adaptively determine the pixel weights to improve the regional similarity and thus the visual quality of the fusion results. The Q cv metric of BPDGAN was lower than the second one by 68.2%, which proves its higher regional similarity and stronger human eye perception capabilities than other methods. Our model adopted the ground pixel-scale control strategy, and the Euclidean distance between pixels was controlled well, so the pixel-level fusion index Q abf of our model was the highest in PET and SPECT tasks and second only to GCF in CT task. In short, the visual information of our model was perfect, and the pixel-level gap between fused image and source image was small.

Ablation Experiments
We study the effect of different combinations of modules and different combinations of loss functions on our network. For different combinations of modules, the results are shown in Table 4. For different combinations of loss functions, the results are shown in Table 5. In both sets of ablation experiments, all evaluation metrics showed substantial improvement. From the above two sets of experiments, it can be seen that our work has yielded very effective results. In the architecture ablation experiment, the evaluation metrics (taking AG as an example) were 6.38% and 2.55% higher with the addition of BPDB and CBAM, respectively, compared to the values registered with the addition of BackBone only. The AG was 9.06% higher with the addition of BPDB and CBAM. This proved that our designed solution performs well in terms of edge, texture and fusion quality. In the loss function ablation experiment, the AG was 3.31% and 2.50% higher with the addition of pixel loss and gradient loss compared to the addition of pixel loss and gradient loss alone. These results demonstrate that L pixel improved pixel-level performance and L grad improved more fusion performances with our tactic.

Future Direction: BPDGAN vs. SwinFusion
Recently, Jiayi Ma [35] proposed a novel transformer-based long-range learning multimodal fusion pipeline named SwinFusion. In order to compare BPDGAN with the SwimFusion, we also conducted the following comparisons, as shown in Figure 6 and Table 6. These results demonstrate that SwinFusion performs better than our BPDGAN, while SwinFusion has large parameters and trains on large dataset. In the model aspect, we find that transformer-based methods can better focus on the overall information. This is our further improvement direction. For the data aspect, the existing work [36] shows that the quality of the dataset plays an important role in the whole issue. The SwinFusion vs. BPFGAN comparisons also prove this point.

Conclusions
In this paper, we present the BPDGAN for multimodal medical image fusion. Unlike previous direct fusion approaches, firstly, we leveraged a YCbCr-based color space approach to achieve fusion of texture structure information without the loss of spectral information. Secondly, in the feature fusion stage, BPDB was proposed to reduce the negative effects, such as overall brightness reduction and blurred edges. These qualities were brought by black background to the fused image. Meanwhile, CBAMs were combined to make our model focus more on the location of the lesion and its structure rather than on the large black background. Finally, the model was trained in an end-to-end manner and did not rely on artificial a priori knowledge at all. We conducted both quantitative and qualitative experiments to demonstrate the superiority of BPDGAN against state-of-the-art methods. In future, we will focus on the study of novel attentional mechanisms of clinical importance and then continue to further improve the performance of BPDGAN.

Conclusions
In this paper, we present the BPDGAN for multimodal medical image fusion. Unlike previous direct fusion approaches, firstly, we leveraged a YCbCr-based color space approach to achieve fusion of texture structure information without the loss of spectral information. Secondly, in the feature fusion stage, BPDB was proposed to reduce the negative effects, such as overall brightness reduction and blurred edges. These qualities were brought by black background to the fused image. Meanwhile, CBAMs were combined to make our model focus more on the location of the lesion and its structure rather than on the large black background. Finally, the model was trained in an end-to-end manner and did not rely on artificial a priori knowledge at all. We conducted both quantitative and qualitative experiments to demonstrate the superiority of BPDGAN against state-of-the-art methods. In future, we will focus on the study of novel attentional mechanisms of clinical importance and then continue to further improve the performance of BPDGAN.