2.1. The Overview of Our Proposed Method
Given 3D multimodal medical images
as input, we propose a 3D multiscale wavelet convolutional neural network (3DWaFusion) for multimodal medical image fusion. As shown in
Figure 1, the proposed 3DWaFusion follows a sequential pipeline consisting of the following major stages: 3D DWT-based wavelet decomposition, GLFC-based global-local feature calibration, PGMF-based pyramid group-wise multiscale interaction, voxel-wise weighted fusion, and 3D IDWT reconstruction for fusion. The proposed 3DWaFusion framework is trained in an end-to-end manner. Specifically, the 3D DWT sequentially performs 1D wavelet decomposition along the width, height, and depth dimensions of 3D medical volumes, effectively decoupling global structural information and local detailed features while reducing information redundancy. The GLFC module first splits the wavelet-decomposed features via depthwise separable convolution (DSC), then employs dual parallel branches to calibrate feature distributions, enhancing feature discriminability and consistency. Furthermore, the PGMF module conducts frequency band grouping and multi-scale pyramid construction, followed by intra-group cross-scale interaction and inter-group cross-modal interaction, fully exploiting complementary information between different modalities and scales. Finally, the voxel-wise weighted averaging strategy fuses the interacted features, and 3D IDWT reconstructs the final fused volume, ensuring the preservation of anatomical structures and fine details.
First, 3D DWT sequentially decomposes each input volume
(
) along three spatial dimensions to generate multi-frequency band features
, which effectively decouple global structural information and local details while reducing redundancy. Second, the GLFC module takes the decomposed features
as input and performs dual-branch global-local feature calibration, outputting enhanced single-modal features
for each modality. Third, the PGMF module conducts group-wise multi-scale extraction and cross-modal interaction on the calibrated features
and
, yielding a compact and highly representative fused feature
with integrated multi-scale contextual information. The fused feature
is fed into a lightweight weight generation branch to produce a voxel-wise weight mask
that adaptively measures the contribution of each modality at every spatial position. The parameter m in Equation (
1) is not a manually selected hyperparameter. Instead, it is an adaptive voxel-wise weight mask learned by the proposed network. The fused wavelet-domain feature is obtained through voxel-wise weighted averaging:
where ⊙ denotes element-wise multiplication. The final fused volume
is reconstructed by 3D IDWT, which perfectly preserves zero-intensity background regions and eliminates spurious non-zero voxels.
2.2. Three-Dimensional Discrete Wavelet Transformation (3D DWT)
As the fundamental frequency-domain decomposition unit of our proposed 3D multiscale wavelet convolutional neural network (3DWaFusion), similar to [
28], we apply 3D discrete wavelet transformation (3D DWT) that is devised to excavate multi-scale spatial structures and eliminate redundant information in 3D multimodal medical volumes. This process effectively separates low-frequency global structures and high-frequency details without introducing excessive redundancy. The details can be found in
Figure 2. Given an input 3D medical image
of the
mth modality, where
L,
H,
W represent the depth, height and width of the volumetric data, respectively, and 3D DWT executes 1D discrete wavelet transform sequentially along depth, height and width dimensions to achieve hierarchical multi-frequency decomposition.
Specifically, 3D DWT decomposes the input volume into eight distinct frequency sub-bands, including one low-frequency sub-band (LLL) that encodes global anatomical structures and seven high-frequency sub-bands (
) that capture fine textures, edges and local detail variations. After decomposition, the spatial resolution of each frequency sub-band is reduced to
of the input, i.e.,
, which effectively reduces computational burden while retaining complete contextual information. The multi-frequency sub-band set generated by 3D DWT for the
mth modality is formally defined as:
where
denotes the frequency sub-band feature of the corresponding component.
Mathematically, the 3D DWT first operates on the width dimension (
W) of
using 1D DWT, denoted as
. Here,
performs 1D wavelet decomposition along the specified dimension
. The output of this step is
formulated as:
where
contains the low-frequency and high-frequency coefficients after decomposition along the width dimension. We then concatenate the first two dimensions (i.e., the decomposed coefficient dimension and the original channel dimension) using the
operation, resulting in a feature tensor of size
.
Next, we apply 1D DWT on the height dimension (
H) of
to obtain the intermediate feature:
Subsequently, we flatten and concatenate the first two dimensions of
to unify the channel representation:
yielding a tensor with shape
. One-dimnsional DWT is performed on the depth dimension (
L) of
:
Subsequently, we adopt
to concatenate the first two dimensions and derive the final decomposed feature
:
After dimension fusion, the output feature tensor has a size of
, which corresponds to the eight distinct frequency sub-bands, including one low-frequency LLL component and seven high-frequency components.
By decoupling structural and detailed information in the frequency domain through sequential 1D DWT operations, the 3D DWT provides compact and discriminative multi-frequency feature inputs for the subsequent Global and Local Feature Calibration (GLFC) module. The calibrated features are then delivered to the Pyramid Group-wise Multiscale Feature Interaction (PGMF) module for multimodal and multi-scale feature fusion, laying a solid foundation for the overall multimodal medical image fusion framework of 3DWaFusion.
2.3. Global and Local Feature Calibration (GLFC)
Following the 3D DWT decomposition, the multi-band frequency feature
with rich anatomical structures and textural details but lacks adaptive calibration between global contextual dependencies and local fine-grained details. To address this issue, we propose the Global and Local Feature Calibration (GLFC) module shown in
Figure 3, which adopts dual parallel branches to perform feature calibration and enhancement for each modality, providing compact features for the subsequent PGMF module.
Formally, the input of GLFC is the 3D DWT output of the
ith modality:
(
), where
i denotes the modality index,
is the concatenated channel dimension of eight frequency bands, and
is the unified spatial resolution after wavelet decomposition. First, a depthwise separable convolution (DSC) is applied to split
into two parallel branch inputs:
where
is fed into the global feature calibration branch and
into the local feature calibration branch.
2.3.1. Global Feature Calibration Branch
This branch models long-range contextual dependencies via a 3D self-attention mechanism to capture global anatomical structures. First,
is processed by
and
convolutions to generate query (
Q), key (
), and value (
V) features:
The attention map
A is computed by matrix multiplication and softmax normalization:
where
d is the dimension of
Q and
. The attention-weighted feature is then obtained by multiplying
A with
V, followed by a residual connection with
and a
convolution to produce the global-calibrated feature
:
2.3.2. Local Feature Calibration Branch
This branch refines local spatial details (edges, textures) using channel-wise attention. First, 3D global average pooling (GAP) compresses
into a compact channel descriptor
:
Two fully connected (FC) layers with ReLU and Sigmoid activations generate channel-wise attention weights:
where
denotes ReLU activation and
denotes Sigmoid activation. The local-calibrated feature
is obtained by channel-wise multiplication between
and
:
where ⊙ represents channel-wise multiplication. The global-calibrated feature
and local-calibrated feature
are concatenated along the channel dimension:
where
denotes channel-wise concatenation. Subsequently, a spatial attention module is applied to refine
:
where AvgPool and MaxPool are average and max pooling operations, respectively. The final calibrated feature of the
ith modality is obtained by element-wise multiplication between
and
:
where
is the output of GLFC for the
ith modality. After processing both modalities, the calibrated features
and
are fed into the subsequent Pyramid Group-wise Multiscale Feature Interaction (PGMF) module for multimodal and multi-scale feature fusion.
2.4. Pyramid Group-Wise Multiscale Feature Interaction (PGMF) Module
The calibrated features output by the GLFC module, and , contain rich anatomical structures and local details for each modality but lack effective multi-scale and cross-modal interaction. To address this, inspired by the existing pyramid mechanism, we propose the Pyramid Group-wise Multiscale Feature Interaction (PGMF) module, which leverages a group-wise multi-scale extraction strategy, intra-group cross-modal interaction, and pyramid progressive fusion to achieve efficient multimodal feature integration.
To obtain group-wise multi-scale feature layers for each single modality, the input calibrated features
and
are first fed into three parallel 3D convolutional branches with kernel sizes of
,
, and
, respectively. Each branch acts as an independent group to extract features with different receptive fields and resolutions, achieving group-wise multi-scale feature decoupling. The process is formulated as:
where Conv1, Conv3, and Conv7 denote 3D convolutions with kernel sizes of
,
, and
, respectively.
Next, we perform
intra-group cross-modal interaction to fuse features from different modalities within the same group, avoiding interference between different scale groups. Specifically, features from the same group of two modalities are concatenated along the channel dimension, and a
3D convolution is applied for channel compression and feature refinement:
where
denotes channel-wise concatenation.
Finally, a pyramid progressive fusion strategy is adopted to integrate multi-scale information. The deepest feature
is upsampled (UP) to restore the spatial resolution and added with
to obtain
. After refinement by a
convolution, the same process is applied to fuse with
, yielding the final multi-scale enriched feature
:
By integrating group-wise multi-scale extraction, intra-group cross-modal interaction, and pyramid progressive fusion, the proposed PGMF module effectively enhances both local details and global contextual representation, providing robust features for subsequent voxel-wise weighted averaging fusion and 3D IDWT reconstruction. To further improve readability, we provide a compact pseudocode summary of the GLFC and PGMF modules in Algorithm 1. This algorithm summarizes the main computational flow corresponding to Equations (8)–(28), allowing readers to understand the proposed modules from an implementation-oriented perspective.
| Algorithm 1: Summary of GLFC and PGMF Procedures |
![Sensors 26 03784 i001 Sensors 26 03784 i001]() |
2.5. Voxel-Wise Weighted Averaging and 3D Inverse DWT (3D IDWT) Reconstruction Module
In conventional multimodal medical image fusion, non-zero voxels often appear in the background region of the fused image, which are originally zero-intensity areas in the source images. Such inaccurate fusion degrades the quality of the final result and may introduce interference to subsequent clinical diagnosis and analysis. To address this issue, we design a reconstruction framework based on a voxel-wise weighted averaging strategy and 3D inverse discrete wavelet transform (3D IDWT). The input of this module is , the final multi-scale cross-modal fusion feature output by the PGMF module, which has already encoded complementary anatomical structure information, fine-grained lesion details and multi-scale contextual dependencies from the two source modalities. Through the proposed framework, we map the high-dimensional fusion feature to adaptive voxel-wise fusion weights, and complete wavelet-domain fusion and image reconstruction while preserving the zero-intensity property of the background region.
The high-dimensional semantic feature cannot be directly used for image reconstruction, as it needs to be mapped to a voxel-wise weight space that matches the spatial size of the original wavelet-domain features. To this end, we design a lightweight weight generation branch based on depthwise separable convolution (DSC) with residual connection, which balances feature representation capability and computational efficiency for 3D volumetric medical data.
For a 3D input feature
, the 3D DSC operation consists of two sequential steps: 3D depthwise convolution and 3D pointwise convolution, formally defined as:
where
denotes 3D depthwise convolution with a
kernel, which applies an independent convolution kernel to each input channel to extract spatial features;
is 3D pointwise convolution that fuses cross-channel information. Compared with standard 3D convolution, DSC significantly reduces the number of parameters and computational complexity, making it suitable for efficient feature extraction on 3D volumetric medical data.
First,
is fed into the first
DSC layer for initial deep feature extraction, which preserves the spatial resolution while extracting deep fusion features:
where
denotes the first DSC layer and
is the extracted feature with the same spatial size as
. Then,
is sent to the second
DSC layer for further feature refinement:
where
denotes the second DSC layer. To preserve the critical anatomical structure and detail information from the original PGMF output and mitigate the gradient vanishing problem in deep network propagation, we introduce a residual shortcut connection that directly adds the input feature
to the refined feature
:
where
is the residual-fused feature. This residual fusion mechanism ensures that the global structural prior and local lesion details from the source images are not lost during the convolution transformation, which is of vital importance for the diagnostic value of the final fused image.
Next, the residual-fused feature
is processed by a third
DSC layer, which compresses the channel dimension of the feature to one, mapping the high-dimensional fusion feature to a single-channel weight map with exactly the same spatial resolution as the wavelet-domain band features:
where
denotes the third DSC layer and
aligns perfectly with the spatial size of the 3D DWT outputs
and
. After that, a Sigmoid activation function is applied to normalize the weight map into the range of
, generating the final voxel-wise weight mask
m:
where
denotes the Sigmoid activation function. Each element
corresponds to a voxel at position
in the 3D volume, and its value represents the adaptive contribution weight of Modality 1 at that voxel, while
naturally corresponds to the contribution weight of Modality 2. This adaptive weight learning mechanism enables the network to automatically assign higher weights to the modality with richer information at each voxel. Admittedly, more higher weights are assigned to the modality with clearer anatomical structures in the tissue region, while in the zero-intensity background region. The weights of both modalities are adaptively learned to be zero, thus perfectly preserving the zero-value property of the background and completely eliminating non-zero artifacts in the fused image.
With the learned voxel-wise weight mask
m, we perform weighting fusion in the wavelet domain, which is the core of our reconstruction framework. Different from conventional spatial-domain fusion that easily causes structure blurring and detail loss, wavelet-domain fusion operates on the decoupled structural and detailed features, which can better preserve the complementary information from different modalities. Specifically, we apply the voxel-wise weight mask to the multi-frequency band features of the two modalities directly output by the 3D DWT module. For each voxel position
in the 3D volume, the fusion process is formulated as:
The matrix form of the voxel-wise fusion is written as:
where ⊙ denotes element-wise (voxel-wise) multiplication,
and
denote the complete 8-subband wavelet features of Modality 1 and Modality 2, respectively. This fusion strategy ensures that the low-frequency structural information and high-frequency detailed information from both modalities are fused in a decoupled and adaptive manner, avoiding the mutual interference between structural and detailed features during fusion.
Finally, 3D inverse discrete wavelet transform (3D IDWT/3D IWT) is applied to the fused wavelet feature
. As the exact inverse process of the 3D DWT decomposition in the front-end of the network, 3D IDWT sequentially performs 1D inverse DWT along the depth (
L), height (
H), and width (
W) dimensions, reconstructing the eight fused frequency sub-bands back to the original spatial resolution
of the input source images. The reconstruction process is formally defined as:
where
denotes the 1D inverse DWT operation along the specified dimension
, which is the exact inverse of the 1D DWT used in the 3D DWT module. Benefiting from the perfect reconstruction property of the wavelet transform, this step can restore the fused features to the image space without introducing additional reconstruction error, generating the final fused 3D medical image
that integrates the complementary anatomical structure information and fine-grained lesion detail information from both modalities.
2.6. Loss Function
The design of the loss function should fully consider the characteristics of different modality source images, to ensure that the critical informative content in the source images is well preserved in the fused result. In this work, we adopt a compound loss function consisting of two components: a 3D structural similarity loss
to preserve anatomical structural information, and an intensity loss
to retain intensity distribution and lesion information. The overall loss function is formulated as:
where
is a weight parameter that balances the proportion between the structural similarity loss and the intensity loss.
The structural similarity loss
is designed based on the Structural Similarity Index Measure (SSIM), which evaluates the similarity between the fused images and the source images in terms of luminance, contrast, and structure. It is formulated as:
where
and
are the source images of the two modalities, and
is the final fused image.
The intensity loss
is based on the mean squared error (MSE) between the fused image and the source images, which is critical for preserving lesion regions with extremely high or low intensity in medical images. It is defined as:
The formulation of SSIM can be presented here.
where
and
denote the mean intensities of images
x and
y,
and
are the corresponding variances,
denotes the covariance between
x and
y, and
and
are small constants (0.0001) used to avoid numerical instability.