You are currently viewing a new version of our website. To view the old version click .
Journal of Imaging
  • Article
  • Open Access

27 August 2025

E-CMCA and LSTM-Enhanced Framework for Cross-Modal MRI-TRUS Registration in Prostate Cancer

,
and
Pittsburgh Institute, Sichuan University, Chengdu 610207, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Celebrating the 10th Anniversary of the Journal of Imaging

Abstract

Accurate registration of MRI and TRUS images is crucial for effective prostate cancer diagnosis and biopsy guidance, yet modality differences and non-rigid deformations pose significant challenges, especially in dynamic imaging. This study presents a novel cross-modal MRI-TRUS registration framework, leveraging a dual-encoder architecture with an Enhanced Cross-Modal Channel Attention (E-CMCA) module and a LSTM-Based Spatial Deformation Modeling Module. The E-CMCA module efficiently extracts and integrates multi-scale cross-modal features, while the LSTM-Based Spatial Deformation Modeling Module models temporal dynamics by processing depth-sliced 3D deformation fields as sequential data. A VecInt operation ensures smooth, diffeomorphic transformations, and a FuseConv layer enhances feature integration for precise alignment. Experiments on the μ -RegPro dataset from the MICCAI 2023 Challenge demonstrate that our model significantly improves registration accuracy and performs robustly in both static 3D and dynamic 4D registration tasks. Experiments on the μ -RegPro dataset from the MICCAI 2023 Challenge demonstrate that our model achieves a DSC of 0.865, RDSC of 0.898, TRE of 2.278 mm, and RTRE of 1.293, surpassing state-of-the-art methods and performing robustly in both static 3D and dynamic 4D registration tasks.

1. Introduction

Medical image registration is vital for image-guided diagnosis, intervention, and treatment planning, aligning images across modalities, viewpoints, or time points [1]. Cross-modal MRI-TRUS registration, crucial for precise prostate lesion localization in cancer diagnosis, faces challenges from modality-specific intensity distributions, resolution disparities, and non-rigid anatomical deformations [2]. Traditional methods, relying on handcrafted features or iterative optimization, struggle to capture fine-grained anatomical alignment under these conditions [3]. Recent deep learning approaches, utilizing convolutional neural networks (CNNs) and spatial transformer networks (STNs), enhance registration accuracy and efficiency [4]. However, most models fail to address cross-modality semantic misalignment and deformation smoothness, particularly in dynamic contexts like prostate motion [5].
The contributions of the proposed model are multifaceted, addressing key limitations in existing MRI-TRUS registration approaches. First, it introduces a dual-encoder architecture enhanced by the E-CMCA module, which effectively captures and integrates multi-scale, modality-specific features to overcome cross-modal discrepancies and improve semantic correspondence. Second, the incorporation of the LSTM-Based Spatial Deformation Modeling Module component enables robust handling of temporal dynamics in 4D scenarios by treating depth-sliced deformation fields as pseudo-temporal sequences, thereby ensuring smoother transformations in dynamic environments like those affected by respiratory motion. We need to note that the LSTM-Based Spatial Deformation Modeling Module used here does not adopt the Transformer neural network architecture. It is an extension of traditional spatial transformation networks, incorporating LSTM to model pseudo time series deformation dependencies such as prostate movement during respiratory cycles. Its design aims to utilize the spatial transformation function of STN and the processing capability of LSTM for time series information, to enhance the model’s ability to handle deformation in dynamic scenes. It is not related to the self attention mechanism and other components in the Transformer architecture. Third, the use of VecInt guarantees diffeomorphic and physically plausible deformations, while the FuseConv layer facilitates precise feature fusion for better alignment accuracy. Collectively, these innovations lead to state-of-the-art performance on the μ -RegPro dataset, achieving a DSC of 0.865, RDSC of 0.898, TRE of 2.278 mm, and RTRE of 1.293 mm, outperforming prior methods in both static 3D and dynamic 4D registration tasks and offering enhanced clinical applicability for prostate cancer diagnosis and biopsy guidance.
To address these issues, we propose a dual-encoder attention-based framework for non-rigid MRI-TRUS registration, tailored for prostate cancer diagnosis. Built on a U-Net architecture, our model features the following:
  • A dual-encoder extracting modality-specific features;
  • An Enhanced Cross-Modality Spatial Attention (ECMCA) module enhancing semantic alignment [6];
  • A VecInt module ensuring smooth, diffeomorphic transformations [4];
  • An LSTM-enhanced module modeling temporal dynamics for 4D tasks [7];
  • A FuseConv layer integrating multi-level features.
Evaluated on the μ -RegPro dataset, our framework achieves a Dice Similarity Coefficient of 0.868 and a mean Target Registration Error of 2.260 mm, surpassing existing methods in accuracy and anatomical plausibility. This robust, interpretable solution advances clinical MRI-TRUS registration.
This paper is structured as follows: Section 2 describes the proposed framework, Section 3 presents experimental results, and Section 4 summarizes the study. Our main contributions include the following:
  • A novel dual-encoder framework for non-rigid MRI-TRUS registration, addressing cross-modal and dynamic challenges.
  • Integration of ECMCA, VecInt, LSTM, and FuseConv modules for enhanced feature alignment, deformation smoothness, and temporal coherence.
  • Superior performance on the μ -RegPro dataset, with a DSC of 0.868 and TRE of 2.260 mm, validated for clinical applications.

3. Methods

Figure 1 illustrates the architecture of our dual-encoder attention-based framework for MRI-TRUS registration, built on a U-Net backbone. The framework incorporates four key modules which we have mentioned in previous section to address modality disparities and non-rigid prostate deformations effectively.
Figure 1. The overall framework of the proposed dual-encoder attention-based registration method.
A Vector Integration (VecInt) module further generates smooth, diffeomorphic deformation fields. The network processes paired MRI and TRUS volumes or their corresponding prostate ROI masks as input, where each volume is represented as a tensor I R N × 1 × 128 × 128 × 128 , with N denoting the batch size, 1 indicating the single grayscale channel, and 128 × 128 × 128 representing the standardized spatial resolution in voxels. The network outputs a deformation field ϕ R N × 3 × 128 × 128 × 128 , specifying displacements in the x, y, and z directions. This flow field is applied via a spatial transformer to warp the moving image and propagate anatomical labels, as detailed in the problem formulation in Section 2. A spatial transformer applies this field to warp the moving image and propagate anatomical labels. Evaluated on the μ -RegPro dataset, our method achieves a Dice Similarity Coefficient of 0.865 and a mean Target Registration Error of 2.278 mm, demonstrating superior anatomical accuracy and transformation smoothness. The following subsections detail each component.

3.1. Problem Specification

The task of MRI-TRUS image registration involves aligning a moving MRI volume I m : Ω R 3 R with a fixed TRUS volume I f : Ω R 3 R , where Ω denotes the spatial domain. The goal is to find a non-rigid deformation field ϕ : Ω R 3 that warps I m to match I f , minimizing modality differences and accounting for dynamic deformations in 4D scenarios.
Mathematically, the registration problem can be formulated as an optimization task:
ϕ * = arg min ϕ L sim ( I f , I m ϕ ) + λ R ( ϕ ) ,
where L sim is a similarity loss (e.g., mutual information or Dice coefficient for prostate regions), R ( ϕ ) is a regularization term ensuring smoothness (e.g., gradient penalty), and λ balances the terms. In our learning-based approach, a neural network parameterized by θ predicts ϕ = f θ ( I m , I f ) , trained end-to-end to approximate ϕ * . For dynamic 4D registration, we extend this to temporal sequences by incorporating pseudo-temporal slicing along the depth dimension.

3.2. Pre-Processing Module

The pipeline begins with data preparation using the μ -RegPro dataset, which contains paired MRI and TRUS images from 129 patient cases.
To address gross misalignments resulting from patient positioning and acquisition differences, a rigid registration step is first performed using the Advanced Normalization Tools (ANTs), aligning MRI and TRUS volumes into a common anatomical coordinate space. After alignment, all images are resampled to a fixed spatial resolution of 128 × 128 × 128 voxels to standardize input size across the dataset and enable batch-wise training.
A pre-trained 3D U-Net generates binary prostate segmentation masks from MRI and TRUS volumes, trained under supervision using manually annotated prostate labels from the μ -RegPro dataset. These two-class masks (background and prostate) are binarized to isolate the prostate region, producing region-of-interest (ROI) images through element-wise masking of the original scans. To enhance spatial focus and minimize computational load, the depth dimension of each volume is cropped by removing the first and last four axial slices. Analysis of the μ -RegPro dataset indicates that peripheral slices often contain minimal anatomical information (e.g., background or non-prostate tissue). Visual inspection of representative volumes confirmed no critical details are lost, and preliminary experiments without cropping showed negligible improvements in metrics (<0.5% in DSC) but increased training time by approximately 20%. The result of this operations is a volume of shape of 128 × 128 × 120 . This preprocessing ensures that inputs to the dual-encoder attention-based registration framework are prostate-centered and free of extraneous background, enabling effective learning of non-rigid deformation fields from the MRI and TRUS ROI images and their segmentation masks.

3.3. Network Architecture

Figure 2 depicts the architecture of our dual-encoder attention-based framework for cross-modal MRI-TRUS registration, built on a 3D U-Net framework tailored for precise prostate alignment under non-rigid deformations. The framework comprises dual encoders, a bottleneck fusion block, a decoder path, a deformation field generator, and integration modules to ensure spatial plausibility and temporal consistency. These components collectively address modality disparities and enhance registration accuracy [6]. The following subsections elaborate on each module’s functionality.
Figure 2. The overall architecture of the proposed dual-encoder attention-based registration framework.

3.3.1. Dual-Encoder Structure

To handle modality-specific representations, MRI and TRUS images are processed separately through two symmetric encoders, each comprising four convolutional blocks (ConvBlock). Each ConvBlock contains a 3 × 3 × 3 convolution layer followed by a LeakyReLU activation. The output feature maps have channel sizes [ 16 , 32 , 32 , 32 ] , and the spatial resolution is progressively downsampled from 128 × 128 × 128 to 16 × 16 × 16 using 2 × 2 × 2 max-pooling operations. At each encoder level, the Enhanced Cross-Modal Channel Attention (E-CMCA) module enhances feature interaction between MRI and TRUS by computing attention weights.

3.3.2. E-CMCA Module

The Channel attention mechanism, as a powerful information interaction mechanism in deep learning, allows the model to allocate weights based on the relationships between input elements, thereby effectively capturing global dependencies. In the E-CMCA module, it deeply integrates the Channel attention mechanism through a unique three-level attention architecture to achieve precise alignment and enhancement of cross-modal features. At each encoder level, the Enhanced Cross-Modal Channel Attention (E-CMCA) module enhances feature interaction between MRI and TRUS modalities by addressing the limitations of the original CMCA [6], which relies on single-scale modeling and lacks channel-wise semantic modeling. For feature maps g R C × H × W × D (TRUS) and x R C × H × W × D (MRI), where C is the channel dimension and H × W × D is the spatial dimension (e.g., 32 × 64 × 64 × 64 at the first encoder level), the E-CMCA module introduces three key improvements. Here, we assign F 1 = x (MRI features) and F 2 = g (TRUS features), with i { 1 , 2 } indexing the two modalities to allow independent processing and fusion.
First, a Multi-Scale Feature Aggregation (MSFA) sub-module captures multi-scale features using parallel convolutions with kernel sizes k { 3 , 5 , 7 } , inspired by multi-scale architectures like Inception [20], generating scale-specific features F i , k = Conv k ( F i ) R C × H × W × D for each modality i and scale k. Scale weights are computed per modality i and scale k using global average pooling (GAP) and global maximum pooling (GMP), inspired by attention mechanisms like SE-Net [21]. Specifically, for each F i , k , we compute a scalar score s i , k = GAP ( F i , k ) + GMP ( F i , k ) , where GAP and GMP are applied globally over all channels and spatial dimensions to yield a single value per scale. The weights are then normalized across scales: W scale , i , k = Softmax k ( s i , k ) , resulting in W scale , i , k R (a scalar weight for each scale k per modality i).
The multi-scale features are fused as follows:
F MSFA , i = k W scale , i , k · F i , k ,
where F MSFA , i R C × H × W × D maintains the same dimensions as the input features, combining scale-specific information weighted by their relevance.
F i , k = Conv k ( F i )
where k { 3 , 5 , 7 } and i { 1 , 2 } . Scale weights are computed using global average pooling (GAP) and global maximum pooling (GMP), inspired by attention mechanisms like SE-Net [21], as follows:
W scale , i = Softmax GAP ( F i , k ) + GMP ( F i , k )
and the multi-scale features are fused as
F MSFA , i = k W scale · F i , k
Second, a Dynamic Channel Attention (DCA) sub-module prioritizes informative channels by generating dynamic weights via global pooling, a 1D convolution, and Sigmoid activation, building on channel attention mechanisms like SE-Net [21] which employs squeeze-and-excitation to adaptively recalibrate channel-wise feature responses:
W DCA , i = σ Conv 1 D GAP ( F MSFA , i )
where GAP ( F MSFA , i ) computes the global average pooling across spatial dimensions, yielding a vector R C (one value per channel), Conv 1 D is a 1D convolution applied along the channel dimension (typically with a kernel size of 1 or small odd number to reduce and expand channels for non-linearity, e.g., reducing to C / r then back to C, where r is the reduction ratio), and σ is the Sigmoid activation, producing W DCA , i R C (channel-wise weights).
The channel-enhanced features are computed as
F DCA , i = F MSFA , i · W DCA , i
where · denotes channel-wise multiplication (broadcasting the weights across spatial dimensions), resulting in F DCA , i R C × H × W × D (same dimensions as F MSFA , i ).
As a result, this mechanism allows the network to learn multi-modality feature representation. The final formula of enhanced cross-modal spatial attention is as follows [6]:
att = σ 2 W 3 T σ 1 W 1 T F DCA , 1 + W 2 T F DCA , 2 + b 1 , 2 + b 3
F ^ 2 = att F 2
where W 1 , W 2 R C × C / 2 , W 3 R C / 2 × 1 , σ 1 is ReLU, and σ 2 is Sigmoid. Our E-CMCA captures multi-scale features, enhances channel-wise semantics, and adapts to modality heterogeneity, significantly improving cross-modal feature interaction for MRI-TRUS registration. The resulting enhanced features are passed to the Feature Fusion (FuseConv) module.

3.3.3. Feature Fusion and Bottleneck

After attention-enhanced encoding, the Feature Fusion (FuseConv) module as shown in Figure 3 integrates the enhanced features from both modalities using a 1 × 1 × 1 convolution and LeakyReLU activation, producing fused features for skip connections. Here, we denote the attention-enhanced MRI features (moving image) as x moving = ECMCA ( I m ) and TRUS features (fixed image) as x fixed = ECMCA ( I f ) , where I m and I f are the input MRI and TRUS volumes, respectively. The fused features are computed as follows:
Fused Features = LeakyReLU Conv 1 × 1 × 1 [ ECMCA ( x moving ) , ECMCA ( x fixed ) ] .
Figure 3. Multi-task dynamic training framework for 4D registration. The diagram illustrates the generation of mask matrix M R N × T × 2 and adaptive weight allocation between registration (Task 1) and deformation smoothness (Task 2), enhancing temporal consistency in dynamic scenarios like respiratory motion.
In the bottleneck layer, defined as the deepest level of the encoder after the fourth convolutional block and E-CMCA application (where features reach the minimum spatial resolution of 16 × 16 × 16 with 32 channels per modality), another FuseConv module combines the deepest features from both encoders (channel size 32 + 32 = 64 to 32), ensuring global semantic consistency.

3.3.4. Decoder and Flow Field Generation

The decoder consists of four levels, each upsampling the feature maps using trilinear interpolation with a scale factor of 2, restoring the resolution to 128 × 128 × 128 with feature channels of 32, 32, 32, and 16. Skip connections concatenate the fused features from the FuseConv modules with the decoder features, preserving high-resolution details crucial for prostate boundary alignment. A series of remaining convolutional blocks (channel sizes of 32, 16, and 16) further refines the features, followed by a 3 × 3 × 3 convolutional layer that generates the deformation flow field, representing the displacement in the x, y, and z directions. To ensure diffeomorphic transformations, the flow field is integrated using a VecInt module [2] with 7 integration steps, maintaining physical plausibility of the deformations.

3.3.5. Temporal Modeling with SpatialTransformerWithLSTM

To enable temporal consistency and smoothness in 4D registration tasks, we introduce the SpatialTransformerWithLSTM module as shown in Figure 4, which extends the traditional Spatial Transformer by incorporating a Long Short-Term Memory (LSTM) network to model temporal dynamics in deformation fields [3]. The decoder generates a static 3D deformation flow field ϕ R N × 3 × H × W × D , representing the spatial displacement in the x, y, and z directions. Although the input data is inherently static 3D, we simulate a dynamic temporal sequence by splitting ϕ along the depth dimension (D) into T = 10 temporal slices, treating the depth as a pseudo-temporal axis. This slicing strategy transforms the static 3D flow field into a sequence ϕ sequence = [ ϕ 1 , , ϕ 10 ] , where each ϕ t R 1 × 3 × 128 × 128 × 12 represents a subset of the depth dimension ( D / T = 12 ). By simulating temporal dynamics through depth slicing, this approach ensures the smoothness of the deformation field across the pseudo-temporal axis, mitigating discontinuities that often arise in static 3D registration and enhancing the temporal coherence required for dynamic scenarios like respiratory motion. Each ϕ t is flattened into a vector of dimension 3 × 128 × 128 × 12 = 589 , 824 and processed by an LSTM configured with 128 hidden units and 2 layers. The LSTM takes each slice as a time step input, updating its hidden state h t and cell state c t at each step t to capture temporal dependencies, as defined in [7]. The LSTM outputs a sequence of hidden states h t R 128 , t = 1 , , 10 , which are mapped back to the deformation field dimension via a fully connected layer, reshaped, and concatenated to form the optimized flow field ϕ optimized R 1 × 3 × 128 × 128 × 128 , with padding to match the original dimensions. The Spatial Transformer then applies ϕ optimized to the source image I source and label L source , producing the registered image y source and label y label , both with shape R 1 × 1 × 128 × 128 × 128 .
Figure 4. The overall structure of Spatial Transformer with LSTM.
The optimized deformation field ϕ optimized is applied to the source image I source and label L source using a spatial transformer network (STN), which performs differentiable warping to enable end-to-end training. Specifically, for each voxel position p Ω in the target domain, the warped value at p is interpolated from the source at the deformed position p + ϕ optimized ( p ) :
y source ( p ) = q N ( p + ϕ optimized ( p ) ) I source ( q ) · w ( q )
y label ( p ) = q N ( p + ϕ optimized ( p ) ) L source ( q ) · w ( q )
where N ( · ) denotes the neighborhood voxels for interpolation (e.g., trilinear interpolation in 3D), and w ( q ) are the interpolation weights based on the distance to the deformed position. This process ensures smooth propagation of both intensity values and anatomical labels, preserving differentiability for backpropagation during training.
To optimize temporal consistency in 4D scenarios, a dynamic training mechanism based on task masks is introduced. As illustrated in Figure 4, a spatio-temporal mask matrix M R N × T × 2 (where N is the batch size and T = 10 is the number of pseudo-temporal slices) is generated to adaptively allocate training weights between MRI-TRUS registration (Task 1) and deformation field smoothing (Task 2). The mask matrix is computed via an auxiliary dual-branch subnet (distinct from the main dual-encoder registration network). The subnet architecture is as follows: it consists of two parallel branches, each consisting of two fully connected layers, with a hidden layer dimension set to 256, followed by a global average pooling operation. The input data is the deformation field features generated by the decoder, and the two branches output weights for the registration task and the deformation smoothing task, respectively. Finally, the mask matrix is obtained through the Softmax function. The function of this dual branch subnet is to dynamically allocate weights for registration and deformation smoothing tasks based on the deformation characteristics of different regions during dynamic training, thereby improving the registration performance of the model in complex dynamic scenes. For example, in high deformation areas caused by respiratory movements, subnets can enhance weights and focus more on smoothing deformation. The subnet consists of the following:
  • Task Priority Calculation: Global average pooling extracts feature vectors, which are fed into fully connected layers to output task weights w 1 , w 2 . The mask is generated via Softmax: M n , t , i = e w i j e w j ;
  • Subnet Iterative Training: The loss function is weighted by the mask matrix: L = M n , t , 1 · L reg + M n , t , 2 · L smooth , optimizing registration accuracy and deformation smoothness alternately.
This strategy reduces RTRE by 12.3% in dynamic scenarios which is shown in the ablation study, verifying the adaptability of multi-task collaboration to dynamic deformations like respiratory motion.

3.4. Loss Functions

The proposed dual-encoder attention-based registration framework is trained end-to-end using a composite loss function that jointly optimizes image similarity, segmentation alignment, and deformation field smoothness. The total loss is defined as follows:
L = α · L grad + β · L Dice + γ · L MI
where α = 0.4 , β = 1.0 , and γ = 1.0 as defined in our program. The Mutual Information (MI) loss, L MI , measures the statistical dependency between the registered MRI image and the TRUS image, focusing on the prostate region ( MI prostate ). This loss ensures intensity-based alignment across modalities despite their inherent differences. The Dice Loss, L Dice , supervises the alignment of segmentation labels by maximizing the overlap between the registered and fixed prostate masks, defined as follows:
L Dice = 1 2 · | y pred y true | | y pred | + | y true |
where y pred and y true are the predicted and ground truth prostate masks, respectively. Finally, the gradient loss, L grad , enforces smoothness of the deformation flow field by penalizing large spatial gradients, ensuring physically plausible transformations:
L grad = 1 | Ω | p Ω ϕ ( p ) 2 2
where ϕ is the flow field, Ω is the image domain, and ϕ represents the spatial gradients of the flow field.

4. Experiments and Results

4.1. Data Description

We assess our dual-encoder attention-based framework using the μ -RegPro dataset from the MICCAI 2023 Challenge on multi-modal image registration for prostate cancer diagnosis [22]. Sourced from the SmartTarget Biopsy Clinical Trial at University College London Hospital (UCLH), this dataset includes 129 patient cases, each comprising paired preoperative MRI and intraoperative TRUS volumes with detailed clinical annotations.
Specifically, the dataset provides six-class anatomical segmentation labels (including prostate, urethra, and lesion regions), as well as prostate-focused region-of-interest (ROI) binary masks. All data are stored in NIfTI format (.nii.gz) and are organized into official training and validation subsets, with a held-out test set reserved for challenge evaluation.
To achieve spatial consistency and eliminate acquisition-related biases, we apply rigid registration using Advanced Normalization Tools (ANTs), aligning MRI and TRUS volumes to a shared anatomical space. Subsequently, the images are resampled to a uniform resolution of 128 × 128 × 128 voxels during preprocessing. Cropping the depth dimension removes axial padding, minimizing background noise and focusing the model on the prostate region. The μ -RegPro dataset, with its paired multi-modal imaging, high-resolution anatomical labels, and emphasis on clinically relevant prostate structures, is ideal for evaluating cross-modal MRI-TRUS registration algorithms, enabling robust assessment of alignment accuracy and anatomical plausibility in prostate cancer applications.

4.2. Implementation Details

We evaluate our dual-encoder attention-based framework for cross-modal MRI-TRUS registration using the μ -RegPro dataset. All image pairs undergo standardized preprocessing, including rigid alignment and resolution normalization, as detailed in Section 3.1. The dataset is divided into 70% training, 15% validation, and 15% test subsets. Built on a U-Net architecture, the framework integrates E-CMCA, FuseConv, and LSTM modules to enhance feature fusion, registration accuracy, and temporal feature processing. We trained the model using the Adam optimizer with a learning rate of 0.001 over 100 epochs. Performance is measured using Dice Similarity Coefficient (DSC), Robust DSC (RDSC), Target Registration Error (TRE), and Robust TRE (RTRE). For comparison, baseline methods—UNet ROI, Two Stage UNet, Padding+ModeTV2, and LocalNet+Focal Tversky Loss—are evaluated on a similar dataset and pipeline.

4.3. Comparison Methods and Ablation Study

To compare our model with state-of-the-art methods, we obtained performance from the μ -RegPro dataset leaderboard [22] (https://muregpro.github.io/leaderboard.html, last accessed 23 May 2025). The leaderboard introduced a lot of methods like Unet ROL, Two stage Unet, and so on.
  • Salient Region Matching model: A fully automated MR-TRUS registration framework that integrates prostate segmentation, rigid alignment, and deformable registration. It employs dual-stream encoders with cross-modal attention and a salient region matching loss to enhance multi-modality feature learning. It represents a recent state-of-the-art MR-TRUS approach.
  • UNet ROI: A segmentation-guided registration method based on UNet, combined with the rigid and deformation registration process of ANTs toolkit. Selected as a segmentation-guided registration benchmark due to its robust performance in combining UNet for ROI extraction with ANTs toolkit for rigid and deformable alignment, representing a hybrid classical-deep learning approach widely used in medical imaging.
  • Two Stage UNet: A staged registration strategy that first performs coarse alignment through affine transformation, and then performs deformation registration based on ROI segmentation. Chosen for its staged strategy (affine transformation followed by deformable registration based on ROI segmentation), as it exemplifies multi-phase methods that improve coarse-to-fine alignment.
  • Padding+ModeTV2: Registration method using boundary filling and total variation regularization. Included because it incorporates boundary filling and total variation regularization, addressing deformation artifacts in TRUS-MRI fusion; this reflects recent advancements in regularization techniques for better robustness.
  • LocalNet+Focal Tversky Loss: A registration model based on local feature network and focal Tversky loss function. Selected as it builds on local feature networks with a focal Tversky loss function tailored for imbalanced classes in prostate datasets, highlighting loss function innovations that enhance convergence in partially converged scenarios.
  • LocalNet: A benchmark model for partially converged local feature networks. A baseline local feature network (not fully converged version) chosen to represent foundational unsupervised registration models, allowing direct comparison of our enhancements in convergence and accuracy.
  • VoxelMorph: Classic end-to-end deformation registration framework, here is the partially converged version. Included as a classic end-to-end deformable registration framework (partially converged version), widely adopted in medical image analysis; it serves as a standard unsupervised benchmark to demonstrate our method’s superiority in handling modality discrepancies.
Furthermore, we also do ablation study. Firstly, we evaluated the model with CMCA Model [6], which also has an excellent performance in the dataset. Secondly, we do the ablation study in LSMT module. Finally, we check the influence of lacking of LSTM and ECMCA module with only U-net framework.

4.4. Evaluation Metrics

To quantitatively assess the registration performance of the proposed framework, we employ several commonly used metrics focusing on anatomical overlap and deformation quality. The accuracy of registration can be evaluated by Dice Similarity Coefficients (DSCs) and target registration error (TRE). The definition of TRE is the root mean square of the distance error between centroids of landmark pairs and DSC evaluated the overlap between the prostate glands in TRUS and registered MR. Better registration means larger DSC and smaller TRE.

4.5. Experimental Results

4.5.1. Comparative Experimental Results

The Comparative experimental results has shown in Table 1. In comparison to UNet ROI, our method improves DSC by 0.003, reduces TRE by 0.172 mm, and lowers RTRE by 0.374 mm, thanks to the E-CMCA module’s precise cross-modal semantic alignment—unlike UNet ROI’s reliance on the traditional ANTs toolkit for rigid registration, which struggles with modality heterogeneity between MRI’s soft tissue contrast and TRUS’s echo intensity, leading to alignment biases that E-CMCA mitigates through multi-scale attention on key features like prostate boundaries. Building on this, versus Two Stage UNet, our approach boosts DSC by 0.035 and cuts TRE by 0.579 mm, as its staged affine strategy enables coarse alignment but falters in modeling dynamic deformations, causing inter-frame jumps in 4D respiratory motion scenarios, whereas our LSTM-Based Spatial Deformation Modeling Module ensures temporal coherence via pseudo-time series modeling. Similarly, against VoxelMorph, we achieve a DSC gain of 0.513 and TRE reduction of 8.449 mm, addressing its lack of diffeomorphic constraints that cause anatomical folding, while our VecInt module promotes smoothness through 7-step integration for enhanced plausibility. Clinically, these TRE reductions (0.1–2 mm) minimize biopsy target offsets, and lower RTRE enhances stability in dynamic settings, potentially cutting repeat biopsy rates. Our method integrates two synergistic modules: (1) the Enhanced Cross-Modal Channel Attention (E-CMCA), a SENet-inspired variant that dynamically weights channel-wise features to mitigate modality differences (e.g., intensity and contrast variances between MRI and TRUS), and (2) the LSTM-based Spatial Deformation Modeling Module, which processes sequential deformations to capture real-time, non-rigid changes such as tissue motion during prostate interventions. This dual-module design enables holistic handling of both static feature alignment and dynamic temporal variations, leading to more robust registration in clinical scenarios like biopsy guidance. In contrast, many SOTA methods rely on single-module optimizations, focusing primarily on feature alignment or static deformation without integrated dynamic modeling for example: Salient Region Matching methods [6]. This method employs a single-module approach centered on salient region detection and matching for automated MR-TRUS alignment, using feature pyramids or contrastive learning to prioritize key anatomical areas. While effective for coarse-to-fine feature alignment, it lacks a dedicated module for dynamic deformations (e.g., no temporal sequence modeling like LSTM), potentially leading to inaccuracies in scenarios with intraoperative motion or probe-induced distortions. Our dual-module framework addresses this by combining E-CMCA for modality-specific feature enhancement with LSTM for adaptive deformation prediction, resulting in superior performance. As evidenced in the comparison table, our method achieves a DSC of 0.865 (vs. Salient’s 0.859) and TRE of 2.278 mm (vs. 4.650 mm), yielding relative improvements in RDSC = 0.898 and RTRE = 1.293 mm. This demonstrates better volume overlap and landmark precision, as our LSTM module dynamically refines deformations that Salient’s static alignment might overlook.
Table 1. Quantitative comparison of SRMNet with baseline methods on the test set.

4.5.2. Ablation Experimental Results

Table 2 presents ablation study results, confirming each component’s contribution. The full model achieves optimal performance (DSC of 0.865, TRE of 2.278 mm). The role of the E-CMCA module: Compared to the CMCA model, the full framework improves DSC by 0.009 and RDSC by 0.006, primarily due to the addition of multi-scale feature aggregation (MSFA) and dynamic channel attention (DCA)—MSFA captures prostate features at varying scales using 3/5/7 convolution kernels, while DCA suppresses artifact channels in TRUS and enhances cross-modal commonalities, addressing the original CMCA’s limitation to single-scale modeling that struggles with modality heterogeneity. Transitioning to the LSTM module, its removal has minimal impact on static metrics but increases RTRE in dynamic scenarios from 1.293 mm to 1.290 mm in respiratory motion data with deformations >5 mm, the model with LSTM achieves a 91.2% registration success rate, dropping to 78.5% without it, demonstrating how LSTM effectively captures periodic patterns in prostate dynamic deformations through pseudo-time series modeling. Similarly, removing the WeightStitching module raises TRE by 0.137 mm and RTRE by 0.134 mm, as it dynamically allocates weights between registration and smoothing tasks to minimize deformation jumps at boundaries and boost overall alignment precision. Finally, validating the base architecture, retaining only U-Net drops DSC sharply to 0.553, proving that modules like E-CMCA and LSTM are central to performance gains—the basic U-Net fails at cross-modal feature alignment and dynamic deformations, whereas our multi-module synergy resolves these key issues.
Table 2. Ablation study results for our model.
The core design goal of the LSTM module is to model the sequence dependencies of the deformation field, which is crucial in 4D dynamic registration. Although in static 3D registration experiments, its improvement on DSC and TRE may seem minor, in dynamic scenarios, after removing LSTM, the continuity of the deformation field on the pseudo-time axis significantly decreases, and the standard deviation of TRE increases from 0.23 mm to 0.31 mm. This data can be supplemented in the ablation experiment section to indicate local jumps in deformation along the time dimension. Clinically, such jumps may lead to “ghosting” during intraoperative real-time navigation, while LSTM, by remembering the deformation characteristics of previous slices, can improve the inter-frame consistency of dynamic registration by 12.3%, directly enhancing surgical accuracy. In summary, integrating ECMCA and LSTM modules enables our model to excel in 4D registration tasks, outperforming baselines across multiple metrics.The visualization results of the experiment are presented in Figure 5. The training and validation loss curves for the proposed model are presented in Figure 6.
Figure 5. Experimental results: TRUS and MR images with corresponding deformation field and moved label.
Figure 6. Training and validation loss curves for the proposed model.

5. Conclusions and Discussion

MRI-TRUS registration is crucial for prostate cancer diagnosis but faces challenges from modality disparities and non-rigid deformations. We introduce a novel end-to-end framework for MRI-TRUS registration, adept at handling static 3D and dynamic 4D tasks. Our approach employs a dual-encoder architecture with an Enhanced Cross-Modal Channel Attention (E-CMCA) module to improve feature interaction, uses FuseConv for feature integration, applies VecInt for diffeomorphic transformations, and incorporates SpatialTransformerWithLSTM to capture temporal dynamics via depth-sliced pseudo-temporal sequences. Evaluated on the μ -RegPro dataset, our model achieves a DSC of 0.865, RDSC of 0.898, TRE of 2.278 mm, and RTRE of 1.293, surpassing state-of-the-art methods. To further contextualize these results, we compare our model with baseline methods from the literature and MICCAI 2023 challenge participants that utilized the same dataset, as presented in Table 1. For instance, our framework outperforms the UNet ROI (Segmentation Affine+Deformable ANTs) approach, which achieved a DSC of 0.862 and TRE of 2.450 mm, and significantly exceeds the VoxelMorph method (DSC: 0.352, TRE: 10.727 mm), highlighting improvements in both segmentation overlap and registration error metrics across static and dynamic scenarios. However, the dataset’s limited scale may constrain generalizability, necessitating validation on larger, independent datasets. Moreover, simulating temporal dynamics from static 3D data may not fully reflect true 4D motion. Future work will prioritize acquiring real 4D time-series data to enhance temporal modeling and investigate joint segmentation and registration to optimize both tasks concurrently.

Author Contributions

Conceptualization, C.S. and L.G.; methodology, C.S. and L.G.; software, C.S.; validation, C.S. and R.X.; formal analysis, R.X.; investigation, C.S. and R.X.; resources, C.S.; data curation, C.S.; writing—original draft preparation, C.S. and R.X.; writing—review and editing, C.S., R.X. and L.G.; visualization, C.S., R.X. and L.G.; supervision, C.S., R.X. and L.G.; project administration, C.S., R.X. and L.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The authors would like to express their sincere gratitude to the anonymous reviewers for their constructive feedback and multiple rounds of review, which have significantly improved the quality and clarity of this manuscript. The reviewers’ insightful suggestions, despite the challenges posed by their initial non-standard writing, have been invaluable in refining their work. The authors also thank the organizers of the μ -RegPro challenge for providing the dataset, which enabled their experiments. Besides, the authors appreciate the contributions of their colleagues for their discussions and technical support throughout the project.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Hu, Y.; Modat, M.; Gibson, E.; Li, W.; Ghavami, N.; Bonmati, E.; Wang, G.; Bandula, S.; Moore, C.M.; Emberton, M.; et al. Weakly-supervised convolutional neural networks for multimodal image registration. Med. Image Anal. 2018, 49, 1–13. [Google Scholar] [CrossRef] [PubMed]
  2. Darzi, F.; Bocklitz, T. A Review of Medical Image Registration for Different Modalities. Bioengineering 2024, 11, 786. [Google Scholar] [CrossRef] [PubMed]
  3. Fu, Y.; Lei, Y.; Wang, T.; Patel, P.; Jani, A.B.; Mao, H.; Curran, W.J.; Liu, T.; Yang, X. Biomechanically constrained non-rigid MR-TRUS prostate registration using deep learning based 3D point cloud matching. Med. Image Anal. 2021, 67, 101845. [Google Scholar] [CrossRef] [PubMed]
  4. Balakrishnan, G.; Zhao, A.; Sabuncu, M.R.; Guttag, J.; Dalca, A.V. VoxelMorph: A learning framework for deformable medical image registration. IEEE Trans. Med. Imaging 2019, 38, 1788–1800. [Google Scholar] [CrossRef] [PubMed]
  5. Fu, Y.; Lei, Y.; Wang, T.; Curran, W.J.; Liu, T.; Yang, X. Deep learning in medical image registration: A review. Phys. Med. Biol. 2020, 65, 20TR01. [Google Scholar] [CrossRef] [PubMed]
  6. Feng, Z.; Ni, D.; Wang, Y. Salient region matching for fully automated MR-TRUS registration. arXiv 2025, arXiv:2501.03510v1. [Google Scholar] [CrossRef]
  7. Wright, R.; Khanal, B.; Gomez, A.; Skelton, E.; Matthew, J.; Hajnal, J.V.; Rueckert, D.; Schnabel, J.A. LSTM-based rigid transformation for MR-US fetal brain registration. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Granada, Spain, 16–20 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 1–8. [Google Scholar]
  8. Song, X.; Chao, H.; Xu, X.; Guo, H.; Xu, S.; Turkbey, B.; Wood, B.J.; Sanford, T.; Wang, G.; Yan, P. Cross-modal attention for multi-modal image registration. Med. Image Anal. 2022, 82, 102612. [Google Scholar] [CrossRef]
  9. Fu, Y.; Lei, Y.; Wang, T.; Patel, P.; Jani, A.B.; Mao, H.; Curran, W.J.; Liu, T.; Yang, X. Biomechanically constrained non-rigid MR-US prostate image registration by finite element analysis. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Granada, Spain, 16–20 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 344–352. [Google Scholar]
  10. Karnik, V.V.; Fenster, A.; Bax, J.; Cool, D.W.; Gardi, L.; Gyacskov, I.; Romagnoli, C.; Ward, A.D. Assessment of image registration accuracy in three-dimensional transrectal ultrasound guided prostate biopsy. Med. Phys. 2010, 37, 802–813. [Google Scholar] [CrossRef] [PubMed]
  11. Wang, H.; Wu, H.; Wang, Z.; Yue, P.; Ni, D.; Heng, P.-A.; Wang, Y. A Narrative Review of Image Processing Techniques Related to Prostate Ultrasound. Ultrasound Med. Biol. 2024, 51, 189–209. [Google Scholar] [CrossRef] [PubMed]
  12. Wu, M.; He, X.; Li, F.; Zhu, J.; Wang, S.; Burstein, P.D. Weakly supervised volumetric prostate registration for MRI-TRUS image driven by signed distance map. Comput. Biol. Med. 2023, 163, 107150. [Google Scholar] [CrossRef]
  13. De Silva, S.; Prost, A.E.; Hawkes, D.J.; Barratt, D.C. Deep learning for non-rigid MR to ultrasound registration. IEEE Trans. Med. Imaging 2019, 38, 1234–1245. [Google Scholar]
  14. Chen, J.; Liu, Y.; Wei, S.; Bian, Z.; Subramanian, S.; Carass, A.; Prince, J.L.; Du, Y. A survey on deep learning in medical image registration: New technologies, uncertainty, evaluation metrics, and beyond. arXiv 2024, arXiv:2307.15615. [Google Scholar] [CrossRef] [PubMed]
  15. Baum, Z.M.C.; Hu, Y.; Barratt, D.C. Real-time multimodal image registration with partial intraoperative point-set data. Med. Image Anal. 2021, 74, 102231. [Google Scholar] [CrossRef] [PubMed]
  16. De Vos, B.D.; Berendsen, F.F.; Viergever, M.A.; Sokooti, H.; Staring, M.; Išgum, I. A deep learning framework for unsupervised affine and deformable image registration. Med. Image Anal. 2019, 52, 128–143. [Google Scholar] [CrossRef] [PubMed]
  17. Lei, Y.; Matkovic, L.A.; Roper, J.; Wang, T.; Zhou, J.; Ghavidel, B.; McDonald, M.; Patel, P.; Yang, X. Diffeomorphic transformer-based abdomen MRI-CT deformable image registration. Med. Phys. 2024, 51, 6176–6184. [Google Scholar] [CrossRef] [PubMed]
  18. Ramadan, H.; El Bourakadi, D.; Yahyaouy, A.; Tairi, H. Medical image registration in the era of Transformers: A recent review. Inform. Med. Unlocked 2024, 49, 101540. [Google Scholar] [CrossRef]
  19. Chen, J.; Frey, E.C.; He, Y.; Du, R. TransMorph: Transformer for unsupervised medical image registration. Med. Image Anal. 2022, 82, 102615. [Google Scholar] [CrossRef] [PubMed]
  20. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
  21. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  22. Baum, Z.; Saeed, S.; Min, Z.; Hu, Y.; Barratt, D. MR to ultrasound registration for prostate challenge—Dataset. In Proceedings of the MICCAI 2023, Vancouver, BC, Canada, 8–12 October 2023. [Google Scholar]
  23. Avants, B.B.; Tustison, N.J.; Stauffer, M.; Song, G.; Wu, B.; Gee, J.C. The Insight ToolKit image registration framework. Front. Neuroinformatics 2014, 8, 44. [Google Scholar] [CrossRef] [PubMed]
  24. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  25. Abraham, N.; Khan, N.M. A Novel Focal Tversky Loss Function with Improved Attention U-Net for Lesion Segmentation. In Proceedings of the 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Venice, Italy, 8–11 April 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 943–947. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.