Previous Article in Journal
Matched–Mismatched Uncertainty Compensation in Dynamic SMC Using Optimal Fractional Loop-Transfer-Recovery Observer
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Modal Image Registration Problem Integrating Multi-Scale Strategy and Deep Learning

School of Mathematics and Statistics, Wuhan University of Technology, Wuhan 430070, China
Mathematics 2026, 14(12), 2131; https://doi.org/10.3390/math14122131 (registering DOI)
Submission received: 9 April 2026 / Revised: 6 June 2026 / Accepted: 10 June 2026 / Published: 14 June 2026
(This article belongs to the Special Issue Mathematical Optimization Methods in Image Processing)

Abstract

Medical image registration integrates information from different types of medical images to support and improve clinical diagnosis. Existing image registration approaches are mainly classified into two categories: model-driven methods and data driven methods. Model-driven methods can achieve high registration accuracy but suffer from low computational efficiency and long processing time. In contrast, data-driven methods stand out for their high efficiency, which gives them great practical value. Taking this advantage as the core basis, this paper proposes a simple unsupervised deep learning framework embedded with a multi-scale strategy. The overall network consists of two core modules: an Affine Transformation Network (AT-Net) and a multi-scale Deformable Transformation Network (DT-Net). The multi-scale design adopted in the DT-Net enables image registration at different feature scales, which effectively improves the overall registration accuracy. In addition, a dual consistency constraint is introduced into the framework to further enhance the model robustness. The entire network realizes end-to-end medical image registration. We verified the performance of the proposed method on a public dataset, with mutual information (MI) adopted as the evaluation metric. The experimental results show that our registration algorithm outperforms several mainstream methods, including Symmetric Image Normalization (SyN), VoxelMorph (VM), the coarse-to-fine deformable transformation framework for unsupervised multi-contrast MR image registration with dual consistency constraint (C-F-I-R), TransMorph and DiffuseMorph. The comparative experiments fully demonstrate that combining the multi-scale strategy with deep learning techniques is an effective solution for medical image registration tasks.

1. Introduction

The primary goal of image registration is to establish the corresponding relationship between pairs of images. Image registration can be categorized into unimodal registration and multi-modal image registration. In medical image processing, image registration is a basic and important technology. Its main goal is to build the corresponding relationship between two images, so that the same target in different images can be aligned at the same position. Based on the imaging methods used for the image pairs, image registration can be divided into unimodal registration and multi-modal registration. Unimodal registration refers to aligning images taken by the same imaging equipment, while multi-modal registration is to align images taken by different types of imaging equipment. Both types of registration play an important role in clinical application, but multi-modal registration is more challenging due to the differences between different imaging methods. With the continuous development of medical imaging technology, multi-contrast MR images have been widely used in clinical diagnosis. However, due to the differences in imaging principles of different contrast MR images, there are obvious differences between the images, which brings great difficulties to image registration. Therefore, this paper tries to solve the problem of image registration for different types of images, especially for multi-contrast MR images. The purpose of this research is to improve the quality of fused images through effective registration, and further provide more complete and accurate information for clinicians to diagnose diseases, so as to help clinicians make more accurate judgments. For example, Figure 1 shows different multi-contrast MR brain images of three patients, which can clearly reflect the different functions of different imaging methods. Among them, Fluid Attenuated Inversion Recovery (FLAIR) can highlight lesions in brain tissue, which is helpful for the detection of brain lesions. T1-weighted imaging (T1) is good at showing anatomical structures clearly, providing a basis for understanding the normal structure of the brain. T1-contrast-enhanced imaging (T1ce) works well in identifying and locating abnormal tissues, which is of great significance for the diagnosis of tumors and other diseases. T2-weighted imaging (T2) is very useful for finding water accumulation, inflammation and some types of tumors, and can provide effective reference for the judgment of disease severity.
In the field of image registration, many researchers in academic circles have put forward a lot of methods. These methods are usually divided into two main types: model-driven methods and data-driven methods.
Model-driven methods in image registration usually use either grayscale data or geometric features of images. Grayscale-based registration checks the similarity between images by analyzing the statistical features of their grayscale values. Arad et al. [6] and Yang et al. [7] put forward methods to optimize the transformation field by maximizing the grayscale similarity function between images. On the other hand, Kybic [8], Rueckert et al. [9] and Sdika [10] proposed B-spline interpolation to get the best transformation fields. Geometric feature-based registration methods focus on matching feature descriptors extracted from images to help with registration. For example, Besl et al. [11] proposed the Iterative Closest Point (ICP) algorithm in 1992 for rigid registration. Although this algorithm works well, it may have problems when the initial positions of the images are very different. In 2003, Chui et al. [12] proposed the Thin Plate Spline Robust Point Matching (TPS-RPM) algorithm. This algorithm solves the shortcomings of the ICP algorithm by building a correspondence matrix for non-rigid registration, without relying on specific spatial mappings. While the TPS-RPM algorithm achieves high accuracy in image registration, it is not universal because it does not make full use of the shared registration patterns among images in the dataset. Therefore, when the fixed images change, it is necessary to adjust parameters and training loss functions for each pair of images. This increases the computational complexity, reduces the registration efficiency, and limits its practical use [13,14].
Using deep learning technology for data-driven image registration is a common way to study the relationship between two images. Deep learning methods are usually divided into supervised deep learning and unsupervised deep learning. Unsupervised deep learning in image registration does not need real transformation field data, so it is more practical to use. This method works on the basis of image similarity: it extracts features to analyze and reduce the differences between moving images and fixed images, so as to complete image registration. In 2015, Jaderberg et al. [15] proposed the Spatial Transformer Network (STN), which laid a foundation for the unsupervised deep learning image registration networks developed later. At the same time, Ronneberger et al. [16] proposed the U-Net model, which is made up of an encoder, a decoder and skip connections. The encoder uses stacked convolutional and pooling layers to extract image features step by step, while the decoder restores image information. This encoder decoder structure helps to reduce network parameters and balance local features with global semantics, so that the model can achieve good results even when there is not much training data. The strong ability of U-Net to extract features has been proved effective in medical image segmentation and medical image registration [17]. In 2017, De Vos et al. [18] proposed the Deformable Image Registration Network (DIRNet). Based on supervised deep learning, DIRNet turns transformation fields into global transformation fields through interpolation. However, because the input of this model is image blocks, it is more suitable for tasks with small transformations rather than large ones. To solve this problem, Shan et al. [19] proposed a fully convolutional network (FCN) in 2017. This network takes the whole image as input and produces the complete transformation field as output, which improves registration efficiency. In addition, Balakrishnan et al. [20] proposed VoxelMorph (VM) in the same year, using U-Net as the core to realize multi-contrast MR brain image registration. This model optimizes the network with a loss function that includes similarity metrics and transformation field smoothness constraints, and achieves good results. In 2022, Huang et al. [21] proposed the C-F-I-R registration framework with dual consistency constraint. This framework improves robustness, reduces computational complexity, and realizes end-to-end image registration. Although this method makes full use of the shared registration patterns among images in the dataset, there is still room to improve its accuracy. In the same year, Chen et al. [22] proposed TransMorph. Compared with the traditional U-Net structure, it can better capture the long-range dependencies in images and improve registration accuracy. In 2025, Zhu et al. [23] proposed a new multimodal similarity evaluator called SynMSE, which can keep tissue boundaries and maintain the integrity of anatomical topology.
This paper puts forward an unsupervised deep learning network that combines deep learning with a multi-scale strategy. The model is made up of two parts: an AT-Net and a DT-Net. The DT-Net includes a multi-scale strategy, carries out several registration update iterations, improves accuracy, and achieves end-to-end image registration by using a loss function focused on MI. It adds a dual consistency constraint and optimizes the registration result through multiple loss functions.
The main contributions of this paper are as follows:
  • The integration of the multi-scale strategy into the multi-modal registration network, which yields significant improvements in multi-modal registration accuracy.
  • A synergistic mechanism combining dual consistency constraints with the multi-scale strategy. Ablation studies demonstrate that the dual consistency constraints effectively suppress errors induced by the multi-scale strategy.
The rest of this paper is arranged as follows: Section 2 explains our methods, Section 3 shows the experimental results and related analysis, and Section 4 gives the conclusion.

2. Methods

The paper examines the process of two-dimensional multi-contrast MR brain image registration. Here, let K represents the sample count in the multi-contrast datasets and F { f 1 , f 2 f K } and M { m 1 , m 2 m K } refer to the paired fixed image sets and moving image sets.

2.1. Image Registration Framework Based on Deep Learning

2.1.1. Affine Transformation Network—AT-Net

AT-Net is pre-trained in an unsupervised manner using the same MI-based objective, and its output serves as initialization for DT-Net.
The STN is a component of deep learning models designed to enhance the neural network’s ability to effectively handle geometric transformations. This module can autonomously determine the parameters required to transform input images to optimize the overall performance of the network.
The STN module is composed of three primary components: a Localized Coordinate Regression Network, a Spatial Grid Generator, and a Sampler. The Localized Coordinate Regression Network is responsible for learning the geometric transformation parameters based on the input image features. Typically, this network is a neural network that includes convolutional and fully connected layers. The Spatial Grid Generator utilizes the parameters obtained from the Localized Coordinate Regression Network to create a spatial transformation grid. This grid determines the placement of each pixel in the output images. The role of the Sampler is to map the input images onto the generated grid. Given that the input images are continuous, the coordinates of the points may not align with integers, potentially hindering proper sampling if they fall outside the grid. Consequently, data processing is necessary to map all coordinate points of the input images to the grid pixels and fill in any gaps as needed. This mapping process commonly involves the use of interpolation algorithms.
This algorithm utilizes STN to execute affine transformation on the moving images. Let p ( x i , y i ) denote a pixel sampling from m , where x i , y i represents the coordinates of the corresponding pixel. The affine transformation can thus be formulated as follows:
A θ ( p ) = θ 11 θ 12 θ 13 θ 21 θ 22 θ 23 × x i y i 1
where θ denotes the parameters that define the linear transformation. The methodology entails pre-training a shallow regression network to estimate these parameters. Utilizing the derived parameters, STN can autonomously execute the affine transformation, thereby facilitating the approximate alignment of moving images with their corresponding fixed images without requiring manual intervention. Within this algorithm’s framework, this regression network is denoted as the AT-Net. Through the utilization of AT-Net, M can be roughly aligned to M A { m A 1 , m A 2 m A K } , achieving the initial step of image registration. Given that affine transformations depend on global information, their effectiveness may be reduced when applied to regions with low signal intensity. Consequently, the outputs produced by the affine transformation network are regarded as preliminary registration results that require subsequent refinement.

2.1.2. Deformable Transformation Network—DT-Net

DT-Net is a deep learning method for calculating non-rigid transformations between medical images. It aligns moving images to fixed images for registration by learning transformation fields. This network is good at capturing local changes and deformations in images, making it suitable for registering medical images with different shapes. In the network structure proposed in this paper, DT-Net processes images that have been preprocessed by AT-Net. DT-Net usually has two connected transformation networks: one handles global transformation, and the other handles local transformation. The global network learns the overall structure and main features of images to align the entire image sets. The local network, on the other hand, focuses on learning the local features and changes in images to adjust specific regions.
DT-Net is trained using medical images with coarse alignment result predicted by AT-Net. By reducing the difference between the predicted transformation fields and the real ones, DT-Net can build a robust registration model suitable for various image registration tasks. The core idea of this network is to design a differentiable operation for each pixel. It optimizes this operation via network training and finally achieves image registration.
Let us denote the obtained transformation field as σ where each element within σ corresponds to an offset distance. For each pixel p the transformation to p can be expressed as follows:
p = p + σ ( p )
This paper utilizes VM as the DT-Net within the network framework. Following the pixel transformation, VM conducts an additional linear interpolation among adjacent pixels to prevent discontinuities in the transformed images:
m σ ( p ) = q Z ( p ) m ( q ) dim x , y ( 1 | p dim q dim | )
where represents the transformation operator and q is a neighboring pixel encompassing both pixel shifting and interpolation processes, while Z refers to regions formed by contiguous pixels. This differentiable interpolation mechanism facilitates the generation of predicted outcomes that exhibit enhanced smoothness and greater realism.
Some changes were made to the network architecture, including the adoption of a deeper convolution structure for feature extraction and the replacement of the ReLU function with the Leaky ReLU function.

2.1.3. Dual Consistency-Constrained Bi-Directional Image Transformation

This component-wise decomposition is directly inherited from the C-F-I-R framework [21]. Within this framework, such decomposition enables independent regularization across spatial dimensions, thereby significantly enhancing the numerical stability of the inversion process. As explicitly stated in the source paper, “the registration procedure should be symmetrical, which refers to the bi-directional transformations between the moving images and the fixed images.” This symmetry prior was first established in via Euler-Lagrange iterative optimization and has been widely validated in medical image registration. Since our method directly inherits this bidirectional transformation framework, the assumption that forward and inverse transformations share the same distribution is a natural consequence of this well-established symmetry principle, rather than an ad hoc design.
Intuitively, the registration process is anticipated to demonstrate symmetry, characterized by the bidirectional transformations linking the moving images and the fixed images. Nevertheless, achieving this symmetry in practical applications is not straightforward. We use an innovative bi-directional image transformation method. The σ represents the transformation field for the forward transformation from the moving image to the fixed image. However, simply applying the σ to the transformation field is not enough to derive the inverse transformation and reconstruct the images for registration. This is because of the loss of correspondence between the σ and the image pixels.
In this paper, the transformation field σ , consisting of horizontal and vertical components within a two-dimensional domain, is introduced. The field σ is decomposed into two constituent offset fields, denoted as σ x and σ y . These offset fields are subsequently warped by applying the original transformation field σ , resulting in deformed offset fields. By recombining these transformation fields, a novel transformation field is constructed. Finally, the inverse transformation field σ 1 is obtained by multiplying the combined result by −1. This methodology effectively aligns the transformation field with the pixel coordinates of the registered image. The complete procedure is formalized as follows:
σ 1 = x , y ( σ i σ )
Given the absence of a reference image to evaluate the accuracy of multi-contrast MR image registration prediction, simultaneous bi-directional registration from M to F and from F to M presents a challenge. This paper proposes a solution to the problem by suggesting a compromise approach that involves converting the multi-modal image registration task into a unimodal image registration task. Specifically, the algorithm utilizes the registered result M D { m D 1 , m D 2 m D K } instead of the fixed images F to calculate the inverse transformed images: M D 1 = M D σ 1 , assuming that M D 1 and M A maintain the same distribution. Subsequently, the algorithm utilizes consistency loss, such as Mean Squared Error (MSE) or Normalized Cross-Correlation (NCC), to effectively constrain M D 1 to M A and ensure accurate registration.

2.2. Image Registration Problem Integrating Multi-Scale Strategy and Deep Learning

2.2.1. Multi-Scale Strategy

It is worth noting that our multi-scale strategy differs from standard image pyramids. It is an iterative cascaded framework where each stage recursively updates the transformation field to handle large deformations, rather than simply resizing the input image.
In the field of medical image registration, the multi-scale strategy [24] represents a specialized processing method that involves considering and utilizing image data at various scales throughout the registration procedure. This method is based on gradually refining registration from coarse-to-fine levels to enhance the accuracy and efficiency of the registration process. The multi-scale image registration procedure typically involves the following stages:
(1)
Coarse Registration (Low Resolution)—Low resolution images are used for coarse registration, reducing computational load and speeding up processing.
(2)
Incremental Registration (Increasing Resolution)—After the coarse registration, the image resolution is progressively improved, and the registration process is repeated. With each increase in resolution, more detailed information becomes available for more accurate registration. This iterative process can be conducted multiple times, with each iteration refining the registration outcomes to enhance precision.
(3)
Fine Registration (High Resolution)—In the final phase, precise registration is conducted using full-resolution images. Given that the images are already roughly aligned at this point, the registration process can focus on making minor adjustments to achieve more precise registration.
Multi-scale strategies are very useful in medical image registration. They can improve registration accuracy and reduce errors caused by local optimal solutions. In addition, this strategy can speed up the algorithm by quickly narrowing the search range in coarse registration, which lowers the computation needed for fine registration. However, the complexity of transformation fields and other factors may still increase the time required for the whole process.
Next, integrate the multi-scale strategy with DT-Net.
Let S = I m i i = 0 N denote a multi-contrast MR medical image, where I m i Ω 2 . Let denote I M and I F as moving image and fixed image, respectively. We want to generate a flow prediction function, G , that takes I M and I F to predict a transformation field σ: Ω → Ω that aligns the sequence S.
The method follows an n-cascade architecture which decomposes the registration into progressively small deformations that are recursively applied to the warped image.
Each cascade within the system functions as a foundational network that operates as a flow prediction function. This function G receives a pair of images ( I M , I F ) and predicts a transformation field σ by registering the moving image I M to the fixed image I F . The predicted transformation field σ k for the k-th cascade can be derived accordingly:
σ k = g k ( I M ( k 1 ) , I F )
where g k denotes the flow prediction function corresponding to the k-th cascade. The warped image is generated through the application of the transformation field σ k to the moving image I M k 1 :
I M ( k ) = σ k I M ( k 1 )
According to the recursive model, the ultimate flow prediction function G is constructed through the composition of all generated transformation fields.
G ( I M , I F ) = σ n σ n 1 σ 1
Therefore, the final warped image M D is:
M D = ( σ n σ n 1 σ 1 ) M A
Multi-scale strategies leverage image features across different scales. This approach performs registration at multiple levels, updates the deformation field gradually, and repeatedly aligns the fixed image. It makes good use of features from different image resolutions and boosts registration accuracy. We carried out experiments to verify the effect of this strategy, and the results confirm it can clearly improve registration performance. Figure 2 presents the structure of the multi-scale DT-Net registration framework.

2.2.2. Coarse-to-Fine Multi-Contrast MR Image Registration Framework with Dual Consistency Constraint

The method proposed in this paper differs from traditional two-step registration methods. Our framework first uses the AT-Net for preliminary image alignment, and then freezes the trained parameters of AT-Net. These fixed parameters are imported into the DT-Net. In this way, the two networks are integrated into a single framework to achieve end-to-end image registration. This design brings several clear advantages. First, the DT-Net takes coarsely aligned images generated by AT-Net as input, and these images are already well matched to the fixed images. Second, the fixed AT-Net parameters do not require repeated updates during model training, which simplifies the training process. Third, the proposed framework can output both affine transformation results and deformable transformation results as additional registration outputs.
The diagram in Figure 3 illustrates our multi-contrast MR image registration framework that incorporates dual consistency constraints in a coarse-to-fine approach. This framework comprises three primary components: (1) The pre-trained AT-Net ( A θ ) for coarse affine registration. AT-Net takes a pair of multi-contrast MR images, M and F . The coarsely aligned images, denoted as M A , serve as inputs for the subsequent DT-Net. Upon completion of pre-training, it is crucial to highlight that the parameters of AT-Net are fixed and not subject to further updates. (2) The DT-Net ( D θ ) is responsible for producing the final predictions. DT-Net receives a concatenated input of F and M A and produces a densely transformation field σ , which is used to generate the final prediction, M D . (3) A dual consistency constraint is used, involving an inverse transformation from M D to M D 1 to enhance registration performance. The inverse transformation field, denoted as σ 1 , is calculated and used to warp M D to obtain M D 1 . By enforcing a similarity measure between M D 1 and M A , the dual consistency constraint is achieved. By implementing a bi-directional registration strategy, it is expected that unwanted interpolation during image registration will be reduced, leading to a more accurate registration outcome.

2.2.3. Loss Function

Several loss functions are employed to optimize the framework for multi-contrast magnetic resonance image registration. To simplify, an unspecified network denoted as ξ θ ( . ) is utilized, which can be either AT-Net or DT-Net.
One of the key loss functions commonly employed is MI, which quantifies the relationship in distribution between two stochastic variables [25]. In this context, we establish two marginal probability distributions, p F ( f ) , p M ( m ) , alongside a joint probability distribution p F , M ( f , m ) . MI quantifies the extent of dependence between F and M by evaluating the divergence between their joint distribution p F , M ( f , m ) and the product of their marginal distributions p F ( f ) p M ( m ) through the application of the Kullback–Leibler divergence metric [26]. The MI can be expressed as:
M I ( F , M ) = p F , M ( f , m ) log ( p F , M ( f , m ) p F ( f ) p M ( m ) ) d x d y  
If F and M are assumed to be independent, the joint probability distribution p F , M ( f , m ) can be expressed as the product of the marginal probability distributions p F ( f ) p M ( m ) . In this instance, the parameter M I ( F , M ) will be equal to zero, indicating the absence of MI between the two variables. The maximization of MI serves as a broad and effective criterion because it does not require any assumptions about the nature of the dependence between variables and does not impose restrictions on the content of images from different modalities involved.
MR images are typically presented in grayscale, with background regions exhibiting intensity values near zero. Consequently, it is expected that no signal should be present within the background areas of registered images. In light of this, we propose a background suppression loss function grounded in prior knowledge, which enforces the condition that the pixel values satisfy the relation M S E ( f , m ) = ( f m ) 2 when f correspond to background pixels.
By combining the MI loss function with a loss function that suppresses background based on prior knowledge, a loss function is derived:
J L ( F , ξ θ ( F , M ) , α , β ) = f , m ( α M I ( f , ξ θ ( f , m ) ) + β i M S E ( f i , ξ θ ( f , m ) i ) ,   i f   f i < γ 0 ,   otherwise
where i N denotes the pixel values within the images, while γ signifies a threshold derived from the dataset used to distinguish background pixels from foreground. The parameters α , β serve as adjustment factors to balance the contributions of the two loss components. Note: The coarse alignment used to initialize DT-Net is the output of AT-Net, predicted in an unsupervised manner; no ground-truth deformation is involved. This loss function is designed to not only enforce global image alignment through the maximization of MI but also to penalize erroneous predictions within designated regions. Consequently, it enhances the consistency of the predictions with the inherent characteristics of medical images.
The second loss function employed aims to satisfy the dual consistency constraint. Instead of calculating the MI loss between M D 1 and M A , a basic MSE loss is calculated.
The final loss function is computed to impose constraints on the transformation field σ . Without such constraints, the transformation may exhibit irregular displacements; however, the previously described two loss functions may still yield low values due to the interpolation algorithm. To mitigate these issues, a spatial smoothness loss function is employed to regularize and refine the transformation field σ :
S L ( σ ) = f , m | σ ( f , m ) | 2
where ( ) represent the calculation of gradients. By constraining the gradient of the transformation field, we ensure the smoothness of the transformation field, thereby preventing significant pixel displacements.
The comprehensive loss function employed to optimize the framework is defined as follows:
L o s s t o t a l ( F , M ) = λ 1 S L ( σ ) + J L ( F , ξ θ ( F , M ) , λ 2 , λ 3 ) + λ 4 M S E ( A θ ( F , M ) , D θ 1 ( F , M ) )
The equation contains four adjust factors λ 1 , λ 2 , λ 3 , λ 4 . These hyperparameters can be adjusted to different values depending on the specific experimental conditions.
All experiments use the three-scale configuration (Group 3) as the default setting, based on the accuracy-efficiency trade-off analysis in Section 3.3.

2.3. Variational Formulation and Numerical Optimization

2.3.1. Unified Variational Model

Image registration is fundamentally an ill-posed inverse problem that seeks a spatial transformation aligning a moving image M to a fixed image F . To ensure mathematical well-posedness, we formulate this task as the minimization of a unified energy functional over an admissible function space. Let Ω d (d = 2 or 3) denote the image domain. We seek the optimal transformation field σ : Ω d such that the coarsely aligned images σ ( ) M A best matches F .
The registration problem is defined as the following variational optimization:
u * = arg min u A   E ( u ) : = λ M I ¯ ( σ ( ) M A , F ) + μ R ( u )
where
(1)
M I ¯ ( , ) denotes the continuous and differentiable approximation of mutual information (MI), serving as the data fidelity term. The negative sign converts the similarity maximization into a minimization problem.
(2)
R ( u ) is the regularization term enforcing spatial smoothness and topological preservation.
(3)
λ > 0   and μ > 0 are weighting parameters balancing fidelity and regularity.
(4)
A H 1 ( Ω ; d ) is the admissible space of transformation fields, typically constrained to satisfy det ( σ ( ) > 0 ) almost everywhere to guarantee diffeomorphic deformations.
We equip H 1 ( Ω ; d ) with the standard norm:
| | u | | H 1 2 = | | u | | L 2 2 + Ω | u ( x ) | 2 d x ,
where | | denotes the Frobenius norm.
Remark 1 (Differentiability of the Fidelity Term). Classical discrete MI based on histogram binning is non-differentiable with respect to u , precluding gradient-based variational analysis. In this work, M I ¯ ( , ) is constructed via Parzen window density estimation with Gaussian kernels [27]. This smoothing ensures that M I ¯ ( , ) is Fréchet differentiable with respect to u in the H 1 topology, rendering the Euler-Lagrange equation of Equation (13) well-defined and enabling gradient descent optimization.
Remark 2 (Existence of Minimizers). The energy functional E ( u ) admits at least one global minimizer in A . This follows from the direct method in the calculus of variations [28].
(1)
Lower Boundedness: Mutual information satisfies 0 M I ¯ ( X ; Y ) min ( H ( X ) , H ( Y ) ) < . Thus, λ M I ¯ λ min ( H ( F ) , H ( M A ) ) > . Combined with R ( u ) 0 , E ( u ) is bounded from below. Under our construction (Parzen window with Gaussian kernel and bounded image intensities), the mutual information estimator is uniformly bounded above, i.e., M I ¯ C M I < Thus λ M I ¯ λ C M I .
(2)
The regularization term R ( u ) is designed to be coercive in H 1 ( Ω ) , i.e., R ( u ) as | | u | | H 1 , which dominates the bounded fidelity term.
Proof. 
The regularization term is defined as:
R ( u ) = S L ( u ) = Ω | | u ( x ) | | F 2 d x ,
where is | | | | F the Frobenius norm.
And the optimization is constrained to the admissible set A = u H 1 ( Ω ; d ) | u | Ω = 0 .
Under this constraint, Poincaré’s inequality holds: there exists a constant C p > 0 such that
| | u | | L 2 ( Ω ) C p | | u | | L 2 ( Ω ) , u A
Consequently,
| | u | | H 1 ( Ω ) 2 = | | u | | L 2 2 + | | u | | L 2 2 ( C p 2 + 1 ) | | u | | L 2 = ( C p 2 + 1 ) R ( u )
Rearranging, we obtain the coercivity estimate:
R ( u ) c | | u | | H 1 ( Ω ) 2 , w h e r e   c = 1 C p 2 + 1 > 0
Therefore,
| | u | | H 1 R ( u )
Since the data fidelity terms are all bounded below (e.g., 0 ), the total energy satisfies:
E ( u ) μ R ( u ) C +   a s | | u | | H 1 .
This establishes the coercivity of the full objective functional on A . □
(3)
Weak Lower Semi-continuity: Under standard intensity regularity assumptions, the smoothed MI estimator M I ¯ is weakly continuous on bounded subsets of u * A [27], and convex regularizers are weakly lower semi-continuous. Hence, E ( u ) is weakly lower semi-continuous on A .
These three conditions collectively guarantee the existence of a minimizer u * A for the unified variational problem (13).

2.3.2. Multi-Scale Strategy as a Numerical Solver

The novelty of our multi-scale strategy lies not in proposing a new theoretical model, but rather in designing a hierarchical optimization scheme that is intrinsically consistent with the unified variational Formulation (13). Unlike conventional multi-scale registration methods that treat scale transitions as heuristic post-processing, our approach offers the following distinct features:
It maintains an identical variational structure across all scales, ensuring that each refinement step remains a well-defined subproblem within the original energy landscape;
By leveraging the update strategy detailed in Section 2.2.1, it effectively avoids geometric distortions that might otherwise be introduced by naive interpolation;
It serves as a structured globalization mechanism specifically tailored to address the non-convexity of mutual information, significantly enhancing the algorithm’s robustness in converging to global optima. This effectively bridges the gap between theoretical existence (see Remark 2) and practical reliability.
Therefore, this multi-scale strategy is not merely a numerical trick decoupled from the theoretical framework, but rather an integral and innovative component of our computational framework, enabling reliable minimization of a theoretically sound yet highly non-convex objective.

2.3.3. Discrete Optimization and Convergence

At each resolution level, the discretized version is minimized using gradient descent with backtracking line search. Since M I ¯ is C 1 after Parzen smoothing and R ( u ) is typically quadratic or convex, the discrete objective is continuously differentiable. Standard convergence results for non-convex smooth optimization [29] guarantee that the iterative sequence u ( t ) converges to a stationary point satisfying | | E k ( u ( t ) ) | | < ε . We do not claim convergence to a global minimum due to the non-convex nature of MI; however, the combination of multi-scale continuation and gradient-based refinement empirically yields robust and anatomically plausible solutions, as validated in Section 3.

2.3.4. Mathematical Role of Each Loss Term and Regularization Component

We explicitly design a composite L o s s t o t a l (Equation (12)) that synergizes data fidelity with multi-faceted regularization:
(1)
S L ( σ ) imposes a smoothness constraint on the deformation field σ to prevent unphysical deformations;
(2)
J L ( ) is a multi-objective joint loss function, in which the parameters λ 2 and λ 3 control the relative importance of its internal sub-terms;
(3)
M S E ( A θ ( F , M ) , D θ 1 ( F , M ) ) enforces inverse consistency by minimizing the mean squared error between the forward-warped image and the inverse-warped image, thereby guaranteeing the topological validity and diffeomorphic property of the deformation field.

3. Experiments and Results

This section presents a comprehensive evaluation of the proposed methods through extensive experimental validation. The image registration experiments are primarily performed using T1 and T2 data.

3.1. Experimental Environment and Dataset

AT-Net Architecture: AT-Net serves as our regression network, comprising five downsampling blocks and two fully connected layers. Each downsampling block consists of two 3 × 3 convolutional layers followed by one 2 × 2 max-pooling layer. At the terminal end, the two fully connected layers output six transformation parameters for affine alignment. The total number of trainable parameters in AT-Net is approximately 588 k.
DT-Net Architecture: Built upon an improved U-Net encoder–decoder structure, DT-Net handles dense deformations. Its final layer utilizes two 3 × 3 convolutional layers with linear activation to generate the final deformation field. The total number of trainable parameters in DT-Net is approximately 1474 k.
Experimental Setup and Dataset: All experiments were conducted on a Linux system using TensorFlow 1.10.0 and Keras on a single NVIDIA RTX 2080 Ti GPU. We utilized the BraTS2020 dataset (Training + Validation) obtained from the Kaggle platform [1,2,3,4,5], which was preprocessed into 2D slices to obtain brain tumor images. For our multi-modal registration task, T2-weighted images served as moving images and T1-weighted images as fixed images; registering the T2 with the T1 resulted in the creation of more informative fused images. The dataset was partitioned into 320 training pairs, 32 validation pairs, and 32 test pairs. During training, data augmentation (random shifts, rotations, scaling, and horizontal flips) was applied. The model was optimized using the Adam optimizer (learning rate 1 × 10−3, batch size 8) for 300 epochs.
Furthermore, to address the concern regarding reproducibility, we have released our complete source code, including network definitions and training configurations, at the https://osf.io/tc6f5/, accessed on 31 May 2026.
For detailed configurations, see Table 1.

3.2. Evaluation Metric

In this paper, mutual information (MI) with base e is used as the evaluation metric. Higher MI values mean a stronger relationship between the two images. The exact definition of MI has been given earlier. Maximizing MI for image registration has clear advantages: it can reduce the influence of noise and does not require image segmentation. Moreover, this method does not need early assumptions about the relationship between moving images and fixed images. Therefore, MI-based registration is widely used in multi-modal image registration. This method not only works well for multi-modal registration but also performs effectively on damaged images, which makes image registration much more convenient.

3.3. Experiments

3.3.1. Effectiveness of the Multi-Scale Strategy

Firstly, to assess the efficacy of the multi-scale strategy, the dataset images used in this study were 224 × 224 pixels in size. The first set of experiments served as a control group, where the multi-scale strategy was not utilized. The image size processed by the network remained at 224 × 224. In the second set of experiments, a one-step registration was performed, involving the updating of images sized 112 × 112 and 224 × 224. The third set of experiments involved iteratively updating images of sizes 56 × 56, 112 × 112, and 224 × 224. In the fourth set, images of sizes 28 × 28, 56 × 56, 112 × 112, and 224 × 224 were iteratively updated. The results of these experiments are presented in Table 2.
From Table 2, it is evident that incorporating the multi-scale strategy improves the registration results compared to the original algorithm. The MI evaluation metric increases with the number of registrations, indicating enhanced registration accuracy. However, the standard deviation gradually rises, and the registration time becomes longer. This is because the introduction of the multi-scale strategy increases the network’s time complexity. By sacrificing some time, more accurate registration results can be achieved, and the efficiency remains acceptable. Overall, integrating the multi-scale strategy within this framework has proven effective. When applying this strategy in practical scenarios, the decision on the optimal number of scales for registration updates should balance precision and efficiency. For this specific dataset, performing two updates and iterations for registration evaluation has been determined as the most suitable approach, and subsequent experiments will follow this established standard.
To address the concern regarding computational cost, we have conducted a quantitative marginal gain analysis using the data from Table 2. The results are summarized in the new Table 3 below:
As shown in Table 3, increasing scales from Group 2 to Group 3 yields the highest marginal efficiency (40.00 MI per second), with a substantial MI gain (+0.076) at minimal additional cost (+0.0019 s/slice). In contrast, adding a fourth scale (Group 3 → 4) provides only a marginal MI improvement (+0.011, +0.8%) while increasing GPU time by 43% (+0.0136 s/slice), causing the marginal efficiency to drop by two orders of magnitude (to 0.81). This clear diminishing return justifies our default choice of three scales as the optimal trade-off between accuracy, stability, and clinical feasibility.

3.3.2. The Influence of Different Learning Rates and Network Widths

Figure 4 and Figure 5 illustrate the impact of different learning rates and network widths on registration outcomes. The study compared MI evaluation metrics across networks of varying widths operating at two distinct learning rates. While a higher learning rate may accelerate convergence, the erratic behavior of the metric curves indicates training instability. Consequently, a learning rate of 1 × 10−3 was selected. Analysis of registration outcomes across networks with three different widths revealed inferior performance for the network with a width of 4 compared to widths of 8 and 16, suggesting limitations in capturing intricate image features. Notably, the network with a width of 8 outperformed the one with a width of 16, demonstrating superior registration results for the dataset under consideration. Therefore, a network width of 8 was chosen.

3.3.3. Weight Results of Different Loss Functions

The study examined the influence of varying weight coefficients within a loss function on outcomes, as detailed in Table 4. When the other three parameters were held constant, an increase in the λ2 coefficient of the loss function led to a rise in the overall MI. However, observations from multiple training sessions indicated suboptimal neural network performance when this coefficient approached 20. Notably, increasing this coefficient resulted in a rise in both the MI evaluation index and its standard deviation. By maintaining this coefficient at 50 and adjusting the other three coefficients, superior outcomes were achieved under the condition where λ1 = 1, λ3 = 100, λ4 = 100. The weight coefficient associated with the MI term has the most significant impact on the loss function, whereas modifications to the other three parameters serve primarily auxiliary and regulatory roles. The analysis reveals that the weight coefficient corresponding to the final evaluation index, based on MI, plays a more significant role in the optimization process.

3.3.4. Parameter Sensitivity and Robustness Analysis

This section systematically evaluates the stability of the proposed framework by analyzing the impact of key hyperparameters—specifically, the number of multi-scale update levels (L), the mutual information weight (λ2), and auxiliary regularization coefficients (λ1, λ3, λ4)—on registration performance. All experiments were conducted on the BraTS2020 dataset, with mutual information (MI) as the primary metric (reported as mean ± standard deviation over 100 test cases), complemented by computational efficiency assessments.
Multi-scale strategy robustness. As shown in Table 2, single-scale registration (L = 1) achieves an MI of 1.200 ± 0.111. Two-stage updates (L = 2) improve this to 1.253 ± 0.113, while the three-stage scheme (L = 3; 56 → 112 → 224) attains the optimal performance of 1.329 ± 0.122. Extending to four stages (L = 4) yields only a marginal gain (1.340 ± 0.136) but increases CPU time by 62% (from 0.3235 to 0.5234 s/slice) and elevates variance, indicating diminishing returns. Crucially, within the practical range L ∈ {2, 3, 4}, the MI varies by less than 1.5% absolute (max difference: 1.340−1.329 = 0.011) with standard deviations confined to 0.11–0.14. This confirms that L = 3 resides within a broad stability plateau rather than representing an isolated optimum.
Loss weight sensitivity profiles. Table 4 reveals three distinct regimes:
λ2 (MI weight) is the most influential: reducing it from the default 50 to 20 drops MI to 0.931 ± 0.061 (−29.9%, indicating convergence to suboptimal solutions). Within λ2 ∈ [30, 50], MI remains stable at 1.273–1.329 (fluctuation < 4.5%, std ≤ 0.122), with λ2 = 50 achieving peak performance.
λ4 (boundary constraint) exhibits high sensitivity: increasing it from 100 (default, MI = 1.329) to 500 reduces MI to 0.908 ± 0.057 (−31.7%), confirming that excessive boundary penalization distorts the deformation field.
λ1 (displacement smoothness) and λ3 (higher-order deformation) show moderate tolerance, e.g., varying λ1 from 1 to 10 reduces MI by only 5.8%.
Collectively, these define clear robustness margins: the default configuration L = 3 ,   λ 1 = 1 , λ 2 = 50 ,   λ 3 = 100 , λ 4 = 100 lies in a high-performance region, safely distant from failure boundaries λ 2 < 25   o r   λ 4 > 150 . Thus, exhaustive grid search is unnecessary—reliable registration is achieved with reasonable initialization within these intervals, enhancing deployability and enabling cross-center validation without site-specific tuning.

3.3.5. Time Consumption Analysis of Inverse Transformation

To evaluate the effectiveness of the inverse transformation employed, we analyze the time required by our approach in comparison to existing inverse techniques, VM-diff [30] and LT-Net [31].
The VM-diff method introduces an inverse transformation by incorporating both differential and integral layers, along with a spatial transformation layer. It derives the inverse transformation field by iteratively applying the negative velocity field. Unlike the direct computation of the inverse transformation field in a single step, as outlined in the algorithm presented in this paper, this approach requires a greater number of computational iterations. LT-Net, in contrast, is a cycle-correspondence learning method based on map set segmentation. A new network is designed to learn the inverse transformation field and perform the inverse transformation through transformation layers. The time efficiency of these methods is quantitatively compared through experimental analysis. To ensure a fair evaluation, a consistent neural network (DT-Net) is used to implement all methods, except for the inverse operation. The resulting time consumption is detailed in Table 5.

3.3.6. Comparison of Registration Effect with Other Models

The quantitative results in Table 6 demonstrate that our method achieves an optimal trade-off between registration accuracy and computational efficiency for multi-contrast MR image registration. In terms of alignment quality, our approach yields a mutual information (MI) score of 1.329 ± 0.122, significantly outperforming conventional methods such as SyN (+38.3%), VM (+14.3%), and C-F-I-R (+10.8%). Although recent Transformer-based models like TransMorph (MI = 1.365) and DiffuseMorph (MI = 1.392) achieve marginally higher MI scores, this improvement comes at a substantial computational cost. Specifically, DiffuseMorph requires 0.537 s/slice on GPU—approximately 17× slower than our method (0.0316 s/slice)—and 1.83 s/slice on CPU, making it impractical for time-sensitive clinical workflows such as intraoperative navigation or large-scale screening. By contrast, our multi-scale AT-Net/DT-Net framework maintains near-state-of-the-art accuracy while operating within a clinically acceptable latency threshold (<0.05 s/slice on GPU). Notably, even under CPU-only deployment, our method (0.32 s/slice) remains an order of magnitude faster than SyN (3.23 s/slice), underscoring its robustness in resource-constrained environments. Collectively, these findings validate that the proposed architecture effectively balances model capacity with inference efficiency, rendering it particularly suitable for real-world multi-modal neuroimaging applications where both precision and throughput are critical.
Figure 6 presents T1-weighted and T2-weighted images of three patients, illustrating the differences in tissue contrast between these modalities. In the T1-weighted images, cerebrospinal fluid appears as a low signal (dark areas), providing clear contrast between gray and white matter. In contrast, the T2-weighted images show cerebrospinal fluid as a high signal (bright areas), facilitating the identification of lesion areas such as edema and tumors. Anatomical variations are evident across the images of all three patients. For example, patient A’s T2 image reveals a distinct high-signal lesion; patient B exhibits ventricular enlargement; and patient C’s images display lower resolution and reduced contrast. These differences highlight the complexity of multimodal registration in clinical practice, which must simultaneously address signal disparities between modalities, individual anatomical variations, and inconsistent image quality.
Figure 7 compares the results of SyN, VM, C-F-I-R, TransMorph, DiffuseMorph, and our method (ours) in T2-to-T1 registration. From the image details, it is evident that SyN and VM exhibit noticeable registration deviations in regions such as the ventricles and cortical edges (e.g., the lesion area of patient A is misaligned, and the ventricle shape of patient B is distorted). Although C-F-I-R improves the alignment of some structures, blurriness still appears in the low-contrast region of patient C. While TransMorph and DiffuseMorph achieve superior global alignment and capture finer anatomical correspondences in high-contrast regions, they occasionally introduce over-smoothing artifacts at tissue boundaries (e.g., the cortical ribbon of patient B) and struggle to preserve subtle lesion textures in patient A, likely due to their heavy reliance on global attention mechanisms that may overlook local intensity variations. In contrast, our method achieves more accurate structural matching across all three datasets: the lesion of patient A aligns better with the anatomical boundaries in the T1 image, the ventricle shape of patient B is naturally restored without boundary blurring, and the overall structural alignment of patient C is smoother while retaining diagnostic texture details. This indicates that our algorithm more effectively balances global affine transformation and local deformable registration when handling modality differences, large deformations, and low-quality images, offering a more reliable solution for clinical scenarios where both precision and anatomical fidelity are paramount.
Figure 8 illustrates the transformation field acquired after registering the three patients using the proposed algorithm, showcasing the grid and visualization of the transformation field. The color red represents transformations along the horizontal axis, whereas the color green denotes transformations along the vertical axis. Increased intensity in either red or green corresponds to a greater magnitude of transformation. These examples demonstrate that, despite the application of affine transformations, substantial deformations remain necessary to achieve precise registrations. Consequently, the integration of both affine and deformable transformations is essential in practical applications.

3.3.7. Statistical Significance

To rigorously evaluate whether the performance gain of the proposed method over the conventional C-F-I-R method is statistically significant, and to comprehensively investigate its behavioral patterns across diverse samples, we conducted a multi-dimensional statistical analysis on the Mutual Information (MI) values derived from 32 paired test samples.
(1)
Hypothesis Testing and Significance Verification
Given that image registration evaluation metrics may deviate from a normal distribution under small-sample conditions, we employed the Wilcoxon signed-rank test, a non-parametric hypothesis test, for the paired data. The null hypothesis was defined as “no difference in the median MI values between the two methods.” The test yielded a statistic of with a two-tailed of 0.0012. Since, we reject the null hypothesis at a confidence level exceeding 99%. This result statistically confirms that the observed improvement achieved by the proposed method is attributable to the intrinsic efficacy of the algorithm rather than random data fluctuations.
(2)
Win-Rate Distribution and Advantage Zone Analysis
Among the 32 independent test cases, the proposed method outperformed C-F-I-R in 24 samples, achieving a win rate of 75%. Further effect size analysis on these 24 winning samples revealed an average margin of improvement of 0.082 for the new method, demonstrating high consistency across standard test sets characterized by rich textures and moderate contrast. This indicates that the proposed method can reliably capture superior spatial correspondences when processing routine and moderately complex images, constituting the primary driver of the overall performance enhancement.
(3)
Attribution of Underperforming Samples and Boundary Condition Discussion
Despite the statistically significant overall advantage, the proposed method performed comparably to or slightly below the C-F-I-R method in 8 samples (25%). Cross-analysis of image characteristics and MI differentials for these 8 cases allows categorization into two typical scenarios:
Extreme Degradation Scenarios (3 samples): In samples #6, #18, and #24, the proposed method lagged behind by 0.103, 0.100, and 0.207 in MI value, respectively. Inspection of the original images confirmed severe local texture loss or intense noise interference in all three cases. Under such extreme conditions, the feature descriptors relied upon by our method may encounter matching ambiguity, whereas the CFIR method exhibits relative robustness due to its global smoothing constraints. This delineates the current performance boundary of the proposed method in extremely low signal-to-noise ratio environments.
Marginal Fluctuation Scenarios (5 samples): For the remaining 5 samples, the absolute MI difference between the two methods was less than 0.03, falling within the range of measurement error and representing normal stochastic variation. These samples were predominantly low-information-content images dominated by homogeneous regions where both methods have approached the theoretical performance ceiling; thus, the observed differences lack practical clinical or application-level discriminability.
(4)
Comprehensive Assessment
In summary, the Wilcoxon signed-rank test establishes the statistical significance of the proposed method’s advantage over C-F-I-R; the 75% win rate coupled with a substantial positive effect size validates its reliability in mainstream application scenarios; and the attribution analysis of the 8 underperforming samples explicitly defines the algorithmic applicability boundaries and future optimization directions. This performance profile—characterized by “significant dominance with well-defined boundaries”—aligns more closely with real-world algorithm evolution paradigms than achieving better performance across all conditions, while also providing concrete empirical guidance for subsequent targeted improvements addressing extreme degradation scenarios.

3.3.8. Ablation Study

To verify the individual effectiveness and synergistic contributions of the proposed multi-scale strategy, dual consistency constraint (DCC), and affine pre-alignment module, we conducted incremental ablation experiments under identical datasets and training configurations. Mutual Information (MI) was used as the quantitative evaluation metric for all variants. The results are presented in Table 7.
(1)
The Multi-scale Strategy Serves as the Core Engine for Overcoming Baseline Bottlenecks
Comparing Exp 0 and Exp 1 reveals that introducing the multi-scale strategy alone yields a substantial improvement of +0.214 in MI, making it the most significant contributor among all individual components. This indicates that the baseline model suffers from severe optimization landscape defects at a single resolution, rendering it prone to local minima. In contrast, the coarse-to-fine hierarchical guidance fundamentally expands the effective capture range, providing a reliable optimization foundation for subsequent regularization constraints. Notably, the standard deviation in Exp 1 (±0.135) increases compared to the baseline (±0.117); this is a natural consequence of the expanded deformation search space introduced by the multi-scale strategy, which leads to higher inter-sample variance.
(2)
Dual Consistency and Affine Pre-alignment Provide Robust Foundational Alignment Gains
Even without the multi-scale strategy, Exp 2 (CFIR) achieves an improvement of +0.177 in MI (1.200 vs. 1.023), confirming that the combination of dual consistency constraints and affine pre-alignment inherently provides a clear positive effect. The affine pre-alignment module (Exp 3, +0.192) supplies high-quality global initialization, thereby reducing the learning burden for non-linear deformation. Meanwhile, although functioning as a topological regularizer, the dual consistency constraint does not compromise alignment accuracy when coupled with the affine module; instead, it enhances overall registration quality by suppressing extreme erroneous deformations. Furthermore, Exp 2 exhibits the lowest standard deviation across all variants (±0.111), corroborating the role of strong constraints in ensuring output consistency.
(3)
The Full Model Achieves Synergistic Optimization in Both Accuracy and Stability
The proposed full model (Exp 4) attains the highest MI score (1.329 ± 0.122), representing a +0.306 improvement over the baseline and outperforming any single-component variant. More importantly, its standard deviation (±0.122) is notably lower than that of Exp 1 (±0.135), which uses the multi-scale strategy alone. This demonstrates that the dual consistency constraint successfully acts as a “stability anchor” within the multi-scale framework—retaining the high-precision alignment capability afforded by the multi-scale approach while effectively mitigating its inherent variance inflation. These results validate the core design motivation of this work: the multi-scale strategy and topological constraints are not merely orthogonally superimposed. Rather, they achieve complementary synergy through a mechanism where “the multi-scale strategy provides the optimization pathway while DCC guarantees topological safety,” enabling the model to attain enhanced registration accuracy alongside greater robustness.

3.3.9. Scalability Analysis

This analysis is strictly grounded in our actual experimental setup, where all training was performed on an NVIDIA RTX 2080 Ti (11 GB) with a batch size of 8.
Crucially, GPU memory consumption scales linearly with batch size. Our empirical measurements confirm the following:
At an input resolution of 256 × 256, the inference memory footprint of DT-Net is ~2.6 GB at batch size = 1 (representing a typical clinical deployment scenario).
At the same resolution, the training memory consumption (including gradients and optimizer states) is ~7.9 GB at batch size = 8 (consistent with all our experiments).
Since training memory typically exceeds inference memory by a factor of 2.5–3× for the same batch size (due to gradient storage and Adam optimizer state overhead), the observed ratio (~7.9 GB/2.6 GB ≈ 3.0) aligns well with theoretical expectations. This validates the linear scalability: while scaling from BS = 1 to BS = 8 would theoretically imply an 8× increase in inference memory, the effective growth is sub-linear due to activation reuse and memory optimization in modern frameworks. Nevertheless, the scaling remains fully predictable and safely within the 11 GB VRAM limit.
For completeness, we note that a pure linear extrapolation suggests inference at BS = 4 would require ~10.4 GB (2.6 × 4), approaching the 11 GB ceiling. However, in practice, leveraging memory-efficient inference strategies (e.g., disabling gradient computation and enabling mixed precision) reduces the actual consumption to ~9.5 GB, which remains feasible on high-end consumer GPUs. Regardless, the primary target for clinical deployment of our method remains batch size = 1, where DT-Net requires only 2.6 GB and AT-Net merely 0.8 GB of VRAM, enabling real-time processing even on mid-tier GPUs.
We sincerely thank the reviewer for highlighting the importance of inference latency in practical deployment. As shown in Table 5, our method achieves a GPU inference time of only 0.0316 s per slice (31.6 ms/slice)—significantly faster than TransMorph (0.124 s/slice) and DiffuseMorph (0.537 s/slice)—while maintaining relatively high registration accuracy (MI = 1.329 ± 0.122). This demonstrates an excellent trade-off between precision and efficiency.
Notably, the runtime of 31.6 ms/slice comfortably satisfies clinical real-time requirements (typically <100 ms/slice for interactive or near-real-time workflows), enabling seamless integration into clinical pipelines—even on mid-tier GPUs (e.g., RTX 3060/3070 with 12 GB VRAM). Furthermore, at batch size = 1 and 256 × 256 resolution, DT-Net consumes only 2.6 GB of VRAM, further confirming its suitability for resource-constrained environments.

3.3.10. Time Complexity Analysis

This section presents a rigorous analysis of the computational complexity of the proposed multi-scale incremental registration framework. We demonstrate that, despite employing a three-level coarse-to-fine pyramid strategy with progressive deformation field updates, the overall computational complexity remains strictly linear with respect to the input image size, i.e., O(N).
(1)
Per-Level Computational Complexity
Let the original resolution of the input image be H × W, and define the total number of pixels at full resolution as N = H × W. At the l-th pyramid level (l ∈ {1, 2, 3}) the image resolution is Hl × Wl corresponding to Nl pixels. The computation at each level comprises the following three components:
  • Feature Extraction and Registration Network Forward Pass: Since the network employs exclusively local convolutions, spatial interpolation, and point-wise non-linear activations, the computational cost scales linearly with the number of pixels at that level, i.e., O(Nl).
  • Deformation Field Upsampling: Upsampling the deformation field σl−1 from the previous level to the current resolution via bilinear interpolation requires traversing all pixels at the current level, resulting in a computational cost of O(Nl).
  • Incremental deformation field Composition: At the l-th level, the deformation field predicted at the current level must be composed with the coarse deformation field obtained from the previous level (see preceding text for the specific formulation). This operation requires one bilinear interpolation per pixel; thus, its computational cost remains O(Nl). Although the constant overhead is higher than that of pure addition due to the involved coordinate transformations and interpolations, the absence of nested loops or non-linear searches ensures it remains a linear-time operation. Consequently, the total cost at each level is maintained at O(Nl), preserving the overall linear complexity conclusion.
Consequently, the total computational complexity at level l is the sum of the above three terms, which preserves linearity:
T l = O ( N l ) + O ( N l ) + O ( N l ) = O ( N l )
(2)
Multi-Scale Strategy
The three-level pyramid adopted in this work operates at resolutions of 56 × 56, 112 × 112, 224 × 224, respectively. Taking the full-resolution pixel count N = 224 × 224 as the baseline, the number of pixels at each level can be expressed as:
N 1 = N 16 , N 2 = N 4 , N 3 = N
The total computational cost Ttotal of the entire registration pipeline is the sum of the costs across all levels. Let c denote the constant average computational overhead per pixel (encompassing network inference, interpolation, and field composition); we then have:
T t o t a l = l = 1 3 T l = c ( N 1 + N 2 + N 3 ) = 21 16 c N
(3)
Complexity Conclusion
In the above equation, the coefficient is 21 16 a constant entirely independent of the input size N. By the definition of asymptotic complexity, constant factors are disregarded; therefore, the overall time complexity is:
T t o t a l = O ( N )
This result demonstrates that the multi-scale incremental update strategy employed in this work achieves the benefits of coarse-to-fine optimization without introducing super-linear computational overhead. Compared to single-scale full-resolution registration, the total computational cost increases by only approximately 31.25%, while significantly improving the convexity of the objective function and convergence stability, thereby achieving an effective balance between computational efficiency and registration performance.

3.3.11. Quantitative Comparison Based on Normalized Cross-Correlation

This work focuses on validating the performance gains achieved by integrating the multi-scale strategy into the registration framework, rather than constructing an end-to-end clinical system. Therefore, we selected Normalized Cross-Correlation (NCC) as the core evaluation metric for the following reasons:
(1)
NCC is the most widely adopted and theoretically sound similarity measure for intensity-based registration, directly reflecting local intensity consistency between deformed images;
(2)
In BraTS2020 multi-modal (T1, T1ce, T2, FLAIR) registration tasks, NCC demonstrates robustness to grayscale distribution variations and correlates highly with anatomical alignment.
The specific experimental results are presented in Table 8 and Figure 9.
Statistically Significant Improvement: The proposed method achieves a significantly higher mean NCC compared to the baseline (0.8217 vs. 0.8169, p = 0.015), demonstrating that multi-scale feature fusion effectively enhances intensity consistency.
Enhanced Stability: The standard deviation is slightly reduced (0.0693 < 0.0712), and the minimum value is improved (0.6935 > 0.6724). This indicates greater robustness in low signal-to-noise ratio regions (e.g., tumor margins, cerebrospinal fluid interfaces), thereby reducing the risk of registration failure.
Clinical Interpretation: In radiotherapy planning, axial slices serve as the fundamental unit for target contouring and dose calculation. An NCC improvement of 0.0048 corresponds to an average reduction of approximately 1.2 gray-level units in pixel-wise error across typical image intensity ranges (0–255). While seemingly marginal, this gain is sufficient to mitigate mis-segmentation artifacts in downstream preprocessing models (e.g., preventing edema from being erroneously classified as enhancing tumor).

4. Discussion and Conclusions

This paper presents an unsupervised neural network framework for high-precision medical image registration. The proposed end-to-end method achieves multi-contrast MR image registration in a coarse-to-fine manner, which reduces computation and produces additional outputs. We also improve DT-Net using a multi-scale strategy to increase registration accuracy. A loss function with dual consistency constraints is adopted, and mutual information (MI) is used to evaluate the performance of different methods. Ablation experiments verify the effectiveness of the improved modules. Comparisons with other algorithms show that our method achieves higher registration accuracy.
While our study demonstrates the effectiveness of the proposed cascaded strategy and dual consistency constraints, we acknowledge a primary limitation regarding the data dimensionality. The current experiments were conducted on 2D slices to rigorously validate the algorithmic architecture and ensure training stability under computational constraints. Consequently, the direct clinical applicability of processing full volumetric data remains to be fully explored. In our future work, we plan to extend the proposed framework to 3D volumes. Given that our network design is dimension-agnostic, we anticipate that the method will generalize effectively to 3D contexts, thereby enhancing its utility in practical clinical scenarios.

Funding

This research received no external funding. The APC was funded by the author.

Data Availability Statement

The data presented in this study are openly available in Kaggle at https://www.kaggle.com/datasets/awsaf49/brats20-dataset-training-validation (accessed on 1 April 2026). Note that a subset of the original 3D volumes was selected and preprocessed into 2D slices for this study.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Menze, B.H.; Jakab, A.; Bauer, S.; Kalpathy-Cramer, J.; Farahani, K.; Kirby, J.; Burren, Y.; Porz, N.; Slotboom, J.; Wiest, R.; et al. The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans. Med. Imaging 2015, 34, 1993–2024. [Google Scholar] [CrossRef]
  2. Bakas, S.; Akbari, H.; Sotiras, A.; Bilello, M.; Rozycki, M.; Kirby, J.S.; Freymann, J.B.; Farahani, K.; Davatzikos, C. Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features. Nat. Sci. Data 2017, 4, 170117. [Google Scholar] [CrossRef]
  3. Bakas, S.; Reyes, M.; Jakab, A.; Bauer, S.; Rempfler, M.; Crimi, A.; Shinohara, R.T.; Berger, C.; Ha, S.M.; Rozycki, M.; et al. Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge. arXiv 2018, arXiv:1811.02629. [Google Scholar] [CrossRef]
  4. Akbari, H.; Sotiras, A.; Bilello, M.; Rozycki, M.; Kirby, J.; Freymann, J.; Farahani, K.; Davatzikos, C. Segmentation Labels and Radiomic Features for the Pre-operative Scans of the TCGA-GBM collection. Cancer Imaging Arch. 2017, 7. [Google Scholar] [CrossRef]
  5. Bakas, S.; Akbari, H.; Sotiras, A.; Bilello, M.; Rozycki, M.; Kirby, J.; Freymann, J.; Farahani, K.; Davatzikos, C. Segmentation Labels and Radiomic Features for the Pre-operative Scans of the TCGA-LGG collection. Cancer Imaging Arch. 2017. [Google Scholar] [CrossRef]
  6. Arad, N.; Dyn, N.; Reisfeld, D.; Yeshurun, Y. Image warping by radial basis functions: Application to facial expressions. CVGIP Graph. Models Image Process. 1994, 56, 161–172. [Google Scholar] [CrossRef]
  7. Yang, X.; Xue, Z.; Liu, X.; Xiong, D. Topology preservation evaluation of compact-support radial basis functions for image registration. Pattern Recognit. Lett. 2011, 32, 1162–1177. [Google Scholar] [CrossRef]
  8. Kybic, J.; Unser, M. Fast parametric elastic image registration. IEEE Trans. Image Process. 2003, 12, 1427–1442. [Google Scholar] [CrossRef]
  9. Rueckert, D.; Sonoda, L.I.; Hayes, C.; Hill, D.L.; Leach, M.O.; Hawkes, D.J. Nonrigid registration using free-form deformations: Application to breast MR images. IEEE Trans. Med. Imaging 1999, 18, 712–721. [Google Scholar] [CrossRef]
  10. Sdika, M. A fast nonrigid image registration with constraints on the Jacobian using large scale constrained optimization. IEEE Trans. Med. Imaging 2008, 27, 271–281. [Google Scholar] [CrossRef]
  11. Besl, P.J.; McKay, N.D. A method for registration of 3-D shapes. IEEE Trans. Pattern Anal. Mach. Intell. 1992, 14, 239–256. [Google Scholar] [CrossRef]
  12. Chui, H.; Rangarajan, A. A new point matching algorithm for non-rigid registration. Comput. Vis. Image Underst. 2003, 89, 114–141. [Google Scholar] [CrossRef]
  13. Yang, T.; Bai, X.; Cui, X.; Gong, Y.; Li, L. TransDIR: Deformable imaging registration network based on transformer to improve the feature extraction ability. Med. Phys. 2022, 49, 952–965. [Google Scholar] [CrossRef]
  14. Yang, T.; Bai, X.; Cui, X.; Gong, Y.; Li, L. GraformerDIR: Graph convolution transformer for deformable image registration. Comput. Biol. Med. 2022, 147, 105799. [Google Scholar] [CrossRef] [PubMed]
  15. Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28, 2017–2025. [Google Scholar]
  16. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
  17. Yang, T.; Bai, X.; Cui, X.; Gong, Y.; Li, L. DAU-Net: An unsupervised 3D brain MRI registration model with dual-attention mechanism. Int. J. Imaging Syst. Technol. 2023, 33, 217–229. [Google Scholar] [CrossRef]
  18. De Vos, B.D.; Berendsen, F.F.; Viergever, M.A.; Staring, M.; Išgum, I. End-to-end unsupervised deformable image registration with a convolutional neural network. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support; Springer International Publishing: Cham, Switzerland, 2017; pp. 204–212. [Google Scholar]
  19. Shan, S.; Yan, W.; Guo, X.; Chang, E.I.; Fan, Y.; Xu, Y. Unsupervised end-to-end learning for deformable medical image registration. arXiv 2017, arXiv:1711.08608. [Google Scholar]
  20. Balakrishnan, G.; Zhao, A.; Sabuncu, M.R.; Guttag, J.; Dalca, A.V. Voxelmorph: A learning framework for deformable medical image registration. IEEE Trans. Med. Imaging 2019, 38, 1788–1800. [Google Scholar] [CrossRef]
  21. Huang, W.; Yang, H.; Liu, X.; Li, C.; Zhang, I.; Wang, R.; Wang, S. A coarse-to-fine deformable transformation framework for unsupervised multi-contrast MR image registration with dual consistency constraint. IEEE Trans. Med. Imaging 2021, 40, 2589–2599. [Google Scholar] [CrossRef]
  22. Chen, J.Y.; Frey, E.C.; He, Y.F.; Segars, W.P.; Li, Y.; Du, Y. TransMorph: Transformer for Unsupervised Medical Image Registration. Med. Image Anal. 2022, 82, 102615. [Google Scholar] [CrossRef]
  23. Zhu, J.K.; Zheng, B.Y.; Xiong, B.; Zhang, Y.X.; Cui, M.; Sun, D.Y.; Cai, J.; Xie, Y.Q.; Qin, W.J. SynMSE: A multimodal similarity evaluator for complex distribution discrepancy in unsupervised deformable multimodal medical image registration. Med. Image Anal. 2025, 103, 103620. [Google Scholar] [CrossRef]
  24. Lara-Hernandez, A.; Rienmüller, T.; Juárez, I.; Pérez, M.; Reyna, F.; Baumgartner, D.; Baumgartner, C. Deep learning-based image registration in dynamic myocardial perfusion CT imaging. IEEE Trans. Med. Imaging 2022, 42, 684–696. [Google Scholar] [CrossRef] [PubMed]
  25. Yoo, I.; Hildebrand, D.G.; Tobin, W.F.; Lee, W.C.A.; Jeong, W.K. ssEMnet: Serial-section electron microscopy image registration using a spatial transformer network with learned features. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Proceedings of the Third International Workshop, DLMIA 2017, and 7th International Workshop, ML-CDS 2017, Québec City, QC, Canada, 14 September 2017; Springer International Publishing: Cham, Switzerland, 2017; pp. 249–257. [Google Scholar]
  26. Vajda, I. Theory of Statistical Inference and Information; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1989; Volume 11, p. 54. [Google Scholar]
  27. Hermosillo, G.; Chefd’Hotel, C.; Faugeras, O. Variational methods for multimodal image matching. Int. J. Comput. Vis. 2002, 50, 329–343. [Google Scholar] [CrossRef]
  28. Dacorogna, B. Direct Methods in the Calculus of Variations, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  29. Bertsekas, D.P. Nonlinear Programming, 2nd ed.; Athena Scientific: Belmont, MA, USA, 1999. [Google Scholar]
  30. Dalca, A.V.; Balakrishnan, G.; Guttag, J.; Sabuncu, M.R. Unsupervised learning of probabilistic diffeomorphic registration for images and surfaces. Med. Image Anal. 2019, 57, 226–236. [Google Scholar] [CrossRef] [PubMed]
  31. Wang, S.; Cao, S.; Wei, D.; Wang, R.; Ma, K.; Wang, L.; Zheng, Y. LT-Net: Label transfer by learning reversible voxel-wise correspondence for one-shot medical image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9162–9171. [Google Scholar]
Figure 1. Multi-contrast MR brain images of three patients [1,2,3,4,5].
Figure 1. Multi-contrast MR brain images of three patients [1,2,3,4,5].
Mathematics 14 02131 g001
Figure 2. Multi-scale strategy framework.
Figure 2. Multi-scale strategy framework.
Mathematics 14 02131 g002
Figure 3. Coarse-to-fine multi-contrast MR image registration framework.
Figure 3. Coarse-to-fine multi-contrast MR image registration framework.
Mathematics 14 02131 g003
Figure 4. Different network registration effects when the learning rate is 0.01.
Figure 4. Different network registration effects when the learning rate is 0.01.
Mathematics 14 02131 g004
Figure 5. Different network registration effects when the learning rate is 0.001.
Figure 5. Different network registration effects when the learning rate is 0.001.
Mathematics 14 02131 g005
Figure 6. The moving images and the fixed images of three patients [1,2,3,4,5].
Figure 6. The moving images and the fixed images of three patients [1,2,3,4,5].
Mathematics 14 02131 g006
Figure 7. Registration effect of different registration methods: The first row shows the registration results on Patient A, where the moving image is the T2 image from Figure 6 and the fixed image is the T1 image. The second and third rows present the registration results on Patient B and Patient C, respectively.
Figure 7. Registration effect of different registration methods: The first row shows the registration results on Patient A, where the moving image is the T2 image from Figure 6 and the fixed image is the T1 image. The second and third rows present the registration results on Patient B and Patient C, respectively.
Mathematics 14 02131 g007
Figure 8. The transformation fields.
Figure 8. The transformation fields.
Mathematics 14 02131 g008
Figure 9. Boxplot comparison of NCC between C-F-I-R and Ours (2D images). Red diamonds denote means; significance marker * (p = 0.015) derived from Wilcoxon signed-rank test.
Figure 9. Boxplot comparison of NCC between C-F-I-R and Ours (2D images). Red diamonds denote means; significance marker * (p = 0.015) derived from Wilcoxon signed-rank test.
Mathematics 14 02131 g009
Table 1. Summary of Network Architectures and Experimental Hyperparameters.
Table 1. Summary of Network Architectures and Experimental Hyperparameters.
CategoryItemSpecification/Value
ImplementationFrameworkTensorFlow 1.10.0 (Keras Backend)
Operating SystemLinux
HardwareNVIDIA RTX 2080 Ti GPU
DatasetSourceBraTS2020 (Training + Validation) [1,2,3,4,5]
ModalityT2 (Moving) to T1 (Fixed) Registration
Partition320 Training/32 Validation/32 Test pairs
Training StrategyOptimizerAdam(Dynamic momentum adjustment)
Learning Rate1 × 10−3
Batch Size8
Epochs300
AugmentationRandom shifts, rotations, scaling, horizontal flips
AT-NetStructure5 Downsampling blocks + 2 Fully Connected layers
Block Detail2 × (3 × 3 Conv) + 1 × (2 × 2 Max-pooling)
Output6 Affine transformation parameters
Parameters588 k (Trainable)
DT-NetStructureImproved U-Net (Encoder–Decoder)
Final Layer2 × (3 × 3 Conv) with Linear activation
OutputDense deformation field
Parameters1474 k (Trainable)
Table 2. Results of different scale registrations on the BraTS2020 Dataset.
Table 2. Results of different scale registrations on the BraTS2020 Dataset.
GroupMISec/Slice (GPU)Sec/Slice (CPU)
The first group1.200 ± 0.1110.02170.2135
The second group1.253 ± 0.1130.02970.2824
The third group1.329 ± 0.1220.03160.3235
The fourth group1.340 ± 0.1360.04520.5234
Table 3. Marginal gain analysis of multi-scale configurations.
Table 3. Marginal gain analysis of multi-scale configurations.
GroupMIGPU Time
(s/Slice)
ΔMI vs. PreviousΔGPU vs. Previous
(s)
Marginal Efficiency
(ΔMI/ΔGPU)
1 (Baseline)1.200 ± 0.1110.0217
21.253 ± 0.1130.0297+0.053+0.00806.63
3 (Default)1.329 ± 0.1220.0316+0.076+0.001940.00
41.340 ± 0.1360.0452+0.011+0.01360.81
Table 4. Weight results of different loss functions.
Table 4. Weight results of different loss functions.
Loss Functionλ1λ2λ3λ4MI
Losstotal(F, M)141001001.098 ± 0.081
1201001000.931 ± 0.061
1301001001.273 ± 0.114
1501001001.329 ± 0.122
10501001001.251 ± 0.115
1505001001.320 ± 0.119
1501005000.908 ± 0.057
Table 5. Quantitative time between different inverse methods.
Table 5. Quantitative time between different inverse methods.
MethodSec/Slice (GPU)Sec/Slice (CPU)
VM-diff0.06720.2353
LT-Net0.04230.2752
ours0.02350.2132
Table 6. Comparison of different algorithms on the BraTS2020 Dataset.
Table 6. Comparison of different algorithms on the BraTS2020 Dataset.
MethodMISec/Slice (GPU)Sec/Slice (CPU)
SyN0.962 ± 0.068-3.2312
VM1.163 ± 0.1050.01340.1934
C-F-I-R1.200 ± 0.1110.02170.2135
ours1.329 ± 0.1220.03160.3235
TransMorph1.365 ± 0.0980.1240.4873
DiffuseMorph1.392 ± 0.0860.5371.8346
Table 7. Results of Incremental Ablation Experiments on Mutual Information (MI). (✕ indicates that the module is not available, while Mathematics 14 02131 i001 indicates its availability).
Table 7. Results of Incremental Ablation Experiments on Mutual Information (MI). (✕ indicates that the module is not available, while Mathematics 14 02131 i001 indicates its availability).
Exp IDModel VariantMulti-Scale StrategyDual Consistency ConstraintAffine Pre-AlignmentMI
Exp ID 0Baseline1.023 ± 0.117
Exp ID 1+ Multi-scale StrategyMathematics 14 02131 i0011.237 ± 0.135
Exp ID 2C-F-I-RMathematics 14 02131 i001Mathematics 14 02131 i0011.200 ± 0.111
Exp ID 3+ Affine Pre-alignmentMathematics 14 02131 i0011.215 ± 0.125
Exp ID 4Ours (Full Model)Mathematics 14 02131 i001Mathematics 14 02131 i001Mathematics 14 02131 i0011.329 ± 0.122
Table 8. Descriptive statistics of NCC for C-F-I-R and Ours on 32 paired 2D images.
Table 8. Descriptive statistics of NCC for C-F-I-R and Ours on 32 paired 2D images.
MetricC-F-I-ROurs
Mean ± Std0.8169 ± 0.07120.8217 ± 0.0693
Median [IQR]0.8185 [0.7621–0.8705]0.8218 [0.7687–0.8726]
Min/Max0.6724/0.96150.6935/0.9601
Valid Pairs32/3232/32
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J. Multi-Modal Image Registration Problem Integrating Multi-Scale Strategy and Deep Learning. Mathematics 2026, 14, 2131. https://doi.org/10.3390/math14122131

AMA Style

Zhang J. Multi-Modal Image Registration Problem Integrating Multi-Scale Strategy and Deep Learning. Mathematics. 2026; 14(12):2131. https://doi.org/10.3390/math14122131

Chicago/Turabian Style

Zhang, Jiting. 2026. "Multi-Modal Image Registration Problem Integrating Multi-Scale Strategy and Deep Learning" Mathematics 14, no. 12: 2131. https://doi.org/10.3390/math14122131

APA Style

Zhang, J. (2026). Multi-Modal Image Registration Problem Integrating Multi-Scale Strategy and Deep Learning. Mathematics, 14(12), 2131. https://doi.org/10.3390/math14122131

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop