1. Introduction
Cardiovascular disease remains one of the leading causes of mortality worldwide. Early diagnosis and precise treatment are essential for improving patient survival and quality of life. Traditional diagnostic approaches largely rely on clinical symptoms, physical examination, and basic biochemical markers. However, these methods often lack the sensitivity and specificity required to detect early-stage pathological changes [
1].
With the rapid advancement of imaging technologies, cardiac imaging has become a cornerstone in the diagnosis and assessment of cardiovascular diseases. It provides clinicians with detailed anatomical and functional information about the heart, facilitating comprehensive evaluation of hemodynamics, coronary artery stenosis, myocardial wall thickness and motion, as well as myocardial ischemia [
2]. In particular, cardiac imaging plays a pivotal role in the diagnosis and management of conditions such as coronary artery disease, heart failure, valvular heart disease, and congenital heart anomalies, offering high-resolution visual data to support individualized clinical decision-making.
In recent years, the development of imaging techniques such as CT, MRI (Magnetic Resonance Imaging), and ultrasound has provided strong support for cardiac function assessment, cardiovascular interventional planning, and postoperative follow-up. Cardiac CT, with its sub-millimeter spatial resolution, offers a critical basis for the early diagnosis of atherosclerosis. In contrast, echocardiography, with its advantage of real-time dynamic imaging, has become the preferred modality for evaluating ventricular synchrony and hemodynamics [
3]. To maximize the diagnostic value of these two modalities, it is essential to establish an accurate cross-modality image registration framework. By spatially aligning the anatomical precision of CT with the functional kinetic characteristics derived from ultrasound, this technique effectively overcomes the inherent limitations of single-modality imaging. Such multimodal registration enables fused imaging for preoperative 3D surgical planning and facilitates real-time intraoperative comparison between ultrasound and preoperative CT models. This significantly improves the precision of complex procedures such as mitral valve repair and left atrial appendage occlusion. With the continuous improvement of multimodal fusion systems, cardiac imaging is evolving from a diagnostic adjunct into a core foundation for individualized therapeutic guidance.
In summary, cardiac imaging plays an indispensable role in early screening, quantitative assessment, and therapeutic planning for cardiovascular diseases. It provides robust technological support for precision medicine and personalized treatment strategies. As imaging technologies continue to evolve, cardiac imaging is poised to play an increasingly prominent role in disease prevention, health management, and clinical decision-making.
This review aims to systematically summarize recent advances in image registration between CT and echocardiography, with a particular focus on the technical principles and clinical applications of both traditional registration frameworks and deep learning-based approaches. By comparatively analyzing the two paradigms in terms of feature extraction accuracy, registration efficiency, and clinical performance, this study seeks to elucidate current methodological bottlenecks and propose directions for future optimization.
1.1. Traditional Registration Framework
Image registration refers to the process of spatially aligning two or more medical images, with the goal of eliminating geometric discrepancies between them and ensuring consistent spatial positioning and structural information within a common coordinate system [
4].
Figure 1 shows that the image registration is to find the optimal spatial transformation of one image to another through optimization algorithms, so that the two images achieve the greatest degree of consistency in anatomical structure or functional information. Depending on the nature of the transformation applied, registration techniques can be broadly categorized into rigid, affine, and non-rigid registration methods [
5]. Rigid registration accounts for rotation and translation, while affine registration further incorporates scaling and shearing transformations. Non-rigid registration, on the other hand, allows for more complex local deformations, making it particularly suitable for aligning anatomical structures that may change shape over time or vary across individuals.
Moreover, medical image registration techniques can be classified into feature-based and intensity-based approaches [
6]. Feature-based methods rely on the extraction and alignment of salient anatomical landmarks or contours, whereas intensity-based methods directly utilize voxel intensity values and employ similarity metrics–such as mutual information or cross-correlation–to guide the alignment process.
1.2. Deep Learning-Based Registration Framework
The deep learning-based registration of cardiac CT and ultrasound images typically involves three key stages: feature extraction, spatial transformation, and alignment optimization. V. Ajantha Devi investigated a CNN-based approach for cardiac CT–ultrasound multimodal image registration, targeting the challenge of achieving anatomically precise alignment to integrate complementary structural and functional information for cardiovascular diagnosis [
7]. As illustrated in
Figure 2, the CT–US registration pipeline is adapted from the original study, and its major components follow the authors’ design. A dual-branch CNN encoder is employed to extract high-resolution anatomical features from CT images, as well as dynamic functional cues from ultrasound. The workflow includes preprocessing of 3D CT and 2D/3D ultrasound inputs, dual-branch feature extraction, attention-based feature fusion, and spatial-transformer-network (STN)-driven registration. Optimization is guided by an NCC–MSE hybrid loss, ensuring alignment accuracy and producing the final deformed CT output.
Chen et al. proposed the use of a cross-modal attention mechanism to compute a similarity matrix between the two modalities in the feature space, thereby generating an initial deformation field. A 3D U-Net architecture is then employed to fuse multi-scale features and enhance local detail alignment [
8]. In the preprocessing stage, CT images are typically thresholded to remove bony artifacts and suppress speckle noise in ultrasound images. An initial coarse alignment is often achieved through elastic registration, which helps reduce the optimization burden on subsequent deep learning models.
In terms of network architecture, most state-of-the-art methods adopt a cascaded framework. Lei et al. developed the Deformable Vision Transformer [
9], which leverages self-attention mechanisms to capture temporal consistency across cardiac cycles. This model incorporates a cycle-consistency loss to constrain the deformation trajectories between systolic and diastolic phases, ensuring temporal coherence.
To address the substantial modality gap between CT and ultrasound, some studies have introduced adversarial training strategies. For example, Yao’s group designed a modality-invariant generator that utilizes a gradient reversal layer to suppress domain-specific features [
10]. Their method achieved a Dice similarity coefficient of 0.89 ± 0.03 on the MICCAI public dataset, representing a 12% improvement over traditional approaches.
In the post-processing stage, biomechanical constraints are often incorporated. Finite element analysis is employed to validate the physiological plausibility of the deformation field, helping to prevent anatomical distortions caused by overfitting.
2. Method
2.1. Search Strategy
To conform to PRISMA guidelines, a structured and transparent search strategy was implemented. A comprehensive literature search was conducted across four major databases—PubMed, IEEE Xplore, Scopus, and Google Scholar. The search strategy combined controlled vocabulary and free-text terms related to cardiac imaging and multimodal registration. Keywords included: “cardiac CT,” “ultrasound,” “image registration,” “multimodal registration,” “deep learning,” “intensity-based registration,” and “feature-based registration.” Boolean operators (AND/OR) were used to systematically combine search terms and expand the coverage of relevant studies.
In the identification phase, all retrieved records were imported into a reference management system, and duplicates were removed. During the screening phase, titles and abstracts were evaluated against predefined inclusion and exclusion criteria. Eligible studies were those (1) published within the last twenty years, (2) peer-reviewed journal articles or international conference papers, and (3) directly addressing cardiac CT–ultrasound image registration. Studies focusing on non-cardiac or unrelated registration tasks, non-human data, or lacking methodological detail were excluded. Full-text assessment was subsequently performed to confirm eligibility. The final set of studies included in the review represents those meeting all criteria for relevance, methodological adequacy, and scientific quality.
2.2. Selection Criteria
We applied strict inclusion criteria to ensure that only studies directly relevant to the registration of cardiac CT and ultrasound images were considered. Specifically, the following categories of literature were excluded:
1. Studies focusing solely on single-modality image registration, such as those involving only CT or only ultrasound image alignment.
2. Studies in which the target anatomy was not the heart, including those addressing image registration of other anatomical regions such as the brain, liver, or other non-cardiac structures.
2.3. Screening Process
During the literature screening process, all retrieved records were first imported into a reference management tool for deduplication to remove duplicate entries. The remaining studies underwent an initial screening based on titles and abstracts to identify potentially relevant works related to cardiac CT and ultrasound image registration. Subsequently, full-text articles were assessed for eligibility according to predefined inclusion and exclusion criteria, focusing on methodological rigor, relevance to multimodal image registration, and availability of quantitative results. Studies with insufficient methodological detail, non-cardiac focus, or irrelevant imaging modalities were excluded. Finally, 50 representative studies were included in the qualitative synthesis based on their methodological innovation, research quality, and potential clinical applicability.
4. Discussion
This review explores the advanced techniques in cross-modal registration between cardiac CT and ultrasound images. Through a comparative analysis with traditional methods, we summarize that deep learning-based registration approaches demonstrate significant advantages in both accuracy and computational efficiency.
Supervised learning methods leverage annotated data (e.g., ground-truth deformation fields) to directly optimize registration networks, achieving high precision in aligning local anatomical features, such as coronary arteries and myocardial boundaries. However, the heavy reliance on large-scale, high-quality labeled datasets presents a major bottleneck, as the annotation process in medical imaging is both costly and dependent on expert knowledge, thus limiting clinical scalability.
In contrast, unsupervised methods optimize registration by maximizing image similarity metrics (e.g., mutual information, local cross-correlation), eliminating the need for manual annotations. These methods are particularly suitable for ultrasound images, which often suffer from high noise and low contrast. Nevertheless, such approaches are prone to local minima during optimization and may fail in scenarios involving significant cardiac motion or deformation.
Weakly supervised learning offers a compromise by introducing sparse annotations—such as anatomical landmarks or segmentation masks—to guide the registration process. This reduces annotation costs while enhancing anatomical plausibility. However, the performance of these methods is highly sensitive to the quality and spatial distribution of the weak labels. Furthermore, their adaptability across modalities remains an open research question.
Traditional medical image registration techniques include feature-based methods (e.g., SIFT, SURF), which establish spatial correspondences through manually engineered features such as anatomical landmarks or edge contours. These approaches depend heavily on the quality of the extracted features and can be negatively affected by ultrasound speckle noise or partial volume effects in CT. Intensity-based methods [
42] (e.g., mutual information, the Demons algorithm) directly exploit grayscale information to compute similarity measures. While suitable for multimodal registration, these methods often suffer from low computational efficiency and struggle with large deformations due to their iterative optimization process.
Subsequently, we compare several representative studies that utilize the same dataset. Ameneh et al. [
43], Zhang et al. [
44], Hering et al. [
45], Wang et al. [
46], and Chang et al. [
47] all employed the ACDC dataset for network training and evaluation in the context of cardiac image registration. Among them, Ameneh et al. adopted a supervised learning approach; Wang et al. and Chang et al. utilized unsupervised methods; while Zhang et al. and Hering et al. implemented weakly supervised frameworks. For reference, the performance of traditional registration methods—SyN and FFD—proposed by Avants et al. [
48] and Modat et al. [
49], respectively, is also reported on the ACDC dataset in terms of Dice similarity coefficient.
According to the results summarized in
Table 6, deep learning-based registration methods outperform traditional techniques in terms of accuracy, with the model by Ameneh et al. achieving the best overall performance. Additionally, deep learning approaches exhibit significantly faster processing speeds due to their ability to leverage GPU acceleration for direct transformation estimation. In terms of computational cost and efficiency, deep learning-based methods demonstrate clear advantages over traditional registration techniques.
Despite the remarkable progress of deep learning in cardiac image registration, its translation into clinical practice remains limited. A major barrier lies in the intrinsic cross-modal discrepancies between cardiac CT and ultrasound: CT provides static, high-resolution anatomical reference, whereas ultrasound captures dynamic, operator-dependent, and often noisy sequences. These differences hinder robust feature correspondence under realistic clinical conditions. In catheterization laboratories and intraoperative settings, for instance, the constant cardiac motion, respiratory influence, and probe-induced deformation challenge the stability of learned registration models. Moreover, most existing approaches are validated on retrospective or single-center datasets that do not reflect the diversity of patient anatomies, imaging protocols, or vendor-specific system characteristics. Evaluation is also largely confined to theoretical similarity metrics—such as Dice coefficient or landmark error—that fail to reflect clinical relevance, particularly regarding navigation accuracy and motion tracking.
To advance clinical applicability, future research should emphasize frameworks that adapt to real-world variability and integrate seamlessly into clinical workflows. Semi-supervised and unsupervised learning strategies, coupled with generative models (e.g., diffusion-based data synthesis), could expand multimodal training data and enhance model generalization to unseen clinical conditions. Multi-task learning architectures that jointly perform registration, segmentation, and motion estimation would enable consistent feature representation for both anatomical alignment and functional assessment during interventional procedures. Embedding biomechanical or physiological priors—such as myocardial strain constraints—into deep architectures may improve prediction stability and interpretability for clinicians. Furthermore, clinically driven evaluation standards that quantify registration benefits in terms of navigation precision, interventional safety, and diagnostic outcomes are essential to replace purely mathematical metrics. Finally, the establishment of large-scale, multi-center datasets and standardized acquisition protocols will be critical to validate model robustness and promote the routine adoption of deep learning–based cardiac multimodal registration in clinical environments.