Foundation Models for Volumetric Medical Imaging: Opportunities, Challenges, and Future Directions

Ghosh, Tapotosh; Sheikhi, Farnaz; Guo, Junlin; Singh, Yashbir; Younis, Khaled; Kuanar, Shiba; Faghani, Shahriar; Farina, Eduardo Moreno Judice de Mattos; Huo, Yuankai; Maleki, Farhad

doi:10.3390/electronics15061245

Open AccessReview

Foundation Models for Volumetric Medical Imaging: Opportunities, Challenges, and Future Directions

by

Tapotosh Ghosh

^1,*,

Farnaz Sheikhi

¹,

Junlin Guo

²,

Yashbir Singh

³

,

Khaled Younis

⁴,

Shiba Kuanar

³,

Shahriar Faghani

³

,

Eduardo Moreno Judice de Mattos Farina

⁵,

Yuankai Huo

⁶

and

Farhad Maleki

¹

Department of Computer Science, University of Calgary, Calgary, AB T2N 1N4, Canada

²

Department of Electrical and Computer Engineering, Vanderbilt University, Nashville, TN 37215, USA

³

Department of Radiology, Mayo Clinic, Rochester, MN 55905, USA

⁴

MedAiConsult LLC, Cleveland, OH 44120, USA

⁵

Department of Radiology, Universidade Federal de São Paulo (UNIFESP), São Paulo, SP 04021-001, Brazil

⁶

Department of Computer Science, Vanderbilt University, Nashville, TN 37215, USA

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(6), 1245; https://doi.org/10.3390/electronics15061245

Submission received: 31 January 2026 / Revised: 2 March 2026 / Accepted: 12 March 2026 / Published: 17 March 2026

(This article belongs to the Special Issue Deep Learning in Video and Image Processing: Challenges, Solutions, and Future Directions, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Foundation models, known as the large-scale, pretrained models capable of generalizing across diverse tasks, have significantly advanced the field of medical image analysis. While most early applications focused on 2D modalities, the unique challenges and opportunities associated with volumetric medical imaging have recently attracted growing interest. This study provides a comprehensive overview of the current landscape of foundation models tailored for volumetric medical image analysis, with a focus on CT, MRI, and PET imaging. We examine key components of these models, including 3D architectures, training strategies, and supported modalities. In addition, we highlight their contribution to major clinical tasks such as classification and prediction, segmentation, image registration, quality enhancement, and visual question answering. Critical challenges of these models, including high computational cost, limited and less diverse 3D datasets, and domain adaptation, are discussed alongside the promising solutions and future research directions. By synthesizing recent advances in volumetric foundation models and outlining key technical and clinical challenges, this review provides a thorough roadmap toward the development of scalable, generalizable, and clinically applicable AI systems for volumetric medical images.

Keywords:

foundation model; medical imaging; vision–language models; task adaptation; fine-tuning; pretraining

1. Introduction

Medical imaging plays a central role in modern healthcare, supporting disease diagnosis, treatment planning, and prognosis across a wide spectrum of clinical applications. While two-dimensional (2D) imaging modalities, such as X-ray and microscopy images, remain widely used, many critical diagnostic tasks rely on volumetric, three-dimensional (3D) imaging techniques. These techniques include computed tomography (CT), magnetic resonance imaging (MRI), and positron emission tomography (PET). These modalities capture rich anatomical and functional information across multiple 2D slices, offering a 3D view of patient anatomy. However, analyzing such data remains highly challenging due to its large volume, heterogeneity, and complexity. Although traditional deep learning approaches, such as convolutional neural networks (CNNs) and the U-Net family, have demonstrated promising performance in 3D segmentation and classification tasks, they generally require large-scale, annotated datasets that are expensive and time-consuming to obtain. In addition, due to the inherent heterogeneity of these data, models trained in a single cohort often fail to generalize across institutions, scanners, or populations, limiting their clinical utility.

In recent years, Foundation Models (FMs) have become a transformative paradigm in artificial intelligence [1]. Initially popularized in natural language processing (NLP) with large language models (LLMs), the concept of training massive models on extensive, diverse data followed by task-specific adaptation has rapidly expanded into computer vision and multimodal learning. The main idea is that large-scale pretraining enables models to learn generalizable representations, facilitating effective transfer to downstream tasks with limited labeled data or enabling zero-shot generalization [2]. In medical imaging, FMs aim to overcome long-standing barriers such as data scarcity, high annotation costs, and poor cross-domain generalization. Segment Anything Model (SAM) [3], MedSAM [4], and emerging modality-specific encoders have highlighted the power of pretrained backbones in enabling rapid adaptation to new tasks and institutions without compromising accuracy [5,6].

Despite these advances, much of the current literature and practice in medical foundation models remains skewed toward 2D imaging and natural image pretraining. Many models leverage ImageNet [7] or similar datasets for the initialization of the model weights, with extensions to medical data primarily in two-dimensional form, such as chest X-rays, dermatology images, and pathology slides. However, the majority of radiology practice and some of the most diagnostically important modalities are inherently volumetric. CT scans, for example, contain hundreds of slices per patient study, capturing fine-grained structural details at submillimeter resolution. MRI provides complementary volumetric information across multiple contrasts, and PET integrates functional metabolic activity in 3D space. Although these volumetric datasets offer unique opportunities for precision medicine, they present distinct computational and methodological challenges that differ fundamentally from 2D images.

Developing FMs for volumetric medical image analysis requires addressing several key challenges. First, volumetric data are significantly larger in size, raising storage and computational bottlenecks during pretraining, fine-tuning and inference. Training directly on full 3D volumes demands specialized architectures, such as 3D CNNs, ViTs extended to 3D, or hybrid slice–volume approaches, and efficient sampling strategies. Second, annotation of 3D volumes is even more resource-intensive than for 2D images, often requiring voxel-level segmentation across hundreds of slices per scan. This underscores the appeal of Self-Supervised Learning (SSL) strategies, where models can extract meaningful representations from unlabeled volumetric data by engaging in proxy tasks, including masked volume modeling, contrastive representation learning, or cross-modal alignment. Third, domain heterogeneity across scanners, protocols, and patient populations is amplified in 3D imaging, necessitating robust strategies for domain adaptation and generalization. Finally, the interpretability and clinical reliability of 3D FMs remain unresolved questions, becoming even more pressing as these models scale to billions of parameters.

Review of Existing Surveys: Several valuable review articles have examined FMs in medical imaging, covering topics such as fairness benchmarking [8], multimodal and vision–language models [9], early taxonomies [10], SAM-based segmentation adaptations [11], and broad overviews spanning modalities and omics [12]. Table 1 summarizes the key contribution of these studies. However, most of these surveys focus on 2D imaging tasks and rely strongly on the literature published before 2023. In addition, many reviews are centered on specific architectures (SAM variants, for instance) or multimodal vision–language models, resulting in limited treatment of 3D volumetric imaging. This gap underscores the need for a comprehensive survey dedicated to synthesizing recent advances and challenges in volumetric medical imaging.

Our Contribution: In this work, we review recent (January 2023–August 2025) developments in FMs for volumetric medical image analysis, a rapidly evolving area that remains underexplored in current surveys. With new FMs being released almost monthly and the growing challenge of developing robust 3D radiology-oriented models, there is a clear need for a focused analysis. This review addresses this need by examining the state-of-the-art in volumetric imaging, with particular attention to radiology applications, model design strategies, and the unique opportunities and challenges associated with building FM for volumetric imaging.

2. Paper Selection and Review Process

This review focuses exclusively on foundation or generalist models, published between 1 January 2023 and 21 August 2025, which are explicitly referred as FMs or generalist models in the paper title. These publications are identified through keyword searches including Modality-Foundation Model, Generalist Model, Foundation/Generalist Model Medical, and Foundation/Generalist Model Radiology. The search string we generally used was ((“foundation model” OR “generalist model”) AND (medical OR “magnetic resonance” OR “ct” OR “computed tomography” OR “mr” OR radiology OR “medicine” OR “pet” OR “pt”)). Using this search string with the paper title, we identified 376 articles across Springer, IEEE Xplore, PubMed, and Google Scholar. In addition, we manually added a small number of relevant papers to the final set of 376 papers. Then, we imposed some exclusion criteria, such as, strictly including models developed for volumetric (3D) medical imaging modalities, specifically CT, MRI, and PET, and excluding studies limited to 2D modalities such as X-rays, ultrasound/echo, endoscopy, retinal imaging, pathology slides, genomic data, and FMs from non-medical domains. In addition, we excluded studies that focus solely on SAM-based adaptations, in order to keep the review centered on models originally proposed as foundation/generalist architectures rather than downstream repositioning. Moreover, Lee et al. [11] already provided a dedicated survey of SAM-based foundation-model methods, which reduces the need to cover that line of work in depth here. The included studies may be single or multimodal, pretrained in a supervised or self-supervised manner, and using 2D or 3D level, as long as they are developed considering volumetric imaging modalities (mostly CT, MR, and PET). We also considered papers that are in preprint to catch the latest trend. After careful consideration using the above, and removing duplicates based on titles, authors, and similarity between papers, we selected 60 papers for the final review. In summary, this review focuses on FMs that are explicitly developed for volumetric medical imaging modalities and introduce novel methodological contributions, rather than works that simply apply existing natural domain FMs (such as SAM) to volumetric medical data. Each paper was reviewed with attention to the imaging modality, datasets used for pretraining and fine-tuning, strategies for both stages, robustness, comparative evaluation against supervised and related FMs in full-data, and few-shot settings, as well as computational resource requirements. The paper selection process is depicted in Figure 1.

3. Foundation Models Development

The term Foundation Model (FM) was first introduced in 2021 by The Stanford Center for Research on Foundation Models (CRFM) [17]. FMs are large-scale AI systems, trained on vast and diverse unlabeled or semi-labeled datasets using self-supervised learning techniques. This extensive training enables FMs to generalize and adapt to a wide range of downstream tasks. Many FMs are multimodal, capable of processing and integrating multiple types of data, including text, images, and audio, and exhibit emergent abilities, such as few-shot learning, reasoning, or transfer to tasks not explicitly seen during training. These characteristics make FMs highly versatile and powerful in different domains.

Building on the growing adoption of FMs in medical imaging, we now present a detailed overview of an end-to-end pipeline for developing these models for volumetric tasks. Figure 2 illustrates the highlights of this pipeline. Additionally, to highlight current trends, Figure 3 shows the distribution of papers with respect to dataset types, architectural design, and pretraining strategies. The process begins with the preparation of large and diverse datasets, followed by the design of model architectures capable of handling the complexity of 3D scans. Various pretraining strategies comprising supervised, weakly supervised, self-supervised, and multimodal approaches have been used. These approaches enable the model to learn useful features that can be transferred to new tasks. Once pretraining is complete, the models are adapted to specific applications, such as segmentation or diagnosis. Finally, having evaluated their performance, they go through deployment in real clinical settings. In the following sections, we describe each of these steps in detail, highlighting the main challenges and current approaches in volumetric foundation model development.

3.1. Data Collection and Curation

A critical first step in developing FMs for volumetric medical imaging is the curation of large and diverse datasets. These datasets may originate from publicly available collections or from private institutional archives, which often require regulatory approval. In practice, many studies combine public datasets with private in-house data to achieve both scale and diversity. This combination helps ensure demographic variety and coverage of different acquisition protocols, which is essential for building robust FMs. However, private datasets, while often richer and more clinically representative, introduce challenges related to privacy, ethics, and data-sharing agreements, and can further limit the reproducibility of published work.

Beyond collection, all datasets must undergo a rigorous quality-control process. Key steps include de-identification of patient information, standardization of imaging formats (e.g., conversion from DICOM to NIfTI), normalization of voxel spacing and intensity ranges, and the removal of corrupt or incomplete volumes. For volumetric imaging, in particular, harmonization across scanners and acquisition protocols are essential to mitigate domain shifts that could hinder model generalization.

The annotation of volumetric data represents another major bottleneck. Voxel-level labeling of 3D volumes is costly and time-intensive, typically requiring expert radiologists. To mitigate this burden, semi-automated strategies, such as weak supervision, presegmentation models, and human-in-the-loop correction, are often adopted. To maintain reliability, quality assurance measures such as inter-rater agreement analysis, consistency checks, and validation on reference subsets are essential. Although large-scale pretraining can proceed without annotations, labeled datasets remain necessary for fine-tuning and evaluation. Importantly, it is recommended to keep some datasets completely separate from the pretraining pool in order to rigorously assess the model’s robustness.

Finally, legal and licensing considerations must be addressed. Public datasets are typically governed by permissive licenses (e.g., CC-BY, CC0) or custom agreements that allow research use but restrict redistribution. Private or institutional data require compliance with regional regulations such as HIPAA, GDPR, or their national equivalents, along with clear documentation of consent scope and data-use agreements. Careful attention to these factors ensures that curated datasets are both ethically sound and reusable for downstream FM training and benchmarking.

3.2. Model Architecture Design

Designing architectures for volumetric FMs requires special attention to the higher dimensionality and heavy computational demands of volumetric data. Classic 3D CNNs, such as the widely used 3D U-Net, extend 2D convolutions to capture spatial context across slices and remain strong baselines for many segmentation tasks. Recently, 3D ViTs are being used widely. These models tokenize volumes into patches or cubes and excel at modeling long-range dependencies across entire scans. A growing area of interest is hybrid approaches, where efficient 2D slice-level encoders are combined with 3D aggregators to strike a balance between efficiency and contextual understanding. As FMs are often large, scalability is a key design factor, with architectures built to support billions of parameters and trained across distributed hardware. To keep training practical, methods like mixed-precision computation, sparse attention, and hierarchical feature representations are commonly used. At this stage, it is also important to be realistic about available computational resources. In many cases, full-scale 3D pretraining may not be feasible, and lighter hybrid designs become the practical choice. A practical starting point is to leverage architectures that have already demonstrated strong performance in supervised learning for the target task, as these proven baselines can be scaled and adapted for FM training.

3.3. Pretraining Strategies

The choice of a proper pretraining strategy is central to the development of FMs, as it determines how models learn transferable representations from large-scale volumetric datasets. In medical imaging, where annotated data are often scarce, researchers have explored a range of supervised, weakly supervised, and self-supervised approaches, explained in the following.

3.3.1. Supervised Pretraining

Supervised pretraining relies on large, annotated datasets in which models are trained using explicit task labels, such as voxel-level segmentations or organ-level classifications. While this approach has proven highly effective in natural image domains, its scalability to volumetric medical imaging is constrained by the substantial cost and time required to produce high-quality annotations. Nonetheless, supervised pretraining on carefully curated public datasets can provide strong baselines and valuable starting points for downstream fine-tuning.

3.3.2. Weakly Supervised Pretraining

Weak supervision leverages coarse or indirect labels—such as image-level tags, slice-level annotations, or automatically extracted metadata—to guide representation learning. Radiology reports, for instance, can be mined to generate approximate labels, while 2D slice annotations may be propagated across full volumes. More structured strategies include the use of supervoxels, where clusters of voxels are grouped into coherent regions that can be labeled at a higher level of abstraction, thereby reducing the need for voxel-wise supervision. Pseudo-labeling has also gained attraction, with labels automatically generated from existing models (e.g., 2D networks, SAM/MedSAM [3,4], or prior segmentation tools) and subsequently refined or filtered using confidence thresholds. Although these labels are inherently noisy, weakly supervised approaches substantially reduce reliance on expert annotation and have proven effective in scaling pretraining across larger and more diverse datasets.

3.3.3. Self-Supervised Pretraining

Self-supervised learning (SSL) has rapidly become the dominant paradigm for FM development, particularly in volumetric medical imaging, as it enables the use of vast amounts of unlabeled data. In these methods, the learning process is guided by supervisory signals derived directly and computationally from the data itself, rather than from human-provided annotations. By designing pretext tasks that exploit intrinsic structure within the data, SSL allows models to acquire transferable representations without relying on costly annotations. A variety of SSL strategies has been proposed, which can be broadly grouped into the following families.

Contrastive Learning: In Contrastive Learning (CL), models are trained to map two differently augmented views of the same 3D volume, forming a positive pair, to the nearby point in the representation space, capturing their shared semantic concepts. At the same time, the model learns to push apart representations of volumes originating from different patients, referred to as negative pairs. As an example, if a CT volume is rotated, cropped, or intensity-adjusted, the model should still produce feature representations similar to those of the original image. This teaches the network to focus on meaningful anatomical structures rather than superficial differences caused by imaging conditions or preprocessing. Popular methods such as MoCo [18], DenseCL [19], SlotCon [20] and SimCLR [21] follow this principle. By enforcing invariance to transformations in this way, CL enables models to learn representations that are both robust to noise and highly transferable across datasets and tasks. However, many contrastive methods are tailored to specific applications and often rely on very large batch sizes to be effective. As a result, selecting the right strategy requires careful consideration of the task, dataset scale, and available computational resources.

Self-distilation: The approaches of DINO [22] and BYOL [23] train student and teacher networks to produce consistent representations without explicit negative pairs. In volumetric imaging, such strategies enable stable feature learning without requiring large batch sizes, which is advantageous when working with high-dimensional 3D inputs.

Generative Pretraining: Generative pretraining methods aim to reconstruct or generate the input data, most commonly through masked image (MIM) [24] or masked volume modeling (MVM) [25]. Herein, random patches or cubes are replaced with a constant or random values and the model is trained to predict them. Examples include MAE [26] and SimMIM [24]. More advanced objectives extend beyond simple reconstruction by training models to synthesize high-resolution volumetric images using reconstruction or adversarial losses, encouraging the network to capture fine-grained structural details. Recently, denoising diffusion probabilistic models (DDPMs) [27] have also been explored as pretraining strategies, offering a powerful way to learn generative priors for volumetric medical imaging.

Multimodal Pretraining: By aligning volumetric imaging with complementary data sources such as radiology reports, pathology findings, or clinical records, multimodal approaches enable richer and semantically grounded representations. These approaches include CLIP-like [28] frameworks, BioViL [29], and MedCLIP [30]. Herein, separate encoders for volume and other data modalities are used to generate embeddings, which are matched for pretraining. This is especially valuable in medical imaging, where image interpretation is closely tied to the clinical context.

3.4. Task Adaptation and Fine-Tuning

Once pretrained, volumetric FMs are adapted to specific downstream tasks such as organ segmentation, tumor detection, image registration, enhancing resolution or disease prognosis as pretraining objective and intended task differ. In general, the following strategies are used.

End-to-end Fine-Tuning: In end-to-end fine-tunning, all parameters of the pretrained model are updated on the target dataset. Then, a task-specific head, a segmentation head/decoder, classification layer, or detection module, for instance, is typically appended to the backbone. Although end-to-end fine-tuning allows the model to fully specialize to the new task, it is computationally expensive and may risk overfitting in the presence of limited data.

Partial Fine-Tuning: In partial fine-tuning, the majority of the pretrained backbone is frozen, and only the task-specific layers are trained. For instance, a linear classification head may be added for disease classification, or a lightweight decoder for volumetric segmentation. This reduces the training time and memory requirements while leveraging the general features learned during pretraining. A common variant is linear probing, where only the final linear layer is trained while the encoder remains fixed. Linear probing is commonly used to assess the performance of SSL-pretrained models by evaluating the quality of the learned representations.

Parameter-Efficient Adaptation: Parameter-efficient adaptation integrates lightweight modules into a pretrained model while keeping most of the backbone frozen, avoiding the need to retrain billions of parameters. Common approaches include using small bottleneck layers, known as adapters, added between transformer blocks [31]. Low-rank weight updates (LoRA) [32] and prompt tuning or visual prompts for task-specific tokens or embeddings [33] are of this kind. These techniques substantially reduce the memory and computational costs, making FMs more practical for clinical development and deployment in the presence of limited hardware resources.

Zero- and Few-Shot Transfer Learning: One of the most promising aspects of FMs is their ability to generalize to unseen tasks with little or no labeled data. By leveraging the strong representations learned during large-scale pretraining, volumetric FMs can be directly applied in a zero-shot setting or can be fine-tuned with only a handful of labeled cases in a few-shot setting. Text-driven segmentation using a CLIP-like model [28] is an example of this approach. This is especially valuable in medical imaging, where labeled volumetric datasets are scarce and costly to produce.

3.5. Evaluation and Deployment

The evaluation of volumetric FMs requires more advanced metrics than accuracy, extending to aspects such as cross-institutional generalization, fairness across diverse patient groups, and robustness in limited-label scenarios. Beyond the absolute performance, it is also recommended to benchmark these models against other FMs as well as supervised baselines under both full- and partial-dataset settings. Equally important is preparing them for deployment in clinical environments, which demands seamless integration with existing systems, strong privacy safeguards, and efficient optimizations to ensure feasibility on practical hardware. Over time, these models must also be continually updated with new data while avoiding catastrophic forgetting, ensuring they remain robust, adaptable, and clinically useful.

4. Recent FMs in Volumetric Medical Imaging

FMs in the volumetric medical domain have been developed to address a wide spectrum of applications, including segmentation, classification, disease prediction, phenotype prediction, image registration, denoising, super-resolution, image reconstruction, report generation, visual question answering, pathology detection, and survival prediction. Notably, several FMs demonstrate the capacity to handle multiple tasks simultaneously. An overview of the reviewed FMs, their tasks, and modalities is depicted in Figure 4. In addition, Figure 5 illustrates the geographical distribution of these publications across countries. To provide a structured overview of the varied roles of FMs in medical imaging and analysis, we have categorized them as follows:

Segmentation models, designed for pixel-wise annotation and localization of anatomical structures or lesions.
Classification and predictive models, focused on tasks such as disease detection, survival analysis, and pathology characterization.
Image registration, enhancement, and reconstruction models, which address image-to-image transformation tasks.
General-purpose models which are multitask FMs, capable of addressing diverse tasks within a single framework.

4.1. FMs for Segmentation

FMs for segmentation tasks are commonly developed through either supervised pretraining on large-scale labeled datasets or self-supervised learning on vast collections of unlabeled data. In practice, many approaches also adopt semi-supervised or hybrid strategies that combine both labeled and unlabeled sources to balance scalability with annotation efficiency. Table 2 summarizes the FMs designed for the segmentation task.

Following pretraining, FMs are typically adapted to specific segmentation tasks via fine-tuning, which often involves attaching a decoder or task-specific linear layers. Fine-tuning strategies vary in scope: (i) in some cases, end-to-end fine-tuning is applied, where all pretrained parameters are updated simultaneously; (ii) alternatively, selective fine-tuning is performed, where only a subset of layers is updated, while others remain frozen; (iii) increasingly, parameter-efficient adaptation methods, such as adapters or low-rank modules, are used to modify the pretrained backbone with minimal additional trainable parameters. These diverse strategies enable practitioners to balance computational cost, memory efficiency, and task-specific performance.

Several FMs are developed for segmenting a wide range of organs; TotalSegmentator [50], SegVol [38], and VISTA3D [43] are among these.

TotalSegmentator [50] was proposed in 2023 as an FM for segmenting 104 anatomical structures, including 27 organs, 59 bones, 10 muscles, and 8 vessels. This FM was trained on 1204 CT scans, from the University Hospital Basel, which were resampled to 1.5 mm isotropic resolution and annotated through iterative refinement. In parallel, an aging dataset of 4004 CT scans (ages 18–100, 63.5% male) was curated to investigate age-related changes in organ and tissue morphology. From a methodological standpoint, a supervised nnU-Net trained on the high-resolution dataset outperformed its low-resolution counterpart and achieved superior external validation on the BTCV dataset [51] compared to models trained solely on BTCV. The aging study demonstrated negative correlations between age and bone attenuation or muscle volume/attenuation, positive correlation with aortic volume, and age-related organ shrinkage. Despite these contributions, important limitations remain. The study did not benchmark TotalSegmentator against other FMs or explore zero-shot and few-shot transfer, leaving its generalizability underexamined.

TotalSegmentator MRI [41] extends the popular nnU-Net–based TotalSegmentator for MRI segmentation across 80 anatomic structures, trained on a diverse 1143-scan corpus (616 MRI, 527 CT) and evaluated on an internal MRI test set and two external MRI sets (AMOS, CHAOS), and a CT test set for cross-modality comparison. They showed better performance compared to baselines. An ablation shows that mixing CT into the training set improves MRI performance, indicating that modality mixing can act as effective data augmentation. The work releases models, code, and annotations with a ready-to-use web tool, offering a practical, open benchmark for robust, sequence-agnostic MRI segmentation at scale.

SegVol [38], introduced in 2024, is a universal volumetric FM designed to segment over 200 anatomical structures from CT scans using flexible spatial and semantic prompts, namely, point, bounding box, and text prompts. To address the challenge of partially labeled datasets, the authors curated 90K CT volumes from 25 open-source datasets and generated refined pseudo-masks through automated segmentation and morphological filtering, followed by fine-tuning on 6K fully annotated scans. The model employs a 3D ViT encoder pretrained with the SimMIM [24] algorithm. This algorithm is a masked image modeling strategy where portions of the input volume are masked and the missing regions need to be reconstructed by the decoder, enabling the model to capture rich spatial representations in a self-supervised manner, combined with a frozen CLIP [28] text encoder and a prompt encoder. To improve computational efficiency on large 3D volumes, SegVol introduces a zoom-out–zoom-in strategy, where a coarse mask is first generated on a resized image (zoom-out) and then refined with sliding window inference on the original ROI (zoom-in). Compared to SAM-based baselines, SegVol demonstrated notable performance improvements and robust few-shot adaptability, highlighting its promise as a general-purpose segmentation model. However, its large parameter size (181M) raises concerns regarding efficiency and deployment in resource-limited clinical environments.

VISTA3D [43] is a unified FM for 3D medical image segmentation, introduced in 2025. This model combines automatic segmentation, interactive editing, and zero-shot inference in a single framework. Built on a SegResNet [52] backbone, VISTA3D was pretrained on 11,454 CT volumes using a semi-supervised strategy that leverages manual labels, pseudo-labels from TotalSegmentator [50], and SAM-derived supervoxels. The model incorporates both an automatic branch, which performs class-prompted binary segmentation, and an interactive branch, which refines masks through point-based prompts and correction workflows. To improve robustness for underrepresented categories, VISTA3D also integrates fine-tuning with oversampled rare classes. It supports segmentation of 127 anatomical classes and demonstrates strong generalization to unseen anatomies with point and text prompts. Compared to models such as nnU-Net and TotalSegmentator, VISTA3D achieved a better performance, reduced inference time by nearly four times, and showed effective few-shot adaptability with as little as one labeled sample. However, its dependence on labeled data for semi-supervised pretraining remain limitations.

VesselFM [36] is an FM specifically designed for universal 3D blood vessel segmentation, a clinically crucial but challenging task due to imaging variability and complex vessel structures. To ensure robustness, vesselFM is trained on a combination of real annotated vessel scans, domain-randomized synthetic data, and flow-matching generative synthetic data. This diverse strategy enables state-of-the-art zero-, one-, and few-shot generalization across unseen imaging domains. VesselFM is validated on clinically relevant modalities (MRA, CTA/CT, and X-ray) as well as preclinical modalities (vEM, µCTA, and two-photon microscopy), spanning multiple anatomical regions (brain, kidney, liver) and organisms (human, mouse, rat). It demonstrates strong adaptability and scalability, underscoring its potential to support diagnosis and treatment of vascular diseases such as stroke, aneurysms, coronary artery disease, and Alzheimer’s. Extending vesselFM to multiclass or instance segmentation tasks represents a promising future direction.

A specialized Lymph Node (LN)-Segmentation FMl [39] was proposed in 2025, releasing annotated LN data from 3346 publicly available CT scans. It also introduced Dynamic Gradient Sparsification Training (DGST), a few-shot fine-tuning strategy that dynamically selects and updates the most critical parameters of the convolutional kernels in nnU-Net based on gradients at each iteration. An nnU-Net model was first pretrained in a supervised manner and then finetuned using DGST in few-shot scenarios. Extensive experiments demonstrated that DGST effectively balances model stability and flexibility, mitigating the risk of overfitting while preserving the generalizability of the model to new medical scenarios. The performance shows that the proposed model using the novel fine-tuning strategy outperforms others.

RoMedFormer [42] is a 3D transformer-based FM for genito-pelvic segmentation developed using a three-stage training recipe and an architecture tailored to small, complex pelvic structures and multimodal data. First, it performs self-supervised pretraining via masked image modeling on large unlabeled CT collections (e.g., FLARE22, HNSCC, RibFrac, ACRIN 6664, TCIA COVID) to learn broad anatomical representations; second, it supervises on multiorgan datasets (TotalSegmentator, AMOS22) to refine multiorgan, CT and MRI capability; third, it task-specifically fine-tunes on institutional data targeting female genito-pelvic organs. Architecturally, the model tokenizes volumes with small 3D patches (8 × 8 × 8) to preserve fine detail, replaces absolute position encodings with Rotary Positional Embeddings (RoPE) to capture relative spatial relationships in 3D, uses SwiGLU MLP blocks for stronger sequence modeling, and decodes with a lightweight convolutional decoder to keep computations low—an arrangement explicitly motivated for large-volume, patch-based 3D processing. Overall, the method transfers broad anatomical knowledge into a scarce, underserved female pelvic domain, yielding strong multimodal performance.

F3-Net [40], proposed in 2025, is an FM for brain abnormality segmentation pretrained on large-scale public MRI datasets in a supervised manner covering multiple pathologies (glioma, metastasis, stroke, WMH). The task is multipathology segmentation, with a pretraining strategy based on supervised nnU-Net training enhanced by a zero-image modality substitution that enables flexible handling of missing sequences (T1, T1-Gd, T2, FLAIR, DWI, ADC). F3-Net outperformed other supervised segmentation baselines. Limitations include reliance on zero-filled substitution rather than biologically realistic imputation, slightly lower performance on infarct lesions compared to disease-specific models, and lack of validation beyond MRI (e.g., CT or PET).

Mixture of Experts (MoE) [53,54] represents a modeling paradigm in which, rather than relying on a single large model to handle all tasks, multiple specialized “expert” submodels are trained, each excelling at different aspects of the problem. A gating or routing mechanism here acts like a decision-maker; it dynamically determines which experts to activate for a given input, thereby improving efficiency and scalability.

Medical Multimodal Mixture of Expert (M⁴oE) [46] is a multimodal FM for CT, MRI, and CE-MRI segmentation that incorporates an MOE into a Swin-Unet [55] backbone. Conventional multilayer perceptrons (MLPs) are replaced with modality-specific expert networks, pretrained using masked autoencoders (MAEs) [26], where each expert reconstructs masked image regions to learn domain-specific features. A gating network dynamically assigns weights to experts based on the input modality, while a projection head aligns outputs with dataset-specific classes. Pretrained on 550 CT, 100 MRI, and 60 CE-MRI scans and fine-tuned on the same datasets, M⁴oE achieved 30% lower computation cost and faster fine-tuning, with strong improvement on ATLAS2023 [56] but decreased performance on the AMOS22 [57] dataset. The main limitations are the small size of the pretraining dataset, absence of external validation, dependence on a single architecture, and the lack of evaluation in few-shot settings or comparison with other FMs.

Mixture of Modality Experts (MoME) [45] is an FM for 3D brain lesion segmentation that addresses the challenges of multimodality MRI and diverse lesion types. The framework employs multiple nnU-Net-based expert networks, each specialized for a specific MRI modality, including T1-weighted, T2-weighted, T1 contrast-enhanced (T1ce), FLAIR, and diffusion weighted imaging (DWI). MoME integrates these experts through a hierarchical gating network that performs voxel-wise aggregation across multiple resolution levels. To prevent experts from losing modality-specific knowledge, a curriculum learning strategy is used, gradually shifting training from modality specialization to collaborative prediction. MoME was pretrained and fine-tuned on six public datasets (6585 3D MRIs) and three private in-house datasets, covering a broad spectrum of brain lesions and modalities. Trained with supervised loss (Dice and cross-entropy) as well as an additional collaboration loss, it demonstrated consistent improvements over strong baselines such as task-specific nnU-Net [58], Multi-Talent [59], Hermes [60], and SAM-Med3D [61]. Importantly, it showed robustness across nine datasets, highlighting its strong cross-domain generalization and potential as a unified solution for brain lesion segmentation, unlike task-specific models that require separate training for each modality. Despite its strengths, MoME has limitations. Its methodology requires separately pretraining five modality-specific experts, making it computationally intensive and difficult to scale when new modalities are added. Moreover, the reliance on supervised pretraining restricts applicability in low-annotation settings, and the framework has not been tested in few-shot or limited-data scenarios.

MIS-FM [49] is a 3D segmentation-focused FM pretrained on over 110K unlabeled CT scans using a novel self-supervised strategy, called Volume Fusion (VF), where patches from a foreground subvolume are fused into a background subvolume using discrete coefficients, and the model is trained to predict these coefficients at the voxel level. This formulation enables direct pretraining of a full encoder–decoder segmentation pipeline without requiring manual annotations. The proposed model, titled Parallel Convolution and Transformer Network (PCT-Net) was evaluated on head–neck, thoracic, and abdominal organ segmentation tasks, outperforming strong baselines such as nnU-Net [58], TransUNet [62], and UNETR++ [63]. Although MIS-FM demonstrates robust transferability in limited-data scenarios, it suffers from reliance on a single architecture and lacks comparison with other FMs.

BrainSegFounder [44] introduces a 3D FM for neuroimage segmentation, leveraging a large-scale self-supervised pipeline across multimodal MRI datasets. The model, based on SwinUNETR [64], undergoes a three-stage training strategy: (i) large-scale SSL pretraining on 82,800 healthy UK Biobank [65] MRIs using masked volume inpainting, 3D rotation, and contrastive coding; (ii) transfer learning on disease-focused datasets, Brain Tumor Segmentation (BraTS) challenge and Anatomical Tracings of Lesions After Stroke v2.0 (ATLAS v2.0), with multimodal adaptation; and (iii) full-model fine-tuning for downstream tumor and lesion segmentation tasks. BrainSegFounder demonstrates modest but consistent improvements over benchmarks like nnU-Net [58], SwinUNETR [64], and TransBTS, with gains particularly evident under few-shot settings. Despite the strengths of multimodal support for brain MRIs, improved data efficiency, and strong validation, the model suffers from computationally intensive pretraining (64 A100 GPUs) and lacks comparison against other FMs, leaving open questions about scalability and strength of the learned representation.

STU-Net [48] was proposed as a family of scalable U-Net models for medical image segmentation, ranging from 14M to 1.4B parameters, with STU-Net-H representing the largest segmentation FM to date. Built on nnU-Net, the architecture was systematically scaled in both depth and width, with joint scaling showing the greatest performance gains. These models were pretrained on the TotalSegmentator dataset of 1204 CT volumes annotated with 104 anatomical structures. The evaluations revealed progressively stronger segmentation performance as model size increased. Generalizability was assessed through zero-shot testing on 14 external datasets and fine-tuning on three benchmarks, where STU-Net achieved robust performance across modalities and tasks. Despite these strengths, the work remains limited by its reliance on CT-centric training data, lack of evaluation on rarer imaging modalities, and the high computational and memory costs of billion-parameter models, which may restrict accessibility.

Moving Object Segmentation in Medical Images (iMOS) [47] introduces the first FM designed for segmenting moving objects across multimodal medical image sequences, including CT, MRI, ultrasound, endoscopy, and electron microscopy. Built on the XMem [66] encoder–decoder framework with memory stores, iMOS employs a semi-supervised learning strategy that requires only a single annotated frame per sequence. To reduce computational cost, it adopts parameter-efficient fine-tuning, where encoders remain frozen. Trained on 877 public volumes, iMOS demonstrates notable improvements across modalities, with particularly strong gains in MRI.

Compared to static segmentation FMs, which operate on individual images and capture only spatial information, iMOS extends segmentation into the temporal domain by explicitly modeling motion. While static FMs typically require densely annotated datasets and focus on organ or tissue delineation, iMOS leverages sparse annotation, memory-based tracking, and spatio-temporal learning to capture dynamic processes such as cardiac motion or tumor changes. Despite these innovations, its limitations include reliance on the same datasets for pretraining and fine-tuning, lack of external validation, and absence of few-shot evaluation, leaving questions about its generalizability.

A self-supervised video FM was proposed in May 2025 for 3D CT segmentation [34]. Pretrained on large-scale natural videos and adapted to medical benchmarks, including ToothFairy2 [67] (mandible, teeth, maxillary bone, pharynx) and AMOS (abdominal organs), it leverages the analogy between video frames (H × W × T) and CT slices (H × W × D), where the third dimension encodes smoothly varying structural information. Video pretraining helps the model learn long-range dependencies and cross-frame consistency, translating into cross-slice spatial reasoning and improved anatomical coherence. It outperforms transformer baselines. However, performance degrades in the presence of metal artifacts. Despite this, video FMs capture a generic third-axis inductive bias, making them suitable for structured 3D tasks.

An FM for whole-heart segmentation, introduced in March 2025 [35], leverages self-supervised learning within a 3D student–teacher framework. The model was trained on a large unlabeled dataset comprising 49,048 CT and MRI images. An xLSTM-UNet architecture was proposed for segmentation tasks, integrating Vision-LSTM (xLSTM), an advanced extension of long short-term memory (LSTM) networks, into the U-Net framework. The proposed approach outperformed state-of-the-art methods in both accuracy and robustness. The adaptability of the model to other anatomical structures remains to be explored.

FL-Knowledge Distilled Transformer [37] introduces a transformer-based vision FM trained under a federated learning (FL) paradigm for large-scale cardiac CT analysis. The study utilizes 8104 cardiac CT scans from multiple university hospitals in Germany, enabling distributed model development without central data sharing. A two-stage semi-supervised training pipeline is employed: CNNs first generate pseudo-labels for unlabeled scans. These pseudo-labels are then distilled into a Swin-UNETR-like transformer collaboratively trained across institutions. By combining knowledge distillation with federated learning, the framework enhances data efficiency and privacy preservation, achieving robust performance in cardiac structure segmentation and anatomical landmark localization.

4.2. FMs for Classification and Predictive Tasks

As FMs are trained on large amounts of data, their encoders are capable of producing high-quality embeddings that capture rich and generalizable representations. For classification and predictive tasks, a fully connected layer is typically appended on top of the encoder and then fine-tuned. This fine-tuning is done either end-to-end (updating all model parameters) or selectively (updating only the newly added layer). This flexibility allows efficient adaptation of the pretrained backbone to a wide range of downstream tasks. Table 3 summarizes the FMs developed specifically for medical classification or prediction tasks.

MEDFORM [70] is a multimodal FM that integrates CT imaging with clinical numerical data for multicancer analysis, leveraging contrastive learning. The model employs unimodal encoders (ResNet for CT scans and TabNet for clinical features) pretrained using the SimCLR [21] method. These embeddings are then aligned via multimodal contrastive learning, and patient-level aggregation is performed using multiple instance learning (MIL), enabling training without slice-level annotations. MEDFORM outperforms unimodal and simple multimodal baselines in lung, breast, and colorectal cancer classification, and shows promise in few-shot learning scenarios. Nevertheless, its limitations include restricted dataset diversity, absence of external validation, modest improvements for certain cancers, and reliance on a single architecture without benchmarking against other FMs.

CRCFound [72] is an FM which was designed for colorectal cancer (CRC) diagnosis and prognosis. It pretrained a ViT backbone on 5137 unlabeled CRC CT scans using an MAE-based approach to learn universal feature representations. In addition to its pretraining in CT scans, CRCFound also integrates radiology reports to enhance its performance in clinical tasks. During the fine-tuning phase, the model leverages textual information from preoperative radiology reports, which provide detailed descriptions of tumor characteristics (such as size, location, and involvement with surrounding structures). This multimodal approach allows CRCFound to generalize effectively across various downstream tasks, including TNM staging, microsatellite instability (MSI) prediction, consensus molecular subtypes (CMS) classification, and prognosis prediction (overall survival and disease-free survival). CRCFound demonstrates exceptional performance and generalization and outperforms traditional supervised models trained from scratch even on unseen datasets. A key limitation is that it does not compare with other FMs or supervised baselines in full-data or limited-data scenarios.

DeepCNTD-Net [74] is a 3D FM used for neuro-trauma triage using non-contrast head CT scans. It was pretrained on a diverse dataset of 29,395 CT scans from nine medical centers across four countries, enabling the model to detect a wide range of neuro-trauma conditions. For pretraining, two task-specific networks were developed: one for hemorrhage subtype segmentation and another for brain anatomy parcellation. The hemorrhage network, based on a 3D Dense U-Net, classifies five hemorrhage subtypes and incorporates squeeze-and-excitation (SE) blocks to enhance feature recalibration. The brain parcellation network uses a U-Net architecture to segment 15 brain structures, employing a multihead architecture for distinct regions. During fine-tuning, the pretrained networks were integrated into DeepCNTD-Net and further refined using multimodal features. The model incorporated additional features from LLM-generated labels, which automatically annotated neuro-trauma findings from radiology reports. These features were fused with the task-specific networks’ outputs through linear layers for multilabel classification, improving the model’s diagnostic capabilities. The fine-tuning process allowed DeepCNTD-Net to learn from both the CT scans and the radiology report data, significantly enhancing its performance in detecting a broad range of neuro-trauma conditions.

FM-HCT [69] is another 3D FM designed for generalizable disease detection in Head CT. The model (ViT-base backbone) was pretrained using SSL via the DINO framework on a massive private institutional dataset of 361,663 3D head CT scans. The model leverages SSL to learn robust, generalized features from unlabeled 3D volumes. FM-HCT was successfully fine-tuned for multiple downstream neurological disease detection tasks and showed strong robustness and generalization across internal and external datasets and consistently outperformed specialized 3D CT FM benchmarks. It also showed high efficiency in few-shot fine-tuning settings, which is comparable to training with full annotated datasets.

Cardiac-CLIP [81] is a vision–language FM tailored for cardiac CT analysis. It employs a two-stage pretraining scheme: (1) masked autoencoding on 130,899 CT scans, followed by (2) contrastive learning with a textual encoder using 11,106 paired cardiac CT–report samples. The model was evaluated on cardiovascular abnormality classification, information retrieval, and clinical prediction tasks, including acute coronary syndrome (ACS), functional coronary stenosis (FCS), and coronary artery calcium (CAC) grading. Cardiac-CLIP achieved state-of-the-art performance and strong generalization in real-world clinical settings. Its limitations include computational demands, limited data diversity, and architectural rigidity.

Percival [80] is a vision–language FM which was introduced in July 2025. It employs a dual-encoder architecture comprising transformer-based image and BERT-style text encoders aligned via symmetric contrastive learning. It was trained on over 400K CT–report pairs from the Penn Medicine BioBank covering the thorax, abdomen, pelvis, head and neck, brain, and extremities. Percival exhibits strong generalization through multicontrast and multiview pretraining. Evaluated on 100K CT volumes, it achieves superior zero-shot retrieval, disease classification across 307 conditions, and prognostic risk stratification across 678 diseases compared with organ-specific models and Merlin [82]. The model further shows robust alignment with biological markers and external validation on CT-RATE, underscoring its potential for broad clinical deployment. Nonetheless, it remains computationally intensive, relies on a single architecture, and has yet to be evaluated in few-shot or label-scarce settings.

GLIP-T(C) [68] is a vision–language FM designed for continual adaptation across diverse medical imaging tasks. Extending the GLIP framework, it leverages grounded language–image pretraining on large-scale natural image–text pairs and employs aligned transformer-based encoders for vision and text. The study compares specialized, joint, and continual learning, showing that rehearsal-based continual learning mitigates catastrophic forgetting while maintaining cross-domain generalization. Evaluated across multiple anatomies and modalities, including colon, lung, thyroid, brain, skin, and cellular images, GLIP-T(C) demonstrates the feasibility of incrementally evolving vision–language models toward general-purpose medical foundation systems.

A lesion-focused FM for cancer imaging biomarkers was proposed in 2025 [77]. The model is pretrained using contrastive learning (SimCLR framework) on 11,467 radiographic lesions from the public DeepLesion [83] dataset. It employs a 3D ResNet-50 encoder and is trained in a task-agnostic manner by contrasting lesion-containing volumes with random non-lesion volumes. For downstream applications, the model can be adapted via linear classifiers on extracted features or through full fine-tuning. Evaluation spans three tasks: anatomical site classification (in-distribution), lung nodule malignancy prediction (out-of-distribution), and NSCLC prognosis (strong distribution shifts), demonstrating robust generalizability. Importantly, the feature-based implementation performs well even with limited data, remains robust to input perturbations and inter-reader variability, and generalizes well to unseen patient cohorts and institutions. Moreover, the learned features exhibit meaningful biological associations, indicating potential utility for biomarker discovery and broader clinical applications. Nonetheless, performance gains on small datasets were modest, and the real-world practicality of the model remains underexplored.

BrainIAC [76] is an FM developed using contrastive SSL pretraining to learn generalized representations from unlabeled brain MRI data. The model was trained and evaluated across four MRI sequences—T1-weighted, T2-weighted, T1CE, and FLAIR. While these sequences capture key structural and pathological information, incorporating additional modalities such as diffusion-weighted imaging (DWI), dynamic contrast-enhanced (DCE) MRI, and fat-suppressed sequences could further enhance the model’s applicability in diverse clinical scenarios. Moreover, as BrainIAC was trained on skull-stripped images, its generalizability to raw clinical scans may be limited.

MerMED-FM [79] is an unified multimodal, multidisease, and multiorgan vision FM trained using SSL. It was developed on 3.3M medical images from over ten organs—including the eye, lung, liver, kidney, prostate, skin, bladder, gallbladder, pancreas, colon, ovarian, uterine, bone, thyroid, and vessels—covering multiple modalities such as chest X-ray (CXR), CT, ultrasound (US), histopathology, color fundus photography (CFP), optical coherence tomography (OCT), and dermatology images. MerMED-FM demonstrated strong and consistent performance across 25 public datasets in diverse diagnostic tasks, including detection of ocular, pulmonary, oncologic, and dermatologic diseases. It outperformed leading FMs such as BiomedCLIP [84] and DINO [22]. These results underscore the model’s effectiveness in handling diverse medical imaging tasks. However, MerMED-FM currently operates on 2D slices only, lacking support for volumetric imaging and multimodal reasoning within individual patients.

Recent advances in self-supervised FMs have demonstrated the potential of ViT architectures for volumetric MRI analysis. A 2024 study [75] proposed a ViT encoder–decoder pretrained on 57,621 multicontrast whole-brain MRI scans (T1WI, T2WI, FLAIR, T1CE) using a hybrid approach that combines masked image modeling and contrastive learning. This strategy captures robust, task-agnostic representations suitable for downstream applications such as brain tumor detection, tumor-type classification, and molecular status prediction, consistently outperforming conventional CNN classifiers (e.g., DenseNet121 [85]) trained from scratch. The model also demonstrated scalability, interpretability via occlusion sensitivity mapping, and robustness to scanner variability and input perturbations. Limitations include evaluation on a narrow set of tasks, comparison only with CNN baselines, and lack of multi-institutional validation.

ViNet [73] was developed for detecting muscle-invasive bladder cancer (MIBC) in MRI images. It was pretrained on over 40K cross-modal imaging datasets from various anatomical regions such as the brain, heart, lungs, and abdomen, collected from both native and open-source databases. The pretraining employed a 3D image restoration method, where the model learns to restore images that undergoes a series of transformations, including non-linear, local-shuffling, outer-cutout, and inner-cutout transformations. This robust pretraining enabled ViNet to learn highly generalized and transferable features, effectively preparing the model for MIBC detection tasks. The SSL-derived FM used in ViNet’s architecture was specifically fine-tuned for transfer learning, where a modified 3D ResNet (based on ResNet3d-18) processes MRI images of the bladder. This enabled ViNet to leverage powerful feature extraction capabilities from the pretrained model while maintaining efficient training dynamics. Additionally, the model integrated weak experiential guidance using an attention mask generated by an nnU-Net [58] segmentation model, which focused on the MIBC-candidate regions. This multilayered training approach improved both model performance and interpretability, making ViNet effective for real-world clinical settings.

Building on similar principles, a ViT-based model [71] was pretrained and fine-tuned on a cohort of encephalitis patients for classification tasks. Using a 16-layer ViT autoencoder with 32 attention heads per layer, the model was trained using masked autoencoding combined with contrastive learning to extract high-quality volumetric embeddings. The pretrained encoder was adapted to downstream tasks via an MLP classifier, outperforming a DenseNet121 [85] baseline, particularly in sensitivity, while highlighting biologically meaningful regions through occlusion sensitivity mapping. The model maintained robustness across scanners and in limited-data scenarios, though evaluation was limited to a single binary task and lacked comparisons with other FMs.

Pretrained on 75,861 clinical head MRI scans, the self-supervised vision FM SwinClassifier [78] was developed to distinguish Parkinson’s disease from Parkinson-plus syndrome. It employs the Swin UNETR backbone which is a hybrid of the Swin Transformer and UNETR architectures combining Transformer-based feature extraction with a U-shaped encoder–decoder design. SwinClassifier outperformed a vanilla ViT autoencoder and CNN models (DenseNet121 and ResNet50) trained from scratch.

Overall, these studies shows the utility of self-supervised pretraining for classification tasks in medical applications. However, there is a need to include broader downstream evaluations and multi-institutional validation to prepare these FMs for real-world use-cases.

4.3. FMs for Image Registration, Reconstruction, and Super-Resolution, Quality Assessment

FMs are being also developed for image registration, super resolution, image reconstruction and enhancement tasks. A summary of recent FMs designed for this tasks is provided in Table 4.

4.3.1. Image Registration

Medical image registration is the process of aligning two or more medical images into a common coordinate system, ensuring that corresponding anatomical structures are spatially matched. uniGradICON [90], multiGradICON [89], FoundationMorph [86], UniReg [87], and TotalRegistrator [88] are specifically designed for medical image registration tasks.

uniGradICON [90] unifies the speed and accuracy of learning-based registration with the general applicability of conventional approaches. It employs a two-step, multiresolution U-Net framework and is trained on a composite dataset spanning multiple anatomical regions (lung, knee, brain, abdomen) and imaging modalities (CT, MRI). A key innovation is the Gradient Inverse Consistency (GradICON) regularization, which replaces conventional smoothness or diffusion regularizers and allows training with consistent hyperparameters across different datasets. This enables the model to automatically learn transformations supported by the data. Evaluation on in-domain, out-of-domain, and zero-shot registration tasks demonstrates superior generalizability compared to other methods.

multiGradICON [89] extends uniGradICON to multimodal registration. It maintains the same architecture but introduces a squared local normalized cross-correlation loss and leverages image similarity loss randomization, in which random modalities are selected per patient for loss computation. Training on a diverse multimodal dataset across multiple anatomical regions improves multimodal registration performance while maintaining strong monomodal capability. In scenarios involving similar modalities, uniGradICON outperforms multiGradICON, whereas multiGradICON excels in inter-modality registration tasks, such as T1w-to-T2w MRI or MR-to-CT alignment.

FoundationMorph [86] is a 3D vision–language FM for unsupervised medical image registration. It features a language module using CLIP [28] for clinical text processing and a vision module with dual-path encoders. The 2D encoder, pretrained on large-scale medical images via inpainting and contrastive learning, integrates with a 3D encoder through multidimension unified attention. The model handles multiple registration tasks with a single architecture, trained on diverse datasets including MRI, CT, and PET images. It generates deformation fields for spatial transformation using normalized cross-correlation loss and gradient regularization. The proposed FM shows better performance in MRI and CT registration compared to task-specific models developed for registration tasks.

UniReg [87] introduces a unified framework for medical image registration using conditional deformation field estimation. The model combines a shared self-supervised anatomical embedding-based backbone with dynamic registration modules controlled by task-specific conditioning vectors encoding anatomical structure priors, registration type constraints, and instance-specific features. A lightweight controller generates dynamic convolutional kernels for task-adaptive deformation field generation. The unified training paradigm leverages task-specific hyperparameters and anatomical segmentation masks, enabling a single model to handle diverse registration scenarios across multiple body regions and registration types without requiring separate network training. The model shows comparable performance to other task specific models or FMs developed for registration purpose using lower computational cost.

TotalRegistrator [88] introduces a field decomposition strategy for whole-body CT registration using multiple specialized U-Net blocks. The approach partitions deformation into affine- and region-specific components (bone, thorax, abdomen, whole-body), with each block trained independently on corresponding anatomical structures. Sequential integration of deformation fields enables comprehensive whole-body alignment while maintaining computational efficiency. The lightweight design requires only 11GB GPU memory through separate training of blocks and organ-wise gradient computation. Multi-organ segmentation masks guide the unsupervised learning process using mutual information, dice loss, and bending energy losses. It shows performance improvement compared to other foundation- and task-specific models with lower GPU memory demand.

Together, these models exemplify how FMs can generalize across anatomical regions and imaging modalities, highlighting their potential to provide robust, adaptable solutions for diverse registration tasks in clinical and research workflows.

4.3.2. Image Reconstruction

The Pattern and Contrast-Prompt U-Net (PCP-UNet) [91] is an FM for cardiac MRI reconstruction. It was pretrained on the large-scale CMRxRecon dataset, which includes 150,480 2D+t cardiac MRI images across diverse contrasts (cine, aorta, tagging, mapping), acceleration rates (4–24×), and sampling patterns (uniform, Gaussian random, radial). Unlike conventional reconstruction models, PCP-UNet incorporates adaptive unrolling, where undersampled data undergoes a variable number of unrolled iterations based on acceleration rate, and channel-shifting, which augments the input with circular-shifted replicas to expand receptive fields. A contrast and pattern prompt mechanism within a U-Net backbone allows the model to adapt reconstruction across multiple acquisition protocols. Pretraining employed a supervised generative strategy with data consistency enforced via a conjugate gradient solver. For downstream evaluation, the model was fine-tuned on the same dataset and benchmarked against fixed U-Net, adaptive U-Net, and fixed PCP-UNet. Results demonstrated that Adaptive PCP-UNet consistently outperformed baselines, achieving the highest SSIM and lowest NRMSE across contrasts, acceleration rates, and sampling patterns. The model also showed robustness to domain variability, highlighting its potential for clinical generalization. Limitations include validation restricted to the CMRxRecon dataset and absence of cross-institutional testing or efficiency analyses such as GPU usage and few-shot performance.

4.3.3. Image Super-Resolution

Image resolution plays a critical role in the performance of downstream tasks such as segmentation, classification, and registration. BME-X [92] was developed for brain MRI enhancement, aiming to improve these downstream applications. Built on a densely connected U-Net (DU-Net) [95] with skip connections, BME-X jointly trains a tissue classification network and a tissue-aware enhancement network using paired clean and synthetically degraded MRIs. The supervised loss combines cross-entropy for tissue classification with mean squared error (MSE) for image reconstruction, maintaining computational efficiency. Trained on 516 MRI sequences spanning fetal to early childhood development, BME-X demonstrates strong performance in motion correction, super-resolution, denoising, harmonization, and contrast enhancement, outperforming methods such as Pix2Pix [96], CycleGAN [97], and DU-Net [95]. While the model is robust across scanners and pathological cases, its limitations include reduced effectiveness for severe motion artifacts, reliance on T1/T2 contrasts, and focus on only three tissue types.

GraphMSR [93] is a graph FM for MRI super-resolution that integrates anatomical priors with multimodal semantic guidance. The framework models images as graphs, where nodes represent feature patches and edges capture local/global dependencies and introduces innovations such as anatomically-aware graph construction and multimodal semantic fusion using priors extracted from LLMs (e.g., LLaVA). Experiments are conducted on multiple datasets with T2-weighted brain and knee scans, evaluated under 2× and 4× upsampling. GraphMSR significantly outperforms CNN-, Transformer-, and diffusion-based methods while maintaining better structural preservation. The model shows strong potential for clinical applications by reducing MRI acquisition times and improving resolution under hardware or patient comfort constraints.

4.3.4. Image Quality Assessment

Image quality assessment (IQA) is the process of evaluating the quality of medical images to ensure they meet the necessary standards for accurate diagnosis and clinical decision making. It is crucial in medical imaging, as poor-quality images, such as those with artifacts, blurring, or low contrast, can lead to misdiagnosis and unnecessary repeat scans. MedIQA [94] is an FM developed to address the challenge of medical image quality assessment across diverse modalities like CT, MRI, and fundus images. It employs a two-stage training strategy: initially, the model is pretrained on 2.5K cases using automatically extracted labels from DICOM metadata such as radiation dose and magnetic field strength, which provides it with a physics-aware understanding of how acquisition settings impact image quality. In the fine-tuning phase, over 12K expert-annotated images are used to align the model’s learned features with radiologist judgments, allowing it to generalize better across diverse imaging scenarios. To efficiently handle 3D volumes, the model introduces a salient slice assessment module, selecting only seven key slices from 3D scans, which reduces computational burden while maintaining diagnostic relevance. Additionally, an automatic prompt mechanism is used to adjust the model’s behavior according to modality, region, dimension, and type, making it adaptable to various imaging conditions without retraining from scratch. These innovations address challenges such as limited annotations, high computational cost, and poor generalizability, positioning MedIQA as a scalable and robust solution for medical image quality assessment. The model shows strong performance across multiple datasets and imaging modalities, significantly outperforming traditional and deep learning-based quality assessment methods.

4.4. Multitask FMs

General-purpose FMs are designed to perform multiple tasks such as segmentation, prediction, classification, and report generation by doing task specific fine-tuning followed by representation learning by pretraining on a large dataset. A summary of recent FMs designed for multiple tasks together is provided in Table 5.

Merlin [82] is a general-purpose FM which can perform zero-shot findings classification (31 classes, 5137 CTs), phenotype classification (692 phenotypes), cross-modal retrieval (CT–report pairs), 5-year disease prediction (6 chronic diseases, more than 7K CTs), radiology report generation, and 3D semantic segmentation (20 abdominal organs, VerSe and TotalSegmentator). To perform these tasks, Merlin was trained on paired data of CT scans (6+ million images from 15,331 CTs), EHR diagnosis codes (1.8+ million codes), and radiology reports (6+ million tokens) in a weakly supervised manner. They used an I3D-ResNet152 image encoder and Clinical Longformer text encoder as model architecture, jointly optimized through a multitask pretraining strategy: (1) binary cross-entropy loss for multilabel prediction of ICD diagnostic codes, which forces the image encoder to capture disease-relevant representations; and (2) InfoNCE contrastive loss to align CT volumes with their paired free-text reports in a shared embedding space. This dual-supervision paradigm explicitly grounds CT features in both structured and unstructured semantics, improving generalizability compared to self-reconstruction objectives. It was evaluated on 752 tasks across six categories and compared with other related FMs, task-specific state-of-the-art models, and also in limited-data scenarios, where Merlin shows great generalizability and superior performance.

Vision–language FMs have been developed to support a wide range of multimodal tasks in medical imaging. RadFM [106] is a radiology FM pretrained on MedMD, a large-scale multimodal dataset containing 16M 2D and 3D medical scans (X-rays, CT, MRI, PET) paired with text reports covering over 5000 diseases. It is subsequently fine-tuned on RadMD, a curated subset with 3M image–report pairs. Using a visually conditioned generative pretraining strategy, RadFM integrates textual and imaging data to support both discriminative and generative tasks. It is evaluated on RAD-Bench [116], a benchmark covering five tasks (modality recognition, disease diagnosis, visual question answering, report generation, and rationale-based diagnosis), where it achieves state-of-the-art performance and strong zero-shot generalization. Limitations include insufficient 3D image representation, as MedMD is dominated by 2D scans, and restricted sentence generation ability, limiting the clinical usefulness of longer, more detailed reports.

Based on the M3D-Data corpus, M3D-LaMed [114] presents a large-scale multimodal FM for 3D medical imaging. This architecture enables a unified processing of diverse volumetric tasks, including image–text retrieval, report generation, visual question answering, anatomical localization, and promptable segmentation. By bridging 3D imaging and natural language understanding, M3D-LaMed provides strong cross-task generalization and scalability. Despite the strengths of M3D-LaMed, its substantial computational requirements and the limited availability of high-quality 3D image–text data need further exploration.

Similarly, CT-CLIP [112] and CT-CHAT [112] are FMs trained on CT-RATE, a dataset of 50,188 chest CT volumes paired with 25,692 reports, designed to bridge vision–language learning in 3D imaging. CT-CLIP aligns CT volumes with radiology reports using a CT-ViT [117] encoder and CXR-BERT [29] via contrastive learning, achieving superior zero-shot classification and retrieval performance over 3D CNN baselines with strong cross-domain generalization. CT-CHAT builds on this by integrating the pretrained CT encoder with an LLM and fine-tuning on 2.7M automatically generated CT–QA pairs, enabling effective visual question answering and report generation. Both models highlight the benefits of multimodal pretraining for robust, scalable, and clinically relevant FMs. Limitations include reliance on single-institution data, constrained prompt variability, lack of comparisons with other 3D FMs, and a narrow focus restricted to chest CT volumes.

PASTA [103], introduced in 2025, is a pan-tumor 3D-CT FM designed to mitigate data scarcity and enable unified cross-tumor analysis. It used a two-stage supervised pretraining strategy: segmentation followed by vision language alignment. It achieved state-of-the-art performance across oncology tasks, strong few-shot learning capability, and efficient CT–MRI transfer. Its generative variant, PASTA-Gen, synthesizes 3D-CT scans with lesion masks and structured text though its artificial lesions may not fully reflect real-world complexity.

Few FMs have been developed exclusively for classification and segmentation tasks. The Visualization and Segmentation Masked Autoencoder (VIS-MAE) [113] addresses challenges such as limited labeled data and multimodal variability by pretraining a Swin Transformer [117] backbone using masked autoencoder (MAE). Pretraining was performed on 2.49 million multimodal 2D images from RadImageNet [118]-LLC, spanning MR, CT, PET/CT, X-ray, and ultrasound. Images were randomly masked (up to 75% of patches) and reconstructed using a mean squared error (MSE) loss. Two strategies were explored: VIS-MAE-Modality, trained separately on each imaging modality, and VIS-MAE-Generic, trained jointly across modalities. Fine-tuned on eight segmentation and six classification tasks, VIS-MAE achieved performance comparable to established baselines, including nnU-Net, TransUNet [62], RadImageNet, ImageNet, and SimCLR [119], with modest gains (0.5–1.5%) in some tasks and slight improvements under limited-data settings. Notably, VIS-MAE-Modality consistently outperformed VIS-MAE-Generic, suggesting that cross-modality pretraining did not provide the expected benefits. The main limitations are the high computational cost of pretraining, the need to use separate models for segmentation and classification, and only small improvements in few-shot settings, which show how difficult it is to build a truly general and multimodal foundation model.

Radio DINO [99] is another FM tailored to radiomics and medical imaging applications. It pretrained ViTs using DINO and DINOv2 SSL methods on the RadImageNet dataset, which includes 1.35 million images from CT, MRI, ultrasound, and X-ray modalities covering eleven anatomical regions. This model captures semantic features without requiring manual feature engineering, which addresses the challenge of limited labeled data in medical imaging. The model achieves superior performance on various classification and segmentation tasks, surpassing supervised baselines.

M3FM [98] has been introduced as a medical multimodal–multitask FM for 3D CTs and other clinical data. Lung nodule detection and characterization, cardiovascular disease diagnosis and mortality risk prediction, lung cancer risk prediction, and other chest abnormality exams such as COVID-19 detection are the tasks covered by this model. Although M3FM covers a remarkable number of tasks, it is limited to the abnormalities within the lungs and heart. Additionally, M3FM faces challenges in dealing with imbalanced data distributions and small abnormality regions in the 3D input.

To support automated cardiac image analysis tasks, such as segmentation, landmark detection, diagnosis, and prognostication, Cine Cardiac Magnetic Resonance Masked Autoencoder (CineMA) [101] was introduced in August 2025. CineMA, was pretrained on 74,916 cine CMR studies comprising approximately 15M images from the UK Biobank using self-supervised masked autoencoders and evaluated on over 4.5K images from eight independent datasets. The model demonstrated superior performance to CNNs in disease detection and achieved competitive results in other tasks. Nonetheless, CineMA faces important limitations, including limited diversity in pretraining data, restricted modality coverage, and substantial computational demands.

Cross-Fraternal Twin Masked Autoencoder (FratMAE) [102], a PET/CT FM, has been proposed to integrate whole-body anatomy, metabolic activity, and text information. FratMAE takes advantage of separate ViT encoders for PET and CT scans, while its cross-attention decoders smooth synergistic interactions between modalities. The model also incorporates textual metadata to enhance PET representation learning. FratMAE was pretrained on the publicly available AutoPET III dataset [120], which contains 1292 PET/CT volumes from patients with melanoma, lymphoma, lung cancer, prostate cancer, or negative controls. Its pretraining performance was evaluated on lesion segmentation and Hodgkin lymphoma staging using the German Hodgkin Study Group (GHSG) dataset, consisting of 515 PET/CT volumes from patients with Hodgkin lymphoma. Experiments show that FratMAE surpasses all baseline models for lesion segmentation and Ann Arbor staging.

BiomedCLIP [84] is a large-scale multimodal biomedical foundation model designed to align biomedical images with text through self-supervised contrastive learning. The model is pretrained on PMC-15M, a newly curated public dataset containing 15 million image–text pairs from 4.4 million PubMed Central articles, making it one of the largest openly available biomedical vision–language resources. Architecturally, BiomedCLIP follows a CLIP-style framework, using a Vision Transformer (ViT-B/16) as the image encoder and PubMedBERT as the text encoder, optimized with an InfoNCE contrastive loss. The pretrained encoders are transferred to downstream tasks via linear probing or full fine-tuning. BiomedCLIP achieves state-of-the-art performance on a wide range of biomedical benchmarks, including RSNA, PCam, LC25000, TCGA-TIL, SLAKE, and VQA-RAD, outperforming prior models such as BioViL and PubMedCLIP in both zero-shot and few-shot settings. Notably, it demonstrates strong robustness to domain shifts across radiology and pathology, and performs well even with limited labeled data.

A Vector Quantized Autoencoder (VQ-AE) [110] was proposed as an FM for chest CT volumes, pretrained via SSL on 12.6M CT slices. The model employs masked region prediction with 5% random masking and a codebook-based vector representation to capture fundamental visual features. Pretraining uses a composite loss combining MSE, SSIM, quantization loss, and masked region reconstruction loss, enabling robust representation learning. Fine-tuning experiments showed modest performance gains with clear benefits in few-shot adaptation, where the model adapted quickly and achieved superior results under limited labeled data. However, the evaluation was limited to two tasks, improvements were relatively minor, and no direct comparisons were provided with other FMs.

There are several modality-specific general-purpose FMs. A modality-specific vision FM for cardiac MRI (CMR) [111] was proposed to address the limitations of task-specific deep learning models, such as scarce annotations and poor cross-task generalization. The model adopts a ViT-S/8 backbone with self-distillation (DINO) pretraining on 36 million CMR images spanning multiple sequences (cine, T1, T2, LGE, perfusion, flow imaging) from two clinical centers and the UK Biobank. Fine-tuned on datasets such as ACDC, Kaggle, and EMIDEC, it was evaluated on diverse downstream tasks including cardiac structure segmentation (left ventricle, right ventricle, myocardium, right atrium, left atrium), disease and LGE detection, cine view classification, and landmark localization. Results showed consistent improvements over ResNet-50 baselines and natural-image pretrained ViTs, particularly in few-shot settings. Limitations include the absence of comparisons with other medical FMs, reliance on 2D, rather than 3D, volumetric pretraining, and high computational cost.

CT-FM [107] introduces a large-scale FM employing a 3D encoder architecture pretrained using a self-supervised, contrastive learning framework on 148K unlabeled 3D CT scans. During downstream adaptation, a SegResNet-style decoder is included. The model is evaluated on organ and lesion segmentation, CT triage (classification), image retrieval, and semantic understanding. The strengths of CT-FM lie in its scalability, domain-agnostic pretraining, and strong transferability across heterogeneous CT datasets. However, the model’s reliance on high computational resources and volumetric contrastive pretraining limits accessibility and efficiency. Additionally, its exclusive focus on CT data restricts generalization to other imaging modalities such as MRI.

Triad [104] is a vision FM, designed for 3D MRI. It was trained on a large-scale dataset of 131,170 MRIs (breast, brain, and prostate regions) from 19,721 patients across 36 clinical cohorts. The dataset spans multiple imaging modalities, including T1w, T2w, FLAIR, fMRI, diffusion-weighted imaging (DWI), and dynamic contrast-enhanced MRI (DCE-MRI). Triad has been evaluated on organ and tumor segmentation, organ and cancer classification, and medical image registration. Pretrained exclusively on MRI, it outperforms baselines trained from scratch.

MRI-CORE [100], proposed in July 2025, is a self-supervised vision FM trained on 6.9M slices of 116,806 volumetric MRI images, covering 18 body locations. The dataset used in this project, “Duke-110K”, was collected from Duke University. For a faster convergence and an improved representation quality, MRI-CORE was initialized with the pretrained weights from the SAM, and then fine-tuned limitedly. The model is evaluated in few-shot segmentation and zero-shot classification and segmentation, outperforming SAM and MedSAM. Despite its strengths, MRI-CORE has three main limitations: it uses a ViT-Base model due to computational constraints, is trained on data from only one institution, and does not support multimodal data.

LCTfound [108] is a lung CT FM pretrained on LungCT-28M, a dataset of 105,184 CT scans (28M slices) across 14 lung diseases and normal cases. It employs a Denoising Diffusion Probabilistic Model (DDPM) with a 200M-parameter UNet–Transformer hybrid and cross-attention, learning 2D slice representations in a self-supervised manner. Fine-tuned with task-specific adapters, LCTfound supports eight downstream tasks, including mediastinal segmentation, whole-lung modeling, PAP diagnosis, NSCLC prognostication, therapy response prediction, virtual CTA imaging, sparse-view reconstruction, and low-dose CT enhancement. It outperformed MAE pretraining, MedSAM, InternImage, and RadImageNet, with additional gains in few-shot and external validation. Limitations include reliance on 2D pretraining and very high pretraining cost (864 GPU hours).

E3D-GPT [109] is a 3D vision–language FM developed for volumetric medical imaging. It leverages self-supervised 3D MAE pretraining on large-scale unlabeled CT volumes to learn spatially rich representations, followed by fine-tuning with paired CT–report data. Through efficient 3D feature aggregation and alignment with a language model, E3D-GPT enables tasks such as report generation, visual question answering (VQA), and disease diagnosis, demonstrating strong spatial understanding and improved performance on volumetric medical reasoning. Despite its strengths, E3D-GPT struggles with hallucination in OOD tasks due to the use of an LLM, and scaling across architectures and modalities is recommended.

FMs for generalist medical AI (GMAI) [115] aim to unify different types of medical data within a single AI system. They combine 3D image analysis with texts and other medical information to perform tasks such as generating radiology reports, answering questions about 3D scans, and guiding procedures. By learning from large collections of unlabeled 3D images and linking them with medical text, this approach forms the basis for models like E3D-GPT, which aim to interpret complex volumetric medical data.

Lingshu [105] is a generalist multimodal FM for medical understanding and reasoning, built on the Qwen2.5-VL (7B/32B) backbones. It addresses challenges in medical AI, including limited annotations, domain mismatch, and modality silos. This is achieved through progressive multistage training consisting of shallow and deep alignment, instruction tuning, and reinforcement learning on 5M open-source and synthetic multimodal samples. Lingshu integrates diverse imaging modalities (X-ray, CT, MRI, ultrasound, fundus, dermoscopy, pathology) with medical text to support tasks such as visual question answering, diagnostic reasoning, and report generation. The model achieves state-of-the-art performance across benchmarks and demonstrates strong few-shot efficiency and generalization. Its limitations include reliance on a single backbone, lack of cross-hospital validation, focus on a subset of clinical tasks, and dependence on large-scale computation.

5. Challenges and Opportunities

FMs in medical imaging hold great promise, with applications ranging from segmentation and classification to reconstruction and outcome prediction. At the same time, their rapid development has exposed a number of challenges that still need to be addressed. Issues such as limited data access and diversity, difficulties in reproducibility, the absence of standardized evaluation protocols, high computational demands, regional biases in datasets, and inconsistent choices of pretraining strategies remain unaddressed. Tackling these challenges is crucial if FMs aim to move beyond research settings and make a real impact in clinical practice. In the following subsections, we explore these challenges in more detail.

5.1. Dataset Scale and Diversity

The availability of large-scale and heterogeneous datasets is critical for effective training and the generalization of FMs. Despite the need, relatively few datasets have been specifically curated for training FMs in volumetric medical imaging.

In 2024, M3D-Data [114] was introduced as a large-scale 3D multimodal medical dataset containing 120K image–text pairs and 662K instruction–response pairs. Leveraging this resource, the authors further proposed M3D-LaMed, a multimodal large language model for 3D medical image analysis, and M3D-Bench, a benchmark for automatic evaluation. In the same year, two instruction-tuning datasets, based on BIMCV-R and CT-RATE, were curated with 354K 3D CT volumes [109]. GPT-4 was employed to generate corresponding instruction–response pairs for the 3D images.

By processing 4.4M scientific articles, 15M biomedical image–text pairs were constructed to form the PMC-15M dataset in 2025 [84]. This dataset is substantially larger than previously known biomedical datasets, including MIMIC-CXR [121] (377,110 pairs), CheXpert [122] (224,316 chest radiograph interpretations), ROCO [123] (87,952 pairs extracted from PubMedCentral Open Access in January 2018), and ARCH [124] (15,164 pathology image representations from PubMed medical articles and pathology textbooks). Built on PMC-15M, BiomedCLIP [84] is a multimodal FM for biomedical vision–language tasks, supporting retrieval, classification, and visual question answering, while incorporating features for improved privacy preservation.

PASTA-Gen [103], introduced in 2025, is a synthetic data framework comprising 30K generated 3D CT scans with pixel-level lesion masks along with structured reports of tumors in ten organs. This dataset was considered as the basis of PASTA FM. XGeM [125], a modular FM for multimodal synthesis, addresses anonymization, class imbalance, and data scarcity, with future potential in scaling to 3D imaging (CT, MRI) and physiological signals (ECG, EEG). To alleviate MRI data scarcity, TriadMR-131K [104] introduced a large-scale dataset of 131K 3D MRI scans from 19K patients across 36 cohorts; the associated FM, Triad, outperformed CT-based models on MRI-related tasks.

The majority of datasets used for both pretraining and fine-tuning FMs—whether public or private—originate from institutions in developed countries. This concentration of resources raises important concerns about potential biases and limits the applicability of these models in developing and underdeveloped regions.

In particular, medical imaging protocols, scanner quality, and acquisition settings often differ substantially across regions. For example, hospitals in resource-limited settings may rely on older imaging devices with lower resolution or differing calibration standards compared to those in high-income countries. Additionally, variations in population health status, prevalence of comorbidities, and genetic or racial factors can influence imaging biomarkers and disease presentation. As a result, models trained exclusively on datasets from developed countries may not generalize effectively to patient populations in other regions. Also, as illustrated in Figure 5, most of the works are conducted in Europe, USA, and China, which depicts the need for diversification.

This disparity highlights a potential gap in the real-world usability of FMs. Without careful consideration of regional diversity in both imaging and patient characteristics, these models risk reinforcing global health inequities. To address this, future efforts should prioritize the inclusion of diverse datasets that represent underrepresented populations and imaging environments, alongside collaborations with medical institutions in low-resource settings. Such initiatives would not only improve fairness and generalizability but also ensure that FMs deliver meaningful clinical value across different healthcare contexts.

Together, these efforts underscore both the progress and persistent challenges in constructing large, diverse, and clinically relevant datasets, which remain the cornerstone for advancing FMs in biomedical and volumetric medical imaging. While some of the reviewed models were built on very large datasets, others, such as TotalSegmentator [50] and MoME [45], were trained on fewer than 10K volumes, which may limit their generalizability to broader populations. A key reason for this limitation is that many large-scale medical datasets are private, constrained by patient confidentiality and regulatory restrictions that prevent open sharing. A promising solution to this challenge is federated learning, which allows multiple institutions to collaboratively develop models without exchanging raw patient data.

5.2. Reproducibility

Reproducibility in medical FMs is closely linked to the accessibility of pretrained weights, implementation code, and datasets. While the majority of reviewed works make their code publicly available, and a few, such as TotalSegmentator [50] and VISTA3D [43], go further by offering interactive online demos, a substantial proportion of models remain difficult to reproduce. The heavy reliance on private datasets significantly hinders independent verification and restricts opportunities for fair benchmarking.

This dependence on non-public data introduces several challenges. First, it prevents researchers from replicating reported results under identical conditions, as the original training data are inaccessible. Second, it limits comparative studies, as model performance often becomes tied to institution-specific datasets rather than standardized benchmarks. Third, private datasets are frequently homogeneous, collected under specific imaging protocols or from particular populations, raising concerns about whether these models generalize to broader and more diverse patient groups. Taken together, these issues underscore the need for initiatives to develop large-scale public datasets, which would facilitate more reproducible and generalizable FMs in medical imaging.

5.3. Evaluation and Benchmarking

A critical observation from the reviewed literature is that many FMs rely on self-designed benchmarking strategies, which undermines the ability to make fair and consistent comparisons across studies (Table 6). In several cases, evaluations were carried out on a narrow set of datasets [110,112], limiting the generalizability of findings beyond the specific cohorts used. Some models were not evaluated on any external datasets [75,91], restricting assessment of their robustness to new imaging distributions, scanners, or patient populations.

Another limitation is the lack of direct comparison with existing FMs. For instance, BME-X [92] and TotalSegmentator [50] reported strong within-study results but did not benchmark against other state-of-the-art models. This omission makes it difficult to determine whether performance gains stem from true methodological advances or from differences in dataset composition and evaluation strategy. Without head-to-head benchmarking, claims of superiority remain context-dependent and risk overstating the efficacy of new methods.

Equally important is the consideration of few-shot and low-annotation scenarios. One of the most significant promises of FMs lies in their ability to achieve high performance with very few labeled samples. Despite this, many reviewed works did not examine their efficacy under few-shot or low-label conditions, overlooking a crucial aspect of FM utility. Given that medical datasets are often expensive and labor-intensive to annotate, failing to evaluate models in low-resource scenarios limits their practical relevance.

These observations underscore the urgent need for standardized evaluation protocols in medical FMs. Such benchmarks should include (i) validation on diverse external datasets, (ii) direct comparisons with existing FMs, and (iii) systematic assessment in few-shot or low-annotation regimes.

5.4. Extensive Resource Requirements

The development of medical FMs is often constrained by the extensive computational resources required for training. Many of the reviewed works report training with multi-GPU clusters, such as NVIDIA A100 or V100 systems, often involving 8–32 GPUs with hundreds of gigabytes of memory. Batch sizes frequently exceed 512 samples, and training is performed over millions of images or thousands of 3D volumes, demanding weeks of computation on high-performance infrastructure. To provide a clearer perspective on these computational demands, we summarize and compare the reported hardware configurations, GPU counts, memory requirements, batch sizes, and input resolutions of volumetric foundation models in Table 6.

While such large-scale training enables these models to achieve impressive results, it also creates a barrier for smaller research groups and institutions in low-resource settings. The high financial cost of GPUs, storage, and energy consumption makes it difficult to reproduce or extend these studies without having access to specialized funding or industrial collaborations. Moreover, this dependency on extensive hardware resources raises concerns about sustainability, as prolonged training runs consume substantial computational power and energy.

To mitigate these challenges, future work should investigate more resource-efficient strategies, such as low-cost model distillation strategies, parameter-efficient fine-tuning, and lightweight architectures tailored to clinical deployment. In addition, federated and collaborative training frameworks could help distribute the computational burden across institutions.

5.5. The Curse of Supervision

The reviewed FMs employ a diverse set of pretraining strategies, ranging from supervised learning to self-supervised and hybrid approaches. However, a substantial proportion of these works continue to rely heavily on supervised learning, which presents scalability challenges due to the costly and labor-intensive process of manual data annotation. This dependence limits the feasibility of extending such models to broader applications where annotated datasets are scarce. Moreover, most studies did not conduct systematic comparisons between different pretraining paradigms—such as supervised, self-supervised, and semi-supervised methods—making it difficult to evaluate whether the chosen approach was, indeed, optimal for the intended task. Although many works report performance improvements, the rationale behind selecting a particular pretraining strategy is often underexplained, leaving open questions regarding the generalizability and transferability of the reported gains.

6. Conclusions

This review highlights the importance of developing FMs for volumetric (3D) medical imaging, an area that has received comparatively less attention. It provides a comprehensive study and comparison of key design considerations for these models, elaborates on the unique challenges and opportunities in this field, and concludes with recommendations for addressing existing gaps. Moving forward, advancing robust, generalizable, and equitable FMs in 3D medical imaging has the potential to significantly enhance clinical decision making, facilitate cross-institutional collaboration, and accelerate the development of AI-driven medical technologies. By overcoming current challenges in data availability, reproducibility, and benchmarking, FMs can transform 3D medical imaging, enabling more accurate, equitable, and widely deployable AI-driven clinical tools.

Author Contributions

All authors contributed to the study conception and to reviewing the literature included in the manuscript. T.G. and F.S. drafted the manuscript, and all authors reviewed and provided feedback on subsequent revisions. T.G. coordinated the study. All authors have read and agreed to the published version of the manuscript.

Funding

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript. Tapotosh Ghosh has received Alberta Innovates Graduate Scholarship and T. Chen Fong Doctoral Excellence Award.

Data Availability Statement

Not applicable.

Conflicts of Interest

Farhad Maleki and Eduardo Moreno Judice de Mattos Farina are on the SIIM AI Education Subcommittee. Yashbir Singh, Khaled Younis, Shiba Kuanar, Yankai Huo, and Shahriar Faghani are on the SIIM Tools/Research Subcommittee. Shahriar Faghani is an Associate Editor of Radiology: Artificial Intelligence. Eduardo Moreno Judice de Mattos Farina is a speaker for sharing progress in cancer care, Merck Sharp & Dohme, and a consultant for md.ai. Author Khaled Younis was employed by the company MedAiConsult LLC, Cleveland, Ohio, USA. Tapotosh Ghosh, Farnaz Sheikhi, and Junlin Guo do not have any competing interests. This paper is a result of a collaboration between the SIIM AI Education and the SIIM Tools/Research Subcommittees.

References

He, Y.; Huang, F.; Jiang, X.; Nie, Y.; Wang, M.; Wang, J.; Chen, H. Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future Directions. IEEE Rev. Biomed. Eng. 2025, 18, 172–191. [Google Scholar] [CrossRef]
Hayat, M.; Dhaliwal, A.; Din, M.; Izhar, R.; Nadeem, M.; Ahmad, N. Cross-Attention Patch Fusion for Few-Shot Colorectal Tissue Generation. In Proceedings of the 2025 5th International Conference on Digital Futures and Transformative Technologies (ICoDT2), Islamabad, Pakistan, 17–18 December 2025; pp. 1–6. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar] [PubMed]
Ma, J.; He, Y.; Li, F.; Han, L.; You, C.; Wang, B. Segment anything in medical images. Nat. Commun. 2024, 15, 654. [Google Scholar] [CrossRef]
Zhang, Y.; Shen, Z.; Jiao, R. Segment anything model for medical image segmentation: Current applications and future directions. Comput. Biol. Med. 2024, 171, 108238. [Google Scholar] [CrossRef] [PubMed]
Yao, W.; Bai, J.; Liao, W.; Chen, Y.; Liu, M.; Xie, Y. From CNN to Transformer: A Review of Medical Image Segmentation Models. J. Imaging Inform. Med. 2024, 37, 1529–1547. [Google Scholar] [CrossRef] [PubMed]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Jin, R.; Xu, Z.; Zhong, Y.; Yao, Q.; QI, D.; Zhou, S.K.; Li, X. Fairmedfm: Fairness benchmarking for medical imaging foundation models. Adv. Neural Inf. Process. Syst. 2024, 37, 111318–111357. [Google Scholar]
Huang, S.C.; Jensen, M.; Yeung-Levy, S.; Lungren, M.P.; Poon, H.; Chaudhari, A.S. Multimodal Foundation Models for Medical Imaging-A Systematic Review and Implementation Guidelines. medRxiv 2024. [Google Scholar] [CrossRef]
Azad, B.; Azad, R.; Eskandari, S.; Bozorgpour, A.; Kazerouni, A.; Rekik, I.; Merhof, D. Foundational models in medical imaging: A comprehensive survey and future vision. arXiv 2023, arXiv:2310.18689. [Google Scholar] [CrossRef]
Lee, H.H.; Gu, Y.; Zhao, T.; Xu, Y.; Yang, J.; Usuyama, N.; Wong, C.; Wei, M.; Landman, B.A.; Huo, Y.; et al. Foundation models for biomedical image segmentation: A survey. arXiv 2024, arXiv:2401.07654. [Google Scholar] [CrossRef]
Khan, W.; Leem, S.; See, K.B.; Wong, J.K.; Zhang, S.; Fang, R. A comprehensive survey of foundation models in medicine. IEEE Rev. Biomed. Eng. 2025, 19, 283–304. [Google Scholar] [CrossRef]
van Veldhuizen, V.; Botha, V.; Lu, C.; Cesur, M.E.; Lipman, K.G.; de Jong, E.D.; Horlings, H.; Sanchez, C.I.; Snoek, C.G.M.; Wessels, L.; et al. Foundation Models in Medical Imaging—A Review and Outlook. arXiv 2025, arXiv:2506.09095. [Google Scholar]
Noh, S.; Lee, B.D. A narrative review of foundation models for medical image segmentation: Zero-shot performance evaluation on diverse modalities. Quant. Imaging Med. Surg. 2025, 15, 5825–5858. [Google Scholar] [CrossRef]
Zhang, S.; Metaxas, D. On the challenges and perspectives of foundation models for medical image analysis. Med. Image Anal. 2024, 91, 102996. [Google Scholar] [CrossRef]
Rashed, E.A.; Bekhit, M. Foundation Models in Medical Image Analysis: Overview and Prospects. In Proceedings of the 2024 IEEE International Conference on Future Machine Learning and Data Science (FMLDS), Sydney, NSW, Australia, 20–23 November 2024; pp. 6–9. [Google Scholar] [CrossRef]
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the Opportunities and Risks of Foundation Models. arXiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Wang, X.; Zhang, R.; Shen, C.; Kong, T.; Li, L. Dense Contrastive Learning for Self-Supervised Visual Pre-Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 3024–3033. [Google Scholar]
Wen, X.; Zhao, B.; Zheng, A.; Zhang, X.; Qi, X. Self-supervised visual representation learning with semantic grouping. Adv. Neural Inf. Process. Syst. 2022, 35, 16423–16438. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PmLR, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9650–9660. [Google Scholar]
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap Your Own Latent—A New Approach to Self-Supervised Learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
Xie, Z.; Zhang, Z.; Cao, Y.; Lin, Y.; Bao, J.; Yao, Z.; Dai, Q.; Hu, H. SimMIM: A Simple Framework for Masked Image Modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Chen, Z.; Agarwal, D.; Aggarwal, K.; Safta, W.; Balan, M.M.; Brown, K. Masked Image Modeling Advances 3D Medical Image Analysis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 1970–1980. [Google Scholar]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. arXiv 2021, arXiv:2111.06377. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
Boecking, B.; Usuyama, N.; Bannur, S.; Castro, D.C.; Schwaighofer, A.; Hyland, S.; Wetscherek, M.; Naumann, T.; Nori, A.; Alvarez-Valle, J.; et al. Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. In Proceedings of the Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
Wang, Z.; Wu, Z.; Agarwal, D.; Sun, J. Medclip: Contrastive learning from unpaired medical images and text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, Dhabi, United Arab Emirates, 7–11 December 2022; Volume 2022, p. 3876. [Google Scholar]
Gao, P.; Geng, S.; Zhang, R.; Ma, T.; Fang, R.; Zhang, Y.; Li, H.; Qiao, Y. Clip-adapter: Better vision-language models with feature adapters. Int. J. Comput. Vis. 2024, 132, 581–595. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
Jia, M.; Tang, L.; Chen, B.C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.N. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 709–727. [Google Scholar]
Ma, Q.; Sun, G.; Tombak, G.I.; Jain, S.; Huber, N.B.; Gool, L.V.; Konukoglu, E. Video Foundation Model for Medical 3D Segmentation. In Proceedings of the Supervised and Semi-Supervised Multi-Structure Segmentation and Landmark Detection in Dental Data; Wang, Y., Qian, D., Wang, S., Ben-Hamadou, A., Pujades, S., Lumetti, L., Grana, C., Bolelli, F., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 72–88. [Google Scholar]
Qayyum, A.; Mazher, M.; Ugurlu, D.; Lemus, J.A.S.; Rodero, C.; Niederer, S.A. Foundation Model for Whole-Heart Segmentation: Leveraging Student-Teacher Learning in Multi-Modal Medical Imaging. arXiv 2025, arXiv:2503.19005. [Google Scholar]
Wittmann, B.; Wattenberg, Y.; Amiranashvili, T.; Shit, S.; Menze, B. vesselFM: A Foundation Model for Universal 3D Blood Vessel Segmentation. arXiv 2025, arXiv:2411.17386. [Google Scholar]
Tölle, M.; Garthe, P.; Scherer, C.; Seliger, J.M.; Leha, A.; Krüger, N.; Simm, S.; Martin, S.; Eble, S.; Kelm, H.; et al. Real world federated learning with a knowledge distilled transformer for cardiac CT imaging. Npj Digit. Med. 2025, 8, 88. [Google Scholar] [CrossRef]
Du, Y.; Bai, F.; Huang, T.; Zhao, B. Segvol: Universal and interactive volumetric medical image segmentation. Adv. Neural Inf. Process. Syst. 2024, 37, 110746–110783. [Google Scholar]
Luo, Z.; Gao, Z.; Liao, W.; Zhang, S.; Wang, G.; Luo, X. Dynamic Gradient Sparsification Training for Few-Shot Fine-tuning of CT Lymph Node Segmentation Foundation Model. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2025; pp. 165–174. [Google Scholar]
Otaghsara, S.S.T.; Rahmanzadeh, R. F3-Net: Foundation Model for Full Abnormality Segmentation of Medical Images with Flexible Input Modality Requirement. arXiv 2025, arXiv:2507.08460. [Google Scholar] [CrossRef]
Akinci D’Antonoli, T.; Berger, L.K.; Indrakanti, A.K.; Vishwanathan, N.; Weiss, J.; Jung, M.; Berkarda, Z.; Rau, A.; Reisert, M.; Küstner, T.; et al. Totalsegmentator MRI: Robust sequence-independent segmentation of multiple anatomic structures in MRI. Radiology 2025, 314, e241613. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Hu, M.; Qiu, R.L.; Thor, M.; Williams, A.; Marshall, D.; Yang, X. RoMedFormer: A Rotary-Embedding Transformer Foundation Model for 3D Genito-Pelvic Structure Segmentation in MRI and CT. arXiv 2025, arXiv:2503.14304. [Google Scholar]
He, Y.; Guo, P.; Tang, Y.; Myronenko, A.; Nath, V.; Xu, Z.; Yang, D.; Zhao, C.; Simon, B.; Belue, M.; et al. VISTA3D: A unified segmentation foundation model for 3D medical imaging. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 20863–20873. [Google Scholar]
Cox, J.; Liu, P.; Stolte, S.E.; Yang, Y.; Liu, K.; See, K.B.; Ju, H.; Fang, R. BrainSegFounder: Towards 3D foundation models for neuroimage segmentation. Med. Image Anal. 2024, 97, 103301. [Google Scholar] [CrossRef]
Zhang, X.; Ou, N.; Basaran, B.D.; Visentin, M.; Qiao, M.; Gu, R.; Ouyang, C.; Liu, Y.; Matthews, P.M.; Ye, C.; et al. A foundation model for brain lesion segmentation with mixture of modality experts. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2024; pp. 379–389. [Google Scholar]
Jiang, Y.; Shen, Y. M4oe: A foundation model for medical multimodal image segmentation with mixture of experts. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2024; pp. 621–631. [Google Scholar]
Yan, Z.; Han, T.; Huang, Y.; Liu, L.; Zhou, H.; Chen, J.; Shi, W.; Cao, Y.; Yang, X.; Ni, D. A Foundation Model for General Moving Object Segmentation in Medical Images. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024; pp. 1–5. [Google Scholar] [CrossRef]
Huang, Z.; Wang, H.; Deng, Z.; Ye, J.; Su, Y.; Sun, H.; He, J.; Gu, Y.; Gu, L.; Zhang, S.; et al. STU-Net: Scalable and Transferable Medical Image Segmentation Models Empowered by Large-Scale Supervised Pre-training. arXiv 2023, arXiv:2304.06716. [Google Scholar]
Wang, G.; Wu, J.; Luo, X.; Liu, X.; Li, K.; Zhang, S. MIS-FM: 3D Medical Image Segmentation using Foundation Models Pretrained on a Large-Scale Unannotated Dataset. arXiv 2023, arXiv:2306.16925. [Google Scholar]
Wasserthal, J.; Breit, H.C.; Meyer, M.T.; Pradella, M.; Hinck, D.; Sauter, A.W.; Heye, T.; Boll, D.T.; Cyriac, J.; Yang, S.; et al. TotalSegmentator: Robust segmentation of 104 anatomic structures in CT images. Radiol. Artif. Intell. 2023, 5, e230024. [Google Scholar] [CrossRef]
Landman, B.; Xu, Z.; Igelsias, J.E.; Styner, M.; Langerak, T.; Klein, A. Segmentation Outside the Cranial Vault Challenge. In Proceedings of the MICCAI: Multi Atlas Labeling Beyond Cranial Vault-Workshop Challenge, Munich, Germany, 5–9 October 2015. [Google Scholar]
Myronenko, A. 3D MRI Brain Tumor Segmentation Using Autoencoder Regularization. In Proceedings of the Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 16 September 2018; Revised Selected Papers, Part II; Springer: Berlin/Heidelberg, Germany, 2018; pp. 311–320. [Google Scholar] [CrossRef]
Mu, S.; Lin, S. A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications. arXiv 2025, arXiv:2503.07137. [Google Scholar]
Cai, W.; Jiang, J.; Wang, F.; Tang, J.; Kim, S.; Huang, J. A Survey on Mixture of Experts in Large Language Models. IEEE Trans. Knowl. Data Eng. 2025, 37, 3896–3915. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In Proceedings of the Computer Vision—ECCV 2022 Workshops; Karlinsky, L., Michaeli, T., Nishino, K., Eds.; Springer: Cham, Switzerland, 2023; pp. 205–218. [Google Scholar]
Quinton, F.; Popoff, R.; Presles, B.; Leclerc, S.; Meriaudeau, F.; Nodari, G.; Lopez, O.; Pellegrinelli, J.; Chevallier, O.; Ginhac, D.; et al. A Tumour and Liver Automatic Segmentation (ATLAS) Dataset on Contrast-Enhanced Magnetic Resonance Imaging for Hepatocellular Carcinoma. Data 2023, 8, 79. [Google Scholar] [CrossRef]
Ji, Y.; Bai, H.; Yang, J.; Ge, C.; Zhu, Y.; Zhang, R.; Li, Z.; Zhang, L.; Ma, W.; Wan, X.; et al. AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation. arXiv 2022, arXiv:2206.08023. [Google Scholar]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef]
Ulrich, C.; Isensee, F.; Wald, T.; Zenk, M.; Baumgartner, M.; Maier-Hein, K.H. MultiTalent: A Multi-dataset Approach to Medical Image Segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2023; Springer: Cham, Switzerland, 2023; pp. 648–658. [Google Scholar]
Gao, Y.; Li, Z.; Liu, D.; Zhou, M.; Zhang, S.; Metaxas, D.N. Training Like a Medical Resident: Context-Prior Learning Toward Universal Medical Image Segmentation. arXiv 2024, arXiv:2306.02416. [Google Scholar] [CrossRef]
Wang, H.; Guo, S.; Ye, J.; Deng, Z.; Cheng, J.; Li, T.; Chen, J.; Su, Y.; Huang, Z.; Shen, Y.; et al. SAM-Med3D: Towards General-purpose Segmentation Models for Volumetric Medical Images. arXiv 2024, arXiv:2310.15161. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Shaker, A.; Maaz, M.; Rasheed, H.; Khan, S.; Yang, M.H.; Khan, F.S. UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation. arXiv 2024, arXiv:2212.04497. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Nath, V.; Tang, Y.; Yang, D.; Roth, H.; Xu, D. Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. arXiv 2022, arXiv:2201.01266. [Google Scholar] [CrossRef]
Littlejohns, T.J.; Holliday, J.; Gibson, L.M.; Garratt, S.; Oesingmann, N.; Alfaro-Almagro, F.; Bell, J.D.; Boultwood, C.; Collins, R.; Conroy, M.C.; et al. The UK Biobank imaging enhancement of 100,000 participants: Rationale, data collection, management and future directions. Nat. Commun. 2020, 11, 2624. [Google Scholar] [CrossRef]
Cheng, H.K.; Schwing, A.G. XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model. arXiv 2022, arXiv:2207.07115. [Google Scholar]
Bolelli, F.; Marchesini, K.; Van Nistelrooij, N.; Lumetti, L.; Pipoli, V.; Ficarra, E.; Vinayahalingam, S.; Grana, C. Segmenting Maxillofacial Structures in CBCT Volumes. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 5238–5248. [Google Scholar] [CrossRef]
Yi, H.; Qin, Z.; Lao, Q.; Xu, W.; Jiang, Z.; Wang, D.; Zhang, S.; Li, K. Towards General Purpose Medical AI: Continual Learning Medical Foundation Model. arXiv 2023, arXiv:2303.06580. [Google Scholar] [CrossRef]
Zhu, W.; Huang, H.; Tang, H.; Musthyala, R.; Yu, B.; Chen, L.; Vega, E.; O’Donnell, T.; Dehkharghani, S.; Frontera, J.A.; et al. 3D foundation AI model for generalizable disease detection in head computed tomography. arXiv 2025, arXiv:2502.02779. [Google Scholar] [CrossRef]
Jung, D.; Jang, J.; Jang, S.; Park, Y.R. MEDFORM: A Foundation Model for Contrastive Learning of CT Imaging and Clinical Numeric Data in Multi-Cancer Analysis. arXiv 2025, arXiv:2501.13277. [Google Scholar] [CrossRef]
Gao, R.; Peng, A.; Duan, Y.; Chen, M.; Zheng, T.; Zhang, M.; Chen, L.; Sun, H. Associations of Postencephalitic Epilepsy Using Multi-Contrast Whole Brain MRI: A Large Self-Supervised Vision Foundation Model Strategy. J. Magn. Reson. Imaging 2025, 62, 494–505. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Cai, D.; Liu, J.; Zhuang, Z.; Zhao, Y.; Wang, F.a.; Li, C.; Hu, C.; Gai, B.; Chen, Y.; et al. CRCFound: A Colorectal Cancer CT Image Foundation Model Based on Self-Supervised Learning. Adv. Sci. 2025, 12, e07339. [Google Scholar] [CrossRef]
Gong, Y.; Zhang, X.; Xia, Y.F.; Cheng, Y.; Bao, J.; Zhang, N.; Zhi, R.; Sun, X.Y.; Wu, C.J.; Wu, F.Y.; et al. A foundation model with weak experiential guidance in detecting muscle invasive bladder cancer on MRI. Cancer Lett. 2025, 611, 217438. [Google Scholar] [CrossRef]
Yoo, Y.; Georgescu, B.; Zhang, Y.; Grbic, S.; Liu, H.; Aldea, G.D.; Re, T.J.; Das, J.; Ullaskrishnan, P.; Eibenberger, E.; et al. A Non-contrast Head CT Foundation Model for Comprehensive Neuro-Trauma Triage. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2025; pp. 3–13. [Google Scholar]
Chen, M.; Zhang, M.; Yin, L.; Ma, L.; Ding, R.; Zheng, T.; Yue, Q.; Lui, S.; Sun, H. Medical image foundation models in assisting diagnosis of brain tumors: A pilot study. Eur. Radiol. 2024, 34, 6667–6679. [Google Scholar] [CrossRef] [PubMed]
Tak, D.; Garomsa, B.A.; Chaunzwa, T.L.; Zapaishchykova, A.; Climent Pardo, J.C.; Ye, Z.; Zielke, J.; Ravipati, Y.; Vajapeyam, S.; Mahootiha, M.; et al. A foundation model for generalized brain MRI analysis. medRxiv 2024. [Google Scholar] [CrossRef]
Pai, S.; Bontempi, D.; Hadzic, I.; Prudente, V.; Sokač, M.; Chaunzwa, T.L.; Bernatz, S.; Hosny, A.; Mak, R.H.; Birkbak, N.J.; et al. Foundation model for cancer imaging biomarkers. Nat. Mach. Intell. 2024, 6, 354–367. [Google Scholar] [CrossRef] [PubMed]
Suo, X.; Chen, M.; Chen, L.; Luo, C.; Kemp, G.J.; Lui, S.; Sun, H. Automatic identification of Parkinsonism using clinical multi-contrast brain MRI: A large self-supervised vision foundation model strategy. eBioMedicine 2025, 116, 105773. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Quek, C.W.N.; Zhou, J.; Wang, Y.; Bai, Y.; Ke, Y.; Yao, J.; Gutierrez, L.; Teo, Z.L.; Ting, D.S.J.; et al. Multimodal, Multi-Disease Medical Imaging Foundation Model (MerMED-FM). arXiv 2025, arXiv:2507.00185. [Google Scholar] [CrossRef]
Beeche, C.; Kim, J.; Tavolinejad, H.; Zhao, B.; Sharma, R.; Duda, J.; Gee, J.; Dako, F.; Verma, A.; Morse, C.; et al. A Pan-Organ Vision-Language Model for Generalizable 3D CT Representations. medRxiv 2025. [Google Scholar] [CrossRef]
Hu, Y.; Zheng, Y.; Miao, S.; Zhang, X.; Xia, J.; Qi, Y.; Zhang, Y.; He, Y.; Chen, Q.; Ye, J.; et al. Cardiac-CLIP: A Vision-Language Foundation Model for 3D Cardiac CT Images. arXiv 2025, arXiv:2507.22024. [Google Scholar]
Blankemeier, L.; Cohen, J.P.; Kumar, A.; Van Veen, D.; Gardezi, S.J.S.; Paschali, M.; Chen, Z.; Delbrouck, J.B.; Reis, E.; Truyts, C.; et al. Merlin: A vision language foundation model for 3D computed tomography. Res. Sq. 2024, rs.3.rs-4546309. [Google Scholar] [CrossRef]
Yan, K.; Wang, X.; Lu, L.; Summers, R.M. DeepLesion: Automated mining of large-scale lesion annotations and universal lesion detection with deep learning. J. Med. Imaging 2018, 5, 036501. [Google Scholar] [CrossRef]
Zhang, S.; Xu, Y.; Usuyama, N.; Xu, H.; Bagga, J.; Tinn, R.; Preston, S.; Rao, R.; Wei, M.; Valluri, N.; et al. BiomedCLIP: A multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv 2025, arXiv:2303.00915. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Pan, S.; Hu, M.; Safari, M.; Shah, K.; Zhao, F.; Wang, T.; Qiu, R.; Yang, X. FoundationMorph: A 3D vision-language foundation model for unsupervised medical image registration. In Proceedings of the Medical Imaging 2025: Imaging Informatics; SPIE: San Francisco, CA, USA, 2025; Volume 13411, pp. 301–309. [Google Scholar]
Li, Z.; Zhang, J.; Ma, T.; Mok, T.C.; Zhou, Y.J.; Chen, Z.; Ye, X.; Lu, L.; Jin, D. UniReg: Foundation Model for Controllable Medical Image Registration. arXiv 2025, arXiv:2503.12868. [Google Scholar] [CrossRef]
Pham, X.L.; Vuurberg, G.; Doppen, M.; Roosen, J.; Stille, T.; Ha, T.Q.; Quach, T.D.; Dang, Q.V.; Luu, M.H.; Smit, E.J.; et al. TotalRegistrator: Towards a Lightweight Foundation Model for CT Image Registration. arXiv 2025, arXiv:2508.04450. [Google Scholar] [CrossRef]
Demir, B.; Tian, L.; Greer, H.; Kwitt, R.; Vialard, F.X.; Estépar, R.S.J.; Bouix, S.; Rushmore, R.; Ebrahim, E.; Niethammer, M. Multigradicon: A foundation model for multimodal medical image registration. In Proceedings of the International Workshop on Biomedical Image Registration; Springer: Berlin/Heidelberg, Germany, 2024; pp. 3–18. [Google Scholar]
Tian, L.; Greer, H.; Kwitt, R.; Vialard, F.X.; San José Estépar, R.; Bouix, S.; Rushmore, R.; Niethammer, M. unigradicon: A foundation model for medical image registration. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2024; pp. 749–760. [Google Scholar]
Zhang, C.; Loecher, M.; Alkan, C.; Yurt, M.; Vasanawala, S.S.; Ennis, D.B. On the Foundation Model for Cardiac MRI Reconstruction. In Proceedings of the Statistical Atlases and Computational Models of the Heart. STACOM 2024. Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15448, pp. 226–235. [Google Scholar] [CrossRef]
Sun, Y.; Wang, L.; Li, G.; Lin, W.; Wang, L. A foundation model for enhancing magnetic resonance images and downstream segmentation, registration and diagnostic tasks. Nat. Biomed. Eng. 2025, 9, 521–538. [Google Scholar] [CrossRef]
Qin, Z.; He, Z.; Zhang, Y.; Shen, Y.; Li, K. GraphMSR: A graph foundation model-based approach for MRI image super-resolution with multimodal semantic integration. Pattern Recognit. 2025, 171, 112178. [Google Scholar] [CrossRef]
Xun, S.; Sun, Y.; Chen, J.; Yu, Z.; Tong, T.; Liu, X.; Wu, M.; Tan, T. MedIQA: A Scalable Foundation Model for Prompt-Driven Medical Image Quality Assessment. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2025; pp. 339–349. [Google Scholar]
Wang, L.; Li, G.; Shi, F.; Cao, X.; Lian, C.; Nie, D.; Liu, M.; Zhang, H.; Li, G.; Wu, Z.; et al. Volume-based analysis of 6-month-old infant brain MRI for autism biomarker identification and early diagnosis. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI); Springer: Cham, Switzerland, 2018; Volume 11072, pp. 411–419. [Google Scholar] [CrossRef]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5967–5976. [Google Scholar] [CrossRef]
Wang, G.; Shi, H.; Chen, Y.; Wu, B. Unsupervised image-to-image translation via long-short cycle-consistent adversarial networks. Appl. Intell. 2022, 53, 17243–17259. [Google Scholar] [CrossRef]
Niu, C.; Lyu, Q.; Carothers, C.D.; Kaviani, P.; Tan, J.; Yan, P.; Kalra, M.K.; Whitlow, C.T.; Wang, G. Medical multimodal multitask foundation model for lung cancer screening. Nat. Commun. 2025, 16, 1523. [Google Scholar] [CrossRef]
Zedda, L.; Loddo, A.; Di Ruberto, C. Radio DINO: A foundation model for advanced radiomics and AI-driven medical imaging analysis. Comput. Biol. Med. 2025, 195, 110583. [Google Scholar] [CrossRef] [PubMed]
Dong, H.; Chen, Y.; Gu, H.; Konz, N.; Chen, Y.; Li, Q.; Mazurowski, M.A. MRI-CORE: A Foundation Model for Magnetic Resonance Imaging. arXiv 2025, arXiv:2506.12186. [Google Scholar] [CrossRef]
Fu, Y.; Bai, W.; Yi, W.; Manisty, C.; Bhuva, A.N.; Treibel, T.A.; Moon, J.C.; Clarkson, M.J.; Davies, R.H.; Hu, Y. A versatile foundation model for cine cardiac magnetic resonance image analysis tasks. arXiv 2025, arXiv:2506.00679. [Google Scholar]
Oh, Y.; Seifert, R.; Cao, Y.; Clement, C.; Ferdinandus, J.; Song, S.; Meng, R.; Zeng, F.; Guo, N.; Li, X.; et al. Developing a PET/CT Foundation Model for Cross-Modal Anatomical and Functional Imaging. J. Nucl. Med. 2025, 66, 251598. [Google Scholar]
Lei, W.; Chen, H.; Zhang, Z.; Luo, L.; Xiao, Q.; Gu, Y.; Gao, P.; Jiang, Y.; Wang, C.; Wu, G.; et al. A Data-Efficient Pan-Tumor Foundation Model for Oncology CT Interpretation. arXiv 2025, arXiv:2502.06171. [Google Scholar]
Wang, S.; Safari, M.; Li, Q.; Chang, C.W.; Qiu, R.L.; Roper, J.; Yu, D.S.; Yang, X. Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging. arXiv 2025, arXiv:2502.14064. [Google Scholar] [CrossRef]
Team, L.; Xu, W.; Chan, H.P.; Li, L.; Aljunied, M.; Yuan, R.; Wang, J.; Xiao, C.; Chen, G.; Liu, C.; et al. Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning. arXiv 2025, arXiv:2506.07044. [Google Scholar] [CrossRef]
Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; Xie, W. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. arXiv 2023, arXiv:2308.02463. [Google Scholar]
Pai, S.; Hadzic, I.; Bontempi, D.; Bressem, K.; Kann, B.H.; Fedorov, A.; Mak, R.H.; Aerts, H.J.W.L. Vision Foundation Models for Computed Tomography. arXiv 2025, arXiv:2501.09001. [Google Scholar] [CrossRef]
Gao, Z.; Zhang, G.; Liang, H.; Liu, J.; Ma, L.; Wang, T.; Guo, Y.; Chen, Y.; Yan, Z.; Chen, X.; et al. A Lung CT Foundation Model Facilitating Disease Diagnosis and Medical Imaging. medRxiv 2025. [Google Scholar] [CrossRef]
Lai, H.; Jiang, Z.; Yao, Q.; Wang, R.; He, Z.; Tao, X.; Wei, W.; Lv, W.; Zhou, S.K. E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model. arXiv 2024, arXiv:2410.14200. [Google Scholar]
Tariq, A.; Patel, B.N.; Banerjee, I. Design, training, and applications of foundation model for chest computed tomography volumes. In Proceedings of the Medical Imaging 2024: Image Processing; SPIE: San Francisco, CA, USA, 2024; Volume 12926, pp. 252–256. [Google Scholar]
Jacob, A.J.; Borgohain, I.; Chitiboi, T.; Sharma, P.; Comaniciu, D.; Rueckert, D. Towards a vision foundation model for comprehensive assessment of Cardiac MRI. arXiv 2024, arXiv:2410.01665. [Google Scholar] [CrossRef]
Hamamci, I.E.; Er, S.; Wang, C.; Almas, F.; Simsek, A.G.; Esirgun, S.N.; Doga, I.; Durugol, O.F.; Dai, W.; Xu, M.; et al. Developing generalist foundation models from a multimodal dataset for 3D computed tomography. arXiv 2024, arXiv:2403.17834. [Google Scholar]
Liu, Z.; Tieu, A.; Patel, N.; Soultanidis, G.; Deyer, L.; Wang, Y.; Huver, S.; Zhou, A.; Mei, Y.; Fayad, Z.A.; et al. VIS-MAE: An Efficient Self-supervised Learning Approach on Medical Image Segmentation and Classification. In Proceedings of the Machine Learning in Medical Imaging; Xu, X., Cui, Z., Rekik, I., Ouyang, X., Sun, K., Eds.; Springer: Cham, Switzerland, 2025; pp. 95–107. [Google Scholar]
Bai, F.; Du, Y.; Huang, T.; Meng, M.Q.H.; Zhao, B. M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models. arXiv 2024, arXiv:2404.00578. [Google Scholar]
Moor, M.; Banerjee, O.; Abad, Z.S.H.; Krumholz, H.M.; Leskovec, J.; Topol, E.J.; Rajpurkar, P. Foundation models for generalist medical artificial intelligence. Nature 2023, 616, 259–265. [Google Scholar] [CrossRef] [PubMed]
Kuo, T.L.; Liao, F.T.; Hsieh, M.W.; Chang, F.C.; Hsu, P.C.; Shiu, D.S. RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues. arXiv 2025, arXiv:2409.12558. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
Mei, X.; Liu, Z.; Robson, P.M.; Marinelli, B.; Huang, M.; Doshi, A.; Jacobi, A.; Cao, C.; Link, K.E.; Yang, T.; et al. RadImageNet: An open radiologic deep learning research dataset for effective transfer learning. Radiol. Artif. Intell. 2022, 4, e210315. [Google Scholar] [CrossRef] [PubMed]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar] [CrossRef]
Gatidis, S.; Hebb, T.; Frueh, M.; La Fougère, C.; Nikolaou, K.; Pfannenberg, C.; Schölkopf, B.; Kuestner, T.; Cyran, C.; Rubin, D. A whole-body FDG-PET/CT Dataset with manually annotated Tumor Lesions. Sci. Data 2022, 9, 601. [Google Scholar] [CrossRef] [PubMed]
Johnson, A.E.W.; Pollard, T.J.; Berkowitz, S.J.; Greenbaum, N.R.; Lungren, M.P.; ying Deng, C.; Mark, R.G.; Horng, S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 2019, 6, 317. [Google Scholar] [CrossRef] [PubMed]
Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpanskaya, K.; et al. CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. arXiv 2019, arXiv:1901.07031. [Google Scholar] [CrossRef]
Pelka, O.; Koitka, S.; Rückert, J.; Nensa, F.; Friedrich, C. Radiology Objects in COntext (ROCO): A Multimodal Image Dataset. In Proceedings of the CVII-STENT/LABELS@MICCAI, Granada, Spain, 16 September 2018. [Google Scholar]
Gamper, J.; Rajpoot, N. Multiple Instance Captioning: Learning Representations from Histopathology Textbooks and Articles. arXiv 2021, arXiv:2103.05121. [Google Scholar] [CrossRef]
Molino, D.; Feola, F.D.; Faiella, E.; Fazzini, D.; Santucci, D.; Shen, L.; Guarrasi, V.; Soda, P. XGeM: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation. arXiv 2025, arXiv:2501.04614. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Paper selection process. 376 articles were initially found from different repositories. After removing duplicates and imposing strict criteria, 60 articles were selected for review.

Figure 2. The pipeline for FM development in volumetric medical imaging. To develop an FM, first, a large unlabeled dataset needs to be collected. Based on the task, an architecture and pretraining strategy has to be chosen. After pretraining, task specific fine-tuning is conducted. Finally, the FMs are evaluated against various benchmarks to ensure robustness, fairness, and efficiency. Moreover, they should be evaluated in few-shot settings as well.

Figure 3. Distribution of the foundation models included in this review. Subfigures illustrate the frequency of (a) dataset types, (b) architectural designs, and (c) pretraining strategies.

Figure 4. Overview of the recent FMs in volumetric medical imaging, considering their tasks, modalities, and the pretraining strategy. Acronyms: Sup—supervised, Unsup—unsupervised, SSL—self-supervised learning, Weak Sup—weakly supervised learning.

Figure 5. Geographical distribution of recent publications in medical imaging FMs, by country, covered in this review. The number of publications for each country is indicated on the map: Austria (2), Brazil (1), Canada (5), China (26), Denmark (1), France (3), Germany (9), Hong Kong S.A.R. (2), Italy (2), Netherlands (3), Romania (1), Singapore (1), South Korea (1), Switzerland (7), Tunisia (1), Turkey (1), United Kingdom (9), United States of America (29), and Vietnam (1). The color intensity represents the number of papers, with darker shades indicating higher publication counts.

Table 1. Summary of the existing surveys on FMs in medical imaging, highlighting their scope, focus areas, and limitations with respect to 3D/volumetric imaging.

Year	Contribution	Challenges	Ref
2025	Focusing on large-scale architectures, self-supervised learning, and adaptation to downstream tasks.	Increased computational costs of volumetric data, limited coverage of 3D FMs (12 models), lack of diverse benchmarks, adapting to medical domain, explainability, fairness and robustness.	[13]
2025	Focusing only on medical image segmentation, reviewing 63 studies in this domain, and evaluating 6 FMs in zero-shot settings.	Demanding hybrid 2D–3D model architectures and systematic benchmarking of cross-domain FMs, limited insight into 3D FMs (18 models).	[14]
2025	Covering LLMs, vision models, omics, and graphs. Mostly focusing on 2D coverage with minimal volumetric image–specific methods.	Cost, interpretability, and validation, limited covering of 3D FMs (six models).	[12]
2024	Defining the “spectrum” of medical FMs, categorizing them from general vision models to modality-specific, and further to organ- or task-specific models, mainly in 2D medical imaging.	Developing multimodality FMs, scales, application-driven solutions, and 3D medical imaging, limited covering of 3D FMs (six models).	[15]
2024	Considering multimodality clinical data such as image, text, sound and signal.	3D medical imaging, limited covering of 3D FMs (two models), interpretability, explainability, and high computational cost of FMs, integrating multimodal data sources.	[16]
2024	A focused review on adaptations of SAM in biomedical segmentation, covering publications from April to September 2023, Analyzing over papers and 33 datasets; mainly zero-shot 2D SAM variants.	Limited coverage of emerging 3D/volumetric models (21 3D FMs), generalization discrepancy, fine-tuning dillema, and modality inconsistencies.	[11]
2024	A systematic review of multimodal FMs, focusing primarily on vision–language models and 2D tasks, reviewing 97 papers using PRISMA guidelines.	Limited insight into volumetric models (18 3D FMs) or segmentation-specific FMs.	[9]
2024	Introducing FairMedFM, the first fairness benchmark for FMs in medical imaging, integrating 17 datasets and analyzing 20 FMs, identifying persistent demographic disparities, focusing on bias/fairness.	Limited to medical image classification and segmentation, limited covering of 3D FMs (seven models).	[8]
2023	A thorough taxonomized, task/organ-specific analysis of research progress and limitations.	Emphasis on textually prompted models, largely focused on articles before 2023, limited coverage of 3D radiology and recent FM advances (five models).	[10]

Table 2. Summary of recent FMs for medical image segmentation. The abbreviations used in the table are as follows. |P.Data|: pretraining dataset size, vols: volumes, DAV: data availability, PL: public, PR: private, M: Mixed, Alg: algorithm, P.Alg.T: pretraining algorithm type, SSL: self-supervised learning, Sup: supervised learning, Sem Sup: semi-supervised learning.

Ref	Year	FM	Modalities	\|P.Data\|	DAV	Anatomy	Architecture	P.Alg.T	Alg
[34]	2025	NA	CT	650K labeled videos	PL	Abdominal organs, mandible, teeth, maxillary bone, pharynx	ViT	SSL	MAE
[35]	2025	3D-Heart-Seg	CT, MRI	2.3K vols	PL	Heart	Vision-LSTM	SSL	Matching probability distribution
[36]	2025	vesselFM	MRA, CTA, X-ray, vEM, µCTA, two-photon microscopy	625K	PL	Blood vessels	ViT	Sup	Swin-Transformer encoder, U-Net–style decoder
[37]	2025	NA	CT	8K vols	PR	Coronary artery	Swin-UNETR	Sem Sup	FL, semi-supervised pseudo-labeling, distillation
[38]	2025	SegVol	CT	90K vols	PL	Whole body	ViT, CLIP (text embedding)	SSL	SimMIM
[39]	2025	LN Segmentation	CT	3.3K vols	PL	Lymph nodes	U-Net	Sup	nnU-Net
[40]	2025	F3-Net	MRI	5.7K vols	PL	Brain	U-Net	Sup	nnU-Net
[41]	2025	TotalSegmentator MRI	MRI	1.1K vols	PR	Whole body	U-Net	Sup	nnU-Net
[42]	2025	RoMedFormer	CT, MRI	NA	M	Genital and pelvic	Transformer (rotary positional embedding)	SSL	Masked image modeling
[43]	2025	VISTA3D	CT	11K vols	M	Whole body	SegResNet	Sem Sup	Pseudo-labels, supervoxels
[44]	2024	BrainSegFounder	MRI T1w, T1-CE, T2w, T2-FLAIR	88K vols	PR	Brain	SwinUnet-R	SSL	Masked volume inpainting, rotation prediction, contrastive coding
[45]	2024	MoME	MRI T1w, T2w, T1-CE, FLAIR, DWI	6.5K vols	PL	Brain	U-Net	Sup	nnU-Net
[46]	2024	M4oE	CT, MRI, CE-MRI	700 vols	PL	Liver, kidney, pancreas, spleen, stomach, gallbladder	Swin Transformer, MLP expert network	SSL	MAE
[47]	2024	iMOS	CT, MRI, ultrasound, endoscopy, electron microscopy	877 vols	PL	Whole body	XMem	Sup	Comparison with actual label
[48]	2023	STU-Net	CT, MRI, PET	1.2K vols	PL	Whole body	U-Net	Sup	nnU-Net
[49]	2023	MIS-FM	CT	113K vols	M	Head, neck, heart, aorta, trachea, esophagus, abdomen	PCT-Net (CNN+ViT)	SSL	Pseudo-segmentation task
[50]	2023	TotalSegmentator	CT	1.2K vols	PL	Whole body	U-Net	Sup	nnU-Net

Table 3. Summary of recent classification FMs in medical imaging. The abbreviations used in the table are as follows. V: vision, VL: vision–language, |P.Data|: pretraining dataset size, vols: volumes, DAV: data availability, PL: public, PR: private, M: mixed, Alg: algorithm, P.Alg.T: pretraining algorithm type, SSL: self-supervised learning, Sup: supervised learning, Unsup: unsupervised learning, and C.L: contrastive learning.

Ref	Year	FM	T	Modalities	\|P.Data\|	DAV	Anatomy	Task	Architecture	P.Alg.T	Alg
[68]	2023	GLIP-T(C)	VL	Natural image, text	2M images, 16K text descriptions	PL	Hippocampus, thyroid nodule, foot	Classification	GLIP, adapted to medical via text prompts	SSL	Joint and continual learning
[69]	2025	FM-HCT	V	CT	361K vols	PR	Head	Classification	ViT	SSL	DINO, MAE
[70]	2025	MEDFORM	VL	CT, tabular data	159K slices	PL	Lung, breast, colon	Classification	TANGLE (ResNet, TabNet multimodal encoder)	SSL	SimCLR, multimodal C.L
[71]	2025	NA	V	MRI (1.5T, 3T T1WI, T2WI, FLAIR, T1CE)	57K vols	PR	Brain	Classification	ViT	SSL	MiM + C.L
[72]	2025	CRCFound	V	CT	5K vols	PR	Colorectal	Classification	ViT	SSL	MAE
[73]	2025	ViNet	V	CT, MRI	>40K vols	M	Brain, heart, lung, abdomen	Classification	ResNet3D-18	SSL	Image restoration
[74]	2025	DeepCNTD-Net	V	CT	29K vols	PR	Head	Classification	3D DenseNet, task-specific 3D U-Nets	Sup	Segmentation, classification with labels
[75]	2024	NA	V	MRI T1w, T1c, T2w, FLAIR	57K vols	PR	Brain	Classification	ViT-16	SSL	Reconstruction, C.L
[76]	2024	BrainIAC	V	MRI T1w, T2w, FLAIR, T1CE	32K	PL	Brain	Classification, prediction	ResNet50	SSL	SimCLR
[77]	2024	NA	V	CT Lesion	11K vols	PL	Lung nodules, cysts, breast lesions, kidney, bone, liver	Classification, prediction	3D ResNet-50	SSL	SimCLR
[78]	2025	SwinClassifier	V	MRI T1WI, T2WI, FLAIR	75K vols	PR	Brain	Classification	Swin UNETR	SSL	Reconstruction, C.L
[79]	2025	MerMED-FM	V	CXR, CT, US, CFP, OCT, histopathology, Dermoscopy	3.3M	PR	Eye, lung, liver, kidney, prostate, skin, bladder	Classification	ViT	SSL	Multimodality agreement via teacher–student network
[80]	2025	Percival	VL	CT, Report	402K vols + reports	PR	Thorax, abdomen, pelvis, head, neck, brain, extremities	Classification	Dual Transformer encoders, BERT-style text encoders	SSL	C.L
[81]	2025	Cardiac-CLIP	VL	CT, Report	130K vols + reports	M	Heart	Classification	ViT-B/32, PubMedBERT	SSL	MAE, C.L

Table 4. Summary of recent image registration, reconstruction, super-resolution and quality assesment FMs in medical imaging. The abbreviations used in the table are as follows. |P.Data|: pretraining dataset size, vols: volumes, DAV: data availability, PL: public, PR: private, M: mixed, Alg: algorithm, P.Alg.T: pretraining algorithm type, SSL: self-supervised learning, Sup: supervised learning, Unsup: unsupervised learning, and C.L: contrastive learning.

Ref	Year	FM	Modalities	\|P.Data\|	DAV	Anatomy	Task	Architecture	P.Alg.T	Alg
[86]	2025	FoundationMorph	MRI, CT, PET, clinical text	23K slices	M	Brain, lung	Image registration	Transformer-based encoder–decoder	SSL	Inpainting/Masked Image Modeling (MiM), C.L
[87]	2025	UniReg	CT	9K vols	M	Whole body	Image registration	Convolution-based encoder–decoder	Sup	Self-supervised feature extraction, similarity, supervised regularization through segmentation masks
[88]	2025	TotalRegistrator	CT	591 CT scan pairs	PR	Whole body	Image registration	U-Net	Unsup	Similarity, segmentation overlay, deformation field
[89]	2024	multiGradICON	CT, MRI, CBCT, MRI T1w, T1ce, T2w, FLAIR, DIXON, diffusion-derived measures	>1M 3D image pairs	PL	Lung, knee, brain, abdomen, pancreas	Image registration	Multiscale U-Net	Unsup	GradICON (Similarity with the target image)
[90]	2024	uniGradICON	CT, MRI, CBCT	3.78M 3D image pairs	PL	Lung, knee, brain, abdomen	Image registration	Multiscale U-Net	Unsup	GradICON (Similarity with the target image)
[91]	2025	PCP-UNet	MRI	150K 2D-t scans	PL	Cardiac	Image reconstruction	U-Net with pattern and contrast prompts, adaptive unrolling, channel-shifting	Sup	Generative (Image reconstruction)
[92]	2024	BME-x	MRI T1w, T2w	516 3D scans	PL	Brain	Super-resolution	Densely connected U-Net	Sup	Classification, similarity calculation on generated high-quality image
[93]	2025	GraphMSR	MRI	460 subjects	PL	Brain, knee	Super-resolution	GNN with attention mechanism	Sup	Reconstruction from low- to high-resolution, structural similarity
[94]	2025	MedIQA	CT, MRI, Fundoscopy	2.5K 3D scans	M	Brain, breast, eye, knee, chest, abdominal	Image quality assessment	ViT-based (MANIQA)	Sup	MSE with target class

Table 5. Summary of recent multitask FMs in medical imaging. The abbreviations used in the table are as follows. T: type, V: vision, VL: vision langauge, |P.Data|: pretraining dataset size, vols: volumes, DAV: data availability, PL: public, PR: private, M: mixed, S: synthetic, Alg: algorithm, P.Alg.T: pretraining algorithm type, SSL: self-supervised learning, Sup: supervised learning, Weak Sup: weakly supervised learning, C.L: contrastive learning.

Ref	Year	FM	T	Modalities	\|P.Data\|	DAV	Anatomy	Task	Architecture	P.Alg.T	Alg
[98]	2025	M3FM	VL	CT, EHR, tabular clinical data, text	117K CT–clinical record pairs	PL	Lungs, heart, airways, chest cavity	Classification, segmentation	CTViT, clinical text transformer, fusion module	SSL	MAE
[99]	2025	Radio DINO	V	CT, MRI, ultrasound, X-ray	1.35M 2D slices	PL	Chest, breast, abdomen	Classification, segmentation	2D ViT	SSL	DINO/DINOv2
[100]	2025	MRI-CORE	V	MRI	6M slices, 110K vols	PR	18 body locations	Classification, segmentation	ViT	SSL	DINOv2
[101]	2025	CineMA	V	CMR	15M	M	Cardiac	Classification, segmentation	Multiview conv-transformer MAE	SSL	MultiMAE
[102]	2025	FratMAE	V	PET, CT	1.2K vols	PL	Whole-body	Classification, segmentation	ViT encoders, cross-attention decoders	SSL	MAE
[103]	2025	PASTA	VL	CT–report pairs	30K CT–mask-text pairs	PL	Lung, liver, pancreas, gallbladder, bladder, bone, esophagus, stomach, kidney, colorectum	Classification, segmentation	3D U-Net	Sup	Sup, synthetic mask
[104]	2025	Triad	V	MRI T1w, T2w, FLAIR, fMRI, DWI, DCE-MRI	131K 3D vols	PL	Breast, brain, prostate	Classification, segmentation, registration	3D U-Net, Swin Transformer	SSL	Reconstruction
[84]	2025	BiomedCLIP	VL	Image–text pairs	15M (PMC-15M)	PR	General biomedical: lungs, lymph nodes, organs	Classification, retrieval, VQA	2D ViT-B/16, PubMedBERT	SSL	C.L
[105]	2025	Lingshu	VL	X-ray, CT, MRI, ultrasound, fundus, dermoscopy, OCT, PET, endoscopy, digital photography, histopathology, microscopy, text	3.75M, 1.3M synthetic samples	PL, S	Whole body	Image understanding and reasoning	Qwen2.5-VL-Instruct architecture, vision encoder, projection MLP module, LLM core	Sup	Multistage sup
[106]	2025	RadFM	VL	2D/3D scans, text	16M scan–text pairs	PL	Whole body	Classification, modality recognition, VQA, report generation	ViT, autoregressive text generator	Sup	Next-token prediction
[107]	2025	CT-FM	V	CT	148K vols	PL	Whole body	Classification, segmentation, retrieval and semantic-understanding	3D encoder, SegResNet decoder	SSL	C.L
[108]	2025	LCTfound	V	CT	28M slices	PR	Lung, mediastinum, bronchi, arterial, venous networks	Classification, segmentation, prognosis, prediction, virtual imaging, reconstruction, enhancement	U-Net with transformer blocks	SSL	Denoising diffusion probabilistic models
[109]	2024	E3D-GPT	VL	3D CT–report pairs	354K pairs	M	Chest, brain, abdomen	Classification, report generation, and VQA	3D ViT encoder, MAE decoder, LLM fusion via 3D conv aggregator	SSL	MAE, C.L
[110]	2024	NA	V	CT	59K vols, 14.1M slices	PR	Chest, lung, heart, pulmonary arteries	Classification, segmentation	VQ-AE (U-Net)	SSL	Masked region reconstruction, structural similarity
[111]	2024	NA	V	MRI	36M slices	PR	Cardiac	Classification, segmentation, landmark localization	ViT-S/8	SSL	DINO
[112]	2024	CT-CLIP	VL	CT-reports	50K vols, 25K reports	PL	Chest	Multiabnormality detection, case retrieval, VQA	CT-ViT, CXR-BERT	SSL	CLIP
[113]	2024	VIS-MAE	V	CT, MRI, PET, X-ray, ultrasound	2.5M slices	PL	Abdomen, heart, prostate, brain, breast, thyroid, skin, chest, knee, pulmonary	Classification, segmentation	Swin Transformer	SSL	MAE
[82]	2024	Merlin	VL	CT, EHR, diagnosis code	6M images from 15K paired CTs with 1.8M + EHR diagnosis codes and radiology reports	PR	Abdomen	Classification, segmentation, cross-modal retrieval, report generation	Inflated ResNet152, Longformer	Weak Sup	Diagnostic code, C.L
[114]	2024	M3D-LaMed	VL	CT–report pair	120K 3D image–text pairs, 662K instruction–response pairs	PL	NA	Segmentation, image-text retrieval, report generation, VQA	3D ViT, LLaMA-2-7B	SSL	CLIP
[115]	2023	GMAI	VL	Multimodal	NA	NA	NA	Multitask	NA	SSL	NA

Table 6. Computational requirement and evaluation process of the reviewed FMs. Acronyms: MD: multiple dataset; FS: few shot; ZS: zero shot; LD: low-data regime; UD: completely unseen dataset; Existing FMs: compared with already published FMs or not; N: no; Y: yes; NA: not available/applicable.

		Computational Requirement					Evaluation Protocol
Ref	FM	GPU Model	#GPU	GPU Memory	Batch Size (Total)	Input Image Shape	MD	FS/ZS/LD/UD	Existing FMs
Segmentation FMs
[34]	NA	NVIDIA RTX A6000	1	48 GB	16 (pretraining), 4 (fine-tuning)	16 × 224 × 224	Y	N	Y
[35]	3D-Heart-Seg	NA	NA	NA	NA	128 × 128 × 128	Y	Y	N
[36]	vesselFM	NVIDIA V100	1	32	8	128 × 128 × 128	Y	Y	Y
[37]	NA	NA	NA	NA	NA	NA	Y	N	N
[38]	SegVol	NVIDIA A100-SXM4	8	8 × 40	32	32 × 256 × 256	Y	Y	Y
[39]	LN Segmentation	NVIDIA V100	8	NA	2	NA	Y	Y	N
[40]	F3-Net	NA	NA	NA	2	NA	Y	N	N
[41]	TotalSegmentator MRI	NVIDIA GeForce RTX 3090	1	24	NA	NA	Y	N	N
[42]	RoMedFormer	NA	NA	NA	2	NA	N	N	N
[43]	VISTA3D	NVIDIA V100	64	64 × 32	NA	308 × 260 × 453	Y	Y	Y
[44]	BrainSegFounder	NVIDIA DGX A100	64	64 × 320	128	NA	Y	Y	N
[45]	MoME	NVIDIA A100	NA	40	NA	160 × 196 × 160	Y	Y	Y
[46]	M4oE	RTX 4090	1	24	36	NA	Y	N	Y
[47]	iMOS	NA	NA	NA	NA	NA	Y	Y	N
[48]	STU-Net	NVIDIA A100	1	80	2	NA	Y	Y	N
[49]	MIS-FM	NVIDIA A100	2	2 × 80	2	NA	Y	Y	N
[50]	TotalSegmentator	NVIDIA GeForce RTX 3090	1	24	NA	512 × 512 × 280, 512 × 512 × 458, 512 × 512 × 824	N	N	N
Classification FMs
[68]	GLIP-T(C)	NA	NA	NA	4	na	Y	Y	N
[69]	FM-HCT	NVIDIA A100	4	4 × 80	256	224 × 224 × 224	Y	Y	N
[70]	MEDFORM	NA	NA	NA	NA	NA	Y	Y	Y
[71]	NA	NVIDIA A100	8	NA	NA	96 × 96 × 96	N	N	N
[72]	CRCFound	NVIDIA A100	4	4 × 40	64	256 × 256 × 32	Y	N	N
[73]	ViNet	NVIDIA V100	1	NA	NA	NA	Y	N	N
[74]	DeepCNTD-Net	NA	NA	NA	NA	NA	Y	N	Y
[75]	NA	NVIDIA A100	8	NA	24	NA	Y	N	N
[76]	BrainIAC	NVIDIA A6000	1	48	32	NA	Y	Y	Y
[77]	NA	NVIDIA Quadro RTX 8000	2	2 × 48	64	NA	Y	Y	N
[78]	SwinClassifier	NVIDIA A100	8	NA	NA	128 × 128 × 64	N	N	N
[79]	MerMED-FM	NVIDIA H100	8	8 × 80	16	na	Y	Y	Y
[80]	Percival	NVIDIA A100	2	NA	48	na	Y	Y	Y
[81]	Cardiac-CLIP	NVIDIA A6000	1	48	64	na	Y	Y	Y
Image Registration, Reconstruction, and Super-resolution, Quality Assessment FMs
[86]	FoundationMorph	NA	NA	NA	NA	256 × 256 × 128
[87]	UniReg	NVIDIA Tesla V100	NA	NA	1		Y	Y	Y
[88]	TotalRegistrator	NVIDIA RTX 3080 Ti	1	12	1	128 × 96 × 160	Y	Y	Y
[89]	multiGradICON	NA	NA	NA	NA	175 × 175 × 175	Y	Y	Y
[90]	uniGradICON	NA	NA	NA	NA	175 × 175 × 175	Y	Y	Y
[91]	PCP-UNet	NA	NA	NA	NA	0.8 × 0.8 × 0.8 mm³	N	N	N
[92]	BME-x	NA	NA	NA	NA	NA	Y	Y	N
[93]	GraphMSR	NVIDIA RTX A100	1	NA	NA	256 × 256, 320 × 320	Y	N	Y
[94]	MedIQA	NVIDIA RTX A6000	1	48	1	224 × 224	Y	N	Y
Multitask FMs
[98]	M3FM	NVIDIA Tesla V100	192	192 × 32	192, in multitask training: 972	16 × 448 × 320, 128 × 448 × 320, 128 × 192 × 224, 128 × 320 × 448	Y	Y	Y
[99]	Radio DINO	NVIDIA A100	2	2 × 80	128, 256, 512	224 × 224	Y	N	N
[100]	MRI-CORE	NVIDIA A6000	4	4 × 48	512	1024 × 1024	Y	Y	Y
[101]	CineMA	NVIDIA RTX A6000	8	8 × 48	128	256 × 256, 192 × 192 × 16	Y	Y	Y
[102]	FratMAE	NVIDIA A100	8	8 × 40	24	160 × 160 × 192	Y	Y	N
[103]	PASTA	NVIDIA A800	8	8 × 40	32	224 × 224 × 112	Y	Y	Y
[104]	Triad	NVIDIA A100	2	2 × 80	8	190 × 192 × 224	Y	N	Y
[84]	BiomedCLIP	NVIDIA A100	16	NA	4000	224 × 224, 384 × 384	Y	Y	Y
[105]	Lingshu	NA	NA	NA	1	NA	Y	Y	Y
[106]	RadFM	NVIDIA A100	32	32 × 80	1	256 × 256 × [4–64]	Y	Y	Y
[107]	CT-FM	NVIDIA Quadro RTX 8000	4	4 × 48	64	128 × 128 × 48	Y	Y	Y
[108]	LCTfound	NVIDIA V100	8	NA	36	NA	Y	Y	Y
[109]	E3D-GPT	NVIDIA A800	8	8 × 40	32	224 × 224 × 112	Y	N	Y
[110]	NA	NA	NA	NA	NA	NA	Y	Y	N
[111]	NA	NVIDIA Tesla H100	8	8 × 80	1024	224 × 224	Y	Y	N
[112]	CT-CLIP	NVIDIA A100	4	4 × 80	1	NA	Y	Y	Y
[113]	VIS-MAE	NVIDIA DGX A100	8	na	640	224 × 224	Y	Y	N
[82]	Merlin	NVIDIA RTX A6000	1	48	18	224 × 224 × 160	Y	Y	Y
[114]	M3D-LaMed	NVIDIA A100	8	8 × 80	48	32 × 256 × 256	Y	N	Y
[115]	GMAI	NA	NA	NA	NA	NA	NA	NA	NA

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ghosh, T.; Sheikhi, F.; Guo, J.; Singh, Y.; Younis, K.; Kuanar, S.; Faghani, S.; Farina, E.M.J.d.M.; Huo, Y.; Maleki, F. Foundation Models for Volumetric Medical Imaging: Opportunities, Challenges, and Future Directions. Electronics 2026, 15, 1245. https://doi.org/10.3390/electronics15061245

AMA Style

Ghosh T, Sheikhi F, Guo J, Singh Y, Younis K, Kuanar S, Faghani S, Farina EMJdM, Huo Y, Maleki F. Foundation Models for Volumetric Medical Imaging: Opportunities, Challenges, and Future Directions. Electronics. 2026; 15(6):1245. https://doi.org/10.3390/electronics15061245

Chicago/Turabian Style

Ghosh, Tapotosh, Farnaz Sheikhi, Junlin Guo, Yashbir Singh, Khaled Younis, Shiba Kuanar, Shahriar Faghani, Eduardo Moreno Judice de Mattos Farina, Yuankai Huo, and Farhad Maleki. 2026. "Foundation Models for Volumetric Medical Imaging: Opportunities, Challenges, and Future Directions" Electronics 15, no. 6: 1245. https://doi.org/10.3390/electronics15061245

APA Style

Ghosh, T., Sheikhi, F., Guo, J., Singh, Y., Younis, K., Kuanar, S., Faghani, S., Farina, E. M. J. d. M., Huo, Y., & Maleki, F. (2026). Foundation Models for Volumetric Medical Imaging: Opportunities, Challenges, and Future Directions. Electronics, 15(6), 1245. https://doi.org/10.3390/electronics15061245

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Foundation Models for Volumetric Medical Imaging: Opportunities, Challenges, and Future Directions

Abstract

1. Introduction

2. Paper Selection and Review Process

3. Foundation Models Development

3.1. Data Collection and Curation

3.2. Model Architecture Design

3.3. Pretraining Strategies

3.3.1. Supervised Pretraining

3.3.2. Weakly Supervised Pretraining

3.3.3. Self-Supervised Pretraining

3.4. Task Adaptation and Fine-Tuning

3.5. Evaluation and Deployment

4. Recent FMs in Volumetric Medical Imaging

4.1. FMs for Segmentation

4.2. FMs for Classification and Predictive Tasks

4.3. FMs for Image Registration, Reconstruction, and Super-Resolution, Quality Assessment

4.3.1. Image Registration

4.3.2. Image Reconstruction

4.3.3. Image Super-Resolution

4.3.4. Image Quality Assessment

4.4. Multitask FMs

5. Challenges and Opportunities

5.1. Dataset Scale and Diversity

5.2. Reproducibility

5.3. Evaluation and Benchmarking

5.4. Extensive Resource Requirements

5.5. The Curse of Supervision

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI