Foundation Models for Volumetric Medical Imaging: Opportunities, Challenges, and Future Directions
Abstract
1. Introduction
2. Paper Selection and Review Process
3. Foundation Models Development
3.1. Data Collection and Curation
3.2. Model Architecture Design
3.3. Pretraining Strategies
3.3.1. Supervised Pretraining
3.3.2. Weakly Supervised Pretraining
3.3.3. Self-Supervised Pretraining
3.4. Task Adaptation and Fine-Tuning
3.5. Evaluation and Deployment
4. Recent FMs in Volumetric Medical Imaging
- Segmentation models, designed for pixel-wise annotation and localization of anatomical structures or lesions.
- Classification and predictive models, focused on tasks such as disease detection, survival analysis, and pathology characterization.
- Image registration, enhancement, and reconstruction models, which address image-to-image transformation tasks.
- General-purpose models which are multitask FMs, capable of addressing diverse tasks within a single framework.
4.1. FMs for Segmentation
4.2. FMs for Classification and Predictive Tasks
4.3. FMs for Image Registration, Reconstruction, and Super-Resolution, Quality Assessment
4.3.1. Image Registration
4.3.2. Image Reconstruction
4.3.3. Image Super-Resolution
4.3.4. Image Quality Assessment
4.4. Multitask FMs
5. Challenges and Opportunities
5.1. Dataset Scale and Diversity
5.2. Reproducibility
5.3. Evaluation and Benchmarking
5.4. Extensive Resource Requirements
5.5. The Curse of Supervision
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- He, Y.; Huang, F.; Jiang, X.; Nie, Y.; Wang, M.; Wang, J.; Chen, H. Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future Directions. IEEE Rev. Biomed. Eng. 2025, 18, 172–191. [Google Scholar] [CrossRef]
- Hayat, M.; Dhaliwal, A.; Din, M.; Izhar, R.; Nadeem, M.; Ahmad, N. Cross-Attention Patch Fusion for Few-Shot Colorectal Tissue Generation. In Proceedings of the 2025 5th International Conference on Digital Futures and Transformative Technologies (ICoDT2), Islamabad, Pakistan, 17–18 December 2025; pp. 1–6. [Google Scholar] [CrossRef]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar] [PubMed]
- Ma, J.; He, Y.; Li, F.; Han, L.; You, C.; Wang, B. Segment anything in medical images. Nat. Commun. 2024, 15, 654. [Google Scholar] [CrossRef]
- Zhang, Y.; Shen, Z.; Jiao, R. Segment anything model for medical image segmentation: Current applications and future directions. Comput. Biol. Med. 2024, 171, 108238. [Google Scholar] [CrossRef] [PubMed]
- Yao, W.; Bai, J.; Liao, W.; Chen, Y.; Liu, M.; Xie, Y. From CNN to Transformer: A Review of Medical Image Segmentation Models. J. Imaging Inform. Med. 2024, 37, 1529–1547. [Google Scholar] [CrossRef] [PubMed]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
- Jin, R.; Xu, Z.; Zhong, Y.; Yao, Q.; QI, D.; Zhou, S.K.; Li, X. Fairmedfm: Fairness benchmarking for medical imaging foundation models. Adv. Neural Inf. Process. Syst. 2024, 37, 111318–111357. [Google Scholar]
- Huang, S.C.; Jensen, M.; Yeung-Levy, S.; Lungren, M.P.; Poon, H.; Chaudhari, A.S. Multimodal Foundation Models for Medical Imaging-A Systematic Review and Implementation Guidelines. medRxiv 2024. [Google Scholar] [CrossRef]
- Azad, B.; Azad, R.; Eskandari, S.; Bozorgpour, A.; Kazerouni, A.; Rekik, I.; Merhof, D. Foundational models in medical imaging: A comprehensive survey and future vision. arXiv 2023, arXiv:2310.18689. [Google Scholar] [CrossRef]
- Lee, H.H.; Gu, Y.; Zhao, T.; Xu, Y.; Yang, J.; Usuyama, N.; Wong, C.; Wei, M.; Landman, B.A.; Huo, Y.; et al. Foundation models for biomedical image segmentation: A survey. arXiv 2024, arXiv:2401.07654. [Google Scholar] [CrossRef]
- Khan, W.; Leem, S.; See, K.B.; Wong, J.K.; Zhang, S.; Fang, R. A comprehensive survey of foundation models in medicine. IEEE Rev. Biomed. Eng. 2025, 19, 283–304. [Google Scholar] [CrossRef]
- van Veldhuizen, V.; Botha, V.; Lu, C.; Cesur, M.E.; Lipman, K.G.; de Jong, E.D.; Horlings, H.; Sanchez, C.I.; Snoek, C.G.M.; Wessels, L.; et al. Foundation Models in Medical Imaging—A Review and Outlook. arXiv 2025, arXiv:2506.09095. [Google Scholar]
- Noh, S.; Lee, B.D. A narrative review of foundation models for medical image segmentation: Zero-shot performance evaluation on diverse modalities. Quant. Imaging Med. Surg. 2025, 15, 5825–5858. [Google Scholar] [CrossRef]
- Zhang, S.; Metaxas, D. On the challenges and perspectives of foundation models for medical image analysis. Med. Image Anal. 2024, 91, 102996. [Google Scholar] [CrossRef]
- Rashed, E.A.; Bekhit, M. Foundation Models in Medical Image Analysis: Overview and Prospects. In Proceedings of the 2024 IEEE International Conference on Future Machine Learning and Data Science (FMLDS), Sydney, NSW, Australia, 20–23 November 2024; pp. 6–9. [Google Scholar] [CrossRef]
- Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the Opportunities and Risks of Foundation Models. arXiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Wang, X.; Zhang, R.; Shen, C.; Kong, T.; Li, L. Dense Contrastive Learning for Self-Supervised Visual Pre-Training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 3024–3033. [Google Scholar]
- Wen, X.; Zhao, B.; Zheng, A.; Zhang, X.; Qi, X. Self-supervised visual representation learning with semantic grouping. Adv. Neural Inf. Process. Syst. 2022, 35, 16423–16438. [Google Scholar]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PmLR, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
- Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9650–9660. [Google Scholar]
- Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap Your Own Latent—A New Approach to Self-Supervised Learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21271–21284. [Google Scholar]
- Xie, Z.; Zhang, Z.; Cao, Y.; Lin, Y.; Bao, J.; Yao, Z.; Dai, Q.; Hu, H. SimMIM: A Simple Framework for Masked Image Modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Chen, Z.; Agarwal, D.; Aggarwal, K.; Safta, W.; Balan, M.M.; Brown, K. Masked Image Modeling Advances 3D Medical Image Analysis. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 1970–1980. [Google Scholar]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. arXiv 2021, arXiv:2111.06377. [Google Scholar] [CrossRef]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
- Boecking, B.; Usuyama, N.; Bannur, S.; Castro, D.C.; Schwaighofer, A.; Hyland, S.; Wetscherek, M.; Naumann, T.; Nori, A.; Alvarez-Valle, J.; et al. Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing. In Proceedings of the Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
- Wang, Z.; Wu, Z.; Agarwal, D.; Sun, J. Medclip: Contrastive learning from unpaired medical images and text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, Dhabi, United Arab Emirates, 7–11 December 2022; Volume 2022, p. 3876. [Google Scholar]
- Gao, P.; Geng, S.; Zhang, R.; Ma, T.; Fang, R.; Zhang, Y.; Li, H.; Qiao, Y. Clip-adapter: Better vision-language models with feature adapters. Int. J. Comput. Vis. 2024, 132, 581–595. [Google Scholar] [CrossRef]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
- Jia, M.; Tang, L.; Chen, B.C.; Cardie, C.; Belongie, S.; Hariharan, B.; Lim, S.N. Visual prompt tuning. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 709–727. [Google Scholar]
- Ma, Q.; Sun, G.; Tombak, G.I.; Jain, S.; Huber, N.B.; Gool, L.V.; Konukoglu, E. Video Foundation Model for Medical 3D Segmentation. In Proceedings of the Supervised and Semi-Supervised Multi-Structure Segmentation and Landmark Detection in Dental Data; Wang, Y., Qian, D., Wang, S., Ben-Hamadou, A., Pujades, S., Lumetti, L., Grana, C., Bolelli, F., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 72–88. [Google Scholar]
- Qayyum, A.; Mazher, M.; Ugurlu, D.; Lemus, J.A.S.; Rodero, C.; Niederer, S.A. Foundation Model for Whole-Heart Segmentation: Leveraging Student-Teacher Learning in Multi-Modal Medical Imaging. arXiv 2025, arXiv:2503.19005. [Google Scholar]
- Wittmann, B.; Wattenberg, Y.; Amiranashvili, T.; Shit, S.; Menze, B. vesselFM: A Foundation Model for Universal 3D Blood Vessel Segmentation. arXiv 2025, arXiv:2411.17386. [Google Scholar]
- Tölle, M.; Garthe, P.; Scherer, C.; Seliger, J.M.; Leha, A.; Krüger, N.; Simm, S.; Martin, S.; Eble, S.; Kelm, H.; et al. Real world federated learning with a knowledge distilled transformer for cardiac CT imaging. Npj Digit. Med. 2025, 8, 88. [Google Scholar] [CrossRef]
- Du, Y.; Bai, F.; Huang, T.; Zhao, B. Segvol: Universal and interactive volumetric medical image segmentation. Adv. Neural Inf. Process. Syst. 2024, 37, 110746–110783. [Google Scholar]
- Luo, Z.; Gao, Z.; Liao, W.; Zhang, S.; Wang, G.; Luo, X. Dynamic Gradient Sparsification Training for Few-Shot Fine-tuning of CT Lymph Node Segmentation Foundation Model. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2025; pp. 165–174. [Google Scholar]
- Otaghsara, S.S.T.; Rahmanzadeh, R. F3-Net: Foundation Model for Full Abnormality Segmentation of Medical Images with Flexible Input Modality Requirement. arXiv 2025, arXiv:2507.08460. [Google Scholar] [CrossRef]
- Akinci D’Antonoli, T.; Berger, L.K.; Indrakanti, A.K.; Vishwanathan, N.; Weiss, J.; Jung, M.; Berkarda, Z.; Rau, A.; Reisert, M.; Küstner, T.; et al. Totalsegmentator MRI: Robust sequence-independent segmentation of multiple anatomic structures in MRI. Radiology 2025, 314, e241613. [Google Scholar] [CrossRef] [PubMed]
- Li, Y.; Hu, M.; Qiu, R.L.; Thor, M.; Williams, A.; Marshall, D.; Yang, X. RoMedFormer: A Rotary-Embedding Transformer Foundation Model for 3D Genito-Pelvic Structure Segmentation in MRI and CT. arXiv 2025, arXiv:2503.14304. [Google Scholar]
- He, Y.; Guo, P.; Tang, Y.; Myronenko, A.; Nath, V.; Xu, Z.; Yang, D.; Zhao, C.; Simon, B.; Belue, M.; et al. VISTA3D: A unified segmentation foundation model for 3D medical imaging. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 20863–20873. [Google Scholar]
- Cox, J.; Liu, P.; Stolte, S.E.; Yang, Y.; Liu, K.; See, K.B.; Ju, H.; Fang, R. BrainSegFounder: Towards 3D foundation models for neuroimage segmentation. Med. Image Anal. 2024, 97, 103301. [Google Scholar] [CrossRef]
- Zhang, X.; Ou, N.; Basaran, B.D.; Visentin, M.; Qiao, M.; Gu, R.; Ouyang, C.; Liu, Y.; Matthews, P.M.; Ye, C.; et al. A foundation model for brain lesion segmentation with mixture of modality experts. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2024; pp. 379–389. [Google Scholar]
- Jiang, Y.; Shen, Y. M4oe: A foundation model for medical multimodal image segmentation with mixture of experts. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2024; pp. 621–631. [Google Scholar]
- Yan, Z.; Han, T.; Huang, Y.; Liu, L.; Zhou, H.; Chen, J.; Shi, W.; Cao, Y.; Yang, X.; Ni, D. A Foundation Model for General Moving Object Segmentation in Medical Images. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024; pp. 1–5. [Google Scholar] [CrossRef]
- Huang, Z.; Wang, H.; Deng, Z.; Ye, J.; Su, Y.; Sun, H.; He, J.; Gu, Y.; Gu, L.; Zhang, S.; et al. STU-Net: Scalable and Transferable Medical Image Segmentation Models Empowered by Large-Scale Supervised Pre-training. arXiv 2023, arXiv:2304.06716. [Google Scholar]
- Wang, G.; Wu, J.; Luo, X.; Liu, X.; Li, K.; Zhang, S. MIS-FM: 3D Medical Image Segmentation using Foundation Models Pretrained on a Large-Scale Unannotated Dataset. arXiv 2023, arXiv:2306.16925. [Google Scholar]
- Wasserthal, J.; Breit, H.C.; Meyer, M.T.; Pradella, M.; Hinck, D.; Sauter, A.W.; Heye, T.; Boll, D.T.; Cyriac, J.; Yang, S.; et al. TotalSegmentator: Robust segmentation of 104 anatomic structures in CT images. Radiol. Artif. Intell. 2023, 5, e230024. [Google Scholar] [CrossRef]
- Landman, B.; Xu, Z.; Igelsias, J.E.; Styner, M.; Langerak, T.; Klein, A. Segmentation Outside the Cranial Vault Challenge. In Proceedings of the MICCAI: Multi Atlas Labeling Beyond Cranial Vault-Workshop Challenge, Munich, Germany, 5–9 October 2015. [Google Scholar]
- Myronenko, A. 3D MRI Brain Tumor Segmentation Using Autoencoder Regularization. In Proceedings of the Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 4th International Workshop, BrainLes 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 16 September 2018; Revised Selected Papers, Part II; Springer: Berlin/Heidelberg, Germany, 2018; pp. 311–320. [Google Scholar] [CrossRef]
- Mu, S.; Lin, S. A Comprehensive Survey of Mixture-of-Experts: Algorithms, Theory, and Applications. arXiv 2025, arXiv:2503.07137. [Google Scholar]
- Cai, W.; Jiang, J.; Wang, F.; Tang, J.; Kim, S.; Huang, J. A Survey on Mixture of Experts in Large Language Models. IEEE Trans. Knowl. Data Eng. 2025, 37, 3896–3915. [Google Scholar] [CrossRef]
- Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In Proceedings of the Computer Vision—ECCV 2022 Workshops; Karlinsky, L., Michaeli, T., Nishino, K., Eds.; Springer: Cham, Switzerland, 2023; pp. 205–218. [Google Scholar]
- Quinton, F.; Popoff, R.; Presles, B.; Leclerc, S.; Meriaudeau, F.; Nodari, G.; Lopez, O.; Pellegrinelli, J.; Chevallier, O.; Ginhac, D.; et al. A Tumour and Liver Automatic Segmentation (ATLAS) Dataset on Contrast-Enhanced Magnetic Resonance Imaging for Hepatocellular Carcinoma. Data 2023, 8, 79. [Google Scholar] [CrossRef]
- Ji, Y.; Bai, H.; Yang, J.; Ge, C.; Zhu, Y.; Zhang, R.; Li, Z.; Zhang, L.; Ma, W.; Wan, X.; et al. AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation. arXiv 2022, arXiv:2206.08023. [Google Scholar]
- Isensee, F.; Jaeger, P.F.; Kohl, S.A.A.; Petersen, J.; Maier-Hein, K.H. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 203–211. [Google Scholar] [CrossRef]
- Ulrich, C.; Isensee, F.; Wald, T.; Zenk, M.; Baumgartner, M.; Maier-Hein, K.H. MultiTalent: A Multi-dataset Approach to Medical Image Segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2023; Springer: Cham, Switzerland, 2023; pp. 648–658. [Google Scholar]
- Gao, Y.; Li, Z.; Liu, D.; Zhou, M.; Zhang, S.; Metaxas, D.N. Training Like a Medical Resident: Context-Prior Learning Toward Universal Medical Image Segmentation. arXiv 2024, arXiv:2306.02416. [Google Scholar] [CrossRef]
- Wang, H.; Guo, S.; Ye, J.; Deng, Z.; Cheng, J.; Li, T.; Chen, J.; Su, Y.; Huang, Z.; Shen, Y.; et al. SAM-Med3D: Towards General-purpose Segmentation Models for Volumetric Medical Images. arXiv 2024, arXiv:2310.15161. [Google Scholar]
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
- Shaker, A.; Maaz, M.; Rasheed, H.; Khan, S.; Yang, M.H.; Khan, F.S. UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation. arXiv 2024, arXiv:2212.04497. [Google Scholar] [CrossRef]
- Hatamizadeh, A.; Nath, V.; Tang, Y.; Yang, D.; Roth, H.; Xu, D. Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images. arXiv 2022, arXiv:2201.01266. [Google Scholar] [CrossRef]
- Littlejohns, T.J.; Holliday, J.; Gibson, L.M.; Garratt, S.; Oesingmann, N.; Alfaro-Almagro, F.; Bell, J.D.; Boultwood, C.; Collins, R.; Conroy, M.C.; et al. The UK Biobank imaging enhancement of 100,000 participants: Rationale, data collection, management and future directions. Nat. Commun. 2020, 11, 2624. [Google Scholar] [CrossRef]
- Cheng, H.K.; Schwing, A.G. XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model. arXiv 2022, arXiv:2207.07115. [Google Scholar]
- Bolelli, F.; Marchesini, K.; Van Nistelrooij, N.; Lumetti, L.; Pipoli, V.; Ficarra, E.; Vinayahalingam, S.; Grana, C. Segmenting Maxillofacial Structures in CBCT Volumes. In Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 5238–5248. [Google Scholar] [CrossRef]
- Yi, H.; Qin, Z.; Lao, Q.; Xu, W.; Jiang, Z.; Wang, D.; Zhang, S.; Li, K. Towards General Purpose Medical AI: Continual Learning Medical Foundation Model. arXiv 2023, arXiv:2303.06580. [Google Scholar] [CrossRef]
- Zhu, W.; Huang, H.; Tang, H.; Musthyala, R.; Yu, B.; Chen, L.; Vega, E.; O’Donnell, T.; Dehkharghani, S.; Frontera, J.A.; et al. 3D foundation AI model for generalizable disease detection in head computed tomography. arXiv 2025, arXiv:2502.02779. [Google Scholar] [CrossRef]
- Jung, D.; Jang, J.; Jang, S.; Park, Y.R. MEDFORM: A Foundation Model for Contrastive Learning of CT Imaging and Clinical Numeric Data in Multi-Cancer Analysis. arXiv 2025, arXiv:2501.13277. [Google Scholar] [CrossRef]
- Gao, R.; Peng, A.; Duan, Y.; Chen, M.; Zheng, T.; Zhang, M.; Chen, L.; Sun, H. Associations of Postencephalitic Epilepsy Using Multi-Contrast Whole Brain MRI: A Large Self-Supervised Vision Foundation Model Strategy. J. Magn. Reson. Imaging 2025, 62, 494–505. [Google Scholar] [CrossRef] [PubMed]
- Yang, J.; Cai, D.; Liu, J.; Zhuang, Z.; Zhao, Y.; Wang, F.a.; Li, C.; Hu, C.; Gai, B.; Chen, Y.; et al. CRCFound: A Colorectal Cancer CT Image Foundation Model Based on Self-Supervised Learning. Adv. Sci. 2025, 12, e07339. [Google Scholar] [CrossRef]
- Gong, Y.; Zhang, X.; Xia, Y.F.; Cheng, Y.; Bao, J.; Zhang, N.; Zhi, R.; Sun, X.Y.; Wu, C.J.; Wu, F.Y.; et al. A foundation model with weak experiential guidance in detecting muscle invasive bladder cancer on MRI. Cancer Lett. 2025, 611, 217438. [Google Scholar] [CrossRef]
- Yoo, Y.; Georgescu, B.; Zhang, Y.; Grbic, S.; Liu, H.; Aldea, G.D.; Re, T.J.; Das, J.; Ullaskrishnan, P.; Eibenberger, E.; et al. A Non-contrast Head CT Foundation Model for Comprehensive Neuro-Trauma Triage. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2025; pp. 3–13. [Google Scholar]
- Chen, M.; Zhang, M.; Yin, L.; Ma, L.; Ding, R.; Zheng, T.; Yue, Q.; Lui, S.; Sun, H. Medical image foundation models in assisting diagnosis of brain tumors: A pilot study. Eur. Radiol. 2024, 34, 6667–6679. [Google Scholar] [CrossRef] [PubMed]
- Tak, D.; Garomsa, B.A.; Chaunzwa, T.L.; Zapaishchykova, A.; Climent Pardo, J.C.; Ye, Z.; Zielke, J.; Ravipati, Y.; Vajapeyam, S.; Mahootiha, M.; et al. A foundation model for generalized brain MRI analysis. medRxiv 2024. [Google Scholar] [CrossRef]
- Pai, S.; Bontempi, D.; Hadzic, I.; Prudente, V.; Sokač, M.; Chaunzwa, T.L.; Bernatz, S.; Hosny, A.; Mak, R.H.; Birkbak, N.J.; et al. Foundation model for cancer imaging biomarkers. Nat. Mach. Intell. 2024, 6, 354–367. [Google Scholar] [CrossRef] [PubMed]
- Suo, X.; Chen, M.; Chen, L.; Luo, C.; Kemp, G.J.; Lui, S.; Sun, H. Automatic identification of Parkinsonism using clinical multi-contrast brain MRI: A large self-supervised vision foundation model strategy. eBioMedicine 2025, 116, 105773. [Google Scholar] [CrossRef] [PubMed]
- Zhou, Y.; Quek, C.W.N.; Zhou, J.; Wang, Y.; Bai, Y.; Ke, Y.; Yao, J.; Gutierrez, L.; Teo, Z.L.; Ting, D.S.J.; et al. Multimodal, Multi-Disease Medical Imaging Foundation Model (MerMED-FM). arXiv 2025, arXiv:2507.00185. [Google Scholar] [CrossRef]
- Beeche, C.; Kim, J.; Tavolinejad, H.; Zhao, B.; Sharma, R.; Duda, J.; Gee, J.; Dako, F.; Verma, A.; Morse, C.; et al. A Pan-Organ Vision-Language Model for Generalizable 3D CT Representations. medRxiv 2025. [Google Scholar] [CrossRef]
- Hu, Y.; Zheng, Y.; Miao, S.; Zhang, X.; Xia, J.; Qi, Y.; Zhang, Y.; He, Y.; Chen, Q.; Ye, J.; et al. Cardiac-CLIP: A Vision-Language Foundation Model for 3D Cardiac CT Images. arXiv 2025, arXiv:2507.22024. [Google Scholar]
- Blankemeier, L.; Cohen, J.P.; Kumar, A.; Van Veen, D.; Gardezi, S.J.S.; Paschali, M.; Chen, Z.; Delbrouck, J.B.; Reis, E.; Truyts, C.; et al. Merlin: A vision language foundation model for 3D computed tomography. Res. Sq. 2024, rs.3.rs-4546309. [Google Scholar] [CrossRef]
- Yan, K.; Wang, X.; Lu, L.; Summers, R.M. DeepLesion: Automated mining of large-scale lesion annotations and universal lesion detection with deep learning. J. Med. Imaging 2018, 5, 036501. [Google Scholar] [CrossRef]
- Zhang, S.; Xu, Y.; Usuyama, N.; Xu, H.; Bagga, J.; Tinn, R.; Preston, S.; Rao, R.; Wei, M.; Valluri, N.; et al. BiomedCLIP: A multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv 2025, arXiv:2303.00915. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Pan, S.; Hu, M.; Safari, M.; Shah, K.; Zhao, F.; Wang, T.; Qiu, R.; Yang, X. FoundationMorph: A 3D vision-language foundation model for unsupervised medical image registration. In Proceedings of the Medical Imaging 2025: Imaging Informatics; SPIE: San Francisco, CA, USA, 2025; Volume 13411, pp. 301–309. [Google Scholar]
- Li, Z.; Zhang, J.; Ma, T.; Mok, T.C.; Zhou, Y.J.; Chen, Z.; Ye, X.; Lu, L.; Jin, D. UniReg: Foundation Model for Controllable Medical Image Registration. arXiv 2025, arXiv:2503.12868. [Google Scholar] [CrossRef]
- Pham, X.L.; Vuurberg, G.; Doppen, M.; Roosen, J.; Stille, T.; Ha, T.Q.; Quach, T.D.; Dang, Q.V.; Luu, M.H.; Smit, E.J.; et al. TotalRegistrator: Towards a Lightweight Foundation Model for CT Image Registration. arXiv 2025, arXiv:2508.04450. [Google Scholar] [CrossRef]
- Demir, B.; Tian, L.; Greer, H.; Kwitt, R.; Vialard, F.X.; Estépar, R.S.J.; Bouix, S.; Rushmore, R.; Ebrahim, E.; Niethammer, M. Multigradicon: A foundation model for multimodal medical image registration. In Proceedings of the International Workshop on Biomedical Image Registration; Springer: Berlin/Heidelberg, Germany, 2024; pp. 3–18. [Google Scholar]
- Tian, L.; Greer, H.; Kwitt, R.; Vialard, F.X.; San José Estépar, R.; Bouix, S.; Rushmore, R.; Niethammer, M. unigradicon: A foundation model for medical image registration. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2024; pp. 749–760. [Google Scholar]
- Zhang, C.; Loecher, M.; Alkan, C.; Yurt, M.; Vasanawala, S.S.; Ennis, D.B. On the Foundation Model for Cardiac MRI Reconstruction. In Proceedings of the Statistical Atlases and Computational Models of the Heart. STACOM 2024. Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15448, pp. 226–235. [Google Scholar] [CrossRef]
- Sun, Y.; Wang, L.; Li, G.; Lin, W.; Wang, L. A foundation model for enhancing magnetic resonance images and downstream segmentation, registration and diagnostic tasks. Nat. Biomed. Eng. 2025, 9, 521–538. [Google Scholar] [CrossRef]
- Qin, Z.; He, Z.; Zhang, Y.; Shen, Y.; Li, K. GraphMSR: A graph foundation model-based approach for MRI image super-resolution with multimodal semantic integration. Pattern Recognit. 2025, 171, 112178. [Google Scholar] [CrossRef]
- Xun, S.; Sun, Y.; Chen, J.; Yu, Z.; Tong, T.; Liu, X.; Wu, M.; Tan, T. MedIQA: A Scalable Foundation Model for Prompt-Driven Medical Image Quality Assessment. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2025; pp. 339–349. [Google Scholar]
- Wang, L.; Li, G.; Shi, F.; Cao, X.; Lian, C.; Nie, D.; Liu, M.; Zhang, H.; Li, G.; Wu, Z.; et al. Volume-based analysis of 6-month-old infant brain MRI for autism biomarker identification and early diagnosis. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI); Springer: Cham, Switzerland, 2018; Volume 11072, pp. 411–419. [Google Scholar] [CrossRef]
- Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5967–5976. [Google Scholar] [CrossRef]
- Wang, G.; Shi, H.; Chen, Y.; Wu, B. Unsupervised image-to-image translation via long-short cycle-consistent adversarial networks. Appl. Intell. 2022, 53, 17243–17259. [Google Scholar] [CrossRef]
- Niu, C.; Lyu, Q.; Carothers, C.D.; Kaviani, P.; Tan, J.; Yan, P.; Kalra, M.K.; Whitlow, C.T.; Wang, G. Medical multimodal multitask foundation model for lung cancer screening. Nat. Commun. 2025, 16, 1523. [Google Scholar] [CrossRef]
- Zedda, L.; Loddo, A.; Di Ruberto, C. Radio DINO: A foundation model for advanced radiomics and AI-driven medical imaging analysis. Comput. Biol. Med. 2025, 195, 110583. [Google Scholar] [CrossRef] [PubMed]
- Dong, H.; Chen, Y.; Gu, H.; Konz, N.; Chen, Y.; Li, Q.; Mazurowski, M.A. MRI-CORE: A Foundation Model for Magnetic Resonance Imaging. arXiv 2025, arXiv:2506.12186. [Google Scholar] [CrossRef]
- Fu, Y.; Bai, W.; Yi, W.; Manisty, C.; Bhuva, A.N.; Treibel, T.A.; Moon, J.C.; Clarkson, M.J.; Davies, R.H.; Hu, Y. A versatile foundation model for cine cardiac magnetic resonance image analysis tasks. arXiv 2025, arXiv:2506.00679. [Google Scholar]
- Oh, Y.; Seifert, R.; Cao, Y.; Clement, C.; Ferdinandus, J.; Song, S.; Meng, R.; Zeng, F.; Guo, N.; Li, X.; et al. Developing a PET/CT Foundation Model for Cross-Modal Anatomical and Functional Imaging. J. Nucl. Med. 2025, 66, 251598. [Google Scholar]
- Lei, W.; Chen, H.; Zhang, Z.; Luo, L.; Xiao, Q.; Gu, Y.; Gao, P.; Jiang, Y.; Wang, C.; Wu, G.; et al. A Data-Efficient Pan-Tumor Foundation Model for Oncology CT Interpretation. arXiv 2025, arXiv:2502.06171. [Google Scholar]
- Wang, S.; Safari, M.; Li, Q.; Chang, C.W.; Qiu, R.L.; Roper, J.; Yu, D.S.; Yang, X. Triad: Vision Foundation Model for 3D Magnetic Resonance Imaging. arXiv 2025, arXiv:2502.14064. [Google Scholar] [CrossRef]
- Team, L.; Xu, W.; Chan, H.P.; Li, L.; Aljunied, M.; Yuan, R.; Wang, J.; Xiao, C.; Chen, G.; Liu, C.; et al. Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning. arXiv 2025, arXiv:2506.07044. [Google Scholar] [CrossRef]
- Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; Xie, W. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. arXiv 2023, arXiv:2308.02463. [Google Scholar]
- Pai, S.; Hadzic, I.; Bontempi, D.; Bressem, K.; Kann, B.H.; Fedorov, A.; Mak, R.H.; Aerts, H.J.W.L. Vision Foundation Models for Computed Tomography. arXiv 2025, arXiv:2501.09001. [Google Scholar] [CrossRef]
- Gao, Z.; Zhang, G.; Liang, H.; Liu, J.; Ma, L.; Wang, T.; Guo, Y.; Chen, Y.; Yan, Z.; Chen, X.; et al. A Lung CT Foundation Model Facilitating Disease Diagnosis and Medical Imaging. medRxiv 2025. [Google Scholar] [CrossRef]
- Lai, H.; Jiang, Z.; Yao, Q.; Wang, R.; He, Z.; Tao, X.; Wei, W.; Lv, W.; Zhou, S.K. E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model. arXiv 2024, arXiv:2410.14200. [Google Scholar]
- Tariq, A.; Patel, B.N.; Banerjee, I. Design, training, and applications of foundation model for chest computed tomography volumes. In Proceedings of the Medical Imaging 2024: Image Processing; SPIE: San Francisco, CA, USA, 2024; Volume 12926, pp. 252–256. [Google Scholar]
- Jacob, A.J.; Borgohain, I.; Chitiboi, T.; Sharma, P.; Comaniciu, D.; Rueckert, D. Towards a vision foundation model for comprehensive assessment of Cardiac MRI. arXiv 2024, arXiv:2410.01665. [Google Scholar] [CrossRef]
- Hamamci, I.E.; Er, S.; Wang, C.; Almas, F.; Simsek, A.G.; Esirgun, S.N.; Doga, I.; Durugol, O.F.; Dai, W.; Xu, M.; et al. Developing generalist foundation models from a multimodal dataset for 3D computed tomography. arXiv 2024, arXiv:2403.17834. [Google Scholar]
- Liu, Z.; Tieu, A.; Patel, N.; Soultanidis, G.; Deyer, L.; Wang, Y.; Huver, S.; Zhou, A.; Mei, Y.; Fayad, Z.A.; et al. VIS-MAE: An Efficient Self-supervised Learning Approach on Medical Image Segmentation and Classification. In Proceedings of the Machine Learning in Medical Imaging; Xu, X., Cui, Z., Rekik, I., Ouyang, X., Sun, K., Eds.; Springer: Cham, Switzerland, 2025; pp. 95–107. [Google Scholar]
- Bai, F.; Du, Y.; Huang, T.; Meng, M.Q.H.; Zhao, B. M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models. arXiv 2024, arXiv:2404.00578. [Google Scholar]
- Moor, M.; Banerjee, O.; Abad, Z.S.H.; Krumholz, H.M.; Leskovec, J.; Topol, E.J.; Rajpurkar, P. Foundation models for generalist medical artificial intelligence. Nature 2023, 616, 259–265. [Google Scholar] [CrossRef] [PubMed]
- Kuo, T.L.; Liao, F.T.; Hsieh, M.W.; Chang, F.C.; Hsu, P.C.; Shiu, D.S. RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues. arXiv 2025, arXiv:2409.12558. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
- Mei, X.; Liu, Z.; Robson, P.M.; Marinelli, B.; Huang, M.; Doshi, A.; Jacobi, A.; Cao, C.; Link, K.E.; Yang, T.; et al. RadImageNet: An open radiologic deep learning research dataset for effective transfer learning. Radiol. Artif. Intell. 2022, 4, e210315. [Google Scholar] [CrossRef] [PubMed]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar] [CrossRef]
- Gatidis, S.; Hebb, T.; Frueh, M.; La Fougère, C.; Nikolaou, K.; Pfannenberg, C.; Schölkopf, B.; Kuestner, T.; Cyran, C.; Rubin, D. A whole-body FDG-PET/CT Dataset with manually annotated Tumor Lesions. Sci. Data 2022, 9, 601. [Google Scholar] [CrossRef] [PubMed]
- Johnson, A.E.W.; Pollard, T.J.; Berkowitz, S.J.; Greenbaum, N.R.; Lungren, M.P.; ying Deng, C.; Mark, R.G.; Horng, S. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci. Data 2019, 6, 317. [Google Scholar] [CrossRef] [PubMed]
- Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpanskaya, K.; et al. CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. arXiv 2019, arXiv:1901.07031. [Google Scholar] [CrossRef]
- Pelka, O.; Koitka, S.; Rückert, J.; Nensa, F.; Friedrich, C. Radiology Objects in COntext (ROCO): A Multimodal Image Dataset. In Proceedings of the CVII-STENT/LABELS@MICCAI, Granada, Spain, 16 September 2018. [Google Scholar]
- Gamper, J.; Rajpoot, N. Multiple Instance Captioning: Learning Representations from Histopathology Textbooks and Articles. arXiv 2021, arXiv:2103.05121. [Google Scholar] [CrossRef]
- Molino, D.; Feola, F.D.; Faiella, E.; Fazzini, D.; Santucci, D.; Shen, L.; Guarrasi, V.; Soda, P. XGeM: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation. arXiv 2025, arXiv:2501.04614. [Google Scholar] [CrossRef] [PubMed]





| Year | Contribution | Challenges | Ref |
|---|---|---|---|
| 2025 | Focusing on large-scale architectures, self-supervised learning, and adaptation to downstream tasks. | Increased computational costs of volumetric data, limited coverage of 3D FMs (12 models), lack of diverse benchmarks, adapting to medical domain, explainability, fairness and robustness. | [13] |
| 2025 | Focusing only on medical image segmentation, reviewing 63 studies in this domain, and evaluating 6 FMs in zero-shot settings. | Demanding hybrid 2D–3D model architectures and systematic benchmarking of cross-domain FMs, limited insight into 3D FMs (18 models). | [14] |
| 2025 | Covering LLMs, vision models, omics, and graphs. Mostly focusing on 2D coverage with minimal volumetric image–specific methods. | Cost, interpretability, and validation, limited covering of 3D FMs (six models). | [12] |
| 2024 | Defining the “spectrum” of medical FMs, categorizing them from general vision models to modality-specific, and further to organ- or task-specific models, mainly in 2D medical imaging. | Developing multimodality FMs, scales, application-driven solutions, and 3D medical imaging, limited covering of 3D FMs (six models). | [15] |
| 2024 | Considering multimodality clinical data such as image, text, sound and signal. | 3D medical imaging, limited covering of 3D FMs (two models), interpretability, explainability, and high computational cost of FMs, integrating multimodal data sources. | [16] |
| 2024 | A focused review on adaptations of SAM in biomedical segmentation, covering publications from April to September 2023, Analyzing over papers and 33 datasets; mainly zero-shot 2D SAM variants. | Limited coverage of emerging 3D/volumetric models (21 3D FMs), generalization discrepancy, fine-tuning dillema, and modality inconsistencies. | [11] |
| 2024 | A systematic review of multimodal FMs, focusing primarily on vision–language models and 2D tasks, reviewing 97 papers using PRISMA guidelines. | Limited insight into volumetric models (18 3D FMs) or segmentation-specific FMs. | [9] |
| 2024 | Introducing FairMedFM, the first fairness benchmark for FMs in medical imaging, integrating 17 datasets and analyzing 20 FMs, identifying persistent demographic disparities, focusing on bias/fairness. | Limited to medical image classification and segmentation, limited covering of 3D FMs (seven models). | [8] |
| 2023 | A thorough taxonomized, task/organ-specific analysis of research progress and limitations. | Emphasis on textually prompted models, largely focused on articles before 2023, limited coverage of 3D radiology and recent FM advances (five models). | [10] |
| Ref | Year | FM | Modalities | |P.Data| | DAV | Anatomy | Architecture | P.Alg.T | Alg |
|---|---|---|---|---|---|---|---|---|---|
| [34] | 2025 | NA | CT | 650K labeled videos | PL | Abdominal organs, mandible, teeth, maxillary bone, pharynx | ViT | SSL | MAE |
| [35] | 2025 | 3D-Heart-Seg | CT, MRI | 2.3K vols | PL | Heart | Vision-LSTM | SSL | Matching probability distribution |
| [36] | 2025 | vesselFM | MRA, CTA, X-ray, vEM, µCTA, two-photon microscopy | 625K | PL | Blood vessels | ViT | Sup | Swin-Transformer encoder, U-Net–style decoder |
| [37] | 2025 | NA | CT | 8K vols | PR | Coronary artery | Swin-UNETR | Sem Sup | FL, semi-supervised pseudo-labeling, distillation |
| [38] | 2025 | SegVol | CT | 90K vols | PL | Whole body | ViT, CLIP (text embedding) | SSL | SimMIM |
| [39] | 2025 | LN Segmentation | CT | 3.3K vols | PL | Lymph nodes | U-Net | Sup | nnU-Net |
| [40] | 2025 | F3-Net | MRI | 5.7K vols | PL | Brain | U-Net | Sup | nnU-Net |
| [41] | 2025 | TotalSegmentator MRI | MRI | 1.1K vols | PR | Whole body | U-Net | Sup | nnU-Net |
| [42] | 2025 | RoMedFormer | CT, MRI | NA | M | Genital and pelvic | Transformer (rotary positional embedding) | SSL | Masked image modeling |
| [43] | 2025 | VISTA3D | CT | 11K vols | M | Whole body | SegResNet | Sem Sup | Pseudo-labels, supervoxels |
| [44] | 2024 | BrainSegFounder | MRI T1w, T1-CE, T2w, T2-FLAIR | 88K vols | PR | Brain | SwinUnet-R | SSL | Masked volume inpainting, rotation prediction, contrastive coding |
| [45] | 2024 | MoME | MRI T1w, T2w, T1-CE, FLAIR, DWI | 6.5K vols | PL | Brain | U-Net | Sup | nnU-Net |
| [46] | 2024 | M4oE | CT, MRI, CE-MRI | 700 vols | PL | Liver, kidney, pancreas, spleen, stomach, gallbladder | Swin Transformer, MLP expert network | SSL | MAE |
| [47] | 2024 | iMOS | CT, MRI, ultrasound, endoscopy, electron microscopy | 877 vols | PL | Whole body | XMem | Sup | Comparison with actual label |
| [48] | 2023 | STU-Net | CT, MRI, PET | 1.2K vols | PL | Whole body | U-Net | Sup | nnU-Net |
| [49] | 2023 | MIS-FM | CT | 113K vols | M | Head, neck, heart, aorta, trachea, esophagus, abdomen | PCT-Net (CNN+ViT) | SSL | Pseudo-segmentation task |
| [50] | 2023 | TotalSegmentator | CT | 1.2K vols | PL | Whole body | U-Net | Sup | nnU-Net |
| Ref | Year | FM | T | Modalities | |P.Data| | DAV | Anatomy | Task | Architecture | P.Alg.T | Alg |
|---|---|---|---|---|---|---|---|---|---|---|---|
| [68] | 2023 | GLIP-T(C) | VL | Natural image, text | 2M images, 16K text descriptions | PL | Hippocampus, thyroid nodule, foot | Classification | GLIP, adapted to medical via text prompts | SSL | Joint and continual learning |
| [69] | 2025 | FM-HCT | V | CT | 361K vols | PR | Head | Classification | ViT | SSL | DINO, MAE |
| [70] | 2025 | MEDFORM | VL | CT, tabular data | 159K slices | PL | Lung, breast, colon | Classification | TANGLE (ResNet, TabNet multimodal encoder) | SSL | SimCLR, multimodal C.L |
| [71] | 2025 | NA | V | MRI (1.5T, 3T T1WI, T2WI, FLAIR, T1CE) | 57K vols | PR | Brain | Classification | ViT | SSL | MiM + C.L |
| [72] | 2025 | CRCFound | V | CT | 5K vols | PR | Colorectal | Classification | ViT | SSL | MAE |
| [73] | 2025 | ViNet | V | CT, MRI | >40K vols | M | Brain, heart, lung, abdomen | Classification | ResNet3D-18 | SSL | Image restoration |
| [74] | 2025 | DeepCNTD-Net | V | CT | 29K vols | PR | Head | Classification | 3D DenseNet, task-specific 3D U-Nets | Sup | Segmentation, classification with labels |
| [75] | 2024 | NA | V | MRI T1w, T1c, T2w, FLAIR | 57K vols | PR | Brain | Classification | ViT-16 | SSL | Reconstruction, C.L |
| [76] | 2024 | BrainIAC | V | MRI T1w, T2w, FLAIR, T1CE | 32K | PL | Brain | Classification, prediction | ResNet50 | SSL | SimCLR |
| [77] | 2024 | NA | V | CT Lesion | 11K vols | PL | Lung nodules, cysts, breast lesions, kidney, bone, liver | Classification, prediction | 3D ResNet-50 | SSL | SimCLR |
| [78] | 2025 | SwinClassifier | V | MRI T1WI, T2WI, FLAIR | 75K vols | PR | Brain | Classification | Swin UNETR | SSL | Reconstruction, C.L |
| [79] | 2025 | MerMED-FM | V | CXR, CT, US, CFP, OCT, histopathology, Dermoscopy | 3.3M | PR | Eye, lung, liver, kidney, prostate, skin, bladder | Classification | ViT | SSL | Multimodality agreement via teacher–student network |
| [80] | 2025 | Percival | VL | CT, Report | 402K vols + reports | PR | Thorax, abdomen, pelvis, head, neck, brain, extremities | Classification | Dual Transformer encoders, BERT-style text encoders | SSL | C.L |
| [81] | 2025 | Cardiac-CLIP | VL | CT, Report | 130K vols + reports | M | Heart | Classification | ViT-B/32, PubMedBERT | SSL | MAE, C.L |
| Ref | Year | FM | Modalities | |P.Data| | DAV | Anatomy | Task | Architecture | P.Alg.T | Alg |
|---|---|---|---|---|---|---|---|---|---|---|
| [86] | 2025 | FoundationMorph | MRI, CT, PET, clinical text | 23K slices | M | Brain, lung | Image registration | Transformer-based encoder–decoder | SSL | Inpainting/Masked Image Modeling (MiM), C.L |
| [87] | 2025 | UniReg | CT | 9K vols | M | Whole body | Image registration | Convolution-based encoder–decoder | Sup | Self-supervised feature extraction, similarity, supervised regularization through segmentation masks |
| [88] | 2025 | TotalRegistrator | CT | 591 CT scan pairs | PR | Whole body | Image registration | U-Net | Unsup | Similarity, segmentation overlay, deformation field |
| [89] | 2024 | multiGradICON | CT, MRI, CBCT, MRI T1w, T1ce, T2w, FLAIR, DIXON, diffusion-derived measures | >1M 3D image pairs | PL | Lung, knee, brain, abdomen, pancreas | Image registration | Multiscale U-Net | Unsup | GradICON (Similarity with the target image) |
| [90] | 2024 | uniGradICON | CT, MRI, CBCT | 3.78M 3D image pairs | PL | Lung, knee, brain, abdomen | Image registration | Multiscale U-Net | Unsup | GradICON (Similarity with the target image) |
| [91] | 2025 | PCP-UNet | MRI | 150K 2D-t scans | PL | Cardiac | Image reconstruction | U-Net with pattern and contrast prompts, adaptive unrolling, channel-shifting | Sup | Generative (Image reconstruction) |
| [92] | 2024 | BME-x | MRI T1w, T2w | 516 3D scans | PL | Brain | Super-resolution | Densely connected U-Net | Sup | Classification, similarity calculation on generated high-quality image |
| [93] | 2025 | GraphMSR | MRI | 460 subjects | PL | Brain, knee | Super-resolution | GNN with attention mechanism | Sup | Reconstruction from low- to high-resolution, structural similarity |
| [94] | 2025 | MedIQA | CT, MRI, Fundoscopy | 2.5K 3D scans | M | Brain, breast, eye, knee, chest, abdominal | Image quality assessment | ViT-based (MANIQA) | Sup | MSE with target class |
| Ref | Year | FM | T | Modalities | |P.Data| | DAV | Anatomy | Task | Architecture | P.Alg.T | Alg |
|---|---|---|---|---|---|---|---|---|---|---|---|
| [98] | 2025 | M3FM | VL | CT, EHR, tabular clinical data, text | 117K CT–clinical record pairs | PL | Lungs, heart, airways, chest cavity | Classification, segmentation | CTViT, clinical text transformer, fusion module | SSL | MAE |
| [99] | 2025 | Radio DINO | V | CT, MRI, ultrasound, X-ray | 1.35M 2D slices | PL | Chest, breast, abdomen | Classification, segmentation | 2D ViT | SSL | DINO/DINOv2 |
| [100] | 2025 | MRI-CORE | V | MRI | 6M slices, 110K vols | PR | 18 body locations | Classification, segmentation | ViT | SSL | DINOv2 |
| [101] | 2025 | CineMA | V | CMR | 15M | M | Cardiac | Classification, segmentation | Multiview conv-transformer MAE | SSL | MultiMAE |
| [102] | 2025 | FratMAE | V | PET, CT | 1.2K vols | PL | Whole-body | Classification, segmentation | ViT encoders, cross-attention decoders | SSL | MAE |
| [103] | 2025 | PASTA | VL | CT–report pairs | 30K CT–mask-text pairs | PL | Lung, liver, pancreas, gallbladder, bladder, bone, esophagus, stomach, kidney, colorectum | Classification, segmentation | 3D U-Net | Sup | Sup, synthetic mask |
| [104] | 2025 | Triad | V | MRI T1w, T2w, FLAIR, fMRI, DWI, DCE-MRI | 131K 3D vols | PL | Breast, brain, prostate | Classification, segmentation, registration | 3D U-Net, Swin Transformer | SSL | Reconstruction |
| [84] | 2025 | BiomedCLIP | VL | Image–text pairs | 15M (PMC-15M) | PR | General biomedical: lungs, lymph nodes, organs | Classification, retrieval, VQA | 2D ViT-B/16, PubMedBERT | SSL | C.L |
| [105] | 2025 | Lingshu | VL | X-ray, CT, MRI, ultrasound, fundus, dermoscopy, OCT, PET, endoscopy, digital photography, histopathology, microscopy, text | 3.75M, 1.3M synthetic samples | PL, S | Whole body | Image understanding and reasoning | Qwen2.5-VL-Instruct architecture, vision encoder, projection MLP module, LLM core | Sup | Multistage sup |
| [106] | 2025 | RadFM | VL | 2D/3D scans, text | 16M scan–text pairs | PL | Whole body | Classification, modality recognition, VQA, report generation | ViT, autoregressive text generator | Sup | Next-token prediction |
| [107] | 2025 | CT-FM | V | CT | 148K vols | PL | Whole body | Classification, segmentation, retrieval and semantic-understanding | 3D encoder, SegResNet decoder | SSL | C.L |
| [108] | 2025 | LCTfound | V | CT | 28M slices | PR | Lung, mediastinum, bronchi, arterial, venous networks | Classification, segmentation, prognosis, prediction, virtual imaging, reconstruction, enhancement | U-Net with transformer blocks | SSL | Denoising diffusion probabilistic models |
| [109] | 2024 | E3D-GPT | VL | 3D CT–report pairs | 354K pairs | M | Chest, brain, abdomen | Classification, report generation, and VQA | 3D ViT encoder, MAE decoder, LLM fusion via 3D conv aggregator | SSL | MAE, C.L |
| [110] | 2024 | NA | V | CT | 59K vols, 14.1M slices | PR | Chest, lung, heart, pulmonary arteries | Classification, segmentation | VQ-AE (U-Net) | SSL | Masked region reconstruction, structural similarity |
| [111] | 2024 | NA | V | MRI | 36M slices | PR | Cardiac | Classification, segmentation, landmark localization | ViT-S/8 | SSL | DINO |
| [112] | 2024 | CT-CLIP | VL | CT-reports | 50K vols, 25K reports | PL | Chest | Multiabnormality detection, case retrieval, VQA | CT-ViT, CXR-BERT | SSL | CLIP |
| [113] | 2024 | VIS-MAE | V | CT, MRI, PET, X-ray, ultrasound | 2.5M slices | PL | Abdomen, heart, prostate, brain, breast, thyroid, skin, chest, knee, pulmonary | Classification, segmentation | Swin Transformer | SSL | MAE |
| [82] | 2024 | Merlin | VL | CT, EHR, diagnosis code | 6M images from 15K paired CTs with 1.8M + EHR diagnosis codes and radiology reports | PR | Abdomen | Classification, segmentation, cross-modal retrieval, report generation | Inflated ResNet152, Longformer | Weak Sup | Diagnostic code, C.L |
| [114] | 2024 | M3D-LaMed | VL | CT–report pair | 120K 3D image–text pairs, 662K instruction–response pairs | PL | NA | Segmentation, image-text retrieval, report generation, VQA | 3D ViT, LLaMA-2-7B | SSL | CLIP |
| [115] | 2023 | GMAI | VL | Multimodal | NA | NA | NA | Multitask | NA | SSL | NA |
| Computational Requirement | Evaluation Protocol | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Ref | FM | GPU Model | #GPU | GPU Memory | Batch Size (Total) | Input Image Shape | MD | FS/ZS/LD/UD | Existing FMs |
| Segmentation FMs | |||||||||
| [34] | NA | NVIDIA RTX A6000 | 1 | 48 GB | 16 (pretraining), 4 (fine-tuning) | 16 × 224 × 224 | Y | N | Y |
| [35] | 3D-Heart-Seg | NA | NA | NA | NA | 128 × 128 × 128 | Y | Y | N |
| [36] | vesselFM | NVIDIA V100 | 1 | 32 | 8 | 128 × 128 × 128 | Y | Y | Y |
| [37] | NA | NA | NA | NA | NA | NA | Y | N | N |
| [38] | SegVol | NVIDIA A100-SXM4 | 8 | 8 × 40 | 32 | 32 × 256 × 256 | Y | Y | Y |
| [39] | LN Segmentation | NVIDIA V100 | 8 | NA | 2 | NA | Y | Y | N |
| [40] | F3-Net | NA | NA | NA | 2 | NA | Y | N | N |
| [41] | TotalSegmentator MRI | NVIDIA GeForce RTX 3090 | 1 | 24 | NA | NA | Y | N | N |
| [42] | RoMedFormer | NA | NA | NA | 2 | NA | N | N | N |
| [43] | VISTA3D | NVIDIA V100 | 64 | 64 × 32 | NA | 308 × 260 × 453 | Y | Y | Y |
| [44] | BrainSegFounder | NVIDIA DGX A100 | 64 | 64 × 320 | 128 | NA | Y | Y | N |
| [45] | MoME | NVIDIA A100 | NA | 40 | NA | 160 × 196 × 160 | Y | Y | Y |
| [46] | M4oE | RTX 4090 | 1 | 24 | 36 | NA | Y | N | Y |
| [47] | iMOS | NA | NA | NA | NA | NA | Y | Y | N |
| [48] | STU-Net | NVIDIA A100 | 1 | 80 | 2 | NA | Y | Y | N |
| [49] | MIS-FM | NVIDIA A100 | 2 | 2 × 80 | 2 | NA | Y | Y | N |
| [50] | TotalSegmentator | NVIDIA GeForce RTX 3090 | 1 | 24 | NA | 512 × 512 × 280, 512 × 512 × 458, 512 × 512 × 824 | N | N | N |
| Classification FMs | |||||||||
| [68] | GLIP-T(C) | NA | NA | NA | 4 | na | Y | Y | N |
| [69] | FM-HCT | NVIDIA A100 | 4 | 4 × 80 | 256 | 224 × 224 × 224 | Y | Y | N |
| [70] | MEDFORM | NA | NA | NA | NA | NA | Y | Y | Y |
| [71] | NA | NVIDIA A100 | 8 | NA | NA | 96 × 96 × 96 | N | N | N |
| [72] | CRCFound | NVIDIA A100 | 4 | 4 × 40 | 64 | 256 × 256 × 32 | Y | N | N |
| [73] | ViNet | NVIDIA V100 | 1 | NA | NA | NA | Y | N | N |
| [74] | DeepCNTD-Net | NA | NA | NA | NA | NA | Y | N | Y |
| [75] | NA | NVIDIA A100 | 8 | NA | 24 | NA | Y | N | N |
| [76] | BrainIAC | NVIDIA A6000 | 1 | 48 | 32 | NA | Y | Y | Y |
| [77] | NA | NVIDIA Quadro RTX 8000 | 2 | 2 × 48 | 64 | NA | Y | Y | N |
| [78] | SwinClassifier | NVIDIA A100 | 8 | NA | NA | 128 × 128 × 64 | N | N | N |
| [79] | MerMED-FM | NVIDIA H100 | 8 | 8 × 80 | 16 | na | Y | Y | Y |
| [80] | Percival | NVIDIA A100 | 2 | NA | 48 | na | Y | Y | Y |
| [81] | Cardiac-CLIP | NVIDIA A6000 | 1 | 48 | 64 | na | Y | Y | Y |
| Image Registration, Reconstruction, and Super-resolution, Quality Assessment FMs | |||||||||
| [86] | FoundationMorph | NA | NA | NA | NA | 256 × 256 × 128 | |||
| [87] | UniReg | NVIDIA Tesla V100 | NA | NA | 1 | Y | Y | Y | |
| [88] | TotalRegistrator | NVIDIA RTX 3080 Ti | 1 | 12 | 1 | 128 × 96 × 160 | Y | Y | Y |
| [89] | multiGradICON | NA | NA | NA | NA | 175 × 175 × 175 | Y | Y | Y |
| [90] | uniGradICON | NA | NA | NA | NA | 175 × 175 × 175 | Y | Y | Y |
| [91] | PCP-UNet | NA | NA | NA | NA | 0.8 × 0.8 × 0.8 mm3 | N | N | N |
| [92] | BME-x | NA | NA | NA | NA | NA | Y | Y | N |
| [93] | GraphMSR | NVIDIA RTX A100 | 1 | NA | NA | 256 × 256, 320 × 320 | Y | N | Y |
| [94] | MedIQA | NVIDIA RTX A6000 | 1 | 48 | 1 | 224 × 224 | Y | N | Y |
| Multitask FMs | |||||||||
| [98] | M3FM | NVIDIA Tesla V100 | 192 | 192 × 32 | 192, in multitask training: 972 | 16 × 448 × 320, 128 × 448 × 320, 128 × 192 × 224, 128 × 320 × 448 | Y | Y | Y |
| [99] | Radio DINO | NVIDIA A100 | 2 | 2 × 80 | 128, 256, 512 | 224 × 224 | Y | N | N |
| [100] | MRI-CORE | NVIDIA A6000 | 4 | 4 × 48 | 512 | 1024 × 1024 | Y | Y | Y |
| [101] | CineMA | NVIDIA RTX A6000 | 8 | 8 × 48 | 128 | 256 × 256, 192 × 192 × 16 | Y | Y | Y |
| [102] | FratMAE | NVIDIA A100 | 8 | 8 × 40 | 24 | 160 × 160 × 192 | Y | Y | N |
| [103] | PASTA | NVIDIA A800 | 8 | 8 × 40 | 32 | 224 × 224 × 112 | Y | Y | Y |
| [104] | Triad | NVIDIA A100 | 2 | 2 × 80 | 8 | 190 × 192 × 224 | Y | N | Y |
| [84] | BiomedCLIP | NVIDIA A100 | 16 | NA | 4000 | 224 × 224, 384 × 384 | Y | Y | Y |
| [105] | Lingshu | NA | NA | NA | 1 | NA | Y | Y | Y |
| [106] | RadFM | NVIDIA A100 | 32 | 32 × 80 | 1 | 256 × 256 × [4–64] | Y | Y | Y |
| [107] | CT-FM | NVIDIA Quadro RTX 8000 | 4 | 4 × 48 | 64 | 128 × 128 × 48 | Y | Y | Y |
| [108] | LCTfound | NVIDIA V100 | 8 | NA | 36 | NA | Y | Y | Y |
| [109] | E3D-GPT | NVIDIA A800 | 8 | 8 × 40 | 32 | 224 × 224 × 112 | Y | N | Y |
| [110] | NA | NA | NA | NA | NA | NA | Y | Y | N |
| [111] | NA | NVIDIA Tesla H100 | 8 | 8 × 80 | 1024 | 224 × 224 | Y | Y | N |
| [112] | CT-CLIP | NVIDIA A100 | 4 | 4 × 80 | 1 | NA | Y | Y | Y |
| [113] | VIS-MAE | NVIDIA DGX A100 | 8 | na | 640 | 224 × 224 | Y | Y | N |
| [82] | Merlin | NVIDIA RTX A6000 | 1 | 48 | 18 | 224 × 224 × 160 | Y | Y | Y |
| [114] | M3D-LaMed | NVIDIA A100 | 8 | 8 × 80 | 48 | 32 × 256 × 256 | Y | N | Y |
| [115] | GMAI | NA | NA | NA | NA | NA | NA | NA | NA |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ghosh, T.; Sheikhi, F.; Guo, J.; Singh, Y.; Younis, K.; Kuanar, S.; Faghani, S.; Farina, E.M.J.d.M.; Huo, Y.; Maleki, F. Foundation Models for Volumetric Medical Imaging: Opportunities, Challenges, and Future Directions. Electronics 2026, 15, 1245. https://doi.org/10.3390/electronics15061245
Ghosh T, Sheikhi F, Guo J, Singh Y, Younis K, Kuanar S, Faghani S, Farina EMJdM, Huo Y, Maleki F. Foundation Models for Volumetric Medical Imaging: Opportunities, Challenges, and Future Directions. Electronics. 2026; 15(6):1245. https://doi.org/10.3390/electronics15061245
Chicago/Turabian StyleGhosh, Tapotosh, Farnaz Sheikhi, Junlin Guo, Yashbir Singh, Khaled Younis, Shiba Kuanar, Shahriar Faghani, Eduardo Moreno Judice de Mattos Farina, Yuankai Huo, and Farhad Maleki. 2026. "Foundation Models for Volumetric Medical Imaging: Opportunities, Challenges, and Future Directions" Electronics 15, no. 6: 1245. https://doi.org/10.3390/electronics15061245
APA StyleGhosh, T., Sheikhi, F., Guo, J., Singh, Y., Younis, K., Kuanar, S., Faghani, S., Farina, E. M. J. d. M., Huo, Y., & Maleki, F. (2026). Foundation Models for Volumetric Medical Imaging: Opportunities, Challenges, and Future Directions. Electronics, 15(6), 1245. https://doi.org/10.3390/electronics15061245

