Multimodal Artificial Intelligence in Medical Diagnostics
Abstract
1. Introduction
1.1. Motivation
1.2. Contributions
- Availability and characteristics of multimodal datasets used in recent studies.
- Preprocessing techniques to improve data quality and cross-modality coordination, including normalization, resampling, and feature selection.
- Multimodal fusion strategies, including early fusion, intermediate-level feature concatenation, and cross-modal attention mechanisms for representation alignment.
- Deep learning architectures, such as convolutional neural networks (CNNs), transformer-based models, and optimization-based classifiers (e.g., Kernel Extreme Learning Machine (KELM)).
2. Methodology
2.1. Selection Criteria
- Time frame: Articles published between late 2023 and 2025 were selected to reflect the most recent developments in MAI for medical diagnostics. This time frame was chosen to capture emerging work on transformer-based models, instruction-tuned LLMs, neural architecture search, and hybrid fusion frameworks.
- Scope: The review focuses on peer-reviewed studies that develop and evaluate machine learning or deep learning models using two or more data modalities for clinical diagnostic purposes. These modalities include imaging (e.g., MRI, CT, and fundus photography), structured EHR data (e.g., lab values and diagnoses), physiological signals (e.g., ECG and CTG), and free-text input (e.g., radiology reports and QA pairs). Papers addressing fusion techniques, preprocessing strategies, model architectures, and real-world evaluation are included.
- Dataset searches: Articles were identified through targeted keyword searches across Google Scholar, IEEE Xplore, and ScienceDirect. Search terms included combinations of the following: “multimodal artificial intelligence”, “multimodal machine learning”, “medical diagnostics”, “fusion techniques”, “deep learning in radiology”, “multimodal EHR”, “vision-language models in healthcare”, and “multimodal health data”.
2.2. Selection Steps
- Titles were initially screened for relevance to multimodal learning and diagnostic applications.
- Abstracts and full texts were reviewed to confirm the use of multiple data modalities and relevance to clinical diagnosis or prediction.
- Articles were excluded if they were outside the medical domain, focused solely on unimodal inputs, or were non-English publications.
- Duplicates were removed, and only the most representative or impactful papers were retained for detailed inclusion.
- Only studies directly cited and analyzed in the present review are included in the tables and synthesis.
3. Multimodal Datasets
Ref. | Dataset | Data Modalities | Data Sources | Dataset Size | Medical Diagnosis | Description |
---|---|---|---|---|---|---|
[20] | PAD-UFES-20 Dataset | Clinical images + patient metadata | Federal University of Espírito Santo | 2298 images | Skin Lesions (Various) | Smartphone-acquired lesion images and 21 clinical attributes |
[28] | Guangzhou NEC Dataset | Radiographs + Clinical | Guangzhou Medical Center | 2234 patients | NEC Diagnosis | Abdominal radiographs with 23 structured clinical parameters |
[30] | CTU-UHB Intrapartum CTG Dataset | FHR signals + expert features | University Hospital Brno | 552 samples | Fetal Acidosis | Annotated CTG recordings with clinical metadata |
[31] | Xinqiao Hospital BPPV Dataset | Eye videos + head vectors | Xinqiao Hospital, Army Medical University | 518 patients | BPPV | Eye-tracking video recordings categorized by semicircular canal type |
[33] | SLAKE-VQA Dataset | X-ray, CT, MRI + QA text | Multiple public sources | 642 images + 14,028 QA pairs | Medical VQA | Bilingual annotated radiology QA pairs with medical knowledge graph |
[29] | Multimodal Dataset for Lupus Erythematosus Subtypes | Clinical images + multi-IHC + metadata | 25 Hospitals in China | 446 cases | Lupus Erythematosus | Clinical skin photographs, IHC slides, and systemic involvement index |
[25,26] | ROCO and ROCOv2 | Radiology images + text (captions, MeSH terms) | OpenI and PMC-OAI | 81,000+ (ROCO), +11,000 (ROCOv2) | Radiology-based Disease Identification and Caption Alignment | Annotated radiology image-text pairs across multiple modalities for VQA and retrieval tasks |
[21] | MedICaT Dataset | Medical figures + captions + inline references | PubMed Central Open Access | 217,060 figures from 131,410 papers | Scientific Radiology Figure Interpretation and Retrieval | Annotated compound figures with captions and inline references for subfigure alignment |
[22] | FFA-IR Dataset | Fundus images + bilingual reports + lesion annotations | Clinical ophthalmology sources | 1330 samples | Retinal Disease Diagnosis | Multilingual diagnostic reports aligned with fundus images and lesion-level annotations |
[32] | ADNI Dataset | MRI, PET, CSF biomarkers, cognitive assessments | ADNI Consortium (USA, Canada) | >2500 participants | Alzheimer’s Disease | A comprehensive longitudinal study integrating multimodal neuroimaging and clinical assessments to monitor AD progression. |
[36] | PMC-VQA Dataset | Biomedical figures + VQA text | PubMed Central | 227,000+ QA pairs | Medical Visual Question Answering | Instruction-tuned large-scale VQA benchmark with domain metadata and UMLS/MeSH alignment |
[34] | NACC Dataset | MRI + Clinical + Cognitive + Genetic | 40+ US ADRCs | >19,000 patients | Dementia (AD, FTD, DLB, VaD) | Longitudinal multimodal dataset with FreeSurfer imaging features, neuropsychological assessments, and diagnostic labels |
[23] | MIMIC-III | EHR (structured, time-series) | MIT Lab for Computational Physiology | 53,423 patients | Critical illness, Diabetes, HF, COVID-19 | ICU clinical records with physiological signals, meds, and lab data used for semantic embedding and knowledge-enhanced prediction |
[37] | Pediatric Radiology Dataset | Radiographic images + diagnostic QA pairs | Pediatric Imaging textbook and digital library | 180 images | Pediatric diagnostic VQA | Pediatric chest, abdominal, and musculoskeletal images with MCQs used in multimodal LLM evaluation |
[38] | Taiwan Biobank Dataset (TWB) | Genomics + EHR data | Taiwan Biobank | 150,000+ adults | Population health and disease genetics in Taiwan | SNP arrays, physical exams, lifestyle data, family history, longitudinal follow-up |
[35] | UK Biobank Dataset (UKB) | Genomics + EHR data | UK Biobank | 500,000+ participants | Multimodal disease risk prediction | Genotype arrays, clinical records, health questionnaires, imaging, family history, medication data |
[24] | MIMIC-IV | Structured EHR, time-series vitals, clinical notes | Beth Israel Deaconess Medical Center | 383,220 admissions (78,275 ICU stays) | ICU risk prediction (e.g., mortality, sepsis) | Publicly available dataset featuring de-identified EHR, vital signs, and notes; spans 2008–2019 with longitudinal hospital data and updated coding standards (ICD-10, LOINC). |
4. Data Preprocessing Techniques
Ref. | Dataset | Technique (Summary) |
---|---|---|
[28] | Guangzhou NEC Dataset | Radiograph resizing and z-score normalization, clinical feature filtering and LightGBM-based imputation, radiomics extraction and mRMR selection, data augmentation. |
[50] | CTU-UHB Intrapartum CTG Dataset | FHR denoising with sparse dictionary learning, GAN-based data augmentation, signal truncation to 30 min, morphological feature extraction. |
[31] | Xinqiao Hospital BPPV Dataset | Video length normalization, uniform frame sampling, head vector transformation, self-encoder-based spatial embedding. |
[29] | Multimodal Dataset for Lupus Subtypes | Stain normalization, multi-IHC image channel registration, patch tiling, clinical metadata imputation and normalization. |
[67] | Zhu et al. Urology Dataset | ROI selection from WSIs, resolution standardization, expert verification, triple-sampling for output stability, prompt structuring for VQA. |
[34] | NACC Dataset | FreeSurfer segmentation, volumetric/surface normalization, inter-site harmonization, domain-based imputation, dimensionality reduction. |
[23,59,61] | MIMIC-III (EHR-KnowGen) | EHR normalization, semantic embedding using UMLS, EHR encoding with self-attention, contrastive sample generation using supervised contrastive loss, concept alignment via graph embeddings. |
[69] | Diagnostic VQA Benchmark | Prompt construction for GPT-4V, alignment of medical questions with corresponding images, and later stage analysis using named entity recognition and similarity metrics (RadGraph F1, ROUGE-L, cosine similarity). |
[45] | ADNI Dataset | MRI resizing and intensity normalization, feature selection on cognitive scores, SHAP-based feature ranking, Grad-CAM applied for CNN interpretability. |
[49] | Private Hip Fracture Dataset | Radiograph preprocessing with image resizing and augmentation; structured EHR cleaning, normalization, clinical encoding for tabular integration. |
[68] | Custom Pediatric Appendicitis Dataset | Structured EHR cleaning and feature selection, ultrasound frame sampling, view classifier filtering, clinical-lab alignment. |
[56] | Internal multimodal dataset (CT + reports) | CT pre-processing, report tokenization, visual-text alignment via ResNet50 and RoBERTa encoders. |
[64] | UK Biobank | Genetic variants and clinical records were cleaned, encoded, and scaled, lifestyle and outcome features were extracted, and missing values were imputed using statistical methods. |
[66] | Private dataset + UK Biobank | Fundus images were colored, normalized and resized. Vessel masks were extracted to capture retinal structure. Clinical EHR variables were one-hot encoded and aligned with image features before multimodal integration. |
[71] | Private multi-institutional dataset | De-identification, low-quality text filtering, standardization into 26 clinical categories, image normalization and resizing. |
[24] | MIMIC-IV | Time-series vitals were normalized and segmented structured EHRs were encoded using temporal categorical embeddings. Clinical notes were tokenized and embedded via BioClinicalBERT, enabling shared encoder input across modalities. |
[65] | Private dataset | Temporal frame selection from hysteroscopic videos, image enhancement, manual scoring of injury risk, and structured EMR standardization. |
[72] | MIMIC-CXR | Preprocessing included filtering uncurated report-image pairs and constructing positive/negative samples for contrastive learning. Free-text reports were tokenized and projected into embeddings. Radiographs were encoded via a vision transformer. A curriculum-based sampling strategy enhanced training robustness. |
5. Multimodal Fusion Techniques
5.1. Early Fusion
5.2. Late Fusion
5.3. Intermediate Fusion
5.4. Cross-Modal and Architecture Search Fusion
6. Multimodal Approaches and Model Architectures
6.1. Hybrid and Attention-Based Architectures
6.2. Transformer-Based Vision-Language Models
6.3. EHR-Centric and Optimization-Based Models
6.4. Tabular-Image Fusion Architectures
6.5. General-Purpose LLMs and Instruction-Tuned Models
6.6. Privacy and Security-Oriented Models
6.7. Comparative Evaluation
7. Discussion
8. Conclusions
Author Contributions
Funding
Conflicts of Interest
Abbreviations
EHR | Electronic Health Record |
AUROC | Area Under the Receiver Operating Characteristic Curve |
MRI | Magnetic Resonance Imaging |
AUC | Area Under the Curve |
QA | Question Answering |
CNN | Convolutional Neural Network |
VQA | Visual Question Answering |
UK | United Kingdom |
ADNI | Alzheimer’s Disease Neuroimaging Initiative |
NEC | Necrotizing Enterocolitis |
IHC | Immunohistochemistry |
AI | Artificial Intelligence |
MAI | Multimodal Artificial Intelligence |
CTG | Cardiotocography |
FHR | Fetal Heart Rate |
References
- Albahra, S.; Gorbett, T.; Robertson, S.; D’Aleo, G.; Kumar, S.V.S.; Ockunzzi, S.; Lallo, D.; Hu, B.; Rashidi, H.H. Artificial intelligence and machine learning overview in pathology & laboratory medicine: A general review of data preprocessing and basic supervised concepts. Semin. Diagn. Pathol. 2023, 40, 71–87. [Google Scholar] [CrossRef] [PubMed]
- Najjar, R. Redefining Radiology: A Review of Artificial Intelligence Integration in Medical Imaging. Diagnostics 2023, 13, 2760. [Google Scholar] [CrossRef] [PubMed]
- Pei, X.; Zuo, K.; Li, Y.; Pang, Z. A review of the application of multi-modal deep learning in medicine: Bibliometrics and future directions. Int. J. Comput. Intell. Syst. 2023, 16, 44. [Google Scholar] [CrossRef]
- Barua, A.; Ahmed, M.U.; Begum, S. A systematic literature review on multimodal machine learning: Applications, challenges, gaps and future directions. IEEE Access 2023, 11, 14804–14831. [Google Scholar] [CrossRef]
- Liang, P.P.; Zadeh, A.; Morency, L.P. Foundations and trends in multimodal machine learning: Principles, challenges, and open questions. ACM Comput. Surv. 2024, 56, 1–42. [Google Scholar] [CrossRef]
- Krones, F.; Marikkar, U.; Parsons, G.; Szmul, A.; Mahdi, A. Review of multimodal machine learning approaches in healthcare. arXiv 2024, arXiv:2402.02460. [Google Scholar] [CrossRef]
- Simon, B.; Ozyoruk, K.; Gelikman, D.; Harmon, S.; Türkbey, B. The future of multimodal artificial intelligence models for integrating imaging and clinical metadata: A narrative review. Diagn. Interv. Radiol. 2024. [Google Scholar] [CrossRef]
- Demirhan, H.; Zadrozny, W. Survey of Multimodal Medical Question Answering. BioMedInformatics 2024, 4, 50–74. [Google Scholar] [CrossRef]
- Adewumi, T.; Alkhaled, L.; Gurung, N.; van Boven, G.; Pagliai, I. Fairness and bias in multimodal ai: A survey. arXiv 2024, arXiv:2406.19097. [Google Scholar]
- Isavand, P.; Aghamiri, S.S.; Amin, R. Applications of Multimodal Artificial Intelligence in Non-Hodgkin Lymphoma B Cells. Biomedicines 2024, 12, 1753. [Google Scholar] [CrossRef]
- Laganà, F.; Bibbò, L.; Calcagno, S.; De Carlo, D.; Pullano, S.A.; Pratticò, D.; Angiulli, G. Smart Electronic Device-Based Monitoring of SAR and Temperature Variations in Indoor Human Tissue Interaction. Appl. Sci. 2025, 15, 2439. [Google Scholar] [CrossRef]
- Mario, V.; Laganá, F.; Manin, L.; Angiulli, G. Soft computing and eddy currents to estimate and classify delaminations in biomedical device CFRP plates. J. Electr. Eng. 2025, 76, 72–79. [Google Scholar] [CrossRef]
- Menniti, M.; Laganà, F.; Oliva, G.; Bianco, M.; Fiorillo, A.S.; Pullano, S.A. Development of Non-Invasive Ventilator for Homecare and Patient Monitoring System. Electronics 2024, 13, 790. [Google Scholar] [CrossRef]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef]
- Huang, D.; Yan, C.; Li, Q.; Peng, X. From Large Language Models to Large Multimodal Models: A Literature Review. Appl. Sci. 2024, 14, 5068. [Google Scholar] [CrossRef]
- Jabeen, S.; Li, X.; Amin, M.S.; Bourahla, O.; Li, S.; Jabbar, A. A review on methods and applications in multimodal deep learning. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–41. [Google Scholar] [CrossRef]
- Li, Y.; Daho, M.E.H.; Conze, P.H.; Zeghlache, R.; Le Boité, H.; Tadayoni, R.; Cochener, B.; Lamard, M.; Quellec, G. A review of deep learning-based information fusion techniques for multimodal medical image classification. Comput. Biol. Med. 2024, 177, 108635. [Google Scholar] [CrossRef]
- Evans, R.S. Electronic health records: Then, now, and in the future. Yearb. Med. Inform. 2016, 25, S48–S61. [Google Scholar] [CrossRef]
- Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
- Pacheco, A.G.; Lima, G.R.; Salomao, A.S.; Krohling, B.; Biral, I.P.; de Angelo, G.G.; Alves, F.C., Jr.; Esgario, J.G.; Simora, A.C.; Castro, P.B.; et al. PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones. Data Brief 2020, 32, 106221. [Google Scholar] [CrossRef]
- Subramanian, S.; Wang, L.L.; Mehta, S.; Bogin, B.; van Zuylen, M.; Parasa, S.; Singh, S.; Gardner, M.; Hajishirzi, H. Medicat: A Dataset of Medical Images, Captions, and Textual References; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 2112–2120. [Google Scholar] [CrossRef]
- Li, M.; Cai, W.; Liu, R.; Weng, Y.; Zhao, X.; Wang, C.; Chen, X.; Liu, Z.; Pan, C.; Li, M.; et al. Ffa-ir: Towards an explainable and reliable medical report generation benchmark. In Proceedings of the Thirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), Online, 6–14 December 2021. [Google Scholar]
- Johnson, A.E.; Pollard, T.J.; Shen, L.; Lehman, L.w.H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; Mark, R.G. MIMIC-III, a freely accessible critical care database. Sci. Data 2016, 3, 160035. [Google Scholar] [CrossRef] [PubMed]
- Johnson, A.E.; Bulgarelli, L.; Shen, L.; Gayles, A.; Shammout, A.; Horng, S.; Pollard, T.J.; Hao, S.; Moody, B.; Gow, B.; et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 2023, 10, 1. [Google Scholar] [CrossRef] [PubMed]
- Pelka, O.; Koitka, S.; Rückert, J.; Nensa, F.; Friedrich, C.M. Radiology Objects in COntext (ROCO): A Multimodal Image Dataset. In Proceedings of the Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis, Granada, Spain, 16 September 2018; pp. 180–189. [Google Scholar] [CrossRef]
- Ruckert, J.; Bloch, L.; Brungel, R.; Idrissi-Yaghir, A.; Schäfer, H.; Schmidt, C.S.; Koitka, S.; Pelka, O.; Ben, A.; Abacha, A.G.; et al. ROCOv2: Radiology Objects in COntext Version 2, an Updated Multimodal Image Dataset. Sci. Data 2024, 11, 688. [Google Scholar] [CrossRef] [PubMed]
- Pediatric Imaging: A Pediatric Radiology Textbook and Digital Library. (Source of Pediatric Radiology Dataset). Available online: https://pediatricimaging.org (accessed on 20 May 2025).
- Gao, W.; Pei, Y.; Liang, H.; Lv, J.; Chen, J.; Zhong, W. Multimodal AI System for the Rapid Diagnosis and Surgical Prediction of Necrotizing Enterocolitis. IEEE Access 2021, 9, 51050–51064. [Google Scholar] [CrossRef]
- Li, Q.; Yang, Z.; Chen, K.; Zhao, M.; Long, H.; Deng, Y.; Hu, H.; Jia, C.; Wu, M.; Zhao, Z.; et al. Human-multimodal deep learning collaboration in ‘precise’diagnosis of lupus erythematosus subtypes and similar skin diseases. J. Eur. Acad. Dermatol. Venereol. 2024, 38, 2268–2279. [Google Scholar] [CrossRef]
- Chudáček, V.; Spilka, J.; Burša, M.; Janků, P.; Hruban, L.; Huptych, M.; Lhotská, L. Open access intrapartum CTG database. BMC Pregnancy Childbirth 2014, 14, 16. [Google Scholar] [CrossRef]
- Lu, H.; Mao, Y.; Li, J.; Zhu, L. Multimodal deep learning-based diagnostic model for BPPV. BMC Med. Inform. Decis. Mak. 2024, 24, 82. [Google Scholar] [CrossRef]
- Weiner, M.W.; Aisen, P.S.; Jack, C.R., Jr.; Jagust, W.J.; Trojanowski, J.Q.; Shaw, L.; Saykin, A.J.; Morris, J.C.; Cairns, N.; Beckett, L.A.; et al. The Alzheimer’s disease neuroimaging initiative: Progress report and future plans. Alzheimer’s Dement. J. Alzheimer’s Assoc. 2010, 6, 202–211. [Google Scholar] [CrossRef]
- Liu, B.; Zhan, L.M.; Xu, L.; Ma, L.; Yang, Y.; Wu, X.M. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In Proceedings of the 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; pp. 1650–1654. [Google Scholar] [CrossRef]
- Beekly, D.L.; Ramos, E.M.; Lee, W.W.; Deitrich, W.D.; Jacka, M.E.; Wu, J.; Hubbard, J.L.; Koepsell, T.D.; Morris, J.C.; Kukull, W.A.; et al. The National Alzheimer’s Coordinating Center (NACC) database: The uniform data set. Alzheimer Dis. Assoc. Disord. 2007, 21, 249–258. [Google Scholar] [CrossRef]
- UK Biobank. New Data & Enhancements to UK Biobank. 2024. Available online: https://www.ukbiobank.ac.uk/enable-your-research/about-our-data (accessed on 4 April 2025).
- Zhang, X.; Wu, C.; Zhao, Z.; Lin, W.; Zhang, Y.; Wang, Y.; Xie, W. Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv 2023, arXiv:2305.10415. [Google Scholar]
- Reith, T.P.; D’Alessandro, D.M.; D’Alessandro, M.P. Capability of multimodal large language models to interpret pediatric radiological images. Pediatr. Radiol. 2024, 54, 1729–1737. [Google Scholar] [CrossRef] [PubMed]
- Feng, Y.C.A.; Chen, C.Y.; Chen, T.T.; Kuo, P.H.; Hsu, Y.H.; Yang, H.I.; Chen, W.J.; Su, M.W.; Chu, H.W.; Shen, C.Y.; et al. Taiwan Biobank: A rich biomedical research database of the Taiwanese population. Cell Genom. 2022, 2, 100197. [Google Scholar] [CrossRef] [PubMed]
- Singh, D.; Singh, B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 2020, 97, 105524. [Google Scholar] [CrossRef]
- Miotto, R.; Wang, F.; Wang, S.; Jiang, X.; Dudley, J.T. Deep learning for healthcare: Review, opportunities and challenges. Briefings Bioinform. 2018, 19, 1236–1246. [Google Scholar] [CrossRef]
- Li, Y.; Ammari, S.; Balleyguier, C.; Lassau, N.; Chouzenoux, E. Impact of preprocessing and harmonization methods on the removal of scanner effects in brain MRI radiomic features. Cancers 2021, 13, 3000. [Google Scholar] [CrossRef]
- Martin, S.A.; Zhao, A.; Qu, J.; Imms, P.E.; Irimia, A.; Barkhof, F.; Cole, J.H.; Initiative, A.D.N. Explainable artificial intelligence for neuroimaging-based dementia diagnosis and prognosis. medRxiv 2025. [Google Scholar] [CrossRef]
- Xue, C.; Kowshik, S.S.; Lteif, D.; Puducheri, S.; Jasodanand, V.H.; Zhou, O.T.; Walia, A.S.; Guney, O.B.; Zhang, J.D.; Pham, S.T.; et al. AI-based differential diagnosis of dementia etiologies on multimodal data. Nat. Med. 2024, 30, 2977–2989. [Google Scholar] [CrossRef]
- Sheng, J.; Zhang, Q.; Zhang, Q.; Wang, L.; Yang, Z.; Xin, Y.; Wang, B. A hybrid multimodal machine learning model for Detecting Alzheimer’s disease. Comput. Biol. Med. 2024, 170, 108035. [Google Scholar] [CrossRef]
- Jahan, S.; Abu Taher, K.; Kaiser, M.S.; Mahmud, M.; Rahman, M.S.; Hosen, A.S.; Ra, I.H. Explainable AI-based Alzheimer’s prediction and management using multimodal data. PloS ONE 2023, 18, e0294253. [Google Scholar] [CrossRef]
- Feng, Z.; Sivak, J.A.; Krishnamurthy, A.K. Multimodal fusion of echocardiography and electronic health records for the detection of cardiac amyloidosis. In Proceedings of the International Conference on Artificial Intelligence in Medicine, Portorož, Slovenia, 12–15 June 2023; Springer: Berlin/Heidelberg, Germany, 2024; pp. 227–237. [Google Scholar] [CrossRef]
- Fleurence, R.L.; Curtis, L.H.; Califf, R.M.; Platt, R.; Selby, J.V.; Brown, J.S. Launching PCORnet, a national patient-centered clinical research network. J. Am. Med. Inform. Assoc. 2014, 21, 578–582. [Google Scholar] [CrossRef]
- Bin, Y.; Yang, Y.; Shen, F.; Xu, X.; Shen, H.T. Bidirectional long-short term memory for video description. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 436–440. [Google Scholar] [CrossRef]
- Schilcher, J.; Nilsson, A.; Andlid, O.; Eklund, A. Fusion of electronic health records and radiographic images for a multimodal deep learning prediction model of atypical femur fractures. Comput. Biol. Med. 2024, 168, 107704. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Z.; Zhu, J.; Jiao, P.; Wang, J.; Zhang, X.; Lu, X.; Zhang, Y. Hybrid-FHR: A multi-modal AI approach for automated fetal acidosis diagnosis. BMC Med. Inform. Decis. Mak. 2024, 24, 19. [Google Scholar] [CrossRef] [PubMed]
- Bowles, C.; Chen, L.; Guerrero, R.; Bentley, P.; Gunn, R.; Hammers, A.; Dickie, D.A.; Hernández, M.V.; Wardlaw, J.; Rueckert, D. Gan augmentation: Augmenting training data using generative adversarial networks. arXiv 2018, arXiv:1810.10863. [Google Scholar]
- Wang, Y.; Yin, C.; Zhang, P. Multimodal risk prediction with physiological signals, medical images and clinical notes. Heliyon 2024, 10, e26772. [Google Scholar] [CrossRef]
- Yang, L.; Xu, S.; Sellergren, A.; Kohlberger, T.; Zhou, Y.; Ktena, I.; Kiraly, A.; Ahmed, F.; Hormozdiari, F.; Jaroensri, T.; et al. Advancing Multimodal Medical Capabilities of Gemini. arXiv 2024, arXiv:2405.03162. [Google Scholar]
- Bodenreider, O. The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic Acids Res. 2004, 32, D267–D270. [Google Scholar] [CrossRef]
- Lipscomb, C.E. Medical subject headings (MeSH). Bull. Med. Libr. Assoc. 2000, 88, 265. [Google Scholar]
- Yao, Z.; Lin, F.; Chai, S.; He, W.; Dai, L.; Fei, X. Integrating medical imaging and clinical reports using multimodal deep learning for advanced disease analysis. In Proceedings of the 2024 IEEE 2nd International Conference on Sensors, Electronics and Computer Engineering (ICSECE), Jinzhou, China, 29–31 August 2024; pp. 1217–1223. [Google Scholar] [CrossRef]
- Park, S.; Lee, E.S.; Shin, K.S.; Lee, J.E.; Ye, J.C. Self-supervised multi-modal training from uncurated images and reports enables monitoring AI in radiology. Med. Image Anal. 2024, 91, 103021. [Google Scholar] [CrossRef]
- Koonce, B. ResNet 50. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Springer: Berlin/Heidelberg, Germany, 2021; pp. 63–72. [Google Scholar] [CrossRef]
- Cai, Y.; Liu, X.; Fan, M.; Wang, H.; Liu, M.; Yu, Y.; Wu, Y. Contrastive Learning on Multimodal Analysis of Electronic Health Records. Sci. Rep. 2024, 14, 3438. [Google Scholar]
- Suresh, H.; Hunt, N.; Johnson, A.; Celi, L.A.; Szolovits, P.; Ghassemi, M. Clinical intervention prediction and understanding using deep networks. arXiv 2017, arXiv:1705.08498. [Google Scholar]
- Niu, S.; Ma, J.; Bai, L.; Wang, Z.; Guo, L.; Yang, X. EHR-KnowGen: Knowledge-enhanced multimodal learning for disease diagnosis generation. Inf. Fusion 2024, 102, 102069. [Google Scholar] [CrossRef]
- Bampa, M.; Miliou, I.; Jovanovic, B.; Papapetrou, P. M-ClustEHR: A multimodal clustering approach for electronic health records. Artif. Intell. Med. 2024, 154, 102905. [Google Scholar] [CrossRef] [PubMed]
- Chung, R.H.; Onthoni, D.; Lin, H.M.; Li, G.H.; Hsiao, Y.P.; Zhuang, Y.S.; Onthoni, A.; Lai, Y.H.; Chiou, H.Y. Multimodal Deep Learning for Classifying Diabetes: Analyzing Carotid Ultrasound Images from UK and Taiwan Biobanks and Their Cardiovascular Disease Associations. 2024; preprint. [Google Scholar]
- Zeng, L.; Ma, P.; Li, Z.; Liang, S.; Wu, C.; Hong, C.; Li, Y.; Cui, H.; Li, R.; Wang, J.; et al. Multimodal Machine Learning-Based Marker Enables Early Detection and Prognosis Prediction for Hyperuricemia. Adv. Sci. 2024, 11, 2404047. [Google Scholar] [CrossRef] [PubMed]
- Li, B.; Chen, H.; Lin, X.; Duan, H. Multimodal Learning system integrating electronic medical records and hysteroscopic images for reproductive outcome prediction and risk stratification of endometrial injury: A multicenter diagnostic study. Int. J. Surg. 2024, 110, 3237–3248. [Google Scholar] [CrossRef]
- Lee, Y.C.; Cha, J.; Shim, I.; Park, W.Y.; Kang, S.W.; Lim, D.H.; Won, H.H. Multimodal deep learning of fundus abnormalities and traditional risk factors for cardiovascular risk prediction. npj Digit. Med. 2023, 6, 14. [Google Scholar] [CrossRef]
- Zhu, L.; Lai, Y.; Ta, N.; Cheng, L.; Chen, R. Multimodal approach in the diagnosis of urologic malignancies: Critical assessment of ChatGPT-4V’s image-reading capabilities. JCO Clin. Cancer Inform. 2024, 8, e2300275. [Google Scholar] [CrossRef]
- Lin, A.C.; Liu, Z.; Lee, J.; Ranvier, G.F.; Taye, A.; Owen, R.; Matteson, D.S.; Lee, D. Generating a multimodal artificial intelligence model to differentiate benign and malignant follicular neoplasms of the thyroid: A proof-of-concept study. Surgery 2024, 175, 121–127. [Google Scholar] [CrossRef]
- Panagoulias, D.P.; Virvou, M.; Tsihrintzis, G.A. Evaluating LLM–Generated Multimodal Diagnosis from Medical Images and Symptom Analysis. arXiv 2024, arXiv:2402.01730. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Yildirim, N.; Richardson, H.; Wetscherek, M.T.; Bajwa, J.; Jacob, J.; Pinnock, M.A.; Harris, S.; Coelho De Castro, D.; Bannur, S.; Hyland, S.; et al. Multimodal healthcare AI: Identifying and designing clinically relevant vision-language applications for radiology. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–22. [Google Scholar] [CrossRef]
- Johnson, A.E.; Pollard, T.J.; Greenbaum, N.R.; Lungren, M.P.; Deng, C.y.; Peng, Y.; Lu, Z.; Mark, R.G.; Berkowitz, S.J.; Horng, S. MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs. PhysioNet 2024. [Google Scholar] [CrossRef]
- Tortora, M.; Cordelli, E.; Sicilia, R.; Nibid, L.; Ippolito, E.; Perrone, G.; Ramella, S.; Soda, P. RadioPathomics: Multimodal learning in non-small cell lung cancer for adaptive radiotherapy. IEEE Access 2023, 11, 47563–47578. [Google Scholar] [CrossRef]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar] [CrossRef]
- Huang, S.C.; Shen, L.; Lungren, M.P.; Yeung, S. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 3922–3931. [Google Scholar] [CrossRef]
- Zhang, Y.; Jiang, H.; Miura, Y.; Manning, C.D.; Langlotz, C.P. Contrastive Learning of Medical Visual Representations from Paired Images and Text. Proc. Mach. Learn. Health Care 2022, 182, 1–24. [Google Scholar]
- Guarrasi, V.; Aksu, F.; Caruso, C.M.; Di Feola, F.; Rofena, A.; Ruffini, F.; Soda, P. A systematic review of intermediate fusion in multimodal deep learning for biomedical applications. Image Vis. Comput. 2025, 158, 105509. [Google Scholar] [CrossRef]
- Zhang, J.; Huang, J.; Jin, S.; Lu, S. Vision-language models for vision tasks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5625–5644. [Google Scholar] [CrossRef] [PubMed]
- Gong, H.; Chen, G.; Liu, S.; Yu, Y.; Li, G. Cross-modal self-attention with multi-task pre-training for medical visual question answering. In Proceedings of the 2021 International Conference on Multimedia Retrieval, Taipei, Taiwan, 16–19 November 2021; pp. 456–460. [Google Scholar] [CrossRef]
- Rajabi, N.; Kosecka, J. Towards grounded visual spatial reasoning in multi-modal vision language models. arXiv 2023, arXiv:2308.09778. [Google Scholar]
- Wang, Y.; Chen, W.; Han, X.; Lin, X.; Zhao, H.; Liu, Y.; Zhai, B.; Yuan, J.; You, Q.; Yang, H. Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning. arXiv 2024, arXiv:2401.06805. [Google Scholar]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, Zhuhai, China, 17–20 February 2023; pp. 19730–19742. [Google Scholar]
- Chen, J.; Zhu, D.; Shen, X.; Li, X.; Liu, Z.; Zhang, P.; Krishnamoorthi, R.; Chandra, V.; Xiong, Y.; Elhoseiny, M. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv 2023, arXiv:2310.09478. [Google Scholar]
- Cui, S.; Wang, J.; Zhong, Y.; Liu, H.; Wang, T.; Ma, F. Automated fusion of multimodal electronic health records for better medical predictions. In Proceedings of the 2024 SIAM International Conference on Data Mining (SDM), Houston, TX, USA, 18–20 April 2024; pp. 361–369. [Google Scholar] [CrossRef]
- Ren, P.; Xiao, Y.; Chang, X.; Huang, P.Y.; Li, Z.; Chen, X.; Wang, X. A comprehensive survey of neural architecture search: Challenges and solutions. ACM Comput. Surv. (CSUR) 2021, 54, 1–34. [Google Scholar] [CrossRef]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
- Irvin, J.; Rajpurkar, P.; Ko, M.; Yu, Y.; Ciurea-Ilcus, S.; Chute, C.; Marklund, H.; Haghgoo, B.; Ball, R.; Shpanskaya, K.; et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 590–597. [Google Scholar] [CrossRef]
- Ren, H.; Lu, W.; Xiao, Y.; Chang, X.; Wang, X.; Dong, Z.; Fang, D. Graph convolutional networks in language and vision: A survey. Knowl.-Based Syst. 2022, 251, 109250. [Google Scholar] [CrossRef]
- Zhu, L.; Mou, W.; Lai, Y.; Chen, J.; Lin, S.; Xu, L.; Lin, J.; Guo, Z.; Yang, T.; Lin, A.; et al. Step into the era of large multimodal models: A pilot study on ChatGPT-4V (ision)’s ability to interpret radiological images. Int. J. Surg. 2024, 110, 4096–4102. [Google Scholar] [CrossRef]
- Latif, G.; Alghazo, J.; Mohammad, N.; Abdelhamid, S.E.; Brahim, G.B.; Amjad, K. A Novel Fragmented Approach for Securing Medical Health Records in Multimodal Medical Images. Appl. Sci. 2024, 14, 6293. [Google Scholar] [CrossRef]
Ref. | Dataset | Model Type | Fusion Techniques | Evaluation Metrics |
---|---|---|---|---|
[50] | CTU-UHB Intrapartum CTG Dataset | Hybrid-FHR (SE-TCN + handcrafted features + CMFF) | Intermediate fusion using multi-head attention on expert and deep FHR features | Accuracy = 96.8%, Sensitivity = 96%, Specificity = 97.5%, F1-score = 96.7% |
[31] | Xinqiao Hospital BPPV Dataset | BKTDN (3D-CNN + TDN + Self-Encoder + MLP) | Cross-attention fusion (eye-movement + head vectors) | Accuracy = 81.7%, Precision = 82.1%, Sensitivity = 94.1%, Specificity = 96.5% |
[46] | CDW-H Dataset | Transformer | Intermediate fusion, Early + Late variants | AUROC = 94% |
[28] | Guangzhou NEC Dataset | SENet-154 + LightGBM | Decision-level late fusion of radiomics and clinical features | Diagnosis: AUC = 93.37%, Accuracy = 91.57%; Surgery: AUC = 94.13%, Accuracy = 88.61% |
[53] | Gemini | Vision Transformer + Transformer Text Encoder | Intermediate fusion via cross-attention; contrastive learning + instruction tuning | Zero-shot AUC = 86.7%, Top-1 retrieval = 71.6%, Top-5 = 89.3% |
[67] | Zhu et al. Urology Dataset | ChatGPT-4V | Prompt-based vision-language reasoning, late fusion via conversational interaction | RCC AUC = 87.1%, Sensitivity = 98%, Specificity = 70%, F1 = 86% |
[29] | Multimodal Dataset for Lupus Subtypes | ResNet-50 + EfficientNet-B1 + MLP | Decision-level fusion (multi-IHC tiles + clinical photos + metadata) | AUROC = 98.4%, Accuracy = 82.88% |
[44] | ADNI | ILHHO-KELM | Intermediate fusion (MRI, PET, CSF feature concatenation with ILHHO feature selection) | Accuracy = 99.2% |
[36] | PMC-VQA Dataset | BLIP-2, MiniGPT-4 | Instruction tuning, Template-based QA generation | Accuracy (BLIP-2) = 71.2% |
[42] | NACC Dataset | FCN-based dual-stream classifier | Intermediate fusion, cognitive + imaging feature concatenation, saliency-based interpretability | Accuracy (multi-stage): >85% |
[61] | MIMIC-III | EHR-KnowGen (EHR encoder + GCN + fusion) | Intermediate fusion via semantic EHR embeddings and GCN-based concept alignment | AUC (Diabetes: 81.5%, HF: 87.8%, COVID-19: 85.1%) |
[85] | MIMIC-III | AutoFM (NAS) | Intermediate fusion using architecture search across EHR feature groups | AUROC: 84.8% (HF), 85.7% (diabetes), 91.4% (mortality) |
[37] | Pediatric Radiology Dataset | BLIP-2, MiniGPT-v2 | Instruction-tuned encoder-decoder with vision-language fusion | Accuracy: 73.3% (BLIP-2), 56.7% (MiniGPT-v2) |
[59] | MIMIC-III Dataset | Transformer + Contrastive Learning | Intermediate fusion of structured features + supervised contrastive objective | Macro F1 = 55.6%, AUC = 80.1% |
[69] | Public MCQ Benchmarks | GPT-4V (black-box VLM) | Prompt-driven vision-language fusion, diagnostic report generation | RadGraph F1 = 77.3%, Cosine Similarity = 93.4% |
[45] | ADNI Dataset | CNN + Clinical Scoring Model | Intermediate fusion of MRI and cognitive-demographic vectors | Accuracy = 94.5% |
[49] | Hip Fracture Dataset | DenseNet + Tabular MLP | Intermediate fusion of DenseNet radiograph features and clinical variables | AUROC = 84% |
[68] | Pediatric Appendicitis Dataset | CNN + MLP (dual-branch) | Intermediate fusion; concatenation of ultrasound embeddings and EHR features | AUROC = 86% |
[56] | Internal dataset (CT + Reports) | ResNet50 + RoBERTa + Fusion Decoder | Cross-attention, Intermediate fusion, Knowledge-based fusion | Accuracy = 96.42%, Recall = 98.48%, F1 = 97%, IoU = 89% |
[63] | UKB + TWB | XGBoost + FFNN | Late fusion, feature concatenation of genetic + clinical data | AUROC = 81.8% (UKB), 82.1% (TWB) for diabetes risk |
[66] | Private dataset + UKB | ResNet-50 for fundus image encoding, MLP for structured clinical data | Feature-level fusion via concatenation followed by fully connected layers | AUROC = 87% (internal), AUROC = 85% (UK Biobank external validation) |
[71] | Private multi institutional dataset | Vision-language (GLoRIA, ConVIRT variants) | Contrastive pretraining with image-report pairs | AUC (84%), Retrieval Top-5 (78%), Pointing Game Accuracy (74%) |
[64] | UK Biobank, Private dataset (Nanfang) | Ensemble model combining GBDT, LR, SVM, and neural networks | Intermediate fusion of genetic and clinical features | AUROC: 81.6% (internal), 79.2% (external) |
[52] | MIMIC-IV | Multitask Transformer Encoder | Intermediate fusion with shared temporal encoder across vitals, notes, and EHR | ICU mortality (AUROC 91.1%), Sepsis (AUROC 88.5%) |
[65] | Private dataset | 3D-CNN + FC network | Intermediate fusion of hysteroscopic video features and EMR embeddings | AUROC 85.4% for injury classification, AUROC 83.7% for outcome prediction |
[57] | MIMIC-CXR | Self-supervised transformer (ViT+text encoder) | Cross-modal contrastive alignment (image-text) | 78.1% AUROC (zero-shot classification), BLEU and CIDEr scores outperforming supervised baselines (report generation) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jandoubi, B.; Akhloufi, M.A. Multimodal Artificial Intelligence in Medical Diagnostics. Information 2025, 16, 591. https://doi.org/10.3390/info16070591
Jandoubi B, Akhloufi MA. Multimodal Artificial Intelligence in Medical Diagnostics. Information. 2025; 16(7):591. https://doi.org/10.3390/info16070591
Chicago/Turabian StyleJandoubi, Bassem, and Moulay A. Akhloufi. 2025. "Multimodal Artificial Intelligence in Medical Diagnostics" Information 16, no. 7: 591. https://doi.org/10.3390/info16070591
APA StyleJandoubi, B., & Akhloufi, M. A. (2025). Multimodal Artificial Intelligence in Medical Diagnostics. Information, 16(7), 591. https://doi.org/10.3390/info16070591