Multimodal Models in Healthcare: Methods, Challenges, and Future Directions for Enhanced Clinical Decision Support

Siam, Md Kamrul; Hossain Faruk, Md Jobair; He, Bofan; Cheng, Jerry Q.; Gu, Huanying

doi:10.3390/info16110971

Open AccessSystematic Review

Multimodal Models in Healthcare: Methods, Challenges, and Future Directions for Enhanced Clinical Decision Support

by

Md Kamrul Siam

^*

,

Md Jobair Hossain Faruk

,

Bofan He

,

Jerry Q. Cheng

and

Huanying Gu

^*

Department of Computer Science, New York Institute of Technology, New York, NY 10023, USA

^*

Authors to whom correspondence should be addressed.

Information 2025, 16(11), 971; https://doi.org/10.3390/info16110971

Submission received: 19 September 2025 / Revised: 30 October 2025 / Accepted: 5 November 2025 / Published: 10 November 2025

(This article belongs to the Special Issue Artificial Intelligence-Based Digital Health Emerging Technologies)

Download

Browse Figures

Versions Notes

Abstract

Decision-making in modern healthcare increasingly relies on integrating a variety of data sources, including patient demographics, medical imaging, laboratory results, clinical narratives, and temporal data, all of which are difficult for traditional computational methodologies to accurately predict. This paper evaluates the latest methodologies that integrate diverse data types, including photographs, clinical notes, temporal measurements, and structured tables, through techniques such as feature amalgamation, prioritization of essential information, and utilization of graphs. We also assess pre-training, fine-tuning, and comprehensive evaluation of model generation procedures. By synthesizing findings from 50 of 91 peer-reviewed papers published between 2020 and 2024, we demonstrate that the integration of structured and unstructured data significantly improves performance in tasks like diagnosis, prognosis prediction, and personalized treatment. This review combines substantial multimodal datasets and applications across several therapeutic domains while addressing critical issues such as data heterogeneity, scalability, interpretability, and ethical considerations. This paper highlights the transformative potential of multimodal models in improving clinical decision support, providing a framework for future research to advance precision medicine and enhance healthcare outcomes.

Keywords:

multimodal model; deep learning; healthcare; multimodal machine learning; fusion methods; multi-modality; medical image diagnosis; prognosis

1. Introduction

The use of multimodal models has become a driving force in the field of healthcare, drawing on many different types of data to optimize clinical decision-making and patient care. Such models employ structured data, medical images, clinical notes, and other modalities to present the whole picture of a patient’s health at the given moment [1]. The emergence of artificial intelligence, including but not limited to large language models, has significantly enhanced multimodal approaches’ applicability in medicine [2]. Studies have shown improvements in the performance of multimodal models compared to unimodal ones for a variety of clinical tasks, such as mortality prediction, diagnosis, length-of-stay prediction, and readmission risk [3]. Overall, using several modalities is more beneficial from the performance perspective and possibly for the understanding and personalization of patient concerns and treatment plans.

Nevertheless, there are still several barriers to multimodal models’ development and adoption, including data selection and integration, a lack of clarity and explanations of results, privacy concerns, etc. [4]. The potential performance enhancement stimulate research and advancements in the healthcare domain despite the challenges of multimodal models [5]. Figure 1 demonstrates how the multimodal model works for disease prediction and forecasting clinical outcomes. The goal of this review is to outline the most relevant directions and intellectual contributions in multimodal healthcare modeling in the recent years.

The critical need for a systematic synthesis of the swiftly evolving field of multimodal healthcare modeling (2020–2024) is the driving force behind this review. The operational sequence is illustrated in the framework in Figure 1 to facilitate comprehension of the scope of this review: In order to facilitate Medical Prediction Tasks, a variety of Clinical Data Modalities are integrated, including Image, Text, Time-Series, and Tabular data. Machine Learning Models process these tasks to produce a Forecast of Clinical Outcome, which subsequently informs the Decision Making Process. Our work is highly relevant due to the absence of standardized fusion techniques and the heterogeneity of current architectures. It provides an evidence-based map of the current state-of-the-art, allowing researchers and clinicians to develop scalable, reliable, and transparent multimodal solutions.

2. Related Literature

In recent years, healthcare systems have increasingly leveraged diverse patient data and integrated it with multimodal models. The primary goal of these models is to facilitate accurate, comprehensive, and timely clinical diagnoses. We have witnessed significant developments in medical imaging, smart care systems, and sentiment analysis. In particular, large language models (LLMs) have influenced healthcare analytics and decision-making. The connection is rooted in their architectural synergy: modern multimodal models often utilize transformer-based LLMs as powerful encoders to derive rich, contextual representations from unstructured text data (like clinical notes), which are then fused with structured data and images for prediction.

The field of multimodal medical AI has been the subject of several comprehensive surveys. Xu et al. [6] provided a general review on the synergy of multi-modal data and AI technologies in medical diagnosis, while Shetty et al. [7] offered an extensive review on multimodal data analysis, covering open issues and future directions.

Tian et al. [8] highlight the transformative influence of LLMs on medical imaging interpretation, particularly through their integration with high-performance transfer learning techniques. This integration enables synergistic data analysis and fosters seamless clinical interactions that are both efficient and cost-effective. In addition to highlighting the breadth of multimodal integration, Cai et al. [9] illustrate the potential of advanced multimodal smart care systems to improve patient classification, diagnosis and treatment. Their approach capitalizes on semantic perception, alignment, and entity association mining, thus uncovering rich multimodal associations with high semantic fidelity. Expanding on this concept, Shaik et al. [10] investigate the application of Data, Information, Knowledge, and Wisdom (DIKW) frameworks within smart healthcare through multimodal fusion techniques. The research adopts robust analytical strategies, including feature selection, rule-based systems, machine learning, deep learning, and natural language processing.

Despite the achievements of multimodal models, based on the aforementioned studies, challenges remain, including performance, clinical interpretability, and trust. For instance, some models operate as black boxes, offering high accuracy but limited insight into how decisions are made. This can complicate clinical validation and regulatory approval, and may hinder clinician trust. There is a growing interest in the creation of interactive visual analytics systems that can tackle tasks such as sentiment analysis in the field of healthcare and assist in the interpretation and explanation of multimodal models [11]. Such studies enhance model transparency, fostering user trust and facilitating broader clinical adoption.

This study aims to address the gaps identified in previous research by integrating the information collected on multimodal models within smart healthcare systems, medical image computing, data integration methodologies, and sentiment analysis [12]. While existing surveys provide comprehensive foundational insights, our work is specifically designed to address the gap left by these reviews: a comprehensive, systematic synthesis focused exclusively on the fusion architectures, metric usage, and application patterns of contemporary multimodal models (2020–2024) is lacking. Therefore, as we outline current research trends and future perspectives, we emphasize the transformative potential of multimodal modeling in healthcare. Our focus is to evaluate how these models can revolutionize clinical decision-making, patient welfare, and operational efficiency, substantially mitigating challenges related to data interpretation and integration.

3. Methodology

We conducted a rigorous and systematic literature search to identify and include high-quality, relevant studies for this review. The process involved well-defined search queries, explicit inclusion and exclusion criteria, and the use of reputable databases to ensure comprehensive and unbiased coverage. We restricted our inclusion period to studies published from 2020 onwards as initial scoping searches done on all three databases showed the multimodal healthcare studies were rare before 2020. Furthermore, we also found that only six and seven studies from ACM, IEEE, and PubMed concerning cancer, and heart, lung, or kidney-related conditions respectively using multi-modalities, although encompassing research from 2020 to the end of 2024. The search strategy was carefully designed to address the following research questions (RQs):

RQ1: What are the current state-of-the-art multimodal models used in healthcare applications, and how are they designed?
RQ2: Which data modalities are most commonly integrated, and what fusion techniques are utilized?
RQ3: What datasets and evaluation metrics are used to assess the performance of these models?
RQ4: What are the key challenges and future research directions in developing multimodal models for healthcare?

3.1. Search Databases and Queries

We target three widely recognized databases, including IEEE Xplore, ACM Digital Library, and PubMed. All the searches were done on 6 December 2024. The specific queries for each database are as follows:

3.1.1. IEEE Xplore

We construct a query to retrieve articles with titles containing the phrase “Multimodal model” and metadata related to “health”, “medical image”, or “diagnosis”. The query is:

(“Document Title”: Multimodal model) AND ((“All Metadata”: health) OR ((“All Metadata”: medical image) AND (“All Metadata”: diagnosis)))

The search is restricted to the period between January 2020 and December 2024, resulting in 57 initial records. After excluding one magazine article, 56 articles are included for initial screening.

3.1.2. ACM Digital Library

The query for the ACM Digital Library targeted publications with the title “multimodal model” and metadata containing “health”, “medical image”, or “diagnosis” within the same time frame:

(“Document Title”: Multimodal model) AND ((“All Metadata”: health) OR ((“All Metadata”: medical image) AND (“All Metadata”: diagnosis)))

The search initially produces 23 results, comprising short papers, research articles, and keynotes, which we then organized according to preliminary screening criteria.

3.1.3. PubMed

We perform a search query on PubMed to identify articles with the title “multimodal model” and metadata referencing “health”, “medical image”, or “diagnosis”:

(“multimodal model”[Title]) AND ((“health”) OR (“medical image”) OR (“diagnosis”))

The search is limited to the same time period, yielding 11 articles inclusion for the initial screening.

3.2. Inclusion and Exclusion Criteria

We define the inclusion and exclusion criteria to ensure relevance and quality.

Inclusion Criteria: This review includes peer-reviewed articles, conference papers, and early access articles published between January 2020 and December 2024, focusing on multimodal models applied to healthcare, particularly in health monitoring, medical imaging, and diagnostics.
Exclusion Criteria: This review excludes non-peer-reviewed publications (e.g., keynote speeches, editorials, magazine articles) and studies unrelated to multimodal models or healthcare applications.

3.3. Search Outcomes

The systematic literature search identified a total of 91 records across three major databases. We adopted a title-only screening approach during the initial phase to maximize efficiency and maintain a high signal-to-noise ratio, as pilot searches indicated that the phrase “Multimodal model” appearing in the title was a strong predictor of relevance for the review’s core focus. Our approach covers multimodal models in healthcare, including methodologies, modalities, and applications. We identified 56 articles from IEEE Xplore, 18 from the ACM Digital Library, and 11 from PubMed. After applying inclusion and exclusion criteria, we identified a total of 50 studies for in-depth analysis. The screening and eligibility assessment process was conducted by two independent researchers. Initial title and abstract screening was performed, resulting in 90 records after removing one duplicate. Following the removal of non-empirical (n = 6) and retracted studies (n = 1), 83 full-text articles were assessed. Finally, 33 studies were excluded for lacking direct human health adherence, resulting in 50 articles included in the final review (Figure 2). Any disagreements were resolved through consensus and consultation with a third researcher. Data extraction, focusing on model architecture, modalities, metrics, and application domain, was performed using a standardized form. Figure 2 demonstrates the process of screening, eligibility assessment, and inclusion for this study.

4. Data Modalities

This section serves to categorize and elaborate on the four primary data types: image, text, time-series, and tabular—that form the foundation of the multimodal healthcare models identified in our review. Understanding the nature and function of each modality is essential for appreciating the fusion strategies and clinical predictions discussed in subsequent sections.

4.1. Image Data

Key imaging modalities, including MRI, X-ray, CT, ultrasound, PET, and SPECT, are essential components of modern multimodal healthcare systems. MRI provides exceptional soft-tissue contrast, which is beneficial for the assessment of neurodegenerative diseases [13,14]. Additionally, it is capable of predicting Alzheimer’s disease through the use of deep learning [15]. X-rays, when combined with clinical data, enhance the detection of pulmonary diseases [16], whereas CT enables detailed anatomical analysis in oncology and cardiology [17,18].

4.1.1. Magnetic Resonance Imaging (MRI)

MRI is widely used in multimodal healthcare models due to its high soft-tissue contrast and ability to capture both anatomical and functional data. It is particularly useful in neurodegenerative diseases like Alzheimer’s, where it supports the quantification of structural and metabolic brain changes [13,14]. Recent deep learning-based approaches have improved its predictive capabilities for disease progression and treatment planning [15]. Combining T1-weighted and angiographic MRI data enables more accurate brain age prediction [19]. As data harmonization and integration improve, the clinical utility of multimodal MRI is expected to grow [20,21].

4.1.2. X-Ray Imaging

X-rays are commonly incorporated in multimodal models alongside structured and unstructured clinical data. These models enhance diagnostic accuracy in areas such as pulmonary disease detection, outperforming single-modality models [16]. The high-dimensional features extracted from X-rays contribute to improved predictions when combined with clinical notes and lab data [22,23]. Their integration enables more generalizable decision-support systems.

4.1.3. Computed Tomography (CT)

CT provides detailed anatomical information and is frequently fused with modalities like PET and EMR data to improve diagnostic performance. In oncology and cardiology, multimodal CT models have improved target localization and risk stratification [17,18]. Automated extraction of features from abdominal CT scans, when integrated with EMR data, has shown promise in predicting ischemic heart disease risk [24].

4.1.4. Ultrasound

Ultrasound is valued for its portability, real-time imaging, and non-invasive nature. It is increasingly used in multimodal models to assess clinical workflows, predict malignancies, and guide interventions [25,26]. Integration with other modalities like Doppler and contrast-enhanced ultrasound improves diagnostic confidence, particularly in oncology and trauma care [27].

4.1.5. PET (Positron Emission Tomography)

PET is instrumental in detecting metabolic activity and is often fused with MRI or CT to enhance morphological and functional analysis [28]. PET-MRI, in particular, provides high soft-tissue contrast and functional mapping without radiation exposure from CT. Its use in multimodal AI-driven models enhances prognosis and therapeutic planning.

4.1.6. SPECT (Single Photon Emission Computed Tomography)

SPECT, especially in combination with CT (SPECT/CT), enhances localization of uptake and diagnostic accuracy. Recent advancements in quantitative SPECT and AI applications have expanded its utility in disease prediction, denoising, and attenuation correction [29].

4.2. Text Data

Clinical notes are essential for the improvement of multimodal models, as they offer contextual information that extends beyond the confines of structured electronic health records (EHR). Unstructured narratives substantially enhance the accuracy of predictions and the robustness of models when combined with EHR data [3,30]. The fine-grained extraction of insights from clinical text has been facilitated by advancements in natural language processing (NLP) and pre-trained language models [31]. On occasion, text-only models have outperformed their structured-data-only counterparts. The most effective approach is the fusion of structured and unstructured data. For example, the integration of electronic health records (EHRs) with operative notes has enhanced the accuracy of predicting outcomes in glaucoma surgery. Similar improvements have been observed in cardiology and oncology. The integration of text data into multimodal frameworks represents a significant advancement in clinical AI, as it enhances personalized care and predictive performance [30].

4.3. Time-Series Data

Time-series data capture temporal dynamics of patient health, playing a crucial role in predicting outcomes such as mortality, length of stay, and readmission in intensive care settings [32]. When combined with structured EHRs and clinical notes, time-series data strengthen multimodal models, enhancing both predictive accuracy and clinical relevance.

Advanced machine learning techniques, including LSTM networks and attention mechanisms, enable these models to detect subtle variations in physiological signals and long-term trends [33,34]. Integrating textual information from clinical notes into temporal models has also shown to improve performance [35,36]. Additionally, the emergence of AutoML for time-series data facilitates rapid model development and testing, supporting applications across diverse temporal scales.

4.4. Tabular Data

Tabular data, typically derived from structured EHRs, serve as a foundational element in multimodal clinical models. Their integration with free-text notes, imaging, and temporal data allows for a more holistic representation of patient health. Studies have shown that such combinations outperform single-source models, including those developed for COVID-19 outcome prediction [3]. A key challenge in using tabular data lies in its heterogeneity across datasets. The MediTab approach addresses this by enabling large language models to generalize across heterogeneous tabular inputs without retraining [37]. This flexibility is valuable in clinical trial predictions and patient outcome forecasting. Overall, integrating structured tabular data with other modalities enhances the reliability and precision of clinical decision support systems [38].

5. Results

Our study suggests significant number of researchers rely on publicly available datasets, and some datasets that combine publicly available datasets with private datasets to enhance model or system performance. We addressed the research questions taking into account all 50 of the included studies. We also discuss the research questions bellow:

1.: RQ1: What are the current state-of-the-art multimodal models used in healthcare applications, and how are they designed?

Recent multimodal healthcare research show that transformer-centric hybrid systems with intermediate fusion are more common. Most models use intermediate fusion, where modality-specific encoders like CNNs for medical imaging and transformer-based language models (e.g., ClinicalBERT, BioBERT) for text extract representations that are combined in a shared latent space. Models can maintain modality-specific traits while permitting cross-modal interactions needed for clinical reasoning. GNNs and Vision Transformers (ViTs) are also used more in image-heavy jobs that need relational or spatial reasoning. About six research use GNN modules to represent complicated clinical interactions. Only eight and three research use early and late fusion techniques, usually for easier classification problems or weakly connected modalities. In terms of learning, supervised learning dominates, with most of them using labeled datasets. Few research use self-supervised or semi-supervised training, showing difficulties with unlabeled multimodal clinical data. Most models are trained using modality-specific objectives rather than end-to-end joint optimization, highlighting the difficulty of balancing diverse learning signals across data kinds.

2.: RQ2: What modalities are commonly integrated in these models, and what fusion techniques are employed?

From our analysis, the following distribution of modalities across studies, as shown in Figure 3, represents the frequency of different modalities used in healthcare multimodal studies across various domains:

Image Data: This modality is the most frequently used across studies, with 15 occurrences in brain related studies [39,40,41,42,43,44,45,46,47,48,49,50,51,52,53], followed by nine in general or special disease studies [54,55,56,57,58,59,60,61,62], six in heart, lung, kidney-related studies [63,64,65,66,67,68], and six in cancer-related studies [69,70,71,72,73,74].

Text Data: Text data are used extensively in general or special disease studies [55,56,60,62,75,76,77,78,79,80,81] (11 occurrences). It is moderately used in heart/lung/kidney-related studies [63,65,68,82] (four occurrences) and minimally in brain-related [42] (one occurrence) and cancer-related studies [71,73,74] (three occurrences).

Tabular Data: Structured tabular data is prominent in general or special disease studies [54,55,56,61,75,76,77,78,80,81,83,84] (12 occurrences), with six occurrences in brain related studies [42,43,44,45,49,50], and three occurrences in cancer related studies [69,71,74] and three occurrences in heart/lung/kidney-related studies [64,68,82].

Time-Series Data: This modality is primarily seen in general or special disease studies [60,77,78,79,83,85] (six occurrences), with lesser usage in brain-related [49,86] (two occurrences) and heart/lung/kidney-related studies [66,87] (two occurrences). Notably, it is absent in cancer-related studies.

These numbers suggest the dominant use of imaging and tabular data in healthcare studies. Also, the text and time-series data are more specialized for certain domains. This distribution provides valuable insights for designing fusion techniques to integrate these modalities effectively.

Furthermore, Figure 4 illustrates the dataset accessibility across the included studies, showing the number of studies in four disease categories (Cancer, Brain, Heart/Lung/Kidney, and General/Special) that used public, private, or mixed/not specified datasets, where it depicts most of the research has been conducted using public datasets. Similarly, Figure 5 illustrates the learning process of the models across the studies, showing the frequency of supervised, unsupervised, or other/combined learning types across the same four disease categories, which describes how most of the researchers are currently focusing on supervised learning across all the categories.

3.: RQ3: What datasets and evaluation metrics are used to assess the performance of these models?

We discovered the use of nine different types of uses of metrics, including accuracy, AUC, and F1-score. Among them, accuracy (16 studies) was the most frequent [42,43,44,45,49,51,52,53,55,62,69,70,74,79,81,85], used in tasks such as breast cancer diagnosis, COVID-19 prediction, and Alzheimer’s classification. AUC (12 studies) was favored for its ability to measure model discrimination, especially in mortality and disease progression tasks [42,43,44,45,49,56,57,61,71,77,79,83]. F1-score (eight studies) was applied in scenarios requiring balance between precision and recall, such as adverse drug reaction detection [42,45,50,55,56,61,62,75]. Sensitivity and specificity (siz studies) supported diagnostic evaluations in imaging and disease detection [41,48,58,67,72,81].

Dice Similarity Coefficient (four studies) was used for image segmentation, notably in brain tumor tasks [41,47,48,49]. Precision and recall (four studies) appeared in tumor detection and radiological analysis [42,62,65,73]. ROUGE and BLEU (three studies) evaluated text generation quality in clinical reporting [60,63,73]. RMSE (two studies) supported regression-based predictions in neuromuscular analysis [40,78]. Mutual information (one study) was employed for feature alignment in brain imaging [39].

4.: RQ4: What are the key challenges and future research directions in developing multimodal models for healthcare?

Analysis of Table 1, Table 2, Table 3 and Table 4 reveals several key challenges. Multimodal integration remains limited, with only a few studies, particularly in cancer research, successfully leveraging multiple data types. Small dataset sizes were a common limitation, especially in tasks like adverse drug reaction detection and ICU monitoring. High computational costs, reported in studies on brain tumor segmentation and Alzheimer’s diagnosis, restrict scalability. Performance limitations were noted in multiclass and complex clinical scenarios, while a lack of interpretability continues to hinder adoption in critical applications such as in-hospital mortality prediction and radiological assessments.

Future research should focus on optimizing fusion techniques, especially intermediate fusion, which was used in several studies but remains underexplored for broader clinical use. Enhancing dataset diversity and scale is essential for building robust, generalizable models. Improving computational efficiency is also critical for enabling real-time deployment in clinical settings. Explainability should be prioritized to ensure clinician trust and regulatory compliance. Additionally, advancing domain adaptation and multimodal pretraining techniques will be vital for improving model transferability across tasks and healthcare systems.

Model Comparisons

To compare the effectiveness of the models included in this review, we conducted a qualitative assessment based on evaluation metrics, performance consistency, error rates, and clinical relevance. Below, we summarize the top-performing models across four major disease categories.

Cancer-related studies: As shown in Table 1, the breast cancer model using B-mode and elastography ultrasound stands out for its strong performance, achieving a Dice coefficient of 89.96%, sensitivity of 90.37%, and specificity of 97.87%. Its use of intermediate fusion and clinically interpretable modalities contributes to its robustness. Although limited by a small dataset (287 image pairs), the model’s diagnostic precision and practical relevance place it ahead of alternatives like the thyroid cancer and mammogram-EHR fusion models.

Brain-related studies: The Gabor filter brain tumor model outperforms others with an accuracy of 99.16% and a mean Intersection over Union (MIoU) of 0.804 (Table 2). It accurately segments and predicts tumors using multimodal characteristics. Models like Alzheimer’s disease classification (92.3% accuracy) and classical tumor segmentation (DSC = 0.7425) are equally effective, but the Gabor filter-based model yields the most complete results, despite domain-specific training.

General/Special disease studies: From Table 3, the breast cancer diagnosis model again emerges as the most balanced in this group, with an F1-score of 0.93 and accuracy of 94.5%. Its intermediate fusion strategy allows reliable integration of diverse data types. Although models targeting severe acute pancreatitis (AUC = 0.874–0.916) and in-hospital mortality prediction (AUC = 0.8921) show strong performance in specific clinical tasks, the breast cancer model maintains a stronger overall profile in both accuracy and generalizability.

Heart, lung, and kidney-related studies: As detailed in Table 4, the lung diseases model shows the most consistent and scalable performance in this category, achieving an accuracy of 96.67%, sensitivity of 96.88%, and TNR of 97.02%. Its application of Vision Transformers contributes to robust performance across various lung conditions. Although the thoracic diseases model (AUC = 0.985) and COVID-19 models (Accuracy: CT = 98.5%, X-ray = 98.6%) demonstrate strengths in specific modalities, the lung diseases model combines balanced metrics, multimodal integration, and clinical scalability, making it the most impactful in this group.

6. Conclusions

This review highlights the growing utility of multimodal data integration in clinical modeling, demonstrating that models combining structured data, imaging, temporal signals, and text achieve notably higher predictive accuracy and broader applicability across diverse clinical outcomes. Our analysis shows that automated model optimization frameworks and advanced fusion techniques can substantially improve the efficiency and scalability of these systems. Strengths: The primary strength of this work is its rigorous, systematic methodology, which utilized three major databases and maintained a strict, contemporary scope (2020–2024). This ensured the synthesis focused on the most recent architectural and application advancements. Furthermore, the detailed categorization of findings by disease type, fusion technique, and specific metric usage provides a structured reference map for future research. Weaknesses: The main limitation is the lack of public access to many private or mixed datasets, which severely restricts the reproducibility and generalizability of the models presented in the literature. Additionally, the heterogeneity of reported evaluation metrics across the included studies restricted the comparison of model performance to qualitative analysis rather than a fully quantitative meta-analysis. However, significant deployment and clinical challenges persist, including complex regulatory approval pathways, issues of algorithmic fairness across diverse patient demographics, and the difficulty in establishing standardized clinical validation protocols. Despite these unique limitations, the evidence suggests that multimodal approaches are crucial for advancing precision medicine. Future work must extend beyond technical optimization to address clinical translation barriers. Specifically, research needs to focus on developing few-shot learning techniques to build robust models despite limited annotated clinical datasets and on standardizing metrics to enable rigorous cross-study comparisons. The implications for practice in healthcare are profound; multimodal modeling will enable a shift toward personalized diagnosis, such as integrating genomics and radiology to guide oncology treatment selection. However, clinical deployment hinges on advancing model transparency and explainability to ensure regulatory compliance and secure the necessary trust of medical practitioners.

Funding

This research was primarily supported by the National Science Foundation Grants CNS-2120350 and III-2311598.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

References

Wang, Y.; Yin, C.; Zhang, P. Multimodal risk prediction with physiological signals, medical images and clinical notes. Heliyon 2024, 10, e26772. [Google Scholar] [CrossRef] [PubMed]
Yuan, M.; Bao, P.; Yuan, J.; Shen, Y.; Chen, Z.; Xie, Y.; Zhao, J.; Li, Q.; Chen, Y.; Zhang, L.; et al. Large language models illuminate a progressive pathway to artificial intelligent healthcare assistant. Med. Plus 2024, 1, 100030. [Google Scholar] [CrossRef]
Henriksson, A.; Pawar, Y.; Hedberg, P.; Nauclér, P. Multimodal fine-tuning of clinical language models for predicting COVID-19 outcomes. Artif. Intell. Med. 2023, 146, 102695. [Google Scholar] [CrossRef] [PubMed]
Behrad, F.; Saniee Abadeh, M. An overview of deep learning methods for multimodal medical data mining. Expert Syst. Appl. 2022, 200, 117006. [Google Scholar] [CrossRef]
Meskó, B. The impact of multimodal large language models on health care’s future. J. Med. Internet Res. 2023, 25, e52865. [Google Scholar] [CrossRef]
Xu, X.; Li, J.; Zhu, Z.; Zhao, L.; Wang, H.; Song, C.; Chen, Y.; Zhao, Q.; Yang, J.; Pei, Y. A comprehensive review on synergy of multi-modal data and AI technologies in medical diagnosis. Bioengineering 2024, 11, 219. [Google Scholar] [CrossRef]
Shetty, S.; Ananthanarayana, V.S.; Mahale, A. Comprehensive review of multimodal medical data analysis: Open issues and future research directions. Acta Inform. Pragensia 2022, 11, 423–457. [Google Scholar] [CrossRef]
Tian, D.; Jiang, S.; Zhang, L.; Lu, X.; Xu, Y. The role of large language models in medical image processing: A narrative review. Quant. Imaging Med. Surg. 2024, 14, 1108–1121. [Google Scholar] [CrossRef]
Cai, Q.; Wang, H.; Li, Z.; Liu, X. A survey on multimodal data-driven smart healthcare systems: Approaches and applications. IEEE Access 2019, 7, 133583–133599. [Google Scholar] [CrossRef]
Shaik, T.; Tao, X.; Li, L.; Xie, H.; Velásquez, J.D. A survey of multimodal information fusion for smart healthcare: Mapping the journey from data to wisdom. Inf. Fusion 2024, 102, 102040. [Google Scholar] [CrossRef]
Wang, X.; He, F.; Jin, S.; Yang, Y.; Wang, Y.; Qu, H. M2Lens: Visualizing and explaining multimodal models for sentiment analysis. IEEE Trans. Vis. Comput. Graph. 2022, 28, 802–812. [Google Scholar] [CrossRef]
Haleem, M.S.; Ekuban, A.; Antonini, A.; Pagliara, S.; Pecchia, L.; Allocca, C. Deep-learning-driven techniques for real-time multimodal health and physical data synthesis. Electronics 2023, 12, 1989. [Google Scholar] [CrossRef]
Liu, Z.; Ding, L.; He, B. Integration of EEG/MEG with MRI and fMRI. IEEE Eng. Med. Biol. Mag. 2006, 25, 46–53. [Google Scholar] [CrossRef]
Lin, A.L.; Laird, A.R.; Fox, P.T.; Gao, J.H. Multimodal MRI neuroimaging biomarkers for cognitive normal adults, amnestic mild cognitive impairment, and Alzheimer’s disease. Neurol. Res. Int. 2012, 2012, 907409. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Song, Z.; Han, X.; Li, H.; Tang, X. Prediction of Alzheimer’s disease progression based on magnetic resonance imaging. ACS Chem. Neurosci. 2021, 12, 4209–4223. [Google Scholar] [CrossRef] [PubMed]
Esteva, A.; Chou, K.; Yeung, S.; Naik, N.; Madani, A.; Mottaghi, A.; Liu, Y.; Topol, E.; Dean, J.; Socher, R. Deep learning-enabled medical computer vision. NPJ Digit. Med. 2021, 4, 5. [Google Scholar] [CrossRef]
Thion Ming, C.; Omar, Z.; Mahmood, N.H.A.; Kadiman, S. A literature survey of ultrasound- and computed-tomography-based cardiac image registration. J. Teknol. Sci. Eng. 2015, 74, 93–101. [Google Scholar] [CrossRef]
Huang, S.C.; Pareek, A.; Zamanian, R.; Banerjee, I.; Lungren, M.P. Multimodal fusion with deep neural networks for leveraging CT imaging and electronic health record: A case-study in pulmonary embolism detection. Sci. Rep. 2020, 10, 22147. [Google Scholar] [CrossRef]
Mouches, P.; Wilms, M.; Rajashekar, D.; Langner, S.; Forkert, N.D. Multimodal biological brain age prediction using magnetic resonance imaging and angiography with the identification of predictive regions. Hum. Brain Mapp. 2022, 43, 2554–2566. [Google Scholar] [CrossRef] [PubMed]
Wei, L.; Osman, S.; Hatt, M.; El Naqa, I. Machine learning for radiomics-based multimodality and multiparametric modeling. Q. J. Nucl. Med. Mol. Imaging 2019, 63, 353–360. [Google Scholar] [CrossRef]
Lv, J.; Roy, S.; Xie, M.; Yang, X.; Guo, B. Contrast agents of magnetic resonance imaging and future perspective. Nanomaterials 2023, 13, 2003. [Google Scholar] [CrossRef] [PubMed]
Jing, B.; Xie, P.; Xing, E.P. On the Automatic Generation of Medical Imaging Reports. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 1: Long Papers, pp. 2577–2586. [Google Scholar]
Norgeot, B.; Glicksberg, B.S.; Trupin, L.; Lituiev, D.; Gianfrancesco, M.; Oskotsky, B.; Schmajuk, G.; Yazdany, J.; Butte, A.J. Assessment of a Deep Learning Model Based on Electronic Health Record Data to Forecast Clinical Outcomes in Patients with Rheumatoid Arthritis. JAMA Netw. Open 2019, 2, e190606. [Google Scholar] [CrossRef]
Zambrano Chaves, J.M.; Wentland, A.L.; Desai, A.D.; Banerjee, I.; Kaur, G.; Correa, R.; Boutin, R.D.; Maron, D.J.; Rodriguez, F.; Sandhu, A.T.; et al. Opportunistic assessment of ischemic heart disease risk using abdominopelvic computed tomography and medical record data: A multimodal explainable artificial intelligence approach. Sci. Rep. 2023, 13, 21034. [Google Scholar] [CrossRef] [PubMed]
Sharma, H.; Drukker, L.; Chatelain, P.; Droste, R.; Papageorghiou, A.T.; Noble, J.A. Knowledge representation and learning of operator clinical workflow from full-length routine fetal ultrasound scan videos. Med. Image Anal. 2021, 69, 101973. [Google Scholar] [CrossRef]
Mansour, N.; Bas, M.; Stock, K.; Strassen, U.; Hofauer, B.; Knopf, A. Multimodal Ultrasonographic Pathway of Parotid Gland Lesions. Ultraschall Med.-Eur. J. Ultrasound 2015, 38, 166–173. [Google Scholar] [CrossRef]
Sargsyan, A.E.; Hamilton, D.R.; Jones, J.A.; Melton, S.L.; Whitson, P.A.; Kirkpatrick, A.W.; Martin, D.S.; Dulchavsky, S.A. FAST at MACH 20: Clinical Ultrasound Aboard the International Space Station. J. Trauma Acute Care Surg. 2005, 58, 35–39. [Google Scholar] [CrossRef]
Cherry, S.R.; Jones, T.; Karp, J.S.; Qi, J.; Moses, W.W.; Badawi, R.D. Total-Body PET: Maximizing Sensitivity to Create New Opportunities for Clinical Research and Patient Care. J. Nucl. Med. Off. Publ. Soc. Nucl. Med. 2018, 59, 3–12. [Google Scholar] [CrossRef]
Barbas, A.S.; Li, Y.; Zair, M.; Van Den Boom, P.; Famure, O.; Dib, M.; Laurence, J.M.; Kim, S.J.; Ghanekar, A. CT volumetry is superior to nuclear renography for prediction of residual kidney function in living donors. Clin. Transplant. 2016, 30, 1028–1035. [Google Scholar] [CrossRef]
Ye, J.; Hai, J.; Song, J.; Wang, Z. Multimodal Data Hybrid Fusion and Natural Language Processing for Clinical Prediction Models. medRxiv 2023. [Google Scholar] [CrossRef]
Lin, W.C.; Chen, A.X.; Song, J.; Weiskopf, N.G.; Chiang, M.F.; Hribar, M.R. Prediction of multiclass surgical outcomes in glaucoma using multimodal deep learning based on free-text operative notes and structured EHR data. J. Am. Med. Inform. Assoc. 2024, 31, 456–464. [Google Scholar] [CrossRef] [PubMed]
Deng, Y.; Liu, S.; Wang, Z.; Wang, Y.; Jiang, Y.; Liu, B. Explainable time-series deep learning models for the prediction of mortality, prolonged length of stay and 30-day readmission in intensive care patients. Front. Med. 2022, 9, 933037. [Google Scholar] [CrossRef]
Bandara, K.; Bergmeir, C.; Hewamalage, H. LSTM-MSNet: Leveraging Forecasts on Sets of Related Time Series with Multiple Seasonal Patterns. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 1586–1599. [Google Scholar] [CrossRef] [PubMed]
Fan, C.; Zhang, Y.; Pan, Y.; Li, X.; Zhang, C.; Yuan, R.; Wu, D.; Wang, W.; Pei, J.; Huang, H. Multi-Horizon Time Series Forecasting with Temporal Attention Learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2527–2535. [Google Scholar]
Jia, F.; Wang, K.; Zheng, Y.; Cao, D.; Liu, Y. GPT4MTS: Prompt-based Large Language Model for Multimodal Time-Series Forecasting. Proc. AAAI Conf. Artif. Intell. 2024, 38, 23343–23351. [Google Scholar] [CrossRef]
Yang, H.; Kuang, L.; Xia, F. Multimodal temporal-clinical note network for mortality prediction. J. Biomed. Semant. 2021, 12, 3. [Google Scholar] [CrossRef]
Wang, Z.; Gao, C.; Xiao, C.; Sun, J. MediTab: Scaling Medical Tabular Data Predictors via Data Consolidation, Enrichment, and Refinement. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, Jeju, Republic of Korea, 3–9 August 2024. [Google Scholar]
Soenksen, L.R.; Ma, Y.; Zeng, C.; Boussioux, L.; Villalobos Carballo, K.; Na, L.; Wiberg, H.M.; Li, M.L.; Fuentes, I.; Bertsimas, D. Integrated multimodal artificial intelligence framework for healthcare applications. NPJ Digit. Med. 2022, 5, 149. [Google Scholar] [CrossRef] [PubMed]
Singh, S.; Anand, R.S. Multimodal medical image sensor fusion model using sparse K-SVD dictionary learning in nonsubsampled shearlet domain. IEEE Trans. Instrum. Meas. 2019, 69, 593–607. [Google Scholar] [CrossRef]
Rawls, E.; Kummerfeld, E.; Zilverstand, A. An integrated multimodal model of alcohol use disorder generated by data-driven causal discovery analysis. Commun. Biol. 2021, 4, 435. [Google Scholar] [CrossRef]
Dong, L.; Wang, Z.; Zhang, M.; Gao, B. A study on a 3D multimodal magnetic resonance brain tumor image segmentation model. In Proceedings of the 2021 IEEE 23rd International Conference on High Performance Computing & Communications/7th International Conference on Data Science & Systems/19th International Conference on Smart City/7th International Conference on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), Haikou, China,, 20–22 December 2021; pp. 2139–2143. [Google Scholar]
Zhong, S.; Ren, J.X.; Yu, Z.P.; Peng, Y.D.; Yu, C.W.; Deng, D.; Xie, Y.; He, Z.Q.; Duan, H.; Wu, B.; et al. Predicting glioblastoma molecular subtypes and prognosis with a multimodal model integrating convolutional neural network, radiomics, and semantics. J. Neurosurg. 2022, 139, 305–314. [Google Scholar] [CrossRef]
Chen, D.; Zhang, L.; Ma, C. A multimodal diagnosis predictive model of Alzheimer’s disease with few-shot learning. In Proceedings of the 2020 International Conference on Public Health and Data Science (ICPHDS), Guangzhou, China, 20–22 November 2020; pp. 273–277. [Google Scholar]
Sekhar, U.S.; Vyas, N.; Dutt, V.; Kumar, A. Multimodal Neuroimaging Data in Early Detection of Alzheimer’s Disease: Exploring the Role of Ensemble Models and GAN Algorithm. In Proceedings of the 2023 International Conference on Circuit Power and Computing Technologies (ICCPCT), Kollam, India, 10–11 August 2023; pp. 1664–1669. [Google Scholar]
Wu, S.; Chen, C.; Luo, H.; Xu, Y. GIRUS-net: A Multimodal Deep Learning Model Identifying Imaging and Genetic Biomarkers Linked to Alzheimer’s Disease Severity. In Proceedings of the 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Sydney, Australia, 24–27 July 2023; pp. 1–4. [Google Scholar]
Falakshahi, H.; Rokham, H.; Miller, R.; Liu, J.; Calhoun, V.D. Network Differential in Gaussian Graphical Models from Multimodal Neuroimaging Data. In Proceedings of the 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Sydney, Australia, 24–27 July 2023; pp. 1–6. [Google Scholar]
Chang, H.H. Multimodal Image Registration Using a Viscous Fluid Model with the Bhattacharyya Distance. In Proceedings of the 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Sydney, Australia, 24–27 July 2023; pp. 1–4. [Google Scholar]
Zhou, Q.; Zou, H.; Luo, F.; Qiu, Y. RHViT: A Robust Hierarchical Transformer for 3D Multimodal Brain Tumor Segmentation Using Biased Masked Image Modeling Pre-training. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkey, 5–8 December 2023; pp. 1784–1791. [Google Scholar]
Rondinella, A.; Chen, C.; Jiang, Q.; Gu, Y.; Xu, F. Enhancing Multiple Sclerosis Lesion Segmentation in Multimodal MRI Scans with Diffusion Models. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkey, 5–8 December 2023; pp. 3733–3740. [Google Scholar]
Hatami, N.; Mechtouff, L.; Rousseau, D.; Cho, T.-H.; Eker, O.; Berthezene, Y.; Frindel, C. A Novel Autoencoders-LSTM Model for Stroke Outcome Prediction Using Multimodal MRI Data. In Proceedings of the 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), Cartagena, Colombia, 18–21 April 2023; pp. 1–5. [Google Scholar]
Chen, Z.; Liu, Y.; Wang, J.; Ma, Y.; Liu, Y.; Ye, H.; Liu, B. Enhanced Multimodal Low-Rank Embedding-Based Feature Selection Model for Multimodal Alzheimer’s Disease Diagnosis. IEEE Trans. Med. Imaging 2024, 44, 815–827. [Google Scholar] [CrossRef]
Bouzara, A.; Kermi, A. Automated Brain Tumor Segmentation in Multimodal MRI Scans Using a Multi-Encoder U-Net Model. In Proceedings of the 2024 8th International Conference on Image and Signal Processing and their Applications (ISPA), Biskra, Algeria, 21–22 April 2024; pp. 1–8. [Google Scholar]
Ahmad, N.; Chen, Y.T. Enhanced Deep Learning Model Performance in 3D Multimodal Brain Tumor Segmentation with Gabor Filter. In Proceedings of the 2024 10th International Conference on Applied System Innovation (ICASI), Kyoto, Japan, 17–21 April 2024; pp. 406–408. [Google Scholar]
Jana, M.; Das, A. Multimodal Medical Image Fusion Using Deep-Learning Models in Fuzzy Domain. In Proceedings of the 2023 IEEE 5th International Conference on Cybernetics, Cognition and Machine Learning Applications (ICCCMLA), Hamburg, Germany, 7–8 October 2023; pp. 514–519. [Google Scholar]
Ouyang, Y.; Wu, Y.; Wang, S.; Qu, H. Leveraging Historical Medical Records as a Proxy via Multimodal Modeling and Visualization to Enrich Medical Diagnostic Learning. IEEE Trans. Vis. Comput. Graph. 2023, 30, 1238–1248. [Google Scholar] [CrossRef]
Liu, X.; Qiu, H.; Li, M.; Yu, Z.; Yang, Y.; Yan, Y. Application of Multimodal Fusion Deep Learning Model in Disease Recognition. In Proceedings of the 2024 IEEE 2nd International Conference on Sensors, Electronics and Computer Engineering (ICSECE), Jinzhou, China, 29–31 August 2024; pp. 1246–1250. [Google Scholar]
Junior, G.V.M.; Santos, R.L.d.S.; Vogado, L.H.; de Paiva, A.C.; dos Santos Neto, P.d.A. XRaySwinGen: Automatic Medical Reporting for X-ray Exams with Multimodal Model. Heliyon 2024, 10, e27516. [Google Scholar]
Cui, K.; Changrong, S.; Maomin, Y.; Hui, Z.; Xiuxiang, L. Development of an Artificial Intelligence-Based Multimodal Model for Assisting in the Diagnosis of Necrotizing Enterocolitis in Newborns: A Retrospective Study. Front. Pediatr. 2024, 12, 1388320. [Google Scholar] [CrossRef]
Yin, M.; Lin, J.; Wang, Y.; Liu, Y.; Zhang, R.; Duan, W.; Zhou, Z.; Zhu, S.; Gao, J.; Liu, L.; et al. Development and Validation of a Multimodal Model in Predicting Severe Acute Pancreatitis Based on Radiomics and Deep Learning. Int. J. Med. Inform. 2024, 184, 105341. [Google Scholar] [CrossRef] [PubMed]
Qiu, X.; Wang, C.; Li, B.; Tong, H.; Tan, X.; Yang, L.; Tao, J.; Huang, J. An Audio-Semantic Multimodal Model for Automatic Obstructive Sleep Apnea-Hypopnea Syndrome Classification via Multi-Feature Analysis of Snoring Sounds. Front. Neurosci. 2024, 18, 1336307. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Ke, J.; Zhang, Y.; Gou, J.; Shen, A.; Wan, S. Multimodal Distillation Pre-Training Model for Ultrasound Dynamic Images Annotation. IEEE J. Biomed. Health Inform. 2024, 29, 3124–3136. [Google Scholar] [CrossRef]
Thota, P.; Veerla, J.P.; Guttikonda, P.S.; Nasr, M.S.; Nilizadeh, S.; Luber, J.M. Demonstration of an Adversarial Attack Against a Multimodal Vision Language Model for Pathology Imaging. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024; pp. 1–5. [Google Scholar]
Hassan, A.; Sirshar, M.; Akram, M.U.; Farooq, M.U. Analysis of Multimodal Representation Learning Across Medical Images and Reports Using Multiple Vision and Language Pre-Trained Models. In Proceedings of the 2022 19th International Bhurban Conference on Applied Sciences and Technology (IBCAST), Islamabad, Pakistan, 12–16 January 2022; pp. 337–342. [Google Scholar]
Fahmy, G.A.; Abd-Elrahman, E.; Zorkany, M. COVID-19 Detection Using Multimodal and Multi-Model Ensemble-Based Deep Learning Technique. In Proceedings of the 2022 39th National Radio Science Conference (NRSC), Cairo, Egypt, 29 November–1 December 2022; Volume 1, pp. 241–253. [Google Scholar]
Chen, W.; Li, Y.; Ou, B.; Tan, P. Collaborative Multimodal Diagnostic: Fusion of Pathological Labels and Vision-Language Model. In Proceedings of the 2023 2nd International Conference on Health Big Data and Intelligent Healthcare (ICHIH), Zhuhai, China, 27–29 October 2023; pp. 119–126. [Google Scholar]
Silva, L.E.V.; Shi, L.; Gaudio, H.A.; Padmanabhan, V.; Morgan, R.W.; Slovis, J.M.; Forti, R.M.; Morton, S.; Lin, Y.; Laurent, G.H.; et al. Prediction of Return of Spontaneous Circulation in a Pediatric Swine Model of Cardiac Arrest Using Low-Resolution Multimodal Physiological Waveforms. IEEE J. Biomed. Health Inform. 2023, 27, 4719–4727. [Google Scholar] [CrossRef]
Owais, M.; Zubair, M.; Seneviratne, L.D.; Werghi, N.; Hussain, I. Unified Synergistic Deep Learning Framework for Multimodal 2-D and 3-D Radiographic Data Analysis: Model Development and Validation. IEEE Access 2024, 12, 159688–159705. [Google Scholar] [CrossRef]
Li, C.-Y.; Wu, J.-T.; Hsu, C.; Lin, M.-Y.; Kang, Y. Understanding eGFR Trajectories and Kidney Function Decline via Large Multimodal Models. In Proceedings of the 2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, 7–9 August 2024; pp. 667–673. [Google Scholar]
Chen, L.; Zhang, T.; Li, Z.; Deng, J.; Li, S.; Hu, J. A Multimodal Deep Learning Model for Preoperative Risk Prediction of Follicular Thyroid Carcinoma. In Proceedings of the 2023 IEEE International Conference on E-health Networking, Application & Services (Healthcom), Chongqing, China, 15–17 December 2023; pp. 188–193. [Google Scholar]
Souza, L.A.; Pacheco, A.G.C.; de Angelo, G.C.; de Lacerda, G.B.; da Silva, J.S.; Sampaio, F.N.; Velozo, B.M.; de Albuquerque, C.S.; Ribeiro, P.L.; Lins, I.D. LiwTERM: A Lightweight Transformer-Based Model for Dermatological Multimodal Lesion Detection. In Proceedings of the 2024 37th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Manaus, Brazil, 30 September–3 October 2024; pp. 1–6. [Google Scholar]
Vo, H.Q.; Wang, L.; Wang, X.; Liu, R.; Sun, Y.; Li, X.; Chen, Z.; Li, Z. Frozen Large-Scale Pretrained Vision-Language Models Are the Effective Foundational Backbone for Multimodal Breast Cancer Prediction. IEEE J. Biomed. Health Inform. 2024, 29, 3234–3246. [Google Scholar] [CrossRef]
Yin, H.; Tao, L.; Zuo, R.; Yin, P.; Li, Y.; Xu, Y.; Liu, M. A Multimodal Fusion Model for Breast Tumor Segmentation in Ultrasound Images. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024; pp. 1–4. [Google Scholar]
Wei, C.; Wang, H.; Shi, X.; Li, X.; Wang, K.; Zhou, G. A Finetuned Multimodal Deep Learning Model for X-ray Image Diagnosis. In Proceedings of the 2024 6th International Conference on Communications, Information System and Computer Engineering (CISCE), Guangzhou, China, 10–12 May 2024; pp. 810–813. [Google Scholar]
Cerdá-Alberich, L.; Veiga-Canuto, D.; Garcia, V.; Blasco, L.; Martinez-Sanchis, S. Harnessing Multimodal Clinical Predictive Models for Childhood Tumors. In Proceedings of the 2023 IEEE EMBS Special Topic Conference on Data Science and Engineering in Healthcare, Medicine and Biology, Julians, Malta, 6–9 December 2023; pp. 71–72. [Google Scholar]
Sakhovskiy, A.; Tutubalina, E. Multimodal Model with Text and Drug Embeddings for Adverse Drug Reaction Classification. J. Biomed. Inform. 2022, 135, 104182. [Google Scholar] [CrossRef]
Li, Y.; Mamouei, M.; Salimi-Khorshidi, G.; Rao, S.; Hassaine, A.; Canoy, D.; Lukasiewicz, T.; Rahimi, K. Hi-BEHRT: Hierarchical Transformer-Based Model for Accurate Prediction of Clinical Events Using Multimodal Longitudinal Electronic Health Records. IEEE J. Biomed. Health Inform. 2022, 27, 1106–1117. [Google Scholar] [CrossRef]
Zhang, K.; Niu, K.; Zhou, Y.; Tai, W.; Lu, G. MedCT-BERT: Multimodal Mortality Prediction Using Medical ConvTransformer-BERT Model. In Proceedings of the 2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI), Atlanta, GA, USA, 6–8 November 2023; pp. 700–707. [Google Scholar]
Duong, T.T.; Uher, D.; Su, H.; Küçükdeveci, A.A.; Almusalihi, R.; Hahne, F.; Schiessl, M.; Dürkop, K.; Bleser, G. Accurate COP Trajectory Estimation in Healthy and Pathological Gait Using Multimodal Instrumented Insoles and Deep Learning Models. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 4801–4811. [Google Scholar] [CrossRef] [PubMed]
Bhattacharjee, A.; Anwar, S.; Whitinger, L.; Loghmani, M.T. Multimodal Sequence Classification of Force-Based Instrumented Hand Manipulation Motions Using LSTM-RNN Deep Learning Models. In Proceedings of the 2023 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), Pittsburgh, PA, USA, 15–18 October 2023; pp. 1–6. [Google Scholar]
Liu, S.; Wang, X.; Hou, Y.; Li, G.; Wang, H.; Xu, H.; Xiang, Y.; Tang, B. Multimodal Data Matters: Language Model Pre-Training over Structured and Unstructured Electronic Health Records. IEEE J. Biomed. Health Inform. 2022, 27, 504–514. [Google Scholar] [CrossRef] [PubMed]
Winston, C.; Winston, C.N.; Winston, C.; Winston, C.; Winston, C. Multimodal Clinical Prediction with Unified Prompts and Pretrained Large-Language Models. In Proceedings of the 2024 IEEE 12th International Conference on Healthcare Informatics (ICHI), Orlando, FL, USA, 3–6 June 2024; pp. 679–683. [Google Scholar] [CrossRef]
Pawar, Y.; Henriksson, A.; Hägglund, M. Leveraging Clinical BERT in Multimodal Mortality Prediction Models for COVID-19. In Proceedings of the 2022 IEEE 35th International Symposium on Computer-Based Medical Systems (CBMS), Shenzhen, China, 21–23 July 2022; pp. 199–204. [Google Scholar]
Niu, S.; Yin, Q.; Wang, C.; Feng, J. Label Dependent Attention Model for Disease Risk Prediction Using Multimodal Electronic Health Records. In Proceedings of the 2021 IEEE International Conference on Data Mining (ICDM), Auckland, New Zealand, 7–10 December 2021; pp. 449–458. [Google Scholar]
Barbosa, S.P.; Marques, L.; Caldas, A.; Pimentel, R.; Cruz, J.G. Predictors of the Health-Related Quality of Life (HRQOL) in SF-36 in Knee Osteoarthritis Patients: A Multimodal Model with Moderators and Mediators. Cureus 2022, 14, e27339. [Google Scholar] [CrossRef] [PubMed]
Gupta, R.; Bhongade, A.; Gandhi, T.K. Multimodal Wearable Sensors-Based Stress and Affective States Prediction Model. In Proceedings of the 2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 17–18 March 2023; Volume 1, pp. 30–35. [Google Scholar]
Dharia, S.Y.; Valderrama, C.E.; Camorlinga, S.G. Multimodal Deep Learning Model for Subject-Independent EEG-Based Emotion Recognition. In Proceedings of the 2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), Regina, SK, Canada, 24–27 September 2023; pp. 105–110. [Google Scholar]
Singh, A.; Holzer, N.; Götz, T.; Wittenberg, T.; Göb, S.; Sawant, S.S.; Salman, M.-M.; Pahl, J. Bio-Signal-Based Multimodal Fusion with Bilinear Model for Emotion Recognition. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkey, 5–8 December 2023; pp. 4834–4839. [Google Scholar]
Yamamoto, M.S.; Sadatnejad, K.; Tanaka, T.; Islam, M.R.; Dehais, F.; Tanaka, Y.; Lotte, F. Modeling Complex EEG Data Distribution on the Riemannian Manifold Toward Outlier Detection and Multimodal Classification. IEEE Trans. Biomed. Eng. 2024, 71, 377–387. [Google Scholar] [CrossRef]

Figure 1. Multimodal model for disease prediction.

Figure 2. Records identification, Screening and inclusion process.

Figure 3. Modality distribution of the included studies.

Figure 4. Dataset accessibility across the included studies.

Figure 5. Learning process of the models across the studies.

Table 1. Comparison of cancer-related studies.

Disease	Fusion Technique	Key Metrics	Strengths	Weaknesses	Ref.
Thyroid Cancer	Intermediate	Accuracy: 93% (Multimodal)	High overall accuracy with robust multimodal metrics	Limited to 323 patient observations	[69]
Skin Cancer	Late	PAD-UFES-20: BACC = 0.63 ± 0.02, ISIC 2019: BACC = 0.52 ± 0.02	Moderate accuracy across large datasets with diverse modalities	Relatively lower accuracy compared to other models	[70]
Breast Cancer (Mammograms & EHR)	Intermediate	CBIS-DDSM: AUC = 0.830, EMBED: AUC = 0.809	Strong AUC values with reliable multimodal fusion	Accuracy is not the highest among all models	[71]
Breast Cancer (B-mode & Elastography Ultrasound)	Intermediate	Dice: 89.96%, Sensitivity: 90.37%, Specificity: 97.87%	High specificity and sensitivity; excellent for detection tasks	Limited observations (287 image pairs)	[72]
Cancer Diagnosis (X-ray & Reports)	Intermediate	ROUGE 1: 0.6544, ROUGE 2: 0.4188, ROUGE L: 0.6228	Good performance in text-image alignment tasks	Lower accuracy metrics compared to classification models	[73]
Childhood Cancer (Neuroblastoma)	Intermediate	Accuracy: +15%, Precision: +20%	Combines imaging, clinical, and molecular data	Limited to neuroblastoma cases	[74]

Table 2. Comparison of brain-related studies.

Disease/Application	Fusion Technique	Key Metrics	Strengths	Weaknesses	Ref.
Neurological Disorders	Intermediate	Entropy, Mutual Information, etc.	Combines multiple imaging modalities for robust analysis.	Performance metrics not explicitly provided.	[39]
Alcohol Use Disorder	Early	RMSEA: 0.06, TLI: 0.91	Significant network connectivity insights.	Limited to fMRI data; lacks multimodal fusion.	[40]
Brain Tumor (Gliomas)	Intermediate	Dice Similarity Coefficient (DSC)	High segmentation accuracy for multiple tumor types.	Focuses only on MRI imaging.	[41]
Glioblastoma	Early	Precision, Recall, Accuracy, F1-score, AUC	High accuracy for mutation and prognosis prediction.	Limited dataset size; requires validation on diverse populations.	[42]
Alzheimer’s Diagnosis (AD vs. NC)	Intermediate	Accuracy: 92.3%, AUC: 0.94	Effective classification of Alzheimer’s vs. normal controls.	Does not address early diagnosis challenges.	[43]
Alzheimer’s Diagnosis (MCI)	Intermediate	Accuracy: 76.4%, AUC: 0.82	Focus on mild cognitive impairment classification.	Lower accuracy compared to AD vs. NC.	[44]
Alzheimer’s Diagnosis (Ensemble Model)	Intermediate	F1: 82%, AUC-ROC: 0.93	Ensemble model enhances classification accuracy.	Complex architecture may require extensive computational resources.	[45]
Schizophrenia	Intermediate	Graph Metrics	Graph modeling provides unique insights into connectivity patterns.	Detailed performance metrics are unavailable.	[46]
Brain Tumor Segmentation	Intermediate	Dice Score	High accuracy for tumor segmentation tasks.	Limited evaluation for different tumor types.	[47]
Multiple Sclerosis	Early	DSC: 0.7425, TPR: 0.7622, PPV: 0.7505	Good segmentation performance with uncertainty handling.	Small dataset size limits generalizability.	[48]
Stroke Prediction	Intermediate	AUC: 0.71, ACC: 0.72, F1: 0.68	Intermediate fusion improves prediction performance.	Limited by small cohort size.	[49]
Alzheimer’s Diagnosis (Severity)	Intermediate	2-class F1: 0.86, 3-class F1: 0.50, 4-class F1: 0.40	Focus on severity prediction with ordinal regression.	Performance decreases with increasing class complexity.	[50]
Emotion Recognition	Intermediate	SEED-IV: 67.33%, SEED-V: 72.36%	Fusion of EEG and eye-tracking data enhances emotion recognition.	Lower accuracy compared to other domains.	[51]
Brain Tumor (Gabor Filters)	Intermediate	Accuracy: 99.16%, mIoU: 0.804	Gabor filters improve feature representation and accuracy.	May require domain- specific tuning.	[52]
General Brain Tasks	Intermediate	4.27% improvement over baseline	Improved performance with Riemannian-based features.	Performance gains are relatively small.	[53]
Brain Imaging Registration	N/A	MI: 1.2527 ± 0.0510	Numerical optimization ensures robust registration.	Not a learning-based approach; limited adaptability.	[86]
Alzheimer’s (NC vs. AD)	Intermediate	AUC: 0.962	High classification accuracy for NC vs. AD.	Does not address multi- class scenarios.	[88]

Table 3. Comparison of general/special disease related studies.

Disease/Application	Fusion Technique	Key Metrics	Strengths	Weaknesses	Ref.
Adverse Drug Reactions (ADR)	Intermediate	F1: 0.61 (English), 0.57 (Russian), 8% gain (French)	Uses social media data for real-time ADR detection.	Limited dataset size; language-dependent performance.	[75]
General Disease Risk	Early	ROC AUC: 0.9036, Precision: 0.7906	Effective multimodal fusion with time-series data.	Complex model architecture; high computational requirements.	[83]
Knee Osteoarthritis (KOA)	Intermediate	R² values	Integrates neurophysiological and demographic data.	Small sample size; limited generalizability.	[84]
Clinical Event Prediction	Early	N/A	Large cohort size; integrates primary and secondary care data.	Performance metrics not provided.	[76]
General Diagnostic Applications	Intermediate	QS = 0.999	Performs well across multiple metrics; robust feature fusion.	Limited details on dataset diversity.	[54]
In-Hospital Mortality Prediction	Intermediate	AUC-ROC: 0.8921, AUC-PR: 0.5635	Strong performance for ICU patient outcomes.	Limited to a specific patient group; recall is low.	[77]
Neuromuscular Conditions	Intermediate	RMSE: 0.51–0.59 cm (Healthy), 1.44–1.53 cm (Condition)	Accurate gait analysis using IMU data.	Small sample size; specific to neuromuscular disorders.	[78]
Soft Tissue Manipulation	Intermediate	Accuracy: 93.2%, AUC: 0.9876	High classification accuracy for multimodal data.	Limited to specific motions and therapists.	[79]
Stress and Affective States	Late	Accuracy: 97.15%, F1-score: N/A	High accuracy with chest-worn sensors.	Limited by sensor dependency; small cohort size.	[85]
Clinical Prediction	Early	Weighted F1: +0.02 to +0.16 (vs. unimodal data)	Demonstrates benefits of multimodal fusion.	Limited improvement in some tasks.	[80]
Cervical Spine Disorder	Late	Accuracy: 92.1%, F1: 0.92	Effective decision- and feature-level fusion.	Limited to cervical spine disorders.	[55]
Necrotizing Enterocolitis (NEC)	Early	Accuracy: 94.2%, Sensitivity: 93.65%	High sensitivity and specificity.	Limited to neonatal intensive care data.	[81]
Severe Acute Pancreatitis	Intermediate	AUC: 0.874–0.916, Accuracy: 85–94.6%	Integrates clinical and imaging data effectively.	Relatively small dataset size.	[56]
Sleep Apnea-Hypopnea Syndrome	Intermediate	Accuracy: 77.6%, AUC: 0.709	Audio features combined with semantic representations.	Limited sample size; moderate accuracy.	[57]
General Ultrasound-Based Diseases	Intermediate	Accuracy: 93.08%	Accurate annotation for multiple diseases.	Limited focus on ultrasound imaging.	[58]
Pathological Tissue Classification	Intermediate	Success Rate: 100%	Effective adversarial attack detection.	Focuses solely on pathology data.	[59]
Radiological Imaging Diagnosis	Intermediate	BLEU-4: 0.661, ROUGE-L: 0.748	Generates high-quality radiological reports.	Limited to radiology; text accuracy is dataset-dependent.	[60]
Sepsis Prediction	Early	ROC AUC: 0.87, F1: 0.81	Strong predictive performance for early sepsis detection.	Highly specific to ICU patient data.	[61]
Breast Cancer Diagnosis	Intermediate	F1: 0.93, Accuracy: 94.5%	Robust performance across multiple modalities.	Focuses only on breast cancer imaging.	[62]

Table 4. Comparison of heart, lung and kidney related studies.

Disease/Application	Fusion Technique	Key Metrics	Strengths	Weaknesses	Ref.
Thoracic Diseases	Intermediate	AUC: 0.985 (VisualBERT), ROUGE: 0.65, BLEU: 0.35	Effective for disease classification and report summarization.	Focuses only on chest X-rays; dataset-specific performance.	[63]
COVID-19	Intermediate	Accuracy: CT Scan—98.5%, X-ray—98.6%	High accuracy using diverse datasets.	Limited to respiratory diseases; no fusion of text.	[64]
COVID-19 (Karolinska Hospital)	Early	AUROC: 0.87, AUPRC: 0.45 (Multimodal)	Demonstrates benefits of multimodal fine-tuning.	Private dataset limits generalizability.	[82]
Neurological Disorders (ECG & GSR)	Intermediate	Accuracy: 68.7%, F1: 67.38% (Arousal)	Accurate arousal recognition using ECG and GSR.	Small dataset with 27 subjects limits applicability.	[87]
Chest Radiograph Diagnosis	Intermediate	Precision: 0.67, Recall: 0.75, F1: 0.71	Utilizes LangChain for integrating vision-language models.	Text generation metrics show moderate performance.	[65]
Cardiac Arrest Prediction	Intermediate	AUC: 0.93 (Combined Waveforms)	High accuracy for physiological waveform integration.	Dataset limited to swine models.	[66]
Lung Diseases	Intermediate	Accuracy: 96.67%, F1: 96.88%, TNR: 97.02%	Robust performance using Vision Transformers.	Focused solely on infectious/non-infectious lung diseases.	[67]
Chronic Kidney Disease (CKD)	Intermediate	MAE: 1.03 (Training), MAPE: 3.59% (Training)	Uses multimodal prompts with LMMs for effective predictions.	Limited testing data for validation.	[68]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Siam, M.K.; Hossain Faruk, M.J.; He, B.; Cheng, J.Q.; Gu, H. Multimodal Models in Healthcare: Methods, Challenges, and Future Directions for Enhanced Clinical Decision Support. Information 2025, 16, 971. https://doi.org/10.3390/info16110971

AMA Style

Siam MK, Hossain Faruk MJ, He B, Cheng JQ, Gu H. Multimodal Models in Healthcare: Methods, Challenges, and Future Directions for Enhanced Clinical Decision Support. Information. 2025; 16(11):971. https://doi.org/10.3390/info16110971

Chicago/Turabian Style

Siam, Md Kamrul, Md Jobair Hossain Faruk, Bofan He, Jerry Q. Cheng, and Huanying Gu. 2025. "Multimodal Models in Healthcare: Methods, Challenges, and Future Directions for Enhanced Clinical Decision Support" Information 16, no. 11: 971. https://doi.org/10.3390/info16110971

APA Style

Siam, M. K., Hossain Faruk, M. J., He, B., Cheng, J. Q., & Gu, H. (2025). Multimodal Models in Healthcare: Methods, Challenges, and Future Directions for Enhanced Clinical Decision Support. Information, 16(11), 971. https://doi.org/10.3390/info16110971

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Models in Healthcare: Methods, Challenges, and Future Directions for Enhanced Clinical Decision Support

Abstract

1. Introduction

2. Related Literature

3. Methodology

3.1. Search Databases and Queries

3.1.1. IEEE Xplore

3.1.2. ACM Digital Library

3.1.3. PubMed

3.2. Inclusion and Exclusion Criteria

3.3. Search Outcomes

4. Data Modalities

4.1. Image Data

4.1.1. Magnetic Resonance Imaging (MRI)

4.1.2. X-Ray Imaging

4.1.3. Computed Tomography (CT)

4.1.4. Ultrasound

4.1.5. PET (Positron Emission Tomography)

4.1.6. SPECT (Single Photon Emission Computed Tomography)

4.2. Text Data

4.3. Time-Series Data

4.4. Tabular Data

5. Results

Model Comparisons

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI