Evaluating ChatGPT for Disease Prediction: A Comparative Study on Heart Disease and Diabetes

Alomari, Ebtesam

doi:10.3390/biomedinformatics5030033

Open AccessArticle

Evaluating ChatGPT for Disease Prediction: A Comparative Study on Heart Disease and Diabetes

by

Ebtesam Alomari

Faculty of Computing and Information, Albaha University, Albaha 65731, Saudi Arabia

BioMedInformatics 2025, 5(3), 33; https://doi.org/10.3390/biomedinformatics5030033

Submission received: 18 March 2025 / Revised: 15 May 2025 / Accepted: 19 June 2025 / Published: 25 June 2025

(This article belongs to the Section Applied Biomedical Data Science)

Download

Browse Figures

Versions Notes

Abstract

Background: Chronic diseases significantly burden healthcare systems due to the need for long-term treatment. Early diagnosis is critical for effective management and minimizing risk. The current traditional diagnostic approaches face various challenges regarding efficiency and cost. Digitized healthcare demonstrates several opportunities for reducing human errors, increasing clinical outcomes, tracing data, etc. Artificial Intelligence (AI) has emerged as a transformative tool in healthcare. Subsequently, the evolution of Generative AI represents a new wave. Large Language Models (LLMs), such as ChatGPT, are promising tools for enhancing diagnostic processes, but their potential in this domain remains underexplored. Methods: This study represents the first systematic evaluation of ChatGPT’s performance in chronic disease prediction, specifically targeting heart disease and diabetes. This study compares the effectiveness of zero-shot, few-shot, and CoT reasoning with feature selection techniques and prompt formulations in disease prediction tasks. The two latest versions of GPT4 (GPT-4o and GPT-4o-mini) are tested. Then, the results are evaluated against the best models from the literature. Results: The results indicate that GPT-4o significantly beat GPT-4o-mini in all scenarios regarding accuracy, precision, and F1-score. Moreover, a 5-shot learning strategy demonstrates superior performance to zero-shot, few-shot (3-shot and 10-shot), and various CoT reasoning strategies. The 5-shot learning strategy with GPT-4o achieved an accuracy of 77.07% in diabetes prediction using the Pima Indian Diabetes Dataset, 75.85% using the Frankfurt Hospital Diabetes Dataset, and 83.65% in heart disease prediction. Subsequently, refining prompt formulations resulted in notable improvements, particularly for the heart dataset (5% performance increase using GPT-4o), emphasizing the importance of prompt engineering. Conclusions: Even though ChatGPT does not outperform traditional machine learning and deep learning models, the findings highlight its potential as a complementary tool in disease prediction. Additionally, this work provides value by setting a clear performance baseline for future work on these tasks

Keywords:

large language models; generative AI; ChatGPT; heart disease; diabetes; disease prediction; prompt engineering

1. Introduction

Recent Artificial Intelligence (AI) advancements have significantly transformed the healthcare sector. The advanced approaches provide novel solutions for various problems, including disease diagnosis [1,2], treatment planning [3], drug discovery [4,5], and patient management [6,7]. Digitized healthcare demonstrates several opportunities for reducing human errors, increasing clinical outcomes, tracing data, etc., [2]. AI technologies, particularly machine learning (ML) and deep learning (DL), have demonstrated remarkable accuracy in multiple disease predictions, such as cancer [8], liver disease [9], and kidney disease [10] within complex medical data.

Chronic diseases, as defined by the National Cancer Institute (NCI) [11], are conditions that last for three months or more and require long-term medical attention to prevent them from getting worse. These diseases usually develop slowly and cannot be cured, so they cause significant effects on patient health and daily living. Common examples include heart disease, diabetes, cancer, and kidney disease. Several factors influence these diseases, including genetic and lifestyle factors, such as physical inactivity and unhealthy diets, as indicated by the American Heart Association [12]. According to the World Health Organization’s (WHO) reports [13], cardiovascular diseases are the leading cause of death globally, accounting for 17.9 million deaths annually. Subsequently, diabetes is among the most widespread diseases, as the number of people living with diabetes has significantly increased, rising from 200 million in 1990 to 830 million in 2022 [14].

Furthermore, the daily increase in patients raises global challenges to healthcare systems and highlights the need for effective prevention, early detection, and novel management approaches. Utilizing cutting-edge technologies enables early diagnosis and thus supports efficient management of diseases. Subsequently, among all innovative healthcare applications, building prediction models for the early diagnosis of chronic disease has obtained considerable attention. Likewise, automating diagnostic tasks improves clinical decision-making, reduces human error, and thus reduces mortality rates and healthcare costs due to timely involvement.

Moreover, Machine Learning (ML) and Deep Learning (DL) have revolutionized healthcare by empowering the development of predictive models using complex medical data for predicting brain tumors [15], sepsis infection [16], and pediatric length of stay in an Intensive Care Unit (ICU) [17], etc. Furthermore, several medical devices that utilize ML/DL models have received approval from the FDA health regulatory bodies. Beta Bionics [18] was designed to automate an insulin delivery system. IDx-DR [19] was developed to detect diabetic retinopathy (DR) in patients with diabetes. Despite the success of ML and DL models, they face limitations that affect trust and adoption in healthcare. For instance, their black-box nature makes it hard to interpret how the model adjusts internal parameters, which poses limitations [20]. Furthermore, supervised learning requires extensive labeled datasets for training, which limits its applicability in clinical scenarios where annotated data is limited. These challenges emphasize the need for alternative approaches.

Furthermore, Large Language Models (LLMs) represent a significant evolution in AI due to their advanced ability to process and generate human-like text. ChatGPT, produced by OpenAI, is one of the most notable LLMs. It has expanded the scope of AI applications through advanced reasoning and interactive communication. Additionally, ChatGPT has significant potential to enhance various NLP tasks. Its adaptability in information extraction tasks, sentiment analysis, and text classification showcases its ability to understand diverse contexts and extract meaningful details [21]. Moreover, its advanced natural language processing (NLP) capabilities position it as a promising tool for addressing medical data and supporting diagnostic decision-making [22]. Subsequently, in contrast to ML/DL models, ChatGPT is pre-trained on vast, diverse datasets, allowing it to understand concepts without needing labeled data for each task. Likewise, ChatGPT uses the transformer architecture with a self-attention mechanism, allowing models to process data in parallel, rather than sequentially, such as RNNs. Moreover, transformers enhance scalability and efficiently handle extensive data and parameters.

Subsequently, most ML and DL models consider each input independent and often do not maintain a memory of previous inputs. In contrast, ChatGPT supports conversation tracking and adapts responses based on previous dialogue. Additionally, data and task-specific design limit traditional ML/DL. On the contrary, ChatGPT is better at generalization and understanding due to its ability to interpret complicated patterns in data. Additionally, ChatGPT does not require intensive task-specific training, as in zero-shot learning. It also performs reasonably well by giving a few examples, known as few-shot learning. However, using LLMs such as ChatGPT in disease prediction remains an emerging area of exploration.

In this paper, we investigate the potential of ChatGPT, particularly GPT-4, as a tool for disease prediction. We focus on two chronic diseases, heart disease and diabetes, leveraging their ability to process prompts and provide output. We explore its performance across various learning strategies: zero-shot, few-shot (3-shot, 5-shot, 10-shot), and various CoT reasoning strategies.

This comparison provides insight into which approach yields the best accuracy for disease prediction tasks. In addition, this study examines the effect of prompt formulation and engineering on the prediction results. Additionally, we tested two feature selection methods, correlation-based and mutual information, to investigate how the number and types of features influence prediction performance. Subsequently, we compare the outcomes against established models from the literature.

The main contributions of this work include:

Utilize ChatGPT for chronic disease diagnosis, particularly heart disease and diabetes prediction.
Explore the impact of prompt engineering and feature selection methods on predictive performance.
Investigate the influence of different learning strategies on model outcomes, including zero-shot, few-shot, and chain-of-thought reasoning.
Evaluate ChatGPT’s results against models from the literature.
Suggest a workflow for ChatGPT as an assistant for ML/DL models to enhance clinical decision-making.

The remainder of this paper is organized as follows: Section 2 discusses related work. Section 3 details the methodology, including the dataset, pre-processing, feature selection, and experimental design. Section 4 presents the results of the experiments. Section 5 provides the discussion. Finally, Section 6 concludes the study, highlighting its implications for healthcare and future work.

2. Related Works

2.1. Diabetes Disease Prediction

Several datasets have been used in the literature, and we grouped the studies based on the dataset utilized. Some studies in the literature have relied on local datasets for their experiments. These datasets are often collected from a specific hospital in their region. Gollapalli et al. [23] categorized diabetes patients into Type 1 Diabetes (T1DM), Type 2 Diabetes (T2DM), and Pre-diabetes. They utilized four ML algorithms: support vector machine (SVM), random forest (RF), K-nearest neighbors (K-NN), and decision tree (DT), along with Bagging and Stacking ensemble-based approaches. The dataset was collected from King Fahad University Hospital (KFUH) in Saudi Arabia. The results indicated that the stacking approach surpasses the other method. Other researchers worked on collecting data (3000 records) of patients from five different Saudi hospitals for classification into people with diabetes and non-diabetic patients using logistic regression (LR), DT, RF, and SVM algorithms and studied the effect of the diabetes-specific tests (FPG and HbA1c) [24]. They found that including HbA1c as a feature increases the performance compared to FPG. Qteat and Awad [25] used a local dataset named “DataPal” to predict diabetes types. The SVM algorithm was applied after pre-processing the data and filling in the missing values using the KNN algorithm. The model achieved an accuracy of 98.73%.

Moreover, Fitriyani et al. [26] focused on detecting type 2 diabetes and hypertension. They developed a disease prediction model (DPM) based on an ensemble approach. To remove outliers, an isolation forest (iForest)-based outlier detection method was used, whereas a synthetic minority oversampling technique was implemented to balance data distribution. They tested four different datasets and the accuracy was varied (96.74%, 85.73%, 75.78%, and 100%). Ali et al. [27] tested different KNN types (Fine, Weighted, Medium, and Cubic) for diabetes classification using a generated dataset. The results indicated that Fine KNN gives higher accuracy than the other methods.

Furthermore, some of the studies primarily used surveys as a data source. Almutairi and Abbod [28] classified the diabetes prevalence rates using two types of KNN algorithms (fine KNN and weighted KNN) and four kernel functions of SVM (linear SVM, Gaussian SVM, quadratic SVM, and cubic SVM). The dataset consists of published national surveys in KSA. The results indicated that weighted KNN outperforms the other algorithms, with the highest average accuracy of 94.5%. Alsulami et al. [29] utilized a Deep Neural Network (DNN), an Autoencoder (AE), and a Convolutional Neural Network (CNN) to predict type 2 diabetes. They used the dataset collected by Syed and Khan in [30]. Further, they developed several experiments to test feature selection effects and compared the results with ML algorithms. The results demonstrate that AE outperformed other models with an accuracy of 81.12% for imbalanced data and 79.16% for balanced data.

The rest of the studies discussed in this subsection used the PIMA Indian Diabetes Dataset (PIDD). This dataset is a well-known benchmark and is frequently used for diabetes prediction tasks and is frequently used in machine learning research. Mahesh et al. [31] proposed a blended ensemble learning (EL) system based on Bayesian networks (BN) and radial basis functions (RBF) for predicting diabetes. The performances of five machine learning (ML) techniques, namely LR, DT classifier, SVM, KNN, and RF, are compared with the proposed EL technique. Experiments reveal that the proposed method outperforms five ML approaches (LR, DT, SVM, KNN, and RF), with an accuracy of 97.11%.

Moreover, Patil and Tamane [32] implemented eight machine learning models: LR, KNN, SVM, gradient boost, DT, MLP, RF, and Gaussian naïve Bayes. The results showed that the LR model achieved the highest accuracy, with 79.54%. WANG et al. [33] utilized an RF algorithm for Diabetes Mellitus classification from the PIDD dataset. Then, they addressed the imbalance problem in the dataset by implementing a synthetic sampling method (ADASYN). Their approach was called DMP_MI. They achieved an accuracy of 87.1%. Other researchers used the same dataset, a clustering algorithm and Sequential Minimal Optimization (SMO) classifier algorithm [34]. The naïve Bayes algorithm is an optimal algorithm [35].

Furthermore, Joshi and Dhakal [36] focused on specific features in PIDD glucose: pregnancy, body mass index (BMI), diabetes pedigree function, and Age. Their model achieved a prediction accuracy of 78.26% using LR and DT algorithms. Sivaranjani et al. [37] implemented step-forward and backward feature selection on the PIDD dataset and achieved an accuracy of 83% using the RF algorithm. Similarly, other researchers achieved an accuracy of 82% based on the RF classifier mode on the same dataset [38].

Kumari et al. [39] combined RF, LR, and NB and tested the effect of implanting an ensemble soft voting classifier. They found that their proposed approach outperforms the base classifiers, including Adaptive Boosting (AdaBoost), GradientBoost, eXtreme Gradient Boosting (XGBoost), and CatBoost (CAT), with accuracy, precision, recall, and F-score of 79.04%, 73.48%, 71.45%, and 80.6%, respectively. Kalagotla et al. [40] implemented a correlation technique for feature selection from the Pima India dataset. The selected features are Age, body mass index (BMI), and glucose. The finding indicated that the staking technique of (multi-layer perceptron, SVM, and LR) outperformed the AdaBoost with 78.2% accuracy, 72.2% precision, 54.4% recall and a 59.4% F-score. Bukhari et al. [41] trained a model using an artificial backpropagation scaled conjugate gradient neural network (ABP-SCGNN) algorithm to predict diabetes effectively. Then, they tested different numbers of neurons in the hidden layer, ranging from 5 to 50. The results showed that the ABP-SCGNN model, containing 20 neurons, achieved an accuracy of 93%.

Alreshan et al. [42] achieved 98.81% using stack-ANN on the PIDD dataset. They have also tested the same algorithm on the diabetes dataset of the Frankfurt Hospital, Germany [43], and achieved 99.51% prediction accuracy. Other researchers [44] used a diabetes dataset of the Frankfurt Hospital and found that the RF method outperformed XGBoost and DT, yielding an accuracy of 98%.

2.2. Heart Disease Prediction

All the discussed works in this subsection utilized the UCI Heart Disease dataset. This dataset has been widely used to evaluate various machine-learning models for predicting heart disease outcomes. Shrestha [45] applied multiple ML algorithms on the Cleveland UCI Heart Disease dataset, including LR, RF, Gradient Boosting, XGBoost, and LSTM networks. The results indicated that XGBoost achieved higher performance with an accuracy of 90%. Alfadli et al. [46] tested feature selection on the UCI Heart Disease dataset using the Importance Permutation approach. To address the missing values, they replaced the missing numerical values with the mean, while the missing values in categorical features were replaced with the new label “Unknown”. They found that the Gaussian Process (GP) outperforms the other algorithms with an accuracy of 84.24%, a Recall of 89.22%, and a Precision of 83.49%. Other researchers used the UCI Cleveland dataset. Anderie et al. [47] found that the SVM algorithm outperformed DT, NB, KNN, and precision with an accuracy of 85%. Bharti et al. [48] achieved 94.2% accuracy using a deep learning approach.

Furthermore, Asif et al. [49] implemented different hyperparameter optimization approaches, including grid search cross-validation (CV) and randomized search CV. The finding indicated that the extra tree classifier algorithm performed better than RF, XGBoost, and CatBoost with an accuracy of 98.15%. Noroozi et al. [50] utilized various feature selection techniques, including filter, wrapper, and evolutionary approaches. They found that implementing filter feature selection using Correlation-based feature selection, information gain, and Symmetrical uncertainty improved the performance of SVM and achieved 85.5%. Chandrasekhar and Peddakrishna [51] explored various ML algorithms on the Cleveland and IEEE Dataport heart disease datasets, which are RF, NB, KNN, LR, gradient boosting (GB), and AdaBoost (AB). The grid search CV approach was used for hyperparameter tuning, while standardization and normalization techniques were implemented for feature scaling. Subsequently, to improve performance, they combined all the models using a soft voting ensemble method, leading to an accuracy of 93.44% for the Cleveland dataset and 95% for the IEEE Dataport dataset. Korial et al. [52] applied the Chi-squared feature selection method, then tested multiple ML classifiers, including NB, RF, LR and KNN. They found that their voting ensemble model achieved an accuracy of 92.11%.

2.3. Research Gap

Even with the capabilities of the traditional models in automating diverse tasks and achieving high performance, there are several limitations. Their black-box design makes it hard to interpret how the model adjusts internal parameters. Additionally, they interpret inputs independently and often do not remember previous inputs. Moreover, the need for labeled datasets for training limits their applicability, particularly in complex domains such as clinical scenarios where annotated data is restricted and requires assistance from specialists. Furthermore, models are designed for specific tasks and thus their abilities are limited. On the contrary, the above-discussed issues are addressed by LLMs such as (ChatGPT) due to their ability to interpret complicated patterns, understand context, and track conversations based on previous dialogue. In addition, it was trained on massive data, which enhances generalization and eliminates the need for intensive task-specific training, as in zero-shot learning. Moreover, it performs reasonably well by giving a few examples, such as in few-shot learning.

This paper addresses a critical research gap by exploring the applicability of LLMs, particularly ChatGPT, in disease prediction. This area is traditionally directed by ML and DL models. While the evolution of ChatGPT has shown promise in various NLP tasks, its capabilities in prediction tasks, particularly in the medical domain, remain unexplored. To the best of our knowledge, this work is the first to systematically evaluate ChatGPT’s capabilities in chronic disease prediction, with a particular focus on heart disease and diabetes.

3. Methodology

In this work, we follow a structured pipeline designed to evaluate the performance of ChatGPT in chronic disease prediction. The workflow is summarized in Figure 1. The methodology is divided into several phases starting from selecting the datasets, pre-processing the data, selecting significant features, selecting learning strategies (involves choosing between strategies like zero-shot, few-shot, and Chain-of-Thought (CoT) reasoning), prompt engineering, API configuration, and finally, the results were evaluated and compared against the best models from the literature. All the experiments were performed using Python 3.12.4 with multiple supporting libraries for analysis and visualization, including Pandas [53], NumPy [54], Scikit-Learn [55], and Matplotlib [56].

We hypothesize that ChatGPT will demonstrate varying performance depending on the learning strategy. Firstly, the zero-shot approach depends entirely on the pre-trained knowledge of the model, so it is expected to provide the baseline performance. Secondly, we anticipate that few-shot learning will outperform zero-shot learning due to the included examples. Thirdly, incorporating domain-specific knowledge with CoT method is expected to improve the CoT strategy. However, due to the fact that not all tasks benefit from multiple-step reasoning, the superiority of CoT methods cannot be guaranteed. Moreover, we anticipate that prompt formulation would significantly impact accuracy. These hypotheses will be tested using various experiments and datasets to evaluate the impact of different prompt strategies on model performance.

3.1. Dataset

3.1.1. Diabetes Dataset

The first dataset utilized in this study is the Pima Indian Diabetes Dataset (PIDD) [57] provided by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). The dataset consisted of 9 features, as illustrated in Table 1.

The second dataset is the collected diabetes dataset from Frankfurt Hospital, Germany. It has 2000 records, with 684 diabetic patients and 1316 non-diabetic patients. The features are similar to those in the PIDD dataset. Table 2 represents a sample of data.

3.1.2. Heart Disease Dataset

The utilized dataset in this study for heart disease is the Cleveland UCI Heart Disease dataset [58]. This dataset was collected in 1988 from Cleveland City. It consists of 76 attributes, but only 14 are used in practice. Table 3 explains the features of the dataset.

3.2. Data Pre-Processing

Data pre-processing ensures that the data is clean and standardized for subsequent analysis. In this work, we perform cleaning and data transformation. In cleaning, we handle missing values by removing any row that includes a null value. In data transformation, we transformed numeric data into non-numeric (categorical) data for all categorical features.

The Pima Indian diabetes dataset was examined for pre-processing. Upon inspection, the dataset was found to have no missing values (null data) and no categorical features. Thus, no modifications were applied to the data before passing on to the next phase. For the UCI heart dataset, we removed six rows that included missing data, leaving 297 rows in the dataset for analysis. Moreover, the ‘num’ column contains the labels 0–4 representing the severity levels, ranging from 0, healthy, to 4, Severe.

Moreover, Figure 2 illustrates the proportion of patients in each category based on the ‘num’ variable, highlighting the relative frequency of each group in the dataset after cleaning null values. The chart highlights a noticeable imbalance in the distribution of patients across the five categories. Category 0 holds the majority, where 160 patients do not have heart disease. In contrast, only 13 patients have severe heart disease, representing the smallest group. The other categories are relatively minor but still show some variation. Leaving the dataset with imbalanced classes will lead to bias toward predicting the majority class (no heart disease), and such imbalances can affect the performance. Thus, we converted them into binary classification to avoid skewed model performance. Furthermore, the performance of proposed models using ChatGPT will be evaluated against the best models from the literature, where they all follow the same transformation. Thus, converting the problem into binary classification will ensure a fair comparison.

Subsequently, we replace the abbreviation of the feature names with the full name to enhance the model’s understanding of the features. Likewise, to improve clarity, we perform data transformation and replace the numbers in the categorical features with text, such as replacing 0 in sex with male and 1 with female.

3.3. Experiments

This paper implemented several approaches and experiments using the latest GPT versions, GPT-4o and GPT-4o-mini. GPT-4o is the high-intelligence model, while GPT-4o-mini is the most cost-efficient model, surpassing GPT-3.5 Turbo and other small models. Feature selection was evaluated to study its effect on performance. Subsequently, different learning strategies have been tested, including zero-shot, few-shot learning, and CoT reasoning. Prompt engineering is considered to enhance the prediction results. Thus, multiple prompt formats have been tested. Running multiple experiments allows us to discover the best approach to develop a disease prediction model. The following subsection provides a breakdown of the experiments and their results.

3.3.1. Feature Selection

We have various feature selection methods, including filter-based methods and wrapper-based methods. Wrapper methods evaluate subsets of features by iteratively training models. The features can be added or removed from the subset based on the output from the trained model to obtain better results. The most popular wrapper methods are recursive feature elimination, forward feature selection, and backward feature elimination. The main challenge is that these methods are usually computationally costly, especially when dealing with many features.

In contrast, filter methods assess the relevance of features using statistical measures. The feature selection is independent of model training. It is generally used as a pre-processing step since the best subset of features is selected before the training step. Moreover, the features are filtered based on their scores in the statistical test. In this work, two filter-based methods will be tested: Mutual Information Feature Selection (MIFS) and the Correlation-based Feature Selection (CBFS) methods.

Moreover, the Correlation-based feature selection method is one of the filter methods. It is faster and less computationally expensive than other methods, such as recursive feature elimination. The correlation method selects significant features based on their relevance to the target (the predicted attribute). If we have two strongly correlated features, we can predict one from the other, which means the model only needs one feature, as the second does not add additional information.

In CBFE, the correlation coefficients between the features are calculated to understand the relationships between them, and then the values are displayed using a correlation matrix. There are three main types of correlation coefficient formulas: Pearson, Kendall, and Spearman correlation. The first one (Pearson correlation coefficient) was picked for this work due to its popularity. Equation (1) shows how to compute the correlation between the x and y variables.

r = \frac{\sum (x_{i} - x^{-}) (y_{i} - y^{-})}{\sqrt{\sum {(x_{i} - x^{-})}^{2} \sum {(y_{i} - y^{-})}^{2}}}

(1)

considering that x_i and y_i represent the ith value of the x variable and y variable, respectively, while x⁻ and y⁻ represent the mean of the values of the x variable and the y variable.

Furthermore, MIFS is an entropy-based method used for feature selection, measuring any arbitrary dependency between random variables. The mutual information (MI) between two variables is a non-negative value, which measures the knowledge one supplies to the other. If features (F) and target (T) are independent, and both contain no information about each other, then their mutual information is zero [59]. Conversely, higher MI values mean a higher dependency between variables. The MI was calculated using Equation (2).

I (F; T) = H (F) - H (F | T)

(2)

where I(F; T) is the mutual information for F and T. H(F) is the entropy for F, and H(F | T) is the conditional entropy for F given T.

Pima Indian Diabetes Dataset (PIDD)

Developing effective prediction models, especially in medical domains, requires understanding feature correlation, which helps understand how they interact and influence the outcome. The heatmap in Figure 3 shows the correlation between all features in the dataset. The correlation values range from −1 to 1, where 1 indicates a strong positive correlation, −1 indicates a strong negative correlation, and near 0 means no significant correlation between the features. The positive correlation means that as one feature increases, the other also increases, whereas the negative correlation means that as one feature increases, the other decreases.

The heatmap in Figure 3 shows that Glucose, BMI, Age, and Pregnancy are most strongly positively related to diabetes outcomes. The correlation between Glucose and Outcomes is 0.47, demonstrating that higher glucose levels are a primary indicator of diabetes. Moreover, the correlation between BMI and Outcomes is 0.29, implying that higher body fat increases the risk of diabetes. Additionally, Age and pregnancy are moderately positively correlated with diabetes, with values of 0.24 and 0.22, respectively, reflecting the increased risk with increasing Age and number of pregnancies. On the other hand, Blood Pressure and Skin Thickness show very low (weak) correlations with the outcome, meaning they have a minor impact. Figure 4 illustrates the correlation between features of diabetes and outcome. Both BloodPressure and Skinthickness will be dropped when using CBFS since they have a lower correlation value.

Moreover, Figure 5 shows the mutual information scores for feature selection. Features with higher scores, such as Glucose. BMI, pregnancy, and Age indicate strong relevance to the target variable (Outcome). However, BloodPressure shows a zero value, so it will be discarded since it is irrelevant to the target.

Frankfurt Hospital Diabetes Dataset (FHDD)

Figure 6 illustrates the correlation between all features in the Frankfurt Hospital Diabetes Dataset. The results are closely aligned with those of the PIDD dataset. Likewise, as depicted in Figure 7, the results differ by no more than ±1.

Figure 8 depicts the mutual information scores. None of the features has a score of zero. This indicates that all features contribute to the prediction so that none will be dropped from the analysis.

UCI Cleveland Heart Disease Dataset

The heatmap in Figure 9 shows the correlation between features in the UCI Cleveland dataset. Ca, Thal, and Oldpead are most strongly related to the num (outcome) column, with values 0.52, 0.51 and 0.50, respectively. Furthermore, Cp, Exang and Slope have moderate positive relationships with the presence of heart disease, while Sex, Age, Trestbps, Chol and Restecg show low positive correlations. The feature Chol shows a very weak correlation of 0.07. Likewise, Fbs has the lowest correlation of 0.05, indicating a very weak relationship with Num. On the other hand, Thalach has a negative correlation of −0.42 with Num, suggesting that higher heart rates during exercise tend to be related to the absence of heart disease. Figure 10 illustrates the correlation between features and Num.

Additionally, Figure 11 illustrates the mutual information scores. Age, Trestbps, and Fbs show values of zero, meaning they are irrelevant to the target (Num). Thus, they will not be passed to the prediction model when using MIFS.

3.3.2. Learning Strategy Selection

Figure 12 illustrates the different experimental scenarios based on the learning strategies employed in this study. The scenarios include zero-shot, few-shot learning and CoT reasoning. Each of these scenarios was tested to evaluate its impact on the performance of GPT-4o and GPT-4o-mini.

In Zero-shot learning, the model performs a task without any provided examples or prior training. Instead, the model recognizes new concepts from just the provided description [60]. The model relies on pre-trained knowledge and general language understanding to make predictions and generate outputs. This strategy eliminates the need for task-specific labeled examples, which makes it worthwhile to formulate the prompt and test the performance of models quickly.

Moreover, few-shot learning is a sub-area of machine learning [61]. It involves feeding the model with a few examples of the prediction, which helps it to generalize from those examples to new cases. The goal is to enable the model to generate accurate predictions by learning from a small number of examples. Firstly, we consider that providing diverse examples is important to help the model learn from these patterns. We tested various numbers of examples (3-shot, 5-shot, and 10-shot). We excluded the outliers during the process of example selection, and we considered picking different patient characteristics to cover diverse scenarios. Further, to prevent bias, we try to select a balanced number of examples for each outcome. For instance, for the 5-shot, we selected three cases representing diabetes/heart disease and two cases representing non-diabetes/no heart disease. Additionally, since consistency in prompt formatting is very important, we followed a structured format and ensured that all provided examples followed the same structure to represent the patient data. This technique improves prompt understanding by the model.

Furthermore, Chain of Thought (CoT) learning encourages the model to follow step-by-step reasoning before concluding. In both zero-shot and few-shot CoT, the following sentence: “Let’s think step by step …” is included in the prompt to encourage the model to think step-by-step before answering each question as induced in [62]. Subsequently, in the few-shot CoT experiment, the model is fed with step-by-step reasoning examples instead of just providing a list of examples [63]. In addition, examples are provided to show the expected input-output format and explain how to perform reasoning for similar cases.

Moreover, we suggest integrating domain-specific knowledge to help the model make more context-aware predictions using prior knowledge along with logical steps. We call it Knowledge-Enhanced CoT (KE-CoT) learning. Figure 13 shows the proposed structure for the prompt message. The structure for the zero-shot KE-CoT prompt message is illustrated in Figure 13a. The first part of the message defines the role of the model and provides context for its prediction task. The second part explicitly encourages the model to reason sequentially, instead of jumping directly to a prediction. The final section outlines specific steps and domain knowledge to guide the reasoning process. This section is mandatory in both zero-shot and few-shot KE-CoT. The goal is to improve the reasoning process by adding contextual guidance and domain-specific knowledge to the prompt.

For the few-shot KE-CoT, we tested two methods. The first method integrated reasoning steps with each provided example, and we call it few-shot KE-CoT with example reasoning (KE-CoT-ExR). Note that the merged reasoning steps employ the same domain knowledge presented in the final section of the prompt. Conversely, in the second method, we list the examples without any reasoning steps. The structure for few-shot KE-CoT-ExR is illustrated in Figure 13b. The first two sections are similar to what is explained in Figure 13a. Then, in the third section, examples are provided to illustrate the expected input–output format. Therefore, we constructed a prompt containing randomly selected patient examples and manually composed their corresponding reasoning, finalized with each example’s predicted outcomes (0 or 1). Hence, the model learns not only how to predict disease but also the reasoning progression behind those predictions, which improves the reliability and interpretability of the prediction.

3.3.3. Prompt Engineering

Prompt engineering is a crucial aspect when creating prompts. It can significantly improve the accuracy of the model’s prediction. Additionally, integrating the prompts with diverse examples of patient data along with their outcomes enhances the model’s learning from these patterns. This encourages us to run experiments using the few-shot prompting. Subsequently, regarding the features, we considered testing both, including all the features and concentrating only on the highly relevant features to the prediction target, as indicated by the feature selection methods.

Prompts Formulation and Optimization

We considered writing clear instructions to the model, including clearly specifying the expected output format and organizing the information to enable the model to parse the message more easily. Additionally, prompts are formulated based on the selected learning strategy and include concise instructions such as step-by-step reasoning prompts for CoT scenarios. Figure 14 shows an example of a zero-shot prompt for diabetes disease prediction.

Figure 15 shows the tested zero-shot prompts for heart disease prediction. We highlighted the differences in the blue color. Figure 15a shows the features and values as in the dataset, while in Figure 15b, we provided the full feature name. In Figure 15c, we transformed numerical data into text representing the actual value in all categorical features. Figure 15d combines the data transformation as in Figure 15c and the column explanation as in Figure 15b.

3.3.4. API Configuration and Parameter Optimization

In this phase, the ChatGPT API is configured to interact with the model to obtain predictions. Multiple configurations are tested to evaluate performance under different scenarios. We use Python to send the constructed prompt to OpenAI’s API. Along with the prompt, we specify API parameters, including MODEL, MESSAGE, MAX_TOKENS, and TEMPERATURE. The MODEL parameter specifies which version of the language model to use, such as GPT-4o-mini and GPT-4o. These models are distinct in terms of efficiency and performance. Moreover, the MESSAGES parameter includes the conversation. It consists of a list of message objects. Each message object includes a role (‘system’, ‘user’, or ‘assistant’) and content (the text input).

Additionally, the message of the ‘user’ role represents the query and inputs, such as patient data. Subsequently, the message of the ‘system’ role plays the model’s primary role in guiding its behavior. We used this message in the experiments:

“You are a medical expert who provides …. disease predictions based on patient data.”

Moreover, the TEMPERATURE parameter controls the randomness of the output; low-temperature values (e.g., 0 to 0.3) are used for more deterministic outputs, specifically for binary classification tasks, such as in our case. We tested setting TEMPERATU.

RE to 0.3, assuming it could help improve the results. However, an increase in the temperature value does not improve the performance. Accordingly, we set the value to 0 in all our experiments. Finally, the MAX_TOKENS parameter was configured based on the chosen learning strategy. Only a single token was needed to capture the output for zero-shot and few-shot strategies since the prediction was limited to a binary classification (0 or 1). Hence, MAX_TOKENS was set to 1. However, in CoT, the output typically includes reasoning steps followed by the final prediction. Consequently, max_tokens were adjusted dynamically to fit the expected output size and ensure that the model had sufficient token capacity to provide detailed step-by-step clarification. Accordingly, this tuning process ensures that parameters are adjusted through experimentation to improve performance and outcome.

3.3.5. Evaluation

We evaluate the performance using the accuracy as well as the other standard evaluation metrics, including precision, recall, and F1-score, to evaluate how well the model handles all categories, particularly the minority ones. Accuracy is the most common performance measure. It is simply a ratio of correctly predicted instances to the total predictions. It is calculated as:

a c c u r a c y = \frac{N u m b e r o f c o r r e c t p r e d i c t i o n s}{T o t a l n u m b e r o f p r e d i c t i o n s}

(3)

Precision (also known as Positive Predictive Value) is the ratio of correctly predicted positive instances to the total predicted positives. It is calculated as:

P r e c i s i o n (P) = \frac{T P}{T P + F P}

(4)

where TP represents the total number of True Positives, those correctly classified. FP represents the total number of False Positives, which were wrongly predicted as the positive class.

Recall is also known as the True Positive Rate. It is calculated as:

R e c a l l (R) = \frac{T P}{T P + F N}

(5)

where FN is the total of the false negatives, which means they were wrongly predicted as the negative class.

F1-score represents the mean of precision and recall. It is calculated as:

F 1 - s c o r e = 2 * \frac{P * R}{P + R}

(6)

4. Results

After running the prediction experiments across a dataset, we compared the model’s predictions to the actual outcomes to assess the performance, using the standard evaluation metrics, including accuracy, precision, recall, and F1-score. For all result tables, bold values indicate the highest performance in each metric.

4.1. Results of Diabetes Prediction Task

4.1.1. Prediction Results for the Pima Indian Diabetes Dataset (PIDD)

Table 4 and Table 5 show the evaluation results for different experiments of diabetes prediction. Table 4 presents the evaluation results for zero-shot and few-shot learning over two GPT-4 versions (GPT-4o, GPT-4o-mini). The results demonstrate that GPT-4o significantly outperforms GPT-4o-mini. Subsequently, the implementation of correlation-based feature selection using GPT-4o slightly enhances (0.39%) the accuracy for zero-shot and 5-shot. The highest accuracy achieved by ChatGPT was 77.08% using the GPT-4o version and 5-shot learning with feature selection based on the correlation coefficient.

Furthermore, Table 5 displays the evaluation results using CoT reasoning. As GPT-4o-mini did not provide promising results in the previous scenarios, we implemented CoT using only GPT-4o. In addition, we applied a CBFS approach since it helps in achieving the highest results with the previously implemented learning strategies. All the tested scenarios for KE-CoT (zero-shot KE-CoT, 3-shot KE-CoT, 3-shot KE-CoT-ExR, and 5-shot KE-CoT) outperform the traditional zero-shot/3-shot CoT in terms of accuracy, precision, and F1-score. This finding suggests that integrating domain-specific information improves the model’s prediction ability. Moreover, the results show that the provided examples, in 3-shot KE-CoT/KE-CoT-ExR, and 5-shot KE-CoT, slightly improve the reasoning process compared with zero-shot KE-CoT. Both 3-shot KE-CoT-ExR and the 5-shot KE-CoT yield the highest accuracy of 73.7%. Subsequently, 3-shot KE-CoT-ExR obtains the highest precision (60.31%), whereas 5-shot KE-CoT achieves the highest F1-score (67.63%). However, the results are still lower than those of other learning strategies in previous experiments. Moreover, the suboptimal results yielded in the previous experiments by 10-shot and the high cost regarding the large number of tokens required for 10-shot CoT discouraged further testing.

4.1.2. Prediction Results for the Frankfurt Hospital Diabetes Dataset (FHDD)

Table 6 shows the results for FHDD. The MIFS is not included in this experiment since the mutual scores indicate that all the features are important. The results yielded by including all the features are slightly higher than when using CBFS. The best prediction accuracy was obtained by GPT-4o using 5-shot (75.85%).

Moreover, Table 7 illustrates the evaluation results using CoT reasoning, including the optimal configuration from the previous experiment (Table 6). Like PIDD, the Domain-Specific Knowledge message helps improve CoT reasoning results. Nevertheless, we can not conclude that any of the KE-CoT approaches outperforms the others, since each achieves the top score on only one metric. The highest accuracy (72.35%) was yielded by both 3-shot KE-CoT and 3-shot KE-CoT-ExR, almost near to what was achieved by zero-shot-KE-CoT (72.3%). Additionally, the highest precision (57.34%) was obtained by zero-shot KE-CoT, while 5-shot KE-CoT achieved the highest F1-score (66.14%). However, the results are still lower than those of other learning strategies in previous experiments.

4.2. Results of Heart Disease Prediction Task

For heart disease prediction, we tested different prompts to evaluate the effect of prompt engineering on the results. The dataset initially consisted of categorical features, which were converted into numerical values to fit the analysis using traditional ML. However, we converted them into text to provide clear input for the ChatGPT model. Furthermore, we clarified the names of the columns (features) to ensure that the model could interpret the data effectively and generate accurate predictions. Then, we test the effect of this step on enhancing its performance in disease prediction tasks.

Table 8 illustrates the results of the four prompts presented in Figure 15. All the experiments in the table were accomplished using zero-shot and without any feature selection. The goal is to study the effect of prompt formulation. The results demonstrate that prompt engineering can significantly enhance the prediction results. The fourth prompt achieved the highest result, where we clarified the feature and replaced numerical data with full text reflecting the value. This approach resulted in an accuracy of 80.47%, a precision of 76.51%, and an F1-score of 79.72%, demonstrating the effectiveness of clear and detailed prompt formulation in improving the model’s performance. Therefore, we used this format in further experiments for few-shot learning.

Table 9 illustrates the results using zero-shot and few-shot learning with and without feature selection methods. In all cases, the CBFS approach helps increase the accuracy; in addition to that, GPT-4o beats GPT-4-mini in terms of accuracy, precision, and F1-score. The 5-shot learning using GPT-4o with the CBFS method achieved the highest results with 85.52% accuracy, 87.30% precision, and 83.65% F1-score. Furthermore, Table 10 shows the results of CoT reasoning approaches. The 3-shot KE-CoT outperforms the other in accuracy and precision. It yielded an accuracy of 76.9% and a precision of 73.24%. However, unlike the prior experiments for CoT in diabetes prediction, incorporating domain-specific knowledge failed to improve the results of zero-shot CoT in terms of accuracy, precision and F1-score. Moreover, the other KE approaches with few-shot (3-shot KE-CoT-ExR and 5-shot KE-CoT) fall behind the 3-shot CoT. Therefore, important questions are raised here regarding when and how knowledge injection is beneficial. Subsequently, as in diabetes prediction, CoT strategies failed to improve the prediction, where the results are far from what was accomplished by GPT-4o with few-shot learning, as shown in Table 9.

4.3. Model Validation

To investigate the potential of ChatGPT on diabetes prediction, we validate the model using a new dataset, the Iraqi Patient dataset for Diabetes (IPDD) [64]. The dataset was obtained from the Medical City Hospital laboratory and the Specialized Center for Endocrinology and Diabetes at Al-Kindy Teaching Hospital in Iraq. The validation was conducted using the best-performing configurations (5-shot GPT-4o). Utilizing the new dataset, we obtain an accuracy of 94.08%, a precision of 99.50%, a recall of 93.84%, and an F1-score of 96.59%. This result highlights the strong potential of ChatGPT as a prediction model for diabetes disease prediction.

However, the validation for heart disease prediction using a new dataset was not feasible due to the lack of publicly available datasets, in addition to the challenges related to collecting real-world clinical data. This limitation is highlighted later in the discussion section (Section 5.1). In addition, we emphasize the need for further focus and investigation regarding heart disease prediction using various datasets and prompt strategies to study the capabilities of ChatGPT in this complex prediction task.

5. Discussion

5.1. Comparative Analysis with Existing ML/DL Models

The experiment’s results identify 5-shot learning as optimal for ChatGPT among the tested options (3, 5 and 10). Moreover, the findings highlight the importance of prompt formulation since it leads to significant improvement. In addition, GPT-4o beat GPT-4o-mini in all experiments and improved the performance significantly. Additionally, correlation coefficients feature selection proves a positive impact in identifying the most predictive features for diabetes using PIDD and heart disease UCI datasets compared with the mutual information feature selection method. However, in the FHDD dataset, the mutual information method indicated that all features are important. Later, the results demonstrated that the model achieved higher accuracy when including all features than when using the correlation-based method.

Subsequently, we conducted a sensitivity analysis to evaluate the 6-shot learning with the best-performing configuration (GPT-4o) against 5-shot learning. The value is simply calculated as follows:

Δ M e t r i c = {M e t r i c}_{6 - s h o t} - {M e t r i c}_{5 - s h o t}

(7)

As shown in Table 11, the 6-shot approach did not outperform the 5-shot. Performance. The expected reason behind the success of 5-shot is its ability to provide sufficient examples to guide the model without introducing redundant/noisy information. Too few examples may not give the model enough information to support accurate prediction, while too many examples could overwhelm it.

Furthermore, zero-shot and various CoT reasoning methods performed less than few-shot learning. Although CoT reasoning encourages the model to break down the problem step-by-step, it is notably less effective than zero-shot. We expected that guiding the model through intermediate reasoning steps would improve the prediction. Contrasting, simplicity and directness by zero-shot learning yielded better results. Additionally, for diabetes, incorporating domain-specific knowledge helps improve prediction accuracy compared with zero-shot/few-shot CoT. Subsequently, 3-shot KE-CoT yielded the highest accuracy and precision for heart disease prediction compared with other CoT strategies. However, the other KE strategies failed to outperform the standard zero-shot/few-shot CoT for heart disease prediction. The likely explanation is that the integrated knowledge was insufficient, especially given the complex combination of interacting features for heart disease. This underscores the need for designed prompt strategies based on disease complexity. Subsequently, involving specialists from the medical domain could provide valuable insights to improve the format of Knowledge-Enhanced prompt, particularly for complex diseases. In addition, it should be emphasized that CoT does not always exceed few-shot/zero-shot. The success of CoT depends on the quality of the prompt and the type of task. The straightforward task may not benefit from providing multi-step reasoning; in fact, reasoning may introduce noise rather than clarity. This could explain why CoT does not outperform few-shot/zero-shot in this work.

Moreover, it can be noticed that recall is high when other metrics are low, and, in some cases, it shows 100%, particularly for diabetes prediction. This indicates that there are zero false negative cases, meaning that all actual diabetic cases were correctly identified. This result does not necessarily indicate the effectiveness of distinguishing between the two classes, particularly with an imbalanced dataset, such as in our work. In the PIDD dataset, we had 268 positive cases out of 768 cases; in FHDD, we had 684 positive cases out of 2000 cases. Therefore, we can not solely rely on recall with an imbalanced dataset. To mitigate this issue and ensure the model’s actual performance, we adopt the following approaches. First, during few-shot learning, particularly 3- and 5-shot, where having a balanced number of examples is impossible, we used a 3:2/2:1 minority–majority example ratio. Second, we used PR AUC (precision–Recall Area Under the Curve) metric, which is typically used for imbalanced dataset evaluation. It plots the precision against the recall at different thresholds. High PR AUC value demonstrates the model’s ability to identify minority positive cases (diabetic) and minimize false positives. Figure 16 depicts the PR curve for diabetes. The PR AUC value is 0.7 for both PIDD and FHDD. This indicates that the prediction model effectively identifies true positive cases (diabetic patients correctly identified as diabetic) while minimizing the number of false positives (incorrectly identifying the non-diabetic patients as diabetic). Moreover, Figure 17 shows the PR curve for heart disease. PR AUC of 0.87 is typically considered very good, indicating the ability of heart disease prediction to capture most diabetic patients and make fewer false positive predictions due to the good precision obtained. It is worth noting that the heart dataset contains 140 positive cases out of 297, indicating the lack of imbalance. Subsequently, it should be emphasized that the prediction was converted to binary classification as explained in Section 3.2. Accordingly, the model may succeed in predicting patients with severe heart disease (class 4). Conversely, the model may struggle with low risk (class 1), mixing them with healthy patients (class 0).

Moreover, we evaluate the results against the best models from the literature that utilize the same dataset. The selected models are illustrated in Table 12. Since the models are tested on different data sizes, direct comparison will not be fair, particularly for ChatGPT, since we include the whole dataset in testing. Therefore, we used the confidence interval (CI), which helps compare model performances. Instead of simply stating the model accuracy, the 95% CI gives a range of values, indicating that the factual accuracy is within that range. By looking at the CIs, we can see if the difference between models is statistically significant.

As can be noted from Figure 18 and Figure 19, our results using GPT-4o do not outperform the other approaches, particularly deep learning using stack-ANN, where it achieved 98.81% using PIDD and 99.51% using FHDD. Additionally, the CIs in both charts indicate that the performance difference between ChatGPT and the others is statistically significant. Likewise, Figure 19 and Figure 20 demonstrate that the extra tree classifier achieved the best results [50], which obtained an accuracy of 98.15%, outperforming ChatGPT’s performance in heart disease prediction. However, looking at CIs, we can see that the accuracy for heart disease prediction yielded by ChatGPT is almost near the lower boundaries of [49,52].

Despite findings indicating that ChatGPT did not outperform the traditional model, the recent advances by LLMs emphasize their ongoing promise in healthcare. CPLLM model [65] and Med-BERT [66] have shown superiority over traditional models (logistic regression) in predicting future diagnoses. CoAD [67] emphasized the ability of LLMs to analyze health reports. Additionally, ComLLM [68] demonstrated the capability of predicting potential relationships between diseases, where few-shot and COT strategies lead to superior results. These studies demonstrate the considerable potential of LLMs in the medical domain.

Moreover, ChatGPT is one of the leading LLMs due to its various capabilities. The provided API’s flexibility and ease of use facilitate easy integration with hospital systems. Additionally, clinicians and specialists can interact with the model in natural language without expert programming skills. Moreover, employing zero-shot/few-shot learning capabilities along with well-designed prompts and/or examples can achieve the goal without requiring intensive and costly model training. Therefore, it offers a powerful and cost-efficient tool for decision support in healthcare. Accordingly, this work establishes a valuable benchmark for its performance on disease prediction tasks and highlights the crucial role of prompt design.

The type of datasets used in this study should be considered when interpreting the results. These datasets were selected due to their reputation in the research community and availability for public use. Including new and diverse, unstructured real-world datasets from various hospitals would be highly valuable for testing the generalizability and performance of the model. However, this was not feasible in this study due to the challenges and lengthy process associated with collecting medical data, including the regulations and the required access approvals. Therefore, we highlighted this as a limitation, particularly for heart disease prediction, and we consider it an important direction for future work.

5.2. Challenges in Using ChatGPT for Disease Prediction and Potential Enhancements

ChatGPT is a generative language model that processes input text using pattern recognition instead of arbitrary numerical computation. Additionally, ChatGPT’s predictions are probabilistic, indicating that its output can vary based on small changes in the input message. Subsequently, the use of few-shot learning and the proposed task-specific training(knowledge-advanced CoT reasoning) did not effectively address this gap due to the complexity of the medical data. In addition, ChatGPT is trained on textual data; thus, it struggles with numerical or structured data, which are common in the clinical domain. For instance, difficulty in understanding that one attribute is more significant than others in prediction.

Moreover, one of the reasons that ML/DL models surpass ChatGPT is that they are trained with carefully labeled data and optimized for specific disease prediction tasks. In contrast, ChatGPT generates responses based on what is learned from massive training data. These data are primarily extracted from web content, including websites, articles, online books, etc. Due to the limitations in filtering techniques, the training data consists of undesirable or biased content that affects the model’s ability to provide accurate output.

In healthcare, biased models will lead to misdiagnoses, causing unfair treatment for patients. The training data may not reflect the health of everyone, particularly people of specific races, genders, or even patients with low incomes. Therefore, the model will fail to deal with these underrepresented groups. For instance, if the model is primarily trained on female patients, it will show poor results when diagnosing male patients. Similarly, models trained with adult data may not work well with infant patients. The other ethical issue is that ChatGPT is still not FDA/CE-approved for direct diagnosis. Moreover, processing sensitive patient information emphasizes significant privacy concerns.

Moreover, the training on labeled medical data could minimize the dependency on prompt engineering, leading to prediction stability. Therefore, fine-tuning the ChatGPT model on domain-specific (medical) datasets could address this issue. Several approaches have been developed to handle bias, including simple resampling methods, such as SMOTE. However, other methods, such as human-in-the-loop (HITL) approaches [69] can be more powerful. In HITL, an expert (a specialist in the medical domain) is involved in performing different tasks. The expert can help annotate high-quality training data in addition to identifying and correcting biases. Moreover, humans can assist the model’s fine-tuning process and evaluate the model’s outputs. Also, enabling customization of the model by experts will offer greater control in adjusting the output, thus mitigating biases. Subsequently, some of the raised challenges regarding the effort and time of integrating humans can be handled by merging machine learning methods.

In addition, the ethical issue, which is aligned with the privacy of patient information during the training process, can be addressed in several ways. One of the approaches is generating synthetic data, which mimics real patient data without truly copying the actual data [70]. In some cases, partially synthetic data will be more applicable than fully synthetic data. In partially synthetic data, only the sensitive data considered high risk for disclosure, such as patient-identifiable information, will be replaced with fake data to avoid reidentification. Correspondingly, a synthetic electronic health record can be generated using tools such as Synthea [71]. These generated data not only address the privacy issue but also provide a large repository for fine-tuning. Therefore, further investigation is required to study the effect of fine-tuning on disease prediction alongside integrating these methods. Furthermore, a hybrid approach combining ChatGPT with traditional ML models, such as using ML-based feature selection, could help overcome specific limitations and merge the strengths of ChatGPT and traditional models.

5.3. The Role of ChatGPT in Enhancing Clinical Decision-Making

ChatGPT can be a clinical assistant that enhances decision-making rather than just replacing ML/DL models. Figure 21 illustrates the suggested workflow for ChatGPT as a complementary tool for clinicians. The process begins with patient data, which could be structured (e.g., laboratory results, vital signs, etc.) or unstructured (e.g., doctor’s notes and radiology reports), followed by ChatGPT as an assistant for pre-screening and data summarization to provide initial insights. Initially, it can extract relevant information from patient records and medical reports. Previous studies demonstrate its ability to extract information from clinical notes [67,68] and radiology reports [72]. The next step is filtering the extracted information to eliminate unnecessary information. Then, the data from different sources, including radiology reports and laboratory results, can be interpreted, grouped, and summarized. After that, an initial assessment of the patient’s risk can be made to minimize clinician time and effort by providing a quick overview of potential concerns. Subsequently, the important information can be converted into structured data to facilitate further analysis. Then, these data will be analyzed by ML/DL models to make predictions. After that, ChatGPT will assist the clinician and generate an easy-to-understand report from the ML/DL model’s output. These reports can improve patient engagement as well. Finally, the final diagnosis and decision will be made, and the clinicians will translate the outcomes into meaningful discussions with patients.

6. Conclusions

This paper investigates the effectiveness of ChatGPT in chronic disease diagnoses, particularly heart disease and diabetes prediction from a structured dataset. The dataset includes medical test results, such as glucose levels, BMI, blood pressure, and cholesterol levels. Various experimental setups using GPT-4o and GPT-4o-mini were designed to evaluate the impact of different learning strategies, including zero-shot, few-shot (3-shot, 5-shot, 10-shot), and CoT reasoning. Subsequently, we assessed the impact of incorporating domain-specific knowledge in improving zero-shot/few-shot CoT. Furthermore, we tested two feature selection methods, which are correlation-based and mutual information. In addition, various prompt formulations are created to examine the effect on the prediction. After that, the results are compared with the best models from the literature that employ traditional machine learning and deep learning. Additionally, we proposed a workflow for ChatGPT, which serves as a complementary tool for clinicians to enhance decision-making and provide support alongside ML/DL models.

Moreover, the findings indicate that 5-shot learning with the GPT-4o model was the most effective configuration for diabetes and heart prediction, surpassing other strategies (zero-shot, 3-shot, 10-shot, and CoT approaches). Moreover, prompt formulation significantly improves performance, particularly for the heart dataset, which contributed to a 5% performance increase, emphasizing the importance of prompt engineering. Subsequently, using Knowledge-Enhanced CoT improves the prediction accuracy compared to zero-shot/few-shot CoT. However, it falls behind 5-shot learning. Multiple factors need to be considered to understand the overall underperformance of CoT and the limited success of the KE-CoT approach in our experiment. These factors include the complexity of the task, the quality of the incorporated reasoning and knowledge, and the fact that not all tasks benefit from an explicit multi-step reasoning chain.

Furthermore, the findings show that ChatGPT does not outperform the best-achieving models from the literature. However, the results demonstrate that it offers a promising possibility for disease prediction since it yielded an accuracy of 85.52% for heart disease and 77.08% for diabetes using PIDD and 75.85% using FHDD, raising the need for advanced techniques to address its limitations. It is worth noting that, despite not outperforming traditional methods, this work provides value by setting a clear performance baseline on these tasks for future work. In addition, the good validation results demonstrate the potential generalization of the model in diabetes prediction. However, the challenges we faced in validating heart disease predictions regarding the availability of new public datasets raise the uncertainty of the model’s generalizability in this task.

Future work in this area could include exploring the hybrid architectures that combine LLM reasoning with ML feature-based predictors to integrate their strengths. Expand testing over distinct clinical datasets for heart disease to investigate the capabilities and ensure the generalizability of the model. In addition, further multi-institutional datasets representing different diseases are required to evaluate the model’s capabilities. It will be important to conduct evaluations of the latest GPT4 (o1) release, which offers advanced reasoning and in-depth enhancements in prompt engineering to improve the model’s accuracy. Additionally, involving specialists in the medical domain to assist with the obtained results could provide valuable insights. Subsequently, additional research is needed to investigate how ChatGPT performs in multi-class classification tasks, as well as to address issues related to dataset imbalance. There is a need for further investigation of the implementation of hybrid systems to integrate the strength of Chat GPT with ML/DL models. The suggested AI-assisted workflow’s applicability must be explored to test its feasibility in treating various diseases.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in the Kaggle repository at https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database (accessed on 31 May 2025), in the UCI Machine Learning Repository at https://archive.ics.uci.edu/dataset/45/heart+ (accessed on 31 May 2025), and in the IEEE DataPort at https://ieee-dataport.org/documents/type-2-diabetes-dataset (accessed on 31 May 2025).

Conflicts of Interest

The author declares no conflicts of interest.

References

Fernández-edreira, D.; Liñares-blanco, J.; Fernandez-lozano, C. Machine Learning analysis of the human infant gut microbiome identifies influential species in type 1 diabetes. Expert Syst. Appl. 2021, 185, 115648. [Google Scholar] [CrossRef]
Kumar, Y.; Koul, A.; Singla, R.; Ijaz, M.F. Artificial intelligence in disease diagnosis: A systematic literature review, synthesizing framework and future research agenda. J. Ambient Intell. Humaniz. Comput. 2023, 14, 8459–8486. [Google Scholar] [CrossRef] [PubMed]
Tom, J.; Zsoldos, M.; Thurzo, A. AI and Face-Driven Orthodontics: A Scoping Review of Digital Advances in Diagnosis and Treatment Planning. AI 2024, 5, 158–176. [Google Scholar] [CrossRef]
Dara, S.; Dhamercherla, S.; Singh, S.; Ch, J.; Babu, M. Machine Learning in Drug Discovery: A Review; Springer: Dordrecht, The Netherlands, 2022; Volume 55, ISBN 0123456789. [Google Scholar]
Blanco-gonz, A.; Cabez, A.; Seco-gonz, A.; Conde-torres, D.; Antelo-riveiro, P.; Piñeiro, Á.; Garcia-fandino, R. The Role of AI in Drug Discovery: Challenges, Opportunities, and Strategies. Pharmaceuticals 2023, 16, 891. [Google Scholar] [CrossRef]
Garreffa, E.; Hamad, A.; Sullivan, C.C.O.; Hazim, A.Z.; York, J.; Puri, S.; Turnbull, A.; Robertson, J.F.; Goetz, M.P. Regional lymphadenopathy following COVID-19 vaccination: Literature review and considerations for patient management in breast cancer care. Eur. J. Cancer 2021, 159, 38–51. [Google Scholar] [CrossRef]
Smith, T.W. Intimate Relationships and Coronary Heart Disease: Implications for Risk, Prevention, and Patient Management. Curr. Cardiol. Rep. 2022, 24, 761–774. [Google Scholar] [CrossRef]
Kumar, Y.; Gupta, S.; Singla, R.; Chen, Y. A Systematic Review of Artificial Intelligence Techniques in Cancer Prediction and Diagnosis. Arch. Comput. Methods Eng. 2022, 29, 2043–2070. [Google Scholar] [CrossRef]
Nam, D.; Chapiro, J.; Paradis, V.; Seraphin, T.P.; Kather, J.N. Artificial intelligence in liver diseases: Improving diagnostics, prognostics and response prediction. J. Hepatol. 2022, 4, 100443. [Google Scholar] [CrossRef]
Sawhney, R.; Malik, A.; Sharma, S.; Narayan, V. A comparative assessment of artificial intelligence models used for early prediction and evaluation of chronic kidney disease. Decis. Anal. J. 2023, 6, 100169. [Google Scholar] [CrossRef]
National Cancer Institute. Available online: https://www.cancer.gov/ (accessed on 2 December 2024).
American Heart Association, Classes of Heart Failure, American Heart Association. Available online: https://www.heart.org/en/health-topics/heart-failure/what-is-heart-failure/classes-of-heart-failure (accessed on 1 December 2024).
World Health Organization, Cardiovascular Diseases, WHO. Available online: https://www.who.int/health-topics/cardiovascular-diseases/#tab=tab_1 (accessed on 1 December 2024).
World Health Organization, Diabetes, WHO. 2024. Available online: https://www.who.int/news-room/fact-sheets/detail/diabetes (accessed on 2 December 2024).
Onakpojeruo, E.P.; Mustapha, M.T.; Ozsahin, D.U.; Ozsahin, I. A Comparative Analysis of the Novel Conditional Deep Convolutional Neural Network Model, Using Conditional Deep Convolutional Generative Adversarial Network-Generated Synthetic and Augmented Brain Tumor Datasets for Image Classification. Brain Sci. 2024, 14, 559. [Google Scholar] [CrossRef]
Alanazi, A.; Aldakhil, L.; Aldhoayan, M.; Aldosari, B. Machine Learning for Early Prediction of Sepsis in Intensive Care Unit (ICU) Patients. Medicina 2023, 59, 1276. [Google Scholar] [CrossRef] [PubMed]
Ganatra, H.A.; Latifi, S.Q.; Baloglu, O. Pediatric Intensive Care Unit Length of Stay Prediction by Machine Learning. Bioengineering 2024, 11, 962. [Google Scholar] [CrossRef] [PubMed]
Bionic Pancreas Research Group. Multicenter, Randomized Trial of a Bionic Pancreas in Type 1 Diabetes. N. Engl. J. Med. 2022, 387, 1161–1172. [Google Scholar] [CrossRef] [PubMed]
Abràmoff, M.D.; Lavin, P.T.; Birch, M.; Shah, N.; Folk, J.C. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. NPJ Digit. Med. 2018, 1, 39. [Google Scholar] [CrossRef]
Ahsan, M.M.; Luna, S.A.; Siddique, Z. Machine-Learning-Based Disease Diagnosis: A Comprehensive Review. Healthcare 2022, 10, 541. [Google Scholar] [CrossRef]
Alomari, E.A. Unlocking the Potential: A Comprehensive Systematic Review of ChatGPT in Natural Language Processing Tasks. CMES-Comput. Model. Eng. Sci. 2024, 141, 43–85. [Google Scholar] [CrossRef]
Caruccio, L.; Cirillo, S.; Polese, G.; Solimando, G.; Sundaramurthy, S.; Tortora, G. Can ChatGPT provide intelligent diagnoses? A comparative study between predictive models and ChatGPT to define a new medical diagnostic bot. Expert Syst. Appl. 2024, 235, 121186. [Google Scholar] [CrossRef]
Gollapalli, M.; Alansari, A.; Alkhorasani, H.; Alsubaii, M.; Sakloua, R.; Alzahrani, R.; Al-hariri, M.; Alfares, M.; Alkhafaji, D.; Al, R.; et al. A novel stacking ensemble for detecting three types of diabetes mellitus using a Saudi Arabian dataset: Pre-diabetes, T1DM, and T2DM. Comput. Biol. Med. 2022, 147, 105757. [Google Scholar] [CrossRef]
Ahmad, H.F.; Mukhtar, H.; Alaqail, H.; Seliaman, M.; Alhumam, A. Investigating Health-Related Features and Their Impact on the Prediction of Diabetes Using Machine Learning. Appl. Sci. 2021, 11, 1173. [Google Scholar] [CrossRef]
Qteat, H.; Awad, M. Using Hybrid Model of Particle Swarm Optimization and Multi-Layer Perceptron Neural Networks for Classification of Diabetes. Int. J. Intell. Eng. Syst. 2021, 14, 10–22. [Google Scholar] [CrossRef]
Fitriyani, N.L.; Syafrudin, M. Development of Disease Prediction Model Based on Ensemble Learning Approach for Diabetes and Hypertension. IEEE Access 2019, 7, 144777–144789. [Google Scholar] [CrossRef]
Ali, A.; Alrubei, M.A.; Hassan, L.F.M.; Al-Ja’afari, M.A.; Abdulwahed, S.H. DIABETES CLASSIFICATION BASED ON KNN. IIUM Eng. J. 2020, 21, 175–181. [Google Scholar] [CrossRef]
Almutairi, E.S.; Abbod, M.F. Machine Learning Methods for Diabetes Prevalence Classification in Saudi Arabia. Modelling 2023, 4, 37–55. [Google Scholar] [CrossRef]
Alsulami, N.; Almasre, M.; Sarhan, S.; Alsaggaf, W. Deep Learning Models for Type 2 Diabetes Detection in Saudi Arabia. J. Pioneer. Med. Sci. 2024, 13, 60–73. [Google Scholar] [CrossRef]
Khan, A.H.S.T. Machine learning-based application for predicting risk of type 2 diabetes mellitus (t2dm) in saudi arabia: A retrospective cross-sectional study. IEEE Access 2020, 8, 199539–199561. [Google Scholar]
Mahesh, T.R.; Kumar, D.; Vinoth Kumar, V.; Asghar, J.; Mekcha Bazezew, B.; Natarajan, R.; Vivek, V. Blended Ensemble Learning Prediction Model for Strengthening Diagnosis and Treatment of Chronic Diabetes Disease. Comput. Intell. Neurosci. 2022, 2022, 4451792. [Google Scholar] [CrossRef]
Patil, R.; Tamane, S. A Comparative Analysis on the Evaluation of Classification Algorithms in the Prediction of Diabetes. IJECE 2018, 8, 3966–3975. [Google Scholar] [CrossRef]
Wang, Q.; Cao, W.; Guo, J.; Ren, J.; Cheng, Y.; Davis, D.N. DMP_MI: An Effective Diabetes Mellitus Classification Algorithm on Imbalanced Data with Missing Values. IEEE Access 2019, 7, 102232–102238. [Google Scholar] [CrossRef]
Devi, R.D.H.; Bai, A.; Nagarajan, N.J.O.M. A novel hybrid approach for diagnosing diabetes mellitus using farthest first and support vector machine algorithms. Obes. Med. 2020, 17, 100152. [Google Scholar] [CrossRef]
Kaur, P.; Kaur, R. Comparative Analysis of Classification Techniques for Diagnosis of Diabetes. In Advances in Bioinformatics, Multimedia, and Electronics Circuits and Signals; Springer: Singapore, 2020. [Google Scholar]
Joshi, R.D.; Dhakal, C.K. Predicting Type 2 Diabetes Using Logistic Regression and Machine Learning Approaches. Int. J. Environ. Res. Public Health 2021, 18, 7346. [Google Scholar] [CrossRef]
Sivaranjani, S.; Ananya, S.; Aravinth, J.; Karthika, R. Diabetes Prediction using Machine Learning Algorithms with Feature Selection and Dimensionality Reduction. In Proceedings of the 2021 7th international conference on advanced computing and communication systems (ICACCS), Coimbatore, India, 19–20 March 2021. [Google Scholar]
Abdulhadi, N.; Al-Mousa, A. Diabetes Detection Using Machine Learning Classification Methods. In Proceedings of the 2021 International Conference on Information Technology, Amman, Jordan, 14–15 July 2021. [Google Scholar]
Kumari, S.; Kumar, D.; Mittal, M. International Journal of Cognitive Computing in Engineering An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier. Int. J. Cogn. Comput. Eng. 2021, 2, 40–46. [Google Scholar] [CrossRef]
Kalagotla, S.K.; Gangashetty, S.V.; Giridhar, K. A novel stacking technique for prediction of diabetes. Comput. Biol. Med. 2021, 135, 104554. [Google Scholar] [CrossRef] [PubMed]
Bukhari, M.M.; Alkhamees, B.F.; Hussain, S.; Gumaei, A.; Assiri, A.; Ullah, S.S. An Improved Artificial Neural Network Model for Effective Diabetes Prediction. Complexity 2021, 2021, 5525271. [Google Scholar] [CrossRef]
Al Reshan, M.S.; Amin, S.; Zeb, M.A.; Sulaiman, A.; Alshahrani, H.; Shaikh, A.; Elmagzoub, M.A. An Innovative Ensemble Deep Learning Clinical Decision Support System for Diabetes Prediction. IEEE Access 2024, 12, 106193–106210. [Google Scholar] [CrossRef]
Type 2 Diabetes Dataset|IEEE DataPort. Available online: https://ieee-dataport.org/documents/type-2-diabetes-dataset (accessed on 14 February 2025).
Dambra, V.; Roccotelli, M.; Fanti, M.P. Diabetic Disease Detection using Machine Learning Techniques. In Proceedings of the 2024 10th International Conference on Control, Decision and Information Technologies, Valetta, Malta, 1–4 July 2024; pp. 1436–1441. [Google Scholar] [CrossRef]
Shrestha, D. Comparative Analysis of Machine Learning Algorithms for Heart Disease Prediction. Adv. Transdiscipl. Eng. 2022, 27, 64–69. [Google Scholar] [CrossRef]
Alfadli, K.M.; Almagrabi, A.O. Feature-Limited Prediction on the UCI Heart Disease Dataset. Comput. Mater. Contin. 2023, 74, 5871–5883. [Google Scholar] [CrossRef]
Anderies, A.; Tchin, J.A.R.W.; Putro, P.H.; Darmawan, Y.P.; Gunawan, A.A.S. Prediction of Heart Disease UCI Dataset Using Machine Learning Algorithms. Eng. Math. Comput. Sci. J. 2022, 4, 87–93. [Google Scholar] [CrossRef]
Bharti, R.; Khamparia, A.; Shabaz, M.; Dhiman, G.; Pande, S.; Singh, P. Prediction of Heart Disease Using a Combination of Machine Learning and Deep Learning. Comput. Intell. Neurosci. 2021, 2021, 8387680. [Google Scholar] [CrossRef]
Asif, D.; Bibi, M.; Arif, M.S. Enhancing Heart Disease Prediction through Ensemble Learning Techniques with Hyperparameter Optimization. Algorithms 2023, 16, 308. [Google Scholar] [CrossRef]
Noroozi, Z.; Orooji, A.; Erfannia, L. Analyzing the Impact of Feature Selection Methods on Machine Learning Algorithms for Heart Disease Prediction; Nature Publishing Group: London, UK, 2023. [Google Scholar] [CrossRef]
Chandrasekhar, N.; Peddakrishna, S. Enhancing Heart Disease Prediction Accuracy through Machine Learning Techniques and Optimization. Processes 2023, 11, 1210. [Google Scholar] [CrossRef]
Korial, A.E.; Gorial, I.I.; Humaidi, A.J. An Improved Ensemble-Based Cardiovascular Disease Detection System with Chi-Square Feature Selection. Computers 2024, 13, 126. [Google Scholar] [CrossRef]
“pandas 2.2.3,” NumFOCUS. Available online: https://pandas.pydata.org/ (accessed on 12 October 2024).
Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef] [PubMed]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Duchesnay, É. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar] [CrossRef]
Hunter, J.D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 2007, 9, 90–95. [Google Scholar] [CrossRef]
Pima Indians Diabetes Database. National Institute of Diabetes and Digestive and Kidney Diseases. 2024. Available online: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database (accessed on 20 August 2024).
UCI Heart Disease Dataset. Available online: https://archive.ics.uci.edu/dataset/45/heart+disease (accessed on 28 August 2024).
Amiri, F.; Rezaei Yousefi, M.; Lucas, C.; Shakery, A.; Yazdani, N. Mutual information-based feature selection for intrusion detection systems. J. Netw. Comput. Appl. 2011, 34, 1184–1199. [Google Scholar] [CrossRef]
Cai, Y.; Ding, Z.; Yang, B.; Peng, Z.; Wang, W. Zero-Shot Learning Through Cross-Modal Transfer Richard. Phys. A Stat. Mech. Its Appl. 2015, 514, 729–740. [Google Scholar] [CrossRef]
Wang, Y.; Yao, Q.; Kwok, J.T.; Ni, L.M. Generalizing from a Few Examples: A Survey on Few-shot Learning. ACM Comput. Surv. 2020, 53, 1–34. [Google Scholar] [CrossRef]
Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.H.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Alrashid, A. “Diabetes Dataset,” Mendeley Data, 2020. Available online: https://data.mendeley.com/datasets/wj9rwkp9c2/1 (accessed on 22 May 2025).
Ben Shoham, O.; An, L.A.L. CPLLM: Clinical prediction with large language models. PLOS Digit. Health 2023, 4, e0000680. [Google Scholar]
Rasmy, L. Med-BERT: Pretrained contextualized embeddings on large- scale structured electronic health records for disease prediction. NPJ Digit. Med. 2021, 4, 86. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Kwan, W.; Wong, K.; Zheng, Y. CoAD: Automatic Diagnosis through Symptom and Disease Collaborative Generation. arXiv 2023, arXiv:2307.08290. [Google Scholar]
Lu, H. Can Large Language Models Enhance Predictions of Disease Progression? Investigating Through Disease Network Link Prediction. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 17703–17715. [Google Scholar]
Ferrara, E. Should ChatGPT be biased? Challenges and risks of bias in large language models. First Monday 2023, 28, 11. [Google Scholar] [CrossRef]
Micheletti, N.; Marchesi, R.; Kuo, N.I.-H.; Barbieri, S.; Jurman, G.; Osmani, V. Generative AI Mitigates Representation Bias Using Synthetic Health Data. PLOS Comput. Biol. 2025, 21, e1013080. [Google Scholar]
Walonoski, J.; Kramer, M.; Nichols, J.; Quina, A.; Moesel, C.; Hall, D.; Duffett, C.; Dube, K.; Gallagher, T.; Mclachlan, S. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J. Am. Med. Informatics Assoc. 2017, 25, 230–238. [Google Scholar] [CrossRef]
Nishio, M.; Fujimoto, K.; Rinaldi, F.; Matsuo, H.; Rohanian, M.; Krauthammer, M.; Matsunaga, T.; Nooralahzadeh, F. Zero-shot classification of TNM staging for Japanese radiology report using ChatGPT at RR-TNM subtask of NTCIR-17 MedNLP-SC. In Proceedings of the 17th NTCIR Conference on Evaluation of Information Access Technologies, Tokyo, Japan, 12–15 December 2023. [Google Scholar]

Figure 1. Proposed workflow for utilizing ChatGPT in disease prediction.

Figure 2. Distribution of patients across categories in the UCI Cleveland dataset.

Figure 3. Correlation between features in the PIDD dataset.

Figure 4. Relationship between ‘Outcome’ and other features in the PIDD dataset.

Figure 5. Mutual information scores for PIDD disease.

Figure 6. Correlation between features in the FHDD.

Figure 7. Relationship between ‘Outcome’ and other features in the FHDD.

Figure 8. Mutual information scores for FHDD.

Figure 9. Correlation between features in the UCI Cleveland dataset.

Figure 10. Correlation-based feature selection: relationship between ‘Num’ and other features in UCI heart disease.

Figure 11. Mutual information scores for UCI heart disease.

Figure 12. Experimental design based on multiple learning strategies.

Figure 13. Proposed structure for Knowledge-Enhanced CoT prompt messages: (a) Example for zero-shot KE-CoT prompt for heart disease, (b) example for few-shot KE-CoT with example reasoning prompt for diabetes disease.

Figure 14. Zero-shot prompt for diabetes disease prediction.

Figure 15. Zero-shot prompts for heart disease prediction. (a) Uses original feature names and values as in the dataset; (b) Replaces feature names with their full descriptive labels; (c) Transforms numerical values into text representing the corresponding category; (d) Combines both column name explanation (b) and data transformation (c).

Figure 16. Precision–Recall Area Under the Curve for diabetes prediction using the best-performing approach (5-shot GPT-4o): (a) PIDD, (b) FHDD dataset.

Figure 17. Precision–Recall Area Under the Curve for heart disease prediction using the best-performing approach (5-shot GPT-4o with CBFS and optimized prompts).

Figure 18. Performance comparison on the proposed method (GPT-4o using 5-shot with CBFS) for diabetes prediction, alongside the best models from the literature (Alreshan et al. [42] and Bukhari et al. [41]), on the PIDD dataset. The black bars above each bar represent error bars, indicating the 95% confidence interval.

Figure 19. Performance comparison on the proposed method (GPT-4o using 5-shot) for diabetes prediction, alongside the best models from the literature, on the FHDD dataset (Alreshan et al. [42] and Dambra et al. [44]). The black bars above each bar represent error bars, indicating the 95% confidence interval.

Figure 20. Performance comparison on the proposed method (GPT-4o using 5-shot with CBFS and prompt engineering) for heart disease prediction alongside the best models from literature (Bharti et al. [48], Asif et al. [49] and Chandrasekhar and Peddakrishna [51]), on the Heart Cleveland UCI dataset. The black bars above each bar represent error bars, indicating the 95% confidence interval.

Figure 21. Suggested AI-assisted workflow.

Table 1. PIDD features description.

Feature Name	Description	Data Type
Pregnancies	Number of pregnancies	Integer
Glucose	Plasma glucose concentration	Integer
BloodPressure	Diastolic blood pressure	Integer
SkinThickness	Triceps skin fold thickness	Integer
Insulin	2-H serum insulin	Integer
BMI	Body mass index	Float
DiabetesPedigreeFunction	A function that scores the likelihood of diabetes based on family history	Float
Age	Age of the patient	Integer
Outcome	Diabetes outcome (0 = No-T2DM, 1 = T2DM).	Integer

Table 2. Sample of entries from diabetes dataset, Frankfurt Hospital.

Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	135	68	42	250	42.3	0.365	24	1
12	84	72	31	0	29.7	0.297	46	1
0	173	78	32	265	46.5	1.159	58	0
4	99	72	17	0	25.6	0.294	28	0

Table 3. Description of UIC heart dataset features.

Feature Name	Full Name	Description	Data Type
age	Age	The Age of the patient in years.	Integer
sex	Sex	The sex of the patient, where 1 = male and 0 = female.	categorical
cp	Chest Pain Type	The type of chest pain, where 1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic.	categorical
trestbps	Resting Blood Pressure	The resting blood pressure of the patient in mm Hg.	Integer
chol	Serum Cholesterol Level	The serum cholesterol level of the patient in mg/dl.	Integer
fbs	Fasting Blood Sugar	A binary feature (1 or 0), where 1 indicates the patient’s fasting blood sugar is greater than 120 mg/dl.	categorical
restecg	Resting Electrocardiographic Result	The electrocardiographic result at rest, where 0 = normal, 1 = ST-T wave abnormality, 2 = left ventricular.	categorical
thalach	Maximum Heart Rate Achieved	The maximum heart rate achieved by the patient during exercise.	Float
exang	Exercise-Induced Angina	A binary feature (1 or 0), where 1 indicates the patient experienced exercise-induced angina.	categorical
oldpeak	ST Depression Induced by Exercise	The depression induced by exercise relative to rest.	Float
slope	Slope of the Peak Exercise ST Segment	The slope of the peak exercise ST segment, where 1 = upsloping, 2 = flat, 3 = downsloping.	categorical
ca	Number of Major Vessels	The number of major vessels (0–3) colored by fluoroscopy. Higher values indicate more significant coronary artery disease.	Integer
thal	Thalassemia	The type of thalassemia, where 3 = normal, 6 = fixed defect, 7 = reversible defect.	categorical
num	Diagnosis of Heart Disease	The outcome variable, where 0 = no heart disease, 1–4 = presence of heart disease.	Integer

Table 4. Evaluation results for diabetes disease prediction using zero-shot and few-shot learning with ChatGPT (GPT-4o, GPT-4o-mini), incorporating two feature selection methods: correlation-based selection (CBFS) and mutual information (MIFS).

Learning Method	Feature Selection Method	Accuracy (%)		Precision (%)		Recall (%)		F1-Score (%)
Learning Method	Feature Selection Method	GPT-4o-mini	GPT-4o	GPT-4o-mini	GPT-4o	GPT-4o-mini	GPT-4o	GPT-4o-mini	GPT-4o
Zero-shot	None	45.83	75.0	39.18	62.75	100.00	69.78	56.30	66.08
	MIFS	39.97	75.52	36.76	62.99	100.00	72.39	53.76	67.36
	CBFS	45.96	75.39	39.24	63.12	100.00	70.90	56.36	66.78
3-shot	None	66.53	76.69	51.22	67.45	85.82	64.18	64.16	65.77
	MIFS	66.92	76.30	51.55	67.48	86.94	61.94	64.72	61.94
	CBFS	67.84	76.56	52.42	67.46	84.70	63.43	64.76	65.38
5-shot	None	67.83	76.69	52.44	68.94	84.33	60.45	64.66	64.41
	MIFS	69.01	76.43	53.61	69.51	83.21	57.84	65.20	63.14
	CBFS	67.70	77.08	52.29	70.00	85.07	60.07	64.77	64.66
10-shot	None	69.53	75.78	54.29	64.64	80.22	67.54	64.76	66.06
	MIFS	69.14	74.87	53.90	62.46	79.85	70.15	64.36	66.08
	CBFS	68.75	73.44	53.57	60.46	78.36	69.03	63.64	64.46

Table 5. Evaluation results for diabetes disease prediction using CoT learning with ChatGPT (GPT-4o) employing correlation-based feature selection.

Learning Method	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
Zero-shot CoT	70.96	55.67	82.46	66.47
KE Zero-shot CoT	73.04	59.16	73.51	65.56
3-shot CoT	72.53	58.82	70.90	64.30
3-shot KE-CoT-ExR	73.69	60.31	72.01	65.65
3-shot KE-CoT	73.4	59.40	74.25	66
5-shot KE-CoT	73.7	59.27	78.73	67.63

Table 6. Evaluation results for diabetes disease prediction (from FHDD) using zero-shot and few-shot learning with ChatGPT (GPT-4o, GPT-4o-mini) incorporating correlation-based selection (CBFS).

Learning Method	Feature Selection Method	Accuracy (%)		Precision (%)		Recall (%)		F1-Score (%)
Learning Method	Feature Selection Method	GPT-4o-mini	GPT-4o	GPT-4o-mini	GPT-4o	GPT-4o-mini	GPT-4o	GPT-4o-mini	GPT-4o
Zero-shot	None	41.55	74.7	36.91	60.85	100.00	72.95	53.92	66.36
Zero-shot	CBFS	44.75	75.2	38.11	61.90	98.68	71.49	54.99	66.35
3-shot	None	61.55	74.85	46.90	60.58	94.15	75.73	62.62	67.32
3-shot	CBFS	59.59	74.7	45.63	60.55	94.74	74.71	61.60	66.88
5-shot	None	66.64	75.85	50.71	63.00	88.89	71.20	64.58	66.85
5-shot	CBFS	66.14	75.65	50.29	62.33	87.28	72.81	63.82	67.16
10-shot	None	66.7	75.14	50.77	61.67	86.55	72.22	64.00	66.53
10-shot	CBFS	69.19	75.5	53.15	62.28	83.77	71.93	65.04	66.76

Table 7. Evaluation results for diabetes disease prediction (from FHDD) with ChatGPT (GPT-4o) employing correlation-based feature selection.

Learning Method	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
Zero-shot CoT	68.85	52.79	84.36	64.94
Zero-shot KE-CoT	72.3	57.34	74.27	64.71
3-shot CoT	68.75	52.59	87.43	65.68
3-shot KE-CoT-ExR	72.35	57.05	77.49	65.72
3-shot KE-CoT	72.35	56.96	78.36	65.97
5-shot KE-CoT	72.10	56.54	79.68	66.14

Table 8. Evaluation results for heart disease prediction testing different prompt formats using ChatGPT with zero-shot strategy and without applying any feature selection method.

Prompts	Accuracy (%)		Precision (%)		Recall (%)		F1-Score (%)
Prompts	GPT-4o-mini	GPT-4o	GPT-4o-mini	GPT-4o	GPT-4o-mini	GPT-4o	GPT-4o-mini	GPT-4o
1	52.19	75.08	49.09	69.09	98.54	83.21	65.53	75.50
2	55.22	73.74	50.76	66.12	97.81	88.32	66.83	75.62
3	58.92	78.11	53.09	72.22	94.16	85.40	67.89	78.26
4	55.22	80.47	50.77	76.51	96.35	83.21	66.50	79.72

Table 9. Evaluation results for heart disease prediction using zero-shot and few-shot learning with ChatGPT, incorporating two feature selection methods: correlation-based selection (CBFS) and mutual information (MIFS).

Learning Method	Feature Selection Method	Accuracy (%)		Precision (%)		Recall (%)		F1-Score (%)
Learning Method	Feature Selection Method	GPT-4o-mini	GPT-4o	GPT-4o-mini	GPT-4o	GPT-4o-mini	GPT-4o	GPT-4o-mini	GPT-4o
Zero-shot	None	55.22	80.47	50.77	76.51	96.35	83.21	66.50	79.72
	MIFS	64.65	81.14	57.69	77.18	87.59	83.94	69.57	80.42
	CBFS	71.04	83.84	64.57	80.69	82.48	85.40	72.44	82.98
3-shot	None	69.36	83.50	64.38	81.88	75.18	82.48	69.36	82.18
	MIFS	73.74	83.84	66.86	82.96	85.40	81.75	75.00	82.35
	CBFS	75.76	85.19	76.42	85.50	68.61	81.75	72.31	83.58
5-shot	None	74.07	82.49	69.48	81.02	78.10	81.02	73.54	81.02
	MIFS	74.41	82.83	69.43	86.44	69.43	74.45	74.15	80.00
	CBFS	78.45	85.52	79.67	87.30	71.53	80.29	75.38	83.65
10-shot	None	74.07	82.83	69.23	84.68	78.83	76.64	73.72	80.46
	MIFS	72.73	82.15	66.28	85.59	83.21	73.72	73.79	79.22
	CBFS	77.44	84.18	76.12	85.71	74.45	78.83	75.28	82.13

Table 10. Evaluation results of heart disease prediction using ChatGPT (GPT-4o) with CoT learning and correlation-based feature selection.

Learning Method	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
Zero-shot CoT	75.76	67.57	91.24	77.64
zero-shot KE-CoT	65.66	57.58	97.08	72.28
3-shot CoT	76.77	70.73	84.67	77.08
3-shot KE-CoT-ExR	72.05	64.21	89.05	74.62
3-shot KE-CoT	76.9	73.24	75.91	74.55
5-shot KE-CoT	72.05	63.64	91.97	75.22

Table 11. Sensitivity analysis to evaluate the 6-shot against 5-shot for the optimal configurations. The change-in-metric values are shown beneath each corresponding metric.

Dataset	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
Pima Indian Diabetes Dataset	76.56 (−0.52)	67.19 (−2.81)	64.18 (+4.11)	65.65 (−0.66)
Frankfurt Hospital Diabetes Dataset	75.05 (−0.8)	61.52 (−1.48)	72.22 (+1.02)	66.44 (−0.41)
Cleveland UCI Heart Disease Dataset	85.19 (−0.33)	86.61 (−0.69)	80.29 (0)	83.33 (−0.32)

Table 12. The selected ML/DL-based studies for performance comparison.

	Dataset	Ref.	Approach
Diabetes Disease	Pima Indian Diabetes Dataset	Bukhari et al. [42]	Deep learning (ABP-SCGNN)
	Pima Indian Diabetes Dataset	Alreshan et al. [43]	Deep learning (stack-ANN)
	Frankfurt Hospital Diabetes Dataset	Alreshan et al. [43]	Deep learning (stack-ANN)
	Frankfurt Hospital Diabetes Dataset	Dambra et al. [45]	Random forest
Heart Disease	Cleveland UCI Heart Disease Dataset	Asif et al. [50]	Extra tree classifier
		Bharti et al. [49]	Deep learning (ANN)
		Chandrasekhar and Peddakrishna [52]	Soft voting ensemble method with feature scaling to address outliers

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alomari, E. Evaluating ChatGPT for Disease Prediction: A Comparative Study on Heart Disease and Diabetes. BioMedInformatics 2025, 5, 33. https://doi.org/10.3390/biomedinformatics5030033

AMA Style

Alomari E. Evaluating ChatGPT for Disease Prediction: A Comparative Study on Heart Disease and Diabetes. BioMedInformatics. 2025; 5(3):33. https://doi.org/10.3390/biomedinformatics5030033

Chicago/Turabian Style

Alomari, Ebtesam. 2025. "Evaluating ChatGPT for Disease Prediction: A Comparative Study on Heart Disease and Diabetes" BioMedInformatics 5, no. 3: 33. https://doi.org/10.3390/biomedinformatics5030033

APA Style

Alomari, E. (2025). Evaluating ChatGPT for Disease Prediction: A Comparative Study on Heart Disease and Diabetes. BioMedInformatics, 5(3), 33. https://doi.org/10.3390/biomedinformatics5030033

Article Menu

Evaluating ChatGPT for Disease Prediction: A Comparative Study on Heart Disease and Diabetes

Abstract

1. Introduction

2. Related Works

2.1. Diabetes Disease Prediction

2.2. Heart Disease Prediction

2.3. Research Gap

3. Methodology

3.1. Dataset

3.1.1. Diabetes Dataset

3.1.2. Heart Disease Dataset

3.2. Data Pre-Processing

3.3. Experiments

3.3.1. Feature Selection

Pima Indian Diabetes Dataset (PIDD)

Frankfurt Hospital Diabetes Dataset (FHDD)

UCI Cleveland Heart Disease Dataset

3.3.2. Learning Strategy Selection

3.3.3. Prompt Engineering

Prompts Formulation and Optimization

3.3.4. API Configuration and Parameter Optimization

3.3.5. Evaluation

4. Results

4.1. Results of Diabetes Prediction Task

4.1.1. Prediction Results for the Pima Indian Diabetes Dataset (PIDD)

4.1.2. Prediction Results for the Frankfurt Hospital Diabetes Dataset (FHDD)

4.2. Results of Heart Disease Prediction Task

4.3. Model Validation

5. Discussion

5.1. Comparative Analysis with Existing ML/DL Models

5.2. Challenges in Using ChatGPT for Disease Prediction and Potential Enhancements

5.3. The Role of ChatGPT in Enhancing Clinical Decision-Making

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI