ChatCVD: A Retrieval-Augmented Chatbot for Personalized Cardiovascular Risk Assessment with a Comparison of Medical-Specific and General-Purpose LLMs

Lakhdhar, Wafa; Arabi, Maryam; Ibrahim, Ahmed; Arabi, Abdulrahman; Serag, Ahmed

doi:10.3390/ai6080163

Open AccessArticle

ChatCVD: A Retrieval-Augmented Chatbot for Personalized Cardiovascular Risk Assessment with a Comparison of Medical-Specific and General-Purpose LLMs

by

Wafa Lakhdhar

¹

,

Maryam Arabi

¹

,

Ahmed Ibrahim

¹

,

Abdulrahman Arabi

²

and

Ahmed Serag

^1,*

¹

AI Innovation Lab, Weill Cornell Medicine (WCM-Q), Education City, Doha P.O. Box 24144, Qatar

²

Qatar Heart Hospital, Hamad Medical Corporation (HMC), Doha P.O. Box 24144, Qatar

^*

Author to whom correspondence should be addressed.

AI 2025, 6(8), 163; https://doi.org/10.3390/ai6080163

Submission received: 3 May 2025 / Revised: 6 June 2025 / Accepted: 16 June 2025 / Published: 22 July 2025

(This article belongs to the Special Issue AI-Driven Innovations: Emerging Trends, Security, and Industrial Solutions)

Download

Browse Figures

Versions Notes

Abstract

Large language models (LLMs) are increasingly being applied to clinical tasks, but it remains unclear whether medical-specific models consistently outperform smaller, generalpurpose ones. This study investigates that assumption in the context of cardiovascular disease (CVD) risk assessment. We fine-tuned eight LLMs—both general-purpose and medical-specific—using textualized data from the Behavioral Risk Factor Surveillance System (BRFSS) to classify individuals as “High Risk” or “Low Risk”. To provide actionable insights, we integrated a Retrieval-Augmented Generation (RAG) framework for personalized recommendation generation and deployed the system within an interactive chatbot interface. Notably, Gemma2, a compact 2B-parameter general-purpose model, achieved a high recall (0.907) and F1-score (0.770), performing on par with larger or medical-specialized models such as Med42 and BioBERT. These findings challenge the common assumption that larger or specialized models always yield superior results, and highlight the potential of lightweight, efficiently fine-tuned LLMs for clinical decision support—especially in resource-constrained settings. Overall, our results demonstrate that general-purpose models, when fine-tuned appropriately, can offer interpretable, high-performing, and accessible solutions for CVD risk assessment and personalized healthcare delivery.

Keywords:

cardiovascular; chatbot; general-purpose LLM; medical-specific LLM; RAG; risk assessment

Graphical Abstract

1. Introduction

Cardiovascular disease (CVD) remains a leading cause of morbidity and mortality worldwide; in 2021 alone, CVDs accounted for 20.5 million deaths, comprising approximately one-third of all global deaths [1,2]. Early and accurate CVD risk assessment is crucial for effective intervention and the prevention of adverse outcomes [3,4,5]. Consequently, systems that can assess risk and provide personalized recommendations are essential for empowering patients and improving health outcomes.

Artificial intelligence (AI), particularly generative AI and large language models (LLMs), is transforming healthcare by improving diagnosis, treatment, and patient care [6,7,8,9,10,11,12,13]. While medical-specific LLMs have garnered significant attention for such healthcare applications [14,15], it remains unclear whether they offer a substantial advantage over general-purpose models in specific tasks like CVD risk assessment. Understanding this distinction is critical, as it informs resource allocation and model selection for AI-driven healthcare solutions.

This study addresses these challenges by leveraging and comparing both medical-specific and general-purpose LLMs for CVD risk assessment. We fine-tuned eight LLMs on the Behavioral Risk Factor Surveillance System (BRFSS) dataset [16,17] to classify individuals as “High Risk” or “Low Risk”, with a particular focus on maximizing recall to minimize missed high-risk cases. Our approach involves transforming numerical health data into textual profiles for effective LLM interpretation and analysis. Furthermore, we implemented a Retrieval-Augmented Generation (RAG) framework [18] to generate personalized lifestyle and healthcare recommendations. These recommendations are integrated into a user-friendly chatbot application, enhancing the accessibility of reliable CVD information and promoting proactive cardiovascular health management.

The overall workflow of our CVD risk assessment system is illustrated in Figure 1. This research is among the first to directly compare medical and general LLMs for CVD risk assessment, and to generate personalized, accessible recommendations through a chatbot interface, thereby aiming to bridge the gap between research findings and practical application.

The main contributions of this work are as follows:

Comparative Evaluation of LLMs: We present a direct comparison of eight fine-tuned LLMs (medical-specific and general-purpose) for CVD risk classification, with an emphasis on recall for identifying high-risk individuals.
extualization of Health Data: We propose a method to convert structured numerical health data into narrative-style textual profiles to enhance LLM interpretability.
RAG-Based Personalized Recommendations: We develop a RAG framework to generate actionable recommendations aligned with authoritative medical guidelines.
ChatCVD—A User-Centered Interface: We introduce ChatCVD, a chatbot that makes complex risk assessments and personalized health advice accessible to users in natural language.
From Research to Application: This study demonstrates a pathway to translate LLM-based AI advances into practical, accessible tools for real-world healthcare use.

2. Related Work

CVD risk assessment has increasingly benefited from advances in machine learning (ML), including recent developments in LLMs. Numerous studies have leveraged large-scale datasets to develop predictive models aimed at identifying high-risk individuals and informing preventive strategies. For example, Xian et al. [19] introduced a coronary heart disease risk prediction method using classical ML models and integrated GPT-3.5 to provide personalized health advice based on user inputs. Similarly, Akther et al. [20] evaluated eleven ML and deep learning models across four datasets, showing that dimensionality reduction techniques such as principal component analysis (PCA) improved predictive performance.

Beyond structured data modeling, text classification techniques have gained prominence in healthcare for extracting insights from unstructured clinical text. Goel [21] demonstrated the integration of vector databases with LLMs to improve factual consistency, while Gundabathula and Kolar [22] leveraged prompt-based strategies to detect errors in clinical notes. In risk stratification tasks, Acharya et al. [23] fine-tuned the LLaMA 2-7B model on structured electronic health records (EHRs), and McInerney et al. [24] proposed a neural additive model leveraging LLMs for individualized risk estimation.

While these studies demonstrate the promise of both ML and LLMs in healthcare, several limitations remain. Most ML-based approaches rely solely on numerical inputs without leveraging the natural language capabilities of modern LLMs. Meanwhile, existing LLM-based methods often focus on unstructured text or symptom classification but rarely address structured datasets like BRFSS for clinical prediction tasks. Furthermore, few studies compare medical-specific and general-purpose LLMs to evaluate their relative effectiveness in structured clinical scenarios. Another notable gap is the lack of systems that move beyond risk classification to generate personalized, evidence-based health guidance using generative capabilities such as RAG.

Our study aims to address these limitations by transforming structured health data into narrative-style inputs, fine-tuning and comparing eight LLMs (medical and general-purpose) for CVD risk classification, and integrating the best-performing model into a RAG-enabled chatbot that delivers personalized lifestyle recommendations aligned with clinical guidelines. Table 1 summarizes related work by outlining their methods, outcomes, and limitations, thereby motivating the contributions of our study.

3. Methods

3.1. Dataset

In this work, we use the BRFSS dataset [16,17], a large-scale health-related telephone survey conducted by the Centers for Disease Control and Prevention (CDC). It comprises 441,456 records and 330 features, including demographic details, health behaviors, chronic conditions, and medical history. The target variable is CVD risk, defined by the presence or absence of myocardial infarction (MI) or coronary heart disease (CHD).

3.2. Data Preprocessing

Data preprocessing is crucial for preparing the BRFSS dataset for analysis and model fine-tuning, ensuring its quality and suitability for LLMs. This process involves four key steps: (1) Feature Selection, (2) Data Cleaning, (3) Class Imbalance Handling, and (4) Data Textualization Descriptions.

3.2.1. Feature Selection

To ensure clinical relevance, we collaborated with two medical experts (A.A. and M.A.) to identify variables strongly associated with cardiovascular health. Key features, such as BMI, smoking history, and diabetes status, were selected in alignment with established clinical guidelines for CVD risk prediction and are detailed in Table 2.

3.2.2. Data Cleaning

To ensure data integrity, we applied a straightforward cleaning procedure by removing records with missing values and eliminating duplicates to preserve the uniqueness of each observation. This resulted in a final dataset of 343,610 records.

To mitigate the impact of potential outliers, we trimmed continuous variables by removing values below the 1st percentile and above the 99th percentile. While this approach slightly reduced the sample size, it helped minimize the influence of extreme values on the analysis.

3.2.3. Class Imbalance Handling

The dataset exhibited substantial class imbalance, with 308,947 records labeled as Class 0 (“Low Risk”), indicating no history of MI or CHD, and only 34,663 as Class 1 (“High Risk”). To mitigate bias toward the majority class during model training and hyperparameter tuning, we applied random under sampling (RUS) [25,26]. This resampling was performed exclusively on the training and validation sets.

As shown in Table 3, RUS effectively balanced the training and validation sets by randomly subsampling the majority ‘Low Risk’ class. Importantly, the held-out test set remained unaltered, preserving its natural class distribution—approximately 10% ‘High Risk’ cases. By maintaining the original test distribution, we enable a more realistic and clinically relevant assessment of model performance. This approach also retained sufficient data volume for robust testing. To enhance transparency, we include visualizations of key demographic distributions (age, sex, income, and education) across the original dataset, the post-RUS training/validation sets, and the unmodified test set in Appendix A (Figure A1, Figure A2 and Figure A3).

3.2.4. Data Textualization Descriptions

Numerical health data was transformed into textual profiles to leverage the natural language processing capabilities of LLMs. This involved mapping coded values to meaningful, human-readable descriptions for each feature, enhancing both interpretability and model comprehension. For example, “High Blood Pressure: Yes” was rewritten as “This individual has a history of high blood pressure.”

Each textual profile was paired with a corresponding CVD risk label (“High Risk” or “Low Risk”). To prompt the LLMs to make a risk prediction, we appended a guiding question to each profile—e.g., “This individual is a female aged 60 to 64. Considering these factors, what is the risk level for CVD ?” (see Figure 2).

To improve model generalization and robustness, we also generated paraphrased variants of each profile. This structured approach enabled the LLMs to learn from diverse and contextually rich representations of patient data.

3.3. Model Development

This section outlines the development of a system for CVD risk assessment using the BRFSS dataset. It includes a comparative evaluation of medical-specific and general-purpose LLMs, identifies the most effective models, and explores the use of RAG to produce personalized health and lifestyle recommendations.

3.3.1. Medical-Specific vs. General-Purpose LLMs for CVD Risk Prediction

To investigate whether medical-specific LLMs offer advantages over their general-purpose counterparts in CVD risk prediction, we evaluated eight models spanning both categories. These models were selected to represent a range of architectures and domain specializations, enabling a comprehensive comparison of their performance on this clinical task.

Medical-Specific Models

BioBert [27]: a pre-trained language model designed specifically for biomedical text, leveraging a large corpus of literature to excel in domain-specific tasks, with 110 million parameters.
Meditron [28]: a model with 7 billion parameters; focuses on medical image captioning, generating descriptive text to support diagnostics.
Med42 [29]: With 8 billion parameters, Med42 functions as a medical question-answering system, providing accurate responses based on an extensive biomedical knowledge base.
MedAlpaca [30]: a medical chatbot trained on healthcare conversations and built on the LLaMA 2 base; features 7 billion parameters and emphasizes contextual understanding in dialogue.

General-Purpose Models

Gemma2 [31]: a versatile LLM with 2 billion parameters, capable of text generation, translation, and content creation.
LLaMA2 [32]: a model with 7 billion parameters, designed for a wide range of natural language processing tasks.
LLaMA3 [33]: the successor to LLaMA2, featuring 8 billion parameters and enhanced reasoning capabilities and improved performance across NLP benchmarks.
Mistral [34]: a model with 7 billion parameters; focuses on improved reasoning and code generation, demonstrating robust general-purpose capabilities.

3.3.2. Fine-Tuning Approach

The workflow for fine-tuning LLMs using textualized BRFSS data is illustrated in Figure 3. Structured BRFSS survey entries are converted into textual health profiles and divided into training, validation, and test datasets. The text is tokenized and used to fine-tune a pre-trained LLM. The fine-tuned model is then evaluated using standard metrics—accuracy, precision, recall, and F1-score—to predict CVD risk.

We applied low-rank adaptation (LoRA) for parameter-efficient fine-tuning across all models, except for BioBERT, which was fully fine-tuned due to its relatively small size. The dataset was split into training (80%), validation (10%), and test (10%) subsets. The validation set was used for hyperparameter tuning and early stopping (patience = 1).

Key configuration details are provided in Table 4. Medical-specific LLMs were fine-tuned using higher LoRA ranks (64) to better retain domain-specific knowledge, whereas general-purpose models prioritized efficiency with lower ranks (16–32). All models used the AdamW optimizer with a weight decay of 0.01 and employed 4-bit quantization where supported to reduce memory footprint.

3.3.3. Feature Importance Analysis

To enhance the interpretability of our fine-tuned LLMs and better understand the relative contributions of different input features to their CVD risk predictions, we conducted a feature importance analysis using SHAP (SHapley Additive exPlanations) [35]. SHAP is a game theory-based framework that explains model outputs by assigning each feature an importance value corresponding to its contribution to a given prediction.

We performed SHAP analysis on fine-tuned general-purpose and medical-specific LLMs. The procedure involved the following steps:

Feature Extraction: A subset of the test data was used, in which the textual health profiles (original inputs to the LLMs) were converted back into their structured, categorical feature representations (e.g., Age_Group, Gender, Smoking_History) using a rule-based feature extractor. This step enabled SHAP to operate on interpretable, human-understandable features.
Prediction Function for SHAP: A wrapper function was created that accepts structured features as input, reconstructs the full textual health profile in the format expected by the fine-tuned LLM, and outputs the CVD risk probability—specifically, the likelihood of being classified as “High Risk”.
SHAP Value Computation: Using the shap library, a SHAP explainer (e.g., Kernel-Explainer or another model-compatible variant) was initialized with a background dataset drawn from structured test instances. SHAP values were computed for each feature on a representative evaluation sample, with consideration for computational constraints (GPU/CPU).
Global Feature Importance: Mean absolute SHAP values were calculated across the evaluation set to derive a global measure of each feature’s importance in influencing model predictions.

3.4. Application Development

To facilitate personalized CVD risk assessment and deliver actionable recommendations, we developed an interactive chatbot application named ChatCVD, which integrates a fine-tuned CVD risk prediction model with a RAG framework (Figure 4). The user provides input through the chatbot interface, which is first processed by a fine-tuned LLM to estimate the individual’s CVD risk level. Based on this, a query is generated and used to retrieve relevant documents from a structured knowledge base via similarity search. The retrieved content is then passed to a generative LLM, which produces a personalized, evidence-based response that is returned to the user through the chatbot interface.

The chatbot is implemented using Streamlit [36], providing a user-friendly interface for accessing personalized lifestyle and health guidance. An example of the ChatCVD interface is shown in Appendix B.

Our RAG approach draws upon a knowledge base constructed from authoritative cardiovascular health sources, including the Heart Foundation [37] and the CVD Risk Guideline [38]. This design allows for automatic content updates without requiring model retraining. The document retrieval pipeline includes the following stages:

Data Acquisition and Processing: HTML content from the specified URLs is automatically downloaded and parsed, and relevant text is extracted.
Chunking: The extracted text is segmented into coherent chunks of approximately 200–300 words to preserve context.
Embedding: Each chunk is converted into a vector representation using SentenceTransformers [39], creating a semantic index of the knowledge base.
Vector Storage: These embeddings are stored in a vector database to enable efficient similarity search during response generation.

The following is the structured prompt employed during query generation to retrieve relevant documents:

Given the following profile: ’{user_input}’, and considering this individual has a {risk_level} risk of CVD, first, identify the key risk factors mentioned.

Then, provide 3 UNIQUE and SPECIFIC ACTIONABLE recommendations to improve their cardiovascular health, drawing from the provided context.

Avoid repeating the same advice.

Ex: If overweight, suggest a weight loss goal.

If high cholesterol, recommend foods to avoid.

Present recommendations in a numbered list.

where {user_input} is replaced with the user’s health profile and {risk_level} with the predicted risk.

When a user provides their health profile through the chatbot interface, the system operates as follows:

Risk Prediction: The input text is passed to the fine-tuned CVD risk prediction model, which classifies the user as “Low Risk” or “High Risk.”
Query Generation: The structured query from the prompt template is constructed using the user’s information and risk level.
Knowledge Retrieval: This query is then used to perform a similarity search in the vector database, retrieving the most relevant chunks of information from the knowledge base.
Recommendation Synthesis: The retrieved information is synthesized using the LLM to generate personalized and actionable recommendations tailored to the user’s specific risk factors and profile. The LLM is prompted to avoid generic advice and instead provide concrete steps the user can take.

An example of the chatbot’s output for a user with a specific health profile is shown in Figure 5.

3.5. Evaluation

3.5.1. Human Expert Assessment

To assess the clinical relevance and quality of the chatbot’s outputs, we conducted an evaluation with two medical experts (A.A. and M.A.). A diverse set of 20 unique patient profiles—each accompanied by chatbot-generated outputs (risk level, key risk factors, and actionable recommendations, as illustrated previously)—was independently reviewed by both experts, resulting in a total of 40 expert ratings. For each case, the experts evaluated the overall quality, clinical relevance, and actionability of the chatbot’s output using a 5-point Likert scale (1 = Poor, 5 = Excellent), and also provided qualitative feedback.

3.5.2. Statistical Analysis

The Shapiro–Wilk test was used to assess the normality of differences in correct classifications between model pairs. For normally distributed differences, paired t-tests were applied; otherwise, the Wilcoxon signed-rank test was used. A significance threshold of p < 0.05 was adopted, with p-values adjusted using the Benjamini–Hochberg procedure to control the false discovery rate (FDR).

4. Results

4.1. Performance of Medical-Specific vs. General-Purpose LLMs in CVD Risk Assessment

This subsection presents the performance evaluation of fine-tuned medical-specific and general-purpose LLMs for CVD risk assessment. Table 5 summarizes the results for eight models across five key metrics: accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). The AUC metric indicates each model’s ability to distinguish between high- and low-risk individuals, with higher values reflecting better overall discriminatory power.

All models demonstrated commendable performance. F1-scores generally exceeded 0.70, indicating a strong balance between precision and recall across most models. Among all models, Mistral, LLaMA3, and LLaMA2 achieved the highest AUC scores (0.84), reflecting robust overall discrimination. Med42, Gemma2, and BioBERT followed closely, each achieving an AUC of 0.82, while Meditron scored slightly lower (0.81). MedAlpaca recorded the lowest AUC (0.77) among the evaluated models. Importantly, all models achieved AUC values well above 0.5—the threshold for random classification—underscoring their ability to effectively differentiate between CVD risk levels.

In terms of recall, Med42 emerged as the top-performing medical-specific model (0.922), demonstrating strong sensitivity in identifying high-risk cases. BioBERT (0.908) and Gemma2 (0.907) also exhibited high recall values. Conversely, general-purpose models such as Mistral and LLaMA2, while having relatively lower recall (0.713 and 0.711, respectively), achieved the highest precision across all models (0.792 and 0.793). This highlights a performance trade-off: Some models prioritize recall (minimizing false negatives), while others emphasize precision (minimizing false positives), depending on their architecture and fine-tuning behavior.

4.2. Human Expert Assessment

The results of the human evaluation indicate strong performance. Of the 40 total expert ratings, 30 (75%) were scored as “Excellent” or “Good Quality” (4 or 5 out of 5). The average rating across all evaluations was 4.5 out of 5. Qualitative feedback consistently suggested that the chatbot’s responses were generally well-aligned with current medical guidelines and offered useful, personalized insights. While this evaluation using 20 distinct cases provides encouraging evidence of quality and clinical alignment, we acknowledge that ongoing assessment with a larger and more diverse set of cases and reviewers will be important for further validation and continuous system improvement.

4.3. Statistical Analysis

To assess the significance of performance differences among models, we conducted pairwise comparisons focusing on recall—the most clinically critical metric in this context. In CVD risk assessment, minimizing false negatives is essential, as failing to identify high-risk individuals can lead to serious adverse outcomes. Therefore, we prioritized models with recall scores above 0.90: Med42, Gemma2, and BioBERT. These models clearly outperformed others, which had recall scores closer to 0.70.

Among them, Med42 achieved the highest recall (0.922; Table 5), reflecting its strong sensitivity in identifying high-risk cases. BioBERT (0.908) and Gemma2 (0.907) also performed well. However, statistical comparisons revealed that the differences in recall between Med42 and the other two models were not significant (p = 0.251 and p = 0.393, respectively, after FDR correction).

Although Med42 had the highest recall, other models showed strength in different areas. For example, Meditron achieved higher precision (0.754) than Med42 (0.664) but with lower recall (0.715), illustrating the common trade-off between minimizing false negatives and minimizing false positives. These dynamics are further illustrated in the confusion matrices presented in Figure A5.

4.4. Feature Importance Analysis

To better understand which input features most significantly influenced the CVD risk predictions of our fine-tuned models, we conducted a feature importance analysis using SHAP values on the Gemma2 and Med42 models, as described in Section 3.3.3. The top 15 features, ranked by their mean absolute SHAP values, are visualized in Figure 6.

The analysis revealed that, for both models, features such as Age_Group and General_Health consistently emerged as highly influential in predicting CVD risk. This aligns with established clinical understanding, suggesting that the models learned to emphasize well-known cardiovascular risk factors from the textualized patient profiles. Notably, Gemma2 assigned higher importance to Gender, while Med42 emphasized High_Blood_Pressure and High_Cholesterol more prominently. Traditional risk indicators such as Smoking_History and Diabetes were identified as important by both models, though their relative rankings varied.

5. Discussion

We evaluated eight fine-tuned LLMs—both general-purpose and medical-specific—for CVD risk assessment using the BRFSS dataset. The results challenge the assumption that larger or domain-specialized models always outperform smaller, general-purpose ones. Notably, Gemma2, a compact general-purpose model with just 2 billion parameters, delivered competitive performance, achieving a high recall of 0.907 and a strong F1-score of 0.770. Its performance was not statistically different from that of high-recall medical-specific models like Med42. These findings underscore the efficiency and practicality of smaller models for clinical applications, particularly in resource-limited settings.

Interestingly, models based on earlier transformer architectures, such as BioBERT, performed on par with more modern LLMs when fine-tuned effectively. In contrast, MedAlpaca, a medically pretrained model derived from LLaMA2, underperformed relative to fine-tuned general-purpose models like LLaMA2 itself. This highlights that targeted fine-tuning may offer greater benefits for specific clinical tasks than broad medical pretraining alone.

On the other hand, general-purpose models like Mistral and LLaMA2 achieved the highest AUC scores (0.84), indicating strong overall discrimination between risk levels, but their lower recall underscores the trade-off between sensitivity and precision. These differences emphasize the need to match model selection with specific clinical priorities—favoring high recall in screening scenarios and higher precision in resource-limited settings where false positives may lead to unnecessary interventions.

To provide greater transparency into model decision-making, we conducted a SHAP-based feature importance analysis on Gemma2 and Med42. Both models consistently identified features such as

A g e_G r o u p

,

G e n e r a l_H e a l t h

,

S m o k i n g_H i s t o r y

, and

D i a b e t e s

as major contributors to CVD risk prediction, aligning with established clinical knowledge.

Differences in the feature rankings—such as Gemma2 placing more emphasis on

G e n d e r

, while Med42 prioritized

H i g h_B l o o d_P r e s s u r e

—reflect distinct ways in which model architectures process textual health profiles. These findings suggest that both general-purpose and medical-specific LLMs, once fine-tuned, are capable of discerning and prioritizing clinically relevant information embedded in natural language. This analysis enhances interpretability and offers deeper insight into how each model makes predictions.

Beyond model evaluation, we developed ChatCVD, an interactive chatbot that combines LLM-based risk prediction with a RAG framework to deliver personalized health recommendations. Expert feedback on the chatbot’s outputs indicated strong clinical relevance and practical value. Most expert ratings scored the system as “Excellent” or “Good Quality,” and qualitative comments confirmed its alignment with current medical guidelines. This integration of AI-driven tools with user-friendly interfaces exemplifies the translational potential of LLMs in supporting proactive, personalized healthcare.

The adaptability of large language models, including GPT-style architectures, also holds promise for scenarios beyond large-scale survey data. While our study utilized the comprehensive BRFSS dataset, the strong pre-training of these models on vast general corpora enables them to generalize effectively even when fine-tuned on more limited or specialized medical datasets. This capacity for few-shot or zero-shot adaptation, alongside techniques like prompt engineering and data augmentation, makes them attractive tools in medical contexts where high-quality labeled data may be scarce [40,41]. Our use of fine-tuning on a specific task can be seen as leveraging this pre-existing knowledge, which is a principle applicable even in data-constrained environments.

While these findings are promising, several limitations must be acknowledged. First, the BRFSS dataset used for model training dates back to 2015. Given the evolving nature of health behaviors, clinical guidelines, and population demographics, this may limit the models’ applicability to current contexts. However, the development framework itself is generalizable. Future work should focus on re-fine-tuning the models using more recent and diverse datasets to enhance both accuracy and generalizability.

Second, while ChatCVD is designed as an assistive tool, it is not a substitute for medical advice. Ensuring transparency, human oversight, and fairness remains essential. A key next step will be to audit the system for demographic biases by disaggregating model performance across variables such as gender, education level, and socioeconomic status. This will help ensure equitable utility and support responsible AI deployment in healthcare.

Lastly, although we adopted a binary classification scheme (“High Risk” vs. “Low Risk”) for simplicity and accessibility, clinical risk is often assessed on a continuous scale. Future studies could explore more granular models that provide probabilistic or multi-level risk stratification, potentially increasing clinical applicability.

Together, these insights demonstrate that carefully fine-tuned LLMs—whether general-purpose or domain-specific—can support effective, interpretable, and accessible CVD risk assessment. Our work reinforces the importance of model selection based on clinical priorities, emphasizes the value of transparency and personalization, and lays the foundation for future research integrating LLMs into real-world healthcare workflows.

6. Conclusions

This work highlights the potential of LLMs for CVD risk assessment. Smaller, general-purpose models like Gemma2 demonstrated high recall, offering efficient alternatives for resource-constrained settings while maintaining strong discriminatory power. These findings challenge the assumption that larger or specialized models always outperform smaller, general-purpose ones. Med42 achieved the highest recall, though it was not statistically significantly superior, positioning it as a promising option for critical diagnostic applications where minimizing false negatives is crucial. The integration of fine-tuned LLMs into a user-friendly chatbot application demonstrates their practical utility, offering personalized risk assessments and actionable insights for proactive CVD management. Further research with larger and more diverse datasets is necessary to validate these findings and examine the ethical implications of incorporating LLMs into clinical workflows.

Author Contributions

Conceptualization, W.L. and A.S.; methodology, W.L. and A.S.; software, W.L.; validation, A.S., M.A., A.A. and A.I.; formal analysis, W.L.; investigation, W.L.; resources, A.S.; data curation, W.L.; writing—original draft preparation, W.L.; writing—review and editing, W.L., M.A., A.I., A.A. and A.S.; visualization, W.L.; supervision, A.S.; project administration, A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study utilized publicly available, anonymized data from the Behavioral Risk Factor Surveillance System (BRFSS). No direct human subject interaction occurred, and no identifiable private information was used.

Informed Consent Statement

The study used publicly available, de-identified data, for which informed consent was obtained by the original data collectors (CDC for BRFSS).

Data Availability Statement

The code and preprocessed data used in this study are available at: https://github.com/serag-ai/ChatCVD accessed on 13 July 2025. The original BRFSS data used in this study is from the 2015 dataset, which is publicly available at: https://www.cdc.gov/brfss/annual_data/annual_2015.html accessed on 15 September 2024.

Acknowledgments

The authors acknowledge the use of text generated by artificial intelligence (AI) during the preparation of this manuscript. Specifically, OpenAI’s ChatGPT and Google’s Gemini were utilized to assist in drafting certain sections. These AI-generated contributions were reviewed, edited, and validated by the authors to ensure accuracy, relevance, and compliance with the scientific and ethical standards of the target journal. The authors assume full responsibility for the final content of the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Demographic Variable Distributions

Figure A1. Distribution of demographic variables in the original BRFSS dataset before RUS.

Figure A2. Distribution of demographic variables in the training/validation set after RUS.

Figure A3. Distribution of demographic variables in the held-out test set (natural imbalance preserved).

Appendix B. ChatCVD Interface Example

Here, we present an example of the ChatCVD user interface, illustrating the user input, the predicted risk level, identified key risk factors, and personalized recommendations generated by the system.

Figure A4. Example of the ChatCVD interface showing user input (left panel, top), an example input profile (left panel, bottom), and the corresponding chatbot output including predicted risk level, key risk factors, personalized recommendations, and source information (right panel).

Appendix C. Confusion Matrix

Figure A5. Confusion matrix comparing the performance of medical and general-purpose LLMs for CVD risk assessment. Top row: the medical LLMs. Bottom row: the general-purpose LLMs.

References

Rehman, S.; Rehman, E.; Ikram, M.; Jianglin, Z. Cardiovascular disease (CVD): Assessment, prediction and policy implications. BMC Public Health 2021, 21, 1299. [Google Scholar] [CrossRef]
Boukhatem, C.; Youssef, H.Y.; Nassif, A.B. Heart disease prediction using machine learning. In Proceedings of the 2022 Advances in Science and Engineering Technology International Conferences (ASET), Dubai, United Arab Emirates, 21–24 February2022; pp. 1–6. [Google Scholar]
Ananthajothi, K.; David, J.; Kavin, A. Cardiovascular Disease Prediction using Patient History and Real Time Monitoring. In Proceedings of the 2024 2nd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT), Bengaluru, India, 4–6 January 2024; pp. 1226–1233. [Google Scholar]
Zhao, D.; Liu, J.; Xie, W.; Qi, Y. Cardiovascular risk assessment: A global perspective. Nat. Rev. Cardiol. 2015, 12, 301–311. [Google Scholar] [CrossRef] [PubMed]
Wong, N.D. Cardiovascular risk assessment: The foundation of preventive cardiology. Am. J. Prev. Cardiol. 2020, 1, 100008. [Google Scholar] [CrossRef]
Li, Y.H.; Li, Y.L.; Wei, M.Y.; Li, G.Y. Innovation and challenges of artificial intelligence technology in personalized healthcare. Sci. Rep. 2024, 14, 18994. [Google Scholar] [CrossRef]
Gunning, D.; Stefik, M.; Choi, J.; Miller, T.; Stumpf, S.; Yang, G.Z. XAI—Explainable artificial intelligence. Sci. Robot. 2019, 4, eaay7120. [Google Scholar] [CrossRef]
Hussain, H.K.; Tariq, A.; Gill, A.Y.; Ahmad, A. Transforming Healthcare: The Rapid Rise of Artificial Intelligence Revolutionizing Healthcare Applications. BULLET J. Multidisiplin. Ilmu 2022, 1, 592216. [Google Scholar]
Sushil, M.; Kennedy, V.E.; Mandair, D.; Miao, B.Y.; Zack, T.; Butte, A.J. CORAL: Expert-curated oncology reports to advance language model inference. NEJM AI 2024, 1, AIdbp2300110. [Google Scholar] [CrossRef]
Hosseini, A.; Serag, A. Is synthetic data generation effective in maintaining clinical biomarkers? Investigating diffusion models across diverse imaging modalities. Front. Artif. Intell. 2025, 7, 1454441. [Google Scholar] [CrossRef]
Hosseini, A.; Serag, A. Self-Supervised Learning Powered by Synthetic Data From Diffusion Models: Application to X-Ray Images. IEEE Access 2025, 13, 59074–59084. [Google Scholar] [CrossRef]
Ben Rabah, C.; Petropoulos, I.N.; Malik, R.A.; Serag, A. Vision transformers for automated detection of diabetic peripheral neuropathy in corneal confocal microscopy images. Front. Imaging 2025, 4, 1542128. [Google Scholar] [CrossRef]
Ben Rabah, C.; Sattar, A.; Ibrahim, A.; Serag, A. A Multimodal Deep Learning Model for the Classification of Breast Cancer Subtypes. Diagnostics 2025, 15, 995. [Google Scholar] [CrossRef] [PubMed]
Helmy, H.; Rabah, C.B.; Ali, N.; Ibrahim, A.; Hoseiny, A.; Serag, A. Optimizing ICU Readmission Prediction: A Comparative Evaluation of AI Tools. In International Workshop on Applications of Medical AI; Springer: Berlin/Heidelberg, Germany, 2024; pp. 95–104. [Google Scholar]
Ibrahim, A.; Hosseini, A.; Ibrahim, S.; Sattar, A.; Serag, A. D3: A Small Language Model for Drug-Drug Interaction prediction and comparison with Large Language Models. Mach. Learn. Appl. 2025, 20, 100658. [Google Scholar] [CrossRef]
CDC. CDC—2015 BRFSS Survey Data and Documentation. Available online: https://www.cdc.gov/brfss/annual_data/annual_2015.html (accessed on 9 September 2024).
Kee, D.; Wisnivesky, J.; Kale, M.S. Lung cancer screening uptake: Analysis of BRFSS 2018. J. Gen. Intern. Med. 2021, 36, 2897–2899. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Yuan, Y.; Zhang, Z. Enhancing llm factual accuracy with rag to counter hallucinations: A case study on domain-specific queries in private knowledge-bases. arXiv 2024, arXiv:2403.10446. [Google Scholar]
Xian, L.; Xu, J. Smart Guardian: An AI-Based Coronary Heart Disease Prediction System, Focusing on Your Cardiac Health! SSRN Electron. J. 2024. Available online: https://ssrn.com/abstract=4713171 (accessed on 10 January 2025).
Akther, K.; Kohinoor, M.S.R.; Priya, B.S.; Rahaman, M.J.; Rahman, M.M.; Shafiullah, M. Multi-Faceted Approach to Cardiovascular Risk Assessment by Utilizing Predictive Machine Learning and Clinical Data in a Unified Web Platform. IEEE Access 2024, 12, 120454–120473. [Google Scholar] [CrossRef]
Goel, R. Using text embedding models and vector databases as text classifiers with the example of medical data. arXiv 2024, arXiv:cs.IR/2402.16886. [Google Scholar]
Gundabathula, S.K.; Kolar, S.R. PromptMind Team at MEDIQA-CORR 2024: Improving Clinical Text Correction with Error Categorization and LLM Ensembles. arXiv 2024, arXiv:cs.CL/2405.08373. [Google Scholar]
Acharya, A.; Shrestha, S.; Chen, A.; Conte, J.; Avramovic, S.; Sikdar, S.; Anastasopoulos, A.; Das, S. Clinical risk prediction using language models: Benefits and considerations. J. Am. Med. Inform. Assoc. 2024, 31, 1856–1864. [Google Scholar] [CrossRef]
McInerney, D.J.; Dickinson, W.; Flynn, L.; Young, A.; Young, G.; van de Meent, J.W.; Wallace, B.C. Towards Reducing Diagnostic Errors with Interpretable Risk Prediction. arXiv 2024, arXiv:2402.10109. [Google Scholar]
Liu, S.M.; Chen, J.H.; Liu, Z. An empirical study of dynamic selection and random under-sampling for the class imbalance problem. Expert Syst. Appl. 2023, 221, 119703. [Google Scholar] [CrossRef]
Hall, T.; Beecham, S.; Bowes, D.; Gray, D.; Counsell, S. A systematic literature review on fault prediction performance in software engineering. IEEE Trans. Softw. Eng. 2011, 38, 1276–1304. [Google Scholar] [CrossRef]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2019, 36, 1234–1240. [Google Scholar] [CrossRef]
Yaseen Jabarulla, M.; Oeltze-Jafra, S.; Beerbaum, P.; Uden, T. MedDoc-Bot: A Chat Tool for Comparative Analysis of Large Language Models in the Context of the Pediatric Hypertension Guideline. arXiv 2024, arXiv:2405.03359. [Google Scholar]
Christophe, C.; Kanithi, P.K.; Raha, T.; Khan, S.; Pimentel, M.A. Med42-v2: A suite of clinical llms. arXiv 2024, arXiv:2408.06142. [Google Scholar]
Han, T.; Adams, L.C.; Papaioannou, J.M.; Grundmann, P.; Oberhauser, T.; Löser, A.; Truhn, D.; Bressem, K.K. MedAlpaca—An open-source collection of medical conversational AI models and training data. arXiv 2023, arXiv:2304.08247. [Google Scholar]
Team, G.; Riviere, M.; Pathak, S.; Sessa, P.G.; Hardin, C.; Bhupatiraju, S.; Hussenot, L.; Mesnard, T.; Shahriari, B.; Ramé, A.; et al. Gemma 2: Improving open language models at a practical size. arXiv 2024, arXiv:2408.00118. [Google Scholar]
Masalkhi, M.; Ong, J.; Waisberg, E.; Zaman, N.; Sarker, P.; Lee, A.G.; Tavakkoli, A. A side-by-side evaluation of Llama 2 by meta with ChatGPT and its application in ophthalmology. Eye 2024, 38, 1789–1792. [Google Scholar] [CrossRef]
Gupta, P.; Yau, L.Q.; Low, H.H.; Lee, I.; Lim, H.M.; Teoh, Y.X.; Koh, J.H.; Liew, D.W.; Bhardwaj, R.; Bhardwaj, R.; et al. WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models. arXiv 2024, arXiv:2408.03837. [Google Scholar]
Jin, B.; Liu, G.; Han, C.; Jiang, M.; Ji, H.; Han, J. Large language models on graphs: A comprehensive survey. arXiv 2023, arXiv:2312.02783. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017); Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 4768–4777. [Google Scholar]
Streamlit, Inc. Available online: https://streamlit.io/ (accessed on 20 September 2024).
National Heart Foundation of Australia. Heart Foundation. Available online: https://www.heartfoundation.org.au/for-professionals/guideline-for-managing-cvd (accessed on 20 September 2024).
National Heart Foundation of Australia. CVD Risk Guideline. Available online: https://www.cvdcheck.org.au/using-the-calculator-to-assess-cvd-risk (accessed on 20 September 2024).
Shi, L.; Kazda, M.; Sears, B.; Shropshire, N.; Puri, R. Ask-EDA: A Design Assistant Empowered by LLM, Hybrid RAG and Abbreviation De-hallucination. arXiv 2024, arXiv:2406.06575. [Google Scholar]
Wang, J.; Shi, E.; Yu, S.; Wu, Z.; Ma, C.; Dai, H.; Yang, Q.; Kang, Y.; Wu, J.; Hu, H.; et al. Prompt Engineering for Healthcare: Methodologies and Applications. arXiv 2023, arXiv:2304.14670. [Google Scholar]
Sufi, F.K. Addressing Data Scarcity in the Medical Domain: A GPT-Based Approach for Synthetic Data Generation and Feature Extraction. Information 2024, 15, 264. [Google Scholar] [CrossRef]

Figure 1. Overview of the CVD risk assessment pipeline. The system consists of three main stages: (1) Data Preprocessing, where the BRFSS dataset undergoes feature selection, cleaning, class balancing, and transformation into textual profiles; (2) Model Development, where medical-specific and general-purpose LLMs are fine-tuned and evaluated on the textualized data; and (3) Application Development, where the best-performing LLM is integrated into a RAG framework to generate personalized health recommendations, which are delivered through the ChatCVD chatbot.

Figure 2. Pipeline for converting structured numerical health data into textual prompts suitable for LLM input. The process involves three stages: (1) transforming numerical feature values into descriptive text, (2) combining features into full health profiles with paraphrased variants, and (3) appending a risk assessment query to each profile. This results in natural language inputs paired with CVD risk labels (“High Risk” or “Low Risk”) for training and evaluation.

Figure 3. Workflow for fine-tuning LLMs using textualized BRFSS data.

Figure 4. System architecture of ChatCVD illustrating the integration of a fine-tuned CVD risk prediction model with a RAG framework.

Figure 5. An example of a personalized response based on a user’s health profile.

Figure 6. Top 15 most important features for fine-tuned Gemma2 (a) and Med42 (b) models, based on mean absolute SHAP values.

Table 1. Summary of related works.

Study	Focus Area	Methodology	Unique Contributions	Limitations
Xian et al. [19]	CVD Risk Assessment	Ensemble ML Models	Integrates GPT-3.5 for personalized health advice	Relies on traditional ML for risk prediction; GPT is used solely for advice generation, not for direct classification from BRFSS data.
Akther et al. [20]	CVD Risk Prediction	ML and DL	Developed a web-based application for personalized risk assessment	Uses traditional ML/DL models; personalized advice may not be as current or context-specific as that provided by RAG-based approaches.
Goel et al. [21]	Medical Data Classification	LLMs with Vector Databases	Focuses on LLMs for classification rather than risk assessment	General medical data classification; not specific to CVD risk from textualized BRFSS or personalized recommendations via RAG.
Gundabathula et al. [22]	Error Detection in Clinical Notes	Prompt-based Learning	Focuses on error correction, not risk assessment	Addresses error correction in clinical notes; does not focus on predictive CVD risk assessment or personalized recommendations.
Acharya et al. [23]	Clinical Risk Prediction	Fine-tuned LLMs	Investigates LLMs for clinical risk prediction but not specifically on BRFSS	Uses EHR data; does not compare medical vs. general LLMs on BRFSS for CVD or integrate RAG for advice.
McInerney et al. [24]	Risk Prediction for Reducing Errors	Neural Additive Models	Investigates LLMs for clinical risk prediction but not specifically on BRFSS	General risk estimation; may not be fine-tuned on BRFSS for binary CVD risk or compare LLM types with RAG for CVD advice.

Table 2. Summary of selected features for CVD risk assessment.

Feature	Description
Age	Five-year age categories
Sex	Gender of the respondent
Smoking History	History of smoking 100 cigarettes
Physical Activity Level	Level of physical activity
Fruit Consumption	Frequency of fruit consumption
Vegetable Consumption	Frequency of vegetable consumption
BMI	Body Mass Index category
General Health	Self-reported general health status
High Cholesterol	History of high cholesterol
Kidney Disease	History of kidney disease
Diabetes	History of diabetes
Blood Pressure	History of high blood pressure
Alcohol Consumption	Frequency of alcohol consumption
Cancer History	History of any type of cancer
Education	Highest level of education attained
Income	Annual income level
Medical Cost Issues	Did not meet a doctor due to financial issues

Note: The “Cancer History” feature is derived from combining the “had skin cancer” and “had any other types of cancer” columns.

Table 3. Class distribution before and after random under sampling (RUS).

Dataset Phase	Class 0 (Low Risk)	Class 1 (High Risk)
Original Dataset	308,947	34,663
Pre-RUS Split
Training	216,360	24,167
Validation	46,286	5255
Test (Held-out)	46,301	5241
Post-RUS
Training	24,167	24,167
Validation	5255	5255

Table 4. Hyperparameters for different models.

Model	LR	BS	EP	LR	LA
Med LLMs	$2 \times 10^{- 5}$	1	5	64	32
Gemma-2b	$2 \times 10^{- 5}$	1	5	64	32
Mistral	$1 \times 10^{- 4}$	8	5	64	32
Llama3	$1 \times 10^{- 4}$	8	5	64	32
BioBERT	$1 \times 10^{- 4}$	8	3	FT	FT
Llama-2-7b	$2 \times 10^{- 5}$	8	5	2	16

Note: LR = learning rate, BS = batch size, EP = epochs, LR = LoRA rank, LA = LoRA alpha, and FT = full fine-tuning.

Table 5. Model Performance for Medical and General Categories.

Category	Model	Accuracy	Precision	Recall	F1-Score	AUC
Medical	BioBERT	0.732	0.672	0.908	0.772	0.82
	Meditron	0.741	0.754	0.715	0.734	0.81
	Med42	0.728	0.664	0.922	0.772	0.82
	MedAlpaca	0.710	0.685	0.779	0.729	0.77
General	Mistral	0.763	0.792	0.713	0.750	0.84
	Gemma2	0.730	0.670	0.907	0.770	0.82
	LlaMa2	0.763	0.793	0.711	0.750	0.84
	LlaMa3	0.761	0.790	0.712	0.749	0.84

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lakhdhar, W.; Arabi, M.; Ibrahim, A.; Arabi, A.; Serag, A. ChatCVD: A Retrieval-Augmented Chatbot for Personalized Cardiovascular Risk Assessment with a Comparison of Medical-Specific and General-Purpose LLMs. AI 2025, 6, 163. https://doi.org/10.3390/ai6080163

AMA Style

Lakhdhar W, Arabi M, Ibrahim A, Arabi A, Serag A. ChatCVD: A Retrieval-Augmented Chatbot for Personalized Cardiovascular Risk Assessment with a Comparison of Medical-Specific and General-Purpose LLMs. AI. 2025; 6(8):163. https://doi.org/10.3390/ai6080163

Chicago/Turabian Style

Lakhdhar, Wafa, Maryam Arabi, Ahmed Ibrahim, Abdulrahman Arabi, and Ahmed Serag. 2025. "ChatCVD: A Retrieval-Augmented Chatbot for Personalized Cardiovascular Risk Assessment with a Comparison of Medical-Specific and General-Purpose LLMs" AI 6, no. 8: 163. https://doi.org/10.3390/ai6080163

APA Style

Lakhdhar, W., Arabi, M., Ibrahim, A., Arabi, A., & Serag, A. (2025). ChatCVD: A Retrieval-Augmented Chatbot for Personalized Cardiovascular Risk Assessment with a Comparison of Medical-Specific and General-Purpose LLMs. AI, 6(8), 163. https://doi.org/10.3390/ai6080163

Article Menu

ChatCVD: A Retrieval-Augmented Chatbot for Personalized Cardiovascular Risk Assessment with a Comparison of Medical-Specific and General-Purpose LLMs

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Dataset

3.2. Data Preprocessing

3.2.1. Feature Selection

3.2.2. Data Cleaning

3.2.3. Class Imbalance Handling

3.2.4. Data Textualization Descriptions

3.3. Model Development

3.3.1. Medical-Specific vs. General-Purpose LLMs for CVD Risk Prediction

Medical-Specific Models

General-Purpose Models

3.3.2. Fine-Tuning Approach

3.3.3. Feature Importance Analysis

3.4. Application Development

3.5. Evaluation

3.5.1. Human Expert Assessment

3.5.2. Statistical Analysis

4. Results

4.1. Performance of Medical-Specific vs. General-Purpose LLMs in CVD Risk Assessment

4.2. Human Expert Assessment

4.3. Statistical Analysis

4.4. Feature Importance Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Demographic Variable Distributions

Appendix B. ChatCVD Interface Example

Appendix C. Confusion Matrix

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI