Data-Driven Predictive Modelling of Lifestyle Risk Factors for Cardiovascular Health

Kissi, Solomon Agyiri; Talukder, Md Golam Muttaquee; Iqbal, Muhammad Zahid

doi:10.3390/electronics14142906

Open AccessArticle

Data-Driven Predictive Modelling of Lifestyle Risk Factors for Cardiovascular Health

by

Solomon Agyiri Kissi

,

Md Golam Muttaquee Talukder

and

Muhammad Zahid Iqbal

^*

School of Computing, Engineering & Digital Technologies, Teesside University, Middlesbrough TS1 3BX, UK

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2906; https://doi.org/10.3390/electronics14142906

Submission received: 20 June 2025 / Revised: 8 July 2025 / Accepted: 17 July 2025 / Published: 20 July 2025

(This article belongs to the Special Issue Smart Bioelectronics, Wearable Systems and E-Health)

Download

Browse Figures

Versions Notes

Abstract

Cardiovascular disease (CVD) remains the foremost global cause of mortality, driven significantly by modifiable lifestyle factors. This study employs a data-driven approach to identify and evaluate these risk factors using advanced machine learning techniques. Analysing a large publicly available dataset of over 300,000 adult health records containing lifestyle behaviours, clinical risk factors, and self-reported health indicators, this research implemented traditional classifiers, ensemble methods, and deep learning architectures to examine the impact of behaviours such as smoking, diet, physical activity, and alcohol consumption on CVD risk. The Random Forest model demonstrated superior performance, achieving high accuracy, recall, and ROC-AUC scores. To demonstrate real-world utility, the model was deployed as an interactive Streamlit web application. This tool allows individuals to input lifestyle and health data to receive real-time CVD risk predictions, offering a novel, user-friendly prototype that bridges machine learning insights with personalised digital health engagement. This tool can facilitate personalised health monitoring and supports early detection by providing actionable insights. The findings underscore the efficacy of predictive modelling in informing targeted interventions and public health strategies. By bridging advanced analytics with practical applications, this research offers a scalable framework for reducing CVD burden, paving the way for precision medicine and improved population health outcomes through data-driven decision-making.

Keywords:

machine learning; diagnosis; cardiovascular diseases; machine learning in healthcare

1. Introduction

Cardiovascular diseases (CVDs) encompass a wide range of conditions that affect the heart and blood vessels, primarily due to the progressive build-up of plaque in arteries—a process known as atherosclerosis. These diseases often develop silently over many years, with symptoms becoming apparent only in advanced stages. As a result, individuals may remain unaware of their condition until they experience severe complications such as heart attacks, strokes, or heart failure [1]. CVDs are the leading cause of mortality worldwide, responsible for an estimated 17.9 million deaths annually. This broad category includes coronary heart disease, cerebrovascular disease, and rheumatic heart disease, among others. Alarmingly, over 80% of CVD-related deaths are attributed to heart attacks and strokes, with nearly one-third of these fatalities occurring prematurely in individuals under the age of 70 [2]. The burden of CVDs is not distributed evenly across populations. At the same time, high-income countries have seen declining mortality rates due to advances in healthcare and prevention, and low and middle-income nations continue to experience a rapid increase in cases. Disparities in CVD mortality rates highlight the growing health inequities between nations. In 2008, more than 17 million people died from cardiovascular conditions, with over 3 million deaths occurring in individuals under the age of 60—many of which could have been prevented through better healthcare access and lifestyle interventions. The proportion of premature CVD deaths ranges from as low as 4% in high-income countries to a staggering 42% in low-income regions, underscoring the urgent need for targeted public health initiatives [3].

While genetic predisposition plays a role in cardiovascular health, a significant proportion of CVD cases are driven by modifiable lifestyle factors. Behaviours such as tobacco use, poor diet, physical inactivity, excessive alcohol intake, and inadequate consumption of fruits and vegetables are well-established contributors to cardiovascular risk [4]. The World Health Organisation (W.H.O) estimates that at least three-quarters of premature CVD deaths could be prevented through behavioural changes, yet these risk factors remain persistently underestimated or overlooked in public discourse and policy enforcement [5]. These lifestyle-related factors are now becoming more common in all socioeconomic groups, which make the burden of disease worse in nations that are both developed and developing [6].

Recent advancements in health data and computational tools have transformed cardiovascular risk assessment, prediction, and mitigation. Machine learning (ML) and data-driven methods provide a promising alternative to traditional public health models, which often use broad interventions. These techniques allow the analysis of large datasets, reveal hidden patterns, and predict CVD risk more accurately [7]. Predictive modelling helps tailor public health strategies to specific demographic or behavioural groups, improving intervention efficiency and impact [8].

The study specifically aims to generate actionable insights that could support targeted public health interventions and empower individuals to make informed decisions about their lifestyles. This research seeks to bridge that gap through data-driven study of the most significant lifestyle risk factors contributing to cardiovascular disease and to explore how predictive modelling can enhance public health outcomes. It explores how everyday habits such as smoking and poor eating contribute to the growing problem of cardiovascular disease (CVD). Using existing data, it identifies key lifestyle-related risk factors. Machine learning techniques are applied to model the relationship between behaviour and CVD risk, uncovering predictive patterns. The study also evaluates the effectiveness of data-driven public health interventions aimed at reducing CVD risk associated with unhealthy behaviours.

This study’s main research questions are as follows:

RQ 1:

How do poor lifestyle choices, such as smoking and unhealthy dietary habits, increase the risk of cardiovascular disease?

RQ 2:

What data-driven insights can inform public health interventions to reduce the risk of cardiovascular disease related to unhealthy lifestyle choices and promotes healthier living?

As a preview of our findings, the optimised Random Forest model achieved an accuracy of 99.92%, precision of 99.96%, recall of 99.99%, specificity of 99.96%, F1-score of 99.92%, and ROC-AUC of 1.00, indicating its strong potential for effective CVD risk prediction.

By deploying the model as an interactive web-based tool, this study contributes to the development of smart e-health systems that support proactive and personalised cardiovascular risk monitoring.

2. Related Work

Given the growing incidence of CVDs and their close correlation with modifiable risk factors, data-driven prevention strategies are essential. Smoking continues to be a major factor, causing about one out of four deaths from CVD [9]. Beyond the heart, it damages arteries and speeds up atherosclerosis, which lowers oxygen delivery and raises the risk of heart attack and stroke [10]. Coronary heart disease (CHD) risk is considerably increased even by second-hand smoke exposure [11]. A single session may last between 30 and 90 min of uninterrupted inhalation, during which substantial quantities of smoke, infused with toxic chemicals that exceed those found in a single cigarette, are filtered through water. Particularly in younger populations, this practice has a negative impact on respiratory, oral, and cardiovascular health [12]. Both the duration and intensity of smoking are associated with an increased risk of CVD, with heavy smokers being at higher risk [13]. Smoking continues despite a wealth of evidence, highlighting the need for focused actions. Despite overwhelming evidence, many still smoke, countering public health progress. Researchers found that cessation programmes that combine medicine, counselling, and nicotine replacement successfully lower the risk of CVD [14]. According to a study, the benefits start to show up rapidly, with the risk of heart attack and stroke significantly decreasing within five years of stopping [15]. Diet also plays a pivotal role in CVD development. Saturated fats, trans fats, sodium, and added sugars promote obesity, hypertension, and dyslipidemia which are key risk factors for CVD [16,17].

In contrast, there are notable cardiovascular advantages associated with healthier diets, especially the Mediterranean diet. Limiting red meat and dairy and increasing fruits, vegetables, nuts, and olive oil regularly reduces the risk of CVD [18]. According to seminal research such as PREDIMED (Prevención con DietaMediterránea—that is, Prevention with Mediterranean Diet), adding nuts or extra-virgin olive oil to this diet reduced major cardiovascular events in high-risk individuals, indicating its preventive function [19]. Regular physical activity significantly reduces cardiovascular disease risk by improving blood pressure, weight management, and cholesterol levels. It also helps prevent chronic conditions including hypertension, diabetes, and certain cancers such as breast and colon [20].

Despite these benefits, sedentary lifestyles remain a global health crisis. In a study by [21], more than 25% of individuals globally were not active enough in 2016, with desk jobs and reliance on cars having a particularly negative impact on wealthy countries. A rising problem for health systems throughout the world, physical inactivity significantly reduces lifespan and raises risks for heart disease, type 2 diabetes, and several types of cancer [22]. A well-established risk factor for CVD, excessive alcohol use is associated with arrhythmias, stroke, cardiomyopathy, and hypertension [23]. The association between alcohol and cardiovascular disease is still complicated; excessive drinking significantly raises cardiovascular risks, even while moderate use may provide some protection [24].

In contrast to abstinence or excessive consumption, the J-shaped curve model indicates that moderate drinking may reduce the risk of CVD [25]. According to research [26], meta-analysis, consumption of excessive amounts increases the risk of ischaemic heart disease even in moderate drinkers, proving that drinking habits are just as important as quantity. In addition to supporting efforts to reduce alcohol use in high-risk populations, public health campaigns should emphasise the dangers of both excessive and moderate alcohol use. By identifying at-risk individuals and facilitating individualised measures, predictive modelling and data-driven techniques have improved the prevention of cardiovascular disease and contributed to a reduction in the worldwide burden of CVD [27].

Machine learning has significantly advanced early CVD detection. Researchers at Leeds Teaching Hospitals NHS Trust developed an algorithm analysing demographic and clinical data to identify high-risk individuals with undiagnosed atrial fibrillation (AF)—a major stroke risk factor. Using anonymised GP records from millions of patients, the system enables earlier AF diagnosis and timely anticoagulant treatment, potentially preventing thousands of strokes annually [28]. Advancements in genetic screening and predictive modelling have enabled personalised public health interventions. In New Zealand, Rod Jackson’s PREDICT study, involving over 400,000 participants, integrates patient data with national records to deliver individualised CVD risk assessments [29]. Recent studies also support machine learning in cardiovascular care, including AIRE, a tool designed for early and explainable risk prediction [30].

3. Methodology and Analysis

The dataset used in this study, referred to as CVD_Cleaned, was sourced from Kaggle, an online platform for hosting open-source datasets, and contains 308,854 records with 19 variables. It represents health survey responses from a diverse adult population and includes self-reported information on cardiovascular-related behaviours, clinical conditions (e.g., diabetes, arthritis), and general health perceptions. Essential Python libraries such as NumPy, Pandas, and Seaborn were employed to load and explore the dataset. Preliminary data inspection and basic exploratory analysis were performed to assess the dataset’s structure, variable distribution, and completeness before modelling.

The dataset includes various features relevant to cardiovascular health, general health status, exercise habits, dietary patterns, existing medical conditions, and lifestyle factors like smoking history and alcohol consumption. The target variable, ‘Heart_Disease’, indicates the presence or absence of cardiovascular disease. Initial inspection revealed that the dataset was well-organised and required minimal cleaning. Nonetheless, data cleaning steps were performed to address any inconsistencies. The dataset underwent essential pre-processing steps to ensure quality and model compatibility. There were no missing values. Approaches like deletion or imputation using the mode or median could have been considered to maintain data accuracy and consistency. Duplicate rows were checked, and 80 duplicate entries were found. These were eliminated to maintain data quality. Duplicate rows can lead to biased analysis and misleading insights, making it crucial to eliminate them prior to further exploration.

3.1. Exploratory Data Analysis (EDA)

By visualising key variables and examining their distributions, we gain a deeper understanding of how lifestyle factors relate to cardiovascular risk. This analytical approach not only informs model development but also lays the foundation for targeted public health interventions aimed at mitigating CVD risks.

3.1.1. Visualisation of the Target Variable

The distribution of the target variable, heart disease, was examined to understand the prevalence of cardiovascular disease within the dataset as shown in Figure 1, the visualisation in most cases correspond to the “No” category in heart disease, indicating that most individuals in the dataset do not have heart disease. The dataset reveals a relatively balanced distribution of males and females, with a slight majority of females.

This near-equal representation is crucial for ensuring gender-specific analyses in cardiovascular health studies. A balanced gender distribution minimises bias and allows for robust exploration of potential differences in lifestyle factors, comorbidities, and cardiovascular disease risks between males and females. A significant portion of the individuals have a history of smoking, while the majority reports no smoking history. Smoking is a major modifiable risk factor for CVD, and its prevalence in the dataset highlights the need for targeted public health efforts to reduce smoking rates. Although a minority, over 50,000 individuals report experiencing depression, highlighting its relevance to overall well-being and potential influence on CVD risk. Depression is linked to behaviours like inactivity, poor diet, and smoking key CVD risk factors. Arthritis is a common condition that can significantly impact physical activity levels and overall quality of life, both of which are important factors in CVD risk. The presence of arthritis in a notable portion of the population suggests that it may play a role in influencing lifestyle behaviours, such as reduced exercise or increased sedentary behaviour, which could contribute to CVD.

There is a varied distribution of self-reported general health statuses, with the majority of individuals rating their health as “Very Good” or “Good.” A smaller proportion report “Excellent” health, while fewer still describe their health as “Fair” or “Poor” as shown in Figure 2. This distribution highlights the importance of subjective health perceptions in understanding overall well-being and their potential link to CVD risk. Individuals with “Fair” or “Poor” health ratings may represent a high-risk group for CVD, as these ratings often correlate with underlying health issues or lifestyle factors.

The dataset indicates that the majority of individuals do not have diabetes, while a smaller proportion report having it (Figure 2). Some individuals fall into the categories of pre-diabetes or borderline diabetes, and a very small group report having diabetes only during pregnancy. This distribution highlights the importance of addressing diabetes and pre-diabetes as significant risk factors for CVD. The presence of pre-diabetes or borderline diabetes in a notable portion of the population suggests an opportunity for early intervention to prevent the progression to full diabetes and reduce associated CVD risks.

3.1.2. Visualisation of the Numerical Variables

This reveals a varied representation across different age groups, with certain age brackets appearing more densely populated than others. The distribution peaks in the middle-aged categories (e.g., 40–44, 45–49, 50–54), suggest that a significant portion of the dataset comprises individuals in these age ranges. Younger age groups (e.g., 18–24, 25–29) and older age groups (e.g., 75–79, 80+) are less densely represented, which may reflect the demographic composition of the population or sampling biases (Figure 3). Age is a critical factor in cardiovascular health and as such, understanding the age distribution in the dataset is essential for tailoring public health interventions for specific age groups.

BMI, defined as a person’s weight in kilograms divided by the square of their height in metres (kg/m²), is a widely used metric to categorise individuals into underweight, normal weight, overweight, or obese categories (Figure 3). It serves as a key indicator for assessing cardiovascular health, as it reflects the balance between weight and height. The BMI distribution in the dataset displays a right-skewed pattern, with most individuals concentrated in the lower to mid-range (approximately 20 to 35). A smaller proportion of the population falls into the higher BMI categories, as indicated by the extended tail of the distribution.

The dataset shows a peak at lower alcohol consumption, with most people reporting minimal to moderate intake. Density drops as consumption rises, indicating heavy drinkers are less common. This matches typical alcohol consumption trends, where moderation is common. Excessive alcohol intake increases CVD and other health risks. As consumption grows, density falls sharply, suggesting frequent or high fried potato intake is rare. Identifying higher intake individuals can guide dietary interventions to lower CVD risk and encourage healthier eating.

3.1.3. Visualisation of the Categorical Variables Against the Target Variable

The relationship between depression and heart disease reveals a meaningful connection when analysed through percentages as shown in Figure 4. Among individuals without depression, 92.36% do not have heart disease, while 7.64% report having it. In contrast, among individuals with depression, 90.14% do not have heart disease, while 9.86% report having it. This indicates that individuals with depression are slightly more likely to have heart disease compared to those without depression. Depression can influence heart disease risk through multiple pathways:

Behavioural Factors: Depression is often associated with unhealthy lifestyle choices, such as physical inactivity, poor diet, and smoking, all of which are significant risk factors for heart disease.

Biological Mechanisms: Depression can lead to chronic inflammation, hormonal imbalances, and increased stress hormones like cortisol, which may contribute to the development of cardiovascular conditions [31]. For example, chronic inflammation is a known driver of atherosclerosis, while elevated cortisol levels can increase blood pressure and promote vascular dysfunction.

Medication Side Effects: Some medications used to treat depression, such as tricyclic antidepressants (TCAs) and selective serotonin reuptake inhibitors (SSRIs), may have side effects that impact cardiovascular health. TCAs are associated with arrhythmias, while SSRIs may increase the risk of bleeding in patients with cardiovascular disease.

The relationship between cancer (both skin and other types) and heart disease reveals consistent patterns when the actual figures are examined (Figure 5). Among individuals without cancer, approximately 7.3% report having heart disease. In contrast, among individuals with cancer, around 15.7% report having it. This indicates that individuals with cancer are more than twice as likely to have heart disease compared to those without cancer. While cancer itself may not directly cause heart disease, this association could reflect several underlying factors:

Shared Risk Factors: Behaviours such as smoking, poor diet, or sedentary lifestyles can increase the risk of both cancer and heart disease.
Treatment Side Effects: Certain cancer treatments, such as chemotherapy or radiation, are known to have cardiovascular side effects that may contribute to heart disease [32].
Physiological Stress: The presence of cancer can place significant stress on the body, potentially exacerbating existing cardiovascular conditions or triggering new ones.

These findings highlight the importance of integrated care for cancer patients, including regular cardiovascular health monitoring and interventions to mitigate heart disease risk. The relationship between diabetes status and heart disease reveals significant differences in heart disease prevalence across different diabetes categories (Table 1). This pattern highlights the strong association between diabetes and cardiovascular risk. Individuals with diabetes are more than three times as likely to have heart disease compared to those without diabetes. Even individuals with pre-diabetes or borderline diabetes show a higher prevalence of heart disease (11.51%) compared to those without diabetes (6.06%). This underscores the importance of early intervention in pre-diabetic individuals to prevent progression to diabetes and reduce cardiovascular risk.

3.1.4. Correlation Analysis

Figure 6’s correlation heat map visualises variable relationships in the dataset, showing correlation strength and direction. It helps detect multicollinearity and key feature associations. Strong correlations may indicate redundant variables, potentially affecting model performance, while weak correlations suggest minimal linear relationships. This analysis supports feature selection and dataset optimisation for predictive modelling.

A correlation threshold of 0.80 was applied to detect and remove strongly correlated feature pairs within the dataset. Identifying highly correlated variables is crucial, as excessive correlation can introduce multicollinearity, which may negatively impact model performance by distorting feature importance.

4. Implementation of Algorithms

The modelling phase employed a mix of classical machine learning models, ensemble learning techniques, and deep learning approaches to develop a robust predictive framework for CVD. Each category of models was selected based on its strengths in handling structured health data and capturing complex relationships within the dataset.

4.1. Classical Machine Learning Algorithms

Logistic Regression, a key statistical model for binary classification, is ideal for predicting CVD. It uses the logistic function to model outcome probability, effectively assessing risk from clinical and lifestyle factors. K-Nearest Neighbour (KNN), a non-parametric, instance-based algorithm, is widely used for classification. It assigns new data points to the majority class among their nearest neighbours in the feature space, adapting well to complex datasets. Decision Tree, a rule-based, non-parametric supervised learning algorithm, splits data based on feature values. It excels in CVD prediction by offering clear, interpretable rules highlighting key risk factors.

4.2. Ensemble Algorithms

Random Forest is an ensemble learning algorithm that constructs multiple decision trees during training and merges their outputs to enhance predictive accuracy and control overfitting. This method is versatile, handling both classification and regression tasks effectively. In the context of CVD prediction, Random Forest has demonstrated high performance. Extreme Gradient Boosting (XGBoost) is a powerful ensemble learning algorithm designed for efficiency and performance. It uses gradient boosting techniques to sequentially improve weak learners, making it highly effective for complex classification tasks like cardiovascular disease (CVD) prediction. Light Gradient Boosting Machine (LightGBM) is an advanced gradient boosting framework designed for speed and efficiency. It builds trees leaf-wise, focusing on areas with the highest loss, which enhances predictive accuracy and reduces overfitting. In cardiovascular disease (CVD) prediction, LightGBM’s capability to handle large datasets and its efficiency in training make it a valuable tool.

4.3. Deep Learning Algorithms

Artificial Neural Networks (ANNs) are computational models inspired by the human brain’s interconnected neuron structure that process input data through layers of nodes, using weights and biases to learn complex patterns. The typical ANN implementation for CVD prediction involves architecture with input, hidden and output layers where each neuron applies activation functions to introduce non-linearity for modelling complex relationships. Regularisation techniques like dropout are employed during training to prevent overfitting by randomly omitting neurons, while optimisation algorithms such as Adam adjust weights and biases to minimise loss functions like binary cross-entropy in classification tasks. Through iterative training across multiple epochs, ANNs develop strong predictive capabilities, making them powerful tools for early CVD detection by capturing intricate data patterns to support clinical intervention strategies.

Convolutional Neural Networks (CNNs) are deep learning models inspired by the human visual system, designed to process data with grid-like topology, such as images. They utilise layers of convolutional filters to automatically learn spatial hierarchies of features from input data, making them particularly effective in analysing complex medical datasets. Their capacity to detect intricate patterns is valuable in cardiovascular disease (CVD) prediction, where CNNs can identify subtle indicators of heart conditions within clinical data.

Recurrent Neural Networks (RNNs) are designed to process sequential data by maintaining information across time steps. Unlike traditional neural networks, RNNs employ recurrent connections that allow them to capture temporal dependencies in data, making them particularly effective for time-series analysis and medical sequence modelling [33]. In the context of cardiovascular disease (CVD) prediction, RNNs can analyse historical patient data to identify patterns indicative of disease progression, enhancing predictive accuracy.

4.4. Models Testing

4.4.1. Confusion Matrices

A confusion matrix is a tabular representation that illustrates the performance of a classification model by comparing actual versus predicted classifications. It consists of four components:

○: True Positives (TP): Instances correctly predicted as positive.
○: True Negatives (TN): Instances correctly predicted as negative.
○: False Positives (FP): Instances incorrectly predicted as positive.
○: False Negatives (FN): Instances incorrectly predicted as negative.

The Key metrics for the models are generated from their confusion matrices

4.4.2. Accuracy

This metric measures the proportion of correctly classified instances among the total predictions. Accuracy is widely used in model evaluation but may not be reliable especially for imbalanced datasets, hence the need for other testing metrics.

A c c u r a c y = \frac{(T P + T N)}{(T P + T N + F P + F N)}

(1)

4.4.3. Precision

This metric assesses how many of the predicted positive cases were actually positive. It is particularly useful in scenarios where false positives have significant consequences. High precision indicates a low false positive rate, which is critical for medical predictions as it ensures that people receive appropriate care without undue stress or intervention.

P r e c i s i o n = \frac{T P}{(T P + F P)}

(2)

4.4.4. Recall (Sensitivity/True Positive Rate)

This metric measures how well the model identifies actual positive cases. In medical diagnostics, achieving a high recall rate is essential to ensure that most true cases are correctly detected, thereby reducing the risk of false negatives. False negatives can lead to missed diagnoses, which may result in delayed treatment and poorer patient outcomes.

R e c a l l = \frac{T P}{(T P + F N)}

(3)

4.4.5. Specificity (True Negative Rate)

Specificity is the direct opposite of sensitivity/recall. It measures a model’s ability to correctly identify negative cases. It quantifies how well a classification model avoids false positives by correctly classifying actual negative instances. Specificity is particularly important in medical diagnostics to minimise unnecessary treatments for healthy individuals.

S p e c i f i c i t y = \frac{T N}{(T N + F P)}

(4)

4.4.6. F1-Score

F1-Score is the harmonic mean of precision and recall, useful when dealing with imbalanced datasets. This metric balances both false positives and false negatives, making it robust for clinical applications.

F 1 - s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(5)

4.4.7. ROC-AUC (Receiver Operating Characteristic-Area Under Curve)

This metric evaluates the model’s ability to distinguish between positive and negative cases by plotting the true positive rate against the false positive rate. AUC is computed as the integral of the ROC curve and provides insight into the trade-off between sensitivity and specificity.

5. Discussion of Results

The results explained in Table 2 indicate that Random Forest achieved the best overall performance, combining high accuracy, recall, specificity, and F1-score. Decision Tree and KNN also performed well, while XGBoost and LightGBM offered strong, consistent results across key metrics. Deep learning models (ANN, CNN, RNN) delivered competitive recall and F1-scores, with RNN performing best in this group, though all showed relatively lower specificity and higher false positives as shown in Table 2, Table 3, Table 4 and Table 5. Overall, most models prioritised recall over specificity, favouring the identification of at-risk individuals a desirable trade-off in preventive health.

5.1. Cross Validation

Cross-validation is a machine learning technique used to assess how well a model generalises to unseen data. Using k = 5, our dataset was split into five parts, and the models were trained and tested across different combinations of these folds. This helps prevent overfitting and provides a more reliable estimate of performance. After cross-validation, all models showed consistent behaviour with only minor variations, some slight drops in precision, specificity, and ROC-AUC, especially for KNN and Decision Tree, indicating mild overfitting. Random Forest remained the best-performing model overall, while deep learning models showed stable and reliable results across the folds (Table 6).

5.2. Optimisation of Models

Hyperparameter tuning involves selecting the optimal configuration of parameters that govern a machine learning algorithm’s behaviour. It is a critical step because it directly affects a model’s predictive performance and ability to generalise to new data. For this research, tuning was applied to enhance model accuracy, recall, and overall robustness. This research used GridSearchCV to exhaustively search through a defined hyperparameter space for the traditional and ensemble models, evaluating combinations based on cross-validated performance (primarily accuracy). For deep learning models (ANN, CNN, RNN), Keras Tuner was employed to find the best layer configurations, dropout rates, and learning rates. The final tuned models were then evaluated on unseen test data.

5.2.1. Summary of Improvements

Logistic Regression: Significant improvement in recall, with a trade-off in precision and specificity. Overall F1-score and ROC-AUC slightly increased.

KNN: Marked improvement in precision and F1-score, indicating better balance, though recall slightly dipped.

Decision Tree: No visible change post-tuning, suggesting it had already reached its peak performance.

Random Forest: Achieved nearly perfect performance on all metrics post-tuning.

XGBoost: Showed notable improvement across all metrics, especially ROC-AUC (from 0.85 to 0.98), indicating stronger classification ability.

LightGBM: Moderately improved across all metrics, particularly in recall and F1-score.

5.2.2. Deep Learning Models

ANN, CNN, and RNN had minor performance gains in recall and F1-score and slight drops in precision and specificity, likely due to sensitivity to data variations. Their performance was generally stable but did not surpass ensemble models.

Random Forest stood out as the most effective, making it the most robust and reliable classifier for CVD risk prediction.

5.2.3. Feature Importance

Feature importance as visualised in Figure 7 helps us understand which variables most influence a model’s predictions. Using the Random Forest, we analysed the impact of each feature on classification accuracy.

The top three most important features were:

Age Category
Weight (kg)
Height (cm)

Variables like green vegetable intake, fruit consumption, and general health significantly contributed to the model. Although some features had lower importance, all added value, so none were excluded from the final analysis as shown in Figure 8. This improves model transparency and clarifies which factors are most predictive in this context.

5.2.4. Deployment of the Random Forest Model

Deployment represents the culminating phase of the machine learning lifecycle, wherein a trained model is operationalised to deliver predictions on new, unseen data. It bridges the transition from experimental analysis to practical application, enabling the model to contribute meaningfully to real-world decision-making processes. For this study, the Random Forest model, which demonstrated superior predictive performance following extensive optimisation, was selected for deployment.

The deployment process involved developing an interactive web application using Streamlit, a Python-based framework specifically designed for the rapid creation of data-driven interfaces. The optimised Random Forest model was serialised utilising the ‘joblib’ library, ensuring both rapid loading times and consistency in prediction serving. To maintain pre-processing fidelity, all label encoders used during model training were also preserved and reloaded within the application pipeline. The deployed application structured to maximise user engagement and accessibility. Inputs were organised into logical categories (Personal and Lifestyle Information; Diet and Health Perception), with real-time feedback mechanisms such as a risk metre and stylised output cards enhancing the user experience. The application was deployed locally, serving as a functional prototype to demonstrate the feasibility of integrating predictive analytics into user-facing health tools. By enabling individuals to input personal health information and instantly obtain a tailored cardiovascular disease (CVD) risk prediction, the application fosters early detection efforts and empowers proactive health management strategies.

5.2.5. Limitations

While the research successfully achieved its objectives, some limitations were encountered during the research process: The dataset had disproportionately fewer heart disease cases. To correct this, up-sampling was applied, but this could impact generalisability since real-world distributions remain imbalanced. The duplication of samples through up-sampling may have introduced synthetic bias. These risks overstating the model’s predictive power by simplifying the complexity of real-life heart disease profiles. Variables such as smoking, diet, and exercise were self-reported, which may lead to recall bias. Participants may misremember behaviours or underreport unhealthy habits, compromising data accuracy. Limited computing resources restricted full hyperparameter tuning and training of complex models like neural networks and ensemble learners, possibly affecting their optimal performance.

6. Conclusions and Future Work

This research demonstrated the powerful role of machine learning in identifying lifestyle-driven cardiovascular disease (CVD) risks. Through in-depth data exploration and model evaluation, Random Forest emerged as the most effective model, accurately capturing risk patterns linked to modifiable factors such as smoking, diet, physical inactivity and chronic conditions like diabetes and arthritis. These findings affirm the viability of data-driven tools for public health decision-making. Each research question and objective were successfully addressed from analysing behavioural risks to developing a practical risk prediction tool. The deployment of the best-performing model as a Streamlit web app marked an essential step toward real-world utility by enabling individuals to receive instant, personalised CVD risk assessments.

To address the challenges identified and enhance the model’s real-world impact, future work will focus on these key improvements:

Integration of Objective Health Data: Incorporating clinically validated or wearable health data will reduce reliance on self-reported inputs, minimising recall bias and enhancing overall data quality.
Improved Handling of Class Imbalance: To mitigate synthetic bias, future iterations could explore advanced resampling methods and cost-sensitive algorithms that more accurately reflect real-world disease distribution.
Scalable Computing Resources: Using cloud-based platforms will enable deeper optimisation of computationally intensive models, improving performance without hardware limitations.
Expanded Dataset Diversity: Including broader population samples and adopting stratified sampling techniques will enhance generalisability across diverse demographic and clinical subgroups.
Real-World Model Validation: Performance will be evaluated using live user inputs to assess practical reliability and fine-tune predictive accuracy under real-world conditions.

Together, these enhancements will ensure the solution evolves into a robust, scalable, and ethically grounded tool, bridging the gap between machine learning innovation and proactive cardiovascular disease prevention.

Author Contributions

Conceptualization, S.A.K. and M.Z.I.; methodology, S.A.K.; software, S.A.K.; validation, S.A.K. and M.Z.I.; formal analysis S.A.K.; investigation, S.A.K.; resources, S.A.K.; data curation, Kissi; writing—original draft preparation, S.A.K.; writing—review and editing, M.Z.I. and M.G.M.T.; visualisation, S.A.K.; supervision, M.Z.I.; project administration, S.A.K. and M.Z.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

During the preparation of this manuscript, the authors used Grok 3, developed by xAI, for the purposes of grammar improvement. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Baghdadi, N.A.; Abdelaliem, S.M.F.; Malki, A.; Gad, I.; Ewis, A.; Atlam, E. Advanced machine learning techniques for cardiovascular disease early detection and diagnosis. J. Big Data 2023, 10, 144. [Google Scholar] [CrossRef]
WHO. 2021. Available online: https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1 (accessed on 20 February 2025).
World Heart Federation; World Health Organisation; World Stroke Oranisation. Global Atlas on Cardiovascular Disease Prevention and Control. 2011. Available online: https://www.who.int/publications/i/item/9789241564373 (accessed on 22 February 2025).
Visseren, F.L.J.; Mach, F.; Smulders, Y.M.; Carballo, D.; Koskinas, K.C.; Bäck, M.; Benetos, A.; Biffi, A.; Boavida, J.M.; Capodanno, D.; et al. 2021 European Sosiety of Cardiology (ESC) Guidelines on cardiovascular disease. Eur. Heart J. 2021, 2, 3227–3337. [Google Scholar] [CrossRef] [PubMed]
Yusuf, S.; Joseph, P.; Rangarajan, S.; Islam, S.; Mente, A.; Hystad, P.; Brauer, M.; Kutty, V.R.; Gupta, R.; Wielgosz, A.; et al. Modifiable risk factors, cardiovascular disease, and mortality in 155 722 individuals from 21 high-income, middle-income, and low-income countries (PURE): A prospective cohort study. Lancet 2020, 395, 795–808. [Google Scholar] [CrossRef] [PubMed]
Vaduganathan, M.; Mensah, G.A.; Turco, J.V.; Fuster, V.; Roth, G.A. The Global Burden of Cardiovascular Diseases and Risk: A Compass for Future Health. J. Am. Coll. Cardiol. 2022, 80, 2361–2371. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Jin, J.; Jiang, X.; Yang, L.; Fan, S.; Zhang, Q.; Chi, M. Artificial intelligence applied in cardiovascular disease: A bibliometric and visual analysis. Front. Cardiovasc. Med. 2024, 11, 1323918. [Google Scholar] [CrossRef] [PubMed]
Liu, T.; Krentz, A.; Lu, L.; Curcin, V. Machine learning based prediction models for cardiovascular disease risk using electronic health records data: Systematic review and meta-analysis. Eur. Heart J.-Digit. Health 2025, 6, 7–22. [Google Scholar] [CrossRef] [PubMed]
Banks, E.; Joshy, G.; Korda, R.J.; Stavreski, B.; Soga, K.; Egger, S.; Day, C.; Clarke, N.E.; Lewington, S.; Lopez, A.D. Tobacco smoking and risk of 36 cardiovascular disease subtypes: Fatal and non-fatal outcomes in a large prospective Australian study. BMC Med. 2019, 17, 128. [Google Scholar] [CrossRef] [PubMed]
Ambrose, J.A.; Barua, R.S. The Pathophysiology of Cigarette. J. Am. Coll. Cardiol. 2004, 43, 1731–1737. [Google Scholar] [CrossRef] [PubMed]
Öberg, M.; Jaakkola, M.S.; Woodward, A.; Peruga, A.; Prüss-Ustün, A. Worldwide burden of disease from exposure to second-hand smoke: A retrospective analysis of data from 192 countries. Lancet 2011, 377, 139–146. [Google Scholar] [CrossRef] [PubMed]
Mahfooz, K.; Vasavada, A.M.; Joshi, A.; Pichuthirumalai, S.; Andani, R.; Rajotia, A.; Hans, A.; Mandalia, B.; Dayama, N.; Younas, Z.; et al. Waterpipe Use and Its Cardiovascular Effects: A Systematic Review and Meta-Analysis of Case-Control, Cross-Sectional, and Non-Randomized Studies. Natl. Libr. Med.–Cureus 2023, 15, e34802. [Google Scholar] [CrossRef] [PubMed]
Banks, E.; Joshy, G.; Weber, M.F.; Liu, B.; Grenfell, R.; Egger, S.; Paige, E.; Lopez, A.D.; Sitas, F.; Beral, V. Tobacco smoking and all-cause mortality in a large Australian cohort study: Findings from a mature epidemic with current low smoking prevalence. BMC Med. 2015, 13, 38. [Google Scholar] [CrossRef] [PubMed]
Rigotti, N.A.; McDermott, M.M. Smoking Cessation and Cardiovascular Disease: It’s Never Too Early or Too Late for Action. J. Am. Coll. Cardiol. 2019, 74, 508–511. [Google Scholar] [CrossRef] [PubMed]
Rahman, M.; Alatiqi, M.; Al Jarallah, M.; Hussain, M.Y.; Monayem, A.; Panduranga, P.; Rajan, R. Cardiovascular Effects of Smoking and Smoking Cessation: A 2024 Update; World Heart Federation: Geneva, Switzerland, 2025; Volume 20. [Google Scholar]
Lichtenstein, A.H.; Appel, L.J.; Vadiveloo, M.; Hu, F.B.; Kris-Etherton, P.M.; Rebholz, C.M.; Sacks, F.M.; Thorndike, A.N.; Van Horn, L.; Wylie-Rosett, J.; et al. 2021 Dietary Guidance to Improve Cardiovascular Health: A Scientific Statement From the American Heart Association. Am. Heart Assoc. 2021, 144, e472–e487. [Google Scholar] [CrossRef] [PubMed]
Kim, Y.; Je, Y.; Giovannucci, E.L. Association between dietary fat intake and mortality from all-causes, cardiovascular disease, and cancer: A systematic review and meta-analysis of prospective cohort studies. Clin. Nutr. 2021, 40, 1060–1070. [Google Scholar] [CrossRef] [PubMed]
Martínez-González, M.Á.; Hernández, A.H. Effect of the Mediterranean diet in cardiovascular prevention. Rev. Espanola De Cardiol. 2024, 77, 574–582. [Google Scholar] [CrossRef]
Temporelli, P.L. Cardiovascular prevention: Mediterranean or low-fat diet? Eur. Hear. J. Suppl. 2023, 25 (Supplement_B), B166–B170. [Google Scholar] [CrossRef] [PubMed]
Booth, F.W.; Roberts, C.K.; Thyfault, J.P.; Ruegsegger, G.N.; Toedebusch, R.G. Role of Inactivity in Chronic Diseases: Evolutionary Insight and Pathophysiological Mechanisms. Physiol. Rev.-Am. Physiol. Soc. 2017, 97, 1351–1402. [Google Scholar] [CrossRef] [PubMed]
Guthold, R.; Stevens, G.A.; Riley, L.M.; Bull, F.C. Worldwide trends in insufficient physical activity from 2001 to 2016: A pooled analysis of 358 population-based surveys with 1.9 million participants. Lancet Glob. Health 2018, 6, e1077–e1086. [Google Scholar] [CrossRef] [PubMed]
Katzmarzyk, P.T.; Friedenreich, C.; Shiroma, E.J.; Lee, I.M. Physical inactivity and non-communicable disease burden in low-income, middle-income and high-income countries. Br. J. Sports Med. 2022, 56, 101–106. [Google Scholar] [CrossRef] [PubMed]
Figueredo, V.M.; Patel, A. Detrimental Effects of Alcohol on the Heart: Hypertension and Cardiomyopathy. Rev. Cardiovasc. Med. 2023, 24, 292. [Google Scholar] [CrossRef] [PubMed]
Georgescu, O.S.; Martin, L.; Târtea, G.C.; Rotaru-Zavaleanu, A.D.; Dinescu, S.N.; Vasile, R.C.; Gresita, A.; Gheorman, V.; Aldea, M.; Dinescu, V.C. Alcohol Consumption and Cardiovascular Disease: A Narrative Review of Evolving Perspectives and Long-Term Implications. Life 2024, 14, 1134. [Google Scholar] [CrossRef] [PubMed]
Hoek, A.G.; van Oort, S.; Mukamal, K.J.; Beulens, J.W.J. Alcohol Consumption and Cardiovascular Disease Risk: Placing New Data in Context. Natl. Libr. Med. 2022, 24, 51–59. [Google Scholar] [CrossRef] [PubMed]
Roerecke, M. Alcohol’s Impact on the Cardiovascular System. Nutrients 2021, 13, 3419. [Google Scholar] [CrossRef] [PubMed]
Shah, N.D.; Steyerberg, E.W.; Kent, D.M. Big Data and Predictive Analytics: Recalibrating Expectations. JAMA Netw. 2018, 320, 27–28. [Google Scholar] [CrossRef] [PubMed]
Nadarajah, R.; Wahab, A.; Reynolds, C.; Raveendra, K.; Askham, D.; Dawson, R.; Keene, J.; Shanghavi, S.; Lip, G.Y.H.; Hogg, D.; et al. Future Innovations in Novel Detection for Atrial Fibrillation (FIND-AF): Pilot study of an electronic health record machine learning algorithm-guided intervention to identify undiagnosed atrial fibrillation. Open Heart-Br. Med. J. 2023, 10, e002447. [Google Scholar] [CrossRef] [PubMed]
Pylypchuk, R.; Wells, S.; Kerr, A.; Poppe, K.; Riddell, T.; Harwood, M.; Exeter, D.; Mehta, S.; Grey, C.; Wu, B.P.; et al. Cardiovascular disease risk prediction equations in 400 000 primary care patients in New Zealand: A derivation and validation study. Lancet 2018, 391, 1897–1907. [Google Scholar] [CrossRef] [PubMed]
Sau, A.; Pastika, L.; Sieliwonczyk, E.; Patlatzoglou, K.; AMcGurk, K.; Zeidaabadi, B.; Zhang, H.; Macierzanka, K.; Mandic, D.; Sabino, E.; et al. Artificial intelligence-enabled electrocardiogram for mortality and cardiovascular risk estimation: A model development and validation study. Lancet Digit. Health 2024, 6, e791–e802. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Zhou, J.; Wang, M.; Yang, C.; Sun, G. Cardiovascular disease and depression: A narrative review. Front. Cardiovasc. Med. 2023, 10, 1274595. [Google Scholar] [CrossRef] [PubMed]
Lyon, A.R.; López-Fernández, T.; Couch, L.S.; Asteggiano, R.; Aznar, M.C.; Bergler-Klein, J.; Boriani, G.; Cardinale, D.; Cordoba, R.; Cosyns, B.; et al. 2022 ESC Guidelines on cardio-oncology developed in collaboration with the European Hematology Association (EHA), the European Society for Therapeutic Radiology and Oncology (ESTRO) and the International Cardio-Oncology Society (IC-OS). Eur Heart J. 2022, 43, 4229–4361. [Google Scholar] [CrossRef] [PubMed]
Hansika, H.; Bergmeir, C.; Bandara, K. Recurrent Neural Networks for Time Series Forecasting: Current status and future directions. Int. J. Forecast. 2021, 37, 388–427. [Google Scholar] [CrossRef]

Figure 1. Major modifiable lifestyle risk factors contributing to cardiovascular disease.

Figure 2. Visualisation of the data against target variable heart disease, sex, exercise, smoking history depression, and arthritis.

Figure 3. Distribution of general health and diabetes.

Figure 4. Distribution of age, BMI, and alcohol consumption.

Figure 5. The distribution of depression, skin cancer, and other types of cancer against heart diseases.

Figure 6. Correlation Plot.

Figure 7. Performance metrics after hyperparameter tuning.

Figure 8. Feature Importance_RF.

Table 1. Diabetes Distribution.

Diabetes Status	Percentage with Heart Disease
No	6.06%
No, pre-diabetes or borderline diabetes	11.51%
Yes	20.85%
Yes, but female told only during pregnancy	3.63%

Table 2. Testing metrics.

Metric	Traditional			Ensemble			Deep
Metric	Logistic Regression	KNN	Decision Tree	Random Forest	XGBoost	Light GBM	ANN	CNN	RNN
Accuracy	0.7433	0.8976	0.9567	0.9859	0.7747	0.7708	0.7638	0.7718	0.7668
Precision	0.7235	0.8314	0.9205	0.9728	0.748	0.7451	0.7255	0.7469	0.74
Recall (Sensitivity)	0.7877	0.9974	0.9999	0.9999	0.8285	0.8232	0.8488	0.8222	0.8225
Specificity	0.6989	0.7977	0.9136	0.972	0.7208	0.7183	0.6788	0.7214	0.7111
F1-score	0.7542	0.9069	0.9585	0.9861	0.7862	0.7822	0.7823	0.7827	0.7791
ROC-AUC	0.8164	0.9556	0.9568	1	0.8503	0.8465	0.8385	0.8467	0.8388

Table 3. Confusion matrices of the traditional models.

Traditional Models	Logistic Regression	KNN	Decision Tree
	[39,671 17,090]	[45,280 11,481]	[51,856 4905]
	[12,052 44,709]	[149 56,612]	[6 56,755]

Table 4. Confusion matrices of the ensemble algorithms.

Ensemble Algorithms	Random Forest	XGBoost	Light GBM
	[55,172 1589]	[40,914 15,847]	[40,773 15,988]
	[6 56,755]	[9734 47,027]	[10,035 46,726]

Table 5. Confusion matrices of the deep learning models.

Deep Learning Models	ANN	CNN	RNN
	[38,529 18,232]	[40,946 15,815]	[40,361 16,400]
	[8583 48,178]	[10,092 46,669]	[10,075 46,686]

Table 6. Testing metrics after cross validation.

Metric (Cross Validated)	Traditional			Ensemble			Deep
Metric (Cross Validated)	Logistic Regression	KNN	Decision Tree	Random Forest	XGBoost	Light GBM	ANN	CNN	RNN
Accuracy	0.7447	0.8808	0.9499	0.9796	0.7747	0.7708	0.7659	0.7712	0.7653
Precision	0.7248	0.8127	0.9092	0.9612	0.7479	0.7448	0.7363	0.7381	0.735
Recall (Sensitivity)	0.7891	0.9898	0.9996	0.9995	0.8286	0.8238	0.8284	0.8406	0.8298
Specificity	0.7003	0.7719	0.9002	0.9597	0.7207	0.7177	0.7034	0.7017	0.7008
F1-score	0.7555	0.8925	0.9523	0.98	0.7862	0.7823	0.7796	0.786	0.7795
ROC-AUC	0.8174	0.9443	0.9499	0.9999	0.8512	0.8477	0.8384	0.8458	0.8372

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kissi, S.A.; Talukder, M.G.M.; Iqbal, M.Z. Data-Driven Predictive Modelling of Lifestyle Risk Factors for Cardiovascular Health. Electronics 2025, 14, 2906. https://doi.org/10.3390/electronics14142906

AMA Style

Kissi SA, Talukder MGM, Iqbal MZ. Data-Driven Predictive Modelling of Lifestyle Risk Factors for Cardiovascular Health. Electronics. 2025; 14(14):2906. https://doi.org/10.3390/electronics14142906

Chicago/Turabian Style

Kissi, Solomon Agyiri, Md Golam Muttaquee Talukder, and Muhammad Zahid Iqbal. 2025. "Data-Driven Predictive Modelling of Lifestyle Risk Factors for Cardiovascular Health" Electronics 14, no. 14: 2906. https://doi.org/10.3390/electronics14142906

APA Style

Kissi, S. A., Talukder, M. G. M., & Iqbal, M. Z. (2025). Data-Driven Predictive Modelling of Lifestyle Risk Factors for Cardiovascular Health. Electronics, 14(14), 2906. https://doi.org/10.3390/electronics14142906

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data-Driven Predictive Modelling of Lifestyle Risk Factors for Cardiovascular Health

Abstract

1. Introduction

2. Related Work

3. Methodology and Analysis

3.1. Exploratory Data Analysis (EDA)

3.1.1. Visualisation of the Target Variable

3.1.2. Visualisation of the Numerical Variables

3.1.3. Visualisation of the Categorical Variables Against the Target Variable

3.1.4. Correlation Analysis

4. Implementation of Algorithms

4.1. Classical Machine Learning Algorithms

4.2. Ensemble Algorithms

4.3. Deep Learning Algorithms

4.4. Models Testing

4.4.1. Confusion Matrices

4.4.2. Accuracy

4.4.3. Precision

4.4.4. Recall (Sensitivity/True Positive Rate)

4.4.5. Specificity (True Negative Rate)

4.4.6. F1-Score

4.4.7. ROC-AUC (Receiver Operating Characteristic-Area Under Curve)

5. Discussion of Results

5.1. Cross Validation

5.2. Optimisation of Models

5.2.1. Summary of Improvements

5.2.2. Deep Learning Models

5.2.3. Feature Importance

5.2.4. Deployment of the Random Forest Model

5.2.5. Limitations

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI