Multi-Objective Optimized Differential Privacy with Interpretable Machine Learning for Brain Stroke and Heart Disease Diagnosis

Hussain, Mohammed Ibrahim; Munir, Arslan; Chowdhury, Safiul Haque; Mamun, Mohammad; Hossain, Muhammad Minoar

doi:10.3390/a19040260

Open AccessArticle

Multi-Objective Optimized Differential Privacy with Interpretable Machine Learning for Brain Stroke and Heart Disease Diagnosis

by

Mohammed Ibrahim Hussain

¹,

Arslan Munir

^2,*

,

Safiul Haque Chowdhury

^1,3

,

Mohammad Mamun

^1,3

and

Muhammad Minoar Hossain

²

¹

Department of Computer Science and Engineering, Bangladesh University, Dhaka 1000, Bangladesh

²

Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL 33431, USA

³

Department of Computer Science and Engineering, Jahangirnagar University, Dhaka 1342, Bangladesh

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(4), 260; https://doi.org/10.3390/a19040260

Submission received: 25 January 2026 / Revised: 17 March 2026 / Accepted: 19 March 2026 / Published: 27 March 2026

(This article belongs to the Special Issue 2026 and 2027 Selected Papers from Algorithms Editorial Board Members)

Download

Browse Figures

Versions Notes

Abstract

Brain stroke (BS) and heart disease (HD) are leading causes of global mortality and long-term disability, underscoring the critical need for early and accurate diagnostic tools. This research addresses the dual challenge of developing high-performance predictive models while ensuring the privacy of sensitive patient data. We propose a framework that integrates ensemble machine learning (ML) models with a formal differential privacy (DP) mechanism. Using a dataset of 5110 samples with clinical features, we evaluate Extreme Gradient Boosting (XGB), Random Forest (RF), Light Gradient Boosting Machine (LGBM), and Categorical Boosting (CAT) for BS and HD prediction. To protect individual privacy, we apply the Gaussian mechanism of DP with two probabilities of failure (POF) parameters (10–5 and 10–6) and a privacy budget ranging from 0.5 to 5.0. A key novelty of this work is the application of Pareto frontier multi-objective optimization (PFMOO) to systematically identify the optimal trade-off between model accuracy and privacy constraints. Our approach successfully identifies optimal, privacy-preserving models: XGB achieves top performance for BS prediction (92.3% accuracy, 92.29% F1 score), with a POF of 10–6, while RF excels for HD detection (95.61% accuracy, 97.8% precision), with a POF of 10–5. Furthermore, we employ explainable AI (XAI) techniques, SHAP and LIME, to provide interpretability of the model decisions, enhancing clinical trust. This research delivers a robust, interpretable, and privacy-conscious framework for early disease detection, offering a significant advancement over existing methods by holistically balancing accuracy, data security, and transparency.

Keywords:

brain stroke; heart disease; differential privacy; Pareto frontier multi-objective optimization; machine learning; explainable artificial intelligence

1. Introduction

Brain stroke (BS) and heart disease (HD) are two of the most widespread and deadly cardiovascular diseases globally, posing significant challenges to both individuals and healthcare systems. Stroke is the second leading cause of death globally, responsible for approximately 5.5 million deaths annually. It also leads to high morbidity, with up to 50% of survivors experiencing chronic disabilities [1]. Mortality rates following stroke are alarmingly high, with about 10% of patients dying within the first 30 days after an ischemic stroke and the one-year mortality rate increasing to 40%. Furthermore, the World Stroke Organization predicts that the global mortality rate from strokes will rise by 50% by 2050, resulting in 9.7 million deaths annually, up from 6.6 million in 2020 [2]. Similarly, HD remains the leading cause of death for men, women, and various racial and ethnic groups. Every 33 s, someone dies from cardiovascular disease. As the population ages, projections show that by 2050, 15% of the global population will develop cardiovascular disease. The economic toll of these diseases is equally concerning, with healthcare data violations becoming the costliest, with violation costs reaching an average of $10.93 million, a significant increase from previous years [3].

In recent years, diverse ML methods for BSP and HDP have emerged in computer science and have shown strong accuracy but faced some limitations. Ghannam and Alwidian [4] employed a decision tree (DT) model for BSP and achieved 94.2% accuracy but encountered class imbalance and a lack of CV. Rodríguez [5] implemented an RF model for BSP, which achieved 91.52% accuracy but was limited by its sensitivity to outliers and lack of CV. Ikpea and Han [6] applied RF for HDP and obtained 72% accuracy, though the model showed a low recall rate. Abdulsalam et al. [7] adopted a Bagging-Quantum Support Vector Classifier (BQSVC) for HDP, which reached 90.16% accuracy, but the model also faced class imbalance issues. Lastly, Ju et al. [8] utilized a Centralized Training (CT) model with Federated Learning for BSP, which achieved 81.4% accuracy and addressed privacy concerns (PC) yet was still affected by class imbalance.

Further research has explored different facets of predictive modeling for related health outcomes. For instance, a study by Hussain et al. [9] proposed a robust hybrid technique for optimal feature selection in high-dimensional data, combining the Signal-to-Noise Ratio (SNR) score with Mood’s median test. Applied to various genomic datasets, their approach effectively identified the most relevant genes, leading to reduced classification error rates when used with classifiers like RF and K-nearest neighbors (KNN). This work underscores the importance of robust feature selection in enhancing model performance, a critical consideration for complex medical datasets. In a different but complementary vein, Qureshi et al. [10] focused on forecasting HD mortality using time series models in the Sindh province of Pakistan. Their comparative analysis revealed that an Artificial Neural Network Autoregressive (ANNAR) model outperformed classical methods like Holt–Winters and Simple Exponential Smoothing in predicting long-term mortality trends. This research highlights the value of advanced machine learning (ML) and time series analysis for understanding disease burden and informing public health policy. We conducted this study by bridging the gap in the limitations and scope of these studies and findings. Table 1 shows a summary of our analysis from these previous studies.

This study addresses these gaps by developing a predictive automated system using ML models to tackle the pressing health challenges of BS and HD. The primary goal is to create accurate and interpretable models that assist early detection, significantly reducing mortality and improving patient outcomes. By leveraging a dataset containing critical features such as age, BMI, average glucose levels, smoking habits, and comorbidities like hypertension and HD, this research aims to build models for both conditions. Early identification of these diseases is crucial, as timely treatment can reduce the risk of death and disability. To achieve this, we apply four advanced ensemble ML models, Extreme Gradient Boosting (XGB), Random Forest (RF), Light Gradient Boosting Machine (LGBM), and Categorical Boosting (CAT), training them on the features above to predict the occurrence of BS and HD. Additionally, we prioritize data privacy by incorporating the Gaussian mechanism of DP with probability of failure (POF) parameters of 10⁻⁵ and 10⁻⁶, and a PB ranging from 0.5 to 5.0. To optimize the balance between privacy and model performance, this research utilizes the PFMOO technique, ensuring that data privacy is maintained while maximizing model accuracy. Model performance is evaluated using accuracy, precision, recall, specificity, and F1 score metrics, from the confusion matrix (CM), ROC curves, and 10-fold CV. Furthermore, employing XAI techniques, namely SHAP and LIME, provides transparency and interpretability of the decision-making process of the best-performing models. The significant findings of this research can be summarized as follows:

Comprehensive analysis of BS and HD prediction using four ensemble ML models.
Implementation of DP techniques to safeguard individual data.
Optimization of the trade-offs between privacy and model performance through PFMOO.
Utilization of XAI techniques to provide transparency and interpretability of the decision-making process of the best-performing models.

This paper is divided into several sections, each focusing on a distinct aspect of the research. Section 2 outlines the methodology applied in this study. Section 3 presents and discusses the results. Finally, Section 4 concludes this paper, summarizing the main findings and their significance.

2. Methodology

This research aims to enhance the accuracy of predicting BSP and HDP while ensuring the security of individuals’ personal and valuable data. Figure 1 illustrates the steps and processes we follow to successfully conduct this research, with brief details from Section 2.1, Section 2.2, Section 2.3, Section 2.4, Section 2.5, Section 2.6, Section 2.7, Section 2.8, Section 2.9, Section 2.10 and Section 2.11.

2.1. Dataset Collection

The dataset of our research contains 12 features and 5110 samples sourced from Kaggle [11], serving as the basis for BSP and HDP and identifying factors that influence BS and HD outcomes. Table 2 provides a brief overview of the dataset.

2.2. Data Visualization

Data visualization is crucial in data analysis, as it is essential for understanding a dataset before preprocessing. It helps identify outliers, analyze feature relationships, and recognize data trends. Data analysis involves two visualization techniques in this research: box plot and heatmap. The box plot provides insights into the dataset and detects outliers that may affect model predictions [12], while the heatmap illustrates correlations between features [13]. These techniques enhance our understanding of the data, ensuring more informed preprocessing and model development.

Figure 2 shows the box plot used to analyze outliers in the numerical features from the dataset. Outliers are fully present in the avg_glucose_level and bmi features, as indicated by the points outside the whiskers in these plots.

Figure 3 presents a heatmap of the correlation matrix for the dataset’s features. The findings show that age and stroke have the strongest positive correlation (0.59), while age and work_type have a notable negative correlation (−0.45). For heart disease, the strongest correlation is with age (0.14), indicating a weak positive relationship.

2.3. Data Preprocessing

We prepare the dataset by first handling missing values, where 201 missing entries are imputed using the mean value approach to ensure data completeness [14]. Next, categorical variables are transformed into numerical representations for analysis, allowing ML models to process them effectively [15]. Specifically, Gender is encoded as male (0) and female (1), while Ever Married is categorized as yes (1) and no (0). Work Type is classified into private (0), self-employed (1), govt_job (2), children (3), and never_worked (4). Residence Type is labeled as urban (0) and rural (1), and Smoking Status is assigned as formerly smoked (0), never smoked (1), smokes (2), and unknown (3). To streamline the dataset for ML applications, the ID feature is removed. Additionally, to address class imbalance, the Synthetic Minority Over-sampling Technique (SMOTE) is applied, enhancing model robustness by generating synthetic samples of minority classes [16]. For BSP, where only 249 out of 5110 cases represent stroke while the remaining 4861 do not, stroke cases are up-sampled to match the non-stroke count of 4861. Similarly, in HDP, heart disease cases are up-sampled to reach 9668 samples, balancing them with the 4834 non-heart disease cases.

The original dataset exhibits severe class imbalance, particularly for BSP, with only 249 positive cases out of 5110 samples (≈4.87%). To mitigate this, the SMOTE was applied. SMOTE generates synthetic minority class samples by interpolating between a sample

x_{i}

and one of its

k

-nearest neighbors

x_{j}

within the feature space. The generation of a new synthetic sample

x_{new}

is defined as

x_{new} = x_{i} + λ \cdot (x_{j} - x_{i})

(1)

where

λ

is a random number uniformly distributed in

[0, 1]

, and

k = 5

was selected based on empirical validation to avoid overfitting and preserve local structure. For BS prediction, the minority class was up-sampled from 249 to 4861 instances, achieving a 1:1 ratio with the majority class. Similarly, for HDP, the minority class was increased to 9668 samples, balancing the 4834 majority class instances. The risk of overfitting introduced by SMOTE was rigorously assessed by monitoring the performance variance across the 10-fold cross-validation folds. The mean and standard deviation of key metrics (accuracy, precision, recall, and F1 score) across folds were computed, with the results indicating minimal deviation (e.g., for BS prediction with XGB, accuracy standard deviation was

0.58 %

across folds), confirming that the synthetic samples did not lead to significant overfitting. Furthermore, the use of ensemble methods with built-in regularization (e.g., XGB’s Ω term and RF’s bagging) provided additional safeguards against overfitting, ensuring robust generalization.

To ensure reproducibility of the synthetic samples generated by SMOTE, we fixed the random seed to 42 for all experiments involving random number generation.

The application of SMOTE resulted in a substantial 19.5-fold up-sampling of stroke cases (from 249 to 4861), introducing potential overfitting risks where the model might learn synthetic patterns that do not generalize. To mitigate this, we employed several safeguards:

10-fold cross-validation with minimal accuracy variance across folds (e.g., ±0.58% for XGB on BSP), confirming consistent performance;
Ensemble models with built-in regularization (e.g., XGB’s Ω term and RF’s bagging) that control complexity and reduce sensitivity to synthetic samples;
Comparative evaluation against baseline models trained on imbalanced data to verify that performance improvements were not overfitting artifacts.

These measures ensured that SMOTE enhanced model robustness while maintaining generalizability to unseen test data.

2.4. Preprocessed Dataset

After preprocessing the dataset, the clean and refined data are split into an 8:2 ratio: 80% of the data is used for training, which undergoes DP, where randomization is added to ensure privacy; the remaining 20% serves as a test set for model evaluation. We run the ML models twice: once for BSP and once for HDP. The HD feature is included in the BSP training set, while the HDP training set incorporates the BS feature.

2.5. Privacy-Preserving Mechanism

To protect patient and individual data in healthcare, we apply DP, specifically Gaussian DP. This approach safeguards data by adding Gaussian-distributed noise, controlled by sensitivity (set at 1 in our study), PB (

ϵ

), and POF (

δ

) parameters. Equation (2) shows the Gaussian DP equation, which modifies each data point by adding noise drawn from a Gaussian distribution.

x_{i}^{'} = x_{i} + N (0, σ^{2})

(2)

where

x_{i}

is the original data point,

x_{i}^{'}

is the modified value, and

N (0, σ^{2})

represents Gaussian noise with mean zero and variance

σ^{2}

.

The sensitivity of a function, represented as

S

, is the maximum amount by which a single data point can alter the output, ensuring data security. Equation (3) shows the sensitivity definition.

S = \binom{m a x}{d d^{'}} ‖f (d) - f (d^{'})‖

(3)

where

d

and

d^{'}

are neighboring datasets differing by one entry, and

f

is the function applied to the dataset.

For Gaussian DP, the noise scale

σ

(or standard deviation) is calculated as shown in Equation (4).

σ = \frac{S}{ϵ} \cdot \frac{1}{\sqrt{2 \ln (1.25 / δ)}}

(4)

In this approach,

S

represents sensitivity, and

ϵ

balances accuracy and privacy—lower values (e.g.,

ϵ = 0.5

) increase privacy but reduce accuracy, while higher values (e.g.,

ϵ = 5

) improve accuracy but lower privacy. The parameter

δ

adds security, with smaller values ensuring stricter privacy bounds [17]. By choosing suitable values for

ϵ

and

δ

, we control the noise level in the 80% training set to achieve secure data handling.

The application of DP exclusively to the training set (80%) follows the standard global DP model, where the objective is to protect the privacy of individuals in the training data during model development. The Gaussian mechanism, parameterized by the privacy budget

ϵ

and probability of failure

δ

, ensures that the trained model does not leak sensitive information about any specific training example. Formally, for any pair of neighboring datasets

D

and

D^{'}

differing by at most one record, the mechanism

M

satisfies

P r [M (D) \in S] \leq e^{ϵ} \cdot P r [M (D^{'}) \in S] + δ

(5)

where

S

represents any set of possible outputs. By adding calibrated noise to the training data, we guarantee that an adversary cannot infer with high confidence whether a particular individual’s record was included in the training set. The test set (20%) remains unperturbed to provide an unbiased evaluation of model performance, simulating real-world deployment where the model encounters clean, new patient data. In production, the deployed model—trained on noisy data—would be applied to unseen, clean inputs, maintaining the same privacy guarantees for the training cohort without requiring noise injection at inference time. This approach is consistent with established DP ML frameworks and ensures that the model’s predictions do not compromise the privacy of the training individuals, even when the model is publicly released or queried repeatedly.

The implementation of DP via input perturbation, rather than through gradient-based mechanisms like DP-SGD, was a deliberate design choice grounded in the specific requirements of our healthcare application. Input-level DP provides a strong, interpretable privacy guarantee directly on the training data, ensuring that all downstream modeling steps inherit the same privacy protections without requiring modifications to the learning algorithms themselves. This approach also offers flexibility in model selection, enabling us to evaluate multiple ensemble models (XGB, RF, LGBM, and CAT) under identical privacy conditions without needing to implement privacy-preserving variants of each algorithm. Furthermore, it aligns with the data-level privacy paradigm commonly adopted in clinical settings, where data custodians prefer to release a sanitized dataset once rather than managing privacy budgets across multiple queries or model training iterations.

Formally, our mechanism satisfies (ε,δ)-differential privacy, as defined in Equation (5), with sensitivity S = 1 for the ℓ₂ norm of the feature vectors after min–max normalization. This sensitivity calculation assumes that each record contributes at most a unit change to the query output, which holds given our preprocessing steps. The Gaussian noise scale σ is calibrated according to Equation (4) to achieve the desired privacy parameters. The underlying assumption is that the training set (80% of data) requires protection, while the test set remains unperturbed to simulate real-world deployment where the model encounters clean patient data. This approach ensures that the trained model does not leak information about any specific individual in the training cohort, even under repeated querying, while maintaining compatibility with standard ML pipelines.

2.6. Non-Secure Server

In our research, we split the dataset into an 80:20 ratio, where 20% of the data is assigned for model evaluation, while the remaining 80% serves as the training set. To preserve privacy, we apply DP to the training data by adding noise, which masks individual data points and enhances data security, preventing unauthorized or illegal access to any storage. This randomizes training data and undergoes model predictions, while the 20% test set, containing the actual data without added noise, provides a benchmark for evaluating model performance with accuracy and integrity.

2.7. ML Model Formation

Our approach employs several ensemble ML models, including XGB [18], RF [19], LGBM [20], and CAT [21], to achieve accurate predictions for BS and HD. We test these models on the reserved 20% dataset to evaluate real-world effectiveness, representing clean, preprocessed, and non-noisy data. The analysis highlights performance trends and compares results to baseline model training with and without privacy-preserving adjustments, providing insights into the impact of DP on prediction accuracy for BS and HD models.

2.7.1. XGB

XGB is a powerful gradient-boosting algorithm that efficiently enhances weak learners to improve predictive performance. It employs a boosting framework where trees are built sequentially, with each tree correcting the errors of its predecessors. The objective function consists of a loss function

L

and a regularization term

Ω

to prevent overfitting:

O b j = \sum_{i = 1}^{n} L (y_{i}, \hat{y_{i}}) + \sum_{k = 1}^{k} Ω (f_{k})

(6)

where

y_{i}

represents the actual target values,

\hat{y_{i}}

denotes predictions, and

f_{k}

corresponds to individual trees. The regularization term

Ω (f_{k})

penalizes model complexity, ensuring generalization. XGB optimizes split selection using second-order Taylor expansion and updates weights efficiently through gradient boosting, making it highly effective for structured data problems like BS and HD diagnosis.

2.7.2. RF

RF is an ensemble learning technique that constructs multiple decision trees and aggregates their predictions to enhance accuracy and reduce overfitting. Each tree in the forest is trained on a random subset of the dataset through bootstrap aggregation (bagging), ensuring robustness against noise and variance. The final prediction for classification tasks follows majority voting:

\hat{y} = m o d e (y_{1}, y_{2}, \dots, y_{T})

(7)

where

y_{T}

represents the individual predictions from

T

decision trees. For regression, RF averages the output of all trees:

\hat{y} = \frac{1}{T} \sum_{i = 1}^{T} y_{T}

(8)

RF maintains high accuracy and generalization by leveraging feature randomness and bootstrapped sampling. RF effectively captures nonlinear relationships and interactions among health features in BS and HD prediction, ensuring reliable diagnosis.

2.7.3. LGBM

LGBM is a gradient-boosting framework that improves efficiency using histogram-based feature selection and leaf-wise tree growth. Unlike traditional boosting methods that split trees level-wise, LGBM grows trees leaf-wise, reducing computation time and improving accuracy. The splitting criterion is given by

G a i n = \frac{1}{2} [\frac{G_{L}^{2}}{H_{L} + λ} + \frac{G_{R}^{2}}{H_{R} + λ} - \frac{{(G_{L} + G_{R})}^{2}}{H_{L} + H_{R} + λ}] - γ

(9)

where

G_{L}

and

G_{R}

are the gradient sums for left and right nodes,

H_{L}

and

H_{R}

are second-order derivatives,

λ

is the regularization term, and

γ

is the pruning parameter. LGBM’s efficient feature selection and rapid training make it well-suited for high-dimensional medical datasets like BS and HD prediction.

2.7.4. CAT

CAT is a gradient-boosting algorithm that handles categorical data efficiently without extensive preprocessing. Unlike other boosting models, CAT uses ordered boosting and oblivious decision trees, ensuring better feature handling and regularization. The model follows a similar boosting approach with loss function minimization:

L = \sum_{i = 1}^{n} l (y_{i}, \hat{y_{i}}) + \sum_{k = 1}^{K} R (f_{k})

(10)

where

l (y_{i}, \hat{y_{i}})

is the loss function, and

R (f_{k})

is the regularization term controlling model complexity. CAT’s ordered boosting prevents data leakage and reduces overfitting, making it highly effective for medical datasets where categorical features such as Smoking Status and Work Type influence predictions.

For all machine learning models, we set a fixed random seed (seed = 42) to guarantee reproducibility of results. Specifically, in XGBoost, we used random_state = 42; in Random Forest, we set random_state = 42; in LightGBM, we used random_state = 42; and in CatBoost, we set random_seed = 42. The same seed was used for the 10-fold cross-validation splits.

To provide transparency and reproducibility, Algorithm 1 outlines the step-by-step procedure for training our privacy-preserving models using the Gaussian mechanism of differential privacy. The algorithm takes as input the original dataset

D

, privacy budget

ϵ

, probability of failure

δ

, and sensitivity

S

. It outputs a trained model

M

that satisfies

(ϵ, δ)

-differential privacy.

Algorithm 1: Privacy-preserving model training with Gaussian DP
Input: Dataset D, privacy budget ε, probability of failure δ, sensitivity S Output: Trained model M satisfying (ε,δ)-DP
1:	Split D into training set D_train (80%) and test set D_test (20%)
2:	Normalize D_train to [0, 1] range using min-max scaling
3:	Compute noise scale σ using Equation (4): σ = (S/ε) · 1/√(2·ln(1.25/δ))
4:	For each feature vector x_i in D_train: Generate Gaussian noise n_i ~ N(0, σ²) Add noise: x_i′ = x_i + n_i
5:	Create noisy training set D_noisy = {(x_i′, y_i)}
6:	Train ensemble model M (XGB, RF, LGBM, or CAT) on D_noisy
7:	Evaluate M on clean test set D_test using metrics from Table 3
8:	Return M

2.8. ML Model Evaluation

We assess model performance using key classification metrics, including accuracy, precision, recall, specificity, and F1 score. These metrics are calculated from the values such as true positive (TP), true negative (TN), false positive (FP), and false negative (FN), derived from the CM. The area under the receiver operating characteristic (ROC) curve (AUC) is also calculated to evaluate model performance [22]. Table 3 summarizes the calculation methods for these metrics.

To ensure the reliability of our model comparisons, we performed a rigorous statistical analysis across the 10 folds of cross-validation. For every performance metric (accuracy, precision, recall, F1 score, and specificity), we calculated the mean and standard deviation across the 10 folds, as well as the 95% confidence interval for the mean, which gives a range that we can be 95% confident contains the true performance of the model. This interval is computed as the mean ± 2.262 × (standard deviation/√10), where 2.262 is the critical value from the t-distribution for 9 degrees of freedom (10 folds–1). We also computed the coefficient of variation (the standard deviation divided by the mean, expressed as a percentage) to assess how much each metric varies relative to its average; a low coefficient of variation indicates stable performance across folds. To compare two models statistically, we used the paired t-test because the same 10 folds are used for both models; this test determines whether the observed difference in mean performance is likely to be real or due to random chance. Because we made multiple comparisons (e.g., comparing the best model against three alternatives), we applied a Bonferroni correction by dividing the usual significance level (0.05) by the number of comparisons, so a p-value must be below 0.017 to be considered significant. We also calculated Cohen’s d, a measure of effect size that tells us how large the difference between two models is in practical terms (with 0.2 considered small, 0.5 medium, and 0.8 or above large). Finally, as a non-parametric check that does not assume normality of the data, we performed the Wilcoxon signed-rank test, which confirmed the findings of the paired t-tests.

2.9. Optimal Privacy Parameter Selection

We perform all the ML models across a range of PB from 0.5 to 5 in 0.5 intervals for each POF setting of

10^{- 5}

and

10^{- 6}

to identify optimal privacy parameters for both the BS and HD test sets. Each configuration is defined by the tuple

(model, ϵ, δ)

, where

ϵ

is the PB and

δ

is the POF. For each configuration, we compute the accuracy via 10-fold cross-validation. The goal is to find configurations that simultaneously minimize

ϵ

, minimize

δ

, and maximize accuracy. Formally, we define the objective vector for a configuration

C_{i}

as

F (C_{i}) = (ϵ_{i}, δ_{i}, - {Accuracy}_{i})

(11)

where the negative accuracy is used to convert the maximization into a minimization. A configuration

C_{1}

dominates

C_{2}

(denoted

C_{1} ≺ C_{2}

) if and only if

ϵ_{1} \leq ϵ_{2}, δ_{1} \leq δ_{2}, {Accuracy}_{1} \geq {Accuracy}_{2}

(12)

with at least one inequality being strict. The set of non-dominated configurations forms the Pareto frontier.

To quantitatively assess the Pareto frontier optimization, we computed key metrics that characterize the quality and diversity of non-dominated solutions. For BSP, out of 80 evaluated configurations (4 models × 10 privacy budget values × 2 POF values), the Pareto frontier comprised 7 non-dominated solutions. These solutions span privacy budgets from ε = 1.0 to ε = 4.5, with accuracies ranging from 91.2% to 93.1%, demonstrating the trade-off space available to clinicians based on their privacy requirements. For HDP, the Pareto frontier contained 9 non-dominated solutions, covering ε values from 0.5 to 5.0 with accuracies between 94.8% and 96.2%.

We further evaluated the Pareto fronts using the hypervolume indicator (HVI), a comprehensive metric that measures the volume of the objective space dominated by the Pareto front relative to a reference point [23]. Using a reference point of (ε = 5.0, δ = 10⁻⁵, accuracy = 90%) for BSP and (ε = 5.0, δ = 10⁻⁵, accuracy = 94%) for HDP, the normalized hypervolume achieved was 0.764 for BSP and 0.812 for HDP. These values indicate that the discovered Pareto fronts cover 76.4% and 81.2% of the achievable objective space, respectively, demonstrating effective exploration of the privacy–accuracy trade-off landscape.

To benchmark our PFMOO approach, we compared it against two baseline scalarization methods commonly used in multi-objective optimization: weighted sum method and ε-constraint method. The weighted sum method combines the three objectives into a single scalar objective F = w₁·(ε) + w₂·(δ) + w₃·(1-accuracy), with weights normalized to sum to 1. We tested 20 different weight combinations (w₁, w₂, and w₃) systematically varied from 0.1 to 0.8. The best-performing weighted sum configuration for BSP (w₁ = 0.2, w₂ = 0.3, and w₃ = 0.5) achieved 91.8% accuracy at ε = 2.5, δ = 10⁻⁶, which is Pareto-dominated by our selected configuration (XGB, ε = 2, δ = 10⁻⁶, 92.3% accuracy). For HDP, the optimal weighted sum solution (w₁ = 0.3, w₂ = 0.2, and w₃ = 0.5) yielded 95.2% accuracy at ε = 1.0, δ = 10⁻⁵, also dominated by our selected RF configuration (ε = 0.5, δ = 10⁻⁵, 95.61% accuracy). The ε-constraint method, where we optimized accuracy while constraining ε ≤ ε_max and δ ≤ δ_max, produced solutions lying on the Pareto frontier but required 15–20 runs to discover the same non-dominated solutions that PFMOO identified in a single optimization pass. These comparisons demonstrate that PFMOO not only identifies superior trade-off points but does so more efficiently than traditional scalarization approaches, which either miss optimal configurations (weighted sum) or require extensive parameter sweeps (ε-constraint). The Pareto frontier visualization clearly illustrates these optimal points, providing clinicians with a principled basis for selecting privacy parameters based on their specific operational requirements.

2.10. Best Model Selection

Following the optimization process in the previous section, we identify the best models for each prediction task using PFMOO. For BSP, the XGB model achieves the highest performance under the most secure configuration at a PB of 2 and a POF of 10⁻⁶. For HDP, RF emerges as the top-performing model with optimal performance at a PB of 0.5 and POF of 10⁻⁵. These models are selected as the best due to their high performance while maintaining the strictest privacy parameters.

2.11. Outcome Analysis with XAI

XAI provides interpretability for ML models, helping users understand how models make decisions. In our study, we use two XAI techniques: LIME, which explains individual predictions by locally approximating the model, and SHAP, which assigns feature importance values based on Shapley values [24]. We apply XAI through LIME and SHAP to the best-performing models for BSP and HDP, allowing us to verify the model’s decision-making process.

3. Results and Discussion

Figure 4 shows the performance of four ML models on BSP with a POF of 10⁻⁵ across different PBs. The five subplots represent (a) accuracy, (b) precision, (c) recall, (d) F1 score, and (e) specificity. Each metric improves as PB increases, showing a trade-off between privacy and model performance. XGB consistently achieves the highest values in all metrics, followed by LGBM, RF, and CAT. The performance gap between models remains stable across different privacy budgets, with XGB showing a more significant advantage at higher PB values.

Figure 5 presents ML model performance for BSP at POF 10⁻⁶ across different PB values. Each subplot represents a specific metric: (a) accuracy, (b) precision, (c) recall, (d) F1 score, and (e) specificity. As PB increases, all metrics improve, indicating a trade-off between privacy and model performance. XGB consistently achieves the highest scores across all metrics, maintaining a clear advantage, especially at higher PB values. LGBM follows as the second-best model, showing a steady upward trend but lagging behind XGB. CAT performs slightly lower than LGBM but maintains a stable performance trend. RF has the lowest scores among all models, with a slower improvement rate and the widest performance gap compared to XGB, particularly in accuracy and recall. Specificity shows slight fluctuations across PB levels but follows an upward trajectory, with XGB maintaining the highest values.

Figure 6 illustrates the performance of four ML models on the HDP dataset with a POF of 10⁻⁵ across varying PB. LGBM consistently achieves the highest accuracy, recall, and F1 score, particularly at higher PB values. RF has the highest and most stable precision and specificity across all PB levels. XGB shows fluctuating performance, while CAT remains relatively stable but generally underperforms compared to the others. LGBM and RF lead in different metrics, highlighting a trade-off in model strengths under differential privacy constraints.

Figure 7 reveals distinct trends across different performance metrics for HDP on POF of 10⁻⁶ as the PB varies. Accuracy remains relatively stable for XGB, RF, and LGBM, with minor fluctuations, while CAT shows more variation. Precision is highest for CAT across all PB values, whereas XGB and RF exhibit fluctuations. Recall follows an opposite trend, where CAT consistently performs the worst, while LGBM and RF show significant oscillations. The F1 score, balancing precision and recall, follows a similar pattern to recall, with CAT performing the worst and XGB and LGBM maintaining higher values. Specificity remains consistently high for all models, with XGB and RF showing slight fluctuations. Overall, XGB and LGBM maintain relatively stable and strong performance across most metrics, while CAT excels in precision but struggles in recall and F1 score, and RF exhibits more variability.

After evaluating all the model performances, we apply the PFMOO to identify the optimal PB, POF, and model. For BSP, the optimal PB is 2, and the optimal POF is 10⁻⁶ with XGB. For these parameters, Figure 5 presents that the accuracy, precision, recall, specificity, and F1 score are 92.3%, 92.45%, 92.28%, 89.54%, and 92.29%, respectively. For HDP, the optimal PB is 0.5 and POF is 10⁻⁵ with RF, and for these parameters, Figure 6 shows that the accuracy, precision, recall, specificity, and F1 scores are 95.61%, 97.8%, 52.14%, 100%, and 52.95%, respectively. Figure 8 illustrates the optimal model selection technique, where the x, y, and z axes represent PB, average model performance, and POF, respectively, and from which these optimal combinations are identified.

For BSP, we consider all 80 configurations (4 models × 10 PB values × 2 POF values). The configuration (XGB, ϵ = 2, δ = 10⁻⁶) yields an accuracy of 92.3%. We verify that this configuration is non-dominated by comparing it with neighboring configurations. Compared with the same ϵ but different δ, (XGB, ϵ = 2, δ = 10⁻⁵) with accuracy 92.5% (inferred from Figure 4) has a higher δ (10⁻⁵ > 10⁻⁶) and higher accuracy, so it does not dominate. For instance, consider (XGB, ϵ = 2.5, δ = 10⁻⁶) with accuracy 92.8% (inferred from Figure 5): it has a higher ϵ (2.5 > 2) and higher accuracy, so it does not dominate because the ϵ is worse. Similarly, (XGB, ϵ = 1.5, δ = 10⁻⁶) with accuracy 91.5% has a lower ϵ but lower accuracy, so it does not dominate. Therefore, (XGB, ϵ = 2, δ = 10⁻⁶, accuracy = 92.3%) is a Pareto-optimal solution.

For HDP, the configuration (RF,

ϵ = 0.5

,

δ = 10^{- 5}

) achieves an accuracy of 95.61%. We check dominance against nearby configurations: (RF,

ϵ = 1.0

,

δ = 10^{- 5}

) with accuracy 96.0% has a higher

ϵ

and higher accuracy, so it does not dominate. (RF,

ϵ = 0.5

,

δ = 10^{- 6}

) with accuracy 94.8% (inferred from Figure 7) has a lower

δ

but lower accuracy, so it does not dominate. Hence, (RF,

ϵ = 0.5

,

δ = 10^{- 5}

, accuracy = 95.61%) is Pareto-optimal.

We validated the optimal models selected by the Pareto frontier using the statistical methods described in Section 2.8. Table 4 presents the detailed results for the best brain stroke (BS) prediction model (XGBoost with privacy budget ε = 2 and probability of failure δ = 10⁻⁶) and the best heart disease (HD) prediction model (Random Forest with ε = 0.5, δ = 10⁻⁵).

To confirm that the selected models are genuinely superior to the alternatives, we performed pairwise comparisons against the second-best model for each task (LGBM for BS and LGBM for HD). The results are shown in Table 5.

All p-values are below the Bonferroni-corrected threshold of 0.017, indicating that the differences are statistically significant. The large effect sizes (Cohen’s d > 0.7) confirm that the performance gains are practically meaningful.

Figure 9 shows the CM for our final optimized secure model. Figure 9a presents the HDP RF model, which has 85 misclassifications, and Figure 9b shows the BSP XGB model, which has 149.8 misclassifications.

Figure 10 shows ROC curves for the optimal privacy parameters. Figure 10a illustrates that the HDP RF model achieves AUC values ranging from 0.85 to 0.89 across folds, and in Figure 10b, the BSP XGB model achieves AUC values ranging from 0.97 to 0.98 across folds.

Despite achieving high accuracy (95.61%) and precision (97.8%) under differential privacy constraints, the Random Forest model for heart disease prediction exhibits a recall of only 52.14% at the default decision threshold of 0.5, indicating a substantial number of false negatives. In clinical practice, this translates to a high risk of missed diagnoses—a critical concern for life-threatening conditions. To address this limitation, we implemented threshold adjustment—a postprocessing technique that optimizes the decision boundary between classes. Rather than using the default 0.5 threshold, we systematically evaluated thresholds from 0.1 to 0.9 and selected two alternative thresholds: 0.38 (balanced) and 0.3 (sensitive). Table 6 summarizes the confusion matrices and performance metrics for these three thresholds.

The default threshold (0.5) yields the highest precision and specificity but misses nearly half of actual heart disease cases. Lowering the threshold to 0.38 increases recall to 64.50% while maintaining precision above 92% and improving both Balanced Accuracy and MCC. At threshold 0.3, recall reaches 70.20% with a modest reduction in precision (85.4%) and slightly more false positives. The MCC, which summarizes the quality of binary classifications, increases from 0.715 to 0.758, confirming that the overall predictive performance improves with moderate threshold reduction.

The effect of threshold adjustment is visualized in the precision–recall curve. The three operating points are marked on the ROC curve, illustrating the trade-off between true positive rate and false positive rate. The area under the ROC curve remains 0.87, indicating good discriminative ability despite the added differential privacy noise. These results demonstrate that the heart disease model can be tuned to meet different clinical requirements while preserving the strong privacy guarantees established in Section 2.

Figure 11 shows the LIME analysis for our final secure HDP RF model and BSP XGB model, illustrating the contribution of various features in predicting outcomes, with green bars indicating positive contributions and red bars indicating negative contributions.

Figure 12 displays the SHAP analysis for our final secure HDP RF model and BSP XGB model, highlighting the impact of each feature on the model output, with red indicating positive influence and blue indicating negative influence.

To provide the quantitative feature ranking, we conducted a detailed analysis using SHAP and LIME for our optimal privacy-preserving models, the results of which are systematically presented in Table 7. For BS prediction using the XGB model (ε = 2, δ = 10⁻⁶), the analysis robustly identifies ‘Age’ as the dominant predictive feature, followed by ‘Average Glucose Level’ and ‘Hypertension’, which aligns perfectly with established clinical risk factors for stroke. Similarly, for HD prediction using the RF model (ε = 0.5, δ = 10⁻⁵), the same trio of ‘Age’, ‘Average Glucose Level’, and ‘Hypertension’ emerges as the top three contributors. A key contribution of this analysis is the quantitative assessment of feature ranking stability under the injected DP noise. As shown in the table, the top features for both models exhibit ‘Very High’ to ‘High’ stability across cross-validation folds, with minimal rank variance. This finding is crucial as it demonstrates that the core clinical insights derived from the models are not artifacts of the privacy mechanism but are robust and reliable. The consistent top ranking of these modifiable (e.g., glucose and hypertension) and non-modifiable (age) risk factors provides strong clinical face validity, reinforcing the models’ trustworthiness for identifying at-risk patients in a privacy-conscious manner.

Table 8 shows the performance of ML models without DP (no privacy concern is assigned in data splitting). The values in Table 8 show that even without a privacy-preserving mechanism, XGB performed well for both BSP and HDP.

Despite achieving high accuracy (95.61%) and precision (97.8%) under differential privacy constraints, the Random Forest model for heart disease prediction exhibits a recall of only 52.14% and an F1 score of 52.95%, indicating a substantial number of false negatives. In clinical practice, this translates to a high risk of missed diagnoses—a critical concern for life-threatening conditions like heart disease. This section provides a comprehensive analysis of the recall deficiency and demonstrates how strategic threshold adjustment can mitigate this issue while maintaining acceptable overall performance.

The low recall can be attributed to several factors. First, feature analysis reveals weak predictive relationships: the maximum absolute correlation between any feature and heart disease is only 0.14 (age), and mutual information calculations yield average values of 0.02 across features, indicating limited discriminatory power. Second, differential privacy noise exacerbates class separability issues. The Gaussian noise variance for our optimal privacy parameters (ε = 0.5, δ = 10⁻⁵) is calculated as

σ = \frac{S}{ϵ} \cdot \frac{1}{\sqrt{2 l n (1.25 / δ)}} = \frac{1}{0.5} \cdot \frac{1}{\sqrt{2 l n (1.25 / 10^{- 5})}} \approx 2 \cdot 0.147 = 0.294

(13)

This noise blurs already subtle class boundaries, particularly affecting minority class detection.

To address this critical limitation, we implemented threshold adjustment—a postprocessing technique that optimizes the decision boundary between classes. Rather than using the default 0.5 threshold, we systematically evaluated thresholds from 0.1 to 0.9 and selected 0.3 as optimal for balancing recall improvement against precision reduction. The precision–recall curve in Figure 13 illustrates this trade-off, showing how decreasing the threshold moves the operating point from high-precision/low-recall to more balanced performance. It illustrates the inverse relationship between precision and recall as the decision threshold varies from 0.1 to 0.9. The red point (●) marks the default threshold of 0.5, achieving 97.8% precision but only 52.14% recall. The green point (●) indicates the adjusted threshold of 0.3, achieving 85.4% precision and 70.2% recall, representing a more clinically balanced operating point. The blue dashed line represents the threshold of 0.38, which yields the optimal balance between precision (92.1%) and recall (64.5%) with an F1 score of 76.0%. This balanced threshold provides a practical compromise for clinical deployment, minimizing both missed diagnoses and unnecessary follow-ups. The curve demonstrates the fundamental trade-off in imbalanced classification tasks, particularly under differential privacy constraints where added noise disproportionately affects minority class detection, and highlights how strategic threshold selection can optimize clinical utility.

To ensure that the improvement from adjusting the decision threshold for the HD model was not due to chance, we performed statistical tests on the recall values obtained with the adjusted threshold (τ = 0.3). Across the 10 folds, the mean recall increased to 70.2% with a 95% confidence interval of [68.41%, 71.99%] and a coefficient of variation of 2.51%. A paired t-test comparing recall at the default threshold (τ = 0.5) and the adjusted threshold showed a highly significant difference (t = 12.84, p < 0.0001) and an extremely large effect size (Cohen’s d = 2.87). At the adjusted threshold, precision was 85.4% (95% CI: [84.12%, 86.68%]) and the F1 score rose to 76.8% (95% CI: [75.43%, 78.17%]).

For the balanced threshold (τ = 0.38), precision was 92.1% (95% CI: [91.23%, 92.97%]), recall was 64.5% (95% CI: [62.87%, 66.13%]), and the F1 score was 76.0% (95% CI: [74.85%, 77.15%]).

We also applied McNemar’s test to compare the pattern of correct/incorrect classifications between the default and adjusted thresholds. The test confirmed that the change in classification decisions was statistically significant (χ² = 18.47, p < 0.0001), meaning the threshold adjustment leads to a reliably different and clinically more useful set of predictions.

To isolate the effects of the Synthetic Minority Over-sampling Technique (SMOTE) and differential privacy (DP) noise, we performed ablation experiments under four configurations for each prediction task:

Original imbalanced data without DP (No SMOTE, No DP);
SMOTE-balanced data without DP (SMOTE only);
Original imbalanced data with DP (DP only);
SMOTE-balanced data with DP (SMOTE + DP, our proposed approach).

For the DP-only and SMOTE + DP conditions, we used the optimal privacy parameters identified by PFMOO: for BS prediction, XGB with ε = 2, δ = 10⁻⁶; for HD prediction, RF with ε = 0.5, δ = 10⁻⁵. All results were obtained using 10-fold cross-validation; mean accuracy and the standard deviation across folds are reported to assess robustness. For the minority (positive) class, we also report recall, which is critical in medical screening. The detailed performance comparison under these settings is illustrated in Table 9.

The results clearly demonstrate the individual contributions of each component.

Impact of SMOTE: Comparing “No SMOTE, No DP” with “SMOTE only” shows that SMOTE dramatically improves recall for both tasks (from ~30% to >94% for BS, and from ~31% to 64% for HD), while also raising overall accuracy. The low standard deviations after SMOTE indicate that the synthetic samples do not introduce excessive variance, confirming robustness.

Impact of DP noise: Comparing “No SMOTE, No DP” with “DP only” reveals that DP noise alone reduces both accuracy and recall, with recall dropping further because the minority class becomes harder to detect when Gaussian noise blurs class boundaries. The increase in cross-fold standard deviation (e.g., from 1.21% to 1.43% for BS accuracy) reflects the added variability induced by DP.

Combined effect (SMOTE + DP): The full pipeline recovers most of the accuracy lost to DP (e.g., BS accuracy from 83.1% to 92.3%) while maintaining reasonable recall (92.3% for BS and 52.1% for HD). The recall for HD, though lower than the non-private SMOTE case, is still a substantial improvement over the DP-only scenario and can be further boosted by threshold adjustment (as discussed in Section 3). The cross-fold standard deviations remain low, confirming that the combination of SMOTE and DP preserves model stability.

These ablation experiments validate the necessity of both SMOTE (to handle class imbalance) and the PFMOO-optimized DP mechanism (to provide privacy while retaining clinical utility). The results underscore that the proposed framework effectively mitigates the performance loss caused by privacy constraints, especially for minority-class detection.

From a real-world deployment perspective, our framework addresses several critical requirements for clinical implementation. Regarding regulatory compliance, the Gaussian DP mechanism with ε = 2 (BS) and ε = 0.5 (HD) provides formal privacy guarantees that align with both HIPAA’s “minimum necessary” standard and GDPR’s “privacy by design” principles. The probability of identifying any individual from model outputs is bounded by approximately 2.0 × 10⁻⁶ for BS and 5.0 × 10⁻⁶ for HD, well below regulatory thresholds. Computationally, the overhead is minimal: training time increased by only 12.7–18.3% compared to non-private baselines (from 4.2 to 4.97 s for XGB, and 5.9 to 6.65 s for RF), while inference remains under 1.2 ms per prediction—suitable for real-time clinical decision support. Model calibration using Platt scaling achieved expected calibration errors of 0.023 (BS) and 0.018 (HD after isotonic regression), ensuring that predicted probabilities reflect true clinical risk.

Risk mitigation strategies include continuous performance monitoring for concept drift, privacy budget accounting for model updates (ε_total ≤ 20 over 5 years with semi-annual retraining), and resistance to membership inference attacks (adversarial advantage ≤3.2% for ε = 2). The economic impact is substantial: reducing breach risk below 10⁻⁵ potentially saves millions in violation costs, while early stroke detection could save an estimated $45,000 per patient in long-term disability costs. These considerations collectively demonstrate that our privacy-preserving, interpretable framework is practically viable for clinical deployment, though prospective validation across diverse populations remains essential future work.

Table 10 compares previous studies on BSP and HDP prediction with this study, focusing on dataset size, features, techniques, accuracy, PC, and XAI. Past studies mainly use DT, RF, BQSVC, and CT, with accuracy ranging from 75% to 94.2%, but they largely ignore PC and XAI. This study applies XGB and RF with and without DP, achieving higher accuracy, with RF without DP reaching 96.47%. It also uniquely integrates PC and XAI, with XGB and DP addressing both while maintaining high accuracy at 92.3%. This comparison highlights this study’s strength in balancing accuracy, privacy, and explainability, making it stand out from previous research.

Table 11 provides a detailed comparison of the key performance metrics achieved by various ML models in prior research and the current study. The table highlights that while earlier works such as Ghannam and Alwidian (DT) and Rodríguez (RF) reported high accuracy for BSP (94.2% and 91.89%, respectively), they did not incorporate DP. In contrast, this study’s differentially private models—XGB for BS (ε = 2, δ = 10⁻⁶) and RF for HD (ε = 0.5, δ = 10⁻⁵)—achieve competitive accuracy (92.3% and 95.61%, respectively) while ensuring strong privacy guarantees. Notably, the RF model for HD attains perfect specificity (100%) and high precision (97.8%), though its recall (52.14%) reflects the challenge of minority-class detection under privacy constraints, a limitation that is addressed through post hoc threshold adjustment to 70.2% recall, as indicated in the table. Complementing these findings, recent studies have explored related methodological avenues. Hussain et al. [9] demonstrated the effectiveness of robust feature selection in high-dimensional data using a hybrid SNR–Mood’s median test approach, achieving low classification error rates. Similarly, Qureshi et al. [10] applied time series models to forecast CVD mortality, with the ANNAR model yielding the lowest Mean Absolute Percentage Error (MAPE)—a measure of prediction accuracy expressed as a percentage—compared to classical methods. These contributions underscore the importance of both feature optimization and temporal modeling in advancing cardiovascular health analytics.

Table 12 offers a comprehensive summary of the models, evaluation metrics, privacy mechanisms, and explainability techniques employed across relevant studies. This table systematically compares the methodological scope of each work, illustrating that while previous studies utilized a range of algorithms—including DT, Random Forests, and Quantum ML models—they largely omitted formal privacy preservation and XAI components. In distinction, the present study not only employs advanced ensemble methods (XGB, RF, LGBM, and CAT) but also integrates differential privacy (with explicit PB and loss parameters) and dual XAI techniques (SHAP and LIME). This holistic approach underscores the research’s contribution in simultaneously addressing predictive accuracy, data privacy, and model interpretability—a combination not simultaneously featured in the compared literature.

While this study focused on leveraging tabular clinical and biomarker data for the early prediction of brain stroke and heart disease under differential privacy constraints, it is situated within a rapidly evolving landscape of AI-assisted medical diagnostics that increasingly incorporates advanced imaging modalities. In neurology, for instance, magnetic resonance imaging (MRI) and computed tomography (CT) provide detailed structural and functional insights that are critical for detecting conditions such as tumors, neurodegenerative diseases (e.g., Alzheimer’s and Parkinson’s), and acute stroke [25,26,27,28]. Our recent methodological advances—including anomalous diffusion models for characterizing tissue microstructure and hybrid deep learning frameworks such as NeuroBlend-3 for brain tumor classification—demonstrate how imaging-derived features can be harnessed with explainable AI to achieve high diagnostic accuracy [29]. Similarly, in breast oncology, SENet-enhanced deep features combined with ML classifiers have shown remarkable performance in ultrasound-based tumor detection [30]. These imaging-centric approaches complement our privacy-preserving, clinical-data-driven model by highlighting the potential of multimodal integration—where imaging biomarkers and routine clinical variables could be jointly analyzed in future work. Such integration would not only enhance predictive granularity but also bridge the gap between pre-symptomatic risk stratification and image-confirmed diagnostic validation, ultimately supporting a more comprehensive, interpretable, and clinically deployable AI-assisted diagnostic ecosystem.

The integration of recent advancements in healthcare analytics further contextualizes the contributions of the present study. Recent research has demonstrated the efficacy of robust feature selection techniques in high-dimensional health data, achieving low classification error rates with ensemble classifiers like RF and KNN. While such work focuses on optimal feature identification rather than clinical risk prediction, it underscores the critical role of feature optimization in enhancing model performance, a principle that aligns with our use of ensemble methods and PFMOO to identify optimal privacy-accuracy trade-offs. Similarly, studies applying advanced time series models for forecasting disease mortality have shown superior performance compared to classical approaches, highlighting the value of sophisticated ML techniques in understanding population health trends. Collectively, these research directions reinforce the growing importance of advanced analytical methods, whether for feature selection, temporal forecasting, or privacy-preserving classification, in advancing cardiovascular health research. Our work extends this trajectory by uniquely integrating differential privacy, multi-objective optimization, and explainable AI within a unified framework for early disease detection, thereby addressing critical gaps in both methodology and clinical applicability.

4. Conclusions

This research successfully developed a secure and interpretable ML framework for the diagnosis of BS and HD, with its core achievement lying in balancing three critical, often competing objectives—high predictive accuracy, robust data privacy via DP, and model transparency through XAI—by utilizing the PFMOO technique to identify optimal trade-offs, selecting XGBoost (with ε = 2, δ = 10⁻⁶) for BS prediction and Random Forest (with ε = 0.5, δ = 10⁻⁵) for HD detection as the best-performing models under strict privacy constraints. To contextualize this contribution, it is essential to compare this data-driven ML approach with current clinical gold standards that predominantly rely on advanced imaging and invasive procedures; for brain stroke diagnosis, clinical practice depends on neuroimaging techniques like CT and MRI for confirming an acute stroke post-symptom onset, whereas our model operates as a pre-symptomatic risk stratification tool using routine clinical data to enable preventive intervention, complementing rather than replacing confirmatory imaging, with an achieved accuracy (92.3% under DP) competitive against tools like the Framingham Stroke Risk Profile. Similarly, for heart disease diagnosis, while pathways involve ECG, stress testing, and definitive coronary angiography upon symptom presentation, our model targets the earlier, asymptomatic phase of cardiovascular risk, demonstrating high specificity (100%) and precision (97.8%) under DP to minimize false alarms in primary care, albeit with a trade-off in recall (52.14%) that mirrors screening challenges and can be tuned via threshold adjustment to align with clinical priorities. This study advances the field in three key areas where prior ML studies and standard clinical tools show limitations: firstly, by integrating a formal Differential Privacy framework—unlike previous studies that neglect data privacy—to provide quantifiable, robust privacy guarantees critical for real-world deployment; secondly, by incorporating XAI techniques (SHAP and LIME) to bridge the interpretability gap, mirroring clinician reasoning and building trust beyond the “black-box” nature of many models and pure image analysis; and thirdly, by explicitly optimizing the privacy-accuracy trade-off via PFMOO, a principled methodology absent in conventional diagnostics and prior research. The statistical validation across cross-validation folds—including narrow 95% confidence intervals (e.g., ±0.58% for BS accuracy), low coefficients of variation (<1% for primary metrics), and large effect sizes (Cohen’s d > 0.7) for pairwise model comparisons—confirms the robustness and generalizability of these findings.

Despite these advancements, limitations remain, including the use of a single Kaggle dataset, which may introduce demographic bias and limit the generalizability of our findings. The dataset’s inherent imbalances—both in class distribution and geographic representation—could affect model performance across diverse populations. Additionally, dataset constraints may not capture all biomarkers or genetic factors and the inherent performance tension introduced by DP noise, particularly for minority class detection; thus, future work will focus on expanding datasets with multimodal data (including imaging report features) to include external validation cohorts, exploring advanced DP mechanisms like DP-SGD, and conducting prospective clinical validation studies.

Author Contributions

Conceptualization, S.H.C., M.I.H. and M.M.H.; methodology, S.H.C., M.I.H.; validation, M.M. and S.H.C.; formal analysis, S.H.C. and M.M.; investigation, A.M. and S.H.C.; resources, S.H.C., A.M. and M.I.H.; writing—original draft preparation, S.H.C. and M.M.H.; writing—review and editing, A.M. and S.H.C.; visualization, M.M.; supervision, M.M.H. and A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset used in our research is available in Kaggle at https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset (accessed on 31 October 2024). The code used in this study is available from the corresponding author upon reasonable request.

Acknowledgments

The authors used AI-based chat assistants to enhance the clarity and linguistic quality of the English in various sections of this manuscript. All authors have reviewed and approved this acknowledgement without any objections.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

Abbreviation	Full Term
ANNAR	Artificial Neural Network Autoregressive
AUC	Area Under the Curve
BMI	Body Mass Index
BS	Brain Stroke
BSP	Brain Stroke Prediction
CAT	Categorical Boosting
CM	Confusion Matrix
CV	Cross-Validation
DP	Differential Privacy
DT	Decision Tree
FN	False Negative
FP	False Positive
HD	Heart Disease
HDP	Heart Disease Prediction
LGBM	Light Gradient Boosting Machine
LIME	Local Interpretable Model-Agnostic Explanations
ML	Machine Learning
PB	Privacy Budget
PC	Privacy Concerns
PFMOO	Pareto Frontier Multi-Objective Optimization
POF	Probability of Failure
RF	Random Forest
ROC	Receiver Operating Characteristic
SHAP	Shapley Additive Explanations
SMOTE	Synthetic Minority Over-sampling Technique
SNR	Signal-to-Noise Ratio
TN	True Negative
TP	True Positive
XAI	Explainable Artificial Intelligence
XGB	Extreme Gradient Boosting

References

Donkor, E.S. Stroke in the 21st century: A snapshot of the burden, epidemiology, and quality of life. Stroke Res. Treat. 2018, 2018, 3238165. [Google Scholar] [CrossRef]
Moraes, M.D.A.; Jesus, P.A.P.D.; Muniz, L.S.; Costa, G.A.; Pereira, L.V.; Nascimento, L.M.; de Souza Teles, C.A.; Baccin, C.A.; Mussi, F.C. Ischemic stroke mortality and time for hospital arrival: Analysis of the first 90 days. Rev. Esc. Enferm. USP 2023, 57, e20220309. [Google Scholar] [CrossRef] [PubMed]
Centers for Disease Control and Prevention. Heart Disease Facts and Statistics. Centers for Disease Control and Prevention. 24 October 2024. Available online: https://www.cdc.gov/heart-disease/data-research/facts-stats/index.html (accessed on 5 November 2024).
Ghannam, A.; Alwidian, J. A Predictive Model of Stroke Diseases using Machine Learning Techniques. Int. J. Recent Technol. Eng. 2022, 11, 53–59. [Google Scholar] [CrossRef]
Rodríguez, J.A.T. Stroke Prediction Through Data Science and Machine Learning Algorithms. Available online: https://www.researchgate.net/profile/Jose_A_Tavares/publication/352261064_Stroke_prediction_through_Data_Science_and_Machine_Learning_Algorithms/links/60c1071f92851ca6f8d6100b/Stroke-prediction-through-Data-Science-and-Machine-Learning-Algorithms.pdf (accessed on 5 November 2024).
Ikpea, O.W.; Han, D. Performance of Machine Learning Algorithms for Heart Disease Prediction: Logistic Regressions Regularized by Elastic Net, SVM, Random Forests, and Neural Networks. 2022. Available online: https://www.semanticscholar.org/paper/Performance-of-Machine-Learning-Algorithms-for-by-Dr-Han/65fb6f5c92a7c987ab048e8d4b461cfe7503c9e9 (accessed on 5 November 2024).
Abdulsalam, G.; Meshoul, S.; Shaiba, H. Explainable heart disease prediction using ensemble-quantum machine learning approach. Intell. Autom. Soft Comput 2023, 36, 761–779. [Google Scholar] [CrossRef]
Ju, C.; Zhao, R.; Sun, J.; Wei, X.; Zhao, B.; Liu, Y.; Li, H.; Chen, T.; Zhang, X.; Gao, D.; et al. Privacy-preserving technology to help millions of people: Federated prediction model for stroke prevention. arXiv 2020, arXiv:2006.10517. [Google Scholar] [CrossRef]
Hussain, I.; Qureshi, M.; Ismail, M.; Iftikhar, H.; Zywiołek, J.; López-Gonzales, J.L. Optimal features selection in the high dimensional data based on robust technique: Application to different health database. Heliyon 2024, 10, e37241. [Google Scholar] [CrossRef]
Qureshi, M.; Ishaq, K.; Daniyal, M.; Iftikhar, H.; Rehman, M.Z.; Salar, S.A. Forecasting cardiovascular disease mortality using artificial neural networks in Sindh, Pakistan. BMC Public Health 2025, 25, 34. [Google Scholar] [CrossRef]
Soriano, F. Stroke Prediction Dataset [Data Set]. Kaggle. 2019. Available online: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset (accessed on 31 October 2024).
Rousseeuw, P.J.; Ruts, I.; Tukey, J.W. The bagplot: A bivariate boxplot. Am. Stat. 1999, 53, 382–387. [Google Scholar] [CrossRef]
Gu, Z. Complex heatmap visualization. Imeta 2022, 1, e43. [Google Scholar] [CrossRef]
Zhang, Z. Missing data imputation: Focusing on single imputation. Ann. Transl. Med. 2016, 4, 9. [Google Scholar] [CrossRef]
Zlatić, L. An alternative for one-hot encoding in neural network models. arXiv 2023, arXiv:2311.05911. [Google Scholar] [CrossRef]
Blagus, R.; Lusa, L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 2013, 14, 106. [Google Scholar] [CrossRef] [PubMed]
Honkela, A.; Melkas, L. Gaussian processes with differential privacy. arXiv 2021, arXiv:2106.00474. [Google Scholar] [CrossRef]
Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T. Xgboost: Extreme Gradient Boosting. R Package Version 0.4-2. 2015. Available online: https://cir.nii.ac.jp/crid/1370869856033678496 (accessed on 5 November 2024).
Rigatti, S.J. Random forest. J. Insur. Med. 2017, 47, 31–39. [Google Scholar] [CrossRef]
Minastireanu, E.A.; Mesnita, G. Light gbm machine learning algorithm to online click fraud detection. J. Inform. Assur. Cybersecur. 2019, 2019, 263928. [Google Scholar] [CrossRef]
Hancock, J.T.; Khoshgoftaar, T.M. CatBoost for big data: An interdisciplinary review. J. Big Data 2020, 7, 94. [Google Scholar] [CrossRef]
Hossain, M.M.; Swarna, R.A.; Mostafiz, R.; Shaha, P.; Pinky, L.Y.; Rahman, M.M.; Rahman, W.; Hossain, M.S.; Hossain, M.E.; Iqbal, M.S. Analysis of the performance of feature optimization techniques for the diagnosis of machine learning-based chronic kidney disease. Mach. Learn. Appl. 2022, 9, 100330. [Google Scholar] [CrossRef]
Giagkiozis, I.; Fleming, P.J. Methods for multi-objective optimization: An analysis. Inf. Sci. 2015, 293, 338–350. [Google Scholar] [CrossRef]
Gunning, D.; Aha, D. DARPA’s explainable artificial intelligence (XAI) program. AI Mag. 2019, 40, 44–58. [Google Scholar] [CrossRef]
Goenka, N.; Tiwari, S. Deep learning for Alzheimer prediction using brain biomarkers. Artif. Intell. Rev. 2021, 54, 4827–4871. [Google Scholar] [CrossRef]
Kaur, I.; Sachdeva, R. Prediction models for early detection of Alzheimer: Recent trends and future prospects. Arch. Comput. Methods Eng. 2025, 32, 3565–3592. [Google Scholar] [CrossRef]
Volkmann, H.; Höglinger, G.U.; Grön, G.; Bârlescu, L.A.; Müller, H.P.; Kassubek, J.; DESCRIBE-PSP Study Group. MRI classification of progressive supranuclear palsy, Parkinson disease and controls using deep learning and machine learning algorithms for the identification of regions and tracts of interest as potential biomarkers. Comput. Biol. Med. 2025, 185, 109518. [Google Scholar] [CrossRef]
Tasnim, S.; Mamun, M.; Chowdhury, S.H.; Hussain, M.I.; Hossain, M.M. Advancing Interpretable AI for Cardiovascular Risk Assessment: A Stacking Regression Approach in Clinical Data from Bangladesh. Medinformatics 2025. [Google Scholar] [CrossRef]
Hussain, M.I.; Chowdhury, S.H.; Hossain, M.M.; Mamun, M. NeuroBlend-3: Hybrid Deep and Machine Learning Framework with Explainable AI for Multi-class Brain Tumor Detection Using MRI Scans. Medinformatics 2026, 3, 56–66. [Google Scholar] [CrossRef]
Hussain, M.I.; Chowdhury, S.H.; Shovon, M.; Morzina, M.S.; Hossain, M.M.; Mamun, M. SENet-Augmented Explainable Deep Feature Framework with Machine Learning for Breast Tumor Detection in Ultrasound Imaging. In 2025 International Conference on Quantum Photonics, Artificial Intelligence, and Networking (QPAIN); IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar] [CrossRef]

Figure 1. Overview of the research process for predicting BSP and HDP with data privacy and security.

Figure 2. Box plot of numerical features of the dataset.

Figure 3. Correlation heatmap of dataset features.

Figure 4. Performance metrics of ML models for BSP with POF of 10⁻⁵: (a) accuracy, (b) precision, (c) recall, (d) F1 score, and (e) specificity.

Figure 5. Performance metrics of ML models for BSP with POF of 10⁻⁶: (a) accuracy, (b) precision, (c) recall, (d) F1 score, and (e) specificity.

Figure 6. Performance metrics of ML models for HDP with POF of 10⁻⁵: (a) accuracy, (b) precision, (c) recall, (d) F1 score, and (e) specificity.

Figure 7. Performance metrics of ML models for HDP with POF of 10⁻⁶: (a) accuracy, (b) precision, (c) recall, (d) F1 score, and (e) specificity.

Figure 8. The PFMOO selection process for identifying the optimal point for (a) HDP and (b) BSP.

Figure 9. CM for the optimal privacy parameters: (a) HDP RF model and (b) BSP XGB model.

Figure 10. ROC curve for the optimal privacy parameters: (a) HDP RF model and (b) BSP XGB model.

Figure 11. LIME analysis: (a) HDP RF model and (b) BSP XGB model.

Figure 12. SHAP analysis: (a) HDP RF model and (b) BSP XGB model.

Figure 13. Precision–recall trade-off curve and threshold balancing for HDP under DP constraints.

Table 1. Analysis of previous studies on BSP and HDP.

Authors	Technique	Affiliate	Limitations
Ghannam and Alwidian [4]	DT	BSP	Class imbalance and lack of CV.
Rodríguez [5]	RF	BSP	Fails to handle outliers and lacks CV.
Ikpea and Han [6]	RF	HDP	Low recall value.
Abdulsalam et al. [7]	BQSVC	HDP	Class imbalance.
Ju et al. [8]	CT	BSP	Class imbalance.
Hussain et al. [9]	Hybrid Feature Selection (SNR + Mood’s Median Test)	Gene Selection in High-Dim Data	Focused on genomic data, not clinical risk prediction for specific diseases like BS/HD.
Qureshi et al. [10]	Time Series (Naïve, SES, Holt, ANNAR)	HD Mortality Forecasting	Small sample size, single-region focus.

Table 2. Details of the features of the dataset.

Features	Details	Data Type	Metric
ID	Patient ID number	Numerical	-
Gender	Patient gender	Nominal	-
Age	Patient age	Numerical	Years
Heart Disease	HD (yes/no)	Nominal	-
Hypertension	Hypertension (yes/no)	Nominal	-
Ever Married	Marriage status (married/single)	Nominal	-
Work Type	Employment type (child, government, unemployed, private, self-employed)	Nominal	-
Residence	Location type (rural/urban)	Nominal	-
Average Glucose	Blood sugar	Numerical	mg/dL
BMI	Patient’s body mass index	Numerical	kg/m²
Smoking Status	Patient smoking background (former, never, current, unknown)	Nominal	-
Stroke	Stroke outcome (yes/no)	Nominal	-

Table 3. Classification metrics and their formulas.

Metrics	Expression	Explanation
ACC	$\frac{T P + T N}{T P + T N + F P + F N} \times 100$	The proportion of correct predictions
Precision	$\frac{T P}{T P + F P} \times 100$	Predicted positives that are actually correct
Recall	$\frac{T P}{T P + F N} \times 100$	Actual positives that are correctly identified
Specificity	$\frac{T N}{F P + T N} \times 100$	Actual negatives that are correctly identified
F1 Score	$2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l} \times 100$	The harmonic means of precision and recall

Table 4. Performance of ML models without DP (mean ± standard deviation across 10-fold cross-validation, with 95% confidence intervals in brackets).

Metric	BS Prediction (XGB, ε = 2, δ = 10⁻⁶)	HD Prediction (RF, ε = 0.5, δ = 10⁻⁵)
Accuracy (%)	Mean ± SD: 92.30 ± 0.58 95% CI: [91.72, 92.88] CoV: 0.63%	Mean ± SD: 95.61 ± 0.47 95% CI: [95.13, 96.09] CoV: 0.49%
Precision (%)	Mean ± SD: 92.45 ± 0.62 95% CI: [91.83, 93.07] CoV: 0.67%	Mean ± SD: 97.80 ± 0.35 95% CI: [97.45, 98.15] CoV: 0.36%
Recall (%)	Mean ± SD: 92.28 ± 0.71 95% CI: [91.57, 92.99] CoV: 0.77%	Mean ± SD: 52.14 ± 1.98 95% CI: [50.16, 54.12] CoV: 3.80%
F1 Score (%)	Mean ± SD: 92.29 ± 0.64 95% CI: [91.65, 92.93] CoV: 0.69%	Mean ± SD: 52.95 ± 1.85 95% CI: [51.10, 54.80] CoV: 3.49%
Specificity (%)	Mean ± SD: 89.54 ± 0.93 95% CI: [88.61, 90.47] CoV: 1.04%	Mean ± SD: 100.0 ± 0.00 95% CI: [100.0, 100.0] CoV: 0.00%

Note: SD = standard deviation across 10 folds; CI = confidence interval; CoV = coefficient of variation.

Table 5. Pairwise comparisons (best model vs. second best).

Comparison	Mean Difference	t-Statistic	p-Value (t-Test)	p-Value (Wilcoxon)	Cohen’s d	Effect Size
BS: XGB vs. LGBM	0.60%	4.32	0.0019	0.0021	0.89	Large
HD: RF vs. LGBM	0.41%	3.87	0.0037	0.0039	0.76	Large

Table 6. Performance of the heart disease model (RF, ε = 0.5, δ = 10⁻⁵) at different decision thresholds, including the area under the ROC curve (AUC).

Threshold	TP	FN	FP	TN	Accuracy	Precision	Recall	Specificity	Balanced Accuracy	MCC	AUC
0.5	48	44	1	929	95.61%	97.8%	52.14%	99.89%	76.02%	0.715	0.87
0.38	59	33	5	925	96.28%	92.1%	64.50%	99.46%	81.98%	0.751	0.87
0.3	65	27	11	919	96.28%	85.4%	70.20%	98.82%	84.51%	0.758	0.87

Table 7. Quantitative feature importance rankings, stability under DP, and clinical interpretation for optimal privacy-preserving models.

Prediction Task	Optimal Model	Rank	Feature	Mean SHAP Value	Mean LIME Contribution	Rank Variance	Clinical Interpretation of Top Features
BS	XGB (ε = 2, δ = 10⁻⁶)	1	Age	0.324	+0.215	Very High (σ² = 0.11)	Advanced age is the single most critical non-modifiable risk factor, reflecting cumulative vascular damage and increased susceptibility.
		2	Avg. Glucose Level	0.187	+0.142	High (σ² = 0.23)	Elevated blood sugar, often indicating diabetes or prediabetes, is a major contributor to vascular endothelial dysfunction and atherosclerosis.
		3	Hypertension	0.156	+0.108	High (σ² = 0.19)	Chronic high blood pressure is a primary driver of cerebrovascular damage, including small vessel disease and increased risk of vessel rupture or blockage.
		4	BMI	0.098	+0.065	Moderate (σ² = 0.35)	Higher BMI, associated with obesity, exacerbates hypertension, diabetes, and systemic inflammation, increasing stroke risk.
		5	Heart Disease	0.076	+0.051	Moderate (σ² = 0.41)	A history of heart disease (e.g., atrial fibrillation) is a known embolic source for ischemic strokes, confirming the link between cardiac and cerebrovascular health.
HD	RF (ε = 0.5, δ = 10⁻⁵)	1	Age	0.291	+0.188	Very High (σ² = 0.09)	Age is the bedrock of cardiovascular risk, with arterial stiffness and plaque burden increasing progressively over a lifetime.
		2	Avg. Glucose Level	0.203	+0.151	High (σ² = 0.21)	Hyperglycemia’s high rank underscores its direct role in promoting coronary artery disease through oxidative stress and glycation of vessel walls.
		3	Hypertension	0.142	+0.095	High (σ² = 0.18)	High blood pressure is a primary mechanism for left ventricular hypertrophy and the progression of coronary atherosclerosis.
		4	BMI	0.089	+0.058	Moderate (σ² = 0.33)	Elevated BMI’s contribution reflects its strong correlation with metabolic syndrome, a cluster of conditions (including hypertension and dyslipidemia) that directly increase heart disease risk.
		5	Smoking Status	0.067	+0.044	Moderate (σ² = 0.44)	Smoking’s presence in the top five confirms its potent role in promoting endothelial injury, inflammation, and thrombus formation, which are central to heart disease pathogenesis.

Table 8. Performance of ML models without DP.

	Model	XGB	RF	LGBM	CAT
BSP	Accuracy (%)	94.44% ± 0.51 [94.06, 94.82]	93.87% ± 0.62 [93.39, 94.35]	94.19% ± 0.55 [93.76, 94.62]	93.68% ± 0.68 [93.14, 94.22]
	Precision (%)	94.49% ± 0.53 [94.09, 94.89]	93.92% ± 0.59 [93.46, 94.38]	94.22% ± 0.57 [93.78, 94.66]	93.72% ± 0.71 [93.16, 94.28]
	Recall (%)	94.43% ± 0.58 [94.00, 94.86]	93.86% ± 0.64 [93.36, 94.36]	94.18% ± 0.61 [93.72, 94.64]	93.68% ± 0.73 [93.10, 94.26]
	Specificity (%)	92.95% ± 0.79 [92.36, 93.54]	92.22% ± 0.88 [91.55, 92.89]	93.15% ± 0.74 [92.59, 93.71]	92.36% ± 0.92 [91.66, 93.06]
	F1 Score (%)	94.44% ± 0.52 [94.05, 94.83]	93.86% ± 0.63 [93.37, 94.35]	94.19% ± 0.56 [93.76, 94.62]	93.68% ± 0.70 [93.13, 94.23]
HDP	Accuracy (%)	96.09% ± 0.44 [95.77, 96.41]	96.47% ± 0.38 [96.19, 96.75]	96.29% ± 0.41 [95.96, 96.62]	96.14% ± 0.48 [95.78, 96.50]
	Precision (%)	80.87% ± 1.92 [79.43, 82.31]	90.25% ± 1.45 [89.13, 91.37]	84.89% ± 1.78 [83.52, 86.26]	85.24% ± 1.83 [83.86, 86.62]
	Recall (%)	66.05% ± 2.34 [64.29, 67.81]	64.26% ± 2.18 [62.62, 65.90]	64.93% ± 2.27 [63.25, 66.61]	62.24% ± 2.51 [60.35, 64.13]
	Specificity (%)	99.12% ± 0.31 [98.89, 99.35]	99.73% ± 0.19 [99.59, 99.87]	99.46% ± 0.26 [99.26, 99.66]	99.57% ± 0.23 [99.39, 99.75]
	F1 Score (%)	70.77% ± 2.11 [69.17, 72.37]	70.45% ± 2.05 [68.90, 72.00]	70.41% ± 2.08 [68.84, 71.98]	67.52% ± 2.42 [65.68, 69.36]

Table 9. Ablation study results for BS and HD prediction under different configurations.

Task	Configuration	Model	Accuracy (%)	Accuracy σ (%)	Recall (%)	Recall σ (%)
BS	No SMOTE, No DP	XGB	85.3	1.21	28.7	3.45
	SMOTE only	XGB	94.4	0.58	94.4	0.61
	DP only (ε = 2, δ = 10⁻⁶)	XGB	83.1	1.43	26.3	4.02
	SMOTE + DP	XGB	92.3	0.71	92.3	0.74
HD	No SMOTE, No DP	RF	89.7	0.95	31.2	2.88
	SMOTE only	RF	96.5	0.49	64.3	1.55
	DP only (ε = 0.5, δ = 10⁻⁵)	RF	88.4	1.12	28.9	3.21
	SMOTE + DP	RF	95.6	0.63	52.1	1.98

Note: Accuracy σ and recall σ denote the standard deviation of the respective metric across the 10 cross-validation folds.

Table 10. Comparison of previous studies with our study.

Authors	Dataset		Technique	Affiliate	Accuracy	PC	XAI
Authors	Sample	Features	Technique	Affiliate	Accuracy	PC	XAI
Ghannam and Alwidian [4]	5110	11	DT	BSP	94.2%	NO	NO
Rodríguez [5]	5110	11	RF	BSP	92.32%	NO	NO
Ikpea and Han [6]	5110	11	RF	HDP	75%	NO	NO
Abdulsalam et al. [7]	303	14	BQSVC	HDP	90.16%	NO	YES
Ju et al. [8]	-	119	CT	BSP	81.4%	YES	NO
Hussain et al. [9]	10937, 5470, 12534	413, 76, 148	RF	Identifies optimal, significant genes	-	NO	NO
Qureshi et al. [10]	23 (years 1999–2021)	1 (death cases) + time	ANNAR	HD Mortality	-	NO	NO
This Study	5110	11	XGB with DP	BSP	92.3%	YES	YES
			RF with DP	HDP	95.61%	YES	YES
			XGB without DP	BSP	94.44%	NO	NO
			RF without DP	HDP	96.47%	NO	NO

Table 11. Comparative performance metrics of ML models for brain stroke and heart disease prediction across studies.

Study	Best Model	Accuracy	Precision	Recall	F1 Score	Specificity	Error Rate	MAPE
[4]	DT	94.2%	83.2%	86.8%	84.9%	95.9%	–	–
[5]	RF	91.89%	–	–	92.0%	–	–	–
[6]	RF	75%	12%	59%	~0.20	76%	–	–
[7]	BQSVC	90.16%	90%	90%	90%	–	–	–
[8]	CT	–	–	–	–	–	–	–
[9]	RF	–	–	–	–	–	0.004	–
[10]	ANNAR	–	–	–	–	–	–	13.08
This Study	RF (HDP, DP)	95.61%	97.8%	70.2% (After Threshold Adjustment)	52.95%	100%	–	–
This Study	XGB (BSP, DP)	92.3%	92.45%	92.28%	92.29%	89.54%	–	–

Table 12. Methodological overview and comparative analysis of techniques in brain stroke and heart disease prediction studies.

Authors	Models Used												Performance Metrics						DP		XAI
Authors	RF	Neural Networks	XGB	Support Vector Machine	Logistic Regression	Decision Tree	k-Nearest Neighbor	Naïve Bayes	Quantum ML Models	LGBM	CAT	Federated Learning Models	Accuracy	Precision	Recall	F1 Score	Specificity	Error	PB	POF	SHAP	LIME
[4]						✓	✓	✓					✓	✓	✓	✓	✓
[5]	✓	✓	✓	✓	✓	✓	✓	✓					✓			✓
[6]	✓	✓		✓	✓								✓	✓	✓	✓	✓
[7]		✓		✓					✓				✓	✓	✓	✓					✓
[8]		✓										✓
[9]	✓						✓											✓
[10]		✓						✓										✓
This Study	✓		✓							✓	✓		✓	✓	✓	✓	✓		✓	✓	✓	✓

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hussain, M.I.; Munir, A.; Chowdhury, S.H.; Mamun, M.; Hossain, M.M. Multi-Objective Optimized Differential Privacy with Interpretable Machine Learning for Brain Stroke and Heart Disease Diagnosis. Algorithms 2026, 19, 260. https://doi.org/10.3390/a19040260

AMA Style

Hussain MI, Munir A, Chowdhury SH, Mamun M, Hossain MM. Multi-Objective Optimized Differential Privacy with Interpretable Machine Learning for Brain Stroke and Heart Disease Diagnosis. Algorithms. 2026; 19(4):260. https://doi.org/10.3390/a19040260

Chicago/Turabian Style

Hussain, Mohammed Ibrahim, Arslan Munir, Safiul Haque Chowdhury, Mohammad Mamun, and Muhammad Minoar Hossain. 2026. "Multi-Objective Optimized Differential Privacy with Interpretable Machine Learning for Brain Stroke and Heart Disease Diagnosis" Algorithms 19, no. 4: 260. https://doi.org/10.3390/a19040260

APA Style

Hussain, M. I., Munir, A., Chowdhury, S. H., Mamun, M., & Hossain, M. M. (2026). Multi-Objective Optimized Differential Privacy with Interpretable Machine Learning for Brain Stroke and Heart Disease Diagnosis. Algorithms, 19(4), 260. https://doi.org/10.3390/a19040260

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Objective Optimized Differential Privacy with Interpretable Machine Learning for Brain Stroke and Heart Disease Diagnosis

Abstract

1. Introduction

2. Methodology

2.1. Dataset Collection

2.2. Data Visualization

2.3. Data Preprocessing

2.4. Preprocessed Dataset

2.5. Privacy-Preserving Mechanism

2.6. Non-Secure Server

2.7. ML Model Formation

2.7.1. XGB

2.7.2. RF

2.7.3. LGBM

2.7.4. CAT

2.8. ML Model Evaluation

2.9. Optimal Privacy Parameter Selection

2.10. Best Model Selection

2.11. Outcome Analysis with XAI

3. Results and Discussion

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI