A Novel Approach Utilizing Bagging, Histogram Gradient Boosting, and Advanced Feature Selection for Predicting the Onset of Cardiovascular Diseases
Abstract
1. Introduction
- Development of an Innovative Prediction Model. We propose a CVD prediction model using the bagging algorithm with histogram gradient boosting as the estimator.
- Implementation of Advanced Techniques. This study utilizes LOF for outlier detection and IG for FS, combined with GridSearchCV and stratified k-fold CV for optimal model performance.
- Filling Existing Research Gaps. This research addresses the underutilization of the bagging algorithm, particularly with HGB, in the context of CVD prediction, thereby providing a novel approach to enhancing early detection and prevention efforts.
2. Materials and Methods
2.1. Data Sources
- Dataset I: The first data source is the CVD dataset sourced from Kaggle [36]. This dataset encompasses 70,000 samples, consisting of 34,979 positive and 35,021 negative instances. It includes 11 features: “age, height, weight, gender, systolic blood pressure, diastolic blood pressure, cholesterol, glucose, smoking habits, alcohol consumption, and physical activity”. For our analysis, we converted the age from days to years. Notably, the dataset contains no missing values, and all features are considered significant risk factors for CVDs.
- Dataset II: The second dataset, an updated version of dataset I, presents two primary distinctions. First, it includes the body mass index (BMI) feature calculated from weight and height attributes. Consequently, the original weight and height features were excluded after deriving the BMI. Therefore, dataset II comprises 10 features: “age, BMI, gender, systolic blood pressure, diastolic blood pressure, cholesterol, glucose, smoking habits, alcohol consumption, and physical activity”.
- Dataset III: The third dataset is the Cardiovascular Study Dataset, also procured from Kaggle [37]. This dataset contains 3390 samples, with 511 classified as positive cases and 2879 as negative. Due to the significant imbalance in sample sizes, we first eliminated missing values and balanced the dataset by randomly selecting additional positive samples from dataset I. This balancing process resulted in a final distribution of 2600 positive and 2600 negative samples. We ensured uniformity in feature values by converting similar units. Ultimately, dataset III comprises eight features, “age, sex, smoking status, total cholesterol, systolic blood pressure, diastolic blood pressure, BMI, and glucose”, with no missing values present.
2.2. Proposed Method
2.3. Rendundancy Data Elimination Technique
- Check all columns for duplicates: the function iterates through each row of the dataset, comparing its values to every other row, considering all columns simultaneously.
- Define duplicates: if a row’s values are identical to another row’s values in all columns, that row is considered a duplicate.
- Remove duplicates: only the first occurrence of a unique row is kept, and all subsequent duplicates are removed from the dataset.
2.4. Local Outlier Factor-Based Outlier Removal
- Measure the distance from point P to all specified points utilizing a distance metric, either Euclidean or Manhattan.
- Determine the k (k-nearest neighbor) closest point, which involves calculating the distance to the third nearest neighbor when k = 3.
- Identify the k nearest points.
- Compute the local reachability density using the following formula:
- Determine the LOF as follows:
2.5. Information Gain-Based Feature Selection
- Compute the entropy value of each feature.
- Compute the information gain value of each feature.
2.6. Histogram Gradient Boosting
2.7. Bootstrap Aggregating Algorithm
Algorithm 1 The Bagging EL | |
Input: training-input set: TS = {( ( label-output set: Y = h: based classifier; T: iteration step. Output: H: final classifier. | |
| |
(12) | |
where the is calculated with: | |
(13) |
2.8. Stratified K-Fold CV
Algorithm 2 Fold generation of Stratified K-Fold CV |
Require: number of folds, k; classes, c. Ensure: generated folds, |
for i := 1 to n do if then |
end if for j := 1 to do S pick n samples randomly from end for end for |
2.9. GridSearchCV Hyperparameter Tuning Technique
Algorithm 3 The GridSearchCV Algorithm |
Step 1: Set the parameters Step 2: Create a “parameter grid” Step 3: Develop a “base model” Step 4: Configure the parameters for the grid search model Step 5: Execute the grid search using training features and labels Step 6: Identify the optimal grid. Step 7: Select the optimal parameters |
2.10. Performance Evaluation Metrics
3. Results and Discussion
3.1. Experimental Settings
3.2. Redundancy Data Elimination
3.3. Outlier Detection and Elimination
3.4. Information Gain and Chi-Square Scores for Feature Selection
3.5. The Performance of the Proposed Method Before and After Hyperparameter Tuning
3.6. Comparison with Previous Works
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- World Heart Report 2023: Confronting The World’s Number One Killer. Available online: https://world-heart-federation.org/wp-content/uploads/World-Heart-Report-2023.pdf (accessed on 24 March 2025).
- Cardiovascular Diseases. Available online: https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1 (accessed on 24 March 2025).
- Cardiovascular Diseases Kill 10,000 People in the WHO European Region Every Day, with Men Dying more Frequently than Women. Available online: https://www.who.int/europe/news/item/15-05-2024-cardiovascular-diseases-kill-10-000-people-in-the-who-european-region-every-day--with-men-dying-more-frequently-than-women (accessed on 24 March 2025).
- Cardiovascular Diseases. Available online: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds) (accessed on 24 March 2025).
- Khan, A.; Qureshi, M.; Daniyal, M.; Tawiah, K. A Novel Study on Machine Learning Algorithm-Based Cardiovascular Disease Prediction. Health Soc. Care Community 2023, 2023, 1406060. [Google Scholar] [CrossRef]
- Mandava, M.; Reddy Vinta, S. MDensNet201-IDRSRNet: Efficient Cardiovascular Disease Prediction System Using Hybrid Deep Learning. Biomed. Signal Process. Control 2024, 93, 106147. [Google Scholar] [CrossRef]
- El-Sofany, H.; Bouallegue, B.; El-Latif, Y.M.A. A Proposed Technique for Predicting Heart Disease Using Machine Learning Algorithms and an Explainable AI Method. Sci. Rep. 2024, 14, 23277. [Google Scholar] [CrossRef]
- Peng, M.; Hou, F.; Cheng, Z.; Shen, T.; Liu, K.; Zhao, C.; Zheng, W. Prediction of Cardiovascular Disease Risk Based on Major Contributing Features. Sci. Rep. 2023, 13, 4778. [Google Scholar] [CrossRef] [PubMed]
- Krive, J.; Chertok, D. Advancing Cardiovascular Disease Prediction Machine Learning Models With Psychological Factors. JACC Adv. 2024, 3, 101185. [Google Scholar] [CrossRef] [PubMed]
- Dorraki, M.; Liao, Z.; Abbott, D.; Psaltis, P.J.; Baker, E.; Bidargaddi, N.; Wardill, H.R.; Van Den Hengel, A.; Narula, J.; Verjans, J.W. Improving Cardiovascular Disease Prediction With Machine Learning Using Mental Health Data. JACC Adv. 2024, 3, 101180. [Google Scholar] [CrossRef]
- Theerthagiri, P. Predictive Analysis of Cardiovascular Disease Using Gradient Boosting Based Learning and Recursive Feature Elimination Technique. Intell. Syst. Appl. 2022, 16, 200121. [Google Scholar] [CrossRef]
- Budholiya, K.; Shrivastava, S.K.; Sharma, V. An Optimized XGBoost Based Diagnostic System for Effective Prediction of Heart Disease. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 4514–4523. [Google Scholar] [CrossRef]
- Lee, J.; Choi, Y.; Ko, T.; Lee, K.; Shin, J.; Kim, H.-S. Prediction of Cardiovascular Complication in Patients with Newly Diagnosed Type 2 Diabetes Using an XGBoost/GRU-ODE-Bayes-Based Machine-Learning Algorithm. Endocrinol. Metab. 2024, 39, 176–185. [Google Scholar] [CrossRef]
- Feng, M.; Wang, X.; Zhao, Z.; Jiang, C.; Xiong, J.; Zhang, N. Enhanced Heart Attack Prediction Using eXtreme Gradient Boosting. J. Theory Pract. Eng. Sci. 2024, 4, 9–16. [Google Scholar] [CrossRef]
- Jamimi, H.A.A. Early Prediction of Heart Disease Risk Using Extreme Gradient Boosting: A Data-Driven Analysis. Int. J. Biomed. Eng. Technol. 2024, 45, 296–313. [Google Scholar] [CrossRef]
- Nematollahi, M.A.; Jahangiri, S.; Asadollahi, A.; Salimi, M.; Dehghan, A.; Mashayekh, M.; Roshanzamir, M.; Gholamabbas, G.; Alizadehsani, R.; Bazrafshan, M.; et al. Body Composition Predicts Hypertension Using Machine Learning Methods: A Cohort Study. Sci. Rep. 2023, 13, 6885. [Google Scholar] [CrossRef] [PubMed]
- Zhao, H.; Zhang, X.; Xu, Y.; Gao, L.; Ma, Z.; Sun, Y.; Wang, W. Predicting the Risk of Hypertension Based on Several Easy-to-Collect Risk Factors: A Machine Learning Method. Front. Public Health 2021, 9, 619429. [Google Scholar] [CrossRef] [PubMed]
- Chandramouli, A.; Hyma, V.R.; Tanmayi, P.S.; Santoshi, T.G.; Priyanka, B. Diabetes Prediction Using Hybrid Bagging Classifier. Entertain. Comput. 2023, 47, 100593. [Google Scholar] [CrossRef]
- Xu, H.; Zhang, L.; Li, P.; Zhu, F. Outlier Detection Algorithm Based on K-Nearest Neighbors-Local Outlier Factor. J. Algorithms Comput. Technol. 2022, 16, 17483026221078111. [Google Scholar] [CrossRef]
- Adesh, A.; Shobha, G.; Shetty, J.; Xu, L. Local Outlier Factor for Anomaly Detection in HPCC Systems. J. Parallel Distrib. Comput. 2024, 192, 104923. [Google Scholar] [CrossRef]
- Alghushairy, O.; Alsini, R.; Soule, T.; Ma, X. A Review of Local Outlier Factor Algorithms for Outlier Detection in Big Data Streams. Big Data Cogn. Comput. 2020, 5, 1. [Google Scholar] [CrossRef]
- Syafrudin, M.; Fitriyani, N.L.; Alfian, G.; Rhee, J. An Affordable Fast Early Warning System for Edge Computing in Assembly Line. Appl. Sci. 2018, 9, 84. [Google Scholar] [CrossRef]
- Qu, K.; Xu, J.; Hou, Q.; Qu, K.; Sun, Y. Feature Selection Using Information Gain and Decision Information in Neighborhood Decision System. Appl. Soft Comput. 2023, 136, 110100. [Google Scholar] [CrossRef]
- Fitriyani, N.L.; Syafrudin, M.; Alfian, G.; Rhee, J. Development of Disease Prediction Model Based on Ensemble Learning Approach for Diabetes and Hypertension. IEEE Access 2019, 7, 144777–144789. [Google Scholar] [CrossRef]
- Syafrudin, M.; Alfian, G.; Fitriyani, N.L.; Rhee, J. Performance Analysis of IoT-Based Sensor, Big Data Processing, and Machine Learning Model for Real-Time Monitoring System in Automotive Manufacturing. Sensors 2018, 18, 2946. [Google Scholar] [CrossRef] [PubMed]
- Ijaz, M.; Alfian, G.; Syafrudin, M.; Rhee, J. Hybrid Prediction Model for Type 2 Diabetes and Hypertension Using DBSCAN-Based Outlier Detection, Synthetic Minority Over Sampling Technique (SMOTE), and Random Forest. Appl. Sci. 2018, 8, 1325. [Google Scholar] [CrossRef]
- Elgeldawi, E.; Sayed, A.; Galal, A.R.; Zaki, A.M. Hyperparameter Tuning for Machine Learning Algorithms Used for Arabic Sentiment Analysis. Informatics 2021, 8, 79. [Google Scholar] [CrossRef]
- Muzayanah, R.; Pertiwi, D.A.A.; Ali, M.; Muslim, M.A. Comparison of Gridsearchcv and Bayesian Hyperparameter Optimization in Random Forest Algorithm for Diabetes Prediction. J. Soft Comput. Explor. 2024, 5, 86–91. [Google Scholar] [CrossRef]
- Ahamad, G.N.; Shafiullah; Fatima, H.; Imdadullah; Zakariya, S.M.; Abbas, M.; Alqahtani, M.S.; Usman, M. Influence of Optimal Hyperparameters on the Performance of Machine Learning Algorithms for Predicting Heart Disease. Processes 2023, 11, 734. [Google Scholar] [CrossRef]
- Zhao, Y.; Zhang, W.; Liu, X. Grid Search with a Weighted Error Function: Hyper-Parameter Optimization for Financial Time Series Forecasting. Appl. Soft Comput. 2024, 154, 111362. [Google Scholar] [CrossRef]
- Ogunsanya, M.; Isichei, J.; Desai, S. Grid Search Hyperparameter Tuning in Additive Manufacturing Processes. Manuf. Lett. 2023, 35, 1031–1042. [Google Scholar] [CrossRef]
- Balamurali, A.; Kumar, K.V. Early Detection and Classification of Type-2 Diabetes Using Stratified k-Fold Validation. In Proceedings of the 2024 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES), Chennai, India, 12–13 December 2024; IEEE: Chennai, India, 2024; pp. 1–6. [Google Scholar]
- Tougui, I.; Jilbab, A.; Mhamdi, J.E. Impact of the Choice of Cross-Validation Techniques on the Results of Machine Learning-Based Diagnostic Applications. Healthc. Inform. Res. 2021, 27, 189–199. [Google Scholar] [CrossRef]
- Szeghalmy, S.; Fazekas, A. A Comparative Study of the Use of Stratified Cross-Validation and Distribution-Balanced Stratified Cross-Validation in Imbalanced Learning. Sensors 2023, 23, 2333. [Google Scholar] [CrossRef]
- Mahesh, T.R.; Vinoth Kumar, V.; Dhilip Kumar, V.; Geman, O.; Margala, M.; Guduri, M. The Stratified K-Folds Cross-Validation and Class-Balancing Methods with High-Performance Ensemble Classifiers for Breast Cancer Classification. Healthc. Anal. 2023, 4, 100247. [Google Scholar] [CrossRef]
- Cardiovascular Disease Dataset. Available online: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset (accessed on 20 August 2024).
- Cardiovascular Study Dataset. Available online: https://www.kaggle.com/datasets/christofel04/cardiovascular-study-dataset-predict-heart-disea (accessed on 20 August 2024).
- pandas.DataFrame.drop_duplicates. Available online: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html (accessed on 20 May 2025).
- Breunig, M.M.; Kriegel, H.-P.; Ng, R.T.; Sander, J. LOF: Identifying Density-Based Local Outliers. ACM SIGMOD Rec. 2000, 29, 93–104. [Google Scholar] [CrossRef]
- Agrawal, P.V.; Kshirsagar, D.D. Information Gain-Based Feature Selection Method in Malware Detection for MalDroid2020. In Proceedings of the 2022 International Conference on Smart Technologies and Systems for Next Generation Computing (ICSTSN), Villupuram, India, 25–26 March 2022; IEEE: Villupuram, India, 2022; pp. 1–5. [Google Scholar]
- Tamim Kashifi, M.; Ahmad, I. Efficient Histogram-Based Gradient Boosting Approach for Accident Severity Prediction With Multisource Data. Transp. Res. Rec. J. Transp. Res. Board 2022, 2676, 236–258. [Google Scholar] [CrossRef]
- Tuv, E.; Borisov, A.; Runger, G.; Torkkola, K. Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination. J. Mach. Learn. Res. 2009, 10, 1341–1366. [Google Scholar]
- Rao, H.; Shi, X.; Rodrigue, A.K.; Feng, J.; Xia, Y.; Elhoseny, M.; Yuan, X.; Gu, L. Feature Selection Based on Artificial Bee Colony and Gradient Boosting Decision Tree. Appl. Soft Comput. 2019, 74, 634–642. [Google Scholar] [CrossRef]
- Devos, L.; Meert, W.; Davis, J. Fast Gradient Boosting Decision Trees with Bit-Level Data Structures. In Machine Learning and Knowledge Discovery in Databases; Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; Volume 11906, pp. 590–606. ISBN 978-3-030-46149-2. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer Series in Statistics; Springer: New York, NY, USA, 2009; ISBN 978-0-387-84857-0. [Google Scholar]
- Nhat-Duc, H.; Van-Duc, T. Comparison of Histogram-Based Gradient Boosting Classification Machine, Random Forest, and Deep Convolutional Neural Network for Pavement Raveling Severity Classification. Autom. Constr. 2023, 148, 104767. [Google Scholar] [CrossRef]
- Hossain, S.M.M.; Deb, K. Plant Leaf Disease Recognition Using Histogram Based Gradient Boosting Classifier. In Intelligent Computing and Optimization; Vasant, P., Zelinka, I., Weber, G.-W., Eds.; Advances in Intelligent Systems and Computing; Springer International Publishing: Cham, Switzerland, 2021; Volume 1324, pp. 530–545. ISBN 978-3-030-68153-1. [Google Scholar]
- Features in Histogram Gradient Boosting Trees. Available online: https://scikit-learn.qubitpi.org/auto_examples/ensemble/plot_hgbt_regression.html (accessed on 24 March 2025).
- Fan, W.; Zhang, K. Bagging. In Encyclopedia of Database Systems; Liu, L., Özsu, M.T., Eds.; Springer US: Boston, MA, USA, 2009; pp. 206–210. ISBN 978-0-387-35544-3. [Google Scholar]
- Breiman, L. Bagging Predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
- Wang, Y.; Liu, J.; Feng, L. Text Length Considered Adaptive Bagging Ensemble Learning Algorithm for Text Classification. Multimed. Tools Appl. 2023, 82, 27681–27706. [Google Scholar] [CrossRef]
- Moreno-Torres, J.G.; Saez, J.A.; Herrera, F. Study on the Impact of Partition-Induced Dataset Shift on $k$-Fold Cross-Validation. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1304–1312. [Google Scholar] [CrossRef]
- Saranya, G.; Pravin, A. Grid Search Based Optimum Feature Selection by Tuning Hyperparameters for Heart Disease Diagnosis in Machine Learning. Open Biomed. Eng. J. 2023, 17, e187412072304061. [Google Scholar] [CrossRef]
- Trevethan, R. Sensitivity, Specificity, and Predictive Values: Foundations, Pliabilities, and Pitfalls in Research and Practice. Front. Public Health 2017, 5, 307. [Google Scholar] [CrossRef]
- Li, K.; Persaud, D.; Choudhary, K.; DeCost, B.; Greenwood, M.; Hattrick-Simpers, J. Exploiting Redundancy in Large Materials Datasets for Efficient Machine Learning with Less Data. Nat. Commun. 2023, 14, 7283. [Google Scholar] [CrossRef] [PubMed]
- Geiping, J.; Goldstein, T. Cramming: Training a Language Model on a Single GPU in One Day. arXiv 2022, arXiv:2212.14034. [Google Scholar]
- Sorscher, B.; Geirhos, R.; Shekhar, S.; Ganguli, S.; Morcos, A.S. Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
- Yang, S.; Xie, Z.; Peng, H.; Xu, M.; Sun, M.; Li, P. Dataset Pruning: Reducing Training Data by Examining Generalization Influence. arXiv 2022, arXiv:2205.09329. [Google Scholar]
- Choudhary, K.; DeCost, B.; Major, L.; Butler, K.; Thiyagalingam, J.; Tavazza, F. Unified Graph Neural Network Force-Field for the Periodic Table: Solid State Applications. Digit. Discov. 2023, 2, 346–355. [Google Scholar] [CrossRef]
- Ou, D.; Ji, Y.; Zhang, L.; Liu, H. An Online Classification Method for Fault Diagnosis of Railway Turnouts. Sensors 2020, 20, 4627. [Google Scholar] [CrossRef]
- Vatcheva, K.P.; Lee, M.; McCormick, J.B.; Rahbar, M.H. Multicollinearity in Regression Analyses Conducted in Epidemiologic Studies. Epidemiol 2016, 6, 227. [Google Scholar] [CrossRef]
- Sipper, M.; Moore, J.H. AddGBoost: A Gradient Boosting-Style Algorithm Based on Strong Learners. Mach. Learn. Appl. 2022, 7, 100243. [Google Scholar] [CrossRef]
- Fitriyani, N.L.; Syafrudin, M.; Ulyah, S.M.; Alfian, G.; Qolbiyani, S.L.; Anshari, M. A Comprehensive Analysis of Chinese, Japanese, Korean, US-PIMA Indian, and Trinidadian Screening Scores for Diabetes Risk Assessment and Prediction. Mathematics 2022, 10, 4027. [Google Scholar] [CrossRef]
- Egghe, L. The Measures Precision, Recall, Fallout and Miss as a Function of the Number of Retrieved Documents and Their Mutual Interrelations. Inf. Process. Manag. 2008, 44, 856–876. [Google Scholar] [CrossRef]
- Çorbacıoğlu, Ş.K.; Aksel, G. Receiver Operating Characteristic Curve Analysis in Diagnostic Accuracy Studies: A Guide to Interpreting the Area under the Curve Value. Turk. J. Emerg. Med. 2023, 23, 195–198. [Google Scholar] [CrossRef]
- Peregrin-Alvarez, J. Reinventing the Body Mass Index: A Machine Learning Approach. medRxiv 2024. medRxiv: 26.24306457. [Google Scholar]
- Gutiérrez-Gallego, A.; Zamorano-León, J.J.; Parra-Rodríguez, D.; Zekri-Nechar, K.; Velasco, J.M.; Garnica, Ó.; Jiménez-García, R.; López-de-Andrés, A.; Cuadrado-Corrales, N.; Carabantes-Alarcón, D.; et al. Combination of Machine Learning Techniques to Predict Overweight/Obesity in Adults. J. Pers. Med. 2024, 14, 816. [Google Scholar] [CrossRef] [PubMed]
- Maiga, J.; Hungilo, G.G. Pranowo Comparison of Machine Learning Models in Prediction of Cardiovascular Disease Using Health Record Data. In Proceedings of the 2019 International Conference on Informatics, Multimedia, Cyber and Information System (ICIMCIS), Jakarta, Indonesia, 24–25 October 2019; IEEE: Jakarta, Indonesia, 2019; pp. 45–48. [Google Scholar]
- Uddin, M.N.; Halder, R.K. An Ensemble Method Based Multilayer Dynamic System to Predict Cardiovascular Disease Using Machine Learning Approach. Inform. Med. Unlocked 2021, 24, 100584. [Google Scholar] [CrossRef]
- Ouf, S.; ElSeddawy, A.I.B. A proposed paradigm for intelligent heart disease prediction system using data mining techniques. J. Southwest Jiaotong Univ. 2021, 56, 220–240. [Google Scholar] [CrossRef]
- Shorewala, V. Early Detection of Coronary Heart Disease Using Ensemble Techniques. Inform. Med. Unlocked 2021, 26, 100655. [Google Scholar] [CrossRef]
- Punugoti, R.; Dutt, V.; Kumar, A.; Bhati, N. Boosting the Accuracy of Cardiovascular Disease Prediction Through SMOTE. In Proceedings of the 2023 International Conference on IoT, Communication and Automation Technology (ICICAT), Gorakhpur, India, 23–24 June 2023; IEEE: Gorakhpur, India, 2023; pp. 1–6. [Google Scholar]
- Bhatt, C.M.; Patel, P.; Ghetia, T.; Mazzeo, P.L. Effective Heart Disease Prediction Using Machine Learning Techniques. Algorithms 2023, 16, 88. [Google Scholar] [CrossRef]
No. | Feature | Type | Mean ± STD | Mean ± STD of Positive Samples | Mean ± STD of Negative Samples |
---|---|---|---|---|---|
1 | age_in_years | integer | 53.33 ± 6.76 | 54.95 ± 6.35 | 51.73 ± 6.78 |
2 | gender | categorical (1: women; 2: men) | - | - | - |
3 | height | integer | 164.36 ± 8.21 | 164.27 ± 8.27 | 164.45 ± 8.15 |
4 | weight | float | 74.21 ± 14.40 | 76.82 ± 14.96 | 71.59 ± 13.31 |
5 | ap_hi (SBP) | integer | 128.82 ± 154.01 | 137.21 ± 191.29 | 120.43 ± 103.55 |
6 | ap_lo (DBP) | integer | 96.63 ± 188.47 | 109.02 ± 217.81 | 84.25 ± 152.69 |
7 | cholestarol | categorical (1: normal; 2: above normal; 3: well above normal) | - | - | - |
8 | gluc (FBG) | categorical (1: normal; 2: above normal; 3: well above normal) | - | - | - |
9 | smoke | binary (0: not smokers; 1: smokers) | - | - | - |
10 | alco (alcohol intake) | binary (0: non-alcohol drinker; 1: alcohol drinker) | - | - | - |
11 | active (physical activity) | binary (0: inactive; 1: active) | - | - | - |
No. | Feature | Type | Mean ± STD | Mean ± STD of Positive Samples | Mean ± STD of Negative Samples |
---|---|---|---|---|---|
1 | age | continuous | 51.77 ± 8.20 | 54.80 ± 6.74 | 48.74 ± 8.42 |
2 | sex | binary (0: women; 1: men) | - | - | - |
3 | is_smoking | binary (0: not smokers; 1: smokers) | - | - | - |
4 | totChol (total cholesterol) | categorical (1: normal; 2: above normal; 3: well above normal) | - | - | - |
5 | sysBP | continuous | 32.95 ± 20.33 | 135.21 ± 19.70 | 130.68 ± 20.71 |
6 | diaBP | continuous | 83.60 ± 11.30 | 84.94 ± 10.93 | 82.25 ± 11.50 |
7 | BMI | continuous | 26.95 ± 5.00 | 28.22 ± 5.55 | 25.69 ± 4.00 |
8 | glucose | categorical (1: normal; 2: above normal; 3: well above normal) | - | - | - |
Dataset | Number of Samples | Number of Redundant Samples | Number of Samples After Data Redundancy Elimination |
---|---|---|---|
Dataset I | 70,000 | 1004 | 68,996 (Train: 62,097, Test: 6899) |
Dataset II | 70,000 | 1004 | 68,996 (Train: 62,097, Test: 6899) |
Dataset III | 5200 | 38 | 5162 (Train: 4646, Test: 516) |
Metric | LR | SVC | GNB | MLP | KNN | RF | AdaBoost | GB | HGB | Extra Trees | Proposed Model |
---|---|---|---|---|---|---|---|---|---|---|---|
Precision | 0.7273 | 0.7257 | 0.7224 | 0.7287 | 0.7513 | 0.8472 | 0.8125 | 0.9371 | 0.9273 | 0.8246 | 0.9375 |
Recall | 0.6627 | 0.6230 | 0.2980 | 0.6929 | 0.6565 | 0.8757 | 0.7681 | 0.9326 | 0.9822 | 0.8617 | 0.9877 |
F1 Score | 0.6935 | 0.6911 | 0.4216 | 0.7005 | 0.7007 | 0.8612 | 0.7896 | 0.9348 | 0.9539 | 0.8427 | 0.9619 |
Accuracy | 0.7114 | 0.7257 | 0.5976 | 0.7118 | 0.7236 | 0.8609 | 0.7984 | 0.9359 | 0.9532 | 0.8416 | 0.9615 |
AUC | 0.7709 | 0.6911 | 0.7001 | 0.7883 | 0.7840 | 0.9396 | 0.8897 | 0.9877 | 0.9852 | 0.9214 | 0.9911 |
Metric | LR | SVC | GNB | MLP | KNN | RF | AdaBoost | GB | HGB | Extra Trees | Proposed Model |
---|---|---|---|---|---|---|---|---|---|---|---|
Precision | 0.7075 | 0.7807 | 0.7225 | 0.7728 | 0.7515 | 0.8613 | 0.8126 | 0.9354 | 0.9239 | 0.8594 | 0.9371 |
Recall | 0.6451 | 0.6168 | 0.2810 | 0.6459 | 0.6723 | 0.9026 | 0.7647 | 0.9299 | 0.9822 | 0.9098 | 0.9890 |
F1 Score | 0.6748 | 0.6891 | 0.4042 | 0.6977 | 0.7096 | 0.8814 | 0.7879 | 0.9299 | 0.9521 | 0.8838 | 0.9624 |
Accuracy | 0.6939 | 0.7258 | 0.5923 | 0.7287 | 0.7290 | 0.8804 | 0.7972 | 0.9338 | 0.9513 | 0.8822 | 0.9619 |
AUC | 0.7590 | 0.7882 | 0.6982 | 0.7964 | 0.6708 | 0.9540 | 0.8892 | 0.9871 | 0.9846 | 0.9549 | 0.9914 |
Metric | LR | SVC | GNB | MLP | KNN | RF | AdaBoost | GB | HGB | Extra Trees | Proposed Model |
---|---|---|---|---|---|---|---|---|---|---|---|
Precision | 0.7645 | 0.6790 | 0.7693 | 0.7471 | 0.7741 | 0.8668 | 0.8484 | 0.8862 | 0.8951 | 0.8306 | 0.8954 |
Recall | 0.7646 | 0.7593 | 0.7073 | 0.7770 | 0.8040 | 0.8062 | 0.8305 | 0.8283 | 0.8257 | 0.7776 | 0.8283 |
F1 Score | 0.7643 | 0.7167 | 0.7369 | 0.7332 | 0.7884 | 0.8348 | 0.8388 | 0.8559 | 0.8585 | 0.8028 | 0.8601 |
Accuracy | 0.7660 | 0.7021 | 0.7495 | 0.7471 | 0.7858 | 0.8422 | 0.8418 | 0.8618 | 0.8655 | 0.8108 | 0.8667 |
AUC | 0.8359 | 0.7670 | 0.8159 | 0.8275 | 0.8507 | 0.9135 | 0.9097 | 0.9244 | 0.9278 | 0.8908 | 0.9280 |
Technique | Dataset I | Dataset II | Dataset III | |||
---|---|---|---|---|---|---|
# Outliers | # Training Samples After Outlier Elimination | # Outliers | # Training Samples After Outlier Elimination | # Outliers | # Training Samples After Outlier Elimination | |
LOF | 1465 | 60,632 | 3916 | 58,181 | 118 | 4528 |
EE | 3105 | 58,992 | 3106 | 58,991 | 233 | 4413 |
iForest | 3105 | 58,992 | 3106 | 58,991 | 233 | 4413 |
OneClassSVM | 6210 | 55,887 | 6210 | 55,886 | 464 | 4182 |
DBSCAN | 6566 | 55,531 | 1244 | 60,853 | 550 | 4098 |
Model | Dataset | |||
---|---|---|---|---|
Dataset I | Dataset II | Dataset III | ||
LR | LOF | 2 (Pre, F1) | 3 (Rec, F1, AUC) | 1 (Prec) |
EE | 3 (Rec, Acc, AUC) | 0 | 0 | |
iForest | 0 | 0 | 0 | |
OneClassSVM | 0 | 2 (Pre, Acc) | 1 (Recall) | |
DBSCAN | 0 | 0 | 3 (F1, Acc, AUC) | |
SVC | LOF | 5 (Pre, Rec, F1, Acc, AUC) | 2 (Rec, Acc) | 0 |
EE | 0 | 2 (F1, AUC) | 0 | |
iForest | 0 | 0 | 0 | |
OneClassSVM | 0 | 0 | 2 (Rec, F1) | |
DBSCAN | 0 | 1 (Pre) | 3 (Pre, Acc, AUC) | |
GNB | LOF | 1 (Pre) | 0 | 0 |
EE | 3 (F1, Acc, AUC) | 2 (Acc, AUC) | 0 | |
iForest | 0 | 0 | 0 | |
OneClassSVM | 1 (Rec) | 2 (Rec, F1) | 3 (Pre, Rec, F1) | |
DBSCAN | 0 | 1 (Pre) | 2 (Acc, AUC) | |
MLP | LOF | 3 (Pre, Acc, AUC) | 3 (Pre, Acc, AUC) | 1 (Pre) |
EE | 0 | 2 (Rec, F1) | 0 | |
iForest | 2 (Rec, F1) | 0 | 0 | |
OneClassSVM | 0 | 0 | 2 (F1, Acc) | |
DBSCAN | 0 | 0 | 2 (Rec, AUC) | |
KNN | LOF | 5 (Pre, Rec, F1, Acc, AUC) | 5 (Pre, Rec, F1, Acc, AUC) | 0 |
EE | 0 | 0 | 0 | |
iForest | 0 | 0 | 0 | |
OneClassSVM | 0 | 0 | 0 | |
DBSCAN | 0 | 0 | 5 (Pre, Rec, F1, Acc, AUC) | |
RF | LOF | 1 (Rec) | 2 (F1, Acc) | 0 |
EE | 1 (Pre) | 0 | 0 | |
iForest | 3 (F1, Acc, AUC) | 2 (Acc, AUC) | 0 | |
OneClassSVM | 0 | 0 | 4 (Pre, Rec, F1, AUC) | |
DBSCAN | 0 | 1 (Pre) | 1 (Acc) | |
AdaBoost | LOF | 4 (Pre, Rec, F1, AUC) | 4 (Pre, Rec, F1, AUC) | 0 |
EE | 0 | 0 | 0 | |
iForest | 0 | 0 | 0 | |
OneClassSVM | 0 | 0 | 5 (Pre, Rec, F1, Acc, AUC) | |
DBSCAN | 1 (Acc) | 1 (Acc) | 0 | |
GB | LOF | 1 (Rec) | 0 | 0 |
EE | 0 | 0 | 0 | |
iForest | 0 | 2 (F1, Acc) | 0 | |
OneClassSVM | 1 (AUC) | 1 (Pre) | 5 (Pre, Rec, F1, Acc, AUC) | |
DBSCAN | 3 (Pre, F1, Acc) | 2 (Rec, AUC) | 0 | |
HGB | LOF | 4 (Pre, Rec, F1, Acc) | 0 | 0 |
EE | 0 | 0 | 0 | |
iForest | 1 (AUC) | 0 | 0 | |
OneClassSVM | 0 | 5 (Pre, Rec, F1, Acc, AUC) | 5 (Pre, Rec, F1, Acc, AUC) | |
DBSCAN | 0 | 0 | 0 | |
ExtraTrees | LOF | 1 (AUC) | 0 | 0 |
EE | 0 | 0 | 0 | |
iForest | 4 (Pre, Rec, F1, Acc) | 5 (Pre, Rec, F1, Acc, AUC) | 0 | |
OneClassSVM | 0 | 0 | 4 (Pre, Rec, F1, AUC) | |
DBSCAN | 0 | 0 | 1 (Acc) | |
Proposed Model | LOF | 0 | 5 (Pre, Rec, F1, Acc, AUC) | 0 |
EE | 0 | 0 | 0 | |
iForest | 2 (Pre, F1) | 0 | 0 | |
OneClassSVM | 0 | 0 | 5 (Pre, Rec, F1, Acc, AUC) | |
DBSCAN | 3 (Rec, Acc, AUC) | 0 | 0 |
Hyperparameter | Value | ||
---|---|---|---|
Dataset I | Dataset II | Dataset III | |
max_iter | 105 | 100 | 700 |
max_depth | none | None | 5 |
l2_regularization | 0 | 1 | 0.5 |
random_state | 0 | 42 | 0 |
learning_rate | 0.1 | 0.1 | 0.01 |
Hyperparameter | Value | ||
---|---|---|---|
Dataset I | Dataset II | Dataset III | |
Estimator | HistGradientBoostingClassifier | HistGradientBoostingClassifier | HistGradientBoostingClassifier |
n_estimators | 20 | 19 | 16 |
max_samples | 1.0 | 1.0 | 1.0 |
max_features | 1.0 | 1.0 | 1.0 |
random_state | 0 | 42 | 0 |
Paired t-Test Metric | Dataset | ||
---|---|---|---|
Dataset I | Dataset II | Dataset III | |
h | 1 | 1 | 1 |
p-value | 0.000 | 0.000 | 0.957 |
t (calculated) | −39.555 | −12.728 | 0.057 |
Reference | Method | Precision | Recall | F1 Score | Accuracy | AUC |
---|---|---|---|---|---|---|
[68] | RF | - | 0.8000 | - | 0.7300 | - |
[69] | Ensemble Method (RF + NB + GB) | - | - | - | 0.9416 | 0.9400 |
[70] | Neural Network | - | - | - | 0.7182 | - |
[71] | Stacking (KNN + RF + SVC + LR) | 0.7601 | 0.6680 | - | 0.7510 | - |
[72] | SMOTE + RF | 0.8300 | 0.8500 | 0.8400 | 0.8600 | - |
[8] | XGBH | - | - | - | - | 0.8030 |
[73] | MLP | - | - | - | 0.8727 | 0.9500 |
Proposed Model (Utilizing Dataset I) | Bagging with HGB + LOF + IG | 0.9390 | 0.9883 | 0.9630 | 0.9625 | 0.9916 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fitriyani, N.L.; Syafrudin, M.; Chamidah, N.; Rifada, M.; Susilo, H.; Aydin, D.; Qolbiyani, S.L.; Lee, S.W. A Novel Approach Utilizing Bagging, Histogram Gradient Boosting, and Advanced Feature Selection for Predicting the Onset of Cardiovascular Diseases. Mathematics 2025, 13, 2194. https://doi.org/10.3390/math13132194
Fitriyani NL, Syafrudin M, Chamidah N, Rifada M, Susilo H, Aydin D, Qolbiyani SL, Lee SW. A Novel Approach Utilizing Bagging, Histogram Gradient Boosting, and Advanced Feature Selection for Predicting the Onset of Cardiovascular Diseases. Mathematics. 2025; 13(13):2194. https://doi.org/10.3390/math13132194
Chicago/Turabian StyleFitriyani, Norma Latif, Muhammad Syafrudin, Nur Chamidah, Marisa Rifada, Hendri Susilo, Dursun Aydin, Syifa Latif Qolbiyani, and Seung Won Lee. 2025. "A Novel Approach Utilizing Bagging, Histogram Gradient Boosting, and Advanced Feature Selection for Predicting the Onset of Cardiovascular Diseases" Mathematics 13, no. 13: 2194. https://doi.org/10.3390/math13132194
APA StyleFitriyani, N. L., Syafrudin, M., Chamidah, N., Rifada, M., Susilo, H., Aydin, D., Qolbiyani, S. L., & Lee, S. W. (2025). A Novel Approach Utilizing Bagging, Histogram Gradient Boosting, and Advanced Feature Selection for Predicting the Onset of Cardiovascular Diseases. Mathematics, 13(13), 2194. https://doi.org/10.3390/math13132194