A-Tuning Ensemble Machine Learning Technique for Cerebral Stroke Prediction
Abstract
:1. Introduction
2. Literature Review
3. Materials and Methods
3.1. Stroke Prediction Dataset
3.2. Methodology
- We proposed a novel tuning ensemble RXLM from RF, XGBoost, and LightGBM to predict cerebral stroke diseases.
- We used the open-access Stroke Prediction dataset and the KNN Imputer technique to handle the missing values.
- We applied the SMOTE on the training set to balance the two classes’ samples due to the imbalance of the training subset.
- We tuned the hyper-parameters of the RF, XGBoost, and LightGBM using the random search technique to get the best parameter values and their high performance.
- Achieving 95.29%, 99.13%, 96.38%, 94.36%, 95.35%, 90.59%, and 90.63% for accuracy, AUC, recall, precision, F1-score, Kappa, and MCC, respectively, for stroke disease prediction.
3.3. Data Pre-Processing of Stroke Dataset
3.3.1. Missing Values Handling
3.3.2. Outlier Elimination
3.3.3. One-Hot Encoding
3.3.4. Normalization
3.3.5. Over-Sampling
3.4. XGBoost
3.5. LightGBM
4. Implementation and Evaluation
4.1. Tuning Parameter Using Random Search Optimization
4.2. Evaluation Metrics
- The true positive of class means that the observed value output and predicted value output for class are true or correct.
- The true negative of class means that the output of the testing dataset is negative compared to the predicted output of class .
- The false positive of class means that the output of the testing dataset is positive, whereas the predicted output of the experiments is false.
- The false negative of class means that the observed value output and predicted value output for class are false or incorrect.
- Accuracy is the degree to which the result matches the required value. It is calculated by measuring the overall number of correct expectations or predictions on the dataset by the total number of expectations.
- Precision is the percentage of positive samples that the model properly predicted.
- The recall is the percentage that correctly identifies the positive samples from the overall number of samples.
- Sensitivity is the proportion of true positives to actual positive values of the samples.
- Specificity is the ratio of the number of actual negatives to the number of true negative values.
- F1-score is the harmonic average of precision and sensitivity.
- AUC is a number used to summarize a classifier’s performance by determining the total area beneath the receiver operating characteristic (ROC) curve. The Mathew correlation coefficient (MCC) is a statistical method for assessing models. It performs the same function as chi-square statistics for a contingency table, measuring the difference between expected and actual values.
- The Kappa Coefficient, or Cohen’s Kappa score, is used with two raters, but can also be modified to work with more than two raters. One of the raters takes on the role of the classification model in ML binary classification models. In contrast, the other rater assumes the role of the real-world observer who is aware of the true categories of each record or dataset. Cohen’s Kappa can be used to calculate overall agreement and agreement after chance has been considered, and it considers the number of agreements ( and ) and disagreements ( and ) between the raters.
4.3. Model Evaluation
4.4. Ensemble’s Result Comparison with the Literature
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Katan, M.; Luft, A. Global burden of stroke. Semin. Neurol. 2018, 38, 2018–2211. [Google Scholar] [CrossRef] [PubMed]
- Bustamante, A.; Penalba, A.; Orset, C.; Azurmendi, L.; Llombart, V.; Simats, A.; Pecharroman, E.; Ventura, O.; Ribó, M.; Vivien, D.; et al. Blood Biomarkers to Differentiate Ischemic and Hemorrhagic Strokes. Neurology 2021, 96, 1928–1939. [Google Scholar] [CrossRef] [PubMed]
- Learn about Stroke. Available online: https://www.world-stroke.org/world-stroke-day-campaign/why-stroke-matters/learnabout-stroke (accessed on 25 February 2023).
- Li, G.; Cheng, L.; Gao, Z.; Xia, X.; Jiang, J. Development of an Untethered Adaptive Thumb Exoskeleton for Delicate Rehabilitation Assistance. IEEE Trans. Robot. 2022, 38, 3514–3529. [Google Scholar] [CrossRef]
- Elloker, T.; Rhoda, A. The Relationship between Social Support and Participation in Stroke: A Systematic Review. Afr. J. Disabil. 2018, 7, a357. [Google Scholar] [CrossRef]
- Concept of Stroke by Health Line. Available online: https://www.cdc.gov/stroke/index.htm (accessed on 7 January 2023).
- Statistics of Stroke by Centers for Disease Control and Prevention. Available online: https://www.cdc.gov/stroke/facts.htm (accessed on 14 March 2023).
- Banerjee, T.; Das, S. Fifty years of stroke researches in India. Ann. Indian Acad. Neurol. 2016, 19, 1–8. [Google Scholar] [CrossRef]
- Stroke Association. Available online: https://www.stroke.org.uk (accessed on 25 January 2023).
- Stroke in Canada. Available online: https://www.canada.ca/en/public-health/services/publications/diseases-conditions/stroke-canada-fact-sheet.html (accessed on 9 March 2023).
- Xia, X.; Yue, W.; Chao, B.; Li, M.; Cao, L.; Wang, L.; Shen, Y.; Li, X. Prevalence and Risk Factors of Stroke in the Elderly in Northern China: Data from the National Stroke Screening Survey. J. Neurol. 2019, 266, 1449–1458. [Google Scholar] [CrossRef]
- Alloubani, A.; Saleh, A.; Abdelhafiz, I. Hypertension and Diabetes Mellitus as a Predictive Risk Factor for Stroke. Diabetes Metab. Syndr. Clin. Res. Rev. 2018, 12, 577–584. [Google Scholar] [CrossRef] [PubMed]
- Boehme, A.; Esenwa, C.; Elkind, M. Stroke risk factors, genetics, and prevention. Circ. Res. 2017, 120, 472–495. [Google Scholar] [CrossRef] [PubMed]
- Adi, N.; Farhany, R.; Ghina, R.; Napitupulu, H. Stroke Risk Prediction Model using Machine Learning. In Proceedings of the International Conference on Artificial Intelligence and Big Data Analytics, Bandung, Indonesia, 27–29 October 2021. [Google Scholar] [CrossRef]
- Dritsas, E.; Trigka, M. Stroke Risk Prediction with Machine Learning Techniques. Sensors 2022, 22, 4670. [Google Scholar] [CrossRef] [PubMed]
- Tazin, T.; Alam, M.; Dola, N.; Bari, M.; Bourouis, S.; Khan, M. Stroke Disease Detection and Prediction using Robust Learning Approaches. J. Healthc. Eng. 2021, 2021, 7633381. [Google Scholar] [CrossRef] [PubMed]
- Al-Mekhlafi, Z.; Senan, E.; Rassem, T.; Mohammed, B.; Makbol, N.; Alanazi, A.; Almurayziq, T.; Ghaleb, F. Deep learning and machine learning for early detection of stroke and haemorrhage. Comput. Mater. Contin. 2022, 72, 775–796. [Google Scholar] [CrossRef]
- Sailasya, G.; Kumari, G. Analyzing the Performance of Stroke Prediction using ML Classification Algorithms. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 539–545. [Google Scholar] [CrossRef]
- Dev, S.; Wang, H.; Nwosu, C.; Jain, N.; Veeravalli, B.; John, D. A predictive analytics approach for stroke prediction using machine learning and neural networks. Healthc. Anal. 2022, 2, 100032. [Google Scholar] [CrossRef]
- Bandi, V.; Bhattacharyya, D.; Midhunchakkravarthy, D. Prediction of Brain Stroke Severity using Machine Learning. Rev. D’intelligence Artif. 2020, 34, 753–761. Available online: https://www.iieta.org/download/file/fid/48077 (accessed on 18 February 2023). [CrossRef]
- Alhakami, H.; Alraddadi, S.; Alseady, S.; Baz, A.; Alsubait, T. A Hybrid Efficient Data Analytics Framework for Stroke Prediction. Int. J. Comput. Sci. Netw. Secur. 2020, 20, 240–250. Available online: http://paper.ijcsns.org/07_book/202004/20200429.pdf (accessed on 7 March 2023).
- Stroke Prediction Dataset. Available online: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset (accessed on 25 February 2022).
- Tan, P.; Steinbach, M.; Karpatne, A.; Kumar, V. Introduction to Data Mining; Computers; Pearson: New York, NY, USA, 2018; pp. 1–864. Available online: https://www-users.cse.umn.edu/~kumar001/dmbook/index.php (accessed on 4 February 2023).
- Naser, M.; Alavi, A. Insights into Performance Fitness and Error Metrics for Machine Learning. arXiv 2020. [Google Scholar] [CrossRef]
Feature Name | Description | Range |
---|---|---|
Gender | The gender of the participant is the focus of this feature. There are 1260 men and 1994 women in the population. | Male–Female |
Age | Participants over the age of 18 are the subject of this feature. | Float |
Hypertension | This characteristic indicates whether this participant has hypertension. 12.54% of participants have high blood pressure. | 0 No Hypertension 1 Hypertension |
Heart Disease | If this individual has heart disease, it is indicated by this feature. The participants’ prevalence of heart disease was 6.33%. | 0 No Heart Disease 1 Heart Disease |
Ever Married | The number of participants who are married, which is 79.84%, is represented by this feature. | Yes–No |
Work Type | This feature has four categories: private, self-employed, government, and finally, never worked. | Never_worked, Children, Private, Self-employed, or Govt_job |
Residence type | This feature has two categories: urban and rural, representing the participant’s living situations. | Urban or Rural |
Average glucose level (mg/dL) | This feature tracks the participants’ average blood glucose level. | Float |
BMI (Kg/m2) | This feature records the participants’ BMI. | Float |
Smoking Status | There are three categories for this feature: smoker (22.37%), never smoker (52.64%), and formerly smoker (24.99%). | Never smoked, smoked, or Formerly Smoked |
Stroke | This attribute identifies if the participant has had a stroke in the past. 5.53% of participants have experienced a stroke. | 0 No Stroke 1 Stroke |
Age | Avg Glucose Level | BMI | Gender Male | Hypertension_1 | Heart_Disease_1 | Ever_Married_Yes | Work_Type Never_Worked |
---|---|---|---|---|---|---|---|
67 | 5.432367 | 3.600048 | 1 | 0 | 1 | 1 | 0 |
61 | 5.309307 | 5.309307 | 0 | 0 | 0 | 1 | 0 |
80 | 4.662684 | 4.662684 | 1 | 0 | 1 | 1 | 0 |
49 | 5.143008 | 5.143008 | 0 | 0 | 0 | 1 | 0 |
79 | 5.159745 | 5.159745 | 0 | 1 | 0 | 1 | 0 |
Age | Avg Glucose Level | BMI | Gender Male | Hypertension_1 | Heart_Disease_1 | Ever_Married_Yes | Work_Type Never_Worked |
---|---|---|---|---|---|---|---|
67 | 2.320709 | 1.027679 | 1 | 0 | 1 | 1 | 0 |
61 | 1.980714 | 0.781547 | 0 | 0 | 0 | 1 | 0 |
80 | 0.194204 | 0.574693 | 1 | 0 | 1 | 1 | 0 |
49 | 1.521257 | 0.791320 | 0 | 0 | 0 | 1 | 0 |
79 | 1.567499 | −0.581283 | 0 | 1 | 0 | 1 | 0 |
Parameter | Meaning | Best Value |
---|---|---|
bootstrap | Bootstrapping generates simulated datasets by resampling the original dataset with replacement many thousands of times. | True |
ccp alpha | Minimal cost-complexity pruning (ccp) employs a complexity parameter. | 0.0 |
class weight | Weights that are related to classes. | None |
criterion | The capability to assess a split’s quality. | gini |
Max depth | The leaf and root nodes are measured by the maximum number of levels in the tree. | None |
Max features | The aspect number must be taken into consideration when selecting the best data split. | Auto |
Max leaf nodes | The trees are grown using the maximum nodes of the leaf. | None |
Max samples | How many samples from X need to be taken to train each base estimator | None |
Min impurity decrease | A split node will experience an impurity reduction greater than or equal this amount. | 0.0 |
Min samples leaf | The smallest number of required samples at each leaf node. | 1 |
Min samples split | This setting instructs the decision tree in a random forest to divide any node with less observation than necessary. | 2 |
Min weight fraction leaf | A leaf node must contain a minimum weighted percentage of the total weights from all of the input samples. | 0.0 |
N estimators | This parameter refers to the number of trees required to be built before taking the maximum voting or averages of predictions. | 100 |
Random state | Random number seed. | 123 |
N jobs | The number of concurrent jobs to run | −1 |
Oob score | Score from an out-of-bag estimate of the training dataset. | False |
verbose | Regulates the amount of jargon used in fitting and predicting. | 0 |
Warm start | If True is set, use the previous fit call’s solution and add more estimators to the ensemble; if not, just fit a new forest. | False |
Parameter | Meaning | Best Value |
---|---|---|
Boosting type | Gradient boosting methods. | gbdt |
Class weight | Weights that are related to classes. | None |
Colsample bytree | Column subsample ratio used to construct each tree. | 1.0 |
Importance type | The kind of importance of a feature that should be entered into feature_importances. | split |
Learning rate | The learning rate of boosting. | 0.4 |
Max depth | The leaf and root nodes are measured by the maximum number of levels in the tree. | −1 |
Min child samples | The minimum level of data required for a child. | 6 |
Min child weight | Minimum required child weight. | 0.001 |
Min split gain | The minimum loss reduction is required to construct a second partition on a tree leaf node. | 0.3 |
N estimators | How many boosted trees can fit? | 20 |
N jobs | The number of concurrent jobs to run. | −1 |
Num leaves | Maximum number of base learners’ tree leaves. | 150 |
Objective | Choose a custom objective function or the associated learning objective and task. | None |
Random state | Random number seed. | 123 |
Reg alpha | Represents L1 regularization parameter on weight. | 0.005 |
Reg lambda | Represents L1 regularization parameter on weight. | 0.0005 |
Subsample | The ratio of the subsamples in the training instance. | 1.0 |
Subsample for bin | A number of samples are needed to make bins. | 200,000 |
Subsample freq | A frequency of 0 indicates that there is no enablement. | 0 |
Verbosity | Whether messages are printed during construction. | warn |
Parameter | Meaning | Best Value |
---|---|---|
Objective | Output probability, logistic regression for binary classification. | binary:logistic |
Base score | Global bias, the overall initial prediction score. | 0.5 |
booster | Gradient boosting methods. | gbtree |
Colsample bylevel | Ratio of columns in the subsample for each level. | 1 |
Colsample bynode | The column-to-node subsample ratio (split). | 1 |
Colsample bytree | Column subsample ratio used to construct each tree. | 1 |
Learning rate | The learning rate of boosting. | 0.300000012 |
Max bin | The maximum number of discrete bins can be used to group continuous features. | 256 |
gamma | The least loss reduction is needed for a subsequent part on a tree leaf node. | 0 |
Max cat to onehot | A threshold for determining whether categorical data should be split using a one-hot encoding-based split in XGBoost. | 4 |
Max delta step | Utilized to safeguard optimization. | 0 |
Max depth | The leaf and root nodes are measured by the maximum number of levels in the tree. | 6 |
Min child weight | Minimum required child weight. | 1 |
N estimators | How many boosted trees can fit? | 100 |
N jobs | The number of active jobs at once. | −1 |
Num parallel tree | The number of parallel trees that are built in each iteration. | 1 |
Predictor | Which predictor algorithm to employ? It allows the use of a GPU or CPU but delivers the same results. | auto |
Eval metric | Metrics for evaluation of validation data. | None |
Random state | Random number seed. | 123 |
Reg alpha | Represents L1 regularization parameter on weight. | 0 |
Reg lambda | Represents L1 regularization parameter on weight. | 1 |
Sampling method | How to sample the training scenarios using this method. | Uniform |
Tree method | The XGBoost tree construction algorithm. | Auto |
Scale Pos weight | Adjust the balance between positive and negative weights, which is helpful for unbalanced classes. | 1 |
Subsample | The ratio of the training instances’ subsamples. | 1 |
Validate parameters | XGBoost will validate input parameters to determine whether a given parameter is utilized when the value is True. | 1 |
Ref. | Methodology | Accuracy | Feature Selection | Datasets |
---|---|---|---|---|
[14] | RF, DT, and NB | 94.781% | No | Stroke prediction dataset |
[15] | NB, RF, LR, KNN, SGD, DT, MLP, Majority Voting, and Stacking, which was the best. | 98% | No | Stroke prediction dataset |
[16] | LR, DT, Voting, and RF, which was the best. | 96% | No | Stroke prediction dataset |
[17] | SVM, KNN, DT, MLP, and RF which was the best | 99% | Recursive Feature Elimination | Dataset of medical record |
[18] | LR, DT, KNN, SVM, RF, and NB, which was the best | 82% | No | Stroke prediction dataset |
[19] | SVM, LASSO, and NN, which was the best | 79% | Perceptron Neural Network and (PCA) | Electronic health records dataset |
[20] | NB, LR, Logistic R, KNN, DT, AdaBoost, Improvised RF | 96.97% | NIHSS | Medical record |
[21] | LR, SVM, ANN, XGBoost, and RF, which was the best | 97% | No | Stroke prediction dataset |
Proposed RXLM | RF, XGBoost, and LightGBM | 96.34% | No | Stroke prediction dataset |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alruily, M.; El-Ghany, S.A.; Mostafa, A.M.; Ezz, M.; El-Aziz, A.A.A. A-Tuning Ensemble Machine Learning Technique for Cerebral Stroke Prediction. Appl. Sci. 2023, 13, 5047. https://doi.org/10.3390/app13085047
Alruily M, El-Ghany SA, Mostafa AM, Ezz M, El-Aziz AAA. A-Tuning Ensemble Machine Learning Technique for Cerebral Stroke Prediction. Applied Sciences. 2023; 13(8):5047. https://doi.org/10.3390/app13085047
Chicago/Turabian StyleAlruily, Meshrif, Sameh Abd El-Ghany, Ayman Mohamed Mostafa, Mohamed Ezz, and A. A. Abd El-Aziz. 2023. "A-Tuning Ensemble Machine Learning Technique for Cerebral Stroke Prediction" Applied Sciences 13, no. 8: 5047. https://doi.org/10.3390/app13085047
APA StyleAlruily, M., El-Ghany, S. A., Mostafa, A. M., Ezz, M., & El-Aziz, A. A. A. (2023). A-Tuning Ensemble Machine Learning Technique for Cerebral Stroke Prediction. Applied Sciences, 13(8), 5047. https://doi.org/10.3390/app13085047