Next Article in Journal
The Effect of UV Light in Accelerating IoT-Based Hydroponic Plant Growth
Previous Article in Journal
A Systematic Literature Study on IoT-Based Water Turbidity Monitoring: Innovation in Waste Management
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Stroke Prediction Using Machine Learning Algorithms †

1
Department of Software Engineering, University of Sialkot, Sialkot 51040, Pakistan
2
Informatic Engineering, Nusa Putra University, Sukabumi, West Java 43152, Indonesia
*
Author to whom correspondence should be addressed.
Presented at the 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society, Aizuwakamatsu City, Japan, 20–26 January 2025.
Eng. Proc. 2025, 107(1), 36; https://doi.org/10.3390/engproc2025107032 (registering DOI)
Published: 27 August 2025

Abstract

Stroke is a major global cause of death and disability, and improving outcomes requires early prediction. Although class imbalance in datasets causes biased predictions and inferior classification accuracy, machine learning (ML) techniques have shown potential in stroke prediction. We used the Synthetic Minority Oversampling Technique (SMOTE) to balance datasets and lessen bias in order to address these problems. Furthermore, we suggested a method that combines a linear discriminant analysis (LDA) model for classification with an autoencoder for feature extraction. A grid search approach was used to optimize the hyperparameters of the LDA model. We used criteria like accuracy, sensitivity, specificity, AUC (area under the curve), and ROC (Receiver Operating Characteristic) to guarantee a strong evaluation. With 98.51% sensitivity, 97.56% specificity, 99.24% accuracy, and 98.00% balanced accuracy, our model demonstrated remarkable performance, indicating its potential to improve stroke prediction and aid in clinical decision-making.

1. Introduction

Stroke is a major worldwide health concern, and improving outcomes and reducing mortality depend heavily on early prediction. Recent developments in deep learning (DL) and machine learning (ML) have demonstrated a great deal of promise for improving stroke prediction accuracy. Fuzzy score-based decision support systems (DSSs) have been shown in studies to be more successful than traditional techniques in improving patient outcomes and detection by making more accurate early stroke predictions [1]. There are still issues like skewed data and limited flexibility in ML, but DL techniques have potential [2]. Stroke prediction has enhanced while retaining interpretability by models like linear regression and polynomial feature modification [3]. Bias is still a problem because of unbalanced datasets. Techniques like SMOTE and hybrid models combining autoencoders with linear discriminant analysis have been proposed to enhance model fairness and reliability [4]. Along with wearable technology, recent research has also investigated hybrid neural networks such as integrating long short-term memory (LSTM) networks and convolutional neural networks (CNNs) for real-time stroke risk assessment [5]. The importance of collaborative data science is highlighted by collaborative projects such as the NeuralCup 2023, which show that combining clinical and imaging data increases the accuracy of stroke prediction [6]. Issues with dataset imbalances and model generalization still exist. To overcome these issues, we choose numerous techniques for improving stroke prediction accuracy. We aim to improve model robustness and accuracy by using ensemble methods that integrate random forest, gradient boosting, and support vector machines (SVMs) [7]. Additionally, we recommend incorporating modern imaging techniques such as CNNs to evaluate MRI and CT scans, which would improve the early detection of stroke-related abnormalities [8]. To improve model generalization, we will use federated learning, which will improve data collection from several sources and ensure more reliable predictions in remote places [9]. We also intend to use multimodal data fusion combining clinical, genetic, and imaging data to provide a holistic perspective of stroke risk while using SMOTE to address dataset imbalances [10]. Finally, we propose developing a web-based application to integrate these predictive models, adding healthcare professionals in making timely decisions to improve patient care [11].

2. Literature Review

We developed a fuzzy scoring decision support system (DSS) to improve stroke pre-diction and reduce late diagnoses and fatality rates. It employs data modeling, fuzzy scoring, and machine learning to properly forecast strokes. When tested with real clinical data, it outperforms established approaches, providing greater early detection and improved patient outcomes [1]. A survey of machine learning and deep learning methods for stroke prediction from 2012 to 2022 found 79 important studies. Even these techniques could improve accuracy, but they also have drawbacks including unbalanced data and a lack of flexibility, which shows that further research is necessary [2]. A machine learning model uses polynomial feature transformation and linear regression to predict strokes, focusing on early detection. By preprocessing data with one-hot encoding and creating second-degree polynomial features, previous models are outperformed while being interpretable [3].
Stroke prediction is crucial because of its worldwide health impact; however, typical machine learning algorithms are biased by imbalanced datasets [12]. Techniques such as SMOTE and hybrid models that include autoencoders and LDA have improved fairness, feature extraction, and metrics such as sensitivity and AUC, increasing prediction reliability [4]. This paper offers a hybrid ML model that combines CNNs and LSTMs to classify stroke risk from bio signals such as ECG (Electrocardiogram) data. With balanced pre-processing and strong measurements, it accomplishes real-time risk assessment and investigates wearable technology integration to improve early detection and patient treatment [5]. The NeuralCup 2023 consortium tested stroke outcome prediction models with clinical and imaging data [13], identifying lesion features, MRI sequences, and demographic parameters as key predictors. Integrating FLAIR (Fluid-Attenuated Inversion Recovery) imaging and white matter analysis enhanced predictions, supporting collaborative data science for personalized recovery planning [6].
It promotes better data collection and early diagnosis in remote places [7]. This study used a dataset of 5110 records with pre-processing to enhance data quality to assess seven machine learning models for stroke prediction. With an accuracy of 82%, the DNN model performed better than the others, indicating its potential for early stroke detection and treatment [8]. A new ensemble model predicts strokes by combining decision trees, logistic regression, and gradient boosting, which increases prediction accuracy. The methodology performs better than traditional approaches, potentially improving patient outcomes and promoting public health campaigns [9].
The usefulness of machine learning models such as naive bayes, decision trees, and logistic regression is demonstrated by recent research on the early prediction of brain stroke. CNNs perform well in MRI image analysis, and the prediction of stroke risk is enhanced by the integration of genetic biomarkers with the SVM model. To develop more thorough stroke prediction models, future research will focus on multimodal data fusion, genetic investigations, and sophisticated imaging [10]. Statistical testing and sophisticated methods like PCA (Principal Component Analysis) and FA (Factor Analysis) in determining stroke risk factors like age, heart disease, and high blood pressure are significant. By combining random forests with these techniques [12,13], stroke prediction improved with an accuracy of 92.55% and an AUC of 98.15%. The practical use of these developments in early intervention is highlighted in a web-based tool for medical practice [11].

3. Methodology

Machine learning (ML) techniques are used in this research study’s suggested methodology. The algorithms listed below will be used.

3.1. Random Forest (RF)

Random forest is an ensemble learning method that creates many decision trees during training and generates each tree classification or mean prediction. Random forest is adept in managing large datasets and is able to spot complex patterns in the data. By selecting subsets of features at random for each decision tree, the method improves the decision tree robustness.

3.2. Gradient Boosting Machines (GBMs)

Gradient boosting is an effective ensemble technique that creates models one after the other and overcome mistakes in the earlier models. To reduce the loss function and increase prediction accuracy, the model employs a gradient descent technique. Non-linear relationships and enhancing the performance of weak models is achieved by mainly focusing on the errors of previous models. Because this can handle a variety of data formats and result in high accuracy on imbalanced datasets, it is especially helpful for stroke prediction.

3.3. Support Vector Machines (SVMs)

The support vector machine (SVM) technique is good at predicting stroke risk based on clinical, genetic, and imaging variables because of its capacity to manage complex data distributions. The SVM is a supervised machine learning model which determines the hyperplane that separates the data into classes for classification and regression problems. The SVM aims to enhance the accuracy on new data.

3.4. Convolutional Neural Networks (CNNs)

Convolutional neural networks are a type of deep learning algorithm developed to handle structured grid data like images. CNNs can be used on MRI or CT scan images to detect regions associated with stroke in stroke prediction because CNNs employ convolutional layers to identify features at various levels; they are highly effective at classifying images.

3.5. SMOTE (Synthetic Minority Oversampling Technique)

To balance uneven datasets, SMOTE is a data augmentation technique which generates synthetic samples for the minority class. In the domain of stroke prediction, imbalanced stroke datasets which arise when there are more non-stroke cases than stroke cases can be resolved with SMOTE by creating samples that keep the prediction models from moving toward the majority class.

3.6. Ensemble Methods

To improve the accuracy and robustness of stroke prediction, ensemble methods like voting classifiers and stacking will be employed. Ensemble techniques combine the strengths of multiple models (random forest, SVM, and CNN) to ensure more accurate prediction. This strategy will reduce the limitations of individual models. In order to handle clinical and imaging data and stroke risk prediction, these algorithms and methodologies will be integrated into a single framework. The suggested methodology is to improve healthcare outcomes by achieving high accuracy and robustness in stroke prediction by utilizing ensemble methods.

3.7. Framework

This research aims to develop a machine learning-based framework for early stroke prediction, leveraging various algorithms to accurately assess stroke risk from clinical and demographic data. The methodology follows a systematic approach involving data collection, preprocessing, model training, evaluation, and deployment. The framework incorporates several key steps to ensure robust stroke risk prediction.

3.8. Dataset Preparation

The dataset for this study includes clinical, demographic, and imaging features from diverse patient records. Clinical data includes age, gender, blood pressure, cholesterol levels, BMI, and smoking habits, while demographic data covers lifestyle factors such as marital status and work type. Imaging data such as MRI scans will be processed and used alongside clinical data. The data is collected from publicly available stroke datasets and medical records. Data preprocessing involves cleaning noisy or incomplete data, handling missing values, normalizing numerical features, and encoding categorical variables for model training.

3.9. Data Preprocessing

Preprocessing prepares the data for machine learning model training. Missing values are imputed using statistical methods, and feature normalization is used to scale the data to a consistent range, improving model accuracy. Furthermore, synthetic data creation techniques such as SMOTE (Synthetic Minority Oversampling Technique) are used to address class imbalances, resulting in a more equal depiction of stroke and non-stroke instances. The dataset is then divided into training (80%) and testing (20%) subsets to assess model performance.

3.10. Model Selection and Training

Several machine learning models will be trained to predict stroke risk, each with its unique strengths:
  • Random Forest (RF): An ensemble method that builds multiple decision trees to improve prediction accuracy and reduce overfitting. It handles high-dimensional data well and can capture complex relationships in clinical features.
  • Gradient Boosting Machines (GBMs): A sequential ensemble technique that builds models iteratively, with each model correcting errors made by the previous one. A GBM will enhance prediction by focusing on residual errors and handling imbalanced data effectively.
  • Support Vector Machines (SVMs): A classification algorithm that maximizes the margin between classes, suitable for high-dimensional spaces and offering high accuracy for stroke prediction.
  • Convolutional Neural Networks (CNNs): Applied to MRI scan data, CNNs excel at extracting features from images and identifying stroke-related abnormalities. These networks capture spatial hierarchies in imaging data, aiding stroke detection.
  • Each model is trained using the pre-processed data, and hyperparameters are optimized through techniques like grid search or random search to improve predictive performance.

3.11. Model Evaluation

Once the models are trained, their performance is evaluated using various metrics such as accuracy, precision, recall, F1 score, and AUC-ROC. These metrics will allow us to assess how well each model predicts stroke risk and handles false positives or negatives. Cross-validation will be used to ensure robust performance, and a confusion matrix will be generated to analyze the misclassification errors and their impact on stroke prediction.

3.12. Model Integration and Deployment

After evaluating the models, the best-performing algorithm, potentially a hybrid model combining the strengths of multiple algorithms, will be selected for integration. The chosen model will be deployed in a real-time healthcare system for stroke risk prediction, where it can process patient data and provide actionable insights for early intervention. This system will be designed to be easy to use by healthcare professionals and will be integrated into existing decision support tools.

3.13. Performance Optimization

To further enhance the model’s accuracy and generalization ability, advanced techniques such as ensemble learning and hyperparameter tuning will be applied. The final model will be continuously updated with new patient data, allowing it to adapt to changes in stroke risk factors and improve its predictive power over time. Figure 1 shows the accuracy results of RF, GBM, SVM, and CNN models.

4. Results

This study involves a new methodology for stroke prediction by combining machine learning algorithms with clinical, demographic, and image data to enhance early stroke detection. This study uses many data sources which include patient health records, demographic information, and image features from MRI scans used to predict stroke using machine learning models like random forest, gradient boosting, support vector machines, and convolutional neural networks (CNNs). The ensemble model, which combines random forest and gradient boosting, achieves high accuracy.
The proposed method provides a non-invasive and efficient technique to assess stroke prediction on time by using clinical and image data. This strategy ensures that medical professionals have the resources they need to enhance and identify high-risk individuals. The use of advanced approaches like SMOTE to deal with imbalanced data and ensemble methods to improve model accuracy ensures that the model is trustworthy in clinical situations. Potential developments include adding genetic biomarkers and multimodal data fusion to improve the system’s predictive ability, resulting in a complete solution for stroke risk prediction.

5. Conclusions and Future Work

In conclusion, this research proposes a highly successful machine learning-based approach for stroke risk prediction, integrating clinical data, demographic information, and advanced imaging techniques. By assessing crucial criteria such as age, heart disease, hypertension, and MRI-derived features, the system gives an accurate and quick evaluation of stroke risk, permitting early intervention. Among the models tested, the ensemble strategy combining random forest and gradient boosting machines performed best, considerably improving accuracy and AUC and hence increasing prediction dependability. This non-invasive technology has the potential to change stroke prediction, providing a valuable tool for healthcare practitioners in risk assessment and management. For future work, research should focus on combining genetic biomarkers by using data from many sources and examining new approaches like federated learning to improve data collection from remote places. With the development of timely prediction, IoT-enabled clinics should provide accessible stroke prediction across broad variety of healthcare institutions, improving patient health and supporting global health initiatives.

Author Contributions

N.K. was responsible for the conceptualization of the study, the overall research design, and project administration. She also contributed to the data curation, validation of results, and preparation of the initial draft of the manuscript. S.J. played a key role in data curation, conducting the analysis, and reviewing the manuscript critically for important intellectual content. D.D.D. contributed to validation, provided critical insights during the review process, and supported in the refinement of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

Authors received no external funding for this research.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data will be made available upon reasonable request to the first author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ben Kahla, M.; Kanzari, D.; Ben Amor, S.; Ayachi Ghannouchi, S.; Martinho, R. Enhanced Fuzzy Score-Based Decision Support System for Early Stroke Prediction. ACM Trans. Comput. Healthc. 2025, 6, 1–23. [Google Scholar] [CrossRef]
  2. Byna, A.; Lakulu, M.M.; Panessai, I.Y. Machine Learning-Based Stroke Prediction: A Critical Analysis. Int. J. Adv. Sci. Eng. Inf. Technol. 2024, 14, 1609–1618. [Google Scholar] [CrossRef]
  3. Sitanaboina, S.L.P.; Aruna Devi, B.; Kulkarni, G.L.; Murugan, S.; Vijayammal, B.K.P.; Neha. Exploring feature relationships in brain stroke data using polynomial feature transformation and linear regression modeling. J. Mach. Comput. 2024, 4, 1158–1169. [Google Scholar] [CrossRef]
  4. Saleem, M.A.; Javeed, A.; Akarathanawat, W.; Chutinet, A.; Suwanwela, N.C.; Kaewplung, P.; Benjapolakul, W. An intelligent learning system based on electronic health records for unbiased stroke prediction. Sci. Rep. 2024, 14, 23052. [Google Scholar] [CrossRef] [PubMed]
  5. Liu, T.; Fan, W.; Wu, C. A hybrid machine learning approach to cerebral stroke prediction based on imbalanced medical dataset. Artif. Intell. Med. 2019, 101, 101723. [Google Scholar] [CrossRef] [PubMed]
  6. Matsulevits, A.; Alvez, P.; Atzori, M.; Beyh, A.; Corbetta, M.; Del Pup, F.; de Schotten, M.T. Benchmarking Stroke Outcome Prediction through Comprehensive Data Analysis–NeuralCup 2023. bioRxiv 2024, 2024-10. [Google Scholar]
  7. Khatri, P.; Sharma, A. An Optimized Machine Learning-Based Stroke Prediction: Enhancing Precision Medicine and Public Health. In Proceedings of the 2024 International Conference on Data Science and Network Security (ICDSNS), Tiptur, India, 26–27 July 2024; pp. 1–6. [Google Scholar]
  8. Asan Nainar, M. Predictive modeling for brain stroke detection using machine learning. Int. J. Sci. Res. Eng. Manag. 2024, 8, 1–5. [Google Scholar] [CrossRef]
  9. Byna, A.; Lakulu, M.M.; Panessai, I.Y. Current critical review on prediction stroke using machine learning. Bull. Electr. Eng. Inform. 2024, 13, 3470–3480. [Google Scholar] [CrossRef]
  10. Wu, D.; Zhang, X.; Zhu, X. A machine learning-based model for stroke prediction. J. Biomed. Eng. Res. 2024, 67, 20240645. [Google Scholar] [CrossRef]
  11. Sahriar, S.; Akther, S.; Mauya, J.; Amin, R.; Mia, M.S.; Ruhi, S.; Reza, M.S. Unlocking stroke prediction: Harnessing projection-based statistical feature extraction with ML algorithms. Heliyon 2024, 10, e27411. [Google Scholar] [CrossRef]
  12. Aldughayfiq, B.; Ashfaq, F.; Jhanjhi, N.; Humayun, M. Capturing semantic relationships in electronic health records using knowledge graphs: An implementation using mimic iii dataset and graphdb. Healthcare 2023, 11, 1762. [Google Scholar] [CrossRef] [PubMed]
  13. Aldughayfiq, B.; Ashfaq, F.; Jhanjhi, N.; Humayun, M. YOLOv5-FPN: A robust framework for multi-sized cell counting in fluorescence images. Diagnostics 2023, 13, 2280. [Google Scholar] [CrossRef]
Figure 1. Algorithm accuracy.
Figure 1. Algorithm accuracy.
Engproc 107 00036 g001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kanwal, N.; Javaid, S.; Dewi, D.D. Stroke Prediction Using Machine Learning Algorithms. Eng. Proc. 2025, 107, 36. https://doi.org/10.3390/engproc2025107032

AMA Style

Kanwal N, Javaid S, Dewi DD. Stroke Prediction Using Machine Learning Algorithms. Engineering Proceedings. 2025; 107(1):36. https://doi.org/10.3390/engproc2025107032

Chicago/Turabian Style

Kanwal, Nayab, Sabeen Javaid, and Dhita Diana Dewi. 2025. "Stroke Prediction Using Machine Learning Algorithms" Engineering Proceedings 107, no. 1: 36. https://doi.org/10.3390/engproc2025107032

APA Style

Kanwal, N., Javaid, S., & Dewi, D. D. (2025). Stroke Prediction Using Machine Learning Algorithms. Engineering Proceedings, 107(1), 36. https://doi.org/10.3390/engproc2025107032

Article Metrics

Back to TopTop