COVID-19 Prediction Using Machine Learning

Raza, Ali; Rehman, Attique Ur; Sanjaya, Imam

doi:10.3390/engproc2025107060

Open AccessProceeding Paper

COVID-19 Prediction Using Machine Learning^†

by

Ali Raza

^1,*,

Attique Ur Rehman

¹

and

Imam Sanjaya

²

¹

Department of Software Engineering, University of Sialkot, Sialkot 51040, Pakistan

²

Department of Informatic Engineering, Nusa Putra University, Sukabumi 43152, West Java, Indonesia

^*

Author to whom correspondence should be addressed.

^†

Presented at the 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society, Aizuwakamatsu City, Japan, 20–26 January 2025.

Eng. Proc. 2025, 107(1), 60; https://doi.org/10.3390/engproc2025107060

Published: 4 September 2025

(This article belongs to the Proceedings of The 7th International Global Conference Series on ICT Integration in Technical Education & Smart Society)

Download

Browse Figure

Versions Notes

Abstract

The COVID-19 virus caused unprecedented global disruption. There have been millions of cases and deaths reported worldwide. Accurate prediction of COVID-19 trends is crucial for effective decision-making, resource allocation, and policy formulation. ML has been shown to be an excellent method for projecting the virus’s growth and impact as it can analyze vast datasets, discover trends, and develop predictive models. This study examines the use of various machine learning techniques for the prediction of COVID-19 such as time series analysis, regression models, and classification techniques. This paper further addresses the problems and constraints of applying the ML model to this context and suggests possible enhancements for future forecasting endeavors. The overall intention of this work is to enlighten people as to how this ML-based method contributes to pandemic forecasting in terms of improvements in pandemic preparation and response schemes.

Keywords:

COVID-19; machine learning; prediction models; time series analysis; pandemic forecasting

1. Introduction

COVID-19 posed a challenge for most health systems around the world, along with its huge impact on worldwide economies and routines [1]. The fast spread of the virus has spurred governments and healthcare groups to implement emergency measures aimed at controlling transmission. One of the critical aspects in managing the pandemic has been predicting the future trajectory of the virus such as forecasting infection rates, hospitalizations, and deaths [2]. With early and accurate prediction, healthcare policy decisions, resource allocation, and public health interventions can be made and carried out, potentially saving lives and reducing the burden on health care systems. Machine learning (ML) has often been an effective tool in many areas, and its promising application has been seen everywhere. With the help of large amounts of historical and real-time data, ML(machine Learning) algorithms find hidden patterns and trends that traditional models may not have understood. In the context of COVID-19, these algorithms may use a wide variety of data sources ranging from case counts and mobility data to healthcare system capacity and even social distancing behaviors to predict future outbreaks and identify potential risk factors [3,4]. This research seeks to explore how several machine learning algorithms can be used to forecast COVID-19 and its spread, as well as the outcomes.

2. Literature Review

Past research explores the application of machine learning algorithms for forecasting the probability of mortality among COVID-19 hospitalized patients based on clinical data obtained upon admission. Data from 1500 patients were used. The random forest (RF) algorithm performed the best in terms of accuracy (95.03%) and precision (94.23%). These works underline the potential of ML models, particularly RF, for helping to identify high-risk patients and optimizing hospital resources for better patient care [1]. Limitations include the retrospective approach, single-center data and imbalanced datasets used. The studies develop automatic COVID-19 prediction systems based on different techniques, such as machine learning combined with explainable AI, used to increase a model’s interpretability. The authors used an open-source dataset with over 2.7 million records; data preprocessing using various techniques such as Synthetic Minority Oversampling Technique (SMOTE) for handling imbalanced datasets, logistic regression, decision trees, and random forests; and a hybrid Convolutional Neural Network-Long ShortTerm Memory (CNN-LSTM). Among them, the CNN-LSTM obtained the highest accuracy at 96.34% and F1 Score at 0.98 [2]. It displays a machine learning model for diagnosing COVID-19 infection using simple binary characteristics including age, gender, known contact with an infected person, and primary signs of infection (for example, cough, fever, etc.). Using a database containing 51,000 training records and nearly 47,000 testing records, the suggested model had impressive accuracy (auROC of 0.90). Screening strategies should be used to manage healthcare resources well, especially for resource-limited settings [3]. The studies focused on supervised and unsupervised learning techniques. Supervised learning outperformed unsupervised methods, achieving up to 92.9% accuracy on tasks such as classification. The three most popular machine learning algorithms—logistic regression, Artificial Neural Network (ANNs), and Convolution Neural Ntwork (CNNs)have been applied to the evaluation of datasets like patient records and chest X-rays. These strategies demonstrated promise for improving diagnostic accuracy, predicting patient outcomes, and optimizing resource allocation [4]. This led to the development of a new way of predicting cases and deaths caused by COVID-19 by making use of cutting-edge deep learning and reinforcement learning methods. In one study, an Multilayer LSTM (MLSTM) model was used for predicting future trends of cases while integrating DRL for optimizing predictions based on patient symptoms. To validate the model, real-world data from India were used to obtain high performance in comparison with standard methods such as logistic regression and basic Long Short Term Memory(LSTM) models. The system has achieved high accuracy and low error rates with a strong correlation between the actual and predicted data [5]. It examines the use of machine learning in full blood count testing for the early diagnosis of COVID-19.

Algorithms were used to analyze the data from the tests. One study provided an alternative that was faster, cheaper, and easier than standard techniques like Reverse Transcription-Polymerase Chain Reaction (RT-PCR), which are time-consuming, expensive, and incorrect sometimes. The most accurate training performances were found in random forest, kStar, and K-Nearest Neighbors (KNN), and OneR had the highest testing accuracy compared to the others [6]. The study examined the full range of machine learning techniques for COVID-19 diagnosis mortality prediction and severity by using clinical and laboratory data. Support vector machines, logistic regression, and random forest are examples of supervised learning algorithms that seem to often be employed in machine learning. ML provides faster, cost-effective alternatives to standard methods of diagnosis such as RT-PCR, and they help to identify critical cases early on. Through targeting resource use towards high-risk groups, models can have huge implications for improving healthcare during pandemics [7]. The supervised machine learning techniques that can be used to forecast the existence of COVID-19 were evaluated based on a publicly available dataset. In this context, the authors used the Waikato Environment for Knowledge Analysis (WEKA) software J48 Decision Tree, random forest (RF), support vector machines (SVM), K-NN, and Naïve bayes (NB) algorithms and applied 10-fold cross-validation for model performance. SVM was found to be the best algorithm as it achieved the lowest mean absolute error of 0.012 and a maximum accuracy of 98.81% [8]. The application of machine learning for forecasting and prediction is discussed in the study, identifying infections using symptom-based models and time series analysis. It evaluates the effectiveness of various machine learning algorithms, such as the Extra Trees Classifier (ETC) and Auto Regressive Integrated Moving Average (ARIMA), in detecting infections and estimating case counts. The ETC had a prediction accuracy of 93.62%, whereas ARIMA anticipated confirmed cases. Another research paper describes the application of machine learning in healthcare for diagnosis automation, resource planning enhancement, and pandemic management. It finds that while the accuracy of ML-based forecasting is not accurate, it does provide some insight for proactive decision-making and suggests the addition of other aspects to produce better results [9]. The paper discusses COVID-19 cases around the world and within individual countries. The decision tree and linear regression algorithms predicted the time series for confirmed cases, recoveries, and deaths. The prediction accuracy was high, with an overall R² of 0.99 for confirmed instances. COVID-19 infections were expected to fall dramatically worldwide by the first week of September 2021, with a gradual halt soon after. Another study emphasizes the effect of public health interventions such as lockdowns in flattening the curve and expands the model to assess the influence of health guidelines [10]. Using blood biomarkers, the study develops a machine learning model to forecast mortality risk in COVID-19 patients. The researchers examined data from 398 individuals, including 43 deaths, and focused on 5 critical biomarkers. For forecasting death up to 48 h in advance, a support vector machine (SVM) model demonstrated 91% sensitivity and specificity (AUC 0.93). The model was interpreted using Shapley additive explanations (SHAPs), which revealed that inflammatory markers and renal function indicators played major roles in mortality prediction. In another study, the results highlight the approach guiding clinical decisions and optimizing resource allocation in critical care [11]. The research aims at studying the usage of machine learning classifiers [12,13,14] to predict COVID-19 infection using 14 clinical characteristics. In this paper, data was from machine learning classifiers predicting COVID-19 infection using 14 clinical characteristics. In this paper, data was also sourced through auto ML. The analysis used data from 4313 patients and 48 clinical variables, and the top model was a stacked ensemble with an Area Under the Precision-Recall Curve (AUPRC) of 0.807. The data were then used to predict the COVID-19. A retrained model with these ten variables achieved comparable results, with AUPRC = 0.791 [15]. Existing schemes and their performance are summarized in Table 1.

3. Methodology

3.1. Data Collection and Preprocessing

The dataset has 58,561 samples with 9 columns for each sample, providing relevant information on what would affect the outcome or course of COVID-19. All these attributes consist of demographics, comorbidities, symptoms, and a history of exposure, included to potentially generate worthwhile outcomes about COVID-19 infection and severity. Before analysis, categorical variables were encoded into numbers by ensuring uniform data transformation. All the numerical features were standardized, thus standardizing their values to have a mean of zero and a standard deviation of one, thus allowing for comparison during model training. Missing values were handled by imputing appropriate replacements for them [16].

3.2. Split Dataset

During the data split step, the dataset was divided into two sets: 30% was used for testing and the leftover 70% was used for training [17]. This partitioning allowed for machine learning model training on the majority of the data, resulting in accurate performance evaluation on the reserved test set.

3.3. Feature Selection

A random forest model was used to analyze and rank the importance of various features for predicting COVID-19. This analysis revealed several significant predictors that play a crucial role in the outcomes [18]. These factors were identified as key contributors that may influence the overall results and further investigation.

3.4. Model Development

Several machine learning techniques were used in this investigation to evaluate how well they performed on the dataset:

Logistic Regression: Here, the baseline model applied was logistic regression, which gave a standard model against which all other models would be compared. Logistic regression is mostly used for the binary classification.
K-Nearest Neighbors (KNN): The technique was applied with different values of k, specifically 1 and 3, to see which value of k gave the optimal number of nearest neighbors to classify the data points. KNN is a nonparametric method that can easily adapt to any kind of data distribution.
Random Forest: This is an ensemble learning method, which is suitable for tuning with GridSearchCV, systematically searching through multiple combinations of hyperparameters to discover the best configuration for improving accuracy and performance.
Decision Tree: This is a nonparametric model that does not assume any particular distribution of data.
Naïve Bayes: This is a parametric model that assumes conditional independence between features.

3.5. Ensemble Method

A soft voting ensemble extends the benefits of logistic regression, random forest, KNN, and Naive Bayes for improving model robustness. Combining the respective strengths of these four algorithms for predictions, it leverages these predictions to become more accurate and reliable in their findings, and this is precisely where it will be very effective with complex data. This methodology covers five steps, which include Data Preprocessing, Split Data, Feature Selection, Model Development and Ensemble methods.

4. Results and Discussion

Table 2 summarizes the evaluation for each model. SVM achieved the highest accuracy at 87.00%, logistic regression, KNN performed quite well. Still, these results suggest areas for improvement, specifically with regard to sensitivity and false positive detection. The workflow for applying the K-Nearest Neighbors (KNN) model using RapidMiner is illustrated in Figure 1.

5. Conclusions

The development of a machine learning (ML) model that can accurately forecast COVID-19 outcomes has advanced significantly as a result of this effort. Through training and evaluation, we have proved the potential usefulness of our method. Our findings emphasized the capabilities of several algorithms among the models studied. However, we acknowledge the need for additional improvements, particularly in resolving sensitivity and class imbalance issues to ensure consistent predictions across diverse patient profiles. In the future, we will concentrate on several key areas:

Increasing the size of datasets to better understand the factors affecting COVID-19 outcomes in different groups.
We will perform hyperparameter optimization and feature selection to improve model performance and reduce mistakes.
Class Imbalance: We will use more advanced techniques such as oversampling, under sampling, and algorithmic tuning to make the model more robust and accurate for under-represented groups. Our findings have wide-ranging implications for the advancement of better care for patients. This work serves as foundation upon which more powerful diagnostic tools can be developed that will allow practitioners to make more informed decisions with accurate, data-driven predictions.

Author Contributions

A.R. conceptualized the study, designed the research framework, and supervised the project. A.U.R. performed data collection, preprocessing, model implementation including KNN, evaluated results, and prepared figures and tables. I.S. provided supervision and guidance, critically reviewed methodology and results, validated findings. All authors have read and agreed to the published version of the manuscript.

Funding

Authors received no external funding for this research.

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

The data will be made available upon reasonable request to the first author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Moulaei, K.S.; Zohreh, M.-T.; Al, K.-A.H. Comparing machine learning algorithms for predicting COVID-19 mortality. BMC Med. Inform. Decis. Mak. 2022, 22, 2. [Google Scholar] [CrossRef] [PubMed]
Solayman, S.A.; Mamun, M.C.; Mahmud, M.; Khan, K.R. Automatic COVID-19 prediction using explainable machine learning techniques. Int. J. Cogn. Comput. Eng. 2023, 4, 36–46. [Google Scholar] [CrossRef]
El-Ush, L.A.; Ahmad, S.A.; Choi, C.C.; Muhammad, M.I. Supervised machine learning models for prediction of COVID-19 infection using epidemiology dataset. SN Comput. Sci. 2021, 2, 11. [Google Scholar]
Hassan, A.A.; Kwekha-Rashid, A.B. Coronavirus disease (COVID-19) cases analysis using machine-learning applications. Appl. Nanosci. 2023, 13, 2013–2025. [Google Scholar]
Kumar, R.; Dutta, F.; Bera, S.; Maji, S.; Ahmad, A.; Kumar, I.E. Recurrent neural network and reinforcement learning model for COVID-19 prediction. Front. Public Health 2021, 9, 744100. [Google Scholar] [CrossRef]
Shaikh, A.A.; Bhat, B.; Ali, K.A.; Nazeer, A.; Akhtar, J.M. COVID-19 detection from CBC using machine learning techniques. Int. J. Technol. Innov. Manag. 2021, 1, 65–78. [Google Scholar]
Alballa, N.; Al-Turki, A.-T.I. Machine learning approaches in COVID-19 diagnosis, mortality, and severity risk prediction: A review. Inform. Med. Unlocked 2021, 24, 100564. [Google Scholar] [CrossRef] [PubMed]
Tofan, G.G.; Almasan, R.E.; Mohammadi, M.F.; Zainuddin, K.M.; Ulucay, I.M.; Lazaroiu, C.M.; Razvan, S. The economics of deep and machine learning-based algorithms for COVID-19 prediction. Oeconomia Copernic. 2024, 15, 27–58. [Google Scholar] [CrossRef]
Jimenez, C.M.; Jiang, I.X.; Jiang, J.; Villavicencio, H.J. COVID-19 prediction applying supervised machine learning algorithms with comparative analysis using Weka. Algorithms 2021, 14, 201. [Google Scholar] [CrossRef]
Painuli, D.M.; Bansal, D.; Painuli, A.M. Forecast and prediction of COVID-19 using machine learning. In Data Science for COVID-19; Academic Press: Cambridge, MA, USA, 2021; pp. 381–397. [Google Scholar]
Shaikh, I.H.; Mahmud, S.A.-E.; Ali, A.-K.M.; Arpaci, P.M. Predicting the COVID-19 infection with fourteen clinical features using machine learning classification algorithms. Multimed. Tools Appl. 2021, 80, 11943–11957. [Google Scholar]
El-Ebiary, Z.A.; Dwedar, E.A.; Ghaleb, G.; Abdelrazek, O.M.; Mansour, A.-D.; Malki, G.I. The COVID-19 pandemic: Prediction study based on machine learning models. Environ. Sci. Pollut. Res. 2021, 28, 40496–40506. [Google Scholar] [CrossRef] [PubMed]
Ahmed, S.E.; Sayed, M.S. Applying different machine learning techniques for prediction of COVID-19 severity. IEEE Access 2021, 9, 135697–135707. [Google Scholar] [CrossRef] [PubMed]
Ezzat, A.A.; Booth, M.P. Development of a prognostic model for mortality in COVID-19 infection using machine learning. Mod. Pathol. 2021, 34, 522–531. [Google Scholar]
Ikemura, K.B.; Yamaguchi, E.Y.; Huynh, B.H.; Sharafoddini, S.M.; Li, S.K.; Johnson, L.S.; Doran, J.G.; Gonzalez, R.G. Using automated machine learning to predict the mortality of patients with COVID-19: Prediction model development. J. Med. Internet Res. 2021, 23, e23458. [Google Scholar] [CrossRef] [PubMed]
Diwaker, C.; Tomar, P.; Solanki, A.; Nayyar, A.; Jhanjhi, N.Z.; Abdullah, A.; Supramaniam, M.A. A New Model for Predicting Component Based Software Reliability Using Soft Computing. IEEE Access 2019, 7, 147191–147203. [Google Scholar] [CrossRef]
Kok, S.H.; Abdullah, A.; Jhanjhi, N.Z.; Supramaniam, M.A. A review of intrusion detection system using machine learning approach. Int. J. Eng. Res. Technol. 2019, 12, 8–15. [Google Scholar]
Ahmed, S.; Hossain, M.A.; Bhuiyan, M.M.I.; Ray, S.K. A Comparative Study of Machine Learning Algorithms to Predict Road Accident Severity. In Proceedings of the 2021 20th International Conference on Ubiquitous Computing and Communications (IUCC/CIT/DSCI/SmartCNS), London, UK, 20–22 December 2021; pp. 390–397. [Google Scholar] [CrossRef]

Figure 1. The running state of the dataset.

Table 1. Existing schemes.

Paper	Model Used	Accuracy/Performance
[1]	Random Forest (RF)	95.03% (Precision: 94.23%)
[2]	CNN-LSTM, Random Forest, Logistic Regression, Decision Trees	CNN-LSTM: 96.34% (F1 Score: 0.98)
[3]	ML Model using Binary Features (Age, Gender, Symptoms)	AUROC: 0.90
[4]	Supervised Learning (Logistic Regression, ANN, CNN)	Up to 92.9% accuracy
[5]	MLSTM + Deep Reinforcement Learning (DRL)	High correlation, low error rates

Table 2. Evaluation for each model.

Model	Accuracy	Precision	Recall
Logistic Regression	73.00%	77.28%	67.82%
KNN (k = 1)	63.00%	65.56%	58.08%
KNN (k = 3)	61.67%	63.85%	59.36%
Random Forest	79.33%	81.59%	71.67%
SVM	87.00%	91.54%	79.15%
Naive Bayes	49.33%	51.61%	41.03%
Decision Tree	59.93%	51.49%	44.23%
Ensemble (Soft)	51.70%	53.20%	49.85%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Raza, A.; Rehman, A.U.; Sanjaya, I. COVID-19 Prediction Using Machine Learning. Eng. Proc. 2025, 107, 60. https://doi.org/10.3390/engproc2025107060

AMA Style

Raza A, Rehman AU, Sanjaya I. COVID-19 Prediction Using Machine Learning. Engineering Proceedings. 2025; 107(1):60. https://doi.org/10.3390/engproc2025107060

Chicago/Turabian Style

Raza, Ali, Attique Ur Rehman, and Imam Sanjaya. 2025. "COVID-19 Prediction Using Machine Learning" Engineering Proceedings 107, no. 1: 60. https://doi.org/10.3390/engproc2025107060

APA Style

Raza, A., Rehman, A. U., & Sanjaya, I. (2025). COVID-19 Prediction Using Machine Learning. Engineering Proceedings, 107(1), 60. https://doi.org/10.3390/engproc2025107060

Article Menu

COVID-19 Prediction Using Machine Learning^†

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Data Collection and Preprocessing

3.2. Split Dataset

3.3. Feature Selection

3.4. Model Development

3.5. Ensemble Method

4. Results and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

COVID-19 Prediction Using Machine Learning †

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Data Collection and Preprocessing

3.2. Split Dataset

3.3. Feature Selection

3.4. Model Development

3.5. Ensemble Method

4. Results and Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

COVID-19 Prediction Using Machine Learning^†