1. Introduction
The global COVID-19 pandemic has generated a vast amount of data related to the spread and impact of the disease worldwide. These data represent an invaluable resource for healthcare professionals, epidemiologists, and researchers, offering critical insights into the dynamics of the disease and its effects on populations [
1]. However, working with these datasets poses significant challenges, particularly due to the presence of missing values, which can compromise the accuracy of analyses and predictions [
2,
3].
Machine learning plays a pivotal role in diverse fields, including healthcare, industry, and scientific research. In healthcare, it has been employed for early disease detection, such as COVID-19 diagnosis, as well as for optimizing treatment strategies and predicting clinical outcomes [
4,
5]. Across industries, machine learning enhances automation and intelligent decision-making, thereby improving efficiency and productivity [
6]. It is also widely used in research to uncover patterns in large datasets and support real-time decision-making in sectors such as finance, marketing, and logistics [
7]. This versatile approach provides powerful tools to address complex challenges across various domains.
Supervised learning models, which are trained using labeled data, are particularly effective for classifying patients into confirmed or dismissed COVID-19 cases [
8,
9]. This research evaluates several supervised learning techniques, including Support Vector Machines (SVMs), Artificial Neural Networks (ANNs), Logistic Regression (LR), Decision Trees (DTs), and Random Forests (RFs), to determine their performance in predicting COVID-19 cases. This study also investigates how these models respond to missing data, focusing on the use of imputation techniques to address such gaps [
10].
Recently, advanced data imputation strategies have emerged, leveraging deep learning architectures such as Variational Autoencoders (VAEs) or Generative Adversarial Networks (GANs) [
11,
12]. While these methods can capture highly complex distributions in medical data, they often require specialized hardware, extensive hyperparameter tuning, and substantial computational resources [
13]. In this study, we focus on four well-established imputation approaches—Random Forest (RF), Predictive Mean Matching (PMM) by Multiple Imputation by Chained Equations (MICE), K-Nearest Neighbor (KNN), and eXtreme Gradient Boosting (XGBoost)—chosen for their robust performance, relative ease of implementation, and suitability for large-scale datasets with moderate computational demands [
14]. By doing so, we aim to balance methodological rigor with practical applicability in real-world healthcare environments.
This study uses a dataset from the Department of Concepción, Paraguay, which provides a regional perspective of COVID-19 cases in the South American context. Although localized, this dataset provides valuable information on disease dynamics and contributes to the global effort to improve diagnostic methods and data analysis practices.
5. Results
The supervised models (SVM, RF, DT, LR, and ANN), widely recognized for their efficacy in medical diagnostics, have been extensively applied to epidemiological research, particularly in the context of COVID-19 [
15,
56]. The results are organized into two primary sections.
Section 5.1. details the performance of the supervised learning models on the imputed datasets.
Section 5.2. discusses the impact of the data imputation techniques on the evaluation metrics.
5.1. Supervised Learning Models’ Performance on Imputed Datasets
The performance of the models across various imputation levels (Levels 0 to 4) was evaluated using key metrics, including accuracy, sensitivity, specificity, F1-score, MCC, and AUC.
Table 3 summarizes the results achieved using the PMM imputation method, highlighting how each model’s performance evolved as the level of missingness increased.
From
Table 3, RF emerged as the top-performing model, achieving the highest accuracy (0.826) and AUC (0.902) at Level 0. Across all levels, RF demonstrated minimal performance degradation, maintaining an accuracy of 0.773 and an AUC of 0.858 at Level 4. Its metrics, including accuracy, sensitivity, specificity, and F1-score, remained relatively stable, highlighting its resilience to imputation variability. SVM excelled in specificity, reaching 0.930 at Level 4, surpassing its specificity at Level 0 (0.828). However, its sensitivity declined significantly to 0.633 at Level 4, indicating a tendency to favor negative predictions. This improvement in specificity with imputation suggests that PMM may introduce a bias toward negative classification outcomes. ANN displayed marked sensitivity to missing data, with a substantial decline in accuracy and F1-score between Level 0 and Level 1. While its metrics showed slight recovery at higher levels, its overall performance was notably affected by the presence of missing data.
Similar results were obtained with RF imputation, as shown in
Table 4. Across all levels of imputation using the RF method, it demonstrated minimal performance degradation, maintaining an accuracy of 0.760 and an AUC of 0.857 at Level 4. Its metrics, including sensitivity, specificity, and F1-score, remained consistently high, reflecting its robustness and adaptability to imputation variability. SVM excelled in specificity, reaching 0.950 at Level 3 and 0.918 at Level 4, surpassing its Level 0 specificity of 0.828. However, sensitivity declined significantly, indicating a trade-off that favored negative classifications at higher levels of imputation. ANN exhibited sensitivity to missing data, showing less stability across imputation levels.
From
Table 5, we see that RF remains quite stable under KNN-based imputation, maintaining an accuracy of 0.772 and an AUC of 0.859 at Level 4. These figures underscore RF’s robustness, even when missing values are replaced via KNN. By contrast, SVM achieves a standout specificity of 0.962 at Level 3—surpassing its Level 0 value of 0.828—but does so at the expense of sensitivity, which declines to 0.576. ANN records moderate metrics overall, with accuracy diminishing from 0.764 at Level 0 to 0.743 at Level 4, indicating that it remains somewhat sensitive to the nuances introduced by imputation. DT sustains a reasonably balanced profile, particularly at Levels 3 and 4, where accuracy hovers around 0.756–0.750. Lastly, LR maintains consistent performance but seldom outperforms the ensemble approaches (RF, DT) or SVM on the principal metrics.
Turning to
Table 6, which shows the models’ outcomes with XGBoost-based imputation, RF again leads the pack. Although its accuracy dips slightly at some intermediate levels, the model finishes strong with a 0.769 accuracy and 0.860 AUC at Level 4, underscoring its adaptability across different imputation techniques. SVM once more leverages high specificity—peaking at 0.953 at Level 3—yet sees a corresponding drop in sensitivity (to 0.569), reinforcing the trade-off observed in other imputation settings. ANN presents respectable but modest results, with accuracy falling from 0.764 at Level 0 to 0.748 at Level 4. DT remains fairly steady, moving from a 0.801 accuracy at Level 0 to roughly 0.750–0.754 at higher imputation levels—less stable than RF, but still competitive. LR stays consistent, though it rarely eclipses the ensemble techniques or SVM in accuracy or F1-score.
Overall, Random Forest (RF) performed consistently well across all four imputation methods (PMM, RF-based, KNN, and XGBoost). Even at higher levels of missingness (Level 4), it generally retained an accuracy above 0.76 and AUC above 0.85, indicating relatively low performance degradation compared to the baseline. In contrast, SVM often improved its specificity—sometimes exceeding 0.95—although this typically coincided with lower sensitivity, suggesting a tendency toward negative classifications. Meanwhile, ANN showed moderate resilience but tended to lose 1–2% in accuracy at higher missingness levels, implying some sensitivity to the chosen imputation strategy. Decision Trees (DTs) maintained balanced results, with accuracy values usually fluctuating between 0.74 and 0.80, while Logistic Regression (LR) remained consistent but rarely matched the top accuracy or F1-score results presented by the ensemble- or margin-based models.
When comparing the four imputation methods, PMM- and RF-based approaches offered slightly higher accuracy and AUC for certain models, notably RF itself. However, KNN and XGBoost also produced competitive outcomes: KNN generally preserved RF’s high accuracy (near 0.77) and AUC (around 0.86) at Level 4, whereas XGBoost kept RF above a 0.76 accuracy and 0.86 AUC under comparable conditions. For SVM, ANN, and DT, the main patterns held across KNN and XGBoost, suggesting that the models’ performance differences stem more from specificity–sensitivity trade-offs rather than large shifts in overall accuracy. In general, the ensemble-based methods—particularly RF—adapted well to missing data imputation, while other algorithms still achieved favorable metrics under certain imputation scenarios.
5.2. Impact of Data Imputation Techniques on Evaluation Metrics
This section evaluates how the four imputation methods (PMM, RF-based, KNN, and XGBoost) impact model performance across increasing missingness levels (0–20%). We first analyze global trends in accuracy and AUC, then dissect granular trade-offs in F1-score, MCC, sensitivity, and specificity. A focused comparison of ANN and SVM—the most imputation-sensitive models—highlights critical performance fluctuations at extreme missingness (Levels 0 vs. 4). Finally, we synthesize recommendations for method–model pairing based on metric priorities.
Examining accuracy and AUC first, Random Forest (RF) consistently retains higher values relative to other models. For instance, with PMM, RF’s accuracy declines modestly from 0.826 at Level 0 to 0.773 at Level 4, while AUC similarly falls from 0.902 to 0.858. A comparable pattern emerges with RF-based imputation, where accuracy starts at 0.826 and ends at 0.760, and AUC remains above 0.85. KNN and XGBoost also produce competitive results—under KNN, for example, RF’s accuracy sits around 0.772 and its AUC at 0.859 by Level 4, and with XGBoost imputation, the final accuracy is near 0.769 and AUC near 0.860. In contrast, algorithms like ANN and SVM show more noticeable drops in accuracy and AUC when missingness rises, indicating that ensemble-based methods are generally more robust to how data are filled.
Turning to the F1-score and MCC, RF again demonstrates stability. Typically, it retains F1-scores above 0.77 and MCC values above 0.49–0.50 across all imputation levels. Meanwhile, ANN sees steeper dips, especially at lower imputation levels under PMM, where the F1-score can drop by about 8–10 percentage points between Levels 0 and 1. SVM is somewhat more volatile: it benefits from higher specificity but often loses ground in sensitivity, which in turn lowers its F1-score and MCC. Notably, RF-based imputation appears to smooth out these fluctuations for tree-based algorithms like DT, which remains relatively stable in its F1-scores (roughly 0.69–0.76) and MCC values (0.45–0.54) across different levels of missingness.
To further quantify these trade-offs,
Table 7 contrasts ANN and SVM performance at Level 0 and Level 4 across imputation methods.
Table 7 highlights two critical patterns. For ANN, sensitivity declines by 13–15% at Level 4, while specificity improves by 20–22%, suggesting a bias toward negative classifications under missingness. XGBoost mitigates this effect slightly, with the smallest sensitivity drop (−13%). For SVM, specificity spikes (e.g., +14% with KNN), but sensitivity plummets by 20–23%, indicating a conservative avoidance of false positives. XGBoost strikes the best balance for SVM, minimizing losses in sensitivity (−20%) and F1-score (−10%).
Regarding sensitivity (recall) and specificity, SVM often spikes in specificity as missingness increases—reaching values above 0.90 with KNN or XGBoost—but its sensitivity can concurrently drop below 0.60. ANN, though not as extreme, also experiences moderate changes in sensitivity (around 5–10% differences), suggesting that margin-based and neural models are particularly sensitive to shifts in data quality post-imputation. In contrast, RF consistently achieves balanced sensitivity and specificity (frequently both above 0.70), underscoring its resilience to the choice of imputation method.
Overall, these findings indicate that PMM- and RF-based approaches often preserve a higher accuracy, AUC, and F1-score for models such as RF and DT. KNN and XGBoost also produce comparably strong results, though they can introduce more pronounced trade-offs in sensitivity versus specificity for models like SVM or ANN. As
Table 7 demonstrates, these trade-offs are most acute at Level 4, where KNN maximizes SVM’s specificity (0.942) but at a steep cost to sensitivity (0.620). Consequently, while most methods maintain respectable performance at lower missingness levels, the degree of metric fluctuation at higher levels (e.g., Level 3 or 4) can help determine which combination of approach and model is best—particularly when certain metrics (e.g., sensitivity or AUC) are given priority.
Following these observations,
Figure 1 provides a visual overview of how each imputation method influences the two principal metrics, accuracy and AUC, as missingness increases from Level 0 to Level 4. In
Figure 1a, the four approaches—PMM, RF-based, KNN, and XGBoost—are plotted against the average accuracy for all models at each level of missingness. Meanwhile,
Figure 1b compares these same methods in terms of their average AUC. Together, these visuals reinforce the conclusion that an RF-based imputation method tends to sustain higher accuracy and AUC across most models and levels of missingness.
6. Discussion
The findings of this study confirm the effectiveness of RF models in imputing missing medical data under various conditions of absence. As demonstrated in [
57], RF outperforms methods such as mean imputation or nearest neighbors in scenarios with missing data at random (MCAR and MAR), owing to its ability to handle nonlinear relationships and high-dimensional data. However, the study also noted RF’s limitations in handling missing not at random (MNAR) data. Similarly, [
58] further emphasized RF’s utility for mixed medical data, highlighting its advantages over parametric methods in avoiding distributional assumptions and efficiently managing complex interactions. Nevertheless, [
59] cautioned against its use in specific scenarios.
Feng et al. [
2] identified notable stability of RF in contexts with moderate proportions of missing data (less than 20%), aligning with the findings of this study. However, they also highlighted the superiority of multiple imputation methods in more complex scenarios or with higher levels of missingness. In [
60], RF’s potential as an adaptable and precise method for imputing missing medical data was reinforced, with opportunities for enhanced predictive capacity through specific optimizations.
Regarding the impact on evaluation metrics, [
14] stressed the importance of selecting appropriate imputation methods to minimize bias in sensitivity and specificity, noting RF’s solid balance between precision and flexibility, consistent with this study’s results. For instance, [
61] also confirmed RF’s robust coverage under MAR data and its superiority over PMM in nonlinear scenarios due to its ability to manage complex interactions. However, both [
14,
61] observed that ANN is more sensitive to imputation methods, further reinforcing RF’s advantage in clinical contexts.
In conclusion, this study highlights the effectiveness of supervised learning models, particularly Random Forest (RF), in classifying COVID-19 cases using imputed datasets, even under significant levels of missing data. RF emerged as the most robust model, maintaining high values for accuracy, area under the curve (AUC), and F1-score across various levels of imputation and across PMM, KNN, RF-based, and XGboost-based imputation methods. These results align with previous research that emphasizes RF’s adaptability in imputing complex medical datasets, especially in MAR (missing at random) scenarios.
Addressing RQ1, supervised learning models exhibited variable performance depending on the imputation method and the level of missing data. RF proved to be the most consistent model, achieving the highest metrics for accuracy (0.826) and AUC (0.902) in datasets without missing data (Level 0) and showing minimal degradation at higher levels of imputation (accuracy of 0.760 and AUC of 0.857 at Level 4 with RF-based imputation). This demonstrates its capacity to adapt to changes introduced by imputation methods. Conversely, ANN and SVM were more sensitive to the quality of imputed data: ANN showed a pronounced decline in F1-score and accuracy at lower levels (Level 1 and Level 2), while SVM improved specificity (0.950 at Level 3 with RF imputation) at the expense of sensitivity.
Importantly, when alternative imputation strategies—such as KNN or XGBoost—were employed, RF generally retained its leading position, whereas SVM and ANN exhibited larger shifts in sensitivity or accuracy, reinforcing the view that ensemble-based methods are more resilient to different missing data handling approaches.
Although a comprehensive feature-importance analysis was beyond this study’s primary scope, a preliminary Random Forest assessment on the complete dataset (Level 0) highlighted the specimen collection technique (e.g., PCR vs. antigen test) as the most influential variable. This finding aligns with clinical research indicating that test sensitivity and reliability vary based on sampling methods [
62]. Additional key predictors included the epidemiological week, patient age, and the timing of symptom onset, echoing prior evidence that temporal factors and demographics profoundly affect diagnostic outcomes. Finally, fever and sore throat also emerged as relevant indicators, suggesting that timely documentation of symptoms can enhance case detection [
63,
64]. These insights underscore the practical importance of systematically capturing both logistical (e.g., test availability) and clinical variables (e.g., patient age, symptom onset) to improve COVID-19 diagnostics in real-world public health contexts.
These results are consistent with findings in high-dimensional or clinical datasets [
65,
66]. ANN and SVM are powerful machine learning models, but their performance can be significantly affected by hyperparameter settings and data quality, particularly in the presence of outliers or noise introduced by imputation. ANNs require large, well-preprocessed datasets to ensure stable training dynamics, while SVMs are sensitive to feature scaling and parameter tuning.
Regarding RQ2, the choice of imputation technique significantly affected model performance. Although RF-based imputation delivered more consistent and stable results compared to PMM, we also observed that KNN and XGBoost produced competitive outcomes for ensemble algorithms (RF, DT), while occasionally inducing greater specificity–sensitivity trade-offs for margin-based methods (SVM). While RF experienced slight decreases in accuracy and AUC (from 0.826 to 0.760 and 0.902 to 0.857, respectively), it maintained stability in key metrics such as the F1-score and MCC across all imputation levels. In contrast, PMM enhanced specificity for models like SVM but introduced greater variability in sensitivity and F1-score. These findings underscore the robustness of RF-based approaches, which preserve predictive performance even in scenarios with high levels of missing data, while showing that KNN and XGBoost can also yield strong results, particularly for RF.
Regarding practical implications and future applications, the comparative analysis of several models under varying levels of missingness offers valuable guidance for researchers dealing with real-world clinical data. Transferring these findings to other regions with similar data characteristics would primarily involve adjusting thresholds for missingness and reviewing local epidemiological contexts [
67]. Nonetheless, the approach outlined here—comparing multiple imputation techniques across various supervised methods—could be replicated to adapt to different disease profiles or healthcare systems.
Additionally, it is essential to acknowledge that each imputation method introduces a degree of uncertainty, which can propagate through subsequent model training and predictions. This “imputation error” may disproportionately affect algorithms sensitive to noise or outliers, as exemplified by the variability observed in ANN and SVM. Although our study did not quantify error propagation in depth, references such as [
68] discuss frameworks for measuring imputation-induced variance and its effects on model estimates. Future work incorporating such techniques could provide a more rigorous understanding of how imputation uncertainty impacts critical metrics like the F1-score and AUC, ultimately leading to more robust, interpretable predictions in clinical contexts.
Beyond confirming the effectiveness of established imputation methods, this study advances the understanding of missing data handling in resource-constrained medical settings through a systematic comparison of four widely used techniques (PMM, RF-based, KNN, and XGBoost) across varying missingness levels. Three key insights emerge: First, RF-based imputation paired with RF classifiers maintains robust performance even at high missingness levels (e.g., accuracy: 0.760 at 40%), outperforming ANN and SVM, which exhibit sensitivity to imputation-induced noise. This challenges the assumption that complex models like ANN inherently dominate in clinical predictions, instead highlighting RF’s unique adaptability to incomplete datasets. Second, imputation strategies introduce distinct specificity–sensitivity trade-offs; while PMM enhanced specificity for SVM (0.950 at Level 3), it concurrently reduced sensitivity—a critical concern in high-risk applications like COVID-19 diagnosis where false negatives carry severe consequences. This underscores the necessity of context-aware imputation selection aligned with clinical priorities. Third, XGBoost-based imputation, though less explored in medical contexts, performed comparably to RF-based methods, expanding the toolkit of viable alternatives for practitioners.
These findings invite targeted future research directions. Hybrid frameworks integrating imputation strategies with model architectures optimized for specific missing data patterns (e.g., combining RF with MNAR-adjusted weighting) could address current limitations. Additionally, quantifying the long-term impact of imputation uncertainty on clinical decision-making—such as through MNAR-aware sensitivity analyses—would further refine predictive robustness. By validating the practicality of established techniques under real-world constraints, this work provides clinicians and researchers with a scalable, interpretable framework for medical datasets, prioritizing methods like RF and XGBoost over resource-intensive alternatives in low-infrastructure settings.
Among the limitations of this study is the exclusive use of data from the Concepción Department in Paraguay, which limits the generalizability of findings to other regions with different epidemiological characteristics. Additionally, while the selected imputation methods were effective, the conclusions cannot be generalized to other methods not considered in this work. As a future direction, expanding the geographical and epidemiological scope is recommended to validate findings across different contexts. Moreover, incorporating hybrid imputation methods and exploring model stability in MNAR (missing not at random) could further enhance their applicability to more complex scenarios. It should also be noted that details regarding computational resource usage, processing times, and model scalability were not provided, which is a limitation for practical implementation.
Furthermore, adopting more advanced imputation strategies—such as deep learning approaches (GANs or Variational Autoencoders)—goes beyond the present scope due to their higher computational and design complexity, even though they may prove valuable, especially in MNAR scenarios. Broadening this study to other regions or countries would likewise require ensuring comparable data quality and standardized recording protocols, a step that involves substantial coordination and data collection efforts. In addition, carrying out more exhaustive outlier analysis and employing thorough validation methods (e.g., repeated k-fold cross-validation) could further refine the results but entail more extensive methodological requirements. Finally, a deep exploration of MNAR-specific models (e.g., selection models or sensitivity analyses) would require additional datasets and assumptions, making it a natural extension of the work presented here.
Although supervised learning predominates in clinical diagnostics due to the availability of well-defined labels, unsupervised and semi-supervised methods offer potential advantages for handling missing data by leveraging partially labeled or unlabeled information. These approaches can sometimes uncover latent structures or subgroups not apparent in strictly supervised frameworks. Nevertheless, they also entail more complex modeling assumptions and may be difficult to align with standard clinical thresholds or outcome measures. Given our dataset’s clear positive/negative labels, we focused on supervised techniques for this study. Looking ahead, future work might explore semi-supervised or unsupervised paradigms, particularly when labels are sparse or when broader exploratory insights are desired in resource-constrained healthcare settings.