Ensemble-Based Imputation for Handling Missing Values in Healthcare Datasets: A Comparative Study of Machine Learning Models

Maijamaa, Bilal Ibrahim; Ahmad, Salim; Musa, Aminu; Ishaq, Abdullahi; Ayuba, Abida

doi:10.3390/engproc2026124021

Open AccessProceeding Paper

Ensemble-Based Imputation for Handling Missing Values in Healthcare Datasets: A Comparative Study of Machine Learning Models^†

by

Bilal Ibrahim Maijamaa

^1,*,

Salim Ahmad

²,

Aminu Musa

¹

,

Abdullahi Ishaq

¹ and

Abida Ayuba

¹

Computer Science Department, Federal University Dutse, Dutse 720211, Nigeria

²

Information Technology Department, Federal University Dutse, Dutse 720211, Nigeria

^*

Author to whom correspondence should be addressed.

^†

Presented at the 6th International Electronic Conference on Applied Sciences, 9–11 December 2025.

Eng. Proc. 2026, 124(1), 21; https://doi.org/10.3390/engproc2026124021

Published: 9 February 2026

(This article belongs to the Proceedings of The 6th International Electronic Conference on Applied Sciences)

Download

Browse Figures

Versions Notes

Abstract

This study addresses the challenge of missing numerical values in healthcare datasets by proposing a Particle Swarm Optimization (PSO)-optimized stacking ensemble model for data imputation. The framework combines Random Forest, XGBoost, and Linear Regression within a stacking architecture, with PSO used to optimize model selection and hyperparameters for improved accuracy. The approach was evaluated on the Breast Cancer Wisconsin and Heart Disease datasets under Missing Completely at Random (MCAR) conditions at 30%, 20%, and 10% missingness levels, using RMSE, MAE, R², and processing time as performance metrics. Experimental results show that the proposed model consistently outperforms individual learners across all missingness scenarios, achieving an RMSE of 0.0446, MAE of 0.0303, and R² of 86.56% on the Breast Cancer dataset at 10% MCAR, and an RMSE of 0.1388 with an R² of 75.19% on the Heart Disease dataset. Compared with a MissForest-based existing approach, the proposed framework demonstrates substantial reductions in imputation error, confirming the effectiveness of combining ensemble learning with evolutionary optimization. Although the PSO-based stacking model incurs higher computational cost, the findings indicate that it provides a robust, accurate, and generalizable solution for numerical data imputation in healthcare applications under varying levels of missingness.

Keywords:

missing data imputation; ensemble learning; stacking; random forest; XGBoost; breast cancer dataset; heart disease dataset; healthcare analytics

1. Introduction

Missing data remains a pervasive challenge in healthcare analytics, where complete and reliable datasets are essential for accurate disease prediction, diagnostic modeling, and clinical decision support. Healthcare datasets frequently exhibit missingness due to sensor malfunction, inconsistent patient follow-ups, data entry errors, and heterogeneous data collection protocols. If not properly addressed, missing values can introduce bias, reduce statistical power, and significantly degrade the performance and stability of machine learning (ML) models [1].

Conventional statistical imputation techniques, such as mean or median substitution and regression-based methods, are computationally efficient but rely on strong assumptions regarding data distribution and linear relationships. These assumptions rarely hold in real-world healthcare data, particularly under moderate to high missingness rates, resulting in biased estimates and suboptimal predictive performance [1,2]. As a result, statistical imputers often fail to preserve the underlying structure of complex medical datasets.

Machine learning-based imputation methods, including Random Forest (RF) [3], MissForest [4], and gradient boosting approaches such as XGBoost [5], have gained increasing attention due to their ability to capture nonlinear relationships and complex feature interactions. Empirical studies demonstrate that these models generally outperform traditional statistical methods in numerical data imputation tasks [6,7,8]. However, standalone ML imputers may still experience performance degradation as missingness increases or when applied across heterogeneous datasets with differing statistical properties [9,10].

To address these limitations, ensemble and hybrid imputation strategies have emerged as a promising direction. By combining complementary learners, ensemble methods can improve robustness, reduce variance, and enhance generalization [11]. Hybrid models further extend this idea by integrating optimization techniques to tune the large hyperparameter spaces inherent in ensemble architectures [9,12]. Particle Swarm Optimization (PSO), in particular, has been shown to be effective in optimizing complex ML models by balancing exploration and exploitation during parameter search [13].

Despite growing interest, limited research has systematically explored PSO-optimized stacking ensembles for numerical data imputation across multiple healthcare datasets, while simultaneously evaluating both accuracy and computational efficiency. This study addresses this gap by investigating a PSO-optimized stacking ensemble framework for imputing missing numerical values, using the Breast Cancer and Heart Disease datasets as representative healthcare case studies under varying Missing Completely at Random (MCAR) conditions.

2. Related Works

Imputation techniques have evolved from classical statistical approaches to advanced machine learning, deep learning, and hybrid frameworks. This section reviews the relevant literature across these categories, highlighting their strengths and limitations.

2.1. Traditional and Statistical Imputation Methods

Simple imputation methods such as mean and median substitution remain widely used due to their simplicity but are known to distort data distributions and underestimate variability, particularly when missingness is not trivial [1]. Multiple Imputation by Chained Equations (MICE) improves upon single imputation by iteratively modeling missing values while accounting for uncertainty, but it becomes computationally expensive and less effective in the presence of nonlinear relationships or high-dimensional data [2].

Linear Regression-based imputation performs reasonably well in datasets with strong linear correlations among variables; however, its effectiveness diminishes when applied to complex healthcare datasets exhibiting nonlinear dependencies [7]. Overall, statistical methods tend to yield biased results as the percentage of missing data increases, motivating the exploration of more flexible data-driven approaches.

2.2. Machine Learning-Based Imputation

Machine learning models have demonstrated superior performance over traditional statistical methods by learning nonlinear patterns directly from data. Random Forest has been widely adopted for imputation due to its robustness, resistance to overfitting, and ability to handle mixed data types [3]. An extended RF for missing value imputation through MissForest shows consistent improvements over conventional approaches for numerical data [4].Gradient boosting methods, particularly XGBoost [5], have gained popularity for their scalability, regularization mechanisms, and strong predictive performance. An ensemble ML method outperforms simpler imputers in sarcopenia prediction tasks [6], while ML-based imputers achieve lower RMSE and MAE across multiple cohort datasets [7]. A comprehensive comparison of imputation techniques in healthcare diagnostics was conducted and reported that RF-based approaches consistently yielded lower errors across missingness levels ranging from 10% to 25% [8].

However, ML imputers require careful hyperparameter tuning and may suffer from instability when applied across datasets with differing feature distributions or higher missingness rates [9,10].

2.3. Hybrid and Optimization-Based Approaches

Hybrid imputation approaches combine multiple models or paradigms to exploit complementary strengths. There is growing trend toward hybrid ML and DL imputation strategies, emphasizing the need to evaluate both accuracy and computational cost [9]. Stacked generalization provides a principled framework for combining heterogeneous base learners through a meta-learner, enabling improved predictive performance by learning how to optimally aggregate individual model outputs [11].

Optimization algorithms play a crucial role in enhancing hybrid models by efficiently searching large hyperparameter spaces. Optimization techniques play a crucial role in improving the accuracy of machine learning-based imputation models by enabling effective tuning of complex hyperparameters and model configurations [13]. Hybrid imputation strategies have been shown to outperform standalone methods in complex datasets; however, some studies do not include an analysis of computational efficiency [14]. Similarly, cyclical hybrid imputation techniques have demonstrated high accuracy but have been associated with increased computational complexity, with limited reporting of detailed runtime performance [15].

Recent studies highlight the lack of comprehensive evaluations of stacking-based imputation frameworks across multiple healthcare datasets and varying missingness levels [8,9,16]. Furthermore, processing time is frequently overlooked, despite being a critical consideration for the practical deployment of hybrid imputation systems.

Limited studies evaluate stacking-based imputation specifically across multiple distinct healthcare datasets (e.g., breast cancer and heart disease). Few works integrate PSO with stacking to improve imputation accuracy and robustness comprehensively. Comparative studies under multiple, relevant missingness levels (10–30% MCAR) are scarce. Prior studies report high imputation error (RMSE), showing significant room for improvement through optimized hybrid models.

There is a lack of thorough model evaluation using comprehensive metrics that fully capture their performance, including an analysis of the processing time, which is often overlooked despite being a critical factor in hybrid methods.

3. Methodology

3.1. Dataset Description

Two distinct healthcare datasets were used for comprehensive evaluation:

Breast Cancer Wisconsin Dataset: The dataset is sourced from the UCI MachineLearning repository. The dataset has 32 attributes initially; after removing the non-numeric feature classification target (diagnosis) and identifier (id), the dataset contains 30 numerical attributes and there are no missing values in the dataset.
The dataset consists of the following attributes and their corresponding data types: id(int64), diagnosis(object), radius_mean(float64), texture_mean(float64), perimeter_mean(float64), area_mean(float64), smoothness_mean(float64), compactness_mean(float64), concavity_mean(float64), concave_points_mean(float64), symmetry_mean(float64), fractal_dimension_mean(float64), radius_se(float64), texture_se(float64), perimeter_se(float64), area_se(float64), smoothness_se(float64), compactness_se(float64), concavity_se(float64), concave_points_se(float64), symmetry_se(float64), fractal_dimension_se(float64), radius_worst(float64), texture_worst(float64), perimeter_worst(float64), area_worst(float64), smoothness_worst(float64), compactness_worst(float64), concavity_worst(float64), concave_points_worst(float64), symmetry_worst(float64), and fractal_dimension_worst(float64).
Heart Disease Dataset: Collected from Kaggle, consisting of 1025 instances and 14 numerical features. This dataset initially contains no missing values, making it ideal for controlled missingness simulation.
The dataset comprises the following attributes and their respective data types: age(int64), sex(int64), cp(int64), trestbps(int64), chol(int64), fbs(int64), restecg(int64), thalach(int64), exang(int64), oldpeak(float64), slope(int64), ca(int64), thal(int64), and target(int64).

3.2. Data Preprocessing

Standard preprocessing steps were applied to both datasets: removal of non-numerical features, normalization of numerical features, no feature engineering was performed, and finally missing values simulation under MCAR (30%, 20% and 10%) is simulated to all numerical features.

The imputation framework consists of Random Forest and XGBoost as base learners and a Linear Regression model as a Stacking Meta-Learner; Particle Swarm Optimization (PSO) is used to tune hyperparameters and missing values are predicted using the optimized stacking model.

Both datasets were partitioned into training, validation, and testing sets using a 70:15:15 ratio and the following performance metrics were used and evaluated: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Coefficient of Determination (R²), and Processing Time.

3.3. Model Formulation

The proposed optimized hybrid model integrates ensemble learning with evolutionary optimization to provide a robust and adaptive framework for numerical missing value imputation. Specifically, it combines the variance reduction strength of ensemble methods with the global search capability of evolutionary algorithms to enhance imputation accuracy and stability. Particle Swarm Optimization (PSO) is employed to identify the most suitable base models and fine-tune their hyperparameters, ensuring that each learner contributes optimally to the hybrid structure.

As illustrated in Figure 1, the proposed framework operates in four main stages:

1.: Candidate Model Training

A set of regression models—Random Forest, Linear Regression, XGBoost, KNN, Ridge Regression, and Support Vector Regression (SVR)–are initially trained on incomplete datasets where missing values are masked.

2.: PSO-Based Model Selection and Hyperparameter Optimization

Particle Swarm Optimization is applied to explore the hyperparameter space of each candidate model and evaluate their performance on the validation set. Based on fitness values (e.g., RMSE minimization), PSO identifies the top-performing models, which are selected as base learners for the stacking ensemble.

3.: Stacking Ensemble Construction

The selected optimized base learners are combined using a stacking strategy, where their individual predictions serve as input features to a Linear Regression meta-learner. This allows the meta-learner to exploit complementary strengths of the base models while reducing individual model bias and variance.

4.: Final Imputation Generation

The trained stacking ensemble predicts missing values in the dataset, producing the final imputed dataset used for performance evaluation.

Figure 1. Model framework.

For example, consider the breast cancer dataset with 20% MCAR missingness applied to all numerical attributes. The incomplete dataset is first used to train multiple candidate regressors. PSO evaluates these models and determines that Random Forest and XGBoost, with optimized hyperparameters, yield the lowest RMSE. These two models are then selected as base learners in the stacking ensemble. Their predictions are combined and passed to a Linear Regression meta-learner, which generates the final imputed values. This layered learning process improves stability and accuracy compared to using any single imputation model. The model formulation was described in Figure 2.

As depicted in Figure 2, this architecture enhances generalization performance, particularly when handling heterogeneous numerical datasets and varying missingness levels. By jointly optimizing model selection and hyperparameters, the proposed framework ensures that each learner contributes optimally to the imputation task.

4. Results and Discussion

The imputation results for the breast cancer dataset are presented in Table 1, Table 2 and Table 3 and Figure 3 and Figure 4. The hybrid model consistently achieved the lowest error and highest explained variance across all missingness levels, confirming its robustness in this domain.

The results for the heart disease dataset under varying missingness levels are presented in Table 4, Table 5 and Table 6 and Figure 5 and Figure 6. The hybrid model again demonstrated consistent superiority. The overall lower R² values compared to the breast cancer dataset (e.g., 75.19% vs. 86.56% at 10% MCAR) suggest that the feature relationships in the heart disease dataset might be more complex or less linearly separable, making imputation inherently more challenging.

Table 7 and Table 8 show that the proposed PSO-optimized stacking model significantly outperforms the existing method by Luke et al. [8] which used the MissForest model. At 20% MCAR, it achieves much lower RMSE (0.0628) and MAE (0.0434) compared to 1.56 and 0.21. At 10% MCAR, the reduction is even clearer, with RMSE = 0.0446 and MAE = 0.0303. This demonstrates that the ensemble effectively combines multiple learners to enhance imputation accuracy. Overall, the framework is robust and reliable for handling missing numerical data in healthcare datasets.

5. Discussion

The experimental results across both the breast cancer and heart disease datasets clearly demonstrate the effectiveness of the proposed PSO-optimized stacking ensemble model in handling numerical missing data. For all levels of missingness (30%, 20%, and 10% MCAR), the hybrid model consistently achieved the lowest RMSE and MAE, along with the highest R² values, compared to individual base learners such as Random Forest and XGBoost. For instance, on the breast cancer dataset at 10% MCAR, the proposed model achieved an RMSE of 0.0446 and MAE of 0.0303, substantially lower than the corresponding values for Random Forest (0.0573 and 0.0387) and XGBoost (0.0717 and 0.0494). Similarly, R² reached 86.56%, indicating a strong explained variance, whereas individual models achieved lower explanatory power.

Compared with the existing approach by Luke et al. [8], which employed the MissForest method, the proposed PSO-stacking model shows dramatic improvements in error reduction. At 20% MCAR, RMSE decreased from 1.56 (MissForest) to 0.0628, and MAE from 0.21 to 0.0434, highlighting the superior predictive capability of the ensemble. This can be attributed to the stacking strategy that combines multiple base learners, allowing the model to exploit complementary strengths and reduce individual model biases, as well as the use of Particle Swarm Optimization (PSO) to fine-tune hyperparameters and select the most effective learners.

One notable advantage of the proposed methodology is its robustness across datasets with varying characteristics. While the breast cancer dataset shows high R² values for the stacking model (up to 86.56%), the heart disease dataset exhibits lower R² values (75.19% at 10% MCAR), reflecting its more complex or less linear feature relationships. Despite this, the hybrid model consistently outperforms single learners, confirming its generalizability across healthcare domains. Additionally, the ensemble approach demonstrates resilience to different levels of missingness, effectively handling both moderate (20–30%) and low (10%) missing data without substantial degradation in performance.

However, the methodology has some limitations. The computational cost of the PSO-optimized stacking model is significantly higher than individual learners. For example, the processing time increases from less than 1 s for Random Forest to 4.22 s at 20% MCAR and 15.70 s at 10% MCAR on the breast cancer dataset. While this cost is acceptable for offline imputation in research or clinical datasets, it may become a bottleneck for very large-scale or real-time datasets. Moreover, the approach relies on properly tuning PSO parameters; poor initialization could lead to suboptimal selection of base learners, slightly reducing imputation accuracy.

In contrast, simpler models like Random Forest or XGBoost are computationally efficient and easier to implement, but they cannot fully exploit the complementary information from multiple learners and therefore have higher prediction errors. MissForest, while widely used, shows particularly poor performance under higher missingness rates, as it is less adaptive and lacks global optimization mechanisms like PSO.

In summary, the proposed PSO-optimized stacking ensemble provides a highly accurate and generalizable framework for numerical data imputation in healthcare datasets. Its main strengths include superior predictive performance, adaptability across datasets, and robustness under different missingness scenarios. The primary trade-off is increased computational time, which may need consideration in large-scale deployments. Overall, the methodology represents a significant improvement over traditional single-learner approaches and existing imputation methods such as MissForest.

6. Conclusions and Future Work

This study presented a PSO-optimized stacking ensemble model for numerical missing data imputation in healthcare datasets. The proposed hybrid approach combines multiple base learners (Random Forest, XGBoost, and Linear Regression) with Particle Swarm Optimization to select and fine-tune the most effective models. Experimental results on the breast cancer and heart disease datasets demonstrated that the proposed framework consistently outperforms individual learners and existing methods such as MissForest, achieving significantly lower RMSE and MAE and higher explained variance (R²) across different missingness levels. The findings confirm that integrating ensemble learning with evolutionary optimization enhances imputation accuracy, robustness, and generalizability across datasets with varying complexity.

Despite its advantages, the proposed methodology incurs higher computational cost due to the PSO process, which may limit its application in extremely large-scale or real-time scenarios. Furthermore, its performance depends on proper parameter tuning and model selection within the stacking framework.

For future work, several directions can be explored to further improve and extend the framework. First, incorporating deep learning models as additional base learners could capture complex non-linear feature relationships more effectively. Second, investigating adaptive or dynamic PSO strategies may reduce computation time while maintaining high imputation accuracy. Third, the framework can be extended to handle mixed-type datasets, including categorical variables, to broaden its applicability in real-world healthcare data. Finally, applying the model to longitudinal or time-series healthcare datasets could evaluate its robustness in dynamic, temporally correlated data, paving the way for more comprehensive clinical decision support systems.

Author Contributions

Conceptualization, B.I.M. and A.M.; methodology, A.M.,S.A., and B.I.M.; software, B.I.M.; validation, S.A., A.M. and B.I.M.; formal analysis, A.M.; investigation, S.A., B.I.M. and A.M.; resources, A.I.; data curation, A.A.; writing—original draft preparation, B.I.M.; writing—review and editing, S.A. and A.M.; visualization, B.I.M.; supervision, S.A.; project administration, A.A. and A.I.; funding acquisition, B.I.M. and A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The breast cancer dataset used in this study is publicly available at https://doi.org/10.24432/C5DW2B, and the heart disease dataset is available at https://doi.org/10.24432/C52P4X. Both datasets can also be accessed through the UCI Machine Learning Repository and the Kaggle database respectively.

Conflicts of Interest

The authors declare no conflict of interest.

References

Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data, 3rd ed.; John Wiley & Sons Inc.: Hoboken, NJ, USA, 2019. [Google Scholar]
Kavelaars, X.M.; van Ginkel, J.R.; van Buuren, S. Multiple imputation in data that grow over time: A comparison of three strategies. Multivar. Behav. Res. 2021, 57, 513–523. [Google Scholar] [CrossRef] [PubMed]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Tang, F.; Ishwaran, H. Random forest missing data algorithms. Stat. Anal. Data Min. ASA Data Sci. J. 2017, 10, 363–377. [Google Scholar] [CrossRef] [PubMed]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD ’16, San Francisco, CA, USA, 13–17 August 2016; Volume 1, pp. 785–794. [Google Scholar]
Karimov, S.; Turimov, D.; Kim, W.; Kim, J. Comparative study of imputation strategies to improve the sarcopenia prediction task. Digit. Health 2025, 11, 20552076241301960. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Guo, S.; Ma, R.; He, J.; Zhang, X.; Rui, D.; Ding, Y.; Li, Y.; Jian, L.; Cheng, J.; et al. Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets. BMC Med. Res. Methodol. 2024, 24, 41. [Google Scholar] [CrossRef] [PubMed]
Joel, L.O.; Doorsamy, W.; Paul, B.S. A comparative study of imputation techniques for missing values in healthcare diagnostic datasets. Int. J. Data Sci. Anal. 2025, 20, 6357–6373. [Google Scholar] [CrossRef]
Alabadla, M.; Sidi, F.; Ishak, I.; Ibrahim, H.; Affendey, L.S.; Ani, Z.C.; Jabar, M.A.; Bukar, U.A.; Devaraj, N.K.; Muda, A.S.; et al. Systematic Review of Using Machine Learning in Imputing Missing Values. IEEE Access 2022, 10, 44483–44502. [Google Scholar] [CrossRef]
Shadbahr, T.; Roberts, M.; Stanczuk, J.; Gilbey, J.; Teare, P.; Dittmer, S.; Thorpe, M.; Torné, R.V.; Sala, E.; Lió, P.; et al. The impact of imputation quality on machine learning classifiers for datasets with missing values. Commun. Med. 2023, 3, 139. [Google Scholar] [CrossRef] [PubMed]
Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
He, Y.; Zhang, G.; Hsu, C.-H. Multiple Imputation of Missing Data in Practice; Chapman and Hall/CRC: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
Ahmad, A.F.; Alshammari, K.; Ahmed, I.; Sayed, M.S. Machine Learning for Missing Value Imputation. arXiv 2024, arXiv:2410.08308. [Google Scholar] [CrossRef]
Deepa, A.V.; Beena, T.L.A. Enhancing data imputation in complex datasets using Lagrange polynomial interpolation and hot-deck fusion. Sci. Temper 2025, 16, 3727–3735. [Google Scholar] [CrossRef]
Kotan, K.; Kırışoğlu, S. Cyclical hybrid imputation technique for missing values in data sets. Sci. Rep. 2025, 15, 6543. [Google Scholar] [CrossRef] [PubMed]
Zang, L.; Xiong, F. Harnessing Machine Learning to Address High Levels of Missing Data in Cross-National Studies: From Bias to Precision in Public Service Research. J. Comp. Policy Anal. Res. Pract. 2025, 27, 339–359. [Google Scholar] [CrossRef]

Figure 2. Model formulation.

Figure 3. RMSE models performance across different missingness rates on breast cancer dataset.

Figure 4. Models processing time performance across different missingness rates on breast cancer dataset.

Figure 5. RMSE models performance across different missingness rates on heart disease dataset.

Figure 6. Models processing time (s) performance across different missingness rates on heart disease dataset.

Table 1. Performance on breast cancer dataset at 30% MCAR.

Model	RMSE	MAE	R²	Processing Time (s)
Random Forest	0.0693	0.0479	69.00	0.5620
XGBOOST	0.0863	0.0609	56.42	0.1916
Proposed PSO-Stacking model	0.0664	0.0434	71.57	3.5800

Table 2. Performance on breast cancer dataset at 20% MCAR.

Model	RMSE	MAE	R²	Processing Time (s)
Random Forest	0.0653	0.0446	73.72	0.6360
XGBOOST	0.0806	0.0562	62.12	0.2069
Proposed PSO-Stacking model	0.0628	0.0434	74.78	4.2225

Table 3. Performance on breast cancer dataset at 10% MCAR.

Model	RMSE	MAE	R²	Processing Time (s)
Random Forest	0.0573	0.0387	77.45	0.7044
XGBOOST	0.0717	0.0494	67.42	0.2102
Proposed PSO-Stacking model	0.0446	0.0303	86.56	15.70

Table 4. Performance on heart disease dataset at 30% MCAR.

Model	RMSE	MAE	R²	Processing Time (s)
Random Forest	0.2311	0.1862	31.68	0.5034
XGBOOST	0.2403	0.1924	24.03	0.1167
Proposed PSO-Stacking model	0.2263	0.1720	34.42	2.9853

Table 5. Performance on heart disease dataset at 20% MCAR.

Model	RMSE	MAE	R²	Processing Time (s)
Random Forest	0.2083	0.1654	44.05	0.5119
XGBOOST	0.2264	0.1799	30.50	0.1211
Proposed PSO-Stacking model	0.1945	0.1426	51.40	3.0922

Table 6. Performance on heart disease dataset at 10% MCAR.

Model	RMSE	MAE	R²	Processing Time (s)
Random Forest	0.1619	0.1224	66.15	0.5210
XGBOOST	0.2021	0.1580	41.75	0.1285
Proposed PSO-Stacking model	0.1388	0.0967	75.19	3.1603

Table 7. Comparison of proposed and existing model on breast cancer dataset at 20% MCAR.

Model	RMSE	MAE	R²	Processing Time (s)
Luke et al. [8]	1.56	0.21	-------	---------
Proposed PSO-Stacking model	0.0628	0.0434	74.78	4.2225

Table 8. Comparison of proposed and existing model on breast cancer dataset at 10% MCAR.

Model	RMSE	MAE	R²	Processing Time (s)
Luke et al. [8]	0.811	0.092	-------	---------
Proposed PSO-Stacking model	0.0446	0.0303	86.56	15.70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Maijamaa, B.I.; Ahmad, S.; Musa, A.; Ishaq, A.; Ayuba, A. Ensemble-Based Imputation for Handling Missing Values in Healthcare Datasets: A Comparative Study of Machine Learning Models. Eng. Proc. 2026, 124, 21. https://doi.org/10.3390/engproc2026124021

AMA Style

Maijamaa BI, Ahmad S, Musa A, Ishaq A, Ayuba A. Ensemble-Based Imputation for Handling Missing Values in Healthcare Datasets: A Comparative Study of Machine Learning Models. Engineering Proceedings. 2026; 124(1):21. https://doi.org/10.3390/engproc2026124021

Chicago/Turabian Style

Maijamaa, Bilal Ibrahim, Salim Ahmad, Aminu Musa, Abdullahi Ishaq, and Abida Ayuba. 2026. "Ensemble-Based Imputation for Handling Missing Values in Healthcare Datasets: A Comparative Study of Machine Learning Models" Engineering Proceedings 124, no. 1: 21. https://doi.org/10.3390/engproc2026124021

APA Style

Maijamaa, B. I., Ahmad, S., Musa, A., Ishaq, A., & Ayuba, A. (2026). Ensemble-Based Imputation for Handling Missing Values in Healthcare Datasets: A Comparative Study of Machine Learning Models. Engineering Proceedings, 124(1), 21. https://doi.org/10.3390/engproc2026124021

Article Menu

Ensemble-Based Imputation for Handling Missing Values in Healthcare Datasets: A Comparative Study of Machine Learning Models^†

Abstract

1. Introduction

2. Related Works