1. Introduction
Missing data remains a pervasive challenge in healthcare analytics, where complete and reliable datasets are essential for accurate disease prediction, diagnostic modeling, and clinical decision support. Healthcare datasets frequently exhibit missingness due to sensor malfunction, inconsistent patient follow-ups, data entry errors, and heterogeneous data collection protocols. If not properly addressed, missing values can introduce bias, reduce statistical power, and significantly degrade the performance and stability of machine learning (ML) models [
1].
Conventional statistical imputation techniques, such as mean or median substitution and regression-based methods, are computationally efficient but rely on strong assumptions regarding data distribution and linear relationships. These assumptions rarely hold in real-world healthcare data, particularly under moderate to high missingness rates, resulting in biased estimates and suboptimal predictive performance [
1,
2]. As a result, statistical imputers often fail to preserve the underlying structure of complex medical datasets.
Machine learning-based imputation methods, including Random Forest (RF) [
3], MissForest [
4], and gradient boosting approaches such as XGBoost [
5], have gained increasing attention due to their ability to capture nonlinear relationships and complex feature interactions. Empirical studies demonstrate that these models generally outperform traditional statistical methods in numerical data imputation tasks [
6,
7,
8]. However, standalone ML imputers may still experience performance degradation as missingness increases or when applied across heterogeneous datasets with differing statistical properties [
9,
10].
To address these limitations, ensemble and hybrid imputation strategies have emerged as a promising direction. By combining complementary learners, ensemble methods can improve robustness, reduce variance, and enhance generalization [
11]. Hybrid models further extend this idea by integrating optimization techniques to tune the large hyperparameter spaces inherent in ensemble architectures [
9,
12]. Particle Swarm Optimization (PSO), in particular, has been shown to be effective in optimizing complex ML models by balancing exploration and exploitation during parameter search [
13].
Despite growing interest, limited research has systematically explored PSO-optimized stacking ensembles for numerical data imputation across multiple healthcare datasets, while simultaneously evaluating both accuracy and computational efficiency. This study addresses this gap by investigating a PSO-optimized stacking ensemble framework for imputing missing numerical values, using the Breast Cancer and Heart Disease datasets as representative healthcare case studies under varying Missing Completely at Random (MCAR) conditions.
2. Related Works
Imputation techniques have evolved from classical statistical approaches to advanced machine learning, deep learning, and hybrid frameworks. This section reviews the relevant literature across these categories, highlighting their strengths and limitations.
2.1. Traditional and Statistical Imputation Methods
Simple imputation methods such as mean and median substitution remain widely used due to their simplicity but are known to distort data distributions and underestimate variability, particularly when missingness is not trivial [
1]. Multiple Imputation by Chained Equations (MICE) improves upon single imputation by iteratively modeling missing values while accounting for uncertainty, but it becomes computationally expensive and less effective in the presence of nonlinear relationships or high-dimensional data [
2].
Linear Regression-based imputation performs reasonably well in datasets with strong linear correlations among variables; however, its effectiveness diminishes when applied to complex healthcare datasets exhibiting nonlinear dependencies [
7]. Overall, statistical methods tend to yield biased results as the percentage of missing data increases, motivating the exploration of more flexible data-driven approaches.
2.2. Machine Learning-Based Imputation
Machine learning models have demonstrated superior performance over traditional statistical methods by learning nonlinear patterns directly from data. Random Forest has been widely adopted for imputation due to its robustness, resistance to overfitting, and ability to handle mixed data types [
3]. An extended RF for missing value imputation through MissForest shows consistent improvements over conventional approaches for numerical data [
4].Gradient boosting methods, particularly XGBoost [
5], have gained popularity for their scalability, regularization mechanisms, and strong predictive performance. An ensemble ML method outperforms simpler imputers in sarcopenia prediction tasks [
6], while ML-based imputers achieve lower RMSE and MAE across multiple cohort datasets [
7]. A comprehensive comparison of imputation techniques in healthcare diagnostics was conducted and reported that RF-based approaches consistently yielded lower errors across missingness levels ranging from 10% to 25% [
8].
However, ML imputers require careful hyperparameter tuning and may suffer from instability when applied across datasets with differing feature distributions or higher missingness rates [
9,
10].
2.3. Hybrid and Optimization-Based Approaches
Hybrid imputation approaches combine multiple models or paradigms to exploit complementary strengths. There is growing trend toward hybrid ML and DL imputation strategies, emphasizing the need to evaluate both accuracy and computational cost [
9]. Stacked generalization provides a principled framework for combining heterogeneous base learners through a meta-learner, enabling improved predictive performance by learning how to optimally aggregate individual model outputs [
11].
Optimization algorithms play a crucial role in enhancing hybrid models by efficiently searching large hyperparameter spaces. Optimization techniques play a crucial role in improving the accuracy of machine learning-based imputation models by enabling effective tuning of complex hyperparameters and model configurations [
13]. Hybrid imputation strategies have been shown to outperform standalone methods in complex datasets; however, some studies do not include an analysis of computational efficiency [
14]. Similarly, cyclical hybrid imputation techniques have demonstrated high accuracy but have been associated with increased computational complexity, with limited reporting of detailed runtime performance [
15].
Recent studies highlight the lack of comprehensive evaluations of stacking-based imputation frameworks across multiple healthcare datasets and varying missingness levels [
8,
9,
16]. Furthermore, processing time is frequently overlooked, despite being a critical consideration for the practical deployment of hybrid imputation systems.
Limited studies evaluate stacking-based imputation specifically across multiple distinct healthcare datasets (e.g., breast cancer and heart disease). Few works integrate PSO with stacking to improve imputation accuracy and robustness comprehensively. Comparative studies under multiple, relevant missingness levels (10–30% MCAR) are scarce. Prior studies report high imputation error (RMSE), showing significant room for improvement through optimized hybrid models.
There is a lack of thorough model evaluation using comprehensive metrics that fully capture their performance, including an analysis of the processing time, which is often overlooked despite being a critical factor in hybrid methods.
3. Methodology
3.1. Dataset Description
Two distinct healthcare datasets were used for comprehensive evaluation:
Breast Cancer Wisconsin Dataset: The dataset is sourced from the UCI MachineLearning repository. The dataset has 32 attributes initially; after removing the non-numeric feature classification target (diagnosis) and identifier (id), the dataset contains 30 numerical attributes and there are no missing values in the dataset.
The dataset consists of the following attributes and their corresponding data types: id(int64), diagnosis(object), radius_mean(float64), texture_mean(float64), perimeter_mean(float64), area_mean(float64), smoothness_mean(float64), compactness_mean(float64), concavity_mean(float64), concave_points_mean(float64), symmetry_mean(float64), fractal_dimension_mean(float64), radius_se(float64), texture_se(float64), perimeter_se(float64), area_se(float64), smoothness_se(float64), compactness_se(float64), concavity_se(float64), concave_points_se(float64), symmetry_se(float64), fractal_dimension_se(float64), radius_worst(float64), texture_worst(float64), perimeter_worst(float64), area_worst(float64), smoothness_worst(float64), compactness_worst(float64), concavity_worst(float64), concave_points_worst(float64), symmetry_worst(float64), and fractal_dimension_worst(float64).
Heart Disease Dataset: Collected from Kaggle, consisting of 1025 instances and 14 numerical features. This dataset initially contains no missing values, making it ideal for controlled missingness simulation.
The dataset comprises the following attributes and their respective data types: age(int64), sex(int64), cp(int64), trestbps(int64), chol(int64), fbs(int64), restecg(int64), thalach(int64), exang(int64), oldpeak(float64), slope(int64), ca(int64), thal(int64), and target(int64).
3.2. Data Preprocessing
Standard preprocessing steps were applied to both datasets: removal of non-numerical features, normalization of numerical features, no feature engineering was performed, and finally missing values simulation under MCAR (30%, 20% and 10%) is simulated to all numerical features.
The imputation framework consists of Random Forest and XGBoost as base learners and a Linear Regression model as a Stacking Meta-Learner; Particle Swarm Optimization (PSO) is used to tune hyperparameters and missing values are predicted using the optimized stacking model.
Both datasets were partitioned into training, validation, and testing sets using a 70:15:15 ratio and the following performance metrics were used and evaluated: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Coefficient of Determination (R2), and Processing Time.
3.3. Model Formulation
The proposed optimized hybrid model integrates ensemble learning with evolutionary optimization to provide a robust and adaptive framework for numerical missing value imputation. Specifically, it combines the variance reduction strength of ensemble methods with the global search capability of evolutionary algorithms to enhance imputation accuracy and stability. Particle Swarm Optimization (PSO) is employed to identify the most suitable base models and fine-tune their hyperparameters, ensuring that each learner contributes optimally to the hybrid structure.
As illustrated in
Figure 1, the proposed framework operates in four main stages:
- 1.
Candidate Model Training
A set of regression models—Random Forest, Linear Regression, XGBoost, KNN, Ridge Regression, and Support Vector Regression (SVR)–are initially trained on incomplete datasets where missing values are masked.
- 2.
PSO-Based Model Selection and Hyperparameter Optimization
Particle Swarm Optimization is applied to explore the hyperparameter space of each candidate model and evaluate their performance on the validation set. Based on fitness values (e.g., RMSE minimization), PSO identifies the top-performing models, which are selected as base learners for the stacking ensemble.
- 3.
Stacking Ensemble Construction
The selected optimized base learners are combined using a stacking strategy, where their individual predictions serve as input features to a Linear Regression meta-learner. This allows the meta-learner to exploit complementary strengths of the base models while reducing individual model bias and variance.
- 4.
Final Imputation Generation
The trained stacking ensemble predicts missing values in the dataset, producing the final imputed dataset used for performance evaluation.
Figure 1.
Model framework.
Figure 1.
Model framework.
For example, consider the breast cancer dataset with 20% MCAR missingness applied to all numerical attributes. The incomplete dataset is first used to train multiple candidate regressors. PSO evaluates these models and determines that Random Forest and XGBoost, with optimized hyperparameters, yield the lowest RMSE. These two models are then selected as base learners in the stacking ensemble. Their predictions are combined and passed to a Linear Regression meta-learner, which generates the final imputed values. This layered learning process improves stability and accuracy compared to using any single imputation model. The model formulation was described in
Figure 2.
As depicted in
Figure 2, this architecture enhances generalization performance, particularly when handling heterogeneous numerical datasets and varying missingness levels. By jointly optimizing model selection and hyperparameters, the proposed framework ensures that each learner contributes optimally to the imputation task.
4. Results and Discussion
The imputation results for the breast cancer dataset are presented in
Table 1,
Table 2 and
Table 3 and
Figure 3 and
Figure 4. The hybrid model consistently achieved the lowest error and highest explained variance across all missingness levels, confirming its robustness in this domain.
The results for the heart disease dataset under varying missingness levels are presented in
Table 4,
Table 5 and
Table 6 and
Figure 5 and
Figure 6. The hybrid model again demonstrated consistent superiority. The overall lower R
2 values compared to the breast cancer dataset (e.g., 75.19% vs. 86.56% at 10% MCAR) suggest that the feature relationships in the heart disease dataset might be more complex or less linearly separable, making imputation inherently more challenging.
Table 7 and
Table 8 show that the proposed PSO-optimized stacking model significantly outperforms the existing method by Luke et al. [
8] which used the MissForest model. At 20% MCAR, it achieves much lower RMSE (0.0628) and MAE (0.0434) compared to 1.56 and 0.21. At 10% MCAR, the reduction is even clearer, with RMSE = 0.0446 and MAE = 0.0303. This demonstrates that the ensemble effectively combines multiple learners to enhance imputation accuracy. Overall, the framework is robust and reliable for handling missing numerical data in healthcare datasets.
5. Discussion
The experimental results across both the breast cancer and heart disease datasets clearly demonstrate the effectiveness of the proposed PSO-optimized stacking ensemble model in handling numerical missing data. For all levels of missingness (30%, 20%, and 10% MCAR), the hybrid model consistently achieved the lowest RMSE and MAE, along with the highest R2 values, compared to individual base learners such as Random Forest and XGBoost. For instance, on the breast cancer dataset at 10% MCAR, the proposed model achieved an RMSE of 0.0446 and MAE of 0.0303, substantially lower than the corresponding values for Random Forest (0.0573 and 0.0387) and XGBoost (0.0717 and 0.0494). Similarly, R2 reached 86.56%, indicating a strong explained variance, whereas individual models achieved lower explanatory power.
Compared with the existing approach by Luke et al. [
8], which employed the MissForest method, the proposed PSO-stacking model shows dramatic improvements in error reduction. At 20% MCAR, RMSE decreased from 1.56 (MissForest) to 0.0628, and MAE from 0.21 to 0.0434, highlighting the superior predictive capability of the ensemble. This can be attributed to the stacking strategy that combines multiple base learners, allowing the model to exploit complementary strengths and reduce individual model biases, as well as the use of Particle Swarm Optimization (PSO) to fine-tune hyperparameters and select the most effective learners.
One notable advantage of the proposed methodology is its robustness across datasets with varying characteristics. While the breast cancer dataset shows high R2 values for the stacking model (up to 86.56%), the heart disease dataset exhibits lower R2 values (75.19% at 10% MCAR), reflecting its more complex or less linear feature relationships. Despite this, the hybrid model consistently outperforms single learners, confirming its generalizability across healthcare domains. Additionally, the ensemble approach demonstrates resilience to different levels of missingness, effectively handling both moderate (20–30%) and low (10%) missing data without substantial degradation in performance.
However, the methodology has some limitations. The computational cost of the PSO-optimized stacking model is significantly higher than individual learners. For example, the processing time increases from less than 1 s for Random Forest to 4.22 s at 20% MCAR and 15.70 s at 10% MCAR on the breast cancer dataset. While this cost is acceptable for offline imputation in research or clinical datasets, it may become a bottleneck for very large-scale or real-time datasets. Moreover, the approach relies on properly tuning PSO parameters; poor initialization could lead to suboptimal selection of base learners, slightly reducing imputation accuracy.
In contrast, simpler models like Random Forest or XGBoost are computationally efficient and easier to implement, but they cannot fully exploit the complementary information from multiple learners and therefore have higher prediction errors. MissForest, while widely used, shows particularly poor performance under higher missingness rates, as it is less adaptive and lacks global optimization mechanisms like PSO.
In summary, the proposed PSO-optimized stacking ensemble provides a highly accurate and generalizable framework for numerical data imputation in healthcare datasets. Its main strengths include superior predictive performance, adaptability across datasets, and robustness under different missingness scenarios. The primary trade-off is increased computational time, which may need consideration in large-scale deployments. Overall, the methodology represents a significant improvement over traditional single-learner approaches and existing imputation methods such as MissForest.
6. Conclusions and Future Work
This study presented a PSO-optimized stacking ensemble model for numerical missing data imputation in healthcare datasets. The proposed hybrid approach combines multiple base learners (Random Forest, XGBoost, and Linear Regression) with Particle Swarm Optimization to select and fine-tune the most effective models. Experimental results on the breast cancer and heart disease datasets demonstrated that the proposed framework consistently outperforms individual learners and existing methods such as MissForest, achieving significantly lower RMSE and MAE and higher explained variance (R2) across different missingness levels. The findings confirm that integrating ensemble learning with evolutionary optimization enhances imputation accuracy, robustness, and generalizability across datasets with varying complexity.
Despite its advantages, the proposed methodology incurs higher computational cost due to the PSO process, which may limit its application in extremely large-scale or real-time scenarios. Furthermore, its performance depends on proper parameter tuning and model selection within the stacking framework.
For future work, several directions can be explored to further improve and extend the framework. First, incorporating deep learning models as additional base learners could capture complex non-linear feature relationships more effectively. Second, investigating adaptive or dynamic PSO strategies may reduce computation time while maintaining high imputation accuracy. Third, the framework can be extended to handle mixed-type datasets, including categorical variables, to broaden its applicability in real-world healthcare data. Finally, applying the model to longitudinal or time-series healthcare datasets could evaluate its robustness in dynamic, temporally correlated data, paving the way for more comprehensive clinical decision support systems.
Author Contributions
Conceptualization, B.I.M. and A.M.; methodology, A.M.,S.A., and B.I.M.; software, B.I.M.; validation, S.A., A.M. and B.I.M.; formal analysis, A.M.; investigation, S.A., B.I.M. and A.M.; resources, A.I.; data curation, A.A.; writing—original draft preparation, B.I.M.; writing—review and editing, S.A. and A.M.; visualization, B.I.M.; supervision, S.A.; project administration, A.A. and A.I.; funding acquisition, B.I.M. and A.M. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Conflicts of Interest
The authors declare no conflict of interest.
References
- Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data, 3rd ed.; John Wiley & Sons Inc.: Hoboken, NJ, USA, 2019. [Google Scholar]
- Kavelaars, X.M.; van Ginkel, J.R.; van Buuren, S. Multiple imputation in data that grow over time: A comparison of three strategies. Multivar. Behav. Res. 2021, 57, 513–523. [Google Scholar] [CrossRef] [PubMed]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Tang, F.; Ishwaran, H. Random forest missing data algorithms. Stat. Anal. Data Min. ASA Data Sci. J. 2017, 10, 363–377. [Google Scholar] [CrossRef] [PubMed]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD ’16, San Francisco, CA, USA, 13–17 August 2016; Volume 1, pp. 785–794. [Google Scholar]
- Karimov, S.; Turimov, D.; Kim, W.; Kim, J. Comparative study of imputation strategies to improve the sarcopenia prediction task. Digit. Health 2025, 11, 20552076241301960. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Guo, S.; Ma, R.; He, J.; Zhang, X.; Rui, D.; Ding, Y.; Li, Y.; Jian, L.; Cheng, J.; et al. Comparison of the effects of imputation methods for missing data in predictive modelling of cohort study datasets. BMC Med. Res. Methodol. 2024, 24, 41. [Google Scholar] [CrossRef] [PubMed]
- Joel, L.O.; Doorsamy, W.; Paul, B.S. A comparative study of imputation techniques for missing values in healthcare diagnostic datasets. Int. J. Data Sci. Anal. 2025, 20, 6357–6373. [Google Scholar] [CrossRef]
- Alabadla, M.; Sidi, F.; Ishak, I.; Ibrahim, H.; Affendey, L.S.; Ani, Z.C.; Jabar, M.A.; Bukar, U.A.; Devaraj, N.K.; Muda, A.S.; et al. Systematic Review of Using Machine Learning in Imputing Missing Values. IEEE Access 2022, 10, 44483–44502. [Google Scholar] [CrossRef]
- Shadbahr, T.; Roberts, M.; Stanczuk, J.; Gilbey, J.; Teare, P.; Dittmer, S.; Thorpe, M.; Torné, R.V.; Sala, E.; Lió, P.; et al. The impact of imputation quality on machine learning classifiers for datasets with missing values. Commun. Med. 2023, 3, 139. [Google Scholar] [CrossRef] [PubMed]
- Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
- He, Y.; Zhang, G.; Hsu, C.-H. Multiple Imputation of Missing Data in Practice; Chapman and Hall/CRC: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
- Ahmad, A.F.; Alshammari, K.; Ahmed, I.; Sayed, M.S. Machine Learning for Missing Value Imputation. arXiv 2024, arXiv:2410.08308. [Google Scholar] [CrossRef]
- Deepa, A.V.; Beena, T.L.A. Enhancing data imputation in complex datasets using Lagrange polynomial interpolation and hot-deck fusion. Sci. Temper 2025, 16, 3727–3735. [Google Scholar] [CrossRef]
- Kotan, K.; Kırışoğlu, S. Cyclical hybrid imputation technique for missing values in data sets. Sci. Rep. 2025, 15, 6543. [Google Scholar] [CrossRef] [PubMed]
- Zang, L.; Xiong, F. Harnessing Machine Learning to Address High Levels of Missing Data in Cross-National Studies: From Bias to Precision in Public Service Research. J. Comp. Policy Anal. Res. Pract. 2025, 27, 339–359. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |