MDPI - Publisher of Open Access Journals

27 pages, 8347 KB

Open AccessArticle

Diversity Constraint and Adaptive Graph Multi-View Functional Matrix Completion

by Haiyan Gao and Youdi Bian

Axioms 2025, 14(11), 793; https://doi.org/10.3390/axioms14110793 - 28 Oct 2025

Viewed by 254

The integrity of real-time monitoring data is paramount to the accuracy of scientific research and the reliability of decision-making. However, data incompleteness arising from transmission interruptions or extreme weather disrupting equipment operations severely compromises the validity of statistical analyses and the stability of [...] Read more.

The integrity of real-time monitoring data is paramount to the accuracy of scientific research and the reliability of decision-making. However, data incompleteness arising from transmission interruptions or extreme weather disrupting equipment operations severely compromises the validity of statistical analyses and the stability of modelling. From a mathematical view, real-time monitoring data may be regarded as continuous functions, exhibiting intricate correlations and mutual influences between different indicators. Leveraging their inherent smoothness and interdependencies enables high-precision data imputation. Within the functional data analysis framework, this paper proposes a Diversity Constraint and Adaptive Graph Multi-View Functional Matrix Completion (DCAGMFMC) method. Integrating multi-view learning with an adaptive graph strategy, this approach comprehensively accounts for complex correlations between data from different views while extracting differential information across views, thereby enhancing information utilization and imputation accuracy. Random simulation experiments demonstrate that the DCAGMFMC method exhibits significant imputation advantages over classical methods such as KNN, HFI, SFI, MVNFMC, and GRMFMC. Furthermore, practical applications on meteorological datasets reveal that, compared to these imputation methods, the root mean square error (RMSE), mean absolute error (MAE), and normalized root mean square error (NRMSE) of the DCAGMVNFMC method decreased by an average of 39.11% to 59.15%, 54.50% to 71.97%, and 43.96% to 63.70%, respectively. It also demonstrated stable imputation performance across various meteorological indicators and missing data rates, exhibiting good adaptability and practical value. Full article

► Show Figures

Figure 1

17 pages, 722 KB

Open AccessArticle

Development of a Machine Learning Model for Predicting Treatment-Related Amenorrhea in Young Women with Breast Cancer

by Long Song, Zobaida Edib, Uwe Aickelin, Hadi Akbarzadeh Khorshidi, Anne-Sophie Hamy, Yasmin Jayasinghe, Martha Hickey, Richard A. Anderson, Matteo Lambertini, Margherita Condorelli, Isabelle Demeestere, Michail Ignatiadis, Barbara Pistilli, H. Irene Su, Shanton Chang, Patrick Cheong-Iao Pang, Fabien Reyal, Scott M. Nelson, Paniti Sukumvanich, Alessandro Minisini, Fabio Puglisi, Kathryn J. Ruddy, Fergus J. Couch, Janet E. Olson, Kate Stern, Franca Agresta, Lesley Stafford, Laura Chin-Lenn, Wanda Cui, Antoinette Anazodo, Alexandra Gorelik, Tuong L. Nguyen, Ann Partridge, Christobel Saunders, Elizabeth Sullivan, Mary Macheras-Magias and Michelle Peate Show full author list Hide full author list

Bioengineering 2025, 12(11), 1171; https://doi.org/10.3390/bioengineering12111171 - 28 Oct 2025

Viewed by 511

Abstract

Treatment-induced ovarian function loss is a significant concern for many young patients with breast cancer. Accurately predicting this risk is crucial for counselling young patients and informing their fertility-related decision-making. However, current risk prediction models for treatment-related ovarian function loss have limitations. To [...] Read more.

Treatment-induced ovarian function loss is a significant concern for many young patients with breast cancer. Accurately predicting this risk is crucial for counselling young patients and informing their fertility-related decision-making. However, current risk prediction models for treatment-related ovarian function loss have limitations. To provide a broader representation of patient cohorts and improve feature selection, we combined retrospective data from six datasets within the FoRECAsT (Infertility after Cancer Predictor) databank, including 2679 pre-menopausal women diagnosed with breast cancer. This combined dataset presented notable missingness, prompting us to employ cross imputation using the k-nearest neighbours (KNN) machine learning (ML) algorithm. Employing Lasso regression, we developed an ML model to forecast the risk of treatment-related amenorrhea as a surrogate marker of ovarian function loss at 12 months after starting chemotherapy. Our model identified 20 variables significantly associated with risk of developing amenorrhea. Internal validation resulted in an area under the receiver operating characteristic curve (AUC) of 0.820 (95% CI: 0.817–0.823), while external validation with another dataset demonstrated an AUC of 0.743 (95% CI: 0.666–0.818). A cutoff of 0.20 was chosen to achieve higher sensitivity in validation, as false negatives—patients incorrectly classified as likely to regain menses—could miss timely opportunities for fertility preservation if desired. At this threshold, internal validation yielded sensitivity and precision rates of 91.3% and 61.7%, respectively, while external validation showed 92.9% and 60.0%. Leveraging ML methodologies, we not only devised a model for personalised risk prediction of amenorrhea, demonstrating substantial enhancements over existing models but also showcased a robust framework for maximally harnessing available data sources. Full article

(This article belongs to the Special Issue Applications of Artificial Intelligence for Medical Diagnosis)

► Show Figures

Figure 1

30 pages, 379 KB

Open AccessArticle

An Enhanced Discriminant Analysis Approach for Multi-Classification with Integrated Machine Learning-Based Missing Data Imputation

by Autcha Araveeporn and Atid Kangtunyakarn

Mathematics 2025, 13(21), 3392; https://doi.org/10.3390/math13213392 - 24 Oct 2025

Viewed by 248

Abstract

This study addresses the challenge of accurate classification under missing data conditions by integrating multiple imputation strategies with discriminant analysis frameworks. The proposed approach evaluates six imputation methods (Mean, Regression, KNN, Random Forest, Bagged Trees, MissRanger) across several discriminant techniques. Simulation scenarios varied [...] Read more.

This study addresses the challenge of accurate classification under missing data conditions by integrating multiple imputation strategies with discriminant analysis frameworks. The proposed approach evaluates six imputation methods (Mean, Regression, KNN, Random Forest, Bagged Trees, MissRanger) across several discriminant techniques. Simulation scenarios varied in sample size, predictor dimensionality, and correlation structure, while the real-world application employed the Cirrhosis Prediction Dataset. The results consistently demonstrate that ensemble-based imputations, particularly regression, KNN, and MissRanger, outperform simpler approaches by preserving multivariate structure, especially in high-dimensional and highly correlated settings. MissRanger yielded the highest classification accuracy across most discriminant analysis methods in both simulated and real data, with performance gains most pronounced when combined with flexible or regularized classifiers. Regression imputation showed notable improvements under low correlation, aligning with the theoretical benefits of shrinkage-based covariance estimation. Across all methods, larger sample sizes and high correlation enhanced classification accuracy by improving parameter stability and imputation precision. Full article

(This article belongs to the Section D1: Probability and Statistics)

27 pages, 44538 KB

Open AccessFeature PaperArticle

Short-Term Load Forecasting in the Greek Power Distribution System: A Comparative Study of Gradient Boosting and Deep Learning Models

by Md Fazle Hasan Shiblee and Paraskevas Koukaras

Energies 2025, 18(19), 5060; https://doi.org/10.3390/en18195060 - 23 Sep 2025

Viewed by 638

Abstract

Accurate short-term electricity load forecasting is essential for efficient energy management, grid reliability, and cost optimization. This study presents a comprehensive comparison of five supervised learning models—Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), a hybrid (CNN-LSTM) architecture, and [...] Read more.

Accurate short-term electricity load forecasting is essential for efficient energy management, grid reliability, and cost optimization. This study presents a comprehensive comparison of five supervised learning models—Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), a hybrid (CNN-LSTM) architecture, and Light Gradient Boosting Machine (LightGBM)—using multivariate data from the Greek electricity market between 2015 and 2024. The dataset incorporates hourly load, temperature, humidity, and holiday indicators. Extensive preprocessing was applied, including K-Nearest Neighbor (KNN) imputation, time-based feature extraction, and normalization. Models were trained using a 70:20:10 train–validation–test split and evaluated with standard performance metrics: MAE, MSE, RMSE, NRMSE, MAPE, and

R^{2}

. The experimental findings show that LightGBM beat deep learning (DL) models on all evaluation metrics and had the best MAE (69.12 MW), RMSE (101.67 MW), and MAPE (1.20%) and the highest

R^{2}

(0.9942) for the test set. It also outperformed models in the literature and operational forecasts conducted in the real world by ENTSO-E. Though LSTM performed well, particularly in long-term dependency capturing, it performed a bit worse in high-variance periods. CNN, GRU, and hybrid models demonstrated moderate results, but they tended to underfit or overfit in some circumstances. These findings highlight the efficacy of LightGBM in structured time-series forecasting tasks, offering a scalable and interpretable alternative to DL models. This study supports its potential for real-world deployment in smart/distribution grid applications and provides valuable insights into the trade-offs between accuracy, complexity, and generalization in load forecasting models. Full article

(This article belongs to the Special Issue The Future of Energy Systems: Integration of Energy Technologies in Distribution Grids)

► Show Figures

Figure 1

13 pages, 1211 KB

Open AccessArticle

Missing Data in OHCA Registries: How Imputation Methods Affect Research Conclusions—Paper I

by Stella Jinran Zhan, Seyed Ehsan Saffari, Marcus Eng Hock Ong and Fahad Javaid Siddiqui

J. Clin. Med. 2025, 14(17), 6345; https://doi.org/10.3390/jcm14176345 - 8 Sep 2025

Viewed by 659

Abstract

Background/Objectives: Clinical observational studies often encounter missing data, which complicates association evaluation with reduced bias while accounting for confounders. This is particularly challenging in multi-national registries such as those for out-of-hospital cardiac arrest (OHCA), a time-sensitive medical emergency with low survival rates. While [...] Read more.

Background/Objectives: Clinical observational studies often encounter missing data, which complicates association evaluation with reduced bias while accounting for confounders. This is particularly challenging in multi-national registries such as those for out-of-hospital cardiac arrest (OHCA), a time-sensitive medical emergency with low survival rates. While various methods for handling missing data exist, observational studies frequently rely on complete-case analysis, limiting representativeness and potentially introducing bias. Our objective was to evaluate the impact of various single imputation methods on association analysis with OHCA registries. Methods: Using a complete dataset (N = 13,274) from the Pan-Asian Resuscitation Outcomes Study (PAROS) registry (1 January 2016–31 December 2020) as reference, we intentionally introduced missing values into selected variables via a Missing At Random (MAR) mechanism. We then compared statistical and machine learning (ML) single imputation methods to assess the association between bystander cardiopulmonary resuscitation (BCPR) and the issuance of a mobile app alert, adjusting for confounders. The impacts of complete-case analysis (CCA) and single imputation methods on conclusions in OHCA research were evaluated. Results: CCA was suboptimal for handling MAR data, resulting in more biased estimates and wider confidence intervals compared to single imputation methods. The missingness-indicator (MxI) method offered a trade-off between bias and ease of implementation. The K-Nearest Neighbours (KNN) method outperformed other imputation approaches, whereas missForest introduced bias under certain conditions. Conclusions: KNN and MxI are easy to use and better alternatives to CCA for reducing bias in observational studies. This study highlights the importance of selecting appropriate imputation methods to ensure reliable conclusions in OHCA research and has broader implications for other registries facing similar missing data challenges. Full article

(This article belongs to the Section Cardiology)

► Show Figures

Figure 1

42 pages, 6378 KB

Open AccessArticle

Advances in Imputation Strategies Supporting Peak Storm Surge Surrogate Modeling

by WoongHee Jung, Christopher Irwin, Alexandros A. Taflanidis, Norberto C. Nadal-Caraballo, Luke A. Aucoin and Madison C. Yawn

J. Mar. Sci. Eng. 2025, 13(9), 1678; https://doi.org/10.3390/jmse13091678 - 31 Aug 2025

Viewed by 670

Abstract

Surrogate models are widely recognized as effective, data-driven predictive tools for storm surge risk assessment. For such applications, surrogate models (referenced also as emulators or metamodels) are typically developed using existing databases of synthetic storm simulations, and once calibrated can provide fast-to-compute approximations [...] Read more.

Surrogate models are widely recognized as effective, data-driven predictive tools for storm surge risk assessment. For such applications, surrogate models (referenced also as emulators or metamodels) are typically developed using existing databases of synthetic storm simulations, and once calibrated can provide fast-to-compute approximations of the storm surge for a variety of downstream analyses. The storm surge predictions need to be established for different geographic locations of interest, typically corresponding to the computational nodes of the original numerical model. A number of inland nodes will remain dry for some of the database storm scenarios, requiring an imputation for them to estimate the so-called pseudo-surge in support of the surrogate model development. Past work has examined the adoption of kNN (k-nearest neighbor) spatial interpolation for this imputation. The enhancement of kNN with hydraulic connectivity information, using the grid or mesh of the original numerical model, was also previously considered. In this enhancement, neighboring nodes are considered connected only if they are connected within the grid. This work revisits the imputation of peak storm surge within a surrogate modeling context and examines three distinct advancements. First, a response-based correlation concept is considered for the hydraulic connectivity, replacing the previous notion of connectivity using the numerical model grid. Second, a Gaussian Process interpolation (GPI) is examined as alternative spatial imputation strategy, integrating a recently established adaptive covariance tapering scheme for accommodating an efficient implementation for large datasets (large number of nodes). Third, a data completion approach is examined for imputation, treating dry instances as missing data and establishing imputation using probabilistic principal component analysis (PPCA). The combination of spatial imputation with PPCA is also examined. In this instance, spatial imputation is first deployed, followed by PPCA for the nodes that were misclassified in the first stage. Misclassification corresponds to the instances for which imputation provides surge estimates higher than ground elevation, creating the illusion that the node is inundated even though the original predictions correspond to the node being dry. In the illustrative case study, different imputation variants established based on the aforementioned advancements are compared, with comparison metrics corresponding to the predictive accuracy of the surrogate models developed using the imputed databases. Results show that incorporating hydraulic connectivity based on response similarity into kNN enhances the predictive performance, that GPI provides a competitive (to kNN) spatial interpolation approach, and that the combination of data completion and spatial interpolation emerges as the recommended approach. Full article

(This article belongs to the Special Issue Machine Learning in Coastal Engineering)

► Show Figures

Figure 1

16 pages, 1109 KB

Open AccessArticle

Development and Validation of a Machine Learning Model for Early Prediction of Acute Kidney Injury in Neurocritical Care: A Comparative Analysis of XGBoost, GBM, and Random Forest Algorithms

by Keun Soo Kim, Tae Jin Yoon, Joonghyun Ahn and Jeong-Am Ryu

Diagnostics 2025, 15(16), 2061; https://doi.org/10.3390/diagnostics15162061 - 17 Aug 2025

Viewed by 834

Abstract

Background: Acute Kidney Injury (AKI) is a pivotal concern in neurocritical care, impacting patient survival and quality of life. This study harnesses machine learning (ML) techniques to predict the occurrence of AKI in patients receiving hyperosmolar therapy, aiming to optimize patient outcomes in [...] Read more.

Background: Acute Kidney Injury (AKI) is a pivotal concern in neurocritical care, impacting patient survival and quality of life. This study harnesses machine learning (ML) techniques to predict the occurrence of AKI in patients receiving hyperosmolar therapy, aiming to optimize patient outcomes in neurocritical settings. Methods: We conducted a retrospective cohort study of 4886 patients who underwent hyperosmolar therapy in the neurosurgical intensive care unit (ICU). Comparative predictive analyses were carried out using advanced ML algorithms—eXtreme Gradient Boosting (XGBoost), Gradient Boosting Machine (GBM), Random Forest (RF)—against standard multivariate logistic regression. Predictive performance was assessed using an 8:2 training-testing data split, with model fine-tuning through cross-validation. Results: The RF with KNN imputation showed slightly better performance than other approaches in predicting AKI. When applied to an independent test set, it achieved a sensitivity of 79% (95% CI: 70–87%) and specificity of 85% (95% CI: 82–88%), with an overall accuracy of 84% (95% CI: 81–87%) and AUROC of 0.86 (95% CI: 0.82–0.91). The multivariate logistic regression analysis, while informative, showed less predictive strength compared to the ML models. Delta chloride levels and serum osmolality proved to be the most influential predictors, with additional significant variables including pH, age, bicarbonate, and the osmolar gap. Conclusions: The prominence of delta chloride and serum osmolality among the predictive variables underscores its potential as a biomarker for AKI risk in this patient population. Full article

(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

► Show Figures

Figure 1

26 pages, 4304 KB

Open AccessArticle

A Hybrid Regression–Kriging–Machine Learning Framework for Imputing Missing TROPOMI NO₂ Data over Taiwan

by Alyssa Valerio, Yi-Chun Chen, Chian-Yi Liu, Yi-Ying Chen and Chuan-Yao Lin

Remote Sens. 2025, 17(12), 2084; https://doi.org/10.3390/rs17122084 - 17 Jun 2025

Viewed by 1481

Abstract

This study presents a novel application of a hybrid regression–kriging (RK) and machine learning (ML) framework to impute missing tropospheric NO₂ data from the TROPOMI satellite over Taiwan during the winter months of January, February, and December 2022. The proposed approach combines [...] Read more.

This study presents a novel application of a hybrid regression–kriging (RK) and machine learning (ML) framework to impute missing tropospheric NO₂ data from the TROPOMI satellite over Taiwan during the winter months of January, February, and December 2022. The proposed approach combines geostatistical interpolation with nonlinear modeling by integrating RK with ML models—specifically comparing gradient boosting regression (GBR), random forest (RF), and K-nearest neighbors (KNN)—to determine the most suitable auxiliary predictor. This structure enables the framework to capture both spatial autocorrelation and complex relationships between NO₂ concentrations and environmental drivers. Model performance was evaluated using the coefficient of determination (r²), computed against observed TROPOMI NO₂ column values filtered by quality assurance criteria. GBR achieved the highest validation r² values of 0.83 for January and February, while RF yielded 0.82 and 0.79 in January and December, respectively. These results demonstrate the model’s robustness in capturing intra-seasonal patterns and nonlinear trends in NO₂ distribution. In contrast, models using only static land cover inputs performed poorly (r² < 0.58), emphasizing the limited predictive capacity of such variables in isolation. Interpretability analysis using the SHapley Additive exPlanations (SHAP) method revealed temperature as the most influential meteorological driver of NO₂ variation, particularly during winter, while forest cover consistently emerged as a key land-use factor mitigating NO₂ levels through dry deposition. By integrating dynamic meteorological variables and static land cover features, the hybrid RK–ML framework enhances the spatial and temporal completeness of satellite-derived air quality datasets. As the first RK–ML application for TROPOMI data in Taiwan, this study establishes a regional benchmark and offers a transferable methodology for satellite data imputation. Future research should explore ensemble-based RK variants, incorporate real-time auxiliary data, and assess transferability across diverse geographic and climatological contexts. Full article

► Show Figures

Figure 1

16 pages, 1868 KB

Open AccessArticle

Bridging the Gap: Missing Data Imputation Methods and Their Effect on Dementia Classification Performance

by Federica Aracri, Maria Giovanna Bianco, Andrea Quattrone and Alessia Sarica

Brain Sci. 2025, 15(6), 639; https://doi.org/10.3390/brainsci15060639 - 13 Jun 2025

Cited by 1 | Viewed by 2818

Abstract

Background/Objectives: Missing data is a common challenge in neuroscience and neuroimaging studies, especially in the context of neurodegenerative disorders such as Mild Cognitive Impairment (MCI) and Alzheimer’s Disease (AD). Inadequate handling of missing values can compromise the performance and interpretability of machine learning [...] Read more.

Background/Objectives: Missing data is a common challenge in neuroscience and neuroimaging studies, especially in the context of neurodegenerative disorders such as Mild Cognitive Impairment (MCI) and Alzheimer’s Disease (AD). Inadequate handling of missing values can compromise the performance and interpretability of machine learning (ML) models. This study aimed to systematically compare the impacts of five imputation methods on classification performance using multimodal data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Methods: We analyzed a dataset including clinical, cognitive, and neuroimaging features from ADNI participants diagnosed with MCI or AD. Five imputation techniques—mean, median, k-Nearest Neighbors (kNNs), Multiple Imputation by Chained Equations (MICE), and missForest (MF)—were applied. Classification tasks were performed using Random Forest (RF), Logistic Regression (LR), and Support Vector Machine (SVM). Models were trained on the imputed datasets and evaluated on a test set without missing values. The statistical significance of performance differences was assessed using McNemar’s test. Results: On the test set, MICE imputation yielded the highest accuracy for both RF (0.76) and LR (0.81), while SVM performed best with median imputation (0.81). McNemar’s test revealed significant differences between RF and both LR and SVM (p < 0.01), but not between LR and SVM. Simpler methods like mean and median performed adequately but were generally outperformed by MICE. The performance of kNNs and MF was less consistent. Conclusions: Overall, the choice of imputation method significantly affects classification accuracy. Selecting strategies tailored to both data structure and classifier is essential for robust predictive modeling in clinical neuroscience. Full article

► Show Figures

Figure 1

16 pages, 2124 KB

Open AccessArticle

Missing Data in Orthopaedic Clinical Outcomes Research: A Sensitivity Analysis of Imputation Techniques Utilizing a Large Multicenter Total Shoulder Arthroplasty Database

by Kevin A. Hao, Terrie Vasilopoulos, Josie Elwell, Christopher P. Roche, Keegan M. Hones, Jonathan O. Wright, Joseph J. King, Thomas W. Wright, Ryan W. Simovitch and Bradley S. Schoch

J. Clin. Med. 2025, 14(11), 3829; https://doi.org/10.3390/jcm14113829 - 29 May 2025

Cited by 1 | Viewed by 742

Abstract

Background: When missing data are present in clinical outcomes studies, complete-case analysis (CCA) is often performed, whereby patients with missing data are excluded. While simple, CCA analysis may impart selection bias and reduce statistical power, leading to erroneous statistical results in some cases. [...] Read more.

Background: When missing data are present in clinical outcomes studies, complete-case analysis (CCA) is often performed, whereby patients with missing data are excluded. While simple, CCA analysis may impart selection bias and reduce statistical power, leading to erroneous statistical results in some cases. However, there exist more rigorous statistical approaches, such as single and multiple imputation, which approximate the associations that would have been present in a full dataset and preserve the study’s power. The purpose of this study is to evaluate how statistical results differ when performed after CCA analysis versus imputation methods. Methods: This simulation study analyzed a sample dataset consisting of 2204 shoulders, with complete datapoints from a larger multicenter total shoulder arthroplasty database. From the sampled dataset of demographics, surgical characteristics, and clinical outcomes, we created five test datasets, ranging from 100 to 2000 shoulders, and simulated 10–50% missingness in the postoperative American Shoulder and Elbow Surgeons (ASES) score and range of motion in four planes in missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR) patterns. Missingness in outcomes was remedied using CCA, three single imputation techniques, and two multiple imputation techniques. The imputation performance was evaluated relative to the native complete dataset using the root mean squared error (RMSE) and the mean absolute percentage error (MAPE). We also compared the mean and standard deviation (SD) of the postoperative ASES score and the results of multivariable linear and logistic regression to understand the effects of imputation on the study results. Results: The average overall RMSE and MAPE were similar for MCAR (22.6 and 27.2%) and MAR (19.2 and 17.7%) missingness patterns, but were substantially poorer for NMAR (37.5 and 79.2%); the sample size and the percentage of data missingness minimally affected RMSE and MAPE. Aggregated mean postoperative ASES scores were within 5% of the true value when missing data were remedied with CCA, and all candidate imputation methods for nearly all ranges of sample size and data missingness when data were MCAR or MAR, but not when data were NMAR. When data were MAR, CCA resulted in overestimates of the SD. When data were MCAR or MAR, the accuracy of the regression estimate (β or OR) and its corresponding 95% CI varied substantially based on the sample size and proportion of missing data for multivariable linear regression, but not logistic regression. When data were MAR, the width of the 95% CI was up to 300% larger when CCA was used, whereas most imputation methods maintained the width of the 95% CI within 50% of the true value. Single imputation with k-nearest neighbor (kNN) method and multiple imputation with predictive mean matching (MICE-PMM) best-reproduced point estimates and intervariable relationships resembling the native dataset. Availability of correlated outcome scores improved the RMSE, MAPE, accuracy of the mean postoperative ASES score, and multivariable linear regression model estimates. Conclusions: Complete-case analysis can introduce selection bias when data are MAR, and it results in loss of statistical power, resulting in loss of precision (i.e., expansion of the 95% CI) and predisposition to false-negative findings. Our data demonstrate that imputation can reliably reproduce missing clinical data and generate accurate population estimates that closely resemble results derived from native primary shoulder arthroplasty datasets (i.e., prior to simulated data missingness). Further study of the use of imputation in clinical database research is critical, as the use of CCA may lead to different conclusions in comparison to more rigorous imputation approaches. Full article

(This article belongs to the Section Orthopedics)

► Show Figures

Figure 1

37 pages, 9314 KB

Open AccessArticle

A Data Imputation Approach for Missing Power Consumption Measurements in Water-Cooled Centrifugal Chillers

by Sung Won Kim and Young Il Kim

Energies 2025, 18(11), 2779; https://doi.org/10.3390/en18112779 - 27 May 2025

Cited by 1 | Viewed by 653

Abstract

In the process of collecting operational data for the performance analysis of water-cooled centrifugal chillers, missing values are inevitable due to various factors such as sensor errors, data transmission failures, and failure of the measurement system. When a substantial amount of missing data [...] Read more.

In the process of collecting operational data for the performance analysis of water-cooled centrifugal chillers, missing values are inevitable due to various factors such as sensor errors, data transmission failures, and failure of the measurement system. When a substantial amount of missing data is present, the reliability of data analysis decreases, leading to potential distortions in the results. To address this issue, it is necessary to either minimize missing occurrences by utilizing high-precision measurement equipment or apply reliable imputation techniques to compensate for missing values. This study focuses on two water-cooled turbo chillers installed in Tower A, Seoul, collecting a total of 118,464 data points over 3 years and 4 months. The dataset includes chilled water inlet and outlet temperatures (

T_{1}

and

T_{2}

) and flow rate (

{\dot{V}}_{1}

) and cooling water inlet and outlet temperatures (

T_{3}

and

T_{4}

) and flow rate (

{\dot{V}}_{3}

), as well as chiller power consumption (

{\dot{W}}_{c}

). To evaluate the performance of various imputation techniques, we introduced missing values at a rate of 10–30% under the assumption of a missing-at-random (MAR) mechanism. Seven different imputation methods—mean, median, linear interpolation, multiple imputation, simple random imputation, k-nearest neighbors (KNN), and the dynamically clustered KNN (DC-KNN)—were applied, and their imputation performance was validated using MAPE and CVRMSE metrics. The DC-KNN method, developed in this study, improves upon conventional KNN imputation by integrating clustering and dynamic weighting mechanisms. The results indicate that DC-KNN achieved the highest predictive performance, with MAPE ranging from 9.74% to 10.30% and CVRMSE ranging from 12.19% to 13.43%. Finally, for the missing data recorded in July 2023, we applied the most effective DC-KNN method to generate imputed values that reflect the characteristics of the studied site, which employs an ice thermal energy storage system. Full article

(This article belongs to the Topic Energy Consumption Analysis and Characterization of Complex Systems)

► Show Figures

Figure 1

33 pages, 1438 KB

Open AccessArticle

Mental Disorder Assessment in IoT-Enabled WBAN Systems with Dimensionality Reduction and Deep Learning

by Damilola Olatinwo, Adnan Abu-Mahfouz and Hermanus Myburgh

J. Sens. Actuator Netw. 2025, 14(3), 49; https://doi.org/10.3390/jsan14030049 - 7 May 2025

Viewed by 2515

Abstract

Mental health is an important aspect of an individual’s overall well-being. Positive mental health is correlated with enhanced cognitive function, emotional regulation, and motivation, which, in turn, foster increased productivity and personal growth. Accurate and interpretable predictions of mental disorders are crucial for [...] Read more.

Mental health is an important aspect of an individual’s overall well-being. Positive mental health is correlated with enhanced cognitive function, emotional regulation, and motivation, which, in turn, foster increased productivity and personal growth. Accurate and interpretable predictions of mental disorders are crucial for effective intervention. This study develops a hybrid deep learning model, integrating CNN and BiLSTM applied to EEG data, to address this need. To conduct a comprehensive analysis of mental disorders, we propose a two-tiered classification strategy. The first tier classifies the main disorder categories, while the second tier classifies the specific disorders within each main disorder category to provide detailed insights into classifying mental disorder. The methodology incorporates techniques to handle missing data (kNN imputation), class imbalance (SMOTE), and high dimensionality (PCA). To enhance clinical trust and understanding, the model’s predictions are explained using local interpretable model-agnostic explanations (LIME). Baseline methods and the proposed CNN–BiLSTM model were implemented and evaluated at both classification tiers using PSD and FC features. On unseen test data, our proposed model demonstrated a 3–9% improvement in prediction accuracy for main disorders and a 4–6% improvement for specific disorders, compared to existing methods. This approach offers the potential for more reliable and explainable diagnostic tools for mental disorder prediction. Full article

(This article belongs to the Special Issue Applications of Wireless Sensor Networks: Innovations and Future Trends)

► Show Figures

Figure 1

18 pages, 5394 KB

Open AccessArticle

Comparison of Models for Missing Data Imputation in PM-2.5 Measurement Data

by Ju-Yong Lee, Seung-Hee Han, Jin-Goo Kang, Chae-Yeon Lee, Jeong-Beom Lee, Hyeun-Soo Kim, Hui-Young Yun and Dae-Ryun Choi

Atmosphere 2025, 16(4), 438; https://doi.org/10.3390/atmos16040438 - 9 Apr 2025

Cited by 1 | Viewed by 1259

Abstract

The accurate monitoring and analysis of PM-2.5 are critical for improving air quality and formulating public health policies. However, environmental data often contain missing values due to equipment failures, data collection errors, or extreme weather conditions, which can hinder reliable analysis and predictions. [...] Read more.

The accurate monitoring and analysis of PM-2.5 are critical for improving air quality and formulating public health policies. However, environmental data often contain missing values due to equipment failures, data collection errors, or extreme weather conditions, which can hinder reliable analysis and predictions. This study evaluates the performance of various missing data imputation methods for PM-2.5 data in Seoul, Korea, using scenarios with artificially generated missing values during high- and low-concentration periods. The methods compared include FFILL, KNN, MICE, SARIMAX, DNN, and LSTM. The results indicate that KNN consistently achieved stable and balanced performance across different temporal intervals, with an RMSE of 5.65, 9.14, and 9.71 for 6 h, 12 h, and 24 h intervals, respectively. FFILL demonstrated superior performance for short intervals (RMSE 4.76 for 6 h) but showed significant limitations as intervals lengthened. SARIMAX performed well in long-term scenarios, with an RMSE of 9.37 for 24 h intervals, but required higher computational complexity. Conversely, deep learning models such as DNN and LSTM underperformed, highlighting the need for further optimization for time-series data. This study highlights the practicality of KNN as the most effective method for addressing missing PM-2.5 data in mid- to long-term applications due to its simplicity and efficiency. These findings provide valuable insights into the selection of imputation methods for environmental data analysis, contributing to the enhancement of data reliability and the development of effective air quality management policies. Full article

(This article belongs to the Special Issue New Insights in Air Quality Assessment: Forecasting and Monitoring)

► Show Figures

Figure 1

32 pages, 6360 KB

Open AccessArticle

Regression-Based Networked Virtual Buoy Model for Offshore Wave Height Prediction

by Eleonora M. Tronci, Matteo Vitale, Therese Patrosio, Thomas Søndergaard, Babak Moaveni and Usman Khan

J. Mar. Sci. Eng. 2025, 13(4), 728; https://doi.org/10.3390/jmse13040728 - 5 Apr 2025

Cited by 1 | Viewed by 994

Abstract

Accurate wave height measurements are critical for offshore wind farm operations, marine navigation, and environmental monitoring. Wave buoys provide essential real-time data; however, their reliability is compromised by harsh marine conditions, resulting in frequent data gaps due to sensor failures, maintenance issues, or [...] Read more.

Accurate wave height measurements are critical for offshore wind farm operations, marine navigation, and environmental monitoring. Wave buoys provide essential real-time data; however, their reliability is compromised by harsh marine conditions, resulting in frequent data gaps due to sensor failures, maintenance issues, or extreme weather events. These disruptions pose significant risks for decision-making in offshore logistics and safety planning. While numerical wave models and machine learning techniques have been explored for wave height prediction, most approaches rely heavily on historical data from the same buoy, limiting their applicability when the target sensor is offline. This study addresses these limitations by developing a virtual wave buoy model using a network-based data-driven approach with Random Forest Regression (RFR). By leveraging wave height measurements from surrounding buoys, the proposed model ensures continuous wave height estimation even in the case of malfunctioning physical sensors. The methodology is tested across four offshore sites, including operational wind farms, evaluating the sensitivity of predictions to buoy placement and feature selection. The model demonstrates high accuracy and incorporates a k-nearest neighbors (kNN) imputation strategy to mitigate data loss. These findings establish RFR as a scalable and computationally efficient alternative for virtual sensing, thereby enhancing offshore wind farm resilience, marine safety, and operational efficiency. Full article

(This article belongs to the Section Ocean Engineering)

► Show Figures

Figure 1

17 pages, 2116 KB

Open AccessArticle

A Comparative Analysis of Hyper-Parameter Optimization Methods for Predicting Heart Failure Outcomes

by Qisthi Alhazmi Hidayaturrohman and Eisuke Hanada

Appl. Sci. 2025, 15(6), 3393; https://doi.org/10.3390/app15063393 - 20 Mar 2025

Cited by 5 | Viewed by 2923

Abstract

This study presents a comparative analysis of hyper-parameter optimization methods used in developing predictive models for patients at risk of heart failure readmission and mortality. We evaluated three optimization approaches—Grid Search (GS), Random Search (RS), and Bayesian Search (BS)—across three machine learning algorithms—Support [...] Read more.

This study presents a comparative analysis of hyper-parameter optimization methods used in developing predictive models for patients at risk of heart failure readmission and mortality. We evaluated three optimization approaches—Grid Search (GS), Random Search (RS), and Bayesian Search (BS)—across three machine learning algorithms—Support Vector Machine (SVM), Random Forest (RF), and eXtreme Gradient Boosting (XGBoost). The models were built using real patient data from the Zigong Fourth People’s Hospital, which included 167 features from 2008 patients. The mean, MICE, kNN, and RF imputation techniques were implemented to handle missing values. Our initial results showed that SVM models outperformed the others, achieving an accuracy of up to 0.6294, sensitivity above 0.61, and an AUC score exceeding 0.66. However, after 10-fold cross-validation, the RF models demonstrated superior robustness, with an average AUC improvement of 0.03815, whereas the SVM models showed potential for overfitting, with a slight decline (−0.0074). The XGBoost models exhibited moderate improvement (+0.01683) post-validation. Bayesian Search had the best computational efficiency, consistently requiring less processing time than the Grid and Random Search methods. This study reveals that while model selection is crucial, an appropriate optimization method and imputation technique significantly impact model performance. These findings provide valuable insights for developing robust predictive models for healthcare applications, particularly for heart failure risk assessment. Full article

(This article belongs to the Special Issue Artificial Intelligence Applications in Medical Data Analysis and Healthcare Virtual Assistants)

► Show Figures

Figure 1

Search Results (74)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (74)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI