MDPI - Publisher of Open Access Journals

20 pages, 4787 KiB

Open AccessArticle

A Data Imputation Strategy to Enhance Online Game Churn Prediction, Considering Non-Login Periods

by JaeHong Lee, Pavinee Rerkjirattikal and SangGyu Nam

Data 2025, 10(7), 96; https://doi.org/10.3390/data10070096 - 23 Jun 2025

Viewed by 530

User churn in online games refers to players becoming inactive for an extended period. Even a small increase in churn can lead to significant revenue loss, making churn prediction crucial for sustaining long-term player engagement. Although user churn prediction has been extensively studied, [...] Read more.

User churn in online games refers to players becoming inactive for an extended period. Even a small increase in churn can lead to significant revenue loss, making churn prediction crucial for sustaining long-term player engagement. Although user churn prediction has been extensively studied, most existing approaches either ignore non-login periods or treat all inactivity uniformly, overlooking key behavioral differences. This study addresses this gap by categorizing non-login periods into three types, as follows: inactivity due to new or dormant users, genuine loss of interest, and temporary inaccessibility caused by external factors. These periods are treated as either non-existent or missing data and imputed using techniques such as mean or mode substitution, linear interpolation, and multiple imputation by chained equations (MICE). MICE was selected due to its ability to impute missing values more robustly by considering multivariate relationships. A random forest (RF) classifier, chosen for its interpretability and robustness to incomplete data, serves as the primary prediction model. Additionally, classifier chains are used to capture label dependencies, and principal component analysis (PCA) is applied to reduce dimensionality and mitigate overfitting. Experiments on real-world MMORPG data show that our approach improves predictive accuracy, achieving a micro-averaged AUC of above 0.92 and a weighted F1 score exceeding 0.70. These findings suggest that our approach improves churn prediction and offers actionable insights for supporting personalized player retention strategies. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

16 pages, 1868 KiB

Open AccessArticle

Bridging the Gap: Missing Data Imputation Methods and Their Effect on Dementia Classification Performance

by Federica Aracri, Maria Giovanna Bianco, Andrea Quattrone and Alessia Sarica

Brain Sci. 2025, 15(6), 639; https://doi.org/10.3390/brainsci15060639 - 13 Jun 2025

Viewed by 687

Abstract

Background/Objectives: Missing data is a common challenge in neuroscience and neuroimaging studies, especially in the context of neurodegenerative disorders such as Mild Cognitive Impairment (MCI) and Alzheimer’s Disease (AD). Inadequate handling of missing values can compromise the performance and interpretability of machine learning [...] Read more.

Background/Objectives: Missing data is a common challenge in neuroscience and neuroimaging studies, especially in the context of neurodegenerative disorders such as Mild Cognitive Impairment (MCI) and Alzheimer’s Disease (AD). Inadequate handling of missing values can compromise the performance and interpretability of machine learning (ML) models. This study aimed to systematically compare the impacts of five imputation methods on classification performance using multimodal data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Methods: We analyzed a dataset including clinical, cognitive, and neuroimaging features from ADNI participants diagnosed with MCI or AD. Five imputation techniques—mean, median, k-Nearest Neighbors (kNNs), Multiple Imputation by Chained Equations (MICE), and missForest (MF)—were applied. Classification tasks were performed using Random Forest (RF), Logistic Regression (LR), and Support Vector Machine (SVM). Models were trained on the imputed datasets and evaluated on a test set without missing values. The statistical significance of performance differences was assessed using McNemar’s test. Results: On the test set, MICE imputation yielded the highest accuracy for both RF (0.76) and LR (0.81), while SVM performed best with median imputation (0.81). McNemar’s test revealed significant differences between RF and both LR and SVM (p < 0.01), but not between LR and SVM. Simpler methods like mean and median performed adequately but were generally outperformed by MICE. The performance of kNNs and MF was less consistent. Conclusions: Overall, the choice of imputation method significantly affects classification accuracy. Selecting strategies tailored to both data structure and classifier is essential for robust predictive modeling in clinical neuroscience. Full article

► Show Figures

Figure 1

16 pages, 2124 KiB

Open AccessArticle

Missing Data in Orthopaedic Clinical Outcomes Research: A Sensitivity Analysis of Imputation Techniques Utilizing a Large Multicenter Total Shoulder Arthroplasty Database

by Kevin A. Hao, Terrie Vasilopoulos, Josie Elwell, Christopher P. Roche, Keegan M. Hones, Jonathan O. Wright, Joseph J. King, Thomas W. Wright, Ryan W. Simovitch and Bradley S. Schoch

J. Clin. Med. 2025, 14(11), 3829; https://doi.org/10.3390/jcm14113829 - 29 May 2025

Cited by 1 | Viewed by 472

Abstract

Background: When missing data are present in clinical outcomes studies, complete-case analysis (CCA) is often performed, whereby patients with missing data are excluded. While simple, CCA analysis may impart selection bias and reduce statistical power, leading to erroneous statistical results in some cases. [...] Read more.

Background: When missing data are present in clinical outcomes studies, complete-case analysis (CCA) is often performed, whereby patients with missing data are excluded. While simple, CCA analysis may impart selection bias and reduce statistical power, leading to erroneous statistical results in some cases. However, there exist more rigorous statistical approaches, such as single and multiple imputation, which approximate the associations that would have been present in a full dataset and preserve the study’s power. The purpose of this study is to evaluate how statistical results differ when performed after CCA analysis versus imputation methods. Methods: This simulation study analyzed a sample dataset consisting of 2204 shoulders, with complete datapoints from a larger multicenter total shoulder arthroplasty database. From the sampled dataset of demographics, surgical characteristics, and clinical outcomes, we created five test datasets, ranging from 100 to 2000 shoulders, and simulated 10–50% missingness in the postoperative American Shoulder and Elbow Surgeons (ASES) score and range of motion in four planes in missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR) patterns. Missingness in outcomes was remedied using CCA, three single imputation techniques, and two multiple imputation techniques. The imputation performance was evaluated relative to the native complete dataset using the root mean squared error (RMSE) and the mean absolute percentage error (MAPE). We also compared the mean and standard deviation (SD) of the postoperative ASES score and the results of multivariable linear and logistic regression to understand the effects of imputation on the study results. Results: The average overall RMSE and MAPE were similar for MCAR (22.6 and 27.2%) and MAR (19.2 and 17.7%) missingness patterns, but were substantially poorer for NMAR (37.5 and 79.2%); the sample size and the percentage of data missingness minimally affected RMSE and MAPE. Aggregated mean postoperative ASES scores were within 5% of the true value when missing data were remedied with CCA, and all candidate imputation methods for nearly all ranges of sample size and data missingness when data were MCAR or MAR, but not when data were NMAR. When data were MAR, CCA resulted in overestimates of the SD. When data were MCAR or MAR, the accuracy of the regression estimate (β or OR) and its corresponding 95% CI varied substantially based on the sample size and proportion of missing data for multivariable linear regression, but not logistic regression. When data were MAR, the width of the 95% CI was up to 300% larger when CCA was used, whereas most imputation methods maintained the width of the 95% CI within 50% of the true value. Single imputation with k-nearest neighbor (kNN) method and multiple imputation with predictive mean matching (MICE-PMM) best-reproduced point estimates and intervariable relationships resembling the native dataset. Availability of correlated outcome scores improved the RMSE, MAPE, accuracy of the mean postoperative ASES score, and multivariable linear regression model estimates. Conclusions: Complete-case analysis can introduce selection bias when data are MAR, and it results in loss of statistical power, resulting in loss of precision (i.e., expansion of the 95% CI) and predisposition to false-negative findings. Our data demonstrate that imputation can reliably reproduce missing clinical data and generate accurate population estimates that closely resemble results derived from native primary shoulder arthroplasty datasets (i.e., prior to simulated data missingness). Further study of the use of imputation in clinical database research is critical, as the use of CCA may lead to different conclusions in comparison to more rigorous imputation approaches. Full article

(This article belongs to the Section Orthopedics)

► Show Figures

Figure 1

17 pages, 600 KiB

Open AccessArticle

Protective Factors for Marijuana Use and Suicidal Behavior Among Black LGBQ U.S. High School Students

by DeKeitra Griffin, Shawndaya S. Thrasher, Keith J. Watts, Philip Baiden, Elaine M. Maccio and Miya Tate

Soc. Sci. 2025, 14(5), 267; https://doi.org/10.3390/socsci14050267 - 26 Apr 2025

Viewed by 607

Abstract

This study aimed to investigate the association between protective factors, marijuana use, and suicidal behavior among Black LGBQ U.S. adolescents. Methods: A subsample of 991 Black LGBQ adolescents was derived from the 2019 Combined High School YRBSS dataset. Suicidal behavior was measured as [...] Read more.

This study aimed to investigate the association between protective factors, marijuana use, and suicidal behavior among Black LGBQ U.S. adolescents. Methods: A subsample of 991 Black LGBQ adolescents was derived from the 2019 Combined High School YRBSS dataset. Suicidal behavior was measured as suicidal planning and/or previous suicide attempts. Marijuana usage gauged lifetime consumption. The protective factors included sports team participation, physical activity, eating breakfast, hours of sleep, and academic performance. Age and sex were entered as covariates. Multiple imputation by chained equations (MICE) was used to address missing data, and pooled binary logistic regression analyses were conducted. Results: Academic performance and hours of sleep were significantly associated with lower odds of suicidal behavior and lifetime marijuana use. Sports team participation was associated with higher odds of lifetime marijuana use. Being female was linked to higher odds of marijuana use, while older age was associated with lower odds. Discussion: For Black LGBQ youth, academic performance and sufficient sleep may function as protective factors. Participating in sports was associated with greater odds of risk behaviors, highlighting the need to assess the experiences of Black LGBQ youth in sports. Implications and Contributions: Our findings inform school programming, policy, and practice by identifying academic support and sleep health as intervention areas. Full article

(This article belongs to the Special Issue The Social and Emotional Wellbeing of LGBTQ+ Young People)

► Show Figures

Figure 1

18 pages, 5394 KiB

Open AccessArticle

Comparison of Models for Missing Data Imputation in PM-2.5 Measurement Data

by Ju-Yong Lee, Seung-Hee Han, Jin-Goo Kang, Chae-Yeon Lee, Jeong-Beom Lee, Hyeun-Soo Kim, Hui-Young Yun and Dae-Ryun Choi

Atmosphere 2025, 16(4), 438; https://doi.org/10.3390/atmos16040438 - 9 Apr 2025

Viewed by 754

Abstract

The accurate monitoring and analysis of PM-2.5 are critical for improving air quality and formulating public health policies. However, environmental data often contain missing values due to equipment failures, data collection errors, or extreme weather conditions, which can hinder reliable analysis and predictions. [...] Read more.

The accurate monitoring and analysis of PM-2.5 are critical for improving air quality and formulating public health policies. However, environmental data often contain missing values due to equipment failures, data collection errors, or extreme weather conditions, which can hinder reliable analysis and predictions. This study evaluates the performance of various missing data imputation methods for PM-2.5 data in Seoul, Korea, using scenarios with artificially generated missing values during high- and low-concentration periods. The methods compared include FFILL, KNN, MICE, SARIMAX, DNN, and LSTM. The results indicate that KNN consistently achieved stable and balanced performance across different temporal intervals, with an RMSE of 5.65, 9.14, and 9.71 for 6 h, 12 h, and 24 h intervals, respectively. FFILL demonstrated superior performance for short intervals (RMSE 4.76 for 6 h) but showed significant limitations as intervals lengthened. SARIMAX performed well in long-term scenarios, with an RMSE of 9.37 for 24 h intervals, but required higher computational complexity. Conversely, deep learning models such as DNN and LSTM underperformed, highlighting the need for further optimization for time-series data. This study highlights the practicality of KNN as the most effective method for addressing missing PM-2.5 data in mid- to long-term applications due to its simplicity and efficiency. These findings provide valuable insights into the selection of imputation methods for environmental data analysis, contributing to the enhancement of data reliability and the development of effective air quality management policies. Full article

(This article belongs to the Special Issue New Insights in Air Quality Assessment: Forecasting and Monitoring)

► Show Figures

Figure 1

12 pages, 6082 KiB

Open AccessArticle

Preserving Informative Presence: How Missing Data and Imputation Strategies Affect the Performance of an AI-Based Early Warning Score

by Taeyong Sim, Sangchul Hahn, Kwang-Joon Kim, Eun-Young Cho, Yeeun Jeong, Ji-hyun Kim, Eun-Yeong Ha, In-Cheol Kim, Sun-Hyo Park, Chi-Heum Cho, Gyeong-Im Yu, Hochan Cho and Ki-Byung Lee

J. Clin. Med. 2025, 14(7), 2213; https://doi.org/10.3390/jcm14072213 - 24 Mar 2025

Cited by 1 | Viewed by 850

Abstract

Background/Objectives: Data availability can affect the performance of AI-based early warning scores (EWSs). This study evaluated how the extent of missing data and imputation strategies influence the predictive performance of the VitalCare–Major Adverse Event Score (VC-MAES), an AI-based EWS that uses last observation [...] Read more.

Background/Objectives: Data availability can affect the performance of AI-based early warning scores (EWSs). This study evaluated how the extent of missing data and imputation strategies influence the predictive performance of the VitalCare–Major Adverse Event Score (VC-MAES), an AI-based EWS that uses last observation carried forward and normal-value imputation for missing values, to forecast clinical deterioration events, including unplanned ICU transfers, cardiac arrests, or death, up to 6 h in advance. Methods: We analyzed real-world data from 6039 patient encounters at Keimyung University Dongsan Hospital, Republic of Korea. Performance was evaluated under three scenarios: (1) using only vital signs and age, treating all other variables as missing; (2) reintroducing a full set of real-world clinical variables; and (3) imputing missing values drawn from a distribution within one standard deviation of the observed mean or using Multiple Imputation by Chained Equations (MICE). Results: VC-MAES achieved the area under the receiver operating characteristic curve (AUROC) of 0.896 using only vital signs and age, outperforming traditional EWSs, including the National Early Warning Score (0.797) and the Modified Early Warning Score (0.722). Reintroducing full clinical variables improved the AUROC to 0.918, whereas mean-based imputation or MICE decreased the performance to 0.885 and 0.827, respectively. Conclusions: VC-MAES demonstrates robust predictive performance with limited inputs, outperforming traditional EWSs. Incorporating actual clinical data significantly improved accuracy. In contrast, mean-based or MICE imputation yielded poorer results than the default normal-value imputation, potentially due to disregarding the “informative presence” embedded in missing data patterns. These findings underscore the importance of understanding missingness patterns and employing imputation strategies that consider the decision-making context behind data availability to enhance model reliability. Full article

(This article belongs to the Section Intensive Care)

► Show Figures

Figure 1

17 pages, 2116 KiB

Open AccessArticle

A Comparative Analysis of Hyper-Parameter Optimization Methods for Predicting Heart Failure Outcomes

by Qisthi Alhazmi Hidayaturrohman and Eisuke Hanada

Appl. Sci. 2025, 15(6), 3393; https://doi.org/10.3390/app15063393 - 20 Mar 2025

Viewed by 1197

Abstract

This study presents a comparative analysis of hyper-parameter optimization methods used in developing predictive models for patients at risk of heart failure readmission and mortality. We evaluated three optimization approaches—Grid Search (GS), Random Search (RS), and Bayesian Search (BS)—across three machine learning algorithms—Support [...] Read more.

This study presents a comparative analysis of hyper-parameter optimization methods used in developing predictive models for patients at risk of heart failure readmission and mortality. We evaluated three optimization approaches—Grid Search (GS), Random Search (RS), and Bayesian Search (BS)—across three machine learning algorithms—Support Vector Machine (SVM), Random Forest (RF), and eXtreme Gradient Boosting (XGBoost). The models were built using real patient data from the Zigong Fourth People’s Hospital, which included 167 features from 2008 patients. The mean, MICE, kNN, and RF imputation techniques were implemented to handle missing values. Our initial results showed that SVM models outperformed the others, achieving an accuracy of up to 0.6294, sensitivity above 0.61, and an AUC score exceeding 0.66. However, after 10-fold cross-validation, the RF models demonstrated superior robustness, with an average AUC improvement of 0.03815, whereas the SVM models showed potential for overfitting, with a slight decline (−0.0074). The XGBoost models exhibited moderate improvement (+0.01683) post-validation. Bayesian Search had the best computational efficiency, consistently requiring less processing time than the Grid and Random Search methods. This study reveals that while model selection is crucial, an appropriate optimization method and imputation technique significantly impact model performance. These findings provide valuable insights for developing robust predictive models for healthcare applications, particularly for heart failure risk assessment. Full article

(This article belongs to the Special Issue Artificial Intelligence Applications in Medical Data Analysis and Healthcare Virtual Assistants)

► Show Figures

Figure 1

32 pages, 502 KiB

Open AccessArticle

Bayesian Random Forest with Multiple Imputation by Chain Equations for High-Dimensional Missing Data: A Simulation Study

by Oyebayo Ridwan Olaniran and Ali Rashash R. Alzahrani

Mathematics 2025, 13(6), 956; https://doi.org/10.3390/math13060956 - 13 Mar 2025

Cited by 1 | Viewed by 967

Abstract

The pervasive challenge of missing data in scientific research forces a critical trade-off: discarding incomplete observations, which risks significant information loss, while conventional imputation methods struggle to maintain accuracy in high-dimensional settings. Although approaches like multiple imputation (MI) and random forest (RF) proximity-based [...] Read more.

The pervasive challenge of missing data in scientific research forces a critical trade-off: discarding incomplete observations, which risks significant information loss, while conventional imputation methods struggle to maintain accuracy in high-dimensional settings. Although approaches like multiple imputation (MI) and random forest (RF) proximity-based imputation offer improvements over naive deletion, they exhibit limitations in complex missing data scenarios or sparse high-dimensional settings. To address these gaps, we propose a novel integration of Multiple Imputation by Chained Equations (MICE) with Bayesian Random Forest (BRF), leveraging MICE’s iterative flexibility and BRF’s probabilistic robustness to enhance the imputation accuracy and downstream predictive performance. Our hybrid framework, BRF-MICE, uniquely combines the efficiency of MICE’s chained equations with BRF’s ability to quantify uncertainty through Bayesian tree ensembles, providing stable parameter estimates even under extreme missingness. We empirically validate this approach using synthetic datasets with controlled missingness mechanisms (MCAR, MAR, MNAR) and dimensionality, contrasting it against established methods, including RF and Bayesian Additive Regression Trees (BART). The results demonstrate that BRF-MICE achieves a superior performance in classification and regression tasks, with a 15–20% lower error under varying missingness conditions compared to RF and BART while maintaining computational scalability. The method’s iterative Bayesian updates effectively propagate imputation uncertainty, reducing overconfidence in high-dimensional predictions, a key weakness of frequentist alternatives. Full article

(This article belongs to the Section D1: Probability and Statistics)

► Show Figures

Figure 1

23 pages, 466 KiB

Open AccessArticle

COVID-19 Data Analysis: The Impact of Missing Data Imputation on Supervised Learning Model Performance

by Jorge Daniel Mello-Román and Adrián Martínez-Amarilla

Computation 2025, 13(3), 70; https://doi.org/10.3390/computation13030070 - 8 Mar 2025

Viewed by 2709

Abstract

The global COVID-19 pandemic has generated extensive datasets, providing opportunities to apply machine learning for diagnostic purposes. This study evaluates the performance of five supervised learning models—Random Forests (RFs), Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Logistic Regression (LR), and Decision Trees [...] Read more.

The global COVID-19 pandemic has generated extensive datasets, providing opportunities to apply machine learning for diagnostic purposes. This study evaluates the performance of five supervised learning models—Random Forests (RFs), Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Logistic Regression (LR), and Decision Trees (DTs)—on a hospital-based dataset from the Concepción Department in Paraguay. To address missing data, four imputation methods (Predictive Mean Matching via MICE, RF-based imputation, K-Nearest Neighbor, and XGBoost-based imputation) were tested. Model performance was compared using metrics such as accuracy, AUC, F1-score, and MCC across five levels of missingness. Overall, RF consistently achieved high accuracy and AUC at the highest missingness level, underscoring its robustness. In contrast, SVM often exhibited a trade-off between specificity and sensitivity. ANN and DT showed moderate resilience, yet were more prone to performance shifts under certain imputation approaches. These findings highlight RF’s adaptability to different imputation strategies, as well as the importance of selecting methods that minimize sensitivity–specificity trade-offs. By comparing multiple imputation techniques and supervised models, this study provides practical insights for handling missing medical data in resource-constrained settings and underscores the value of robust ensemble methods for reliable COVID-19 diagnostics. Full article

(This article belongs to the Special Issue Artificial Intelligence Applications in Public Health: 2nd Edition)

► Show Figures

Figure 1

16 pages, 1423 KiB

Open AccessCommunication

Evaluating Imputation Methods to Improve Prediction Accuracy for an HIV Study in Uganda

by Nadia B. Mendoza, Chii-Dean Lin, Susan M. Kiene, Nicolas A. Menzies, Rhoda K. Wanyenze, Katherine A. Schmarje, Rose Naigino, Michael Ediau, Seth C. Kalichman and Barbara A. Bailey

Stats 2024, 7(4), 1405-1420; https://doi.org/10.3390/stats7040082 - 24 Nov 2024

Viewed by 953

Abstract

Standard statistical analyses often exclude incomplete observations, which can be particularly problematic when predicting rare outcomes, such as HIV positivity. In the linkage to the HIV care dataset, there were initially 553 complete HIV positive cases, with an additional 554 cases added through [...] Read more.

Standard statistical analyses often exclude incomplete observations, which can be particularly problematic when predicting rare outcomes, such as HIV positivity. In the linkage to the HIV care dataset, there were initially 553 complete HIV positive cases, with an additional 554 cases added through imputation. Imputation methods amelia, hmisc, mice and missForest were evaluated. Simulations were conducted across various scenarios using the complete data to guide imputation for the full dataset. A random forest model was used to predict HIV status, assessing imputation precision, overall prediction accuracy, and sensitivity. While missForest produced imputed values closer to the observed ones, this did not translate into better predictive models. Hmisc and mice imputations led to higher prediction accuracy and sensitivity, with median accuracy increasing from 64% to 76% and median sensitivity rising from 0.4 to 0.75. Hmisc and amelia were the fastest imputation methods. Additionally, oversampling the minority class combined with undersampling the majority class did not improve predictions of new HIV positive cases using only the complete observations. However, increasing the minority class information through imputation enhanced sensitivity for predicting cases in this class. Full article

(This article belongs to the Section Computational Statistics)

► Show Figures

Figure 1

12 pages, 2751 KiB

Open AccessArticle

Impact of Data Pre-Processing Techniques on XGBoost Model Performance for Predicting All-Cause Readmission and Mortality Among Patients with Heart Failure

by Qisthi Alhazmi Hidayaturrohman and Eisuke Hanada

BioMedInformatics 2024, 4(4), 2201-2212; https://doi.org/10.3390/biomedinformatics4040118 - 1 Nov 2024

Cited by 3 | Viewed by 3809

Abstract

Background: Heart failure poses a significant global health challenge, with high rates of readmission and mortality. Accurate models to predict these outcomes are essential for effective patient management. This study investigates the impact of data pre-processing techniques on XGBoost model performance in predicting [...] Read more.

Background: Heart failure poses a significant global health challenge, with high rates of readmission and mortality. Accurate models to predict these outcomes are essential for effective patient management. This study investigates the impact of data pre-processing techniques on XGBoost model performance in predicting all-cause readmission and mortality among heart failure patients. Methods: A dataset of 168 features from 2008 heart failure patients was used. Pre-processing included handling missing values, categorical encoding, and standardization. Four imputation techniques were compared: Mean, Multivariate Imputation by Chained Equations (MICEs), k-nearest Neighbors (kNNs), and Random Forest (RF). XGBoost models were evaluated using accuracy, recall, F1-score, and Area Under the Curve (AUC). Robustness was assessed through 10-fold cross-validation. Results: The XGBoost model with kNN imputation, one-hot encoding, and standardization outperformed others, with an accuracy of 0.614, recall of 0.551, and F1-score of 0.476. The MICE-based model achieved the highest AUC (0.647) and mean AUC (0.65 ± 0.04) in cross-validation. All pre-processed models outperformed the default XGBoost model (AUC: 0.60). Conclusions: Data pre-processing, especially MICE with one-hot encoding and standardization, improves XGBoost performance in heart failure prediction. However, moderate AUC scores suggest further steps are needed to enhance predictive accuracy. Full article

► Show Figures

Graphical abstract

19 pages, 576 KiB

Open AccessArticle

Predicting Missing Values in Survey Data Using Prompt Engineering for Addressing Item Non-Response

by Junyung Ji, Jiwoo Kim and Younghoon Kim

Future Internet 2024, 16(10), 351; https://doi.org/10.3390/fi16100351 - 27 Sep 2024

Viewed by 2334

Abstract

Survey data play a crucial role in various research fields, including economics, education, and healthcare, by providing insights into human behavior and opinions. However, item non-response, where respondents fail to answer specific questions, presents a significant challenge by creating incomplete datasets that undermine [...] Read more.

Survey data play a crucial role in various research fields, including economics, education, and healthcare, by providing insights into human behavior and opinions. However, item non-response, where respondents fail to answer specific questions, presents a significant challenge by creating incomplete datasets that undermine data integrity and can hinder or even prevent accurate analysis. Traditional methods for addressing missing data, such as statistical imputation techniques and deep learning models, often fall short when dealing with the rich linguistic content of survey data. These approaches are also hampered by high time complexity for training and the need for extensive preprocessing or feature selection. In this paper, we introduce an approach that leverages Large Language Models (LLMs) through prompt engineering for predicting item non-responses in survey data. Our method combines the strengths of both traditional imputation techniques and deep learning methods with the advanced linguistic understanding of LLMs. By integrating respondent similarities, question relevance, and linguistic semantics, our approach enhances the accuracy and comprehensiveness of survey data analysis. The proposed method bypasses the need for complex preprocessing and additional training, making it adaptable, scalable, and capable of generating explainable predictions in natural language. We evaluated the effectiveness of our LLM-based approach through a series of experiments, demonstrating its competitive performance against established methods such as Multivariate Imputation by Chained Equations (MICE), MissForest, and deep learning models like TabTransformer. The results show that our approach not only matches but, in some cases, exceeds the performance of these methods while significantly reducing the time required for data processing. Full article

► Show Figures

Graphical abstract

16 pages, 977 KiB

Open AccessArticle

Influence of Preprocessing Methods of Automated Milking Systems Data on Prediction of Mastitis with Machine Learning Models

by Olivier Kashongwe, Tina Kabelitz, Christian Ammon, Lukas Minogue, Markus Doherr, Pablo Silva Boloña, Thomas Amon and Barbara Amon

AgriEngineering 2024, 6(3), 3427-3442; https://doi.org/10.3390/agriengineering6030195 - 18 Sep 2024

Viewed by 1463

Abstract

Missing data and class imbalance hinder the accurate prediction of rare events such as dairy mastitis. Resampling and imputation are employed to handle these problems. These methods are often used arbitrarily, despite their profound impact on prediction due to changes caused to the [...] Read more.

Missing data and class imbalance hinder the accurate prediction of rare events such as dairy mastitis. Resampling and imputation are employed to handle these problems. These methods are often used arbitrarily, despite their profound impact on prediction due to changes caused to the data structure. We hypothesize that their use affects the performance of ML models fitted to automated milking systems (AMSs) data for mastitis prediction. We compare three imputations—simple imputer (SI), multiple imputer (MICE) and linear interpolation (LI)—and three resampling techniques: Synthetic Minority Oversampling Technique (SMOTE), Support Vector Machine SMOTE (SVMSMOTE) and SMOTE with Edited Nearest Neighbors (SMOTEEN). The classifiers were logistic regression (LR), multilayer perceptron (MLP), decision tree (DT) and random forest (RF). We evaluated them with various metrics and compared models with the kappa score. A complete case analysis fitted the RF (0.78) better than other models, for which SI performed best. The DT, RF, and MLP performed better with SVMSMOTE. The RF, DT and MLP had the overall best performance, contributed by imputation or resampling (SMOTE and SVMSMOTE). We recommend carefully selecting resampling and imputation techniques and comparing them with complete cases before deciding on the preprocessing approach used to test AMS data with ML models. Full article

(This article belongs to the Special Issue Implementation of Artificial Intelligence in Agriculture)

► Show Figures

Figure 1

15 pages, 328 KiB

Open AccessArticle

An Inductive Approach to Quantitative Methodology—Application of Novel Penalising Models in a Case Study of Target Debt Level in Swedish Listed Companies

by Åsa Grek, Fredrik Hartwig and Mark Dougherty

J. Risk Financial Manag. 2024, 17(5), 207; https://doi.org/10.3390/jrfm17050207 - 15 May 2024

Viewed by 2380

Abstract

This paper proposes a method for conducting quantitative inductive research on survey data when the variable of interest follows an ordinal distribution. A methodology based on novel and traditional penalising models is described. The main aim of this study is to pedagogically present [...] Read more.

This paper proposes a method for conducting quantitative inductive research on survey data when the variable of interest follows an ordinal distribution. A methodology based on novel and traditional penalising models is described. The main aim of this study is to pedagogically present the method utilising the new penalising methods in a new application. A case was employed to outline the methodology. The case aims to select explanatory variables correlated with the target debt level in Swedish listed companies. The survey respondents were matched with accounting information from the companies’ annual reports. However, missing data were present: to fully utilise penalising models, we employed classification and regression tree (CART)-based imputations by multiple imputations chained equations (MICEs) to address this problem. The imputed data were subjected to six penalising models: grouped multinomial lasso, ungrouped multinomial lasso, parallel element linked multinomial-ordinal (ELMO), semi-parallel ELMO, nonparallel ELMO, and cumulative generalised monotone incremental forward stagewise (GMIFS). While the older models yielded several explanatory variables for the hypothesis formation process, the new models (ELMO and GMIFS) identified only one quick asset ratio. Subsequent testing revealed that this variable was the only statistically significant variable that affected the target debt level. Full article

(This article belongs to the Section Mathematics and Finance)

17 pages, 4522 KiB

Open AccessArticle

Weighted Average Ensemble-Based PV Forecasting in a Limited Environment with Missing Data of PV Power

by Dae-Sung Lee and Sung-Yong Son

Sustainability 2024, 16(10), 4069; https://doi.org/10.3390/su16104069 - 13 May 2024

Cited by 2 | Viewed by 1843

Abstract

Photovoltaic (PV) power is subject to variability, influenced by factors such as meteorological conditions. This variability introduces uncertainties in forecasting, underscoring the necessity for enhanced forecasting models to support the large-scale integration of PV systems. Moreover, the presence of missing data during the [...] Read more.

Photovoltaic (PV) power is subject to variability, influenced by factors such as meteorological conditions. This variability introduces uncertainties in forecasting, underscoring the necessity for enhanced forecasting models to support the large-scale integration of PV systems. Moreover, the presence of missing data during the model development process significantly impairs model performance. To address this, it is essential to impute missing data from the collected datasets before advancing with model development. Recent advances in imputation methods, including Multivariate Imputation by Chained Equations (MICEs), K-Nearest Neighbors (KNNs), and Generative Adversarial Imputation Networks (GAINs), have exhibited commendable efficacy. Nonetheless, models derived solely from a single imputation method often exhibit diminished performance under varying weather conditions. Consequently, this study introduces a weighted average ensemble model that combines multiple imputation-based models. This innovative approach adjusts the weights according to “sky status” and evaluates the performance of single-imputation models using criteria such as sky status, root mean square error (RMSE), and mean absolute error (MAE), integrating them into a comprehensive weighted ensemble model. This model demonstrates improved RMSE values, ranging from 74.805 to 74.973, which corresponds to performance enhancements of 3.293–3.799% for KNN and 3.190–4.782% for MICE, thereby affirming its effectiveness in scenarios characterized by missing data. Full article

► Show Figures

Figure 1

Search Results (41)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (41)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI