MDPI - Publisher of Open Access Journals

49 pages, 4062 KB

Open AccessArticle

Evaluation of a Non-Parametric Penalized Kaplan–Meier Estimator Under Interval-Censored Survival Data

by Kayakazi Chophela, Chioneso Show Marange and Akinwumi Sunday Odeyemi

Symmetry 2026, 18(3), 519; https://doi.org/10.3390/sym18030519 - 18 Mar 2026

Viewed by 392

Interval-censored survival data arise frequently in biomedical and epidemiological studies where event times are observed only within observation intervals. Classical non-parametric estimators, such as the Kaplan–Meier (KM) estimator under imputation and the Turnbull estimator, often suffer from instability, irregular fluctuations, and overfitting when [...] Read more.

Interval-censored survival data arise frequently in biomedical and epidemiological studies where event times are observed only within observation intervals. Classical non-parametric estimators, such as the Kaplan–Meier (KM) estimator under imputation and the Turnbull estimator, often suffer from instability, irregular fluctuations, and overfitting when sample sizes are small or when the prevalence rate is low. Recent methodological developments, which include smoothed and penalized approaches, have been proposed to improve stability and reduce estimation error in such settings. This study evaluates and benchmarks the finite-sample performance of a nonparametric penalized likelihood KM estimator under interval-censored data. The method is compared with the classical KM estimator using four imputation strategies, that is, midpoint, regression, uniform, and multiple imputation. From a symmetry perspective, midpoint and uniform imputation preserve interval symmetry through deterministic and probabilistic mechanisms, respectively, whereas regression and multiple imputation intentionally introduce structural asymmetry to reflect data-driven risk heterogeneity and distributional uncertainty. To assess and benchmark the performance of the penalized KM estimator, an extensive Monte Carlo (MC) simulation study was conducted across varying sample sizes and prevalence rates using error-based metrics. The MC simulation results revealed that the nonparametric penalized KM estimator consistently outperforms the classical KM estimator in small samples across all prevalence rates. The gains are more pronounced under low prevalence rates where the penalized KM estimator is superior for small to relatively moderate samples of

n \approx

40–100. Among the imputation techniques, regression and multiple imputation generally exhibited superior performance. Real data application further confirms these findings, demonstrating that the nonparametric penalized KM estimator yields more stable and accurate survival curves than the classical KM estimator in small samples. Full article

(This article belongs to the Section Mathematics)

► Show Figures

Figure 1

31 pages, 841 KB

Open AccessArticle

Penalized Spline Estimator for Semiparametric Binary Logistic Regression Model with Application to Coronary Heart Disease Risk Factors

by Nur Chamidah, Marisa Rifada, Budi Lestari, Dursun Aydin and Naufal Ramadhan Al Akhwal Siregar

Symmetry 2026, 18(3), 432; https://doi.org/10.3390/sym18030432 - 28 Feb 2026

Viewed by 419

Abstract

In this study, we develop a regression analysis method, namely, the Semiparametric Binary Logistic Regression (SBLR), by extending the classical logistic regression that integrates both parametric and nonparametric components, which allows it to simultaneously model linear and non-linear relationships. Here, to obtain the [...] Read more.

In this study, we develop a regression analysis method, namely, the Semiparametric Binary Logistic Regression (SBLR), by extending the classical logistic regression that integrates both parametric and nonparametric components, which allows it to simultaneously model linear and non-linear relationships. Here, to obtain the estimation of a nonparametric component in the form of a non-linear curve (sigmoid curve), we use the penalized spline, which is a smoothing technique used in the nonparametric approach due to its ability to produce smooth and adaptive curves for fluctuating data. In this smoothing technique, selecting the optimal smoothing parameters plays an important role in fitting the model. Commonly, this selection is based on the minimum value of ordinary Cross-Validation (CV) or Generalized Cross-Validation (GCV). However, these CV and GCV criteria cannot be used when the CV and GCV curves continuously decline and never rise; the minimum CV and GCV values would not be achieved because they are not directly applicable due to the non-quadratic nature of the log-likelihood function. Therefore, a Generalized Approximate Cross-Validation (GACV) criterion is used to address such cases. This distinguishes it from previous studies that used the CV or GCV criterion. In the application to real data, we define an SBLR model of Coronary Heart Disease (CHD) risk factors that can be used for prediction and interpretation purposes. The results of the study successfully demonstrate the efficacy of the proposed method in identifying critical non-linear thresholds for CHD risk factors, and it is statistically valid and highly effective for CHD risk prediction. In the future, we can use the results of this research as a basis of an early warning system, specifically alerting individuals with moderate stress levels and dietary habits exceeding the identified thresholds to be aware of the heightened probability of developing CHD. In addition, this research aligns with point three of the Sustainable Development Goals (SDGs), namely, premature mortality reduction from non-communicable diseases by 2030. Full article

(This article belongs to the Section Mathematics)

► Show Figures

Figure 1

18 pages, 840 KB

Open AccessArticle

Utilizing Machine Learning Techniques for Computer-Aided COVID-19 Screening Based on Clinical Data

by Honglun Xu, Andrews T. Anum, Michael Pokojovy, Sreenath Chalil Madathil, Yuxin Wen, Md Fashiar Rahman, Tzu-Liang (Bill) Tseng, Scott Moen and Eric Walser

COVID 2026, 6(1), 17; https://doi.org/10.3390/covid6010017 - 9 Jan 2026

Viewed by 546

Abstract

The COVID-19 pandemic has highlighted the importance of rapid clinical decision-making to facilitate the efficient usage of healthcare resources. Over the past decade, machine learning (ML) has caused a tectonic shift in healthcare, empowering data-driven prediction and decision-making. Recent research demonstrates how ML [...] Read more.

The COVID-19 pandemic has highlighted the importance of rapid clinical decision-making to facilitate the efficient usage of healthcare resources. Over the past decade, machine learning (ML) has caused a tectonic shift in healthcare, empowering data-driven prediction and decision-making. Recent research demonstrates how ML was used to respond to the COVID-19 pandemic. This paper puts forth new computer-aided COVID-19 disease screening techniques using six classes of ML algorithms (including penalized logistic regression, random forest, artificial neural networks, and support vector machines) and evaluates their performance when applied to a real-world clinical dataset containing patients’ demographic information and vital indices (such as sex, ethnicity, age, pulse, pulse oximetry, respirations, temperature, BP systolic, BP diastolic, and BMI), as well as ICD-10 codes of existing comorbidities, as attributes to predict the risk of having COVID-19 for given patient(s). Variable importance metrics computed using a random forest model were used to reduce the number of important predictors to thirteen. Using prediction accuracy, sensitivity, specificity, and AUC as performance metrics, the performance of various ML methods was assessed, and the best model was selected. Our proposed model can be used in clinical settings as a rapid and accessible COVID-19 screening technique. Full article

(This article belongs to the Special Issue Artificial Intelligence and Machine Learning Applications for Developing the Diagnosis of COVID-19, 3rd Edition)

► Show Figures

Figure 1

31 pages, 511 KB

Open AccessArticle

Shrinkage Approaches for Ridge-Type Estimators Under Multicollinearity

by Marwan Al-Momani, Bahadır Yüzbaşı, Mohammad Saleh Bataineh, Rihab Abdallah and Athifa Moideenkutty

Mathematics 2025, 13(22), 3733; https://doi.org/10.3390/math13223733 - 20 Nov 2025

Cited by 1 | Viewed by 657

Abstract

Multicollinearity is a common issue in regression analyses that occurs when some predictor variables are highly correlated, leading to unstable least squares estimates of model parameters. Various estimation strategies have been proposed to address this problem. In this study, we enhanced a ridge-type [...] Read more.

Multicollinearity is a common issue in regression analyses that occurs when some predictor variables are highly correlated, leading to unstable least squares estimates of model parameters. Various estimation strategies have been proposed to address this problem. In this study, we enhanced a ridge-type estimator by incorporating pretest and shrinkage techniques. We conducted an analytical comparison to evaluate the performance of the proposed estimators in terms of their bias, quadratic risk, and numerical performance using both simulated and real data. Additionally, we assessed several penalization methods and three machine learning algorithms to facilitate a comprehensive comparison. Our results demonstrate that the proposed estimators outperformed the standard ridge-type estimator with respect to the mean squared error of the simulated data and the mean squared prediction error of two real data applications. Full article

(This article belongs to the Special Issue Advances in Statistical Methods with Applications)

► Show Figures

Figure 1

32 pages, 1288 KB

Open AccessArticle

Random Forest Adaptation for High-Dimensional Count Regression

by Oyebayo Ridwan Olaniran, Saidat Fehintola Olaniran, Ali Rashash R. Alzahrani, Nada MohammedSaeed Alharbi and Asma Ahmad Alzahrani

Mathematics 2025, 13(18), 3041; https://doi.org/10.3390/math13183041 - 21 Sep 2025

Cited by 2 | Viewed by 1760

Abstract

The analysis of high-dimensional count data presents a unique set of challenges, including overdispersion, zero-inflation, and complex nonlinear relationships that traditional generalized linear models and standard machine learning approaches often fail to adequately address. This study introduces and validates a novel Random Forest [...] Read more.

The analysis of high-dimensional count data presents a unique set of challenges, including overdispersion, zero-inflation, and complex nonlinear relationships that traditional generalized linear models and standard machine learning approaches often fail to adequately address. This study introduces and validates a novel Random Forest framework specifically developed for high-dimensional Poisson and Negative Binomial regression, designed to overcome the limitations of existing methods. Through comprehensive simulations and a real-world genomic application to the Norwegian Mother and Child Cohort Study, we demonstrate that the proposed methods achieve superior predictive accuracy, quantified by lower root mean squared error and deviance, and critically produced exceptionally stable and interpretable feature selections. Our theoretical and empirical results show that these distribution-optimized ensembles significantly outperform both penalized-likelihood techniques and naive-transformation-based ensembles in balancing statistical robustness with biological interpretability. The study concludes that the proposed frameworks provide a crucial methodological advancement, offering a powerful and reliable tool for extracting meaningful insights from complex count data in fields ranging from genomics to public health. Full article

(This article belongs to the Special Issue Statistics for High-Dimensional Data)

► Show Figures

Figure 1

14 pages, 845 KB

Open AccessArticle

Assessment of Ultrasound-Controlled Diagnostic Methods for Thyroid Lesions and Their Associated Costs in a Tertiary University Hospital in Spain

by Lelia Ruiz-Hernández, Carmen Rosa Hernández-Socorro, Pedro Saavedra, María de la Vega-Pérez and Sergio Ruiz-Santana

J. Clin. Med. 2025, 14(15), 5551; https://doi.org/10.3390/jcm14155551 - 6 Aug 2025

Cited by 1 | Viewed by 2186

Abstract

Background/Objectives: Accurate diagnosis of thyroid cancer is critical but challenging due to overlapping ultrasound (US) features of benign and malignant nodules. This study aimed to evaluate the diagnostic performance of non-invasive and minimally invasive US techniques, including B-mode US, shear wave elastography (SWE), [...] Read more.

Background/Objectives: Accurate diagnosis of thyroid cancer is critical but challenging due to overlapping ultrasound (US) features of benign and malignant nodules. This study aimed to evaluate the diagnostic performance of non-invasive and minimally invasive US techniques, including B-mode US, shear wave elastography (SWE), color Doppler, superb microvascular imaging (SMI), and TI-RADS, in patients with suspected thyroid lesions and to assess their reliability and cost effectiveness compared with fine needle aspiration (FNA) biopsy. Methods: A prospective, single-center study (October 2023–February 2025) enrolled 300 patients with suspected thyroid cancer at a Spanish tertiary hospital. Of these, 296 patients with confirmed diagnoses underwent B-mode US, SWE, Doppler, SMI, and TI-RADS scoring, followed by US-guided FNA and Bethesda System cytopathology. Lasso-penalized logistic regression and a bootstrap analysis (1000 replicates) were used to develop diagnostic models. A utility function was used to balance diagnostic reliability and cost. Results: Thyroid cancer was diagnosed in 25 patients (8.3%). Elastography combined with SMI achieved the highest diagnostic performance (Youden index: 0.69; NPV: 97.4%; PPV: 69.1%), outperforming Doppler-only models. Intranodular vascularization was a significant risk factor, while peripheral vascularization was protective. The utility function showed that, when prioritizing cost, elastography plus SMI was cost effective (α < 0.716) compared with FNA. Conclusions: Elastography plus SMI offers a reliable, cost-effective diagnostic rule for thyroid cancer. The utility function aids clinicians in balancing reliability and cost. SMI and generalizability need to be validated in higher prevalence settings. Full article

(This article belongs to the Section Endocrinology & Metabolism)

► Show Figures

Figure 1

20 pages, 774 KB

Open AccessArticle

Robust Variable Selection via Bayesian LASSO-Composite Quantile Regression with Empirical Likelihood: A Hybrid Sampling Approach

by Ruisi Nan, Jingwei Wang, Hanfang Li and Youxi Luo

Mathematics 2025, 13(14), 2287; https://doi.org/10.3390/math13142287 - 16 Jul 2025

Viewed by 1047

Abstract

Since the advent of composite quantile regression (CQR), its inherent robustness has established it as a pivotal methodology for high-dimensional data analysis. High-dimensional outlier contamination refers to data scenarios where the number of observed dimensions (p) is much greater than the [...] Read more.

Since the advent of composite quantile regression (CQR), its inherent robustness has established it as a pivotal methodology for high-dimensional data analysis. High-dimensional outlier contamination refers to data scenarios where the number of observed dimensions (p) is much greater than the sample size (n) and there are extreme outliers in the response variables or covariates (e.g., p/n > 0.1). Traditional penalized regression techniques, however, exhibit notable vulnerability to data outliers during high-dimensional variable selection, often leading to biased parameter estimates and compromised resilience. To address this critical limitation, we propose a novel empirical likelihood (EL)-based variable selection framework that integrates a Bayesian LASSO penalty within the composite quantile regression framework. By constructing a hybrid sampling mechanism that incorporates the Expectation–Maximization (EM) algorithm and Metropolis–Hastings (M-H) algorithm within the Gibbs sampling scheme, this approach effectively tackles variable selection in high-dimensional settings with outlier contamination. This innovative design enables simultaneous optimization of regression coefficients and penalty parameters, circumventing the need for ad hoc selection of optimal penalty parameters—a long-standing challenge in conventional LASSO estimation. Moreover, the proposed method imposes no restrictive assumptions on the distribution of random errors in the model. Through Monte Carlo simulations under outlier interference and empirical analysis of two U.S. house price datasets, we demonstrate that the new approach significantly enhances variable selection accuracy, reduces estimation bias for key regression coefficients, and exhibits robust resistance to data outlier contamination. Full article

► Show Figures

Figure 1

26 pages, 6617 KB

Open AccessArticle

Penalty Strategies in Semiparametric Regression Models

by Ayuba Jack Alhassan, S. Ejaz Ahmed, Dursun Aydin and Ersin Yilmaz

Math. Comput. Appl. 2025, 30(3), 54; https://doi.org/10.3390/mca30030054 - 12 May 2025

Viewed by 2572

Abstract

This study includes a comprehensive evaluation of six penalty estimation strategies for partially linear models (PLRMs), focusing on their performance in the presence of multicollinearity and their ability to handle both parametric and nonparametric components. The methods under consideration include Ridge regression, Lasso, [...] Read more.

This study includes a comprehensive evaluation of six penalty estimation strategies for partially linear models (PLRMs), focusing on their performance in the presence of multicollinearity and their ability to handle both parametric and nonparametric components. The methods under consideration include Ridge regression, Lasso, Adaptive Lasso (aLasso), smoothly clipped absolute deviation (SCAD), ElasticNet, and minimax concave penalty (MCP). In addition to these established methods, we also incorporate Stein-type shrinkage estimation techniques that are standard and positive shrinkage and assess their effectiveness in this context. To estimate the PLRMs, we consider a kernel smoothing technique grounded in penalized least squares. Our investigation involves a theoretical analysis of the estimators’ asymptotic properties and a detailed simulation study designed to compare their performance under a variety of conditions, including different sample sizes, numbers of predictors, and levels of multicollinearity. The simulation results reveal that aLasso and shrinkage estimators, particularly the positive shrinkage estimator, consistently outperform the other methods in terms of Mean Squared Error (MSE) relative efficiencies (RE), especially when the sample size is small and multicollinearity is high. Furthermore, we present a real data analysis using the Hitters dataset to demonstrate the applicability of these methods in a practical setting. The results of the real data analysis align with the simulation findings, highlighting the superior predictive accuracy of aLasso and the shrinkage estimators in the presence of multicollinearity. The findings of this study offer valuable insights into the strengths and limitations of these penalty and shrinkage strategies, guiding their application in future research and practice involving semiparametric regression. Full article

► Show Figures

Figure 1

33 pages, 7879 KB

Open AccessEditor’s ChoiceArticle

Performance Evaluation of Machine Learning Models for Predicting Energy Consumption and Occupant Dissatisfaction in Buildings

by Haidar Hosamo and Silvia Mazzetto

Buildings 2025, 15(1), 39; https://doi.org/10.3390/buildings15010039 - 26 Dec 2024

Cited by 30 | Viewed by 5448

Abstract

This study evaluates the performance of 15 machine learning models for predicting energy consumption (30–100 kWh/m²·year) and occupant dissatisfaction (Percentage of Dissatisfied, PPD: 6–90%), key metrics for optimizing building performance. Ten evaluation metrics, including Mean Absolute Error (MAE, average prediction error), [...] Read more.

This study evaluates the performance of 15 machine learning models for predicting energy consumption (30–100 kWh/m²·year) and occupant dissatisfaction (Percentage of Dissatisfied, PPD: 6–90%), key metrics for optimizing building performance. Ten evaluation metrics, including Mean Absolute Error (MAE, average prediction error), Root Mean Squared Error (RMSE, penalizing large errors), and the coefficient of determination (R², variance explained by the model), are used. XGBoost achieves the highest accuracy, with an energy MAE of 1.55 kWh/m²·year and a PPD MAE of 3.14%, alongside R² values of 0.99 and 0.97, respectively. While these metrics highlight XGBoost’s superiority, its margin of improvement over LightGBM (energy MAE: 2.35 kWh/m²·year, PPD MAE: 3.89%) is context-dependent, suggesting its application in high-precision scenarios. ANN excelled at PPD predictions, achieving the lowest MAE (1.55%) and Mean Absolute Percentage Error (MAPE: 4.97%), demonstrating its ability to model complex nonlinear relationships. This nonlinear modeling advantage contrasts with LightGBM’s balance of speed and accuracy, making it suitable for computationally constrained tasks. In contrast, traditional models like linear regression and KNN exhibit high errors (e.g., energy MAE: 17.56 kWh/m²·year, PPD MAE: 17.89%), underscoring their limitations with respect to capturing the complexities of building performance datasets. The results indicate that advanced methods like XGBoost and ANN are particularly effective owing to their ability to model intricate relationships and manage high-dimensional data. Future research should validate these findings with diverse real-world datasets, including those representing varying building types and climates. Hybrid models combining the interpretability of linear methods with the precision of ensemble or neural models should be explored. Additionally, integrating these machine learning techniques with digital twin platforms could address real-time optimization challenges, including dynamic occupant behavior and time-dependent energy consumption. Full article

(This article belongs to the Section Building Energy, Physics, Environment, and Systems)

► Show Figures

Figure 1

16 pages, 685 KB

Open AccessArticle

Predicting Clinical Outcomes in COVID-19 and Pneumonia Patients: A Machine Learning Approach

by Kaida Cai, Zhengyan Wang, Xiaofang Yang, Wenzhi Fu and Xin Zhao

Viruses 2024, 16(10), 1624; https://doi.org/10.3390/v16101624 - 17 Oct 2024

Cited by 1 | Viewed by 2088

Abstract

In the clinical diagnosis of pneumonia, particularly during the COVID-19 pandemic, individuals who progress to a critical stage requiring mechanical ventilation are classified as mechanically ventilated critically ill patients. Accurately predicting the discharge outcomes for this specific cohort, especially those with COVID-19, is [...] Read more.

In the clinical diagnosis of pneumonia, particularly during the COVID-19 pandemic, individuals who progress to a critical stage requiring mechanical ventilation are classified as mechanically ventilated critically ill patients. Accurately predicting the discharge outcomes for this specific cohort, especially those with COVID-19, is of paramount clinical importance. Missing data, a common issue in medical research, can significantly impact the validity of analyses. In this work, we address this challenge by employing two missing data imputation techniques: multiple imputation and missForest, to enhance data completeness. Additionally, we utilize the smoothly clipped absolute deviation (SCAD) penalized logistic regression method to select significant features. Our real data analysis compares the predictive performances of extreme learning machines, random forests, support vector machines, and XGBoost using 10-fold cross-validation. The results consistently show that XGBoost outperforms the other methods in predicting discharge outcomes, making it a reliable tool for clinical decision-making in the treatment of severe pneumonia, including COVID-19 cases. Within this context, the random forest imputation method generally enhances performance, underscoring its effectiveness in managing missing data compared to multiple imputation. Full article

(This article belongs to the Section Coronaviruses)

► Show Figures

Figure 1

22 pages, 3573 KB

Open AccessArticle

The Estimating Parameter and Number of Knots for Nonparametric Regression Methods in Modelling Time Series Data

by Autcha Araveeporn

Modelling 2024, 5(4), 1413-1434; https://doi.org/10.3390/modelling5040073 - 5 Oct 2024

Cited by 6 | Viewed by 2603

Abstract

This research aims to explore and compare several nonparametric regression techniques, including smoothing splines, natural cubic splines, B-splines, and penalized spline methods. The focus is on estimating parameters and determining the optimal number of knots to forecast cyclic and nonlinear patterns, applying these [...] Read more.

This research aims to explore and compare several nonparametric regression techniques, including smoothing splines, natural cubic splines, B-splines, and penalized spline methods. The focus is on estimating parameters and determining the optimal number of knots to forecast cyclic and nonlinear patterns, applying these methods to simulated and real-world datasets, such as Thailand’s coal import data. Cross-validation techniques are used to control and specify the number of knots, ensuring the curve fits the data points accurately. The study applies nonparametric regression to forecast time series data with cyclic patterns and nonlinear forms in the dependent variable, treating the independent variable as sequential data. Simulated data featuring cyclical patterns resembling economic cycles and nonlinear data with complex equations to capture variable interactions are used for experimentation. These simulations include variations in standard deviations and sample sizes. The evaluation criterion for the simulated data is the minimum average mean square error (MSE), which indicates the most efficient parameter estimation. For the real data, monthly coal import data from Thailand is used to estimate the parameters of the nonparametric regression model, with the MSE as the evaluation metric. The performance of these techniques is also assessed in forecasting future values, where the mean absolute percentage error (MAPE) is calculated. Among the methods, the natural cubic spline consistently yields the lowest average mean square error across all standard deviations and sample sizes in the simulated data. While the natural cubic spline excels in parameter estimation, B-splines show strong performance in forecasting future values. Full article

► Show Figures

Figure 1

16 pages, 3921 KB

Open AccessArticle

Predicting Antidiabetic Peptide Activity: A Machine Learning Perspective on Type 1 and Type 2 Diabetes

by Kaida Cai, Zhe Zhang, Wenzhou Zhu, Xiangwei Liu, Tingqing Yu and Wang Liao

Int. J. Mol. Sci. 2024, 25(18), 10020; https://doi.org/10.3390/ijms251810020 - 18 Sep 2024

Cited by 3 | Viewed by 2750

Abstract

Diabetes mellitus (DM) presents a critical global health challenge, characterized by persistent hyperglycemia and associated with substantial economic and health-related burdens. This study employs advanced machine-learning techniques to improve the prediction and classification of antidiabetic peptides, with a particular focus on differentiating those [...] Read more.

Diabetes mellitus (DM) presents a critical global health challenge, characterized by persistent hyperglycemia and associated with substantial economic and health-related burdens. This study employs advanced machine-learning techniques to improve the prediction and classification of antidiabetic peptides, with a particular focus on differentiating those effective against T1DM from those targeting T2DM. We integrate feature selection with analysis methods, including logistic regression, support vector machines (SVM), and adaptive boosting (AdaBoost), to classify antidiabetic peptides based on key features. Feature selection through the Lasso-penalized method identifies critical peptide characteristics that significantly influence antidiabetic activity, thereby establishing a robust foundation for future peptide design. A comprehensive evaluation of logistic regression, SVM, and AdaBoost shows that AdaBoost consistently outperforms the other methods, making it the most effective approach for classifying antidiabetic peptides. This research underscores the potential of machine learning in the systematic evaluation of bioactive peptides, contributing to the advancement of peptide-based therapies for diabetes management. Full article

(This article belongs to the Special Issue Machine Learning in Disease Diagnosis and Treatment)

► Show Figures

Figure 1

23 pages, 426 KB

Open AccessFeature PaperArticle

A Penalized Empirical Likelihood Approach for Estimating Population Sizes under the Negative Binomial Regression Model

by Yulu Ji and Yang Liu

Mathematics 2024, 12(17), 2674; https://doi.org/10.3390/math12172674 - 28 Aug 2024

Viewed by 1979

Abstract

In capture–recapture experiments, the presence of overdispersion and heterogeneity necessitates the use of the negative binomial regression model for inferring population sizes. However, within this model, existing methods based on likelihood and ratio regression for estimating the dispersion parameter often face boundary and [...] Read more.

In capture–recapture experiments, the presence of overdispersion and heterogeneity necessitates the use of the negative binomial regression model for inferring population sizes. However, within this model, existing methods based on likelihood and ratio regression for estimating the dispersion parameter often face boundary and nonidentifiability issues. These problems can result in nonsensically large point estimates and unbounded upper limits of confidence intervals for the population size. We present a penalized empirical likelihood technique for solving these two problems by imposing a half-normal prior on the population size. Based on the proposed approach, a maximum penalized empirical likelihood estimator with asymptotic normality and a penalized empirical likelihood ratio statistic with asymptotic chi-square distribution are derived. To improve numerical performance, we present an effective expectation-maximization (EM) algorithm. In the M-step, optimization for the model parameters could be achieved by fitting a standard negative binomial regression model via the R basic function glm.nb(). This approach ensures the convergence and reliability of the numerical algorithm. Using simulations, we analyze several synthetic datasets to illustrate three advantages of our methods in finite-sample cases: complete mitigation of the boundary problem, more efficient maximum penalized empirical likelihood estimates, and more precise penalized empirical likelihood ratio interval estimates compared to the estimates obtained without penalty. These advantages are further demonstrated in a case study estimating the abundance of black bears (Ursus americanus) at the U.S. Army’s Fort Drum Military Installation in northern New York. Full article

► Show Figures

Figure 1

12 pages, 415 KB

Open AccessArticle

Machine Learning-Based Risk Prediction of Discharge Status for Sepsis

by Kaida Cai, Yuqing Lou, Zhengyan Wang, Xiaofang Yang and Xin Zhao

Entropy 2024, 26(8), 625; https://doi.org/10.3390/e26080625 - 25 Jul 2024

Cited by 1 | Viewed by 2118

Abstract

As a severe inflammatory response syndrome, sepsis presents complex challenges in predicting patient outcomes due to its unclear pathogenesis and the unstable discharge status of affected individuals. In this study, we develop a machine learning-based method for predicting the discharge status of sepsis [...] Read more.

As a severe inflammatory response syndrome, sepsis presents complex challenges in predicting patient outcomes due to its unclear pathogenesis and the unstable discharge status of affected individuals. In this study, we develop a machine learning-based method for predicting the discharge status of sepsis patients, aiming to improve treatment decisions. To enhance the robustness of our analysis against outliers, we incorporate robust statistical methods, specifically the minimum covariance determinant technique. We utilize the random forest imputation method to effectively manage and impute missing data. For feature selection, we employ Lasso penalized logistic regression, which efficiently identifies significant predictors and reduces model complexity, setting the stage for the application of more complex predictive methods. Our predictive analysis incorporates multiple machine learning methods, including random forest, support vector machine, and XGBoost. We compare the prediction performance of these methods with Lasso penalized logistic regression to identify the most effective approach. Each method’s performance is rigorously evaluated through ten iterations of 10-fold cross-validation to ensure robust and reliable results. Our comparative analysis reveals that XGBoost surpasses the other models, demonstrating its exceptional capability to navigate the complexities of sepsis data effectively. Full article

► Show Figures

Figure 1

20 pages, 2394 KB

Open AccessArticle

Enhanced Model Predictions through Principal Components and Average Least Squares-Centered Penalized Regression

by Adewale F. Lukman, Emmanuel T. Adewuyi, Ohud A. Alqasem, Mohammad Arashi and Kayode Ayinde

Symmetry 2024, 16(4), 469; https://doi.org/10.3390/sym16040469 - 12 Apr 2024

Cited by 3 | Viewed by 1780

Abstract

We address the estimation of regression parameters for the ill-conditioned predictive linear model in this study. Traditional least squares methods often encounter challenges in yielding reliable results when there is multicollinearity. Therefore, we employ a better shrinkage method, average least squares-centered penalized regression [...] Read more.

We address the estimation of regression parameters for the ill-conditioned predictive linear model in this study. Traditional least squares methods often encounter challenges in yielding reliable results when there is multicollinearity. Therefore, we employ a better shrinkage method, average least squares-centered penalized regression (ALPR), as it offers a more efficient approach for handling multicollinearity than ridge regression. Additionally, we integrate ALPR with the principal component (PC) dimension reduction method for enhanced performance. We compared the proposed PCALPR estimation technique with existing ones for ill-conditioned problems through comprehensive simulations and real-life data analyses using the mean squared error. This integration results in superior model performance compared to other methods, highlighting the potential of combining dimensionality reduction techniques with penalized regression for enhanced model predictions. Full article

(This article belongs to the Section Mathematics)

► Show Figures

Figure 1

Search Results (35)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (35)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI