Stats

16 pages, 1006 KiB

Open AccessArticle

A Bayesian Non-Linear Mixed-Effects Model for Accurate Detection of the Onset of Cognitive Decline in Longitudinal Aging Studies

by Franklin Fernando Massa, Marco Scavino and Graciela Muniz-Terrera

Stats 2025, 8(3), 74; https://doi.org/10.3390/stats8030074 - 18 Aug 2025

Viewed by 228

Abstract

Change-point models are frequently considered when modeling phenomena where a regime shift occurs at an unknown time. In aging research, these models are commonly adopted to estimate of the onset of cognitive decline. Yet these models present several limitations. Here, we present a [...] Read more.

Change-point models are frequently considered when modeling phenomena where a regime shift occurs at an unknown time. In aging research, these models are commonly adopted to estimate of the onset of cognitive decline. Yet these models present several limitations. Here, we present a Bayesian non-linear mixed-effects model based on a differential equation designed for longitudinal studies to overcome some limitations of classical change point models used in aging research. We demonstrate the ability of the proposed model to avoid biases in estimates of the onset of cognitive impairment in a simulated study. Finally, the methodology presented in this work is illustrated by analyzing results from memory tests from older adults who participated in the English Longitudinal Study of Aging. Full article

► Show Figures

Figure 1

13 pages, 970 KiB

Open AccessArticle

A Mixture Integer GARCH Model with Application to Modeling and Forecasting COVID-19 Counts

by Wooi Chen Khoo, Seng Huat Ong, Victor Jian Ming Low and Hari M. Srivastava

Stats 2025, 8(3), 73; https://doi.org/10.3390/stats8030073 - 13 Aug 2025

Viewed by 239

Abstract

This article introduces a flexible time series regression model known as the Mixture of Integer-Valued Generalized Autoregressive Conditional Heteroscedasticity (MINGARCH). Mixture models provide versatile frameworks for capturing heterogeneity in count data, including features such as multiple peaks, seasonality, and intervention effects. The proposed [...] Read more.

This article introduces a flexible time series regression model known as the Mixture of Integer-Valued Generalized Autoregressive Conditional Heteroscedasticity (MINGARCH). Mixture models provide versatile frameworks for capturing heterogeneity in count data, including features such as multiple peaks, seasonality, and intervention effects. The proposed model is applied to regional COVID-19 data from Malaysia. To account for geographical variability, five regions—Selangor, Kuala Lumpur, Penang, Johor, and Sarawak—were selected for analysis, covering a total of 86 weeks of data. Comparative analysis with existing time series regression models demonstrates that MINGARCH outperforms alternative approaches. Further investigation into forecasting reveals that MINGARCH yields superior performance in regions with high population density, and significant influencing factors have been identified. In low-density regions, confirmed cases peaked within three weeks, whereas high-density regions exhibited a monthly seasonal pattern. Forecasting metrics—including MAPE, MAE, and RMSE—are significantly lower for the MINGARCH model compared to other models. These results suggest that MINGARCH is well-suited for forecasting disease spread in urban and densely populated areas, offering valuable insights for policymaking. Full article

► Show Figures

Figure 1

9 pages, 1071 KiB

Open AccessCommunication

On the Appropriateness of Fixed Correlation Assumptions in Repeated-Measures Meta-Analysis: A Monte Carlo Assessment

by Vasileios Papadopoulos

Stats 2025, 8(3), 72; https://doi.org/10.3390/stats8030072 - 13 Aug 2025

Viewed by 277

Abstract

In repeated-measures meta-analyses, raw data are often unavailable, preventing the calculation of the correlation coefficient r between pre- and post-intervention values. As a workaround, many researchers adopt a heuristic approximation of r = 0.7. However, this value lacks rigorous mathematical justification and may [...] Read more.

In repeated-measures meta-analyses, raw data are often unavailable, preventing the calculation of the correlation coefficient r between pre- and post-intervention values. As a workaround, many researchers adopt a heuristic approximation of r = 0.7. However, this value lacks rigorous mathematical justification and may introduce bias into variance estimates of pre/post-differences. We employed Monte Carlo simulations (n = 500,000 per scenario) in Fisher z-space to examine the distribution of the standard deviation of pre-/post-differences (σ_D) under varying assumptions of r and its uncertainty (σ_r). Scenarios included r = 0.5, 0.6, 0.707, 0.75, and 0.8, each tested across three levels of variance (σ_r = 0.05, 0.1, and 0.15). The approximation of r = 0.75 resulted in a balanced estimate of σ_D, corresponding to a “midway” variance attenuation due to paired data. This value more accurately offsets the deficit caused by assuming a correlation, compared to the traditional value of 0.7. While the r = 0.7 heuristic remains widely used, our results support the use of r = 0.75 as a more mathematically neutral and empirically defensible alternative in repeated-measures meta-analyses lacking raw data. Full article

► Show Figures

Figure 1

27 pages, 942 KiB

Open AccessArticle

Individual Homogeneity Learning in Density Data Response Additive Models

by Zixuan Han, Tao Li, Jinhong You and Narayanaswamy Balakrishnan

Stats 2025, 8(3), 71; https://doi.org/10.3390/stats8030071 - 9 Aug 2025

Viewed by 177

Abstract

In many complex applications, both data heterogeneity and homogeneity are present simultaneously. Overlooking either aspect can lead to misleading statistical inferences. Moreover, the increasing prevalence of complex, non-Euclidean data calls for more sophisticated modeling techniques. To address these challenges, we propose a density [...] Read more.

In many complex applications, both data heterogeneity and homogeneity are present simultaneously. Overlooking either aspect can lead to misleading statistical inferences. Moreover, the increasing prevalence of complex, non-Euclidean data calls for more sophisticated modeling techniques. To address these challenges, we propose a density data response additive model, where the response variable is represented by a distributional density function. In this framework, individual effect curves are assumed to be homogeneous within groups but heterogeneous across groups, while covariates that explain variation share common additive bivariate functions. We begin by applying a transformation to map density functions into a linear space. To estimate the unknown subject-specific functions and the additive bivariate components, we adopt a B-spline series approximation method. Latent group structures are uncovered using a hierarchical agglomerative clustering algorithm, which allows our method to recover the true underlying groupings with high probability. To further improve estimation efficiency, we develop refined spline-backfitted local linear estimators for both the grouped structures and the additive bivariate functions in the post-grouping model. We also establish the asymptotic properties of the proposed estimators, including their convergence rates, asymptotic distributions, and post-grouping oracle efficiency. The effectiveness of our method is demonstrated through extensive simulation studies and real-world data analysis, both of which show promising and robust performance. Full article

► Show Figures

Figure 1

33 pages, 905 KiB

Open AccessArticle

by Edwin Kipruto and Willi Sauerbrei

Stats 2025, 8(3), 70; https://doi.org/10.3390/stats8030070 - 6 Aug 2025

Viewed by 225

Abstract

Penalized regression methods are widely used for variable selection. Non-negative garrote (NNG) was one of the earliest methods to combine variable selection with shrinkage of regression coefficients, followed by lasso. About a decade after the introduction of lasso, adaptive lasso (ALASSO) was proposed [...] Read more.

Penalized regression methods are widely used for variable selection. Non-negative garrote (NNG) was one of the earliest methods to combine variable selection with shrinkage of regression coefficients, followed by lasso. About a decade after the introduction of lasso, adaptive lasso (ALASSO) was proposed to address lasso’s limitations. ALASSO has two tuning parameters (

λ

and

γ

), and its penalty resembles that of NNG when

γ = 1

, though NNG imposes additional constraints. Given ALASSO’s greater flexibility, which may increase instability, this study investigates whether NNG provides any practical benefit or can be replaced by ALASSO. We conducted simulations in both low- and high-dimensional settings to compare selected variables, coefficient estimates, and prediction accuracy. Ordinary least squares and ridge estimates were used as initial estimates. NNG and ALASSO (

γ = 1

) showed similar performance in low-dimensional settings with low correlation, large samples, and moderate to high

R^{2}

. However, under high correlation, small samples, and low

R^{2}

, their selected variables and estimates differed, though prediction accuracy remained comparable. When

γ \neq 1

, the differences between NNG and ALASSO became more pronounced, with ALASSO generally performing better. Assuming linear relationships between predictors and the outcome, the results suggest that NNG may offer no practical advantage over ALASSO. The

γ

parameter in ALASSO allows for adaptability to model complexity, making ALASSO a more flexible and practical alternative to NNG. Full article

(This article belongs to the Section Statistical Methods)

► Show Figures

Figure 1

19 pages, 7512 KiB

Open AccessReview

Archimedean Copulas: A Useful Approach in Biomedical Data—A Review with an Application in Pediatrics

by Giulia Risca, Stefania Galimberti, Paola Rebora, Alessandro Cattoni, Maria Grazia Valsecchi and Giulia Capitoli

Stats 2025, 8(3), 69; https://doi.org/10.3390/stats8030069 - 1 Aug 2025

Viewed by 304

Abstract

Many applications in health research involve the analysis of multivariate distributions of random variables. In this paper, we review the basic theory of copulas to illustrate their advantages in deriving a joint distribution from given marginal distributions, with a specific focus on bivariate [...] Read more.

Many applications in health research involve the analysis of multivariate distributions of random variables. In this paper, we review the basic theory of copulas to illustrate their advantages in deriving a joint distribution from given marginal distributions, with a specific focus on bivariate cases. Particular attention is given to the Archimedean family of copulas, which includes widely used functions such as Clayton and Gumbel–Hougaard, characterized by a single association parameter and a relatively simple structure. This work differs from previous reviews by providing a focused overview of applied studies in biomedical research that have employed Archimedean copulas, due to their flexibility in modeling a wide range of dependence structures. Their ease of use and ability to accommodate rotated forms make them suitable for various biomedical applications, including those involving survival data. We briefly present the most commonly used methods for estimation and model selection of copula’s functions, with the purpose of introducing these tools within the broader framework. Several recent examples in the health literature, and an original example of a pediatric study, demonstrate the applicability of Archimedean copulas and suggest that this approach, although still not widely adopted, can be useful in many biomedical research settings. Full article

(This article belongs to the Section Statistical Methods)

► Show Figures

Figure 1

22 pages, 579 KiB

Open AccessArticle

Automated Classification of Crime Narratives Using Machine Learning and Language Models in Official Statistics

by Klaus Lehmann, Elio Villaseñor, Alejandro Pimentel, Javiera Preuss, Nicolás Berhó, Oswaldo Diaz and Ignacio Agloni

Stats 2025, 8(3), 68; https://doi.org/10.3390/stats8030068 - 30 Jul 2025

Viewed by 833

Abstract

This paper presents the implementation of a language model–based strategy for the automatic codification of crime narratives for the production of official statistics. To address the high workload and inconsistencies associated with manual coding, we developed and evaluated three models: an XGBoost classifier [...] Read more.

This paper presents the implementation of a language model–based strategy for the automatic codification of crime narratives for the production of official statistics. To address the high workload and inconsistencies associated with manual coding, we developed and evaluated three models: an XGBoost classifier with bag-of-words features and word embeddings features, an LSTM network using pretrained Spanish word embeddings as a language model, and a fine-tuned BERT language model (BETO). Deep learning models outperformed the traditional baseline, with BETO achieving the highest accuracy. The new ENUSC (Encuesta Nacional Urbana de Seguridad Ciudadana) workflow integrates the selected model into an API for automated classification, incorporating a certainty threshold to distinguish between cases suitable for automation and those requiring expert review. This hybrid strategy led to a 68.4% reduction in manual review workload while preserving high-quality standards. This study represents the first documented application of deep learning for the automated classification of victimization narratives in official statistics, demonstrating its feasibility and impact in a real-world production environment. Our results demonstrate that deep learning can significantly improve the efficiency and consistency of crime statistics coding, offering a scalable solution for other national statistical offices. Full article

(This article belongs to the Section Applied Statistics and Machine Learning Methods)

► Show Figures

Figure 1

28 pages, 888 KiB

Open AccessArticle

Requiem for Olympic Ethics and Sports’ Independence

by Fabio Zagonari

Stats 2025, 8(3), 67; https://doi.org/10.3390/stats8030067 - 28 Jul 2025

Viewed by 187

Abstract

This paper suggests a theoretical framework to summarise the empirical literature on the relationships between sports and both religious and secular ethics, and it suggests two interrelated theoretical models to empirically evaluate the extent to which religious and secular ethics, as well as [...] Read more.

This paper suggests a theoretical framework to summarise the empirical literature on the relationships between sports and both religious and secular ethics, and it suggests two interrelated theoretical models to empirically evaluate the extent to which religious and secular ethics, as well as sports policies, affect achievements in sports. I identified two national ethics (national pride/efficiency) and two social ethics (social cohesion/ethics) by measuring achievements in terms of alternative indexes based on Olympic medals. I referred to three empirical models and applied three estimation methods (panel Poisson, Data Envelopment, and Stochastic Frontier Analyses). I introduced two sports policies (a quantitative policy aimed at social cohesion and a qualitative policy aimed at national pride), by distinguishing sports in terms of four possibly different ethics to be used for the eight summer and eight winter Olympic Games from 1994 to 2024. I applied income level, health status, and income inequality, to depict alternative social contexts. I used five main religions and three educational levels to depict alternative ethical contexts. I applied country dummies to depict alternative institutional contexts. Empirical results support the absence of Olympic ethics, the potential substitution of sport and secular ethics in providing social cohesion, and the dependence of sports on politics, while alternative social contexts have different impacts on alternative sport achievements. Full article

(This article belongs to the Special Issue Ethicametrics)

► Show Figures

Figure 1

22 pages, 366 KiB

Open AccessArticle

Proximal Causal Inference for Censored Data with an Application to Right Heart Catheterization Data

by Yue Hu, Yuanshan Gao and Minhao Qi

Stats 2025, 8(3), 66; https://doi.org/10.3390/stats8030066 - 22 Jul 2025

Viewed by 345

Abstract

In observational causal inference studies, unmeasured confounding remains a critical threat to the validity of effect estimates. While proximal causal inference (PCI) has emerged as a powerful framework for mitigating such bias through proxy variables, existing PCI methods cannot directly handle censored data. [...] Read more.

In observational causal inference studies, unmeasured confounding remains a critical threat to the validity of effect estimates. While proximal causal inference (PCI) has emerged as a powerful framework for mitigating such bias through proxy variables, existing PCI methods cannot directly handle censored data. This article develops a unified proximal causal inference framework that simultaneously addresses unmeasured confounding and right-censoring challenges, extending the proximal causal inference literature. Our key contributions are twofold: (i) We propose novel identification strategies and develop two distinct estimators for the censored-outcome bridge function and treatment confounding bridge function, resolving the fundamental challenge of unobserved outcomes; (ii) To improve robustness against model misspecification, we construct a robust proximal estimator and establish uniform consistency for all proposed estimators under mild regularity conditions. Through comprehensive simulations, we demonstrate the finite-sample performance of our methods, followed by an empirical application evaluating right heart catheterization effectiveness in critically ill ICU patients. Full article

(This article belongs to the Section Applied Statistics and Machine Learning Methods)

► Show Figures

Figure 1

10 pages, 1848 KiB

Open AccessArticle

Local Stochastic Correlation Models for Derivative Pricing

by Marcos Escobar-Anel

Stats 2025, 8(3), 65; https://doi.org/10.3390/stats8030065 - 18 Jul 2025

Viewed by 232

Abstract

This paper reveals a simple methodology to create local-correlation models suitable for the closed-form pricing of two-asset financial derivatives. The multivariate models are built to ensure two conditions. First, marginals follow desirable processes, e.g., we choose the Geometric Brownian Motion (GBM), popular for [...] Read more.

This paper reveals a simple methodology to create local-correlation models suitable for the closed-form pricing of two-asset financial derivatives. The multivariate models are built to ensure two conditions. First, marginals follow desirable processes, e.g., we choose the Geometric Brownian Motion (GBM), popular for stock prices. Second, the payoff of the derivative should follow a desired one-dimensional process. These conditions lead to a specific choice of the dependence structure in the form of a local-correlation model. Two popular multi-asset options are entertained: a spread option and a basket option. Full article

(This article belongs to the Section Applied Stochastic Models)

► Show Figures

Figure 1

17 pages, 1296 KiB

Open AccessArticle

Machine Learning Ensemble Algorithms for Classification of Thyroid Nodules Through Proteomics: Extending the Method of Shapley Values from Binary to Multi-Class Tasks

by Giulia Capitoli, Simone Magnaghi, Andrea D'Amicis, Camilla Vittoria Di Martino, Isabella Piga, Vincenzo L'Imperio, Marco Salvatore Nobile, Stefania Galimberti and Davide Paolo Bernasconi

Stats 2025, 8(3), 64; https://doi.org/10.3390/stats8030064 - 16 Jul 2025

Viewed by 379

Abstract

The need to improve medical diagnosis is of utmost importance in medical research, consisting of the optimization of accurate classification models able to assist clinical decisions. To minimize the errors that can be caused by using a single classifier, the voting ensemble technique [...] Read more.

The need to improve medical diagnosis is of utmost importance in medical research, consisting of the optimization of accurate classification models able to assist clinical decisions. To minimize the errors that can be caused by using a single classifier, the voting ensemble technique can be used, combining the classification results of different classifiers to improve the final classification performance. This paper aims to compare the existing voting ensemble techniques with a new game-theory-derived approach based on Shapley values. We extended this method, originally developed for binary tasks, to the multi-class setting in order to capture complementary information provided by different classifiers. In heterogeneous clinical scenarios such as thyroid nodule diagnosis, where distinct models may be better suited to identify specific subtypes (e.g., benign, malignant, or inflammatory lesions), ensemble strategies capable of leveraging these strengths are particularly valuable. The motivating application focuses on the classification of thyroid cancer nodules whose cytopathological clinical diagnosis is typically characterized by a high number of false positive cases that may result in unnecessary thyroidectomy. We apply and compare the performance of seven individual classifiers, along with four ensemble voting techniques (including Shapley values), in a real-world study focused on classifying thyroid cancer nodules using proteomic features obtained through mass spectrometry. Our results indicate a slight improvement in the classification accuracy for ensemble systems compared to the performance of single classifiers. Although the Shapley value-based voting method remains comparable to the other voting methods, we envision this new ensemble approach could be effective in improving the performance of single classifiers in further applications, especially when complementary algorithms are considered in the ensemble. The application of these techniques can lead to the development of new tools to assist clinicians in diagnosing thyroid cancer using proteomic features derived from mass spectrometry. Full article

► Show Figures

Figure 1

30 pages, 2389 KiB

Open AccessCommunication

Beyond Expectations: Anomalies in Financial Statements and Their Application in Modelling

by Roman Blazek and Lucia Duricova

Stats 2025, 8(3), 63; https://doi.org/10.3390/stats8030063 - 15 Jul 2025

Cited by 1 | Viewed by 608

Abstract

The increasing complexity of financial reporting has enabled the implementation of innovative accounting practices that often obscure a company’s actual performance. This project seeks to uncover manipulative behaviours by constructing an anomaly detection model that utilises unsupervised machine learning techniques. We examined a [...] Read more.

The increasing complexity of financial reporting has enabled the implementation of innovative accounting practices that often obscure a company’s actual performance. This project seeks to uncover manipulative behaviours by constructing an anomaly detection model that utilises unsupervised machine learning techniques. We examined a dataset of 149,566 Slovak firms from 2016 to 2023, which included 12 financial parameters. Utilising TwoSteps and K-means clustering in IBM SPSS, we discerned patterns of normative financial activity and computed an abnormality index for each firm. Entities with the most significant deviation from cluster centroids were identified as suspicious. The model attained a silhouette score of 1.0, signifying outstanding clustering quality. We discovered a total of 231 anomalous firms, predominantly concentrated in sectors C (32.47%), G (13.42%), and L (7.36%). Our research indicates that anomaly-based models can markedly enhance the precision of fraud detection, especially in scenarios with scarce labelled data. The model integrates intricate data processing and delivers an exhaustive study of the regional and sectoral distribution of anomalies, thereby increasing its relevance in practical applications. Full article

(This article belongs to the Section Applied Statistics and Machine Learning Methods)

► Show Figures

Figure 1

23 pages, 658 KiB

Open AccessArticle

The Extended Kumaraswamy Model: Properties, Risk Indicators, Risk Analysis, Regression Model, and Applications

by Morad Alizadeh, Gauss M. Cordeiro, Gabriela M. Rodrigues, Edwin M. M. Ortega and Haitham M. Yousof

Stats 2025, 8(3), 62; https://doi.org/10.3390/stats8030062 - 14 Jul 2025

Viewed by 427

Abstract

We propose a new unit distribution, study its properties, and provide an important application in the field of geology through a set of risk indicators. We test its practicality through two applications to real data, make comparisons with the well-known beta and Kumaraswamy [...] Read more.

We propose a new unit distribution, study its properties, and provide an important application in the field of geology through a set of risk indicators. We test its practicality through two applications to real data, make comparisons with the well-known beta and Kumaraswamy distributions, and estimate the parameters of the new distribution in different ways. We provide a new regression model and apply it in statistical prediction operations for residence times data. Full article

► Show Figures

Figure 1

10 pages, 339 KiB

Open AccessArticle

Continuity Correction and Standard Error Calculation for Testing in Proportional Hazards Models

by Daniel Baumgartner and John E. Kolassa

Stats 2025, 8(3), 61; https://doi.org/10.3390/stats8030061 - 14 Jul 2025

Viewed by 296

Abstract

Standard asymptotic inference for proportional hazards models is conventionally performed by calculating a standard error for the estimate and comparing the estimate divided by the standard error to a standard normal distribution. In this paper, we compare various standard error estimates, including based [...] Read more.

Standard asymptotic inference for proportional hazards models is conventionally performed by calculating a standard error for the estimate and comparing the estimate divided by the standard error to a standard normal distribution. In this paper, we compare various standard error estimates, including based on the inverse observed information, the inverse expected inverse information, and the jackknife. Furthermore, correction for continuity is compared to omitting this correction. We find that correction for continuity represents an important improvement in the quality of approximation, and furthermore note that the usual naive standard error yields a distribution closer to normality, as measured by skewness and kurtosis, than any of the other standard errors investigated. Full article

► Show Figures

Figure 1

15 pages, 472 KiB

Open AccessArticle

Some Useful Techniques for High-Dimensional Statistics

by David J. Olive

Stats 2025, 8(3), 60; https://doi.org/10.3390/stats8030060 - 13 Jul 2025

Viewed by 367

Abstract

High-dimensional statistics are used when

n < 5 p

, where n is the sample size and p is the number of predictors. Useful techniques include (a) the use of a sparse fitted model, (b) use of principal component analysis for dimension reduction, [...] Read more.

High-dimensional statistics are used when

n < 5 p

, where n is the sample size and p is the number of predictors. Useful techniques include (a) the use of a sparse fitted model, (b) use of principal component analysis for dimension reduction, (c) use of alternative multivariate dispersion estimators instead of the sample covariance matrix, (d) eliminate weak predictors, and (e) stack low-dimensional estimators into a vector. Some variants and theory for these techniques will be given or reviewed. Full article

► Show Figures

Figure 1

17 pages, 384 KiB

Open AccessArticle

The Detection Method of the Tobit Model in a Dataset

by El ouali Rahmani and Mohammed Benmoumen

Stats 2025, 8(3), 59; https://doi.org/10.3390/stats8030059 - 12 Jul 2025

Viewed by 299

Abstract

This article proposes an extension of detection methods for the Tobit model by generalizing existing approaches from cases with known parameters to more realistic scenarios where the parameters are unknown. The main objective is to develop detection procedures that account for parameter uncertainty [...] Read more.

This article proposes an extension of detection methods for the Tobit model by generalizing existing approaches from cases with known parameters to more realistic scenarios where the parameters are unknown. The main objective is to develop detection procedures that account for parameter uncertainty and to analyze how this uncertainty affects the estimation process and the overall accuracy of the model. The methodology relies on maximum likelihood estimation, applied to datasets generated under different configurations of the Tobit model. A series of Monte Carlo simulations is conducted to evaluate the performance of the proposed methods. The results provide insights into the robustness of the detection procedures under varying assumptions. The study concludes with practical recommendations for improving the application of the Tobit model in fields such as econometrics, health economics, and environmental studies. Full article

► Show Figures

Figure 1

16 pages, 1288 KiB

Open AccessArticle

Quantile Estimation Based on the Log-Skew-t Linear Regression Model: Statistical Aspects, Simulations, and Applications

by Raúl Alejandro Morán-Vásquez, Anlly Daniela Giraldo-Melo and Mauricio A. Mazo-Lopera

Stats 2025, 8(3), 58; https://doi.org/10.3390/stats8030058 - 11 Jul 2025

Viewed by 298

Abstract

We propose a robust linear regression model assuming a log-skew-t distribution for the response variable, with the aim of exploring the association between the covariates and the quantiles of a continuous and positive response variable under skewness and heavy tails. This model [...] Read more.

We propose a robust linear regression model assuming a log-skew-t distribution for the response variable, with the aim of exploring the association between the covariates and the quantiles of a continuous and positive response variable under skewness and heavy tails. This model includes the log-skew-normal and log-t linear regression models as special cases. Our simulation studies indicate good performance of the quantile estimation approach and its outperformance relative to the classical quantile regression model. The practical applicability of our methodology is demonstrated through an analysis of two real datasets. Full article

(This article belongs to the Special Issue Robust Statistics in Action II)

► Show Figures

Figure 1

14 pages, 16727 KiB

Open AccessArticle

Well Begun Is Half Done: The Impact of Pre-Processing in MALDI Mass Spectrometry Imaging Analysis Applied to a Case Study of Thyroid Nodules

by Giulia Capitoli, Kirsten C. J. van Abeelen, Isabella Piga, Vincenzo L’Imperio, Marco S. Nobile, Daniela Besozzi and Stefania Galimberti

Stats 2025, 8(3), 57; https://doi.org/10.3390/stats8030057 - 10 Jul 2025

Cited by 1 | Viewed by 378

Abstract

The discovery of proteomic biomarkers in cancer research can be effectively performed in situ by exploiting Matrix-Assisted Laser Desorption Ionization (MALDI) Mass Spectrometry Imaging (MSI). However, due to experimental limitations, the spectra extracted by MALDI-MSI can be noisy, so pre-processing steps are generally [...] Read more.

The discovery of proteomic biomarkers in cancer research can be effectively performed in situ by exploiting Matrix-Assisted Laser Desorption Ionization (MALDI) Mass Spectrometry Imaging (MSI). However, due to experimental limitations, the spectra extracted by MALDI-MSI can be noisy, so pre-processing steps are generally needed to reduce the instrumental and analytical variability. Thus far, the importance and the effect of standard pre-processing methods, as well as their combinations and parameter settings, have not been extensively investigated in proteomics applications. In this work, we present a systematic study of 15 combinations of pre-processing steps—including baseline, smoothing, normalization, and peak alignment—for a real-data classification task on MALDI-MSI data measured from fine-needle aspirates biopsies of thyroid nodules. The influence of each combination was assessed by analyzing the feature extraction, pixel-by-pixel classification probabilities, and LASSO classification performance. Our results highlight the necessity of fine-tuning a pre-processing pipeline, especially for the reliable transfer of molecular diagnostic signatures in clinical practice. We outline some recommendations on the selection of pre-processing steps, together with filter levels and alignment methods, according to the mass-to-charge range and heterogeneity of data. Full article

(This article belongs to the Section Applied Statistics and Machine Learning Methods)

► Show Figures

Graphical abstract

18 pages, 359 KiB

Open AccessArticle

On the Decision-Theoretic Foundations and the Asymptotic Bayes Risk of the Region of Practical Equivalence for Testing Interval Hypotheses

by Riko Kelter

Stats 2025, 8(3), 56; https://doi.org/10.3390/stats8030056 - 8 Jul 2025

Viewed by 190

Abstract

Testing interval hypotheses is of huge relevance in the biomedical and cognitive sciences; for example, in clinical trials. Frequentist approaches include the proposal of equivalence tests, which have been used to study if there is a predetermined meaningful treatment effect. In the Bayesian [...] Read more.

Testing interval hypotheses is of huge relevance in the biomedical and cognitive sciences; for example, in clinical trials. Frequentist approaches include the proposal of equivalence tests, which have been used to study if there is a predetermined meaningful treatment effect. In the Bayesian paradigm, two popular approaches exist: The first is the region of practical equivalence (ROPE), which has become increasingly popular in the cognitive sciences. The second is the Bayes factor for interval null hypotheses, which was proposed by Morey et al. One advantage of the ROPE procedure is that, in contrast to the Bayes factor, it is quite robust to the prior specification. However, while the ROPE is conceptually appealing, it lacks a clear decision-theoretic foundation like the Bayes factor. In this paper, a decision-theoretic justification for the ROPE procedure is derived for the first time, which shows that the Bayes risk of a decision rule based on the highest-posterior density interval (HPD) and the ROPE is asymptotically minimized for increasing sample size. To show this, a specific loss function is introduced. This result provides an important decision-theoretic justification for testing the interval hypothesis in the Bayesian approach based on the ROPE and HPD, in particular, when sample size is large. Full article

(This article belongs to the Section Bayesian Methods)

► Show Figures

Figure 1

24 pages, 347 KiB

Open AccessArticle

Estimating the Ratio of Means in a Zero-Inflated Poisson Mixture Model

by Michael Pearce and Michael D. Perlman

Stats 2025, 8(3), 55; https://doi.org/10.3390/stats8030055 - 5 Jul 2025

Viewed by 197

Abstract

The problem of estimating the ratio of the means of a two-component Poisson mixture model is considered, when each component is subject to zero-inflation, i.e., excess zero counts. The resulting zero-inflated Poisson mixture (ZIPM) model can be viewed as a three-component Poisson mixture [...] Read more.

The problem of estimating the ratio of the means of a two-component Poisson mixture model is considered, when each component is subject to zero-inflation, i.e., excess zero counts. The resulting zero-inflated Poisson mixture (ZIPM) model can be viewed as a three-component Poisson mixture model with one degenerate component. The EM algorithm is applied to obtain frequentist estimators and their standard errors, the latter determined via an explicit expression for the observed information matrix. As an intermediate step, we derive an explicit expression for standard errors in the two-component Poisson mixture model (without zero-inflation), a new result. The ZIPM model is applied to simulated data and real ecological count data of frigatebirds on the Coral Sea Islands off the coast of Northeast Australia. Full article

► Show Figures

Figure 1

14 pages, 715 KiB

Open AccessArticle

A Data-Driven Approach of DRG-Based Medical Insurance Payment Policy Formulation in China Based on an Optimization Algorithm

by Kun Ba and Biqing Huang

Stats 2025, 8(3), 54; https://doi.org/10.3390/stats8030054 - 30 Jun 2025

Viewed by 557

Abstract

The diagnosis-related group (DRG) system classifies patients into different groups in order to facilitate decisions regarding medical insurance payments. Currently, more than 600 standard DRGs exist in China. Payment details represented by DRG weights must be adjusted during decision-making. After modeling the DRG [...] Read more.

The diagnosis-related group (DRG) system classifies patients into different groups in order to facilitate decisions regarding medical insurance payments. Currently, more than 600 standard DRGs exist in China. Payment details represented by DRG weights must be adjusted during decision-making. After modeling the DRG weight-determining process as a parameter-searching and optimization-solving problem, we propose a stochastic gradient tracking algorithm (SGT) and compare it with a genetic algorithm and sequential quadratic programming. We describe diagnosis-related groups in China using several statistics based on sample data from one city. We explored the influence of the SGT hyperparameters through numerous experiments and demonstrated the robustness of the best SGT hyperparameter combination. Our stochastic gradient tracking algorithm finished the parameter search in only 3.56 min when the insurance payment rate was set at 95%, which is acceptable and desirable. As the main medical insurance payment scheme in China, DRGs require quantitative evidence for policymaking. The optimization algorithm proposed in this study shows a possible scientific decision-making method for use in the DRG system, particularly with regard to DRG weights. Full article

► Show Figures

Figure 1

14 pages, 784 KiB

Open AccessArticle

Distance-Based Relevance Function for Imbalanced Regression

by Daniel Daeyoung In and Hyunjoong Kim

Stats 2025, 8(3), 53; https://doi.org/10.3390/stats8030053 - 28 Jun 2025

Viewed by 348

Abstract

Imbalanced regression poses a significant challenge in real-world prediction tasks, where rare target values are prone to overfitting during model training. To address this, prior research has employed relevance functions to quantify the rarity of target instances. However, existing functions often struggle to [...] Read more.

Imbalanced regression poses a significant challenge in real-world prediction tasks, where rare target values are prone to overfitting during model training. To address this, prior research has employed relevance functions to quantify the rarity of target instances. However, existing functions often struggle to capture the rarity across diverse target distributions. In this study, we introduce a novel Distance-based Relevance Function (DRF) that quantifies the rarity based on the distance between target values, enabling a more accurate and distribution-agnostic assessment of rare data. This general approach allows imbalanced regression techniques to be effectively applied to a broader range of distributions, including bimodal cases. We evaluate the proposed DRF using Mean Squared Error (MSE), relevance-weighted Mean Absolute Error (

MAE ϕ

), and Symmetric Mean Absolute Percentage Error (SMAPE). Empirical studies on synthetic datasets and 18 real-world datasets demonstrate that DRF tends to improve the performance across various machine learning models, including support vector regression, neural networks, XGBoost, and random forests. These findings suggest that DRF offers a promising direction for rare target detection and broadens the applicability of imbalanced regression methods. Full article

► Show Figures

Figure 1

13 pages, 300 KiB

Open AccessArticle

New Effects and Methods in Brownian Transport

by Dmitri Martila and Stefan Groote

Stats 2025, 8(3), 52; https://doi.org/10.3390/stats8030052 - 26 Jun 2025

Viewed by 365

Abstract

We consider the noise-induced transport of overdamped Brownian particles in a ratchet system driven by nonequilibrium symmetric three-level Markovian noise and additive white noise. In addition to a detailed analysis of this system, we consider a simple example that can be solved exactly, [...] Read more.

We consider the noise-induced transport of overdamped Brownian particles in a ratchet system driven by nonequilibrium symmetric three-level Markovian noise and additive white noise. In addition to a detailed analysis of this system, we consider a simple example that can be solved exactly, showing both the increase in the number of current reversals and hypersensitivity. The simplicity of the exact solution and the model itself is beneficial for comparison with experiments. Full article

► Show Figures

Figure 1

14 pages, 464 KiB

Open AccessArticle

Elicitation of Priors for the Weibull Distribution

by Purvi Prajapati, James D. Stamey, David Kahle, John W. Seaman, Jr., Zachary M. Thomas and Michael Sonksen

Stats 2025, 8(3), 51; https://doi.org/10.3390/stats8030051 - 23 Jun 2025

Viewed by 303

Abstract

Bayesian methods have attracted increasing interest in the design and analysis of clinical trials. Many of these clinical trials investigate time-to-event endpoints. The Weibull distribution is often used in survival and reliability analysis to model time-to-event data. We propose a process to elicit [...] Read more.

Bayesian methods have attracted increasing interest in the design and analysis of clinical trials. Many of these clinical trials investigate time-to-event endpoints. The Weibull distribution is often used in survival and reliability analysis to model time-to-event data. We propose a process to elicit information about the parameters of the Weibull distribution for pharmaceutical applications. Our method is based on an expert’s answers to questions about the median and upper quartile of the distribution. Using the elicited information, a joint prior is constructed for the median and upper quartile of the Weibull distribution, which induces a joint prior distribution on the shape and rate parameters of the Weibull. To illustrate, we apply our elicitation methodology to a pediatric clinical trial, where information is elicited from a subject-matter expert for the control arm. Full article

► Show Figures

Figure 1

19 pages, 920 KiB

Open AccessArticle

Ethicametrics: A New Interdisciplinary Science

by Fabio Zagonari

Stats 2025, 8(3), 50; https://doi.org/10.3390/stats8030050 - 22 Jun 2025

Cited by 1 | Viewed by 415

Abstract

This paper characterises Ethicametrics (EM) as a new interdisciplinary scientific research area focusing on metrics of ethics (MOE) and ethics of metrics (EOM), by providing a comprehensive methodological framework. EM is scientific: it is based on behavioural mathematical modelling to be statistically validated [...] Read more.

This paper characterises Ethicametrics (EM) as a new interdisciplinary scientific research area focusing on metrics of ethics (MOE) and ethics of metrics (EOM), by providing a comprehensive methodological framework. EM is scientific: it is based on behavioural mathematical modelling to be statistically validated and tested, with additional sensitivity analyses to favour immediate interpretations. EM is interdisciplinary: it spans from less to more traditional fields, with essential mutual improvements. EM is new: valid and invalid examples of EM (articles referring to an explicit and an implicit behavioural model, respectively) are scarce, recent, time-stable and discipline-focused, with 1 and 37 scientists, respectively. Thus, the core of EM (multi-level statistical analyses applied to behavioural mathematical models) is crucial to avoid biased MOE and EOM. Conversely, articles inside EM should study quantitatively any metrics or ethics, in any alternative context, at any analytical level, by using panel/longitudinal data. Behavioural models should be ethically explicit, possibly by evaluating ethics in terms of the consequences of actions. Ethical measures should be scientifically grounded by evaluating metrics in terms of ethical criteria coming from the relevant theological/philosophical literature. Note that behavioural models applied to science metrics can be used to deduce social consequences to be ethically evaluated. Full article

► Show Figures

Figure 1

27 pages, 2058 KiB

Open AccessArticle

Mission Reliability Assessment for the Multi-Phase Data in Operational Testing

by Jianping Hao and Mochao Pei

Stats 2025, 8(3), 49; https://doi.org/10.3390/stats8030049 - 20 Jun 2025

Viewed by 333

Abstract

Traditional methods for mission reliability assessment under operational testing conditions exhibit some limitations. They include coarse modeling granularity, significant parameter estimation biases, and inadequate adaptability for handling heterogeneous test data. To address these challenges, this study establishes an assessment framework using a vehicular [...] Read more.

Traditional methods for mission reliability assessment under operational testing conditions exhibit some limitations. They include coarse modeling granularity, significant parameter estimation biases, and inadequate adaptability for handling heterogeneous test data. To address these challenges, this study establishes an assessment framework using a vehicular missile launching system (VMLS) as a case study. The framework constructs phase-specific reliability block diagrams based on mission profiles and establishes mappings between data types and evaluation models. The framework integrates the maximum entropy criterion with reliability monotonic decreasing constraints, develops a covariate-embedded Bayesian data fusion model, and proposes a multi-path weight adjustment assessment method. Simulation and physical testing demonstrate that compared with conventional methods, the proposed approach shows superior accuracy and precision in parameter estimation. It enables mission reliability assessment under practical operational testing constraints while providing methodological support to overcome the traditional assessment paradigm that overemphasizes performance verification while neglecting operational capability development. Full article

► Show Figures

Figure 1

Journal Menu

Journal Browser

Stats, Volume 8, Issue 3 (September 2025) – 26 articles

Further Information

Guidelines

MDPI Initiatives

Follow MDPI