Journal Description
Stats
Stats
is an international, peer-reviewed, open access journal on statistical science published quarterly online by MDPI. The journal focuses on methodological and theoretical papers in statistics, probability, stochastic processes and innovative applications of statistics in all scientific disciplines including biological and biomedical sciences, medicine, business, economics and social sciences, physics, data science and engineering.
- Open Access— free for readers, with article processing charges (APC) paid by authors or their institutions.
- High Visibility: indexed within ESCI (Web of Science), Scopus, RePEc, and other databases.
- Rapid Publication: manuscripts are peer-reviewed and a first decision is provided to authors approximately 18.2 days after submission; acceptance to publication is undertaken in 2.9 days (median values for papers published in this journal in the first half of 2025).
- Recognition of Reviewers: reviewers who provide timely, thorough peer-review reports receive vouchers entitling them to a discount on the APC of their next publication in any MDPI journal, in appreciation of the work done.
Impact Factor:
1.0 (2024);
5-Year Impact Factor:
1.1 (2024)
Latest Articles
A Mixture Integer GARCH Model with Application to Modeling and Forecasting COVID-19 Counts
Stats 2025, 8(3), 73; https://doi.org/10.3390/stats8030073 - 13 Aug 2025
Abstract
►
Show Figures
This article introduces a flexible time series regression model known as the Mixture of Integer-Valued Generalized Autoregressive Conditional Heteroscedasticity (MINGARCH). Mixture models provide versatile frameworks for capturing heterogeneity in count data, including features such as multiple peaks, seasonality, and intervention effects. The proposed
[...] Read more.
This article introduces a flexible time series regression model known as the Mixture of Integer-Valued Generalized Autoregressive Conditional Heteroscedasticity (MINGARCH). Mixture models provide versatile frameworks for capturing heterogeneity in count data, including features such as multiple peaks, seasonality, and intervention effects. The proposed model is applied to regional COVID-19 data from Malaysia. To account for geographical variability, five regions—Selangor, Kuala Lumpur, Penang, Johor, and Sarawak—were selected for analysis, covering a total of 86 weeks of data. Comparative analysis with existing time series regression models demonstrates that MINGARCH outperforms alternative approaches. Further investigation into forecasting reveals that MINGARCH yields superior performance in regions with high population density, and significant influencing factors have been identified. In low-density regions, confirmed cases peaked within three weeks, whereas high-density regions exhibited a monthly seasonal pattern. Forecasting metrics—including MAPE, MAE, and RMSE—are significantly lower for the MINGARCH model compared to other models. These results suggest that MINGARCH is well-suited for forecasting disease spread in urban and densely populated areas, offering valuable insights for policymaking.
Full article
Open AccessCommunication
On the Appropriateness of Fixed Correlation Assumptions in Repeated-Measures Meta-Analysis: A Monte Carlo Assessment
by
Vasileios Papadopoulos
Stats 2025, 8(3), 72; https://doi.org/10.3390/stats8030072 - 13 Aug 2025
Abstract
►▼
Show Figures
In repeated-measures meta-analyses, raw data are often unavailable, preventing the calculation of the correlation coefficient r between pre- and post-intervention values. As a workaround, many researchers adopt a heuristic approximation of r = 0.7. However, this value lacks rigorous mathematical justification and may
[...] Read more.
In repeated-measures meta-analyses, raw data are often unavailable, preventing the calculation of the correlation coefficient r between pre- and post-intervention values. As a workaround, many researchers adopt a heuristic approximation of r = 0.7. However, this value lacks rigorous mathematical justification and may introduce bias into variance estimates of pre/post-differences. We employed Monte Carlo simulations (n = 500,000 per scenario) in Fisher z-space to examine the distribution of the standard deviation of pre-/post-differences (σD) under varying assumptions of r and its uncertainty (σr). Scenarios included r = 0.5, 0.6, 0.707, 0.75, and 0.8, each tested across three levels of variance (σr = 0.05, 0.1, and 0.15). The approximation of r = 0.75 resulted in a balanced estimate of σD, corresponding to a “midway” variance attenuation due to paired data. This value more accurately offsets the deficit caused by assuming a correlation, compared to the traditional value of 0.7. While the r = 0.7 heuristic remains widely used, our results support the use of r = 0.75 as a more mathematically neutral and empirically defensible alternative in repeated-measures meta-analyses lacking raw data.
Full article

Figure 1
Open AccessArticle
Individual Homogeneity Learning in Density Data Response Additive Models
by
Zixuan Han, Tao Li, Jinhong You and Narayanaswamy Balakrishnan
Stats 2025, 8(3), 71; https://doi.org/10.3390/stats8030071 - 9 Aug 2025
Abstract
►▼
Show Figures
In many complex applications, both data heterogeneity and homogeneity are present simultaneously. Overlooking either aspect can lead to misleading statistical inferences. Moreover, the increasing prevalence of complex, non-Euclidean data calls for more sophisticated modeling techniques. To address these challenges, we propose a density
[...] Read more.
In many complex applications, both data heterogeneity and homogeneity are present simultaneously. Overlooking either aspect can lead to misleading statistical inferences. Moreover, the increasing prevalence of complex, non-Euclidean data calls for more sophisticated modeling techniques. To address these challenges, we propose a density data response additive model, where the response variable is represented by a distributional density function. In this framework, individual effect curves are assumed to be homogeneous within groups but heterogeneous across groups, while covariates that explain variation share common additive bivariate functions. We begin by applying a transformation to map density functions into a linear space. To estimate the unknown subject-specific functions and the additive bivariate components, we adopt a B-spline series approximation method. Latent group structures are uncovered using a hierarchical agglomerative clustering algorithm, which allows our method to recover the true underlying groupings with high probability. To further improve estimation efficiency, we develop refined spline-backfitted local linear estimators for both the grouped structures and the additive bivariate functions in the post-grouping model. We also establish the asymptotic properties of the proposed estimators, including their convergence rates, asymptotic distributions, and post-grouping oracle efficiency. The effectiveness of our method is demonstrated through extensive simulation studies and real-world data analysis, both of which show promising and robust performance.
Full article

Figure 1
Open AccessArticle
Unraveling Similarities and Differences Between Non-Negative Garrote and Adaptive Lasso: A Simulation Study in Low- and High-Dimensional Data
by
Edwin Kipruto and Willi Sauerbrei
Stats 2025, 8(3), 70; https://doi.org/10.3390/stats8030070 - 6 Aug 2025
Abstract
Penalized regression methods are widely used for variable selection. Non-negative garrote (NNG) was one of the earliest methods to combine variable selection with shrinkage of regression coefficients, followed by lasso. About a decade after the introduction of lasso, adaptive lasso (ALASSO) was proposed
[...] Read more.
Penalized regression methods are widely used for variable selection. Non-negative garrote (NNG) was one of the earliest methods to combine variable selection with shrinkage of regression coefficients, followed by lasso. About a decade after the introduction of lasso, adaptive lasso (ALASSO) was proposed to address lasso’s limitations. ALASSO has two tuning parameters ( and ), and its penalty resembles that of NNG when , though NNG imposes additional constraints. Given ALASSO’s greater flexibility, which may increase instability, this study investigates whether NNG provides any practical benefit or can be replaced by ALASSO. We conducted simulations in both low- and high-dimensional settings to compare selected variables, coefficient estimates, and prediction accuracy. Ordinary least squares and ridge estimates were used as initial estimates. NNG and ALASSO ( ) showed similar performance in low-dimensional settings with low correlation, large samples, and moderate to high . However, under high correlation, small samples, and low , their selected variables and estimates differed, though prediction accuracy remained comparable. When , the differences between NNG and ALASSO became more pronounced, with ALASSO generally performing better. Assuming linear relationships between predictors and the outcome, the results suggest that NNG may offer no practical advantage over ALASSO. The parameter in ALASSO allows for adaptability to model complexity, making ALASSO a more flexible and practical alternative to NNG.
Full article
(This article belongs to the Section Statistical Methods)
►▼
Show Figures

Figure 1
Open AccessReview
Archimedean Copulas: A Useful Approach in Biomedical Data—A Review with an Application in Pediatrics
by
Giulia Risca, Stefania Galimberti, Paola Rebora, Alessandro Cattoni, Maria Grazia Valsecchi and Giulia Capitoli
Stats 2025, 8(3), 69; https://doi.org/10.3390/stats8030069 - 1 Aug 2025
Abstract
Many applications in health research involve the analysis of multivariate distributions of random variables. In this paper, we review the basic theory of copulas to illustrate their advantages in deriving a joint distribution from given marginal distributions, with a specific focus on bivariate
[...] Read more.
Many applications in health research involve the analysis of multivariate distributions of random variables. In this paper, we review the basic theory of copulas to illustrate their advantages in deriving a joint distribution from given marginal distributions, with a specific focus on bivariate cases. Particular attention is given to the Archimedean family of copulas, which includes widely used functions such as Clayton and Gumbel–Hougaard, characterized by a single association parameter and a relatively simple structure. This work differs from previous reviews by providing a focused overview of applied studies in biomedical research that have employed Archimedean copulas, due to their flexibility in modeling a wide range of dependence structures. Their ease of use and ability to accommodate rotated forms make them suitable for various biomedical applications, including those involving survival data. We briefly present the most commonly used methods for estimation and model selection of copula’s functions, with the purpose of introducing these tools within the broader framework. Several recent examples in the health literature, and an original example of a pediatric study, demonstrate the applicability of Archimedean copulas and suggest that this approach, although still not widely adopted, can be useful in many biomedical research settings.
Full article
(This article belongs to the Section Statistical Methods)
►▼
Show Figures

Figure 1
Open AccessArticle
Automated Classification of Crime Narratives Using Machine Learning and Language Models in Official Statistics
by
Klaus Lehmann, Elio Villaseñor, Alejandro Pimentel, Javiera Preuss, Nicolás Berhó, Oswaldo Diaz and Ignacio Agloni
Stats 2025, 8(3), 68; https://doi.org/10.3390/stats8030068 - 30 Jul 2025
Abstract
This paper presents the implementation of a language model–based strategy for the automatic codification of crime narratives for the production of official statistics. To address the high workload and inconsistencies associated with manual coding, we developed and evaluated three models: an XGBoost classifier
[...] Read more.
This paper presents the implementation of a language model–based strategy for the automatic codification of crime narratives for the production of official statistics. To address the high workload and inconsistencies associated with manual coding, we developed and evaluated three models: an XGBoost classifier with bag-of-words features and word embeddings features, an LSTM network using pretrained Spanish word embeddings as a language model, and a fine-tuned BERT language model (BETO). Deep learning models outperformed the traditional baseline, with BETO achieving the highest accuracy. The new ENUSC (Encuesta Nacional Urbana de Seguridad Ciudadana) workflow integrates the selected model into an API for automated classification, incorporating a certainty threshold to distinguish between cases suitable for automation and those requiring expert review. This hybrid strategy led to a 68.4% reduction in manual review workload while preserving high-quality standards. This study represents the first documented application of deep learning for the automated classification of victimization narratives in official statistics, demonstrating its feasibility and impact in a real-world production environment. Our results demonstrate that deep learning can significantly improve the efficiency and consistency of crime statistics coding, offering a scalable solution for other national statistical offices.
Full article
(This article belongs to the Section Applied Statistics and Machine Learning Methods)
►▼
Show Figures

Figure 1
Open AccessArticle
Requiem for Olympic Ethics and Sports’ Independence
by
Fabio Zagonari
Stats 2025, 8(3), 67; https://doi.org/10.3390/stats8030067 - 28 Jul 2025
Abstract
This paper suggests a theoretical framework to summarise the empirical literature on the relationships between sports and both religious and secular ethics, and it suggests two interrelated theoretical models to empirically evaluate the extent to which religious and secular ethics, as well as
[...] Read more.
This paper suggests a theoretical framework to summarise the empirical literature on the relationships between sports and both religious and secular ethics, and it suggests two interrelated theoretical models to empirically evaluate the extent to which religious and secular ethics, as well as sports policies, affect achievements in sports. I identified two national ethics (national pride/efficiency) and two social ethics (social cohesion/ethics) by measuring achievements in terms of alternative indexes based on Olympic medals. I referred to three empirical models and applied three estimation methods (panel Poisson, Data Envelopment, and Stochastic Frontier Analyses). I introduced two sports policies (a quantitative policy aimed at social cohesion and a qualitative policy aimed at national pride), by distinguishing sports in terms of four possibly different ethics to be used for the eight summer and eight winter Olympic Games from 1994 to 2024. I applied income level, health status, and income inequality, to depict alternative social contexts. I used five main religions and three educational levels to depict alternative ethical contexts. I applied country dummies to depict alternative institutional contexts. Empirical results support the absence of Olympic ethics, the potential substitution of sport and secular ethics in providing social cohesion, and the dependence of sports on politics, while alternative social contexts have different impacts on alternative sport achievements.
Full article
(This article belongs to the Special Issue Ethicametrics)
►▼
Show Figures

Figure 1
Open AccessArticle
Proximal Causal Inference for Censored Data with an Application to Right Heart Catheterization Data
by
Yue Hu, Yuanshan Gao and Minhao Qi
Stats 2025, 8(3), 66; https://doi.org/10.3390/stats8030066 - 22 Jul 2025
Abstract
In observational causal inference studies, unmeasured confounding remains a critical threat to the validity of effect estimates. While proximal causal inference (PCI) has emerged as a powerful framework for mitigating such bias through proxy variables, existing PCI methods cannot directly handle censored data.
[...] Read more.
In observational causal inference studies, unmeasured confounding remains a critical threat to the validity of effect estimates. While proximal causal inference (PCI) has emerged as a powerful framework for mitigating such bias through proxy variables, existing PCI methods cannot directly handle censored data. This article develops a unified proximal causal inference framework that simultaneously addresses unmeasured confounding and right-censoring challenges, extending the proximal causal inference literature. Our key contributions are twofold: (i) We propose novel identification strategies and develop two distinct estimators for the censored-outcome bridge function and treatment confounding bridge function, resolving the fundamental challenge of unobserved outcomes; (ii) To improve robustness against model misspecification, we construct a robust proximal estimator and establish uniform consistency for all proposed estimators under mild regularity conditions. Through comprehensive simulations, we demonstrate the finite-sample performance of our methods, followed by an empirical application evaluating right heart catheterization effectiveness in critically ill ICU patients.
Full article
(This article belongs to the Section Applied Statistics and Machine Learning Methods)
►▼
Show Figures

Figure 1
Open AccessArticle
Local Stochastic Correlation Models for Derivative Pricing
by
Marcos Escobar-Anel
Stats 2025, 8(3), 65; https://doi.org/10.3390/stats8030065 - 18 Jul 2025
Abstract
This paper reveals a simple methodology to create local-correlation models suitable for the closed-form pricing of two-asset financial derivatives. The multivariate models are built to ensure two conditions. First, marginals follow desirable processes, e.g., we choose the Geometric Brownian Motion (GBM), popular for
[...] Read more.
This paper reveals a simple methodology to create local-correlation models suitable for the closed-form pricing of two-asset financial derivatives. The multivariate models are built to ensure two conditions. First, marginals follow desirable processes, e.g., we choose the Geometric Brownian Motion (GBM), popular for stock prices. Second, the payoff of the derivative should follow a desired one-dimensional process. These conditions lead to a specific choice of the dependence structure in the form of a local-correlation model. Two popular multi-asset options are entertained: a spread option and a basket option.
Full article
(This article belongs to the Section Applied Stochastic Models)
►▼
Show Figures

Figure 1
Open AccessArticle
Machine Learning Ensemble Algorithms for Classification of Thyroid Nodules Through Proteomics: Extending the Method of Shapley Values from Binary to Multi-Class Tasks
by
Giulia Capitoli, Simone Magnaghi, Andrea D'Amicis, Camilla Vittoria Di Martino, Isabella Piga, Vincenzo L'Imperio, Marco Salvatore Nobile, Stefania Galimberti and Davide Paolo Bernasconi
Stats 2025, 8(3), 64; https://doi.org/10.3390/stats8030064 - 16 Jul 2025
Abstract
►▼
Show Figures
The need to improve medical diagnosis is of utmost importance in medical research, consisting of the optimization of accurate classification models able to assist clinical decisions. To minimize the errors that can be caused by using a single classifier, the voting ensemble technique
[...] Read more.
The need to improve medical diagnosis is of utmost importance in medical research, consisting of the optimization of accurate classification models able to assist clinical decisions. To minimize the errors that can be caused by using a single classifier, the voting ensemble technique can be used, combining the classification results of different classifiers to improve the final classification performance. This paper aims to compare the existing voting ensemble techniques with a new game-theory-derived approach based on Shapley values. We extended this method, originally developed for binary tasks, to the multi-class setting in order to capture complementary information provided by different classifiers. In heterogeneous clinical scenarios such as thyroid nodule diagnosis, where distinct models may be better suited to identify specific subtypes (e.g., benign, malignant, or inflammatory lesions), ensemble strategies capable of leveraging these strengths are particularly valuable. The motivating application focuses on the classification of thyroid cancer nodules whose cytopathological clinical diagnosis is typically characterized by a high number of false positive cases that may result in unnecessary thyroidectomy. We apply and compare the performance of seven individual classifiers, along with four ensemble voting techniques (including Shapley values), in a real-world study focused on classifying thyroid cancer nodules using proteomic features obtained through mass spectrometry. Our results indicate a slight improvement in the classification accuracy for ensemble systems compared to the performance of single classifiers. Although the Shapley value-based voting method remains comparable to the other voting methods, we envision this new ensemble approach could be effective in improving the performance of single classifiers in further applications, especially when complementary algorithms are considered in the ensemble. The application of these techniques can lead to the development of new tools to assist clinicians in diagnosing thyroid cancer using proteomic features derived from mass spectrometry.
Full article

Figure 1
Open AccessCommunication
Beyond Expectations: Anomalies in Financial Statements and Their Application in Modelling
by
Roman Blazek and Lucia Duricova
Stats 2025, 8(3), 63; https://doi.org/10.3390/stats8030063 - 15 Jul 2025
Cited by 1
Abstract
The increasing complexity of financial reporting has enabled the implementation of innovative accounting practices that often obscure a company’s actual performance. This project seeks to uncover manipulative behaviours by constructing an anomaly detection model that utilises unsupervised machine learning techniques. We examined a
[...] Read more.
The increasing complexity of financial reporting has enabled the implementation of innovative accounting practices that often obscure a company’s actual performance. This project seeks to uncover manipulative behaviours by constructing an anomaly detection model that utilises unsupervised machine learning techniques. We examined a dataset of 149,566 Slovak firms from 2016 to 2023, which included 12 financial parameters. Utilising TwoSteps and K-means clustering in IBM SPSS, we discerned patterns of normative financial activity and computed an abnormality index for each firm. Entities with the most significant deviation from cluster centroids were identified as suspicious. The model attained a silhouette score of 1.0, signifying outstanding clustering quality. We discovered a total of 231 anomalous firms, predominantly concentrated in sectors C (32.47%), G (13.42%), and L (7.36%). Our research indicates that anomaly-based models can markedly enhance the precision of fraud detection, especially in scenarios with scarce labelled data. The model integrates intricate data processing and delivers an exhaustive study of the regional and sectoral distribution of anomalies, thereby increasing its relevance in practical applications.
Full article
(This article belongs to the Section Applied Statistics and Machine Learning Methods)
►▼
Show Figures

Figure 1
Open AccessArticle
The Extended Kumaraswamy Model: Properties, Risk Indicators, Risk Analysis, Regression Model, and Applications
by
Morad Alizadeh, Gauss M. Cordeiro, Gabriela M. Rodrigues, Edwin M. M. Ortega and Haitham M. Yousof
Stats 2025, 8(3), 62; https://doi.org/10.3390/stats8030062 - 14 Jul 2025
Abstract
►▼
Show Figures
We propose a new unit distribution, study its properties, and provide an important application in the field of geology through a set of risk indicators. We test its practicality through two applications to real data, make comparisons with the well-known beta and Kumaraswamy
[...] Read more.
We propose a new unit distribution, study its properties, and provide an important application in the field of geology through a set of risk indicators. We test its practicality through two applications to real data, make comparisons with the well-known beta and Kumaraswamy distributions, and estimate the parameters of the new distribution in different ways. We provide a new regression model and apply it in statistical prediction operations for residence times data.
Full article

Figure 1
Open AccessArticle
Continuity Correction and Standard Error Calculation for Testing in Proportional Hazards Models
by
Daniel Baumgartner and John E. Kolassa
Stats 2025, 8(3), 61; https://doi.org/10.3390/stats8030061 - 14 Jul 2025
Abstract
►▼
Show Figures
Standard asymptotic inference for proportional hazards models is conventionally performed by calculating a standard error for the estimate and comparing the estimate divided by the standard error to a standard normal distribution. In this paper, we compare various standard error estimates, including based
[...] Read more.
Standard asymptotic inference for proportional hazards models is conventionally performed by calculating a standard error for the estimate and comparing the estimate divided by the standard error to a standard normal distribution. In this paper, we compare various standard error estimates, including based on the inverse observed information, the inverse expected inverse information, and the jackknife. Furthermore, correction for continuity is compared to omitting this correction. We find that correction for continuity represents an important improvement in the quality of approximation, and furthermore note that the usual naive standard error yields a distribution closer to normality, as measured by skewness and kurtosis, than any of the other standard errors investigated.
Full article

Figure 1
Open AccessArticle
Some Useful Techniques for High-Dimensional Statistics
by
David J. Olive
Stats 2025, 8(3), 60; https://doi.org/10.3390/stats8030060 - 13 Jul 2025
Abstract
►▼
Show Figures
High-dimensional statistics are used when , where n is the sample size and p is the number of predictors. Useful techniques include (a) the use of a sparse fitted model, (b) use of principal component analysis for dimension reduction,
[...] Read more.
High-dimensional statistics are used when , where n is the sample size and p is the number of predictors. Useful techniques include (a) the use of a sparse fitted model, (b) use of principal component analysis for dimension reduction, (c) use of alternative multivariate dispersion estimators instead of the sample covariance matrix, (d) eliminate weak predictors, and (e) stack low-dimensional estimators into a vector. Some variants and theory for these techniques will be given or reviewed.
Full article

Figure 1
Open AccessArticle
The Detection Method of the Tobit Model in a Dataset
by
El ouali Rahmani and Mohammed Benmoumen
Stats 2025, 8(3), 59; https://doi.org/10.3390/stats8030059 - 12 Jul 2025
Abstract
►▼
Show Figures
This article proposes an extension of detection methods for the Tobit model by generalizing existing approaches from cases with known parameters to more realistic scenarios where the parameters are unknown. The main objective is to develop detection procedures that account for parameter uncertainty
[...] Read more.
This article proposes an extension of detection methods for the Tobit model by generalizing existing approaches from cases with known parameters to more realistic scenarios where the parameters are unknown. The main objective is to develop detection procedures that account for parameter uncertainty and to analyze how this uncertainty affects the estimation process and the overall accuracy of the model. The methodology relies on maximum likelihood estimation, applied to datasets generated under different configurations of the Tobit model. A series of Monte Carlo simulations is conducted to evaluate the performance of the proposed methods. The results provide insights into the robustness of the detection procedures under varying assumptions. The study concludes with practical recommendations for improving the application of the Tobit model in fields such as econometrics, health economics, and environmental studies.
Full article

Figure 1
Open AccessArticle
Quantile Estimation Based on the Log-Skew-t Linear Regression Model: Statistical Aspects, Simulations, and Applications
by
Raúl Alejandro Morán-Vásquez, Anlly Daniela Giraldo-Melo and Mauricio A. Mazo-Lopera
Stats 2025, 8(3), 58; https://doi.org/10.3390/stats8030058 - 11 Jul 2025
Abstract
We propose a robust linear regression model assuming a log-skew-t distribution for the response variable, with the aim of exploring the association between the covariates and the quantiles of a continuous and positive response variable under skewness and heavy tails. This model
[...] Read more.
We propose a robust linear regression model assuming a log-skew-t distribution for the response variable, with the aim of exploring the association between the covariates and the quantiles of a continuous and positive response variable under skewness and heavy tails. This model includes the log-skew-normal and log-t linear regression models as special cases. Our simulation studies indicate good performance of the quantile estimation approach and its outperformance relative to the classical quantile regression model. The practical applicability of our methodology is demonstrated through an analysis of two real datasets.
Full article
(This article belongs to the Special Issue Robust Statistics in Action II)
►▼
Show Figures

Figure 1
Open AccessArticle
Well Begun Is Half Done: The Impact of Pre-Processing in MALDI Mass Spectrometry Imaging Analysis Applied to a Case Study of Thyroid Nodules
by
Giulia Capitoli, Kirsten C. J. van Abeelen, Isabella Piga, Vincenzo L’Imperio, Marco S. Nobile, Daniela Besozzi and Stefania Galimberti
Stats 2025, 8(3), 57; https://doi.org/10.3390/stats8030057 - 10 Jul 2025
Cited by 1
Abstract
The discovery of proteomic biomarkers in cancer research can be effectively performed in situ by exploiting Matrix-Assisted Laser Desorption Ionization (MALDI) Mass Spectrometry Imaging (MSI). However, due to experimental limitations, the spectra extracted by MALDI-MSI can be noisy, so pre-processing steps are generally
[...] Read more.
The discovery of proteomic biomarkers in cancer research can be effectively performed in situ by exploiting Matrix-Assisted Laser Desorption Ionization (MALDI) Mass Spectrometry Imaging (MSI). However, due to experimental limitations, the spectra extracted by MALDI-MSI can be noisy, so pre-processing steps are generally needed to reduce the instrumental and analytical variability. Thus far, the importance and the effect of standard pre-processing methods, as well as their combinations and parameter settings, have not been extensively investigated in proteomics applications. In this work, we present a systematic study of 15 combinations of pre-processing steps—including baseline, smoothing, normalization, and peak alignment—for a real-data classification task on MALDI-MSI data measured from fine-needle aspirates biopsies of thyroid nodules. The influence of each combination was assessed by analyzing the feature extraction, pixel-by-pixel classification probabilities, and LASSO classification performance. Our results highlight the necessity of fine-tuning a pre-processing pipeline, especially for the reliable transfer of molecular diagnostic signatures in clinical practice. We outline some recommendations on the selection of pre-processing steps, together with filter levels and alignment methods, according to the mass-to-charge range and heterogeneity of data.
Full article
(This article belongs to the Section Applied Statistics and Machine Learning Methods)
►▼
Show Figures

Graphical abstract
Open AccessArticle
On the Decision-Theoretic Foundations and the Asymptotic Bayes Risk of the Region of Practical Equivalence for Testing Interval Hypotheses
by
Riko Kelter
Stats 2025, 8(3), 56; https://doi.org/10.3390/stats8030056 - 8 Jul 2025
Abstract
Testing interval hypotheses is of huge relevance in the biomedical and cognitive sciences; for example, in clinical trials. Frequentist approaches include the proposal of equivalence tests, which have been used to study if there is a predetermined meaningful treatment effect. In the Bayesian
[...] Read more.
Testing interval hypotheses is of huge relevance in the biomedical and cognitive sciences; for example, in clinical trials. Frequentist approaches include the proposal of equivalence tests, which have been used to study if there is a predetermined meaningful treatment effect. In the Bayesian paradigm, two popular approaches exist: The first is the region of practical equivalence (ROPE), which has become increasingly popular in the cognitive sciences. The second is the Bayes factor for interval null hypotheses, which was proposed by Morey et al. One advantage of the ROPE procedure is that, in contrast to the Bayes factor, it is quite robust to the prior specification. However, while the ROPE is conceptually appealing, it lacks a clear decision-theoretic foundation like the Bayes factor. In this paper, a decision-theoretic justification for the ROPE procedure is derived for the first time, which shows that the Bayes risk of a decision rule based on the highest-posterior density interval (HPD) and the ROPE is asymptotically minimized for increasing sample size. To show this, a specific loss function is introduced. This result provides an important decision-theoretic justification for testing the interval hypothesis in the Bayesian approach based on the ROPE and HPD, in particular, when sample size is large.
Full article
(This article belongs to the Section Bayesian Methods)
►▼
Show Figures

Figure 1
Open AccessArticle
Estimating the Ratio of Means in a Zero-Inflated Poisson Mixture Model
by
Michael Pearce and Michael D. Perlman
Stats 2025, 8(3), 55; https://doi.org/10.3390/stats8030055 - 5 Jul 2025
Abstract
►▼
Show Figures
The problem of estimating the ratio of the means of a two-component Poisson mixture model is considered, when each component is subject to zero-inflation, i.e., excess zero counts. The resulting zero-inflated Poisson mixture (ZIPM) model can be viewed as a three-component Poisson mixture
[...] Read more.
The problem of estimating the ratio of the means of a two-component Poisson mixture model is considered, when each component is subject to zero-inflation, i.e., excess zero counts. The resulting zero-inflated Poisson mixture (ZIPM) model can be viewed as a three-component Poisson mixture model with one degenerate component. The EM algorithm is applied to obtain frequentist estimators and their standard errors, the latter determined via an explicit expression for the observed information matrix. As an intermediate step, we derive an explicit expression for standard errors in the two-component Poisson mixture model (without zero-inflation), a new result. The ZIPM model is applied to simulated data and real ecological count data of frigatebirds on the Coral Sea Islands off the coast of Northeast Australia.
Full article

Figure 1
Open AccessArticle
A Data-Driven Approach of DRG-Based Medical Insurance Payment Policy Formulation in China Based on an Optimization Algorithm
by
Kun Ba and Biqing Huang
Stats 2025, 8(3), 54; https://doi.org/10.3390/stats8030054 - 30 Jun 2025
Abstract
►▼
Show Figures
The diagnosis-related group (DRG) system classifies patients into different groups in order to facilitate decisions regarding medical insurance payments. Currently, more than 600 standard DRGs exist in China. Payment details represented by DRG weights must be adjusted during decision-making. After modeling the DRG
[...] Read more.
The diagnosis-related group (DRG) system classifies patients into different groups in order to facilitate decisions regarding medical insurance payments. Currently, more than 600 standard DRGs exist in China. Payment details represented by DRG weights must be adjusted during decision-making. After modeling the DRG weight-determining process as a parameter-searching and optimization-solving problem, we propose a stochastic gradient tracking algorithm (SGT) and compare it with a genetic algorithm and sequential quadratic programming. We describe diagnosis-related groups in China using several statistics based on sample data from one city. We explored the influence of the SGT hyperparameters through numerous experiments and demonstrated the robustness of the best SGT hyperparameter combination. Our stochastic gradient tracking algorithm finished the parameter search in only 3.56 min when the insurance payment rate was set at 95%, which is acceptable and desirable. As the main medical insurance payment scheme in China, DRGs require quantitative evidence for policymaking. The optimization algorithm proposed in this study shows a possible scientific decision-making method for use in the DRG system, particularly with regard to DRG weights.
Full article

Figure 1
Highly Accessed Articles
Latest Books
E-Mail Alert
News
Topics
Topic in
JPM, Mathematics, Applied Sciences, Stats, Healthcare
Application of Biostatistics in Medical Sciences and Global Health
Topic Editors: Bogdan Oancea, Adrian Pană, Cǎtǎlina Liliana AndreiDeadline: 31 October 2026

Special Issues
Special Issue in
Stats
Benford's Law(s) and Applications (Second Edition)
Guest Editors: Marcel Ausloos, Roy Cerqueti, Claudio LupiDeadline: 31 October 2025
Special Issue in
Stats
Nonparametric Inference: Methods and Applications
Guest Editor: Stefano BonniniDeadline: 28 November 2025
Special Issue in
Stats
Robust Statistics in Action II
Guest Editor: Marco RianiDeadline: 31 December 2025