Stats

Stats, Vol. 9, Pages 68: Pertinent Prediction Intervals in Linear Regression

Dimitris N. Politis — 2026-06-25

Stats, Vol. 9, Pages 68: Pertinent Prediction Intervals in Linear Regression

Authors: Dimitris N. Politis

In linear regression, a point predictor Y^f of a future response Yf associated with a regressor value of interest x̲f can easily be constructed. Since Y^f will always incur a prediction error, it is desirable to accompany the point predictor by a prediction interval, say C(x̲f), that will contain the target Yf with a pre-specified high probability, e.g., 90%. An estimated prediction interval, say C^(x̲f), is called pertinent if its construction incorporates the variability of all estimators that are employed in the prediction problem. So far, pertinent prediction intervals have only been constructed via some form of bootstrap. However, resampling can be quite computationally expensive since the estimation/prediction problem has to be re-calculated on a large number of pseudo-scatterplots, each having the same sample size as the original one. The paper at hand proposes a short-cut that directly employs the asymptotic normal distribution of relevant estimators—as opposed to a bootstrap histogram—in order to capture their variability. The resulting prediction interval achieves pertinence without full-scale resampling, thus offering computational savings of orders of magnitude.

Stats, Vol. 9, Pages 69: Statistics of Non-Conserved Observables in Lindblad Master Equations

Giovanni Modanese — 2026-06-25

Stats, Vol. 9, Pages 69: Statistics of Non-Conserved Observables in Lindblad Master Equations

Stats doi: 10.3390/stats9040069

Authors: Giovanni Modanese

We study the dynamics of observables that are conserved under the Hamiltonian evolution of a closed quantum system, but cease to be conserved when the system is coupled to a Markovian environment and described by a Lindblad master equation. Starting from the adjoint Lindblad equation, we derive elementary expressions for the time derivatives of the expectation value and second moment of an observable O, with particular emphasis on the case [H,O]=0 but L&dagger;(O)≠0. These formulae provide a direct assessment of how collapse operators break Hamiltonian conservation laws and generate fluctuations of formerly conserved quantities. The discussion is illustrated by analytic examples: one-qubit amplitude damping, a two-qubit excitation-number model, a momentum-diffusion model in which the mean is conserved while the variance grows, and the Jaynes–Cummings model. The latter also shows the complementary case of a reservoir coupled through a conserved quantity, where dephasing can occur without changing the statistics of that quantity. We finally comment on the relation between Lindblad source terms and idealized wave-function reduction models in which local conservation may hold only statistically.

Stats, Vol. 9, Pages 67: Changes in Variance and the Detection of Trends

Markus Neuhäuser — 2026-06-24

Stats, Vol. 9, Pages 67: Changes in Variance and the Detection of Trends

Stats doi: 10.3390/stats9040067

Authors: Markus Neuhäuser

Background: Tests for a trend in location are appropriate when there is an ordered alternative such as, for example, when it is assumed that the effect does not decrease with increasing doses of a drug or fertilizer. Classical trend tests for normally distributed data as well as the nonparametric Jonckheere trend test can have inflated type I error rates when variances differ between groups. Here, different approaches suggested to handle heterogeneous variances are investigated in combination with the Williams trend test. Methods: A simulation study was performed to compare the Jonckheere trend test with competing tests. The different tests were investigated for normal and non-normal data and also applied to a data set on sizes of walnuts opened by birds in various stages of a winter. Results: With one exception, all investigated trend tests can have an inflated type I error rate when variances differ. Only a nonparametric multiple contrast test based on relative effects showed an acceptable type I error rate in all scenarios considered in the simulation. Conclusions: The Williams trend test in combination with the nonparametric multiple contrast test based on relative effects can be suggested for routine use. With this procedure, an increase in variance cannot cause a significant result in the test for trend.

Stats, Vol. 9, Pages 66: Mathematical Analysis and Computational Approximation of Extremal Points of the Subset of Copulas

Rachid Jaafar — 2026-06-21

Stats, Vol. 9, Pages 66: Mathematical Analysis and Computational Approximation of Extremal Points of the Subset of Copulas

Stats doi: 10.3390/stats9030066

Authors: Rachid Jaafar Ahmed Hfa Ahmed Sani

Copulas, as a new tool for statistical analysis, are studied in depth. One of the most notable aspects of this study is the geometric perspective, particularly the concept of regeneration via extreme points and the well-known result in functional analysis: the Krein–Milman theorem. The practical value of such a theoretical study is highlighted by a very interesting illustration in biomedical analysis. As the effective determination of all extremal points is quasi-impossible, we suggest an implementation method to approximate those of a finite and countable set. Computer implementations have demonstrated the effectiveness of the adopted approach.

Stats, Vol. 9, Pages 65: Large Sample Theory for Some Ridge-Type Regression Estimators

Yu Jin — 2026-06-20

Stats, Vol. 9, Pages 65: Large Sample Theory for Some Ridge-Type Regression Estimators

Stats doi: 10.3390/stats9030065

Authors: Yu Jin David J. Olive

This paper provides a large sample theory for some ridge-type multiple linear regression estimators, including Liu-type regression estimators, when the number of predictors is fixed. Some large sample theory is also given for ridge-type generalized linear model estimators. The estimators can be used for inference and variable selection.

Stats, Vol. 9, Pages 64: A Random Activation Framework for Cure Models with Waring-Distributed Latent Causes

Jonathan K. J. Vasquez — 2026-06-19

Stats, Vol. 9, Pages 64: A Random Activation Framework for Cure Models with Waring-Distributed Latent Causes

Stats doi: 10.3390/stats9030064

Authors: Jonathan K. J. Vasquez Vera Tomazella Danilo Alvares Pedro Rafael D. Marinho Joaquín Martínez-Minaya

This paper introduces a random activation framework for cure rate modeling that provides a novel latent mechanistic interpretation of the standard mixture cure model, utilizing a Waring-distributed number of latent causes. The proposed approach represents unobserved heterogeneity through a discrete latent variable interpreted as the number of potential risk factors, providing a flexible and biologically interpretable characterization of individual susceptibility. In contrast to classical competing risks models based on extremal operators or deterministic activation schemes, the event time is assumed to arise from a stochastic selection among latent causes. This random activation mechanism defines a unified probabilistic framework in which the cure fraction emerges naturally as the probability of having zero latent causes. The Waring distribution is adopted to model the latent count structure due to its hierarchical formulation, which accommodates overdispersion and heavy-tailed behavior strictly within the latent parametrization of individual risk factors. Under this framework, while the population survival function mathematically reduces to the classical mixture cure representation, the model provides an alternative structure where covariates directly impact the expected latent burden. Parameter estimation for the identifiable regression structure is performed via maximum likelihood, and the finite-sample performance of the estimators is assessed through Monte Carlo simulations, showing accurate parameter recovery and stable inferential properties. An application to real survival data illustrates the practical relevance and epidemiological interpretability of the proposed framework. Overall, this work extends the understanding of existing cure rate models by integrating latent count structures and stochastic activation within a coherent setting, providing a powerful interpretation tool for heterogeneous survival data with long-term survivors.

Stats, Vol. 9, Pages 63: Discrimination of Geological Orientation Data with Measurement Errors

Marco Di Marzio — 2026-06-18

Stats, Vol. 9, Pages 63: Discrimination of Geological Orientation Data with Measurement Errors

Stats doi: 10.3390/stats9030063

Authors: Marco Di Marzio Stefania Fensore Agnese Panzera Chiara Passamonti

Fracture orientation data in structural geology are commonly affected by non-negligible angular uncertainty, which can significantly impact the reliability of classification and interpretation of deformation patterns. In this work, we address the problem of discriminating between two groups of directional observations. To account for measurement uncertainty inherent in field data, we adopt a deconvolution-based circular kernel discriminant rule specifically designed for noisy angular observations. This approach explicitly incorporates the measurement-error mechanism into the estimation process, allowing for more robust classification in the presence of observational noise. The methodology is applied to measurements arising in structural geology, where the discrimination of fracture orientations is relevant to the interpretation of deformation patterns and to applications in rock engineering. Specifically, we consider two datasets from Ordovician turbidites, involving different types of orientation data. The first dataset consists of L01 axes, representing linear features described by Plunge–Azimuth coordinates, while the second dataset concerns axial-plane cleavage surfaces, expressed in terms of Dip and Dip direction. We assess the performance of the estimator under varying levels of angular uncertainty and alternative error distributions, with a focus on its ability to correctly separate the two geological groups. Results show that explicitly modeling measurement error leads to improved discrimination accuracy and more reliable identification of structural patterns compared to standard methods that neglect noise.

Stats, Vol. 9, Pages 62: A Comparative Study of Robust and Improved Shrinkage Estimators Under Multicollinearity and Outliers Using Multiple Performance Criteria with Application to Health Data

Nusrat Yasmin — 2026-06-17

Stats, Vol. 9, Pages 62: A Comparative Study of Robust and Improved Shrinkage Estimators Under Multicollinearity and Outliers Using Multiple Performance Criteria with Application to Health Data

Stats doi: 10.3390/stats9030062

Authors: Nusrat Yasmin B. M. Golam Kibria Zoran Bursac

Multicollinearity reduces the reliability of ordinary least squares by increasing variances and creating unstable estimates. This issue has led to biased and penalized regression methods like ridge-, Liu- and Stein-type estimators. Here, we build existing ridge-type approaches by introducing improved ridge and Liu-type estimators, along with robust variants to handle outliers. We investigate their theoretical properties regarding bias, variance, and mean squared error. We also evaluate their performance through Monte Carlo simulations with different levels of multicollinearity and data contamination. By using several evaluation criteria, including mean squared error, akaike information criterion, mean absolute deviation, and mean absolute percentage error, along with an average-rank comparison framework applied here for the first time, we further validate our results with two health-related datasets. The findings show that the strong estimators provide more stable estimates and improved predictive performance, particularly when dealing with severe multicollinearity and outliers.

Stats, Vol. 9, Pages 61: Defective Gamma–G Family for Cure Fraction Models: Novel Survival Methods with Applications to Cancer Data

Cynthia A. V. Tojeiro — 2026-06-17

Stats, Vol. 9, Pages 61: Defective Gamma–G Family for Cure Fraction Models: Novel Survival Methods with Applications to Cancer Data

Stats doi: 10.3390/stats9030061

Authors: Cynthia A. V. Tojeiro Vera L. D. Tomazella Agatha S. Rodrigues Pedro R. D. Marinho

In this paper, we propose two novel defective survival models within the Gamma–G family: the defective Gamma–Gompertz and the defective Gamma–Dagum distributions. In contrast to the corresponding Gamma–G mixture cure formulation, in which the Gamma–G distributional parameters are combined with an explicit cure fraction mixing parameter, the proposed defective formulation induces the cure fraction through the limiting behavior of the survival function. Thus, within the same Gamma–G baseline structure, the model avoids introducing an additional cure fraction parameter. The motivation for these new models lies in the limited set of defective distributions currently available, despite the increasing demand for flexible cure rate models in biomedical applications. By extending the defective property to the Gamma–G construction, our approach fills this methodological gap while providing models that are both interpretable and computationally efficient. We show that the Gamma–G construction preserves defectiveness whenever the baseline distribution is defective, thus establishing a coherent theoretical foundation. Both models allow covariate effects through regression structures on shape, scale, and, in the case of the Gamma–Dagum distribution, on the cure fraction parameter, resulting in flexible and interpretable specifications. Parameters are estimated via maximum likelihood, and an extensive Monte Carlo study confirms estimator consistency and accurate coverage in finite samples. The practical relevance of the models is illustrated with two large clinical datasets on melanoma and cervical cancer from the São Paulo Cancer Registry. Results reveal that the proposed models provide competitive goodness-of-fit and offer useful insights into long-term survival compared to traditional cure rate approaches. Overall, this work introduces a unifying and flexible framework for defective survival models, extending their applicability and delivering practical improvements over existing cure models.

Stats, Vol. 9, Pages 60: Bootstrap-Calibrated Outlier Detection and Influence Diagnostics for Meta-Analysis: The R Package boutliers

Hisashi Noma — 2026-06-12

Stats, Vol. 9, Pages 60: Bootstrap-Calibrated Outlier Detection and Influence Diagnostics for Meta-Analysis: The R Package boutliers

Stats doi: 10.3390/stats9030060

Authors: Hisashi Noma Kazushi Maruo Masahiko Gosho

Meta-analysis is a statistical tool commonly used within systematic reviews to synthesize quantitative evidence, but individual studies with atypical results or disproportionate influence can materially affect pooled estimates, heterogeneity estimates, and the conclusions drawn from evidence syntheses. Conventional outlier and influence diagnostics for meta-analysis are useful, but their interpretation often relies on asymptotic reference values or informal rules of thumb, which may be inadequate when the number of studies is limited or heterogeneity is substantial. We introduce boutliers, an R package that implements bootstrap-calibrated outlier detection and influence diagnostics for fixed-effect and random-effects meta-analysis. The package provides leave-one-study-out diagnostics based on Studentized deleted residuals, relative changes in the variance of the pooled effect estimator, and relative changes in the between-study variance, together with a likelihood-ratio diagnostic based on a mean-shifted model. For each diagnostic measure, bootstrap reference distributions, critical values, and p-values are provided to support quantitative interpretation of influential studies. We describe the statistical framework, implementation, and practical use of the package and illustrate its application using a real published meta-analysis dataset on spinal manipulative therapy for chronic low back pain. The boutliers package provides accessible tools for incorporating uncertainty-calibrated influence diagnostics into routine meta-analytic practice.

Stats, Vol. 9, Pages 59: A Two-Stage Changepoint–Copula Framework for Non-Stationary Count Time Series: Application to Tropical Cyclones

Md Iqbal Hossain — 2026-06-04

Stats, Vol. 9, Pages 59: A Two-Stage Changepoint–Copula Framework for Non-Stationary Count Time Series: Application to Tropical Cyclones

Stats doi: 10.3390/stats9030059

Authors: Md Iqbal Hossain Norou Diawara

Cross-basin tropical cyclone variability may exhibit complex, non-linear dependence structures influenced by large-scale climate modes and potential regime shifts. Reliance on traditional linear correlation measures without accounting for structural changes can therefore lead to misleading interpretations of global storm relationships. This study investigates the regional dependence structures of tropical cyclone counts across six major ocean basins (NA, ENP, WNP, NI, SI, and SP) from 1980 to 2024. We adopt a two-stage analytical framework integrating changepoint detection and copula modeling to address non-stationarity in both marginal distributions and dependence structures. First, we identify a significant structural break in the year 2000 via a penalised likelihood applied jointly to the d=6-variate Poisson series, with inter-basin dependence captured by a latent Gaussian process (the construction used by Lund et al. (2025). This is mathematically equivalent to a Gaussian copula with Poisson margins (Genest and Ne&scaron;lehová (2007)). Then, we apply bivariate copula models separately to the pre- and post-2000 regimes using the randomized probability integral transform with results averaged over 500 replications of the auxiliary uniforms to mitigate randomization noise. The results reveal substantial non-stationarity, most notably a 59% increase in North Atlantic storm frequency and a fundamental reorganization of global dependence structures, while dependence structures evolved from primarily symmetric and weak (dominated by Gaussian and Clayton copulas) to more complex and stronger dependencies (increased Frank and Gumbel copulas). Notably, a statistically significant (p<0.001) and strong negative dependence emerged between the Southern Pacific and Northern Indian basins (τ=−0.464) in the recent regime. The inclusion of changepoint detection significantly improves model fit and reveals a fundamental reorganization of global tropical cyclone teleconnections, with enhanced coordination between basins in the contemporary climate regime. Modeling these regimes separately, as opposed to a single stationary period, uncovers a shift towards more complex, tail-dependent copula families (Gumbel, Clayton) in the recent era. These findings have important implications for climate risk assessment, seasonal forecasting, and understanding the impacts of climate change on global storm patterns. The proportion of Gumbel copulas (capturing upper-tail dependence) increased from 7% to 20%, while Gaussian copulas decreased from 53% to 33%, indicating more complex, extreme-value-focused dependencies in the contemporary climate. Due to small sample sizes (n1=20, n2=25), copula and dependence estimates are exploratory, not confirmatory. Interpretations reflect this power constraint, utilizing Benjamini–Hochberg adjustments for significance.

Stats, Vol. 9, Pages 58: On the Sine Inverse Lomax Burr III Distribution with Application to Monthly Actual Tax Revenue Data

Anuwoje Ida. L. Abonongo — 2026-06-03

Stats, Vol. 9, Pages 58: On the Sine Inverse Lomax Burr III Distribution with Application to Monthly Actual Tax Revenue Data

Stats doi: 10.3390/stats9030058

Authors: Anuwoje Ida. L. Abonongo John Abonongo Samuel Asante Gyamerah

Advances in probability distributions are important for modelling complex data across fields such as actuarial science, environmental science, biomedical science, economics, finance, and insurance. Classical distributions often have limitations when dealing with highly skewed data, heavy tails, or unusual failure patterns. To address these challenges, this study introduces the Sine Inverse Lomax Burr III distribution, a new flexible model that combines the tail behaviour of the Burr III distribution with the skewness-control properties of the sine inverse transformation. Statistical properties, including quantiles, moments, moment generating functions, and order statistics, are derived. Some risk measures, including the value at risk, tail value at risk, and tail variance, are derived and studied. Parameter estimation is performed using five different estimation techniques: maximum likelihood estimation, least squares, weighted least squares, percentile matching, and Anderson–Darling. The usefulness of the proposed model is demonstrated using monthly tax revenue data. The results show that the SILBIII distribution performs better than the competing distributions. The proposed model is an alternative model suitable for modeling data in finance, actuarial, and related fields.

Stats, Vol. 9, Pages 57: A Copula-Based Framework for Multivariate Count Time Series with Mixed Marginal Distributions

Dimuthu Fernando — 2026-06-02

Stats, Vol. 9, Pages 57: A Copula-Based Framework for Multivariate Count Time Series with Mixed Marginal Distributions

Stats doi: 10.3390/stats9030057

Authors: Dimuthu Fernando Yuxin Wen Wimarsha Jayanetti

We developed a class of multivariate integer-valued time series models using copula theory. Each count time series is modeled as a Markov chain, with serial dependence characterized through copula-based transition probabilities for Poisson and negative binomial marginals. Cross-sectional dependence is modeled via a trivariate Gaussian or a “t-copula”, allowing for both positive and negative correlations and providing a flexible dependence structure. Model parameters are estimated using likelihood-based inference, where the trivariate Gaussian or t-copula integrals are evaluated through standard randomized Monte Carlo methods. Simulation results, along with an analysis of annual counts of major hurricanes (Category 3+) across the North Atlantic, Eastern North Pacific, and Western North Pacific basins, demonstrate the effectiveness of the proposed model.

Stats, Vol. 9, Pages 56: Predictive Power Analysis of Multiple Test Procedures Under Arbitrary Dependence

George Karabatsos — 2026-05-29

Stats, Vol. 9, Pages 56: Predictive Power Analysis of Multiple Test Procedures Under Arbitrary Dependence

Stats doi: 10.3390/stats9030056

Authors: George Karabatsos

Many statistical problems can be addressed by applying a multiple testing procedure (MTP) that controls either the Family-Wise Error Rate (FWER) or False Discovery Rate (FDR) under unknown arbitrarily interdependent p-values, without explicitly modeling these inter-correlations. They include the FWER-controlling Bonferroni MTP and Holm MTP; the FDR-controlling Benjamini-Yekutieli MTP; and the DP-MTP, based on a Dirichlet process (DP) prior distribution supporting the entire space of MTPs that control either the FWER or the FDR. For such an MTP, this study introduces a new and congenial method for Bayesian predictive power analysis, for power calculation and sample size determination for any given planned future (e.g., replication or interim) study. This novel MTP predictive power analysis method is based on a joint prior distribution defining a scale matrix mixture of asymmetric multivariate normal mean-variance mixture distributions, factorized as a general prior distribution for effect sizes (e.g., obtained from expert judgment or results of prior studies), and a uniform prior distribution for correlation matrices representing arbitrary dependencies between p-values of test statistics of given multiple hypothesis tests under their alternative hypotheses. The new MTP power analysis method also results in p-value weights which can be used to minimize the relative impacts of and assess for significance-chasing biases (e.g., publication bias, p-hacking, etc.) in multiple testing, without needing to assume that p-values (effect sizes) are independent. Previous MTP power analysis methods are conditional in that they assume fixed effect sizes and a fixed correlation matrix for test statistics (p-values) which may be difficult to specify; while, incongenially, not fully accounting for their uncertainty, especially for MTPs that control either the FWER or FDR for arbitrarily correlated p-values. The new simulation-based MTP predictive power analysis method is illustrated through the analysis of p-values obtained by a famous study of lead exposure and re-analyzed by the previous MTP literature, using R package bnpMTP (v1.0).

Stats, Vol. 9, Pages 54: BLIDE: Bayesian Learning of Infectious Disease Emerging in COVID-19 Studies

Avizit Chandra Adhikary — 2026-05-28

Stats, Vol. 9, Pages 54: BLIDE: Bayesian Learning of Infectious Disease Emerging in COVID-19 Studies

Stats doi: 10.3390/stats9030054

Authors: Avizit Chandra Adhikary Ziyu Liu Anisha Das Rongjie Liu Chao Huang

The COVID-19 pandemic has reshaped global infrastructure, highlighting the importance of effective infectious disease management. Identifying when and where infection trends change abnormally can aid strategic planning; yet, existing change point detection methods struggle due to the non-linear nature of infection trends, spatial and temporal dependencies, regional demographic and healthcare variations, and differing preventive measures. To address this issue, we propose a Bayesian method that can detect candidate regional disease-related change periods while overcoming these challenges. Specifically, we develop a Bayesian function-on-function regression model that learns from infection trends across multiple regions by incorporating both time-invariant features and the historical effect of time-dependent functional covariates. Temporal dependence in the covariate effects is captured through neighborhood-based spike-and-slab priors, whose latent binary inclusion indicators are, in turn, modeled by Ising priors. A Gibbs sampling framework is derived to approximate the joint posterior distribution of the model parameters. We compared the performance of the proposed framework against two widely used change-point detection methods, BCP and Segmented. In our simulation studies, BLIDE achieves an F1-score of 1.000 under high signal-to-noise conditions and maintains an F1-score above 0.95 even when noise dominates the trends, substantially outperforming BCP (F1-scores 0.454 and 0.131, respectively) and Segmented (F1-scores below 0.05 across all scenarios).

Stats, Vol. 9, Pages 55: Extending Entropic Value at Risk Using the γ-Order Generalized Normal Distribution

Christos P. Kitsos — 2026-05-28

Stats, Vol. 9, Pages 55: Extending Entropic Value at Risk Using the γ-Order Generalized Normal Distribution

Stats doi: 10.3390/stats9030055

Authors: Christos P. Kitsos Ulrich E. Nyamsi Ioannis S. Stamatiou

This paper extends the Entropic Value at Risk by considering the γ-order Generalized Normal distribution, a flexible family of distributions capable of modeling deviations from the classical normality assumption. An analytic expression for the Entropic Value at Risk is derived, and it incorporates a shape parameter γ that controls tail behavior and allows the model to capture heavy-tailed financial data more accurately. An empirical application to daily stock returns shows that this risk measure provides estimates closer to the empirical risk than those obtained under the normality assumption.

Stats, Vol. 9, Pages 53: Extreme Rainfall Modelling Using Time-Varying Threshold Generalised Pareto Regression Trees

Matome Lesley Sebola — 2026-05-28

Stats, Vol. 9, Pages 53: Extreme Rainfall Modelling Using Time-Varying Threshold Generalised Pareto Regression Trees

Stats doi: 10.3390/stats9030053

Authors: Matome Lesley Sebola Daniel Maposa

The escalating frequency and intensity of extreme rainfall events driven by climate change threaten infrastructure resilience and societal safety, underscoring the urgent need for robust models to predict these events. Previous studies on the integration of Extreme Value Theory (EVT) and machine learning in modelling extreme rainfall events have not explored the use of a time-varying threshold. This study introduces a novel time-varying threshold Generalised Pareto (GP) regression tree for modelling extreme rainfall in Durban, South Africa. The proposed hybrid model combines EVT with covariate-driven regression tree partitioning, allowing the threshold to evolve dynamically with meteorological conditions. Using daily rainfall and meteorological covariate data from 1981 to 2025, the model was developed, pruned, and benchmarked against a static-threshold GP regression tree and a time-varying threshold Generalised Pareto Distribution (GPD). Evaluation based on the Bayesian Information Criterion (BIC) and log-likelihood demonstrated the superior performance of the proposed model in capturing covariate-driven heterogeneity and temporal variability of rainfall extremes. Four distinct climatic regimes with different tail behaviours and return levels were identified. This study provides the first meteorological application of a time-varying threshold GP regression tree and offers practical insights into flood risk assessment and climate resilience planning in the city of Durban.

Stats, Vol. 9, Pages 52: The Poisson–QGamma Distribution: Properties, Estimation Methods, Regression Modeling, and Applications in Engineering Count Data

Fatma Zohra Seghier — 2026-05-26

Stats, Vol. 9, Pages 52: The Poisson–QGamma Distribution: Properties, Estimation Methods, Regression Modeling, and Applications in Engineering Count Data

Stats doi: 10.3390/stats9030052

Authors: Fatma Zohra Seghier Halim Zeghdoudi Muhammad Ameeq Sana Kanwal

Modeling over-dispersed count data is a common challenge in applied statistics, especially in engineering applications where repeated events, system faults, and clustered observations often produce variability beyond that allowed by the classical Poisson model. In this paper, we introduce and study the Poisson–QGamma distribution, a new compound discrete model obtained by mixing the Poisson distribution with the QGamma distribution. The proposed distribution is analytically tractable and flexible enough to capture over-dispersion, skewness, and excess kurtosis, which are frequently observed in real count data. Several statistical properties of the distribution are derived, including the probability mass function, cumulative distribution function, survival and hazard rate functions, moments, dispersion index, skewness, kurtosis, entropy, and generating functions. Parameter estimation is considered using maximum likelihood, method of moments, least squares, and weighted least squares methods. The finite-sample behavior of these estimators is examined through Monte Carlo simulation. A regression model based on the Poisson–QGamma distribution is also developed for count responses with covariates. The proposed model is compared with classical and competing count models using simulation and real-data applications. Three engineering-related datasets, involving power grid failure counts, environmental sensor event counts, and packet loss counts in communication networks, are analyzed to illustrate the practical value of the model. The results show that the Poisson–QGamma model provides a better fit than several standard alternatives, including the Poisson, negative binomial, Poisson–Lindley, generalized Poisson, and COM–Poisson models, particularly in the presence of over-dispersion and heavy-tailed behavior. Overall, the proposed distribution offers a parsimonious and effective tool for modeling over-dispersed count data, while also contributing to the broader class of compound discrete distributions.

Stats, Vol. 9, Pages 51: The Generalized Marshall–Olkin Topp–Leone-G Family: Properties, Estimation, and Goodness-of-Fit Testing Under Right-Censored Data

Aidi Khaoula — 2026-05-22

Stats, Vol. 9, Pages 51: The Generalized Marshall–Olkin Topp–Leone-G Family: Properties, Estimation, and Goodness-of-Fit Testing Under Right-Censored Data

Stats doi: 10.3390/stats9030051

Authors: Aidi Khaoula Laba Handique Djemoui Nour el Houda

In this paper, we introduce a new extension of the Topp–Leone-G family, called the generalized Marshall–Olkin Topp–Leone-G (GMOTL-G) family of distributions. The proposed family is obtained by combining the generalized Marshall–Olkin and Topp–Leone-G generators, leading to a more flexible class of models for lifetime data. We study several of its mathematical and statistical properties and focus in particular on the generalized Marshall–Olkin Topp–Leone exponential (GMOTL-E) distribution as an important special case. For this model, we derive and discuss a number of useful characteristics, including the moment generating function, moments, order statistics, residual and reversed residual life functions, mean deviations, asymptotic behavior, and stochastic ordering. We also develop maximum likelihood estimation for the model parameters under both complete and right-censored samples. In addition, we construct a goodness-of-fit test for the proposed model under independent right censoring using a chi-square type approach. The performance of the estimation and testing procedures is investigated through simulation, and the results show good behavior of the estimators and satisfactory agreement between empirical and theoretical significance levels. Finally, two real data applications, one with complete data and one with right-censored data, are presented to illustrate the flexibility and practical usefulness of the proposed model. These results show that the new family provides an effective tool for modeling lifetime data and for assessing model adequacy in the presence of right censoring.

Stats, Vol. 9, Pages 50: Goodness-of-Fit Test for the Kumaraswamy Distribution via Energy Distance Approach with Applications to Real Data

Joseph Njuki — 2026-05-21

Stats, Vol. 9, Pages 50: Goodness-of-Fit Test for the Kumaraswamy Distribution via Energy Distance Approach with Applications to Real Data

Stats doi: 10.3390/stats9030050

Authors: Joseph Njuki Thomas Gilbert

In this article, we develop a goodness-of-fit test for the Kumaraswamy distribution based on energy statistics. Due to the availability of its quantile (inverse) function, the Kumaraswamy distribution has been shown to be the preferred alternative to the Beta distribution, since both have bounded support in the (0,1) interval. The proposed test procedure is simple and more powerful against general alternatives. Under different settings, simulations show that the proposed test is capable of being well controlled for any given significance (nominal) levels. In terms of power comparisons, the proposed test outperforms other existing methods in different settings. We then apply the proposed test to two real datasets (electricity access data and underground economy index) to demonstrate its competitiveness and usefulness.

Stats, Vol. 9, Pages 49: Defining Effect Size Thresholds for OR, RR, and η2 in Physiotherapy Studies

Grzegorz Zieliński — 2026-05-19

Stats, Vol. 9, Pages 49: Defining Effect Size Thresholds for OR, RR, and η2 in Physiotherapy Studies

Stats doi: 10.3390/stats9030049

Authors: Grzegorz Zieliński

(1) Background: Effect size interpretation in physiotherapy research varies across statistical models, hindering comparability between studies using linear, logistic, and variance-based analyses. Unified, discipline-specific thresholds are needed to harmonise interpretation and support consistent sample size planning in clinical trials. The aim is to estimate physiotherapy-specific reference values for odds ratio (OR), relative risk (RR), and η2/ηp2 based on empirically established thresholds for Cohen’s d (0.1, 0.4, and 0.8). (2) Methods: Cohen’s d values were transformed into corresponding effect size metrics using deterministic algebraic relationships. Specifically, OR, RR, and η2 were derived from Cohen’s d under selected baseline risks (p0 = 0.1, 0.2, and 0.5). Calculations were performed in R 4.3.1 assuming equal group sizes and homogeneity of variances. (3) Results: OR thresholds were 1.20 (small), 2.07 (medium), and 4.27 (large). For RR, at p0 = 0.1, thresholds were 1.18, 1.95, and 3.22; at p0 = 0.5, they were 1.09, 1.35, and 1.62. Corresponding η2/ηp2 values were 0.003, 0.039, and 0.138. (4) Conclusions: The derived thresholds form a coherent, numerically anchored framework linking linear, logistic, and variance-based effect sizes. This approach standardises interpretation across statistical models and strengthens methodological consistency in physiotherapy clinical research.

Stats, Vol. 9, Pages 48: Global Versus Australian Progress in Multi-Pollutant Air Quality: GAM-Based Trend Analysis and a Clean-Air Progress Index (1990–2019)

Khaled Haddad — 2026-05-13

Stats, Vol. 9, Pages 48: Global Versus Australian Progress in Multi-Pollutant Air Quality: GAM-Based Trend Analysis and a Clean-Air Progress Index (1990–2019)

Stats doi: 10.3390/stats9030048

Authors: Khaled Haddad

Reliable tracking of multi-pollutant air-quality progress is essential for assessing policy effectiveness and health risks, yet most assessments still focus on single pollutants. We analysed population-weighted exposures to fine particulate matter (PM2.5), nitrogen dioxide (NO2) and household air pollution (HAP) for Australia and the global average over 1990–2019, using harmonised estimates from a Global Burden of Disease–type framework. Non-parametric LOESS and semi-parametric generalised additive models were applied to characterise long-term trends, and a composite clean-air progress index (CAPI; 1990 = 1) was constructed to summarise joint changes in the three pollutants. Statistical and Monte Carlo methods were used to propagate reported exposure uncertainty into both pollutant-specific trends and the composite index. Globally, exposures to PM2.5, NO2 and HAP all declined, and the CAPI fell to around 0.7 by 2019, indicating substantial multi-pollutant improvement relative to 1990. In Australia, NO2 decreased more rapidly than the global mean, but PM2.5 showed little long-term decline and the HAP-related metric increased more than three-fold. As a result, Australia’s CAPI rose to approximately 1.6–1.7, with Monte Carlo uncertainty envelopes remaining well above 1 from the early 2000s onwards. Correlation analyses revealed that pollutants improved together at the global scale, but were partially decoupled in Australia, implying that source-specific gains have not translated into aggregate clean-air progress. These findings demonstrate that single-pollutant assessments can obscure important trade-offs and that multi-pollutant, uncertainty-aware indices such as CAPI provide a more informative basis for benchmarking national trajectories against global experience and for guiding integrated clean-air policy.

Stats, Vol. 9, Pages 47: Unified Numerical Method for Stochastic Differential Equations with Poisson and Gaussian White Noises

Mircea D. Grigoriu — 2026-04-24

Stats, Vol. 9, Pages 47: Unified Numerical Method for Stochastic Differential Equations with Poisson and Gaussian White Noises

Stats doi: 10.3390/stats9030047

Authors: Mircea D. Grigoriu

A method is developed for integrating stochastic differential equations (SDEs) with Poisson (PWN) and Gaussian (GWN) white noises interpreted as the formal derivatives of the compound Poisson and Brownian motion processes. In contrast to the current integration schemes, which solve discrete time versions of the posed SDEs, the proposed method solves the posed SDEs for finite dimensional (FD) models of the compound Poisson and Brownian motion processes, i.e., finite sums of deterministic functions of time weighted by random coefficients. Paths of the resulting solutions, referred to as FD solutions, can be generated by standard ordinary differential equation (ODE) solvers since the paths of the FD input models are smooth. We also establish conditions under which the distributions of extremes and other continuous functionals of the solutions of the posed SDEs can be approximated by those of their FD solutions. This is essential in applications since the distributions of functionals of FD solutions can be estimated while those of actual solutions are rarely available analytically and cannot be obtained numerically.

Stats, Vol. 9, Pages 46: A Practical Framework for Incorporating Complex Survey Design in Bayesian Kernel Machine Regression

Doreen Jehu-Appiah — 2026-04-23

Stats, Vol. 9, Pages 46: A Practical Framework for Incorporating Complex Survey Design in Bayesian Kernel Machine Regression

Stats doi: 10.3390/stats9030046

Authors: Doreen Jehu-Appiah Emmanuel Obeng-Gyasi

Large-scale population datasets are rarely generated via simple random sampling; instead, they reflect complex designs involving stratification, clustering, and unequal inclusion probabilities. While survey weights are provided to recover population-representative estimates, standard Bayesian Kernel Machine Regression (BKMR), a flexible nonlinear model for high-dimensional exposure mixtures, does not explicitly accommodate these design features. We present a simulation-based framework that evaluates performance under complex sampling by comparing two analytic strategies applied to identical survey-like data: (i) a naïve, unweighted BKMR implementation and (ii) a design-aware workflow that can be executed using existing software without modifying the BKMR algorithm itself. Finite populations are generated with correlated exposures and a known nonlinear data-generating function. Stratified two-stage cluster samples are then drawn under both non-informative and exposure-dependent (informative) selection mechanisms, with controlled intra-class correlation (ICC). The design-aware approach incorporates sampling weights through resampling of the dataset while preserving primary sampling unit structure, followed by standard BKMR fitting. Methods are evaluated using bias, interval width, and empirical 95% coverage relative to the known truth. Across simulation scenarios, naïve BKMR exhibits bias and systematic under-coverage under informative sampling, with empirical 95% coverage often dropping to approximately 0–40%, whereas the design-aware workflow improves coverage to approximately 40–60%, moving results closer to nominal levels. These findings provide a practical, implementation-ready strategy for integrating survey design considerations into BKMR analyses and delineate conditions under which accounting for sampling design affects inference. While the proposed approach improves inferential performance relative to naïve BKMR, it does not fully achieve nominal coverage, indicating that further methodological development is required for fully valid uncertainty quantification under complex survey designs.

Stats, Vol. 9, Pages 45: Coverage and Precision of Net Promoter Score Confidence Intervals Across Sampling Distributions

Philip Turk — 2026-04-21

Stats, Vol. 9, Pages 45: Coverage and Precision of Net Promoter Score Confidence Intervals Across Sampling Distributions

Stats doi: 10.3390/stats9020045

Authors: Philip Turk Jordan Cinderich Emma McNeill

The Net Promoter Score (NPS) is a widely used metric for customer loyalty in business. However, the current theoretical gaps in the literature suggest practical refinements for real-world applications. In this simulation study, we use an unbiased estimator of the variance for the sample NPS to examine coverage and width for three different confidence interval methods: Wald, bootstrap t, and adjusted Wald with weights corresponding to four underlying population distribution shapes: extreme (E), left-skewed (LS), triangular (T), and uniform (U). As the sample size increased, all methods approached the nominal 95% coverage rate with an exception for the extreme population; the adjusted Wald method with triangular and uniform weights is particularly robust among the representative population shapes examined. All adjusted Wald methods performed comparably in width, especially at a larger n. The confidence interval width depended on the population shape. Overall, the Wald and bootstrap t methods should be avoided at small sample sizes and are not recommended. Our methods raise awareness of the sampling distribution of the NPS statistic, provide a theoretical basis for an unbiased estimator of the variance, and assess reliable confidence interval construction. These results provide an informed application of NPS and lay the foundation for future methodological development.

Stats, Vol. 9, Pages 44: Assessing the Accuracy of Bootstrap-Based Standard Errors in Regression Models with Unobserved Heterogeneity

Yingjuan Zhang — 2026-04-18

Stats, Vol. 9, Pages 44: Assessing the Accuracy of Bootstrap-Based Standard Errors in Regression Models with Unobserved Heterogeneity

Stats doi: 10.3390/stats9020044

Authors: Yingjuan Zhang Jochen Einbeck

When the data at hand are suspected to stem from several latent subpopulations, Statisticians commonly speak of “unobserved heterogeneity”. While the presence and importance of this phenomenon is commonly acknowledged, there is relatively little guidance on how to carry out correct inferences under unobserved heterogeneities, even in relatively simple scenarios such as the linear regression model. In this work, bootstrap algorithms for the computation of standard errors are investigated in the context of a mixture-based regression approach which accounts for the clustered nature of the data. Of interest is both the accuracy of the standard errors (evidenced by confidence interval coverage rates) and the relative reduction in standard errors achieved in comparison to a naïve linear model fit. Simulations and a real data example are provided.

Stats, Vol. 9, Pages 43: Scalable Likelihood Inference for Student-t Copula Count Time Series

Quynh Nhu Nguyen — 2026-04-17

Stats, Vol. 9, Pages 43: Scalable Likelihood Inference for Student-t Copula Count Time Series

Stats doi: 10.3390/stats9020043

Authors: Quynh Nhu Nguyen Victor De Oliveira

Count time series often exhibit extremal dependence that may not be adequately captured by Gaussian copula models. We develop a likelihood-based framework for count-valued time series using Student-t copulas with latent ARMA dependence. The latent process is constructed through a scale-mixture representation of a Gaussian ARMA process, preserving the second-order dependence structure while introducing tail dependence and greater persistence of extreme events. Likelihood inference requires evaluating high-dimensional truncated multivariate t probabilities, which is computationally demanding under heavy tails and strong serial dependence. To address this challenge, we develop scalable likelihood approximations tailored to the time series structure. In particular, we formulate a time series version of minimax exponential tilting for multivariate t probabilities, termed Time Series Minimax Exponential Tilting (TMET), which exploits the exact conditional representation of the latent ARMA process. The resulting algorithm reduces computational complexity from cubic to near-linear in the series length while retaining the high accuracy of minimax exponential tilting. For comparison, we also extend two widely used Gaussian copula approximations—the continuous extension (CE) method and the Geweke–Hajivassiliou–Keane (GHK) simulator—to the Student-t copula setting. Simulation studies show that TMET outperforms CE and GHK, particularly under strong dependence, heavy tails, and low-count regimes. The framework also supports predictive inference and residual diagnostics. An application to weekly rotavirus counts illustrates how the Student-t copula provides a flexible extension of the Gaussian copula while retaining stable inference even when tail dependence is weak or absent.

Stats, Vol. 9, Pages 42: MAI-GAN: An Inferentially Calibrated Generative Framework for Multilevel Longitudinal Data with Applications to Educational Intersectionality

Benjamin Hechtman — 2026-04-09

Stats, Vol. 9, Pages 42: MAI-GAN: An Inferentially Calibrated Generative Framework for Multilevel Longitudinal Data with Applications to Educational Intersectionality

Stats doi: 10.3390/stats9020042

Authors: Benjamin Hechtman Ross H. Nehm Wei Zhu

Synthetic datasets are increasingly used in education research for methodological validation, privacy-preserving data sharing, and reproducible equity analysis; however, most generative approaches prioritize marginal distributional similarity without ensuring preservation of multilevel inferential properties. This limitation is consequential for repeated-measures data analyzed using intersectionality-focused hierarchical models, where conclusions depend on variance partitioning, partial pooling, and stratum-level heterogeneity. We introduce MAI-GAN, a hybrid generative framework that implements a structure–residual decomposition approach combining Bayesian longitudinal MAIHDA with conditional GAN-based residual generation. Inferential fidelity is operationalized with respect to multilevel intersectional models by explicitly targeting the preservation of fixed effects, variance components, and variance partitioning coefficients, while baseline composition is maintained via stratified bootstrap resampling. Applied to a six-semester undergraduate biology dataset (N = 2669 students), MAI-GAN was evaluated across multiple independent random seeds and consistently reproduced baseline-dependent residual structure and key inferential quantities. These results demonstrate that model-aligned generative strategies can produce synthetic longitudinal datasets that remain coherent under intersectionality-focused multilevel analysis, offering a principled foundation for equity-oriented synthetic data generation.

Stats, Vol. 9, Pages 41: A Novel Exponentiated Pareto Exponential Distribution with Applications in Environmental and Financial Datasets

Ibrahim Sule — 2026-04-09

Stats, Vol. 9, Pages 41: A Novel Exponentiated Pareto Exponential Distribution with Applications in Environmental and Financial Datasets

Stats doi: 10.3390/stats9020041

Authors: Ibrahim Sule Mogiveny Rajkoomar

Environmental and financial datasets often display complex distributional characteristics, including heavy tails, high skewness and the presence of extreme observations. Traditional probability models such as the exponential, gamma or log-normal distributions may not adequately capture these behaviours particularly when modelling extreme events such as rainfall, pollution levels, stock returns or loss severities. By integrating the characteristics of Pareto and exponential distributions into an exponentiated framework that can describe datasets arising from environmental and finance fields, this study presents a novel three-parameter exponentiated Pareto exponential distributions using the exponentiated Pareto family of distributions with classical exponential distribution as the baseline model. This novel model extends the classical exponential distribution with the addition of extra shape parameters which simultaneously regulate the centre and tail behaviours of the new model. The statistical and mathematical characteristics of the proposed distribution are determined and studied. The maximum likelihood estimate approach is used in a conducted simulation exercise, and the estimator’s efficiency is evaluated as seen from the results. The practical applicability of the model is illustrated with four real-life datasets utilising model adequacy and goodness-of-fit measurements such as log–likelihood, Akaike information criteria and Bayesian information criteria. The data reveal that the proposed model gives a better fit than the models chosen as comparators, making the EPE distribution useful and robust in environmental and financial fields of study.

Stats, Vol. 9, Pages 40: A New Partially Linear Regression with an Application to the Price of Coffee Before and After the Pandemic

Edwin M. M. Ortega — 2026-04-08

Stats, Vol. 9, Pages 40: A New Partially Linear Regression with an Application to the Price of Coffee Before and After the Pandemic

Stats doi: 10.3390/stats9020040

Authors: Edwin M. M. Ortega Gabriela M. Rodrigues Kwan Sung Jang Gauss M. Cordeiro

We propose a partially linear regression linear model to explain coffee prices before and after the COVID-19 pandemic. This new regression model incorporates the fundamental assumption of linearity and nonlinearity between these variables. We consider the penalized quasi-likelihood method for parameter estimation and present residual analysis for the new regression model. A simulation study examines penalized quasi-likelihood estimators and the empirical distribution of the quantile residuals. Furthermore, the article aims to identify variables that influence changes in coffee prices, such as the price of Arabica and Robusta varieties, supply (expressed in millions of bags of production), global consumption, exchange rates, inflation, and the pandemic.

Stats, Vol. 9, Pages 39: A New Depth-Based Test for Multivariate Two-Sample Problems

My Luu — 2026-04-03

Stats, Vol. 9, Pages 39: A New Depth-Based Test for Multivariate Two-Sample Problems

Stats doi: 10.3390/stats9020039

Authors: My Luu Yuejiao Fu Augustine Wong Xiaoping Shi

Statistical depth provides a center–outward ordering of multivariate observations and is widely used in nonparametric inference. We study depth-based tests for multivariate two-sample problems and examine the behaviour of different depth notions using the DD plot (data-depth plot) across a variety of distributional space. The DD plot illustrates that depth functions differ in their sensitivity to distributional differences, emphasizing the importance of depth selection in two-sample testing. We propose a new two-sample test statistic, log DDR, constructed from ratios of numerical depth values rather than depth-induced ranks. Simulation studies under multiple scenarios and for three representative depth functions indicate that log DDR achieves improved power relative to several competing depth-based nonparametric tests. The results further demonstrate that the performance of log DDR and existing methods depends strongly on the chosen depth function, consistent with insights from the DD plot. These findings support a two-stage testing approach in which the DD plot is used to guide the choice of depth notion before applying log DDR for homogeneity testing.

Stats, Vol. 9, Pages 38: Multiple Imputation of a Continuous Outcome with Fully Observed Predictors Using TabPFN

Jerome Sepin — 2026-04-01

Stats, Vol. 9, Pages 38: Multiple Imputation of a Continuous Outcome with Fully Observed Predictors Using TabPFN

Stats doi: 10.3390/stats9020038

Authors: Jerome Sepin

Handling missing data is a central challenge in quantitative research, particularly when datasets exhibit complex dependency structures, such as nonlinear relationships and interactions. Multiple imputation (MI) via fully conditional specification (FCS), as implemented in the MICE R package, is widely used but relies on user-specified models that may fail to capture complex dependency structures, especially in high-dimensional settings, or on more sophisticated algorithms that are considered data-hungry. This paper investigates the performance of TabPFN, a transformer-based, pretrained foundation model developed for tabular prediction tasks, for MI. TabPFN is pretrained on millions of synthetic datasets and approximates posterior predictive distributions without dataset-specific retraining, offering a compelling solution for imputing complex missing data in small to moderately sized samples. We conduct a simulation study focusing on univariate missingness in a continuous outcome with complete predictors, comparing TabPFN with standard MI methods. Performance is evaluated using bias, standard error, and coverage of the marginal mean estimand across a range of data-generating and missingness mechanisms. Our results show that TabPFN yields competitive or superior performance relative to Classification and Regression Trees and Predictive Mean Matching. These findings highlight TabPFN as a promising tool for missing data imputation, with particular relevance to health research.

Stats, Vol. 9, Pages 37: On the Classification–Causal Tradeoff in Neural Network Propensity Score Estimation

Seungman Kim — 2026-03-31

Stats, Vol. 9, Pages 37: On the Classification–Causal Tradeoff in Neural Network Propensity Score Estimation

Stats doi: 10.3390/stats9020037

Authors: Seungman Kim Jaehoon Lee Kwanghee Jung

Observational studies serve as a vital alternative to randomized experiments but are highly susceptible to selection bias. Propensity score (PS) methods address this by balancing covariates between groups. Although including all relevant covariates is theoretically ideal, high dimensionality often destabilizes traditional estimation models. This study evaluates the efficacy of deep neural networks (DNN) and convolutional neural networks (CNN) for PS estimation compared to traditional logistic regression (LR), leveraging their capacity to handle complex nonlinear relationships and interactions. Using a Monte Carlo simulation across 36 conditions, model performance was evaluated based on bias and imbalance reduction. Results indicate that DNNs and CNNs significantly outperform LR. Specifically, while LR increased outcome bias by 17% and reduced covariate imbalance by only 5%, DNNs and CNNs reduced outcome bias by 13% and 16%, respectively, while decreasing covariate imbalance by 18% and 21%. We conclude that despite requiring specialized computational resources, neural networks offer substantial advantages for high-dimensional PS estimation. However, their reliable application necessitates stability-aware training and proper error rate thresholds to prevent probability degeneracy.

Stats, Vol. 9, Pages 36: On Dimension-Free Stochastic Surrogates and Estimators of Cross-Partial Derivatives and the Hessian Matrix

Matieyendou Lamboni — 2026-03-29

Stats, Vol. 9, Pages 36: On Dimension-Free Stochastic Surrogates and Estimators of Cross-Partial Derivatives and the Hessian Matrix

Stats doi: 10.3390/stats9020036

Authors: Matieyendou Lamboni

This study introduces stochastic surrogates of all the cross-partial derivatives of functions using L evaluations of functions at randomized points. Such randomized points are constructed using the class of lp-spherical distributions or equivalent distributions. For the cross-partial derivatives of a given order |u|∈{2,…,d}, the proposed surrogates and the corresponding estimators of cross-partial derivatives enjoy the parametric rate of convergence and dimension-free mean squared errors when d≪p, leading to breaking down the curse of dimensionality. Imposing p≪d allows to break down the curse of dimensionality for only the cross-partial derivatives of orders given by |u|≪1+d2log(d). Also, the L-point-based Hessian surrogate and estimator are proposed, including the convergence analysis. A particular choice of p allows to achieve the dimension-free mean squared errors. Analytical examples and simulations have been provided to show the efficiency of such surrogates and estimators.

Stats, Vol. 9, Pages 35: Analyzing Complex Non-Linear Fascia-Muscle Interactions Using Cross-Recurrence Quantification Analysis

Andreas Brandl — 2026-03-25

Stats, Vol. 9, Pages 35: Analyzing Complex Non-Linear Fascia-Muscle Interactions Using Cross-Recurrence Quantification Analysis

Stats doi: 10.3390/stats9020035

Authors: Andreas Brandl Marcus Müller Robert Schleip

Biophysical, neurophysiological, psychological and social processes along with their interactions are complex, often non-linear and inherently time-dependent. However, time series analysis of such measurements usually requires extensive data processing and is therefore potentially associated with structural biases. This exploratory secondary analysis introduces cross-recurrence quantification analysis (CRQA), which is explicitly suited to time series with complicated non-stationary properties. We illustrate and validate CRQA using a previous study that investigated the dynamic relationship between thoracolumbar fascia deformation and back extensor muscle activity in patients with low back pain. CRQA revealed significant differences in the relationships between fascia and muscles in low back pain patients compared to healthy individuals. The analysis revealed more specific aspects of fascia-muscle coupling than traditional analytical approaches, suggesting that CRQA is a useful additional tool for investigating time-dependent interactions with dynamic complex nonlinear patterns.

Stats, Vol. 9, Pages 34: Multidimensional Correlates of Childhood Stunting in India: A Spatial Machine Learning and Explainable AI Approach

Bhagyajyothi Rao — 2026-03-24

Stats, Vol. 9, Pages 34: Multidimensional Correlates of Childhood Stunting in India: A Spatial Machine Learning and Explainable AI Approach

Stats doi: 10.3390/stats9020034

Authors: Bhagyajyothi Rao Md Gulzarull Hasan Bandhavya Putturaya Asha Kamath Mohammad Aatif Yousif M. Elmosaad

Childhood stunting remains a major public health challenge in India and is influenced by multiple socioeconomic and environmental factors. This ecological study examined district-level correlates of childhood stunting, including Crimes Against Women (CAW), the Multidimensional Poverty Index (MPI), and drought severity, using data from NFHS-5, the National Crime Records Bureau, NITI Aayog’s MPI reports, and the Drought Atlas of India. Spatial autocorrelation and Spatial regression models were applied alongside machine learning approaches and SHAP-based Explainable AI (XAI) interpretation. Childhood stunting exhibited significant spatial clustering (Moran’s I = 0.520, p < 0.001), with hotspots in northern, central, and eastern India. Higher stunting was associated with higher birth order, low maternal BMI, child anaemia, and MPI, and negative associations with iodised salt usage, electricity access, and timely postnatal care. A significant spatial lag parameter (ρ = 0.348) indicated substantial spillover effects. Machine learning models consistently identified MPI, drought severity, and CAW as key predictors. The integrated spatial and machine learning framework identifies key correlates and spatial dependencies of childhood stunting, highlighting the need for region-specific, multisectoral interventions.

Stats, Vol. 9, Pages 33: An Adaptive Method to Identify Outliers in Skewed Observations: Application to Assess NAACCR Cancer Registry Data Usage

Xiaowen Yang — 2026-03-23

Stats, Vol. 9, Pages 33: An Adaptive Method to Identify Outliers in Skewed Observations: Application to Assess NAACCR Cancer Registry Data Usage

Stats doi: 10.3390/stats9020033

Authors: Xiaowen Yang Amjila Bam Nubaira Rizvi Xiao-Cheng Wu Donald Mercante Qingzhao Yu

Outlier detection is a fundamental component of data preprocessing and quality monitoring across diverse scientific domains, including engineering, biomedical sciences, and finance. While many variables in controlled environments approximate a normal distribution, real-world data, particularly biological, environmental, and epidemiological measures, are frequently characterized by pronounced right-skewness. To address the shortcomings of conventional methods, this study introduces the Dynamic Threshold for Outlier Detection (DTOD), which reframes outlier detection as a concrete operational workflow. The DTOD framework dynamically adjusts detection thresholds based on a functional relationship between skewness and tail morphology. Validation through large-scale simulation experiments across light-, middle-, and high-skewness levels confirms the method’s versatility. The DTOD proves particularly effective at two ends of the spectrum: enhancing sensitivity for detecting subtle anomalies in light-skewed data while serving as a conservative, high-confidence screening tool that controls false positives in high-skewness environments. In real-world application to North American Association of Central Cancer Registries (NAACCR) data, the method successfully identified outliers with abnormally high unknown tumor size rates in colorectal cancer and maintained a low misclassification rate in highly skewed lung cancer data. Ultimately, the DTOD provides a promising, interpretable solution for improving data quality in skewed scenarios.

Stats, Vol. 9, Pages 32: Estimator Statistics from Simulation-Free Dirichlet Block-Bootstrap Resampling

Tillmann Rosenow — 2026-03-20

Stats, Vol. 9, Pages 32: Estimator Statistics from Simulation-Free Dirichlet Block-Bootstrap Resampling

Stats doi: 10.3390/stats9020032

Authors: Tillmann Rosenow

Since the initiation of two variants of the bootstrap method by Efron and Rubin in the late 1970s, a variety of advancements has emerged in the literature. The subsampling of blocks enabled the estimation of the actual variance of the sample mean. The equivalence of the data-level and the estimator-level resampling is easily established for the sample mean and estimators alike. For Rubin’s variant of the bootstrap we apply an algorithm by Diniz et al. which allows for the numerically stable computation of the sample-based cumulative distribution function of the estimator under investigation. No actual Monte-Carlo resampling is necessary in this setting and we demonstrate how we get access to the very small probabilities of the tails and moreover to confidence intervals. We do this at the example of a well-known test model that exhibits geometrically decaying spatial correlations. The analysis naturally applies to temporally correlated systems or to the correlations occurring in Markov chains, as well.

Stats, Vol. 9, Pages 31: A Bayesian Approach for Clustering Constant-Wise Change-Point Data

Ana Carolina da Cruz — 2026-03-17

Stats, Vol. 9, Pages 31: A Bayesian Approach for Clustering Constant-Wise Change-Point Data

Stats doi: 10.3390/stats9020031

Authors: Ana Carolina da Cruz Camila P. E. de Souza

Change-point models deal with ordered data sequences. Their primary goal is to infer the locations where an aspect of the data sequence changes. In this paper, we propose and implement a nonparametric Bayesian model for clustering observations based on their constant-wise change-point profiles via a Gibbs sampler. Our model incorporates a Dirichlet process on the constant-wise change-point structures to cluster observations while simultaneously performing multiple change-point estimation. Additionally, our approach controls the number of clusters in the model, not requiring specification of the number of clusters a priori. Satisfactory clustering and estimation results were obtained when evaluating our method under various simulated scenarios and on a real dataset from single-cell genomic sequencing. Our proposed methodology is implemented as an R package called BayesCPclust and is available from the Comprehensive R Archive Network.

Stats, Vol. 9, Pages 30: Correction: Risca et al. Archimedean Copulas: A Useful Approach in Biomedical Data—A Review with an Application in Pediatrics. Stats 2025, 8, 69

Giulia Risca — 2026-03-17

Stats, Vol. 9, Pages 30: Correction: Risca et al. Archimedean Copulas: A Useful Approach in Biomedical Data—A Review with an Application in Pediatrics. Stats 2025, 8, 69

Stats doi: 10.3390/stats9020030

Authors: Giulia Risca Stefania Galimberti Paola Rebora Alessandro Cattoni Maria Grazia Valsecchi Giulia Capitoli

In the original publication [...]

Stats, Vol. 9, Pages 29: Comparison of Minimal Circular Balanced RMDs Constructed Through Rule I and II of Cyclic Shifts Method

Muhammad Ejaz Malik — 2026-03-13

Stats, Vol. 9, Pages 29: Comparison of Minimal Circular Balanced RMDs Constructed Through Rule I and II of Cyclic Shifts Method

Stats doi: 10.3390/stats9020029

Authors: Muhammad Ejaz Malik Muhammad Ameeq Muhammad Riaz Rashid Ahmed

The repeated measurement design (RMD) is a cost-effective research design commonly used in various fields. RMDs have several advantages; however, the carryover effect is a fundamental issue. Carryover effects typically serve as the primary source of bias in the evaluation of treatment efficacy. To reduce this bias, minimal circular balanced RMDs (MCBRMDs) are utilized. Rule I of the cyclic shift method produces MCBRMDs for only the odd v (number of treatments to be compared). Rule II produces these designs for both v odd and v even. This article contributes to the literature by providing a systematic comparison of two cyclic shift rules for constructing MCBRMDs for odd v. The study provides useful guidance to experimenters in choosing effective designs under practical experimental restrictions by comparing these designs using efficiency of carryover effects and separability.

Stats, Vol. 9, Pages 28: Combinatorial Game Theory and Reinforcement Learning in Cumulative Tic-Tac-Toe via Evaluation Functions

Kai Li — 2026-03-10

Stats, Vol. 9, Pages 28: Combinatorial Game Theory and Reinforcement Learning in Cumulative Tic-Tac-Toe via Evaluation Functions

Stats doi: 10.3390/stats9020028

Authors: Kai Li Wei Zhu

We introduce cumulative tic-tac-toe, a novel variant of the classic 3×3 tic-tac-toe game in which play continues until the board is completely filled. Each player’s final score is determined by the total number of three-in-a-row sequences they form. Using combinatorial game theory (CGT), we establish that under optimal play, the game is a draw, and we characterize its theoretical properties. To empirically validate and optimize practical play, we develop a reinforcement learning (RL) framework based on temporal-difference (TD) learning, which is enhanced with a domain-informed evaluation function to accelerate convergence. The experimental results show that our triplet-coverage difference (TCD) evaluation function reduces the average number of training episodes by approximately 23.1% compared with a random-initialization baseline, a statistically significant improvement at the 5% significance level. These results demonstrate the efficiency of our CGT–RL approach for cumulative tic-tac-toe and suggest that similar methods may be useful for analyzing related combinatorial games. We also discuss potential analogies in domains such as competitive resource allocation and coalition formation, illustrating how cumulative-scoring games connect abstract game-theoretic ideas to practical sequential decision problems.

Stats, Vol. 9, Pages 27: Meta-Analysis of Paired Binary Data with Unobserved Dependence: Insights from Laterality and Bilateralism in Anatomy

Vasileios Papadopoulos — 2026-03-07

Stats, Vol. 9, Pages 27: Meta-Analysis of Paired Binary Data with Unobserved Dependence: Insights from Laterality and Bilateralism in Anatomy

Stats doi: 10.3390/stats9020027

Authors: Vasileios Papadopoulos Aliki Fiska

Anatomical variants are observed on paired body sides, yet many prevalence studies—particularly those based on osteological collections—report only right- and left-side frequencies without specifying whether findings occur bilaterally in the same individual. In such cases, the individual-level left–right structure is unobserved. Consequently, inference on laterality and bilateralism cannot be based on the reported data alone and must rely on explicit assumptions about within-individual dependence. We study this problem in the context of anatomic prevalence data, although the framework applies more broadly to paired binary outcomes. We parameterize the admissible joint distributions using a feasibility-based dependence index λ, spanning the full range from independence to maximal feasible concordance implied by the marginal prevalences. Within this framework, we examine two complementary estimands: the paired odds ratio for laterality and bilateral prevalence. Analytic results and Monte Carlo simulations show that bilateral prevalence varies linearly and remains stable across the admissible dependence range, whereas the paired odds ratio exhibits intrinsic boundary instability as dependence approaches its feasible maximum due to vanishing discordant counts. Uncertainty-propagation analyses further indicate that laterality inference is robust to moderate misspecification of the dependence assumption. These results demonstrate that unobserved within-subject dependence is a structural inferential issue in paired binary meta-analysis and motivate feasibility-based sensitivity analysis when only marginal data are available.

Stats, Vol. 9, Pages 26: The Gamma Power Generalized Weibull Distribution: Modeling Bibliometric Data

Arioane Primon Soares — 2026-03-05

Stats, Vol. 9, Pages 26: The Gamma Power Generalized Weibull Distribution: Modeling Bibliometric Data

Stats doi: 10.3390/stats9020026

Authors: Arioane Primon Soares Ryan Novaes Pereira Fernando A. Peña-Ramírez Luz Milena Zea Fernández Renata Rojas Guerra

In this study, we introduce the gamma power generalized Weibull (GPGW) distribution and investigate several of its main mathematical properties. The performance of the maximum likelihood estimators is evaluated through Monte Carlo simulations. The practical relevance of the proposed distribution is illustrated through an application to real bibliometric data, where the GPGW is used to model SCImago Journal Rank (SJR) indicators. In comparison with alternative models commonly employed for lifetime and positive data, the GPGW distribution exhibits strong competitive performance. In particular, in the real data application, it outperforms eleven competing distributions in terms of goodness of fit criteria, including the power generalized Weibull (PGW), the gamma-Nadarajah–Haghighi (GNH), and the exponentiated power generalized Weibull (EPGW) distributions. While inheriting several mathematical features of the EPGW distribution, such as expressions for moments, skewness, and kurtosis, the GPGW offers enhanced flexibility, making it a valuable modeling tool for lifetime data and heavy-tailed positive measurements.

Stats, Vol. 9, Pages 25: Ordering and Quantifying Textual Cohesion via Semantic, Geometric and Statistical Structure

Stelios Arvanitis — 2026-03-03

Stats, Vol. 9, Pages 25: Ordering and Quantifying Textual Cohesion via Semantic, Geometric and Statistical Structure

Stats doi: 10.3390/stats9020025

Authors: Stelios Arvanitis

We propose a semantic, geometric, and statistical framework for quantifying and ordering textual cohesion in long-form discourse. Sentences are embedded into a semantic similarity graph and Ollivier–Ricci curvature is used to extract sentence- and document-level structural profiles, represented as step functions on a normalized rhetorical-time axis. On this functional space we define the Weighted Utopia Index (wUI), a corpus-relative measure of weighted shortfall from an upper-envelope profile under a dominance-type ordering. The rhetorical-time weighting function is learned self-supervised: we generate controlled sentence-order perturbations with known ordinal coherence degradation and estimate the weight parameters via an ordered probit model on a training split. We evaluate ordering recovery on held-out State of the Union speeches using rank correlations, pairwise and adjacent ordering accuracy, and violation-localization diagnostics with bootstrap uncertainty. Across these criteria, wUI systematically outperforms embedding-only adjacent-similarity baselines, while a Nash-type aggregation provides an interpretable semantic–structural trade-off score. An application to later-period speeches illustrates how the method yields interpretable cohesion rankings and curvature-profile diagnostics without requiring external annotations.

Stats, Vol. 9, Pages 24: Asymptotic Properties of Error Density Estimators in the Two-Phase Linear Regression Model

Fuxia Cheng — 2026-03-01

Stats, Vol. 9, Pages 24: Asymptotic Properties of Error Density Estimators in the Two-Phase Linear Regression Model

Stats doi: 10.3390/stats9020024

Authors: Fuxia Cheng Lixia Wang

This paper investigates kernel estimation of the error density function for the two-phase linear regression model. We derive the asymptotic distributions of residual-based kernel density estimators. First, we demonstrate that the asymptotic distribution of the maximum deviation (suitably normalized) between the residual-based kernel density estimator and the expected kernel density (based on the true errors) coincides with the result for an independent and identically distributed (i.i.d.) sample. We then prove that the residual-based kernel density estimator is asymptotically normal at a fixed point.

Stats, Vol. 9, Pages 23: The Path from PCA to Autoencoders to Variational Autoencoders: Building Intuition for Deep Generative Modeling

Alaa Tharwat — 2026-02-28

Stats, Vol. 9, Pages 23: The Path from PCA to Autoencoders to Variational Autoencoders: Building Intuition for Deep Generative Modeling

Stats doi: 10.3390/stats9020023

Authors: Alaa Tharwat Mahmoud M. Eid

This tutorial provides a comprehensive and intuitive journey through the evolution of deep generative models, tracing a clear path from the foundations of Principal Component Analysis (PCA) to modern Variational Autoencoders (VAEs), showing how each method solves the limitations of the previous one. We begin with PCA, a linear tool for reducing data dimensions. Its inability to model non-linear patterns motivates the use of Autoencoders (AEs), which use neural networks to learn flexible, compressed representations. However, AEs lack a probabilistic framework, preventing them from generating new data. VAEs address this by treating the latent space as a probability distribution, enabling data generation. We compare the three methods through theoretical analysis, experiments, and step-by-step numerical examples that show exactly how each model compresses data—a detail often missing elsewhere. Unlike resources that treat these topics separately, we connect them into a single narrative, building intuition progressively from linear to probabilistic deep generative models.

Stats, Vol. 9, Pages 22: Abundance Estimation Using Minimum Order Set Distances in Line Transect Sampling

Mohammad Ali Al Kadiri — 2026-02-26

Stats, Vol. 9, Pages 22: Abundance Estimation Using Minimum Order Set Distances in Line Transect Sampling

Stats doi: 10.3390/stats9020022

Authors: Mohammad Ali Al Kadiri Mariam H. Al-Husari

Line transect sampling is widely used for estimating population abundance, but existing nonparametric estimators of detection density at the transect line often suffer from boundary bias and tuning sensitivity. In this paper, we propose two simple tuning-light estimators based on minimum order statistics of perpendicular distances, requiring measurement of only the judged-closest object within each set. Under mild regularity conditions, the proposed estimators are consistent and asymptotically normal, with low bias and variance demonstrated through simulation studies under exponential and half-normal detection models. An application to a wooden stakes transect survey illustrates the practical advantages of the proposed approach for low-effort ecological surveys.

Stats, Vol. 9, Pages 21: Performance Evaluation of the Robust Stein Estimator in the Presence of Multicollinearity and Outliers

Lwando Dlembula — 2026-02-22

Stats, Vol. 9, Pages 21: Performance Evaluation of the Robust Stein Estimator in the Presence of Multicollinearity and Outliers

Stats doi: 10.3390/stats9010021

Authors: Lwando Dlembula Chioneso Show Marange Lwando Orbet Kondlo

Multicollinearity and outliers are common challenges in multiple linear regression, often adversely affecting the properties of least squares estimators. To address these issues, several robust estimators have been developed to handle multicollinearity and outliers individually or simultaneously. More recently, the robust Stein estimator (RSE) was introduced, which integrates shrinkage and robustness to effectively mitigate the impact of both multicollinearity and outliers. Despite its theoretical advantages, the finite-sample performance of this approach under multicollinearity and outliers remains underexplored. First, outliers in the y direction have been the main focus of earlier research on the RSE, not considering that leverage points could substantially impact regression results. Second, this study addresses the gap by considering outliers in the y direction and leverage points, providing a more thorough assessment of the RSE robustness. Finally, to extend the limited existing benchmark, we compare and evaluate the RSE performance with a wide range of robust and classical estimators. This extends existing benchmarking, which is limited in the current literature. Several Monte Carlo (MC) simulations were conducted, considering both normal and heavy-tailed error distributions, with sample sizes, multicollinearity levels, and outlier proportions varied. Performance was evaluated using bootstrap estimates of root mean squared error (RMSE) and bias. The MC simulation results indicated that the RSE outperformed other estimators under several scenarios where both multicollinearity and outliers are present. Finally, real data studies confirm the MC simulation results.

Stats, Vol. 9, Pages 20: Penalized Likelihood Estimation of Continuation Ratio Models for Ordinal Response and Its Application in CGSS Data

Huihui Sun — 2026-02-19

Stats, Vol. 9, Pages 20: Penalized Likelihood Estimation of Continuation Ratio Models for Ordinal Response and Its Application in CGSS Data

Stats doi: 10.3390/stats9010020

Authors: Huihui Sun Yemin Cui

The continuation ratio model is a crucial tool for analyzing ordinal response data. However, its explanatory power diminishes under high-dimensional settings where the number of covariates p is large. To address this, we introduce, for the first time, the smoothly clipped absolute deviation (SCAD) penalty into the forward continuation ratio model framework. We propose a corresponding penalized likelihood estimation method that performs simultaneous variable selection and parameter estimation and provides an efficient algorithm for its implementation. Numerical simulations demonstrate the favorable properties of the SCAD penalty: it precisely identifies significant variables while more aggressively shrinking the coefficients of irrelevant ones to zero, outperforming alternative penalties like Lasso and elastic net in selection accuracy. Finally, we illustrate the practical utility of our method through an empirical application using data from the Chinese General Social Survey (CGSS).

Stats, Vol. 9, Pages 19: Sample Size Calculation and Power Analysis for the General Mediation Analysis Method

Nubaira Rizvi — 2026-02-14

Stats, Vol. 9, Pages 19: Sample Size Calculation and Power Analysis for the General Mediation Analysis Method

Stats doi: 10.3390/stats9010019

Authors: Nubaira Rizvi Amjila Bam Wentao Cao Qingzhao Yu

Mediation analysis is a widely used statistical technique for identifying the mechanisms underlying the relationship between an exposure and an outcome. However, accurate power analysis and sample size determination for mediation models that involve non-normal distributions or mixtures of continuous and binary variables are challenging. We propose a computationally efficient simulation-based approach for general mediation analysis. By applying monotone smoothing splines to estimate empirical critical values derived from extensive simulations, our method enables accurate power calculations without the need for real-time simulation. We validated the method across varying scenarios, including continuous, binary variables and time-to-event outcome with strict Type I error control. The method-quantified large effects (0.35) yielded >80% power at minimal sample sizes (n = 25–50) across all settings, while small effects (0.02) required larger samples. Continuous models achieved 80% power for small effects at n = 410, whereas fully binary models required n > 500. For medium effects (0.15), the power was >0.80 at n = 75 with binary mediators. This study presents a robust framework that combines the flexibility of simulation-based inference with the speed of analytical approximations. We provide an accompanying R package to facilitate efficient sample size planning for mediation models.

Stats, Vol. 9, Pages 18: The Bivariate Poisson–X–Exponential Distribution: Theory, Inference, and Multidomain Applications

Wafa Treidi — 2026-02-14

Stats, Vol. 9, Pages 18: The Bivariate Poisson–X–Exponential Distribution: Theory, Inference, and Multidomain Applications

Stats doi: 10.3390/stats9010018

Authors: Wafa Treidi Halim Zeghdoudi

We propose the Bivariate Poisson–X–Exponential Distribution (BPXED), a flexible bivariate count model obtained by compounding Poisson variables with a shared X–Exponential latent mixing distribution. The model extends the Poisson–X–Exponential (PXED) distribution and includes several bivariate Poisson-type models as special or limiting cases. Closed-form expressions are derived for the joint probability mass function, probability generating function, moments, and covariance structure, showing that dependence arises from shared latent heterogeneity and is restricted to positive correlation. Parameter estimation is developed using maximum likelihood, regression-based, and Bayesian approaches, and a Monte Carlo simulation study demonstrates a good finite-sample performance. Applications to soccer scores, reliability failures, and correlated photon counts illustrate improved goodness-of-fit over classical and recent competing models. Overall, BPXED provides an analytically tractable and interpretable framework for modeling positively dependent and overdispersed bivariate count data.

Stats, Vol. 9, Pages 17: Estimating the Parameter of Direct Effects in Crossover Designs: The Case of 6 Periods and 2 Treatments

Miltiadis S. Chalikias — 2026-02-12

Stats, Vol. 9, Pages 17: Estimating the Parameter of Direct Effects in Crossover Designs: The Case of 6 Periods and 2 Treatments

Stats doi: 10.3390/stats9010017

Authors: Miltiadis S. Chalikias

The present study investigates the derivation of optimal repeated measurement designs for two treatments, six periods, and n experimental units, focusing exclusively on the direct effects of the treatments. The optimal designs are determined for cases where n &equiv; 0 or 1, 2, 3, 4 (mod 4). The adopted optimality criterion aims at minimizing the variance of the estimator of the direct effects, thereby ensuring maximum precision in parameter estimation and increased design efficiency. The results presented extend and complement earlier studies on optimal two-treatment repeated-measurement designs for a smaller number of periods, and are closely related to more recent work focusing on optimality with respect to direct effects. Overall, this work contributes to the theoretical framework of optimal design methodology by providing new insights into the structure and efficiency of repeated measurement designs, and lays the groundwork for future extensions incorporating treatment–period interactions.

Stats, Vol. 9, Pages 16: Using Vector Representations of Characteristic Functions and Vector Logarithms When Proving Asymptotic Statements

Wolf-Dieter Richter — 2026-02-11

Stats, Vol. 9, Pages 16: Using Vector Representations of Characteristic Functions and Vector Logarithms When Proving Asymptotic Statements

Stats doi: 10.3390/stats9010016

Authors: Wolf-Dieter Richter

In this methodological–technical note, in addition to the well-known concepts of logarithms of positive real numbers and operators, we open a path for mathematical treatment of the mathematical concept of the logarithm of a vector. We prove the most basic arithmetic operations for this new logarithm concept and demonstrate how it applies to characteristic functions and limit theorems of probability theory. As a side result, we revise a formula for ii that is known from the literature.

Stats, Vol. 9, Pages 15: New Two-Parameter Ridge Estimators for Addressing Multicollinearity in Linear Regression: Theory, Simulation, and Applications

Md Ariful Hoque — 2026-02-10

Stats, Vol. 9, Pages 15: New Two-Parameter Ridge Estimators for Addressing Multicollinearity in Linear Regression: Theory, Simulation, and Applications

Stats doi: 10.3390/stats9010015

Authors: Md Ariful Hoque B. M. Golam Kibria Zoran Bursac

Multicollinearity among explanatory variables often undermines the reliability of the ordinary least squares (OLS) estimator that can be used in linear regression modeling. To overcome the limitation, a variety of two-parameter estimation strategies have been developed in prior research. We revisit these existing methods and present a newly established two-parameter ridge estimator to improve the accuracy of regression coefficients in terms of multicollinearity settings. A theoretical evaluation, assessed under the mean squared error (MSE) framework, is examined to compare their efficiency. Furthermore, a comprehensive simulation study is conducted to examine the empirical properties of all these estimators for different configurations, followed by a real-life dataset to examine their performance.

Stats, Vol. 9, Pages 14: eduSTAT—Automated Workflows for the Analysis of Small- to Medium-Sized Datasets

Rudolf Golubich — 2026-02-04

Stats, Vol. 9, Pages 14: eduSTAT—Automated Workflows for the Analysis of Small- to Medium-Sized Datasets

Stats doi: 10.3390/stats9010014

Authors: Rudolf Golubich

This communication provides a citable methodological reference for eduSTAT (v1), an automated, rule-based workflow for the statistical analysis of small- to medium-sized datasets (N≈30–3000). The web application is initially available in German and will be offered in English once it is established in German-speaking regions. It is developed with the aim of supporting early training in the scientific method and reducing the risk of spurious or inappropriate statistical analyses. The paper establishes the foundation for subsequent meta-analyses based on citation tracking of studies that apply eduSTAT, enabling iterative, data-driven improvement of the software.

Stats, Vol. 9, Pages 13: The Stingray Copula for Negative Dependence

Alecos Papadopoulos — 2026-02-04

Stats, Vol. 9, Pages 13: The Stingray Copula for Negative Dependence

Stats doi: 10.3390/stats9010013

Authors: Alecos Papadopoulos

We present a new single-parameter bivariate copula, called the Stingray, that is dedicated to representing negative dependence, and it nests the Independence copula. The Stingray copula is generated in a relatively novel way; it has a simple form and is always defined over the full support, unlike many copulas that model negative dependence. We provide visualizations of the copula, derive several dependence properties, and compute basic concordance measures. We compare it with other copulas and joint distributions with respect to the extent of dependence it can capture, and we find that the Stingray copula outperforms most of them while remaining competitive with well-known, widely used copulas such as the Gaussian and Frank copulas. Moreover, we show, through simulation, that the dependence structure it represents cannot be fully captured by these copulas, as it is asymmetric. We also show how the non-parametric Spearman’s rho measure of concordance can be used to formally test the hypothesis of statistical independence. As an illustration, we apply it to a financial data sample from the building construction sector in order to model the negative relationship between the level of capital employed and its gross rate of return.

Stats, Vol. 9, Pages 12: Tuning for Precision Forecasting of Green Market Volatility Time Series

Sonia Benghiat — 2026-01-29

Stats, Vol. 9, Pages 12: Tuning for Precision Forecasting of Green Market Volatility Time Series

Stats doi: 10.3390/stats9010012

Authors: Sonia Benghiat Salim Lahmiri

In recent years, the green financial market has been exhibiting heightened volatility daily, largely due to policy changes and economic shifts. To explore the broader potential of predictive modeling in the context of short-term volatility time series, this study analyzes how fine-tuning hyperparameters in predictive models is essential for improving short-term forecasts of market volatility, particularly within the rapidly evolving domain of green financial markets. While traditional econometric models have long been employed to model market volatility, their application to green markets remains limited, especially when contrasted with the emerging potential of machine-learning and deep-learning approaches for capturing complex dynamics in this context. This study evaluates the performance of several data-driven forecasting models starting with machine-learning models: regression tree (RT) and support vector regression (SVR), and with deep-learning ones: long short-term memory (LSTM), convolutional neural network (CNN), and gated recurrent unit (GRU) applied to over a decade of daily estimated volatility data coming from three distinct green markets. Predictive accuracy is compared both with and without hyperparameter optimization methods. In addition, this study introduces the quantile loss metric to better capture the skewness and heavy tails inherent in these financial series, alongside two widely used evaluation metrics. This comparative analysis yields significant numerical and graphical insights, enhancing the understanding of short-term volatility predictability in green markets and advancing a relatively underexplored research domain. The study demonstrates that deep-learning predictors outperform machine-learning ones, and that including a hyperparameter tuning algorithm shows consistent improvements across all deep-learning models and for all volatility time series.

Stats, Vol. 9, Pages 11: Improving Confidence Interval Estimation in Logistic Regression with Multicollinear Predictors: A Comparative Study of Shrinkage Estimators and Application to Prostate Cancer Data

Sultana Mubarika Rahman Chowdhury — 2026-01-29

Stats, Vol. 9, Pages 11: Improving Confidence Interval Estimation in Logistic Regression with Multicollinear Predictors: A Comparative Study of Shrinkage Estimators and Application to Prostate Cancer Data

Stats doi: 10.3390/stats9010011

Authors: Sultana Mubarika Rahman Chowdhury Zoran Bursac B. M. Golam Kibria

In logistic regression with finite binary samples and multicollinear predictors, the maximum likelihood estimator often results in overfitting and high mean squared error (MSE). Shrinkage methods like ridge, Liu, and Kibria–Lukman offer improved MSE performance but are typically evaluated only on this criterion, which overlooks their inferential capability. This study shifts the focus toward confidence interval coverage, using simulations to assess the coverage probability, interval width, and MSE of several shrinkage estimators under varying conditions. The results show that, while shrinkage methods generally reduce interval width and MSE, many fail to maintain adequate coverage. However, certain ridge and Kibria–Lukman estimators achieve a favorable balance between narrow interval width and consistent coverage, making them preferable. The findings are further validated using a prostate cancer dataset, contributing to more reliable inference in logistic regression under multicollinearity. Overall, the results demonstrate that well-chosen shrinkage estimators can serve as effective alternatives to the MLE in biostatistical modeling, improving the stability and interpretability of regression analyses in studies pertaining to public health and medicine.

Stats, Vol. 9, Pages 10: Effect Structures in Ordinal Regression: The Adjacent Categories Approach

Gerhard Tutz — 2026-01-27

Stats, Vol. 9, Pages 10: Effect Structures in Ordinal Regression: The Adjacent Categories Approach

Stats doi: 10.3390/stats9010010

Authors: Gerhard Tutz

The potential of the adjacent categories approach for capturing the influence of explanatory variables on ordinal responses is investigated. Several models with increasing complexity in their linear predictors are considered, and their relationships are discussed, including the basic adjacent categories model, the stereotype model, models with category-specific effects, and dispersion models. For the adjacent categories framework, regularization methods for effect selection are introduced with the aim of distinguishing between no effect, global effects, and category-specific effects. Particular attention is given to the adjacent dispersion model, which provides a parsimonious parameterization while substantially improving model fit compared to the basic model. Effect selection for both the location and dispersion effects in the adjacent dispersion model is introduced. The proposed approaches are illustrated using several real data sets.

Stats, Vol. 9, Pages 9: A Utility-Driven Bayesian Design: A New Framework for Extracting Optimal Experiments from Observational Reliability Data

Rossella Berni — 2026-01-21

Stats, Vol. 9, Pages 9: A Utility-Driven Bayesian Design: A New Framework for Extracting Optimal Experiments from Observational Reliability Data

Stats doi: 10.3390/stats9010009

Authors: Rossella Berni Nedka Dechkova Nikiforova Federico Mattia Stefanini

In this study, a procedure to build Bayesian optimal designs using utility functions and exploiting existing data is proposed. The procedure is illustrated through a case study in the field of reliability, by applying a hierarchical Bayesian model and performing Markov Chain Monte Carlo simulations. Two innovative contributions are introduced: (i) the definition of specific utility functions that involve several key issues and (ii) the use of observational data. The use of observational data makes it possible to build the optimal design without additional costs for the company, while the definition of the utility functions accounts for the specific characteristics of the reliability study. Features like model residuals, i.e., discrepancies between observed and predicted response values, and the costs of the electronic component are addressed. Costs are also weighted considering the environmental impact. Satisfactory results are obtained and subsequently validated through an in-depth sensitivity analysis.

Stats, Vol. 9, Pages 8: Preliminary and Shrinkage-Type Estimation for the Parameters of the Birnbaum–Saunders Distribution Based on Modified Moments

Syed Ejaz Ahmed — 2026-01-16

Stats, Vol. 9, Pages 8: Preliminary and Shrinkage-Type Estimation for the Parameters of the Birnbaum–Saunders Distribution Based on Modified Moments

Stats doi: 10.3390/stats9010008

Authors: Syed Ejaz Ahmed Muhammad Kashif Ali Shah Waqas Makhdoom Nighat Zahra

The two-parameter Birnbaum–Saunders (B-S) distribution is widely applied across various fields due to its favorable statistical properties. This study aims to enhance the efficiency of modified moment estimators for the B-S distribution by systematically incorporating auxiliary non-sample information. To this end, we developed and analyzed a suite of estimation strategies, including restricted estimators, preliminary test estimators, and Stein-type shrinkage estimators. A pretest procedure was formulated to guide the decision on whether to integrate the non-sample information. The relative performance of these estimators was rigorously evaluated through an asymptotic distributional analysis, comparing their asymptotic distributional bias and risk under a sequence of local alternatives. The finite-sample properties were assessed via Monte Carlo simulation studies. The practical utility of the proposed methods is demonstrated through applications to two real-world datasets: failure times for mechanical valves and bone mineral density measurements. Both numerical results and theoretical analysis confirm that the proposed shrinkage-based techniques deliver substantial efficiency gains over conventional estimators.

Stats, Vol. 9, Pages 7: Performance Forecasting for Multi-Server Retrial Queue with Possibility of Processing Repetition and Server Reservation for Repeating Users

Alexander N. Dudin — 2026-01-09

Stats, Vol. 9, Pages 7: Performance Forecasting for Multi-Server Retrial Queue with Possibility of Processing Repetition and Server Reservation for Repeating Users

Stats doi: 10.3390/stats9010007

Authors: Alexander N. Dudin Sergei A. Dudin Olga S. Dudina

This study focuses on forecasting and optimizing the performance of a real-world object modelled by a multi-server queueing system that processes two types of users: primary (new) users and repeating users. The repeating users are those who succeeded in entering processing upon arrival and then decided to repeat it. These users have privilege and can enter processing when they wish once at least one device is idle. The primary user is admitted to the system only if the number of occupied devices is less than some threshold value and the quantity of repeating users residing in the system does not exceed certain thresholds. Repeating users are impatient and non-persistent. Arrivals of primary users are described by the Markovian arrival process. Processing times of primary and repeating users have distinct phase-type distributions. Utilizing the concept of the generalized phase–time distributions, the dynamics of this queueing system are formally characterized by the multidimensional Markov chain, which is examined in this paper. The ergodicity condition is derived. The relation of the key performance characteristics of the system and the thresholds defining the policy of the primary user’s admission is numerically highlighted. Optimal threshold selection is demonstrated numerically.

Stats, Vol. 9, Pages 6: Two-Stage Wiener-Physically-Informed-Neural-Network (W-PINN) AI Methodology for Highly Dynamic and Highly Complex Static Processes

Dillon G. Hurd — 2026-01-01

Stats, Vol. 9, Pages 6: Two-Stage Wiener-Physically-Informed-Neural-Network (W-PINN) AI Methodology for Highly Dynamic and Highly Complex Static Processes

Stats doi: 10.3390/stats9010006

Authors: Dillon G. Hurd Yuderka T. González Jacob Oyler Spencer Wolfe Monica H. Lamm Derrick K. Rollins

Our new Theoretically Dynamic Regression (TDR) modeling methodology was recently applied in three types of real data modeling cases using physically based dynamic model structures with low-order linear regression static functions. Two of the modeling cases achieved the validation set modeling goal of rfit,val ≥ 0.9. However, the third case, consisting of eleven (11) type one (1) sensor glucose data sets, and thus, eleven individual models, all fail considerably short of this modeling goal and the average  rfit,val, r¯fit,val = 0.68. For this case, the dynamic forms are highly complex 60 min forecast, second-order-plus-dead-time-plus-lead (SOPDTPL) structures, and the static form is a twelve (12) input first-order linear regression structure. Using these dynamic structure results, the objective is to significantly increase  rfit for each of the eleven (11) modeling cases using the recently developed Wiener-Physically-Informed-Neural-Network (W-PINN) approach as the static modeling structure. Two W-PINN stage-two static structures are evaluated–one developed using the JMP® Pro Version 16, Artificial Neural Network (ANN) toolbox and the other developed using a novel ANN methodology coded in Python version, 3.12.3. The JMP r¯fit,val = 0.74 with a maximum of 0.84. The Python r¯fit,val = 0.82 with a maximum of 0.93. Incorporating bias correction, using current and past SGC residuals, the Python estimator improved the average r¯fit,val from 0.82 to 0.87 with the maximum still 0.93.

Stats, Vol. 9, Pages 5: Probabilistic Links Between Quantum Classification of Patterns of Boolean Functions and Hamming Distance

Theodore Andronikos — 2026-01-01

Stats, Vol. 9, Pages 5: Probabilistic Links Between Quantum Classification of Patterns of Boolean Functions and Hamming Distance

Stats doi: 10.3390/stats9010005

Authors: Theodore Andronikos Constantinos Bitsakos Konstantinos Nikas Georgios I. Goumas Nectarios Koziris

This article investigates the probabilistic relationship between quantum classification of Boolean functions and their Hamming distance. By integrating concepts from quantum computing, information theory, and combinatorics, we explore how Hamming distance serves as a metric for analyzing deviations in function classification. Our extensive experimental results confirm that the Hamming distance is a pivotal metric for validating nearest neighbors in the process of classifying random functions. One of the significant conclusions we arrived is that the successful classification probability decreases monotonically with the Hamming distance. However, key exceptions were found in specific classes, revealing intra-class heterogeneity. We have established that these deviations are not random but are systemic and predictable. Furthermore, we were able to quantify these irregularities, turning potential errors into manageable phenomena. The most important novelty of this work is the demarcation, for the first time to the best of our knowledge, of precise Hamming distance intervals for the classification probability. These intervals bound the possible values the probability can assume, and provide a new foundational tool for probabilistic assessment in quantum classification. Practitioners can now endorse classification results with high certainty or dismiss them with confidence. This framework can significantly enhance any quantum classification algorithm’s reliability and decision-making capability.

Stats, Vol. 9, Pages 4: ST-Community Detection Methods for Spatial Transcriptomics Data Analysis

Charles Zhao — 2026-01-01

Stats, Vol. 9, Pages 4: ST-Community Detection Methods for Spatial Transcriptomics Data Analysis

Stats doi: 10.3390/stats9010004

Authors: Charles Zhao Jian-Jian Ren

The single-cell spatial transcriptomics (ST) data with cell type and spatial location, i.e., (C,x,y) with C as cell type and (x,y) as its spatial location, produced by recent biotechnologies, such as CosMx and Xenium, contain a huge amount of information about cancer tissue samples, thus have great potential for cancer research via detection of ST-Community which is defined as a collection of cells with distinct cell-type composition and similar neighboring patterns based on nearby cell-percentages. But for huge CosMx single-cell ST data, the existing clustering methods do not work well for st-community detection, and the commonly used kNN compositional data method shows lack of informative neighboring cell patterns. In this article, we propose a novel and more informative disk compositional data (DCD) method for single-cell ST data, which identifies neighboring patterns of each cell via taking into account of ST data features from recent new technologies. After initial processing single-cell ST data into the DCD matrix, an innovative DCD-TMHC computation method for st-community detection is proposed here. Extensive simulation studies and the analysis of CosMx breast cancer data, which is an example of single-cell ST dataset, clearly show that our proposed DCD-TMHC computation method is superior to other existing methods. Based on the st-communities detected for CosMx breast cancer data, the logistic regression analysis results demonstrate that the proposed DCD-TMHC computation method produces better interpretable and superior outcomes, especially in terms of assessment for different cancer categories. These suggest that our proposed novel and informative DCD-TMHC computation method here will be helpful and have an impact on future cancer research based on single-cell ST data, which can improve cancer diagnosis and monitor cancer treatment progress.

Stats, Vol. 9, Pages 3: Repeated Measurement Designs of Five Periods: Estimating the Parameter of Carryover Effects

Miltiadis S. Chalikias — 2025-12-29

Stats, Vol. 9, Pages 3: Repeated Measurement Designs of Five Periods: Estimating the Parameter of Carryover Effects

Stats doi: 10.3390/stats9010003

Authors: Miltiadis S. Chalikias

This study investigates the derivation of optimal repeated measurement designs of two treatments, five periods, and n experimental units for carryover effects. The optimal designs are determined for cases where n = 0, 1 (mod 2). The adopted optimality criterion focuses on minimizing the variance of the estimated carryover effect, thereby ensuring maximum precision in parameter estimation and design efficiency. The results presented here extend and complement earlier research of Chalikias and Kounias on optimal two-treatment repeated measurement designs for a smaller number of periods, and are closely related to the more recent findings on optimal designs for direct effects. Overall, the present work contributes to the theoretical framework of optimal design methodology by providing new insights into the structure and efficiency of repeated measurement designs, particularly in the presence of carryover effects, and sets the ground for future extensions incorporating treatment–period interactions.

Stats, Vol. 9, Pages 2: Stochastic Complexity of Rayleigh and Rician Data with Normalized Maximum Likelihood

Aaron Lanterman — 2025-12-25

Stats, Vol. 9, Pages 2: Stochastic Complexity of Rayleigh and Rician Data with Normalized Maximum Likelihood

Stats doi: 10.3390/stats9010002

Authors: Aaron Lanterman

The Rician distribution, which arises in radar, communications, and magnetic resonance imaging, is characterized by a noncentrality parameter and a scale parameter. The Rayleigh distribution is a special case of the Rician distribution with a noncentrality parameter of zero. This paper considers generalized hypothesis testing for Rayleigh and Rician distributions using Rissanen’s stochastic complexity, particularly his approximation employing Fisher information matrices. The Rayleigh distribution is a member of the exponential family, so its normalized maximum likelihood density is readily computed, and shown to asymptotically match the Fisher information approximation. Since the Rician distribution is not a member of the exponential family, its normalizing term is difficult to compute directly, so the Fisher information approximation is employed. Because the square root of the determinant of the Fisher information matrix is not integrable, we restrict the integral to a subset of its range, and separately encode the choice of subset.

Stats, Vol. 9, Pages 1: A Proportional Hazards Mixture Cure Model for Subgroup Analysis: Inferential Method and an Application to Colon Cancer Data

Kai Liu — 2025-12-24

Stats, Vol. 9, Pages 1: A Proportional Hazards Mixture Cure Model for Subgroup Analysis: Inferential Method and an Application to Colon Cancer Data

Stats doi: 10.3390/stats9010001

Authors: Kai Liu Yingwei Peng Narayanaswamy Balakrishnan

When determining subgroups with heterogeneous treatment effects in cancer clinical trials, the threshold of a variable that defines subgroups is often pre-determined by physicians based on their experience, and the optimality of the threshold is not well studied, particularly when the mixture cure rate model is considered. We propose a mixture cure model that allows optimal subgroups to be estimated for both the time to event for uncured subjects and the cure status. We develop a smoothed maximum likelihood method for the estimation of model parameters. An extensive simulation study shows that the proposed smoothed maximum likelihood method provides accurate estimates. Finally, the proposed mixture cure model is applied to a colon cancer study to evaluate the potential differences in the treatment effect of levamisole plus fluorouracil therapy versus levamisole alone therapy between younger and older patients. The model suggests that the difference in the treatment effect on the time to cancer recurrence for uncured patients is significant between patients younger than 67 and patients older than 67, and the younger patient group benefits more from the combined therapy than the older patient group.

Stats, Vol. 8, Pages 119: Robust Kibria Estimators for Mitigating Multicollinearity and Outliers in a Linear Regression Model

Hina Naz — 2025-12-17

Stats, Vol. 8, Pages 119: Robust Kibria Estimators for Mitigating Multicollinearity and Outliers in a Linear Regression Model

Stats doi: 10.3390/stats8040119

Authors: Hina Naz Ismail Shah Danish Wasim Sajid Ali

In the presence of multicollinearity, the ordinary least squares (OLS) estimators, aside from BLUE (best linear unbiased estimator), lose efficiency and fail to achieve minimum variance. In addition, these estimators are highly sensitive to outliers in the response direction. To overcome these limitations, robust estimation techniques are often integrated with shrinkage methods. This study proposes a new class of Kibria Ridge M-estimators specifically developed to simultaneously address multicollinearity and outlier contamination. A comprehensive Monte Carlo simulation study is conducted to evaluate the performance of the proposed and existing estimators. Based on the mean squared error criterion, the proposed Kibria Ridge M-estimators consistently outperform the traditional ridge-type estimators under varying parameter settings. Furthermore, the practical applicability and superiority of the proposed estimators are validated using the Tobacco and Anthropometric datasets. Overall, the new proposed estimators demonstrate good performance, offering robust and efficient alternatives for regression modeling in the presence of multicollinearity and outliers.

Stats, Vol. 8, Pages 118: Korovkin-Type Approximation Theorems for Statistical Gauge Integrable Functions of Two Variables

Hari Mohan Srivastava — 2025-12-15

Stats, Vol. 8, Pages 118: Korovkin-Type Approximation Theorems for Statistical Gauge Integrable Functions of Two Variables

Stats doi: 10.3390/stats8040118

Authors: Hari Mohan Srivastava Bidu Bhusan Jena Susanta Kumar Paikray Umakanta Misra

In this work, we develop and investigate statistical extensions of gauge integrability and gauge summability for double sequences of functions of two real variables, formulated within the framework of deferred weighted means. We begin by establishing several fundamental limit theorems that serve to connect these generalized notions and provide a rigorous theoretical foundation. Based on these results, we establish Korovkin-type approximation theorems using the classical test function set 1,s,t,s2+t2 in the Banach space C([0,1]2). To demonstrate the applicability of the proposed framework, we further present an example involving families of positive linear operators associated with the Meyer-König and Zeller (MKZ) operators. These findings not only extend classical Korovkin-type theorems to the setting of statistical deferred gauge integrability and summability but also underscore their robustness in addressing double sequences and the approximation of two-variable functions.

Stats, Vol. 8, Pages 117: Still No Free Lunch: Failure of Stability in Regulated Systems of Interacting Cognitive Modules

Rodrick Wallace — 2025-12-15

Stats, Vol. 8, Pages 117: Still No Free Lunch: Failure of Stability in Regulated Systems of Interacting Cognitive Modules

Stats doi: 10.3390/stats8040117

Authors: Rodrick Wallace

The asymptotic limit theorems of information and control theories, instantiated as the Rate Distortion Control Theory of bounded rationality, enable examination of stability across models of cognition based on a variety of fundamental, underlying probability distributions likely to characterize different forms of embodied ‘intelligent’ systems. Embodied cognition is inherently unstable, requiring the pairing of cognition with regulation at and across the various and varied scales and levels of organization. Like contemporary Large Language Model ‘hallucination,’ de facto ‘psychopathology’—the failure of regulation in systems of cognitive modules—is not a bug but an inherent feature of embodied cognition. What particularly emerges from this analysis, then, is the ubiquity of failure-under-stress even for ‘intelligent’ embodied cognition, where cognitive and regulatory modules are closely paired. There is still No Free Lunch, much in the classic sense of Wolpert and Macready. With some further effort, the probability models developed here can be transformed into robust statistical tools for the analysis of observational and experimental data regarding regulated and other cognitive phenomena.

Stats, Vol. 8, Pages 116: Mapping Research on the Birnbaum–Saunders Statistical Distribution: Patterns, Trends, and Scientometric Perspective

Víctor Leiva — 2025-12-13

Stats, Vol. 8, Pages 116: Mapping Research on the Birnbaum–Saunders Statistical Distribution: Patterns, Trends, and Scientometric Perspective

Stats doi: 10.3390/stats8040116

Authors: Víctor Leiva

This article provides a critical assessment of the Birnbaum–Saunders (BS) distribution, a pivotal statistical model for lifetime data analysis and reliability estimation, particularly in fatigue contexts. The model has seen successfully applied across diverse fields, including biological mortality, environmental sciences, medicine, and risk models. Moving beyond a basic scientometric review, this study synthesizes findings from 353 peer-reviewed articles, selected using PRISMA 2020 protocols, to specifically trace the evolution of estimation techniques, regression methods, and model extensions. Key findings reveal robust theoretical advances, such as Bayesian methods and bivariate/spatial adaptations, alongside practical progress in influence diagnostics and software development. The analysis highlights key research gaps, including the critical need for scalable, auditable software and structured reviews, and notes a peak in scholarly activity around 2019, driven importantly by the Brazil-Chile research alliance. This work offers a consolidated view of current BS model implementations and outlines clear future directions for enhancing their theoretical robustness and practical utility.

Stats, Vol. 8, Pages 115: Entropy and Minimax Risk Diversification: An Empirical and Simulation Study of Portfolio Optimization

Hongyu Yang — 2025-12-11

Stats, Vol. 8, Pages 115: Entropy and Minimax Risk Diversification: An Empirical and Simulation Study of Portfolio Optimization

Stats doi: 10.3390/stats8040115

Authors: Hongyu Yang Zijian Luo

The optimal allocation of funds within a portfolio is a central research focus in finance. Conventional mean-variance models often concentrate a significant portion of funds in a limited number of high-risk assets. To promote diversification, Shannon Entropy is widely applied. This paper develops a portfolio optimization model that incorporates Shannon Entropy alongside a risk diversification principle aimed at minimizing the maximum individual asset risk. The study combines empirical analysis with numerical simulations. First, empirical data are used to assess the theoretical model’s effectiveness and practicality. Second, numerical simulations are conducted to analyze portfolio performance under extreme market scenarios. Specifically, the numerical results indicate that for fixed values of the risk balance coefficient and minimum expected return, the optimal portfolios and their return distributions are similar when the risk is measured by standard deviation, absolute deviation, or standard lower semi-deviation. This suggests that the model exhibits robustness to variations in the risk function, providing a relatively stable investment strategy.

Stats, Vol. 8, Pages 114: Validated Transfer Learning Peters–Belson Methods for Survival Analysis: Ensemble Machine Learning Approaches with Overfitting Controls for Health Disparity Decomposition

Menglu Liang — 2025-12-10

Stats, Vol. 8, Pages 114: Validated Transfer Learning Peters–Belson Methods for Survival Analysis: Ensemble Machine Learning Approaches with Overfitting Controls for Health Disparity Decomposition

Stats doi: 10.3390/stats8040114

Authors: Menglu Liang Yan Li

Background: Health disparities research increasingly relies on complex survey data to understand survival differences between population subgroups. While Peters–Belson decomposition provides a principled framework for distinguishing disparities explained by measured covariates from unexplained residual differences, traditional approaches face challenges with complex data patterns and model validation for counterfactual estimation. Objective: To develop validated Peters–Belson decomposition methods for survival analysis that integrate ensemble machine learning with transfer learning while ensuring logical validity of counterfactual estimates through comprehensive model validation. Methods: We extend the traditional Peters–Belson framework through ensemble machine learning that combines Cox proportional hazards models, cross-validated random survival forests, and regularized gradient boosting approaches. Our framework incorporates a transfer learning component via principal component analysis (PCA) to discover shared latent factors between majority and minority groups. We note that this “transfer learning” differs from the standard machine learning definition (pre-trained models or domain adaptation); here, we use the term in its statistical sense to describe the transfer of covariate structure information from the pooled population to identify group-level latent factors. We develop a comprehensive validation framework that ensures Peters–Belson logical bounds compliance, preventing mathematical violations in counterfactual estimates. The approach is evaluated through simulation studies across five realistic health disparity scenarios using stratified complex survey designs. Results: Simulation studies demonstrate that validated ensemble methods achieve superior performance compared to individual models (proportion explained: 0.352 vs. 0.310 for individual Cox, 0.325 for individual random forests), with validation framework reducing logical violations from 34.7% to 2.1% of cases. Transfer learning provides additional 16.1% average improvement in explanation of unexplained disparity when significant unmeasured confounding exists, with 90.1% overall validation success rate. The validation framework ensures explanation proportions remain within realistic bounds while maintaining computational efficiency with 31% overhead for validation procedures. Conclusions: Validated ensemble machine learning provides substantial advantages for Peters–Belson decomposition when combined with proper model validation. Transfer learning offers conditional benefits for capturing unmeasured group-level factors while preventing mathematical violations common in standard approaches. The framework demonstrates that realistic health disparity patterns show 25–35% of differences explained by measured factors, providing actionable targets for reducing health inequities.

Stats, Vol. 8, Pages 113: Goodness of Chi-Square for Linearly Parameterized Fitting

George Livadiotis — 2025-12-01

Stats, Vol. 8, Pages 113: Goodness of Chi-Square for Linearly Parameterized Fitting

Stats doi: 10.3390/stats8040113

Authors: George Livadiotis

The paper shows an alternative perspective of the reduced chi-square as a measure of the goodness of fitting methods. The reduced chi-square is given by the ratio of the fitting over the propagation errors, that is, a universal relationship that holds for any linearity, but not for a nonlinearly parameterized fitting model. We begin by providing the proof for the traditional examples of one-parametric fitting of a constant and the bi-parametric fitting of a linear model, and then, for the general case of any linearly multi-parameterized model. We also show that this characterization is not generally true for nonlinearly parameterized fitting. Finally, we demonstrate these theoretical developments with an application in real data from the plasma protons in the heliosphere.

Stats, Vol. 8, Pages 112: Factor Analysis Biplots for Continuous, Binary and Ordinal Data

Marina Valdés-Rodríguez — 2025-11-25

Stats, Vol. 8, Pages 112: Factor Analysis Biplots for Continuous, Binary and Ordinal Data

Stats doi: 10.3390/stats8040112

Authors: Marina Valdés-Rodríguez Laura Vicente-González José L. Vicente-Villardón

This article presents biplots derived from factor analysis of correlation matrices for both continuous and ordinal data. It introduces biplots specifically designed for factor analysis, detailing the geometric interpretation for each data type and providing an algorithm to compute biplot coordinates from the factorization of correlation matrices. The theoretical developments are illustrated using a real dataset that explores the relationship between volunteering, political ideology, and civic engagement in Spain.

Stats, Vol. 8, Pages 111: A Copula-Based Model for Analyzing Bivariate Offense Data

Dimuthu Fernando — 2025-11-19

Stats, Vol. 8, Pages 111: A Copula-Based Model for Analyzing Bivariate Offense Data

Stats doi: 10.3390/stats8040111

Authors: Dimuthu Fernando Wimarsha Jayanetti

We developed a class of bivariate integer-valued time series models using copula theory. Each count time series is modeled as a Markov chain, with serial dependence characterized through copula-based transition probabilities for Poisson and Negative Binomial marginals. Cross-sectional dependence is modeled via a bivariate Gaussian copula, allowing for both positive and negative correlations and providing a flexible dependence structure. Model parameters are estimated using likelihood-based inference, where the bivariate Gaussian copula integral is evaluated through standard randomized Monte Carlo methods. The proposed approach is illustrated through an application to offense data from New South Wales, Australia, demonstrating its effectiveness in capturing complex dependence patterns.

Stats, Vol. 8, Pages 110: Prediction Inferences for Finite Population Totals Using Longitudinal Survey Data

Asokan M. Variyath — 2025-11-18

Stats, Vol. 8, Pages 110: Prediction Inferences for Finite Population Totals Using Longitudinal Survey Data

Stats doi: 10.3390/stats8040110

Authors: Asokan M. Variyath Brajendra C. Sutradhar

In an infinite-/super-population (SP) setup, regression analysis of longitudinal data, which involves repeated responses and covariates collected from a sample of independent individuals or correlated individuals belonging to a cluster such as a household/family, has been intensively studied in the statistics literature over the last three decades. In general, a longitudinal, such as an auto-correlation structure for repeated responses for an individual or a two-way cluster–longitudinal correlation structure for repeated responses from the individuals belonging to a cluster/household, are exploited to obtain consistent and efficient regression estimates. However, as opposed to the SP setup, a similar regression analysis for a finite population (FP)-based longitudinal or clustered longitudinal data using a survey sample (SS) taken from the FP-based on a suitable sampling design becomes complex, which requires first defining the FP regression and correlation (both longitudinal and/or clustered) parameters and then estimating them using appropriate sampling weighted-design unbiased (SWDU) estimating equations. The finite sampling inferences, such as predictions of longitudinal changes in FP totals, would become much more complex, meaning that it would be necessary to predict the non-sampled totals after accommodating the longitudinal and/or clustered longitudinal correlation structures. Our objective in this paper is to deal with this complex FP prediction inference by developing a design cum model (DCM)-based estimation approach. Two competitive FP total predictors, namely design-assisted model-based (DAMB) and design cum model-based (DCMB) predictors are compared using an intensive simulation study. The regression and correlation parameters involved in these prediction functions are optimally estimated using the proposed DCM-based approach.

Stats, Vol. 8, Pages 109: Maximum Likelihood and Calibrating Prior Prediction Reliability Bias Reference Charts

Stephen Jewson — 2025-11-06

Stats, Vol. 8, Pages 109: Maximum Likelihood and Calibrating Prior Prediction Reliability Bias Reference Charts

Stats doi: 10.3390/stats8040109

Authors: Stephen Jewson

There are many studies in the scientific literature that present predictions from parametric statistical models based on maximum likelihood estimates of the unknown parameters. However, generating predictions from maximum likelihood parameter estimates ignores the uncertainty around the parameter estimates. As a result, predictive probability distributions based on maximum likelihood are typically too narrow, and simulation testing has shown that tail probabilities are underestimated compared to the relative frequencies of out-of-sample events. We refer to this underestimation as a reliability bias. Previous authors have shown that objective Bayesian methods can eliminate or reduce this bias if the prior is chosen appropriately. Such methods have been given the name calibrating prior prediction. We investigate maximum likelihood reliability bias in more detail. We then present reference charts that quantify the reliability bias for 18 commonly used statistical models, for both maximum likelihood prediction and calibrating prior prediction. The charts give results for a large number of combinations of sample size and nominal probability and contain orders of magnitude more information about the reliability biases in predictions from these methods than has previously been published. These charts serve two purposes. First, they can be used to evaluate the extent to which maximum likelihood predictions given in the scientific literature are affected by reliability bias. If the reliability bias is large, the predictions may need to be revised. Second, the charts can be used in the design of future studies to assess whether it is appropriate to use maximum likelihood prediction, whether it would be more appropriate to reduce the reliability bias by using calibrating prior prediction, or whether neither maximum likelihood prediction nor calibrating prior prediction gives an adequately low reliability bias.

Stats, Vol. 8, Pages 108: Analysis of the Truncated XLindley Distribution Using Bayesian Robustness

Meriem Keddali — 2025-11-05

Stats, Vol. 8, Pages 108: Analysis of the Truncated XLindley Distribution Using Bayesian Robustness

Stats doi: 10.3390/stats8040108

Authors: Meriem Keddali Hamida Talhi Ali Slimani Mohammed Amine Meraou

In this work, we present a robust examination of the Bayesian estimators utilizing the two-parameter Upper truncated XLindley model, a unique Lindley model variant, and the oscillation of posterior risks. We provide the model in a censored scheme along with its likelihood function. The topic of sensitivity and robustness analysis of the Bayesian estimators was only covered by a small number of authors. As a result, very few apps have been created in this field. The oscillation of the posterior hazards of the Bayesian estimator is used to illustrate the method. By using a Monte Carlo simulation study, we show that, with the correct generalized loss function, a robust Bayesian estimator of the parameters corresponding to the smallest oscillation of the posterior risks may be obtained; robust estimators can be obtained when the parameter space is low-dimensional. The robustness and precision of Bayesian parameter estimation can be enhanced in regimes where the parameters of interest are of small magnitude.

Stats, Vol. 8, Pages 107: A High Dimensional Omnibus Regression Test

Ahlam M. Abid — 2025-11-05

Stats, Vol. 8, Pages 107: A High Dimensional Omnibus Regression Test

Stats doi: 10.3390/stats8040107

Authors: Ahlam M. Abid Paul A. Quaye David J. Olive

Consider regression models where the response variable Y only depends on the p×1 vector of predictors x=(x1,…,xp)T through the sufficient predictor SP=α+xTβ. Let the covariance vector Cov(x,Y)=ΣxY. Assume the cases (xiT,Yi)T are independent and identically distributed random vectors for i=1,…,n. Then for many such regression models, β=0 if and only if ΣxY=0 where 0 is the p×1 vector of zeroes. The test of H0:ΣxY=0 versus H1:ΣxY≠0 is equivalent to the high dimensional one sample test H0:μ=0 versus HA:μ≠0 applied to w1,…,wn where wi=(xi−μx)(Yi−μY) and the expected values E(x)=μx and E(Y)=μY. Since μx and μY are unknown, the test of H0:β=0 versus H1:β≠0 is implemented by applying the one sample test to vi=(xi−x¯)(Yi−Y¯) for i=1,…,n. This test has milder regularity conditions than its few competitors. For the multiple linear regression one component partial least squares and marginal maximum likelihood estimators, the test can be adapted to test H0:(βi1,…,βik)T=0 versus H1:(βi1,…,βik)T≠0 where 1≤k≤p.

Stats, Vol. 8, Pages 106: A Multi-State Model for Lung Cancer Mortality in Survival Progression

Vinoth Raman — 2025-11-05

Stats, Vol. 8, Pages 106: A Multi-State Model for Lung Cancer Mortality in Survival Progression

Stats doi: 10.3390/stats8040106

Authors: Vinoth Raman Sandra S. Ferreira Dário Ferreira Ayman Alzaatreh

Lung cancer remains one of the leading causes of death worldwide due to its high rates of illness and mortality. In this study, we applied a continuous-time multi-state Markov model to examine how lung cancer progresses through six clinically defined stages, using retrospective data from 576 patients. The model describes movements between disease stages and the final stage (death), providing estimates of how long patients typically remain in each stage and how quickly they move to the next. It also considers important demographic and clinical factors such as age, smoking history, hypertension, asthma, and gender, which influence survival outcomes. Our findings show slower changes at the beginning of the disease but faster decline in later stages, with clear differences across patient groups. This approach highlights the dynamic course of the illness and can help guide tailored follow-up, personalized treatment, and health policy decisions. The study is based on a secondary analysis of publicly available data and therefore did not require clinical trial registration.

Stats, Vol. 8, Pages 105: Silhouette-Based Evaluation of PCA, Isomap, and t-SNE on Linear and Nonlinear Data Structures

Mostafa Zahed — 2025-11-03

Stats, Vol. 8, Pages 105: Silhouette-Based Evaluation of PCA, Isomap, and t-SNE on Linear and Nonlinear Data Structures

Stats doi: 10.3390/stats8040105

Authors: Mostafa Zahed Maryam Skafyan

Dimensionality reduction is fundamental for analyzing high-dimensional data, supporting visualization, denoising, and structure discovery. We present a systematic, large-scale benchmark of three widely used methods—Principal Component Analysis (PCA), Isometric Mapping (Isomap), and t-Distributed Stochastic Neighbor Embedding (t-SNE)—evaluated by average silhouette scores to quantify cluster preservation after embedding. Our full factorial simulation varies sample size n∈{100,200,300,400,500}, noise variance σ2∈{0.25,0.5,0.75,1,1.5,2}, and feature count p∈{20,50,100,200,300,400} under four generative regimes: (1) a linear Gaussian mixture, (2) a linear Student-t mixture with heavy tails, (3) a nonlinear Swiss-roll manifold, and (4) a nonlinear concentric-spheres manifold, each replicated 1000 times per condition. Beyond empirical comparisons, we provide mathematical results that explain the observed rankings: under standard separation and sampling assumptions, PCA maximizes silhouettes for linear, low-rank structure, whereas Isomap dominates on smooth curved manifolds; t-SNE prioritizes local neighborhoods, yielding strong local separation but less reliable global geometry. Empirically, PCA consistently achieves the highest silhouettes for linear structure (Isomap second, t-SNE third); on manifolds the ordering reverses (Isomap > t-SNE > PCA). Increasing σ2 and adding uninformative dimensions (larger p) degrade all methods, while larger n improves levels and stability. To our knowledge, this is the first integrated study combining a comprehensive factorial simulation across linear and nonlinear regimes with distribution-based summaries (density and violin plots) and supporting theory that predicts method orderings. The results offer clear, practice-oriented guidance: prefer PCA when structure is approximately linear; favor manifold learning—especially Isomap—when curvature is present; and use t-SNE for the exploratory visualization of local neighborhoods. Complete tables and replication materials are provided to facilitate method selection and reproducibility.

Stats, Vol. 8, Pages 104: Computational Testing Procedure for the Overall Lifetime Performance Index of Multi-Component Exponentially Distributed Products

Shu-Fei Wu — 2025-11-02

Stats, Vol. 8, Pages 104: Computational Testing Procedure for the Overall Lifetime Performance Index of Multi-Component Exponentially Distributed Products

Stats doi: 10.3390/stats8040104

Authors: Shu-Fei Wu Chia-Chi Hsu

In addition to products with a single component, this study examines products composed of multiple components whose lifetimes follow a one-parameter exponential distribution. An overall lifetime performance index is developed to assess products under the progressive type I interval censoring scheme. This study establishes the relationship between the overall and individual lifetime performance indices and derives the corresponding maximum likelihood estimators along with their asymptotic distributions. Based on the asymptotic distributions, the lower confidence bounds for all indices are also established. Furthermore, a hypothesis testing procedure is formulated to evaluate whether the overall lifetime performance index achieves the specified target level, utilizing the maximum likelihood estimator as the test statistic under a progressive type I interval censored sample. Moreover, a power analysis is carried out, and two numerical examples are presented to demonstrate the practical implementation for the overall lifetime performance index. This research can be applied to the fields of life testing and reliability analysis.

Stats, Vol. 8, Pages 103: A Nonparametric Monitoring Framework Based on Order Statistics and Multiple Scans: Advances and Applications in Ocean Engineering

Ioannis S. Triantafyllou — 2025-11-01

Stats, Vol. 8, Pages 103: A Nonparametric Monitoring Framework Based on Order Statistics and Multiple Scans: Advances and Applications in Ocean Engineering

Stats doi: 10.3390/stats8040103

Authors: Ioannis S. Triantafyllou

In this work, we introduce a statistical framework for monitoring the performance of a breakwater structure in reducing wave impact. The proposed methodology aims to achieve diligent tracking of the underlying process and the swift detection of any potential malfunctions. The implementation of the new framework requires the construction of appropriate nonparametric Shewhart-type control charts, which rely on order statistics and scan-type decision criteria. The variance of the run length distribution of the proposed scheme is investigated, while the corresponding mean value is determined. For illustration purposes, we consider a real-life application, which aims at evaluating the effectiveness of a breakwater structure based on wave height reduction and wave energy dissipation.

Stats, Vol. 8, Pages 102: The Ridge-Hurdle Negative Binomial Regression Model: A Novel Solution for Zero-Inflated Counts in the Presence of Multicollinearity

HM Nayem — 2025-11-01

Stats, Vol. 8, Pages 102: The Ridge-Hurdle Negative Binomial Regression Model: A Novel Solution for Zero-Inflated Counts in the Presence of Multicollinearity

Stats doi: 10.3390/stats8040102

Authors: HM Nayem B. M. Golam Kibria

Datasets with many zero outcomes are common in real-world studies and often exhibit overdispersion and strong correlations among predictors, creating challenges for standard count models. Traditional approaches such as the Zero-Inflated Poisson (ZIP), Zero-Inflated Negative Binomial (ZINB), and Hurdle models can handle extra zeros and overdispersion but struggle when multicollinearity is present. This study introduces the Ridge-Hurdle Negative Binomial model, which incorporates L2 regularization into the truncated count component of the hurdle framework to jointly address zero inflation, overdispersion, and multicollinearity. Monte Carlo simulations under varying sample sizes, predictor correlations, and levels of overdispersion and zero inflation show that Ridge-Hurdle NB consistently achieves the lowest mean squared error (MSE) compared to ZIP, ZINB, Hurdle Poisson, Hurdle Negative Binomial, Ridge ZIP, and Ridge ZINB models. Applications to the Wildlife Fish and Medical Care datasets further confirm its superior predictive performance, highlighting RHNB as a robust and efficient solution for complex count data modeling.

Stats, Vol. 8, Pages 101: Robustness of the Trinormal ROC Surface Model: Formal Assessment via Goodness-of-Fit Testing

Christos Nakas — 2025-10-17

Stats, Vol. 8, Pages 101: Robustness of the Trinormal ROC Surface Model: Formal Assessment via Goodness-of-Fit Testing

Stats doi: 10.3390/stats8040101

Authors: Christos Nakas

Receiver operating characteristic (ROC) surfaces provide a natural extension of ROC curves to three-class diagnostic problems. A key summary index is the volume under the surface (VUS), representing the probability that a randomly chosen observation from each of the three ordered groups is correctly classified. A parametric estimation of VUS typically assumes trinormality of the class distributions. However, a formal method for the verification of this composite assumption has not appeared in the literature. Our approach generalizes the two-class AUC-based GOF test of Zou et al. to the three-class setting by exploiting the parallel structure between empirical and trinormal VUS estimators. We propose a global goodness-of-fit (GOF) test for trinormal ROC models based on the difference between empirical and trinormal parametric estimates of the VUS. To improve stability, a probit transformation is applied and a bootstrap procedure is used to estimate the variance of the difference. The resulting test provides a formal diagnostic for assessing the adequacy of trinormal ROC modeling. Simulation studies illustrate the robustness of the assumption via the empirical size and power of the test under various distributional settings, including skewed and multimodal alternatives. The method’s application to COVID-19 antibody level data demonstrates the practical utility of it. Our findings suggest that the proposed GOF test is simple to implement, computationally feasible for moderate sample sizes, and a useful complement to existing ROC surface methodology.

Stats, Vol. 8, Pages 100: Synthetic Hydrograph Estimation for Ungauged Basins: Exploring the Role of Statistical Distributions

Dan Ianculescu — 2025-10-17

Stats, Vol. 8, Pages 100: Synthetic Hydrograph Estimation for Ungauged Basins: Exploring the Role of Statistical Distributions

Stats doi: 10.3390/stats8040100

Authors: Dan Ianculescu Cristian Gabriel Anghel

The use of probability distribution functions in deriving synthetic hydrographs has become a robust method for modeling the response of watersheds to precipitation events. This approach leverages statistical distributions to capture the temporal structure of runoff processes, providing a flexible framework for estimating peak discharge, time to peak, and hydrograph shape. The present study explores the application of various probability distributions in constructing synthetic hydrographs. The research evaluates parameter estimation techniques, analyzing their influence on hydrograph accuracy. The results highlight the strengths and limitations of each distribution in capturing key hydrological characteristics, offering insights into the suitability of certain probability distribution functions under varying watershed conditions. The study concludes that the approach based on the Cadariu rational function enhances the adaptability and precision of synthetic hydrograph models, thereby supporting flood forecasting and watershed management.

Stats, Vol. 8, Pages 99: Model-Free Feature Screening Based on Data Aggregation for Ultra-High-Dimensional Longitudinal Data

Junfeng Chen — 2025-10-16

Stats, Vol. 8, Pages 99: Model-Free Feature Screening Based on Data Aggregation for Ultra-High-Dimensional Longitudinal Data

Stats doi: 10.3390/stats8040099

Authors: Junfeng Chen Xiaoguang Yang Jing Dai Yunming Li

Ultra-high dimensional longitudinal data feature screening procedures are widely studied, but most require model assumptions. The screening performance of these methods may not be excellent if we specify an incorrect model. To resolve the above problem, a new model-free method is introduced where feature screening is performed by sample splitting and data aggregation. Distance correlation is used to measure the association at each time point separately, while longitudinal correlation is modeled by a specific cumulative distribution function to achieve efficiency. In addition, we extend this new method to handle situations where the predictors are correlated. Both methods possess excellent asymptotic properties and are capable of handling longitudinal data with unequal numbers of repeated measurements and unequal intervals between repeated measurement time points. Compared to other model-free methods, the two new methods are relatively insensitive to within-subject correlation, and they can help reduce the computational burden when applied to longitudinal data. Finally, we use some simulated and empirical examples to show that both new methods have better screening performance.

Stats, Vol. 8, Pages 98: Expansions for the Conditional Density and Distribution of a Standard Estimate

Christopher S. Withers — 2025-10-14

Stats, Vol. 8, Pages 98: Expansions for the Conditional Density and Distribution of a Standard Estimate

Stats doi: 10.3390/stats8040098

Authors: Christopher S. Withers

Conditioning is a very useful way of using correlated information to reduce the variability of an estimate. Conditioning an estimate on a correlated estimate, reduces its covariance, and so provides more precise inference than using an unconditioned estimate. Here we give expansions in powers of n−1/2 for the conditional density and distribution of any multivariate standard estimate based on a sample of size n. Standard estimates include most estimates of interest, including smooth functions of sample means and other empirical estimates. We also show that a conditional estimate is not a standard estimate, so that Edgeworth-Cornish-Fisher expansions cannot be applied directly.

Stats, Vol. 8, Pages 97: Goodness-of-Fit Tests via Entropy-Based Density Estimation Techniques

Luai Al-Labadi — 2025-10-14

Stats, Vol. 8, Pages 97: Goodness-of-Fit Tests via Entropy-Based Density Estimation Techniques

Stats doi: 10.3390/stats8040097

Authors: Luai Al-Labadi Ruodie Yu Kairui Bao

Goodness-of-fit testing remains a fundamental problem in statistical inference with broad practical importance. In this paper, we introduce two new goodness-of-fit tests grounded in entropy-based density estimation techniques. The first is a boundary-corrected empirical likelihood ratio test, which refines the classic approach by addressing bias near the support boundaries, though, in practice, it yields results very similar to the uncorrected version. The second is a novel test built on Correa’s local linear entropy estimator, leveraging quantile regression to improve density estimation accuracy. We establish the theoretical properties of both test statistics and demonstrate their practical effectiveness through extensive simulation studies and real-data applications. The results show that the proposed methods deliver strong power and flexibility in assessing model adequacy in a wide range of settings.

Stats, Vol. 8, Pages 96: Robust Parameter Designs Constructed from Hadamard Matrices

Yingfu Li — 2025-10-11

Stats, Vol. 8, Pages 96: Robust Parameter Designs Constructed from Hadamard Matrices

Stats doi: 10.3390/stats8040096

Authors: Yingfu Li Kalanka P. Jayalath

The primary objective of robust parameter design (RPD) is to determine the optimal settings of control factors in a system to minimize response variance while achieving a desirable mean response. This article investigates fractional factorial designs constructed from Hadamard matrices of orders 12, 16, and 20 to meet RPD requirements with minimal runs. For various combinations of control and noise factors, rather than recommending a single “best” design, up to the top ten good candidate designs are identified. All listed designs permit the estimation of all control-by-noise interactions and the main effects of both control and noise factors. Additionally, some nonregular RPDs allow for the estimation of one or two control-by-control interactions, which may be critical for achieving optimal mean response. These results provide practical options for efficient, resource-constrained experiments with economical run sizes.

Stats, Vol. 8, Pages 95: Bayesian Bell Regression Model for Fitting of Overdispersed Count Data with Application

Ameer Musa Imran Alhseeni — 2025-10-10

Stats, Vol. 8, Pages 95: Bayesian Bell Regression Model for Fitting of Overdispersed Count Data with Application

Stats doi: 10.3390/stats8040095

Authors: Ameer Musa Imran Alhseeni Hossein Bevrani

The Bell regression model (BRM) is a statistical model that is often used in the analysis of count data that exhibits overdispersion. In this study, we propose a Bayesian analysis of the BRM and offer a new perspective on its application. Specifically, we introduce a G-prior distribution for Bayesian inference in BRM, in addition to a flat-normal prior distribution. To compare the performance of the proposed prior distributions, we conduct a simulation study and demonstrate that the G-prior distribution provides superior estimation results for the BRM. Furthermore, we apply the methodology to real data and compare the BRM to the Poisson and negative binomial regression model using various model selection criteria. Our results provide valuable insights into the use of Bayesian methods for estimation and inference of the BRM and highlight the importance of considering the choice of prior distribution in the analysis of count data.

Stats, Vol. 8, Pages 94: Rank-Based Control Charts Under Non-Overlapping Counting with Practical Applications in Logistics and Services

Ioannis S. Triantafyllou — 2025-10-09

Stats, Vol. 8, Pages 94: Rank-Based Control Charts Under Non-Overlapping Counting with Practical Applications in Logistics and Services

Stats doi: 10.3390/stats8040094

Authors: Ioannis S. Triantafyllou

In this article, we establish a constructive nonparametric scheme for monitoring the quality of services provided by a transportation company. The proposed methodology aims at achieving the diligent tracking of the underlying process and the swift detection of any potential malfunctions. The implementation of the new framework requires the construction of appropriate schemes, which follow the set-up of a Shewhart chart and are connected to ranks and multiple run decision criteria. The dispersion and the mean value of the run length distribution for the suggested distribution-free scheme are investigated for the special case k=2. For illustration purposes, a real-data logistics environment is discussed, whereas the proposed approach is applied for improving the quality of the provided services.

Stats, Vol. 8, Pages 93: Improper Priors via Expectation Measures

Peter Harremoës — 2025-10-09

Stats, Vol. 8, Pages 93: Improper Priors via Expectation Measures

Stats doi: 10.3390/stats8040093

Authors: Peter Harremoës

In Bayesian statistics, the prior distributions play a key role in the inference, and there are procedures for finding prior distributions. An important problem is that these procedures often lead to improper prior distributions that cannot be normalized to probability measures. Such improper prior distributions lead to technical problems, in that certain calculations are only fully justified in the literature for probability measures or perhaps for finite measures. Recently, expectation measures were introduced as an alternative to probability measures as a foundation for a theory of uncertainty. Using expectation theory and point processes, it is possible to give a probabilistic interpretation of an improper prior distribution. This will provide us with a rigid formalism for calculating posterior distributions in cases where the prior distributions are not proper without relying on approximation arguments.

Stats, Vol. 8, Pages 92: Predictions of War Duration

Glenn McRae — 2025-10-09

Stats, Vol. 8, Pages 92: Predictions of War Duration

Stats doi: 10.3390/stats8040092

Authors: Glenn McRae

The durations of wars fought between 1480 and 1941 A.D. were found to be well represented by random numbers chosen from a single-event Poisson distribution with a half-life of (1.25 ± 0.1) years. This result complements the work of L.F. Richardson who found that the frequency of outbreaks of wars can be described as a Poisson process. This result suggests that a quick return on investment requires a distillation of the many stressors of the day, each one of which has a small probability of being included in a convincing well-orchestrated simple call-to-arms. The half-life is a measure of how this call wanes with time.

Stats, Vol. 8, Pages 91: Benford Behavior in Stick Fragmentation Problems

Bruce Fang — 2025-10-08

Stats, Vol. 8, Pages 91: Benford Behavior in Stick Fragmentation Problems

Stats doi: 10.3390/stats8040091

Authors: Bruce Fang Ava Irons Ella Lippelman Steven J. Miller

Benford’s law states that in many real-world datasets, the probability that the leading digit is d equals log10((d+1)/d) for all 1≤d≤9. We call this weak Benford behavior. A dataset is said to follow strong Benford behavior if the probability that its significand (i.e., the significant digits in scientific notation) is at most s equals log10(s) for all s∈[1,10). We investigate Benford behavior in a multi-proportion stick fragmentation model, where a stick is split into m substicks according to fixed proportions at each stage. This generalizes previous work on the single proportion stick fragmentation model, where each stick is split into two substicks using one fixed proportion. We provide a necessary and sufficient condition under which the lengths of the stick fragments converge to strong Benford behavior in the multi-proportion model.

Stats, Vol. 8, Pages 90: The Use of Double Poisson Regression for Count Data in Health and Life Science—A Narrative Review

Sebastian Appelbaum — 2025-10-01

Stats, Vol. 8, Pages 90: The Use of Double Poisson Regression for Count Data in Health and Life Science—A Narrative Review

Stats doi: 10.3390/stats8040090

Authors: Sebastian Appelbaum Julia Stronski Uwe Konerding Thomas Ostermann

Count data are present in many areas of everyday life. Unfortunately, such data are often characterized by over- and under-dispersion. In 1986, Efron introduced the Double Poisson distribution to account for this problem. The aim of this work is to examine the application of this distribution in regression analyses performed in health-related literature by means of a narrative review. The databases Science Direct, PBSC, Pubmed PsycInfo, PsycArticles, CINAHL and Google Scholar were searched for applications. Two independent reviewers extracted data on Double Poisson Regression Models and their applications in the health and life sciences. From a total of 1644 hits, 84 articles were pre-selected and after full-text screening, 13 articles remained. All these articles were published after 2011 and most of them targeted epidemiological research. Both over- and under-dispersion was present and most of the papers used the generalized additive models for location, scale, and shape (GAMLSS) framework. In summary, this narrative review shows that the first steps in applying Efron’s idea of double exponential families for empirical count data have already been successfully taken in a variety of fields in the health and life sciences. Approaches to ease their application in clinical research should be encouraged.

Stats, Vol. 8, Pages 89: Theoretically Based Dynamic Regression (TDR)—A New and Novel Regression Framework for Modeling Dynamic Behavior

Derrick K. Rollins — 2025-09-28

Stats, Vol. 8, Pages 89: Theoretically Based Dynamic Regression (TDR)—A New and Novel Regression Framework for Modeling Dynamic Behavior

Stats doi: 10.3390/stats8040089

Authors: Derrick K. Rollins Marit Nilsen-Hamilton Kendra Kreienbrink Spencer Wolfe Dillon Hurd Jacob Oyler

The theoretical modeling of a dynamic system will have derivatives of the response (y) with respect to time (t). Two common physical attributes (i.e., parameters) of dynamic systems are dead-time (θ) and lag (τ). Theoretical dynamic modeling will contain physically interpretable parameters such as τ and θ with physical constraints. In addition, the number of unknown model-based parameters can be considerably smaller than empirically based (i.e., lagged-based) approaches. This work proposes a Theoretically based Dynamic Regression (TDR) modeling approach that overcomes critical lagged-based modeling limitations as demonstrated in three large, multiple input, highly dynamic, real data sets. Dynamic Regression (DR) is a lagged-based, empirical dynamic modeling approach that appears in the statistics literature. However, like all empirical approaches, the model structures do not contain first-principle interpretable parameters. Additionally, several time lags are typically needed for the output, y, and input, x, to capture significant dynamic behavior. TDR uses a simplistic theoretically based dynamic modeling approach to transform xt into its dynamic counterpart, vt, and then applies the methods and tools of static regression to vt. TDR is demonstrated on the following three modeling problems of freely existing (i.e., not experimentally designed) real data sets: 1. the weight variation in a person (y) with four measured nutrient inputs (xi); 2. the variation in the tray temperature (y) of a distillation column with nine inputs and eight test data sets over a three year period; and 3. eleven extremely large, highly dynamic, subject-specific models of sensor glucose (y) with 12 inputs (xi).