MDPI - Publisher of Open Access Journals

33 pages, 905 KiB

Open AccessArticle

by Edwin Kipruto and Willi Sauerbrei

Stats 2025, 8(3), 70; https://doi.org/10.3390/stats8030070 - 6 Aug 2025

Penalized regression methods are widely used for variable selection. Non-negative garrote (NNG) was one of the earliest methods to combine variable selection with shrinkage of regression coefficients, followed by lasso. About a decade after the introduction of lasso, adaptive lasso (ALASSO) was proposed [...] Read more.

Penalized regression methods are widely used for variable selection. Non-negative garrote (NNG) was one of the earliest methods to combine variable selection with shrinkage of regression coefficients, followed by lasso. About a decade after the introduction of lasso, adaptive lasso (ALASSO) was proposed to address lasso’s limitations. ALASSO has two tuning parameters (

λ

and

γ

), and its penalty resembles that of NNG when

γ = 1

, though NNG imposes additional constraints. Given ALASSO’s greater flexibility, which may increase instability, this study investigates whether NNG provides any practical benefit or can be replaced by ALASSO. We conducted simulations in both low- and high-dimensional settings to compare selected variables, coefficient estimates, and prediction accuracy. Ordinary least squares and ridge estimates were used as initial estimates. NNG and ALASSO (

γ = 1

) showed similar performance in low-dimensional settings with low correlation, large samples, and moderate to high

R^{2}

. However, under high correlation, small samples, and low

R^{2}

, their selected variables and estimates differed, though prediction accuracy remained comparable. When

γ \neq 1

, the differences between NNG and ALASSO became more pronounced, with ALASSO generally performing better. Assuming linear relationships between predictors and the outcome, the results suggest that NNG may offer no practical advantage over ALASSO. The

γ

parameter in ALASSO allows for adaptability to model complexity, making ALASSO a more flexible and practical alternative to NNG. Full article

(This article belongs to the Section Statistical Methods)

► Show Figures

Figure 1

19 pages, 7512 KiB

Open AccessReview

Archimedean Copulas: A Useful Approach in Biomedical Data—A Review with an Application in Pediatrics

by Giulia Risca, Stefania Galimberti, Paola Rebora, Alessandro Cattoni, Maria Grazia Valsecchi and Giulia Capitoli

Stats 2025, 8(3), 69; https://doi.org/10.3390/stats8030069 - 1 Aug 2025

Viewed by 160

Abstract

Many applications in health research involve the analysis of multivariate distributions of random variables. In this paper, we review the basic theory of copulas to illustrate their advantages in deriving a joint distribution from given marginal distributions, with a specific focus on bivariate [...] Read more.

Many applications in health research involve the analysis of multivariate distributions of random variables. In this paper, we review the basic theory of copulas to illustrate their advantages in deriving a joint distribution from given marginal distributions, with a specific focus on bivariate cases. Particular attention is given to the Archimedean family of copulas, which includes widely used functions such as Clayton and Gumbel–Hougaard, characterized by a single association parameter and a relatively simple structure. This work differs from previous reviews by providing a focused overview of applied studies in biomedical research that have employed Archimedean copulas, due to their flexibility in modeling a wide range of dependence structures. Their ease of use and ability to accommodate rotated forms make them suitable for various biomedical applications, including those involving survival data. We briefly present the most commonly used methods for estimation and model selection of copula’s functions, with the purpose of introducing these tools within the broader framework. Several recent examples in the health literature, and an original example of a pediatric study, demonstrate the applicability of Archimedean copulas and suggest that this approach, although still not widely adopted, can be useful in many biomedical research settings. Full article

(This article belongs to the Section Statistical Methods)

► Show Figures

Figure 1

22 pages, 579 KiB

Open AccessArticle

Automated Classification of Crime Narratives Using Machine Learning and Language Models in Official Statistics

by Klaus Lehmann, Elio Villaseñor, Alejandro Pimentel, Javiera Preuss, Nicolás Berhó, Oswaldo Diaz and Ignacio Agloni

Stats 2025, 8(3), 68; https://doi.org/10.3390/stats8030068 - 30 Jul 2025

Viewed by 527

Abstract

This paper presents the implementation of a language model–based strategy for the automatic codification of crime narratives for the production of official statistics. To address the high workload and inconsistencies associated with manual coding, we developed and evaluated three models: an XGBoost classifier [...] Read more.

This paper presents the implementation of a language model–based strategy for the automatic codification of crime narratives for the production of official statistics. To address the high workload and inconsistencies associated with manual coding, we developed and evaluated three models: an XGBoost classifier with bag-of-words features and word embeddings features, an LSTM network using pretrained Spanish word embeddings as a language model, and a fine-tuned BERT language model (BETO). Deep learning models outperformed the traditional baseline, with BETO achieving the highest accuracy. The new ENUSC (Encuesta Nacional Urbana de Seguridad Ciudadana) workflow integrates the selected model into an API for automated classification, incorporating a certainty threshold to distinguish between cases suitable for automation and those requiring expert review. This hybrid strategy led to a 68.4% reduction in manual review workload while preserving high-quality standards. This study represents the first documented application of deep learning for the automated classification of victimization narratives in official statistics, demonstrating its feasibility and impact in a real-world production environment. Our results demonstrate that deep learning can significantly improve the efficiency and consistency of crime statistics coding, offering a scalable solution for other national statistical offices. Full article

(This article belongs to the Section Applied Statistics and Machine Learning Methods)

► Show Figures

Figure 1

28 pages, 888 KiB

Open AccessArticle

Requiem for Olympic Ethics and Sports’ Independence

by Fabio Zagonari

Stats 2025, 8(3), 67; https://doi.org/10.3390/stats8030067 - 28 Jul 2025

Viewed by 145

Abstract

This paper suggests a theoretical framework to summarise the empirical literature on the relationships between sports and both religious and secular ethics, and it suggests two interrelated theoretical models to empirically evaluate the extent to which religious and secular ethics, as well as [...] Read more.

This paper suggests a theoretical framework to summarise the empirical literature on the relationships between sports and both religious and secular ethics, and it suggests two interrelated theoretical models to empirically evaluate the extent to which religious and secular ethics, as well as sports policies, affect achievements in sports. I identified two national ethics (national pride/efficiency) and two social ethics (social cohesion/ethics) by measuring achievements in terms of alternative indexes based on Olympic medals. I referred to three empirical models and applied three estimation methods (panel Poisson, Data Envelopment, and Stochastic Frontier Analyses). I introduced two sports policies (a quantitative policy aimed at social cohesion and a qualitative policy aimed at national pride), by distinguishing sports in terms of four possibly different ethics to be used for the eight summer and eight winter Olympic Games from 1994 to 2024. I applied income level, health status, and income inequality, to depict alternative social contexts. I used five main religions and three educational levels to depict alternative ethical contexts. I applied country dummies to depict alternative institutional contexts. Empirical results support the absence of Olympic ethics, the potential substitution of sport and secular ethics in providing social cohesion, and the dependence of sports on politics, while alternative social contexts have different impacts on alternative sport achievements. Full article

(This article belongs to the Special Issue Ethicametrics)

► Show Figures

Figure 1

22 pages, 366 KiB

Open AccessArticle

Proximal Causal Inference for Censored Data with an Application to Right Heart Catheterization Data

by Yue Hu, Yuanshan Gao and Minhao Qi

Stats 2025, 8(3), 66; https://doi.org/10.3390/stats8030066 - 22 Jul 2025

Viewed by 231

Abstract

In observational causal inference studies, unmeasured confounding remains a critical threat to the validity of effect estimates. While proximal causal inference (PCI) has emerged as a powerful framework for mitigating such bias through proxy variables, existing PCI methods cannot directly handle censored data. [...] Read more.

In observational causal inference studies, unmeasured confounding remains a critical threat to the validity of effect estimates. While proximal causal inference (PCI) has emerged as a powerful framework for mitigating such bias through proxy variables, existing PCI methods cannot directly handle censored data. This article develops a unified proximal causal inference framework that simultaneously addresses unmeasured confounding and right-censoring challenges, extending the proximal causal inference literature. Our key contributions are twofold: (i) We propose novel identification strategies and develop two distinct estimators for the censored-outcome bridge function and treatment confounding bridge function, resolving the fundamental challenge of unobserved outcomes; (ii) To improve robustness against model misspecification, we construct a robust proximal estimator and establish uniform consistency for all proposed estimators under mild regularity conditions. Through comprehensive simulations, we demonstrate the finite-sample performance of our methods, followed by an empirical application evaluating right heart catheterization effectiveness in critically ill ICU patients. Full article

(This article belongs to the Section Applied Statistics and Machine Learning Methods)

► Show Figures

Figure 1

10 pages, 1848 KiB

Open AccessArticle

Local Stochastic Correlation Models for Derivative Pricing

by Marcos Escobar-Anel

Stats 2025, 8(3), 65; https://doi.org/10.3390/stats8030065 - 18 Jul 2025

Viewed by 165

Abstract

This paper reveals a simple methodology to create local-correlation models suitable for the closed-form pricing of two-asset financial derivatives. The multivariate models are built to ensure two conditions. First, marginals follow desirable processes, e.g., we choose the Geometric Brownian Motion (GBM), popular for [...] Read more.

This paper reveals a simple methodology to create local-correlation models suitable for the closed-form pricing of two-asset financial derivatives. The multivariate models are built to ensure two conditions. First, marginals follow desirable processes, e.g., we choose the Geometric Brownian Motion (GBM), popular for stock prices. Second, the payoff of the derivative should follow a desired one-dimensional process. These conditions lead to a specific choice of the dependence structure in the form of a local-correlation model. Two popular multi-asset options are entertained: a spread option and a basket option. Full article

(This article belongs to the Section Applied Stochastic Models)

► Show Figures

Figure 1

17 pages, 1296 KiB

Open AccessArticle

Machine Learning Ensemble Algorithms for Classification of Thyroid Nodules Through Proteomics: Extending the Method of Shapley Values from Binary to Multi-Class Tasks

by Giulia Capitoli, Simone Magnaghi, Andrea D'Amicis, Camilla Vittoria Di Martino, Isabella Piga, Vincenzo L'Imperio, Marco Salvatore Nobile, Stefania Galimberti and Davide Paolo Bernasconi

Stats 2025, 8(3), 64; https://doi.org/10.3390/stats8030064 - 16 Jul 2025

Viewed by 312

Abstract

The need to improve medical diagnosis is of utmost importance in medical research, consisting of the optimization of accurate classification models able to assist clinical decisions. To minimize the errors that can be caused by using a single classifier, the voting ensemble technique [...] Read more.

The need to improve medical diagnosis is of utmost importance in medical research, consisting of the optimization of accurate classification models able to assist clinical decisions. To minimize the errors that can be caused by using a single classifier, the voting ensemble technique can be used, combining the classification results of different classifiers to improve the final classification performance. This paper aims to compare the existing voting ensemble techniques with a new game-theory-derived approach based on Shapley values. We extended this method, originally developed for binary tasks, to the multi-class setting in order to capture complementary information provided by different classifiers. In heterogeneous clinical scenarios such as thyroid nodule diagnosis, where distinct models may be better suited to identify specific subtypes (e.g., benign, malignant, or inflammatory lesions), ensemble strategies capable of leveraging these strengths are particularly valuable. The motivating application focuses on the classification of thyroid cancer nodules whose cytopathological clinical diagnosis is typically characterized by a high number of false positive cases that may result in unnecessary thyroidectomy. We apply and compare the performance of seven individual classifiers, along with four ensemble voting techniques (including Shapley values), in a real-world study focused on classifying thyroid cancer nodules using proteomic features obtained through mass spectrometry. Our results indicate a slight improvement in the classification accuracy for ensemble systems compared to the performance of single classifiers. Although the Shapley value-based voting method remains comparable to the other voting methods, we envision this new ensemble approach could be effective in improving the performance of single classifiers in further applications, especially when complementary algorithms are considered in the ensemble. The application of these techniques can lead to the development of new tools to assist clinicians in diagnosing thyroid cancer using proteomic features derived from mass spectrometry. Full article

► Show Figures

Figure 1

30 pages, 2389 KiB

Open AccessCommunication

Beyond Expectations: Anomalies in Financial Statements and Their Application in Modelling

by Roman Blazek and Lucia Duricova

Stats 2025, 8(3), 63; https://doi.org/10.3390/stats8030063 - 15 Jul 2025

Cited by 1 | Viewed by 358

Abstract

The increasing complexity of financial reporting has enabled the implementation of innovative accounting practices that often obscure a company’s actual performance. This project seeks to uncover manipulative behaviours by constructing an anomaly detection model that utilises unsupervised machine learning techniques. We examined a [...] Read more.

The increasing complexity of financial reporting has enabled the implementation of innovative accounting practices that often obscure a company’s actual performance. This project seeks to uncover manipulative behaviours by constructing an anomaly detection model that utilises unsupervised machine learning techniques. We examined a dataset of 149,566 Slovak firms from 2016 to 2023, which included 12 financial parameters. Utilising TwoSteps and K-means clustering in IBM SPSS, we discerned patterns of normative financial activity and computed an abnormality index for each firm. Entities with the most significant deviation from cluster centroids were identified as suspicious. The model attained a silhouette score of 1.0, signifying outstanding clustering quality. We discovered a total of 231 anomalous firms, predominantly concentrated in sectors C (32.47%), G (13.42%), and L (7.36%). Our research indicates that anomaly-based models can markedly enhance the precision of fraud detection, especially in scenarios with scarce labelled data. The model integrates intricate data processing and delivers an exhaustive study of the regional and sectoral distribution of anomalies, thereby increasing its relevance in practical applications. Full article

(This article belongs to the Section Applied Statistics and Machine Learning Methods)

► Show Figures

Figure 1

23 pages, 658 KiB

Open AccessArticle

The Extended Kumaraswamy Model: Properties, Risk Indicators, Risk Analysis, Regression Model, and Applications

by Morad Alizadeh, Gauss M. Cordeiro, Gabriela M. Rodrigues, Edwin M. M. Ortega and Haitham M. Yousof

Stats 2025, 8(3), 62; https://doi.org/10.3390/stats8030062 - 14 Jul 2025

Viewed by 314

Abstract

We propose a new unit distribution, study its properties, and provide an important application in the field of geology through a set of risk indicators. We test its practicality through two applications to real data, make comparisons with the well-known beta and Kumaraswamy [...] Read more.

We propose a new unit distribution, study its properties, and provide an important application in the field of geology through a set of risk indicators. We test its practicality through two applications to real data, make comparisons with the well-known beta and Kumaraswamy distributions, and estimate the parameters of the new distribution in different ways. We provide a new regression model and apply it in statistical prediction operations for residence times data. Full article

► Show Figures

Figure 1

10 pages, 339 KiB

Open AccessArticle

Continuity Correction and Standard Error Calculation for Testing in Proportional Hazards Models

by Daniel Baumgartner and John E. Kolassa

Stats 2025, 8(3), 61; https://doi.org/10.3390/stats8030061 - 14 Jul 2025

Viewed by 196

Abstract

Standard asymptotic inference for proportional hazards models is conventionally performed by calculating a standard error for the estimate and comparing the estimate divided by the standard error to a standard normal distribution. In this paper, we compare various standard error estimates, including based [...] Read more.

Standard asymptotic inference for proportional hazards models is conventionally performed by calculating a standard error for the estimate and comparing the estimate divided by the standard error to a standard normal distribution. In this paper, we compare various standard error estimates, including based on the inverse observed information, the inverse expected inverse information, and the jackknife. Furthermore, correction for continuity is compared to omitting this correction. We find that correction for continuity represents an important improvement in the quality of approximation, and furthermore note that the usual naive standard error yields a distribution closer to normality, as measured by skewness and kurtosis, than any of the other standard errors investigated. Full article

► Show Figures

Figure 1

15 pages, 472 KiB

Open AccessArticle

Some Useful Techniques for High-Dimensional Statistics

by David J. Olive

Stats 2025, 8(3), 60; https://doi.org/10.3390/stats8030060 - 13 Jul 2025

Viewed by 214

Abstract

High-dimensional statistics are used when

n < 5 p

, where n is the sample size and p is the number of predictors. Useful techniques include (a) the use of a sparse fitted model, (b) use of principal component analysis for dimension reduction, [...] Read more.

High-dimensional statistics are used when

n < 5 p

, where n is the sample size and p is the number of predictors. Useful techniques include (a) the use of a sparse fitted model, (b) use of principal component analysis for dimension reduction, (c) use of alternative multivariate dispersion estimators instead of the sample covariance matrix, (d) eliminate weak predictors, and (e) stack low-dimensional estimators into a vector. Some variants and theory for these techniques will be given or reviewed. Full article

► Show Figures

Figure 1

17 pages, 384 KiB

Open AccessArticle

The Detection Method of the Tobit Model in a Dataset

by El ouali Rahmani and Mohammed Benmoumen

Stats 2025, 8(3), 59; https://doi.org/10.3390/stats8030059 - 12 Jul 2025

Viewed by 195

Abstract

This article proposes an extension of detection methods for the Tobit model by generalizing existing approaches from cases with known parameters to more realistic scenarios where the parameters are unknown. The main objective is to develop detection procedures that account for parameter uncertainty [...] Read more.

This article proposes an extension of detection methods for the Tobit model by generalizing existing approaches from cases with known parameters to more realistic scenarios where the parameters are unknown. The main objective is to develop detection procedures that account for parameter uncertainty and to analyze how this uncertainty affects the estimation process and the overall accuracy of the model. The methodology relies on maximum likelihood estimation, applied to datasets generated under different configurations of the Tobit model. A series of Monte Carlo simulations is conducted to evaluate the performance of the proposed methods. The results provide insights into the robustness of the detection procedures under varying assumptions. The study concludes with practical recommendations for improving the application of the Tobit model in fields such as econometrics, health economics, and environmental studies. Full article

► Show Figures

Figure 1

16 pages, 1288 KiB

Open AccessArticle

Quantile Estimation Based on the Log-Skew-t Linear Regression Model: Statistical Aspects, Simulations, and Applications

by Raúl Alejandro Morán-Vásquez, Anlly Daniela Giraldo-Melo and Mauricio A. Mazo-Lopera

Stats 2025, 8(3), 58; https://doi.org/10.3390/stats8030058 - 11 Jul 2025

Viewed by 235

Abstract

We propose a robust linear regression model assuming a log-skew-t distribution for the response variable, with the aim of exploring the association between the covariates and the quantiles of a continuous and positive response variable under skewness and heavy tails. This model [...] Read more.

We propose a robust linear regression model assuming a log-skew-t distribution for the response variable, with the aim of exploring the association between the covariates and the quantiles of a continuous and positive response variable under skewness and heavy tails. This model includes the log-skew-normal and log-t linear regression models as special cases. Our simulation studies indicate good performance of the quantile estimation approach and its outperformance relative to the classical quantile regression model. The practical applicability of our methodology is demonstrated through an analysis of two real datasets. Full article

(This article belongs to the Special Issue Robust Statistics in Action II)

► Show Figures

Figure 1

14 pages, 16727 KiB

Open AccessArticle

Well Begun Is Half Done: The Impact of Pre-Processing in MALDI Mass Spectrometry Imaging Analysis Applied to a Case Study of Thyroid Nodules

by Giulia Capitoli, Kirsten C. J. van Abeelen, Isabella Piga, Vincenzo L’Imperio, Marco S. Nobile, Daniela Besozzi and Stefania Galimberti

Stats 2025, 8(3), 57; https://doi.org/10.3390/stats8030057 - 10 Jul 2025

Cited by 1 | Viewed by 257

Abstract

The discovery of proteomic biomarkers in cancer research can be effectively performed in situ by exploiting Matrix-Assisted Laser Desorption Ionization (MALDI) Mass Spectrometry Imaging (MSI). However, due to experimental limitations, the spectra extracted by MALDI-MSI can be noisy, so pre-processing steps are generally [...] Read more.

The discovery of proteomic biomarkers in cancer research can be effectively performed in situ by exploiting Matrix-Assisted Laser Desorption Ionization (MALDI) Mass Spectrometry Imaging (MSI). However, due to experimental limitations, the spectra extracted by MALDI-MSI can be noisy, so pre-processing steps are generally needed to reduce the instrumental and analytical variability. Thus far, the importance and the effect of standard pre-processing methods, as well as their combinations and parameter settings, have not been extensively investigated in proteomics applications. In this work, we present a systematic study of 15 combinations of pre-processing steps—including baseline, smoothing, normalization, and peak alignment—for a real-data classification task on MALDI-MSI data measured from fine-needle aspirates biopsies of thyroid nodules. The influence of each combination was assessed by analyzing the feature extraction, pixel-by-pixel classification probabilities, and LASSO classification performance. Our results highlight the necessity of fine-tuning a pre-processing pipeline, especially for the reliable transfer of molecular diagnostic signatures in clinical practice. We outline some recommendations on the selection of pre-processing steps, together with filter levels and alignment methods, according to the mass-to-charge range and heterogeneity of data. Full article

(This article belongs to the Section Applied Statistics and Machine Learning Methods)

► Show Figures

Graphical abstract

18 pages, 359 KiB

Open AccessArticle

On the Decision-Theoretic Foundations and the Asymptotic Bayes Risk of the Region of Practical Equivalence for Testing Interval Hypotheses

by Riko Kelter

Stats 2025, 8(3), 56; https://doi.org/10.3390/stats8030056 - 8 Jul 2025

Viewed by 159

Abstract

Testing interval hypotheses is of huge relevance in the biomedical and cognitive sciences; for example, in clinical trials. Frequentist approaches include the proposal of equivalence tests, which have been used to study if there is a predetermined meaningful treatment effect. In the Bayesian [...] Read more.

Testing interval hypotheses is of huge relevance in the biomedical and cognitive sciences; for example, in clinical trials. Frequentist approaches include the proposal of equivalence tests, which have been used to study if there is a predetermined meaningful treatment effect. In the Bayesian paradigm, two popular approaches exist: The first is the region of practical equivalence (ROPE), which has become increasingly popular in the cognitive sciences. The second is the Bayes factor for interval null hypotheses, which was proposed by Morey et al. One advantage of the ROPE procedure is that, in contrast to the Bayes factor, it is quite robust to the prior specification. However, while the ROPE is conceptually appealing, it lacks a clear decision-theoretic foundation like the Bayes factor. In this paper, a decision-theoretic justification for the ROPE procedure is derived for the first time, which shows that the Bayes risk of a decision rule based on the highest-posterior density interval (HPD) and the ROPE is asymptotically minimized for increasing sample size. To show this, a specific loss function is introduced. This result provides an important decision-theoretic justification for testing the interval hypothesis in the Bayesian approach based on the ROPE and HPD, in particular, when sample size is large. Full article

(This article belongs to the Section Bayesian Methods)

► Show Figures

Figure 1

Search Results (465)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Article Types

Countries / Regions

Search Results (465)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI