Journal Description
Stats
Stats
is an international, peer-reviewed, open access journal on statistical science published quarterly online by MDPI. The journal focuses on methodological and theoretical papers in statistics, probability, stochastic processes and innovative applications of statistics in all scientific disciplines including biological and biomedical sciences, medicine, business, economics and social sciences, physics, data science and engineering.
- Open Access— free for readers, with article processing charges (APC) paid by authors or their institutions.
- High Visibility: indexed within ESCI (Web of Science), Scopus, RePEc, and other databases.
- Rapid Publication: manuscripts are peer-reviewed and a first decision is provided to authors approximately 19 days after submission; acceptance to publication is undertaken in 2.2 days (median values for papers published in this journal in the first half of 2024).
- Recognition of Reviewers: reviewers who provide timely, thorough peer-review reports receive vouchers entitling them to a discount on the APC of their next publication in any MDPI journal, in appreciation of the work done.
Impact Factor:
0.9 (2023);
5-Year Impact Factor:
1.0 (2023)
Latest Articles
Factor Analysis of Ordinal Items: Old Questions, Modern Solutions?
Stats 2024, 7(3), 984-1001; https://doi.org/10.3390/stats7030060 (registering DOI) - 16 Sep 2024
Abstract
Factor analysis, a staple of correlational psychology, faces challenges with ordinal variables like Likert scales. The validity of traditional methods, particularly maximum likelihood (ML), is debated. Newer approaches, like using polychoric correlation matrices with weighted least squares estimators (WLS), offer solutions. This paper
[...] Read more.
Factor analysis, a staple of correlational psychology, faces challenges with ordinal variables like Likert scales. The validity of traditional methods, particularly maximum likelihood (ML), is debated. Newer approaches, like using polychoric correlation matrices with weighted least squares estimators (WLS), offer solutions. This paper compares maximum likelihood estimation (MLE) with WLS for ordinal variables. While WLS on polychoric correlations generally outperforms MLE on Pearson correlations, especially with nonbell-shaped distributions, it may yield artefactual estimates with severely skewed data. MLE tends to underestimate true loadings, while WLS may overestimate them. Simulations and case studies highlight the importance of item psychometric distributions. Despite advancements, MLE remains robust, underscoring the complexity of analyzing ordinal data in factor analysis. There is no one-size-fits-all approach, emphasizing the need for distributional analyses and careful consideration of data characteristics.
Full article
(This article belongs to the Section Computational Statistics)
Open AccessArticle
A Statistical Methodology for Evaluating Asymmetry after Normalization with Application to Genomic Data
by
Víctor Leiva, Jimmy Corzo, Myrian E. Vergara, Raydonal Ospina and Cecilia Castro
Stats 2024, 7(3), 967-983; https://doi.org/10.3390/stats7030059 - 9 Sep 2024
Abstract
►▼
Show Figures
This study evaluates the symmetry of data distributions after normalization, focusing on various statistical tests, including a few explored test named Rp. We apply normalization techniques, such as variance stabilizing transformations, to ribonucleic acid sequencing data with varying sample sizes to assess their
[...] Read more.
This study evaluates the symmetry of data distributions after normalization, focusing on various statistical tests, including a few explored test named Rp. We apply normalization techniques, such as variance stabilizing transformations, to ribonucleic acid sequencing data with varying sample sizes to assess their effectiveness in achieving symmetric data distributions. Our findings reveal that while normalization generally induces symmetry, some samples retain asymmetric distributions, challenging the conventional assumption of post-normalization symmetry. The Rp test, in particular, shows superior performance when there are variations in sample size and data distribution, making it a preferred tool for assessing symmetry when applied to genomic data. This finding underscores the importance of validating symmetry assumptions during data normalization, especially in genomic data, as overlooked asymmetries can lead to potential inaccuracies in downstream analyses. We analyze postmortem lateral temporal lobe samples to explore normal aging and Alzheimer’s disease, highlighting the critical role of symmetry testing in the accurate interpretation of genomic data.
Full article
Figure 1
Open AccessCase Report
The Integrated Violin-Box-Scatter (VBS) Plot to Visualize the Distribution of a Continuous Variable
by
David W. Gerbing
Stats 2024, 7(3), 955-966; https://doi.org/10.3390/stats7030058 - 4 Sep 2024
Abstract
►▼
Show Figures
The histogram remains a widely used tool for visualization of the distribution of a continuous variable, despite the disruption of binning the underlying continuity into somewhat arbitrarily sized discrete intervals imposed by the simplicity of its pre-computer origins. Alternatives include three visualizations, namely
[...] Read more.
The histogram remains a widely used tool for visualization of the distribution of a continuous variable, despite the disruption of binning the underlying continuity into somewhat arbitrarily sized discrete intervals imposed by the simplicity of its pre-computer origins. Alternatives include three visualizations, namely a smoothed density distribution such as a violin plot, a box plot, and the direct visualization of the individual data values as a one-dimensional scatter plot. To promote ease of use, the plotting function discussed in this work, Plot(x), automatically integrates these three visualizations of a continuous variable x into what is called a VBS plot here, tuning the resulting plot to the sample size and discreteness of the data. This integration complements the information derived from the histogram well and more easily generalizes to a multi-panel presentation at each level of a second categorical variable.
Full article
Figure 1
Open AccessArticle
Weighted Empirical Likelihood for Accelerated Life Model with Various Types of Censored Data
by
Jian-Jian Ren and Yiming Lyu
Stats 2024, 7(3), 944-954; https://doi.org/10.3390/stats7030057 - 3 Sep 2024
Abstract
In analysis of survival data, the Accelerated Life Model (ALM) is one of the widely used semiparametric models, and we often encounter various types of censored survival data, such as right censored data, doubly censored data, interval censored data, partly interval-censored data, etc.
[...] Read more.
In analysis of survival data, the Accelerated Life Model (ALM) is one of the widely used semiparametric models, and we often encounter various types of censored survival data, such as right censored data, doubly censored data, interval censored data, partly interval-censored data, etc. For complicated types of censored data, the studies of statistical inferences on the ALM are very technical and challenging mathematically, thus up to now little work has been done. In this article, we extend the concept of weighted empirical likelihood (WEL) from univariate case to multivariate case, and we apply it to the ALM, which leads to an estimation approach, called weighted maximum likelihood estimator, as well as the WEL based confidence interval for the regression parameter. Our proposed procedures are applicable to various types of censored data under a unified framework, and some simulation results are presented.
Full article
(This article belongs to the Section Survival Analysis)
Open AccessArticle
Doubly Robust Estimation and Semiparametric Efficiency in Generalized Partially Linear Models with Missing Outcomes
by
Lu Wang, Zhongzhe Ouyang and Xihong Lin
Stats 2024, 7(3), 924-943; https://doi.org/10.3390/stats7030056 - 31 Aug 2024
Abstract
We investigate a semiparametric generalized partially linear regression model that accommodates missing outcomes, with some covariates modeled parametrically and others nonparametrically. We propose a class of augmented inverse probability weighted (AIPW) kernel–profile estimating equations. The nonparametric component is estimated using AIPW kernel estimating
[...] Read more.
We investigate a semiparametric generalized partially linear regression model that accommodates missing outcomes, with some covariates modeled parametrically and others nonparametrically. We propose a class of augmented inverse probability weighted (AIPW) kernel–profile estimating equations. The nonparametric component is estimated using AIPW kernel estimating equations, while parametric regression coefficients are estimated using AIPW profile estimating equations. We demonstrate the doubly robust nature of the AIPW estimators for both nonparametric and parametric components. Specifically, these estimators remain consistent if either the assumed model for the probability of missing data or that for the conditional mean of the outcome, given covariates and auxiliary variables, is correctly specified, though not necessarily both simultaneously. Additionally, the AIPW profile estimator for parametric regression coefficients is consistent and asymptotically normal under the semiparametric model defined by the generalized partially linear model on complete data, assuming that the missing data mechanism is missing at random. When both working models are correctly specified, this estimator achieves semiparametric efficiency, with its asymptotic variance reaching the efficiency bound. We validate our approach through simulations to assess the finite sample performance of the proposed estimators and apply the method to a study that investigates risk factors associated with myocardial ischemia.
Full article
(This article belongs to the Special Issue Novel Semiparametric Methods)
►▼
Show Figures
Figure 1
Open AccessArticle
A Dynamic Reliability Analysis for the Conditional Number of Working Components within a Structure
by
Ioannis S. Triantafyllou
Stats 2024, 7(3), 906-923; https://doi.org/10.3390/stats7030055 - 28 Aug 2024
Abstract
►▼
Show Figures
In the present work, we study the number of working units of a consecutive-type structure at a specific time point under the condition that the system’s failure has not been observed yet. The main results of this paper offer some closed formulae for
[...] Read more.
In the present work, we study the number of working units of a consecutive-type structure at a specific time point under the condition that the system’s failure has not been observed yet. The main results of this paper offer some closed formulae for determining the distribution of the number of working components under the aforementioned condition. Several alternatives are considered for identifying the structure of the underlying system. The numerical investigation which is carried out takes into account different distributional assumptions for the lifetime of the components of the reliability system. Some concluding remarks and comments are provided for the performance of the resulting consecutive-type design.
Full article
Figure 1
Open AccessArticle
Scoring Individual Moral Inclination for the CNI Test
by
Yi Chen, Benjamin Lugu, Wenchao Ma and Hyemin Han
Stats 2024, 7(3), 894-905; https://doi.org/10.3390/stats7030054 - 23 Aug 2024
Abstract
►▼
Show Figures
Item response theory (IRT) is a modern psychometric framework for estimating respondents’ latent traits (e.g., ability, attitude, and personality) based on their responses to a set of questions in psychological tests. The current study adopted an item response tree (IRTree) method, which combines
[...] Read more.
Item response theory (IRT) is a modern psychometric framework for estimating respondents’ latent traits (e.g., ability, attitude, and personality) based on their responses to a set of questions in psychological tests. The current study adopted an item response tree (IRTree) method, which combines the tree model with IRT models for handling the sequential process of responding to a test item, to score individual moral inclination for the CNI test—a broadly adopted model for examining humans’ moral decision-making with three parameters generated: sensitivity to moral norms, sensitivity to consequences, and inaction preference. Compared to previous models for the CNI test, the resulting EIRTree-CNI Model is able to generate individual scores without increasing the number of items (thus, less subject fatigue or compromised response quality) or employing a post hoc approach that is deemed statistically suboptimal. The model fits the data well, and the subsequent test also supported the concurrent validity and the predictive validity of the model. Limitations are discussed further.
Full article
Figure 1
Open AccessCase Report
Integrating Proteomic Analysis and Machine Learning to Predict Prostate Cancer Aggressiveness
by
Sheila M. Valle Cortés, Jaileene Pérez Morales, Mariely Nieves Plaza, Darielys Maldonado, Swizel M. Tevenal Baez, Marc A. Negrón Blas, Cayetana Lazcano Etchebarne, José Feliciano, Gilberto Ruiz Deyá, Juan C. Santa Rosario and Pedro Santiago Cardona
Stats 2024, 7(3), 875-893; https://doi.org/10.3390/stats7030053 - 21 Aug 2024
Abstract
Prostate cancer (PCa) poses a significant challenge because of the difficulty in identifying aggressive tumors, leading to overtreatment and missed personalized therapies. Although only 8% of cases progress beyond the prostate, the accurate prediction of aggressiveness remains crucial. Thus, this study focused on
[...] Read more.
Prostate cancer (PCa) poses a significant challenge because of the difficulty in identifying aggressive tumors, leading to overtreatment and missed personalized therapies. Although only 8% of cases progress beyond the prostate, the accurate prediction of aggressiveness remains crucial. Thus, this study focused on studying retinoblastoma phosphorylated at Serine 249 (Phospho-Rb S249), N-cadherin, β-catenin, and E-cadherin as biomarkers for identifying aggressive PCa using a logistic regression model and a classification and regression tree (CART). Using immunohistochemistry (IHC), we targeted the expression of these biomarkers in PCa tissues and correlated their expression with clinicopathological data of the tumor. The results showed a negative correlation between E-cadherin and β-catenin with aggressive tumor behavior, whereas Phospho-Rb S249 and N-cadherin positively correlated with increased tumor aggressiveness. Furthermore, patients were stratified based on Gleason scores and E-cadherin staining patterns to evaluate their capability for early identification of aggressive PCa. Our findings suggest that the classification tree is the most effective method for measuring the utility of these biomarkers in clinical practice, incorporating β-catenin, tumor grade, and Gleason grade as relevant determinants for identifying patients with Gleason scores ≥ 4 + 3. This study could potentially benefit patients with aggressive PCa by enabling early disease detection and closer monitoring.
Full article
(This article belongs to the Section Regression Models)
►▼
Show Figures
Figure 1
Open AccessArticle
An Analysis of the Impact of Injury Severity on Incident Clearance Time on Urban Interstates Using a Bivariate Random-Parameter Probit Model
by
M. Ashifur Rahman, Milhan Moomen, Waseem Akhtar Khan and Julius Codjoe
Stats 2024, 7(3), 863-874; https://doi.org/10.3390/stats7030052 - 9 Aug 2024
Abstract
Incident clearance time (ICT) is impacted by several factors, including crash injury severity. The strategy of most transportation agencies is to allocate more resources and respond promptly when injuries are reported. Such a strategy should result in faster clearance of incidents, given the
[...] Read more.
Incident clearance time (ICT) is impacted by several factors, including crash injury severity. The strategy of most transportation agencies is to allocate more resources and respond promptly when injuries are reported. Such a strategy should result in faster clearance of incidents, given the resources used. However, injury crashes by nature require extra time to attend to and move crash victims while restoring the highway to its capacity. This usually leads to longer incident clearance duration, despite the higher amount of resources used. This finding has been confirmed by previous studies. The implication is that the relationship between ICT and injury severity is complex as well as correlated with the possible presence of unobserved heterogeneity. This study investigated the impact of injury severity on ICT on Louisiana’s urban interstates by adopting a random-parameter bivariate modeling framework that accounts for potential correlation between injury severity and ICT, while also investigating unobserved heterogeneity in the data. The results suggest that there is a correlation between injury severity and ICT. Importantly, it was found that injury severity does not impact ICT in only one way, as suggested by most previous studies. Also, some shared factors were found to impact both injury severity and ICT. These are young drivers, truck and bus crashes, and crashes that occur during daylight. The findings from this study can contribute to an improvement in safety on Louisiana’s interstates while furthering the state’s mobility goals.
Full article
Open AccessArticle
Comparison of Methods for Addressing Outliers in Exploratory Factor Analysis and Impact on Accuracy of Determining the Number of Factors
by
W. Holmes Finch
Stats 2024, 7(3), 842-862; https://doi.org/10.3390/stats7030051 - 5 Aug 2024
Abstract
►▼
Show Figures
Exploratory factor analysis (EFA) is a very common tool used in the social sciences to identify the underlying latent structure for a set of observed measurements. A primary component of EFA practice is determining the number of factors to retain, given the sample
[...] Read more.
Exploratory factor analysis (EFA) is a very common tool used in the social sciences to identify the underlying latent structure for a set of observed measurements. A primary component of EFA practice is determining the number of factors to retain, given the sample data. A variety of methods are available for this purpose, including parallel analysis, minimum average partial, and the Chi-square difference test. Research has shown that the presence of outliers among the indicator variables can have a deleterious impact on the performance of these methods for determining the number of factors to retain. The purpose of the current simulation study was to compare the performance of several methods for dealing with outliers combined with multiple techniques for determining the number of factors to retain. Results showed that using correlation matrices produced by either the percentage bend or heavy-tailed Student’s t-distribution, coupled with either parallel analysis or the minimum average partial yield, were most accurate in terms of identifying the number of factors to retain. Implications of these findings for practice are discussed.
Full article
Figure 1
Open AccessArticle
Patent Keyword Analysis Using Bayesian Zero-Inflated Model and Text Mining
by
Sunghae Jun
Stats 2024, 7(3), 827-841; https://doi.org/10.3390/stats7030050 - 3 Aug 2024
Abstract
►▼
Show Figures
Patent keyword analysis is used to analyze the technology keywords extracted from collected patent documents for specific technological fields. Thus, various methods related to this type of analysis have been researched in the industrial engineering fields, such as technology management and new product
[...] Read more.
Patent keyword analysis is used to analyze the technology keywords extracted from collected patent documents for specific technological fields. Thus, various methods related to this type of analysis have been researched in the industrial engineering fields, such as technology management and new product development. To analyze the patent document data, we have to search for patents related to the target technology and preprocess them to construct the patent–keyword matrix for statistical and machine learning algorithms. In general, a patent–keyword matrix has an extreme zero-inflated problem. This is because each keyword occupies one column even if it is included in only one document among all patent documents. General zero-inflated models have a limit at which the performance of the model deteriorates when the proportion of zeros becomes extremely large. To solve this problem, we applied a Bayesian inference to a general zero-inflated model. In this paper, we propose a patent keyword analysis using a Bayesian zero-inflated model to overcome the extreme zero-inflated problem in the patent–keyword matrix. In our experiments, we collected practical patents related to digital therapeutics technology and used the patent–keyword matrix preprocessed from them. We compared the performance of our proposed method with other comparative methods. Finally, we showed the validity and improved performance of our patent keyword analysis. We expect that our research can contribute to solving the extreme zero-inflated problem that occurs not only in patent keyword analysis, but also in various text big data analyses.
Full article
Figure 1
Open AccessArticle
Mass Conservative Time-Series GAN for Synthetic Extreme Flood-Event Generation: Impact on Probabilistic Forecasting Models
by
Divas Karimanzira
Stats 2024, 7(3), 808-826; https://doi.org/10.3390/stats7030049 - 3 Aug 2024
Abstract
►▼
Show Figures
The lack of data on flood events poses challenges in flood management. In this paper, we propose a novel approach to enhance flood-forecasting models by utilizing the capabilities of Generative Adversarial Networks (GANs) to generate synthetic flood events. We modified a time-series GAN
[...] Read more.
The lack of data on flood events poses challenges in flood management. In this paper, we propose a novel approach to enhance flood-forecasting models by utilizing the capabilities of Generative Adversarial Networks (GANs) to generate synthetic flood events. We modified a time-series GAN by incorporating constraints related to mass conservation, energy balance, and hydraulic principles into the GAN model through appropriate regularization terms in the loss function and by using mass conservative LSTM in the generator and discriminator models. In this way, we can improve the realism and physical consistency of the generated extreme flood-event data. These constraints ensure that the synthetic flood-event data generated by the GAN adhere to fundamental hydrological principles and characteristics, enhancing the accuracy and reliability of flood-forecasting and risk-assessment applications. PCA and t-SNE are applied to provide valuable insights into the structure and distribution of the synthetic flood data, highlighting patterns, clusters, and relationships within the data. We aimed to use the generated synthetic data to supplement the original data and train probabilistic neural runoff model for forecasting multi-step ahead flood events. t-statistic was performed to compare the means of synthetic data generated by TimeGAN with the original data, and the results showed that the means of the two datasets were statistically significant at 95% level. The integration of time-series GAN-generated synthetic flood events with real data improved the robustness and accuracy of the autoencoder model, enabling more reliable predictions of extreme flood events. In the pilot study, the model trained on the augmented dataset with synthetic data from time-series GAN shows higher NSE and KGE scores of NSE = 0.838 and KGE = 0.908, compared to the NSE = 0.829 and KGE = 0.90 of the sixth hour ahead, indicating improved accuracy of 9.8% NSE in multistep-ahead predictions of extreme flood events compared to the model trained on the original data alone. The integration of synthetic training datasets in the probabilistic forecasting improves the model’s ability to achieve a reduced Prediction Interval Normalized Average Width (PINAW) for interval forecasting, yet this enhancement comes with a trade-off in the Prediction Interval Coverage Probability (PICP).
Full article
Figure 1
Open AccessArticle
The Negative Binomial INAR(1) Process under Different Thinning Processes: Can We Separate between the Different Models?
by
Dimitris Karlis, Naushad Mamode Khan and Yuvraj Sunecher
Stats 2024, 7(3), 793-807; https://doi.org/10.3390/stats7030048 - 27 Jul 2024
Abstract
The literature on discrete valued time series is expanding very fast. Very often we see new models with very similar properties to the existing ones. A natural question that arises is whether the multitude of models with very similar properties can really have
[...] Read more.
The literature on discrete valued time series is expanding very fast. Very often we see new models with very similar properties to the existing ones. A natural question that arises is whether the multitude of models with very similar properties can really have a practical purpose or if they mostly present theoretical interest. In the present paper, we consider four models that have negative binomial marginal distributions and are autoregressive in order 1 behavior, but they have a very different generating mechanism. Then we try to answer the question whether we can distinguish between them with real data. Extensive simulations show that while the differences are small, we still can discriminate between the models with relatively moderate sample sizes. However, the mean forecasts are expected to be almost identical for all models.
Full article
(This article belongs to the Special Issue Modern Time Series Analysis II)
►▼
Show Figures
Figure 1
Open AccessArticle
Seismic Evaluation Based on Poisson Hidden Markov Models—The Case of Central and South America
by
Evangelia Georgakopoulou, Theodoros M. Tsapanos, Andreas Makrides, Emmanuel Scordilis, Alex Karagrigoriou, Alexandra Papadopoulou and Vassilios Karastathis
Stats 2024, 7(3), 777-792; https://doi.org/10.3390/stats7030047 - 23 Jul 2024
Abstract
►▼
Show Figures
A study of earthquake seismicity is undertaken over the areas of Central and South America, the tectonics of which are of great interest. The whole territory is divided into 10 seismic zones based on some seismotectonic characteristics, as in previously published studies. The
[...] Read more.
A study of earthquake seismicity is undertaken over the areas of Central and South America, the tectonics of which are of great interest. The whole territory is divided into 10 seismic zones based on some seismotectonic characteristics, as in previously published studies. The earthquakes used in the present study are extracted from the catalogs of the International Seismological Center, cover the period of 1900–2021, and are restricted to shallow depths (≤60 km) and a magnitude . Fore- and aftershocks are removed according to Reasenberg’s technique. The paper confines itself to the evaluation of earthquake occurrence probabilities in the seismic zones covering parts of Central and South America, and we implement the hidden Markov model (HMM) and apply the EM algorithm.
Full article
Figure 1
Open AccessArticle
Time-Varying Correlations between JSE.JO Stock Market and Its Partners Using Symmetric and Asymmetric Dynamic Conditional Correlation Models
by
Anas Eisa Abdelkreem Mohammed, Henry Mwambi and Bernard Omolo
Stats 2024, 7(3), 761-776; https://doi.org/10.3390/stats7030046 - 22 Jul 2024
Abstract
The extent of correlation or co-movement among the returns of developed and emerging stock markets remains pivotal for efficiently diversifying global portfolios. This correlation is prone to variation over time as a consequence of escalating economic interdependence fostered by international trade and financial
[...] Read more.
The extent of correlation or co-movement among the returns of developed and emerging stock markets remains pivotal for efficiently diversifying global portfolios. This correlation is prone to variation over time as a consequence of escalating economic interdependence fostered by international trade and financial markets. In this study, the time-varying correlation and co-movement between the JSE.JO stock market of South Africa and its developed and developing stock market partners are analyzed. The dynamic conditional correlation–exponential generalized autoregressive conditional heteroscedasticity (DCC-EGARCH) methodology is employed with different multivariate distributions to explore the time-varying correlation and volatilities between the JSE.JO stock market and its partners. Based on the conditional correlation results, the JSE.JO stock market is integrated and co-moves with its partners, and the conditional correlation for all markets exhibits time-variant behavior. The conditional volatility results show that the JSE.JO stock market behaves differently from other markets, especially after 2015, indicating a positive sign for investors to diversify between the JSE.JO and its partners. The highest value of conditional volatility for markets was in 2020 during the COVID-19 pandemic, representing the riskiest period that investors should avoid due to the lack of diversification opportunities during crises.
Full article
(This article belongs to the Section Time Series Analysis)
►▼
Show Figures
Figure 1
Open AccessCase Report
Parametric Estimation in Fractional Stochastic Differential Equation
by
Paramahansa Pramanik, Edward L. Boone and Ryad A. Ghanam
Stats 2024, 7(3), 745-760; https://doi.org/10.3390/stats7030045 - 20 Jul 2024
Abstract
Fractional Stochastic Differential Equations are becoming more popular in the literature as they can model phenomena in financial data that typical Stochastic Differential Equations models cannot. In the formulation considered here, the Hurst parameter, H, controls the Fraction of Differentiation, which needs
[...] Read more.
Fractional Stochastic Differential Equations are becoming more popular in the literature as they can model phenomena in financial data that typical Stochastic Differential Equations models cannot. In the formulation considered here, the Hurst parameter, H, controls the Fraction of Differentiation, which needs to be estimated from the data. Fortunately, the covariance structure among observations in time is easily expressed in terms of the Hurst parameter which means that a likelihood is easily defined. This work derives the Maximum Likelihood Estimator for H, which shows that it is biased and is not a consistent estimator. Simulation data used to understand the bias of the estimator is used to create an empirical bias correction function and a bias-corrected estimator is proposed and studied. Via simulation, the bias-corrected estimator is shown to be minimally biased and its simulation-based standard error is created, which is then used to create a 95% confidence interval for H. A simulation study shows that the 95% confidence intervals have decent coverage probabilities for large n. This method is then applied to the S&P500 and VIX data before and after the 2008 financial crisis.
Full article
(This article belongs to the Special Issue Novel Semiparametric Methods)
►▼
Show Figures
Figure 1
Open AccessCase Report
Bayesian Model Averaging and Regularized Regression as Methods for Data-Driven Model Exploration, with Practical Considerations
by
Hyemin Han
Stats 2024, 7(3), 732-744; https://doi.org/10.3390/stats7030044 - 18 Jul 2024
Abstract
Methodological experts suggest that psychological and educational researchers should employ appropriate methods for data-driven model exploration, such as Bayesian Model Averaging and regularized regression, instead of conventional hypothesis-driven testing, if they want to explore the best prediction model. I intend to discuss practical
[...] Read more.
Methodological experts suggest that psychological and educational researchers should employ appropriate methods for data-driven model exploration, such as Bayesian Model Averaging and regularized regression, instead of conventional hypothesis-driven testing, if they want to explore the best prediction model. I intend to discuss practical considerations regarding data-driven methods for end-user researchers without sufficient expertise in quantitative methods. I tested three data-driven methods, i.e., Bayesian Model Averaging, LASSO as a form of regularized regression, and stepwise regression, with datasets in psychology and education. I compared their performance in terms of cross-validity indicating robustness against overfitting across different conditions. I employed functionalities widely available via R with default settings to provide information relevant to end users without advanced statistical knowledge. The results demonstrated that LASSO showed the best performance and Bayesian Model Averaging outperformed stepwise regression when there were many candidate predictors to explore. Based on these findings, I discussed appropriately using the data-driven model exploration methods across different situations from laypeople’s perspectives.
Full article
(This article belongs to the Section Data Science)
Open AccessCase Report
Transitioning from the University to the Workplace: A Duration Model with Grouped Data
by
Manuel Salas-Velasco
Stats 2024, 7(3), 719-731; https://doi.org/10.3390/stats7030043 - 16 Jul 2024
Abstract
Labor market surveys usually measure unemployment duration in time intervals. In these cases, traditional duration models such as Cox regression and parametric survival models are not suitable for studying the duration of unemployment spells. In order to deal with this above issue, we
[...] Read more.
Labor market surveys usually measure unemployment duration in time intervals. In these cases, traditional duration models such as Cox regression and parametric survival models are not suitable for studying the duration of unemployment spells. In order to deal with this above issue, we use Han and Hausman’s ordered logit model for grouped durations, which has more flexibility than standard specifications. In particular, its flexibility arises from the fact that we do not need to specify any functional form for the baseline hazard function—it also circumvents problems associated with heterogeneity. The focus of interest is on the first unemployment duration of higher education graduates. The analysis is accomplished by using a large dataset from a graduate survey of Spanish university graduates. The results show that the university-to-work transition of higher education graduates is significantly associated with the graduate’s age, participation in internship programs, field of study, type of university, and gender. Specifically, graduates who participated in internship programs, engineering graduates, and graduates from private universities experience a smooth transition.
Full article
(This article belongs to the Section Survival Analysis)
Open AccessArticle
Optimal Estimators of Cross-Partial Derivatives and Surrogates of Functions
by
Matieyendou Lamboni
Stats 2024, 7(3), 697-718; https://doi.org/10.3390/stats7030042 - 14 Jul 2024
Abstract
Computing cross-partial derivatives using fewer model runs is relevant in modeling, such as stochastic approximation, derivative-based ANOVA, exploring complex models, and active subspaces. This paper introduces surrogates of all the cross-partial derivatives of functions by evaluating such functions at N randomized points and
[...] Read more.
Computing cross-partial derivatives using fewer model runs is relevant in modeling, such as stochastic approximation, derivative-based ANOVA, exploring complex models, and active subspaces. This paper introduces surrogates of all the cross-partial derivatives of functions by evaluating such functions at N randomized points and using a set of L constraints. Randomized points rely on independent, central, and symmetric variables. The associated estimators, based on model runs, reach the optimal rates of convergence (i.e., ), and the biases of our approximations do not suffer from the curse of dimensionality for a wide class of functions. Such results are used for (i) computing the main and upper bounds of sensitivity indices, and (ii) deriving emulators of simulators or surrogates of functions thanks to the derivative-based ANOVA. Simulations are presented to show the accuracy of our emulators and estimators of sensitivity indices. The plug-in estimates of indices using the U-statistics of one sample are numerically much stable.
Full article
(This article belongs to the Section Statistical Methods)
►▼
Show Figures
Figure 1
Open AccessCase Report
Neurodevelopmental Impairments Prediction in Premature Infants Based on Clinical Data and Machine Learning Techniques
by
Arantxa Ortega-Leon, Arnaud Gucciardi, Antonio Segado-Arenas, Isabel Benavente-Fernández, Daniel Urda and Ignacio J. Turias
Stats 2024, 7(3), 685-696; https://doi.org/10.3390/stats7030041 - 12 Jul 2024
Abstract
►▼
Show Figures
Preterm infants are prone to NeuroDevelopmental Impairment (NDI). Some previous works have identified clinical variables that can be potential predictors of NDI. However, machine learning (ML)-based models still present low predictive capabilities when addressing this problem. This work attempts to evaluate the application
[...] Read more.
Preterm infants are prone to NeuroDevelopmental Impairment (NDI). Some previous works have identified clinical variables that can be potential predictors of NDI. However, machine learning (ML)-based models still present low predictive capabilities when addressing this problem. This work attempts to evaluate the application of ML techniques to predict NDI using clinical data from a cohort of very preterm infants recruited at birth and assessed at 2 years of age. Six different classification models were assessed, using all features, clinician-selected features, and mutual information feature selection. The best results were obtained by ML models trained using mutual information-selected features and employing oversampling, for cognitive and motor impairment prediction, while for language impairment prediction the best setting was clinician-selected features. Although the performance indicators in this local cohort are consistent with similar previous works and still rather poor. This is a clear indication that, in order to obtain better performance rates, further analysis and methods should be considered, and other types of data should be taken into account together with the clinical variables.
Full article
Figure 1
Highly Accessed Articles
Latest Books
E-Mail Alert
News
Topics
Topic in
Entropy, Mathematics, Modelling, Stats
Interfacing Statistics, Machine Learning and Data Science from a Probabilistic Modelling Viewpoint
Topic Editors: Jürgen Pilz, Noelle I. Samia, Dirk HusmeierDeadline: 31 December 2024
Conferences
Special Issues
Special Issue in
Stats
Feature Paper Special Issue: Reinforcement Learning
Guest Editors: Wei Zhu, Sourav Sen, Keli XiaoDeadline: 30 September 2024
Special Issue in
Stats
Statistical Learning for High-Dimensional Data
Guest Editor: Paulo Canas RodriguesDeadline: 30 September 2024
Special Issue in
Stats
Statistics, Analytics, and Inferences for Discrete Data
Guest Editor: Dungang LiuDeadline: 30 November 2024
Special Issue in
Stats
Integrative Approaches in Statistical Modeling and Machine Learning for Data Analytics and Data Mining
Guest Editors: Victor Leiva, Cecília CastroDeadline: 31 January 2025