Journal Description
Stats
Stats
is an international, peer-reviewed, open access journal on statistical science, published quarterly online by MDPI. The journal focuses on methodological and theoretical papers in statistics, probability, stochastic processes and innovative applications of statistics in all scientific disciplines including biological and biomedical sciences, medicine, business, economics and social sciences, physics, data science and engineering.
- Open Access— free for readers, with article processing charges (APC) paid by authors or their institutions.
- High Visibility: indexed within Scopus, ESCI (Web of Science), RePEc, and other databases.
- Rapid Publication: manuscripts are peer-reviewed and a first decision provided to authors approximately 14.5 days after submission; acceptance to publication is undertaken in 2.9 days (median values for papers published in this journal in the first half of 2022).
- Recognition of Reviewers: reviewers who provide timely, thorough peer-review reports receive vouchers entitling them to a discount on the APC of their next publication in any MDPI journal, in appreciation of the work done.
Latest Articles
A Novel Generalization of Zero-Truncated Binomial Distribution by Lagrangian Approach with Applications for the COVID-19 Pandemic
Stats 2022, 5(4), 1004-1028; https://doi.org/10.3390/stats5040060 - 30 Oct 2022
Abstract
The importance of Lagrangian distributions and their applicability in real-world events have been highlighted in several studies. In light of this, we create a new zero-truncated Lagrangian distribution. It is presented as a generalization of the zero-truncated binomial distribution (ZTBD) and hence named
[...] Read more.
The importance of Lagrangian distributions and their applicability in real-world events have been highlighted in several studies. In light of this, we create a new zero-truncated Lagrangian distribution. It is presented as a generalization of the zero-truncated binomial distribution (ZTBD) and hence named the Lagrangian zero-truncated binomial distribution (LZTBD). The moments, probability generating function, factorial moments, as well as skewness and kurtosis measures of the LZTBD are discussed. We also show that the new model’s finite mixture is identifiable. The unknown parameters of the LZTBD are estimated using the maximum likelihood method. A broad simulation study is executed as an evaluation of the well-established performance of the maximum likelihood estimates. The likelihood ratio test is used to assess the effectiveness of the third parameter in the new model. Six COVID-19 datasets are used to demonstrate the LZTBD’s applicability, and we conclude that the LZTBD is very competitive on the fitting objective.
Full article
Open AccessArticle
Comparison of Positivity in Two Epidemic Waves of COVID-19 in Colombia with FDA
Stats 2022, 5(4), 993-1003; https://doi.org/10.3390/stats5040059 - 28 Oct 2022
Abstract
We use the functional data methodology to examine whether there are significant differences between two waves of contagion by COVID-19 in Colombia between 7 July 2020 and 20 July 2021. A pointwise functional t-test is initially used, then an alternative statistical test
[...] Read more.
We use the functional data methodology to examine whether there are significant differences between two waves of contagion by COVID-19 in Colombia between 7 July 2020 and 20 July 2021. A pointwise functional t-test is initially used, then an alternative statistical test proposal for paired samples is presented, which has a theoretical distribution and performs well in small samples. Our statistical test generates a scalar p-value, which provides a global idea about the significance of the positivity curves, complementing the existing punctual tests, as an advantage.
Full article
(This article belongs to the Section Applied Stochastic Models)
►▼
Show Figures

Figure 1
Open AccessArticle
Snooker Statistics and Zipf’s Law
by
Stats 2022, 5(4), 985-992; https://doi.org/10.3390/stats5040058 - 21 Oct 2022
Abstract
Zipf’s law is well known in linguistics: the frequency of a word is inversely proportional to its rank. This is a special case of a more general power law, a common phenomenon in many kinds of real-world statistical data. Here, it is shown
[...] Read more.
Zipf’s law is well known in linguistics: the frequency of a word is inversely proportional to its rank. This is a special case of a more general power law, a common phenomenon in many kinds of real-world statistical data. Here, it is shown that snooker statistics also follow such a mathematical pattern, but with varying parameter values. Two types of rankings (prize money earned and centuries scored), and three different time frames (all-time, decade, and year) are considered. The results indicate that the power law parameter values depend on the type of ranking used, as well as the time frame considered. Furthermore, in some cases, the resulting parameter values vary significantly over time, for which a plausible explanation is provided. Finally, it is shown how individual rankings can be described somewhat more accurately using a log-normal distribution, but that the overall conclusions derived from the power law analysis remain valid.
Full article
(This article belongs to the Section Data Science)
►▼
Show Figures

Figure 1
Open AccessArticle
Extreme Tail Ratios and Overrepresentation among Subpopulations with Normal Distributions
by
and
Stats 2022, 5(4), 977-984; https://doi.org/10.3390/stats5040057 - 20 Oct 2022
Abstract
►▼
Show Figures
Given several different populations, the relative proportions of each in the high (or low) end of the distribution of a given characteristic are often more important than the overall average values or standard deviations. In the case of two different normally-distributed random variables,
[...] Read more.
Given several different populations, the relative proportions of each in the high (or low) end of the distribution of a given characteristic are often more important than the overall average values or standard deviations. In the case of two different normally-distributed random variables, as is shown here, one of the (right) tail ratios will not only eventually be greater than 1 from some point on, but will even become infinitely large. More generally, in every finite mixture of different normal distributions, there will always be exactly one of those distributions that is not only overrepresented in the right tail of the mixture but even completely overwhelms all other subpopulations in the rightmost tails. This property (and the analogous result for the left tails), although not unique to normal distributions, is not shared by other common continuous centrally symmetric unimodal distributions, such as Laplace, nor even by other bell-shaped distributions, such as Cauchy (Lorentz) distributions.
Full article

Figure 1
Open AccessCommunication
Ordinal Cochran-Mantel-Haenszel Testing and Nonparametric Analysis of Variance: Competing Methodologies
Stats 2022, 5(4), 970-976; https://doi.org/10.3390/stats5040056 - 17 Oct 2022
Abstract
The Cochran-Mantel-Haenszel (CMH) and nonparametric analysis of variance (NP ANOVA) methodologies are both sets of tests for categorical response data. The latter are competitor tests for the ordinal CMH tests in which the response variable is necessarily ordinal; the treatment variable may be
[...] Read more.
The Cochran-Mantel-Haenszel (CMH) and nonparametric analysis of variance (NP ANOVA) methodologies are both sets of tests for categorical response data. The latter are competitor tests for the ordinal CMH tests in which the response variable is necessarily ordinal; the treatment variable may be either ordinal or nominal. The CMH mean score test seeks to detect mean treatment differences, while the CMH correlation test assesses ordinary or (1, 1) generalized correlation. Since the corresponding nonparametric ANOVA tests assess arbitrary univariate and bivariate moments, the ordinal CMH tests have been extended to enable a fuller comparison. The CMH tests are conditional tests, assuming that certain marginal totals in the data table are known. They have been extended to have unconditional analogues. The NP ANOVA tests are unconditional. Here, we give a brief overview of both methodologies to address the question “which methodology is preferable?”.
Full article
(This article belongs to the Section Statistical Methods)
Open AccessArticle
On the Bivariate Composite Gumbel–Pareto Distribution
Stats 2022, 5(4), 948-969; https://doi.org/10.3390/stats5040055 - 16 Oct 2022
Abstract
In this paper, we propose a bivariate extension of univariate composite (two-spliced) distributions defined by a bivariate Pareto distribution for values larger than some thresholds and by a bivariate Gumbel distribution on the complementary domain. The purpose of this distribution is to capture
[...] Read more.
In this paper, we propose a bivariate extension of univariate composite (two-spliced) distributions defined by a bivariate Pareto distribution for values larger than some thresholds and by a bivariate Gumbel distribution on the complementary domain. The purpose of this distribution is to capture the behavior of bivariate data consisting of mainly small and medium values but also of some extreme values. Some properties of the proposed distribution are presented. Further, two estimation procedures are discussed and illustrated on simulated data and on a real data set consisting of a bivariate sample of claims from an auto insurance portfolio. In addition, the risk of loss in this insurance portfolio is estimated by Monte Carlo simulation.
Full article
Open AccessArticle
Benford Networks
by
and
Stats 2022, 5(4), 934-947; https://doi.org/10.3390/stats5040054 - 30 Sep 2022
Abstract
The Benford law applied within complex networks is an interesting area of research. This paper proposes a new algorithm for the generation of a Benford network based on priority rank, and further specifies the formal definition. The condition to be taken into account
[...] Read more.
The Benford law applied within complex networks is an interesting area of research. This paper proposes a new algorithm for the generation of a Benford network based on priority rank, and further specifies the formal definition. The condition to be taken into account is the probability density of the node degree. In addition to this first algorithm, an iterative algorithm is proposed based on rewiring. Its development requires the introduction of an ad hoc measure for understanding how far an arbitrary network is from a Benford network. The definition is a semi-distance and does not lead to a distance in mathematical terms, instead serving to identify the Benford network as a class. The semi-distance is a function of the network; it is computationally less expensive than the degree of conformity and serves to set a descent condition for the rewiring. The algorithm stops when it meets the condition that either the network is Benford or the maximum number of iterations is reached. The second condition is needed because only a limited set of densities allow for a Benford network. Another important topic is assortativity and the extremes which can be achieved by constraining the network topology; for this reason, we ran simulations on artificial networks and explored further theoretical settings as preliminary work on models of preferential attachment. Based on our extensive analysis, the first proposed algorithm remains the best one from a computational point of view.
Full article
(This article belongs to the Special Issue Benford's Law(s) and Applications)
►▼
Show Figures

Figure 1
Open AccessFeature PaperArticle
Robust Permutation Tests for Penalized Splines
Stats 2022, 5(3), 916-933; https://doi.org/10.3390/stats5030053 - 16 Sep 2022
Abstract
►▼
Show Figures
Penalized splines are frequently used in applied research for understanding functional relationships between variables. In most applications, statistical inference for penalized splines is conducted using the random effects or Bayesian interpretation of a smoothing spline. These interpretations can be used to assess the
[...] Read more.
Penalized splines are frequently used in applied research for understanding functional relationships between variables. In most applications, statistical inference for penalized splines is conducted using the random effects or Bayesian interpretation of a smoothing spline. These interpretations can be used to assess the uncertainty of the fitted values and the estimated component functions. However, statistical tests about the nature of the function are more difficult, because such tests often involve testing a null hypothesis that a variance component is equal to zero. Furthermore, valid statistical inference using the random effects or Bayesian interpretation depends on the validity of the utilized parametric assumptions. To overcome these limitations, I propose a flexible and robust permutation testing framework for inference with penalized splines. The proposed approach can be used to test omnibus hypotheses about functional relationships, as well as more flexible hypotheses about conditional relationships. I establish the conditions under which the methods will produce exact results, as well as the asymptotic behavior of the various permutation tests. Additionally, I present extensive simulation results to demonstrate the robustness and superiority of the proposed approach compared to commonly used methods.
Full article

Graphical abstract
Open AccessArticle
Smoothing County-Level Sampling Variances to Improve Small Area Models’ Outputs
Stats 2022, 5(3), 898-915; https://doi.org/10.3390/stats5030052 - 11 Sep 2022
Abstract
The use of hierarchical Bayesian small area models, which take survey estimates along with auxiliary data as input to produce official statistics, has increased in recent years. Survey estimates for small domains are usually unreliable due to small sample sizes, and the corresponding
[...] Read more.
The use of hierarchical Bayesian small area models, which take survey estimates along with auxiliary data as input to produce official statistics, has increased in recent years. Survey estimates for small domains are usually unreliable due to small sample sizes, and the corresponding sampling variances can also be imprecise and unreliable. This affects the performance of the model (i.e., the model will not produce an estimate or will produce a low-quality modeled estimate), which results in a reduced number of official statistics published by a government agency. To mitigate the unreliable sampling variances, these survey-estimated variances are typically modeled against the direct estimates wherever a relationship between the two is present. However, this is not always the case. This paper explores different alternatives to mitigate the unreliable (beyond some threshold) sampling variances. A Bayesian approach under the area-level model set-up and a distribution-free technique based on bootstrap sampling are proposed to update the survey data. An application to the county-level corn yield data from the County Agricultural Production Survey of the United States Department of Agriculture’s (USDA’s) National Agricultural Statistics Service (NASS) is used to illustrate the proposed approaches. The final county-level model-based estimates for small area domains, produced based on updated survey data from each method, are compared with county-level model-based estimates produced based on the original survey data and the official statistics published in 2016.
Full article
(This article belongs to the Special Issue Small Area Estimation: Theories, Methods and Applications)
►▼
Show Figures

Figure 1
Open AccessProject Report
Using Small Area Estimation to Produce Official Statistics
by
and
Stats 2022, 5(3), 881-897; https://doi.org/10.3390/stats5030051 - 08 Sep 2022
Cited by 1
Abstract
The USDA National Agricultural Statistics Service (NASS) and other federal statistical agencies have used probability-based surveys as the foundation for official statistics for over half a century. Non-survey data that can be used to improve the accuracy and precision of estimates such as
[...] Read more.
The USDA National Agricultural Statistics Service (NASS) and other federal statistical agencies have used probability-based surveys as the foundation for official statistics for over half a century. Non-survey data that can be used to improve the accuracy and precision of estimates such as administrative, remotely sensed, and retail data have become increasingly available. Both frequentist and Bayesian models are used to combine survey and non-survey data in a principled manner. NASS has recently adopted Bayesian subarea models for three of its national programs: farm labor, crop county estimates, and cash rent county estimates. Each program provides valuable estimates at multiple scales of geography. For each program, technical challenges had to be met and a strenuous review completed before models could be adopted as the foundation for official statistics. Moving models out of the research phase into production required major changes in the production process and a cultural shift. With the implemented models, NASS now has measures of uncertainty, transparency, and reproducibility of its official statistics.
Full article
(This article belongs to the Special Issue Small Area Estimation: Theories, Methods and Applications)
►▼
Show Figures

Figure 1
Open AccessArticle
Modeling Realized Variance with Realized Quarticity
Stats 2022, 5(3), 856-880; https://doi.org/10.3390/stats5030050 - 07 Sep 2022
Abstract
This paper proposes a model for realized variance that exploits information in realized quarticity. The realized variance and quarticity measures are both highly persistent and highly correlated with each other. The proposed model incorporates information from the observed realized quarticity process via autoregressive
[...] Read more.
This paper proposes a model for realized variance that exploits information in realized quarticity. The realized variance and quarticity measures are both highly persistent and highly correlated with each other. The proposed model incorporates information from the observed realized quarticity process via autoregressive conditional variance dynamics. It exploits conditional dependence in higher order (fourth) moments in analogy to the class of GARCH models exploit conditional dependence in second moments.
Full article
(This article belongs to the Special Issue Modern Time Series Analysis)
►▼
Show Figures

Figure 1
Open AccessArticle
A New Benford Test for Clustered Data with Applications to American Elections
Stats 2022, 5(3), 841-855; https://doi.org/10.3390/stats5030049 - 31 Aug 2022
Abstract
A frequent problem with classic first digit applications of Benford’s law is the law’s inapplicability to clustered data, which becomes especially problematic for analyzing election data. This study offers a novel adaptation of Benford’s law by performing a first digit analysis after converting
[...] Read more.
A frequent problem with classic first digit applications of Benford’s law is the law’s inapplicability to clustered data, which becomes especially problematic for analyzing election data. This study offers a novel adaptation of Benford’s law by performing a first digit analysis after converting vote counts from election data to base 3 (referred to throughout the paper as 1-BL 3), spreading out the data and thus rendering the law significantly more useful. We test the efficacy of our approach on synthetic election data using discrete Weibull modeling, finding in many cases that election data often conforms to 1-BL 3. Lastly, we apply 1-BL 3 analysis to selected states from the 2004 US Presidential election to detect potential statistical anomalies.
Full article
(This article belongs to the Special Issue Benford's Law(s) and Applications)
►▼
Show Figures

Figure 1
Open AccessArticle
A New Bivariate INAR(1) Model with Time-Dependent Innovation Vectors
Stats 2022, 5(3), 819-840; https://doi.org/10.3390/stats5030048 - 19 Aug 2022
Abstract
Recently, there has been a growing interest in integer-valued time series models, especially in multivariate models. Motivated by the diversity of the infinite-patch metapopulation models, we propose an extension to the popular bivariate INAR(1) model, whose innovation vector is assumed to be time-dependent
[...] Read more.
Recently, there has been a growing interest in integer-valued time series models, especially in multivariate models. Motivated by the diversity of the infinite-patch metapopulation models, we propose an extension to the popular bivariate INAR(1) model, whose innovation vector is assumed to be time-dependent in the sense that the mean of the innovation vector is linearly increased by the previous population size. We discuss the stationarity and ergodicity of the observed process and its subprocesses. We consider the conditional maximum likelihood estimate of the parameters of interest, and establish their large-sample properties. The finite sample performance of the estimator is assessed via simulations. Applications on crime data illustrate the model.
Full article
(This article belongs to the Section Time Series Analysis)
►▼
Show Figures

Figure 1
Open AccessArticle
Deriving the Optimal Strategy for the Two Dice Pig Game via Reinforcement Learning
by
and
Stats 2022, 5(3), 805-818; https://doi.org/10.3390/stats5030047 - 17 Aug 2022
Abstract
Games of chance have historically played a critical role in the development and teaching of probability theory and game theory, and, in the modern age, computer programming and reinforcement learning. In this paper, we derive the optimal strategy for playing the two-dice game
[...] Read more.
Games of chance have historically played a critical role in the development and teaching of probability theory and game theory, and, in the modern age, computer programming and reinforcement learning. In this paper, we derive the optimal strategy for playing the two-dice game Pig, both the standard version and its variant with doubles, coined “Double-Trouble”, using certain fundamental concepts of reinforcement learning, especially the Markov decision process and dynamic programming. We further compare the newly derived optimal strategy to other popular play strategies in terms of the winning chances and the order of play. In particular, we compare to the popular “hold at n” strategy, which is considered to be close to the optimal strategy, especially for the best n, for each type of Pig Game. For the standard two-player, two-dice, sequential Pig Game examined here, we found that “hold at 23” is the best choice, with the average winning chance against the optimal strategy being 0.4747. For the “Double-Trouble” version, we found that the “hold at 18” is the best choice, with the average winning chance against the optimal strategy being 0.4733. Furthermore, time in terms of turns to play each type of game is also examined for practical purposes. For optimal vs. optimal or optimal vs. the best “hold at n” strategy, we found that the average number of turns is 19, 23, and 24 for one-die Pig, standard two-dice Pig, and the “Double-Trouble” two-dice Pig games, respectively. We hope our work will inspire students of all ages to invest in the field of reinforcement learning, which is crucial for the development of artificial intelligence and robotics and, subsequently, for the future of humanity.
Full article
(This article belongs to the Special Issue Feature Paper Special Issue: Reinforcement Learning)
►▼
Show Figures

Figure 1
Open AccessArticle
Autoregressive Models with Time-Dependent Coefficients—A Comparison between Several Approaches
by
and
Stats 2022, 5(3), 784-804; https://doi.org/10.3390/stats5030046 - 12 Aug 2022
Abstract
Autoregressive-moving average (ARMA) models with time-dependent (td) coefficients and marginally heteroscedastic innovations provide a natural alternative to stationary ARMA models. Several theories have been developed in the last 25 years for parametric estimations in that context. In this paper, we focus on time-dependent
[...] Read more.
Autoregressive-moving average (ARMA) models with time-dependent (td) coefficients and marginally heteroscedastic innovations provide a natural alternative to stationary ARMA models. Several theories have been developed in the last 25 years for parametric estimations in that context. In this paper, we focus on time-dependent autoregressive (tdAR) models and consider one of the estimation theories in that case. We also provide an alternative theory for tdAR processes that relies on a -mixing property. We compare these theories to the Dahlhaus theory for locally stationary processes and the Bibi and Francq theory, made essentially for cyclically time-dependent models, with our own theory. Regarding existing theories, there are differences in the basic assumptions (e.g., on derivability with respect to time or with respect to parameters) that are better seen in specific cases such as the tdAR(1) process. There are also differences in terms of asymptotics, as shown by an example. Our opinion is that the field of application can play a role in choosing one of the theories. This paper is completed by simulation results that show that the asymptotic theory can be used even for short series (less than 50 observations).
Full article
(This article belongs to the Section Time Series Analysis)
►▼
Show Figures

Figure 1
Open AccessArticle
Neutrosophic F-Test for Two Counts of Data from the Poisson Distribution with Application in Climatology
Stats 2022, 5(3), 773-783; https://doi.org/10.3390/stats5030045 - 12 Aug 2022
Abstract
►▼
Show Figures
This paper addresses the modification of the F-test for count data following the Poisson distribution. The F-test when the count data are expressed in intervals is considered in this paper. The proposed F-test is evaluated using real data from climatology. The comparative study
[...] Read more.
This paper addresses the modification of the F-test for count data following the Poisson distribution. The F-test when the count data are expressed in intervals is considered in this paper. The proposed F-test is evaluated using real data from climatology. The comparative study showed the efficiency of the F-test for count data under neutrosophic statistics over the F-test for count data under classical statistics.
Full article

Figure 1
Open AccessArticle
Poisson Extended Exponential Distribution with Associated INAR(1) Process and Applications
Stats 2022, 5(3), 755-772; https://doi.org/10.3390/stats5030044 - 05 Aug 2022
Abstract
►▼
Show Figures
The significance of count data modeling and its applications to real-world phenomena have been highlighted in several research studies. The present study focuses on a two-parameter discrete distribution that can be obtained by compounding the Poisson and extended exponential distributions. It has tractable
[...] Read more.
The significance of count data modeling and its applications to real-world phenomena have been highlighted in several research studies. The present study focuses on a two-parameter discrete distribution that can be obtained by compounding the Poisson and extended exponential distributions. It has tractable and explicit forms for its statistical properties. The maximum likelihood estimation method is used to estimate the unknown parameters. An extensive simulation study was also performed. In this paper, the significance of the proposed distribution is demonstrated in a count regression model and in a first-order integer-valued autoregressive process, referred to as the INAR(1) process. In addition to this, the empirical importance of the proposed model is proved through three real-data applications, and the empirical findings indicate that the proposed INAR(1) model provides better results than other competitive models for time series of counts that display overdispersion.
Full article

Figure 1
Open AccessArticle
Model-Based Estimates for Farm Labor Quantities
Stats 2022, 5(3), 738-754; https://doi.org/10.3390/stats5030043 - 03 Aug 2022
Cited by 1
Abstract
The United States Department of Agriculture’s (USDA’s) National Agricultural Statistics Service (NASS) conducts the Farm Labor Survey to produce estimates of the number of workers, duration of the workweek, and wage rates for all agricultural workers. Traditionally, expert opinion is used to integrate
[...] Read more.
The United States Department of Agriculture’s (USDA’s) National Agricultural Statistics Service (NASS) conducts the Farm Labor Survey to produce estimates of the number of workers, duration of the workweek, and wage rates for all agricultural workers. Traditionally, expert opinion is used to integrate auxiliary information, such as the previous year’s estimates, with the survey’s direct estimates. Alternatively, implementing small area models for integrating survey estimates with additional sources of information provides more reliable official estimates and valid measures of uncertainty for each type of estimate. In this paper, several hierarchical Bayesian subarea-level models are developed in support of different estimates of interest in the Farm Labor Survey. A 2020 case study illustrates the improvement of the direct survey estimates for areas with small sample sizes by using auxiliary information and borrowing information across areas and subareas. The resulting framework provides a complete set of coherent estimates for all required geographic levels. These methods were incorporated into the official Farm Labor publication for the first time in 2020.
Full article
(This article belongs to the Special Issue Small Area Estimation: Theories, Methods and Applications)
►▼
Show Figures

Figure 1
Open AccessArticle
Reciprocal Data Transformations and Their Back-Transforms
Stats 2022, 5(3), 714-737; https://doi.org/10.3390/stats5030042 - 30 Jul 2022
Abstract
Variable transformations have a long and celebrated history in statistics, one that was rather academically glamorous at least until generalized linear models theory eclipsed their nurturing normal curve theory role. Still, today it continues to be a covered topic in introductory mathematical statistics
[...] Read more.
Variable transformations have a long and celebrated history in statistics, one that was rather academically glamorous at least until generalized linear models theory eclipsed their nurturing normal curve theory role. Still, today it continues to be a covered topic in introductory mathematical statistics courses, offering worthwhile pedagogic insights to students about certain aspects of traditional and contemporary statistical theory and methodology. Since its inception in the 1930s, it has been plagued by a paucity of adequate back-transformation formulae for inverse/reciprocal functions. A literature search exposes that, to date, the inequality E(1/X) ≤ 1/(E(X), which often has a sizeable gap captured by the inequality part of its relationship, is the solitary contender for solving this problem. After documenting that inverse data transformations are anything but a rare occurrence, this paper proposes an innovative, elegant back-transformation solution based upon the Kummer confluent hypergeometric function of the first kind. This paper also derives formal back-transformation formulae for the Manly transformation, something apparently never done before. Much related future research remains to be undertaken; this paper furnishes numerous clues about what some of these endeavors need to be.
Full article
(This article belongs to the Section Statistical Methods)
►▼
Show Figures

Figure 1
Open AccessArticle
A Variable Selection Method for Small Area Estimation Modeling of the Proficiency of Adult Competency
Stats 2022, 5(3), 689-713; https://doi.org/10.3390/stats5030041 - 27 Jul 2022
Abstract
In statistical modeling, it is crucial to have consistent variables that are the most relevant to the outcome variable(s) of interest in the model. With the increasing richness of data from multiple sources, the size of the pool of potential variables is escalating.
[...] Read more.
In statistical modeling, it is crucial to have consistent variables that are the most relevant to the outcome variable(s) of interest in the model. With the increasing richness of data from multiple sources, the size of the pool of potential variables is escalating. Some variables, however, could provide redundant information, add noise to the estimation, or waste the degrees of freedom in the model. Therefore, variable selection is needed as a parsimonious process that aims to identify a minimal set of covariates for maximum predictive power. This study illustrated the variable selection methods considered and used in the small area estimation (SAE) modeling of measures related to the proficiency of adult competency that were constructed using survey data collected in the first cycle of the PIAAC. The developed variable selection process consisted of two phases: phase 1 identified a small set of variables that were consistently highly correlated with the outcomes through methods such as correlation matrix and multivariate LASSO analysis; phase 2 utilized a k-fold cross-validation process to select a final set of variables to be used in the final SAE models.
Full article
(This article belongs to the Special Issue Small Area Estimation: Theories, Methods and Applications)
►▼
Show Figures

Figure 1
Highly Accessed Articles
Latest Books
E-Mail Alert
News
Topics
Conferences
Special Issues
Special Issue in
Stats
Multivariate Statistics and Applications
Guest Editor: Silvia RomagnoliDeadline: 31 October 2022
Special Issue in
Stats
Novel Semiparametric Methods
Guest Editor: Eddy KwessiDeadline: 30 November 2022
Special Issue in
Stats
Feature Paper Special Issue: Quantitative Finance
Guest Editors: Gareth W. Peters, Damien Challet, Hongsong Yuan, Paweł Polak, Min ShuDeadline: 31 December 2022
Special Issue in
Stats
Feature Paper Special Issue: Reinforcement Learning
Guest Editors: Wei Zhu, Sourav Sen, Keli XiaoDeadline: 1 January 2023


