Next Issue
Volume 4, December
Previous Issue
Volume 4, June
 
 

Stats, Volume 4, Issue 3 (September 2021) – 13 articles

Cover Story (view full-size image): High-dimensional classification studies have become widespread across various domains. In this paper, we propose a robust and sparse estimator for logistic regression models that simultaneously tackles the presence of outliers and/or irrelevant features. We rely on L0-constraints and mixed-integer conic programming techniques to solve the underlying double combinatorial problem in a framework that allows one to pursue optimality guarantees. Our proposal is used to investigate the main drivers of honeybee (Apis mellifera) loss in Pennsylvania through annual winter loss survey data, where it produces a more interpretable classification model and provides evidence for several outlying observations. In addition, numerical simulations show that our approach outperforms other methods across most performance measures in the considered settings. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Readerexternal link to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
Article
Curve Registration of Functional Data for Approximate Bayesian Computation
Stats 2021, 4(3), 762-775; https://doi.org/10.3390/stats4030045 - 07 Sep 2021
Viewed by 861
Abstract
Approximate Bayesian computation is a likelihood-free inference method which relies on comparing model realisations to observed data with informative distance measures. We obtain functional data that are not only subject to noise along their y axis but also to a random warping along [...] Read more.
Approximate Bayesian computation is a likelihood-free inference method which relies on comparing model realisations to observed data with informative distance measures. We obtain functional data that are not only subject to noise along their y axis but also to a random warping along their x axis, which we refer to as the time axis. Conventional distances on functions, such as the L2 distance, are not informative under these conditions. The Fisher–Rao metric, previously generalised from the space of probability distributions to the space of functions, is an ideal objective function for aligning one function to another by warping the time axis. We assess the usefulness of alignment with the Fisher–Rao metric for approximate Bayesian computation with four examples: two simulation examples, an example about passenger flow at an international airport, and an example of hydrological flow modelling. We find that the Fisher–Rao metric works well as the objective function to minimise for alignment; however, once the functions are aligned, it is not necessarily the most informative distance for inference. This means that likelihood-free inference may require two distances: one for alignment and one for parameter inference. Full article
(This article belongs to the Special Issue Functional Data Analysis (FDA))
Show Figures

Figure 1

Article
Some New Tests of Conformity with Benford’s Law
Stats 2021, 4(3), 745-761; https://doi.org/10.3390/stats4030044 - 06 Sep 2021
Cited by 2 | Viewed by 994
Abstract
This paper presents new perspectives and methodological instruments for verifying the validity of Benford’s law for a large given dataset. To this aim, we first propose new general tests for checking the statistical conformity of a given dataset with a generic target distribution; [...] Read more.
This paper presents new perspectives and methodological instruments for verifying the validity of Benford’s law for a large given dataset. To this aim, we first propose new general tests for checking the statistical conformity of a given dataset with a generic target distribution; we also provide the explicit representation of the asymptotic distributions of the relevant test statistics. Then, we discuss the applicability of such novel devices to the case of Benford’s law. We implement extensive Monte Carlo simulations to investigate the size and the power of the introduced tests. Finally, we discuss the challenging theme of interpreting, in a statistically reliable way, the conformity between two distributions in the presence of a large number of observations. Full article
(This article belongs to the Special Issue Benford's Law(s) and Applications)
Show Figures

Figure 1

Article
Inference for the Linear IV Model Ridge Estimator Using Training and Test Samples
Stats 2021, 4(3), 725-744; https://doi.org/10.3390/stats4030043 - 03 Sep 2021
Viewed by 908
Abstract
The asymptotic distribution is presented for the linear instrumental variables model estimated with a ridge penalty and a prior where the tuning parameter is selected with a holdout sample. The structural parameters and the tuning parameter are estimated jointly by method of moments. [...] Read more.
The asymptotic distribution is presented for the linear instrumental variables model estimated with a ridge penalty and a prior where the tuning parameter is selected with a holdout sample. The structural parameters and the tuning parameter are estimated jointly by method of moments. A chi-squared statistic permits confidence regions for the structural parameters. The form of the asymptotic distribution provides insights on the optimal way to perform the split between the training and test sample. Results for the linear regression estimated by ridge regression are presented as a special case. Full article
(This article belongs to the Special Issue Ridge Regression, Liu and Related Estimators)
Article
Cross-Validation, Information Theory, or Maximum Likelihood? A Comparison of Tuning Methods for Penalized Splines
Stats 2021, 4(3), 701-724; https://doi.org/10.3390/stats4030042 - 02 Sep 2021
Cited by 2 | Viewed by 1035
Abstract
Functional data analysis techniques, such as penalized splines, have become common tools used in a variety of applied research settings. Penalized spline estimators are frequently used in applied research to estimate unknown functions from noisy data. The success of these estimators depends on [...] Read more.
Functional data analysis techniques, such as penalized splines, have become common tools used in a variety of applied research settings. Penalized spline estimators are frequently used in applied research to estimate unknown functions from noisy data. The success of these estimators depends on choosing a tuning parameter that provides the correct balance between fitting and smoothing the data. Several different smoothing parameter selection methods have been proposed for choosing a reasonable tuning parameter. The proposed methods generally fall into one of three categories: cross-validation methods, information theoretic methods, or maximum likelihood methods. Despite the well-known importance of selecting an ideal smoothing parameter, there is little agreement in the literature regarding which method(s) should be considered when analyzing real data. In this paper, we address this issue by exploring the practical performance of six popular tuning methods under a variety of simulated and real data situations. Our results reveal that maximum likelihood methods outperform the popular cross-validation methods in most situations—especially in the presence of correlated errors. Furthermore, our results reveal that the maximum likelihood methods perform well even when the errors are non-Gaussian and/or heteroscedastic. For real data applications, we recommend comparing results using cross-validation and maximum likelihood tuning methods, given that these methods tend to perform similarly (differently) when the model is correctly (incorrectly) specified. Full article
(This article belongs to the Special Issue Functional Data Analysis (FDA))
Show Figures

Graphical abstract

Article
Learning Time Acceleration in Support Vector Regression: A Case Study in Educational Data Mining
Stats 2021, 4(3), 682-700; https://doi.org/10.3390/stats4030041 - 31 Aug 2021
Cited by 3 | Viewed by 932
Abstract
The development of a country involves directly investing in the education of its citizens. Learning analytics/educational data mining (LA/EDM) allows access to big observational structured/unstructured data captured from educational settings and relies mostly on machine learning algorithms to extract useful information. Support vector [...] Read more.
The development of a country involves directly investing in the education of its citizens. Learning analytics/educational data mining (LA/EDM) allows access to big observational structured/unstructured data captured from educational settings and relies mostly on machine learning algorithms to extract useful information. Support vector regression (SVR) is a supervised statistical learning approach that allows modelling and predicts the performance tendency of students to direct strategic plans for the development of high-quality education. In Brazil, performance can be evaluated at the national level using the average grades of a student on their National High School Exams (ENEMs) based on their socioeconomic information and school records. In this paper, we focus on increasing the computational efficiency of SVR applied to ENEM for online requisitions. The results are based on an analysis of a massive data set composed of more than five million observations, and they also indicate computational learning time savings of more than 90%, as well as providing a prediction of performance that is compatible with traditional modeling. Full article
(This article belongs to the Section Computational Statistics)
Show Figures

Figure 1

Article
Robust Variable Selection with Optimality Guarantees for High-Dimensional Logistic Regression
Stats 2021, 4(3), 665-681; https://doi.org/10.3390/stats4030040 - 31 Aug 2021
Cited by 1 | Viewed by 1268
Abstract
High-dimensional classification studies have become widespread across various domains. The large dimensionality, coupled with the possible presence of data contamination, motivates the use of robust, sparse estimation methods to improve model interpretability and ensure the majority of observations agree with the underlying parametric [...] Read more.
High-dimensional classification studies have become widespread across various domains. The large dimensionality, coupled with the possible presence of data contamination, motivates the use of robust, sparse estimation methods to improve model interpretability and ensure the majority of observations agree with the underlying parametric model. In this study, we propose a robust and sparse estimator for logistic regression models, which simultaneously tackles the presence of outliers and/or irrelevant features. Specifically, we propose the use of L0-constraints and mixed-integer conic programming techniques to solve the underlying double combinatorial problem in a framework that allows one to pursue optimality guarantees. We use our proposal to investigate the main drivers of honey bee (Apis mellifera) loss through the annual winter loss survey data collected by the Pennsylvania State Beekeepers Association. Previous studies mainly focused on predictive performance, however our approach produces a more interpretable classification model and provides evidence for several outlying observations within the survey data. We compare our proposal with existing heuristic methods and non-robust procedures, demonstrating its effectiveness. In addition to the application to honey bee loss, we present a simulation study where our proposal outperforms other methods across most performance measures and settings. Full article
(This article belongs to the Special Issue Robust Statistics in Action)
Show Figures

Figure 1

Article
Assessment of a Modified Sandwich Estimator for Generalized Estimating Equations with Application to Opioid Poisoning in MIMIC-IV ICU Patients
Stats 2021, 4(3), 650-664; https://doi.org/10.3390/stats4030039 - 12 Aug 2021
Viewed by 1181
Abstract
Longitudinal data is encountered frequently in many healthcare research areas to include the critical care environment. Repeated measures from the same subject are expected to correlate with each other. Models with binary outcomes are commonly used in this setting. Regression models for correlated [...] Read more.
Longitudinal data is encountered frequently in many healthcare research areas to include the critical care environment. Repeated measures from the same subject are expected to correlate with each other. Models with binary outcomes are commonly used in this setting. Regression models for correlated binary outcomes are frequently fit using generalized estimating equations (GEE). The Liang and Zeger sandwich estimator is often used in GEE to produce unbiased standard error estimation for regression coefficients in large sample settings, even when the covariance structure is misspecified. The sandwich estimator performs optimally in balanced designs when the number of participants is large with few repeated measurements. The sandwich estimator’s asymptotic properties do not hold in small sample and rare-event settings. Under these conditions, the sandwich estimator underestimates the variances and is biased downwards. Here, the performance of a modified sandwich estimator is compared to the traditional Liang-Zeger estimator and alternative forms proposed by authors Morel, Pan, and Mancl-DeRouen. Each estimator’s performance was assessed with 95% coverage probabilities for the regression coefficients using simulated data under various combinations of sample sizes and outcome prevalence values with independence and autoregressive correlation structures. This research was motivated by investigations involving rare-event outcomes in intensive care unit settings. Full article
(This article belongs to the Section Computational Statistics)
Show Figures

Figure 1

Article
Generalized Cardioid Distributions for Circular Data Analysis
Stats 2021, 4(3), 634-649; https://doi.org/10.3390/stats4030038 - 11 Aug 2021
Cited by 1 | Viewed by 870
Abstract
The Cardioid (C) distribution is one of the most important models for modeling circular data. Although some of its structural properties have been derived, this distribution is not appropriate for asymmetry and multimodal phenomena in the circle, and then extensions are required. There [...] Read more.
The Cardioid (C) distribution is one of the most important models for modeling circular data. Although some of its structural properties have been derived, this distribution is not appropriate for asymmetry and multimodal phenomena in the circle, and then extensions are required. There are various general methods that can be used to produce circular distributions. This paper proposes four extensions of the C distribution based on the beta, Kumaraswamy, gamma, and Marshall–Olkin generators. We obtain a unique linear representation of their densities and some mathematical properties. Inference procedures for the parameters are also investigated. We perform two applications on real data, where the new models are compared to the C distribution and one of its extensions. Full article
(This article belongs to the Section Applied Stochastic Models)
Show Figures

Figure 1

Article
Smoothing in Ordinal Regression: An Application to Sensory Data
Stats 2021, 4(3), 616-633; https://doi.org/10.3390/stats4030037 - 21 Jul 2021
Cited by 2 | Viewed by 1112
Abstract
The so-called proportional odds assumption is popular in cumulative, ordinal regression. In practice, however, such an assumption is sometimes too restrictive. For instance, when modeling the perception of boar taint on an individual level, it turns out that, at least for some subjects, [...] Read more.
The so-called proportional odds assumption is popular in cumulative, ordinal regression. In practice, however, such an assumption is sometimes too restrictive. For instance, when modeling the perception of boar taint on an individual level, it turns out that, at least for some subjects, the effects of predictors (androstenone and skatole) vary between response categories. For more flexible modeling, we consider the use of a ‘smooth-effects-on-response penalty’ (SERP) as a connecting link between proportional and fully non-proportional odds models, assuming that parameters of the latter vary smoothly over response categories. The usefulness of SERP is further demonstrated through a simulation study. Besides flexible and accurate modeling, SERP also enables fitting of parameters in cases where the pure, unpenalized non-proportional odds model fails to converge. Full article
(This article belongs to the Special Issue Statistics, Analytics, and Inferences for Discrete Data)
Show Figures

Figure 1

Article
Parameter Choice, Stability and Validity for Robust Cluster Weighted Modeling
Stats 2021, 4(3), 602-615; https://doi.org/10.3390/stats4030036 - 06 Jul 2021
Viewed by 911
Abstract
Statistical inference based on the cluster weighted model often requires some subjective judgment from the modeler. Many features influence the final solution, such as the number of mixture components, the shape of the clusters in the explanatory variables, and the degree of heteroscedasticity [...] Read more.
Statistical inference based on the cluster weighted model often requires some subjective judgment from the modeler. Many features influence the final solution, such as the number of mixture components, the shape of the clusters in the explanatory variables, and the degree of heteroscedasticity of the errors around the regression lines. Moreover, to deal with outliers and contamination that may appear in the data, hyper-parameter values ensuring robust estimation are also needed. In principle, this freedom gives rise to a variety of “legitimate” solutions, each derived by a specific set of choices and their implications in modeling. Here we introduce a method for identifying a “set of good models” to cluster a dataset, considering the whole panorama of choices. In this way, we enable the practitioner, or the scientist who needs to cluster the data, to make an educated choice. They will be able to identify the most appropriate solutions for the purposes of their own analysis, in light of their stability and validity. Full article
(This article belongs to the Special Issue Robust Statistics in Action)
Show Figures

Figure 1

Article
First Digit Oscillations
Stats 2021, 4(3), 595-601; https://doi.org/10.3390/stats4030035 - 05 Jul 2021
Cited by 1 | Viewed by 694
Abstract
The frequency of the first digits of numbers drawn from an exponential probability density oscillate around the Benford frequencies. Analysis, simulations and empirical evidence show that datasets must have at least 10,000 entries for these oscillations to emerge from finite-sample noise. Anecdotal evidence [...] Read more.
The frequency of the first digits of numbers drawn from an exponential probability density oscillate around the Benford frequencies. Analysis, simulations and empirical evidence show that datasets must have at least 10,000 entries for these oscillations to emerge from finite-sample noise. Anecdotal evidence from population data is provided. Full article
(This article belongs to the Special Issue Benford's Law(s) and Applications)
Show Figures

Figure 1

Article
Base Dependence of Benford Random Variables
Stats 2021, 4(3), 578-594; https://doi.org/10.3390/stats4030034 - 02 Jul 2021
Cited by 1 | Viewed by 1202
Abstract
A random variable X that is base b Benford will not in general be base c Benford when cb. This paper builds on two of my earlier papers and is an attempt to cast some light on the issue of [...] Read more.
A random variable X that is base b Benford will not in general be base c Benford when cb. This paper builds on two of my earlier papers and is an attempt to cast some light on the issue of base dependence. Following some introductory material, the “Benford spectrum” of a positive random variable is introduced and known analytic results about Benford spectra are summarized. Some standard machinery for a “Benford analysis” is introduced and combined with my method of “seed functions” to yield tools to analyze the base c Benford properties of a base b Benford random variable. Examples are generated by applying these general methods to several families of Benford random variables. Berger and Hill’s concept of “base-invariant significant digits” is discussed. Some potential extensions are sketched. Full article
(This article belongs to the Special Issue Benford's Law(s) and Applications)
Show Figures

Figure 1

Article
A Constrained Generalized Functional Linear Model for Multi-Loci Genetic Mapping
Stats 2021, 4(3), 550-577; https://doi.org/10.3390/stats4030033 - 25 Jun 2021
Viewed by 740
Abstract
In genome-wide association studies (GWAS), efficient incorporation of linkage disequilibria (LD) among densely typed genetic variants into association analysis is a critical yet challenging problem. Functional linear models (FLM), which impose a smoothing structure on the coefficients of correlated covariates, are advantageous in [...] Read more.
In genome-wide association studies (GWAS), efficient incorporation of linkage disequilibria (LD) among densely typed genetic variants into association analysis is a critical yet challenging problem. Functional linear models (FLM), which impose a smoothing structure on the coefficients of correlated covariates, are advantageous in genetic mapping of multiple variants with high LD. Here we propose a novel constrained generalized FLM (cGFLM) framework to perform simultaneous association tests on a block of linked SNPs with various trait types, including continuous, binary and zero-inflated count phenotypes. The new cGFLM applies a set of inequality constraints on the FLM to ensure model identifiability under different genetic codings. The method is implemented via B-splines, and an augmented Lagrangian algorithm is employed for parameter estimation. For hypotheses testing, a test statistic that accounts for the model constraints was derived, following a mixture of chi-square distributions. Simulation results show that cGFLM is effective in identifying causal loci and gene clusters compared to several competing methods based on single markers and SKAT-C. We applied the proposed method to analyze a candidate gene-based COGEND study and a large-scale GWAS data on dental caries risk. Full article
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop