Advances in Statistics: Theory, Methodology, Applications and Data Analysis

A special issue of Mathematics (ISSN 2227-7390). This special issue belongs to the section "Probability and Statistics".

Deadline for manuscript submissions: closed (1 July 2023) | Viewed by 22818

Special Issue Editor


E-Mail Website
Guest Editor
Department of Mathematics, Statistics and Computer Science, Universidad de Cantabria, 39005 Santander, Spain
Interests: statistical data depth; pattern recognition; ubiquitous computing; healthcare; functional data analysis; hypothesis testing; supervised classification
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues, 

The Special Issue is dedicated to explore the latest advances in the Statistics area of Mathematics that are innovative in either their theoretical, methodological or applicability approach. Potential topics of this Special Issue incorporate, but are not limited to, non-parametric statistics, functional data analysis, fuzzy and random sets, multivariate statistics, classification: supervised and clustering, including machine learning techniques, robust statistics, hypothesis testing and time series analysis.

Dr. Alicia Nieto-Reyes
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Mathematics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • non-parametric statistics
  • functional data analysis
  • fuzzy and random sets
  • multivariate statistics
  • classification

Published Papers (16 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

13 pages, 333 KiB  
Article
A New Estimator: Median of the Distribution of the Mean in Robustness
by Alfonso García-Pérez
Mathematics 2023, 11(12), 2694; https://doi.org/10.3390/math11122694 - 14 Jun 2023
Viewed by 1004
Abstract
In some statistical methods, the statistical information is provided in terms of the values used by classical estimators, such as the sample mean and sample variance. These estimations are used in a second stage, usually in a classical manner, to be combined into [...] Read more.
In some statistical methods, the statistical information is provided in terms of the values used by classical estimators, such as the sample mean and sample variance. These estimations are used in a second stage, usually in a classical manner, to be combined into a single value, as a weighted mean. Moreover, in many applied studies, the results are given in these terms, i.e., as summary data. In all of these cases, the individual observations are unknown; therefore, computing the usual robustness estimators with them to replace classical non-robust estimations by robust ones is not possible. In this paper, the use of the median of the distribution Fx¯ of the sample mean is proposed, assuming a location-scale contaminated normal model, where the parameters of Fx¯ are estimated with the classical estimations provided in the first stage. The estimator so defined is called median of the distribution of the mean, MdM. This new estimator is applied in Mendelian randomization, defining the new robust inverse weighted estimator, RIVW. Full article
Show Figures

Figure 1

19 pages, 361 KiB  
Article
Bicluster Analysis of Heterogeneous Panel Data via M-Estimation
by Weijie Cui and Yong Li
Mathematics 2023, 11(10), 2333; https://doi.org/10.3390/math11102333 - 17 May 2023
Viewed by 850
Abstract
This paper investigates the latent block structure in the heterogeneous panel data model. It is assumed that the regression coefficients have group structures across individuals and structural breaks over time, where change points can cause changes to the group structures and structural breaks [...] Read more.
This paper investigates the latent block structure in the heterogeneous panel data model. It is assumed that the regression coefficients have group structures across individuals and structural breaks over time, where change points can cause changes to the group structures and structural breaks can vary between subgroups. To recover the latent block structure, we propose a robust biclustering approach that utilizes M-estimation and concave fused penalties. An algorithm based on local quadratic approximation is developed to optimize the objective function, which is more compact and efficient than the ADMM algorithm. Moreover, we establish the oracle property of the penalized M-estimators and prove that the proposed estimator recovers the latent block structure with a probability approaching one. Finally, simulation studies on multiple datasets demonstrate the good finite sample performance of the proposed estimators. Full article
Show Figures

Figure 1

18 pages, 2147 KiB  
Article
Theoretical Structure and Applications of a Newly Enhanced Gumbel Type II Model
by Showkat Ahmad Lone, Tabassum Naz Sindhu, Marwa K. H. Hassan, Tahani A. Abushal, Sadia Anwar and Anum Shafiq
Mathematics 2023, 11(8), 1797; https://doi.org/10.3390/math11081797 - 10 Apr 2023
Cited by 1 | Viewed by 955
Abstract
Statistical models are vital in data analysis, and researchers are always on the search for potential or the latest statistical models to fit data sets in a variety of domains. To create an improved statistical model, we used a T-X transformation and the [...] Read more.
Statistical models are vital in data analysis, and researchers are always on the search for potential or the latest statistical models to fit data sets in a variety of domains. To create an improved statistical model, we used a T-X transformation and the Gumbel Type-II model in this investigation. The research examined a simulation evaluation to assess the efficacy of the parameters. To show the application of the T-X approach for producing new distributions titled the new and improved Gumbel Type-II (NIGT-II) distribution, two actual data sets were used. The data sets reveal that the NIGT-II distribution sounds nicer than the Gumbel Type-II distribution. Full article
Show Figures

Figure 1

15 pages, 1805 KiB  
Article
X-STATIS: A Multivariate Approach to Characterize the Evolution of E-Participation, from a Global Perspective
by Carmen C. Rodríguez-Martínez, Mitzi Cubilla-Montilla, Purificación Vicente-Galindo and Purificación Galindo-Villardón
Mathematics 2023, 11(6), 1492; https://doi.org/10.3390/math11061492 - 18 Mar 2023
Cited by 2 | Viewed by 1025
Abstract
This paper aims to categorize countries by their e-participation index, according to political, capacity, and governmental environment factors; examine how they are projected based on these factors; and analyze whether this projection corresponds to the current state of e-participation development. It is the [...] Read more.
This paper aims to categorize countries by their e-participation index, according to political, capacity, and governmental environment factors; examine how they are projected based on these factors; and analyze whether this projection corresponds to the current state of e-participation development. It is the first study to provide an overview of the e-participation level using multivariate analysis techniques for three-way data analysis, specifically, the X-STATIS methodology and cluster analysis. These techniques enable the simultaneous representation of countries, factors, conditions, trajectories, and groupings, taking into account national conditions in the evolution of e-participation from 2008 to 2016. The results show that when the conditions of each country interact with the level of e-participation development, and depending on the economic development, 7% of countries are lagging behind in e-participation evolution, given their institutional and political capacity. This delay is particularly relevant in countries that enjoy a higher level of socioeconomic status. Meanwhile, 38% are above the level they would correspond to. Full article
Show Figures

Figure 1

20 pages, 851 KiB  
Article
Scalar Variance and Scalar Correlation for Functional Data
by Cristhian Leonardo Urbano-Leon, Manuel Escabias, Diana Paola Ovalle-Muñoz and Javier Olaya-Ochoa
Mathematics 2023, 11(6), 1317; https://doi.org/10.3390/math11061317 - 09 Mar 2023
Viewed by 1502
Abstract
In Functional Data Analysis (FDA), the existing summary statistics so far are elements in the Hilbert space L2 of square-integrable functions. These elements do not constitute an ordered set; therefore, they are not sufficient to solve problems related to comparability such as [...] Read more.
In Functional Data Analysis (FDA), the existing summary statistics so far are elements in the Hilbert space L2 of square-integrable functions. These elements do not constitute an ordered set; therefore, they are not sufficient to solve problems related to comparability such as obtaining a correlation measurement or comparing the variability between two sets of curves, determining the efficiency and consistency of a functional estimator, among other things. Consequently, we present an approach of coherent redefinition of some common summary statistics such as sample variance, sample covariance and correlation in Functional Data Analysis (FDA). Regarding variance, covariance and correlation between functional data, our summary statistics lead to numbers instead of functions which is helpful for solving the aforementioned problems. Furthermore, we briefly discuss the functional forms coherence of some statistics already present in the FDA. We formally enumerate and demonstrate some properties of our functional summary statistics. Then, a simulation study is presented briefly, with evidence of the consistency of the proposed variance. Finally, we present the implementation of our statistics through two application examples. Full article
Show Figures

Figure 1

20 pages, 1730 KiB  
Article
Statistical Depth for Text Data: An Application to the Classification of Healthcare Data
by Sergio Bolívar, Alicia Nieto-Reyes and Heather L. Rogers
Mathematics 2023, 11(1), 228; https://doi.org/10.3390/math11010228 - 02 Jan 2023
Cited by 2 | Viewed by 2225
Abstract
This manuscript introduces a new concept of statistical depth function: the compositional D-depth. It is the first data depth developed exclusively for text data, in particular, for those data vectorized according to a frequency-based criterion, such as the tf-idf (term frequency–inverse document [...] Read more.
This manuscript introduces a new concept of statistical depth function: the compositional D-depth. It is the first data depth developed exclusively for text data, in particular, for those data vectorized according to a frequency-based criterion, such as the tf-idf (term frequency–inverse document frequency) statistic, which results in most vector entries taking a value of zero. The proposed data depth consists of considering the inverse discrete Fourier transform of the vectorized text fragments and then applying a statistical depth for functional data, D. This depth is intended to address the problem of sparsity of numerical features resulting from the transformation of qualitative text data into quantitative data, which is a common procedure in most natural language processing frameworks. Indeed, this sparsity hinders the use of traditional statistical depths and machine learning techniques for classification purposes. In order to demonstrate the potential value of this new proposal, it is applied to a real-world case study which involves mapping Consolidated Framework for Implementation and Research (CFIR) constructs to qualitative healthcare data. It is shown that the DDG-classifier yields competitive results and outperforms all studied traditional machine learning techniques (logistic regression with LASSO regularization, artificial neural networks, decision trees, and support vector machines) when used in combination with the newly defined compositional D-depth. Full article
Show Figures

Figure 1

25 pages, 4203 KiB  
Article
A Machine Learning Approach to Detect Parkinson’s Disease by Looking at Gait Alterations
by Cristina Tîrnăucă, Diana Stan, Johannes Mario Meissner, Diana Salas-Gómez, Mario Fernández-Gorgojo and Jon Infante
Mathematics 2022, 10(19), 3500; https://doi.org/10.3390/math10193500 - 25 Sep 2022
Viewed by 1335
Abstract
Parkinson’s disease (PD) is often detected only in later stages, when about 50% of nigrostriatal dopaminergic projections have already been lost. Thus, there is a need for biomarkers to monitor the earliest phases, especially for those that are at higher risk. In this [...] Read more.
Parkinson’s disease (PD) is often detected only in later stages, when about 50% of nigrostriatal dopaminergic projections have already been lost. Thus, there is a need for biomarkers to monitor the earliest phases, especially for those that are at higher risk. In this work, we explore the use of machine learning methods to diagnose PD by analyzing gait alterations via an inertial sensors system that participants in the study wear while walking down a 15 m long corridor in three different scenarios. To achieve this goal, we have trained six well-known machine learning models: support vector machines, logistic regression, neural networks, k nearest neighbors, decision trees and random forest. We thoroughly explored several ways to mitigate the problems derived from the small amount of available data. We found that, while achieving accuracy rates of over 70% is quite common, the accuracy of the best model trained is only slightly above the 80% mark. This model has high precision and specificity (over 90%), but lower sensitivity (only 71%). We believe that these results are promising, especially given the size of the population sample (41 PD patients and 36 healthy controls), and that this research venue should be further explored. Full article
Show Figures

Figure 1

19 pages, 558 KiB  
Article
Data Depth and Multiple Output Regression, the Distorted M-Quantiles Approach
by Maicol Ochoa and Ignacio Cascos
Mathematics 2022, 10(18), 3272; https://doi.org/10.3390/math10183272 - 09 Sep 2022
Viewed by 927
Abstract
For a univariate distribution, its M-quantiles are obtained as solutions to asymmetric minimization problems dealing with the distance of a random variable to a fixed point. The asymmetry refers to the different weights awarded to the values of the random variable at [...] Read more.
For a univariate distribution, its M-quantiles are obtained as solutions to asymmetric minimization problems dealing with the distance of a random variable to a fixed point. The asymmetry refers to the different weights awarded to the values of the random variable at either side of the fixed point. We focus on M-quantiles whose associated losses are given in terms of a power. In this setting, the classical quantiles are obtained for the first power, while the expectiles correspond to quadratic losses. The M-quantiles considered here are computed over distorted distributions, which allows to tune the weight awarded to the more central or peripheral parts of the distribution. These distorted M-quantiles are used in the multivariate setting to introduce novel families of central regions and their associated depth functions, which are further extended to the multiple output regression setting in the form of conditional and regression regions and conditional depths. Full article
Show Figures

Figure 1

15 pages, 1117 KiB  
Article
Relationship between Mental Health and Socio-Economic, Demographic and Environmental Factors in the COVID-19 Lockdown Period—A Multivariate Regression Analysis
by Stefano Bonnini and Michela Borghesi
Mathematics 2022, 10(18), 3237; https://doi.org/10.3390/math10183237 - 06 Sep 2022
Cited by 4 | Viewed by 1840
Abstract
Amongst the several consequences of the COVID-19 pandemic, we should include psychological effects on the population. The mental health consequences of lockdown are affected by several factors. The most important are: the duration of the social isolation period, the characteristics of the living [...] Read more.
Amongst the several consequences of the COVID-19 pandemic, we should include psychological effects on the population. The mental health consequences of lockdown are affected by several factors. The most important are: the duration of the social isolation period, the characteristics of the living space, the number of online (virtual) and offline (physical) contacts and perceived contacts’ closeness, individual characteristics, and the spread of infection in the geographical area of residence. In this paper, we investigate the possible effects of environmental, social and individual characteristics (predictors) on mental health (response) during the COVID-19 lockdown period. The relationship between mental health and predictors can be studied with a multivariate linear regression model, because “mental health” is a multidimensional concept. This work provides a contribution to the debate about the factors affecting mental health in the period of the COVID-19 lockdown, with the application of an innovative approach based on a multivariate regression analysis and a combined permutation test on data collected in a survey conducted in Italy in 2020. Full article
Show Figures

Figure 1

14 pages, 332 KiB  
Article
Confidence Intervals Based on the Difference of Medians for Independent Log-Normal Distributions
by Weizhong Tian, Yaoting Yang and Tingting Tong
Mathematics 2022, 10(16), 2989; https://doi.org/10.3390/math10162989 - 18 Aug 2022
Cited by 5 | Viewed by 1233
Abstract
In this paper, we study the inferences of the difference of medians for two independent log-normal distributions. These methods include traditional methods such as the parametric bootstrap approach, the normal approximation approach, the method of variance estimates recovery approach, and the generalized confidence [...] Read more.
In this paper, we study the inferences of the difference of medians for two independent log-normal distributions. These methods include traditional methods such as the parametric bootstrap approach, the normal approximation approach, the method of variance estimates recovery approach, and the generalized confidence interval approach. The simultaneous confidence intervals for the difference in the median for more than two independent log-normal distributions are also discussed. Our simulation studies evaluate the performances of the proposed confidence intervals in terms of coverage probabilities and average lengths. We find that the parametric bootstrap approach would be a suitable choice for smaller sample sizes for the two independent distributions and multiple independent distributions. However, the method of variance estimates recovery and normal approximation approaches are alternative competitors for constructing simultaneous confidence intervals, especially when the populations have large variance. We also include two practical applications demonstrating the use of the techniques on observed data, where one data set works for the PM2.5 mass concentrations in Bangkapi and Dindaeng in Thailand and the other data came from the study of nitrogen-bound bovine serum albumin produced by three groups of diabetic mice. Both applications show that the confidence intervals from the parametric bootstrap approach have the smallest length. Full article
Show Figures

Figure 1

18 pages, 565 KiB  
Article
Statistical Inference of Wiener Constant-Stress Accelerated Degradation Model with Random Effects
by Peihua Jiang
Mathematics 2022, 10(16), 2863; https://doi.org/10.3390/math10162863 - 11 Aug 2022
Cited by 4 | Viewed by 1231
Abstract
In the field of reliability analysis, the constant-stress accelerated degradation test is one of the most commonly used methods to evaluate a product’s reliability as degradation data are provided. In this paper, a constant-stress accelerated degradation test model of the Wiener process with [...] Read more.
In the field of reliability analysis, the constant-stress accelerated degradation test is one of the most commonly used methods to evaluate a product’s reliability as degradation data are provided. In this paper, a constant-stress accelerated degradation test model of the Wiener process with random effects is proposed. First, the generalized confidence intervals of the model parameters are developed by constructing generalized pivotal quantities. Second, utilizing the substitution method, the generalized confidence intervals for the reliability function of lifetime, mean time to failure and the generalized prediction intervals for the degradation characteristic at the normal operating condition are also developed. Simulation studies are conducted to investigate the performances of the proposed generalized confidence intervals and prediction intervals. The simulation results reveal that the proposed generalized confidence intervals and prediction intervals work well in terms of the coverage percentage. In particular, a comparative analysis is made with the traditional bootstrap confidence intervals. At last, the proposed procedures are used for a real data analysis. Full article
Show Figures

Figure 1

23 pages, 381 KiB  
Article
Properties of Statistical Depth with Respect to Compact Convex Random Sets: The Tukey Depth
by Luis González-De La Fuente, Alicia Nieto-Reyes and Pedro Terán
Mathematics 2022, 10(15), 2758; https://doi.org/10.3390/math10152758 - 03 Aug 2022
Viewed by 1125
Abstract
We study a statistical data depth with respect to compact convex random sets, which is consistent with the multivariate Tukey depth and the Tukey depth for fuzzy sets. In addition, it provides a different perspective to the existing halfspace depth with respect to [...] Read more.
We study a statistical data depth with respect to compact convex random sets, which is consistent with the multivariate Tukey depth and the Tukey depth for fuzzy sets. In addition, it provides a different perspective to the existing halfspace depth with respect to compact convex random sets. In studying this depth function, we provide a series of properties for the statistical data depth with respect to compact convex random sets. These properties are an adaptation of properties that constitute the axiomatic notions of multivariate, functional, and fuzzy depth-functions and other well-known properties of depth. Full article
Show Figures

Figure 1

21 pages, 522 KiB  
Article
High-Dimensional Statistics: Non-Parametric Generalized Functional Partially Linear Single-Index Model
by Mohamed Alahiane, Idir Ouassou, Mustapha Rachdi and Philippe Vieu
Mathematics 2022, 10(15), 2704; https://doi.org/10.3390/math10152704 - 30 Jul 2022
Viewed by 1097
Abstract
We study the non-parametric estimation of partially linear generalized single-index functional models, where the systematic component of the model has a flexible functional semi-parametric form with a general link function. We suggest an efficient and practical approach to estimate (I) the single-index link [...] Read more.
We study the non-parametric estimation of partially linear generalized single-index functional models, where the systematic component of the model has a flexible functional semi-parametric form with a general link function. We suggest an efficient and practical approach to estimate (I) the single-index link function, (II) the single-index coefficients as well as (III) the non-parametric functional component of the model. The estimation procedure is developed by applying quasi-likelihood, polynomial splines and kernel smoothings. We then derive the asymptotic properties, with rates, of the estimators of each component of the model. Their asymptotic normality is also established. By making use of the splines approximation and the Fisher scoring algorithm, we show that our approach has numerical advantages in terms of the practical efficiency and the computational stability. A computational study on data is provided to illustrate the good practical behavior of our methodology. Full article
Show Figures

Figure 1

16 pages, 310 KiB  
Article
A New Clustering Method Based on the Inversion Formula
by Mantas Lukauskas and Tomas Ruzgas
Mathematics 2022, 10(15), 2559; https://doi.org/10.3390/math10152559 - 22 Jul 2022
Cited by 6 | Viewed by 1421
Abstract
Data clustering is one area of data mining that falls into the data mining class of unsupervised learning. Cluster analysis divides data into different classes by discovering the internal structure of data set objects and their relationship. This paper presented a new density [...] Read more.
Data clustering is one area of data mining that falls into the data mining class of unsupervised learning. Cluster analysis divides data into different classes by discovering the internal structure of data set objects and their relationship. This paper presented a new density clustering method based on the modified inversion formula density estimation. This new method should allow one to improve the performance and robustness of the k-means, Gaussian mixture model, and other methods. The primary process of the proposed clustering algorithm consists of three main steps. Firstly, we initialized parameters and generated a T matrix. Secondly, we estimated the densities of each point and cluster. Third, we updated mean, sigma, and phi matrices. The new method based on the inversion formula works quite well with different datasets compared with K-means, Gaussian Mixture Model, and Bayesian Gaussian Mixture model. On the other hand, new methods have limitations because this one method in the current state cannot work with higher-dimensional data (d > 15). This will be solved in the future versions of the model, detailed further in future work. Additionally, based on the results, we can see that the MIDEv2 method works the best with generated data with outliers in all datasets (0.5%, 1%, 2%, 4% outliers). The interesting point is that a new method based on the inversion formula can cluster the data even if data do not have outliers; one of the most popular, for example, is the Iris dataset. Full article
12 pages, 1635 KiB  
Article
Empirical Likelihood Ratio Tests for Homogeneity of Multiple Populations in the Presence of Auxiliary Information
by Ronghuo Wu and Yongsong Qin
Mathematics 2022, 10(13), 2341; https://doi.org/10.3390/math10132341 - 04 Jul 2022
Viewed by 1193
Abstract
The empirical likelihood ratio test (ELRT) statistic is constructed for testing the homogeneity of several nonparametric populations in the presence of some auxiliary information. It is shown—under some regularity conditions and under the null hypothesis that all distribution functions of the populations are [...] Read more.
The empirical likelihood ratio test (ELRT) statistic is constructed for testing the homogeneity of several nonparametric populations in the presence of some auxiliary information. It is shown—under some regularity conditions and under the null hypothesis that all distribution functions of the populations are equal—that the asymptotic distribution of the ELRT is a chi-squared distribution. The proposed ELRT could be more powerful than the Kruskal–Wallis test, as extra information can be efficiently employed by ELRT. The advantage of ELRT over T&P (2006) is that researchers do not need to select approximately normal statistics for inter-group comparisons, and ELRT is more suitable for the multi-population consistency test with a small sample size. Full article
Show Figures

Figure 1

31 pages, 627 KiB  
Article
Supervised Classification of Healthcare Text Data Based on Context-Defined Categories
by Sergio Bolívar, Alicia Nieto-Reyes and Heather L. Rogers
Mathematics 2022, 10(12), 2005; https://doi.org/10.3390/math10122005 - 10 Jun 2022
Cited by 2 | Viewed by 1799
Abstract
Achieving a good success rate in supervised classification analysis of a text dataset, where the relationship between the text and its label can be extracted from the context, but not from isolated words in the text, is still an important challenge facing the [...] Read more.
Achieving a good success rate in supervised classification analysis of a text dataset, where the relationship between the text and its label can be extracted from the context, but not from isolated words in the text, is still an important challenge facing the fields of statistics and machine learning. For this purpose, we present a novel mathematical framework. We then conduct a comparative study between established classification methods for the case where the relationship between the text and the corresponding label is clearly depicted by specific words in the text. In particular, we use logistic LASSO, artificial neural networks, support vector machines, and decision-tree-like procedures. This methodology is applied to a real case study involving mapping Consolidated Framework for Implementation and Research (CFIR) constructs to health-related text data and achieves a prediction success rate of over 80% when just the first 55% of the text, or more, is used for training and the remaining for testing. The results indicate that the methodology can be useful to accelerate the CFIR coding process. Full article
Show Figures

Figure 1

Back to TopTop