Generalizing Normality: Different Estimation Methods for Skewed Information

: Normality is the most commonly used mathematical supposition in data modeling. Nonetheless, even based on the law of large numbers (LLN), normality is a strong presumption, given that the presence of asymmetry and multi-modality in real-world problems is expected. Thus, a ﬂexible modiﬁcation in the normal distribution proposed by Elal-Olivero adds a skewness parameter called Alpha-skew-normal (ASN) distribution, which enables bimodality and fat-tail, if needed, although it is sometimes not trivial to estimate this third parameter (regardless of the location and scale). This work analyzed seven different statistical inferential methods towards the ASN distribution on synthetic data and historical data of water ﬂux from 21 rivers (channels) in the Atacama region. Moreover, the contributions of this paper are related to the estimations of probability surrounding rivers’ ﬂux levels in the surroundings of Copiapó city, which is the most economically important city of the third Chilean region and is known to be located in one of the driest areas on Earth (excluding the North and the South Poles). The results show the competitiveness of the MPS and RADE methods with respect to the MLE method, as well as their excellent performance.


Introduction
We live in the Big Data Era, where high volumes and many varieties of data characterization are often noticeable in data lakes. Despite the amount of observation, symmetry and smooth tails are not always observed. These characteristics are natural because we all live in a complex world with nonlinear relations and outliers that describe extreme values, which are more recurrent than easy statistical tools take into account. This new age requires flexible models and different reasoning based on data and information.
A question that is often asked in traditional departments worldwide is: "Are statistical methods becoming old fashioned?". Sir David Cox [1] explained that the focus is on the relevance and quality of data based on their coverage and representativeness, which provide confidence in the results in spite of the amount of information (large volumes) in a set, which may hold some potentially biased estimates with measurement errors. Efron and Hastie [2] discussed the relation between computer-related inferences and statistical inferences as a system of mathematical logic for guidance and correction that is complemented by large-scale prediction algorithms that are suitable for this new century.
Therefore, complexity is intrinsic to massive amounts of data, where high dimensionality and dynamism are often present [3,4]. Nonetheless, all of the information contained in the acquired data can be extracted by using an estimation method, e.g., with maximum likelihood estimation (MLE); a parametric version of such a method will be supported by a supposed distribution. Parametric approaches can be used to easily interpret patterns by using parameters, enable association across variables, present a low computational cost, and are easier to implement in decision-making systems.
In many cases, the standard MLE may not return desirable results. Other estimation methods that return accurate estimates have been considered, such as estimators based on the least-square function [5], the product of spacing [6,7], or goodness-of-fit statistics [8]. There is not a unique method that performs better in all models, and the performance may depend on the selected parametric form [9,10]. Thus, it is necessary to use an efficient estimation method jointly with a flexible parametric model that covers many data patterns and that sometimes accommodates the asymmetry and multi-modality that may be contained in the data under consideration. This is the case of meteorological data, which show significant changes, as well as a complex dynamic [11,12]. This field demands extra attention on data-driven models, which are needed in order to incorporate space-time dependence [13], structural changes [14], and extreme values [15]. Moreover, in a parametric world, a model supported by a probabilistic model (which deals with asymmetry and multi-modality [16,17]) is necessary. Thus, this paper was motivated by a case study of the water flux of 21 rivers (channels) in the surroundings of Copiapó city, which is located in the Atacama Desert and is one of the planet's driest areas. Moreover, we intend to exemplify evidence of the probability densities associated with these events' empirical distribution by using seven different approaches to statistical inference. This paper is divided into four parts. Section 2 presents the motivation and details regarding the analyzed data. Section 3 provides the background of the adopted methodology with respect to the elements of statistical inference. Then, Sections 4 and 5 show the developments related to the methods when implemented on synthetic data and for the analysis of real-world data. Finally, Section 6 discusses the findings based on the obtained results.

The Data
The dataset adopted is related to fluviometric records (average monthly flows) from the Atacama Desert region (third-largest region of Chile) in the surroundings of Copiapó city. The historical period of these data is that of the ten years from January 2011 to December 2020; the data are associated with 21 rivers (or stream channels) and were obtained from the Chilean government website called Direccion General de Aguas (Información Oficial Hidrometeorológica y de Calidad de Aguas en Línea).
Historical events revealed the high periodicity of the low water flux of the region. However, cyclical events were also noticeable (such as two rainfall events and defrosting of glaciers/snow in summer, amongst others), creating the expected multi-modality and a large leptokurtosis.
During the processing of information, a decision-support system (DSS) sifts through and analyzes massive amounts of data, compiling comprehensive information that can be used in solving problems and making decisions. Figure 1 presents a flowchart of knowledge discovery in databases (KDD); the process involves information retrieval (IR), the decision-support system (DSS), and monitoring and forecasting.
Thus, the Atacama Desert watershed problem is part of a multi-dimensional study related to the analysis of a circular economy. It is essential to mention that uncertainty is always presented globally (as a measurement error or sample bias, amongst others). Nonetheless, probabilistic reasoning allows one to generalize results through statistical inference procedures.

Alpha-Skew-Normal (ASN) Distribution
Let X be a random variable that follows an Alpha-skew-normal (ASN) distribution [18]; then, its probability density function (PDF) is given by where x ∈ R, α ∈ R, and φ(·) is the PDF of the standard normal distribution. The cumulative density function (CDF) is given by Then, wrapping the ASN density f(x|α) with the parameters for location (µ) and scale (σ), that is, the random variable T is defined by T = µ + σX for µ ∈ R and σ > 0, this value is given by: Figure 2 presents different forms of the PDF of the ASN distribution. The CDF related to Equation (4) is Given its flexibility, the ASN distribution has been used in data modeling and has been adopted in different fields, such as astronomy [19], the modeling of wind speed [20], and for benchmark data [21]. Figure 2 presents different forms of the PDF of the ASN distribution, assuming, for instance, µ = 0 (location), σ = 1 (scale), and other values for α, in order to show the presence of asymmetry and bimodality (incorporating different heights between the modalities).

Different Estimation Methods for the ASN Distribution
In this subsection, we will discuss seven different estimation methods (maximum likelihood estimation, ordinary and weighted least-square estimate, method of the maximum product of spacings, Cramer-von Mises minimum distance estimators, Anderson-Darling estimators, and right-tail Anderson-Darling estimators) for the parameters (µ, σ, and α) of the ASN distribution. Table 1 describes the methods used and the authors that proposed these inferential procedures. A comparison that used the cited estimators was presented for other models [22][23][24]. Note that Carl Friedrich Gauss introduced the LSQ in 1822, and it is one of the oldest estimation procedures, although it was only in the paper of Swain et al. [5] that this approach was discussed for a class of non-normal models, and it then became a standard reference when applied in different probability distributions. The additional details related to the estimation of the ASN distribution's parameters are presented in the following subsections.

Maximum Likelihood Estimation
Maximum likelihood estimation (MLE) is widely used in data analysis; Fisher's derivation of information inequality was first used for the analysis of variance, and later for the estimation of functions derived from Euler's relation for homogeneous functions. Despite the fact that historical records of this technique have been widely exposed and defended by Ronald A. Fisher (who likely gained visibility because of his epic fights with Egon S. Pearson), its reasoning dates back to the mid-1700s [28].
Let t 1 , t 2 , · · · , t n be a sample of a random sample of size n from F(t|µ, σ, α). Considering z i = (t i −µ) σ , the maximum likelihood estimatorsμ MLE ,σ MLE , andα MLE can be obtained by maximizing with respect to µ, σ, and α. The log-likelihood function of (5) is given by Numerical methods, such as the Newton-Rapshon, method are required in order to find the solution of a nonlinear system. Under mild conditions, the MLEs are asymptotically normally distributed with a joint multivariate normal distribution that is given by where I(µ, σ, α) is the Fisher information matrix given by Elal-Olivero [18], who discussed that the ASN family satisfies these mild conditions. This methodology is often adopted because of the sufficiency of the MLE, combined with its consistency (which leads to the statistical efficiency) and the fact that asymptotic normality is guaranteed. Next, we will present a series of minimum distance estimations (which are easily applied to estimate consistently unknown parameters) that are designed to reflect the proposed model in order to reproduce the probabilistic structure of the realworld phenomenon under study [29]. Minimum distance estimations provide consistent parameter estimates and are competitive, especially where other methods do not succeed.

Ordinary and Weighted Least-Square Estimates
Consider a random sample of order statistics of size n as , and F(t|µ, σ, α) is its monotonic function. The leastsquare (LSQ) estimatorsμ LSE ,σ LSE , andα LSE for the ASN distribution can be obtained by minimizing the parameters µ, σ, and α: Thus, the LSQ equations can be obtained by solving the non-linear equations where Alternative solutions are obtained through high-precision numerical approximations of ∆ j for j = 1, 2, 3 partial derivatives.
Alternatively, the weighted least-squares (WLQ) estimates have been proposed for whenever an efficient method is required with small sets of data.μ W LSE ,σ W LSE , andα W LSE can be obtained by adopting the following minimized equation: The solutions deviate from those of the non-linear equations: (13). The WLQ estimation technique is particularly useful whenever one aims to weigh observations in proportion to the equivalence of the error variance for an observation in order to overcome the issue of non-constant variance.

Method of the Maximum Product of Spacings
The maximum product of spacings (MPS) method is a powerful alternative to MLE for estimating unknown parameters of continuous univariate distributions; it aims to maximize the geometric mean of the spacings in the data (differences between the values of the cumulative distribution function at neighborhood data points). Cheng and Amin proposed this method [6,30]-though it was also found independently by Ranneby [7]as a Kullback-Leibler information approximation measurement. Some desirable properties of the MPS methods, such as their asymptotic efficiency, invariance, and, most importantly, consistency, are held more broadly (under general conditions) than for MLEs [30].
Let us represent the differences between the values of the cumulative distribution functions on their neighborhood data points with the function D i (µ, σ, α) . . , n + 1, . . .} as a uniform spacing of a random sample from the ASN distribution, which is defined by Thus, the MPS estimatesμ MPS ,σ MPS , and α MPS can be obtained by maximizing the geometric mean of the spacings: considering the maximization of this function (G ASN ) by adopting its logarithm as The estimates of the unknown parametersμ MPS ,σ MPS , andα MPS are obtained by solving the nonlinear equations where ∆ 1 (· | µ, σ, α), ∆ 2 (· | µ, σ, α), and ∆ 3 (· | µ, σ, α) are given, respectively, in Equation (13).

The Cramer-von Mises Minimum Distance Estimators
Alternatively, an estimator that requires no assumptions about the distributions' parametric form, the Cramer-von Mises estimator (CME), is based on the difference between the estimates of the cumulative distribution function and the empirical distribution function [31,32]. These estimators operate based on the minimum distance across the "true" distribution (observed) and the "modeled" distribution (adjusted) through the maximum goodness of fit.

The Anderson-Darling and Right-Tail Anderson-Darling Estimators
Another type of minimum distance estimator is based on Anderson-Darling statistics, and it is often called an Anderson-Darling estimator (ADE). This estimator is based on the minimum distance estimation obtained by sampling data sorted in ascending order from the observed set (Y), and then X = Sort(Y), in combination with the permutation of {1, 2, . . . , n}, which causes the X series to be sorted. Thus, this process is associated with the cumulative distribution function F(·) and the survival function S(·) = 1 − F(·) for any PDF. In contrast, samples are only drawn from a uniform distribution if Y (and X) are samples from the PDF distribution.
The Anderson-Darling estimates µ ADE , σ ADE , and α ADE of the parameters µ, σ, and α are obtained by minimizing, with respect to µ, σ, and α, the function: These estimates can also be obtained by solving the non-linear equations: Alternatively, one can improve the ADE's performance by considering the information held in the non-symmetrical differences between the theoretical CDF and the empirical CDF [33]. Thus, the right-tail Anderson-Darling estimator (RADE) is an alternative; µ RADE , σ RADE , and α RADE of the parameters µ, σ, and α are obtained by minimizing the function: These estimates can also be obtained by solving the non-linear equations: where ∆ 1 (· | µ, σ, α), ∆ 2 (· | µ, σ, α), and ∆ 3 (· | µ, σ, α) are given, respectively, in Equation (13).

Numerical Analysis
In this section, we investigated the behavior of the ASN distribution based on artificial (synthetic) data, as well as the modification of its parameters as a condition of the estimation method. Thus, a Monte Carlo simulation was carried out, the seven most commonly used estimation methods were considered for the parameters, and their efficiency was compared. The following approach was adopted. The procedure was:

1.
Given a set of parameters from the ASN(µ, σ, α) distribution, N samples of size n were generated; 2.

4.
The overall bias and the overall MSE were computed with 1 The results of this simulation should return the estimation method that can be expected to be the most efficient based on if the estimations of both the bias and MSE are close to zero. For this simulation study, we adopted the R software [34], and for the maximization method, we used the maxLik and stats4 packages [35]. The values chosen for the simulation parameters were N = 10, 000 and n = {40, 60, 80, · · · , 300}. Due to the lack of space, we will present the results only for {µ = 0, σ = 1, α = 1} and {µ = 0, σ = 1, α = 6}. Nonetheless, the following results are generalized by other choices of the vector of parameters θ. The estimation methods were considered under the same conditions in terms of samples, numbers of limit iterations, and initial values. Here, we considered the true values as initial values. However, we provide a simple approach that will be discussed in the next section in order to deal with real cases where good initial values are not available. Figures 3 and 4 present the performance of the estimators in terms of the bias and MSE for the parameters µ, σ, and α when using the MLE, LSQ, WLQ, MPS, CME, ADE, and RADE with N = 10.000 simulated samples and different values of n. It can be observed that the MLE did not return adequate estimates for some parameter values and only converged for samples of large sizes (moreover, an identifiability problem was noticeable when the MLE was adopted). These results show a drawback in the current approaches to obtaining parameter estimates with an ASN distribution. Although there is not a method that uniformly returns better estimates for all parameters and different parameter values, we observed that the MPS obtained the best results in terms of the minimum bias and MSE. Additionally, obtaining an estimate for α was quite challenging, but the MPS returned the smallest bias for this parameter. Therefore, we recommend using the MPS to obtain estimates for all practical purposes.  Figure 2, by choosing the configuration of these parameters (µ = 0.5, σ = 0.5, and α = 3), a bimodal PDF can be seen, which presents a larger peak to the left and a smaller peak to the right.

Results
As we stated in Section 2, the motivation of this paper was driven by the water flux in the Atacama Desert or, more precisely, in the surroundings of Copiapó city. Figure 5 shows the empirical density of this phenomenon; a high concentration of low values is presented (near zero), although important events were also captured in this 10-year time window, such as a large amount of rainfall that caused a large leptokurtosis.  Table 2 presents the statistical summaries (minimum, first quartile, median, mean, third quartile, and maximum) of the water flux by month. Because the weather is very constant in the region, the seasonality across years can be ignored, as low flux is common (close values through the minimum of the months). Nonetheless, the cycle in each month is essential; given events such as defrosting at the end of spring/beginning of summer (higher values in the third quartile in November and December), it is expected to that more water will be received in the system. Logarithmic transformations have been used for a long time [36], though a normal distribution is often obtained. In some other situations, this is not the case; for instance, the dynamic of the water flux in the observed period, which is shown in Figure 6, shows the presence of bimodality in this data transformation, and its monthly representation is shown in Figure 7. It is essential to mention that the maximum historical values were in May and June (of 2017); they were related to the heavy rains that occurred in the region, which are also notable in the previous Table 2.  The empirical distribution of this phenomenon is visualized in Figure 6, in which the dashed lines represents the adjusted density functions: the MPS in red and the RADE in blue. The initial values used to start the iteration procedures were obtained from whileα was obtained from a grid search in the range (−10, −9.5, . . . , 9,10). To confirm that the MPS returned better estimates than the others, we used Ecdf-based goodness-of-fit tests of the ASN distribution model. Figure 8 shows a particular type of distance between the functions F n and the observed values. After transforming the parameters with the exponential function, the practical implications of the obtained results are that, on average, it is expected that a monthly water flux of 0.1452 (from the MPS)-compared to 0.1541 (from the RADE)-will be observed, which will show a monthly dispersion of 2.45 (MPS) versus 2.85 (RADE) by considering the value of µ. Regarding the asymmetry estimated with both methods (α), they showed that it is expected that a frequent occurrence will be seen on the right side of the empirical distribution, that is, the recurrence of values greater than the average (µ) is expected. Based on the results found by Elal-Olivero [18], if an ASN distribution presents a parameter α > 1.34, a bimodality will be obtained; if not, then shall only see a skewness. Through the transformation of the estimations towards the skewed parameter (α), it bimodality was obtained (in cases of both MPS and RADE); on the other hand, through the exponentiation ofα, elements in the direction of skewed information were obtained. Figure 5 shows the presence of skewed data.
After confirming the ASN distribution's goodness of fit, the occurrence of an event could be associated with its density (or cumulative) probability function. For instance, extreme values can be seen; Table 3 shows some exemplifications that consider 1%, 10%, 50%, 99%, and 99.99%.

Conclusions
Uncertainty reveals a wide variety of processes and experiences that may follow different rules. However, other attributions of uncertainty, such as external (disposition) versus internal factors (ignorance), are assessed through statistical inference with certain philosophical interpretations of probability [37]. The utilities of each possible outcome lead to the choice of rational actions regardless of the observed results' uncertainty.
The example presented by this paper is the modeling of water flux, which is an essential element that has been placed under stress by significant population growth and an increase in the demand for water supply (for agriculture, industrial processes, mineral extraction, and human consumption) [38]. Therefore, planning the logistics of excessive water decisions according to their probabilistic distributions helps in unraveling this complex task [39,40]; this can be initiated by the analysis of the water flux, especially when environmental factors present limited sources for the water levels.
In the Atacama Desert, a north-south geographic band that is mainly located in northern Chile, precipitation is only a few millimeters per year, making it one of the driest places on Earth [41,42]. However, the vast expanses of the desert are punctuated by fertile valleys with rivers that originate in the central Andes and flow into the Pacific Ocean. Along these rivers, human populations have logically settled, exploiting this rare and precious water more and more throughout history, especially with the growing development of the mining industry, which is logically interested in the mineral resources of the Cordillera.
The hydrological regime of these rivers is characteristic of arid areas: Water flows from the peaks after the melting of the snowfall, glaciers, and permafrost located in the upper parts of the Cordillera [43]. In the context of climate change, it is therefore essential to understand the global hydrological cycles of these regions in order to set up a sustainable management policy [44]. This requires the implementation of tools for forecasting river flows, relative humidity, or any other water-related quantities, resulting in an inevitable need for in-depth knowledge of the physical phenomena that govern the entire hydrological cycle and, more precisely, the complex interactions among climate, ice, snow, and river flows.
Thus, this work proposed the investigation of different inference methods for the ASN probabilistic distribution, which is a promising and flexible distribution. Bimodality was noticeable and skewed information was observed in the historical series, but the distribution was nevertheless accommodated by the adopted probabilistic approach.
By using real-time analysis, big data solutions can now be implemented, as the process's probabilistic function was estimated and evidence of the goodness of fit was shown. In addition to the elements of the ASN distribution shown here, monitoring charts and other statistical process control (SPC) tools can also be explored because parametric distributions are often adopted. Future work should expand on the reasoning of the quantile estimations with explainable features (in a regression structure) that are associated with this problem, and forecasting may also be a motivation for future research.