A Test Detecting the Outliers for Continuous Distributions Based on the Cumulative Distribution Function of the Data Being Tested

One of the pillars of experimental science is sampling. Based on the analysis of samples, estimations for populations are made. There is an entire science based on sampling. Distribution of the population, of the sample, and the connection among those two (including sampling distribution) provides rich information for any estimation to be made. Distributions are split into two main groups: continuous and discrete. The present study applies to continuous distributions. One of the challenges of sampling is its accuracy, or, in other words, how representative the sample is of the population from which it was drawn. To answer this question, a series of statistics have been developed to measure the agreement between the theoretical (the population) and observed (the sample) distributions. Another challenge, connected to this, is the presence of outliers regarded here as observations wrongly collected, that is, not belonging to the population subjected to study. To detect outliers, a series of tests have been proposed, but mainly for normal (Gauss) distributions—the most frequently encountered distribution. The present study proposes a statistic (and a test) intended to be used for any continuous distribution to detect outliers by constructing the confidence interval for the extreme value in the sample, at a certain (preselected) risk of being in error, and depending on the sample size. The proposed statistic is operational for known distributions (with a known probability density function) and is also dependent on the statistical parameters of the population—here it is discussed in connection with estimating those parameters by the maximum likelihood estimation method operating on a uniform U(0,1) continuous symmetrical distribution.


Introduction
Many statistical techniques are sensitive to the presence of outliers and all calculations, including the mean and standard deviation can be distorted by a single grossly inaccurate data point.Therefore, checking for outliers should be a routine part of any data analysis.
To date, several tests have been developed for the purpose of identifying outliers of certain distributions.Most of the studies are connected with the Normal (or Gauss) distribution [1].The first paper that attracted attention on this matter is [2] and this was followed by studies that identified the derivation of the distribution of the extreme values in samples taken from Normal distributions [3].Then, a series of tests were developed by Thompson in 1935 [4], these were subjected to evaluation [5], and revised [6,7].
For other distributions such as the Gamma distribution, procedures for detecting outliers were proposed [8], revised [9], and unfortunately proved to be inefficient [10].
The first attempt to generalize the criterion for detecting outliers for any distribution can be found in [11], but further research on this subject is scarce apart from a notable recent attempt by Bardet and Dimby [12].The Grubbs test is a frequently used test for detecting the outliers of a Normal distribution [7].For a sample (x), the Grubbs' test statistic takes the largest absolute deviation from the sample mean (x) in units of the sample standard deviation (s) in order to calculate the risk of being in error (α G ) when stating that the most departed values from the mean (min(x), max(x) or both) are not outliers (see Table 1).The associated probabilities of the observed (p G ) are obtained from the Student t distribution [13].

Sample statistic (G)
Associated probability (p G = 1-α G ) Equation G "min" = x−min(x) s G "max" = max(x)−x s G "all" = max(G "min" , G "max" ) One should note that the Grubbs test statistic produces a symmetrical confidence interval (see Equations ( 1) and ( 2)).The Grubbs statistic as given in Table 1, is intended to be used with the parameters of the population (µ and σ), which are determined using the central moments (CM) method Here, a method is proposed for constructing the confidence intervals for the extreme values of any continuous distribution for which the cumulative distribution function is also obtainable.The method involves the direct application of a simple test for detecting the outliers.The proposed method is based on deriving the statistic for the extreme values for the uniform distribution.Also, the proposed method provides a symmetrical confidence interval in the probability space.

Materials and Methods
The Grubbs test (Table 1) is based on the fact that if outliers exist, then these are "localized" as the maximum value and/or the minimum value in the dataset.Thus, the Grubbs test is essentially a sort of order statistic [14].
Some introductory elements are required for describing the proposed procedure.When a sample of data is tested under the null hypothesis that it follows a certain distribution, it is intrinsically assumed that the distribution is known.The usual assumption is that we possess its probability density function (PDF, for a continuous distribution) or its probability distribution function (PDF for a discrete distribution).The discussion below relates to continuous distributions, although the treatment of discrete distributions are similar to certain degree.Nevertheless, a major distinction between continuous and discrete distributions in the treatment of data is made here; that is, a continuous distribution is "dense", e.g., between any two distinct observations it is possible to observe another while in the case of a discrete distribution, this is generally not true.
Even when the PDF is known (possibly intrinsically), its (statistical) parameters may not necessarily be known, and this raises the complex problem of estimating the parameters of the (population) distribution from the sample; however, this issue is outside the scope of this paper.In general, the estimation of the parameters of the distribution of the data is biased by the presence of the outliers in the data, and thus, identifying the outliers along with the estimation of the parameters of the distribution is a difficult task because two statistical hypotheses are operating.Assuming that the parameters ("parameters") of the distribution (of the PDF) are obtained using the maximum likelihood estimation method (MLE, Equation (3); see [15]), there is some suggestion that the uncertainty accompanying this estimation is transmitted to the process of detecting the outliers.PDF(X; "parameters") → max.⇒ ln (PDF(X; "parameters")) → min.
It should be noted that Equation ( 3) is a simplified version of the MLE method, since the real use of it requires and involves partial derivatives of the parameters; see Source code (MathCad language) for the MLE estimations in the Supplementary Materials available online.
Either way (whether the uncertainty accompanying this estimation is transmitted to the process of detecting the outliers or not), once an estimate for the parameters of the distribution is available, a test (most desirably, a test based on a statistic) for detecting the presence of an outlier must provide the probability of observing that (assumed) "outlier" as a randomly drawn value from the distribution.What to do next with the probability is another statistical "trick": to observe a value with a probability less than an imposed "level" (usually 5%) is defined as an unlikely event, and therefore, the suspicion regarding the presence of the outlier is justified.With regard to the statistical "trick" mentioned above, the opinion of the author of this manuscript is that one "observation" is not enough.Actually, there should be a series of observations, that come from a series of statistics, each providing a probability.Then, the unlikeliness of the event can be safely ascertained by using Fisher's "combining probability from independent tests" method (FCS, Equation (4); see [16][17][18]: where p 1 , . . ., p τ are probabilities from τ independent tests, CDFχ 2 is the χ 2 cumulative distribution function (see also up until Equation ( 6) below), and p FCS is the combined probability from independent tests.Taking the general case, for (x 1 , . . ., x n ) as n independent draws (or observations) from a (assumed known) continuous distribution defined by its probability density function, PDF (x; (π j ) 1≤j≤m ) where (π j ) 1≤j≤m are the (assumed unknown) m statistical parameters of the distribution, by way of integration for a (assumed known) domain (D) of the distribution, we may have access to the associated cumulative density function (CDF) CDF(x; (π j ) 1≤j≤m ; PDF), simply expressed as (Equation ( 5)): where inf(D) was used instead of min(D) to include unbounded domains (e.g., when inf(D) = -∞; "inf" stands for infimum, "min" stands for minimum).Please note that having the PDF and CDF does not necessarily imply that we have an explicit formula (or expression) for any of them.However, with access to numerical integration methods [19], it is enough to have the possibility of evaluating them at any point (x).Unlike PDF(x; (π j ) 1≤j≤m ), CDF(x; (π j ) 1≤j≤m ) is a bijective function and therefore, it is always invertible (even if we do not have an explicit formula; let "InvCDF" be its inverse, Equation ( 6)): if p = CDF(x; (π j ) 1≤j≤m ), then x = InvCDF(p; (π j ) 1≤j≤m ), and vice-versa (6) CDF(x; (π j ) 1≤j≤m ; "PDF") is a strong tool that greatly simplifies the problem at hand: the problems of analyzing any distribution function (PDF) are translated such that only one needs to be analyzed (the continuous uniform distribution).That is, a series of observed data (x i ) 1≤i≤n is expressed through their associated probabilities p i = CDF(x i ; (π j ) 1≤j≤m ) (for 1≤i≤n) and the analysis can be conducted on the (p i ) 1≤i≤n series instead.
Since the analysis of the (p i ) 1≤i≤n series of probabilities is a native case of order statistics, the discussion now turns to order statistics.The first studies in this area were by the fathers of modern statistics, Karl Pearson [20] and Ronald A. Fisher [3] while the first order statistic applicable to any distribution (not only the normal distribution) was first studied by Cramér and Von Mises (see [21,22]).
An order statistic operating on probabilities ((p i ) 1≤i≤n ) will sort the values (let (q i ) 1≤i≤n be the series of sorted (p i ) 1≤i≤n values, Equation ( 7)) and will assess its departure from the continuous uniform distribution (where it is assumed that SORT is a procedure that sorts ascending the values).
For instance, the Kolmogorov-Smirnov (KS) method (see Equation (8); the Kolmogorov-Smirnov statistic) calculates the KS Statistic and later tests the value (from a sample) against the threshold of a chosen significance level (usually 5%).
In order to have certain thresholds for a series of significance levels, these statistics can be derived from Monte-Carlo ("MC") simulations [30], and deployed for a large number of samples in order to reflect, as best as possible, the state of the population.

Proposed Outlier Detection Statistic
A statistic was developed to be applicable to any distribution.For a series of probabilities ((p i ) 1≤i≤n ) or (sorted probabilities, (q i ) 1≤i≤n ) associated with a series of (repeated drawing) observations ((x i ) 1≤i≤n ), the (r i ) 1≤i≤n differences are calculated as Equation ( 9): The statistic called "g1" (see below) was generated based on the formula given in Equation ( 9) (given as Equation ( 10)).g1 = max 1≤i≤n r i (10) It should be noted that Equations ( 9) and ( 10) provide the same result regardless of whether the calculation is made on a sorted series of probabilities ((q i ) 1≤i≤n ) or not (then it is made on (p i ) 1≤i≤n ).
Regarding the name of this new proposed statistic ("g1"), when Equations ( 1) and (2) (G "min" , G "max" , G "all" ) and Equation ( 9) are compared, for a standard normal distribution N(x; µ=0,σ=1) the equation defining G "all" becomes much more like Equation ( 9), with the difference being that in Equation ( 2) the sample mean (x) is used as an estimate for the mean of the population (µ) and the sample standard deviation (s) is used as an estimate for the standard deviation of the population (σ) while Equation ( 9) basically expresses the same in terms of associated probabilities (p i = P(X ≤ x i ) = CDF "Normal" (x i ; µ,σ), 0.5 = P(X ≤ µ) = CDF "Normal" (µ; µ,σ)).
Therefore, the proposed statistic very much resembles the Grubbs test for normality (and hence its name).One difference is that in the Grubbs test sample statistics are used to calculate the sample G "all" value (x and s), thereby reducing the degrees of freedom associated with the value (from n to n-2) while for the g1 value (and statistic) the degrees of freedom remain unchanged (n).The major difference is actually the one that makes the proposed statistic generalizable to any distribution-the mean used in the Grubbs test is replaced by the median-the beauty of this change is that for symmetrical distributions (including a Normal distribution) these two coincide.
A further connection with other statistics must also be noted.If any sample is resampled by extracting only the smallest and the largest of its values, then the Kolmogorov-Smirnov statistic for those subsamples almost perfectly resembles (by setting n = 2 in Equations ( 8)-( 1)) the proposed "g1" statistic.
Since CDF is a bijective function (see Equation ( 6)), the proposed generalization of the Grubbs test for detecting the outliers for Normal distribution into the "g1" statistic for detecting the outliers for any distribution is a natural extension of it.The "g1" test associated with the "g1" statistic will be able to operate in the probability space ((p i ) 1≤i≤n or (q i ) 1≤i≤n ) instead of the observed space ((x i ) 1≤i≤n ), the calculation formula (Equations ( 9) and ( 10)) is slightly different (to those given in Equations ( 1) and ( 2)), and the probability associated with the departure will no longer be extracted from the Student t distribution (as in Equations ( 1) and ( 2)).The change from mean (µ for G "all" ) to median (0.5 in Equation ( 9)) is a safe extension for any distribution type, since Equation ( 9) measures (or accounts for) the extreme departures from the equiprobable point-having an observation y (y ← X) with y ≤ InvCDF "Any distribution" (0.5; "parameters") and an observation z (z ← X) with z ≥ InvCDF "Any distribution" (0.5; "parameters") is equiprobable.
One way to associate a probability with the "g1" statistic is to do a Monte-Carlo (MC) simulation.

Simulation Study
A MC study was conducted.Two different strategies were developed in order to deal efficiently with a very large amount of data, and specifically, to solve the order statistics problem (that is, first sampling from the uniform distribution, and later using Equations ( 7)- (10).One of those alternatives has been described in [14] and the other is described below.Table 2 shows the details of the conducted MC study.For each sample size of the observed n in each run m samples (see Table 2) were generated from the standard uniform continuous distribution (e.g., from the [0, 1] interval).The outlier detection statistic "g1" was calculated (Equations ( 9) and ( 10)).From a large pool of sampled and resampled data (m•resa•repe = 7•10 9 in Table 2, repetitions were joined (n, p, g1) as pairs from the p•n control points, that is, where the probability was from 0.001 to 0.999 with a step of 0.001 for each n (from 2 to 12).The external repetitions (resa = 7 in Table 2) were joined together by taking the median (since the median is a sufficiency statistic [31] for any order statistic such as in the extraction of (n, p, g1) pairs from the p•n control points).The MC simulation was conducted with the configuration set as defined in Table 2.The obtained data were recorded in separate files by sample size and analyzed as such.
The objective associated (with any statistic) is to obtain the cumulative distribution function (CDF, Equation ( 5)), and thus by evaluating the CDF for the value of the statistic obtained from the sample (Equations ( 9) and ( 10)) to obtain a probability for the sampling.Please note that only in the lucky cases are we able to do this; Generally only the critical values (values corresponding to certain risks Symmetry 2019, 11, 835 6 of 15 of being in error) or approximation formulas are available (see for instance [21,24,26,28,29]).Here, the analytical CDF formula was obtained for the "g1" outlier detection statistic.

The Analytical Formula of CDF for g1
The "g1" statistic have a very simple calculation formula (see Equation ( 9)) and, as expected, its CDF formula is also very simple (see Equation (11)).Thus, for a calculated sample statistic g1 (x ← g1 in Equation ( 11)), the significance level (α ← 1-p) is immediate (Equation (11), where P represents the probability that the random variable X takes on a value less than or equal to x).

Simulation Results for the Distribution of the "g1" Statistic
The results of the simulation for n varying from 2 to 10 were sufficient to provide a clear indication of the analytical formula for the CDF of "g1".Descriptive statistics including Standard Error (SE, the standard error formula is given as Equation ( 12)) between the expected probability (from MC simulation) and the calculated probability (from Equation (11), pi ← (2•x i ) n ) and the highest positive and highest negative departures are given in Table 3.
Table 3. Descriptive statistics for the agreement in the calculation of the "g1" statistic (Equation ( 10) vs. Equation ( 11)).As can be observed in Table 3 the standard error (SE) slowly decreases beginning with n = 7, being two orders of magnitude smaller (actually it is about 200 times smaller) than the step from the MC experiment.Since the standard error alone is not proof that Equation ( 11) is the true CDF formula for providing the probability for the g1 statistic, the smallest and the highest difference between the observed and the expected probabilities are also given in Table 3.They substantiate that Equation ( 11) is indeed the right estimate for the CDF of g1.For convenience, Figure 1 shows the value of the error in each observation point (999 points corresponding to p = 0.001 up to p = 0.999 for each n from 2 to 12).Regarding the estimation error (of the ʺg1ʺ statistic) depicted in Figure 1, the ʺg1ʺ statistic is rarely bigger than 10 -5 , never bigger than 1.5•× 10 −5 and tends to become smaller with the increase in sample size (n).Using Equation (11), Figure 2 depicts the shape of the CDFʺg1ʺ(x;n).
With regard to the ʺg1ʺ statistic (depicted in Figure 2), the domain for a variable distributed by the ʺg1ʺ statistic (see Equation (11)) has values between 0 and 0.5 with the mode at p = 0 (a vertical asymptote at p = 0), a median of n -1 •2 -1/n (and having a left asymmetry decreasing with the increasing of n and converging (for n → ∞) to symmetry) and mean of 1/2(n+1).Regarding the estimation error (of the "g1" statistic) depicted in Figure 1, the "g1" statistic is rarely bigger than 10 −5 , never bigger than 1.5•× 10 −5 and tends to become smaller with the increase in sample size (n).Using Equation ( 11), Figure 2 depicts the shape of the CDF "g1" (x;n).
With regard to the "g1" statistic (depicted in Figure 2), the domain for a variable distributed by the "g1" statistic (see Equation (11)) has values between 0 and 0.5 with the mode at p = 0 (a vertical asymptote at p = 0), a median of n −1 •2 −1/n (and having a left asymmetry decreasing with the increasing of n and converging (for n → ∞) to symmetry) and mean of 1/2(n+1).Regarding the estimation error (of the ʺg1ʺ statistic) depicted in Figure 1, the ʺg1ʺ statistic is rarely bigger than 10 -5 , never bigger than 1.5•× 10 −5 and tends to become smaller with the increase in sample size (n).Using Equation ( 11), Figure 2 depicts the shape of the CDFʺg1ʺ(x;n).
With regard to the ʺg1ʺ statistic (depicted in Figure 2), the domain for a variable distributed by the ʺg1ʺ statistic (see Equation (11)) has values between 0 and 0.5 with the mode at p = 0 (a vertical asymptote at p = 0), a median of n -1 •2 -1/n (and having a left asymmetry decreasing with the increasing of n and converging (for n → ∞) to symmetry) and mean of 1/2(n+1).The expression of CDF "g1" is easily inverted (see Equation ( 13)).
7. from "g1" Statistic to "g1" Confidence Intervals for the Extreme Values Equation ( 13) can be used to calculate the critical values of the "g1" statistic for any values of α (α ← 1-p) and n.The critical values of the "g1" statistic acts as the boundaries of the confidence intervals.
One should note that the confidence interval defined by Equation ( 14) is symmetric.
In order to arrive at the confidence intervals for the extreme values the sampled data (Equation ( 15)) it is necessary to use the inverse of the CDF (again), and for the distribution of the sampled data.
x extreme (α) = InvCDF "Distribution" (0.5 To illustrate the calculation of the confidence intervals for the extreme values in the sampled data, a series of 206 data was chosen from [32].The data were tested against the assumption that it follows a generalized Gauss-Laplace distribution (Equation ( 16), a symmetrical distribution), and later if there were some observations suspected to be outliers.The steps of this analysis and the obtained results are given in Table 4.
The greatest departure from the median (0.5) for the 206 PCB dataset (Table 4) was 9.603 (CDF "GL" (9.603; µ = 6.47938, σ = 0.82828, k = 1.79106) = 0.9998).Due to the force of this deviation from the median, 9.603 was suspected as being an outlier and was removed (it should be noted that in a broader context, an outlier can be also seen as an atypical observation, correctly collected from the population observation, as part of the data generation process and thus it may be maintained in the sample but probably with a less weight).The same procedure (as in Table 4) can be applied to the remaining data (205 observations).Then, InvCDF "g1" (1-0.05;205) = 0.499875, p min (n=205) = 0.0001251; and p max (n=205) = 0.9998749.The MLE estimates for the parameters of the Gauss-Laplace distribution remain unchanged (µ = 6.47938, σ = 0.82828, k = 1.79106) and the removed observation (9.603) is still not an outlier (x max = InvCDF "GL" (0.9998749; µ = 6.47938, σ = 0.82828, k = 1.79106) = 9.7166 > 9.603).Step Results Calculate the critical probabilities for the extreme values by using Equations ( 9) and ( 10 Since the smallest value in the dataset is 4.151 (> 3.24) and the largest value is 9.603 (< 9.71), at 5% risk being in error there are no outliers in the dataset on the assumption that data follows the Gauss-Laplace distribution

Proposed Procedure for Detecting the Outliers
The procedure for detecting the outliers should start with measuring the agreement between the observed and estimated (Figure 3). Figure 3 contains a statistical "trick", namely, when there are no outliers the statistics measuring the gap between the observation and the model (order statistics, Equation ( 6)) are in agreement (their associated probabilities are not too far from each other).When outliers exist, the order statistics are also sensitive to their presence.Since this is a separate subject, for further discussion please see the series of papers beginning with [32][33][34].

Second Simulation Assessing "Grubbs" and "g1" Outlier Detection Alternatives
Another MC study was designed to test the claim that the proposed method provides consistent results.This second MC simulation is much simpler than the one used to derive the data for constructing the outlier statistics (Figure 4).

Second Simulation Assessing "Grubbs" and "g1" Outlier Detection Alternatives
Another MC study was designed to test the claim that the proposed method provides consistent results.This second MC simulation is much simpler than the one used to derive the data for constructing the outlier statistics (Figure 4).

Second Simulation Assessing "Grubbs" and "g1" Outlier Detection Alternatives
Another MC study was designed to test the claim that the proposed method provides consistent results.This second MC simulation is much simpler than the one used to derive the data for constructing the outlier statistics (Figure 4).The data used here as a proof of the facts are from [7] and all cases involve a Normal distribution (Distribution = Normal in Equation ( 15); PDF and CDF for Normal distribution in Equation ( 18 The data used here as a proof of the facts are from [7] and all cases involve a Normal distribution (Distribution = Normal in Equation ( 15); PDF and CDF for Normal distribution in Equation ( 18); a symmetrical distribution) with α = 5% risk being in error.The parameters of the Normal distribution (µ and σ) are determined for each case, as well as the sample size (Equation ( 17)).
x extreme (α) = InvCDF "Normal" (0.5 ± 0.5 PDF "Normal" (x; µ, σ) = e For comparison, the same strategy for calculating the confidence intervals of the extreme values for the Normal distribution with the Grubbs test statistic (Equation ( 2)) was used to provide an alternate result (Equation ( 19)).
The steps followed in this analysis are given in the Table 5.
Table 5.Comparison of the steps of the analysis and simulation for extreme values confidence intervals (proposed method vs. Grubbs test) Step Action (step 0 is setting the dataset; α ← 0.05) 1 Estimate (with MLE, Equation (3)) parameters (µ, σ) of the Normal distribution; calculate the associated CDFs (Equation ( 18)) 2 Calculate the order statistics, their associated risks being in error, FCS and p FCS (Equations ( 6) and ( 4)) 3 For n and α calculate the confidence intervals for the extreme values by using (a) Equation ( 6) and ( 17) and (b) Equation ( 19) Run the MC experiment (Figure 4) for K = 10000 (and then the expected number of outliers is 500) samples and count the samples containing outliers for the existing method (Grubbs, Equation (19); with µ and σ from CM method) and for the proposed method (g1, Equations ( 13)-( 15) and (17); with µ and σ from the MLE method) Results of the analysis using the steps given in Table 5 for the first dataset are given in Table 6.In regard to the results given in Table 6: At step 1, CPs are the cumulative probabilities ({p 1 , . . ., p 10 } in Figure 3) associated with the series of the observations from the sample ({x 1 , . . ., x 10 } in Figure 3).
At step 4 (see Figure 4), since {510, 526} are comparable with 500 and {1977, 2009} are much greater than 500, the results lead to the conclusion that the existing method produces type I errors by leading to false positive detection of outliers in the samples while the proposed method does not.

Going Further with the Outlier Analysis
What if "596" is removed from the sample?The following table provides mirror-like results for this scenario (Table 7).As can be observed in Table 7, the data is not in good agreement with normality (α FCS in Table 6 is 7%, while in Table 7 it is 16%) and there is no change in the accuracy of the classification ({563, 543} comparable with 500, {2341, 2333} is much greater than 500; the existing method produces type I errors by leading to false positive detection of outliers in the samples, while the proposed method does not).When comparing the results given in Table 6 with the results given in Table 7 it should be noted that both tests (Grubbs and the newly proposed g1) produce somewhat confusing results (see Table 8 for side-by-side outcomes).Table 8 highlights the fact that based on the {568, 570, 570, 570, 572, 572, 572, 578, 584} sample, the g1 test may be interpreted as identifying 596 as being an outlier.This is not quite true because the g1 test was not intended to be used in this way.That is, 596 is outside of the dataset, so at the time of constructing the confidence intervals for the extreme values, the information regarding its observation was missing.
Another trial was done, this time with 601 replacing 596 in the initial dataset (Table 9).In a further trial, 604 replaced 596 in the initial dataset (Table 10).The conclusion is simple (see the results in the Tables 6, 7, 9 and 10): A test will hardly ever detect an outlier for a small sample; it is more likely to reject the hypothesis of the sample drawn from the distribution itself!
The same trick was used on a bigger sample and the results are shown in Table 11 (the dataset is from Table 4).On one hand, as the results in Table 11 prove, the proposed method correctly identifies the confidence interval for the extreme values, while the existing method does not.
On the other hand, the results in Table 11 also show that the likelihood of identifying the outliers increases with the sample size, making it perfectly possible to identify outliers with the proposed method, although this is not the case in small samples.It is possible to detect the outliers in small samples as well, but not when the parameters of the distribution are derived from the sample data-only when the parameters of the distribution are known a priori or determined from other samples (the results given in Tables 6-10 are proof of this).
Independently of the shape of the theoretical distribution being tested (the generic case is defined by Equation ( 5)), as defined by Equations ( 9) and (10), the newly proposed statistic "g1" defines a symmetric confidence interval for the extreme values in samples in the probability space (Equation ( 14)).Later, this symmetric confidence interval may be changed back into an asymmetrical one when it is expressed in the domain of the theoretical distribution being tested (Equation ( 15)).It should be recognized that "g1" uses a symmetrization strategy to obtain the confidence interval for the extreme values in samples.
It might seem that the literature on robust statistics was ignored in this work, however, this is not entirely true.In fact, a whole pool of robust statistics was used extensively in the study (see Equation (8)), introduced as a tool in Table 5 and involved in the later calculations (Tables 6, 7 and 9, Tables 10 and 11).Also, it should be noted that the substitution of the mean by the median is not a new idea; it is well known in the field of robust statistics (for example, Watson U 2 [29], the WU Statistic in Equation (8), uses it).
A short literature survey provides several of examples of current real applications that require the proposed method.Thus, in signal processing, non-stationary, non-Gaussian, spiky signals are usually regarded as outliers and thus discarded (see [35][36][37][38] as typical cases).In this context, it should be noted that Mood's median test is preferred to the Kruskal-Wallis test when outliers are present [39].The identification of outliers is also recognized as an issue in the validation of protein structures, and the current methods are revised in [40].Other examples can be found in [41].
In the wider context, an alternate window-based strategy has been proposed in which outliers are detected in each window by the Tukey method and labeled so that they can be excluded from the realization of the process points to be used for model identification [42].A contingency-based strategy proposes maximization of true positive (TP) values and minimization of false negative (FN) and false positive (FP) values [43].Finally, another distribution testing procedure has been proposed in [44].

Figure 3 .
Figure 3.The procedure for detecting outliers.

Figure 3 .
Figure 3.The procedure for detecting outliers.

Figure 4 .
Figure 4.The procedure for testing the outlier statistics.

Figure 4 .
Figure 4.The procedure for testing the outlier statistics.

Table 2 .
Details of the MC simulation on "g1" outlier detection statistic.

Table 8 .
Side-by-side comparison of the analysis of the samples.

Table 11 .
Outlier analysis results for Table4dataset under the assumption of normal distribution.