Comparing the Effectiveness of Robust Statistical Estimators of Proﬁciency Testing Schemes in Outlier Detection

: This study investigates the effectiveness of robust estimators of location and dispersion, used in proﬁciency testing and listed in ISO 13528:2015, in outlier detection. The models utilize (a) kernel density plots, (b) Z-factors, (c) Monte Carlo simulations, and (d) distributions derived from at most two contaminating distributions and one main Gaussian. The simulation parameters cover a wide range of those commonly encountered in proﬁciency testing (PT) schemes, so the results presented are of fairly general application. We chose a functional sub-optimal solution by grouping and classifying the model settings, resulting in ﬁve matrices readily usable for selecting the best robust estimator. Whenever at most half of the distribution of each contaminating population is outside the central distribution, there is only one optimal estimator. For all other cases, the ﬁve matrices provide the appropriate robust statistic. The proposed method applies to 95.1% of 144 results for an existing PT for cement. These actual datasets indicate that the Hampel estimator for the mean and the Q-method for the standard deviation provide the most appropriate performance statistic in 86.1% of the cases.


Introduction
Interlaboratory comparisons and schemes, where at least two laboratories measure one or more characteristics of the same or similar items, are widespread worldwide. Evaluating participant performance against pre-established criteria through interlaboratory comparisons is called proficiency testing (PT) [1]. Among other criteria, Z-score usually expresses the performance of each participant. Its value depends on both the assigned value and standard deviation for proficiency assessment [1]. Results appearing inconsistent with the remainder dataset, called outliers, can impact the values of these two summary statistics [1]. Using robust statistical methods minimizes this influence, defining as such the insensitive ones to small departures from underlying assumptions surrounding a probabilistic model [1]. Several authors contributed significantly with their research and books in developing robust statistics (Hampel et al. [2], Huber et al. [3], Wilcox [4], and Maronna et al. [5]). Hund et al. [6] in a detailed older review of interlaboratory studies in analytical chemistry, pointed out the use of robust statistics in the outliers' treatment. Daszykowski et al. [7] performed an excellent review of some basic concepts of robust techniques and their usefulness in chemometric data analysis. Recently, Shevlyakov [8] and Ghosh et al. [9] further analyzed nonparametric statistics like the M-estimators, the influence function, and the influence curve. Outlier detection stands at the core of robust statistics [10] and particular rules for outlier detection were proposed, e.g., for the estimation of multivariate mean and dispersion (i.e., covariance matrix) [11,12]. In the considered task to estimate the mean and dispersion, it is also possible to use a variety of available robust regression estimators including the least trimmed squares [13] or least weighted squares [14], but such approaches are not available in the relevant ISO norms. ISO 13528:2015 [15] in Annex C analyzes a series of robust estimators for the population mean and standard deviation, to be used in PT schemes as an alternative to the classical method provided by ISO 5725-2:1994 [16]. The estimators of the population mean are the median value, MED; the average according to algorithm A with iterated scale, Ax*; the Hampel estimator for mean, Hx*. The corresponding estimates of the standard deviation are the scaled median absolute deviation, MADe; the normalized interquartile range, nIQR; the estimator according to algorithm A with iterated scale, As*; the estimator according to the Q method, Hs*. We note that method A was first described in the much older ISO 5725-5:1998 [17] and carried over to ISO 13528:2015. The population mean or assigned value, x PT , and standard deviation σ PT constitute the summary statistics of the population's results. The respecting performance statistic for the result X i of each participating lab i is the Z-score provided by Equation (1) according to 17043:2010 [1].
Despite the detailed description of both performance and summary statistics in the mentioned standards and the widespread implementation of PT schemes, the number of studies comparing these estimators for their optimality is limited. Some researchers report comparative studies between the methods described in the statistical standards, focusing on the performance criterion and the detection of outliers by referring to the results of individual PT schemes [18][19][20][21]. For these particular cases and regardless of which robust method they find optimal, most reject the classical one, according to ISO 5725-2:1994. De Oliveira et al. [22] compared different statistical approaches, classical and robust, to evaluate participants' performance in a PT program for lead in blood determination, and they concluded that a modification of Method A was the best one. Kojima et al. [23] attempted a generalization in finding the best method using a Monte Carlo approach by adding a contaminating Gauss distribution to the main one and restricting their study to the next set of parameters. (a) The number of participants is 200; (b) the standard deviation of the main population is 5% of the mean value; (c) the secondary population is 20% of the total. They concluded that the robustness to outliers of the MED-nIQR is more significant than the other methods, noticing simultaneously that the MED-nIQR does not always give the best conclusion in the actual PT due to its reduced degrees of freedom.
The only common conclusion of the mentioned research is that the classic method, described in 5725-2:1994, is insufficient to detect a relatively accurate number of outliers. Additionally, there is no unique optimal robust method for all the considered PT schemes in [18][19][20][21][22][23]. One of the main conclusions of Oliveira et al. [22] is that a PT provider should conduct studies using different statistical approaches to get the best standard deviation estimate since there is no consensus on which method is more suitable for the experimental data. We ended up in the same judgment by comparing the robustness of PT statistical estimators for a limited number of participants [24] by combining: (a) kernel density plots [15]; (b) Z-factors; (c) Monte Carlo simulations using distributions derived from the addition of one or two contaminating distributions and one main Gaussian, and fitting to the kernel plots. The number of participants in this study ranged from 10 to 30. We remarked that the selection of the best robust estimator is a function of (a) the fraction of the contaminating populations; (b) the distances of the mean values of the distributions. Therefore, the best estimator depends on the shape of the results' distribution.
In this study, the same approach as in [24] is employed to assess the effectiveness of the mentioned robust methods by selecting 10 to 100 participants and analyzing the distribution of unsatisfactory results with |Z| > 3. To the author's knowledge, while there is extensive literature on the effectiveness of robust estimators in linear regression [25][26][27][28][29][30], the same is not valid for the estimators proposed in ISO 13528. Actual results of cement's chemical, physical, and mechanical properties are also processed, belonging to a PT scheme of cement organized by Eurocert S.A. The organizer performs the cement scheme nine rounds a year with 11-14 participants per round, including all the tests defined by the standard EN 197-1 [31]. Section 1 of [24] describes the PT schemes provided by this company. The interlaboratory comparison is a source of information for the product evaluation process [32].
The structure of the paper is as follows. Section 2 presents kernel density plots as a possibility to estimate the results' distribution and applies the approximation of an actual kernel density plot with the sum of, at most, three normal distributions of adjustable parameters [24] to the long-term data of a PT scheme. As a result of the application of this approach to 144 cases and six tests, the computed parameter distributions of the three Gaussians are more general, i.e., they should apply to a wide range of PTs. Section 3 describes the model of measuring the effectiveness of robust estimators in outlier detection. The algorithm applies Monte Carlo simulations and, considering all the unsatisfactory results, computes the distance of their distribution from a reference value. Based on the assumption that the samples are homogeneous and the PT scheme permits only one analysis method per test, Section 4 implements the approach introduced in Section 3 for various model parameters. The simulations result in a series of tables of the most suitable estimators. These are of particular interest to PT experts when they have to choose the most accurate and robust method to process the test results. The author developed all software in C#.

Kernel Density Plots
The kernel density plots (KDP) provide the possibility of representing the result's distribution as a continuous curve and identifying bimodalities or lack of symmetry (ISO 13528:2015 [15]). A series of researchers indicated the importance of kernel estimation in robust statistics [33,34]. Adding one main normal distribution with one or two contaminating ones provides an effective simulation of these curves, as proved in [24]. Table 1 shows the parameters of the three distributions. The software built the kernel density plots by processing the results of six tests performed according to EN 197-1 by the PT scheme for a time ranging from 2012 to 2020, with the aim of the study gaining significant generality. Table 2 indicates the selected tests and the applied methods of testing.  Building the KDP with the results of each test makes it difficult to reveal the actual distribution for participants ranging from 11 to 14 per round. For this reason, in [24], we used all the results of nine consecutive yearly tests for each KDP, as participants are the same during a year. The algorithm achieves normalization by using the difference of the assigned value from the mean value of each participant, D ij = x ij − x pt,j , where i is the participant and j is the round. Especially for the compressive strength in different ages, constituting the most significant tests, the software created the kernel plots using the results of nine moving rounds for the last three years of the PT scheme application. The Generalized Reduced Gradient non-linear regression technique fitted the mix of three normal distributions to each KDP and adjusted the parameters shown in Table 1 to achieve the minimum distance between the two curves. The result of this processing was 144 kernel density plots, based on multiple kinds of tests during the long-term operation of the PT scheme, efficiently simulated using a mix of central and contaminating populations. Figure 1 depicts an example of a kernel density plot implemented on 28-day strength results of nine rounds. The location of the two contaminating distributions is on the left and the right of the central one. The coefficient of determination, R 2 , between the mix of the three distributions and the kernel density plot is 0.998, and the parameters' values are as follows: m 1 = 0.12 MPa; s 1 = 1.47 MPa; m 2 = 4.54 MPa; s 2 = 0.68; m 3 = −3.47; s 3 = 1.18; fr 1 = 0.87; fr 2 = 0.024; fr 3 = 0.106; n 2 = 3.01; n 3 = −2.45. The average coefficient of determination of the fit of all 144 sets of KDP is 0.9917, with a standard deviation equal to 0.0051. The above demonstrates that simulating kernel density plots with a sum of three normal distributions is effective.
Building the KDP with the results of each test makes it difficult to reveal the actual distribution for participants ranging from 11 to 14 per round. For this reason, in [24], we used all the results of nine consecutive yearly tests for each KDP, as participants are the same during a year. The algorithm achieves normalization by using the difference of the assigned value from the mean value of each participant, Dij = xij-xpt,j, where i is the participant and j is the round. Especially for the compressive strength in different ages, constituting the most significant tests, the software created the kernel plots using the results of nine moving rounds for the last three years of the PT scheme application. The Generalized Reduced Gradient non-linear regression technique fitted the mix of three normal distributions to each KDP and adjusted the parameters shown in Table 1 to achieve the minimum distance between the two curves. The result of this processing was 144 kernel density plots, based on multiple kinds of tests during the long-term operation of the PT scheme, efficiently simulated using a mix of central and contaminating populations.  The large number of kernel density plots, based on various cement tests, and effectively simulated by the reported method, necessitates a detailed investigation of the range and distribution of the simulation parameters. The outcome of this processing will be input to search for and define the best robust estimators. Figure 2 shows, for all the simulated KDF, the algebraic distances from m1 to m2 and to m3 in s1 units, n2 and n3 correspondingly. The large number of kernel density plots, based on various cement tests, and effectively simulated by the reported method, necessitates a detailed investigation of the range and distribution of the simulation parameters. The outcome of this processing will be input to search for and define the best robust estimators. Figure 2 shows, for all the simulated KDF, the algebraic distances from m 1 to m 2 and to m 3 in s 1 units, n 2 and n 3 correspondingly.
ing populations. In 73.6% of cases, the points are down the diagonal, meaning that f2 ≥ f Thus, in most data, the fraction of the second population with |n2| ≤ |n3| is considerabl higher than that of the third. Figure 3b indicates that the fraction f2 + f3 ≤ 20 represent 84.7% of the total population of contaminating distributions. Additionally, 76.4% of th results lie between 0.05 and 0.20, constituting the main fraction of the two secondary dis tributions.   The algorithm calculates n 2 , n 3 so that always to be |n 2 | ≤ |n 3 |. The data are divided into three regions: (a) |n 3 | < 3.5; (b) 3.5 ≤ |n 3 | < 5.5; (c) |n 3 | ≥ 5.5. The computations provide the following distribution of the three regions: (a) 68%; (b) 27.8%; (c) 4.2%. By rounding the values of n 2 , and n 3 to the nearest integer, the distribution of the three regions has the following meaning. (a) 68% of the rounded values belong to the interval [-3,3]; only 4.2% of the population has n 3 > = 6, or n 3 < = −6; 27.8% of data has n 3 = 4,5 or n 3 = −4, −5. Figure 3 illustrates the fractions f 2 , f 3 , and the sum f 2 + f 3 of the surfaces of contaminating populations. In 73.6% of cases, the points are down the diagonal, meaning that f 2 ≥ f 3 . Thus, in most data, the fraction of the second population with |n 2 | ≤ |n 3 | is considerably higher than that of the third. Figure 3b indicates that the fraction f 2 + f 3 ≤ 20 represents 84.7% of the total population of contaminating distributions. Additionally, 76.4% of the results lie between 0.05 and 0.20, constituting the main fraction of the two secondary distributions.
Standards 2023, 3, FOR PEER REVIEW 5 The algorithm calculates n2, n3 so that always to be |n2| ≤ |n3|. The data are divided into three regions: (a) |n3| < 3.5; (b) 3.5 ≤ |n3| < 5.5; (c) |n3| ≥ 5.5. The computations provide the following distribution of the three regions: (a) 68%; (b) 27.8%; (c) 4.2%. By rounding the values of n2, and n3 to the nearest integer, the distribution of the three regions has the following meaning. (a) 68% of the rounded values belong to the interval [-3,3]; only 4.2% of the population has n3 > = 6, or n3 < = −6; 27.8% of data has n3 = 4,5 or n3 = −4, −5. Figure 3 illustrates the fractions f2, f3, and the sum f2 + f3 of the surfaces of contaminating populations. In 73.6% of cases, the points are down the diagonal, meaning that f2 ≥ f3. Thus, in most data, the fraction of the second population with |n2| ≤ |n3| is considerably higher than that of the third. Figure 3b indicates that the fraction f2 + f3 ≤ 20 represents 84.7% of the total population of contaminating distributions. Additionally, 76.4% of the results lie between 0.05 and 0.20, constituting the main fraction of the two secondary distributions. Standard deviations require normalization since they refer to tests whose results may differ by two orders of magnitude. Typically, SO 3 values range from 2.5% to 4%, while specific surface values lie from 3000 cm 2 /g to 4500 cm 2 /g. Such a normalizing variable is Standards 2023, 3 115 the coefficient of variation, CV, which for each test takes into account the average of the assigned values of the included rounds, AvVal, provided by Equation (2).
Rounding of the coefficients of variation to the nearest integer permits calculating the distribution of CV triads for all simulated core density functions. Table 3 presents the sixteen sets of rounded coefficients of variation with the highest frequency within this distribution. On average, the coefficients of the main population are higher than those of the contaminating ones. Searching for the most suitable robust estimators will rely on the data presented in Figures 2 and 3, and Table 3.

Monte Carlo Simulation
The Monte Carlo approach utilizes the mix of three distributions implemented to simulate the kernel density plot using the parameters of Table 1. Additionally, the following data are inputs and independent variables of the algorithm.  Table 4 according to the analysis of ISO 13528:2015 [15] and EN ISO/IEC 17043:2010 [1]. For each Z factor, the table shows the respecting population's mean, the standard deviation used to calculate it, and the clause of the standard applied. ISO 13,528 [15] (p. 26) and EN ISO/IEC 17043 [1] use the following conventional interpretation of Z scores. (a) A result that gives |Z| ≤ 2.0 is acceptable; (b) if 2.0 < |Z| < 3.0, the performance is questionable; and (c) if |Z| ≥ 3.0, the performance is unacceptable, generating an action signal. There are other arbitrary but reasonable bounds for Gaussian populations in the literature. Rousseeuw et al. [13] use the rule |Z|>2.5 to label a value as an outlier for normalized residuals in regression. In a strictly normal distribution, 0.27% and 1.24% of the population have |Z| greater than 3 and 2.5, respectively. In this study, we use the convention of ISO 13528: If |Z|>3, the laboratory performance is unsatisfactory, and the result is an outlier. As demonstrated in [24], the estimators of location and dispersion described in ISO 5725-2:1994 [16] are not resistant to outliers. For this reason, we restricted this study to the robust estimators analyzed in ISO 13528:2015. Supposing that a distribution of a large number of results contains Zu% non-satisfactory results, the question is which robust estimators most closely approach this value for a certain number of labs. As found in [24], choosing the most reliable estimator depends on the number of participants and the actual distribution of the results. This older study considered a limited number of participants, up to 30, f 2 + f 3 < 0.10, and mean values of the Zu% distribution. Using robust estimators to detect the outliers indicates that the approach is nonparametric. Therefore, using the entire distribution of Zu% instead of its average constitutes a substantial improvement of the method presented in [24]. The improved algorithm applies the Monte Carlo simulation and follows the subsequent steps:   Table 4 and the Zu% for the absolute values of the four Z-factors presented in the same table using a N lab = 1000. For this number of participants, all estimators converge to their final value. As demonstrated in [24], N lab = 400 is adequate for such convergence. (vi) This previous study found that Z_MADe was the closest estimator of unacceptable Z-factors to the estimation based on the main normal distribution of step (iv). Its values corresponding to N lab = 1000 represent the reference values, Z ref .
(vii) The simulations implemented all the settings shown in Table 5 for participants up to 100. The populations correspond to unimodal, bimodal, and trimodal distributions with a maximum total fraction of secondary distributions up to 0.2. (viii) For each mix of the three normal distributions, the software performs N iter iterations and N s simulations for each N lab . Then it calculates the differential distribution of each of the four Zu% in at least twenty bins. The algorithm compares these results with the reference value using the distance of the distribution's points from Z ref provided by Equation (4). The Z-factor with the closest distance to the reference value is optimal.  3 If n 3 < 0, the distribution is symmetric about the Y-axis with that of n 3 > 0 and n 2 of opposite sign and the same absolute values. Therefore, the simulation results will be the same. 4 The algorithm stops earlier than 25 simulations if the new value of the average of each unacceptable Z-factor differs from the previous one by less than 0.1%. 5 If the number of distribution non-zero fractions is higher than 20, then N b = 30 or 40.
There are five components to this equation. ZM j , X ij , ZD ij , Z ref , and N max . ZM j denotes the mean of distances between the distribution of estimator j and Z ref , X ij is the fraction of the differential distribution at point i with a ZD ij value on the X-axis, and N max is the maximum i with a non-zero X ij value. The best estimator is the one whose index j corresponds to ZM j 's minimum. If the ZM value of another estimator differs by less than 1% from the minimum ZM, this also estimator is optimal. Figure 4 illustrates the above method of finding the most suitable estimator using the following model parameters. N lab = 20; s 1 = 2; n 2 = −2; n 3 = 5; s 2 = 1; s 3 = 1; fr 2 = 0.1; fr 3 = 0.05.

Initial Simulations
Monte Carlo simulation was first applied to study the distribution of the two statistics affecting the Z factors: the population's mean and standard deviation. The algorithm created the frequency distributions of all the estimators in Nb bins, using the Niter·N s results of each statistic. These histograms, designed to investigate the symmetry of the distributions, are highly reliable because they incorporate up to 25,000 simulations. Figure 5 depicts such an example for MED and MADe. There is an asymmetry in both distributions, indicating that this skewness should be explored further. The skewness metric must be nonparametric because the approach used to detect outliers is. Additionally, all studied distributions are unimodal, sometimes with few frequency fractions over 0.01. In this study, we implement the simple formula provided by Groeneveld et al. [38] using the mean value, μ, and the median value, ν, of the distribution. The measure of dispersion in this formula, E(|X−ν|), is the mean value of the absolute differences between distribution values and the median. In this way, we avoid utilizing the more complicated formula described in many textbooks [39], involving the use of second and third-order moments. Equation (5) expresses the selected skewness measure, SK.

Initial Simulations
Monte Carlo simulation was first applied to study the distribution of the two statistics affecting the Z factors: the population's mean and standard deviation. The algorithm created the frequency distributions of all the estimators in N b bins, using the N iter ·N s results of each statistic. These histograms, designed to investigate the symmetry of the distributions, are highly reliable because they incorporate up to 25,000 simulations. Figure 5 depicts such an example for MED and MADe. There is an asymmetry in both distributions, indicating that this skewness should be explored further. The skewness metric must be nonparametric because the approach used to detect outliers is. Additionally, all studied distributions are unimodal, sometimes with few frequency fractions over 0.01. In this study, we implement the simple formula provided by Groeneveld et al. [38] using the mean value, µ, and the median value, ν, of the distribution. The measure of dispersion in this formula, E(|X−ν|), is the mean value of the absolute differences between distribution values and the median. In this way, we avoid utilizing the more complicated formula described in many textbooks [39], involving the use of second and third-order moments. Equation (5) expresses the selected skewness measure, SK.

Initial Simulations
Monte Carlo simulation was first applied to study the distribution of the two statistics affecting the Z factors: the population's mean and standard deviation. The algorithm created the frequency distributions of all the estimators in Nb bins, using the Niter·N s results of each statistic. These histograms, designed to investigate the symmetry of the distributions, are highly reliable because they incorporate up to 25,000 simulations. Figure 5 depicts such an example for MED and MADe. There is an asymmetry in both distributions, indicating that this skewness should be explored further. The skewness metric must be nonparametric because the approach used to detect outliers is. Additionally, all studied distributions are unimodal, sometimes with few frequency fractions over 0.01. In this study, we implement the simple formula provided by Groeneveld et al. [38] using the mean value, μ, and the median value, ν, of the distribution. The measure of dispersion in this formula, E(|X−ν|), is the mean value of the absolute differences between distribution values and the median. In this way, we avoid utilizing the more complicated formula described in many textbooks [39], involving the use of second and third-order moments. Equation (5)  It is well known from very early research [40] that the nonparametric skewness always lies between −1 and +1. For symmetric distributions its value is zero. It is positive for right skewed distributions (µ > ν) and negative for left skewed distributions (µ < ν). For each robust statistic, the algorithm computes the mean value, µ, by summing the products of each differential fraction f i , for I = 1 to N b , with the average X i of each interval of the histogram X-axis using Equation (6). The calculation of median value, ν, utilized linear interpolation between the points X p , X p+1 , where f p < = 0.5 and f p+1 > = 0.5. Finally, Equation (7) provides the estimation E(|X−ν|). (c) 0.4 ≤ |SK| < 0.7; (d) 0.7 ≤ |SK| ≤ 1. As a rough guide, we can consider that if |SK| < 0.2, the departure from symmetry is low [41]. The skew is moderate for |SK| values in (b). The distribution is highly skewed for |SK| values in (c). Finally, the skew is very high for |SK| ≥ 0.7. (iii) The skewness of the estimators of mean increases, increasing n 2 from −7 to 7. For these estimators, the skewness decreases, increasing the number of participants. The distributions are highly symmetric for N lab ≥ 40 and n 2 ≤ 0. (iv) The estimators of the mean have less skewness than those of standard deviation for the same N lab and n 2 . The best symmetry in the distributions of the latter occurs for N lab ≥ 40 and n 2 between −1 and 1. (v) For low and high n 2 values, the skewness standard deviation sometimes becomes very high. (vi) The asymmetry of the two distributions indicates that the search for the best method regarding outliers should focus on their distribution for each performance statistic mentioned in Table 4.   Figure 6. Nonparametric skewness values of the robust estimators of population's mean.

Shape of the Outliers Distribution
The distributions of outliers are not only skewed but sometimes not bell-shaped. Furthermore, they are discrete. For a certain number of participants, Nlab, each value ZD on the X-axis corresponds to an integer number of unsatisfactory results. For example, for Nlab = 40 and ZD = 5%, the outliers are 5/100·40 = 2. Figure 8 illustrates the above by implementing the four performance statistics in Table 4 and using six sets of model parameters.
Equation (4) defines the variables shown in Figure 8. The shape of the distribution of outliers strongly depends on the number of participants, the model parameters expressing the distribution of the results, and the performance statistic selected. All curves have a long right tail, but its thickness in each distribution depends on the performance estimator. In general, Z_Hamp shows thinner tails, thus a lower percentage of outliers significantly higher than the reference value, Zref. Figure 8 shows the mean distances ZM between each performance estimator and Zref. Z_Hamp is the most effective estimator for the examples presented. An intriguing case appears in Figure 8b: Z_Hamp and Z_A estimators are both optimal, despite their distinct distributions. It results from using the objective criterion of the mean distance of the distribution from Zref, which is particularly effective for strongly non-normal distributions.

Shape of the Outliers Distribution
The distributions of outliers are not only skewed but sometimes not bell-shaped. Furthermore, they are discrete. For a certain number of participants, N lab , each value ZD on the X-axis corresponds to an integer number of unsatisfactory results. For example, for N lab = 40 and ZD = 5%, the outliers are 5/100·40 = 2. Figure 8 illustrates the above by implementing the four performance statistics in Table 4 and using six sets of model parameters.

Implementation of the Simulator
In the author's experience, 10 to 100 labs participate in most PT schemes. Additionally, if the total contaminating population is more than 20% and very distant from the center, one of two things usually happens. (a) The samples are not homogenous enough, or (b) the labs use different measurement methods for the same analyte or property. ISO 13528:2015 dedicates a complete sub-section and an Annex to the first case [15]. The same standard suggests using a distinct assigned value for each measurement method for the Equation (4) defines the variables shown in Figure 8. The shape of the distribution of outliers strongly depends on the number of participants, the model parameters expressing the distribution of the results, and the performance statistic selected. All curves have a long right tail, but its thickness in each distribution depends on the performance estimator. In general, Z_Hamp shows thinner tails, thus a lower percentage of outliers significantly higher than the reference value, Z ref . Figure 8 shows the mean distances ZM between each performance estimator and Z ref . Z_Hamp is the most effective estimator for the examples presented. An intriguing case appears in Figure 8b: Z_Hamp and Z_A estimators are both optimal, despite their distinct distributions. It results from using the objective criterion of the mean distance of the distribution from Z ref , which is particularly effective for strongly non-normal distributions.

Implementation of the Simulator
In the author's experience, 10 to 100 labs participate in most PT schemes. Additionally, if the total contaminating population is more than 20% and very distant from the center, one of two things usually happens. (a) The samples are not homogenous enough, or (b) the labs use different measurement methods for the same analyte or property. ISO 13528:2015 dedicates a complete sub-section and an Annex to the first case [15]. The same standard suggests using a distinct assigned value for each measurement method for the second case [15]. As a result, the settings in Table 5 are of particular practical interest to those with expertise in PT schemes since they apply to most PT schemes with homogeneous test items, permitting one properly described measurement method per test.
The simulation implements Equation (4) to all the settings of Table 5

Simulation Results
The simulator calculates the results for all combinations [s 1 , s 2 , s 3 ] X [fr 2 , fr 3 ], resulting in 120 matrices similar to those in Figures A1 and A2, which are the exact solution of the optimization problem applying the criterion of Equation (4). Instead of this solution, we chose a functional sub-optimal one by grouping and classifying the parameters as follows: (i) The algorithm initially generates two groups: (a) a small population of participants, N LOW , with N lab = 10, 15, 20, and 30; (b) a large one, N HIGH , with N lab = 40, 60, 80, and 100.
(ii) It then creates at most seven regions of n 2 by keeping|n 2 | < n 3 , the following: (a) −7 ≤ n 2 ≤ −8; (b) −6 ≤ n 2 ≤ −5; (c) −4 ≤ n 2 ≤ −3; (d) −2 ≤ n 2 ≤ 2 (e) 3 ≤ n 2 ≤ 4 (f) 5 ≤ n 2 ≤ 6 (g) 7 ≤ n 2 ≤8. It is seven regions when n 3MAX is 8, but five when n 3MAX is less. (iii) Afterwards, it counts the occurrences of each estimator as optimal in each region and calculates their percentages. (iv) An optimal estimator is the one with the highest percentage and those whose percent of appearance differs by up to 10% from the maximum. (v) Next is the creation of tables with the results per n 3 and [s 1 , s 2 , s 3 ]. Figures 9-11 show the most accurate estimators for [s 1 , s 2 , s 3 ] = 111, 211, 212, 222 and 7 ≤ n 3 ≤ 8, 5 ≤ n 3 ≤ 6, n 3 = 4. Appendix B contains all the remaining results for s 1 = 3 and 4. We do not show the results for n 3 = 2,3 since Z_Hampel is always much superior to the second estimator, i.e., Z_Hampel is the best. Figure 9 shows that the four estimators are numbered and colored. Two columns are present for each [s 1 , s 2 , s 3 ] because two optimal estimators exist for some parameter sets.
Standards 2023, 3, FOR PEER REVIEW 14 contrast, this estimator is found significantly more frequently in the low number of labs than in NHIGH when n3 = 4. (iv) The selection of the most appropriate estimator is not so sensitive to the choice of the mixture of standard deviations, [s1, s2, s3]. When comparing Figures 10, 11, A3 and A4, one finds that in enough cases, the optimal statistic is the same for the same group of labs, n3, n2, f2, and f3, concluding that the impact of [s1, s2, s3] is less strong than that of the other parameters. (v) With a contaminating population fraction f3 of 0.05 and distribution diverging relatively little from the bimodal with −2 ≤ n2 ≤ 2, the Z_Hamp is most often found. However, this rule is not absolute. Figure A4 demonstrates that for the group NHIGH, n3 = 4, and s1 = 3 or 4, Z_A is the most suitable.  1  2  2  2  2  3  2  3  2  2  3  2  0.05  0.1  2  2  2  2  3  2  2  3  2  2  0.1  0.1  2  2  2  2  3  2  2  3  2  3  2  0  0.15  3  3  3  3  3  3  3  3  0.05  0.15  3  3  3  3  3  3  1  3  (vi) In some cases, there are expanded zones where one estimator outperforms the others, increasing the robustness of the suggested solution. For example, in Figures 10 and A3, for 5 ≤ n2 ≤ 6, the first choice is the Z_MADe. In Figures 10, 11, A3 and A4, for −2 ≤ n2 ≤ 2 and f3 = 0.10 the correct selection is Z_A. (vii) For NLOW and NHIGH groups, Figure 12 illustrates a rough tendency for the preferred statistics as a function of the zones of n2, n3, and f3. For each region, the algorithm counts the occurrences of each estimator in the maps depicted in Figures 9-11, A3 and A4. It calculates their percentages and considers as optimal statistics those that are up to 10% below the maximum. (viii) The results of Figure 12 are only a rough guide to choosing the most appropriate estimator, demonstrating that there is no unique solution, and the best selection depends on the data distribution.  Figures A1 and A2. (ii) For the same n 3 , the distribution of the most accurate performance statistics as a function of n 2 is not symmetrical around the center. For all n 3 , the estimators for −8 ≤ n 2 ≤ −7 and −6 ≤ n 2 ≤ −5 differ significantly from the ones for 7 ≤ n 2 ≤ 8 and 5 ≤ n 2 ≤ 6. More symmetrical patterns appear in zones −4 ≤ n 2 ≤ −3 and 3 ≤ n 2 ≤ 4. (iii) With the same model parameters, the results are similar for the two groups N LOW and N HIGH , but not in all cases. For example, for 7 ≤ n 3 ≤ 8 and −7 ≤ n 2 ≤ −8, Z_Hampel appears as the optimal performance statistic much more in the N HIGH group than in N LOW . In contrast, this estimator is found significantly more frequently in the low number of labs than in N HIGH when n 3 = 4. (iv) The selection of the most appropriate estimator is not so sensitive to the choice of the mixture of standard deviations, [s 1 , s 2 , s 3 ]. When comparing Figures 10, 11, A3 and A4, one finds that in enough cases, the optimal statistic is the same for the same group of labs, n 3 , n 2 , f 2 , and f 3 , concluding that the impact of [s 1 , s 2 , s 3 ] is less strong than that of the other parameters. (v) With a contaminating population fraction f 3 of 0.05 and distribution diverging relatively little from the bimodal with −2 ≤ n 2 ≤ 2, the Z_Hamp is most often found. However, this rule is not absolute. Figure A4 demonstrates that for the group N HIGH , n 3 = 4, and s 1 = 3 or 4, Z_A is the most suitable. (vi) In some cases, there are expanded zones where one estimator outperforms the others, increasing the robustness of the suggested solution. For example, in Figures 10 and A3, for 5 ≤ n 2 ≤ 6, the first choice is the Z_MADe. In Figures 10, 11, A3 and A4, for −2 ≤ n 2 ≤ 2 and f 3 = 0.10 the correct selection is Z_A. (vii) For N LOW and N HIGH groups, Figure 12 illustrates a rough tendency for the preferred statistics as a function of the zones of n 2 , n 3 , and f 3 . For each region, the algorithm counts the occurrences of each estimator in the maps depicted in Figures 9-11, A3 and A4. It calculates their percentages and considers as optimal statistics those that are up to 10% below the maximum. (viii) The results of Figure 12 are only a rough guide to choosing the most appropriate estimator, demonstrating that there is no unique solution, and the best selection depends on the data distribution.    Although a machine learning technique, like the Support Vector Machine [42], could better classify the data, the presented solution is directly applicable. The PT schemes can apply the following procedure to implement the presented technique of finding the most appropriate robust estimator. Following verification of the homogeneity of the test items and the use of the same method of analysis per test, the PT expert should build the kernel density plot, enabling him to find the optimal mix of at most three Gaussian approximating this plot. If the PT scheme organizer executes the test several times a year, selecting the results of the latest and several recent rounds to generate kernel plots is preferable. The latest data have a significant probability of belonging to a population similar to the recent results population, especially if the same or almost the same laboratories participate. However, if PT scheme experts used only the last round to build the distribution, the parameters could have severe uncertainty, especially if N lab is small. The next step is to normalize the parameters of the found normal distributions based on the mean value of the central one using Equation (8).
The symbols m i,Act and s i,Act , for I = 1,2,3, denote the mean value and the standard deviation of the three normal distributions simulating the kernel density plot. The parameters fr 1 , fr 2 , fr 3 , and n 2 , n 3 in Table 1 are independent of normalization.  Although a machine learning technique, like the Support Vector Machine [42], could better classify the data, the presented solution is directly applicable. The PT schemes can apply the following procedure to implement the presented technique of finding the most appropriate robust estimator. Following verification of the homogeneity of the test items and the use of the same method of analysis per test, the PT expert should build the kernel density plot, enabling him to find the optimal mix of at most three Gaussian approximating this plot. If the PT scheme organizer executes the test several times a year, selecting the results of the latest and several recent rounds to generate kernel plots is preferable. The latest data have a significant probability of belonging to a population similar to the recent results population, especially if the same or almost the same laboratories participate. However, if PT scheme experts used only the last round to build the distribution, the parameters could have severe uncertainty, especially if Nlab is small. The next step is to normalize the parameters of the found normal distributions based on the mean value of the central one using Equation (8). • 100 = 1, 2, 3  Figure 12. Trend maps of robust performance estimators as a function of n 3 , n 2 , and f 3 regions.
Rounding the algebraic distances n 2 , n 3 , and the three standard deviations to the nearest integer follows. The same is done with the fractions fr 2 and fr 3 to the nearest multiple of 0.05. Next is to check if all the parameters are within the ranges given in Figures 9-11, A3 and A4. If n 3 < 0, and |n 3 | > |n 2 |, the symmetrical case by changing the signs of n 2 and n 3 provides the same optimal estimator. If this check is positive, the choice is the estimator that the Figures demonstrate. If f 2 + f 3 ≥ 0.25 but |n 2 | ≤ 3 and |n 3 | ≤ 3, Z_Hamp is the most appropriate selection, as already shown in 4.2. Otherwise, the figures are not applicable to such a high percentage of contaminating populations. The same happens if |n 3 | > n 3MAX , as provided in Section 4.2.
The 144 kernel density plots analyzed in Section 2 represent a sufficiently large sample to apply the simulation results of the mentioned five figures to the actual conditions of a PT scheme operating systematically for years. Figures 2 and 3 summarize the model parameters found. In case some real [s 1 , s 2 , s 3 ] sets are not present in the simulation results, we selected from the figures those [s 1 , s 2 , s 3 ] that are closest to the actual ones. The group chosen is always the N LOW , as 11-14 labs participate in each test. Table 6 summarizes the best estimators for detecting outliers for model parameters computed from actual and multiple tests. The low population of results with |n 3 | ≥ 6 shows that participants try to improve their performance and the PT scheme is well organized and mature. Only in 4.9% out of the 144 cases the parameters of the kernel density plots are outside the studied range. Z_Hampel significantly outperforms the other estimators when applied to multiple distributions of the results of a well-performing PT scheme. The conclusion is not general, but one can assume that in continuous and mature PT schemes with a limited number of participants, the Z_Hampel estimator is the suitable choice.

Conclusions
This study investigated the robust estimators of PT schemes in outlier detection, listed in ISO 13528:2015, using tools developed in [24] and analyzing the distribution of unsatisfactory results with |Z| > 3. The approach combines various techniques and criteria: (a) robust estimators of the population mean and dispersion; (b) kernel density plots; (c) distributions derived from the mix of at most three Gaussians; (d) Monte Carlo simulations; (e) Z-factors.
One portion of the algorithm created kernel density plots for 144 datasets of cement's chemical, physical, and mechanical properties, belonging to a continuous PT scheme of cement. It approximated each KDP by summing suitable fractions of the Gaussians, which have adjustable parameters. The fit is sufficiently efficient for all the sets, with a coefficient of determination of 0.9917 ± 0.0051. While the central distribution is the primary one, the others contaminate it. The fractions of the three Gaussians, their standard deviations, and the algebraic distance of the mean of the contaminating ones from the center define the shape of the results' kernel density plot. The distribution of these variables determined the minimum range of the simulation's parameters.
Assuming a distribution of a large number of results containing Zu% unsatisfactory results according to EN ISO/IEC 17043:2010, the simulator can find the best estimators. For each PT round, the simulator considers a wide range of parameters and takes into account between 10 and 100 participants. The fraction of the two contaminating Gaussians is up to 0.20 of the total. The model was applied to study the distribution of the mean and the standard deviation of the population. We found that both are unimodal but skewed. For the standard deviation, the skewness ranges from moderate to very high. The distributions of outliers are not only skewed but sometimes not bell-shaped. The metric to find the best robust estimator must be nonparametric because the approach used to detect unsatisfactory results is. The software calculates the differential distribution of each estimator and compares these results with the reference value using the distance of the distribution's points from Z ref . The estimator with the minimum value is the best.
The simulator calculates the optimal statistics for all combinations, resulting in 120 matrices of results. Instead of this solution, we chose a functional sub-optimal one by grouping and classifying the model parameters according to the number of participants, the distance of the secondary distributions from the central one, and the standard deviation of the Gaussians. In each region, the estimators appearing most often are the best. Whenever at most half of the distribution of each contaminating population is outside the central distribution, the Z_Hamp estimator is the most suitable. For all other cases, we concluded with five result tables, which form a building block of this study and a robust tool for choosing the most accurate statistic. Z_Hamp is most frequently found for distributions that diverge relatively little from the bimodal, and the fraction of the most distant population from the center is 0.05.
We applied the suggested method to the referred 144 datasets of a continuous PT scheme for cement. We used their kernel density plots and the parameters of the Gaussian populations to approximate them. The parameters of the KDPs fall outside the studied range only in 4.9% of these data. Z_Hampel is the most suitable estimator in 86.1% of this population of results derived from a stably performing PT scheme.
In the author's experience, 10 to 100 labs participate in most PT schemes, so the results presented are of fairly general application. One can use the reference values derived from Z_MADe when there are more participants. After ensuring that the same method of analysis is used per test and verifying the homogeneity of the samples, the technical expert of a PT can apply the estimators generated by the sub-optimal solution. It is first necessary to build the kernel density plot per test and round, or over the recent rounds if the participants are limited, and then to adjust the parameters of the three Gaussian. If these are within the range presented in this study, then the estimator given in the corresponding Figure is most likely the best one.
Based on the above arguments, the novelty of this study is the generalized assessment of the effectiveness of the robust statistics listed in 13528:2015 in detecting outliers. We generated tables of optimal estimators for a wide range of participants and distributions of results, especially useful for existing and new PT schemes. The proposed method applies to 95.1% of the results for an existing PT for cement. The data cover nine years. The investigation of the best estimators can continue in the following two directions: direct use of the kernel density plots in determining the best statistic estimators' comparisons for Z-factors of absolute value between two and three.
Furthermore, ISO 13528:2015 in Section C.6 states that the PT provider may use other robust estimators, subject to demonstrating their efficiency in fulfilling the particular requirements of the PT scheme. The simulation developed and the rich dataset of PT results used in this study could assist in investigating and comparing the effectiveness of regression or multivariate models when applied to PT schemes. Data Availability Statement: The data and simulation results presented in this paper are available upon request from the author. The results covering 120 matrices could be the inputs of a Support Vector Machine to search for optimal classification. Acknowledgments: The author is grateful to G. Briskolas, EUROCERT S.A., for providing access to the data of the PT schemes organized by his organization. The author is also thankful to reviewers for their constructive comments, which have helped to improve the present paper.

Conflicts of Interest:
The author declares no conflict of interest. Figures A1 and A2 show the best estimators in detecting outliers for [s 1 , s 2 , s 3 ] = 222. The algorithm also calculates the sum per estimator and n 3 value. If a cell contains multiple findings, the color is that of the estimator of the highest ranking in the row.