Should We Gain Confidence from the Similarity of Results between Methods?

Confirming the result of a calculation by a calculation with a different method is often seen as a validity check. However, when the methods considered are all subject to the same (systematic) errors, this practice fails. Using a statistical approach, we define measures for reliability and similarity, and we explore the extent to which the similarity of results can help improve our judgment of the validity of data. This method is illustrated on synthetic data and applied to two benchmark datasets extracted from the literature: band gaps of solids estimated by various density functional approximations, and effective atomization energies estimated by ab initio and machine-learning methods. Depending on the levels of bias and correlation of the datasets, we found that similarity may provide a null-to-marginal improvement in reliability and was mostly effective in eliminating large errors.


Introduction
When all computational methods yield similar results, one often assumes that these cannot be wrong. However, logically, one cannot prove this: an argument is not necessarily right because the majority thinks so. One might, therefore, ask whether obtaining similar results with different methods gives a higher chance of achieving reliable results (one has to keep in mind that the better accuracy of a method when compared to another is a statistical assessment but is not necessarily valid for all systems [1,2]. In this paper, we propose and test a statistical approach to address this question in the context of computational approximations. The concepts of reliability and similarity are defined and measured by probabilities estimated from benchmark error sets. The interplay between reliability and similarity is estimated by conditional probabilities. Reliability, as defined here, is closely related to measures we used in previous studies, based on the empirical cumulative density function (ECDF) of error sets [3]. As for similarity, there is a link with correlation between error sets as illustrated in refs. [1,2]. Unlike correlation, similarity is affected by bias between methods, i.e., correlation does not imply similarity.
The following section (Section 2) presents the method. The Applications section (Section 3) illustrates the method on a toy dataset of normal distributions and on two real-world datasets. In order to be able to draw conclusions, we chose literature benchmark datasets with sufficient points to enable reliable numerical results, and a variety of methods encompassing various scenarios of bias and correlation. The main observations are summarized in the conclusion. The aim of this paper is to exemplify a statistical approach to similarity and not to draw general conclusions nor to recommend any of the studied methods.

Frame
For a given computational method, M, and a given system, S, let the value calculated for a chosen property be denoted by X(M, S). A benchmark provides reference values, R(S). The error for the method M and the system S is given by

Reliability and Similarity of Computational Results
A benchmark data set is expected to provide a large set of data. We can use statistical measures on this set to make assessments on the reliability of the computational method. Let us first define what we mean by the results of a calculation being reliable or being similar.
The computational method M is considered reliable for the system S if where the reliability threshold, r , is chosen by the user of the method, depending on his needs. We consider here that two methods, M 1 and M 2 provide similar results for system S when where the similarity threshold, s , is also defined by the user. When we consider a set of methods, we say that the results of these methods are similar when all pairs of methods of the set yield similar results. If not specified otherwise, we will use, in this paper, s = r = . Figure 1 schematically presents the problem. The set of systems for which method M 1 is reliable is represented by a red disk; for method M 2 , this is a blue disk. The systems for which the two methods are similar are contained in the gray disk. The overlapping region of the red (or blue) disk with the gray disk indicates the set of systems that are reliable with the method M 1 (or M 2 ), and, at the same time, close to the result provided by the other method. Figure 1. A schematic representation of the properties of the systems. The region within the square represents the set of all benchmark systems. The red disk represents the set of systems for which method M 1 is reliable. The blue disk represents the set of systems for which method M 2 is reliable. The gray disk represents the set of systems for which methods M 1 and M 2 give similar results.
Let us define the following notations characterizing those sets, where the indices r and s refer to the reliability and similarity, respectively: • N, the number of systems in the data set (corresponding to the white square in Figure 1). • N s (M 1 , M 2 , . . . ; s ), or N s ( s ) for brevity, the number of systems that yield similar results (within s ) using methods M 1 , M 2 , . . . . (corresponding to the gray disk in Figure 1). • N r (M, r ), the number of systems for which method M is reliable (corresponding to the red or blue disk in Figure 1). • N r (M, r ∩ s ), the number of systems for which method M is reliable and similar to the other methods (corresponding to the overlap region of the three disks in Figure 1).

Probabilities
If the data set is sufficiently large, we can estimate probabilities as frequencies from these numbers: • The probability to obtain a reliable result with method M, • The probability to obtain similar results for the set of considered methods, For a finite sample, the smallest value of s for which P s ( s ) = 1 is called the Hausdorff distance [4]. • The (conditional) probability to obtain reliable results with method M, given that this method is similar to the other methods in the set, • The (conditional) probability that a result with method M is similar to that of the other methods, given that it is reliable, with the limit values P s|r (M, s = ∞, r ) = 1 (10) Furthermore, even for s = r = , in general, P s ( ) = P r (M, ) and P s|r (M, ) = P r|s (M, ), where the notations were shortened to imply the equality of both thresholds.
The main objective of this paper is to investigate whether choosing methods with similar results is a good criterion of reliability, i.e., to what extent P r|s ( r , s ) > P r ( r ). Even if this aim is achieved, this does not go without a drawback: the systems for which similarity is not reached are eliminated from the study with a probability 1 − P s ( s ). P s|r (M, s , r ) gives us an indication about the quality of our selection criteria by similarity.
An important limitation of this approach is the sample size. Even for large data sets, it may happen that the number of similar results, N s ( s ) is small, e.g., because at least one of the methods yields results systematically different from that of the other methods or because s was chosen too small. In such a case, the uncertainty of the empirical estimates becomes large.

Statistical Measures
Often, the distribution of errors is summarized by numbers, such as the mean error, the mean absolute error, and the standard deviation. Although these numbers convey some information, they sometimes hide the misconception that the distribution of errors is normal. In the cases we analyze below, as in most cases we studied previously [5], the distributions of errors are not normal. This justifies the use of probabilistic estimators, such as those presented in our previous work [1][2][3], or the ones introduced here.
A direct link can be made between the statistics based on the empirical cumulative distribution function (ECDF) of the absolute errors, presented in ref. [3], and some of those introduced above: • The reliability probability P r (M, r ) is equivalent to the ECDF of the absolute errors, noted C( ) in our previous work. • The qth percentile Q q (M) of the absolute errors is the value of r , such as P r (M, r ) = q/100. The conditional probabilities P r|s (M, ) and P s|r (M, ) will, thus, be represented as conditional ECDFs as a function of , generalizing our former probabilistic statistics.

Guidelines
In order to obtain a better understanding of the situations arising from data extracted from the chemical literature, let us first consider pairs of points generated randomly according to normal distributions: each point is assimilated to a "system", where the values on the abscissa are interpreted as "errors" for M 1 while those on the ordinate as "errors" for M 2 . The results are presented in Figure 2. The panels on the left show the randomly produced "errors" (green dots).
The red stripe shows the reliability region for M 1 , where |E(M 1 , S| < r (cf. Equation (2)), and the blue stripe shows the same for M 2 . The gray stripe shows the region where the results produced by M 1 and M 2 are within ± s (Equation (3)). Some points are marked by numbers. The polygon with corners corresponding to the points (2, 4, 6, and 8) delimits the region where M 1 is both close to M 2 and is reliable. The polygon with corners corresponding to the points (1, 3, 5, and 7) delimits the region where M 2 is both close to M 1 and is reliable. The plots were drawn by choosing r = s = 1.
The ratio of the number of points in the red or blue stripe to the total number of points gives P r . The ratio of the number of points in the gray stripe to the total number of points gives P s . The ratio the number of points in the polygons (1, 3, 5, and 7) or (2, 4, 6, and 8) to the number of points in the gray stripe gives P r|s . The ratio the number of points in the polygons to those in the red or blue stripe give P s|r . The panels on the right show the dependence of the probabilities on r = s . The results for M 1 are in blue, those for M 2 in red. P r (M i ) are drawn as thin curves, P r|s as thick curves, and those for P r|s as dashed curves.
The top row is produced for errors centered at the origin (the mean errors are equal to zero for both methods; the variance is different for the two methods). In the second row, the mean errors are different and non-zero. In the third row, a correlation is introduced between the errors produced by the two methods. In the last row, the effect of correlation is enhanced. In the first three rows, the parameters are inspired from those obtained for the PBE/HSE06 pair (see Section 3.2), in the last row different parameters are used, namely those of PBE0/HSE06. Let us start by discussing the first row. We see that, from the choice made for r and s , an important number of points is in the region where |E(M i , S)| < r , (i = 1, or 2) and |E(M 1 , S) − E(M 2 , S)| < s . However, there are points that are within the reliable range for both methods (in the region where the red and blue stripes overlap) are not within the region of similarity (gray stripe). This could be corrected by increasing s to √ 2 r , but there is a price to pay for it: for each of the methods, the number of points selected increases by including systems for which the method does not yield reliable results. Furthermore, we notice that there are points that are reliable with one method but not similar to the other method. There are also points that are similar but unreliable (inside the gray stripe but outside either the red or the blue stripe). Finally, there are points that where the methods are both unreliable and dissimilar (on white background). Let us now look at the evolution of probabilities with r = s (top right panel). We see that M 1 is globally of worse quality than M 2 , as P r (M 1 ) < P r (M 2 ) (thin curves).
When first selecting the results by similarity and then checking the reliability, we see that the conditional probabilities P r|s are close for M 1 and M 2 and better than the P r curves. Checking similarity has eliminated part of the good results (that were reliable with either M 1 or M 2 ) but provides a higher probability to obtain a good result. Note that, while P r|s is slightly better for M 2 than for M 1 , the inverse is true for P s|r , a consequence of the division by P r .
Let us now shift the point cloud by analyzing it for the case when the mean errors are non-zero. If the shift for at least one of the methods is important, none of the "systems" produces similar results for the two methods (the point cloud is shifted outside the gray stripe). The figure shows an intermediate case, where the shift is not so important and plays a role mainly for M 1 .
The similarity (gray stripe) essentially retains the results that are good for M 2 (within the blue stripe) because the number of points that are both similar and reliable for M 1 is reduced. As a result (second row, right panel), the similarity hardly improves the probability to obtain a good result for the better method (M 2 ) but eliminates a number of systems for which M 2 would provide reliable results. However, there is still an improvement for the method of lower quality (M 1 ).
Another effect reducing the improvement is the existence of positive correlation between the "errors". This is exemplified in the last two rows, where the position of the points are concentrated around a line. In the limit of perfect correlation, these points lie all on a line. If the mean errors make the lines lie in the similarity region, P r|s = P r : no gain is obtained through similarity. If the line lies outside the similarity region (outside the gray stripe), even worse, no point is selected by similarity: if we rely on similarity only, we cannot use any of the calculations.
Note that the correlation between data produces an increase in P s|r : if a method is producing a reliable result, by correlation, it is likely that the other method produces also a reliable result, except when one of the methods is strongly biased compared to the other.

Performance of Individual Methods
The errors in the band gaps are quite large for this set of methods. The mean absolute errors lie between 0.5 eV (HSE06) and 1.2 eV (LDA), while Q 95 varies between 1.7 eV (HSE06) and 3.2 eV (LDA) (Figure 3). The probability to have more reliable than unreliable results occurs at the median absolute error, which defines a minimal value r = 0.33 eV for HSE06-the best method in this set. Figure 3 shows the dependence of P r (M, ) on . One can can safely qualify HSE06 as the best (most reliable) among the methods as P r (HSE06, ) is never smaller than any of the other P r (M, ) curves (within the sampling uncertainty). While PBE0 becomes competitive with HSE06 for > 1.7 eV, it behaves rather like SCAN for the values of ≈ 1 eV and like the group of the three methods that perform worst (LDA, PBE, and PBEsol) for small values.
It is important to have in mind that, even if PBE0 and SCAN are identically reliable at the = 1 threshold (P r (PBE0, = 1) P r (SCAN, = 1)), this is not necessarily true for the same subset of systems.  Figure 4 shows the absolute errors made by two methods (HSE06 and PBE or PBE0) and the distance between the results obtained with the two methods. We choose, for example, a threshold for the similarity of the two methods, s = 1 eV. We take the same value for the threshold defining a method reliable, r = 1 eV.

Similarity and Reliability
If we assume that the similarity of the results is a good criterion to select the reliable results, the points should lie either in the bottom left rectangle (

Impact of Similarity on Reliability
Let us now look at the probabilities as functions of (we take s = r = ), Figure 5. As a reference, we plot P r (M, ) (thin curves, identical to the ECDF curves in Figure 3), the estimation of reliability when no similarity check is made. The thick curves correspond to P r|s (M, ), the probability to obtain with method M errors smaller than if the results of method M are within ± of the other method(s). The dashed curves indicate P s|r (M, ), the probability of a reliable result obtained with method M to be in the subset selected by similarity. The data set contains originally 471 systems. However, by making selections, e.g., of systems where the DFAs yield similar results, the size of the sample is reduced, and the use of statistical estimates is hampered. We estimate that, below 100 selected systems, the statistics become unreliable. The region for which the size of the sample is smaller than 100 is marked by a gray rectangle in Figure 5.
We notice that, for LDA and PBE, which provide close reliability curves, practically no distinction can be made between the thin and thick curves: similarity has no impact on reliability. We also see that P s|r (M, ) ≈ 1 for almost the whole range of : if one method gives a reliable result for a system, the other one is very likely to give a reliable result too. Thus, the size of the sample of similar results is reaching a size comparable to that of the complete sample already for a small value of (the gray rectangle is very thin).
The situation changes when we compare PBE to HSE06. The region where the size of the sample of similar results is below 100 reaches a large value of ≈ 0.6 eV. For > 0.6 eV, we notice an improvement for each of the methods: P r|s (M, ) > P r (M, ). However, we notice that the improvement of the worse of the two methods is not compensating the difference of quality between the two methods.
Even as increases beyond 0.6 eV, P s|r (M, ) is at first relatively small: if one method gives a reliable result, the probability that the other provides a reliable result too, is relatively small. Without surprise, the risk of the better of the two methods (HSE06) to eliminate systems by selection is higher than that of the worse of the two methods (PBE), cf. dashed curves in Figure 5. In this case, one should take the result provided by the better of the two methods, not, e.g., the average of the results of the two methods.
The improvement has to be paid: for some of the systems, the methods provide results that are not similar, and are not taken into consideration -we have no answer to give for these systems. Figure 6 shows an example by choosing PBE, finding how many systems from the data set are similar (within ) to those obtained with another method. The graph confirms that an important number of systems are lost, unless one declares similarity by choosing a large value for . Let us attempt to condensate the results by looking at the values of for which the probabilities of having an absolute error smaller than is 0.95, Q 95 (M), cf. Table 1. This provides only an exploration of the behavior at large . Nevertheless, it leads to the conclusions above: LDA and PBE do not gain by using similarity: Q 95 (LDA) = 3.1 eV and Q 95 (PBE) = 2.9 eV, even after similarity is imposed. However, Q 95 (HSE06) decreases from 1.7 to 1.3 eV when the similarity with PBE is taken into account. We can expect the errors of different DFAs to be highly correlated [2]. (For example, recall that making the approximation valid for the uniform electron gas is a basic ingredient in almost all DFAs.) In other words, this could mean that if one method is right, all are right, and if one method is wrong, all are wrong: little improvement can be expected from agreement between methods.
Another measure of similarity is Spearman's rank correlation coefficient (Table 2). This varies between 0.76 (LDA and PBE0), and 0.99/1.00 (within the group of lower performance: LDA, PBE, and PBEsol). For PBE and HSE06, it takes an intermediate value (0.83). The correlation coefficients gives a hint for grouping the methods; however, it is more difficult to extract from them the information given in Figure 5 than it is from Q 95 (M). Another problem of using the correlation index is its invariance with respect to a monotonous transformation of the calculated values (a linear transformation for the Pearson correlation). If one of the methods is biased and another not, these methods are not likely to give similar results, despite a high correlation index. Of course, this dissimilarity can be reduced by correcting the bias, typically, by subtracting the estimated mean error from the values obtained. Figure 7 shows the probability that the results of two methods are similar (within ). The similarity of LDA and PBE can be recognized immediately, as well as the dissimilarity between LDA and PBE0 or HSE06. One can also notice the improvement after centering the errors (i.e., correcting the bias by subtracting the mean signed error for each of the methods). At the same time, the difference between methods (PBE0 and HSE06) is reduced.
The numerical results are given in Table 3. The similarity of LDA, PBE, and PBEsol is well visible from these numbers. Table 3. The mean and the standard deviation of the probability distribution of having two DFAs giving similar results, µ s (M 1 , M 2 )(σ s (M 1 , M 2 )), Equations (12) and (13). Let us now increase the number of methods that we are considering. Taking into account the closeness of the results of LDA, PBE, and PBEsol, we do not expect anything considering the similarity of these three methods. However, one might ask whether comparing PBE, HSE06, and SCAN, or PBE, HSE06, and PBE0 provides any improvement. In the first case, Q 95 (HSE06) stays at 1.3 eV; in the second, it slightly increases to 1.4 eV.
Increasing the number of methods for similarity checks does not provide necessarily an improvement on reliability (as one increases the number of "bad" methods to compare with). All six methods provides, at best, Q 95 (M) ≈ 1.3 eV, while the best value in Table 1 is of 1.1 eV. This can be also seen in Figure 8, the analogue of Figure 5, showing the probabilities obtained when similarity among all six methods is taken into account. This also shows the increase of the region of poor sampling.

Eliminating Strange Results?
The distribution of errors in density functional approximations is often not normal [5]. This can be seen in Figure 9. It seems that similarity confirms (in part) the prejudice that a strange behavior of one method is not repeated by another, different method. After restricting the data set to similar values, the distribution of errors is more compact. This explains the lowering of the Q 95 (M). Recall, however, that wrong results obtained with both methods are not excluded. PBE HSE06 PBE HSE06 Figure 9. Histograms showing the distribution of errors before and after introducing similarity (left, and right panel for = ∞ and = 1 eV, respectively), for PBE (red) and HSE06 (blue). The normal distributions using the mean and standard deviation of these error distributions are shown as curves.

The ZAS2019 Dataset
The effective atomization energies (EAE) for the QM7b dataset [16], for molecules up to seven heavy atoms (C, N, O, S, and Cl) are issued from the study by Zaspel et al. [17]. We consider here values for the cc-pVDZ basis set, and the prediction error for 6211 systems for the SCF, MP2, and machine-learning (SLATM-L2) methods with respect to CCSD(T) values as analyzed by Pernot et al. [18].
In contrast to the case of the DFAs presented in the previous section (Table 2), the errors in this dataset present negligible rank correlation coefficients (smaller than 0.1 in absolute value). Similarity will, thus, be dominated by the bias in the errors and their dispersion. The P r (M, ) and P r|s (M, ) curves are shown in Figure 10. When comparing HF to MP2, one sees that both methods benefit from similarity as soon as > 0.2 kcal/mol. Naturally, HF benefits much more from the similarity selection than MP2: its Q 95 decreases from 6.1 to 4.2 kcal/mol, while Q 95 for MP2 decreases slightly from 3.4 to 2.9 kcal/mol. A similar behavior is observed in the comparison of HF to SLATM-L2 with a larger onset of improvement for SLATM-L2 ( ∼0.5 kcal/mol). For HF, Q 95 decreases from 6.1 to 3.8 kcal/mol and for SLATM-L2 from 4.7 to 2.5 kcal/mol. The comparison of MP2 to SLATM-L2 provides an intermediate case, where both methods present more balanced improvements: for MP2, Q 95 decreases from 3.4 to 2.4 kcal/mol and for SLATM-L2 from 4.7 to 1.9 kcal/mol. Adding HF to the MP2/SLATM-L2 pair produces a marginal gain for the latter methods, whereas HF presents a strong gain in reliability: the final Q 95 values are 2.5 (HF), 1.8 (MP2) and 1.7 kcal/mol (SLATM-L2). However, this comes at the price of a large number of system rejections: for ∼2.0 kcal/mol, only 1/4th of the 6211 systems are selected by their similarity. For comparison, this number is about 2/3rd for the MP2/SLATM-L2 comparison.
In this context of uncorrelated error sets with different accuracy levels, one sees that similarity selection has a notable positive impact on the reliability of predictions by any of the methods, even the most accurate ones. It is striking that MP2 or SLATM-L2 might benefit from comparison with HF, but, as already discussed for band gaps (Figure 9) , this proceeds mainly by elimination of systems with large errors.

Conclusions
We asked whether picking only results that are similar to different methods would improve the accuracy of their predictions (in spite of possibly eliminating a significant part of the calculations done). The use of probabilities to treat reliability and similarity was illustrated on two benchmark data sets, one of band gap calculations with different density functional approximations, the other of effective atomization energies with two ab initio methods and one machine-learning method.
For the properties and methods studied, the thresholds for reliability and similarity were chosen quite generously. For the band gap data set, we found that similarity of the density functional results had only a marginal impact on improving the prediction accuracy. This is consistent with previous findings that the differences between density functional approximations are less important when considering the error distributions [1,2], or taking into account experimental uncertainty [19].
For the effective atomization energies data set, in which the error sets are uncorrelated, notable improvements of reliability after similarity selection were observed for all methods, even the most accurate ones. Roughly, we observed two categories of results: 1.
methods that always give close results, for which similarity is irrelevant; and 2.
methods for which an improvement can be achieved, especially by eliminating certain systems that behave strangely with one or the other methods-similarity is mainly effective for eliminating large errors.
Note that the size of the data sets might have an impact on the uncertainty of all the statistics. For the smaller datasets, this uncertainty might be comparable with the observed differences between statistics. Bootstrapping approaches, such as the ones used in our previous works [1,3], could be used to this effect. This was not the focus of the present study, and uncertainty management will be considered in forthcoming research.