On the Use of Benchmarks for Multiple Properties

Benchmark calculations provide a large amount of information that can be very useful in assessing the performance of density functional approximations, and for choosing the one to use. In order to condense the information some indicators are provided. However, these indicators might be insufficient and a more careful analysis is needed, as shown by some examples from an existing data set for cubic crystals.


Introduction
An increase in computing power has allowed the replacement of personal experience with databases (see for instance [1][2][3][4][5][6]). In the realm of density functional theory, these have become a valuable tool for both tuning and tailoring new methods (see [7][8][9][10] for recent examples) and, in particular, to assess the performance of density functional approximations [5,11,12]. Ultimately, benchmarks should help computational chemists in choosing the best method to be adopted in a new study. However, the large amount of available data requires synthetic and reliable indicators [13][14][15] capable of providing a ranking based on the quality of the approach. Unfortunately, these indicators do not always give the necessary information, so one has to go back to the database and analyze the data according to the objectives of the study.
Some examples are given below, where indicators might lead to erroneous conclusions: 1. choosing the method giving the best results for two properties, A and B; 2. choosing the method giving the best results for property B, knowing that property A is well described.
Benchmark calculations for density functionals on some cubic crystals, provided in [16], will be used as a concrete example.
It is not the purpose of this paper to rank density functionals, or to advise for or against any of the density functionals cited here: the questions raised are not connected to any specific functional. Their names appear only in order to facilitate reading and enable reproducibility.
1. when good results are needed for both property A and property B? 2. when it is guaranteed (it can be checked) that A is well described, but good results for property B are also needed?
The rapid answer would be to use the method X for both (1) and (2). However, after a brief reflection, it becomes evident that the information provided by the indication that X is better for A and B separately is not sufficient.

Two Properties Simultaneously Needed
In order to formalize the problem, let us call the set of systems in the benchmark database, S. The total number of systems is N(S). A subset S M,P gives "good" results with method M ∈ (X, Y) for property P ∈ (A, B). The number of elements in S M,P is N(S M,P ). The probability of obtaining a "good" result with method M for property P is given by p M,P = N(S M,P )/N(S). We say that method X is "better" than Y for property P when N(S X,P ) > N(S Y,P ), or p X,P > p Y,P .
We now consider the case where M = X is better than M = Y both for P = A, and P = B. This is schematically represented in Figure 1 by disks corresponding to the subsets S M,P . The color of the disks correspond to the properties (blue for property A, orange for property B). The disks in the left panel, corresponding to M = X, are larger than in the right panel, corresponding to M = Y, indicating that N(S X,P ) > N(S Y,P ). However, we do not have any information about the intersection S(M, A) ∩ S(M, B), the number of cases when properties A and B are both well described using method M. We cannot exclude that method X gives "better" results for a larger number of systems N(S X,P ) > N(S Y,P ) and for A and B separately, but that the number of systems for which the results are better both for A and B is smaller for X than for Y: This is schematically represented in Figure 1 where the overlap of the disks, corresponding to the sets S X,A ∩ S X,B (left panel) is smaller than that corresponding to S Y,A ∩ S Y,B (right panel). In such a case, when "good" results are desired for both properties, A and B, it is better choose method Y, although method X was better when analyzing each property separately.
To be more specific, let us consider data for cubic crystals given in [16], and choose as A the lattice constants (LC) , and as B the bulk moduli (BM). We consider a method to be "good", if it reproduces the lattice constants within 3 pm, and bulk moduli within 3 GPa. The probabilities of obtaining "good" results with three different density functional approximations ( i.e., LDA [17,18], PBEsol [19] and HISS [20,21]) are given in Table 1.
Diagrammatic explanation that method X can be better than method Y for property A and property B when taken separately, but method Y is better when both A and B are needed. Blue disks: cases when the method works well for property A; orange disks: cases when the method works well for property B; (left) method X; (right) method Y. HISS gives the best results both for LC and BM. PBEsol comes next and the local density approximation is the worst. However, when we consider the performance for both LC and BM, LDA, PBEsol, and HISS perform equally. Note that the success probability is rather low.
We would like to stress that the numbers presented in the tables are only to indicate that the effects discussed here can show up. The size of the data set is too small to allow conclusions about the quality of the functionals.
The probability of obtaining a reliable result with method M is not p M,A∩B as indicated above, but is the probability of obtaining a good result for B given that the result for A is good Now the reference set is not the full set of data, S, but the subset of results reliable for A, S M,A . Using the same example as above, we find now that p M,BM|LC increases from HISS to PBEsol, and to LDA (Table 1), in reverse order of the probability obtained for LC and BM individually.

Remark 1.
The problem presented in this paper is related to the lack of positive correlation between the errors made when computing different properties [23]. In the example given in Table 1, the rank correlation coefficients between the errors for lattice constants and bulk moduli are: −0.51 (LDA), −0.24 (PBEsol) and −0.65 (HISS).

Improving the Quality of the Approximations Reduces the Risk of Unreliable Selection
The risk of such unpleasant surprises as presented above comes from the low quality of the approximations: in the limiting case when one of the method gives perfect agreement for both properties and the other does not, there is no doubt about which method to choose. In the following we will use a simple approach to improve the performance of the approximations and repeat the analysis made above.
The previous section uses the results directly provided by density functional approximations. A careful analysis of the data reveals that the parametrizations were not good enough to eliminate systematic errors. Having an exact density functional would obviously solve the problems presented above. An efficient way to correct, at least partially, errors of the actual density functionals is to apply a statistical correction, e.g., as a linear transformation [16,24,25]. This correction is a technique to eliminate the main part of the systematic errors, a necessary step to evaluate prediction uncertainty [16].
We now use corrected methods and evaluate their performance on the basis of prediction uncertainty, as reported in [16]. For the same methods as above, one can estimate the success probabilities reported in Table 2. One sees that the success probability has notably increased for LC and is slightly less for BM. In this group, HISS is not the best method for LC anymore, although it still is for BM. Table 2.
Probability that a given method gives "good" results for lattice constants p M,LC , for bulk moduli p M,BM , and for both of them p M,LC∩BM . The uncertainty on all reported values, estimated by the Agresti-Coull formula [22], is about 0.1 for a data set of size 28. Results are obtained using corrected methods. Comparing PBEsol and HISS with LDA individually we notice that the joint and conditional probabilities preserve the supremacy of both "best" methods for individual properties. With PBEsol, as LC is perfect the error only comes from BM. Joint and conditional probabilities become equal to p BM . With one property perfect, the error of the other determines everything. For HISS, LC is not so good, but BM is better, so the joint probabilities are not worse than for PBEsol and the conditional probabilities are even better than for PBEsol.

Conclusions
The wealth of methods available, e.g., density functional approximations, require a selection to be made prior to undertaking a study. This can be made based on benchmark data sets. However, the information condensed from such data sets can be misleading and should be adapted to the study for which the method is chosen.
If a benchmark provides the information that a method X is better than a method Y for some properties A, B, . . . it does not necessarily mean that the method X is better when these properties are all needed for a given study. In other words, the probability of obtaining a "good" result for each of the properties is not the same as the probability of obtaining a "good" result for all the properties. Similarly, when a property can be tested and only systems that pass this test are considered, the statement that a given method is superior to the other methods for each of the properties is insufficient for choosing a functional for the remaining properties. Numerical results from existing benchmarks show that such situations can appear.
The solution to these problems is relatively simple, but one has to go back to the full set of data used in the benchmark and construct the measure relevant to the project. Unfortunately, this is not always possible: benchmarks are not always constructed using the same set of molecules for different properties.
A final remark: although we used "probabilities" to obtain a "good" result in this paper, confusions such as those indicated here can show up also for other measures rating the quality of an approximation.
Author Contributions: AS designed the study. AS and PP performed the calculations, analyzed the results, and drafted and finalized the paper. BC, RD and DP provided the necessary computational and reference data and contributed to the discussion and redaction. All the authors have read and approved the final manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.