The Good, the Bad, and the Ugly: “HiPen”, a New Dataset for Validating (S)QM/MM Free Energy Simulations

Indirect (S)QM/MM free energy simulations (FES) are vital to efficiently incorporating sufficient sampling and accurate (QM) energetic evaluations when estimating free energies of practical/experimental interest. Connecting between levels of theory, i.e., calculating ΔAlow→high, remains to be the most challenging step within an indirect FES protocol. To improve calculations of ΔAlow→high, we must: (1) compare the performance of all FES methods currently available; and (2) compile and maintain datasets of ΔAlow→high calculated for a wide-variety of molecules so that future practitioners may replicate or improve upon the current state-of-the-art. Towards these two aims, we introduce a new dataset, “HiPen”, which tabulates ΔAgasMM→3ob (the free energy associated with switching from an MM to an SCC−DFTB molecular description using the 3ob parameter set in gas phase), calculated for 22 drug-like small molecules. We compare the calculation of this value using free energy perturbation, Bennett’s acceptance ratio, Jarzynski’s equation, and Crooks’ equation. We also predict the reliability of each calculated ΔAgasMM→3ob by evaluating several convergence criteria including sample size hysteresis, overlap statistics, and bias metric (Π). Within the total dataset, three distinct categories of molecules emerge: the “good” molecules, for which we can obtain converged ΔAgasMM→3ob using Jarzynski’s equation; “bad” molecules which require Crooks’ equation to obtain a converged ΔAgasMM→3ob; and “ugly” molecules for which we cannot obtain reliably converged ΔAgasMM→3ob with either Jarzynski’s or Crooks’ equations. We discuss, in depth, results from several example molecules in each of these categories and describe how dihedral discrepancies between levels of theory cause convergence failures even for these gas phase free energy simulations.


Introduction
Calculating accurate free energy differences from simulation involves two disparate requirements: accurate energetic evaluations (e.g., semi-empirical quantum mechanics (SQM) or high level electronic structure methods (QM)) and sufficiently long simulations to appropriately sample relevant regions of , a free energy difference of interest "0 → 1" at the desired high (we use the term "low" to refer to any level of theory capable of conducting sufficient sampling, while the term "high" to refer to any level of theory that is too computationally expensive to conduct sufficient sampling for FES but that provides accurate energetics for evaluating FES) level of theory, is a state function, it can be calculated by employing the thermodynamic cycle shown in Figure 1. Specifically, Here, ∆A low 0→1 is a standard FES at the low level of theory, and the steps ∆A connect the low level back to the high level of interest. The low level is chosen such that sufficient sampling can be conducted. Further, if a force field is used for the low level, several tricks of the trade, e.g., soft-core potentials [14,15], may help facilitate alchemical transformations [2]. In particular, if ∆A low 0→1 is carried out at the MM level of theory, any standard free energy estimator can be used for its calculation, such as free energy perturbation (FEP) [16], thermodynamic integration (TI) [17], Bennett's Acceptance Ratio (BAR) [18], its multi-state extension MBAR [19], vFEP [20], WHAM [21], etc. Correction legs (ii) and (iv) have traditionally been calculated using FEP, Equation (2), written here for the specific application of connecting the low and high levels of theory: Here, k B and T have the usual meaning of Boltzmann's constant and temperature, and . . . low denotes an ensemble average generated at the low level of theory and in the canonical ensemble. This is, of course, highly advantageous as costly simulations at the high level of theory are not needed.
In a post-processing step, the difference ∆U low→high = U high − U low is computed for every frame saved during the low level simulation. For this reason, FEP is a so-called "one-sided" method as it only requires simulation from "one side" of the free energy difference.
Although Equation (2) is formally exact, recent research has shown conclusively that FEP, when applied to the calculation of ∆A low→high , rarely gives converged free energy differences [22][23][24][25][26][27][28][29]. In fact, by now few would dispute the statement that FEP cannot be used to compute ∆A low→high for systems of practical interest. In order for Equation (2) to converge in practice, at least some configurations sampled at the low level of theory also need to be low energy configurations at the high level of theory (cf. [30]). Typically, however, there are disparities in "stiff" (bonds and angles) and "soft" (dihedrals) degrees of freedom between low (MM or SQM) and high (SQM or QM) levels of theory. Simply put, in many cases, an "MM" molecule does not look like a "QM" molecule, and slight differences between these structures can result in drastic convergence errors [22][23][24][25][26][27][28]. These disparities are exacerbated as the size of the high level/QM region increases. More generally, in most cases, the phase space sampled at the low level of theory has little or no overlap with the phase space which would be sampled if the system of interest were treated at a high level of theory. Following, e.g., Pohorille et al. [30], phase space overlap can be quantified by comparing the distribution of forward (low → high) energy differences p(∆U f w ) and (negative) backward (high → low) energy differences p(−∆U bw ) (cf. below) (For example, p(∆U f w ) is the histogram obtained from configurations saved during a simulation at the low level of theory, for which one computes ∆U f w = U high − U low . The availability of p(−∆U bw ), of course, depends whether simulations at the high level of theory could be carried out).
The failure of FEP would suggest using more efficient methods to compute ∆A low→high ; however, in the present context (as compared to regular alchemical FES), the choice is severely limited. Consider, e.g., Bennett's Acceptance Ratio (BAR, Equation (3)), a widely used method for calculating ∆A's [31]. If used to evaluate ∆A low→high as in (ii) or (iv), it takes the form: where f (x) = (1 + exp(x/k b T)) −1 and As one sees from the use of ... low and ... high in Equation (3), BAR requires simulation at both endstates, thus is a so-called "two-sided" method. The need for simulations at both levels of theory makes the use of BAR problematic, i.e., computationally too expensive. Thus, the search for reliable, less expensive methods to compute correction steps between low and high levels of theory continues. We, and many others in the FES community, have explored many potential solutions including reweighting schemes [23], force-matching techniques [32], and nonequilibrium work methods [33,34].
Given the underlying problem, poor or non-existing overlap between phase space sampled at the low and high level of theory, there are two paths towards reliably calculating ∆A low→high . One possible path is to enhance efficiency of methods used to compute free energy differences between disparate levels of theory. Another possible path may be to make the low level of theory "look" more similar in configurational space to the high level of theory. Our proof-of-concept results using force-matching in Reference [32], as well as the results of other groups (e.g., [35-50]), are examples of the latter strategy. Concerning the former, we successfully used nonequilibrium work techniques to compute ∆A low→high [33,34]. In particular, we explored the utility of Jarzynski's (JAR) equation [51], the nonequilibrium analog to FEP. Formally, one replaces the energy difference ∆U low→high in Equation (2) by the nonequilibrium work W low→high needed to bring the system from a low to high level description: Although the nonequilibrium switching simulations needed to obtain W low→high do require evaluation of high level energies and forces at every step, two factors make such calculations practicable. First, at least in our tests to date, rather short switching simulations (a few hundred to a few thousand MD steps per switch) were sufficient. Second, these switching simulations are a post-processing step started from coordinate/velocity sets saved during equilibrium simulations at the low level of theory (cf. Materials and Methods). Therefore, they can be run in parallel, making the JAR calculation scheme much more computationally efficient than conducting a sufficiently long high level simulation, e.g., as needed for BAR.
Given that FEP cannot be used to calculate free energy differences ∆A low→high reliably, another challenge in benchmarking performance of FES estimators such as FEP and JAR is obtaining reference results. In the past, we have used BAR, as well as its nonequilibrium analog the Crooks' equation (CRO) [52], to generate reference results [33,34]. As with FEP and JAR, energy differences in BAR are formally replaced by nonequilibrium work values to give CRO, i.e., Of all the methods discussed, CRO is the most computationally expensive. As with BAR, one must conduct long equilibrium simulations at both levels of theory and then conduct nonequilibrium switching simulations, this time launched not only from low → high but also from high → low. In real applications CRO, and most likely BAR, are far too expensive to calculate ∆A low→high within an indirect scheme of practical interest; however, in the context of methodological work, they provide a means to obtain reference results for comparison to cheaper methods.
While our earlier work has demonstrated the utility of both nonequilibrium work methods, in particular JAR [33,34], and force-matching approaches [32] for the computation of ∆A low→high , the techniques were tested only on a relatively small number of systems. To advance the state of the art, a broader test of the existing methodology is required and is the subject of the current study. In previous unrelated work [53], we used ParamChem (https://cgenff.umaryland.edu, a web-interface for automatically predicting parameter and topology sets for small molecules [54,55]) to obtain CHARMM generalized force-field (CGenFF) parameters [56] for the Maybridge Hitfinder set [57]. As part of the ParamChem procedure, "penalties" are assigned that indicate the expected quality of generated parameters. From the full Maybridge set, we then selected 22 molecules that: (1) represent chemical diversity seen in medicinal chemistry, as the Maybridge set includes molecules that are drug-like according to Lipinski's rule of 5; and (2) had high penalties for bonded and/or charge parameters. We expect these systems, shown in Figure 2, to be challenging cases when computing ∆A low→high . Because of the high parameter penalties, we refer to our chosen set as "HiPen".
Given the diversity of the compounds chosen, we view our HiPen set as a benchmark set that can be used to compare methods for computing ∆A low→high in the context of indirect (S)QM/MM FES. In related areas of computational chemistry, extensive benchmark sets have proven very useful. For example, the Benchmark Energy and Geometries Database (BEGDB, http://www.begdb.com) is a highly utilized, online computational resource where quantum quality energies and properties for a wide variety of molecules are deposited. BEGDB has a stated purpose of "serv[ing] as benchmarks for testing and parameterization of other computational methods." Datasets maintained in BEGDB have been cited ≈2400 times, with some of the more frequently cited datasets being S22 (1079 citations), and S66 (454 citations). Similarly, the Minnesota Databases 2.0 (MN2.0) are a large collection of datasets for comprehensive validation. Many of the works citing these BEGDB and MN2.0 datasets are methodological investigations aiming to develop, improve, or validate, the performance of Density Functional Theory (DFT). For example, several DFT functionals that have recently (since 2015) been derived, validated, or improved upon using BEGDB or MN2.0 datasets include but are certainly not limited to: B97M-V [58], the occ-RI-K algorithm [59], ωB97M-V [60], minimally adaptive basis (MAB) [61], ωB97M(2) [62], revised M06 (revM06) [63], revised M06-L [64], and MN15 [65], as well as several large review-style validation studies of DFT methods in general [66,67]. Additionally, the MN2.0 databases were used by Peverati and Truhlar to search for a "universal" density functional in 2014 [68]. For good measure, we also list these seminal works in the DFT development field [69][70][71][72][73]. (It should be noted these references by no means represent a complete list of all DFT improvements facilitated by BEGDB and MN2.0. For a more complete literary listing, travel to http://www.begdb. com and https://comp.chem.umn.edu/db/ and follow links to each dataset's debut publications, and then view all citing publications.) Thus, as online repositories of maintained datasets, BEGDB and MN2.0 represent invaluable resources to the quantum chemistry modeling discipline and in turn chemistry at large. In the area of force-field focused alchemical FES, benchmarks and comparative tests exist as well; e.g., a study comparing results for relative free energy differences obtained with the most widely used programs was just published [74]. The datasets forming the basis for the various SAMPL competitions are another excellent source of curated experimental reference values [75][76][77]. Several test systems can be downloaded from the "alchemistry.org" web site (see https://www.alchemistry.org, follow the link "Test System Repository"). Since nothing comparable to these alchemical collections or to aforementioned QM benchmark sets yet exists for indirect (S)QM/MM FES, we view the HiPen set ( Figure 2) as the core of a benchmark in this area. Studies with a related goal include work by Cave-Ayland et al. [25], in which forward p(∆U f w ) and backward energy distributions p(−∆U bw ) for a number of compounds are computed and compared systematically, as well as the work in Reference [29], where Ryde searched for criteria to determine whether ∆A low→high values have converged or not. Clearly, it should be useful to have available reference results against which novel methodological developments can be compared. The present study is the beginning of such a database, and we ask others in the field to join us.
In Reference [34], we discerned three overall factors contributing to the difficulty of obtaining converged results: (i) subtle differences in bond lengths and angles (i.e., the "sti f f " degrees of freedom); (ii) different conformational preferences, such as preferred ranges for dihedrals (i.e., "so f t" degrees of freedom); and (iii) differences in charge distribution of the region either described by the low or the high level of theory. The last complication arises only in aqueous solution or in a protein-ligand complex. Here, we concentrate on the first two complications, mismatches in stiff and soft degrees of freedom; hence, all calculations here are carried out in the gas phase. Further, in the present work, we use the MM force field as the low level "as is", i.e., we do not attempt to improve phase space overlap through force-matching or related techniques. Additionally, for the purposes of generating reference results via two-sided methods such as BAR and CRO, we have chosen to utilize a semi-empirical method as our high level of theory, as it is still cheap enough to achieve relatively efficient sampling. Specifically, we are interested in whether, at least for some systems, FEP is enough to compute ∆A low→high gas , whether JAR with short switching protocols is sufficient for converged results, or whether two-sided methods, which are too expensive for general use, are needed.
In addition to reporting ∆A MM→3ob gas obtained with each of these methods, we report criteria we have in the past found useful to identifying failures in "convergence" (obtaining the correct ∆A low→high within reasonable certainty). These include comparing differences in magnitude of "forward" and "backward" ∆A's (i.e., ∆A MM→3ob vs. −∆A 3ob→MM ) [23], calculating "sample size hysteresis" (Equation (10) in Reference [34]), calculating the standard deviation in ∆U low→high /W low→high and vice versa [78] (cf. [29]), calculating distribution overlaps in ∆U low→high and W low→high values [7,[32][33][34]78], and finally applying the Π criterion introduced by Wu and Kofke [79] to the forward/backward energy distributions or work distributions [32,78]. To understand difficult cases and failures, we also computed distributions of dihedral angles sampled during simulations at the two levels of theory, as well as at the end of forward and backward switching simulations [23,33,34].

Results
Given the average energy difference between an MM and 3ob calculation differs on the order of 10 5 kcal/mol, ∆A MM→3ob gas by default can be quite large. Thus, even large standard deviations, σ(∆A MM→3ob gas ), of 10 kcal/mol or even more are easy to overlook. Therefore, we list "offset" values in Table 1, which must be added for reported positive ∆A and subtracted for reported negative ∆A to give the actual total ∆A MM→3ob gas (for values <0 in coming tables, the offset should be subtracted to give the true ∆A, for values >0 in coming tables, the offset should be added to give the true ∆A). Extracting "offset" values allows trends in variance and convergence failure to be more apparent, and allows data to be presented more compactly. Table 1. ZINC Database ID's for each HiPen molecule; the total number of atoms (N tot ) and total number of heavy atoms (N heavy ) per molecule; ParamChem reported CGenFF penalties; calculated ∆A "offset" for each molecule in the dataset should be added or subtracted (to positive and negative ∆A's respectively) to every ∆A listed in Tables 2 and 3 to give the total calculated ∆A. Additionally, in Table 1, we list the bonded and charge parameter penalties reported for the CGENFF parameters by ParamChem as well as the number of atoms (total/non-hydrogen) in each selected molecule. As described in the Introduction, we specifically chose the twenty-two molecules based on high parameter penalties, in addition to picking molecules to represent chemical diversity seen in small molecule drug sets. Tables 2 and 3 provide all calculated ∆A MM→3ob gas results for each of the 22 HiPen molecules. In addition, we report values for the convergence criteria considered: hysteresis (Hyst) calculated by comparing ∆A calculated from the complete dataset to block averaged ∆A (this criterion is a simplified version of considerations by Woods and co-workers [80]) (see [78] for complete definition of our "hysteresis" criterion), standard deviations of ∆A calculated from 10 blocks, average ∆U or average W, standard deviation of ∆U/W to indicate width of input distributions (the average energy differences/work values and their standard deviations could also be used to estimate ∆A according to the second order cumulant expansion; however, the underlying distributions are far from Gaussian), and percentage overlap of "forward" and "backward" distributions. Table 2. Equilibrium results. ∆A MM→3ob calculated with FEP (fw), FEP (bw), and BAR, as well as calculated convergence metrics. For each ∆A, we divide the total dataset into 10 blocks, calculate ∆A i from each of these i blocks, and compare the average of those ∆A i to ∆A calculated from the total dataset (giving Hyst), and we also report the standard deviation of these ∆A i (σ ∆A ). To determine the reliability of ∆U distributions for calculating ∆A we report: ∆U, the standard deviation of ∆U's (σ ∆U ) as narrower distributions are likely to provide converged results, and finally we report one-sided Π which, when >0.5, likely indicates the ∆U distribution is resultant from sufficient and unbiased sampling. Finally, we report percentage overlap in ∆U between forward and backward distributions.  Table 3. Nonequilibrium results. ∆A MM→3ob calculated with JAR (fw), JAR (bw), and CRO, as well as calculated convergence metrics. For each ∆A, we divide the total dataset into 10 blocks, calculate ∆A i from each of these i blocks, and compare the average of those ∆A i to ∆A calculated from the total dataset (giving Hyst), and we also report the standard deviation of these ∆A i (σ ∆A ). To determine the reliability of W distributions for calculating ∆A, we report: W, the standard deviation of W's (σ W ) as narrower distributions are likely to provide converged results, and finally we report one-sided Π which, when >0.5, likely indicates the W distribution is resultant from sufficient and unbiased sampling. Finally, we report percentage overlap in W between forward and backward distributions. Another criterion we found useful to predict whether free energy differences obtained from free energy estimators FEP, BAR, JAR, and CRO are likely to be converged [32,78] is Π, introduced by Wu and Kofke [79]. Π provides a quantitative means for determining if a distribution was collected from a sufficiently large sample in a manner free of bias. A complete discussion of the Π criterion in theory and derivation is beyond the scope of this work, but we strongly advise the reader to see Wu and Kofke's works in 2004 and 2005 introducing this measure and its uses [79,81]. In Reference [32], Equation (3) gives the one-sided Π criterion, which we used to determine "reliability" of distributions used herein. Π should be >0.5 to indicate a "well behaved" distribution of ∆U low→high or W low→high . It should be noted, however, that the Π criterion assumes input energy or work distributions obey at least an approximate Gaussian distribution, which is not necessarily likely for molecules of practical interest, and as we show below, is not always the case in HiPen. Nevertheless, we have still found reasonable correlation between Π values and deviations from reference results in Reference [78].

∆A
In Reference [78], we also noted that FEP/JAR is likely to fail when the standard deviation of energy difference or work values (σ ∆U low→high or σ W low→high ), i.e., the raw data entering Equation (2) or Equation (5) becomes too large; σ > 4 k B T being the absolute threshold of reliability, beyond which the corresponding ∆A is likely untrustworthy. A similar observation in the context of indirect (S)QM/MM FES was recently also made by Ryde [29]. All of these criteria have proven useful in the past to indicate convergence of ∆A low→high , as well as while investigating convergence failure in more difficult cases. As such, for each ∆A MM→3ob presented in this work, we also calculated each of these metrics to evaluate the "quality of convergence".

Discussion
Looking at Tables 2 and 3, one immediate observation is that equilibrium methods (FEP/BAR) overall provided results we would consider "unreliable." This can be seen in Table 2: for most molecules, FEP (fw), FEP (bw), and BAR results differ by several kcal/mol; most ∆U distributions are broad (with σ ∆U 4k B T) and do not pass the Π sampling criterion requirement of being >0.5; calculated Hyst values are >1 kcal/mol; and finally the percentage overlap in nearly all cases is not large enough to ensure sufficient sampling for even the two-sided estimator BAR. Therefore, we did not see fit to classify molecules according to performance of FEP/BAR, and results obtained by equilibrium methods are not considered further. While compiling nonequilibrium results, three clusters of molecules emerged within HiPen, which we have named the Good, the Bad, and the Ugly. "Good" molecules were any molecules for which a converged ∆A MM→3ob gas could be calculated using JAR (fw). "Bad" molecules were any molecules for which JAR (fw)/JAR (bw) calculations appeared unconverged, but for which CRO was (largely) converged. "Ugly" molecules were any molecules for which even CRO exhibited convergence issues according to all of our convergence metrics. It is worth expounding upon some examples in each of Good, Bad, and Ugly cases and illustrate how nonequilibrium switches can be used to sharpen distributions and improve configurational overlap between levels of theory in most of these cases. Although we have collected ∆U/W distribution overlaps as well as rotatable dihedral distributions for every molecule in the HiPen data set, presenting all such data would overwhelm the reader, and represent somewhat of a digression from the purpose of this work. For those looking to validate their own data, all plots can be found in the Supplementary Materials. Additionally, all data including input scripts will be made available upon request, and we have compiled all minimized topology/coordinates as well as parameters and input files and those can be found at https://zenodo.org/record/2328952 (doi:10.5281/zenodo.2328952). Before we begin the classification and discussion, we would like to emphasize that our classifications of "Good", "Bad", and "Ugly" are merely intended for grouping comparable results and attempting to describe trends from those groups without having to describe results from every molecule. Within each group, there are further gradients of "best" and "worst" convergence performers.

The Good
We were able to calculate ∆A MM→3ob gas using our JAR (fw) protocol (a 1 ps "forward" switching protocol, see Methods for details) for the following molecules: 2, 3, 4, 7, 10, 11, 12, 13, 14, 15, and 17. Thus, for these molecules obtaining ∆A low→high within an indirect scheme should be fairly straightforward and should not require full simulations at the high level of theory. We will focus our discussion here on molecules 2 and 11.
The potential energy overlaps, shown in Figure 3a,  is not likely to provide converged or accurate results. However, utilizing nonequilibrium W MM→3ob instead of ∆U MM→3ob vastly improves overlap of P mm and P 3ob (shown in Figure 3b) to 53.89%, and improves one-sided Π values to 1.86 and 2.19 for JAR (fw) and JAR (bw), respectively. One point of concern regarding W distributions for 2 is the "tail", or small secondary peak, seen around −4 kcal/mol in P 3ob ; however, despite this secondary peak, there appears to be more than enough configurational overlap to provide converged results. Figure 3. (a) Molecule 2's potential energy "forward" (P mm ) and "backward" (P 3ob ) distributions plotted as "offset" from the ∆U MM→3ob to simplify the x-axis. (b) Molecule 2's nonequilibrium work "forward" (P mm ) and "backward" (P 3ob ) distributions plotted as "offset" from the W MM→3ob to simplify the x-axis.
It should also be noted how poorly FEP performs compared to converged results from JAR and CRO: FEP (fw) predicts ∆A MM→3ob gas = −255.52 kcal/mol, while CRO predicts ∆A MM→3ob gas = −271.84 kcal/mol, a difference of 16.32 kcal/mol. This FEP/CRO discrepancy is quite concerning as these are gas phase FES, and thus the only potential sources for error in ∆A MM→3ob gas are due to lack of overlap in intramolecular degrees of freedom. In aqueous solution, or in simulations of a protein-ligand complex, errors resulting from the use of FEP are likely to be even larger. Further, it would be foolish to hope for error cancellation in steps ∆A low→high 0 and ∆A low→high 1 of the full indirect scheme in Figure 1, as the two corrections may be computed in different environments. This highlights the need for improving overlap/convergence when computing free energy differences between levels of theory.
As seen with 2, potential energy distributions coming from MM and 3ob equilibrium simulations are disparate, with merely 0.03% overlap (Figure 4a). Additionally, Π values (−3.26 and −3.38 for FEP (fw) and FEP (bw), respectively (see Table 2)) indicate that these potential energy distributions are not suited to providing converged FES between levels of theory. Poor convergence can also be seen in: high sample size hysteresis values (>1 kcal/mol), deviation between FEP (fw) and FEP (bw), and broadened ∆U MM→3ob distributions themselves. By all metrics, FEP and BAR fail for molecule 11, a small molecule in gas phase with chemical features seen in many experimental contexts. However, as again seen with molecule 2, W MM→3ob distributions were far more suited for calculating converged ∆A MM→3ob gas (Figure 4b). Π values were much improved compared to their equilibrium counterparts, being 0.99 and 1.68 for JAR (fw) and JAR (bw), respectively; overlap between the two distributions is also dramatically improved at 44.12%; JAR (fw) and JAR (bw) are essentially absent of sample size hysteresis, and agree with each other within ≈0.5 kcal/mol. Figure 4. (a) Molecule 11's potential energy "forward" (P mm ) and "backward" (P 3ob ) distributions plotted as "offset" from the ∆U MM→3ob to simplify the x-axis. (b) Molecule 11's nonequilibrium work "forward" (P mm ) and "backward" (P 3ob ) distributions plotted as "offset" from the W MM→3ob to simplify the x-axis.
Comparing dihedral distributions among MM, 3ob, and nonequilibrium switching simulations illustrates again MM and 3ob levels of theory agree on low energy dihedral conformations although they may not largely agree on relative energy differences between said dihedral angles ( Figure 5). However, there appears to be enough dihedral overlap to allow for converged JAR and CRO, but likely distinctions between "stiff" degrees of freedom (i.e., bonds, angles) cause issues in FEP/BAR convergence, whereas bond/angle discrepancies are likely resolved during nonequilibrium switching simulations. It again should be noted how poorly FEP performed compared to converged JAR/CRO: FEP (fw) predicts (excluding offset values in Table 1) ∆A MM→3ob gas to be −460.27 kcal/mol, whereas CRO predicts −409.50 kcal/mol (see Tables 2 and 3), a difference of 49.23 kcal/mol. Again, considering the only sources of convergence difficulties here are the intramolecular degrees of freedom (i.e., bonds, angles, dihedrals, and intramolecular nonbonded interactions) one cannot expect error cancellation in a full indirect QM/MM FES.

The Bad
For several molecules, we could not obtain converged ∆A MM→3ob gas using JAR (fw) alone but rather needed long equilibrium 3ob simulations and nonequilibrium 3ob → MM simulations to obtain a converged value via CRO. These molecules include: 1, 5, 6, 16, 18, 19, 21, and 22. We focus our discussion here on molecules 5 and 6, as representative examples.

Molecule 5
Molecule 5 (ZINC00087557, 3-phenylcyclopropane-1,2-dicrabohydrazide) contains functional groups useful in metal-organic chemistry [83], and thus has potential for future modeling focus not only in drug discovery projects but also in inorganic modeling. However, we have seen that it is quite difficult to obtain converged ∆A MM→3ob gas for this species even with nonequilibrium switching simulations.
Molecule 5's equilibrium ∆U MM→3ob distributions are quite broad and exhibit little to no overlap (≈0.0%) (see Figure 6a). Convergence issues in FEP/BAR are further indicated by very low one-sided Π values, sample size hysteresis is seen in FEP (fw), FEP(bw) and BAR, and once again magnitude differences between FEP (fw) and FEP (bw) (see Table 2). Unlike with the "Good" molecules, although nonequilibrium simulations of 5 improved overlap and resulted in narrower distributions (Figure 6b), JAR (fw)/JAR (bw) metrics still indicate potential convergence failure and CRO results might require further validation. For example, for JAR (fw) Π = 0.08, and JAR (bw) Π = −0.66, both indicate unreliable one-sided distributions. Additionally, JAR (bw) exhibits sample size hysteresis (Hyst = 2.26 kcal/mol), and JAR (fw) and JAR (bw) disagree by 1.66 kcal/mol. CRO results, on the other hand, do barely pass convergence metrics: sample size hysteresis is low, and overlap is sufficient for a two-sided method at 8.76%. Thus, while CRO may be able to calculate a trustworthy result, it is far from the stellar nonequilibrium convergence results seen from "Good" molecules. Figure 6. (a) Molecule 5's potential energy "forward" (P mm ) and "backward" (P 3ob ) distributions plotted as "offset" from the ∆U MM→3ob to simplify the x-axis. (b) Molecule 5's nonequilibrium work "forward" (P mm ) and "backward" (P 3ob ) distributions plotted as "offset" from the W MM→3ob to simplify the x-axis.
Through examining the dihedral distributions of 5, we see that χ 2 and χ 3 could be causing convergence issues in FEP, BAR, and JAR (Figure 7). For example, consider MM and 3ob dihedral populations in χ 2 and χ 3 . These populations have distinct minimum angles at the two levels of theory, which doubtless are contributing to convergence failure when using FEP/BAR. However, when conducting MM → 3ob switching simulations, the barriers to rotation from MM preferred dihedral conformations to 3ob preferred dihedral conformations may be too high to traverse during the shorter nonequilibrium switching simulations (i.e., 1 ps). Although the switching simulation dihedral populations are more similar to their respective target level of theories, there are still discrepancies. For example, considering 5's χ 3 , we see MM predicts two low energy angles at 60 • and −140 • while 3ob predicts one low energy dihedral value at 140 • . After MM → 3ob switching simulations, the two peaks predicted by MM relax to one larger peak vaguely encompassing the low energy region in 3ob simulations, although this low energy dihedral peak does not have the same population density as seen from 3ob equilibrium simulations. It is likely that longer switching simulations may be needed to allow these dihedral degrees of freedom to completely relax. However, by pooling both W MM→3ob and W 3ob→MM values, we were able to obtain marginally converged CRO results. Finally, CRO predicts ∆A MM→3ob gas to be −537.50 kcal/mol, while FEP(fw) predicts −587.63 kcal/mol and JAR (fw) −539.16 kcal/mol. The FEP (fw) result is again ≈50 kcal/mol from the CRO value; an error which almost certainly will not be resolved via error cancellation in the indirect scheme. However, the JAR (fw) result is only ≈1.66 kcal/mol from the CRO value. Thus, the JAR (fw) result may have given essentially the "correct" result in an indirect cycle ("correct" when compared to our calculated CRO value), yet this result may have been spurious, and thus it is always wise to evaluate convergence metrics before trusting nonequilibrium switching simulations to resolve most configurational disparities in a distribution.

Molecule 6
Molecule 6 (ZINC00095858, ethyl N-[(2-chlorophenyl)sulfonyl]carbamate) is a flexible small molecule with thousands of similar structures available in the PubChem Database, thus ensuring accurate FES modeling of 6 could be beneficial for modeling many other small molecules.
As seen with 5, FEP and BAR results for 6 are unreliable: ∆U MM→3ob distributions are broad and non-overlapping (Figure 8a), Π values are poor, sample size hysteresis for FEP (fw) and FEP (bw) is high, and FEP (fw) and FEP (bw) differ by ≈21 kcal/mol (see Table 2). Unfortunately, as with with FEP, JAR (fw) and JAR (bw) results are also not immediately trustworthy, with one-sided Π values of −0.65 and −0.47 for JAR (fw) and JAR (bw), respectively, and discrepancy between JAR (fw) and JAR (bw) of ≈5 kcal/mol (see Table 3). However, by utilizing data from both MM → 3ob and 3ob → MM switching simulations, i.e., CRO, we were able to calculate a marginally converged ∆A MM→3ob gas . Overlap between nonequilibrium work distributions (23.95%) is much improved compared to ∆U MM→3ob distributions (0.00%). Figure 8. (a) Molecule 6's Potential energy "forward" (P mm ) and "backward" (P 3ob ) distributions plotted as "offset" from the ∆U MM→3ob to simplify the x-axis. (b) Molecule 6's nonequilibrium work "forward" (P mm ) and "backward" (P 3ob ) distributions plotted as "offset" from the W MM→3ob to simplify the x-axis.
Again, qualitatively comparing dihedral distributions illuminates possible causes for JAR convergence failures (Figure 9). Although most of the dihedral populations appear largely similar for 6, χ 3 may be distinct enough to cause convergence errors, certainly with FEP/BAR as trans-χ 3 is vastly overrepresented from MM simulations. This may also be the case in JAR (fw) and JAR (bw) results, as MM → 3ob switching simulations are unable to replicate the near degeneracy of the transand gauche-χ 3 conformations as seen in 3ob simulations. However, pooling together all switching simulations may provide enough gauche-χ 3 conformations to achieve convergence.
Once again, we should point out that FEP (fw) predicts ∆A MM→3ob gas to be −109.58 kcal/mol, while CRO predicts ∆A MM→3ob gas to be −140.17 kcal/mol, an error of ≈30 kcal/mol, and an error which would not necessarily be resolved in error cancellation within the indirect cycle. Thus, using convergence metrics, such as Π, is important for evaluating the reliability of a dataset. Figure 9. Dihedral populations for 6's dihedral degrees of freedom (see Figure 2 for dihedral labels).

The Ugly
Finally, there were three molecules for which we observed severe convergence issues even when using CRO: molecules 8, 9, and 20. We focus our discussion here on molecules 8 and 9.
As in earlier cases, FEP/BAR results again are unconvincing. However, interestingly, the metrics do not indicate there should be much more of a convergence problem than as seen for the "Bad" molecules. Π values illustrate the ∆U MM→3ob distributions are unreliable as expected, and yet FEP (fw) and FEP (bw) differ by only ≈6 kcal/mol, and ∆U MM→3ob distributions do exhibit barely enough overlap for a converged result at 3.74% ( Figure 10). Surprisingly, this potential energy overlap percentage indicates equilibrium results for 8 should result in marginally converged ∆A MM→3ob gas , and yet other analyses of ∆U distributions indicate these are unreliable datasets. Even though W MM→3ob distributions do overlap considerably better than ∆U MM→3ob , at 24.92%, Π evaluations of W MM→3ob (fw and bw) distributions indicate only marginally improved reliability, and not enough to be sufficiently confident in JAR or even CRO results. Additionally, the W distributions in Figure  10b seem to be oddly polymodal. Considering the difficulties in convergence observed, we conducted longer switching simulations (5 ps) in the hopes of improving convergence by allowing longer relaxation times. Data from 5 ps switching simulations are given in Table 3 in row "8 (5 ps)", and distributions are shown in Figure 10c. Unfortunately, even conducting 5 ps switching simulations did not allow for significantly improved W MM→3ob distributions. Figure 10. (a) Molecule 8's potential energy "forward" (P mm ) and "backward" (P 3ob ) distributions plotted as "offset" from the ∆U MM→3ob to simplify the x-axis. (b) Molecule 8's nonequilibrium work "forward" (P mm ) and "backward" (P 3ob ) distributions from 1 ps switching simulations plotted as "offset" from the W MM→3ob to simplify the x-axis. (c) Molecule 8's nonequilibrium work "forward" (P mm ) and "backward" (P 3ob ) distributions from 5 ps switching simulations plotted as "offset" from the W MM→3ob to simplify the x-axis.
Examining 8's dihedral populations may provide insight into this molecule's convergence issues: although χ 1 is fairly consistent between MM and 3ob distributions, χ 2 is quite distinct between MM and 3ob (see Figure 11). MM and 3ob simulations agree there is a low energy cis-χ 2 conformation, however 3ob simulations also predict the gaucheand trans-χ 2 conformation is populated, while MM simulations do not visit this region. It should also be noted, as described in the Methods, equilibrium simulations were launched by initiating randomized dihedrals to ensure thorough dihedral sampling. Even after randomizing χ 2 , MM simulations did not visit trans regions that were shown to be energetically stable in 3ob simulations. Furthermore, even after MM → 3ob switching simulations of 1 ps and 5 ps, MM configurations are not able to relax into the transand gauche-χ 2 conformations predicted by 3ob. Thus, the barrier to rotation around χ 2 must be too high to overcome, even during longer/slower switching protocols. This is a case where intramolecular force matching may improve low-level classical parameters and thus overlap to the higher level of theory.

Molecule 9
Molecule 9 (ZINC ID 00123162, 1-phenyl-1,2,3-butanetrione 2-[N-(4-chlorophenyl)hydrazone]) contains chemical features seen in thousands of molecules available in the PubChem database, and therefore ensuring appropriate FES modeling with 9 could ensure appropriate FES modeling of many other compounds in the near future.
Given the difficulty in arriving at a converged ∆A MM→3ob gas for 9, we again hoped to pin-point convergence issues in dihedral degrees of freedom, Figure 13. As can be seen in Figure 13a,b, χ 1 and χ 2 distributions between equilibrium MM and 3ob simulations are fairly similar, low energy dihedral conformations are consistent between levels of theory and relative populations between such angles are also consistent. However, for χ 3 , χ 4 , χ 5 , and χ 6 , there are large discrepancies between MM and 3ob regarding the low energy dihedral values and their relative populations, such discrepancy is especially clear in χ 6 ( Figure 13g). Furthermore, such discrepancies are not completely resolved within 1, or even 5 ps nonequilibrium switching simulations, as is the case for χ 3 and χ 4 . Thus, these dihedrals likely represent the roadblock to converged ∆A low→high for 9 and further likely require intramolecular force matching to be resolved. Figure 13. Dihedral populations for 9's dihedral degrees of freedom (see Figure 2 for dihedral labels).

Materials and Methods
All molecular modeling described herein was conducted using CHARMM (Chemistry at Harvard Molecular Modeling) software (version C43a2) [87].

Equilibrium Simulations
The complete Maybridge Hitfinder set, a curated online database of "drug-like" (according to Lipinski rules) molecules (https://www.maybridge.com) was scanned through ParamChem, an online tool for generating parameter and topology files used in CHARMM (https://cgenff.umaryland.edu). Standard ParamChem output includes listing "parameter and charge penalties". These penalties represent how trustworthy the output parameters and topologies are, thus higher penalties may indicate less trustworthy parameters and potential modeling inaccuracies. From the Maybridge Hitfinder set, 22 molecules were chosen which returned high parameter or charge penalties from ParamChem and these 22 molecules constituted our "HiPen" dataset.
Initial 3D coordinates of each HiPen molecule were collected from the ZINC 12 Structural Database (http://zinc.docking.org [88]); see Table 1 for the ZINC IDs. As mentioned, CGenFF (CHARMM Generalized Force Field [56]) parameter and topology files for each molecule were obtained through the ParamChem web interface (https://cgenff.umaryland.edu [54,55]). The starting coordinates were optimized by 1000 steps of Steepest Descent minimization, followed by 1000 steps of Adopted Basis Newton-Raphson minimization. To calculate ∆A MM→3ob gas according to the various methods compared in this work (FEP, BAR, JAR, and CRO), Langevin Dynamics (LD) simulations were performed at the two levels of theory, MM and SCC-DFTB/3ob. In all cases, a friction coefficient of 5 ps −1 was applied to all atoms, and random velocities were added at each step corresponding to a temperature bath of 300 K.
MM simulations: For each molecule, ten LD simulations were carried out, which were started from different initial random velocities. Additionally, to enhance sampling, we employed different starting coordinates if/when the molecule contained rotatable bonds (cf. Figure 2). First, all rotatable bonds were randomized. Next, 1000 steps of Adopted Basis Newton-Raphson minimization were carried out while restraining the dihedral angles harmonically (k = 100 kcal/mol/A 2 ) to their randomized value(s). Finally, restraints were removed and 10 ps of LD were carried out as equilibration. As molecules were simulated in the gas phase, all nonbonded interactions were calculated explicitly during the simulation; neither switching nor shifting functions were applied. Following these preparation steps, 10 million steps of LD were carried out with a timestep of 1 fs; this corresponds to 10 ns per run, and a cumulative simulation length of 100 ns for the ten runs per molecule. Restart files (containing both coordinates and velocities) were saved every 100 steps, resulting in 1 million coordinate and velocity sets per molecule. For use in FEP and BAR, the energies for each coordinate set were computed at both the MM and SCC-DFTB/3ob levels of theory. We further computed the instantaneous dihedral angles of all rotatable bonds considered.
SCC-DFTB3/3ob simulations: All simulations employing the SCC-DFTB/3ob potential energy function, were conducted according to a protocol closely mirroring what was just described for the MM simulations. However, in the case of SCC-DFTB/3ob simulations, the simulation length per run (again, 10 runs per molecule) was only 1 ns (a timestep of 1 fs was used); the cumulative simulation length, therefore, was 10 ns. Coordinate and velocity information was saved every 100 steps resulting in a total of 100,000 restart files. As in the MM case, for each of the coordinate sets we computed the energy at the force field and SCC-DFT3/3ob levels of theory, as well as the instantaneous values of the dihedrals of the rotatable bonds.

Nonequilibrium "Switching" Simulations
To compute ∆A MM→3ob gas using JAR (Equation (5)) or CRO (Equation (6)), one must repeatedly compute the nonequilibrium work for switching from the MM Hamiltonian to the SCC-DFTB/3ob Hamiltonian and/or vice versa. In CHARMM, this can be accomplished using the program's MSCALE [89] and PERT [87] functionalities. The multi-scale (MSCALE) modeling module of CHARMM allows the user to treat a system, in part or whole, according to two (or more) different energy functions. In the present case, MSCALE is employed to mix the MM and SCC-DFTB/3ob energy functions as needed. In combination with the PERT free energy facility of CHARMM in slow-growth mode, the degree of mixing can be changed continuously from 100% MM to 100% SCC-DFTB/3ob. Since during switching over any finite time window the system is not at equilibrium, the "energy differences" obtained in slow-growth calculations really are nonequilibrium work values. This is even more the case when switches are carried out very quickly (within a few ps or less) to avoid excessive computational cost, as is the case when switching to (S)QM/MM Hamiltonians. For full details, we refer the reader to our earlier work [33,34]; additionally, a recent general review about nonequilibrium work methods can be found in Reference [90]. Switching simulations in both forward (MM → SCC-DFTB/3ob) and backward (SCC-DFTB/3ob → MM) direction were started from the restart files saved during the respective equilibrium simulations (see above). The timestep during all switching simulations was 1 fs.
MM to 3ob Switching simulations: MM → 3ob switching simulations were launched from every 10,000th MM simulation step, i.e., from snapshots saved at 10 ps intervals during the MM equilibrium simulations. Per molecule, this resulted in a total of 10,000 MM → 3ob nonequilibrium switches. Unless otherwise noted, all switching simulations were conducted for 1 ps (1000 steps). W MM→3ob was recorded per switch and post-processed using JAR and CRO. For each of the final coordinates, we also computed the dihedral angles of the rotatable bonds.
3ob to MM Switching simulations: 3ob → MM switching simulations were launched from every 1000th 3ob simulation step, or every 1 ps. Per molecule, this resulted in a total of 10,000 3ob → MM nonequilibrium simulations. Unless otherwise noted, all switching simulations were conducted for 1 ps (1000 steps). W 3ob→MM was recorded per switch and post-processed according to JAR and CRO. As was done at the end of the MM → 3ob switching simulations, dihedral angle values were recorded.

Conclusions
We present here a new dataset to be used for future method development in the QM/MM FES community. In particular, we have calculated ∆A MM→3ob gas for 22 drug-like small molecules obtained from the Maybridge Hitfinder set. In compiling our dataset, we observed three categories of molecules emerge: "good", molecules for which JAR (fw) could obtain a reliably converged ∆A MM→3ob gas ; "bad", molecules for which JAR (fw) results proved unreliable but CRO results were reliably converged; and "ugly" molecules for which even CRO could not produce reliably converged results. Although we have yet to derive any strict/concrete patterns quantitatively relating our convergence criteria to how "wrong" a calculated ∆A low→high may be (i.e., we cannot yet tell from a one-sided Π value calculated from ∆U MM→3ob distribution how wrong the resultant FEP (fw) will be), we have illustrated that several convergence metrics should always be evaluated and compared before trusting a ∆A, especially those calculated from equilibrium FES. We have also seen, once again, how discrepancies in "stiff" and "soft" degrees of freedom between levels of theory can result in drastic convergence errors which may not always be ameliorated via nonequilibrium work (even extended) switching simulations, as was seen in the case of "ugly" molecules 8 and 9. We intend to use these data in the near future for further method development as well as evaluating the same criterion/metrics in solvent phase free energy simulations. Furthermore, we hope this dataset will prove as useful to FES practitioners in providing standardized results as the BEGDB and MNDB2.0 datasets are to the quantum mechanical calculation community.