Electronegativity Equalization Method: Parameterization and Validation for Large Sets of Organic, Organohalogene and Organometal Molecule

The Electronegativity Equalization Method (EEM) is a fast approach for charge calculation. A challenging part of the EEM is the parameterization, which is performed using ab initio charges obtained for a set of molecules. The goal of our work was to perform the EEM parameterization for selected sets of organic, organohalogen and organometal molecules. We have performed the most robust parameterization published so far. The EEM parameterization was based on 12 training sets selected from a database of predicted 3D structures (NCI DIS) and from a database of crystallographic structures (CSD). Each set contained from 2000 to 6000 molecules. We have shown that the number of molecules in the training set is very important for quality of the parameters. We have improved EEM parameters (STO-3G MPA charges) for elements that were already parameterized, specifically: C, O, N, H, S, F and Cl. The new parameters provide more accurate charges than those published previously. We have also developed new parameters for elements that were not parameterized yet, specifically for Br, I, Fe and Zn. We have also performed crossover validation of all obtained parameters using all training sets that included relevant elements and confirmed that calculated parameters provide accurate charges.


Introduction
Electronegativity Equalization Method (EEM) [1,2,3] is a fast approach for charge calculation. The basic idea is based on the density functional theory (DFT) [4,5]. First, Parr et al. applied the DFT and formulated a new definition and explanation of electronegativity [6,7]. Later on, Mortier et al. applied Parr's definition of electronegativity and Sanderson's Electronegativity Equalization Principle (EEP) [8,9,10] and created the EEM.
This method is able to calculate atomic charges markedly faster than common ab initio approaches. A challenging part of the EEM is the parameterization that is performed using ab initio charges obtained for a set of molecules. The parameterization is very time-consuming with time complexity of O(S.B 4 ), where S is a number of molecules in the set. The most common parameterization of the EEM is a parameterization for the HF method with the STO-3G basis set, where the charges are calculated by Mulliken population analysis (MPA) [11,12]. Principally, it is also possible to parameterize the EEM for other basis sets (i.e., 6-31G*) and methods for charge calculation (i.e., CHELPG, MK, NPA, ESP, Hirshfeld method) [13,14]. First attempts to calculate EEM parameters were published in eighties [1,2]. These publications contained only parameters for C, H, N and O, which were developed using training sets of about one hundred molecules. Further parameterizations were performed during the nineties and contained parameters for new elements (S, Si, P, F, Cl) and more complex bases [15,16,17]. The EEM parameterization still remains attractive to chemists' attention [18,19,20].
The goal of this work is to perform the EEM parameterization based on large sets of organic, organohalogen and organometal molecules (containing Zn and Fe) selected from databases NCI DIS [21] and CSD [22], and to validate the quality of calculated parameters on reference sets of molecules selected from these databases. The parameterization was performed for STO-3G MPA charges.

EEM
Using DFT, the effective (charge-dependent) electronegativity of the atom i in a molecule can be calculated by eq. (1) [1,2 3]: where N is the number of atoms in the molecule, q i and q j are the charges distributed on the atoms i and j, respectively, R i,j is the distance between atoms i and j, and κ is the adjusting factor. The coefficients A i and B i are defined by eqs. (2): where χ i 0 is the electronegativity of an isolated neutral atom i, η i 0 is the hardness, and ∆χ i 0 and ∆η i describe the molecular environment. The coefficients A i , B i and κ are empirical parameters, which must be obtained via EEM parameterization. Such a parameterization is a topic of this work. According to Sanderson's Electronegativity Equalization Principle [8,9,10], the effective electronegativity of each atom in the molecule is equal to the molecular electronegativity χ : The total charge Q of the molecule is equal to the sum of all the atomic charges: The atomic charges are described using the equation system (5), which contains N+1 equations with N+1 unknowns: q 1 , q 2 , … , q N and χ . This system was derived from equations (1), (3) and (4) [1]:

(5)
The matrix of the equation system (5) is called EEM matrix.

κ χ
Then, empirical parameters can be obtained using eq. (8) in the following way: 1. Selection of a set of molecules used for the EEM parameterization. 2. Ab initio calculation of atomic charges q i for all atoms within all selected molecules.
3. Calculation of the molecular electronegativity χ as a harmonic average of atomic electronegativities χ i 0 (for isolated atoms i): 4. Selection of κ values for which the parameterization will be performed. 5. For each of the above selected κ: • Calculation of x i and y i values for all atoms in all molecules using eq. (8).
• Separation of x i and y i couples into subsets according to the chemical symbol and hybridization of the atom i (for example C in sp 3 , C in sp 2 etc.). • Calculation of parameters A i and B i for each of these subsets using the least square minimization.

Methods
In this work, two databases were used. The first one was the NCI DIS 3D database [21], created as a part of DTP NCI (Developmental Therapeutics Program of National Cancer Institute). This database contains organic molecules tested against cancer, specifically their topologies and also geometries, predicted by the program CHEM-X [23] and stored in SDF format [24]. The second database used was CSD (Cambridge Structural Database) [22], administered by CCDC (Cambridge Crystallographic Data Centre). Geometries of molecules are stored also in SDF format. However, in this case information is obtained experimentally using the X-ray and/or neutron diffraction. Both these databases are sufficiently large, containing more than two hundred thousand molecules.  From these two databases, several training sets of molecules were selected (see Table 1). Our goal was to generate training sets, which cover most of bonding situations and also conformational variability in real molecules. For that reason, we have chosen large sets containing randomly selected molecules. As molecules are unsorted in NCI DIS and CSD databases, the simplest random selection is to take a continuous part of the database. We selected three training sets, containing elements C, H, O, N and S from each database. To obtain the most versatile training sets, we selected first training set from the beginning, second from the middle and the third from the end of the databases. We have also used unions of these sets. For organohalogenes and organometals, we did not need so many training sets as the process of parameterization was already debugged on above mentioned six training sets and their unions. Therefore we used only one training set of organohalogenes from each database. Just one training set (from CSD database) was used for organometals, because the NCI DIS database does not contain enough organometal molecules. These organometal and organohalogene molecules were selected from the beginning of the databases.
For the parameterization, ab initio charges were calculated using the HF method with the STO-3G basis set for all molecules in all sets. The charge calculation was performed by Gaussian 98 program [25]. After that, the EEM parameterization was performed using calculated ab initio charges for all training sets.
R mol avg describes quality of parameters obtained via EEM parameterization using the training set. This value is between 0 and 1. The closer it is to 1, the more accurate charges are provided employing the EEM method using the parameters. The literature [17]. For our parameters, the best R mol avg for each tested set is bolded.
Parameters, calculated for all training sets presented in Table 1, and also parameters obtained from the literature were validated for all training sets that contained suitable atoms. Validation of parameters for a selected training set was done in such a way that ab initio charges and the EEM charges calculated for each molecule from the training set using the developed parameters were compared via the least square method. In other words, the linear regression line was fitted to a set of points [q i (ab initio), q i (EEM)], where q i (ab initio) and q i (EEM) are ab initio and EEM charges of the atom i, respectively. Correlation between ab initio and EEM charges was described by the R-squared value [26] of this line. This R-squared value is between 0 and 1. The closer it is to 1, the better the correlation is. The R-squared value (R mol ) was calculated for each molecule in the training set. An average value of (R mol avg ) was calculated from all R mol values in each set to express the quality of parameters for the set.  For each training set of molecules in Table 1, the parameters were found. As it was described in the Methods section, calculated parameters were validated for all training sets that contained suitable atoms and also compared with the parameters from literature [17]. As the literature does not show the κ value (see eq. (1)), we had to find the κ value via our methodology. The best fit for κ was found to equal 1.25. The results of this parameter quality validation expressed by R mol avg are summarized in Table 2. This table shows that the quality of parameters varies for different training sets. Moreover, the quality of parameters from literature is generally slightly better than the quality of our parameters. Therefore, our effort was to further improve our methodology and parameters. The main idea of this improvement was based on results, obtained for training sets n all and its subsets n beg , n mid and n end and also for training set c all with subsets c beg , c mid and c end . It is seen from Table 2 that R mol avg (n all ) is better than the average value from R mol avg (n beg ), R mol avg (n mid ) and R mol avg (n end ), but the best results are obtained for the set n mid . Analogically, in the training set c all , the subset c beg provides the best parameters. Randomly sorted molecules that create the training sets imply the good accuracy of parameters from subsets n mid and c beg . Therefore, the quality of parameters can be increased by selection of an appropriate subset of the input training set. We have tested two methods of appropriate subset selection: 1. Select only molecules, which have R mol greater than a defined limit (for example 0.8).

Results
2. Sort molecules from the training set T randomly and create a sequence of them (1, 2, ..., |T|), where |T| is a cardinality of T. Calculate parameters for all subsets ST i , where ST i is obtained from T by removing the subset DST i . The subset DST i is composed of elements T (i-1).K+1 , T (i-1).K+2 , …, T i.K ; where K can be, for example, 100. Now create the selection in the following way: From the input training set, sorted into the above described sequence, delete every subset DST i , for which R mol avg (T) < R mol avg (ST i ).
By comparison, the second approach was found to be more successful. It is interesting that sets selected via the first method provide worse quality of parameters than the input training sets themselves (results not shown here).
Using method 2, we have performed selections based on sets c beg , c hal , c met and c h,m and created sets c beg2 , c hal2 , c met2 and c h,m2 (see Table 3).
For more details about R mol avg see Table 2. R mol avg values of our parameters that are better than the literature parameters (denoted as Lit., taken from reference [17]) are in italics. The best R mol avg value (our parameters) for each tested set is bolded.
We have chosen the CSD database as this database contains high quality experimental data. The set c beg was selected as it exhibits R mol avg higher than c mid , c end and c all . The parameters were calculated for selected subsets c beg2 , c hal2 , c met2 and c h,m2 (see Table 4). Then the parameters were validated for all training sets containing corresponding atoms (see Table 5 and graphs in supplementary materials).
It is seen that the selected subsets c beg2 , c hal2 , c met2 and c h,m2 provide markedly better parameters than the input sets c beg , c hal , c met and c h,m themselves. In all cases we have found parameters that are better than the literature parameters.
The parameters c beg2 are of better quality than parameters obtained from literature [17] for both used databases. The parameters c hal2 , c met2 and c h,m2 are of a worse quality than published parameters for the NCI DIS database, but are better for the experimental database CSD. Moreover, these parameters contain new data for halogens or Fe and Zn.
Generally, we can conclude, that it is possible to calculate parameters using both the predicted and experimental databases. However, parameters that are based on experimental structures exhibit better charge calculation results. It can be caused by the fact that the theoretical structures from NCI DIS database may include some less realistic geometries compared to the experimental structures from CSD database. These parameters are more useful as they are portable and can be used for an arbitrary molecule that contains atoms for which the parameters were developed. Our results also show that it is useful to work with large training sets and select the best subset that provides the highest quality parameters. It is also reasonable to test several training sets.
We did a large validation of our parameters. For demonstration, tables with detailed results of EEM charge calculation method with our parameters for several different organohalogene and organometal molecules are attached in supplementary material. Also coordinates and charges on single atoms are available there.

Conclusions
In this work, we have improved the published EEM parameters to calculate the STO-3G MPA charges for C, O, N, H, S, F and Cl. The new parameters provide more accurate charges than those published previously [17]. We have developed parameters for elements not yet parameterized, specifically for Br, I, Fe and Zn.
The EEM parameterization we have performed has been based on 12 training sets, which are also the largest published training sets used for the EEM parameterization ranging from 2000 to 6000 molecules. We have shown that the number of molecules in the training set is very important for the quality of the parameters.
We have performed crossover validation of all obtained parameters using all training sets that include relevant elements. To the best of our knowledge, we have performed the most accurate testing of EEM parameters quality published so far. This is the first work to compare EEM parameters calculated using two principally different training sets, one being a database of theoretically predicted 3D structures (NCI DIS) and the second being a database of crystallographic structures (CSD). Our results show that it is possible to use both databases, but parameters from the CSD database training sets give more accurate charges. Moreover, the parameters obtained from the NCI DIS database training sets are not very suitable to calculate charges for molecules from the CSD database.
These improved and newly developed parameters can be used for charge calculation using the program EEM SOLVER [27], which we have developed and which is freely available via the internet on http://ncbr.chemi.muni.cz/~n19n/eem_abeem.