Conformation-Independent QSPR Approach for the Soil Sorption Coefficient of Heterogeneous Compounds

We predict the soil sorption coefficient for a heterogeneous set of 643 organic non-ionic compounds by means of Quantitative Structure-Property Relationships (QSPR). A conformation-independent representation of the chemical structure is established. The 17,538 molecular descriptors derived with PaDEL and EPI Suite softwares are simultaneously analyzed through linear regressions obtained with the Replacement Method variable subset selection technique. The best predictive three-descriptors QSPR is developed on a reduced training set of 93 chemicals, having an acceptable predictive capability on 550 test set compounds. We also establish a model with a single optimal descriptor derived from CORAL freeware. The present approach compares fairly well with a previously reported one that uses Dragon descriptors.


Introduction
The soil sorption coefficient (K oc ) describes the biodegradation and pollution impact of organic pesticides [1] when these compounds interact with the organic matter of soils and sediments either on surface, ground or drinking water [2]. The reliable estimation of the K oc parameter is very important in agriculture, as its experimental measurement is difficult, expensive and time-consuming. Predicting the soil sorption coefficient for a wide number of chemical structures is very convenient in risk assessment [3].
In the realm of the Quantitative Structure-Property Relationships (QSPR) theory [4][5][6], an experimental property of a chemical compound, i.e., K oc , can be predicted through the knowledge of its chemical structure. The structure is quantified by means of a set of suitable molecular descriptors, in other words, numerical quantities carrying specific information on the constitutional, topological, geometrical, hydrophobic, and/or electronic aspects [7][8][9]. Therefore, a set of descriptors is then statistically correlated with the experimental property, resulting in a mathematical model that can be used with find out useful parallelisms.
It is known that many published QSPR models that predict the soil sorption coefficient involve the experimental octanol/water partition coefficient (K ow ) or the water solubility (S w ) [10], while other QSPR are based on theoretical molecular descriptors [11][12][13]. However, usually, little work is done to examine the model's predictivity (validation) and the chemical domain of application over a wide range of compounds, especially for new chemicals [14][15][16].
A previous QSPR study of Gramatica et al. [14] on a highly heterogeneous set of 643 organic non-ionic compounds predicts the soil sorption coefficient expressed in logarithmic units (logK oc ). The training set with 93 compounds used in such work is peculiar, because it is much smaller than the test set of 550 compounds (1:6 ratio). The best Dragon molecular descriptors are selected through the Genetic Algorithms (GA) technique based on Multivariable Linear Regression analysis (MLR), leading to a four-dimensional QSPR having a predictivity of 78% on the test set. The best predicted data are obtained by consensus modeling from ten different models in the GA model population.
In this work, we report new alternative QSPR models for the soil sorption coefficient in the same molecular set studied by Gramatica et al. [14], using an approach that does not consider the conformational representation of the chemical structure by only relying on the constitutional and topological aspects of the molecules [15]. As is known, every model that includes three-dimensional descriptors usually involves high computational costs and long times during the calculation of molecular geometry optimization. Therefore, the conformation-independent QSPR approach can be considered as a very useful methodology.
In addition, we also explore the performance of QSPR models based on optimal descriptors [16]. Within this technique, the calculated optimal descriptor depends both on the molecular structure and the property under analysis (K oc ), but does not explicitly depend on the 3D-molecular geometry. We have shown the importance of optimal descriptors in previous QSPR studies [17][18][19][20][21].

Results and Discussion
We begin our QSPR analysis by exploring the performance of molecular descriptors calculated with the PaDEL freeware. The most representative structural features of the training set of 93 heterogeneous compounds are searched through the RM technique. In this way, the best MLR models based on 1-6 molecular descriptors are found in a pool having 17,536 variables. In order to remove the 'collinear' (identical) descriptors, the linearly-dependent pairs are identified within RM, and only one variable from each pair is kept for further analysis. This process leads to a set containing 3491 linearly-independent descriptors.
We follow the common practice of keeping the model's dimension (d) as small as possible. The best MLR models are listed in Table 1, while a brief description of the descriptors meanings is provided in Table S1. It is appreciated from Table 1 that the RMS train parameter continues improving beyond four descriptors, but RMS test does not significantly improve. According to this, we choose a structure-property relationship having four descriptors with an acceptable predictive power on the test set: logK oc " 0.18SP3`0.30CrippenLogP´0.090gmax`0.16XLogP`1.18 (1) N train " 93, R 2 train " 0.87, RMS train " 0.45 In this equation N is the number of compounds; R ijmax denotes the maximum correlation coefficient between descriptor pairs; o2.5 indicates the number of outlier compounds in the training set having a residual (difference between experimental and calculated activity) greater than 2.5-times RMS train . The conformation-independent descriptors appearing in Equation (1) belong to four different classes [9]: (i) a PaDEL Chi Path Descriptor: SP3, simple path of order 3; (ii) a Crippen descriptor: CrippenLogP, Crippen's LogP; (iii) an electrotopological state atom type descriptor: gmax, the maximum E-state; and (iv) the XLogP descriptor.
A plot for the predicted logK oc as a function of the experimental values for the training and test sets is provided in Figure 1. The dispersion plot of residuals in Figure S1 tends to obey a random pattern around the zero line, suggesting that the assumption of the MLR technique is fulfilled. The correlation matrix for Equation (1) is given in Table S2, showing the absence of high correlations between descriptor pairs, while their numerical values are included in Table S3.
Equation (1) has an acceptable predictive power on the external test set of 550 compounds, according to the R 2 test and RMS test parameters. Such a model approves the internal validation process of Cross-Validation through the exclusion of one molecule at a time. The Y-Randomization technique demonstrates that Equation (1) has RMS train ă RMS rand and thus a valid structure-logK oc relationship is found. The external validation criteria recommended in [22] to assure predictive capability are also achieved and are summarized in Table S4. The conformation-independent descriptors appearing in Equation (1) belong to four different classes [9]: (i) a PaDEL Chi Path Descriptor: SP3, simple path of order 3; (ii) a Crippen descriptor: CrippenLogP, Crippen's LogP; (iii) an electrotopological state atom type descriptor: gmax, the maximum E-state; and (iv) the XLogP descriptor.
A plot for the predicted log oc K as a function of the experimental values for the training and test sets is provided in Figure 1. The dispersion plot of residuals in Figure S1 tends to obey a random pattern around the zero line, suggesting that the assumption of the MLR technique is fulfilled. The correlation matrix for Equation (1) is given in Table S2, showing the absence of high correlations between descriptor pairs, while their numerical values are included in Table S3. Equation (1)

RMS
RMS  and thus a valid structure-log oc K relationship is found. The external validation criteria recommended in [22] to assure predictive capability are also achieved and are summarized in Table S4. The statistical quality of Equation (1) is quite similar to various QSPR models reported previously by Gramatica et al. [14].  The statistical quality of Equation (1) is quite similar to various QSPR models reported previously by Gramatica et al. [14]. For instance, our QSPR with RMS train " 0.45 and RMS test " 0.53 is better than the published four-topological descriptor model with RMS train " 0.52 and RMS test " 0.56. Furthermore, Equation (1) is also comparable to the three-descriptor consensus model proposed in that paper (RMS train " 0.52 and RMS test " 0.53), although such a model has as the disadvantage that it includes geometrical descriptors. In our approach, we do not consider the geometrical representation of the chemical structures, but consider their constitutional and topological aspects instead while achieving acceptable results.
As a next step of this QSPR study, we include optimal molecular descriptor definitions in order to analyze the performance of such soil sorption-specific structural variables. The DCW optimal descriptor is optimized by increasing R 2 train , until the model starts to lose predictive capability in the test set (measured by RMS test ). The best structural representation for the 93 training compounds is hydrogen-filled graph, where the statistics for the stepwise evolution of the linear model is presented in Table 2. The first local descriptor selected is NNC (Nearest Neighboring Code), then the following ones are 0 EC (Morgan Extended Connectivity of zero-th order) and NOSP (the presence of Nitrogen, Oxygen, Sulfur or Phosphorus) in that order. It is noted from Table 2 that the best quality optimal descriptor involves such three-variable types, and 64 active attributes are based on them (shown in Table S5). More complete details for the QSPR model are the following: The parameters used for the DCW calculation are T " 1 and N epochs " 7. Figures S2 and S3 demonstrate that the MLR technique is also satisfied for Equation (2). An example for the calculation of DCW for formaldehyde is provided in Table 3. Our results reveal that Equation (1) has a better performance on the test set than Equation (2). Both QSPRs are obtained through different approaches, i.e., by allowing or not the molecular descriptor representing the chemical structure to be dependent on the studied logK oc property.
As a next step, we investigate what happens when the previous set of 3491 0D-2D descriptors from PaDEL is combined with the optimal DCW descriptor. The best 1-6 variable MLR models found in such pool of 3492 descriptors (Table S6) do not ameliorate the predictive power of our first model, as the training set statistics is better but not the one for the test set.
In a new attempt to improve Equation (1), we consider the inclusion of EPI Suite predictions as semiempirical molecular descriptors, calculated through logK ow Epi and logS w Epi predicted values. After searching the best MLR models in the set composed of 3493 independent descriptors from PaDEL and EPI Suite (refer to Table 4), the following structure-K oc relationship is achieved: N train " 93, R 2 train " 0.87, RMS train " 0.44 The performance of Equation (3) is better than Equation (1), and thus, we consider that this new QSPR model is the most suitable structure-soil sorption coefficient relationship for the 643 organic non-ionic compounds. Figure 2 and Figure S4 plot the predictions, while Tables S2 and S4 provide the correlation matrix and external validation criteria for Equation (3).  (3) is better than Equation (1), and thus, we consider that this new QSPR model is the most suitable structure-soil sorption coefficient relationship for the 643 organic non-ionic compounds. Figure 2 and Figure S4 plot the predictions, while Tables S2 and S4 provide the correlation matrix and external validation criteria for Equation (3).    The 2D molecular descriptors appearing in this last equation belong to three different classes: (i) a Molecular Linear Free Energy Relation (MLFER) descriptor: MLFER.E, measuring the excessive molar refraction; (ii) a substructure fingerprint: SubFP302, the presence of rotatable bonds; and (iii) an EPI Suite descriptor: logK ow Epi. As the three descriptors take positive numerical values, Equation (3) indicates that a compound having higher values for both MLFER.E and logK ow Epi descriptors together with a lower value for SubFP302 tend to have a higher predicted soil sorption coefficient.
MLFER.E measures the excessive molar refraction: the molar refraction of the solute minus the molar refraction of an alkane of equivalent volume. This descriptor can be easily estimated from the knowledge of a compound's refractive index, and suggests the propensity of the soil phase to interact with solute compounds having πand σ-electron pairs.
The SubFP302 descriptor has a clear interpretation as quantifies the presence (equal to one) or absence (equal to zero) of rotatable bonds in the chemical structure. This fingerprint identifies rotatable bonds that allow free rotation around themselves, that is to say, any single bond, not in a ring, bound to a non-terminal heavy atom.
Finally, the logarithm of the octanol/water partition coefficient logK ow Epi descriptor is a well-known physicochemical property that has been widely used in past QSPR studies for correlating the logK oc values. Therefore, hydrophobic compounds with high logK ow Epi values tend to exhibit a higher retaining by the organic matter of soils and sediments.
The analysis of the applicability domain of the new proposed QSPR reveals that 16 compounds out of the 550 included in the test set do not belong to the AD of the model, as h i ą h˚" 0.13. The obtained leverage values are also provided in Table S7. We assume that this particular behavior is due to the complexity of the dataset, i.e., the great structural heterogeneity of the molecules considered in this study. Thus, the predicted logK oc values for all, with the exception to such 16 test set compounds, can be considered as reliable as they fall within the AD.
As a final comparison, our best QSPR model with RMS train " 0.44 and RMS test " 0.48 has a better performance on the heterogeneous compounds than the one provided by EPI Suite: RMS train " 0.47 and RMS test " 0.56 (connectivity method) and RMS train " 0.48 and RMS test " 0.56 (partition coefficient based method). This means that our developed QSPR model of Equation (3) represents an alternative/complementary tool to the EPI Suite program for predicting the studied property in present dataset of 643 organic non-ionic compounds.

Experimental Dataset
The experimental soil sorption partition coefficient collected from [14] is quantified as the ratio between chemical concentration in soil and in water normalized to organic carbon. In the present dataset, logK oc ranges in the interval (´0.31, 6.02) in the training set (train) and (0, 6.33) in the test set (test); the complete list of 643 compounds studied here is included in Table S7 as Supplementary Material. The dataset is highly heterogeneous, and includes practically all of the principal functional groups present in pesticides and various organic pollutants.
In addition and for comparison purposes, the calculated logarithm of the soil sorption partition coefficient is obtained through the Estimation Program Interface (EPI Suite) software from the KOCWIN module (logK oc Epi) [23]. EPI Suite calculates logK oc Epi via two different techniques: (a) based on the first order Molecular Connectivity Index (MCI); and (b) based on logK ow (rather than MCI). In both cases, the program employs a series of group contribution factors.

Structural Representation and Molecular Descriptors Calculation
The molecules are first drawn in mol format with ACDLabs ChemSketch freeware [24]. The set of conformation-independent molecular descriptors is computed using PaDEL Version 2.20 [25], because it has the advantage that it is a freely available and open source software. PaDEL currently calculates 1444 0D-2D descriptors and 12 fingerprint types (total 16,092 bits) [26]. Furthermore, semiempirical descriptors from EPI Suite are added, such as the calculated logarithm of the octanol/water partition coefficient from KOWWIN (logK ow Epi) and the calculated logarithm of the water solubility from WATERNT (logS w Epi) [23]. Therefore, the total number of non-conformational descriptors explored in this work is 17,538. It is our intention to capture, with such a great number of descriptors, the most relevant structural characteristics affecting the studied property.

Molecular Descriptors' Selection in Multivariable Linear Regression (MLR)
We employ the Replacement Method (RM) technique [27][28][29][30][31][32][33] in order to generate MLR models on the training set, by searching in a pool having D " 17, 538 descriptors for optimal subsets containing d descriptors (d is much lower than D), with smallest values for the standard deviation (S train ) or the root mean square deviation (RMS train ). Table S8 includes a list of mathematical equations involved in the present study. All of the MATLAB-programmed [34] algorithms used in our calculations are available upon request.

The Optimal Molecular Descriptors
By means of the CORAL freeware (Correlation and Logic) [35] it is easy to define different optimal molecular descriptors. The Structural Representation (SR) used, i.e., graph or SMILES (Simplified Molecular Input Line Entry Specification), determines the Structural Attributes or local descriptors (SA) available for the QSPR. Therefore, it is necessary to decide which SA combination is the most appropriate, and this is done in a stepwise fashion, i.e., first search for the best single SA, then search for a second SA that combines the best with the previous one, and so on.
The DCW descriptor is a linear combination of Correlation Weights (CW); refer to Table S8. The CW is calculated for each SA in the training set through the Monte Carlo (MC) simulation method. The DCW depends on the threshold (T) and the number of epochs (N epochs ): the appropriate selection of T and N epochs avoids model over-fitting. The rare attributes are the ones that occur in less than T compounds, and in this work T is a positive integer analyzed in the range from 0-5.

Model Validation
The linear regression models are theoretically validated through Leave-One-Out Cross-Validation (LOO) [22]. A more reliable validation is applied with an external test set of structures: the same training set-test set partition from [14] is used in present analysis, that is to say, 93 compounds in the training set and 550 compounds in the test set. We also scramble the experimental property values with Y-Randomization [36] and 10,000 cases, as a way of checking that the model is not a result of chance correlation when RMS rand is greater than RMS train .

Applicability Domain
A predictive QSPR model is only able to predict molecules falling within its Applicability Domain (AD) [37], so that the predicted property is not a result of substantial extrapolation (unreliable prediction). The AD definition is dependent on the model's descriptors and the experimental property. Within the leverage approach [38], a test set compound must have a calculated leverage (h i ) smaller than the warning leverage (h˚).

Conclusions
We have succeeded in establishing structure-property relationships for the soil sorption coefficient, a useful parameter related to sorption processes determining the environmental fate, distribution and persistence of chemicals. The chemical domain explored includes a heterogeneous set of 643 organic non-ionic compounds, having a K oc range of more than six log units. The QSPR models found on a training set composed of 93 compounds have an acceptable predictive performance on a test set including 550 compounds, and are able to fulfill other necessary mathematical conditions, such as Cross-Validation, Y-Randomization and Applicability Domain analysis. Our results compare favorably to previous reported ones from the literature, although the proposed models involve molecular descriptors calculated through freely available software like PaDEL, CORAL and EPI Suite.
As we have developed a conformation-independent QSPR approach, the conformational representation of the chemical structures is avoided, and thus, no-experimental information on the X-ray crystal structure of compounds is required. Our research work continuously focuses on the use of new methods based on constitutional and topological approximations to QSPR studies, and thus, new results will be published shortly elsewhere.