Identification of Electronic and Structural Descriptors of Adenosine Analogues Related to Inhibition of Leishmanial Glyceraldehyde-3-Phosphate Dehydrogenase

Quantitative structure–activity relationship (QSAR) studies were performed in order to identify molecular features responsible for the antileishmanial activity of 61 adenosine analogues acting as inhibitors of the enzyme glyceraldehyde 3-phosphate dehydrogenase of Leishmania mexicana (LmGAPDH). Density functional theory (DFT) was employed to calculate quantum-chemical descriptors, while several structural descriptors were generated with Dragon 5.4. Variable selection was undertaken with the ordered predictor selection (OPS) algorithm, which provided a set with the most relevant descriptors to perform PLS, PCR and MLR regressions. Reliable and predictive models were obtained, as attested by their high correlation coefficients, as well as the agreement between predicted and experimental values for an external test set. Additional validation procedures were carried out, demonstrating that robust models were developed, providing helpful tools for the optimization of the antileishmanial activity of adenosine compounds.


Introduction
Leishmaniases are diseases caused by the intracellular protozoan parasite Leishmania.There are an estimated 1.5-2 million new cases per year, of which up to 500,000 are visceral leishmaniasis (VL), the fatal version of the disease.Left untreated, it causes a global annual mortality estimated at 59,000 [1].According to disease burden estimates, leishmaniasis ranks third in disease burden in disability-adjusted life years caused by neglected tropical diseases and is the second cause of parasite-related deaths after malaria [2].For a variety of reasons, it is not receiving the deserved attention given its high occurrence [3].
The first-line treatments for VL since the 1930s are the pentavalent antimonials, although these compounds are toxic and resistance has been an increasing problem in India [4].While significant progress has been made in the last 10 years, with the approval of amphotericin B, miltefosine and paromomycin, these new and safer chemotherapy alternatives remain out of reach for the affected rural population who are most in need [5].Moreover, the use of poor-quality drugs can be life-threatening for vulnerable patients and also have a devastating impact on public health and elimination programmes targeting the disease [6].
The glycolytic enzyme glyceraldehyde 3-phosphate dehydrogenase (GAPDH) has been considered as a target for the inhibition of protozoan parasites [7,8].GAPDH from the pathogenic trypanosomatids Trypanosoma brucei, Trypanosoma cruzi and Leishmania mexicana are quite similar to each other, but have sufficient structural differences, when compared to the human enzyme, making possible the structure-based design of compounds that selectively inhibit all three trypanosomatid enzymes, but not the human homologue [7].
By exploiting the differences in the structure of the parasitic and human GAPDH, adenosine analogs with substitutions on N-6 of the adenine ring and on the 2′ position of the ribose moiety were designed, synthesized and tested for inhibition of trypanosomatid GAPDHs, and two crystal structures of L. mexicana GAPDH (LmGAPDH) complexed with high-affinity inhibitors that also block parasite growth were solved [9].Induced fit of the LmGAPDH backbone upon binding of the inhibitor may enlarge a cavity at the binding site to accommodate the inhibitor.The extensive hydrophobic interactions between the protein and the two substituents on the adenine scaffold of the inhibitor TND (N-1,2,3,4-tetrahydronaphth-1-yl-2′-[3,5-dimethoxybenzamido]-2′-deoxyadenosine), as shown in Figure 1, provide a plausible explanation for the high affinity of these inhibitors for trypanosomatid GAPDHs [9].In order to enhance the knowledge on structural requirements for the adenosine binding to LmGAPDH, structure-activity relationship studies were carried out employing different molecular modeling techniques [11,12].In this work we have performed the calculation of a large amount of electronic, geometrical and topological descriptors with the aim to select the most relevant ones to the biological activity of adenosine compounds as inhibitors of LmGAPDH, employing the recently developed variable selection algorithm OPS (Ordered Predictor Selection) [13].By employing this strategy in conjunction with a protocol described previously [14,15], we have been able to construct a predictive model of the quantitative structure-activity relationships for the inhibition of LmGAPDH by adenosine compounds.

Statistical Results
The OPS variable selection algorithm selected nine descriptors as the most relevant for the analysis: volume, E HOMO , HATS4e, HATS3u, H7m, Mor23v, BELp1, JGI2, E1v (see Table 1 for the meanings of each descriptor).The PLS regression models obtained with these descriptors have resulted in the statistical parameters presented in Table 2.In order to reassure the suitability of the selected descriptors for building QSAR models for the compounds under study, other two techniques were also employed: Principal Component Regression (PCR) and Multiple Linear Regression (MLR).Statistical results for these techniques are also displayed in Table 2. There, it is possible to observe that the optimum number of latent variables for PLS is 1, while the optimal number of principal components for PCR is 2, since those are the ones presenting lowest SEV (standard error of validation) and PRESS (cross-validation predicted residual error sum of squares) values.
Then, applying leave-one-out (LOO) cross-validation, the best PLS model presents correlation coefficients of = 0.852 and r 2 = 0.874, whereas in the best PCR models these values are = 0.873 and r 2 = 0.852, indicating good internal consistency for both models.Leave-N-out (LNO) cross-validation results show that the models continue to present significant correlation coefficients ( = 0.850 and 0.854 for PLS and PCR, respectively) even when 30% of the samples are left out for prediction, which indicates that robust models were obtained.

External Model Validation and Y-Randomization Tests
External validation tests were applied in order to evaluate the predictive power of the QSAR models constructed.A plot of experimental versus predicted pIC 50 values comparing the compounds in both training and test sets, using the three regression techniques employed here, is shown in Figure 2. The good agreement between the experimental and calculated values indicates that predictive models were obtained, since good values of external validation correlation coefficients ( and standard errors of prediction (SEP) were achieved (see Table 3).These results indicate that the QSAR models constructed can be used to accurately predict the biological activity of other compounds within this structural class.
Chance correlations between the dependent variable and the selected descriptors were verified employing the y-randomization validation.In this test, the pIC 50 values are scrambled and the r 2 and q 2 values are calculated.If low values for both parameters are found, then one can be sure that a true correlation of the descriptors with the response variable exists in the data set [16,17].In the 20 y-randomizations performed for our data, only low values of r 2 and q 2 were obtained (see Table 3).So, this indicates that the descriptors selected by the OPS algorithm possess a true correlation with the dependent variable, attesting that our statistical results are not a chance correlation result.The models obtained were ranked according to the methodology proposed by Karoly et al. [18,19], where ranks are compared with random numbers.The sum of ranking differences (SRD) arranges the models in such a way that low values of SRD are related to better models, while similar SRD values indicates the similarity of the models.Furthermore, the discrete distribution for a small number of objects (n < 14) is calculated, whereas the normal distribution is used as a reasonable approximation if the number of objects is large.This theoretical distribution is visualized for random numbers and can be used to identify SRD values for models that are far from being random, a procedure named as Comparison of Ranks by Random Numbers (CRNN).
The results for the ranking procedure are presented in Table 4 for training and test sets, while Figure 3 shows the SRD distributions (data matrices are provided as supplemental material S2).These results indicate that for both training and test sets the models obtained are better (or similarly) ranked than the experimental values, and that the SRD values for models are not random.

Applicability Domain
The applicability domain was defined here in terms of leverage and Studentized residuals for all samples in the training set.Leverage (h) is a quantity that represents a sample's distance to the centroid of the training set.For the i th sample, (i = 1, …, m), where is the descriptor row-vector for compound i, m is the number of query compounds, is the n k training set matrix, k is the number of model descriptors and n is the number of samples in the training set.A leverage value greater than a certain critical value for a training set sample, defined here on the basis of 95% confidence level, means that the sample has a high influence in the model.Concerning Y outliers, the simple examination of raw y residuals can be misleading due to the effect of leverage.A sample with an extreme y value pulls the model towards itself, decreasing the difference between its experimental and fitted y values.In contrast, a sample with a y value lying close to the y mean value, having little leverage, do not greatly influence the model so its y residual tends to be higher.In order to have a more realistic picture, the Studentized residual, r i , can be applied, since it takes leverage into account.r i is derived from the root mean squared y residual for the training set (RMSE), and is given by Equation (1): Since it is assumed that r i is normally distributed, a t test can determine whether a sample's Studentized residual is large enough to classify such sample as a Y outlier.Here, a critical value for r i was computed at a 95% probability level, based on the n training set samples.Finally, the plots of h i versus r i for the best PLS, PCR and MLR models were examined in order to determine the applicability domain of these models.Results are shown in Figure 4, where is possible to verify that none of the compounds from the training set can be considered as a response outlier, since all of them present low combined values of h i and r i .Although compound 25, in all models, and compound 42 in PLS and PCR present high Y residuals, both of them have extremely low leverage values, meaning that this outcome does not significantly influence the model.Meanwhile, compounds with relatively high leverage values (1, 41, 43-47 in PLS; 35, 38, 45-47 in PCR; and 40, 42, 46 and 47 in MLR) are inside the applicability domains of their respective models, since they are within the thresholds of r i .

Molecular Implications for Ligand Design
Since reliable QSAR models were obtained, the regression vectors can be used to analyze the selected molecular features and to suggest structural modifications that can be able to improve the biological activity of molecules similar to the ones studied here.The contributions of each descriptor to the regression vector for the best models obtained are displayed in Figure 5.

MLR
The positive regression coefficient of descriptors such as Volume, E HOMO , H7m, BELp1, and JGI2 indicates that their higher values are favorable for the LmGAPDH inhibition.Then, a given molecule must present a high solvent-accessible surface-bounded molecular volume (as defined by Connoly [20]) in order to show affinity to LmGAPDH.Molecular volume is an useful index for understanding the drug action since short range dispersion forces play a major role in the binding of ligands to biological receptors.For efficient and specific binding, the receptor cavity must be filled with the interacting ligand in the most optimal geometry [21].Additionally, in this case, E HOMO must also have a high value, which indicates that a highly active molecule must be the one with a high ionization potential, meaning that it would easily donate an electron in a charge transfer mechanism [22].Geometry, Topology and Atom-Weight Assembly (GETAWAY) descriptors such as H7m try to match 3D-molecular geometry provided by the molecular influence matrix and molecular topology with chemical information by using different atomic weightings (atomic mass, polarizability, van der Waals volume, and electronegativity) [23].The information provided by the H7m descriptor in our PLS model is weighted by atomic masses, having a positive influence on the biological activity.
BELp1 is also a 2D descriptor from the class of BCUT descriptors, which accounts for the first eight lowest absolute eigenvalues for the modified Burden adjacency matrix, where p refers to atomic polarizability and 1 is the eigenvalue rank.The ordered sequence of the lowest eigenvalues reflects the relevant aspects of molecular structure, which are useful for similarity searching [24].JGI2 belongs to GALVEZ descriptors, which are the Galvez topological charge indices, and have their origin in the first 10 eigenvalues of the polynomial of corrected adjacency matrix of the compounds.JGI2 represents the mean topological charge index of order 2 [25].
On the other hand, from the negative signs of regression coefficients of HATS4e, HATS3u, Mor23v and E1v, it is evident that these descriptors contribute negatively to the biological activity of adenosine compounds.Thus, lower values of these descriptors are required in order to obtain high activity compounds.HATS4e and HATS3u also belong to the class of GETAWAY descriptors.The HATS prefix means leverage-weighted autocorrelation, 4 and 3 are the lag numbers, and while HATS4e is weighted by atomic Sanderson electronegativities, HATS3u is unweighted [26].The selection of the 3D descriptors Mor23v can be related to the importance of molecular conformation of adenosine analogues for the interaction with key amino acids from the binding site of GAPDH [27].E1v belongs to the class of Weighted Holistic Invariant Molecular (WHIM) descriptors, which contain 3D information calculated from the x,y,z-coordinates.E1v is the 1st component accessibility directional WHIM index, weighted by atomic van der Waals volumes [28].
On the basis of the foregoing considerations, it is possible to observe a balance between steric and electrostatic properties influencing the affinity of adenosines to LmGAPDH, which is in agreement with the findings of Guido et al. [11].Steric molecular features are represented by volume, H7m, E1v, and Mor23v, while descriptors E HOMO , HATS4e, BELp1, and JGI2 account for electronic aspects.

Data Sets
The 61 adenosine derivatives employed in this study were selected from the literature [7,[29][30][31][32].IC 50 values, measured under the same experimental conditions, were converted to the corresponding pIC 50 (-logIC 50 ), and used as dependent variable in the regression analyses.Structures and pIC 50 values for all compounds are displayed in Table 5.From the whole data set, 47 compounds were selected to constitute the training set, while 14 compounds were taken to compose a test set to be utilized in an external validation procedure.This selection was performed carefully in order to certify that the structural diversity and the pIC 50 distribution of the data set were well represented in both training and test sets.

Descriptor Calculation and Selection
A pre-optimization of the geometries of all compounds were carried out with the semiempirical method PM3 [33,34].A final optimization was performed with the density functional theory (DFT) using the B3LYP functional [35,36] along with the 6-311G** basis sets [37].Several electronic descriptors were calculated using Gaussian 03 [37], and various structural descriptors were calculated with the QSAR module implemented in HyperChem 4.5 [34].A set of 1,100 molecular descriptors, encoding information about molecular structure, connectivity and topology were also calculated with Dragon 5.4 [38].All descriptors were autoscaled in order to give them the same weight in the analyses.
With the aim to reduce the number of descriptors, the absolute values of correlation coefficients between each descriptor and pIC 50 were calculated.Descriptors with coefficients lower than 0.3 were eliminated from the analysis, and so 72 descriptors remained.From this subset of descriptors, the ones presenting a non-uniform distribution related to the pIC 50 were also eliminated, leaving 35 descriptors in the analysis.Then, the Ordered Predictor Selection (OPS) algorithm [13] was employed to perform a variable selection.The basic idea of this algorithm is to attribute an importance to each descriptor based on an informative vector.The columns of the matrix are rearranged in such a way that the most important descriptors are presented in the first columns.Afterwards, successive PLS regressions are performed with an increasing number of descriptors in order to find the best PLS model.In this analysis, the regression vector was used as an informative vector and the correlation coefficient of cross-validation, q 2 , as a criterion to select the best models.The suitability of the descriptors selected by this procedure was tested by performing Principal Component Regression (PCR) and Multiple Linear Regression (MLR).
The best models were chosen on the basis of the cross-validation predicted residual error sum of squares (PRESS), being the optimal number of PLS or PCR components the one that minimizes PRESS.Model quality was verified mainly by the correlation coefficients r 2 and q 2 and also by the prediction residuals, which are indications that the model can be used for making predictions of the biological properties of unknown compounds, which are structurally similar.Model robustness and sensitiveness were additionally evaluated by applying leave-N-out (LNO) cross-validation and y-randomization tests.It is important to mention that the model validation is a very crucial step in QSAR studies [39][40][41][42].In the LNO cross-validation procedure, N compounds (N varying from 1 to 20) were left out from the training set.For a particular N, the data were randomized 30 times, and the average and standard deviation values for q 2 were used.In the y-randomization, the dependent variable-vector was scrambled 20 times in order to verify the occurrence of chance correlations between the dependent variable and the selected descriptors [16,17].Applicability domain was defined through the examination of the plots of leverage versus Studentized residuals for the best PLS, PCR and MLR models.

Conclusions
The continuous search for new antileishmanial compounds is undoubtedly important for the researches in neglected diseases.In this context, QSAR models can play an important role in the discovery and optimization of new drug candidates.In this work, PLS, PCR and MLR models were developed to provide indications on relevant molecular features for the antileishmanial activity of adenosine compounds.A set of nine descriptors selected by the OPS approach have demonstrated to be suitable for the construction of QSAR models.The models constructed can be used by researchers interested in synthesizing new adenosine compounds.Once a new adenosine compound is designed, its structure can be submitted to the calculations performed in our work, i.e., the variables selected in our study can be calculated for this new compound.Then, the values of these variables can be inserted into the regression models in order to predict the pIC 50 for this compound.So, our models can be helpful to decide which compounds should be synthesized, saving time and resources.The good statistical parameters, stability and robustness of the models obtained, as assured by the validation tests applied over our data, indicate that these models can be used to design other adenosine derivatives with improved antileishmanial activity.

Figure 2 .
Figure 2. Experimental versus predicted pIC 50 values of the training and test set compounds.

Figure 4 .
Figure 4. Plots of leverage versus Studentized residuals for the regression models constructed.Blue lines indicate the thresholds representing a probability level of 95%.

Figure 5 .
Figure 5. Contribution of each descriptor to the regression vector.

Table 1 .
Symbols, types and definitions of the selected descriptors.

Table 2 .
Statistical parameters for the PLS, PCR and MLR models based on the 9 selected descriptors.

Table 3 .
Statistical parameters of external validation and y-randomization tests.

Table 4 .
SRD ranking of models and experimental values, p% interval and percentiles output for training and test sets.

Table 5 .
Chemical structures and pIC 50 values for training and test set compounds.