QSAR Study of Antimicrobial 3-Hydroxypyridine-4-one and 3-Hydroxypyran-4-one Derivatives Using Different Chemometric Tools

A series of 3-hydroxypyridine-4-one and 3-hydroxypyran-4-one derivatives were subjected to quantitative structure-antimicrobial activity relationships (QSAR) analysis. A collection of chemometrics methods, including factor analysis-based multiple linear regression (FA-MLR), principal component regression (PCR) and partial least squares combined with genetic algorithm for variable selection (GA-PLS) were employed to make connections between structural parameters and antimicrobial activity. The results revealed the significant role of topological parameters in the antimicrobial activity of the studied compounds against S. aureus and C. albicans. The most significant QSAR model, obtained by GA-PLS, could explain and predict 96% and 91% of variances in the pIC50 data (compounds tested against S. aureus) and predict 91% and 87% of variances in the pIC50 data (compounds tested against C. albicans), respectively.


Introduction
Quantitative structure activity relationships (QSAR) studies, as one of the most important areas in chemometrics, give information that is useful for molecular design and medicinal chemistry [1][2][3][4][5]. QSAR models are mathematical equations constructing a relationship between chemical structures and biological activities. These models have another ability, which is providing a deeper knowledge about the mechanism of biological activity. In the first step of a typical QSAR study one needs to find a set of molecular descriptors with the higher impact on the biological activity of interest [6][7][8][9].
A wide range of descriptors has been used in QSAR modeling. These descriptors have been classified into different categories, including constitutional, geometrical, topological, quantum chemical and so on. There are several variable selection methods including multiple linear regression (MLR), genetic algorithm (GA), partial least squares (PLS), principle component or factor analysis (PCA/FA), and so on. [7][8][9]. MLR yields models that are simpler and easier to interpret than PCR and PLS, because these methods perform regression on latent variables that don't have physical meaning. Due to the colinearity problem in MLR analysis, one may remove the collinear descriptors before MLR model development. MLR equations can describe the structure activity relationships well but some information will be discarded in MLR analysis. On the other hand, factor analysis-based methods such as PLS regression can handle the collinear descriptors and therefore better predictive models will be obtained by PLS method [10].
It is almost 120 years since physicians revealed that the coincidence of blood and bacteria in a wound may cause a life-threatening infection. It has also been shown that blood or hemoglobin enhance the lethality of intraperitoneal or subcutaneous inocula of bacteria such as Escherichia coli. The effective component of hemoglobin is iron, and various soluble iron compounds exert an equivalent effect [11]. Administration of iron compounds to the host can increase the virulence of Escherichia coli, Listeria monocytogenes, Salmonella typhimurium and other pathogens [12]. In fact, iron is an essential element required for the growth and virulence of virtually all microbial pathogens [13,14]. The availability of iron is critically important in host-parasite interactions [15]. Vertebrate hosts withhold iron from microbial invaders as a major defence mechanism against infection [13,15]. This task is achieved by sequestration of iron with iron-binding proteins, the most abundant, haemoproteins [16]. Some natural antibiotics, called siderophores, are low-molecular-weight chelating agents that form stable complexes with iron [17,18]. There are many reports of the antimicrobial activity of chelating agents with different chemical structures [19][20][21][22]. Kojic acid (5-hydroxy-2hydroxymethyl-pyran-4-one) and its 3-hydroxypyranones derivatives are examples of these compounds [19]. The bidentate chelating ligand 3-hydroxypyranone, which has a catechol-like function, forms stable complexes with several metal ions such as Fe 3+ . In vitro antibacterial and antifungal activities of 3-hydroxy-pyridinones, bioisoster derivatives of 3-hydroxypyranones with metal chelating ability have been described. They have an inhibitory effect on the growth of Escherichia coli, Listeria inocua and Staphylococcus aureus [22]. More recently antibacterial and antifungal activities of carboxamide derivatives of 3-hydroxypyranones, 5-hydroxypyranones and 5hydroxypyridinones have been reported [23,24].
The antimicrobial activity against C. albicans, S. aureus and P. aeroginosa was the subject of MLR analysis in this preliminary study. MLR models revealed the best relationship between the antimicrobial activity and structural properties against S. aureus and C. albicans. In the present paper, more than 600 topological, geometrical, constitutional, functional group, electrostatic, quantum and chemical descriptors were used, for the development of QSAR equations, different methods were applied for the antimicrobial activity of the studied compounds against S. aureus and C. albicans. These methods where: (i) genetic algorithm -partial least squares (GA-PLS), (ii) MLR with factor analysis as the data pre-processing step for variable selection (FA-MLR) and (iii) principal component regression analysis (PCRA). The correlation coefficient (r), standard error of regression (SE), r 2 cv (Q 2 ) and RMScv (STD(r)) were employed to judge the validity of regression equation.

Software
The two-dimensional structures of molecules were drawn using the Hyperchem 7.0 software. The final geometries were obtained with the semi-empirical AM1 method in the Hyperchem program. The molecular structures were optimized using the Polak-Ribiere algorithm until the root mean square gradient was 0.01 kcal mol -1 . The resulted geometry was transferred into Dragon program package, which was developed by Milano Chemometrics and QSAR Group [26]. The z-matrix of the structures was provided by the software and transferred to the Gaussian 98 program. Complete geometry optimization was performed taking the most extended conformation as starting geometries. Semiempirical molecular orbital calculation (AM1) of the structures was preformed using Gaussian 98 program [27]. MATLAB software (version 7.1 Math Work Inc.) was used for the PLS regression method.

Data set and descriptor generation
The biological data used in this study are antimicrobial activity, (in terms of -log MIC), of a set of 3-hydroxypyridine-4-one and 3-hydroxypyran-4-one derivatives [23,24,25]. The structural features of these compounds are listed in Table 1 and then used for subsequent QSAR analysis as dependent variables. The large number of molecular descriptors was calculated using Hyperchem, Dragon package and Gaussian 98. Some chemical parameters including molecular volume (V), molecular surface area (SA), hydrophobicity (LogP), hydration energy (HE) and molecular polarizability (MP) were calculated using Hyperchem Software. Dragon software calculated different functional groups, topological, geometrical and constitutional descriptors for each molecule.
Constitutional, topological, geometrical, functional group, quantum and physicochemical indices were used in this study; brief description of some of them is listed in Table 2. Table 1. Chemical structure of the compounds used in QSAR analysis.

Data screening and model building
The calculated descriptors were collected in a data matrix whose number of rows and columns were the number of molecules and descriptors, respectively. Genetic algorithm -partial least squares (GA-PLS), MLR with factor analysis as the data pre-processing step for variable selection (FA-MLR) and principal component regression analysis (PCRA) methods were used to derive the QSAR equations and feature selection was performed by the use of genetic algorithm (GA). The genetic algorithms are efficient methods for function minimization. In descriptor selection context, the prediction error of the model built upon a set of features is optimized [29].
In this study, to model the structure-antimicrobial activity relationships better, genetic algorithmpartial least square (GA-PLS) was employed [30,31]. Partial least squares (PLS) linear regression is a recent technique that generalizes and combines features from principal component analysis and multiple regressions. PLS is a method suitable for overcoming the problems in MLR related to multicollinear or over-abundant descriptors [10].
Application of PLS method thus allows the construction of larger QSAR equations while still avoiding over-fitting and eliminating most variables. This method is normally used in combination with cross-validation to obtain the optimum number of components [32,33]. The PLS regression method used was the NIPALS-based algorithm existed in the chemometrics toolbox of MATLAB software (version 7.1 Math Work Inc.). In order to obtain the optimum number of factors based on the Haaland and Thomas F-ratio criterion, leave-one-out cross-validation procedure was used [34].
In our previous study the classical approach of multiple regression technique was used for developing QSAR relation [25]. Here, FA-MLR was also performed on the dataset. Factor analysis (FA) was used to reduce the number of variables and to detect structure in the relationships between them. This data-processing step is applied to identify the important predictor variables and to avoid collinearities among them [35]. Principle component regression analysis, PCRA, was also tried for the dataset along with FA-MLR. With PCRA collinearities among X variables are not a disturbing factor and the number of variables included in the analysis may exceed the number of observations [36]. In this method, factor scores, as obtained from FA, are used as the predictor variables [35]. In PCRA, all descriptors are assumed to be important while the aim of factor analysis is to identify relevant descriptors.

GA-PLS
In PLS analysis, the descriptors data matrix is decomposed to orthogonal matrices with an inner relationship between the dependent and independent variables. Therefore, unlike MLR analysis, the multicolinearity problem in the descriptors is omitted by PLS analysis. Because a minimal number of latent variables are used for modeling in PLS; this modeling method coincides with noisy data better than MLR. In order to find the more convenient set of descriptors in PLS modeling, genetic algorithm was used. To do so, many different GA-PLS runs were conducted using different initial set of populations. The data set (compounds tested against S. aureus, n = 31) was divided into two groups: calibration set (n = 25) and prediction set (n = 6). Given 25 calibration samples; the leave-one-out cross-validation procedure was used to find the optimum number of latent variables for each PLS model. The most convenient GA-PLS model that resulted in the best fitness contained 17 indices, 5 of them being those obtained by MLR. The PLS estimate of coefficients for these descriptors are given in Figure 1.  As it is observed, a combination of quantum, topological, geometrical, constitutional, and functional group descriptors have been selected by GA-PLS to account the antimicrobial activity of the studied compounds. The majority of these descriptors are topological indices. The resulted GA-PLS model possessed very high statistical quality R 2 = 0.96 and Q 2 = 0.91. The values of pMIC using PLS model (refined from cross-validation or external prediction set) along with the corresponding relative errors of prediction (REP) are shown in Table 3. Very small values of relative errors confirm the accuracy of the proposed GA-PLS model for modeling antimicrobial activity of the studied compounds.  The data set (compounds tested against C. albicans, n = 28) was again divided into two groups: calibration set (n = 23) and prediction set (n = 5). Given 23 calibration samples; the leave-one-out cross-validation procedure was used to find the optimum number of latent variables for each PLS model. Here, the most convenient GA-PLS model contained 15 indices, five of them being those obtained by MLR. The PLS estimate of coefficients for these descriptors are given in Figure 2.  Table 4. Very small values of relative errors confirm the accuracy of the proposed GA-PLS model for modeling antimicrobial activity of the studied compounds.  Table 5 shows the five factor loadings of the variables (after VARIMAX rotation) for the compounds tested against S. aureus. As it is observed, about 79% of variances in the original data matrix could be explained by selected four factors. Based on the procedure explained in the experimental section, the following three-parametric equation was derived.

FA-MLR and PCRA
Equation 2 also shows high equation statistics (81% explained variance and 79% predict variance in pMIC data). Since factor scores are used instead of selected descriptors, and any factor-score contains information from different descriptors, loss of information is thus avoided and the quality of PCRA equation is better than those derived from FA-MLR.
As it is observed from Table 5, in the case of each factor, the loading values for some descriptors are much higher than those of the others. These high values for each factor indicate that this factor contains higher information about which descriptors. It should be noted that all factors have information from all descriptors but the contribution of descriptor in different factors are not equal. For example, factors 1 and 2 have higher loadings for topological, constitutional and functional group indices, whereas information about quantum and functional group descriptors is highly incorporated in factors 3 and 4. Therefore, from the factor scores used by equation E 2 , significance of the original variables for modeling the activity can be obtained. Factor score 1 indicates importance of Mv, HNar, nCaH and IDDE (topological, constitutional and functional group descriptors, respectively). Factor score 2 indicates importance of RBN and Me (constitutional descriptors), Factor score 3 and 4 signify the importance of DMy, and nCONHR (quantum and functional group descriptors, respectively). Table 6 shows the five factor loadings of the variables (after VARIMAX rotation) for the compounds tested against C. albicans. As it is observed, about 80% of variances in the original data matrix could be explained by selected five factors.  shows also high equation statistics (88% explained variance and 83% predicted variance in pMIC data). It should be noted that the variables (factor scores) used in Equation 4 are perfectly orthogonal to each other. Since factor scores are used instead of selected descriptors, and any factorscore contains information from different descriptors, loss of information is thus avoided and the quality of PCRA equation is better than those derived from FA-MLR.
As it is observed from Table 6, in the case of each factor, the loading values for some descriptors are much higher than those of the others. Factors 1 and 2 have higher loadings for topological, quantum and functional group indices, whereas information about geometrical, quantum and topological descriptors is highly incorporated in factors 3, 4 and 5. Therefore, from the factor scores used by equation E 4 , significance of the original variables for modeling the activity can be obtained. Factor score 1 indicates importance of PW5, piID and electronegativity (topological and quantum descriptors). Factor score 2 indicates importance of HOMO nCp and nNR 2 (quantum and functional group descriptor). Factor score 3 signifies the importance of ASP and L/Bw (geometrical descriptors) and factor score 4 and 5 signify the importance of quantum and topological descriptors (DMz and PW3).
Comparison between the results obtained by GA-PLS and the other employed regression methods indicates higher accuracy of this method in describing antimicrobial activity of the studied compounds.
Difference in accuracy of the different regression methods used in this study is visualized in Figures  3 and 4 by plotting the predicted activity (by cross-validation) against the experimental values. Obviously, all linear models represented scattering of data around a straight line with slope and intercept close to one and zero, respectively. As it is observed, the plot of data resulted by GA-PLS represents the lowest scattering and that obtained by FA-MLR and PCR analysis have lower accuracy. It should be mentioned that the model which GA-PLS method provides is better than that MLR analysis provided in our previous study [25]. In fact, MLR analysis could explain and predict 55% and 35% of variances in the pMIC data (compounds tested against S. aureus) and predict 82% and 73% of variances in the pMIC data (compounds tested against C. albicans).

Experimental activity
Predicted activity Figure 4. Plots of the cross-validated predicted activity against the experimental activity for the QSAR models obtained by different chemometrics methods (against C. albicans).

Conclusions
Quantitative relationships between molecular structure and inhibitory activity of a series of 3hydroxypyridine-4-one and 3-hydroxypyran-4-one derivatives were discovered by a collection of chemometrics methods including GA-PLS, FA-MLR and PCRA. The results revealed the significant role of topological parameters in the antimicrobial activity of the studied compounds against S. aureus and C. albicans. A comparison between the different statistical methods employed indicated that GA-PLS represented superior results and it could explain and predict 96% and 91% of variances in the pMIC data (compounds tested against S. aureus) and predict 91% and 87% of variances in the pMIC data (compounds tested against C. albicans). As it is observed, the plot of data resulted by GA-PLS represents the lowest scattering, and the impact of topological descriptors was the most.