Inductive QSAR Descriptors. Distinguishing Compounds with Antibacterial Activity by Artificial Neural Networks

Abstract: On the basis of the previous models of inductive and steric effects, ‘inductive’ electronegativity and molecular capacitance, a range of new ‘inductive’ QSAR descriptors has been derived. These molecular parameters are easily accessible from electronegativities and covalent radii of the constituent atoms and interatomic distances and can reflect a variety of aspects of intra- and intermolecular interactions. Using 34 ‘inductive’ QSAR descriptors alone we have been able to achieve 93% correct separation of compounds with- and without antibacterial activity (in the set of 657). The elaborated QSAR model based on the Artificial Neural Networks approach has been extensively validated and has confidently assigned antibacterial character to a number of trial antibiotics from the literature. Keywords: QSAR, antibiotics, descriptors, substituent effect, electronegativity. Introduction. Nowadays, rational drug design efforts widely rely on building extensive QSAR models which currently represent a substantial part of modern ‘


Introduction.
Nowadays, rational drug design efforts widely rely on building extensive QSAR models which currently represent a substantial part of modern 'in silico' research.Due to inability of the fundamental laws of chemistry and physics to directly quantify biological activities of compounds, computational chemists are led to research for simplified but efficient ways of dealing with the phenomenon, such as by the means of molecular descriptors [1].The QSAR descriptors came to particular demand during last decades when the amounts of chemical information started to grow explosively.Nowadays, scientists routinely work with collections of hundreds of thousands of molecular structures which cannot be efficiently processed without use of diverse sets of QSAR parameters.Modern QSAR science uses a broad range of atomic and molecular properties varying from merely empirical to quantum-chemical.The most commonly used QSAR arsenals can include up to hundreds and even thousands of descriptors readily computable for extensive molecular datasets.Such varieties of available descriptors in combination with numerous powerful statistical and machine learning techniques allow creating effective and sophisticated structure-bioactivity relationships [1][2][3].Nevertheless, although even the most advanced QSAR models can be great predictive instruments, often they remain purely formal and do not allow interpretation of individual factors influencing activity of drugs [3].Many molecular descriptors (in particular derived from molecular topology alone) lack defined physical justification.The creation of efficient QSAR descriptors also possessing much defined physical meaning still remains one of the most important tasks for the QSAR research.
In a series of previous works we introduced a number of reactivity indices derived from the Linearity of Free Energy Relationships (LFER) principle [4].All of these atomic and group parameters could be easily calculated from the fundamental properties of bound atoms and possess much defined physical meaning [5][6][7][8].It should be noted that, historically, the entire field of the QSAR has been originated by such LFER descriptors as inductive, resonance and steric substituent constants [4].As the area progressed further, the substituent parameters remained recognized and popular quantitative descriptors making lots of intuitive chemical sense, but their applicability was limited for actual QSAR studies [9].To overcome this obstacle, we have utilized the extensive experimental sets of inductive and steric substituent constants to build predictive models for inductive and steric effects [5].The developed mathematical apparatus not only allowed quantification of inductive and steric interactions between any substituent and reaction centre, but also led to a number of important equations such as those for partial atomic charges [8], analogues of chemical hardness-softness [7] and electronegativity [6].
Notably, all of these parameters (also known as 'inductive' reactivity indices) have been expressed through the very basic and readily accessible parameters of bound atoms: their electronegativities (χ), covalent radii (R) and intramolecular distances (r).Thus, steric Rs and inductive σ* influence of natomic group G on a single atom j can be calculated as: In those cases when the inductive and steric interactions occur between a given atom j and the rest of N-atomic molecule (as sub-substituent) the summation in (1) and (2) should be taken over N-1 terms.Thus, the group electronegativity of (N-1)-atomic substituent around atom j has been expressed as the following: Similarly we have defined steric and inductive effects of a singe atom onto a group of atoms (the rest of the molecule): In the works [7,8] an iterative procedure for calculating a partial charge on j-th atom in a molecule has been developed: ) )( ( (where Q j reflects the formal charge of atom j).
Initially, the parameter χ in (6) corresponds to χ 0 -an absolute, unchanged electronegativity of an atom; as the iterative calculation progresses the equalized electronegativity χ' gets updated according to (7): where the local chemical hardness η 0 reflects the "resistance" of electronegativity to a change of the atomic charge.The parameters of 'inductive' hardness η i and softness s i of a bound atom i have been elaborated as the following: The corresponding group parameters have been expressed as The interpretation of the physical meaning of 'inductive' indices has been developed by considering a neutral molecule as an electrical capacitor formed by charged atomic spheres [8].This approximation related inductive chemical softness and hardness of bound atom(s) with the total area of the facings of electrical capacitor formed by the atom(s) and the rest of the molecule.
We have also conducted very extensive validation of 'inductive' indices on experimental data.Thus, it has been established that R S steric parameters calculated for common organic substituents form a high quality correlation with Taft's empirical E S -steric constants (r 2 =0.985) [10].The theoretical inductive σ* constants calculated for 427 substituents correlated with the corresponding experimental numbers with coefficient r = 0.990 [5].The group inductive parameters χ computed by the method (3) have agreed with a number of known electronegativity scales [6].The inductive charges produced by the iterative procedure (6) have been verified by experimental C-1s Electron Core Binding Energies [8] and dipole moments [6].A variety of other reactivity and physicalchemical properties of organic, organometallic and free radical substances has been quantified within equations ( 1)-( 11) [11][12][13][14][15][16].It should be noted, however, that in our previous studies we have always considered different classes of 'inductive' indices (substituent constants, charges or electronegativity) in separate contexts and tended to use the canonical LFER methodology of correlation analysis in dealing with the experimental data.At the same time, a rather broad range of methods of computing 'inductive' indices has already been developed to the date and it is feasible to use these approaches to derive a new class of QSAR descriptors.In the present work we introduce 50 such QSAR descriptors (we called 'inductive') and will test their applicability for building QSAR model of "antibioticlikeness".

Results
QSAR models for drug-likeness in general and for antibiotic-likeness in particular are the emerging topics of the 'in silico' chemical research.These binary classifiers serve as invaluable tools for automated pre-virtual screening, combinatorial library design and data mining.A variety of QSAR descriptors and techniques has been applied to drug/non-drug classification problem.The latest series of QSAR works report effective separation of bioactive substances from the non-active chemicals by applying the methods of Support Vector Machines (SVM) [17,18], probability-based classification [19], the Artificial Neural Networks (ANN) [20][21][22] and the Bayesian Neural Networks (BNN) [23,24] among others.Several groups used datasets of antibacterial compounds to build the binary classifiers of general antibacterial activity (antibiotic-likeness models) utilizing the ANN algorithm [25][26][27], linear discriminant analysis (LDA) [28,29], binary logistic regression [29] or k-means cluster method [30].Thus, in the study [31] the LDA has been used to relate anti-malarial activity of a series of chemical compounds to molecular connectivity QSAR indices.The results clearly demonstrate that creation of QSAR approaches for classification of molecules active against broad range of infective agents represents an important and valuable tack for the modern QSAR research.

Dataset
To investigate the possibility of using the inductive QSAR descriptors for creation an effective model of antibiotic-likeness, we have considered a dataset of Vert and co-authors [27] containing the total of 657 structurally heterogeneous compounds including 249 antibiotics and 408 general drugs.
This dataset has been used in the previous studies [27,29] and therefore could allow us to comparatively evaluate the performance of QSAR model built upon the inductive descriptors.Descriptors 50 inductive QSAR descriptors introduced on the basis of formulas (1)- (11) have been described in the greater details in Table 1.Those include various local parameters calculated for certain kinds of bound atoms (for instance for most positively/negatively charges, etc), groups of atoms (say, for substituent with the largest/smallest inductive or steric effect within a molecule, etc) or computed for the entire molecule.One common feature for all of the introduced inductive descriptors is that they all produce a single value per compound.Another similarity between them is in their relation to atomic electronegativity, covalent radii and interatomic distances.It should also be noted, that all descriptors (except the total formal charge) depend on the actual spatial structure of molecules.The choice of particular inductive descriptors in Table 1 was driven by our expectation to have a limited set of QSAR parameters reflecting the greatest variety of different aspects of intra-and intermolecular interactions a molecule can be engaged into.It should be mentioned, however, that some inductive descriptors may reflect related or similar molecular/atomic properties and therefore can be correlated in certain cases (even though the analytical representation of those descriptors does not directly imply their co-linearity).Thus, a special precaution should be taken when using such parameters for QSAR modeling.The procedure of selection of appropriate inductive descriptors has been outlined in the following section.Table 1.Inductive QSAR descriptors introduced on the basis of equations ( 1)- (11).

Descriptor
Characterization Parental formula(s)

EO_Equalized a
Iteratively equalized electronegativity of a molecule Calculated iteratively by (7) where charges get updated according to (6); an atomic hardness in ( 7) is expressed through (8)

Average_EO_Pos a
Arithmetic mean of electronegativities of atoms with positive partial charge where + n is the number of atoms i in a molecule with positive partial charge

Average_EO_Neg a
Arithmetic mean of electronegativities of atoms with negative partial charge where − n is the number of atoms i in a molecule with negative partial charge η (hardness) -based

Global_Hardness a
Molecular hardness -reversed softness of a molecule (10)

Sum of hardnesses of atoms of a molecule
Calculated as a sum of inversed atomic softnesses in turn computed within ( 9)

Sum of hardnesses of atoms with positive partial charge
Obtained by summing up the contributions from atoms with positive charge computed by ( 8)

Sum of hardnesses of atoms with negative partial charge
Obtained by summing up the contributions from atoms with negative charge computed by ( 8)

Average_Hardness a
Arithmetic mean of hardnesses of all atoms of a molecule Estimated by dividing quantity (10) by the number of atoms in a molecule

Average_Pos_Hardness
Arithmetic mean of hardnesses of atoms with positive partial charge where + n is the number of atoms i with positive partial charge.

Average_Neg_Hardness a
Arithmetic mean of hardnesses of atoms with negative partial charge where − n is the number of atoms i with negative partial charge.

Smallest_Neg_Hardness a
Smallest atomic hardness among values for negatively charged atoms.( 8)

Hardness_of_Most_Pos
Atomic hardness of an atom with the most positive charge (8)

Hardness_of_Most_Neg a
Atomic hardness of an atom with the most negative charge ( 8)

Sum of softnesses of atoms with positive partial charge
Obtained by summing up the contributions from atoms with positive charge computed by ( 9)

Sum of softnesses of atoms with negative partial charge
Obtained by summing up the contributions from atoms with negative charge computed by ( 9)

Average_Softness
Arithmetic mean of softnesses of all atoms of a molecule (11) divided by the number of atoms in molecule

Arithmetic mean of softnesses of atoms with positive partial charge
where + n is the number of atoms i with positive partial charge.

Arithmetic mean of softnesses of atoms with negative partial charge
where − n is the number of atoms i with negative partial charge.

Smallest_Neg_Softness a
Smallest atomic softness among values for negatively charged atoms

Softness_of_Most_Pos a
Atomic softness of an atom with the most positive charge (9)

Softness_of_Most_Neg a
Atomic softness of an atom with the most negative charge ( 9)

Selection of variables
To build a binary QSAR model enabling effective separation of antibacterials we have initially calculated all 50 individual inductive descriptors for each molecule from the Vert's dataset.We have used the hydrogen suppressed representation of the molecular structures -i.e.only the heavy atoms have been taken into account.The inductive QSAR descriptors have been calculated within the MOE package [32] from values of atomic electronegativities and radii taken from our previous publications [5].To avoid the mentioned cross-correlation among the independent variables we have computed pair wise regressions between all 50 sets of the QSAR parameters and removed those inductive descriptors which formed any linear dependence with R≥0.9.As the result of this procedure, only 34 inductive QSAR descriptors have been selected for the further processing (see the legend to Table 1).The average values of these 34 parameters independently calculated for antibacterial and non-antibacterial compounds have been plotted onto Figure 1.As it can be seen, the corresponding curves for two classes of compounds are clearly separated on the graph and, hence, the selected 34 inductive descriptors should allow building an effective QSAR model of "antibiotic likeness".

QSAR model
In order to relate the inductive descriptors to antibiotic activity of the studied molecules we have employed the Artificial Neural Networks (ANN) method -one of the most effective pattern recognition techniques.During the last decades the machine-learning approaches have became an essential part of the QSAR research; the detailed description of the ANN's fundamentals can be found in numerous sources [33 for example].
In our study we have used the standard back-propagation ANN configuration consisting of 34 input and 1 output nodes.The number of nodes in the hidden layer was varied from 2 to 14 in order to find the optimal network that allows most accurate separation of antibacterials from other compounds in the training sets.For effective training of the ANN (to avoid its over fitting) we have used the training sets of 592 compounds (including 197 antibiotics) randomly derived as 90 percent of the total of 657 molecules.In each training run the remaining 10 percents of the compounds were used as the testing set to assess the predictive ability of the model.It should be noted, that we the condition of noncorrelation amongst the descriptors has been monitored within the training and the testing sets of compounds as well.
During the learning phase, a value of 1 has been assigned to the training set's molecules possessing antibacterial activity and value 0 to the others.For each configuration of the ANN (with 2, 3, 4, 6, 8, 10, 12, and 14 hidden nodes respectively) we have conducted 20 independent training runs to evaluate the average predictive power of the network.Table 2 contains the resulting values of specificity, sensitivity and accuracy of separation of antibacterial and non-antibacterial compounds in the testing sets.The corresponding counts of the false/true positive-and negative predictions have been estimated using 0.4 and 0.6 cut-off values for non-antibacterials and antibacterials respectively.Thus, an antibiotic compound from the testing set, has been considered correctly classified by the ANN only when its output value ranged from 0.6 to 1.0.For each non-antibiotic entry of the testing set the correct classification has been assumed if the corresponding ANN output lay between 0 and 0.4.Thus, all network output values ranging from 0.4 to 0.6 have been ultimately considered as incorrect predictions (rather than undetermined or non-defined).Considering that one of the most important implications for the "antibiotic-likeness" model is its potential use for identification of novel antibiotic candidates from electronic databases, we have calculated the parameters of the Positive Predictive Values (PPV) for the networks while varying the number of hidden nodes.Taking into account the PPV values for the networks with the varying number of the hidden nodes along with the corresponding values of sensitivity, specificity and general accuracy we have selected neural network with three hidden nodes as the most efficient among the studied.The ANN with 34 input-, 3 hidden-and 1 output nodes has allowed the recognition of 93% of antibiotic and 93% of non-antibiotic compounds, on average.The output from this 34-3-1 network has also demonstrated very good separation on positive (antibiotics) and negative (non-antibiotics) predictions.Figure 2 features frequencies of the output values for the training and testing sets consisting of ⅓ of antibiotic and ⅔ of non-antibiotics compounds.As it can readily be seen from the graph, the vast majority of the predictions has been contained within [0.0÷0.4] and [0.6÷1.0]ranges what also illustrates that 0.4 and 0.6 cut-offs values provide very adequate separation of two bioactivity classes (Tables 3 and 4 feature the outputs values from the 34-3-1 ANN for the training and testing sets respectively).It should be mentioned, that the estimated 93% accuracy of the prediction by the 34-3-1 ANN is similar or superior to the results by several similar 'antibiotic-likeness' studies where the overall cross-validated accuracy can range from 78 [20] to 98% [26] depending of the QSAR methodology, size of antibiotics/non-antibiotics dataset, cross-correlation technique and statistics utilized.
We have also applied the developed techniques on the non-hydrogen suppressed molecular structures.The estimated accuracy of antibiotic/non-antibiotic classification was very close to the results for the hydrogen suppressed molecules.In contrast, the time for the calculation of the inductive QSAR descriptors in the former case is much shorter as the total number of all atoms nearly doubles.

Discussion
The accuracy of discrimination of antibiotic compounds by the artificial neural networks built upon the 'inductive' descriptors clearly demonstrates an adequacy and good predictive power of the developed QSAR model.There is strong evidence, that the introduced inductive descriptors do adequately reflect the structural properties of chemicals, which are relevant for their antibacterial activity.This observation is not surprising considering that the inductive QSAR descriptors calculated within ( 1)-( 11) should cover a very broad range of proprieties of bound atoms and molecules related to their size, polarizability, electronegativity, compactness, mutual inductive and steric influence and distribution of electronic density, etc.The results of the study demonstrate that not extensive sets of inductive QSAR descriptors having much defined physical meaning can be sufficient for creating useful models of "antibiotic-likeness".The accuracy of the developed QSAR model is superior or similar compared to other binary classifiers on the same set of molecules but using much more extensive collections of QSAR descriptors [27,29].
Presumably, accuracy of the approach operating by the inductive descriptors can be improved even further by expanding the QSAR descriptors or by applying more powerful classification techniques such as Support Vector Machines or Bayesian Neural Networks.Use of merely statistical techniques in conjunction with the inductive QSAR descriptors would also be beneficial, as they will allow interpreting individual descriptor contributions into molecular "antibiotic-likeness".The selection of drugs used for the simulation can also be extended and/or refined.For instance, it has been experimentally confirmed that several non-antibacterial compounds from Vert's dataset can, in fact, possess definite antibacterial activity.Thus, anti-inflammatory drugs diclofenac [34,35], piroxicam, mefenamic acid and naproxen [35], antihistamines -bromodiphenhydramine [36] diphenhydramine [36] and triprolidine [37], anti-psychotics -chlorpromazine [38,39] and fluphenazine [40,41], the tranquilizer promazine [42] and anti-hypertensive methyldopa [43] all exhibit moderate to powerful potential against microbes.It is obvious, that having all these compounds as the negative control can interfere with the training of efficient antibiotic-likeness model.We, however, did not remove these substances from the e training and testing sets for the sake of comparison of our results with the previous data.Nonetheless, despite the certain drawbacks, it is obvious that the developed ANN-based QSAR model operating by the inductive descriptors has demonstrated very high accuracy and can be used for mining electronic collections of chemical structures for novel antibiotic candidates.

An application of the model
We have decided to test the developed model of "antibiotic-likeness" on the series of early-stage antibiotic compounds featured in the free issue of the Drug Data Report -a journal presenting preliminary drug research results appearing for the first time in patent literature [44].The "experimental" antibiotic compounds cited by the issue included one penicillin-and two cephalosporin-derivatives as well as a number of high molecular weight chemicals with complex   5.As it can be seen from the data, all of the estimated output values score well above 0.60 threshold what confidently assigns all of the trial molecules to the class of antibiotics.These results demonstrate that the developed ANN-based binary classifier of antibacterial activity is adequate and can be considered an effective tool for 'in silico' antibiotics discovery.The results also demonstrate that the inductive parameters readily accessible by formulas (1)-( 11) from atomic electronegativities, covalent radii and interatomic distances can produce a variety of useful QSAR descriptors to be used 'in silico' chemical research.

Conclusions
The results of the present work demonstrate that a variety of atomic, substituent and molecular properties which can be computed within the framework of our previous models for inductive and steric effects, inductive electronegativity and molecular capacitance represent a powerful arsenal of 3D QSAR descriptors for modern 'in silico' drug research.Using only 34 inductive descriptors with no additional independent parameters we have achieved 93% correct classification of compounds withand without antibacterial activity.The introduced inductive descriptors possess a number of important merits: they are 3D-and stereo-sensitive, can be easily computed from fundamental properties of bound atoms and molecules and possess much defined physical meaning.The developed ANN-based model for antibiotic-likeness prediction can be used as a powerful QSAR tool for filtering through the collections of chemical structures to discover novel antibiotic leads.

Methods
The names of the chemical compounds from the dataset from [27] have been translated into SMILES records and MOL files using the ChemIDPlus online service [45] and the MOE package [32].50 inductive descriptors have been calculated using by the SVL scripts -a specialized language of the MOE package.The interatomic distances have been calculated by the MOE from the molecular structures optimized with the MMFF94 force-field [46].The atomic types have been assigned according to the name, valent state and a formal charge of atoms as it is defined within the MOE.The parameters of the corresponding atomic electronegativities and covalent radii have been taken from

Figure 1 .
Figure 1.Averaged values of 34 selected inductive QSAR descriptors calculated independently within studied sets of antibiotics (dashed line) and nonantibiotics (solid line).

Figure 2 .
Figure 2. Distribution of the output values from the ANN with three nodes in the hidden layer and trained on the set containing 90% of the studied compounds as five C11-carbamate azalides and four eremomycin carboxamides (the corresponding structural formulas are presented on Figure3).

Figure 3 .
Figure 3.Chemical structures of twelve early stage antibiotics from the Drug Data Report used for validation for the developed ANN -based QSAR model.3a)
a -descriptors selected for building the antibiotic-likeness QSAR model.

Table 2 .
Parameters of specificity, sensitivity, accuracy and positive predictive values for prediction of antibiotic and non-antibiotic compounds by the artificial neural networks with the varying number of hidden nodes.The cut-off values 0.4 and 0.6 have been used for negative and positive predictions respectively.

Table 3 .
Compounds of the training set and output values from the trained neural network with three hidden nodes.

Table 4 .
Compounds of the testing set and the corresponding output values from the trained neural network with three hidden nodes.

Table 5 .
Output values from the neural network for the validation set's antibiotics.