Topological Models for Prediction of Pharmacokinetic Parameters of Cephalosporins Using Random Forest, Decision Tree and Moving Average Analysis

The topological indices were used to encode the structureal features of cephalosporins. Both topostructural and topochemical versions of a distance based descriptor, three adjacency based descriptors and five distance-cum-adjacency based descriptors were calculated. The values of 18 indices for each cephalosporin in the dataset were computed using an in-house computer program. Multiple pharmacokinetic parameters of cephalosporins were predicted using random forest, decision tree and moving average analysis. Random forest correctly classified the pharmacokinetic parameters into low and high ranges upto 95%. A decision tree was constructed for each pharmacokinetic parameter to determine the importance of topological indices. The decision tree learned the information from the input data with an accuracy of 95% and correctly predicted the cross-validated (10 fold) data with an accuracy of upto 90%. Three independent moving average based topological models were developed using a single range for simultaneous prediction of multiple pharmacokinetic parameters. The accuracy of classification of single index based models using moving average analysis varied from 65% to 100%. Keywords Topological indices • Random forest • Decision tree • Moving average analysis • Pharmacokinetic parameters • Cephalosporins.


Introduction
The pharmaceutical industry need to develop continuously new medicinal drugs in order to fight the development of resistance in pathogenic agents, and to cope with newly discovered types of infections [1].Since ADME (absorption, distribution, metabolism and elimination) properties are important parameters in lead identification, the in silico methods to search for drug candidates with good ADME properties has attracted the pharmaceutical industry [2][3][4].
Various quantitative structure-activity relationship (QSAR) approaches have been applied to find relationships between ADME parameters and molecular structure and properties.The polarizability and transition state energy of a cephalosporin were used to predict permeability through the outer membrane and of the reactivity of β-lactam ring with penicillin binding proteins.The activity exhibited quadratic dependence on the variables [5].In another QSAR study lipophilicity and electronic and hydrogen bonding parameter were used as molecular descriptors.It was found that polar-polar interactions of hydrophilic penicillins and cephalosporins could be explained on the basis of hydrogen bonding properties [6].Turner et al. [7] predicted multiple pharmacokinetic parameters for a series of cephalosporins using artificial neural network.Further, artificial neural networks (ANNs) were used for the prediction of clearances, fraction bound to plasma proteins, and volume of distribution of a series of structurally diverse compounds.Simple methods for determining the human pharmacokinetics of known and drug-like compounds are of interest to pharmaceutical industry [8].
Genetic algorithm-combined with partial least squares were used for modeling ADME properties of structurally diverse compounds.Many ADME properties could be well explained by simple molecular descriptors derived from 2-dimensional chemical structure [9].
Aim of the present study was to develop simple models for the prediction of multiple pharmacokinetic parameters using topological descriptors obtained from 2-dimensional chemical structure.The predictability of the proposed models using random forest, decision tree and moving average analysis has been compared in the present study.Finally, single index range models derived from moving average analysis for the simultaneous classification of multiple pharmacokinetic parameters into low and high values have also been proposed in the present study.

Dataset
Turner et al [7] compiled various pharmacokinetic parameters of cephalosporins such as t 1/2 , CL, CL R , f e , V and f b .The half-life was reported quantitatively as t 1/2 (h).For the present study, cephalosporins were considered to exhibit low t 1/2 -labeled as "A" (N=13) if they exhibited t 1/2 value < 2.0 h and high t 1/2 -labeled as "B" (N=7) if the t 1/2 value was 2.0 or more.Similarly, the clearance was reported quantitatively as CL (mL.min -1 .kg - ), the renal clearance was reported as CL R (mL.min -1 .kg - ), and the volume of distribution at steady state was reported as V (L/kg).The fraction excreted unchanged in the urine was reported quantitatively as f e and fraction bound to plasma proteins was reported as f b .The cephalosporins were considered to exhibit low CL -labeled as "A" (N=5) if they exhibited CL < 1.0 mL.min -1 .kg - and high CL -labeled as "B" (N=15) if they exhibited CL ≥ 1.0 mL.min - 1 .kg - .These cephalosporins were considered to exhibit low CL R -labeled as "A" (N=7) if they exhibited CL R < 1.0 mL.min -1 .kg - and high CL R -labeled as "B" (N=13) if the CL R ≥ 1.0 mL.min -1 .kg - .Cephalosporins were also considered to exhibit low f e -labeled as "A" (N=8) if they exhibited f e < 0.7 and high f e -labeled as "B" (N=12) if they exhibited f e ≥ 0.7.These cephalosporins were considered to exhibit low Vlabeled as "A" (N=8) if they exhibited V< 0.2 and high V -labeled as "B" (N=12) if they exhibited V ≥ 0.2.The cephalosporins were considered to exhibit low f blabeled as "A" (N=14) if they exhibited f b < 0.8 and high f b -labeled as "B" (N=6) if they exhibited f b ≥ 0.8.

Decision tree
A single decision tree [28] was grown, for each property, to identify the importance of topological indices.In a decision tree, the molecules at each parent node are classified, based on the index value, into two child nodes.The prediction for a molecule reaching a given terminal node is obtained by majority vote of the molecules reaching the same terminal node in training set.The tree giving the lowest value of error in cross-validation is selected as optimal tree.In this study, R program (version 2.1.0)along with the RPART library was used to grow decision tree.

Moving average analysis
To construct single topological index based model for predicting property/activity based ranges, moving average analysis of correctly predicted compounds was used [20].According to this method the minimum size of range is based on moving average of 65% of the correctly predicted compounds.However if the moving average percentage of correct prediction lies between 50±15%, it is classified as transitional range.The characterisitic property assigned to each drug was compared with reported property.

Results and Discussion
The random forests were grown with 18 topologicaldescriptors.The importance of node was determined by mean decrease in accuracy and purity of the node was determined by mean decrease in Gini.The precision and sensitivity of classification was also determined.The precision is a measure of accuracy, provided that a specific class has been predicted.The sensitivity is the ability of a predicted model to select certain instances of a certain class from a dataset.The RF classified the t 1/2 of cephalosporins with an accuracy of 85% and out-of-bag (OOB) estimate of error was 15%.The precision and sensitivity of low t 1/2 was of the order of 92% and 85%, whereas the precision and sensitivity of high t 1/2 was of the order of 75% and 86% respectively.A1, molecular connectivity topochemical index and A12, augmented eccentric connectivity index were identified as the most important descriptors.The RF classified the CL of cephalosporins with an accuracy of 90% and OOB estimate of error was 10%.The precision and sensitivity of low CL was of the order of 80% and 80%, whereas the precision and sensitivity of high CL was of the order of 93% and 93% respectively.A8, Zagreb topochemical index, M a The predictions from decision tree were obtained by tenfold cross-validation.
was of the order of 64% and 75% respectively.A11, eccentric adjacency index and A13, superadjacency index were identified as the most important descriptor.The RF classified the cephalosporins with regard to V with an accuracy of 90% and outof-bag estimate of error was only 10%.The precision and sensitivity of low V was of the order of 100% and 75%, whereas the precision and sensitivity of high V was of the order of 86% and 100% respectively.A5, eccentric connectivity topochemical index, A9, Wiener's topochemical index and A14, eccentric connectivity index were identified as the most important descriptors The RF classified the cephalosporins with regard to f b with an accuracy of 95% and out-of-bag estimate of error was only 5%.The precision and sensitivity of low f b was of the order of 93% and 100%, whereas the precision and sensitivity of high f b was of the order of 100% and 83% respectively.A5, eccentric connectivity topochemical index, A7, Zagreb topochemical index, M 1 c and A8, Zagreb topochemical index, M 2 c were identified as the most important descriptors.The predictions for multiple pharmacokinetic parameters using RF were found to be upto 95% (Tab.2).
The decision tree was built from a set of 18 topological indices.The index at the root node is most important and the importance of index decreases as the length of tree increases.The classification of t 1/2 using a single tree, based on A1, molecular connectivity topochemical index and A2, eccentric adjacency topochemical index is shown in Fig. 1.The decision tree identified molecular connectivity topochemical index (A1) as the most important index.The decision tree classified the cephalosporins in the training set with an accuracy of 95% and in 10 fold cross-validation, 70% cephalosporins were correctly classified with regard to t 1/2 .In cross-validation, the precision and sensitivity of low t 1/2 was of the order of 89% and 62%, whereas the precision and sensitivity of high t 1/2 was of the order of 55% and 86% respectively (Tab.2).The classification of CL using decision tree, based on A5 eccentric connectivity topochemical index is shown in Fig. 1.The tree correctly classified cephalosporins in the training set with an accuracy of 95%.In 10 fold cross-validation, 85% cephalosporins were correctly classified with regard to CL.In cross-validation, the precision and sensitivity of low CL was 67% and 80%, whereas the precision and sensitivity of high CL was 93% and 87% respectively (Tab.2).The classification of CL R using single tree based on A5, eccentric connectivity topochemical index is shown in Fig. 1.The tree correctly classified cephalosporins in the training set with an accuracy of 95%.In 10 fold crossvalidation, 75% cephalosporins were correctly classified with regard to CL R .In cross-validation, the precision and sensitivity of low CL R was 63% and 71%, whereas the precision and sensitivity of high CL R was 83% and 77% respectively (Tab.2).The classification of f e using A11, eccentric adjacency index and A5, eccentric connectivity topochemical index is shown in Fig. 1.According to decision tree, eccentric adjacency index (A11) was the most important index.The tree classified the cephalosporins in the training set with an accuracy of 90%.In 10 fold cross-validation, 60% cephalosporins were classified correctly with regard to f e .In cross-validation, the precision and sensitivity of low f e was 50% and 38%, whereas the precision and sensitivity of high f e was 64% and 75% respectively (Tab.2).The classification of V using decision tree based on A5, eccentric connectivity topochemical index is shown in Fig. 1.The tree correctly classified cephalosporins in the training set with an accuracy of 100%.In 10 fold cross-validation, 85% cephalosporins were correctly classified with regard to V. In cross-validation, the precision and sensitivity of low V was 86% and 75%, whereas the precision and sensitivity of high V was 85% and 92%, respectively (Tab.2).The classification of f b using a single tree based on A5, eccentric connectivity topochemical index is shown in Fig. 1.The tree classified the cephalosporins in the training set with an accuracy of 100%.In 10 fold cross-validation, 90% cephalosporins were classified correctly with regard to f b .In cross-validation, the precision and sensitivity of low f b was 93% and 93%, whereas the precision and sensitivity of high f b was 83% and 83% respectively (Tab.2).The decision tree learned the information from the input data with an accuracy of more than 95 % and predicted the cross-validated ( 10fold) data with an accuracy of up to 90%.
The result obtained using single tree agree in principle with those obtained using random forest.The strength of random forest lies in out-of-bag error of estimate.Since decision tree is easy to interpret and can be visualized, the importance of descriptors was taken from decision trees.The variables selected by the tree can be different from random forest because decision tree results are based on single tree while random forest results are average of many trees.The single decision tree sometime assigns the same importance to more than one descriptor and selects one descriptor at random whereas random forest assign importance based on the average of all the individual trees.The property based ranges were identified using moving average analysis [18].Three independent moving average analysis based models were developed using a single index at a time.The three topological indices identified as most important indices by decision trees were used to construct single index based model for simultaneous prediction of multiple pharmacokinetic parameters.The precision and sensitivity of classification for multiple pharmacokinetic parameters (t 1/2 , CL, CL R , f e , V and f b ) using moving average analysis is summarized in Tab. 3.
Tab. 4. Prediction of multiple pharmacokinetic parameters (t 1/2 , CL, CL R , f e , V and f b ) by moving average analysis using eccentric connectivity topochemical index (A5).
Though three independent models were developed, the classification of multiple pharmacokinetic parameters (t 1/2 , CL, CL R , f e , V and f b ) was based on single range of the topological indices A5, eccentric connectivity topochemical index, A1 molecular connectivity topochemical index and eccentric adjacency index, A11 (Tab.4-5).Tab. 5. Prediction of multiple pharmacokinetic parameters (t 1/2 , CL, CL R , f e , V and f b ) by moving average analysis using molecular connectivity topochemical index (A1) and eccentric adjacency index (A11).
It is surprising that topostructural eccentric adjacency index was identified as one of the important index along with topochemical indices and also a single range could be identified.One would expect this to happen because topostructural indices are insensitive to topochemical isomers.Therefore, we evaluated the intercorrelation of eccentric adjacency index values with that of A5 and A1 using all possible structures upto 5 vertices (all 29 structures varying only with respect to connectivity and not topochemical nature).A11 indeed exhibited poor correlation with A5 and A1.
The cephalosporins were correctly classified as exhibiting low t with an accuracy of 90%.The cephalosporins were also correctly classified as exhibiting high CL, high CL R , high f e , high V or exhibiting low CL, low CL R , low f e , low V with an accuracy of 90%, 90%, 65% and 85%, respectively.The cephalosporins were correctly classified as exhibiting low f b or exhibiting high f b with an accuracy of 95%.The cephalosporins were also correctly classified as exhibiting low t 1/2 or exhibiting high t 1/2 using eccentric adjacency index (A11) with an accuracy of 85%.Eccentric adjacency index (A11), classified the cephalosporins as exhibiting high CL, high CL R , high f e , high V or exhibiting low CL, low CL R , low f e , low V with an accuracy of 75%, 85%, 80% and 70%, respectively.The cephalosporins were correctly classified as exhibiting low f b or exhibiting high f b with an accuracy of 80%.

It is noteworthy that the threshold index values for classification of compounds
into high or low pharmacokinetic properties using moving average analysis may appear different from those obtained using decision tree.The apparent differences can be attributed to the fact that topological index values identified using moving average analysis were strictly based on the index value of drugs in the dataset, whereas the ranges of index values obtained from decision tree may refer to drug that is not present in the dataset used to obtain decision tree.

Conclusion
To identify important descriptors and to predict the multiple pharmacokinetic parameters of cephalosporins RF and decision tree were constructed.Single index

. 1. Topostructural and topochemical indices
Confusion matrix for multiple pharmacokinetic parameters (t 1/2 , CL, CL R , f e , V and f b ) using the models based on random forest, decision tree and moving average analysis compounds were correctly classified as high f e .The precision and sensitivity of low f e was of the order of 50% and 38%, whereas the precision and sensitivity of high f e Tab. 2.
Tab. 3. Accuracy of classification for multiple pharmacokinetic parameters (t 1/2 , CL, CL R , f e , V and f b ) using the models based on moving average analysis.
CL, CL R , f e , V and f b drug; +, high t 1/2 , CL, CL R , f e , V and f b drug.
CL, CL R , f e , V and f b drug; +, high t 1/2 , CL, CL R , f e , V and f b 0 drug.connectivityindex (A5), classified the cephalosporins as exhibiting high CL, high CL R , high f e , high V or exhibiting low CL, low CL R , low f e , low V with an accuracy of 95%, 95%, 60% and 90%, respectively.All the cephalosporins were correctly classified as exhibiting low f b or exhibiting high f b .The single index range model based on eccentric connetivity index can simultaneously predict the multiple pharmacokinetic parameters.Similarly, the single range model based on molecular connectivity topochemical index (A1) correctly classified the cephalosporins as exhibiting low t 1/2 or exhibiting high t 1/2