This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

This work is devoted to the prediction of a series of 208 structurally diverse PKCθ inhibitors using the Random Forest (RF) based on the Mold^{2} molecular descriptors. The RF model was established and identified as a robust predictor of the experimental pIC_{50} values, producing good external ^{2}_{pred} of 0.72, a standard error of prediction (

Playing crucial roles in initiating and controlling immune responses, T cells are responsible for many chronic inflammatory diseases when they are inappropriately or extendedly stimulated [

Nonetheless, it is well known that the experimental determination for inhibitory activity remains a labor-intensive and time-consuming operation. A more efficient and economical alternative way,

Among QSAR investigations, one of the important factors affecting the quality of the model is the molecular descriptors used to extract the structural information, in the form of numerical or digital representation suitable for model development, which serve as the bridge between the molecular structures and physicochemical properties or biological activity of chemicals. A software, Mold^{2} [^{2} descriptors with those calculated by some typical commercial software packages, such as Cerius^{2} and Dragon, on several data sets using Shannon entropy analysis has demonstrated that Mold^{2} descriptors convey a similar amount of information [^{2} has been proven suitable not only for QSAR analysis, but also for virtual screening of large databases of chemicals due to low computing costs as well as high efficiencies [

Another key factor for production of

Comparatively, the RF [

In the present investigation, three popular statistical methods, ^{2} descriptors for PKCθ inhibitors; (2) comparison of the performance of the models derived by the three methods of RF, PLS and SVM to determine the superior one (which resulted in the present work as RF); (3) investigation of the influence of tuning parameters on the RF models; and (4) identification of the important descriptors using RF built-in variables’ importance measures.

Currently, random forest, partial least squares and support vector machine—three algorithms popular in chemometrics—were applied on a large dataset of 208 compounds (including 157 molecules as a training set and 51 molecules as a test set) to explore their structure-PKCθ inhibitory activity (expressed by the experimental IC_{50} values). This resulted in one linear model for PLS, and two nonlinear different models for SVM and RF, respectively. All these results were obtained using the R statistical packages, and the pre-processing of the data was performed by the package caret [

Using the R package randomForest [_{try} (for regression, one-third of the number of descriptors (^{2}, of 0.96 are obtained, and for the test set a ^{2} of 0.76 are obtained, respectively. The ^{2} (^{2}) is 0.54, for the test set the ^{2}_{pred} is 0.72, suggesting both good internal and external predictions for the developed optimal RF model.

Support vector machine results were obtained by the R package kernlab [^{2} reaches as high as 0.99 with a ^{2} = 0.57, while, for the test one, an ^{2} of only 0.61 is obtained with an ^{2}_{pred} = 0.59, indicating an overfitting problem of the model.

For the present investigation, could PLS, the widely used linear regression technology in the QSAR model, serving also as the statistical method built-in Comparative Molecular Field Analysis (CoMFA), be applied in building a reliable model? With this question in mind, PLS regression was carried out using the R package PLS [^{2} software solely from the 2D chemical structures. As a result, a 7-latent variable QSAR model was obtained, determined using LOO cross validation with the lowest cross validation root mean square error. The statistical results of the PLS model, present a coefficient of determination ^{2} = 0.57, LOO cross validation coefficient ^{2} = 0.36 and ^{2} = 0.42, ^{2}_{pred} = 0.39 and _{50} values of the training and test sets. In a word, PLS generates a relatively poor QSAR model for these PKCθ inhibitors.

Random Forest, as a new classification and regression tool, has not been frequently applied in QSAR, QSPR (quantitative structure-property relationship), or other chemometrics [

Unlike CART (Classification and Regression Trees), each tree in the forest is fully grown without pruning. Due to this, averaging the predictions of many weakly calculated results always ends in a significant performance improvement compared to a single tree. Random forest introduces random training set (bootstrap) and random input vectors into the trees, where each tree is grown using a bootstrap sample of training data at each node, with the best split chosen from a random sample of _{try} variables instead of all variables.

For most QSAR modeling tools, their performance can be significantly influenced if irrelevant descriptors are not removed prior to training. However, descriptors selection optimization, for random forest, is not quite necessary, since the OOB metrics are used in RF to get the estimates of feature importance. Each variable in OOB samples is randomly permuted and the impact of each variable on prediction is measured. Change the value of a parameter ^{2} and ^{2}_{pred} for the test set have shown satisfactory statistical predictions according to the general criterion (^{2}_{pred} > 0.5, ^{2} > 0.6 for the test set) [

Gaining popularity recently, SVM has been applied to a wide range of pharmacological and biomedical investigations including drug-likeness [

As a useful regression tool, PLS has been successfully applied in a series of QSAR analyses [^{2} = 0.42 and ^{2}_{pred} = 0.39, ^{2} > 0.6 and ^{2}_{pred} > 0.5 for the test set). This implies that these series of PKCθ inhibitors might have multiple mechanisms of action and there may not be proper linear relationship between the inhibition activity of the molecules and their corresponding Mold^{2} indices.

In addition, in order to compare the performance of the RF and PLS models, the

Where _{1} and _{2} are the number of samples in the corresponding test set, and _{1}^{2} is the square from the higher and _{2}^{2} is the square from the lower root mean square errors of the two compared models. When comparing the performance of PLS with the RF models, the ^{2}/^{2} which is 3.24 should be less than the value predicted from the null hypothesis, ^{2} value shows that the PLS method does a much poorer job of developing a model with predictive value. The large value of

In many QSAR researches [

Outliers from a QSAR are compounds that do not fit the model or are poorly predicted by it [

From the above results, RF clearly exhibits better statistical performances than the other two models (

Most QSAR modeling tools, as we know, require at least a moderate amount of parameter tunings to optimize the performance. Though currently, it has been demonstrated that the RF model performs relatively well “off the shelf”, to investigate the impact of adjusting two key parameters, _{try} and the number of tree (_{try}, which is the number of the descriptors randomly sampled as candidates for splitting at each node during the tree induction. It ranges from 1 to _{try} < _{try} = 1, then the trees are essentially making random splits, and the only optimization is the selection of splitting point for the chosen descriptor. One would expect the performance of the ensemble to be suboptimal, unless all descriptors are of equal importance [_{try} can be chosen to be some function of _{try} (_{try} = 1 or _{try} values, including the default ones (32 for regressions).

_{try} of _{try} is optimal when near 50 with a median value of 0.78, while the default _{try} has given a median value of 0.77. It is also observed that the correlation decreases to ~7% when _{try} = 1 compared to the optimum one (_{try} = 50). Therefore, it is still necessary to perform a moderate parameter tuning to get the optimal model, although at most times, RF can give the optimal model by using default parameters, which is supported by a previous report [

The other parameter, the number of trees to grow, also affects the performance in most cases: This should not be set to a very small number to ensure that every input row gets predicted at least a few times. In many cases, 500 trees are sufficient (more are needed only if descriptor importance or molecular proximity is desired). There is no penalty for having “too many” trees, other than a waste in computational resources, in contrast to other algorithms which require a stopping rule [

The ideal QSAR model would be robust, sparse, predictive, and interpretable. In many cases such an ideal is not achievable with current descriptors and response variable mapping methods, although much effort is being expended in approaching this ideal. Consequently, QSAR modeling tends to be divided into two classes, depending on the intended outcome of the study. Predictive QSAR aims to screen large, chemically diverse compound libraries that are often noisy, thus they often present less descriptors explanation, especially, with various descriptors like in our work. In addition, in case of possible multiple mechanisms of action among molecules, nonlinear machine learning algorithms are sometimes employed (like SVM, ANN,

Here are the definitions of the variable importance measures: (1) Mean Decrease Accuracy (%IncMSE) is constructed by permuting the values of each variable of the test set, recording the prediction and comparing it with the unpermuted test set prediction of the variable (normalized by the standard error). For regression, it is the average increase in squared residuals of the test set when the variable is permuted. A higher %IncMSE value represents a higher variable importance; (2) Mean Decrease Gini (IncNodePurity) measures the quality (NodePurity) of a split for every variable (node) of a tree by means of the Gini index. Every time a split of a node is made on a variable, the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance that is often very consistent with the permutation importance measure. As the same as %IncMSE, a higher IncNodePurity value represents a higher variable importance.

In conclusion, though it is often deemed that RF can be used “off the shelf” without expending much effort on parameter tuning or descriptor selection, in some cases it is still important for users to investigate the sensitivity of RF to changes in _{try} which sometimes also influences the performance of the derived QSAR models [

A large diverse dataset of 220 compounds with the experimental values for the IC_{50} of the PKCθ inhibitors were taken from literatures [_{50} values ranging from 0.28 to >30000 nM. Ten chiral compounds with the same 2D structures but different activity were removed due to software limitation. Inhibitors including inequality values reported for two compounds were also deleted for the current QSAR research. Finally, 208 structures with definitive biological values were used for this QSAR analysis. Here, the converted molar pIC_{50} (−log IC_{50}) values, ranging from 5.022 to 9.553 nM, were used as the dependent variables in the QSAR regression analysis to improve the normal distribution of the experimental data points. This span of 4 log units of the pIC_{50} values and the unusually large dataset make it highly appropriate for a QSAR analysis.

As for the division rule for training and test sets, various studies have provided different valuable strategies including the most usually used one,

Firstly, the construction of the 2D prediction models depends on the generation of the molecular descriptors. Simply by using various molecular modeling tools, it is possible to calculate thousands of these descriptors directly from the structure of any particular molecule. In this work, a series of 208 two-dimensional structures were drawn with the ISIS/Draw 2.3 program [^{2}, a free program available to public, to calculate the molecular descriptors. Solely from the 2D chemical structures, the Mold^{2} software package can calculate 777 molecular descriptors for each compound, the models generated based on which have been reported comparable to those established based on descriptors calculated by commercial software packages according to Hong ^{2} soft. Then, the descriptor preprocessing (often called unsupervised selection of descriptors), which is often required to perform prior to modeling a data set, is executed as follows: (1) Descriptors containing greater than 85% zero values were removed; (2) zero- and near zero- variance predictors were removed because these may cause the model to crash or the fit to be unstable; (3) one of the two descriptors with absolute correlations above 0.75 was omitted; and (4) descriptors with linear combinations were identified and removed correspondingly until the dependencies among predictors were resolved. After these steps, the original 777 descriptors were reduced to 96. The remaining 96 descriptors and their definitions are presented in Supporting Information Tables S2 and S3, respectively.

A successful prediction model relies greatly on the use of an appropriate statistical approach. In this work, three popular methods (

PLS is similar to principal components regression but with both the independent and dependent variables involved in the generation of the orthogonal latent variables rather than only independent variables used. PLS is based on the projection of the original multivariate data matrices down onto smaller matrices (

Where

Up to date, PLS regression algorithms have been extended to various methods such as the kernel algorithm, the wide kernel algorithm, SIMPLS algorithm and the classical orthogonal scores algorithm. In the present study, the kernel algorithm was selected to build the QSAR models, with leave-one-out (LOO) cross validation used to determine the optimal principal components.

SVM: As a novel type of learning machine, the support vector machine developed by Vapnik and Cortes [

Where;

Where

And, _{ɛ}^{2}, is used as a measurement of function flatness.

RF: RF models were constructed according to the described original RF algorithm [_{p}_{p}_{try} of the predictors and then choose the best split from among those variables. The tree is grown to maximum size and not pruned back; (3) Predict the

RF algorithm is the same as Bagging when _{try} = _{try} of the descriptors rather than the _{try} is one-third of the number of descriptors (_{try} is very small, so that the search is very fast. In addition, RF is more efficient than a single tree deriving from that RF does not do any pruning at all, while a single tree needs some pruning using cross validation that can take up a significant portion of the computation time, to get the right model complexity.

RF possesses its own reliable statistical characteristics based on OOB set prediction, which could be used for validation and model selection with no cross-validation performed. It was shown the prediction accuracy of an OOB set and a 5-fold cross validation procedure was near the same [_{try} or descriptor selection to optimize the performance of RF.

Besides above merits, RF can also calculate descriptor importance in the course of training as follows: As each tree is grown, make predictions on the OOB data for that tree. At the same time each descriptor in the OOB data is randomly permuted on at a time, and each modified data set is also predicted by the tree. Finally, after completing the model training, the margins for each molecule are calculated based on the OOB prediction and OOB prediction with each descriptor permuted. Then the measure of importance for the _{j}_{j}

After the regression model was constructed, the standard error of prediction (^{2} (^{2}_{pred}) [

Where _{i}_{i}_{tr}

In the present work, a successful computation model was developed, for the first time, for a series of 208 PKCθ inhibitors with diverse scaffolds of structures based on Mold^{2} descriptors using the random forest algorithm. Its statistical results are ^{2} = 0.76, ^{2}_{pred} = 0.72, and

For comparison studies, two alternative approaches—PLS and SVM—were also applied to the dataset. As a result, the problem of overfitting was observed for SVM modeling, while for PLS analysis a poor predictive model with ^{2}_{pred} = 0.39 and ^{2} indices. We hope that the adopted model and included above information will be of help for screening and prediction of novel potent PKCθ inhibitors, and for further researches on the subject matter.

The authors thank the R Development Core Team for affording the free R2.10 software.

^{2}, molecular descriptors from 2D structures for chemoinformatics and toxicoinformatics

(_{50} values of the RF model; (_{50} values of the SVM model; (_{50} values of the PLS model.

Residual plot for the training and test sets in the RF model.

Boxplots of 30 replications of 5-fold cross-validation correlation at various values of _{try} for the PKCθ data set. Horizontal lines inside the boxes are the median correlation.

Comparison of the training, out-of-bag, and external test set MSEs for random forest on the PKCθ data set as the number of trees increases.

Ordered variable importance scores from RF. The first three important descriptors are surrounded by blue frame.

Statistical performance of the QSAR models for PKCθ inhibitors.

Para. |
RF | SVM | PLS | |||
---|---|---|---|---|---|---|

Training | Test | Training | Test | Training | Test | |

Size | 157 | 51 | 157 | 51 | 157 | 51 |

^{2} |
0.96 | 0.76 | 0.99 | 0.61 | 0.57 | 0.42 |

^{2} |
0.54 | - | 0.57 | - | 0.36 | - |

^{2}_{pred} |
- | 0.72 | - | 0.59 | - | 0.39 |

0.25 | - | 0.08 | - | 0.59 | - | |

- | 0.45 | - | 0.55 | - | 0.67 |

^{2}, coefficient of determination; ^{2}, cross-validated ^{2}: ^{2} based on OOB, 10-fold cross-validation and leave-one-out for RF, SVM and PLS, respectively; ^{2}_{pred}, predictive correlation coefficient for the test set;

Statistical performance of QSAR models from 100 times of 51-chemical-hold-out testing (mean and standard deviation) for PKCθ inhibitors.

Para |
RF | SVM | PLS | |||
---|---|---|---|---|---|---|

Training | Test | Training | Test | Training | Test | |

Size | 157 | 51 | 157 | 51 | 157 | 51 |

^{2} |
0.95 ± 0.003 | 0.58 ± 0.09 | 0.82 ± 0.01 | 0.49 ± 0.10 | 0.64 ± 0.13 | 0.41 ± 0.13 |

^{2} |
0.57 ± 0.03 | - | 0.59 ± 0.02 | - | 0.39 ± 0.11 | - |

^{2}_{pred} |
- | 0.56 ± 0.09 | - | 0.45 ± 0.10 | - | 0.10 ± 0.84 |

0.24 ± 0.01 | - | 0.39 ± 0.01 | - | 0.53 ± 0.09 | - | |

- | 0.59 ± 0.06 | - | 0.63 ± 0.05 | - | 0.79 ± 0.25 |

^{2}, coefficient of determination; ^{2}, cross-validated ^{2}: ^{2} based on OOB, 10-fold cross-validation and leave-one-out for RF, SVM and PLS, respectively; ^{2}_{pred}, predictive correlation coefficient for the test set;

Representative chemical structures and inhibitory activity of the PKCθ inhibitor dataset.

No. | Scaffold | Substituent | pIC_{50} |
Ref | ||
---|---|---|---|---|---|---|

R^{1} |
R^{2} |
R^{3} | ||||

1 |
A | OMe | OMe | 3-Bromophenyl | 5.337 | [ |

2 | A | OMe | OMe | Phenyl | 5.796 | [ |

3 |
A | OMe | OMe | 3-Chlorophenyl | 5.409 | [ |

X | ||||||

17 | B | Pyrrolidine | 8.420 | [ | ||

23 | B | H_{2}N |
7.921 | [ | ||

27 | B | PhNH | 6.959 | [ | ||

Ar | R | |||||

77 | C | Phenyl | 4-CH_{2}-NMe_{2} |
7.854 | [ | |

80 | C | 3-Pyridine | 5-CH_{2}-NMe_{2} |
7.076 | [ | |

85 |
C | Phenyl | 2-OMe,3-CH_{2}-NMe_{2} |
7.921 | [ | |

X | ||||||

37 | D | 1 | 7.456 | [ | ||

41 |
D | 2 | 7.469 | [ | ||

NR’R | ||||||

137 | E | Morpholine | 8.108 | [ | ||

140 | E | Pyrrolidine | 7.456 | [ | ||

NR’R | ||||||

153 |
F | Morpholine | 7.886 | [ | ||

157 |
F | NHCH_{2}CH(OH)CH_{2}OH |
8.824 | [ |

Test set;

from the corresponding reference.