This article is an openaccess article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
Experimental pEC_{50}s for 216 selective respiratory syncytial virus (RSV) inhibitors are used to develop classification models as a potential screening tool for a large library of target compounds. Variable selection algorithm coupled with random forests (VSRF) is used to extract the physicochemical features most relevant to the RSV inhibition. Based on the selected small set of descriptors, four other widely used approaches,
Respiratory syncytial virus (RSV), a singlestranded RNA virus of negative genome polarity, is a member of the
Traditionally, the biological activity of a drug candidate is obtained via costly and time consuming experiments. Thus the introduction of
Construction of a computational model often requires two conditions. The first factor is molecular descriptors, which are used to extract the structural information that is suitable for model development. The software Mold^{2} [
Secondly, the adoption of appropriate classification approaches to establish models is another central element to obtain accurate prediction. Often used classification methods include the simple but interpretable linear discriminant analysis (LDA) and partial least square (PLS) [
It is well known that an ideal classification model should have high performance with a lower number of descriptors. Thus, in the present work, to optimize the 2D (twodimensional) molecular descriptor subset, while simultaneously enhancing the statistical performance and efficiency of the model, the variable selection (VS) method by RF combined with backward elimination using outofbag (OOB) error is selected to perform a classification task for the current RSV inhibitors to investigate whether the proposed VSRF method can construct an ideal prediction model (
As a special kind of neural network that can be used for clustering, visualization, and abstraction tasks, selforganizing map (SOM) is especially suitable for data survey due to its prominent visualization properties. In our previous work, this technology has been successfully applied to dataset split [
A VSRF strategy has been developed successfully, with the final number of descriptors being reduced to six from the original 272 for the further study. Since it is recommended that the number of compounds in the training set should be at least fivetimes larger than that of the selected independent variables [
Based on the selected descriptors, five different statistical methods (VSRF, SVM, GP, LDA,
VSRF: Random forest effectively has only one tuning parameter,
SVM: Similar to other multivariate statistical models, the performance of SVM depends on the combination of several parameters including the capacity parameter
GP: The Gaussian process method, based on clearly defined statistical principles and easily programmed [
LDA: a widely used classification technology, LDA, was also performed to classify the current dataset based on the selected six descriptors. As shown in
From the above discussion, it can be concluded that the developed VSRF model performed comparably with SVM and GP, demonstrated by the Q_{cv}(%) of VSRF, SVM and GP of 81.6%, 79.1% and 78%, respectively, in terms of crossvalidation. These models outperform those of the LDA and
In addition, when comparing the other four models, it is observed that the LDA model is comparable to that of
It should be noted that RF, as a new classification and regression tool, can well solve the small
By using feature selection, the most appropriate sets of molecular descriptors for predicting the RSV low and high active inhibitors are extracted from the VSRF models, some of which probably provide new insights into the physicochemical characteristics of RSV inhibition by specific classes of compounds. D299, one of the topological descriptors, is a molecular branching index that is calculated from the algebraic formulas derived by Lovasz and Pelikan for special types of trees such as path or star and for particular eigenvalues [
The 2D autocorrelation descriptors can be obtained by summing up the products of certain properties of the two atoms located at a given topological distance or spatial lag. The most important factor in interpreting them in the model is the topological distance, once weighted equally. In point of this fact, the best model selected an optimum descriptor combination, which includes van der Waals volumes and atomic polarizabilities as the most relevant key features (
The last selected two descriptors (D513 and D528) belong to topological charge indices. D513, molecular topological order3 charge index (GGI3) represents the three eigenvalues of the corrected adjacency matrix of a molecule. D528, the mean molecular topological order8 charge index (JGI8), is a kind of Galvez topological charge index which evaluates the charge transfers between pairs of atoms and the global charge transfers in the molecule [
From the aforementioned discussion, it can be seen that the activity of these RSV inhibitors is mainly influenced by several factors including the molecular branching index and atomic polarizabilities. These results are to some extent in agreement with the corresponding related experimental conclusions [
As expected, besides the robust, sparse and predictive features, an ideal classification model would still be interpretable. In many cases, gaining an intuitive interpretation of important features from the twodimensional QSAR is not always simple. For the present work, it should be pointed out that our explanations for the current descriptors are just broad due to nonlinear model types and abstract descriptors. However, in terms of developing a highly predictive classification model, the proposed VSRF model in this work could allow this task.
A large, diverse dataset of 216 RSV inhibitors collected from the literature [
In the present work, the two dimensional structures of all RSV inhibitors were built with the ISIS/Draw 2.3 program [
Rational division of an experimental SAR (structureactivity relationship) dataset into the respective training and test sets for model development and validation is very important. The methods often used include random sampling (RS), KennardStone (KS), Kmean clustering, and selforganizing map,
For the independent prediction set, we performed our selection on the basis of their distribution in the chemical space, which is defined by Kohonen neural network [
VSRF: Random forest model was constructed according to the described original RF algorithm [
RF possesses its own reliable statistical characteristics based on OOB set prediction, which could be used for validation and model selection with no crossvalidation performed. It was shown that the prediction accuracy of an OOB set and a 5fold cross validation procedure was nearly the same [
Random forest, as a new classification and regression tool, has not been frequently applied in QSAR, QSPR (quantitative structureproperty relationship) [
As expected, an ideal classification model should possess high prediction ability with a small set of descriptors. Thus, variable selection with random forest was used to implement this task. Here, we simply introduce the VSRF. To select optimal descriptors, random forests were iteratively fitted, at each iteration building a new forest after discarding those descriptors with the smallest variable importance; the selected set of descriptor is the one that yields the smallest OOB error rate. In this algorithm, all forests result from eliminating, iteratively, a fraction,
SVM: Support vector machines are a relatively new type of learning algorithm originally introduced by Vapnik and coworkers [
For the classification task, briefly, this involves the optimization of Lagrangian multipliers
GP: Preliminarily used in QSAR field, the Gaussian process (GP) was also introduced in the present study to classify the RSV inhibitors. Pioneering work was made by Burden [
A Gaussian process is defined simply as a collection of random variables which have a joint Gaussian distribution. It is completely characterized by its mean and covariance function. In the GP, the kernel function used in training and prediction contains (1) Radial Basis kernel function “Gaussian”; (2) Polynomial kernel function; (3) Linear kernel function; (4) Hyperbolic tangent kernel function; (5) Laplacian kernel function; (6) Bessel kernel function; (7) ANOVA RBF kernel function; and (8) Spline kernel. In the present work, the popular Radial Basis kernel function was chosen, with the kernel parameters determined by sigest function implemented in the R package kernlab.
LDA: LDA is a pattern recognition method providing a classification model based on the combination of variables that best predicts the category or group to which a given compounds belongs. The basic theory of LDA is to classify the dependents by dividing an
As in the case of all discriminative methods [
In the present work, based on the uptodate largest dataset (to our best knowledge) of 216 structurally diverse RSV inhibitors, a VSRF classification model with good predictive performance (the overall Q = 94.34% for the prediction set) has been built.
By explanation of the selected descriptors, we conclude that the topological structure and electronic factors play a central role in the RSV inhibition. Moreover, a comparison with four other statistical methods,
Selforganizing map (SOM) top map indicating the distribution of the training and external prediction sets. The training set is labeled in black font and the prediction set in red font. The number corresponds to the series number of the compounds of the RSV inhibitors.
The ROC (receiver operating characteristic) curves of VSRF, SVM, GP, LDA and
The selected 6 Mold^{2} descriptors using variable selection algorithm coupled with random forests (VSRF) and their definition.
D299  The largest eigenvalue  Eigenvaluebased indices 
D347  Molecular topological path index of order 07  Walk and path counts 
D490  Moran topological structure autocorrelation length4 weighted by atomic van der Waals volumes  2D autocorrelation 
D503  Moran topological structure autocorrelation length1 weighted by atomic polarizabilities  2D autocorrelation 
D513  Molecular topological order3 charge index  Topological charge indices 
D528  Mean molecular topological order8 charge index  Topological charge indices 
The prediction performance of high and low active compounds as respiratory syncytial virus (RSV) inhibitors from VSRF, SVM, GP, LDA and
VSRF  27  0  100  23  3  88.46  94.34  0.89  0.96  81.6 
SVM  23  4  85.19  21  5  80.77  83.02  0.66  0.84  79.1 
GP  27  0  100  20  6  76.92  88.68  0.79  0.9  78 
LDA  20  7  74.07  21  5  80.77  77.36  0.55  0.77  67.5 
22  5  81.48  17  9  65.38  73.58  0.48  0.76  72.9 
VSRF,
Comparison of random forest (RF) statistical performance with and without variable selection based on the respiratory syncytial virus (RSV) inhibitor dataset
Training set  RF  82  0  100  81  0  100  100  0.816  171.42 
VSRF  82  0  100  81  0  100  100  0.816  8.06  
Test set  RF  25  2  92.59  23  3  88.46  90.57     
VSRF  27  0  100  23  3  88.46  94.34     
for RF,
Representative compounds with their chemical names, activities and classes used in the dataset.
1 

4.507  L  12 
2 

6.328  L  12 
3 

5.174  L  12 
4^{*} 

6.222  L  12 
5 

5.959  L  12 
7 

5.959  L  12 
8^{*} 

4.81  L  12 
9 

5.481  L  12 
10 

5.114  L  12 
11 

5.570  L  12 
12^{*} 

6.284  L  12 
29 

6.125  L  13 
30 

8.398  H  13 
31 

7.959  H  13 
32^{*} 

7.796  H  13 
34 

7.602  H  13 
35 

7.745  H  13 
36 

7.921  H  13 
37 

7.678  H  13 
38 

8.046  H  13 
39^{*} 

8.000  H  13 
41 

7.959  H  13 
42^{*} 

7.854  H  13 
43 

7.824  H  13 
*, test set;
from the corresponding reference;
H denotes high active compounds, L denotes low active compounds.