These authors contributed equally to this work.
This article is an openaccess article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
To design ARC111 analogues with improved efficiency, we constructed the QSAR of 22 ARC111 analogues with RPMI8402 tumor cells. First, the optimized support vector regression (SVR) model based on the literature descriptors and the worst descriptor elimination multiroundly (WDEM) method had similar generalization as the artificial neural network (ANN) model for the test set. Secondly, seven and 11 more effective descriptors out of 2,923 features were selected by the highdimensional descriptor selection nonlinearly (HDSN) and WDEM method, and the SVR models (SVR3 and SVR4) with these selected descriptors resulted in better evaluation measures and a more precise predictive power for the test set. The interpretability system of better SVR models was further established. Our analysis offers some useful parameters for designing ARC111 analogues with enhanced antitumor activity.
Topoisomerase I (TOP I) is a clinical target for the treatment of cancer [
The quantitative structureactivity relationship (QSAR) is a powerful approach used for studying the relationship between drug activities and molecular structures, and it is helpful to explain how structural features determine drug activities. Especially, an acceptable QSAR has the advantages of higherspeed and lowercosts than experimental testing for drug activity evaluation. Yu
The support vector machine (SVM) is a class of learningbased nonlinear modeling technique with proven performance in a wide range of practical applications [
The objectives of this work were: (1) to test the effectiveness of the SVR model on ARC111 analogues by comparing them with other chemometric tools including stepwise MLR, PLS, and ANN; (2) to construct and evaluate QSAR models using SVR with selection of descriptors from highdimensional features of ARC111 analogues; (3) to analyze the explanatory power of the SVR models; and (4) to predict the activities of several theoretical drugs based on our model and thus provide specific parameters for future drug development.
To verify the generalization ability of QSAR constructed using SVR technique, a lowdimensional literature dataset with 9 descriptors was adopted. The 9 descriptors were the combined set of features from stepwise MLR and PLS in [
The SVR model with 6 descriptors (SVR2) produced better results than the SVR model with all 9 descriptors (SVR1). We noted that the 6 descriptors were obtained with the WDEM from the 9 descriptors. This showed that the WDEM method might be effective to choose relevant descriptors for more accurate prediction of the activities of ARC111 analogues. This property will be helpful for the modeling with highdimensional features. Considering nonlinear function, predictive ability and computing time, the Radial Basis Function (
To improve drug design of ARC111 analogues, the analysis of highdimensional descriptors may result in better prediction. Using the software, PCLIENT, 2,923 molecular descriptors were calculated. Then the highdimensional dataset containing the independent variables (all 2,923 descriptors) and the dependent variables [pIC_{50 (expt.)} values] was used for modeling. Because the highdimensional descriptors had more redundant information, we focused on how to select nonlinearly less but more critical descriptors using SVR. We have developed two novel methods that could select important descriptors from thousands of them. By initial coarse screening using the HDSN method to filter out irrelevant features, the data set would switch from highdimensional into lowdimensional. Then further careful screening using the WDEM method would turn the data set with lowdimensional features into one with only important descriptors. Throughout the process, the descriptors in modeling with higher
In feature screening, the Radial Basis Function (
The SVR3 and SVR4 models predicted that the antitumor activity of ARC111 analogues depends on 7 and 11 molecular factors, respectively. According to the interpretability analysis of the SVR model we have established [
The
For all descriptors, the analysis of singlefactor effects showed that the antitumor activity was positively correlated with
Perhaps, starting from a descriptor pool and then revealing the physicochemical properties of a limited number of selected descriptors, as seen in some papers, can lead to a compromise between both approaches. In most of the models for prediction, theoretical molecular descriptors were used. Experimental chromatographic descriptors could be useful but are tedious to determine and therefore less popular [
According to the types and roles of ARC111 substituents reported in literature [
First, to understand the QSAR reliability of modeling ARC111 analogue activities using SVR technique, 4 electronic [
Second, to develop a better QSAR model based on highdimensional data sample using SVR technique, molecular structures were represented by about 3,000 molecular descriptors that encoded much more structural information. These descriptors were generated by the software PCLIENT (
To reduce dimensionality and improve model robustness in QSAR analysis, highdimensional features would be screened coarsely and nonlinearly into lowdimensional features with lower mean squared error (
The selection of descriptors and the optimization of Kernel functions parameters were examined by 10fold or LOO validation with the minimum
Here
In our QSAR analysis, the structural information of 34 ARC111 analogues was described using 2923 molecular descriptors obtained. Two groups of more important descriptors were obtained using two nonlinear descriptor selection methods, and then used to model the activities of these ARC111 analogues based on SVR. The two SVR models demonstrated consistently better performance than reference models in terms of prediction accuracy for the test data. Our results offer new theoretical tools for drug design and development.
The authors acknowledge financial support from the Science Foundation for Distinguished Young Scholars of Hunan Province, China (No. 10JJ1005), the Specialized Research Fund for the Doctoral Program of Higher Education of China (No. 20114320120005), the Research Foundation of Education Bureau of Hunan Province, China (No. 09C502) and the Science Foundation for Talents from Hunan Agricultural University of China (No. 07YJ05). The authors thank Mingshun Chen at USDAARS and Department of Entomology, Kansas State University, Manhattan, Kansas, USA for his comments and suggestions during the manuscript preparation. The authors also thank Gang Qian, Lifeng Wang and Yijun Su at Department of Bioinformatics, Hunan Agricultural University, Changsha, China for their help.
Singlefactor effects of features in the SVR3 (
Four types of ARC111 analogues structures.
Comparative quantitative structureactivity relationship (QSAR) modeling of the independent test, based on the literature dataset.
Stepwise MLR  PLS  ANN  SVR1  SVR2  

Number of descriptors  5  7  9  9  6 
0.201  0.167  0.050  0.141  0.061  
0.910  0.890  0.962  0.937  0.950  
0.730  0.775  0.933  0.811  0.918 
Comparative QSAR modeling of the independent test based on the highdimensional descriptors selection using support vector regression (SVR).
Stepwise MLR  PLS  ANN  SVR3  SVR4  

Number of descriptors  5  7  9  7  11 
0.201  0.167  0.050  0.032  0.028  
0.910  0.890  0.962  0.964  0.971  
0.730  0.775  0.933  0.957  0.962 
The retained descriptors by the highdimensional descriptor selection nonlinearly (HDSN) and worst descriptor elimination multiroundly (WDEM) methods and their
Model  Group name  Descriptor name  

SVR3  GSFRAG  26.555  
2D autocorrelations  25.175  
Constitutional descriptors  12.210  
2D autocorrelations  12.114  
Functional group counts  5.898  
Topological charge indices  3.687  
Geometrical descriptors  2.387  
SVR4  BCUT descriptors  11.382  
GSFRAGL  3.771  
Randic molecular profiles  3.511  
Eigenvaluebased indices  2.456  
Constitutional descriptors  2.456  
RDF descriptors  2.435  
Walk and path counts  2.425  
RDF descriptors  2.398  
Topological descriptors  2.084  
RDF descriptors  1.304  
GETAWAY descriptors  0.599 
Substituents and activities of 34 ARC111 analogues.
Experimental drugs  Theoretical drugs  


 
Compound  Type  Substituent  pIC_{50 (expt.)}  Compound  Type  Substituent  pIC_{50 (pred.)}  

 
R_{1}  R_{2}  R_{3}  R_{4}  R_{1}  R_{2}  R_{3}  R_{4}  
1  I  Me  Me  8.699  1  I  Me  Et  8.651  
2  Me  Bn  7.276  2  Me  8.172  
3  Et  Bn  7.114  3  Et  7.876  
4  Bn  6.523  4  7.388  
5  Bn  6.071  5  6.908  
6  Bn  Bn  6.420  6  III  Bn  6.668  
7  Et  Et  8.222  7  Et  7.208  
8  8.097 
8  6.904  
9  H  Me  9.523  9  6.617  
10  H  Et  8.699 
10  IV  Et  6.248  
11  H  8.523  11  6.102  
12  H  8.699  12  6.100  
13  H  Bn  7.796  
14  H  H  8.398  
15  Me  8.097 

16  Et  8.301  
17  II  8.523  
18  III  H  8.155  
19  Me  7.523  
20  IV  Bn  6.398 

21  H  7.046  
22  Me  6.523 
Four experimental compounds in the test set;
predicted values of 12 theoretical compounds by the SVR4 model.
Group and count of descriptors from the software PCLIENT.
Group No.  Group of descriptors  Count  Group No.  Group of descriptors  Count 

1  Constitutional descriptors  48  13  RDF descriptors  150 
2  Topological descriptors  119  14  3DMoRSE descriptors  160 
3  Walk and path counts  47  15  WHIM descriptors  99 
4  Connectivity indices  33  16  GETAWAY descriptors  197 
5  Information indices  47  17  Functional group counts  121 
6  2D autocorrelations  96  18  Atomcentered fragments  120 
7  Edge adjacency indices  107  19  Charge descriptors  14 
8  BCUT descriptors  64  20  Molecular properties  28 
9  Topological charge indices  21  21  ETstate Indices  >300 
10  Eigenvaluebased indices  44  22  ETstate Properties 
3 
11  Randic molecular profiles  41  23  GSFRAG Descriptor  307 
12  Geometrical descriptors  74  24  GSFRAGL Descriptor  886 
Total:  >3000 
This group of descriptors did not exist in the default state.