A Support Vector Machine Classification Model for Benzo[c]phenathridine Analogues with Topoisomerase-I Inhibitory Activity

Benzo[c]phenanthridine (BCP) derivatives were identified as topoisomerase I (TOP-I) targeting agents with pronounced antitumor activity. In this study, a support vector machine model was performed on a series of 73 analogues to classify BCP derivatives according to TOP-I inhibitory activity. The best SVM model with total accuracy of 93% for training set was achieved using a set of 7 descriptors identified from a large set via a random forest algorithm. Overall accuracy of up to 87% and a Matthews coefficient correlation (MCC) of 0.71 were obtained after this SVM classifier was validated internally by a test set of 15 compounds. For two external test sets, 89% and 80% BCP compounds, respectively, were correctly predicted. The results indicated that our SVM model could be used as the filter for designing new BCP compounds with higher TOP-I inhibitory activity.


Introduction
The topoisomerases (TOP) are enzymes involved in processes such as replication, repair, transcription, recombination and segregation of DNA. Those of type I are the target of several anticancer agents based on their ability to stabilize the DNA-enzyme cleavage complex that causes DNA damage and cytotoxicity [1,2]. Among the agents expressing a targeted anti-topoisomerase activity, alkaloids of the benzo[c]phenanthridines (BCPs) family are well known [1][2][3][4]. Many BCP analogues were synthesized and evaluated for their activity on topoisomerase I as well as their cytotoxicity. Of those, ethoxidine, NK-109 and topovale (ARC 111) are potential candidates for cancer chemotherapy [1][2][3][4][5] (Figure 1). The in vitro TOP-I inhibitory activity is valued by REC value, which is the relative effective concentration of TOP-I related to topotecan [6][7][8][9][10][11][12][13][14][15][16][17]. Therefore, compounds having TOP-I inhibitory activity could be divided into two classes of compounds based on the topotecan threshold. In this study, a support vector machine (SVM) approach was used to build up a classification model based on the anti-topoisomerase-I activity for BCP analogues. The SVM model could be applied to seek out new BCP analogues which are inhibitors of TOP-I for cancer treatment.

Feature Selection
Three feature selection approaches, namely mRMR (Max-Relevance, Min-Redundancy), GA (genetic algorithm) and RF (random forest) were applied to the dataset to figure out a set of chemical descriptors related to bioactivity properties from 2,032 molecular descriptors calculated by Dragon. Based on the calculation, three sets of molecular descriptors were selected, including a set of 10, 16 and seven descriptors for the mRMR, GA and RF methods, respectively. Table 1 shows the molecular descriptors selected via the RF method. After that, these sets of descriptors were used to develop the SVM models. For any SVM, it is necessary to select the optimal parameters of the Kernel function (C, γ) and these values for each set of descriptors were figured out and are shown in Table 2. The results of classification based on three different descriptor sets (by mRMR, GA and RF) by the SVM algorithm (the e1071 package in R) are shown in Table 3. The total accuracies of these SVM models are greater than or equal to 0.91 on training sets and are greater than or equal to 0.74 for cross-validation. However, the basic principle of SVM is a supervised learning approach. Hence, the classification results regarding test set and external set showed the relatively accurate performance of models. The SVM model which was developed from descriptors selected by RF method gave a better classification power than the two other methods.

SVM Classification Model
A total of 82 compounds were used for SVM and classified into actives or inactives based on the the relative effective concentration (REC) compared to topotecan and the respective threshold points. For the separation of active/inactive, REC of topotecan (REC = 1) was selected as a threshold point. TOP-I active compound having equal or stronger activity than that of topotecan is presented as "1" and vice versa, "0" is presented TOP-I inactive compound having weaker activity than that of topotecan. The set of molecular descriptors selected via RF method was used to create the final SVM (the e1071 package in R) classification model based on anti-topoisomerase-1 activity. Moreover, two other classification approaches namely the SVM kernlab and "randomForest" (RF) in R are also applied and the classification results are presented in Table 4. In general, both of SVM packages (the e1071 package, the package kernlab) gave better results than those of observed for random forest methods with the different values in the total accuracy of 20%. Regarding the training set, two SVM packages (e1071, kernlab) expressed similar classification power. However, regarding the test set and the external set, the SVM package e1071 showed the better results. The Matthews correlation coefficient (MCC) of this package was stable for training, test and external test sets with values 0.82, 0.71 and 0.80, respectively. For the kernlab package, the MCC value for the training set was as high as 0.82 whereas the MCC of the test set was only 0.54. Normally, a model with MCC larger than 0.4 indicates that it has the predictive power [18]. The RF classification method in combination with a set of descriptors chosen by RF gave MCC = 0.35 on the external set and this result indicates that the RF classification model has no ability to predict the biological activity of compounds in this study. The SVM-e1071 in R classification model with a set of descriptors identified from a large set via a RF algorithm showed the best results.
To further validate the use of SVM-e1071 for TOP-I classification, a 50-fold Y-scrambling procedure for the training sets was performed [19][20][21]. The total accuracy of 0.59 was obtained from the Y-scrambling analyses. The results showed that our models show a significantly better performance than those obtained when class assignments are randomly achieved with total accuracy values of 0.93 versus 0.59, respectively. However, it has to be noted that the principle of SVM is supervised learning and the learning strategy tries to keep the error to a minimum values which explains the relative "good" performance of this method for the y scrambled data sets.

Validation and Application
The final SVM model was validated not only by cross-validation procedure but also by an external dataset (not belonging to the dataset used to create the SVM classification model). The power for classification of BCPs by anti-TOP-I activity was 0.87-0.93 for total accuracy and 0.22 for cross-validation error. Moreover, the final SVM model was applied to classify a set of 10 BCP analogues recently synthesized by Lavoie et al., which so-called application set or external test set 2 and detailed chemical structures and TOP-I activities are shown in Table 5 [22][23][24][25][26]. According to literature, among these 10 BCP analogues, seven compounds have stronger anti-TOP-I activity than topotecan and three compounds are weaker than topotecan. The results indicated that the classification model was achieved a correct prediction of 80% (8/10) and the detailed results are presented in Table 4. The positive accuracy gained the value of 100% i.e., the classification model is more accurate for predicting compounds wtith stronger activity than topotecan. Table 5. Chemical structure of ten benzo[c]phenanthridine derivatives in application set and their topoisomerase I inhibitory activity REC and classification results from final SVM model. Classification term: "1" presented TOP-I active compound having equal or stronger activity than that of topotecan; "0" presented TOP-I inactive compound having weaker activity than that of topotecan.

No
Chemical This result completely meets our goal, which is aiming to look for new inhibitors of TOP-I for cancer chemotherapy. However, one limitation of all machine learning approaches is their inability to indicate the important role of functional groups related to biological activity. Hence, the combination of this SVM classification model with molecular docking studies [27,28] and also related 2D-and 3D-QSAR model on cytotoxicity of BCPs [5] could provide insight into the molecular basis of TOP-I inhibitors.

Discussions
In this study, the TOP-I inhibitory activity of topotecan, the synthetic derivative of camptothecin and the most potent anticancer drugs in clinical use, is used as threshold points for SVM classification models. Topotecan, ethoxidine, fagaronine and BCP related compounds indicated the selectivity on TOP-I than TOP-II. These novels acted as DNA intercalators and having two mechanisms including (i) TOP-I poison as fagaronine; and (ii) TOP-I suppressor as ethoxidine [27,28]. Our preliminary results from in silico modeling indicated that BCP compounds may inhibit the TOP-I activity via suppression mechanism.
SVM is a machine learning method, which has been used for many kinds of pattern recognition problems. SVMs have been successfully adapted to treat both regression and classification [29]. In this study, mRMR, GA and RF algorithms were used for the selection of descriptors from large descriptors set derived from Dragon software and the results showed that the descriptors selected from the RF method gave a better classification power than others. The SVM also indicated the better results on classification of BCP analogues with anti-TOP-I activity than that of RF. This final SVM model with its high accuracy (80-90%), fast calculation without the need of 3D conformation and accurate prediction on the substances having positive activity could be applied to look for and design new analogues of BCPs with higher topoisomerase I inhibitory activity. SVMs are a black box technique and deliver information about their predictions other than the relationship between molecular descriptors and bioactivity. The SVM model's explanatory in term molecular descriptors and bioactivity could be performed if the weights of each descriptor are explicitly solved [29]. It should be noted that our SVM model could be applied in screening prior to synthesis procedures for new BCP-like compouds to identify potential TOP-I inhibitors. However, the development of a useful drug has to be dealt on systems level of drug discovery and development and take into consideration many factors including solubility, ease of drug formulation, selectivity and ADME-Tox.

Dataset
BCP analogues in the study (82 compounds) with topoisomerase I inhibitory activity evaluated by testing of DNA cleavage were collected from the work of LaVoie et al. [6][7][8][9][10][11][12][13][14][15][16][17]. The biological activity data is represented by the relative effective concentration (REC) compared to topotecan whose value is arbitrarily set at 1.0 (as reference compound). This parameter allows the comparison of molecules based on the cleavage of plasmid DNA in the presence of human topoisomerase I. Therefore, with a compound having the REC value higher than 1, that means its topoisomerase I inhibitory activity lower than topotecan and vice versa. Based upon the REC values, the data set was grouped into two classes: 58 analogues were assigned to inactive class (REC value > 1, negatives or inactives) and 24 analogues were classified as active class with REC value ≤ 1 (high TOP-I blockade, positives or actives). Chemical structure of 82 benzo[c]phenanthridine derivatives and their topoisomerase I inhibitory activity REC values were presented in Table 6. Table 6. Chemical structure of 82 benzo[c]phenanthridine derivatives and their topoisomerase I inhibitory activity REC and classification results from final SVM model. Classification term: "1" presented TOP-I active compound having equal or stronger activity than that of topotecan; "0" presented TOP-I inactive compound having weaker activity than that of topotecan.

Training and Test Sets
The training and test sets were generated by random division. Firstly, a set of nine compounds was selected and not used to develop the models. These compounds were separated from the others and considered as external set. The remaining analogues were split randomly for five times into 80% for training sets and 20% for test sets using the R software [30,31]. The numbers of compounds in each subset are presented in Table 7. a Actives: Compounds whose activity is equal or stronger than topotecan; b Inactives: compounds whose activity is weaker than topotecan.

Molecular Descriptors and Feature Selection
The Dragon software was applied to calculate 2,032 molecular descriptors [32,33]. In this study, several feature selection algorithms were performed to reduce dimensionality of descriptor space. Firstly, chemical descriptors only containing the value of zero were eliminated before using the mathematic methods. Only 533 descriptors were selected for next steps. In addition, highly correlated descriptors (r > 0.90) were removed to avoid redundancy and to manage the data more efficiently in terms of computation resources and intuitive perception of the chemical space. A total of 103 descriptors were selected and scaled to unit variance to be used as input value to SVMs. Finally, three feature selection methods namely mRMR (Max-Relevance, Min-Redundancy), GA (genetic algorithm) and RF (random forest) were applied in order to select the optimum set for the molecular descriptors [30,34].

Support Vector Machine
Support vector machine (SVM), the most successfully applied new classification algorithms, were introduced to the machine learning community by Vapnik. Smola and Schölkopf provided an extensive tutorial on SVMs [30][31][32][33][34]. The underlying idea of an SVM classifier is to map linearly inseparable input data into a higher dimensional space where the data can be linearly separated, using a maximal separating hyperplane. Support Vector Machines (SVM) is the classification system based on the supervised learning approach. In this study, SVM algorithm in the e1071 package in R with Kernel function was used [30,31]. The strategies for the SVM classification model are shown in Figure 2.
Selection of optimal parameters of Kernel function: RBF Kernel function has two parameters (C and γ) and the selection of these parameters is one of two critical issues to develop the good SVM model (the other is the feature selection) [35][36][37]. To find the optimal parameters, an algorithm Grid (function tune in R) is made following the process described in Figure 3. The training set and test set are used to find a pair of optimal parameters (C and γ) of the Kernel function. Pairs of parameters were tested in intervals reduced step by step an algorithm Grid. The pair is chosen when the error of cross validation is minimal [30,34].

Evaluation Criteria for Classification Model
Performance of the SVM models was measured by using standard parameters for classification models [18,19,[38][39][40]. They are described as follows: (i) The overall classification accuracy of a prediction model, accuracy = (tp + tn)/(tp + fp + tn + fn), (ii) Sensitivity (recall, accuracy on actives) = tp/(tp + fn), (iii) Specificity (accuracy on inactives) = tn/(tn + fp), (iv) Precision on actives = tp/(tp + fp), (v) Precision on inactives = tn/(tn + fn) and, (vi) Matthews correlation coefficient (MCC): The quality of a classification model is generally measured by using these parameters, which are estimated for the whole set and by applying a cross-validation protocol based on a leave-one-out (LOO) procedure. In all equations, tp = number of true positives, tn = number of true negatives, fp = number of false positives, and fn = number of false negatives.

Conclusions
In this study, the SVM was used to build up a model for prediction and classification of 73 BCP analogues based on their anti-topoisomerase-1 activity. The best model was derived from the SVM-e1071 package in R with the optimal settings of the Kernel function (C = 4, γ = 0.25) and the set of descriptors selected by RF method. This final SVM model is able to correctly predict the anti-topoisomerase activity for 93% of compounds in the training set and 87% of those in a test set. It was also validated on external sets not involving in the dataset used with the developed model and then on the application set. Total accuracies of 89% (the prediction is correct for eight out of nine compounds) and 80% (the prediction is correct for eight out of 10 compounds) were obtained for the external set and application sets, respectively. Furthermore, this model has also proved its ability to classify correctly BCP analogues that have the positive activity, with an accuracy from 80 to 100% overall. With its high accuracy (80-90%), fast and accurate prediction on the substances having positive activity, our SVM model could be applied to look for and design new analogues of BCPs with higher topoisomerase I inhibitory activity.