#### 2.1. The Benzo[c]phenanthridins and Their Biological Activity Data

The compounds studied in this work were BCP derivatives having a similar core to the two alkaloids nitidine and fagaronine shown in

Figure 1 [

1,

14]. The

in vitro TOP-I inhibition data (REC, which is the relative effective concentration of TOP-I related to topotecan) and IC

_{50} values (the concentration of compound causing 50% cell growth inhibition against tumor cell lines) on RPMI8402, CPT-K5, P388, CPT45, KB3-1, KBV-1, KBH5.0, U937 and U937rs of 137 chemical structures related to BCPs were collected from the literature [

3,

4,

5,

6,

7,

8,

9,

10,

11]. However, not all bioactivity data of different cell lines is available for each compound. Compounds numbers and available bioactivity data are listed in

Table 1. U937 and U937rs cell lines having a limited number of cytotoxicity data were not used for developing the QSAR model. REC and IC

_{50} values were converted to negative logarithm of REC, IC

_{50} (pREC, pIC50) for use in the QSAR studies. Chemically, the dataset can be divided into six groups of general skeletons presented in

Figure 2 and the number of compounds of each group are shown in

Table 1. The number of compounds in the training and external test sets are presented in

Table 2. The detailed chemical structures and bioactivity of BCP dataset are presented in the

Supporting Information.

#### 2.2. Over-fitting Problem

A well-accepted QSAR model should be able to accurately predict activities of a new compound which is not included in the training set. Over-fitting or over-estimation occurs when the predictive ablility on external set is bad, some papers use r

^{2} for this assessment [

15,

16]. However, using the square of correlation coefficient is not exact in all cases and cannot manifest the meaning of the model predictions. In this study, the results of 3D QSAR model of RPMI8402 cell line and 2D QSAR model of CPT45 cell line were given as examples. Accordingly, RPMI-fs45 model (R

^{2} = 0.812 for training set; r

^{2} = 0.701 for test set) was assessed as having a good result and the ability to predict accurately but in fact the predictive power of this model is bad (

Figure 3A) with the external set of model RPMI-fs45 (red triangle) tending to go out of the two limit lines at a confidence level of 95% of training set (blue circle), whereas, the CPT45-2D model gave a reasonable result based on the 95% confidence level assessment method, which is shown in

Figure 3B. Hence, the QSAR model with high value r

^{2} of training and test sets does not necessarily correlate with a good predictive model.

#### 2.3. Model Assessment Method

For QSAR validation, several parameters such as R

^{2}, q

^{2}, standard error of training and test sets, Y-scrambling analyses, and confidence interval estimators were used to judge the QSAR models [

12,

13,

16,

17,

18,

19,

20]. Confidence level is the result of statistical estimation based on observations on a population. This estimated level is hard to reach 100%, therefore, the statisticians often use the estimate of 90%, 95%, 99% confidence intervals [

15,

18]. For classical QSAR study, 95% confidence interval is commonly used as the parameter in validation of QSAR models. In this study, the QSAR model evaluation method based on confidence level is presented as below.

At a confidence level of 95%, the limit is calculated so that 95% of training is in the area limited by the upper and lower bounds as shown in

Figure 3. The two bounds are almost straight lines parallel to the baseline y = x. If the two bounds meet the horizontal axis at the points x1= −x2= δ (δ > 0), they can be assumed as two lines y = x − δ and y = x + δ. The d value represents the desired predictability of the model which depends on squared correlation coefficient R

^{2} and standard error of the predicted results compared with experimental values of training set. If the model gave the predicted β, then 95% of β was in the range β ± δ.

The model predictions are confirmed as true to its ability when the external evaluation set with coordinates (x_{i} = pIC50 expected, y_{i} = pIC50 experimental) lies within the boundaries of the two lines, the type I error probability is 5% (if the external set is large enough, there will be 5% of compounds with the predicted value lies outside the confidence interval).

The assessment step in this study is done as follows:

- Determine the δ from the training set.

- Use a set of external set to assess the reliability of the value of δ. The reliability of the value of δ is evaluated by seeing how much the coordinates of the compounds in the external set properly distributed in the confidence limits. This number is not required to be larger than or equal to the value of reliability (95%) but the difference of those two numbers must not be too large. The signs were used in this study for assessing the value δ with possitive (+), negative (−) and unknown (+/−).

- If r^{2} is also greater than 0.5, the model can be proposed to predict beyond the range of values evaluated.

In addition, several new metrics

${r}_{m}^{2}$,

$\overline{{r}_{m}^{2}}$ and

$\mathsf{\Delta}{r}_{m}^{2}$ proposed by Roy’s reasearch group was also calculated for both training and test set to validate our QSAR models [

21,

22,

23]. These additional validation parameters were used to assess the predictive quality of QSAR models. For the good QSAR models, the values of

$\overline{{r}_{m}^{2}}$ should have be more than 0.5 and

$\mathsf{\Delta}{r}_{m}^{2}$ values should preferably be lower than 0.2 for both of the training and test sets. The equations for calculation of

${r}_{m}^{2}$,

$\overline{{r}_{m}^{2}}$ and

$\mathsf{\Delta}{r}_{m}^{2}$ metrics could be found at

supporting information.

#### 2.4. Hologram, 2D and 3D QSAR Modeling

In this study, eight 2D QSAR models, eight hologram QSAR models and thirteen 3D QSAR models for TOP-I inhibitory activity and anti-toxicity on RPMI8402, CPT-K5, P388, CPT45, KB3-1, KBV-1, KBH5.0 tumor cell line were developed and the results are presented in

Table 3 (2D),

Table 4 (Hologram),

Table 5 (3D) and the assessments of corresponding models with a confidence level of 95% are also presented. Based up on 95% confidence interval, the δ value, assessment and range of prediction of all obtained models were calculated. There are several models with good R

^{2} and q

^{2} values of training and test sets but those models could not give the predictive power for external test set by applying the confidence intervals (

Table 5).

The models of RPMI-8402, KB3-1 cell lines and on TOP-I inhibitory activity have correlated and results in all three methods’ building QSAR models are reasonable. The hologram, 2D and 3D- QSAR models performed on pREC (topoisomerase inhibitory activity) and pIC50 of RPMI8402, KB3-1 cell-lines showed not only significant statistical quality, but also predictive ability, with the square of correlation coefficient R

^{2} = 0.584 − 0.768, the square of the crossvalidation coefficient q

^{2} = 0.406 − 0.594 as well as the external set’s square of predictive correlation coefficient r

^{2} = 0.514 − 0.795. For RPMI 8402 cell line and KB3-1 cell-lines, the largest range of prediction are [−1:3] from hologram model and [−0.5:2.2] from 2D QSAR, respectively, were obtained. The best range of prediction for anti-topoisomerase 1 is [−2.5:1] is achieved from 3D QSAR model. Based on the calculation of

${r}_{m}^{2}$,

$\overline{{r}_{m}^{2}}$ and

$\mathsf{\Delta}{r}_{m}^{2}$ metrics, several good QSAR models are highlighted in bold numbers in

Table 3,

Table 4 and

Table 5. Detailed of 29 QSAR models are available in

supporting information.

QSAR models on topoisomerase inhibitory activity and cytotoxicity of RPMI8402, KB3-1 cell-lines were used for further investigation on application set. The prediction on application set containing 1214 new virtual designed compounds offers a short list of 94 compounds with better predictive antitumor activity. Several selected compounds with predicted bioactive values are listed in

supporting information. Analysis of the results from our QSAR models shows the general points of the relationship between chemical structures and antitumor activity of BCP derivatives summarized in

Figure 4 and described as follows:

- (1).
The steric interaction plays an important role in determining the bioactivities of the BCP against many tumor cell lines, including cytotoxicity and TOP-I inhibitory ability. Substituents at 8,9-dimethoxy position on the skeletons are necessary for the biological effects. The results have shown that methoxy group at position 2 is essential for bioactivity while position 3 is not essential. The substituents at position 11, 12 affect the activity and should have a length of 4-5 carbons or lower, be straight up with the bulky end groups.

- (2).
Reducing the amount of nitrogen in the rings system and increasing the number of nitrogen atoms in the substituent can improve the bioactivity. Nitrogen in position 6 gave a better effect than position 5.

- (3).
The substituents at two positions 11 and 12 could have a positive effect on cytotoxicity and TOP-I inhibitory activity. The substituent at position 12 gives a stronger effect on bioactivity than position 11.

The previous studies indicated that topotecan, the synthetic derivative of camptothecin is the most potent anticancer drugs in clinical use [

24]. Topotecan, ethoxidine, fagaronine and BCP related compounds indicated the selectivity on TOP-I than TOP-II. These novel compounds acted as DNA intercalators and have two mechanisms including

(i) TOP-I poison activity like fagaronine; and

(ii) TOP-I suppressor activity like ethoxidine [

24,

25]. Our preliminary results from

in silico modeling indicated that BCP compounds may inhibit the TOP-I activity via suppression mechanism. From this QSAR study, the important role of natural functional groups related to biological activity is indicated in

Figure 4. Hence, the combination of our QSAR models with other classification on TOP-I and cytotoxicity predictive models and molecular docking studies [

12,

25,

26] could provide insight into the molecular basis of BCPs derivatives on antitumor and TOP-I inhibitory activity.