Prediction of Anticancer Peptides with High Efficacy and Low Toxicity by Hybrid Model Based on 3D Structure of Peptides

Recently, anticancer peptides (ACPs) have emerged as unique and promising therapeutic agents for cancer treatment compared with antibody and small molecule drugs. In addition to experimental methods of ACPs discovery, it is also necessary to develop accurate machine learning models for ACP prediction. In this study, features were extracted from the three-dimensional (3D) structure of peptides to develop the model, compared to most of the previous computational models, which are based on sequence information. In order to develop ACPs with more potency, more selectivity and less toxicity, the model for predicting ACPs, hemolytic peptides and toxic peptides were established by peptides 3D structure separately. Multiple datasets were collected according to whether the peptide sequence was chemically modified. After feature extraction and screening, diverse algorithms were used to build the model. Twelve models with excellent performance (Acc > 90%) in the ACPs mixed datasets were used to form a hybrid model to predict the candidate ACPs, and then the optimal model of hemolytic peptides (Acc = 73.68%) and toxic peptides (Acc = 85.5%) was used for safety prediction. Novel ACPs were found by using those models, and five peptides were randomly selected to determine their anticancer activity and toxic side effects in vitro experiments.


Introduction
According to the latest cancer statistics, cancer incidence and death rates are increasing year by year [1,2]. Traditional cancer treatment methods mainly include surgery, radiation therapy, chemical drugs and macromolecular targeted drugs. However, cancer treatment continues to face the challenge of increasing resistance to chemical and receptor-targeted anticancer drugs. Researchers identified a class of bioactive peptides with antitumor activity, named anticancer peptides (ACPs), which are found in a wide range of organisms, including mammals, amphibians, insects, plants and microorganisms [3]. As a new potential drug for cancer treatment, ACPs are cationic amphiphilic peptides with a length of about 5-50 amino acids, which are characterized by a simple structure, easily synthesized, easily modified chemically and less immunogenic [4]. Owing to the increased proportion of phosphatidylserine (negatively charged) on the surface of cancer cells compared to normal cells, cationic amphiphilic peptides may be an effective and highly selective antitumor drug. The antitumor mechanisms of ACPs can be divided into two types: selective membrane destruction and non-membrane dissolution, which include inhibition of angiogenesis and promotion of tumor cell apoptosis [5]. Despite these advantages, ACPs still face challenges before becoming effective clinical agents, such as poor stability, hemolysis and toxicity to normal tissue cells. The stability of peptides can be improved on various ways, including incorporation of unnatural amino acids, cyclization and modification of the chemical skeleton [6]. Therefore, it is essential to develop methods to identify safer and more effective ACPs.
Machine learning derived from artificial intelligence and statistics is one of the key research directions in the field of data analysis at present. The application of machine learning algorithms in drug research and development greatly speeds up the process of early drug screening. Basic machine learning algorithms includes naive Bayes (NB), support vector machine (SVM), random forests (RF), K nearest neighbor (KNN), artificial neural networks (ANNs), ensemble algorithms, etc. Due to identifying potential novel ACPs using experimental methods require considerable time and expense, in order to aid wet-laboratory researchers discover novel ACPs, various machine learning approaches are used to ACPs recognition [7].
Although tremendous advances in the field of ACPs prediction, these approaches are almost entirely based on peptide sequences of feature extraction and use very similar datasets. In addition to the basic characteristics of peptide sequence, the 3D structure of ACPs plays a key role in inhibiting tumor cell proliferation. SATPdb [34] is a database annotating the tertiary structure of various therapeutic peptides using PEstrMOD [35], homology modeling [36] and I-TASSER Suite [37] methods, which provides a data basis of structure-based activity and toxicity prediction of ACPs. Consequently, in this study, we collected five datasets of 3D structure of ACPs and used a variety of machine learning algorithms to establish the model. Similarly, the tertiary structure datasets of hemolytic peptides and toxic peptides were collected to construct models for predicting the safety of ACPs. Then, 12 algorithms with the better performance were selected from the two ACPs datasets to form a mixed model to screen ACPs from antimicrobial peptides (AMPs). The optimal prediction models of hemolytic peptides and toxic peptides were used to ensure the safety of candidate ACPs (Figure 1). To our knowledge, this is the first reported method to simultaneously predict ACPs activity, hemolysis and toxicity based on the 3D structure of peptides. Novel ACPs were discovered by the above methods, and their anticancer efficacy and toxicity were evaluated by in vitro experiments.

The Composition of Multiple Datasets
The positive and negative sample compositions of five ACPs datasets, three hemolytic peptides datasets and three toxic peptides datasets are shown in Figure 2A-C. Figure  2D shows the training set and test set division of all datasets. The above datasets are detailed in Materials and Methods 4.1 Datasets.

Figure 1.
The flowchart describes the overall implementation approach for screening high-efficiency and low-toxicity ACPs.

The Composition of Multiple Datasets
The positive and negative sample compositions of five ACPs datasets, three hemolytic peptides datasets and three toxic peptides datasets are shown in Figure 2A-C. Figure 2D shows the training set and test set division of all datasets. The above datasets are detailed in Section 4.1 Datasets.

Structural Similarity Analysis
The first three principal components were extracted from structural features for similarity analysis. The positive data and negative data of the ACPs datasets were obviously distinguished in the three-dimensional space. The concentrated distribution of polypeptides with anticancer activity indicated that the extracted features were very effective. In particular, the chemically modified peptide dataset D3 performed better than other ACPs

Structural Similarity Analysis
The first three principal components were extracted from structural features for similarity analysis. The positive data and negative data of the ACPs datasets were obviously distinguished in the three-dimensional space. The concentrated distribution of polypeptides with anticancer activity indicated that the extracted features were very effective. In particular, the chemically modified peptide dataset D3 performed better than other ACPs datasets. The distinction between positive and negative data was not obvious in the hemolytic peptide datasets, and the data distribution was relatively less concentrated. Toxic peptides have very similar features and can clearly distinguish between positive and negative data. Structural similarity analysis of mixed peptide datasets is shown in Figure 3, and other datasets in Figure S1.

Development of Models on ACPs Dataset
In this study, five ACP datasets were collected to explore the influence of different datasets on model construction. The model of D2 performs better than the model of D1 in the natural peptide dataset (Table 1, Figure 4A), which indicates that ACPs and AMPs have similar structures, and their differentiation is slightly weaker than that of ACPs and peptides derived from Swissprot. The model performance of chemically modified peptide dataset D3 was better than that of natural peptide dataset D1 and D2, illustrating that the feature extraction method in this study was easier to extract the chemically modified features of peptides, so the chemically modified ACPs were easier to distinguish from modified AMPs. Similarly, the model of mixed peptide dataset D5 was superior to the model of D4 (Table 2). Overall, it is crucial to select the appropriate dataset when constructing the prediction model.

Development of Models on ACPs Dataset
In this study, five ACP datasets were collected to explore the influence of different datasets on model construction. The model of D2 performs better than the model of D1 in the natural peptide dataset (Table 1, Figure 4A), which indicates that ACPs and AMPs have similar structures, and their differentiation is slightly weaker than that of ACPs and peptides derived from Swissprot. The model performance of chemically modified peptide dataset D3 was better than that of natural peptide dataset D1 and D2, illustrating that the feature extraction method in this study was easier to extract the chemically modified features of peptides, so the chemically modified ACPs were easier to distinguish from modified AMPs. Similarly, the model of mixed peptide dataset D5 was superior to the model of D4 (Table 2). Overall, it is crucial to select the appropriate dataset when constructing the prediction model.

D1 Dataset
A total of 306 descriptors were obtained by preliminary feature screening in D1 dataset, and 24 descriptors were obtained by further feature selection method in WEKA (Table S1). A variety of algorithms were selected for model construction based on 306 and 24 features respectively. Then, the optimal model is selected from the classical algorithms Binary, RF, SVM and KNN. From WEKA ensemble algorithm (AdaBoostM1, Bagging, Stacking and Vote) select the optimal model. In the D1 dataset, the best performing model was the Vote algorithm based on 306 features, with accuracy of 94.2% and MCC of 0.89 in the training set and accuracy of 96.43% and MCC of 0.93 in the test set.

D2 Dataset
A total of 318 descriptors were obtained by preliminary screening in D2 dataset, and 20 descriptors were obtained by further feature screening in WEKA (Table S3). In addition, the dataset has 148 descriptors recommended by the MOE descriptor calculation module (Table S2). Same as the D1 dataset, 6 models with better performance were obtained after model construction and selection. The two models with the best performance are AdaBoostM1 algorithm with J48 primary learner, and the selected features are 318 and 20, respectively. It achieved accuracy of 92.84% and MCC of 0.86 on the training set, and accuracy of 97.32% and MCC of 0.95 on the test set.

D3 Dataset
A total of 332 features were selected from the chemically modified peptide dataset D3 after preliminary screening, and the remaining 13 features were further screened by WEKA (Table S5). Furthermore, the dataset has 37 suggested descriptors in the MOE (Table S4). After model construction and selection, five models with excellent performance were obtained. Among them, the model with the best performance is the classical algorithm KNN (k = 9) based on 37 features. It achieved the highest accuracy of 100%, MCC of 1 on the test dataset and accuracy of 98.13%, MCC of 0.96 on the training dataset.

D4 Dataset
A total of 333 descriptors were selected from the mixed peptide dataset D4, and 14 descriptors were obtained by further feature selection by WEKA (Table S7). In addition, the dataset has nine suggested descriptors in the MOE (Table S6). Six models with excellent performance were obtained through model construction and selection. The Bagging algorithm based on 333 features with J48 as the primary classifier has the best performance. It achieved accuracy of 91.49% and MCC of 0.83 on the training set, and accuracy of 96.98% and MCC of 0.94 on the test set (Table 2, Figure 4A).

D5 Dataset
In the mixed dataset D5, 334 features were selected preliminarily and 17 features were further selected in WEKA (Table S9). Moreover, there were 19 suggested descriptors in the MOE (Table S8). Six models with better performance were obtained through model development and screening. The optimal model was based on 334 descriptors developed using J48 as a secondary classifier of Stacking algorithm, with an accuracy of 93.53% and MCC of 0.87 (training set), and accuracy of 96.98% and MCC of 0.94 (test set) (Table 2, Figure 4A).
Selecting low-dimensional features in each dataset performed as well as models constructed with high-dimensional features. These results suggest that high quality models can be developed as long as there are key features. On D1, D2, D4 and D5 data sets, the performance of the ensemble algorithm was relatively slightly better than that of the classical algorithm. In the five datasets of ACPs, all the models screened above showed excellent internal stability and external predictability.

Models Developed on Hemolytic Peptide Dataset
In the natural dataset HD1 of hemolytic peptide, 308 descriptors were selected preliminarily, and the remaining 7 descriptors were further selected by WEKA (Table S10). The optimal model was the RF algorithm derived from WEKA based on 7 descriptors, with an accuracy of 66.67% and MCC of 0.33 (training set) and accuracy of 73.68% and MCC of 0.47 (test set). In the chemically modified peptide dataset HD2, there were 330 features after initial screening, and 18 features were further screened by WEKA (Table S11). The model with the best performance is the KNN algorithm based on 330 features. It achieved accuracy of 76.38% and MCC of 0.53 on the training set, and accuracy of 81.48% and MCC of 0.66 on the test set (Table 3, Figure 4B). In the mixed peptide data set HD3, 330 features were screened for the first time and 23 features were further screened by WEKA (Table S12). The optimal model is the Bagging algorithm based on 330 descriptors with J48 as the primary learner, with an accuracy of 71.04% and MCC of 0.42 (training set), and accuracy of 83.7% and MCC of 0.68 (test set). In the hemolytic peptide dataset, the model chemically modified peptides developed based on the 3D structure features are superior to the natural peptides. The ensemble algorithm derived from WEKA performed better than the classical algorithm. The optimal models in the three hemolytic peptide datasets had good fit in the training set and good generalization ability in the test set.

Models Developed on Toxic Peptide Dataset
In the natural toxic peptide dataset TD1, there were 336 features after preliminary screening and 22 features were further selected by WEKA (Table S13). The best performing model was the RF algorithm derived from WEKA based on 336 features, with accuracy of 81.94% and MCC of 0.64 in the training set and accuracy of 85.5% and MCC of 0.72 in the test set (Table 4, Figure 4B). In the chemically modified peptide dataset TD2, 310 descriptors were selected for the first time, and the remaining 15 descriptors were further screened by WEKA (Table S14). The optimal model was the RF algorithm derived from WEKA based on the 310 features. It achieved an accuracy of 64.58% and MCC of 0.29 on the training set and an accuracy of 70.83% and MCC of 0.43 on the test set. In the mixed peptide dataset TD3, 336 descriptors were obtained by preliminary screening and 27 descriptors were obtained by further selecting by WEKA (Table S15). The optimal model was the Bagging algorithm based on 336 features with J48 as the primary learner, with an accuracy of 81.31% and MCC of 0.63 (training set), and accuracy of 81.92% and MCC of 0.64 (test set). In the toxic peptide dataset, the performance of the model developed by the features extracted from chemically modified peptides was weaker than that of natural peptides, possibly due to similar chemical modification structures on toxic peptides and AMPs. Additionally, the model developed by the ensemble algorithm is superior to the classical algorithm. The three optimal models have good internal fitting and external predictability.

Screening of Candidate ACPs
A total of 1294 AMPs were collected as a dataset to screen peptides with anticancer activity. Since the dataset included chemically modified peptides and natural peptides, we used the models in the ACPs mixed dataset D4 and D5 for ACPs prediction. As shown in Table 2, the 12 models developed in the D4 and D5 data sets were all excellent. In order to increase the accuracy of the prediction model, we tried to combine the prediction results of the 12 models to select ACPs. The screening criterion of the mixed model was that the ACPs were determined only when the predicted results of more than nine (including 9) models were positive. A total of 83 candidate ACPs were screened by the above method. Due to almost all of the 83 anticancer peptides being natural peptides, the optimal model of hemolytic peptide HD1 and toxic peptide TD1 was selected for safety prediction. In the end, we obtained 41 candidate ACPs whose hemolysis and toxicity were predicted to be negative. A total of 5 peptides were randomly selected from 41 candidate anticancer peptides for experimental verification. The sequence and structural information of the five polypeptides are shown in Table 5 and Figure 5, and the information of 41 candidate ACPs is listed in Table S16. Brevinin-2DYb GLFDVVKGVLKGAGKNVAGSLLEQLKCKLSGGC end, we obtained 41 candidate ACPs whose hemolysis and toxicity were predicted to be negative. A total of 5 peptides were randomly selected from 41 candidate anticancer peptides for experimental verification. The sequence and structural information of the five polypeptides are shown in Table 5 and Figure 5, and the information of 41 candidate ACPs is listed in Table S16.  Name  Sequence  16563  RANATUERIN-2Lb GILSSIKGVAKGVAKNVAAQLLDTLKCKITGC  19566  Brevinin-2DYd  GIFDVVKGVLKGVGKNVAGSLLEQLKCKLSGGC  22121  Odorranain-C1  GVLGAVKDLLIGAGKSAAQSVLKTLSCKLSNDC  22355  RANATUERIN2  GLFLDTLKGAAKDVAGKLEGLKCKITGCKLP  27843 Brevinin-2DYb GLFDVVKGVLKGAGKNVAGSLLEQLKCKLSGGC Figure 5. Three-dimensional structure of 5 candidate ACPs.

Experimental Verification
The inhibition rates of several cancer cell lines in the CancerPPD were analyzed statistically [38], with median activity (EC/IC/LC50 (μM)) ranging from 17 ± 3 μM to 53 ± 9 μM. Therefore, compared with the ACPs in cancerPPD, the candidate five peptides showed anticancer activity when the IC50 value was less than 50 μM. The anticancer action of the five purified peptides was tested on A549, MCF7, HeLa and LoVo cancer lines, and the concentration that inhibits half of the cell growth was calculated (half-inhibitory concentration [IC50]). Four of the five peptides exhibited anticancer activity and inhibited

Experimental Verification
The inhibition rates of several cancer cell lines in the CancerPPD were analyzed statistically [38], with median activity (EC/IC/LC50 (µM)) ranging from 17 ± 3 µM to 53 ± 9 µM. Therefore, compared with the ACPs in cancerPPD, the candidate five peptides showed anticancer activity when the IC50 value was less than 50 µM. The anticancer action of the five purified peptides was tested on A549, MCF7, HeLa and LoVo cancer lines, and the concentration that inhibits half of the cell growth was calculated (half-inhibitory concentration [IC50]). Four of the five peptides exhibited anticancer activity and inhibited the growth of at least two types of cancer cells ( Figure 6A,B,E, Table 6), which demonstrated the efficacy of the hybrid ACPs model to a certain extent. All the four peptides (Ranatuerin-2Lb, Brevinin-2DYd, Odorranain-C1 and Brevinin-2DYb) had effective killing effects on lung cancer cell A549, especially the IC50 of Brevinin-2DYd and Ranatuerin-2Lb were 2.975 µM and 15.32 µM, respectively. Brevinin-2DYd showed significant inhibitory effects on four cancer cells relative to other peptides. the growth of at least two types of cancer cells ( Figure 6A,B,E, Table 6), which demonstrated the efficacy of the hybrid ACPs model to a certain extent. All the four peptides (Ranatuerin-2Lb, Brevinin-2DYd, Odorranain-C1 and Brevinin-2DYb) had effective killing effects on lung cancer cell A549, especially the IC50 of Brevinin-2DYd and Ranatuerin-2Lb were 2.975 μM and 15.32 μM, respectively. Brevinin-2DYd showed significant inhibitory effects on four cancer cells relative to other peptides.  To verify the effectiveness of the toxic peptide model, we selected human embryonic kidney cells 293T to test the inhibitory effect of candidate ACPs on non-cancerous cells ( Figure 6C-E, Table 6). Similar to the division of ACPs activity, peptides were considered to have low toxicity when the IC50 value exceeded 50 µM. Three (RANATUERIN-2Lb, Brevinin-2DYb, RANATUERIN2) of the five peptides showed low toxicity to 293T cells, Odorranain-C1 exhibited certain toxicity and Brevinin-2DYd showed the highest toxicity. To a certain extent, the experimental results suggested that the prediction model of toxic peptide had a certain predictability, which was consistent with the model verification results.
We further reported the hemolytic activity of the 5 peptides, the hemolytic rate of 100 µM peptides was more than 10% as the classification standard of hemolysis, and there was no hemolysis in the 4 peptides (RANATUERIN-2Lb, Odorranain-C1, Brevinin-2DYb and RANATUERIN2) and moderate hemolysis in Brevinin-2DYd ( Figure 6F, Table 7). To some extent, these results indicate that the developed RF model derived from WEKA is effective in the prediction of hemolytic peptides.  X-100 (0.1%)) The "+" symbol indicates that the hemolysis rate exceeds 10% when peptides concentration reaches 100 µM. The "−" symbol indicates the hemolysis rate of less than 10% at peptides concentration of 100 µM.
In summary, these results show that RANATUERIN-2Lb and Brevinin-2DYb are potential ACPs considering the anticancer activity and safety of peptides. RANATUERIN-2Lb and Brevinin-2DYb have selective killing effect on lung cancer A549 cells relative to 293T cells, and can be developed as potential candidate drugs for lung cancer. These experiments are only to validate the practicability of the developed ACPs, hemolytic peptides and cytotoxic peptides models, and more systematic experiments are needed to further develop ACP drugs.

Discussion
The purpose of this study was to construct models for predicting anticancer activity and safety of peptides based on their 3D structures, and to collect different datasets to compare the differences between the models developed by natural peptides and chemically modified peptides. In the ACPs datasets, KNN and RF of the classical algorithm are excellent, while AdaBoostM1, Bagging, Vote and Stacking of the ensemble algorithm show high accuracy. Compared with the previous model prediction methods for ACPs [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26], features were extracted based on the 3D structure of peptides for the first time, which can be used as a supplementary method for the prediction of ACPs. The extraction of features from the 3D structure of peptides can better reflect the state of peptides in organisms and analyze the properties of peptides from different perspectives. Compared with the existing prediction models such as ENNAACT [26] and AntiCP 2.0 [23], the ACPs model developed by us is very robust only by its accuracy and MCC. However, due consideration has to be given to the differences caused by different methods of extracting features, so it is uncertain whether our model is superior to other models. The results of the test set of the model constructed in the ACPs dataset were slightly better than those of the training set, which may be due to the fact that the sample size of the test set is small, and the characteristics of positive and negative samples are clearly distinguished. Meanwhile, the ensemble algorithm adopted can prevent the overfitting of the training set, so as to maintain the good fit of the model and improve the external generalization ability as much as possible. Some peptide prediction models also show similar results, such as the model built by Piyush. et al., using the SVM algorithm [23], and the model developed by Vishuda. et al., using the RF algorithm [39].
Similarly, it is the first time to predict both hemolysis and toxicity of anticancer peptides while predicting their activities. Referred to the model developed by Vinod et al. using a similar approach [32], our model constructed in the hemolytic peptide chemical modification dataset HD2 achieved good results. However, our model in the toxic peptide dataset and the existing model uses different methods of feature extraction, so it is hard to compare the model performance. Multiple datasets collected in this study, the ACPs and hemolytic peptides collected were verified by experiments, while the data of toxic peptide datasets are all kinds of peptides with high toxicity, not just toxicity to non-cancerous cells. Toxic peptides act on non-cancer cells through a variety of mechanisms including interactions with specific ion channels, enzymes, mitochondria and membrane components. Therefore, the model for predicting hemolysis and toxicity developed in this study can be used to predict the toxic and side effects of various peptides. In order to further improve the accuracy of toxicity prediction, it is necessary to collect relatively appropriate datasets in the future.
In the classical peptide databases APD3 [40], CAMP R3 [41], DRAMP 2.0 [42] and SATPdb [34], the number of ACPs accounts for about 10% of AMPs. Thus, some AMPs exhibited anticancer activity [43], a candidate dataset containing 1294 AMPs was collected to screen novel ACPs. The hybrid ACPs model and the optimal hemolytic and toxicity prediction model developed in this study were applied to the candidate dataset to screen out 41 candidate ACPs and 5 of them were randomly selected for experimental verification. To a certain extent, the results of this study indicate that the mixed model for screening ACPs had excellent practicability, the hemolytic peptide model also had good applicability, and the toxic peptide model had a certain predictability. Taken together, these results suggest that RANATUERIN-2Lb and Brevinin-2DYb have both anticancer activity and safety, and are expected to be developed as candidate ACPs drugs. The method proposed in this study provides a new idea for predicting the activity and safety of ACPs. Although we obtained good results by selecting peptides with good predicted effects for experimental verification, and had a certain effect on predicting candidate ACPs drugs, the experimental verification method was not perfect enough from the perspective of model verification. As they have suggested, the peptides with the worst ensemble prediction results should be selected for further experiments to fully demonstrate the efficacy of the developed model.
Overall, we developed models for predicting the activity and safety of ACPs based on the 3D structure of peptides, and identified two novel candidate ACPs. Although the present method is restricted by the time consuming of the structure prediction step, it is still a powerful complement to the method of building models based on sequence features. With the development of computational biology, more and more 3D structure-based peptide activity prediction methods are expected to be developed and used widely. In order to verify whether the model is reliable, we randomly selected 5 peptides for experiment, and we will continue to conduct experiments on the remaining 36 candidate peptides to select the ACPs with high efficiency and low toxicity in the future. Considering that the structure of the peptide may change when it interacts with the cell, it is important to verify its three-dimensional structure through experiments. At present, the methods of protein crystal structure analysis mainly include nuclear magnetic resonance (NMR), X-ray diffraction imaging (XDI) and cryo-electron microscopy (Cryo-EM) [44]. Due to the limitation of experimental conditions, the 3D structure verification experiment of peptide cannot be carried out, so we hope to improve it in the subsequent studies. In this study, the selection of candidate ACPs mainly focused on natural ACPs, while the general chemically modified ACPs had better effects. Based on the excellent prediction model of chemically modified peptides developed in this paper, chemically modified peptides with high anticancer activity will be screened in future studies.

ACPs Datasets
In this study, we created 5 datasets in order to compare the differences between diverse datasets in constructing the prediction model of ACPs. We collected sequence and structure data on experimentally validated ACPs in CancerPPD [45] and SATPdb. Afterwards, natural peptides and chemically modified peptides are classified according to whether there are chemical modifications. Generally, acetylation, amidation, methylation, glycosylation and non-natural residues are counted as chemical modifications [46]. We extracted 435 natural ACPs from CancerPPD and SATPdb, and collected 300 chemically modified ACPs from CancerPPD. In order to remove sequence redundancy in the dataset, CD-Hit [47] was used to delete sequences with more than 85% similarity, and 280 natural ACPs were obtained. The non-anticancer AMPs were selected from SATPdb as non-ACPs, 360 chemically modified non-ACPs were extracted and 471 natural non-ACPs were screened using CD-Hit with sequence identity cut-off of 85%. In addition, we retrieved 356 random peptides from SwissProt [48] Proteins using the following keywords, "not anticancer activity", "amino acid length range of 5 to 70" and "with 3D structure", as another natural non-ACPs dataset.
In order to create several balanced datasets, 280 natural ACPs and 280 natural AMPs from the SATPdb were selected as the D1 dataset. The D2 dataset was composed of 280 natural ACPs and 280 random peptides from SwissProt. The D3 dataset was made of 300 chemically modified ACPs and 300 chemically modified AMPs from SATPdb. Subsequently, natural peptides and chemically modified peptides were placed in one dataset to form a mixed dataset, that is, D1 and D3 constitute the mixed dataset D4, and D2 and D3 form the mixed dataset D5 (Figure 2A). The sequence and 3D structures data of the five ACP datasets were shown in Supplementary Materials 2.

Hemolytic Peptides Datasets
The sequence and structure data of hemolytic peptides verified by experiments were obtained from Hemolytik [49] and SATPdb, and natural peptides and chemically modified peptides were collected respectively. Peptides that satisfy one of the following criteria are considered hemolytic peptides, (i) minimum hemolytic concentration (MHC) ≤ 250 µg/mL; (ii) half maximum effective concentration (EC50) or hazardous concentration (HC50) ≤ 100 µM and (iii) >10% hemolytic activity up to 100 µM [28]. Peptides that do not meet the above criteria are selected as non-hemolytic peptides with extremely low hemolysis at relatively high concentrations. Finally, the natural dataset HD1 of 94 hemolytic peptides and 94 non-hemolytic peptides was obtained through CD-Hit screening. Chemically modified peptide dataset HD2 was composed of 135 hemolytic peptides and 135 non-hemolytic peptides, and HD1 and HD2 constituted the mixed dataset HD3 ( Figure 2B). Peptide sequences and 3D structures of the three hemolytic peptide datasets were shown in Supplementary Materials 2.

Toxic Peptides Datasets
Similarly, the structure and sequence data of toxic peptides were obtained from the SATPdb, which were divided into natural peptides and chemically modified peptides. The majority of toxic peptides in the SATPdb are peptide toxins from ATDB [50], Tox-Prot [51], ConoServer [52] and DBETH [53], which are usually highly toxic and have a killing effect on many types of cells, including non-cancer cells. Non-toxic AMPs were retrieved from SATPdb as non-toxic peptides, that is, negative datasets. We extracted the natural dataset TD1 of 1000 toxic peptides and 1000 non-toxic peptides by CD-Hit screening. The chemical modification dataset TD2 was formed from 120 toxic peptides and 120 non-toxic peptides, and TD1 and TD2 are mixed into the TD3 dataset ( Figure 2C). Peptide sequences and 3D structures of the three toxic peptide datasets were shown in Supplementary Materials 2.

Candidate Datasets
Since some AMPs also have anticancer properties, we collected a candidate dataset that did not include experimentally validated anticancer peptides in the hope of finding novel ACPs. A total of 2024 non-anticancer and non-toxic antibacterial peptides were obtained from the desired functions module of SATPdb. Then, CD-Hit was used to exclude peptides with sequences that were more than 90% similar to those in all study datasets. Finally, 1294 antibacterial peptides with predicted 3D structures were obtained to compose the candidate dataset.

Internal and External Validations
The length of amino acid sequences in all of the above datasets ranged from 5 to 70. Each dataset was randomly divided into two datasets, that is, 80% of the data constituted the training set and the remaining 20% of the data constituted the test set, and the data ratio between positive samples and negative samples in each subset was about 1:1 ( Figure 2D). The training set was used to train the model, and the 5-fold cross validation technique was used for internal validation. For external validation, we used a test set to evaluate the performance of the trained model.

Feature Extraction and Selection
Based on the 3D structure of peptides, the global physicochemical descriptor of the 2018 MOE software [54] was used to extract the features related to the structural properties. The peptides of the same category usually have similar features and a prediction model can be built to distinguish them from other categories of peptides by using machine learning algorithms to recognize their common features. The 3D structure of peptides is closer to the real state of its action in living organisms, so the extracted features can better reflect the properties of peptide drugs. MOE descriptors contain 206 2D descriptors, 148 3D descriptors and 88 protein descriptors, resulting in a total of 435 features. These features include volume and molecular shape, surface area, energy-related descriptors, conformation dependent charge descriptors, etc. To avoid data redundancy, descriptors with poor correlation with activity or toxicity should be removed for preliminary features screening. For every dataset, contingency coefficient C, Cramer's V, entropic uncertainty U and linear correlation R 2 of each descriptor were calculated in MOE, and descriptors whose numerical value of C, V, U and R 2 were 0 were deleted. The remaining descriptors are used for model construction. Then, in each dataset we used the WEKA package [55] for feature selection based on the above descriptors. We choose "CfsSubsetEval" as the evaluator and "Best First" as the default parameter of the search method, that is, the forward direction with amount of backtracking, N = 5 and the lookup size D = 1 [46]. In addition, after the calculation of C, U, V and R 2 mentioned above, MOE has recommended descriptors in some datasets and we built the model according to these descriptors.

Structural Similarity Analysis
A principal component analysis (PCA) was performed on the remaining descriptors of each dataset after preliminary feature screening. Then, the first three principal components were selected to perform the structural feature similarity analysis of 3D visualization for each dataset. The more concentrated the data distribution, the higher the similarity of peptide features.

Machine Learning Techniques
In this paper, we adopted a variety of machine learning algorithms to establish the model, which was mainly divided into two categories: classical algorithm and ensemble algorithm (Figure 1). Classical algorithms include Binary, KNN, RF and SVM. The Binary method is an algorithm based on Bayesian statistics, which can build classification models in the QSAR module of MOE software (2018), and uses the LOO (Leave one out) method for internal verification. The KNN algorithm was run in the Windows command window, the value range of k is 1-9. We used Euclidian distance to describe the similarity between samples, and the adjacent samples were classified into one category. Genetic algorithms with population size of 200 and termination algebra of 300 were used to screen models with correct classification rate (CCR) over 0.6. RF and SVM were implemented by using the random Forest and e1071 package in R respectively. RF uses a set of unpruned decision trees and randomly selects a subset of predictors as candidates for splitting tree nodes [56]. The SVM algorithm uses the radial basis kernel (RBK) function to construct the model, and sets two key parameters that cost as 1000 and gamma as 1 × 10 −6 .
Ensemble algorithms mainly include AdaBoostM1 [57], Bagging [58], Stacking [59] and Vote [60] derived from WEKA software (version 3.8.4). AdaBoostM1 chose DecsionStump (DS) as the primary classifier, and DS, J48, SMO and NaiveBayes (NB) were selected respectively to build four algorithms. Bagging selected REPtree as the basic classifier, and REPtree, J48, SMO, NB, RF and IBK were selected to build six models. Stacking selects ZeroR, PART, OneR, J48, RF, IBK, NB and SMO as the basic learners and J48 as the secondary classifier to build an efficient model. Vote combines multiple algorithms and classifies samples according to the average probability of output. In order to obtain the model with excellent performance, the "CVParameterSelection" method was selected in WEKA to optimize the corresponding parameters of the algorithm.

Performance Evaluation
The performance of the evaluation model was represented by the following parameters: sensitivity (Sen), specificity (Spc), accuracy (Acc) and the Matthew correlation coefficient (MCC). The formula is shown below: where TP, FP, TN and FN stand for the number of true positives, false positives, true negatives and false negatives, respectively.

Cell Killing Ability Assay In Vitro
Human tumor cell lines MCF7, A549, HeLa and LoVo were used as experimental materials to detect the anticancer activity of peptides by MTT assay [61,62]. In addition, human embryonic kidney cells 293T were also included in the experiment to examine the safety of the peptide against non-cancerous cells. First, the cells were cultured, then the cells were inoculated in 96-well plates (6000 cells per well) and incubated for 24 h, followed by the addition of five peptides (0.5, 1, 2, 4, 8, 16, 32, 64 and 128 µM) at increasing concentrations for 24 h, each concentration was prepared in triplicate. Untreated cells as negative control group incubated with corresponding medium for 24 h. Subsequently, 10 uL, 5 mg/mL MTT (Sangon Biotech, Shanghai, China) solution was added to each well and incubated for 4 h at 37 • C. After the medium was discarded, 150 µL DMSO was added to each well for 10 min to dissolve formazan crystals. Absorbance was measured at 570 and 630 nm using a multiwall plate reader and the inhibition rate was expressed as the mean ± standard deviation (SD) of the triplicate data.

Hemolysis Assay
The 2% sheep red blood cells (SRBC) were purchased from Nanjing Senbeijia Biological Technology Co., Ltd (Nangjing, China). A total of 4 mL of 2% SRBC was taken, rinsed twice with 4 mL of PBS by centrifugation for 5 min at 3000 rpm and the precipitates were resuspended in 4 mL of PBS. We added 100 µL of 2% SRBC to each well of the 96-well plate and then added 100 µL peptide solutions of different concentrations (2,4,8,16,32,64,128 and 256 µM) in triplicate at 37 • C for 1 h. 100 µL of PBS buffer and 100 uL of Triton X-100 0.1% (w/v) were mixed with SRBC as negative and positive controls [63], respectively. All samples were centrifuged at 3000 rpm for 5 min, 100 µL supernatant was collected and transferred to a new 96-well plate, and its absorbance was measured at 540 nm. The calculation formula of hemolysis rate is as follows.