ACPNet: A Deep Learning Network to Identify Anticancer Peptides by Hybrid Sequence Information

Cancer is one of the most dangerous threats to human health. One of the issues is drug resistance action, which leads to side effects after drug treatment. Numerous therapies have endeavored to relieve the drug resistance action. Recently, anticancer peptides could be a novel and promising anticancer candidate, which can inhibit tumor cell proliferation, migration, and suppress the formation of tumor blood vessels, with fewer side effects. However, it is costly, laborious and time consuming to identify anticancer peptides by biological experiments with a high throughput. Therefore, accurately identifying anti-cancer peptides becomes a key and indispensable step for anticancer peptides therapy. Although some existing computer methods have been developed to predict anticancer peptides, the accuracy still needs to be improved. Thus, in this study, we propose a deep learning-based model, called ACPNet, to distinguish anticancer peptides from non-anticancer peptides (non-ACPs). ACPNet employs three different types of peptide sequence information, peptide physicochemical properties and auto-encoding features linking the training process. ACPNet is a hybrid deep learning network, which fuses fully connected networks and recurrent neural networks. The comparison with other existing methods on ACPs82 datasets shows that ACPNet not only achieves the improvement of 1.2% Accuracy, 2.0% F1-score, and 7.2% Recall, but also gets balanced performance on the Matthews correlation coefficient. Meanwhile, ACPNet is verified on an independent dataset, with 20 proven anticancer peptides, and only one anticancer peptide is predicted as non-ACPs. The comparison and independent validation experiment indicate that ACPNet can accurately distinguish anticancer peptides from non-ACPs.


Introduction
Currently, cancer is an enormous threat to human health, reported by the International Agency for Research on Cancer (IARC) [1], leading to rapidly increasing mortality every year, reaching 18.1 million new cases and 9.6 million cancer deaths in 2018 alone. Moreover, the type of common cancer is increasing, including lung, breast, prostate, colorectal and so on. However, the diagnosis and therapy of cancer are challenging, which hinders the therapy process, such as surgery, radiotherapy, chemotherapy, and targeted therapy. Furthermore, traditional methods for the treatment of cancer always meet costly and highrisk problems, accompanied by the risk of tissue death, drug resistance and other serious side effects [2]. Fortunately, increasing evidence shows that some novel anticancer agents, for some peptides for instance, become novel and safe for therapy [3]. For example, peptide p28 is a post-translational, multi-target anticancer agent that preferentially enters a wide variety of solid tumor cells and binds to both wild-type and mutant p53 protein, inhibiting constitutional morphogenic protein 1 (Cop1)-mediated ubiquitination and proteasomal degradation of p53, which results in increased levels of p53 and induces cell-cycle arrest at G2/M and eventual apoptosis, which results in tumor cell shrinkage and death [4]. LL-37, a novel anticancer peptide, has a net positive charge and is amphiphilic, and can eliminate pathogenic microbes directly, via electrostatic attraction towards negatively charged bacterial membranes. Several studies have shown that LL-37 participates in various host immune systems, such as inflammatory responses and tissue repair [5]. Peptide-based therapeutics are more affordable, tolerable and safe, which are considered to be advanced therapy strategies [6]. Therefore, it is a significant step to rapidly and effectively find useful anticancer peptides (ACPs). Although there are many experimental methods to identify ACPs, they are usually laborious, expensive, time consuming and hard to achieve in a high-throughput manner. Furthermore, with the rapid development of data science, it is very desirable to design machine learning-based methods to identify ACPs [7].
Over the past few years, a dozen computational methods have been proposed to identify ACPs, including k-nearest neighbor (KNN), support vector machine (SVMs), random forest (RF) and so on. In 2019, Boopathi, V. et al. proposed a machine learning model to predict ACPs, called mACPpred [7], which uses seven types of encoding features, including amino acid composition (AAC), dipeptide composition (DPC), composition-transitiondistribution (CTD), quasi-sequence-order (QSO), amino acid index (AAIF), binary profile (NC5) and conjoint triad (CTF) to represent a peptide sequence and cooperate with an SVM model to predict ACPs. In 2020, Li Qingwen, et al. employed five types of peptide sequence features, including amino acid composition (AAC), conjoint triad (CT), pseudo-amino acid composition (PAAC), grouped amino acid composition (GAAC) and C/T/D, and then fused multiple machine learning methods, containing SVM, RF, and LibD3C, to identify ACPs [8]. In 2020, Ge Ruiquan et al. proposed a machine model called EnACP, which introduces sequence composition, sequence-order, physicochemical properties, etc. to encode a peptide sequence and input the important feature selected by multiple ensemble classifiers to an SVM model to predict ACPs [9]. These methods try to find effective and useful features to represent a peptide and combine a high-performance machine model to identify ACPs. Another way to identify peptides is by introducing a deep learning model to identify ACPs from the raw sequence of peptides. In 2019,Yi Haichent et al. proposed a deep learning long short-term memory (LSTM) neural network model, called ACP-DL [10], which developed an efficient feature representation approach by integrating binary profile features, k-mer sparse matrix of the reduced amino acid alphabet and then implemented a deep LSTM model to identify ACPs. In 2021, Chen Xiangan et al. proposed an ACP prediction model, called ACP-DA [11], which uses data augmentation for insufficient samples and trains a multilayer perception model to improve the prediction performance. In 2020, Yu Lezheng et al. found that the recurrent neural network with bidirectional long short-term memory cells is a superior architecture to identify ACPs and implement a sequence-based deep learning tool, called DeepACP [12], to accurately predict ACPs.
Although these methods can predict ACPs accurately, the accuracy performance of existing methods still needs to be improved. It is also a challenge to represent peptide sequences to numerical vectors and further improve the prediction accuracy of ACPs. Therefore, in this paper, we propose a hybrid deep learning-based model, called ACPNet, which employs the raw peptide sequence and carefully selected sequence features as input, to fit recurrent neural networks, and fully connected network to further improve the predicting performance.

Materials
The sequences of ACP and non-ACP are downloaded from this research used in [12]. Three datasets are introduced including ACPs250, ACPs82, ACPs20. ACPs250 contains 250 ACPs and 250 non-ACPs sequence samples, ACPs82 is made up of 82 ACPs and 82 non-ACPs sequence samples, ACPs20 contains 10 ACP samples and 10 non-ACP samples. ACPs250 dataset is split into training dataset and validation dataset with 80% and 20% respectively. To further validate the performance of ACPNet, we conducted experiments on an independent test dataset ACPs82. Furthermore, ACPs20, another independent dataset, is introduced to further prove the performance of ACPNet. These datasets are listed in Table 1. To further improve the prediction accuracy of ACPs, in this work, we employed three hybrid kinds of features to encode a peptide sequence to a numerical vector, including peptide sequence features, peptide physicochemical properties and automatic embedding features, which are listed in Table 2. The concept of PAAC (pseudo amino acid composition) [13] was introduced to avoid completely losing the sequence-order information [1]. In contrast with the conventional amino acid composition (AAC) that contains 20 components, the PAAC contains a set of greater than 20 discrete factors, where the first 20 represent the components of its conventional amino acid composition while the additional factors incorporate some sequence-order information via various pseudo components. PAAC can be represented by P = [p 1 , p 2 , . . . , p 20 , p 20+1 , . . . , p 20+λ where λ is an integrated parameter set by a user with recommended value 10, p u is calculated by Equation (1) where f i is the frequency of each amino acid in a protein, w is the weight factor set by a user with default value 0.05,τ k is the k-th tier correlation factor that reflects the sequence order correlation between all the k-th most contiguous residues as formulated by Equation (2). where J i,i+k can be calculated by Equation (3).
where Φ q (R i is the q-th function of the amino acid R i and Γ the total number of functions considered, such as hydrophobicity value, hydrophilicity value, and side-chain mass of amino acid. Another two features are peptide sequence length and Shannon entropy of peptide sequences [14]. The Shannon entropy of peptide sequences can be obtained by Equation (4).
Peptide Physicochemical Properties Peptide, a short chain of amino acids, exhibits many similar properties to proteins and the physicochemical properties have close relations with protein functions. Therefore, three physicochemical properties of peptides are introduced to represent peptide sequences including Gravy [15], Molecular_weight [16] and Charge_at_pH [17]. Gravy feature is used to describe peptide gravy according to Kyte and Doolittle, Molecular_weight is employed to represent peptide molecular weight, Charge_at_pH is introduced to calculate the charge number of a protein when given pH set 10.

Embedding Features
A peptide can be seen as a sentence containing 'word' as ' , which represent 20 kinds of amino acids. To encode a peptide sequence to a numerical vector, a peptide is converted to a numerical vector by amino acid index with the same length of peptide length such as [1, 2, 4, . . . , 20,5]. The corresponding map is A → 1, C → 2, . . . , Y → 20 . After a peptide is converted to a numerical vector, the following step is to map each index to a user-defined dimension vector. The embedding process is to turn positive integers (indexes) into dense vectors. For a peptide sequence "ADGF" is an example with a user-defined dimension as three, the representation process is shown in Figure 1.

Overall Workflow
Deep learning technology has obtained numerous achievements in many bioinformatics applications [18][19][20]. Therefore, in this paper, we propose a hybrid deep learningbased model, named ACPNet, for predicting ACPs. The overall workflow is shown in Figure 2. The detailed structure of ACPNet is in Supplementary Figure S1.

Model Structure Overall Workflow
Deep learning technology has obtained numerous achievements in many bioinformatics applications [18][19][20]. Therefore, in this paper, we propose a hybrid deep learning-based model, named ACPNet, for predicting ACPs. The overall workflow is shown in Figure 2. The detailed structure of ACPNet is in Supplementary Figure S1. Deep learning technology has obtained numerous achievements in many bioinformatics applications [18][19][20]. Therefore, in this paper, we propose a hybrid deep learningbased model, named ACPNet, for predicting ACPs. The overall workflow is shown in Figure 2. The detailed structure of ACPNet is in Supplementary Figure S1. The manually selected features and auto-embedding features are fed to a fully connected neural network (Dense Network) and recurrent neural network (RNN) [21] respectively, and then the result of the previous process is merged for the final prediction. Overall, ACPNet combines Dense Network and RNN to construct a hybrid deep learningbased model to identify ACPs, which not only consider the importance of manually selected features but also automatically learn the potential features from raw peptide sequences as well.

Prediction Model Constructed by RNN and Dense Networks
RNN and Dense network, as the two most significant deep learning models, are widely applied on time serial and feature independent issues respectively. LSTM [22], a kind of implementation of RNN, is usually employed to further automatically extract potential features from the time serial vectors. The kernel process of LSTM can be illustrated by Equation (5). The manually selected features and auto-embedding features are fed to a fully connected neural network (Dense Network) and recurrent neural network (RNN) [21] respectively, and then the result of the previous process is merged for the final prediction. Overall, ACPNet combines Dense Network and RNN to construct a hybrid deep learning-based model to identify ACPs, which not only consider the importance of manually selected features but also automatically learn the potential features from raw peptide sequences as well.

Prediction Model Constructed by RNN and Dense Networks
RNN and Dense network, as the two most significant deep learning models, are widely applied on time serial and feature independent issues respectively. LSTM [22], a kind of implementation of RNN, is usually employed to further automatically extract potential features from the time serial vectors. The kernel process of LSTM can be illustrated by Equation (5).
where, x t is the input vector, h t is output vector, c t is the cell state vector, W, U and b are the learning parameters, f t is a forget gate vector to remember old information, i t is input gate vector to acquire new information, o t is output gate vector as output candidate, σ g , σ c and σ h are three activate functions. LSTM is employed in ACPNet for the belief that the peptide sequence can be seen as time series data. The dense network processes 1-dimension data with independent features expertly, therefore a Dense network is employed to process manual-selected features and the final prediction. where TP, FP, TN and FN represent the true positives, false positives, true negatives and false negatives, respectively. We also plot the receiver operating characteristic curves (ROC) and computed Area Under the Curve (AUC) to show the performance of ACPNet.

The Effects of Feature Combination
To explore the effect of the combination of manually selected features and automatic learning features, the performances conducted by three types of combination are compared on ACPs250 (as training dataset), and ACPs82 (as test dataset). The compared results are listed in Table 3, and we found that the hybrid-fused features model shows better performance on multiple metrics. Note, in terms of MCC, the feature-fused model surpasses the manually selected feature model by more than 16%, and on other metrics, the feature-fused model shows advanced performance as well. The results indicate that the combination of two types of features play a positive and reinforced role in distinguishing between ACPs and non-ACPs.

Manually Selected Features Importance Rank
To further show the importance of each manually selected feature, CatBoost [23], an ensemble machine learning framework, was introduced to calculate the importance score of each feature. Figure 3 shows the importance score of each manually selected feature. Length, Gravy, MW (Molecular_weight), and SH (Shannon entropy) obtained a relatively high score and surpassed the majority of PAAC's features. The PAAC features also show a positive contribution. The PAAC features get the best score at the 14th feature and show not particularly important contributions at the 18th feature. Besides, each PAAC feature provides different contributions to identify ACP. Overall, the manually selected features contribute to the classification of ACPs and non-ACPs.
Length, Gravy, MW (Molecular_weight), and SH (Shannon entropy) obtained a relatively high score and surpassed the majority of PAAC's features. The PAAC features also show a positive contribution. The PAAC features get the best score at the 14th feature and show not particularly important contributions at the 18th feature. Besides, each PAAC feature provides different contributions to identify ACP. Overall, the manually selected features contribute to the classification of ACPs and non-ACPs.

Feature Visualization
We use Uniform Manifold Approximation and Projection (UMAP) [24] to visualize the distribution of ACPs and non-ACPs by media vector, generated on the ACPNet inner layer into two-dimensional space. Figure 4 illustrates that ACPs and non-ACPs in the training and test dataset can be easily classified by these features, which reconfirms that the constructed features contribute to the identification of ACPs and non-ACPs.

Feature Visualization
We use Uniform Manifold Approximation and Projection (UMAP) [24] to visualize the distribution of ACPs and non-ACPs by media vector, generated on the ACPNet inner layer into two-dimensional space. Figure 4 illustrates that ACPs and non-ACPs in the training and test dataset can be easily classified by these features, which reconfirms that the constructed features contribute to the identification of ACPs and non-ACPs.

Performance Comparison of Models on Independent Datasets
To show the advance of ACPNet, traditional machine learning-and deep learningbased methods are employed as comparisons. For a fair comparison, the same training dataset, ACP250s, and independent test dataset, ACP82s, were used to train and test all methods. For traditional machine learning, SVM, RF, and CatBoost are introduced for the comparisons. Note, the auto embedding features are replaced by the index encoding for peptide sequence because the auto embedding links the training process. Therefore, the index encoding method and manually selected features are employed to encode a peptide for traditional machine learning. The comparison results of traditional machine learning are listed in Table 4.

Performance Comparison of Models on Independent Datasets
To show the advance of ACPNet, traditional machine learning-and deep learningbased methods are employed as comparisons. For a fair comparison, the same training dataset, ACP250s, and independent test dataset, ACP82s, were used to train and test all methods. For traditional machine learning, SVM, RF, and CatBoost are introduced for the comparisons. Note, the auto embedding features are replaced by the index encoding for peptide sequence because the auto embedding links the training process. Therefore, the index encoding method and manually selected features are employed to encode a peptide for traditional machine learning. The comparison results of traditional machine learning are listed in Table 4. For the comparison with the deep learning-based model, seven existing models are employed as the comparisons, which include AntiCP [25], Hajisharifi [26], iACP [27] and ACPred-FL [28], CNN (Convolutional Neural Network) [29], CNN+RNN, DeepACP [12]. AntiCP contains two models, which are AntiCP_ACC and Anticp_DPC, both based on SVM. AntiCP_ACC is built by amino acid composition features, while Anticp_DPC is constructed by dipeptide composition features. Hajisharifi combines two integrative SVM-based classification models for the prediction of anticancer peptides on the base of local alignment kernel and PAAC parameters. Further, iACP employs g-gap dipeptide components and SVM to predict ACPs. CNN is fed one-hot encoding matrixes to identify ACPs. For the CNN+RNN model, the input is the same as the CNN model, but after the CNN part, an RNN model links the following process. DeepACP, an end-to-end method, tries to fuse the peptide sequence encoding and training process to identify the ACPs. In the same way as the comparison with traditional machine learning, the ACP250s dataset and independent test dataset, ACP82s, are used to train and test all methods. The comparison results of deep learning-based methods are listed in Table 5. From Table 5, we can find that ACPNet shows better performance compared with the other seven methods in multiple metrics. In terms of precision, ACPNet is slightly lower than ACPred-FL, but shows nearly 7% improvement in Recall and achieves a more balanced performance. For other methods, except ACPred-FL, ACPNet obtained more than 6%, 6%, 10%, 3.5%, and 12% improvement, in terms of accuracy, F1-score, recall, precision and MCC. Overall, ACPNet shows higher performance compared with existing methods. Furthermore, we also plot the receiver operating characteristic curves (ROC) to further show the performance of ACPNet, shown in Figure 5, with AUC at 0.945 and PRAUC (Area Under the Precision-Recall Curve) [30] at 0.947, respectively. Figure 5 indicates that ACPNet shows better performance both on AUC and PRAUC. Furthermore, we also plot the receiver operating characteristic curves (ROC) to further show the performance of ACPNet, shown in Figure 5, with AUC at 0.945 and PRAUC (Area Under the Precision-Recall Curve) [30] at 0.947, respectively. Figure 5 indicates that ACPNet shows better performance both on AUC and PRAUC.

Independent Validation
Furthermore, ACPNet is verified on an independent dataset, with 10 ACPs and 10 non-ACPs, and only one proven ACP is predicted as non-ACP, while all non-ACPs are predicted as non-ACPs. The independent validation results are listed in Table 6. Most ACPs obtained a higher score, large than 0.7. If the predicting score is larger than 0.5, the corresponding peptide will be treated as an ACP. The independent validation results indicate that ACPNet can accurately predict ACPs.

Independent Validation
Furthermore, ACPNet is verified on an independent dataset, with 10 ACPs and 10 non-ACPs, and only one proven ACP is predicted as non-ACP, while all non-ACPs are predicted as non-ACPs. The independent validation results are listed in Table 6. Most ACPs obtained a higher score, large than 0.7. If the predicting score is larger than 0.5, the corresponding peptide will be treated as an ACP. The independent validation results indicate that ACPNet can accurately predict ACPs.

Discussion
Cancer is one of the most dangerous threats to human health. Anticancer peptides could be novel agents for the therapy of cancers [33]. Therefore, accurately identifying anticancer peptides is a key step for the therapy. Nowadays, although deep neural network models have been developed to predict ACP, the accuracy still needs to improve. Thus, in this study, we proposed a hybrid deep learning-based model, called ACPNet, to distinguish ACPs from non-ACPs. For the feature construction, three types of features were introduced. The first type of features are manually selected features, which include PAAC, Length, and Shannon entropy, calculated from peptide sequence information. The second type of features are Molecular_weight, Charge_at_pH, Gravy, which source from peptide physicochemical properties. The third type of features are autoencoding features, which link the training process by encoding each amino acid index to a vector. In part 3.3, the performance of the combination of three types of features is compared. The comparison results show that the combination of three types of features play a positive role in identifying ACPs. Each manually selected feature, including sequence information and peptide physicochemical properties features, is relatively independent, so a fully connected neural network is employed to train the manually selected features. Autoencoding features of peptides can be seen as time serial data, which fit the learning pattern of BiLSTM. Therefore, Sequence information and peptide physicochemical properties are merged to feed fully connected networks. Autoencoding features are input into a BiLSTM network [34]. After passing the two types of networks, the media vectors are merged to feed a fully connected network for the final prediction. ACPNet is a hybrid deep learning network, which fuses the advance of two types of networks to build the neural network structure and full use of three kinds of information features as inputs, to improve the prediction accuracy of ACPs. For a comparison with other existing methods, ACPNet not only shows higher performance in multiple metrics, including Accuracy, F1-score, Recall, Precision and MCC, but also shows balanced performance metrics. This means that ACPNet may show better robustness for future identification. Furthermore, media vectors generated on the ACPNet inner layer are compressed into two dimensions, to further show the entire performance directly. The visualization results indicate that three different types of features of peptides and hybrid deep learning-based models can accurately distinguish ACPs from non-ACPs. Furthermore, ACPNet is verified on an independent dataset, with 10 ACPs and 10 non-ACPs. Only one proven ACP was predicted as a non-ACP, and all the non-ACPs were predicted as non-ACPs. The independent validation results indicate that ACPNet can accurately distinguish ACPs from non-ACPs. The learning pattern of ACPNet also fits other peptide-related works, which may provide a useful clue in solving these problems. ACPNet also meets some limitations. For example, ACPNet does not provide a user-friendly web server, which may make it difficult for some people who don't know how to program. Furthermore, ACPNet does not consider the different types of cancer, which may generate bias. In the future, we will try to build a new ACP prediction model, based on the different types of cancer, and provide a user-friendly web server to the public.

Conclusions
In this work, we proposed a deep learning-based method, ACPNet, to identify ACPs, by combining a hybrid deep learning-based model with manually selected features and automatic encoding features as the input. For the feature construction, three types of features were introduced, namely, peptide sequence component information features, peptide physicochemical properties and auto-encoding features. Three different types of features play a positive role in distinguishing ACPs and non-ACPs. A fully connected network and recurrent neural network were introduced to process the constructed features. Compared with existing methods, ACPNet shows better and more balanced performance, with 1.2% accuracy, 2.0% F1-score, and 7.2% Recall improvement. On a three-part dataset, with 10 ACPs and 10 non-ACPs, ACPNet accurately predicts nine in ten ACPs. The series experiments show that ACPNet can accurately distinguish ACPs from non-ACPs.
Supplementary Materials: The following supporting information can be downloaded online. Figure S1: detailed structure of ACPNet.