Peptide-Based Drug Predictions for Cancer Therapy Using Deep Learning

Anticancer peptides (ACPs) are selective and toxic to cancer cells as new anticancer drugs. Identifying new ACPs is time-consuming and expensive to evaluate all candidates’ anticancer abilities. To reduce the cost of ACP drug development, we collected the most updated ACP data to train a convolutional neural network (CNN) with a peptide sequence encoding method for initial in silico evaluation. Here we introduced PC6, a novel protein-encoding method, to convert a peptide sequence into a computational matrix, representing six physicochemical properties of each amino acid. By integrating data, encoding method, and deep learning model, we developed AI4ACP, a user-friendly web-based ACP distinguisher that can predict the anticancer property of query peptides and promote the discovery of peptides with anticancer activity. The experimental results demonstrate that AI4ACP in CNN, trained using the new ACP collection, outperforms the existing ACP predictors. The 5-fold cross-validation of AI4ACP with the new collection also showed that the model could perform at a stable level on high accuracy around 0.89 without overfitting. Using AI4ACP, users can easily accomplish an early-stage evaluation of unknown peptides and select potential candidates to test their anticancer activities quickly.


Introduction
Cell membrane properties differ between the tumor cell and healthy cells [1]. For example, the membrane fluidity of cancer cells is higher than that of healthy cells [2]. In addition, cancer cells are characterized by a negatively charged surface [3]. Anticancer peptides (ACPs), a subset of antimicrobial peptides (AMPs), are found to be toxic to cancer cells [1]. Compared with chemotherapeutic reagents used in the standard cancer treatment protocol, ACPs have higher specificity and selectivity to the neoplasm. Meanwhile, ACPs can be easily synthesized and scaled up. It can thus serve as a new option in cancer treatment modality [1].
ACPs can be divided into two types based on their putative anticancer mechanism: molecular-targeting and cancer-targeting peptides. Several state-of-the-art ACP predictors have been constructed using ACP data as a positive training set and non-ACP data as a negative training set. These predictors are helpful for scientists to evaluate peptides' anticancer activities in anticancer agent development. However, these existing predictors were built using traditional machine learning methods. For example, a support vector machine (SVM) was applied to build AntiCP [4] and iACP [5]. Both ACPred [6] and MLACP [7] were trained using random forest (RF). Although these basic machine learning methods are popular for model construction, they have some limitations affecting model performance. Recent advances in deep learning architecture have been successfully applied 2 of 10 in many fields (e.g., for the prediction of ACPs). For example, PTPD used Word2Vec and the deep learning network (DNN) model [8].
To hasten the discovery of ACPs, we built a deep learning model to detect peptides with anticancer activity. Our model was composed of a peptide sequence encoding method and a machine learning model. In this study, we used PC6 [9], a novel proteinencoding method, to convert a peptide sequence into a computational matrix, representing six physicochemical properties of each amino acid. We mainly applied the convolutional neural network to build our model. Because of an increase in the number of ACPs confirmed recently, we could identify more ACP sequences and construct a highly accurate ACP prediction model.

Results
Firstly, we recruited ACPs from publications described in the Materials and Methods. Figure 1 presents the relationship between our positive set and the dataset used in Charoenkwan et al. [10]. Data sets used in Charoenkwan et al. (to simplify the expression, we used "Charoenkwan sets") were composed of the main data set and the alternative data set. The main data set was 861 experimentally validated ACPs as the positive set and 861 AMPs with no anticancer ability as the negative set. The alternative data set consisted of 970 ACPs as the positive set and 970 peptide sequences randomly chosen from Swiss-Prot as the negative set. Positive set in the new collection (this study, positive n = 2124 + negative n = 2124) includes Charoenkwan sets, and other 942 ACPs discovered recently. We used 80% of the data in model training and 20% for validation (Table 1) and spared one small set from the new collected ACPs (n = 212) as the testing set.  [8].
To hasten the discovery of ACPs, we built a deep learning model to detect peptides with anticancer activity. Our model was composed of a peptide sequence encoding method and a machine learning model. In this study, we used PC6 [9], a novel proteinencoding method, to convert a peptide sequence into a computational matrix, representing six physicochemical properties of each amino acid. We mainly applied the convolutional neural network to build our model. Because of an increase in the number of ACPs confirmed recently, we could identify more ACP sequences and construct a highly accurate ACP prediction model.

Results
Firstly, we recruited ACPs from publications described in the Materials and Methods. Figure 1 presents the relationship between our positive set and the dataset used in Charoenkwan et al. [10]. Data sets used in Charoenkwan et al. (to simplify the expression, we used "Charoenkwan sets") were composed of the main data set and the alternative data set. The main data set was 861 experimentally validated ACPs as the positive set and 861 AMPs with no anticancer ability as the negative set. The alternative data set consisted of 970 ACPs as the positive set and 970 peptide sequences randomly chosen from Swiss-Prot as the negative set. Positive set in the new collection (this study, positive n = 2124 + negative n = 2124) includes Charoenkwan sets, and other 942 ACPs discovered recently. We used 80% of the data in model training and 20% for validation (Table 1) and spared one small set from the new collected ACPs (n = 212) as the testing set.  We compared AI4ACP, trained using the main data set and the alternative data set in the previous study, with other ACP predictors. Most of the ACP predictors are poorly maintained, and thus they were not working or available. The results shown in Tables 2  and 3 were obtained from the manuscripts of AntiCP2.0 [11] and ACPred [10], which were done based on Charoenkwan's main set. As shown in Table 2, most ACP predictors  We compared AI4ACP, trained using the main data set and the alternative data set in the previous study, with other ACP predictors. Most of the ACP predictors are poorly maintained, and thus they were not working or available. The results shown in Tables 2 and 3 were obtained from the manuscripts of AntiCP2.0 [11] and ACPred [10], which were done based on Charoenkwan's main set. As shown in Table 2, most ACP predictors trained with the main data set did not perform efficiently in low specificities or low sensitivities.   Table 3 shows the performance of ACP predictors trained and tested using Charoenkwan's alternative data set. AI4ACP was trained and tested in the alternative dataset to make a fair comparison. The performance of most of the ACP predictors was more favorable than those trained using the main data set. Among these predictors, AntiCP2.0 [11] exhibited the best performance. Moreover, the performance of AI4ACP trained and tested by the alternative data set is close to AntiCP2.0.
Most ACP predictors were unable to run for the new collection due to the codes unavailable or the poor maintenance on web portals. Therefore, we picked AntiCP2.0 [11], which performed the best using the alternative dataset shown in Table 3, to compare AI4ACP through the web-based service by taking the testing set of the new collection for validation. AI4ACP was also trained using both the alternative data set and the new collection to make a fair comparison. As shown in Table 4, the performance of AI4ACP was slightly better than AntiCP2.0 when AI4ACP was trained using the alternative set. By training the model of AI4ACP in the new collection set, which is almost double the alternative set, AntiCP2.0 demonstrates the most excellent performance on the accuracy, specificity, sensitivity, and MCC in the testing set from the new collection.

Discussions
The increasing number of publications, databases, and tools shows the importance of peptide-based therapeutics nowadays. More and more ACPs were recognized and even used as FDA-approved drugs. This result reveals the growth in demand for identifying and predicting ACPs. The identification and screening novel ACPs in a wet lab is usually time-consuming and expensive. Exploring the anticancer activity of peptides by using ACP predictors can accelerate the development of new anticancer drugs. However, the prediction of an ACP predictor is merely speculative. Laboratory experiments would still be required to confirm whether a peptide sequence possesses anticancer activity.
There are already some existing ACP predictors, such as iACP, ACPred, and AntiCP2.0. Most of the existing ACP predictors were constructed from protein-encoding methods like amino acid composition (AAC), dipeptide composition (DPC), autocovariance (AC) method, and traditional machine learning methods like SVM, random forest, or the ensemble methods. This study used a novel protein-encoding method, the PC6 protein-encoding method. The PC6 encoding method selected one property from the six subclusters of physicochemical properties of amino acids, respectively. Four of the six chosen properties were based on the seven properties from the original autocovariance (AC) methods. Moreover, two common physicochemical properties were selected from the remaining two subclusters. It was speculated that the PC6 encoding method might capture more complete features from the sequences of interest.
The data sets used in previous studies had not been updated for quite some time. In addition to the positive data set with a few ACPs, AMPs as negative data set in previous studies might be inappropriate since ACPs are a subset of AMPs. Such poor-constructed data might reduce the accuracy of the prediction. With the up-to-date ACPs data set and an unbiased negative data set collected from UniProt and randomly generated, the predictor performed better under the same architecture.
To ensure the stability of the model and avoid overfitting during the model training, five-fold cross-validation was also applied to the model ( Table 5). All the sequences used in our final model were randomly divided into five parts. In every training repetition, one of the five parts would be left out as the testing set, and the other four parts as the training set. The result showed that though the model's performance had a slight decline as the number of sequences of the training set in five-fold cross-validation was smaller than the original training set, the average accuracy of the model is still about 89%. It showed that the model could perform at a stable level with no worries about overfitting issues. The external testing set was used to test the model's performance under the unknown sequences. The result showed that only 7 of 43 ACP sequences were misidentified as non-ACPs. The model's accuracy under the external testing set is about 84%. Therefore, we could presume that if a new-designed sequence is predicted as ACP by our model, it is highly possible to be an effective anticancer peptide.
Tables 2-4 revealed that combining the PC6 encoding method and deep learning model could efficiently predict ACPs. The PC6 encoding method could exactly preserve the physicochemical properties of amino acids from original peptide sequences, and the deep learning model could learn these preserved features. In addition, with an increase in the number of peptide sequences confirmed as ACPs, we could build a predictor that exhibited more favorable performance and higher accuracy than other state-of-the-art ACP predictors. AI4ACP is a user-friendly web-based ACP predictor, and users can use this tool to detect whether the query sequence is an ACP. This tool can be beneficial for drug development for cancer treatment. AI4ACP will be continuously updated once new ACPs are discovered in the future. Besides, the deep learning model is available at https://github.com/yysun0116/AI4ACP, accessed on 6 March 2022.

Positive Data Collection
We collected ACP sequences from four ACP and AMP databases: CancerPPD [12], DBAASP [13], DRAMP [14], and YADAMP [15]. In addition, we included sequences from the positive alternative set reported by Charoenkwan et al., 2021 [10]. We downloaded all peptides with anticancer activity from the four databases and previous studies. After excluding ACPs with unusual amino acids or a nonlinear structure, namely "B", "Z", "U", "X", "J", "O", "i", and "-", and duplicates between different databases, we obtained 2839 positive ACPs. Figure 2a presents the length distribution of the 2839 ACPs; most of the sequences were shorter than 50 amino acids in length. Therefore, we excluded ACPs longer than 50 amino acids. Finally, 2815 ACP sequences were retained. Figure 2b depicts the length distribution of the 2815 ACPs. The external testing set was used to test the model's performance under the unk sequences. The result showed that only 7 of 43 ACP sequences were misidentified a ACPs. The model's accuracy under the external testing set is about 84%. Therefo could presume that if a new-designed sequence is predicted as ACP by our mode highly possible to be an effective anticancer peptide.
Tables 2-4 revealed that combining the PC6 encoding method and deep le model could efficiently predict ACPs. The PC6 encoding method could exactly pr the physicochemical properties of amino acids from original peptide sequences, a deep learning model could learn these preserved features. In addition, with an incre the number of peptide sequences confirmed as ACPs, we could build a predictor th hibited more favorable performance and higher accuracy than other state-of-the-ar predictors. AI4ACP is a user-friendly web-based ACP predictor, and users can u tool to detect whether the query sequence is an ACP. This tool can be beneficial fo development for cancer treatment. AI4ACP will be continuously updated once new are discovered in the future. Besides, the deep learning model is availa https://github.com/yysun0116/AI4ACP ,accessed on 30 March 2022.

Positive Data Collection
We collected ACP sequences from four ACP and AMP databases: CancerPPD DBAASP [13], DRAMP [14], and YADAMP [15]. In addition, we included sequence the positive alternative set reported by Charoenkwan et al., 2021 [10]. We download peptides with anticancer activity from the four databases and previous studies. Af cluding ACPs with unusual amino acids or a nonlinear structure, namely "B", "Z" "X", "J", "O", "i", and "-", and duplicates between different databases, we obtaine positive ACPs. Figure 2a presents the length distribution of the 2839 ACPs; most sequences were shorter than 50 amino acids in length. Therefore, we excluded longer than 50 amino acids. Finally, 2815 ACP sequences were retained. Figure 2b   To ensure that the characteristics of the ACPs learned by the model were bal we filtered out the remaining ACPs sharing >99% sequence identity with existing by calculating the sequence identity using CD-HIT [16]. A total of 2124 ACPs w cluded as positive data. To evaluate the performance of our model and compare i that of other state-of-the-art predictors, we used 10% of all the positive data as the t To ensure that the characteristics of the ACPs learned by the model were balanced, we filtered out the remaining ACPs sharing >99% sequence identity with existing ACPs by calculating the sequence identity using CD-HIT [16]. A total of 2124 ACPs were included as positive data. To evaluate the performance of our model and compare it with that of other state-of-the-art predictors, we used 10% of all the positive data as the testing set after excluding sequences from the positive set of other predictors. Figure 3 presents the detailed positive data collection and division process. set after excluding sequences from the positive set of other predictors. Figure 3 pre the detailed positive data collection and division process.

Negative Data Collection
The negative data set consisted of 1062 non-ACP peptides from UniProt [17] and generated peptides. From UniProt, we collected peptides shorter than 50 amino aci length and without anticancer, antiviral, antimicrobial, or antifungal activities. Ran generated peptides were derived using the same length distribution of the positive set and randomly filled with 20 essential amino acids. Accordingly, we obtained 212 quences as the negative data set. We used 90% of the negative data set (1912 seque as the negative training set and the remaining 10% (212 sequences) as the negative te set. Figure 4 presents the detailed negative data collection and division process.

Protein-Encoding Method
This study used the PC6 protein-encoding method [9] to convert a peptide sequ into a computational matrix. PC6 is a novel protein-encoding method that can enco sequence based on both the order and physicochemical properties of the amino aci the sequence. After benchmarking with other encoding methods, the PC6 enco method exhibited the most satisfactory performance. Therefore, we applied PC6 i encoding stage in our final prediction model.

Developing a Deep Learning Model
We implemented Keras, a high-level API from Tensorflow, to construct and tr deep learning model. We first applied the PC6 protein-encoding method [9] to a quences. PC6 would add an extra character, "X", which would be 0 in all six prope at the end of the sequences for sequence padding to length 50 and convert them into 6 matrices. Figure 5 presents the process of the PC6 protein-encoding method.

Negative Data Collection
The negative data set consisted of 1062 non-ACP peptides from UniProt [17] and 1062 generated peptides. From UniProt, we collected peptides shorter than 50 amino acids in length and without anticancer, antiviral, antimicrobial, or antifungal activities. Randomgenerated peptides were derived using the same length distribution of the positive data set and randomly filled with 20 essential amino acids. Accordingly, we obtained 2124 sequences as the negative data set. We used 90% of the negative data set (1912 sequences) as the negative training set and the remaining 10% (212 sequences) as the negative testing set. Figure 4 presents the detailed negative data collection and division process.
Pharmaceuticals 2022, 15, x FOR PEER REVIEW 6 of 10 set after excluding sequences from the positive set of other predictors. Figure 3 presents the detailed positive data collection and division process.

Negative Data Collection
The negative data set consisted of 1062 non-ACP peptides from UniProt [17] and 1062 generated peptides. From UniProt, we collected peptides shorter than 50 amino acids in length and without anticancer, antiviral, antimicrobial, or antifungal activities. Randomgenerated peptides were derived using the same length distribution of the positive data set and randomly filled with 20 essential amino acids. Accordingly, we obtained 2124 sequences as the negative data set. We used 90% of the negative data set (1912 sequences) as the negative training set and the remaining 10% (212 sequences) as the negative testing set. Figure 4 presents the detailed negative data collection and division process.

Protein-Encoding Method
This study used the PC6 protein-encoding method [9] to convert a peptide sequence into a computational matrix. PC6 is a novel protein-encoding method that can encode a sequence based on both the order and physicochemical properties of the amino acids of the sequence. After benchmarking with other encoding methods, the PC6 encoding method exhibited the most satisfactory performance. Therefore, we applied PC6 in the encoding stage in our final prediction model.

Developing a Deep Learning Model
We implemented Keras, a high-level API from Tensorflow, to construct and train a deep learning model. We first applied the PC6 protein-encoding method [9] to all sequences. PC6 would add an extra character, "X", which would be 0 in all six properties, at the end of the sequences for sequence padding to length 50 and convert them into 50 × 6 matrices. Figure 5 presents the process of the PC6 protein-encoding method.

Protein-Encoding Method
This study used the PC6 protein-encoding method [9] to convert a peptide sequence into a computational matrix. PC6 is a novel protein-encoding method that can encode a sequence based on both the order and physicochemical properties of the amino acids of the sequence. After benchmarking with other encoding methods, the PC6 encoding method exhibited the most satisfactory performance. Therefore, we applied PC6 in the encoding stage in our final prediction model.

Developing a Deep Learning Model
We implemented Keras, a high-level API from Tensorflow, to construct and train a deep learning model. We first applied the PC6 protein-encoding method [9] to all sequences. PC6 would add an extra character, "X", which would be 0 in all six properties, at the end of the sequences for sequence padding to length 50 and convert them into 50 × 6 matrices. Figure 5 presents the process of the PC6 protein-encoding method.  (Figure 6). The first dense layer contains 128 u with a 50% dropout rate. The last layer in the model is the output layer and is comp of a one-dimensional dense layer with the sigmoid activation function that produc of convolutional layers, batch normalization, max pooling, dropout layers, and two dense layers ( Figure 6). The first dense layer contains 128 units with a 50% dropout rate. The last layer in the model is the output layer and is composed of a one-dimensional dense layer with the sigmoid activation function that produces a value ranging from 0 to 1; this value can indicate whether a peptide is an ACP. The convolutional layer in the three blocks in our model was built using 64, 32, and 8 one-dimensional filters of length 20 with the ReLU activation function, respectively. After the convolutional layer was built, batch normalization and max-pooling were applied with a 25% dropout rate in every block. Binary cross entropy was implemented as the loss function. With a learning rate of 0.0001, the Adam optimizer was used as our optimizer. Using the validation data set (90%), we trained the model and evaluated its performance using the validation data set (10%). Finally, all available data, namely 2124 positive and 2124 negative data, were used to train the final model. Subsequently, we implemented the neural network using Keras (https://github.com/keras-team/keras, accessed on 30 March 2022) from Tensorflow2 (https://www.tensorflow.org/, accessed on 30 March 2022). The model architecture consists of three blocks composed of convolutional layers, batch normalization, max pooling, dropout layers, and two dense layers ( Figure 6). The first dense layer contains 128 units with a 50% dropout rate. The last layer in the model is the output layer and is composed of a one-dimensional dense layer with the sigmoid activation function that produces a value ranging from 0 to 1; this value can indicate whether a peptide is an ACP. The convolutional layer in the three blocks in our model was built using 64, 32, and 8 one-dimensional filters of length 20 with the ReLU activation function, respectively. After the convolutional layer was built, batch normalization and max-pooling were applied with a 25% dropout rate in every block. Binary cross entropy was implemented as the loss function. With a learning rate of 0.0001, the Adam optimizer was used as our optimizer. Using the validation data set (90%), we trained the model and evaluated its performance using the validation data set (10%). Finally, all available data, namely 2124 positive and 2124 negative data, were used to train the final model. Figure 6. Model architecture in this study. After PC6 encoding, protein sequences will go through every layer in this model.

Data for the Final Model
After confirming the most favorable model architecture and hyperparameters, we trained the model using all the available data (2124 positive and 2124 negative data). Eventually, we produced the final prediction model for the website. The data set used in this study can be found on our online HELP page. (https://axp.iis.sinica.edu.tw/AI4ACP/helppage.html, accessed on 30 March 2022) The positive and negative data sets will be continuously updated with the same criteria if new ACPs are discovered in the future.

Performance Measure
We evaluate the performance of our model using threshold-dependent parameters, which include Accuracy, Specificity, Sensitivity, and Matthews Correlation Coefficient (MCC). These parameters are calculated via the following equations: Figure 6. Model architecture in this study. After PC6 encoding, protein sequences will go through every layer in this model.

Data for the Final Model
After confirming the most favorable model architecture and hyperparameters, we trained the model using all the available data (2124 positive and 2124 negative data). Eventually, we produced the final prediction model for the website. The data set used in this study can be found on our online HELP page. (https://axp.iis.sinica.edu.tw/AI4 ACP/helppage.html, accessed on 6 March 2022) The positive and negative data sets will be continuously updated with the same criteria if new ACPs are discovered in the future.

Performance Measure
We evaluate the performance of our model using threshold-dependent parameters, which include Accuracy, Specificity, Sensitivity, and Matthews Correlation Coefficient (MCC). These parameters are calculated via the following equations: where TP represents the true positive predictions, TN represents the true negative predictions, FP represents the false positive predictions, and FN represents the false negative predictions.

External Testing Set
We collected ACP sequences from the updating version of DBAASP and excluded the sequences which were replicates of the 2124 ACP sequences we had collected. The sequence shorter than 10 amino acids, longer than 50 amino acids, or with amino acids out of 20 usual amino acids were also excluded. Finally, 43 ACP sequences were filtered out as an external testing set.

System Implementation and Workflow
For the intuitive user experience and easy understanding, we built AI4ACP composed of the LAMP system architecture (Linux Ubuntu 16.04, Apache 2.04, MySQL 5.7, and PHP 5.1) with the Bootstrap 3 CSS framework (http://getbootstrap.com/, accessed on 6 March 2022), jQuery1.11.1, and jQuery Validation version 1.17. Furthermore, the core of the analysis process was implemented in the neural network by using Keras from Tensorflow. AI4ACP runs as a virtual machine (CPU of 2.27 GHz, 20 cores, 32-GB RAM, and 500-GB storage) on the cloud infrastructure of the Institute of Information Science, Academia Sinica, Taiwan.
AI4ACP is a website service that allows users to predict whether a query peptide sequence is an ACP. The input data should be in the FASTA format, and the query peptide sequence should be composed of only 20 essential amino acids; sequences would not be recognized if they contain unusual amino acids such as B, Z, U, X, J, or O. AI4ACP would output a CSV file containing a prediction score ranging from 0 to 1 and the prediction result as YES or NO for each input peptide sequence. The prediction score represents the probability that the query peptide sequence is an ACP. The output file's prediction results, shown as a binary column, indicate the ACP sequence(s). The prediction result is based on the prediction score with a threshold of 0.472, which is the average of thresholds calculated by training the model five times. The workflow of AI4ACP is presented in Figure 7 and explained as follows. First, the query peptide sequence is input in the FASTA format or as a FASTA file, and a valid job title is provided ( Figure 7A). After the query sequence is submitted, the result appears in a three-column table composed of the input peptide's name, prediction score, and result ( Figure 7B). In addition, a pie chart presents the prediction result; this pie chart enables users to view the prediction results of the whole submission at the same time ( Figure 7C). The output of ACP activity for each submitted sequence with a prediction score. (C) Pie chart presents the prediction of the whole submissions and the submission with files generated during the prediction.

Conclusions
ACPs are a special subset of short peptides which contain abilities to fight cancer. Modeling the ACP properties is a crucial research topic for developing ACP-based cancer therapy. This study collected up-to-date ACP data and then developed an online ACP predictor, AI4ACP. By evaluating the external testing set, our approach builds a prediction model based on the PC6 protein-encoding method, and deep learning outperforms other predictors. AI4ACP can be an ideal filter to select potential peptides in the first step of new ACP finding. Users can upload their peptide candidate sequences to our web server, get predictions in a few minutes, and pick promising ones for further costing bench experiments.
The deep learning approach in the drug discovery pipeline is beneficial for promoting and economizing the early drug development process. This study successfully transforms peptides into a machine-readable format encoded with physiochemical information. Although a large amount of data can improve model performance, it is necessary to conduct data preprocessing to prevent garbage in and out carefully. We took special care on data utilization and found that using a robust machine learning algorithm can improve model performance in learning different peptides patterns.

Conclusions
ACPs are a special subset of short peptides which contain abilities to fight cancer. Modeling the ACP properties is a crucial research topic for developing ACP-based cancer therapy. This study collected up-to-date ACP data and then developed an online ACP predictor, AI4ACP. By evaluating the external testing set, our approach builds a prediction model based on the PC6 protein-encoding method, and deep learning outperforms other predictors. AI4ACP can be an ideal filter to select potential peptides in the first step of new ACP finding. Users can upload their peptide candidate sequences to our web server, get predictions in a few minutes, and pick promising ones for further costing bench experiments.
The deep learning approach in the drug discovery pipeline is beneficial for promoting and economizing the early drug development process. This study successfully transforms peptides into a machine-readable format encoded with physiochemical information. Although a large amount of data can improve model performance, it is necessary to conduct data preprocessing to prevent garbage in and out carefully. We took special care on data utilization and found that using a robust machine learning algorithm can improve model performance in learning different peptides patterns.

Conflicts of Interest:
The authors declare no conflict of interest.