Prediction of Peptide Vascularization Inhibitory Activity in Tumor Tissue as a Possible Target for Cancer Treatment

: The prediction of metabolic activities in silico form is crucial to be able to address all research possibilities without exceeding the experimental costs. In particular, for cancer research, the prediction of certain activities can be of great help in the discovery of different treatments. In this work it has been proposed to predict, through Machine Learning, the anti-angiogenic activity of peptides is currently being used in cancer treatment and is giving hopeful results. From a list of peptide sequences, three types of molecular descriptors were obtained (AAC, DC and TC) that offered the possibility of training different ML algorithms. After a Feature Selection process, different models were obtained with a predictive value that surpassed the current state of the art. These results shown that ML is useful for the classiﬁcation and prediction of the activity of new peptides, making experimental screening cheaper and faster.


Introduction
The prediction of metabolic activities in-silico is crucial to address all research possibilities without exceeding the experimental costs. Specifically, in cancer research, there are endless opportunities that may be tested for possible treatment. Among them, one method that is having hopeful results in this field is tumor treatment with anti-angiogenic peptides. Attacking the tumor by destabilizing its micro-environment is crucial to prevent its development. The vast majority of treatments of this type are based on peptides, mainly due to their low toxicity and design possibilities.
Being able to predict the anti-angiogenic activity of these peptides from their amino acid sequence opens the doors to the discovery of new natural peptides or designed in the laboratory with this activity. In this sense, experimental researchers would be able to carry out a previous filter at the time of experimentally validating the treatments and focus only on those that had a previous significance, by means of in-silico techniques.
In this work, Machine Learning techniques were able to predict with high precision if a peptide has anti-angiogenic activity, only from its amino acid sequence. Our work was published before in detail as open access in the Scientific Reports journal [1]. The list of sequences was obtained from the article [2].

Results
Previous studies shown that peptides with anti-angiogenic activity have a common structure, presenting mostly folds of beta anti-parallel sheets, with a high incidence of hydrophobic and cationic residues. On the other hand, the composition of these peptides is not fully defined, although it has been observed that these peptides are more prone to present certain amino acid residues in their sequences.

Baseline Algorithms without Feature Selection
Firstly, a comparative experiment was carried out under the same conditions in order to observe which dataset-algorithm pair achieves the best performance. Four algorithms (SVM [3], RF [4], Glmnet [5] and k-NN [6]) were trained with the three descriptors (AAC, DC and TC) [7] and the union of AAC-DC and AAC-TC. The results obtained in this phase of the experiment do not improve those reported in the literature (0.809 in Accuracy) [2], but they do indicate a certain trend in the data.

Feature Selection
We ranked the features accordiing with their p-values after performing a Kruskal test of each variable with the dependent variable (Anti-angiogenic and Non Anti-angiogenic). We explored the size of different subgroups, (5, 10 and 15) for AAC, (25, 50, 75, 100) for DC, (75, 100, 125, 150) for TC and (50, 100, 150, 200) for the combination of the three datasets.
In this case, several models exceed the values marked in the literature (red line), all of them with RF and Glmnet algorithms. The best performance was achieved with the Glmnet algorithm trained with the dataset of the union of the three descriptors, using the 200 most significant variables after the action of the statistical test. Furthermore, as can be seen in Figure 1b,d, all the models show great stability, which indicates that they have obtained homogeneous results in all the repetitions.

Discussion
The results obtained in this work reflect the importance of the use of new data analysis technologies to support experimental research. This work reports a set of models that have surpassed previous state of the art models for this type of problem (p-value = 1.8665 × 10 −9 ). A high proportion of amino acids within the sequence such as Alanine, Valine and Cysteine are important for classifying peptides. In addition, it is observed in the higher positions, as di-peptide sequences (SP, VD, ID and CK) and tripeptides (LSL, DIT and PDL) provide significant information to this model.

Machine Learning Models
The following algorithms were implemented: Support Vector Machines (SVM) [3], Random Forest (RF) [4], k-Nearest Neighbors (k-NN) [6] and Generalized linear model (Glmnet) [5]. A Nested Cross Validation was used for training the models. In other words, there were two validation phases. Firstly, a holdout was used for the selection of the best hyperparameters (2/3 for training and 1/3 for testing) and secondly, a 10-fold CV was used for the validation of the model (we ran 5 times this CV process).