Predicting Antitumor Activity of Peptides by Consensus of Regression Models Trained on a Small Data Sample

Predicting antitumor activity of compounds using regression models trained on a small number of compounds with measured biological activity is an ill-posed inverse problem. Yet, it occurs very often within the academic community. To counteract, up to some extent, overfitting problems caused by a small training data, we propose to use consensus of six regression models for prediction of biological activity of virtual library of compounds. The QSAR descriptors of 22 compounds related to the opioid growth factor (OGF, Tyr-Gly-Gly-Phe-Met) with known antitumor activity were used to train regression models: the feed-forward artificial neural network, the k-nearest neighbor, sparseness constrained linear regression, the linear and nonlinear (with polynomial and Gaussian kernel) support vector machine. Regression models were applied on a virtual library of 429 compounds that resulted in six lists with candidate compounds ranked by predicted antitumor activity. The highly ranked candidate compounds were synthesized, characterized and tested for an antiproliferative activity. Some of prepared peptides showed more pronounced activity compared with the native OGF; however, they were less active than highly ranked compounds selected previously by the radial basis function support vector machine (RBF SVM) regression model. The ill-posedness of the related inverse problem causes unstable behavior of trained regression models on test data. These results point to high complexity of prediction based on the regression models trained on a small data sample.


Introduction
Peptides are attracting increasing attention and have growing significance as therapeutics. They are Nature's toolkit known to control and direct various cellular functions and intercellular communication events. For many years, peptide-based therapeutics were only considered for hormonal disorders and hormone-dependent cancers. However, novel technologies comprising synthetic procedures (solid-phase synthesis), recombinant processes and especially recent progress in drug delivery technologies, overcome many of the former drawbacks associated with peptide-based drugs [1,2]. About half of the peptides in clinical trials address oncology, metabolic, infectious and cardiovascular diseases-related targets. However, it is expected in the future that peptide drugs will address other medical disorders as well. Peptides offer several advantages over "classical" small molecules (higher specificity/selectivity, lower toxicity and tissue accumulation) or antibodies (smaller size, lack of serious immune responses, easy storage). Some of the most applied peptide-based drugs today are glatiramer acetate for the treatment of multiple sclerosis [3], leuprolide acetate, a GnRH receptor agonist for the treatment of breast and prostate cancers [4] and exenatide, approved for the treatment of diabetes mellitus type 2 [5].
Among short peptides with significant therapeutic potential, the native opioid growth factor (OGF), Met-enkephalin (Tyr-Gly-Gly-Phe-Met) is of particular interest. Numerous studies revealed that it acts in a receptor-mediated fashion and has regulatory function in the onset and progression of different human cancers [6]. OGF binds to the OGF receptor (OGFr) and modulates cyclin-dependent kinase inhibition pathway. Cell proliferation can be reduced by the increase of the OGF-OGFr activity through the addition of exogenous OGF [7] or some immunomodulators, like resiquimod, an upregulator of the OGFr [8]. Recent studies under the phase II clinical trials showed that biotherapy with OGF improves clinical benefits and even survival in patients with advanced pancreatic cancer [9], while the combination of chemotherapy with gemcitabine and biotherapy with OGF decreases pancreatic cancer growth and also reduced toxic effects of chemotherapy (in vitro experiments and animal models) [10].
The main drawback of OGF is low enzymatic stability and thus rapid hydrolysis in biological fluids. Some of the recent attempts to overcome this limitation involved incorporation of unnatural, adamantane-containing amino acids into primary OFG sequence [11]. It was found that the replacement of Gly 2 with (R,S)-(1-adamantyl)glycine (Ada) gave the most effective derivative with antitumor activity against HEp-2, HBL, SW-620 and Caco-2 cell lines in vitro. Afterwards, the support vector machines (SVM) QSAR approach was undertaken to screen a virtual library of OGF-related compounds and identify novel structures with possibly improved antitumor activities [12]. Some of the top-rated compounds obtained by computational prediction were synthesized and showed more pronounced activity on the selected cancer cell lines. SVM approach is one of the most used QSAR models in rational drug design for the active/non-active classification problem. Additionally, probability based-and artificial neural networks (ANN) regression models were applied on similar problems [13][14][15]. The size of the training set determinates quality of prediction and in examples mentioned above it ranges from 100 to 1400 compounds. A common problem within the academic community is availability of a limited number of samples with measured biological activity. Thus, reliable identification of novel lead compound(s) from a virtual library becomes a challenging problem. The situation is generally known as the "small N large p" problem [16], and is very common in medicine, bioinformatics, computational drug design, etc. Therefore, methodology for the selection of regression model(s) that can possibly yield reliable and stable prediction is of crucial importance. Stability implies that predictions do not change significantly if small number of compounds, or even one, with measured activity is replaced in the training sample. Thus, it is the aim of the present work to validate potential of methodology that uses consensus between regression models in predicting biological activity of virtual compounds. Therefore, we have trained the following regression models: the feed forward ANN, the k-nearest neighbor (KNN), the sparseness constrained linear regression, the linear SVM and nonlinear SVM with polynomial and Gaussian kernels on a dataset of 22 OGF-related peptides with measured antitumor activity. Learned models were then applied on a virtual library consisted of 429 peptides yielding six lists with candidate compounds ranked by predicted antitumor activity. Consensus between the ranking lists has been sought. By using the majority voting principle a final ranking list has been obtained. The highly ranked candidate compounds were synthesized, characterized and tested for an antiproliferative activity. The activities were compared against compounds selected previously from the same virtual library by the radial basis function SVM regression model [12].

Regression Model Selection for Compound Activity Prediction
The training sample composed of 22 compounds with measured biological activity and 1,647 molecular descriptors (features) generated by the QSAR methodology to characterize each compound (details in the Experimental Section). Thus, the training data are described with the set: or in matrix formulation with X ∈ R 22 × 1647 and y ∈ R 22 . When using stratified cross-validation (CV) to optimize prediction models small data sample with the large number of features yields CV error with large variance [17]. Since too few labeled data are available to distinguish role of different features in error analysis, selection of the parameters of regression models is uncertain [18][19][20].
In order to possibly counteract overfitting problems, we have trained six regression models listed previously. Prior to training and regression all compound descriptors were standardized to zero mean and unit variance whereupon the compound activity values y were scaled to [−1, 1] interval. Regression models were trained by leave-one-out and leave-two-out CVs using 1000 random partitions. The mean correlation between true and predicted values has been used as performance measure. The results of CV analysis are reported in Table S1 in the Supplementary Information. Sparse regression model was the only one with the mean correlation above 0.7 in both CV modes and the ANN model was closed to that. Linear SVM also has correlation value above 0.7 in leave-one-out CV. The RBF SVM had quite low correlation value, but top 5 compounds predicted by it agreed with the top 5 compounds predicted by models with better CV performance (see below). Among used regression models, only sparse linear regression model performs feature selection during the training process. That was expected to increase chance of predicting compound with high biological activity. This choice follows the principle of parsimony, also known as Occam's razor, since the experimental data are explained with a less complex model (smaller number of features). Hence, overfitting should be reduced substantially.
Sparse regression model w is obtained as solution of the underdetermined system of linear equations y = X w + e, where e represents modeling error: constant. Its value depends on the noise variance. The least absolute shrinkage and selection operator (LASSO) [21] is such regression model. However, over the last decade a number of methods for sparse solution of the linear inverse problems with the formulations equivalent to above, have been developed [22][23][24] and conditions necessary for uniqueness of the solution are established. To be more specific, let us suppose that k = |w| 0 , i.e., w has only k important (non-zero) coefficients that point out to corresponding molecular descriptors. Unique solution of y = X w in the problem considered here is obtained when: k < 22 and 22  k log 1647/k. Maximal number of molecular descriptors k that satisfies above inequality is 3. It is, however, not likely that compounds can be characterized with 3 QSAR molecular descriptors only. Some of the well conducted studies on the prediction of antibacterial activity of polypeptides were based on 44 [14] and 20 "inductive" QSAR descriptors [15]. Moreover, the prediction error 2 2  y Xw can't be made reasonably small with such a small number of molecular descriptors. The interior point method [24] used to solve related inverse problem yielded regression vector with 7 coefficients that are significantly different than zero. But k = 7 implies that 39 samples (equations) with measured activity should be available to guarantee uniqueness and, consequently, stable behavior of learned regression model on test data. Evidently, that was impossible to achieve with such a small training dataset. The unstable behavior is demonstrated in disagreement between features that correspond with seven large coefficients of the regression vector (molecular descriptors GMTI, w, D/D, SRW07, SRW09, IDET and SHP2) and features (piID, MATS4p, Mor20m-see also Table S2 in the Supplementary Information) that are highly correlated (above 0.7) with the prediction of this sparse regression model on test data. This unstable behavior has been already recognized and discussed in gene selection for cancer prediction [25]. From the mathematical standpoint there is analogy between antitumor activity prediction in chemoinformatics and cancer prediction in bioinformatics. Thus, some advanced concepts used in microarray data analysis could potentially be beneficial in the problem considered here. Consensus between feature selection algorithms has been used very recently in microarray data analysis to alleviate disagreement between them and improve credibility of selected features (genes) [26]. Therefore, to counterattack uncertainty in selection of regression models, consensus between them has been sought after they were applied to unseen data.
Trained regression models were applied to 429 compounds from the virtual library (details regarding construction of the virtual library are given in the Experimental Section). For each regression model a list was formed with compounds ordered by predicted activity. Then we looked at lists that overlap significantly in top ten compounds. This has been achieved with the lists corresponding with the linear SVM, sparse regression model, feed forward ANN and RBF SVM (Table 1). Moreover, these four regression models yielded highly correlated predictions when applied to library of 429 virtual compounds with the correlation value above 0.94 ( Table 2). Rankings of first 10 compounds in each of four lists were combined using the majority voting and yielded a final list where the first position has been reached with the minimal score. The final list of first 15 compounds is presented in Table 1 together with the top 10 compounds predicted by [12]; compound 10 had an overall score equal to 4 implying that it was first on the list of each of four regression models. The first 5 compounds were in the top 10 compounds in each of the four lists. It is important to point out that structures of these 5 compounds differ significantly from those predicted by the SVM Gaussian kernel [12]. This finding prompted us to synthesize compounds 10-14, evaluate their biological potential and finally compare utilities of different regression models for a given problem. Table 1. Ranking list of the activity of virtual compounds as predicted by four regression models (linear SVM, sparse regression, MLP ANN, RBF SVM). Numbers in parentheses represent position of given compound on individual lists obtained by four regression models. Top-rated compounds predicted in reference [12] are given for comparison.

Rank/(position on the lists/mean value)
Compounds

Peptide Synthesis
Synthesis of methionine rich peptides 10-14 was based on the 2 + 2 and 2 + 3 coupling between Boc-protected dipeptides 1-4 and H-Met-Met-OH or H-Met-Met-Met-OH (Scheme 1). Different methods of peptide bond formation were tested; finally peptides 5-7 were obtained by the DCC/HOSu method, activation with HATU yielded Ada-containing pentapeptide 8, while pentapeptide 9 was obtained by the mixed anhydride method (details in the Experimental section). Finally, deprotection of the Boc group was performed by acid hydrolysis and extra precautions were taken due to the known susceptibility of methionine residue toward acid-promoted oxidation to sulfoxide. Therefore, deprotection step was performed in ice bath, crude product was immediately purified by the HPLC and residual acid removed by passing through the short C18 column. This procedure hampered oxidation of Met residue in all derivatives with the exception of Ada containing pentapeptide 11, which showed extreme susceptibility to oxidation. The full scan mass spectrum of product(s) obtained in deprotection step ( Figure 1

Proliferation Assay
Peptides 10, 12-14 as well as trisulfoxide derivative 15 were screened for possible antiproliferative effects on three cell lines: SW 620 (colon carcinoma), MCF-7 (breast carcinoma) and HeLa (cervical carcinoma), according to the previously published procedure [12]. Table 3 summarizes activities found for tested peptides, expressed as percentage of growth (PG) of three cell lines, together with values for OGF and three OGF-related peptides proposed by the SVM radial basis function kernel (RBF) [12]. In vitro screening results confirmed that some of tested compounds possess more pronounced activity compared to the OGF on certain cell lines (numbers in bold, Table 3). However, compared to three OGF-related peptides proposed by the SVM radial basis function kernel (RBF), compounds predicted by the consensus of regression models proposed herein are less effective in inhibition of tumor cells growth. Comparison with the most active peptides from the training set revealed comparable antiproliferative effects on the SW620 cell line, but improved activities on HeLa and especially MCF-7 cell lines (Table 3). Therefore, the approach presented here managed to select peptides with improved biological activities from virtual library.   [12]; c data from Reference [11].
Regarding amino acid sequence, peptides proposed by the consensus of regression models differ significantly from those prepared following the RBF SVM regression model prediction [12]. Only one derivative in top-five contains non-natural adamantane-related amino acid, and all are rich in methionine. Contrary to that, prevalence of phenylalanine is characteristic of peptides described in [12]. Improved biological activity was attributed to the hydrophobic character and somewhat fixed conformation concluded from CD spectra. Presence of Ada, a -turn inducer, contributes to the observed conformational change when compared with the native OGF. Peptides prepared according to the consensus of regression models are less hydrophobic (log P values, Table 1), and absence of bulky Ada residue probably influences flexibility of prepared peptides. On the other hand, in peptides prepared in [12] third position was occupied with Gly, which allow flexibility of the peptide chain and thus help accommodation of the rigid Ada amino acid. It can be expected that, owing to the prevalence of nonpolar aliphatic amino acid methionine, peptides prepared in this work are structurally distinct from those prepared in [12] where aromatic amino acid phenylalanine is incorporated. Therefore, all these diversities can influence observed lower antiproliferative potential of tested peptides. Interestingly, trisulfoxide derivative 15 obtained by the oxidation of unstable, the only Ada-containing peptide 11, showed similar antiproliferation potential as other peptides tested in this work, despite its unfavourable characteristics (log P − 1.49, log S − 3.03). Therefore, different structural and conformational factors contribute to biological profiles of studied peptides.
Although top 5 ranked compounds were on the lists of four regression models, compounds synthesized from that list were less effective in inhibition of tumor cell growth than RBF SVM predicted compounds in [12]. This confirms complexity and unreliability of prediction when regression models are trained on a sample with small number of compounds with measured antitumor activity and large number of features. This instability of prediction is consequence of the high ill-posedness of the concrete problem and is inherent to all "small N large p problems", as shown in [25] for the gene subset selection problem. Thus, it is possibly wise to propose development of mathematical model for evaluation of the quality of prediction achieved by various statistical and machine learning methods in the spirit of the methods developed in [27] for validation of gene selection algorithms. It is also important to focus efforts and attention strictly on feature selection methods and try to reduce redundant features before training of regression models. Since reduction of features increases the ration between number of labeled samples and number of features this should up to some extent counteract instability of predictions on test data caused by overfitting. One such concept that combines L 1 and L 2 regularization has been proposed recently in [28].

Construction of the Virtual Library
Details about training set, QSAR descriptors and virtual library construction are available at [29]. Web-site also contains structures of peptides in the virtual library, their molecular descriptors, and the predicted cytostatic activities. Briefly, 22 OGF-related peptides with measured antiproliferative activity were submitted to the E-Dragon web service [30], that predicts a probable three-dimensional structure using CORINA [31] and computes molecular descriptors for each compound. A virtula library was constructed encompassing OGF-like tri-, tetra-and pentapeptides. Amino acids present in training set peptides were used in the virtual library with two additional members to raise variety. At the N-terminal position, aromatic amino acids Tyr, Phe and Trp were used, at the second position Gly, Ala and Ada were varied, while other positions were occupied with Gly, Ala, Ada, Phe and Met. We used Marvin 5.1.1 software [32] to visualize, browse, and otherwise handle the molecules in the virtual library.

Software Environment
The regression models were implemented in MATLAB script language using functions available in the classification toolbox: newff, train and sim for the feed forward ANN, knnclassify for the KNN predictor and our own MATLAB implementation of the linear and nonlinear SVM-based regression models (files in MATLAB script language are given in Supplementary Information). The feed-forward ANN was composed of one input layer with ten nonlinear neurons (sigmoids) and one output layer with pure linear neuron. The network was trained in the backpropagation mode and the smallest CV-based prediction error has been obtained in the trainoss mode (the one step secant algorithm). Sparse linear regression was implemented by the interior point method [24], designed for the solution of large scale linear inverse problems with a MATLAB code available at [33].

Materials and Methods
Reaction courses were monitored by TLC on Silica Gel 60 F 254 Merck plates and examined under UV light or detected with HBr/ninhydrine. Column chromatography was performed with Merck silica gel (0.040-0.063 mm). Optical activities were measured on Optical Activity LTD automatic AA-10 polarimeter at 20 °C. Analysis was performed on HPLC system coupled with UV detector; C-18 semipreparative (250 × 8 mm, ID 5µm) column at flow rate 1 mL/min, or analytical (150 × 4.5 mm, ID 5 µm) column at flow rate 0.5 mL/min was used under isocratic conditions using different concentration of MeOH in 0.1% aqueous TFA. UV detection was performed at 254 or 280 nm. NMR spectra were recorded on 600 MHz and 300 MHz spectrometers, operating at 150.92 or 75.47 MHz for 13 C and 600.13 or 300.13 MHz for 1 H nuclei. TMS was used as an internal standard. Mass spectrometry measurements were performed on HPLC system coupled with triple quadrupole mass spectrometer, operating in positive electrospray ionization (ESI) mode. Spectra were recorded from a 10 μg/mL compound solution in 50% methanol/0.1% FA by injection of 2 μL into the ion source of the instrument by autosampler, at the flow rate of 0.2 mL/min (mobile phase 50% methanol/0.1% FA). HRMS analysis was performed on MALDI-TOF mass spectrometer operating in reflectron mode. Calibration type was internal with calibrants produced by matrix ionization (monomeric, dimeric and trimeric CHCA), azithromycin and angiotensin II dissolved in α-cyano-4-hydroxycinnamic acid matrix in the mass range m/z 190.0499 to 749.5157 or 1046.5417. Accurately measured spectra were internally calibrated and elemental analysis was performed on Data Explorer v. 4.9 Software with mass accuracy better than 5 ppm. Samples were prepared by mixing 1 μL of analyte methanol solution with 5 μL of saturated (10 mg/mL) solution of α-cyano-4-hydroxycinnamic acid (α-CHCA) and internal calibrants (0.1 mg/mL) dissolved in 50% acetonitrile/0.1% TFA. Synthesis of (R,S)-(1-adamantyl)glycine (Ada) and Boc-Phe-Ada-OH (4) is described previously [11].

General Procedure for the Synthesis of Compounds 5-7
Selected Boc-protected dipeptide 1 or 2 (Boc-Phe-Ala-OH or Boc-Phe-Gly-OH, respectively) (0.150 mmol) was dissolved in dry DMF at 0 °C. HOSu (0.180 mmol) and DCC (0.195 mmol) were added into the solution and the resulting mixture was stirred for 30 min at 0 °C and then at room temperature overnight. The reaction mixture was filtered and the precipitate washed with DMF. Mother liquor was added dropwise into the suspension of KHCO 3 (0.198 mmol) and H-Met-Met-Met-OH or H-Met-Met-OH (0.180 mmol) in water. The reaction mixture was stirred for 3 hours at room temperature. pH of the solution was set to 2-3 with 10% citric acid and the product was extracted with Et 2 O, washed with brine and water and purified by flash chromatography.

Proliferation Assay
The biological potential of prepared compounds has been tested in the Laboratory for experimental therapy (Ruđer Bošković Institute) following the previously published procedure [12]. Briefly, MCF-7, SW 620 and HeLa cells were cultured as monolayers and maintained in Dulbecco's modified Eagle medium (DMEM), supplemented with 10% fetal bovine serum (FBS), 2mM L-glutamine, 100 U/mL penicillin and 100 μg/mL streptomycin in a humidified atmosphere with 5% CO 2 at 37 °C. The panel cell lines were inoculated onto a series of standard 96-well microtiter plates on day 0, at 1 × 10 4 to 3 × 10 4 cells/mL, depending on the doubling times of specific cell line. Test agents were then added in five, 10-fold dilutions (10 −8 to 10 −4 M) and incubated for a further 72 hours. Working dilutions were freshly prepared on the day of testing. The solvent was also tested for eventual inhibitory activity by adjusting its concentration to be the same as in working concentrations. After 72 hours of incubation the cell growth rate was evaluated by performing the MTT assay. For this purpose the substance treated medium was discarded and MTT was added to each well at a concentration of 20 μg/40 μL. After four hours of incubation the precipitates were dissolved in 160 μL of dimethylsulphoxide (DMSO). The absorbance that is directly proportional to the number of living, metabolically active cells was measured on a microplate reader at 570 nm. Each test point was performed in quadruplicate in three individual experiments.

Conclusions
The results of the present work demonstrate the complexity of prediction of biological activity of candidate molecules based on the regression models trained on a dataset of very small ratio between the number of labeled samples and number of features. Six regression models were trained on a dataset of 22 OGF-related peptides with measured antitumor activity (1647 QSAR descriptors) and then applied on a 429 virtual library members. To, possibly, hinder overfitting and instability of prediction, consensus among the regression models has been sought. Highly ranked compounds were prepared and tested. Although some of them showed more pronounced activity compared with the native OGF, they were less active than highly ranked compounds selected previously from the same virtual library by the RBF SVM regression model. The RBF SVM regression model has also been considered here and showed smaller cross-validation accuracy. However, its prediction has been a part of the consensus reached, in addition to RBF SVM, between feed-forward ANN, linear SVM and linear sparse regression model. This seemingly paradoxical result confirms inherent instability of prediction that is based on regression model trained on a (very) small sample size. This has been additionally confirmed by high correlation (0.94) of predictions of test data between these four models, whereas list of features that are highly correlated with predictions of corresponding models do not overlap. As this situation is, unfortunately, often found within the academic community, it is possibly wise to consider development of a mathematical model for the evaluation of the quality of prediction achieved under such conditions, in the spirit of the methods developed recently in microarray data analysis. It is also recommended to consider application of existing or development of novel feature selection methods to remove redundant features. This would increase the ratio between the number of labeled samples and number of features before training of regression models and, thus, should reduce instabilities of predictions of test data.