Proteomic Screening for Prediction and Design of Antimicrobial Peptides with AmpGram

Antimicrobial peptides (AMPs) are molecules widespread in all branches of the tree of life that participate in host defense and/or microbial competition. Due to their positive charge, hydrophobicity and amphipathicity, they preferentially disrupt negatively charged bacterial membranes. AMPs are considered an important alternative to traditional antibiotics, especially at the time when multidrug-resistant bacteria being on the rise. Therefore, to reduce the costs of experimental research, robust computational tools for AMP prediction and identification of the best AMP candidates are essential. AmpGram is our novel tool for AMP prediction; it outperforms top-ranking AMP classifiers, including AMPScanner, CAMPR3R and iAMPpred. It is the first AMP prediction tool created for longer AMPs and for high-throughput proteomic screening. AmpGram prediction reliability was confirmed on the example of lactoferrin and thrombin. The former is a well known antimicrobial protein and the latter a cryptic one. Both proteins produce (after protease treatment) functional AMPs that have been experimentally validated at molecular level. The lactoferrin and thrombin AMPs were located in the antimicrobial regions clearly detected by AmpGram. Moreover, AmpGram also provides a list of shot 10 amino acid fragments in the antimicrobial regions, along with their probability predictions; these can be used for further studies and the rational design of new AMPs. AmpGram is available as a web-server, and an easy-to-use R package for proteomic analysis at CRAN repository.


Introduction
Abuse and overuse of antibiotics in human health care and animal breeding has greatly contributed to a worldwide resistance to antibiotics. Moreover, the fact that hardly any new classes of antibiotics have been introduced to the market for decades makes the situation even more alarming [1,2]. Multidrug-resistant bacteria, the so-called 'superbugs', threaten our ability to tackle even common infectious diseases, resulting in prolonged illnesses and death of tens of thousands of Table 1. Peptide and protein length distribution in the UniProt [38] and dbAMP [39] database divided into length groups according to the AmpGram benchmark dataset (for details, see Section 3). Our goal was to launch a high-throughput computational classifier, AmpGram, that could efficiently scan proteomes not only for typical AMPs but also longer proteins with AMP properties, including cryptic AMPs, and to indicate with high accuracy regions responsible for the AMP activity. AmpGram uses n-grams (amino-acid motifs) and random forests (a machine learning method) as an AMP classification algorithm. This methodology has already been used with success in our previous projects to create software for prediction of amyloid proteins [40], signal peptides, [41] and to assess optimal growth conditions for methanogens [42].
A new approach that identifies protein AMP potential regions is needed, not only because of the alarming situation with the growing bacterial resistance but, because small peptides are easier and cheaper to synthesize and present fewer side effects as indicated, e.g., by pardaxin [43]. Moreover, their activity can be easily improved by sequence modification that increases hydrophobicity and/or positive charge. Application of n-grams also allowed us to overcome the problem of high score-length dependency [29]. The overprediction for longer AMPs could not have been solved by simply their inclusion in the positive training dataset because their amino acid composition is hardly distinguishable from other proteins in contrast to typical AMPs (Supplementary Figure S1). The similarity in amino acid composition between longer AMPs and the negative dataset results from the fact that only short regions of proteins are responsible for their AMP properties.

Benchmark Analysis of AMP Predictors
The benchmark analysis involved AmpGram and other top-ranking AMP predictors: AMPScanner [24], ADAM [25], iAMP-2L [26], CAMPR3 [27] and iAMPpred [28]. In order to compare their performance, the values of AUC (the Area Under the ROC receiver operating characteristic-Curve), precision, sensitivity and specificity were calculated for the test dataset. The performance results include the division of the benchmark dataset into five groups according to the sequence length (for details, see Section 3). However, to keep the article concise only the results for (i) all lengths and (ii) the longest AMPs are presented. The group of all lengths is dominated by shorter sequences, from ten to 60 amino acids, i.e., typical AMPs, and therefore biased against longer peptides and proteins. Consequently, the results in Figure 1 and Tables 2 and 3 include the most informative groups analyzed. The complete results of the research are available in the Supplementary Materials ( Figure S2 and Tables S1-S4).

AUC
Precision Sensitivity Specificity all lengths 61-710 aa  The benchmark results ( Figure 1, Tables 2 and 3) confirm that AmpGram performs very well but it is outperformed by AMPScanner [24], both for the group of all lengths (AUC: 0.964 vs. 0.906) and the longest AMPs (AUC: 0.905 vs. 0.839). However, the benchmark is biased against AmpGram because our test dataset could contain sequences that were included in the training datasets of other AMP predictors, including AMPScanner [24]. In order to test the influence of the benchmark bias, we compared the performance of AmpGram and AMPScanner on two datasets: APD3 [44] and DAMPD [45] in accordance with the methodology by Gabere and Noble [29]. It is important to emphasize that AMPScanner [24] was exclusively trained on sequences from the APD3 database [44], and neither AMPScanner [24] nor AmpGram used the DAMPD database [45]. To ensure that the DAMPD dataset [29] is indeed unbiased, we have additionally searched it and removed all sequences that were present in the AmpGram or AMPScanner [24] training dataset. As expected, AMPScanner beats AmpGram on the biased APD3 dataset (AUC: 0.985 vs. 0.972; Figure 2, Table 4); however, AmpGram outperforms AMPScanner [24] on the unbiased DAMPD dataset (AUC: 0.932 vs. 0.909; Figure 2, Table 5). This indicates that AmpGram is a more robust predictor. Moreover, in contrast to AMPScanner [24], AmpGram also allows query sequences to contain non-standard amino acids.  [24] performance on the APD and DAMPD dataset with other predictors from Gabere and Noble's benchmark and according to their methodology [29]. Sequences used to train either AmpGram or AMPScanner were removed from the DAMPD dataset. The benchmark without their removal is presented in Figure S3 in the Supplementary Materials. The very low values of precision are due to the very large negative dataset used (for details, see Section 3).
The other top-ranking AMP classifiers are not far behind AmpGram in the prediction of typical AMPs, but they have problems with longer peptides and proteins ( Figure 1, Tables 2 and 3), e.g., all CAMPR3 tools [27], which are based on: random forests (CAMPR3-RF), support vector machine (CAMPR3-SVM), artificial neural network (CAMPR3-ANN) and discriminant analysis (CAMPR3-DA), are characterized by decent sensitivity but very low specificity and precision. Sensitivity and specificity reflect the proportion of AMP and non-AMP sequences that are identified correctly as AMPs and non-AMPs, respectively, and precision the proportion of AMPs that actually are AMPs [46,47]. It means that all CAMPR3 algorithms, tend to 'overpredict' longer sequences as AMPs, i.e., generate a high number of false positive results. This high score-length dependency has already been indicated by Gabere and Noble [29] and also concerns iAMPpred [28]. In contrast to CAMPR3 and iAMPpred, ADAM [25] has very low sensitivity, and decent specificity and precision, which means that the program rather 'underpredictis' longer peptides and proteins, i.e., generates a high number of false negative results. Table 4. Comparison of AmpGram and AMPscanner [24] performance on the APD dataset with other predictors from Gabere and Noble's benchmark and according to their methodology [29]. The very low values of precision are due to the very large negative dataset used (for details, see Section 3).

Prediction of Potential AMP Regions and Fragments
The goal behind development of AmpGram was to introduce a high throughput and accurate computational classifier that could search proteomes not only for typical AMPs, but also longer and cryptic AMPs, such as lactoferrin [32] and thrombin [37], respectively. Cryptic AMPs represent AMP sequences embedded in proteins that do not seem to have any AMP properties.
As indicated in the benchmark section, AmpGram is the best AMP classifier that also robustly detects longer AMPs. Moreover, AmpGram predicts regions that have some antimicrobial potential. It scans a protein sequence with a sliding window of 10 amino acids in search of n-grams characteristic for AMPs and non-AMPs. Consequently, it divides the protein into overlapping subsequences of 10 amino acids (10-mers) that either are or are not AMPs (for details, see Section 3). The 10-mers are subsequently plotted along the sequence of the whole protein indicating regions that have strong antimicrobial potential. In Figure 3, exemplary results for lactoferrin (AmpGram prediction probability 0.627) and thrombin (AmpGram prediction probability 0.839) are presented.
In the case of lactoferrin, three regions have already been experimentally confirmed as AMPs, and two of them lactoferricin  and lactoferrampin (268-284) were clearly identified by AmpGram as AMPs [32]. Moreover, AmpGram detected many more regions in lactoferrin sequence that could represent potential AMPs. They can be easily identified in Figure 3A as sites with many overlapping AMP 10-mers (Table S6). Interestingly, the distribution of AMP 10-mers also perfectly reflects the evolutionary history of lactoferrin, i.e., its origin by a gene duplication event [48]. There are six distinct regions with the accumulation of AMP 10-mers: three in the N-terminal globular domain and three in the C-terminal one.
Human thrombin is a typical cryptic AMP. While it does not have any AMP properties, its C-terminal region does, and moreover the AMP fragments constitute a novel class of AMPs [37]. AmpGram prediction reveals that the AMP potential of the longest experimentally confirmed thrombin fragment (527-622) seems to be restricted to its C-terminus and overlaps with the other two shorter AMP fragments (597-622, 604-622). As in the case of lactoferrin, AmpGram also detected many more regions in thrombin that presumably could represent AMPs ( Figure 3B; Table S6).  [32] and 527-622, 597-622 and 604-622 for thrombin [37]; the sequence coordinates for lactoferrin do not include an N-terminal signal peptide (1-19).

Datasets
In order to construct the positive, i.e., antimicrobial, dataset, 12,389 AMPs were retrieved from dbAMP [39], which is at present the most comprehensive database for AMPs. It includes information from other publicly available AMP databases, such as APD3 [44], CAMPR3 [27], ADAM [25], PhytAMP [49], AMPer [50], AntiBP2 [51], BACTIBASE [52] and LAMP [53]. Sequences containing nonstandard amino acids (B, J, O, U, X, Z) were removed from the positive dataset. In order to reduce the redundancy, and consequently bias in the antimicrobial dataset, sequence clustering was performed with CD-HIT program (version 4.8.1) at the identity threshold 0.90 [54]. In total, the final positive dataset contained 2463 peptides.
As there are only few sequences verified as non-AMPs, the negative dataset was created using peptides extracted from cytoplasmic proteins similarly to datasets presented by Gabere and Noble [29]. We downloaded 544,249 sequences from UniProt (version from 20.12.2019) [38] that were experimentally validated as proteins without documented antimicrobial, antibacterial, antiviral or antifungal activity, and did not posses a mitochondrial or plastid transit peptide. We excluded proteins carrying mitochondrial or plastid transit peptides because their presequences were hypothesised to have evolved from AMPs [55], and therefore might have introduced bias in the negative dataset. The sequences downloaded from UniProt [38] were concatenated into a single string. From the concatenated string, we cut off blocks equal in length to all 2463 sequences from the positive dataset. Next, within the extracted blocks, we cut off sequences corresponding in length to AMPs from the randomly mixed positive dataset. For each AMP in the positive dataset, a subset of non-AMP sequences equal in size to a given AMP was created. Finally, from each subset of non-AMPs, we randomly collected one sequence for the negative dataset amounting to 2463 sequences ( Figure 4A).  [39] (red, green and blue horizontal lines). To create the negative dataset, non-antimicrobial sequences (grey horizontal lines) were retrieved from the UniProt database [38]. The sequences were first concatenated into one string (grey horizontal line), and then cut (black vertical lines) into blocks corresponding in length to sequences from the positive dataset (red, green and blue horizontal line). The extracted blocks were next cut (not indicated in the figure) into subsets corresponding in length to sequences from the positive dataset (red, green and blue circles) and from them individual sequences were randomly selected for the negative dataset (A). Exemplary n-grams used to train AmpGram: the positive n-grams are shaded in red, green and blue, and the negative ones in grey (B). To make a prediction, AmpGram first divides a peptide into subsequences of 10 amino acids (10-mers). For each 10-mer, AmpGram makes a prediction if it is an AMP (true) or not (false) (first model). To scale the prediction for 10-mers to the whole peptide, a lot of statistics is calculated and on their basis AmpGram makes the final prediction (second model). Abbreviations of the statistics: fraction_true--fraction of positive 10-mers, pred_mean-mean value of prediction, pred_median-median of prediction, n_peptide -number of 10-mers in a peptide, n_pos-number of positive 10-mers, pred_min-minimum value of prediction, pred_max-maximum value of prediction, longest_pos-the longest stretch of consecutively occurring 10-mers predicted as positive, n_pos_10-number of streches comprising of at least 10 10-mers predicted as positive, frac_0_0. We divided both positive and negative dataset into five equally sized groups of sequence lengths: (i) 11-19, (ii) 20-26, (iii) 27-36, (iv) 37-60 and (v) 61-710, in order to ensure similar length distribution of sequences in the training and benchmark dataset. Next, we randomly extracted one tenth of sequences from each group to create the benchmark dataset. It comprised 247 AMP and 247 non-AMP sequences and was subsequently used to compare the performance of AmpGram with other top-ranking predictors. The remaining 2216 sequences in each dataset were used to train AmpGram.
We also compared the performance of AmpGram and other AMP predictors, including AMPScanner [24], on the benchmark datasets from Gabere and Noble [29]. They used 1713 AMP and 8565 non-AMP sequences from the APD3 database [44], and 547 AMP and 2735 non-AMP sequences from the DAMPD database [45]. To ensure the unbiased character of the DAMPD dataset in favour of AmpGram and AMPScanner [24], 336 AMP sequences were removed from the DAMPD dataset that were present either in the AmpGram (240 sequences) or AMPScanner (239 sequences) [24] training dataset. The benchmark without their removal is presented in the Supplementary Materials Figure S3 and Table S5.

Extraction of Encoded N-Grams
We scanned each sequence with a sliding window of 10 amino acids dividing it into overlapping subsequences of 10 amino acids (10-mers). All 10-mers from the positive dataset were considered as AMPs, whereas all 10-mers from the negative dataset as non-AMPs. Consequently, we obtained 87,716 AMP 10-mers and 87,599 non-AMP 10-mers. For each 10-mer in the positive and negative dataset, we extracted n-grams, which are continuous or discontinuous sequences of n elements. We considered unigrams (n-gram of size 1), bigrams (n-gram of size 2) and trigrams (n-gram of size 3), we separately analyzed continuous and discontinuous n-grams. For bigrams, we considered n-grams with a gap length from 1 to 3, whereas trigrams could contain only a single gap between the first and the second or the second and the third position. Next, the counts of n-grams were binarized, where 1 means that an n-gram was present in the sequence and 0 if it was absent ( Figure 4B).

Model Training with Random Forests
The classifier with the best ability to correctly predict 10-mers with AMP activity was chosen during five-fold cross-validation using different length groups of sequences for training. The use of 11-26-amino-acid-long peptides, both 893 AMP and non-AMP sequences that resulted in 8791 AMP and 8818 non-AMP 10-mers, yielded the best results . We used random forests as the classification algorithm and trained them on the binarized n-grams extracted from 10-mers of the positive and negative dataset ( Figure 4B,C). We considered only the most informative n-grams (13,087) selected by Quick Permutation Test (QuiPT) [40]. We grew the forest with 2000 trees and the default number of variables to possibly split at each node (rounded down square root of the total number of variables). To speed up the computation, we used the fastest implementation of random forests in R, i.e., the ranger package [56].
In order to scale the prediction for 10-mers to the whole peptide, we calculated the following statistics for each peptide using prediction for its 10-mers: (i) fraction_true-fraction of positive 10-mers, (ii) pred_mean-mean value of prediction, (iii) pred_median-median of prediction, (iv) n_peptide-number of 10-mers in a peptide, (v) n_pos-number of positive 10-mers, (vi) pred_min-minimum value of prediction, (vii) pred_max-maximum value of prediction, (viii) longest_pos-the longest stretch of consecutively occurring 10-mers predicted as positive, The second random forest layer is responsible for deciding whether a given peptide (a collection of overlapping 10-mers) is an AMP or not. The following architecture is also known as the stacked random forest [57].

Conclusions
AmpGram is a novel AMP predictor that uses n-grams to represent information hidden in amino acid sequences and random forests as the classification algorithm. In comparison to other top-ranking AMP predictors, including AMPScanner, CAMPR3R and iAMPpred, AmpGram performs better at detecting AMPs. To the best of our knowledge, AmpGram is the first AMP classifier created for the prediction of longer AMPs and high-throughput proteomic screening. The application of n-grams made it possible to overcome the problem of high score-length dependency that was first indicated by Gabere and Noble [29] and also confirmed in our research. AmpGram not only allows to predict AMPs with high accuracy, but also precisely indicates peptide/protein fragments and regions that do have AMP potential. In order to test how AmpGram predictions relate to actual biological activity, we performed analyses for lactoferrin and thrombin; the former is a well-knownantimicrobial protein and the latter represents a cryptic AMP. Cryptic AMPs do not exhibit any AMP properties as mature proteins but their proteolytic products do. As expected, AmpGram identified both lactoferrin and thrombin as AMPs and indicated their potential AMP fragments and regions, including the sequences previously verified experimentally as AMPs [32,37]. The examples of lactoferrin and thrombin prove that antimicrobial fragments and regions predicted by AmpGram are good candidates for further investigation in terms of bactericidal activity, stability, toxicology, pharmacokinetics and the rational design of new AMPs; their antimicrobial activity can be further improved by amino acid modification to balance the peptide hydrophobicity and positive charge vital for disrupting bacterial membranes [58]. Moreover, the small size of AmpGram predicted fragments makes them easy to synthesize and exhibit potentially fewer side effects compared to longer AMPs [43].
AmpGram is available as a web server for multiple query sequences; however, for high-throughput proteomic screening, the users are encouraged to use its stand-alone version (see Appendix A). Therefore, we have also implemented AmpGram as an easy-to-use R package.
Supplementary Materials: The following are available online at http://www.mdpi.com/1422-0067/21/12/ 4310/s1, Figure S1 Amino acid composition of AMP and non-AMP sequences. The analysis was performed on sequences from positive and negative dataset, respectively (for details, see Section 3). The shorter the sequence, the stronger the differences in amino acid composition between AMPs and non-AMPs. For longer AMPs, i.e., over 60 amino acids, the differences between the datasets are hardly visible. Figure S2 Comparison of AmpGram performance with other top-ranking predictors for (i) all AMP lengths and (ii) 11-19, (iii) 20-26, (iv) 27-36, (v) 37-60 and (vi) 61-710-amino-acid-long AMPs. Figure S3 Comparison of AmpGram and AMPscanner [24] performance on the APD and DAMPD dataset with other predictors from Gabere and Noble's benchmark and according to their methodology [29]. Table S1 Comparison of AmpGram performance with other top-ranking predictors for 11-19-amino-acid-long AMPs. Programs that do not provide prediction probability are marked with asterisks. Table S2 Comparison of AmpGram performance with other top-ranking predictors for 20-26-amino-acid-long AMPs. Programs that do not provide prediction probability are marked with asterisks. Table S3 Comparison of AmpGram performance with other top-ranking predictors for 27-36-amino-acid-long AMPs. Programs that do not provide prediction probability are marked with asterisks. Table S4 Comparison of AmpGram performance with other top-ranking predictors for 37-60-amino-acid-long AMPs. Programs that do not provide prediction probability are marked with asterisks. Table S5 Comparison of AmpGram and AMPscanner [24] performance on the APD and DAMPD datasets with other predictors from Gabere and Noble's benchmark and according to their methodology [29]. Table S6 List of antimicrobial 10-mers for lactoferrin and thrombin, including experimentally confirmed fragments, predicted by AmpGram.