On the Prediction of In Vitro Arginine Glycation of Short Peptides Using Artificial Neural Networks

One of the hallmarks of diabetes is an increased modification of cellular proteins. The most prominent type of modification stems from the reaction of methylglyoxal with arginine and lysine residues, leading to structural and functional impairments of target proteins. For lysine glycation, several algorithms allow a prediction of occurrence; thus, making it possible to pinpoint likely targets. However, according to our knowledge, no approaches have been published for predicting the likelihood of arginine glycation. There are indications that arginine and not lysine is the most prominent target for the toxic dialdehyde. One of the reasons why there is no arginine glycation predictor is the limited availability of quantitative data. Here, we used a recently published high–quality dataset of arginine modification probabilities to employ an artificial neural network strategy. Despite the limited data availability, our results achieve an accuracy of about 75% of correctly predicting the exact value of the glycation probability of an arginine–containing peptide without setting thresholds upon whether it is decided if a given arginine is modified or not. This contribution suggests a solution for predicting arginine glycation of short peptides.


Introduction
In nature, there is an amazing variety of proteins. So far, thousands of them have been described that carry out diverse functions that are essential for life, either as structural building blocks within and without cells or as catalysts of biochemical reactions in the form of enzymes [1,2]. Proteins are composed of a certain number of amino acids. There are 20 'standard' amino acids and several somewhat obscure ones such as selenocysteine and pyrrolysine that are also proteinogenic [3].
Clearly, the potential sequence variety is enormous even in short proteins (peptides), reaching astronomical proportions. To this huge variety, so-called post-translational modifications of amino acids must be added. These modifications add layers of regulation and control and include a plethora of processes leading to adducts, such as acetylation, phosphorylation, methylation, and ubiquitination, among many others [4]. The post-translational modification of specific amino acids can occur enzymatically or non-enzymatically. For example, protein glycosylation, which is important for protein sorting, protein secretion, always be arginine. Specifically, we utilize ANNs because they offer a direct way to solve problems given their high accuracy and their adaptation to noisy, unknown, or incomplete information [24] besides a fast computation after training due the fact that the neural network can be easily implemented in the parallel hardware. In particular, ANNs as ML methods have been applied in problems of fluids in the flow phase pattern identification, for example [39,40].
The peptide sequences we used were extracted from Sjoblom et al. [22]. Although the number of glycated peptides is relatively small for training or prediction, the information is of exceptional quality because all experiments were performed by the same lab using the same procedures for modification and its detection. As such, our work is not based on glycation data from different sources that are inherently difficult to compare. Furthermore, our approach allows stating a probability of glycation in percent. It is therefore superior to algorithms that are based on thresholds that decide whether a residue is glycated or not. In conclusion, our contribution is aimed at enabling a more directed AGE analysis of in vitro glycated peptides, saving time and funds for the researcher.

General Outline
We chose the following features of amino acids to computationally characterize a short protein sequence (11 amino acids) that contains a central arginine residue: sequence of amino acids of the peptide (SoA), hydropathy (Hyd), mass (Mas), hydrophobicity (Hyp), polarizability (Pol), normalized van der Waals volume (vdW), torsion angle (ToA), and isoelectric point (IEP) (Figure 1). The features we are working on represent several physical properties, and others, i.e., structural properties. For the sake of simplicity, we will indistinguishably call them features. Subsequently, we formed 11-element vectors for each feature and selected the number of vectors, one (study case 1), two (study case 2), or three (study case 3), that enter an ANN to train a model that allows us to predict the glycation probability of the central arginine residue in the 11mer peptide sequence. These sequences were taken from a recent publication by Sjoblom et al. [22]. The extent of arginine modification was determined by these authors using a state-of-the-art technique (liquid chromatography mass spectrometry). In total, 54 sequences were retrieved. Although the number of glycated peptide fragments is not particularly high, all experimental steps were conducted under comparable conditions, making comparisons much more reliable [22]. Figure 1 shows a general outline of the process to be followed in our methodology. More details on the ANN operation can be found in the Supplementary Materials to this manuscript.

Database Construction and Study Cases
The tool of ML used to make the prediction of peptide glycation is an ANN that requires data for training and testing. As stated above, for the construction of the arginine glycation database, information obtained experimentally by Sjoblom et al. [22] was used. This information allows us to rewrite the alphabetical sequence of amino acids for each of the peptides with the corresponding numerical values for each of their physical properties. The list of the 20 proteinaceous amino acids, with their corresponding values for each of the physical properties that represent them, can be found in Table S1 in the Supplementary Material.
By employing this information, we built a database with 54 (peptides) × 8 (features) vectors using the different values for each amino acid (11 elements). That is, each 1 of the 432 vectors was formed by selecting 1 of the 54 peptides made up of a sequence of 11 elements, and selecting 1 of the 8 physical properties; thus, assigning to each element of the peptide the value corresponding to that physical property. For example, for the 11 mersequence SPFYLRPPSFL we built eight vectors ( Figure 2). We retrieved each element of the sequence (i.e., an amino acid) and obtained the corresponding values (for the complete list of constructed vectors refer to Table S2 in the Supplementary Material). This process was repeated for all properties. Steps to follow for the characterization and prediction of glycation using ANN. First, a preliminary database is assembled from the amino acid sequence of the peptides. Then, by rewriting the amino acid sequence with the values corresponding to each of the physical properties, a list of all the vectors is created. Their values are normalized and delivered to the ANN, which through a learning process makes the final predictions corresponding to the probability of glycation for each peptide.

Database Construction and Study Cases
The tool of ML used to make the prediction of peptide glycation is an ANN that requires data for training and testing. As stated above, for the construction of the arginine glycation database, information obtained experimentally by Sjoblom et al. [22] was used. This information allows us to rewrite the alphabetical sequence of amino acids for each of the peptides with the corresponding numerical values for each of their physical properties. The list of the 20 proteinaceous amino acids, with their corresponding values for each of the physical properties that represent them, can be found in Table S1 in the supplementary material.
By employing this information, we built a database with 54 (peptides) × 8 (features) vectors using the different values for each amino acid (11 elements). That is, each 1 of the 432 vectors was formed by selecting 1 of the 54 peptides made up of a sequence of 11 elements, and selecting 1 of the 8 physical properties; thus, assigning to each element of the peptide the value corresponding to that physical property. For example, for the 11 mer-sequence SPFYLRPPSFL we built eight vectors ( Figure 2). We retrieved each element of the sequence (i.e., an amino acid) and obtained the corresponding values (for the complete list of constructed vectors refer to Table S2 in the supplementary material). This process was repeated for all properties. Steps to follow for the characterization and prediction of glycation using ANN. First, a preliminary database is assembled from the amino acid sequence of the peptides. Then, by rewriting the amino acid sequence with the values corresponding to each of the physical properties, a list of all the vectors is created. Their values are normalized and delivered to the ANN, which through a learning process makes the final predictions corresponding to the probability of glycation for each peptide.
We considered basically three cases as inputs for the ANN. The single-case study is accepting one individual vector of the same property for each peptide as input. The two-case study considers two vectors of two different properties in combination for each peptide, giving rise to up to 28 different outputs. Finally, the three-case study considers three vectors of three different properties in combination for each peptide, giving rise to up to 56 different outputs. We would like to point out that, consideration of higher order combinations results in more complex learning processes without providing significant improvements in predictions. Consequently, we did not consider them further in our study.
Here, it is important to note that, for any of the three cases, at the center of the sequence of elements is always the amino acid arginine. Recall that the objective of ANN in this project is to be able to predict the glycation probability of the central arginine corresponding to each peptide. This can be done through the combination of vectors as described above.
At this point, it is worth taking a few steps forward and establishing now that another part of the objective of this study is to determine which of all these possible combinations of amino acid parameters gives us the most accurate predictions.
For the ANN learning and prediction process, it is necessary to form a set of samples for each of the study cases, which we termed patterns. Each pattern was built on a combination of only one, two, or three vectors for a specific peptide where their numerical values are the properties used in the corresponding amino acid sequence of the peptide. Thus, each pattern (P) is represented by a matrix of "m" rows (the number of properties selected) and "n" columns (each 1 of the 11 amino acids within the peptide). These patterns are provided to the ANN to be able to predict the exact probability of glycation of the central amino acid arginine (inside the pattern). We considered basically three cases as inputs for the ANN. The single-case study is accepting one individual vector of the same property for each peptide as input. The twocase study considers two vectors of two different properties in combination for each peptide, giving rise to up to 28 different outputs. Finally, the three-case study considers three vectors of three different properties in combination for each peptide, giving rise to up to 56 different outputs. We would like to point out that, consideration of higher order combinations results in more complex learning processes without providing significant improvements in predictions. Consequently, we did not consider them further in our study.
Here, it is important to note that, for any of the three cases, at the center of the sequence of elements is always the amino acid arginine. Recall that the objective of ANN in this project is to be able to predict the glycation probability of the central arginine corresponding to each peptide. This can be done through the combination of vectors as described above.
At this point, it is worth taking a few steps forward and establishing now that another part of the objective of this study is to determine which of all these possible combinations of amino acid parameters gives us the most accurate predictions.
For the ANN learning and prediction process, it is necessary to form a set of samples for each of the study cases, which we termed patterns. Each pattern was built on a combination of only one, two, or three vectors for a specific peptide where their numerical values are the properties used in the corresponding amino acid sequence of the peptide. Thus, each pattern ( ) is represented by a matrix of "m" rows (the number of properties For example, if we wanted to predict the probability of glycation of a certain peptide from the analysis of a pattern formed by the combination of two vectors corresponding to hydropathy (Hyd) and mass (Mas), we would specify an array A mn of the form: Hyd 1 Hyd 2 · · · Hyd 10 Hyd 11 Mas 1 Mas 2 · · · Mas 10 Mas 11 (1) where p is the index for each 1 of the 54 peptides. Thus, for the 11-sequence SPFYLRPPSFL mentioned above, we can consider that for a combination of two vectors, to form a 2-vector pattern (P), we can take any two of the eight different sequences of numerical values shown in Figure 1.
All the specifications of the study cases are presented in Table 1 for reference. From the eight amino acid features, there are multiple combinations to conform to each one of the study cases. For the two-case and three-case, we will focus on the combinations of features that present the best results. However, for the one-case, to explain in detail, the learning and prediction process of the ANN, we chose to present the results of all the features; thus, forming a total of eight sub-cases (Table 1). Table 1. Details of the ANN study cases. The physical properties and the architecture of the neural network are presented for each one of the sub-cases of case 1, along with the sub-cases that performed the best result for cases 2 and 3.

Cases Features ANN Layer Architecture
Case 1A Sequence of amino acid Because each of the patterns used as input information for the ANN training is composed of an array of "m" features and 11 values of the amino acids (n), then, for the three-case study, we have an array of the form: where the index p represents the peptide and therefore runs from 1 to 54, and the index m represents each one of the three features chosen (f a , f b or f c ) among the eight possible ones; where indices a, b, and c take different values from each other, ranging from 1 to 8. For optimal ANN performance, the input data for all cases are preprocessed to result in the normalizedP p pattern, for whicĥ For each feature m and for each amino acid n, we will normalize the matrix elements following relation (3), where f m is the mean of the elements of the training set corresponding to the m-th feature, and σ m is the standard deviation of those elements. The normalization process for a given case was performed on each subset of vectors made up of the elements corresponding to the same feature, and not on the whole dataset.
For the learning process, all the normalized information will be divided into three sets, the first and largest will be used for training, consisting of 70% of the data. The second is the validation set consisting of 15% of the data and the remaining 15% will be used for the final predictions that will be presented in the results section. It is important to note that during training the ANN does not know the data of the validation and prediction sets.
Thus, the ANN will be fed only with the training set for each of the case studies. Where, in general, the ANN architecture is of one to three hidden layers, having per layer (including the input layer), a varying number of neurons, according to each of the cases studied (see Table 1). Figure 3 shows the general architecture of the ANN used. Consider that it will be fed with the patterns formed for each of the case studies, through the input layer of the ANN. Subsequently, learning is performed through the hidden layers and the prediction is processed in the output layer. Finally, to minimize the error during the training process the ANN was constructed as a regression model using the Adam optimization algorithm [41] with a learning rate γ = 0.001, we used a rectified linear unit (ReLu) as an activation function [42], and employed a backpropagation algorithm [43].

Results
Towards predicting the value of the glycation probability based on the small database available, ANNs were used. They can offer high accuracy, even with incomplete information [24]. As previously described, we used eight different features to construct the vectors, feeding the algorithms with one of the features for each sequence, or using a combination of them. We report the Mean Absolute Percent Error (MAPE) and Mean Absolute Figure 3. Schematic of the utilized ANN structure. Each of the 11mer sequences with n-features is provided to the ANN through the input layer, from which the learning process proceeds through an adjustment in the interconnections in the hidden layers. Finally, a prediction of the glycation probability is made, which is provided by the output layer. Different colors distinguish different properties.
Finally, to minimize the error during the training process the ANN was constructed as a regression model using the Adam optimization algorithm [41] with a learning rate γ = 0.001, we used a rectified linear unit (ReLu) as an activation function [42], and employed a backpropagation algorithm [43].

Results
Towards predicting the value of the glycation probability based on the small database available, ANNs were used. They can offer high accuracy, even with incomplete information [24]. As previously described, we used eight different features to construct the vectors, feeding the algorithms with one of the features for each sequence, or using a combination of them. We report the Mean Absolute Percent Error (MAPE) and Mean Absolute Error (MAE), by averaging the results over 160 different predictions, each with a different ANN training. Table 2 summarizes the analysis carried out from the eight different features and the combination of two of them. The MAPE and MAE values are reported in the upper and lower diagonal matrices, respectively. In the main diagonal of the matrix, the MAPE is represented first, followed by MAE. Both errors are also characterized by gray shades, to describe if we have a high (clear gray), average (medium gray), or low accuracy (dark gray) prediction value. We found that the combination of features can improve the performance of the ANN, now we will review the results for each specific case, recalling that the previous values were the average over 160 different predictions. Figure 4 shows the box plots of MAPE and MAE for the individual features, even though vdW results in the lowest errors in Table 2, we can see a broader distribution, in contrast with Hyd. As we can see from Table 2, Hyd represents a 7.2 and 2.5% higher MAPE and MAE error than vdW, but 66.7 and 70.7% narrower range distribution. The narrowest range distribution for MAPE and MAE are represented by IEP and Hyp, respectively; in contrast, the broadest is Pol for both cases.
For cases 2 and 3, where there are multiple combinations of amino acid features, for clarity it was considered to present in detail for this work only the cases that showed the best results (considering that the remaining cases are shown in the Supplementary Material, Figure S1).
Thus, for case 2, the highest errors were obtained with the combination of Hyp and ToA with values for MAE and MAPE of 30.61 and 57.13%, respectively. Now, if we use the best features in the main diagonal to make the combinations, we can improve the results compared to one feature only. The best values we found using a combination of two features was for vdW-Pol (both MAE and MAPE) with 23.53 and 33.3%, respectively ( Table 2). For cases 2 and 3, where there are multiple combinations of amino acid features, for clarity it was considered to present in detail for this work only the cases that showed the best results (considering that the remaining cases are shown in the supplementary material, Figure S1).
Thus, for case 2, the highest errors were obtained with the combination of Hyp and ToA with values for MAE and MAPE of 30.61 and 57.13%, respectively. Now, if we use the best features in the main diagonal to make the combinations, we can improve the results compared to one feature only. The best values we found using a combination of two features was for vdW-Pol (both MAE and MAPE) with 23.53 and 33.3%, respectively (Table 2). Figure 5 shows the MAPE and MAE, respectively, of the first eight combinations of two features with the lowest values. In turn, for comparison, the two combinations with the highest errors are presented (the complete set can be found in the supplementary materials, Figure S1). Interestingly, once we study the combination of two features, most of the cases show error distributions as narrow as using only the feature Hyd, which is the single feature that shows the narrowest error distribution (see Figure 4). This can lead us to the idea that increasing the number of features used in the ANN increases the performances and narrows the error distribution, but we must be aware that this is not always the case, because the features may hide unknown correlations, whereby increasing the number of them will not improve the prediction, as the amount of independent data may not increase. The narrowest error for case 2 is presented in the combination of Mas-ToA for MAPE and Hyd-IEP for MAE with a range of 2.13% and 0.80, respectively.   Figure S1). Interestingly, once we study the combination of two features, most of the cases show error distributions as narrow as using only the feature Hyd, which is the single feature that shows the narrowest error distribution (see Figure 4). This can lead us to the idea that increasing the number of features used in the ANN increases the performances and narrows the error distribution, but we must be aware that this is not always the case, because the features may hide unknown correlations, whereby increasing the number of them will not improve the prediction, as the amount of independent data may not increase. The narrowest error for case 2 is presented in the combination of Mas-ToA for MAPE and Hyd-IEP for MAE with a range of 2.13% and 0.80, respectively. In case 3, consisting of the combination of three different properties, an improvement in ANN performance was generally observed, mainly in combinations that incorporated the properties SoA, Hyd, Mas, vdW, and IEP, whose trend can be seen from case 2 in Figure 5. Thus, the lowest value for MAE was obtained by combining SoA with Mas and IEP, reaching an error of 15.42; while for MAPE a value of 25.04% was reached by combining Hyd with Mas and IEP; which is a substantial reduction with respect to the best results of case 2. In case 3, consisting of the combination of three different properties, an improvement in ANN performance was generally observed, mainly in combinations that incorporated the properties SoA, Hyd, Mas, vdW, and IEP, whose trend can be seen from case 2 in Figure 5. Thus, the lowest value for MAE was obtained by combining SoA with Mas and IEP, reaching an error of 15.42; while for MAPE a value of 25.04% was reached by combining Hyd with Mas and IEP; which is a substantial reduction with respect to the best results of case 2.

Discussion
Here, we developed a tool for predicting the glycation probability of arginine residues in proteins. Although we had to work with a limited dataset, we consider it relevant to have some means to predict arginine modifications. Furthermore, our work can serve as a work of principle and be subsequently expanded once more information on arginine glycation becomes available.
Arginine, akin to lysine, is a prime target for methylglyoxal [10]. Since our data set on arginine glycation is too small for conventional approaches, we employed a machine learning strategy. While artificial intelligence (AI) is the overarching science of mimicking human abilities, machine learning is a specific subset of AI that trains a machine how to learn. Nowadays, machine learning is one of the most important tools for scientists in the development of new applications [44,45]. We could have conveniently employed a linear-regression-based approach to estimate the probability of glycation, but the results would have been considerably poor compared to those obtained using the more sophisticated ANN.
At present, the accuracy of our algorithm is limited by the relatively small size of the database we used for training and testing the ANN. This bottleneck can be tackled by adding more data entries to the database once these become available. This would allow an improvement in the reliability of our algorithm for successfully reporting glycation probabilities.
Experimentally, approaches like nano high performance liquid chromatography/ electrospray ionization/tandem mass spectrometry can be utilized to determine the ratio of glycated to total peptides [46]. It should be kept in mind that usually amino acid glycation is not resulting in a "black or white" pattern (i.e., all peptides carrying the modification or none) but more like a gradual probability scale. Once more data on arginine glycation becomes available, we aim to present a tool based on our algorithm that analyzes a protein sequence provided by the user in FASTA format for the presence of arginine residues that are potentially glycated. The output would be given as an arginine glycation probability at a specific position of the protein in percent. This approach could allow narrowing down the number of arginine residues that can preferentially become AGE-modified. Such a tool is envisioned to enable a more directed protein-AGE-arginine analysis saving time and funds for the researcher.
We want to stress that efforts have already been developed in this area from which the present research is inspired. Reddy et al. [31] developed a methodology based on support vector machines (SVM) with which they were able to classify glycated and non-glycated lysine residues using the structural properties of amino acid residues. For that work, they had a reference database containing a total of 538 glycated and non-glycated lysine residues, with which they were able to obtain an accuracy of 0.7562, 0 being totally inaccurate and 1 being totally accurate. Recently, Yu et al. [27] achieved a considerable improvement in the classification process of lysine glycations with SVM, working with a database of more than 6000 items, reaching a high accuracy of 0.88.
In comparison with the work presented here, it should be emphasized that although they are different methods (classification with SVM versus prediction with ANN), the highest precisions achieved are of similar magnitudes. However, there are several considerations to be stressed: first, the fact that our work had a very small base of only 54 peptides (both glycated and non-glycated peptides), which made the learning process of the neural network more complicated, and second the fact that what we performed in this project is an exact prediction of the probability of glycation, while the cited study and other similar studies on which this one is based [26,28,30] are founded on a classification between groups of peptides where there is glycation and where there is no glycation. It is important to note that all other studies prior to the one developed by Yu et al. [27] achieve, relatively speaking, lower accuracy.
Our algorithm shows that the most important characteristics determining arginine glycation probability are the sequence of amino acids, polarizability, amino acid mass, normalized van der Waals volume, and hydropathy, while torsion angle, hydrophobicity, and isoelectric point seem to be of lesser importance (Figure 4). When simultaneously considering two characteristics (two-case), polarizability and normalized van der Waals volume stand out as being most important for determining glycation probability ( Figure 5). The errors become lower when considering these two characteristics, showing that probably a combination of several factors predisposes an arginine residue for glycation. Sjoblom et al. [22] made the observations that polar residues such as tyrosine (large van der Waals volume) and negatively charged ones seem to influence glycation probability. Certainly, it is possible that more than two properties of neighboring amino acids are relevant for the determination of arginine glycation probability. This question is planned to be addressed in future work.
Finally, the question might arise whether arginine glycation probabilities can be transferred from the peptide level to the protein level. Our study is focused on short 11mer sequences that harbor a central arginine residue. Nonetheless, if the protein region (i.e., the arginine residue) is in contact with solvent (e.g., on the exterior of the protein) it should be possible to get a good estimate if the arginine residue in question is modified or not. On the other hand, residues in hydrophobic pockets or on the surface of protein-protein interaction sites are probably not predicted very reliably.

Conclusions
In conclusion, we herein present the conceptual framework that allows predicting the glycation susceptibility of arginine residues in peptides. Arginine modification by glycation is emerging to be highly relevant, perhaps even more so than lysine modification [10,35]. Whereas several research groups addressed the question of how to predict lysine modification, to our knowledge, we present the first attempt at predicting arginine glycation. At the same time, this study was carried out using ANN on a very limited database. This is relevant given that previous studies on lysine have been carried out with the SVM method on databases of considerable size.
The present work focused on obtaining an accurate estimation of the probability of glycation in arginine. Promising results were obtained by taking combinations of two or three amino acid characteristics for such estimation. We identified that a combination of three characteristics (sequence of amino acids, amino acid mass, and isoelectric point) gives the smallest mean absolute error (15.42). Combinations with other characteristics such as normalized van der Waals volume and hydropathy yield similar results. This key finding suggests that arginine glycation (and potentially glycation in general) is mostly influenced by the combination of these factors. Experimental approaches are needed to confirm this result.
Our work is aimed at the researcher who requires information on whether a certain arginine residue might be the target of reactive dicarbonyls and if so, to what extent. More than just reporting qualitative aspects, we provide a strategy to receive quantitative information on the glycation probability of individual arginine residues. Therefore, the most probable "hits" would be the ones that experimental characterization would be applied to preferentially. Overall, our approach is not only positioned to integrate into the landscape of previously published algorithms for the estimation of lysine residue glycation but to extend it in a meaningful way.