On the Prediction of In Vitro Arginine Glycation of Short Peptides Using Artificial Neural Networks

Ulices Que-Salinas; Dulce Martinez-Peon; Angel D. Reyes-Figueroa; Ivonne Ibarra; Christian Quintus Scheckhuber

doi:10.3390/s22145237

,

and

¹

Centro de Ciencias de la Tierra, Universidad Veracruzana, Xalapa 91090, VER, Mexico

²

Department of Electrical and Electronic Engineering, National Technological Institute of Mexico/IT, Monterrey 67170, NL, Mexico

³

Consejo Nacional de Ciencia y Tecnología, Av. Insurgentes Sur 1582, Col. Crédito Constructor, Benito Juárez, Mexico City 03940, DF, Mexico

⁴

Centro de Investigación en Matemáticas Unidad Monterrey, Parque de Investigación e Innovación Tecnológica (PIIT), Av. Alianza Centro No. 502, Apodaca 66628, NL, Mexico

Sensors2022, 22(14), 5237;https://doi.org/10.3390/s22145237

This article belongs to the Special Issue Artificial Intelligence, Digital Sensors and Data Science in Bio-Medicine

Version Notes

Order Reprints

Abstract

One of the hallmarks of diabetes is an increased modification of cellular proteins. The most prominent type of modification stems from the reaction of methylglyoxal with arginine and lysine residues, leading to structural and functional impairments of target proteins. For lysine glycation, several algorithms allow a prediction of occurrence; thus, making it possible to pinpoint likely targets. However, according to our knowledge, no approaches have been published for predicting the likelihood of arginine glycation. There are indications that arginine and not lysine is the most prominent target for the toxic dialdehyde. One of the reasons why there is no arginine glycation predictor is the limited availability of quantitative data. Here, we used a recently published high–quality dataset of arginine modification probabilities to employ an artificial neural network strategy. Despite the limited data availability, our results achieve an accuracy of about 75% of correctly predicting the exact value of the glycation probability of an arginine–containing peptide without setting thresholds upon whether it is decided if a given arginine is modified or not. This contribution suggests a solution for predicting arginine glycation of short peptides.

Keywords:

amino acids; arginine; artificial neural network; glycation; methylglyoxal; modification probability; prediction; protein sequences

1. Introduction

In nature, there is an amazing variety of proteins. So far, thousands of them have been described that carry out diverse functions that are essential for life, either as structural building blocks within and without cells or as catalysts of biochemical reactions in the form of enzymes [1,2]. Proteins are composed of a certain number of amino acids. There are 20 ‘standard’ amino acids and several somewhat obscure ones such as selenocysteine and pyrrolysine that are also proteinogenic [3].

Clearly, the potential sequence variety is enormous even in short proteins (peptides), reaching astronomical proportions. To this huge variety, so–called post–translational modifications of amino acids must be added. These modifications add layers of regulation and control and include a plethora of processes leading to adducts, such as acetylation, phosphorylation, methylation, and ubiquitination, among many others [4]. The post–translational modification of specific amino acids can occur enzymatically or non–enzymatically. For example, protein glycosylation, which is important for protein sorting, protein secretion, and cellular recognition among other functions, is performed by glycosyltransferases and related enzymes [5]. Another important example is the reversible modification of histones by histone acetylases and deacetylases that is essential for the coordinated regulation of gene expression [6]. Glycation, on the other hand, is regarded as a strictly non–enzymatic process that involves the reaction of sugars (e.g., glucose, fructose) and sugar–derived molecules with amino groups of biologically highly relevant molecules, such as nucleic acids, lipids, and proteins [7]. Usually, these reactions result in the formation of advanced glycation end–products (AGEs) which are mostly detrimental and compromise the function of the target molecule irreversibly [8,9].

In proteins, the side chains of lysine and arginine are the main targets of AGE formation [10,11]. One of the most reactive glycating compounds is the reactive carbonyl species (RCS) methylglyoxal (MGO) which is formed as a toxic by–product by metabolic activity, e.g., during glycolysis [12].

Usually, cellular MGO levels are kept at relatively low levels of around 0.3–6 µM [13] by dedicated enzymatic defense systems (e.g., glyoxalases I and II, aldose reductases) [14,15] and low–molecular–weight scavengers, but in certain pathological conditions (i.e., diabetes, neurodegeneration, cancer) [16,17] and in aged cells and tissues [18,19,20] MGO can become problematic for cellular viability due to increased production and/or impaired removal. It should be noted that the specific MGO–mediated modification of proteins can be important for several signaling processes and for gene regulation. This has been demonstrated in studies often conducted in simple eukaryotic model systems that are very amenable to experimental procedures [21].

Although the importance of MGO binding to certain amino acids in a target protein is a well–studied phenomenon, it has become clear that there are most likely no straightforward consensus sequences that allow a reliable prediction of potential glycation sites [22]. Available predictive algorithms therefore must rely on the physical (e.g., polarity), chemical (e.g., amino acid composition), or structural features (e.g., accessible surface area, secondary structure features, and local backbone angles) of nearby amino acid residues. These allow a prediction of potential sites of lysine glycation. For example, GlyNN [23] utilizes an artificial neural network (ANN) [24] approach to enable lysine glycation prediction from a relatively small dataset of 215 elements. Further developments are BPB_GlySite [25], PreGly [26], PredGly [27], Gly–PseAAC [28], Glypre [29], iProtGly–SS [30], and GlyStruct [31] with approaches such as bi–profile Bayes feature extraction, position specific amino acid propensity, and models trained with support vector machine (SVM) classifiers. Traditionally, lysine glycation has been the target of in–depth research with a large database of lysine modifications being freely available (PLMD [32], which is based on CPLM [33] and CPLA 1.0 [34]). However, in recent years it has become clear that also, possibly even more so, the reaction of MGO and arginine is very relevant for the pathogenicity of AGEs [10,35]. Chemically, MGO reacts with arginine yielding an irreversible intermediate, the AGE dihydroxyimidazoline (DHI) after Schiff base addition and its subsequent rearrangement (Amadori product formation). Removal of water from DHI leads to the formation of the AGE 5–hydro–5–methylimidazolone (MG–H1) [8,9]. MG–H1 is an important marker for the AGE–modification of skin collagen [36], mitochondrial dysfunction [37], and acute coronary syndrome [38], among others.

The experimental demonstration of specific amino acid modifications is often very costly in terms of time, resources, and labor, especially in larger proteins which may contain several potential sites of glycation. Hence, being able to analyze protein sequences for the presence of glycated arginine residues would be a useful approach to predict sites of MG–H1 formation. To our knowledge, no tools have been published so far that allow the predictive identification of potential arginine modification sites in proteins. In this work, we implemented a machine learning (ML) method using our own supervised numerical training algorithm, which uses the features of amino acids as input information, and as a target the probability of glycation to occur in a certain protein, where the selected amino acids form a sequence of 11 elements within the protein whose central amino acid will always be arginine. Specifically, we utilize ANNs because they offer a direct way to solve problems given their high accuracy and their adaptation to noisy, unknown, or incomplete information [24] besides a fast computation after training due the fact that the neural network can be easily implemented in the parallel hardware. In particular, ANNs as ML methods have been applied in problems of fluids in the flow phase pattern identification, for example [39,40].

The peptide sequences we used were extracted from Sjoblom et al. [22]. Although the number of glycated peptides is relatively small for training or prediction, the information is of exceptional quality because all experiments were performed by the same lab using the same procedures for modification and its detection. As such, our work is not based on glycation data from different sources that are inherently difficult to compare. Furthermore, our approach allows stating a probability of glycation in percent. It is therefore superior to algorithms that are based on thresholds that decide whether a residue is glycated or not. In conclusion, our contribution is aimed at enabling a more directed AGE analysis of in vitro glycated peptides, saving time and funds for the researcher.

2. Methodology

2.1. General Outline

We chose the following features of amino acids to computationally characterize a short protein sequence (11 amino acids) that contains a central arginine residue: sequence of amino acids of the peptide (SoA), hydropathy (Hyd), mass (Mas), hydrophobicity (Hyp), polarizability (Pol), normalized van der Waals volume (vdW), torsion angle (ToA), and isoelectric point (IEP) (Figure 1). The features we are working on represent several physical properties, and others, i.e., structural properties. For the sake of simplicity, we will indistinguishably call them features. Subsequently, we formed 11–element vectors for each feature and selected the number of vectors, one (study case 1), two (study case 2), or three (study case 3), that enter an ANN to train a model that allows us to predict the glycation probability of the central arginine residue in the 11mer peptide sequence. These sequences were taken from a recent publication by Sjoblom et al. [22]. The extent of arginine modification was determined by these authors using a state–of–the–art technique (liquid chromatography mass spectrometry). In total, 54 sequences were retrieved. Although the number of glycated peptide fragments is not particularly high, all experimental steps were conducted under comparable conditions, making comparisons much more reliable [22]. Figure 1 shows a general outline of the process to be followed in our methodology. More details on the ANN operation can be found in the Supplementary Materials to this manuscript.

Figure 1. Steps to follow for the characterization and prediction of glycation using ANN. First, a preliminary database is assembled from the amino acid sequence of the peptides. Then, by rewriting the amino acid sequence with the values corresponding to each of the physical properties, a list of all the vectors is created. Their values are normalized and delivered to the ANN, which through a learning process makes the final predictions corresponding to the probability of glycation for each peptide.

2.2. Database Construction and Study Cases

The tool of ML used to make the prediction of peptide glycation is an ANN that requires data for training and testing. As stated above, for the construction of the arginine glycation database, information obtained experimentally by Sjoblom et al. [22] was used. This information allows us to rewrite the alphabetical sequence of amino acids for each of the peptides with the corresponding numerical values for each of their physical properties. The list of the 20 proteinaceous amino acids, with their corresponding values for each of the physical properties that represent them, can be found in Table S1 in the Supplementary Material.

By employing this information, we built a database with 54 (peptides) × 8 (features) vectors using the different values for each amino acid (11 elements). That is, each 1 of the 432 vectors was formed by selecting 1 of the 54 peptides made up of a sequence of 11 elements, and selecting 1 of the 8 physical properties; thus, assigning to each element of the peptide the value corresponding to that physical property. For example, for the 11 mer–sequence SPFYLRPPSFL we built eight vectors (Figure 2). We retrieved each element of the sequence (i.e., an amino acid) and obtained the corresponding values (for the complete list of constructed vectors refer to Table S2 in the Supplementary Material). This process was repeated for all properties.

Figure 2. Example of construction of vectors. For the 11mer sequence SPFYLRPPSFL, each amino acid is converted into a number, depending on the property. For example, the first amino acid, serine (SER), has a value of 5.7 for the property isoelectric point, 3.83 for torsion angle, 1.6 for normalized van der Waals volume, −0.04 for polarizability, 0.06 for hydrophobicity, 105 for mass, −0.8 for hydropathy, and −2 for amino acid sequence.

We considered basically three cases as inputs for the ANN. The single–case study is accepting one individual vector of the same property for each peptide as input. The two–case study considers two vectors of two different properties in combination for each peptide, giving rise to up to 28 different outputs. Finally, the three–case study considers three vectors of three different properties in combination for each peptide, giving rise to up to 56 different outputs. We would like to point out that, consideration of higher order combinations results in more complex learning processes without providing significant improvements in predictions. Consequently, we did not consider them further in our study.

Here, it is important to note that, for any of the three cases, at the center of the sequence of elements is always the amino acid arginine. Recall that the objective of ANN in this project is to be able to predict the glycation probability of the central arginine corresponding to each peptide. This can be done through the combination of vectors as described above.

At this point, it is worth taking a few steps forward and establishing now that another part of the objective of this study is to determine which of all these possible combinations of amino acid parameters gives us the most accurate predictions.

For the ANN learning and prediction process, it is necessary to form a set of samples for each of the study cases, which we termed patterns. Each pattern was built on a combination of only one, two, or three vectors for a specific peptide where their numerical values are the properties used in the corresponding amino acid sequence of the peptide. Thus, each pattern (

𝒫

) is represented by a matrix of “m” rows (the number of properties selected) and “n” columns (each 1 of the 11 amino acids within the peptide). These patterns are provided to the ANN to be able to predict the exact probability of glycation of the central amino acid arginine (inside the pattern).

For example, if we wanted to predict the probability of glycation of a certain peptide from the analysis of a pattern formed by the combination of two vectors corresponding to hydropathy (Hyd) and mass (Mas), we would specify an array

A_{m n}

of the form:

𝒫_{p} = A_{2 \times 11} = (\begin{matrix} \begin{matrix} H y d_{1} & H y d_{2} \end{matrix} & \dots & \begin{matrix} H y d_{10} & H y d_{11} \end{matrix} \\ \begin{matrix} M a s_{1} & M a s_{2} \end{matrix} & \dots & \begin{matrix} M a s_{10} & M a s_{11} \end{matrix} \end{matrix})

(1)

where p is the index for each 1 of the 54 peptides. Thus, for the 11–sequence SPFYLRPPSFL mentioned above, we can consider that for a combination of two vectors, to form a 2–vector pattern (

𝒫

), we can take any two of the eight different sequences of numerical values shown in Figure 1.

All the specifications of the study cases are presented in Table 1 for reference. From the eight amino acid features, there are multiple combinations to conform to each one of the study cases. For the two–case and three–case, we will focus on the combinations of features that present the best results. However, for the one–case, to explain in detail, the learning and prediction process of the ANN, we chose to present the results of all the features; thus, forming a total of eight sub–cases (Table 1).

Table 1. Details of the ANN study cases. The physical properties and the architecture of the neural network are presented for each one of the sub–cases of case 1, along with the sub–cases that performed the best result for cases 2 and 3.

Because each of the patterns used as input information for the ANN training is composed of an array of “m” features and 11 values of the amino acids (n), then, for the three–case study, we have an array of the form:

𝒫_{p} = A_{m \times n} = (\begin{matrix} \begin{matrix} f_{a 1} & f_{a 2} \end{matrix} & \begin{matrix} \dots & f_{a 11} \end{matrix} \\ \begin{matrix} f_{b 1} & f_{b 2} \end{matrix} & \begin{matrix} \dots & f_{b 11} \end{matrix} \\ \begin{matrix} f_{c 1} & f_{c 2} \end{matrix} & \begin{matrix} \dots & f_{c 11} \end{matrix} \end{matrix})

(2)

where the index p represents the peptide and therefore runs from 1 to 54, and the index m represents each one of the three features chosen (f_a, f_b or f_c) among the eight possible ones; where indices a, b, and c take different values from each other, ranging from 1 to 8.

For optimal ANN performance, the input data for all cases are preprocessed to result in the normalized

{\hat{𝒫}}_{p}

pattern, for which

{\hat{f}}_{m n} = \frac{f_{m n} - \bar{f_{m}}}{σ_{m}}

(3)

For each feature m and for each amino acid n, we will normalize the matrix elements following relation (3), where

\bar{f_{m}}

is the mean of the elements of the training set corresponding to the m–th feature, and

σ_{m}

is the standard deviation of those elements. The normalization process for a given case was performed on each subset of vectors made up of the elements corresponding to the same feature, and not on the whole dataset.

For the learning process, all the normalized information will be divided into three sets, the first and largest will be used for training, consisting of 70% of the data. The second is the validation set consisting of 15% of the data and the remaining 15% will be used for the final predictions that will be presented in the results section. It is important to note that during training the ANN does not know the data of the validation and prediction sets.

Thus, the ANN will be fed only with the training set for each of the case studies. Where, in general, the ANN architecture is of one to three hidden layers, having per layer (including the input layer), a varying number of neurons, according to each of the cases studied (see Table 1).

Figure 3 shows the general architecture of the ANN used. Consider that it will be fed with the patterns formed for each of the case studies, through the input layer of the ANN. Subsequently, learning is performed through the hidden layers and the prediction is processed in the output layer.

Figure 3. Schematic of the utilized ANN structure. Each of the 11mer sequences with n–features is provided to the ANN through the input layer, from which the learning process proceeds through an adjustment in the interconnections in the hidden layers. Finally, a prediction of the glycation probability is made, which is provided by the output layer. Different colors distinguish different properties.

Finally, to minimize the error during the training process the ANN was constructed as a regression model using the Adam optimization algorithm [41] with a learning rate γ = 0.001, we used a rectified linear unit (ReLu) as an activation function [42], and employed a backpropagation algorithm [43].

3. Results

Towards predicting the value of the glycation probability based on the small database available, ANNs were used. They can offer high accuracy, even with incomplete information [24]. As previously described, we used eight different features to construct the vectors, feeding the algorithms with one of the features for each sequence, or using a combination of them. We report the Mean Absolute Percent Error (MAPE) and Mean Absolute Error (MAE), by averaging the results over 160 different predictions, each with a different ANN training.

Table 2 summarizes the analysis carried out from the eight different features and the combination of two of them. The MAPE and MAE values are reported in the upper and lower diagonal matrices, respectively. In the main diagonal of the matrix, the MAPE is represented first, followed by MAE. Both errors are also characterized by gray shades, to describe if we have a high (clear gray), average (medium gray), or low accuracy (dark gray) prediction value.

Table 2. Summary of glycation probability errors (MAPE/MAE) for the single–case/two–case predictive approaches. We report the MAPE values in the upper diagonal matrix, and MAE values in the lower diagonal. The main diagonal shows the MAPE and MAE values for the case 1, respectively. Both errors are also characterized by gray shades, to describe if we have a high (clear gray), average (medium gray), or low accuracy (dark gray) prediction value. Hyd: hydropathy; Hyp: hydrophobicity; IEP: isoelectric point; Mas: mass; Pol: polarizability; SoA: amino acid sequence of the peptide; ToA: torsion angle; and vdW: normalized van der Waals volume.

The cases where we used only one feature can be seen in the main diagonal, for MAE values the lower results were using Hyp (30.63) or ToA (32.11), whereas the highest values were using SoA (26.98) or vdW (27.49). Additionally, for MAPE the lower results again were using Hyp (54.79%) or ToA (52.12%), and for the highest results the best values were using vdW (39.25%) or Mass (40.93%).

We found that the combination of features can improve the performance of the ANN, now we will review the results for each specific case, recalling that the previous values were the average over 160 different predictions. Figure 4 shows the box plots of MAPE and MAE for the individual features, even though vdW results in the lowest errors in Table 2, we can see a broader distribution, in contrast with Hyd. As we can see from Table 2, Hyd represents a 7.2 and 2.5% higher MAPE and MAE error than vdW, but 66.7 and 70.7% narrower range distribution. The narrowest range distribution for MAPE and MAE are represented by IEP and Hyp, respectively; in contrast, the broadest is Pol for both cases.

Figure 4. Box plot of the results for case 1. Values obtained for MAPE (left) and for MAE (right). The characteristics are ordered from the lowest error (implying higher reliability and accuracy) to the highest errors. The * sign denotes outliers.

For cases 2 and 3, where there are multiple combinations of amino acid features, for clarity it was considered to present in detail for this work only the cases that showed the best results (considering that the remaining cases are shown in the Supplementary Material, Figure S1).

Thus, for case 2, the highest errors were obtained with the combination of Hyp and ToA with values for MAE and MAPE of 30.61 and 57.13%, respectively. Now, if we use the best features in the main diagonal to make the combinations, we can improve the results compared to one feature only. The best values we found using a combination of two features was for vdW–Pol (both MAE and MAPE) with 23.53 and 33.3%, respectively (Table 2).

Figure 5 shows the MAPE and MAE, respectively, of the first eight combinations of two features with the lowest values. In turn, for comparison, the two combinations with the highest errors are presented (the complete set can be found in the Supplementary Materials, Figure S1). Interestingly, once we study the combination of two features, most of the cases show error distributions as narrow as using only the feature Hyd, which is the single feature that shows the narrowest error distribution (see Figure 4). This can lead us to the idea that increasing the number of features used in the ANN increases the performances and narrows the error distribution, but we must be aware that this is not always the case, because the features may hide unknown correlations, whereby increasing the number of them will not improve the prediction, as the amount of independent data may not increase. The narrowest error for case 2 is presented in the combination of Mas–ToA for MAPE and Hyd–IEP for MAE with a range of 2.13% and 0.80, respectively.

Figure 5. Box plots of the results for case 2. Values obtained for MAPE (left) and for MAE (right). The first eight combinations with the lowest errors are shown, as well as the two combinations with the highest errors. The * sign denotes outliers.

In case 3, consisting of the combination of three different properties, an improvement in ANN performance was generally observed, mainly in combinations that incorporated the properties SoA, Hyd, Mas, vdW, and IEP, whose trend can be seen from case 2 in Figure 5. Thus, the lowest value for MAE was obtained by combining SoA with Mas and IEP, reaching an error of 15.42; while for MAPE a value of 25.04% was reached by combining Hyd with Mas and IEP; which is a substantial reduction with respect to the best results of case 2.

4. Discussion

Here, we developed a tool for predicting the glycation probability of arginine residues in proteins. Although we had to work with a limited dataset, we consider it relevant to have some means to predict arginine modifications. Furthermore, our work can serve as a work of principle and be subsequently expanded once more information on arginine glycation becomes available.

Arginine, akin to lysine, is a prime target for methylglyoxal [10]. Since our data set on arginine glycation is too small for conventional approaches, we employed a machine learning strategy. While artificial intelligence (AI) is the overarching science of mimicking human abilities, machine learning is a specific subset of AI that trains a machine how to learn. Nowadays, machine learning is one of the most important tools for scientists in the development of new applications [44,45]. We could have conveniently employed a linear–regression–based approach to estimate the probability of glycation, but the results would have been considerably poor compared to those obtained using the more sophisticated ANN.

At present, the accuracy of our algorithm is limited by the relatively small size of the database we used for training and testing the ANN. This bottleneck can be tackled by adding more data entries to the database once these become available. This would allow an improvement in the reliability of our algorithm for successfully reporting glycation probabilities.

Experimentally, approaches like nano high performance liquid chromatography/electrospray ionization/tandem mass spectrometry can be utilized to determine the ratio of glycated to total peptides [46]. It should be kept in mind that usually amino acid glycation is not resulting in a “black or white” pattern (i.e., all peptides carrying the modification or none) but more like a gradual probability scale. Once more data on arginine glycation becomes available, we aim to present a tool based on our algorithm that analyzes a protein sequence provided by the user in FASTA format for the presence of arginine residues that are potentially glycated. The output would be given as an arginine glycation probability at a specific position of the protein in percent. This approach could allow narrowing down the number of arginine residues that can preferentially become AGE–modified. Such a tool is envisioned to enable a more directed protein–AGE–arginine analysis saving time and funds for the researcher.

We want to stress that efforts have already been developed in this area from which the present research is inspired. Reddy et al. [31] developed a methodology based on support vector machines (SVM) with which they were able to classify glycated and non–glycated lysine residues using the structural properties of amino acid residues. For that work, they had a reference database containing a total of 538 glycated and non–glycated lysine residues, with which they were able to obtain an accuracy of 0.7562, 0 being totally inaccurate and 1 being totally accurate. Recently, Yu et al. [27] achieved a considerable improvement in the classification process of lysine glycations with SVM, working with a database of more than 6000 items, reaching a high accuracy of 0.88.

In comparison with the work presented here, it should be emphasized that although they are different methods (classification with SVM versus prediction with ANN), the highest precisions achieved are of similar magnitudes. However, there are several considerations to be stressed: first, the fact that our work had a very small base of only 54 peptides (both glycated and non–glycated peptides), which made the learning process of the neural network more complicated, and second the fact that what we performed in this project is an exact prediction of the probability of glycation, while the cited study and other similar studies on which this one is based [26,28,30] are founded on a classification between groups of peptides where there is glycation and where there is no glycation. It is important to note that all other studies prior to the one developed by Yu et al. [27] achieve, relatively speaking, lower accuracy.

Our algorithm shows that the most important characteristics determining arginine glycation probability are the sequence of amino acids, polarizability, amino acid mass, normalized van der Waals volume, and hydropathy, while torsion angle, hydrophobicity, and isoelectric point seem to be of lesser importance (Figure 4). When simultaneously considering two characteristics (two–case), polarizability and normalized van der Waals volume stand out as being most important for determining glycation probability (Figure 5). The errors become lower when considering these two characteristics, showing that probably a combination of several factors predisposes an arginine residue for glycation. Sjoblom et al. [22] made the observations that polar residues such as tyrosine (large van der Waals volume) and negatively charged ones seem to influence glycation probability. Certainly, it is possible that more than two properties of neighboring amino acids are relevant for the determination of arginine glycation probability. This question is planned to be addressed in future work.

Finally, the question might arise whether arginine glycation probabilities can be transferred from the peptide level to the protein level. Our study is focused on short 11mer sequences that harbor a central arginine residue. Nonetheless, if the protein region (i.e., the arginine residue) is in contact with solvent (e.g., on the exterior of the protein) it should be possible to get a good estimate if the arginine residue in question is modified or not. On the other hand, residues in hydrophobic pockets or on the surface of protein–protein interaction sites are probably not predicted very reliably.

5. Conclusions

In conclusion, we herein present the conceptual framework that allows predicting the glycation susceptibility of arginine residues in peptides. Arginine modification by glycation is emerging to be highly relevant, perhaps even more so than lysine modification [10,35]. Whereas several research groups addressed the question of how to predict lysine modification, to our knowledge, we present the first attempt at predicting arginine glycation. At the same time, this study was carried out using ANN on a very limited database. This is relevant given that previous studies on lysine have been carried out with the SVM method on databases of considerable size.

The present work focused on obtaining an accurate estimation of the probability of glycation in arginine. Promising results were obtained by taking combinations of two or three amino acid characteristics for such estimation. We identified that a combination of three characteristics (sequence of amino acids, amino acid mass, and isoelectric point) gives the smallest mean absolute error (15.42). Combinations with other characteristics such as normalized van der Waals volume and hydropathy yield similar results. This key finding suggests that arginine glycation (and potentially glycation in general) is mostly influenced by the combination of these factors. Experimental approaches are needed to confirm this result.

Our work is aimed at the researcher who requires information on whether a certain arginine residue might be the target of reactive dicarbonyls and if so, to what extent. More than just reporting qualitative aspects, we provide a strategy to receive quantitative information on the glycation probability of individual arginine residues. Therefore, the most probable “hits” would be the ones that experimental characterization would be applied to preferentially. Overall, our approach is not only positioned to integrate into the landscape of previously published algorithms for the estimation of lysine residue glycation but to extend it in a meaningful way.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/s22145237/s1. References [47,48,49,50,51,52,53,54,55,56,57] are cited in the supplementary materials.

Author Contributions

Conceptualization, U.Q.-S., D.M.-P., A.D.R.-F., I.I. and C.Q.S.; Investigation, U.Q.-S., D.M.-P., A.D.R.-F., I.I. and C.Q.S.; Methodology, U.Q.-S., D.M.-P. and I.I.; Software, U.Q.-S.; Validation, U.Q.-S.; Visualization, D.M.-P.; Writing—original draft, U.Q.-S., D.M.-P., A.D.R.-F. and C.Q.S.; Writing—review & editing, U.Q.-S., D.M.-P., A.D.R.-F. and C.Q.S. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by Tecnológico de Monterrey.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data is contained in the article and its supplementary information.

Acknowledgments

Computational resources were supported by the biophysical systems laboratory of the Universidad Iberoamericana Puebla. Ulices Que–Salinas acknowledges the financial support provided by CONACyT México through grant: Estancias Posdoctorales Nacionales, no. I1200/224/2021.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sigal, A.; Milo, R.; Cohen, A.; Geva-Zatorsky, N.; Klein, Y.; Liron, Y.; Rosenfeld, N.; Danon, T.; Perzov, N.; Alon, U. Variability and Memory of Protein Levels in Human Cells. Nature 2006, 444, 643–646. [Google Scholar] [CrossRef]
Ponomarenko, E.A.; Poverennaya, E.V.; Ilgisonis, E.V.; Pyatnitskiy, M.A.; Kopylov, A.T.; Zgoda, V.G.; Lisitsa, A.V.; Archakov, A.I. The Size of the Human Proteome: The Width and Depth. Int. J. Anal. Chem. 2016, 2016, 7436849. [Google Scholar] [CrossRef]
Ho, J.M.L.; Miller, C.A.; Smith, K.A.; Mattia, J.R.; Bennett, M.R. Improved Pyrrolysine Biosynthesis through Phage Assisted Non-Continuous Directed Evolution of the Complete Pathway. Nat. Commun. 2021, 12, 3914. [Google Scholar] [CrossRef]
Müller, M.M. Post-Translational Modifications of Protein Backbones: Unique Functions, Mechanisms, and Challenges. Biochemistry 2018, 57, 177–185. [Google Scholar] [CrossRef]
Gavin, J.W.; Jon, S.T.; Toone, E.J. Natural Product Glycosyltransferases: Properties and Applications. Adv. Enzymol. Relat. Areas Mol. Biol. 2009, 76, 55–119. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, Z.; Jia, J.; Du, T.; Zhang, N.; Tang, Y.; Fang, Y.; Fang, D. Overview of Histone Modification. In Advances in Experimental Medicine and Biology; Springer Nature: Berlin/Heidelberg, Germany, 2021; pp. 1–16. [Google Scholar] [CrossRef]
Rabbani, N.; Thornalley, P.J. Dicarbonyl Stress in Cell and Tissue Dysfunction Contributing to Ageing and Disease. Biochem. Biophys. Res. Commun. 2015, 458, 221–226. [Google Scholar] [CrossRef]
Ahmed, N.; Babaei-Jadidi, R.; Howell, S.K.; Beisswenger, P.J.; Thornalley, P.J. Degradation Products of Proteins Damaged by Glycation, Oxidation and Nitration in Clinical Type 1 Diabetes. Diabetologia 2005, 48, 1590–1603. [Google Scholar] [CrossRef]
Oya, T.; Hattori, N.; Mizuno, Y.; Miyata, S.; Maeda, S.; Osawa, T.; Uchida, K. Methylglyoxal Modification of Protein. Chemical and Immunochemical Characterization of Methylglyoxal-Arginine Adducts. J. Biol. Chem. 1999, 274, 18492–18502. [Google Scholar] [CrossRef]
Rabbani, N.; Thornalley, P.J. Protein Glycation—Biomarkers of Metabolic Dysfunction and Early-Stage Decline in Health in the Era of Precision Medicine. Redox Biol. 2021, 42, 101920. [Google Scholar] [CrossRef]
Mercado-Uribe, H.; Andrade-Medina, M.; Espinoza-Rodríguez, J.H.; Carrillo-Tripp, M.; Scheckhuber, C.Q. Analyzing Structural Alterations of Mitochondrial Intermembrane Space Superoxide Scavengers Cytochrome-c and SOD1 after Methylglyoxal Treatment. PLoS ONE 2020, 15, e0232408. [Google Scholar] [CrossRef]
Phillips, S.A.; Thornalley, P.J. The Formation of Methylglyoxal from Triose Phosphates. Investigation Using a Specific Assay for Methylglyoxal. Eur. J. Biochem. 1993, 212, 101–105. [Google Scholar] [CrossRef]
Rabbani, N.; Thornalley, P.J. Measurement of Methylglyoxal by Stable Isotopic Dilution Analysis LC-MS/MS with Corroborative Prediction in Physiological Samples. Nat. Protoc. 2014, 9, 1969–1979. [Google Scholar] [CrossRef]
Thornalley, P.J. Glyoxalase I--Structure, Function and a Critical Role in the Enzymatic Defence against Glycation. Biochem. Soc. Trans. 2003, 31 Pt 6, 1343–1348. [Google Scholar] [CrossRef]
Mannervik, B. Molecular Enzymology of the Glyoxalase System. Drug Metabol. Drug Interact. 2008, 23, 13–27. [Google Scholar] [CrossRef]
Kumar Pasupulati, A.; Chitra, P.S.; Reddy, G.B. Advanced Glycation End Products Mediated Cellular and Molecular Events in the Pathology of Diabetic Nephropathy. Biomol. Concepts 2016, 7, 293–309. [Google Scholar] [CrossRef]
Schalkwijk, C.G.; Stehouwer, C.D.A. Methylglyoxal, a Highly Reactive Dicarbonyl Compound, in Diabetes, Its Vascular Complications, and Other Age-Related Diseases. Physiol. Rev. 2020, 100, 407–461. [Google Scholar] [CrossRef]
Morcos, M.; Du, X.; Pfisterer, F.; Hutter, H.; Sayed, A.A.R.; Thornalley, P.; Ahmed, N.; Baynes, J.; Thorpe, S.; Kukudov, G.; et al. Glyoxalase-1 Prevents Mitochondrial Protein Modification and Enhances Lifespan in Caenorhabditis elegans. Aging Cell 2008, 7, 260–269. [Google Scholar] [CrossRef]
Scheckhuber, C.Q.; Mack, S.J.; Strobel, I.; Ricciardi, F.; Gispert, S.; Osiewacz, H.D. Modulation of the Glyoxalase System in the Aging Model Podospora Anserina: Effects on Growth and Lifespan. Aging 2010, 2, 969–980. [Google Scholar] [CrossRef]
Fan, X.; Monnier, V.M. Protein Posttranslational Modification (PTM) by Glycation: Role in Lens Aging and Age-Related Cataractogenesis. Exp. Eye Res. 2021, 210, 108705. [Google Scholar] [CrossRef]
Scheckhuber, C.Q. Studying the Mechanisms and Targets of Glycation and Advanced Glycation End-Products in Simple Eukaryotic Model Systems. Int. J. Biol. Macromol. 2019, 127, 85–94. [Google Scholar] [CrossRef]
Sjoblom, N.M.; Kelsey, M.M.G.; Scheck, R.A. A Systematic Study of Selective Protein Glycation. Angew. Chem. Int. Ed. 2018, 57, 16077–16082. [Google Scholar] [CrossRef]
Johansen, M.B.; Kiemer, L.; Brunak, S. Analysis and Prediction of Mammalian Protein Glycation. Glycobiology 2006, 16, 844–853. [Google Scholar] [CrossRef]
Rabuñal, J.R.; Dorado, J. Artificial Neural Networks in Real-Life Applications. In Artificial Neural Networks in Real-Life Applications; IGI Global: Hershey, PA, USA, 2006; pp. 1–375. [Google Scholar] [CrossRef]
Ju, Z.; Sun, J.; Li, Y.; Wang, L. Predicting Lysine Glycation Sites Using Bi-Profile Bayes Feature Extraction. Comput. Biol. Chem. 2017, 71, 98–103. [Google Scholar] [CrossRef]
Liu, Y.; Gu, W.; Zhang, W.; Wang, J. Predict and Analyze Protein Glycation Sites with the MRMR and IFS Methods. Biomed. Res. Int. 2015, 2015, 561547. [Google Scholar] [CrossRef]
Yu, J.; Shi, S.; Zhang, F.; Chen, G.; Cao, M. PredGly: Predicting Lysine Glycation Sites for Homo Sapiens Based on XGboost Feature Optimization. Bioinformatics 2019, 35, 2749–2756. [Google Scholar] [CrossRef]
Xu, Y.; Li, L.; Ding, J.; Wu, L.Y.; Mai, G.; Zhou, F. Gly-PseAAC: Identifying Protein Lysine Glycation through Sequences. Gene 2017, 602, 1–7. [Google Scholar] [CrossRef]
Zhao, X.; Zhao, X.; Bao, L.; Zhang, Y.; Dai, J.; Yin, M. Glypre: In Silico Prediction of Protein Glycation Sites by Fusing Multiple Features and Support Vector Machine. Molecules 2017, 22, 1891. [Google Scholar] [CrossRef]
Islam, M.M.; Saha, S.; Rahman, M.M.; Shatabda, S.; Farid, D.M.; Dehzangi, A. IProtGly-SS: Identifying Protein Glycation Sites Using Sequence and Structure Based Features. Proteins Struct. Funct. Bioinform. 2018, 86, 777–789. [Google Scholar] [CrossRef]
Reddy, H.M.; Sharma, A.; Dehzangi, A.; Shigemizu, D.; Chandra, A.A.; Tsunoda, T. GlyStruct: Glycation Prediction Using Structural Properties of Amino Acid Residues. BMC Bioinform. 2019, 19, 547. [Google Scholar] [CrossRef]
Xu, H.; Zhou, J.; Lin, S.; Deng, W.; Zhang, Y.; Xue, Y. PLMD: An Updated Data Resource of Protein Lysine Modifications. J. Genet. Genom. 2017, 44, 243–250. [Google Scholar] [CrossRef]
Liu, Z.; Wang, Y.; Gao, T.; Pan, Z.; Cheng, H.; Yang, Q.; Cheng, Z.; Guo, A.; Ren, J.; Xue, Y. CPLM: A Database of Protein Lysine Modifications. Nucleic Acids Res. 2014, 42, D531–D536. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Cao, J.; Gao, X.; Zhou, Y.; Wen, L.; Yang, X.; Yao, X.; Ren, J.; Xue, Y. CPLA 1.0: An Integrated Database of Protein Lysine Acetylation. Nucleic Acids Res. 2011, 39 (Suppl. 1), D1029–D1034. [Google Scholar] [CrossRef] [PubMed]
Rabbani, N.; Xue, M.; Thornalley, P.J. Dicarbonyls and Glyoxalase in Disease Mechanisms and Clinical Therapeutics. Glycoconj. J. 2016, 33, 513–525. [Google Scholar] [CrossRef] [PubMed]
Sugiura, K.; Koike, S.; Suzuki, T.; Ogasawara, Y. Carbonylation of Skin Collagen Induced by Reaction with Methylglyoxal. Biochem. Biophys. Res. Commun. 2021, 562, 100–104. [Google Scholar] [CrossRef] [PubMed]
Hara, T.; Toyoshima, M.; Hisano, Y.; Balan, S.; Iwayama, Y.; Aono, H.; Futamura, Y.; Osada, H.; Owada, Y.; Yoshikawa, T. Glyoxalase I Disruption and External Carbonyl Stress Impair Mitochondrial Function in Human Induced Pluripotent Stem Cells and Derived Neurons. Transl. Psychiatry 2021, 11, 275. [Google Scholar] [CrossRef]
Bora, S.; Adole, P.S.; Motupalli, N.; Pandit, V.R.; Vinod, K.V. Association between Carbonyl Stress Markers and the Risk of Acute Coronary Syndrome in Patients with Type 2 Diabetes Mellitus–A Pilot Study. Diabetes Metab. Syndr. Clin. Res. Rev. 2020, 14, 1751–1755. [Google Scholar] [CrossRef]
Al-Naser, M.; Elshafei, M.; Al-Sarkhi, A. Artificial Neural Network Application for Multiphase Flow Patterns Detection: A New Approach. J. Pet. Sci. Eng. 2016, 145, 548–564. [Google Scholar] [CrossRef]
Rosa, E.S.; Salgado, R.M.; Ohishi, T.; Mastelari, N. Performance Comparison of Artificial Neural Networks and Expert Systems Applied to Flow Pattern Identification in Vertical Ascendant Gas-Liquid Flows. Int. J. Multiph. Flow 2010, 36, 738–754. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Eckle, K.; Schmidt-Hieber, J. A Comparison of Deep Networks with ReLU Activation Function and Linear Spline-Type Methods. Neural Netw. 2019, 110, 232–242. [Google Scholar] [CrossRef]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Representations by Back-Propagating Errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Pan, C.; Tabatabaei, S.K.; Tabatabaei Yazdi, S.M.H.; Hernandez, A.G.; Schroeder, C.M.; Milenkovic, O. Rewritable Two-Dimensional DNA-Based Data Storage with Machine Learning Reconstruction. Nat. Commun. 2022, 13, 2984. [Google Scholar] [CrossRef] [PubMed]
Wang, R.; Wang, Z.; Wang, H.; Pang, Y.; Lee, T.-Y. Characterization and Identification of Lysine Crotonylation Sites Based on Machine Learning Method on Both Plant and Mammalian. Sci. Rep. 2020, 10, 20447. [Google Scholar] [CrossRef] [PubMed]
Scheckhuber, C.Q. Arg354 in the Catalytic Centre of Bovine Liver Catalase Is Protected from Methylglyoxal-Mediated Glycation. BMC Res. Notes 2015, 8, 830. [Google Scholar] [CrossRef] [PubMed]
Markus, G.; Tritsch, G.L.; Parthasarathy, R. A model for hydropathy-based peptide interactions. Arch. Biochem. Biophys. 1989, 272, 433–439. [Google Scholar] [CrossRef]
Chiavari, G.; Galletti, G.C. Pyrolysis—gas chromatography/mass spectrometry of amino acids. J. Anal. Appl. Pyrolysis 1992, 24, 123–137. [Google Scholar] [CrossRef]
Fauchère, J.L.; Charton, M.; Kier, L.B.; Verloop, A.; Pliska, V. Amino acid side chain parameters for correlation studies in biology and pharmacology. Int. J. Pept. Protein Res. 1988, 32, 269–278. [Google Scholar] [CrossRef]
Lefranc, M.-P. Amino Acids. 2001. Available online: https://www.imgt.org/IMGTeducation/Aide-memoire/_UK/aminoacids/ (accessed on 7 June 2021).
Bachem. Peptide Calculator. 2017. Available online: https://www.bachem.com/knowledge-center/peptide-calculator/ (accessed on 7 June 2021).
Jha, K.; Saha, S.; Tanveer, M. Prediction of protein-protein interactions using stacked auto-encoder. Trans. Emerging Tel Technol. 2021, e4256. [Google Scholar] [CrossRef]
Rosenblatt, F. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms; Spartan Books: Washington, DC, USA, 1962. [Google Scholar]
Rojas, R. Neural Networks: A Systematic Introduction; Springer: Berlin/Heidelberg, Germany, 1996. [Google Scholar]
Bishop, C. Machine Learning for Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. A Deep Learning; The MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Ketkar, N. Deep Learning with Python: A Hands-on Introduction; Apress, NYC: New York, NY, USA, 2017. [Google Scholar]

Figure 1. Steps to follow for the characterization and prediction of glycation using ANN. First, a preliminary database is assembled from the amino acid sequence of the peptides. Then, by rewriting the amino acid sequence with the values corresponding to each of the physical properties, a list of all the vectors is created. Their values are normalized and delivered to the ANN, which through a learning process makes the final predictions corresponding to the probability of glycation for each peptide.

Figure 2. Example of construction of vectors. For the 11mer sequence SPFYLRPPSFL, each amino acid is converted into a number, depending on the property. For example, the first amino acid, serine (SER), has a value of 5.7 for the property isoelectric point, 3.83 for torsion angle, 1.6 for normalized van der Waals volume, −0.04 for polarizability, 0.06 for hydrophobicity, 105 for mass, −0.8 for hydropathy, and −2 for amino acid sequence.

Figure 3. Schematic of the utilized ANN structure. Each of the 11mer sequences with n–features is provided to the ANN through the input layer, from which the learning process proceeds through an adjustment in the interconnections in the hidden layers. Finally, a prediction of the glycation probability is made, which is provided by the output layer. Different colors distinguish different properties.

Figure 4. Box plot of the results for case 1. Values obtained for MAPE (left) and for MAE (right). The characteristics are ordered from the lowest error (implying higher reliability and accuracy) to the highest errors. The * sign denotes outliers.

Figure 5. Box plots of the results for case 2. Values obtained for MAPE (left) and for MAE (right). The first eight combinations with the lowest errors are shown, as well as the two combinations with the highest errors. The * sign denotes outliers.

Table 1. Details of the ANN study cases. The physical properties and the architecture of the neural network are presented for each one of the sub–cases of case 1, along with the sub–cases that performed the best result for cases 2 and 3.

Cases	Features	ANN Layer Architecture
Case 1A	Sequence of amino acid	11 × 4 × 3 × 2 × 1
Case 1B	Hydropathy	11 × 4 × 3 × 2 × 1
Case 1C	Mass	11 × 4 × 3 × 2 × 1
Case 1D	Polarizability	11 × 4 × 3 × 2 × 1
Case 1E	Hydrophobicity	11 × 4 × 3 × 2 × 1
Case 1F	Normalized van der Waals volume	11 × 4 × 3 × 2 × 1
Case 1G	Torsion angle	11 × 4 × 3 × 2 × 1
Case 1H	Isoelectric point	11 × 4 × 3 × 2 × 1
Case 2	Polarization + normalized van der Waals volume	22 × 10 × 1
Case 3	Sequence of amino acid + hydropathy + normalized van der Waals volume	33 × 40 × 1

Table 2. Summary of glycation probability errors (MAPE/MAE) for the single–case/two–case predictive approaches. We report the MAPE values in the upper diagonal matrix, and MAE values in the lower diagonal. The main diagonal shows the MAPE and MAE values for the case 1, respectively. Both errors are also characterized by gray shades, to describe if we have a high (clear gray), average (medium gray), or low accuracy (dark gray) prediction value. Hyd: hydropathy; Hyp: hydrophobicity; IEP: isoelectric point; Mas: mass; Pol: polarizability; SoA: amino acid sequence of the peptide; ToA: torsion angle; and vdW: normalized van der Waals volume.

	MAPE
	vdW	Mas	Pol	SoA	Hyd	IEP	ToA	Hyp
vdW	39.75/27.49	35.71	33.3	37.61	35.77	37.34	40.62	39.45
Mas	24.83	40.93/27.5	36.74	46.28	40	39.05	46.47	39.36
Pol	23.53	25.5	41.06/28.3	36.79	35.66	37.88	41.51	41.73
SoA	25.63	29.93	25.74	41.71/26.98	40	46.59	44.96	44.74
Hyd	25.69	29.87	25.66	26.36	42.6/28.17	45.4	46.58	46.22
IEP	23.96	26.11	23.87	26.81	27.02	48.34/28.95	46.15	52.23
ToA	24.69	28.85	25.15	28.42	28.83	26.08	52.12/32.11	57.13
Hyp	26.21	26.77	26.51	27.48	27.8	27.57	30.61	54.79/30.63
								MAE

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

On the Prediction of In Vitro Arginine Glycation of Short Peptides Using Artificial Neural Networks

Abstract

1. Introduction

2. Methodology

2.1. General Outline

2.2. Database Construction and Study Cases

3. Results

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics