Glypre: In Silico Prediction of Protein Glycation Sites by Fusing Multiple Features and Support Vector Machine

Glycation is a non-enzymatic process occurring inside or outside the host body by attaching a sugar molecule to a protein or lipid molecule. It is an important form of post-translational modification (PTM), which impairs the function and changes the characteristics of the proteins so that the identification of the glycation sites may provide some useful guidelines to understand various biological functions of proteins. In this study, we proposed an accurate prediction tool, named Glypre, for lysine glycation. Firstly, we used multiple informative features to encode the peptides. These features included the position scoring function, secondary structure, AAindex, and the composition of k-spaced amino acid pairs. Secondly, the distribution of distinctive features of the residues surrounding the glycation and non-glycation sites was statistically analysed. Thirdly, based on the distribution of these features, we developed a new predictor by using different optimal window sizes for different properties and a two-step feature selection method, which utilized the maximum relevance minimum redundancy method followed by a greedy feature selection procedure. The performance of Glypre was measured with a sensitivity of 57.47%, a specificity of 90.78%, an accuracy of 79.68%, area under the receiver-operating characteristic (ROC) curve (AUC) of 0.86, and a Matthews’s correlation coefficient (MCC) of 0.52 by 10-fold cross-validation. The detailed analysis results showed that our predictor may play a complementary role to other existing methods for identifying protein lysine glycation. The source code and datasets of the Glypre are available in the Supplementary File.


Introduction
In general, glycation is a post-translational modification produced by a reaction between reducing sugars and the amino groups of lysine or arginine, or N-terminal amino acids. It is a two-step non-enzymatic reaction. The first step is to form the stable Amadori product based on the unstable Schiff base. The second step is to further form irreversible cross-linked products, the so-called advanced glycation end products (AGES). The accumulation of glycation products are known to associate with the pathogenesis of aging and complications of diabetes. It also plays crucial regulatory roles in almost all cellular processes and is involved in other human diseases, such as Alzheimer's [1] and Parkinson's diseases [2]. The essence of glycation is the reducing sugars attached to amino groups in cellular proteins, which result in Schiff bases as early glycation products [3,4]. Recently, more interest has been paid to lysine glycation from researchers working on metabolism [5].

Investigation of Different Features
The residue distribution surrounding lysine is an important factor for the lysine glycation sites. Thus, for Dataset 1, the investigation is performed for the distribution of different properties, including position conservation, secondary structure, AAindex, and the composition of k-spaced amino acid pairs on the basis of a window size 31.

The Position Conservation Features Analysis
The position conservation of residues surrounding the lysine can usually provide much helpful information to predict glycation sites. The analysis was performed for the distribution of the M(l) value of each position around lysine residues, as shown in Figure 2.
From Figure 2, it can be observed that the M(l) value of almost each position in positive samples is greater than that in negative samples. It makes clear that these glycation sites prefer some special residues. Especially, it was found that the M(l) value at −6 and +4 sites were significantly higher than other sites, indicating that these two sites play relatively more important roles for glycation. Further analysis shows that the frequency of alanine (A), glutamic acid (E) and leucine (L) is >10% at −6 and +4 sites of glycation. From the sequence logo of the experimental 210 glycation peptides (Figure 3), we also found that alanine (A), glutamic acid (E), leucine (L), and lysine (K) have preferences to

Investigation of Different Features
The residue distribution surrounding lysine is an important factor for the lysine glycation sites. Thus, for Dataset 1, the investigation is performed for the distribution of different properties, including position conservation, secondary structure, AAindex, and the composition of k-spaced amino acid pairs on the basis of a window size 31.

The Position Conservation Features Analysis
The position conservation of residues surrounding the lysine can usually provide much helpful information to predict glycation sites. The analysis was performed for the distribution of the M(l) value of each position around lysine residues, as shown in Figure 2.
From Figure 2, it can be observed that the M(l) value of almost each position in positive samples is greater than that in negative samples. It makes clear that these glycation sites prefer some special residues. Especially, it was found that the M(l) value at −6 and +4 sites were significantly higher than other sites, indicating that these two sites play relatively more important roles for glycation. Further analysis shows that the frequency of alanine (A), glutamic acid (E) and leucine (L) is >10% at −6 and +4 sites of glycation. From the sequence logo of the experimental 210 glycation peptides (Figure 3), we also found that alanine (A), glutamic acid (E), leucine (L), and lysine (K) have preferences to appear near the glycation sites. The results also support that acidic amino acids, mainly glutamate (E), and lysine (K) residues, catalyse the glycation of nearby lysines [11,14,15]. All in all, these analyses suggested that position conservation influenced the glycation. appear near the glycation sites. The results also support that acidic amino acids, mainly glutamate (E), and lysine (K) residues, catalyse the glycation of nearby lysines [11,14,15]. All in all, these analyses suggested that position conservation influenced the glycation.

Secondary Structure Features Analysis
In order to analyse the difference of secondary structure (SS) between glycated and non-glycated peptides, the distribution of SS around lysine residues is calculated, as illustrated in Figures 4-6, which demonstrates that the residues around the glycation site favoured to form helix structures. Moreover, it was observed that the closer the site gets to the glycation sites, the lower the frequency of helix and sheet structures, and the greater the frequency of coil structures, whereas the value of coils and sheets surrounding non-glycation sites were higher than that around the glycation sites. Moreover, the frequency of helix and sheet structures has a similar fluctuation in positive and negative samples. We can conclude from the above that the difference of SS could discriminate glycation and non-glycation sites efficiently. appear near the glycation sites. The results also support that acidic amino acids, mainly glutamate (E), and lysine (K) residues, catalyse the glycation of nearby lysines [11,14,15]. All in all, these analyses suggested that position conservation influenced the glycation.

Secondary Structure Features Analysis
In order to analyse the difference of secondary structure (SS) between glycated and non-glycated peptides, the distribution of SS around lysine residues is calculated, as illustrated in Figures 4-6, which demonstrates that the residues around the glycation site favoured to form helix structures. Moreover, it was observed that the closer the site gets to the glycation sites, the lower the frequency of helix and sheet structures, and the greater the frequency of coil structures, whereas the value of coils and sheets surrounding non-glycation sites were higher than that around the glycation sites. Moreover, the frequency of helix and sheet structures has a similar fluctuation in positive and negative samples. We can conclude from the above that the difference of SS could discriminate glycation and non-glycation sites efficiently.

Secondary Structure Features Analysis
In order to analyse the difference of secondary structure (SS) between glycated and non-glycated peptides, the distribution of SS around lysine residues is calculated, as illustrated in Figures 4-6, which demonstrates that the residues around the glycation site favoured to form helix structures. Moreover, it was observed that the closer the site gets to the glycation sites, the lower the frequency of helix and sheet structures, and the greater the frequency of coil structures, whereas the value of coils and sheets surrounding non-glycation sites were higher than that around the glycation sites. Moreover, the frequency of helix and sheet structures has a similar fluctuation in positive and negative samples. We can conclude from the above that the difference of SS could discriminate glycation and non-glycation sites efficiently.

AAindex Features Analysis
We knew that The Amino Acid Index database (AAindex) played very important roles in PTM prediction [16]. In this study, we selected the top 20 amino acid indices (shown in Table 1) corresponded to the best AUC value. Afterwards, the distribution of five of the 20 amino acid indices of residues around the glycation and non-glycation sites was statistically analysed. The six representative amino acid indices were shown in Figure 7. More details for the distribution of 20 amino acid indices are shown in the Supplementary Information. The results show that, for the 20 amino acid indices, the positive samples have a larger fluctuation than the negative samples. It also illustrated that information about the 20 amino acid indices around glycation and non-glycation sites was very different.

AAindex Features Analysis
We knew that The Amino Acid Index database (AAindex) played very important roles in PTM prediction [16]. In this study, we selected the top 20 amino acid indices (shown in Table 1) corresponded to the best AUC value. Afterwards, the distribution of five of the 20 amino acid indices of residues around the glycation and non-glycation sites was statistically analysed. The six representative amino acid indices were shown in Figure 7. More details for the distribution of 20 amino acid indices are shown in the Supplementary Information. The results show that, for the 20 amino acid indices, the positive samples have a larger fluctuation than the negative samples. It also illustrated that information about the 20 amino acid indices around glycation and non-glycation sites was very different.

CKSAAP Feature Analysis
The feature selection method could determine the most important amino acid pairs, which were generated by the CKSAAP encoding scheme [17]. The CKSAAP is dependent on the window size, so we selected the optimal window size (17) to analyse the CKSAAP. In order to give some instruments for predicting the glycation sites, the top 30 features were selected according to the IG feature list. The composition of the top-30 residue pairs were also presented in two radar diagrams ( Figure 8). As can be seen from Figure 8, the compositions of the top-30 features are remarkably different in glycation and non-glycation sites. The importance of the top-30 residue pairs is also clearly and intuitively characterized in Figure 9. For example, the feature 'LxE' is significantly enriched in

CKSAAP Feature Analysis
The feature selection method could determine the most important amino acid pairs, which were generated by the CKSAAP encoding scheme [17]. The CKSAAP is dependent on the window size, so we selected the optimal window size (17) to analyse the CKSAAP. In order to give some instruments for predicting the glycation sites, the top 30 features were selected according to the IG feature list. The composition of the top-30 residue pairs were also presented in two radar diagrams ( Figure 8). As can be seen from Figure 8, the compositions of the top-30 features are remarkably different in glycation and non-glycation sites. The importance of the top-30 residue pairs is also clearly and intuitively characterized in Figure 9. For example, the feature 'LxE' is significantly enriched in position pairs (−12/−11) surrounding the glycation sites. As can be seen in Figure 9 A, E, and L frequently appeared in the top-30 amino acid pairs, which is consistent with the observation from Figure 8. Figure 9 also showed the sequence patterns around the glycation sites, that is, a sequence fragment including these amino acid pairs would more likely have glycation sites. Figure 8 also illustrated that information about the composition of the top-30 residue pairs was very different in positive and negative samples. position pairs (−12/−11) surrounding the glycation sites. As can be seen in Figure 9 A, E, and L frequently appeared in the top-30 amino acid pairs, which is consistent with the observation from Figure 8. Figure 9 also showed the sequence patterns around the glycation sites, that is, a sequence fragment including these amino acid pairs would more likely have glycation sites. Figure 8 also illustrated that information about the composition of the top-30 residue pairs was very different in positive and negative samples.

The Performance of the Proposed Predictor
According to the above analysis, we respectively optimize the window size of each set of features to obtain the best prediction accuracy. The window size is varied from 11 residues to 31 residues and investigated the AUC found through SVM in 10-fold cross-validation. The optimal window sizes are 31, 27, 13 and 25 residues, respectively. Based on the optimal window sizes for four groups of features, the sample can be formulated as a 402-dimension vector, including 31 dimensions for the position scoring function, 27 × 3 = 81 dimensions for the secondary structure, 13 × 20 dimensions for the AAindex, and 30 dimensions for the CKSAAP.
Next, it is necessary to perform feature selection to remove the irrelevant and redundant features. In this study, features in the mRMR feature rank list were added one by one during the GFS procedure by using SVM in 10-fold cross-validation. Performance comparisons of the prediction models with the addition of features is shown in Figure 10. The red asterisk showed the distribution of the selected optimal features. The number of selected optimal features was 87, which had the position pairs (−12/−11) surrounding the glycation sites. As can be seen in Figure 9 A, E, and L frequently appeared in the top-30 amino acid pairs, which is consistent with the observation from Figure 8. Figure 9 also showed the sequence patterns around the glycation sites, that is, a sequence fragment including these amino acid pairs would more likely have glycation sites. Figure 8 also illustrated that information about the composition of the top-30 residue pairs was very different in positive and negative samples.

The Performance of the Proposed Predictor
According to the above analysis, we respectively optimize the window size of each set of features to obtain the best prediction accuracy. The window size is varied from 11 residues to 31 residues and investigated the AUC found through SVM in 10-fold cross-validation. The optimal window sizes are 31, 27, 13 and 25 residues, respectively. Based on the optimal window sizes for four groups of features, the sample can be formulated as a 402-dimension vector, including 31 dimensions for the position scoring function, 27 × 3 = 81 dimensions for the secondary structure, 13 × 20 dimensions for the AAindex, and 30 dimensions for the CKSAAP.
Next, it is necessary to perform feature selection to remove the irrelevant and redundant features. In this study, features in the mRMR feature rank list were added one by one during the GFS procedure by using SVM in 10-fold cross-validation. Performance comparisons of the prediction models with the addition of features is shown in Figure 10. The red asterisk showed the distribution of the selected optimal features. The number of selected optimal features was 87, which had the

The Performance of the Proposed Predictor
According to the above analysis, we respectively optimize the window size of each set of features to obtain the best prediction accuracy. The window size is varied from 11 residues to 31 residues and investigated the AUC found through SVM in 10-fold cross-validation. The optimal window sizes are 31, 27, 13 and 25 residues, respectively. Based on the optimal window sizes for four groups of features, the sample can be formulated as a 402-dimension vector, including 31 dimensions for the position scoring function, 27 × 3 = 81 dimensions for the secondary structure, 13 × 20 dimensions for the AAindex, and 30 dimensions for the CKSAAP.
Next, it is necessary to perform feature selection to remove the irrelevant and redundant features. In this study, features in the mRMR feature rank list were added one by one during the GFS procedure by using SVM in 10-fold cross-validation. Performance comparisons of the prediction models with the addition of features is shown in Figure 10. The red asterisk showed the distribution of the selected optimal features. The number of selected optimal features was 87, which had the highest AUC of 0.87.
The selected optimal features included the position scoring function (17), secondary structure (21), AAindex (32), and the composition of k-spaced amino acid pairs (17).  (17), secondary structure (21), AAindex (32), and the composition of k-spaced amino acid pairs (17). In order to evaluate the classification model in broader terms, the k-fold cross-validation test (k = 6, 8 and 10) is used by splitting the dataset into k equally-sized subsets, and using each of the k subsets as the testing dataset iteratively. The k-fold cross-validation test was executed 50 times with different random seeds to extract most representative statistical results. Meanwhile, we used the leave-one-out test as the classification evaluation strategy to avoid the sampling bias of the k-fold crossvalidation test. The performance of the proposed method was shown in Table 2. In can be observed that Glypre performs statistically well for the sensitivity, specificity, accuracy, AUC, and MCC. Additionally, based on the same benchmark Dataset 1, our Glypre yielded a MCC of 0.52, a specificity of 90%, an accuracy of 79%, and an AUC of 0.86, which were observably better than an MCC of 0.31, a specificity of 54%, an accuracy of 68%, and an AUC of 0.72 in the Gly-PseAAC [13]. The sensitivity of 57.62% with Glypre was slightly worse than the sensitivity of 58.74% with Gly-PseAAC.

Comparison with Existing Methods
To ensure a fair comparison with previous studies, including Gly-PseAAC, GlyNN, and PreGly, Dataset 2 (89 glycation and 126 non-glycation sequences of 20 proteins) was adopted to investigate Glypre in three-and 10-fold cross-validation. Table 3 shows the results of comparison. Glypre outputs the average values 20 times and other results of Gly-PseAAC, GlyNN, and PreGly have been cited from a paper [13]. Glypre achieved the highest Sen of 85.11%, Acc of 89.77%, AUC of 95.57, and MCC of 78.84, which was clearly better than the existing methods. Although the Spe of Glypre is less than the Sp of PreGly, the difference is not large. In order to evaluate the classification model in broader terms, the k-fold cross-validation test (k = 6, 8 and 10) is used by splitting the dataset into k equally-sized subsets, and using each of the k subsets as the testing dataset iteratively. The k-fold cross-validation test was executed 50 times with different random seeds to extract most representative statistical results. Meanwhile, we used the leave-one-out test as the classification evaluation strategy to avoid the sampling bias of the k-fold cross-validation test. The performance of the proposed method was shown in Table 2. In can be observed that Glypre performs statistically well for the sensitivity, specificity, accuracy, AUC, and MCC. Additionally, based on the same benchmark Dataset 1, our Glypre yielded a MCC of 0.52, a specificity of 90%, an accuracy of 79%, and an AUC of 0.86, which were observably better than an MCC of 0.31, a specificity of 54%, an accuracy of 68%, and an AUC of 0.72 in the Gly-PseAAC [13]. The sensitivity of 57.62% with Glypre was slightly worse than the sensitivity of 58.74% with Gly-PseAAC.

Comparison with Existing Methods
To ensure a fair comparison with previous studies, including Gly-PseAAC, GlyNN, and PreGly, Dataset 2 (89 glycation and 126 non-glycation sequences of 20 proteins) was adopted to investigate Glypre in three-and 10-fold cross-validation. Table 3 shows the results of comparison. Glypre outputs the average values 20 times and other results of Gly-PseAAC, GlyNN, and PreGly have been cited from a paper [13]. Glypre achieved the highest Sen of 85.11%, Acc of 89.77%, AUC of 95.57, and MCC of 78.84, which was clearly better than the existing methods. Although the Spe of Glypre is less than the Sp of PreGly, the difference is not large.

Comparison with Other Predictors on the Independent Test Dataset
Since we have already investigated our proposed algorithm by k-fold cross-validation, it is necessary to further narrow down the independent dataset test to check the effectiveness of the final model. We randomly selected 20 proteins from PLMD 3.0, which had no overlap with Dataset 1. The performance of model on Dataset 1 was evaluated using the 20 proteins as independent testing dataset. A glycation prediction tool (Gly-PseAAC) still provided online prediction services (http://app.aporc.org/Gly-PseAAC/).
The performance of the proposed method was shown in Table 4. From Table 4, we can see that Glypre can successfully detect the same number of glycation sites as the Gly-PseAAC does. However, the threshold θ of Gly-PseAAC was set to 0.35, while the threshold θ of Glypre was set to 0.5. We can also see that the posterior probability scores of Glypre were almost larger than that of Gly-PseAAC. This means that Glypre is a promising method for the prediction of protein glycation sites. Table 4. Experimental results for Glypre and Gly-PseAAC on the independent test dataset. We highlighted the posterior probability scores of successfully detecting glycation sites. The glycation sites of protein were listed, and the posterior probability scores of these two predictors were also shown.

Datasets
The experimental-confirmed lysine glycation benchmark dataset (Dataset 1) used in this work was collected from the database CPLM (http://cplm.biocuckoo.org/) [18]. There were 323 lysine glycation sites extracted from 72 proteins, and the corresponding primary protein sequences were collected from UniProt (release 2014_11, http://www.uniprot.org/) [19]. Redundant sequences were removed by using the CD-HIT [20] program with at least a 30% pairwise sequence identity threshold. Finally, a total of 47 proteins which contain lysine glycation sites were obtained. Subsequently, from these proteins, 210 peptides having length of 31 residues and experimental-validated glycation in the centre were collected and labelled them as positive samples. Correspondingly, the other 31-mer sequences with lysine centres, which are non-glycation sites, were selected as the negative samples. Finally, there were 210 positive samples and 1383 negative samples. The number of the positive and negative samples is unbalanced which usually results in a skewed classification of non-glycation. Thus, a main dataset was obtained by combining the 210 positive samples and 420 negative samples randomly selected from the 1383 negative samples. Although Gly-PseAAC [13] used the same benchmark dataset, it only provided the samples which are 15 residues long with the lysine in the centre. Thus, the comparison with the Gly-PseAAC on the benchmark dataset is not so propitious.
To evaluate the effectiveness of the proposed method as well as to perform fair comparisons with previous methods [11][12][13], we used another benchmark dataset (Dataset 2) to train Glypre, which contained only 89 glycation sites and 126 non-glycation sites from 20 proteins. This dataset can be downloaded from http://www.cds.dtu.dk/databases/GlycateBase-1.0/.
To further test the generalizability of our method, an independent testing dataset was introduced in this work which was derived from the database PLMD 3.0 (http://plmd.biocuckoo.org/) [21]. We randomly selected the 20 proteins from PLMD 3.0, which had no overlap with Dataset 1. These 20 proteins including 37 lysine glycation sites were constructed as the independent testing dataset.

The Position Scoring Function
The potential sequence characteristic of residues around glycation sites (from −15 to +15) is one of the most important aspects. Here, we investigated the residue preference at each site by the following conservation formulation (Equation (1)): where P l i indicates the occurrence frequency of the ith amino acid at the lth position. p 0 , which represents the background frequency, is set to 0.05. The larger the conservation of the M(l) value is, the stronger the conservation of the lth site. When the M(l) value is 0, it represents a random distribution of the 20 residues at the lth position.
Given the aligned training sequences from positive samples, the position weight matrix (PWM) was defined as follows Equation (2): where n xl denotes the real counts of residue x at the lth position. p 0 is the background frequency of each amino acid in the protein sequence, and equal to 0.05 in this work. N denotes the number of the training sequences. Then, with the arbitrary peptide fragment, which has 31 residues, the position scoring function [10] of the lth site can be calculated as Equation (3): where M p (l) and M N (l) denote the position conservation at the lth site in the positive and negative samples, respectively. The value of F(l) shows the degree of the sequence close to positive samples.

The Secondary Structure (SS)
In this step, the secondary structure information around glycation sites is taken into account, which plays a vital role in the protein's structure and function [22]. Secondary structures include α-helix, β-sheets, and coil. PSIPRED [23] is taken into account for the prediction of secondary structures, which delivered the output in the form of "H", "C", and "E", representing helix, sheets, and coils, respectively. In order to constitute a numeric vector, we used the probability of "H", "C", and "E" to encode protein segments in this study.

Amino Acid Indices
The Amino Acid Index database (AAindex) includes amino acid mutation matrices and amino acid indices [24]. It collected 544 physicochemical properties for version 9.0. An amino acid index including 20 numerical values stands for physicochemical properties of 20 amino acids. Physicochemical properties have been successfully predicted for several protein modifications [25,26]. Here, we selected 20 informative physicochemical properties to encode each peptide in this work.

The Composition of k-Spaced Amino Acid Pairs (CKSAAP)
The composition of k-spaced amino acid pairs (CKSAAP) has been successfully used for predicting various PTMs [12,17,27], which could reflect the characteristics of residues surrounding glycation sites. Given 20 native amino acids and one complementary residue 'X', the CKSAAP feature contains 441 basic amino acid pair types: AA, AC, . . . , AW, AY, XX. The basic amino acid pair types are enlarged to the k-spaced amino acid pair types. For example, 'AˆW' means that this amino acid pair is separated by one other amino acid. Considering that the CKSAAP was performed over k = 0, 1, 2, 3 and 4 in this study, and the vector size of the CKSAAP feature was 2205 dimensions, here, we utilized the information gain method (IG) [28] to rank the 2205 features. Afterwards, the top-30 features to encode the protein were selected.

Feature Selection
Feature selection is an important step for building an effective prediction model [29][30][31][32][33]. Generally, not all features have an equivalent contribution to the glycation prediction system. In addition, some features are usually noisy and redundant. To analyse the features, we used the mRMR method [34] to rank all the features. Then, we implemented a new GFS procedure based on the mRMR rank list. The GFS procedure is as follows: Features in the mRMR feature rank list were added one by one to make the feature subset, and if the AUC of the feature subset fulfils the criteria of improvement by means of an SVM in 10-fold cross-validation, and this feature is added. Finally, the feature subset that has the highest AUC was used to train the model for predicting glycation sites.

Support Vector Machine
The support vector machine (SVM) derives from statistical learning theory, first proposed by Vapnik [35], which is an effective machine learning technique. SVM constructs a hyperplane that separates two types of samples as widely as possible in a high-dimensional space. It has been effectively applied in many bioinformatics problems. In this study, we utilized the LIBSVM toolset [36]. A grid search strategy based on 10-fold cross-validation is utilized to find the optimal parameters.

Performance Assessment
Five measurements were employed to evaluate the performance of our proposed predictor. These measurements included sensitivity (SN), specificity (SP), accuracy (ACC), and Matthews's correlation coefficient (MCC), and area under the receiver-operating characteristic (ROC) curve (AUC). AUC is the area under the receiver-operating characteristic (ROC) curve, presented as a plot of true positive rate against false positive rate. The AUC score of a ROC curve summarizes the overall performance of a corresponding model or method [37], while others are defined by the following formulas Equations (4)- (7): where TP, TN, FP, and FN represent the number of true positives, true negatives, false positives, and false negatives, respectively.

Conclusions
In this work, we explored the application of a position scoring function and secondary structure (SS) in the glycation prediction problem. The distribution of different properties around the glycation sites and non-glycation sites based on a large training dataset were statistically analysed. The analysis suggests that the proximity of acidic amino acids, mainly glutamate, and lysine to lysines promotes the glycation. We also found some important features which can contribute to the prediction of glycation. A novel two-step feature selection was proposed, which improved the prediction and generalization ability of the model. The experimental results of Glypre outperformed the existing methods on Dataset 2, which revealed the effectiveness of this new method. Moreover, the promising performance on an independent testing dataset from the PLMD database also proved the commendable generalization ability of the defined method. In the future, we will collect more data, analyze more features, and use other useful strategies to construct a predictive model with higher accuracy. Of course, the method forged in this research can also be used in the prediction of other protein post-translational modifications.