Recognition of Protein Pupylation Sites by Adopting Resampling Approach

With the in-depth study of posttranslational modification sites, protein ubiquitination has become the key problem to study the molecular mechanism of posttranslational modification. Pupylation is a widely used process in which a prokaryotic ubiquitin-like protein (Pup) is attached to a substrate through a series of biochemical reactions. However, the experimental methods of identifying pupylation sites is often time-consuming and laborious. This study aims to propose an improved approach for predicting pupylation sites. Firstly, the Pearson correlation coefficient was used to reflect the correlation among different amino acid pairs calculated by the frequency of each amino acid. Then according to a descending ranked order, the multiple types of features were filtered separately by values of Pearson correlation coefficient. Thirdly, to get a qualified balanced dataset, the K-means principal component analysis (KPCA) oversampling technique was employed to synthesize new positive samples and Fuzzy undersampling method was employed to reduce the number of negative samples. Finally, the performance of our method was verified by means of jackknife and a 10-fold cross-validation test. The average results of 10-fold cross-validation showed that the sensitivity (Sn) was 90.53%, specificity (Sp) was 99.8%, accuracy (Acc) was 95.09%, and Matthews Correlation Coefficient (MCC) was 0.91. Moreover, an independent test dataset was used to further measure its performance, and the prediction results achieved the Acc of 83.75%, MCC of 0.49, which was superior to previous predictors. The better performance and stability of our proposed method showed it is an effective way to predict pupylation sites.


Introduction
Pupylation is a prokaryotic analog of ubiquitination whose prokaryotic ubiquitin-like protein (Pup) separates intracellular proteins under the action of enzymes and specifically modifies the target protein [1,2]. Its proteasome-independent functions help in the regulation of DNA repair mechanisms, particularly recent advances on its regulatory role for careful series of functions that gives rise to a therapeutic solution in cancer chemotherapy [3][4][5]. Pup is an identified posttranslational small modifier in prokaryotes [1,2] that usually attaches to the substrate lysine via isopeptide bonds, and this process is called pupylation. Although the functions of pupylation and ubiquitylation are similar, the enzymologies involved in these processes are different [1,2,6]. Since experimental methods are laborious, it is essential to improve the current computational methodologies to provide direction for further research on intriguing research questions. An example of this is the works on understanding the stability of ERCC1 DNA repair protein, a biomarker of several advanced cancer diseases. This protein functions in multiple DNA repair pathways and certain mutations in this protein and its partner had drastic consequences for the protein complex stability [3][4][5][6][7][8][9]. Functional and structural studies have determined the significance of the complex integrity and ubiquitination/deubiquitination events for controlling the function of the protein during DNA damage response [1,2,[6][7][8]. These findings persuade new therapeutic solution in cancer chemotherapy. Thus, it is important to devise accurate computational methodologies for scientists to ultimately predict the probability of different regulatory pathways that control protein function and disease severities [6,8,9]. Several proteomics methods have been proposed for identifying pupylated proteins and pupylation sites [10][11][12][13][14][15][16][17]. However, pupylation sites of many experimentally identified pupylated proteins are still unknown. Accurate identification of pupylation sites is an essential first step to better understand the mechanism underlying protein pupylation. A number of large-scale proteomics techniques have been used for predicting pupylation sites, which are time-consuming and laborious. Therefore, a series of other effective and accurate prediction methods have been proposed for predicting pupylation sites.
The Group-based Prediction System was used by Lin to design the predictor of GPS-PUP for predicting the pupylation sites [11]. The predictor iPup [12] was designed by combining the composition of k-spaced amino acid pair (CKSAAP) feature and support vector machines (SVMs). Zhao et al. [13] created a predictive model with five features and also adopted feature selection methods to find the optimal feature set. Chen et al. [14] proposed a model PupPred, which was also an SVM-based predictor. It used amino acid pair features to encode lysine central peptides and combined a series of features to improve its predictive performance. Hasan et al. [15] constructed a predictor called pbPUP that used the profile-based composition of k-spaced amino acid pairs (pbCKSAAP) to represent the sequence information around the pupylation site. Jiang et al. [16] created a predictor called PUL-PUP that combined the positive-unlabeled learning technique with CKSAAP to predict pupylation sites. A structured and searchable database PupDB was used for integrating information of pupylated proteins and sites, protein structures, and functional annotations for the management and analysis of pupylation sites due to a large number of newly identified pupylated proteins and sites [16]. Recently, Nan et al. [17] proposed the enhanced positive-unlabeled learning algorithm for predicting pupylation sites.
In this study, the Pearson correlation coefficient was used to evaluate the relevance among amino acid pairs based on amino acid composition information. Then, according to the descending order, the amino acid pair features were added by a step of 20 to get the best prediction results by means of jackknife test. Then the selected 320 amino acid pairs were combined with TOP-n-gram [18], adapted normal distribution bi-profile Bayes (ANBPB) [19], and parallel correlation pseudo amino acid composition (PC-PseAAC) [20] to construct a multiple feature vector for the query protein peptide sequence. Because of the imbalance between the number of positive and negative samples (183:2258), the K-means principal component analysis (KPCA) oversampling technique, firstly proposed by Jia and Zuo [21], was applied on the positive training dataset for oversampling, and the synthetic samples were added to the original positive training dataset as a new positive training dataset. The fuzzy undersampling (FUS) method [21,22] was applied to the negative dataset to reduce noise negative samples, and the selected negative samples were used as a new negative training dataset. At last, the 377 pupylated sites (positive samples) and 365 non-pupylated lysine sites (negative samples) were used to train and test our model. Moreover, the performance of our method was verified by means of jackknife, 10-fold cross-validation test and an independent dataset. When compared with other existing predictors, all of the results showed better performance and stability, proving as an effective way to predict pupylation sites.

Pearson Correlation Coefficient for Feature Selection
The Pearson Correlation Coefficient [23,24] is a measurement for the linear relationship among distance variables. If a positive linear correlation exists between two variables, their metric approaches 1. The Pearson Correlation Coefficient was used to find out the most closely related amino acids. First, extracting amino acid composition (AAC) was used for feature extraction on the training dataset. Second, the Pearson Correlation Coefficient was employed to calculate the correlation coefficient among amino acids based on AAC values, and its values were sorted in a descending order. Finally, the amino acid pair composition (AAPC) was selected according to the order. The value of Pearson Correlation Coefficient was 1 for the amino acid pair AA, CC, ..., YY, and so these 20 amino acid pairs of the same amino acid were reserved to combine with others. The prediction on the jackknife test on the combination of 20, 40, ..., 380, and 400 features was performed by a step of 20, and the detailed results are shown in Table S1. The best results were obtained when the first AAPC (320) features were selected; sensitivity (Sn) was 81.42%, specificity (Sp) was 76.44%, accuracy (Acc) was 78.10%, and Matthews Correlation Coefficient (MCC) was 0.55. However, the prediction results were not monotonous; this might be due to the same parameter used for different dimension features.

Combination of KPCA and FUS for Training Set Balancing
As described in Section 2.1, the training dataset comprised 183 positive samples and 2258 negative samples which is the same as [15,17] to guarantee the contrast of the experiment.
The ratio of positive and negative samples was 1:12, leading to biased results and inaccurate data. The KPCA and fuzzy undersampling (FUS) oversampling and undersampling methods were first proposed by Jia and Zuo [21]. They were effective in identifying protein O-GlcNAcylation sites. Therefore, in this study, KPCA and FUS were also employed to solve the imbalance between positive and negative training datasets. First, FUS was used to remove the redundant negative samples, and the remaining 365 nonpupylation protein peptides were used as the new negative training dataset. Then, KPCA was applied on the positive samples to divide them into three clusters (k = 3). The experiment was repeated several times. The results of clustering were added to the original positive samples so that the ratio of the positive samples and the negative samples was approximately 1:1. The following results were obtained based on KPCA and FUS.
The performance illustrated that the results obtained from the original unbalanced training dataset were biased toward the larger number of classes. For example, on the 320-dimensional selected feature vector, Sn was 0 and Sp was 100%; it was the worst performance for the prediction model. However, after using KPCA and FUS on the training dataset, Sn was 81.42% and Sp was 76.44% shown in Table 1, indicating that the unbalanced data had a great influence on the experimental results.

Predictive Performance Improvement
A variety of feature extraction methods can be found on the Web server (http://bioinformatics. hitsz.edu.cn/Pse-in-One2.0/) [25]. After several trials, TOP-n-gram, ANBPB, and PC-PseAAC feature extraction methods were used to combine with 320-dimensional features extracted by AAPC, and the results are shown in Table 2. Table 2. Predictive performance of TOP-n-gram, adapted normal distribution bi-profile Bayes (ANBPB), and parallel correlation pseudo amino acid composition (PC-PseAAC) using the jackknife test. The Pearson correlation coefficient was used to filter out the AAPC (320) dimension features from the 400-dimensional features so as to remove the redundant information. Although the results were improved, the value of MCC was slightly lower. AAPC (320) was combined with other features to increase the value of MCC. When AAPC (320) was combined with TOP-n-gram and ANBPB, Sn, Acc, and MCC improved greatly. Especially when AAPC (320) was combined with TOP-n-gram, MCC increased by 0.3037, but Sn slightly reduced. When AAPC (320) was combined with PC-PseAAC, Sn increased by 2.73%. Combining AAPC (320) with TOP-n-gram and PC-PseAAC, Sn increased by 6.01%. After several experiments, AAPC (320) was combined with TOP-n-gram, ANBPB, and PC-PseAAC, with the MCC of 0.96, which was higher by 0.41 than the previous value of AAPC (320); the accuracy was 98.18%. Furthermore, Sn and Sp increased by 13.12% and 23.56% compared with AAPC (320). The results indicated that the predictor achieved better prediction performance.

Comparison between the Proposed Method and Other Prediction Methods
The proposed method was compared with other methods by tenfold cross-validation, including EPuL [17], PUL-PUP [16], PSoL [26], and SVM balance [15] on the training dataset to evaluate the effectiveness of the proposed method for pupylation site prediction. Table 3 presents the comparison among EPuL, PUL-PUP, PSoL, and SVM balance. The SVM balance method can solve the imbalance of datasets, but its negative samples obtained from unannotated lysine sites are inaccurate. PUL-PUP and PSoL use unreliable negative datasets, whose features are not optimal. EPuL uses a series of options to obtain the reliably negative datasets; however, its accuracy is low because it does not integrate other features and the imbalance dataset is selected randomly. Figure 1 shows that the proposed method achieved a higher accuracy compared with EPuL for running 10 times of tenfold cross-validation results, with Sn of 6.32% and Sp of 4.35%. Furthermore, the Acc and MCC of the proposed method were 95.09% and 0.9063, respectively, which were 4.85% and 0.0963 higher than those of EPuL. For the EPuL web-server is unavailable, we listed the detailed 10-fold cross-validation results of our method, PUL-PUP, PSoL, and SVM in Tables S2-S5.

Performance on the Independent Test Dataset
In this study, the proposed method achieved a better performance, indicating that the Pearson correlation coefficient, KPAC, and FUS were effective in predicting the pupylation sites. An independent test dataset containing 20 proteins, which included 29 experimentally validated pupylation sites and 408 nonannotated pupylation sites, was selected to further verify the effectiveness of the proposed method. Although the positive and negative datasets of the independent test dataset were imbalanced, it was obtained from the real proteins and reflected the true distribution of pupylated sites and nonpupylated sites. A comparison of results with those of the existing methods, including the EPuL, PUL-PUP, PSoL, and SVM balance, is listed in Figure 1. The prediction results of PUL-PUP, PSoL, and SVM-balance are directly from EPuL [17]. we listed results of our method, PUL-PUP, PSoL, and SVM balance on the independent dataset in Table S6-S9.
The overall accuracy of the proposed method was the best among the five models. Especially, Sn reached 100% and Acc was 24% higher than that of EPuL.

Performance on the Independent Test Dataset
In this study, the proposed method achieved a better performance, indicating that the Pearson correlation coefficient, KPAC, and FUS were effective in predicting the pupylation sites. An independent test dataset containing 20 proteins, which included 29 experimentally validated pupylation sites and 408 nonannotated pupylation sites, was selected to further verify the effectiveness of the proposed method. Although the positive and negative datasets of the independent test dataset were imbalanced, it was obtained from the real proteins and reflected the true distribution of pupylated sites and nonpupylated sites. A comparison of results with those of the existing methods, including the EPuL, PUL-PUP, PSoL, and SVM balance, is listed in Figure 1. The prediction results of PUL-PUP, PSoL, and SVM-balance are directly from EPuL [17]. we listed results of our method, PUL-PUP, PSoL, and SVM balance on the independent dataset in Table S6-S9.
The overall accuracy of the proposed method was the best among the five models. Especially, Sn reached 100% and Acc was 24% higher than that of EPuL.

Materials and Methods
This study involved the following steps: (1)

Materials and Methods
This study involved the following steps: (1)

Formatting of Mathematical Components
The training and test datasets constructed by Chun et al. [12] and lately used by Nan et al. [17], were also adopted in this study. The training dataset consisted of 183 experimentally validated pupylation sites and 2258 artificially generated nonannotated lysine sites from 162 proteins of Mycobacterium smegmatis (M. smegmatis), Mycobacterium tuberculosis (M. tuberculosis) and Escherichia coli (E. coli). The independent test dataset contained 29 experimentally verified lysine pupylation sites and 408 nonannotated lysine sites from 20 proteins. As the independent test dataset was highly unbalanced, it reflected the real effects of different methods [10][11][12][13][14][15][16][17]. Similar to the previous findings, the length of each protein peptide was 21 in the training and test datasets. The training and test datasets are available in the supplementary material.

Amino Acid Composition (AAC)
AAC [27] is a widely used method for feature extraction, varied protein sequence analysis, and prediction. It was used to calculate the content of amino acids in the amino acid fragment. For a given peptide fragment containing 20 natural amino acids and pseudo amino acids "X," only 20 natural amino acids were used to construct the 20-dimensional feature vector of AAC for encoding because the frequency of pseudo amino acid "X" always approached zero or equaled zero. AAC can be defined as follows: where where represents the frequency of the occurrence of the ith amino acid in the 20 natural amino acids A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y and expressed as follows:

Formatting of Mathematical Components
The training and test datasets constructed by Tung [12] and lately used by Nan et al. [17], were also adopted in this study. The training dataset consisted of 183 experimentally validated pupylation sites and 2258 artificially generated nonannotated lysine sites from 162 proteins of Mycobacterium smegmatis (M. smegmatis), Mycobacterium tuberculosis (M. tuberculosis) and Escherichia coli (E. coli). The independent test dataset contained 29 experimentally verified lysine pupylation sites and 408 nonannotated lysine sites from 20 proteins. As the independent test dataset was highly unbalanced, it reflected the real effects of different methods [10][11][12][13][14][15][16][17]. Similar to the previous findings, the length of each protein peptide was 21 in the training and test datasets. The training and test datasets are available in the supplementary material.

Amino Acid Composition (AAC)
AAC [27] is a widely used method for feature extraction, varied protein sequence analysis, and prediction. It was used to calculate the content of amino acids in the amino acid fragment. For a given peptide fragment containing 20 natural amino acids and pseudo amino acids "X," only 20 natural amino acids were used to construct the 20-dimensional feature vector of AAC for encoding because the frequency of pseudo amino acid "X" always approached zero or equaled zero. AAC can be defined as follows: where f i where f i represents the frequency of the occurrence of the ith amino acid in the 20 natural amino acids {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} and expressed as follows: (2) where N i indicates the number of amino acid ith in the peptide fragment, and N total indicates the total length of the peptide fragment. The length of the amino acid peptide was 20 in the study.

Amino Acid Pair Composition
Amino acid pair composition (AAPC) [28] is generally used to obtain the correlation between one amino acid and other amino acids in a protein sequence. In this study, it was used to calculate the frequency of occurrence of amino acids in pupylation fragments. The AAPC feature vector is defined as follows: where x i indicates the occurrence correlation of the ith amino acid pairs with other amino acid pairs (AA, AC, ..., YY).
x i = n ij n total (4) where n ij indicates the number of amino acid pair ijth in the peptide fragment, and n total indicates the total number of amino acid pair. The length of the amino acid pair was 400 in the study.

Adapted Normal Distribution bi-profile Bayes
ANBPB [19] is a combination of bi-profile Bayes (BPB) [29,30] and the standard normal distribution, which uses a probability vector to encode the peptide fragment.
approximately obeys the standard normal distribution N(0, 1) . If V j is used to represent the standard deviation of the random variable X ij (i = 1, 2, . . . , 21) , the normalized variable of the random variable X ij can be defined as follows: Therefore, p j (i = 1, 2, . . . n, n + 1, . . . , 2n) can be encoded by the adapted normal distribution as follows: where the formula for ϕ(x) is given by ϕ(x) = 1 √ 2π x −∞ e − t 2 2 dt.

TOP-n-gram
Top-n-gram [18], which includes the evolutionary information extracted from the frequency profiles, can be viewed as a novel profile-based building block of proteins. The multiple sequence alignments yielded by PSI-BLAST [31] are used to calculate the protein sequence frequency profiles, which are combined with the n most frequent amino acids in each amino acid frequency profile, and the frequency profiles are converted into Top-n-grams. The occurrence times of each TOP-n-gram are used to convert protein fragments into fixed feature vectors, and then the corresponding vectors are substituted into SVM. Several basic building blocks have been investigated as the words of "protein sequence language," including Ngrams [29,32], patterns [33], motifs [34], and binary profiles [18].

Parallel Correlation Pseudo Amino Acid Composition
Parallel correlation pseudo amino acid composition (PC-PseAAC) was also considered to obtain more order information for a fragment [20]. PC-PseAAC is an approach incorporating the contiguous local sequence-order information and the global sequence-order information into the feature vector of the protein sequence. Given a protein sequence P, the PC-PseAAC feature vector of P is defined as follows [25]: where where fi (i = 1,2, . . . ,20) is the normalized occurrence frequency of the 20 amino acids in the protein sequence P; the parameter λ is an integer, representing the highest counted rank (or tier) of the correlation along a protein sequence; w is the weight factor ranging from 0 to 1; and θj (j = 1, 2, · · · , λ) is called the j-tier correlation factor reflecting the sequence-order correlation between all the jth most contiguous residues along a protein chain and defined as follows: where the correlation function is given by: where H 1 (R i ), H 2 (R i ), and M(R i ) are the hydrophobicity value, hydrophilicity value, and side-chain mass of the amino acid R i , respectively. Before substituting the values of hydrophobicity, hydrophilicity, and side-chain mass into [18], they should all be subjected to a standard conversion as described by the following equation [25]: where H 0 1 (i) is the original hydrophobicity value of the ith amino acid; H 0 2 (i) is the corresponding original hydrophilicity value; and M 0 (i) is the mass of the ith amino acid side chains.

Pearson Correlation Coefficient
As a measure of the variability among variables, the correlation coefficient can indicate a certain correlation between two variables at the same time. The Pearson correlation coefficient [23,24] is used to judge whether two datasets are on a line and measure the linear relationship according to distance. For the Pearson correlation coefficient, if a positive linear correlation exists between two variables, their metric value approaches 1. Similarly, if the two variables have a negative linear correlation, their metric value approaches -1. Let (X, Y) be a two-dimensional random variable vector, and the Pearson correlation coefficient is defined as follows: Let (X i , Y i ) (i = 1, 2 . . . n ) be a random sample of (X, Y) , and another form of the Pearson correlation coefficient can be expressed as follows: where X is the mean of X i , and Y is the mean of Y i , (i = 1, 2 . . . n) . In this study, AAC was used for feature extraction on the training dataset, and the values of Pearson Correlation Coefficients were sorted in a descending order. The positive and negative training datasets were extracted in AAPC. The feature selection was performed on the 400-dimensional feature vector, and the best results were obtained when the first AAPC (320) features were selected.

KPCA Oversampling Technique
The KPCA oversampling technique [21] is based on the K-means clustering algorithm and principal component analysis (PCA) [35,36]. First, the initial clustering center K was formed by randomly selecting class samples from the positive training dataset.
K(InitialCenter(j)), j = 1, 2, . . . n (17) Second, the Euclidean distance was used to divide the positive training samples into k clusters, defined as follows: where K(j) represents j original positive training samples, and I j is the distance between j original positive training samples and n clusters. Finally, the average values of all peptide fragments of the original positive training sample were calculated in each cluster and then clustering centers were adjusted. The initial clustering center K was not picked twice, and the process was performed iteratively. Next, each cluster of positive training samples K = [k 1 , k 2 , . . . , k m1 ] T was normalized by K = k i −µ σ to be K = k 1 , k 2 , . . . , k m1 T .
Then, the covariance matrix R = r ij m * n o f K was obtained and the following equations were used to add the synthesized positive samples to the positive training dataset.
y i = (y i1 , y i2 , . . . y iDim ), i = 1, 2, . . ., Dim (22) y ij = k i * A(:, j), i = 1, 2, . . ., m1, j = 1, 2, . . ., Dim where m1 is the number of positive samples in the nth cluster, Dim is the number of features in the positive samples, and A is the eigenvalue matrix of the covariance matrix arranged in a descending order. In this study, the value of K was set to 3 and KPCA was used on the positive samples.

Fuzzy Undersampling Method
The FUS method [22] uses the fuzzy membership function to extract the information hidden in the training samples. In this study, the FUS approach was used to reduce the number of negative training samples to keep the positive and negative datasets in balance.
First, the mean and standard deviation of each feature of the positive and negative training samples were calculated as follows: Neg(i, j ) NegNum , j = 1, 2, . . . , Dim where PosNum is the number of positive training samples, NegNum is the number of negative training samples, Pos(i, j ) is the value of the jth feature of the ith positive training sample, and Neg(i, j ) is the value of the jth feature of the ith negative training sample. Dim is the number of training samples features. Second, the mean and standard deviation of the positive and negative samples were used to establish a membership function, defined as follows: where Data(i, j) is the jth eigenvalue of the ith sample among the training samples (i = 1, 2, . . . , PosNum, PosNum + 1, . . . , PosNum + NegNum, j = 1, 2, . . . , Dim) .
Finally, all the positive training samples were saved and the negative training samples with high scores were removed. The score function is defined as follows: Fval(i, j), (i = PosNum + 1, . . . , PosNum + NegNum).
Then, FUS was applied on the negative training samples, and the number of negative training samples was reduced from 2258 to 365, which could decrease the inaccuracy caused by imbalanced datasets.

SVMs Implementation and Parameter Selectiont
SVM is a supervised learning method, which applies statistical theory to complete classification and regression in the area of bioinformatics [15,[37][38][39][40][41][42][43]. In this study, the LibSVM package [43,44] was used to establish and evaluate the performance of the prediction model. The kernel function used the radial basis kernel function (RBF) K S i , S j = e (−γ S i −S j 2 ) , and a grid search strategy based on 10 times of tenfold cross-validation and jackknife was employed to find the optimal parameters. The optimal parameters of tenfold cross-validation and jackknife are C = 0.70711 and γ= 1.4142, respectively. Four measurements, including sensitivity (Sn), specificity (Sp), accuracy (Acc), and Matthews correlation coefficient (MCC), were used to evaluate the performance of the proposed predictor [44], which are shown as follows: Sn = TP TP + FN where TP, FP, TN, and FN are the number of true positives, false positives, true negatives, and false negatives, respectively.

Conclusions
In this study, the Pearson correlation coefficient was used to evaluate the relevance between any two amino acids, which were employed to obtain an optimal combination of amino acid pairs. Then, the AAPC (320) information was combined with TOP-n-gram, ANBPB, and PC-PseAAC to construct multiple feature vectors. Moreover, KPAC and FUS were used to solve the imbalance between positive and negative datasets. Finally, tenfold cross-validation, jackknife test, and an independent test were used to verify the proposed model. The prediction results showed that the proposed method was more accurate than the previous methods in predicting the pupylation sites. However, this study only considered the content of amino acids and neglected the location of amino acids in pupylated proteins. Amino acid environment such as surface areas and buried residues should be considered in future studies to explore the effects of amino acids at different positions. It should be pointed out user-friendly and publicly accessible web-servers will significantly enhance their impacts, we shall make efforts in our future work to provide a web-server for the prediction method presented in this paper. The MATLAB (matrix laboratory) package of our prediction method is available as Supplementary material.