PUP-Fuse: Prediction of Protein Pupylation Sites by Integrating Multiple Sequence Representations

Pupylation is a type of reversible post-translational modification of proteins, which plays a key role in the cellular function of microbial organisms. Several proteomics methods have been developed for the prediction and analysis of pupylated proteins and pupylation sites. However, the traditional experimental methods are laborious and time-consuming. Hence, computational algorithms are highly needed that can predict potential pupylation sites using sequence features. In this research, a new prediction model, PUP-Fuse, has been developed for pupylation site prediction by integrating multiple sequence representations. Meanwhile, we explored the five types of feature encoding approaches and three machine learning (ML) algorithms. In the final model, we integrated the successive ML scores using a linear regression model. The PUP-Fuse achieved a Mathew correlation value of 0.768 by a 10-fold cross-validation test. It also outperformed existing predictors in an independent test. The web server of the PUP-Fuse with curated datasets is freely available.


Introduction
Pupylation is a type of prokaryotic ubiquitin-like protein (Pup), which contributes to many cellular processes [1,2]. The Pup process connects the lysine residue with isopeptide bonds, called pupylation, which plays an important role in controlling signal transduction and protein degradation in prokaryotic cells [3,4]. Pup proteins tag intrinsically disordered and misfolded proteins to be degraded [3,5]. While pupylation and ubiquitylation are analogs in terms of function, they have different enzymologies [6]. Unlike ubiquitylation, pupylation involves two types of enzymes: deamidase of Pup (DOP) and proteasome accessory factor A (PafA) [3,[7][8][9][10]. On the other hand, enzymes of pupylation are initiated from microbial species and exhibit no homology to ubiquitylation enzymes [7,11].
To know the molecular mechanisms of pupylation, it is necessary to define the substrates of pupylation and its sites precisely. Typically, this task is labor-intensive and time-consuming because of the large-scale analysis of proteomics; thus, a few computational methods of predicting pupylation sites have been proposed [12][13][14][15][16]. Liu et al. first developed the GPS-PUP predictor for the prediction of pupylation sites by the group-based prediction system (GPS) method [17]. Tung et al. presented the iPUP predictor that implemented the support vector machine (SVM) algorithm with the composition of pairs of k-space amino acids (CKSAAP) [18]. Chen et al. developed the PupPred predictor based on SVM [19], where the pairing of amino acids was used to encode the lysine-centered peptides. Recently, Hasan et al. proposed a web server, called pbPUP, to predict the pupylation sites using the profile-based features [20]. The predictors of GPS-PUP, iPUP, and pbPUP showed reasonable performance of population site prediction. However, when they were given higher specificity, their sensitivity score was low.
In this study, we developed the PUP-Fuse as a machine learning (ML)-based predictor, as shown in Figure 1. In brief, we employed the PupDB database [21] to compile the positive and negative samples with a full sequence, encoded the sequence windows into numerical feature vectors by using multiple sequence encoding schemes, selected informative features, and inputted them to ML models. The PUP-Fuse integrated the multiple ML scores generated by the different, single encoding-employing ML methods to enhance the prediction performance.

Performance Results on Training Dataset
We have developed the five single encoding-employing random forest (RF) models and the combined model (PUP-Fuse) by linearly combining them. The PUP-Fuse optimized the five weight coefficients for the AAI, Binary, tripeptide composition (TPC), Profile-Based Composition of K-Spaced Amino Acid Pairs (pbCKSAAP), and CKSAAP-employing RF models as 0.01, 0.1, 0.3, and 0.3, 0.2, respectively. We evaluated the Sens, Spec, Acc, MCC, AUC values of the single encoding-employing RF models and the PUP-Fuse without any feature selection by 10-fold CV test, as shown in Table 1. The ROC and corresponding auPRC curves are shown in Figure 3A,C. The five measures of the PUP-Fuse were higher than those of any single encoding-employing RF model. The PUP-Fuse achieved a very high AUC of 0.913 and outperformed all the single encoding-employing models, which significantly outperformed all the single encoding-employing models with a two-sample t-test at p-value < 0.05 (Table 1). The PUP-Fuse predictor was developed based on the PupDB database, where a positive-to-negative ratio of ~1:12 was highly imbalanced. Since the prediction accuracy of

Performance Results on Training Dataset
We have developed the five single encoding-employing random forest (RF) models and the combined model (PUP-Fuse) by linearly combining them. The PUP-Fuse optimized the five weight coefficients for the AAI, Binary, tripeptide composition (TPC), Profile-Based Composition of K-Spaced Amino Acid Pairs (pbCKSAAP), and CKSAAP-employing RF models as 0.01, 0.1, 0.3, and 0.3, 0.2, respectively. We evaluated the Sens, Spec, Acc, MCC, AUC values of the single encoding-employing RF models and the PUP-Fuse without any feature selection by 10-fold CV test, as shown in Table 1. The ROC and corresponding auPRC curves are shown in Figure 3A,C. The five measures of the PUP-Fuse were higher than those of any single encoding-employing RF model. The PUP-Fuse achieved a very high AUC of 0.913 and outperformed all the single encoding-employing models, which significantly outperformed all the single encoding-employing models with a two-sample t-test at p-value < 0.05 (Table 1). To investigate the validity of a high cutoff similarity of 80%, employed by CD-HIT [25], we compared the prediction performance with a cutoff of 80% to that with an ordinary cutoff of 30%. A cutoff of 30% produced the training dataset that contained 129 proteins with 141 pupylation and 141 non-pupylation sites (with a 1:1 positive-to-negative ratio). The overall performance of PUP-Fuse with a cutoff similarity of 30% a little decreased (AUC = 0.903) (Table S1) by the 10-fold cross-validation, but a cutoff similarity of 80% presented almost similar performance with a cutoff similarity of 30%.  The PUP-Fuse predictor was developed based on the PupDB database, where a positive-to-negative ratio of~1:12 was highly imbalanced. Since the prediction accuracy of ML algorithms is seriously impaired by such unbalanced datasets [19,20], many site predictors of PTM use a fairly balanced ratio of positive to negative samples to train classification models [20,23,24]. On the training dataset, we compared the prediction performance (AUC) between 1:1, 1:2, and 1:all ratios of positive to negative samples ( Figure S1). Since a 1:1 ratio provided a higher AUC value, we determined a 1:1 ratio as the optimal one. The size of the window is also an important factor to discriminate the positive sites from the negative ones. Based on AUC values, the window size was searched from 25 to 61 ( Figure S2). An optimal window size of 57 was obtained because the AUC increasing rate from 45 to 57 was very low.
To investigate the validity of a high cutoff similarity of 80%, employed by CD-HIT [25], we compared the prediction performance with a cutoff of 80% to that with an ordinary cutoff of 30%. A cutoff of 30% produced the training dataset that contained 129 proteins with 141 pupylation and 141 non-pupylation sites (with a 1:1 positive-to-negative ratio). The overall performance of PUP-Fuse with a cutoff similarity of 30% a little decreased (AUC = 0.903) (Table S1) by the 10-fold cross-validation, but a cutoff similarity of 80% presented almost similar performance with a cutoff similarity of 30%.

Performance Optimization by Chi-Square Test
The proposed method describes some sniping sequence patterns for a pupylation site in a comprehensive way, while it results in a high-dimensional vector. Some redundant or irrelevant attributes may be present that affect accuracy reduction. Thus, we selected the informative features out of many features using a well-established chi-squared test. For each employed scheme, different feature subsets were selected, which contained the top-ranked features ranging from the top 20 to the top 500 with an interval of 20. All these curated feature subcategories were inputted to RF separately, and their respective performances were evaluated using 10-fold cross-validation ( Figure S3). To end, the feature subset that reached the highest AUC was selected as the optimal one. In this approach, we selected the 260-, 100-, 200-, 240-, and 350-dimensional features from pbCKSAAP, AAI, Binary, CKSAAP, and TPC encodings, respectively. In the PUP-Fuse, the weight coefficients for the AAI, Binary, TPC, and pbCKSAAP-employing RF models were optimized as 0.1, 0.1, 0.3, 0.3, and 0.2, respectively. As shown in Figure 3B,D, the PUP-Fuse with the chi-squared test achieved higher AUC and auPRC values than that without the feature selection. The PUP-Fuse with feature selection achieved an accuracy of 88.4% (Sn = 88.1% and MCC = 0.768) at a specificity control of 88.1% on the training data ( Table 2). The PUP-Fuse reached a remarkable AUC value of 0.956, which significantly outperformed all the singleemploying-based models with a two-sample t-test at the level of p-value < 0.05 (Table 2).

Comparison among Different ML Methods on Training Dataset
Selecting an optimal ML method is an essential step. Therefore, to verify the effectiveness and superiority of the RF algorithm employed by the PUP-Fuse, we compared it with the KNN and SVM algorithms on the same training dataset by a 10-fold CV test. In order to make a fair comparison, the KNN and SVM models implemented the same encoding schemes as the PUP-Fuse. As shown in Figure 4, the RF model yielded a higher AUC than the other two ML models, which was approximately 2-5% higher than the AUCs of the other models.

Comparison of PUP-Fuse with Existing Methods on Independent Dataset
Several computational methods had been proposed for the prediction of pupylation sites. In order to compare the PUP-Fuse with the four existing methods (GPS-PUP, iPUP, PUPS, and PbPUP), an independent dataset of 86 pupylation sites from 71 pupylated proteins and 1136 non-pupylation putative sites was used. Even though the PUP-Fuse and these existing methods did not use the same training dataset, we used the same independent dataset for a fair comparison of performances. We submitted the independent dataset directly to the web servers to obtain the prediction performances. The PUP-Fuse achieved the highest performance, as shown in Table 3, with a Sens of 0.59, a Spec of 0.91, an Acc of 0.82, and an MCC of 0.55. The PUP-Fuse provided 10-20% higher MCC than the other existing models, demonstrating the superiority of the PUP-Fuse over the existing predictors. The superiority of PUP-Fuse could result from a linear combination of the five ML probability scores evaluated by the five different encodings. Note that all the encodings contribute to prediction performance.

Comparison of PUP-Fuse with Existing Methods on Independent Dataset
Several computational methods had been proposed for the prediction of pupylation sites. In order to compare the PUP-Fuse with the four existing methods (GPS-PUP, iPUP, PUPS, and PbPUP), an independent dataset of 86 pupylation sites from 71 pupylated proteins and 1136 non-pupylation putative sites was used. Even though the PUP-Fuse and these existing methods did not use the same training dataset, we used the same independent dataset for a fair comparison of performances. We submitted the independent dataset directly to the web servers to obtain the prediction performances. The PUP-Fuse achieved the highest performance, as shown in Table 3, with a Sens of 0.59, a Spec of 0.91, an Acc of 0.82, and an MCC of 0.55. The PUP-Fuse provided 10-20% higher MCC than the other existing models, demonstrating the superiority of the PUP-Fuse over the existing predictors. The superiority of PUP-Fuse could result from a linear combination of the five ML probability scores evaluated by the five different encodings. Note that all the encodings contribute to prediction performance. The threshold values of iPUP, GPS-PUP, PUPS, and pbPUP were set to show high specificity (90%) in their corresponding webservers.

Data Collection and Processing
The datasets were retrieved and taken from the publication of the PupDB database [21]. The experimentally identified lysine pupylation sites were treated as positive samples, while all existing lysine residues that were not experimentally confirmed as the sites of pupylation in those proteins were treated as non-pupylation sites or negative samples. After deleting 80% similar sequences using CD-HIT [25], we preserved 233 pupylated proteins with 273 positive and 3280 negative sites. In the PupDB dataset, the ratio of the positive to negative samples (~1:12) is very unbalanced, which would obstruct the training model. Thus, a balanced dataset with a positive-to-negative ratio of 1:1 (186 of positive sites and negative 186 sites) was composed by randomly excluding the negative samples. The independent dataset consisting of 87 experimentally verified pupylation sites and 191 putative non-pupylation sites was randomly extracted from the dataset to test the various predictors. The curated datasets are summarized in Table 4. The pbCKSAAP method is widely investigated in previous studies [20,[26][27][28]. The k-spaced residue pair could be defined as p a {k} p b (a, b = 1, 2, . . . , 20), where p a and p b show two residues of 20 types of amino acids. While k = 0, p a {k} p b represents a dipeptide and considers a number of 400 (= 20 × 20) dipeptides. In this study k = 0, 1, 2, 3, 4 were considered (i.e., k max = 4). Accordingly, the feature vector from each positive/negative sample has a dimension of 200 (= 400 × 5 ). In this process, PSI-BLAST searched each protein sequence to produce a profile (i.e., PSSM matrix) with respect to the NCBI NR90 database (December 2010 version). For the inclusion of new sequences, the iteration time and e-value limit were set, respectively, to 3 and 1.0 × 10 −4 .
If residue pair p a {k} p b performs between the positions t and t + k + 1 in the PSSM matrix, the frequency scores could be generated as follows: where PSSM(t, p a ) denotes the amino acid score p a at the t th of PSSM in a row position, PSSM (t + k + 1, p b ) exists for an amino acid score of p b at (t + k + 1) th of PSSM in a row position. The pupylation/non-pupylation site appears N times. Moreover, we normalize S a,b using the formulation below: where L stands for the total sequence fragment length, i.e., the size of a window is L. We have used the pbCKSAAP encoding scheme to create a 2000-dimensional feature vector for any positive/negative sample.

CKSAAP Encoding
The CKSAAP encoding is widely used for representing sequence motifs [26,27,29]. If a sequence fragment is composed of 57 windows and 20 types of residues, it contains 400 (=20 × 20) types of residue pairs (i.e., AA, AC, AD, . . . ) for every single k, where k signifies the space between two amino acids. In this work, the optimal k max was set to 4 to generate 2000 (=20 × (k max + 1) × 20)-dimensional feature vectors for a single sequence.

Binary Encoding
Twenty types of amino acids can encode with the sequence window to generate the binary feature vectors [30,31]. By binary encoding, a 1140 (=20 × 57)-dimensional feature vector was calculated for a window sequence.

AAI Encoding
The AAI encoding scheme uses the amino acid properties [32]. We selected the top 15 instructive amino acid indices after assessing different physicochemical and biological properties of amino acids (Table S2). The AAI encoding generates 855 (=57 × 15)-dimensional feature vectors.

Feature Selection
Feature selection is a key step to eliminate unrelated features and to improve predictive performance. All the features are not equally important, or even some of them are noisy and have adverse effects on performance [15,20]. We used ChiSquaredAttributeEval and Ranker evaluation tools of WEKA [33] to select the features that are relevant to pupylation sites.
The chi-squared (χ 2 ) test is a standard statistical test that analyzes the variance of the expected distribution, assuming that the presence of a given function is independent of the class value. Details in the χ 2 feature selection process can be found elsewhere [20].

Classification Method
Random forest algorithms are based on the classification and regression trees (CART) techniques [26,31,34]. It raises numerous trees of classification or regression that are called "forests". Each tree is constructed using a deterministic algorithm, and due to two factors, the trees are different. The best separation is initially chosen from a random subset of the predictors at each node. In addition, a bootstrap observation sample is used to construct each tree. The overall prediction is then calculated as the average of all the trees.
We used KNN and SVM to compare with the RF classifier employed by PUP-Fuse. SVM is being widely used in protein bioinformatics [35][36][37]. For making a proper binary prediction, a kernel radial basis function (RBF) with the LIBSVM 2019 package (http: //www.csie.ntu.edu.tw/~cjlin/libsvm/ (accessed date: 11 September 2019)) was applied to the training and independent datasets [38]. For tuning parameters, C and γ were maximized based on the training dataset by using the LIBSVM grid search strategy. The grid search strategy was carried based on 10-fold cross-validation tests to find the optimal C and γ ∈ [2 −7 , 2 −6 , . . . , 2 8 ]. The KNN is a supervised ML algorithm that solves both classification and regression problems. We used the KNN algorithm of the R package to classify positive and negative samples at (https://cran.r-project.org (accessed date: 11 September 2019)).

Feature Integration
Features were typically integrated to improve the prediction performance. We linearly combined the ML scores evaluated by the five encoding schemes: AAI, Binary, TPC, CKSAAP, and pbCKSAAP, with the formula as follows:

of 12
where Cl is the linear combination of the ML scores, w i and s i are the weight coefficient and score for each encoding scheme i. The total of w i equals to 1. The linear combination model of the ML scores calculated by the five encoding schemes is named the PUP-Fuse. The above feature integration model is widely used in different bioinformatics tasks [27,[39][40][41].

Conclusions
The PUP-Fuse was developed for better prediction of pupylation sites. The PUP-Fuse was the RF model that integrated the five types of encoding schemes to consider various sequence patterns around protein pupylation sites. Then, the chi-squared test was used as the feature selection method. Performance evaluated by the training and independent tests clearly demonstrated the advantage of the PUP-Fuse over the existing models. The performances of PUP-Fuse are assessed based on the independent test dataset and compared with other existing methods, concluding that the predictive performance of PUP-Fuse is better than other existing methods. The comparison between different classifiers shows that the chi-squared feature selection algorithm optimized the curated feature vectors and RF-based model superior to other classifiers in predicting pupylation sites. Additionally, we found that the integration of successive ML scores by using a linear regression model advances the prediction performances. The web implementation of the PUP-Fuse with curated datasets are freely available for users at http://kurata14.bio. kyutech.ac.jp/PUP-Fuse/ (accessed date: 11 September 2019).
Supplementary Materials: Supplementary materials can be found at https://www.mdpi.com/14 22-0067/22/4/2120/s1. Figure S1: Comparison of 1:1, 1:2, and 1:all ratios of positive-to-negative samples on training dataset; Figure S2: AUC values for different window sizes based on 10-fold cross-validation tests; Figure S3: AUC value with respect to selected features for the five encoding schemes; Table S1: Prediction performance after the removal of 30% sequence redundancy on the training dataset; Table S2: Fifteen types of AAI properties used in this study. Funding: This work was supported by the Grant-in-Aid for Scientific Research (B) (19H04208) and partially supported by the Grant-in-Aid for JSPS Research Fellow (19F19377) from Japan Society for the Promotion of Science (JSPS) and by the developing key technologies for discovering and manufacturing pharmaceuticals used for next-generation treatments and diagnoses both from the Ministry of Economy, Trade and Industry, Japan (METI) and from Japan Agency for Medical Research and Development (AMED).

Conflicts of Interest:
The authors declare that they have no conflicts of interest.