Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features

The prediction of protein subcellular localization is critical for inferring protein functions, gene regulations and protein-protein interactions. With the advances of high-throughput sequencing technologies and proteomic methods, the protein sequences of numerous yeasts have become publicly available, which enables us to computationally predict yeast protein subcellular localization. However, widely-used protein sequence representation techniques, such as amino acid composition and the Chou’s pseudo amino acid composition (PseAAC), are difficult in extracting adequate information about the interactions between residues and position distribution of each residue. Therefore, it is still urgent to develop novel sequence representations. In this study, we have presented two novel protein sequence representation techniques including Generalized Chaos Game Representation (GCGR) based on the frequency and distributions of the residues in the protein primary sequence, and novel statistics and information theory (NSI) reflecting local position information of the sequence. In the GCGR + NSI representation, a protein primary sequence is simply represented by a 5-dimensional feature vector, while other popular methods like PseAAC and dipeptide adopt features of more than hundreds of dimensions. In practice, the feature representation is highly efficient in predicting protein subcellular localization. Even without using machine learning-based classifiers, a simple model based on the feature vector can achieve prediction accuracies of 0.8825 and 0.7736 respectively for the CL317 and ZW225 datasets. To further evaluate the effectiveness of the proposed encoding schemes, we introduce a multi-view features-based method to combine the two above-mentioned features with other well-known features including PseAAC and dipeptide composition, and use support vector machine as the classifier to predict protein subcellular localization. This novel model achieves prediction accuracies of 0.927 and 0.871 respectively for the CL317 and ZW225 datasets, better than other existing methods in the jackknife tests. The results suggest that the GCGR and NSI features are useful complements to popular protein sequence representations in predicting yeast protein subcellular localization. Finally, we validate a few newly predicted protein subcellular localizations by evidences from some published articles in authority journals and books.


Introduction
Assigning subcellular localizations for a protein is a significant step to elucidate its interaction partners, functions and potential roles in the cellular machinery [1,2]. However, experimental methods to determine subcellular localization usually involve immunolabelling or tagging, which could be tested the performance of the proposed features and models on two yeast benchmark datasets and compared them with a few popular methods using the jackknife test.

Results and Discussions
We listed in Tables 1 and 2 the predicting results of the proposed model and other existing models for the jackknife test on CL317 and ZW225 respectively. As can be seen, our model achieved overall prediction accuracies of 0.8825 and 0.7736 respectively on CL317 and ZW225. The performance on CL317 outperforms some existing methods, such as Wei et al. [15] (with accuracy 0.827) and Zhang et al. [24] (with accuracy 0.88). The improvement is important considering that we only used a 2-D GCGR feature and a 3-D NSI feature, while other methods combined features like amino acid composition of 20-D and dipeptide of 400-D. We further tested the performance of combining GCGR and NSI with other widely-recognized features including pseudo-amino acid composition (PwAAC) and dipeptide composition (Dipeptide). Specifically, we applied three models including: (1) PwAAC alone, (2) fusion of features PwAAC and Dipeptide, and (3) fusion of features PwAAC, Dipeptide, GCGR and NSI into the CL317 dataset. Their prediction results for the jackknife test were summarized in Figure 1. yeast benchmark datasets and compared them with a few popular methods using the jackknife test.

Results and Discussions
We listed in Table 1 and Table 2 the predicting results of the proposed model and other existing models for the jackknife test on CL317 and ZW225 respectively. As can be seen, our model achieved overall prediction accuracies of 0.8825 and 0.7736 respectively on CL317 and ZW225. The performance on CL317 outperforms some existing methods, such as Wei et al. [15] (with accuracy 0.827) and Zhang et al. [24] (with accuracy 0.88). The improvement is important considering that we only used a 2-D GCGR feature and a 3-D NSI feature, while other methods combined features like amino acid composition of 20-D and dipeptide of 400-D. We further tested the performance of combining GCGR and NSI with other widely-recognized features including pseudo-amino acid composition (PwAAC) and dipeptide composition (Dipeptide). Specifically, we applied three models including: (1) PwAAC alone, (2) fusion of features PwAAC and Dipeptide, and (3) fusion of features PwAAC, Dipeptide, GCGR and NSI into the CL317 dataset. Their prediction results for the jackknife test were summarized in Figure 1. As Figure 1 depicts, the model combined all features achieved much higher prediction accuracy than others, indicating that: (1) feature fusion techniques are promising to improve the prediction accuracy since single-view feature can only reflect part of the information of a protein sequence; (2) the two features GCGR and NSI can be served as a helpful complementary to features like PwAAC and Dipeptide, revealing the effectiveness of the two novel feature representation techniques as well. two features GCGR and NSI can be served as a helpful complementary to features like PwAAC and Dipeptide, revealing the effectiveness of the two novel feature representation techniques as well.
To further evaluate the efficiency of the feature fusion technique and improve protein subcellular location prediction accuracy, we introduced the final multiple-views based model, in which the feature vector for each protein was represented by concatenating numerical vectors from GCGR, NSI, PwAAC and Dipeptide. In addition, SVM was selected as the classifier. Comparison with other existing models using the jackknife test on CL317 and ZW225 were shown in Tables 3 and 4, respectively. As can be seen, our integration model achieved the highest overall accuracies, that is, 0.921 and 0.889 on CL317 and ZW225, respectively. There are two indications: (1) our proposed protein sequence feature representations including GCGR and NSI both contain some valuable information such as concentrated local information, which were not covered by previous features; (2) The integration of multiple informative features may improve prediction performance. In addition, our model achieved the highest MCCs for most of subcellular location classes on CL317 except for Cy and Me. Moreover, the MCCs and the Accs of the class Nu and Mi for our model are much better than other existing methods on ZW225.
Finally, we searched authoritative journals and publications for further validation of the predicted subcellular location of some proteins, and found that some of them have already been validated by experiments. For example, we predicted that the protein YHR196W belongs to nucleolar, which have been reported by more than 20 publications such as Eswara et al. [26] and Polymenis et al. [27]. We also predicted Sec17p to be localized in cytoplasm and the endoplasmic reticulum, consistent with Aouida et al. [28]. Thus, our model is effective in screening out potential protein subcellular locations for further experimental validation.
To summarize: first, the new simple unitary distance-based method is comparable to many methods in prediction accuracy; second, the proposed new perspectives (GCGR and NSI) truly contain some valuable information from protein primary sequence, and can be served as a complement to the existing feature representations; third, the multi-feature based model can improve the prediction accuracy notably, thus can be used to help biologist determine protein subcellular location.
However, we are fully aware that there are several limitations in this study. First of all, we only used the average of the x-and y-axis of the points in the GCGR plot, which may retrieve only partial information of the plot. An immediate option is to try other statistics of the GCGR plot such as median and percentiles. Second, the biological interpretation under the effectiveness of the features is not fully clear. Third, the current version of the software is not very user-friendly. In the future, we will devote to offer an online web service such that more biologists can use the software. We will also try to use some parallel algorithms for dealing with large scale eukaryote species including human data.

Datasets
In the paper, two yeast datasets CL317 and ZW225 are used for comparing different predicting models. The CL317 dataset was collected by Chen and Li [18]. The original 846 proteins explicitly annotated to one subcellular were derived from SWISSPROT (version 49.0) by European Bioinformatics Institute, Hinxton Cambridge, United Kingdom (www.ebi.ac.uk/swissprot) [25]. Since short sequences are more like to be homologous and it is also difficult to extract enough information from them, we removed the proteins with less than 80 residues similar to Chen and Li [18]. The remaining dataset contains 317 apoptosis proteins belonging to six subcellular locations including cytoplasmic (Cy), membrane (Me), nuclear (Nu), endoplasmic reticulum (En), mitochondrial (Mi) and secreted (Se) with 112, 55, 52, 47, 34 and 17 proteins, respectively. ZW225 was curated by Zhang and Wang [24], including 225 proteins in four subcellular locations with 89 membrane proteins, 70 cytoplasmic proteins, 41 nuclear proteins and 25 mitochondrial proteins. The proteins were extracted from SWISSPROT (version 50.3) by European Bioinformatics Institute, Hinxton Cambridge, United Kingdom using the same rules as CL317.

Generalized Chaos Game Representation (GCGR) of Protein Primary Sequences
The chaos game representation (CGR) was initially introduced to visualize DNA sequences [29] and later for protein sequences as well [30]. Here, we further developed a generalized chaos game representation (GCGR) to represent a protein sequence by a 2-dimensional numerical feature vector describing the frequency of 20 amino acids and their neighbor information in the sequence. The construction of GCGR consists of three steps: Step 1: Convert a Protein Sequence into a Sequence on an Alphabet of Size 6 We converted the 20 amino acids into six groups (Table 5). Specifically, Proline (P), Glycine (G) and Cysteine (C) formed three separate groups because of their unique backbone properties. The remaining 17 amino acids were classified into the other three groups according to their hydropathy scale including strongly hydrophilic (denoted by H), strongly hydrophobic (L), and weakly hydrophilic or weakly hydrophobic (S) [31]. As a result, each primary protein sequence could be uniquely represented by a string on the alphabet {H, L, S, P, G, C}. For example, the protein sequence "YAMQESHFTCI" can be represented by "SLLHHSHLSCL" according to Table 5. Firstly, we drew a regular hexagon, in which each vertex is associated with a distinct label of H, L, S, P, G and C, and each edge is of unit length. Then, for each encoded primary sequence in the first step, we plotted its letters sequentially as vertices inside the hexagon as follows: the first vertex, corresponding to the first letter of the primary sequence, was placed in the center of the hexagon; and the i-th vertex, corresponding to the i-th letter, was placed in the middle of the first (i-1)-th vertices and the vertex representing the i-th letter in the hexagon. After that, a plot named the GCGR of the primary sequence was drawn. As examples, we plotted in Figure 2 the GCGRs for six representative proteins with each belonging to a different subcellular location. From the six GCGR figures, we can directly retrieve some valuable information: for proteins in the Cy and Nu classes, the plotted points are close to vertices H and L; the protein in the Me class are uniformly distributed around all the vertices except for C; proteins in the Nu and En classes have fewer points around vertices G, C and P, G, respectively; proteins in the last two classes are almost uniformly distributed. In a word, the proteins in different subcellular locations distributed differently in the GCGR plots.
Molecules 2019, 24, x FOR PEER REVIEW 6 of 13 first step, we plotted its letters sequentially as vertices inside the hexagon as follows: the first vertex, corresponding to the first letter of the primary sequence, was placed in the center of the hexagon; and the i-th vertex, corresponding to the i-th letter, was placed in the middle of the first (i-1)-th vertices and the vertex representing the i-th letter in the hexagon. After that, a plot named the GCGR of the primary sequence was drawn. As examples, we plotted in Figure 2 the GCGRs for six representative proteins with each belonging to a different subcellular location. From the six GCGR figures, we can directly retrieve some valuable information: for proteins in the Cy and Nu classes, the plotted points are close to vertices H and L; the protein in the Me class are uniformly distributed around all the vertices except for C; proteins in the Nu and En classes have fewer points around vertices G, C and P, G, respectively; proteins in the last two classes are almost uniformly distributed. In a word, the proteins in different subcellular locations distributed differently in the GCGR plots.

Step 3: Convert Each Protein Sequence into a 2-D Vector according to its GCGR Plot
As can be seen from Figure 2, each letter in the protein sequence corresponds to a (x, y)-coordinate in the GCGR plot. We then modelled the GCGR plot as a combination of two series: one is composed of the x-coordinates and the other is composed of the y-coordinates, which were named x-series and y-series, respectively. As can be seen from Figures 3 and 4, there are many useful observations: (1) The average values of the x-series and y-series for proteins in the En-class, denoted as x and y respectively, tend to be greater than those for proteins in the other classes; (2) Proteins in the first class Cy also have a large x, but do not have a large y; (3) Proteins in the last two classes Mi and Se have moderate x and y, respectively.

Step 3: Convert Each Protein Sequence into a 2-D Vector according to its GCGR Plot
As can be seen from Figure 2, each letter in the protein sequence corresponds to a (x, y)-coordinate in the GCGR plot. We then modelled the GCGR plot as a combination of two series: one is composed of the x-coordinates and the other is composed of the y-coordinates, which were named x-series and y-series, respectively. As can be seen from Figure 3 and Figure 4, there are many useful observations: (1) The average values of the x-series and y-series for proteins in the En-class, denoted as x and y respectively, tend to be greater than those for proteins in the other classes; (2) Proteins in the first class Cy also have a large x , but do not have a large y ; (3) Proteins in the last two classes Mi and Se have moderate x and y , respectively.    Figure 2. Each panel in Figure 2 gives rise to two time series.
However, unlike proteins in the Se class, those in the Mi class have a greater y than its x ; (4) y of proteins in the second class Me is the smallest among all classes. In summary, x and y are two effective numerical features to identify subcellular location of proteins. It is not surprising since x and y contain not only the information about amino acids frequencies, but also their order in a protein sequence. For a better view, we also drew in Figure 5 the boxplots of x and y for proteins in each of the six classes in CL317.  Figure 2. Each panel in Figure 2 gives rise to two time series.
However, unlike proteins in the Se class, those in the Mi class have a greater y than its x; (4) y of proteins in the second class Me is the smallest among all classes. In summary, x and y are two effective numerical features to identify subcellular location of proteins. It is not surprising sincex and y contain not only the information about amino acids frequencies, but also their order in a protein sequence. For a better view, we also drew in Figure 5 the boxplots of x and y for proteins in each of the six classes in CL317.  Theoretically, a class with narrow variation scope and less outliers can be discriminated more robustly. As can be seen, the proteins in the Nu class have substantially narrower variation with x ranges approximately from 1.22 to 1.26 and y ranges approximately from 1.23 to 1.28. For En, though x is widely distributed, y is more centralized, which can be used to differentiate this class. Similarly for Se, though y is widely distributed, x is more centralized. Finally, it is of note that all six classes has different medians and relatively differentiable variation scopes for both x and y . Therefore, by combining x and the y , it is possible to predict the localization of most proteins.

Novel Statistics and Information Theory (NSI) of Protein Primary Sequences
In order to acquire more position information of the primary sequence, we presented a novel statistics and information theory based method to extract features from the protein primary sequence. Different from the previous section, we just classified the 20 amino acids into three groups in view of their hydropathy profiles [21]: (1) internal group, in which the residues tend to appear in the inner side of the protein spatial structure, (2) external group, in which the residues tend to occur at the surface, and (3) ambivalent group, in which the residues do not have fixed common positions. Then, a protein sequence can be transformed into a 3-letter string according to the following rule: where () Pj represents the j th letter in the protein primary sequence P , and ( ( )) F P j presents the encoded letter for () Pj. For example, given a protein sequence YAMQESHFTCI P = , its encoded sequence is () F P SSFDDSDFSSF = .
After that, we calculated the position features of the encoded sequence to represent its local information. Specifically, let be the position sequence of a given amino acid k. Theoretically, a class with narrow variation scope and less outliers can be discriminated more robustly. As can be seen, the proteins in the Nu class have substantially narrower variation with x ranges approximately from 1.22 to 1.26 and y ranges approximately from 1.23 to 1.28. For En, though x is widely distributed, y is more centralized, which can be used to differentiate this class. Similarly for Se, though y is widely distributed, x is more centralized. Finally, it is of note that all six classes has different medians and relatively differentiable variation scopes for both x and y. Therefore, by combining x and the y, it is possible to predict the localization of most proteins.

Novel Statistics and Information Theory (NSI) of Protein Primary Sequences
In order to acquire more position information of the primary sequence, we presented a novel statistics and information theory based method to extract features from the protein primary sequence. Different from the previous section, we just classified the 20 amino acids into three groups in view of their hydropathy profiles [21]: (1) internal group, in which the residues tend to appear in the inner side of the protein spatial structure, (2) external group, in which the residues tend to occur at the surface, and (3) ambivalent group, in which the residues do not have fixed common positions. Then, a protein sequence can be transformed into a 3-letter string according to the following rule: where P(j) represents the jth letter in the protein primary sequence P, and F(P(j)) presents the encoded letter for P(j). For example, given a protein sequence P = YAMQESHFTCI, its encoded sequence is F(P) = SSFDDSDFSSF.
After that, we calculated the position features of the encoded sequence to represent its local information. Specifically, let W(k)(k ∈ {F, D, S}) be the position sequence of a given amino acid k. We calculated the intervals between two consequent positions in W(k), which formed a new numerical distance sequence denoted by N(k). Obviously, N(k) contains the positional and distribution information of the given amino acid k in the primary sequences. For instance, for the encoded amino acid sequence, the position sequences for the reduced amino acids F, D, and S are: W(F) = (3,8,11), W(D) = (4, 5, 7), W(S) = (1,2,6,9,10). Then the symbolic sequences are cyclic and their numerical sequences N(k) are: N(F) = (5, 3), N(D) = (1, 2), N(S) = (1, 4, 3, 1). The numerical sequence N(k) provides a new profile to characterize correlation residues of the given sequence. In fact, the interval distance between two occurrences of k can be denoted by a random variable x. We calculated the probability p k (x) of the Matthew's correlation coefficient (MCC) of the variable x and obtain its distribution function. Based on the probability theory, we can further calculate the mean value E k (x) and the variance D k (x) by: Then, we defined the positional information I (k) (x) as follows: where I (k) (x) is a pivotal statistic for comparing the degree of variation from one data to another. Finally, I (k) (x)(k ∈ {F, D, S}) appropriately characterize the positional information of three encoding letters and thus form a novel feature vector of a protein primary sequence.

Unitary Distance
In this article, rather than the serial combination method which combines different feature vectors into a super-vector, the parallel combination method combines two feature vectors by a complex vector [32], which was defined by where i is an imaginary unit. Here, we defined the parallel combined feature space on as C = {u + vi|u ∈ A, v ∈ B}. Thus C is an m-dimensional complex vector space, where m = max(dimA, dimB). The inner product of two vectors in the complex space is given by (a, b) = a H b, where a, b ∈ C, and H is the denotation of conjugate transpose. The complex vector space defined by the above inner product is usually called unitary space. The norm in unitary space is given by: where z = (a 1 + i · b 1 , · · · , a n + i · b n ) T . Then, the unitary distance between two complex vectors z 1 and z 2 is calculated by:

Performance Assessment
For evaluating the effectiveness of two proposed features GCGR and NSI, we first introduced an easy model for fast predicting protein subcellular location, which is described as follows: for a given protein, its numeric features from GCGR and NSI were firstly extracted. Then, these two features are concatenated in parallel and then classified by a classifier free prediction model. As well-acknowledged, all the prediction models in the real number space could be extended to the complex number space by using different similarity measures. For example, the Euclidian distance is a commonly-used similarity metric in the real space, while the unitary distance is often used in the complex space, which was adopted in this paper. Note that the dimensionalities of u and v in the complex space must be equal, pad the lower-dimensional one with x/y until its dimensionality is equal to the higher-dimensional one before vectors combination.
In order to measure the predictive capability of the algorithm, we adopted the following commonly used measures: Sensitivity(S n ) = TP/(TP + FN) Specificity(S p ) = TN/(TN + FP) Overall prediction accuracy(A c ) = where True Positive (TP) represents the number of true positives in its subcellular location, True Negative (TN) represents the number of true negatives in its subcellular location, False Positive (FP) denotes the number of false positives and False Negative (FN) denotes the number of false negatives in its subcellular location. N is the total number of the protein sequences. Sensitivity means the rate of correct prediction. Specificity means the reliability level for predictive model. The Matthew's correlation coefficient (MCC) shows the comprehensive performance of the prediction algorithm.
In this paper, all the experiments were completed by MatLab and a Library for Support Vector Machines (LIBSVM). In the experiments, we chose the jackknife test for validating the performance of each model. The jackknife test is done by dropping in turn each sample from the data set as the test sample and fitting the model for the remaining set of observations as the training samples. The predicting accuracy can be obtained by the right classified samples divided by the total number of samples. It worth's noticing that though there is no parameter in our feature construction process, there are two model parameters for LIBSVM, namely c and g. We adopted a grid search to find the best c and g in the jackknife test. Specifically, c and g both varied from 2 −5 to 2 5 with a multiple 2, and the best c is 8 and g is 0.0625 for the dataset CL317, while the numbers are 16 and 0.125 respectively for the dataset ZW225.

Conclusions
In this study, we first proposed a novel and quick method for predicting yeast subcellular locations based on generalized chaos game representation of the protein primary sequence and the statistics and information theory to uncover the residues distribution among the sequence. Implementation on two benchmark yeast datasets suggests that this model achieves comparable classification performance as those of machine learning-based classifiers. In addition, a fusion model incorporating GCGR and NSI with some known features including PwAAC and Dipeptide were presented, which gains the highest overall accuracy and MCC on the two benchmark datasets. The results also indicate that the new features extracted contain some useful information, which is not mined in previous methods.