KSIMC: Predicting Kinase–Substrate Interactions Based on Matrix Completion

Protein phosphorylation is an important chemical modification catalyzed by kinases. It plays important roles in many cellular processes. Predicting kinase–substrate interactions is vital to understanding the mechanism of many diseases. Many computational methods have been proposed to identify kinase–substrate interactions. However, the prediction accuracy still needs to be improved. Therefore, it is necessary to develop an efficient computational method to predict kinase–substrate interactions. In this paper, we propose a novel computational approach, KSIMC, to identify kinase–substrate interactions based on matrix completion. Firstly, the kinase similarity and substrate similarity are calculated by aligning sequence of kinase–kinase and substrate–substrate, respectively. Then, the original association network is adjusted based on the similarities. Finally, the matrix completion is used to predict potential kinase–substrate interactions. The experiment results show that our method outperforms other state-of-the-art algorithms in performance. Furthermore, the relevant databases and scientific literature verify the effectiveness of our algorithm for new kinase–substrate interaction identification.


Introduction
Protein phosphorylation is one of the most important post-translational modifications (PSMs) in an organism [1]. It is catalyzed by protein kinases, which promote the transfer of a phosphate group to corresponding substrates. Additionally, protein phosphatases remove the phosphates from substrates. Therefore, protein phosphorylation is a reversible post-translational modification based on the equilibrium of kinases and phosphatases. It plays critical roles in many cellular processes, such as cell metabolism, cell proliferation, cell differentiation, cell apoptosis, and cellular signal transduction [2,3]. Abnormal action of kinases and substrates may lead to a series of diseases, such as rheumatoid arthritis [4] and diabetes [5]. Thus, identifying interactions between substrates and its specific kinases may facilitate the study of diseases and drug targets [6][7][8].
In recent years, several biological methods have been proposed to identify phosphorylation sites and corresponding kinases including a low-throughput [9] and high-throughput [10] technique. Large amounts of phosphorylation sites have been identified by using high-throughput technology.
However, most of the corresponding kinases are still unknown. For example, there are more than 30,000 phosphorylation sites stored in the popular knowledgebase Phospho.ELM [11]. However, 90% of these phosphorylation sites do not have records of corresponding annotated kinases. Moreover, a similar problem also exists in the PhosphoSitePlus [12], more than 95% phosphorylation sites do not have the records of the corresponding annotated kinases. Therefore, many computational methods have been developed for identifying kinase-substrate interactions [13,14]. Linding et al. [15] proposed a computational framework to identify a site-specific kinase-substrate based on the network context of kinases and phosphoproteins. Dang et al. [16] developed a new method for identifying kinase-substrate interactions by using conditional random fields. Zhou et al. [17] proposed a web server tool (GPS) to predict kinase-substrate interactions based on the BLOSUM matrix and Markov Cluster Algorithm. Zou et al. [18] presented a computational method, PKIS, to identify kinase-substrate interaction by applying the composition of a monomer spectrum encoding strategy(CMS) [19] to encode the protein sequence feature. Patrick et al. [20] proposed a Bayesian network model to identify kinase-substrate interactions by integrating the cellular context. Fan et al. [21] developed a random forest model for predicting kinase-substrate interactions based on functional information. Li et al. [22] proposed a kernel-based method to identify kinase-substrate interactions by using Supervised Laplacian Regularized Least Squares. Song et al. [23] presented a computational method to infer the relationships between kinases and substrates by integrating protein sequence and functional features. Gnad et al. [24] utilized support vector machines to predict phosphorylation and acetylation sites based on the primary sequence and developed an online database (PHOSIDA) to store phosphorylation data. Moreover, several computational methods employ the biological network information to improve the prediction accuracy. For instance, Song et al. [25] proposed a computational method named iGPS, to predict kinase-substrate interactions based on the PPI network. Damle et al. [26] presented an algorithm, PhosNetConstruct, to infer kinase-substrate interactions based on the domain-specific phosphorylation network. Li et al. [27] proposed a network-based method to identify kinase-substrate interactions by integrating sequence similarity. Qin et al. [28] developed a computational framework for predicting kinase-substrate interactions based on the protein domains network. However, due to the complexity of protein phosphorylation, the accuracy of kinase-substrate prediction of most of the exiting computational methods still needs to be improved.
In this paper, we propose a new computational approach, KSIMC, to predict kinase-substrate interactions based on matrix completion. Firstly, the kinase-kinase similarity and the substrate-substrate similarity are calculated by using a sequence local alignment method, respectively. Then, the kinasesubstrate association network is adjusted based on pairwise similarities. Finally, the matrix completion is used to predict potential kinase-substrate interactions. The experiment results show that our method outperforms other state-of-the-art algorithms in performance. Furthermore, the relevant databases and scientific literatures verify the effectiveness of this algorithm for potential kinase-substrate interactions prediction.

Evaluation Metrics
In this paper, ten-fold cross-validation and de novo tests are conducted to evaluate the performance of KSIMC in predicting kinase-substrate interactions. In the ten-fold cross validation, all known kinase-substrate associations are randomly divided into 10 subsets of equal size. Each subset takes a turn as a test set, while the remaining nine subsets are treated as the training set. After performing the algorithm on the dataset, the predicted scoring matrix of all kinase-substrate interactions is generated. Then we calculate the true positive (TP), true negative (TN), false positive (FP) and false negative (FN) by ranking the prediction results. Correspondingly, TP represents the number of the positive samples that are correctly predicted, TN represents the number of the negative samples that are correctly predicted, FP represents the number of the positive samples that are incorrectly predicted, and FN represents the number of the negative samples that are incorrectly predicted. By changing various thresholds, true positive rate (TPR) and false positive rate (FPR) are calculated as follows: Finally, the Receiver Operating Characteristic (ROC) is drawn based on the TPR and FPR and the Area Under Curve (AUC) is calculated for the performance evaluation.

Comparison with Network-Based Method
To evaluate the performance of KSIMC, we compare it with another network-based method Hetesim-SEQ [27] for all kinases and substrates by using ten-fold cross-validation. Hetesim-SEQ is a network-based method for kinase-substrate interactions prediction based on Hetesim [29] similarity. The ROC curve of KSIMC and Hetesim-SEQ is shown in Figure 1. KSIMC achieves the AUC value of 0.862, which is 0.06 higher than Hetesim-SEQ. It shows that KSIMC performs better than Hetesim-SEQ. predicted. By changing various thresholds, true positive rate (TPR) and false positive rate (FPR) are calculated as follows: Finally, the Receiver Operating Characteristic (ROC) is drawn based on the TPR and FPR and the Area Under Curve (AUC) is calculated for the performance evaluation.

Comparison with Network-Based Method
To evaluate the performance of KSIMC, we compare it with another network-based method Hetesim-SEQ [27] for all kinases and substrates by using ten-fold cross-validation. Hetesim-SEQ is a network-based method for kinase-substrate interactions prediction based on Hetesim [29] similarity. The ROC curve of KSIMC and Hetesim-SEQ is shown in Figure 1. KSIMC achieves the AUC value of 0.862, which is 0.06 higher than Hetesim-SEQ. It shows that KSIMC performs better than Hetesim-SEQ.

Comparison with Different Predictors by De Novo Test
In order to evaluate the ability of KSIMC in predicting potential kinase-substrate interactions, we perform de novo kinase-substrate interaction prediction test experiments. In the de novo test, for each queried kinase i, all known kinase-substrate interactions of kinase i are deleted. The remaining kinase-substrate interactions are treated as training sets. Four popular methods of kinase-substrate interactions including GPS [17], iGPS [25], NetworKIN [15] and PhosphoPICK [20] are also applied to predict potential substrates for new kinases. Since these predictors only provide a web server, we submit the dataset to the corresponding web server for testing. Four kinase groups including CAMK, CMGC, STE, and TK are used to illustrate the overall performance of different methods. The ROC curves for different methods in different kinase groups are illustrated in Figure 2. It can be observed that KSIMC performs better than the other four algorithms on different kinase groups. For instance,

Comparison with Different Predictors by De Novo Test
In order to evaluate the ability of KSIMC in predicting potential kinase-substrate interactions, we perform de novo kinase-substrate interaction prediction test experiments. In the de novo test, for each queried kinase i, all known kinase-substrate interactions of kinase i are deleted. The remaining kinase-substrate interactions are treated as training sets. Four popular methods of kinase-substrate interactions including GPS [17], iGPS [25], NetworKIN [15] and PhosphoPICK [20] are also applied to predict potential substrates for new kinases. Since these predictors only provide a web server, we submit the dataset to the corresponding web server for testing. Four kinase groups including CAMK, CMGC, STE, and TK are used to illustrate the overall performance of different methods. The ROC curves for different methods in different kinase groups are illustrated in Figure 2. It can be observed that KSIMC performs better than the other four algorithms on different kinase groups.

Case Studies
To further demonstrate the ability of KSIMC to predict new kinase-substrate interactions, the case study is performed in here. All known kinase-substrate interactions are treated as the training set and the unknown kinase-substrate interactions are treated as the test set. We apply KSIMC to predict potential kinase-substrate interactions and obtain the prediction scores for all candidate kinase-substrate interactions (Table S1). We take the substrate IRS1 as an example to illustrate the capability of KSIMC to identify unknown kinase-substrate interactions. The top 10 predicted results of IRS1 are selected to validate based on the database and literatures (Table S2). The detailed information is shown in Table 1. We find that four predicted kinases have been confirmed in the PhosphoNET database and one predicted kinase has been validated in recent literature. For example, the serine site at the 312 position of the IRS1 sequence is catalyzed by two kinases (MAPK1 ranked at top 2 and MAPK8 ranked at top 6). The serine sites at the position 24 and 233 of IRS1 are catalyzed by PRKCA (ranked at top 3) and PRKCE (ranked at top 7), respectively. In addition, it has been proved that IRS1 can be regulated by CDK1 [30].
In addition, some interesting kinases such as ABL1, CSNK2A1, GSK3B, PRKG1 and RPS6KA3 are also discovered from the experimental results. The molecular mechanism of these kinases is still unknown; it deserves a biologist to validate its functions by using a biological experiment.

Case Studies
To further demonstrate the ability of KSIMC to predict new kinase-substrate interactions, the case study is performed in here. All known kinase-substrate interactions are treated as the training set and the unknown kinase-substrate interactions are treated as the test set. We apply KSIMC to predict potential kinase-substrate interactions and obtain the prediction scores for all candidate kinase-substrate interactions (Table S1). We take the substrate IRS1 as an example to illustrate the capability of KSIMC to identify unknown kinase-substrate interactions. The top 10 predicted results of IRS1 are selected to validate based on the database and literatures (Table S2). The detailed information is shown in Table 1. We find that four predicted kinases have been confirmed in the PhosphoNET database and one predicted kinase has been validated in recent literature. For example, the serine site at the 312 position of the IRS1 sequence is catalyzed by two kinases (MAPK1 ranked at top 2 and MAPK8 ranked at top 6). The serine sites at the position 24 and 233 of IRS1 are catalyzed by PRKCA (ranked at top 3) and PRKCE (ranked at top 7), respectively. In addition, it has been proved that IRS1 can be regulated by CDK1 [30].
In addition, some interesting kinases such as ABL1, CSNK2A1, GSK3B, PRKG1 and RPS6KA3 are also discovered from the experimental results. The molecular mechanism of these kinases is still unknown; it deserves a biologist to validate its functions by using a biological experiment.

Data Resources
In this work, the human kinase-substrate interactions are obtained from the Phospho.ELM 9.0 database [11]. The interactions labeled with kinase group or family are not considered in the experiment.

Kinase-Kinase and Substrate-Substrate Similarity Measure
Based on the protein sequence information of the kinase and the substrate, the sequence local alignment method is used to calculate the kinase-kinase similarity and the substrate-substrate similarity. The sequence alignment tool Emboss [31] is utilized to calculate the sequence similarity, which has been widely used in sequence alignment [32,33]. The parameters of Emboss are set with the default value (Matrix = BLOSUM62, Gap open = 10, Gap extend = 0.5).

Adjust the Kinase-Substrate Interaction Network
The adjacency matrix of the kinase-substrate interaction network is described as M KS matrix. If there is a known relationship between the kinase i and the substrate j, then M KS (i,j) is 1, otherwise M KS (i,j) is 0. However, there may still be potential positives for unknown kinase-substrate relationships. In order to further enhance the reliability of the kinase-substrate association network, the kinase-substrate association network is adjusted based on sequence similarity. It is based on the assumption that similar substrates tend to be related with similar kinases. For instance, assume that kinase k 1 is associated with the substrate s 1 and there is no known association between k 1 and the substrate s 2 . If the similarity between s 1 and s 2 is greater than a certain threshold t, the kinase k 1 and substrate s 2 are considered to be associated and the corresponding element in the kinase-substrate interaction matrix is set to 1. The parameter t is set as 0.9 here. Based on the above process, the kinases-substrate associations are readjusted to obtain a new kinase-substrate interaction matrix. The process is shown in Figure 3.

Construction of Kinase-Substrate Heterogenous Network
The heterogeneous network is constructed based on three sub-networks including the kinase similarity network, substrate similarity network and kinase-substrate association network. Let K = { , , ⋯ , } represent the set of m different kinases. Mkk denotes the kinase similarity matrix.
Similarly, for the substrate similarity network, let S = { , , ⋯ , } represent the set of n different substrates. Mss denotes the substrate similarity matrix. The value of element Mss denotes the similarity of two substrates. The example of the kinase-substrate heterogenous network is shown in Figure 4.
The matrix M of the kinase-substrate heterogenous network can be defined as follows: The main diagonal elements of matrix M are composed of sub-matrices and . The subdiagonal elements are composed of sub-matrices and .
is the transpose of .

Construction of Kinase-Substrate Heterogenous Network
The heterogeneous network is constructed based on three sub-networks including the kinase similarity network, substrate similarity network and kinase-substrate association network. Let K = {K 1 , K 2 , · · · , K m } represent the set of m different kinases. M kk denotes the kinase similarity matrix. Similarly, for the substrate similarity network, let S = {S 1 , S 2 , · · · , S n } represent the set of n different substrates. M ss denotes the substrate similarity matrix. The value of element M ss denotes the similarity of two substrates. The example of the kinase-substrate heterogenous network is shown in Figure 4. The

Construction of Kinase-Substrate Heterogenous Network
The heterogeneous network is constructed based on three sub-networks including the kinase similarity network, substrate similarity network and kinase-substrate association network. Let K = { , , ⋯ , } represent the set of m different kinases. Mkk denotes the kinase similarity matrix.
Similarly, for the substrate similarity network, let S = { , , ⋯ , } represent the set of n different substrates. Mss denotes the substrate similarity matrix. The value of element Mss denotes the similarity of two substrates. The example of the kinase-substrate heterogenous network is shown in Figure 4.
The matrix M of the kinase-substrate heterogenous network can be defined as follows: The main diagonal elements of matrix M are composed of sub-matrices and . The subdiagonal elements are composed of sub-matrices and .
is the transpose of .

Predicting Kinase-Substrate Interactions by Using Matrix Completion
The goal of the kinase-substrate interactions prediction is to complement the heterogeneous network adjacency matrix M by constructing a matrix M*. The adjacency matrix M can be recovered by minimizing the rank of the matrix based on the assumption that the matrix is of low rank. The optimization problem can be defined as: where M* is a candidate solution matrix with scores of all the unknown kinase-substrate interactions.
Ω denotes a set of index of known elements in the matrix M. P Ω denotes an orthogonal projection matrix. It is defined as follows: However, the rank minimization problem is known to be NP-Hard [34]. It is impractical for the problem of predicting kinase-substrate interactions with a large number of kinases and substrates. In order to facilitate the solution, the relaxation form [35] is used to minimize the nuclear norm with a soft threshold instead of minimizing the rank: where M * denotes the nuclear norm of the matrix M*. M * F denotes the Frobernius form of M* and τ is a singular value threshold parameter.
This optimization problem can be solved by the singular value thresholding (SVT) algorithm [35]. For a matrix X with rank r, singular value decomposition of X ∈ R n1×n2 is as follows: where U and V are n1 × r and r × n2, respectively. σ i is a nonnegative singular value. For each τ > 0, the soft thresholding operator D τ is defined as follows: (8) where (σ i − τ) + denotes the positive part of (σ i − τ), namely (σ i − τ) + = max(0, (S i − t)). This soft thresholding operation thus shrinks the singular values of X toward zero. The shrinkage iterations starts from Y 0 , and the matrix X K and Y K are reconstructed continuously via: where δ is the positive step size. The value of δ is set to (m + n)/ |Ω|, and Y 0 is set to τ/(δ P Ω (M) ) δP Ω (M) as suggested by previous research [36]. Since the algorithm needs to iteratively decompose and reconstruct the matrix, a fast implementation of the SVT algorithm [37] is used to improve the computational efficiency. The flow chart of predicting kinase-substrate interactions by using matrix completion is shown in Figure 5.

Conclusions
Protein phosphorylation is one of the most post-translational modifications, which plays critical roles in many cellular processes, such as cell metabolism, gene expression and cellular signal transduction. Abnormal action of kinases and substrates may lead to a series of diseases. Thus, identification of the interactions between substrates and its specific kinases can increase our understanding in the pathogenesis of diseases [38][39][40]. In this paper, we propose a computational method to predict kinase-substrate interactions by using matrix completion. Firstly, the kinase similarity matrix and the substrate similarity matrix are calculated by aligning the sequence of kinase-kinase and substrate-substrate, respectively. Then the original association network is adjusted and the adjacency matrix of the kinase-substrate heterogeneous network is constructed based on the similarities. Finally, the matrix completion is used to fill in the missing information in the adjacency matrix and to predict the potential kinase-substrate interactions. The experiment results show that our method outperforms other state-of-the-art algorithms in performance. In addition, as KSIMC only utilizes the sequence information, it also can be used to predict the kinasesubstrate of other species such as prokaryotes.

Conclusions
Protein phosphorylation is one of the most post-translational modifications, which plays critical roles in many cellular processes, such as cell metabolism, gene expression and cellular signal transduction. Abnormal action of kinases and substrates may lead to a series of diseases. Thus, identification of the interactions between substrates and its specific kinases can increase our understanding in the pathogenesis of diseases [38][39][40]. In this paper, we propose a computational method to predict kinase-substrate interactions by using matrix completion. Firstly, the kinase similarity matrix and the substrate similarity matrix are calculated by aligning the sequence of kinase-kinase and substrate-substrate, respectively. Then the original association network is adjusted and the adjacency matrix of the kinase-substrate heterogeneous network is constructed based on the similarities. Finally, the matrix completion is used to fill in the missing information in the adjacency matrix and to predict the potential kinase-substrate interactions. The experiment results show that our method outperforms other state-of-the-art algorithms in performance. In addition, as KSIMC only utilizes the sequence information, it also can be used to predict the kinase-substrate of other species such as prokaryotes.