Prediction of Drug–Target Interaction Networks from the Integration of Protein Sequences and Drug Chemical Structures

Knowledge of drug–target interaction (DTI) plays an important role in discovering new drug candidates. Unfortunately, there are unavoidable shortcomings; including the time-consuming and expensive nature of the experimental method to predict DTI. Therefore, it motivates us to develop an effective computational method to predict DTI based on protein sequence. In the paper, we proposed a novel computational approach based on protein sequence, namely PDTPS (Predicting Drug Targets with Protein Sequence) to predict DTI. The PDTPS method combines Bi-gram probabilities (BIGP), Position Specific Scoring Matrix (PSSM), and Principal Component Analysis (PCA) with Relevance Vector Machine (RVM). In order to evaluate the prediction capacity of the PDTPS, the experiment was carried out on enzyme, ion channel, GPCR, and nuclear receptor datasets by using five-fold cross-validation tests. The proposed PDTPS method achieved average accuracy of 97.73%, 93.12%, 86.78%, and 87.78% on enzyme, ion channel, GPCR and nuclear receptor datasets, respectively. The experimental results showed that our method has good prediction performance. Furthermore, in order to further evaluate the prediction performance of the proposed PDTPS method, we compared it with the state-of-the-art support vector machine (SVM) classifier on enzyme and ion channel datasets, and other exiting methods on four datasets. The promising comparison results further demonstrate that the efficiency and robust of the proposed PDTPS method. This makes it a useful tool and suitable for predicting DTI, as well as other bioinformatics tasks.


Introduction
The identification of drug-target interactions (DTI) has recently emerged as an area of intense research activity due to its important role in finding new proteins to target for drug development and discovering new drug candidates [1,2]. However, the target proteins of many drugs are not complete or even not known. In the past years, much effort has been devoted to using experimental methods to identify drug-protein interactions. But these experimental methods are both time-consuming and expensive. It often costs billions of dollars for developing a successful novel chemistry-based drug and takes nearly a decade for introducing the drug to market. However, there are only few drug candidates that can be approved to reach the market by Food and Drug Administration (FDA) [3][4][5]. This is partially caused by the unacceptable toxicity for those drug candidates with the satisfactory activity, due to the deficient of the knowledge of drug-target interactions. Thus, it is necessary to In the paper, we proposed a novel computational approach based on protein sequence, namely PDTPS (Predicting Drug Targets with Protein Sequence), to predict drug-target interactions (DTI). The PDTPS method combines Bi-gram probabilities (BIGP), Position Specific Scoring Matrix (PSSM), and Principal Component Analysis (PCA) with Relevance Vector Machine (RVM). In order to evaluate the prediction capacity of the PDTPS, we carry out the experiment on enzyme, ion channel, GPCR, and nuclear receptor datasets by using five-fold cross-validation tests. The proposed PDTPS method achieved average accuracy of 97.73%, 93.12%, 86.78%, and 87.78% on enzyme, ion channel, GPCR, and nuclear receptor datasets respectively. The experimental results showed that our method has good prediction performance. Furthermore, in order to further evaluate the prediction performance of the proposed PDTPS method, we compared it with the state-of-the-art support vector machine (SVM) classifier on enzyme and ion channel datasets and other exiting methods on four datasets. The promising comparison results further demonstrate the efficiency and robustness of the proposed PDTPS method. This makes it a useful tool and suitable for predicting DTI, as well as other bioinformatics tasks. The flow chart of the proposed prediction model is shown in Figure 1.

Performance of the Proposed Method
In order to verify the effectiveness of the proposed method, we carry out the experiment on enzyme, ion channel, GPCR, and nuclear receptor datasets through employing five-fold crossvalidation tests respectively. For five-fold cross-validation, the whole dataset was divided into five parts; four parts of them were used as training samples, and one part of them was employed as testing samples. In addition, there are several parameters that need be optimized for the RVM classifier in the experiment. Here, the 'ploy2' function was selected as the kernel function, we also set up other parameters: width = 1, initapla = 1/N and beta = 0. Where width represents the width of 'ploy2' kernel function, N is the number of training samples, and beta represents classification. Tables 1-4 list the five-fold cross-validation tests prediction results by using the proposed approach on enzyme, ion channel, GPCR, and nuclear receptor datasets.

Performance of the Proposed Method
In order to verify the effectiveness of the proposed method, we carry out the experiment on enzyme, ion channel, GPCR, and nuclear receptor datasets through employing five-fold cross-validation tests respectively. For five-fold cross-validation, the whole dataset was divided into five parts; four parts of them were used as training samples, and one part of them was employed as testing samples. In addition, there are several parameters that need be optimized for the RVM classifier in the experiment. Here, the 'ploy2' function was selected as the kernel function, we also set up other parameters: width = 1, initapla = 1/N and beta = 0. Where width represents the width of 'ploy2' kernel function, N is the number of training samples, and beta represents classification. Tables 1-4 list the five-fold cross-validation tests prediction results by using the proposed approach on enzyme, ion channel, GPCR, and nuclear receptor datasets.
The good prediction results of the proposed approach for drug-target interactions result from the correct choice of feature extraction method and classifier. Major improvements of the proposed feature extraction method can be divided into three following reasons: (1) Because PSSM not only describes the order information but also retains sufficient prior information, it can capture useful information from a given protein sequence; (2) The Bi-gram probabilities represented each protein PSSM and calculated the Bi-gram feature through employing the probability information PSSM contains. Because the Bi-gram features extracted from PSSMs can significantly reduce the sparsity level, this helps in improving the recognition performance; (3) For reducing the influence of noise for classifying and ensuring the integrity of feature information, we transformed the dimensions of each BIGP feature vector from 400 to 350 using Principal Component Analysis (PCA). Thus, it can be seen from these experimental results that the proposed BIGP method plays an essential role for improving prediction accuracy for predicting DTI.

Comparison with the SVM-Based Method
The proposed method has achieved good prediction accuracy. In order to further evaluate the prediction performance of the RVM classifier, the comparison of prediction accuracy between the RVM classifier and the state-of-the-art support vector machine (SVM) classifier was carried out through employing the same feature extraction method on enzyme and ion channel datasets. We also adopted five-fold cross-validation tests to assess the prediction accuracy of the SVM classifier. The LIBSVM tool [25] of SVM was used to execute classification. In the experiment, we also optimized several parameters of the SVM classifier. We selected the radial basis function (RBF) as the kernel function, and the c and g parameters of the RBF kernel were set up (c = 0.5 and g = 0.6) by using a grid search method.
The comparison prediction results of RVM and SVM classifiers on enzyme and ion channel datasets are listed in Tables 5 and 6, respectively. At the same time, the comparison of ROC Curves between RVM and SVM classifiers are also shown in Figures 2 and 3 on enzyme and ion channel datasets, respectively. As displayed in Table 5, the RVM classifier obtained 97.73% average accuracy on the enzyme dataset, while 91.15% average accuracy was achieved by the SVM classifier. Similarly, it can be seen form Table 6 that 93.12% average accuracy was obtained by the RVM classifier and 87.77% average accuracy was achieved by the SVM classifier on the ion channel dataset. It can be observed from these results that the prediction accuracy obtained by the RVM classifier is significantly higher than that of the SVM classifier. In addition, as displayed in Figures 2 and 3, the ROC curves of the RVM classifier is also obviously better than that of the SVM classifier. The proposed method obtained good prediction results which may be attributable to two reasons: (1) because the RVM classifier greatly reduces the amount of calculation of the kernel function relative to the SVM classifier; which helps in improving the prediction performance; (2) the kernel functions required to meet the condition of Mercer is the obvious disadvantage of the SVM classifier; however, the RVM classifier overcame it and solved the problem. Thus, all of these experimental results indicate that the proposed prediction model might become a useful tool for predicting DTI, as well as performing other bioinformatics tasks.  adopted five-fold cross-validation tests to assess the prediction accuracy of the SVM classifier. The LIBSVM tool [25] of SVM was used to execute classification. In the experiment, we also optimized several parameters of the SVM classifier. We selected the radial basis function (RBF) as the kernel function, and the c and g parameters of the RBF kernel were set up (c = 0.5 and g = 0.6) by using a grid search method. The comparison prediction results of RVM and SVM classifiers on enzyme and ion channel datasets are listed in Tables 5 and 6, respectively. At the same time, the comparison of ROC Curves between RVM and SVM classifiers are also shown in Figures 2 and 3 on enzyme and ion channel datasets, respectively. As displayed in Table 5, the RVM classifier obtained 97.73% average accuracy on the enzyme dataset, while 91.15% average accuracy was achieved by the SVM classifier. Similarly, it can be seen form Table 6 that 93.12% average accuracy was obtained by the RVM classifier and 87.77% average accuracy was achieved by the SVM classifier on the ion channel dataset. It can be observed from these results that the prediction accuracy obtained by the RVM classifier is significantly higher than that of the SVM classifier. In addition, as displayed in Figures 2 and 3, the ROC curves of the RVM classifier is also obviously better than that of the SVM classifier. The proposed method obtained good prediction results which may be attributable to two reasons: (1) because the RVM classifier greatly reduces the amount of calculation of the kernel function relative to the SVM classifier; which helps in improving the prediction performance; (2) the kernel functions required to meet the condition of Mercer is the obvious disadvantage of the SVM classifier; however, the RVM classifier overcame it and solved the problem. Thus, all of these experimental results indicate that the proposed prediction model might become a useful tool for predicting DTI, as well as performing other bioinformatics tasks.

Comparison with Other Methods
Up to now, a number of computational methods have been proposed for predicting drug target interactions. In our study, in order to further evaluate the prediction performance of the proposed method, we compared its prediction accuracy with four existing DTI predictors; DBSI [26], Yamanishi [27], KBMF2K [28], and NetCMP [29] on enzyme, ion channel, GPCR, and nuclear receptor datasets, respectively. These methods use the same strategy as the proposed method, however, they adopt different feature extraction methods and classifiers.

Comparison with Other Methods
Up to now, a number of computational methods have been proposed for predicting drug target interactions. In our study, in order to further evaluate the prediction performance of the proposed method, we compared its prediction accuracy with four existing DTI predictors; DBSI [26], Yamanishi [27], KBMF2K [28], and NetCMP [29] on enzyme, ion channel, GPCR, and nuclear receptor datasets, respectively. These methods use the same strategy as the proposed method, however, they adopt different feature extraction methods and classifiers. Table 7 displays these comparison results. It can be observed from Table 7 that the prediction accuracy of the proposed approach is significantly higher than the other four methods on enzyme, ion channel, GPCR, and nuclear receptor datasets. The comparison results further demonstrated that the PDTPS can improve the prediction accuracy relative to current approaches. Due to using a good classifier and a novel feature extraction method, the proposed method achieved good prediction results. This makes the PDTPS a useful tool and suitable for predicting DTI.

Dataset
In this study, we carried out the experiment using the proposed method on four protein targets datasets: enzymes, ion channels, GPCRs, and nuclear receptors. These data can be freely obtained from the KEGG BRITE [7], BRENDA [30], SuperTarget [6], and Drug Bank [8] databases and were used as the gold-standard datasets by Yamanishi et al [27] The number of drugs known to target enzymes, ion channels, GPCRs, and nuclear receptors are 445, 210, 233, and 54, respectively. The numbers of proteins known to be targeted by the drugs are 664, 204, 95, and 26 respectively. These drug-target pairs were carefully screened, 5127 pairs of them are known to interact with each other. The numbers of known interactions involving enzymes, ion channels, GPCRs, and nuclear receptors are 2926, 1476, 635, and 90, respectively. Then, all known interactions of the drug-target pairs were chosen as positive sample sets for four datasets in our experiment.
A bipartite graph is usually used to represent a drug-target interaction network, whose nodes represent target proteins or drug molecules and the edges describe the real drug-target interactions that have been already identified through experiments or other ways. It can be observed from bipartite graph that the number of the real drug-target interactions edges are small. Here, we take the enzyme dataset as an example; there are a total of 295,480 (445 × 664) connections in the corresponding bipartite and only 2926 edges of them are known drug-target interactions. Therefore, the possible number of negative samples (295,480 − 2926 = 29,2554) is significantly more than the number of positive samples (2926), which is a bias problem. In order to solve this problem, we randomly selected the negative samples as much as the positive sample. As a result, there are 2926, 1476, 635, and 90 negative samples of enzymes, ion channels, GPCRs, and nuclear receptors datasets. In other words, there are 5852, 2952, 1270, and 180 drug-target pairs of enzymes, ion channels, GPCRs, and nuclear receptors datasets in the experiment.

Position Specific Scoring Matrix
Position Specific Scoring Matrix (PSSM) can be represented an M × 20 matrix M = M ij i : 1 = 1 . . . M, j = 1 . . . 20 , where M represents the length of a given protein sequence, 20 is the number of 20 amino acids, and M ij represents the score of the j th amino acid relative to the i th position for a query protein sequence [31]. The score M ij can be expressed as M ij = 20 ∑ k = 1 p(i, k) × q(j, k) , where p(i, k) represents the appearing frequency of the k th amino acid at position i of the probe, and q(i, k) is the value of Dayhoff's mutation matrix between j th and k th amino acids. Thus, a high score represents a highly-conserved position; on the contrary, a low score represents a weakly-conserved position.
In the study, in order to create experimental datasets, we used Position Specific Iterated BLAST (PSI-BLAST) [32] to construct PSSMs for each protein sequence. The e-value and number of iterations are set up as the default values in PSI-BLAST. For achieving highly and widely homologous sequences, an e-value of 0.001 and three iterations were selected. It is possible that features may be different if we use different parameters, however, in the work we concentrated on exploring general PSSM features for predicting DTI by employing mostly default settings. Thus, each PSSMs feature vector can be represented as M × 20 matrix by using PSI-BLAST, where M is the number of residues of a given protein sequence and the 20 columns are the number of 20 amino acids.

Bi-Gram Probabilities
The Bi-gram Probabilities (BIGP) have been used for protein fold recognition. In the literature [33], it was described how to use a given protein's original primary sequence or its consensus sequence for protein fold recognition. Instead, we employed the BIGP feature extraction method that the literature [34] proposed to represent a given protein sequence based on its PSSM (PSSM has been mentioned in the Section 3.2 of the paper). In detail, the bi-gram feature vector was computed through counting the bi-gram frequencies of occurrence in PSSM. It is assumed that P represents the PSSM of a protein sequence, which contains L rows and 20 columns, where L is the length of a given protein sequence and 20 columns represents a number of 20 amino acids. The PSSM element P ij can be interpreted as the relative probability of j th amino acid at the i th location of the primary protein sequence, P ij can be expressed as P ij = 20 ∑ j = 1 i : 1 = 1 . . . L, j = 1 . . . 20. The frequency of occurrence of transition from m th amino acid to n th amino acid can be defined as follows: Equation (1) gives 400 frequencies of occurrence BIGP mn for 400 bi-gram transitions, the matrix BIGP called the bi-gram occurrence matrix, the number of the 400 whose elements represent the bi-gram feature vector [34] are as follows: These bi-gram features can also be expressed as follows: where θ = mn = 400 is the dimensionality of the feature vector BF, the ϕ u can be represented as follows: Finally, each protein sequence was converted into a 400-dimensional vector by using BIGP method. In the paper, to reduce the influence of noise and improve the prediction accuracy, the dimensions of enzymes, ion channels, GPCRs, and nuclear receptors datasets were reduced from 400 to 350 by using Principal Component Analysis (PCA) method.

Relevance Vector Machine
The related theory of the Relevance Vector Machine describes in details in the literature [35]. We assumed {x n , t n } N n = 1 , x n ∈ R d is the training set for binary classification question, where t n ∈ {0, 1} represents the training set label, t i is the testing set label, and t i = y i + ε i , where is the classification model; ε i is the additional noise, with a mean value of zero and a variance of σ 2 , where ε i ∼ N(0, σ 2 ), t i ∼ N(y i , σ 2 ). It is assumed that the training sets are independent and identically distributed; the vector t submits to as follows distribution: where ϕ is defined as follows: The training set label t is employed to detect the testing set label t * , given by p(t * |t) = p(t * |w, σ 2 )p(w, σ 2 |t)dwdσ 2 Due to making the value of most components of the weight vector w zero and reducing the number of calculation of the kernel function, additional conditions are attached to the weight vector w Assuming that w i obeys a distribution with a mean value of zero and a variance of α −1 i , the mean p(t * |t) = p(t * |w, a, σ 2 )p(w, a, σ 2 |t)dwdadσ 2 (8) p(t * |w, a, σ 2 ) = N(t * |y(x * ; w), σ 2 ) Because p(w, a, σ 2 |t) cannot be obtained by an integral, it must be resolved using a Bayesian formula, given as p(w, a, σ 2 |t) = p(w|a, σ 2 , t)p(a, σ 2 |t) (10) p(w|a, σ 2 , t) = p(t|w, σ 2 )p(w|a)/p(t|a, σ 2 ) The integral of the product of p w, a, σ 2 |t and p(w|a) is as follows: Because p(a, σ 2 |t)∝ p(t|a, σ 2 )p(a)p(σ 2 ) and p(a, σ 2 |t) cannot be solved by means of integration, the solution is approximated using the maximum likelihood method, represented by The iterative process of a MP and σ 2 MP is given by: Here ∑ i, i is ith element in the Σ diagonal and the initial value of α and σ 2 can be decided via the approximation of a MP and σ 2 MP using Formula (15) continuously updated. After enough iterations, most of a i will be close to infinity, the corresponding parameters in w i will be zero, and other a i values will be close to finite. The resulting corresponding parameters x i of a i are now referred to as the relevance vector.

Performance Evaluation
In the paper, we used the following evaluation criteria as a measure for evaluating the performance of the proposed classifier and feature extraction method in our experiment. There are Ac (Accuracy), Sn where true positives (TP) represents the number of positive pairs that are predicted as interacting drug-target pairs, false positives (FP) is the count of negative pairs that are predicted as interacting drug-target pairs, true negatives (TN) is the total of negative pairs that are predicted as non-interacting drug-target pairs and false negatives (FN) represents the number of positive pairs that are predicted as non-interacting drug-target pairs. In addition, the Receiver Operating Curve (ROC) was established to evaluate the performance of the proposed approach in the experiment.

Conclusions
In the paper, we proposed a novel computational approach based on protein sequence, namely PDTPS (Predicting Drug Targets with Protein Sequence), to predict drug-target interactions (DTI). The PDTPS method combines bi-gram probabilities (BIGP), Position Specific Scoring Matrix (PSSM), and Principal Component Analysis (PCA) with Relevance Vector Machine (RVM). In order to evaluate the prediction capacity of the PDTPS, we carried out the method on enzyme, ion channel, GPCR, and nuclear receptor datasets by using five-fold cross-validation tests. The proposed PDTPS method achieved average accuracy of 97.73%, 93.12%, 86.78%, and 87.78% on enzyme, ion channel, GPCR, and nuclear receptor datasets, respectively. The experimental results showed that our method has good prediction performance. Furthermore, in order to evaluate the prediction performance of the proposed PDTPS method, we compared it with the state-of-the-art support vector machine (SVM) classifier on enzyme and ion channel datasets and other existing methods on four datasets. The promising comparison results further demonstrate the efficiency and robustness of the proposed PDTPS method. This makes it a useful tool and suitable for predicting DTI, as well as performing other bioinformatics tasks. For future studies, more effective feature extraction approaches and machine learning algorithms can be developed for predicting DTI.