Drug-Target Interaction Prediction through Label Propagation with Linear Neighborhood Information

Interactions between drugs and target proteins provide important information for the drug discovery. Currently, experiments identified only a small number of drug-target interactions. Therefore, the development of computational methods for drug-target interaction prediction is an urgent task of theoretical interest and practical significance. In this paper, we propose a label propagation method with linear neighborhood information (LPLNI) for predicting unobserved drug-target interactions. Firstly, we calculate drug-drug linear neighborhood similarity in the feature spaces, by considering how to reconstruct data points from neighbors. Then, we take similarities as the manifold of drugs, and assume the manifold unchanged in the interaction space. At last, we predict unobserved interactions between known drugs and targets by using drug-drug linear neighborhood similarity and known drug-target interactions. The experiments show that LPLNI can utilize only known drug-target interactions to make high-accuracy predictions on four benchmark datasets. Furthermore, we consider incorporating chemical structures into LPLNI models. Experimental results demonstrate that the model with integrated information (LPLNI-II) can produce improved performances, better than other state-of-the-art methods. The known drug-target interactions are an important information source for computational predictions. The usefulness of the proposed method is demonstrated by cross validation and the case study.


Introduction
The identification of potential drug-target interactions is a crucial task in drug discovery, which helps to find novel targets for existing drugs or identify targets for new drugs [1]. Wet experiments are reliable ways of determining interactions between drugs and targets, but they are cost-intensive and time-consuming [2]. In contrast, computational methods provide economic and efficient alternative to predict possible drug-target interactions with high reliability for further experiments.
To the best of our knowledge, researchers collect drug-target interaction data, and construct the public databases. Available drug-target data facilitate the development of drug-target interaction prediction methods. Traditional computational methods include molecular docking simulation methods and ligand-based methods. Though docking simulation methods are effective, they cannot work without three-dimensional (3D) structures of targets [3]. Ligand-based methods perform well when there are sufficient known ligands for a target protein, but such methods are not suitable for large-scale data [4].
In addition, several methods have been proposed based on properties of drug and targets. Kuhn et al. [5] used molecular features and target proteins to predict drug-target relations. Garcia-Sosa et al. [6,7] introduced logistic regression and naïve Bayesian classifiers for classification of

Evaluation Metrics
In order to evaluate the performances of prediction models, computational experiments were conducted on four benchmark datasets. Here, we adopted leave-one-out cross validation (LOOCV) to test model performances. That is, each drug-target pair was left out in turn, and remaining pairs were used as the training set to build models for predictions. We repeated the procedure until each drug-target pair is ever tested.
The AUC and AUPR are the most popular evaluation metrics in the previous works. AUC is the area under the receiver operating characteristic (ROC) curve, which plots the true positive rate (TPR) versus the false positive rate (FPR). AUPR is the area under the precision-recall curve, which plots the ratio of true positives among the predicted positives for each recall rate. There are more negative instances than positive ones, and AUPR punishes the false positives more in evaluation [31]. Therefore, we adopted AUPR as the primary metric and used AUC to evaluate models.

The Performances of the LPLNI Models
In this section, we evaluate the performances of the LPLNI models. Since we had the interaction profiles and fingerprints for drugs, we respectively used these features to calculate the linear neighborhood similarities and then built LPLNI models. Here, we used the Pubchem fingerprint for analysis.
There are two parameters K and α in LPLNI, in which K is the number of neighbors in the linear neighborhood similarity (LNS), and α is the probability of absorbing target information from neighbors. These parameters may influence the results, and we can build LPLNI models using different parameter values. The number of drug neighbors K should be less than the number of all drugs, and the four benchmark datasets, i.e., the nuclear receptors (NRs) dataset, the G-protein coupled receptors (GPCRs) dataset, the ion channels (ICs) dataset, and the enzymes (Es) dataset, contain 54, 223, 210, and 445 drugs, respectively. Therefore, we considered different neighborhood numbers K 10, 30, and 50 for the NRs dataset, 60, 120, and 180 for the GPCRs and ICs datasets, and 120, 240, and 360 for the Es dataset. In addition, absorbing probability α should be greater than zero, and smaller than one. Hence, for parameter α we chose values from 0.1 to 0.9 (with a step size of 0.1).
The drug-drug similarity is critical for LPLNI. To demonstrate the superiority of linear neighborhood similarity, we also considered cosine similarity, Jaccard similarity, and Gauss similarity and applied label propagation to build similarity-based prediction models. The Gauss function calculates the similarity by Gauss x i , x j = exp − x i − x j 2 /σ , which has the bandwidth parameter σ, and we set σ = ∑ i |x i |/n d as in [23], where x i is the feature vector of the i-th drug, and n d is the number of drugs. All prediction models are evaluated using LOOCV. The performances of different similarity-based models are shown in Figure 1. In general, the linear neighborhood similarity can lead to better performances than can cosine similarity, Gauss similarity, or Jaccard similarity. The possible reason for the superior performances of the LPLNI models is that the linear neighborhood similarity describes the linear relationship of data points in the feature space. The linear neighborhood similarity is then smoothly transferred into the interaction space, and LPLNI utilizes the label propagation to make predictions based on the same linear relationship of data points in the interaction space.
Moreover, we observed that the LPLNI models based on the interaction profiles have better performances than the LPLNI models based on the Pubchem fingerprint, which indicates that the interaction profiles are an information source of utmost importance for prediction.

The Performances of LPLNI Models with Integrated Information
In machine learning, combining diverse information of drugs can improve the performance of prediction models [32][33][34][35][36][37]. In Section 2.2, our study demonstrates that only the use of interaction profiles of drugs can lead to high-accuracy prediction models; however, we still attempted to incorporate structural information of drugs to further improve accuracy.
Since we had nine different fingerprints, we firstly built individual LPLNI models based on different fingerprint features and evaluated their usefulness. The leave-one-out cross validation performances of the prediction models are shown in Table 1. Among all fingerprints, Daylight, Extended and Hybridization fingerprints produce better performances than others on the benchmark datasets. Although the performances of fingerprints are lower than the interaction profiles, fingerprints can still provide information for the drug-target interaction predictions. According to their performances, Daylight fingerprints, Extended fingerprints, and Hybridization fingerprints were adopted to incorporate into the interaction profile-based models. fingerprints can still provide information for the drug-target interaction predictions. According to their performances, Daylight fingerprints, Extended fingerprints, and Hybridization fingerprints were adopted to incorporate into the interaction profile-based models. By using the strategy described in Section 3.4, we incorporated the three fingerprints into the interaction profile-based model and developed the prediction model with integrated information, named "LPLNI-II." As shown in Table 1, LPLNI-II can produce better results than individual featurebased models on the benchmark datasets, improving the AUPR values of 0.9464 to 0.9492 and AUC values of 0.9532 to 0.9919 (on NRs dataset), indicating the usefulness of combing various information of drugs.  By using the strategy described in Section 3.4, we incorporated the three fingerprints into the interaction profile-based model and developed the prediction model with integrated information, named "LPLNI-II." As shown in Table 1, LPLNI-II can produce better results than individual feature-based models on the benchmark datasets, improving the AUPR values of 0.9464 to 0.9492 and AUC values of 0.9532 to 0.9919 (on NRs dataset), indicating the usefulness of combing various information of drugs.

Comparison with State-of-the-Art Methods
To the best of our knowledge, a great number of methods were proposed to predict drug-target interactions. NetLapRLS [20] trained two classifiers based on the chemical and genomic information with the interaction profiles separately, and then linearly combined the two classifiers to develop the prediction model. RLS-Kron [21] considered chemical structures, genomic sequences, and the interaction profiles, then calculated the similarity by the Gaussian function, and utilized the Regularized Least Squares (RLS) classifier to build prediction models. The model based on the interaction profiles could produce high-accuracy performances, and the final prediction model was developed by integrating diverse information with the Kronecker product. These methods and our method utilize the interaction profiles as the primary information sources to develop prediction models. To demonstrate the superiority of our method, we adopted NetLapRLS and RLS-Kron for comparison. All methods were evaluated by leave-one-out cross validation (LOOCV).
Since RLS-Kron and our method can make high-accuracy predictions using only the interaction profiles, we firstly built prediction models based on the interaction profiles and compared their performances. As shown in Table 2, the AUPR values of LPLNI are 0.9051, 0.9461, 0.9658 and 0.9464, higher than RLS-Kron on the enzymes (Es) dataset, the G-protein coupled receptors (GPCRs) dataset, the ion channels (ICs) dataset, and the nuclear receptors (NRs) dataset, respectively. In addition, LPLNI produces superior AUC performances on the GPCRs dataset, the ICs dataset, and the NRs dataset. Therefore, the interaction profile-based LNLPI model produces better results than the interaction profile-based RLS-Kron model on these benchmark datasets.
Further, we tested the performances of the LPLNI model with integrated information (LPLNI-II) by comparing LPLNI-II with RLS-Kron and NetLapRLS. As shown in Table 3, LPLNI-II can outperform benchmark methods on the GPCRs dataset, ICs dataset, and NRs dataset. Therefore, the LPLNI-II can integrate different information and make high-accuracy predictions.

Case Study
To test the potential of LNLPI in the drug-target interaction predictions, we built models based on known interactions of the Es dataset and then made predictions for unknown interactions. We checked the top 10 interactions predicted by our method and looked for evidences in SuperTarget [38] to support our discoveries. SuperTarget contains updating interactions from several drug databases, i.e., DrugBank, KEGG, etc. As shown in Table 4, 4 predictions out of 10 are confirmed, and results indicate that our method is capable of predicting novel interactions. Peptidyl-prolyl cis-trans isomerase A

Datasets
There are several databases that provide information about drugs and drug-target interactions and that can be used for predicting unobserved drug-target interactions.
The Pubchem database [39,40] can provide chemical structures. The DrugBank database [41][42][43][44] is a comprehensive bioinformatics resource that includes targets, transporters, and enzymes of drugs. The KEGG database [45,46] is a collection of protein pathways that are associated with drug targets. BRENDA [47] is a comprehensive collection of enzyme and metabolic data, and is updated by extracting information from primary literature. SuperTarget [38] contains more than 2500 target proteins, which are annotated with about 7300 relations to 1500 drugs.
To study potential drug-target interactions, we used four benchmark datasets of drug-target interactions, which were compiled by Yamanishi et al. [48]. There are mainly four types of target proteins: enzymes (Es), ion channels (ICs), G-protein coupled receptors (GPCRs), and nuclear receptors (NRs). In Yamanishi's datasets, the drug-target interactions were classified into four subsets, which are associated with different types of targets. Table 5 lists the details of the four datasets.

Features
In order to build prediction models, we should represent drugs or targets as feature vectors. Firstly, we present a feature named "interaction profile" for drugs (targets) from known interactions. As shown in Figure 2, let {d 1 , d 2 , · · · , d m } be a set of given drugs, {t 1 , t 2 , · · · , t n } be a set of given targets, and their interactions can be formalized as an interaction network. The interaction profile of a drug (target) is a binary vector describing the presence or absence of interaction with every target (drug) in the network. is the number of drugs, is the number of targets, is the number of known interactions, is the average number of targets for each drug, and is the average number of drugs for each target. Sparsity is known interactions divided by all possible interaction pairs.

Features
In order to build prediction models, we should represent drugs or targets as feature vectors. Firstly, we present a feature named "interaction profile" for drugs (targets) from known interactions. As shown in Figure 2, let , , ⋯ , be a set of given drugs, , , ⋯ , be a set of given targets, and their interactions can be formalized as an interaction network. The interaction profile of a drug (target) is a binary vector describing the presence or absence of interaction with every target (drug) in the network. Since we collect drug structures from KEGG DRUG, we also represent drugs as feature vectors based on their substructures. Structural features of drugs are well known as fingerprints, which are bit vectors with elements indicating the frequencies or the existence of certain substructures. As listed in Table 6, there are different drug fingerprints, and we adopt Chemical Development Kit (CDK) [49] to calculate these fingerprints and then use them as structural feature vectors. Since we collect drug structures from KEGG DRUG, we also represent drugs as feature vectors based on their substructures. Structural features of drugs are well known as fingerprints, which are bit vectors with elements indicating the frequencies or the existence of certain substructures. As listed in Table 6, there are different drug fingerprints, and we adopt Chemical Development Kit (CDK) [49] to calculate these fingerprints and then use them as structural feature vectors.

The Label Propagation Method with Linear Neighborhood Information
In this section, we introduce the label propagation method with linear neighborhood information (LPLNI), which has two steps: calculation of linear neighborhood similarity and label propagation-based prediction.
Let us introduce several notations. Given n d drugs and n t targets, their interactions are organized as an interaction matrix Y = (Y 1 , Y 2 , · · · , Y n t ) ∈ R n d ×n t , where Y i is the interaction profile of the i-th target. 1 = y ij ∈ Y if the i-th drug interacts with the j-th target, else, y ij = 0. Each drug can be represented by a p-dimension feature vector x i (for example, the interaction profile), i = 1, 2, · · · , n d .

Linear Neighborhood Similarity
Roweis et al. [50] revealed that a data point and its neighbors are close to a locally linear patch of the manifold, and Wang et al. [51] discovered that each point can be optimally reconstructed by its neighbors. Based on these studies [50,51], we calculated the drug-drug similarity by considering how to reconstruct the data point through its neighbors, as per our previous work [52].
Here, we represent drugs as feature vectors x i , i = 1, 2, · · · , n d and take them as data points in the feature space. We reconstruct each data point x i by linear combination of its neighbors and formulate the optimization problem as follows: where · is the Euclidean norm, and N(x i ) represents the set of K (0 < K < n d ) nearest neighbors . ω i,i j represents the weights of x i j for reconstructing x i and can be considered as the similarity between x i and x i j . Clearly, We notice that the matrix G i is likely to be singular if the K neighbors are close to each other. In this case, it is hard to obtain the unique solution of the optimization problem. In order to avoid the singular matrix and enhance generalization capability, we introduce regularization for the reconstructive weights and present the optimization problem: where λ i is the regularization parameter and column vector e = (1, 1, · · · , 1) T . The parameter λ i controls the relative value between reconstruction error ω T i G i ω i and the regularization term ω i 2 . Since spectral norm is compatible and Gram matrix G i is symmetric and positive semidefinite, we have where ρ G i is spectral radius of G i . Here, we can estimate value range of ω T i G i ω i and ω i 2 .
Therefore, we can roughly set in the practical use, and ε is a small number satisfying ε 1. We set ε to 0.01 for simplicity. We can use the standard quadratic programing to solve Equation (2), and its solutions is named the "linear neighborhood similarity" (LNS). We calculate the weights for data points, and concentrate them row by row, and form the similarity matrix W ∈ R n d ×n d . The entire procedure of calculating LNS is summarized in Figure 3.
where ( ) is spectral radius of . Here, we can estimate value range of and ‖ ‖ . Therefore, we can roughly set in the practical use, and is a small number satisfying ≪ 1. We set to 0.01 for simplicity. We can use the standard quadratic programing to solve Equation (2), and its solutions is named the "linear neighborhood similarity" (LNS). We calculate the weights for data points, and concentrate them row by row, and form the similarity matrix ∈ ℝ × . The entire procedure of calculating LNS is summarized in Figure 3.

Label Propagation
Based on the drug-drug similarity, we formulate a directed graph, which uses drugs as nodes and similarities ( , ) = as weights. It is worth mentioning that usually ≠ . In the graph, the known interactions of drugs with given targets are taken as the initial label information of nodes, and the label information is then updated. In the update, a node absorbs label information for its neighbors with the probability ∈ (0, 1) and retains the initially label

Label Propagation
Based on the drug-drug similarity, we formulate a directed graph, which uses drugs as nodes and similarities W(i, j) = ω ij as weights. It is worth mentioning that usually ω ij = ω ji .
In the graph, the known interactions of drugs with given targets are taken as the initial label information of nodes, and the label information is then updated. In the update, a node absorbs label information for its neighbors with the probability α ∈ (0, 1) and retains the initially label information with the probability 1 − α. The update process for the i-th label of nodes at the k-th iteration is written as where Y i is the i-th column vector of the interaction matrix Y (i.e., the i-th initial labels for all nodes). Further, we can formulate the update for all target labels in matrix form: where F (k) ∈ R n d ×n t represents that label matrix in the kth iteration, and F (0) = Y. We will analyze the convergence of this iterative process Equation (6) in Theorem 1.

Theorem 1.
The iterative process, Equation (6), will converge to a solution F, that is where I ∈ R n d ×n d is the identity matrix.
Proof of Theorem 1. Note that F (0) = Y, the iterative process Equation (6) can be rewritten as follows F ∈∝ n d ×n t is the final label matrix, presenting the predicted scores for drug-target pairs.

LPLNI with Integrated Information
In this paper, we consider the interaction profile feature of drugs and targets and consider different fingerprint features of drugs. Therefore, we can calculate different similarities based on different features and then build different prediction models. Generally, combining diverse models can enhance predictive performances [53][54][55][56].
Here, we consider a nonlinear strategy to integrate different prediction models. Given n models, they will produce n predicted scores for a drug-target pair, denoted as F = F 1 , F 2 , · · · , F n , and the integrated score is given by the following binomial logistic regression model in the conditional probability form: where α k ∈∝, k = 1, 2, · · · , n, and b ∈ R. The parameters are estimated by maximum likelihood estimation based on known interactions and their predicted scores.
In the prediction stage, the predicted scores from the n models are aggregated by Equation (8) to produce the final predictions.
We abbreviate the LPLNI model with integrated information as "LPLNI-II".

Conclusions
In this paper, we propose a drug-target interaction prediction method with linear neighborhood information, and the method can utilize known interactions to make high-accuracy predictions. Further, we incorporated structural information into the prediction models to improve performances. Computational experiments show that our method outperforms other state-of-the-art methods on the benchmark datasets. The potential of the method is also validated in the case study. In conclusion, the proposed method is a promising tool for drug-target interaction prediction.