Inferring Drug-Related Diseases Based on Convolutional Neural Network and Gated Recurrent Unit

Predicting novel uses for drugs using their chemical, pharmacological, and indication information contributes to minimizing costs and development periods. Most previous prediction methods focused on integrating the similarity and association information of drugs and diseases. However, they tended to construct shallow prediction models to predict drug-associated diseases, which make deeply integrating the information difficult. Further, path information between drugs and diseases is important auxiliary information for association prediction, while it is not deeply integrated. We present a deep learning-based method, CGARDP, for predicting drug-related candidate disease indications. CGARDP establishes a feature matrix by exploiting a variety of biological premises related to drugs and diseases. A novel model based on convolutional neural network (CNN) and gated recurrent unit (GRU) is constructed to learn the local and path representations for a drug-disease pair. The CNN-based framework on the left of the model learns the local representation of the drug-disease pair from their feature matrix. As the different paths have discriminative contributions to the drug-disease association prediction, we construct an attention mechanism at the path level to learn the informative paths. In the right part, a GRU-based framework learns the path representation based on path information between the drug and the disease. Cross-validation results indicate that CGARDP performs better than several state-of-the-art methods. Further, CGARDP retrieves more real drug-disease associations in the top part of the prediction result that are of concern to biologists. Case studies on five drugs demonstrate that CGARDP can discover potential drug-related disease indications.


Introduction
In the past decades, there has been a gradual increase in new molecular entity research and development, but the number of new molecular entities approved by the Food and Drug Administration (FDA) has been decreasing [1][2][3]. Traditional drug development often requires 10-15 years and an investment of $1.5 billion [4][5][6]. Because FDA-approved drugs undergo biological experiments, clinical trials, and are evaluated for safety, drugs are often repositioned. Repositioning existing drugs for new indications or uses requires only 6.5 years, and the cost is $300 million, which is far less than the cost of developing a new drug [7][8][9].
Based on different biological premises and assumptions, researchers use different data types and biological preconditions to study drug repositioning. Research methods include retargeting based on binary vector. We defined S i = sub i,1 , sub i,2 , . . . , sub i,j , . . . , sub i,869 , where sub i,j is the j-th chemical substructure of the i-th drug. LRSSL [23] measured the drug similarities by calculating the cosine similarities between the chemical substructures of drugs. We also use R = R [i,j] ∈ R N r ×N r , which represents drug similarity, where R [i,j] is in the range of [0, 1] and is the similarity of r i and r j , and N r denotes the number of drugs.
To evaluate the similarity between diseases, we establish directed acyclic graphs (DAG) of semantic terms for corresponding diseases, which contain all semantic terms related to that disease. Wang et al. [28] successfully calculated the semantic similarity between diseases using their related terms in the DAG graph. LRSSL computed the similarities between diseases by using Wang's method, and we obtained the disease similarity from LRSSL. Let D = D [i,j] ∈ R N d ×N d be a similarity matrix of diseases such that each element is between 0 and 1.
In light of the relationship between drugs and diseases, we add an edge between the corresponding drug and disease ( Figure 1). Matrix A ∈ R N r ×N d denotes the edge set; if A ij = 1, drug r i is associated with the disease d j , otherwise, A ij = 0.

Prediction Model Based on CNN and GRU
To predict the potential representation of the association between a drug and a disease, we propose a novel prediction model based on a CNN and GRU. We apply the CNN module in the left part to learn the combinatorial representation of drug r i and disease d j ; further, we apply GRU in the right part to capture the path representation between r i and d j . Finally, the two representations were integrated by a combined strategy to achieve the final correlation scores of r i and d j . We take drug r 1 and disease d 3 as an example to describe the learning framework for the left and right parts, and we use x, x, X to represent the scalar, vector, and matrix, respectively.
The probability that a drug is associated with a disease is higher when there are more drugs similar to another drug associated with a disease, such as r 1 and d 3 . As shown in Figure 2, drugs similar to r 1 are {r 2 , r 3 , r 6 }, and the drugs associated with d 3 are {r 2 , r 6 }. The drugs associated with d 3 are similar to r 1 , and therefore, the probability of d 3 being associated with r 1 is very high. The first row of matrix R denotes the similarity between r 1 and all drugs, and the third row of the matrix A T denotes as the associations between d 3 and all drugs.
A drug is associated with more diseases that are similar to a disease, so the more likely the drug is associated with the disease, such as r 1 and d 3 . As shown in Figure 2, diseases similar to d 3 are {d 1 , d 2 , d 5 } and the r 1 associated with {d 1 , d 2 }; therefore, r 1 and d 3 are more likely to be related. The third row of the matrix D denotes the similarity between d 3 and all diseases, and the first row of matrix A denotes the association between r 1 and all diseases. Therefore, we combine the left and right feature representations into the feature matrix X = X [i,j] ∈ R 2×(N r +N d ) of r 1 and d 3 , N r is the number of drugs and N d is the number of diseases. The first row of the matrix X denotes the eigenvector of drug r 1 , and the second row denotes the eigenvector of disease d 3 . Figure 2. Construction of the feature matrix by integrating the similarities and associations.

Convolutional Layer
As shown in Figure 3, to capture the boundary information of X, we first apply a padding operation obtain a new matrix named X . Then, we use X as an input to the left convolution module [29] to learn the potential representation of a drug-disease pair. We assume that the size of the filter is set as W f and W h for each layer of convolution. When there are n conv filters, the convolution filter W conv ∈ R W f ×W h ×n conv is applied to X . Then, we obtain the feature matrix Z conv ∈ R (2−W h +2p+1)×(d−W f +2p+1)×n conv , where p is the number of padding layer in the feature matrix of the CNN model, and d is the length of X . X conv (i, j) is the element at the i-th row and the j-th column of X , and X conv (k, i, j) represents a region within the filter when the k-th filter slides to the X conv (i, j). The formal definitions of X conv (k, i, j) and Z conv,k (i, j) are as follows: where W conv (k, :, :) is the sliding window weight matrix of the k-th filter, b conv is the bias vector, f is a ReLU function [30], Z conv,k i, j is the element at the i-th row and j-th column of the k-th feature map Z conv,k .

Pooling Layer
The feature maps Z conv,k are pooling layers for downsampling to remove unimportant sample data, thus further reducing the number of parameters. We use max pooling to complete the pooling operation and set its sampling window size to W m × W p . The pooling outputs of all the feature maps are Z convpool,k : where Z convpool,k is the k-th feature map, and Z convpool,k (i, j) is the element at its' i-th row and j-th column, and p is the number of padding layer in the Z conv,k . We obtain the feature representation of the node pair Z convpool,k (i, j), which is flattened and sent to the fully connected layer. The characteristic of the output represents the final result obtained by flattening the fully connected layer as a potential association for the final drug-disease pair c: where σ is a sigmoid function [31], W l is a fully connected layer feature matrix, and · is the dot product symbol.

GRU with Attention-Based Path Encoder on the Right
For the prediction of the novel association between drug r i and disease d j , the different paths between the two nodes contribute differently to their associations. Thus, a path-level attention mechanism is introduced to select more important paths for the association between r i and d j . This mechanism consists of two parts: a path encoder and a path attention layer, as shown in Figure 3.

GRU-Based Sequence Encoder
The GRU module [32] tracks the state of paths with a gating mechanism instead of using separate memory cells. There are two types of gates: the reset gate r t and the update gate z t . These gates jointly control the amount of information that is updated to the state. To illustrate the updated process of the state, we take r 1 and d 3 as an example. There are four paths between r 1 and d 3 to form a set The node in each path inputs its corresponding feature vector x t . The i-th path in P 13 is represented by P i 13 , and the new state h t of the t-th node is calculated as: where h t−1 is the state of the t − 1 state in the path, and h t is the candidate state of the current node. This is a linear interpolation between the previous state h t−1 and the current new state h t computed with new information. The update gate z t controls the extent to which the previous node information is introduced into the current state. The closer the gate z t is to 1, the more the state information of the previous node is brought in. z t is updated as: where x t is the vector at the t-th node, W z is the weight matrix of the node vector, U z is the weight matrix of the previous state, and b z is a bias vector. The candidate state h t is calculated as: where r t is the reset gate that controls how much the past state contributes to the candidate state. If r t is zero, it will forget all previous states. W h and U h are matrices of the candidate state, b h is the bias vector of the candidate state, and · is the Hadamard product symbol. The reset gate is updated as: where σ is the sigmod function, W r is the weight matrix of the node vector x t in the reset gate, U r is the weight matrix of the candidate state h t−1 , and b r is the bias vector.

GRU-Based Path Encoder
We assume that P t ij is the path set of drug r i and disease d j , and the t-th path contains nodes. We use a bidirectional GRU module to integrate the information in two directions of the path and combine the context information of the path nodes. A bidirectional GRU module contains a forward → GRU module, which reads from the first node to the last node, and the backward ← GRU module, which reads from the last node to the first node as: we concatenate h t ij and h t ij to obtain the representation h ij ] of the t-th path of r i and d j .

Path Attention
To distinguish the different contributions of multiple paths from r i to d j to their associated predictions, we introduce attention mechanisms to distinguish the importance of the path. The total path information g ij is formulated as the weighted sum of all paths, and it is expressed as: where h t ij is the representation vector of the t-th path of r i to d j , and α t ij is the attention weight of h t ij to measure the importance of the t-th path. We introduce a path vector u p to measure the importance of the path. The attention weight of each path can be defined as: where u t ij is the score function of the corresponding path, i.e., the score of the import of the path, W t is the weight vector, b t is the bias vector, α t ij is the attention weight of the t-th path, u p is the weight vector, and (u p ) T indicated its transposition.

Combined Strategy
To fully combine the representation of the left-path node pair r 1 and d 3 and path information representation of the right path, we design a combined strategy for determining the association score of r 1 and d 3 . We added a SoftMax classifier to ensure that left and right paths have certain predictive capabilities and to further improve the performance of predictive classification. The corresponding loss is defined as: where c ij is a representational learning method based on CNN learning drug r i and disease d j . g ij is the representation obtained by learning on the right, W c and W v are the weight matrices of the left and right parts, respectively, b c and b v are the offset vectors, y real is the actual correlation between the drug and the disease. Further, 1 means the drug is associated with the disease, and 0 is the unknown association, where score 0 c indicates that there is no possibility of association between drug r i and disease d j , and score 1 c indicates that there is no possibility of association between drug r i and disease d j . Finally, loss 1 and loss 2 , are the cross entropy losses of the model in the probability of prediction and the true correlation value. The final loss function of our model is the weighted sum of loss 1 and loss 2 : where α 1 is a super parameter, which is used to weigh the contribution of loss 1 and loss 2 . Our final score is:

Reducing Overfitting
Our neural network has nearly 50 million parameters, which turns out to too many parameters to learn without considerable overfitting. Thus, we introduce the following measures to prevent overfitting.

Dropout
Integrating the result from many different models is an excellent method to reduce test errors [33,34], but this method is too computationally expensive for large neural networks and takes several days to train. There is, however, a very efficient approach to model combination that only spends a factor of about two during training. The recently presented technique, called "dropout" [35], consists of setting the output of each hidden neuron to zero with probability 0.5. The neurons that are "dropped out" in this way do not participate in the forward pass and back-propagation. Thus, every time an input is presented, the neural network samples a different architecture, but all these architectures share weights. This technique reduces intricate co-adaptations of neurons, because a neuron cannot depend on the existence of other specific neurons. Therefore, it is forced to learn more robust, beneficial features in conjunction with many different random subsets of the other neurons. During the test, we multiply the output of all the neurons by 0.5, which reasonably approximates the geometric mean of the predictive distributions produced by the exponentially many dropout networks.

Evaluation Metrics
In this study, we applied five-fold cross-validation analysis to evaluate the performance of our method. All known drug-disease associations were treated as positive samples and divided randomly into five equal positive subsets. At the same time, unknown associations with a matching number were randomly selected and divided into five negative subsets. In each fold, four positive subsets and four negative subsets were selected for training and the remaining were used to testing. We trained the prediction model based on known associations in the training set and predicted associations in the testing set. Training and testing were repeated five times, and the average of the performance was adopted. In addition, we calculated the drug similarity each time we selected four positive samples. Then, the testing set for each drug was ranked; the higher the candidate disease ranked, the greater was the possibility of association between the drug and the disease.
The CGARDP model was used to obtain the test scores of the associations in the testing set. The scores were ranked in the descending order of the scores, given a threshold θ. If the scores were higher than θ, they were considered as positive samples, and those below θ were considered as negative samples. We calculate different true positive rates (TPRs), false positive rates (FPRs), accuracy (precisions), and recall (recall) in each θ as follows where TP indicates the correct identification of the number of positive samples, TN indicates the correct identification of the number of negative samples, FP indicates the number of samples that will be predicted as a positive example, and FN indicates the number of samples identified as a negative sample. Thus, the receiver operating characteristic (ROC) curve [36] can be drawn using different TPRs and FPRs under different θ. The area under the curve (AUC) is called the drug-related AUC value. The average AUC of all drugs was used to assess the overall performance of our method. Because the ratio of positive and negative samples is 1:169, there is a large class imbalance. The class imbalance problem is concerned with positive cases, while the two indicators of the PR curve are focused on positive samples; therefore, the PR curve has more credibility than the ROC curve [1]. Thus, we used the PR curve to measure the performance at the same time. Precision is defined as the percentage of real samples that are determined as positive samples, and recall as the percentage of true samples to the total number of actual positive samples. In addition, biologists always choose to arrange higher-ranking candidate diseases for biological verifications, and therefore, the top of the ranking candidate list must have more positive samples. Therefore, we made another evaluation criterion a performance metric, i.e., we calculated the average recall rate of top-k (k = 30, 60, 90, 120 . . . ). The higher the recall rate, the higher is the proportion of drug-related diseases that are correctly retrieved; further, the better the predictive performance, the higher is the positive sample that is successfully identified.

Comparison with Other Methods
To evaluate the performance of the CGARDP model, we compared it with several state-of-the-art methods including HGBI [37], MBIRW [24], LRSSL [23], and SCMFDD [25]. HGBI builds a three-layer heterogeneous network that uses a combination of drug, disease, and target for prediction. MBIRW builds a two-layer network of drugs and diseases to complete the drug reposition by walking among the drug-disease network. LRSSL, a Laplacian regularized sparse subspace learning method, combines the chemical substructure of the drug, the target domain, and the target annotation for prediction. SCMFDD calculates the Jaccard similarity of the chemical substructure of the drug and the semantic similarity of the disease to predict novel drug-disease association using matrix factorization.
For CGARDP and several other comparison methods, each method must adjust the parameters involved to optimize the prediction performance. In our method, the left convolutional neural network active windows W f and W h are 3 and 20, respectively. It has two convolutional layers; the first of contains 16 convolution kernels, and the second contains 32 convolution kernels, that is, n conv is 16 and 32. The padding parameter P is (1,10). The size of the sampling window (W m ,W p ) is set to (2,2), and the super participation α 1 is 2. For fairness, the parameters of other methods are based on the parameters recommended in the corresponding literature (α = 0.4 for HGBI, α = 0.3 for MBIRW, µ = 0.01, λ = 0.01 for LRSSL, µ = 2 0 , λ = 2 2 for SCMFDD).
As shown in Figure 4A and Table 1, CGARDP achieves the best average performance over all 763 drugs that we considered (AUC of ROC curve = 0.956). The AUC-ROC values of other methods, i.e., HGBI, MBIRW, LRSSL, and SCMFDD for 763 drugs are 0.683, 0.837, 0.838, and 0.726, respectively. In particular, CGARDP outperforms HGBI by 27.3%, MBIRW by 11.9%, LRSSL by 11.8%, and SCMFDD by 23%. Further, we list the AUCs of all five methods on 15 well characterized human drugs, each of which has more than 15 known related diseases. CGARDP yields the best average performance in terms of AUCs and achieves the best performances for 11 of the 15 common drugs. Among all methods, LRSSL performed second best, and LRSSL took full advantage of the multiple similarity of drugs. MBIRW achieved almost the same effect as LRSSL on AUC; however, it performance was less than LRSSL by 7% on AUPR. These differences in performance are possibly because MBIRW focuses on the topology information of the network. SCMFDD and HGBI perform considerably worse than LRSSL and MBIRW; however, SCMFDD performs 4.5% better than HGBI. This difference can be attributed to the fact that SCMFDD relies on the calculation of similarity, while HGBI constructs a three-layer network that introduces drug-protein information but does not make full use of this information. Compared with other methods, the superiority of CGARDP is due to its in-depth understanding of the node representation of the drug-disease association and the attentional representation of the path representation.  Because the number of unknown drug-disease associations far exceeds the known associations, there is a serious imbalance in data. The PR curve predicts performance metrics better than the ROC curve when there is a serious imbalance between the positive and negative samples. Figure 4B and Table 2 shows the AUPR for the average performance of all drugs, and CGARDP produces the best average performance on these drugs (AUC of PR curve = 0.425). Its average AUPR is 41.3%, 37.8%, 30.8%, and 41.1% higher than those of HGBI, MBIRW, LRSSL, and SCMFDD, respectively. For the 15 well-characterized drugs, CGARDP demonstrates the best performance for 11 of these drugs. In addition, 265 diseases were only association with one drug, and 116 diseases were associations with two drugs. Therefore, CGARDP can be used for diseases associated with only one or two drugs.
For all the prediction results on 763 drugs, we performed a Wilcoxon test to evaluate whether CGARDP's performance is significantly better than that of the other methods. The statistical results (Table 3) indicate that CGARDP yields the significantly better performance under the p-value threshold of 0.05 in terms of not only AUCs but also AUPRs. A higher recall rate on top k ranked drugs means that real disease-related drugs are correctly identified. The average recall rates of the top k samples on all 763 drugs are shown in Figure 5. CGARDP consistently outperforms the other methods at various k values, and it ranked 89.9% in the top 30, 93.8% in the top 60, and 97.1% in the top 120. Before the top 90, LRSSL performed better than MBiRW, and then MBiRW surpassed LRSSL. The former ranks 63.4%, 71.3%, and 77.7% in the top 30, 60, and 120, respectively, and the latter is 53.1% and 66.3%. 79.3%. The possible reason for these different rankings is that MBiRW makes better use of global topology information, while LRSSL focuses more on neighbor node information. HGBI and SCMFDD have relatively close recall rates at different k values. HGBI ranks for k values of 30, 60, and 120 were 28.8%, 41.1%, and 54.9%, respectively, and those of SCMFDD are 30.6%, 45.0%, and 57.8%. Ultimately, we can conclude that CGARDP is indeed better than other methods in discovering the underlying disease of the drug.

Case Studies on Ciprofloxacin, Ceftriaxone, Ofloxacin, Ampicillin, and Levofloxacin
After the above five-fold cross-validation, we evaluated the performance of the method, and all known correlation data were used as training data to predict the unknown drug-disease association. Case studies of five drugs-Ciprofloxacin, Ceftriaxone, Ofloxacin, Ampicillin, and Levofloxacin-demonstrate the ability of CGARDP to detect high-quality candidate diseases for drugs. The analysis of each of the top ten candidates for each drug is presented in detail in Table 4. First, A drug bank is a database of drugs pharmacology indication, drug interaction, and clinical trials for a disease. The Comparative Toxicogenomics Database (CTD) contains important information about the effects of drugs on the disease. The Centers for Disease Control and Prevention (CDC) records the trends and preventive treatments of common diseases. In Table 4, 12 candidate diseases are included from the drug bank, nine candidates are included in the CTD, and two candidates are included in the CDC; this table shows that these candidate diseases are indeed related to the corresponding drugs. Second, ClinicalTrials.gov (https://clinicaltrials.gov/) is a database of clinical trials run by the National Institutes of Health (NIH), and it contains clinical trials of various drugs and related diseases. PubChem (https://pubchem.ncbi.nlm.nih.gov/) is a database of chemical modules supported by the NIH, and it stores biochemical experimental data and structural information on compounds, including drugs and their biological activities data. A total of 21 candidate diseases in Table 4 were included in ClinicalTrials.gov, and 7 candidates were included in PubChem, indicating that these candidates were supported by the experiment. In addition, a candidate for the "literature" marker was supported by the literature. The addition of ceftriaxone to metronidazole has a synergistic effect, which can reduce the production of toxins and promote wound healing; thus, the combination of metronidazole and ceftriaxone is preventive. Tetanus patients with sepsis and pneumonia have good efficacy, confirming that Ceftriaxone affects the candidate disease tetanus.
In addition, the CTD database also contains potential associations that the literature infers to exist, labelled as Inferred. Four candidate diseases in Table 4 were inferred from the CTD literature, indicating that the drug is more likely to be associated with the candidate disease. Case studies of candidate diseases for the five drugs confirmed that CGARDP was indeed able to detect potential candidate diseases for the drug.

Prediction of Novel Drug-Disease Associations
According to cross validation and case studies, we applied CGARDP to predict the novel drug-disease associations. All known drug-disease associations were utilized to train CGARDP's prediction model, the potential candidate associations were then obtained by using the model as listed in Supplementary Table S1.

Conclusions
A novel method based on CNN and GRU-CGARDP-was proposed to predict the potential drug-disease associations. The CRU based framework deeply integrates the similarity and association information of a drug-disease pair. The GGU based framework deeply learns the path information between the drug and the disease. CGARDP discriminates different contributions of the paths by constructing the attention mechanism and learns more informative representation of the drug-disease pair. The experimental results show that CGARDP outperforms other methods in terms of both AUCs and AUPRs. The case studies on five drugs confirm that CGARDP is able to retrieve potential candidate drug-disease associations.