Drug Repositioning Based on the Enhanced Message Passing and Hypergraph Convolutional Networks

Drug repositioning, an important method of drug development, is utilized to discover investigational drugs beyond the originally approved indications, expand the application scope of drugs, and reduce the cost of drug development. With the emergence of increasingly drug-disease-related biological networks, the challenge still remains to effectively fuse biological entity data and accurately achieve drug-disease repositioning. This paper proposes a new drug repositioning method named EMPHCN based on enhanced message passing and hypergraph convolutional networks (HGCN). It firstly constructs the homogeneous multi-view information with multiple drug similarity features and then extracts the intra-domain embedding of drugs through the combination of HGCN and channel attention mechanism. Secondly, inter-domain information of known drug-disease associations is extracted by graph convolutional networks combining node and edge embedding (NEEGCN), and a heterogeneous network composed of drugs, proteins and diseases is built as an important auxiliary to enhance the inter-domain message passing of drugs and diseases. Besides, the intra-domain embedding of diseases is also extracted through HGCN. Ultimately, intra-domain and inter-domain embeddings of drugs and diseases are integrated as the final embedding for calculating the drug-disease correlation matrix. Through 10-fold cross-validation on some benchmark datasets, we find that the AUPR of EMPHCN reaches 0.593 (T1) and 0.526 (T2), respectively, and the AUC achieves 0.887 (T1) and 0.961 (T2) respectively, which shows that EMPHCN has an advantage over other state-of-the-art prediction methods. Concerning the new disease association prediction, the AUC of EMPHCN through the five-fold cross-validation reaches 0.806 (T1) and 0.845 (T2), which are 4.3% (T1) and 4.0% (T2) higher than the second best existing methods, respectively. In the case study, EMPHCN also achieves satisfactory results in real drug repositioning for breast carcinoma and Parkinson’s disease.


Introduction
The process of developing a new drug typically continues for 15-20 years with high costs, less success rate and huge risk from the idea identification to the marketing approval [1]. Drug repositioning, an important part of drug development, refers to the discovery of new and reliable indications for existing drugs. Drug repositioning enables pharmaceutical researchers to save time and reduce costs, thus becoming an effective strategy for the development of new drugs [2,3]. However, drug repositioning still requires intensive study due to unknown and potentially complex pharmacology and biology.
Currently, the methods of drug repositioning mainly include the network-based diffusion analysis method, the machine learning-based method, and the method based on inter-domain feature extraction module based on the drug and disease similarity networks by considering various network topology information. Ultimately, the inter-domain and intra-domain embeddings are fused in parallel to obtain the drug and disease embedding representation. Although GCN achieves satisfactory performance for the drug-disease association, it has difficulty in capturing the higher-order features of nodes in the graph network. The hypergraph convolution model [23], on the other hand, can effectively solve this problem and has drawn wide attention in recent years.
In order to effectively extract information about the higher-order feature of nodes in the drug-and disease-related network, we propose a new drug repositioning method based on the enhanced message passing and hypergraph convolutional networks (EMPHCN). The contributions of this work are summarized as follows: (1) a multimodal heterogeneous network is built to reveal association information between drugs and diseases by connecting drug-disease association, drug-drug similarity, disease-disease similarity, drug-protein association, protein-protein interaction, and disease-protein association. The foundation for enhanced message passing is provided by this related knowledge. (2) The designed EMPHCN is applied to explicitly model the higher-order associations within the respective domains of drugs and diseases by hypergraph convolution for extracting the intra-domain information. And it uses graph convolutional networks combining node and edge embedding (NEEGCN) to effectively extract the inter-domain information of known drug-disease associations. (3) In addition, it constructs multi-views of multiple drug similarities and extracts the feature of multi-homogeneous information by HGCN combined with the channel attention mechanism, which enhances the intra-domain message passing and forms the intra-domain embedding. Besides, it transfers the feature by a heterogeneous network composed of drugs, proteins, and diseases with GAT and fuses it with the inter-domain feature of known drug-disease associations to form the enhanced inter-domain embedding, and then combines previous intra-domain and inter-domain embeddings to obtain the final embedding of drugs and diseases.

Dataset
The drug-disease associations in dataset 1 (T1) of this paper are selected from the Zhang dataset [12], which contains 18,416 drug-disease associations between 269 drugs and 598 diseases from CTD [24] (http://ctdbase.org/ (accessed on 8 April 2021)). This dataset also collects comprehensive information about drugs such as target, enzyme, drugdrug interaction, pathway and substructure, and these five drug similarity features can be obtained by computation. We select 15,630 drug-disease associations from the 18,416 drug-disease associations in this dataset. Dataset 2 (T2) of this paper is obtained from 1921 known drug-disease relationships for therapeutic indications collected from 6677 approved indications in the repoDB database [25], where we consider only FDA-approved small-molecule drugs. We obtained 10 drug similarity features by referring to the method proposed by Zeng et al. [16]. In order to fuse more drug features for enhancing the intradomain message passing of drug, we input various drug similarity information (e.g., target, enzyme, drug-drug interaction, and pathway) in T1 as well as T2 (e.g., clinical similarity, drug side effects' similarity, and chemical similarity, where the chemical similarity is computed by Tanimoto score [26], which can be obtained by the Chemical Development Kit [27] according to the SMILES string [28] of drugs) into the hypergraph convolutional network as multiple similarity matrices of drugs, respectively.
Although datasets T1 and T2 contain a limited number of disease and drug nodes, they contain a massive amount of protein association information which is helpful for the predictions of drug-disease associations. In addition, in order to test the robustness of our baseline model, we introduce 2 public datasets, Fdataset [10] and Cdataset [6], which do not contain the protein association network and various drug characteristics. Fdataset contains 593 drugs, 313 diseases and 1933 confirmed drug-disease associations. Cdataset includes 2532 associations between 663 drugs and 409 diseases. The protein association networks (drug-protein, disease-protein, protein-protein interaction (PPI)) in this paper draw data from the databases used by Wang et al. [20] The DGIdb database [29] (https://dgidb.genome.wustl.edu (accessed on 9 September2021)) is a druggene interaction database, which integrates not only drug-gene interactions reported in the existing literature, but also records drug-gene interactions in more than 30 databases such as DrugBank [30], PharmGKB [31], Chembl [32], and TTD [33], providing information on genes and their known or potential drug associations. We pull target protein-coding genes of a given drug from DGIdb, and then map genes to proteins according to gene names and obtain the drug-protein association. At least 2 targets are required for each drug which is motivated by the notion that a drug is utilized to treat a distinct disease, most likely due to its off-target activities.
The disease-protein associations are extracted from DisGeNET [34] (https://www.disgenet. org (accessed on 5 May 2021)), where we extract the protein-coding genes of specific diseases and then map them to the corresponding protein. It is required that at least 1 protein is associated with each disease which makes it more biologically meaningful. The drug name is represented by the DrugBank ID, and the common name of each disease is annotated according to MeSH [35] (http://www.ncbi.nlm.nih.gov/ (accessed on 8 April 2021)), which can be converted to UMLS ID [36] using DisGeNET to discover the related protein.
For the PPI network, we refer to the human PPI information compiled by Menche et al. [37] to extract the protein-protein associations; that is, the initial weight of each edge is set to 1, and one protein is associated with at least 1 protein. Ultimately, we obtain the multi-association networks by combining the above data, as shown in Table 1.  Cdataset  1  663  409  ----------drug-disease  2532  0.0093   Fdataset  1  593  313  ----------drug-disease  1933  0.0104 Note: P denotes the number of similarity matrices for drugs in various datasets. Sparsity represents the ratio of the number of known associations to the number of all possible associations.

Multi-Association Network Construction
We apply the multiple similarity matrices of the drugs mentioned in the previous section to construct a drug multi-view network; namely, the set of multiple similarity matrices of drugs is denoted as P r = P r 1 , P r 2 , . . . , P r S , P r S ∈ R M×M , where M is the number of drugs, and S represents the total number of the similarity matrices. And then, it is used to construct the multi-view network as G r = G r 1 , G r 2 , . . . , G r S , in which the set of its adjacency matrices is denoted as A r = P r = P r 1 , P r 2 , . . . , P r S . The similarity matrix of diseases is expressed as S d ∈ R N×N , where N denotes the number of diseases. The similarity between diseases is obtained from the MeSH database [35], in which diseases are classified into various categories. We construct a directed acyclic graph (DAG) to compute the semantic similarities of diseases based on the reference of Wang et al. [38]. The disease-disease similarity network is denoted as G d and its adjacency matrix is Here, the known drug-disease association network is represented as a graph G rd with the adjacency matrix A rd ∈ {0, 1} M×N (If drug r i is associated with disease d j , A rd ij = 1. Otherwise, A rd ij = 0). In addition to the above-constructed graphs G r , G d and G rd , we build the graph structures by using the association data of protein-drug, protein-disease and PPI described in the previous section, which are denoted as G dp , G rp , and G p , respectively. The overall association network is shown in Figure 1.
The similarity matrix of diseases is expressed as ∈ ℝ , where N denotes the number of diseases. The similarity between diseases is obtained from the MeSH database [35], in which diseases are classified into various categories. We construct a directed acyclic graph (DAG) to compute the semantic similarities of diseases based on the reference of Wang et al. [38]. The disease-disease similarity network is denoted as and its adjacency matrix is = ∈ ℝ × . Here, the known drug-disease association network is represented as a graph with the adjacency matrix ∈ {0,1} × (If drug is associated with disease , = 1. Otherwise, = 0). In addition to the above-constructed graphs , and , we build the graph structures by using the association data of protein-drug, protein-disease and PPI described in the previous section, which are denoted as , , and , respectively. The overall association network is shown in Figure 1.

Model Framework
In this section, we introduce a deep learning model named EMPHCN to predict the association information between drugs and diseases. The overall workflow is shown in Figure 2, which includes the intra-domain and inter-domain message passing of drugs and diseases. A multi-view network is constructed according to multiple similarity features of drugs for the intra-domain message passing of drugs, which is enhanced by the combination of HGCN and channel attention mechanism (Figure 2a). Besides, we directly apply HGCN to extract the intra-domain embeddings of diseases. The inter-domain message passing includes 2 parts ( Figure 2b): 1 is the inter-domain message passing for known drug-disease associations, which is enhanced by graph convolutional networks with node and edge embedding (NEEGCN), and the other is a heterogeneous network composed of proteins, drugs and diseases, which enriches the message passing according to GAT. The embedding of 2 parts is then summed to obtain the inter-domain feature of drugs and diseases, and ultimately, all intra-domain and inter-domain features are integrated for the prediction of the drug-disease association.

Model Framework
In this section, we introduce a deep learning model named EMPHCN to predict the association information between drugs and diseases. The overall workflow is shown in Figure 2, which includes the intra-domain and inter-domain message passing of drugs and diseases. A multi-view network is constructed according to multiple similarity features of drugs for the intra-domain message passing of drugs, which is enhanced by the combination of HGCN and channel attention mechanism (Figure 2a). Besides, we directly apply HGCN to extract the intra-domain embeddings of diseases. The inter-domain message passing includes 2 parts ( Figure 2b): 1 is the inter-domain message passing for known drug-disease associations, which is enhanced by graph convolutional networks with node and edge embedding (NEEGCN), and the other is a heterogeneous network composed of proteins, drugs and diseases, which enriches the message passing according to GAT. The embedding of 2 parts is then summed to obtain the inter-domain feature of drugs and diseases, and ultimately, all intra-domain and inter-domain features are integrated for the prediction of the drug-disease association.

Drug and Disease Intra-Domain Message Passing
We first employ hypergraph convolutional networks (HGCN) [23] in the intra-domain message passing to extract the intra-domain information of drugs and diseases in G r and G d , respectively. The general graph network structure is usually represented by an adjacency matrix, where each edge connects only two vertices. Hypergraphs, on the other hand, can be used to encode the higher-order data correlation (beyond pairwise connections) by using their degree-free hyperedges, i.e., hypergraphs with flexible hyperedge properties are easily scalable for multi-modal data.

Drug and Disease Intra-Domain Message Passing
We first employ hypergraph convolutional networks (HGCN) [23] in the intra-domain message passing to extract the intra-domain information of drugs and diseases in and , respectively. The general graph network structure is usually represented by an adjacency matrix, where each edge connects only two vertices. Hypergraphs, on the other hand, can be used to encode the higher-order data correlation (beyond pairwise connections) by using their degree-free hyperedges, i.e., hypergraphs with flexible hyperedge properties are easily scalable for multi-modal data.
A hypergraph is usually defined as G = (V, E), where V represents the vertex set, and E denotes the hyperedge set. An incidence matrix ∈ ℝ × is used to represent connections among vertices on the hyper-graph, where is the number of vertices and is the number of hyperedges. Each element in H is defined as For any vertex ∈ , ℎ( , ) = 1 when the hyperedge e ∈ E is associated with vertex , otherwise ℎ( , ) = 0. Simultaneously, the degree ( ) of vertex v and the degree ( ) of hyperedge are represented by the diagonal matrix ∈ ℝ × and ∈ ℝ × , respectively, then the degree of the vertex and the degree of the hyperedge can be defined as ( ) = ∑ ℎ( , ) The incidence matrix H and vertex feature X can be fed into the hypergraph neural networks with a hypergraph convolutional layer defined as where two diagonal matrices / and are used for normalization, is the learnable weight and is the feature embedding learned by the HGCN. Each hyperedge in A hypergraph is usually defined as G = (V, E), where V represents the vertex set, and E denotes the hyperedge set. An incidence matrix H ∈ R N v ×N e is used to represent connections among vertices on the hyper-graph, where N v is the number of vertices and N e is the number of hyperedges. Each element in H is defined as For any vertex v ∈ V, h(v, e) = 1 when the hyperedge e ∈ E is associated with vertex v, otherwise h(v, e) = 0. Simultaneously, the degree d(v) of vertex v and the degree δ(e) of hyperedge e are represented by the diagonal matrix D v ∈ R N v ×N v and D e ∈ R N e ×N e , respectively, then the degree of the vertex and the degree of the hyperedge can be defined as The incidence matrix H and vertex feature X can be fed into the hypergraph neural networks with a hypergraph convolutional layer defined as where two diagonal matrices D −1/2 v and D −1 e are used for normalization, θ l is the learnable weight and X l+1 is the feature embedding learned by the HGCN. Each hyperedge in EMPHCN is built by connecting a vertex with its neighboring K vertices (Here, we set K = 15) based on the adjacency relationship on the graph, so N vertices construct N hyperedges with the incidence matrix H ∈ R N×N .
Specifically, we use the initial input graph G r , G d to construct the hypergraphs respectively, and the corresponding hypergraph incidence matrices are set as H r and H d , respectively. Here, of drugs, S is the total number of incidence matrices of drugs, and N is the number of diseases. We use one of H r (Here, we chose H r 1 ) and H d as the initial input X 0 According to Equation (2), the updated embeddings in the disease and drug domains can be obtained as X l+1 intra_r , X l+1 intra_d respectively From Equation (4), we can calculate the multi-channel drug embedding , . . . , X l+1 intra_r S ], X l+1 intra_r ∈ R M×F×S by the hypergraph convolution of each incidence matrix of drugs and similarly obtain the disease embedding X l+1 intra_d ∈ R N×F . Here, F is the dimensionality of the embedding. Furthermore, in order to utilize the multivariate feature information of drugs for the intra-domain message passing enhancement, we employ the attention mechanism ECA-Net [39] to learn the importance of each channel in X l+1 intra_r . The specific operation process is as follows: we first use the global average pooling on the spatial dimension to perform the feature compression and get the new embedding ,j=1 x ij is the global average pooling (GAP), and then we further compute the attention factor ω r corresponding to each channel, and in order to reduce the model complexity, we use a 1D convolution operation to realize that all channels share the same parameters where ω r = [ω r 1 , ω r 2 , . . . , ω r S ], ω r ∈ R 1×S , conv1d stands for the 1D convolution, k represents the convolution kernel size (Here, we set it as k = 3 × 1), σ it the Sigmoid activation function.
Ultimately, we sum up the features of all channels to get X l+1 intra_r ∈ R M×F , which is used as the drug embedding of the next layer of network input

Drug and Disease Inter-Domain Message Passing
Drug-disease inter-domain message passing contains the known drug-disease association network, which is treated as the main part of message passing. Moreover, we enhance the inter-domain message passing by introducing a heterogeneous network composed of proteins, drugs and diseases.

Known Drug-Disease Inter-Domain Message Passing
GCN has widely been used in the recommendation system and the prediction model [40]. However, it usually only considers the node embedding in the graph convolutional process while ignoring the role of edges [41]. We introduce a novel graph convolution model (NEEGCN) for bipartite graph link prediction, which can combine the embedding of edges and nodes. We first map the edge information into the node domain to obtain the corresponding drug node feature X l r_e and disease node feature X l d_e , respectively, where A rd denotes the adjacency matrix of the known drug-disease association graph network, W l r_e and W l d_e are the trainable weight matrices. We refer to the intra-domain initial input in Equation (3) above as the initial inter-domain input, the edge information mapped to the drug and disease node embeddings is then fused with the original drug and disease node embeddings, and the message passing between the nodes of drugs and diseases is defined as where N(r i ) and N d j are the set of inter-domain neighbors of node r i and node d j respectively, and r→d are trainable weights for projecting the node embedding from 1 domain to another.

Protein-Related Inter-Domain Message Passing
In addition to using the known drug-disease network for the message passing, we add a heterogeneous network composed of proteins, diseases and drugs and combine them to form an inter-domain message-passing enhancement module for extracting the interdomain features of drugs and diseases. We refer to the method suggested by Wang et al. [20] and achieve message-passing between them by the graph attention network (GAT) [42], which can help us score a vast amount of protein-association information to filter out the important association information. The propagation process is first implemented as where X l r and X l d are the disease and drug embeddings of the lth layer and E l r→p , E l d→p are the information from drug to protein through GAT and the information from disease to protein through GAT in the lth layer, respectively, and we fuse them to get the protein embedding E l p . Here, GAT is used to achieve the projection of node information from the v-domain to the u-domain, and the updated formula is where W v is a trainable weight matrix, → x v j represents the feature of the node and ν j , α u i ,ν j is the attention weight coefficient as where ρ is the LeakyReLU activation function, → a denotes the weight vector, and is the concatenation operation.
Then, in order to enable smooth features between protein neighborhood nodes, a single layer of GAT is applied to perform the protein intra-domain message passing by the PPI network We project the updated protein node embedding E l p back to the drug and disease domains to obtain the drug embedding X l p_r and the disease embedding X l p_d , respectively.

Association Prediction for Drugs and Diseases
After computing the respective intra-domain and inter-domain features of drugs and diseases, we fuse these embeddings of drugs and diseases as where X l+1 r , X l+1 d are the embeddings of the next layer of drugs and diseases, respectively. We can similarly obtain X l+2 r and X l+2 d , and then combine them via a skip connection to get the final embeddings of drugs and diseasesX r ,X d respectivelŷ Lastly, we obtain the final association information between drugs and diseases by the matrix multiplication with the following equation where the matrixÂ rd is the predicted score probability matrix, and the values inÂ rd represent the probability of drug-disease association.

Optimization and Parameter Setting
Since the number of experimentally confirmed associations is much smaller than the number of drug-disease pairs, and because the sparsity of various drug-disease datasets is inconsistent, we adopt the weighted binary cross-entropy as a loss function to balance the positive and negative sample ratios. We denote known drug-disease association pairs as positive samples and other pairs as negative samples, and the loss function is defined as follows where y and y are the number of negative and positive samples, respectively, and the balance factor is λ = |y|/|y|. We use the Adam optimizer [43] to minimize the loss function. The EMPHCN is divided into two hidden layers, and the main parameters of this framework include the first and second hidden layer embedding dimensions, k 1 = 256, k 2 = 128, the initial learning rate of the optimizer, lr = 0.002, the regular dropout rate, β = 0.4. According to the sparsity of the dataset, we set the edge dropout rate γ and the training epoch µ as γ (T1) = 0.35, γ (T2) = 0.1, µ (T1) = 4500, µ (T2) = 3200, respectively.

Ablation Experiment of EMPHCN Model
We conduct the ablation experiment to verify the effectiveness of each part of our EMPHCN model. Specifically, we analyzed the impact of each part in our model according to the results of a 10-fold cross-validation on the dataset T1. The performance of different variants of EMPHCN is shown in Figure 3. We first used the GCN to replace the HGCN and found that the AUC and AUPR values achieved by the GCN were 0.4% and 0.8% lower than those of the HGCN, respectively. For the baseline model EMP_base (with the HGCN and the NEEGCN), we set a single drug similarity matrix in the intra-domain of drugs and only used known drug-disease associations in the inter-domain. Then we introduced the drug multi-view information and integrated the HGCN and ECA-Net to form an intra-domain fusion enhancement module. We found that the AUC and AUPR values using the intra-domain fusion enhancement module were better than those of the baseline method using a single similarity matrix, which both had a 0.3% improvement, as shown in Figure 3. In addition, we introduced additional networks G dp , G rp , and G p into the inter-domain information and combined known drug-disease associations to form the inter-domain fusion enhancement module. We found that the AUC and AUPR values achieved by the inter-domain fusion enhancement module were 0.5% and 0.6% higher than those of the baseline method, respectively, whose inter-domain module is only composed of known drug-disease associations. According to these comparisons, we can clearly see that the EMPHCN using HGCN and intra-domain and inter-domain fusion modules (with the intra-domain and inter-domain message passing enhancement) produced an improvement in the performance of drug repositioning.

Ablation Experiment of EMPHCN Model
We conduct the ablation experiment to verify the effectiveness of each part of our EMPHCN model. Specifically, we analyzed the impact of each part in our model according to the results of a 10-fold cross-validation on the dataset T1. The performance of different variants of EMPHCN is shown in Figure 3. We first used the GCN to replace the HGCN and found that the AUC and AUPR values achieved by the GCN were 0.4% and 0.8% lower than those of the HGCN, respectively. For the baseline model EMP_base (with the HGCN and the NEEGCN), we set a single drug similarity matrix in the intra-domain of drugs and only used known drug-disease associations in the inter-domain. Then we introduced the drug multi-view information and integrated the HGCN and ECA-Net to form an intra-domain fusion enhancement module. We found that the AUC and AUPR values using the intra-domain fusion enhancement module were better than those of the baseline method using a single similarity matrix, which both had a 0.3% improvement, as shown in Figure 3. In addition, we introduced additional networks , , and into the inter-domain information and combined known drug-disease associations to form the inter-domain fusion enhancement module. We found that the AUC and AUPR values achieved by the inter-domain fusion enhancement module were 0.5% and 0.6% higher than those of the baseline method, respectively, whose inter-domain module is only composed of known drug-disease associations. According to these comparisons, we can clearly see that the EMPHCN using HGCN and intra-domain and inter-domain fusion modules (with the intra-domain and inter-domain message passing enhancement) produced an improvement in the performance of drug repositioning.

Comparison between EMPHCN and Other Methods
In this section, we compare EMPHCN with six state-of-the-art drug repositioning methods listed below. Among these methods, the LAGCN method [21] uses an attention mechanism to combine the embeddings from multiple graph convolutional layers to predict drug-disease associations. The SCMFDD [12] provides a similarity constraint matrix decomposition method for predicting drug-disease correlations. The NIMCGCN [18] applies GCNs to learn the intra-domain information of miRNA and disease from similarity networks and builds a neural inductive matrix completion model to predict miRNA-

Comparison between EMPHCN and Other Methods
In this section, we compare EMPHCN with six state-of-the-art drug repositioning methods listed below. Among these methods, the LAGCN method [21] uses an attention mechanism to combine the embeddings from multiple graph convolutional layers to predict drug-disease associations. The SCMFDD [12] provides a similarity constraint matrix decomposition method for predicting drug-disease correlations. The NIMCGCN [18] applies GCNs to learn the intra-domain information of miRNA and disease from similarity networks and builds a neural inductive matrix completion model to predict miRNA-disease associations. BNNR [14] proposes a bounded nuclear norm regularization method to complete a drug-disease heterogeneous network for drug repositioning. DRRS [13] constructs a drug repositioning recommendation system for novel drug indications by integrating related data sources and validated information on drugs and diseases. The DRHGCN [22] designs inter-and intra-domain feature extraction modules which learn drug and disease embeddings by GCNs, and then parallelly fuse these embeddings for drug and disease association prediction.
We applied 10-fold cross-validation when comparing the EMPHCN with other methods. The results are shown in Table 2 as well as Figure 4a-d. On dataset T1, EMPHCN obtains the highest average AUPR and AUC at 0.593 and 0.887, respectively. On dataset T2, EMPHCN outperforms the other six models with a final average AUC of 0.961, which is 2.3% higher than the second-best method, DRHGCN, and achieves a final average AUPR of 0.526, which is 3.4% higher than the second-best method, BNNR. Because we introduce more homogeneous and heterogeneous networks and apply HGCN to extract the feature in intra-domain as the message passing, EMPHCN can achieve a satisfactory prediction result for the drug repositioning. At the same time, we find that the EMPHCN baseline model (EMP_base) also keeps excellent results in the comparative experiment ( Figure 5, Table 2), which achieved the highest average AUC and AUPR values (0.941 and 0.593, respectively) in the benchmark datasets.
grating related data sources and validated information on drugs and diseases. The DRHGCN [22] designs inter-and intra-domain feature extraction modules which learn drug and disease embeddings by GCNs, and then parallelly fuse these embeddings for drug and disease association prediction.
We applied 10-fold cross-validation when comparing the EMPHCN with other methods. The results are shown in Table 2 as well as Figure 4a-d. On dataset T1, EMPHCN obtains the highest average AUPR and AUC at 0.593 and 0.887, respectively. On dataset T2, EMPHCN outperforms the other six models with a final average AUC of 0.961, which is 2.3% higher than the second-best method, DRHGCN, and achieves a final average AUPR of 0.526, which is 3.4% higher than the second-best method, BNNR. Because we introduce more homogeneous and heterogeneous networks and apply HGCN to extract the feature in intra-domain as the message passing, EMPHCN can achieve a satisfactory prediction result for the drug repositioning. At the same time, we find that the EMPHCN baseline model (EMP_base) also keeps excellent results in the comparative experiment ( Figure 5, Table 2), which achieved the highest average AUC and AUPR values (0.941 and 0.593, respectively) in the benchmark datasets.      We also evaluated the effect of the ratio of positive and negative samples in the training set on the performance of the EMPHCN. We chose the T2 dataset with sparse positive samples and selected positive and negative sample ratios of 1:10, 1:50, 1:80, 1:100, 1:120, and 1:140 (where 1:140 is close to the maximum ratio in the dataset). Besides, we variously set a small portion of the positive samples missing as a control group, i.e., set the number of positive samples as {100%, 95%, 90%, and 85%} of the original positive samples, respectively. The test results are shown in Figure 6. We find that the performance of the model is improved as the number of positive samples is increased, probably because more positive samples are beneficial for the message passing in the prediction model. In addition, we find when the ratio of positive samples is in the range of 90% to 100%, the performance of the model is stable and improves slightly with the increase in the proportion of negative samples, probably because of the double influence of the increase of the number of training samples and negative samples. In addition, we implement the statistically significant difference analysis on the T1 dataset for the AUROC and AUPR results of 10-fold cross-validation between EMPHCN We also evaluated the effect of the ratio of positive and negative samples in the training set on the performance of the EMPHCN. We chose the T2 dataset with sparse positive samples and selected positive and negative sample ratios of 1:10, 1:50, 1:80, 1:100, 1:120, and 1:140 (where 1:140 is close to the maximum ratio in the dataset). Besides, we various-ly set a small portion of the positive samples missing as a control group, i.e., set the number of positive samples as {100%, 95%, 90%, and 85%} of the original positive samples, respectively. The test results are shown in Figure 6. We find that the performance of the model is improved as the number of positive samples is increased, probably because more positive samples are beneficial for the message passing in the prediction model. In addition, we find when the ratio of positive samples is in the range of 90% to 100%, the performance of the model is stable and improves slightly with the increase in the proportion of negative samples, probably because of the double influence of the increase of the number of training samples and negative samples.
A U P R T1 0.451 ± 0.002 0.502 ± 0.004 0.550 ± 0.002 0.546 ± 0.001 0.579 ± 0.002 0.574 ± 0.001 0.578 ± 0.001 0.593 ± 0.003 T2 0.379 ± 0.002 0.047 ± 0.002 0.049 ± 0.003 0.492 ± 0.001 0.138 ± 0.004 0.475 ± 0.001 0.501 ± 0.003 0.526 ± 0.002 Cdataset 0.574 ± 0.003 0.445 ± 0.002 0.060 ± 0.002 0.679 ± 0.001 0.194 ± 0.002 0.655 ± 0.002 0.688 ± 0.002 ---------Fdataset 0.475 ± 0.006 0.354 ± 0.005 0.062 ± 0.002 0.601 ± 0.001 0.134 ± 0.002 0.566 ± 0.002 0.604 ± 0. We also evaluated the effect of the ratio of positive and negative samples in the training set on the performance of the EMPHCN. We chose the T2 dataset with sparse positive samples and selected positive and negative sample ratios of 1:10, 1:50, 1:80, 1:100, 1:120, and 1:140 (where 1:140 is close to the maximum ratio in the dataset). Besides, we variously set a small portion of the positive samples missing as a control group, i.e., set the number of positive samples as {100%, 95%, 90%, and 85%} of the original positive samples, respectively. The test results are shown in Figure 6. We find that the performance of the model is improved as the number of positive samples is increased, probably because more positive samples are beneficial for the message passing in the prediction model. In addition, we find when the ratio of positive samples is in the range of 90% to 100%, the performance of the model is stable and improves slightly with the increase in the proportion of negative samples, probably because of the double influence of the increase of the number of training samples and negative samples. In addition, we implement the statistically significant difference analysis on the T1 dataset for the AUROC and AUPR results of 10-fold cross-validation between EMPHCN In addition, we implement the statistically significant difference analysis on the T1 dataset for the AUROC and AUPR results of 10-fold cross-validation between EMPHCN and other methods. Figure 7 demonstrates that when the p-value = 0.05, EMPHCN considerably differs from other approaches. and other methods. Figure 7 demonstrates that when the p-value = 0.05, EMPHCN considerably differs from other approaches.

Investigation of Novel Predictions
To validate the ability of diverse models to recover new disease associations corresponding to drugs (i.e., no drug association information for new diseases), we first choose 20% of the disease nodes as the test set, and for each disease in the test set, we delete all known drug-disease associations associated with that disease. Then we use the remaining 80% of the disease nodes corresponding to known drug-disease associations as the training set. This process ensures that the disease nodes in the test sample are unknown new nodes. We repeat the test five times for each prediction model to obtain an average result. The corresponding results, including ROC and PR curves, are shown in Table 3 and Figure  8a-d. We find that EMPHCN achieved excellent performance on datasets T1 and T2 (AUC = 0.806 on T1, 0.845 on T2), which were superior to other state-of-the-art prediction methods.

Investigation of Novel Predictions
To validate the ability of diverse models to recover new disease associations corresponding to drugs (i.e., no drug association information for new diseases), we first choose 20% of the disease nodes as the test set, and for each disease in the test set, we delete all known drug-disease associations associated with that disease. Then we use the remaining 80% of the disease nodes corresponding to known drug-disease associations as the training set. This process ensures that the disease nodes in the test sample are unknown new nodes. We repeat the test five times for each prediction model to obtain an average result. The corresponding results, including ROC and PR curves, are shown in Table 3 and Figure 8a-d. We find that EMPHCN achieved excellent performance on datasets T1 and T2 (AUC = 0.806 on T1, 0.845 on T2), which were superior to other state-of-the-art prediction methods.  In addition, we further validate the number of positive samples successfully recovered from the first k candidate samples. Considering the various number of positive samples in various datasets, we set the threshold k to 3000 (T1) and 300 (T2), respectively. Under the T1 dataset, we find that the number of positive samples successfully recovered by EMPHCN and SCMFDD is significantly higher than those of other models, as shown in Figure 9a. For the T2 dataset, the number of positive samples successfully recovered by EMPHCN is significantly higher than in other models, as shown in Figure 9b, which indicates again that EMPHCN has the outstanding ability to prioritize potential disease-related drugs.  In addition, we further validate the number of positive samples successfully recovered from the first k candidate samples. Considering the various number of positive samples in various datasets, we set the threshold k to 3000 (T1) and 300 (T2), respectively. Under the T1 dataset, we find that the number of positive samples successfully recovered by EMPHCN and SCMFDD is significantly higher than those of other models, as shown in Figure 9a. For the T2 dataset, the number of positive samples successfully recovered by EMPHCN is significantly higher than in other models, as shown in Figure 9b, which indicates again that EMPHCN has the outstanding ability to prioritize potential disease-related drugs. In addition, we further validate the number of positive samples successfully recovered from the first k candidate samples. Considering the various number of positive samples in various datasets, we set the threshold k to 3000 (T1) and 300 (T2), respectively. Under the T1 dataset, we find that the number of positive samples successfully recovered by EMPHCN and SCMFDD is significantly higher than those of other models, as shown in Figure 9a. For the T2 dataset, the number of positive samples successfully recovered by EMPHCN is significantly higher than in other models, as shown in Figure 9b, which indicates again that EMPHCN has the outstanding ability to prioritize potential disease-related drugs.

Case Study
To further verify the reliability and capability of the drug-disease association prediction model, we applied the EMPHCN to predict candidate drugs for two diseases, including breast carcinoma and Parkinson's disease (PD). When identifying potential drugs for breast carcinoma and Parkinson's disease, we used all known drug-disease associations in T2 as a training set to predict whether a missing drug-disease association existed and ranked the predicted drug candidates according to the magnitude of their probability values.
The top 10 drug candidates for breast carcinoma predicted by the EMPHCN are listed in Table 4. We found that 10 of these (100% success rate) are confirmed by various published pieces of evidence. For instance, topotecan is an antineoplastic agent used to treat ovarian cancer, small-cell lung cancer or cervical cancer. A previous study reports that primary chemotherapy with topotecan is an effective and well-tolerated treatment for patients with breast cancer and central nervous system (CNS) metastases [44]. Herein, topotecan is the first predicted candidate for potentially treating breast carcinoma. Carboplatin, an organoplatinum antineoplastic alkylating agent, used in the treatment of advanced ovarian carcinoma, is predicted by EMPHCN to be associated with breast carcinoma. The study by Martin et al. [45] indicates that carboplatin is an active drug in metastatic breast cancer (MBC) patients without previous exposure to chemotherapy. Campana et al. [46] observed that elderly breast cancer patients were highly responsive to the toxicity of electrochemotherapy (ECT) and achieved durable local tumor control, where the ECT treatment schedule consisted of intravenous or intratumoral bleomycin followed by locally-delivered electric pulses. For Parkinson's disease, the top 10 candidate drugs predicted by EMPHCN are listed in Table 5. We find 8 of which (80% success rate) were confirmed by various literature evidence. For example, pergolide is an ergot-derived dopamine receptor agonist utilized to treat Parkinson's disease and restless legs syndrome. The study by Mizuno et al. [54] confirms that pergolide has efficacy in patients with Parkinson's disease, either as monotherapy or in combination with levodopa. Galantamine is a cholinesterase inhibitor used to manage mild to moderate dementia associated with Alzheimer's disease. Aarsland et al. [55] report that galantamine is approved to be serviceable for patients with Parkinson's disease. Mephentermine NA In addition, we used the same model EMPHCN to predict the top five diseases likely to be associated with felodipine, as shown in Table 6. Lorimer et al. [61] verified that felodipine has, overall, a modest but significant anti-anginal benefit when combined with a beta-blocker.

Discussion
In this paper, we propose the EMPHCN, a novel prediction model to recover potential drug-disease associations based on enhanced message passing and hypergraph convolutional networks. Compared with existing methods, the EMPHCN incorporates multiple drug similarities and enhances the intra-domain message passing by combining hypergraph convolution with channel attention mechanism, and improves the inter-domain message passing by incorporating protein, drug and disease association networks, which can outperform other drug-disease association prediction methods.
Although the EMPHCN can be chosen as a powerful tool for predicting drug-disease association information, there are some issues that can be improved in our future work. For instance, we will consider more biological entities, such as miRNAs and microorganisms involved in drug-disease associations, to construct a heterogeneous network with more types of entities and links for a better understanding of drug and disease features for drug repositioning. Because our current model does not involve genomics, transcriptomics, other omics and cell line data from an individual patient, we will extend our method by combining with the multi-omics information to prescribe the patient a specifically repurposed drug in our future work. In addition, we currently apply the PPI information to construct the association network for training. In our future work, we will use the protein similarities to construct the protein-protein association network to research the relationship between the off-target consequences and the repositioning opportunities.