Knowledge-Graph-Based Drug Repositioning against COVID-19 by Graph Convolutional Network with Attention Mechanism

The current global crisis caused by COVID-19 almost halted normal life in most parts of the world. Due to the long development cycle for new drugs, drug repositioning becomes an effective method of screening drugs for COVID-19. To find suitable drugs for COVID-19, we add COVID-19related information into our medical knowledge graph and utilize a knowledge-graph-based drug repositioning method to screen potential therapeutic drugs for COVID-19. Specific steps are as follows. Firstly, the information about COVID-19 is collected from the latest published literature, and gene targets of COVID-19 are added to the knowledge graph. Then, the information of COVID-19 of the knowledge graph is extracted and a drug–disease interaction prediction model based on Graph Convolutional Network with Attention (Att-GCN) is established. Att-GCN is used to extract features from the knowledge graph and the prediction matrix reconstructed through matrix operation. We evaluate the model by predicting drugs for both ordinary diseases and COVID-19. The model can achieve area under curve (AUC) of 0.954 and area under the precise recall area curve (AUPR) of 0.851 for ordinary diseases. On the drug repositioning experiment for COVID-19, five drugs predicted by the models have proved effective in clinical treatment. The experimental results confirm that the model can predict drug–disease interaction effectively for both normal diseases and COVID-19.


Introduction
Coronavirus disease 2019 (COVID-19) has been listed as an international public health emergency by WHO [1], and on March 11, it was defined as a global "pandemic". As of December 16, more than 20.58 million people were infected worldwide. As the number of COVID-19 patients is dramatically increasing worldwide, treatment in intensive care units (ICUs) has also become a major challenge [2]. Under the current circumstance of the absence of specific vaccines and medicines against COVID-19, it is urgent to discover effective therapies, especially drugs, to treat COVID-19. Considering that it usually takes 10-15 years to develop a new drug, probably the best strategy for the treatment of COVID-19 is drug repositioning. Drug repositioning, also known as "new use of old drugs" and "re-examination of old drugs", refers to the discovery of new indications or new uses for drugs already on the market, including repositioning, repurposed, and repurposed drugs that are in the clinical research stage or approved for marketing evaluation, reorientation of treatment direction, etc. Under normal circumstances, it takes 10-15 years for new drug development from the determination of the idea to the drug market, and there are uncertainties such as safety and pharmacokinetics so that R&D costs and risks of drug development are significant. However, since drugs used for drug repositioning studies usually have passed Future Internet 2021, 13, 13 2 of 10 several stages of clinical trials or are already on the market, their risks are lower compared with strategies such as developing from scratch and obtaining patent licenses. In addition, compared with obtaining patent licenses and restructuring strategies, it has the advantages of a short time to market and a greater possibility of discovering differences in drug effects, so higher returns are expected. Therefore, drug repositioning is one of the strategies with the best risk/benefit ratio among currently known drug development strategies.
As many drugs treat disease by acting on related targets, most drug repositioning research studies predict new drug-disease interaction (DDI) by discovering new drugtarget interaction (DTI). However, for a new disease, because the corresponding target of the disease has not been fully discovered, if drug repositioning is achieved only through the discovery of new DTI, the results may be limited. In addition, the existing data of known drugs and related information, such as proteins and genes, are huge, so it is particularly important to use appropriate databases and preliminary screening of data when redirecting drugs to an emerging disease. Wearable devices and internet medical service platforms generate a large number of high-dimensional medical data. Traditional machine learning methods cannot be used to process high-dimensional medical data from different sources. Deep learning methods are widely used in feature extraction [3] and disease prediction [4] for medical data. In the drug redirection problem, the drug and target naturally form a graph structure, and the most used deep learning model is the Graph Convolutional Network (GCN).
In order to find drug candidates against COVID-19, we construct a knowledge graph (KG) for COVID-19 and propose a new model called Graph Convolutional Network with Attentional mechanism for Drug-Disease Interaction (Att-GCN-DDI) to predict potential therapeutic drugs for COVID-19. We first collect information about COVID-19 in the latest published literature and add the gene targets of COVID-19 to our drug KG. Then, we screen the nodes and relationships associated with COVID-19 to build the KG. Borrowing the idea from Neo-DTI [5] and DTI-NET [6], the Att-GCN-DDI model first employs GCN with attentional mechanism to extract features from KG and then performs matrix factorization for DDI prediction. The tests on two scenarios in DDI prediction have demonstrated that Att-GCN-DDI can significantly outperform several baseline prediction methods. Att-GCN-DDI also has good performance on the DDI prediction against COVID-19. Five drug candidates predicted by Att-GCN-DDI have proved effective in the clinical treatment.
The main contributions of our work lie in: (1) We gather and add the target of COVID-19 to our KG. Then, we select the related knowledge to construct a KG for COVID-19, which is applied to find the potential therapeutic drugs against COVID-19. (2) We propose a GCN-based model for drug repositioning on KG. The model can learn the topology around the disease effectively, which is utilized to predict new drugs for the disease. (3) We evaluate our method by predicting drugs for both ordinary diseases and COVID-19.
Att-GCN-DDI finds five effective drugs against COVID-19, which have been proved in clinical treatment and outperforms five other baseline models in the drug repositioning for ordinary diseases. The experimental results confirm the strong predictive power of Att-GCN-DDI.

Related Work
The most important step in drug repositioning is to find novel drug-disease or drugtarget interactions. In order to achieve this goal, various methods have been developed, including computational approaches, biological experimental approaches, and mixed approaches [7]. In recent years, researchers mainly use computational methods to realize drug repositioning because biological experimental methods cost a lot of money and time [8]. Now the most widely used computing methods include the following four categories: network-based methods, knowledge graph embedding-based method, text mining methods and biological feature-based methods [9]. Text mining methods extract useful information from known literature to find the novel interactions between drugs and diseases. "MAM" [10], "PharmGKB" [11] and "Chem2bio2rdf" [12] are all based on semantic similarity to measure the relationship between drug and diseases. Recently, researchers began to use machine learning methods to achieve this goal. Fu et al. [13] proposed a semantic similarity framework using random forest (RF) and support vector machine (SVM) methods. However, the diversity of language expression and the contradictory information found in the literature limit the performance of the text mining method [14].
The biological feature-based method realizes drug repositioning by using machine learning approaches to extract the biological feature of drugs and targets [15]. These methods usually include two key parts: feature extraction and relationship prediction. "SimBoost" [16] trains a gradient-enhanced machine model to learn the similarity between drugs and proteins to understand their binding affinity. "NRLMF" [17] uses the similarity between drugs and proteins to simulate the probability of drug-target interaction through logical matrix decomposition [18]. These methods improve the accuracy of DTI prediction to a certain extent. However, these methods do not take drug-drug or protein-protein interactions into account [9].
The method based on the knowledge graph embedding (KGE) maps the entities and relationships in the knowledge graph to a low-dimensional continuous vector space, which can retain the inherent characteristics of the knowledge graph and alleviate the feature sparse problem that may be faced in the application of the knowledge graph. The training process is divided into multiple stages. First, the KGE model uses random noise to initialize the embedding vector. Then, the loss error is calculated by the score function, and training is performed. GrEDeL [19] uses TransE to learn the embedding vector based on a biomedical knowledge graph by exploiting the relations extracted from biomedical abstracts. Then, the Long Short-Term Memory (LSTM) Networks model is trained to discover candidate drugs for diseases of interest from biomedical literature. TriModel [20] is an extension of DistMultand and ComplEx models, using three embedding vectors to represent each entity and relationship. These methods have high requirements on the quality of knowledge graphs and are suitable for finding new associations between drugs and targets that have been fully studied; however, they are not suitable for drug repositioning for emerging diseases due to incomplete disease-related information in the knowledge graph.
In recent years, the network-based method has been widely used. This method mainly includes three steps: network construction, feature extraction and relationship prediction [9]. The network-based method calculates the similarity between drugs and targets based on the network topology, and the purpose of this method is to predict unknown interactions based on known interactions [21]. The basic principle is that drugs tend to combine with similar targets or diseases. DDR [22] constructed drug-drug interaction network and protein-protein interaction network based on the similarity between proteins and drugs [18]. They then used the RF method to predict the combination of drugs and proteins. DTI-NET [6], MSCMF(Multiple Similarities Collaborative Matrix Factorization) [23] and HNM(Heterogeneous Network Model) [24] can further improve the accuracy of DTI prediction by integrating information from heterogeneous data sources and improving the relationship prediction method. Neo-DTI [5] uses a new feature extraction scheme based on DTI-NET [6] to enhance the accuracy. However, the existing research on network-based method is limited to the prediction of drug-target interactions, so, the mining of heterogeneous network is not thorough enough. In addition, there is still a lot of data loss in the process of feature extraction.

The Drug KG
We adopted the drug KG that was built in our previous study, which integrated six pharmaceutical knowledge bases: DrugBank [25], KEGG DRUG [26], TTD [27], DID [28], PharmGKB [29] and SIDER [30]. We first analyze the original data in the knowledge Future Internet 2021, 13, 13 4 of 10 base to extract the triples, and then insert the data according to the graphical data model integration data triples to obtain the knowledge graph [31]. The KG contains five types of entities, including drugs, genes, diseases, channels, side effects, and nine relationships among them. The data schema of our drug KG is shown in Figure 1.
loss in the process of feature extraction.

The Drug KG
We adopted the drug KG that was built in our previous study, which integrated six pharmaceutical knowledge bases: DrugBank [25], KEGG DRUG [26], TTD [27], DID [28], PharmGKB [29] and SIDER [30]. We first analyze the original data in the knowledge base to extract the triples, and then insert the data according to the graphical data model integration data triples to obtain the knowledge graph [31]. The KG contains five types of entities, including drugs, genes, diseases, channels, side effects, and nine relationships among them. The data schema of our drug KG is shown in Figure 1.

Acquisition of COVID-19 Information
Nowadays, drugs against COVID-19 are divided into two categories according to their targets (genes) [32]. The first is to act on the immune cells of the human body to enhance the immune function of human. The second is acting on the COVID-19 itself, binding receptors, and the enzymes needed for its replication. Through gathering the information of COVID-19, we have screened out four targets (genes) of COVID-19 with clear function and high reliability. They are RNA dependent RNA polymer (RdRp) [32], ACE2 [32], pp1ab [33], human immunity virus type 1 protection (pol) [34]. Therefore, we link the COVID-19 entity to KG through four drug-gene interactions: COVID-19-RdRp, COVID-19-ACE2, COVID-19-pp1ab and COVID-19-pol.

Construction of the KG for COVID-19
Our drug KG contains more than 100,000 entities and more than 670,000 relationships. It is extremely difficult to perform a computation consuming model GCN on such a large-scale KG. In addition, the information of the drugs and proteins that are not related to COVID-19 in the KG will also interfere with drug repositioning. To reduce the calculation amount and improve the accuracy of the DDI prediction, we need to construct a KG for COVID-19 by extracting the related knowledge from the drug KG.
In our drug KG, if a drug can treat a disease, there is usually a path with a distance less than 4 between them beside the direct connection. The path between them contains the information of why the drug can treat the disease. In addition, Att-GCN-DDI predicts DDIs through the similarity of topological structure, and diseases with similar topological structure are usually connected by the path with distance less than 4, such as diseasedrug-disease, disease-gene-disease and disease-gene-gene-disease.

Acquisition of COVID-19 Information
Nowadays, drugs against COVID-19 are divided into two categories according to their targets (genes) [32]. The first is to act on the immune cells of the human body to enhance the immune function of human. The second is acting on the COVID-19 itself, binding receptors, and the enzymes needed for its replication. Through gathering the information of COVID-19, we have screened out four targets (genes) of COVID-19 with clear function and high reliability. They are RNA dependent RNA polymer (RdRp) [32], ACE2 [32], pp1ab [33], human immunity virus type 1 protection (pol) [34]. Therefore, we link the COVID-19 entity to KG through four drug-gene interactions: COVID-19-RdRp, COVID-19-ACE2, COVID-19-pp1ab and COVID-19-pol.

Construction of the KG for COVID-19
Our drug KG contains more than 100,000 entities and more than 670,000 relationships. It is extremely difficult to perform a computation consuming model GCN on such a largescale KG. In addition, the information of the drugs and proteins that are not related to COVID-19 in the KG will also interfere with drug repositioning. To reduce the calculation amount and improve the accuracy of the DDI prediction, we need to construct a KG for COVID-19 by extracting the related knowledge from the drug KG.
In our drug KG, if a drug can treat a disease, there is usually a path with a distance less than 4 between them beside the direct connection. The path between them contains the information of why the drug can treat the disease. In addition, Att-GCN-DDI predicts DDIs through the similarity of topological structure, and diseases with similar topological structure are usually connected by the path with distance less than 4, such as disease-drugdisease, disease-gene-disease and disease-gene-gene-disease. Therefore, in this paper, we focus on the COVID-19 node, and select the drugs and disease nodes whose shortest path distance from the COVID-19 node is less than 4. After that, we use these drugs and disease nodes as the initial data to supplement the related gene, side effect and pathway nodes. Finally, the drug-disease drug-gene, gene-pathway, drug-drug, drug-side effect, gene-gene, and disease-gene relationships among nodes were supplemented too. Tables 1 and 2 show the number of entities and relationship of the KG for COVID-19.

Method
We designed a model called Att-GCN-DDI to discover unknown DDIs based on the drug KG. The workflow of Att-GCN-DDI is shown in Figure 2. Att-GCN-DDI mainly includes three main steps: (a) node embedding based on Att-GCN; (b) topology-preserving learning of the node embedding; (c) reconstruction of DDI matrix. Through step (a), the topological features of each node in the KG are extracted into an F-dimension vector, and the feature vectors of all drugs and diseases constitute the drug feature matrix and disease feature matrix, respectively (where X is the drug feature matrix, where each row represents the feature vector of a drug, and Y is the disease feature matrix, where each row represents the feature vector of a disease). Through step (b), we attempt to find an optimal projection from the drug space to the protein space by supervised learning so that the mapped drug feature vectors geometrically approach the diseases of their known interactions. The projection matrix Z is supervised by known drug-disease interactions and learns to minimize the difference between the known interaction matrix P and XZY T . Then, Att-GCN-DDI performs matrix operations on the projection matrix Z and feature matrix by step (c), and finally reconstructs DDIs matrix. Then, we can get the novel DDIs based on the reconstructed DDI matrix.

Method
We designed a model called Att-GCN-DDI to discover unknown DDIs based on the drug KG. The workflow of Att-GCN-DDI is shown in Figure 2. Att-GCN-DDI mainly includes three main steps: (a) node embedding based on Att-GCN; (b) topology-preserving learning of the node embedding; (c) reconstruction of DDI matrix. Through step (a), the topological features of each node in the KG are extracted into an F-dimension vector, and the feature vectors of all drugs and diseases constitute the drug feature matrix and disease feature matrix, respectively (where X is the drug feature matrix, where each row represents the feature vector of a drug, and Y is the disease feature matrix, where each row represents the feature vector of a disease). Through step (b), we attempt to find an optimal projection from the drug space to the protein space by supervised learning so that the mapped drug feature vectors geometrically approach the diseases of their known interactions. The projection matrix Z is supervised by known drug-disease interactions and learns to minimize the difference between the known interaction matrix P and XZY T . Then, Att-GCN-DDI performs matrix operations on the projection matrix Z and feature matrix by step (c), and finally reconstructs DDIs matrix. Then, we can get the novel DDIs based on the reconstructed DDI matrix. Next, we will introduce the mathematical formulations of these three main steps. The given KG is defined as an undirected graph G = (V, E) V = {v 1 , v 2 , . . . , v n } is the set of nodes and E = {e 1 , e 2 , . . . , e m } is the set of edges; where m is the number of nodes, n is the number of edges and E ∈ V × V. The adjacency matrix A is usually represented in binary, 1 means that there is a connection between nodes, 0 means that there is no connection between nodes.
The key step of topological feature extraction using Att-GCN is to construct the Laplace matrix, and in our model, the Laplacian matrix should be: where I n is an identity matrix and D is the inverse degree matrix [9]. Finally, the topological feature of each node in the heterogeneous network can be extracted using the following formula: where M is a feature vector of the input entity, and X is the characteristic of each node itself. The extracted features are used to form a drug feature matrix F drug and a disease feature matrix F disease in which each row represents a drug or disease feature vector M. Then, we use supervised learning to find the most appropriate projection matrix Z. The learning objectives are as follows: where P is the known DDI matrix. Note that the same matrix construction method is also used in [1,2] to solve the problem of relationship prediction. In order to achieve quick convergence, we use λ w 1 as the regularizer.
Then, we introduced the attention mechanism to assign weights to the feature matrix. The calculation formula of attention is as follows: where Q is the embedding representation of drug features. After the supervised learning of the projection matrix, the process of reconstructing the DDI matrix is as follows:

Experiments
Our model is tested on two experiments. First, we employ the Att-GCN-DDI model to predict drug candidates for COVID-19 and conduct a case study based on predicted drugs. Then we compare Att-GCN-DDI with 5 baseline models on the drug repositioning experiment for ordinary diseases.

Results
In the experiment, we used the Att-GCN-DDI model to predict drug candidates against COVID-19. Here, we take COVID-19 KG as the input. Since the KG does not include the interactions between COVID-19 and drugs, we can use all known DDIs as the training set to train our model and finally reconstruct the prediction matrix.
After analysis and search, 5 drugs among them have been clinically proven to be viable for COVID-19 treatment. Their information is shown in Table 3. These results indicate that the drug candidates against COVID-19 predicted by Att-GCN-DDI are basically reliable.

Case Study
We analyzed the path between COVID-19 and our drug candidates in the KG to understand why these drugs are more likely to treat COVID-19 than others. The path can be divided into two types. The first is to directly connect COVID-19 through genes. In this case, we can think of drugs acting on COVID-19 related genes to treat COVID-19. The second is to link diseases directly without genes. Take drug Tipranavir as an example. The paths between this drug and COVID-19 in KG are shown in Figure 3. It can be found that although this drug does not directly act on COVID-19-related genes, drugs related to Tipranavir directly act on COVID-19-related genes, ACE2 and pol, respectively. Therefore, we believe that this drug is indeed more likely to have a therapeutic relationship with COVID-19 than other drugs.
After analysis and search, 5 drugs among them have been clinically proven to be viable for COVID-19 treatment. Their information is shown in Table 3. These results indicate that the drug candidates against COVID-19 predicted by Att-GCN-DDI are basically reliable.

Case Study
We analyzed the path between COVID-19 and our drug candidates in the KG to understand why these drugs are more likely to treat COVID-19 than others. The path can be divided into two types. The first is to directly connect COVID-19 through genes. In this case, we can think of drugs acting on COVID-19 related genes to treat COVID-19. The second is to link diseases directly without genes. Take drug Tipranavir as an example. The paths between this drug and COVID-19 in KG are shown in Figure 3. It can be found that although this drug does not directly act on COVID-19-related genes, drugs related to Tipranavir directly act on COVID-19-related genes, ACE2 and pol, respectively. Therefore, we believe that this drug is indeed more likely to have a therapeutic relationship with COVID-19 than other drugs.

DDI Prediction for Other Diseases
At present, only a small number of medicines have proven to have a therapeutic effect on COVID-19. Therefore, only performing DDI experiments on COVID-19 cannot fully verify the effectiveness of the Att-GCN-DDI model Therefore, we also test our model on DDI prediction for other diseases.
The We then compared the performance of Att-GCN-DDI with GCN-DDI, Neo-DTI [5], DTI-NET [6] and HNM [24] in predicting DDIs. GCN-DDI indicates that the GCN model in our framework does not use attention mechanism. The area under the precise recall area curve (AUPR) and the area under receiver operating characteristic curve (AUROC) were used to evaluate the predictive performance of all prediction methods. The evaluation results are shown in Figure 4. The results show that Att-GCN-DDI is superior to other methods in AUROC and AUPR. Att-GCN-DDI adopts the Att-GCN model to extract features by node embedding, so the extracted features can better retain the topology structure of nodes. This makes the prediction more accurate.
amples were used as the training set, and the remaining 10% of positive and negative examples were regarded as the test set.
We then compared the performance of Att-GCN-DDI with GCN-DDI, Neo-DTI [5], DTI-NET [6] and HNM [24] in predicting DDIs. GCN-DDI indicates that the GCN model in our framework does not use attention mechanism. The area under the precise recall area curve (AUPR) and the area under receiver operating characteristic curve (AUROC) were used to evaluate the predictive performance of all prediction methods. The evaluation results are shown in Figure 4. The results show that Att-GCN-DDI is superior to other methods in AUROC and AUPR. Att-GCN-DDI adopts the Att-GCN model to extract features by node embedding, so the extracted features can better retain the topology structure of nodes. This makes the prediction more accurate. Next, we tested Att-GCN-DDI in another scenario by including all positive and negative examples in the 10-fold cross-validation procedure. The evaluation results are shown in Figure 5. We find that the performance of Att-GCN-DDI in this scenario is also superior to other methods. However, the AUPR of GCN-DDI is very close to GCN-DDI. The main reason for this situation is that with the expansion of data volume, the demand of the model for accurate topology feature extraction decreases. Therefore, compared with other methods, the application of GCN for feature extraction has no obvious advantages. From Figures 4 and 5, we can see that the Att-GCN-DDI using attention mechanism performs obviously better than GCN-DDI in two scenarios. This is because the attention mechanism can flexibly capture the relationship between global information and local information. This is very important for extracting feature information of topology structure in the knowledge graph.  Figure 5. We find that the performance of Att-GCN-DDI in this scenario is also superior to other methods. However, the AUPR of GCN-DDI is very close to GCN-DDI. The main reason for this situation is that with the expansion of data volume, the demand of the model for accurate topology feature extraction decreases. Therefore, compared with other methods, the application of GCN for feature extraction has no obvious advantages. From Figures 4 and 5, we can see that the Att-GCN-DDI using attention mechanism performs obviously better than GCN-DDI in two scenarios. This is because the attention mechanism can flexibly capture the relationship between global information and local information. This is very important for extracting feature information of topology structure in the knowledge graph.

Discussion
This paper collects COVID-19 information and inserts it into the existing medical KG to build a KG of COVID-19. Based on the KG, the GCN-based drug repositioning model is used to predict potential therapeutic drugs for COVID-19. We conduct drug repositioning experiments on COVID-19 and other diseases, respectively. Our model ultimately identified 30 potential drugs for COVID-19 treatment, of which five have proven to be effective clinically. On the DDI prediction experiments for other diseases, our model outperforms other baseline methods. Our work provides help for the preliminary screening of drugs in the face of new diseases and helps medical staff to screen out potential drugs for new diseases in the shortest time. However, our research also has some shortcomings. For example, the GCN model can effectively learn the structural information and the relation between nodes in the knowledge map, but it cannot learn the representation of the

Discussion
This paper collects COVID-19 information and inserts it into the existing medical KG to build a KG of COVID-19. Based on the KG, the GCN-based drug repositioning model is used to predict potential therapeutic drugs for COVID-19. We conduct drug repositioning experiments on COVID-19 and other diseases, respectively. Our model ultimately identified 30 potential drugs for COVID-19 treatment, of which five have proven to be effective clinically. On the DDI prediction experiments for other diseases, our model outperforms other baseline methods. Our work provides help for the preliminary screening of drugs in the face of new diseases and helps medical staff to screen out potential drugs for new diseases in the shortest time. However, our research also has some shortcomings. For example, the GCN model can effectively learn the structural information and the relation between nodes in the knowledge map, but it cannot learn the representation of the relation, let alone the directivity of the relation. Therefore, in the future, we will learn from the structure of GraphSAGE and the GAT(Graph Attention Network) model to improve the ability of our model in relation representation.