Data-Driven Drug Repurposing in Diabetes Mellitus through an Enhanced Knowledge Graph †

: Diabetes mellitus affects more than 400 million people worldwide, and the incidence of disease is rising. Current anti-hyperglycemic agents share major drawbacks, such as hypoglycemia and low potency due to a lack of target speciﬁcity. Drug repurposing accelerates drug research and development pipelines and empowers chemical space enrichment. Herein, we propose a data-driven approach towards drug repurposing in diabetes mellitus by integrating heterogeneous biomedical data in a uniﬁed knowledge graph. Through extensive data mining in public repositories, diabetes-related multimodal data have been retrieved. Several data analysis techniques were employed to extract information and deﬁne semantic associations, followed by data parsing and, next, descriptive statistics, regression, and cluster analysis. Biomedical entity recognition and negation detection were performed by natural language processing. Predeﬁned biological ontologies served as reference endpoints for class deﬁnition upon data integration. Graph analytics were performed, and drug– drug, protein–protein, drug–protein, and drug–disease interactions were established. A majority vote-based machine learning framework for the prediction of human cytochrome P450 inhibitors was also integrated into the proposed enhanced knowledge graph analysis that facilitates data-driven ranking for drug repurposing candidates in diabetes mellitus. The presented method yields a ranked list of repurposing candidates.


Introduction
Diabetes mellitus (DM) is a worldwide fast-growing disease of the endocrine system, posing as a modern pandemic according to its global prevalence.As the latest data from the International Diabetes Federation showed, 536.6 million people were affected by diabetes in 2021, while 6.7 million deaths occurred due to this condition.The number of people afflicted by diabetes is expected to rise to 783.2 million in 2045 [1].Diabetes is a metabolic disorder in which continuous elevated levels of blood glucose occur, a state called hyperglycemia.Diabetes can be classified into four main categories based on disease etiology and pathogenesis [2].The most prevalent disease phenotypes include type 1 (5-10%) and type 2 (90-95%) diabetes [2].Type 1 diabetes is an insulin-dependent autoimmune disorder that is characterized by pancreatic beta-cell dysfunction, leading to dysregulation of insulin response and hyperglycemia [3].Type 2 diabetes, on the other hand, is insulin-independent Eng.Proc.2023, 50, 9 2 of 8 and characterized by insulin resistance, resulting in the excessive function of beta-cells to maintain normoglycemia [4].
Along with conventional drug discovery, drug repurposing holds promise for the control of the diabetes epidemic [7].To this end, several in silico approaches that employ heterogeneous data sources have been developed, such as machine learning, text mining, and network analysis [8] or knowledge graph-based drug repurposing.The latter facilitates a data integration framework for the unified analysis of heterogeneous data, enabling the utilization of different layers of information [9].Ghorbanali et al. [10] proposed the DrugRep-KG method, which employs knowledge graph embedding to represent drugs and disease associations in a unified latent space towards drug repurposing.Zhu et al. [11] introduced a similar approach, which includes several drug databases in an integrated and unified knowledge graph.The drug knowledge graph was then used to predict drug repurposing candidates through machine learning models.Herein, we propose a data-driven approach towards drug repurposing in diabetes mellitus by integrating heterogeneous biomedical data and predictions of in-house machine learning models in a unified knowledge graph.Molecular docking data were used to enrich the knowledge graph in question.Overall, the proposed enhanced knowledge graph analysis facilitates a data-driven ranking for drug repurposing candidates in diabetes mellitus.

Databases and Repositories
Heterogenous biomedical data were collected from publicly available repositories.Information regarding bioactive molecules was gathered from the DrugBank database [12].An important feature provided by this repository is the mapping of protein targets for each bioactive molecule with the UniProt database [13].UniProt served as the main data source for proteins, providing information about their biological function and structure.Next, the SureChEMBL platform [14] was used to extract patent data, while datasets from clinical trials were collected from ClinicalTrials.gov[15].Additionally, pharmacogenomics data were extracted from the PharmGKB repository [16].Another repository used was Omnipath [17], as it contains information about signaling network interactions, enzyme-substrate relationships, protein complexes, protein annotations, and intracellular communication.Complementary to the aforementioned datasets, information about molecular pathways was retrieved from Reactome [18], while pharmacogenomic data were enriched with data from the ENSEMBL repository [19].Additionally, pharmacogenomic recommendations were obtained from CPIC [20].Gene sequences were retrieved from RefSeq [21].Clin-Var [22] and dbSNP [23] platforms provided information about the clinical significance of selected genomic variants (missense mutations) and their frequency of occurrence in different population groups.The ChEMBL repository was also queried [24] for experimental data regarding either the pharmacological response of chemical molecules in cellular assays or experimental binding values to specific protein targets.miRNA-protein interactions were collected from the mirTarBase platform [25], and TCGA was queried for gene-cancer type associations [26].Data on drug responses was retrieved from PharmacoDB [27].Finally, data regarding protein-disease associations were obtained from the OpenTargets platform [28].The databases used, and the layers of information they provided are illustrated in Figure 1.
were obtained from the OpenTargets platform [28].The databases used, and the lay information they provided are illustrated in Figure 1.

Data Gathering
Data gathering was performed per data source.Data were available a. as d loadable files either through their webpage or FTP connection (e.g., DrugBank, Unip b. via data scraping (e.g., SureChEMBL) or c.REST APIs (e.g., Ominpath).

Information Extraction per Data Type
Μining and extensive filtering were performed per data type.For SureChEMBL first step was to parse the data retrieved via scraping and then extract the claims of patent.Named entity recognition (NER) was applied to annotate biomedical terms clinical trials, data mining and filtering were applied to extract drug-disease and pro disease associations.For pharmacogenomics, data on clinical significance for the mis mutations located at protein binding sites were prioritized, along with their frequen occurrence and pharmacogenomic recommendations.The data collected from Ope gets were filtered, focusing only on direct relations.Finally, only drug-protein ass tions along with experimental values per assay type survived filtering for ChEMB rived data.The overall workflow for each data type is summarized in Figure 2.

Data Gathering
Data gathering was performed per data source.Data were available a. as downloadable files either through their webpage or FTP connection (e.g., DrugBank, Uniprot), b. via data scraping (e.g., SureChEMBL) or c.REST APIs (e.g., Ominpath).

Information Extraction per Data Type
Mining and extensive filtering were performed per data type.For SureChEMBL, the first step was to parse the data retrieved via scraping and then extract the claims of each patent.Named entity recognition (NER) was applied to annotate biomedical terms.For clinical trials, data mining and filtering were applied to extract drug-disease and proteindisease associations.For pharmacogenomics, data on clinical significance for the missense mutations located at protein binding sites were prioritized, along with their frequency of occurrence and pharmacogenomic recommendations.The data collected from OpenTargets were filtered, focusing only on direct relations.Finally, only drug-protein associations along with experimental values per assay type survived filtering for ChEMBL-derived data.The overall workflow for each data type is summarized in Figure 2.

Molecular Docking
Molecular interaction data were also generated by molecular docking as an extra layer of information.In this context, virtual screening through docking simulations was performed for 7,955 bioactive molecules and 529 protein targets.Autodock vina [29] and Protein Data Bank (PDB) were used [30].

Data Integration in an Enhanced Knowledge Graph
The information extracted led to large data volume, complex inter-relationships, and extreme heterogeneity.Relational databases could not be used as they lack scalability and cannot handle unstructured data.Hence, a graph database was employed to manage and query connected data that share semantic relations (Figure 3).For data integration in a unified knowledge graph, further preprocessing of the extracted information took place.The knowledge graph in question was further enriched with cytochrome P450 toxicity predictions [31].Additionally, docking scores were included after normalization [32].

Molecular Docking
Molecular interaction data were also generated by molecular docking as an extra layer of information.In this context, virtual screening through docking simulations was performed for 7,955 bioactive molecules and 529 protein targets.Autodock vina [29] and Protein Data Bank (PDB) were used [30].

Data Integration in an Enhanced Knowledge Graph
The information extracted led to large data volume, complex inter-relationships, and extreme heterogeneity.Relational databases could not be used as they lack scalability and cannot handle unstructured data.Hence, a graph database was employed to manage and query connected data that share semantic relations (Figure 3).For data integration in a unified knowledge graph, further preprocessing of the extracted information took place.The knowledge graph in question was further enriched with cytochrome P450 toxicity predictions [31].Additionally, docking scores were included after normalization [32].

Graph-Based Machine Learning
Link prediction employing machine learning models was applied based on the subnetwork of drug-protein associations.The aim was to predict new relationships between these two entities of the graph in question, taking into account already-known relationships.To this end, the subnetwork of interest was extracted from the knowledge graph, and drug-protein pairs were labeled by assigning pairs with known interactions to the positive class, whereas the negative class included those pairs devoid of drug-protein as-

Graph-Based Machine Learning
Link prediction employing machine learning models was applied based on the subnetwork of drug-protein associations.The aim was to predict new relationships between these two entities of the graph in question, taking into account already-known relationships.To this end, the subnetwork of interest was extracted from the knowledge graph, and drug-protein pairs were labeled by assigning pairs with known interactions to the positive class, whereas the negative class included those pairs devoid of drug-protein associations as indicated by experimental values (IC 50 , EC 50 , and K i ).Next, feature extraction was performed based on local statistical measurements of the distances between drug-protein pairs using the Fast Random Projection (FastRP) method [33].Of note, the extracted measurements characterized each node and not the pair.Therefore, the next step was to combine the pairs by multiplying the feature vectors of each node.By importing the drug-protein pairs and their features, classifiers were designed and trained to distinguish between the connected and non-connected pairs.Data splitting took place in a 70:30 ratio for training and test sets.The training data set was used to design three classifiers (Random Forest, Support Vector Machine, and k-nearest neighbors), for which the optimal parameters were found through 10-fold cross-validation.The optimal models of each classifier were tested on the external test set.The process of splitting the data, designing, and testing the classifiers was performed ten times.

Link Prediction through Machine Learning
The machine learning models developed to perform link prediction for the drug-protein pairs considered were evaluated through 10-fold cross-validation and tested in an external test set.The mean performance of the models employing the FastRP embedding method is provided in Table 1.As summarized in Table 1, metrics indicate that the models a. generalize well enough, as they achieve similar performance in the external test set and b. discriminate the drugprotein pairs that are linked from those that are not.The optimal parameters selected for each classifier through cross-validation were the following:

•
Support Vector Machines: radial kernel basis function as kernel, sigma equal to 0.0043, and the cost of constraints violation (C) set to 1.

Molecular Docking Analysis for DPP-4 Inhibitors
To identify dipeptidyl peptidase-4 (DPP-4) inhibitors, docking results were analyzed for drug repurposing candidates.A simple condition was set to identify the most potent inhibitors based on which the docking score of the new inhibitor should be better than the docking score of the reference inhibitor of DPP-4.The list of drug repurposing candidates is depicted in Figure 4 as a histogram of their docking scores; 15 compounds were found to be more potent DPP-4 inhibitors with a normalized docking score lower than −2.05, among 392 test-compounds that had a score lower than −1.9 (reference score).

Molecular Docking Analysis for DPP-4 Inhibitors
To identify dipeptidyl peptidase-4 (DPP-4) inhibitors, docking results were analyzed for drug repurposing candidates.A simple condition was set to identify the most potent inhibitors based on which the docking score of the new inhibitor should be better than the docking score of the reference inhibitor of DPP-4.The list of drug repurposing candidates is depicted in Figure 4 as a histogram of their docking scores; 15 compounds were found to be more potent DPP-4 inhibitors with a normalized docking score lower than −2.05, among 392 test-compounds that had a score lower than −1.9 (reference score).

Identifying Drug Repurposing Candidates
The top-15 drug repurposing candidates were filtered based on a. cytochrome P450 inhibition, b. structural similarity to known DPP4 inhibitors (Tanimoto score), c. data from clinical trials, d. patent data and pharmacogenomics and led to top-four drug repurposing candidates.The latter are ranked by their docking scores, their probability of serving as DPP-4 ligands according to the SVM classifier, and their Tanimoto scores.

Discussion
Herein, an enhanced knowledge graph was designed for a holistic view, processing, and curation of biomedical knowledge coupled to a. structural information generated by molecular docking and b. machine learning models.Overall, such a design allowed for faster and better filtering of drug repurposing candidates in diabetes mellitus after building upon the efficacy, safety, and selectivity ranking for test compounds.DPP-4 served as a paradigm, yet our strategy is robust and easy to adapt.

Discussion
Herein, an enhanced knowledge graph was designed for a holistic view, processing, and curation of biomedical knowledge coupled to a. structural information generated by molecular docking and b. machine learning models.Overall, such a design allowed for faster and better filtering of drug repurposing candidates in diabetes mellitus after building upon the efficacy, safety, and selectivity ranking for test compounds.DPP-4 served as a paradigm, yet our strategy is robust and easy to adapt.

Conclusions
The enhanced knowledge graph analysis presented herein facilitates data-driven ranking for drug repurposing candidates in diabetes mellitus.This is a unified system for integrating multi-modal heterogeneous data for informed-drug repurposing.DPP-4 served as a paradigm, resulting in top-four candidates.Overall, this is a robust adaptive strategy.

Figure 1 .
Figure 1.Collection of public repositories mined to extract biomedical data.

Figure 1 .
Figure 1.Collection of public repositories mined to extract biomedical data.

Figure 2 .
Figure 2. Data analysis pipeline per data source to extract the information of prime interest.

Figure 2 .
Figure 2. Data analysis pipeline per data source to extract the information of prime interest.

8 Figure 3 .
Figure 3.A schematic representation of our enhanced biomedical knowledge graph.

Figure 3 .
Figure 3.A schematic representation of our enhanced biomedical knowledge graph.

Figure 4 .
Figure 4. Histogram of the normalized docking scores for those test compounds sharing better docking scores than the DPP-4 reference inhibitor.

Figure 4 .
Figure 4. Histogram of the normalized docking scores for those test compounds sharing better docking scores than the DPP-4 reference inhibitor.

3. 3 .
Identifying Drug Repurposing Candidates The top-15 drug repurposing candidates were filtered based on a. cytochrome P450 inhibition, b. structural similarity to known DPP4 inhibitors (Tanimoto score), c. data from clinical trials, d. patent data and pharmacogenomics and led to top-four drug repurposing candidates.The latter are ranked by their docking scores, their probability of serving as DPP-4 ligands according to the SVM classifier, and their Tanimoto scores.

Table 1 .
Machine learning models and their mean performance for ten iterations.