Changing Trends in Computational Drug Repositioning

Efforts to maximize the indications potential and revenue from drugs that are already marketed are largely motivated by what Sir James Black, a Nobel Prize-winning pharmacologist advocated—“The most fruitful basis for the discovery of a new drug is to start with an old drug”. However, rational design of drug mixtures poses formidable challenges because of the lack of or limited information about in vivo cell regulation, mechanisms of genetic pathway activation, and in vivo pathway interactions. Hence, most of the successfully repositioned drugs are the result of “serendipity”, discovered during late phase clinical studies of unexpected but beneficial findings. The connections between drug candidates and their potential adverse drug reactions or new applications are often difficult to foresee because the underlying mechanism associating them is largely unknown, complex, or dispersed and buried in silos of information. Discovery of such multi-domain pharmacomodules—pharmacologically relevant sub-networks of biomolecules and/or pathways—from collection of databases by independent/simultaneous mining of multiple datasets is an active area of research. Here, while presenting some of the promising bioinformatics approaches and pipelines, we summarize and discuss the current and evolving landscape of computational drug repositioning.


Introduction
The path to new drug discovery has always been a road full of twists and turns. De novo drug discovery in particular is an expensive, time-consuming, and high risk process. For instance, the total average cost of developing a new drug, as per an estimate, ranges from $2 billion to $3 billion and it takes at least 13-15 years to bring a drug to the market-starting from initial discovery to the approval stage [1]. Further, the process suffers from a high rate of attrition. About 10% of the drugs that enter into clinical trials get approved by regulatory agencies [2]. The remaining 90% of the drugs fail due to inefficacy or high toxicity due to the limited predictive value of preclinical studies [3]. Nearly 62% of the compounds fail in Phase II and approximately 45% attrition occurs in Phase III [4]. These attritions are due to insufficient R&D productivity in identifying the drug response on the target due to the limited availability of preclinical disease models which has raised concerns in the pharmaceutical industry [5]. Despite rapid technological advances and exponential increases in pharmaceutical R&D investments, the number of newly approved drugs continues to be the same [6]. To overcome these challenges and to potentially bypass this productivity gap, more and more companies are resorting to "drug repositioning" or "drug repurposing" (sometimes also referred to as drug reprofiling, drug retasking, or therapeutic switching) or simply identifying and developing new therapeutic uses for existing or abandoned pharmacotherapies [7]. The premise is that since most approved compounds have known bioavailability and safety profiles, proven formulation and manufacturing routes, and reasonably characterized pharmacology, repositioned drugs can enter clinical phases more rapidly and at a lower cost than novel compounds. Further, the 90% therapeutic development failure rate means there are many existing, partially developed therapeutic candidates that could be re-visited, explored further, and potentially repurposed for a new disease, common or rare. It is therefore not surprising that in recent years, of the new drugs that reach their first markets, repositioned drugs have taken up to a percentage of~30%! For instance, of the 113 new drugs and biologics approved or launched in 2017, only seven were first-in-class agents (an approved and launched first drug with a novel mechanism of action) while 36 were repositioned drugs [8]. As per an estimate, this bypassing can potentially make a drug available for use in patients within 3-12 years with a total estimated cost of $40-80 million [9,10]. Table 1. Examples of repositioned drugs (adapted in part from [11], this list is neither extensive nor exhaustive).

Drug
Original Indication New Indication Most of the successful cases of drug repurposing have been serendipitous discoveries rather than systematic, hypothesis-driven outcomes. These include the accidental discovery of thalidomide as an agent for leprosy or the more notable example of sildenafil, an angina medication developed in 1989 subsequently marketed as Viagra ® , a blockbuster drug to treat erectile dysfunction [12] (see Table 1 for additional examples of drug repositioning). De novo drug therapies for more than 8000 orphan or rare diseases are impossible to develop with the current R&D costs, however, drug repositioning with its premise of discovering hidden connections or building connections between a drug and disease hold promise for orphan disease therapy [13]. Further, revisiting the approved drugs for identifying new indications helps the pharmaceutical companies to extend the patent life of drugs, through application to adjacent diseases and also helps the company to protect the IP against competitors [14].
In-silico methods like data-mining, machine learning, and network-based approaches, offer an unprecedented opportunity to predict all possible drug repositioning candidates using available diverse and heterogeneous data sources from genomics and biomedical domains [15]. Indeed, predictive models have been built using these methods exploiting existing data such as protein targets, chemical structure, or phenotypic information such as profiles of side-effect, gene expression, etc. While the advances in computational sciences bring the possibility of applying novel algorithms and approaches to systems biology data, these datasets themselves have triggered fundamental research on more complex problems [16]. As a result of this hybrid approach of utilizing computational methods and experimental screenings, various modalities of drug repositioning methods have emerged. Computational drug repositioning methods focus on shared characteristics between two drugs and depending on what kind of drug discovery (drug-based or disease-based) [17], the methods can be classified in to target-based, expression-based, knowledge-based, chemical structure-based, pathway-based and mechanism of action-based [18]. In this review article, we briefly outline the recent progress in computational methods and strategies applied on the drug-disease data for drug repositioning investigations.

Approaches
In silico drug repurposing challenges that are drug-centric (i.e., discovering new indications for existing drugs) or disease-centric (i.e., identifying an effective drug as a potential treatment for disease) have the common challenge of either assessing the similarity or connections between drugs or between diseases [19]. Jin and Wong [18] reviewed a variety of approaches used as a basis for computational drug repurposing. These can be broadly categorized as knowledge-based and signature-based approaches.

Knowledge-Based Drug Repurposing
This repurposing method utilizes the available information on drug such as drug-targets, chemical structures, adverse effects, pathways etc. and builds computational models to predict unknown mechanisms, targets or new bio-markers for diseases [20][21][22][23][24]. In pathway-based approach, signaling pathways, metabolic pathways and protein-interaction networks data are used to compute the similarity or connections between drug and disease. The processed omics data, for example, from human patients or animal models of disease are used to reconstruct disease-specific pathways that can serve as key targets for novel therapeutic discovery or for repositioned drugs [25][26][27][28][29][30]. Target mechanism-based approaches on the other hand take into account known mechanism of action and target role : Here, the data available on signaling pathways, protein interactions and omics data are integrated to identify the potential mechanism of action (MoA) of drugs [31][32][33][34]. This in turn can enable find better and even specific drug targets and also for discover of an alternate medication for any disease.

Signature-Based Drug Repurposing
This method makes use of gene expression signatures by comparing drug gene expression profiles and disease gene expression profiles and is frequently referred to as 'signature reversion' method [35]. Gene expression based methods are effective in constructing a detailed map of connections between diseases and drug actions [36-40].   STRING Disease Protein-Protein interaction, analysis, and networks https://string-db.org/cgi/input.pl [82] Connectivity Map (CMap) [83,84], NCBI's Gene Expression (GEO) [75], and the relatively recent LINCS datasets [44] are also extensively explored in drug repositioning studies. Recent technical and technological advancements in molecular biology and exponential growth of biomedical data while presenting challenges have also opened up an array of opportunities to develop and apply novel and powerful computational approaches that can enable informed drug repositioning. The free availability of data repositories are further directing and catalyzing these efforts. In Table 2 we present some of the widely used open source drug-and disease-centric and related databases. These include, for instance, databases that provide information on the known targets, mechanism of action, gene expression, clinical status, ADMET properties, signaling pathways and disease-centric database which has omics data (transcriptomic, proteomic, genetic characteristics of diseases).

In Silico Methods for Drug Repositioning
In the following sections, we present an overview of some of the in silico methods-current and emerging-used for facilitating drug repositioning candidate discovery.

Machine Learning
Any machine learning workflow typically comprises of 4 steps: data pre-processing, feature extraction, model fitting and evaluation [85]. PREDICT, is a similarity based machine learning framework, integrating drug-drug similarity (based on drug-protein interactions, sequence and gene-ontology) and disease-disease similarity (disease-phenotype and human phenotype ontology) where the authors have used them as features applying logistic regression to predict similar drugs for similar diseases and they achieved AUC = 0.9 in predicting drug indications [86]. SPACE, another similarity-based method predicts anatomical therapeutic chemical classification of drugs by integrating multiple data sources using Logistic Regression [87]. Likewise, several such similarity based methods have been reported for predicting novel drug indications [88][89][90].
Deep learning, a large class of machine learning-based models composed of multiple processing layers representing data with a high level of abstraction are now being explored computational biology field for a wide-variety of applications including drug discovery [91,92]. The principal difference between conventional "shallow" learning (neural network with one or two hidden layers) and deep learning is that while the former does not deal with raw data and requires a feature extraction step to be performed before the learning process, the latter not only discovers intricate structure in large data sets but by using the backpropagation algorithm allows changing the internal parameters incrementally to compute the representation in each layer from the representation in the previous layer [92]. Deep learning-based approaches have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and are currently being explored in biomedical and genomic domains. Aliper and Plis, for example, used deep learning with gene expression data to learn drug therapeutic categories and found that deep neural networks surpassed SVM after 10 fold cross validation suggesting a working proof for applying deep learning for drug discovery and development [93]. Interestingly, Zhao and Cheong, compared deep neural networks (DNN) approach with SVM-based approach to predict psychiatric drug indications based on the expression profiles of drugs and reported that [37]. While more studies are needed to understand if DNN-based approaches indeed have the claimed benefits, there have been additional reports suggesting that deep learning-based approaches perform better than traditional machine learning algorithms in toxicity prediction by enabling multi-task learning [94,95].

Network Models
Network-based approaches have been extensively exploited in computational drug repositioning for identifying novel drug targets, interactions, and indications [96]. Typically, in these models, the nodes in the networks represent either drug, disease, or gene products and edges represent the interactions or relationships between them. These networks are either knowledge-based or computationally inferred using multiple data resources and have various representations such as drug-drug, drug-target, drug-disease, disease-disease, disease-gene, disease-drug, protein-protein interactions, and transcriptional networks [97]. Cheng and Liu computed similarities-drug-based, target-based, and network-based-to predict drug-target interaction in a bi-partite network and found that network based inference method performed best with an average ROC AUC of 0.96 [21]. Similar homogenous or bipartite network models have been incorporated using phenotype data such as side-effect [98][99][100], transcriptional [101][102][103], drug-disease [104,105] and signaling pathway data [25].
Integrating heterogeneous data also provides diverse information and has the potential to unveil hidden or unknown drug-disease relationships based on the guilt-by-association principle. Most of the similarity-based methods are either drug-centric or disease-centric networks, with relatively few approaches that built a drug-disease heterogeneous network using compendia of gene annotations and network clustering to identify drug repositioning candidates [105,106]. Luo and Zhao, built a similar network-based framework using heterogeneous data through a network diffusion process and used the diffusion distributions to derive the prediction scores of drug-target interactions [107]. Recently, Himmelstein et al. integrated data from 29 public resources to identify dug repositioning candidates and predicted the probability of repositioning for 209,168 drug-disease pairs [108].

Mining Electronic Health Records for Drug Repurposing
Electronic health records (EHR) of the patients which provide medications details along with patient history can also be mined to identify drug repositioning candidates. Applying natural language processing on EHRs, for instance, reveals post-market, additional adverse drug events which are not found in clinical trials [109]. These side-effects information can be potentially used for drug-repositioning and validation [23]. Mining EHR records for example helped in identifying that metformin, a most commonly prescribed medication for type II diabetes, can also be repurposed for cancer treatment [110]. The relevance and accuracy of the model's prediction needs to be assessed in discovering a drug whose indications are unknown. The validity of novel drug prediction can be evaluated by comparing the predicted targets in ClinicalTrials.gov, PubMed abstracts or EHR records. The performance of the model can be evaluated by computing area under the ROC curve (AUC ROC) and Precision Recall (PR) curve. Sensitivity is a metric to measure the proportion of true positive identified correctly and Specificity is the proportion of negatives correctly identified as negatives. Due to the large unannotated drug-indication pairs as false positives, the sensitivity and specificity estimates are poor and creates substantial imbalance of true positives and true negatives. In a recent review, Brown and Patel suggest that using sensitivity-validation alone is ideal since it does not need the true negatives. The authors further suggest that investigators should test their model performance with cross-validation to prevent over-fitting and weak predictive performance [111].

Open Innovation-Crowd Sourcing
Crowd sourcing is a collaborative approach of delegating tasks to the crowd where the variety of expertise available generates new insights or hypothesis with the available data. This paradigm has been taken advantage in a multitude areas from diverse domains including health care and genomics. The open source drug discovery process enables faster translation of research to results with a clear definition on specific problem, task decomposition and immediate feedback loop [112][113][114][115]. Pharmaceutical companies, due to the limitations in R&D business model and man power often are focused on specific diseases which may or may not include rare and neglected diseases. Hence, few pharmaceutical and non-profit companies have used crowdsourcing platforms and embraced a wide-variety of innovative solutions [116,117] directing towards discussing the scientific enigmas. Several open innovation platforms have been established in order to build industry-academia partnerships and to explore science and business opportunities with mutual benefit (Table 3).

National Center for Advancing Translational Sciences (NCATS)-NIH-Academia-Industry Partnerships Initiative
The National Institutes of Health (NIH), as part of the new therapeutic uses program, launched (i) NCATS' NIH-Industry Partnerships initiative in 2012 to foster collaboration between pharmaceutical companies and the biomedical research community; and (ii) bench-to clinical repurposing initiative to test the utility of crowdsourcing efforts or computational approaches for drug repurposing.
The focus of the match-making NIH-industry partnerships projects is to match researchers with open assets from pharmaceutical assets to fuel and accelerate drug repurposing candidate discovery. Through this initiative, NCATS supports and advances research on a wide range of common and rare (including neglected) diseases. Current industry partners in this initiative include: AstraZeneca, AbbVie, Bristol-Myers Squibb, Eli Lilly, GlaxoSmithKline, Janssen Pharmaceuticals, MedImmune, Mereo BioPharma, Pfizer, and Sanofi. The participating companies make a number of partially developed assets available to academic researchers to crowdsource repurposing ideas. Projects using most of these assets can go directly into Phase II clinical trials, while some may require additional pre-clinical investigations or a Phase I clinical trial (e.g., testing in target populations to determine dosing, assess safety and tolerability).
Through the bench-to-clinic repurposing program, NCATS supports pre-clinical studies, clinical feasibility studies or proof-of-concept clinical trials to assess the utility of computational approaches or crowdsourcing efforts in discovering drug repurposing candidates. Table 4 lists the new therapeutic uses projects funded by NIH-NCATS through these two programs (additional details can be found at https://ncats.nih.gov/ntu/projects).

Open Source Software
The open source movement has created a substantial value in pursuing towards "state-of-the-art" research over the last decade with the help of reusable and generic software libraries for data processing [124,125]. Jupyter Notebook for instance is the modern data analysis tool for reproducible computational research that supports open source languages like Python, Julia, C++, R and several other languages and provides rich features for interactive computing, visualization, and documentation [126]. Structured data tools like Scikit-learn [127], R-Programming, Orange [128] and Weka [129] are useful for mining, analysis, learning and statistical computing. For the high dimensional un-structured data such as images, text, or audio outputs, deep learning tools like TensorFlow [130], Keras [131], PyTorch [132], CNTK [133], and Matlab [134] that take advantage of multi-GPU accelerated training are increasingly used. Gephi [135] and Cytoscape [136] are other popular tools used primarily for bimolecular interaction networks, omics-data integration, clustering and visualization. In Table 5, we summarize few such used tools used in computational drug discovery and repositioning.

Discussion
Drug repositioning acts as a viable strategy for a cost-effective de novo drug discovery. Although in silico methods have proven to be successful in addressing the problem of repurposing, some challenges continue to be addressed. One of the principal issues is the missing drug-disease indication data. Marking the missing indications as true negatives or ignoring them from training can potentially compromise the predictive power of the computational model for drug repurposing candidate discovery. Second, the lack of a true gold standard dataset for drug repositioning makes it difficult for in silico methods to evaluate results. As a result, common performance metrics such as sensitivity, specificity, and precision are used to assess the utility of computational drug repurposing algorithms. Third, existing computational methods tend to be predominantly one-sided (e.g., drug-centric or disease-centric). However, the integration of multi-omic data with similarity measures have been shown to have better predictive performance with identification of novel therapeutic compounds [105,108].
The sea of biomedical information (see Table 2), in which small molecule and gene/protein structural, functional and process knowledge-both in normal and disease states-is embedded consists of unstructured free-text as in publications and structured or semi-structured relational databases. Transforming information from these silos into actionable knowledge is facilitated by establishing connectivity among the subsets taken from these multiple heterogeneous and diverse domains. For example, a pharmacomodule consisting of a group of genes, biological processes, pathways, phenotypes, small molecules (approved drugs or investigational compounds), and a group of drug-induced or related adverse events forms a meaningful multi-domain module when the interdependency among most of the pairs of subsets are supported by scientific evidence (literature or databases). These pharmacomodules can potentially take us closer to answering the how question about the underlying a hypothetical mechanism of action or phenomena. An informed answer to the how question holds the premise to generate better and informed drug repositioning hypotheses. Growing scientific evidence [7] suggests that any compound found to be safe in humans is likely to have multiple therapeutic uses. However, almost all successful drug reposition crossovers so far have been the result of either accidental occurrences or informed guesses. Given that this "back-to-basics" approach for repositioning is growing in popularity [8], there is an urgent need for more efficient and systematic computational approaches to first systematize the available genomic and pharmacological databases for representation and knowledge discovery and then use these databases and pattern discovery tools to identify the potential new uses for existing drugs. What is needed clearly is a paradigm shift in the approaches-genomic, biopharmacological, and computational-for a more informed systematic drug rediscovery ("systematic serendipity") taking into account all of the data resources. Originally coined by Eugene Garfield, "systematic serendipity" refers to the organized process of discovering previously unknown scientific relations using citation databases, leading to better possibilities for a collaboration of human serendipity with computer supported knowledge discovery [150].
The credibility of published research will improve the discoveries in science if the provided compendium has an evidence for the accuracy and reproducibility of the results. Reproducibility particularly is a major issue especially when scientific papers publish unexpected, positive results and other researchers or an independent research group is unable to replicate the same results even after using same or similar methods as reported by the original study [151,152]. It has been estimated that the irreproducible research costs up to $28 billion per year [153]! Providing the code and data used to obtain the claimed and reported results is always a better strategy than mere describing them in natural language in the paper and can be eventually an incremental step towards a better science [154][155][156]. The recent Findability, Accessibility, Interoperability, Reusability (FAIR) data principles go beyond the mere reuse of data by individuals but rather enhance the ability of machines to support and find and use the data automatically. These include any efforts that support discovery and reproducibility through good data management practices such as good data management, maintenance of the data flow, and sharing relevant tools or pipelines used in the research [157]. The recent Datasets2Tools project is in compliance with these principles and enables users to search for contributed canned analyses, datasets and tools [158]. Computational science research can be replicated effectively using tools like code version control software like Github [159] and transferable computational environments like Docker [160]. Over the past few years, the reproducibility issue is being taken seriously and many journals insist on providing code and data when submitting the paper.
In summary, emerging and advanced novel computational methods and crowdsourcing-based approaches that enable the joint analysis of genomic, biomedical and pharmacological data hold the premise to facilitate informed, efficient, and systematic drug repositioning. Whether this premise expedites drug development pipelines and how much of it translates into novel therapeutic discovery and impacts public health, especially catering to unmet needs (e.g., rare and neglected diseases), positively remains to be seen.
Author Contributions: J.K.Y. and A.G.J. conceived the outline, reviewed the literature and wrote the manuscript; S.Y. and Y.W. participated in the discussions and provided the edits to some sections of the paper.