Prediction of Drug Targets for Specific Diseases Leveraging Gene Perturbation Data: A Machine Learning Approach

Identification of the correct targets is a key element for successful drug development. However, there are limited approaches for predicting drug targets for specific diseases using omics data, and few have leveraged expression profiles from gene perturbations. We present a novel computational approach for drug target discovery based on machine learning (ML) models. ML models are first trained on drug-induced expression profiles with outcomes defined as whether the drug treats the studied disease. The goal is to “learn” the expression patterns associated with treatment. Then, the fitted ML models were applied to expression profiles from gene perturbations (overexpression (OE)/knockdown (KD)). We prioritized targets based on predicted probabilities from the ML model, which reflects treatment potential. The methodology was applied to predict targets for hypertension, diabetes mellitus (DM), rheumatoid arthritis (RA), and schizophrenia (SCZ). We validated our approach by evaluating whether the identified targets may ‘re-discover’ known drug targets from an external database (OpenTargets). Indeed, we found evidence of significant enrichment across all diseases under study. A further literature search revealed that many candidates were supported by previous studies. For example, we predicted PSMB8 inhibition to be associated with the treatment of RA, which was supported by a study showing that PSMB8 inhibitors (PR-957) ameliorated experimental RA in mice. In conclusion, we propose a new ML approach to integrate the expression profiles from drugs and gene perturbations and validated the framework. Our approach is flexible and may provide an independent source of information when prioritizing drug targets.


Background
Traditionally, drug discovery involves a series of steps: target identification, target validation, lead identification, lead optimization, clinical trials, and introduction of the new drug to the market [1]. Nevertheless, the speed of new drug development has been slower than anticipated, despite increasing investment [2]. It is estimated that the cost of developing a new drug is ≈USD 2.6 billion [3]. One of the main reasons for the enormous cost of drug discovery is the high failure rate.
The success of drug development largely depends on the validity of targets. However, most drugs fail to complete the development process due to a lack of efficacy, and this is often due to the wrong target being pursued [4]. Traditionally, drug targets are often identified from hypothesis-driven pre-clinical models, yet pre-clinical models may not always translate well to clinical applications. For some diseases such as psychiatric disorders, current animal or cell models are still far from capturing the complexity of the human disorder [5]. In addition, some have hypothesized that relying on hypothesis-driven studies alone may have led to the 'filtering' of findings and publication bias, exacerbating the reliability and reproducibility issues of some research findings [6].
On the other hand, the recent decade has observed a remarkable growth in genomics and other forms of biomedical big data. As increasing amounts of data have been made available, computational methods have attracted increasing attention as they offer a fast, cost-effective, and unbiased way to prioritize promising drug targets. Given the limitation of current approaches and the urgent need to develop therapies for diseases, addressing the problem of target identification and drug development from different angles is essential. We believe that computational and experimental approaches can complement each other to improve the efficiency and reliability of identifying valid drug targets. Given the extremely high cost and time investment in drug development, even if the success rate can be increased by a small margin, the savings (in absolute terms) could be substantial.

Overview of Our Approach
In this study, we present a novel computational target discovery approach based on machine learning (ML) models to expression profiles induced by genetic perturbation. In our approach, ML models are first trained on drug-induced expression profiles, with outcomes defined as whether the drug can treat the studied disease. The goal is to "learn" the expression patterns associated with treatment. Then, the fitted ML models were applied to expression profiles derived from gene perturbations (i.e., overexpression (OE) or knockdown (KD) of specific genes). Afterwards, we could prioritize drug targets based on the predicted probabilities from the ML model, which reflects treatment potential.
Intuitively, for example, the overexpression (OE) of gene X leads to an expression profile 'similar' to that of five other drugs known to treat diabetes. Then, an agonist targeted at X (or other drugs that activate or upregulate X and related pathways) may also be useful for treating diabetes. In this case, we expect that the ML model (trained on drugs but applied to gene perturbation data) would output a high predicted probability (of treatment potential) for gene X, and it can be prioritized for further studies.
Let us consider an opposite scenario in which the overexpression of gene Y increases the disease risk or severity. In this case, we may observe a lower-than-expected predicted probability of 'treatment potential' from the ML model. Gene Y can still be considered a potential drug target for further studies, but here, we expect a down regulation of gene Y to be associated with disease treatment.

Strengths of Our Approach
Our approach has several potential advantages. Firstly, it provides a general and flexible framework in which any kinds of supervised learning methods can be applied for training. As such, we may leverage the advantages of different, including recently developed, supervised learning algorithms. In addition, our approach is independent of other kinds of evidence usually employed to identify drug targets, for example those used by the OpenTargets platform [7] (e.g., genetic associations, mutation data, expression data, animal models, text mining, etc.). Therefore, the proposed methodology may provide an independent source of information when prioritizing targets. In addition, our approach does not rely on the information of known genes or drug targets for a disease; as such, it may be applicable to a wide range of diseases, including disorders with less well-known pathophysiology and targets. The lack of reliance on known disease gene/targets may help discover more novel disease drug targets that are not directly linked to previous ones.
In brief, we first proposed a general framework for identifying drug targets of specific diseases using a machine learning approach, leveraging gene perturbation and drug tran-scriptome data. Our methodology was applied to several diseases, including type 1 and 2 diabetes mellitus (DM), hypertension (HT), schizophrenia (SCZ), and rheumatoid arthritis (RA). Then, we validated our new framework by assessing its ability to 're-discover' drug targets based on an external established database (OpenTargets). We also found that many candidate targets are supported by the literature and are functionally relevant.

Methods
We present a general approach for identifying potential drug targets of a specific disease using state-of-the-art ML methods. As described above, ML models were first trained on drug expression profiles to learn the expression patterns associated with treatment of a disease. Then, the trained model was applied to expression profiles after OE or KD to predict the therapeutic potential of up-or downregulation of individual genes.

Datasets
The drug-induced expression profiles and genetically perturbed (OE/KD) expression profiles were downloaded from LINCS (The Library of Integrated Network-Based Cellular Signatures) [8]. For details of the study, please refer to [8]. Briefly, to measure the influence of genetic perturbation on expression, each genetic perturbation (OE/KD) was profiled in triplicate 96 h (h) after application. A single cDNA clone was employed for studies of OE; on the other hand, three distinct shRNAs targeting each gene were profiled for KD experiments. As for expression profiling for drugs, each compound was profiled in triplicate at 6 or 24 h following treatment. Gene expression profiling was based on a reduced representation of the transcriptome (1000 'landmark' genes), which has been shown to produce reliable results compared to standard RNA-seq [8].
The original data at multiple levels of pre-processing are available via the accession GEO: GSE92742 [8]. In the current study, expression data were downloaded from the link (https://github.com/dhimmel/lincs [accessed 26 June 2018]), which provides consensus transcriptional signatures for LINCS L1000 perturbations (see https://think-lab.github. io/d/43/#7 [accessed 26 June 2018]). Briefly, the input signatures were weighted by its Spearman correlation with other input signatures. For consistency, we kept the genes that appeared in both drug-induced and genetically perturbed expression profiles, so ML models trained on drug expression profiles can be directly employed to make predictions on expression data induced from KD/OE experiments. The final drug expression profile dataset consists of 1158 observations, with expression measured in 7467 genes. The dimensions of the OE and KD datasets were 2413 × 7467 and 4326 × 7467, respectively.

Training ML Models on Drug Expression Data to Predict Treatment Potential
The outcome variable (0/1) is defined as whether the drug is indicated for the disease under study. The drug indications were derived from the Anatomical Therapeutic Chemical (ATC) classification system and the MEDication Indication Resource high-precision subset (MEDI-HPS). We employed our proposed approach to predict drug targets for various diseases covering different systems, including hypertension (HT), type 1 and type 2 diabetes mellitus (DM), schizophrenia (SCZ), and rheumatoid arthritis (RA). Indications for HT, DM, and SCZ were extracted from ATC, and indications for RA were extracted from MEDI-HPS, because there is no exact category for RA in ATC. We built prediction models for each disease separately, and four ML classification methods were employed for each disease.

ML Model Building
The model-building procedure largely followed our previous work [9], and we also provide a brief description below. Briefly, we employed four state-of-the-art classification methods, including support vector machine (SVM), gradient boosting machine (GBM), random forest (RF), and logistic regression with the elastic net penalty (EN), to learn the pattern of gene expression profiles associated with treatment of the studied disease [10][11][12][13][14]. As the number of drugs known to treat specific diseases is small, there are few observations with positive outcomes. Following our previous study [9], we performed a weighted analysis by increasing weight of the minority class. SVM, RF, and GBM models were implemented using "scikit-learn" in Python, while EN was implemented with the R package 'glmnet'. We employed nested 5-fold cross-validation (CV) to choose the optimal hyperparameters and evaluate the performance of corresponding models on hold-out datasets.
Briefly, the dataset was divided into 5 folds using a random seed, and 1/5 of the data (i.e., N ≈ 232 observations) was held out for testing in each run. For the remaining 4/5 of observations (N ≈ 926), 1/5 of them (N ≈ 185) was reserved for tuning hyper-parameters and choosing the best-performing model, while the rest (N ≈ 741) was used for training. This is to avoid optimistic bias if one chooses the best hyper-parameters in the test set only. In this study, the test set was only used to evaluate the predictive performance and not involved in hyper-parameter tuning. As a secondary analysis, we also built an 'ensemble' model with logistic regression to integrate predicted probabilities from the four ML models for each disease. Drug indication was treated as the outcome, while the predictors were predicted probabilities from the four ML models. We employed a weighted model that up-weighed the minority class, as described above. Please also refer to Supplementary Text for details of model building.

Model Evaluation
Two metrics were used to evaluate the predictive performance of ML models, including area under the receiver operating characteristic curve (ROC-AUC) and area under the precision-recall curve (PR-AUC). PR-AUC may be more instructive in classification performance evaluation when the dataset is imbalanced [15].

External Validation Approach
Validation of drug-disease or drug-target predictions from computational methods has always been a difficult task. As reported by [16], for studies on drug repositioning, a cross-validation approach may overestimate predictive accuracy, as there may be drugs with overlap in the training and testing sets. In addition, highly similar drugs may be split into training and testing sets; hence, the similarity of training and testing sets may be higher than expected than in practice. There may be similar concerns for disease drug target predictions. If one only evaluates the validity of predictions using performance evaluation metrics (e.g., AUC-ROC) under cross-validation alone, this may lead to over-optimistic results. To avoid this problem, we utilized an independent resource to examine whether our approach can 're-discover' known drug targets for diseases from other data sources. Briefly, we validated our results by evaluating whether the identified targets were enriched for those listed by OpenTargets [7], a platform for systematic drug target identification and prioritization. The platform integrates data from genetics, somatic mutations, expression analysis, drugs, animal models, and the literature through robust pipelines and uses an aggregate score to indicate the association of a target with a disease [7].
We applied the models trained on drug expression profiles to OE/KD expression profiles to predict their treatment potentials. Drug targets were downloaded from OpenTargets, with a continuous score (from 0 to 1) indicating the strength of association between the target and disease.
We need to define a cutoff to select relevant genes as 'valid' targets for the disease. To avoid arbitrariness in selecting a fixed cutoff, here, we defined a cutoff sequence ranging from 1 to 0 with a step size of 0.2; genes whose association scores are lower than or equal to the cutoff were filtered away, and those genes with association scores higher than the cutoff were considered as "valid" targets. To test for enrichment, we examined whether these 'valid' targets from the external database had a higher-or lower-than-expected predicted probability (of treatment potential) from our model when compared to the non-targets. Specifically, we compared the mean predicted probability for genes within the set of 'valid' targets against the mean predicted probability of genes not included in the 'valid' set. A two-tailed t-test was used for this comparison. This method follows closely the principle and methodology of MAGMA [17], which is one of the most widely used programs for gene-based and gene-set analyses in genome-wide associations studies (GWAS). Intuitively, MAGMA compares the mean z-statistic for SNPs/genes within a specified set against those outside the set. (An alternative approach would be to conduct correlation tests between the target score and predicted probabilities; however, the target scores are not normally distributed (with many zeros and ones), which might render such a method unreliable.) We note that OpenTargets includes the EMBL-EBI Expression Atlas, which contains data for differentially expressed genes in patients against controls for different diseases. However, our drug and gene perturbation expression profiles are based on the L1000 data, and L1000 is not included in the Expression Atlas according to our latest search (as at 3 January 2022). In addition, L1000 data were obtained by the OE/KD of individual genes or application of drugs on primarily cancer cell lines, which do not have a direct relationship with the diseases studied here. We also checked that OpenTargets did not include L1000 as a source for scoring disease-gene associations. Therefore, we believe that OpenTargets represents a reasonably good independent resource for external validation.

Model Performance
It should be noted that the predictive performance of different ML methods is not the major focus of this study; our main objective is to uncover new disease drug targets and to validate our proposed approach by testing for its ability to 're-discover' known targets.
The average predictive performance of different ML models, measured in AUC-ROC and AUC-PR, is presented in Table S1. In terms of AUC-ROC, SVM performed the best for SCZ, while EN performed the best for DM and RA. GBM slightly outperformed others for HT. In terms of AUC-PR, SVM performed the best in DM and SCZ datasets, but GBM and EN showed the best performance for HT and RA, respectively. Note that AUC-PR is dependent on the proportion of positive outcome; hence, the low AUC-PR observed is expected given the relatively small number of drugs indicated for each disorder. Table S2 briefly summarizes the overall number of drugs included for each disease.
We also carried out further analysis to evaluate the correlation of predicted probabilities from different models for each disease under study (Table S3). We found significant positive correlations between the predicted probabilities. However, most of the correlations were moderate, suggesting that different models can still produce different predictions.

External Validation
The results of enrichment test for 'known' drug targets from OpenTargets (based on OE data) are shown in Tables 1-5. Overall, for drug targets identified from OE data, we observed significant enrichment (with FDR (false discovery rate) adjusted p-values < 0.05) for at least one ML method and score threshold for all the diseases under study (Tables 1-5). 1.56 × 10 −1 6.84 × 10 −1 1.96 × 10 −1 1.12 × 10 −1 Enrichment p-values are shown. Four machine learning methods were used to train a model on expression data to predict treatment potential, and the model was fitted to expression profiles after gene perturbation. The false discovery rate (FDR) approach was employed to correct for multiple testing. p-values with corresponding FDR < 0.05 are in bold, while p-values with corresponding FDR between 0.05 and 0.1 are in italics. The first column is the threshold of the 'relevance' score (available from OpenTargets) above which we defined a gene as a drug 'target'. Enrichment was tested against the targets listed in the OpenTargets database, which is the same as below. SVM: support vector machines; EN: logistic regression with elastic net regularization; RF: random forest; GBM, gradient boosted machines. For Tables 1-5, the predicted targets are based on expression data from overexpression (OE) experiments. The results from KD data are listed in Table S6.
For DM and HT, we observed significant enrichment across multiple thresholds and most of the ML methods with FDR < 0.05, indicating that the proposed method indeed 'rediscovered' known targets more than expected by chance. For RA, significant enrichment was mainly observed for prediction models based on RF or GBM. For SCZ and bipolar disorder (BP), which shared anti-psychotics as treatment, the enrichment was not as strong, but suggestive enrichment (FDR < 0.1) was observed especially for targets based on SVM for SCZ and EN for BP.
As a test of the robustness of our results to different random seeds, we also picked two diseases (HT/RA) and repeated the analysis using a different random seed for sample splitting. The results were broadly similar with significant enrichment observed (see Tables S4 and S5).
On the other hand, apart from a few significant findings for HT, no statistically significant enrichment was observed for targets identified from KD data. The results are shown in supplementary tables (Table S6).

Literature Support of Potential Targets
In order to validate the functional relevance of our identified potential targets, we conducted a literature search of the 10 targets with the highest and lowest predicted probabilities for each disease (please see Table S7 for a list of these targets) based on targets identified from OE data. These highlighted targets were selected from a total of 2413 genes subject to OE experiments. As described in the introduction, for targets with high predicted probabilities, we expect that upregulation of the gene may be associated with therapeutic potential; for targets with lower-than-expected predicted probabilities, we predict that downregulation of the gene may be associated with therapeutic potential.
Selected targets with literature support are discussed below and highlighted in Table 6. Note that our proposed approach does not utilize any prior knowledge of disease-gene associations.

Schizophrenia/Bipolar Disorder
Schizophrenia and bipolar disorder share similar clinical characteristics, and antipsychotics are indicated for both disorders. The two disorders are also highly genetically correlated [18]. Therefore, we tested for target enrichment for both SCZ and BP based on our model trained on the ATC-SCZ dataset. Our study suggests that the overexpression of DRD1 may be associated with treatment effects on SCZ. It was reported that insufficient D1 receptor signaling was associated with cognitive deficits and that working memory deficits may be relieved by treatments that augment D1 receptor stimulation, indicating that drugs acting on this potential target may restore cognitive dysfunction in SCZ [19]. Indeed, DRD1 agonist has been tested in a clinical trial for cognitive enhancement [20]. Moderate improvement was observed on some cognitive tests, including the CogState battery and attention domain of the MATRICS cognitive battery, although no significant improvement was detected for working memory. Other studies also suggested a role of DRD1 in the pathophysiology of SCZ [21]. Taken together, DRD1 may be a potential therapeutic target for SCZ.
HIF1AN, another target identified in our models, has been proven to suppress HIF1A's transcriptional ability, which thus can affect HIF1A's ability in regulating hypoxia-inducible genes [22,23]. Hypoxia may be involved in the pathogenesis of SCZ. For example, a methylome-wide association study (MWAS) of SCZ identified many top hits related to hypoxia [24]. HIF1A is also proposed as a candidate gene for SCZ, considering the association between HIF1A and intrinsic hypoxia occurring in the developing brain that may lead to complex changes in neurodevelopment [25,26]. HIF1A may enhance vascular growth (hence reducing hypoxia) via controlling the expression of vascular endothelial growth factor (VEGF) [27]. We found that the inhibition of HIF1AN expression may be associated with the therapeutical effect on SCZ, which is in line with the direction of effect from the above studies.
A few other targets may also be associated with SCZ, as supported by other studies. For example, ADCY9 is involved in glutamate and GABA neurotransmission pathways [28], and damaging de novo mutations have been identified in the gene [29]. Another potential target, RPA2, showed differential expression in a study of pluripotent stem cell-derived neurons from SCZ patients [30]. As for another candidate target, RhoA, a study showed reduced mRNA expression of RhoA in dorsolateral prefrontal cortex of SCZ subjects compared to controls [31]. NDUFS4 is another target we identified. Interestingly, a GWAS study revealed that an SNP (rs67017972) close to NDUFS4 was significantly associated with verbal memory in SCZ patients [32].

Rheumatoid Arthritis
A number of selected potential targets such as SMAD7, TGFBR2, FGFR10P, and PSMB8 are supported by previous studies. It was reported that SMAD7 expression was largely reduced in synovial tissues of RA patients, and mouse models also showed that SMAD7 deficiency increased risk to autoimmune arthritis [33]. In addition, it was shown that the intra-articular overexpression of SMAD7 relieved experimental arthritis [34]. These results support our prediction that the overexpression of SMAD7 may improve RA.
Regarding another potential target, TGFBR2, it has been reported that TGFBR2 plays an important role in chondrogenesis [35]. Current results showed that up to 40% of RA patients are resistant to methotrexate, which is the first-line therapy for RA. Peres et al. reported that the drug resistance of methotrexate was linked to a reduction of CD39 expression due to the impairment in TGF-β signaling, and TGF-β increases CD39 expression on regulatory T cells (Tregs) via the activation of TGFBR2 [36]. The authors also observed that patients non-responsive to methotrexate had reduced expression of TGFBR2 in Tregs compared to responsive patients. In this connection, the overexpression of TGFBR2 may reverse the impairment of TGF-β signaling, which is consistent with our prediction that TGFBR2 overexpression may be useful for RA treatment. In addition, hypermethylation of TGFBR2 (associated with decreased expression) was found in RA samples [37]. Taken together, the results above indicate that TGFBR2 expression levels might be linked to RA disease activity.
Additionally, FGFR10P was identified as a possible target for RA by our study. An LD block on chromosome 6 (6q27) that contains the genes CCR6 and FGFR10P was observed to be associated with increased risks for several autoimmune diseases, such as RA, Crohn's disease, and vitiligo [38][39][40][41][42][43]. Moreover, our model predicted that the inhibition of PSMB8 may induce treatment effects on RA. This is directly supported by experimental evidence from animal studies. It was observed that treatment with a PSMB8 inhibitor (PR-957) can ameliorate experimental RA in mice [44]. The drug led to a decrease in cellular infiltration, cytokine production, and autoantibodies in the RA mouse model. In a similar vein, several studies also reported that PSMB8 inhibitors reversed autoreactive immune responses and showed therapeutic effects in animal models of autoimmune encephalomyelitis, colitis, and Hashimoto's thyroiditis [45][46][47]. Regarding evidence from human genetics studies, an SNP in the PSMB8 (LMP7) gene was also found to be associated with juvenile RA [48].
Our study also predicted inhibition of IL-21 receptor (IL-21R) as a potential treatment for RA. An animal study reported that IL-21 receptor expression on B cells contributed to the development of collagen-induced arthritis (CIA) [49]. This implies that IL-21R inhibitors may hinder the development of CIA. The role of IL-21R in RA is also supported by other studies. Berberine (BBR) is a drug that has been suggested as a potential treatment for RA [50], and a recent study [51] showed that BBR inhibits IL-21/IL-21R-dependent autophagy, leading to a reduction of proliferation of arthritic fibroblast-like synoviocytes, which in turn may lead to the amelioration of RA.
We also predicted that the inhibition of LTβR (lymphotoxin β receptor) may be associated with therapeutic effects. Interestingly, a phase-1 randomized controlled trial of the safety and efficacy of pateclizumab (a drug that inhibits LTα1β2-LTβR interactions) showed that the drug was well tolerated and demonstrated preliminary evidence of clinical activity compared to placebo [52]. In a subsequent phase II study, pateclizumab resulted in a higher response rate than placebo treatment at 12 weeks, but the difference was not statistically significant [53]. However, CXCL13 (a biomarker of RA activity) [54] serum levels decreased significantly after pateclizumab treatment.
Another potential target identified was DAXX. It was reported that the accumulation of DAXX in promyelocytic leukemia (PML) protein nuclear bodies (NBs) may promote inflammatory disorders, which is in line with our prediction that the downregulation of this target may be beneficial to treatment [55].

Diabetes Mellitus
Previous studies also support several potential targets, and our study suggested that the overexpression of these targets may be associated with treatment effects on DM. First, Mayumi et al. found that mutations of NR0B2, also known as SHP, was associated with type 2 DM in a Japanese sample. It was reported that the mutant proteins show significantly reduced activities [56]. Another study showed that the inhibitory effect of metformin (one of the most commonly used drugs for DM) on hepatic gluconeogenesis may be mediated through the expression of NR0B2 [57].
We also identified Fos as a potential target for DM (with OE favoring treatment). It was found that insulin could induce c-Fos mRNA expression in neurons [58], fibroblasts [59], and pancreatic beta-cells [60]. Another study showed that c-Fos upregulation increased beta-cell proliferation, insulin secretion, and cellular survival, which is mediated by the activation of Nkx6.1. On the other hand, c-Fos knockdown inhibits Nkx6.1-mediated beta-cell proliferation and reduces insulin secretion [61].
QPRT (quinolinate phosphoribosyltransferase) was also found to be a potential target for DM. It is an enzyme involved in the kynurenine pathway, which may be involved in diabetes pathogenesis [62]. In a recent clinical study, the expression of QPRT in the subcutaneous compartment was negatively correlated with HbA1c, fasting glycemia, and 120 min glycemia [62]. This is consistent with our finding that the upregulation of this target may be associated with therapeutic potential.
We also revealed MAGED1 as a potential drug target, and predicted that the upregulation of the gene may be associated with the treatment of DM. One study concluded that MAGED1-deficient mice showed hyperphagia and reduced motor activity, which led to the development of obesity [63]. Another subsequent animal study [64] observed a similar phenomenon in which MAGED1-deficient mice showed late-onset obesity, owing to reduced energy expenditure and physical activities. The study also found that MAGED1 expression was reduced during adipogenesis, and loss of MAGED1 led to increased pre-adipocyte proliferation and differentiation in vitro. MAGED1 also reduced the stability and transcriptional activity of PPAR-gamma, which is the target for thiazolidinediones (a class of anti-diabetic medication).
Several other identified targets were also shown in previous studies to be associated with DM. For example, GADD45A was suggested as a diabetes-associated gene, which might be involved in both diabetic cardiomyopathy and DM-induced baroreflex dysfunction [65]. Regarding another potential target, TSPAN8, a SNP in this gene was associated with insulin release and sensitivity in a genetic association study [66]. Another target of interest was PPP2R1A, which encodes a constant regulatory subunit of protein phosphatase 2 (PP2A). It was found that the podocyte-specific loss of PP2A worsened diabetic glomerulopathy and accelerated the progression of diabetic kidney disease [67]. In addition, PPP2R1A was discovered to interact with IRS1 (Insulin receptor substrate 1), which is a key mediator of insulin signal transduction implicated in Type 2 DM [68].
We also identified TANK-binding kinase 1 (TBK1) as a potential target. The inhibition of TBK1 to treat DM is supported by a few previous studies. It has been suggested that a combination of immunomodulators and agents that specifically increase the mass of functional β-cells favors the treatment of type 1 DM [69]. In a study using a zebrafish model of type 1 DM, Xu et al. revealed that the inhibition of TBK1/IKKε (IκB kinase ε) led to β-cell regeneration [70]. Another study [71] showed that TBK1 is expressed mainly in beta cells of mammalian islets and that the genetic silencing of TBK1 led to an elevated expression of genes and proteins that are important for beta cell proliferation. On the other hand, TBK1 expression in beta cells was raised in human type 2 DM islets.
Our study also suggested that USF1 and HLA-DMB are potential targets for DM. Risk alleles of USF1 have been found to be associated with cardiovascular disease and type 2 DM risk in a number of studies [72][73][74]. Moreover, another study showed that a risk allele within USF1 appeared to remove the inductive effect of insulin on USF1 expression, which in turn affected the expression of other target genes, contributing to increased risk of cardiometabolic diseases [75]. HLA-DMB was also reported to be associated with type 1 DM in several human genetic studies [76][77][78].

Hypertension
Our models identified TCF7L2 as a potential target for hypertension. TCF7L2 is a wellestablished susceptibility gene for type 2 DM [79], and given the high comorbidity rate and possibly shared pathophysiology between DM and hypertension [80], further studies into this target may be warranted. In addition, a study of the Thai elderly population suggested that the SNP rs290487 in TCF7L2 may contribute to risks of hypertension regardless of type 2 DM [81]. Another cohort study concluded that both a parental history of diabetes and the TCF7L2 at-risk variant were associated with a higher incidence of hypertension after controlling for other cardiometabolic risk factors [82].
Considering another potential target, ATP5A1, a pharmacological network analysis of Compound Uncaria Hypotensive Tablet (a Chinese medication for hypertension) revealed that the therapeutic effect of this drug may be associated with actions on ATP synthetases, including ATP5A1 [83]. Another study revealed that the expression of ATP5A1 was significantly decreased in spontaneously hypertensive rats compared with controls [84]. These results may further support our finding that the overexpression of ATP5A1 may be associated with therapeutic effects.
Another target of interest is FADD, which is also a marker of apoptosis and apoptosis that may be implicated in atherosclerosis [85]. Cohort studies reported that a high plasma level of FADD was associated with increased incidence of coronary events and ischemic stroke [86,87]. Considering the strong associations between hypertension, stroke, and coronary heart diseases [88], the findings from these studies are supportive of the role of FADD and our prediction that the inhibition of FADD expression may be associated with treatment of hypertension (or its complications).
Interestingly, TSPAN8, which was suggested by our approach as a DM drug target, was also identified as a potential target for hypertension. As argued above, given the comorbidities and possibly shared metabolic pathways [80], TSPAN8 may also be an interesting candidate.
Another potential target we found was NFE2L2 (also known as Nrf2), which was directly supported by an animal study. Bai et al. [89] showed that the injection of tertbutylhydroquinone (t-BHQ), a selective Nrf2 activator, significantly reduced mean arterial pressure, plasma norepinephrine levels, and sympathetic nerve activities in hypertensive rats. tBHQ also reduced levels of reactive oxygen species and decreased inflammatory cytokine release in the periventricular nucleus (PVN). In addition, the knockdown of Nrf2 in the PVN abrogated the therapeutic effects of tBHQ on HT. Another study also suggested a role of Nrf2 activation in endothelial protection and the attenuation of oxidative stress [90]. These findings are consistent with our model prediction that the activation or overexpression of NFE2L2 may be associated with therapeutic benefits.
Some other potential targets, such as DUSP6 and HOXB13, are also supported by the literature. Zoe et al. [91] found that genes from the DUSP family may contribute to hypertensive heart disease; specifically, for DUSP6, its expression was upregulated in spontaneously hypertensive rats compared to controls. The above results were consistent with our finding that the inhibition of DUSP6 may have a protective treatment effect. For HOXB13, it was reported that the knockdown of HOXB13 can reduce the cytotoxicity caused by various oxidative stress inducers [92,93], and an increasing number of studies suggest that oxidative stress has a key role in the pathogenesis of hypertension [94]. As for another target, ETV1, in a study of human left atrium (LA) samples, ETV1 was downregulated in cardiac pressure overload, which may in turn be associated with both electrical and structural remodeling [95].

Overview
In this study, we presented a novel ML-based computational approach to identify promising drug targets. To our knowledge, this work is the first to employ ML methods to leverage both drug-induced and genetically perturbed expression data to discover potential drug targets for specific diseases.
Our approach is general as it can incorporate any supervised learning algorithms. To validate our method, we examined whether it may 're-discover' known targets based on other sources of data. Indeed, we observed that top genes from our models were enriched for targets from the OpenTargets platform. Encouragingly, a number of targets highlighted by our proposed method were also supported by the literature.

Relevant Works
We highlight a few relevant works on drug target prediction here. Kandoi et al. reviewed machine learning and system biology applications in distinguishing drug targets from non-targets [96]. Several studies explored the biological properties of known drug targets by ML methods to predict the druggability of proteins [97][98][99][100][101]. For example, Kumari et al. proposed a sequence-based prediction model and leveraged information such as amino acid composition and amino acid property group composition to predict whether a new target may be druggable. They also performed a comprehensive comparison of several ML methods [100]. In another study, eight key properties of human drug targets were extracted and learned by SVM to discover new targets [97][98][99][100][101]. A similar study extracted simple physicochemical properties from known drug targets to predict targets against non-targets [101].
Regarding network-based approaches, Costa et al. [102] leveraged interaction network topological features together with tissue expression and subcellular localization data to predict druggable genes. In another work, Li et al. employed the topological features of a protein-protein interaction network to identify potential drug targets [99]. Emig et al. presented an integrated network-based method to predict drug targets based on disease gene expression profiles and an interaction network, and some novel drug targets for scleroderma and cancer were reported [103]. However, our study is different from the previous studies in several aspects. One of the most important differences is that the focus of most of the above studies (except [103]) is to predict in general whether a protein may serve as a drug target (i.e., distinguishing targets from non-targets). They did not address the problem of predicting whether a protein is a target for a specific disease, such as diabetes or SCZ. As discussed above, networkbased methods are useful and have been proposed for uncovering disease drug targets. However, they are relatively dependent on similarity between entities and known drug targets; hence, they may be less capable of discovering novel targets. In addition, networkbased approaches usually require good knowledge of gene-gene (or protein-protein) and disease-gene interactions. It may not be easy to define such interactions accurately, and different sources may suggest different patterns of interactions. Therefore, the edges may need to be defined arbitrarily.
There are relatively few studies that employed gene perturbation data to predict drug targets, but a recent study [104] has leveraged such data to identify tentative targets. The authors proposed pairwise learning and joint learning methods constructed on chemically and genetically perturbed gene expression profiles to predict the targets of different chemicals [104]. They also constructed a drug-protein-disease network for drug repurposing. However, the methodologies and objectives of our study and ref [104] are different. We proposed ML methods to assess how the expression profiles from gene perturbations are related to those of drugs. Ref [104] mainly employed Pearson correlation and linear models to assess the similarity between transcriptomic changes from gene perturbation and those from drugs. An advantage of our approach is that by employing ML methods (e.g., SVM, random forests, boosted trees), we may accommodate complex non-linear relationships and interactions between features. The study [104] used transcriptomic data from gene perturbations mainly to predict drug-protein interactions; prediction of disease-specific drug targets was performed in a separate analysis using networks (which requires knowledge of the known therapeutic targets of studied diseases). As discussed above, network-based methods have their own limitations. We proposed an alternative new approach, which integrates transcriptomic data with ML approaches in a unified framework to predict drug targets for specific diseases.
One of our previous works has employed an ML approach for drug repositioning, leveraging drug expression data [18]. However, the objectives are different from the current study, in which we aim to uncover novel drug targets. In practice, drug repositioning may not always be feasible (for example, due to side effects of existing drugs), and there are also important hurdles to drug repositioning efforts, such as patent considerations, regulatory barriers, and organizational hurdles in industry, as reviewed by Pushpakom et al. [105]. As a result, revealing new targets remains a very important goal in drug development and pharmaceutical research. Unlike our previous work, here, we have covered diseases other than psychiatric disorders. In addition, gene perturbation data have not been used in the previous study.
There seems to be some discrepancies in the performance of enrichment tests across diseases, with DM and HT having the most significant enrichment. We are uncertain about the exact reasons, but there are many possible factors that may lead to such differences. For example, it is unknown how well the experiments on cell lines can capture the actual expression changes in human tissues relevant to the studied diseases. The L1000 gene perturbation data are primarily based on several cancer cell lines, which may not correlate very well with actual expression changes in relevant tissues, e.g., the brain for our study on SCZ. Another speculation is that the range of mechanisms underlying HT and DM drugs may be slightly more diverse than RA or SCZ, which may facilitate the ML models to learn expression patterns conducive to therapeutic effects. For example, most drugs for SCZ are based on similar mechanisms related to dopaminergic blockade.
We also could not conclude with certainty why some ML approaches may perform worse in some cases. It is possible that some methods may not be able to capture very well the expression patterns (which can differ across diseases) contributing to therapeutic potential, especially in view of the imbalanced datasets. This may in turn affect the performance in the target enrichment tests.

Strengths and Limitations
As described earlier, there are important strengths of our approach. Our approach is general and highly flexible, can incorporate any supervised ML methods, is independent of other sources of evidence commonly employed to identify drug targets, and does not rely on knowledge on known disease genes/targets. However, there are also several limitations. One limitation is that our ML prediction model-building datasets are highly imbalanced, as only a small number of drugs are usually indicated for each disease. In order to address this issue, we increased the class weight of the minority group. There are other strategies to address issues, such as SMOTE (Synthetic Minority Oversampling Technique) [106], but whether strategies such as SMOTE can address this issue in high-dimensional settings is still unclear and will be a topic for further investigations. Our primary objective of this study is to present a general framework for prioritizing drug targets leveraging drug expression and gene perturbation data; as such, we have not investigated in great depths ways to improve the ML model itself, which will be left as a topic for further work. For example, it may be worthwhile to explore other kinds of ML approaches and ways of combining predictions from different models; the models may also be further evaluated by repeated CV or bootstrap approaches.
Another aspect is that we observed significant enrichment for the identified targets primarily in the OE datasets but not in KD datasets. One hypothesis is that some off-target effects may interfere with the expression profiles in KD experiments, leading to greater difficulties in finding relevant drug targets [8]. How to overcome or reduce the influence of off-target effects remains an area for further studies. Here, we have employed enrichment tests to examine 're-discovery' of known drug targets from other sources of data and showed that many targets may be clinically/biologically relevant based on the literature. Nevertheless, further experimental and clinical studies are required to confirm our findings. In addition, further works are required to elucidate the mechanisms underlying the drugs that may act on the identified targets.

Conclusions
This study presents a general computational framework to prioritize drug targets for various diseases. Under the framework, different kinds of ML methods can be utilized. We applied four ML methods to identify potential drug targets of four disorders. External validation showed that the top candidates are enriched for targets selected by independent lines of evidence from a large external database (OpenTargets). We also found that previous studies provided support to a number of targets identified by our approach.
Finding promising targets for diseases is crucial to drug development. However, it is impractical to perform in-depth experimental studies on every possible target for each disease. Computational methods offer a cheap, fast, and systematic high-throughput approach to guide the prioritization of targets. We hope our presented framework will provide an additional way to prioritize drug targets, which may benefit future drug development.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/pharmaceutics14020234/s1. Supplementary Text: Hyperparameter tuning and weighted analysis, Figure S1: Receiver-operating curves (ROC) of different machine learning methods across four datasets, Table S1: Average predictive performance of different machine learning methods across four datasets, Table S2: Summary of the number of drugs in nested cross validation in our study, Table S3: Pearson correlation of predicted probabilities from different ML models for each disease, Table S4: Average predictive performance of different machine learning methods across two datasets, Table S5: Enrichment test of the predicted targets for HT and RA, Table S6: Enrichment test results from Knockdown (KD) data (including 5 sub-tables), Table S7: List of identified targets (the 10 targets with the highest and lowest predicted probabilities of treatment potential are shown; including four sub-tables showing targets for each disease). Data Availability Statement: Full data of LINCS L1000 perturbational profiles can be accessed via the GEO accession code: GSE70138. Consensus transcriptional signatures for LINCS L1000 perturbations are available at https://github.com/dhimmel/lincs. ATC drug categories can be found at https: //www.genome.jp/brite/br08303, and the drug indications from MEDI-HPS can be downloaded from https://www.vumc.org/cpm/cpm-blog/medi-ensemble-medication-indication-resource-0. The OpenTargets database is available at https://platform.opentargets.org/ (accessed on 30 November 2021).