The Performance of Gene Expression Signature-Guided Drug–Disease Association in Different Categories of Drugs and Diseases

A gene expression signature (GES) is a group of genes that shows a unique expression profile as a result of perturbations by drugs, genetic modification or diseases on the transcriptional machinery. The comparisons between GES profiles have been used to investigate the relationships between drugs, their targets and diseases with quite a few successful cases reported. Especially in the study of GES-guided drugs–disease associations, researchers believe that if a GES induced by a drug is opposite to a GES induced by a disease, the drug may have potential as a treatment of that disease. In this study, we data-mined the crowd extracted expression of differential signatures (CREEDS) database to evaluate the similarity between GES profiles from drugs and their indicated diseases. Our study aims to explore the application domains of GES-guided drug–disease associations through the analysis of the similarity of GES profiles on known pairs of drug–disease associations, thereby identifying subgroups of drugs/diseases that are suitable for GES-guided drug repositioning approaches. Our results supported our hypothesis that the GES-guided drug–disease association method is better suited for some subgroups or pathways such as drugs and diseases associated with the immune system, diseases of the nervous system, non-chemotherapy drugs or the mTOR signaling pathway.


Introduction
A gene expression signature (GES) is a set of comprehensive gene expression profiles that can reveal the difference between stimulated and normal cell states [1]. Current applications of GES analysis are fruitful in cancer-related areas for disease genotype classification and outcome predictions. For example, Ramaswamy, S. et al. created a GES database for diagnosing and categorizing the tumour type with an accuracy rate of 78% [2]. Wright, G. et al. developed a Bayesian rule-based algorithm to classify diffuse large B cell lymphoma into two subgroups which have a significant difference in the five-year survival rate [3]. Although the GES method is more commonly used in diagnosing   When the inclusion criteria were applied, and the signatures with no indication relationship were excluded, 230 manual disease signatures and 244 manual drug perturbation signatures from 71 unique diseases and 56 unique drugs, respectively, were enrolled in the final analysis. The average signed Jaccard indexes [12] (SJI) of 3976 unique drug-disease pairs were calculated. Among them, there were 167 pairs with a drug-disease indication from the drug labels. The remaining 3809 unique drug-disease pairs were used as the control group. 14  When the inclusion criteria were applied, and the signatures with no indication relationship were excluded, 230 manual disease signatures and 244 manual drug perturbation signatures from 71 unique diseases and 56 unique drugs, respectively, were enrolled in the final analysis. The average signed Jaccard indexes [12] (SJI) of 3976 unique drug-disease pairs were calculated. Among them, there were 167 pairs with a drug-disease indication from the drug labels. The remaining 3809 unique drug-disease pairs were used as the control group.

Subgroups Distribution
Among the 56 unique drugs analysed, 32 unique protein targets with 22 categories of Anatomical Therapeutic Chemical (ATC) classification were assigned. Thirteen drugs are classified as chemotherapy drugs, and 44 drugs are not (Methotrexate is both a chemotherapy and a non-chemotherapy drug due to its different main therapeutic targets when against different diseases). For transcription factor (TF) level, 12 drugs are labelled as "directly", 39 drugs are labelled as "not-directly" and 5 drugs were labelled as "non-Human" (see section 4.4. subgroup classification for the detailed meanings of labels). Further, 71 diseases are divided into 11 ICD-11 (International Classification of Diseases 11th Revision) categories. In total, 70 subgroups belonging to five categories were assigned ( Figure 2, detailed information in Table S1 and Table S2).
Among the 56 unique drugs analysed, 32 unique protein targets with 22 categories of Anatomical Therapeutic Chemical (ATC) classification were assigned. Thirteen drugs are classified as chemotherapy drugs, and 44 drugs are not (Methotrexate is both a chemotherapy and a nonchemotherapy drug due to its different main therapeutic targets when against different diseases). For transcription factor (TF) level, 12 drugs are labelled as "directly", 39 drugs are labelled as "notdirectly" and 5 drugs were labelled as "non-Human" (see section 4.4. subgroup classification for the detailed meanings of labels). Further, 71 diseases are divided into 11 ICD-11 (International Classification of Diseases 11th Revision) categories. In total, 70 subgroups belonging to five categories were assigned (Figure 2, detailed information in Table S1 and Table S2).   Figure 2. The subgroups proportion of unique 167 indicated drug-disease pairs of different categories. (a) Disease classification. NEO: neoplasms, DMSCT: diseases of the musculoskeletal system or connective tissue, DS: diseases of the skin, CIPD: certain infectious or parasitic diseases, DIS: diseases of the immune system, ENMD: endocrine, nutritional or metabolic diseases, DBBO: diseases of the blood or blood-forming organs, DRS: diseases of the respiratory system, DNS: diseases of the nervous system, DDS: diseases of the digestive system, DCS: diseases of the circulatory system. (b) Drug target. GLUR: glucocorticoid receptor, DNAtopo: DNA/topoisomerase-human, TYRK: tyrosine kinase, DNAclak: DNA cross-linking/alkylation, CYC: cyclooxygenase, DNAlig: DNA/ligase, TOPOI: topoisomerase-non-human, INTR: interferon receptor, MICROT: microtubules, NUCS: nucleotide synthesis, TNF: tumor necrosis factor. (c) TF (transcription factor) level. "Directly": drugs with TFs as its main therapeutic targets. "Not-directly" indicates drugs with main therapeutic targets which are human DNA structures or human proteins but not TFs. "Non-Human" represent drugs interacting with protein or structures of non-human (for example, from virus or bacterial) as main therapeutic targets. (d) Chemotherapy. "YES" or "NO" indicates the drug is a chemotherapy drug or not. (e) ATC classification. CORTI: corticosteroids for systemic use, plain, OAA: other antineoplastic agents, CYTOANTIB: cytotoxic antibiotics and related substances, ANTIME: antimetabolites, IMMSUP: immunosuppressants, NSAAP: anti-inflammatory and antirheumatic products, non-steroids, HAARA: hormone antagonists and related agents, QUINA: quinolone antibacterial, IMMSTI: immunostimulants, PAAAONP: plant alkaloids and other natural products, ALKA: alkylating agents.

Overall Score of GES Similarity of Drug-Indicted Disease Pairs Against Random Drug-Disease Pairs
We observed significantly lower SJI similarity scores of drug-disease indication pairs than those of random drug-disease pairs (p-value of two-side t-test [13] equals to 0.02324). The average similarity score of indicated pairs is −0.00386 with a standard deviation of 0.01794 and that of random control pairs is −0.00072 with a standard deviation of 0.01750, indicating that the GES method can reflect the therapeutic effects of the drugs (The distributions of SJI in both the indication group and the control group are shown in Figure 3).
Molecules 2020, 25, x FOR PEER REVIEW 5 of 17 drugs with TFs as its main therapeutic targets. "Not-directly" indicates drugs with main therapeutic targets which are human DNA structures or human proteins but not TFs. "Non-Human" represent drugs interacting with protein or structures of non-human (for example, from virus or bacterial) as main therapeutic targets. (d) Chemotherapy. "YES" or "NO" indicates the drug is a chemotherapy drug or not. (e) ATC classification. CORTI: corticosteroids for systemic use, plain, OAA: other antineoplastic agents, CYTOANTIB: cytotoxic antibiotics and related substances, ANTIME: antimetabolites, IMMSUP: immunosuppressants, NSAAP: anti-inflammatory and antirheumatic products, non-steroids, HAARA: hormone antagonists and related agents, QUINA: quinolone antibacterial, IMMSTI: immunostimulants, PAAAONP: plant alkaloids and other natural products, ALKA: alkylating agents.

Overall Score of GES Similarity of Drug-Indicted Disease Pairs Against Random Drug-Disease Pairs
We observed significantly lower SJI similarity scores of drug-disease indication pairs than those of random drug-disease pairs (p-value of two-side t-test [13] equals to 0.02324). The average similarity score of indicated pairs is −0.00386 with a standard deviation of 0.01794 and that of random control pairs is −0.00072 with a standard deviation of 0.01750, indicating that the GES method can reflect the therapeutic effects of the drugs (The distributions of SJI in both the indication group and the control group are shown in Figure 3).

Subgroup Scores of GES Similarity of Drug-Indicated Disease Pairs Against Random Drug-Disease Pairs
We compared drugs from five different categories of subgroups: (1) disease classifications; (2) drug target; (3) TF level; (4) chemotherapy; and (5) ATC classification. The results are shown in Figure  4, detail information is listed in Table S3 and Table S4. Subgroups with important or significant (qvalue according to false discover rate (FDR) lower than 0.05) results according to least squares mean partitions F tests of a generalized linear model (GLM) [14] are listed in Table 2.

Subgroup Scores of GES Similarity of Drug-Indicated Disease Pairs Against Random Drug-Disease Pairs
We compared drugs from five different categories of subgroups: (1) disease classifications; (2) drug target; (3) TF level; (4) chemotherapy; and (5) ATC classification. The results are shown in Figure 4, detail information is listed in Table S3 and Table S4. Subgroups with important or significant (q-value according to false discover rate (FDR) lower than 0.05) results according to least squares mean partitions F tests of a generalized linear model (GLM) [14] are listed in Table 2. Important subgroups or subgroups with false discover rate (FDR) q-value lower than 0.05 from GLM least squares mean partitions F tests for signed Jaccard index differences between drug-indicted disease pairs and random drug-disease pairs. "----" indicates that subgroups only have one unique drug-disease pair sample with no standard deviation.  "Not-directly" indicates drugs with main therapeutic targets which are human DNA structures or human proteins but not TFs. "Non-Human" represents drugs interacting with non-human proteins or structures (for example, from viruses or bacteria) as main therapeutic targets.

Gene and Pathway Analysis on an Example Drug-Disease GES Pair
Interferon receptor (with the same drug-disease pair content as the immunostimulants subgroup), the subgroup with the lowest q-value, was chosen as a case report for the pathway analysis. The top 5% (93/1898) genes with a relatively reversed expression probability according to the relatively expression probability of a gene's (G I-R% , an indicator of the relative possibility difference of gene expression between the indicated group and the random control group, see below 4.5) scores are shown in Table 3. The top 10 significant biological pathways identified by the ingenuity pathway analysis are shown in Table 4.  These 10 pathways are reported to be involved with interferon regulation [15][16][17][18][19][20][21][22][23][24][25][26][27]. within inflammatory and immune responses (see Table 5).

Discussion
It is well-recognized that genes with similar gene expression patterns have a similar function [34]. From the overall score, we can see that FDA-approved drugs listed in the CREEDS database and their indicated diseases generally have inverse GES patterns compared with the random controls. However, the absolute difference between the indicated group and random control group is not very obvious. For example, in a recent study [35], a significant relationship was found between drug-disease GES similarity and drug therapeutic effect using Cmap [36], with a relatively low overall area under curve (AUC) of 0.57, indicating a real, albeit weak, inverse relationship. The treatment effectors of the drugs identified in this study likely work via the interaction of the genes' protein products, with only a moderate correlation between gene expression and levels of the corresponding protein(s) [37]. Thus, an association study between drugs/diseases and gene expression/pharmaceutical effect is necessary. Also, other mechanisms, for instance, microRNA-based therapeutics, might directly orchestrate the activation/deactivation of the gene expression. However, due to the limitation of available sources, we were unable to investigate other mechanisms of action. Besides, the drugs' TF-levels were not a significant factor that reflect the indication relationship (although drugs directly interacting with TF perform slightly better with q-value of 0.22309 vs. q-value of 0.99509). In our analyses, some subgroups of drugs-diseases pairs with indication associations have positive similarity scores (which means that the drug may exacerbate the disease according to the assumption of gene expression signature similarity) or a score higher than random drug-disease pairs, but these findings were not statistically significant. On the other hand, 7 of 70 subgroups had a significantly lower similarity score when a drug-disease association is indicated.
This study may provide some hints to other future studies utilizing the GES method strategies of comparing drug-disease GES similarity for drug repositioning. That is, certain types of drugs may have a stronger ability to reverse the GES of the diseases they treat, and the disease type may also influence this ability. As such, in specific kinds of subgroups, the drug-disease pairs with higher similarities of reversed GES patterns may have greater therapeutic relationships, which means that focusing on certain kinds of diseases or drugs can increase the true positive rate of the GES-guided drug repositioning method For example, over half (4/7) of the significant subgroups (immunostimulants, interferon receptor, other dermatological preparations, and diseases of the blood or blood-forming organs) are related to diseases associated with the immune system (the disease includes in "other dermatological preparations" atopic dermatitis). This indicates that a drug with drug-disease pairs associated with the immune system tends to have lower similarity scores when compared with the diseases it indicated than random diseases. This means in a GES-guided drug repositioning analysis, an immune-associated drug is more likely to have a potential therapeutic effect on diseases that have a higher inverse similarity with it.
Chemotherapy drugs may not be as good as non-chemotherapy drugs for the GES-guided drug repositioning method (q-values: 0.99509 vs. 0.03937). Unarguably, the high diversity of chemotherapy responses to heterogenetic tumor tissues or even histologically similar tumors has been a challenging problem for a long time [37,38]. The failure of controlling the process of programmed cell death in tissue, one of the major causes of tumors, can be rectified or even overturned by activating/deactivating different pathways under various conditions [39]. This may be the reason that chemotherapy drugs are not good for the GES-guided repositioning approach. On the other hand, non-chemotherapy drugs show a significant result as they interact with cancer cells through more specific mechanisms, such as hormone regulation or mono-target therapy.
For the biological pathway analysis of the interferon receptor subgroup, we found that the genes involved in pathways directly regulated by drugs have the lowest GI-R% scores. It is reasonable that GES-guided drug repositioning methods are more sensitive to drugs directly targeting pathways related to diseases. Furthermore, the significance of mTOR signaling is in accordance with the result in which the subgroup kinase mTOR had a significant indicated-random drug-disease pairs' SJI difference. This result confirmed the high sensitivity of the GES-guided drug repositioning method to this pathway on the other side.
There are some limitations in this study. First, the tissues used for testing the drug effects may not match with the body parts/organs affected by the diseases.. Second, some bias may be caused by the limited number of the CREEDS bio-assay collection which may not have the ability to fully present the pattern of all kinds of drugs and diseases. Additionally, it is important to differentiate the types of "treatment effect". Some drugs may cure a particular disease while others may just provide symptomatic relief thereby resulting in different patterns of GES for the same disease. Also, some indicated subgroups ("kinase mTOR", and "other dermatological preparations") have too few unique drug-disease pairs (n = 1), which may weaken the analyses' power.
In this study, we systematically analyzed the similarity of gene expression profiles from known drug-disease associations, and we found that indicated pairs have a greater inverse similarity score. We found seven subgroups in which their drugs or diseases may have a greater reversed GES pattern when there is a clear therapeutic effect. These findings suggest that a GES-guided drug repositioning method should be used based on the drug or disease type differences. For example, drugs or diseases associated with the immune system, diseases of the nervous system or non-chemotherapy drugs may be a better choice for drug repositioning. Moreover, our biological pathway enrichment analysis showed that some pathways may be more sensitive to this method, such as the mTOR signaling pathway.

Gene Signature Data Collection and Filtering
In this study, all gene signature information was collected from a well-calibrated GES repository, the crowd extracted expression of differential signatures (CREEDS) [12] database. The CREEDS database is maintained by the Ma'ayan Lab of Icahn School of Medicine at Mount Sinai. CREEDS utilized GEO2Enrichr [40] to extract GES profiles from the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) and applied a characteristic direction (CD) model [41] to identify differentially expressed genes. This database V1.0 includes 10,797 single-gene perturbations, 2258 disease signatures and 5516 drug perturbation gene signatures. Among these signatures, 2176 manual single-gene perturbations, 828 manual disease signatures and 875 manual drug perturbation signatures were considered to be more accurate compared with the automatically generated GES by the machine learning method. The CREEDS database allows users to compare the similarity between the user-specified GES and the GESs processed and stored in the CREEDS.
We first selected the CREEDS manual GES profiles if the assays were from human tissues and/or human cell linesand if the drugs had FDA approval.
Each GES profile includes a list of up-and down-regulated genes. The SJI [12] (see below), a measurement for the similarity between two GES profiles from the paired drug-disease, was calculated. When a drug or a disease had multiple GES profiles, we calculated the SJIs of all the possible combinations, and an overall score for each unique drug-disease pair was calculated from the average of all scores from pairs sharing the same drug-disease combination. All the disease signatures and drug perturbation signatures were requested through the application program interface (API) provided by CREEDS. GES profiles were removed if they labelled for both a drug treatment and for a disease, because this may cause biased similarity. Under the criteria that (a) the GES profiles must come from assays of human cells/tissues, and (b) drugs must be approved by FDA, the remaining signatures were paired within drugs and diseases according to the indication associations. Signatures without any indicated drug-disease relationship were also excluded from further analysis. For example, cocaine was removed because its indication, local anesthesia, was not in the data of disease signatures and could not be paired. The overall data process is shown in Figure 5.

Similarity Calculation
In our analysis, SJI, which is based on the Jaccard similarity coefficient [42], was used to compute the similarity between GES profiles from a drug and a disease. The Jaccard similarity coefficient is a statistic used to gauge the similarity between different sample sets. It is defined as the size of the intersection divided by the size of the union of two sample sets. It is calculated as follows: where G1 and G2 stand for two lists of differential expressed gene sets, "SAME" represents the number of same genes between two given gene sets, and "ALL" stands for all the unique genes that appeared in the two gene sets.
SJI, which combines the Jaccard similarity coefficient with the gene regulation direction is calculated as follows: where J means Jaccard similarity coefficient, and G up and G down are up-or down-regulated genes in the given gene set G, respectively. The index ranges from +1 to −1, where +1 and −1 indicate a same pattern and inverse pattern of two gene sets, respectively. Zero indicates that the two sets have no associations, or the same part is cancelled out by the inverse part. The reason to use an unranked 2. Remove non-human assays (450/555)

Similarity Calculation
In our analysis, SJI, which is based on the Jaccard similarity coefficient [42], was used to compute the similarity between GES profiles from a drug and a disease. The Jaccard similarity coefficient is a statistic used to gauge the similarity between different sample sets. It is defined as the size of the intersection divided by the size of the union of two sample sets. It is calculated as follows: Jaccard Similarity Coe f f icient(G 1 , G 2 ) = SAME ALL where G 1 and G 2 stand for two lists of differential expressed gene sets, "SAME" represents the number of same genes between two given gene sets, and "ALL" stands for all the unique genes that appeared in the two gene sets. SJI, which combines the Jaccard similarity coefficient with the gene regulation direction is calculated as follows: where J means Jaccard similarity coefficient, and G up and G down are up-or down-regulated genes in the given gene set G, respectively. The index ranges from +1 to −1, where +1 and −1 indicate a same pattern and inverse pattern of two gene sets, respectively. Zero indicates that the two sets have no associations, or the same part is cancelled out by the inverse part. The reason to use an unranked score calculation method (SJI) is to keep in accordance with the same scoring method used in the CREEDS database. The CREEDS API (application programming interface) offers the function to calculate the SJI automatically. However, we found the API could not calculate the SJI correctly when two GES profiles are highly overlapped., therefore, all the SJIs in this study were re-calculated.

Drug-Related Information Collection
In our analysis, the source of drug-related information is listed as follows: 1. Drug target information was collected from DrugBank [43,44] Release Version 5.1.4 [45] (https://www.drugbank.ca/releases/latest#external-links). Only the targets with the main therapeutic effect in the mechanism of action section were included; 2. The human TF list was collected from the paper published by Samuel A. Lambert et al. [46]; 3. ATC classifications on level 3 were collected from the WHO official website (https://www. whocc.no/atc_ddd_index/); 4. The drug indication was from section "indications and usage" of FDA label on FDA website (https://labels.fda.gov/); 5. (Drug-indicated) Disease classification was assigned to each disease based on the International Classification of Diseases 11th Revision (ICD-11), level 1.

Subgroup Classification
In our analysis, we assessed the following factors that might influence the power of the GES-guided drug repositioning method:

1.
Disease classifications: A subgroup was assigned to a disease in a drug-disease pair according to the ICD-11-level 1 code of the disease; 2.
Drug target subfamilies: Subgroups were divided by the main therapeutic target of each drug. To avoid group splits being too small, some same subfamilies of targets are grouped as one, such as "Beta-1 adrenergic receptor", "Beta-2 adrenergic receptor" and "Beta-3 adrenergic receptor" are grouped in the same subgroup "Beta adrenergic receptors"; 3.
The relationship between the drug's main therapeutic targets and human transcription factors: A TF level was assigned according to the relationship between the drugs' main therapeutic targets and human TF. Drugs with main therapeutic targets that can directly interact with at least one TF were labelled as "directly". Drugs with main therapeutic targets which are human DNA structures or human proteins but not TFs were labelled as "not-directly". Drugs interacting with non-human proteins or structures (for example, from viruses or bacteria) as main therapeutic targets were labelled as "non-Human"; 4.
The drug's ATC classification: Subgroups were divided according to the Anatomical Therapeutic Chemical classification system, level 3. Drugs with multiple classifications caused by different administration routes were unified to systematic use.

Statistical Analysis and Pathway Analysis
The random control group was generated by calculating the average SJI of all possible drug-disease pairs without indicated associations to imitate a GES-guided drug repositioning screening. A t-test [13] was applied to quantify the mean differences of the SJI between drug-indicated disease pairs and random controls.
For subgroup analysis, GLM [14] least squares mean partitions F tests function was applied to estimate the mean difference between the indicated and control group since the data was unbalanced with multiple factors. A significant result of a certain subgroup indicated that the average SJI of this subgroup was significant between two indication levels (Yes/No). False discovery rate (FDR) q-value of the Benjamini-Hochberg procedure [47] was controlled to 0.05 to avoid an inflated experiment-wise type I error rate caused by multiple comparisons among all subgroups.
Data processing and statistical analysis (student t-tests, GLM, FDR calculation) were conducted using R studio 3.6.1 [48] and SAS software version 9.4. Copyright © 2019 SAS Institute Inc. Cary, NC, USA.
Differentially reversed expression genes (top 5% negative score according to the relatively reverse percentage) from the most significant subgroup will be chosen as examples to conduct biological pathway enrichment analysis.
The relatively reverse percentage is calculated as Relatively expression probability o f a gene G I−R% = D I % − D R % where D I % and D R % stand for the percentage of the gene which is differentially expressed in all assays of indicated/random drug-diseases pairs. It is calculated as D% = NS − NR Total assays pairs where NS and NR represent the number of times a gene showed a same or reverse regulation direction between assays of drugs and diseases among all drug-disease assays pairs.
The G I-R% ranges from 100% to -100%. A higher positive score indicates that this gene is more likely to be expressed in the same direction in indicated drug-disease assays compared with random drug-disease assays. Likewise, a lower negative score indicates that this gene has a higher probability to express reversely between indicated drug-disease assays compared with random drug-disease assays.