MNBDR: A Module Network Based Method for Drug Repositioning

Drug repurposing/repositioning, which aims to find novel indications for existing drugs, contributes to reducing the time and cost for drug development. For the recent decade, gene expression profiles of drug stimulating samples have been successfully used in drug repurposing. However, most of the existing methods neglect the gene modules and the interactions among the modules, although the cross-talks among pathways are common in drug response. It is essential to develop a method that utilizes the cross-talks information to predict the reliable candidate associations. In this study, we developed MNBDR (Module Network Based Drug Repositioning), a novel method that based on module network to screen drugs. It integrated protein–protein interactions and gene expression profile of human, to predict drug candidates for diseases. Specifically, the MNBDR mined dense modules through protein–protein interaction (PPI) network and constructed a module network to reveal cross-talks among modules. Then, together with the module network, based on existing gene expression data set of drug stimulation samples and disease samples, we used random walk algorithms to capture essential modules in disease development and proposed a new indicator to screen potential drugs for a given disease. Results showed MNBDR could provide better performance than popular methods. Moreover, functional analysis of the essential modules in the network indicated our method could reveal biological mechanism in drug response.


Introduction
The traditional process of drug development is particularly slow and costly, which usually takes 12-15 years and billions of dollars [1,2]. In addition, the rate of new drug candidates being Food and Drug Administration (FDA)-approved has been lessen although the levels of investments in pharmaceutical R&D remarkably increase [3]. At the same time, this philosophy of rational drug design that "one gene, one drug, one disease" paradigm overlooks the inherent complexity of diseases [4,5]. In this case, drug repositioning (or repurposing), which aims to identify novel disease indications for known safety and pharmacology approved-drugs, is very economical and efficient. Compared with traditional process of drug development, repositioning a drug may reduce the drug development period to 6.5 years and costs on average $300 million [6]. Therefore, drug repositioning should be "the primary strategy in drug discovery for every broadly focused, research-based pharmaceutical company" [1].
One of the seminal method is connectivity map (CMap) [7], and the assumption of it was that biological state could be described in terms of a genomic signature. They measured genome-wide transcriptional expression data across a multiple of cell lines treated with small drug molecules and matched these profiles with disease perturbation gene expression profiles to find new associations between drugs and diseases. Although it is difficult to interpret the meaning of predicted associations, the robust of disease signatures and the effectiveness of the method has been experimentally validated [8][9][10]. Inspired by the Genes 2021, 12, 25 2 of 12 rationale behind the CMap method [7], numerous approaches for drug repositioning based on gene expression data and connectivity map have been developed. Zhang et al. [11] proposed a simple method to filter reference gene-expression profiles for the connection scoring scheme. In addition, more connection methods, such as eXtreme Sum score (XSum) [12], Xcos [13], was proposed to calculate the similarities between gene expression patterns of diseases and drugs. Iorio et al. [14] developed a drug repositioning method that constructed drug-drug similarity networks by comparing drug perturbation gene expression profiles. Saberian et al. [15] presented a novel machine learning-based method, which explored the anti-similarity between drugs and diseases to uncover new uses for drugs. However, these previous methods ignored the fact that both the pathogenesis of diseases and drug mode of action (MoA) have been revealed to be tightly connected with gene modules [16]. Chung et al. [17] developed Functional Module Connectivity Map (FMCM), using functional gene modules as disease signatures to build a connectivity map and its performance was superior to traditional signature-based drug-repurposing methods. Jia et al. [18] introduced a new framework incorporating the gene expression data and pathway analysis. They provided a new approach to explain the drug mode of action in a disease context. As we know, proteins, nucleic acids, and small molecules could form a dense network of molecular interactions in a cell, and there may be cross-talks among different functional modules in the cell [19]. Therefore, in drug repositioning, it may be helpful to consider the cross-talks among the function modules. However, as far as we know, there were no drug repositioning methods taking the cross-talks among modules into consideration.
To fill the gap, we present Module Network Based Drug Repositioning (MNBDR), a novel computational module for drug repositioning. We applied the module network to the field of drug repositioning for the first time and proposed two new indicators to evaluate the expression levels of modules and the score of drug-disease. First of all, dense clusters in PPI network were detected as modules. After that, as described in our previous study [20], the cross-talks among modules were identified by testing whether the connections among the genes in two modules were significantly high. Based on both the gene expression data of disease samples and the module network, Pagerank [21] algorithm was applied to rank the important modules in disease. Lastly, the gene expression data of the important modules in drug stimulation samples were further pooled together to calculate an overall connectivity score for each pair of drug-disease. In order to validate our method, we applied MNBDR in 19 cancer datasets and compared it with several popular signature-based drug repurposing methods. We showed that MNBDR performed better than previous methods. Finally, we analyzed the function of the important modules in our module network to investigate the biological meaning of our method.

Data Set and Preprocessing
The drug stimulation data was downloaded from The Library of Integrated Network-Based Cellular Signatures (LINCS) program (level 5; accession number: GSE70138), which contains 118,051 gene expression profiles from multiple human cultured cell lines (treatment and control) treated with 1827 bioactive small chemical molecules at varying concentrations. Each expression profile consists of moderated z-score value of 12,328 genes. LINCS team defined nine touchstone cell lines [22] and we used five cell lines (PC3, A375, HALE, MCF7 and HT29) which have the sample sizes more than 10,000. The pre-processing procedure for drug gene expression data was included in the Supplementary Materials and Figure S1.
Cheng et al. [12,13] showed that the majority of the compounds do not have sufficient therapeutic effects on cell lines. In our work, we applied the compound filtering procedure described by Cheng et al. [12] and used expression signal strength (ESS) to filter the drug stimulation samples. The details were described in Supplementary Materials and Figure S2. The microarray data for whole-genome mRNA expression of disease samples were downloaded from TCGA (The Cancer Genome Atlas) research network [23]. In order to generate more stable disease features, only data sets with at least three normal and three disease samples were considered for further processing. Lastly, we obtained a total of 3486 control samples and 60,460 disease samples from 19 cancer data sets. Then, for each cancer data set, we averaged disease and control samples and calculated the corresponding fold changes for all the genes.
The PPI data was derived from STRING database [24]. In order to reduce the falsepositive interactions which are probably originated from prediction methods, we followed the strategy of Zhou et al. [25] and only the interactions with a confidence score of 770 or above were kept. In total, there were 36,619 unique interactions among 9474 proteins in the PPI.

Benchmark Standard
The golden standard of known drug indications were obtained from Quan et al. [26]. They identified the drug-indication relationships through Drug-Gene Interaction database (DGIdb) [27], Therapeutic Target Database (TTD) [28], and DrugBank [29]. Only the clinically supported or FDA-approved drug-disease relationships were used. In this study, we got a total of 2877 associations between 19 cancers and 477 drugs. All the drug and disease interactions used in this work were shown in Supplementary Table S1.

Construction of the Module Network
To construct the module network, we adopted a similar strategy as our previous study [20]. First of all, we used MCODE [30] in Cytoscape [31] to detect dense clusters in the PPI network and only the clusters containing no less than 5 nodes were retained as modules. After that, for each pair of modules, the number of edges (PPI interactions) between the two modules was calculated. Then two random gene sets, which have the same number of genes with the two modules, were randomly selected, and the edges among the two random gene sets were counted. The random process was repeated 1000 times and the 1000 edge numbers were used as null distribution. Then, the p-value of the cross-talks among the two modules was calculated based on the null distribution. If the number of edges between two modules was significantly high (p-value < 0.01), then there was a cross-talk between the two modules. Finally, all the modules and cross-talks among these modules constituted the module network ( Figure 1A).

Feature Space Transformation
For each disease (or drug), we mapped gene expression data from gene's feature space to the module's feature space. Taking breast cancer as an example, first of all, the fold-change of all the genes' expression levels in the breast cancer samples and control samples was calculated. Then for all the n dense clusters (M 1 , M 2 , . . . , M i , . . . , M n ) in PPI, the importance (Imp i ) for module M i was calculated as follow: Fmax and Fmin are the maximum and minimum fold-change of all the genes in M i . At last, we obtained {Imp 1 , Imp 2 , . . . , Imp i , . . . , Imp n }, which can characterize the difference of the gene expression levels of all the modules in the disease.
Based on the fact that cross-talks among functional modules could play important roles in drug response and disease progression, we proposed a computational method, which used module network to identify essential modules in disease progression and improve CMap drug screening strategy, to do drug repositioning. Figure 1 shows the pipeline of our method. Our framework includes the following steps: (i) As PPI network exhibit a "scale-free" topology [37], communities exist in PPI network. The dense cluster in PPI may work together as a functional unit, we mined the communities in PPI as functional gene sets (denoted as modules in this work). After that, a permutation test was applied to identify the cross-talks among these modules (Method). As a result, we obtained 486 significant pairs among 116 modules, which were shown in Figure S3 and Supplementary Table S2.

Module Rank Based on Pagerank
We assumed that the modules with both important topological coefficient in the module network and significantly differential expression levels would be more essential in disease. We thus used network propagation algorithms to simulate cross-talks of functional modules, which was defined as follows: where W denotes a transition matrix that is the column normalization of the adjacency matrix. In our work, the nodes of the adjacency matrix are modules, and the edges are the connections among the modules in our module network. Here, P 0 represents our initial, or prior, information of the modules. In this work, we set P 0 as {Imp 1 , Imp 2 , Imp 3 , . . . , Imp n } of all the modules in the corresponding disease. As we know, if the propagation process repeated too much times, information will eventually spread out over the whole network and the local neighborhood of the important nodes will be missed [32]. Therefore, a damping factor λ (0 < λ < 1) was defined to avoid it. In this study, λ was set as 0.85, which was typical value for Pagerank [33].

Drug Prioritizing
Inspired by the normalized discounted cumulative gain (NDCG) [34], we proposed a new indicator S to evaluate the drug-disease score between each drug to a specific disease. The indicator is described as follows: For the top n modules (1st, 2nd, . . . , ith, . . . , nth) in disease progression, V(i) is the imp of the ith modules in drug response and P(i) is the position of the ith module in the ranked module list in drug response. That is, if the important modules in disease were also ranked on the top of the module list in drug response, a high score S would be obtained. At last, for each disease, all the drugs were prioritized based on S.

Evaluation Metrics
We used the area under the curve (AUC) of the receiver operator characteristic (ROC) measure to evaluate model performance. The ROC curve can be drawn with the truepositive rates (TPRs) and the false-positive rates (FPRs) at different cutoffs. TPR is the proportion of positive samples identified correctly among the total positive samples, while FPR is the ratio of misidentified negative samples accounting for all the negative samples. TPR and FPR are defined as follows: where TP and TN are the numbers of correctly identified positive and negative samples, and FN and FP are the numbers of positive and negative samples that are misidentified. At the same time, we also used AUC0.1, which is widely used in the field of drug repositioning [12], to evaluate our algorithm. Index AUC0.1 is the area under the curve measured of the ROC under the condition of FPR ≤ 0.1. It guarantees that indicator can focus on the early retrieval performance of the model by restricting FPR. It is essential because it is more realistic in drug repositioning when the candidate drug number is small. Thus, in this work, we applied AUC0.1 as the main index for module evaluation. To better compare model performance, we also used the average AUC (AvgAUC) of all the diseases as our evaluation index. To determine the statistical significance of the results, we calculated non-parametric p-value by performing 10,000 runs with random permutations of the drug-disease relation.

Assessment
To assess the performance of MNBDR, we compared the prediction results with several methods.
In order to investigate the impact of cross-talks between modules on prediction performance, we compared MNBDR with two other methods (Gene based method and Module based method). The Gene based method only used the gene's fold-change to rank the genes and used the ranked gene list to screen drugs, which is similar with the traditional CMap. In the meanwhile, Module based method, which ranked the modules using the gene expression levels (without taking the module networks into calculation) and screened the drugs based on the ranked module list.
MNBDR and Module based method are described as above and have been implemented in Python package, which are freely available at https://github.com/nbnbhwyy/ MNBDR. Details about other methods were available in Supplementary Material.

Framework Overview
Based on the fact that cross-talks among functional modules could play important roles in drug response and disease progression, we proposed a computational method, which used module network to identify essential modules in disease progression and improve CMap drug screening strategy, to do drug repositioning. Figure 1 shows the pipeline of our method.
Our framework includes the following steps: (i) As PPI network exhibit a "scale-free" topology [37], communities exist in PPI network. The dense cluster in PPI may work together as a functional unit, we mined the communities in PPI as functional gene sets (denoted as modules in this work). After that, a permutation test was applied to identify the cross-talks among these modules (Method). As a result, we obtained 486 significant pairs among 116 modules, which were shown in Figure S3 and Supplementary Table S2. (ii) For each disease, the perturbation of the genes was calculated based on the gene expression data of disease samples and control samples, and then mapped the perturbation of the genes to the module's space through an index Imp (Method) to obtain the initial score of the modules. Subsequently, we applied a network propagation algorithm to learn the topology information of the module network to refine the scores of the disease modules. In the propagation algorithm, λ is an important parameter and we adopted a typical value (0.85) [33]. In addition, we changed λ in the algorithm and found the result was robust ( Figure S4). At last, we selected the n modules with the highest scores to characterize the corresponding disease. In this study, n is set to 15, which is about 10% of all the nodes in the module network. In order to validate the robustness of our model, we also varied n from 3 to 50 and found the result was stable, and our method could achieve the best performance with n = 15 (The details are included in Figure S5). (iii) For these modules, the perturbation scores after each drug's stimulation was also calculated based on the gene expression data of the samples stimulated by the corresponding drug and the control sample (Method). Then, a new indicator S (Method) was proposed to evaluate the effect of each drug to the specific disease. Finally, all the drugs for each disease could be ranked based on the indicator.

Comparing with CMap
CMap is the most famous method to screen drugs using gene expression data [7]. Cheng et al. [12,13] have made a systematic assessment of CMap. In this work, we use this method as a benchmark (Gene based method). In addition, in order to validate the hypothesis that the cross-talks information among functional modules could facilitate drug repositioning, we compared our strategy (MNBDR), which integrating module networks and gene expression data to rank modules, with a simple expression ranking strategy which prioritizing modules based on expression data only (Module based method).
Cancer is one of the most serious threats to human health and drug development for cancer is a big challenge [38]. Here, we applied our method in 19 cancer data sets to comparing the performance of MNBDR, Gene based method and Module based method. In this work, we adopted the same index (AUC, AUC0.1 and p-value) in a previous work to evaluate the performance of drug screen methods [12]. The detail result was shown in Table 1 and Figure 2. Cancer is one of the most serious threats to human health and drug development for cancer is a big challenge [38]. Here, we applied our method in 19 cancer data sets to comparing the performance of MNBDR, Gene based method and Module based method. In this work, we adopted the same index (AUC, AUC0.1 and p-value) in a previous work to evaluate the performance of drug screen methods [12]. The detail result was shown in Table 1 and Figure 2.   The results showed that MNBDR had the best performance in the two indexes: AveAUC and AveAUC0.1 (FPR = 0.1, specificity higher than 0.9). In the meanwhile, the method based on modules performed better than the method using gene as features (the original method of CMap), which was consistent with the previous report [17]. In addition, this phenomenon also proved that the communities mined from PPI indeed were functional modules. Furthermore, because the main difference between our method and the method based on modules was that MNBDR using the cross-talks information in the module network, the better performance of our method validated the hypothesis of our strategy. At last, we also validated the performance by randomly permuted the drugdisease relations of the benchmark standard and calculated the p-values of the two indexes (Method). These p-values also proved the power of our method. The detail result was shown in Table S3.

Comparing with the Other Methods
The connectivity approach is essential for drug screen using gene expression data and a previous paper compared several connectivity approaches [39]. In order to evaluate the power of our method, we compared the performance of our method with all the five connectivity methods. From this result (Table 2), it can be seen that all the connectivity approaches can achieve a better performance than random methods (p-value < 0.01). Among them, XsumScore was the best. Apart from that, in the two indexes, MNBDR outperformed other connectivity methods. The results of our method (MNBDR) are bolded.
In addition to comparing with traditional connectivity methods, we also compared our method with three latest published methods (LLE-DML [15] and Cogena [40] and EMUDRA [36]) that used gene expression data to screen drugs for diseases. The results shown in Table 2 indicated MNBDR was more effective than LLE-DML, Cogena, and EMUDRA. More importantly, the differences of AUC0.1 in the four methods are more obvious and AUC0.1 is very valuable for drug development [12], and we find that Cogena has similar performance with the Module based method. This can prove that our approach may be useful for different functional modules, which is valuable for further research. Moreover, LLE-DML achieves the second best performance. It performs as "black boxes" and it is very hard to investigate the important genes in the modules. In the meanwhile, our method could reveal the important modules in diseases, and could be used to investigate the biological mechanisms in disease progression and drug response.

Function Analysis of the Important Modules in Diseases
We also investigated the function of the important modules to reveal the underlying mechanisms in disease and drug response. In our study, we selected the modules that are important in all the 19 cancers and set them as GCF (generalized cancer features). As a result, 15 modules were selected. Then, we used GSEA to analyze which pathways the genes, contained by GCF were involved in. Some enriched KEGG pathways for GCF genes were shown in Figure 3 and all the enriched pathways were shown in Supplementary  Table S4.
Genes 2020, 11, x FOR PEER REVIEW 8 of 12 genes were shown in Figure 3 and all the enriched pathways were shown in Supplementary Table S4. Among the 47 significant pathways (FDR < 1.0 × 10 −4 ), we found that "Pathways in cancer" was on the top, with an FDR of 2.89 × 10 −13 . What is more, many sub-pathways of "Pathways in cancer" were enriched, such as "MAPK signaling pathway", "PI3K-Akt signaling pathway", "FoxO signaling pathway", "Proteoglycans in cancer", "Jak-STAT signaling pathway", "Regulation of actin cytoskeleton", "Focal adhesion" and "ErbB signaling pathway". In these sub-pathways, "MAPK signaling pathway" is reported to be essential for cancer-immune evasion in human cancer cells [41]. In addition, "PI3K-Akt signaling pathway" plays a major role not only in tumor development but also in the tumor's potential response to cancer treatment [42]. Recent studies indicate that numerous components of the phosphatidylinositol-3-kinase (PI3K)/AKT pathway have more frequent amplification, mutation and translocation than any other pathway in cancer patients [43]. About "Proteoglycans in cancer", the available evidence indicates both an up-regulation of ribosome production and changes in the ribosome structure might causally contribute to neoplastic transformation [44]. Forkhead box O (FOXO) transcription factors are involved in multiple signaling pathways and function as tumor suppressors in a variety of cancers [45]. Apart from this, there were also many pathways in specific cancers, such as "NON-Small cell lung cancer", "Prostate cancer", "endometrial cancer", and "Basal cell carcinoma".
In a word, module genes are significantly enriched with many cancer-related pathways. Among the 47 significant pathways (FDR < 1.0 × 10 −4 ), we found that "Pathways in cancer" was on the top, with an FDR of 2.89 × 10 −13 . What is more, many sub-pathways of "Pathways in cancer" were enriched, such as "MAPK signaling pathway", "PI3K-Akt signaling pathway", "FoxO signaling pathway", "Proteoglycans in cancer", "Jak-STAT signaling pathway", "Regulation of actin cytoskeleton", "Focal adhesion" and "ErbB signaling pathway". In these sub-pathways, "MAPK signaling pathway" is reported to be essential for cancer-immune evasion in human cancer cells [41]. In addition, "PI3K-Akt signaling pathway" plays a major role not only in tumor development but also in the tumor's potential response to cancer treatment [42]. Recent studies indicate that numerous components of the phosphatidylinositol-3-kinase (PI3K)/AKT pathway have more frequent amplification, mutation and translocation than any other pathway in cancer patients [43]. About "Proteoglycans in cancer", the available evidence indicates both an up-regulation of ribosome production and changes in the ribosome structure might causally contribute to neoplastic transformation [44]. Forkhead box O (FOXO) transcription factors are involved in multiple signaling pathways and function as tumor suppressors in a variety of cancers [45]. Apart from this, there were also many pathways in specific cancers, such as "NON-Small cell lung cancer", "Prostate cancer", "endometrial cancer", and "Basal cell carcinoma". In a word, module genes are significantly enriched with many cancer-related pathways.

Case Study in Breast Cancer
Breast cancer is one of the most common cancer and drug screen for breast is essential for the therapy [46]. As described above, MNBDR could achieve a good performance in breast cancer data set. The details of drugs identified by MNBDR for breast cancer are included in the Supplementary Table S5. Among the identified drugs, most of them are also supported by the literature, in addition to being confirmed by the benchmark data.
Romidepsin is predicted as an efficient drug for breast cancer by our method. In the meanwhile, Romidepsin is a histone deacetylase inhibitor treatment of adult patients with cutaneous T-cell lymphoma (CTCL) [47]. It modulates additional targets involved in cancer initiation and progression such as c-myc, Hsp90 and p53. Romidepsin has shown anticancer effects by induction of apoptosis, cell differentiation and cell cycle arrest, either alone or in combination [48,49].
Colchicine has been considered as one of the most effective medications for alleviating crystal-induced joint inflammation [50]. Inhibition of microtubule polymerization is the chief mechanism of action. Microtubules are among the main protein filaments that make up the cytoskeleton, which is crucial to the regulation of many activities [51]. To date, microtubule-targeting agents (MTAs) remain one the most reliable classes of antineoplastic drugs in the treatment of BC [52]. Based on this evidence, Colchicine which inhibits of microtubule polymerization may have a potential therapeutic effect on breast cancer and the experiment also got a certain degree of verification [53,54], and Colchicine is predicted as one of the most efficient drugs for breast cancer by MNBDR.
Ciclopirox is also one of the top rank drugs predicted by our method for treating breast cancer. This drug is a diterpene triepoxide that is able to suppression the cell growth of breast cancer [55,56].
All the results indicate our method could not only prioritize the drugs which have been approved, but also find the new adaptation disease for the old drugs.

Conclusions
As cross-talks among modules may be important in disease progression and drug response, we proposed a module network based drug repositioning (MNBDR) method. We used module network, which is based on a permutation test method, to describe the cross-talks among the modules in PPI, and then, using the gene expression data of disease samples and control samples, a network diffusion method was used to rank important modules in disease. After that, the important modules in each drug were also identified using gene expression data of samples stimulated by the drug. Finally, a new index, which could reveal whether the important modules in disease progression were also important in drug response, was proposed to evaluate the efficiency of a drug to the specific disease. We evaluated our method using gene expression data of more than 7000 samples from 19 different cancers obtained from TCGA, as well as measurements of around 118,051 drug instances LINCS databases. The results showed that MNBDR consistently outperformed the other methods in terms of not only AUC but also AUC0.1, which indicated the proposed method performed well when the effective drugs are on the top of the ranked lists. Function annotation of the genes in the modules shows our method indeed could capture the import genes in disease and drug response. In addition, case study in breast cancer showed our method could not only prioritize the drugs which have been approved, but also find the new adaptation disease for the old drugs.
In order to prevent overfitting, we did not train the model and we adopted typical values for the parameters in the model. As n (the number of important models) in the model is very important, we also used 10-fold cross-validation to select the best n in training set and compared the performance between our current strategy (n = 15) and the optimal model in 10-fold cross validation. We repeated the 10-fold cross-validation 10 times and the result was shown in Figure S6. We found that the result of optimal model is similar to our current strategy, which proves that our model does not have a data leakage problem. Of course, there are some drawbacks of our work. In this work, we only validated our method in cancer data sets. In fact, drug repositioning for other diseases are also valuable. In our future work, we will test our method in more diseases.

Conflicts of Interest:
The authors declare no conflict of interest.