Identifying Tissue- and Cohort-Specific RNA Regulatory Modules in Cancer Cells Using Multitask Learning

Simple Summary Understanding the underlying biological mechanisms of primary tumors is crucial for predicting how tumors respond to therapies and exploring accurate treatment strategies. miRNA–mRNA interactions have a major effect on many biological processes that are important in the formation and progression of cancer. In this study, we introduced a computational pipeline to extract tissue- and cohort-specific miRNA–mRNA regulatory modules of multiple cancer types from the same origin using miRNA and mRNA expression profiles of primary tumors. Our model identified regulatory modules of underlying cancer types (i.e., cohort-specific) and shared regulatory modules between cohorts (i.e., tissue-specific). Abstract MicroRNA (miRNA) alterations significantly impact the formation and progression of human cancers. miRNAs interact with messenger RNAs (mRNAs) to facilitate degradation or translational repression. Thus, identifying miRNA–mRNA regulatory modules in cohorts of primary tumor tissues are fundamental for understanding the biology of tumor heterogeneity and precise diagnosis and treatment. We established a multitask learning sparse regularized factor regression (MSRFR) method to determine key tissue- and cohort-specific miRNA–mRNA regulatory modules from expression profiles of tumors. MSRFR simultaneously models the sparse relationship between miRNAs and mRNAs and extracts tissue- and cohort-specific miRNA–mRNA regulatory modules separately. We tested the model’s ability to determine cohort-specific regulatory modules of multiple cancer cohorts from the same tissue and their underlying tissue-specific regulatory modules by extracting similarities between cancer cohorts (i.e., blood, kidney, and lung). We also detected tissue-specific and cohort-specific signatures in the corresponding regulatory modules by comparing our findings from various other tissues. We show that MSRFR effectively determines cancer-related miRNAs in cohort-specific regulatory modules, distinguishes tissue- and cohort-specific regulatory modules from each other, and extracts tissue-specific information from different cohorts of disease-related tissue. Our findings indicate that the MSRFR model can support current efforts in precision medicine to define tumor-specific miRNA–mRNA signatures.


Introduction
Cancer is one of the most leading causes of death globally. Despite the remarkable improvements in cancer therapies, cancer patients remain undiagnosed or mistakenly diagnosed in many cases. This mainly happens when cancer therapy cannot match a specific disease due to insufficient knowledge of molecular mechanisms [1]. There is a consensus among cancer biologists that distinct cancers have various molecular subgroups with unique biological characteristics, which is believed as one of the main reasons for drug resistance and less effectiveness treatments [2,3]. Hence, understanding the molecular mechanism of primary tumor cells and tissues is fundamental to infer the biology of human tumors and predict how the tumors respond to therapies [4].
MicroRNAs (miRNAs) are important non-protein-coding RNA regulators of gene expression by directly or indirectly targeting messenger RNAs (mRNAs), and they are also known to be involved in biological processes that impact the formation, progression and treatment of various cancer types [5]. However, the functional roles of miRNAs and their combinatorial effects as regulatory molecules in cellular processes remain elusive [6,7]. Thus, extracting information of miRNA and mRNA relationships from primary tumors informs about the molecular pathogenesis in the underlying tissue and can provide a deeper understanding of the biological mechanisms of miRNAs in cancer. This helps to provide new strategies for further development and application in clinical settings in terms of early diagnosis and better treatment [8].
The evidence presented thus far, besides similarities in the molecular mechanism of different cancers from the same tissue and the highly correlated nature of genomic data [9], clearly demonstrates the need for establishing accurate computational methods to interpret the regulation of mRNAs by miRNAs in similar tumor tissues.

Previous Studies
Identifying interactions between miRNAs and mRNAs has been improved in recent years due to several proposed computational techniques [10][11][12]. Especially, probabilistic methods [13][14][15] led to reported miRNA-mRNA interactions in cancer. Recently, casual links between miRNAs and mRNAs have been reported [16,17]. However, these studies ignore the effectual common assumption of mRNA regulation by other mRNAs [18]. Hence, we formulated a regulatory module as a small subset of mRNAs correlated with each other, regulated directly or indirectly by a small subset of correlated miRNAs.
Previously, we established a single-task algorithm to identify miRNA-mRNA regulatory modules in cancer [19]. Here, we apply multitask learning [20,21], to improve the model's predictive performance and to extract biological relevant modules. Multitask learning improves the detection power to identify biomarkers for small sample sizes (i.e., various cancer cohorts) by inferring information from abundant data. Multitask learning has been applied successfully to explore the commonalities between cancer-related tasks and corresponding treatment. Examples of such studies include cancer drug susceptibility prediction [22], cancer survival analysis [23], cancer staging [24], diagnosis-specific genotype-phenotype identification [25], embedding multi-omics data and predicting phenotype profile [26], and identification of cancer drug response biomarkers [27].
In this project, we established a multitask learning sparse regularized factor regression (MSRFR) model to increase the power and consistency of biomarker identification by targeting sample size disparity in different cancer cohorts of the underlying tissue.

Our Contributions
MSRFR efficiently extracts tissue-and cohort-specific miRNA-mRNA regulatory modules of multiple cancer types from a similar origin (i.e., same tissue) using miRNA and mRNA expression profiles of primary tumors. MSRFR was able to simultaneously estimate the effective number of modules for each cancer type (cohort-specific) and for shared modules (i.e., tissue-specific overlaps between cohorts of the same origin), and extract regulatory modules by imposing a low-rank structure and by grouping correlated mechanisms. We applied our algorithm on three sets of cancer cohorts of the same tissue (i.e., blood, kidney, and lung).
The predictive performance of MSRFR and the percentage of regulatory modules with significant survival analysis identified by MSRFR was significantly higher than the single-task algorithm [19], which indicates the higher ability of the proposed model in extracting regulatory modules with biological importance. Moreover, the significance of tissue-specific regulatory modules in survival analysis suggests that our algorithm was able to identify miRNA-mRNA regulatory modules with biological functions in both, the underlying cohorts and the associated tissue. Enrichment analysis and literature validation of identified regulatory modules by MSRFR showed disease-associated and tissue-specific miRNA-mRNA signatures. MSRFR can also be customized to be applied on more than two cohorts from the same origin.

Datasets
In this work, we developed a predictive model that incorporates expression profiles and clinical phenotypes of multiple cancer cohorts into a unified learning framework to identify tissue-and cohort-specific miRNA-mRNA regulatory modules. We used Lymphoid neoplasm diffuse large B-cell lymphoma (DLBC), Acute myeloid leukemia (LAML), Kidney renal clear cell carcinoma (KIRC), Kidney renal papillary cell carcinoma (KIRP), Lung adenocarcinoma (LUAD), and Lung squamous cell carcinoma (LUSC) data sets which are publicly available by the Cancer Genome Atlas (TCGA, https://portal.gdc.cancer.gov, accessed on 10 August 2022). TCGA provided genomic characterizations (including miRNA and mRNA expression profiles) and clinical information of cancer patients. Our computational analyses did not include metastatic tumors since their underlying biology is generally different to primary tumors.
For each cohort, we extracted "BCGSC miRNA Profiling" files for miRNA expression profiles of all primary tumors, which are preprocessed using the unified miRNASeq pipeline of TCGA. We also extracted "HTSeq-FPKM" files for mRNA expression profiles of all primary tumors, which are preprocessed using the unified RNASeq pipeline of TCGA. Since not all patients had both miRNA and mRNA expression profiles, we eliminated samples with only one available expression profile and used the primary tumors only with matched miRNA and mRNA expression profiles. We filtered miRNAs and mRNAs by discarding those that were expressed in less than 50% of the tumors from the analysis. Moreover, we considered matched miRNAs and mRNAs of both cohorts incorporated in each analysis. Hence, the number of miRNAs and mRNAs included in each cancer type reduced to around 500 miRNAs and 17,000 mRNAs on average.
TCGA also provided "Clinical Supplement" files of all patients. To evaluate the biological relevance of the identified miRNA-mRNA regulatory modules, we performed survival analysis in our experiments using extracted survival characteristics of patients (i.e., days to last follow-up for alive patients and days to death for deceased patients).

Problem Definition
We developed a machine learning approach that utilizes expression levels of miRNAs and mRNAs to identify highly correlated miRNA-mRNA regulatory modules in primary tumors to determine similarities between multiple cancer cohorts of the same origin and to uncover tissue-specific signatures. In this study, we are given K cohorts indexed by k. For each cohort, we are given a training set , which contains miRNA and mRNA levels of N k tumors, where x ik ∈ R D and y ik ∈ R T denote miRNA and mRNA expression profiles of tumor i for cancer k, respectively. All symbols used in our model (in Section 2.3) are described in Table 1.

Method
It is believed that the relationship between miRNAs and mRNAs is sparse [28]. Hence, we formulated the proposed problem as a linear factor multivariate regression model with a low-rank structure on the coefficient matrix to support this assumption. For a single cancer cohort, this formulation can be shown as follows: where W X ∈ R D×R projects D-dimensional miRNA profiles into a R-dimensional space, W Y ∈ R R×T performs linear regression in this R-dimensional projected space, E ∈ R D×T is the matrix of error terms, and W ∈ R D×T is the coefficient matrix of regression model. In addition to W X and W Y matrices, the dimensionality of the projected space (i.e., R) must to be estimated in the learning process. With this low-rank assumption, instead of learning W matrix, we attempted to learn W X and W Y matrices. By doing so, we were able to reduce the number of parameters that needed to be learned. Furthermore, capability of converting W X and W Y matrices into a summation of rank-one matrices is the other major incentive of this low-rank assumption. Each of these rank-one matrices will be considered as a distinct miRNA-mRNA regulatory module. We explored a multitask variant of multivariate regression between miRNA and mRNA expression profiles of multiple cancer cohorts under the assumption that the columns of X and Y are centered (i.e., columns with zero mean) and normalized (i.e., columns with unit standard deviation). The error terms in E assumed to be independent and identically distributed Gaussian random variables with zero mean and σ 2 variance.
Considering that the total number of responses (mRNAs) and predictors (miRNAs) are much larger than N, but the number of important factors is typically smaller than N, it is a credible assumption that the relationship between predictors and responses is sparse. To fit the model on such data, regularized or penalized methods are needed to perform dimensionality reduction and feature extraction. Moreover, we intend to extract similarities of regulatory modules in different tasks (cancer cohorts) which will be interpreted as tissuespecific miRNA-mRNA regulatory modules. To capture this similarity, a multitask learning formulation needs to be applied. Consequently, we proposed MSRFR to find tissue-and cohort-specific miRNA-mRNA regulatory modules of multiple cancer cohorts from their miRNA and mRNA expression profiles and estimate the effective number of regulatory modules as follows: where {W X S , W Y S } and {W X k , W Y k } K k=1 are the model parameters that infer the tissuespecific and K cohort-specific regulatory modules, respectively. Y k and X k denotes mRNA and miRNA expression profiles of cohort k, respectively, {λ 1 , λ 2 } ∈ R + are the user-defined regularization parameters of the elastic net penalty on W X S and {W X k } K k=1 matrices, likewise, {λ 3 , λ 4 } ∈ R + are regularization parameters on W Y S and {W Y k } K k=1 matrices, to restrict the search space by enforcing the sparsity structure on the variables based on the input data size. R s and {R k } K k=1 are the number of shared regulatory modules (i.e., tissue-specific) and cohort-specific regulatory modules of cancer k, respectively. Note that number of regulatory modules are chosen a priori before optimization such that 1 ≤ R s , where parameter R U is the problem-specific upper bound for the dimensionality of the projected space.
By imposing elastic penalty which linearly combines 1 and Frobenius norms, as regularization function, we expect two highly correlated features both exist or both absent in a factor. In other words, elastic net penalty empowered our model to group correlated miRNAs together and correlated mRNAs together, besides inducing the sparse structure. In addition to accurately extract all cancers' miRNA-mRNA regulatory modules separately by inferring information from all data sets, proposed multitask learning formulation that uses shared parameters between all tasks (W X S and W Y S ), which enabled our model to extract joint regulatory modules from all cohorts to be interpreted as tissue-specific miRNA-mRNA regulatory modules. The overall view of the developed framework for two cohorts (i.e., K = 2) is demonstrated in Figure 1. To solve the regularized model, as well as to find the number of effective latent factors (i.e., number of regulatory modules), an alternating optimization algorithm was proposed.
Selecting a large or small number of latent factors (i.e., R s , {R k } K k=1 values) would lead to overfitting or underfitting, respectively. To avoid this, we applied the mechanism proposed by [29] that guarantees identifying linearly independent modules and learns how many independent latent factors are needed to explain the data. Since problem (2) is non-convex, it is not expected to find the exact solution in a reasonable time. However, with predefined R s and This attribute enabled us to apply a heuristic algorithm using a gradient descent method.
To solve problem (2) with predefined R s and {R k } K k=1 values, we performed Algorithm A1 with a random initial values of decision variables. We determined the stopping criterion of the algorithm according to the objective function of the optimization problem (2). We assumed optimization problem (2) and f (t) are the objective function values of problem (2) in the last two iterations.
Our algorithm starts by fixing the number of latent factors with an initial upper bound (i.e., R s = {R k } K k=1 = R U ). Problem (2) is not jointly convex with respect to all variables, but if we fix {W X S , {W X k } K k=1 }, it will be convex with respect to {W Y S , {W Y k } K k=1 } or vice versa. After converting optimization problem (2) to a convex problem by fixing one set of variables, the algorithm starts solving it using an alternating optimization strategy. After convergence, the algorithm checks whether all variable matrices are full rank. If there was any matrix that is not full rank, the algorithm reduces the value of related matrices ranks by one and solves the optimization problem (2) again. For instance, after convergence if rank(W X i ) or rank(W Yi ) < R i , ∀i ∈ {S, 1, 2, ..., K}, then the algorithm reduces R i by one and starts from the first step. The algorithm is guaranteed to achieve full rank matrices (i.e., , at the termination.  Figure 1. Overview of the developed framework for identifying key tissue-specific and cohort-specific regulatory modules of two cancer types from the same origin. (a) Input data that includes miRNA and mRNA expression profiles of two cancer cohorts. (b) Imposing low-dimensional structure and multitask learning formulation simultaneously to identify cohort-specific regulatory modules of two cancer types and shared regulatory modules between them (i.e., tissue-specific), as well as effective estimation of the number of each regulatory module (i.e., R 1 , R 2 , and R s ) using MSRFR model. (c) Writing the low-rank matrices as the summation of R 1 , R 2 , and R s rank-one matrices such that each of them corresponds to one miRNA-mRNA regulatory module of first cancer cohort, second cancer cohort, and tissue-specific regulatory modules. (d) Filtering identified regulatory modules to test for key modules with biological relevance and functional importance by applying functional survival and gene set enrichment analyses. (e) Validating the result of the experiment, first by comparing the result of the study against cross-cohort and healthy sample result to examine the uniqueness of identified modules, then by literature validation of identified miRNAs in cohortspecific regulatory modules to see whether they are related to the underlying disease, and finally by assessing the transcription factor expression in identified modules.
To update variables in each iteration, a gradient descent approach was performed, and to accelerate the convergence of gradient descent, we applied Prox-Linear update [30]. To simplify the notation of update steps, k index refers to all k ∈ {1, . . . , K} for the following steps, and we considered the following notation: The prox-Linear update functions defined as follows: where S is the soft-thresholding function, such that S τ (ν) = sign(ν) × max(|ν| − τ, 0), and are the derivatives of the objective value of problem (2) k are the multipliers that have to be greater than the Lipschitz constant of g(W X S , W Y S ) and g(W X k , W Y k ), respectively. According to the Equations (A1)-(A4) in Appendix B, the multipliers can be set as follows: Our optimization strategy is described in Algorithm A1 with more details The algorithm's pseudocode is presented in Appendix C).
For the sake of simplicity, we refer to W X S and {W X k } K k=1 as W X and refer to W Y S and {W Y k } K k=1 as W Y . Similarly, we also refer to R S and {R k } K k=1 as R. After finding W X and W Y matrices using the proposed algorithm, we need to extract key miRNA-mRNA regulatory modules. We first determined weights of each identified regulatory module. Hence, we normalized each row of W X and each column of W Y to unit norm, in order to set a unified scale for all of the regulatory modules. For this purpose, instead of our initial decomposition assumption (i.e., W ≈ W X W Y ), for each tissue-and cohort-specific regulatory modules we obtained the low rank decomposition The diagonal entries of D X and D Y are used for assigning the importance weight of each underlying regulatory module. To determine weights of regulatory modules, we sorted the rows of W X and the columns of W Y from the largest one to the smallest one. The regulatory module with the highest importance (i.e., the first regulatory module) refers to the one that corresponds to the highest diagonal entry. The first regulatory module is considered in biological relevance analyses (Section 3), since it reflects the most considerable portion of knowledge.
Similar to our previous strategy [12], we detected miRNAs (mRNAs) for each module as follows: A miRNA (mRNA) is considered to be selected if the magnitude of its weight is larger than two over square root of the total number of miRNAs (mRNAs) included.
To pick the key regulatory modules with biological relevance among all identified regulatory modules, we filtered the regulatory modules by performing functional survival and functional gene set enrichment analyses [31].
First, to check if the MSRFR-identified mRNAs are related to the survival rate of all patients, we classified patients into two groups using k-means clustering based on the mRNA expression profiles. We then examined survival rates in the clinical parameters by checking whether there is a statistically significant difference between the two groups of patients using the log-rank test. Significant survival difference refers to a module with different mRNA expression levels of patients of the underlying cohort. The main motivation of this process was exploring the functional importance of identified regulatory modules.
In the second step of filtering, regulatory modules that seem to have biological relevance from filtering step one were further evaluated in functional gene set enrichment analysis [31].
A regulatory module is categorized as a key regulatory module if "survival analysis of the corresponding module reports a significant difference between patient groups" and "underlying module is enriched in either tissue-specific or cohort-specific gene sets".
Following the biological validation strategy, we attempted to investigate the identifying transcription factors (TFs) in miRNA-mRNA signatures. To determine whether or not TF regulation is affected by cancer-specific miRNA-mRNA signatures, we assessed the currently known human TFs [32] in the regulatory modules. We performed 10,000× permutation analysis to estimate the distribution for the number of TFs in a certain number of random genes and compare it with the number of TFs in the selected mRNAs by MSRFR (Section 3.6).

Experimental Setting
To test the applicability of our MSFRF model, we applied it on three pairs of cancer cohorts where each originated from the same tissue (blood, kidney, lung).
We trained the algorithm with all combinations of predefined parameters using fourfold cross-validation. Then, we calculated the average root mean square error (RMSE) of four folds between predicted and actual mRNA expression levels for all parameter combinations. For each experiment, we picked the set of parameters with the minimum average RMSE. Finally, the parameters of the three experiments are set as follows: 15,300,30). We considered R U = 10 as the upper bound on the number of latent factors in all cohorts of experiments. Thus, due to the number of independent factors that can explain each cohort/tissue, MSRFR identifies 10 or less than 10 regulatory modules. We also set the stopping criteria parameter to 10 −4 , and the maximum number of iterations to 10 3 . Since miRNA and mRNA expression profiles of primary tumors are count data and can take only non-negative values, we applied log 2 -transformation before feeding them to the algorithm. We implemented our algorithm in R.

Predictive Performance Comparison
We tested the performance of MSRFR by calculating the normalized root mean squared error (NRMSE) of our algorithm between observed and predicted values of mRNA expression levels against the single-task algorithm [19] using where k indexes the cancer cohorts.
The predictive performance comparison showed that MSRFR algorithm explained a higher proportion of variance than the single-task algorithm in five out of six cohorts using fewer regulatory modules ( Table 2). By applying MSRFR in three experiments to cancer cohorts of blood, kidney, and lung, we identified 75 regulatory modules in total (12 cohort-specific and 10 tissue-specific modules for blood, 14 cohort-specific and 10 tissue-specific modules for kidney, and 19 cohort-specific and 10 tissue-specific modules for lung). Figure 2a,b show examples for selected top 10 miRNAs and top 50 mRNAs identified in the first regulatory modules of KIRC cohort and kidney tissue, respectively. The first regulatory module refers to the extraction of the most significant result from each cohort/tissue. Selected miRNAs and mRNAs for all regulatory modules identified by MSRFR are presented in Supplementary Table S1.  NKTR  AP1G2  IRF9   ATG16L2  TUBGCP6  MAN2C1  DNHD1  ZNF276  RBM6  TTLL3  SEC31B  UBAP1L  AHSA2  UVSSA  LY6G5B  HYPK  RBM33  RBM5  ZNF169  L3MBTL1  DNAH1  MSH5  OGT  FNBP4  METTL3  ENGASE  CLK3  MAPK8IP3  ARGLU1  MDM4  MZF1  RNPC3  CCDC84  KIAA0895L  AGAP6  RSRP1  ORAOV1  TAF1C  TRIM66  SNRNP70  PRPF3  NFATC2IP  UNKL  XRCC3  DENND4B  CCNL2 Figure 2. Heat map of top 10 miRNAs and top 50 mRNAs of as (a) cohort-specific regulatory module identified in KIRC and (b) tissue-specific regulatory module identified for kidney, clustered into two groups of patients using k-means clustering on mRNA expression values. Red colors indicate over-expression (i.e., higher than the population mean), and blue colors indicate lower expression (i.e., lower than the population mean). The font sizes of miRNAs and mRNAs are proportional to the magnitudes of their weights inferred by our algorithm.

Functional Survival Analysis of Identified Regulatory Modules
To assess the findings of our MSRFR algorithm, we compared results between different cohorts and tissues. By including the clinical parameters (i.e., survival of patients), we clustered patients into two groups by applying k-means clustering on the expression values of the selected mRNAs. Next, we acquired the expression values of selected miRNAs and mRNAs and used the expression profiles of both cancers in each tissue to cluster the patients and to address their survival rates (Figure 2a,b). Significant differences were determined using the log-rank test.
For tissue-specific analysis, we selected the overlap of regulatory modules with significant differences in both cancer cohorts of the same tissue. Interestingly, we find that the differences in expression levels of the identified miRNA-mRNA signatures seem to relate to survival differences in the different cancer patient groups (Figure 3). These figures indicated that the identified regulatory module is highly effective in capturing the biological mechanism of both cohorts, and it is associated with kidney tissue. In 27 of 75 identified regulatory modules in total (36%), we observed a significant survival difference between the two groups (i.e., p-value < 0.05 in the log-rank test). To demonstrate the detection power of clinical characterization using the proposed method, we compared the percentage of regulatory modules with significant survival differences in this study against another algorithm [19]. Table 3 shows all identified regulatory modules and those with significant survival differences by MSRFR model and our recently reported single-task algorithm [19]. The percentage of significant survival analysis is higher in MSRFR algorithm, even though, in MSRFR we have two filters for survival analysis of tissue-specific regulatory modules using the expression value and clinical information of both underlying cohorts.

Tissue-Specificity and Disease Association of Key Regulatory Modules
We next performed enrichment analysis of selected mRNAs in the first regulatory modules by using cell type signatures [33] and disease association [34] to address if the underlying tissues in tissue-specific regulatory modules and signatures that are cohortspecific can be found. Of note, we observed specific cell type signatures in all tissue-specific modules that related to the underlying tissues ( Figure 4A-C) Figure 4. Summary of enrichment analysis in cell types of tissue-specific regulatory modules of (A) blood, (B) kidney and (C) lung. Examples of enrichment analysis of tissue-specific regulatory modules of (D) KIRP and (E) LUAD in DisGeNET. Red arrows in (A-C) depict cell-type and tissue-related terms and disease-relevant terms in (D,E).
Regarding disease association of cohort-specific regulatory modules, we found disease associated terms that are relevant to the cohort-specific regulatory modules ( Figure 4D,E), thereby validating that MSRFR can specifically determine tissue-specific and cohort-specific miRNA-mRNA signatures.

Comparing miRNA-mRNA Signatures in Cross-Cohort Combinations and with Healthy Samples
Next, we examined the uniqueness of identified modules by applying the MSRFR algorithm on a total of 12 combinations of cohorts that did not originate from the same tissue (DLBC-KIRC, DLBC-KIRP, DLBC-LUAD, DLBC-LUSC, LAML-KIRC, LAML-KIRP, LAML-LUAD, LAML-LUSC, KIRC-LUAD, KIRC-LUSC, KIRP-LUAD, and KIRP-LUSC). We found 118 tissue-specific modules and 220 cohort-specific modules in a total of 12 experiments. Of note, only 4/118 (3.39%) of tissue-specific regulatory modules presented significant survival differences, while 64/220 (29.10%) of cohort-specific regulatory modules are significant in survival analysis ( Figure 5A). The ratio of significant survival differences in the related cohort experiments was 10/30 (33.33%) and 17/45 (37.78%) for tissue-specific and cohort-specific regulatory modules, respectively (Table 3). Accordingly, survival analysis based on clinical information in non-relevant cohort pairs was significantly lower than identified regulatory modules with significant survival differences in tissue-specific regulatory modules. However, there is no significant difference in cohort-specific regulatory modules ( Figure 5B). These findings indicate that MSRFR identifies tissue-specific regulatory modules with biological relevance using cohorts from the same tissue rather than miRNA-mRNA signatures from non-related tissues.
We conclude that MSRFR increases the detection power of the model by multitask formulation, enhances the capacity of the model in grouping genes that participate in the same processes while extracting fewer modules, effectively determines biologically related miRNA-mRNA regulatory modules by inferring information from other tasks and it distinguishes tissue-and cohort-specific regulatory modules from each other to extract tissue-specific information from different cohorts of disease-related tissue.
We also applied MSRFR on healthy samples of kidney and lung that are deposited in TCGA and compared the findings to the two cohort-specific and their shared tissue-specific first regulatory modules. We found marginal overlaps between healthy and disease samples ( Figure 6), indicating that MSRFR effectively determines miRNA-mRNA signatures related to primary tumors.

Literature Validation of Identified miRNA-mRNA Signatures
We investigated the relevance of the identified cohort-specific regulatory modules underlying cancer using a literature survey validation by checking whether identified miRNAs are important for formation or progression of the corresponding cancer type. For 68 selected miRNAs out of 161 identified miRNAs (42.24%) in the first regulatory module of six different cohorts, we found published records in PubMed. The ratio of selected miRNAs with literature support in LUAD and LUSC cohorts was higher than in the other cohorts ( Figure 7). Detailed information on all 68 miRNAs is reported in Supplementary Table S2, indicating that MSRFR effectively determines cancer-related miRNAs in cohort-specific regulatory modules. Moreover, to see if identified miRNA-mRNA signatures are known in the literature (PubMed), we investigated all possible duplex combinations of the top 2 and top 10 selected miRNAs and mRNAs of each cohort/tissue. Specifically, we looked for direct or indirect associations as biomarkers in the formation, progression, or treatment of cancer. For example, mir-152 and mir-30e, the top two selected miRNAs by MSRFR for LUSC cohort, have been shown to improve the non-invasive diagnosis of renal cell carcinoma [35]. mir-142 and APAF1 as target gene are proposed as a promising non-invasive diagnostic biomarker of hepatocellular carcinoma [36], which have been detected by MSRFR for kidney tissue. ASXL2 and BPTF were suggested for a potential therapeutic approach for human diseases [37], and MSRFR identified both in lung tissue. In total, we were able to validate 30 records in PubMed, including miRNA-miRNA, miRNA-mRNA, and mRNA-mRNA interactions, which are listed in Supplementary Table S3.

Abundance of Transcription Factor mRNAs in miRNA-mRNA Signatures
To check if TF expression could be affected in the investigated primary tumors and their specific miRNA-mRNA signatures, we assessed the currently known 1639 human TFs [32] in our nine regulatory modules (first regulatory module in each cohort/tissue). MSRFR found 1030 mRNAs on average in these nine regulatory modules (Supplementary Table S1).
To test if the number of overlapping TFs is meaningful, we developed a permutation test and compared our result with random sets of genes. To this end, we randomly picked 1030 genes 10,000× and compared the percentage of TFs among random experiments and MSRFR results. The maximum average of randomly identified TFs was 10.49%. In 6 out of 9 cohorts/tissues the average number of TFs identified by MSRFR deterministically dominates the average number of TFs in random genes ( Figure 8A). We also found that the difference in the median of TFs in random genes and selected mRNAs by MSRFR is significantly greater (Mann-Whitney test p-value = 0.004) than in random sets ( Figure 8B).
The detailed information on detected TFs in regulatory modules are listed in Supplementary Table S4.

Discussion
Understanding the underlying biological mechanisms of primary tumors is crucial for predicting how tumors respond to therapies. miRNAs' interactions with mRNAs have a major effect on many biological processes that are important in the formation and progression of cancer. Therefore, identifying both cohort-and tissue-specific miRNA-mRNA regulatory modules of cancers have received considerable interest due to its importance in cancer biology.
This study introduces a pipeline to extract tissue-and cohort-specific miRNA-mRNA regulatory modules of multiple cancer types from the same origin using matched miRNA and mRNA expression profiles of primary tumors. We generated a multitask sparse regularized factor regression model which was able to successfully extract and distinguish tissue-and cohort-specific regulatory modules and estimate the effective numbers of cohortspecific and tissue-specific regulatory modules.
Out of all six considered cohorts, MSRFR model outperformed single-task regression method in 5/6 cohorts (see Table 2). We were able to identify mRNAs that are related to tissue type and that were enriched in disease-relevant terms. The identified miRNAs were also reported in the investigated primary tumors, and finally, TFs in the determined miRNA-mRNA signatures indicate their strong involvement in pathogenesis. The sets of experiments with cohorts from unrelated tissues to the phenotype showed marginal overlaps. This indicates that MSRFR determines significant differences between cohorts and identifies tissue-specific modules from the same tissue.
Collectively, these results show that the proposed model is highly effective in identifying key miRNA-mRNA regulatory modules and distinguishing cohort-specific and tissue-specific regulatory modules.
There is an abundant room for further progress in determining similarities in molecular patho-mechanisms. Extensions of this study can be applied to investigate other diseases where similar primary cells are involved to find cohort-specific mechanisms together with the mechanisms shared among underlying conditions. Moreover, further work is required to establish complementary research on other RNAs in the non-coding genome, such as long non-coding RNAs, to decode their functional similarities in different conditions.  Data Availability Statement: Publicly available data sets were analyzed in this study. These data sets can be found at https://portal.gdc.cancer.gov, accessed on 10 August 2022. Our implementation of proposed MSRFR algorithm in R is available at https://github.com/MiladMokhtaridoost/MSRFR, accessed on 10 August 2022. Gene set enrichment analysis was conducted using publicly available Metascape web tools at https://metascape.org, accessed on 10 August 2022.

Acknowledgments:
The results shown here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga, accessed on 10 August 2022.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: