Network-Based and Machine-Learning Approaches Identify Diagnostic and Prognostic Models for EMT-Type Gastric Tumors

The microsatellite stable/epithelial-mesenchymal transition (MSS/EMT) subtype of gastric cancer represents a highly aggressive class of tumors associated with low rates of survival and considerably high probabilities of recurrence. In the era of precision medicine, the accurate and prompt diagnosis of tumors of this subtype is of vital importance. In this study, we used Weighted Gene Co-expression Network Analysis (WGCNA) to identify a differentially expressed co-expression module of mRNAs in EMT-type gastric tumors. Using network analysis and linear discriminant analysis, we identified mRNA motifs and microRNA-based models with strong prognostic and diagnostic relevance: three models comprised of (i) the microRNAs miR-199a-5p and miR-141-3p, (ii) EVC/EVC2/GLI3, and (iii) PDE2A/GUCY1A1/GUCY1B1 gene expression profiles distinguish EMT-type tumors from other gastric tumors with high accuracy (Area Under the Receiver Operating Characteristic Curve (AUC) = 0.995, AUC = 0.9742, and AUC = 0.9717; respectively). Additionally, the DMD/ITGA1/CAV1 motif was identified as the top motif with consistent relevance to prognosis (hazard ratio > 3). Molecular functions of the members of the identified models highlight the central roles of MAPK, Hh, and cGMP/cAMP signaling in the pathology of the EMT subtype of gastric cancer and underscore their potential utility in precision therapeutic approaches.


Introduction
Gastric cancer (GC) is one of the most common malignancies with extreme interand intra-tumoral heterogeneity [1,2]. With more than a million new cases each year and approximately 769,000 deaths in 2020, it comprises one of the leading causes of cancerrelated deaths worldwide [3]. Despite its substantial burden, little progress has been made regarding the development of effective therapeutic interventions for GC patients [4]. This reflects the inability of the conventional one-size-fits-all diagnostic/therapeutic approaches for combatting such a heterogeneous disease.
Fortunately, in recent decades, various classifications with either histologic [5] or molecular [6] bases have been developed for this malignancy. These classification systems guide the development of disease management strategies that are tailor-made for specific subtypes of GC. In comparison with histologic classifications, molecular classifications display a wider association with tumor heterogeneity and patient prognosis, suggesting their broader utility in the clinical setting [7]. One of the major molecular classifications of stomach cancer was developed based on the mRNA expression data of gastric tumors almost a decade ago by the Asian Cancer Research Group (ACRG) [8]. This classification stratifies gastric tumors into four subtypes, namely (i) microsatellite instability (MSI), (ii) microsatellite stable/epithelial-mesenchymal transition (MSS/EMT; EMT for short), (iii) microsatellite stable/TP53+, and (iv) microsatellite stable/TP53−. Among these, the EMT subtype is associated with significantly poorer overall survival and a higher chance of recurrence, possibly demanding a more aggressive treatment approach [8][9][10].
Despite the obvious benefits of tumor classifications, the substantial costs of the current experimental approaches required for patient stratification impede the clinical translation of these subtypes, underscoring the necessity of the development of practical biomarkers for disease/patient management [7]. Specifically, considering the aggressive nature of the EMT-type tumors, exploration of the molecular landscape of these tumors and the development of practical means for the stratification of patients into EMT and non-EMT cases is of substantial interest. In this line, Lee at el. [9] developed a NanoString-based 71gene signature assay that can potentially be used for diagnostic/prognostic purposes in the clinical setting. Nevertheless, there is still room for reductions in the costs and availability of patient stratification approaches, and the underlying biology of the phenotypes observed in patients with EMT-type tumors remains elusive.
In this study, we established the EMT GC subtype, proposed by the ACRG, as the subtype with the most distinct transcriptomic landscape and moved on to identify some of the core elements involved in the pathology of this subtype through the combination of co-expression module discovery and motif extraction approaches. These elements were further explored in terms of their clinical utility, and the most potent candidates with diagnostic and prognostic relevance were identified and discussed. The pipeline designed for this study appears to be robust for the identification of central regulators of biological phenomena and can readily be employed in other similar contexts. Moreover, the top motifs identified represent potent candidates for further validation to be used as affordable means for the stratification of GC patients in the clinical setting.

Datasets
We retrieved RNA-seq and miRNA-seq raw counts from treatment-naïve adenocarcinomas of The Cancer Genome Atlas-STomach ADenocarcinoma (TCGA-STAD) cohort (n = 316; only the samples that were not flagged as low quality were retrieved) using the Genomic Data Commons (GDC) data portal [11] and microarray data from the ACRG cohort (n = 300) and the Singapore cohort (n = 192) via the Gene Expression Omnibus (accession numbers GSE62254 and GSE15459). The clinical information for the analyzed samples is available in the Supplementary Table S1. The distribution of the clinical information within each subtype for all three cohorts is also presented in Table 1. Since not all of the 316 TCGA samples possessed all the required data categories for the different steps of the analysis (e.g., survival data, ACRG classification, etc.), for each specific step of the study, only the subset of the original cohort that included all data modalities relevant to that step was utilized. Tumors from all three cohorts have been previously classified into the four ACRG-based molecular subtypes [8]. The same classification was used in this study. This reduced the samples with classifications for the TCGA to a total of 167 samples (MSI = 37; EMT = 47; TP53+ = 42; TP53− = 41). In the ACRG cohort, three samples (#369, #533, and #542) were removed since they were identified as outliers based on the Principal Component Analysis (PCA) of the log2 transformed intensities (total: 297; EMT = 46; MSI = 68; TP53+ = 77; TP53− = 106). The RNA-seq data for gastric tumors and paired normal gastric tissues were also retrieved from GSE184336 for tumor vs normal comparisons.

Data Analysis and Visualization
R version 4.1.1 and Cytoscape version 3.9.0 were used for the statistical and networkbased analysis of the data and visualization of the results. Differential gene expression analysis was carried out using the DESeq2 R package [12], which uses negative binomial generalized linear models for the identification of the differentially expressed genes. Venn diagrams were constructed using the VennDiagram package and PCA was carried out using the prcomp function in R.

Evaluation of ACRG Subtypes
Enrichment analysis of the TCGA tumor samples classified into the four distinct subtypes in comparison to the normal samples was carried out using the Hallmark gene sets of the Gene Set Enrichment Analysis (GSEA) desktop application version 4.1.0 [13]. GSEA is one of the most popular methods from the second generation of enrichment analysis techniques. This method ranks genes based on the correlation of their expression levels with the phenotype under investigation and calculates an enrichment score for each predefined gene set (in this case, the gene sets in the Hallmark collection of the GSEA) based on the aggregation of the members of these sets at the top or the bottom of the ranked list of genes. Identification of the top modules of the differentially expressed genes for each subtype was conducted using the greedy search algorithm of the jActiveModules plug-in in the Cytoscape [14].

Weighted Gene Co-Expression Network Analysis and Motif Identification
Co-expression modules are, in essence, clusters of genes that present a coordinated variation in their expression levels across samples, and they potentially represent groups of genes with related functions regulated by the same transcriptional program. The interpretation of these modules within specific biological contexts can reveal novel insights regarding how specific functions/phenotypes are regulated [15]. Here, the identification of co-expression modules was performed using the Weighted Gene Co-expression Network Analysis (WGCNA) algorithm [16]. WGCNA first constructs an adjacency matrix by applying a hard or soft thresholding procedure on the co-expression similarity measurements between each pair of genes and then utilizes a clustering approach for the identification of the co-expression modules. In this study, the co-expression module discovery was carried out with the following parameters: a signed topological overlap matrix was used, the minimum module size was set to 20, the optimum soft threshold was identified as 20 using the scale independence and mean connectivity plots, and the dendrogram cut height for module merging was set to 0.25. The significance of the modules was determined by taking the average of the −log10(adj. p-value) of the differential expression of their members in the EMT samples compared to the pooled samples of the other subtypes (Wald test; corrected for multiple hypothesis testing by the Benjamini-Hochberg method).
Motifs in protein-protein interaction (PPI) networks are small subgraphs that occur much more often than is expected by chance. Alterations in the activity and expression levels of these regulatory units are a common observation in pathological states such as cancer [17]. In this context, the identified top module was further queried for biologically relevant regulatory subunits through the utilization of motif identification approaches. The PPI data were retrieved from the STRING database version 11.5 [18], and the NetMatchStar plug-in in the Cytoscape [19] was used to identify triangle motifs with three nodes and three edges. The choice of the triangle motifs was based on the high frequency with which they are observed in the biological systems and the fact that many larger motifs are comprised of multiple triangle motifs [20].
A modified version of the multi-objective scoring function used in [21,22] was used for motif scoring: where W stands for the weight, i is any given motif, j is any one of the weighting scenarios (all of the 13 utilized weighting scenarios are available in the Supplementary Table S2), ND is the mean of the node degree of each of the motif members, BC is the mean betweenness centrality, DP is the number of the nodes in a given motif that are members of the pathways in the cancer KEGG pathway (hsa05200), AUC is the mean area under the ROC curve, and the LFC is the mean absolute log2 fold change of the expression of the nodes in a motif in the EMT subtype in comparison to the pooled samples of the other subtypes. The 'max (parameter)' denotes the maximum value of each parameter achieved by a motif.

Assessment of Diagnostic and Prognostic Values of the RNAs
Survival analysis was performed using the survival and survminer packages in R. The TCGA RNA-seq data for 288 solid tumor samples with appropriate clinical information based on the criteria used by Anaya [23] were subjected to Variance Stabilizing Transformation (VST), and the ACRG microarray data were Robust Multichip Average (RMA)-normalized prior to the survival analysis.
The top and bottom 40% of the samples (based on the expression of the gene under investigation) were used as the high-expression and low-expression groups, respectively. As for the motifs, the intersection of the samples in the top/bottom 40% based on the expression of each motif member was used to form the high-expression and low-expression groups. The age and sex of the patients were used as covariates in the Cox regression analysis in order to account for their possible confounding effects. Due to the inclusion of samples that exhibited concordant high/low expression of all of the motif members in each analysis, a varying number of samples were analyzed for each motif. Considering this, only motifs with at least 30 samples in each group (high-and low-expression groups) and a total of at least 100 samples were selected for further examination. Among these, we specifically looked for motifs that were consistently present among the top five motifs of both cohorts (based on their Hazard Ratio [HR]).
The glm built-in function in R was used for the logistic regression analysis. Since quantile normalization was found to be an excellent method for making the microarray and RNA-seq data comparable for machine learning applications [24], the raw counts and intensities for TCGA and ACRG samples were pooled, log2 transformed, and quantile normalized prior to logistic regression analysis. After normalization, the TCGA and ACRG samples were again separated, and the regression models for discrimination between tumor subtypes were first fitted to the TCGA data and then validated on the ACRG data. To assess the robustness of the models, their performance on the independently quantile normalized data of the samples from the Singapore cohort was also evaluated. The ability of the motifs to distinguish tumors from normal samples was also assessed by fitting a model to the TCGA RNA-seq data for both STAD solid tumors (n = 316) and the available adjacent normal tissue samples from the gastric cancer patients in the TCGA-STAD cohort (n = 30; cases for which adjacent normal tissue samples were available are distinguished with bold script in the Supplementary Table S1) after VST normalization. The same method was also applied to the GSE184336 dataset (with 70% of the samples as the training set and the remaining samples as the validation set) for independent validation of the capacity of the motifs for discrimination between normal and tumor samples.
Multi-candidate miRNA combinations capable of discriminating EMT-type tumors from other subtypes were identified using the linear discriminant analysis (LDA) with leave-one-out cross-validation, using the method described in [25]. Eighty percent of the samples were allocated to the training set for this analysis and the remaining samples were used for validation. The validated mRNA targets of the differentially expressed miRNAs were obtained using the multiMiR library in R [26].

MiRNA-mRNA Network Construction
The miRNA-mRNA network was constructed in R using the PPI interaction information from STRING and the validated miRNA-target interactions obtained from multiMiR. Twenty-three centrality measures were calculated for the network using the igraph and centiserve [27,28] packages in R. PCA was used to identify the most suitable centrality measure among these 23 centrality measures based on the structure of the network, using the method described in Ashtiani et al. [29]. The final network was visualized using Cytoscape.

EMT-Type Gastric Cancer Displays a Distinct Transcriptional Profile
In order to assess the transcriptional rewiring of the tumors in different ACRG subtypes, we performed a set of exploratory analyses on 167 TCGA samples classified into four distinct subtypes (MSI, EMT, TP53+, and TP53−) [8]. GSEA has shown that EMT-type tumors did indeed exhibit hallmarks of epithelial-mesenchymal transition (False Discovery Rate (FDR) = 0.038) and angiogenesis (FDR = 0.047) as their top enrichment signals. Other subtypes, however, have consistently shown G2M checkpoint and E2F/MYC targets as their top enrichment results (FDR < 0.05) (Supplementary Figure S1). This suggests a more profound difference in the transcriptional rewiring of EMT-type tumors compared to other subtypes.
Next, we reconstructed PPI networks, highlighting interactions among the differentially expressed genes in each subtype compared to normal samples (adjusted p-value ≤ 0.05, absolute LFC ≥ 3). We then identified and compared the top-scoring modules of the different subtypes based on the greedy algorithm of the jActiveModules Cytoscape plug-in. Considerable overlap between the top modules of MSI, TP53+, and TP53− subtypes was observed, yet the top module of the EMT subtype did not share any genes with the other subtypes (Supplementary Figure S2).
Finally, the results of the PCA on the complete expression matrices of TCGA tumors revealed that the samples belonging to the EMT subtype are roughly distinguished in PC1; this is while no tangible difference can be observed between the other three subtypes (Supplementary Figure S3). In accordance with our observations in the TCGA samples, similar results were also observed in the PCA of the ACRG samples (Supplementary Figure S3).
Overall, these results indicated that the samples belonging to the EMT subtype display the most distinct transcriptional profile among all the ACRG subtypes.

WGCNA and Motif Ranking Identify 39 Core mRNA Motifs
In order to find robust prognostic/diagnostic RNA markers, we sought to take advantage of co-expression module and motif identification approaches to identify core RNA regulators of EMT-type tumors. The workflow implemented for the identification of these RNAs is shown in Figure 1A. Fourteen co-expression modules with varying numbers of genes were identified by applying WGCNA on the expression data of the 47 EMT-type tumors in the TCGA cohort. A list of members of each module is provided in Supplementary Table S3. We used the negative logarithm of each gene's adjusted p-value, after differential expression analysis between EMT-type samples and other subtypes, as the criterion for gene significance. Using this criterion, the module with the most significant average differential expression was designated as the "EMT" module and the members of this module were selected for further investigation ( Figure 1B). Since a high level of module membership indicates that the expression level of a gene is an adequate proxy for the general behavior of a module, the label for the rest of the modules was based on the gene with the highest level of module membership in that module. The association of the eigengenes of each module with clinical parameters (gender, age at diagnosis, pathological stage, TNM stages, and the tissue of origin) was also assessed ( Figure 1C). There is a significant negative correlation between the eigengene of the EMT module and the age at diagnosis, suggesting the potential role of the members of this module in the earlier onset of the disease. this is while no tangible difference can be observed between the other three subtypes (Supplementary Figure S3). In accordance with our observations in the TCGA samples, similar results were also observed in the PCA of the ACRG samples (Supplementary Figure S3). Overall, these results indicated that the samples belonging to the EMT subtype display the most distinct transcriptional profile among all the ACRG subtypes.

WGCNA and Motif Ranking Identify 39 Core mRNA Motifs
In order to find robust prognostic/diagnostic RNA markers, we sought to take advantage of co-expression module and motif identification approaches to identify core RNA regulators of EMT-type tumors. The workflow implemented for the identification of these RNAs is shown in Figure 1A. Fourteen co-expression modules with varying numbers of genes were identified by applying WGCNA on the expression data of the 47 EMTtype tumors in the TCGA cohort. A list of members of each module is provided in Supplementary Table S3. We used the negative logarithm of each gene's adjusted p-value, after differential expression analysis between EMT-type samples and other subtypes, as the criterion for gene significance. Using this criterion, the module with the most significant average differential expression was designated as the "EMT" module and the members of this module were selected for further investigation ( Figure 1B). Since a high level of module membership indicates that the expression level of a gene is an adequate proxy for the general behavior of a module, the label for the rest of the modules was based on the gene with the highest level of module membership in that module. The association of the eigengenes of each module with clinical parameters (gender, age at diagnosis, pathological stage, TNM stages, and the tissue of origin) was also assessed ( Figure 1C). There is a significant negative correlation between the eigengene of the EMT module and the age at diagnosis, suggesting the potential role of the members of this module in the earlier onset of the disease. It should be noted that since all of the co-expression modules were identified on the same set of samples, the observation that the eigengene of the EMT module is negatively correlated with the age at diagnosis is not biased by possible age imbalances in the data. Tumor staging system: T-size and spread of the primary tumor; N-level of spread to lymph nodes; M-metastasis.
Triangle motifs (with three nodes and three edges) are the most common type of motifs and are known to largely regulate the higher network structures and serve as the core building blocks of complex biological networks [20,30]. To identify core regulatory elements of the EMT module by taking advantage of the biological relevance of triangle motifs, the PPI network of the members of this module was reconstructed in Cytoscape. A total of 920 triangle motifs were identified. Each one of these motifs was scored based on 13 different weighting scenarios (Supplementary Table S2) using the multi-objective scoring function (see Section 2). Supplementary Table S4 contains all 920 motifs with their corresponding scores in each of the weighting scenarios. The top 10 motifs based on each of the weighting scenarios were selected. After removing the redundant motifs, a total of 39 top motifs remained and were used for further evaluation ( Table 2). These motifs represent potent candidates for playing central roles in GC, specifically the EMT subtype. This is due to the fact that the utilized scoring function was designed to designate the best scores to the motifs with the most profound topological significance, diagnostic value, and differential expression in the EMT subtype in comparison to the other subtypes.

Expression of the DMD/ITGA1/CAV1 Motif Is a Strong Predictor of Patient Survival
Next, we set out to characterize the 39 top motifs and identify the most potent candidates in terms of their prognostic capability. To this end, we conducted a survival analysis on the motifs based on the expression levels of the members of the motifs. For each member of the motifs, and for each motif considered a single entity, samples were divided into high expression and low expression groups both for the TCGA and ACRG cohorts, Kaplan-Meier curves were constructed (Figure 2A), and multivariate cox regression results (to account for the effects of age and sex) were extracted ( Table 2). Considering our stringent criteria (Section 2), the DMD/ITGA1/CAV1 motif was identified as the top motif with consistent relevance to prognosis (HR > 3 in both TCGA and ACRG cohorts).

EVC/EVC2/GLI3 and PDE2A/GUCY1A1/GUCY1B1 Are Robust Diagnostic Motifs
In order to assess the diagnostic capacity of the motifs and identify the most significant motifs with diagnostic relevance, we conducted a logistic regression analysis. Members of the motifs were used as predictors and the subtype of the samples (EMT versus non-EMT) as the response variable. We used the TCGA cohort as the training set and the ACRG cohort as the validation set. Additionally, the independently normalized data from the samples of the Singapore cohort were used to assess the robustness of the models. The top two motifs based on their Area Under the Receiver Operating Characteristic Curve (AUC) in the validation set were EVC/EVC2/GLI3 (AUC = 0.97) and PDE2A/GUCY1A1/GUCY1B1 (AUC = 0.97) ( Figure 2B; Table 3). We also assessed the diagnostic capacity of the motifs for distinguishing tumors from normal samples using the

EVC/EVC2/GLI3 and PDE2A/GUCY1A1/GUCY1B1 Are Robust Diagnostic Motifs
In order to assess the diagnostic capacity of the motifs and identify the most significant motifs with diagnostic relevance, we conducted a logistic regression analysis. Members of the motifs were used as predictors and the subtype of the samples (EMT versus non-EMT) as the response variable. We used the TCGA cohort as the training set and the ACRG cohort as the validation set. Additionally, the independently normalized data from the samples of the Singapore cohort were used to assess the robustness of the models. The top two motifs based on their Area Under the Receiver Operating Characteristic Curve (AUC) in the validation set were EVC/EVC2/GLI3 (AUC = 0.97) and PDE2A/GUCY1A1/GUCY1B1 (AUC = 0.97) ( Figure 2B; Table 3). We also assessed the diagnostic capacity of the motifs for distinguishing tumors from normal samples using the data from TCGA-STAD normal and tumor tissues and the GSE184336 dataset as an independent test set. Interestingly, PDE2A/GUCY1A1/GUCY1B1 achieved the highest AUC in the TCGA cohort (AUC = 0.95) and an AUC of 0.85 in the test set of the GSE184336 dataset, reinforcing its diagnostic importance (Table 4).

A Two-Membered miRNA Model Accurately Distinguishes EMT-Type Tumors from Other Gastric Tumors
The candidate miRNAs regulating the expression of the identified motifs were determined through the identification of differentially expressed miRNAs (EMT vs other subtypes; n = 220) that targeted one or more genes among the members of the top 39 motifs (109 miRNAs). The top multi-candidate miRNA combination was identified using LDA with leave-one-out cross-validation. The top two-membered miRNA combination consisting of hsa-miR-199a-5p and hsa-miR-141-3p with an AUC of 0.963 in the training set and an AUC of 0.995 in the test set was identified as the best discriminant multi-candidate miRNA combination (index: (0.597167 × hsa-miR-199a-5p) + (−0.798247 × hsa-miR-141-3p) + 2.02755). The results of the survival analysis for these miRNAs and their combination are demonstrated in Figure 3.
The candidate miRNAs regulating the expression of the identified motifs were determined through the identification of differentially expressed miRNAs (EMT vs other subtypes; n = 220) that targeted one or more genes among the members of the top 39 motifs (109 miRNAs). The top multi-candidate miRNA combination was identified using LDA with leave-one-out cross-validation. The top two-membered miRNA combination consisting of hsa-miR-199a-5p and hsa-miR-141-3p with an AUC of 0.963 in the training set and an AUC of 0.995 in the test set was identified as the best discriminant multi-candidate miRNA combination (index: (0.597167 × hsa-miR-199a-5p) + (−0.798247 × hsa-miR-141-3p) + 2.02755). The results of the survival analysis for these miRNAs and their combination are demonstrated in Figure 3.

Discussion
Among the molecular classifications of gastric tumors by ACRG, tumors of the EMT subtype are associated with significantly worse patient prognosis and likely demand more drastic therapeutic interventions [9]. Coupling this with the vastly unknown nature of the tumors of this subtype, further investigation of the molecular landscape of these tumors and the development of diagnostic and predictive biomarkers are of utmost importance. Here, we have identified a differentially expressed co-expression network in the tumors of the EMT subtype using WGCNA. The negative correlation of this module with the age of the patients at the time of diagnosis ( Figure 1C) is in line with the characterization of this subtype by ACRG [8] and indicates the relevance of this module to the EMT subtype. We have further explored this co-expression module in order to extract its central motifs and regulatory miRNAs with relevance to diagnosis and prognosis.

Discussion
Among the molecular classifications of gastric tumors by ACRG, tumors of the EMT subtype are associated with significantly worse patient prognosis and likely demand more drastic therapeutic interventions [9]. Coupling this with the vastly unknown nature of the tumors of this subtype, further investigation of the molecular landscape of these tumors and the development of diagnostic and predictive biomarkers are of utmost importance. Here, we have identified a differentially expressed co-expression network in the tumors of the EMT subtype using WGCNA. The negative correlation of this module with the age of the patients at the time of diagnosis ( Figure 1C) is in line with the characterization of this subtype by ACRG [8] and indicates the relevance of this module to the EMT subtype. We have further explored this co-expression module in order to extract its central motifs and regulatory miRNAs with relevance to diagnosis and prognosis.

Poor Outcomes for Patients with High Expressions of DMD/ITGA1/CAV1 Motif
Our results are able to characterize the signaling circuits involved in the aggressive phenotypes often observed in the gastric tumors of the EMT subtype (e.g., invasion, chemoresistance, etc.). We have identified the DMD/ITGA1/CAV1 motif as the top motif with consistent relevance to prognosis (HR > 3 in both TCGA and ACRG cohorts). The ITGA1 gene encodes the α-1 subunit of the integrin superfamily of glycoproteins. These transmembrane receptors are responsible for a variety of cellular functions including cell adhesion, migration, and intracellular signaling in response to the extracellular environment (ECM) [32]. ITGA1 is extensively associated with cancer invasiveness and poor patient prognosis in various tumor types. It promotes EMT, proliferation, and drug resistance in response to dysregulations in the tumor extracellular matrix. This is in part realized through upregulation of the Ras/MEK/ERK (MAPK) pathway [33][34][35][36]. Additionally, a wealth of studies indicate that the EMT-promoting effects of dysregulation in various molecules in GC converge on ITGA1, highlighting its potential as a therapeutic target [37,38].
Upon stimulation, the integrin receptors activate Ras through the recruitment of the Grb2/SOS complex. This is a process in which Caveolin-1 (Cav-1), a protein encoded by another member of the identified motif (CAV1), has been shown to play a pivotal role [39]. Cav-1 is best known for its crucial roles as a component of the caveolae-invaginations in the cell membrane involved, among other functions, in cell surface receptor localization and signal transduction [40]. Similar to ITGA1, Cav-1 is strongly associated with poor treatment outcomes, poor prognosis, and EMT [41,42]. Importantly, MAPK is not the only pathway through which Cav-1 has been associated with EMT. It has been shown that Cav-1 stimulates the dephosphorylation of β-Catenin, culminating in the activation of the WNT pathway and upregulation of Met receptor tyrosine kinase. Met (also known as HGFR), through its positive crosstalk with HER2, contributes to tumor aggressiveness, migration, proliferation, and chemoresistance by upregulating MAPK, WNT, and PI3K/AKT pathways [40]. Studies investigating the role of DMD, the last member of the identified motif, are sparse and contradictory [43], warranting a need for further investigation of the role of the DMD in the GC EMT subtype and its functional association with ITGA1 and Cav-1.

The EVC/EVC2/GLI3 Motif Performs Well Both as a Diagnostic and a Prognostic Marker
Our analysis pipeline resulted in the identification of two motifs with superior relevance to the diagnosis of gastric tumors of the EMT subtype. The top identified motif consists of EVC, EVC2, and GLI3; genes coding for essential members of the Hedgehog (Hh) signaling pathway [44]. The Hh pathway is firmly associated with the exhibition of stem-like phenotypes in cancer, cancer cell migration, EMT, and drug resistance in various cancer types including GC [45][46][47]. GLI3 is a transcription factor central to the regulation of the Hh pathway and plays dual roles both as an activator and a repressor of the genes downstream of this pathway [44]. In the absence of the Hh pathway ligands, GLI3 is bound to SUFU, which mediates its proteolytic cleavage, resulting in the abundance of cleaved GLI3 proteins, which act as suppressors of the Hh pathway. In the presence of the Hh ligands, SUFU dissociates from the GLI3 in a process in which both EVC and EVC2 have been shown to be of vital importance [48]. The dissociated full-length GLI3 promotes upregulation of the Hh pathway. The activity of GLI3 is strongly associated with various malignancies. For example, it promotes proliferation and EMT in multiple cancer types [49,50] and plays a role as a cancer driver gene in GC [51]. Importantly, multiple lines of evidence associate the overexpression of GLI3 with poor prognosis in various tumor types [50,52]. In line with these reports, our results indicate considerably worse outcomes for patients with higher expression of the EVC/EVC2/GLI3 motif in both TCGA (HR = 2) and ACRG (HR = 2.7) cohorts, suggesting the possible utility of this motif as a prognostic indicator as well as a diagnostic marker.

PDE2A/GUCY1A1/GUCY1B1-A Strong Diagnostic Marker
The other identified top motif with potential diagnostic capacity for the EMT subtype of GC is comprised of PDE2A (a member of the phosphodiesterase superfamily), GUCY1A1, and GUCY1B1 (also known as GUCY1A3 and GUCY1B3, respectively). These molecules are central regulators of the metabolism of cyclic guanosine monophosphate (cGMP) and cyclic adenosine monophosphate (cAMP), secondary messengers involved in many cellular functions including cell proliferation, differentiation, and apoptosis [53]. Interestingly, in addition to its exceptional performance in discriminating the samples of the EMT subtype from other gastric tumors, this motif presented a capacity for distinguishing gastric tumors from normal samples (AUC = 0.95; highest AUC among the assessed motifs), demonstrating its potential use as a diagnostic marker of GC in general. Notably, the presence of other proteins of the phosphodiesterase superfamily (PDE1A and PDE3A) and adenylate cyclase 5 (ADCY5) in addition to guanylate cyclase (GUCY) proteins among the identified top motifs (Table 3; Figure 4) points to a likely central role of cAMP and cGMP metabolism in the EMT subtype of GC. In line with this, there are a plethora of studies indicating the viability of phosphodiesterase inhibition as a treatment approach for the suppression of proliferation and reduction of the invasion capacity of tumors in various cancers [54]. However, the exact role of these molecules in tumorigenesis and cancer progression is ambiguous, and specifically, the interplay between the cyclase and phosphodiesterase proteins in cancer remains largely unexplored.

MiR-199a-5p and miR-141-3p Dysregulations Are Associated with Tumor Invasiveness
Another important result of this study is the identification of a candidate two-membered miRNA diagnostic biomarker (AUC = 0.995; Figure 3) consisting of hsa-miR-199a-5p (upregulated in the samples of the EMT subtype; LFC = 1.4) and hsa-miR-141-3p (downregulated in the samples of the EMT subtype; LFC = −1.9). In contrast to its downregulation in various tumor types, the expression of hsa-miR-199a is shown to be increased in the case of GC and has been associated with increased tumor invasiveness and metastasis in multiple studies [55,56]. These reports are in accordance with the observations of the current study and support the positive coefficient of this molecule in the identified diagnostic model. The other member of our two-membered diagnostic model, hsa-miR-141-3p, is a member of the miR-200 family of miRNAs, the downregulation of the members of which is tightly associated with increased proliferation, EMT, and invasiveness of gastric tumors among other tumor types [57][58][59]. Altogether, these results highly support the relevance of the identified two-membered miRNA-based diagnostic model in distinguishing gastric tumors of the EMT subtype. Additionally, the expression of both of these miRNAs was associated with patient outcomes in GC in previous studies [55,59]. However, our results only indicate a positive association between the high expression of hsa-miR-199a-5p and poor survival (p-value = 0.034). No association between the expression of hsa-miR-141-3p and patient prognosis could be observed (p-value = 0.34; Figure 3).

Conclusions
A few points regarding the implemented methods for motif identification and their limitations in this study should be noted. Considering the effects of multi-collinearity, the coefficients in the logistic regression modeling of the motifs should be utilized with caution when inferring the behavior of the mRNAs in these motifs since they are all extracted downstream of WGCNA. Nevertheless, this does not affect the precision of the prediction of the disease status by the motifs, and thus the top motifs with diagnostic capacity represent viable candidates. One should also take note that, based on the design of this study, the identified motifs are inclined to be more important in the EMT subtype, but their importance is not necessarily restricted to it; especially due to the inclusion of weighting factors such as the topological significance and previous association with cancer pathways in the motif ranking procedure. Additionally, while the top motifs in terms of prognostic and diagnostic capacity were the main focus of this discussion, all of the other high-scoring motifs in different weighting scenarios (Supplementary Table S4) represent potential candidates for playing significant roles in the pathology of GC and are encouraged to be further explored. Finally, this investigation was carried out entirely in silico, and subsequent wet-lab experiments are necessary for further validation of the results.
Overall, the current study took advantage of the biological relevance of both coexpression modules and network motifs through the combination of their identification methods in an end-to-end analysis workflow. Exploiting the abilities of WGCNA, a multiobjective motif scoring function, and machine learning approaches, we identified combinations of mRNAs and regulatory miRNAs with considerable prognostic and diagnostic capability. These results highlight the central roles of MAPK, Hh, and cGMP/cAMP signaling in the pathology of the EMT subtype of GC and provide an unprecedented picture of rewired signaling circuits that possibly contribute to the phenotypes observed in tumors of this subtype. Additionally, the identified co-expression modules and the large number of characterized motifs provide an opportunity for further exploration of this subtype of gastric tumors through various study designs.