Next Article in Journal
Influence of Moisture on Mechanical Properties and Energy Dissipation Characteristics of Coal–Rock Combined Body
Previous Article in Journal
Training Data for Stereo Matching Algorithms Based on Neural Networks and a Method for Data Evaluation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

G-S-M-E: A Prior Biological Knowledge-Based Pattern Detection and Enrichment Framework for Multi-Omics Data Integration

by
Miray Unlu Yazici
1,*,
Burcu Bakir-Gungor
2 and
Malik Yousef
3,4,*
1
Department of Bioengineering, Abdullah Gul University, Kayseri 38080, Türkiye
2
Department of Computer Engineering, Abdullah Gul University, Kayseri 38080, Türkiye
3
Department of Information Systems, Zefat Academic College, Zefat 13206, Israel
4
Galilee Digital Health Research Center, Zefat Academic College, Zefat 13206, Israel
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2025, 15(23), 12669; https://doi.org/10.3390/app152312669
Submission received: 9 October 2025 / Revised: 15 November 2025 / Accepted: 22 November 2025 / Published: 29 November 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

The rapid advancements in high-throughput technologies have led to a dramatic increase in diverse -omics data types, enabling comprehensive analyses, especially for complex diseases like cancer. Despite the development of multi-omics approaches, the challenges of scaling integration to massive, heterogeneous -omics datasets suggest that novel computational tools need to be designed. In this study, we propose an approach for integrating microRNA (miRNA) and messenger RNA (mRNA) expression data, incorporating prior biological knowledge (PBK). This approach scores and ranks groups of miRNAs and their associated genes using cross-validation iterations. The proposed method incorporates a Pattern detection (P) component to identify molecular motifs unique to each biological group. The analysis also facilitates the visualization of the groups, facilitating the identification of co-occurring groups and their characteristic features across iterations. Furthermore, the groups are scored using an over-representation analysis through a new Enrichment (E) component in each iteration. The clusters of the groups based on the Enrichment Scores (ESs) are visualized in a heatmap to obtain novel insights into the collective behavior and dependencies of the groups, aiming to understand the molecular mechanisms of complex diseases. The developed G-S-M-E tool not only provides performance metrics and biological scores at the group level but also offers comprehensive insights into intricate multi-omics interactions. In summary, our study emphasizes the importance of mathematical and data science methodologies in elucidating intricate multi-omics integration, yielding a formalized approach that deepens our comprehension of complex diseases.

1. Introduction

Cancer is a multifactorial disease and develops over time with the interactions of environmental and genetic components. Understanding the underlying factors and their interactions at multiple cellular levels can contribute significantly to cancer onset and progression mechanisms. The microRNAs (miRNAs), a class of non-protein-coding molecular factors, can act as tumor suppressors or oncogenes, affecting cancer development [1]. For example, miRNAs regulate protein coding gene activity via binding to target mRNAs [2]. While most studies show that miRNA interaction with 3’UTR of the target mRNA suppresses or inhibits the gene expression [3], in some cases, miRNAs promote gene expression by interacting with 5’UTR and the coding sequence [4]. Upregulation and downregulation of the genes mediated by miRNAs are linked to cancer [5,6]. Hence, it is critical to shed light on the miRNA-mediated gene regulation mechanisms in carcinogenesis and tumor progression. In this respect, multi-omics data analysis focusing on miRNA–mRNA regulation mechanisms can provide important insights into the molecular pathways affected during cancer development and progression.
Conventional single-omics analyses [7,8] have identified significant factors in molecular pathways and in cellular processes (apoptosis, metastasis, proliferative signaling) that are fundamental to understanding cancer. However, an immense amount of data generated with high-throughput technology has changed the direction of ongoing studies toward integration-based methods [9]. In recent years, machine learning (ML) methods have offered novel techniques for integration and analysis of multi-omics data, as summarized in a recent review paper [10]. Several unsupervised approaches [11,12] are used for hidden pattern identification and patient subgroup discovery, and supervised ML methods are used for patient stratification, feature selection, and classification [13,14]. A recent survey investigated graph-based ML models for multi-omics analyses [15]. As a method of modeling complex biological systems while accounting for complex interrelations and broader relations, the development of multi-omics data fusion approaches is required [16]. The data fusion approach has progressed over recent years, departing from the basic statistical data analysis approach and incorporating complex network and deep learning models [10]. These network-centric multi-omics data analyses involve the use of linked feature graphs to map complex biological systems and enable the detection of biomarkers [17,18]. The complex and complex/non-linear data interrelations are mapped into lower-dimensional spaces using deep learning data fusion models (autoencoders, graph neural networks, etc.). These models are capable of detecting the key biomarkers that take part in biological regulation through the use of multi-omics [19].
Most ML models [10,20,21] use -omics data as biological features, disregarding biological domain knowledge. Instead of relying only on fully data-driven analyses, incorporating the prior biological knowledge (PBK) into studies can provide a deeper insight into the underlying changes and patterns in regulatory mechanisms of complex diseases [22]. For example, the Kyoto Encyclopedia of Genes and Genomes (KEGG) [23] is used in the literature as a PBK repository [24]. KEGG presents interaction maps of genes and gene products at molecular and higher levels. Some other examples of PBK repositories are as follows. miRTarBase [25] provides experimentally validated miRNA–target interactions. The DisGeNET platform [26] contains disease–gene associations assessed with specific scores. The functional annotations of genes from a wide range of species are collected and organized in a Gene Ontology (GO) database [27]. Network topology-based approaches described by [28,29] such as DisGeNET [30] and miRWalk [31] utilize PBK to construct gene–miRNA–circRNA networks. Following that, KEGG and GO enrichment analyses are performed on genes associated with these significant circRNAs to identify potential functions of differentially expressed circRNAs. As another example, refs. [32,33] proposed statistical and computational approaches to integrate PBK into -omics data analyses.
In order to incorporate PBK into -omics data analysis, one of the alternatives is to perform functional enrichment. These analyses are widely used to identify over-represented biological functions and cellular pathways for a set of genes that are identified as significant in high-throughput experiments for the disease under investigation [34,35]. The enrichment analysis or over-representation analysis (ORA) can be considered a dimension reduction procedure in which a large set of biomolecular entities can be narrowed down to a smaller set of biological processes [36]. Enrichment significance (p-value) indicates the importance of biological terms for given sets of biomolecules which are identified in -omics data analysis [37].
In this study, we developed the G-S-M-E tool, which performs ML-based multi-omics integration of mRNA and miRNA expression profiles utilizing PBK and functional enrichment analysis. In our earlier studies [38], we developed the G-S-M (grouping–scoring–modelling) approach to perform grouping-based feature selection. In its G component, the groups can be constructed based on the correlation information between the -omics features (pairs). Highly correlated pairs such as a set of miRNA–gene pairs can be collected under the same miRNA with common gene information. These common genes and their associated miRNA can be named as a group, and each group is given as an input to the S component. After a score is assigned to each group, the best scoring groups are used to develop an ML model for classification of disease vs. control samples.
The proposed approach (G-S-M-E) expands on the G-S-M idea by introducing two new components, pattern detection (P) and enrichment analysis (E), to score and rank groups of miRNAs and their associated genes. This scoring and ranking process is performed iteratively through cross-validation, providing an insight into multi-omics interactions. In the G-S-M-E tool, the high-scoring groups including miRNA-associated genes are given as an input to an ML classifier. In the P component of the proposed G-S-M-E tool, iteration information which is generated via a random subsampling cross-validation technique is used to track the scores and the rankings of the groups in each iteration. The collection of these iterative interactions results in the identification of molecular motifs unique to biological groups. Hence, hidden molecular patterns within distinct groups are revealed. In addition, the iteration information with ranking of groups that are obtained within the random subsampling cross-validation procedure [39] is visualized in a heatmap to understand the distinct patterns of groups at the molecular level. In the E component of G-S-M-E, the collective behavior of clusters of groups incorporating PBK is visually represented with ORA scores in a compact representation. Pathway impact factor is evaluated with the following component for the groups instead of merely genes. The E component utilizes four PBK databases, i.e., GO, KEGG, Wikipathways, and Reactome, for enrichment analysis. The significance of the over-represented terms in the groups is calculated using hypergeometric distribution. Clustering approaches are used to reveal clusters of groups that are similarly scored by the four enriched approaches. The proposed P and E components offer unparalleled opportunities to assess the importance of the potential signatures not only considering expression profiles but with the contribution of pathway knowledge via cross-validation. Thus, our investigation traverses the landscape of multi-omics intricacies, guided by mathematical rigor and data science expertise, so that we might reveal the intricate choreography underpinning carcinogenesis.

2. Materials and Methods

2.1. Data Sets and Preprocessing

Two breast-invasive carcinoma (BRCA) -omics datasets (miRNA and gene expression profiles) including 760 cases and 87 controls were downloaded from The Cancer Genome Atlas (TCGA) data portal (https://portal.gdc.cancer.gov/, accessed on 25 May 2024). A total of 1881 miRNAs (Illumina stem loop expression data) in the form of log2 (RPM + 1) and 21.839 genes (Illumina RNA-Seq) in the form of raw read counts were given as an input into the workflow (reads mapped to GRCh38). Raw read counts of RNA-seq data were normalized with the trimmed mean of M-values (TMM) by the edgeR package [40]. Normalized read per million (RPM) miRNA-seq data was used for further analyses. Then, variances between the two classes were compared with a t-test. The significant miRNAs and genes were filtered for p < 0.05. In BRCA, preprocessing of expression profiles resulted in 13,009 significant genes and 166 significant miRNAs, which were used for further steps.
G-S-M-E tool was also tested on several -omics datasets, including kidney renal papillary cell carcinoma—KIRP, liver hepatocellular carcinoma—LIHC, lung adenocarcinoma—LUAD, prostate adenocarcinoma—PRAD, stomach adenocarcinoma—STAD, thyroid carcinoma—THCA, and uterine corpus endometrial carcinoma—UCEC, to assess the underlying patterns identified by our tool and to provide a comprehensive evaluation of the model’s performance. Table 1 presents the case and control sample numbers for each cancer type analyzed with G-S-M-E.

2.1.1. Design and Implementation of the G-S-M-E Tool

G-S-M-E was developed with an open-source data analytics platform called KNIME [41]. The selection of significant miRNAs and genes by using a t-test, grouping important features, determining the top groups using the ML model, training the ML classifier with the top groups, performance evaluation, and other extended analyses (pattern detection of groups with iteration and ranking information and significant feature statistics) were carried out using the KNIME platform. Enrichment analysis was performed in R.
The general framework of G-S-M-E, which illustrates the overall integration procedure and the flow of -omics data across the components of the tool, is presented in Figure 1. These components include G (Grouping) for detecting the groups; S (Scoring) for assigning the scores to each group; M (Modelling) for training the classifier; P (Pattern recognition) for identifying the patterns among groups; and E (Enrichment) for enriching the groups with prior biological knowledge.
Following the above-mentioned preprocessing steps, the miRNA data matrix with dimensions p × n, where p corresponds to the number of miRNAs for n samples, and a gene expression matrix with dimensions q × n, where q represents the number of genes for n samples, were given as an input to G-S-M-E tool. A 1:1 ratio was applied on the samples to ensure an equal number of cases and controls.
The normalized gene expression profiles were then divided into training and test sets, with a split ratio of 80:20. A random subsampling cross-validation technique method (Monte Carlo cross-validation, or MCCV [40]) was used to split the datasets. In each iteration, a different sample subset of expression profiles was selected for the training and test datasets. Therefore, overfitting was avoided in the tool. Figure 1 illustrates the MCCV method in the upper-middle part of the figure where the training data is shown in white and the test data in blue. The training sets comprising p miRNAs and q genes for x samples were used for further analysis. Subsequently, the test portion of the gene expression dataset including q genes and n–x samples was used in the evaluation of the model’s performance. In BRCA, each iteration of the training set included 70 cases and 70 controls, while the test set comprised 17 cases and 17 controls. The number of classes varied depending on the number of cases and control samples in different types of cancer.

2.1.2. Components of the G-S-M-E Tool

  • The G (grouping) component
The G component aims to detect the feature groups by processing significant miRNA and mRNA datasets. First of all, pairs of correlated features (miRNA–gene pairs, or mg pairs) were detected using a correlation function. Here, the strength of miRNA–gene correlation was measured using Spearman’s correlation [42] to elucidate monotonic associations in disease classification. This operation created a list of mg pairs. Let i denote the number of highly correlated pairs and t represent the total pair number. The highly correlated mg pairs are indicated as follows:
pairs = {mgi|i = 1, 2, …, t}
A sample table that was generated by the G component is given in Table 2.
Here, we used correlation information to identify the relationships between -omics datasets. Alternative biological knowledge can be utilized instead of correlation metrics, e.g., pathways, disease–gene, drug–gene.
  • The S (Scoring) component
The Scoring component aims to assign scores to each group. In this step, G (including N unique groups) was given as an input into the S component. In order to perform the scoring task, the input data (including miRNAs and its associated gene(s)) were utilized for group information.
The expression data of the gene(s) associated with the corresponding miRNA was extracted from the gene expression training dataset (q genes for x sample). This submatrix, including case and control samples of the miRNA associated gene(s) was split into new training and validation sets. An ML classifier trained on the new training set with 10-fold cross-validation was tested on the validation set. The mean of the accuracy values generated through cross-validation was assigned the score (scorej) for the j-th corresponding group (groupj). The tool enabled the use of other performance metrics with different weight options (specificity, sensitivity, AUC, etc.) as the group score. This process was applied to each group, and a list including N number of groups and their assigned scores was retrieved via the S component (Equation (2)):
scores = {(groupj, scorej)|j ∈ {1, …, N}}
Following the score assignment to N groups, ranks were given to each group based on the scores. This ranking function, denoted Rank, is shown in Table 3. Let Rank (scorej) represent the rank assigned to a group based on its score (scorej). The set of all possible ranks is R = {1, 2, 3, …, 10, 0}, and the set of predefined scores in descending order is score = {1, 0.95, 0.90, …, 0.50}.
While a score between 0.95 and 1 corresponds to rank 1 (the highest rank), groups with scores lower than 0.50 are assigned a rank of 0. Ranked groups including score and rank information for each group are recorded as in Equation (3).
RankedGroups = {(groupj, scorej, rankj)|j = 1, …, N}
The RankedGroups list was then used in the Modeling (M) and Pattern Detection (P) components. While the RankedGroups list is utilized to select the groups with best ranks in the M component, a matrix named ‘Basic Informative Matrix (BIM)’ is generated from the RankedGroups list in the P component.
In summary, the S component performs a prediction task for each identified group. The gene(s) associated with the particular miRNA group predict the difference between case and control samples. The assigned score reflects that group’s individual predictive power. This scoring then enables the system to rank groups, focusing first on those that are most effective at prediction.
  • The M (Modeling) component
The Modeling (M) component aims to train the classifier by using the best ranking groups and the expression levels of their associated gene(s). The M component utilizes the RankedGroups list and corresponding expression profiles of -omics datasets to create an ML model for classifying tumor and control samples. To achieve this, the 10 best ranking groups from the RankedGroups list and their gene(s) expression information were provided as an input to the M component. The input table (Table 1) introduced into the M component consists of the tumor and control samples in rows and the features (genes associated with the 10 best scoring groups) in columns. A flowchart of the M component’s functioning is given in Figure 2.
The M component generated ML models by training classifiers, including Decision Tree (DT), Gradient Boosting Tree (GBT), Naïve Bayes (NB), Probabilistic Neural Network (PNN), Random Forest (RF), Support Vector Machine (SVM), and Tree Ensemble (TE). A repeated random subsampling cross-validation technique was used to evaluate the model’s performance. After randomly splitting samples of the given dataset into training and test sets in each iteration, the model was fitted on the training set and tested on the separate test set in order to prevent overfitting. The performance results of each classifier were averaged over the total number of iterations.
  • The P (Pattern Detection) component
While the M component trains and tests the ML classifier, the P component aims to identify the patterns among groups by tracking the iteration information. If a group appeared in a particular iteration, the iteration number and ranking level of the group were recorded in the RankedGroups list. Utilizing this list, a Pattern matrix P ∈ Rgxin was created, where rows represent groups (g), and columns represent iteration numbers (in) from 1 to 100. Within this matrix, P(gi, inj) = 1 indicated a high significance (ranking) level of the i-th group(g) in the j-th iteration number (in), and P(gi, inj) = 10 represented a low significance level of the corresponding group. Otherwise, the group did not exist in that iteration. The construction of the Pattern matrix (P) is defined in Table 4.
The P matrix was used to visualize the occurrence patterns of these frequently seen groups in a heatmap by using average linkage hierarchical clustering of rows and columns based on Euclidean distances. This analysis not only provides visualization of the group’s appearance but also derives co-occurring groups and the appearance characteristics of groups through iterations.
  • The E (Enrichment) component: Enriched Groups with PBK
As the P component was employed to identify the occurrence patterns of groups through iterations, providing ranking information, the E component was designed to assign additional scores to each group utilizing prior biological knowledge. A flowchart for detecting enriched group features is given in Figure 3.
This component employs hypergeometric distribution, incorporating prior biological knowledge (PBK). Prior biological knowledge (PBK) was obtained from GO, KEGG, and Wikipathways databases. The significant groups and their associated gene(s) derived from the RankedGroups list (Equation (3)) were used as an input in the E component. We investigated whether the known pathways/terms (representing biological functions or processes) were over-represented in our identified groups by submitting the relevant functional gene sets from these databases into the E component.
In our analysis, hypergeometric distribution was used to determine whether known biological functions or processes were significantly enriched within the groups. The hypergeometric probability mass function is as follows:
p X = k = x = k t n k N n t k N t
The random variable X represents the number of genes that are present in the intersection. The observed number of successes, k, represents the genes that are intersected between a pathway and term and the group of interest. The population size, N, symbolically represents the total number of genes contained within the genome. The number of successes within the population, n, represents the genes found within the pathway and term. The sample size, t, symbolically represents the total number of genes that are differentially expressed within the group. The p-value represents the probability of seeing k or more genes from the pathway/term within the respective group.
The lower the p-value, the less likely it is to observe the k genes in a given set of genes by chance. The p-values for the pathways/terms were collected for each group in this step. Finally, the average of the hypergeometric distribution probability scores and Benjamini–Hochberg-adjusted p-values were converted into the −log10 (p-value) scale. This value was assigned to the given group as an Enrichment score (ES). The E component collects the ES scores for each group in the Enrichment score matrix (Equation (5)).
E S g i , l o g 10 p v a l u e i = g i   h a v i n g   e n r i c h m e n t   s c o r e   N A   o t h e r w i s e  
The enrichment score matrix enables users to select significant biological groups based on pathway enrichment analysis. The significant groups were visualized with a stratigraphic plot, and group features were hierarchically clustered by the enrichment levels (ES scores) of groups.
The following packages were used in the E component to perform the functional enrichment analysis. Over-represented GO, KEGG, and Wikipathways terms in significant groups were identified using the R package clusterProfiler 4.2.2 [43] with the enrichGO, enrichKEGG, enrichWP functions, respectively. The R package ReactomePA [44] with the enrichPathway function provided enrichment analysis of functional gene sets using the Reactome database. For visual representation of the over-represented groups, a stratigraphic diagram was created with the R packages tidypaleo [45] and ggplot2 [46].

3. Results

In this section, the significant groups, miRNAs, and genes identified by the G-S-M-E tool in the BRCA dataset are presented. In addition, to elucidate the underlying molecular patterns for different cancer types, the G, S, M, and E components were tested on different -omics datasets belonging to other cancer types (as shown in Table 1). We present molecular patterns within distinct groups that were visualized based on the iteration and ranking information. In the last part of this section, functional enrichment findings are presented. The biological functions of the group features are utilized to understand the behaviors of the clusters of groups.

3.1. Identification of Significant Groups in the BRCA Dataset

Table S1 summarizes the different characteristics of the 10 most significant groups identified by the G-S-M-E tool in the BRCA dataset. The frequencies of the groups over 100 iterations are given in the ‘Frequency of Group’ column. The average scores and ranks of each group calculated within the S component are given in the following two columns of the table. The Number of Associated Genes and Associated Genes columns represent the number and the set of unique target genes of each group, respectively. Iteration and rank information tracked via MCCV is given in list form in the last two columns. The higher the score, the higher the importance of the group for the classification task. On the other hand, a lower rank represents the higher statistical significance of that group.
Table S1 also highlights that the top five identified groups occurred in more than 50 out of 100 iterations. While there exists variation in the count of target genes among these groups, the leading two groups remarkably comprise over 100 associated genes. It is essential to note that the assessment of a group’s significance combines multiple characteristics. For instance, the appearance of “hsa-miR-10b-5p” across nearly half of the total iterations is highlighted by its critical role in classifying BRCA tumor and normal samples, a conclusion drawn from the combination of its average rank and score. This output identified by G-S-M-E is relevant considering the reported downregulation of the tumor suppressive role of “hsa-miR-10b-5p” in BRCA, as well as other cancer types including ovarian and prostate cancer [47]. The distribution of gene numbers within groups is illustrated in Figure 4. The associated gene numbers for most of the groups ranged from 0 to 50 over 100 iterations.

3.2. Performance Evaluation

Considering the complexity of the regulatory network between miRNA and target mRNA, which tends to exhibit a non-linear (monotonic) relationship, Spearman’s rank correlation was used to identify most significantly correlated miRNA–mRNA pairs, and these pairs were used to create the groups (details can be found in the Methods section—Grouping component). To determine the optimum threshold in group creation, the performance of the ML models for the thresholds of 0.6, 0.7, and 0.8 was evaluated using the BRCA dataset. The detailed results are illustrated in Figure 5.
Our main idea in forming biological groups was to distinguish the classes using a minimal number of highly correlated features (genes) with the best performance values of accuracy and AUC-ROC curve. Therefore, a threshold of 0.8 was selected as the optimum one for BRCA data, yielding high performance metrics (average accuracy: 0.98, average AUC-ROC curve: 0.99) with a reduced gene number (average gene number 52.7). The average accuracy and AUC metrics are calculated over 100 iterations. Here, we used accuracy and the AUC-ROC curve to select the threshold, but other performance metrics can be considered at this stage. Following the establishment of the threshold, we evaluated the performance of various classifiers—Decision Tree (DT), Gradient Boosting Trees (GBT), Naïve Bayes (NB), Probabilistic Neural Network (PNN), Random Forest (RF), Support Vector Machine (SVM), Tree Ensemble (TE)—across each cancer type as shown in Figure 6. This figure presents a comprehensive overview of the performance measures achieved by the G-S-M-E approach across groups in the context of multi-omics data analysis for the BRCA and other cancer multi-omics datasets.
Each subfigure shows a specific cancer type (BRCA, KIRP, LIHC, LUAD, STAD, PRAD, THCA, UCEC). The classifiers (models) are displayed on the x-axis, while the values of each performance metric (accuracy, area under the ROC curve, Cohen’s kappa, …, etc.) are represented in different colors. The performance of the model cluster is around 0.9–1.0 in BRCA. There is consistency in model performances of the analyzed cancer types except STAD. Metrics such as Cohen’s kappa and sensitivity in Naive Bayes (NB) gradually decreased in STAD compared to other models. Some variations are displayed in some metrics, depending on the model, such as the F-measure value in the PRAD dataset. In terms of models, RF and GBT generally achieve the best results across all cancer datasets. SVM and TE exhibit consistently strong performance across metrics. Overall, RF is the most reliable classifier in differentiating the positive and negative classes, with consistent high performance across all datasets. SVM and GBT follow RF, with good consistent metrics in most cases. NB is less reliable than other models, considering the variability and low scores across datasets. The Receiver Operating Characteristic (ROC) curves of the classifiers for each studied cancer type are given in Figure S1. The higher the ROC score, the better the distinguishing ability of the classifier. In addition, multiple types of complex biological data integration methods, such as MOFA (Multi-Omics Factor Analysis) and SNF (Similarity Network Fusion), are used to validate our proposed method, G-S-M-E, in its ability to discriminate between the classes.
The results in Figure S2 indicate that compared to state-of-the-art multi-omics integration methods, G-S-M-E shows highly competitive and robust performance, especially in maintaining high discriminative and balanced classification capability. While MOFA achieves the highest accuracy (0.99) and reliability (Cohen’s kappa: 0.97), G-S-M-E has the highest AUC-ROC (showing the model’s overall ability to distinguish between the classes) among the compared methods. Our method outperforms SNF in almost all metrics. Scoring and ranking groups using the MCCV and Pattern detection (P) component in G-S-M-E validate its success in feature selection and balanced classification performance. In addition, G-S-M-E improves biological insight through providing pattern detection and group-level scoring.

3.3. Impact of Group Construction on G-S-M-E Classification Performance

This section of the study investigates the effect of group construction on the performance of the model in the G-S-M-E tool. The model is trained using Spearman correlation-derived groups based on expression data and on the miRTarBase-validated groups from the external database. The validated groups are shown in Table S2. The most striking association is, in fact, that of hsa-miR-145-5p and TGFBR2, which is very strongly supported by the existing literature, where hsa-miR-145-5p acts as a tumor suppressor by directly targeting TGFBR2. Likewise, our study also identified another key tumor-suppressive miRNA, hsa-miR-139-5p, associated with a number of potential target genes like FZD4, TNS1, and CDCA8. In addition to this analysis, the effect of the construction of groups on the prediction performance of the G-S-M-E tool is shown in Figure S3 It can be observed that groups constructed via correlation yielded better classification results (e.g., higher accuracy, AUC-ROC), confirming our data-driven approach. The limitation of groups to previously pre-validated miRNA–target interactions may result in the elimination of new interactions that are necessary to maximize performance. Our method aims for optimal performance in classification while identifying potential biomarkers and their interactions rather than being bound to previously known interactions.
In this part of the study, we focus on the collective functionality of features to understand disease mechanisms. Identifying the significant groups operating together is essential. Therefore, we identify and collect features from significant groups and then construct our models.

3.4. Molecular Patterns Within Identified Groups of BRCA

In the follow-up analysis, the contribution of pairs of groups and clusters of groups to the collective characteristics was investigated. The appearance behavior of groups throughout the iterations is visualized in Figure 7. The heatmap uses color-coding to highlight the characteristics of significant groups, making it easier to spot hidden patterns. Understanding the comprehensive connection of the features can provide deeper knowledge about the underlying mechanisms of BRCA.
In Figure 7, the iteration information given in columns represents the appearance of each group including the ranks depicted by colors. Similar patterns of iterations are displayed in close proximity via hierarchical clustering (using Euclidean distance). The corresponding dendrogram is shown at the top of the heatmap. The top-ranked features which are potential key players in BRCA are encoded in dark blue color. The three most commonly appearing groups (hsa-miR-139-5p, hsa-miR-378a-3p, and hsa-miR-378a-5p) also have significant ranks and are referred to as dominant groups. One can notice that the identification of dominant groups may have aided in the classification of the defined classes. Other candidate dominant features include hsa-miR-10b 5p, hsa-miR-335-3p. Even if they appear in half of the total iterations, the groups are significantly ranked with average ranks (2.59 and 5.27, respectively). Complementary pairs among groups can be inferred from Figure 7. Complementary groups are defined as the ones functioning on the same target, but when one group appears, the complementary one does not appear.

3.5. Identification of Post-Modified Groups via Functional Enrichment Analysis

Section 3.1, Section 3.2, Section 3.3 and Section 3.4 summarized our findings on the BRCA dataset and the behavior of significantly identified features with sequential patterns of iteration and ranking information. We expand our study to identify the biological functions of the group features based on the Enrichment score (ES) by incorporating PBK. For this purpose, the significance levels of all known biological processes were calculated via a hypergeometric test for each group. An average score was assigned to each group feature, and functional importance scores were thereby assigned to the features. A deeper insight into collective behavior of groups was developed by clustering the groups based on the ES, as illustrated in Figure 8.
This distinct collective interpretation of features provided a framework for selecting the functional hub features for BRCA. Feature associations in pairwise and cluster levels depending on the ES of GO, KEGG, Reactome, and Wikipathway are displayed in the stratigraphic plot (Figure 8). The graphic illustrates the significance level of each group in biological processes individually.
As seen in Figure 8, the group features with a significant p-value (<0.05) according to over-representation analysis were scaled according to the −log10 (p-value), also known as ES. A level of significance between 0.05 and 0.01 is indicated with red color, and higher significance is encoded by the blue color. For enhanced visualization of groups, rows (groups) are hierarchically clustered. Post-modification of groups with enrichment analysis contributed to identifying the relationships of pairs and clusters of features in known biological functions and process levels. For instance, the pairs hsa-miR-486-5p and hsa-miR-144-5p were over-represented in Reactome and KEGG, with significant ES (p-value < 0.01), and in GO and Wikipathways with a significant ES (0.01 < p-value < 0.05).
The groups were clustered by similarities based on their biological functions in which all the biological processes’ effects can be observed together. For instance, hsa-miR-486-5p, hsa-miR-144-5p, and hsa-miR-451a are located in the same cluster, and they have high ES. In the biological literature, the term “cluster” is defined as two or more miRNAs with close physical distance. The three miRNAs hsa-miR-99a, has-let-7c, and hsa-miR-125b-2 located within a long noncoding RNA (LINC00478 on chr 21) are referred to in the literature as the 99a/let-7c/125b-2 miRNA cluster, and they are expressed from the primary miRNA cluster MIR99AHG [48]. It is reported that the noncoding RNA (LINC00478) inhibits the activity of MYC targets, thereby suppressing metastasis in breast cancer [49].
As shown in Figure 8, two members of this cluster were identified within our functionally significant cluster. Additionally, it is indicated that LINC00478- mediated upregulation of MMP9 expression in bladder cancer tissue promotes lncRNA-related cancer development [50]. It is reported that the tumor suppressor cluster miR-144/miR-451a is downregulated in hepatocellular carcinoma via epigenetic silencing [51]. The functional significance of these clusters facilitates studies on BRCA to formulate new research directions. A new perspective on establishing new biological relationships can be achieved with the incorporation of PBK into -omics analyses.

4. Discussion

4.1. Biological Validation of G-S-M-E Findings

To validate our findings from a biological perspective, we utilize dbDEMC [52], which is one of the most widely used miRNA–disease association databases. All of the best scoring miRNAs represented in Table S1 are found in dbDEMC as breast cancer-associated miRNAs with adjusted p-values of <10−17. We further investigated the roles of the miRNAs hsa-miR-139-5p and hsa-miR-378a-5p, which appeared in all iterations. Considering the inhibitory role of hsa-miR-139-5p in invasion and metastasis in human triple-negative breast cancer [53] and the regulatory role of has-miR-378a-5p in mitotic fidelity [54], downregulation of these key miRNAs promotes tumor progression. Contrary to the inhibitory role of these two identified miRNAs, the overexpression of hsa-miR-196a-5p promotes tumor progression in BRCA [55]. G-S-M-E identified hsa-miR-196a-5p as a significant miRNA for BRCA dataset, as shown in Table S1.
From a gene-centric point of view, G-S-M-E reports the following 10 genes as the most frequently detected genes: ADRB2, BTNL9, EBF3, GPR146, HSPB6, KIAA0408, LDB2, LRRN4CL, PDE2A, and TGFBR2. The β2-adrenergic receptor (ADRB2) affects regulatory functions in the ERK/COX-2 signaling pathway in breast cancer [56]. It is reported that BTNL9 is downregulated in BRCA, and it has a role in P53/CDC25C and P53/GADD45 pathways [57]. While G protein-coupled receptor 146 (GPR146) is upregulated in BRCA [58], LDB2 downregulates the estrogen receptor (ERα) activity in BRCA [59]. Another study reported the significant role of TGFBR2 in the TGF-β signaling pathway in BRCA [60].

4.2. Functional Validation of Post-Transcriptional Regulation via the Clinical Proteomic Tumor Analysis Consortium (CPTAC): Application to LUAD and PDAC

To properly validate the efficacy of the multi-omics network fusion framework and the functional relevance of post-transcriptional regulation, we further incorporated the proteomic data obtained through the Clinical Proteomic Tumor Analysis Consortium (CPTAC). We derived level 3 data related to two differing cancer types, lung adenocarcinoma (LUAD) and pancreatic ductal adenocarcinoma (PDAC). The data related to each type of tumor consisted of a matched set of samples for two important aspects: the expression of miRNA and data derived through proteomics analysis. The relevance of our G-S-M-E framework applied to the CPTAC cohorts, which particularly integrated miRNA and proteins, supported the effectiveness of the model within the two-omics framework. The performance metrics of the model using the CPTAC datasets are presented in Figure S4. Applying the model to the lung adenocarcinoma (LUAD) dataset, the model showed a remarkable discriminative capability, reaching a peak of AUC = 0.93 and a maximum F1-score of 0.88 (itself using the Random Forest classifier on the integrated features). Similarly, while applying the model to the pancreatic ductal adenocarcinoma (PDAC) dataset, the model also retained a remarkable performance level, reaching a peak of AUC = 0.91 and the maximum F1 of 0.84.
The G-S-M-E tool analyzes biological data by exploring features that operate together as a group, in contrast to evaluating them individually. In addition, the tool facilitates the selection or filtering of the groups based on the biological functionality within cellular mechanisms. Considering the complexity of disease mechanisms, leveraging functional groups using prior biological knowledge (PBK) helps researchers to understand the dynamics among and within biological groups. This insight can help to develop strategies for disease identification and classification.

5. Conclusions

In summary, we have developed a novel tool called G-S-M-E which incorporates several PBK databases into multi-omics data analysis by utilizing supervised ML and statistical methods. The first objective of the tool is to integrate multi-omics datasets utilizing scoring and ranking information over the iterations. Many studies disregarded the potential benefits of cross-validation iterations in revealing the hidden patterns within the features. However, our study aims to track the scores of each group across multiple iterations so that score vectors with a length of 100 are generated for the groups. A heatmap is then used to identify patterns among the groups based on these score vectors.
The collective behavior of the groups is analyzed by clustering the groups based on the ES reflecting the biological function. The analysis is visualized with a stratigraphic plot of group features, which illustrates the enhancements of the prediction model by incorporating the PBK. Another key feature of the tool is its focus on the “collective interaction of miRNAs and mRNAs in gene regulatory networks”. Here, groups are constructed by miRNA and mRNA features. Then, all of the features within the groups are evaluated at group level for further significance, scoring, and enrichment analyses.
In the original G-S-M approach, the scores are assigned to each group via an internal ML cross-validation method in which computational measurements are utilized. In contrast, in this study, we present a novel component called E, which incorporates PBK to score the groups. The E component integrated into the existing workflow utilizes enriched pathways from various databases, such as KEGG and Reactome, to score the groups based on their functional significance in a biological context.
In summary, this study introduces a novel prior knowledge-driven multi-omics data analysis which utilizes biological function- and process-related knowledge from different sources to improve the developed model; thus, it provides key biomarkers for the disease under study.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app152312669/s1, Figure S1: Receiver operating characteristic (ROC) curves for model evaluation across cancer types; Figure S2: Comparison of G-S-M-E tool with state-of-the-art multi-omics integration methods, MOFA (multi-omics factor analysis) and SNF (similarity network fusion); Figure S3: Performance assessment of G-S-M-E for building miRNA–target groups in BRCA: Correlation-derived miRNA–target groups and validated miRNA–target groups; Figure S4: performance of the G-S-M-E framework on CPTAC multi-omics data (miRNA and protein fusion): lung adenocarcinoma (LUAD) and pancreatic ductal adenocarcinoma (PDAC); Table S1: Characteristics of significant groups i.e., frequency of the group, average score, associated genes, iteration information and rank list; Table S2: Output groups and their associated features identified by the G-S-M-E Tool, confirmed by miRTarBase in BRCA.

Author Contributions

Conceptualization, M.U.Y. and M.Y.; methodology, M.U.Y. and M.Y.; software, M.U.Y. and M.Y.; validation, M.U.Y.; formal analysis, M.U.Y. and B.B.-G.; investigation, B.B.-G., M.U.Y. and M.Y.; resources, M.U.Y.; data curation, M.U.Y.; writing—original draft preparation, B.B.-G., M.U.Y.; writing—review and editing, B.B.-G., M.U.Y. and M.Y.; visualization, M.U.Y.; supervision, M.Y. and B.B.-G.; project administration, M.U.Y.; funding acquisition, M.Y. and B.B.-G. All authors have read and agreed to the published version of the manuscript.

Funding

The work of MY was supported by Zefat Academic College. The work of BB-G was supported by the Abdullah Gul University Support Foundation (AGUV).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The G-S-M-E tool, version 1.0, alongside with the provided Supplementary Materials, has been made accessible on GitHub (https://github.com/malikyousef/G-S-M-E, accessed on 25 May 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
DTDecision Tree
GBTGradient Boosting Trees
MCCVMonte Carlo Cross-validation
NBNaïve Bayes
ORAOver-representation Analysis
PBKPrior Biological Knowledge
PNNProbabilistic Neural Network
RFRandom Forest
SVMSupport Vector Machine
TETree Ensemble
TCGAThe Cancer Genome Atlas

References

  1. Govindaraj, V.; Kar, S. Role of microRNAs in oncogenesis: Insights from computational and systems-level modeling approaches. Comput. Syst. Oncol. 2021, 1, e1028. [Google Scholar] [CrossRef]
  2. Martin, H.C.; Wani, S.; Steptoe, A.L.; Krishnan, K.; Nones, K.; Nourbakhsh, E.; Vlassov, A.; Grimmond, S.M.; Cloonan, N. Imperfect centered miRNA binding sites are common and can mediate repression of target mRNAs. Genome Biol. 2014, 15, R51. [Google Scholar] [CrossRef] [PubMed]
  3. Ha, M.; Kim, V.N. Regulation of microRNA biogenesis. Nat. Rev. Mol. Cell Biol. 2014, 15, 509–524. [Google Scholar] [CrossRef] [PubMed]
  4. Broughton, J.P.; Lovci, M.T.; Huang, J.L.; Yeo, G.W.; Pasquinelli, A.E. Pairing beyond the Seed Supports MicroRNA Targeting Specificity. Mol. Cell 2016, 64, 320–333. [Google Scholar] [CrossRef]
  5. Yang, C.; Tabatabaei, S.N.; Ruan, X.; Hardy, P. The Dual Regulatory Role of MiR-181a in Breast Cancer. Cell. Physiol. Biochem. 2017, 44, 843–856. [Google Scholar] [CrossRef]
  6. Jurkovicova, D.; Smolkova, B.; Magyerkova, M.; Sestakova, Z.; Horvathova Kajabova, V.; Kulcsar, L.; Zmetakova, I.; Kalinkova, L.; Krivulcik, T.; Karaba, M.; et al. Down-regulation of traditional oncomiRs in plasma of breast cancer patients. Oncotarget 2017, 8, 77369–77384. [Google Scholar] [CrossRef]
  7. Chen, Y.; Li, Y.; Narayan, R.; Subramanian, A.; Xie, X. Gene expression inference with deep learning. Bioinformatics 2016, 32, 1832–1839. [Google Scholar] [CrossRef]
  8. Frommlet, F.; Szulc, P.; König, F.; Bogdan, M. Selecting predictive biomarkers from genomic data. PLoS ONE 2022, 17, e0269369. [Google Scholar] [CrossRef]
  9. Cai, Z.; Poulos, R.C.; Liu, J.; Zhong, Q. Machine learning for multi-omics data integration in cancer. iScience 2022, 25, 103798. [Google Scholar] [CrossRef]
  10. Baião, A.R.; Cai, Z.; Poulos, R.C.; Robinson, P.J.; Reddel, R.R.; Zhong, Q.; Vinga, S.; Gonçalves, E. A technical review of multi-omics data integration methods: From classical statistical to deep generative approaches. Brief. Bioinform. 2025, 26, bbaf355. [Google Scholar] [CrossRef]
  11. Shomorony, I.; Cirulli, E.T.; Huang, L.; Napier, L.A.; Heister, R.R.; Hicks, M.; Cohen, I.V.; Yu, H.-C.; Swisher, C.L.; Schenker-Ahmed, N.M.; et al. An unsupervised learning approach to identify novel signatures of health and disease from multimodal data. Genome Med. 2020, 12, 7. [Google Scholar] [CrossRef] [PubMed]
  12. Zhang, Y.; Kiryu, H. MODEC: An unsupervised clustering method integrating omics data for identifying cancer subtypes. Brief. Bioinform. 2022, 23, bbac372. [Google Scholar] [CrossRef]
  13. Albaradei, S.; Thafar, M.; Alsaedi, A.; Van Neste, C.; Gojobori, T.; Essack, M.; Gao, X. Machine learning and deep learning methods that use omics data for metastasis prediction. Comput. Struct. Biotechnol. J. 2021, 19, 5008–5018. [Google Scholar] [CrossRef]
  14. Feldner-Busztin, D.; Firbas Nisantzis, P.; Edmunds, S.J.; Boza, G.; Racimo, F.; Gopalakrishnan, S.; Limborg, M.T.; Lahti, L.; de Polavieja, G.G. Dealing with dimensionality: The application of machine learning to multi-omics data. Bioinformatics 2023, 39, btad021. [Google Scholar] [CrossRef]
  15. Valous, N.A.; Popp, F.; Zörnig, I.; Jäger, D.; Charoentong, P. Graph machine learning for integrated multi-omics analysis. Br. J. Cancer 2024, 131, 205–211. [Google Scholar] [CrossRef]
  16. Chen, F.; Cai, G.; Li, Y.; Ou-Yang, L. SpaFusion: A multi-level fusion model for clustering spatial multi-omics data. Inf. Fusion 2025, 124, 103372. [Google Scholar] [CrossRef]
  17. Deng, Z.; Wu, J.; Chen, X.; Li, G.; Liu, J.; Hu, Z.; Li, R.; Deng, W. MNMO: Discover driver genes from a multi-omics data based-multi-layer network. Bioinformatics 2025, 41, btaf134. [Google Scholar] [CrossRef]
  18. Kumar, R.; Romano, J.D.; Ritchie, M.D. Network-based analyses of multiomics data in biomedicine. BioData Min. 2025, 18, 37. [Google Scholar] [CrossRef] [PubMed]
  19. Jiang, W.; Ye, W.; Tan, X.; Bao, Y.-J. Network-based multi-omics integrative analysis methods in drug discovery: A systematic review. BioData Min. 2025, 18, 27. [Google Scholar] [CrossRef]
  20. Dimitrakopoulos, C.; Hindupur, S.K.; Häfliger, L.; Behr, J.; Montazeri, H.; Hall, M.N.; Beerenwinkel, N. Network-based integration of multi-omics data for prioritizing cancer genes. Bioinformatics 2018, 34, 2441–2448. [Google Scholar] [CrossRef] [PubMed]
  21. Wang, T.; Shao, W.; Huang, Z.; Tang, H.; Zhang, J.; Ding, Z.; Huang, K. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat. Commun. 2021, 12, 3445. [Google Scholar] [CrossRef] [PubMed]
  22. Jarada, T.N.; Rokne, J.G.; Alhajj, R. SNF-NN: Computational method to predict drug-disease interactions using similarity network fusion and neural networks. BMC Bioinform. 2021, 22, 28. [Google Scholar] [CrossRef] [PubMed]
  23. Kanehisa, M.; Furumichi, M.; Tanabe, M.; Sato, Y.; Morishima, K. KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017, 45, D353–D361. [Google Scholar] [CrossRef]
  24. Wang, Y.; Yang, S.; Zhao, J.; Du, W.; Liang, Y.; Wang, C.; Zhou, F.; Tian, Y.; Ma, Q. Using Machine Learning to Measure Relatedness Between Genes: A Multi-Features Model. Sci. Rep. 2019, 9, 4192. [Google Scholar] [CrossRef]
  25. Huang, H.-Y.; Lin, Y.-C.-D.; Li, J.; Huang, K.-Y.; Shrestha, S.; Hong, H.-C.; Tang, Y.; Chen, Y.-G.; Jin, C.-N.; Yu, Y.; et al. miRTarBase 2020: Updates to the experimentally validated microRNA–target interaction database. Nucleic Acids Res. 2019, 48, gkz896. [Google Scholar] [CrossRef]
  26. Piñero, J.; Ramírez-Anguita, J.M.; Saüch-Pitarch, J.; Ronzano, F.; Centeno, E.; Sanz, F.; Furlong, L.I. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 2019, 48, D845–D855. [Google Scholar] [CrossRef]
  27. The Gene Ontology Consortium; Aleksander, S.A.; Balhoff, J.; Carbon, S.; Cherry, J.M.; Drabkin, H.J.; Ebert, D.; Feuermann, M.; Gaudet, P.; Harris, N.L.; et al. The Gene Ontology knowledgebase in 2023. Genetics 2023, 224, iyad031. [Google Scholar] [CrossRef]
  28. Wang, S.; Tang, X.; Qin, L.; Shi, W.; Bian, S.; Wang, Z.; Wang, Q.; Wang, X.; Gu, J.; Hao, B.; et al. Integrative Analysis Extracts a Core ceRNA Network of the Fetal Hippocampus With Down Syndrome. Front. Genet. 2020, 11, 565955. [Google Scholar] [CrossRef]
  29. Xi, Y.; Fowdur, M.; Liu, Y.; Wu, H.; He, M.; Zhao, J. Differential expression and bioinformatics analysis of circRNA in osteosarcoma. Biosci. Rep. 2019, 39, BSR20181514. [Google Scholar] [CrossRef]
  30. Piñero, J.; Saüch, J.; Sanz, F.; Furlong, L.I. The DisGeNET cytoscape app: Exploring and visualizing disease genomics data. Comput. Struct. Biotechnol. J. 2021, 19, 2960–2967. [Google Scholar] [CrossRef] [PubMed]
  31. Sticht, C.; De La Torre, C.; Parveen, A.; Gretz, N. miRWalk: An online resource for prediction of microRNA binding sites. PLoS ONE 2018, 13, e0206239. [Google Scholar] [CrossRef]
  32. Benedetti, E.; Pučić-Baković, M.; Keser, T.; Gerstner, N.; Büyüközkan, M.; Štambuk, T.; Selman, M.H.J.; Rudan, I.; Polašek, O.; Hayward, C.; et al. A strategy to incorporate prior knowledge into correlation network cutoff selection. Nat. Commun. 2020, 11, 5153. [Google Scholar] [CrossRef] [PubMed]
  33. Zuo, Y.; Cui, Y.; Yu, G.; Li, R.; Ressom, H.W. Incorporating prior biological knowledge for network-based differential gene expression analysis using differentially weighted graphical LASSO. BMC Bioinform. 2017, 18, 99. [Google Scholar] [CrossRef]
  34. Garcia-Moreno, A.; López-Domínguez, R.; Villatoro-García, J.A.; Ramirez-Mena, A.; Aparicio-Puerta, E.; Hackenberg, M.; Pascual-Montano, A.; Carmona-Saez, P. Functional Enrichment Analysis of Regulatory Elements. Biomedicines 2022, 10, 590. [Google Scholar] [CrossRef]
  35. Ietswaart, R.; Gyori, B.M.; Bachman, J.A.; Sorger, P.K.; Churchman, L.S. GeneWalk identifies relevant gene functions for a biological context using network representation learning. Genome Biol. 2021, 22, 55. [Google Scholar] [CrossRef] [PubMed]
  36. Pomyen, Y.; Segura, M.; Ebbels, T.M.D.; Keun, H.C. Over-representation of correlation analysis (ORCA): A method for identifying associations between variable sets. Bioinformatics 2015, 31, 102–108. [Google Scholar] [CrossRef]
  37. Reimand, J.; Isserlin, R.; Voisin, V.; Kucera, M.; Tannus-Lopes, C.; Rostamianfar, A.; Wadi, L.; Meyer, M.; Wong, J.; Xu, C.; et al. Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap. Nat. Protoc. 2019, 14, 482–517. [Google Scholar] [CrossRef] [PubMed]
  38. Yousef, M.; Allmer, J.; İnal, Y.; Gungor, B.B. G-S-M: A Comprehensive Framework for Integrative Feature Selection in Omics Data Analysis and Beyond. bioRxiv 2024, 585514. [Google Scholar] [CrossRef]
  39. Xu, Q.-S.; Liang, Y.-Z. Monte Carlo cross validation. Chemom. Intell. Lab. Syst. 2001, 56, 1–11. [Google Scholar] [CrossRef]
  40. Robinson, M.D.; McCarthy, D.J.; Smyth, G.K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010, 26, 139–140. [Google Scholar] [CrossRef]
  41. Berthold, M.R.; Cebron, N.; Dill, F.; Gabriel, T.R.; Kötter, T.; Meinl, T.; Ohl, P.; Thiel, K.; Wiswedel, B. KNIME—The Konstanz information miner: Version 2.0 and beyond. ACM SIGKDD Explor. Newsl. 2009, 11, 26–31. [Google Scholar] [CrossRef]
  42. Rahnenführer, J.; De Bin, R.; Benner, A.; Ambrogi, F.; Lusa, L.; Boulesteix, A.-L.; Migliavacca, E.; Binder, H.; Michiels, S.; Sauerbrei, W.; et al. Statistical analysis of high-dimensional biomedical data: A gentle introduction to analytical goals, common approaches and challenges. BMC Med. 2023, 21, 182. [Google Scholar] [CrossRef]
  43. Wu, T.; Hu, E.; Xu, S.; Chen, M.; Guo, P.; Dai, Z.; Feng, T.; Zhou, L.; Tang, W.; Zhan, L.; et al. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Innovation 2021, 2, 100141. [Google Scholar] [CrossRef] [PubMed]
  44. Yu, G.; He, Q.-Y. ReactomePA: An R/Bioconductor package for reactome pathway analysis and visualization. Mol. Biosyst. 2016, 12, 477–479. [Google Scholar] [CrossRef] [PubMed]
  45. Dunnington, D.W.; Libera, N.; Kurek, J.; Spooner, I.S.; Gagnon, G.A. tidypaleo: Visualizing Paleoenvironmental Archives Using ggplot2. J. Stat. Softw. 2022, 101, 1–20. [Google Scholar] [CrossRef]
  46. Wickham, H. Ggplot2: Elegant Graphics for Data Analysis, 2nd ed.; Use R! Springer: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
  47. Wang, J.; Yan, Y.; Zhang, Z.; Li, Y. Role of miR-10b-5p in the prognosis of breast cancer. PeerJ 2019, 7, e7728. [Google Scholar] [CrossRef]
  48. Søkilde, R.; Persson, H.; Ehinger, A.; Pirona, A.C.; Fernö, M.; Hegardt, C.; Larsson, C.; Loman, N.; Malmberg, M.; Rydén, L.; et al. Refinement of breast cancer molecular classification by miRNA expression profiles. BMC Genom. 2019, 20, 503. [Google Scholar] [CrossRef]
  49. Guo, R.; Su, Y.; Zhang, Q.; Xiu, B.; Huang, S.; Chi, W.; Zhang, L.; Li, L.; Hou, J.; Wang, J.; et al. LINC00478-derived novel cytoplasmic lncRNA LacRNA stabilizes PHB2 and suppresses breast cancer metastasis via repressing MYC targets. J. Transl. Med. 2023, 21, 120. [Google Scholar] [CrossRef]
  50. Yang, H.-J.; Liu, T.; Xiong, Y. Anti-cancer effect of LINC00478 in bladder cancer correlates with KDM1A-dependent MMP9 demethylation. Cell Death Discov. 2022, 8, 242. [Google Scholar] [CrossRef]
  51. Zhao, J.; Li, H.; Zhao, S.; Wang, E.; Zhu, J.; Feng, D.; Zhu, Y.; Dou, W.; Fan, Q.; Hu, J.; et al. Epigenetic silencing of miR-144/451a cluster contributes to HCC progression via paracrine HGF/MIF-mediated TAM remodeling. Mol. Cancer 2021, 20, 46. [Google Scholar] [CrossRef]
  52. Yang, Z.; Wu, L.; Wang, A.; Tang, W.; Zhao, Y.; Zhao, H.; Teschendorff, A.E. dbDEMC 2.0: Updated database of differentially expressed miRNAs in human cancers. Nucleic Acids Res. 2017, 45, D812–D818. [Google Scholar] [CrossRef]
  53. Krishnan, K.; Steptoe, A.L.; Martin, H.C.; Pattabiraman, D.R.; Nones, K.; Waddell, N.; Mariasegaram, M.; Simpson, P.T.; Lakhani, S.R.; Vlassov, A.; et al. miR-139-5p is a regulator of metastatic pathways in breast cancer. RNA 2013, 19, 1767–1780. [Google Scholar] [CrossRef]
  54. Winsel, S.; Mäki-Jouppila, J.; Tambe, M.; Aure, M.R.; Pruikkonen, S.; Salmela, A.-L.; Halonen, T.; Leivonen, S.-K.; Kallio, L.; Børresen-Dale, A.-L. Excess of miRNA-378a-5p perturbs mitotic fidelity and correlates with breast cancer tumourigenesis in vivo. Br. J. Cancer 2014, 111, 2142–2151. [Google Scholar] [CrossRef] [PubMed]
  55. Jiang, C.-F.; Shi, Z.-M.; Li, D.-M.; Qian, Y.-C.; Ren, Y.; Bai, X.-M.; Xie, Y.-X.; Wang, L.; Ge, X.; Liu, W.-T.; et al. Estrogen-induced miR-196a elevation promotes tumor growth and metastasis via targeting SPRED1 in breast cancer. Mol. Cancer 2018, 17, 83. [Google Scholar] [CrossRef] [PubMed]
  56. Xie, W.-Y.; He, R.-H.; Zhang, J.; He, Y.-J.; Wan, Z.; Zhou, C.-F.; Tang, Y.-J.; Li, Z.; Mcleod, H.L.; Liu, J. β-blockers inhibit the viability of breast cancer cells by regulating the ERK/COX-2 signaling pathway and the drug response is affected by ADRB2 single-nucleotide polymorphisms. Oncol. Rep. 2019, 41, 341–350. [Google Scholar] [CrossRef] [PubMed]
  57. Mo, Q.; Xu, K.; Luo, C.; Zhang, Q.; Wang, L.; Ren, G. BTNL9 is frequently downregulated and inhibits proliferation and metastasis via the P53/CDC25C and P53/GADD45 pathways in breast cancer. Biochem. Biophys. Res. Commun. 2021, 553, 17–24. [Google Scholar] [CrossRef]
  58. Cui, P.; Chen, Y.; Waili, N.; Li, Y.; Ma, C.; Li, Y. Associations of serum C-peptide and insulin-like growth factor binding proteins-3 with breast cancer deaths. PLoS ONE 2020, 15, e0242310. [Google Scholar] [CrossRef]
  59. Johnsen, S.A.; Güngör, C.; Prenzel, T.; Riethdorf, S.; Riethdorf, L.; Taniguchi-Ishigaki, N.; Rau, T.; Tursun, B.; Furlow, J.D.; Sauter, G.; et al. Regulation of Estrogen-Dependent Transcription by the LIM Cofactors CLIM and RLIM in Breast Cancer. Cancer Res. 2009, 69, 128–136. [Google Scholar] [CrossRef]
  60. Ma, X.; Beeghly-Fadiel, A.; Lu, W.; Shi, J.; Xiang, Y.-B.; Cai, Q.; Shen, H.; Shen, C.-Y.; Ren, Z.; Matsuo, K.; et al. Pathway Analyses Identify TGFBR2 as Potential Breast Cancer Susceptibility Gene: Results from a Consortium Study among Asians. Cancer Epidemiology. Biomark. Prev. 2012, 21, 1176–1184. [Google Scholar] [CrossRef]
Figure 1. The workflow of G-S-M-E comprises G, S, M, P, E components indicated by circles. The panels on the left depict the input data sources: TCGA and prior biological knowledge. The panel on the right illustrates the developed algorithm within G-S-M-E. The components—G, S, M, P, E—are designed to integrate the miRNA–mRNA datasets, incorporating prior biological knowledge. Component G constructs the groups that include miRNA and its associated gene(s). Component S assigns scores to these groups based on expression profile information. Component M trains classifiers using the most significant groups. Component P detects patterns through iteration and ranking information obtained from Component M. The last component (E) enriches the groups by assigning enrichment scores via pathway enrichment analysis. The bottom layer displays the generated output files, which are the ‘Enriched groups table’ and ‘Performance table’.
Figure 1. The workflow of G-S-M-E comprises G, S, M, P, E components indicated by circles. The panels on the left depict the input data sources: TCGA and prior biological knowledge. The panel on the right illustrates the developed algorithm within G-S-M-E. The components—G, S, M, P, E—are designed to integrate the miRNA–mRNA datasets, incorporating prior biological knowledge. Component G constructs the groups that include miRNA and its associated gene(s). Component S assigns scores to these groups based on expression profile information. Component M trains classifiers using the most significant groups. Component P detects patterns through iteration and ranking information obtained from Component M. The last component (E) enriches the groups by assigning enrichment scores via pathway enrichment analysis. The bottom layer displays the generated output files, which are the ‘Enriched groups table’ and ‘Performance table’.
Applsci 15 12669 g001
Figure 2. Flowchart for training and testing the model in the G-S-M-E tool. The M component iteratively builds classifier models by incorporating top-ranking groups.
Figure 2. Flowchart for training and testing the model in the G-S-M-E tool. The M component iteratively builds classifier models by incorporating top-ranking groups.
Applsci 15 12669 g002
Figure 3. Flowchart for detecting enriched group features based on prior biological knowledge (PBK) via hypergeometric distribution analysis.
Figure 3. Flowchart for detecting enriched group features based on prior biological knowledge (PBK) via hypergeometric distribution analysis.
Applsci 15 12669 g003
Figure 4. Frequency distribution of gene-associated miRNA groups by size. This histogram illustrates the relationship of the number of features within a biological group and frequency of groups at that size. The group size (miRNA-gene number within a group) is shown on the x-axis, while frequency of group size is shown on the y-axis.
Figure 4. Frequency distribution of gene-associated miRNA groups by size. This histogram illustrates the relationship of the number of features within a biological group and frequency of groups at that size. The group size (miRNA-gene number within a group) is shown on the x-axis, while frequency of group size is shown on the y-axis.
Applsci 15 12669 g004
Figure 5. Performance evaluation of machine learning models for biological groups generated with different correlation thresholds (tested on the BRCA dataset). The x-axis refers to the correlation thresholds, while the y-axis shows the accuracy and AUC-ROC scores. The average accuracy and AUC metrics arecalculated over 100 iterations. The color key encodes the learner (classifier) used in the analysis.
Figure 5. Performance evaluation of machine learning models for biological groups generated with different correlation thresholds (tested on the BRCA dataset). The x-axis refers to the correlation thresholds, while the y-axis shows the accuracy and AUC-ROC scores. The average accuracy and AUC metrics arecalculated over 100 iterations. The color key encodes the learner (classifier) used in the analysis.
Applsci 15 12669 g005
Figure 6. G-S-M-E performance across multi-omics datasets. This figure illustrates the performance of various classifiers (x-axis) for each of the following cancer types: BRCA, KIRP, LIHC, LUAD, STAD, PRAD, THCA, UCEC. The y-axis quantifies the performance metrics (accuracy, AUC-ROC, Cohen’s kappa, etc.). The metrics are distinguished by color.
Figure 6. G-S-M-E performance across multi-omics datasets. This figure illustrates the performance of various classifiers (x-axis) for each of the following cancer types: BRCA, KIRP, LIHC, LUAD, STAD, PRAD, THCA, UCEC. The y-axis quantifies the performance metrics (accuracy, AUC-ROC, Cohen’s kappa, etc.). The metrics are distinguished by color.
Applsci 15 12669 g006
Figure 7. Heatmap of ranked group features for each iteration. The co-occurrence patterns of groups are visualized with ranks. The color key encodes the ranks of the features from dark blue to yellow. Dark blue shows the top-ranked features. Light yellow indicates the non-detected groups for the corresponding iteration.
Figure 7. Heatmap of ranked group features for each iteration. The co-occurrence patterns of groups are visualized with ranks. The color key encodes the ranks of the features from dark blue to yellow. Dark blue shows the top-ranked features. Light yellow indicates the non-detected groups for the corresponding iteration.
Applsci 15 12669 g007
Figure 8. Stratigraphic plot of group features with Enrichment scores (ESs) for BRCA. This figure illustrates the significance level of each group with biological processes, utilizing terms from GO, KEGG, Reactome, and Wikipathways. Significant groups are presented with blue color, with red showing higher significance. Non-significant groups are not displayed in the graph for the corresponding terms.
Figure 8. Stratigraphic plot of group features with Enrichment scores (ESs) for BRCA. This figure illustrates the significance level of each group with biological processes, utilizing terms from GO, KEGG, Reactome, and Wikipathways. Significant groups are presented with blue color, with red showing higher significance. Non-significant groups are not displayed in the graph for the corresponding terms.
Applsci 15 12669 g008
Table 1. Names and abbreviations (TCGA IDs) of the cancer datasets used within this study to test the G-S-M-E tool. The numbers of cancer (case) and normal (control) samples are given for each cancer type.
Table 1. Names and abbreviations (TCGA IDs) of the cancer datasets used within this study to test the G-S-M-E tool. The numbers of cancer (case) and normal (control) samples are given for each cancer type.
Cancer TypeTCGA IDNumber of Case SamplesNumber of Control Samples
Breast invasive carcinomaBRCA76087
Kidney renal papillary cell carcinomaKIRP29032
Liver hepatocellular carcinomaLIHC36744
Lung adenocarcinomaLUAD44920
Prostate adenocarcinomaPRAD49352
Stomach adenocarcinomaSTAD37035
Thyroid carcinomaTHCA50659
Uterine corpus endometrial carcinomaUCEC17423
Table 2. A sample table that shows biological groups of correlated miRNA and gene pairs created by the G component. Each group features a unique miRNA name and the list of genes that are highly correlated with it.
Table 2. A sample table that shows biological groups of correlated miRNA and gene pairs created by the G component. Each group features a unique miRNA name and the list of genes that are highly correlated with it.
Group NamemiRNA Associated Genes
hsa-miR-139-5pLRRN4CL, LOC90586, NRN1, TGFBR2, …
hsa-miR-378a-5pSYNPO, C17orf103, HSPB6, …
hsa-miR-10b-5pIL33, ADAMTS5, ABCA10, DMD, …
Table 3. Assigning ranks to groups based on scores. For each score interval, the corresponding rank is presented in the table.
Table 3. Assigning ranks to groups based on scores. For each score interval, the corresponding rank is presented in the table.
ScorejRank (Scorej)
≥0.951
≥0.90 and <0.952
≥0.50 and <0.5510
<0.500
Table 4. Determination of significance level for groups in the P component.
Table 4. Determination of significance level for groups in the P component.
P (gi, inj)Significance Level
1High significance (ranking) level of gi in iteration inj
2 to 10Gradually decreasing significance level of gi in iteration inj
0Group gi does not appear in iteration inj
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Unlu Yazici, M.; Bakir-Gungor, B.; Yousef, M. G-S-M-E: A Prior Biological Knowledge-Based Pattern Detection and Enrichment Framework for Multi-Omics Data Integration. Appl. Sci. 2025, 15, 12669. https://doi.org/10.3390/app152312669

AMA Style

Unlu Yazici M, Bakir-Gungor B, Yousef M. G-S-M-E: A Prior Biological Knowledge-Based Pattern Detection and Enrichment Framework for Multi-Omics Data Integration. Applied Sciences. 2025; 15(23):12669. https://doi.org/10.3390/app152312669

Chicago/Turabian Style

Unlu Yazici, Miray, Burcu Bakir-Gungor, and Malik Yousef. 2025. "G-S-M-E: A Prior Biological Knowledge-Based Pattern Detection and Enrichment Framework for Multi-Omics Data Integration" Applied Sciences 15, no. 23: 12669. https://doi.org/10.3390/app152312669

APA Style

Unlu Yazici, M., Bakir-Gungor, B., & Yousef, M. (2025). G-S-M-E: A Prior Biological Knowledge-Based Pattern Detection and Enrichment Framework for Multi-Omics Data Integration. Applied Sciences, 15(23), 12669. https://doi.org/10.3390/app152312669

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop