Bioinformatics Strategies to Identify Shared Molecular Biomarkers That Link Ischemic Stroke and Moyamoya Disease with Glioblastoma

Expanding data suggest that glioblastoma is accountable for the growing prevalence of various forms of stroke formation, such as ischemic stroke and moyamoya disease. However, the underlying deterministic details are still unspecified. Bioinformatics approaches are designed to investigate the relationships between two pathogens as well as fill this study void. Glioblastoma is a form of cancer that typically occurs in the brain or spinal cord and is highly destructive. A stroke occurs when a brain region starts to lose blood circulation and prevents functioning. Moyamoya disorder is a recurrent and recurring arterial disorder of the brain. To begin, adequate gene expression datasets on glioblastoma, ischemic stroke, and moyamoya disease were gathered from various repositories. Then, the association between glioblastoma, ischemic stroke, and moyamoya was established using the existing pipelines. The framework was developed as a generalized workflow to allow for the aggregation of transcriptomic gene expression across specific tissue; Gene Ontology (GO) and biological pathway, as well as the validation of such data, are carried out using enrichment studies such as protein–protein interaction and gold benchmark databases. The results contribute to a more profound knowledge of the disease mechanisms and unveil the projected correlations among the diseases.


Introduction
Glioblastoma, generally regarded as glioblastoma-multiforme (GBM), is the most deadly form of cancer in the brain region throughout the world [1]. Percival Bailey and Harvey Cushing introduced the name glioblastoma multiforme in 1926, emphasizing the hypothesis that the cancerous cells arise from Gila, fundamental drug precursors (glioblasts). Additionally, it is an utterly volatile presentation caused by necrosis, hemorrhage, and cysts (multiform) [2]. GBM has vague signs or symptoms initially. Headaches, mood changes, fatigue, and symbols close to those of a stroke are all possible symptoms [3]. Symptoms lation, signaling pathways, and protein-protein interactions analyzed from disease-affected tissues. Results were then validated using experimentally validated gold benchmark databases and literature such as DisGeNET, db-GaP, and Rare-Diseases-AutoRIF.

Collected Datasets
The datasets included for the analysis were derived from the National Center for Biotechnology Information (NCBI), a well-known Gene Expression Omnibus (GEO) database. Each disease's query returns a series of datasets. If the dataset is obtained from a non-human species and does not meet two criteria for each group, such as control samples (healthy) and case samples (patients), it is not preferred for our study. Additionally, we discarded repeated datasets, unfavorable formatting, or insignificant experimental emphasis. We also excluded datasets with sample sizes smaller than our preselected cutoff sample size of three for each group. Linear regression is used to analyze the transcriptomic differential expression of the selected GEO datasets, and a linear model may have appropriate analytical strength when the sample size for either the healthy or the patient is three, or higher than three [27]. In addition, we concentrated on a particular cell or tissue type in light of its influence on the course of a disease. This method resulted in selecting three strongly important datasets for glioblastoma, I. stroke, and moyamoya (mm) as well as suitable for the analysis. The datasets for glioblastoma and ischemic stroke are RNA-seq, and the dataset for moyamoya disease is micro-array. As no RNA-seq datasets met our requirements, we used the microarray dataset for moyamoya. We looked for datasets with the lowest amount of biases and distortion for this study. For the analysis, we selected transcriptome RNA-seq/microarray datasets of human participants with the accession numbers GSE106804, GSE56267, and GSE131293, which included both healthy and diseased patients.
The glioblastoma dataset (GSE106804) included gene expression data from the Extracellular Vesicle of 13 glioblastoma patients and 6 healthy controls [28]. GBM is constantly in contact with its underlying tumor microenvironment (TME). Extracellular vesicle has a significant effect on the GBM tumor microenvironment, paving the path for the development of GBM [29,30]. Hence, we selected the dataset. The I. stroke (GSE56267) dataset included gene expression evidence from the cortical tissue of seven I. stroke patients and six healthy controls, whereas cortical neurons depict important intact genome information regarding I. stroke patients [31]. The moyamoya dataset (GSE131293), the only microarray data, included gene expression results from three patients and three stable controls' neural crest stem cells [32].

Preprocessing and Distinction of Differentially Expressed Genes
As mentioned earlier, the datasets were collected from NCBI. We performed differential expression analysis to detect the genes that are noticeably expressed in patients' samples compared to healthy samples. We performed differential expression analysis (DEA) of RNA-seq raw count data using DESeq2, an R package. The internal normalization technique was carried out using DESeq2 and determined the geometric mean of every gene across all samples. Then, the negative binomial distribution, a linear model, was calculated for each gene, considering variability among samples. Finally, notable genes were filtered using the Wald test and we automatically removed low-expressed/outlier genes using Cook's distance [33]. For microarray data, we used Limma, also a linear model, for DEA, which performed a t-test to find the importance of every gene over samples [34]. The code for DEA was implemented in R and can be accessed through our Github repository: https://github.com/hiddenntreasure/glioblastoma, accessed on 11 July 2022.
We used the Z-score transformation (Z mn ) for each disease phenotype to make the gene expression data more comparable. The equation for this transformation is where σ m indicates standard deviation and g mn suggests the magnitude of the gene (m) in the sample (n). Thus, this allows us to directly measure the expression of genes across samples and types of cells from various disorders. We discarded the genes with missing or null values. Two parameters are deployed to derive the most significant/biomarker genes accountable for the emergence of a disease. First, the p-value should be less than 0.05; secondly, the absolute value of the log2-fold change is either 1 or greater/less than 1. Genes with a logFC greater than one are highly expressed compared to the other genes and are known as upregulated (up-reg) genes, whereas downregulated (down-reg) genes are lower expressed in contrast to gene expression arrays and logFC is less than 1. We have several significant genes for each disease that are differentially dysregulated and significantly liable for developing a disease. Then, we identified shared genes between a pair: glioblastoma and I. stroke, as well as glioblastoma and moyamoya.
The prevalent genes in these two pairs of diseases were then used to build a genedisease network (GDN), and different neighbors were found using Jaccard coefficient methods [35], which is the co-occurrence score. In contrast, the edge (connection among genes) predicts the correlation coefficient rate for the nodes (genes): G indicates the total number of genes represented as nodes, and E denotes the number of connections among genes represented as edges. To cross-check illness comorbidity relationships, we used the R programs comoR [36] and POGO [37].

Enrichment Analysis for Significant Gene Ontology and Molecular Pathway Selection
Previously, gene expression profiling generally consisted of a group of genes corresponding to either healthy or affected samples, enlisted in a list L as per their differential expression. A meaningful understanding of this list was extracted. However, in a given biological process, it may provide an insufficient number or an excessive number of statistically significant genes that might fluctuate from one dissertation to another for a given batch of genes [38,39]. However, enrichment analysis denotes a normalized set of genes that employs previously identified molecular pathways or gene expression arrays. Moreover, it defined the group of genes associated with the different genotypes (phenotypes) hypothesis [40].
EnrichR was employed to acquire a deeper insight into the biological pathways and Gene Ontology (GO) terms associated with GBM in relation to I. stroke and mm [41]. It conducts GSEA to classify the DEGs' corresponding pathways and GOs. Compared with a catalog of well-annotated gene sets, such as pathway analysis, it facilitates observing the functional relevance of the given gene set. The pathway is the molecular biology concept, which defines an artificial condensed process model within a cell or tissue [42]. A typical pathway model begins across an external signaling molecule by provoking a specific receptor that triggers a string of proteins connected with each other [43]. The Gene Ontology (GO) is a computational paradigm for representing gene (protein) functions as well as their related connections towards other genes [44]. The hierarchical arrangement of the GO makes it possible to compare proteins annotated with different meanings in ontology as well as have relationships with each other. We focused on four different pathway databases: KEGG [45], BioCarta [46], Reactome [47], and Wiki-Pathways [48]; and biological Process (BP) from Gene Ontology (GO) domain [49].

Analysis of Protein-Protein Interactions (PPIs)
The PPIs are central to all cellular/molecular mechanisms since they constitute the physical interactions between two or even more protein components [50]. We used data from the STRING database [51] and Network Analyst [52] to create PPI networks centered on the connections among various proteins. We used the String Interactome repository from "String-db.org" (accessed on 28 October 2014) with a confidence level of 800 and topological criteria such as degree >15 [51]. Proteins are denoted by colored circles/nodes; conversely, connections of the proteins are characterized by edges.

Analysis of Transcription Factors (TFs) and microRNAs (miRNAs)
We discovered DEGs-TFs, which regulate the identified significant genes (identified from transcriptomic differential analysis) not only at their correct period but also at their suitable volume in a cell throughout the cell's/organism's lifetime, and are responsible for determining the transformation of genetic information from DNA to mRNA at the transcription level. Furthermore, gene-miRNAs were also discovered in order to help researchers by giving insight into the regulatory biomolecules that determine and control RNA splicing and expression of genes at their post-transcriptional level.
EnrichR was deployed to identify the DEGs-TFs and microRNAs [41]. The DEGs-TFs relationship was identified and studied using the JASPAR database [53] and ENCODE [54,55], whereas miRNA-DEGs interactions are found using a well-known database called TarBase [56] and miRTarBase [57]. The topological investigation was carried out using Cytoscape's Network Analyzer and Network Analyst [58,59].

Drug Prediction
Network Analyst was used to identify the possible medications for treating glioblastoma and its associated diseases. The drug was predicted using the DrugBank database version 5.0. [60]. A list of protein-drug interactions was made based on statistical importance. Two protein-drug interactions were predicted for two pairs of cases, such as glioblastoma and I. stroke, and glioblastoma and mm. In our study, we utilized highly interacted shared proteins (hub proteins) found from both pairs of cases. To identify hypothesized selective biomarkers between GBM and I. stroke and GBM and mm, we used gene expression analyses using limma. Moreover, we extracted signaling pathways and GO terminologies from various databases, as well as protein-protein interactions (PPIs), transcription factors (TFs) of genes, and gene-MicroRNAs (miRNAs) that are related to the derived biomarkers. Our network-based method was cross-checked with the three gold standard databases, namely, DisGeNET, db-GaP, and Rare-Diseases-AutoRIF, to validate our biomarker genes and pathways.

Evaluation of Gene Expression
"Expression profiling by high-throughput sequencing" (or RNA-seq) data of GBM was reviewed from the NCBI to categorize and comprehend the gene enrichment that could influence the development of I. stroke and mm. However, due to the unavailability of moyamoya's RNA-seq data, we collected "expression profiling by array" (or microarray) data of moyamoya.
A well-known project called Bioconductor established R packages called Limma and DESeq2 for microarray and RNA-seq data. We used it to perform expression profiling and found 3585 DEGs in glioblastoma with a p-value less than 0.05 and an absolute logFC greater than 1. Whereas 1038 genes are upregulated due to foreign signals increasing the cellular process factor in all genes, 2547 genes are downregulated due to the same component decreasing markedly. Following the statistical study, we identified the most significant DEGs for each disease, such as I. stroke and moyamoya. Table 1 illustrates that 1465 significant DEGs were found in I. stroke, whereas the expression increased (up-reg) in 1120 genes and expression decreased (down-reg) in 345 genes; similarly, 1382 significant DEGs were found in mm, whereas the expression increased (up-reg) in 715 genes and expression decreased (down-reg) in 667 genes. The GSE accession numbers for the selected study are GSE106804 [28], GSE56267 [31], and GSE131293 [32] for glioblastoma, ischemic stroke, and moyamoya, respectively, as shown in Table 1. Due to the proper data availability, we took the dataset from three different cells: extracellular vesicle, cortical ischemic stroke tissue, and neural crest stem cell for GBM, I. stroke, and mm, respectively (Table 1, column 3). However, the findings still show insightful outcomes for our projected hypothesis. Column 2 in Table 1 demonstrates the RNA sequencing technology used to identify the transcriptomic data for each disease in our study. The number of samples for both cases and controls is an essential identifier in identifying associations among diseases because the increasing number of samples enhances the computational power of a dataset. In our study, moyamoya has only three samples, both for control and case, which is the least, whereas the other two diseases have at least six samples for either side. The overall up-and downregulated genes are quite balanced for moyamoya, though not for GBM and I. stroke.

Identified Enriched Pathways and Gene Ontology Terminologies
Pathway enrichment analysis was implemented to better understand the molecular mechanisms/processes that underlie all complicated diseases. Using EnrichR [41], a bioinformatics resource, we conducted a comparison-based enrichment analysis to classify overexpressed pathways in our relationship (GBM and I. stroke; or GBM and MM), and the analysis was performed on top of three different databases (Wikipathways (human-2019) [61], BioCarta (2016) [41], and KEGG (human-2019) [62]) in our experiment. The pathway enrichment experiments were performed using the common DEGs between GBM and its associated diseases (I. stroke and mm). We carried out regulatory research to learn more about the molecular mechanisms that play a role in this comorbidity. Our research identified overexpressed pathways in which DEGs are identified in various disorders and categorized them based on their functional importance. Manual curation was used to limit pathways considered greatly enriched in the typical DEG sets with p-value criteria. The criteria denote that the p-value must be less than 0.05. EnrichR discovered major pathways from KEGG, WikiPathways, and BioCarta databases that are significantly linked to DEGs that are common between GBM and I. stroke pair and GBM and mm pairs. Using the shared 50 genes between GBM and mm, we obtained 149 shared pathways, among which 20 are significant, considering the p-value (<0.05). Similarly, 59 genes are common between GBM and I. stroke; we obtained 217 signaling pathways common between them, and 68 are highly expressed (significant pathways). Thus, ascending sorting of p-value implied retrieving the top 15 significant pathways between (a) GBM and I. stroke- Table 2 and (b) GBM and mm- Table 3.  We also discovered highly expressed Gene Ontology (GO) terms, especially for identifying molecular events associated with a disease. Therefore, popular DEGs between two diseases were employed to obtain the list of GOs associated with a disease. The Enrichr was used to find GO terms enriched by shared DEGs. Enrichr introduces biological processes (BP-2016) that are linked to DEGs so that they can be grouped into functional categories [63,64]. Hence, this helps us learn more about the molecular processes and biological relevance of DEGs. It was then narrowed down to only those processes and terms with a relative p-value below 0.05. Between GBM and mm, 503 GO terminologies are shared, where 138 are significant GO terms (p-value < 0.05). Likewise, GBM and I. stroke have 652 shared GO terms, among which 193 are significant. Tables 4 and 5

Protein-Protein Interactions (PPIs) Analysis
With the use of online-based tools such as STRING and Network Analyst, we built putative PPI networks utilizing our enriched common disease genes. PPIs try to compensate for the organism's so-called interactomics, in which abnormal PPIs cause numerous illnesses. One or more typically linked protein subnetworks are reported to be represented by two diseases. PPI analysis revealed strongly interacting proteins employing topological criteria, such as a degree higher than 15°. Figure 2A shows the PPI network between GBM and mm. The network includes 59 nodes (genes) and 29 edges; the PPI network's enriched p-value is 0.232. Figure 2B   The cytoHubba module was used to explore the most significant hub-proteins based on the simplified PPI networks developed previously [65]. We found 14 hub proteins between GBM and mm using four cytoHubba algorithms, and they are MCC, DMNC, Degree, and EPC (as shown in Figure 3 Similarly, we found 26 hub proteins between GBM and I. stroke, as shown in Figure 4. All the four cytoHubba algorithms share 21 hub proteins: COL1A1, ANXA2, PPBP, SPARC, TIMP1, SERPINE1, PECAM1, HLA-DRA, CXCR4, ALOX5AP, S100A12, BCL2A1, HLA-DQA1, LCP2, GNB5, S100A8, PLEK, ARHGEF9, LCP1, IL2RG, and SLA; two are shared by Degree, MCC, and EPC: TREML1 and F11R; two are shared by Degree, EPC, and DMNC: SERPINA1 and NCF2; and only one is shared by Degree and EPC: ANKRD1. Although further research into the activities of these newly discovered hub proteins is needed, they might be potential therapeutic targets.

Determination of the DEGs' Transcriptional and Post-Translational Regulators
Transcription factors (TFs) are nothing but proteins that govern the expression of the identified significant genes in our case. In other words, the transcriptional process converts genes into RNA or protein products. Transcription factors are found in all living organisms and regulate gene expression. TF genes are significant because they regulate a variety of biological processes [66,67]. miRNA plays a vital role in cellular processes and biochemical and molecular functions [68]. As a result, changes in miRNA levels (enriched miRNA) may affect metabolic processes, signal transmission, and transcription [69]. According to this study, microRNAs play a role in various diverse biological characteristics related to glioblastoma, such as cell growth, incursion, glioma stem cell activity, and angiogenesis (blood vessel formation) [70]. Additionally, miRNA functions may aid in elucidating the dysregulated signaling pathways and provide insight into the development of novel therapeutic and diagnostic procedures [71].

Analysis of the Predicted Drugs
We predicted drugs using shared proteins that resulted from our analysis. A web tool called Network Analyst was employed, which collected data from the DrugBank 5.0 database.
We utilized 26 hub proteins shared between GBM and I. stroke to discover the drugs as represented in Figure 7B. The protein-drug interaction ( Figure 7A) has ten nodes, including two genes (ANXA2 and SERPINE1) and eight chemical compounds (Alteplase, Tenecteplase, Urokinase, Plasmin, Troglitazone, Drotrecogin alfa, Anistreplase, and Reteplase). Similarly, we used 14 shared hub proteins from GBM and mm for drug prediction, as shown in Figure 7A

Validation of Transcriptomic Potential DEGs
We validated our potential DEGs of transcriptomic analysis by using literature-based disease-gene association datasets such as DisGeNET [72], dbGaP [73], and Rare-Diseases-AutoRIF. The data were created and validated using the previous study, including the biomarker genes corresponding to diseases. In order to assess the shared genes' statistical significance and validate our findings, we employed EnrichR [41], an online program. EnrichR utilized the shared genes with the disease-associated gene database to discover the relevant data. Even though EnrichR gives disease-gene information for a variety of disorders, we only take into account the information pertaining to the diseases we identified.
We further confirmed our findings by reviewing previous publications that discovered biomarkers for the diseases. The literature associated with each gene is included in Tables 6  and 7. Ultimately, we created a diseasome network of GBM and its associated neurological and vascular disorders, as shown in Figure 8. We developed this association map from the gold benchmark database and previous literature review using Cytoscape [58]. Table 6. Transcriptomic analysis identifies potential target genes in GBM and mm that have been verified by previous research.

Discussion
According to current research, it is clear that glioblastoma is the most aggressive type of brain cancer. It is also known that glioblastoma is responsible for an increased risk of developing ischemic stroke [9]. Similarly, moyamoya disease develops in brain tumor patients due to cranial irradiation during radiation therapy [128]. Thus, it is possible that ischemic stroke and moyamoya may be formed in glioblastoma patients.
Hence, our study aims to identify genetic relationships between glioblastoma and ischemic stroke as well as glioblastoma and moyamoya. Thus, doctors should be concerned about ischemic stroke and moyamoya in glioblastoma patients. The bioinformatics approach may comprehensively understand the molecular mechanisms in the specified disease progression. In this study, we carried out an investigation on transcriptomic profiles of ischemic stroke, moyamoya, and glioblastoma (as shown Figure 1). Moreover, we predicted the therapeutic drugs for the associations.
For pathways, there are two pathways, named antigen processing and presentation and serotonin and anxiety, that are common in glioblastoma, ischemic stroke, and moyamoya. In IDH-wildtype gliomas, the antigen processing and presentation (APP) score is linked with the immunological score [136]. Antigen processing and presentation, DC pathway, cytokine pathway, and IL-12 pathway were increased in the intracranial arteries of patients with mm in this study [137]. The serotonin and anxiety pathway is known as the monoaminergic system [138]. In addition, ischemic brain injury alters this route, and the monoaminergic system may be a potential therapeutic target for stroke [139]. Therefore, these four pathways can be a therapeutic target in order to prevent ischemic stroke and moyamoya associated with glioblastoma patients. In addition, a pathway named leukocyte transendothelial migration is activated, which is validated by a previous study between glioblastoma and ischemic stroke. It is possible that an aberrant immunological condition and the development of GBM are associated, and the leukocyte transendothelial migratory pathway might be an indicator of that [140].
According to the information presented above, our technique has the potential to disclose some of the essential mechanisms that underlie disease, as well as generate unique theories about disease mechanisms and identify new biomarkers for disease. Genetic data analysis is expected to be crucial for improving predictive medicine and uncovering pathways connecting with glioblastoma, ischemic stroke, and moyamoya, as well as identifying potential therapeutic targets.
We made an effort to use prior research to validate each of our findings. However, there is still a need for more in vivo and in vitro research. Due to their complexity, the doctor must be concerned about ischemic stroke and moyamoya in glioblastoma patients. Moreover, the prevention of ischemic stroke and moyamoya can be made possible by inactivating mentioned pathways using the predicted drug.
A few limitations open the way for further research, such as the availability of brainrelated data from living organisms. Moreover, more specific clinical-and gene-level research is required to better understand the complications by analyzing the candidate biomarkers found in this work.

Conclusions
The current study used a statistical technique on the transcriptomic data to uncover the shared significant genes that are highly enriched among glioblastoma, ischemic stroke, and moyamoya patients. The study of the significant gene sets revealed the associated dysregulated pathways that were also highly enriched. Protein-protein interactions, regulatory TFs from the survey of TF-gene interactions, and miRNAs from gene-miRNA interactions were obtained by comparing the overlapped DEGs with distinct biomolecular interaction networks and databases. Most of the transcription factors and microRNAs discovered in this study are novel; no prior studies have implicated these genes or pathways in developing these disorders or their connections. More studies still need to be performed to validate these molecular signature biomarkers. This study looked at candidate genes at protein and RNA levels, such as TFs, mRNAs and miRNAs, the pathway, and the GO terminologies. Finally, we predicted the potential drugs for the associations. Moreover, the results were validated using gold benchmark databases and published literature. These results show that genes in glioblastoma are more or less active in people with ischemic stroke and moyamoya, which could help explain these diseases. It also demonstrates how to find functional relationships between ischemic stroke and moyamoya, explaining why they are linked to glioblastoma.