A Comparative Cross ‐ Platform Meta ‐ Analysis to Identify Potential Biomarker Genes Common to Endometriosis and Recurrent Pregnancy Loss

: Endometriosis is characterized by unwanted growth of endometrial tissue in different locations of the female reproductive tract. It may lead to recurrent pregnancy loss, which is one of the worst curses for the reproductive age group of human populations around the world. Thus, there is an urgent need for unveiling any common source of origin of both these diseases and con ‐ nections, if any. Herein, we aimed to identify common potential biomarker genes of these two dis ‐ eases via in silico approach using meta ‐ analysis of microarray data. Datasets were selected for the study based on certain exclusion criteria. Those datasets were subjected to comparative meta ‐ anal ‐ yses for the identification of differentially expressed genes (DEGs), that are common to both diag ‐ noses. The DEGs were then subjected to protein ‐ protein networking and subsequent functional en ‐ richment analyses for unveiling their role/function in connecting two diseases. From the analyses, 120 DEGs are reported to be significant out of which four genes have been found to be prominent. These include the CTNNB1 , HNRNPAB , SNRPF and TWIST2 genes. The significantly enriched path ‐ ways based on the above ‐ mentioned genes are mainly centered on signaling and developmental events. These findings could significantly elucidate the underlying molecular events in endometri ‐ osis ‐ based recurrent miscarriages.


Introduction
Endometriosis is commonly known as a chronic condition that has been characterized by the growth of endometrial tissue in sites other than the endometrium [1]. This may result in the abnormal growth of endometrial cells outside the uterus and cause a painful condition. According to NHS-UK, symptoms include severe pelvic pain during periods, sex, urination and defacation. Major symptoms could be constipation, diarrhea, and even blood during urination. Women also face difficulties in getting pregnant (https://www.nhs.uk/conditions/endometriosis/, accessed on 20 July 2020). After several years of research, the pathogenesis of endometriosis is still not clear [2]. The existence of endometriosis has been found from Müllerian or non-Müllerian stem cells, which may include those from bone marrow, the endometrial basal layer, the peritoneum, or Müllerian remnants [3]. In addition, scientists believe that dysregulation of the canonical Wnt/βsignaling pathway could be responsible for the endometriotic lesions leading to the endometriosis condition [2]. Wnt/β-catenin signaling also has a role in governing the endometrial cells regulated by estrogen and progesterone. Any changes in the expression of estrogen and progesterone receptors may cause progesterone resistance in endometriosis patients [4]. Infertility problems may be caused due to recurrent pregnancy loss which has been found a major issue in endometriosis patients. Indeed, the loss of two or more pregnancies has also been reported by the European Society of Human Reproduction and Embryology Recurrent Pregnancy Loss (RPL) [5], where ectopic pregnancy and molar pregnancy has been excluded. Endometriosis-associated infertility could be identified by potential markers, such as inflammatory cytokines, iron and oxidative stress, oxidant-antioxidant imbalance, and iron-dependent progression of endometriosis [6]. A recent literature review suggested that endometrial immune dysregulation could be responsible for RPL and may also lead to endometriosis [7]. Thus far, the exact reason for endometriosis is still not clear and, therefore meta-data analysis may provide further knowledge to solve the molecular pathogenesis complexity of such condition(s). To find the genes involved in the loss of the hormonal functions and association with endometriosis, Sapkota et al. [8] performed a large scale, 11 genome-wide case-control dataset meta-analysis and found that FN1, CCDC170, ESR1, SYNE1, and FSHB are the 5 genes that could be responsible for the endometriosis risk. Therefore, computational system biology plays a major role in meta-data analysis. In combination with machine learning, many biomarker genes have been identified, including NOTCH3, SNAPC2, B4GALNT1, SMAP2, DDB2, GTF3C5, and PTOV1 from the transcriptomic data analysis, and TRPM6, RASSF2, TNIP2, RP3-522J7. 6, FGD3, and MFSD14B from the methylomic data analysis [9]. The latest metadata investigation related to polymorphisms and endometriosis tried to find the genetic level reason behind endometriosis, where five polymorphisms have been associated with endometriosis [10]. They were glutathione S-transferase pi 1 (GSTP1) rs1695, interferon-gamma (IFNG) (CA) repeat, wingless-type MMTV integration site family member 4 (WNT4) rs16826658, rs2235529, and glutathione S-transferase mu 1 (GSTM1) null genotype. The present study aimed to identify the genes that are differentially expressed in endometriosis and RPL conditions, and to elucidate their involvement in protein-protein interactions, as well as their functional importance in biological pathways as potential biomarkers common to both endometriosis and RPL.

Microarray Data
Suitable gene expression microarray samples were obtained from the NCBI Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/, accessed on 20 July 2020) [11]. A thorough search was performed of the GEO database from July 2020 to September 2020 (3 months) using the keywords "Endometriosis AND Recurrent Pregnancy Loss". The GEO datasets that were included in our study are GSE7305, GSE23339, GSE26787, GSE58178 and GSE111974 subject to their fulfillment of certain criteria. The gene expression profiling was based on endometrial tissue and each dataset contained sufficient data to perform a meta-analysis. The following inclusion criteria were imposed while selecting the datasets for the meta-analyses: (i) the sample type must be endometrial tissue only, (ii) datasets should not contain overlapping sample sets, (iii) datasets must not have been generated from the same research laboratory, and (iv) they are heterogeneous in terms of microarray platform ( Table 1). The datasets that met these inclusion criteria were selected for the present study.

DEG Screening and Meta-Analyses
Analyses of microarray expression data were performed using the ExAtlas metaanalyses software [12]. The expression profiles of the 5 GEO datasets that were included in our study were extracted from the GEO database.
Normalization of the data was carried out using the quantile method. Each dataset was saved separately and later combined using the batch normalization method. Genespecific batch normalization can be used to combine two or more datasets. If two datasets include the same tissue or organ then median expression levels for this common tissue/organ are equalized in the two datasets using this method.
ExAtlas uses the same algorithm for statistical analysis as NIA Array Analysis [13]. Gene expression values are log-transformed and used for ANOVA [13], which is modified for the multiple hypotheses testing case. Additionally, the false discovery rate (FDR) [2] is used to assess the significance of gene expression change instead of p-values. Later meta-analyses were performed on the saved datasets using a random effect method and lists of differentially-expressed genes were saved as a gene set file. The random-effects method takes into account the variance of heterogeneity between studies, which is added to the variance of individual effects. Here, term effect means the log ratio of gene expression change/difference compared to control or study-wide mean or median.
In a parallel manner, the same raw datasets were analyzed with another software named Network analyst 3.0 [13]. Upon combining the datasets after normalization, 17,347 matched feature numbers were recognized, which were then subjected to batch effect adjustment using Combat. Then, meta-analyses were carried out on the combined dataset using a random effect model with the p-value set to less than 0.05 and FDR to less than or equal to 2. While FDR can be a great indicator of the strength of a study, the p-value can be more useful for statistical power analyses in future studies. The Limma package [14] was used for the identification of differentially expressed genes (DEGs).
Furthermore, gene expression analyses were performed on all the datasets individually using Geo2R [3]. Quantile normalization was performed and the Benjamini and Hochberg false discovery rate method was selected by default for Geo2R analyses because it is the most commonly used adjustment for microarray data and provides a good balance between the discovery of statistically significant genes and limitation of false positives.

Comparative Analyses
The DEGs from both the analyses were then compared and then the common genes were marked. These genes have the annotation set to official gene symbol, which was corrected using db2db tool of the Biological Database Network [15]. Furthermore, the gene expression outputs of all the datasets generated using Geo2R [11] were compared and the common DEGs were recorded, which were also compared with the output of Ex-Atlas and Network Analyst 3.0. The DEGs were then used to construct a heatmap using the ComplexHeatmappackage of R [16]

Protein-Protein Network Interaction
Additionally, DEGs have also been used to study the protein-protein interactions using the STRING app [17] of Cytoscape [18]. The protein-protein interaction network was developed in the STRING app. The meaning of the network edges was set to evidence-based analyses. The second shell interactors were added to the network to ensure or visualize connections between our target proteins, which were too weak to be found. The 1st shell interactors were the proteins directly associated with the input protein(s) while the 2nd shell of interactors were the proteins associated with the proteins from the 1st shell. It can be the case that a 2nd shell protein can be directly connected to an input protein(s), but it will usually have a weaker association and therefore it would not show up among the specified number of 1st shell interactors. The 2nd shell proteins are always grey. The generated network was then analyzed using the Network Analyzer function of Cytoscape.

Pathway Enrichment Analysis
Furthermore, the biological processes that are involved with the DEGs and the functional enrichment analysis were also studied using the BINGO app [19] of Cytoscape. A hypergeometric test was carried out using Benjamini and Hochberg FDR correction. The GO Biological process was selected as the ontology file for executing enrichment analyses. The generated network was then analyzed using the network analyzer function of Cytoscape.
The overall presentation of the methods used in this study is present in Figure 1.

Results
Five microarray datasets met the inclusion criteria and have been included in our study namely, GSE58178, GSE23339, GSE7305, GSE111974 and GSE26787 (Table 1). Altogether, these 5 datasets consisted in 114 samples, of which 54 were controls, and the remaining 60 were patient samples (34 EMS and 26 RPL subjects). Box plots representing the value distribution of these five datasets, which were constructed using Geo2R. The plot shows that the log 2 expression values are normalized across all the samples of each dataset with the median line having more or less equal distribution for each dataset (Figure 2).  GEO-Gene Expression Omnibus. [D] [E]

Expression of Up-and Down-Regulated Genes
Meta-analyses of selected microarray datasets using ExAtlas software estimated 207 significant genes using a random-effect model, of which 109 genes were down-regulated and 98 genes were up-regulated in the patients (both endometriosis and RPL patients taken together) compared to healthy controls. Figure 3 shows clustered heatmaps of the five datasets comprising the expression of the up-regulated and down-regulated DEGs. Based on the expression values of the DEGs, the datasets are clearly clustered into two groups, namely endometriosis and RPL. It is evident from Figure 3 that both the groupsendometriosis and RPL-have a similar pattern of expression of genes. In Figure 3, effect value refers to the log ratio of gene expression change/difference compared to control or study-wide mean or median. NA analysis revealed 685 DEGs, of which 236 were up-regulated and 449 were downregulated. When the results of both EA and NA were compared, 120 genes were found to be common. The top 25 DEGs from the above-mentioned 120 genes are listed in Table 2 based on their fold change (FC) values along with their Entrez ID, log-ratio combined and FDR value. Interestingly, among all the DEGs, the TWIST2 gene was found to possess the highest fold change value (3.494), which can be considered as a significant observation since the same gene has been found to have the highest fold change value in the case of NA analyses. Among these top 25 DEGs, 60% were down-regulated as evident from their log-ratio combined value while the rest 40% were up-regulated (Table 2). Thus, the downregulated genes overweighed the scale as compared to the up-regulated genes. In a parallel workflow, all the target GEO datasets were analyzed using Geo2R. The expression profiles contained genes that were significantly expressed in comparison to the control. Following this, the expression profiles of all the datasets were overlapped using the Venn diagram ( Figure 4A); it was seen that only 19 significantly overexpressed genes were common among the five datasets. Interestingly, when these 19 genes were compared with the differentially expressed genes from EA and NA analysis results ( Figure 4B), then, surprisingly, only a single gene, i.e., TWIST2, was found to be commonly present among all the three analyses, viz. EA, NA and Geo2R. This outcome shifted our focus towards the TWIST2 gene and triggered our interest in exploring the biological role of this marker, especially in the context of human reproductive health. It should be noted here, with respect to Figure 4B, that all the genes that are considered for comparative analyses between the three different software-based approaches demonstrated significant fold change in the patient sample compared to the control.

Protein-Protein Interaction (PPI) Network
The PPI network for the DEGs is illustrated in Figure 5. The size of the node indicates the connection degree value. Centrality is an important parameter in a signaling network since it helps us to estimate the importance of a node/edge in the flow of information. It is considered an important parameter while exploring drug targets. The degree of the nodes can be used as a rough estimate of centrality. The top 20 query nodes, based on the descending order of their degree of centrality, are listed in Table 3, along with their respective betweenness centrality, closeness centrality, and the average shortest path length. A small nuclear ribonucleoprotein F (SNRPF), had the highest degree of node (84) followed by Catenin Beta 1 (CTNNB1) and Heterogeneous Nuclear Ribonucleoprotein A/B (HNRNPAB) with their degrees of nodes being 54 and 50, respectively.
Betweenness centrality is a measure of information flow in a network system. Nodes with a high betweenness centrality are crucial for a network since they can control information flow in a biological network and can be considered as targets for drug discovery. It is basically defined as the number of shortest paths in a graph that pass through the node, divided by the total number of shortest paths. Among the top three genes with the highest degrees of centrality, CTNNB1 has a comparatively higher betweenness centrality value than the other two, i.e., SNRPF and HNRNPAB. Another important measure that estimates how fast the flow of information would be through a given node to other nodes is closeness centrality. Among the three top genes in Table 3, CTNNB1 (0.51828) has the highest value followed by SNRPF (0.46615) and HNRNPAB (0.42133), respectively. Average shortest-path length may be defined as the average number of steps along the shortest paths for all possible pairs of network nodes. It measures how efficiently information or mass transport occurs on a network. This list has also been topped by CTNNB1 (1.92946) followed by SNRPF (2.14523) and HNRHPAB (2.37344), respectively. Figure 5. Protein-protein interaction (PPI) network of endometriosis and recurrent pregnancy loss genes, performed using the STRING app of Cytoscape. The size of the node indicates the connection degree value. Colored nodes represent the most common 120 genes.

A B
The colored nodes represent the first shell interactors or the query proteins (120 DEGs) while the white nodes represent the second shell interactors or the proteins that are not included in the input file and have been included for analytical purposes only. The maximum number of white nodes that was allowed in our PPI analyses was set to 50. In the inset, the 20 proteins that were listed in Table 3 are represented via protein-protein interactions without any secondary interactors.

Pathway Enrichment Analyses
In the GO functional enrichment analyses using the BINGO plugin of Cytoscape (Figure 6), the yellow nodes are significantly over-represented while the white nodes are not significantly over-represented and are included only to show the yellow nodes in the context of the GO hierarchy. The size of a node is proportional to the number of query genes that are annotated to the corresponding GO category. The top 20 GO categories based on their respective node sizes, which are significantly over-represented in our study, are listed in Table 4. Among these significantly over-represented categories, the highest node size was reported for the biological regulation pathway followed by regulation of biological processes and regulation of cellular processes. Neighborhood connectivity was found to be highest for regulation of the signaling pathway, followed by biological regulation, organ morphogenesis, and skeletal development. It is interesting to find that among the first 20 over-represented pathways, CTNNB1 was found to be present in all the pathways. This shows the importance of this gene in the flow of information in reference to the pathophysiology of both the diseases. The HNRNPAB protein was found to be involved in 15 pathways, thereby demonstrating its role in disease occurrence. TWIST2 protein has also been found to be present in 12 pathways. These observations definitely point towards the probable involvement of the TWIST2 gene in endometriosis and RPL etiology. The SNRPF protein was found to only be linked to the cellular component organization pathway inspite of having the highest degree of centrality in the case of the protein-protein interaction network.

Discussion
A large number of works have been carried out in the past decades to identify genetic markers for both endometriosis and RPL [25][26][27][28][29][30][31][32][33]. However, a trustworthy molecular marker having significant prognostic value has not yet been determined. Moreover, the lack of potential drug targets is also one of the probable causes for several unsuccessful attempts to ameliorate the diseases. Therefore, there is an urgent need for the identification of potential biomarkers for the two diseases. This study is one of the pioneers in finding a common potential biomarker for the two diseases for successful diagnostic purposes and for effective drug delivery systems.
The literature survey provided epidemiological evidence to establish a probable link between endometriosis and RPL [34]. A recent investigation by Santulli et al. demonstrated an increased rate of spontaneous miscarriages in endometriosis-affected females [35]. Another interesting study in 2017 claimed mild endometriosis to be a potential risk for miscarriages [36]. Later in 2019, s study claimed that endometriosis affected the efficacy of assisted reproductive technology by increasing the risk of miscarriage [37].
More recently, Poli-Neto et al. identified the NOLC1 gene as the most common gene in the phase I and II endometriosis and affects menstruation, while in phases III and IV, the genes CDKN1B, DLD, ELOVL5, H2AFZ, IDI1, ME1, MTHFD2, NOLC1, and SOD1 play a major role. These reports prompted the authors to explore the relationship between endometriosis and RPL through the identification of any potential biomarkers common to both the diseases. The present study, in relation to Poli-Neto et al. [38], extends the identification of CTNNB1, HNRNPAB, SNRPF, and TWIST2 genes as major markers, while the TWIST2 gene was identified as the most prominent marker for the exploration of endometriosis and RPL. Although, authors investigated and predicted several parameters reporting challenges in treating the diseases [38].
It is clearly evident from Figure 3 that both the diseases have similar gene expression patterns, thereby providing a clear indication for some common markers for the two diseases. It is also evident from Figure 3 that there exists a clear distinction between the patient and the control groups of each dataset in terms of the expression profiles of the genes. This observation partially supports the idea that the above-mentioned 207 genes may be considered as signatory markers for both EM and RPL. It is clearly evident from Table 2 that the TWIST2 gene has the highest fold change value from both the EA and NA analyses. This observation clearly indicates that TWIST2 has a significant role to play as a potential diagnostic marker for endometriosis-based recurrent miscarriages. TWIST2 has a very important role to play in reproduction. The TWIST2 gene is proved to play a very significant role in embryo implantation in mice. Embryo implantation is a very important event for a successful pregnancy. Suppression of the TWIST2 gene impaired the embryo implantation by suppressing endothelial-mesenchymal transition (EMT) during embryo implantation [39]. A recent clinical study reported Setleis syndrome in a child with a novel mutation in the TWIST2 gene [40]. Another study in 2014 by Huang et al. showed that haploinsufficiency of TWIST2 results in reduced bone formation [41]. Franco et al. highlighted TWIST2 as a molecular switch during gene transcription [42]. Furthermore, sequestration of E-proteins by increased TWIST2 levels functions to inhibit muscle-specific gene activation [42][43][44]. TWIST2 requires Histone Deacetylases for Myoblast Determination Protein 1-Myocyte Enhancer Factor 2 inhibition [43]. TWIST2 is also known to regulate osteoblast differentiation, however its involvement occurs temporally after TWIST1 [41,45]. The transcription factor RUNX2 is considered a master regulator of the osteogenic program due to its indispensable role in the regulation of most of the genes that give rise to the mature osteoblast phenotype [41,46]. Both TWIST1 and TWIST2 can also regulate RUNX2 at the protein level by physically interacting with RUNX2 and inhibiting its ability to bind DNA [41,46]. TWIST2 also acts as an important key negative regulator of myeloid lineage development, as manifested by marked increases in mature myeloid populations of macrophages, neutrophils, and basophils in TWIST2-deficient mice [41,47]. Therefore, on converging our findings with the above-mentioned published investigations, it is clearly evident that downregulation of the TWIST2 gene may have a very potent role in early embryonic developmental events, rendering it as a potential clinical marker for endometriosis based RPL. Another gene, CA XII (Carbonic Anhydrase XII), also has a high log fold change value, as evident from Table 2. CA XII has been found to have prominent expression during mouse embryonic development [48]. However, in this article, we did not focus on other genes in Table 2 since, on overlapping our intersected gene list from EA and NA output with the Geo2R results, only the TWIST2 gene was found to be in common. In other words, only the TWIST2 gene was found to be present in all the three analyses and therefore was considered to be an important clinical marker.
In the case of protein-protein interaction analyses, the top three genes participating in the network were SNRPF, CTNNB1 and HNRNPAB, based on their degree of centrality. Small Nuclear Ribonucleoprotein Polypeptide F (SNRPF) plays role in pre-mRNA splicing and also as a component of the spliceosomal U1, U2, U4 and U5 small nuclear ribonucleoproteins (snRNPs), the building blocks of the spliceosome [49][50][51][52][53][54][55][56][57]. The SNRPF gene was found to be downregulated in our study samples and therefore may serve as a valid target for disease-based research.
CTNNB1 or Catenin Beta 1 is an important downstream component of the canonical Wnt signaling pathway [58][59][60][61][62][63][64][65]. The Wnt signaling pathway is known for its role in embryonic development, where it actively participates in body axis patterning, cell fate specification, cell proliferation, and cell migration events [66]. These developmental processes are essential for proper tissue formations, including bone, heart, and muscles. CTNNB1 protein is also a part of a protein complex that forms cell-cell junctions in epithelial and endothelial tissues [67]. Additionally, β-catenin 1 also promotes neurogenesis by maintaining sympathetic neuroblasts within the cell cycle [68]. Surprisingly, β-catenin has also been associated with endometrial cancer onset and recurrence [69]. Therefore, it is evident from the above-mentioned studies that CTNNB1 has an important role in early developmental pathways and inter-and intracellular recognitions. Interestingly, this gene was found to be upregulated in our study, rendering it an important marker for disease-based research and for exploring its role in disease prognosis.
Located on chromosome 5q35.3, HNRNPAB or Heterogeneous Nuclear Ribonucleoprotein A/Bis is a member of a subfamily of ubiquitously expressed heterogeneous nuclear ribonucleoproteins (hnRNPs). They are associated with pre-mRNAs in the nucleus and appear to influence pre-mRNA processing and other aspects of mRNA metabolism and transport. HNRNPAB has also been found to be associated with ankyloblepharonectodermal defects-cleft lip/palate syndrome [70][71][72]. Surprisingly, this gene is also a member of the preimplantation embryo pathway (WP3527) [73]. In our study, this gene is downregulated, similar to the SNRPF gene. Considering the above-mentioned facts, it can be hypothesized that HNRNPAB has a definitive role in disease prognosis via pre-mRNA processing or preimplantation embryo pathways and can be an essential diagnostic marker for endometriosis-based RPL.
It is clearly evident from Table 4 that the top 20 pathways of the pathway enrichment analysis based on the overlapping genes of the EA-NA analyses are mainly concerned with signaling pathways and developmental biology, thereby indicating the combined inclination of the genes towards functioning in the arena of developmental signaling events during embryogenesis. When we tried to explore the involvement of our potential biomarkers in the biological pathways, it was seen that the SNRPF protein is involved in the cellular component organization pathway. Interestingly, CTNNB1 is involved in all 20 pathways. HNRNPAB is involved in 15 pathways and TWIST 2 in 13 of the pathways. CTNNB1, HNRNPAB and TWIST2 are commonly involved in 11 out of 20 major pathways, that are shown in Supplementary Table S1, while SNRPF and CTNNB1 share only one pathway in common.

Conclusions
In conclusion, our work has identified 120 DEGs in the five profile datasets based on ExAtlas and Network Analyst results. A handful of biomarkers were found common to both endometriosis and RPL, and can have a diagnostic role in the case of endometriosisbased RPL. Notable among these markers are CTNNB1, HNRNPAB, SNRPF and TWIST2. The 120 DEGs, when compared with the cumulative output of Geo2R software, showed only one gene (TWIST2) to be common among the three analytical approaches. Therefore, our study also claims the TWIST2 gene as a prominent marker of choice for the diseases.
The significantly enriched pathways based on the above-mentioned genes are mainly centered on signaling and developmental events. These findings could significantly improve our understanding of the cause and underlying molecular events in endometriosisbased recurrent miscarriages. However, further downstream validation of these markers is a needed for quantitating their potentiality and establishing their efficacy as a potential drug target(s).

Supplementary Materials:
The following are available online at www.mdpi.com/2076-3417/11/8/3349/s1, Table S1: List of top 20 pathways as an outcome of pathway enrichment analysis using the target genes by BINGO app of Cytoscape.
Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data used to arrive at the findings of the study are openly available in GEO datasets at https://www.ncbi.nlm.nih.gov/gds [11].

Conflicts of Interest:
The authors declare no conflict of interest.