Prioritization of Susceptibility Genes for Ectopic Pregnancy by Gene Network Analysis

Ectopic pregnancy is a very dangerous complication of pregnancy, affecting 1%–2% of all reported pregnancies. Due to ethical constraints on human biopsies and the lack of suitable animal models, there has been little success in identifying functionally important genes in the pathogenesis of ectopic pregnancy. In the present study, we developed a random walk–based computational method named TM-rank to prioritize ectopic pregnancy–related genes based on text mining data and gene network information. Using a defined threshold value, we identified five top-ranked genes: VEGFA (vascular endothelial growth factor A), IL8 (interleukin 8), IL6 (interleukin 6), ESR1 (estrogen receptor 1) and EGFR (epidermal growth factor receptor). These genes are promising candidate genes that can serve as useful diagnostic biomarkers and therapeutic targets. Our approach represents a novel strategy for prioritizing disease susceptibility genes.


Introduction
Ectopic pregnancy is defined as embryo implantation outside of the uterine cavity, with the most common site being the Fallopian tube [1]. Its incidence is 1%-2% of all reported pregnancies [2]. Ectopic pregnancy is a very dangerous complication of pregnancy, which accounts for 10% of all maternal deaths during the early stage of pregnancy [2]. Notably, unlike humans, the major type of ectopic pregnancy in animals is abdominal pregnancy, and tubal pregnancy is rather rare [3]. The pathogenesis of ectopic pregnancy in humans is not fully understood.
It has been suggested that both impaired embryo transport in the Fallopian tube and abnormal tubal environment may contribute to the pathogenesis of ectopic pregnancy. The major risk factors for ectopic pregnancy include tubal infection and smoking [4,5]. These two conditions initiate inflammation within the Fallopian tube, leading to embryo retention by disrupting smooth muscle contractility and ciliary beat frequency [6]. The inflammatory environment may also provide necessary pro-implantation factors with roles in the establishment of Fallopian tube receptivity for the retained embryo [7]. A number of genes have been implicated in ectopic pregnancy [8][9][10]. However, due to ethical constraints on human biopsies and the lack of suitable animal models, most studies have focused on altered tubal gene or protein expression by comparing ectopic pregnancy with normal controls; thus, these studies are largely descriptive and speculative. There has been little success in identifying functionally important genes in the pathogenesis of ectopic pregnancy.
Recently, text mining methodology has been implemented [11], making it possible to retrieve all ectopic pregnancy candidate genes from published experimental results in an automated way. Because functional validation of these candidates using experimental methods is not feasible for now, computational methods may provide a useful option for prioritization of the most likely susceptibility genes. Among all computational methods, the random walk algorithm, which incorporates topological properties of the gene network, has attracted more and more attention in recent years [12,13]. It is a variant of the PageRank algorithm used by the Google search engine to rank search results on the world-wide web [14]. In the present study, we developed a random walk-based strategy named TM-rank to prioritize ectopic pregnancy-related genes by combining text mining data and gene network information. Our approach represents a novel strategy for prioritization of disease susceptibility genes.

Identification of Susceptibility Genes for Ectopic Pregnancy by Using Text Mining
We carried out a key word search in the PubMed database and obtained 7363 publications related to ectopic pregnancy as a result (from January 1980 to October 2015). The titles and abstracts of these publications were fetched and analyzed through a computational pipeline ( Figure 1A). Statistical analysis shows that the number of publications on ectopic pregnancy has been growing linearly in recent years ( Figure 1B). From these publications, a total of 264 ectopic pregnancy-related genes were extracted ( Figure 1C; Table S1). incorporates topological properties of the gene network, has attracted more and more attention in recent years [12,13]. It is a variant of the PageRank algorithm used by the Google search engine to rank search results on the world-wide web [14]. In the present study, we developed a random walk-based strategy named TM-rank to prioritize ectopic pregnancy-related genes by combining text mining data and gene network information. Our approach represents a novel strategy for prioritization of disease susceptibility genes.

Identification of Susceptibility Genes for Ectopic Pregnancy by Using Text Mining
We carried out a key word search in the PubMed database and obtained 7363 publications related to ectopic pregnancy as a result (from January 1980 to October 2015). The titles and abstracts of these publications were fetched and analyzed through a computational pipeline ( Figure 1A). Statistical analysis shows that the number of publications on ectopic pregnancy has been growing linearly in recent years ( Figure 1B). From these publications, a total of 264 ectopic pregnancy-related genes were extracted ( Figure 1C; Table S1).

Gene Prioritization by TM-Rank
Initially, the gene rank (the importance of a gene to ectopic pregnancy) was determined by the number of publications for each gene. Considering the network underlying all ectopic pregnancy-related genes, we would like a gene to be highly ranked if it is linked to other highly ranked genes. Therefore, we developed the TM-rank algorithm, which extends the PageRank algorithm [14], for analyzing our text mining results. We demonstrated that the importance of a gene as determined by the number of publications could propagate fast through the entire network by iterative random walk ( Figure 3A). Finally, genes were ranked according to TM-rank values in descending order. As a result, the top five genes exceeded a defined TM-rank cut-off (mean + 2 × SD = 0.1919). These five genes were: VEGFA (vascular endothelial growth factor A), IL8 (interleukin 8), IL6 (interleukin 6), ESR1 (estrogen receptor 1) and EGFR (epidermal growth factor receptor) ( Figure 3B).

Gene Prioritization by TM-Rank
Initially, the gene rank (the importance of a gene to ectopic pregnancy) was determined by the number of publications for each gene. Considering the network underlying all ectopic pregnancy-related genes, we would like a gene to be highly ranked if it is linked to other highly ranked genes. Therefore, we developed the TM-rank algorithm, which extends the PageRank algorithm [14], for analyzing our text mining results. We demonstrated that the importance of a gene as determined by the number of publications could propagate fast through the entire network by iterative random walk ( Figure 3A). Finally, genes were ranked according to TM-rank values in descending order. As a result, the top five genes exceeded a defined TM-rank cut-off (mean + 2ˆSD = 0.1919). These five genes were: VEGFA (vascular endothelial growth factor A), IL8 (interleukin 8), IL6 (interleukin 6), ESR1 (estrogen receptor 1) and EGFR (epidermal growth factor receptor) ( Figure 3B).

Gene Prioritization by TM-Rank
Initially, the gene rank (the importance of a gene to ectopic pregnancy) was determined by the number of publications for each gene. Considering the network underlying all ectopic pregnancy-related genes, we would like a gene to be highly ranked if it is linked to other highly ranked genes. Therefore, we developed the TM-rank algorithm, which extends the PageRank algorithm [14], for analyzing our text mining results. We demonstrated that the importance of a gene as determined by the number of publications could propagate fast through the entire network by iterative random walk ( Figure 3A). Finally, genes were ranked according to TM-rank values in descending order. As a result, the top five genes exceeded a defined TM-rank cut-off (mean + 2 × SD = 0.1919). These five genes were: VEGFA (vascular endothelial growth factor A), IL8 (interleukin 8), IL6 (interleukin 6), ESR1 (estrogen receptor 1) and EGFR (epidermal growth factor receptor) ( Figure 3B).  By searching the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway database [15], these five genes are linked to 11 pathways ( Figure 4A). EGFR is involved in seven pathways, namely the estrogen signaling pathway, focal adhesion pathway, ErbB signaling pathway, MAPK signaling pathway, PI3K-Akt signaling pathway, FoxO signaling pathway and HIF-1 signaling pathway. IL6 is associated with six pathways: the Jak-STAT signaling pathway, PI3K-Akt signaling pathway, HIF-1 signaling pathway, FoxO signaling pathway, NOD-like receptor signaling pathway and Toll-like receptor signaling pathway. Beyond the VEGF signaling pathway, VEGFA also plays a role in the PI3K-Akt signaling pathway, HIF-1 signaling pathway and focal adhesion pathway. IL8 participates in the NOD-like receptor signaling pathway and Toll-like receptor signaling pathway. ESR1 is the key component of the estrogen signaling pathway.
In addition, we performed pathway enrichment analysis on the complete list of 264 susceptibility genes by using the DAVID (Database for Annotation, Visualization and Integrated Discovery) tools. As a result, a total of 13 pathways, namely the PI3K-Akt signaling pathway, FoxO signaling pathway, Jak-STAT signaling pathway, hematopoietic cell lineage, HIF-1 signaling pathway, estrogen signaling pathway, Toll-like receptor signaling pathway, apoptosis, VEGF signaling pathway, focal adhesion, T cell receptor signaling pathway, NOD-like receptor signaling pathway, and MAPK signaling pathway, are significantly enriched (false discovery rate FDR < 0.05) ( Figure 4B). Interestingly, all 11 pathways linked to top-ranked genes are overlapped with these 13 pathways enriched among all genes identified by text mining, indicating that our TM-rank algorithm is able to prioritize functionally important genes for ectopic pregnancy. By searching the KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway database [15], these five genes are linked to 11 pathways ( Figure 4A). EGFR is involved in seven pathways, namely the estrogen signaling pathway, focal adhesion pathway, ErbB signaling pathway, MAPK signaling pathway, PI3K-Akt signaling pathway, FoxO signaling pathway and HIF-1 signaling pathway. IL6 is associated with six pathways: the Jak-STAT signaling pathway, PI3K-Akt signaling pathway, HIF-1 signaling pathway, FoxO signaling pathway, NOD-like receptor signaling pathway and Toll-like receptor signaling pathway. Beyond the VEGF signaling pathway, VEGFA also plays a role in the PI3K-Akt signaling pathway, HIF-1 signaling pathway and focal adhesion pathway. IL8 participates in the NOD-like receptor signaling pathway and Toll-like receptor signaling pathway. ESR1 is the key component of the estrogen signaling pathway.
In addition, we performed pathway enrichment analysis on the complete list of 264 susceptibility genes by using the DAVID (Database for Annotation, Visualization and Integrated Discovery) tools. As a result, a total of 13 pathways, namely the PI3K-Akt signaling pathway, FoxO signaling pathway, Jak-STAT signaling pathway, hematopoietic cell lineage, HIF-1 signaling pathway, estrogen signaling pathway, Toll-like receptor signaling pathway, apoptosis, VEGF signaling pathway, focal adhesion, T cell receptor signaling pathway, NOD-like receptor signaling pathway, and MAPK signaling pathway, are significantly enriched (false discovery rate FDR < 0.05) ( Figure 4B). Interestingly, all 11 pathways linked to top-ranked genes are overlapped with these 13 pathways enriched among all genes identified by text mining, indicating that our TM-rank algorithm is able to prioritize functionally important genes for ectopic pregnancy.

Discussion
Ectopic pregnancy is a very dangerous complication of pregnancy, the pathogenesis of which is not fully understood. There is a wealth of information that is hidden within published experimental results. In the present study we performed a text mining analysis of ectopic pregnancy-associated genes. A total of 264 genes were indentified from 7363 publications. Since a large number of publications were analyzed, our result may provide reasonably good coverage of the entire set of ectopic pregnancy-associated genes.
Interpretation of the text mining results is usually based on the number of publications for each gene. That is, the more publications a gene is cited in, the more important the gene is to ectopic pregnancy. The disadvantage is that genes are treated as separated entities and the interactions between genes are ignored. It has been shown that gene products or proteins usually form protein complexes to execute their functions instead of acting alone [16]. Hence, a gene is indicative of

Discussion
Ectopic pregnancy is a very dangerous complication of pregnancy, the pathogenesis of which is not fully understood. There is a wealth of information that is hidden within published experimental results. In the present study we performed a text mining analysis of ectopic pregnancy-associated genes. A total of 264 genes were indentified from 7363 publications. Since a large number of publications were analyzed, our result may provide reasonably good coverage of the entire set of ectopic pregnancy-associated genes.
Interpretation of the text mining results is usually based on the number of publications for each gene. That is, the more publications a gene is cited in, the more important the gene is to ectopic pregnancy. The disadvantage is that genes are treated as separated entities and the interactions between genes are ignored. It has been shown that gene products or proteins usually form protein complexes to execute their functions instead of acting alone [16]. Hence, a gene is indicative of functional importance for its interacting partners, opening the possibility to integrate interaction information for gene prioritization. In the present study, a genome-wide gene network was constructed by using up-to-date interaction data available in the STRING database [17]. Using a threshold value of 0.9 for the combined score, we obtained a gene network consisting of 198 nodes connected via 1007 edges. So far, several studies have been conducted to incorporate the topology of the gene network in prioritization of disease candidate genes [18][19][20]. The main concern for these studies is that the incompleteness and noisiness of interaction data may affect the accuracy of the prioritization results. By weighting and integrating interaction data from high-throughput experiments and from the mining of databases and literature, the STRING database provides a comprehensive gene network at a reasonable false positive rate [17]. We expected that the use of the STRING database may alleviate this problem to a certain extent.
Using text mining data and gene network information as input, we developed a random walk-based algorithm named TM-rank to prioritize ectopic pregnancy-related genes. This algorithm is a variant of the PageRank algorithm used by the Google search engine to rank search results on the world-wide web [14]. Compared to previously described algorithms, our method is unique in that: (1) both text mining data and gene interaction information are used; (2) no training set of functionally confirmed genes is needed; and (3) valuable insights of genes with high susceptibility to ectopic pregnancy are recommended.
These top five genes recommended by TM-rank can serve as useful diagnostic biomarkers and therapeutic targets. Serum VEGF is reported to be higher in ectopic pregnancy compared with normal pregnancy, indicating that VEGF may be a useful diagnostic biomarker for ectopic pregnancy [21]. Moreover, VEGF has been shown to be up-regulated at the implantation site compared to the non-implantation site of the Fallopian tube in ectopic pregnancy [22]. It is thought that elevated expression of VEGF may contribute to ectopic pregnancy by enhancing the supply of oxygen to the embryo via angiogenesis [23]. IL-6 and IL-8 are inflammatory cytokines and both of them are highly induced in the Fallopian tube upon chlamydial infection [24] as well as ectopic implantation [25]. Embryo implantation into the uterus is known to induce a local inflammatory response in a similar way as leukocytes accumulate at inflammatory sites [26]. It is likely that IL-6 and IL-8, induced by inflammation in the Fallopian tube, may provide signals to the embryo and promote tubal implantation. As the predominant estrogen receptor subtype in the Fallopian tube [27], ESR1 has been shown to serve as a crucial regulator in Fallopian tube development [28]. Exogenous estrogen treatment enhances embryo transport in the mouse Fallopian tube [29,30]. Interestingly, ESR1 protein has been shown to be absent in the Fallopian tube in human ectopic pregnancy [31], suggesting a role of ESR1 in the pathogenesis of ectopic pregnancy. EGFR is highly expressed at the ectopic implantation site [32]. It has been demonstrated that EGFR is a promising target for the treatment of ectopic pregnancy [33].
Unlike the supervised method with a positive training set, our un-supervised prioritization method is very difficult to provide validation data for. Wet experiments are often used; however, due to ethical constraints on human biopsies and the lack of suitable animal models, experimental validation is not a feasible option in this case. Alternatively, we performed pathway enrichment analysis on the complete list of 264 susceptibility genes by using the DAVID tools. A total of 13 enriched pathways were identified. In contrast, the top five genes selected by TM-rank are linked to 11 pathways. Interestingly, all 11 pathways are overlapped with the 13 enriched pathways, except for the ErbB signaling pathway. This result might provide validity of our methodology.
A limitation for text mining-based strategies is that there is no chance to discover new genes. In order to solve this problem, we enlarged the network by including new genes that are not reported to be involved in ectopic pregnancy. According to the rule of "guilty-by-association", these new genes may be potential susceptibility genes. Finally, we made a list of 10 new genes, all of which have more than two connections with known genes (Table S2).
In conclusion, we developed a random walk-based algorithm named TM-rank to prioritize ectopic pregnancy-related genes based on text mining data and gene network information. Application of the TM-rank methodology yielded promising candidate genes that can serve as useful diagnostic biomarkers and therapeutic targets. Our study on ectopic pregnancy can act as an illustrative example for broader use in various diseases.

Text Mining
The PubMed database [34] was used as a source of publications for text mining. We conducted a search with the following combinations of query key words: "ectopic pregnancy" OR "tubal pregnancy" OR "eccyesis". The search tag "(Title/Abstract)" was added after each key word. In order to make information extraction more precise, the relevant publications were retrieved in XML format with content enclosed within XML tag pairs. For each article, titles and abstract texts were fetched and transformed into the PubTator format [35] using in-house PERL scripts.
Text mining was performed by using the GNormPlus pipeline [36]. GNormPlus includes two modules: gene mention recognition and gene name normalization, respectively. Gene mentions were detected by using the CRF++ library [37]. Gene name normalization was performed by GenNorm [38]. The GNormPlus pipeline has achieved a precision of 87.1% and a recall of 86.4% [36]. To ensure accuracy, we manually checked each entry from the output of GNormPlus. A total of 1864 false positive errors out of a total of 6771 entries were identified. Thus, the observed precision was 4907/6771 = 72.47%. Finally, all the curated entries were summarized and a full gene list associated with ectopic pregnancy was compiled.

Construction of Gene Network
The STRING database v10.0 [17] was used to create gene network. The minimum combined score was set to 0.900 (highest confidence). The Cytoscape software [39] was applied for visualization and analysis of the gene network. The degree for each gene was analyzed by NetworkAnalyzer [40]. The threshold value for hub genes was defined as the mean plus two standard deviations.

The TM-Rank Algorithm
In order to prioritize susceptibility genes for ectopic pregnancy, TM-rank adopted a random walk in the gene network to infer which genes are most important. We followed a general implementation of a random walk in an iterative form: where r t+1 and r t are the TM-rank vectors in which the i-th element represents the TM-rank for gene i at time steps t and t+1, respectively. The parameter d is the restart probability which is set to 0.15 in this study. The vector r˝is the initial vector. Taking the number of publications into consideration, we biased the initial rank vector to the normalized number of publications for each gene. Assuming the total number of genes in the gene network is n, then the column-normalized network adjacency matrix can be written as: . .¨¨ẅ For a given gene i, let N i be the node set of immediate interacting neighbors of node i and c i be the number of publications for gene i. Then w ij is defined as: The TM-rank algorithm was implemented in MATLAB language (MathWorks, MA, USA).