Data Mining of Gene Arrays for Biomarkers of Survival in Ovarian Cancer

The expected five-year survival rate from a stage III ovarian cancer diagnosis is a mere 22%; this applies to the 7000 new cases diagnosed yearly in the UK. Stratification of patients with this heterogeneous disease, based on active molecular pathways, would aid a targeted treatment improving the prognosis for many cases. While hundreds of genes have been associated with ovarian cancer, few have yet been verified by peer research for clinical significance. Here, a meta-analysis approach was applied to two carefully selected gene expression microarray datasets. Artificial neural networks, Cox univariate survival analyses and T-tests identified genes whose expression was consistently and significantly associated with patient survival. The rigor of this experimental design increases confidence in the genes found to be of interest. A list of 56 genes were distilled from a potential 37,000 to be significantly related to survival in both datasets with a FDR of 1.39859 × 10−11, the identities of which both verify genes already implicated with this disease and provide novel genes and pathways to pursue. Further investigation and validation of these may lead to clinical insights and have potential to predict a patient’s response to treatment or be used as a novel target for therapy.


Introduction
Ovarian cancer is the fifth most common cancer and the fourth most common cause of cancer related deaths in UK women. Each year approximately 7000 UK women are diagnosed with ovarian cancer and over 4000 succumb to the disease.
Ovarian cancer's high mortality is attributed to the majority of incidences being diagnosed at a late stage. Few, if any, symptoms are expected from early stage disease, while in the later stages the indications are at most vague and more commonly attributed to non-pathological complaints including, back and abdominal pain, bloating and abnormal menstrual patterns [1].
Stage I ovarian cancer has a relatively good prognosis with 92% five-year survival, which drops down to 22% in patients with stage III disease. Despite the rising interest in identifying targeted therapy, there has not been significant change in disease outcome in the last few decades [2,3]. Currently, there is no screening tool with a performance specific or accurate enough to be implemented on the general population. Alongside ultrasonography, the existing tests for detection and monitoring of cancer progression or recurrence is based on serological immunoassay of Cancer Antigen 125 (CA125) [4,5]. This test is flawed by the natural variation and fluctuations of the protein [5][6][7], often false negative results lead to late presentation and diagnosis, and false positives to unnecessary explorative surgery [4]. However, encouragingly a recent report demonstrates the sensitivity of using CA125 as a screening tool for the general population to be vastly improved by using mathematical modeling to calculate risk based on serial measurements of CA125 [8].
Despite the continuing extensive study of ovarian cancer cell lines and patient material with numerous publications implicating novel genes associated with its incidence [9], little has changed in the treatment and expected outcome of patients presenting with ovarian cancer. Treatment for ovarian cancer is mainly total abdominal hysterectomy with bilateral salpingo-oophorectomy, omentectomy and staging. In advanced stage disease platinum based chemotherapy with or without taxol may be indicated as adjuvant or neoadjuvant therapy with interval debulking. Recently bevacizumab, an antiangiogenic therapy, has been used in certain cases [10,11]. A response to which is seen in approximately 70% of patients, however the majority of which develop a resistance to the therapy and experience a recurrence of the tumor, some more aggressively than others [10].
From the above, it is clear that there is an urgent need to identify non-invasive screening tools for early detection of ovarian cancer and also to improve targeted therapy for advanced stage disease.
DNA microarray experiments allow determination of the expression of entire genomes in DNA and RNA extracted from biological samples. To obtain the data in the current study, genetic material acquired from ovarian tumors was hybridized against a microarray gene chip containing probes for most of the characterized genes in the human genome yielding a relative expression value for several probes per gene [12]. These large, multidimensional, data could be interpreted using infinite analytical strategies to draw different conclusions [13]. Out of the thousands investigated and implicated genetic variants that are reported to have a role in ovarian cancer, only a few, have been exclusively positively replicated [9]. A recent review highlights agreement that instead of generating new experimental data, which can be both costly and timely, the sharing of resources, data, results, methods and samples is crucial to narrowing down common active cellular mechanisms in what is a relatively rare yet genotypically diverse disease [11].
The two methods of analysis explored in the current study are artificial neural networks (ANNs) and Cox proportional hazard modeling analysis. ANNs are a form of machine learning that are applied to non-linear datasets, pattern recognition algorithms to strengthen connections within its structure, which is akin to the plasticity of nervous systems in biology [14]. Cox proportional hazard modeling analysis is used to determine if a continuous independent variable such as gene expression levels associate with survival [15].
The two key focal points of research into ovarian cancer are firstly the development of a biomarker from a non-invasive test that can be used as a screening tool for early detection in the at risk population, and secondly to improve the prognosis and treatment of patients diagnosed with later stage disease.
The aim of the current study was to characterize genomic differences between tumors from patients that experienced different survival times after diagnosis with stage III ovarian cancer. Figure 1 is a schematic depicting the meta-analysis approach used to filter two cohorts of data for genes that consistently significantly associate with patient survival time when analyzed using two cohorts of data and two analytical approaches.

Source Data
Array Express was searched for datasets comprising gene microarray data collected from cohorts of ovarian cancer samples with as similar profile as possible. Extraneous variables were minimized by searching Array Express and not including data acquired from experiments that did not fit a strict criteria: i.e., including only data from large patient cohorts using micro-arrays representing the full genome. Datasets with low sample numbers, ambiguous or unclear sample data, studies based around cell lines, or with a focus on drug trials, were not included.
Survival time was the only dependent variable available in both the cohorts selected for the analysis. Patients in both studies selected were subject to the same treatment of possible debulking surgery, followed by platinum based chemotherapy [16,17].

Datasets Used
Gene array data were downloaded from Array Express, the dataset was built from tissue from patients with ovarian cancer who have been treated with the same care pathway. Full data and information is available at Array Express under experiments E-GEOD-13876 and E-GEOD-26712 [12].
Based on the patient information and data annotations provided with both datasets, survival time was selected as the basis for this investigation, i.e., survival time was the only listed variable common to both data sets. Both of these datasets could be used to identify genes whose expression significantly and consistently associate with survival time from Stage III serous ovarian cancer, and, to validate or refute any genes recently reported to be linked to ovarian cancer but not fully validated.
Cohort 1: Full data and information is available at Array express under the E-GEOD-13876 [12] Array: A-GEOD-7759-Operon human v3 ~35 K 70-mer two-color oligonucleotide microarrays. Sample information: 157 consecutive patients donated tumor from cyto-reductive surgery prior to platinum based chemotherapy treated at University Medical Center Groningen (UMCG, Groningen, The Netherlands) in the period 1990-2003 [17].

Meta-Analysis of Microarray Data
A set of six three-layered back propagation ANNs with an architecture of 1 input node, 2 hidden layer nodes and 1 output node were trained to identify gene probes that perform well as predictors of short and long survival. The ANN algorithm was developed at NTU [14,18], contact CompanDX [19] for further details. Multiple ANNs were trained to accommodate a categorical analysis around a continuous variable. A backpropagation algorithm was used to update the weights of the ANN and was trained to convergence on an early stopping randomly extracted dataset comprising 20% of the global dataset. A sigmoidal transfer function was used in the architecture to relate input gene expression to survival.
Firstly, the survival distribution of the population of the two datasets were observed, three possible cut-off time points determining short and long survival were defined; above and below 16, 23 and 30 months. Using these three survival cut-offs, ANN analyses were conducted on the two datasets. Within each of the six ANN analyses, the gene probes were ranked by their root mean gained error on an internal blind validation step comprising a different 20% of the global dataset and gene probes ranking below 0.05% were disregarded. The gene short names of these shortlisted gene probes were then cross-referenced across the three ANN from each time point in each dataset. Gene names were then weighted based on the frequency of their presence in the three ANNs top 0.05% ranking probes. The list of weighted gene names with a consistent predictive performance between long and short term survival were taken forward to the meta-analysis (see supplementary data for full gene probe listings).
Cox univariate survival analysis was conducted on every gene probe individually to determine the expression significantly correlated with survival. To do this, a macro was created within Statistica software that cycled round each of the thousands of gene probes within each dataset and produced a report for each one. Due to software limitations, this had to be done in several batches of 4000 probes for each dataset. The individual output reports were compiled and converted to an Excel spreadsheet. Gene probes were ranked by their p-value and any below 0.05 were disregarded. The gene codes of the gene probes with a p-value of ≤0.05 were taken forward for the meta-analysis (p-values available in supplementary data).
The Pivot table function within Excel was used to cross-compare the gene codes that performed well as predictors in the MLP-ANNs and had a significant p-value in the Cox univariate survival analysis. Gene probes that did not occur in all four categories were disregarded. The data corresponding to the gene probes of the genes identified to be of interest were extracted from the data. T-tests were conducted using the same time point cut-offs as described for the ANNs. Genes that did not have a significant p-value for one or more probe in both datasets were disregarded. Finally the mean averages of each were compared. Genes whose expression trends differed when correlated with survival between the datasets were disregarded.
The final list of 56 gene codes (Table 1) were cross-referenced using STRING to highlight any known association or link between them [20,21]. Literature and online resources such as Gene Cards and Human Protein Atlas were further mined to create a database of genomic, proteomic, expression, oncologic and pathway information to direct avenues of further investigation [22,23].
The probability this discovery occurring by chance was a probability of 1.39859 × 10 −11 . The number of genes found to be of interest multiplied by number of possible probes in each data set for both analyses ((56/37,632) × (56/22,283) × (56/37,632) × (56/22,283)) = 1.39859 × 10 −11 . If the work of Fury et al. [24] is taken into consideration, this probability may be even lower.

Verification of Protein Expression
From the literature and database mining, Endothelin receptor type A (EDNRA) was selected for verification at a protein level. A tissue MicroArray was purchased form Biomax (OV6161 from US Biomax Inc., Rockville, MD, USA [25]), and an Anti-EDNRA HPA014087 (Atlas Antibodies, Stockholm, Sweden) was selected above others for its demonstrated specificity via western blot of a human cell line. Biomax OV6161 is a high density microarray of 616 cores of paraffin-embedded ovarian specimens mounted onto a glass slide. It contains; 28 normal or normal adjacent tissue, 1 transitional cell carcinoma, 13 clear cell carcinoma and 280 cases of adenocarcinoma of varying stage and grade. All information is available at http://www.biomax.us/tissue-arrays/Ovary/OV6161 [25].
Slides were deparaffinized and dehydrated by heating at 60 °C on a hot plate for 10 min, immediately followed by two 5 min alcohol washes, and three 2 min washes in Industrial Methylated Spirits ending in ddH2O. Antigen retrieval consisted of a 20 min boil in a citrate buffer (pH6). After cooling in ddH2O, slides were carefully loaded to the Sequenza staining system and stained using the Novolink Polymer detection system (RE7200-CE, Leica Biosystems, Buckingham, UK) care was taken and checks were in place to ensure no part of the slide ever dried or microbubbles of air were trapped between the Sequenza coverslip and the slide, as per the manufactures recommendations. The dilution of the primary antibody was optimized using incomplete offcuts of a breast TMA and one additional test slide purchased from Biomax. A negative control omitting the primary antibody ensured all staining was associated with primary antibody binding. Two 5 min wash cycles rinsing with tris-buffered saline (TBS) were conducted between each of the following incubations; 5 min peroxidase block at room temperature to minimize non-specific binding, an 80 min room temperature incubation with the primary antibody HPA014087 (Atlas Antibodies, Stockholm, Sweden) at a 1 in 40 dilution. The antibody binding signal was amplified with a 30 min room temperature incubation with post primary reagent and a 5 min exposure to a 1 in 20 dilution of diaminobenzidine working solution. Finally, a 6 min incubation with the haematoxylin reagent enabled visualization of cell nucleic architecture. The stained slides were fixed by sequential alcohol washes in the reverse order they are listed above before sealing with a cover slip.
The TMA was accepted for scoring as a range of staining intensities were seen in tumor tissue across the slide. For a core to be considered viable to be scored, it had to contain at least 100 tumor cells. Cores were scored blindly on a categorical basis assigning a number to the overall intensity of the staining seen (0 negative, 1 weak, 2 moderate and 3 intense). Scores were assigned by a trained technician and a proportion (13.8%) were separately scored by a pathologist familiar with ovarian malignancies. The concordance between the scorers was very good (κ value = 0.921).

Genes of Interest
A list of 56 genes were distilled from a potential 37,000 gene probes to warrant further research into their role in survival time from ovarian cancer. These are listed in Table 1.
Completely different gene sets and numbers of genes in panels can be shown to be significantly differentially expressed between two datasets if different data mining methods are applied to the same data [26]. Of the final list of 56 genes of interest listed above, only three overlapped with those found to be of interest in the original publications. LRRC17 and TMEM45A were part of the panel of 86 genes found by continuous prediction algorithm to be of interest by Crijins et al. [17], GULP1 was also one of the 57 genes found to be of interest published by Bonome et al. [16]. The latter is intriguing as the paper's primary analysis of fitting a Cox univariate survival curve to each gene is akin to Method 1 described above. This disparity can be attributed firstly to the stringency of using additional statistical analyses and validation of a second dataset as a filter to a genes significance, and secondly, the difference in data pre-processing and normalization strategies, which is known to alter the results to downstream analyses [17,26].
The rigor of combining a meta-analysis approach with multiple testing using a variety of statistical approaches, increases the power and confidence in the relevance of genes found to be of interest and ensures the probability of these findings to have occurred by chance to be infinitesimal; only the most "robust" biomarkers remained. Encouragingly, the 56 genes of interest included are both known and novel candidates associating with ovarian cancer survival. Namely, IGF2 is overexpressed in ovarian cancers, increased ligation is seen ovarian cystic fluid [27], which activates molecular pathways key to cell invasion [28], and, independently is a predictor of poor survival [29]. IGFBP3 and IGFBP6 are part of these pathways and the former is downstream of a p53 cascade. BMP4 is a known mediator of ovarian metastasis and cell invasion [30], its increased expression is a predictor of poor survival [31], and, has been implicated in cisplatin resistance [31]. Others such as WTAP, MAPK, and NAV3 have been implicated in other cancers but less so for ovarian [32][33][34].
This broad, meta-analytical approach benefits from being comprehensive; however, the loss of the ability to control extraneous variables is an inherent challenge when using publically sourced data. There are numerous non-recorded variables that could also determine patient survival times, this was and should always be acknowledged and considered when assumptions during the interpretation of results are made in order to hypothesize and derive possible meaning.
As both patient data cohorts received the same care pathway of primary debulking surgery followed by platinum based chemotherapy, chemoresistance will have been a contributing factor to survival times for a proportion of those patients. It could be suggested that the differential expression of at least some of the 56 genes of interest are a consequence of up or down-regulation of genes within tumors making them either more aggressive or to be able to evade platinum based chemotherapy. IGFBP3 has been shown to mediate resistance to cisplatin therapy in non-small-cell lung cancer [35], and BMP-4 expression has been shown to be altered after chemotherapy [31].

Preliminary Validation
Based on collated information from databases and literature review, EDNRA was selected as an interesting starting point to begin verification of genes protein expression patterns in relation to ovarian cancer: Epithelial to mesenchymal transition (EMT) was a common theme when collating information of the 56 genes of interest. Cell line studies have also implicated the phenomena of EMT to occur in platinum based drug resistance in epithelial ovarian cancer [36]. However, the exact mechanisms by which this happens are unconfirmed, in fact conflicting results are reported from both in vivo and in vitro studies [37]. The presence of markers of EMT such as SNAIL and E-cadherin have been linked with ovarian cancer invasiveness [36] and the activation of anti-apoptotic pathways such as NF-κB have been observed in cisplatin resistant cell lines [37]. Contrary to prior evidence, Miow et al. [37] found cisplatin had a higher efficacy on ovarian cell lines with mesenchymal status than those with an epithelial status.
Rosano et al. [36] elucidates EDNRA role in cell signaling pathways in the context of EMT in ovarian cancer cell line. An examination of EDNR2A expression in a wider cohort of ovarian specimens such as a tissue microarray would better represent the heterogeneity of ovarian cancers-hence its selection for this study.
A clear increase in EDNRA protein expression was seen in the higher grade and later stage disease (Figure 2, Tables 2 and 3). Endothelin receptor type A (EDNRA) is the primary receptor for endothelin-1. Activation of EDNRA initiates G protein coupled receptor (GPCR) mediated activation of phosophatidylinositol-calcium second messenger system [13]. Its increased expression in the more intense cancers is likely representing increased cell proliferative activity of the tumors. A tissue microarray from a cohort of patients matching the profile of those in the microarray cohorts with survival data would expand upon this.
Significantly differential staining was also seen in different types of ovarian tumor ( Figure 3 and Table 4) implying that expression has potential to subgroup different histotypes of tumor. However there are insufficient numbers to draw any firm conclusions from these.
Further investigation and validation of the genes that have not yet been reported to associate with survival and investigating commonalities between the novel and known genes may have clinical relevance and have potential to predict a patient's response to treatment or be used as a novel target for therapy.
Moreover, using the genes in combination with each other as a gene signature or biomarker panel and clarifying the nature of these commonalities using more, freely available online resources such as STRING, KEGG, Reactome, BioGrid, Panther and HeTop could begin to unearth molecular pathways with potential to characterize the nature of individual tumors within patient cohorts and enable more tailored treatment.   Table 3. T-test table comparing the significance of protein expression differences.  Table 4. T-test p-values comparing EDNRA protein expression between cancer histology. Italicized numbers indicate p-value less than 0.05. It should be emphasized that the reporting of each of these genes association with survival from ovarian cancer may not be novel, however the genes that emerge to appear alongside each other consistently over a number of experiments, technologies and cohorts will elucidate commonalities, signaling pathways and cell processes active that would lead to subcategorization of tumors. Unfortunately, it is likely that the results seen here, as in all multidimensional analyses of large cohorts are further corrupted by the heterogeneity of both the cases within the disease, and the cells within each tumor microenvironment. It is unlikely a disease as phenotypically diverse and poorly characterized as ovarian cancer will have one or a few subcategories. Multiple onco-genotypes and onco-phenotypes are likely to be present within any cohort dampening the potential for each to be discovered.

Conclusions
A list of 56 genes have been filtered from a meta-analysis of gene micro-array data. A proportion of these are well characterized in cancer, this both confirms the reliability of the methods and data used, and opens avenues of research to peruse to further our understanding of the genetics of the disease.
Validation at protein level was begun with the IHC of an ovarian TMA (322 ovarian specimens) for EDNRA. A significant association was seen between EDNRA expression and ovarian cancer stage and grade. Future investigations EDNRA in ovarian tumors, where survival data is available, would elucidate its potential role identifying subpopulations of patients and direct treatment accordingly.