Identification and Validation of a New Set of Five Genes for Prediction of Risk in Early Breast Cancer

Molecular tests predicting the outcome of breast cancer patients based on gene expression levels can be used to assist in making treatment decisions after consideration of conventional markers. In this study we identified a subset of 20 mRNA differentially regulated in breast cancer analyzing several publicly available array gene expression data using R/Bioconductor package. Using RTqPCR we evaluate 261 consecutive invasive breast cancer cases not selected for age, adjuvant treatment, nodal and estrogen receptor status from paraffin embedded sections. The biological samples dataset was split into a training (137 cases) and a validation set (124 cases). The gene signature was developed on the training set and a multivariate stepwise Cox analysis selected five genes independently associated with DFS: FGF18 (HR = 1.13, p = 0.05), BCL2 (HR = 0.57, p = 0.001), PRC1 (HR = 1.51, p = 0.001), MMP9 (HR = 1.11, p = 0.08), SERF1a (HR = 0.83, p = 0.007). These five genes were combined into a linear score (signature) weighted according to the coefficients of the Cox model, as: 0.125FGF18 − 0.560BCL2 + 0.409PRC1 + 0.104MMP9 − 0.188SERF1A (HR = 2.7, 95% CI = 1.9–4.0, p < 0.001). The signature was then evaluated on the validation set assessing the discrimination ability by a Kaplan Meier analysis, using the same cut offs classifying patients at low, intermediate or high risk of disease relapse as defined on the training set (p < 0.001). Our signature, after a further clinical validation, could be proposed as prognostic signature for disease free survival in breast cancer patients where the indication for adjuvant chemotherapy added to endocrine treatment is uncertain.

The array gene expression analysis "Mammaprint®" identifies a 70 gene-signature indicative for poor prognosis in patients with lymph node-negative disease or with 1-3 positive nodes, predicting chemotherapy benefit in the "high risk" group, vs. no apparent benefit in the "low risk" group [3][4][5][6], in a non-randomized clinical setting. It needs fresh/frozen tissue of the primary breast tumors [2,3]. The multigene assay "Oncotype DX ® " evaluate gene expression analysis of 21 genes starting from paraffin-embedded tissue calculating a recurrence score to classify patients at low, intermediate, or high risk for recurrence. From two independent retrospective analyses from phase III clinical trial with adjuvant tamoxifen-alone control arms, the 21-gene recurrent score (RS) assay defines a group of patients with low scores who do not appear to benefit from chemotherapy, and a second group with very high scores who derive major benefit from chemotherapy, independently of age and tumor size [1,[9][10][11].
Other studies using a supervised approach based on clinical outcome endpoint to tumor grade as a basis for gene findings have resulted in development of multiple commercial reference lab assays for prognostication (MapQuant Dx [14], Theros Breast Cancer Index [15]).
The above-mentioned multigene assays are expensive and validations have been made on patients selected by age and nodal or Estrogen Receptor status and or received adjuvant treatment.
Analyzing data from several array based gene expression wide analysis publicly available on NCBI Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/), we identified a subset of 20 mRNA differentially regulated in breast cancer. We activated a protocol evaluating these markers to create a new gene signature based on real time PCR from paraffin embedded tissue and on a "real life" breast cancer patient population. The enrolled cases were not selected for age, adjuvant treatment, nodal and estrogen receptor status.

Results and Discussion
Formalin-fixed and paraffin-embedded (FFPE) tissues represent one of the largest tissue sources, for which well-documented clinical follow-up is available, and therefore large-scale retrospective studies are possible [18]. As described recently by Bussolati et al. [19], in a near future the possibility of obtaining high-quality total RNA from archival tissues will guarantee a more powerful and robust gene expression analysis. In order to identify a small number of informative genes providing prognostic information for breast cancer, we evaluated in silico a set of published signatures and tested by gene expression array on the 408 breast cancer cases deposited in NCBI Gene Expression Omnibus. By several steps involving univariate analysis for the association with disease free survival (DFS), unsupervised hierarchical clustering algorithm, and multivariate Cox modelling selection, we found 20 highly related genes with DFS. These candidate genes were subsequently evaluated in vitro by RTqPCR analyzing a total of 261 cases representing the training (137 cases) and the validation (124 cases) datasets (see the workflow shown in Figure 1).

Gene Selection on the Published Datasets
We used data deposited in NCBI Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/, GEO Series accession number GSE1456 and GSE3494), including 408 breast cancer cases. Files containing raw intensity data of Affymetrix HU133A and HU133B arrays of the two datasets (GSE1456 and GSE3494) were preprocessed using R/Bioconductor (GCRMA package, quantile normalization, median polish summarization). The two data sets were pre-processed together using the supercomputer Michelangelo (http://www.litbio.org). The candidate genes were selected from the above mentioned datasets as those included in 4 previously proposed signatures: the "70-gene signature" developed by van de Vijver et al. [3] and van't Veer et al. [2] including 70 genes, the "recurrence-score" developed by Paik et al. [9] including 21 genes, the "two-gene-ratio model" [12] including 2 genes and the "Insulin Resistance" signature including 15 genes [20] (Table 1). Since some genes are present in more than one signature, the final extracted set was made up of 98 genes (194 Affy-probes) ( Table 1).

Gene Selection on the Merged GEO Datasets
The 98 genes selected from the published signatures were first tested in univariate analysis for their association with disease free survival (DFS). Forty-eight genes resulted associated with DFS with a p value < 0.01 and were selected for the subsequent step. Using an unsupervised hierarchical clustering algorithm, 20 clusters were selected grouping genes with similar expression profiles. A gene was selected within each cluster using a multivariate Cox model, choosing the one most associated with DFS: the final 20-genes set, all highly associated with DFS, are reported in Table 2.

Tumor Samples
Among 350 consecutive invasive breast cancer patients with full information about tumor, adjuvant treatments, follow up, relapse, death and causes of death, treated between 1998 and 2001, 89 cases (25.4%) were removed from the study because of the low RNA concentration (below 10 ng/µL) or high degradation (Ct values for ACTB and B2M over 34). The remaining 261 cases were split in two biological sample datasets: The training (137 cases) and the validation set (124 cases) by a simple criteria of consecutiveness.
The clinical and demographic characteristics of the patients included in the training and in the validation set are summarized in Table 3 and reported in detail in the supplementary file. Due to a simple criteria of consecutiveness building the sets, the Training set has a longer mean follow up (100.7 months; range 59-123) as compared with the Validation set (89.2; 61-121). Nevertheless, the only significant differences between the two sets was the use of anthracycline-based regimens in the adjuvant setting (Training 16% vs. Validation 32.2%; p = 0.01) and an higher incidence of G3 tumors in the Validation Set (30.6% vs. 19.7, p = 0.04). The lack of information about HER2 Status is related to the temporal context of the selected cases (1998)(1999)(2000)(2001) and it was evaluated "a posteriori" just in 40% of relapsed patients. Any other clinical and biological pattern is similar and reflecting the "real life" picture of the disease in North East of Italy at this time.

Signature Definition on the Training Set
A multivariate stepwise Cox analysis was run on the breast cancer samples including the 20 selected genes. The Cox model selected a final set of five genes independently associated with DFS (Table 4) These five genes were combined into a linear score (signature) weighted according to the coefficients of the Cox model (Table 4), as: This score ranged from −2.95 to 2.91, with a mean value of −0.48 a SD of 1.00. The linear score was highly associated with DFS in the training set: HR = 2.7, 95% CI = 1.9-4.0, p < 0.001.
The score was then categorized in three groups according to the tertiles of its distribution. The DFS according to the three risk groups is reported in Figure 2

Signature Evaluation on the Validation Set
The signature defined on the training set was evaluated on the independent set of data of the 124 patients included in the validation set. The discrimination ability of the signature was assessed on the validation set by a Kaplan Meier analysis, using the same cut offs classifying patients at low, intermediate or high risk of disease relapse as defined on the training set.
The score resulted highly associated with DFS also in the validation set (p < 0.001) (Figure 3). Patients with an "intermediate risk" signature had an HR = 2.1 (95% CI = 0.72-6.2, p = 0.17) and patients with a high risk signature had an HR = 5.4 (95% CI = 2.0-14.4, p = 0.001) as compared to patients with a low risk signature.

Inter and Intra Assay Reproducibility
Three serial sections from three cases each were evaluated independently in triplicate calculating the coefficients of variation (CVs) for the Recurrent Score in the same run and in different runs. The intra-assay and the inter-assay CVs was 3.7% and 4.7%, respectively.

Multivariate Analysis
The Multivariate Analysis (Cox Regression) indicates that Nodal Status (p = 0.00001), T Size (p = 0.0002) and the five-gene Signature (p = 0.0004) are significantly related to DFS, while Ki67 (cut off: 14%), Grading and Chemo-or Endocrine Adjuvant Treatments are not ( Table 6). The five-gene Signature HR is slightly affected by adjuvant treatments: Table 7 summarized data about the five-gene signature in presence or absence of Adjuvant treatment.

Discussion
In this study we developed a five-gene recurrence score able to estimate the likelihood of recurrence in a series of consecutive breast cancer tissue samples. These five informative genes were selected by a multistep approach summarized in Figure 1. Firstly, we identified in silico a subset of 20 mRNA differentially regulated in breast cancer analyzing several publicly available array gene expression data using R/Bioconductor package. We further evaluated, in vitro, the expression level of these 20 genes in 261 consecutive invasive breast cancer cases not selected for age, adjuvant treatment, nodal and estrogen receptor status from paraffin embedded sections. The only requested feature was a minimum follow up of 5 years with full clinical data. Each tissue block was reviewed by a pathologist to ensure greater than 70% content of tumor cells. The gene expression analysis was based on RTqPCR. The biological samples dataset was split into a training and a validation dataset. The gene signature was developed on the training set by a multivariate stepwise Cox analysis selecting five genes independently associated with DFS. These five genes were combined into a linear score (signature) weighted according to the coefficients of the Cox model. The signature was then evaluated on the validation set assessing the discrimination ability by a Kaplan Meier analysis, using the same cut offs classifying patients at low, intermediate or high risk of disease relapse as defined on the training set.
These five genes of interest were identified without any a priori selection for gene function or cancer involvement, but simply for the relationship between their expression level and DFS. Interestingly, except for SERF1a which the function is still unknown, they have been described to play an important role in cancer as follows: (a) FGF18: Its over-expression in tumors has also been demonstrated [21,22]. FGF18 expression is up-regulated through the constitutive activation of the Wnt pathway observed in most colorectal carcinomas [23]. As a secreted protein, FGF18 can thus affect both the tumor and the connective tissue cells of the tumor microenvironment. (b) BCL2: Over-expression of BCL2 protein has been identified in a variety of solid organ malignancies, including breast cancer. BCL2 transcript over-expression is related to unfavorable prognosis in Oncotype Dx [9] and in Mammaprint® [3]. (c) PRC1: It associates with the mitotic spindle and has been found to play a crucial role in the completion of cytokinesis [24,25]. PRC1 is negatively regulated by p53 and it is over-expressed in p53 defective cells [26] suggesting that the gene is tightly regulated in a cancer-specific manner. (d) MMP9: Metalloproteases are frequently up-regulated in the tumor microenvironment [27].
MMP9 influence many aspects of tissue function by cleaving a diverse range of extracellular matrix, cell adhesion, and cell surface receptors, and regulate the bioavailability of many growth factors and chemokines [28]. (e) SERF1a: The function of SERF1a is not already known.
The biological properties of these genes are related with four of the six hallmarks of cancer proposed by Hanahan et al. [29,30]: FGF18 should be included in "Self-sufficiency in growth signal" group, BCL2 in "Evading apoptosis" group, PRC1 in "Limitless replication potential" group, MMP9 in "Tissue invasion and metastasis" group, while the function of SERF1a is still unknown. These findings establish a link between our proposed molecular signature of breast cancer and the underlying capabilities acquired during the multistep development of human tumors previously categorized [29,30].
For an experimental point of view, our assay appears affordable, not time consuming, it needs FFPE tissue and it might be performed easily in almost all laboratories with the required RT-qPCR instrumentations. Importantly it was validated on a "real life" clinical setting with a set of consecutive breast cancer cases irrespectively from age, nodal and estrogen receptor status, adjuvant treatment with at least a minimum follow up of 5 years. An important limit of our approach was that the test was possible in 74.6% of the initial set of cases due to RNA degradation from FFPE tissues according to the literature regarding other signatures [19,31,32]. RNA degradation can be monitored simply evaluating the Ct values of the housekeeping genes used for normalization. Multicentric studies will be needed to evaluate possible pitfalls due to experimental inter-laboratory variability and above all increasing the reliability of the assay. A further step will be the analysis of the predictive value of the five-gene signature in ER positive population of tamoxifen alone benefit and of chemotherapy added to tamoxifen.

Tumor Samples Enrolled in This Study
Tumor samples were obtained from routinely processed formalin-fixed, paraffin embedded sections retrieved from 350 consecutive invasive breast cancer patients with full information about tumor, adjuvant treatments, follow up, relapse, death and causes of death, treated between 1998 and 2001. In order to test our signature in a "real life" clinical setting, we decided to use consecutive non metastatic breast cancer cases irrespectively from age, nodal and estrogen receptor status, adjuvant treatment. The only requested pattern was a minimum follow up of 5 years with full clinical data. All patient information was handled in accordance with review board approved protocols and in compliance with the Helsinki declaration [33]. Hematoxylin and Eosin (H & E) sections were reviewed to identify paraffin blocks with tumor areas. Histological type and grade were assessed according to the World Health Organization criteria [34]. The detailed histological and clinical feature of each patient enrolled in this study is available in the supplementary information file. Paraffin blocks corresponding to histology sections that showed the highest relative amount of tumor vs. stroma, few infiltrating lymphoid cells and that lacked significant areas of necrosis were selected. Three 20 µm thick sections were cut, followed by one H & E control slide. The tumor area selected for the analysis was marked on this control slide to ensure greater than 70% content of neoplastic cells. Tumor areas dissected ranged from 0.5 to 1.0 cm 2 wide.

Ethics Statement
The use of tissues for this study has been approved by the Ethics Committee of Centro Oncologico, ASS1 triestina & Università di Trieste, Italy. A comprehensive written informed consent was signed for the surgical treatment that produced the tissue samples and the related diagnostic procedures. All information regarding the human material used in this study was managed using anonymous numerical codes, clinical data were not used and samples were handled in compliance with the Helsinki declaration (http://www.wma.net/en/30publications/10policies/b3/).

RNA Isolation
Paraffin-embedded tumor material obtained from the 20 μm thick sections was de-paraffinized in xilene at 50 °C for 3 min and rinsed twice in absolute ethanol at room temperature. Total RNA was extracted using the RecoverAll kit (Ambion, Austin, TX, USA), including a DNase step according to the manufacturer's recommended protocol. RNA concentration was measured by Quant-iT™ RNA kit (Invitrogen, Carlsbad, CA, USA).

Two Step RTqPCR Analysis
Fourteen µL of total RNA was subjected to reverse transcription using SuperScript® VILO™ cDNA Synthesis kit (Invitrogen, Carlsbad, CA, USA) according to the manufacturer's recommended protocol. One microlitres of cDNA was amplified in duplicate adding 10 picomoles of each primer (see Table 8 for sequence details) to the 1x QuantiFast™ SYBR® Green PCR solution (Qiagen, Hilden, Germany) in a final volume of 25 µL.
Cycling conditions consisted of 5 min at 95 °C, 10 s at 95 °C, 30 s at 60 °C for a total of 40 cycles, using Stratagene Mx3000™ or ABI SDS 7000™ instruments. Plate reading was performed during the 60 °C step.
For each primer set, standard curves made from serial dilutions of cDNA from MCF7 cell lines (see Table 2) were used to estimate PCR reaction efficiency (E) using the formula: E (%) = (10 [−1/sl°pe] − 1) × 100. The expression levels of each of the 20 genes selected were normalized by GeNorm [35] using 2 housekeeping genes (B2M e ACTB) and the relative quantification was calculated by the statistical computing language R. The human breast cancer cell line MCF7 was purchased from American Type Culture Collection (ATCC HTB22; derived from a human breast adenocarcinoma). Cells were maintained in minimal essential medium (MEM) (Invitrogen/Life technologies, Villebon-sur-Yvette, France) supplemented with 2 mM L-glutamine, 1.5 g/L sodium bicarbonate, 0.1 mM nonessential aa, 1 mM pyruvate sodium, 0.01 mg/mL bovine insulin, and 10% fetal bovine serum (Thermo Scientific, Waltham, MA, USA) at 37 °C in a humidified atmosphere of 5% CO 2 .

Training and Validation Dataset
The biological samples dataset was split into the training and the validation dataset. The training set consists of the first 144 consecutive cases and the validation of the last 127 cases. The gene signature was developed on the training set. Once the signature has been fully specified, the validation set was accessed once and only for estimating the prediction accuracy of the signature. A multivariate stepwise Cox analysis was run on the breast cancer training set samples including the 20 selected genes. The stepwise procedure was run to select genes independently associated with DFS (p for inclusion <0.10). The overall workflow shown in Figure 1 summarizes every step starting from selection of markers from the literature since the validation of the gene signature. Reproducibility within and between blocks was assessed by performing the test in serial sections from three blocks representing three cases. We finally performed a multivariate Cox proportional-hazards analysis in a model that included treatment received (no adjuvant therapy vs. chemotherapy, hormonal therapy, or both) and the final gene Signature (both Training and Validation sets included), using the NCSS 2001 Statistical software (NCSS Inc., Kaysville, UT, USA, 2001).

Univariate and Multivariate Analysis
We performed a univariate analysis including Age, T size, Nodal status, Grading, Ki67, adjuvant treatments and the 5-gene signature, followed by a multivariate Cox proportional-hazards analysis in a model that included treatment received (no adjuvant therapy vs. chemotherapy, hormonal therapy, or both) and the 5-gene Signature (Low/Intermediate/High Risk; both Training and Validation sets included), using the NCSS 2001 Statistical software (NCSS Inc., Kaysville, UT, USA, 2001).

Conclusions
We developed a prognostic tool for early breast cancer based on the analysis of the relative expression level of FGF18, BCL2, PRC1, MMP9 and SERF1A in combination. Our signature has a good discriminating ability when tested on the validation set. We suppose that, after a necessary further clinical validation on a higher number of cases, it could be proposed as non expensive prognostic signature for disease free survival in breast cancer patients where the indication for adjuvant chemotherapy added to endocrine treatment is uncertain.