Qualitative and Quantitative Requirements for Assessing Prognostic Markers in Prostate Cancer

Molecular prognostic markers are urgently needed in order to improve therapy decisions in prostate cancer. To better understand the requirements for biomarker studies, we re-analyzed prostate cancer tissue microarray immunohistochemistry (IHC) data from 39 prognosis markers in subsets of 50 – >10,000 tumors. We found a strong association between the “prognostic power” of individual markers and the number of tissues that should be minimally included in such studies. The prognostic relevance of more than 90% of the 39 IHC markers could be detected if ≥6400 tissue samples were analyzed. Studying markers of tissue quality, including immunohistochemistry of ets-related gene (ERG) and vimentin, and fluorescence in-situ hybridization analysis of human epidermal growth factor receptor 2 (HER2), we found that 18% of tissues in our tissue microarray (TMA) showed signs of reduced tissue preservation and limited immunoreactivity. Comparing the results of Kaplan-Meier survival analyses or associations to ERG immunohistochemistry in subsets of tumors with and without exclusion of these defective tissues did not reveal statistically relevant differences. In summary, our study demonstrates that TMA-based marker validation studies using biochemical recurrence as an endpoint require at least 6400 individual tissue samples for establishing statistically relevant associations between the expression of molecular markers and patient outcome if weak to moderate prognosticators should also be reliably identified.


Introduction
Prostate cancer is the most frequent malignancy in men. The clinical behavior ranges from slowly growing indolent tumors to highly aggressive and metastatic cancers. Based on the results of large autopsy studies demonstrating a high prevalence of prostate cancers also in men who never experienced symptoms of the disease during their lifetime, it is assumed that most prostate cancer patients would be manageable without surgery and its associated side effects [1]. Accordingly, distinguishing between the low malignant and indolent form of the disease that does not require immediate therapy and the aggressive cancers that will eventually progress to life-threatening disease is the clinically most relevant discipline of current prostate cancer research.
Only recently, commercial molecular classifiers have become available. These tests are based on mRNA expression profiling of defined gene sets, allow for estimating the biological aggressiveness of a cancer and, therefore, may aid in therapy decisions [2][3][4]. These classifiers underscore the interest of the diagnostic industry in the topic of prostate cancer prognosticators. It can be expected that future generations of such classifiers can be substantially improved if genes are included that on their own already exhibit strong and independent prognostic power.
During the last decades, a multitude of studies announced prognostic biomarkers for prostate cancer. However, although more than hundred different prognostic markers have been suggested (reviewed in [5]), none of them has entered clinical routine testing as to yet. This disappointing failure to translate research findings into clinical applications is partly due to the fact that data obtained on virtually all of these markers vary largely between different studies. This is even true for the most established prognostic parameters, such as p53 or phosphatase and tensin homolog (PTEN). More than 50 studies analyzed the impact of p53 alterations on prostate cancer phenotype and prognosis. Although most immunohistochemistry studies reported a link between nuclear p53 accumulation and adverse tumor features, such as high grade, advanced stage, and peripheral zone origin [6], as well as poor prognosis after radical prostatectomy [7] or external beam radiation [8] and unfavorable clinical courses in conservatively managed patients [9], there are also studies that do not confirm these associations [10,11]. Likewise, genomic deletion of PTEN has been unequivocally linked to adverse tumor features in several studies [12][13][14][15][16], other studies again employing immunohistochemistry reported highly variable results on the prognostic value of PTEN expression. For example, an association between loss of PTEN expression and poor patient prognosis was only found in one [13] out of four studies [13,[17][18][19], and a link between loss of PTEN expression and high Gleason grade or advanced tumor stage was only reported in two [20,21] out of five studies on this topic [18][19][20][21][22].
It is quite obvious that most of the discrepant results in the literature are due to (i) technical issues, and (ii) relatively small patient cohorts used for most studies. It is obvious that different antibodies, staining protocols, and scoring criteria that are employed in most studies can cause massive experimental variation. Due to an intense dispute with a reviewer of one of our manuscripts on the issue whether our frequency of p53 immunostaining in prostate cancer was lower than the 50% suggested by another group due to protocol issues (in our opinion) or to missed heterogeneity in a tissue microarray (TMA) setting (the reviewer's opinion), we were forced to experimentally demonstrate that the range of p53 positive prostate cancers could be brought from 2.5% to 98% solely by protocol modifications [7]. However, the example of HER2 immunohistochemistry analysis of breast cancer demonstrates that a considerable (but not a complete) degree of assay standardization can be achieved [23]. However, even in such a highly standardized analysis including various controls, the quality of the tissues samples will impact the results. This is due to the fact that postsurgical tissue fixation cannot be fully standardized. The most frequently used fixative, i.e., formalin, causes proteins to cross-link and makes them impassible for microbial degradation or autolysis. The efficiency of the fixation process depends on the proper penetration of the formalin into the tissue, but obviously, the success of penetration greatly depends on the size and the composition of a given tissue. In case of too much or too little fixation, the tissue may not be suitable for analysis. This is a serious problem particularly in immunohistochemistry studies, where lack of immunoreactivity cannot be distinguished from a true negative result due to biological absence of the protein of interest.
The tissue microarray (TMA) technology has proven to be excellently suited for rapid and cost efficient analysis of large numbers of tissue samples [24]. While in studies analyzing conventional large sections the study cohort size was typically limited to less than 100 samples due to the time and costs connected with such "classical" analyses, it is the availability of suitable tissues that first of all limits the size of TMA studies. As a consequence, TMA studies including hundreds of tissue samples are often viewed as "large-scale" analyses. Extent and impact of low-quality tissues that are inevitably included in every large-scale tissue analysis are, however, unknown. In the present study, we took advantage of our very large prostate cancer tissue microarray comprising more than 12,000 tissue spots and molecular data from more than one hundred proteins analyzed by means of immunohistochemistry to better understand the impact of the sample size and the tissue quality on the outcome of TMA studies for marker validation purposes. Biochemical (PSA) recurrence was used as an endpoint in this project dealing with patients having undergone prostatectomy. This reason for this is that PSA recurrence is the "easiest" (most frequent) clinical endpoint to analyze in prostatectomy studies and it is strongly associated with other clinical endpoints such as metastasis or cancer-related death.

Experimental Section
Patients. Radical prostatectomy specimens were available from 12,427 patients undergoing surgery between 1992 and 2012 at the Department of Urology and the Martini Clinics at the University Medical Center Hamburg-Eppendorf. Follow-up data were available for a total of 12,344 patients with a median follow-up of 36 months (range of 1 to 241 months; Table 1). Prostate specific antigen (PSA) values were measured following surgery and PSA recurrence was defined as a postoperative PSA of 0.2 ng/mL and increasing at first of appearance. All prostate specimens were analyzed according to a standard procedure, including a complete embedding of the entire prostate for histological analysis [7]. The TMA manufacturing process was described earlier in detail [24]. In short, one 0.6 mm core was taken from a representative tissue block from each patient. The tissues were distributed among 27 TMA blocks, each containing 144 to 522 tumor samples. For internal controls, each TMA block also contained various control tissues, including normal prostate tissue.
Statistics. Statistical calculations were performed with JPM 9 (JMP ® , Version 9. SAS Institute Inc., Cary, NC, USA, 1989-2007) Contingency tables and the chi²-test were performed to search for associations between molecular parameters and tumor phenotype. Survival curves were calculated according to Kaplan-Meier. The Log-Rank test was applied to detect significant differences between groups.

Impact of the Tissue Quality
In order to identify tissues with poor immunoreactivity, we performed ERG and vimentin immunohistochemistry analysis of our TMA. These proteins are expressed in virtually every human tissue. ERG is a member of the E26 transformation-specific (ETS) transcription factor family that is expressed in endothelial cells. ERG had been extensively studied in our TMA before since it is strongly expressed in about 50% of prostate cancers [47], and has been linked to early onset prostate cancer [48]. For the purpose of identifying low quality tissues, we re-analyzed our large 12,427 samples prostate cancer TMA for ERG expression specifically in endothelial cells (Figure 2a). In addition, we stained the TMA for vimentin, a type III intermediate filament that is strongly expressed in mesenchymal cells, which typically accompany prostate cancer cells (Figure 2b). Since blood vessels and mesenchymal cells can be found in virtually every prostate tissue sample, we considered complete absence of vimentin and ERG staining as an indicator of impaired immunoreactivity. In addition, we took advantage of the results from an earlier study, where we demonstrated that low-quality tissues with impaired immunoreactivity also showed a poor performance in fluorescence in-situ hybridization (FISH) analysis of gene copy numbers [49]. In the present study, we performed HER2 FISH analysis on the TMA and considered absence of FISH signals as an additional indicator of poor tissue quality. In summary, all tissue spots that showed simultaneous lack of ERG and vimentin immunostaining and absence of HER2 FISH signals were considered "low-quality". A total of 11,223 tissue spots was included in this analysis. The remaining tissue spots were excluded because they were severely damaged or absent in the TMA slides. Simultaneous lack of ERG, vimentin, and HER2 signals were found in 2056 (18.3%) of the analyzed tissue spots. These "low-quality" tissues were randomly distributed across the TMA, and there was no obvious association between tissue reactivity and tumor phenotype or patient outcome (Figure 3). The marginally significant p-values obtained in these analyses do not reflect true associations but are attributable to slight variations between the groups.
To further investigate the performance of these "low-quality" spots in IHC experiments, we next compared them to staining patterns of our 39 IHC markers in subsets of 2000 "low-quality" and 2000 "high-quality" tissues (i.e., samples that were positive for all of ERG, vimentin and HER2). This analysis revealed that, although the "low-quality" tissues can be stained with most of the tested antibodies, there was a average reduction of about 12 percent points in the fraction of positive tissue samples across all of these markers in "low-quality" tissues as compared to "high-quality" tissues ( Figure 4). These date demonstrate, that "low-quality" tissues bear a high risk to underestimate the true expression level and may even result in false negative findings.  These findings imply that problems may arise if it comes to comparisons between biomarkers analyzed by immunohistochemistry. In such a scenario, false positive associations can potentially occur if the level of immunostaining of the analyzed markers parallels the quality of the tissue in a relevant fraction of samples. In contrast, inverse associations must always be considered valid. Here, the same tissues that are negative for one marker stain positive for the other, thus excluding the possibility of false associations due to reduced immunoreactivity. To assess the potential impact of "low-quality" tissues on the reliability of associations between ERG and other IHC markers, we used our existing ERG IHC data [47], which showed a positive result in about 50% of cancers. We studied the associations of all 39 markers to ERG expression, including markers with strong associations to ERG positivity (e.g., Marker #24, Figure 5a), markers with strong associations to ERG negativity (e.g., Marker #34, Figure 5c), markers with weak associations to ERG positivity (e.g., #12 (MTC02), Figure 5b), and markers lacking such associations (e.g., Marker #32, Figure 5d). Particularly for the latter set of markers, it could be possible that "low-quality" tissues drive such weak associations. All analyses were performed in differently sized subsets of our large TMA, and the significance of associations was compared between tissue sets containing both "low" and "high"-quality tissues and tissue sets after excluding the 2056 "low-quality" tissues from the data. Since statistical associations will become stronger the more samples are analyzed, we performed the analyses in randomly selected subsets of 1600, 3200, 6400, and 10,000 samples. To compensate for incidental findings that might arise from random subset selection, we repeated each analysis five times. The Log-rank chi 2 p-value was recorded from each analysis, and the average Log-rank chi 2 p-value was calculated from the five repeated analyses. All results are shown in Table 2. Following the same analysis strategy, we also questioned whether the reduced immunoreactivity in the "low-quality" tissues impacted the outcome of prognosis associations. For this analysis, we selected five of our 39 prognostic markers set and performed Kaplan-Meier survival plots to compare the impact of these markers (using biochemical (PSA) recurrence as clinical endpoint) before and after exclusion of the 2056 "low-quality" tissues. All results are shown in Table 3, and examples of Kaplan-Meier plots are given in Figure 4.
In both sets of calculations, we did not observe changes in the analysis results, regardless if the "low-quality" tissue was excluded from the analysis or not, demonstrating that the ≈20% "low-quality" tissues present in our TMA did not significantly impact the study outcome. Nevertheless, the examples of associations with ERG expression given in Figure 5 confirm that the fraction of entirely negative samples can be slightly overestimated unless the "low-quality" tissues are exclude, as indicated by a difference of five percent points between tumors with a negative result for both Marker #24 and ERG in subsets of cancers before and after exclusion of the "low-quality" tissues. However, the finding that even positive associations resulting from discrete expression differences remained significant after exclusion of the "low-quality" tissues ( Figure 5b) clearly demonstrates that associations between different markers can be reliably detected in large-scale TMA studies. Since "low-quality" tissues were randomly distributed across all samples irrespective of the clinical course (Figure 3), it was not surprising that there was no difference in the ability to detect prognostic differences in tissues with or without "low-quality" tissues. Here, the "underestimation" of the true staining intensity resulted in a smooth shift of all survival curves either towards an overall better prognosis (i.e., if strong expression of the marker was linked to poor prognosis), or towards worse prognosis (i.e., if strong expression of the marker was linked to good prognosis), whereas the relative distance between the curves remained largely constant. However, this analysis also suggested that the number of samples included in marker validation analyses might have a much stronger impact on the analysis result than the tissue quality, thus prompting us to analyze the impact of the sample size in more detail below.  Table 2. Impact of the tissue quality on the association between expression of ERG and other IHC markers. The chi 2 p-values are given for survival analyses in subsets of 1600-10,000 tissue spots. "Low quality tissue" indicates whether tissues with impaired immunoreactivity were excluded from analysis or not (included). "Association strength" separates the markers into those with weak, moderate, or strong positive associations (i.e., the marker is more frequently expressed in ERG positive than in ERG negative cancers), those with inverse associations (i.e., the marker is more frequently expressed in ERG negative than in ERG positive cancers), and those that are unrelated to ERG (no association).   Table 3. Impact of the tissue quality on the outcome of Kaplan-Meier survival analysis.

Marker
The chi 2 p-values are given for survival analyses in subsets of 1600-10,000 tissue spots. "Low quality tissue" indicates whether tissues with impaired immunoreactivity were excluded from analysis or not (included).

Impact of the Sample Size
In order to estimate the minimal sample size that is required to yield statistically stable results in prostate cancer prognosis marker validation studies, we carried out serial analyses in randomly selected subsets of 50, 100, 200, 400, 800, 1600, 3200, 6400 and all (12,427) samples included in our TMA. We performed Kaplan-Meier survival plots and Log-rank chi 2 tests including a total of 39 protein markers with confirmed prognostic relevance from our molecular database. The smallest sample set that revealed a Log-rank p-value of 0.001 or less was considered to be sufficient for reliable marker analysis provided that this significance level held also true in the analysis of all larger sample sets. In addition, in order to rank the "prognostic power" of our 39 markers, we summarized the Log-rank values emerging from all subset analyses of each marker. This strategy was selected because chi 2 values can be easily extracted from all tests and thus provide a simple, however objective, index of the power of individual markers. We grouped our markers according to the accumulated chi 2 values into markers with "weak" (sum of all chi 2 values <100), "moderate" (sum chi 2 101-299), and "strong" prognostic power (sum chi 2 ≥300). The Log-rank p-values for all markers in each sample subset and the accumulated Log-rank chi 2 values per marker are shown in Table 4, and exemplary Kaplan-Meier plots are given in Figure 6.
The results of this analysis first of all demonstrate a close relationship between the "prognostic power" of a marker and the numbers of samples that need to be analyzed in order to reliably evaluate the marker's prognostic potential (Figure 7 and Table 4). Given that the power of a marker of interest is typically not known before the analysis is performed (particularly in case of novel and uncharacterized candidate markers), and that four markers revealed prognostic relevance only if the entire sample set was analyzed (Table 4), our findings imply that as many samples as possible should be included in such marker validation experiments in order to also reliably detect minor associations between prostate cancer genotype and clinical behavior. However, from a more practical point of view, our data also demonstrates that a cohort size of 6400 prostate cancers is sufficient to reproduce the prognostic value of the vast majority (i.e., 35 out of 39, 90%) of the markers included in our study. Table 4. Impact of the sample size on the outcome of Kaplan-Meier survival analyses. The chi 2 p-values are given for survival analyses in subsets of 50-12,427 tissue spots. "n analyzable" gives the number of interpretable tissue spots if the entire TMA was analyzed. "Marker power" indicates the relative prognostic power as described in the results section. Bold face indicates the minimal sample set that was considered sufficient to evaluate the respective marker. Grey color indicates sample sizes that yielded strong prognostic relevance.  Figure 7. Association between the "prognostic power" of different immunohistochemistry markers and the minimal number of samples that is required for statistically sound marker validation studies. "Marker Power" is given as the sum of Log-rank chi 2 values per marker from the analysis in subsets of 50, 100, 200, 400, 800, 1600, 3200, 6400, and 12,427 samples. Some markers are annotated as examples.
Importantly, only two (5%) of the 39 markers in our study, including p53 as a prime example of a very strong prognostic marker [27], had sufficient prognostic power to allow for conclusive results also in small cohorts including less than 500 prostate cancers. This finding is of particular interest since the majority of prostate cancer marker studies still analyze less than 500 cancer samples [5]. Our findings provide a simple explanation for the highly discrepant results on most potential prognostic biomarkers. We also found significant associations in less than 500 samples that, however, did not hold true in the next larger subsets and must, therefore, be considered incidental. For example, Marker #7 revealed significant p-values in subsets of 50 and 200 samples, but not in subsets of 100, 400, or even 3200 samples, demonstrating that analysis of small subsets can occasionally lead to incidental statistical results.
It is further of note that there is always a considerable fraction of samples that does not yield interpretable results. This is due to typical TMA-related issues, including exhausted tissue cores resulting in empty spots in the TMA section or lack of tumor cells in the tissue spot. In our study, the fraction of non-interpretable tissue cores was about 35%, independent from the size of the subset selected for analysis (Figure 8a). As a consequence, the number of interpretable samples varied between 6494 and 10,946 (average 8592) spots for the different markers in the entire dataset (n = 12,427) and averaged for example 33.4 cancers in the 50 samples subset, 1019.7 cancers in the 1600 samples subset, or 4065 cancers in the 6400 samples subset (Figure 8b). Therefore, a certain dropout rate of TMA spots should always be taken into account if a TMA is built. The fraction of interpretable samples can potentially be increased if multiple samples of the same cancer specimen are included in the TMA. We do not recommend this procedure, however. For example, building a 6000-samples TMA from 2000 cancers with three spots from each cancer can be expected to result in about 1800 interpretable cancers (which still is too small a number for reliable statistical analysis according to our data), but is connected with the same costs, analysis time, and tissue consumption as compared to a 6000 samples TMA built from one punch per tumor, which will, however, result in about 3900 interpretable cancers. In addition, analysis of multiple cores always introduces a statistical bias into the analysis. This is because not all of the multiple tissue spots per tumor will be analyzable, and tumors with three to four interpretable spots might have a higher likelihood to detect positive staining as compared to tumors with only one to two interpretable tissue spots.

Conclusions
The availability of a very large prostate cancer prognosis TMA with an extensive molecular database, including samples from more than 12,000 individual prostate cancers as well as molecular data from 39 prognostic relevant protein markers enabled us to evaluate the impact of qualitative and quantitative factors for prostate cancer biomarker studies. The results of our analyses suggest that such studies should aim at the analysis of at least 6000 individual prostate cancer samples to obtain reliable statistical findings allowing for a concluding judgment of a potential prognostic value of a marker of interest. Only for particularly strong markers, reliable results can also be obtained from substantially smaller cohorts. However, very strong prognostic markers appear to be rare, and the power of a marker is often not known before the analysis is made. Our data further suggest that almost 20% of the tissues included in a prostate cancer TMA may have limited tissue reactivity, potentially compromising the results of some analyses. While there is no impact of tissue reactivity on the results of prognostic studies, this issue is more relevant if it comes to comparisons between biomarkers analyzed by immunohistochemistry. Our data suggest, however, that even such associations that result from only discrete expression differences can be reliably identified in large-scale TMA analyses.