DNA Repair Gene Expression Adjusted by the PCNA Metagene Predicts Survival in Multiple Cancers

Removal of the proliferation component of gene expression by proliferating cell nuclear antigen (PCNA) adjustment via statistical methods has been addressed in numerous survival prediction studies for breast cancer and all cancers in the Cancer Genome Atlas (TCGA). These studies indicate that the removal of proliferation in gene expression by PCNA adjustment removes the statistical significance for predicting overall survival (OS) when gene selection is performed on a genome-wide basis. Since cancers become addicted to DNA repair as a result of forced cellular replication, increased oxidation, and repair deficiencies from oncogenic loss or genetic polymorphisms, we hypothesized that PCNA adjustment of DNA repair gene expression does not remove statistical significance for OS prediction. The rationale and importance of this translational hypothesis is that new lists of repair genes which are predictive of OS can be identified to establish new targets for inhibition therapy. A candidate gene approach was employed using TCGA RNA-Seq data for 121 DNA repair genes in 8 molecular pathways to predict OS for 18 cancers. Statistical randomization test results indicate that after PCNA adjustment, OS could be predicted significantly by sets of DNA repair genes for 61% (11/18) of the cancers. These findings suggest that removal of the proliferation signal in expression by PCNA adjustment does not remove statistical significance for predicting OS. In conclusion, it is likely that previous studies on PCNA adjustment and survival were biased because genes identified through a genome-wide approach are strongly co-regulated by proliferation.


Introduction
Genomic instability is an important hallmark of cancer which leads to mutations that dysregulate cellular growth [1,2]. Mutations play an important role in oncogenic transformation and can be catastrophic during mitosis [3,4]. The additional replication stress and increased oxidative damage that arises from continuous forced cell division in tumor cells requires several DNA repair components [5,6]. However, inherited genetic polymorphisms and oncogenic loss result in DNA repair deficiencies, and therefore alternative DNA repair pathways must be found if replication is to continue [7]. The addiction to alternative DNA repair pathways by cancer can therefore be targeted to prevent the repair and restart of stressed replication forks [8][9][10].
All living cells depend on DNA repair to correct damage to DNA, which can arise from mismatched bases, oxidized bases, DNA methylation, DNA adducts, intra-and interstrand DNA crosslinks, single-strand and double strand breaks, and stalled forks. DNA repair pathways include direct reversal repair, base excision repair, non-homologous end-joining, mismatch repair, translesion synthesis, damage signaling, homologous recombination, and nucleotide excision repair. Early attempts to RNA-Seq based normalized expression values for somatic mutations in DNA repair genes were also obtained from cBio-Portal [24,25]. We also acquired high-confidence deletions and amplifications from cBio-Portal, where a deletion was defined as full homozygous loss with a GISTIC score [26] of −2, and an amplification was defined as high-level gain with a GISTIC score of 2. Low-level deletions (heterozygous loss) and low-level gain (low-level amplifications) with GISTIC scores of −1 and 1, respectively, were not used.

DNA Repair Gene Lists
We used the comprehensive list of DNA repair genes provided by Wood et al. [27], which have been previously described [28][29][30][31][32][33]. Table 1 lists the DNA repair pathways and 123 genes used which were available in TCGA expression data.  A brief description of each pathway follows. Direct reversal DNA repair (DRR) is a single step reaction of removal of the methyl-or photoadducts. Base excision DNA repair (BER) corrects base lesions generated by oxidative, alkylation, deamination, and depurinatiation/depyrimidination damage. Non-homologous end-joining (NHEJ) repairs double-strand breaks (DSBs) at all stages of the cell-cycle, bringing about the ligation of two DSBs without the need for sequence homology, and therefore NHEJ is error-prone. DNA mismatch repair (MMR) is responsible for correction of replication errors (mismatches, small insertions, deletions, and microsatellites) that escape the proofreading activity of a DNA polymerase [34]. Translesion synthesis (TLS) is a process involving specialized DNA polymerases which replicate across from DNA lesions. DNA damage signaling (DDS) induces several cellular responses including DNA repair, cell-cycle checkpoint activity, and triggering of apoptotic pathways. Homologous recombination repair (HRR) is a type of genetic recombination in which nucleotide sequences are exchanged between two similar or identical molecules of DNA. Nucleotide excision repair (NER) removes UV-induced damage (thymine dimers and 6-4-photoproducts) as well as other kinds of DNA damage, which produce bulky distortions in the shape of a DNA double helix.

Cancer Survival Prediction-Univariate
To understand the effect of the PCNA metagene on OS prediction by repair genes, we first estimated Pearson correlation between each DNA repair gene's expression (adjusted for age at diagnosis) with expression of the PCNA metagene. Figure 1 shows a cluster heatmap for the 123 repair genes and each of the 18 cancers considered, with blue-coloring denoting low correlation with the PCNA metagene and red indicating high correlation with the PCNA metagene. Renal clear cell (rcc) and renal papillary cancer (rpc) clustered together along with cervical cancer in their own "tree branch" of the dendogram, because the two cancers had much lower correlation (blue) with the PCNA metagene when compared with all remaining cancers (red). Head and neck cancer clustered by itself as an outlier, because repair gene correlation with the PCNA metagene was greater than other cancers for genes 94-123 (red color). Glioblastoma multiforme (gbm) and brain low-grade gliomas (brainlgg) clustered together mostly because of shared correlation of repair genes and the PCNA metagene for these cancers, and the unique pattern involving lower correlation between BLM (gene 123) expression and the PCNA metagene. The remaining cancers showed similar correlations between repair genes and PCNA, and therefore clustered together with slightly varying patterns. The biological importance of the correlation shown reveals how each repair gene is potentially coregulated by proliferation, which may or may not be important for readers developing models of repair gene co-regulation by PCNA. Certainly, the associations reported would need to be verified experimentally.
Next, we report the univariate Kaplan-Meier (KM) test results for each gene based on expression adjusted for age at diagnosis, stage, and the PCNA metagene. Figure 2 illustrates the KM chi-squared values (χ 2 ) for each gene when used to predict OS for the various cancers. Overall, most of the genes were not significant predictors of OS, since significant KM tests are required to have χ 2 > 3.84 (red color). Low-grade gliomas (brainlgg), renal cell cancer and uterine cancer clustered by themselves as individual outliers since they had many more genes which predicted OS significantly (red). The remaining cancers had different patterns of χ 2 , suggesting uniqueness of the cancer in terms of OS prediction by repair genes. Previous reports [1,2] have suggested that adjustment of genome-wide transcript expression with the PCNA metagene before survival analysis reduces the chance of finding significant predictors of OS. However, when considering DNA repair genes, there appears to be many cancer-repair gene combinations for which OS can be predicted significantly when removing proliferation from expression.

Cancer Survival Prediction-Multivariate Gene Sets for Multiple Pathways
In this section we report on multivariate sets of genes which were significant predictors of OS. Two types of hypothesis test results are provided: one in which only KM tests were used and the other based on randomization tests for OS in which same-size gene sets were randomly selected from the pool of DNA repair genes. Table 2 lists sample sizes and cancer sites for which KM logrank tests of OS based on the best binarized principal components (PC) from correlation of multiple DNA repair gene expression adjusted for age and the PCNA metagene. For the simultaneous adjustment of expression by age and PCNA metagene (first A,P column), all of the cancers resulted in a significant KM logrank test for the best binarized PC; however, the empirical p-values for random selection of genes resulted in a significant KM test (second A,P column) for three cancers, namely, AML, bladder, and sarcoma.   Table 3 lists sample sizes and cancer sites for KM tests performed on the best binarized PC for multiple DNA repair gene expression adjusted for age, stage, and the PCNA metagene. For the simultaneous adjustment of expression for age, stage, and PCNA metagene (first A,S,P column), all of the 12 cancers resulted in PCs whose KM test result were significant. However, when random gene sets of the same size were selected, (second A,S,P column), only eight of the 12 cancers were significant: breast, colorectal, liver, lung, lung squamous cell, melanoma, renal papillary cancer, and stomach. The combined results in Table 2; Table 3 suggest that while age, stage, and PCNA metagene adjusted expression of DNA repair gene expression resulted in significant KM tests for 100% of the cancers (18/18), randomization tests were significant for 61% (11/18) of the cancers. This is important because it supports our hypothesis that PCNA adjustment does not remove statistical significance for predicting OS when using DNA repair genes.

Multivariate Sets of DNA Repair Genes which Predict Overall Survival for each Cancer
This section describes for each cancer the multivariate list of genes constructed from individual KM tests using PCA. The loadings on the best binarized PCs with the greatest significance for predicting OS based on a KM test were used to reveal whether downregulation or upregulation resulted in shortened or prolonged OS. Table 4 lists the DNA repair genes of the best binarized PC for 10/11 cancers showing empirical p-values less than 0.05 when adjusting for age, stage, and PCNA metagene. The gene set for liver cancer is not shown because there were 43 genes which loaded on the best binarized PC, suggesting the model is not parsimonious. When considering the composite of all the genes which were significant survival predictors, pathway activation results indicate upregulation of the BRCA1 pathway, NHEJ pathway, BER pathway, MMR pathway, and downregulation of the NER pathway. ATM signaling (ATM, RAD17, RAD50) was downregulated due to strong downregulation of ATM, which one possible mechanism has been suggested to be promoter hypermethylation [35]. With downregulation of the tumor suppressor kinase ATM and the checkpoint kinase CHEK2, it follows that the ATM intraphase checkpoint would also likely be inactivated. Cells deficient in ATM have also been observed to not reduce transcription following DSBs, a phenotype which has been called "radiosensitized DNA synthesis". In addition, ataxia-telangiaectasia (A-T) cells deficient in ATM are known to repair DSBs following exposure to ionizing radiation [36]. With regard to the BRCA1 pathway, Complex B was highly upregulated with downstream upregulation of G1/S-Phase as well as the HRR pathway (BLM, BRCA1, MSH2, MS62, and RFC). In addition, FANCD2 was upregulated, which activates S-Phase checkpoint control. However, Complex C was downregulated (mostly due to RAD50), possibly suggesting downregulation of downstream G2/M phase. In the NHEJ pathway, ATM, the cross-linking enzyme Artemis (DLCRE1C), and RAD50 were downregulated, however, LIG3, LIG4, NBN, PRKDC, XRCC1, XRCC2, and XRCC5 were upregulated. Activated genes in the BER pathway included POLB, POLE, XRCC1, LIG1, LIG3, and FEN1. The MMR pathway activated genes were EXO1, FEN1, MSH2, MSH6, RFC2, and RFC3. Downregulated genes in the NER pathway were the sensitizer APEX1 and the DNA glycosylate OGG1. The NER pathway was mostly downregulated due to downregulation of HR23B, TFIH, XPC, and XPG. We also noted that MGMT was downregulated, which would increase SSBs and their conversion to DSBs at replication forks [37]. Another observation was that PARP1 was not in any of the lists of significant genes, because from a univariate perspective, it correlated positively with the PCNA metagene in all cancers except renal clear cell and renal papillary (see gene #88 in Figure 1). PARP1 was also a weak predictor of OS for most cancers after its expression was adjusted for age at diagnosis and the PCNA metagene (see gene #21 in Figure 2).

Cluster Analysis of Genomic Event Rates
Once genomic event rates (GERs) were determined for each cancer and the sets of driver genes and DNA repair genes in each repair pathway, cluster analysis was pursued to portray the pattern of similarity between the various cancers based on GERs. Results of cluster analysis of GERs for various cancers is shown in Figure 3. A total of four clusters of cancers were discernible in the data. In spite of all the cancers exhibiting high GERs for driver mutations, cancers in cluster 1 portray strong upregulation of genomic amplification in DNA repair genes, while cancers in cluster 2 reveal downregulation of amplification, deletion, and mutation in DNA genes. Melanoma and ovarian cancer clustered furthest away from the previously described clusters, mostly because of the unique patterns among GERs which emerged. Regarding driver genes, both melanoma and ovarian cancer exhibited greater rates of amplifications but had lower rates of deletions. Additionally, while melanoma revealed increased rates of mutations and amplifications and decreased rates of deletions in DNA repairs genes, ovarian cancer showed the opposite pattern, with lower rates of mutations in DNA repair genes and greater rates of amplifications and deletions. Table 5 lists qualitative patterns which emerged from the cluster analysis of GERs shown in Figure 3. Table 5. Qualitative patterns of genomic event rates per gene-tumor for each cluster identified during hierarchical cluster analysis (from Figure 3). Opportunistic cancers for further study in cluster 2 are AML, colorectal, and renal papillary. Figures S1-S36 illustrate for each cancer investigated a KM plot and Kernel density (pdf) plot of p-values during random selection of gene sets for the best binarized PCs derived from sets of genes, for which each gene had its own significant KM test after the various adjustments for age, stage, and PCNA metagene. p-Values listed in Tables 2 and 3 were extracted from Figures S1-S36. for somatic mutations ("mut"), deletions ("del"), and amplifications ("amp") in driver genes ("dr") and the 8 groups of DNA repair genes (DRR, BER, NHEJ, MMR, TLS, DDS, HRR, NER). Euclidean distance used as the distance function, with unweighted pair group method with arithmetic mean (UPGMA) as the agglomeration method. Heat map colors represent GER. Table 5. Qualitative patterns of genomic event rates per gene-tumor for each cluster identified during hierarchical cluster analysis (from Figure 3). Opportunistic cancers for further study in cluster 2 are AML, colorectal, and renal papillary.

Discussion
Cancer is a multifactorial disease which depends on a constellation of factors involving genomic instability, selective genetic pressure from somatic mutations and polymorphisms, and gene-environment interactions. Two important hallmarks of cancer are the persistent high level of somatic mutations in driver genes and DNA repair addiction. Together, these mechanisms directly and indirectly support a growth advantage and prolonged cell survival. As the costs of genomic DNA sequencing and RNA-Seq analysis decrease, there will continue to be new information available regarding cancer's addiction for DNA repair.
Our approach employed two levels of statistical testing, one that merely involved straightforward maximum likelihood (ML) estimation and another based on randomization tests, which resulted in empirical p-values. The ML-based survival prediction with adjusted DNA repair gene expression was significant for most of the cancers; however, survival prediction based on empirical p-values was significant for fewer cancers. It is now known that identification of sets of genes from genome-wide annotation lists will result in false positives that are associated with the PCNA metagene. Our focus was to specifically target DNA repair gene expression, remove the effect of the PCNA metagene, age at diagnosis, and stage, to determine if significant lists can be obtained. Not surprisingly, after the adjustments, many of the cancers revealed DNA repair genes which significantly predicted OS.
It is important to realize that our use of randomly selected same-size gene sets was performed in order to develop randomization tests for empirical p-value testing. Since the study was hypothesis-driven and used a candidate gene approach with a constrained list of DNA repair genes, it would be impossible to determine the effects of random bias based on randomly selecting genes from a genome-wide perspective as was done by Venet et al. [16] and Shimon [19]. There may likely be genes in the genome which predict survival better than the DNA repair genes considered; however, they would not be DNA repair genes. There was also no freedom to use genome-wide selection of genes for survival prediction, due to our candidate gene approach.
Pathway analysis results indicate a pattern suggestive of downregulation of primary damage signaling kinase (ATM) and initial BER pathway components (APEX1 and OGG1), and when combined with increased pathogenic somatic mutations in driver genes (e.g., TP53), our results may indicate that damage signaling in the initial portion of repair pathways is abrogated, while the remainder of the pathway is intact.
We also assessed GERs of the various cancers and confirmed that all of the cancers had high somatic mutation rates in driver genes. There were also two main clusters of cancers identified, which portrayed either high levels of amplification in DNA repair genes or low GERs for mutations, deletions, and amplification in DNA repair genes. The latter group of cancers including AML, colorectal, GBM, low grade gliomas, and renal papillary (cluster 2 in Table 5), may be more opportunistic for repair inhibition therapy because of less confounding associated with low levels of mutations, deletions, and amplification exhibited in DNA repair genes for these cancers. It warrants noting that the heat map in Figure 3 represents cancer-specific GERs in DNA repair pathways and not genome-wide mutations levels in cancers as reported by Lawrence et al. [38]. While melanoma was reported by Lawrence et al. to have the greatest levels of genome-wide mutation levels, our results indicate that melanoma had high levels of gene amplification and somatic mutations in DNA repair genes, causing its separate clustering. In addition, ovarian cancer clustered by itself because of high levels of amplification in DNA repair genes.
Random somatic mutations considered in TCGA data are not genetic polymorphisms (e.g., SNPs) occurring in the same regions of DNA, for which allelic phenotypes, associated risks, prognosis, and recommended treatment options are known. Instead, they are rarely found in the same location in DNA and are rarely of the same type. Somatic mutations merely accumulate from genetic selective pressure in driver genes, which is one of the most important hallmarks of sporadic cancers. Along these lines, there is uncertainty related to experimentally verifying the effect and pathogenicity of a single somatic mutation. Numerous algorithms can be employed for computationally predicting pathogenicity of somatic mutations (our results are based on the FATHMM algorithm used by TCGA and COSMIC [39]), but the experimental laboratory costs required to fully understand how a single somatic mutation alters a protein and how any change in function impacts the disease phenotype are exorbitant.
The translational value of our results is established by the potential of novel patterns of DNA repair gene expression in cancer, which could prove useful in animal studies, transgenics, and xenograft models, etc., in order to understand if inhibition of the genes identified inhibit tumor growth and improve survival [40][41][42]. Adjustment of DNA repair gene expression by the PCNA metagene has enabled us to view cancer from a distant perspective based on high-granularity involvement of DNA repair pathways in cancer. This view will hopefully enforce an appreciation among biologists and oncologists for the translational value of pursuing experimental inhibition studies, as well as randomized control trials for establishing safety and evaluating efficacy.
We did not comparatively assess numerous techniques for their computational efficiency, scalability, or differences in OS survival prediction. We also did not evaluate differences between using progression free survival (PFS) vs. OS, or bootstrapping effects on results. TCGA was primarily undertaken for molecular studies, and therefore clinical data standardization and collection was a secondary effort [43]. TCGA was also not a clinical trial, and therefore outcomes were considered from both retrospective and prospective cases, without a standardized patient follow-up plan. Molecular data were also obtained from single sections of primary tumors, and therefore spatial and temporal variation in tumor heterogeneity cannot be addressed. Use of OS as a survival endpoint is supported for most of the cancers available in TCGA [43]. For breast cancer subtypes with varying aggressiveness, OS is likely appropriate for the basal-like subtype but not for luminal A. Glioblastoma multiforme is also considered an aggressive cancer, although OS is considered suitable for use with TCGA data. Prostate cancer OS in TCGA is not a suitable endpoint, and therefore prostate cancer was not evaluated. Regarding confounding, we removed the effects of age and stage at diagnosis from expression data using skew-zero inputs to multivariate regression. Confounding caused by competing risks and outcomes in TCGA is more relevant for disease-specific survival (DSS), disease-free interval (DFI), and PFS and less of an issue for OS, so we do not envision a sufficient level of competing risks bias which would trigger a concern. The Cox PH assumption is also supported for most of the TCGA cancers, with only a few exceptions. While KM analysis was employed as the primary survival hypothesis test, Cox PH regression was only used for determination of risk directionality as a function of increasing expression level.
There is also the problem of unknown upstream effects of germline polymorphisms and DNA repair deficiencies which may result in a variety of unknown influences. The difficulty presented by cellular niching and high levels of clonal heterogeneity in tumors also presents a challenge for fully unraveling the associations observed in this study. The TCGA data used are not based on single-cell RNA-Seq analysis, which would be helpful for elucidating heterogeneity effects; however, the large variation in genotypes would exacerbate the present uncertainties surrounding our attempt to portray the role of DNA repair genes in cancer survival. We also did not employ empirical data from TCGA for DNA microsatellite instability, methylation status, or chromosome aberrations, which would overlay more complexity on the models developed. Although we did include DNA repair genes in the MMR pathway, which verify repeat count of microsatellites during cell division [44].

PCNA Metagene
Table S1 of the Supplemental Information lists the 131 PCNA-related genes employed for PCNA adjustment. For each tumor, we obtained RNA-Seq derived normalized expression values of the 131 PCNA-related genes [16], and collapsed their expression values down to median expression, which is termed the "PCNA metagene". Next, the PCNA metagene (median) and expression values for all of the DNA repair genes listed above were transformed into van der Waerden (VDW) scores. This transformation simply involved first transforming expression values for each gene into percentiles, and then substituting the percentiles as probabilities in the inverse standard normal function, i.e., Z = Φ −1 (ptile), to obtain standard normal variates of expression, which are distributed with mean zero and variance unity.

Maximum Likelihood Survival Prediction
The VDW scores for expression of each DNA repair gene were regressed separately on the VDW scores for age, VDW scores for tumor stage, and VDW scores for PCNA metagene, and the residuals were taken as the new DNA repair gene expression values. Residuals for each DNA repair gene were then binarized into (0,1) by splitting on the median, to form a grouping variable which was employed in Kaplan-Meier (KM) survival analysis with OS status. Overall survival was reported to be the most accurately derived outcome for TCGA data [43]. Each DNA repair gene which resulted in significant maximum likelihood (ML) estimates of the KM logrank tests, i.e., χ 2 p-value < 0.05 (1 d.f. chi-squared), was appended to a list of p genes. Eigendecomposition of the correlation matrix of the p significant genes was then performed with Varimax orthogonal rotation, and the PCs for all p dimensions were extracted. Each PC was then transformed into a binary grouping variable for KM input, by assigning negative PC values to 0 and positive to 1. The single group-transformed PC which resulted in the greatest χ 2 value during KM analysis was identified and called the "best binarized PC". For cancers without stage available in TCGA, the best binarized PC when age and PCNA were used for residual generation was input into Cox proportional hazards (PH) regression as a continuous variable (i.e., PC score) to determine whether positive values prolonged or shortened OS. Whereas, for cancers with stage available in TCGA, the best binarized PC when age, stage, and PCNA were used for residual generation was input into Cox PH regression. For the best binarized PC under evaluation, if the Cox PH hazard ratio (HR) < 1, then genes having positive loading on the best binarized PC were beneficial to OS if upregulated, whereas genes whose loadings were negative were considered hazardous, and would need to be downregulated in order to be beneficial. Analogously, if the Cox PH HR > 1, it meant that positive PC values were deleterious, and therefore genes that loaded negatively on this PC would need to be upregulated to be beneficial, and genes that loaded positively on this PC would need to be downregulated in order to be beneficial to OS. Figure 4 illustrates the workflow employed, outlining the various steps used for establishing the best binarized PC for each cancer, and whether positive loadings on the best binarized PC prolonged or shortened OS. Justification for using KM analyses was hinged to our observation that for the same genes, results from Cox PH regression of continuously-scaled expression were consistently more significant when compared with grouped analysis based on KM. Thus, we chose KM analysis for prediction significance due to conservativeness, and Cox PH to determine directionality of survival risk as a function of increasing expression values. Table S2 lists the multiple linear regression results for each DNA repair gene (dependent variable) and Z-scores for the independent variables: age, stage, and the PCNA metagene, as well as KM chi-squared (1 d.f.) statistics for univariate survival analysis based on OS status.
based on the top 20 driver genes identified by at least two tools in the DRIVERDB database [46]. The GER of each type of event was determined by dividing the sum by the number of genes in the group and the number of tumors obtained for each of the cancers considered. This led to the GER in units of events per gene-tumor. Hierarchical cluster analysis was then used to cluster values of GER for each cancer. Euclidean distance was used as the distance function, while the unweighted pair group method with arithmetic mean (UPGMA) was used for the agglomeration method.  (E) The PC resulting in the greatest KM chi-squared value is selected as the "best binarized PC," and univariate Cox PH regression is then run on the best binarized PC to determine if positive (negative) PC score values are associated with prolonged (shortened) OS.

Empirical p-Value Tests of Survival Using Randomly Selected Genes
For each cancer, the single best binarized PC that resulted in the greatest chi-squared statistic during maximum likelihood KM analysis was considered to the be the "observed" test statistic. Recall, this test statistic for the best binarized PC was initially based on individual DNA repair genes whose adjusted gene expression resulted in a significant KM test. Let the number of significant DNA repair genes for a best binarized PC be p. We used B = 1000 iterations for empirical p-value testing. During each b-th iteration, a random set of p DNA repair genes with the same adjustment to expression were selected, followed by correlation analysis, and then PC extraction via eigendecomposition of the p × p correlation matrix. Each PC was then binarized and used in KM analysis to determine which PC resulted in the greatest chi-squared test statistic for the set of p random genes. After B iterations, the empirical p-value was equal to P = # {b:χ 2(b) > χ 2 }/B, where χ 2 is the "observed" 1 d.f. χ 2 test statistic from ML-based KM analysis based on the best binarized PC, and χ 2(b) is the chi-squared statistic from the best binarized PC extracted from the correlation matrix of p randomly selected DNA repair genes used in KM analysis during the bth iteration. The bottom of Figure 4 illustrates how the correlation matrix of p genes with significant KM analysis were employed to obtain the best binarized PC for predicting OS.

Genomic Event Rates
For each cancer, we also summed the number of pathogenic somatic mutations, deletions, and amplifications in the set of 20 driver genes, and in each of the eight groups of DNA repair genes (DRR, BER, NHEJ, MMR, TLS, DDS, HRR, NER). Table S3 of the Supplemental Information lists the cancer-specific driver genes used, which we previously reported [45]. Driver gene selection was based on the top 20 driver genes identified by at least two tools in the DRIVERDB database [46]. The GER of each type of event was determined by dividing the sum by the number of genes in the group and the number of tumors obtained for each of the cancers considered. This led to the GER in units of events per gene-tumor. Hierarchical cluster analysis was then used to cluster values of GER for each cancer. Euclidean distance was used as the distance function, while the unweighted pair group method with arithmetic mean (UPGMA) was used for the agglomeration method.

Removing Redundant Genes in Gene Lists
PCNA itself was listed as a BER gene and was removed from survival analysis because our primary goal was to remove the genome-wide association of PCNA with other genes from the expression of DNA repair genes. For most cancers, TP53 was listed as a driver gene with high levels of somatic mutations, so it was not included as a DNA repair gene during survival analysis for those cancers. DNA repair genes listed multiple times in the repair pathways described above included POLD1, POLE, POLH, POLM, WRN, PCNA, LIG1, BLM, BRCA1, FANCA, FANCC, and ERCC4. Only the first occurrence of these genes in their respective pathway lists was used. This resulted in a full list of 123 DNA repair genes, which were used for the univariate statistics presented in Figure 1; and Figure 2. However, after PCNA and TP53 were removed, a final list of 121 unique (non-redundant) DNA repair genes was constructed and used during survival analysis for all cancers. Table S4 of the Supplemental Information lists the DNA repair genes which were excluded from survival analysis because of duplicate listing in the PCNA gene list or list of driver genes for the given cancer, and lists driver genes excluded from GER analysis because of duplicate listing in the PCNA gene list.

Conclusions
In conclusion, our hypothesis-driven focus to target DNA repair gene expression adjusted for the PCNA metagene as a means of predicting OS in various cancers resulted in statistically significant sets of DNA repair genes. We also identified that AML, colorectal, and renal papillary cancers may be potentially more opportunistic for inhibition therapy because of less confounding in the form of lower rates of mutations, deletions, and amplifications in DNA repair genes which predict OS in these cancers. The most opportunistic cancer for DNA repair inhibition therapy appears to be AML, since the TCGA cases harbored the lowest rates of somatic mutations, deletions, and amplifications in DNA repair genes.
Supplementary Materials: The following are available online at http://www.mdpi.com/2072-6694/11/4/501/s1, Table S1 of the Supplemental Information lists the 131 PCNA-related genes which were available in TCGA expression data. Table S2 lists the multiple linear regression results for each DNA repair gene (dependent variable) and Z-scores for the independent variables: age, stage, and the PCNA metagene, as well as KM chi-squared (1 d.f.) statistics for univariate survival analysis based on OS status. Table S3 of the Supplemental Information lists the cancer-specific driver genes used, which we previously reported [45]. Driver gene selection was based on the top 20 driver genes identified by at least 2 tools in the DRIVERDB database [46]. Table S4 of the Supplemental Information lists the DNA repair genes which were excluded from survival analysis because of duplicate listing in the PCNA gene list or list of driver genes for the given cancer, and also lists driver genes excluded from GER analysis because of duplicate listing in the PCNA gene list. Figures S1-S36 show KM and kernel density plots for acute myelogenous leukemia, bladder, breast, cervical, colorectal, glioblastoma multiforme, head and neck, low grade gliomas, liver, lung sq. cell, lung, melanoma, ovarian, renal clear cell, renal papillary, sarcoma, stomach, and cancers of the uterine.
Author Contributions: L.E.P. wrote the manuscript, developed the workflow, and performed the majority of computational modeling, while T.K. reviewed and confirmed the statistical procedures.
Funding: Research was supported by NASA Grant NNX-12AO52A.