PR/SET Domain Family and Cancer: Novel Insights from The Cancer Genome Atlas

The PR/SET domain gene family (PRDM) encodes 19 different transcription factors that share a subtype of the SET domain [Su(var)3-9, enhancer-of-zeste and trithorax] known as the PRDF1-RIZ (PR) homology domain. This domain, with its potential methyltransferase activity, is followed by a variable number of zinc-finger motifs, which likely mediate protein–protein, protein–RNA, or protein–DNA interactions. Intriguingly, almost all PRDM family members express different isoforms, which likely play opposite roles in oncogenesis. Remarkably, several studies have described alterations in most of the family members in malignancies. Here, to obtain a pan-cancer overview of the genomic and transcriptomic alterations of PRDM genes, we reanalyzed the Exome- and RNA-Seq public datasets available at The Cancer Genome Atlas portal. Overall, PRDM2, PRDM3/MECOM, PRDM9, PRDM16 and ZFPM2/FOG2 were the most mutated genes with pan-cancer frequencies of protein-affecting mutations higher than 1%. Moreover, we observed heterogeneity in the mutation frequencies of these genes across tumors, with cancer types also reaching a value of about 20% of mutated samples for a specific PRDM gene. Of note, ZFPM1/FOG1 mutations occurred in 50% of adrenocortical carcinoma patients and were localized in a hotspot region. These findings, together with OncodriveCLUST results, suggest it could be putatively considered a cancer driver gene in this malignancy. Finally, transcriptome analysis from RNA-Seq data of paired samples revealed that transcription of PRDMs was significantly altered in several tumors. Specifically, PRDM12 and PRDM13 were largely overexpressed in many cancers whereas PRDM16 and ZFPM2/FOG2 were often downregulated. Some of these findings were also confirmed by real-time-PCR on primary tumors.


Introduction
The positive regulatory domain (PRDM) gene family, a subfamily of Kruppel-like zinc finger gene products, currently includes 19 members in humans [1][2][3][4]. The protein products of this family Here, The Cancer Genome Atlas (TCGA) deposited exome and RNA-Seq data [24] were used to obtain a complete pan-cancer overview of the genomic and transcriptomic alterations for all PRDM genes across 31 distinct human cancer types.

Mutational Profiling of PRDM Genes Across Human Cancers
To systematically identify somatic mutations within genes encoding PRDMs, we started with a mutational profiling of these genes across human cancers. We downloaded Exome-sequencing datasets from the TCGA web portal for 31 cancer types and about 11,000 patients. The number of samples for each cancer type is illustrated in Table S1 [25].
Overall, we identified 3995 point mutations, 180 deletions (39 in-frame and 141 frameshift), and 22 insertions (16 in-frame and 6 frameshift) affecting PRDM genes. Silent or synonymous mutations were 1531 (26.7% of total mutations) and ranged between 11% (PRDM6) and 41% (PRDM8) of the total mutations for each gene (Figure 1). Here, The Cancer Genome Atlas (TCGA) deposited exome and RNA-Seq data [24] were used to obtain a complete pan-cancer overview of the genomic and transcriptomic alterations for all PRDM genes across 31 distinct human cancer types.

Mutational Profiling of PRDM Genes Across Human Cancers
To systematically identify somatic mutations within genes encoding PRDMs, we started with a mutational profiling of these genes across human cancers. We downloaded Exome-sequencing datasets from the TCGA web portal for 31 cancer types and about 11,000 patients. The number of samples for each cancer type is illustrated in Table S1 [25].
To measure the frequencies of somatic mutations for each PRDM gene across all tumor types, only non-synonymous mutations were considered. We observed heterogeneity in the mutation frequencies of these genes in the different tumor types. A global low mutation rate (from 0 to 8.2%) was found, except for PRDM3/MECOM, PRDM8, PRDM9, PRDM15, ZFPM1/FOG1, and ZFPM2/FOG2 (Table 1). In detail, PRDM8 and PRDM15 were mutated at low rates in most of the analyzed cancer types except PAAD where they were both frequently mutated (16.0% and 11.2%, respectively). PRDM3/MECOM was recurrently mutated in various cancer types, also reaching a value of 20.1% of mutated samples in SKCM. Similarly, PRDM9 was mutated with a high mutation rate in many cancer types, achieving values of 10.0% in UCEC, 14.2% in LUAD, and 15.4% in SKCM. Otherwise, ZFPM1/FOG1 was mutated at a low rate in a few cancer types, except in UCS (5.2%), COAD (6.6%), READ (9.4%), and ACC (50.5%). Finally, ZFPM2/FOG2 was frequently mutated at a high rate in various cancer types, reaching a value of 11.1% in LUAD and 16.5% in SKCM.
We visualized the mutation data in each tumor type by Oncostrip function (Supplementary file 1). Through this approach, we evaluated the percentage of samples with at least one mutated PRDM gene in each tumor type ranging from 1.02% (2/196) in LAML samples to 55.43% (51/92) in ACC samples. Furthermore, this function allowed us to visualize the mutation type affecting PRDM genes in each sample. Interestingly, ZFPM1/FOG1 revealed a high number of samples, especially in ACC, with "multi_hit" mutations (more than one mutation affecting the same gene in the same cancer sample). Specifically, we found 11/18 (61%) multi-hit mutations in COAD, 10/11 (90%) in READ, and 23/47 (48%) in ACC (see Supplementary file S1).

Cancers Genes
To distinguish between damaging and tolerated missense mutations, we carried out a variant effect predictor (VEP) analysis (Table S2). Missense mutations with a SIFT score ranging in the interval 0.0-0.05 and/or with a PolyPhen score in the interval 0.5-1 were considered as deleterious or probably damaging, respectively. As shown in Table 2, adding all the other deleterious somatic mutations (frameshift, in-frame deletions, stop gained and start lost mutations, splice site, UTR, and intron variants) to the deleterious missense mutations classified with the VEP analysis, we obtained the total number of deleterious mutations affecting each PRDM gene. Thus, we obtained the percentage of deleterious somatic mutations across the tumor samples. This number was ≥50% for MECOM/PRDM3 (52.7%), PRDM4 (55%), PRDM5 (54.8%), PRDM6 (58.5%), PRDM10 (55.2%), PRDM11 (54%), PRDM13 (51.7%), and PRDM16 (50%). Additionally, to predict the potential functional effect of the identified PRDM somatic mutations on the affected proteins and to detect a possible mutation enrichment in some domains, we localized the deleterious missense mutations on the canonical protein isoform of each PRDM ( Figure 2). Interestingly, a random sampling weighted on the size of the annotated protein domains demonstrated that somatic deleterious mutations were significantly enriched in the PR domain of PRDM1, PRDM5, PRDM6, PRDM8, PRDM9, PRDM12 and PRDM13 (p < 0.005). Another important aspect of cancer genetic studies is the presence of possible recurrent and hotspot mutations. Figure 3 illustrates mutations in PRDM genes recurring in more than three tumor types. Interestingly, the frameshift mutation T/-→K678X, despite affecting PRDM3/MECOM in a region not containing known domains, was recurrent in different tumor types; similarly, also the missense mutation G/A→S237L occurred in a region without known domains but in many tumors. Otherwise, missense mutations affecting a Zn-finger domain and occurring in different tumors were observed for PRDM9, PRDM14, and PRDM16. Likewise, PRDM12 was frequently mutated in a splice donor site in a region coding for the PR domain whereas in different tumor types, ZFPM2/FOG2 was affected by the missense mutation C/T→R734C in a region without known domains. In addition, PRDM2 and PRDM15 revealed an in-frame deletion in various cancers and PRDM11 a frameshift mutation. Finally, ZFPM1/FOG1 showed several recurrent mutations; they all (frameshift mutations and in-frame deletions) hit a region without known domains ( Figure 3). Another important aspect of cancer genetic studies is the presence of possible recurrent and hotspot mutations. Figure 3 illustrates mutations in PRDM genes recurring in more than three tumor types. Interestingly, the frameshift mutation T/-→K678X, despite affecting PRDM3/MECOM in a region not containing known domains, was recurrent in different tumor types; similarly, also the missense mutation G/A→S237L occurred in a region without known domains but in many tumors. Otherwise, missense mutations affecting a Zn-finger domain and occurring in different tumors were observed for PRDM9, PRDM14, and PRDM16. Likewise, PRDM12 was frequently mutated in a splice donor site in a region coding for the PR domain whereas in different tumor types, ZFPM2/FOG2 was affected by the missense mutation C/T→R734C in a region without known domains. In addition, PRDM2 and PRDM15 revealed an in-frame deletion in various cancers and PRDM11 a frameshift mutation. Finally, ZFPM1/FOG1 showed several recurrent mutations; they all (frameshift mutations and in-frame deletions) hit a region without known domains ( Figure 3).  Interestingly, all these mutations were particularly recurrent in ACC patients. In this cohort, ZFPM1/FOG1 also displayed five hotspot mutations, all localized in the same region outside the known domains ( Figure 4a). To establish whether these hotspot mutations could have an impact on the ZFPM1/FOG1 structure, we utilized the I-TASSER web-tool to predict the tertiary structure of the annotated ZFPM1/FOG1 protein ( Figure 4b) and proteins carrying the missense mutations and the in-frame deletions (Figure 4c-e). As illustrated, these mutations completely altered the structure of the canonical protein. Otherwise, the frameshift mutations E444X and P445X led to premature stop codons at the residues 669 and 796, respectively; both of the mutated proteins shared only the first Interestingly, all these mutations were particularly recurrent in ACC patients. In this cohort, ZFPM1/FOG1 also displayed five hotspot mutations, all localized in the same region outside the known domains (Figure 4a). To establish whether these hotspot mutations could have an impact on the ZFPM1/FOG1 structure, we utilized the I-TASSER web-tool to predict the tertiary structure of the annotated ZFPM1/FOG1 protein ( Figure 4b) and proteins carrying the missense mutations and the in-frame deletions (Figure 4c-e). As illustrated, these mutations completely altered the structure of the canonical protein. Otherwise, the frameshift mutations E444X and P445X led to premature stop codons at the residues 669 and 796, respectively; both of the mutated proteins shared only the first 9 of 17 443 residues with the canonical protein whereas they changed in the 444-669 and 444-796 regions and missed respectively 337 and 210 residues at the C-terminal, which contains the last five zinc fingers of ZFPM1/FOG1. 443 residues with the canonical protein whereas they changed in the 444-669 and 444-796 regions and missed respectively 337 and 210 residues at the C-terminal, which contains the last five zinc fingers of ZFPM1/FOG1.  Finally, to assess whether members of the PRDM family may be driver genes in a given cancer type, we used the OncodriveCLUST tool, which aims to identify genes whose mutations are biased towards a large spatial clustering. This method is based on the feature that cancer gene mutations frequently cluster in particular positions of the protein. Thus, mutations with a frequency higher than the background rate that tend to cluster in specific regions of protein-coding genes are likely to be driver genes. Based on the scores of this analysis, ZFPM1/FOG1 can be considered as a cancer driver for ACC (Figure 4f) and PRDM8 for PAAD ( Figure S2).

Differentially Expressed PRDM Genes across Human Cancers
To evaluate whether the expression of PRDM genes is affected in human cancers, we took advantage of RNA-Seq datasets from paired samples (cancer vs. benign counterpart) available at the TCGA web portal. Globally, 585 patients across 21 cancer types were analyzed (Table S1). The gene expression profiles differed considerably between normal and tumor specimens, depending on the cancer type, as shown by the principal component analysis [26]. The results of gene expression profiling are summarized in Table S3 and Figure 5.
to the number of clusters found in a certain gene, also indicated in the squared bracket. Specifically, in the ZFPM1/FOG1 locus, two mutation clusters were found (fdr < 2.87 × 10 −6 ).
Finally, to assess whether members of the PRDM family may be driver genes in a given cancer type, we used the OncodriveCLUST tool, which aims to identify genes whose mutations are biased towards a large spatial clustering. This method is based on the feature that cancer gene mutations frequently cluster in particular positions of the protein. Thus, mutations with a frequency higher than the background rate that tend to cluster in specific regions of protein-coding genes are likely to be driver genes. Based on the scores of this analysis, ZFPM1/FOG1 can be considered as a cancer driver for ACC (Figure 4f) and PRDM8 for PAAD ( Figure S2).

Differentially Expressed PRDM Genes across Human Cancers
To evaluate whether the expression of PRDM genes is affected in human cancers, we took advantage of RNA-Seq datasets from paired samples (cancer vs. benign counterpart) available at the TCGA web portal. Globally, 585 patients across 21 cancer types were analyzed (Table S1). The gene expression profiles differed considerably between normal and tumor specimens, depending on the cancer type, as shown by the principal component analysis [26]. The results of gene expression profiling are summarized in Table S3 and Figure 5.

PRDM Expression in Human Primary Tumors
In an attempt to validate the findings obtained on RNA-Seq datasets, we assayed TissueScan cDNA panel arrays containing eight different tumors (breast, colon, kidney, liver, lung, ovary, prostate, and thyroid) [25].
Specifically, we analyzed the differential expression of PRDM3/MECOM, PRDM10, PRDM12, PRDM16, and ZFPM2/FOG2 genes ( Figure S3). As illustrated in Figure 6, a general differential expression, even though not significant, was observed for both ZFPM2/FOG2 and PRDM16 in several tumor tissues (Figure 6a,b). PRDM3/MECOM was found to be significantly overexpressed in breast (p < 0.001), ovary (p < 0.001), and colon (p < 0.05) cancer specimens (Figure 6c). Similarly, PRDM10 was overexpressed at significant levels in breast (p < 0.001) and colon (p < 0.05) cancer specimens (Figure 6d). No statistical differences in the expression of all these genes were measured between tumor and healthy samples of the other cancer entities (Figure 6a-d). Otherwise, the PRDM12 gene was difficult to analyze. This gene was confirmed as being highly overexpressed in all the analyzed cancer tissues, except in ovarian cancer, as indicated by reanalysis of the TCGA RNA-Seq dataset. However, this gene was not expressed or was expressed at very low levels in all the healthy tissues used as controls; it was expressed in tumor specimens. For this reason, its relative gene expression could not be calculated using the 2 −∆∆Ct method for all the analyzed tissues. Particularly, relative expression was measured only for thyroid, ovary, prostate, and liver ( Figure 6e) whereas in breast, colon, kidney, and lung normal tissues, the amplification products were undetectable when observed by agarose gel electrophoresis analysis (Figure 6f). Of note, these cancer specimens showed measurable levels of PRDM12 (Figure 6e-f).

Discussion
In this study, we provide for the first time a systematic and comprehensive overview of both the mutational status and the expression profile of all the PRDM genes across a large number of different cancer types.
Recently, the availability of multi-omics datasets (such as TCGA) from human cancers, together with the development of advanced bioinformatics tools, represent a unique source to study human malignancies [24].

Discussion
In this study, we provide for the first time a systematic and comprehensive overview of both the mutational status and the expression profile of all the PRDM genes across a large number of different cancer types.
Recently, the availability of multi-omics datasets (such as TCGA) from human cancers, together with the development of advanced bioinformatics tools, represent a unique source to study human malignancies [24].
Our reanalysis of the TCGA Exome-sequencing datasets revealed PRDM2, PRDM3/MECOM, PRDM9, PRDM16 and ZFPM2/FOG2 as the most mutated genes. Heterogeneity in the mutation frequencies was observed in the different tumor types with higher mutation rates found for PRDM3/MECOM, PRDM8, PRDM9, PRDM15, ZFPM1/FOG1 and ZFPM2/FOG2 in specific cancers. Remarkably, VEP analysis indicated that the percentage of total deleterious mutations across the tumor samples was high for most genes. More interestingly, a random sampling weighted on the size of the annotated protein domains demonstrated that deleterious mutations were significantly enriched in the PR domain of PRDM1, PRDM5, PRDM6, PRDM8, PRDM9, PRDM12 and PRDM13. Frequent mutations disrupting the PR domain in tumor samples would be a mechanism for removing the tumor suppressor function of the PR-plus isoform in favor of the oncogenic PR-minus form.
A big challenge in cancer biology studies is distinguishing between mutations conferring a selective growth advantage to cancer cells (drivers) and those randomly accumulating and without significant effects on the oncogenic process (passengers) [27,28]. Many algorithms employing different approaches are now utilized to recognize driver genes although, when compared for their performance, all display both strengths and weaknesses [25,[29][30][31]. Moreover, recent studies have highlighted the existence of "mini-driver" genes with weaker tumor-promoting effects, thus expanding the previous driver-passenger dichotomy to a continuous model [25,30,32]. Besides, a sub-classification has been proposed to differentiate "mut-driver genes", usually altered by somatic gene mutations, from "epi-driver genes", which are deregulated through epigenetic modifications but are not frequently mutated [25,27].
In this study, OncodriveCLUST analysis revealed two putative cancer mut-driver genes: PRDM8 in PAAD and ZFPM1/FOG1 in ACC. In the latter case, we found that the involved gene was mutated in a very high percentage (about 55%) of tumor samples. Additionally, a mutational hotspot region localized between the amino acid positions 444 and 447, outside the known domains, was recognizable. All these findings agree with the key parameters commonly used to discern drivers from passengers [29]. Notably, these hotspot mutations recurred also in other malignancies, such as COAD, KIRP, READ, STAD, and UCS, supporting an important role of this gene in carcinogenesis. Moreover, the finding of "multi_hit" mutations in ACC, as well as in COAD and READ, advises that this gene could function as a tumor suppressor gene. This is conceivable with the current knowledge about the role of ZFPM1/FOG1 in differentiation. Indeed, ZFPM1 is also known as a friend of GATA1 (FOG1) since it interacts with GATA1 and it is an essential cofactor for the transcription factor GATA-1 in erythroid and megakaryocytic differentiation. Reduced expression of ZFPM1/FOG1 was found in preleukemic progenitors of a mouse model of leukemia [33,34]. Besides, the high recurrence of these mutations together with results from 3D-modeling of the canonical and mutant ZFPM1/FOG1 proteins suggest that, although without known domains, this is a critical region for the protein. Interestingly, this region (particularly K431 residue) is evolutionary highly conserved ( Figure S4).
It is conceivable that other PRDM genes may play a key role in the initiation and progression of specific or multiple tumor types, as also reported by previous literature [8,[15][16][17]. Indeed, it is accepted that cancer driver genes are mainly involved in three core cellular processes: cell fate, cell survival, and genome maintenance. For instance, PRDM2 has a relevant role in all of them; specifically, it also participates to the formation of protein complexes involved in the DNA damage response and in genome maintenance [15]. Additionally, a recent study identified a driver PRDM2 mutation in MSI colorectal cancer [16]. However, our analysis has not considered this gene as a driver. We cannot exclude, among the explanations, the limitation of the utilized bioinformatics tool [30].
Our pan-cancer study has also identified PRDM12 as a possible epi-driver gene in multiple cancers. Of note, TCGA gene expression profiling of PRDM genes revealed significant overexpression of PRDM12 and PRDM13 in many tumor types whereas ZFPM2/FOG2, PRDM8, and PRDM16 were more often downregulated in tumor tissues. Our qRT-PCR analysis was not able to confirm all the results obtained through TCGA analysis. The main reason could be the utilization of unpaired samples in the validation through TissueScan cDNA panel arrays; otherwise, our analysis on TCGA was carried out on paired samples. In addition, we have analyzed a small number of samples by qRT-PCR compared to the huge number of cases from TCGA. Noteworthy is that when we measured the differential expression of PRDM12 in cDNA panel arrays, we found the expression of this gene only in cancer specimens but not in healthy samples of several tissues, suggesting that it could be putatively utilized as a biomarker in those malignancies. Our study represents the first analysis of all PRDM genes in pan-cancer; further studies using large cohorts are necessary to validate the most promising results, particularly for PRDM12. In addition, given the lack of literature data, we are aware that functional studies investigating the effect of altered expression both in vitro and in vivo are required to establish the possible impact of PRDM12 in cancer.
Altogether, our results can be useful for identifying a subset of relevant PRDMs that are frequently mutated and/or transcriptionally deregulated in certain tumor types. Functional studies on specific PRDM gene mutations should be accomplished to definitely prove their potential oncogenic role. Moreover, it would be interesting to investigate whether these mutations contribute to cancer progression and metastasis, as well as whether they correlate with prognosis and/or with drug response and resistance. The epigenetic changes underlying the altered gene expression observed in tumor samples should also be explored. In this context, the availability of novel multi-omics data integration tools and methods also offer the opportunity to further integrate our analysis of PRDM gene expression by a systematic pan-cancer study of the epigenetic marks in these genes [25,35,36].

TCGA Data Source Selection and Processing for Mutation Analysis
In this manuscript, we analyzed both whole Exome-and RNA-Seq data retrieved from publicly accessible repositories. Specifically, we retrieved the whole exome sequencing data from the GitHub R data package for pre-compiled somatic mutations from TCGA cohorts "TCGA mutations" and analyzed it using the Bioconductor package "maftools" [37].
The selection and nomenclature of PRDM genes were based on the HUGO Gene Nomenclature Committee [38]. To estimate the mutation enrichment within the PR domain of each of the PRDM proteins, we performed a random sampling iterated 1000 times weighted on the size of the annotated domains.
To assess whether one or more PRDM proteins could be considered as cancer driver genes based on the positional clustering of the variants in the selected human cancers, we used a re-implementation of the software OncodriveCLUST within the maftools package [39].

TCGA Data Source Selection and Processing for Expression Analysis
The RNA-Seq gene expression data were downloaded from TCGA [42]. The analysis of gene expression and the identification of differentially expressed genes were performed comparing the expression profiles of cancer vs. matched normal samples in a paired analysis. Therefore, expression data taken from human primary cancers for which healthy samples were not available were discarded. According to this criterion, 22 tumor entities were analyzed. To have a more robust differential expression analysis in paired samples, we applied generalized linear models implemented in the EdgeR Bioconductor package version 3.17.10. p-values adjustment was performed through the application of the false discovery rate (FDR) method. We considered differentially expressed genes with a logFC ≤−1 and logFC ≥1, and an FDR ≤ 0.01.

Real-Time RT-PCR Analysis
Quantitative real-time PCR (qRT-PCR) experiments were carried out on TissueScan Cancer Survey Panels, which contained cDNA samples from various normal and cancer tissues covering eight different tumors (breast, colon, kidney, liver, lung, ovary, prostate and thyroid) from independent patients diagnosed at various clinical disease stages and selected from mixed ages and genders. Tissue cDNAs of each array were synthesized from high-quality total RNAs of pathologist-verified tissues, normalized and validated with β-actin in two sequential qPCR analyses, and provided with clinical information and QC data [25].
The amplification products were also analyzed by agarose gel electrophoresis [45]. Data were normalized with β-actin gene provided with arrays. The relative gene expression was calculated using the 2 −∆∆Ct method [25]. Funding: This work was supported in part by ordinary funds from the University of Campania "Luigi Vanvitelli".

Conflicts of Interest:
The authors declare no conflict of interest.