The Genomic Impact of DNA CpG Methylation on Gene Expression; Relationships in Prostate Cancer

The process of DNA CpG methylation has been extensively investigated for over 50 years and revealed associations between changing methylation status of CpG islands and gene expression. As a result, DNA CpG methylation is implicated in the control of gene expression in developmental and homeostasis processes, as well as being a cancer-driver mechanism. The development of genome-wide technologies and sophisticated statistical analytical approaches has ushered in an era of widespread analyses, for example in the cancer arena, of the relationships between altered DNA CpG methylation, gene expression, and tumor status. The remarkable increase in the volume of such genomic data, for example, through investigators from the Cancer Genome Atlas (TCGA), has allowed dissection of the relationships between DNA CpG methylation density and distribution, gene expression, and tumor outcome. In this manner, it is now possible to test that the genome-wide correlations are measurable between changes in DNA CpG methylation and gene expression. Perhaps surprisingly is that these associations can only be detected for hundreds, but not thousands, of genes, and the direction of the correlations are both positive and negative. This, perhaps, suggests that CpG methylation events in cancer systems can act as disease drivers but the effects are possibly more restricted than suspected. Additionally, the positive and negative correlations suggest direct and indirect events and an incomplete understanding. Within the prostate cancer TCGA cohort, we examined the relationships between expression of genes that control DNA methylation, known targets of DNA methylation and tumor status. This revealed that genes that control the synthesis of S-adenosyl-l-methionine (SAM) associate with altered expression of DNA methylation targets in a subset of aggressive tumors.


Methylation of DNA Cytosine
The patterns and function of DNA methylation have been extensively investigated across organisms, in settings of both health and disease (reviewed in [1]). In humans, for example, this covalent DNA modification has been studied in developmental and normal biology [2] where it is clear that the control of DNA methylation is biologically profound. The control of DNA methylation is central to embryogenesis, genetic imprinting, and X chromosome inactivation. Furthermore, DNA methylation levels change across the genome through the aging process, thus giving rise to the concept of the so-called epigenetic clock (reviewed in [3,4]). There has been an equally profound examination of DNA methylation in cancer settings (reviewed in [5]) and other age-related syndromes.
At the center of the process of DNA methylation is the transfer of a methyl group from S-adenosyl-L-methionine (SAM) to the cytosine of a CpG dinucleotide (adjacent within a single DNA strand), immediately following DNA replication [6]. The addition of a methyl group to cytosine most commonly occurs in the context of a being adjacent and five prime to guanine and, hence, the nomenclature of CpG where p represents the DNA phosphate backbone. Cytosine can also be hydroxyl methylated.
In fact, SAM is a universal methyl donor, being the substrate for enzymes that control the methylated status of DNA, RNA [7], and proteins, such as histones [8]. Indeed, varieties of the enzymes that control these events have relatively high affinity for SAM [9]. Following donation of a methyl group, S-adenosyl-L-homocysteine (SAH) is formed and the ratio of SAM/SAH appears to be critical for the control of these biological processes (reviewed in [10]). Importantly, these substrates are, themselves, profoundly and rapidly influenced by environmental factors including diet [11] and, in turn this potentially combines with genetic variation and links to the predisposition to a number of diseases (reviewed in [12]).
The central tenet of studying changes in DNA methylation is that it represents a major mechanism by which chromatin access of transcription factors (TF) and the basal transcriptional machinery is modulated. There are at least 50 years of research supporting links between altered DNA methylation of genes that, in turn, govern processes associated with cancer initiation and progression. In the 1960s, researchers had already observed altered patterns of DNA methylation in cancer cell models and even proposed that this resulted in altered distribution of the sites of transcription initiation [13,14]. The biochemistry and regulation of CpG methylation was investigated and led to the description of altered methylation states of known tumor suppressors and oncogenes. Subsequent exploration of the associations at candidate sites between altered DNA methylation and gene expression largely confirmed the hypothesis that the DNA methylation signal serves as a physical impediment to TFs and the transcriptional machinery.
Thus, a general hypothesis emerged where the consequence of DNA methylation is to provide a physical barrier to positive regulators of transcription. More specifically, DNA methylation is a process of epigenetic control of the information encoded by DNA. That is, the process does not alter the underlying DNA sequence as 5-methylcytosine (5mC) still operates in a codon in the same manner as unmethylated cytosine, and yet it is heritable to daughter cells as the 5mC mark is copied to the nascent strand of DNA during DNA replication.
Methylated cytosine is a dynamic modification not only because it is added and removed by enzymatic processes, but also because it can spontaneously deaminate and, therefore, mutate [15][16][17]. By contrast, spontaneous deamination of unmethylated cytosine is readily recognized by the DNA repair machinery and corrected. Therefore, during the course of evolution methylated CpG dinucleotides have steadily been deaminated. As a result, across the human genome the frequency of CpG is actually underrepresented accounting for only 1/80th of the dinucleotides in the genome rather than the expected 1/16th [18].
The overwhelming majority (~85%) of methylated CpG sites in the genome are found in repetitive elements such as short interspersed nuclear elements (SINEs) and long interspersed nuclear elements (LINEs), as well as satellite DNA repeats in peri-centromeric regions. Persistent methylation of CpG in these repetitive elements is probably withstood as there is less evolutionary pressure on these regions and they can withstand the increased mutation rate without apparent harm to fitness. By contrast, CpG regions that remain unmethylated presumably have higher evolutionary pressure to remain unmethylated and thereby avoid the higher mutation rate. Approximately 15% of the CpG sites are found in CpG islands in the promoter regions of some 70% of protein-coding genes [19].
Presumably the regulatory function of methylation at CpG islands outweighs the potential mutation that may occur and argues for evolutionary conservation of CpG island function in gene regulation. Hence, while there is significant evolutionary pressure to reduce CpGs, the function of CpGs in islands is advantageous and protected from evolutionary erosion [20,21].
CpG islands are between 300 and 3000 bp (base pairs) in length with a GC content of greater than 50% and an observed/expected ratio of CpG to GpC greater than 0.6. Consistent with the idea that they play roles in the control of gene expression, they are often found to overlap with the histone mark trimethylated at Lysine 4 of Histone 3 (H3K4me3) and binding sites for RNA polymerase indicative of active, or at least permissive, transcription [22,23]. Although annotation of the non-coding genome is not as complete, there is evidence that CpG islands that previously appeared to be orphans, not associated with a known gene, can be associated with long non-coding RNA (lncRNA), micro RNAs (miRNAs) and other non-coding genes [24] and that these distal, or orphan, CpG islands may be important sites for TF binding and the control of non-coding RNA expression [25].

The Control of CpG Methylation and Its Impact on Transcription
As with all modifications to DNA, the addition and subtraction of methyl groups is tightly regulated by antagonistic enzyme families. The DNA methyl transferases (DNMTs) add methyl groups. DNMT1 is the major DNA methyltransferase expressed at high levels in all tissues where it plays a role in the maintenance of cytosine methylation following progression though the cell cycle. DNMTA3a and 3b are more involved in the de novo initiation of methylation patterns. Whilst these enzymes have been identified and investigated for many years, being cloned in the late 1980s and early 1990s [26], the ten-eleven translocation (TET) family of proteins were identified as the methylcytosine dioxygenases only in 2009 [27]. These proteins can reverse the methylation actions of DNMTs by oxidizing 5mC. Again, underscoring the importance of DNA methylation in embryogenesis, the genetic knockout of DNMT and TET family members display a range of embryonic phenotypes, including lethality, supporting the importance of normal regulation of DNA methylation [28][29][30][31][32][33][34][35][36].
DNMT1 maintains the DNA methylation pattern from the parent cell to the daughter cell. This heritable nature of DNA methylation is a key feature defining DNA methylation as an epigenetic mark. Unlike DNMT1, DNMT3a and DNMT3b normally methylate DNA that is unmethylated on both strands and do not have binding preference to the hemi-methylated state-a feature central to DNMT1's maintenance function. The roles of DNMT3b and 3a are not completely redundant. DNMT3a is a distributive enzyme, while DNMT3b is a processive enzyme. DNMT3a is important for focal methylation of single copy genes or regions where there are not long stretches of CpG to methylate. On the other hand, the high processivity of DNMT3b is conducive to its role in methylating the highly repetitive peri-centromeric regions where there are long stretches with many CpG positions to be methylated. Another DNMT3 family member is DNMT3L, which has no catalytic activity because of having a non-consensus catalytic domain. Nevertheless, DNMT3L plays an important role in DNA methylation because it interacts with DNMT3a and 3b. For instance, the interaction with DNMT3a increases the activity of DNMT3a and has been shown to be essential for maternal imprinting.
TET1 was initially cloned from leukemic cells where it was identified as a fusion partner of the mixed-lineage leukemia (MLL) translocation [37], but its function was only revealed by in 2009 [27], when the capacity to oxidize 5mC was revealed. Three TET family members exist [38], and similarly to the DNMTs, the TETs appear to differ in their functions. TET1 exerts an important embryonic function by governing expression of stem cell specific TFs such as NANOG [38] and its loss triggers a shift towards trophectoderm differentiation. TET1 also governs meiosis [39] and prevent the spread of CpG methylation [40]. Others have applied TET1 chromatin immunoprecipitation combined with next generation sequencing (ChIP-Seq) to approaches revealed binding of TET1 to CpG rich sequences in transcriptionally-active promoters, but also at polycomb-repressed genes. Thus, TET1 at least has a potentially dual role in triggering gene activation, and also modulating polycomb-mediated gene repression [41,42]. Again, in embryonic systems, TET1 was shown to associate with so-called bivalent environments with H3K4me3 and H3K27me3, adding evidence to the concept that TET1 at least controls the so-called 'poised' chromatin state for example at developmentally-regulated genes [43]. Supporting an antagonistic role against DNMTs, TET1 can remove the imprinted status of specific genes [44,45].
There is some evidence for TET2 and TET3 to target specific gene regulation programs, for example the active repression of interleukin 6 (IL6) during inflammation through active recruitment of histone deacetylases (HDAC) independently of modification of 5mC [46]. There is also evidence of either genetic variation or somatic mutation of TET2 to impact hematopoietic stem cell function and is implicated in a range of myeloid disorders [47].
Thus, 5mC is directly governed spatially and temporally, often in a gene program specific manner, by at least six different enzymes. The levels of 5mC at CpG regions and islands is in turn sensed by a number of proteins, to help translate the chemical signal of the methyl group into biological functions that include regulating transcriptional activity; this is most clearly found when the CpG island is in a proximal promoter region.
The impact of altered CpG island methylation is thought to regulate transcription in at least two mechanisms. Firstly, the increase in methylation levels at CpG islands or CpG regions can impact the physical access of TFs and, therefore, suppress gene regulation. Indeed, early studies demonstrated that a single methylated CpG in a 6 bp region could impact TF access [48]. Secondly, the methylation of CpG regions are, in turn, recognized by a family of proteins containing the methyl-CpG-binding-domain, known collectively as MBDs. These proteins Methyl-CpG Binding Protein 2 (MeCP2), Methyl-CpG Binding Domain Protein 1 (MBD1), MBD2, MBD3, and MBD4), along with Zinc Finger and BTB Domain Containing 33 (ZBTB33/KAISO) which has a different domain to recognize methylated DNA, are found in complexes that contain other chromatin modifying enzymes such as HDACs. Various workers established that CpG methylation attracts MeCP1 [49] and that these proteins can recruit HDACs to repress transcription [50]. Similarly, MeCP2 binds to methylated DNA and recruits SIN3a, which recruits HDACs leading to a situation where regions of DNA methylation coexist with regions of deacetylated histones that can form a compact, closed chromatin structure to exclude interaction with TF and the basal transcriptional machinery. Therefore, the levels of methylation at CpG islands can impact more widely the genome around the island.
When restricted to specific biological settings, the relationships between DNA methylation, chromatin assembly and transcription are certainly apparent. The mostly-methylated CpG islands on the inactive X chromosome in a female cell strongly (but not exclusively) correlate with gene silencing [51][52][53]. In a similar way, imprinted genes, expressed either from the paternal or the maternal allele, are associated with CpG island regions methylated only on one allele. Another group of genes, the cancer-testis (CT) antigen genes, such as those of the melanoma antigen family (MAGE) families often have methylated CpG island promoters in all normal tissues, except testes, where they are expressed. Often these genes are expressed widely in cancers where those CpG islands lose methylation [54][55][56][57][58][59].

Development of Technologies and Computational Approaches for Epigenomic Analyses
Early work in cancer systems, at candidate loci, quickly revealed that gains and losses of DNA CpG methylation were significantly detected at the sites of tumor suppressors and oncogenes. These observations gave rise to the concept of epi-mutations that could act alongside somatic mutations as cancer-drivers [60]. Specifically, regulatory and promoter CpG regions on these genes have been identified as being inappropriately methylated. Some of the earliest studied in the field were focused on loss of methylation at the CpG islands of known oncogenes, for example at the RAS locus [61]. Subsequently, other workers considered the possibility for gain of methylation at tumor suppressors and again early studies demonstrated gain of CpG methylation at the calcitonin gene in lung cancer [62]. In part, this suggested that hypomethylation at oncogenes may arise earlier in cancer, or pre-malignant conditions whereas hypermethylation, by contrast, possibly arises later and tends to be associated with promoters that control the expression of oncogenes and tumor suppressor. More recently, there is some evidence for coordinated hypo and hypermethylation events occurring in leukemia [63].
As well as giving rise to the concept of epi-mutations, the sequencing of DNA methylation events, inter-joined as they are with the regulation of histone modifications, also justified development of DNMT inhibitors and the combination of epigenetic therapies that targeted both CpG methylation and repressive histone modifications [64,65].
These candidate gene studies were catalysts for the development of genomic technologies and large scale data-analytic approaches aimed at revealing how many and how frequently CpG islands and regions were methylated in various cancer types. For example, some of the first technologies to expand from the candidate loci to genome-wide coverage built on the use of methylation sensitive restriction enzymes that digest specifically unmethylated CpG regions [66]. The recovered fragments were modified to detect digestion, or not, of fragments indicative of methylation status that could be imaged with either radiolabel approaches or subsequently with next generation sequencing (NGS) approaches. Thus, workers were able to begin to measure quantitative differences in the levels and distribution of CpG methylation between cell models, or tumor material compared with adjacent normal material.
In parallel, other technologies were developed using microarray platforms and hybridization, and the Illumina platform clearly emerged as the market leader. This platform allowed the relatively easy scanning of multiple CpG regions in cells, be they cell lines, frozen tumor or even formalin fixed material. The CpGs assessed by these technologies were selected based on their relative position within CpG islands and or near to gene features such as Reference Sequence (RefSeq) annotated transcriptional start sites (TSS). Although these early arrays and more recent larger ones sample CpG positions in all known TSSs and CpG islands and even annotated enhancers, they are not exhaustive in their coverage and frequently have rather limited numbers of CpG positions representing large genomic regions. Furthermore, their design rationale is based on an incomplete understanding of which CpG positions might be most relevant to regulation of chromatin and of gene expression. Nonetheless, these approaches have very clearly established that there are altered patterns of CpG methylation at CpG islands in a wide range of cells and in health and disease.
A significant caveat is an implicit assumption that the CpG positions represented on the array from a given genomic region are indicative of the broader methylation state of all CpGs in that region. While this is clearly correct in some situations, it may not be universal, and there is clearly emerging evidence for highly specific patterns of CpG methylation. For example, when comparing CpG island methylation between normal and cancer, it is likely that a few CpG positions can accurately represent the CpG island status. However, it is much less clear if this is true when comparing non-malignant genomes with different exposures. Nevertheless, array studies have provided significant insights from which a generally more textured understanding has emerged of the roles of different CpG methylation positions and density on gene expression in development (reviewed in [67,68]). Indeed, the CpG methylation arrays are one of the earliest and widely applied genomic technologies, and also helped to catalyze the development of the statistical framework for the analyses of differential DNA methylation [69,70].
To obtain exhaustive genome-wide coverage has required the development of NGS-based approaches to DNA methylation [71,72], for example, whole genome bisulphite sequencing (BS) at the base pair resolution. However, this does not distinguish between 5mC and 5-hydroxymethylcytosine (5hmC) and, therefore, comparative analyses is needed using BS and oxidative BS sequencing. 5mC and 5hmC specific antibodies can be used in Methylated DNA immunoprecipitation (MeDIP-Seq) approaches. Other modifications include GC based enriched regions (eRRBS), as well as locus-averaging methods that pull down methylated peaks or associated with MBD binding. Array technology has continued to be developed, principally by Illumina who have developed 27K, 450K and no 850K Infinium Methylation EPIC rrays for more comprehensive analyses.
Marching in tandem with the progress of the technologies available to survey the CpG status across the genome have been an equally profound development of the computational approaches to analyze the data. The broad demands of these software are to cope with modelling the data, dealing with incompleteness of data and identifying differentially methylated regions (DMR). Most commonly, the bioinformatics community uses the R platform for statistical computing [73,74] and a range of library packages implemented in Bioconductor [75,76]. For example, there are currently over 60 packages available in Biconductor that deal with the analyses of genomic CpG methylation data. R and Bioconductor are both community developed and maintained and, therefore, new approaches are continually developed. Indeed, Bioconductor illustrates the combination of packages, or software libraries, that can be applied for optimal workflows for many common bioinformatic analyses. Such a workflow has been developed for the analyses of DNA methylation [77].

Analyses of CpG Methylation in Large Cohorts of Publically Available Data
In tandem with the development of these, and other, genomic technologies has been their widespread application across multiple genomes undertaken by researchers in the Encyclopedia of DNA elements (ENCODE) [78,79], RoadMap Epigenome [80], and FANTOM [81] consortia. In doing so, these consortia have also generated remarkable volumes of publically available data with which to interrogate DNA methylation patterns and relationships to gene expression and cell phenotypes. In cancer it is clear that the TF-genome interactions are corrupted [82][83][84][85] and "re-wired" [86][87][88], for example, by somatic mutations and endogenous structural variants that disrupt TF binding. A major driver of addressing global DNA methylation in cancer has been the development of large cancer genome studies, for example the Cancer Genome Atlas (TCGA) in which virtually all 30,000 tumors in the archive have been screened with Illumina microarray approaches. This, in part, has allowed workers to undertake pan-cancer analyses of the DNA methylation patterns in an effort to classify different methylation subgroups between tumors [89].
However, testing the extent of genome-wide correlations between CpG methylation and gene expression is challenging because of statistical, biological, and technical limitations and incomplete biological understanding and, therefore, the extent and strength of correlations differ significantly between studies. For example, it is also critical to consider that these DNA methylation states are in the context of chromatin, and that this chromatin structure is, in part, being defined by modifications of histones making up the nucleosomes; unmethylated CpGs are frequently associated with active histone marks. This interplay of epigenetic events has also guided how researchers consider the transcriptional potential of a gene, as to whether the gene is active, repressed, or poised [90]. If the gene locus was in a transcriptionally permissive environment, and is poised, then methylation can impact expression of the gene.
A further major impediment to identifying TF-genomic interactions that are disrupted by CpG methylation, and which might also drive cancer development is the sheer volume of TF complexes involved and the combinatorial nature of epigenomic events. Over 20% of the protein-coding genome relates to transcriptional control; approximately 2500 TF interact with over 2000 TF co-factors and chromatin remodeling factors, before even considering the actions of the non-coding genome. The diversity of TF-genomic interactions is amplified even further when considering that each TF complex may have thousands of genomic binding sites, known collectively as a cistrome [91].
Therefore, it is reasonable to postulate that the choice of TF binding sites is guided by the interplay of multiple histone modifications and the CpG methylation status, but it is much harder to test the strength of these relationships and how they diverge across cancer states. Therefore, considering the impact of CpG methylation on TF-genomic interactions quickly becomes a very challenging question. For instance, although CpG islands tend to be considered as discrete data points, being either on or off, they are in fact highly continuous. CpG status can be altered by both changes in the distribution and density of methylation.
Furthermore, workers have more recently proposed that the regions around the islands, (the shores and shelves) carry further important information to mediate relationships that control gene expression [92,93]. For example, the shores and shelves on CpG islands tend to have higher variation across cancers and therefore comparative analyses have to be specific to locations. As the CpG methylation status of the human genome has become increasingly mapped it has also emerged that differential methylation is not restricted to the CpG islands, but also extends to CpG regions for example at enhancers [94] as well as across gene bodies. Again, it is worth remembering that the one of the earliest studies of DNA methylation impacting transcription factor binding considered a single 6 bp region and the impact of one methylated cytosine within that region [48]. Thus, as further details regarding DNA methylation patterns in tissues have emerged, so too have the complexities in which they relate to transcription.
From a developmental perspective, recent genomic studies have begun to reveal the impact of CpG methylation at enhancer and intergenic regions. Clearly, 5mC dynamics are dramatic and dynamic in embryos to establish totipotency. In mammals, active TET-dependent oxidation of 5mC as well as passive cell-division dependent depletion is critical to demethylate gene enhancers and activation of transcription factors that control embryonic development pathways [95]. More recently, cell-based studies have also modeled the impact of CpG methylation in gene enhancers and revealed the interplay with the so-called bivalent status of enhancers and super-enhancers [96], again demonstrating the interplay between CpG methylation and histone modifications.
There are approximately 50,000 CpG islands in the human genome (depending on the specific definitions of GC content, density, and length) and the shore and shelf concept of course expands this number. Rarely is a region entirely methylated or not and therefore the calculation of the level of methylation is not trivial, requiring relatively sophisticated statistical models that aim to identify DMR. To ascribe function to these DMR requires some aspect of spatial annotation. For example, if a CpG island is proximal to a TSS it is a reasonable assumption that heavily methylated (or unmethylated) status impacts the expression of this gene. This does not preclude the fact that methylation may impact the expression of distal genes in both the 5 and 3 direction, and that the promoter of one gene may actually be a distal enhancer of another [97] and, therefore, give rise to so-called "ripple" expression of adjacent genes [98,99]. Additionally, gene expression may be impacted by the combined effect of 5 and 3 distal and proximal regulatory regions and may include the methylation of the gene body.
Finally, the definition of a DMR may also relate to cell type. For example, in a cancer genome it is not clear if the average methylation state of a limited number of CpG probes within a 1 kb CpG island accurately define the overall methylation state of that island. If they do, it is also unclear that would also be true in normal matched tissues.
These concerns notwithstanding, NGS technologies have been applied to an ever-greater extent with increasing genomic coverage and resolution of CpG methylation yielding new observations regarding DNA methylation patterns. Genome-wide, in normal cells, there is a negative correlation for approximately 20% of the genome between elevated 5mC in CpG islands and repression of a neighboring TSS [100].
Thus, with the development and widespread application of tools to measure CpG methylation levels and distribution across the genome it is now possible to test the extent of correlation between CpG methylation and gene expression. Within the cancer context the remarkable volume of data developed by TCGA investigators has allowed investigators to dissect the relationships between CpG methylation and gene expression.
One of the earlier studies by Aran et al. [101] addressed this question and focused on negative correlations between CpG methylation at enhancer sites and the gene body, and gene expression across multiple cell lines, and offered evidence that enhancer methylation was selectively disrupted in cancer. However, if the starting point are the individual datasets, rather than modelling the ones where there is the strongest negative correlation, then identifying these relationships de novo can be more challenging.
Combining MeDIP-Seq and ribonucleic acid (RNA)-Seq datasets from malignant mesothelioma cells was undertaken to investigate the genome-wide extent of strong negative correlations between methylation and gene expression. The number of such CpG methylation-gene expression negative correlations were strikingly small. For example, there were several thousand hypermethylated and hypomethylated genes but only hundreds of genes differential expressed, with the clearest relationships being at intronic regions where altered methylation associated with altered expression [102].
Similarly in breast cancer, genome-wide NGS approaches have been applied to identify how CpG methylation impacts gene expression and the emergence of drug resistance. In this context, approximately one thousand genes had both altered methylation and altered expression, perhaps supporting the concept that epigenetic events can allow for the rapid adaptation to drug exposure. However, drilling into the subset of data where CpG methylation and gene expression were negatively correlated identified fewer genes (in the low hundreds) [103]. In another report [104] in lung cancer the correlation between DNA methylation and gene expression was detected for approximately 750 genes, but for one third of these the correlation was positive. Again, a negative correlation could only be found for approximately 500 genes. Similarly, in esophageal cancer the authors only illustrated the negative correlation at candidate loci despite having generated RNA-Seq and matched MeDIP-Seq data [105].
Reflecting the challenges in finding strong and widespread negative correlations between genome wide CpG methylation and genes expression, investigators have applied more sophisticated analytical approaches to transform the continuous DNA methylation data into a categorical format. Again, in breast cancer, both positive and negative correlations between DMR and gene expression are often observed. By selecting for lowly-expressed genes enhanced the detection of greater negative correlations between gene promoter CpG methylation in the promoter [106]. Therefore, it seems that a negative correlation between CpG methylation and gene expression exists for only a small subset of genes expressed in cancer cells, numbering perhaps in the hundreds.

Prostate Cancer as a Model of the Interplay between Genomic and Epigenomic Cancer Drivers
Amongst men in the US, prostate cancer (PCa) is most common non-cutaneous cancer diagnosed and second leading cause of death [107,108]. This cancer is highly heterogeneous in terms of progression rates. Although pathological tumor grade (Gleason Grade) accurately predicts disease outcome, currently clinical parameters that can be exploited before surgery do not accurately predict progression risks to more aggressive stages of disease. Therefore, it is not easy to identify men who both need and will be cured by surgical treatment, from those men who will experience subsequent treatment failure and disease recurrence [109,110]. This is of clinical significance as patients who experience treatment failure are significantly more likely to progress to more aggressive forms of PCa with significantly increased risks of tumor-related mortality [111,112].
This ambiguity is further obscured because the incidence and natural history of PCa varies between races. American men of African ancestry have a 19% increased incidence, and 37% increased mortality from PCa compared to men of European ancestry (reviewed in [113][114][115][116]). Thus, in African American (AA) PCa patients, the disease appears more aggressive, and occurs at a younger age, than European American (EA) patients.
In an effort to more accurately define disease multiple groups [117], including the TCGA consortium, have added to previous understanding [118,119] and established roles for common genetic alterations in PCa [120][121][122], and novel somatic mutations, including Forkhead Box A1 (FOXA1), Speckle-Type POZ Protein (SPOP). Also supporting the importance of androgen receptor (AR) signaling in PCa and the cross-talk with epigenetic events, the co-activator Nuclear Receptor Coactivator 2 (NCOA2) is commonly amplified [122,123].
The complex nature of cancer phenotypes however cannot be explained by genetic components alone [124]. Epigenomic modifications and events contribute significantly to cell transformation and play distinct yet complementary roles to genomic events, and add to a fuller explanation for the etiology of disease. For example, up-regulation of the histone methyltransferase, Enhancer of Zeste Homolog 2 (EZH2) appears common in both localized and metastatic PCa, and associates with poorer prognosis [125,126]. Additionally, reflecting different genomic and epigenomic drivers of PCa, there are significant global differences in the pattern of CpG DNA methylation associated with different genetic PCa phenotypes, notably in the presence of TMPRSS2-ERG translocations [127,128]. Similarly, we and others have examined the expression and CpG methylation status associated with miRNA (reviewed in [129]). For example, in PCa, promoter hypermethylation is associated with loss of microRNA (miR)-200 family members that regulate cell migration/invasion. We have defined cohorts of miRNA that predict aggressive disease [130] and in turn revealed that their expression may often be associated with altered CpG methylation [131].
The changes in CpG methylation in PCa progression have been comprehensively reviewed by Lynch and co-workers [132]. They highlighted the consistency of methylation at the promoter of certain genes, for example Glutathione S-Transferase Pi 1 (GSTP1). They also combined datasets and measured the overlaps to identify expression of 168 genes commonly identified to have associated DMR and altered expression including GSTP1 and others such as retinoic acid receptor β (RARB), Ras Association Domain Family Member 1 (RASSF1), and Aldehyde Oxidase 1 (AOX1) as well homeobox gene family members.
Others have sought to relate CpG methylation patterns to clinical outcome and combined their patterns in univariate regression analyses of time to disease recurrence and revealed that methylation of certain loci, for example again including AOX1 and RARB [133] predicted disease progression [134,135]. Further supporting the relevance of DMR in PCa progression, the TCGA investigators revealed how altered DNA methylation patterns associated with different PCa genetic phenotypes. Interestingly, no negative correlation patterns were noted for DNA methylation level and either mRNA or miRNA expression. However, subsequent studies by Jin and co-workers of the same data modelled the interplay between distal, promoter, and genic CpG methylation and gene expression [136]. Notably, these workers revealed that TSS and distal hypermethylation, but not hypomethylation, were associated with differential gene expression. Again, reflecting other cancer studies, the correlations were both negative and positive and the number of genes for which a specific location of hypermethylation negatively correlated with gene expression was fewer than 100.
To complement these studies, many researchers have examined how the expression and genetic variation of either DNMTs or TETs are altered in cancer systems. For example, earlier studies have linked gain of expression or genetic variation with altered DNA methylation patterns in various tumors. In many cases these gains of DNMTs function were linked to disease progression and worse clinical features. Indeed, the interplay of DNMTs and TETs has also been established as a putative feed forward loop where increased DNMTs function silences TETs [137]. Furthermore, researchers have examined how DNMTs and TETs may contribute to disease progression and established roles for increased DNMTs in mouse models of prostate cancer [138][139][140] and that TET1 for example is disrupted by copy number changes correlating with reduced 5hMC levels in prostate cancer samples [141]. Others have revealed a targeted function for TET1 to interact with the pioneer factor FOXA1 to activate lineage-specific enhancers [142].
A complementary area of very active research in PCa is the control of one-carbon metabolism and the methionine cycle that generates the SAM pools that in turn feed into the control of DNA methylation, as well as histone methylation [143]. This pathway has unique relevance to the prostate which normally secretes high levels of acetylated polyamines which in turn can stress the cellular production of SAM and, therefore, the biochemistry of prostate epithelial cells is modified to enhance methionine salvage pathways. Indeed this dependency may highlight a unique therapeutic approach to targeting prostate cancer through inhibitors of Methylthioadenosine Phosphorylase (MTAP), the rate limiting enzyme [144]. Given that folate is an upstream dietary-derived precursor of these pathways it is likely that dissecting the links between folate metabolism and prostate cancer may yield unique and tissue-specific insight [145][146][147].

Bioinformatic Approaches to Reveal Associations between DNA Methylation and Prostate Tumor Status
To add to these studies, we have now sought to model the impact of the DNA methylation pathway on gene expression in PCa by using the R platform for statistical computing and a range of library packages implemented in Bioconductor. As a starting point, we created a list of genes known to be involved in the control of DNA methylation. To do this we downloaded genes from the DNA methylation pathway from WikiPathways [148] and combined these genes with those returned from searches of DNA methylation pathway genes in UniProt [149]. Together these approaches yielded a unique list of 165 genes includes those that control the regulation of SAM pools and DNA CpG methylation.
To examine how these genes were altered in PCa we examined their expression in the TCGA prostate cancer cohort (PRAD) of 497 tumors. These data are publically available and were downloaded. The data actually include tumors and normal samples and, therefore, we created an expression table of all genes detectable in at least 80% of tumors (n = 16,785) given as relative Z-scores as compared to the mean of the normal [150]. Expression of the 165 gene-panel of the DNA methylation pathway was examined in this table using genefilter to capture only those genes altered by more than two Z-scores in 25% of tumors; this yielded 21 genes on the DNA methylation pathway. These genes included DNA methyltransferase (DNMT3A), MBD1, and Nuclear receptor corepressor (NCOR)1. Tumor expression patterns for these genes were then visualized and clustered by expression on a heatmap (pheatmap). Relationships between cluster membership and tumor grade (Gleason Grade 6 and 7 compared to 8,9,10) were measured using survival. The expression patterns of these 21 genes clustered tumors into groups that in turn associated with significantly different levels of Gleason Grade (p < 0.006) ( Figure 1A). Interestingly, of these genes shown to be associated with altered CpG methylation, we had previously identified that increased NCOR1 binding to gene targets in prostate cancer cell lines leads to elevated CpG methylation [151].
Next, we sought to investigate the relationships between these 21 genes and the targets of DNA methylation. To do this, we identified all genes amongst the 16,785 genes expressed in the TCGA-PRAD tumors that positively and negatively correlated with each of these 21 genes on the DNA methylation pathway. Subsequently we used a hypergeometric test to measure how genes that were strongly correlated with these 21 genes on the DNA methylation pathway were themselves enriched for genes that were known targets of DNA methylation. Thus, for each of the 21 DNA methylation pathway genes the negative and positive correlations (Pearson correlation either <−0.6 or >0.6) of expression was determined. Only five DNA methylation pathway genes had strong correlations and subsequently, the enrichment of the 165 common targets of DNA methylation from the Lynch et al. review [132] was measured within these correlated genes, using a hypergeometric test.
Biomolecules 2017, 7, 15 10 of 20 To examine how these genes were altered in PCa we examined their expression in the TCGA prostate cancer cohort (PRAD) of 497 tumors. These data are publically available and were downloaded. The data actually include tumors and normal samples and, therefore, we created an expression table of all genes detectable in at least 80% of tumors (n = 16,785) given as relative Z-scores as compared to the mean of the normal [150]. Expression of the 165 gene-panel of the DNA methylation pathway was examined in this table using genefilter to capture only those genes altered by more than two Z-scores in 25% of tumors; this yielded 21 genes on the DNA methylation pathway. These genes included DNA methyltransferase (DNMT3A), MBD1, and Nuclear receptor corepressor (NCOR)1. Tumor expression patterns for these genes were then visualized and clustered by expression on a heatmap (pheatmap). Relationships between cluster membership and tumor grade (Gleason Grade 6 and 7 compared to 8,9,10) were measured using survival. The expression patterns of these 21 genes clustered tumors into groups that in turn associated with significantly different levels of Gleason Grade (p < 0.006) ( Figure 1A). Interestingly, of these genes shown to be associated with altered CpG methylation, we had previously identified that increased NCOR1 binding to gene targets in prostate cancer cell lines leads to elevated CpG methylation [151].
Next, we sought to investigate the relationships between these 21 genes and the targets of DNA methylation. To do this, we identified all genes amongst the 16,785 genes expressed in the TCGA-PRAD tumors that positively and negatively correlated with each of these 21 genes on the DNA methylation pathway. Subsequently we used a hypergeometric test to measure how genes that were strongly correlated with these 21 genes on the DNA methylation pathway were themselves enriched for genes that were known targets of DNA methylation. Thus, for each of the 21 DNA methylation pathway genes the negative and positive correlations (Pearson correlation either <−0.6 or >0.6) of expression was determined. Only five DNA methylation pathway genes had strong correlations and subsequently, the enrichment of the 165 common targets of DNA methylation from the Lynch et al. review [132] was measured within these correlated genes, using a hypergeometric test. Gene expression was measured as normal tissue relative Z-scores of all genes detectable in at least 80% of tumors. Cluster membership significantly identifies aggressive tumors (p < 0.006). (B) The negative and positive correlation (Pearson correlation either <−0.6 or >0.6) for each of these commonly altered genes from the DNA methylation pathway (n = 21) and all other detectable genes in the TCGA cohort (n = 16,785) was measured and the enrichment of the 165 common targets of DNA methylation from the Lynch et al. review [132] was measured using a hypergeometric test. Only the indicated genes had significant correlation with all genes and significant enrichment of the targets of DNA methylation. (C) Heatmap illustrating common and significantly altered mRNA expression of genes that significantly correlate with Indolethylamine N-Methyltransferase (INMT) and Methionine Adenosyltransferase 2B (MAT2B) and are known targets of DNA methylation in TCGA PRAD cohort. Cluster membership significantly identifies aggressive tumors (p < 0.0007).
This analysis identified that from the DNA methylation pathway genes only Indolethylamine N-Methyltransferase (INMT) and Methionine Adenosyltransferase 2B (MAT2B significantly correlated with a set of genes which themselves were significantly enriched for known targets of DNA methylation in PCa ( Figure 1B). For example, INMT and MAT2B were commonly altered and associated with the expression of genes that are significantly enriched for known targets of DNA . Gene expression was measured as normal tissue relative Z-scores of all genes detectable in at least 80% of tumors. Cluster membership significantly identifies aggressive tumors (p < 0.006); (B) The negative and positive correlation (Pearson correlation either <−0.6 or >0.6) for each of these commonly altered genes from the DNA methylation pathway (n = 21) and all other detectable genes in the TCGA cohort (n = 16,785) was measured and the enrichment of the 165 common targets of DNA methylation from the Lynch et al. review [132] was measured using a hypergeometric test. Only the indicated genes had significant correlation with all genes and significant enrichment of the targets of DNA methylation; (C) Heatmap illustrating common and significantly altered mRNA expression of genes that significantly correlate with Indolethylamine N-Methyltransferase (INMT) and Methionine Adenosyltransferase 2B (MAT2B) and are known targets of DNA methylation in TCGA PRAD cohort. Cluster membership significantly identifies aggressive tumors (p < 0.0007).
This analysis identified that from the DNA methylation pathway genes only Indolethylamine N-Methyltransferase (INMT) and Methionine Adenosyltransferase 2B (MAT2B significantly correlated with a set of genes which themselves were significantly enriched for known targets of DNA methylation in PCa ( Figure 1B). For example, INMT and MAT2B were commonly altered and associated with the expression of genes that are significantly enriched for known targets of DNA methylation changes in PCa. INMT is a methyltransferase and methionine adenosyltransferase. MAT2B regulates the biosynthesis of SAM from methionine and therefore is important in the regulation of SAM pools upstream of DNA methylation events. INMT was previously identified as predicting disease progression risk in prostate cancer [152] whereas MAT2B has not previously been implicated in prostate cancer risk or etiology. Together, these finding suggest that mechanisms that control the flux of SAM pools appear to be linked to aggressive PCa and that only a relatively small subset of genes are probably targeted. Again, it is worth emphasizing that SAM pools are related to diet and genetic variation, and there is a significant literature over how control of the central methyl donor impacts a range of diseases including PCa [144,145,[153][154][155][156][157][158].
Next, we took targets of DNA methylation that correlated with INMT and MATB expression and identified those with the most altered expression in the tumors with higher Gleason Grade tumors; this was 32 genes ( Figure 1C). These targets included of course known targets of DNA methylation and included genes that were both up and down-regulated. Within the down-regulated genes (n = 20) the down-regulated expression was pronounced and significant in the more aggressive; these genes included RARB, AOX1, and GSTP1.
Together this translational bioinformatics approach has mined existing datasets and identified that 21 members of the DNA methylation pathway are commonly altered and associated with more aggressive tumor features (e.g., INMT and MAT2B). These genes were strongly correlated with a small number of known targets of DNA methylation including RARB, which in turn could distinguish tumors with higher Gleason Grade.

Summary
The current review has aimed to examine the central aspects of the relationships between DNA CpG methylation and the control of gene expression. Within the cancer arena, this complex field has made very significant strides with the application of genome-wide technologies and sophisticated statistical approaches, combined with high-quality and large-scale tumor profiling data. Perhaps surprisingly, the numbers of negative correlations between DNA CpG methylation and gene expression are in the hundreds not thousands, although within these there are clear examples of tumor adaptation to drug exposure, and genes that are known tumor-drivers. We also present a bioinformatics pipeline to examine how genes known to control DNA CpG methylation relate to altered gene expression and tumor status. This approach is relatively generic and revealed that, at least in PCa, the control of the biosynthesis of SAM is significantly associated with altered gene expression and tumor aggressiveness.