Copy Number Variation in Hereditary Non-Polyposis Colorectal Cancer

Hereditary non-polyposis colorectal cancer (HNPCC) is the commonest form of inherited colorectal cancer (CRC) predisposition and by definition describes families which conform to the Amsterdam Criteria or reiterations thereof. In ~50% of patients adhering to the Amsterdam criteria germline variants are identified in one of four DNA Mismatch repair (MMR) genes MLH1, MSH2, MSH6 and PMS2. Loss of function of any one of these genes results in a failure to repair DNA errors occurring during replication which can be most easily observed as DNA microsatellite instability (MSI)—a hallmark feature of this disease. The remaining 50% of patients without a genetic diagnosis of disease may harbour more cryptic changes within or adjacent to MLH1, MSH2, MSH6 or PMS2 or elsewhere in the genome. We used a high density cytogenetic array to screen for deletions or duplications in a series of patients, all of whom adhered to the Amsterdam/Bethesda criteria, to determine if genomic re-arrangements could account for a proportion of patients that had been shown not to harbour causative mutations as assessed by standard diagnostic techniques. The study has revealed some associations between copy number variants (CNVs) and HNPCC mutation negative cases and further highlights difficulties associated with CNV analysis.


Introduction
Somewhere between 2% and 5% of all colorectal cancers (CRCs) are classified as hereditary non-polyposis colorectal cancer (HNPCC). Families with germline mutations or complex genomic changes (without structural gene alterations) that render one of four DNA mismatch repair (MMR) genes ineffective compose a subset of HNPCC known as Lynch Syndrome (LS).
The clinical diagnosis of HNPCC is defined by any one of several reiterations of the Amsterdam Criteria, first established in 1990 to enable the identification of the genetic basis of the disease [1]. As such mutations in MLH1, MSH2, MSH6 and PMS2 have been identified to account for all LS families [2][3][4]. Recently, loss of EPCAM, has been associated with transcriptional silencing of MSH2, and rare epimutations in MLH1 have also been implicated in LS [5,6].
Despite the definition of HNPCC up to 50% of clinically tested patients with tumours demonstrating microsatellite instability (MSI), the hallmark phenotype of HNPCC, will fail to have any germline mutation identified in any one of the four MMR genes responsible for LS [7][8][9]. This suggests that there are either other genes associated with this disorder or different mechanisms of gene silencing responsible for HNPCC.
Since the sequencing of the human genome it has become apparent that genomic rearrangements are ubiquitous in the population. Genomic duplication or deletion have been shown to encompass large stretches of contiguous DNA and are commonly termed copy number variants (CNVs). As CNVs range from kilobase (Kb) to megabase in size, they may encompass or disrupt functional DNA sequences, result in gene amplification or loss, or alter epigenetic patterning [10]. As such, CNVs have been well documented in their contribution to disease development and variation in disease phenotype [11][12][13][14][15][16].
CNVs have been implicated in the development of many forms of CRC, e.g., germline deletion of two genes, PTEN and BMPR1A have been identified to be the cause of Juvenile Polyposis (JP) in four unrelated children [17], while genomic deletions in the genes SMAD4, BMPR1A and PTEN result in JP [18] and furthermore, the Leiden Open Variation Database (LOVD) database lists nearly 3,000 mutations in four MMR genes associated with HNPCC, of which many are gains and losses of genomic material [19]. Recent reports specifically examining the association between genomic rearrangements and LS have revealed that loss of a region on chromosome 2 encompassing EPCAM appears to be associated with LS [6,20]. The loss of EPCAM appears to re-write the epigenetic programming in the region such that the MSH2 becomes silenced as a result of CpG methylation of the 5' promoter region. This evidence suggests that a proportion of HNPCC families may be accounted for by genomic rearrangements that may not be readily identified using more traditional gene mutation searches.
CNVs are detected using DNA arrays that comprise a series of oligonucleotides that represent evenly distributed markers across the entire genome. As the number of oligonucleotide markers has increased from a few hundred thousand to over five million, CNV resolution has improved such that ever smaller rearrangements can be detected in a single experiment. In this study we have used the Affymetrix Cytogenetic Whole Genome2.7M (Cyto2.7M) array which contains over 400,000 SNP probes and greater than 2.1 million CNV probes (average spacing 1,395 base pairs) to examine the CNV landscape in HNPCC patients and search for CN gains or CN losses which may reside in or in the vicinity of the 22 genes associated with DNA MMR. We also investigated genes and gene expression regulatory elements (microRNAs or miRs) associated with CNVs unique to the HNPCC patients using pathway analysis to determine if they may contribute to disease development.

Samples
Genomic DNA samples for the current study were obtained from HNPCC patients who had given informed consent for their DNA to be used for studies into their disease and control DNA samples from the Hunter Community Study (HCS) [21]. DNA was extracted from whole blood by the salt precipitation method [22]. The study was approved by the University of Newcastle's Human Research Ethics Committee (HREC) and the Hunter New England Human Research Ethics Committee (HNEHREC).
A sample size of 125 HNPCC patients was used for the current study. All HNPCC patients were clinically diagnosed as per the Amsterdam Criteria II [1,23] or the Bethesda Guidelines [24]. All patients had been diagnosed with CRC and were the first individual (proband) of their family to seek genetic testing. The samples were referred for routine clinical diagnostic testing involving screening for mutations in: MLH1, MSH2, MSH6 and/or PMS2. The mutation screening was performed using Sanger Sequencing and/or Multiplex ligation-dependant probe amplification (MLPA). No mutations were identified in any of the patients used for the current study and are thus considered to be MMR mutation negative. The average age of patients recruited for this study was 52 years of age.
A sample size of 40 controls from the Hunter Community Study (HCS) [21] was used in the current study. Theses samples were from healthy individuals aged >55 years who were cancer free at the time of sample collection.

Genomic Array Analysis
The DNA from the 165 patients and controls was processed on the Affymetrix Cyto2.7M array according to manufacturer's protocols. CEL files obtained from scanning the Cyto2.7M array were analysed in the proprietary software from Affymetrix, the Chromosome Analysis Suite (ChAS) (Version CytoB-N1.2.0.232; r4280) using NetAffx Build 30.2 (Hg18) annotation. Quality control parameters were optimized and validated using a training set of 20 randomly selected samples (patients and controls). Identified CNV regions within the training set were assessed according to CNV call confidence, probe count, size, wavinessSd and by visual inspection for distinction from normal CN state. In addition, data was visually inspected to identify regions with low density of markers including centromeric and telomeric regions (Supplementary Table 1) which were excluded from analysis across all samples. The resultant thresholds were applied to all samples. Most of the thresholds were more stringent than default settings, aiming to minimize the number of false-positive CNVs being included in the analysis. Briefly, all samples were subject to a series of quality cut-off measures: snpQC >1.1 (assesses quality of SNP probes with respect distances between the distribution of alleles AA, AB and BB alleles and larger differences are associated with an increased ability to identify a genotype; default), mapdQC <0.27 (Median Absolute Pair-wise Difference; assesses quality of CN probes with respect to a reference model file; default) and wavinessSd <0.1 (measure of standard deviation in data waviness; the GC content across the genome correlates with average probe intensities i.e., high GC probes are brighter than low GC probes on average, which creates waves in the data). CNV regions within all samples were then filtered using a set of CNV calling parameters: >90% confidence, autosomes only and a minimum number of 24 probes.

Statistical, Pathway and Annotation Analysis
Refined ChAS CNV counts and CNV size for the patients were compared to the controls using a two tailed un-paired t-test in Graphpad Prism (Version 6) [25]. Gene enrichment analysis was performed using WebGestalt analysis software (Version 2013) [26]. This software was used to assess gene lists derived from the refined CNV results obtained from ChAS according to Gene Ontology (GO) categories, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and miR targets. Analysis was performed using hypergeometric statistical method, Benjamini and Hochberg (BH) correction for multiple testing (both default settings) and a biological significance threshold of <0.05 with a minimum of two genes per category required to assess any enrichment. TAM (Tool for Annotations of miRs) (Version 2) [27] software was used to annotate miRs according to miR family, cluster, function, Human miR associated disease categories (HMDD) and tissue specificity. Annotations were performed using the following parameters: all miRs in the TAM database were used as a background; to identify meaningful categories we looked at miR over-representation in all categories and analysis was limited to at least one miR in a given category. Enrichment analysis for miRs categories was conducted using hypergeometric testing and p values were corrected according to Bonferroni correction for multiple testing.

Results and Discussion
Recent studies have reported CNV's as relevant contributors to human diversity and cancer susceptibility [28][29][30]. This study further defines the contribution of CNVs to disease risk in HNPCC.

Resolution
Refinement of ChAS thresholds resulted in the final analysis of CNVs ranging from a minimum of 8.4 Kb to a maximum of 2,722.5 Kb in size (see Figure 1 for examples). CN gains ranged from 8.

CNV Detection
Analysis of Cyto2.7M array data identified a total of 543 CNVs in the 165 patients and controls utilized in this study (Table 1). Total counts of CNVs observed in the 125 HNPCC patients corresponded to 439 CNV events compared to 104 events in the 40 controls. The mean number of CNVs identified per sample did not significantly differ between patients and controls (3.51 CNVs per patient and 2.60 CNVs per control, p = 0.2980). Consistent with a recent report looking at CNVs in hereditary breast cancer, similar counts of CNVs detected between patients and controls have been suggested to reflect a lack of genomic instability in the genomes of patients screened [31,32]. The mean CNV affected genome per sample did not differ between patients and controls either (284.07 Kb patients and 295.52 Kb controls, p = 0.9121). However, the mean size of a CNV differed significantly between patients and controls (70.08 Kb in patients and 106.57 Kb in controls, p = 0.0165). The exact reason why we observe this difference is unclear however it may be a function of the number of samples in each group.
No CN gains or losses were identified within the defined search region for any of the 22 genes in the MMR pathway for all samples utilized in this study, patients and controls. We cannot however rule of the possibility for CNVs residing in these regions which are smaller than the resolution of detection provided by this array (<8.4 Kb).

Occurrence and Distribution of CNVs in Patients and Controls
Of the total 104 CNVs identified in controls, 34 CNVs contained genomic regions that were common to genomic regions identified in patients (Supplementary Table 2). A total of 70 CNVs were unique to the controls of which 47 (67.14%) were associated with genes (Supplementary Table 3).
Of 439 CNVs identified in patients, 53 CNVs contained genomic regions that were common to genomic regions identified in controls (Supplementary Table 4). Three hundred and eighty six CNVs were unique to the patients population of which 207 (53.63%) were associated with genes (Supplementary Table 5).
From the 207 unique CNVs associated with genes identified in the patients, 9 were identified in patients that did not overlap any CNVs in controls but affected the same gene even in multiple patients (ARPP-21, C7orf10, KIAA1217, LINGO2, MACROD2 and NKAIN2). A total of 60 genes associated with 131 CNVs were identified in multiple individuals (as shown in Table 3). Fifty two genes were affected by a CNV in two individuals; five genes were affected by a CNV in three individuals (IGSF11, GK5, XRN1, NAMPT and LCP1); and three genes were affected by a CNV in four individuals (CTNNA3, NRG3 and LOC642597). Table 3. Genes associated with unique CNVs (compared to controls) identified across multiple patients. Number of CNV events in which gene (s) have been identified and if they were a CN gain or loss.

Type 2 CNV events 3 CNV events 4 CNV events Gains
While this study has not investigated further the contribution of any one of these CNVs to disease development, previous studies have reported the involvement of several of the genes influenced by one or more CNVs in CRC: L-plastin (subunit LCP1) has been shown to be unregulated in various solid human tumours and is also known to contribute to CRC progression via its involvement in cell proliferation and invasion and consequently metastasis [33][34][35]; alpha-catenin (subunit CTNNA3) has been reported to show reduced expression in CRC cell lines which has been suggested to facilitate metastasis [36], while another study has reported increased expression of alpha-catenin during adenoma formation via the negative regulation of beta-catenin signalling [37]; the tumour suppressor gene APC has been unequivocally associated with the CRC and Familial adenomatous polypsis (FAP) [38][39][40]; and furthermore, expression of IGSF11 has been reported to be elevated in CRC cells lines and may represent a target for cancer immunotherapy [41]. Future studies are required to validate and investigate the role of the CNVs identified in our study for their potential contribution in the development of HNPCC.
Of the 386 CNVs identified unique to the patients, of these regions 56.5% of them have been previously reported in the Database of Genomic Variance (DGV). Fifty nine CNVs contained genomic regions which were identified in multiple patients (Table 4). A total of 15 genomic regions were identified in two patients; five common genomic regions were identified in three patients, located on chromosomes 3, 5, 9, 11 and 12; and one genomic region was identified in four patients on chromosome 16. Two other CNVs were also shown to be common to five patients on chromosomes 3 and 5. Additional studies are required to investigate the sequence content of these regions to identify if novel contributors to disease development may reside in these regions. Table 4. Genomic regions associated with unique CNVs (compared to controls) identified across multiple among patients. Note CNV frequency and CNV type; CNV location (chromosome, start bp and end bp) and size; as well as the confidence score associated with CNV call and the number of probes used to call the CNV are also noted.

Pathway Analysis
WebGestalt [26] pathway analysis software was then used to compare a list of 317 genes associated with CNVs uniquely identified across all patients (compared to controls) to all genes in the human genome (Supplementary Table 6). Enrichment analysis of KEGG pathways and miR targets was conducted. In relation to the control population we also undertook a similar analysis which revealed three pathways, a tight junction pathway and two related pathways, an oocyte meiosis and a progesterone mediated oocyte maturation related pathway. Of particular note was the absence of any cancer related pathway in the control population suggests that those identified in the patient population are likely to be involved in some aspect of malignancy.
KEGG analysis revealed a total of 18 significant pathways in which genes uniquely identified in the patients were enriched ( Table 5). The most significant pathways identified included those of the carbohydrate digestion and absorption (p = 0.0012); starch and sucrose metabolism (p = 0.0017); and metabolic pathways (p = 0.0023) affecting a total of 11 patients. Previous studies have suggested that changes occurring in metabolic pathways are commonly observed during carcinogenesis and tumour growth [42,43]. In the context of this study, these results suggest the potential existence of a germline predisposition in the affected patients which lead to metabolic conditions that promote disease development. The tight junction pathway (p = 0.0058) and neurotrophin signalling pathway (p = 0.0058) were also identified to be enriched and have been shown to play a role in gut permeability and motility [44][45][46]. These pathways have been well documented for their contribution to CRC [47][48][49][50]. It is interesting to also note among the enriched KEGG pathways the prostate cancer pathway (p = 0.0251) and endometrial cancer pathway (p = 0.0251) also featured and represent two cancers commonly arising in the general population and in the setting of HNPCC/LS [51][52][53]. Overall, our KEGG results suggest the existence of genetic risk factors which may act to promote the development of cancer. Enrichment analysis for targets of miRs identified 65 significant regions within the 3' UTR of the CNV impacted genes unique in the patients. We identified in silico 114 miRs (Supplementary Table 7) that target these genes regions with over 35% of these having previously been reported to have associations with CRC . Of the top 10 most significant regions, 40% of the miRs we identified using our approach have been associated with CRC (miR-141, miR-15A, miR-15B, miR-18A, miR-200A, miR-200B, miR-203, miR-32, miR-429 and miR-92). Overall, our miR enrichment analysis supports reported findings on miR involvement in CRC. Confirmation of the respective miRs by functional analysis is required to unequivocally demonstrate their role in HNPCC.
In summary the results obtained from the pathway analysis suggest that many of the genes associated with CNVs uniquely identified in patients are associated with carcinogenesis, tumour growth and disease susceptibility and may be factors in the development of CRC.

MicroRNA Annotation
From the 317 genes affected by a CNV in the patient cohort we identified, using pathway analysis, 65 genes that enriched 3' UTR microRNA target regions. These 65 binding regions were associated with 114 proposed regulatory microRNAs. TAM (a Tool for Annotations of miRs) software [27] was then used to identify meaningful miR categories among the 114 miRs that target significantly enriched 3' UTR regions identified in the patients from previous pathway analysis (Table 6). We identified a total of 261 miR categories: 22 families, 33 clusters, 39 functional categories, 162 HMDD and 5 tissue specificity categories. It was identified that miRs were enriched in the family category miR-17 (p = 0.0011). A total of 10 functional categories were enriched including those associated with onco-miRs (p = 0.0264), processes of apoptosis (p = 0.0291) and cell-cycle (p = 0.0406). For the HMDD category miRs were enriched in various forms of cancer, with cancer enrichment alone accounting for 80% of the most significant findings. In the context of our study it was reassuring to note the presence of adenocarcinoma (p = 0.0000155) and colorectal neoplasm's (p = 0.0253) among the cancers enriched in miR annotation. Cardiovascular diseases (4%), psychological disorders (10%) and infection (4%) represented the minority of other significant finds (all with p < 0.0047). According to tissue specificity, the placenta represented the most significant miR enriched tissue (p = 0.00001284). Previous studies have suggested that processes of angiogenesis and vascularisation occurring during placental development in pregnancy are also present during tumour development and this has been observed in CRC [77][78][79].
Overall the results obtained from the TAM analysis suggest that the 114 miRs associated with the 3' UTR regions significantly identified in the patients may stimulate processes leading to carcinogenesis which is consistent with what we expect to find in these cancer patients [80].

CNV Burden
A study by Girirajan and Eichler [81] has suggested that the severity of disease may be explained by the overall burden CNVs place on an individual's genome where increased sensitivity to developing disease is correlated with increased CNV burden and furthermore that variation in CNV burden will result in phenotype variation in patients. In a recent study looking at both HNPCC MMR mutation negative and MMR mutation positive patients, we observed an increased average size of CNVs in patients tested and suggested that this was related to an increased genomic burden [82]. An increased CNV burden was not observed among the patients utilized in the current study, though we did detect a decreased mean size of CNVs in patients compared to controls. Importantly, the current study compared 125 patients and 40 controls whereas the recent HNPCC study compared 96 patients and 384 controls [82]. We suggest that discrepancies in these findings are likely to be related to the inequity of sample populations between studies, the limited number of controls used in the current study, the type of array used (noting differences in both the array coverage and density), as well as the algorithm used by analysis software may all contribute to variation in the observed results [83][84][85].

CNV Bias
CNV analysis has suffered from a lack of standardization in analytical techniques used for data mining. The Hidden Markov Model (HMM) and Circular Binary Segmentation (CBS) represent the algorithms utilized to develop CNV calling programs that have been reported to be the most efficient [83][84][85]. Furthermore, using algorithms developed for a specific data type has been shown to perform better in CNV calling compared to platform-independent software algorithms [86]. The robustness of software algorithms, batch effects, and population stratification will therefore influence the accuracy of calls made to segmented data and hence the reliability of CNV calls and CNV boundary descriptors derived from arrays [83][84][85]. The Cyto2.7M array was chosen for use in the current study as at the time it provided the greatest density and most even genomic coverage of any CNV arrays. All data was analysed through ChAS which uses a HMM-based algorithm and was specifically developed to use with the Cyto2.7M array data.
To improve our predictive accuracy we did consider merging other CNV data sets that were at our disposal [82], but this would result in a significant loss of information since the our original data set was created using a SNP based array which did not have even coverage across the genome whereas the one used in this report was an oligo-based array with uniform coverage. Any merged dataset would lose a significant amount of information.
Finally, a recent report suggesting the use of a minimum of three different algorithms when conducting CNV association analysis, this was not possible due to limitations in the data generated by the array used in this study [87].

Conclusions
We were unable to identify any DNA mismatch repair genes targeted by CNVs that may contribute to a significant proportion of HNPCC patients recruited into this study. We did identify several genomic regions that were altered in multiple unrelated HNPCC patients that could potentially be associated with disease risk. The genomic regions encompassed by these CNVs warrant further study to define precisely their role in disease development. We could not rule out the existence of CNVs, smaller than the limits of detection provided by this array, from involvement in the aetiology of HNPCC.
Pathway analysis was thus utilized to identify possible common pathways associated with the heterogeneous outcomes of the analysis. We identified a total of 317 genes impacted by CNVs uniquely identified in patients (compared to controls). Results from KEGG pathway analysis identified the enrichment of pathways involved in metabolism, and these are known to be required for cancer development. It is likely that these loci may contribute to CRC disease risk in the affected individuals. miR enrichment analysis has further highlighted a series of miRs which are suggested to contribute to carcinogenesis. It was found that over 40% of these miRs had been previously reported to play a role in CRC development. As such we have shown that CNV altered genes are over represented in pathways leading to carcinogenesis, tumour growth and disease susceptibility, including CRC. The genes driving pathway enrichments require further investigation to elucidate their precise role in disease development.
The annotation of 114 miRs (reported in the pathway analysis) identified significant functional miR categories associated with cancer, including specifically adenocarcinomas and colorectal neoplasms. Placental tissue was identified to be among the tissues most significantly enriched with the miR looked at in this investigation. We speculate that processes of angiogenesis and vascularisation necessary for placental development are also present during tumour formation including those observed in CRC. As such the genes associated with CNVs we have identified are targeted by miRs which are implicated in various processes leading to malignancy. We conclude that while we have not shown direct consequences of miRs interacting with our CNV altered genes, the separate effects of aberrant miR expression and CNVs impacting on genes that such miRs target may have similar consequences.
Overall the results of this study provide some evidence of CNV involvement in the aetiology of HNPCC and furthermore reinforce that CNV probe arrays compared to SNP arrays appear to be of limited utility for CNV detection.